Movie/Script: Alignment and Parsing of Video and Text Transcription

Loading...
Thumbnail Image
Penn collection
Lab Papers (GRASP)
Degree type
Discipline
Subject
Funder
Grant number
License
Copyright date
Distributor
Related resources
Contributor
Abstract

Movies and TV are a rich source of diverse and complex video of people, objects, actions and locales “in the wild”. Harvesting automatically labeled sequences of actions from video would enable creation of large-scale and highly-varied datasets. To enable such collection, we focus on the task of recovering scene structure in movies and TV series for object tracking and action retrieval. We present a weakly supervised algorithm that uses the screenplay and closed captions to parse a movie into a hierarchy of shots and scenes. Scene boundaries in the movie are aligned with screenplay scene labels and shots are reordered into a sequence of long continuous tracks or threads which allow for more accurate tracking of people, actions and objects. Scene segmentation, alignment, and shot threading are formulated as inference in a unified generative model and a novel hierarchical dynamic programming algorithm that can handle alignment and jump-limited reorderings in linear time is presented. We present quantitative and qualitative results on movie alignment and parsing, and use the recovered structure to improve character naming and retrieval of common actions in several episodes of popular TV series.

Advisor
Date of presentation
2008-10-01
Conference name
Lab Papers (GRASP)
Conference dates
2023-05-17T03:10:53.000
Conference location
Date Range for Data Collection (Start Date)
Date Range for Data Collection (End Date)
Digital Object Identifier
Series name and number
Volume number
Issue number
Publisher
Publisher DOI
Journal Issue
Comments
Copyright 2008 Springer. Postprint version. Published in: Movie/Script: Alignment and Parsing of Video and Text Transcription. Timothee Cour, Chris Jordan, Eleni Miltsakaki, Ben Taskar. In Computer Vision - ECCV 2008: 10th European Conference on Computer Vision. David Forsyth, Philip Torr, Andrew Zisserman, eds. Marsaeille, France, October 2008. Proceedings, Part IV, pp. 158-171. The original publication is available at www.springerlink.com. DOI: 10.1007/978-3-540-88693-8_12 Publisher URL: http://springerlink.com/content/p2075438g062241h/?p=e6611e70aaf14ff4a54de2cb038f9c1cπ=11
Recommended citation
Collection