11:00 am to 12:00 am
Event Location: NSH 1305
Abstract: This talk addresses the problem of understanding the visual content of videos using a weak form of supervision such as the textual information available in television or film scripts. I will discuss two instances of this problem, the joint localization and identification of movie characters and their actions, and the assignment of either symbolic action labels or natural text annotations to video frames using temporal ordering constraints. Both problems can be tackled using a discriminative clustering framework, and I will present the underlying models, appropriate relaxations of the corresponding combinatorial optimization problems associated with learning these models, and efficient algorithms for solving the corresponding convex optimization problems. I will also present experimental results on feature-length films and cooking videos.
Joint work with Piotr Bojanowski, Edouard Grave, Remi Lajugie, Francis Bach, Ivan Laptev, Cordelia Schmid,and Josef Sivic.