Representing Videos using Mid-level Discriminative Patches

Arpit Jain, Abhinav Gupta, Mikel Rodriguez, and Larry S. Davis

Conference Paper, Proceedings of (CVPR) Computer Vision and Pattern Recognition, pp. 2571 - 2578, June, 2013

Abstract

How should a video be represented? We propose a new representation for videos based on mid-level discriminative spatio-temporal patches. These spatio-temporal patches might correspond to a primitive human action, a semantic object, or perhaps a random but informative spatio-temporal patch in the video. What defines these spatio-temporal patches is their discriminative and representative properties. We automatically mine these patches from hundreds of training videos and experimentally demonstrate that these patches establish correspondence across videos and align the videos for label transfer techniques. Furthermore, these patches can be used as a discriminative vocabulary for action classification where they demonstrate state-of-the-art performance on UCF50 and Olympics datasets.

BibTeX

@conference{Jain-2013-113360,
author = {Arpit Jain and Abhinav Gupta and Mikel Rodriguez and Larry S. Davis},
title = {Representing Videos using Mid-level Discriminative Patches},
booktitle = {Proceedings of (CVPR) Computer Vision and Pattern Recognition},
year = {2013},
month = {June},
pages = {2571 - 2578},
}

Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.