Motion Words for Videos - Robotics Institute Carnegie Mellon University

Motion Words for Videos

Ekaterina H. Taralova, Fernando De la Torre Frade, and Martial Hebert
Conference Paper, Proceedings of (ECCV) European Conference on Computer Vision, pp. 725 - 740, September, 2014

Abstract

In the task of activity recognition in videos, computing the video representation often involves pooling feature vectors over spatially local neighborhoods. The pooling is done over the entire video, over coarse spatio-temporal pyramids, or over predetermined rigid cuboids. Similarly to pooling image features over superpixels in images, it is natural to consider pooling spatio-temporal features over video segments, e.g., supervoxels. However, since the number of segments is variable, this produces a video representation of variable size. We propose Motion Words - a new, fixed size video representation, where we pool features over supervoxels. To segment the video into supervoxels, we explore two recent video segmentation algorithms. The proposed representation enables localization of common regions across videos in both space and time. Importantly, since the video segments are meaningful regions, we can interpret the proposed features and obtain a better understanding of why two videos are similar. Evaluation on classification and retrieval tasks on two datasets further shows that Motion Words achieves state- of-the-art performance.

BibTeX

@conference{Taralova-2014-17151,
author = {Ekaterina H. Taralova and Fernando De la Torre Frade and Martial Hebert},
title = {Motion Words for Videos},
booktitle = {Proceedings of (ECCV) European Conference on Computer Vision},
year = {2014},
month = {September},
pages = {725 - 740},
keywords = {Video representations, action classification},
}