Joint Segmentation and Classification of Human Actions in Video

M. H. Nguyen, D. Lan, and F. De la Torre

Conference Paper, Proceedings of (CVPR) Computer Vision and Pattern Recognition, pp. 3265 - 3272, June, 2011

Abstract

Automatic video segmentation and action recognition has been a long-standing problem in computer vision. Much work in the literature treats video segmentation and action recognition as two independent problems; while segmentation is often done without a temporal model of the activity, action recognition is usually performed on pre-segmented clips. In this paper we propose a novel method that avoids the limitations of the above approaches by jointly performing video segmentation and action recognition. Unlike standard approaches based on extensions of dynamic Bayesian networks, our method is based on a discriminative temporal extension of the spatial bag-of-words model that has been very popular in object recognition. The classification is performed robustly within a multi-class SVM framework whereas the inference over the segments is done efficiently with dynamic programming. Experimental results on honeybee, Weizmann, and Hollywood datasets illustrate the benefits of our approach compared to state-of-the-art methods.

BibTeX

@conference{Nguyen-2011-120922,
author = {M. H. Nguyen and D. Lan and F. De la Torre},
title = {Joint Segmentation and Classification of Human Actions in Video},
booktitle = {Proceedings of (CVPR) Computer Vision and Pattern Recognition},
year = {2011},
month = {June},
pages = {3265 - 3272},
}

Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.