FACSCaps: Pose-Independent Facial Action Coding With Capsules
Abstract
Most automated facial expression analysis methods treat the face as a 2D object, flat like a sheet of paper. That works well provided images are frontal or nearly so. In real-world conditions, moderate to large head rotation is common and system performance to recognize expression degrades. Multi-view Convolutional Neural Networks (CNNs) have been proposed to increase robustness to pose, but they require greater model sizes and may generalize poorly across views that are not included in the training set. We propose FACSCaps architecture to handle multi-view and multi-label facial action unit (AU) detection within a single model that can generalize to novel views. Additionally, FACSCaps's ability to synthesize faces enables insights into what is leaned by the model. FACSCaps models video frames using matrix capsules, where hierarchical pose relationships between face parts are built into internal representations. The model is trained by jointly optimizing a multi-label loss and the reconstruction accuracy. FACSCaps was evaluated using the FERA 2017 facial expression dataset that includes spontaneous facial expressions in a wide range of head orientations. FACSCaps outperformed both state-of-the-art CNNs and their temporal extensions.
BibTeX
@workshop{Ertugrul-2018-119661,author = {Itir Onal Ertugrul and Laszlo A. Jeni and Jeffrey F. Cohn},
title = {FACSCaps: Pose-Independent Facial Action Coding With Capsules},
booktitle = {Proceedings of CVPR '18 Workshops},
year = {2018},
month = {June},
}