Learning facial action units with spatiotemporal cues and multi-label sampling - Robotics Institute Carnegie Mellon University

Learning facial action units with spatiotemporal cues and multi-label sampling

W. Chu, F. De la Torre, and J. Cohn
Journal Article, Image and Vision Computing, Vol. 81, pp. 1 - 14, January, 2019

Abstract

Facial action units (AUs) may be represented spatially, temporally, and in terms of their correlation. Previous research focuses on one or another of these aspects or addresses them disjointly. We propose a hybrid network architecture that jointly models spatial and temporal representations and their correlation. In particular, we use a Convolutional Neural Network (CNN) to learn spatial representations, and a Long Short-Term Memory (LSTM) to model temporal dependencies among them. The outputs of CNNs and LSTMs are aggregated into a fusion network to produce per-frame prediction of multiple AUs. The hybrid network was compared to previous state-of-the-art approaches in two large FACS-coded video databases, GFT and BP4D, with over 400,000 AU-coded frames of spontaneous facial behavior in varied social contexts. Relative to standard multi-label CNN and feature-based state-of-the-art approaches, the hybrid system reduced person-specific biases and obtained increased accuracy for AU detection. To address class imbalance within and between batches during training the network, we introduce multi-labeling sampling strategies that further increase accuracy when AUs are relatively sparse. Finally, we provide visualization of the learned AU models, which, to the best of our best knowledge, reveal for the first time how machines see AUs.

BibTeX

@article{Chu-2019-120694,
author = {W. Chu and F. De la Torre and J. Cohn},
title = {Learning facial action units with spatiotemporal cues and multi-label sampling},
journal = {Image and Vision Computing},
year = {2019},
month = {January},
volume = {81},
pages = {1 - 14},
}