Inference Machines: Parsing Scenes via Iterated Predictions

PhD Thesis, Tech. Report, CMU-RI-TR-13-15, Robotics Institute, Carnegie Mellon University, June, 2013

Abstract

Extracting a rich representation of an environment from visual sensor readings can benefit many tasks in robotics, e.g., path planning, mapping, and object manipulation. While important progress has been made, it remains a difficult problem to effectively parse entire scenes, i.e., to recognize semantic objects, man-made structures, and landforms. This process requires not only recognizing individual entities but also understanding the contextual relations among them. The prevalent approach to encode such relationships is to use a joint probabilistic or energy-based model which enables one to naturally write down these interactions. Unfortunately, performing exact inference over these expressive models is often intractable and instead we can only approximate the solutions. While there exists a set of sophisticated approximate inference techniques to choose from, the combination of learning and approximate inference for these expressive models is still poorly understood in theory and limited in practice. Furthermore, using approximate inference on any learned model often leads to suboptimal predictions due to the inherent approximations. As we ultimately care about predicting the correct labeling of a scene, and not necessarily learning a joint model of the data, this work proposes to instead view the approximate inference process as a modular procedure that is directly trained in order to produce a correct labeling of the scene. Inspired by early hierarchical models in the computer vision literature for scene parsing, the proposed inference procedure is structured to incorporate both feature descriptors and contextual cues computed at multiple resolutions within the scene. We demonstrate that this inference machine framework for parsing scenes via iterated predictions offers the best of both worlds: state-of-the-art classification accuracy and computational efficiency when processing images and/or unorganized 3-D point clouds. Additionally, we address critical problems that arise in practice when parsing scenes on board real-world systems: integrating data from multiple sensor modalities and efficiently processing data that is continuously streaming from the sensors.

BibTeX

@phdthesis{Munoz-2013-7741,
author = {Daniel Munoz},
title = {Inference Machines: Parsing Scenes via Iterated Predictions},
year = {2013},
month = {June},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-13-15},
}

Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.