Learning Multi-Modal Navigation in Unstructured Environments - Robotics Institute Carnegie Mellon University
Loading Events

PhD Thesis Defense

May

4
Wed
Arne Suppe Robotics Institute,
Carnegie Mellon University
Wednesday, May 4
9:00 am to 10:00 am
Learning Multi-Modal Navigation in Unstructured Environments

Abstract:
A robot that operates efficiently in a team with humans in an unstructured outdoor environment must translate commands into actions from a modality intuitive to its operator. The robot must be able to perceive the world as humans do so that the actions taken by the robot reflect the nuances of natural language and human perception. Traditionally, a navigation system combines individual perception, language processing, and planning blocks that are trained separately and with different performance specifications. They communicate with restrictive interfaces to ease development (i.e., point objects with discrete attributes and a limited command language), but this also constrains the information one module can transfer to another.

We propose a technique to transform a text command and a static aerial image into a cost map suitable for planning, trained with a single differentiable loss. We build upon the FiLM VQA architecture, adapt it to generate a cost map, and combine it with Max Margin Planning, modified to use the Field D* planner. We present an extensible synthetic benchmark derived from the CLEVR dataset, which we use to study the comprehension abilities of the algorithm in the context of an unbiased environment with virtually unlimited data. We analyze the algorithm’s performance on this data to understand its limitations. We offer some results on a semi-synthetic data set that uses real-world aerial imagery and synthetic commands. Planning algorithms often do not map well to the GPUs that have catalyzed the development of deep learning in recent years. We introduce a version of Field D* suitable for data-parallel GPU training that uses the Bellman-Ford algorithm, boosting performance almost ten times compared to our CPU-optimized implementation.

The fluid interaction between humans working in a team depends upon a shared understanding of the task, the environment, and the subtleties of language. A robot operating in this context must do the same. Learning to translate commands and images into trajectories using a single differentiable planning loss is one way to capture and imitate human behavior and is a small step towards seamless interaction between robots and humans.

Thesis Committee Members:
Martial Hebert, Chair
Kris Kitani
Jean Hyaejin Oh
Junsong Yuan, State University of New York at Buffalo

More Information