Learning Multi-Modal Navigation in Unstructured Environments
Abstract
A robot that operates efficiently in a team with humans in an unstructured outdoor environment must translate commands into actions from a modality intuitive to its operator. The robot must be able to perceive the world as humans do so that the actions taken by the robot reflect the nuances of natural language and human perception. Traditionally, a navigation system combines individual perception, language processing, and planning blocks that are often trained separately with different performance specifications. They communicate with restrictive interfaces to ease development (i.e., point objects with discrete attributes and a limited command language), but this also constrains the information one module can transfer to another.
The tremendous success of deep learning has revolutionized traditional lines of research in computer vision, such as object detection and scene labeling. Visual question answering, or VQA, connects state-of-the-art techniques in natural language processing with image understanding. Symbol grounding, multi-step reasoning, and comprehension of spatial relations are already elements of these systems. These elements are unified in an architecture with a single differentiable loss, eliminating the need for well-defined interfaces between modules and the simplifying assumptions that go with them.
We introduce a technique to transform a text language command and a static aerial image into a cost map suitable for planning. We build upon the FiLM VQA architecture, adapt it to generate a cost map, and combine it with a differentiable planning loss (Max Margin Planning) modified to use the Field D* planner. With this architecture, we take a step towards unifying language, perception, and planning into a single, end-to-end trainable system.
We present an extensible synthetic benchmark derived from the CLEVR dataset, which we use to study the comprehension abilities of the algorithm in the context of an unbiased environment with virtually unlimited data. We analyze the algorithm's performance on this data to understand its limitations and propose future work to address its shortcomings. We offer results on a hybrid dataset using real-world aerial imagery and synthetic commands.
Planning algorithms are often sequential with a high branching factor and do not map well to the GPUs that have catalyzed the development of deep learning in recent years. We carefully selected Field D* and Max Margin Planning to perform well on highly parallel architectures. We introduce a version of Field D* suitable for multi-GPU data-parallel training that uses the Bellman-Ford algorithm, boosting performance almost ten times compared to our CPU-optimized implementation.
The fluid interaction between humans working in a team depends upon a shared understanding of the task, the environment, and the subtleties of language. A robot operating in this context must do the same. Learning to translate commands and images into trajectories with a differentiable planning loss is one way to capture and imitate human behavior and is a small step towards seamless interaction between robots and humans.
BibTeX
@phdthesis{Suppe-2022-131710,author = {Arne Suppe},
title = {Learning Multi-Modal Navigation in Unstructured Environments},
year = {2022},
month = {May},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-22-22},
keywords = {navigation, vision language navigation, path planning, inverse reinforcement learning, deep learning},
}