10:30 am to 11:30 am
GHC 6501
Abstract:
A robot that operates efficiently in a team with a human in an unstructured outdoor environment must be able to translate commands from a modality that is intuitive to its operator into actions. This capability is especially important as robots become ubiquitous and interact with untrained users. For this to happen, the robot must be able to perceive the world as humans do, so that the nuances of natural language and human perception are appropriately reflected in the actions taken by the robot. Traditionally, this has been done with separate perception, language processing, and planning blocks unified by a grounding system. The grounding system relates abstract symbols in the command to concrete representations in perception that can be placed into a metric or topological map upon which one can execute a planner. These modules are trained separately, often with different performance specifications, and are connected with restrictive interfaces to ease development and debugging (i.e., point objects with discrete attributes and a limited command language), but which also limits the kinds of information one module can transfer to another.
The tremendous success of deep learning has revolutionized traditional lines of research in computer vision, such as object detection and scene labeling. Some of the most recent work goes even further, bringing together state of the art techniques in natural language processing with image understanding in what is called visual question answering, or VQA. Symbol grounding, multi-step reasoning, and comprehension of spatial relations are already elements of these systems, all contained in a single differentiable deep learning architecture, eliminating the need for well-defined interfaces between modules and the simplifying assumptions that go with them.
Building upon this work, we propose a technique to transform a natural language command and a static aerial image, into a cost map suitable for planning. With this technique, we take a step towards unifying language, perception, and planning in a single, end-to-end trainable system. Further, we propose a synthetic benchmark based upon the CLEVR dataset, which can be used to compare the strengths weakness of the comprehension abilities of various planning algorithms in the context of an unbiased environment with virtually unlimited data. Finally, we propose some extensions to the system as steps towards practical robotics applications.
Thesis Committee:
Martial Hebert, Chair
Kris Kitani
Jean Oh
Junsong Yuan, State University of New York at Buffalo