Abstract:
Humans learn by interacting with their surroundings using all of their senses. The first of these senses to develop is touch, and it is the first way that young humans explore their environment, learn about objects, and tune their cost functions (via pain or treats). Yet, robots are often denied this highly informative and fundamental sensory information, instead relying fully on visual systems. In this thesis, we explore how combining tactile sensing with visual understanding can improve how robots learn from interaction.
We begin by understanding how robots can learn from visual interaction alone. We propose the concept of semantic curiosity, which rewards temporal inconsistencies in object detections in a trajectory and is used as an intrinsic motivation reward to train an exploration policy. Our experiments demonstrate that exploration driven by semantic curiosity leads to better object detection performance.
Next, we propose PoseIt, a visual and tactile dataset for understanding how holding pose influences the grasp. We train a classifier to predict grasp stability from the multi-modal input, and find that it generalizes well to new objects and new poses.
We then focus on more fine-grained object manipulation. Thin, malleable objects, such as cables, are particularly susceptible to severe gripper/object occlusions, creating significant challenges in continuously sensing the cable state from vision alone. We propose using visual perception and hand-designed tactile-guided motion primitives to handle cable routing and assembly.
Finally, building on our previous work, we develop a framework that learns USB cable insertion from human demonstrations alone. The visual-tactile policy is trained using behavior cloning without requiring any hand-coded primitives. We demonstrate that our transformer-based policy effectively fuses sequential visual and tactile features for high-precision manipulation.
Thesis Committee Members:
Wenzhen Yuan, Chair
Abhinav Gupta
David Held
Adithya Murali, Nvidia Research