Abstract:
Robots rely heavily on sensing to reason about physical interactions, and recent advancements in rapid prototyping, MEMS sensing, and machine learning have led to a plethora of sensing alternatives. However, few of these sensors have gained widespread use among roboticists. This thesis proposes a framework for incorporating sensors into a robot learning paradigm, from development to deployment, through the lens of ReSkin — a versatile and scalable magnetic tactile sensor. By examining the design, integration, and representation learning processes of ReSkin, this thesis aims to provide guidance on the design and implementation of effective sensing systems for robot learning.
We begin with the design of ReSkin — a low-cost, compact, and diverse platform for tactile sensing. We propose a self-supervised learning technique that enables sensor replaceability by allowing learned models to generalize to new instances of the sensor. Next, we investigate the scalability of ReSkin in the context of dexterous manipulation: we introduce the D’Manus, an inexpensive, modular, and robust platform with integrated large-area ReSkin sensing, aimed at satisfying the large-scale data collection demands of robot learning.
Moving beyond sensor integration, this thesis explores representation learning for sensors. Sensory data is typically sequential and continuous; however, most research on existing sequential architectures like LSTMs and Transformers focuses primarily on discrete modalities such as text and DNA. To address this gap, we propose Hierarchical State Space (HiSS) models, a conceptually simple and novel technique for continuous sequential prediction. HiSS creates a temporal hierarchy by stacking structured state-space models on top of each other, and outperforms state-of-the-art sequence models such as causal Transformers, LSTMs, S4, and Mamba. Further, we introduce CSP-Bench, a new benchmark for continuous sequence-to-sequence prediction (CSP) from real-world sensory data. CSP-Bench aims to address the lack of real-world datasets available for CSP tasks, providing a valuable resource for researchers working in this area.
Building on HiSS, our proposed work explores the interplay between different modalities for robot policy learning. Specifically, we investigate the use of cross-modal supervision to learn effective multimodal representations for downstream robot policies. Our approach involves using tactile representations with cross-modal supervision from vision, and vice versa. By leveraging the strengths of individual modalities to improve representations for other modalities, our goal is to develop more effective multimodal representations that can enhance robot policy learning. This work has the potential to improve existing policy learning frameworks by enabling robots to better integrate information from multiple sensory modalities.
Thesis Committee Members:
Abhinav Gupta, Co-chair
Carmel Majidi, Co-chair
Deepak Pathak
Lerrel Pinto, New York University