Unlocking Generalization for Robotics via Modularity and Scale
Abstract
How can we build generalist robot systems? Looking at fields such as vision and language, the common theme has been large scale end-to-end learning with massive, curated datasets.
In robotics, on the other hand, scale alone may not be enough due to the significant multi-modality of robotics tasks, lack of easily accessible data and the safety and reliability challenges of deploying on physical hardware. Meanwhile, some of the most successfully deployed robotic systems today are inherently modular and can leverage the independent generalization capabilities of each module to perform well. Inspired by these qualities, this thesis seeks to tackle the task of building generalist robot agents by integrating these components into one: combining modularity with large scale learning for general purpose robot control.
We begin by exploring these two aspects independently. The first question we consider is: how can we build modularity and hierarchy into learning systems? Our key insight is that rather than having the agent learn hierarchy and low-level control end-to-end, we can explicitly enforce modularity via planning to enable significantly more efficient and capable robot learners. Next, we come to the role of scale in building generalist robot systems. To effectively scale, neural networks require vast amounts of diverse data, expressive architectures to fit the data and a source of supervision to generate the data. To that end, we leverage a powerful supervision source: classical planning algorithms, which can generalize broadly, but are expensive to run and require access to perfect, privileged information to perform well in practice. We use these planning algorithms to supervise large-scale policy learning in simulation to produce generalist agents.
Finally, we consider how to unify modularity with large-scale policy learning to build autonomous real-world robot systems capable of performing zero-shot long-horizon manipulation. We propose to do so by tightly integrating key ingredients of modular high and mid-level planning, learned local control, procedural scene generation and large-scale policy learning for sim-to-real transfer. We demonstrate that this recipe can produce powerful results: a single, generalist agent can solve challenging long-horizon manipulation tasks in the real world, solely from text instruction.
BibTeX
@phdthesis{Dalal-2025-145414,author = {Murtaza Dalal},
title = {Unlocking Generalization for Robotics via Modularity and Scale},
year = {2025},
month = {January},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-25-02},
keywords = {large scale learning, hierarchy, long-horizon manipulation, sim-to-real transfer, distillation},
}