Abstract:
Traditional object detection methods are often confined to predefined object vocabularies, limiting their versatility in real-world scenarios where robots need to understand and execute diverse household tasks. Additionally, the 2D and 3D perception communities have typically pursued separate approaches tailored to their respective domains.
In this thesis, we present a language-conditioned object detector with an open and adaptable vocabulary, capable of seamlessly operating in both 2D and 3D environments with minimal architectural adjustments. Our detector incorporates top-down guidance from language commands to direct its attention within the visual stream, while also leveraging bottom-up information from pre-trained object detectors. We demonstrate its state-of-the-art performance in both 2D and 3D contexts on widely recognised benchmarks.
Furthermore, we showcase its practical utility in language-guided robot manipulation. Central to our model are energy-based concept generation modules, proficient in handling longer instructions and novel spatial concept combinations. We evaluate our model on established instruction-guided manipulation benchmarks, including newly introduced benchmarks for compositional instructions. Notably, our model demonstrates the ability to execute highly compositional instructions zero-shot in both simulation and real-world settings.
Committee:
Katerina Fragkiadaki, Chair
Tom M. Mitchell
Shubham Tulsiani
Nikolaos Gkanatsios