Automating Annotation Pipelines by leveraging Multi-Modal Data - Robotics Institute Carnegie Mellon University
Loading Events

MSR Thesis Defense

July

15
Mon
Anish Madan MSR Student / Research Associate II Robotics Institute,
Carnegie Mellon University
Monday, July 15
2:00 pm to 3:00 pm
Rashid Auditorium – 4401 Gates and Hillman Centers
Automating Annotation Pipelines by leveraging Multi-Modal Data
Abstract: The era of vision-language models (VLMs) trained on large web-scale datasets challenges conventional formulations of “open-world” perception. In this work, we revisit the task of few-shot object detection (FSOD) in the context of recent foundational VLMs.
First, we point out that zero-shot VLMs such as GroundingDINO significantly outperform state-of-the-art few-shot detectors (48 vs. 33 AP) on COCO. Despite their strong zero-shot performance, such foundational models may still be sub-optimal. For example, trucks on the web may be defined differently from trucks for a target application such as autonomous vehicle perception. We argue that the task of few-shot recognition can be reformulated as aligning foundation models to target concepts using a few examples. Interestingly, such examples can be multi-modal, using both text and visual cues, mimicking instructions that are often given to human annotators when defining a target concept of interest.

Concretely, we propose Foundational FSOD, a new benchmark protocol that evaluates detectors pre-trained on any external datasets and fine-tuned on multi-modal (text and visual) K- shot examples per target class. We repurpose nuImages for Foundational FSOD, benchmark several popular open-source VLMs, and provide an empirical analysis of state-of-the-art methods. Lastly, we discuss our recent CVPR 2024 Foundational FSOD competition and share insights from the community. Notably, the winning team significantly outperforms our baseline by 23.9 mAP!

Committee:
Prof. Deva K. Ramanan (advisor)
Prof. Katerina Fragkiadaki
Neehar Peri