Carnegie Mellon University
Abstract:
Computer vision models have proven to be tremendously capable of recognizing and detecting several real-world objects: cars, people, pets. These models are only possible due to a meticulous pipeline where a task and application is first conceived followed by an appropriate dataset curation that collects and labels all necessary data. Commonly, studies are focused on one aspect of this pipeline whether at application level or at dataset curation level. This thesis aims to take a more holistic view on the entire visual dataset pipeline, specifically focusing on real-world tasks and datasets.
In the first part of this thesis, we focus on real-world tasks of a visual pipeline. Since real-world object distribution is often imbalanced, where some categories are seen frequently while others are seen rarely, models struggle to perform well on under represented classes. Thus, we aim to improve standard vision tasks on long-tailed distributed datasets which resemble a real-world distribution. Our first approach starts in visual classification task where we aim to increase performance on rarer classes. In this work, we create new stronger classifiers for rarer classes by leveraging the representations and classifiers learnt for common classes. Our simple method can be applied on top of any existing set of classifiers, thus showcasing that learning better classifiers does not require extensive or complicated approaches. Our second approach ventures into visual detection and segmentation, where the additional localization task makes it difficult to train better rare detectors. We take a closer look at the basic resampling approach used widely in detection for long-tailed datasets. Notably, we showcase that the fundamental resampling strategy in detection can be improved by not only resampling whole images but also resampling just objects.
Successful real-world models depend heavily on the quality of training and testing data. In part two of this thesis, we develop a dataset and identify and explore a large challenge facing visual dataset curation. First, we build the first large-scale visual fMRI dataset, BOLD5000. In an effort to bridge the gap between computer vision and human vision, we design a dataset with 5,000 images taken from computer vision benchmark datasets. Through this effort, we identified a crucial and time-consuming component of dataset curation: creating labeling instructions for annotators and participants. Labeling instructions for a typical visual dataset will include detailed definitions and visual category examples provided to annotators. Notably, current datasets typically do not release their labeling instructions (LIs). We introduce a new task, labeling instruction generation, to reverse engineer LIs from existing datasets. Our method leverages existing large visual and language models (VLMs) to generate LIs that significantly outperforms all baselines.
Thesis Committee Members:
Martial Hebert, Co-chair
Michael Tarr, Co-chair
Deva Ramanan
Alyosha Efros, UC Berkeley
Ross Girshick, FAIR