Resource-Constrained Learning and Inference for Visual Perception - Robotics Institute Carnegie Mellon University
Loading Events

PhD Thesis Defense

April

29
Fri
Mengtian Li Robotics Institute,
Carnegie Mellon University
Friday, April 29
11:30 am to 12:30 pm
Resource-Constrained Learning and Inference for Visual Perception

Abstract:
We have witnessed rapid advancement across major computer vision benchmarks over the past years. However, the top solutions’ hidden computation cost prevents them from being practically deployable. For example, training large models until convergence may be prohibitively expensive in practice, and autonomous driving or augmented reality may require a reaction time that rivals that of humans, typically 200 milliseconds for visual stimuli. Clearly, vision algorithms need to be adjusted or redesigned when meeting resource constraints. This thesis argues that we should embrace resource constraints into the first principles of algorithm designs. We support this thesis with principled evaluation frameworks and novel constraint-aware solutions for various computer vision tasks.

This thesis first investigates the evaluation of vision algorithms in resource-constrained settings. As a primary metric for computation cost, latency is mainly evaluated independently from accuracy, making it hard to compare algorithms with different accuracy-latency tradeoffs. To address this issue, we propose an approach to integrate latency and accuracy coherently into a single metric that we call “streaming accuracy”. We further show that we can build an evaluation framework on top of this metric and generalize it to arbitrary single-frame understanding tasks. Such streaming perception framework yields several surprising conclusions and solutions, e.g., latency is sometimes minimized by sitting idle and “doing nothing”! We also discuss future extensions of streaming perception to streaming forecasting, where the evaluation protocol is one step closer to real-world applications with full-stack perception. Additionally, we propose a formal setting for studying generic deep network training under the non-asymptotic, resource-constrained regime, i.e., budgeted training.

This thesis then explores task-specific novel solutions under resource constraints. Far-range LiDAR-based 3D object detection is a compute-intensive task. Contemporary solutions use 3D voxel representations, often encoded with a bird’s-eye view (BEV) feature map. While quite intuitive, such representations scale quadratically with the spatial range of the map, making them ill-suited for far-field perception. We present a pyramidal representation that retains the benefits of BEV while remaining efficient by exploiting the following insight: near-field lidar measurements are dense and optimally encoded by small voxels, while far-field measurements are sparse and better encoded with large voxels. Additionally, this thesis proposes biologically-inspired attentional warping for 2D object detection, and discusses its future extension to arbitrary image-based tasks. We also propose a progressive distillation approach for learning lightweight detectors from a sequence of teacher models. To complete the perception stack, we propose future object detection with backcasting for end-to-end detection, tracking, and forecasting.

Thesis Committee Members:
Deva Ramanan, Chair
Martial Hebert
Mahadev Satyanarayanan
Raquel Urtasun, Waabi & University of Toronto
Ross Girshick, Meta AI Research

More Information