Resource-Constrained Learning and Inference for Visual Perception
Abstract
We have witnessed rapid advancement across major computer vision benchmarks over the past years. However, the top solutions' hidden computation cost prevents them from being practically deployable. For example, training large models until convergence may be prohibitively expensive in practice, and autonomous driving or augmented reality may require a reaction time that rivals that of humans, typically 200 milliseconds for visual stimuli. Clearly, vision algorithms need to be adjusted or redesigned when meeting resource constraints. This thesis argues that we should embrace resource constraints into the first principles of algorithm designs. We support this thesis with principled evaluation frameworks and novel constraint-aware solutions for both the cases of training and inference of computer vision tasks.
For evaluation frameworks, we first introduce a formal setting for studying training under the non-asymptotic, resource-constrained regime, i.e., budgeted training. Next, we propose streaming accuracy to evaluate latency and accuracy coherently with a single metric for real-time online perception. More broadly, building upon this metric, we introduce a meta-benchmark that systematically converts any single-frame task into a streaming perception task.
For constraint-aware solutions, we propose a budget-aware learning rate schedule for budgeted training, and dynamic scheduling and asynchronous forecasting for streaming perception. We also propose task-specific solutions, including foveated image magnification and progressive knowledge distillation for 2D object detection, multi-range pyramids for 3D object detection, and future object detection with backcasting for end-to-end detection, tracking and forecasting.
We conclude the thesis with discussions on future work. We plan to extend streaming perception to include long-term forecasting, generalize our foveated image magnification to arbitrary spatial image understanding tasks, and explore multi-sensor fusion for long-range 3D detection.
BibTeX
@phdthesis{Li-2022-131683,author = {Mengtian Li},
title = {Resource-Constrained Learning and Inference for Visual Perception},
year = {2022},
month = {May},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-22-20},
}