PVT++: A Simple End-to-End Latency-Aware Visual Tracking Framework

Bowen Li, Ziyuan Huang, Junjie Ye, Yiming Li, Sebastian Scherer, Hang Zhao, and Changhong Fu

Conference Paper, Proceedings of (ICCV) International Conference on Computer Vision, pp. 10006-10016, October, 2023

View Publication

Abstract

Visual object tracking is essential to intelligent robots. Most existing approaches have ignored the online latency that can cause severe performance degradation during real-world processing. Especially for unmanned aerial vehicles (UAVs), where robust tracking is more challenging and onboard computation is limited, the latency issue can be fatal. In this work, we present a simple framework for end-to-end latency-aware tracking, i.e., end-to-end predictive visual tracking (PVT++). Unlike existing solutions that naively append Kalman Filters after trackers, PVT++ can be jointly optimized, so that it takes not only motion information but can also leverage the rich visual knowledge in most pre-trained tracker models for robust prediction. Besides, to bridge the training-evaluation domain gap, we propose a relative motion factor, empowering PVT++ to generalize to the challenging and complex UAV tracking scenes. These careful designs have made the small-capacity lightweight PVT++ a widely effective solution. Additionally, this work presents an extended latency-aware evaluation benchmark for assessing an any-speed tracker in the online setting. Empirical results on a robotic platform from the aerial perspective show that PVT++ can achieve significant performance gain on various trackers and exhibit higher accuracy than prior solutions, largely mitigating the degradation brought by latency. Our code will be made public.

BibTeX

@conference{Li-2023-139110,
author = {Bowen Li and Ziyuan Huang and Junjie Ye and Yiming Li and Sebastian Scherer and Hang Zhao and Changhong Fu},
title = {PVT++: A Simple End-to-End Latency-Aware Visual Tracking Framework},
booktitle = {Proceedings of (ICCV) International Conference on Computer Vision},
year = {2023},
month = {October},
pages = {10006-10016},
keywords = {Visual Object Tracking, Motion Prediction, Latency-aware Perception},
}

Copyright notice: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.