Carnegie Mellon University
Abstract:
Modularized and cascaded autonomy stacks (object detection, then tracking and then trajectory prediction) have been widely adopted in many autonomous systems such as self-driving cars due to its interpretability. In this talk, I advocate the use of such a modular approach but improve its accuracy and robustness by developing different 3D representations for each module and tightly integrating modules across the stack.
I will begin by talking the progress we have made in individual module’s development, which includes (1) a 3D pseudo-LiDAR representation for monocular 3D object detection (Mono3D-PLiDAR, ICCVW ’19); (2) an efficient 3D bounding box representation for 3D multi-object tracking (AB3DMOT, IROS ‘20); (3) a social-aware discriminative representation for multi-modal 3D tracking (GNN3DMOT, CVPR ‘20); (4) a joint social-temporal modeling for trajectory prediction (AgentFormer, ICCV ‘21).
To increase the end-to-end performance of the cascaded autonomy stack, I will then talk two approaches for integrating individual modules: (1) An error propagation reduction framework by parallelizing autonomy stacks (GSDT, ICRA ‘21 and PTP, RAL ‘21) and multi-hypothesis data association (MTP, arXiv ‘21); (2) A prediction-then-perception pipeline that learns a scene-level 3D representation and can scale the performance of prediction with self-supervised learning (SPF2, CoRL ‘20 and S2Net, arXiv ‘21).
The completed work is suitable for short-horizon prediction but limited in long-horizon situations due to more significant error accumulation and lack of diversity in prediction. In the proposed work, I plan to go one step further and improve the two integration approaches. First, I propose an affinity-based perception and prediction framework to remove the need for using trajectory representation in the autonomy stack, which is obtained through the error-prone data association step. Second, to cover different modes of future prediction in the prediction-then-perception pipeline, I propose to increase the diversity in objects’ motion prediction by injecting semantic information into the scene-level self-supervised learning.
Thesis Committee Members:
Kris Kitani, Chair
Matthew P. O’Toole
Deva Ramanan
Marco Pavone, Stanford University and NVIDIA Research