Abstract:
As humans, we are constantly interacting with and observing a three-dimensional dynamic world; where objects around us change state as they move or are moved, and we, ourselves, move for navigation and exploration. Such an interaction between a dynamic environment and a dynamic ego-agent is complex to model as an ego-agent’s perception of the world is affected by change in both the environment and ego-motion.
In this thesis, we tackle perception and action in this “four-dimensional” world. Specifically, while the observations of the world are available to us as petabytes of open-source videos, we want to analyse if more physical information such as pixel-precise depth, noise-free and calibrated camera poses, in addition to forward models like volumetric rendering or sensor simulations are required to enable a rich four-dimensional understanding. We therefore propose to investigate if 4D data and 4D priors are needed to understand and act in the 4D world.
We start the discussion by looking at prior work that attempts to understand the 4D world using sequences of 3D LiDAR data and differentiable voxel rendering for the analysis-by-synthesis task of forecasting. We find that modeling the world with an inductive bias for volumetric rendering helps improve future forecasting, so much so that it can be used for downstream motion planning in autonomous vehicles.
We then talk about prior work that focuses on future forecasting but only using sequences of 2.5D data in the form of range maps without any physical forward models. We show a potential application of such future forecasting on multi-object tracking across occlusions where reasoning about the future depth of an object is critical. Next, we propose to show that such forecasting models (that do not explicitly enforce 4D consistency) can also be used for downstream motion planning in the real world in the presence of dynamic obstacles.
Further, we propose to contrast the above approach by instead learning a 4D model of the world that enforces 4D priors implicitly. Given a pair of registered point clouds of a dynamic scene in a global coordinate frame, we achieve this by predicting future scene flow in this coordinate frame.
Thesis Committee Members:
Deva Ramanan, Chair
Shubham Tulsiani
Katerina Fragkiadaki
Carl Vondrick, Columbia University
Leonidas Guibas, Stanford University & Google