Sensing the shapes of objects and their motion relative to a camera is of great importance in a wide range of applications, such as autonomous navigation, robotic manipulation, and cartography. When an observer moves about an object, shape information is revealed through changes in the appearance of the object. We are developing a method for automatically recovering both the shape of an object and the camera motion from a sequence of images.
In principle, the stream of images produced by moving a camera about a rigid object provides enough information to fully recover both shape and motion. However, existing techniques based on stereo triangulation are ill-conditioned when the scene is relatively distant from the camera.
We have developed a factorization method to robustly decompose an image stream into object shape and camera motion. The method begins by identifying prominent feature points and tracking them from each image to the next. The positions of those points in each image are then entered into a large measurement matrix, which is factorized into shape and motion using singular value decomposition (SVD). The factorization method is able to reduce the effects of noise because it applies a well-conditioned numerical computation to data that is in fact highly redundant. It makes no assumptions about smoothness or regularity of motion.
The first factorization method was based on an orthographic model of image projection. This model did not account for the scaling effect in an image of an object as it moves towards or away from the camera, nor for the apparent rotation of an object which is not centered in the image. Because of the limitations of the model, the method was also unable to determine the distance to the object.
We have recently developed a paraperspective factorization method based on a more realistic projection model. The paraperspective projection model accounts for both the scaling effect and the apparent rotation effect. In addition, this new method is able to recover the distance to the object in each image frame. We subsequently extended the method to accommodate longer image sequences in which, due to larger motion of the camera, many of the features are not visible throughout the entire sequence.
Experiments have shown that the method is a practical technique for sensing the shapes of objects and the motion of the observer in a variety of applications. It could be used to automatically create three-dimensional models of objects for use in virtual reality systems, to use a single camera to determine the motion of an autonomous vehicle and map its environment, or to build site models of areas to undergo construction or structures to be remodelled from a videotape of the site.