Self-Supervising Occlusions For Vision
Abstract
Virtually every scene has occlusions. Even a scene with a single object exhibits self-occlusions - a camera can only view one side of an object (left or right, front or back), or part of the object is outside the field of view. More complex occlusions occur when one or more objects block part(s) of another object. Understanding and dealing with occlusions is hard due to the large variation in the type, number, and extent of occlusions possible in scenes. Even humans cannot accurately segment or predict the contour or shape of the occluded region when the object is occluded. Current large human-annotated datasets cannot capture such a wide range of occlusions. In this thesis, we propose learning amodal .i.e both visible and occluded regions of objects in a self-supervised fashion in densely populated scenes.
Occlusions in a scene can be broadly categorized into either self-occlusion, occluded-by-others, and/or truncation. For learning in self-occluded regions, We use multi-view priors in a bootstrapping framework to infer the content of occluded regions of the image. We show that such supervision helps the network learn better image representations even with large occlusions. We extend this using temporal cues from a stationary camera to learn accurate 3D shapes of self-occluded objects. For Occlusion by others, we explored using longitudinal data i.e. videos captured over weeks, months, or even years to supervise occluded regions in an object. We exploit this real data in a novel way to first automatically mine a large set of unoccluded objects and then composite them in the same views to generate occlusion scenarios. This self-supervision is strong enough for an amodal network to learn the occlusions in real-world images.
Finally, We show two methodologies for learning different types of occlusions. First, We combine the previous two paradigms of learning Self-Occluded and Occlusion by others for predicting the 3D amodal reconstruction of objects. Here, we show by learning and exploiting different occlusion categories like Self-occluded, and occluded by others and truncation can enhance the accuracy of the reconstruction. On the other hand, we show learning of 3D reconstruction and tracking of objects in an end-to-end learning framework using multi-view video input. We will discuss and analyze the pros and cons of the different approaches and representations for the amodal representation of objects.
BibTeX
@phdthesis{Narapureddy-2022-134773,author = {Dinesh Reddy Narapureddy},
title = {Self-Supervising Occlusions For Vision},
year = {2022},
month = {December},
school = {Carnegie Mellon University},
address = {Pittsburgh, PA},
number = {CMU-RI-TR-22-72},
keywords = {Occlusions, Self-Supervision, Self-Occlusion, Occluded-by-others},
}