3DQ-Nets: Visual Concepts Emerge in Pose Equivariant 3D Quantized Neural Scene Representations - Robotics Institute Carnegie Mellon University

3DQ-Nets: Visual Concepts Emerge in Pose Equivariant 3D Quantized Neural Scene Representations

Mihir Prabhudesai, Shamit Lal, Hsiao-Yu Fish Tung, Adam W. Harley, Shubhankar Potdar, and Katerina Fragkiadaki
Workshop Paper, CVPR '20 Workshops, pp. 388 - 389, June, 2020

Abstract

We present a framework that learns 3D object concepts without supervision from 3D annotations. Our model detects objects, quantizes their features into prototypes, infers associations across detected objects in different scenes, and uses those to (self) supervise its visual feature representations. Object detection, correspondence inference, representation learning, and object to prototype compression takes place in a 3-dimensional visual feature space, inferred from the input RGB-D images using differentiable inverse graphics architectures, optimized end-to-end for predicting views of scenes. Our 3D feature space learns to be invariant to the camera viewpoint and disentangled from projection artifacts, foreshortenings or cross-object occlusions. As a result, 3D features learn to establish accurate correspondences across objects found under varying camera viewpoints, size and pose, and compressing them into prototypes. Our prototypes are represented similarly by 3-dimensional feature maps. They are rotated and scaled appropriately during matching to explain object instances in a variety of 3D poses and scales. We show this pose and scale equivariance permits much better compressibility of objects into their prototypical representations. Our model is optimized with a mix of end-to-end gradient descent and expectation-maximization iterations. We show 3D object detection, correspondence inference and object-to-prototype clustering improve over time and help one another. We demonstrate the usefulness of our model in few-shot learning: one or few object labels suffice to learn a pose-aware 3D object detector for the object category. To the best of our knowledge, this is the first system that demonstrates that 3D visual concepts emerge, without language annotating, rather, by moving around and relating episodic visual experiences, in a self-paced automated learning process.

BibTeX

@workshop{Prabhudesai-2020-126665,
author = {Mihir Prabhudesai and Shamit Lal and Hsiao-Yu Fish Tung and Adam W. Harley and Shubhankar Potdar and Katerina Fragkiadaki},
title = {3DQ-Nets: Visual Concepts Emerge in Pose Equivariant 3D Quantized Neural Scene Representations},
booktitle = {Proceedings of CVPR '20 Workshops},
year = {2020},
month = {June},
pages = {388 - 389},
}