TY - GEN
T1 - Learning to synthesize 3D indoor scenes from monocular images
AU - Zhu, Fan
AU - Shen, Fumin
AU - Liu, Li
AU - Shao, Ling
AU - Xie, Jin
AU - Fang, Yi
N1 - Funding Information:
†This work was done when the author was with New York University Abu Dhabi.
Publisher Copyright:
© 2018 Copyright held by the owner/author(s).
PY - 2018/10/15
Y1 - 2018/10/15
N2 - Depth images have always been playing critical roles for indoor scene understanding problems, and are particularly important for tasks in which 3D inferences are involved. However, since depth images are not universally available, abandoning them from the testing stage can significantly improve the generality of a method. In this work, we consider the scenarios where depth images are not available in the testing data, and propose to learn a convolutional long short-term memory (Conv LSTM) network and a regression convolutional neural network (regression ConvNet) using only monocular RGB images. The proposed networks benefit from 2D segmentations, object-level spatial context, object-scene dependencies and objects' geometric information, where optimization is governed by the semantic label loss, which measures the label consistencies of both objects and scenes, and the 3D geometrical loss, which measures the correctness of objects' 6Dof estimation. Conv LSTM and regression ConvNet are applied to scene/object classification, object detection and 6Dof estimation tasks respectively, where we utilize the joint inference from both networks and further provide the perspective of synthesizing fully rigged 3D scenes according to objects' arrangements in monocular images. Both quantitative and qualitative experimental results are provided on the NYU-v2 dataset, and we demonstrate that the proposed Conv LSTM can achieve state-of-the-art performance without requiring the depth information.
AB - Depth images have always been playing critical roles for indoor scene understanding problems, and are particularly important for tasks in which 3D inferences are involved. However, since depth images are not universally available, abandoning them from the testing stage can significantly improve the generality of a method. In this work, we consider the scenarios where depth images are not available in the testing data, and propose to learn a convolutional long short-term memory (Conv LSTM) network and a regression convolutional neural network (regression ConvNet) using only monocular RGB images. The proposed networks benefit from 2D segmentations, object-level spatial context, object-scene dependencies and objects' geometric information, where optimization is governed by the semantic label loss, which measures the label consistencies of both objects and scenes, and the 3D geometrical loss, which measures the correctness of objects' 6Dof estimation. Conv LSTM and regression ConvNet are applied to scene/object classification, object detection and 6Dof estimation tasks respectively, where we utilize the joint inference from both networks and further provide the perspective of synthesizing fully rigged 3D scenes according to objects' arrangements in monocular images. Both quantitative and qualitative experimental results are provided on the NYU-v2 dataset, and we demonstrate that the proposed Conv LSTM can achieve state-of-the-art performance without requiring the depth information.
KW - CNN
KW - Indoor scene understanding
KW - LSTM
KW - Object detection
KW - Scene classification
UR - http://www.scopus.com/inward/record.url?scp=85058226066&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85058226066&partnerID=8YFLogxK
U2 - 10.1145/3240508.3240700
DO - 10.1145/3240508.3240700
M3 - Conference contribution
AN - SCOPUS:85058226066
T3 - MM 2018 - Proceedings of the 2018 ACM Multimedia Conference
SP - 501
EP - 509
BT - MM 2018 - Proceedings of the 2018 ACM Multimedia Conference
PB - Association for Computing Machinery, Inc
T2 - 26th ACM Multimedia conference, MM 2018
Y2 - 22 October 2018 through 26 October 2018
ER -