Learning to synthesize 3D indoor scenes from monocular images

Fan Zhu, Fumin Shen, Li Liu, Ling Shao, Jin Xie, Yi Fang

Research output: Chapter in Book/Report/Conference proceedingConference contribution


Depth images have always been playing critical roles for indoor scene understanding problems, and are particularly important for tasks in which 3D inferences are involved. However, since depth images are not universally available, abandoning them from the testing stage can significantly improve the generality of a method. In this work, we consider the scenarios where depth images are not available in the testing data, and propose to learn a convolutional long short-term memory (Conv LSTM) network and a regression convolutional neural network (regression ConvNet) using only monocular RGB images. The proposed networks benefit from 2D segmentations, object-level spatial context, object-scene dependencies and objects' geometric information, where optimization is governed by the semantic label loss, which measures the label consistencies of both objects and scenes, and the 3D geometrical loss, which measures the correctness of objects' 6Dof estimation. Conv LSTM and regression ConvNet are applied to scene/object classification, object detection and 6Dof estimation tasks respectively, where we utilize the joint inference from both networks and further provide the perspective of synthesizing fully rigged 3D scenes according to objects' arrangements in monocular images. Both quantitative and qualitative experimental results are provided on the NYU-v2 dataset, and we demonstrate that the proposed Conv LSTM can achieve state-of-the-art performance without requiring the depth information.

Original languageEnglish (US)
Title of host publicationMM 2018 - Proceedings of the 2018 ACM Multimedia Conference
PublisherAssociation for Computing Machinery, Inc
Number of pages9
ISBN (Electronic)9781450356657
StatePublished - Oct 15 2018
Event26th ACM Multimedia conference, MM 2018 - Seoul, Korea, Republic of
Duration: Oct 22 2018Oct 26 2018

Publication series

NameMM 2018 - Proceedings of the 2018 ACM Multimedia Conference


Other26th ACM Multimedia conference, MM 2018
Country/TerritoryKorea, Republic of


  • CNN
  • Indoor scene understanding
  • LSTM
  • Object detection
  • Scene classification

ASJC Scopus subject areas

  • Computer Graphics and Computer-Aided Design
  • Human-Computer Interaction


Dive into the research topics of 'Learning to synthesize 3D indoor scenes from monocular images'. Together they form a unique fingerprint.

Cite this