Depth images have always been playing critical roles for indoor scene understanding problems, and are particularly important for tasks in which 3D inferences are involved. However, since depth images are not universally available, abandoning them from the testing stage can significantly improve the generality of a method. In this work, we consider the scenarios where depth images are not available in the testing data, and propose to learn a convolutional long short-term memory (Conv LSTM) network and a regression convolutional neural network (regression ConvNet) using only monocular RGB images. The proposed networks benefit from 2D segmentations, object-level spatial context, object-scene dependencies and objects' geometric information, where optimization is governed by the semantic label loss, which measures the label consistencies of both objects and scenes, and the 3D geometrical loss, which measures the correctness of objects' 6Dof estimation. Conv LSTM and regression ConvNet are applied to scene/object classification, object detection and 6Dof estimation tasks respectively, where we utilize the joint inference from both networks and further provide the perspective of synthesizing fully rigged 3D scenes according to objects' arrangements in monocular images. Both quantitative and qualitative experimental results are provided on the NYU-v2 dataset, and we demonstrate that the proposed Conv LSTM can achieve state-of-the-art performance without requiring the depth information.