Abstract
We infer and generate threedimensional (3D) scene information from a single input image and without supervision. This problem is underexplored, with most prior work relying on supervision from, e.g., 3D groundtruth, multiple images of a scene, image silhouettes or keypoints. We propose Pix2Shape, an approach to solve this problem with four component: (i) an encoder that infers the latent 3D representation from an image, (ii) a decoder that generates an explicit 2.5D surfelbased reconstruction of a scene—from the latent code—(iii) a differentiable renderer that synthesizes a 2D image from the surfel representation, and (iv) a critic network trained to discriminate between images generated by the decoderrenderer and those from a training distribution. Pix2Shape can generate complex 3D scenes that scale with the viewdependent onscreen resolution, unlike representations that capture worldspace resolution, i.e., voxels or meshes. We show that Pix2Shape learns a consistent scene representation in its encoded latent space, and that the decoder can then be applied to this latent representation in order to synthesize the scene from a novel viewpoint. We evaluate Pix2Shape with experiments on the ShapeNet dataset as well as on a novel benchmark we developed – called 3DIQTT—to evaluate models based on their ability to enable 3d spatial reasoning. Qualitative and quantitative evaluation demonstrate Pix2Shape’s ability to solve scene reconstruction, generation and understanding tasks.
Introduction
Humans sense, plan and act in a 3D world despite only directly observing 2D projections of their 3D environment. Automatic 3D understanding seeks to recover a realistic underlying 3D structure of a scene using only 2D image projection(s). This longstanding challenge in computer vision has recently admitted learningbased solutions. Many such approaches leverage 3D supervision, such as from images annotated with ground truth 3D shape information (Girdhar et al. 2016; Wu et al. 2015, 2016b; Choy et al. 2016). Recent approaches rely on using other forms of 3D supervision, such as multiple views of the same object (Yan et al. 2016; Tulsiani et al. 2017; Li et al. 2019), 2.5D supervision (Wu et al. 2016a, 2017), keypoint (Kar et al. 2014; Novotný et al. 2019) and silhouette annotations (Wiles and Zisserman 2017; Henderson and Ferrari 2018; Chen et al. 2019). Our work treats the problem of unsupervised single image 3D scene understanding. This form of the problem is challenging, as we aim to infer an encoding of 3D structure from only a single image, and this too without any form of 3D ground truth supervision during training. We do not rely on any 3D scene supervision, however we employ camera pose, scene reflectance profiles and outgoing/observed radiance as weak supervision signals.
While the benefits of employing supervision can certainly be argued for in this context—i.e., with the growing number of datasets with labelled 3D ground truth for objects (Chang et al. 2015) and cityscapes (Caesar et al. 2019)—one benefit of approaching the problem from an unsupervised perspective is that we are not limited to the types of 3D objects represented in these datasets. Indeed, however vast, existing datasets fall far from capturing all possible artificial and natural 3D scenes and objects. Moreover, datasets with depth annotations often contain incomplete or noisy depth maps due to limitations in depth capture hardware.
Unsupervised single image 3D understanding is a relatively underexplored area, with only a few works treating this setting (Rezende et al. 2016; Yan et al. 2016). These methods rely on deformable 3D mesh or voxel representations of the world, and have only been applied to simple 3D primitives (e.g., cubes, spheres) or single objects over a clean background.
One approach to this problem is to leverage prior knowledge on how 2D images are formed from the 3D world, including the effects of shading and occlusion. Building machine learning architectures with an explicit knowledge of this forward rendering model could help better disambiguate the 3D structure of geometry from 2D observations. In this spirit, we propose the Pix2Shape architecture for unsupervised single image 3D understanding: a model that learns abstract latent encodings of the geometry of an entire scene geometry, and all from a single image. These implicit learnt scene representations can be decoded—when combined with a targe viewing/camera position—into a viewdependent realization of 2.5D surfaces (depth map and surface normals) visible only from that view. We can then readily rerender these explicit viewdependent surface elements (surfels) at their corresponding 2D image projections in order to synthesize an unseen view of the scene.
Our model builds atop Adversarially Learned Inference (ALI) (Dumoulin et al. 2016), an extension of Generative Adversarial Networks (GANs) (Goodfellow et al. 2014) that infers a latent code from an image using an encoder network. In Pix2Shape, the encoder network learns a latent representation that embeds the 3D information of an entire scene from an image. We map the latent representation to viewdependent depth and normal maps using a decoder before projecting these maps onto image space using a differentiable renderer. We evaluate the resulting image using an adversarial critic. Our model remains unsupervised as it does not require ground truth depth maps nor any other kind of 3D supervision, as in previous works (e.g., observing the same object from multiple views, keypoint registration or image silhouettes). Note that, at any given instant, our model outputs the depth and surface normals conditioned on a specific camera view; we never produce/synthesize the entirety of the 3D world structure. That being said, the latent space we learn embeds the 3D geometry of the entire underlying scene, which allows our decoder and renderer to smoothly extrapolate and synthesize scene geometry from unseen camera views during inference. We refer to this indirect process of embedding 3D information in the latent code as “implicit” inference.
An ambitious longterm goal is to infer the 3D structure of photographs of the realworld, and our work takes a first step in this direction: We rely on physically based rendering inorder to build a model of the world. However, in order to make the training tractable we experiment exclusively with synthetically constructed scenes, adopting several simplifying assumptions. Of note, we assume that the world is composed of piecewise smooth 3D elements and that, for each input image, the illumination, view and object materials are known. Since each pixel in an image is a function of geometry, illumination, view and texture, our focus in this work is to learn the underlying geometry of a scene keeping the other parameters fixed.
We evaluate our model’s ability to recover accurate and consistent depth from a single image, for both seen and unseen viewpoints, using Hausdorff and Chamfer distance metrics between generated and ground truth depth maps. In addition to reconstruction, we can sample novel scenes (at novel views) using the generative nature of our adversarial network. Finally, we propose a new 3D understanding benchmark—3D IQ Test Task (3DIQTT)—to evaluate models’ understanding of the underlying 3D structure of an object: the test consists of matching a rotated view of a reference object (Fig. 2). To perform this task, we develop a novel 3DIQ dataset to train and test against. In this setting, we can additionally estimate camera pose in our learnt latent 3D world embedding. Our contributions are as follows:

1.
An approach for unsupervised single image 3D understanding that builds a latent embedding of an entire 3D scene,

2.
A decoding scheme that leverages viewdependent, explicit surfel representations to sample scene information more efficiently than (worldspace) voxels and meshes,

3.
A differentiable 3D renderer that we leverage, and that can be included as a layer in any learningbased neural network architecture, and

4.
3DIQTT, a new 3D understanding benchmark.
Related Work
Single View 3D Reconstruction and Generation
3D generation and reconstruction has been studied extensively in the computer vision and graphics communities (Saxena et al. 2009; Chaudhuri et al. 2011; Kalogerakis et al. 2012; Chang et al. 2015; Rezende et al. 2016; Soltani et al. 2017; Kulkarni et al. 2015; Tulsiani et al. 2016; Huang et al. 2019; Jiang et al. 2019). Most methods in the literature focus on recovering the 3D structure from 2D images by using explicit 3D supervision. Choy et al. (2016), Girdhar et al. (2016), Wu et al. (2015, 2016b) and Zhu et al. (2018) reconstruct and/or generate 3D voxels from a latent representation by directly comparing with available 3D shapes. Wu et al. (2017) and Zhang et al. (2018) use 2.5D supervision during training, i.e., depth maps. More recent methods tend to use weaker forms of supervision for single image reconstruction. Wu et al. (2016a), Kato and Harada (2018), Henderson and Ferrari (2018) and Chen et al. (2019) use image based annotations like silhouettes, 2D keypoints or object masks. Kanazawa et al. (2018) learn both texture and shape from 2D images leveraging multiple learning signals such as keypoints and mean shape.
Rezende et al. (2016), Yan et al. (2016), Gadelha et al. (2016) and Novotný et al. (2017) learn 3D shapes by using multiple views and approximately differentiable rendering mechanisms. However, one of Rezende et al. (2016)’s experiments show reconstruction 3D objects trained using a single view. As far as we know, theirs is the only fully unsupervised method for explicit 3D reconstruction from a single image. Their method is limited to reconstructing relatively simple 3D primitives floating in space due to the strong priors required for the model to work. Concurrent to our work, HoloGAN (NguyenPhuoc et al. 2019) can synthesize 2D images of more realistic scenes (e.g., cars, bedrooms) under camera view rotation. However their model can not recover the geometry from its implicit representation. Compared to Rezende et al. (2016), our model can learn to represent more complex synthetic indoor scenes composed of multiple ShapeNet (Chang et al. 2015) objects and, while we do not address real image inputs (i.e., as HoloGAN), we can infer explicit geometry for visible surfaces from each given view. As such, our model can also be applied to 3D reconstruction (like Rezende et al. (2016) but only for visible parts of the scene) and novel viewpoint image generation (like NguyenPhuoc et al. (2019)).
Differentiable Rendering
In order to facilitate deep neural network based models to infer 3D structures from their 2D projections (images), it is required to compute and propagate the derivatives of image pixels with respect to 3D geometry and other properties. Gradient estimation through rendering process is a challenging task. In both rasterization and raytracing techniques the visibility mapping step is often nondifferentiable. Loper and Black (2014) is one of the well known methods for differential rendering, but has limited applicability due to high computational and memory costs. Kato et al. (2017); Rezende et al. (2016) approximate the gradients of the rendering process and are often limited to a rasterization based rendering scheme. OpenDR (Loper and Black 2014), as used by Henderson and Ferrari (2018), applies first order Taylor approximation to compute gradients. Liu et al. (2019) computes the gradients analytically by softly assigning contribution of each triangle face to a pixel in meshbased representations. Chen et al. (2019) improved this soft assignment and allow the use of textures by interpolating local mesh properties for foreground pixels. Insafutdinov and Dosovitskiy (2018) proposed a differentiable reprojection mechanism for point clouds to infer 3D shapes. However learning methods built on these approaches sofar require either more than one view per object or 2D silhouette as supervision and can only reconstruct single objects. In our work we circumvent the nondifferentiablity challenge as follows: (1) Our network is trained to output only “visible” surface elements (surfels) of the scene conditioned on the view, i.e. a 2.5D representation and (2) We maintain onetoone correspondence between the output surfels and the pixels. In other words our model outputs exactly one surfel in object space per pixel in the output image, and the final image is then formed by a differentiable shading operation. This makes our model differentiable, easily adaptable across image resolutions and allows endtoend training.
Method
Our method follows the ALI architecture (Dumoulin et al. 2016), where we have an encoder branch that learns to convert images into latent representations, a decoder branch that learns to generate images from randomly sampled latent representations, and a critic that tries to predict if pairs of latent code and image are real or fake. The critic and encoder pathways are implemented as convolutional neural networks but the decoder pathway contains an additional differentiable renderer, usable like a layer of a neural network, that converts the 2.5D surfel representation into a 2D image by computing shading at each surfel. Additionally, the decoder is conditioned on a camera pose. See Fig. 3 for an overview. In the following section, we drill down on the individual components of this architecture.
3D Representation and Surfels
Representing 3D structure as voxels or meshes presents different challenges for generative models (Kobbelt and Botsch 2004). Representing entire objects using voxels scales poorly given its \(O(n^3)\) complexity. Additionally, the vast majority of the generated voxels are not relevant to most viewpoints, such as the voxels that are entirely inside objects. A common workaround is to use a surface representation such as meshes. However, these too come with their own drawbacks, such as their graphlike structure. This makes mesh representation difficult to generate using neural networks. Current mesh based methods mainly rely on deforming a preexisting mesh and are thus limiting the object topology to have the same genus as the template mesh.
Our approach represents the 3D scene implicitly in a highdimensional latent variable. In our framework, this latent variable (i.e., a vector) is decoded using a decoder network conditioned on the camera pose into a viewpointdependent representation of surface elements (i.e., surfels (Pfister et al. 2000), squareshaped planes that are scaled based on depth to roughly fit the size of a pixel) that constitute the visible part of the scene. This representation is very compact: given a renderer’s point of view, we can represent only the part of the 3D surface needed by the renderer. As the camera moves closer to a part of the scene, surfels become more compact and thereby increase the amount of visible detail. For descriptive purpose we discuss surfels as squares, but in general they can have any shape. Figure 1 compares surfels with different representations. Surfels differ from other explicit representations in that they are viewdependent, i.e., this representation changes for different camera poses (but the implicit latent vector representation does not).
Formally, surfels are represented as a tuple \((P, N, \rho )\), where \(P = (p_x, p_y, p_z)\) is its 3D position, \(N = (n_x, n_y, n_z)\) is the surface normal vector, and \(\rho = (k_r, k_g, k_b)\) is the albedo of the surface material. Note that \(\rho \) represents the material properties at the point P and could take a different size for a different shading model. Since we are only interested in modelling structural properties of the scenes, i.e. geometry and depth, we assume that objects in the scene have uniform material properties and thus keep \(\rho \) fixed. We also estimate the normals from depth by assuming locally planar surfaces. We represent the surfels in the camera coordinate system and generate one surfel for each pixel in the output image. This makes our representation very compact. Thus, the only necessary parameter for the decoder network to generate is \(p_z\), i.e. a depth map.
Differentiable 3D Renderer
Since our architecture is GANlike and uses 2D images as input to the critic network, we need to project the generated 3D representations down to 2D space using a renderer. In our setting, each stage of the rendering pipeline must be differentiable to allow us to take advantage of gradientbased optimization and backpropagate the critic’s error signal to the surfel representation. Our proposed rendering process is differentiable because: (1) each output pixel depends exactly on one surfel, and (2) we employ a differentiable shading operation to compute the color of each pixel. Our PyTorch implementation of the differentiable renderer can render a \(128 \times 128\) surfelbased scene in under 1.4 ms on a mobile NVIDIA GTX 1060 GPU. Further details about the rendering implementation can be found in Appendix B.
Model
The adversarial training paradigm allows the generator network to capture the underlying target distribution by competing with an adversarial critic network. Pix2Shape employs bidirectional adversarial training (Dumoulin et al. 2016; Donahue et al. 2016) to model the distribution of surfels from 2D images.
BiDirectional Adversarial Training
ALI (Dumoulin et al. 2016) or BiGAN (Donahue et al. 2016) extend the GAN (Goodfellow et al. 2014) framework by including the learning of an inference mechanism. Specifically, in addition to the decoder network \(G_x\), ALI provides an encoder \(G_z\) which maps data points \(\varvec{x}\) to latent representations \(\varvec{z}\). In these bidirectional models, the critic, D, discriminates in both the data space (\(\varvec{x}\) versus \(G_x(\varvec{z})\)), and latent space (\(\varvec{z}\) versus \(G_z(\varvec{x})\)) jointly, maximizing the adversarial value function over two joint distributions. The final minmax objective can be written as:
where \(q(\varvec{x})\) and \(p(\varvec{z})\) denote encoder and decoder marginal distributions.
Modelling Depth and Constrained Normal Estimation
The encoder network captures the distribution over the latent space of the scene given an image data point \(\varvec{x}\). The decoder network maps a fixed scene latent distribution \(p(\varvec{z}_{scene})\) (a standard normal distribution in our case) to the 2.5D surfel representation from a given viewpoint \(\varvec{z}_{view}\). The surfel representation is rendered into a 2D image using our differentiable renderer. The resulting image is given as input to the critic to distinguish it from the ground truth image data. To emphasize on the notation, note that the output of the encoder is \(\varvec{z}_{scene}\) and the input to decoder is \((\varvec{z}_{scene},\varvec{z}_{view})\)
A straightforward way to design the decoder network is to learn a conditional distribution to produce the surfels’ depth (\(p_z\)) and normal (N). However, this could lead to inconsistencies between the local shape and the surface normal. For instance, the decoder can fake an RGB image of a 3D shape simply by changing the normals while keeping the depth fixed. To avoid this issue, we exploit the fact that realworld surfaces are locally planar, and that surfaces visible to the camera have normals constrained to be in the halfspace of visible normal directions from the camera’s view point. Considering the camera to be looking along the \(z\) axis direction, the estimated normal has the constraint \(n_z > 0\). Therefore, the local surface normal is estimated by solving the following problem for every surfel:
where the spatial gradient \(\nabla P\) is computed for each of the 8 neighbour points, and P is the position of the surfels in the camera coordinate system obtained by backprojecting the generated depth along rays.
This approach enforces consistency between the predicted depth field and the computed normals and provides a gradient signal to the depth from the shading process. If the depth is incorrect, the normalestimator outputs an incorrect set of normals, resulting in an inconsistent RGB image with the data distribution, which in turn would get penalized by the critic. The decoder network is thus incentivized to produce realistic depths.
Unsupervised Training
The WassersteinGAN (Arjovsky et al. 2017) formulation provides stable training dynamics using the first Wasserstein distance between the distributions. We adopt the gradient penalty setup as proposed in Gulrajani et al. (2017) for more robust training. However, we modify the formulation to take into account the bidirectional training.
The architectures of our networks, and training hyperparameters are explained in detail in the Appendix Sect. C. Briefly, we used Conditional Normalization (Dumoulin et al. 2016; Perez et al. 2017) for conditioning the viewpoint (or camera pose) in the encoder, decoder and the discriminator networks. The viewpoint is a three dimensional vector representing positional coordinates of the camera. In our training, the affine parameters of the batchnormalization layers (Ioffe and Szegedy 2015) are replaced by learned representations based on the viewpoint. The final objective includes a bidirectional reconstruction loss:
where the rend\((\cdot )\) function synthesizes images through viewdependent decoding and projection and \(\varvec{z}\) is \((\varvec{z}_{scene},\varvec{z}_{view})\). This objective enforces the reconstructions from the model to stay close to the corresponding inputs. This reconstruction loss is used for the encoder and decoder networks as it has been empirically shown to improve reconstructions in ALItype models (Li et al. 2017).
SemiSupervised Training for Classification
Our model can be also trained in a semisupervised setting (see Algorithm 1) for solving image classification tasks that require 3D understanding such as the 3DIQTT (See Fig. 2). The idea is to use labeled examples to streamline the learned latent representations in order to solve the task. In this case, we do not assume that we know the camera position for the unlabeled training samples. Ass mentioned earlier, part of the latent vector \(\varvec{z}\) encodes the actual 3D object (denoted as \(\varvec{z}_{scene}\)) and the remainder estimates the camerapose (denoted as \(\varvec{z}_{view}\)). For the supervised samples, two additional loss terms were used: (a) a loss that enforces the object component (\(\varvec{z}_{scene}\)) to be the same for both the reference object and the correct answer, (b) a loss that maximizes the distance between the reference object and the distractors. This loss is expressed as:
where \(\varvec{x}_{ref}\) is the reference image, \(\varvec{x}_{ans}\) is the correct answer, \(d_i\) denotes the distractors, \(D_{\theta }(\varvec{x}_1, \varvec{x}_2) = (\varvec{z}_{scene}^{\varvec{x}_1} \varvec{z}_{scene}^{\varvec{x}_2}_2)^2\) and \(\varvec{z}^{\varvec{x}}=Encoder_{\theta }(\varvec{x})\).
During training, we also minimize the mutual information between \(\varvec{z}_{scene}\) and \(\varvec{z}_{view}\) to explicitly disentangle both. This is implemented via MINE (Belghazi et al. 2018). The strategy of MINE is to parameterize a variational formulation of the mutual information in terms of a neural network:
This objective is optimized in an adversarial paradigm where T, the statistics network, plays the role of the critic and is fed with samples from the joint and marginal distribution. We use this loss to minimize the mutual information estimate in both unsupervised and supervised training iterations. Once the model is trained, we answer 3DIQTT questions, by inferring the latent 3D representation for each of the four images and we select the answer closest to the reference image as measured by \(L_2\) distance on latent representations.
Experimental Setup
We evaluate Pix2Shape on three different tasks: scene reconstruction, scene generation, and 3DIQTT.
Scene Reconstruction
The goal of this task is to produce a 2.5D representation (depth and normals) from a given input image. Moreover, we also evaluate if the model can extrapolate to unobserved views of the scene.
For this task we have created two datasets of scene images composed of a room containing one or more objects placed at random positions and orientations. Shape scenes dataset is created with rendered images of multiple basic 3D shapes (i.e., box, sphere, cone, torus, teapot etc) placed inside a room. ShapeNet scenes dataset is constructed from renderings of multiple objects of different categories from the ShapeNet dataset (Chang et al. 2015) (i.e., bowls, bottles, mugs, lamps, bags, etc).
Each 3D scene is rendered into a single \(128\times 128\times 3\) image taken from a camera in a random position sampled uniformly on the positive octant of a sphere containing the room. The probability of seeing the same configuration of a scene from two different views is near zero.
We evaluate the performance of scene reconstruction using three different metrics: (1) Chamfer distance, (2) Hausdorff distance (Hausdorff 1949) (on surfels’ position), and (3) Mean Squared Error (MSE).
Chamfer distance (CD) gives the average distance from each point in a set to closest point in the other set. For any two point sets \(A, B \subset \mathbb {R}^3\) Chamfer distance is measured using:
Hausdorff distance (HD) measures the correspondence of the model’s 3D reconstruction with the input for a given camera pose. Given two point sets, A and B, the Hausdorff distance is,
where \(D_H^+\) is an asymmetric Hausdorff distance between two point sets. E.g., \(\max D_H^+(A, B) = \max D(a, B)\), for all \(a \in A\), or the largest Euclidean distance \(D(\cdot )\), from a set of points in A to B, and a similar definition for the reverse case \(\max D_H^+(B, A)\). In both the evaluations, we mesaure compare our reconstructed viewcentric surfels (3D positions and normals) to the groundtruth surfels.
Scene Generation
In the second task we showcase the generative ability of our model by using our generator to sample class conditioned shapes from ShapeNet dataset. We evaluate the 3D scene generation task qualitatively.
3DIQTT
In the final task we evaluate the 3D understanding capability of the model on 3DIQTT: a spatial reasoningbased semisupervised classification task. The goal of the 3DIQTT is to quantify the ability of our model to perform a 3D spatial reasoning test by using large amounts of the unlabeled training data and a small set of labeled examples.
For this 3DIQTT task, we generated a dataset where each IQ question consists of a reference image of a Tetrislike shape, as well as three other images, one of which is a randomly rotated version of the reference (see Fig. 2 for an example). The training set consists of 100k questions where only a few are labeled with the information about the correct answer (i.e. either \(1\%\) (1k) or \(0.2\%\) (200) of the total training data). The validation and test sets each contain 100K labeled questions. Earlier literature related to 3DIQTT is elaborated in Appendix Sect. I. We evaluate the 3DIQTT task with the percentage of questions answered correctly.
More details on experimental setup and evaluation can be found in Appendix Sects. E and G.
Experiments
Scene Reconstruction
Figure 4 shows the input shape scenes data and its corresponding shading reconstructions, along with its recovered depth and normal maps. The depth map is encoded in such a way that the darkest points are closer to the camera. The normal map colors correspond to the cardinal directions (red/green/blue for x/y/z axis respectively). Table 1 shows a quantitative evaluation of the Chamfer and Hausdorff distances on Shape scene and shapenet scene datasets from a given observed view. The table also depicts mean squared error (MSE) of the generated depth map with respect to the input depth map. The shading reconstructions are almost perfect in this simple dataset. Our model successfully learns the depth of the scenes and thereby the relative positions of the surfels. It also estimates the normal maps from the depth consistently. However the absolute distance is not always recovered perfectly.
Figure 5 shows the reconstructions from the model on challenging ShapeNet scenes where the number of objects as well as their shape varies. Note how our model is able to handle geometry of varying complexity. Figure 6 shows reconstructions on \(256 \times 256\) resolution scenes(on the right) constructed out of more difficult thinedged chairs and tables from ShapeNet dataset in random configurations.
To showcase that our model can reconstruct unobserved views, we first infer the latent code \(\varvec{z}_{scene}\) of an image \(\varvec{x}\) and then decode and render different views while rotating the camera around the scene. Table 2 shows the Chamfer and Hausdorff distances and MSE loss of reconstructing a scene from different unobserved view angles. As the view angle increases from \(0 ^{\circ }\)(original) to \(80 ^{\circ }\) for shape scenes the reconstruction error and MSE tend to increase. However, for the ShapeNet scenes the trend is not as clear because of the complexity of the scene and interobject occlusions. We compare our method with the PTN baseline (Yan et al. 2016). Note that PTN reconstructs the 3D object in voxels explicitly, where as we output a 2.5D representation. Therefore, for a fair comparison we rotate and render per pixel depth map from a desired view and obtain the Chamfer distance with respect to ground truth projection for that view. Figure 7 qualitatively shows how Pix2Shape correctly infers the scene parts not in view demonstrating true 3D understanding.
In all our datasets and further experiments we use diffuse materials with uniform reflectance. The reflectance values are chosen arbitrarily and we use the same material properties for both the input and the generator side. However, our differentiable rendering setup also supports Phong illumination model. As an instance Fig. 8 shows the input shape scenes data with specular reflection and its corresponding shading reconstructions, along with its recovered depth.
Scene Generation
We trained Pix2Shape on scenes composed of a single ShapeNet object in a room. The model was trained conditionally by giving the class label of the ShapeNet object present in the scene to the decoder and critic networks (Mirza and Osindero 2014). Figure 9 shows the results of conditioning the decoder on different target classes. Our model was able to generate accurate 3D models for the target class. We can also train the model in an unconditional fashion without giving any object category information (see Appendix F for more details and results).
In order to explore the manifold of the learned representations, we selected two images \(\varvec{x_1}\) and \(\varvec{x_2}\) from the held out data. We then linearly interpolated between their encodings \(\varvec{z}_{\varvec{1}scene}\) and \(\varvec{z}_{\varvec{2}scene}\) and decoded the intermediary points into their corresponding images using a fixed camera pose. Figure 10 shows this for two different settings.
3DIQ Test Task
We trained our model using the aforementioned semisupervised training described in Sect. 3.3.4 on the 3DIQTT task. We compared our model to different baselines listed below and with human performance.
Human We created an online test where 40 random graduate students from our lab answered 20 randomly selected questions from the test set (similar to Fig. 2). No student had seen these images before. More details can be found in Appendix J.
CNN The first baseline is composed of four ResNet50 modules (He et al. 2016) with shared weights followed by three fullyconnected layers and a softmax output for the class label (answer 1 to 3). We trained this CNN only on the labeled samples. The architecture is depicted in the appendix, Fig. 12.
Siamese Network Our second baseline is a Siamese CNN with a similar architecture as the previous one but with the fullyconnected layers removed. Instead of the supervised loss provided in the form of correct answers, it was trained with contrastive loss (Koch et al. 2015). This loss reduces the feature distance between the reference and correct answer and maximizes the feature distance between the reference and incorrect answers.
Perspective Transformer Nets As our third baseline, we used the open source implementation of the Perspective Transformer Nets (Yan et al. 2016) to solve the IQTT task using the learnt latent code.
Rezende et al. (2016) Since there is no open source code available for this work, we implemented our own interpretation of this work. We were able to reproduce the results from their paper (see appendix K before attempting to use it as baseline for our model.
A more detailed description of the networks and contrastive loss function can be found in the Appendix D.
Table 3 shows 3DIQTT results for our method and baselines. The CNNbased baselines were not able to infer the underlying 3D structure of the data and their results are only slightly better than random guess. The poor performance of the Siamese CNN might be in part because the contrastive loss rewards similarities in pixel space and has no notion of 3D similarity. However, Pix2Shape achieved significantly better accuracy by leveraging the learned 3D knowledge of objects. Our method also outperformed the other 2 baseline approaches, but with a smaller margin.
Analyzing the Loss Functions
In this section, we do an ablation study of the different loss functions used to train our model. Our final objective is a combination of bidirectional adversarial loss \(\mathcal {L}_{ALI}\) and a reconstruction loss \(\mathcal {L}_{recon}\). For the 3DIQTT task we augmented the above losses with a mutual information based objective \(I_{\Theta }(z_{s},z_{v})\) to make sure that different parts of the latent code encode distinctive pieces of information present in a scene. This allows us to disentangle view point and geometry. Table 4 shows our results for both the reconstruction task on the ShapeNet scenes dataset and the 3DIQTT task when considering, i) only adversarial loss (\(\mathcal {L}_{ALI}\)); ii) only reconstruction loss(\(\mathcal {L}_{recon}\)); iii) adversarial and reconstruction but not mutual info (\(\mathcal {L}_{ALI}\)); and (\(\mathcal {L}_{recon}\)) (note that this does not effect reconstruction task); and, iv) all three (\(\mathcal {L}_{ALI}\), \(\mathcal {L}_{recon}\) and \(I_{\Theta }(z_{s},z_{v})\)).
We observe each loss term improves the performance of the model on both the tasks. Using adversarial loss alone is not enough to faithfully reconstruct the surfels. On the other hand we observe that having the reconstruction loss alone affects the performance of the model while extrapolating the shape from unseen views (e.g., view angle \(35^{\circ }\) to \(80^{\circ }\)). However, this scenario yields better performance when reconstructing from the given input view point, i.e., \(0^{\circ }\). We also notice that having a reconstruction loss alone affects the quality of the samples generated. We observe that the adversarial loss (\(\mathcal {L}_{ALI}\)) plays a major role in obtaining detailed and high quality samples. For the 3DIQTT task, the role of (\(\mathcal {L}_{ALI}\)) is more evident. (\(\mathcal {L}_{ALI}\)) encourages the latent code to learn meaningful representations by constraining the model to match the joint distributions. Results also indicate clearly that skipping mutual information loss degrades the performance of the model on 3DIQTT task. This is expected because of the mixup of view information with geometrical information in the latent representation.
Conclusion
In this paper we propose a generative approach to learn 3D structural properties from singleview images in an unsupervised and implicit fashion. Our model receives an image of a scene with uniform material as input, estimates the depth of the scene points and then reconstructs the input image through a differentiable renderer. We also provide quantitative evidence that support our argument by introducing a novel IQ Test Task in a semisupervised setup. We hope that this evaluation metric will be used as a standard benchmark to measure the 3D understanding capability of models across different 3D representations. The main drawback of our current model is that it requires the knowledge of lighting and material properties. Future work will focus on tackling the more ambitious setting of learning complex materials and texture along with modelling the lighting properties of the scene.
All code for this project is available at https://github.com/rajeswar18/pix2shape. The code we developed in order to reproduce the Rezende et al. (2016) baseline is available at https://github.com/fgolemo/threedeetools.
References
Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein generative adversarial networks. In International conference on machine learning (ICML).
Belghazi, M. I., Baratin, A., Rajeshwar, S., Ozair, S., Bengio, Y., Hjelm, D., & Courville, A. (2018). Mutual information neural estimation. In International conference on machine learning (ICML).
Caesar, H., Bankiti, V., Lang, A. H., Vora, S., Liong, V. E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., & Beijbom, O. (2019). Nuscenes: A multimodal dataset for autonomous driving. arXiv:1903.11027.
Chang, A. X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., Xiao, J., Yi, L., & Yu, F. (2015). Shapenet: An informationrich 3D model repository.
Chaudhuri, S., Kalogerakis, E., Guibas, L., & Koltun, V. (2011). Probabilistic reasoning for assemblybased 3D modeling. In ACM SIGGRAPH.
Chen, W., Gao, J., Ling, H., Smith, E. J., Lehtinen, J., Jacobson, A., & Fidler, S. (2019). Learning to predict 3d objects with an interpolationbased differentiable renderer. CoRR abs/1908.01210
Choy, C., Xu, D., Gwak, J., Chen, K., & Savarese, S. (2016). 3Dr2n2: A unified approach for single and multiview 3d object reconstruction.
Donahue, J., & Krähenb, U. P., & Darrell, T. (2016). Adversarial feature learning. arXiv:1605.09782.
Dumoulin, V., Belghazi, I., Poole, B., Lamb, A., Arjovsky, M., Mastropietro, O., & Courville, A. (2016). Adversarially learned inference. arXiv:1606.00704.
Gadelha, M., Maji, S., & Wang, R. (2016). 3D shape induction from 2D views of multiple objects. CoRR abs/1612.05872.
Girdhar, R., Fouhey, D., Rodriguez, M., & Gupta, A. (2016). Learning a predictable and generative vector representation for objects. In European conference of computer vision (ECCV).
Goodfellow, I., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., & Bengio, Y. (2014), Generative adversarial nets. In Advances in neural information processing systems (NIPS).
Gulrajani, I., Ahmed, F., Arjovsky, M,, Dumoulin, V., & Courville, A. C. (2017). Improved training of wasserstein gans. In Advances in neural information processing systems (NIPS).
Hadsell, R., Chopra, S., & LeCun, Y. (2006). Dimensionality reduction by learning an invariant mapping. In Null.
Hausdorff, F. (1949). Grundzüge der Mengenlehre. New York: Chelsea Pub. Co.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Computer vision and pattern recognition (CVPR).
Henderson, P., & Ferrari, V. (2018). Learning to generate and reconstruct 3D meshes with only 2D supervision. CoRR abs/1807.09259.
Huang, J., Zhou, Y., Funkhouser, T. A., & Guibas, L. J. (2019). Framenet: Learning local canonical frames of 3D surfaces from a single RGB image. CoRR abs/1903.12305.
Insafutdinov, E., & Dosovitskiy, A. (2018). Unsupervised learning of shape and pose with differentiable point clouds. CoRR abs/1810.09381.
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning (ICML).
Jiang, C. M., Wang, D., Huang, J., Marcus, P., & Nießner, M. (2019). Convolutional neural networks on nonuniform geometrical signals using euclidean spectral transformation. CoRR abs/1901.02070.
Kajiya, J. T. (1986). The rendering equation. In Annual conference on computer graphics and interactive techniques (SIGGRAPH).
Kalogerakis, E., Chaudhuri, S., Koller, D., & Koltun, V. (2012). A probabilistic model for componentbased shape synthesis. ACM Transactions in Graphics, 31(4), 551–5511.
Kanazawa, A., Tulsiani, S., Efros, A. A., & Malik, J. (2018), Learning categoryspecific mesh reconstruction from image collections. In European conference on computer Vision (ECCV).
Kar, A., Tulsiani, S., Carreira, J., & Malik, J. (2014). Categoryspecific object reconstruction from a single image. CoRR abs/1411.6069.
Kato, H., & Harada, T. (2018). Learning view priors for singleview 3D reconstruction. CoRR abs/1811.10719.
Kato, H., Ushiku, Y., & Harada, T. (2017). Neural 3D mesh renderer. CoRR abs/1711.07566.
Kobbelt, L., & Botsch, M. (2004). A survey of pointbased techniques in computer graphics. Computers and Graphics, 28(6), 801–814.
Koch, G., Zemel, R., & Salakhutdinov, R. (2015). Siamese neural networks for oneshot image recognition. In ICML Deep Learning Workshop.
Kulkarni, T. D., Whitney, W., Kohli, P., & Tenenbaum, J. B. (2015). Deep convolutional inverse graphics network. In Advances in neural information processing systems (NIPS).
Li, C., Liu, H., Chen, C., Pu, Y., Chen, L., Henao, R., & Carin, L. (2017), Alice: Towards understanding adversarial learning for joint distribution matching. In Advances in neural information processing systems (NIPS).
Li, Z., Dekel, T., Cole, F., Tucker, R., Snavely, N., Liu, C., & Freeman, W. T. (2019). Learning the depths of moving people by watching frozen people. CoRR abs/1904.11111.
Liu, S., Chen, W., Li, T., & Li, H. (2019), Soft rasterizer: Differentiable rendering for unsupervised singleview mesh reconstruction. CoRR abs/1901.05567.
Loper, M. M., & Black, M. J. (2014). Opendr: An approximate differentiable renderer. In Euroean conference on computer vision (ECCV).
Mikolov, T., Deoras, A., Kombrink, S., Burget, L., & Cernockỳ, J. (2011), Empirical evaluation and combination of advanced language modeling techniques. In INTERSPEECH.
Mirza, M., & Osindero, S. (2014). Conditional generative adversarial nets.
NguyenPhuoc, T., Li, C., Theis, L., Richardt, C., & Yang, Y. L. (2019). Hologan: Unsupervised learning of 3d representations from natural images.
Niu, C., Li, J., & Xu, K. (2018). Im2struct: Recovering 3d shape structure from a single RGB image. In Computer vision and pattern recognition (CVPR).
Novotný, D., Larlus, D., & Vedaldi, A. (2017). Learning 3D object categories by looking around them. CoRR abs/1705.03951.
Novotný, D., Ravi, N., Graham, B., Neverova, N., & Vedaldi, A. (2019). C3dpo: Canonical 3D pose networks for nonrigid structure from motion. arXiv:1909.02533.
Perez, E., Strub, F., De Vries, H., Dumoulin, V., & Courville, A. (2017). Film: Visual reasoning with a general conditioning layer. arXiv:1709.07871.
Pfister, H., Zwicker, M., van Baar, J., & Gross, M. (2000), Surfels: Surface elements as rendering primitives. In Annual conference on computer graphics and interactive techniques.
Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. In International conference on learning representations (ICLR).
Rezende, D. J., Eslami, S. M. A., Mohamed, S., Battaglia, P., Jaderberg, M., & Heess, N. (2016). Unsupervised learning of 3D structure from images. In Advances in neural information processing systems (NIPS).
Saxena, A., Sun, M., & Ng, A. Y. (2009). Make3d: Learning 3D scene structure from a single still image. In IEEE transasctions on pattern anal mach intell (PAMI) (Vol. 31, No. 5).
Shepard, R. N., & Metzler, J. (1971). Mental rotation of threedimensional objects. Science, 171(3972), 701–703.
Soltani, A. A., Huang, H., Wu, J., Kulkarni, T. D., & Tenenbaum, J. B. (2017). Synthesizing 3D shapes via modeling multiview depth maps and silhouettes with deep generative networks. In Computer vision and pattern recognition (CVPR)
Taha, A. A., & Hanbury, A. (2015). An efficient algorithm for calculating the exact hausdorff distance. In IEEE transactions on pattern analysis and machine intelligence (PAMI).
Tulsiani, S., Su, H., Guibas, L. J., Efros, A. A., & Malik, J. (2016), Learning shape abstractions by assembling volumetric primitives. CoRR abs/1612.00404.
Tulsiani, S., Zhou, T., Efros, A. A., & Malik, J. (2017). Multiview supervision for singleview reconstruction via differentiable ray consistency. CoRR abs/1704.06254.
Wiles, O., & Zisserman, A. (2017). Silnet : Single and multiview reconstruction by learning from silhouettes. CoRR abs/1711.07888.
Woodcock, R., Mather, N., & McGrew, K. (2001). Woodcock johnson iii—tests of cognitive skills. Riverside Pub.
Wu, J., Xue, T., Lim, J., Tian, Y., Tenenbaum, J., Torralba, A., & Freeman, W. (2016a), Single image 3d interpreter network.
Wu, J., Zhang, C., Xue, T., Freeman, W. T., Tenenbaum, J. B. (2016b). Learning a probabilistic latent space of object shapes via 3D generativeadversarial modeling. In Advances in neural information processing systems (NIPS).
Wu, J., Wang, Y., Xue, T., Sun, X., Freeman, W. T., & Tenenbaum, J. B. (2017). Marrnet: 3D shape reconstruction via 2.5D sketches. CoRR abs/1711.03129.
Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., & Xiao, J. (2015). 3D shapenets: A deep representation for volumetric shapes. In Computer vision and pattern recognition (CVPR).
Yan, X., Yang, J., Yumer, E., Guo, Y., & Lee, H. (2016), Perspective transformer nets: Learning singleview 3D object reconstruction without 3D supervision. In Advances in neural information processing systems (NIPS).
Zhang, X., Zhang, Z., Zhang, C., Tenenbaum, J., Freeman, B., & Wu, J. (2018). Learning to reconstruct shapes from unseen classes. In Advances in neural information processing systems (NIPS).
Zhu, J. Y, Zhang, Z., Zhang, C., Wu, J., Torralba, A., Tenenbaum, J. B., & Freeman, W. T. (2018). Visual object networks: Image generation with disentangled 3D representations. In Advances in neural information processing systems (NeurIPS).
Author information
Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Communicated by JunYan Zhu, Hongsheng Li, Eli Shechtman, MingYu Liu, Jan Kautz, Antonio Torralba.
Appendices
Appendix
Rendering Details
The color of a surfel depends on the material reflectance, its position and orientation, as well as the ambient and point light source colors (See Fig. 11b). Given a surface point \(P_i\), the color of its corresponding pixel \(I_{rc}\) is given by the shading equation:
where \(\rho _i\) is the surface reflectance, \(L_a\) is the ambient light’s color, \(L_j\) is the jth positional light source’s color, with \(d_{ij} = L_{j}^{{\text{ pos }}}  P_i\), or the direction vector from the scene point to the point light source, and \(k_l\), \(k_q\) being the linear and quadratic attenuation terms respectively. Equation 5 is an approximation of rendering equation proposed in Kajiya (1986).
Architecture
Pix2Shape is composed of an encoder network (See Table 5), a decoder network (See Table 6), and a critic network (See Table 7). Specifically, the decoder architecture is similar to the generator in DCGAN (Radford et al. 2015) but with LeakyReLU (Mikolov et al. 2011) as activation function and batchnormalization (Ioffe and Szegedy 2015). Also, we adjusted its depth and width to accommodate the high resolution images accordingly. In order to condition the camera position on the z variable, we use conditional normalization in the alternate layers of the decoder. We train our model for 60K iterations with a batch size of 6 with images of resolution \(128 \times 128 \times 3\).
Architecture for Semisupervised experiments
Pixel2Surfels architecture remains similar to the previous one but with higher capacity on the decoder and critic. The most important difference is that for those experiments we do not condition the networks with the camera pose to be fair with the baselines. In addition to the three networks we have a statistics network (see Table 8) that estimates and minimizes the mutual information between the two set of dimensions in the latent code using MINE (Belghazi et al. 2018). Out of 128 dimensions for z we use first 118 dimensions for represent scenebased information and rest to encode view based info.
The architecture of the baseline networks is shown in Fig. 12. During training we use the contrastive loss (Hadsell et al. 2006):
where \(\varvec{x}_1\) and \(\varvec{x}_2\) are the input images, y is either 0 (if the inputs are supposed to be the same) or 1 (if the images are supposed to be different), \(G_\theta \) is each ResNet block, parameterized by \(\theta \), and m is the margin, which we set to 2.0. We apply the contrastive loss to the 2048 features that are generated by each ResNet block.
Material, Lights, and Camera Properties
Material In our experiments, we use diffuse materials with uniform reflectance. The reflectance values are chosen arbitrarily and we use the same material properties for both the input and the generator side. Figure 13 shows that it is possible to learn reflectance along side learning the 3D structure of the scenes by requiring the model to predict the material coefficients along with the depth of the surfels. The color of the objects depend on both the lighting and the material properties. We do not delve into more details on this, as our objective is to model the structural/geometrical properties of the world with the model. This will be explored further in a later study.
Camera The camera is specified by its position, viewing direction and vector indicating the orientation of the camera. The camera positions were uniform randomly sampled on a sphere for the 3DIQTT task and on a spherical patch contained in the positive octant, for the rest of the experiments. The viewing direction was updated based on the camera position and the center of mass of the objects, so that the camera was always looking at a fixed point in the scene as its position changed. The focal length ranged between [18 mm and 25mm] in all the experiments and the fieldofview was fixed to 24mm. The camera properties were also shared between the input and the generator side. However, in the 3DIQTT task we relaxed the assumption that we know the camera pose and instead estimate the view as a part of the learnt latent representation.
Lights For the light sources, we experimented with single and multiple pointlight sources, with the light colors chosen randomly. The light positions are uniformly sampled on a sphere for the 3D IQTT tasks, and uniformly on a spherical patch covering the positive octant for the other scenes. The same light colors and positions are used both for rendering the input and the generated images. The lights acted as a physical spot lights with the radiant energy attenuating quadratically with distance. As an ablation study we relaxed this assumption of having perfect knowledge of lights by using random position and random color lights. Those experiments show that the light information is not needed by our model to learn the 3D structure of the data. However, as we use random lights on the generator side, the shading of the reconstructions is in different color than in the input as shown in Fig. 14.
Unconditional ShapeNet Generation
We trained Pix2Shape on scenes composed of ShapeNet objects from six categories (i.e., bowls, bottles, mugs, cans, caps and bags). Figure 15 shows qualitative results on unconditional generation. Since no class information is provided to the model, the latent variable captures both the object category and its configuration.
Evaluation of 3D Reconstructions
For evaluating 3D reconstructions, we use the Hausdorff distance (Taha and Hanbury 2015) as a measure of similarity between two shapes as in Niu et al. (2018). Given two point sets, A and B, the Hausdorff distance is,
where \(D_H^+\) is an asymmetric Hausdorff distance between two point sets. E.g., \(\max D_H^+(A, B) = \max D(a, B), \text {for} \text {all}\,a \in A\), or the largest Euclidean distance \(D(\cdot )\), from a set of points in A to B, and a similar definition for the reverse case \(\max D_H^+(B, A)\).
Ablation Study on Depth Supervision
As an ablation study, we repeated the experiment that demonstrates the view extrapolation capabilities of our model with depth superrvision. Table 9 depicts the quantitative evaluations on reconstruction if the scenes from unobserved angles.
3D Intelligence Quotient Task
In their landmark work, Shepard and Metzler (1971) introduced the mental rotation task into the toolkit of cognitive assessment. The authors presented human subjects with reference images and answer images. The subjects had to quickly decide if the answer was either a 3Drotated version or a mirrored version of the reference. The speed and accuracy with which people can solve this mental rotation task has since become a staple of IQ tests like the WoodcockJohnson tests (Woodcock et al. 2001). We took this as inspiration to design a quantitative evaluation metric (number of questions answered correctly) as opposed to the default qualitative analyses of generative models. We use the same kind of 3D objects but instead of confronting our model with pairs of images and only two possible answers, we include several distractor answers and the subject (human or computer) has to to pick which one out of the three possible answers is the 3Drotated version of the reference object (See Fig. 2).
Details on Human Evaluations for 3D IQTT
We posted the questionnaire to our labwide mailing list, where 41 participants followed the call. The questionnaire had one calibration question where, if answered incorrectly, we pointed out the correct answer. For all successive answers, we did not give any participant the correct answers and each participant had to answer all 20 questions to complete the quiz.
We also asked participants for their age range, gender, education, and for comments. While many commented that the questions were hard, nobody gave us a clear reason to discard their response. All participants were at least high school graduates currently pursuing a Bachelor’s degree. The majority of submissions \((78\%)\) were male, whereas the others were female or unspecified. Most of our participants \((73.2\%)\) were between 18 and 29 years old, the others between 30 and 39. The resulting test scores are normally distributed according to the ShapiroWilk test \((p<0.05)\) and significantly different from random choice according to 1sample Student’s t test \((p<0.01)\).
Implementation of Rezende et al.
With the publication of Rezende et al. (2016), the authors did not publicly release any code and upon request did not offer any either. When implementing our own version, we attempted to reproduce their results first, which is depicted in Fig. 16a. Further, we attempted to use the method for the same qualitative reconstruction of the primitiveinbox dataset as shown in Fig. 4. We found that this worked only with one main object and when there was no background (see Fig. 16b). When including the background, applying the same method lead to degenerate solutions (see Fig. 16c).
Ablation Study of the Weighted MutualInfo Loss on 3DIQTT
Consider the semisupervised objective used in algorithm 1. In this section we do an ablation study on 3DIQTT performance with the modified form of the equation where Mutualinformation loss \(I_{\Theta }(z_{scene}, z_{view})\) is weighted by a coefficient \(\lambda \). Plot in Fig. 17 indicates the importance of the MI term. Having a good estimate of the view point and disentangling the view information from geometry is indeed crucial to the performance of the IQ task.
Rights and permissions
About this article
Cite this article
Rajeswar, S., Mannan, F., Golemo, F. et al. Pix2Shape: Towards Unsupervised Learning of 3D Scenes from Images Using a ViewBased Representation. Int J Comput Vis 128, 2478–2493 (2020). https://doi.org/10.1007/s11263020013221
Received:
Accepted:
Published:
Issue Date:
Keywords
 Computer vision
 Differentiable rendering
 3D understanding
 Adversarial training