Pix2Shape: Towards Unsupervised Learning of 3D Scenes from Images Using a View-Based Representation


We infer and generate three-dimensional (3D) scene information from a single input image and without supervision. This problem is under-explored, with most prior work relying on supervision from, e.g., 3D ground-truth, multiple images of a scene, image silhouettes or key-points. We propose Pix2Shape, an approach to solve this problem with four component: (i) an encoder that infers the latent 3D representation from an image, (ii) a decoder that generates an explicit 2.5D surfel-based reconstruction of a scene—from the latent code—(iii) a differentiable renderer that synthesizes a 2D image from the surfel representation, and (iv) a critic network trained to discriminate between images generated by the decoder-renderer and those from a training distribution. Pix2Shape can generate complex 3D scenes that scale with the view-dependent on-screen resolution, unlike representations that capture world-space resolution, i.e., voxels or meshes. We show that Pix2Shape learns a consistent scene representation in its encoded latent space, and that the decoder can then be applied to this latent representation in order to synthesize the scene from a novel viewpoint. We evaluate Pix2Shape with experiments on the ShapeNet dataset as well as on a novel benchmark we developed – called 3D-IQTT—to evaluate models based on their ability to enable 3d spatial reasoning. Qualitative and quantitative evaluation demonstrate Pix2Shape’s ability to solve scene reconstruction, generation and understanding tasks.


Humans sense, plan and act in a 3D world despite only directly observing 2D projections of their 3D environment. Automatic 3D understanding seeks to recover a realistic underlying 3D structure of a scene using only 2D image projection(s). This long-standing challenge in computer vision has recently admitted learning-based solutions. Many such approaches leverage 3D supervision, such as from images annotated with ground truth 3D shape information (Girdhar et al. 2016; Wu et al. 2015, 2016b; Choy et al. 2016). Recent approaches rely on using other forms of 3D supervision, such as multiple views of the same object (Yan et al. 2016; Tulsiani et al. 2017; Li et al. 2019), 2.5D supervision (Wu et al. 2016a, 2017), key-point (Kar et al. 2014; Novotný et al. 2019) and silhouette annotations (Wiles and Zisserman 2017; Henderson and Ferrari 2018; Chen et al. 2019). Our work treats the problem of unsupervised single image 3D scene understanding. This form of the problem is challenging, as we aim to infer an encoding of 3D structure from only a single image, and this too without any form of 3D ground truth supervision during training. We do not rely on any 3D scene supervision, however we employ camera pose, scene reflectance profiles and outgoing/observed radiance as weak supervision signals.

Fig. 1

Comparison of 3D representations. Voxels and meshes (a and b) are viewpoint-independent representations. These representations require storage space proportional to the required level of detail. Our implicit representation captures the full scene in a fixed-length latent vector, which, given a viewpoint, can be decoded into an explicit viewpoint-dependent “surfels” representation with arbitrary level of detail (c and d)

While the benefits of employing supervision can certainly be argued for in this context—i.e., with the growing number of datasets with labelled 3D ground truth for objects (Chang et al. 2015) and cityscapes (Caesar et al. 2019)—one benefit of approaching the problem from an unsupervised perspective is that we are not limited to the types of 3D objects represented in these datasets. Indeed, however vast, existing datasets fall far from capturing all possible artificial and natural 3D scenes and objects. Moreover, datasets with depth annotations often contain incomplete or noisy depth maps due to limitations in depth capture hardware.

Unsupervised single image 3D understanding is a relatively under-explored area, with only a few works treating this setting (Rezende et al. 2016; Yan et al. 2016). These methods rely on deformable 3D mesh or voxel representations of the world, and have only been applied to simple 3D primitives (e.g., cubes, spheres) or single objects over a clean background.

One approach to this problem is to leverage prior knowledge on how 2D images are formed from the 3D world, including the effects of shading and occlusion. Building machine learning architectures with an explicit knowledge of this forward rendering model could help better disambiguate the 3D structure of geometry from 2D observations. In this spirit, we propose the Pix2Shape architecture for unsupervised single image 3D understanding: a model that learns abstract latent encodings of the geometry of an entire scene geometry, and all from a single image. These implicit learnt scene representations can be decoded—when combined with a targe viewing/camera position—into a view-dependent realization of 2.5D surfaces (depth map and surface normals) visible only from that view. We can then readily re-render these explicit view-dependent surface elements (surfels) at their corresponding 2D image projections in order to synthesize an unseen view of the scene.

Fig. 2

Sample questions from 3D-IQTT. For this “mental rotation” task, we present a reference image and three possible answers. The test is a classification task where the goal is to find the rotated view of the model from the reference image. To solve this task, the 3D shape of the reference must be inferred from the 2D image and compared to the inferred 3D shapes of the answers (see footnote for correct answers)

Our model builds atop Adversarially Learned Inference (ALI) (Dumoulin et al. 2016), an extension of Generative Adversarial Networks (GANs) (Goodfellow et al. 2014) that infers a latent code from an image using an encoder network. In Pix2Shape, the encoder network learns a latent representation that embeds the 3D information of an entire scene from an image. We map the latent representation to view-dependent depth and normal maps using a decoder before projecting these maps onto image space using a differentiable renderer. We evaluate the resulting image using an adversarial critic. Our model remains unsupervised as it does not require ground truth depth maps nor any other kind of 3D supervision, as in previous works (e.g., observing the same object from multiple views, key-point registration or image silhouettes). Note that, at any given instant, our model outputs the depth and surface normals conditioned on a specific camera view; we never produce/synthesize the entirety of the 3D world structure. That being said, the latent space we learn embeds the 3D geometry of the entire underlying scene, which allows our decoder and renderer to smoothly extrapolate and synthesize scene geometry from unseen camera views during inference. We refer to this indirect process of embedding 3D information in the latent code as “implicit” inference.

An ambitious long-term goal is to infer the 3D structure of photographs of the real-world, and our work takes a first step in this direction: We rely on physically based rendering in-order to build a model of the world. However, in order to make the training tractable we experiment exclusively with synthetically constructed scenes, adopting several simplifying assumptions. Of note, we assume that the world is composed of piece-wise smooth 3D elements and that, for each input image, the illumination, view and object materials are known. Since each pixel in an image is a function of geometry, illumination, view and texture, our focus in this work is to learn the underlying geometry of a scene keeping the other parameters fixed.

We evaluate our model’s ability to recover accurate and consistent depth from a single image, for both seen and unseen viewpoints, using Hausdorff and Chamfer distance metrics between generated and ground truth depth maps. In addition to reconstruction, we can sample novel scenes (at novel views) using the generative nature of our adversarial network. Finally, we propose a new 3D understanding benchmark—3D IQ Test Task (3D-IQTT)—to evaluate models’ understanding of the underlying 3D structure of an object: the test consists of matching a rotated view of a reference object (Fig. 2). To perform this task, we develop a novel 3D-IQ dataset to train and test against. In this setting, we can additionally estimate camera pose in our learnt latent 3D world embedding. Our contributions are as follows:

  1. 1.

    An approach for unsupervised single image 3D understanding that builds a latent embedding of an entire 3D scene,

  2. 2.

    A decoding scheme that leverages view-dependent, explicit surfel representations to sample scene information more efficiently than (world-space) voxels and meshes,

  3. 3.

    A differentiable 3D renderer that we leverage, and that can be included as a layer in any learning-based neural network architecture, and

  4. 4.

    3D-IQTT, a new 3D understanding benchmark.

Related Work

Single View 3D Reconstruction and Generation

3D generation and reconstruction has been studied extensively in the computer vision and graphics communities (Saxena et al. 2009; Chaudhuri et al. 2011; Kalogerakis et al. 2012; Chang et al. 2015; Rezende et al. 2016; Soltani et al. 2017; Kulkarni et al. 2015; Tulsiani et al. 2016; Huang et al. 2019; Jiang et al. 2019). Most methods in the literature focus on recovering the 3D structure from 2D images by using explicit 3D supervision. Choy et al. (2016),  Girdhar et al. (2016),  Wu et al. (2015, 2016b)  and  Zhu et al. (2018) reconstruct and/or generate 3D voxels from a latent representation by directly comparing with available 3D shapes. Wu et al. (2017) and Zhang et al. (2018) use 2.5D supervision during training, i.e., depth maps. More recent methods tend to use weaker forms of supervision for single image reconstruction. Wu et al. (2016a), Kato and Harada (2018), Henderson and Ferrari (2018) and Chen et al. (2019) use image based annotations like silhouettes, 2D keypoints or object masks. Kanazawa et al. (2018) learn both texture and shape from 2D images leveraging multiple learning signals such as keypoints and mean shape.

Rezende et al. (2016), Yan et al. (2016), Gadelha et al. (2016) and Novotný et al. (2017) learn 3D shapes by using multiple views and approximately differentiable rendering mechanisms. However, one of Rezende et al. (2016)’s experiments show reconstruction 3D objects trained using a single view. As far as we know, theirs is the only fully unsupervised method for explicit 3D reconstruction from a single image. Their method is limited to reconstructing relatively simple 3D primitives floating in space due to the strong priors required for the model to work. Concurrent to our work, HoloGAN (Nguyen-Phuoc et al. 2019) can synthesize 2D images of more realistic scenes (e.g., cars, bedrooms) under camera view rotation. However their model can not recover the geometry from its implicit representation. Compared to Rezende et al. (2016), our model can learn to represent more complex synthetic indoor scenes composed of multiple ShapeNet (Chang et al. 2015) objects and, while we do not address real image inputs (i.e., as HoloGAN), we can infer explicit geometry for visible surfaces from each given view. As such, our model can also be applied to 3D reconstruction (like Rezende et al. (2016) but only for visible parts of the scene) and novel viewpoint image generation (like Nguyen-Phuoc et al. (2019)).

Differentiable Rendering

In order to facilitate deep neural network based models to infer 3D structures from their 2D projections (images), it is required to compute and propagate the derivatives of image pixels with respect to 3D geometry and other properties. Gradient estimation through rendering process is a challenging task. In both rasterization and ray-tracing techniques the visibility mapping step is often non-differentiable. Loper and Black (2014) is one of the well known methods for differential rendering, but has limited applicability due to high computational and memory costs. Kato et al. (2017); Rezende et al. (2016) approximate the gradients of the rendering process and are often limited to a rasterization based rendering scheme. OpenDR (Loper and Black 2014), as used by Henderson and Ferrari (2018), applies first order Taylor approximation to compute gradients. Liu et al. (2019) computes the gradients analytically by softly assigning contribution of each triangle face to a pixel in mesh-based representations. Chen et al. (2019) improved this soft assignment and allow the use of textures by interpolating local mesh properties for foreground pixels. Insafutdinov and Dosovitskiy (2018) proposed a differentiable re-projection mechanism for point clouds to infer 3D shapes. However learning methods built on these approaches so-far require either more than one view per object or 2D silhouette as supervision and can only reconstruct single objects. In our work we circumvent the non-differentiablity challenge as follows: (1) Our network is trained to output only “visible” surface elements (surfels) of the scene conditioned on the view, i.e. a 2.5D representation and (2) We maintain one-to-one correspondence between the output surfels and the pixels. In other words our model outputs exactly one surfel in object space per pixel in the output image, and the final image is then formed by a differentiable shading operation. This makes our model differentiable, easily adaptable across image resolutions and allows end-to-end training.


Our method follows the ALI architecture (Dumoulin et al. 2016), where we have an encoder branch that learns to convert images into latent representations, a decoder branch that learns to generate images from randomly sampled latent representations, and a critic that tries to predict if pairs of latent code and image are real or fake. The critic and encoder pathways are implemented as convolutional neural networks but the decoder pathway contains an additional differentiable renderer, usable like a layer of a neural network, that converts the 2.5D surfel representation into a 2D image by computing shading at each surfel. Additionally, the decoder is conditioned on a camera pose. See Fig. 3 for an overview. In the following section, we drill down on the individual components of this architecture.

Fig. 3

Model. Pix2Shape generates realistic 3D views of scenes by training on 2D single images only. Its decoder generates the surfel depth map \(p_z\) from a noise vector \(\varvec{z}\) conditioned on the camera pose. The surfel normals are estimated from the predicted depth. The surfels are rendered into a 2D image and, together with image samples from the target distribution, are fed to the critic, which generates a gradient for both encoder and decoder paths

3D Representation and Surfels

Representing 3D structure as voxels or meshes presents different challenges for generative models (Kobbelt and Botsch 2004). Representing entire objects using voxels scales poorly given its \(O(n^3)\) complexity. Additionally, the vast majority of the generated voxels are not relevant to most viewpoints, such as the voxels that are entirely inside objects. A common workaround is to use a surface representation such as meshes. However, these too come with their own drawbacks, such as their graph-like structure. This makes mesh representation difficult to generate using neural networks. Current mesh based methods mainly rely on deforming a pre-existing mesh and are thus limiting the object topology to have the same genus as the template mesh.

Our approach represents the 3D scene implicitly in a high-dimensional latent variable. In our framework, this latent variable (i.e., a vector) is decoded using a decoder network conditioned on the camera pose into a viewpoint-dependent representation of surface elements (i.e., surfels (Pfister et al. 2000), square-shaped planes that are scaled based on depth to roughly fit the size of a pixel) that constitute the visible part of the scene. This representation is very compact: given a renderer’s point of view, we can represent only the part of the 3D surface needed by the renderer. As the camera moves closer to a part of the scene, surfels become more compact and thereby increase the amount of visible detail. For descriptive purpose we discuss surfels as squares, but in general they can have any shape. Figure 1 compares surfels with different representations. Surfels differ from other explicit representations in that they are view-dependent, i.e., this representation changes for different camera poses (but the implicit latent vector representation does not).

Formally, surfels are represented as a tuple \((P, N, \rho )\), where \(P = (p_x, p_y, p_z)\) is its 3D position, \(N = (n_x, n_y, n_z)\) is the surface normal vector, and \(\rho = (k_r, k_g, k_b)\) is the albedo of the surface material. Note that \(\rho \) represents the material properties at the point P and could take a different size for a different shading model. Since we are only interested in modelling structural properties of the scenes, i.e. geometry and depth, we assume that objects in the scene have uniform material properties and thus keep \(\rho \) fixed. We also estimate the normals from depth by assuming locally planar surfaces. We represent the surfels in the camera coordinate system and generate one surfel for each pixel in the output image. This makes our representation very compact. Thus, the only necessary parameter for the decoder network to generate is \(p_z\), i.e. a depth map.

Differentiable 3D Renderer

Since our architecture is GAN-like and uses 2D images as input to the critic network, we need to project the generated 3D representations down to 2D space using a renderer. In our setting, each stage of the rendering pipeline must be differentiable to allow us to take advantage of gradient-based optimization and backpropagate the critic’s error signal to the surfel representation. Our proposed rendering process is differentiable because: (1) each output pixel depends exactly on one surfel, and (2) we employ a differentiable shading operation to compute the color of each pixel. Our PyTorch implementation of the differentiable renderer can render a \(128 \times 128\) surfel-based scene in under 1.4 ms on a mobile NVIDIA GTX 1060 GPU. Further details about the rendering implementation can be found in Appendix B.


The adversarial training paradigm allows the generator network to capture the underlying target distribution by competing with an adversarial critic network. Pix2Shape employs bi-directional adversarial training (Dumoulin et al. 2016; Donahue et al. 2016) to model the distribution of surfels from 2D images.

Bi-Directional Adversarial Training

ALI (Dumoulin et al. 2016) or Bi-GAN (Donahue et al. 2016) extend the GAN (Goodfellow et al. 2014) framework by including the learning of an inference mechanism. Specifically, in addition to the decoder network \(G_x\), ALI provides an encoder \(G_z\) which maps data points \(\varvec{x}\) to latent representations \(\varvec{z}\). In these bi-directional models, the critic, D, discriminates in both the data space (\(\varvec{x}\) versus \(G_x(\varvec{z})\)), and latent space (\(\varvec{z}\) versus \(G_z(\varvec{x})\)) jointly, maximizing the adversarial value function over two joint distributions. The final min-max objective can be written as:

$$\begin{aligned}&\min _{G}\max _{D} \mathcal {L}_{ALI}(G,D) = \mathbb {E}_{q(\varvec{x})}[\log (D(\varvec{x},G_z(\varvec{x})))] \nonumber \\&\quad + \mathbb {E}_{p(\varvec{z})}[\log (1-D(G_x(\varvec{z}),\varvec{z}))],\nonumber \end{aligned}$$

where \(q(\varvec{x})\) and \(p(\varvec{z})\) denote encoder and decoder marginal distributions.

Modelling Depth and Constrained Normal Estimation

The encoder network captures the distribution over the latent space of the scene given an image data point \(\varvec{x}\). The decoder network maps a fixed scene latent distribution \(p(\varvec{z}_{scene})\) (a standard normal distribution in our case) to the 2.5D surfel representation from a given viewpoint \(\varvec{z}_{view}\). The surfel representation is rendered into a 2D image using our differentiable renderer. The resulting image is given as input to the critic to distinguish it from the ground truth image data. To emphasize on the notation, note that the output of the encoder is \(\varvec{z}_{scene}\) and the input to decoder is \((\varvec{z}_{scene},\varvec{z}_{view})\)

A straightforward way to design the decoder network is to learn a conditional distribution to produce the surfels’ depth (\(p_z\)) and normal (N). However, this could lead to inconsistencies between the local shape and the surface normal. For instance, the decoder can fake an RGB image of a 3D shape simply by changing the normals while keeping the depth fixed. To avoid this issue, we exploit the fact that real-world surfaces are locally planar, and that surfaces visible to the camera have normals constrained to be in the half-space of visible normal directions from the camera’s view point. Considering the camera to be looking along the \(-z\) axis direction, the estimated normal has the constraint \(n_z > 0\). Therefore, the local surface normal is estimated by solving the following problem for every surfel:

$$\begin{aligned} N^T \nabla P = 0 \text{ subject } \text{ to } \Vert N\Vert = 1 \text{ and } n_z > 0, \end{aligned}$$

where the spatial gradient \(\nabla P\) is computed for each of the 8 neighbour points, and P is the position of the surfels in the camera coordinate system obtained by back-projecting the generated depth along rays.

This approach enforces consistency between the predicted depth field and the computed normals and provides a gradient signal to the depth from the shading process. If the depth is incorrect, the normal-estimator outputs an incorrect set of normals, resulting in an inconsistent RGB image with the data distribution, which in turn would get penalized by the critic. The decoder network is thus incentivized to produce realistic depths.

Unsupervised Training

The Wasserstein-GAN (Arjovsky et al. 2017) formulation provides stable training dynamics using the first Wasserstein distance between the distributions. We adopt the gradient penalty setup as proposed in Gulrajani et al. (2017) for more robust training. However, we modify the formulation to take into account the bidirectional training.

The architectures of our networks, and training hyper-parameters are explained in detail in the Appendix Sect. C. Briefly, we used Conditional Normalization (Dumoulin et al. 2016; Perez et al. 2017) for conditioning the viewpoint (or camera pose) in the encoder, decoder and the discriminator networks. The viewpoint is a three dimensional vector representing positional coordinates of the camera. In our training, the affine parameters of the batch-normalization layers (Ioffe and Szegedy 2015) are replaced by learned representations based on the viewpoint. The final objective includes a bi-directional reconstruction loss:

$$\begin{aligned} \mathcal {L}_{recon} =~&\mathbb {E}_{q(\varvec{x})}[||\varvec{x}- {{\textsc {rend}}}(G_{x}(G_{z}(\varvec{x})))||_2] \nonumber \\&+ \mathbb {E}_{p(\varvec{z})}[||\varvec{z}- G_{z}({{\textsc {rend}}}(G_{x}(\varvec{z})))||_2], \end{aligned}$$

where the rend\((\cdot )\) function synthesizes images through view-dependent decoding and projection and \(\varvec{z}\) is \((\varvec{z}_{scene},\varvec{z}_{view})\). This objective enforces the reconstructions from the model to stay close to the corresponding inputs. This reconstruction loss is used for the encoder and decoder networks as it has been empirically shown to improve reconstructions in ALI-type models (Li et al. 2017).


Semi-Supervised Training for Classification

Our model can be also trained in a semi-supervised setting (see Algorithm 1) for solving image classification tasks that require 3D understanding such as the 3D-IQTT (See Fig. 2). The idea is to use labeled examples to streamline the learned latent representations in order to solve the task. In this case, we do not assume that we know the camera position for the unlabeled training samples. Ass mentioned earlier, part of the latent vector \(\varvec{z}\) encodes the actual 3D object (denoted as \(\varvec{z}_{scene}\)) and the remainder estimates the camera-pose (denoted as \(\varvec{z}_{view}\)). For the supervised samples, two additional loss terms were used: (a) a loss that enforces the object component (\(\varvec{z}_{scene}\)) to be the same for both the reference object and the correct answer, (b) a loss that maximizes the distance between the reference object and the distractors. This loss is expressed as:

$$\begin{aligned} \mathcal {L}_{\theta } = \frac{1}{2}D_{\theta }(\varvec{x}_{ref}, \varvec{x}_{ans}) - \frac{1}{2}\sum _{i=1}^{2} D_{\theta }(\varvec{x}_{ref}, \varvec{x}_{d_i}) \end{aligned}$$

where \(\varvec{x}_{ref}\) is the reference image, \(\varvec{x}_{ans}\) is the correct answer, \(d_i\) denotes the distractors, \(D_{\theta }(\varvec{x}_1, \varvec{x}_2) = (||\varvec{z}_{scene}^{\varvec{x}_1}- \varvec{z}_{scene}^{\varvec{x}_2}||_2)^2\) and \(\varvec{z}^{\varvec{x}}=Encoder_{\theta }(\varvec{x})\).

During training, we also minimize the mutual information between \(\varvec{z}_{scene}\) and \(\varvec{z}_{view}\) to explicitly disentangle both. This is implemented via MINE (Belghazi et al. 2018). The strategy of MINE is to parameterize a variational formulation of the mutual information in terms of a neural network:

$$\begin{aligned} I_{\Theta }(z_{s},z_{v}) = \sup _{\theta \in \Theta } \mathbb {E}_{{\mathbb {P}}_{{z_{s}}-{z_{v}}}}[T_\theta ] - \log (\mathbb {E}_{{\mathbb {P}}_{{z_{s}}-{z_{v}}}}[e^{T_\theta }]). \end{aligned}$$

This objective is optimized in an adversarial paradigm where T, the statistics network, plays the role of the critic and is fed with samples from the joint and marginal distribution. We use this loss to minimize the mutual information estimate in both unsupervised and supervised training iterations. Once the model is trained, we answer 3D-IQTT questions, by inferring the latent 3D representation for each of the four images and we select the answer closest to the reference image as measured by \(L_2\) distance on latent representations.

Fig. 4

Shape scenes reconstruction. Pix2Shape reconstruction of single objects in a room (left) and multiple objects into a room (right). On both sides, the ground truths for RGB, depth, and normals are in the upper row, the inferred image, depth and normals are in the respective lower rows. Our model is able to correctly recover the depth and normal of the scenes from a single 2D image

Experimental Setup

We evaluate Pix2Shape on three different tasks: scene reconstruction, scene generation, and 3D-IQTT.

Scene Reconstruction

The goal of this task is to produce a 2.5D representation (depth and normals) from a given input image. Moreover, we also evaluate if the model can extrapolate to unobserved views of the scene.

For this task we have created two datasets of scene images composed of a room containing one or more objects placed at random positions and orientations. Shape scenes dataset is created with rendered images of multiple basic 3D shapes (i.e., box, sphere, cone, torus, teapot etc) placed inside a room. ShapeNet scenes dataset is constructed from renderings of multiple objects of different categories from the ShapeNet dataset (Chang et al. 2015) (i.e., bowls, bottles, mugs, lamps, bags, etc).

Each 3D scene is rendered into a single \(128\times 128\times 3\) image taken from a camera in a random position sampled uniformly on the positive octant of a sphere containing the room. The probability of seeing the same configuration of a scene from two different views is near zero.

We evaluate the performance of scene reconstruction using three different metrics: (1) Chamfer distance, (2) Hausdorff distance (Hausdorff 1949) (on surfels’ position), and (3) Mean Squared Error (MSE).

Chamfer distance (CD) gives the average distance from each point in a set to closest point in the other set. For any two point sets \(A, B \subset \mathbb {R}^3\) Chamfer distance is measured using:

$$\begin{aligned} CD(A, B)= & {} \dfrac{1}{|A|} \sum _{x \in A}\min _{y \in B}\Vert x-y\Vert _2\nonumber \\&\quad + \dfrac{1}{|B|} \sum _{x \in B}\min _{y \in A}\Vert x-y\Vert _2 \end{aligned}$$

Hausdorff distance (HD) measures the correspondence of the model’s 3D reconstruction with the input for a given camera pose. Given two point sets, A and B, the Hausdorff distance is,

$$\max \left\{ \max D_H^+(A, B), \max D_H^+(B, A)\right\} ,$$

where \(D_H^+\) is an asymmetric Hausdorff distance between two point sets. E.g., \(\max D_H^+(A, B) = \max D(a, B)\), for all \(a \in A\), or the largest Euclidean distance \(D(\cdot )\), from a set of points in A to B, and a similar definition for the reverse case \(\max D_H^+(B, A)\). In both the evaluations, we mesaure compare our reconstructed view-centric surfels (3D positions and normals) to the groundtruth surfels.

Table 1 Scene reconstruction results

Scene Generation

In the second task we showcase the generative ability of our model by using our generator to sample class conditioned shapes from ShapeNet dataset. We evaluate the 3D scene generation task qualitatively.


In the final task we evaluate the 3D understanding capability of the model on 3D-IQTT: a spatial reasoning-based semi-supervised classification task. The goal of the 3D-IQTT is to quantify the ability of our model to perform a 3D spatial reasoning test by using large amounts of the unlabeled training data and a small set of labeled examples.

For this 3D-IQTT task, we generated a dataset where each IQ question consists of a reference image of a Tetris-like shape, as well as three other images, one of which is a randomly rotated version of the reference (see Fig. 2 for an example). The training set consists of 100k questions where only a few are labeled with the information about the correct answer (i.e. either \(1\%\) (1k) or \(0.2\%\) (200) of the total training data). The validation and test sets each contain 100K labeled questions. Earlier literature related to 3D-IQTT is elaborated in Appendix Sect. I. We evaluate the 3D-IQTT task with the percentage of questions answered correctly.

More details on experimental setup and evaluation can be found in Appendix Sects. E and G.

Fig. 5

ShapeNet scenes reconstruction. Implicit 3D reconstruction of scenes composed by multiple ShapeNet objects

Fig. 6

ShapeNet 256 \( \times \) 256 scenes reconstruction. Implicit 3D reconstruction of scenes composed by multiple ShapeNet objects


Scene Reconstruction

Figure 4 shows the input shape scenes data and its corresponding shading reconstructions, along with its recovered depth and normal maps. The depth map is encoded in such a way that the darkest points are closer to the camera. The normal map colors correspond to the cardinal directions (red/green/blue for x/y/z axis respectively). Table 1 shows a quantitative evaluation of the Chamfer and Hausdorff distances on Shape scene and shapenet scene datasets from a given observed view. The table also depicts mean squared error (MSE) of the generated depth map with respect to the input depth map. The shading reconstructions are almost perfect in this simple dataset. Our model successfully learns the depth of the scenes and thereby the relative positions of the surfels. It also estimates the normal maps from the depth consistently. However the absolute distance is not always recovered perfectly.

Figure 5 shows the reconstructions from the model on challenging ShapeNet scenes where the number of objects as well as their shape varies. Note how our model is able to handle geometry of varying complexity. Figure 6 shows reconstructions on \(256 \times 256\) resolution scenes(on the right) constructed out of more difficult thin-edged chairs and tables from ShapeNet dataset in random configurations.

To showcase that our model can reconstruct unobserved views, we first infer the latent code \(\varvec{z}_{scene}\) of an image \(\varvec{x}\) and then decode and render different views while rotating the camera around the scene. Table 2 shows the Chamfer and Hausdorff distances and MSE loss of reconstructing a scene from different unobserved view angles. As the view angle increases from \(0 ^{\circ }\)(original) to \(80 ^{\circ }\) for shape scenes the reconstruction error and MSE tend to increase. However, for the ShapeNet scenes the trend is not as clear because of the complexity of the scene and inter-object occlusions. We compare our method with the PTN baseline (Yan et al. 2016). Note that PTN reconstructs the 3D object in voxels explicitly, where as we output a 2.5D representation. Therefore, for a fair comparison we rotate and render per pixel depth map from a desired view and obtain the Chamfer distance with respect to ground truth projection for that view. Figure 7 qualitatively shows how Pix2Shape correctly infers the scene parts not in view demonstrating true 3D understanding.

In all our datasets and further experiments we use diffuse materials with uniform reflectance. The reflectance values are chosen arbitrarily and we use the same material properties for both the input and the generator side. However, our differentiable rendering setup also supports Phong illumination model. As an instance Fig. 8 shows the input shape scenes data with specular reflection and its corresponding shading reconstructions, along with its recovered depth.

Table 2 View point reconstruction
Fig. 7

Viewpoint reconstruction. Given a scene (first column), we rotate the camera around it to visualize the unseen parts of the scene. The model correctly infers the unobserved geometry of the objects, demonstrating true 3D understanding of the scene. Videos of these reconstructions can be seen at https://bit.ly/2zADuqG

Fig. 8

Shape scenes reconstruction with specular reflectance. Pix2Shape reconstruction of multiple objects into a room. The input RGB images are in the upper row, the inferred image and depth are in the respective lower rows

Scene Generation

We trained Pix2Shape on scenes composed of a single ShapeNet object in a room. The model was trained conditionally by giving the class label of the ShapeNet object present in the scene to the decoder and critic networks (Mirza and Osindero 2014). Figure 9 shows the results of conditioning the decoder on different target classes. Our model was able to generate accurate 3D models for the target class. We can also train the model in an unconditional fashion without giving any object category information (see Appendix F for more details and results).

In order to explore the manifold of the learned representations, we selected two images \(\varvec{x_1}\) and \(\varvec{x_2}\) from the held out data. We then linearly interpolated between their encodings \(\varvec{z}_{\varvec{1}scene}\) and \(\varvec{z}_{\varvec{2}scene}\) and decoded the intermediary points into their corresponding images using a fixed camera pose. Figure 10 shows this for two different settings.

Fig. 9

Conditional scene generation. Class-conditionally generated samples for ShapeNet dataset. These images are not part of the training data

Fig. 10

Manifold exploration. Exploration of the learned manifold of 3D representations. Generated interpolations (middle columns) between two images \(\varvec{x_1}\) and \(\varvec{x_2}\) (first and last columns)

Table 3 3D-IQTT results

3D-IQ Test Task

We trained our model using the aforementioned semi-supervised training described in Sect. 3.3.4 on the 3D-IQTT task. We compared our model to different baselines listed below and with human performance.

Human We created an online test where 40 random graduate students from our lab answered 20 randomly selected questions from the test set (similar to Fig. 2). No student had seen these images before. More details can be found in Appendix J.

CNN The first baseline is composed of four ResNet-50 modules (He et al. 2016) with shared weights followed by three fully-connected layers and a softmax output for the class label (answer 1 to 3). We trained this CNN only on the labeled samples. The architecture is depicted in the appendix, Fig. 12.

Siamese Network Our second baseline is a Siamese CNN with a similar architecture as the previous one but with the fully-connected layers removed. Instead of the supervised loss provided in the form of correct answers, it was trained with contrastive loss (Koch et al. 2015). This loss reduces the feature distance between the reference and correct answer and maximizes the feature distance between the reference and incorrect answers.

Perspective Transformer Nets As our third baseline, we used the open source implementation of the Perspective Transformer Nets (Yan et al. 2016) to solve the IQTT task using the learnt latent code.

Rezende et al. (2016) Since there is no open source code available for this work, we implemented our own interpretation of this work. We were able to reproduce the results from their paper (see appendix K before attempting to use it as baseline for our model.

A more detailed description of the networks and contrastive loss function can be found in the Appendix D.

Table 3 shows 3D-IQTT results for our method and baselines. The CNN-based baselines were not able to infer the underlying 3D structure of the data and their results are only slightly better than random guess. The poor performance of the Siamese CNN might be in part because the contrastive loss rewards similarities in pixel space and has no notion of 3D similarity. However, Pix2Shape achieved significantly better accuracy by leveraging the learned 3D knowledge of objects. Our method also outperformed the other 2 baseline approaches, but with a smaller margin.

Table 4 Loss analysis

Analyzing the Loss Functions

In this section, we do an ablation study of the different loss functions used to train our model. Our final objective is a combination of bi-directional adversarial loss \(\mathcal {L}_{ALI}\) and a reconstruction loss \(\mathcal {L}_{recon}\). For the 3D-IQTT task we augmented the above losses with a mutual information based objective \(I_{\Theta }(z_{s},z_{v})\) to make sure that different parts of the latent code encode distinctive pieces of information present in a scene. This allows us to disentangle view point and geometry. Table 4 shows our results for both the reconstruction task on the ShapeNet scenes dataset and the 3D-IQTT task when considering, i) only adversarial loss (\(\mathcal {L}_{ALI}\)); ii) only reconstruction loss(\(\mathcal {L}_{recon}\)); iii) adversarial and reconstruction but not mutual info (\(\mathcal {L}_{ALI}\)); and (\(\mathcal {L}_{recon}\)) (note that this does not effect reconstruction task); and, iv) all three (\(\mathcal {L}_{ALI}\), \(\mathcal {L}_{recon}\) and \(I_{\Theta }(z_{s},z_{v})\)).

We observe each loss term improves the performance of the model on both the tasks. Using adversarial loss alone is not enough to faithfully reconstruct the surfels. On the other hand we observe that having the reconstruction loss alone affects the performance of the model while extrapolating the shape from unseen views (e.g., view angle \(35^{\circ }\) to \(80^{\circ }\)). However, this scenario yields better performance when reconstructing from the given input view point, i.e., \(0^{\circ }\). We also notice that having a reconstruction loss alone affects the quality of the samples generated. We observe that the adversarial loss (\(\mathcal {L}_{ALI}\)) plays a major role in obtaining detailed and high quality samples. For the 3D-IQTT task, the role of (\(\mathcal {L}_{ALI}\)) is more evident. (\(\mathcal {L}_{ALI}\)) encourages the latent code to learn meaningful representations by constraining the model to match the joint distributions. Results also indicate clearly that skipping mutual information loss degrades the performance of the model on 3D-IQTT task. This is expected because of the mix-up of view information with geometrical information in the latent representation.


In this paper we propose a generative approach to learn 3D structural properties from single-view images in an unsupervised and implicit fashion. Our model receives an image of a scene with uniform material as input, estimates the depth of the scene points and then reconstructs the input image through a differentiable renderer. We also provide quantitative evidence that support our argument by introducing a novel IQ Test Task in a semi-supervised setup. We hope that this evaluation metric will be used as a standard benchmark to measure the 3D understanding capability of models across different 3D representations. The main drawback of our current model is that it requires the knowledge of lighting and material properties. Future work will focus on tackling the more ambitious setting of learning complex materials and texture along with modelling the lighting properties of the scene.

All code for this project is available at https://github.com/rajeswar18/pix2shape. The code we developed in order to reproduce the Rezende et al. (2016) baseline is available at https://github.com/fgolemo/threedee-tools.


  1. Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein generative adversarial networks. In International conference on machine learning (ICML).

  2. Belghazi, M. I., Baratin, A., Rajeshwar, S., Ozair, S., Bengio, Y., Hjelm, D., & Courville, A. (2018). Mutual information neural estimation. In International conference on machine learning (ICML).

  3. Caesar, H., Bankiti, V., Lang, A. H., Vora, S., Liong, V. E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., & Beijbom, O. (2019). Nuscenes: A multimodal dataset for autonomous driving. arXiv:1903.11027.

  4. Chang, A. X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., Xiao, J., Yi, L., & Yu, F. (2015). Shapenet: An information-rich 3D model repository.

  5. Chaudhuri, S., Kalogerakis, E., Guibas, L., & Koltun, V. (2011). Probabilistic reasoning for assembly-based 3D modeling. In ACM SIGGRAPH.

  6. Chen, W., Gao, J., Ling, H., Smith, E. J., Lehtinen, J., Jacobson, A., & Fidler, S. (2019). Learning to predict 3d objects with an interpolation-based differentiable renderer. CoRR abs/1908.01210

  7. Choy, C., Xu, D., Gwak, J., Chen, K., & Savarese, S. (2016). 3D-r2n2: A unified approach for single and multi-view 3d object reconstruction.

  8. Donahue, J., & Krähenb, U. P., & Darrell, T. (2016). Adversarial feature learning. arXiv:1605.09782.

  9. Dumoulin, V., Belghazi, I., Poole, B., Lamb, A., Arjovsky, M., Mastropietro, O., & Courville, A. (2016). Adversarially learned inference. arXiv:1606.00704.

  10. Gadelha, M., Maji, S., & Wang, R. (2016). 3D shape induction from 2D views of multiple objects. CoRR abs/1612.05872.

  11. Girdhar, R., Fouhey, D., Rodriguez, M., & Gupta, A. (2016). Learning a predictable and generative vector representation for objects. In European conference of computer vision (ECCV).

  12. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014), Generative adversarial nets. In Advances in neural information processing systems (NIPS).

  13. Gulrajani, I., Ahmed, F., Arjovsky, M,, Dumoulin, V., & Courville, A. C. (2017). Improved training of wasserstein gans. In Advances in neural information processing systems (NIPS).

  14. Hadsell, R., Chopra, S., & LeCun, Y. (2006). Dimensionality reduction by learning an invariant mapping. In Null.

  15. Hausdorff, F. (1949). Grundzüge der Mengenlehre. New York: Chelsea Pub. Co.

    Google Scholar 

  16. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Computer vision and pattern recognition (CVPR).

  17. Henderson, P., & Ferrari, V. (2018). Learning to generate and reconstruct 3D meshes with only 2D supervision. CoRR abs/1807.09259.

  18. Huang, J., Zhou, Y., Funkhouser, T. A., & Guibas, L. J. (2019). Framenet: Learning local canonical frames of 3D surfaces from a single RGB image. CoRR abs/1903.12305.

  19. Insafutdinov, E., & Dosovitskiy, A. (2018). Unsupervised learning of shape and pose with differentiable point clouds. CoRR abs/1810.09381.

  20. Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning (ICML).

  21. Jiang, C. M., Wang, D., Huang, J., Marcus, P., & Nießner, M. (2019). Convolutional neural networks on non-uniform geometrical signals using euclidean spectral transformation. CoRR abs/1901.02070.

  22. Kajiya, J. T. (1986). The rendering equation. In Annual conference on computer graphics and interactive techniques (SIGGRAPH).

  23. Kalogerakis, E., Chaudhuri, S., Koller, D., & Koltun, V. (2012). A probabilistic model for component-based shape synthesis. ACM Transactions in Graphics, 31(4), 55-1–55-11.

    Article  Google Scholar 

  24. Kanazawa, A., Tulsiani, S., Efros, A. A., & Malik, J. (2018), Learning category-specific mesh reconstruction from image collections. In European conference on computer Vision (ECCV).

  25. Kar, A., Tulsiani, S., Carreira, J., & Malik, J. (2014). Category-specific object reconstruction from a single image. CoRR abs/1411.6069.

  26. Kato, H., & Harada, T. (2018). Learning view priors for single-view 3D reconstruction. CoRR abs/1811.10719.

  27. Kato, H., Ushiku, Y., & Harada, T. (2017). Neural 3D mesh renderer. CoRR abs/1711.07566.

  28. Kobbelt, L., & Botsch, M. (2004). A survey of point-based techniques in computer graphics. Computers and Graphics, 28(6), 801–814.

    Article  Google Scholar 

  29. Koch, G., Zemel, R., & Salakhutdinov, R. (2015). Siamese neural networks for one-shot image recognition. In ICML Deep Learning Workshop.

  30. Kulkarni, T. D., Whitney, W., Kohli, P., & Tenenbaum, J. B. (2015). Deep convolutional inverse graphics network. In Advances in neural information processing systems (NIPS).

  31. Li, C., Liu, H., Chen, C., Pu, Y., Chen, L., Henao, R., & Carin, L. (2017), Alice: Towards understanding adversarial learning for joint distribution matching. In Advances in neural information processing systems (NIPS).

  32. Li, Z., Dekel, T., Cole, F., Tucker, R., Snavely, N., Liu, C., & Freeman, W. T. (2019). Learning the depths of moving people by watching frozen people. CoRR abs/1904.11111.

  33. Liu, S., Chen, W., Li, T., & Li, H. (2019), Soft rasterizer: Differentiable rendering for unsupervised single-view mesh reconstruction. CoRR abs/1901.05567.

  34. Loper, M. M., & Black, M. J. (2014). Opendr: An approximate differentiable renderer. In Euroean conference on computer vision (ECCV).

  35. Mikolov, T., Deoras, A., Kombrink, S., Burget, L., & Cernockỳ, J. (2011), Empirical evaluation and combination of advanced language modeling techniques. In INTERSPEECH.

  36. Mirza, M., & Osindero, S. (2014). Conditional generative adversarial nets.

  37. Nguyen-Phuoc, T., Li, C., Theis, L., Richardt, C., & Yang, Y. L. (2019). Hologan: Unsupervised learning of 3d representations from natural images.

  38. Niu, C., Li, J., & Xu, K. (2018). Im2struct: Recovering 3d shape structure from a single RGB image. In Computer vision and pattern recognition (CVPR).

  39. Novotný, D., Larlus, D., & Vedaldi, A. (2017). Learning 3D object categories by looking around them. CoRR abs/1705.03951.

  40. Novotný, D., Ravi, N., Graham, B., Neverova, N., & Vedaldi, A. (2019). C3dpo: Canonical 3D pose networks for non-rigid structure from motion. arXiv:1909.02533.

  41. Perez, E., Strub, F., De Vries, H., Dumoulin, V., & Courville, A. (2017). Film: Visual reasoning with a general conditioning layer. arXiv:1709.07871.

  42. Pfister, H., Zwicker, M., van Baar, J., & Gross, M. (2000), Surfels: Surface elements as rendering primitives. In Annual conference on computer graphics and interactive techniques.

  43. Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. In International conference on learning representations (ICLR).

  44. Rezende, D. J., Eslami, S. M. A., Mohamed, S., Battaglia, P., Jaderberg, M., & Heess, N. (2016). Unsupervised learning of 3D structure from images. In Advances in neural information processing systems (NIPS).

  45. Saxena, A., Sun, M., & Ng, A. Y. (2009). Make3d: Learning 3D scene structure from a single still image. In IEEE transasctions on pattern anal mach intell (PAMI) (Vol. 31, No. 5).

  46. Shepard, R. N., & Metzler, J. (1971). Mental rotation of three-dimensional objects. Science, 171(3972), 701–703.

    Article  Google Scholar 

  47. Soltani, A. A., Huang, H., Wu, J., Kulkarni, T. D., & Tenenbaum, J. B. (2017). Synthesizing 3D shapes via modeling multi-view depth maps and silhouettes with deep generative networks. In Computer vision and pattern recognition (CVPR)

  48. Taha, A. A., & Hanbury, A. (2015). An efficient algorithm for calculating the exact hausdorff distance. In IEEE transactions on pattern analysis and machine intelligence (PAMI).

  49. Tulsiani, S., Su, H., Guibas, L. J., Efros, A. A., & Malik, J. (2016), Learning shape abstractions by assembling volumetric primitives. CoRR abs/1612.00404.

  50. Tulsiani, S., Zhou, T., Efros, A. A., & Malik, J. (2017). Multi-view supervision for single-view reconstruction via differentiable ray consistency. CoRR abs/1704.06254.

  51. Wiles, O., & Zisserman, A. (2017). Silnet : Single- and multi-view reconstruction by learning from silhouettes. CoRR abs/1711.07888.

  52. Woodcock, R., Mather, N., & McGrew, K. (2001). Woodcock johnson iii—tests of cognitive skills. Riverside Pub.

  53. Wu, J., Xue, T., Lim, J., Tian, Y., Tenenbaum, J., Torralba, A., & Freeman, W. (2016a), Single image 3d interpreter network.

  54. Wu, J., Zhang, C., Xue, T., Freeman, W. T., Tenenbaum, J. B. (2016b). Learning a probabilistic latent space of object shapes via 3D generative-adversarial modeling. In Advances in neural information processing systems (NIPS).

  55. Wu, J., Wang, Y., Xue, T., Sun, X., Freeman, W. T., & Tenenbaum, J. B. (2017). Marrnet: 3D shape reconstruction via 2.5D sketches. CoRR abs/1711.03129.

  56. Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., & Xiao, J. (2015). 3D shapenets: A deep representation for volumetric shapes. In Computer vision and pattern recognition (CVPR).

  57. Yan, X., Yang, J., Yumer, E., Guo, Y., & Lee, H. (2016), Perspective transformer nets: Learning single-view 3D object reconstruction without 3D supervision. In Advances in neural information processing systems (NIPS).

  58. Zhang, X., Zhang, Z., Zhang, C., Tenenbaum, J., Freeman, B., & Wu, J. (2018). Learning to reconstruct shapes from unseen classes. In Advances in neural information processing systems (NIPS).

  59. Zhu, J. Y, Zhang, Z., Zhang, C., Wu, J., Torralba, A., Tenenbaum, J. B., & Freeman, W. T. (2018). Visual object networks: Image generation with disentangled 3D representations. In Advances in neural information processing systems (NeurIPS).

Download references

Author information



Corresponding author

Correspondence to Sai Rajeswar.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Communicated by Jun-Yan Zhu, Hongsheng Li, Eli Shechtman, Ming-Yu Liu, Jan Kautz, Antonio Torralba.



Rendering Details

The color of a surfel depends on the material reflectance, its position and orientation, as well as the ambient and point light source colors (See Fig. 11b). Given a surface point \(P_i\), the color of its corresponding pixel \(I_{rc}\) is given by the shading equation:

$$\begin{aligned} I_{rc} = \rho _i \left( L_a + \sum _j \frac{1}{k_l \Vert d_{ij}\Vert + k_q \Vert d_{ij}\Vert ^2} L_j \right. \nonumber \\ \left. \max \left( 0, N_i^T d_{ij} / \Vert d_{ij}\Vert \right) \right) , \end{aligned}$$

where \(\rho _i\) is the surface reflectance, \(L_a\) is the ambient light’s color, \(L_j\) is the jth positional light source’s color, with \(d_{ij} = L_{j}^{{\text{ pos }}} - P_i\), or the direction vector from the scene point to the point light source, and \(k_l\), \(k_q\) being the linear and quadratic attenuation terms respectively. Equation 5 is an approximation of rendering equation proposed in Kajiya (1986).

Fig. 11

Differentiable 3D renderer. a A surfel is defined by its position P, normal N, and reflectance \(\rho \). Each surfel maps to an image pixel \(P_{im}\). b The surfel’s color depends on its reflectance \(\rho \) and the angle \(\theta \) between each light I and the surfel’s normal N


Pix2Shape is composed of an encoder network (See Table 5), a decoder network (See Table 6), and a critic network (See Table 7). Specifically, the decoder architecture is similar to the generator in DCGAN (Radford et al. 2015) but with LeakyReLU (Mikolov et al. 2011) as activation function and batch-normalization (Ioffe and Szegedy 2015). Also, we adjusted its depth and width to accommodate the high resolution images accordingly. In order to condition the camera position on the z variable, we use conditional normalization in the alternate layers of the decoder. We train our model for 60K iterations with a batch size of 6 with images of resolution \(128 \times 128 \times 3\).

Table 5 Pix2Shape encoder architecture
Table 6 Pix2Shape decoder architecture
Table 7 Pix2Shape critic architecture

Architecture for Semi-supervised experiments

Pixel2Surfels architecture remains similar to the previous one but with higher capacity on the decoder and critic. The most important difference is that for those experiments we do not condition the networks with the camera pose to be fair with the baselines. In addition to the three networks we have a statistics network (see Table 8) that estimates and minimizes the mutual information between the two set of dimensions in the latent code using MINE (Belghazi et al. 2018). Out of 128 dimensions for z we use first 118 dimensions for represent scene-based information and rest to encode view based info.

Table 8 Pix2Shape statistics network architecture

The architecture of the baseline networks is shown in Fig. 12. During training we use the contrastive loss (Hadsell et al. 2006):

$$\begin{aligned} \mathcal {L_\theta }(\varvec{x}_{1}, \varvec{x}_{2}, y)&= (1-y)\frac{1}{2}(D_\theta (\varvec{x}_{1}, \varvec{x}_{2}))^2 \nonumber \\&+ (y)\frac{1}{2}(max(0,m-D_\theta (\varvec{x}_{1}, \varvec{x}_{2})))^2 \nonumber \\ D_\theta (\varvec{x}_1, \varvec{x}_2)&= ||G_\theta (\varvec{x}_1)-G_\theta (\varvec{x}_2)||_2, \end{aligned}$$

where \(\varvec{x}_1\) and \(\varvec{x}_2\) are the input images, y is either 0 (if the inputs are supposed to be the same) or 1 (if the images are supposed to be different), \(G_\theta \) is each ResNet block, parameterized by \(\theta \), and m is the margin, which we set to 2.0. We apply the contrastive loss to the 2048 features that are generated by each ResNet block.

Fig. 12

3D-IQTT baseline architecture. The four ResNet-50 share the same weights and were slightly modified to support our image size. “FC” stands for fully-connected layer and the hidden node sizes are 2048, 512, and 256 respectively. The output of the network is encoded as one-hot vector

Fig. 13

Learning material along with structure. The model learns the foreground and background colors separately

Fig. 14

Random lights configuration

Fig. 15

Unconditional scene generation. Generated samples from Pix2Shape model trained on ShapeNet scenes. Left: shaded images; Right: depth maps

Material, Lights, and Camera Properties

Material In our experiments, we use diffuse materials with uniform reflectance. The reflectance values are chosen arbitrarily and we use the same material properties for both the input and the generator side. Figure 13 shows that it is possible to learn reflectance along side learning the 3D structure of the scenes by requiring the model to predict the material coefficients along with the depth of the surfels. The color of the objects depend on both the lighting and the material properties. We do not delve into more details on this, as our objective is to model the structural/geometrical properties of the world with the model. This will be explored further in a later study.

Camera The camera is specified by its position, viewing direction and vector indicating the orientation of the camera. The camera positions were uniform randomly sampled on a sphere for the 3D-IQTT task and on a spherical patch contained in the positive octant, for the rest of the experiments. The viewing direction was updated based on the camera position and the center of mass of the objects, so that the camera was always looking at a fixed point in the scene as its position changed. The focal length ranged between [18 mm and 25mm] in all the experiments and the field-of-view was fixed to 24mm. The camera properties were also shared between the input and the generator side. However, in the 3D-IQTT task we relaxed the assumption that we know the camera pose and instead estimate the view as a part of the learnt latent representation.

Lights For the light sources, we experimented with single and multiple point-light sources, with the light colors chosen randomly. The light positions are uniformly sampled on a sphere for the 3D IQTT tasks, and uniformly on a spherical patch covering the positive octant for the other scenes. The same light colors and positions are used both for rendering the input and the generated images. The lights acted as a physical spot lights with the radiant energy attenuating quadratically with distance. As an ablation study we relaxed this assumption of having perfect knowledge of lights by using random position and random color lights. Those experiments show that the light information is not needed by our model to learn the 3D structure of the data. However, as we use random lights on the generator side, the shading of the reconstructions is in different color than in the input as shown in Fig. 14.

Unconditional ShapeNet Generation

We trained Pix2Shape on scenes composed of ShapeNet objects from six categories (i.e., bowls, bottles, mugs, cans, caps and bags). Figure 15 shows qualitative results on unconditional generation. Since no class information is provided to the model, the latent variable captures both the object category and its configuration.

Table 9 View point reconstruction
Fig. 16

Reproduction of Rezende et al. (2016) and qualitative results. Top row: Samples of input images; bottom row: corresponding reconstructed images. We found that with a single centered object, the method was able to correctly reproduce the shape and orientation. However, when the object was not centered, too complex, or there was a background present, the method failed to estimate the shape

Evaluation of 3D Reconstructions

For evaluating 3D reconstructions, we use the Hausdorff distance (Taha and Hanbury 2015) as a measure of similarity between two shapes as in Niu et al. (2018). Given two point sets, A and B, the Hausdorff distance is,

$$\max \left\{ \max D_H^+(A, B), \max D_H^+(B, A)\right\} ,$$

where \(D_H^+\) is an asymmetric Hausdorff distance between two point sets. E.g., \(\max D_H^+(A, B) = \max D(a, B), \text {for} \text {all}\,a \in A\), or the largest Euclidean distance \(D(\cdot )\), from a set of points in A to B, and a similar definition for the reverse case \(\max D_H^+(B, A)\).

Ablation Study on Depth Supervision

As an ablation study, we repeated the experiment that demonstrates the view extrapolation capabilities of our model with depth superrvision. Table 9 depicts the quantitative evaluations on reconstruction if the scenes from unobserved angles.

3D Intelligence Quotient Task

In their landmark work, Shepard and Metzler (1971) introduced the mental rotation task into the toolkit of cognitive assessment. The authors presented human subjects with reference images and answer images. The subjects had to quickly decide if the answer was either a 3D-rotated version or a mirrored version of the reference. The speed and accuracy with which people can solve this mental rotation task has since become a staple of IQ tests like the Woodcock-Johnson tests (Woodcock et al. 2001). We took this as inspiration to design a quantitative evaluation metric (number of questions answered correctly) as opposed to the default qualitative analyses of generative models. We use the same kind of 3D objects but instead of confronting our model with pairs of images and only two possible answers, we include several distractor answers and the subject (human or computer) has to to pick which one out of the three possible answers is the 3D-rotated version of the reference object (See Fig. 2).

Details on Human Evaluations for 3D IQTT

We posted the questionnaire to our lab-wide mailing list, where 41 participants followed the call. The questionnaire had one calibration question where, if answered incorrectly, we pointed out the correct answer. For all successive answers, we did not give any participant the correct answers and each participant had to answer all 20 questions to complete the quiz.

We also asked participants for their age range, gender, education, and for comments. While many commented that the questions were hard, nobody gave us a clear reason to discard their response. All participants were at least high school graduates currently pursuing a Bachelor’s degree. The majority of submissions \((78\%)\) were male, whereas the others were female or unspecified. Most of our participants \((73.2\%)\) were between 18 and 29 years old, the others between 30 and 39. The resulting test scores are normally distributed according to the Shapiro-Wilk test \((p<0.05)\) and significantly different from random choice according to 1-sample Student’s t test \((p<0.01)\).

Implementation of Rezende et al.

With the publication of Rezende et al. (2016), the authors did not publicly release any code and upon request did not offer any either. When implementing our own version, we attempted to reproduce their results first, which is depicted in Fig. 16a. Further, we attempted to use the method for the same qualitative reconstruction of the primitive-in-box dataset as shown in Fig. 4. We found that this worked only with one main object and when there was no background (see Fig. 16b). When including the background, applying the same method lead to degenerate solutions (see Fig. 16c).

Fig. 17

Study of effect of mutual-information objective on 3D-IQTT performance. Our model performance is correlated positively to the the weight on Mutual information term increases

Ablation Study of the Weighted Mutual-Info Loss on 3D-IQTT

Consider the semi-supervised objective used in algorithm 1. In this section we do an ablation study on 3D-IQTT performance with the modified form of the equation where Mutual-information loss \(I_{\Theta }(z_{scene}, z_{view})\) is weighted by a co-efficient \(\lambda \). Plot in Fig. 17 indicates the importance of the MI term. Having a good estimate of the view point and disentangling the view information from geometry is indeed crucial to the performance of the IQ task.

$$\begin{aligned} L \leftarrow \mathcal {L}_{ALI} + \mathcal {L}_{recon} + I_{\Theta }(z_{scene}, z_{view}) \end{aligned}$$

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Rajeswar, S., Mannan, F., Golemo, F. et al. Pix2Shape: Towards Unsupervised Learning of 3D Scenes from Images Using a View-Based Representation. Int J Comput Vis 128, 2478–2493 (2020). https://doi.org/10.1007/s11263-020-01322-1

Download citation


  • Computer vision
  • Differentiable rendering
  • 3D understanding
  • Adversarial training