1 Introduction

Reconstructing 3D objects from 2D images is a long-standing research area in computer vision. While traditional methods rely on multiple images of the same object instance (Seitz et al. 2006; Furukawa and Hernández 2015; Broadhurst et al. 2001; Laurentini 1994; De Bonet and Viola 1999; Gargallo et al. 1999; Liu and Cooper 2010), there has recently been a surge of interest in learning-based methods that can infer 3D structure from a single image, assuming that it shows an object of a class seen during training (e.g. Fan et al. 2017; Choy et al. 2016; Yan et al. 2016; see Sect. 2.1). A related problem to reconstruction is that of generating new 3D shapes from a given object class a priori, i.e. without conditioning on an image. Again, there have recently been several works that apply deep learning techniques to this task (e.g. Wu et al. 2016; Zou et al. 2017; Gadelha et al. 2017; see Sect. 2.2).

Fig. 1
figure 1

Given only unannotated 2D images as training data, our model learns (1) to reconstruct and predict the pose of 3D meshes from a single test image, and (2) to generate new 3D mesh samples. The generative process (solid arrows) samples a Gaussian embedding, decodes this to a 3D mesh, renders the resulting mesh, and finally adds Gaussian noise. It is trained end-to-end to reconstruct input images (dashed arrows), via an encoder network that learns to predict and disentangle shape, pose, and lighting. The renderer produces lit, shaded RGB images, allowing us to exploit shading cues in the reconstruction loss

Learning-based methods for single-image reconstruction are motivated by the fact that the task is inherently ambiguous: many different shapes project to give the same pixels, for example due to self-occlusion. Hence, we must rely on prior knowledge capturing what shapes are likely to occur. However, most reconstruction methods are trained discriminatively to predict complete shapes from images—they do not represent their prior knowledge about object shapes as an explicit distribution that can generate shapes a priori. In this work, we take a generative approach to reconstruction, where we learn an explicit prior model of 3D shapes, and integrate this with a renderer to model the image formation process. Inference over this joint model allows us to find the most likely 3D shape for a given image.

Most learning-based methods for reconstruction and generation rely on strong supervision. For generation (e.g. Wu et al. 2016; Zou et al. 2017), this means learning from large collections of manually constructed 3D shapes, typically ShapeNet (Chang et al. 2015) or ModelNet (Wu et al. 2015). For reconstruction (e.g. Choy et al. 2016; Fan et al. 2017; Richter and Roth 2018), it means learning from images paired with aligned 3D meshes, which is very expensive supervision to obtain (Yang et al. 2018). While a few methods do not rely on 3D ground-truth, they still require keypoint annotations on the 2D training images (Vicente et al. 2014; Kar et al. 2015; Kanazawa et al. 2018), and/or multiple views for each object instance, often with pose annotations (Yan et al. 2016; Wiles and Zisserman 2017; Kato et al. 2018; Tulsiani et al. 2018; Insafutdinov and Dosovitskiy 2018). In this paper, we consider the more challenging setting where we only have access to unannotated 2D images for training, without ground-truth pose, keypoints, or 3D shape, and with a single view per object instance.

It is well known that shading provides an important cue for 3D understanding (Horn 1975). It allows determination of surface orientations, if the lighting and material characteristics are known; this has been explored in numerous works on shape-from-shading over the years (Horn 1975; Zhang et al. 1999; Barron and Malik 2015). Unlike learning-based approaches, these methods can only reconstruct non-occluded parts of an object, and achieving good results requires strong priors (Barron and Malik 2015). Conversely, existing learning-based generation and reconstruction methods can reason over occluded or visually-ambiguous areas, but do not leverage shading information in their loss. Furthermore, the majority use voxel grids or point clouds as an output representation. Voxels are easy to work with, but cannot explicitly model non-axis-aligned surfaces, while point clouds do not represent surfaces explicitly at all. In both cases, this limits the usefulness of shading cues. To exploit shading information in a learning-based approach, we therefore need to move to a different representation; a natural choice is 3D meshes. Meshes are ubiquitous in computer graphics, and have desirable properties for our task: they can represent surfaces of arbitrary orientation and dimensions at fixed cost, and are able to capture fine details. Thus, they avoid the visually displeasing ‘blocky’ reconstructions that result from voxels. We also go beyond monochromatic light, considering the case of coloured directional lighting; this provides even stronger shading cues when combined with arbitrarily-oriented mesh surfaces. Moreover, our model explicitly reasons over the lighting parameters, jointly with the object shape, allowing it to exploit shading information even in cases where the lighting parameters are unknown—which classical shape-from-shading methods cannot.

In this paper, we present a unified framework for both reconstruction and generation of 3D shapes, that is trained to model 3D meshes using only 2D supervision (Fig. 1). Our framework is very general, and can be trained in similar settings to existing models (Tulsiani et al. 2017b, 2018; Yan et al. 2016; Wiles and Zisserman 2017), while also supporting weaker supervision scenarios. It allows:

  • Use of different mesh parameterisations, which lets us incorporate useful modeling priors such as smoothness or composition from primitives.

  • Exploitation of shading cues due to monochromatic or coloured directional lighting (Fig. 2), letting us discover concave structures that silhouette-based methods cannot (Gadelha et al. 2017; Tulsiani et al. 2017b, 2018; Yan et al. 2016; Soltani et al. 2017).

  • Training with varying degrees of supervision: single or multiple views per instance, with or without ground-truth pose annotations.

Fig. 2
figure 2

Lighting: coloured directional lighting a provides strong cues for surface orientation; white light b provides less information; silhouettes c provide none at all. Our model is able to exploit the shading information from coloured or white lighting

To achieve this, we design a probabilistic generative model that captures the full image formation process, whereby the shape of a 3D mesh, its pose, and incident lighting are first sampled independently, then a 2D rendering is produced from these (Sect. 3). We use stochastic gradient variational Bayes for training (Kingma and Welling 2014; Rezende et al. 2014) (Sect. 4). This involves learning an inference network that can predict 3D shape, pose and lighting from a single image, with the shape placed in a canonical frame of reference, i.e. disentangled from the pose. Together, the model plus its inference network resemble a variational autoencoder (Kingma and Welling 2014) on pixels. It represents 3D shapes in a compact latent embedding space, and has extra layers in the decoder corresponding to the mesh representation and renderer. As we do not provide 3D supervision, the encoder and decoder must bootstrap and guide one another during training. The decoder learns the manifold of shapes, while at the same time the encoder learns to map images onto this. This learning process is driven purely by the objective of reconstructing the training images. While this is an ambiguous task and the model cannot guarantee to reconstruct the true shape of an object from a single image, its generative capability means that it always produces a plausible instance of the relevant class; the encoder ensures that this is consistent with the observed image. This works because the generative model must learn to produce shapes that reproject well over all training images, starting from low-dimensional latent representations. This creates an inductive bias towards regularity, which avoids degenerate solutions with unrealistic shapes that could, in isolation, explain each individual training image.

In Sect. 5, we demonstrate our method on 13 diverse object classes. This includes several highly concave classes, which methods relying on silhouettes cannot learn correctly (Yan et al. 2016; Gadelha et al. 2017; Tulsiani et al. 2017b, 2018). We first display samples from the distribution of shapes learnt by our model, showing that (i) the use of meshes yields smoother, more natural samples than those from voxel-based methods (Gadelha et al. 2017), (ii) different mesh parameterisations are better suited to different object classes, and (iii) our samples are diverse and realistic, covering multiple modes of the training distribution. We also demonstrate that our model learns a meaningful latent space, by showing that interpolating between points in it yields realistic intermediate samples. We then quantitatively evaluate performance of our method on single-view reconstruction and pose estimation, showing that: (i) it learns to predict pose, and disentangle it from shape, without either being given as supervision; (ii) exploiting information from shading improves results over using silhouettes in the reconstruction loss, even when the model must learn to estimate the lighting parameters and disentangle them from surface normals; (iii) when using a standard single white light, our model outperforms state-of-the-art 2D-supervised methods (Kato et al. 2018), both with and without pose supervision, thanks to exploiting shading cues; (iv) performance improves further when using multiple coloured lights, even approaching that of state-of-the-art 3D-supervised methods (Fan et al. 2017; Richter and Roth 2018). Finally, we evaluate the impact of design choices such as different mesh parameterisations and latent space dimensionalities, showing which choices work well for different object classes.

A preliminary version of this work appeared as Henderson and Ferrari (2018). That earlier version assumed fixed, known lighting parameters rather than explicitly reasoning over them; also here we present a much more extensive experimental evaluation.

2 Related Work

2.1 Learning Single-Image 3D Reconstruction

In the last 3 years, there has been a surge of interest in single-image 3D reconstruction; this has been enabled both by the growing maturity of deep learning techniques, and by the availability of large datasets of 3D shapes (Chang et al. 2015; Wu et al. 2015). Among such methods, we differentiate between those requiring full 3D supervision (i.e. 3D shapes paired with images), and those that need only weaker 2D supervision (e.g. pose annotations); our work here falls into the second category.

3D-Supervised Methods Choy et al. (2016) apply a CNN to the input image, then pass the resulting features to a 3D deconvolutional network, that maps them to to occupancies of a \(32^3\) voxel grid. Girdhar et al. (2016) and Wu et al. (2016) proceed similarly, but pre-train a model to auto-encode or generate 3D shapes respectively, and regress images to the latent features of this model. Instead of directly producing voxels, Soltani et al. (2017), Shin et al. (2018) and Richter and Roth (2018) output multiple depth-maps and/or silhouettes, from known (fixed) viewpoints; these are subsequently fused if a voxel reconstruction is required. Fan et al. (2017) and Mandikal et al. (2018) generate point clouds as the output, with networks and losses specialised to their order invariant structure. Like ours, the concurrent works of Groueix et al. (2018) and Wang et al. (2018) predict meshes, but parameterise them differently to us. Tulsiani et al. (2017a) and Niu et al. (2018) both learn to map images to sets of cuboidal primitives, of fixed and variable cardinality respectively. Finally, Gwak et al. (2017) and Zhu et al. (2017) present methods with slightly weaker requirements on ground-truth. As in the previous works, they require large numbers of 3D shapes and images; however, these do not need to be paired with each other. Instead, the images are annotated only with silhouettes.

2D-Supervised Methods

A few recent learning-based reconstruction techniques do not rely on 3D ground-truth; these are the closest in spirit to our own. They typically work by passing input images through a CNN, which predicts a 3D representation, which is then rendered to form a reconstructed 2D silhouette; the loss is defined to minimise the difference between the reconstructed and original silhouettes. This reliance on silhouettes means they cannot exploit shading and cannot learn to reconstruct concave object classes—in contrast to our approach. Moreover, all these methods require stronger supervision than our own—they must be trained with ground-truth pose or keypoint annotations, and/or multiple views of each instance presented together during training.

Rezende et al. (2016) briefly discuss single-image reconstruction using a conditional generative model over meshes. This models radial offsets to vertices of a spherical base mesh, conditioning on an input image. The model is trained in a variational framework to maximise the reconstructed pixel likelihood. It is demonstrated only on simple shapes such as cubes and cylinders.

Yan et al. (2016) present a method that takes single image as input, and yields a voxel reconstruction. This is trained to predict voxels that reproject correctly to the input pixels, assuming the object poses for the training images are known. The voxels are projected by computing a max operation along rays cast from each pixel into the voxel grid, at poses matching the input images. The training objective is then to maximise the intersection-over-union (IOU) between these projected silhouettes and the silhouettes of the original images. Kato et al. (2018) present a very similar method, but using meshes instead of voxels as the output representation. It is again trained using the silhouette IOU as the loss, but also adds a smoothness regularisation term, penalising sharply creased edges. Wiles and Zisserman (2017) propose a method that takes silhouette images as input, and produces rotated silhouettes as output; the input and output poses are provided. To generate the rotated silhouettes, they predict voxels in 3D space, and project them by a max operation along rays.

Tulsiani et al. (2017b) also regress a voxel grid from a single image; however, the values in this voxel grid are treated as occupancy probabilities, which allows use of probabilistic ray termination (Broadhurst et al. 2001) to enforce consistency with a silhouette or depth map. Two concurrent works to ours, Tulsiani et al. (2018) and Insafutdinov and Dosovitskiy (2018), extend this approach to the case where pose is not given at training time. To disentangle shape and pose, they require that multiple views of each object instance be presented together during training; the model is then trained to reconstruct the silhouette for each view using its own predicted pose, but the shape predicted from some other view. Yang et al. (2018) use the same principle to disentangle shape and pose, but assume that a small number of images are annotated with poses, which improves the accuracy significantly.

Vicente et al. (2014) jointly reconstruct thousands of object instances in the PASCAL VOC 2012 dataset using keypoint and silhouette annotations, but without learning a model that can be applied to unseen images. Kar et al. (2015) train a CNN to predict keypoints, pose, and silhouette from an input image, and then optimise the parameters of a deformable model to fit the resulting estimates. Concurrently with our work, Kanazawa et al. (2018) present a method that takes a single image as input, and produces a textured 3D mesh as output. The mesh is parameterised by offsets to the vertices of a learnt mean shape. These three methods all require silhouette and keypoint annotations on the training images, but only a single view of each instance.

Novotny et al. (2017) learn to perform single-image reconstruction using videos as supervision. Classical multi-view stereo methods are used to reconstruct the object instance in each video, and the reconstructions are used as ground-truth to train a regression model mapping images to 3D shapes.

2.2 Generative Models of 3D Shape

The last 3 years have also seen increasing interest in deep generative models of 3D shapes. Again, these must typically be trained using large datasets of 3D shapes, while just one work requires only images (Gadelha et al. 2017).

3D-Supervised Methods

Wu et al. (2015) and Xie et al. (2018) train deep energy-based models on voxel grids; Huang et al. (2015) train one on surface points of 3D shapes, jointly with a decomposition into parts. Wu et al. (2016) and Zhu et al. (2018) present generative adversarial networks (GANs; Goodfellow et al. 2014) that directly model voxels using 3D convolutions; Zhu et al. (2018) also fine-tune theirs using 2D renderings. Rezende et al. (2016) and Balashova et al. (2018) both describe models of voxels, based on the variational autoencoder (VAE; Kingma and Welling 2014). Nash and Williams (2017) and Gadelha et al. (2018) model point clouds, using different VAE-based formulations. Achlioptas et al. (2018) train an autoencoder for dimensionality reduction of point clouds, then a GAN on its embeddings. Li et al. (2017) and Zou et al. (2017) model shapes as assembled from cuboidal primitives; Li et al. (2017) also add detail by modelling voxels within each primitive. Tan et al. (2018) present a VAE over parameters of meshes. Calculating the actual vertex locations from these parameters requires a further energy-based optimisation, separate to their model. Their method is not directly applicable to datasets with varying mesh topology, including ShapeNet and ModelNet.

2D-Supervised MethodsSoltani et al. (2017) train a VAE over groups of silhouettes from a set of known viewpoints; these may be fused to give a true 3D shape as a post-processing stage, separate to the probabilistic model. The only prior work that learns a true generative model of 3D shapes given just 2D images is Gadelha et al. (2017); this is therefore the most similar in spirit to our own. They use a GAN over voxels; these are projected to images by a simple max operation along rays, to give silhouettes. A discriminator network ensures that projections of sampled voxels are indistinguishable from projections of ground-truth data. This method does not require pose annotations, but they restrict poses to a set of just eight predefined viewpoints. In contrast to our work, this method cannot learn concave shapes, due to its reliance on silhouettes. Moreover, like other voxel-based methods, it cannot output smooth, arbitrarily-oriented surfaces. Yang et al. (2018) apply this model as a prior for single-image reconstruction, but they require multiple views per instance during training.

3 Generative Model

Our goal is to build a probabilistic generative model of 3D meshes for a given object class. For this to be trainable with 2D supervision, we cast the entire image-formation process as a directed model (Fig. 1). We assume that the content of an image can be explained by three independent latent components—the shape of the mesh, its pose relative to the camera, and the lighting. These are modelled by three low-dimensional random variables, \(\mathbf {z}\), \(\theta \), and \(\lambda \) respectively. The joint distribution over these and the resulting pixels \(\mathbf {x}\) factorises as \( P(\mathbf {x} ,\, \mathbf {z} ,\, \theta ,\, \lambda ) = P(\mathbf {z}) P(\theta ) P(\lambda ) P(\mathbf {x} \,|\,\mathbf {z} ,\, \theta ,\, \lambda ) \).

Following Gadelha et al. (2017), Yan et al. (2016), Tulsiani et al. (2017b), and Wiles and Zisserman (2017), we assume that the pose \(\theta \) is parameterised by just the azimuth angle, with \(\theta \sim \text {Uniform}(-\pi , \pi )\) (Fig. 4a, bottom). The camera is then placed at fixed distance and elevation relative to the object. We similarly take \(\lambda \) to be a single azimuth angle with uniform distribution, which specifies how a predefined set of directional light sources are to be rotated around the origin (Fig. 4a, top). The number of lights, their colours, elevations, and relative azimuths are kept fixed. We are free to choose these; our experiments include tri-directional coloured lighting, and a single white directional light source plus an ambient component.

Following recent works on deep latent variable models (Kingma and Welling 2014; Goodfellow et al. 2014), we assume that the embedding vector \(\mathbf {z}\) is drawn from a standard isotropic Gaussian, and then transformed by a deterministic decoder network, \(F_{\phi }\), parameterised by weights \(\phi \) which are to be learnt (“Appendix A” details the architecture of this network). This produces the mesh parameters\(\varPi = F_{\phi }(\mathbf {z})\). Intuitively, the decoder network \(F_{\phi }\) transforms and entangles the dimensions of \(\mathbf {z}\) such that all values in the latent space map to plausible values for \(\varPi \), even if these lie on a highly nonlinear manifold. Note that our approach contrasts with previous models that directly output pixels (Kingma and Welling 2014; Goodfellow et al. 2014) or voxels (Wu et al. 2016; Gadelha et al. 2017; Zhu et al. 2018; Balashova et al. 2018) from a decoder network.

We use \(\varPi \) as inputs to a fixed mesh parameterisation function \(M(\varPi )\), which yields vertices \(\mathbf {v}_{\text {object}}\) of triangles defining the shape of the object in 3D space, in a canonical pose (different options for M are described below). The vertices are transformed into camera space according to the pose \(\theta \), by a fixed function T: \(\mathbf {v}_{\text {camera}} = T(\mathbf {v}_{\text {object}},\, \theta )\). They are then rendered into an RGB image \(I_0 = \mathscr {G}(\mathbf {v}_{\text {camera}} ,\, \lambda )\) by a rasteriser \(\mathscr {G}\) using Gouraud shading (Gouraud 1971) and Lambertian surface reflectance (Lambert 1760).

The final observed pixel values \(\mathbf {x}\) are modelled as independent Gaussian random variables, with means equal to the values in an L-level Gaussian pyramid (Burt and Adelson 1983), whose base level equals \(I_0\), and whose Lth level has smallest dimension equal to one:

$$\begin{aligned}&P_{\phi }(\mathbf {x} \,|\,\mathbf {z},\, \theta ,\, \lambda ) = \prod _l P_{\phi }(\mathbf {x}_l \,|\,\mathbf {z} ,\, \theta ,\, \lambda )\end{aligned}$$
(1)
$$\begin{aligned}&\mathbf {x}_l \sim \text {Normal}\left( I_l,\, \tfrac{\epsilon }{2^l} \right) \end{aligned}$$
(2)
$$\begin{aligned}&I_0 = \mathscr {G}(T(M(F_{\phi }(\mathbf {z})),\, \theta ),\, \lambda ) \end{aligned}$$
(3)
$$\begin{aligned}&I_{l+1} = I_l * k_G \end{aligned}$$
(4)

where l indexes pyramid levels, \(k_G\) is a small Gaussian kernel, \(\epsilon \) is the noise magnitude at the base scale, and \(*\) denotes convolution with stride two. We use a multi-scale pyramid instead of just the raw pixel values to ensure that, during training, there will be gradient forces over long distances in the image, thus avoiding bad local minima where the reconstruction is far from the input.

Fig. 3
figure 3

Mesh parameterisations: ortho-block and full-block (assembly from cuboidal primitives, of fixed or varying orientation) are suited to objects consisting of compact parts (ab); subdivision (per-vertex deformation of a subdivided cube) is suited to complex continuous surfaces (c)

Mesh Parameterisations After the decoder network has transformed the latent embedding \(\mathbf {z}\) into the mesh parameters \(\varPi \), these are converted to actual 3D vertices using a simple, non-learnt mesh-parameterisation function M. One possible choice for M is the identity function, in which case the decoder network directly outputs vertex locations. However, initial experiments showed that this does not work well: it produces very irregular meshes with large numbers of intersecting triangles. Conversely, using a more sophisticated form for M enforces regularity of the mesh. We use three different parameterisations in our experiments.

In our first parameterisation, \(\varPi \) specifies the locations and scales of a fixed number of axis-aligned cuboidal primitives (Fig. 3a), from which the mesh is assembled (Zou et al. 2017; Tulsiani et al. 2017a). Changing \(\varPi \) can produce configurations with different topologies, depending which blocks touch or overlap, but all surfaces will always be axis-aligned. The scale and location of each primitive are represented by 3D vectors, resulting in a total of six parameters per primitive. In our experiments we call this ortho-block.

Our second parameterisation is strictly more powerful than the first: we still assemble the mesh from cuboidal primitives, but now associate each with a rotation, in addition to its location and scale. Each rotation is parameterised as three Euler angles, yielding a total of nine parameters per primitive. In our experiments we call this full-block (Fig. 3b).

The above parameterisations are naturally suited to objects composed of compact parts, but cannot represent complex continuous surfaces. For these, we define a third parameterisation, subdivision (Fig. 3c). This parameterisation is based on a single cuboid, centred at the origin; the edges and faces of the cuboid are subdivided several times along each axis. Then, \(\varPi \) specifies a list of 3D displacements, one per vertex, which deform the subdivided cube into the required shape. In practice, we subdivide each edge into four segments, resulting in 98 vertices, hence 294 parameters.

4 Variational Training

We wish to learn the parameters of our model from a training set of 2D images of objects of a single class. More precisely, we assume access to a set of images \(\{\mathbf {x}^{(i)}\}\), each showing an object with unknown shape, at an unknown pose, under unknown lighting. Note that we do not require that there are multiple views of each object (in contrast with Yan et al. (2016) and Tulsiani et al. (2018)), nor that the object poses are given as supervision (in contrast with Yan et al. (2016), Tulsiani et al. (2017b), Wiles and Zisserman (2017), and Kato et al. (2018)).

We seek to maximise the marginal log-likelihood of the training set, which is given by \(\sum _i \log P_{\phi }(\mathbf {x}^{(i)})\), with respect to \(\phi \). For each image, we have

$$\begin{aligned}&\log P_{\phi }(\mathbf {x}^{(i)})\nonumber \\&\quad = \log \int _{\mathbf {z},\theta ,\lambda } P_{\phi }(\mathbf {x}^{(i)} \,|\,\mathbf {z}, \theta , \lambda ) P(\mathbf {z})P(\theta ) P(\lambda ) \, \mathrm {d}\mathbf {z} \, \mathrm {d}\theta \, \mathrm {d}\lambda \end{aligned}$$
(5)

Unfortunately this is intractable, due to the integral over the latent variables \(\mathbf {z}\) (shape), \(\theta \) (pose), and \(\lambda \) (lighting). Hence, we use amortised variational inference, in the form of stochastic gradient variational Bayes (Kingma and Welling 2014; Rezende et al. 2014). This introduces an approximate posterior \(Q_{\omega }(\mathbf {z}, \theta , \lambda \,|\,\mathbf {x})\), parameterised by some \(\omega \) that we learn jointly with the model parameters \(\phi \). Intuitively, Q maps an image \(\mathbf {x}\) to a distribution over likely values of the latent variables \(\mathbf {z}\), \(\theta \), and \(\lambda \). Instead of the log-likelihood (5), we then maximise the evidence lower bound (ELBO):

$$\begin{aligned}&\mathop {\mathbb {E}}_{\mathbf {z},\, \theta ,\, \lambda \sim Q_{\omega }(\mathbf {z},\, \theta ,\, \lambda \,|\,\mathbf {x}^{(i)})}\left[ \log P_{\phi }( \mathbf {x}^{(i)} \,|\,\mathbf {z},\, \theta ,\, \lambda ) \right] \nonumber \\&- KL \left[ Q_{\omega }(\mathbf {z},\, \theta ,\, \lambda \,|\,\mathbf {x}^{(i)}) \,\Big |\Big |\, P(\mathbf {z}) P(\theta ) P(\lambda ) \right] \le \log P_{\phi }(\mathbf {x}^{(i)})\nonumber \\ \end{aligned}$$
(6)

This lower-bound on the log-likelihood can be evaluated efficiently, as the necessary expectation is now with respect to Q, for which we are free to choose a tractable form. The expectation can then be approximated using a single sample.

Fig. 4
figure 4

a We parameterise the object pose relative to the camera by the azimuth angle \(\theta \), and rotate the lights around the object as a group according to a second azimuth angle \(\lambda \). b To avoid degenerate solutions, we discretise \(\theta \) into coarse and fine components, with \(\theta _{\mathrm {coarse}}\) categorically distributed over R bins, and \(\theta _{\mathrm {fine}}\) specifying a small offset relative to this. For example, to represent the azimuth indicated by the pink line, \(\theta _{\mathrm {coarse}} = 3\) and \(\theta _{\mathrm {fine}} = -18^\circ \). The encoder network outputs softmax logits \(\varvec{\rho }\) for a categorical variational distribution over \(\theta _{\mathrm {coarse}}\), and the mean \(\xi \) and standard deviation \(\zeta \) of a Gaussian variational distribution over \(\theta _{\mathrm {fine}}\), with \(\xi \) bounded to the range \(( -\pi / R ,\, \pi / R )\)

We let Q be a mean-field approximation, i.e. given by a product of independent variational distributions:

$$\begin{aligned} Q_{\omega }(\mathbf {z}, \theta , \lambda \,|\,\mathbf {x}) = Q_{\omega }(\mathbf {z} \,|\,\mathbf {x}) Q_{\omega }(\theta \,|\,\mathbf {x}) Q_{\omega }(\lambda \,|\,\mathbf {x}) \end{aligned}$$
(7)

The parameters of these distributions are produced by an encoder network, \(\mathrm {enc}_{\omega }(\mathbf {x})\), which takes the image \(\mathbf {x}\) as input. For this encoder network we use a small CNN with architecture similar to Wiles and Zisserman (2017) (see “Appendix A”). We now describe the form of the variational distribution for each of the variables \(\mathbf {z}\), \(\theta \), and \(\lambda \).

Shape For the shape embedding \(\mathbf {z}\), the variational posterior distribution \(Q_{\omega }(\mathbf {z} \,|\,\mathbf {x})\) is a multivariate Gaussian with diagonal covariance. The mean and variance of each latent dimension are produced by the encoder network. When training with multiple views per instance, we apply the encoder network to each image separately, then calculate the final shape embedding \(\mathbf {z}\) by max-pooling each dimension over all views.

Pose For the pose \(\theta \), we could similarly use a Gaussian posterior. However, many objects are roughly symmetric with respect to rotation, and so the true posterior is typically multi-modal. We capture this multi-modality by decomposing the rotation into coarse and fine parts (Mousavian et al. 2017): an integer random variable \(\theta _{\text {coarse}}\) that chooses from \(R_\theta \) rotation bins, and a small Gaussian offset \(\theta _{\text {fine}}\) relative to this (Fig. 4b):

$$\begin{aligned} \theta = -\pi + \theta _{\text {coarse}} \frac{2\pi }{R_\theta } + \theta _{\text {fine}} \end{aligned}$$
(8)

We apply this transformation in both the generative \(P(\theta )\) and variational \(Q_{\omega }(\theta )\), giving

$$\begin{aligned}&P(\theta _{\text {coarse}} = r) = 1/R_\theta \end{aligned}$$
(9)
$$\begin{aligned}&P(\theta _{\text {fine}}) = \text {Normal}(\theta _{\text {fine}} \,|\,0,\, \pi / R_\theta ) \end{aligned}$$
(10)
$$\begin{aligned}&Q_{\omega }\left( \theta _{\text {coarse}} = r \,\Big \vert \,\mathbf {x}^{(i)} \right) = \rho _r^\theta \left( \mathbf {x}^{(i)} \right) \end{aligned}$$
(11)
$$\begin{aligned}&Q_{\omega }(\theta _{\text {fine}}) = \text {Normal}\left( \theta _{\text {fine}} \,\Big \vert \,\xi ^\theta (\mathbf {x}^{(i)}),\, \zeta ^\theta (\mathbf {x}^{(i)}) \right) \end{aligned}$$
(12)

where the variational parameters \(\rho _r^\theta , \xi ^\theta , \zeta ^\theta \) for image \(\mathbf {x}^{(i)}\) are again estimated by the encoder network \(\mathrm {enc}_{\omega }(\mathbf {x}^{(i)})\). Specifically, the encoder uses a softmax output to parameterise \(\varvec{\rho }^\theta \), and restricts \(\xi ^\theta \) to lie in the range \(( -\pi / R_\theta ,\, \pi / R_\theta )\), ensuring that the fine rotation is indeed a small perturbation, so the model must correctly use it in conjunction with \(\theta _\mathrm {coarse}\).

Provided \(R_\theta \) is sufficiently small, we can integrate directly with respect to \(\theta _{\text {coarse}}\) when evaluating (6), i.e. sum over all possible rotations. While this allows our training process to reason over different poses, it is still prone to predicting the same pose \(\theta \) for every image; clearly this does not correspond to the prior on \(\theta \) given by (9101112). The model is therefore relying on the shape embedding \(\mathbf {z}\) to model all variability, rather than disentangling shape and pose. The ELBO (6) does include a KL-divergence term that should encourage latent variables to match their prior. However, it does not have a useful effect for \(\theta _{\text {coarse}}\): minimising the KL divergence from a uniform distribution for each sample individually corresponds to independently minimising all the probabilities \(Q_{\omega }(\theta _{\text {coarse}})\), which does not encourage uniformity of the full distribution. The effect we desire is to match the aggregated posterior distribution \(\left\langle Q_{\omega }(\theta \,|\,\mathbf {x}^{(i)}) \right\rangle _i\) to the prior \(P(\theta )\), where \(\langle \,\cdot \, \rangle _i\) is the empirical mean over the training set. As \(\theta _{\text {coarse}}\) follows a categorical distribution in both generative and variational models, we can directly minimise the L1 distance between the aggregated posterior and the prior

$$\begin{aligned}&\sum _r^{R_\theta } \bigg |\left\langle Q_{\omega }\left( \theta _{\text {coarse}} = r \,|\,\mathbf {x}^{(i)}\right) \right\rangle _i - P\left( \theta _{\text {coarse}} = r\right) \bigg |\nonumber \\&= \sum _r^{R_\theta } \bigg |\left\langle \rho _r^\theta (\mathbf {x}^{(i)}) \right\rangle _i - \frac{1}{R_\theta } \; \bigg |\end{aligned}$$
(13)

We use this term in place of \(\textit{KL}\,\,\Big [Q(\theta _{\text {coarse}} \,|\,\mathbf {x}^{(i)}) \big |\big | P(\theta _{\text {coarse}}) \Big ]\) in our loss, approximating the empirical mean with a single minibatch.

Lighting For the lighting angle \(\lambda \), we perform the same decomposition into coarse and fine components as for \(\theta \), giving new variables \(\lambda _\mathrm {coarse}\) and \(\lambda _\mathrm {fine}\), with \(\lambda _\mathrm {coarse}\) selecting from among \(R_\lambda \) bins. Analogously to pose, \(\lambda _\mathrm {coarse}\) has a categorical variational distribution parameterised by a softmax output \(\rho ^\lambda \) from the encoder, and \(\lambda _\mathrm {fine}\) has a Gaussian variational distribution with parameters \(\xi ^\lambda \) and \(\zeta ^\lambda \). Again, we integrate over \(\lambda _\mathrm {coarse}\), so the training process reasons over many possible lighting angles for each image, increasing the predicted probability of the one giving the best reconstruction. We also regularise the aggregated posterior distribution of \(\lambda _\mathrm {coarse}\) towards a uniform distribution.

Loss Our final loss function for a minibatch \(\mathscr {B}\) is then given by

$$\begin{aligned}&\sum _{r_\theta }^{R_\theta } \; \sum _{r_\lambda }^{R_\lambda } \Bigg \{- \bigg \langle \mathop {\mathbb {E}}_{ \mathbf {z},\, \theta _{\text {fine}},\, \lambda _{\text {fine}} \sim Q_{\omega } }\nonumber \\&\qquad \bigg [\log P_{\phi }\!\left( \mathbf {x}^{(i)} \,\Big \vert \,\mathbf {z},\, \theta _{\text {coarse}} = r_\theta ,\, \theta _{\text {fine}},\, \lambda _{\text {coarse}} = r_\lambda ,\, \lambda _{\text {fine}} \right) \bigg ]\nonumber \\&\qquad \rho ^\theta _{r_\theta } \left( \mathbf {x}^{(i)} \right) \, \rho ^\lambda _{r_\lambda } \left( \mathbf {x}^{(i)} \right) \bigg \rangle _{\!i \in \mathscr {B}} \Bigg \}\nonumber \\&\quad + \alpha \; \sum _r^{R_\theta } \Bigg \{ \bigg |\! \left\langle \rho _r^\theta \left( \mathbf {x}^{(i)} \right) \right\rangle _{\!i \in \mathscr {B}} \! - \frac{1}{R_\theta } \; \bigg |\Bigg \} \nonumber \\&\quad + \, \alpha \; \sum _r^{R_\lambda } \Bigg \{ \bigg |\! \left\langle \rho _r^\lambda \left( \mathbf {x}^{(i)} \right) \right\rangle _{\!i \in \mathscr {B}} \! - \frac{1}{R_\lambda } \; \bigg |\Bigg \} \nonumber \\&\quad +\, \beta \; \bigg \langle KL \left[ Q_{\omega }\left( \mathbf {z}, \theta _{\text {fine}}, \lambda _{\text {fine}} \,\Big \vert \,\mathbf {x}^{(i)} \right) \,\Big |\Big |\, P(\mathbf {z}) P(\theta _{\text {fine}}) P(\lambda _{\text {fine}}) \right] \bigg \rangle _{\!i \in \mathscr {B}}\nonumber \\ \end{aligned}$$
(14)

where \(\beta \) increases the relative weight of the KL term as in Higgins et al. (2017), and \(\alpha \) controls the strength of the prior-matching terms for pose and lighting. We minimise (14) with respect to \(\phi \) and \(\omega \) using ADAM (Kingma and Ba 2015) with gradient clipping, applying the reparameterisation trick to handle the Gaussian random variables (Kingma and Welling 2014; Rezende et al. 2014). Hyperparameters are given in “Appendix B”.

Differentiable Rendering Note that optimising (14) by gradient descent requires differentiating through the mesh-rendering operation \(\mathscr {G}\) used to calculate \(P_{\phi }(\mathbf {x} \,|\,\mathbf {z} ,\, \theta ,\, \lambda )\), to find the derivative of the pixels with respect to the vertex locations and colours. While computing exact derivatives of \(\mathscr {G}\) is very expensive, Loper and Black (2014) describe an efficient approximation. We employ a similar technique here, and have made our TensorFlow implementation publicly available.Footnote 1

5 Experiments

We follow recent works (Gadelha et al. 2017; Yan et al. 2016; Tulsiani et al. 2017b, 2018; Fan et al. 2017; Kato et al. 2018; Richter and Roth 2018; Yang et al. 2018) and evaluate our approach using the ShapeNet dataset (Chang et al. 2015). Using synthetic data has two advantages: it allows controlled experiments modifying lighting and other parameters, and it lets us evaluate the reconstruction accuracy using the ground-truth 3D shapes.

We begin by demonstrating that our method successfully learns to generate and reconstruct 13 different object classes (Sect. 5.1). These include the top ten most frequent classes of ShapeNet, plus three others (bathtub, jar, and pot) that we select because they are smooth and concave, meaning that prior methods using voxels and silhouettes cannot learn and represent them faithfully, as shading information is needed to handle them correctly.

We then rigorously evaluate the performance of our model in different settings, focusing on four classes (aeroplane, car, chair, and sofa). The first three are used in Yan et al. (2016), Tulsiani et al. (2017b), Kato et al. (2018), and Tulsiani et al. (2018), while the fourth is a highly concave class that is hard to handle by silhouette-based approaches. We conduct experiments varying the following factors:

  • Mesh parameterisations (Sect. 5.2): We evaluate the three parameterisations described in Sect. 3: ortho-block, full-block, and subdivision.

  • Single white light versus three coloured lights (Sect. 5.3): Unlike previous works using silhouettes (Sect. 2), our method is able to exploit shading in the training images. We test in two settings: (i) illumination by three coloured directional lights (colour, Fig. 2a); and (ii) illumination by one white directional light plus a white ambient component (white, Fig. 2b).

  • Fixed versus varying lighting (Sect. 5.3): The variable \(\lambda \) represents a rotation of all the lights together around the vertical axis (Sect. 3). We conduct experiments in two settings: (i) \(\lambda \) is kept fixed across all training and test images, and is known to the generative model (fixed); and (ii) \(\lambda \) is chosen randomly for each training/test image, and is not provided to the model (varying). In the latter setting, the model must learn to disentangle the effects of lighting angle and surface orientation on the observed shading.

  • Silhouette versus shading in the loss (Sect. 5.3): We typically calculate the reconstruction loss (pixel log-likelihood) over the RGB shaded image (shading), but for comparison with 2D-supervised silhouette-based methods (Sect. 2). we also experiment with using only the silhouette in the loss (silhouette), disregarding differences in shading between the input and reconstructed pixels.

  • Latent space dimensionality (Sect. 5.4): We experiment with different sizes for the latent shape embedding \(\mathbf {z}\), which affects the representational power of our model. We found that 12 dimensions gave good results in initial experiments, and use this value for all experiments apart from Sect. 5.4, where we evaluate its impact.

  • Multiple views (Sect. 5.5): Yan et al. (2016), Wiles and Zisserman (2017), Tulsiani et al. (2018) and Yang et al. (2018) require that multiple views of each instance are presented together in each training batch, and Tulsiani et al. (2017b) also focus on this setting. Our model does not require this, but for comparison we include results with three views per instance at training time, and either one or three at test time.

  • Pose supervision: Most previous works that train for 3D reconstruction with 2D supervision require the ground-truth pose of each training instance (Yan et al. 2016; Wiles and Zisserman 2017; Tulsiani et al. 2017b). While our method does not need this, we evaluate whether it can benefit from it, in each of the settings described above (we report these results in their corresponding sections).

Finally, we compare the performance of our model to several prior and concurrent works on generation and reconstruction, using various degrees of supervision (Sect. 5.6).

Table 1 Reconstruction and pose estimation performance for the ten most-frequent classes in ShapeNet (first ten rows), plus three smooth, concave classes that methods based on voxels and silhouettes cannot handle (last three rows)

Evaluation Metrics We benchmark our reconstruction and pose estimation accuracy on a held-out test set, following the protocol of Yan et al. (2016), where each object is presented at 24 different poses, and statistics are aggregated across objects and poses. We use the following measures:

  • iou: to measure the shape reconstruction error, we calculate the mean intersection-over-union between the predicted and ground-truth shapes. For this we voxelise both meshes at a resolution of \(32^3\). This is the metric used by recent works on reconstruction with 2D supervision (e.g. Yan et al. 2016; Tulsiani et al. 2017b; Kato et al. 2018; Wiles and Zisserman 2017).

  • err: to measure the pose estimation error, we calculate the median error in degrees of predicted rotations.

  • acc: again to evaluate pose estimation, we measure the fraction of instances whose predicted rotation is within \(30^\circ \) of the ground-truth rotation.

Note that the metrics err and acc are used by Tulsiani et al. (2018) to evaluate pose estimation in a similar setting to ours.

Training Minibatches

Each ShapeNet mesh is randomly assigned to either the training set (80% of meshes) or the test set. During training, we construct each minibatch by randomly sampling 128 meshes from the relevant class, uniformly with replacement. For each selected mesh, we render a single image, using a pose sampled from \(\text {Uniform}(-\pi ,\,\pi )\) (and also sampling a lighting angle for experiments with varying lighting). Only these images are used to train the model, not the meshes themselves. In experiments using multiple views, we instead sample 64 meshes and three poses per mesh, and correspondingly render three images.

Fig. 5
figure 5

Samples from our model for the ten most frequent classes in ShapeNet in order of decreasing frequency, plus three other interesting classes. Note the diversity and realism of our samples, which faithfully capture multimodal shape distributions, e.g. both straight and right-angled sofas, boats with and without sails, and straight- and delta-wing aeroplanes. We successfully learn models for the highly concave classes sofa, bathtub, pot, and jar, enabled by the fact that we exploit shading cues during training. Experimental setting: subdivision, fixed colour lighting, shading loss

5.1 Generating and Reconstructing Diverse Object Classes

We train a separate model for each of the 13 object classes mentioned above, using subdivision parameterisation. Samples generated from these models are shown in Fig. 5. We see that the sampled shapes are realistic, and the models have learnt a prior that encompasses the space of valid shapes for each class. Moreover, the samples are diverse: the models generate various different styles for each class. For example, for sofa, both straight and right-angled (modular) designs are sampled; for aeroplane, both civilian airliners and military (delta-wing) styles are sampled; for pot, square, round, and elongated, forms are sampled; and, for vessel, boats both with and without sails are sampled. Note also that our samples incorporate smoothly curved surfaces (e.g. car, jar) and slanted edges (e.g. aeroplane), which voxel-based methods cannot represent (Sect. 5.6 gives a detailed comparison with one such method (Gadelha et al. 2017)).

Reconstruction results are given in Table 1, with qualitative results in Fig. 6. We use fixed colour lighting, shading loss, single-view training, and no pose supervision (columns iou, err, acc); we also report iou when using pose supervision in column \({\textit{iou}} \,|\,\theta \). We see that the highest reconstruction accuracy (iou) is achieved for cars, sofas, and aeroplanes, and the lowest for benches, chairs, and lamps. Providing the ground-truth poses as supervision improves reconstruction performance in all cases (\({\textit{iou}} \,|\,\theta \)). Note that performance for the concave classes sofa, bathtub, pot, and jar is comparable or higher than several non-concave classes, indicating that our model can indeed learn them by exploiting shading cues.

Note that in almost all cases, the reconstructed image is very close to the input (Fig. 6); thus, the model has learnt to reconstruct pixels successfully. Moreover, even when the input is particularly ambiguous due to self-occlusion (e.g. the rightmost car and sofa examples), we see that the model infers a plausible completion of the hidden part of the shape (visible in the third column). However, the subdivision parameterisation limits the amount of detail that can be recovered in some cases, for example the slatted back of the second bench is reconstructed as a continuous surface. Furthermore, flat surfaces are often reconstructed as several faces that are not exactly coplanar, creating small visual artifacts. Finally, the use of a fixed-resolution planar mesh limits the smoothness of curved surfaces, as seen in the jar class.

The low values of the pose estimation error err (and corresponding high values of acc) for most classes indicate that the model has indeed learnt to disentangle pose from shape, without supervision. This is noteworthy given the model has seen only unannotated 2D images with arbitrary poses—disentanglement of these factors presumably arises because it is easier for the model to learn to reconstruct in a canonical reference frame, given that it is encouraged by our loss to predict diverse poses. While the pose estimation appears inaccurate for table, lamp, pot, and jar note that these classes exhibit rotational symmetry about the vertical axis. Hence, it is not possible to define (nor indeed to learn) a single, unambiguous canonical frame of reference for them.

Fig. 6
figure 6

Qualitative examples of reconstructions for different object classes. Each group of three images shows (i) ShapeNet ground-truth; (ii) our reconstruction; (iii) reconstruction placed in a canonical pose, with the different viewpoint revealing hidden parts of the shape. Experimental setting: subdivision, single-view training, fixed colour lighting, shading loss.

Fig. 7
figure 7

Samples for four object classes, using our three different mesh parameterisations. ortho-block and full-block perform well for sofas and reasonably for chairs, but are less well-suited to aeroplanes and cars, which are naturally represented as smooth surfaces. subdivision gives good results for all four object classes

Fig. 8
figure 8

Qualitative examples of reconstructions, using different mesh parameterisations. Each row of five images shows (i) ShapeNet ground-truth; (ii) our reconstruction with subdivision parameterisation; (iii) reconstruction placed in a canonical pose; (iv) our reconstruction with blocks; (v) canonical-pose reconstruction. Experimental setting: single-view training, fixed colour lighting, shading loss

Table 2 Reconstruction performance for four classes, with three different mesh parameterisations (Sect. 3)
Table 3 Reconstruction performance with different lighting and loss

5.2 Comparing Mesh Parameterisations

We now compare the three mesh parameterisations of Sect. 3, considering the four classes car, chair, aeroplane, and sofa. We show qualitative results for generation (Fig. 7) and reconstruction (Fig. 8); Table 2 gives quantitative results for reconstruction. Again we use fixed colour lighting, shading loss and single-view training.

We see that different parameterisations are better suited to different classes, in line with our expectations. Cars have smoothly curved edges, and are well-approximated by a single simply-connected surface; hence, subdivision performs well. Conversely, ortho-block fails to represent the curved and non-axis-aligned surfaces, in spite of giving relatively high IOU. Chairs vary in topology (e.g. the back may be solid or slatted) and sometimes have non-axis-aligned surfaces, so the flexible full-block parameterisation performs best. Interestingly, subdivision is able to partially reconstruct the holes in the chair backs by deforming the reconstructed surface such that it self-intersects. Aeroplanes have one dominant topology and include non-axis-aligned surfaces; both full-block and subdivision perform well here. However, the former sometimes has small gaps between blocks, failing to reflect the true topology. Sofas often consist of axis-aligned blocks, so the ortho-block parameterisation is expressive enough to model them. We hypothesise that it performs better than the more flexible full-block as it is easier for training to find a good solution in a more restricted representation space. This is effectively a form of regularisation. Overall, the best reconstruction performance is achieved for cars, which accords with Tulsiani et al. (2017b), Yan et al. (2016), and Fan et al. (2017). On average over the four classes, the best parameterisation is subdivision, both with and without pose supervision.

Table 4 Reconstruction performance with fixed and varying lighting

5.3 Lighting

Fixed Lighting Rotation Table 3 shows how reconstruction performance varies with the different choices of lighting, colour and white, using shading loss. Coloured directional lighting provides more information during training than white lighting, and the results are correspondingly better.

Fig. 9
figure 9

Effect of varying the dimensionality of the latent embedding vector \(\mathbf {z}\) on reconstruction performance (\({\textit{iou}} \,|\,\theta \)). Experimental setting: subdivision, fixed colour lighting, shading loss

Fig. 10
figure 10

Interpolating between shapes in latent space. In each row, the leftmost and rightmost images show ground-truth shapes from ShapeNet, and the adjacent columns show the result of reconstructing each using our model with subdivision parameterisation. In the centre three columns, we interpolate between the resulting latent embeddings, and display the decoded shapes. In each case, we see a semantically-plausible, gradual deformation of one shape into the other

We also show performance with silhouette loss for coloured light. This considers just the silhouette in the reconstruction loss, instead of the shaded pixels. To implement it, we differentiably binarise both our reconstructed pixels \(I_0\) and the ground-truth pixels \(\mathbf {x}^{(i)}\) prior to calculating the reconstruction loss. Specifically, we transform each pixel p into \(p / (p + \eta )\), where \(\eta \) is a small constant. This performs significantly worse than with shading in the loss, in spite of the input images being identical. Thus, back-propagating information from shading through the renderer does indeed help with learning—it is not merely that colour images contain more information for the encoder network. As in the previous experiment, we see that pose supervision helps the model (column \({\textit{iou}} \,|\,\theta \) versus iou). In particular, only with pose supervision are silhouettes informative enough for the model to learn a canonical frame of reference reliably, as evidenced by the high median rotation errors without (column err).

Table 5 Reconstruction performance with multiple views at train/test time

Varying Lighting Rotation We have shown that shading cues are helpful for training our model. We now evaluate whether it can still learn successfully when the lighting angle varies across training samples (varying). Table 4 shows that our method can indeed reconstruct shapes even in this case. When the object pose is given as supervision (column \({\textit{iou}} \,|\,\theta \)), the reconstruction accuracy is on average only slightly lower than in the case of fixed, known lighting. Thus, the encoder successfully learns to disentangle the lighting angle from the surface normal orientation, while still exploiting the shading information to aid reconstruction. When the object pose is not given as supervision (column iou), the model must learn to simultaneously disentangle shape, pose and lighting. Interestingly, even in this extremely hard setting our method still manages to produce good reconstructions, although of course the accuracy is usually lower than with fixed lighting. Finally, note that our results with varying lighting are better than those with fixed lighting from the final row of Table 3, using only the silhouette in the reconstruction loss. This demonstrates that even when the model does not have access to the lighting parameters, it still learns to benefit from shading cues, rather than simply using the silhouette.

Table 6 Reconstruction performance (\({\textit{iou}} \,|\,\theta \)) in a setting matching Yan et al. (2016), Tulsiani et al. (2017b), Kato et al. (2018), and Yang et al. (2018), which are silhouette-based methods trained with pose supervision and multiple views (to be precise, Yang et al. (2018) provide pose annotations for 50% of all training images)
Table 7 Comparison of our method with the concurrent work MVC (Tulsiani et al. 2018) in different settings, on the three classes for which they report results

5.4 Latent Space Structure

The shape of a specific object instance must be entirely captured by the latent embedding vector \(\mathbf {z}\). On the one hand, using a higher dimensionality for \(\mathbf {z}\) should result in better reconstructions, due to the greater representational power. On the other hand, a lower dimensionality makes it easier for the model to learn to map any point in \(\mathbf {z}\) to a reasonable shape, and to avoid over-fitting the training set. To evaluate this trade-off, we ran experiments with different dimensionalities for \(\mathbf {z}\) (Fig. 9). We see that for all classes, increasing from 6 to 12 dimensions improves reconstruction performance on the test set. Beyond 12 dimensions, the effect differs between classes. For car and chair, higher dimensionalities yield lower performance (indicating over-fitting or other training difficulties). Instead, aeroplane and sofa continue to benefit from higher and higher dimensionalities, up to 48 for aeroplane and 64 (and maybe beyond) for sofa.

Fig. 11
figure 11

Samples from the voxel-based method of Gadelha et al. (2017) (odd rows), shown above stylistically-similar samples from our model (even rows). Both methods are trained with a single view per instance, and without pose annotations. However, our model outputs meshes, and uses shading in the loss; hence, it can represent smooth surfaces and learn concave classes such as vase

For all our other experiments, we use a 12-dimensional embedding, as this gives good performance on average across classes. Note that our embedding dimensionality is much smaller than its counterpart in other works. For example, Tulsiani et al. (2017b) have a bottleneck layer with dimensionality 100, while Wiles and Zisserman (2017) use dimensionality 160. This low dimensionality of our embeddings facilitates the encoder mapping images to a compact region of the embedding space centred at the origin; this in turn allows modelling the embeddings by a simple Gaussian from which samples can be drawn.

Interpolating in the Latent Space To demonstrate that our models have learnt a well-behaved manifold of shapes for each class, we select pairs of ground-truth shapes, reconstruct these using our model, and linearly interpolate between their latent embeddings (Fig. 10). We see that the resulting intermediate shapes give a gradual, smooth deformation of one shape into the other, showing that all regions of latent space that we traverse correspond to realistic samples.

5.5 Multi-view Training/Testing

Table 5 shows results when we provide multiple views of each object instance to the model, either at training time only, or during both training and testing. In both cases, this improves results over using just a single view—the model has learnt to exploit the additional information about each instance. Note that when training with three views but testing with one, the network has not been optimised for the single-view task; however, the additional information present during training means it has learnt a stronger model of valid shapes, and this knowledge transfers to the test-time scenario of reconstruction from a single image.

5.6 Comparison to Previous and Concurrent Works

Generation Figure 11 compares samples from our model, to samples from that of Gadelha et al. (2017), on the four object classes we have in common. This is the only prior work that trains a 3D generative model using only single views of instances, and without pose supervision. Note however that unlike us, all images in the training set of Gadelha et al. (2017) are taken from one of a fixed set of eight poses, making their task a little easier. We manually selected samples from our model that are stylistically similar to those shown in Gadelha et al. (2017) to allow side-by-side comparison. We see that in all cases, generating meshes tends to give cleaner, more visually-pleasing samples than their use of voxels. For chair, our model is able to capture the very narrow legs; for aeroplane, it captures the diagonal edges of the wings; for car and vase, it captures the smoothly curved edges. Note that as shown in Fig. 5, our model also successfully learns models for concave classes such as bathtub and sofa—which is impossible for Gadelha et al. (2017) as they do not consider shading.

Reconstruction

Table 6 compares our results with previous and concurrent 2D-supervised methods that input object pose at training time. We consider works that appeared in 2018 to be concurrent to ours (Henderson and Ferrari 2018). Here, we conduct experiments in a setting matching Yan et al. (2016), Tulsiani et al. (2017b), Kato et al. (2018), and Yang et al. (2018): multiple views at training time, with ground-truth pose supervision [given for 50% of images in Yang et al. (2018)].

Even when using only silhouettes during training, our results are about as good as the best of the works we compare to, that of Kato et al. (2018), which is a concurrent work. Our results are somewhat worse than theirs for aeroplanes and chairs, better for cars, and identical for sofas. On average over the four classes, we reach the same iou of 62.5%. When we add shading information to the loss, our results show a significant improvement. Importantly, Yan et al. (2016), Tulsiani et al. (2017b) and Yang et al. (2018) cannot exploit shading, as they are based on voxels. Coloured lighting helps all classes even further, leading to a final performance higher than than all other methods on car and sofa, and comparable to the best other method on chair and aeroplane (Kato et al. 2018). On average we reach 66.8% iou, compared to 62.5% for Kato et al. (2018).

We also show results for Yan et al. (2016) using our coloured lighting images as input, but their silhouette loss.Footnote 2 This performs worse than our method on the same images, again showing that incorporating shading in the loss is useful—our colour images are not simply more informative to the encoder network than those of Yan et al. (2016). Interestingly, when trained with shading or colour, our method outperforms Tulsiani et al. (2017b) even when the latter is trained with depth information. When trained with colour, our results (average 66.8% iou) are even close to those of Fan et al. (2017) (67.0%) and Richter and Roth (2018) (68.2%), which are state-of-the-art methods trained with full 3D supervision.

Table 7 compares our results with those of Tulsiani et al. (2018). This is a concurrent work similar in spirit to our own, that learns reconstruction and pose estimation without 3D supervision nor pose annotations, but requires multiple views of each instance to be presented together during training. We match their experimental setting by training our models on three views per instance; however, they vary elevation as well as azimuth during training, making their task a little harder. We see that the ability of our model to exploit shading cues enables it to significantly outperform Tulsiani et al. (2018), which relies on silhouettes in its loss. This is shown by iou and \({\textit{iou}} \,|\,\theta \) being higher for our method with white light and shading loss, than for theirs with white light and silhouette. Indeed, our method outperforms theirs even when they use depth information as supervision. When we use colour lighting, our performance is even higher, due to the stronger information about surface normals. Conversely, when our method is restricted to silhouettes, it performs significantly worse than theirs across all three object classes.

6 Conclusion

We have presented a framework for generation and reconstruction of 3D meshes. Our approach is flexible and supports many different supervision settings, including weaker supervision than any prior works (i.e. a single view per training instance, and without pose annotations). When pose supervision is not provided, it automatically learns to disentangle the effects of shape and pose on the final image. When the lighting is unknown, it also learns to disentangle the effects of lighting and surface orientation on the shaded pixels. We have shown that exploiting shading cues leads to higher performance than state-of-the-art methods based on silhouettes (Kato et al. 2018). It also allows our model to learn concave classes, unlike these prior works. Moreover, our performance is higher than that of methods with depth supervision (Tulsiani et al. 2017b, 2018), and even close to the state-of-the-art results using full 3D supervision (Fan et al. 2017; Richter and Roth 2018). Finally, ours is the first method that can learn a generative model of 3D meshes, trained with only 2D images. We have shown that use of meshes leads to more visually-pleasing results than prior voxel-based works (Gadelha et al. 2017).

Limitations Our method is trained to ensure that the rendered reconstructions match the original images. Such an approach is inherently limited by the requirement that images from the generative model must resemble the input images for reconstruction, in terms of the L2 distance on pixels. Thus, in order to operate successfully on natural images, the model would need to be extended to incorporate more realistic materials and lighting.

Our use of different mesh parameterisations gives flexibility to model different classes faithfully. We have shown that the subdivision parameterisation gives reasonable results for all classes; however, other parameterisations work better for particular classes. Hence, for best results on a given class, a suitable parameterisation must be selected by the user.

Finally, we note that when multiple views but only silhouettes are available as input, discriminative methods specialised for this task (Kato et al. 2018; Tulsiani et al. 2018) outperform our approach.