Skip to main content
Log in

Pix2Shape: Towards Unsupervised Learning of 3D Scenes from Images Using a View-Based Representation

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

We infer and generate three-dimensional (3D) scene information from a single input image and without supervision. This problem is under-explored, with most prior work relying on supervision from, e.g., 3D ground-truth, multiple images of a scene, image silhouettes or key-points. We propose Pix2Shape, an approach to solve this problem with four component: (i) an encoder that infers the latent 3D representation from an image, (ii) a decoder that generates an explicit 2.5D surfel-based reconstruction of a scene—from the latent code—(iii) a differentiable renderer that synthesizes a 2D image from the surfel representation, and (iv) a critic network trained to discriminate between images generated by the decoder-renderer and those from a training distribution. Pix2Shape can generate complex 3D scenes that scale with the view-dependent on-screen resolution, unlike representations that capture world-space resolution, i.e., voxels or meshes. We show that Pix2Shape learns a consistent scene representation in its encoded latent space, and that the decoder can then be applied to this latent representation in order to synthesize the scene from a novel viewpoint. We evaluate Pix2Shape with experiments on the ShapeNet dataset as well as on a novel benchmark we developed – called 3D-IQTT—to evaluate models based on their ability to enable 3d spatial reasoning. Qualitative and quantitative evaluation demonstrate Pix2Shape’s ability to solve scene reconstruction, generation and understanding tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  • Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein generative adversarial networks. In International conference on machine learning (ICML).

  • Belghazi, M. I., Baratin, A., Rajeshwar, S., Ozair, S., Bengio, Y., Hjelm, D., & Courville, A. (2018). Mutual information neural estimation. In International conference on machine learning (ICML).

  • Caesar, H., Bankiti, V., Lang, A. H., Vora, S., Liong, V. E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., & Beijbom, O. (2019). Nuscenes: A multimodal dataset for autonomous driving. arXiv:1903.11027.

  • Chang, A. X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., Xiao, J., Yi, L., & Yu, F. (2015). Shapenet: An information-rich 3D model repository.

  • Chaudhuri, S., Kalogerakis, E., Guibas, L., & Koltun, V. (2011). Probabilistic reasoning for assembly-based 3D modeling. In ACM SIGGRAPH.

  • Chen, W., Gao, J., Ling, H., Smith, E. J., Lehtinen, J., Jacobson, A., & Fidler, S. (2019). Learning to predict 3d objects with an interpolation-based differentiable renderer. CoRR abs/1908.01210

  • Choy, C., Xu, D., Gwak, J., Chen, K., & Savarese, S. (2016). 3D-r2n2: A unified approach for single and multi-view 3d object reconstruction.

  • Donahue, J., & Krähenb, U. P., & Darrell, T. (2016). Adversarial feature learning. arXiv:1605.09782.

  • Dumoulin, V., Belghazi, I., Poole, B., Lamb, A., Arjovsky, M., Mastropietro, O., & Courville, A. (2016). Adversarially learned inference. arXiv:1606.00704.

  • Gadelha, M., Maji, S., & Wang, R. (2016). 3D shape induction from 2D views of multiple objects. CoRR abs/1612.05872.

  • Girdhar, R., Fouhey, D., Rodriguez, M., & Gupta, A. (2016). Learning a predictable and generative vector representation for objects. In European conference of computer vision (ECCV).

  • Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014), Generative adversarial nets. In Advances in neural information processing systems (NIPS).

  • Gulrajani, I., Ahmed, F., Arjovsky, M,, Dumoulin, V., & Courville, A. C. (2017). Improved training of wasserstein gans. In Advances in neural information processing systems (NIPS).

  • Hadsell, R., Chopra, S., & LeCun, Y. (2006). Dimensionality reduction by learning an invariant mapping. In Null.

  • Hausdorff, F. (1949). Grundzüge der Mengenlehre. New York: Chelsea Pub. Co.

    MATH  Google Scholar 

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Computer vision and pattern recognition (CVPR).

  • Henderson, P., & Ferrari, V. (2018). Learning to generate and reconstruct 3D meshes with only 2D supervision. CoRR abs/1807.09259.

  • Huang, J., Zhou, Y., Funkhouser, T. A., & Guibas, L. J. (2019). Framenet: Learning local canonical frames of 3D surfaces from a single RGB image. CoRR abs/1903.12305.

  • Insafutdinov, E., & Dosovitskiy, A. (2018). Unsupervised learning of shape and pose with differentiable point clouds. CoRR abs/1810.09381.

  • Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning (ICML).

  • Jiang, C. M., Wang, D., Huang, J., Marcus, P., & Nießner, M. (2019). Convolutional neural networks on non-uniform geometrical signals using euclidean spectral transformation. CoRR abs/1901.02070.

  • Kajiya, J. T. (1986). The rendering equation. In Annual conference on computer graphics and interactive techniques (SIGGRAPH).

  • Kalogerakis, E., Chaudhuri, S., Koller, D., & Koltun, V. (2012). A probabilistic model for component-based shape synthesis. ACM Transactions in Graphics, 31(4), 55-1–55-11.

    Article  Google Scholar 

  • Kanazawa, A., Tulsiani, S., Efros, A. A., & Malik, J. (2018), Learning category-specific mesh reconstruction from image collections. In European conference on computer Vision (ECCV).

  • Kar, A., Tulsiani, S., Carreira, J., & Malik, J. (2014). Category-specific object reconstruction from a single image. CoRR abs/1411.6069.

  • Kato, H., & Harada, T. (2018). Learning view priors for single-view 3D reconstruction. CoRR abs/1811.10719.

  • Kato, H., Ushiku, Y., & Harada, T. (2017). Neural 3D mesh renderer. CoRR abs/1711.07566.

  • Kobbelt, L., & Botsch, M. (2004). A survey of point-based techniques in computer graphics. Computers and Graphics, 28(6), 801–814.

    Article  Google Scholar 

  • Koch, G., Zemel, R., & Salakhutdinov, R. (2015). Siamese neural networks for one-shot image recognition. In ICML Deep Learning Workshop.

  • Kulkarni, T. D., Whitney, W., Kohli, P., & Tenenbaum, J. B. (2015). Deep convolutional inverse graphics network. In Advances in neural information processing systems (NIPS).

  • Li, C., Liu, H., Chen, C., Pu, Y., Chen, L., Henao, R., & Carin, L. (2017), Alice: Towards understanding adversarial learning for joint distribution matching. In Advances in neural information processing systems (NIPS).

  • Li, Z., Dekel, T., Cole, F., Tucker, R., Snavely, N., Liu, C., & Freeman, W. T. (2019). Learning the depths of moving people by watching frozen people. CoRR abs/1904.11111.

  • Liu, S., Chen, W., Li, T., & Li, H. (2019), Soft rasterizer: Differentiable rendering for unsupervised single-view mesh reconstruction. CoRR abs/1901.05567.

  • Loper, M. M., & Black, M. J. (2014). Opendr: An approximate differentiable renderer. In Euroean conference on computer vision (ECCV).

  • Mikolov, T., Deoras, A., Kombrink, S., Burget, L., & Cernockỳ, J. (2011), Empirical evaluation and combination of advanced language modeling techniques. In INTERSPEECH.

  • Mirza, M., & Osindero, S. (2014). Conditional generative adversarial nets.

  • Nguyen-Phuoc, T., Li, C., Theis, L., Richardt, C., & Yang, Y. L. (2019). Hologan: Unsupervised learning of 3d representations from natural images.

  • Niu, C., Li, J., & Xu, K. (2018). Im2struct: Recovering 3d shape structure from a single RGB image. In Computer vision and pattern recognition (CVPR).

  • Novotný, D., Larlus, D., & Vedaldi, A. (2017). Learning 3D object categories by looking around them. CoRR abs/1705.03951.

  • Novotný, D., Ravi, N., Graham, B., Neverova, N., & Vedaldi, A. (2019). C3dpo: Canonical 3D pose networks for non-rigid structure from motion. arXiv:1909.02533.

  • Perez, E., Strub, F., De Vries, H., Dumoulin, V., & Courville, A. (2017). Film: Visual reasoning with a general conditioning layer. arXiv:1709.07871.

  • Pfister, H., Zwicker, M., van Baar, J., & Gross, M. (2000), Surfels: Surface elements as rendering primitives. In Annual conference on computer graphics and interactive techniques.

  • Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. In International conference on learning representations (ICLR).

  • Rezende, D. J., Eslami, S. M. A., Mohamed, S., Battaglia, P., Jaderberg, M., & Heess, N. (2016). Unsupervised learning of 3D structure from images. In Advances in neural information processing systems (NIPS).

  • Saxena, A., Sun, M., & Ng, A. Y. (2009). Make3d: Learning 3D scene structure from a single still image. In IEEE transasctions on pattern anal mach intell (PAMI) (Vol. 31, No. 5).

  • Shepard, R. N., & Metzler, J. (1971). Mental rotation of three-dimensional objects. Science, 171(3972), 701–703.

    Article  Google Scholar 

  • Soltani, A. A., Huang, H., Wu, J., Kulkarni, T. D., & Tenenbaum, J. B. (2017). Synthesizing 3D shapes via modeling multi-view depth maps and silhouettes with deep generative networks. In Computer vision and pattern recognition (CVPR)

  • Taha, A. A., & Hanbury, A. (2015). An efficient algorithm for calculating the exact hausdorff distance. In IEEE transactions on pattern analysis and machine intelligence (PAMI).

  • Tulsiani, S., Su, H., Guibas, L. J., Efros, A. A., & Malik, J. (2016), Learning shape abstractions by assembling volumetric primitives. CoRR abs/1612.00404.

  • Tulsiani, S., Zhou, T., Efros, A. A., & Malik, J. (2017). Multi-view supervision for single-view reconstruction via differentiable ray consistency. CoRR abs/1704.06254.

  • Wiles, O., & Zisserman, A. (2017). Silnet : Single- and multi-view reconstruction by learning from silhouettes. CoRR abs/1711.07888.

  • Woodcock, R., Mather, N., & McGrew, K. (2001). Woodcock johnson iii—tests of cognitive skills. Riverside Pub.

  • Wu, J., Xue, T., Lim, J., Tian, Y., Tenenbaum, J., Torralba, A., & Freeman, W. (2016a), Single image 3d interpreter network.

  • Wu, J., Zhang, C., Xue, T., Freeman, W. T., Tenenbaum, J. B. (2016b). Learning a probabilistic latent space of object shapes via 3D generative-adversarial modeling. In Advances in neural information processing systems (NIPS).

  • Wu, J., Wang, Y., Xue, T., Sun, X., Freeman, W. T., & Tenenbaum, J. B. (2017). Marrnet: 3D shape reconstruction via 2.5D sketches. CoRR abs/1711.03129.

  • Wu, Z., Song, S., Khosla, A., Yu, F., Zhang, L., Tang, X., & Xiao, J. (2015). 3D shapenets: A deep representation for volumetric shapes. In Computer vision and pattern recognition (CVPR).

  • Yan, X., Yang, J., Yumer, E., Guo, Y., & Lee, H. (2016), Perspective transformer nets: Learning single-view 3D object reconstruction without 3D supervision. In Advances in neural information processing systems (NIPS).

  • Zhang, X., Zhang, Z., Zhang, C., Tenenbaum, J., Freeman, B., & Wu, J. (2018). Learning to reconstruct shapes from unseen classes. In Advances in neural information processing systems (NIPS).

  • Zhu, J. Y, Zhang, Z., Zhang, C., Wu, J., Torralba, A., Tenenbaum, J. B., & Freeman, W. T. (2018). Visual object networks: Image generation with disentangled 3D representations. In Advances in neural information processing systems (NeurIPS).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sai Rajeswar.

Additional information

Communicated by Jun-Yan Zhu, Hongsheng Li, Eli Shechtman, Ming-Yu Liu, Jan Kautz, Antonio Torralba.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

Rendering Details

The color of a surfel depends on the material reflectance, its position and orientation, as well as the ambient and point light source colors (See Fig. 11b). Given a surface point \(P_i\), the color of its corresponding pixel \(I_{rc}\) is given by the shading equation:

$$\begin{aligned} I_{rc} = \rho _i \left( L_a + \sum _j \frac{1}{k_l \Vert d_{ij}\Vert + k_q \Vert d_{ij}\Vert ^2} L_j \right. \nonumber \\ \left. \max \left( 0, N_i^T d_{ij} / \Vert d_{ij}\Vert \right) \right) , \end{aligned}$$
(5)

where \(\rho _i\) is the surface reflectance, \(L_a\) is the ambient light’s color, \(L_j\) is the jth positional light source’s color, with \(d_{ij} = L_{j}^{{\text{ pos }}} - P_i\), or the direction vector from the scene point to the point light source, and \(k_l\), \(k_q\) being the linear and quadratic attenuation terms respectively. Equation 5 is an approximation of rendering equation proposed in Kajiya (1986).

Fig. 11
figure 11

Differentiable 3D renderer. a A surfel is defined by its position P, normal N, and reflectance \(\rho \). Each surfel maps to an image pixel \(P_{im}\). b The surfel’s color depends on its reflectance \(\rho \) and the angle \(\theta \) between each light I and the surfel’s normal N

Architecture

Pix2Shape is composed of an encoder network (See Table 5), a decoder network (See Table 6), and a critic network (See Table 7). Specifically, the decoder architecture is similar to the generator in DCGAN (Radford et al. 2015) but with LeakyReLU (Mikolov et al. 2011) as activation function and batch-normalization (Ioffe and Szegedy 2015). Also, we adjusted its depth and width to accommodate the high resolution images accordingly. In order to condition the camera position on the z variable, we use conditional normalization in the alternate layers of the decoder. We train our model for 60K iterations with a batch size of 6 with images of resolution \(128 \times 128 \times 3\).

Table 5 Pix2Shape encoder architecture
Table 6 Pix2Shape decoder architecture
Table 7 Pix2Shape critic architecture

Architecture for Semi-supervised experiments

Pixel2Surfels architecture remains similar to the previous one but with higher capacity on the decoder and critic. The most important difference is that for those experiments we do not condition the networks with the camera pose to be fair with the baselines. In addition to the three networks we have a statistics network (see Table 8) that estimates and minimizes the mutual information between the two set of dimensions in the latent code using MINE (Belghazi et al. 2018). Out of 128 dimensions for z we use first 118 dimensions for represent scene-based information and rest to encode view based info.

Table 8 Pix2Shape statistics network architecture

The architecture of the baseline networks is shown in Fig. 12. During training we use the contrastive loss (Hadsell et al. 2006):

$$\begin{aligned} \mathcal {L_\theta }(\varvec{x}_{1}, \varvec{x}_{2}, y)&= (1-y)\frac{1}{2}(D_\theta (\varvec{x}_{1}, \varvec{x}_{2}))^2 \nonumber \\&+ (y)\frac{1}{2}(max(0,m-D_\theta (\varvec{x}_{1}, \varvec{x}_{2})))^2 \nonumber \\ D_\theta (\varvec{x}_1, \varvec{x}_2)&= ||G_\theta (\varvec{x}_1)-G_\theta (\varvec{x}_2)||_2, \end{aligned}$$
(6)

where \(\varvec{x}_1\) and \(\varvec{x}_2\) are the input images, y is either 0 (if the inputs are supposed to be the same) or 1 (if the images are supposed to be different), \(G_\theta \) is each ResNet block, parameterized by \(\theta \), and m is the margin, which we set to 2.0. We apply the contrastive loss to the 2048 features that are generated by each ResNet block.

Fig. 12
figure 12

3D-IQTT baseline architecture. The four ResNet-50 share the same weights and were slightly modified to support our image size. “FC” stands for fully-connected layer and the hidden node sizes are 2048, 512, and 256 respectively. The output of the network is encoded as one-hot vector

Fig. 13
figure 13

Learning material along with structure. The model learns the foreground and background colors separately

Fig. 14
figure 14

Random lights configuration

Fig. 15
figure 15

Unconditional scene generation. Generated samples from Pix2Shape model trained on ShapeNet scenes. Left: shaded images; Right: depth maps

Material, Lights, and Camera Properties

Material In our experiments, we use diffuse materials with uniform reflectance. The reflectance values are chosen arbitrarily and we use the same material properties for both the input and the generator side. Figure 13 shows that it is possible to learn reflectance along side learning the 3D structure of the scenes by requiring the model to predict the material coefficients along with the depth of the surfels. The color of the objects depend on both the lighting and the material properties. We do not delve into more details on this, as our objective is to model the structural/geometrical properties of the world with the model. This will be explored further in a later study.

Camera The camera is specified by its position, viewing direction and vector indicating the orientation of the camera. The camera positions were uniform randomly sampled on a sphere for the 3D-IQTT task and on a spherical patch contained in the positive octant, for the rest of the experiments. The viewing direction was updated based on the camera position and the center of mass of the objects, so that the camera was always looking at a fixed point in the scene as its position changed. The focal length ranged between [18 mm and 25mm] in all the experiments and the field-of-view was fixed to 24mm. The camera properties were also shared between the input and the generator side. However, in the 3D-IQTT task we relaxed the assumption that we know the camera pose and instead estimate the view as a part of the learnt latent representation.

Lights For the light sources, we experimented with single and multiple point-light sources, with the light colors chosen randomly. The light positions are uniformly sampled on a sphere for the 3D IQTT tasks, and uniformly on a spherical patch covering the positive octant for the other scenes. The same light colors and positions are used both for rendering the input and the generated images. The lights acted as a physical spot lights with the radiant energy attenuating quadratically with distance. As an ablation study we relaxed this assumption of having perfect knowledge of lights by using random position and random color lights. Those experiments show that the light information is not needed by our model to learn the 3D structure of the data. However, as we use random lights on the generator side, the shading of the reconstructions is in different color than in the input as shown in Fig. 14.

Unconditional ShapeNet Generation

We trained Pix2Shape on scenes composed of ShapeNet objects from six categories (i.e., bowls, bottles, mugs, cans, caps and bags). Figure 15 shows qualitative results on unconditional generation. Since no class information is provided to the model, the latent variable captures both the object category and its configuration.

Table 9 View point reconstruction
Fig. 16
figure 16

Reproduction of Rezende et al. (2016) and qualitative results. Top row: Samples of input images; bottom row: corresponding reconstructed images. We found that with a single centered object, the method was able to correctly reproduce the shape and orientation. However, when the object was not centered, too complex, or there was a background present, the method failed to estimate the shape

Evaluation of 3D Reconstructions

For evaluating 3D reconstructions, we use the Hausdorff distance (Taha and Hanbury 2015) as a measure of similarity between two shapes as in Niu et al. (2018). Given two point sets, A and B, the Hausdorff distance is,

$$\max \left\{ \max D_H^+(A, B), \max D_H^+(B, A)\right\} ,$$

where \(D_H^+\) is an asymmetric Hausdorff distance between two point sets. E.g., \(\max D_H^+(A, B) = \max D(a, B), \text {for} \text {all}\,a \in A\), or the largest Euclidean distance \(D(\cdot )\), from a set of points in A to B, and a similar definition for the reverse case \(\max D_H^+(B, A)\).

Ablation Study on Depth Supervision

As an ablation study, we repeated the experiment that demonstrates the view extrapolation capabilities of our model with depth superrvision. Table 9 depicts the quantitative evaluations on reconstruction if the scenes from unobserved angles.

3D Intelligence Quotient Task

In their landmark work, Shepard and Metzler (1971) introduced the mental rotation task into the toolkit of cognitive assessment. The authors presented human subjects with reference images and answer images. The subjects had to quickly decide if the answer was either a 3D-rotated version or a mirrored version of the reference. The speed and accuracy with which people can solve this mental rotation task has since become a staple of IQ tests like the Woodcock-Johnson tests (Woodcock et al. 2001). We took this as inspiration to design a quantitative evaluation metric (number of questions answered correctly) as opposed to the default qualitative analyses of generative models. We use the same kind of 3D objects but instead of confronting our model with pairs of images and only two possible answers, we include several distractor answers and the subject (human or computer) has to to pick which one out of the three possible answers is the 3D-rotated version of the reference object (See Fig. 2).

Details on Human Evaluations for 3D IQTT

We posted the questionnaire to our lab-wide mailing list, where 41 participants followed the call. The questionnaire had one calibration question where, if answered incorrectly, we pointed out the correct answer. For all successive answers, we did not give any participant the correct answers and each participant had to answer all 20 questions to complete the quiz.

We also asked participants for their age range, gender, education, and for comments. While many commented that the questions were hard, nobody gave us a clear reason to discard their response. All participants were at least high school graduates currently pursuing a Bachelor’s degree. The majority of submissions \((78\%)\) were male, whereas the others were female or unspecified. Most of our participants \((73.2\%)\) were between 18 and 29 years old, the others between 30 and 39. The resulting test scores are normally distributed according to the Shapiro-Wilk test \((p<0.05)\) and significantly different from random choice according to 1-sample Student’s t test \((p<0.01)\).

Implementation of Rezende et al.

With the publication of Rezende et al. (2016), the authors did not publicly release any code and upon request did not offer any either. When implementing our own version, we attempted to reproduce their results first, which is depicted in Fig. 16a. Further, we attempted to use the method for the same qualitative reconstruction of the primitive-in-box dataset as shown in Fig. 4. We found that this worked only with one main object and when there was no background (see Fig. 16b). When including the background, applying the same method lead to degenerate solutions (see Fig. 16c).

Fig. 17
figure 17

Study of effect of mutual-information objective on 3D-IQTT performance. Our model performance is correlated positively to the the weight on Mutual information term increases

Ablation Study of the Weighted Mutual-Info Loss on 3D-IQTT

Consider the semi-supervised objective used in algorithm 1. In this section we do an ablation study on 3D-IQTT performance with the modified form of the equation where Mutual-information loss \(I_{\Theta }(z_{scene}, z_{view})\) is weighted by a co-efficient \(\lambda \). Plot in Fig. 17 indicates the importance of the MI term. Having a good estimate of the view point and disentangling the view information from geometry is indeed crucial to the performance of the IQ task.

$$\begin{aligned} L \leftarrow \mathcal {L}_{ALI} + \mathcal {L}_{recon} + I_{\Theta }(z_{scene}, z_{view}) \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rajeswar, S., Mannan, F., Golemo, F. et al. Pix2Shape: Towards Unsupervised Learning of 3D Scenes from Images Using a View-Based Representation. Int J Comput Vis 128, 2478–2493 (2020). https://doi.org/10.1007/s11263-020-01322-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-020-01322-1

Keywords

Navigation