Shape and Viewpoint Without Keypoints

Goel, Shubham; Kanazawa, Angjoo; Malik, Jitendra

doi:10.1007/978-3-030-58555-6_6

Shubham Goel¹²,
Angjoo Kanazawa¹² &
Jitendra Malik¹²

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12360))

Included in the following conference series:

European Conference on Computer Vision

40 Citations

Abstract

We present a learning framework that learns to recover the 3D shape, pose and texture from a single image, trained on an image collection without any ground truth 3D shape, multi-view, camera viewpoints or keypoint supervision. We approach this highly under-constrained problem in a “analysis by synthesis” framework where the goal is to predict the likely shape, texture and camera viewpoint that could produce the image with various learned category-specific priors. Our particular contribution in this paper is a representation of the distribution over cameras, which we call “camera-multiplex”. Instead of picking a point estimate, we maintain a set of camera hypotheses that are optimized during training to best explain the image given the current shape and texture. We call our approach Unsupervised Category-Specific Mesh Reconstruction (U-CMR), and present qualitative and quantitative results on CUB, Pascal 3D and new web-scraped datasets. We obtain state-of-the-art camera prediction results and show that we can learn to predict diverse shapes and textures across objects using an image collection without any keypoint annotations or 3D ground truth. Project page: https://shubham-goel.github.io/ucmr.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Learning Category-Specific Mesh Reconstruction from Image Collections

Share with Thy Neighbors: Single-View Reconstruction by Cross-Instance Consistency

Self-supervised Single-View 3D Reconstruction via Semantic Consistency

References

Blanz, V., Vetter, T.: A morphable model for the synthesis of 3D faces. In: SIGGRAPH (1999)
Google Scholar
Blender Online Community: Blender - a 3D modelling and rendering package. Blender Institute, Amsterdam (2019). http://www.blender.org
Cashman, T.J., Fitzgibbon, A.W.: What shape are dolphins? Building 3D morphable models from 2D images. TPAMI 35, 232–244 (2013)
Article Google Scholar
Chen, W., et al.: Synthesizing training images for boosting human 3D pose estimation. In: 3DV (2016)
Google Scholar
Choy, C.B., Xu, D., Gwak, J.Y., Chen, K., Savarese, S.: 3D-R2N2: a unified approach for single and multi-view 3D object reconstruction. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 628–644. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_38
Chapter Google Scholar
Dai, Y., Li, H., He, M.: A simple prior-free method for non-rigid structure-from-motion factorization. IJCV 107, 101–122 (2014). https://doi.org/10.1007/s11263-013-0684-2
Article MathSciNet MATH Google Scholar
Fan, H., Su, H., Guibas, L.J.: A point set generation network for 3D object reconstruction from a single image. In: CVPR (2017)
Google Scholar
Gadelha, M., Maji, S., Wang, R.: 3D shape induction from 2D views of multiple objects. In: 3DV (2017)
Google Scholar
Girdhar, R., Fouhey, D.F., Rodriguez, M., Gupta, A.: Learning a predictable and generative vector representation for objects. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 484–499. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_29
Chapter Google Scholar
Gkioxari, G., Malik, J., Johnson, J.: Mesh R-CNN. In: ICCV (2019)
Google Scholar
Hughes, J.F., Foley, J.D.: Computer Graphics: Principles and Practice. Pearson Education, London (2014)
MATH Google Scholar
Insafutdinov, E., Dosovitskiy, A.: Unsupervised learning of shape and pose with differentiable point clouds. In: NeurIPS (2018)
Google Scholar
Jakab, T., Gupta, A., Bilen, H., Vedaldi, A.: Unsupervised learning of object landmarks through conditional image generation. In: NeurIPS (2018)
Google Scholar
Kanazawa, A., Kovalsky, S., Basri, R., Jacobs, D.: Learning 3D deformation of animals from 2D images. In: Computer Graphics Forum. Wiley Online Library (2016)
Google Scholar
Kanazawa, A., Tulsiani, S., Efros, A.A., Malik, J.: Learning category-specific mesh reconstruction from image collections. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 386–402. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_23
Chapter Google Scholar
Kar, A., Häne, C., Malik, J.: Learning a multi-view stereo machine. In: NeurIPS (2017)
Google Scholar
Kar, A., Tulsiani, S., Carreira, J., Malik, J.: Category-specific object reconstruction from a single image. In: CVPR (2015)
Google Scholar
Kato, H., Ushiku, Y., Harada, T.: Neural 3D mesh renderer. In: CVPR (2018)
Google Scholar
Kulkarni, N., Gupta, A., Tulsiani, S.: Canonical surface mapping via geometric cycle consistency. In: ICCV (2019)
Google Scholar
Liu, S., Li, T., Chen, W., Li, H.: Soft rasterizer: a differentiable renderer for image-based 3D reasoning. In: ICCV (2019)
Google Scholar
Loper, M.M., Black, M.J.: OpenDR: an approximate differentiable renderer. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 154–169. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10584-0_11
Chapter Google Scholar
Novotny, D., Ravi, N., Graham, B., Neverova, N., Vedaldi, A.: C3DPO: canonical 3D pose networks for non-rigid structure from motion. In: ICCV (2019)
Google Scholar
Park, J.J., Florence, P., Straub, J., Newcombe, R., Lovegrove, S.: DeepSDF: learning continuous signed distance functions for shape representation. In: CVPR (2019)
Google Scholar
Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: NeurIPS (2019)
Google Scholar
Pinkall, U., Polthier, K.: Computing discrete minimal surfaces and their conjugates. Exp. Math. 2, 15–36 (1993)
Article MathSciNet Google Scholar
Shu, Z., Sahasrabudhe, M., Alp Güler, R., Samaras, D., Paragios, N., Kokkinos, I.: Deforming autoencoders: unsupervised disentangling of shape and appearance. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 664–680. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01249-6_40
Chapter Google Scholar
Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., Funkhouser, T.: Semantic scene completion from a single depth image. In: CVPR (2017)
Google Scholar
Thewlis, J., Bilen, H., Vedaldi, A.: Unsupervised learning of object frames by dense equivariant image labelling. In: NeurIPS (2017)
Google Scholar
Torresani, L., Hertzmann, A., Bregler, C.: Nonrigid structure-from-motion: estimating shape and motion with hierarchical priors. TPAMI 30, 878–892 (2008)
Article Google Scholar
Tulsiani, S., Efros, A.A., Malik, J.: Multi-view consistency as supervisory signal for learning shape and pose prediction. In: CVPR (2018)
Google Scholar
Tulsiani, S., Gupta, S., Fouhey, D., Efros, A.A., Malik, J.: Factoring shape, pose, and layout from the 2D image of a 3D scene. In: CVPR (2018)
Google Scholar
Tulsiani, S., Zhou, T., Efros, A.A., Malik, J.: Multi-view supervision for single-view reconstruction via differentiable ray consistency. In: CVPR (2017)
Google Scholar
Varol, G., et al.: Learning from synthetic humans. In: CVPR (2017)
Google Scholar
Vicente, S., Carreira, J., Agapito, L., Batista, J.: Reconstructing PASCAL VOC. In: CVPR (2014)
Google Scholar
Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The Caltech-UCSD Birds-200-2011 Dataset. Technical report CNS-TR-2011-001, California Institute of Technology (2011)
Google Scholar
Wu, S., Rupprecht, C., Vedaldi, A.: Unsupervised learning of probably symmetric deformable 3D objects from images in the wild. arXiv preprint arXiv:1911.11130 (2019)
Wu, Z., et al.: 3D ShapeNets: a deep representation for volumetric shapes. In: CVPR (2015)
Google Scholar
Xiang, Y., Mottaghi, R., Savarese, S.: Beyond pascal: a benchmark for 3D object detection in the wild. In: WACV (2014)
Google Scholar
Yan, X., Yang, J., Yumer, E., Guo, Y., Lee, H.: Perspective transformer nets: learning single-view 3D object reconstruction without 3D supervision. In: NeurIPS (2016)
Google Scholar
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep networks as a perceptual metric. In: CVPR (2018)
Google Scholar
Zuffi, S., Kanazawa, A., Berger-Wolf, T., Black, M.J.: Three-D safari: learning to estimate zebra pose, shape, and texture from images “in the wild”. In: ICCV (2019)
Google Scholar

Download references

Acknowledgements

We thank Jasmine Collins for scraping the zappos shoes dataset and members of the BAIR community for helpful discussions. This work was supported in-part by eBay, Stanford MURI and the DARPA MCS program.

Author information

Authors and Affiliations

UC Berkeley, Berkeley, USA
Shubham Goel, Angjoo Kanazawa & Jitendra Malik

Authors

Shubham Goel
View author publications
You can also search for this author in PubMed Google Scholar
Angjoo Kanazawa
View author publications
You can also search for this author in PubMed Google Scholar
Jitendra Malik
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shubham Goel .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 6465 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Goel, S., Kanazawa, A., Malik, J. (2020). Shape and Viewpoint Without Keypoints. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12360. Springer, Cham. https://doi.org/10.1007/978-3-030-58555-6_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-58555-6_6
Published: 16 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58554-9
Online ISBN: 978-3-030-58555-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Shape and Viewpoint Without Keypoints

Abstract

Access this chapter

Similar content being viewed by others

Learning Category-Specific Mesh Reconstruction from Image Collections

Share with Thy Neighbors: Single-View Reconstruction by Cross-Instance Consistency

Self-supervised Single-View 3D Reconstruction via Semantic Consistency

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 6465 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Shape and Viewpoint Without Keypoints

Abstract

Access this chapter

Similar content being viewed by others

Learning Category-Specific Mesh Reconstruction from Image Collections

Share with Thy Neighbors: Single-View Reconstruction by Cross-Instance Consistency

Self-supervised Single-View 3D Reconstruction via Semantic Consistency

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 6465 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation