3D Interpreter Networks for Viewer-Centered Wireframe Modeling

  • Jiajun Wu
  • Tianfan Xue
  • Joseph J. Lim
  • Yuandong Tian
  • Joshua B. Tenenbaum
  • Antonio Torralba
  • William T. Freeman
Article

Abstract

Understanding 3D object structure from a single image is an important but challenging task in computer vision, mostly due to the lack of 3D object annotations to real images. Previous research tackled this problem by either searching for a 3D shape that best explains 2D annotations, or training purely on synthetic data with ground truth 3D information. In this work, we propose 3D INterpreter Networks (3D-INN), an end-to-end trainable framework that sequentially estimates 2D keypoint heatmaps and 3D object skeletons and poses. Our system learns from both 2D-annotated real images and synthetic 3D data. This is made possible mainly by two technical innovations. First, heatmaps of 2D keypoints serve as an intermediate representation to connect real and synthetic data. 3D-INN is trained on real images to estimate 2D keypoint heatmaps from an input image; it then predicts 3D object structure from heatmaps using knowledge learned from synthetic 3D shapes. By doing so, 3D-INN benefits from the variation and abundance of synthetic 3D objects, without suffering from the domain difference between real and synthesized images, often due to imperfect rendering. Second, we propose a Projection Layer, mapping estimated 3D structure back to 2D. During training, it ensures 3D-INN to predict 3D structure whose projection is consistent with the 2D annotations to real images. Experiments show that the proposed system performs well on both 2D keypoint estimation and 3D structure recovery. We also demonstrate that the recovered 3D information has wide vision applications, such as image retrieval.

Keywords

3D skeleton Single image 3D reconstruction Keypoint estimation Neural network Synthetic data 

Notes

Acknowledgements

This work is supported by NSF Robust Intelligence 1212849 and NSF Big Data 1447476 to W.F., NSF Robust Intelligence 1524817 to A.T., ONR MURI N00014-16-1-2007 to J.B.T., Shell Research, the Toyota Research Institute, and the Center for Brain, Minds and Machines (NSF STC award CCF-1231216). The authors would like to thank Nvidia for GPU donations. Part of this work was done when Jiajun Wu was an intern at Facebook AI Research, and Tianfan Xue was a graduate student at MIT CSAIL.

References

  1. Akhter, I., & Black, M. J. (2015). Pose-conditioned joint angle limits for 3d human pose reconstruction. In IEEE conference on computer vision and pattern recognition.Google Scholar
  2. Aubry, M., Maturana, D., Efros, A., Russell, B., & Sivic, J. (2014). Seeing 3d chairs: Exemplar part-based 2d–3d alignment using a large dataset of cad models. In IEEE conference on computer vision and pattern recognition.Google Scholar
  3. Bansal, A., & Russell, B. (2016). Marr revisited: 2d–3d alignment via surface normal prediction. In IEEE conference on computer vision and pattern recognition.Google Scholar
  4. Barrow, H. G., & Tenenbaum, J. M. (1978). Recovering intrinsic scene characteristics from images. Computer Vision Systems Google Scholar
  5. Belhumeur, P. N., Jacobs, D. W., Kriegman, D. J., & Kumar, N. (2013). Localizing parts of faces using a consensus of exemplars. IEEE Transactions on Pattern Analysis and Machine intelligence, 35(12), 2930–2940.CrossRefGoogle Scholar
  6. Bever, T. G., & Poeppel, D. (2010). Analysis by synthesis: A (re-) emerging program of research for language and vision. Biolinguistics, 4(2–3), 174–200.Google Scholar
  7. Bourdev, L., Maji, S., Brox, T., & Malik, J. (2010). Detecting people using mutually consistent poselet activations. In European Conference on Computer Vision.Google Scholar
  8. Carreira, J., Agrawal, P., Fragkiadaki, K., & Malik, J. (2016). Human pose estimation with iterative error feedback. In IEEE conference on computer vision and pattern recognition.Google Scholar
  9. Chen, J., Izadi, S., & Fitzgibbon, A. (2012). Kinêtre: Animating the world with the human body. In ACM symposium on user interface software and technology.Google Scholar
  10. Choy, C. B., Xu, D., Gwak, J., Chen, K., & Savarese, S. (2016). 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In European conference on computer vision.Google Scholar
  11. Dosovitskiy, A., Tobias Springenberg, J., & Brox, T. (2015). Learning to generate chairs with convolutional neural networks. In IEEE conference on computer vision and pattern recognition.Google Scholar
  12. Fidler, S., Dickinson, S. J., & Urtasun, R. (2012). 3d object detection and viewpoint estimation with a deformable 3d cuboid model. In Advances in neural information processing systems.Google Scholar
  13. Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In IEEE conference on computer vision and pattern recognition.Google Scholar
  14. Hejrati, M., & Ramanan, D. (2012). Analyzing 3d objects in cluttered images. In Advances in neural information processing systems.Google Scholar
  15. Hejrati, M., & Ramanan, D. (2014). Analysis by synthesis: 3d object recognition by object reconstruction. In IEEE conference on computer vision and pattern recognition.Google Scholar
  16. Hinton, G. E., & Ghahramani, Z. (1997). Generative models for discovering sparse distributed representations. Philosophical Transactions of the Royal Society of London B: Biological Sciences, 352(1358), 1177–1190.CrossRefGoogle Scholar
  17. Hinton, G. F. (1981). A parallel computation that assigns canonical object-based frames of reference. In International joint conference on artificial intelligence.Google Scholar
  18. Hu, W., & Zhu, S. C. (2015). Learning 3d object templates by quantizing geometry and appearance spaces. IEEE Transactions on Pattern Analysis and Machine intelligence, 37(6), 1190–1205.MathSciNetCrossRefGoogle Scholar
  19. Huang, Q., Wang, H., & Koltun, V. (2015). Single-view reconstruction via joint analysis of image and shape collections. ACM Transactions on Graphics, 34(4), 87.Google Scholar
  20. Jaderberg, M., Simonyan, K., Zisserman, A., & Kavukcuoglu, K. (2015). Spatial transformer networks. In Advances in neural information processing systems.Google Scholar
  21. Kar, A., Tulsiani, S., Carreira, J., & Malik, J. (2015). Category-specific object reconstruction from a single image. In IEEE conference on computer vision and pattern recognition.Google Scholar
  22. Kar, A., Häne, C., & Malik, J. (2017). Learning a multi-view stereo machine. In Advances in neural information processing systems.Google Scholar
  23. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems.Google Scholar
  24. Kulkarni, T. D., Kohli, P., Tenenbaum, J. B., & Mansinghka, V. (2015a). Picture: A probabilistic programming language for scene perception. In IEEE conference on computer vision and pattern recognition.Google Scholar
  25. Kulkarni, T. D., Whitney, W. F., Kohli, P., & Tenenbaum, J. B. (2015b) Deep convolutional inverse graphics network. In Advances in neural information processing systems.Google Scholar
  26. Leclerc, Y. G., & Fischler, M. A. (1992). An optimization-based approach to the interpretation of single line drawings as 3d wire frames. International Journal of Computer Vision, 9(2), 113–136.CrossRefGoogle Scholar
  27. Li, Y., Su, H., Qi, C. R., Fish, N., Cohen-Or, D., & Guibas, L. J. (2015). Joint embeddings of shapes and images via cnn image purification. ACM Transactions on Graphics, 34(6), 234.Google Scholar
  28. Lim, J. J., Pirsiavash, H., & Torralba, A. (2013). Parsing ikea objects: Fine pose estimation. In IEEE international conference on computer vision.Google Scholar
  29. Lim, J. J., Khosla, A., Torralba, A. (2014). FPM: Fine pose parts-based model with 3d cad models. In European conference on computer vision.Google Scholar
  30. Liu, J., & Belhumeur, P. N. (2013). Bird part localization using exemplar-based models with enforced pose and subcategory consistency. In IEEE international conference on computer vision.Google Scholar
  31. Lowe, D. G. (1987). Three-dimensional object recognition from single two-dimensional images. Artificial intelligence, 31(3), 355–395.CrossRefGoogle Scholar
  32. McCormac, J., Handa, A., Leutenegger, S., & Davison, A. J. (2017). Scenenet rgb-d: Can 5m synthetic images beat generic imagenet pre-training on indoor segmentation. In IEEE international conference on computer vision.Google Scholar
  33. Newcombe, R. A., Izadi, S., Hilliges, O., Molyneaux, D., Kim, D., Davison, A. J., Kohi, P., Shotton, J., Hodges, S., & Fitzgibbon, A. (2011). Kinectfusion: Real-time dense surface mapping and tracking. In IEEE international symposium on mixed and augmented reality (pp. 127–136).Google Scholar
  34. Newell, A., Yang, K., & Deng, J. (2016) Stacked hourglass networks for human pose estimation. In European conference on computer vision.Google Scholar
  35. Pepik, B., Stark, M., Gehler, P., & Schiele, B. (2012). Teaching 3d geometry to deformable part models. In IEEE conference on computer vision and pattern recognition.Google Scholar
  36. Prasad, M., Fitzgibbon, A., Zisserman, A., & Van Gool, L. (2010). Finding nemo: Deformable object class modelling using curve matching. In IEEE conference on computer vision and pattern recognition.Google Scholar
  37. Ramakrishna, V., Kanade, T., & Sheikh, Y. (2012). Reconstructing 3d human pose from 2d image landmarks. In European conference on computer vision.Google Scholar
  38. Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems.Google Scholar
  39. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.MathSciNetCrossRefGoogle Scholar
  40. Sapp, B., & Taskar, B. (2013). Modec: Multimodal decomposable models for human pose estimation. In IEEE conference on computer vision and pattern recognition.Google Scholar
  41. Satkin, S., Lin, J., & Hebert, M. (2012). Data-driven scene understanding from 3D models. In British machine vision conference.Google Scholar
  42. Shakhnarovich, G., Viola, P., & Darrell, T. (2003). Fast pose estimation with parameter-sensitive hashing. In IEEE international conference on computer vision.Google Scholar
  43. Shih, K. J., Mallya, A., Singh, S., & Hoiem, D. (2015). Part localization using multi-proposal consensus for fine-grained categorization. In British machine vision conference.Google Scholar
  44. Shrivastava, A., & Gupta, A. (2013). Building part-based object detectors via 3d geometry. In IEEE international conference on computer vision.Google Scholar
  45. Soltani, A. A., Huang, H., Wu, J., Kulkarni, T. D., & Tenenbaum, J. B. (2017) Synthesizing 3d shapes via modeling multi-view depth maps and silhouettes with deep generative networks. In IEEE conference on computer vision and pattern recognition.Google Scholar
  46. Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., & Funkhouser, T. (2017). Semantic scene completion from a single depth image. In IEEE conference on computer vision and pattern recognition.Google Scholar
  47. Su, H., Huang, Q., Mitra, N. J., Li, Y., & Guibas, L. (2014). Estimating image depth using shape collections. ACM Transactions on Graphics, 33(4), 37.Google Scholar
  48. Su, H., Qi, C. R., Li, Y., & Guibas, L. (2015). Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3d model views. In IEEE international conference on computer vision.Google Scholar
  49. Sun, B., & Saenko, K. (2014) From virtual to reality: Fast adaptation of virtual object detectors to real domains. In British machine vision conference.Google Scholar
  50. Taigman, Y., Yang, M., Ranzato, M., & Wolf, L. (2015). Web-scale training for face identification. In IEEE conference on computer vision and pattern recognition.Google Scholar
  51. Tompson, J., Goroshin, R., Jain, A., LeCun, Y., & Bregler, C. (2015). Efficient object localization using convolutional networks. In IEEE conference on computer vision and pattern recognition.Google Scholar
  52. Tompson, J. J., Jain, A., LeCun, Y., & Bregler, C. (2014). Joint training of a convolutional network and a graphical model for human pose estimation. In Advances in neural information processing systems.Google Scholar
  53. Torralba, A., & Efros, A. A. (2011) Unbiased look at dataset bias. In IEEE conference on computer vision and pattern recognition.Google Scholar
  54. Torresani, L., Hertzmann, A., & Bregler, C. (2003). Learning non-rigid 3d shape from 2d motion. In Advances in neural information processing systems.Google Scholar
  55. Toshev, A., & Szegedy, C. (2014). Deeppose: Human pose estimation via deep neural networks. In IEEE conference on computer vision and pattern recognition (pp. 1653–1660).Google Scholar
  56. Tulsiani, S., & Malik, J. (2015). Viewpoints and keypoints. In IEEE conference on computer vision and pattern recognition.Google Scholar
  57. Tulsiani, S., Zhou, T., Efros, A. A., & Malik, J. (2017). Multi-view supervision for single-view reconstruction via differentiable ray consistency. In IEEE conference on computer vision and pattern recognition.Google Scholar
  58. Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-sne. Journal of Machine Learning Research, 9(11), 2579–2605.MATHGoogle Scholar
  59. Vicente, S., Carreira, J., Agapito, L., & Batista, J. (2014). Reconstructing pascal voc. In IEEE conference on computer vision and pattern recognition.Google Scholar
  60. Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology.Google Scholar
  61. Wu, J., Yildirim, I., Lim, J. J., Freeman, B., & Tenenbaum, J. (2015). Galileo: Perceiving physical object properties by integrating a physics engine with deep learning. In Advances in neural information processing systems.Google Scholar
  62. Wu, J., Zhang, C., Xue, T., Freeman, W. T., & Tenenbaum, J. B. (2016) Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In Advances in neural information processing systems.Google Scholar
  63. Wu, J., Wang, Y., Xue, T., Sun, X., Freeman, W. T., & Tenenbaum, J. B. (2017). Marrnet: 3d shape reconstruction via 2.5d sketches. In Advances in neural information processing systems. Google Scholar
  64. Xiang, Y., Mottaghi, R., & Savarese, S. (2014). Beyond pascal: A benchmark for 3d object detection in the wild. In IEEE winter conference on applications of computer vision.Google Scholar
  65. Xiao, J., Hays, J., Ehinger, K., Oliva, A., & Torralba, A. (2010) Sun database: Large-scale scene recognition from abbey to zoo. In IEEE conference on computer vision and pattern recognition.Google Scholar
  66. Xue, T., Liu, J., & Tang, X. (2012). Example-based 3d object reconstruction from line drawings. In IEEE conference on computer vision and pattern recognition.Google Scholar
  67. Yang, Y., & Ramanan, D. (2011) Articulated pose estimation with flexible mixtures-of-parts. In IEEE conference on computer vision and pattern recognition.Google Scholar
  68. Yasin, H., Iqbal, U., Krüger, B., Weber, A., & Gall, J. (2016). A dual-source approach for 3d pose estimation from a single image. In IEEE conference on computer vision and pattern recognition.Google Scholar
  69. Yuille, A., & Kersten, D. (2006). Vision as bayesian inference: Analysis by synthesis? Trends in Cognitive Sciences, 10(7), 301–308.CrossRefGoogle Scholar
  70. Zeng, A., Song, S., Nießner, M., Fisher, M., Xiao, J. (2017). 3dmatch: Learning the matching of local 3d geometry in range scans. In IEEE conference on computer vision and pattern recognition. Google Scholar
  71. Zhou, T., Krähenbühl, P., Aubry, M., Huang, Q., & Efros, A. A. (2016) Learning dense correspondence via 3d-guided cycle consistency. In IEEE conference on computer vision and pattern recognition.Google Scholar
  72. Zhou, X., Leonardos, S., Hu, X., & Daniilidis, K. (2015) 3d shape reconstruction from 2d landmarks: A convex formulation. In IEEE conference on computer vision and pattern recognition.Google Scholar
  73. Zia, M. Z., Stark, M., Schiele, B., & Schindler, K. (2013). Detailed 3d representations for object recognition and modeling. IEEE Transactions on Pattern Analysis and Machine intelligence, 35(11), 2608–2623.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Massachusetts Institute of TechnologyCambridgeUSA
  2. 2.Google ResearchMountain ViewUSA
  3. 3.University of Southern California Los AngelesUSA
  4. 4.Facebook Inc.Menlo ParkUSA
  5. 5.Google ResearchCambridgeUSA

Personalised recommendations