Single Image 3D Interpreter Network

  • Jiajun WuEmail author
  • Tianfan Xue
  • Joseph J. Lim
  • Yuandong Tian
  • Joshua B. Tenenbaum
  • Antonio Torralba
  • William T. Freeman
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9910)


Understanding 3D object structure from a single image is an important but difficult task in computer vision, mostly due to the lack of 3D object annotations in real images. Previous work tackles this problem by either solving an optimization task given 2D keypoint positions, or training on synthetic data with ground truth 3D information.

In this work, we propose 3D INterpreter Network (3D-INN), an end-to-end framework which sequentially estimates 2D keypoint heatmaps and 3D object structure, trained on both real 2D-annotated images and synthetic 3D data. This is made possible mainly by two technical innovations. First, we propose a Projection Layer, which projects estimated 3D structure to 2D space, so that 3D-INN can be trained to predict 3D structural parameters supervised by 2D annotations on real images. Second, heatmaps of keypoints serve as an intermediate representation connecting real and synthetic data, enabling 3D-INN to benefit from the variation and abundance of synthetic 3D objects, without suffering from the difference between the statistics of real and synthesized images due to imperfect rendering. The network achieves state-of-the-art performance on both 2D keypoint estimation and 3D structure recovery. We also show that the recovered 3D information can be used in other vision applications, such as image retrieval.


3D structure Single image 3D reconstruction Keypoint estimation Neural network Synthetic data 



This work is supported by NSF Robust Intelligence 1212849 and NSF Big Data 1447476 to W.F., NSF Robust Intelligence 1524817 to A.T., ONR MURI N00014-16-1-2007 to J.B.T., Shell Research, and the Center for Brain, Minds and Machines (NSF STC award CCF-1231216). The authors would like to thank Nvidia for GPU donations. Part of this work was done during Jiajun Wu’s internship at Facebook AI Research.

Supplementary material

419981_1_En_22_MOESM1_ESM.mp4 (4.2 mb)
Supplementary material 1 (mp4 4264 KB)
419981_1_En_22_MOESM2_ESM.mp4 (2.7 mb)
Supplementary material 2 (mp4 2794 KB)
419981_1_En_22_MOESM3_ESM.pdf (2.7 mb)
Supplementary material 3 (pdf 2757 KB)


  1. 1.
    Akhter, I., Black, M.J.: Pose-conditioned joint angle limits for 3d human pose reconstruction. In: CVPR (2015)Google Scholar
  2. 2.
    Aubry, M., Maturana, D., Efros, A., Russell, B., Sivic, J.: Seeing 3d chairs: exemplar part-based 2d–3d alignment using a large dataset of cad models. In: CVPR (2014)Google Scholar
  3. 3.
    Bansal, A., Russell, B.: Marr revisited: 2d–3d alignment via surface normal prediction. In: CVPR (2016)Google Scholar
  4. 4.
    Belhumeur, P.N., Jacobs, D.W., Kriegman, D.J., Kumar, N.: Localizing parts of faces using a consensus of exemplars. IEEE TPAMI 35(12), 2930–2940 (2013)CrossRefGoogle Scholar
  5. 5.
    Bever, T.G., Poeppel, D.: Analysis by synthesis: a (re-) emerging program of research for language and vision. Biolinguistics 4(2–3), 174–200 (2010)Google Scholar
  6. 6.
    Bourdev, L., Maji, S., Brox, T., Malik, J.: Detecting people using mutually consistent poselet activations. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6316, pp. 168–181. Springer, Heidelberg (2010). doi: 10.1007/978-3-642-15567-3_13 CrossRefGoogle Scholar
  7. 7.
    Carreira, J., Agrawal, P., Fragkiadaki, K., Malik, J.: Human pose estimation with iterative error feedback. In: CVPR (2016)Google Scholar
  8. 8.
    Choy, C.B., Xu, D., Gwak, J., Chen, K., Savarese, S.: 3d–r2n2: a unified approach for single and multi-view 3D object reconstruction. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2006, Part VIII. LNCS, vol. 9912, pp. 1–17. Springer, Heidelberg (2016)Google Scholar
  9. 9.
    Dosovitskiy, A., Tobias Springenberg, J., Brox, T.: Learning to generate chairs with convolutional neural networks. In: CVPR (2015)Google Scholar
  10. 10.
    Fidler, S., Dickinson, S.J., Urtasun, R.: 3d object detection and viewpoint estimation with a deformable 3d cuboid model. In: NIPS (2012)Google Scholar
  11. 11.
    Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR (2014)Google Scholar
  12. 12.
    Hejrati, M., Ramanan, D.: Analysis by synthesis: 3d object recognition by object reconstruction. In: CVPR (2014)Google Scholar
  13. 13.
    Hejrati, M., Ramanan, D.: Analyzing 3d objects in cluttered images. In: NIPS (2012)Google Scholar
  14. 14.
    Hinton, G.E., Ghahramani, Z.: Generative models for discovering sparse distributed representations. Philos. Trans. R. Soc. London B: Biol. Sci. 352(1358), 1177–1190 (1997)CrossRefGoogle Scholar
  15. 15.
    Hu, W., Zhu, S.C.: Learning 3d object templates by quantizing geometry and appearance spaces. IEEE TPAMI 37(6), 1190–1205 (2015)CrossRefGoogle Scholar
  16. 16.
    Huang, Q., Wang, H., Koltun, V.: Single-view reconstruction via joint analysis of image and shape collections. ACM SIGGRAPH 34(4), 87 (2015)Google Scholar
  17. 17.
    Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. In: NIPS (2015)Google Scholar
  18. 18.
    Kar, A., Tulsiani, S., Carreira, J., Malik, J.: Category-specific object reconstruction from a single image. In: CVPR (2015)Google Scholar
  19. 19.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)Google Scholar
  20. 20.
    Kulkarni, T.D., Kohli, P., Tenenbaum, J.B., Mansinghka, V.: Picture: a probabilistic programming language for scene perception. In: CVPR (2015)Google Scholar
  21. 21.
    Kulkarni, T.D., Whitney, W.F., Kohli, P., Tenenbaum, J.B.: Deep convolutional inverse graphics network. In: NIPS (2015)Google Scholar
  22. 22.
    Leclerc, Y.G., Fischler, M.A.: An optimization-based approach to the interpretation of single line drawings as 3d wire frames. IJCV 9(2), 113–136 (1992)CrossRefGoogle Scholar
  23. 23.
    Li, Y., Su, H., Qi, C.R., Fish, N., Cohen-Or, D., Guibas, L.J.: Joint embeddings of shapes and images via cnn image purification. ACM SIGGRAPH Asia 34(6), 234 (2015)Google Scholar
  24. 24.
    Lim, J.J., Khosla, A., Torralba, A.: FPM: fine pose parts-based model with 3D CAD models. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 478–493. Springer, Heidelberg (2014). doi: 10.1007/978-3-319-10599-4_31 Google Scholar
  25. 25.
    Lim, J.J., Pirsiavash, H., Torralba, A.: Parsing ikea objects: fine pose estimation. In: ICCV (2013)Google Scholar
  26. 26.
    Liu, J., Belhumeur, P.N.: Bird part localization using exemplar-based models with enforced pose and subcategory consistency. In: ICCV (2013)Google Scholar
  27. 27.
    Lowe, D.G.: Three-dimensional object recognition from single two-dimensional images. Artif. Intell. 31(3), 355–395 (1987). ElsevierCrossRefGoogle Scholar
  28. 28.
    Van der Maaten, L., Hinton, G.: Visualizing data using t-sne. JMLR 9(11), 2579–2605 (2008)zbMATHGoogle Scholar
  29. 29.
    Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. arXiv preprint arXiv:1603.06937 (2016)
  30. 30.
    Peng, X., Sun, B., Ali, K., Saenko, K.: Exploring invariances in deep convolutional neural networks using synthetic images. CoRR, abs/1412.7122 2 (2014)Google Scholar
  31. 31.
    Pepik, B., Stark, M., Gehler, P., Schiele, B.: Teaching 3d geometry to deformable part models. In: CVPR (2012)Google Scholar
  32. 32.
    Prasad, M., Fitzgibbon, A., Zisserman, A., Van Gool, L.: Finding nemo: deformable object class modelling using curve matching. In: CVPR (2010)Google Scholar
  33. 33.
    Ramakrishna, V., Kanade, T., Sheikh, Y.: Reconstructing 3D human pose from 2D image landmarks. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 573–586. Springer, Heidelberg (2012). doi: 10.1007/978-3-642-33765-9_41 Google Scholar
  34. 34.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS (2015)Google Scholar
  35. 35.
    Sapp, B., Taskar, B.: Modec: multimodal decomposable models for human pose estimation. In: CVPR (2013)Google Scholar
  36. 36.
    Satkin, S., Lin, J., Hebert, M.: Data-driven scene understanding from 3D models. In: BMVC (2012)Google Scholar
  37. 37.
    Shakhnarovich, G., Viola, P., Darrell, T.: Fast pose estimation with parameter-sensitive hashing. In: ICCV (2003)Google Scholar
  38. 38.
    Shih, K.J., Mallya, A., Singh, S., Hoiem, D.: Part localization using multi-proposal consensus for fine-grained categorization. In: BMVC (2015)Google Scholar
  39. 39.
    Shrivastava, A., Gupta, A.: Building part-based object detectors via 3d geometry. In: ICCV, pp. 1745–1752 (2013)Google Scholar
  40. 40.
    Su, H., Huang, Q., Mitra, N.J., Li, Y., Guibas, L.: Estimating image depth using shape collections. ACM TOG 33(4), 37 (2014)Google Scholar
  41. 41.
    Su, H., Qi, C.R., Li, Y., Guibas, L.: Render for cnn: viewpoint estimation in images using cnns trained with rendered 3d model views. In: ICCV (2015)Google Scholar
  42. 42.
    Sun, B., Saenko, K.: From virtual to reality: fast adaptation of virtual object detectors to real domains. In: BMVC (2014)Google Scholar
  43. 43.
    Taigman, Y., Yang, M., Ranzato, M., Wolf, L.: Web-scale training for face identification. In: CVPR (2015)Google Scholar
  44. 44.
    Tompson, J., Goroshin, R., Jain, A., LeCun, Y., Bregler, C.: Efficient object localization using convolutional networks. In: CVPR (2015)Google Scholar
  45. 45.
    Tompson, J.J., Jain, A., LeCun, Y., Bregler, C.: Joint training of a convolutional network and a graphical model for human pose estimation. In: NIPS (2014)Google Scholar
  46. 46.
    Torralba, A., Efros, A.A.: Unbiased look at dataset bias. In: CVPR (2011)Google Scholar
  47. 47.
    Torresani, L., Hertzmann, A., Bregler, C.: Learning non-rigid 3d shape from 2d motion. In: NIPS (2003)Google Scholar
  48. 48.
    Toshev, A., Szegedy, C.: Deeppose: human pose estimation via deep neural networks. In: CVPR, pp. 1653–1660 (2014)Google Scholar
  49. 49.
    Tulsiani, S., Malik, J.: Viewpoints and keypoints. In: CVPR (2015)Google Scholar
  50. 50.
    Vicente, S., Carreira, J., Agapito, L., Batista, J.: Reconstructing pascal voc. In: CVPR (2014)Google Scholar
  51. 51.
    Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The Caltech-UCSD Birds-200-2011 Dataset. Technical report. CNS-TR-2011-001, California Institute of Technology (2011)Google Scholar
  52. 52.
    Wu, J., Yildirim, I., Lim, J.J., Freeman, B., Tenenbaum, J.: Galileo: perceiving physical object properties by integrating a physics engine with deep learning. In: NIPS (2015)Google Scholar
  53. 53.
    Xiang, Y., Mottaghi, R., Savarese, S.: Beyond pascal: a benchmark for 3d object detection in the wild. In: WACV (2014)Google Scholar
  54. 54.
    Xiao, J., Hays, J., Ehinger, K., Oliva, A., Torralba, A.: Sun database: large-scale scene recognition from abbey to zoo. In: CVPR (2010)Google Scholar
  55. 55.
    Xue, T., Liu, J., Tang, X.: Example-based 3d object reconstruction from line drawings. In: CVPR (2012)Google Scholar
  56. 56.
    Yang, Y., Ramanan, D.: Articulated pose estimation with flexible mixtures-of-parts. In: CVPR (2011)Google Scholar
  57. 57.
    Yasin, H., Iqbal, U., Krüger, B., Weber, A., Gall, J.: A dual-source approach for 3d pose estimation from a single image. In: CVPR (2016)Google Scholar
  58. 58.
    Yuille, A., Kersten, D.: Vision as bayesian inference: analysis by synthesis? Trends Cogn. Sci. 10(7), 301–308 (2006)CrossRefGoogle Scholar
  59. 59.
    Zeng, A., Song, S., Nießner, M., Fisher, M., Xiao, J.: 3dmatch: learning the matching of local 3d geometry in range scans. arXiv preprint arXiv:1603.08182 (2016)
  60. 60.
    Zhou, T., Krähenbühl, P., Aubry, M., Huang, Q., Efros, A.A.: Learning dense correspondence via 3d-guided cycle consistency. In: CVPR (2016)Google Scholar
  61. 61.
    Zhou, X., Leonardos, S., Hu, X., Daniilidis, K.: 3d shape reconstruction from 2d landmarks: a convex formulation. In: CVPR (2015)Google Scholar
  62. 62.
    Zia, M.Z., Stark, M., Schiele, B., Schindler, K.: Detailed 3d representations for object recognition and modeling. IEEE TPAMI 35(11), 2608–2623 (2013)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Jiajun Wu
    • 1
    Email author
  • Tianfan Xue
    • 1
  • Joseph J. Lim
    • 1
    • 2
  • Yuandong Tian
    • 3
  • Joshua B. Tenenbaum
    • 1
  • Antonio Torralba
    • 1
  • William T. Freeman
    • 1
    • 4
  1. 1.Massachusetts Institute of TechnologyCambridgeUSA
  2. 2.Stanford UniversityStanfordUSA
  3. 3.Facebook AI ResearchMenlo ParkUSA
  4. 4.Google ResearchCambridgeUSA

Personalised recommendations