Deep Volumetric Video From Very Sparse Multi-view Performance Capture

  • Zeng HuangEmail author
  • Tianye Li
  • Weikai Chen
  • Yajie Zhao
  • Jun Xing
  • Chloe LeGendre
  • Linjie Luo
  • Chongyang Ma
  • Hao Li
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11220)


We present a deep learning based volumetric approach for performance capture using a passive and highly sparse multi-view capture system. State-of-the-art performance capture systems require either pre-scanned actors, large number of cameras or active sensors. In this work, we focus on the task of template-free, per-frame 3D surface reconstruction from as few as three RGB sensors, for which conventional visual hull or multi-view stereo methods fail to generate plausible results. We introduce a novel multi-view Convolutional Neural Network (CNN) that maps 2D images to a 3D volumetric field and we use this field to encode the probabilistic distribution of surface points of the captured subject. By querying the resulting field, we can instantiate the clothed human body at arbitrary resolutions. Our approach scales to different numbers of input images, which yield increased reconstruction quality when more views are used. Although only trained on synthetic data, our network can generalize to handle real footage from body performance capture. Our method is suitable for high-quality low-cost full body volumetric capture solutions, which are gaining popularity for VR and AR content creation. Experimental results demonstrate that our method is significantly more robust and accurate than existing techniques when only very sparse views are available.


Human performance capture Neural networks for multi-view stereo Wide-baseline reconstruction 



We would like to thank the authors of [74] who helped testing with their system. This work was supported in part by the ONR YIP grant N00014-17-S-FO14, the CONIX Research Center, one of six centers in JUMP, a Semiconductor Research Corporation (SRC) program sponsored by DARPA, the Andrew and Erna Viterbi Early Career Chair, the U.S. Army Research Laboratory (ARL) under contract number W911NF-14-D-0005, Adobe, and Sony. The content of the information does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.

Supplementary material

474218_1_En_21_MOESM1_ESM.pdf (3.3 mb)
Supplementary material 1 (pdf 3337 KB)


  1. 1.
    Collet, A., et al.: High-quality streamable free-viewpoint video. ACM Trans. Graph. (TOG) 34(4), 69 (2015)CrossRefGoogle Scholar
  2. 2.
    Orts-Escolano, S., et al.: Holoportation: Virtual 3d teleportation in real-time. In: Proceedings of the 29th Annual Symposium on User Interface Software and Technology, pp. 741–754. ACM (2016)Google Scholar
  3. 3.
    Joo, H., et al.: Panoptic studio: a massively multiview system for social motion capture. In: The IEEE International Conference on Computer Vision (ICCV) (2015)Google Scholar
  4. 4.
    Vlasic, D., et al.: Dynamic shape capture using multi-view photometric stereo. ACM Trans. Gr. (TOG) 28(5), 174 (2009)CrossRefGoogle Scholar
  5. 5.
    Li, H., et al.: Temporally coherent completion of dynamic shapes. ACM Trans. Gr. (TOG) 31(1), 2 (2012)CrossRefGoogle Scholar
  6. 6.
    Vlasic, D., Baran, I., Matusik, W., Popović, J.: Articulated mesh animation from multi-view silhouettes. ACM Trans. Gr. (TOG) 27, 97 (2008). ACMCrossRefGoogle Scholar
  7. 7.
    De Aguiar, E., Stoll, C., Theobalt, C., Ahmed, N., Seidel, H.P., Thrun, S.: Performance capture from sparse multi-view video. ACM Trans. Gr. (TOG) 27, 98 (2008). ACMGoogle Scholar
  8. 8.
    Xu, W., et al.: Monoperfcap: Human performance capture from monocular video. arXiv preprint arXiv:1708.02136 (2017)
  9. 9.
    Matusik, W., Buehler, C., Raskar, R., Gortler, S.J., McMillan, L.: Image-based visual hulls. In: Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, pp. 369–374. ACM Press/Addison-Wesley Publishing Co. (2000)Google Scholar
  10. 10.
    Furukawa, Y., Ponce, J.: Carved visual hulls for image-based modeling. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3951, pp. 564–577. Springer, Heidelberg (2006). Scholar
  11. 11.
    Esteban, C.H., Schmitt, F.: Silhouette and stereo fusion for 3d object modeling. Comput. Vis. Image Underst. 96(3), 367–392 (2004)CrossRefGoogle Scholar
  12. 12.
    Cheung, G.K., Baker, S., Kanade, T.: Visual hull alignment and refinement across time: A 3d reconstruction algorithm combining shape-from-silhouette with stereo. In: Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer Society Conference on. Volume 2, IEEE (2003) II-375Google Scholar
  13. 13.
    Song, D., Tong, R., Chang, J., Yang, X., Tang, M., Zhang, J.J.: 3d body shapes estimation from dressed-human silhouettes. In: Computer Graphics Forum, Vol. 35, pp. 147–156 (2016). Wiley Online LibraryCrossRefGoogle Scholar
  14. 14.
    Zuo, X., Du, C., Wang, S., Zheng, J., Yang, R.: Interactive visual hull refinement for specular and transparent object surface reconstruction. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2237–2245 (2015)Google Scholar
  15. 15.
    Liu, Y., Dai, Q., Xu, W.: A point-cloud-based multiview stereo algorithm for free-viewpoint video. IEEE Trans. Vis. Comput. Gr. 16(3), 407–418 (2010)CrossRefGoogle Scholar
  16. 16.
    Franco, J.S., Lapierre, M., Boyer, E.: Visual shapes of silhouette sets. In: Third International Symposium on 3D Data Processing, Visualization, and Transmission, pp. 397–404. IEEE (2006)Google Scholar
  17. 17.
    Loop, C., Zhang, C., Zhang, Z.: Real-time high-resolution sparse voxelization with application to image-based modeling. In: Proceedings of the 5th High-Performance Graphics Conference, pp. 73–79. ACM (2013)Google Scholar
  18. 18.
    Starck, J., Hilton, A.: Surface capture for performance-based animation. IEEE Comput. Gr. Appl. 27(3) (2007)CrossRefGoogle Scholar
  19. 19.
    Zitnick, C.L., Kang, S.B., Uyttendaele, M., Winder, S., Szeliski, R.: High-quality video view interpolation using a layered representation. ACM Trans. Gr. (TOG) 23, 600–608 (2004) . ACMCrossRefGoogle Scholar
  20. 20.
    Waschbüsch, M., Würmlin, S., Cotting, D., Sadlo, F., Gross, M.: Scalable 3d video of dynamic scenes. Vis. Comput. 21(8), 629–638 (2005)CrossRefGoogle Scholar
  21. 21.
    Wu, C., Varanasi, K., Liu, Y., Seidel, H.P., Theobalt, C.: Shading-based dynamic shape refinement from multi-view video under general illumination. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 1108–1115. IEEE (2011)Google Scholar
  22. 22.
    Ahmed, N., Theobalt, C., Dobrev, P., Seidel, H.P., Thrun, S.: Robust fusion of dynamic shape and normal capture for high-quality reconstruction of time-varying geometry. In: IEEE Conference on Computer Vision and Pattern Recognition, 2008, CVPR 2008, pp. 1–8. IEEE (2008)Google Scholar
  23. 23.
    Stoll, C., Gall, J., De Aguiar, E., Thrun, S., Theobalt, C.: Video-based reconstruction of animatable human characters. ACM Trans. Gr. (TOG) 29(6), 139 (2010)CrossRefGoogle Scholar
  24. 24.
    Bradley, D., Popa, T., Sheffer, A., Heidrich, W., Boubekeur, T.: Markerless garment capture. ACM Trans. Gr. (TOG) 27, 99 (2008). ACMCrossRefGoogle Scholar
  25. 25.
    Wu, C., Varanasi, K., Theobalt, C.: Full Body performance capture under uncontrolled and varying illumination: a shading-based approach. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7575, pp. 757–770. Springer, Heidelberg (2012). Scholar
  26. 26.
    Gall, J., Stoll, C., De Aguiar, E., Theobalt, C., Rosenhahn, B., Seidel, H.P.: Motion capture using joint skeleton tracking and surface estimation. In: IEEE Conference on Computer Vision and Pattern Recognition, 2009, CVPR 2009, pp. 1746–1753. IEEE (2009)Google Scholar
  27. 27.
    Liu, Y., Stoll, C., Gall, J., Seidel, H.P., Theobalt, C.: Markerless motion capture of interacting characters using multi-view image segmentation. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1249–1256. IEEE (2011)Google Scholar
  28. 28.
    Bray, M., Kohli, P., Torr, P.H.S.: PoseCut: simultaneous segmentation and 3D pose estimation of humans using dynamic graph-cuts. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 642–655. Springer, Heidelberg (2006). Scholar
  29. 29.
    Brox, T., Rosenhahn, B., Cremers, D., Seidel, H.-P.: high accuracy optical flow serves 3-D pose tracking: exploiting contour and flow based constraints. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 98–111. Springer, Heidelberg (2006). Scholar
  30. 30.
    Brox, T., Rosenhahn, B., Gall, J., Cremers, D.: Combined region and motion-based 3d tracking of rigid and articulated objects. IEEE Trans. Pattern Anal. Mach. Intell. 32(3), 402–415 (2010)CrossRefGoogle Scholar
  31. 31.
    Mustafa, A., Kim, H., Guillemaut, J.Y., Hilton, A.: General dynamic scene reconstruction from multiple view video. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 900–908 (2015)Google Scholar
  32. 32.
    Wu, C., Stoll, C., Valgaerts, L., Theobalt, C.: On-set performance capture of multiple actors with a stereo camera. ACM Trans. Gr. (TOG) 32(6), 161 (2013)Google Scholar
  33. 33.
    Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: Smpl: A skinned multi-person linear model. ACM Trans. Gr. (TOG) 34(6), 248 (2015)CrossRefGoogle Scholar
  34. 34.
    Loper, M., Mahmood, N., Black, M.J.: Mosh: motion and shape capture from sparse markers. ACM Trans. Gr. (TOG) 33(6), 220 (2014)CrossRefGoogle Scholar
  35. 35.
    Hasler, N., Ackermann, H., Rosenhahn, B., Thormählen, T., Seidel, H.P.: Multilinear pose and body shape estimation of dressed subjects from image sets. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1823–1830. IEEE (2010)Google Scholar
  36. 36.
    Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers, J., Davis, J.: Scape: shape completion and animation of people. ACM Trans. Gr. (TOG) 24, 408–413 (2005). ACMCrossRefGoogle Scholar
  37. 37.
    Balan, A.O., Sigal, L., Black, M.J., Davis, J.E., Haussecker, H.W.: Detailed human shape and pose from images. In: IEEE Conference on Computer Vision and Pattern Recognition, 2007, CVPR 2007, pp. 1–8. IEEE (2007)Google Scholar
  38. 38.
    Plänkers, R., Fua, P.: Tracking and modeling people in video sequences. Comput. Vis. Image Underst. 81(3), 285–302 (2001)CrossRefGoogle Scholar
  39. 39.
    Sminchisescu, C., Triggs, B.: Estimating articulated human motion with covariance scaled sampling. Int. J. Robot. Res. 22(6), 371–391 (2003)CrossRefGoogle Scholar
  40. 40.
    Tan, J.K.V., Budvytis, I., Cipolla, R.: Indirect deep structured learning for 3d human body shape and pose predictionGoogle Scholar
  41. 41.
    Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., Black, M.J.: Keep it SMPL: automatic estimation of 3D human pose and shape from a single image. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 561–578. Springer, Cham (2016). Scholar
  42. 42.
    Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. arXiv preprint arXiv:1712.06584 (2017)
  43. 43.
    Lassner, C., Romero, J., Kiefel, M., Bogo, F., Black, M.J., Gehler, P.V.: Unite the people: Closing the loop between 3d and 2d human representations. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  44. 44.
    Guan, P., Weiss, A., Balan, A.O., Black, M.J.: Estimating human shape and pose from a single image. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 1381–1388. IEEE (2009)Google Scholar
  45. 45.
    Dou, M., Fuchs, H., Frahm, J.M.: Scanning and tracking dynamic objects with commodity depth cameras. In: 2013 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), pp. 99–106. IEEE (2013)Google Scholar
  46. 46.
    Dou, M., et al.: Fusion4d: real-time performance capture of challenging scenes. ACM Trans. Gr. (TOG) 35(4), 114 (2016)CrossRefGoogle Scholar
  47. 47.
    Ye, G., Liu, Y., Hasler, N., Ji, X., Dai, Q., Theobalt, C.: Performance capture of interacting characters with handheld kinects. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, pp. 828–841. Springer, Heidelberg (2012). Scholar
  48. 48.
    Zollhöfer, M., et al.: Real-time non-rigid reconstruction using an rgb-d camera. ACM Trans. Gr. (TOG) 33(4), 156 (2014)CrossRefGoogle Scholar
  49. 49.
    Wang, R., Wei, L., Vouga, E., Huang, Q., Ceylan, D., Medioni, G., Li, H.: Capturing dynamic textured surfaces of moving targets. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 271–288. Springer, Cham (2016). Scholar
  50. 50.
    Tylecek, R., Sara, R.: Refinement of surface mesh for accurate multi-view reconstruction. Int. J. Virtual Real. 9(1), 45–54 (2010)Google Scholar
  51. 51.
    Wu, C., Liu, Y., Dai, Q., Wilburn, B.: Fusing multiview and photometric stereo for 3d reconstruction under uncalibrated illumination. IEEE Trans. Vis. Comput. Gr. 17(8), 1082–1095 (2011)CrossRefGoogle Scholar
  52. 52.
    Hernández, C., Vogiatzis, G., Brostow, G.J., Stenger, B., Cipolla, R.: Non-rigid photometric stereo with colored lights. In: IEEE 11th International Conference on Computer Vision, 2007, ICCV 2007, pp. 1–8 IEEE (2007)Google Scholar
  53. 53.
    Robertini, N., Casas, D., De Aguiar, E., Theobalt, C.: Multi-view performance capture of surface details. Int. J. Comput. Vis. 124, 1–18 (2017)MathSciNetCrossRefGoogle Scholar
  54. 54.
    Pons-Moll, G., Pujades, S., Hu, S., Black, M.: Clothcap: Seamless 4d clothing capture and retargeting. ACM Trans. Gr. (Proc. SIGGRAPH) [to appear] 1 (2017)CrossRefGoogle Scholar
  55. 55.
    Zhang, C., Pujades, S., Black, M., Pons-Moll, G.: Detailed, accurate, human shape estimation from clothed 3D scan sequences. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017) SpotlightGoogle Scholar
  56. 56.
    Yang, J., Franco, J.-S., Hétroy-Wheeler, F., Wuhrer, S.: Estimation of human body shape in motion with wide clothing. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 439–454. Springer, Cham (2016). Scholar
  57. 57.
    Kalogerakis, E., Averkiou, M., Maji, S., Chaudhuri, S.: 3d shape segmentation with projective convolutional networks. In: Proceedings of CVPR, 2. IEEE (2017)Google Scholar
  58. 58.
    Su, H., Maji, S., Kalogerakis, E., Learned-Miller, E.: Multi-view convolutional neural networks for 3d shape recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 945–953 (2015)Google Scholar
  59. 59.
    Shi, B., Bai, S., Zhou, Z., Bai, X.: Deeppano: deep panoramic representation for 3-d shape recognition. IEEE Signal Process. Lett. 22(12), 2339–2343 (2015)CrossRefGoogle Scholar
  60. 60.
    Qi, C.R., Su, H., Nießner, M., Dai, A., Yan, M., Guibas, L.J.: Volumetric and multi-view cnns for object classification on 3d data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5648–5656 (2016)Google Scholar
  61. 61.
    Huang, H., Kalogerakis, E., Chaudhuri, S., Ceylan, D., Kim, V.G., Yumer, E.: Learning local shape descriptors from part correspondences with multiview convolutional networks. ACM Trans. Gr. (TOG) 37(1), 6 (2018)CrossRefGoogle Scholar
  62. 62.
    Su, H., Wang, F., Yi, L., Guibas, L.: 3d-assisted image feature synthesis for novel views of an object. arXiv preprint arXiv:1412.0003 (2014)
  63. 63.
    Zhou, T., Tulsiani, S., Sun, W., Malik, J., Efros, A.A.: View synthesis by appearance flow. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 286–301. Springer, Cham (2016). Scholar
  64. 64.
    Park, E., Yang, J., Yumer, E., Ceylan, D., Berg, A.C.: Transformation-grounded image generation network for novel 3d view synthesis. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 702–711. IEEE (2017)Google Scholar
  65. 65.
    Rezende, D.J., Eslami, S.A., Mohamed, S., Battaglia, P., Jaderberg, M., Heess, N.: Unsupervised learning of 3d structure from images. In: Advances In Neural Information Processing Systems, pp. 4996–5004 (2016)Google Scholar
  66. 66.
    Sinha, A., Unmesh, A., Huang, Q., Ramani, K.: Surfnet: Generating 3d shape surfaces using deep residual networks. In: Proceedings of CVPR (2017)Google Scholar
  67. 67.
    Choy, C.B., Xu, D., Gwak, J.Y., Chen, K., Savarese, S.: 3D-R2N2: a unified approach for single and multi-view 3D object reconstruction. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 628–644. Springer, Cham (2016). Scholar
  68. 68.
    Lun, Z., Gadelha, M., Kalogerakis, E., Maji, S., Wang, R.: 3d shape reconstruction from sketches via multi-view convolutional networks. arXiv preprint arXiv:1707.06375 (2017)
  69. 69.
    Soltani, A.A., Huang, H., Wu, J., Kulkarni, T.D., Tenenbaum, J.B.: Synthesizing 3d shapes via modeling multi-view depth maps and silhouettes with deep generative networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1511–1519 (2017)Google Scholar
  70. 70.
    Tatarchenko, M., Dosovitskiy, A., Brox, T.: Multi-view 3D models from single images with a convolutional network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 322–337. Springer, Cham (2016). Scholar
  71. 71.
    Tulsiani, S., Zhou, T., Efros, A.A., Malik, J.: Multi-view supervision for single-view reconstruction via differentiable ray consistency. In: CVPR, Vol. 1, p. 3 (2017)Google Scholar
  72. 72.
    Kar, A., Häne, C., Malik, J.: Learning a multi-view stereo machine. In: Advances in Neural Information Processing Systems, pp. 364–375 (2017)Google Scholar
  73. 73.
    Hartmann, W., Galliani, S., Havlena, M., Van Gool, L., Schindler, K.: Learned multi-patch similarity. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 1595–1603. IEEE (2017)Google Scholar
  74. 74.
    Ji, M., Gall, J., Zheng, H., Liu, Y., Fang, L.: Surfacenet: an end-to-end 3d neural network for multiview stereopsis. arXiv preprint arXiv:1708.01749 (2017)
  75. 75.
    Dibra, E., Jain, H., Oztireli, C., Ziegler, R., Gross, M.: Human shape from silhouettes using generative hks descriptors and cross-modal neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 5 (CVPR), Honolulu, HI, USA (2017)Google Scholar
  76. 76.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456 (2015)Google Scholar
  77. 77.
    Xu, H., Barbič, J.: Signed distance fields for polygon soup meshes. Graphics Interface 2014 (2014)Google Scholar
  78. 78.
    Varol, G., et al.: Learning from synthetic humans. In: CVPR (2017)Google Scholar
  79. 79.
    Adobe: Mixamo (2013).
  80. 80.
    Du, R., Chuang, M., Chang, W., Hoppe, H., Varshney, A.: Montage4d: interactive seamless fusion of multiview video textures. In: Proceedings of ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games (I3D), pp. 124–133. ACM (May 2018)Google Scholar
  81. 81.
    Prada, F., Kazhdan, M., Chuang, M., Collet, A., Hoppe, H.: Spatiotemporal atlas parameterization for evolving meshes. ACM Trans. Gr. (TOG) 36(4), 58 (2017)CrossRefGoogle Scholar
  82. 82.
    Furukawa, Y., Ponce, J.: Accurate, dense, and robust multiview stereopsis. IEEE Trans. Pattern Anal. Mach. Intell. 32(8), 1362–1376 (2010)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Zeng Huang
    • 1
    • 2
    Email author
  • Tianye Li
    • 1
    • 2
  • Weikai Chen
    • 2
  • Yajie Zhao
    • 2
  • Jun Xing
    • 2
  • Chloe LeGendre
    • 1
    • 2
  • Linjie Luo
    • 3
  • Chongyang Ma
    • 3
  • Hao Li
    • 1
    • 2
    • 4
  1. 1.University of Southern CaliforniaLos AngelesUSA
  2. 2.USC Institute for Creative TechnologiesLos AngelesUSA
  3. 3.Snap Inc.Los AngelesUSA
  4. 4.PinscreenSanta MonicaUSA

Personalised recommendations