Volumetric Performance Capture from Minimal Camera Viewpoints

  • Andrew GilbertEmail author
  • Marco Volino
  • John Collomosse
  • Adrian Hilton
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11215)


We present a convolutional autoencoder that enables high fidelity volumetric reconstructions of human performance to be captured from multi-view video comprising only a small set of camera views. Our method yields similar end-to-end reconstruction error to that of a probabilistic visual hull computed using significantly more (double or more) viewpoints. We use a deep prior implicitly learned by the autoencoder trained over a dataset of view-ablated multi-view video footage of a wide range of subjects and actions. This opens up the possibility of high-end volumetric performance capture in on-set and prosumer scenarios where time or cost prohibit a high witness camera count.


Multi-view reconstruction Deep autoencoders Visual hull 



The work was supported by InnovateUK via the TotalCapture project, grant agreement 102685. The work was supported in part through the donation of GPU hardware by the NVidia corporation.


  1. 1.
    Starck, J., Kilner, J., Hilton, A.: A free-viewpoint video renderer. J. Graph. GPU Game Tools 14(3), 57–72 (2009)CrossRefGoogle Scholar
  2. 2.
    Tsiminaki, V., Franco, J., Boyer, E.: High resolution 3D shape texture from multiple videos. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR) (2014)Google Scholar
  3. 3.
    Volino, M., Casas, D., Collomosse, J., Hilton, A.: 4D for interactive character appearance. In: Computer Graphics Forum (Proceedings of Eurographics 2014) (2014)Google Scholar
  4. 4.
    Collet, A., et al.: High-quality streamable free-viewpoint video. ACM Trans. Graph. (TOG) 34(4), 69 (2015)CrossRefGoogle Scholar
  5. 5.
    Grauman, K., Shakhnarovich, G., Darrell, T.: A Bayesian approach to image-based visual hull reconstruction. In: Proceedings of the CVPR (2003)Google Scholar
  6. 6.
    Guillemaut, J.Y., Hilton, A.: Joint multi-layer segmentation and reconstruction for free-viewpoint video applications. Int. J. Comput. Vis. 93(1), 73–100 (2011)CrossRefGoogle Scholar
  7. 7.
    Casas, D., Huang, P., Hilton, A.: Surface-based character animation. In: Magnor, M., Grau, O., Sorkine-Hornung, O., Theobalt, C. (eds.) Digital Representations of the Real World: How to Capture, Model, and Render Visual Reality, pp. 239–252. CRC Press (2015)Google Scholar
  8. 8.
    Laurentini, A.: The visual hull concept for silhouette-based image understanding. IEEE Trans. Pattern Anal. Mach. Intell. 16(2), 150–162 (1994)CrossRefGoogle Scholar
  9. 9.
    Franco, J., Boyer, E.: Exact polyhedral visual hulls. In: Proceedings of the British Machine Vision Conference (BMVC) (2003)Google Scholar
  10. 10.
    Volino, M., Casas, D., Collomosse, J., Hilton, A.: Optimal representation of multiple view video. In: Proceedings of the British Machine Vision Conference. BMVA Press (2014)Google Scholar
  11. 11.
    Budd, C., Huang, P., Klaudinay, M., Hilton, A.: Global non-rigid alignment of surface sequences. Int. J. Comput. Vis. (IJCV) 102(1–3), 256–270 (2013)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Han, X., Li, Z., Huang, H., Kalogerakis, E., Yu, Y.: High-resolution shape completion using deep neural networks for global structure and local geometry inference. In: Proceedings of the International Conference on Computer Vision (ICCV 2017) (2017)Google Scholar
  13. 13.
    Wu, Z., et al.: 3D shapenets: a deep representation for volumetric shapes. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015) (2015)Google Scholar
  14. 14.
    Sharma, A., Grau, O., Fritz, M.: VConv-DAE: deep volumetric shape learning without object labels. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9915, pp. 236–250. Springer, Cham (2016). Scholar
  15. 15.
    Fattal, R.: Image upsampling via imposed edge statistics. In: Proceedings of the ACM SIGGRAPH (2007)Google Scholar
  16. 16.
    Rudin, L.I., Osher, S., Fatemi, E.: Non-linear total variation based noise removal algorithms. Phys. D 60(1–4), 259–268 (1992)MathSciNetCrossRefGoogle Scholar
  17. 17.
    Abrahamsson, S., Blom, H., Jans, D.: Multifocus structured illumination microscopy for fast volumetric super-resolution imaging. Biomed. Opt. Express 8(9), 4135–4140 (2017)CrossRefGoogle Scholar
  18. 18.
    Aydin, V., Foroosh, H.: Volumetric super-resolution of multispectral data. In: CORR arXiv:1705.05745v1 (2017)
  19. 19.
    Xie, J., Xu, L., Chen, E.: Image denoising and inpainting with deep neural networks. In: Proceedings of the Neural Information Processing Systems (NIPS), pp. 350–358 (2012)Google Scholar
  20. 20.
    Wang, Z., Liu, D., Yang, J., Han, W., Huang, T.S.: Deep networks for image super-resolution with sparse prior. In: Proceedings of the International Conference on Computer Vision (ICCV), pp. 370–378 (2015)Google Scholar
  21. 21.
    Shi, W., et al.: Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In: Proceedings of the Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  22. 22.
    Jain, V., Seung, H.: Natural image denoising with convolutional networks. In: Proceedings of the Neural Information Processing Systems (NIPS), pp. 769–776 (2008)Google Scholar
  23. 23.
    Dong, C., Loy, C.C., He, K., Tang, X.: Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 38(2), 295–307 (2016)CrossRefGoogle Scholar
  24. 24.
    Zeiler, M.D.: Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701 (2012)
  25. 25.
    Srivastava, R.K., Greff, K., Schmidhuber, J.: Training very deep networks. In: Advances in Neural Information Processing Systems, pp. 2377–2385 (2015)Google Scholar
  26. 26.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)Google Scholar
  27. 27.
    Lorensen, W., Cline, H.: Marching cubes: a high resolution 3D surface construction algorithm. ACM Trans. Graph. (TOG) 21(4), 163–169 (1987)Google Scholar
  28. 28.
    Trumble, M., Gilbert, A., Malleson, C., Hilton, A., Collomosse, J.: Total capture: 3D human pose estimation fusing video and inertial sensors. In: Proceedings of 28th British Machine Vision Conference, pp. 1–13 (2017)Google Scholar
  29. 29.
    Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6m: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1325–1339 (2014)CrossRefGoogle Scholar
  30. 30.
    Starck, J., Hilton, A.: Surface capture for performance-based animation. IEEE Comput. Graph. Appl. 27(3) (2007)CrossRefGoogle Scholar
  31. 31.
    Mustafa, A., Volino, M., Guillemaut, J.Y., Hilton, A.: 4D temporally coherent light-field video. In: 3DV 2017 Proceedings (2017)Google Scholar
  32. 32.
    Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Tran. Image Process. (TIP) 13(4), 600–612 (2004)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Andrew Gilbert
    • 1
    Email author
  • Marco Volino
    • 1
  • John Collomosse
    • 1
    • 2
  • Adrian Hilton
    • 1
  1. 1.Centre for Vision Speech and Signal ProcessingUniversity of SurreyGuildfordUK
  2. 2.Creative Intelligence LabAdobe ResearchSan JoseUSA

Personalised recommendations