Expressive Telepresence via Modular Codec Avatars

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12357)


VR telepresence consists of interacting with another human in a virtual space represented by an avatar. Today most avatars are cartoon-like, but soon the technology will allow video-realistic ones. This paper aims in this direction, and presents Modular Codec Avatars (MCA), a method to generate hyper-realistic faces driven by the cameras in the VR headset. MCA extends traditional Codec Avatars (CA) by replacing the holistic models with a learned modular representation. It is important to note that traditional person-specific CAs are learned from few training samples, and typically lack robustness as well as limited expressiveness when transferring facial expressions. MCAs solve these issues by learning a modulated adaptive blending of different facial components as well as an exemplar-based latent alignment. We demonstrate that MCA achieves improved expressiveness and robustness w.r.t to CA in a variety of real-world datasets and practical scenarios. Finally, we showcase new applications in VR telepresence enabled by the proposed model.


Virtual reality Telepresence Codec Avatar 

Supplementary material

504453_1_En_20_MOESM1_ESM.pdf (294 kb)
Supplementary material 1 (pdf 294 KB)

Supplementary material 2 (mp4 26829 KB)


  1. 1.
    Wei, S.E., et al.: VR facial animation via multiview image translation. In: SIGGRAPH (2019)Google Scholar
  2. 2.
    Heymann, D.L., Shindo, N.: Covid-19: what is next for public health? Lancet 395, 542–545 (2020)CrossRefGoogle Scholar
  3. 3.
    Orts-Escolano, S., et al.: Holoportation: virtual 3D teleportation in real-time. In: UIST (2016)Google Scholar
  4. 4.
    Lombardi, S., Saragih, J., Simon, T., Sheikh, Y.: Deep appearance models for face rendering. In: SIGGRAPH (2018)Google Scholar
  5. 5.
    Tewari, A., et al.: FML: face model learning from videos. In: CVPR (2019)Google Scholar
  6. 6.
    Thies, J., Zollhofer, M., Stamminger, M., Theobalt, C., Nießner, M.: Face2face: real-time face capture and reenactment of RGB videos. In: CVPR (2016)Google Scholar
  7. 7.
    Elgharib, M., et al.: Egoface: egocentric face performance capture and videorealistic reenactment. arXiv:1905.10822 (2019)
  8. 8.
    Nagano, K., et al.: PaGAN: real-time avatars using dynamic textures. In: SIGGRAPH (2018)Google Scholar
  9. 9.
    Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. TPAMI 23(6), 681–685 (2001)CrossRefGoogle Scholar
  10. 10.
    Blanz, V., Vetter, T., et al.: A morphable model for the synthesis of 3D faces. In: SIGGRAPH (1999)Google Scholar
  11. 11.
    Tena, J.R., De la Torre, F., Matthews, I.: Interactive region-based linear 3D face models. In: SIGGRAPH (2011)Google Scholar
  12. 12.
    Neumann, T., Varanasi, K., Wenger, S., Wacker, M., Magnor, M., Theobalt, C.: Sparse localized deformation components. TOG 32(6), 1–10 (2013)CrossRefGoogle Scholar
  13. 13.
    Cao, C., Chai, M., Woodford, O., Luo, L.: Stabilized real-time face tracking via a learned dynamic rigidity prior. TOG 37(6), 1–11 (2018)CrossRefGoogle Scholar
  14. 14.
    Cao, C., Weng, Y., Zhou, S., Tong, Y., Zhou, K.: Facewarehouse: a 3D facial expression database for visual computing. TVCG 20(3), 413–425 (2013)Google Scholar
  15. 15.
    Ghafourzadeh, D., et al.: Part-based 3D face morphable model with anthropometric local control. In: EuroGraphics (2020)Google Scholar
  16. 16.
    Seyama, J., Nagayama, R.S.: The uncanny valley: effect of realism on the impression of artificial human faces. Presence: Teleoper. Virtual Environ. 16(4), 337–351 (2007)CrossRefGoogle Scholar
  17. 17.
    Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv:1312.6114 (2013)
  18. 18.
    Lombardi, S., Simon, T., Saragih, J., Schwartz, G., Lehrmann, A., Sheikh, Y.: Neural volumes: learning dynamic renderable volumes from images. TOG 38(4), 65 (2019)CrossRefGoogle Scholar
  19. 19.
    Cao, C., Hou, Q., Zhou, K.: Displaced dynamic expression regression for real-time facial tracking and animation. TOG 33(4), 1–10 (2014)Google Scholar
  20. 20.
    Li, H., et al.: Facial performance sensing head-mounted display. TOG 34(4), 1–9 (2015)Google Scholar
  21. 21.
    Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: ICCV (2017)Google Scholar
  22. 22.
    Bai, S., Kolter, J.Z., Koltun, V.: An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv:1803.01271 (2018)
  23. 23.
    Wang, Z., Bovik, A., Sheikh, H., Simoncelli, E.: Image quality assessment: from error visibility to structural similarity. TIP 13(4), 600–612 (2014)Google Scholar
  24. 24.
    Wikipedia: structural similarity.

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.University of TorontoTorontoCanada
  2. 2.Vector InstituteTorontoCanada
  3. 3.Facebook Reality LabPittsburghUSA

Personalised recommendations