Skip to main content
Log in

ReliTalk: Relightable Talking Portrait Generation from a Single Video

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Recent years have witnessed great progress in creating vivid audio-driven portraits from monocular videos. However, how to seamlessly adapt the created video avatars to other scenarios with different backgrounds and lighting conditions remains unsolved. On the other hand, existing relighting studies mostly rely on dynamically lighted or multi-view data, which are too expensive for creating video portraits. To bridge this gap, we propose ReliTalk, a novel framework for relightable audio-driven talking portrait generation from monocular videos. Our key insight is to decompose the portrait’s reflectance from implicitly learned audio-driven facial normals and images. Specifically, we involve 3D facial priors derived from audio features to predict delicate normal maps through implicit functions. These initially predicted normals then take a crucial part in reflectance decomposition by dynamically estimating the lighting condition of the given video. Moreover, the stereoscopic face representation is refined using the identity-consistent loss under simulated multiple lighting conditions, addressing the ill-posed problem caused by limited views available from a single monocular video. Extensive experiments validate the superiority of our proposed framework on both real and synthetic datasets. Our code is released in (https://github.com/arthur-qiu/ReliTalk).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Data Availability Statement

Video data and pre-trained models used in this paper are available online. We provide corresponding source links for reproduction purposes in the ReliTalk repository https://github.com/arthur-qiu/ReliTalk.

References

  • Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Cheng, Q., & Chen, G., et al. (2016). Deep speech 2: End-to-end speech recognition in english and mandarin. In International conference on machine learning, pp. 173–182.

  • Barron, J. T., & Malik, J. (2014). Shape, illumination, and reflectance from shading. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(8), 1670–1687.

    Article  Google Scholar 

  • Barron, J. T., Mildenhall, B., Tancik, M., Hedman, P., Martin-Brualla, R., & Srinivasan, P. P. (2021a). Mip-NeRF: A multiscale representation for anti-aliasing neural radiance fields. arXiv:2103.13415 [cs].

  • Barron, J. T., Mildenhall, B., Verbin, D., Srinivasan, P. P., & Hedman, P. (2021b). Mip-NeRF 360: Unbounded anti-aliased neural radiance fields. arXiv:2111.12077 [cs].

  • Basri, R., & Jacobs, D. W. (2003). Lambertian reflectance and linear subspaces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(2), 218–233.

    Article  Google Scholar 

  • Blanz, V., & Vetter, T. (1999). A morphable model for the synthesis of 3d faces. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques, pp. 187–194.

  • Blinn, J. F. (1977). Models of light reflection for computer synthesized pictures. In Proceedings of the 4th annual conference on computer graphics and interactive techniques, pp. 192–198.

  • Caselles, P., Ramon, E., Garcia, J., Giro-i Nieto, X., Moreno-Noguer, F., & Triginer, G. (2023). Sira: Relightable avatars from a single image. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp. 775–784.

  • Chan, E.R., Monteiro, M., Kellnhofer, P., Wu, J., & Wetzstein, G. (2021). pi-gan: Periodic implicit generative adversarial networks for 3d-aware image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5799–5809.

  • Chan, E. R., Lin, C. Z., Chan, M. A., Nagano, K., Pan, B., De Mello, S., Gallo, O., Guibas, L. J., Tremblay, J., & Khamis, S., et al. (2022). Efficient geometry-aware 3d generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16123–16133.

  • Chen, L., Maddox, R. K., Duan, Z., & Xu, C. (2019). Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 7832–7841.

  • Chen, Z., & Liu, Z. (2022). Relighting4d: Neural relightable human from videos. In European conference on computer vision, Springer, pp. 606–623.

  • Christensen, P. H. (2015). An approximate reflectance profile for efficient subsurface scattering. In ACM SIGGRAPH 2015 Talks, pp. 1–1.

  • Chung, J. S., & Zisserman, A. (2017). Out of time: Automated lip sync in the wild. In Computer Vision—ACCV 2016 Workshops: ACCV 2016 International Workshops, Taipei, Taiwan, November 20–24, 2016, Revised Selected Papers, Part II 13, Springer, pp. 251–263.

  • Chung, J. S., Senior, A., Vinyals, O., & Zisserman, A. (2017). Lip reading sentences in the wild. In IEEE conference on computer vision and pattern recognition.

  • Community, B. O. (2018). Blender: A 3D modelling and rendering package. Blender Foundation, Stichting Blender Foundation, Amsterdam, http://www.blender.org.

  • Cudeiro, D., Bolkart, T., Laidlaw, C., Ranjan, A., Black, M. J. (2019a). Capture, learning, and synthesis of 3d speaking styles. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10101–10111.

  • Cudeiro, D., Bolkart, T., Laidlaw, C., Ranjan, A., & Black, M. J. (2019b). Capture, learning, and synthesis of 3d speaking styles. In 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 10093–10103, https://doi.org/10.1109/CVPR.2019.01034.

  • Debevec, P., Hawkins, T., Tchou, C., Duiker, H.P., Sarokin, W., & Sagar, M. (2000). Acquiring the reflectance field of a human face. In Proceedings of the 27th annual conference on computer graphics and interactive techniques, ACM Press/Addison-Wesley Publishing Co., USA, SIGGRAPH ’00, pp. 145–156, https://doi.org/10.1145/344779.344855.

  • Fan, Y., Lin, Z., Saito, J., Wang, W., & Komura, T. (2022). Faceformer: Speech-driven 3d facial animation with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR).

  • Feng, Y., Feng, H., Black, M. J., & Bolkart, T. (2021). Learning an animatable detailed 3d face model from in-the-wild images. ACM Transactions on Graphics (ToG), 40(4), 1–13.

    Article  Google Scholar 

  • Guo, Y., Chen, K., Liang, S., Liu, Y. J., Bao, H., & Zhang, J. (2021). Ad-nerf: Audio driven neural radiance fields for talking head synthesis. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 5784–5794.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778.

  • Hess, R. (2013). Blender foundations: The essential guide to learning blender 2.5. Routledge.

  • Hou, A., Zhang, Z., Sarkis, M., Bi, N., Tong, Y., & Liu, X. (2021). Towards high fidelity face relighting with realistic shadows. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 14719–14728.

  • Hou, A., Sarkis, M., Bi, N., Tong, Y., & Liu, X. (2022). Face relighting with geometrically consistent shadows. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4217–4226.

  • Isola, P., Zhu, J. Y., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134.

  • Ji, X., Zhou, H., Wang, K., Wu, W., Loy, C.C., Cao, X., Xu, F. (2021). Audio-driven emotional video portraits. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 14080–14089.

  • Karras, T., Aila, T., Laine, S., Herva, A., & Lehtinen, J. (2017). Audio-driven facial animation by joint end-to-end learning of pose and emotion. ACM Trans Graph, 36(4), https://doi.org/10.1145/3072959.3073658.

  • Kim, H., Garrido, P., Tewari, A., Xu, W., Thies, J., Niessner, M., Pérez, P., Richardt, C., Zollhöfer, M., & Theobalt, C. (2018). Deep video portraits. ACM Transactions on Graphics (TOG), 37(4), 1–14.

    Article  Google Scholar 

  • Li, T., Bolkart, T., Black, M. J., Li, H., & Romero, J. (2017). Learning a model of facial shape and expression from 4d scans. ACM Trans Graph, 36(6), 194-1.

  • Liu, X., Xu, Y., Wu, Q., Zhou, H., Wu, W., & Zhou, B. (2022). Semantic-aware implicit neural audio-driven video portrait generation. arXiv preprint arXiv:2201.07786.

  • Liu, Y., Li, Y., You, S., & Lu, F. (2019). Unsupervised learning for intrinsic image decomposition from a single image. https://doi.org/10.48550/ARXIV.1911.09930.

  • Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., & Ng, R. (2020). NeRF: Representing scenes as neural radiance fields for view synthesis. arXiv:2003.08934 [cs] version: 2.

  • Nestmeyer, T., Lalonde, J. F., Matthews, I. & Lehrmann, A. (2020). Learning physics-guided face relighting under directional light. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5124–5133.

  • Or-El, R., Luo, X., Shan, M., Shechtman, E., Park, J. J. & Kemelmacher-Shlizerman, I. (2022). Stylesdf: High-resolution 3d-consistent image and geometry generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 13503–13513.

  • Pan, X., Dai, B., Liu, Z., Loy, CC. & Luo, P. (2020). Do 2D gans know 3d shape? Unsupervised 3D shape reconstruction from 2d image gans. arXiv preprint arXiv:2011.00844.

  • Pandey, R., Escolano, S. O., Legendre, C., Haene, C., Bouaziz, S., Rhemann, C., Debevec, P., & Fanello, S. (2021). Total relighting: Learning to relight portraits for background replacement. ACM Transactions on Graphics (TOG), 40(4), 1–21.

    Article  Google Scholar 

  • Parkhi, O. M., Vedaldi, A., & Zisserman, A. (2015). Deep face recognition. In British Machine vision conference.

  • Prajwal, K., Mukhopadhyay, R., Namboodiri, V.P., & Jawahar, C. (2020). A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM international conference on multimedia, pp. 484–492.

  • Ramamoorthi, R., & Hanrahan, P. (2001). On the relationship between radiance and irradiance: determining the illumination from images of a convex lambertian object. JOSA A, 18(10), 2448–2459.

    Article  MathSciNet  CAS  PubMed  ADS  Google Scholar 

  • Richard, A., Zollhöfer, M., Wen, Y., de la Torre, F., & Sheikh, Y. (2021). Meshtalk: 3d face animation from speech using cross-modality disentanglement. In 2021 IEEE/CVF International conference on computer vision (ICCV), pp. 1153–1162, https://doi.org/10.1109/ICCV48922.2021.00121.

  • Schroff, F., Kalenichenko, D., & Philbin, J. (2015). Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823.

  • Shen, S., Li, W., Zhu, Z., Duan, Y., Zhou, J., & Lu, J. (2022). Learning dynamic facial radiance fields for few-shot talking head synthesis. In European conference on computer vision.

  • Shu, Z., Yumer, E., Hadap, S., Sunkavalli, K., Shechtman, E., & Samaras, D. (2017). Neural face editing with intrinsic image disentangling. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5541–5550.

  • Song, L., Wu, W., Qian, C., He, R., & Loy, C. C. (2022). Everybody’s talkin’: Let me talk as you want. IEEE Transactions on Information Forensics and Security, 17, 585–598.

    Article  Google Scholar 

  • Srinivasan, P. P., Deng, B., Zhang, X., Tancik, M., Mildenhall, B., & Barron, J. T. (2021). Nerv: Neural reflectance and visibility fields for relighting and view synthesis. In CVPR.

  • Sun, J., Wang, X., Zhang, Y., Li, X., Zhang, Q., Liu, Y., & Wang, J. (2022). Fenerf: Face editing in neural radiance fields. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 7672–7682.

  • Sun, T., Barron, J. T., Tsai, Y. T., Xu, Z., Yu, X., Fyffe, G., Rhemann, C., Busch, J., Debevec, P. E., & Ramamoorthi, R. (2019). Single image portrait relighting. ACM Transactions on Graphics (TOG), 38(4), 79-1.

  • Suwajanakorn, S., Seitz, S. M., & Kemelmacher-Shlizerman, I. (2017). Synthesizing obama: Learning lip sync from audio. ACM Transactions on Graphics (ToG), 36(4), 1–13.

    Article  Google Scholar 

  • Tang, J., Wang, K., Zhou, H., Chen, X., He, D., Hu, T., Liu, J., Zeng, G., & Wang J. (2022). Real-time neural radiance talking portrait synthesis via audio-spatial decomposition. arXiv preprint arXiv:2211.12368

  • Taylor, S., Kim, T., Yue, Y., Mahler, M., Krahe, J., Rodriguez, A. G., Hodgins, J., & Matthews, I. (2017). A deep learning approach for generalized speech animation. ACM Transactions on Graphics (TOG), 36(4), 1–11.

    Article  Google Scholar 

  • Thies, J., Elgharib, M., Tewari, A., Theobalt, C., & Nießner, M. (2020). Neural voice puppetry: Audio-driven facial reenactment. In European conference on computer vision, Springer, pp. 716–731.

  • Wang, Y., Zhang, L., Liu, Z., Hua, G., Wen, Z., Zhang, Z., & Samaras, D. (2008). Face relighting from a single image under arbitrary unknown lighting conditions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(11), 1968–1984.

    Article  Google Scholar 

  • Wang, Y., Holynski, A., Zhang, X., & Zhang, X. C. (2022). Sunstage: Portrait reconstruction and relighting using the sun as a light stage. arXiv preprint arXiv:2204.03648.

  • Wang, Z., Yu, X., Lu, M., Wang, Q., Qian, C., & Xu, F. (2020). Single image portrait relighting via explicit multiple reflectance channel modeling. ACM Transactions on Graphics (TOG), 39(6), 1–13.

    ADS  Google Scholar 

  • Wu, H., Jia, J., Wang, H., Dou, Y., Duan, C., & Deng, Q. (2021). Imitating arbitrary talking style for realistic audio-driven talking face synthesis. In Proceedings of the 29th ACM international conference on multimedia, pp. 1478–1486.

  • Xu, Y., Peng, S., Yang, C., Shen, Y., & Zhou, B. (2022). 3d-aware image synthesis via learning structural and textural representations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 18430–18439.

  • Yang, H., Zhu, H., Wang, Y., Huang, M., Shen, Q., Yang, R., & Cao, X. (2020). Facescape: A large-scale high quality 3d face dataset and detailed riggable 3d face prediction. In IEEE/CVF conference on computer vision and pattern recognition (CVPR).

  • Yao, S., Zhong, R., Yan, Y., Zhai, G., & Yang, X. (2022). DFA-NeRF: Personalized talking head generation via disentangled face attributes neural rendering. arXiv:2201.00791 [cs].

  • Yeh, Y. Y., Nagano, K., Khamis, S., Kautz, J., Liu, M. Y., Wang, T. C. (2022). Learning to relight portrait images via a virtual light stage and synthetic-to-real adaptation. arXiv preprint arXiv:2209.10510.

  • Yi, R., Ye, Z., Zhang, J., Bao, H., & Liu, Y. J. (2020). Audio-driven talking face video generation with learning-based personalized head pose. arXiv preprint arXiv:2002.10137.

  • Zhang, H., Goodfellow, I., Metaxas, D., & Odena, A. (2019). Self-attention generative adversarial networks. In International conference on machine learning, pp. 7354–7363.

  • Zhang, L., Zhang, Q., Wu, M., Yu, J., & Xu, L. (2021a). Neural video portrait relighting in real-time via consistency modeling. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 802–812.

  • Zhang, R., Isola, P., Efros, A. A., Shechtman, E., & Wang, O. (2018). The unreasonable effectiveness of deep features as a perceptual metric. In CVPR.

  • Zhang, X., Srinivasan, P. P., Deng, B., Debevec, P., Freeman, W. T., & Barron, J. T. (2021). Nerfactor: Neural factorization of shape and reflectance under an unknown illumination. ACM Transactions on Graphics (TOG), 40(6), 1–18.

    Article  Google Scholar 

  • Zhang, Z., Li, L., Ding, Y., & Fan, C. (2021c). Flow-guided one-shot talking face generation with a high-resolution audio-visual dataset. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3661–3670.

  • Zhao, X., Ma, F., Güera, D., Ren, Z., Schwing, A. G., & Colburn, A. (2022). Generative multiplane images: Making a 2d gan 3d-aware. In European conference on computer vision, Springer, pp. 18–35.

  • Zheng, Y., Abrevaya, V. F., Bühler, M. C., Chen, X., Black, M. J., & Hilliges, O. (2022). Im avatar: Implicit morphable head avatars from videos. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 13545–13555.

  • Zhou, H., Hadap, S., Sunkavalli, K., Jacobs, D. W. (2019a). Deep single-image portrait relighting. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 7194–7202.

  • Zhou, H., Liu, Y., Liu, Z., Luo, P., & Wang, X. (2019). Talking face generation by adversarially disentangled audio-visual representation. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 9299–9306.

    Article  Google Scholar 

  • Zhou, H., Sun, Y., Wu, W., Loy, C.C., Wang, X., & Liu, Z. (2021). Pose-controllable talking face generation by implicitly modularized audio-visual representation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4176–4186.

  • Zhou, Y., Han, X., Shechtman, E., Echevarria, J., Kalogerakis, E., & Li, D. (2020). Makelttalk: Speaker-aware talking-head animation. ACM Transactions on Graphics (TOG), 39(6), 1–15.

    CAS  Google Scholar 

  • Zhu, H., Yang, H., Guo, L., Zhang, Y., Wang, Y., Huang, M., Shen, Q., Yang, R., & Cao, X. (2021). Facescape: 3d facial dataset and benchmark for single-view 3d face reconstruction. arXiv preprint arXiv:2111.01082.

Download references

Acknowledgements

This research is supported by the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISG2-PhD-2022-01-035T), NTU NAP, MOE AcRF Tier 2 (MOET2EP20221-0012), and under the RIE2020 Industry Alignment Fund - Industry Collaboration Projects (IAF-ICP) Funding Initiative, as well as cash and in-kind contribution from the industry partner(s).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ziwei Liu.

Additional information

Communicated by Gang Hua.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (mp4 92809 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Qiu, H., Chen, Z., Jiang, Y. et al. ReliTalk: Relightable Talking Portrait Generation from a Single Video. Int J Comput Vis (2024). https://doi.org/10.1007/s11263-024-02007-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11263-024-02007-9

Keywords

Navigation