A statistical parametric approach to video-realistic text-driven talking avatar

Xie, Lei; Sun, Naicai; Fan, Bo

doi:10.1007/s11042-013-1633-3

A statistical parametric approach to video-realistic text-driven talking avatar

Published: 15 August 2013

Volume 73, pages 377–396, (2014)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Lei Xie¹,
Naicai Sun¹ &
Bo Fan¹

432 Accesses
14 Citations
Explore all metrics

Abstract

This paper proposes a statistical parametric approach to video-realistic text-driven talking avatar. We follow the trajectory HMM approach where audio and visual speech are jointly modeled by HMMs and continuous audiovisual speech parameter trajectories are synthesized based on the maximum likelihood criterion. Previous trajectory HMM approaches only focus on mouth animation, which synthesizes simple geometric mouth shapes or video-realistic effects of the lip motion. Our approach uses trajectory HMM to generate visual parameters of the lower face and it realizes video-realistic animation of the whole face. Specifically, we use active appearance model (AAM) to model the visual speech, which offers a convenient and compact statistical model of both the shape and the appearance variations of the face. To realize video-realistic effects with high fidelity, we use Poisson image editing technique to stitch the synthesized lower-face image to a whole face image seamlessly. Objective and subjective experiments show that the proposed approach can produce natural facial animation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Realistic Speech-Driven Facial Animation with GANs

Article Open access 13 October 2019

Voice Animator: Automatic Lip-Synching in Limited Animation by Audio

Literature Review of Audio-Driven 2D Avatar Video Generation Algorithms

Notes

http://dict.bing.com.cn
https://itunes.apple.com/us/app/talking-tom-cat/id377194688?mt=8
http://marketplace.xbox.com/en-US/Product/Avatar-Kinect/66acd000-77fe-1000-9115-d8025848081a
Base vector is added in.
http://hts.sp.nitech.ac.jp/
We tested 20, 40, 100, 150, 200, 250, 300 and 350 training sentences and we denote them as S20,..., S350.

References

Berger MA, Hofer G, Shimodaira H (2011) Carnival—combining speech technology and computer animation. IEEE Comput Graph Appl 80–89
Blanz V, Vetter T (1999) A morphable model for the synthesis of 3d faces. In: Siggraph, pp 187–194
Blanz V, Basso C, Poggio T, Vetter T (2003) Reanimating faces in images and video. In: Eurographics, pp 641–650
Brand M (1999) Voice puppetry. In: Siggraph, pp 21–28
Bregler C, Covell M, Slaney M (2007) Video rewrite: driving visual speech with audio. In: Siggraph, pp 353–360
Chen T (2001) Audiovisual speech processing: lip reading and lip synchronization. IEEE Signal Proc Mag 18(1):9–21
Article MATH Google Scholar
Choi K, Hwang JN (1999) Baum–welch hidden markov model inversion for reliable audio-to-visual conversion. In: Proc. IEEE 3rd workshop multimedia signal processing, pp 175–180
Cootes TG, Edwards GJ, Taylor CJ (2001) Active appearance models. IEEE Trans Pattern Anal Mach Intell 23(6):681–685
Article Google Scholar
Cosatto E, Ostermann J, Graf HP, Schroeter J (2003) Lifelike talking faces for interactive services. Proc IEEE 91(9):1406–1428
Article Google Scholar
Deng Z, Neumann U (eds) (2008) Data-driven 3D facial animation. Springer, New York
Google Scholar
Ezzat T, Poggio T (2000) Visual speech synthesis by morphing visemes. Int J Comput Vis 38(1):45–57
Article MATH Google Scholar
Ezzat T, Geiger G, Poggio T (2002) Trainable videorealistic speech animation. In: Siggraph, pp 388–397
Fagel S, Bailly GB, Theobald B-J (2009) Animating virtual speakers or singers fromaudio: lip-synching facial animation. In: EURASIP journal on audio, speech, and music processing 2009, pp 1–2
Fu S, Gutierrez-Osuna R, Esposito A, Kakumanu KP, Garcia ON (2005) Audio/visual mapping with cross-modal hidden markov models. IEEE Trans Multimedia 7:243–251
Google Scholar
Hofer G, Yamagishi J, Shimodaira H (2008) Speech-driven lip motion generation with a trajectory hmm. In: Proc. of interspeech
Hura S, Leathem C, Shaked N (2010) Avatars meet the challenge. Speech Technol 30–32
Jia J, Zhang S, Meng F, Wang Y, Cai L (2011) Emotional audio-visual speech synthesis based on pad. EURASIP J Audio Speech Music Process 19(3):570–582
Article Google Scholar
Jia J, Wu Z, Zhang S, Meng H, Cai L (2013) Head and facial gestures synthesis using pad model for an expressive talking avatar. Multimed Tools Appl. doi:10.1007/S11042-013-1604-8
Kessentini Y, Paquet T, Hamadou AB (2010) Off-line handwritten word recognition using multi-stream hidden markov models. Pattern Recogn Lett 31(1):60–70
Google Scholar
Liu K, Ostermann J (2009) Optimization of an image-based talking head system. In: EURASIP journal on audio, speech, and music processing, vol 2009
Meng F, Wu Z, Jia J, Meng H, Cai L (2013) Synthesizing english emphatic speech for multimodal correstive feedback in computer-aided pronunciation training. Multimed Tools Appl. doi:10.1007/s11042-013-1601-y
McGurk H, MacDonald J (1976) Hearing lips and seeing voices. Nature 264:746–748
Article Google Scholar
Ohman T, Salvi G (1999) Using hmms and anns formapping acoustic to visual speech. TMH-QPSR 40(1–2):45–50
Google Scholar
Ostermann J, Weissenfeld A (2004) Talking faces - technologies and applications. In: Proc. of ICPR, vol 3, pp 826–833
Pandzic IS, Forchheimer R (eds) (2002) MPEG-4 facial animation the standard, implementation and applications. Wiley, New York
Google Scholar
Pèrez P, Gangnet M, Blake A (2003) Poisson image editing. In: ACM Trans. Graphics, vol 22, pp 313–318
Pighin F, Hecker J, Lischinski D, Szeliski R, Salesin DH (1998) Synthesizing realistic facial expressions from photographs. In: Siggraph, pp 75–84
Potamianos G, Neti C, Luettin J, Matthews I (2004) Issues in visual and audio-visual speech processing. Ch. Audio-visual automatic speech recognition: an overview. MIT Press, pp 121–148
Salvi G, Beskow J, Moubayed SA, Granstrom B (2009) Synface–speech-driven facial animation for virtual speech-reading support. In: EURASIP journal on audio, speech, and music processing, vol 2009
Shinji Sako KT, Masuko T, Kobayashi T, Kitamura T (2000) Hmm-based text-to-audio-visual speech synthesis. In: Interspeech
Summereld AQ (1987) Some preliminaries to a comprehensive account of audio-visual speech perception. Lawrence Erlbaum Associates, Ch. Hearing by Eye: The Psychology of Lip-Reading, pp 97–113
Tamura M, Kondo S, Masuko T, Kobayashi T (1999) Text to audio-visual speech synthesis based on parameter generation from HMM. In: Eurospeech, pp 959–962
Theobald B-J, Wilkinson N (2007) A real-time speech-driven talking head using active appearance models. In: AVSP
Theobald B-J, Fagel S, Bailly G, Elisei F (2008) Lips2008: visual speech synthesis challenge. In: Proc. of interspeech
Theobald B, Matthews I, Wilkinson N, Cohn JF, Boker S (2007) Animating faces using appearance models. In: Proceedings of the workshop on vision, video and graphics
Tokuda K, Yoshimura T, Masuko T, Kobayashi T, Kitamura T (2000) Speech parameter generation algorigthms for hmm-based speech synthesis. In: ICASSP, pp 1315–1318
Wang L, Qian X, Han W, Soong FK (2010) Synthesizing photo-real talking head via trajectory-guided sample selection. In: Interspeech
Wang L, Han W, Soong FK, Huo Q (2011) Text driven 3d photo-realistic talking head. In: Interspeech, pp 3307–3310
Weise T, Bouaziz S, Li H, Pauly M (2011) Realtime performance-based facial animation. In: Siggraph
Wu Z, Zhang S, Cai L, Meng H (2006) Real-time synthesis of chinese visual speech and facial expressions using mpeg-4 fap features in a three-dimensional avatar. In: Proc. Interspeech, pp 1802–1805
Xie L, Liu Z-Q (2007) Realistic mouth-synching for speech-driven talking face using articulatory modelling. IEEE Trans Multimed 9(23):500–510
Google Scholar
Yamamoto E, Nakamura S, Shikano K (1998) Lip movement synthesis from speech based on hidden markov models. Speech Comm 26(1–2):105–115
Article Google Scholar
Yamagishi J, Masuko T, Tokuda K, Kobayashi T (2003) A training method for average voice model based on shared decision tree context clustering and speaker adaptive training. In: ICASSP, pp 716–719
Zeng Z, Tu J, Pianfetti BM, Huang TS (2008) Audio-visual affective expression recognition through multistream fused hmm. IEEE Trans Multimed 10(4):570–577
Article Google Scholar

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (61175018), the Natural Science Basic Research Plan of Shaanxi Province (2011JM8009) and the Fok Ying Tung Education Foundation (131059).

Author information

Authors and Affiliations

School of Computer Science, Northwestern Polytechnical University, Xi’an, China
Lei Xie, Naicai Sun & Bo Fan

Authors

Lei Xie
View author publications
You can also search for this author in PubMed Google Scholar
Naicai Sun
View author publications
You can also search for this author in PubMed Google Scholar
Bo Fan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lei Xie.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xie, L., Sun, N. & Fan, B. A statistical parametric approach to video-realistic text-driven talking avatar. Multimed Tools Appl 73, 377–396 (2014). https://doi.org/10.1007/s11042-013-1633-3

Download citation

Published: 15 August 2013
Issue Date: November 2014
DOI: https://doi.org/10.1007/s11042-013-1633-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A statistical parametric approach to video-realistic text-driven talking avatar

Abstract

Access this article

Similar content being viewed by others

Realistic Speech-Driven Facial Animation with GANs

Voice Animator: Automatic Lip-Synching in Limited Animation by Audio

Literature Review of Audio-Driven 2D Avatar Video Generation Algorithms

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A statistical parametric approach to video-realistic text-driven talking avatar

Abstract

Access this article

Similar content being viewed by others

Realistic Speech-Driven Facial Animation with GANs

Voice Animator: Automatic Lip-Synching in Limited Animation by Audio

Literature Review of Audio-Driven 2D Avatar Video Generation Algorithms

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation