Abstract
In this chapter, we present two methods (fused HMM inversion method and unit selection method) for the speech-driven facial animation system. It systematically addresses audiovisual data acquisition, expressive trajectory analysis, and audiovisual mapping. Based on this framework, we learn the correlation between neutral facial deformation and expressive facial deformation with the Gaussian Mixture Model (GMM). A hierarchical structure is proposed to map the acoustic parameters to lip FAPs. Then the synthesized neutral FAP streams are extended with expressive variations according to the prosody of the input speech. The quantitative evaluation of the experimental result is encouraging and the synthesized face shows a realistic quality.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Arun, K. S., Huang, T. S., & Blostein, S. D. (1987). Least-square fitting of two 3-D point sets.IEEE Transactions on Pattern Analysis and Machine Intelligence 9(5), 698–700.
Brand, M. (1999). Voice puppetry. InProceedings of SIGGRAPH’ 99, 21–28.
Brand, M., Oliver, N. & Pentland, A. (1997). Coupled hidden Markov models for complex action recognition. InProceedings of Computer Vision and Pattern Recognition, 994–997.
Bregler, C., Covell, M., & Slaney, M. (1997). Video rewrite: Driving visual speech with audio, InProceedings of the 24th annual conference on Computer graphics and interactive techniques, 353–360
Chen, T. & Rao, R.R. (1998), Audio-Visual Integration in Multimodal Communication, InProceedings of the IEEE, Vol 86, (pp. :837–851).
Choi, K., & Hwang, J. N. (1999). Baum—Welch HMM inversion for reliable audio-to-visual conversion. InProceedings of the IEEE International Workshop Multimedia Signal Processing(pp.175–180).
Cosatto, E., Potamianos, G., & Graf, H. P. (2000). Audiovisual unit selection for the synthesis of photo-realistic talking-heads. InIEEE International Conference on Multimedia and Expo,619–622
Ezzat, T., & Poggio, T. (1998). MikeTalk: A talking facial display based on morphing visemes. InProceedings of Computer Animation Conference, Philadelphia, 96–102.
Fu, S. L., Gutierrez-Osuna, R., Esposito, A., Kakumanu, P. K., and Garcia, O. N. (2005). Audio/visual mapping with cross-modal hidden Markov models.IEEE Transaction on Multimedia, Vol 7, Issue 2, 243–252
Gutierrez-Osuna, R. et al. (2005). Speech-driven facial animation with realistic dynamics.IEEE Transactions on Multimedia, Vol 7, Issue 1, 33–42.
Hong, P. Y., Wen, Z., & Huang, T. S. (2002). Real-time speech-driven face animation with expressions using neural networks.IEEE Transactions on Neural Networks, Vol 13, Issue 4, 916–927.
Li, Y., & Shum, H. Y. (2006). Learning dynamic audio-visual mapping with input-output hidden Markov modelsIEEE Transactions on Multimedia 8(3), 542–549
Li, Y., Yu, F., Xu, Y. Q., Chang, E., & Shum, H. Y. (2001). Speech-driven cartoon animation with emotions. InProceedings of the Ninth ACM International Conference on Multimedia, 365–371.
Liu, W. T., Yin, B. C., Jia, X. B., & Kong, D. H. (2004). Audio to visual signal mappings with HMM. InICASSP 2004. 885–888.
Massaro, D. W., Beskow, J., Cohen, M. M., Fry, C. L., and Rodriguez, T. (1999). Picture my voice: Audio to visual speech synthesis using artificial neural networks. InProceedings of AVSP’ 99, Santa Cruz, CA (pp. 133–138).
Moon, S. Y., & Hwang, J. N. (1997). Robust speech recognition based on joint model and feature space optimization of hidden Markov model.IEEE Transactions on Neural Networks 8(2):194–204.
Pan, H., Levinson, S., Huang, T. S., & Liang, Z. P. (2004). A fused hidden Markov model with application to bimodal speech processing.IEEE Transactions on Signal Processing 52(3), 573–581.
Rao, R. R., & Chen, T. (1998). Audio-to-visual conversion for multimedia communication.IEEE Transactions on Industrial Electronics 45(1), 15–22.
Saul, L. K., & Jordan, M. I. (1999). Mixed memory Markov model: Decomposing complex stochastic process as mixture of simpler onesMachine Learning 37, 75–88.
Tekalp, A. M., & Ostermann, J. (2000). Face and 2-D mesh animation in MPEG-4.Signal Processing: Image Communication 15, 387–421.
Verma, A., Subramaniam, L. V., Rajput, N., Neti, C., & Faruquie, T. A. (2004). Animating expressive faces across languages.IEEE Transactions on Multimedia, Vol 6, 791–800.
Wang, J. Q., Wong, K. H. , Pheng, P. A , Meng, H. M., & Wong, T. T. (2004). A real-time Cantonese text-to-audiovisual speech synthesizer.ICASSP04, Vol 1, I–653–6
Xie, L., & Liu, Z. Q. (2006). Speech animation using coupled hidden Markov models. InThe 18 th International Conference on Pattern Recognition, ICPR 2006, 1128–1131.
Xin, L., Tao, J. H., & Tan, T. N. (2007). Visual speech synthesis based on fused hidden Markov model inversion. InProceedings of ICIP2007, Vol 3, 293–296.
Yamamoto, E., Nakamura, S., & Shikano, K. (1998). Lip movement synthesis from speech based on hidden Markov models.Speech Communication, Vol 26, 105–115.
Yin, P. R., & Tao, J. H. (2005). Dynamic mapping method based speech driven face animation system. InThe First International Conference on Affective Computing and Intelligent Interaction (ACII), 755–763.
Zeng, Z. H., Tu, J. L., Pianfetti, B., Liu, M., Zhang, T., Zhang, Z. Q., Huang, T. S., & Levinsion, S. (2005). Audio-visual affect recognition through multi-stream fused HMM for HCI. InCVPR 2005, Vol 2, 967–972.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag London Limited
About this chapter
Cite this chapter
Tao, J., Yin, P., Xin, L. (2009). Face Animation Based on Large Audiovisual Database. In: Tao, J., Tan, T. (eds) Affective Information Processing. Springer, London. https://doi.org/10.1007/978-1-84800-306-4_11
Download citation
DOI: https://doi.org/10.1007/978-1-84800-306-4_11
Publisher Name: Springer, London
Print ISBN: 978-1-84800-305-7
Online ISBN: 978-1-84800-306-4
eBook Packages: Computer ScienceComputer Science (R0)