Multimedia Tools and Applications

, Volume 74, Issue 22, pp 9889–9907 | Cite as

Acoustic to articulatory mapping with deep neural network

  • Zhiyong Wu
  • Kai Zhao
  • Xixin WuEmail author
  • Xinyu Lan
  • Helen Meng


Synthetic talking avatar has been demonstrated to be very useful in human-computer interactions. In this paper, we discuss the problem of acoustic to articulatory mapping and explore different kinds of models to describe the mapping function. We try general linear model (GLM), Gaussian mixture model (GMM), artificial neural network (ANN) and deep neural network (DNN) for the problem. Taking the advantage of neural network that its prediction stage can be finished in a very short time (e.g. real-time), we develop a real-time speech driven talking avatar system based on DNN. The input of the system is acoustic speech and the output is articulatory movements (that are synchronized with the input speech) on a three-dimensional avatar. Several experiments are conducted to compare the performance of GLM, GMM, ANN and DNN on a well known acoustic-articulatory English speech corpus MNGU0. Experimental results demonstrate that the proposed acoustic to articulatory mapping method with DNN can achieve the best performance.


Acoustic to articulatory mapping Audio-visual mapping Deep neural network (DNN) Speech driven talking avatar 



This work is supported by the National Basic Research Program of China (2012CB316401 and 2013CB329304). This work is also partially supported by the Hong Kong SAR Government’s Research Grants Council (N-CUHK414/09), the National Natural Science Foundation of China (61375027, 61370023 and 60805008), the National Social Science Foundation Major Project (13&ZD189) and Guangdong Provincial Science and Technology Program (2012A011100008).


  1. 1.
    Cassell J (2001) Embodied conversational agents: representation and intelligence in user interfaces. AI Mag 22(4):67–83Google Scholar
  2. 2.
    Cosatto E, Ostermann J, Graf HP, Schroeter J (2003) “Lifelike talking faces for interactive services”. Proc IEEE 91:1406–1429CrossRefGoogle Scholar
  3. 3.
    Deng L (2011) “An overview of deep-structured learning for information processing,” In: Proc. Asian-Pacific Signal & Inforamtion Processing Annual Summit & Conference (APSIPA ASC), pp 1–14Google Scholar
  4. 4.
    Ding C, Xie L, Zhu PC (2014) Head motion synthesis from speech using deep neural networks. Multimed Tools Appl. doi: 10.1007/s11042-014-2156-2
  5. 5.
    Hinton GE (2002) Training products of experts by minimizing contrastive divergence. Neural Comput 14(8):1771–1800zbMATHMathSciNetCrossRefGoogle Scholar
  6. 6.
    Hinton GE (2007) To recognize shapes, first learn to generate images. Prog Brain Res 165:535–547CrossRefGoogle Scholar
  7. 7.
    Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural Comput 18(7):1527–1554zbMATHMathSciNetCrossRefGoogle Scholar
  8. 8.
    Hiroya S, Honda M (2002) “Determination of articulatory movements from speech acoustics using an HMM-based speech production model,” In: Proc. Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp 437–440Google Scholar
  9. 9.
    Hiroya S, Honda M (2002) “Acoustic-to-articulatory inverse mapping using an HMM-based speech production model,” In: Proc. Int. Conf. on Spoken Language Processing (ICSLP), pp 2305–2308Google Scholar
  10. 10.
    Jia J, Wu ZY, Zhang S, Meng H, Cai LH (2013) Head and facial gestures synthesis using PAD model for an expressive talking avatar. Multimed Tools Appl. doi: 10.1007/s11042-013-1604-8
  11. 11.
    Jia J, Zhang S, Meng FB, Wang YX, Cai LH (2011) Emotional audio-visual speech synthesis based on PAD. IEEE Transaction on Audio, Speech, and Language Processing, 19(3):570–582Google Scholar
  12. 12.
    Karlsson I, Faulkner A, Salvi G (2003) “SYNFACE - A talking face telephone,” In: Proc. European Conf. on Speech Communication and Technology (EUROSPEECH), pp 1297–1300Google Scholar
  13. 13.
    Kawahara H, Estill J, Fujimura O (2001) “Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system straight,” In: Proc. Int. Workshop Models and Analysis of Vocal Emissions for Biomedical Application (MAVEBA)Google Scholar
  14. 14.
    Massaro DW (1987) Speech perception by ear and eye: a paradigm for psychological inquiry. Lawrence Erlbaum Associates, HillsdaleGoogle Scholar
  15. 15.
    McCullagh P (1984) Generalized linear models. Eur J Oper Res 16(3):285–292zbMATHMathSciNetCrossRefGoogle Scholar
  16. 16.
    Meng FB, Wu ZY, Jia J, Meng H, Cai LH (2013) Synthesizing English emphatic speech for multimodal corrective feedback in computer-aided pronunciation training. Multimed Tools Appl. doi: 10.1007/s11042-013-1601-y
  17. 17.
    Mohamed A, Dahl G, Hinton GE (2009) “Deep belief networks for phone recognition,” In: Proc. NIPS Workshop on Deep Learning for Speech Recognition and Related ApplicationsGoogle Scholar
  18. 18.
    Reynolds D (2009) “Gaussian mixture models,” Encyclopedia of Biometrics Google Scholar
  19. 19.
    Richmond K (2002) “Estimating articulatory parameters from the acoustic speech signal,” PhD thesis, The Centre for Speech Technology Research, Edinburgh UniversityGoogle Scholar
  20. 20.
    Richmond K, Hoole P, King S (2011) “Announcing the electromagnetic articulography (day 1) subset of the MNGU0 articulatory corpus,” In: Proc. Annual Conf. of International Speech Communication Association (INTERSPEECH), pp 1505–1508Google Scholar
  21. 21.
    Rumelhart DE, Hinton GE, Williams RJ (1986) Learning internal representations by error propagation. Parallel Distrib Process 1:318–362Google Scholar
  22. 22.
    Tieleman T, Hinton GE (2009) “Using fast weights to improve persistent contrastive divergence,” In: Proc. ACM International Conference on Machine Learning (ICML), pp 1033–1040Google Scholar
  23. 23.
    Toda T, Black AW, Tokuda K (2004) “Acoustic-to-articulatory inversion mapping with Gaussian mixture model,” In: Proc. Annual Conf. of International Speech Communication Association (INTERSPEECH), pp 1129–1132Google Scholar
  24. 24.
    Toda T, Black AW, Tokuda K (2008) Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model. Speech Comm 50:215–227CrossRefGoogle Scholar
  25. 25.
    Uria B, Murray I, Renals S, Richmond K (2012) “Deep architectures for articulatory inversion,” In: Proc. Annual Conf. of International Speech Communication Association (INTERSPEECH)Google Scholar
  26. 26.
    Wu ZY, Zhang S, Cai LH, Meng H (2006) “Real-time synthesis of Chinese visual speech and facial expressions using MPEG-4 FAP features in a three-dimensional avatar,” In: Proc. Int. Conf. on Spoken Language Processing (ICSLP), pp 1802–1805Google Scholar
  27. 27.
    Xie L, Liu ZQ (2007) Realistic mouth-synching for speech-driven talking face using articulatory modelling. IEEE Trans Multimedia 9(3):500–510CrossRefGoogle Scholar
  28. 28.
    Xie L, Sun NC, Fan B (2013) A statistical parametric approach to video-realistic text-driven talking avatar. Multimed Tools Appl. doi: 10.1007/s11042-013-1633-3
  29. 29.
    Yegnanarayana B (2006) Artificial neural networks, Prentice Hall of IndiaGoogle Scholar
  30. 30.
    Zhang L, Renals S (2008) Acoustic-articulatory modeling with the trajectory HMM. IEEE Signal Process Lett 15:245–248CrossRefGoogle Scholar
  31. 31.
    Zhao TY, Ling ZH, Lei M, Dai LR, Liu QF (2010) “Minimum generation error training for HMM-based prediction of articulatory movement,” In: Proc. Int. Symposium on Chinese Spoken Language Processing (ISCSLP), pp 99–102Google Scholar
  32. 32.
    Zhao K, Wu ZY, Cai LH (2013)“A real-time speech driven talking avatar based on deep neural network,” In: Proc. Asian-Pacific Signal & Inforamtion Processing Annual Summit & Conference (APSIPA ASC)Google Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Zhiyong Wu
    • 1
    • 2
    • 3
  • Kai Zhao
    • 1
    • 3
  • Xixin Wu
    • 1
    • 3
    Email author
  • Xinyu Lan
    • 1
    • 3
  • Helen Meng
    • 1
    • 2
  1. 1.Tsinghua-CUHK Joint Research Center for Media Sciences, Technologies and Systems, and Shenzhen Key Laboratory of Information Science and Technology, Graduate School at ShenzhenTsinghua UniversityShenzhenChina
  2. 2.Department of Systems Engineering and Engineering ManagementThe Chinese University of Hong KongHong KongChina
  3. 3.Tsinghua National Laboratory for Information Science and Technology (TNList), and Department of Computer Science and TechnologyTsinghua UniversityBeijingChina

Personalised recommendations