Articulatory Speech Re-synthesis: Profiting from Natural Acoustic Speech Data

  • Dominik Bauer
  • Jim Kannampuzha
  • Bernd J. Kröger
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5641)


The quality of static phones (e.g. vowels, fricatives, nasals, laterals) generated by articulatory speech synthesizers has reached a high level in the last years. Our goal is to expand this high quality to dynamic speech, i.e. whole syllables, words, and utterances by re-synthesizing natural acoustic speech data. Re-synthesis means that vocal tract action units or articulatory gestures, describing the succession of speech movements, are adapted spatio-temporally with respect to a natural speech signal produced by a natural “model speaker” of Standard German. This adaptation is performed using the software tool SAGA (Sound and Articulatory Gesture Alignment) that is currently under development in our lab. The resulting action unit scores are stored in a database and serve as input for our articulatory speech synthesizer. This technique is designed to be the basis for a unit selection articulatory speech synthesis in the future.


speech articulatory speech synthesis articulation re-synthesis vocal tract action units 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Adams, S.G., Weismer, G., Kent, R.D.: Speaking Rate and Speech Movement Velocity Profiles. Journal of Speech and Hearing Research 36, 41–54 (1993)CrossRefGoogle Scholar
  2. Badin, P., Bailly, G., Revéret, L., Baciu, M., Segebarth, C., Savariaux, C.: Three-Dimensional Linear Articulatory Modeling of Tongue, Lips and Face, Based on MRI and Video Images. Journal of Phonetics 30, 533–553 (2002)CrossRefGoogle Scholar
  3. Birkholz, P.: 3D Artikulatorische Sprachsynthese. Ph.D Thesis, Rostock (2005)Google Scholar
  4. Birkholz, P., Kröger, B.J.: Vocal Tract Model Adaptation Using Magnetic Resonance Imaging. In: Proceedings of the 7th International Seminar on Speech Production, Belo Horizonte, Brazil, pp. 493–500 (2006)Google Scholar
  5. Birkholz, P., Jackel, D., Kröger, B.J.: Simulation of losses due to turbulence in the time-varying vocal system. IEEE Transactions on Audio, Speech, and Language Processing 15, 1218–1225 (2007)CrossRefGoogle Scholar
  6. Birkholz, P., Jackèl, D., Kröger, B.J.: Construction and Control of a Three-Dimensional Vocal Tract Model. In: Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2006), Toulouse, France, pp. 873–876 (2006)Google Scholar
  7. Birkholz, P., Steiner, I., Breuer, S.: Control Concepts for Articulatory Speech Synthesis. In: Sixth ISCA Workshop on Speech Synthesis, Bonn, Germany, pp. 5–10 (2007)Google Scholar
  8. Dang, J., Honda, K.: Estimation of vocal tract shapes from speech sounds with a physiological articulatory model. Journal of Phonetics 30, 511–532 (2002)CrossRefGoogle Scholar
  9. Deterding, D., Nolan, F.: Aspiration and Voicing of Chinese and English Plosives. In: Proceedings of the ICPhS XVI, Saarbrücken, pp. 385–388 (2007)Google Scholar
  10. Draper, M.H., Ladefoged, P., Whiteridge, D.: Respiratory Muscles in Speech. Journal of Speech and Hearing Research 2, 16–27 (1959)CrossRefGoogle Scholar
  11. Engwall, O.: Articulatory Synthesis Using Corpus-Based Estimation of Line Spectrum Pairs. In: Proceedings of Interspeech, Lisbon, Portugal (2005)Google Scholar
  12. Horiguchi, S., Bell-Berti, F.: The Velotrace: A Device for Monitoring Velar Position. Cleft Palate Journal 24(2), 104–111 (1987)Google Scholar
  13. Kröger, B.J.: A gestural production model and its application to reduction in German. Phonetica 50, 213–233 (1993)CrossRefGoogle Scholar
  14. Kröger, B.J., Birkholz, P.: A Gesture-Based Concept for Speech Movement Control in Articulatory Speech Synthesis. In: Esposito, A., Faundez-Zanuy, M., Keller, E., Marinaro, M. (eds.) COST Action 2102. LNCS (LNAI), vol. 4775, pp. 174–189. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  15. Kröger, B.J., Schröder, G., Opgen-Rhein, C.: A gesture-based dynamic mo¬del describing articulatory movement data. Journal of the Acoustical Society of America 98, 1878–1889 (1995)CrossRefGoogle Scholar
  16. Levelt, W.J.M., Roelofs, A., Meyer, A.S.: A Theory of Lexical Access in Speech Production. Behav. Brain Sci.  22, 1–38 (1999)Google Scholar
  17. Levelt, W.J.M., Wheeldon, L.: Do Speakers Have Access to a Mental Syllabary? Cognition 50, 239–269 (1994)CrossRefGoogle Scholar
  18. Löfqvist, A.: Lip Kinematics in Long and Short Stop and Fricative Consonants. J. Acoust. Soc. A. 117(2), 858–878 (2005)CrossRefGoogle Scholar
  19. Löfqvist, A., Gracco, V.L.: Lip and Jaw Kinematics in Bilabial Stop Consonant Production. Journal of Speech, Language, and Hearing Research 40, 877–893 (1997)CrossRefGoogle Scholar
  20. Löfqvist, A., Yoshioka, H.: Laryngeal Activity in Swedish Obstruent Clusters. J. Acoust. Soc. Am. 68(3), 792–801 (1980)CrossRefGoogle Scholar
  21. Moll, K.L., Daniloff, R.G.: Investigation of the Timinig of Velar Movements during Speech. JASA 50(2), 678–684 (1971)CrossRefGoogle Scholar
  22. Wrench, A.: An Investigation of Sagittal Velar Movements and its Correlation with Lip, Tongue and Jaw Movement. In: Proceedings of the ICPhS, San Francisco, pp. 435–438 (1999)Google Scholar
  23. Yoshioka, H., Löfqvist, A., Hirose, H.: Laryngeal adjustments in the production of consonant clusters and geminates in American English. J. Acoust. Soc. Am. 70(6), 1615–1623 (1981)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Dominik Bauer
    • 1
  • Jim Kannampuzha
    • 1
  • Bernd J. Kröger
    • 1
  1. 1.Department of Phoniatrics, Pedaudiology, and Communication DisordersUniversity Hospital Aachen and RWTH Aachen UniversityAachenGermany

Personalised recommendations