International Journal of Speech Technology

, Volume 6, Issue 4, pp 365–377 | Cite as

The German Text-to-Speech Synthesis System MARY: A Tool for Research, Development and Teaching

  • Marc Schröder
  • Jürgen Trouvain


This paper introduces the German text-to-speech synthesis system MARY. The system's main features, namely a modular design and an XML-based system-internal data representation, are pointed out, and the properties of the individual modules are briefly presented. An interface allowing the user to access and modify intermediate processing steps without the need for a technical understanding of the system is described, along with examples of how this interface can be put to use in research, development and teaching. The usefulness of the modular and transparent design approach is further illustrated with an early prototype of an interface for emotional speech synthesis.

text-to-speech speech synthesis markup languages teaching in speech technology emotions 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Allen, J., Hunnicutt, S., and Klatt, D.H. (1987). From Text to Speech: The MITalk System. Cambridge, UK: Cambridge University Press.Google Scholar
  2. Baayen, R.H., Piepenbrock, R., and Gulikers, L. (1995). The CELEX Lexical Database (CDROM). Philadelphia, PA, USA: Linguistic Data Consortium, University of Pennsylvania.Google Scholar
  3. Baumann, S. and Trouvain, J. (2001). On the prosody of German telephone numbers. In Proceedings of Eurospeech 2001. Aalborg, Denmark, pp. 557-560.Google Scholar
  4. Benzmüller, R. and Grice, M. (1997). Trainingsmaterialien zur Etikettierung deutscher Intonation mitGToBI. Phonus 3, Research Report of the Institute of Phonetics, University of the Saarland, pp. 9-34.Google Scholar
  5. Black, A., Taylor, P., and Caley, R. (1999). Festival speech synthesis system, edition 1.4. Technical report, Centre for Speech Technology Research, University of Edinburgh, UK. Scholar
  6. Brants, T. (2000). TnT-Astatistical part-of-speech tagger. Proceedings of the 6th Conference on Applied Natural Language Processing. Seattle,WA, USA. Scholar
  7. Breitenbücher, M. (1999). Textvorverarbeitung zur deutschen Version des Festival Text-to-Speech Synthese Systems. Technical report, IMS Stuttgart. Scholar
  8. Brinckmann, C. and Trouvain, J. (2003). The role of duration models and symbolic representation for timing in synthetic speech. International Journal of Speech Technology, 6:21-31.Google Scholar
  9. Cowie, R., Douglas-Cowie, E., Savvidou, S., McMahon, E., Sawey, M. and Schröder, M. (2000). 'FEELTRACE': An instrument for recording perceived emotion in real time. Proceedings of the ISCA Workshop on Speech and Emotion. Northern Ireland, pp. 19-24. Scholar
  10. Cowie, R., Douglas-Cowie, E., Tsapatsoulis, N., Votsis, G., Kollias, S., Fellenz, W., and Taylor, J. (2001). Emotion recognition in human-computer interaction. IEEE Signal Processing Magazine, 18(1):32-80.Google Scholar
  11. Dutoit, T. (1997). An Introduction to Text-to-Speech Synthesis. Dordrecht: Kluwer Academic Publishers.Google Scholar
  12. Dutoit, T., Pagel, V., Pierret, N., Bataille, F., and van der Vrecken, O. (1996). The MBROLA project: Towards a set of high quality speech synthesisers free of use for non commercial purposes. Proceedings of the 4th International Conference of Spoken Language Processing. Philadelphia, USA, pp. 1393-1396.Google Scholar
  13. Grice, M., Baumann, S., and Benzmüller, R. (2002). German intonation in autosegmental-metrical phonology. In S.-A. Jun (Ed.), Prosodic Typology. Oxford University Press.Google Scholar
  14. Harold, E.R. (1999). XML Bible. Hungry Minds, Inc. Scholar
  15. Hoffmann, R., Kordon, U., Kürbis, S., Ketzmerick, B., and Fellbaum, K. (1999). An interactive course on speech synthesis. Proceedings of the ESCA/SOCRATES Workshop MATISSE, pp. 61-64.Google Scholar
  16. Jessen, M. (1999). German. In H. van der Hulst (Ed.)Word Prosodic Systems in the Languages of Europe. Berlin, New York: Mouton de Gruyter, pp. 515-545.Google Scholar
  17. JSML (1999). Java speech markup language 0.6. Technical report, Sun Microsystems. speech/forDevelopers/JSMLGoogle Scholar
  18. Klabbers, E., Stöber, K., Veldhuis, R., Wagner, P., and Breuer, S. (2001). Speech synthesis development made easy: The bonn open synthesis system. Proceedings of Eurospeech 2001. Aalborg, Denmark, pp. 521-524.Google Scholar
  19. Klatt, D.H. (1979). Synthesis by rule of segmental durations in English sentences. In B. Lindblom and S. Öhman (Eds.), Frontiers of Speech Communication, New York: Academic, pp. 287-299.Google Scholar
  20. Microsoft (2002). SAPI5: Microsoft Speech API 5.1. Scholar
  21. Möbius, B. (1999). The Bell Labs German text-to-speechsystem. Computer Speech and Language, 13:319-357.Google Scholar
  22. Murray, I.R. and Arnott, J.L. (1995). Implementation and testing of a system for producing emotion-by-rule in synthetic speech. Speech Communication, 16:369-390.Google Scholar
  23. Petitpierre, D. and Russell, G. (1995). MMORPH-The Multext morphology program. deliverable report, MULTEXT. Scholar
  24. Russell, J.A. (1980). A circumplex model of affect. Journal of Personality and Social Psychology, 39:1161-1178.Google Scholar
  25. Schiller, A., Teufel, S., and Thielen, C. (1995). Guidelines für das Tagging deutscher Textkorpora mit STTS. Technical report, IMS-CL, University Stuttgart. http://www.sfs.nphil.unituebingen. de/Elwis/stts/stts.htmlGoogle Scholar
  26. Schlosberg, H. (1941). A scale for the judgement of facial expressions. Journal of Experimental Psychology, 29:497-510.Google Scholar
  27. Schröder, M. (2001). Emotional speech synthesis: A review. Proceedings of Eurospeech 2001, Aalborg, Denmark, vol. 1, pp. 561-564. Scholar
  28. Schröder, M., Cowie, R., Douglas-Cowie, E., Westerdijk, M., and Gielen, S. (2001). Acoustic correlates of emotion dimensions in view of speech synthesis. Proceedings of Eurospeech 2001, Aalborg, Denmark, vol. 1, pp. 87-90. Scholar
  29. Silverman, K., Beckman, M., Pitrelli, J., Ostendorf, M., Wightman, C., Price, P., Pierrehumbert, J., and Hirschberg, J. (1992). ToBI: A standard for labeling english prosody. Proceedings of the 2nd International Conference of Spoken Language Processing. Banff, Canada, pp. 867-870.Google Scholar
  30. Skut,W. and Brants,T. (1998). Chunk tagger-Statistical recognition of noun phrases. Proceedings of the ESSLLI Workshop on Automated Acquisition of Syntax and Parsing. Saarbrücken, Germany. Scholar
  31. Skut,W., Krenn, B., Brants, T., and Uszkoreit, H. (1997). An annotation scheme for free word order languages. Proceedings of the 5th Conference on Applied Natural Language Processing.Washington DC, USA. htmlGoogle Scholar
  32. Sproat, R. (Ed.) (1997). Multilingual Text-to-Speech Synthesis: The Bell Labs Approach. Boston: Kluwer.Google Scholar
  33. Sproat, R., Hunt, A., Ostendorf, M., Taylor, P., Black, A., Lenzo, K., and Edgington, M. (1998). SABLE: A standard for TTS markup. Proceedings of the 5th International Conference of Spoken Language Processing. Sydney, Australia, pp. 1719-1724.Google Scholar
  34. Sproat, R., Taylor, P.A., Tanenblatt, M., and Isard, A. (1997). A markup language for text-to-speech synthesis. In Proceedings of Eurospeech 1997. Rhodes/Athens, Greece.Google Scholar
  35. Taylor, P. and Isard, A. (1997). SSML: A speech synthesis markup language. Speech Communication, 21:123-133.Google Scholar
  36. Traber, C. (1993). Syntactic processing and prosody control in the SVOX TTS system for German. Proceedings of Eurospeech 1993. Berlin, Germany, pp. 2099-2102.Google Scholar
  37. Trouvain, J. (2002). Tempo control in speech synthesis by prosodic phrasing. Proceedings of Konvens, Saarbrücken, Germany.Google Scholar
  38. Trouvain, J. and Grice, M. (1999). The effect of tempo on prosodic structure. Proceedings of the 14th International Conference of Phonetic Sciences. San Francisco, USA, pp. 1067-1070.Google Scholar
  39. VoiceXML (2002). VoiceXML 1.0 Specification. VoiceXML Forum. http://www.voicexml.orgGoogle Scholar
  40. Walker, M.R. and Hunt, A. (2001). Speech Synthesis Markup Language Specification. W3C. Scholar
  41. Wells, J.C. (1996). SAMPA Phonetic Alphabet for German. Scholar

Copyright information

© Kluwer Academic Publishers 2003

Authors and Affiliations

  • Marc Schröder
    • 1
  • Jürgen Trouvain
    • 2
  1. 1.DFKISaarbrückenGermany
  2. 2.Institute of PhoneticsSaarland UniversityGermany

Personalised recommendations