Expressive Speech Synthesis: Past, Present, and Possible Futures

  • Marc Schröder

Abstract

Approaches towards adding expressivity to synthetic speech have changed considerably over the last 20 years. Early systems, including formant and diphone systems, have been focused around “explicit control” models; early unit selection systems have adopted a “playback” approach. Currently, various approaches are being pursued to increase the flexibility in expression while maintaining the quality of state-of-the-art systems, among them a new “implicit control” paradigm in statistical parametric speech synthesis, which provides control over expressivity by combining and interpolating between statistical models trained on different expressive databases. The present chapter provides an overview of the past and present approaches, and ventures a look into possible future developments.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Audibert, N., Vincent, D., Aubergé V., & Rosec, O. (2006). Expressive speech synthesis: Evaluation of a voice quality centered coder on the different acoustic dimensions. In: Proceedings of Speech Prosody, Dresden, Germany.Google Scholar
  2. Birkholz, P. (2007). Control of an articulatory speech synthesizer based on dynamic approximation of spatial articulatory targets. In: Proceedings of Interspeech, Antwerp, Belgium.Google Scholar
  3. Bulut, M., Narayanan, S.S., & Syrdal, A.K. (2002). Expressive speech synthesis using a concate-native synthesiser. In: Proceedings of the 7th International Conference on Spoken Language Processing, Denver.Google Scholar
  4. Burkhardt, F., & Sendlmeier, W.F. (2000). Verification of acoustical correlates of emotional speech using formant synthesis. In: Proceedings of the ISCA Workshop on Speech and Emotion, Northern Ireland, pp. 151–156.Google Scholar
  5. Cahn, J.E. (1990). The generation of affect in synthesized speech. Journal of the American Voice I/O Society, 8, 1–19.Google Scholar
  6. Campbell, N. (2005). Developments in corpus-based speech synthesis: Approaching natural conversational speech. IEICE Transactions on Information and Systems 88(3), 376–383.CrossRefGoogle Scholar
  7. Campbell, N. (2007). Approaches to conversational speech rhythm: Speech activity in two-person telephone dialogues. In: Proceedings of the International Congress of Phonetic Sciences, Saarbrücken, Germany, pp. 343–348.Google Scholar
  8. Campbell, N., & Marumoto, T. (2000). Automatic labelling of voice-quality in speech databases for synthesis. In: Proceedings of the 6th International Conference on Spoken Language Processing, Beijing.Google Scholar
  9. Charpentier, F., & Moulines, E. (1989). Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. In: Proceedings of Eurospeech, Paris, pp. 13–19.Google Scholar
  10. d'Alessandro, C., & Doval, B. (2003). Voice quality modification for emotional speech synthesis. In: Proceedings of Eurospeech 2003, Geneva, Switzerland, pp. 1653–1656.Google Scholar
  11. Edgington, M. (1997). Investigating the limitations of concatenative synthesis. In: Proceedings of Eurospeech 1997, Rhodes/Athens.Google Scholar
  12. Ekman, P. (1977) Biological and cultural contributions to body and facial movement. In: J. Blacking (Ed.) The anthropology of the body, London: Academic Press, pp. 39–84.Google Scholar
  13. Fernandez, R., & Ramabhadran, B. (2007). Automatic exploration of corpus-specific properties for expressive text-to-speech: A case study in emphasis. In: Proceedings of the 6th ISCA Workshop on Speech Synthesis, Bonn, Germany, pp. 34–39.Google Scholar
  14. Heuft, B., Portele, T., & Rauth, M. (1996). Emotions in time domain synthesis. In: Proceedings of the 4th International Conference of Spoken Language Processing, Philadelphia.Google Scholar
  15. Iida, A., & Campbell, N. (2003). Speech database design for a concatenative text-to-speech synthesis system for individuals with communication disorders. International Journal of Speech Technology 6, 379–392.CrossRefGoogle Scholar
  16. Iriondo, I., Guaus, R., Rogríguez, A., Lázaro, P., Montoya, N., Blanco, J. M., Bernadas, D., Oliver, J. M., Tena, D., & Longhi, L. (2000). Validation of an acoustical modelling of emotional expression in Spanish using speech synthesis techniques. In: Proceedings of the ISCA Workshop on Speech and Emotion, Northern Ireland, pp. 161–166.Google Scholar
  17. Johnson, W.L., Narayanan, S.S., Whitney, R., Das, R., Bulut, M., & LaBore, C. (2002). Limited domain synthesis of expressive military speech for animated characters. In: Proceedings of the 7th International Conference on Spoken Language Processing, Denver.Google Scholar
  18. Ling, Z. H., Qin, L., Lu, H., Gao, Y., Dai, L. R., Wang, R. H., Jiang, Y., Zhao, Z. W., Yang, J. H., Chen, J., Hu, G. P. (2007). The USTC and iFlytek speech synthesis systems for Blizzard Challenge 2007. In: Proceedings of Blizzard Challenge, Bonn, Germany.Google Scholar
  19. Matsui, H., & Kawahara, H. (2003). Investigation of emotionally morphed speech perception and its structure using a high quality speech manipulation system. In: Proceedings of Eurospeech 2003, Geneva, Switzerland, pp. 2113–2116.Google Scholar
  20. Miyanaga, K., Masuko, T., & Kobayashi, T. (2004). A style control technique for HMM-based speech synthesis. In: Proceedings of the 8th International Conference of Spoken Language Processing, Jeju, Korea.Google Scholar
  21. Montero, J. M., Gutiérrez-Arriola, J., Colás, J., Enríquez, E., Pardo, J. M. (1999). Analysis and modelling of emotional speech in Spanish. In: Proceedings of the 14th International Conference of Phonetic Sciences, San Francisco, pp. 957–960.Google Scholar
  22. Moore, R. K. (2007). Spoken language processing: Piecing together the puzzle. Speech Communication, 49, 418–435CrossRefGoogle Scholar
  23. Mozziconacci, S.J. L. (1998). Speech variability and emotion: Production and perception. PhD thesis, Technical University EindhovenGoogle Scholar
  24. Mozziconacci, S. J. L., & Hermes, D. J. (1999). Role of intonation patterns in conveying emotion in speech. In: Proceedings of the 14th International Conference of Phonetic Sciences, San Francisco, pp. 2001–2004.Google Scholar
  25. Murray I. R., & Arnott, J. L. (1995). Implementation and testing of a system for producing emotion-by-rule in synthetic speech. Speech Communication, 16, 369–390CrossRefGoogle Scholar
  26. Pitrelli, J. F., Bakis, R., Eide, E. M., Fernandez, R., Hamza, W., & Picheny, M. A. (2006). The IBM expressive text-to-speech synthesis system for American English. IEEE Transactions on Audio, Speech and Language Processing 14(4):1099–1108.CrossRefGoogle Scholar
  27. Rank, E., & Pirker, H. (1998). Generating emotional speech with a concatenative synthesizer. In: Proceedings of the 5th International Conference of Spoken Language Processing, Sydney, Australia, vol 3, pp. 671–674.Google Scholar
  28. Scherer, K. R. (1986). Vocal affect expression: A review and a model for future research. Psychological Bulletin 99,143–165CrossRefGoogle Scholar
  29. Schröder, M. (1999). Can emotions be synthesized without controlling voice quality? Phonus 4, Research Report of the Institute of Phonetics, University of the Saarland, pp. 37–55.Google Scholar
  30. Schröder, M. (2001) Emotional speech synthesis: A review. In: Proceedings of Eurospeech 2001, Aalborg, Denmark (vol 1, pp. 561–564).Google Scholar
  31. Schröder, M. (2003). Experimental study of affect bursts. Speech Communication Special Issue Speech and Emotion 40(1–2), 99–116.MATHGoogle Scholar
  32. Schröder, M. (2006). Expressing degree of activation in synthetic speech. IEEE Transactions on Audio, Speech and Language Processing 14(4),1128–1136CrossRefGoogle Scholar
  33. Schröder, M. (2007). Interpolating expressions in unit selection. In: Proceedings of the second International Conference on Affective Computing and Intelligent Interaction (ACII'2007), Lisbon, Portugal.Google Scholar
  34. Schröder, M. (2008). Approaches to emotional expressivity in synthetic speech. In: K. Izdebski (Ed.) The emotion in the human voice, vol 3, Plural, San Diego.Google Scholar
  35. Schröder, M., & Grice, M. (2003). Expressing vocal effort in concatenative synthesis. In: Proceedings of the 15th International Conference of Phonetic Sciences, Barcelona.Google Scholar
  36. Schröder, M., Heylen, D., & Poggi, I. (2006). Perception of non-verbal emotional listener feedback. In: Proceedings of Speech Prosody 2006, Dresden, Germany.Google Scholar
  37. Trouvain, J., & Schröder, M. (2004). How (not) to add laughter to synthetic speech. In: Proc. Workshop on Affective Dialogue Systems, Kloster Irsee, Germany, pp 229–232.Google Scholar
  38. Turk, O., Schröder, M., Bozkurt, B., & Arslan, L. (2005). Voice quality interpolation for emotional text-to-speech synthesis. In: Proceedings of Interspeech 2005, Lisbon, Portugal, pp. 797–800.Google Scholar
  39. Vincent, D., Rosec, O., & Chonavel, T. (2005). Estimation of LF glottal source parameters based on an ARX model. In: Proceedings of Interspeech, Lisbon, Portugal, pp. 333–336.Google Scholar
  40. Vroomen, J., Collier, R., & Mozziconacci, S. J. L. (1993). Duration and intonation in emotional speech. In: Proceedings of Eurospeech 1993, Berlin, Germany, vol 1, pp. 577–580.Google Scholar
  41. Wang, L., Chu, M., Peng, Y., Zhao, Y., & Soong, F. (2007). Perceptual annotation of expressive speech. In: Proceedings of the sixth ISCA Workshop on Speech Synthesis, Bonn, Germany, pp. 46–51.Google Scholar
  42. Wollermann, C., & Lasarcyk, E. (2007). Modeling and perceiving of (un-)certainty in articulatory speech synthesis. In: Proceedings the sixth ISCA Speech Synthesis Workshop, Bonn, Germany, pp. 40–45.Google Scholar
  43. Yamagishi, J., Kobayashi, T., Tachibana, M., Ogata, K., & Nakano, Y. (2007). Model adaptation approach to speech synthesis with diverse voices and styles. In: Proceedings of ICASSP, Hawaii, vol. IV, pp. 1233–1236.Google Scholar
  44. Yamagishi, J., Onishi, K., Masuko, T., & Kobayashi T. (2003) Modeling of various speaking styles and emotions for HMM-based speech synthesis. In: Proceedings of Eurospeech, Geneva, Switzerland, pp. 2461–2464.Google Scholar
  45. Ye, H., & Young, S. (2004). High quality voice morphing. In: Proceedings of ICASSP 2004, Montreal.Google Scholar
  46. Zen, H., & Toda, T. (2005). An overview of Nitech HMM-based speech synthesis system for Blizzard Challenge 2005. In: Proceedings of Interspeech, Lisbon, Portugal, pp. 93–96.Google Scholar
  47. Zovato, E., Pacchiotti, A., Quazza, S., & Sandri, S. (2004). Towards emotional speech synthesis: A rule based approach. In: Proceedings of the fifth ISCA Speech Synthesis Workshop, Pittsburgh,PA, pp 219–220.Google Scholar

Copyright information

© Springer-Verlag London Limited 2009

Authors and Affiliations

  • Marc Schröder
    • 1
  1. 1.DFKI GmbHSaarbrückenGermany

Personalised recommendations