A statistical approach to lexical stress assignment for TTS synthesis

  • Catalin Ungurean
  • Dragos Burileanu
  • Aurelian Dervis
Article

Abstract

Lexical stress is primarily important to generate a correct pronunciation of words in many languages; hence its correct placement is a major task in prosody prediction and generation for high-quality TTS (text-to-speech) synthesis systems. This paper proposes a statistical approach to lexical stress assignment for TTS synthesis in Romanian. The method is essentially based on n-gram language models at character level, and uses a modified Katz backoff smoothing technique to solve the problem of data sparseness during training. Monosyllabic words are considered as not carrying stress, and are separated by an automatic syllabification algorithm. A maximum accuracy of 99.11% was obtained on a test corpus of about 47,000 words.

Text-to-speech synthesis Prosody prediction Lexical stress assignment n-grams Smoothing technique 

References

  1. Arciuli, J., & Thompson, J. (2006). Improving the assignment of lexical stress in text-to-speech systems. In Proceedings of the 11th Australasian international conference on speech science and technology, Auckland, New Zealand (pp. 296–300). Google Scholar
  2. Aull, A. M., & Zue, V. W. (1985). Lexical stress determination and its application to large vocabulary speech recognition. In Proceedings of ICASSP’85, Tampa (Florida), USA (Vol. 10, pp. 1549–1552). Google Scholar
  3. Black, A. W., Lenzo, K., & Pagel, V. (1998). Issues in building general letter to sound rules. In Proceedings of the 3rd ESCA workshop on speech synthesis, Jenolan Caves, Australia (pp. 77–80). Google Scholar
  4. Braga, D., & Coelho, L. (2008). Automatic word stress marker for Portuguese TTS. In Proceedings of V jornadas en tecnología del Habla, Bilbao, Spain (pp. 179–182). Google Scholar
  5. Burileanu, D. (2002). Basic research and implementation decisions for a text-to-speech synthesis system in Romanian. International Journal of Speech Technology, 5(3), 211–225. MATHCrossRefGoogle Scholar
  6. Burileanu, D., & Negrescu, C. (2006). Prosody modeling for an embedded TTS system implementation. In Proceedings of the 14th European signal processing conference EUSIPCO 2006, Florence, Italy (CD-ROM Proceedings). Google Scholar
  7. Burileanu, D., Negrescu, C., & Surmei, M. (2009, in press). Recent advances in Romanian language text-to-speech synthesis. In Proceedings of the Romanian Academy, Series A. Google Scholar
  8. Chen, S. F., & Goodman, J. (1996). An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th annual meeting on association for computational linguistics, Santa Cruz (California), USA (pp. 310–318). Google Scholar
  9. Church, K. (1985). Stress assignment in letter to sound rules for speech synthesis. In Proceedings of the 23rd annual meeting on association for computational linguistics, Chicago, USA (pp. 246–253). Google Scholar
  10. Cojocaru, D. (2003). Romanian grammar. The Slavic and East European Language Resource Center. Google Scholar
  11. Dou, Q., Bergsma, S., Jiampojamarn, S., & Kondrak, G. (2009). A ranking approach to stress prediction for letter-to-phoneme conversion. In Proceedings of the 47th annual meeting of the ACL and the 4th IJCNLP of the AFNLP, Suntec, Singapore (pp. 118–126). Google Scholar
  12. Franzén, V., & Horne, M. (1997). Word stress in Romanian. Lund University Dept. of Linguistics: Working Papers 46 (pp. 75–91). Google Scholar
  13. Huang, X., Acero, A., & Hon, H.-W. (2001). Spoken language processing: a guide to theory, algorithm, and system development. Upper Saddle River: Prentice Hall. Google Scholar
  14. Katz, S. M. (1987). Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing, 35(3), 400–401. CrossRefGoogle Scholar
  15. Lea, W. A. (1980). Prosodic aids to speech recognition. In W. A. Lea (Ed.), Trends in speech recognition (pp. 166–205). Englewood Cliffs: Prentice-Hall. Google Scholar
  16. Manning, C., & Schütze, H. (2000). Foundations of statistical natural language processing. London: MIT. Google Scholar
  17. Oancea, E., & Badulescu, A. (2002). Stressed syllable determination for Romanian words within speech synthesis applications. International Journal of Speech Technology, 5(3), 237–246. MATHCrossRefGoogle Scholar
  18. Şuteu, F., & Şoşa, E. (1993). Dicţionar ortografic al limbii române (Orthographic Corpus of Romanian Language). Bucharest: ATOS. Google Scholar
  19. Ungurean, C., Burileanu, D., Popescu, V., Negrescu, C., & Dervis, A. (2008). Automatic diacritic restoration for a TTS-based e-mail reader application. UPB Scientific Bulletin, Series C, 70(4), 3–12. Google Scholar
  20. van Kuijk, D., van den Heuvel, H., & Boves, L. (1996). Using lexical stress in continuous speech recognition for Dutch. In Proceedings of ICSLP’96, Philadelphia, USA (Vol. 3, pp. 1736–1739). Google Scholar
  21. Webster, G. (2004). Improving letter-to-pronunciation accuracy with automatic morphologically-based stress prediction. In Proceedings of INTERSPEECH 2004, Jeju Island, Korea (pp. 2573–2576). Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  • Catalin Ungurean
    • 1
  • Dragos Burileanu
    • 1
    • 2
  • Aurelian Dervis
    • 1
  1. 1.Speech Technology and Human-Computer Dialogue Laboratory, Faculty of Electronics, Telecommunications and Information Technology“Politehnica” University of BucharestBucharestRomania
  2. 2.Romanian Academy Center for Artificial IntelligenceBucharestRomania

Personalised recommendations