Handling Two Difficult Challenges for Text-to-Speech Synthesis Systems: Out-of-Vocabulary Words and Prosody: A Case Study in Romanian

Chapter

Abstract

Given the unrestricted context for text-to-speech (TTS) synthesis and the current multilingual environment, TTS is often hampered by the presence of out-of-vocabulary (OOV) words. There are many precipitating factors for OOV words, from the use of technical terms, proper nouns, rare words that were not covered by the lexicon, and foreign words partially morphologically adapted; the latter, in fact, is a problem often confronted by non-English TTS synthesis systems. Furthermore, in order to derive natural speech from arbitrary text, all words that make up an utterance must undergo a series of complex processes such as: diacritic restoration; part-of-speech tagging; expansion to pronounceable form; syllabification; lexical stress prediction; and letter-to-sound conversion. OOV words require both automatic and trainable methods that can perform such tasks, which are usually based on a limited lexical context. The exception to this rule are those cases where part of speech and surrounding words are used as discriminative features such as in homograph disambiguation and abbreviation expansion. In this chapter we introduce the basic architecture of a generic natural language processing module in TTS synthesis, proposing data-driven solutions to various tasks, comparing our results concerning OOV words and prosody modeling with the current state-of-the-art TTS synthesis systems.

References

  1. Allen J, Hunnicutt MS, Klatt D (1987) From text to speech: The MITalk system. Cambridge University, Cambridge, p 216Google Scholar
  2. Amsler RA (1980) The structure of the Merriam. Webster Pocket Dictionary. Doctoral dissertation, TR-164, University of Texas, AustinGoogle Scholar
  3. Bartlett S, Kondrak G, Cherry C (2008) Automatic syllabification with structured SVMs for letter-to-phoneme conversion. In: Proceedings of ACL-08: HLT, pp 568–576Google Scholar
  4. Berger AL, Pietra VJD, Pietra SAD (1996) A maximum entropy approach to natural language processing. Comput linguist 22(1):39–71Google Scholar
  5. Bisani M, Ney H (2002) Investigations on joint-multigram models for grapheme-to-phoneme conversion. In: Proceedings ICSLP, vol 2, pp 105–108Google Scholar
  6. Black AW, Lenzo K, Pagel V (1998) Issues in building general letter to sound rules. In: The third ESCA/COCOSDA workshop (ETRW) on speech synthesisGoogle Scholar
  7. Brants T (2000) TnT: a statistical part-of-speech tagger. In: Proceedings of the sixth conference on applied natural language processing. Association for Computational Linguistics, pp 224–231Google Scholar
  8. Burileanu D, Sima M, Neagu A (1999) A phonetic converter for speech synthesis in Romanian. In: Proceedings of the XIVth international congress on phonetic sciences ICPhS’99, pp 503–506Google Scholar
  9. Ceausu A (2006) Maximum entropy tiered tagging. In: Proceedings of the 11th ESSLLI student session, pp 173–179Google Scholar
  10. CMU (2011) Carnegie Mellon pronuncing dictionary. http://www.speech.cs.cmu.edu/cgi-bin/cmudict
  11. Daelemans W, Van Den Bosch A, Weijters T (1997) IGTree: using trees for compression and classification in lazy learning algorithms. Artif Intell Rev 11(1):407–423CrossRefGoogle Scholar
  12. Demberg V, Schmid H, Mohler G (2007) Phonological constraints and morphological preprocessing for grapheme-to-phoneme conversion. In: Annual meeting-association for computational linguistics, vol 45, no. 1, p 96, June 2007Google Scholar
  13. DeRose SJ (1988) Grammatical category disambiguation by statistical optimization. Comput Linguist 14(1):31–39Google Scholar
  14. Franzén V, Horne M (1997) Word stress in Romanian. Lund Working Papers in Linguistics 46:75–91Google Scholar
  15. Haizhou L, Min Z, Jian S (2004) A joint source-channel model for machine transliteration. In: Proceedings of the 42nd annual meeting on association for computational linguistics. Association for Computational Linguistics, p 159, July 2004Google Scholar
  16. Ion R (2007) Word sense disambiguation methods applied to English and Romanian. PhD Thesis (in Romanian). Romanian Academy, BucharestGoogle Scholar
  17. Jiampojamarn S, Cherry C, Kondrak G (2008) Joint processing and discriminative training for letter-to-phoneme conversion. In: Proceedings of ACL-08: HLT, pp 905–913Google Scholar
  18. Jitcă D, Apopei V, Păduraru O (2012) Transcription of Romanian intonation-RoToBI. http://www.etc.tuiasi.ro/sibm/romanian_spoken_language/RoToBi/RoToBi_System.html
  19. Jung SY, Hong S, Paek E (2000) An English to Korean transliteration model of extended Markov window. In: Proceedings of the 18th conference on computational linguistics, vol 1. Association for Computational Linguistics, pp 383–389Google Scholar
  20. Kahn D (1976) Syllable-based generalizations in English phonology, vol 156. Indiana University Linguistics Club, BloomingtonGoogle Scholar
  21. Kaszczuk M, Osowski L (2009) The IVO software blizzard challenge 2009 entry: improving IVONA text-to-speech. In: Blizzard Challenge Workshop, Edinburgh, ScotlandGoogle Scholar
  22. Knight K, Graehl J (1997) Machine transliteration. In: Proceedings of the thirty-fifth annual meeting of the association for computational linguistics and eighth conference of the European chapter of the Association for Computational Linguistics, Somerset, NJ, pp 128–135Google Scholar
  23. Lafferty J, McCallum A, Pereira FC (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence dataGoogle Scholar
  24. Marchand Y, Damper RI (2000) A multistrategy approach to improving pronunciation by analogy. Comput Linguist 26(2):195–219CrossRefGoogle Scholar
  25. Marchand Y, Damper RI (2007) Can syllabification improve pronunciation by analogy of English? Nat Lang Eng 13(1):1–24CrossRefGoogle Scholar
  26. Marques NC, Lopes GP (1996) A neural network approach to part-of-speech tagging. In: Proceedings of the 2nd meeting for computational processing of spoken and written Portuguese, pp 21–22Google Scholar
  27. Meng HM, Lo WK, Chen B, Tang K (2001) Generating phonetic cognates to handle named entities in English-Chinese cross-language spoken document retrieval. In: Automatic speech recognition and understanding, 2001. ASRU'01. IEEE Workshop on IEEE. pp 311–314.Google Scholar
  28. Och FJ, Ney H (2003) A systematic comparison of various statistical alignment models. Comput Linguist 29(1):19–51MATHCrossRefGoogle Scholar
  29. Pagel V, Lenzo K, Black A (1998) Letter to sound rules for accented lexicon compression. In: Proceedings of the international conference on spoken language processing, Sydney, AustraliaGoogle Scholar
  30. Rama T, Singh AK, Kolachina S (2009) Modeling letter-to-phoneme conversion as a phrase based statistical machine translation problem with minimum error rate training. In: Proceedings of human language technologies: the 2009 annual conference of the North American chapter of the Association for Computational Linguistics, Companion Volume: Student Research Workshop and Doctoral Consortium. Association for Computational Linguistics, pp 90–95.Google Scholar
  31. Ratnaparkhi A (1996) A maximum entropy model for part-of-speech tagging. In: Proceedings of the conference on empirical methods in natural language processing, vol 1. pp 133–142, May 1996. Philadelphia, PA, USAGoogle Scholar
  32. Reynar JC, Ratnaparkhi A (1997) A maximum entropy approach to identifying sentence boundaries. In: Proceedings of the fifth conference on Applied natural language processing. Association for Computational Linguistics, pp 16–19.Google Scholar
  33. Romanian Academy. DEX (2009) The explanatory dictionary of the Romanian languageGoogle Scholar
  34. Samuelsson C (1993). Morphological tagging based entirely on Bayesian inference. In: 9th Nordic conference on computational linguistics, June 1993Google Scholar
  35. Silverman K, Beckman M, Pitrelli J, Ostendorf M, Wightman C, Price P, Pierrehumbert J, Hirschberg J (1992) ToBI: a standard for labeling English prosody. In: Proceedings of ICSLP, vol 2. pp 867–870, October 1992Google Scholar
  36. Stalls BG, Knight K (1998) Translating names and technical terms in Arabic text. In: Proceedings of the COLING/ACL workshop on computational approaches to semitic languages, pp 34–41, August 1998,Google Scholar
  37. Stan A, Yamagishi J, King S, Aylett M (2011) The Romanian speech synthesis (RSS) corpus: building a high quality HMM-based speech synthesis system using a high sampling rate. Speech Commun 53(3):442–450CrossRefGoogle Scholar
  38. Tokuda K, Yoshimura T, Masuko T, Kobayashi T, Kitamura T (2000) Speech parameter generation algorithms for HMM-based speech synthesis. In: IEEE international conference acoustics, speech, and signal processing, 2000, vol 3. IEEE, pp 1315–1318Google Scholar
  39. Tufi D (1999) Tiered tagging and combined language models classifiers. In: Text, speech and dialogue. Springer Berlin/Heidelberg, pp 843–843Google Scholar
  40. Tufiș D, Ceaușu A (2008) DIAC+: a professional diacritics recovering system. In: Proceedings of LREC, Marrakech, MoroccoGoogle Scholar
  41. Tufi D, Ion R, Ceauu A, ștefănescu D (2008) RACAI's linguistic Web services. In: Proceedings of the 6th language resources and evaluation conference-LRECGoogle Scholar
  42. Ungurean C, Burileanu D, Popescu V, Derviş A (2011) Hybrid syllabification and letter-to-phone conversion for TTS synthesis. In: U.P.B. Sci. Bull., Series C, vol 73, Iss. 3, 2011, ISSN 1454-234xGoogle Scholar
  43. Ungurean C, Burileanu D, Dervis A (2009) A statistical approach to lexical stress assignment for TTS synthesis. Int J Speech Technol 12(2–3):63–73.Google Scholar
  44. Virga P, Khudanpur S (2003) Transliteration of proper names in cross-lingual information retrieval. In: Proceedings of the ACL 2003 workshop on multilingual and mixed-language named entity recognition, vol 15. Association for Computational Linguistics, pp 57–64, July 2003Google Scholar
  45. Weijters A (1991) A simple look-up procedure superior to NETtalk? In: Proceedings of the international conference on artificial neural networks—ICANN-91, Espoo, FinlandGoogle Scholar
  46. Webster G (2004) Improving letter-to-pronunciation accuracy with automatic morphologically-based stress prediction. In: Proceedings of INTERSPEECH 2004, pp 2573–2576Google Scholar

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  1. 1.Research Institute for Artificial IntelligenceRomanian Academy Center for Artificial Intelligence (RACAI)BucharestRomania
  2. 2.Institute for Intelligent SystemsThe University of MemphisMemphisUSA

Personalised recommendations