Abstract
Speech synthesis is one of the most language-dependent domains of speech technology. In particular, the natural language processing stage of a text-to-speech (TTS) system contains the largest part of the linguistic knowledge for a given language. In this respect, one can state that building a high-quality TTS system for a new language involves many theoretical and technical challenges. Especially, extensive studies must exist (or be done) at the linguistic level, in order to endow the system with the most relevant language information; this requirement represents an essential condition to obtain a true naturalness of the synthesized speech, starting from unrestricted input text. This paper presents fundamental research and the related implementation issues in developing a complete TTS system in Romanian, emphasizing the language particularities and their influence on improving the language processing stage efficiency. The first section describes our standpoint on TTS synthesis as well as the overall architecture of our TTS system. The next sections formulate several important tasks of the natural language processing stage (input text preprocessing, letter-to-phone conversion, acoustic database preparation) and discuss the design philosophy of the corresponding modules, implementation decisions and evaluation experiments. A distinct section is devoted to an acoustic-phonetic study that assisted the phone-set selection and acoustic database generation. The paper ends with conclusions and a description of the work that is currently in progress at other levels of the TTS system.
Similar content being viewed by others
References
Ainsworth, W.A. and Pell, B. (1989). Connectionist architectures for a text-to-speech system. Proceedings of Eurospeech'89, Paris, France, pp. 125–128.
Beutnagel, M., Mohri, M., and Riley, M. (1999). Rapid unit selection from a large speech corpus for concatenative speech synthesis. Proceedings of Eurospeech'99, Budapest, Hungary, vol. 2, pp. 607–610.
Boula de Mareüil, P., Yvon, F., D'Alessandro, C., Aubergé, V., Bagein, M., Bailly, G., Béchet, F., Foukia, S., Goldman, J.-P., Keller, E., O'shaughnessy, D., Pagel, V., Sannier, F., Véronis, J., and Zellner, B. (1998). Evaluation of grapheme-to-phoneme conversion for text-to-speech synthesis in French. Computer Speech and Languages, 12(4):393–410.
Burileanu, D. (1999). Natural language processing for speech synthesis in Romanian language. Proceedings of the 12th International Conference on Control Systems and Computer Science (CSCS12), Bucharest, Romania, vol. 2, pp. 1–6.
Burileanu, D., Sima, M., and Neagu, A. (1999a). A phonetic converter for speech synthesis in Romanian. Proceedings of the XIVth Congress on Phonetic Sciences (ICPhS), San Francisco, CA, vol. 1, pp. 503–506.
Burileanu, D., Dan, C., Sima, M., and Burileanu, C. (1999b). A parser-based text preprocessor for Romanian language TTS synthesis. Proceedings of Eurospeech'99, Budapest, Hungary, vol. 5, pp. 2063–2066.
Burileanu, D., Burileanu, C., and Neagu, A. (2000). Diphone database development for a Romanian language TTS system. “State-of-the-Art in Speech Synthesis”, London, pp. 9/1–9/6.
Campbell, N. and Black, A.W. (1997). Prosody and the selection of source units for concatenative synthesis. In J.P.H. van Santen, R.W. Sproat, J.P. Olive, and J. Hirschberg (Eds.), Progress in Speech Synthesis. New York: Springer-Verlag, pp. 279–292.
Conkie, A.D. and Isard, S. (1997). Optimal coupling of diphones. In J.P.H. van Santen, R.W. Sproat, J.P. Olive, and J. Hirschberg (Eds.), Progress in Speech Synthesis. New York: Springer-Verlag, pp. 293–304.
Daelemans, W.M.P. and van den Bosh, A.P.J. (1997). Languageindependent data-oriented grapheme-to-phoneme conversion. In J.P.H. van Santen, R.W. Sproat, J.P. Olive, and J. Hirschberg (Eds.), Progress in Speech Synthesis. New York: Springer-Verlag, pp. 77–89.
D'Alessandro, C., Rizet, M.G., and Boula de Mareüil, P. (1996). Synthèse de la paroleà partir du texte. In H. Méloni (Ed.), Fondements et perspectives en traitement automatique de la parole. Aupelf-Uref, pp. 81–96.
Dutoit, T. (1997). An Introduction to Text-to-Speech Synthesis. Dordrecht: Kluwer.
Ferri, G., Pierucci, P., and Sanzone, D. (1997). A complete linguistic analysis for an Italian text-to-speech system. In J.P.H. van Santen, R.W. Sproat, J.P. Olive, and J. Hirschberg (Eds.), Progress in Speech Synthesis. New York: Springer-Verlag, pp. 123–138.
Gubbins, P.R. and Kurtis, K.M. (1995). Neural network solutions for improving English text-to-speech transcription. Proceedings of the International Conference on Phonetic Science, Stockholm, Sweden, pp. 314–317.
Jiang, L., Hon, H.W., and Huang, X. (1997). Improvements on a trainable letter-to-sound converter. Proceedings of Eurospeech'97, Rhodes, Greece, pp. 605–608.
Karaali O., Corrigan, G., Gerson, I., and Massey, N. (1997). Textto-speech conversion with neural networks: A recurrent TDNN approach. Proceedings of Eurospeech'97, Rhodes, Greece, pp. 561–564.
Klabbers, E. and Veldhuis, R. (2001). Reducing audible spectral discontinuities. IEEE Transactions on Speech and Audio Processing, 9(1):39–51.
Liberman, M.Y. and Church, K.W. (1992). Text analysis and word pronunciation in text-to-speech systems. In S. Furui and M.M. Sondhi (Eds.), Advances in Speech Signal Processing. New York: Marcel Dekker, pp. 791–831.
Lindstrom, A. and Ljungqvist, M. (1994). Text processing within a speech synthesis system. Proceedings of the ICSLP'94, Yokohama, Japan, pp. 139–142.
Rabiner, L. and Juang, B.H. (1993). Fundamentals of Speech Recognition. New Jersey: Prentice-Hall.
Sejnowski, T.J. and Rosenberg, C.R. (1987). Parallel networks that learn to pronounce English text. Complex Systems, 1:145–168.
Taylor, P. and Black, A.W. (1999). Speech synthesis by phonological structure matching. Proceedings of Eurospeech'99, Budapest, Hungary, vol. 2, pp. 623–626.
Wells, J., Barry, W., Grice, M., Fourcin, A., and Gibbon, D. (1992). Standard Computer-Compatible Transcription. Esprit project 2589 (SAM), Doc. no. SAM-UCL-037. London: Phonetics and Linguistics Dept., UCL.
Wouters, J. and Macon, M.W. (2001). Control of spectral dynamics in concatenative speech synthesis. IEEE Transactions on Speech and Audio Processing, 9(1):30–38.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Burileanu, D. Basic Research and Implementation Decisions for a Text-to-Speech Synthesis System in Romanian. International Journal of Speech Technology 5, 211–225 (2002). https://doi.org/10.1023/A:1020236605813
Issue Date:
DOI: https://doi.org/10.1023/A:1020236605813