Skip to main content
Log in

Basic Research and Implementation Decisions for a Text-to-Speech Synthesis System in Romanian

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Speech synthesis is one of the most language-dependent domains of speech technology. In particular, the natural language processing stage of a text-to-speech (TTS) system contains the largest part of the linguistic knowledge for a given language. In this respect, one can state that building a high-quality TTS system for a new language involves many theoretical and technical challenges. Especially, extensive studies must exist (or be done) at the linguistic level, in order to endow the system with the most relevant language information; this requirement represents an essential condition to obtain a true naturalness of the synthesized speech, starting from unrestricted input text. This paper presents fundamental research and the related implementation issues in developing a complete TTS system in Romanian, emphasizing the language particularities and their influence on improving the language processing stage efficiency. The first section describes our standpoint on TTS synthesis as well as the overall architecture of our TTS system. The next sections formulate several important tasks of the natural language processing stage (input text preprocessing, letter-to-phone conversion, acoustic database preparation) and discuss the design philosophy of the corresponding modules, implementation decisions and evaluation experiments. A distinct section is devoted to an acoustic-phonetic study that assisted the phone-set selection and acoustic database generation. The paper ends with conclusions and a description of the work that is currently in progress at other levels of the TTS system.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Ainsworth, W.A. and Pell, B. (1989). Connectionist architectures for a text-to-speech system. Proceedings of Eurospeech'89, Paris, France, pp. 125–128.

  • Beutnagel, M., Mohri, M., and Riley, M. (1999). Rapid unit selection from a large speech corpus for concatenative speech synthesis. Proceedings of Eurospeech'99, Budapest, Hungary, vol. 2, pp. 607–610.

    Google Scholar 

  • Boula de Mareüil, P., Yvon, F., D'Alessandro, C., Aubergé, V., Bagein, M., Bailly, G., Béchet, F., Foukia, S., Goldman, J.-P., Keller, E., O'shaughnessy, D., Pagel, V., Sannier, F., Véronis, J., and Zellner, B. (1998). Evaluation of grapheme-to-phoneme conversion for text-to-speech synthesis in French. Computer Speech and Languages, 12(4):393–410.

    Google Scholar 

  • Burileanu, D. (1999). Natural language processing for speech synthesis in Romanian language. Proceedings of the 12th International Conference on Control Systems and Computer Science (CSCS12), Bucharest, Romania, vol. 2, pp. 1–6.

    Google Scholar 

  • Burileanu, D., Sima, M., and Neagu, A. (1999a). A phonetic converter for speech synthesis in Romanian. Proceedings of the XIVth Congress on Phonetic Sciences (ICPhS), San Francisco, CA, vol. 1, pp. 503–506.

    Google Scholar 

  • Burileanu, D., Dan, C., Sima, M., and Burileanu, C. (1999b). A parser-based text preprocessor for Romanian language TTS synthesis. Proceedings of Eurospeech'99, Budapest, Hungary, vol. 5, pp. 2063–2066.

    Google Scholar 

  • Burileanu, D., Burileanu, C., and Neagu, A. (2000). Diphone database development for a Romanian language TTS system. “State-of-the-Art in Speech Synthesis”, London, pp. 9/1–9/6.

  • Campbell, N. and Black, A.W. (1997). Prosody and the selection of source units for concatenative synthesis. In J.P.H. van Santen, R.W. Sproat, J.P. Olive, and J. Hirschberg (Eds.), Progress in Speech Synthesis. New York: Springer-Verlag, pp. 279–292.

    Google Scholar 

  • Conkie, A.D. and Isard, S. (1997). Optimal coupling of diphones. In J.P.H. van Santen, R.W. Sproat, J.P. Olive, and J. Hirschberg (Eds.), Progress in Speech Synthesis. New York: Springer-Verlag, pp. 293–304.

    Google Scholar 

  • Daelemans, W.M.P. and van den Bosh, A.P.J. (1997). Languageindependent data-oriented grapheme-to-phoneme conversion. In J.P.H. van Santen, R.W. Sproat, J.P. Olive, and J. Hirschberg (Eds.), Progress in Speech Synthesis. New York: Springer-Verlag, pp. 77–89.

    Google Scholar 

  • D'Alessandro, C., Rizet, M.G., and Boula de Mareüil, P. (1996). Synthèse de la paroleà partir du texte. In H. Méloni (Ed.), Fondements et perspectives en traitement automatique de la parole. Aupelf-Uref, pp. 81–96.

  • Dutoit, T. (1997). An Introduction to Text-to-Speech Synthesis. Dordrecht: Kluwer.

    Google Scholar 

  • Ferri, G., Pierucci, P., and Sanzone, D. (1997). A complete linguistic analysis for an Italian text-to-speech system. In J.P.H. van Santen, R.W. Sproat, J.P. Olive, and J. Hirschberg (Eds.), Progress in Speech Synthesis. New York: Springer-Verlag, pp. 123–138.

    Google Scholar 

  • Gubbins, P.R. and Kurtis, K.M. (1995). Neural network solutions for improving English text-to-speech transcription. Proceedings of the International Conference on Phonetic Science, Stockholm, Sweden, pp. 314–317.

  • Jiang, L., Hon, H.W., and Huang, X. (1997). Improvements on a trainable letter-to-sound converter. Proceedings of Eurospeech'97, Rhodes, Greece, pp. 605–608.

  • Karaali O., Corrigan, G., Gerson, I., and Massey, N. (1997). Textto-speech conversion with neural networks: A recurrent TDNN approach. Proceedings of Eurospeech'97, Rhodes, Greece, pp. 561–564.

  • Klabbers, E. and Veldhuis, R. (2001). Reducing audible spectral discontinuities. IEEE Transactions on Speech and Audio Processing, 9(1):39–51.

    Google Scholar 

  • Liberman, M.Y. and Church, K.W. (1992). Text analysis and word pronunciation in text-to-speech systems. In S. Furui and M.M. Sondhi (Eds.), Advances in Speech Signal Processing. New York: Marcel Dekker, pp. 791–831.

    Google Scholar 

  • Lindstrom, A. and Ljungqvist, M. (1994). Text processing within a speech synthesis system. Proceedings of the ICSLP'94, Yokohama, Japan, pp. 139–142.

  • Rabiner, L. and Juang, B.H. (1993). Fundamentals of Speech Recognition. New Jersey: Prentice-Hall.

    Google Scholar 

  • Sejnowski, T.J. and Rosenberg, C.R. (1987). Parallel networks that learn to pronounce English text. Complex Systems, 1:145–168.

    Google Scholar 

  • Taylor, P. and Black, A.W. (1999). Speech synthesis by phonological structure matching. Proceedings of Eurospeech'99, Budapest, Hungary, vol. 2, pp. 623–626.

    Google Scholar 

  • Wells, J., Barry, W., Grice, M., Fourcin, A., and Gibbon, D. (1992). Standard Computer-Compatible Transcription. Esprit project 2589 (SAM), Doc. no. SAM-UCL-037. London: Phonetics and Linguistics Dept., UCL.

    Google Scholar 

  • Wouters, J. and Macon, M.W. (2001). Control of spectral dynamics in concatenative speech synthesis. IEEE Transactions on Speech and Audio Processing, 9(1):30–38.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Burileanu, D. Basic Research and Implementation Decisions for a Text-to-Speech Synthesis System in Romanian. International Journal of Speech Technology 5, 211–225 (2002). https://doi.org/10.1023/A:1020236605813

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1020236605813

Navigation