Advertisement

Polish unit selection speech synthesis with BOSS: extensions and speech corpora

  • Grażyna Demenko
  • Katarzyna Klessa
  • Marcin Szymański
  • Stefan Breuer
  • Wolfgang Hess
Article

Abstract

This article presents research and development aimed at creating a Polish speech database for speech synthesis and adapting BOSS (The Bonn Open Synthesis System) to the Polish language. First of all, the linguistic background for the design of Polish spoken resources for unit selection is presented, together with the presentation of the applied transcription and annotation methods. The next section details the assumptions and the structure of the Polish corpus and its segmental and prosodic annotation. Then, the linguistic features used in duration modelling and the selection of adequate speech units of two Polish modules in BOSS are reported: the duration prediction module (the description is accompanied by a concise overview of Polish duration modelling for speech technology purposes) and the cost functions module. Finally, the results of two kinds of perception tests are discussed: the first is a preference test aimed at the evaluation of synthesized speech obtained using three variants of speech signal segmentation (automatic, semi-automatic and manual) and the second is a mean opinion score test carried out to provide a preliminary assessment of the synthesized speech quality attained with the Polish version of the BOSS synthesizer. The closing chapter summarizes future perspectives and challenges for the Polish TTS (text-to-speech) and further developments of BOSS for Polish.

Keywords

Speech synthesis Speech corpora Duration modeling Unit selection Speech segmentation Polish BOSS 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Baranowska, E., Francuzik, K., Karpiński, M., & Kleśta, J. (2003). Identification of nuclear melody. Placement in Polish read texts. In A. Mettouchi & G. Ferre (Eds.), Interfaces prosodiques, Nantes, France. Google Scholar
  2. Batusek, R. A. (2002). Duration model for Czech text-to-speech synthesis. In Proc. of speech prosody, Aix-en-Provence, France. Google Scholar
  3. Bonafonte, A., Höge, H., Kiss, I., Moreno, A., Ziegenhain, U., van den Heuvel, H., Hain, H.-U., Wang, X. S., & Garcia, M. N. (2006). TC-STAR: Specifications of language resources and evaluation for speech synthesis. In Proceedings of LREC (international conference on language resources and evaluation), Genoa, Italy. Google Scholar
  4. Bonafonte, A., Lourdes, A., Esquerra1, I., Oller, S., & Moreno, A. (2009). Recent work on the FESTCAT database for speech synthesis. In Proceedings of the I Iberian SLTech 2009, Porto Salvo, Portugal. Google Scholar
  5. Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Monterey: Wadsworth & Brooks/Cole Advanced Books & Software. MATHGoogle Scholar
  6. Breuer, S., & Abresch, J. (2003). Unit selection speech synthesis for a directory enquiries service. In Proceedings of the ICPhS, Barcelona, Spain. Google Scholar
  7. Campbell, N. (1992). Multi-level timing in speech University of Sussex. PhD Thesis. (Exp. Psychol): Brighton, UK. Google Scholar
  8. Chung, H., & Huckvale, M. A. (2001). Linguistic factors affecting timing in Korean with application to speech synthesis. In Proceedings of Eurospeech, Scandinavia. Google Scholar
  9. Cruttenden, A. (1994). Intonation. Cambridge: Cambridge University Press. Google Scholar
  10. Demenko, G. (1999). Analiza cech suprasegmentalnych języka polskiego na potrzeby syntezy mowy. Poznań: Wydawnictwo Naukowe UAM. Google Scholar
  11. Demenko, G. (2005). Speech synthesis of Polish based on the concatenation phonetic-acoustic segments. In 2nd language & technology conference: Human language technologies as a challenge for computer science and linguistics, April 21–23, 2005, Poznań, Poland. Google Scholar
  12. Demenko, G., Wypych, M., & Baranowska, E. (2003). Speech and language technology : Vol. 7. Implementation of grapheme-to-phoneme rules and extended SAMPA alphabet in Polish text-to-speech synthesis. Poznań: Edition PTFON. Google Scholar
  13. Demenko, G., Bachan, J., Möbius, B., Klessa, K., Szymański, M., & Grocholewski, G. (2008). Development and evaluation of Polish speech corpus for unit selection speech synthesis systems. In Proceedings of Interspeech 2008, Brisbane, Australia. Google Scholar
  14. Fék, M., Pesti, P., Németh, G., Zainkó, C., & Olaszy, G. (2006). Corpus-based unit selection TTS for Hungarian. TSD 2006 367-373 (retrieved from http://speechlab.tmit.bme.hu/zainko/ on 1 March 2010).
  15. Fujisaki, H., Hirose, K., & Takahashi, N. (1990). Manifestation of linguistic and paralinguistic information in the voice fundamental frequency contours of spoken Japanese. In Proceedings of ICSLP, Kobe, Japan. Google Scholar
  16. Gardner-Bonneau, D. (Ed.) (2003). Special Issue on Speech Synthesis. International Journal of Speech Technology. Kluwer Academic Publishers. Google Scholar
  17. Gibbon, D., Moore, R., & Winski, R. (1997). Handbook of standards and resources for spoken language systems. Berlin: Mouton de Gruyter. Google Scholar
  18. Grocholewski, S. (1997). Corpora—speech database for Polish diphones. In Proceedings of Eurospeech’97 (pp. 1735–1738). Google Scholar
  19. Hirst, D., & Di Cristo, A. (Eds.) (1998). Intonation systems. A survey of twenty languages. Cambridge: Cambridge University Press. Google Scholar
  20. Jassem, W. (1962). Akcent języka polskiego. Wrocław: Ossolineum. Google Scholar
  21. Jassem, W. (2003). Illustrations of the IPA: Polish. Journal of the Phonetic Association, 23(1), 103–107. CrossRefGoogle Scholar
  22. Jassem, W., Morton, J., & Steffen-Batóg, M. (1968). The perception of stress in synthetic speech-like stimuli by Polish listeners. In W. Jassem (Ed.), Speech analysis and synthesis 1 (pp. 289–308). Warszawa: Państwowe Wydawnictwo Naukowe. Google Scholar
  23. Jassem, W., Krzyśko, M., & Stolarski, P. (1981). IPPT PAN: Vol. 33. Regresyjny model izochronizmu zestrojowego w sygnale mowy, Warszawa. Google Scholar
  24. Keating, P. (1979). A phonetic study of a voicing contrast in Polish. Unpublished doctoral dissertation, Brown University. Google Scholar
  25. Klatt, D. H. (1979). Synthesis by rule of segmental durations in English sentences. In K. Lindblom & K. Ohman (Eds.), Frontiers of speech communication research. London: Academic Press. Google Scholar
  26. Klessa, K. (2006). Analiza iloczasu głoskowego na potrzeby syntezy mowy polskiej. Unpublished doctoral dissertation, Adam Mickiewicz University, Poznań, Poland. Google Scholar
  27. Klessa, K., Szymański, M., Breuer, S., & Demenko, G. (2007). Optimization of Polish segmental duration prediction with CART. In SSW6, Bonn. Google Scholar
  28. Matoušek, J., Tihelka, D., & Romportl, J. (2008). Building of a speech corpus optimised for unit selection TTS synthesis. In Proceedings of LREC (international conference on language resources and evaluation), Marrakech, Morocco. Google Scholar
  29. Mixdorff, H. (1998). Intonation patterns of German—Model-based quantitative analysis and synthesis of F0-contours. PhD thesis submitted to TU Dresden. Google Scholar
  30. Möbius, B. (2000). Corpus-based speech synthesis: Methods and challenges. In W. Sendlmeier (Ed.), Forum Phoneticum : Vol. 69. Speech and signals: Aspects of speech synthesis and automatic speech recognition (pp. 79–96). Frankfurt a. M.: Hector. Google Scholar
  31. Möbius, B. (2001). Rare events and closed domains: Two delicate concepts in speech synthesis. In Fourth ISCA ITRW on speech synthesis, Perthshire, Scotland. Google Scholar
  32. Möbius, B., & van Santen, J. P. H. (1996). Modeling segmental duration in German text-to-speech synthesis. In Proceedings of the international conference on spoken language processing (Vol. 4, pp. 2395–2398) Philadelphia, PA. Google Scholar
  33. Morton, J., & Jassem, W. (1965). Acoustic correlates of stress. Language and Speech, 8, 150–181. Google Scholar
  34. Ostendorf, M., Digalakis, Vassilios V., & Kimball, Owen A. (1996). From HMM’s to segment models: A unified view of stochastic modeling for speech recognition. IEEE Transactions on Speech and Audio Processing, 4(5), 360–378. CrossRefGoogle Scholar
  35. Richter, L. (1974). Porównanie iloczasu samogłosek polskich wymówionych w logatomach oraz w wyrazach. In Biuletyn Polskiego towarzystwa fonetycznego (Vol. 32, pp. 173–178). Google Scholar
  36. Richter, L. (1978). Wpływ pozycji w zestroju akcentowym na czas trwania głosek. In Lingua Posnaniensia, Vol. 21, Poznań, Poland. Google Scholar
  37. Riedi, M. P. (1998). Controlling segmental duration in speech synthesis systems. PhD thesis, TIK-Schriftenreihe (26), ETH Zürich. Google Scholar
  38. Sagisaka, Y., Campbell, N., & Higuchi, N. (1997). Computing prosody, computational models for processing spontaneous speech. New York: Springer. Google Scholar
  39. Śledziński, D. (2007). Fonetyczno-akustyczna analiza struktury sylaby w języku polskim na potrzeby technologii mowy. Unpublished PhD Thesis, Adam Mickiewicz University, Poznań, Poland. Google Scholar
  40. Steffen-Batóg, M., & Nowakowski, P. (1993). An algorithm for phonetic transcription of orthographic texts in Polish. In M. Steffen-Batóg & W. Awedyk (Eds.), Studia phonetica posnaniensia, Vol. 3. Poznań: Wydawnictwo Naukowe UAM. Google Scholar
  41. Steffen-Batogowa, M. (1975). Automatyzacja transkrypcji fonematycznej tekstów polskich. Warszawa: PWN. Google Scholar
  42. Szymański, M., & Grocholewski, S. (2005). Transcription-based automatic segmentation of speech. In Proceedings of 2nd language & technology conference (pp. 11–15). Poznań. Google Scholar
  43. Szymański, M., & Grocholewski, S. (2006). Post-processing of automatic segmentation of speech using dynamic programming. In LNAI. Proc. 9th international conference on text, speech and dialogue, Brno. Berlin: Springer. Google Scholar
  44. Szymański, M., & Grocholewski, S. (2008). Error prediction-based semi-automatic segmentation of speech databases. In LNAI. Proc. 11th international conference on text, speech and dialog, Brno, Czech Republic. Berlin: Springer. Google Scholar
  45. Tokuda, K., & Black, A. (2005). The Blizzard Challenge 2005: Evaluating corpus-based speech synthesis on common datasets. In Proc. Interspeech (Eurospeech) (pp. 77–80). Google Scholar
  46. Toledano, D., Hernández Gómez, L. A., & Villarrubia Grande, L. (2003). Automatic phonetic segmentation. IEEE Transactions on Speech and Audio Processing, 11(6), 617–625. CrossRefGoogle Scholar
  47. Van Santen, J. P. H. (1993a). Exploring N-way tables with sums-of-product models. Journal of Mathematical Psychology, 37(3), 327–371. MATHCrossRefMathSciNetGoogle Scholar
  48. Van Santen, J. P. H. (1993b). Quantitative modeling of segmental duration. In Proceedings of human language technology conference (pp. 323–328), Princeton, New Jersey. Google Scholar
  49. Van Santen, J., & Buchsbaum, A. L. (1997). Methods for optimal text selection. In Proceedings Eurospeech 1997, Rhodos, Greece. Google Scholar
  50. Van Son, R. J. J. H., & Van Santen, J. P. H. (1997). Strong interaction between factors influencing consonant duration. In Proceedings of Eurospeech ’97, Rhodos. Google Scholar
  51. Wagner, A. (2008). Kompleksowy model intonacji do zastosowania w syntezie mowy. Unpublished doctoral dissertation, Adam Mickiewicz University, Poznań, Poland. Google Scholar
  52. Wells, J. (1996). The SAMPA homepage. http://www.phon.ucl.ac.uk/home/sampa/home.htm.

Copyright information

© Springer Science+Business Media, LLC 2010

Authors and Affiliations

  • Grażyna Demenko
    • 1
  • Katarzyna Klessa
    • 1
  • Marcin Szymański
    • 2
  • Stefan Breuer
    • 3
  • Wolfgang Hess
    • 3
  1. 1.Instytut JęzykoznawstwaUniwersytet im. Adama MickiewiczaPoznańPoland
  2. 2.Laboratorium Zintegrowanych Systemów Przetwarzania Języka i Mowy, Poznańskie Centrum Superkomputerowo-SiecioweInstytut Chemii Bioorganicznej PANPoznańPoland
  3. 3.Institut für Kommunikationswissenschaften, Abteilung Sprache und KommunikationRheinische Friedrich-Wilhelms-UniversitätBonnGermany

Personalised recommendations