Advertisement

Current State of Text-to-Speech System ARTIC: A Decade of Research on the Field of Speech Technologies

  • Daniel Tihelka
  • Zdeněk Hanzlíček
  • Markéta Jůzová
  • Jakub Vít
  • Jindřich Matoušek
  • Martin Grůber
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11107)

Abstract

This paper provides a survey of the current state of ARTIC – the modern Czech concatenative corpus-based text-to-speech system. Through more than a decade of research & development in the field of speech technologies and applications, the system was enriched with new languages (and, as a consequence, language-dependent NLP methods), and its speech generation capabilities were significantly improved when new progressive speech generation modules (SPS, DNN, HSS) were (and are still being to) designed and incorporated into it. Also, ARTIC has to deal with various requirements on data used to generate speech from, ranging in size, quality and domain of the output speech, while there always was the requirement to achieve the highest quality in terms of both naturalness and intelligibility. Thus, the paper summarizes some of the most significant achievements and demanding tasks which had to be tackled by the system, illustrating the universality and flexibility of this Czech TTS system.

Keywords

Speech synthesis Unit selection Statistical-parametric synthesis DNN WaveNet Hybrid synthesis Personalized speech synthesis Voice banking 

References

  1. 1.
    Hanzlíček, Z.: Czech HMM-based speech synthesis. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2010. LNCS (LNAI), vol. 6231, pp. 291–298. Springer, Heidelberg (2010).  https://doi.org/10.1007/978-3-642-15760-8_37CrossRefGoogle Scholar
  2. 2.
    Hanzlíček, Z.: Czech HMM-based speech synthesis: experiments with model adaptation. In: Habernal, I., Matoušek, V. (eds.) TSD 2011. LNCS (LNAI), vol. 6836, pp. 107–114. Springer, Heidelberg (2011).  https://doi.org/10.1007/978-3-642-23538-2_14CrossRefGoogle Scholar
  3. 3.
    Hanzlíček, Z.: Optimal Number of States in HMM-Based Speech Synthesis. In: Ekštein, K., Matoušek, V. (eds.) TSD 2017. LNCS (LNAI), vol. 10415, pp. 353–361. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-64206-2_40CrossRefGoogle Scholar
  4. 4.
    Hanzlíček, Z., Matoušek, J., Tihelka, D.: Experiments on reducing footprint of unit selection TTS system. In: Habernal, I., Matoušek, V. (eds.) TSD 2013. LNCS (LNAI), vol. 8082, pp. 249–256. Springer, Heidelberg (2013).  https://doi.org/10.1007/978-3-642-40585-3_32CrossRefGoogle Scholar
  5. 5.
    Hanzlíček, Z., Romportl, J., Matoušek, J.: Voice conservation: towards creating a speech-aid system for total laryngectomees. In: Kelemen, J., Romportl, J., Zackova, E. (eds.) Beyond Artificial Intelligence. TIEI, vol. 4, pp. 203–212. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-34422-0_14CrossRefGoogle Scholar
  6. 6.
    Hanzlíček, Z., Vít, J., Tihelka, D.: WaveNet-based speech synthesis applied to Czech: a comparison with the traditional synthesis methods. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2018. LNAI, vol. 11107, pp. 445–452. Springer, Cham (2018)CrossRefGoogle Scholar
  7. 7.
    Ircing, P., Romportl, J., Loose, Z.: Audiovisual interface for Czech spoken dialogue system. In: Proceedings of ICSP 2010, pp. 526–529. IEEE, Beijing (2010)Google Scholar
  8. 8.
    ITU Recommendation BS.1534-2: Method for the subjective assessment of intermediate quality level of coding systems. Technical report, International Telecommunication Union (2014)Google Scholar
  9. 9.
    Jůzová, M., Tihelka, D.: Minimum text corpus selection for limited domain speech synthesis. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2014. LNCS (LNAI), vol. 8655, pp. 398–407. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10816-2_48CrossRefGoogle Scholar
  10. 10.
    Jůzová, M., Tihelka, D.: Tuning limited domain speech synthesis using general text-to-speech system. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2014. LNCS (LNAI), vol. 8655, pp. 408–415. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10816-2_49CrossRefGoogle Scholar
  11. 11.
    Jůzová, M., Tihelka, D., Matoušek, J.: Designing high-coverage multi-level text corpus for non-professional-voice conservation. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS (LNAI), vol. 9811, pp. 207–215. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-43958-7_24CrossRefGoogle Scholar
  12. 12.
    Jůzová, M., Tihelka, D., Matoušek, J., Hanzlíček, Z.: Voice conservation and TTS system for people facing total laryngectomy. In: Proceedings of Interspeech 2017, pp. 3425–3426. ISCA, Stockholm (2017)Google Scholar
  13. 13.
    Kala, J., Matoušek, J.: Very fast unit selection using Viterbi search with zero-concatenation-cost chains. In: Proceedings of ICASSP 2014, pp. 2569–2573. IEEE, Florence (2014)Google Scholar
  14. 14.
    Krňoul, Z., Železný, M.: A development of Czech talking head. In: Proceedings of Interspeech (ICSLP) 2008, Brisbane, Australia, pp. 2326–2329 (2008)Google Scholar
  15. 15.
    Legát, M., Matoušek, J.: Pitch contours as predictors of audible concatenation artifacts. In: Proceedings of WCECS 2011, San Francisco, USA, pp. 525–529 (2011)Google Scholar
  16. 16.
    Matoušek, J., Hanzlíček, Z., Campr, M., Krňoul, Z., Campr, P., Grůber, M.: Web-based system for automatic reading of technical documents for vision impaired students. In: Habernal, I., Matoušek, V. (eds.) TSD 2011. LNCS (LNAI), vol. 6836, pp. 364–371. Springer, Heidelberg (2011).  https://doi.org/10.1007/978-3-642-23538-2_46CrossRefGoogle Scholar
  17. 17.
    Matoušek, J., Legát, M.: Is unit selection aware of audible artifacts? In: Proceedings of SSW8, ISCA, Barcelona, pp. 267–271 (2013)Google Scholar
  18. 18.
    Matoušek, J., Romportl, J.: Recording and annotation of speech corpus for Czech unit selection speech synthesis. In: Matoušek, V., Mautner, P. (eds.) TSD 2007. LNCS (LNAI), vol. 4629, pp. 326–333. Springer, Heidelberg (2007).  https://doi.org/10.1007/978-3-540-74628-7_43CrossRefGoogle Scholar
  19. 19.
    Matoušek, J., Tihelka, D.: Annotation errors detection in TTS corpora. In: Proceedings of Interspeech 2013, pp. 1511–1515. ISCA, Lyon (2013)Google Scholar
  20. 20.
    Matoušek, J., Tihelka, D.: Voting detector: a combination of anomaly detectors to reveal annotation errors in TTS corpora. In: Proceedings of Interspeech 2016, pp. 1560–1564. ISCA, San Francisco (2016)Google Scholar
  21. 21.
    Matoušek, J., Tihelka, D., Romportl, J.: Current state of czech text-to-speech system ARTIC. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2006. LNCS (LNAI), vol. 4188, pp. 439–446. Springer, Heidelberg (2006).  https://doi.org/10.1007/11846406_55CrossRefGoogle Scholar
  22. 22.
    Matoušek, J., Tihelka, D., Romportl, J.: Building of a speech corpus optimised for unit selection TTS synthesis. In: Proceedings of LREC 2008, pp. 1296–1299. ELRA, Marrakech (2008)Google Scholar
  23. 23.
    Matoušek, J., Tihelka, D., Šmídl, L.: On the impact of annotation errors on unit-selection speech synthesis. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2012. LNCS (LNAI), vol. 7499, pp. 456–463. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-32790-2_55CrossRefGoogle Scholar
  24. 24.
    van den Oord, A., et al.: WaveNet: a generative model for raw audio. CoRR abs/1609.03499 (2016)Google Scholar
  25. 25.
    van den Oord, A., et al.: Parallel WaveNet: fast high-fidelity speech synthesis. CoRR abs/1711.10433 (2017)Google Scholar
  26. 26.
    Qian, Y., Soong, F.K., Yan, Z.J.: A unified trajectory tiling approach to high quality speech rendering. IEEE Trans. Audio Speech Lang. Process. 21(2), 280–290 (2013)CrossRefGoogle Scholar
  27. 27.
    Romportl, J.: Structural data-driven prosody model for TTS synthesis. In: Proceedings of the Speech Prosody 2006, pp. 549–552. TUDpress, Dresden (2006)Google Scholar
  28. 28.
    Romportl, J., Matoušek, J.: Formal prosodic structures and their application in NLP. In: Matoušek, V., Mautner, P., Pavelka, T. (eds.) TSD 2005. LNCS (LNAI), vol. 3658, pp. 371–378. Springer, Heidelberg (2005).  https://doi.org/10.1007/11551874_48CrossRefGoogle Scholar
  29. 29.
    Romportl, J., Zovato, E., Santos, R., Ircing, P., Relaño, J.G., Danieli, M.: Application of expressive TTS synthesis in an advanced ECA system. In: Proceedings of SSW7, pp. 120–125. ISCA, Kyoto (2010)Google Scholar
  30. 30.
    Stanislav, P., Šmídl, L., Švec, J.: An automatic training tool for air traffic control training. In: Proceedings of Interspeech 2016, pp. 782–783. ISCA, San Francisco (2016)Google Scholar
  31. 31.
    Taylor, P.: Text-to-Speech Synthesis, 1st edn. Cambridge University Press, New York (2009)CrossRefGoogle Scholar
  32. 32.
    Tihelka, D.: Symbolic prosody driven unit selection for highly natural synthetic speech. In: Proceedings of Interspeech 2005 - Eurospeech, pp. 2525–2528. ISCA, Lisboa (2005)Google Scholar
  33. 33.
    Tihelka, D., Grůber, M., Hanzlíček, Z.: Robust methodology for TTS enhancement evaluation. In: Habernal, I., Matoušek, V. (eds.) TSD 2013. LNCS (LNAI), vol. 8082, pp. 442–449. Springer, Heidelberg (2013).  https://doi.org/10.1007/978-3-642-40585-3_56CrossRefGoogle Scholar
  34. 34.
    Tihelka, D., Hanzlíček, Z., Jůzová, M., Matoušek, J.: First steps towards hybrid speech synthesis in Czech TTS system ARTIC. In: SPECOM 2018 (2018, submitted for review)CrossRefGoogle Scholar
  35. 35.
    Tihelka, D., Kala, J., Matoušek, J.: Enhancements of Viterbi search for fast unit selection synthesis. In: Proceedings of Interspeech 2010, pp. 174–177. ISCA, Makuhari (2010)Google Scholar
  36. 36.
    Tihelka, D., Matoušek, J., Kala, J.: Quality deterioration factors in unit selection speech synthesis. In: Matoušek, V., Mautner, P. (eds.) TSD 2007. LNCS (LNAI), vol. 4629, pp. 508–515. Springer, Heidelberg (2007).  https://doi.org/10.1007/978-3-540-74628-7_66CrossRefGoogle Scholar
  37. 37.
    Tihelka, D., Stanislav, P.: ARTIC for assistive technologies: transformation to resource-limited hardware. In: Proceedings of WCECS 2011, pp. 581–584. IANG, San Francisco (2011)Google Scholar
  38. 38.
    Vít, J., Matoušek, J.: Concatenation artifact detection trained from listeners evaluations. In: Habernal, I., Matoušek, V. (eds.) TSD 2013. LNCS (LNAI), vol. 8082, pp. 169–176. Springer, Heidelberg (2013).  https://doi.org/10.1007/978-3-642-40585-3_22CrossRefGoogle Scholar
  39. 39.
    Vít, J., Matoušek, J.: On the analysis of training data for WaveNet-based speech synthesis. In: Proceedings of ICASSP 2018, IEEE, Calgary (2018)Google Scholar
  40. 40.
    Zen, H.: Acoustic modeling in statistical parametric speech synthesis - from HMM to LSTM-RNN. In: Proceedings of MLSLP (2015, invited paper)Google Scholar
  41. 41.
    Železný, M., Krňoul, Z., Císař, P., Matoušek, J.: Design, implementation and evaluation of the Czech realistic audio-visual speech synthesis. Sig. Process. 12, 3657–3673 (2006)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Daniel Tihelka
    • 1
  • Zdeněk Hanzlíček
    • 1
  • Markéta Jůzová
    • 2
  • Jakub Vít
    • 2
  • Jindřich Matoušek
    • 1
    • 2
  • Martin Grůber
    • 1
  1. 1.New Technologies for the Information Society, Faculty of Applied SciencesUniversity of West BohemiaPilsenCzech Republic
  2. 2.Department of Cybernetics, Faculty of Applied SciencesUniversity of West BohemiaPilsenCzech Republic

Personalised recommendations