Abstract
In recent years, new neural architectures for generating high-quality synthetic speech on a per-sample basis were introduced. We describe our application of statistical parametric speech synthesis based on LSTM neural networks combined with a generative neural vocoder for the Czech language. We used a traditional LSTM architecture for generating vocoder parametrization from linguistic features. We replaced a standard vocoder with a WaveRNN neural network. We conducted a MUSHRA listening test to compare the proposed approach with the unit selection and LSTM-based parametric speech synthesis utilizing a standard vocoder. In contrast with our previous work, we managed to outperform a well-tuned unit selection TTS system by a great margin on both professional and amateur voices.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Hanzlíček, Z., Vít, J., Tihelka, D.: Wavenet-based speech synthesis applied to Czech. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2018. LNCS (LNAI), vol. 11107, pp. 445–452. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00794-2_48
Henter, G.E., Merritt, T., Shannon, M., Mayo, C., King, S.: Measuring the perceptual effects of modelling assumptions in speech synthesis using stimuli constructed from repeated natural speech. In: Proceedings of Interspeech, pp. 1504–1508 (2014)
Hunt, A.J., Black, A.W.: Unit selection in a concatenative speech synthesis system using a large speech database. In: Proceedings of ICASSP 1996, pp. 373–376 (1996)
International Telecommunications Union: Method for the subjective assessment of intermediate quality level of coding systems. ITU Recommendation ITU-R BS.1534-2 (2014)
Jůzová, M., Tihelka, D., Matoušek, J., Hanzlíček, Z.: Voice conservation and TTS system for people facing total laryngectomy. In: Proceedings Interspeech 2017, pp. 3425–3426 (2017)
Kalchbrenner, N., et al.: Efficient neural audio synthesis. Proc. Mach. Learn. Res. 80, 2410–2419 (2018)
Kawahara, H., Morise, M., Toda, T., Banno, H., Nisimura, R., Irino, T.: Excitation source analysis for high-quality speech manipulation systems based on an interference-free representation of group delay with minimum phase response compensation. In: Proceedings of Interspeech, pp. 2243–2247 (2014)
Lorenzo-Trueba, J., Drugman, T., Latorre, J., Merritt, T., Putrycz, B., Barra-Chicote, R.: Robust universal neural vocoding. CoRR abs/1811.06292, submitted to ICASSP 2019 (2018)
Matoušek, J., Tihelka, D., Psutka, J.: Automatic segmentation for Czech concatenative speech synthesis using statistical approach with boundary-specific correction. In: Proceedings of Eurospeech, pp. 301–304 (2003)
Matoušek, J., Tihelka, D., Romportl, J.: Building of a speech corpus optimised for unit selection TTS synthesis. In: Proceedings of LREC (2008)
Morise, M.: D4C, a band-aperiodicity estimator for high-quality speech synthesis. Speech Commun. 84, 57–65 (2016)
van den Oord, A., et al.: WaveNet: a generative model for raw audio. CoRR abs/1609.03499 (2016). http://arxiv.org/abs/1609.03499
Tamamori, A., Hayashi, T., Kobayashi, K., Takeda, K., Toda, T.: Speaker-dependent WaveNet vocoder. In: Proceedings of Interspeech, pp. 1118–1122 (2017)
Tihelka, D., Hanzlíček, Z., Jůzová, M., Vít, J., Matoušek, J., Grůber, M.: Current state of text-to-speech system ARTIC: a decade of research on the field of speech technologies. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2018. LNCS (LNAI), vol. 11107, pp. 369–378. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00794-2_40
Zen, H.: Acoustic modeling in statistical parametric speech synthesis - from HMM to LSTM-RNN. In: Proceedings of MLSLP (2015)
Acknowledgment
This research was supported by the Czech Science Foundation (GA CR), project No. GA19-19324S. The work has been supported by the grant of the University of West Bohemia, project No. SGS-2019-027.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Vít, J., Hanzlíček, Z., Matoušek, J. (2019). Czech Speech Synthesis with Generative Neural Vocoder. In: Ekštein, K. (eds) Text, Speech, and Dialogue. TSD 2019. Lecture Notes in Computer Science(), vol 11697. Springer, Cham. https://doi.org/10.1007/978-3-030-27947-9_26
Download citation
DOI: https://doi.org/10.1007/978-3-030-27947-9_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-27946-2
Online ISBN: 978-3-030-27947-9
eBook Packages: Computer ScienceComputer Science (R0)