Czech Speech Synthesis with Generative Neural Vocoder

Vít, Jakub; Hanzlíček, Zdeněk; Matoušek, Jindřich

doi:10.1007/978-3-030-27947-9_26

Jakub Vít⁹,
Zdeněk Hanzlíček⁹ &
Jindřich Matoušek⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11697))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

833 Accesses
6 Citations

Abstract

In recent years, new neural architectures for generating high-quality synthetic speech on a per-sample basis were introduced. We describe our application of statistical parametric speech synthesis based on LSTM neural networks combined with a generative neural vocoder for the Czech language. We used a traditional LSTM architecture for generating vocoder parametrization from linguistic features. We replaced a standard vocoder with a WaveRNN neural network. We conducted a MUSHRA listening test to compare the proposed approach with the unit selection and LSTM-based parametric speech synthesis utilizing a standard vocoder. In contrast with our previous work, we managed to outperform a well-tuned unit selection TTS system by a great margin on both professional and amateur voices.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

WaveNet-Based Speech Synthesis Applied to Czech

Constructing a Deep Neural Network Based Spectral Model for Statistical Speech Synthesis

Phone-Level Embeddings for Unit Selection Speech Synthesis

References

Hanzlíček, Z., Vít, J., Tihelka, D.: Wavenet-based speech synthesis applied to Czech. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2018. LNCS (LNAI), vol. 11107, pp. 445–452. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00794-2_48
Chapter Google Scholar
Henter, G.E., Merritt, T., Shannon, M., Mayo, C., King, S.: Measuring the perceptual effects of modelling assumptions in speech synthesis using stimuli constructed from repeated natural speech. In: Proceedings of Interspeech, pp. 1504–1508 (2014)
Google Scholar
Hunt, A.J., Black, A.W.: Unit selection in a concatenative speech synthesis system using a large speech database. In: Proceedings of ICASSP 1996, pp. 373–376 (1996)
Google Scholar
International Telecommunications Union: Method for the subjective assessment of intermediate quality level of coding systems. ITU Recommendation ITU-R BS.1534-2 (2014)
Google Scholar
Jůzová, M., Tihelka, D., Matoušek, J., Hanzlíček, Z.: Voice conservation and TTS system for people facing total laryngectomy. In: Proceedings Interspeech 2017, pp. 3425–3426 (2017)
Google Scholar
Kalchbrenner, N., et al.: Efficient neural audio synthesis. Proc. Mach. Learn. Res. 80, 2410–2419 (2018)
Google Scholar
Kawahara, H., Morise, M., Toda, T., Banno, H., Nisimura, R., Irino, T.: Excitation source analysis for high-quality speech manipulation systems based on an interference-free representation of group delay with minimum phase response compensation. In: Proceedings of Interspeech, pp. 2243–2247 (2014)
Google Scholar
Lorenzo-Trueba, J., Drugman, T., Latorre, J., Merritt, T., Putrycz, B., Barra-Chicote, R.: Robust universal neural vocoding. CoRR abs/1811.06292, submitted to ICASSP 2019 (2018)
Google Scholar
Matoušek, J., Tihelka, D., Psutka, J.: Automatic segmentation for Czech concatenative speech synthesis using statistical approach with boundary-specific correction. In: Proceedings of Eurospeech, pp. 301–304 (2003)
Google Scholar
Matoušek, J., Tihelka, D., Romportl, J.: Building of a speech corpus optimised for unit selection TTS synthesis. In: Proceedings of LREC (2008)
Google Scholar
Morise, M.: D4C, a band-aperiodicity estimator for high-quality speech synthesis. Speech Commun. 84, 57–65 (2016)
Article Google Scholar
van den Oord, A., et al.: WaveNet: a generative model for raw audio. CoRR abs/1609.03499 (2016). http://arxiv.org/abs/1609.03499
Tamamori, A., Hayashi, T., Kobayashi, K., Takeda, K., Toda, T.: Speaker-dependent WaveNet vocoder. In: Proceedings of Interspeech, pp. 1118–1122 (2017)
Google Scholar
Tihelka, D., Hanzlíček, Z., Jůzová, M., Vít, J., Matoušek, J., Grůber, M.: Current state of text-to-speech system ARTIC: a decade of research on the field of speech technologies. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2018. LNCS (LNAI), vol. 11107, pp. 369–378. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00794-2_40
Chapter Google Scholar
Zen, H.: Acoustic modeling in statistical parametric speech synthesis - from HMM to LSTM-RNN. In: Proceedings of MLSLP (2015)
Google Scholar

Download references

Acknowledgment

This research was supported by the Czech Science Foundation (GA CR), project No. GA19-19324S. The work has been supported by the grant of the University of West Bohemia, project No. SGS-2019-027.

Author information

Authors and Affiliations

NTIS - New Technology for the Information Society, Faculty of Applied Sciences, University of West Bohemia, Univerzitní 22, 306 14, Plzeň, Czech Republic
Jakub Vít, Zdeněk Hanzlíček & Jindřich Matoušek

Authors

Jakub Vít
View author publications
You can also search for this author in PubMed Google Scholar
Zdeněk Hanzlíček
View author publications
You can also search for this author in PubMed Google Scholar
Jindřich Matoušek
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jakub Vít .

Editor information

Editors and Affiliations

University of West Bohemia, Pilsen, Czech Republic
Kamil Ekštein

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vít, J., Hanzlíček, Z., Matoušek, J. (2019). Czech Speech Synthesis with Generative Neural Vocoder. In: Ekštein, K. (eds) Text, Speech, and Dialogue. TSD 2019. Lecture Notes in Computer Science(), vol 11697. Springer, Cham. https://doi.org/10.1007/978-3-030-27947-9_26

Download citation

DOI: https://doi.org/10.1007/978-3-030-27947-9_26
Published: 06 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-27946-2
Online ISBN: 978-3-030-27947-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Czech Speech Synthesis with Generative Neural Vocoder

Abstract

Access this chapter

Similar content being viewed by others

WaveNet-Based Speech Synthesis Applied to Czech

Constructing a Deep Neural Network Based Spectral Model for Statistical Speech Synthesis

Phone-Level Embeddings for Unit Selection Speech Synthesis

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Czech Speech Synthesis with Generative Neural Vocoder

Abstract

Access this chapter

Similar content being viewed by others

WaveNet-Based Speech Synthesis Applied to Czech

Constructing a Deep Neural Network Based Spectral Model for Statistical Speech Synthesis

Phone-Level Embeddings for Unit Selection Speech Synthesis

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation