Phone-Level Embeddings for Unit Selection Speech Synthesis

Perquin, Antoine; Lecorvé, Gwénolé; Lolive, Damien; Amsaleg, Laurent

doi:10.1007/978-3-030-00810-9_3

Antoine Perquin¹⁶,
Gwénolé Lecorvé¹⁶,
Damien Lolive¹⁶ &
…
Laurent Amsaleg¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11171))

Included in the following conference series:

International Conference on Statistical Language and Speech Processing

588 Accesses
4 Citations

Abstract

Deep neural networks have become the state of the art in speech synthesis. They have been used to directly predict signal parameters or provide unsupervised speech segment descriptions through embeddings. In this paper, we present four models with two of them enabling us to extract phone-level embeddings for unit selection speech synthesis. Three of the models rely on a feed-forward DNN, the last one on an LSTM. The resulting embeddings enable replacing usual expert-based target costs by an euclidean distance in the embedding space. This work is conducted on a French corpus of an 11 h audiobook. Perceptual tests show the produced speech is preferred over a unit selection method where the target cost is defined by an expert. They also show that the embeddings are general enough to be used for different speech styles without quality loss. Furthermore, objective measures and a perceptual test on statistical parametric speech synthesis show that our models perform comparably to state-of-the-art models for parametric signal generation, in spite of necessary simplifications, namely late time integration and information compression.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Black, A.W., Zen, H., Tokuda, K.: Statistical parametric speech synthesis. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 4, pp. 1229–1232 (2007)
Google Scholar
Hunt, A.J., Black, A.W.: Unit selection in a concatenative speech synthesis system using a large speech database. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, pp. 373–376 (1996)
Google Scholar
Lolive, D., et al.: The IRISA text-to-speech system for the Blizzard challenge 2017. In: Proceedings of the Blizzard Challenge Workshop (2017)
Google Scholar
Merritt, T., Clark, R.A., Wu, Z., Yamagishi, J., King, S.: Deep neural network-guided unit selection synthesis. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5145–5149 (2016)
Google Scholar
Morise, M., Yokomori, F., Ozawa, K.: WORLD: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans. Inf. Syst. 99(7), 1877–1884 (2016)
Article Google Scholar
van den Oord, A., et al.: WaveNet: a generative model for raw audio. In: Proceedings of the ISCA Speech Synthesis Workshop (SSW), pp. 125–125 (2016)
Google Scholar
Perquin, A.: Big deep voice: indexation de données massives de parole grâce à des réseaux de neurones profonds. Master’s thesis, University of Rennes 1 (2017)
Google Scholar
Wan, V., Agiomyrgiannakis, Y., Silen, H., Vit, J.: Googles next-generation real-time unit-selection synthesizer using sequence-to-sequence LSTM-based autoencoders. In: Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), pp. 1143–1147 (2017)
Google Scholar
Wang, Y., et al.: Tacotron: towards end-to-end speech synthesis. In: Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), pp. 4006–4010 (2017)
Google Scholar
Wu, Z., King, S.: Improving trajectory modelling for DNN-based speech synthesis by using stacked bottleneck features and minimum generation error training. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 24(7), 1255–1265 (2016)
Article Google Scholar
Wu, Z., Watts, O., King, S.: Merlin: an open source neural network speech synthesis system. In: Proceedings of the ISCA Speech Synthesis Workshop (SSW), pp. 218–223 (2016)
Google Scholar
Yan, Z.J., Qian, Y., Soong, F.K.: Rich-context unit selection (RUS) approach to high quality TTS. In: IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pp. 4798–4801 (2010)
Google Scholar
Ze, H., Senior, A., Schuster, M.: Statistical parametric speech synthesis using deep neural networks. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7962–7966 (2013)
Google Scholar

Download references

Acknowledgments

This study has been realized under the ANR (French National Research Agency) project SynPaFlex ANR-15-CE23-0015.

Author information

Authors and Affiliations

Univ Rennes, CNRS, IRISA, Lannion, France
Antoine Perquin, Gwénolé Lecorvé & Damien Lolive
Univ Rennes, CNRS, INRIA, IRISA, Rennes, France
Laurent Amsaleg

Authors

Antoine Perquin
View author publications
You can also search for this author in PubMed Google Scholar
Gwénolé Lecorvé
View author publications
You can also search for this author in PubMed Google Scholar
Damien Lolive
View author publications
You can also search for this author in PubMed Google Scholar
Laurent Amsaleg
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Antoine Perquin .

Editor information

Editors and Affiliations

University of Mons, Mons, Belgium
Thierry Dutoit
Rovira i Virgili University, Tarragona, Spain
Carlos Martín-Vide
University of Mons, Mons, Belgium
Gueorgui Pironkov

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Perquin, A., Lecorvé, G., Lolive, D., Amsaleg, L. (2018). Phone-Level Embeddings for Unit Selection Speech Synthesis. In: Dutoit, T., Martín-Vide, C., Pironkov, G. (eds) Statistical Language and Speech Processing. SLSP 2018. Lecture Notes in Computer Science(), vol 11171. Springer, Cham. https://doi.org/10.1007/978-3-030-00810-9_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-00810-9_3
Published: 19 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00809-3
Online ISBN: 978-3-030-00810-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics