Advertisement

Phone-Level Embeddings for Unit Selection Speech Synthesis

  • Antoine Perquin
  • Gwénolé Lecorvé
  • Damien Lolive
  • Laurent Amsaleg
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11171)

Abstract

Deep neural networks have become the state of the art in speech synthesis. They have been used to directly predict signal parameters or provide unsupervised speech segment descriptions through embeddings. In this paper, we present four models with two of them enabling us to extract phone-level embeddings for unit selection speech synthesis. Three of the models rely on a feed-forward DNN, the last one on an LSTM. The resulting embeddings enable replacing usual expert-based target costs by an euclidean distance in the embedding space. This work is conducted on a French corpus of an 11 h audiobook. Perceptual tests show the produced speech is preferred over a unit selection method where the target cost is defined by an expert. They also show that the embeddings are general enough to be used for different speech styles without quality loss. Furthermore, objective measures and a perceptual test on statistical parametric speech synthesis show that our models perform comparably to state-of-the-art models for parametric signal generation, in spite of necessary simplifications, namely late time integration and information compression.

Keywords

Speech synthesis Unit selection Embedding 

Notes

Acknowledgments

This study has been realized under the ANR (French National Research Agency) project SynPaFlex ANR-15-CE23-0015.

References

  1. 1.
    Black, A.W., Zen, H., Tokuda, K.: Statistical parametric speech synthesis. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 4, pp. 1229–1232 (2007)Google Scholar
  2. 2.
    Hunt, A.J., Black, A.W.: Unit selection in a concatenative speech synthesis system using a large speech database. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, pp. 373–376 (1996)Google Scholar
  3. 3.
    Lolive, D., et al.: The IRISA text-to-speech system for the Blizzard challenge 2017. In: Proceedings of the Blizzard Challenge Workshop (2017)Google Scholar
  4. 4.
    Merritt, T., Clark, R.A., Wu, Z., Yamagishi, J., King, S.: Deep neural network-guided unit selection synthesis. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5145–5149 (2016)Google Scholar
  5. 5.
    Morise, M., Yokomori, F., Ozawa, K.: WORLD: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans. Inf. Syst. 99(7), 1877–1884 (2016)CrossRefGoogle Scholar
  6. 6.
    van den Oord, A., et al.: WaveNet: a generative model for raw audio. In: Proceedings of the ISCA Speech Synthesis Workshop (SSW), pp. 125–125 (2016)Google Scholar
  7. 7.
    Perquin, A.: Big deep voice: indexation de données massives de parole grâce à des réseaux de neurones profonds. Master’s thesis, University of Rennes 1 (2017)Google Scholar
  8. 8.
    Wan, V., Agiomyrgiannakis, Y., Silen, H., Vit, J.: Googles next-generation real-time unit-selection synthesizer using sequence-to-sequence LSTM-based autoencoders. In: Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), pp. 1143–1147 (2017)Google Scholar
  9. 9.
    Wang, Y., et al.: Tacotron: towards end-to-end speech synthesis. In: Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech), pp. 4006–4010 (2017)Google Scholar
  10. 10.
    Wu, Z., King, S.: Improving trajectory modelling for DNN-based speech synthesis by using stacked bottleneck features and minimum generation error training. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 24(7), 1255–1265 (2016)CrossRefGoogle Scholar
  11. 11.
    Wu, Z., Watts, O., King, S.: Merlin: an open source neural network speech synthesis system. In: Proceedings of the ISCA Speech Synthesis Workshop (SSW), pp. 218–223 (2016)Google Scholar
  12. 12.
    Yan, Z.J., Qian, Y., Soong, F.K.: Rich-context unit selection (RUS) approach to high quality TTS. In: IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), pp. 4798–4801 (2010)Google Scholar
  13. 13.
    Ze, H., Senior, A., Schuster, M.: Statistical parametric speech synthesis using deep neural networks. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7962–7966 (2013)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Antoine Perquin
    • 1
  • Gwénolé Lecorvé
    • 1
  • Damien Lolive
    • 1
  • Laurent Amsaleg
    • 2
  1. 1.Univ Rennes, CNRS, IRISALannionFrance
  2. 2.Univ Rennes, CNRS, INRIA, IRISARennesFrance

Personalised recommendations