Advertisement

Segmental Pitch Control Using Speech Input Based on Differential Contexts and Features for Customizable Neural Speech Synthesis

  • Shinya Hanabusa
  • Takashi Nose
  • Akinori Ito
Conference paper
Part of the Smart Innovation, Systems and Technologies book series (SIST, volume 110)

Abstract

This paper proposes a technique for controlling the pitch of synthetic speech at a segmental level using user input speech within a framework of speech synthesis based on deep neural networks (DNNs). In a previous study, we proposed tailor-made speech synthesis, the speech synthesis technique which enables users to control the synthetic speech naturally and intuitively. We introduced differential fundamental frequency (F0) contexts into speaker model training of speech synthesis based on DNNs. The differential F0 context represents relative log F0 at the segmental level of training data. In this study, we use the user speech to determine the F0 contexts for synthetic speech. This approach allows users to modify and control the segmental pitch more flexibly, which will enhance the performance of the tailor-made speech synthesis.

Keywords

DNN-based speech synthesis Tailor-made speech synthesis Prosody control Differential F0 context User speech input 

Notes

Acknowledgment

Part of this work was supported by JSPS KAKENHI Grant Numbers JP16K13253 and JP17H00823.

References

  1. 1.
    Kawahara, H., Masuda-Katsuse, I., de Cheveigne, A.: Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: possible role of a repetitive structure in sounds. Speech Commun. 27(3–4), 187–207 (1999)CrossRefGoogle Scholar
  2. 2.
    Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  3. 3.
    Maeno, Y., Nose, T., Kobayashi, T., Koriyama, T., Ijima, Y., Nakajima, H., Mizuno, H., Yoshioka, O.: Prosodic variation enhancement using unsupervised context labeling for HMM-based expressive speech synthesis. Speech Commun. 57, 144–154 (2014)CrossRefGoogle Scholar
  4. 4.
    Nishigaki, Y., Takamichi, S., Toda, T., Neubig, G., Sakti, S., Nakamura, S.: Prosody-controllable HMM-based speech synthesis using speech input. In: Proceedings of the MLSLP (2015)Google Scholar
  5. 5.
    Nose, T., Kato, Y., Kobayashi, T.: Style estimation of speech based on multiple regression hidden semi-Markov model. In: Proceedings of the INTERSPEECH, pp. 2285–2288 (2007)Google Scholar
  6. 6.
    Watts, O., Wu, Z., King, S.: Sentence-level control vectors for deep neural network speech synthesis. In: Proceedings of the INTERSPEECH, pp. 2217–2221 (2015)Google Scholar
  7. 7.
    Yamada, S., Nose, T., Ito, A.: A study on tailor-made speech synthesis based on deep neural networks. In: Proceedings of the IIH-MSP, pp. 159–166 (2017)Google Scholar
  8. 8.
    Zen, H., Senior, A., Schuster, M.: Statistical parametric speech synthesis using deep neural networks. In: Proceedings of the ICASSP, pp. 7962–7966 (2013)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Graduate School of EngineeringTohoku UniversityAoba-ku, Sendai-shiJapan

Personalised recommendations