Two-Stage Sequence-to-Sequence Neural Voice Conversion with Low-to-High Definition Spectrogram Mapping

Miyamoto, Sou; Nose, Takashi; Hiroshiba, Kazuyuki; Odagiri, Yuri; Ito, Akinori

doi:10.1007/978-3-030-03748-2_16

Sou Miyamoto⁷,
Takashi Nose⁷,
Kazuyuki Hiroshiba⁸,
Yuri Odagiri⁸ &
…
Akinori Ito⁷

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 110))

Included in the following conference series:

International Conference on Intelligent Information Hiding and Multimedia Signal Processing

578 Accesses
2 Citations

Abstract

In this study, we propose a voice conversion technique with two-stage conversion, which is realized by using two models consisting of U-Net and pix2pix. Using U-Net, we tried to reproduce intonation of a target speaker by performing low-dimensional feature conversion considering the time direction. We introduced pix2pix for the task of spectrogram enhancement. The pix2pix is trained to map from low definition spectrogram to high definition spectrogram (low-to-high spectrogram mapping). Low definition spectrogram is reconstructed from low dimensional mel-cepstrum converted by U-Net and high definition spectrogram is extracted from natural speech. In objective evaluations, we showed that the proposed method was effective in improvement of mel-cepstral distance (MCD) and Log F0 RMSE. Subjective evaluations revealed that the use of the proposed method had a certain effect in improving speech individuality while maintaining the same level of naturalness as the conventional method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Hardcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/k2kobayashi/sprocket.

References

Alec, R., Luke, M., Soumith, C.: Unsupervised representation learning with deep convolutional generative adversarial networks. In: International Conference on Learning Representations (2016)
Google Scholar
Desai, S., Raghavendra, E.V., Yegnanarayana, B., Black, A.W., Prahallad, K.: Voice conversion using artificial neural networks. In: 2009 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3893–3896. IEEE (2009)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: IEEE/CVF International Conference on Computer Vision and Pattern Recognition (2017)
Google Scholar
Kain, A., Macon, M.W.: Spectral voice conversion for text-to-speech synthesis. In: 1998 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 285–288. IEEE (1998)
Google Scholar
Kaneko, T., Kameoka, H., Hiramatsu, K., Kashino, K.: Sequence-to-sequence voice conversion with similarity metric learned using generative adversarial networks. In: Proceedings of the INTERSPEECH, pp. 1283–1287 (2017)
Google Scholar
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (2015)
Google Scholar
Kobayashi, K., Toda, T.: sprocket: open-source voice conversion software. In: Proceedings of the Odyssey 2018 The Speaker and Language Recognition Workshop, pp. 203–210 (2018)
Google Scholar
Masanobu, A.: A segment-based approach to voice conversion. In: 1991 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 2, pp. 765–768 (1991)
Google Scholar
Masci, J., Meier, U., Cireşan, D., Schmidhuber, J.: Stacked convolutional auto-encoders for hierarchical feature extraction. In: International Conference on Artificial Neural Networks, pp. 52–59. Springer (2011)
Google Scholar
Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014)
Mohammadi, S.H., Kain, A.: An overview of voice conversion systems. Speech Commun. 88(C), 65–82 (2017)
Article Google Scholar
Morise, M., Yokomori, F., Ozawa, K.: World: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans. Inf. Syst. 99(7), 1877–1884 (2016)
Article Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: International Conference on Medical Image Computing and Computer-assisted Intervention, pp. 234–241. Springer (2015)
Google Scholar
Saito, Y., Takamichi, S., Saruwatari, H.: Voice conversion using input-to-output highway networks. IEICE Trans. Inf. Syst. E100.D(8), 1925–1928 (2017)
Article Google Scholar
Sun, L., Kang, S., Li, K., Meng, H.: Voice conversion using deep bidirectional long short-term memory based recurrent neural networks. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4869–4873. IEEE (2015)
Google Scholar
Toda, T., Tokuda, K.: A speech parameter generation algorithm considering global variance for hmm-based speech synthesis. IEICE Trans. Inf. Syst. E90–D(5), 816–824 (2007)
Article Google Scholar

Download references

Acknowledgment

Part of this work was supported by JSPS KAKENHI Grant Numbers JP16K13253 and JP17H00823.

Author information

Authors and Affiliations

Graduate School of Engineering, Tohoku University, Aramaki Aza Aoba 6–6–05, Aoba-ku, Sendai-shi, Miyagi, 980–8579, Japan
Sou Miyamoto, Takashi Nose & Akinori Ito
DWANGO Co., Ltd., KABUKIZA TOWER., 4–12–15 Ginza, Chuo-ku, Tokyo, 104–0061, Japan
Kazuyuki Hiroshiba & Yuri Odagiri

Authors

Sou Miyamoto
View author publications
You can also search for this author in PubMed Google Scholar
Takashi Nose
View author publications
You can also search for this author in PubMed Google Scholar
Kazuyuki Hiroshiba
View author publications
You can also search for this author in PubMed Google Scholar
Yuri Odagiri
View author publications
You can also search for this author in PubMed Google Scholar
Akinori Ito
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Akinori Ito .

Editor information

Editors and Affiliations

College of Information Science and Engineering, Fujian University of Technology, Fuzhou, Fujian, China
Jeng-Shyang Pan
Graduate School of Engineering, Tohoku University, Sendai, Miyagi, Japan
Akinori Ito
Swinburne University of Technology, Hawthorn, VIC, Australia
Pei-Wei Tsai
Centre for Artificial Intelligence, Faculty of Engineering and Information Technology, University of Technology Sydney, Sydney, NSW, Australia
Lakhmi C. Jain

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Miyamoto, S., Nose, T., Hiroshiba, K., Odagiri, Y., Ito, A. (2019). Two-Stage Sequence-to-Sequence Neural Voice Conversion with Low-to-High Definition Spectrogram Mapping. In: Pan, JS., Ito, A., Tsai, PW., Jain, L. (eds) Recent Advances in Intelligent Information Hiding and Multimedia Signal Processing. IIH-MSP 2018. Smart Innovation, Systems and Technologies, vol 110. Springer, Cham. https://doi.org/10.1007/978-3-030-03748-2_16

Download citation

DOI: https://doi.org/10.1007/978-3-030-03748-2_16
Published: 11 November 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-03747-5
Online ISBN: 978-3-030-03748-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics