DNN-Based Cross-Lingual Voice Conversion Using Bottleneck Features

  • M. Kiran ReddyEmail author
  • K. Sreenivasa Rao


Cross-lingual voice conversion (CLVC) is quite challenging since the source and target speakers speak different languages. It is essential for various applications such as developing mixed-language speech synthesis systems, customization of speaking devices, etc. This paper proposes a deep neural network (DNN)-based approach utilizing bottleneck features for CLVC. In the proposed method, the speaker-independent information present in the speech signals from different languages is represented by using the bottleneck features extracted from a deep auto-encoder. A DNN model is trained to learn the mapping between bottleneck features and the corresponding spectral features of the target speaker. The proposed approach can capture speaker-specific characteristics of a target speaker, and requires no speech data from the source speaker during training. The performance of the proposed method is evaluated using data from three Indian languages: Telugu, Tamil and Malayalam. The experimental results show that the proposed method can effectively convert the source speaker voice to target speaker voice in a cross-lingual scenario.


Cross-lingual voice conversion Deep autoencoder Deep neural network Gaussian mixture model 



  1. 1.
    Stylianou Y et al (1998) Continuous probabilistic transform for voice conversion. IEEE Trans Speech Audio Process 6(2):131–142CrossRefGoogle Scholar
  2. 2.
    Toda T et al (2007) Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Trans Audio Speech Lang Process 15(8):2222–2235CrossRefGoogle Scholar
  3. 3.
    Takamichi S et al (2015) Modulation spectrum-constrained trajectory training algorithm for GMM-based voice conversion. In: Proceedings of ICASSP, South Brisbane, Queensland, Australia, pp 4859–4863 Google Scholar
  4. 4.
    Yu J, Tao D, Wang M, Rui Y (2015) Learning to rank using user clicks and visual features for image retrieval. IEEE Trans Cybern 45(4):767–779CrossRefGoogle Scholar
  5. 5.
    Hong C, Yu J, Wan J, Tao D, Wang M (2015) Multimodal deep autoencoder for human pose recovery. IEEE Trans Image Process 24(12):5659–5670MathSciNetCrossRefGoogle Scholar
  6. 6.
    Yu Z, Yu J, Xiang G, Fan J (2018) Beyond bilinear: generalized multimodal factorized high-order pooling for visual question answering. IEEE Trans Neural Netw Learn Syst 29(2):5947–5959CrossRefGoogle Scholar
  7. 7.
    Hong C, Yu J, Zhang J, Jin X, Lee K (2019) Multimodal face-pose estimation with multitask manifold deep learning. IEEE Trans Ind Inf 15(7):3952–3961CrossRefGoogle Scholar
  8. 8.
    Zhang J, Yu J, Tao D (2018) Local deep-feature alignment for unsupervised dimension reduction. IEEE Trans Image Process 27(5):2420–2432MathSciNetCrossRefGoogle Scholar
  9. 9.
    Yu J, Tan M, Zhang H, Tao D, Rui Y (2019) Hierarchical deep click feature prediction for fine-grained image recognition. IEEE Trans Pattern Anal Mach Intell. CrossRefGoogle Scholar
  10. 10.
    Yu J, Zhu C, Zhang J, Huang Q, Tao D (2019) Spatial pyramid-enhanced netvlad with weighted triplet loss for place recognition. IEEE Trans Neural Netw Learn Syst. CrossRefGoogle Scholar
  11. 11.
    Desai S et al (2010) Spectral mapping using artificial neural networks for voice conversion. IEEE Trans Audio Speech Lang Process 18(5):954–964CrossRefGoogle Scholar
  12. 12.
    Mohammadi SH, Kain A (2014) Voice conversion using deep neural networks with speaker-independent pre-training. In: Proceedings of SLT, pp 19–23Google Scholar
  13. 13.
    Saito Y, Takamichi S, Saruwatari H (2017) Voice conversion using input-to-output highway networks. IEICE Trans Inf Syst E100–D(8):1925–1928CrossRefGoogle Scholar
  14. 14.
    Sun L et al (2015) Voice conversion using deep bidirectional long short-term memory based recurrent neural networks. In: Proceedings of ICASSP, South Brisbane, Queensland, Australia, pp 4869–4873Google Scholar
  15. 15.
    Chen LH et al (2014) Voice conversion using deep neural networks with layer-wise generative training. IEEE/ACM Trans Audio Speech Lang Process 22(12):1859–1872CrossRefGoogle Scholar
  16. 16.
    Nakashika T et al (2015) Voice conversion using RNN pre-trained by recurrent temporal restricted Boltzmann machines. IEEE/ACM Trans Audio Speech Lang Process 23(3):580–587CrossRefGoogle Scholar
  17. 17.
    Sun L, Wang H, Kang S, Li K, Meng HM (2016) Personalized, cross-lingual TTS using phonetic posteriorgrams. In: Proceedings of INTERSPEECHGoogle Scholar
  18. 18.
    Hsu C-C et al (2017) Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks. arXiv:1704.00849
  19. 19.
    Kameoka H et al (2018) StarGAN-VC: non-parallel many-to-many voice conversion with star generative adversarial networks. arXiv:1806.02169
  20. 20.
    Fang F et al (2018) High-quality nonparallel voice conversion based on cycle-consistent adversarial network. In: Proceedings of ICASSP, Alberta, Canada, pp 5279–5283Google Scholar
  21. 21.
    Abe M et al (1990) Cross-language voice conversion. In: Proceedings of ICASSP, Albuquerque, New Mexico, USA, pp 345–348Google Scholar
  22. 22.
    Sundermann D, Ney H, Hoge H (2003) VTLN-based cross-language voice conversion. In: IEEE workshop on automatic speech recognition and understanding, pp 676–681Google Scholar
  23. 23.
    Erro D, Moreno A (2007) Frame alignment method for crosslingual voice conversion. In: Proceedings of Interspeech 2007, Antwerp, Belgium, pp 1969–1972Google Scholar
  24. 24.
    Charlier M et al (2009) Cross-language voice conversion based on eigenvoices. In: Proceedings of INTERSPEECH, Brighton, UK, pp 1635–1638Google Scholar
  25. 25.
    Ramani B et al (2014) Cross-lingual voice conversion-based polyglot speech synthesizer for Indian languages. In: Proceedings of INTERSPEECH, Singapore, pp 775–779Google Scholar
  26. 26.
    Zhou Y et al (2019) Cross-lingual voice conversion with bilingual phonetic posteriorgram and average modeling. In: Proceedings of IEEE international conference on acoustics, speech and signal processing (ICASSP), Brighton, UK, pp 6790–6794Google Scholar
  27. 27.
    Hsu CC et al (2016) Voice conversion from non-parallel corpora using variational auto-encoder. In: Proceedings of Asia-Pacific signal and information processing association annual summit and conference (APSIPA), pp 1–6Google Scholar
  28. 28.
    Rabiner L, Juang B-H (1993) Fundamentals of speech recognition. Prentice-Hall, Englewood Cliffs, NJ Google Scholar
  29. 29.
    Morise M et al (2016) WORLD: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE Trans Inf Syst E99–D(7):1877–1884CrossRefGoogle Scholar
  30. 30.
    Hu Y, Ling Z (2018) Extracting spectral features using deep autoencoders with binary distributed hidden units for statistical parametric speech synthesis. IEEE/ACM Trans Audio Speech Lang Process 26(4):713–724MathSciNetCrossRefGoogle Scholar
  31. 31.
    Ramani B et al (2013) A common attribute based unified HTS framework for speech synthesis in Indian languages. In: Proceedings of SSW, Barcelona, Spain, pp 291–296Google Scholar
  32. 32.
    Povey D et al (2011) The Kaldi speech recognition toolkit. In: IEEE Workshop on Automatic Speech Recognition and Understanding, IEEE Signal Processing SocietyGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringIndian Institute of Technology KharagpurKharagpurIndia

Personalised recommendations