Advertisement

Circuits, Systems, and Signal Processing

, Volume 35, Issue 4, pp 1283–1311 | Cite as

A Multi-level GMM-Based Cross-Lingual Voice Conversion Using Language-Specific Mixture Weights for Polyglot Synthesis

  • B. RamaniEmail author
  • M. P. Actlin Jeeva
  • P. Vijayalakshmi
  • T. Nagarajan
Article

Abstract

For any given mixed-language text, a multilingual synthesizer synthesizes speech that is intelligible to human listener. However, as speech data are usually collected from native speakers to avoid foreign accent, synthesized speech shows speaker switching at language switching points. To overcome this, the multilingual speech corpus can be converted to a polyglot speech corpus using cross-lingual voice conversion, and a polyglot synthesizer can be developed. Cross-lingual voice conversion is a technique to produce utterances in target speaker’s voice from source speaker’s utterance irrespective of the language and text spoken by the source and the target speakers. Conventional voice conversion technique based on GMM tokenization suffer from degradation in speech quality as the spectrum is oversmoothed due to statistical averaging. The current work focuses on alleviating the oversmoothing effect in GMM-based voice conversion technique, using (source) language-specific mixture weights in a multi-level GMM followed by selective pole focusing in the unvoiced speech segments. The continuity between the frames of the converted speech is ensured by performing fifth-order mean filtering in the cepstral domain. For the current work, cross-lingual voice conversion is performed for four regional Indian languages and a foreign language namely, Tamil, Telugu, Malayalam, Hindi, and Indian English. The performance of the system is evaluated subjectively using ABX listening test for speaker identity and using mean opinion score for quality. Experimental results demonstrate that the proposed method effectively improves the quality and intelligibility mitigating the oversmoothing effect in the voice-converted speech. A hidden Markov model-based polyglot text-to-speech system is also developed, using this converted speech corpus, to further make the system suitable for unrestricted vocabulary.

Keywords

GMM Multilingual Polyglot Cross-lingual voice conversion Oversmoothing ABX listening test 

References

  1. 1.
    M. Abe, S. Nakamura, K. Shikano, H. Kuwabara, Voice conversion through vector quantization, in International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1 (1988), pp. 655–658Google Scholar
  2. 2.
    M. Abe, K. Shikano, H. Kuwabara, Cross-language voice conversion, in International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1 (1990), pp. 345–348Google Scholar
  3. 3.
    M. Charlier, Y. Ohtani, T. Toda, A. Moinet, T. Dutoit, Cross-language voice conversion based on eigenvoices, in INTERSPEECH (2009), pp. 1635–1638Google Scholar
  4. 4.
    D. Erro, A. Moreno, A. Bonafonte, Voice conversion based on weighted frequency warping. IEEE Trans. Audio Speech Lang. Process. 18(5), 922–931 (2010)CrossRefGoogle Scholar
  5. 5.
    E. Godoy, O. Rosec, T. Chonavel, Voice conversion using dynamic frequency warping with amplitude scaling, for parallel or nonparallel corpora. IEEE Trans. Audio Speech Lang. Process. 20(4), 1313–1323 (2012)CrossRefGoogle Scholar
  6. 6.
    A.J. Hunt, A.W. Black, Unit selection in a concatenative speech synthesis system using a large speech database, in International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1 (1996), pp. 373–376Google Scholar
  7. 7.
    H.T. Hwang, Y. Tsao, H.M. Wang, Y.R. Wang, S.H. Chen, Alleviating the over-smoothing problem in GMM-based voice conversion with discriminative training, in INTERSPEECH (2013), pp. 3062–3066Google Scholar
  8. 8.
    A. Kain, M. Macon, Spectral voice conversion for text-to-speech synthesis. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1 (1998), pp. 285–288Google Scholar
  9. 9.
    H. Kawahara, Speech representation and transformation using adaptive interpolation of weighted spectrum: vocoder revisited, in International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 2 (1997), pp. 1303–1306Google Scholar
  10. 10.
    E.K. Kim, S. Lee, Y.H. Oh, Hidden Markov model based voice conversion using dynamic characteristics of speaker, in EUROSPEECH (1997), pp. 2519–2522Google Scholar
  11. 11.
    J. Latorre, K. Iwano, S. Furui, New approach to the polyglot speech generation by means of an HMM-based speaker adaptable synthesizer. Speech Commun. 48(10), 1227–1242 (2006)CrossRefGoogle Scholar
  12. 12.
    A.F. Machado, M. Quieroz, Voice conversion: a critical survey. In: Sound and Music Computing, pp. 291–298 (2010)Google Scholar
  13. 13.
    T. Masuko, HMM-based speech synthesis and its applications. Ph.D. Dissertation, (2002)Google Scholar
  14. 14.
    P.A. Naylor, A. Kounoudes, J. Gudnason, M. Brookes, Estimation of glottal closure instants in voiced speech using the DYPSA algorithm. IEEE Trans. Audio Speech Lang. Process. 15, 34–43 (2007)CrossRefGoogle Scholar
  15. 15.
    W.H. Press, S.A. Teukolsky, W.T. Vetterling, B.P. Flannery, Numerical recipes in C: the art of scientific computing (Chapter 14), 2nd edn. (Cambridge University Press, Cambridge, 1992), pp. 615–619Google Scholar
  16. 16.
    B. Ramani, M.P. Actlin Jeeva, P. Vijayalakshmi, T. Nagarajan, Cross-lingual voice conversion-based polyglot speech synthesizer for Indian languages, in INTERSPEECH (2014), pp. 775–779Google Scholar
  17. 17.
    B. Ramani, S.L. Christina, G.A. Rachel, V.S. Solomi, M.K. Nandwana, A. Prakash, A. Shanmugam, R. Krishnan, S. Kishore, K. Samudravijaya, P. Vijayalakshmi, T. Nagarajan, H.A. Murthy, A common attribute based unified HTS framework for speech synthesis in Indian languages, in ISCA Workshop on Speech Synthesis (2013), pp. 291–296Google Scholar
  18. 18.
    A.K. Singh, A computational phonetic model for indian language scripts, in Constraints on Spelling Changes: Fifth International Workshop on Writing Systems (2006)Google Scholar
  19. 19.
    V.S. Solomi, S.L. Christina, G.A. Rachel, B. Ramani, P. Vijayalakshmi, T. Nagarajan, Analysis on acoustic similarities between tamil and english phonemes using product of likelihood-Gaussians for an HMM-based mixed-language synthesizer, in COCOSDA (2013), pp. 1–5Google Scholar
  20. 20.
    Y. Stylianou, Voice transformation: a survey, in International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2009), pp. 3585–3588Google Scholar
  21. 21.
    Y. Stylianou, O. Cappe, E. Moulines, Statistical methods for voice quality transformation, in EUROSPEECH (1995), pp. 447–450Google Scholar
  22. 22.
    D. Sundermann, H. Hoge, A. Bonafonte, H. Ney, A. Black, S. Narayanan, Text-independent voice conversion based on unit selection, in International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1 (2006), pp. I81–I84Google Scholar
  23. 23.
    D. Sundermann, H. Ney, H. Hoge, VTLN-based cross-language voice conversion, in IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU’03 (2003), pp. 676–681Google Scholar
  24. 24.
    Technology Development for Indian Languages Programme, DeitY (2013), http://tdil.mit.gov.in/AboutUs.aspx. Last Accessed on 06 Sept 2014
  25. 25.
    T. Toda, H. Saruwatari, K. Shikano, Voice conversion algorithm based on Gaussian mixture model with dynamic frequency warping of straight spectrum, in International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 2 (2001), pp. 841–844Google Scholar
  26. 26.
    T. Toda, A. Black, K. Tokuda, Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Trans. Audio Speech Lang. Process. 15, 2222–2235 (2007)CrossRefGoogle Scholar
  27. 27.
    P.A. Torres-carrasquillo, D.A. Reynolds, J. Deller Jr, Language identification using Gaussian mixture model tokenization, in International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. I-757–I-760 (2002)Google Scholar
  28. 28.
    H. Valbret, E. Moulines, J.P. Tubach, Voice transformation using PSOLA technique, in International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1 (1992), pp. 145–148Google Scholar
  29. 29.
    P. Vijayalakshmi, T. Nagarajan, P. Mahadevan, Improving speech intelligibility in cochlear implants using acoustic models. WSEAS Trans. Signal Process. 7(4), 131–144 (2011)Google Scholar
  30. 30.
    J. Yamagishi, T. Kobayashi, Y. Nakano, K. Ogata, J. Isogai, Analysis of speaker adaptation algorithms for HMM-based speech synthesis and a constrained SMAPLR adaptation algorithm. IEEE Trans. Audio Speech Lang. Process. 17(1), 66–83 (2009)CrossRefGoogle Scholar
  31. 31.
    H. Zen, K. Tokuda, A.W. Black, Statistical parametric speech synthesis. Speech Commun. 51, 1039–1064 (2009)CrossRefGoogle Scholar
  32. 32.
    M. Zhang, J. Tao, J. Tian, X. Wang, Text-independent voice conversion based on state mapped codebook, in International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2008), pp. 4605–4608Google Scholar
  33. 33.
    M.A. Zissman, E. Singer, Automatic language identification of telephone speech messages using phoneme recognition and N-gram modeling, in International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1 (1994), pp. I-305–I-308Google Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • B. Ramani
    • 1
    Email author
  • M. P. Actlin Jeeva
    • 1
  • P. Vijayalakshmi
    • 1
  • T. Nagarajan
    • 1
  1. 1.SSN College of EngineeringChennaiIndia

Personalised recommendations