Circuits, Systems, and Signal Processing

, Volume 37, Issue 5, pp 2142–2163 | Cite as

A Multilingual to Polyglot Speech Synthesizer for Indian Languages Using a Voice-Converted Polyglot Speech Corpus

  • P. Vijayalakshmi
  • B. Ramani
  • M. P. Actlin Jeeva
  • T. Nagarajan


A multilingual synthesizer synthesizes speech, for any given monolingual or mixed-language text, that is intelligible to human listeners. The necessity for such synthesizer arises in a country like India, where multiple languages coexist. For the current work, multilingual synthesizers are developed using HMM-based speech synthesis technique. However, for a mixed-language text, the synthesized speech shows speaker switching at language switching points which is quite annoying to the listener. This is due to the fact that, speech data used for training is collected for each language from a different (native) speaker. To overcome the speaker switching at language switching points, a polyglot speech synthesizer is developed using polyglot speech corpus (all the speech data in a single speaker’s voice). The polyglot speech corpus is obtained using cross-lingual voice conversion (CLVC) technique. In the current work, polyglot synthesizer is developed for five languages namely Tamil, Telugu, Hindi, Malayalam and Indian English. The regional Indian languages considered are acoustically similar, to certain extent, and hence, common phoneset and question set is used to build the synthesizer. Experiments are carried out by developing various bilingual polyglot synthesizers to choose the language (thereby the speaker) that can be considered as target for polyglot synthesizer. The performance of the synthesizers is evaluated subjectively for speaker/language switching using perceptual test and quality using mean opinion score. Speaker identity is evaluated objectively using a GMM-based speaker identification system. Further, the polyglot synthesizer developed using polyglot speech corpus is compared with the adaptation-based polyglot synthesizer, in terms of quality of the synthesized speech and amount of data required for adaptation and voice conversion. It is observed that the performance of the polyglot synthesizer developed using polyglot speech corpus obtained from CLVC technique is better or almost similar to that of the adaptation-based polyglot synthesizer.


Polyglot Multilingual HMM GMM Voice conversion 


  1. 1.
    L. Badino, C. Barolo, S. Quazza, Language independent phoneme mapping for foreign TTS, in ISCA Workshop on Speech Synthesis, pp. 217–218 (2004)Google Scholar
  2. 2.
    A.W. Black, K.A. Lenzo, Multilingual text-to-speech synthesis, in International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. III-761–III-764 (2004)Google Scholar
  3. 3.
    N. Campbell, Foreign language speech synthesis, in The Third ESCA/COCOSDA Workshop on Speech, Synthesis, pp. 177–180 (1998)Google Scholar
  4. 4.
    N. Campbell, Talking foreign—concatenative speech synthesis and the language barrier, in EUROSPEECH, pp. 337–340 (2001)Google Scholar
  5. 5.
    C.P. Chen, Y.C. Huang, C.H. Wu, K.D. Lee, Polyglot speech synthesis based on cross-lingual frame selection using auditory and articulatory features. IEEE/ ACM Trans. Audio Speech Lang. Process. 22(10), 1558–1570 (2014)CrossRefGoogle Scholar
  6. 6.
    A.J. Hunt, A.W. Black, Unit selection in a concatenative speech synthesis system using a large speech database. Int. Conf. Acoust. Speech Signal Process. (ICASSP) 1, 373–376 (1996)Google Scholar
  7. 7.
    J. Latorre, K. Iwano, S. Furui, New approach to the polyglot speech generation by means of an HMM-based speaker adaptable synthesizer. Speech Commun. 48(10), 1227–1242 (2006)CrossRefGoogle Scholar
  8. 8.
    A.F. Machado, M. Quieroz, Voice conversion: a critical survey, in Sound and Music Computing, pp. 291–298 (2010)Google Scholar
  9. 9.
    M. Mashimo, T. Toda, K. Shikano, N. Campbell, Evaluation of cross-language voice conversion based on GMM and STRAIGHT, in EUROSPEECH, pp. 361–364 (2001)Google Scholar
  10. 10.
    M. Moberg, K. Parssinen, J. Iso-Sipila, Cross-lingual phoneme mapping for multilingual synthesis systems, in INTERSPEECH, pp. 1029–1032 (2004)Google Scholar
  11. 11.
    B. Mobius, J. Schroeter, J. Van Santen, R. Sproat, J. Olive, Recent advances in multilingual text-to-speech synthesis, in Fortschritte der Akustik - DAGA 96 (DEGA, Oldenburg, 1996), pp. 82–85Google Scholar
  12. 12.
    Y. Qian, H. Liang, F.K. Soong, A Cross-language state sharing and mapping approach to bilingual (Mandarin–English) TTS. IEEE Trans. Audio Speech Lang. Process. 17(6), 1231–1239 (2009)CrossRefGoogle Scholar
  13. 13.
    B. Ramani, S. Lilly Christina, G. Anushiya Rachel, V. Sherlin Solomi, M.K. Nandwana, A. Prakash, A. Shanmugam, R. Krishnan, S. Kishore, K. Samudravijaya, P. Vijayalakshmi, T. Nagarajan, H.A. Murthy, A common attribute based unified HTS framework for speech synthesis in Indian languages, in ISCA Workshop on Speech Synthesis, pp. 291–296 (2013)Google Scholar
  14. 14.
    B. Ramani, V. Sherlin Solomi, G. Anushiya Rachel, S. Lilly Christina, P. Vijayalakshmi, T. Nagarajan, H.A. Murthy, Development and evaluation of unit selection and HMM-based speech synthesis systems for Tamil, in National Conference on Communications (NCC), pp. 1–5 (2013)Google Scholar
  15. 15.
    B. Ramani, M.P. Actlin Jeeva, P. Vijayalakshmi, T. Nagarajan, A multi-level GMM-based cross-lingual voice conversion using language specific mixture weights for polyglot synthesis. Circuits Syst. Signal Process. 35(4), 1283–1311 (2016)MathSciNetCrossRefGoogle Scholar
  16. 16.
    B. Sharma, S.R.M. Prasanna, Polyglot speech synthesis: a review. IETE Tech. Rev. 34(4), 366–389 (2017)CrossRefGoogle Scholar
  17. 17.
    V. Sherlin Solomi, S. Lilly Christina, G. Anushiya Rachel, B. Ramani, P. Vijayalakshmi, T. Nagarajan, Analysis on acoustic similarities between Tamil and English phonemes using product of likelihood-Gaussians for an HMM-based mixed-language synthesizer, in International Conference Oriental COCOSDA, pp. 1–5 (2013)Google Scholar
  18. 18.
    V. Sherlin Solomi, M.S. Saranya, G. Anushiya Rachel, P. Vijayalakshmi, T. Nagarajan, Performance comparison of KLD and PoG metrics for finding the acoustic similarity between phonemes for the development of a polyglot synthesizer, in IEEE TENCON, pp. 1–4 (2014)Google Scholar
  19. 19.
    Y. Stylianou, O. Cappe, E. Moulines, Statistical methods for voice quality transformation, in EUROSPEECH, pp. 447–450 (1995)Google Scholar
  20. 20.
    D. Sundermann, H. Hoge, A. Bonafonte, H. Ney, A. Black, S. Narayanan, Text-independent voice conversion based on unit selection. Int. Conf. Acoust. Speech Signal Process. (ICASSP) 1, I81–I84 (2006)Google Scholar
  21. 21.
    Y. Tabet, M. Boughazi, Speech synthesis techniques—a survey, in 7th International Workshop on Systems, Signal Processing and Their Applications (WOSSPA), pp. 67–70 (2011)Google Scholar
  22. 22.
    Technology Development for Indian Languages Programme, DeitY, (2016). Accessed on 30 June 2017
  23. 23.
    T. Toda, H. Saruwatari, K. Shikano, Voice conversion algorithm based on Gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum. Int. Conf. Acoust. Speech Signal Process. (ICASSP) 2, 841–844 (2001)Google Scholar
  24. 24.
    T. Toda, A.W. Black, K. Tokuda, Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Trans. Audio Speech Lang. Process. 15(8), 2222–2235 (2007)CrossRefGoogle Scholar
  25. 25.
    C. Traber, K. Huber, K. Nedir, B. Pfister, E. Keller, B. Zellner, From multilingual to polyglot speech synthesis, in EUROSPEECH, pp. 835–838 (1999)Google Scholar
  26. 26.
    H. Valbret, E. Moulines, J.P. Tubach, Voice transformation using PSOLA technique. Int. Conf. Acoust. Speech Signal Process. (ICASSP) 1, 145–148 (1992)Google Scholar
  27. 27.
    S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, P. Woodland, The HTK Book (for HTK Version 3.4) (Cambridge University Engineering Department, Cambridge, 2002)Google Scholar
  28. 28.
    H. Zen, K. Tokuda, A.W. Black, Statistical parametric speech synthesis. Speech Commun. 51(11), 1039–1064 (2009)CrossRefGoogle Scholar
  29. 29.
    M. Zhang, J. Tao, J. Tian, X. Wang, Text-independent voice conversion based on state mapped codebook, in International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 4605–4608 (2008)Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2017

Authors and Affiliations

  • P. Vijayalakshmi
    • 1
  • B. Ramani
    • 1
  • M. P. Actlin Jeeva
    • 1
  • T. Nagarajan
    • 1
  1. 1.SSN College of EngineeringChennaiIndia

Personalised recommendations