Skip to main content
Log in

A Multilingual to Polyglot Speech Synthesizer for Indian Languages Using a Voice-Converted Polyglot Speech Corpus

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

A multilingual synthesizer synthesizes speech, for any given monolingual or mixed-language text, that is intelligible to human listeners. The necessity for such synthesizer arises in a country like India, where multiple languages coexist. For the current work, multilingual synthesizers are developed using HMM-based speech synthesis technique. However, for a mixed-language text, the synthesized speech shows speaker switching at language switching points which is quite annoying to the listener. This is due to the fact that, speech data used for training is collected for each language from a different (native) speaker. To overcome the speaker switching at language switching points, a polyglot speech synthesizer is developed using polyglot speech corpus (all the speech data in a single speaker’s voice). The polyglot speech corpus is obtained using cross-lingual voice conversion (CLVC) technique. In the current work, polyglot synthesizer is developed for five languages namely Tamil, Telugu, Hindi, Malayalam and Indian English. The regional Indian languages considered are acoustically similar, to certain extent, and hence, common phoneset and question set is used to build the synthesizer. Experiments are carried out by developing various bilingual polyglot synthesizers to choose the language (thereby the speaker) that can be considered as target for polyglot synthesizer. The performance of the synthesizers is evaluated subjectively for speaker/language switching using perceptual test and quality using mean opinion score. Speaker identity is evaluated objectively using a GMM-based speaker identification system. Further, the polyglot synthesizer developed using polyglot speech corpus is compared with the adaptation-based polyglot synthesizer, in terms of quality of the synthesized speech and amount of data required for adaptation and voice conversion. It is observed that the performance of the polyglot synthesizer developed using polyglot speech corpus obtained from CLVC technique is better or almost similar to that of the adaptation-based polyglot synthesizer.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. Various synthesizers were developed by varying the amount of data as 1, 2, 3, 4, 5 and 12 h. Reasonable quality of speech is obtained with 30 min of data itself. For the current work, 1 h of speech data is considered.

References

  1. L. Badino, C. Barolo, S. Quazza, Language independent phoneme mapping for foreign TTS, in ISCA Workshop on Speech Synthesis, pp. 217–218 (2004)

  2. A.W. Black, K.A. Lenzo, Multilingual text-to-speech synthesis, in International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. III-761–III-764 (2004)

  3. N. Campbell, Foreign language speech synthesis, in The Third ESCA/COCOSDA Workshop on Speech, Synthesis, pp. 177–180 (1998)

  4. N. Campbell, Talking foreign—concatenative speech synthesis and the language barrier, in EUROSPEECH, pp. 337–340 (2001)

  5. C.P. Chen, Y.C. Huang, C.H. Wu, K.D. Lee, Polyglot speech synthesis based on cross-lingual frame selection using auditory and articulatory features. IEEE/ ACM Trans. Audio Speech Lang. Process. 22(10), 1558–1570 (2014)

    Article  Google Scholar 

  6. A.J. Hunt, A.W. Black, Unit selection in a concatenative speech synthesis system using a large speech database. Int. Conf. Acoust. Speech Signal Process. (ICASSP) 1, 373–376 (1996)

    Google Scholar 

  7. J. Latorre, K. Iwano, S. Furui, New approach to the polyglot speech generation by means of an HMM-based speaker adaptable synthesizer. Speech Commun. 48(10), 1227–1242 (2006)

    Article  Google Scholar 

  8. A.F. Machado, M. Quieroz, Voice conversion: a critical survey, in Sound and Music Computing, pp. 291–298 (2010)

  9. M. Mashimo, T. Toda, K. Shikano, N. Campbell, Evaluation of cross-language voice conversion based on GMM and STRAIGHT, in EUROSPEECH, pp. 361–364 (2001)

  10. M. Moberg, K. Parssinen, J. Iso-Sipila, Cross-lingual phoneme mapping for multilingual synthesis systems, in INTERSPEECH, pp. 1029–1032 (2004)

  11. B. Mobius, J. Schroeter, J. Van Santen, R. Sproat, J. Olive, Recent advances in multilingual text-to-speech synthesis, in Fortschritte der Akustik - DAGA 96 (DEGA, Oldenburg, 1996), pp. 82–85

  12. Y. Qian, H. Liang, F.K. Soong, A Cross-language state sharing and mapping approach to bilingual (Mandarin–English) TTS. IEEE Trans. Audio Speech Lang. Process. 17(6), 1231–1239 (2009)

    Article  Google Scholar 

  13. B. Ramani, S. Lilly Christina, G. Anushiya Rachel, V. Sherlin Solomi, M.K. Nandwana, A. Prakash, A. Shanmugam, R. Krishnan, S. Kishore, K. Samudravijaya, P. Vijayalakshmi, T. Nagarajan, H.A. Murthy, A common attribute based unified HTS framework for speech synthesis in Indian languages, in ISCA Workshop on Speech Synthesis, pp. 291–296 (2013)

  14. B. Ramani, V. Sherlin Solomi, G. Anushiya Rachel, S. Lilly Christina, P. Vijayalakshmi, T. Nagarajan, H.A. Murthy, Development and evaluation of unit selection and HMM-based speech synthesis systems for Tamil, in National Conference on Communications (NCC), pp. 1–5 (2013)

  15. B. Ramani, M.P. Actlin Jeeva, P. Vijayalakshmi, T. Nagarajan, A multi-level GMM-based cross-lingual voice conversion using language specific mixture weights for polyglot synthesis. Circuits Syst. Signal Process. 35(4), 1283–1311 (2016)

    Article  MathSciNet  Google Scholar 

  16. B. Sharma, S.R.M. Prasanna, Polyglot speech synthesis: a review. IETE Tech. Rev. 34(4), 366–389 (2017)

    Article  Google Scholar 

  17. V. Sherlin Solomi, S. Lilly Christina, G. Anushiya Rachel, B. Ramani, P. Vijayalakshmi, T. Nagarajan, Analysis on acoustic similarities between Tamil and English phonemes using product of likelihood-Gaussians for an HMM-based mixed-language synthesizer, in International Conference Oriental COCOSDA, pp. 1–5 (2013)

  18. V. Sherlin Solomi, M.S. Saranya, G. Anushiya Rachel, P. Vijayalakshmi, T. Nagarajan, Performance comparison of KLD and PoG metrics for finding the acoustic similarity between phonemes for the development of a polyglot synthesizer, in IEEE TENCON, pp. 1–4 (2014)

  19. Y. Stylianou, O. Cappe, E. Moulines, Statistical methods for voice quality transformation, in EUROSPEECH, pp. 447–450 (1995)

  20. D. Sundermann, H. Hoge, A. Bonafonte, H. Ney, A. Black, S. Narayanan, Text-independent voice conversion based on unit selection. Int. Conf. Acoust. Speech Signal Process. (ICASSP) 1, I81–I84 (2006)

    Google Scholar 

  21. Y. Tabet, M. Boughazi, Speech synthesis techniques—a survey, in 7th International Workshop on Systems, Signal Processing and Their Applications (WOSSPA), pp. 67–70 (2011)

  22. Technology Development for Indian Languages Programme, DeitY, http://tdil.mit.gov.in/AboutUs.aspx (2016). Accessed on 30 June 2017

  23. T. Toda, H. Saruwatari, K. Shikano, Voice conversion algorithm based on Gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum. Int. Conf. Acoust. Speech Signal Process. (ICASSP) 2, 841–844 (2001)

    Google Scholar 

  24. T. Toda, A.W. Black, K. Tokuda, Voice conversion based on maximum-likelihood estimation of spectral parameter trajectory. IEEE Trans. Audio Speech Lang. Process. 15(8), 2222–2235 (2007)

    Article  Google Scholar 

  25. C. Traber, K. Huber, K. Nedir, B. Pfister, E. Keller, B. Zellner, From multilingual to polyglot speech synthesis, in EUROSPEECH, pp. 835–838 (1999)

  26. H. Valbret, E. Moulines, J.P. Tubach, Voice transformation using PSOLA technique. Int. Conf. Acoust. Speech Signal Process. (ICASSP) 1, 145–148 (1992)

    Google Scholar 

  27. S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, P. Woodland, The HTK Book (for HTK Version 3.4) (Cambridge University Engineering Department, Cambridge, 2002)

    Google Scholar 

  28. H. Zen, K. Tokuda, A.W. Black, Statistical parametric speech synthesis. Speech Commun. 51(11), 1039–1064 (2009)

    Article  Google Scholar 

  29. M. Zhang, J. Tao, J. Tian, X. Wang, Text-independent voice conversion based on state mapped codebook, in International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 4605–4608 (2008)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to P. Vijayalakshmi.

Additional information

The authors would like to thank the Department of Information Technology, Ministry of Communication and Technology, Government of India, for funding the project, “Development of Text-to-Speech synthesis for Indian Languages Phase II”, Ref. No. 11(7)/2011- HCC(TDIL).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Vijayalakshmi, P., Ramani, B., Jeeva, M.P.A. et al. A Multilingual to Polyglot Speech Synthesizer for Indian Languages Using a Voice-Converted Polyglot Speech Corpus. Circuits Syst Signal Process 37, 2142–2163 (2018). https://doi.org/10.1007/s00034-017-0659-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-017-0659-6

Keywords

Navigation