Skip to main content

Investigation of Effectively Synthesizing Code-Switched Speech Using Highly Imbalanced Mix-Lingual Data

  • Conference paper
  • First Online:
Book cover Neural Information Processing (ICONIP 2020)

Abstract

End-to-end text-to-speech (TTS) can synthesize monolingual speech with high naturalness and intelligibility. Recently, the end-to-end model has also been used in code-switching (CS) TTS and performs well on naturalness, intelligibility and speaker consistency. However, existing systems rely on skillful bilingual speakers to build a CS mix-lingual data set with a high Language-Mix-Ratio (LMR), while simply mixing monolingual data sets results in accent problems. To reduce the cost of recording and maintain the speaker consistency, in this paper, we investigate an effective method to use a low LMR imbalanced mix-lingual data set. Experiments show that it is possible to construct a CS TTS system with a low LMR imbalanced mix-lingual data set with diverse input text presentations, meanwhile produce acceptable synthetic CS speech with more than 4.0 Mean Opinion Score (MOS). We also find that the result will be improved if the mix-lingual data set is augmented with monolingual English data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Code-switching is also known as code-mixing. In this paper, we use code-switching.

  2. 2.

    http://www.speech.cs.cmu.edu/cgi-bin/cmudict.

  3. 3.

    https://www.data-baker.com/hc_znv_1_en.html.

  4. 4.

    https://keithito.com/LJ-Speech-Dataset/.

  5. 5.

    Some samples are available in “https://pandagst.github.io/”.

References

  1. Shen, J., et al.: Natural TTS synthesis by conditioningwavenet on MEL spectrogram predictions. International Conference on Acoustics. Speech and Signal Processing (ICASSP), pp. 4779–4783. IEEE, Calgary (2018)

    Google Scholar 

  2. Ping, W., Peng, K., Gibiansky, A., Arik, S.O., Kannan, A., Narang, S., Jonathan, R., Miller, J.: Deep voice 3: Scaling text-to-speech with convolutional sequence learning. In: 6th International Conference on Learning Representations (ICLR), Vancouver (2018)

    Google Scholar 

  3. Zhang, Y., et al.: Learning to speak fluently in a foreign language: multilingual speech synthesis and cross-language voice cloning. In: 20th International Speech Communication Association (INTERSPEECH), pp. 2080–2084. ISCA, Graz (2019)

    Google Scholar 

  4. Traber, C., et al.: From multilingual to polyglot speech synthesis. In: European Conference on Speech Communication and Technology, pp. 835–839 (1999)

    Google Scholar 

  5. Chu, M., Peng, H., Zhao, Y., Niu, Z., Chang, E.: Microsoft Mulan - a bilingual TTS system. In: International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. I-I. IEEE, Hong Kong (2003)

    Google Scholar 

  6. Ming, H., Lu, Y., Zhang, Z., Dong, M.: A light-weight methodof building an LSTM-RNN-based bilingual TTS system. In: 2017 International Conference on Asian Language Processing (IALP), pp. 201–205. IEEE, Singapore (2017)

    Google Scholar 

  7. Sitaram, S., Rallabandi, S.K., Rijhwani, S., Black, A.W.: Experiments with cross-lingual systems for synthesis of code-mixed text. In: SSW, pp. 76–81 (2016)

    Google Scholar 

  8. Cao, Y., et al.: End-to-end code-switched tts with mix of monolingual recordings. In: 2019 IEEE International Conference on Acoustics. Speech and Signal Processing (ICASSP), pp. 6935–6939. IEEE, Brighton (2019)

    Google Scholar 

  9. Xue, L., Song, W., Xu, G., Xie, L., Wu, Z.: Building a mixed-lingual neural tts system with only monolingual data. In: 20th International Speech Communication Association (INTERSPEECH), pp. 2060–2064. ISCA, Graz (2019)

    Google Scholar 

  10. Chandu, K.R., Rallabandi, S.K., Sitaram, S., Black, A.W.: Speech synthesis for mixed-language navigation instructions. In: 18th International Speech Communication Association (INTERSPEECH), pp. 57–61. ISCA, Stockholm (2017)

    Google Scholar 

  11. Campbell, N.: Talking foreign-concatenative speech synthesis and the language barrier. In: 7th European Conference on Speech Communication and Technology (EUROSPEECH), pp. 337–340. ISCA, Aalborg (2001)

    Google Scholar 

  12. Zen, H., Braunschweiler, N., Buchholz, S., Gales, M.J., Knill, K., Krstulovic, S., Latorre, J.: Statistical parametric speech synthesis based on speaker and language factorization. In: IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 6, pp. 1713–1724. IEEE (2012)

    Google Scholar 

  13. Chen, M., et al.: Cross-lingual, multi-speaker text-to-speech synthesis using neural speaker embedding. In: 20th International Speech Communication Association (INTERSPEECH), pp. 2105–2109. ISCA, Graz (2019)

    Google Scholar 

  14. Spa, J.: Handbook of the international phonetic association. a guide to the use of the international phonetic alphabet. Word-J. Int. Ling. Assoc. 53(3), 421–424 (2002)

    Google Scholar 

  15. Qian, Y., Cao, H., Soong, F.K.: HMM-based mixed-language (Mandarin-English) speech synthesis. In: Proceedings of the 2008 6th International Symposium on Chinese Spoken Language Processing, pp. 1–4. IEEE, Kunming (2008)

    Google Scholar 

  16. Li, S., Lu, X., Ding, C., Shen, P., Kawahara, T.: investigating radical-based end-to-end speech recognition systems for chinese dialects and japanese. In: 20th International Speech Communication Association (INTERSPEECH), pp. 2200–2204. ISCA, Graz (2019)

    Google Scholar 

  17. Li, B., Zen, H.: Multi-language multi-speaker acoustic modeling for lstm-rnn based statistical parametric speech synthesis. In: 17th International Speech Communication Association (INTERSPEECH), pp. 2468–2472. ISCA, San Francisco (2016)

    Google Scholar 

  18. Yu, Q., Liu, P., Wu, Z., Ang, S.K., Meng, H., Cai, L.: Learning cross-lingual information with multilingual BLSTM for speech synthesis of low-resource languages. IEEE International Conference on Acoustics. Speech and Signal Processing (ICASSP), pp. 5545–5549. IEEE, Shanghai (2016)

    Google Scholar 

  19. Sitaram, S., Black, A.W.: Speech synthesis of code-mixed text. In: 10th International Conference on Language Resources and Evaluation (LREC), pp. 3422–3428. ELRA, Portoroz (2016)

    Google Scholar 

  20. Chen, Y., Tu, T., Yeh, C., Lee, H.Y.: End-to-end text-to-speech for low-resource languages by cross-lingual transfer learning. In: 20th International Speech Communication Association (INTERSPEECH), pp. 2075–2079. ISCA, Graz (2019)

    Google Scholar 

  21. Mametani, K., Kato, T., Yamamoto, S.: Investigating context features hidden in End-to-End TTS. In: IEEE International Conference on Acoustics. Speech and Signal Processing (ICASSP), pp. 6920–6924. IEEE, Brighton (2019)

    Google Scholar 

  22. Griffin, D.W., Lim, J.S.: Signal estimation from modified short-time fourier transform. In: IEEE International Conference on Acoustics. Speech, and Signal Processing, pp. 804–807. IEEE, Boston (1983)

    Google Scholar 

  23. Lee, Y., Shon, S., Kim, T.: Learning pronunciation from a foreign language in speech synthesis networks. In: arXiv preprint arXiv:1811.09364, (2018)

  24. Chung, Y.A., Wang, Y., Hsu, W.N., Zhang, Y., Skerry-Ryan, R.J.: Semi-supervised training for improving data efficiency in end-to-end speech synthesis. In: 2019 IEEE International Conference on Acoustics. Speech and Signal Processing (ICASSP), pp. 6940–6944. IEEE, Brighton (2019)

    Google Scholar 

  25. Li, B., Zhang, Y., Sainath, T., Wu, Y., Chan, W.: Bytes are all you need: End-to-end multilingual speech recognition and synthesis with bytes. In: 2019 IEEE International Conference on Acoustics. Speech and Signal Processing (ICASSP), pp. 5621–5625. IEEE, Brighton (2019)

    Google Scholar 

Download references

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grant 61771333, the Tianjin Municipal Science and Technology Project under Grant 18ZXZNGX00330 and JSPS KAKENHI Grant No. 19K24376 and NICT international fund 2020 “Bridging Eurasia: Multilingual Speech Recognition along the Silk Road”, Japan.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Longbiao Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Guo, S. et al. (2020). Investigation of Effectively Synthesizing Code-Switched Speech Using Highly Imbalanced Mix-Lingual Data. In: Yang, H., Pasupa, K., Leung, A.CS., Kwok, J.T., Chan, J.H., King, I. (eds) Neural Information Processing. ICONIP 2020. Lecture Notes in Computer Science(), vol 12532. Springer, Cham. https://doi.org/10.1007/978-3-030-63830-6_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-63830-6_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-63829-0

  • Online ISBN: 978-3-030-63830-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics