Investigation of Effectively Synthesizing Code-Switched Speech Using Highly Imbalanced Mix-Lingual Data

Guo, Shaotong; Wang, Longbiao; Li, Sheng; Zhang, Ju; Gong, Cheng; Wang, Yuguang; Dang, Jianwu; Honda, Kiyoshi

doi:10.1007/978-3-030-63830-6_4

Shaotong Guo¹⁴,
Longbiao Wang¹⁴,
Sheng Li ORCID: orcid.org/0000-0001-7636-3797¹⁷,
Ju Zhang¹⁶,
Cheng Gong¹⁴,
Yuguang Wang¹⁶,
Jianwu Dang ORCID: orcid.org/0000-0002-9237-4821¹⁵ &
…
Kiyoshi Honda¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12532))

Included in the following conference series:

International Conference on Neural Information Processing

2303 Accesses
1 Citations

Abstract

End-to-end text-to-speech (TTS) can synthesize monolingual speech with high naturalness and intelligibility. Recently, the end-to-end model has also been used in code-switching (CS) TTS and performs well on naturalness, intelligibility and speaker consistency. However, existing systems rely on skillful bilingual speakers to build a CS mix-lingual data set with a high Language-Mix-Ratio (LMR), while simply mixing monolingual data sets results in accent problems. To reduce the cost of recording and maintain the speaker consistency, in this paper, we investigate an effective method to use a low LMR imbalanced mix-lingual data set. Experiments show that it is possible to construct a CS TTS system with a low LMR imbalanced mix-lingual data set with diverse input text presentations, meanwhile produce acceptable synthetic CS speech with more than 4.0 Mean Opinion Score (MOS). We also find that the result will be improved if the mix-lingual data set is augmented with monolingual English data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Code-switching is also known as code-mixing. In this paper, we use code-switching.
2.
http://www.speech.cs.cmu.edu/cgi-bin/cmudict.
3.
https://www.data-baker.com/hc_znv_1_en.html.
4.
https://keithito.com/LJ-Speech-Dataset/.
5.
Some samples are available in “https://pandagst.github.io/”.

References

Shen, J., et al.: Natural TTS synthesis by conditioningwavenet on MEL spectrogram predictions. International Conference on Acoustics. Speech and Signal Processing (ICASSP), pp. 4779–4783. IEEE, Calgary (2018)
Google Scholar
Ping, W., Peng, K., Gibiansky, A., Arik, S.O., Kannan, A., Narang, S., Jonathan, R., Miller, J.: Deep voice 3: Scaling text-to-speech with convolutional sequence learning. In: 6th International Conference on Learning Representations (ICLR), Vancouver (2018)
Google Scholar
Zhang, Y., et al.: Learning to speak fluently in a foreign language: multilingual speech synthesis and cross-language voice cloning. In: 20th International Speech Communication Association (INTERSPEECH), pp. 2080–2084. ISCA, Graz (2019)
Google Scholar
Traber, C., et al.: From multilingual to polyglot speech synthesis. In: European Conference on Speech Communication and Technology, pp. 835–839 (1999)
Google Scholar
Chu, M., Peng, H., Zhao, Y., Niu, Z., Chang, E.: Microsoft Mulan - a bilingual TTS system. In: International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. I-I. IEEE, Hong Kong (2003)
Google Scholar
Ming, H., Lu, Y., Zhang, Z., Dong, M.: A light-weight methodof building an LSTM-RNN-based bilingual TTS system. In: 2017 International Conference on Asian Language Processing (IALP), pp. 201–205. IEEE, Singapore (2017)
Google Scholar
Sitaram, S., Rallabandi, S.K., Rijhwani, S., Black, A.W.: Experiments with cross-lingual systems for synthesis of code-mixed text. In: SSW, pp. 76–81 (2016)
Google Scholar
Cao, Y., et al.: End-to-end code-switched tts with mix of monolingual recordings. In: 2019 IEEE International Conference on Acoustics. Speech and Signal Processing (ICASSP), pp. 6935–6939. IEEE, Brighton (2019)
Google Scholar
Xue, L., Song, W., Xu, G., Xie, L., Wu, Z.: Building a mixed-lingual neural tts system with only monolingual data. In: 20th International Speech Communication Association (INTERSPEECH), pp. 2060–2064. ISCA, Graz (2019)
Google Scholar
Chandu, K.R., Rallabandi, S.K., Sitaram, S., Black, A.W.: Speech synthesis for mixed-language navigation instructions. In: 18th International Speech Communication Association (INTERSPEECH), pp. 57–61. ISCA, Stockholm (2017)
Google Scholar
Campbell, N.: Talking foreign-concatenative speech synthesis and the language barrier. In: 7th European Conference on Speech Communication and Technology (EUROSPEECH), pp. 337–340. ISCA, Aalborg (2001)
Google Scholar
Zen, H., Braunschweiler, N., Buchholz, S., Gales, M.J., Knill, K., Krstulovic, S., Latorre, J.: Statistical parametric speech synthesis based on speaker and language factorization. In: IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 6, pp. 1713–1724. IEEE (2012)
Google Scholar
Chen, M., et al.: Cross-lingual, multi-speaker text-to-speech synthesis using neural speaker embedding. In: 20th International Speech Communication Association (INTERSPEECH), pp. 2105–2109. ISCA, Graz (2019)
Google Scholar
Spa, J.: Handbook of the international phonetic association. a guide to the use of the international phonetic alphabet. Word-J. Int. Ling. Assoc. 53(3), 421–424 (2002)
Google Scholar
Qian, Y., Cao, H., Soong, F.K.: HMM-based mixed-language (Mandarin-English) speech synthesis. In: Proceedings of the 2008 6th International Symposium on Chinese Spoken Language Processing, pp. 1–4. IEEE, Kunming (2008)
Google Scholar
Li, S., Lu, X., Ding, C., Shen, P., Kawahara, T.: investigating radical-based end-to-end speech recognition systems for chinese dialects and japanese. In: 20th International Speech Communication Association (INTERSPEECH), pp. 2200–2204. ISCA, Graz (2019)
Google Scholar
Li, B., Zen, H.: Multi-language multi-speaker acoustic modeling for lstm-rnn based statistical parametric speech synthesis. In: 17th International Speech Communication Association (INTERSPEECH), pp. 2468–2472. ISCA, San Francisco (2016)
Google Scholar
Yu, Q., Liu, P., Wu, Z., Ang, S.K., Meng, H., Cai, L.: Learning cross-lingual information with multilingual BLSTM for speech synthesis of low-resource languages. IEEE International Conference on Acoustics. Speech and Signal Processing (ICASSP), pp. 5545–5549. IEEE, Shanghai (2016)
Google Scholar
Sitaram, S., Black, A.W.: Speech synthesis of code-mixed text. In: 10th International Conference on Language Resources and Evaluation (LREC), pp. 3422–3428. ELRA, Portoroz (2016)
Google Scholar
Chen, Y., Tu, T., Yeh, C., Lee, H.Y.: End-to-end text-to-speech for low-resource languages by cross-lingual transfer learning. In: 20th International Speech Communication Association (INTERSPEECH), pp. 2075–2079. ISCA, Graz (2019)
Google Scholar
Mametani, K., Kato, T., Yamamoto, S.: Investigating context features hidden in End-to-End TTS. In: IEEE International Conference on Acoustics. Speech and Signal Processing (ICASSP), pp. 6920–6924. IEEE, Brighton (2019)
Google Scholar
Griffin, D.W., Lim, J.S.: Signal estimation from modified short-time fourier transform. In: IEEE International Conference on Acoustics. Speech, and Signal Processing, pp. 804–807. IEEE, Boston (1983)
Google Scholar
Lee, Y., Shon, S., Kim, T.: Learning pronunciation from a foreign language in speech synthesis networks. In: arXiv preprint arXiv:1811.09364, (2018)
Chung, Y.A., Wang, Y., Hsu, W.N., Zhang, Y., Skerry-Ryan, R.J.: Semi-supervised training for improving data efficiency in end-to-end speech synthesis. In: 2019 IEEE International Conference on Acoustics. Speech and Signal Processing (ICASSP), pp. 6940–6944. IEEE, Brighton (2019)
Google Scholar
Li, B., Zhang, Y., Sainath, T., Wu, Y., Chan, W.: Bytes are all you need: End-to-end multilingual speech recognition and synthesis with bytes. In: 2019 IEEE International Conference on Acoustics. Speech and Signal Processing (ICASSP), pp. 5621–5625. IEEE, Brighton (2019)
Google Scholar

Download references

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grant 61771333, the Tianjin Municipal Science and Technology Project under Grant 18ZXZNGX00330 and JSPS KAKENHI Grant No. 19K24376 and NICT international fund 2020 “Bridging Eurasia: Multilingual Speech Recognition along the Silk Road”, Japan.

Author information

Authors and Affiliations

Tianjin Key Laboratory of Cognitive Computing and Application, College of Intelligence and Computing, Tianjin University, Tianjin, China
Shaotong Guo, Longbiao Wang, Cheng Gong & Kiyoshi Honda
Japan Advanced Institute of Science and Technology, Ishikawa, Japan
Jianwu Dang
Huiyan Technology (Tianjin) Co., Ltd., Tianjin, China
Ju Zhang & Yuguang Wang
National Institute of Information and Communications Technology, Kyoto, Japan
Sheng Li

Authors

Shaotong Guo
View author publications
You can also search for this author in PubMed Google Scholar
Longbiao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Sheng Li
View author publications
You can also search for this author in PubMed Google Scholar
Ju Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Cheng Gong
View author publications
You can also search for this author in PubMed Google Scholar
Yuguang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jianwu Dang
View author publications
You can also search for this author in PubMed Google Scholar
Kiyoshi Honda
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Longbiao Wang .

Editor information

Editors and Affiliations

Department of AI, Ping An Life, Shenzhen, China
Haiqin Yang
Faculty of Information Technology, King Mongkut's Institute of Technology Ladkrabang, Bangkok, Thailand
Kitsuchart Pasupa
City University of Hong Kong, Kowloon, Hong Kong
Andrew Chi-Sing Leung
Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong, Hong Kong
James T. Kwok
School of Information Technology, King Mongkut’s University of Technology Thonburi, Bangkok, Thailand
Jonathan H. Chan
The Chinese University of Hong Kong, New Territories, Hong Kong
Irwin King

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Guo, S. et al. (2020). Investigation of Effectively Synthesizing Code-Switched Speech Using Highly Imbalanced Mix-Lingual Data. In: Yang, H., Pasupa, K., Leung, A.CS., Kwok, J.T., Chan, J.H., King, I. (eds) Neural Information Processing. ICONIP 2020. Lecture Notes in Computer Science(), vol 12532. Springer, Cham. https://doi.org/10.1007/978-3-030-63830-6_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-63830-6_4
Published: 19 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-63829-0
Online ISBN: 978-3-030-63830-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics