Abstract
End-to-end text-to-speech (TTS) can synthesize monolingual speech with high naturalness and intelligibility. Recently, the end-to-end model has also been used in code-switching (CS) TTS and performs well on naturalness, intelligibility and speaker consistency. However, existing systems rely on skillful bilingual speakers to build a CS mix-lingual data set with a high Language-Mix-Ratio (LMR), while simply mixing monolingual data sets results in accent problems. To reduce the cost of recording and maintain the speaker consistency, in this paper, we investigate an effective method to use a low LMR imbalanced mix-lingual data set. Experiments show that it is possible to construct a CS TTS system with a low LMR imbalanced mix-lingual data set with diverse input text presentations, meanwhile produce acceptable synthetic CS speech with more than 4.0 Mean Opinion Score (MOS). We also find that the result will be improved if the mix-lingual data set is augmented with monolingual English data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Code-switching is also known as code-mixing. In this paper, we use code-switching.
- 2.
- 3.
- 4.
- 5.
Some samples are available in “https://pandagst.github.io/”.
References
Shen, J., et al.: Natural TTS synthesis by conditioningwavenet on MEL spectrogram predictions. International Conference on Acoustics. Speech and Signal Processing (ICASSP), pp. 4779–4783. IEEE, Calgary (2018)
Ping, W., Peng, K., Gibiansky, A., Arik, S.O., Kannan, A., Narang, S., Jonathan, R., Miller, J.: Deep voice 3: Scaling text-to-speech with convolutional sequence learning. In: 6th International Conference on Learning Representations (ICLR), Vancouver (2018)
Zhang, Y., et al.: Learning to speak fluently in a foreign language: multilingual speech synthesis and cross-language voice cloning. In: 20th International Speech Communication Association (INTERSPEECH), pp. 2080–2084. ISCA, Graz (2019)
Traber, C., et al.: From multilingual to polyglot speech synthesis. In: European Conference on Speech Communication and Technology, pp. 835–839 (1999)
Chu, M., Peng, H., Zhao, Y., Niu, Z., Chang, E.: Microsoft Mulan - a bilingual TTS system. In: International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. I-I. IEEE, Hong Kong (2003)
Ming, H., Lu, Y., Zhang, Z., Dong, M.: A light-weight methodof building an LSTM-RNN-based bilingual TTS system. In: 2017 International Conference on Asian Language Processing (IALP), pp. 201–205. IEEE, Singapore (2017)
Sitaram, S., Rallabandi, S.K., Rijhwani, S., Black, A.W.: Experiments with cross-lingual systems for synthesis of code-mixed text. In: SSW, pp. 76–81 (2016)
Cao, Y., et al.: End-to-end code-switched tts with mix of monolingual recordings. In: 2019 IEEE International Conference on Acoustics. Speech and Signal Processing (ICASSP), pp. 6935–6939. IEEE, Brighton (2019)
Xue, L., Song, W., Xu, G., Xie, L., Wu, Z.: Building a mixed-lingual neural tts system with only monolingual data. In: 20th International Speech Communication Association (INTERSPEECH), pp. 2060–2064. ISCA, Graz (2019)
Chandu, K.R., Rallabandi, S.K., Sitaram, S., Black, A.W.: Speech synthesis for mixed-language navigation instructions. In: 18th International Speech Communication Association (INTERSPEECH), pp. 57–61. ISCA, Stockholm (2017)
Campbell, N.: Talking foreign-concatenative speech synthesis and the language barrier. In: 7th European Conference on Speech Communication and Technology (EUROSPEECH), pp. 337–340. ISCA, Aalborg (2001)
Zen, H., Braunschweiler, N., Buchholz, S., Gales, M.J., Knill, K., Krstulovic, S., Latorre, J.: Statistical parametric speech synthesis based on speaker and language factorization. In: IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 6, pp. 1713–1724. IEEE (2012)
Chen, M., et al.: Cross-lingual, multi-speaker text-to-speech synthesis using neural speaker embedding. In: 20th International Speech Communication Association (INTERSPEECH), pp. 2105–2109. ISCA, Graz (2019)
Spa, J.: Handbook of the international phonetic association. a guide to the use of the international phonetic alphabet. Word-J. Int. Ling. Assoc. 53(3), 421–424 (2002)
Qian, Y., Cao, H., Soong, F.K.: HMM-based mixed-language (Mandarin-English) speech synthesis. In: Proceedings of the 2008 6th International Symposium on Chinese Spoken Language Processing, pp. 1–4. IEEE, Kunming (2008)
Li, S., Lu, X., Ding, C., Shen, P., Kawahara, T.: investigating radical-based end-to-end speech recognition systems for chinese dialects and japanese. In: 20th International Speech Communication Association (INTERSPEECH), pp. 2200–2204. ISCA, Graz (2019)
Li, B., Zen, H.: Multi-language multi-speaker acoustic modeling for lstm-rnn based statistical parametric speech synthesis. In: 17th International Speech Communication Association (INTERSPEECH), pp. 2468–2472. ISCA, San Francisco (2016)
Yu, Q., Liu, P., Wu, Z., Ang, S.K., Meng, H., Cai, L.: Learning cross-lingual information with multilingual BLSTM for speech synthesis of low-resource languages. IEEE International Conference on Acoustics. Speech and Signal Processing (ICASSP), pp. 5545–5549. IEEE, Shanghai (2016)
Sitaram, S., Black, A.W.: Speech synthesis of code-mixed text. In: 10th International Conference on Language Resources and Evaluation (LREC), pp. 3422–3428. ELRA, Portoroz (2016)
Chen, Y., Tu, T., Yeh, C., Lee, H.Y.: End-to-end text-to-speech for low-resource languages by cross-lingual transfer learning. In: 20th International Speech Communication Association (INTERSPEECH), pp. 2075–2079. ISCA, Graz (2019)
Mametani, K., Kato, T., Yamamoto, S.: Investigating context features hidden in End-to-End TTS. In: IEEE International Conference on Acoustics. Speech and Signal Processing (ICASSP), pp. 6920–6924. IEEE, Brighton (2019)
Griffin, D.W., Lim, J.S.: Signal estimation from modified short-time fourier transform. In: IEEE International Conference on Acoustics. Speech, and Signal Processing, pp. 804–807. IEEE, Boston (1983)
Lee, Y., Shon, S., Kim, T.: Learning pronunciation from a foreign language in speech synthesis networks. In: arXiv preprint arXiv:1811.09364, (2018)
Chung, Y.A., Wang, Y., Hsu, W.N., Zhang, Y., Skerry-Ryan, R.J.: Semi-supervised training for improving data efficiency in end-to-end speech synthesis. In: 2019 IEEE International Conference on Acoustics. Speech and Signal Processing (ICASSP), pp. 6940–6944. IEEE, Brighton (2019)
Li, B., Zhang, Y., Sainath, T., Wu, Y., Chan, W.: Bytes are all you need: End-to-end multilingual speech recognition and synthesis with bytes. In: 2019 IEEE International Conference on Acoustics. Speech and Signal Processing (ICASSP), pp. 5621–5625. IEEE, Brighton (2019)
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China under Grant 61771333, the Tianjin Municipal Science and Technology Project under Grant 18ZXZNGX00330 and JSPS KAKENHI Grant No. 19K24376 and NICT international fund 2020 “Bridging Eurasia: Multilingual Speech Recognition along the Silk Road”, Japan.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Guo, S. et al. (2020). Investigation of Effectively Synthesizing Code-Switched Speech Using Highly Imbalanced Mix-Lingual Data. In: Yang, H., Pasupa, K., Leung, A.CS., Kwok, J.T., Chan, J.H., King, I. (eds) Neural Information Processing. ICONIP 2020. Lecture Notes in Computer Science(), vol 12532. Springer, Cham. https://doi.org/10.1007/978-3-030-63830-6_4
Download citation
DOI: https://doi.org/10.1007/978-3-030-63830-6_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-63829-0
Online ISBN: 978-3-030-63830-6
eBook Packages: Computer ScienceComputer Science (R0)