Fine-Tuned Self-supervised Speech Representations for Language Diarization in Multilingual Code-Switched Speech

Frost, Geoffrey; Morris, Emily; Jansen van Vüren, Joshua; Niesler, Thomas

doi:10.1007/978-3-031-22321-1_17

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1734))

Included in the following conference series:

Southern African Conference for Artificial Intelligence Research

369 Accesses
1 Altmetric

Abstract

Annotating a multilingual code-switched corpus is a painstaking process requiring specialist linguistic expertise. This is partly due to the large number of language combinations that may appear within and across utterances, which might require several annotators with different linguistic expertise to consider an utterance sequentially. This is time-consuming and costly. It would be useful if the spoken languages in an utterance and the boundaries thereof were known before annotation commences, to allow segments to be assigned to the relevant language experts in parallel. To address this, we investigate the development of a continuous multilingual language diarizer using fine-tuned speech representations extracted from a large pre-trained self-supervised architecture (WavLM). We experiment with a code-switched corpus consisting of five South African languages (isiZulu, isiXhosa, Setswana, Sesotho and English) and show substantial diarization error rate improvements for language families, language groups, and individual languages over baseline systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/GeoffreyFrost/code-switched-language-diarization.

References

Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. In: Advances in Neural Information Processing Systems, vol. 33, pp. 12449–12460 (2020)
Google Scholar
Brummer, N.: Measuring, refining and calibrating speaker and language information extracted from speech. Ph.D. thesis, University of Stellenbosch, Stellenbosch (2010)
Google Scholar
Cai, W., Cai, Z., Liu, W., Wang, X., Li, M.: Insights in-to-end learning scheme for language identification. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5209–5213 (2018)
Google Scholar
Chen, G., et al.: Gigaspeech: an evolving, multi-domain ASR corpus with 10,000 hours of transcribed audio. In: Proceedings of Interspeech (2021)
Google Scholar
Chen, S., et al.: WavLM: large-scale self-supervised pre-training for full stack speech processing. IEEE J. Sel. Topics Signal Process. 6, 1505–1518 (2022)
Article Google Scholar
Chi, Z., et al.: XLM-E: cross-lingual language model pre-training via electra. arXiv preprint arXiv:2106.16138 (2021)
Fujita, Y., Kanda, N., Horiguchi, S., Nagamatsu, K., Watanabe, S.: End-to-end neural speaker diarization with permutation-free objectives. In: Proceedings of Interspeech (2019)
Google Scholar
Gelly, G., Gauvain, J.L.: Spoken language identification using LSTM-based angular proximity. In: Proceedings of Interspeech, pp. 2566–2570 (2017)
Google Scholar
Geng, W., et al.: End-to-end language identification using attention-based recurrent neural networks. In: Proceedings of Interspeech, pp. 2944–2948 (2016)
Google Scholar
Gonzalez-Dominguez, J., Lopez-Moreno, I., Moreno, P.J., Gonzalez-Rodriguez, J.: Frame-by-frame language identification in short utterances using deep neural networks. Neural Netw. 64, 49–58 (2015)
Article Google Scholar
Hershey, J.R., Chen, Z., Le Roux, J., Watanabe, S.: Deep clustering: discriminative embeddings for segmentation and separation. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 31–35. IEEE (2016)
Google Scholar
Hieronymus, J.L., Kadambe, S.: Spoken language identification using large vocabulary speech recognition. In: Proceedings of Fourth International Conference on Spoken Language Processing (ICSLP), pp. 1780–1783 (1996)
Google Scholar
Hsu, W.N., Bolte, B., Tsai, Y.H.H., Lakhotia, K., Salakhutdinov, R., Mohamed, A.: Hubert: self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 3451–3460 (2021)
Article Google Scholar
Kahn, J., et al.: LIBRI-LIGHT: a benchmark for ASR with limited or no supervision. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7669–7673. IEEE (2020)
Google Scholar
Li, H., Ma, B., Lee, K.A.: Spoken language recognition: from fundamentals to practice. Proc. IEEE 101(5), 1136–1159 (2013)
Article Google Scholar
Liu, H., et al.: End-to-end language diarization for bilingual code-switching speech. In: Proceedings of Interspeech, pp. 1489–1493 (2021)
Google Scholar
Lopez-Moreno, I., Gonzalez-Dominguez, J., Martinez, D., Plchot, O., Gonzalez-Rodriguez, J., Moreno, P.J.: On the use of deep feedforward neural networks for automatic language identification. Comput. Speech Lang. 40, 46–59 (2016)
Article Google Scholar
Mendoza, S., Gillick, L., Ito, Y., Lowe, S., Newman, M.: Automatic language identification using large vocabulary continuous speech recognition. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 785–788 (1996)
Google Scholar
Muthusamy, Y.K., Barnard, E., Cole, R.A.: Reviewing automatic language identification. IEEE Signal Process. Mag. 11(4), 33–41 (1994)
Article Google Scholar
Muthusamy, Y.K., Jain, N., Cole, R.A.: Perceptual benchmarks for automatic language identification. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. I-333 (1994)
Google Scholar
Nakagawa, S., Ueda, Y., Seino, T.: Speaker-independent, text-independent language identification by HMM. In: Proceedings of Second International Conference on Spoken Language Processing (1992)
Google Scholar
Ramus, F., Mehler, J.: Language identification with suprasegmental cues: a study based on speech resynthesis. J. Acoust. Soc. Am. 105(1), 512–521 (1999)
Article Google Scholar
Schultz, T., Rogina, I., Waibel, A.: LVCSR-based language identification. In: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 781–784 (1996)
Google Scholar
Trong, T.N., Hautamäki, V., Lee, K.A.: Deep language: a comprehensive deep learning approach to end-to-end language recognition. In: Proceedings of Odyssey: The Speaker and Language Recognition Workshop, vol. 2016, pp. 109–116 (2016)
Google Scholar
Van Dulm, O.: The grammar of English-Afrikaans code switching: a feature checking account. Ph.D. thesis, External Organizations (2007)
Google Scholar
Van Leeuwen, D.A., Brummer, N.: Channel-dependent GMM and multi-class logistic regression models for language recognition. In: Proceedings of Odyssey: The Speaker and Language Recognition Workshop, pp. 1–8 (2006)
Google Scholar
Van Leeuwen, D.A., De Boer, M., Orr, R.: A human benchmark for the NIST language recognition evaluation 2005. In: Proceedings of Odyssey: The Speaker and Language Recognition Workshop, p. 12 (2008)
Google Scholar
Wang, C., et al.: VoxPopuli: a large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. arXiv preprint arXiv:2101.00390 (2021)
Watanabe, S., Hori, T., Hershey, J.R.: Language independent end-to-end architecture for joint language identification and speech recognition. In: Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 265–271 (2017)
Google Scholar
van der Westhuizen, E., Niesler, T.: A first South African corpus of multilingual code-switched soap opera speech. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC) (2018)
Google Scholar
Yan, Y.: Development of an approach to language identification based on language-dependent phone recognition. Oregon Graduate Institute of Science and Technology (1995)
Google Scholar
Yang, S.W., et al.: Superb: speech processing universal performance benchmark. In: Proceedings of Interspeech (2021)
Google Scholar
Zhao, J., Shu, H., Zhang, L., Wang, X., Gong, Q., Li, P.: Cortical competition during language discrimination. Neuroimage 43(3), 624–633 (2008)
Article Google Scholar
Zissman, M.A.: Comparison of four approaches to automatic language identification of telephone speech. IEEE Trans. Speech Audio Process. 4(1), 31 (1996)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of E and E Engineering, Stellenbosch University, Stellenbosch, South Africa
Geoffrey Frost, Joshua Jansen van Vüren & Thomas Niesler
Cape Town, South Africa
Emily Morris

Authors

Geoffrey Frost
View author publications
You can also search for this author in PubMed Google Scholar
Emily Morris
View author publications
You can also search for this author in PubMed Google Scholar
Joshua Jansen van Vüren
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Niesler
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Geoffrey Frost .

Editor information

Editors and Affiliations

University of KwaZulu-Natal, Durban, South Africa
Anban Pillay
University of KwaZulu-Natal, Durban, South Africa
Edgar Jembere
University of Pretoria, Pretoria, South Africa
Aurona Gerber

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Frost, G., Morris, E., Jansen van Vüren, J., Niesler, T. (2022). Fine-Tuned Self-supervised Speech Representations for Language Diarization in Multilingual Code-Switched Speech. In: Pillay, A., Jembere, E., Gerber, A. (eds) Artificial Intelligence Research. SACAIR 2022. Communications in Computer and Information Science, vol 1734. Springer, Cham. https://doi.org/10.1007/978-3-031-22321-1_17

Download citation

DOI: https://doi.org/10.1007/978-3-031-22321-1_17
Published: 28 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-22320-4
Online ISBN: 978-3-031-22321-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Fine-Tuned Self-supervised Speech Representations for Language Diarization in Multilingual Code-Switched Speech