Significance of Indic Self-supervised Speech Representations for Indic Under-Resourced ASR

Mukherjee, Sougata; Mishra, Jagabandhu; Prasanna, S. R. Mahadeva

doi:10.1007/978-3-031-48312-7_8

Sougata Mukherjee¹³,
Jagabandhu Mishra¹³ &
S. R. Mahadeva Prasanna¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14339))

Included in the following conference series:

International Conference on Speech and Computer

343 Accesses

Abstract

Automatic speech recognition is a mature speech technology, almost able to attend human label recognition performance conditioned on the availability of sufficient labeled training data. However, the performance of the system struggles to achieve deployable performance in the under-resourced scenario. In such a scenario, most of the work suggests traditional frameworks are preferable over state-of-the-art deep learning frameworks. This work creates a dataset for the Lambani language of 6 hours duration, and attempts to develop an ASR system. The system provides a character error rate (CER) of \(39.1\%\) and \(24.1\%\) using the GMM-HMM framework and TDNN framework, respectively for Lambani dataset. The language doesn’t have enough publicly available speech and corresponding text transcription resources of its own. Motivating by the same, this work uses the publicly available wav2vec2.0 (W2V) pre-trained model (trained on 23 Indian languages’ unlabeled speech data) and fine-tuned it with the labeled data of the Lambani language. After that using the fine-tuned framework as a non-linear feature extractor, the ASR task is performed with GMM-HMM and TDNN framework. The proposed approach provides a relative improvement of \(53.4\%\) and \(32.1\%\) for the GMM-HMM and TDNN frameworks, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Brazilian Portuguese Speech Recognition Using Wav2vec 2.0

Automatic speech recognition: a survey

Article 10 November 2020

Hybrid End-to-End Architecture for Hindi Speech Recognition System

References

Chadha, H.S., et al.: Vakyansh: ASR toolkit for low resource Indic languages. arXiv preprint arXiv:2203.16512 (2022)
Chen, D., Mak, B.K.W.: Multitask learning of deep neural networks for low-resource speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 23(7), 1172–1183 (2015)
Google Scholar
Gales, M.J.: Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 12(2), 75–98 (1998)
Article Google Scholar
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of The Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256. JMLR Workshop and Conference Proceedings (2010)
Google Scholar
Gupta, A., et al.: CLSRIL-23: cross lingual speech representations for Indic languages. arXiv preprint arXiv:2107.07402 (2021)
Imseng, D., Bourlard, H., Garner, P.N.: Using kl-divergence and multilingual information to improve ASR for under-resourced languages. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4869–4872. IEEE (2012)
Google Scholar
Imseng, D., Motlicek, P., Bourlard, H., Garner, P.N.: Using out-of-language data to improve an under-resourced speech recognizer. Speech Commun. 56, 142–151 (2014)
Article Google Scholar
Le, V.B., Besacier, L.: Automatic speech recognition for under-resourced languages: application to Vietnamese language. IEEE Trans. Audio Speech Lang. Process. 17(8), 1471–1482 (2009)
Article Google Scholar
Mishra, J., Gandra, J., Patil, V., Prasanna, S.R.M.: Issues in sub-utterance level language identification in a code switched bilingual scenario. In: 2022 IEEE International Conference on Signal Processing and Communications (SPCOM), pp. 1–5. IEEE (2022)
Google Scholar
Mishra, J., Patil, J.N., Chowdhury, A., Prasanna, S.M.: End to end spoken language diarization with wav2vec embeddings
Google Scholar
Mishra, J., Prasanna, S.R.M.: Importance of supra-segmental information and self-supervised framework for spoken language Diarization task. In: Prasanna, S.R.M., Karpov, A., Samudravijaya, K., Agrawal, S.S. (eds.) International Conference on Speech and Computer, vol. 13721, pp. 494–507. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20980-2_42
Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an ASR corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210. IEEE (2015)
Google Scholar
Peddinti, V., Povey, D., Khudanpur, S.: A time delay neural network architecture for efficient modeling of long temporal contexts. In: Sixteenth Annual Conference of The International Speech Communication Association (2015)
Google Scholar
Povey, D., et al.: The kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. No. CONF, IEEE Signal Processing Society (2011)
Google Scholar
Sahraeian, R., Compernolle, D.V., Wet, F.d.: Under-resourced speech recognition based on the speech manifold. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)
Google Scholar
Swadesh, M.: Lexico-statistic dating of prehistoric ethnic contacts: with special reference to north American Indians and Eskimos. Proc. Am. Philos. Soc. 96(4), 452–463 (1952)
Google Scholar
Tachbelie, M.Y., Abate, S.T., Besacier, L.: Using different acoustic, lexical and language modeling units for ASR of an under-resourced language-Amharic. Speech Commun. 56, 181–194 (2014)
Article Google Scholar
Thomas, S., Ganapathy, S., Hermansky, H.: Cross-lingual and multi-stream posterior features for low resource LVCSR systems. In: Eleventh Annual Conference of the International Speech Communication Association (2010)
Google Scholar
Thomas, S., Ganapathy, S., Hermansky, H.: Multilingual MLP features for low-resource LVCSR systems. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4269–4272. IEEE (2012)
Google Scholar
Yu, D., Deng, L., Dahl, G.: Roles of pre-training and fine-tuning in context-dependent DBN-HMMS for real-world speech recognition. In: Proceedings of NIPS Workshop on Deep Learning and Unsupervised Feature Learning. sn (2010)
Google Scholar

Download references

Acknowledgements

The Lambani data collection is a part of the"Speech to Speech translation project". The authors would like to acknowledge the Ministry of Electronics and Information Technology (MeitY), Govt. of India, for funding us in this project. The authors would also like to thank the data associates who have helped in collecting Lambani data. The authors are grateful to Mr.Swapnil Sontakke for building the GUI which played a crucial role in collection of data.

Author information

Authors and Affiliations

Indian Institute of Technology Dharwad, Dharwad, India
Sougata Mukherjee, Jagabandhu Mishra & S. R. Mahadeva Prasanna

Authors

Sougata Mukherjee
View author publications
You can also search for this author in PubMed Google Scholar
Jagabandhu Mishra
View author publications
You can also search for this author in PubMed Google Scholar
S. R. Mahadeva Prasanna
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sougata Mukherjee .

Editor information

Editors and Affiliations

St. Petersburg Federal Research Center of the Russian Academy of Sciences, St. Petersburg, Russia
Alexey Karpov
Koneru Lakshmaiah Education Foundation, Vaddeswaram, India
K. Samudravijaya
Indian Institute of Information Technology Dharwad, Dharwad, India
K. T. Deepak
Indian Institute of Technology Dharwad, Dharwad, India
Rajesh M. Hegde
KIIT Group of Colleges, Gurugram, India
Shyam S. Agrawal
Indian Institute of Technology Dharwad, Dharwad, India
S. R. Mahadeva Prasanna

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mukherjee, S., Mishra, J., Prasanna, S.R.M. (2023). Significance of Indic Self-supervised Speech Representations for Indic Under-Resourced ASR. In: Karpov, A., Samudravijaya, K., Deepak, K.T., Hegde, R.M., Agrawal, S.S., Prasanna, S.R.M. (eds) Speech and Computer. SPECOM 2023. Lecture Notes in Computer Science(), vol 14339. Springer, Cham. https://doi.org/10.1007/978-3-031-48312-7_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-48312-7_8
Published: 22 November 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-48311-0
Online ISBN: 978-3-031-48312-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Significance of Indic Self-supervised Speech Representations for Indic Under-Resourced ASR

Abstract

Access this chapter

Similar content being viewed by others

Brazilian Portuguese Speech Recognition Using Wav2vec 2.0

Automatic speech recognition: a survey

Hybrid End-to-End Architecture for Hindi Speech Recognition System

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Significance of Indic Self-supervised Speech Representations for Indic Under-Resourced ASR

Abstract

Access this chapter

Similar content being viewed by others

Brazilian Portuguese Speech Recognition Using Wav2vec 2.0

Automatic speech recognition: a survey

Hybrid End-to-End Architecture for Hindi Speech Recognition System

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation