Abstract
Automatic speech recognition is a mature speech technology, almost able to attend human label recognition performance conditioned on the availability of sufficient labeled training data. However, the performance of the system struggles to achieve deployable performance in the under-resourced scenario. In such a scenario, most of the work suggests traditional frameworks are preferable over state-of-the-art deep learning frameworks. This work creates a dataset for the Lambani language of 6 hours duration, and attempts to develop an ASR system. The system provides a character error rate (CER) of \(39.1\%\) and \(24.1\%\) using the GMM-HMM framework and TDNN framework, respectively for Lambani dataset. The language doesn’t have enough publicly available speech and corresponding text transcription resources of its own. Motivating by the same, this work uses the publicly available wav2vec2.0 (W2V) pre-trained model (trained on 23 Indian languages’ unlabeled speech data) and fine-tuned it with the labeled data of the Lambani language. After that using the fine-tuned framework as a non-linear feature extractor, the ASR task is performed with GMM-HMM and TDNN framework. The proposed approach provides a relative improvement of \(53.4\%\) and \(32.1\%\) for the GMM-HMM and TDNN frameworks, respectively.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Chadha, H.S., et al.: Vakyansh: ASR toolkit for low resource Indic languages. arXiv preprint arXiv:2203.16512 (2022)
Chen, D., Mak, B.K.W.: Multitask learning of deep neural networks for low-resource speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 23(7), 1172–1183 (2015)
Gales, M.J.: Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 12(2), 75–98 (1998)
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of The Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256. JMLR Workshop and Conference Proceedings (2010)
Gupta, A., et al.: CLSRIL-23: cross lingual speech representations for Indic languages. arXiv preprint arXiv:2107.07402 (2021)
Imseng, D., Bourlard, H., Garner, P.N.: Using kl-divergence and multilingual information to improve ASR for under-resourced languages. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4869–4872. IEEE (2012)
Imseng, D., Motlicek, P., Bourlard, H., Garner, P.N.: Using out-of-language data to improve an under-resourced speech recognizer. Speech Commun. 56, 142–151 (2014)
Le, V.B., Besacier, L.: Automatic speech recognition for under-resourced languages: application to Vietnamese language. IEEE Trans. Audio Speech Lang. Process. 17(8), 1471–1482 (2009)
Mishra, J., Gandra, J., Patil, V., Prasanna, S.R.M.: Issues in sub-utterance level language identification in a code switched bilingual scenario. In: 2022 IEEE International Conference on Signal Processing and Communications (SPCOM), pp. 1–5. IEEE (2022)
Mishra, J., Patil, J.N., Chowdhury, A., Prasanna, S.M.: End to end spoken language diarization with wav2vec embeddings
Mishra, J., Prasanna, S.R.M.: Importance of supra-segmental information and self-supervised framework for spoken language Diarization task. In: Prasanna, S.R.M., Karpov, A., Samudravijaya, K., Agrawal, S.S. (eds.) International Conference on Speech and Computer, vol. 13721, pp. 494–507. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20980-2_42
Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an ASR corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210. IEEE (2015)
Peddinti, V., Povey, D., Khudanpur, S.: A time delay neural network architecture for efficient modeling of long temporal contexts. In: Sixteenth Annual Conference of The International Speech Communication Association (2015)
Povey, D., et al.: The kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. No. CONF, IEEE Signal Processing Society (2011)
Sahraeian, R., Compernolle, D.V., Wet, F.d.: Under-resourced speech recognition based on the speech manifold. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)
Swadesh, M.: Lexico-statistic dating of prehistoric ethnic contacts: with special reference to north American Indians and Eskimos. Proc. Am. Philos. Soc. 96(4), 452–463 (1952)
Tachbelie, M.Y., Abate, S.T., Besacier, L.: Using different acoustic, lexical and language modeling units for ASR of an under-resourced language-Amharic. Speech Commun. 56, 181–194 (2014)
Thomas, S., Ganapathy, S., Hermansky, H.: Cross-lingual and multi-stream posterior features for low resource LVCSR systems. In: Eleventh Annual Conference of the International Speech Communication Association (2010)
Thomas, S., Ganapathy, S., Hermansky, H.: Multilingual MLP features for low-resource LVCSR systems. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4269–4272. IEEE (2012)
Yu, D., Deng, L., Dahl, G.: Roles of pre-training and fine-tuning in context-dependent DBN-HMMS for real-world speech recognition. In: Proceedings of NIPS Workshop on Deep Learning and Unsupervised Feature Learning. sn (2010)
Acknowledgements
The Lambani data collection is a part of the"Speech to Speech translation project". The authors would like to acknowledge the Ministry of Electronics and Information Technology (MeitY), Govt. of India, for funding us in this project. The authors would also like to thank the data associates who have helped in collecting Lambani data. The authors are grateful to Mr.Swapnil Sontakke for building the GUI which played a crucial role in collection of data.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Mukherjee, S., Mishra, J., Prasanna, S.R.M. (2023). Significance of Indic Self-supervised Speech Representations for Indic Under-Resourced ASR. In: Karpov, A., Samudravijaya, K., Deepak, K.T., Hegde, R.M., Agrawal, S.S., Prasanna, S.R.M. (eds) Speech and Computer. SPECOM 2023. Lecture Notes in Computer Science(), vol 14339. Springer, Cham. https://doi.org/10.1007/978-3-031-48312-7_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-48312-7_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-48311-0
Online ISBN: 978-3-031-48312-7
eBook Packages: Computer ScienceComputer Science (R0)