Abstract
Recent technologies have pertained to huge success in the field of speech recognition and natural language processing. Numerous models are used for fine-tuning on large pre-trained models. Such models are of huge help when training languages with vast resources. This is not the case for languages that have comparatively fewer resources. In this paper, an automatic speech recognition model for low resourced Indian regional languages is proposed. The model used in this paper is the renowned Wav2Vec2 model. This paper will focus on the Tamil dataset. The objective of this paper is to minimize the word error rate (WER) metric when transcribing some audio file (e.g., a WAV file) containing speech. This is achieved by pre-training Wav2Vec2. Later, fine tuning it by the addition of a linear layer on top and feeding a custom tokenizer. This approach achieves 61.3% WER on the Mozilla Common Voice dataset. This outperforms previously gained WER of 69.76.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bachate RP, Sharma A (2019) Automatic speech recognition systems for regional languages in India. Int J Recent Technol Eng 8(2)
Javed T, Doddapaneni S, Raman A, Bhogale KS, Ramesh G, Kunchukuttan A, Kumar P, Khapra MM (12 Nov 2021) Towards building ASR systems for next billion users
Schneider S, Baevski A, Collobert R (11 Sept 2019) Michael Auli for Facebook AI research Wav2Vec: unsupervised pre-training for speech recognition
Povey D (Oct 2013) kaldi-asr.org
Dr. Thamburaj KP, Dr. Ponniah K, Dr. Sivanathan I, Dr. Kumar M (May 2021) A process of developing an ASR system for Malay and Tamil languages, researchgate.net
Baevski A, Zhou H, Mohamed A, Auli M (20 June 2020) wav2vec 2.0: a framework for self-supervised learning of speech representations. arxiv.org/abs/2006.11477
Common Voice Corpus 7.0 Version: ta_216h_2021–07–21, Mozilla, July 21, 2021. [Online]. Available: https://commonvoice.mozilla.org/en/datasets
Chung Y-A, Zhang Y, Han W, Chiu C-C, Qin J, Pang R, Wu Y (7 Aug, 2021) W2v-BERT: combining contrastive learning and masked language modeling for self-supervised speech pre-training. arxiv.org/abs/2108.06209
Baevski A, Auli M, Conneau A (24 Sept 2020) Wav2vec2 2.0—learning the structure of speech from raw audio, Meta AI
Conneau A, Baevski A, Collobert R, Mohamed A, Auli M (15 Dec 2020) Unsupervised cross-lingual representation learning for speech recognition, Facebook AI
Papers With Code https://paperswithcode.com/sota/speech-recognition-on-common-voice-tamil. Accessed 2021
Hanunan A (27 Nov 2017) Sequence modelling with CTC. Standford University
Heafield K (July 2011) KenLM: faster and smaller language model queries, reseachgate.net
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Ghadekar, P., Jhanwar, K., Karpe, A., Sivanandan, A., Shetty, T., Khushalani, P. (2023). ASR for Indian Regional Languages Using Fine-Tuned Wav2Vec2 Model. In: Chakraborty, B., Biswas, A., Chakrabarti, A. (eds) Advances in Data Science and Computing Technologies. ADSC 2022. Lecture Notes in Electrical Engineering, vol 1056. Springer, Singapore. https://doi.org/10.1007/978-981-99-3656-4_14
Download citation
DOI: https://doi.org/10.1007/978-981-99-3656-4_14
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-3655-7
Online ISBN: 978-981-99-3656-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)