Abstract
Understanding spoken language, or transcribing the spoken words into text, was one of the earliest goals of computer language processing and falls under the realm of speech processing. Speech processing in itself predates the computer by many decades. Speech being the most important and most common means of communication for most people, is always in need of necessary technology advances. Therefore, in the recent decades there has been great interest in techniques including automatic speech recognition (ASR), text to speech etc. This research is focused around English and the scope needs to be expanded to other languages as well. In this study we explore several open-source ASR systems that offer multilingual (English and Spanish) models. We discuss various models these ASR systems offer, evaluate their performance. Based on our manual observations and using automatic evaluation metrics (the word error rate) we find that Whisper models perform the best for both English and Spanish. In addition, it supports a multilingual model that has the ability to process audio that consists of words from both English and Spanish.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Garza-Ulloa, J.: Introduction to cognitive science, cognitive computing, and human cognitive relation to help in the solution of artificial intelligence biomedical engineering problems. In: Applied Biomedical Engineering Using Artificial Intelligence and Cognitive Models, pp. 39–111 (2022)
Kong, X., Choi, J.Y., Shattuck-Hufnagel, S.: Evaluating automatic speech recognition systems in comparison with human perception results using distinctive feature measures. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 5810–5814. IEEE (2017)
Errattahi, R., El Hannani, A., Ouahmane, H.: Automatic speech recognition errors detection and correction: a review. Procedia Comput. Sci. 128, 32–37 (2018)
Juang, B.H., Rabiner, L.R.: Automatic Speech Recognition–a Brief History of the Technology Development. Georgia Institute of Technology. Atlanta Rutgers University and the University of California. Santa Barbara (2005). https://doi.org/10.1016/B0-08-044854-2/00906-8
Topaz, M., Schaffer, A., Lai, K.H., Korach, Z.T., Einbinder, J., Zhou, L.: Medical malpractice trends: errors in automated speech recognition. J. Med. Syst. 42(8), 153–154 (2018)
Mengesha, Z., Heldreth, C., Lahav, M., Sublewski, J., Tuennerman, E.: “I don’t think these devices are very culturally sensitive.”—impact of automated speech recognition errors on African Americans. Front. Artif. Intell. 4, 169. (2021)
Koenecke, A., Nam, A., Lake, E., Nudell, J., Quartey, M., Mengesha, Z., et al.: Racial disparities in automated speech recognition. Natl. Acad. Sci. 117(14), 7684–7689 (2020)
Harwell, D.: “The Accent Gap”. The Washington Post. https://www.washingtonpost.com/graphics/2018/business/alexa-does-not-understand-your-accent/ (2018). Last accessed 14 Aug 2023
Tatman, R.: Gender and dialect bias in YouTube’s automatic captions. In: Proceedings of the First ACL Workshop on Ethics in Natural Language Processing, pp. 53–59 (2017)
Zea, J.A., Aguiar, J.: “Spanish Políglota”: an automatic Speech Recognition system based on HMM. In: 2021 Second International Conference on Information Systems and Software Technologies (ICI2ST), pp. 18–24. IEEE (2021)
Hernández-Mena, C.D., Meza-Ruiz, I.V., Herrera-Camacho, J.A.: Automatic speech recognizers for Mexican Spanish and its open resources. J. Appl. Res. Technol. 15(3), 259–270 (2017)
Vaswani, A., et al.: Attention is all you need. Advances in neural information processing systems (2017)
Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I.: Robust Speech Recognition via Large-Scale Weak Supervision. ArXiv (2022)
Vosk Documentation. https://alphacephei.com/vosk/. Last accessed 14 Aug 2023
Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE 2011 workshop on automatic speech recognition and understanding (No. CONF). IEEE Signal Processing Society (2011)
Lee, A., Kawahara, T.: Recent development of open-source speech recognition engine Julius. In: Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pp.131–137. Asia-Pacific Signal and Information Processing Association (2009)
Hannun, A., et al.: Deep Speech: Scaling up end-to-end speech recognition (2014)
DeepSpeech Documentation. https://deepspeech.readthedocs.io. Last accessed 14 Aug 2023
DeepSpeech Python Library. https://pypi.org/project/deepspeech. Last accessed 14 Aug 2023
Maier, V.: Evaluating ril as basis for evaluating automated speech recognition devices and the consequences of using probabilistic string edit distance as input. 3rd year project. Sheffield University (2002)
Szyma´nski, P., et al.: WER we are and WER we think we are. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 3290–3295. Online. Association for Computational Linguistics (2020)
Morris, A.C.: Maier, V., Green, P.: From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition. Interspeech (2004)
TouchMetrics Homepage. https://torchmetrics.readthedocs.io. Last accessed 14 Aug 2023
Morris, A.C.: An information theoretic measure of sequence recognition performance. IDIAP (2003)
Kincaid,J: Challenges in Measuring Automatic Transcription Accuracy. https://medium.com/descript/challenges-in-measuring-automatic-transcription-accuracy-f322bf5994f (2018)
Acknowledgment
This work was supported by grants from the National Science Foundation (NSF; award# 2131052 and award# 2219587). The opinions and findings expressed in this work do not necessarily reflect the views of the funding institution. Funding agency had no involvement in the conduct of any aspect of the research.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Adegbegha, Y.E., Minocha, A., Balyan, R. (2024). Analyzing Multilingual Automatic Speech Recognition Systems Performance. In: Zhao, F., Miao, D. (eds) AI-generated Content. AIGC 2023. Communications in Computer and Information Science, vol 1946. Springer, Singapore. https://doi.org/10.1007/978-981-99-7587-7_16
Download citation
DOI: https://doi.org/10.1007/978-981-99-7587-7_16
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-7586-0
Online ISBN: 978-981-99-7587-7
eBook Packages: Computer ScienceComputer Science (R0)