A Comparative Analysis of Real Time Open-Source Speech Recognition Tools for Social Robots

Pande, Akshara; Shrestha, Bhanu; Rani, Anshul; Mishra, Deepti

doi:10.1007/978-3-031-35708-4_26

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14033))

Included in the following conference series:

International Conference on Human-Computer Interaction

764 Accesses
1 Citations

Abstract

Social robots are designed to support people through their capabilities such as information gathering, processing, analyzing, and predicting. Social robots play a vital role in various fields such as medical, entertainment, education, and assistance. Speech is a fundamental characteristic of social robots to establish communication with humans. The advancement of artificial intelligence has facilitated speech recognition tools to be substantially effective. It is easier to comprehend the meaning of a speech if it is documented. The speech recognition tools help robots in recognizing human speech. It is supposed that robots can precisely understand what humans are attempting to convey, however it is not achievable every time due to several factors such as constraints in terms of robot functionality or noise in the environment. There are research studies which indicate that speech recognition of children is a challenging problem for robots. The in-built speech recognition capabilities of such robots can be enhanced by integrating it with a more efficient speech recognition tool available in this domain. Therefore, it is necessary to select the appropriate speech recognition tool so that robots can understand human speech in a consistent way. In the present study we are analyzing five real-time speech-to-text recognition tools available from open sources: Google speech recognition, Vosk, CMUSphinx, DeepSpeech and Whisper. Evaluation metrics are generally used to evaluate the performance of speech recognition tools. This analysis will enable us to determine the best real time open-source tool to employ for robot-human interaction.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Budiharto, W., et al.: EduRobot: intelligent humanoid robot with natural interaction for education and entertainment. Procedia Comput. Sci. 116, 564–570 (2017)
Article Google Scholar
Childers, M., et al.: US army research laboratory (ARL) robotics collaborative technology alliance 2014 capstone experiment. US Army Research Laboratory Aberdeen Proving Ground United States (2016)
Google Scholar
Erol, B.A., et al.: Toward artificial emotional intelligence for cooperative social human–machine interaction. IEEE Trans. Comput. Soc. Syst. 7(1), 234–246 (2019)
Article MathSciNet Google Scholar
Ahn, H.S., Lee, M.H., MacDonald, B.A.: Healthcare robot systems for a hospital environment: CareBot and ReceptionBot. In: 24th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN), Kobe, Japan, pp. 571–576 (2015)
Google Scholar
Hameed, I.A., Strazdins, G., Hatlemark, H.A.M., Jakobsen, I.S., Damdam, J.O.: Robots that can mix serious with fun. In: Hassanien, A.E., Tolba, M.F., Elhoseny, M., Mostafa, M. (eds.) AMLTA 2018. AISC, vol. 723, pp. 595–604. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-74690-6_58
Chapter Google Scholar
Nassif, A.B., et al.: Speech recognition using deep neural networks: a systematic review. IEEE Access 7, 19143–19165 (2019)
Article Google Scholar
Rahat, S.A., Imteaj, A., Rahman, T.: An IoT based interactive speech recognizable robot with distance control using Raspberry Pi. In: 2018 International Conference on Innovations in Science, Engineering and Technology (ICISET). IEEE (2018)
Google Scholar
Vacher, M., et al.: Complete sound and speech recognition system for health smart homes: application to the recognition of activities of daily living. In: Domenico, C. (ed.) New Developments in Biomedical Engineering, pp. 645–673. In-Tech (2010)
Google Scholar
van den Berghe, R.: Social robots in a translanguaging pedagogy: a review to identify opportunities for robot-assisted (language) learning. Front. Robot. AI 9, 958624 (2022)
Article Google Scholar
Randall, N.: A survey of robot-assisted language learning (RALL). ACM Trans. Hum.-Robot Interact. (THRI) 9(1), 1–36 (2019)
Google Scholar
Taniguchi, T., et al.: Language and robotics. Frontiers Media SA, p. 674832 (2021)
Google Scholar
Forsberg, M.: Why is speech recognition difficult. Chalmers University of Technology (2003)
Google Scholar
Mubin, O., Henderson, J., Bartneck, C.: You just do not understand me! Speech recognition in human robot interaction. In: The 23rd IEEE International Symposium on Robot and Human Interactive Communication. IEEE (2014)
Google Scholar
Shneiderman, B.: The limits of speech recognition. Commun. ACM 43(9), 63–65 (2000)
Article Google Scholar
McCowan, I.A., et al.: On the use of information retrieval measures for speech recognition evaluation. IDIAP (2004)
Google Scholar
Kennedy, J., et al.: Child speech recognition in human-robot interaction: evaluations and recommendations. In: 2017 12th ACM/IEEE International Conference on Human-Robot Interaction (HRI) (2017)
Google Scholar
Attawibulkul, S., Kaewkamnerdpong, B., Miyanaga, Y.: Noisy speech training in MFCC-based speech recognition with noise suppression toward robot assisted autism therapy. In: 2017 10th Biomedical Engineering International Conference (BMEiCON). IEEE (2017)
Google Scholar
Zinchenko, K., Wu, C.Y., Song, K.T.: A study on speech recognition control for a surgical robot. IEEE Trans. Ind. Inf. 13(2), 607–615 (2017)
Article Google Scholar
Ishi, C.T., et al.: A robust speech recognition system for communication robots in noisy environments. IEEE Trans. Robot. 24(3), 759–763 (2008)
Article MathSciNet Google Scholar
Russo, N., et al.: Effects of background noise on cortical encoding of speech in autism spectrum disorders. J. Autism Dev. Disord. 39, 1185–1196 (2009)
Article Google Scholar
Miller, K.W., Voas, J., Costello, T.: Free and open source software. IT Prof. 12(6), 14–16 (2010)
Article Google Scholar
Weber, S., Luo, J.: What makes an open source code popular on Git hub?. In: 2014 IEEE International Conference on Data Mining Workshop (2014)
Google Scholar
Speech Recognition homepage. https://pypi.org/project/SpeechRecognition/. Accessed 11 Jan 2023
DeepSpeech homepage. https://deepspeech.readthedocs.io/en/r0.9/. Accessed 11 Jan 2023
Vosk homepage. https://pypi.org/project/vosk/. Accessed 11 Jan 2023
Cavazza, M.: An empirical study of speech recognition errors in a task-oriented dialogue system. In: Proceedings of the Second SIGdial Workshop on Discourse and Dialogue (2001)
Google Scholar
Saon, G., Ramabhadran, B., Zweig, G.: On the effect of word error rate on automated quality monitoring. In: 2006 IEEE Spoken Language Technology Workshop. IEEE (2006)
Google Scholar
Filippidou, F., Moussiades, L.: Α benchmarking of IBM, google and wit automatic speech recognition systems. In: Maglogiannis, I., Iliadis, L., Pimenidis, E. (eds.) AIAI 2020. IAICT, vol. 583, pp. 73–82. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-49161-1_7
Chapter Google Scholar
Match Error Rate homepage. https://torchmetrics.readthedocs.io/en/stable/text/match_error_rate.html. Accessed 11 Jan 2023
WORD INFO. LOST homepage. https://torchmetrics.readthedocs.io/en/stable/text/word_info_lost.html. Accessed 11 Jan 2023
CHAR ERROR RATE homepage. https://torchmetrics.readthedocs.io/en/stable/text/char_error_rate.html#:~:text=character%20error%20rate%20is%20a. Accessed 11 Jan 2023
https://en.wikipedia.org/wiki/Speech_recognition
Këpuska, V., Bohouta, G.: Comparing speech recognition systems (Microsoft API, Google API and CMU Sphinx). Int. J. Eng. Res. Appl. 7(03), 20–24 (2017)
Google Scholar
Morris, A.C., Maier, V., Green, P.: From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition. In: Eighth International Conference on Spoken Language Processing (2004)
Google Scholar
Wang, P., Sun, R., Zhao, H., Yu, K.: A new word language model evaluation metric for character based languages. In: Sun, M., Zhang, M., Lin, D., Wang, H. (eds.) CCL/NLP-NABD -2013. LNCS (LNAI), vol. 8202, pp. 315–324. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41491-6_29
Chapter Google Scholar
Sarı, L., Hasegawa-Johnson, M., Yoo, C.D.: Counterfactually fair automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 29, 3515–3525 (2021)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Education Technology Laboratory, Department of Computer Science (IDI), Norwegian University of Science and Technology, Gjøvik, Norway
Akshara Pande, Bhanu Shrestha & Deepti Mishra
Department of Computer Science (IDI), Norwegian University of Science and Technology, Gjøvik, Norway
Anshul Rani
Business Analytics Research Group, Inland School of Business and Social Sciences, Inland Norway University of Applied Sciences, Lillehammer, Norway
Deepti Mishra

Authors

Akshara Pande
View author publications
You can also search for this author in PubMed Google Scholar
Bhanu Shrestha
View author publications
You can also search for this author in PubMed Google Scholar
Anshul Rani
View author publications
You can also search for this author in PubMed Google Scholar
Deepti Mishra
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Akshara Pande .

Editor information

Editors and Affiliations

Aaron Marcus and Associates, Berkeley, CA, USA
Aaron Marcus
World Usability Day and Bubble Mountain Consulting, Newton Center, MA, USA
Elizabeth Rosenzweig
Southern University of Science and Technology – SUSTech, Shenzhen, China
Marcelo M. Soares

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pande, A., Shrestha, B., Rani, A., Mishra, D. (2023). A Comparative Analysis of Real Time Open-Source Speech Recognition Tools for Social Robots. In: Marcus, A., Rosenzweig, E., Soares, M.M. (eds) Design, User Experience, and Usability. HCII 2023. Lecture Notes in Computer Science, vol 14033. Springer, Cham. https://doi.org/10.1007/978-3-031-35708-4_26

Download citation

DOI: https://doi.org/10.1007/978-3-031-35708-4_26
Published: 09 July 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-35707-7
Online ISBN: 978-3-031-35708-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics