Abstract
Voice interaction with the television is becoming a reality on domestic environments. However, one of the factors that influences the correct operation of these systems is the background noise that obstructs the performance of the automatic speech recognition (ASR) component. In order to further understand this issue, the paper presents an analysis of the performance of three ASR systems (Bing Speech API, Google API, and Nuance ASR) in several domestic noise scenarios resembling the interaction with the TV on a domestic context. A group of 36 users was asked to utter sentences based on TV requests, where the sentences’ corpus comprised typical phrases used when interacting with the TV. To better know the behavior, performance and robustness of each ASR to noise, the tests were carried out with three recording devices placed at different distances from the user. Google ASR proved to be the most robust to noise with a higher recognition precision, followed by Bing Speech and Nuance. The results obtained showed that ASR systems performance is globally quite robust but tends to deteriorate with domestic background noise. Future replications of the evaluation setup will allow the evaluation of ASR solutions in other scenarios.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Benesty, J.: Handbook of Speech Processing. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-49127-9
Bernhaupt, R., Boutonnnet, M., Gatellier, B., Gimenez, Y., Pouchepanadin, C., Souiba, L.: A set of recommendations for the control of IPTV-systems via smartphones based on the understanding of users practices and needs (2012)
Bernhaupt, R., Drouet, D., Manciet, F., Pirker, M., Pottier, G.: Using speech to search comparing built-in and ambient speech search in terms of privacy and user experience (2017)
Bohouta, G., Këpuska, V.: Performance of WUW and general ASR speech recognition systems in different acoustic environments. J. Acoust. Soc. Am. 143(3), 1758 (2018)
Cordeiro, J.P.R.: Conversação Homem-máquina. Caracterização e Avaliação do Estado Actual das Soluções de Speech Recognition, Speech Synthesis e Sistemas de conversação Homem-máquina (2016)
Cultofmac. Nuance Beats Apple to Voice-Controlled Television with New Dragon TV Platform. https://www.cultofmac.com/139335/nuance-beats-apple-to-voice-controlled-television-with-new-dragon-tv-platform/CultofMac. Accessed 20 Sept 2018
Gomes, R.: Teste de interfaces de Voz (2007)
Goto, J., Kim, Y.-B., Strl, N., Miyazaki, M., Komine, K., Uratani, N.: A spoken dialogue interface for TV operations based on data collected by using WOZ method (2004)
Hirayama, N., Yoshino, K., Itoyama, K., Mori, S., Okuno, H.G.: Automatic speech recognition for mixed dialect utterances by mixing dialect language models. IEEE/ACM Trans. Audio Speech Lang. Process. 23(2), 373–382 (2015)
Ibrahim, A., Johansson, P.: Multimodal dialogue systems: a case study for interactive TV. In: Carbonell, N., Stephanidis, C. (eds.) UI4ALL 2002. LNCS, vol. 2615, pp. 209–218. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-36572-9_17
Këpuska, V.: Comparing speech recognition systems (Microsoft API, Google API And CMU Sphinx). Int. J. Eng. Res. Appl. 07(03), 20–24 (2017)
Zajechowski, M.: Automatic Speech Recognition (ASR) Software - An Introduction - Usability Geek. https://usabilitygeek.com/automatic-speech-recognition-asr-software-an-introduction/. Accessed 30 Jan 2019
Morbini, F., Audhkhasi, K., Sagae, K., Artstein, R.: Which ASR should I choose for my dialogue system? In: Sigdial, pp. 394–403, August 2013
Nakatoh, Y., Kuwano, H., Kanamori, T., Hoshimi, M.: Speech recognition interface system for digital TV control. Acoust. Sci. Technol. 28(3), 165–171 (2007)
Shahamiri, S.R., Binti Salim, S.S.: Real-time frequency-based noise-robust automatic speech recognition using multi-nets artificial neural networks: a multi-views multi-learners’ approach. Neurocomputing 129, 199–207 (2014)
Spiliotopoulos, D., Stavropoulou, P., Kouroupetroglou, G.: Spoken dialogue interfaces: integrating usability. In: Holzinger, A., Miesenberger, K. (eds.) HCI and Usability for e-Inclusion. USAB 2009. LNCS, vol 5889, pp. 484–499. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-10308-7_36
Stolfi, G.: Perceção auditiva e compressão de áudio. In Princípios de Televisão Digital, pp. 1–26 (2008)
He, L.D., Alex, A.: Why word error rate is not a good metric for speech recognizer training for the speech translation task? In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5632–5635 (2011)
Lecouteux, B., Vacher, M., Portet, F.: Distant speech processing for smart home: comparison of ASR approaches in scattered microphone network for voice command. Int. J. Speech Technol. 21, 601–618 (2018)
Turunen, M., et al.: User expectations and user experience with different modalities in a mobile phone-controlled home entertainment system. In: Proceedings of the 11th International Conference on Human-Computer Interaction with Mobile Devices, pp. 1–4. ACM, New York (2009)
Vipperla, R., Bozonnet, S., Wang, D., Evans, N.: Robust speech recognition in multi-source noise environments using convolutive non-negative matrix factorization. In: CHiME: Workshop on Machine Learning in Multisource Environments, pp. 74–79 (2011)
Ward, N., Rivera, A., Ward, K., Novick, D.: Some Usability issues and research priorities in spoken dialog applications. Departmental Technical Reports (2005)
Barker, J.P., Marxer, R., Vincent, E., Watanabe, S.: The CHiME challenges: robust speech recognition in everyday environments. In: Watanabe, S., Delcroix, M., Metze, F., Hershey, J.R. (eds.) New Era for Robust Speech Recognition, pp. 327–344. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-64680-0_14
Lecouteux, B., Vacher, B., Portet, F.: Distant speech processing for smart home: comparison of ASR approaches in scattered microphone network for voice command. Int. J. Speech Technol. 21(3), 601–618 (2018)
Nematollahi, M.A., Al-Haddad, S.A.R.: Distant speaker recognition: an overview. Int. J. Humanoid Robot. 13(02), 1550032 (2016)
Pellegrini, T., et al.: A corpus-based study of elderly and young speakers of European Portuguese: acoustic correlates and their impact on speech recognition performance (2013)
Hämäläinen, A.: Automatically Recognising European Portuguese Children’s Speech (2014). https://doi.org/10.1007/978-3-319-09761-9_1
Ali, A., Magdy, W., Renals, S.: Multi-Reference Evaluation for Dialectal Speech Recognition System: A Study for Egyptian ASR (2015)
Garner, P.N., Imseng, D., Meyer, T.: Automatic Speech Recognition and Translation of a Swiss German Dialect: Walliserdeutsch (2014). http://www.swissinfo.ch/. Accessed 12 Mar 2019
deMauro, T.: Linguística Elementar. Editorial Estampa, Lisboa (2000)
Acknowledgments
This paper is a result of the project CHIC – Cooperative Holistic for Internet and Content (grant agreement number 24498), funded by COMPETE 2020 and Portugal 2020 through the European Regional Development Fund (FEDER).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Beça, P., Abreu, J., Santos, R., Rodrigues, A. (2019). Evaluating the Performance of ASR Systems for TV Interactions in Several Domestic Noise Scenarios. In: Abásolo, M., Silva, T., González, N. (eds) Applications and Usability of Interactive TV. jAUTI 2018. Communications in Computer and Information Science, vol 1004. Springer, Cham. https://doi.org/10.1007/978-3-030-23862-9_12
Download citation
DOI: https://doi.org/10.1007/978-3-030-23862-9_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-23861-2
Online ISBN: 978-3-030-23862-9
eBook Packages: Computer ScienceComputer Science (R0)