Evaluating the Performance of ASR Systems for TV Interactions in Several Domestic Noise Scenarios

Beça, Pedro; Abreu, Jorge; Santos, Rita; Rodrigues, Ana

doi:10.1007/978-3-030-23862-9_12

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1004))

Included in the following conference series:

Iberoamerican Conference on Applications and Usability of Interactive TV

303 Accesses

Abstract

Voice interaction with the television is becoming a reality on domestic environments. However, one of the factors that influences the correct operation of these systems is the background noise that obstructs the performance of the automatic speech recognition (ASR) component. In order to further understand this issue, the paper presents an analysis of the performance of three ASR systems (Bing Speech API, Google API, and Nuance ASR) in several domestic noise scenarios resembling the interaction with the TV on a domestic context. A group of 36 users was asked to utter sentences based on TV requests, where the sentences’ corpus comprised typical phrases used when interacting with the TV. To better know the behavior, performance and robustness of each ASR to noise, the tests were carried out with three recording devices placed at different distances from the user. Google ASR proved to be the most robust to noise with a higher recognition precision, followed by Bing Speech and Nuance. The results obtained showed that ASR systems performance is globally quite robust but tends to deteriorate with domestic background noise. Future replications of the evaluation setup will allow the evaluation of ASR solutions in other scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Benesty, J.: Handbook of Speech Processing. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-49127-9
Book Google Scholar
Bernhaupt, R., Boutonnnet, M., Gatellier, B., Gimenez, Y., Pouchepanadin, C., Souiba, L.: A set of recommendations for the control of IPTV-systems via smartphones based on the understanding of users practices and needs (2012)
Google Scholar
Bernhaupt, R., Drouet, D., Manciet, F., Pirker, M., Pottier, G.: Using speech to search comparing built-in and ambient speech search in terms of privacy and user experience (2017)
Google Scholar
Bohouta, G., Këpuska, V.: Performance of WUW and general ASR speech recognition systems in different acoustic environments. J. Acoust. Soc. Am. 143(3), 1758 (2018)
Article Google Scholar
Cordeiro, J.P.R.: Conversação Homem-máquina. Caracterização e Avaliação do Estado Actual das Soluções de Speech Recognition, Speech Synthesis e Sistemas de conversação Homem-máquina (2016)
Google Scholar
Cultofmac. Nuance Beats Apple to Voice-Controlled Television with New Dragon TV Platform. https://www.cultofmac.com/139335/nuance-beats-apple-to-voice-controlled-television-with-new-dragon-tv-platform/CultofMac. Accessed 20 Sept 2018
Gomes, R.: Teste de interfaces de Voz (2007)
Google Scholar
Goto, J., Kim, Y.-B., Strl, N., Miyazaki, M., Komine, K., Uratani, N.: A spoken dialogue interface for TV operations based on data collected by using WOZ method (2004)
Google Scholar
Hirayama, N., Yoshino, K., Itoyama, K., Mori, S., Okuno, H.G.: Automatic speech recognition for mixed dialect utterances by mixing dialect language models. IEEE/ACM Trans. Audio Speech Lang. Process. 23(2), 373–382 (2015)
Article Google Scholar
Ibrahim, A., Johansson, P.: Multimodal dialogue systems: a case study for interactive TV. In: Carbonell, N., Stephanidis, C. (eds.) UI4ALL 2002. LNCS, vol. 2615, pp. 209–218. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-36572-9_17
Chapter Google Scholar
Këpuska, V.: Comparing speech recognition systems (Microsoft API, Google API And CMU Sphinx). Int. J. Eng. Res. Appl. 07(03), 20–24 (2017)
Google Scholar
Zajechowski, M.: Automatic Speech Recognition (ASR) Software - An Introduction - Usability Geek. https://usabilitygeek.com/automatic-speech-recognition-asr-software-an-introduction/. Accessed 30 Jan 2019
Morbini, F., Audhkhasi, K., Sagae, K., Artstein, R.: Which ASR should I choose for my dialogue system? In: Sigdial, pp. 394–403, August 2013
Google Scholar
Nakatoh, Y., Kuwano, H., Kanamori, T., Hoshimi, M.: Speech recognition interface system for digital TV control. Acoust. Sci. Technol. 28(3), 165–171 (2007)
Article Google Scholar
Shahamiri, S.R., Binti Salim, S.S.: Real-time frequency-based noise-robust automatic speech recognition using multi-nets artificial neural networks: a multi-views multi-learners’ approach. Neurocomputing 129, 199–207 (2014)
Article Google Scholar
Spiliotopoulos, D., Stavropoulou, P., Kouroupetroglou, G.: Spoken dialogue interfaces: integrating usability. In: Holzinger, A., Miesenberger, K. (eds.) HCI and Usability for e-Inclusion. USAB 2009. LNCS, vol 5889, pp. 484–499. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-10308-7_36
Chapter Google Scholar
Stolfi, G.: Perceção auditiva e compressão de áudio. In Princípios de Televisão Digital, pp. 1–26 (2008)
Google Scholar
He, L.D., Alex, A.: Why word error rate is not a good metric for speech recognizer training for the speech translation task? In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5632–5635 (2011)
Google Scholar
Lecouteux, B., Vacher, M., Portet, F.: Distant speech processing for smart home: comparison of ASR approaches in scattered microphone network for voice command. Int. J. Speech Technol. 21, 601–618 (2018)
Article Google Scholar
Turunen, M., et al.: User expectations and user experience with different modalities in a mobile phone-controlled home entertainment system. In: Proceedings of the 11th International Conference on Human-Computer Interaction with Mobile Devices, pp. 1–4. ACM, New York (2009)
Google Scholar
Vipperla, R., Bozonnet, S., Wang, D., Evans, N.: Robust speech recognition in multi-source noise environments using convolutive non-negative matrix factorization. In: CHiME: Workshop on Machine Learning in Multisource Environments, pp. 74–79 (2011)
Google Scholar
Ward, N., Rivera, A., Ward, K., Novick, D.: Some Usability issues and research priorities in spoken dialog applications. Departmental Technical Reports (2005)
Google Scholar
Barker, J.P., Marxer, R., Vincent, E., Watanabe, S.: The CHiME challenges: robust speech recognition in everyday environments. In: Watanabe, S., Delcroix, M., Metze, F., Hershey, J.R. (eds.) New Era for Robust Speech Recognition, pp. 327–344. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-64680-0_14
Chapter Google Scholar
Lecouteux, B., Vacher, B., Portet, F.: Distant speech processing for smart home: comparison of ASR approaches in scattered microphone network for voice command. Int. J. Speech Technol. 21(3), 601–618 (2018)
Article Google Scholar
Nematollahi, M.A., Al-Haddad, S.A.R.: Distant speaker recognition: an overview. Int. J. Humanoid Robot. 13(02), 1550032 (2016)
Article Google Scholar
Pellegrini, T., et al.: A corpus-based study of elderly and young speakers of European Portuguese: acoustic correlates and their impact on speech recognition performance (2013)
Google Scholar
Hämäläinen, A.: Automatically Recognising European Portuguese Children’s Speech (2014). https://doi.org/10.1007/978-3-319-09761-9_1
Google Scholar
Ali, A., Magdy, W., Renals, S.: Multi-Reference Evaluation for Dialectal Speech Recognition System: A Study for Egyptian ASR (2015)
Google Scholar
Garner, P.N., Imseng, D., Meyer, T.: Automatic Speech Recognition and Translation of a Swiss German Dialect: Walliserdeutsch (2014). http://www.swissinfo.ch/. Accessed 12 Mar 2019
deMauro, T.: Linguística Elementar. Editorial Estampa, Lisboa (2000)
Google Scholar

Download references

Acknowledgments

This paper is a result of the project CHIC – Cooperative Holistic for Internet and Content (grant agreement number 24498), funded by COMPETE 2020 and Portugal 2020 through the European Regional Development Fund (FEDER).

Author information

Authors and Affiliations

Digimedia, Department of Communication and Arts, University of Aveiro, Aveiro, Portugal
Pedro Beça, Jorge Abreu, Rita Santos & Ana Rodrigues

Authors

Pedro Beça
View author publications
You can also search for this author in PubMed Google Scholar
Jorge Abreu
View author publications
You can also search for this author in PubMed Google Scholar
Rita Santos
View author publications
You can also search for this author in PubMed Google Scholar
Ana Rodrigues
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pedro Beça .

Editor information

Editors and Affiliations

CICPBA - III-LIDI, National University of La Plata, La Plata, Argentina
María José Abásolo
Department of Communication and Art, University of Aveiro, Aveiro, Portugal
Telmo Silva
Department of Social Sciences, National University of Quilmes , Buenos Aires, Argentina
Nestor D. González

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Beça, P., Abreu, J., Santos, R., Rodrigues, A. (2019). Evaluating the Performance of ASR Systems for TV Interactions in Several Domestic Noise Scenarios. In: Abásolo, M., Silva, T., González, N. (eds) Applications and Usability of Interactive TV. jAUTI 2018. Communications in Computer and Information Science, vol 1004. Springer, Cham. https://doi.org/10.1007/978-3-030-23862-9_12

Download citation

DOI: https://doi.org/10.1007/978-3-030-23862-9_12
Published: 05 July 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-23861-2
Online ISBN: 978-3-030-23862-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics