Evaluating the Performance of ASR Systems for TV Interactions in Several Domestic Noise Scenarios

  • Pedro BeçaEmail author
  • Jorge Abreu
  • Rita Santos
  • Ana Rodrigues
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 1004)


Voice interaction with the television is becoming a reality on domestic environments. However, one of the factors that influences the correct operation of these systems is the background noise that obstructs the performance of the automatic speech recognition (ASR) component. In order to further understand this issue, the paper presents an analysis of the performance of three ASR systems (Bing Speech API, Google API, and Nuance ASR) in several domestic noise scenarios resembling the interaction with the TV on a domestic context. A group of 36 users was asked to utter sentences based on TV requests, where the sentences’ corpus comprised typical phrases used when interacting with the TV. To better know the behavior, performance and robustness of each ASR to noise, the tests were carried out with three recording devices placed at different distances from the user. Google ASR proved to be the most robust to noise with a higher recognition precision, followed by Bing Speech and Nuance. The results obtained showed that ASR systems performance is globally quite robust but tends to deteriorate with domestic background noise. Future replications of the evaluation setup will allow the evaluation of ASR solutions in other scenarios.


Natural language interaction ASR evaluation TV interaction Automatic speech recognition 



This paper is a result of the project CHIC – Cooperative Holistic for Internet and Content (grant agreement number 24498), funded by COMPETE 2020 and Portugal 2020 through the European Regional Development Fund (FEDER).


  1. 1.
    Benesty, J.: Handbook of Speech Processing. Springer, Heidelberg (2008). Scholar
  2. 2.
    Bernhaupt, R., Boutonnnet, M., Gatellier, B., Gimenez, Y., Pouchepanadin, C., Souiba, L.: A set of recommendations for the control of IPTV-systems via smartphones based on the understanding of users practices and needs (2012)Google Scholar
  3. 3.
    Bernhaupt, R., Drouet, D., Manciet, F., Pirker, M., Pottier, G.: Using speech to search comparing built-in and ambient speech search in terms of privacy and user experience (2017)Google Scholar
  4. 4.
    Bohouta, G., Këpuska, V.: Performance of WUW and general ASR speech recognition systems in different acoustic environments. J. Acoust. Soc. Am. 143(3), 1758 (2018)CrossRefGoogle Scholar
  5. 5.
    Cordeiro, J.P.R.: Conversação Homem-máquina. Caracterização e Avaliação do Estado Actual das Soluções de Speech Recognition, Speech Synthesis e Sistemas de conversação Homem-máquina (2016)Google Scholar
  6. 6.
    Cultofmac. Nuance Beats Apple to Voice-Controlled Television with New Dragon TV Platform. Accessed 20 Sept 2018
  7. 7.
    Gomes, R.: Teste de interfaces de Voz (2007)Google Scholar
  8. 8.
    Goto, J., Kim, Y.-B., Strl, N., Miyazaki, M., Komine, K., Uratani, N.: A spoken dialogue interface for TV operations based on data collected by using WOZ method (2004)Google Scholar
  9. 9.
    Hirayama, N., Yoshino, K., Itoyama, K., Mori, S., Okuno, H.G.: Automatic speech recognition for mixed dialect utterances by mixing dialect language models. IEEE/ACM Trans. Audio Speech Lang. Process. 23(2), 373–382 (2015)CrossRefGoogle Scholar
  10. 10.
    Ibrahim, A., Johansson, P.: Multimodal dialogue systems: a case study for interactive TV. In: Carbonell, N., Stephanidis, C. (eds.) UI4ALL 2002. LNCS, vol. 2615, pp. 209–218. Springer, Heidelberg (2003). Scholar
  11. 11.
    Këpuska, V.: Comparing speech recognition systems (Microsoft API, Google API And CMU Sphinx). Int. J. Eng. Res. Appl. 07(03), 20–24 (2017)Google Scholar
  12. 12.
    Zajechowski, M.: Automatic Speech Recognition (ASR) Software - An Introduction - Usability Geek. Accessed 30 Jan 2019
  13. 13.
    Morbini, F., Audhkhasi, K., Sagae, K., Artstein, R.: Which ASR should I choose for my dialogue system? In: Sigdial, pp. 394–403, August 2013Google Scholar
  14. 14.
    Nakatoh, Y., Kuwano, H., Kanamori, T., Hoshimi, M.: Speech recognition interface system for digital TV control. Acoust. Sci. Technol. 28(3), 165–171 (2007)CrossRefGoogle Scholar
  15. 15.
    Shahamiri, S.R., Binti Salim, S.S.: Real-time frequency-based noise-robust automatic speech recognition using multi-nets artificial neural networks: a multi-views multi-learners’ approach. Neurocomputing 129, 199–207 (2014)CrossRefGoogle Scholar
  16. 16.
    Spiliotopoulos, D., Stavropoulou, P., Kouroupetroglou, G.: Spoken dialogue interfaces: integrating usability. In: Holzinger, A., Miesenberger, K. (eds.) HCI and Usability for e-Inclusion. USAB 2009. LNCS, vol 5889, pp. 484–499. Springer, Heidelberg (2009). Scholar
  17. 17.
    Stolfi, G.: Perceção auditiva e compressão de áudio. In Princípios de Televisão Digital, pp. 1–26 (2008)Google Scholar
  18. 18.
    He, L.D., Alex, A.: Why word error rate is not a good metric for speech recognizer training for the speech translation task? In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5632–5635 (2011)Google Scholar
  19. 19.
    Lecouteux, B., Vacher, M., Portet, F.: Distant speech processing for smart home: comparison of ASR approaches in scattered microphone network for voice command. Int. J. Speech Technol. 21, 601–618 (2018)CrossRefGoogle Scholar
  20. 20.
    Turunen, M., et al.: User expectations and user experience with different modalities in a mobile phone-controlled home entertainment system. In: Proceedings of the 11th International Conference on Human-Computer Interaction with Mobile Devices, pp. 1–4. ACM, New York (2009)Google Scholar
  21. 21.
    Vipperla, R., Bozonnet, S., Wang, D., Evans, N.: Robust speech recognition in multi-source noise environments using convolutive non-negative matrix factorization. In: CHiME: Workshop on Machine Learning in Multisource Environments, pp. 74–79 (2011)Google Scholar
  22. 22.
    Ward, N., Rivera, A., Ward, K., Novick, D.: Some Usability issues and research priorities in spoken dialog applications. Departmental Technical Reports (2005)Google Scholar
  23. 23.
    Barker, J.P., Marxer, R., Vincent, E., Watanabe, S.: The CHiME challenges: robust speech recognition in everyday environments. In: Watanabe, S., Delcroix, M., Metze, F., Hershey, J.R. (eds.) New Era for Robust Speech Recognition, pp. 327–344. Springer, Cham (2017). Scholar
  24. 24.
    Lecouteux, B., Vacher, B., Portet, F.: Distant speech processing for smart home: comparison of ASR approaches in scattered microphone network for voice command. Int. J. Speech Technol. 21(3), 601–618 (2018)CrossRefGoogle Scholar
  25. 25.
    Nematollahi, M.A., Al-Haddad, S.A.R.: Distant speaker recognition: an overview. Int. J. Humanoid Robot. 13(02), 1550032 (2016)CrossRefGoogle Scholar
  26. 26.
    Pellegrini, T., et al.: A corpus-based study of elderly and young speakers of European Portuguese: acoustic correlates and their impact on speech recognition performance (2013)Google Scholar
  27. 27.
    Hämäläinen, A.: Automatically Recognising European Portuguese Children’s Speech (2014). Scholar
  28. 28.
    Ali, A., Magdy, W., Renals, S.: Multi-Reference Evaluation for Dialectal Speech Recognition System: A Study for Egyptian ASR (2015)Google Scholar
  29. 29.
    Garner, P.N., Imseng, D., Meyer, T.: Automatic Speech Recognition and Translation of a Swiss German Dialect: Walliserdeutsch (2014). Accessed 12 Mar 2019
  30. 30.
    deMauro, T.: Linguística Elementar. Editorial Estampa, Lisboa (2000)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Digimedia, Department of Communication and ArtsUniversity of AveiroAveiroPortugal

Personalised recommendations