Skip to main content

Evaluating the Performance of ASR Systems for TV Interactions in Several Domestic Noise Scenarios

  • Conference paper
  • First Online:
Applications and Usability of Interactive TV (jAUTI 2018)

Abstract

Voice interaction with the television is becoming a reality on domestic environments. However, one of the factors that influences the correct operation of these systems is the background noise that obstructs the performance of the automatic speech recognition (ASR) component. In order to further understand this issue, the paper presents an analysis of the performance of three ASR systems (Bing Speech API, Google API, and Nuance ASR) in several domestic noise scenarios resembling the interaction with the TV on a domestic context. A group of 36 users was asked to utter sentences based on TV requests, where the sentences’ corpus comprised typical phrases used when interacting with the TV. To better know the behavior, performance and robustness of each ASR to noise, the tests were carried out with three recording devices placed at different distances from the user. Google ASR proved to be the most robust to noise with a higher recognition precision, followed by Bing Speech and Nuance. The results obtained showed that ASR systems performance is globally quite robust but tends to deteriorate with domestic background noise. Future replications of the evaluation setup will allow the evaluation of ASR solutions in other scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Benesty, J.: Handbook of Speech Processing. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-49127-9

    Book  Google Scholar 

  2. Bernhaupt, R., Boutonnnet, M., Gatellier, B., Gimenez, Y., Pouchepanadin, C., Souiba, L.: A set of recommendations for the control of IPTV-systems via smartphones based on the understanding of users practices and needs (2012)

    Google Scholar 

  3. Bernhaupt, R., Drouet, D., Manciet, F., Pirker, M., Pottier, G.: Using speech to search comparing built-in and ambient speech search in terms of privacy and user experience (2017)

    Google Scholar 

  4. Bohouta, G., Këpuska, V.: Performance of WUW and general ASR speech recognition systems in different acoustic environments. J. Acoust. Soc. Am. 143(3), 1758 (2018)

    Article  Google Scholar 

  5. Cordeiro, J.P.R.: Conversação Homem-máquina. Caracterização e Avaliação do Estado Actual das Soluções de Speech Recognition, Speech Synthesis e Sistemas de conversação Homem-máquina (2016)

    Google Scholar 

  6. Cultofmac. Nuance Beats Apple to Voice-Controlled Television with New Dragon TV Platform. https://www.cultofmac.com/139335/nuance-beats-apple-to-voice-controlled-television-with-new-dragon-tv-platform/CultofMac. Accessed 20 Sept 2018

  7. Gomes, R.: Teste de interfaces de Voz (2007)

    Google Scholar 

  8. Goto, J., Kim, Y.-B., Strl, N., Miyazaki, M., Komine, K., Uratani, N.: A spoken dialogue interface for TV operations based on data collected by using WOZ method (2004)

    Google Scholar 

  9. Hirayama, N., Yoshino, K., Itoyama, K., Mori, S., Okuno, H.G.: Automatic speech recognition for mixed dialect utterances by mixing dialect language models. IEEE/ACM Trans. Audio Speech Lang. Process. 23(2), 373–382 (2015)

    Article  Google Scholar 

  10. Ibrahim, A., Johansson, P.: Multimodal dialogue systems: a case study for interactive TV. In: Carbonell, N., Stephanidis, C. (eds.) UI4ALL 2002. LNCS, vol. 2615, pp. 209–218. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-36572-9_17

    Chapter  Google Scholar 

  11. Këpuska, V.: Comparing speech recognition systems (Microsoft API, Google API And CMU Sphinx). Int. J. Eng. Res. Appl. 07(03), 20–24 (2017)

    Google Scholar 

  12. Zajechowski, M.: Automatic Speech Recognition (ASR) Software - An Introduction - Usability Geek. https://usabilitygeek.com/automatic-speech-recognition-asr-software-an-introduction/. Accessed 30 Jan 2019

  13. Morbini, F., Audhkhasi, K., Sagae, K., Artstein, R.: Which ASR should I choose for my dialogue system? In: Sigdial, pp. 394–403, August 2013

    Google Scholar 

  14. Nakatoh, Y., Kuwano, H., Kanamori, T., Hoshimi, M.: Speech recognition interface system for digital TV control. Acoust. Sci. Technol. 28(3), 165–171 (2007)

    Article  Google Scholar 

  15. Shahamiri, S.R., Binti Salim, S.S.: Real-time frequency-based noise-robust automatic speech recognition using multi-nets artificial neural networks: a multi-views multi-learners’ approach. Neurocomputing 129, 199–207 (2014)

    Article  Google Scholar 

  16. Spiliotopoulos, D., Stavropoulou, P., Kouroupetroglou, G.: Spoken dialogue interfaces: integrating usability. In: Holzinger, A., Miesenberger, K. (eds.) HCI and Usability for e-Inclusion. USAB 2009. LNCS, vol 5889, pp. 484–499. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-10308-7_36

    Chapter  Google Scholar 

  17. Stolfi, G.: Perceção auditiva e compressão de áudio. In Princípios de Televisão Digital, pp. 1–26 (2008)

    Google Scholar 

  18. He, L.D., Alex, A.: Why word error rate is not a good metric for speech recognizer training for the speech translation task? In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5632–5635 (2011)

    Google Scholar 

  19. Lecouteux, B., Vacher, M., Portet, F.: Distant speech processing for smart home: comparison of ASR approaches in scattered microphone network for voice command. Int. J. Speech Technol. 21, 601–618 (2018)

    Article  Google Scholar 

  20. Turunen, M., et al.: User expectations and user experience with different modalities in a mobile phone-controlled home entertainment system. In: Proceedings of the 11th International Conference on Human-Computer Interaction with Mobile Devices, pp. 1–4. ACM, New York (2009)

    Google Scholar 

  21. Vipperla, R., Bozonnet, S., Wang, D., Evans, N.: Robust speech recognition in multi-source noise environments using convolutive non-negative matrix factorization. In: CHiME: Workshop on Machine Learning in Multisource Environments, pp. 74–79 (2011)

    Google Scholar 

  22. Ward, N., Rivera, A., Ward, K., Novick, D.: Some Usability issues and research priorities in spoken dialog applications. Departmental Technical Reports (2005)

    Google Scholar 

  23. Barker, J.P., Marxer, R., Vincent, E., Watanabe, S.: The CHiME challenges: robust speech recognition in everyday environments. In: Watanabe, S., Delcroix, M., Metze, F., Hershey, J.R. (eds.) New Era for Robust Speech Recognition, pp. 327–344. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-64680-0_14

    Chapter  Google Scholar 

  24. Lecouteux, B., Vacher, B., Portet, F.: Distant speech processing for smart home: comparison of ASR approaches in scattered microphone network for voice command. Int. J. Speech Technol. 21(3), 601–618 (2018)

    Article  Google Scholar 

  25. Nematollahi, M.A., Al-Haddad, S.A.R.: Distant speaker recognition: an overview. Int. J. Humanoid Robot. 13(02), 1550032 (2016)

    Article  Google Scholar 

  26. Pellegrini, T., et al.: A corpus-based study of elderly and young speakers of European Portuguese: acoustic correlates and their impact on speech recognition performance (2013)

    Google Scholar 

  27. Hämäläinen, A.: Automatically Recognising European Portuguese Children’s Speech (2014). https://doi.org/10.1007/978-3-319-09761-9_1

    Google Scholar 

  28. Ali, A., Magdy, W., Renals, S.: Multi-Reference Evaluation for Dialectal Speech Recognition System: A Study for Egyptian ASR (2015)

    Google Scholar 

  29. Garner, P.N., Imseng, D., Meyer, T.: Automatic Speech Recognition and Translation of a Swiss German Dialect: Walliserdeutsch (2014). http://www.swissinfo.ch/. Accessed 12 Mar 2019

  30. deMauro, T.: Linguística Elementar. Editorial Estampa, Lisboa (2000)

    Google Scholar 

Download references

Acknowledgments

This paper is a result of the project CHIC – Cooperative Holistic for Internet and Content (grant agreement number 24498), funded by COMPETE 2020 and Portugal 2020 through the European Regional Development Fund (FEDER).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pedro Beça .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Beça, P., Abreu, J., Santos, R., Rodrigues, A. (2019). Evaluating the Performance of ASR Systems for TV Interactions in Several Domestic Noise Scenarios. In: Abásolo, M., Silva, T., González, N. (eds) Applications and Usability of Interactive TV. jAUTI 2018. Communications in Computer and Information Science, vol 1004. Springer, Cham. https://doi.org/10.1007/978-3-030-23862-9_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-23862-9_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-23861-2

  • Online ISBN: 978-3-030-23862-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics