Skip to main content

Audio Event Recognition in the Smart Home

Abstract

After giving a brief overview of the relevance and value of deploying automatic audio event recognition (AER) in the smart home market, this chapter reviews three aspects of the productization of AER which are important to consider when developing pathways to impact between fundamental research and “real-world” applicative outlets. In the first section, it is shown that applications introduce a variety of practical constraints which elicit new research topics in the field: clarifying the definition of sound events, thus suggesting interest for the explicit modeling of temporal patterns and interruption; running and evaluating AER in 24/7 sound detection setups, which suggests to recast the problem as open-set recognition; and running AER applications on consumer devices with limited audio quality and computational power, thus triggering interest for scalability and robustness. The second section explores the definition of user experience for AER. After reporting field observations about the ways in which system errors affect user experience, it is proposed to introduce opinion scoring into AER evaluation methodology. Then, the link between standard AER performance metrics and subjective user experience metrics is being explored, and attention is being drawn to the fact that F-score metrics actually mash up the objective evaluation of acoustic discrimination with the subjective choice of an application-dependent operation point. Solutions to the separation of discrimination and calibration in system evaluation are introduced, thus allowing the more explicit separation of acoustic modeling optimization from that of application-dependent user experience. Finally, the last section analyses the ethical and legal issues involved in deploying AER systems which are “listening” at all times into the users’ private space. A review of the key notions underpinning European data and privacy protection laws, questioning if and when these apply to audio data, suggests a set of guidelines which summarize into empowering users to consent by fully informing them about the use of their data, as well as taking reasonable information security measures to protect users’ personal data.

Keywords

  • Smart home applications
  • Audio event recognition
  • Modelling audio events
  • Computational power
  • Embedded sound recognition
  • Audio quality
  • Open-set recognition
  • User interface
  • Objective and subjective evaluation
  • Ethics
  • Privacy

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-319-63450-0_12
  • Chapter length: 37 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   149.00
Price excludes VAT (USA)
  • ISBN: 978-3-319-63450-0
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   199.99
Price excludes VAT (USA)
Hardcover Book
USD   199.99
Price excludes VAT (USA)
Fig. 12.1
Fig. 12.2

Notes

  1. 1.

    The term audio event recognition (AER) in this chapter corresponds to what is referred to as sound event detection in other chapters of this book. Whereas consensus is currently forming amongst the academic research community around the latter term, the industry prefers AER for marketing reasons: firstly because it establishes a parallel with automatic speech recognition, and secondly because “recognition” makes the system feel more intelligent by referring to semantics and meaning, than “detection” which refers to “plain automation.”

  2. 2.

    This suggestion may be reminiscent of speech recognition techniques, where the acoustic models and the language model contribute almost equally to speech recognition accuracy [54]. However, the problem may be different in AER: the proportion of silence or interruption versus target acoustic frames may be much smaller, and thus have less effect on the general acoustic probabilities, in continuous speech than in the case of short interrupted audio events such as, e.g., smoke alarms or baby cries. Besides, the structure of non-speech audio events or audio scenes may not be of a linguistic nature according to the strict definition of language as a system of communication, thus questioning the structural nature of non-speech audio events at a deeper level of cognitive concepts. More discussion on “acoustic language models” can be found in Chap. 8.

  3. 3.

    A requirement to be able to cross-check the triggering audio may seem to contradict the privacy and eavesdropping concerns analyzed further down in Sect. 12.4.4. However, in practice there is less of a privacy concern about transmitting short audio snippets, with pre and post bracketing time kept to a minimum around the triggering audio event, than there is about streaming someone’s personal audio to the cloud in a continuous 24/7 manner.

  4. 4.

    False alarm (FA) is synonymous to false positive (FP), and missed detection (MD) is synonymous to false negatives (FN). Usage varies across domains in the literature, e.g., speaker recognition tends to use the former more often.

  5. 5.

    Many of the definitions in this section are quoted from the ICO’s documentation [69, 70] and the UK Data Service’s website [64]. Both services are local to the UK and government funded, and they aim at helping researchers and businesses understanding the legal requirements set forth into the UK’s Data Protection Act 1998. Because the UK’s DPA 1998 in itself seeks to implement European recommendations and directives, the notions quoted in this list are echoing the general definitions set forth in the various European laws and recommendations.

References

  1. Ahuja, K., Schneider, J., de Maisieres, M.T.: The Connected Home Market. McKinsey & Company, New York (2015). http://www.mckinsey.com/spContent/connected_homes/pdf/McKinsey_Connectedhome.pdf

    Google Scholar 

  2. Aldrich, F.K.: Smart homes: past, present, future. In: Harper, R. (ed.) Inside the Smart Home, pp. 17–39. Springer, London (2013)

    Google Scholar 

  3. Aucouturier, J.J., Defréville, B., Pachet, F.: The bag-of-frames approach to audio pattern recognition: a sufficient model for urban soundscapes but not for polyphonic music. J. Acoust. Soc. Am. 122(2), 881–891 (2007)

    CrossRef  Google Scholar 

  4. Battaglino, D., Lepauloux, L., Evans, N.: The open-set problem in acoustic scene classification. In: 2016 IEEE International Workshop on Acoustic Signal Enhancement (IWAENC), pp. 1–5 (2016)

    Google Scholar 

  5. Bendale, A., Boult, T.E.: Towards open set deep networks. CoRR abs/1511.06233 (2015). http://arxiv.org/abs/1511.06233

  6. Biet, N.: Internet of Things - Overview of the Market. The Faktory, Belgium (2014). http://www.thefaktory.com/wp-content/uploads/2015/01/IoT-market-overview-Final.pdf

    Google Scholar 

  7. Bimbot, F., Bonastre, J.F., Fredouille, C., Gravier, G., Magrin-Chagnolleau, I., Meignier, S., Merlin, T., Ortega-García, J., Petrovska-Delacrétaz, D., Reynolds, D.A.: A tutorial on text-independent speaker verification. EURASIP J. Appl. Signal Process. 2004, 430–451 (2004)

    CrossRef  Google Scholar 

  8. Boehm, F.: A comparison between US and EU data protection legislation for law enforcement purposes. Technical report, European Parliament (2015). http://www.europarl.europa.eu/studies

    Google Scholar 

  9. Bonomi, F., Milito, R., Natarajan, P., Zhu, J.: Fog computing: a platform for internet of things and analytics. In: Bessis, N., Dobre, C. (eds.) Big Data and Internet of Things: A Roadmap for Smart Environments, pp. 169–186. Springer, Berlin (2014)

    CrossRef  Google Scholar 

  10. Brümmer, N.: Measuring, refining and calibrating speaker and language information extracted from speech. Ph.D. thesis, Stellenbosch University (2010)

    Google Scholar 

  11. Brümmer, N., du Preez, J.: Application-independent evaluation of speaker detection. Comput. Speech Lang. 20, 230–275 (2006)

    CrossRef  Google Scholar 

  12. Buchholz, S., Latorre, J.: Crowdsourcing preference tests, and how to detect cheating. In: Proceedings of Interspeech 2011, pp. 3053–3056 (2011)

    Google Scholar 

  13. Burges, C.J.: A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Disc. 2(2), 121–167 (1998)

    CrossRef  Google Scholar 

  14. Business Software Alliance: Global cloud computing scorecard - country report: Vietnam (2012). http://cloudscorecard.bsa.org/2012/assets/PDFs/country_reports/Country_Report_Vietnam.pdf

  15. Celesti, A., Fazio, M., Villari, M.: Enabling secure XMPP communications in federated IoT clouds through XEP 0027 and SAML/SASL SSO. Sensors 17, 301 (2017)

    CrossRef  Google Scholar 

  16. Clark, R.A.J., Podsiadło, M., Fraser, M., Mayo, C., King, S.: Statistical analysis of the blizzard challenge 2007 listening test results. In: Proceedings of the Blizzard Challenge (2007). http://www.festvox.org/blizzard/bc2007/

    Google Scholar 

  17. Commission Nationale de l’Informatique et des Libertés. https://www.cnil.fr/ (2017). Last accessed 01/2017

  18. Corti, L., Van den Eynden, V., Bishop, L., Woollard, M.: Managing and Sharing Research Data. Sage Publishing, Thousand Oaks (2014)

    Google Scholar 

  19. Crossley, D.: Samsung’s listening tv is proof that tech has outpaced our rights. The Guardian (2015). https://www.theguardian.com/media-network/2015/feb/13/samsungs-listening-tv-tech-rights

  20. Dall, R., Yamagishi, J., King, S.: Rating naturalness in speech synthesis: the effect of style and expectation. In: Proceedings of the Speech Prosody Workshop (2014)

    Google Scholar 

  21. Davies, M.: C-Sense - exploiting low dimensional models in sensing, computation and signal processing. http://cordis.europa.eu/project/rcn/204493_en.html (2016). European Research Council project ID 694888, hosted at the University of Edinburgh. Online description last accessed 05/2017

  22. de Hert, P., Papakonstantinou, V.: The data protection regime in China. Technical report, European Parliament (2015). http://www.europarl.europa.eu/studies

    Google Scholar 

  23. Deligne, S., Bimbot, F.: Inference of variable-length linguistic and acoustic units by multigrams. Speech Commun. 23, 223–241 (1997)

    CrossRef  Google Scholar 

  24. Eskénazi, M., Levow, G.A., Meng, H., Parent, G., Suendermann, D.: Crowdsourcing for Speech Processing. Wiley, Chichester (2013)

    CrossRef  Google Scholar 

  25. European Council: Directive 95/46/EC of the European Parliament and of the Council (1995). http://eur-lex.europa.eu/legal-content/en/TXT/?uri=CELEX:31995L0046

    Google Scholar 

  26. European Court of Human Rights, Council of Europe: European Convention on Human Rights (1950). http://www.echr.coe.int/Documents/Convention_ENG.pdf

  27. European Parliament: Resolution of 6 July 2011 on a comprehensive approach on personal data protection in the European Union (2011/2025(INI)) (2011). http://www.europarl.europa.eu/sides/getDoc.do?type=TA&reference=P7-TA-2011-0323&language=EN&ring=A7-2011-0244

    Google Scholar 

  28. Foster, P., Sigtia, S., Krstulovic, S., Barker, J., Plumbley, M.D.: CHiME-Home: a dataset for sound source recognition in a domestic environment. In: Proceedings of the 11th Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) (2015)

    Google Scholar 

  29. Fulford, N., Sutherland, T.: ‘One voice to bind them all’ - Smart home devices, AI, children and the law. Digit. Bus. Lawyer 18(10), 12–15 (2016)

    Google Scholar 

  30. Gubbi, J., Buyya, R., Marusic, S., Palaniswami, M.: Internet of things (IoT): a vision, architectural elements, and future directions. Futur. Gener. Comput. Syst. 29(7), 1645–1660 (2013)

    CrossRef  Google Scholar 

  31. Hain, T., Garner, P.N.: Speech recognition. In: Renals, S., Bourlard, H., Carletta, J., Popescu-Belis, A. (eds.) Multimodal Signal Processing: Human Interactions in Meetings. Cambridge University Press, Cambridge (2012)

    Google Scholar 

  32. Hockett, C.F.: The origin of speech. Sci. Am. 203, 88–96 (1960)

    CrossRef  Google Scholar 

  33. Hustinx, P.: EU data protection law: The review of directive 95/46/EC and the proposed general data protection regulation. Technical report, European University Institute’s Academy of European Law (2013). https://secure.edps.europa.eu/EDPSWEB/webdav/site/mySite/shared/Documents/EDPS/Publications/Speeches/2014/14-09-15_Article_EUI_EN.pdf

  34. Information Commissioner’s Office. https://ico.org.uk/ (2017). Last accessed 01/2017

  35. International Organization for Standardization: ISO 8201: Acoustics – Audible emergency evacuation signal. International Organization for Standardization, Geneva (1987)

    Google Scholar 

  36. Kim, J.H., Chung, B.T.H., Keh, J.S., Lee, I.H., Kim, I.H., Chang, I.H.: Data protection in south korea: overview. In: Data Protection Multi-Jurisdictional Guide 2015/16. Thomson Reuters, New York (2015). http://global.practicallaw.com/2-579-7926

  37. Kinoshita, M., Asayama, S., Kosinski, E.: Data protection in japan: overview. In: Data Protection Multi-Jurisdictional Guide 2014/15. Thomson Reuters, New York (2014). http://global.practicallaw.com/5-520-1289

  38. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)

    CrossRef  Google Scholar 

  39. Li, R.Y.M., Li, H.C.Y., Mak, C.K., Tang, T.B.: Sustainable smart home and home automation: big data analytics approach. Int. J. Smart Home 10(8), 177–198 (2016)

    CrossRef  Google Scholar 

  40. Loi numéro 78-17 du 6 janvier 1978 relative à l’informatique, aux fichiers et aux libertés, et convention 108 (1978). http://www.cnil.fr/

  41. Martin, A., Doddington, G., Kamm, T., Ordowski, M., Przybocki, M.: The DET curve in assessment of detection task performance. In: Proceedings of Eurospeech’97, pp. 1895–1898 (1997)

    Google Scholar 

  42. Medaglia, C.M., Serbanati, A.: An overview of privacy and security issues in the internet of things. In: Giusto, D., Iera, A., Morabito, G., Atzori, L. (eds.) The Internet of Things, pp. 389–395. Springer, Berlin (2010)

    CrossRef  Google Scholar 

  43. Mesaros, A., Heittola, T., Virtanen, T.: Metrics for polyphonic sound event detection. Appl. Sci. 6(6), 162 (2016)

    CrossRef  Google Scholar 

  44. Möller, S., Falk, T.H.: Quality prediction for synthesized speech: comparison of approaches. In: Proceedings of the International Conference on Acoustics, pp. 1168–1171 (2009)

    Google Scholar 

  45. Möller, S., Kim, D., Malfait, L.: Estimating the quality of synthesized and natural speech transmitted through telephone networks using single-ended prediction models. Acta Acust. United Acust. 94, 21–31 (2008)

    CrossRef  Google Scholar 

  46. Murphy, K.P.: Hidden semi-Markov models (HSMMs). Technical report, Massachusetts Institute of Technology (2002). http://www.cs.ubc.ca/~murphyk/papers/segment.pdf

    Google Scholar 

  47. Nesta, F., Koldovský, Z.: Supervised independent vector analysis through pilot dependent components. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 536–540 (2017)

    Google Scholar 

  48. Nieto, O., Farbood, M.M., Jehan, T., Bello, J.P.: Perceptual analysis of the F-measure for evaluating section boundaries in music. In: proceedings of the 15th International Society of Music Information Retrieval Conference (ISMIR) (2014)

    Google Scholar 

  49. Norton Rose Fulbright: Global data privacy directory (2014). http://www.nortonrosefulbright.com/files/global-data-privacy-directory-52687.pdf

  50. Ostendorf, M.: Moving beyond the ‘beads-on-a-string’ model of speech. In: Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 79–84 (1999)

    Google Scholar 

  51. Pathak, M.A.: Privacy preserving machine learning for speech processing. Ph.D. thesis, Language Technologies Institute, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA (2012)

    Google Scholar 

  52. Pathak, M.A., Raj, B., Rane, S., Smaragdis, P.: Privacy-preserving speech processing: cryptographic and string-matching frameworks show promise. IEEE Signal Process. Mag. 30(2), 62–74 (2013)

    CrossRef  Google Scholar 

  53. Pragnell, M., Spence, L., Moore, R.: The Market Potential for Smart Homes. Joseph Rowntree Foundation, York (2000)

    Google Scholar 

  54. Renals, S., Bourlard, H., Carletta, J., Popescu-Belis, A.: Speech Recognition. Cambridge University Press, Cambridge (2012)

    Google Scholar 

  55. Reynolds, D.: Gaussian mixture models. In: Li, S.Z., Jain, A. (eds.) Encyclopedia of Biometrics, pp. 659–663. Springer, Berlin (2009)

    Google Scholar 

  56. Scheirer, W.J., Rocha, A., Sapkota, A., Boult, T.E.: Towards open set recognition. IEEE Trans. Pattern Anal. Mach. Intell. (T-PAMI) 36, 1757–1772 (2013)

    Google Scholar 

  57. Sigtia, S., Stark, A.M., Krstulović, S., Plumbley, M.D.: Automatic environmental sound recognition: performance versus computational cost. IEEE/ACM Trans. Audio Speech Lang. Process. 24(11), 2096–2107 (2016)

    CrossRef  Google Scholar 

  58. Stowell, D., Giannoulis, D., Benetos, E., Lagrange, M., Plumbley, M.D.: Detection and classification of audio scenes and events. IEEE Trans. Multimedia 17(10), 1733–1746 (2015)

    CrossRef  Google Scholar 

  59. Sturm, B.L.: Classification accuracy is not enough. J. Intell. Inf. Syst. 41(3), 371–406 (2013). http://rdcu.be/m8F6

    CrossRef  Google Scholar 

  60. Sturm, B.L.: A simple method to determine if a music information retrieval system is a “horse”. IEEE Trans. Multimedia 16(6), 1636–1644 (2014)

    CrossRef  Google Scholar 

  61. Su, L., Yeh, C.C.M., Liu, J.Y., Wang, J.C., Yang, Y.H.: A systematic evaluation of the bag-of-frames representation for music information retrieval. IEEE Trans. Multimedia 16(5), 1188–1200 (2014)

    CrossRef  Google Scholar 

  62. Temko, A., Malkin, R., Zieger, C., Macho, D., Nadeu, C., Omologo, M.: CLEAR evaluation of acoustic event detection and classification systems. In: Stiefelhagen, R., Garofolo, J. (eds.) Multimodal Technologies for Perception of Humans, pp. 311–322. Springer, Berlin (2006)

    Google Scholar 

  63. The European Parliament and the Council of the European Union: Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation) (2016). http://eur-lex.europa.eu/legal-content/en/TXT/?uri=CELEX%3A32016R0679

  64. The UK Data Service: Obligations when sharing data. https://www.ukdataservice.ac.uk/manage-data/legal-ethical/obligations/ (2017). Last accessed 01/2017

  65. The Universal Declaration of Human Rights. United Nations General Assembly resolution 217 A (1948). http://www.un.org/en/universal-declaration-human-rights/

  66. Tokuda, K., Masuko, T., Miyazaki, N., Kobayashi, T.: Multi-ppace probability distribution HMM. IEICE Trans. Inf. Syst. E85–D(3), 455–464 (2002)

    Google Scholar 

  67. Uk data protection act 1998 (1998). http://www.legislation.gov.uk/ukpga/1998/29/contents

  68. UK Information Commissioner’s Office: Determining what information is ‘data’ for the purposes of the DPA (2012). https://ico.org.uk/media/for-organisations/documents/1609/what_is_data_for_the_purposes_of_the_dpa.pdf

    Google Scholar 

  69. UK Information Commissioner’s Office: Determining what is personal data (2012). https://ico.org.uk/media/1554/determining-what-is-personal-data.pdf

    Google Scholar 

  70. UK Information Commissioner’s Office: Data controllers and data processors: what the difference is and what the governance implications are (2014). https://ico.org.uk/media/1546/data-controllers-and-data-processors-dp-guidance.pdf

    Google Scholar 

  71. van Leeuwen, D.A., Brümmer, N.: An introduction to application-independent evaluation of speaker recognition systems. In: Müller, C. (ed.) Speaker Classification I: Fundamentals, Features, and Methods, pp. 330–353. Springer, Berlin, Heidelberg (2007)

    CrossRef  Google Scholar 

  72. Vermesan, O., Firess, P., Guillemin, P., Sundamaeker, H., Eisenhauer, M., Moessner, K., Arndt, M., Spirito, M., Medagliani, P., Giaffreda, R., Gusmeroli, S., Ladid, L., Serrano, M., Hauswirth, M., Baldini, G.: Internet of things strategic research and innovation agenda. In: Vermesan, O., Friess, P. (eds.) Internet of Things - From Research and Innovation to Market Deployment, chap. 3 Rivers Publishers, Gistrup (2014)

    Google Scholar 

  73. Virtanen, T., Mesaros, A., Heittola, T., Plumbley, M.D., Foster, P., Benetos, E., Lagrange, M. (eds.): Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016) (2016). http://www.cs.tut.fi/sgn/arg/dcase2016/workshop-proceedings

  74. Wang, Y., Getreuer, P., Hughes, T., Lyon, R.F., Saurous, R.A.: Trainable frontend for robust and far-field keyword spotting. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2017)

    Google Scholar 

  75. Weiss, M.A., Archick, K.: US-EU Data Privacy: From Safe Harbor to Privacy Shield. Technical report, Congressional Research Service (2016). https://www.fas.org/sgp/crs/misc/R44257.pdf

    Google Scholar 

  76. Wikipedia entry for “Clever Hans”. https://en.wikipedia.org/wiki/Clever_Hans (2017). Last accessed 01/2017

  77. Wikipedia entry for “Displacement”. https://en.wikipedia.org/wiki/Displacement_(linguistics) (2017). Last accessed 01/2017

  78. Wikipedia entry for “Institutional Review Board”. https://en.wikipedia.org/wiki/Institutional_review_board (2017). Last accessed 01/2017

  79. Yoshimura, T., Tokuda, K., Masuko, T., Kobayashi, T., Kitamura, T.: Simultaneous modeling of spectrum, pitch and duration in hmm-based speech synthesis. In: Proceedings of Eurospeech’99 (1999)

    Google Scholar 

  80. Zen, H., Tokuda, K., Masuko, T., Kobayashi, T., Kitamura, T.: Hidden semi-markov model based speech synthesis. In: Proceedings of the International Conference on Spoken Language Processing (ICSLP) (2004)

    Google Scholar 

  81. Zen, H., Masuko, T., Tokuda, K., Yoshimura, T., Kobayashi, T., Kitamura, T.: State duration modeling for HMM-based speech synthesis. IEICE Trans. Inf. Syst. E90-D(3), 692–693 (2007)

    CrossRef  Google Scholar 

Download references

Acknowledgements

The research work exposed in Sect. 12.2.3 has been developed as a collaboration between Queen Mary University of London and Audio Analytic, supported by InnovateUK grant nr.131604 and EPSRC grants EP/M507088/1 & EP/N014111/1, as well as private funding from Audio Analytic Ltd. Prof. Mark Plumbley, currently at University of Surrey, has supervised the work of Dr. Siddarth Sigtia, currently with Apple, and Dr. Adam Stark, currently with Mi.mu Gloves, on the academic side of this research work. Tamara Sword from Audio Analytic has contributed some of the wording about use cases and marketing aspects. The author would like to thank Dr. Tuomas Virtanen and Dr. Juan Bello for their insightful comments on this chapter.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sacha Krstulović .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2018 Springer International Publishing AG

About this chapter

Cite this chapter

Krstulović, S. (2018). Audio Event Recognition in the Smart Home. In: Virtanen, T., Plumbley, M., Ellis, D. (eds) Computational Analysis of Sound Scenes and Events. Springer, Cham. https://doi.org/10.1007/978-3-319-63450-0_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-63450-0_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-63449-4

  • Online ISBN: 978-3-319-63450-0

  • eBook Packages: EngineeringEngineering (R0)