Approaches to Complex Sound Scene Analysis

  • Emmanouil Benetos
  • Dan Stowell
  • Mark D. Plumbley
Chapter

Abstract

This chapter presents state-of-the-art research and open topics for analyzing complex sound scenes in a single microphone case. First, the concept of sound scene recognition is presented, from the perspective of different paradigms (classification, tagging, clustering, segmentation) and methods used. The core section is on sound event detection and classification, presenting various paradigms and practical considerations along with methods for monophonic and polyphonic sound event detection. The chapter will then focus on the concepts of context and “language modeling” for sound scenes, also covering the concept of relationships between sound events. Work on sound scene recognition based on event detection is also presented. Finally the chapter will summarize the topic and will provide directions for future research.

Keywords

Scene analysis Sound scene recognition Sound event detection Sound recognition Acoustic language models Audio context recognition Hidden Markov models (HMMs) Markov renewal process Non-negative matrix factorization (NMF) Feature learning Soundscape 

References

  1. 1.
    Anguera, X., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., Vinyals, O.: Speaker diarization: a review of recent research. IEEE Trans. Audio Speech Lang. Process. 20(2), 356–370 (2012)CrossRefGoogle Scholar
  2. 2.
    Aucouturier, J.J., Pachet, F.: The bag-of-frames approach to audio pattern recognition: a sufficient model for urban soundscapes but not for polyphonic music. J. Acoust. Soc. Am. 122(2), 881–891 (2006)CrossRefGoogle Scholar
  3. 3.
    Barber, D., Cemgil, A.T.: Graphical models for time-series. IEEE Signal Process. Mag. 27(6), 18–28 (2010)Google Scholar
  4. 4.
    Battaglino, D., Lepauloux, L., Evans, N.: The open-set problem in acoustic scene classification. In: IEEE International Workshop on Acoustic Signal Enhancement (IWAENC) (2016)Google Scholar
  5. 5.
    Bello, J.P., Daudet, L., Abdallah, S., Duxbury, C., Davies, M., Sandler, M.B.: A tutorial on onset detection in music signals. IEEE Trans. Speech Audio Process. 13(5), 1035–1047 (2005)CrossRefGoogle Scholar
  6. 6.
    Benetos, E., Lagrange, M., Dixon, S.: Characterisation of acoustic scenes using a temporally-constrained shift-invariant model. In: 15th International Conference on Digital Audio Effects (DAFx), pp. 317–323. York, UK (2012)Google Scholar
  7. 7.
    Benetos, E., Lafay, G., Lagrange, M., Plumbley, M.: Detection of overlapping acoustic events using a temporally-constrained probabilistic model. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 6450–6454 (2016)Google Scholar
  8. 8.
    Benetos, E., Lafay, G., Lagrange, M., Plumbley, M.D.: Polyphonic sound event tracking using linear dynamical systems. IEEE/ACM Trans. Audio Speech Lang. Process. 25(6), 1266–1277 (2017)CrossRefGoogle Scholar
  9. 9.
    Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)MATHGoogle Scholar
  10. 10.
    Beritelli, F., Casale, S., Ruggeri, G., Serrano, S.: Performance evaluation and comparison of G.729/AMR/fuzzy voice activity detectors. IEEE Signal Process. Lett. 9(3), 85–88 (2002)Google Scholar
  11. 11.
    Bischof, H., Godec, M., Leistner, C., Rinner, B., Starzacher, A.: Autonomous audio-supported learning of visual classifiers for traffic monitoring. IEEE Intell. Syst. 25(3), 15–23 (2010)CrossRefGoogle Scholar
  12. 12.
    Bisot, V., Essid, S., Richard, G.: HOG and subband power distribution image features for acoustic scene classification. In: 23rd European Signal Processing Conf. (EUSIPCO), pp. 719–723 (2015)Google Scholar
  13. 13.
    Bisot, V., Serizel, R., Essid, S., Richard, G.: Acoustic scene classification with matrix factorization for unsupervised feature learning. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6445–6449 (2016)Google Scholar
  14. 14.
    Boulanger-Lewandowski, N., Bengio, Y., Vincent, P.: Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and transcription. In: 29th International Conference on Machine Learning, Edinburgh (2012)Google Scholar
  15. 15.
    Cai, R., Lu, L., Hanjalic, A., Zhang, H.J., Cai, L.H.: A flexible framework for key audio effects detection and auditory context inference. IEEE Trans. Audio Speech Lang. Process. 14(3), 1026–1039 (2006)CrossRefGoogle Scholar
  16. 16.
    Cakir, E., Heittola, T., Huttunen, H., Virtanen, T.: Multi-label vs. combined single-label sound event detection with deep neural networks. In: 23rd European Signal Processing Conference (EUSIPCO), pp. 2551–2555 (2015)Google Scholar
  17. 17.
    Cakir, E., Heittola, T., Huttunen, H., Virtanen, T.: Polyphonic sound event detection using multi label deep neural networks. In: International Joint Conference on Neural Networks (IJCNN), pp. 1–7 (2015). doi:10.1109/IJCNN.2015.7280624Google Scholar
  18. 18.
    Cauchi, B., Lagrange, M., Misdariis, N., Cont, A.: Saliency-based modeling of acoustic scenes using sparse non-negative matrix factorization. In: 2013 14th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS) (2013). doi:10.1109/WIAMIS.2013.6616131Google Scholar
  19. 19.
    Cotton, C.V., Ellis, D.P.W.: Spectral vs. spectro-temporal features for acoustic event classification. In: IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 69–72 (2011)Google Scholar
  20. 20.
    Dessein, A., Cont, A., Lemaitre, G.: Real-time polyphonic music transcription with non-negative matrix factorization and beta-divergence. In: International Society for Music Information Retrieval Conference, pp. 489–494 (2010)Google Scholar
  21. 21.
    Dewar, M., Wiggins, C., Wood, F.: Inference in hidden Markov models with explicit state duration distributions. IEEE Signal Process. Lett. 19(4), 235–238 (2012)CrossRefGoogle Scholar
  22. 22.
    Dietterich, T.G., Lathrop, R.H., Lozano-Pérez, T.: Solving the multiple instance problem with axis-parallel rectangles. Artif. Intell. 89(1), 31–71 (1997)CrossRefMATHGoogle Scholar
  23. 23.
    Diment, A., Cakir, E., Heittola, T., Virtanen, T.: Automatic recognition of environmental sound events using all-pole group delay features. In: 2015 23rd European Signal Processing Conference (EUSIPCO), pp. 729–733 (2015)Google Scholar
  24. 24.
    Eronen, A.J., Peltonen, V.T., Tuomi, J.T., Klapuri, A.P., Fagerlund, S., Sorsa, T., Lorho, G., Huopaniemi, J.: Audio-based context recognition. IEEE Trans. Audio Speech Lang. Process. 14(1), 321–329 (2006)CrossRefGoogle Scholar
  25. 25.
    Foster, P., Sigtia, S., Krstulovic, S., Barker, J., Plumbley, M.D.: CHIME-home: a dataset for sound source recognition in a domestic environment. In: IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) (2015)Google Scholar
  26. 26.
    Geiger, J.T., Schuller, B., Rigoll, G.: Large-scale audio feature extraction and SVM for acoustic scene classification. In: 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 1–4 (2013)Google Scholar
  27. 27.
    Gemmeke, J.F., Vuegen, L., Karsmakers, P., Vanrumste, B., Van hamme, H.: An exemplar-based NMF approach to audio event detection. In: IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) (2013)Google Scholar
  28. 28.
    Gill, L.F., D’Amelio, P.B., Adreani, N.M., Sagunsky, H., Gahr, M.C., ter Maat, A.: A minimum-impact, flexible tool to study vocal communication of small animals with precise individual-level resolution. Methods Ecol. Evol. (2016). doi:10.1111/2041-210x.12610Google Scholar
  29. 29.
    Heittola, T., Mesaros, A., Eronen, A., Virtanen, T.: Audio context recognition using audio event histograms. In: 18th European Signal Processing Conference, pp. 1272–1276 (2010)Google Scholar
  30. 30.
    Heittola, T., Mesaros, A., Virtanen, T., Eronen, A.: Sound event detection in multisource environments using source separation. In: Workshop on Machine Listening in Multisource Environments (CHiME 2011), pp. 36–40 (2011)Google Scholar
  31. 31.
    Heittola, T., Mesaros, A., Eronen, A., Virtanen, T.: Context-dependent sound event detection. EURASIP J. Audio Speech Music Process. 2013(1), 1 (2013). doi:10.1186/1687-4722-2013-1CrossRefGoogle Scholar
  32. 32.
    Imoto, K., Ono, N.: Acoustic scene analysis from acoustic event sequence with intermittent missing event. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 156–160. IEEE, New York (2015)Google Scholar
  33. 33.
    Imoto, K., Ohishi, Y., Uematsu, H., Ohmuro, H.: Acoustic scene analysis based on latent acoustic topic and event allocation. In: 2013 IEEE International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1–6. IEEE, New York (2013). doi:10.1109/MLSP.2013.6661957Google Scholar
  34. 34.
    Johnson, M.J., Willsky, A.S.: Bayesian nonparametric hidden semi-Markov models. J. Mach. Learn. Res. 14(Feb), 673–701 (2013)MathSciNetMATHGoogle Scholar
  35. 35.
    Kim, S., Narayanan, S., Sundaram, S.: Acoustic topic model for audio information retrieval. In: 2009 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 37–40. IEEE, New York (2009)Google Scholar
  36. 36.
    Kumar, A., Raj, B.: Audio event detection using weakly labeled data. In: Proceedings of the ACM Multimedia Conference, pp. 1038–1047. ACM (2016)Google Scholar
  37. 37.
    Lagrange, M., Lafay, G., Défréville, B., Aucouturier, J.J.: The bag-of-frames approach: a not so sufficient model for urban soundscapes. J. Acoust. Soc. Am. 138(5), EL487–EL492 (2015)CrossRefGoogle Scholar
  38. 38.
    Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791 (1999). doi:10.1038/44565CrossRefMATHGoogle Scholar
  39. 39.
    Lee, K., Hyung, Z., Nam, J.: Acoustic scene classification using sparse feature learning and event-based pooling. In: 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 1–4 (2013)Google Scholar
  40. 40.
    Lu, T., Wang, G., Su, F.: Context-based environmental audio event recognition for scene understanding. Multimedia Systems 21(5), 507–524 (2015). doi:10.1007/s00530-014-0424-7CrossRefGoogle Scholar
  41. 41.
    Marler, P.R., Slabbekoorn, H.: Nature’s Music: The Science of Birdsong. Academic, Cambridge (2004)Google Scholar
  42. 42.
    Mesaros, A., Heittola, T., Eronen, A., Virtanen, T.: Acoustic event detection in real life recordings. In: 18th European Signal Processing Conference, pp. 1267–1271 (2010)Google Scholar
  43. 43.
    Mesaros, A., Heittola, T., Klapuri, A.: Latent semantic analysis in sound event detection. In: 19th European Signal Processing Conference, pp. 1307–1311 (2011)Google Scholar
  44. 44.
    Mesaros, A., Heittola, T., Dikmen, O., Virtanen, T.: Sound event detection in real life recordings using coupled matrix factorization of spectral representations and class activity annotations. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 151–155 (2015)Google Scholar
  45. 45.
    Müller, C. (ed.): Speaker Classification I: Fundamentals, Features, and Methods. Springer, Berlin, Heidelberg (2007). doi:10.1007/978-3-540-74200-5 Google Scholar
  46. 46.
    Murphy, K.: Machine Learning: A Probabilistic Perspective. MIT, Cambridge, MA (2012)MATHGoogle Scholar
  47. 47.
    Murphy, K.P., Paskin, M.A.: Linear-time inference in hierarchical HMMs. In: Advances in Neural Information Processing Systems, vol. 2, pp. 833–840 (2002)Google Scholar
  48. 48.
    Mysore, G.J., Sahani, M.: Variational inference in non-negative factorial hidden Markov models for efficient audio source separation. In: International Conference Machine Learning (ICML), pp. 1887–1894 (2012)Google Scholar
  49. 49.
    Okuno, H.G., Ogata, T., Komatani, K.: Computational auditory scene analysis and its application to robot audition: five years experience. In: International Conference on Informatics Research for Development of Knowledge Society Infrastructure (ICKS 2007), pp. 69–76. IEEE, New York (2007). doi:10.1109/ICKS.2007.7Google Scholar
  50. 50.
    Parascandolo, G., Huttunen, H., Virtanen, T.: Recurrent neural networks for polyphonic sound event detection in real life recordings. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6440–6444 (2016)Google Scholar
  51. 51.
    Phan, H., Maasz, M., Mazur, R., Mertins, A.: Random regression forests for acoustic event detection and classification. IEEE/ACM Trans. Audio Speech Lang. Process. 23(1), 20–31 (2015)CrossRefGoogle Scholar
  52. 52.
    Phan, H., Hertel, L., Maass, M., Koch, P., Mertins, A.: Label tree embeddings for acoustic scene classification. In: Proceedings of the ACM Multimedia Conference (2016)CrossRefGoogle Scholar
  53. 53.
    Piczak, K.J.: Environmental sound classification with convolutional neural networks. In: International Workshop on Machine Learning for Signal Processing (MLSP) (2015). doi:10.1109/MLSP.2015.7324337Google Scholar
  54. 54.
    Plinge, A., Grzeszick, R., Fink, G.A.: A bag-of-features approach to acoustic event detection. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3704–3708 (2014)Google Scholar
  55. 55.
    Poliner, G., Ellis, D.: A discriminative model for polyphonic piano transcription. EURASIP J. Adv. Signal Process. (8), 154–162 (2007). doi:10.1155/2007/48317MATHGoogle Scholar
  56. 56.
    Rabiner, L., Juang, B.H.: Fundamentals of Speech Recognition. Prentice-Hall, Upper Saddle River, NJ (1993)MATHGoogle Scholar
  57. 57.
    Raczynski, S., Vincent, E., Sagayama, S.: Dynamic Bayesian networks for symbolic polyphonic pitch modeling. IEEE Trans. Audio Speech Lang. Process. 21(9), 1830–1840 (2013)CrossRefGoogle Scholar
  58. 58.
    Rakotomamonjy, A., Gasso, G.: Histogram of gradients of time-frequency representations for audio scene classification. IEEE/ACM Trans. Audio Speech Lang. Process. 23(1), 142–153 (2015)Google Scholar
  59. 59.
    Roma, G., Nogueira, W., Herrera, P.: Recurrence quantification analysis features for environmental sound recognition. In: 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (2013). doi:10.1109/WASPAA.2013.6701890Google Scholar
  60. 60.
    Salamon, J., Bello, J.P.: Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Process. Lett. 24(3), 279–283 (2017)CrossRefGoogle Scholar
  61. 61.
    Sigtia, S., Benetos, E., Dixon, S.: An end-to-end neural network for polyphonic piano music transcription. IEEE/ACM Trans. Audio Speech Lang. Process. 24(5), 927–939 (2016)CrossRefGoogle Scholar
  62. 62.
    Sigtia, S., Stark, A.M., Krstulovic, S., Plumbley, M.D.: Automatic environmental sound recognition: performance versus computational cost. IEEE/ACM Trans. Audio Speech Lang. Process. (2016)Google Scholar
  63. 63.
    Stowell, D., Clayton, D.: Acoustic event detection for multiple overlapping similar sources. In: IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) (2015)Google Scholar
  64. 64.
    Stowell, D., Plumbley, M.D.: Segregating event streams and noise with a Markov renewal process model. J. Mach. Learn. Res. 14, 2213–2238 (2013)MathSciNetMATHGoogle Scholar
  65. 65.
    Stowell, D., Benetos, E., Gill, L.F.: On-bird sound recordings: automatic acoustic recognition of activities and contexts. IEEE/ACM Trans. Audio Speech Lang. Process. 25(6), 1193–1206 (2017)CrossRefGoogle Scholar
  66. 66.
    Stowell, D., Giannoulis, D., Benetos, E., Lagrange, M., Plumbley, M.D.: Detection and classification of acoustic scenes and events. IEEE Trans. Multimedia 17(10), 1733–1746 (2015)CrossRefGoogle Scholar
  67. 67.
    Stowell, D., Gill, L.F., Clayton, D.: Detailed temporal structure of communication networks in groups of songbirds. J. R. Soc. Interface 13(119) (2016). doi:10.1098/rsif.2016.0296
  68. 68.
    Sturm, B.L.: A survey of evaluation in music genre recognition. In: 10th International Workshop on Adaptive Multimedia Retrieval: Semantics, Context, and Adaptation (AMR 2012), Revised Selected Papers, pp. 29–66. Springer International Publishing, Berlin (2014). doi:10.1007/978-3-319-12093-5_2
  69. 69.
    Tranter, S.E., Reynolds, D.A.: An overview of automatic speaker diarization systems. IEEE Trans. Audio Speech Lang. Process. 14(5), 1557–1565 (2006)CrossRefGoogle Scholar
  70. 70.
    Virtanen, T., Mesaros, A., Heittola, T., Plumbley, M., Foster, P., Benetos, E., Lagrange, M.: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016). Tampere University of Technology. Department of Signal Processing (2016). http://www.cs.tut.fi/sgn/arg/dcase2016/
  71. 71.
    Yu, S.Z.: Hidden semi-Markov models. Artif. Intell. 174(2), 215–243 (2010). doi:10.1016/j.artint.2009.11.011MathSciNetCrossRefMATHGoogle Scholar
  72. 72.
    Yu, D., Deng, L.: Automatic Speech Recognition: A Deep Learning Approach. Springer, London (2015). doi:10.1007/978-1-4471-5779-3 MATHGoogle Scholar
  73. 73.
    Zhang, H., McLoughlin, I., Song, Y.: Robust sound event recognition using convolutional neural networks. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 559–563 (2015)Google Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  • Emmanouil Benetos
    • 1
  • Dan Stowell
    • 1
  • Mark D. Plumbley
    • 2
  1. 1.School of Electronic Engineering and Computer ScienceQueen Mary University of LondonLondonUK
  2. 2.Centre for Vision, Speech and Signal ProcessingUniversity of SurreyGuildford, SurreyUK

Personalised recommendations