Temporal Acoustic Words for Online Acoustic Event Detection

  • Rene Grzeszick
  • Axel Plinge
  • Gernot A. Fink
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9358)


The Bag-of-Features principle proved successful in many pattern recognition tasks ranging from document analysis and image classification to gesture recognition and even forensic applications. Lately these methods emerged in the field of acoustic event detection and showed very promising results. The detection and classification of acoustic events is an important task for many practical applications like video understanding, surveillance or speech enhancement. In this paper a novel approach for online acoustic event detection is presented that builds on top of the Bag-of-Features principle. Features are calculated for all frames in a given window. Applying the concept of feature augmentation additional temporal information is encoded in each feature vector. These feature vectors are then softly quantized so that a Bag-of-Feature representation is computed. These representations are evaluated by a classifier in a sliding window approach. The experiments on a challenging indoor dataset of acoustic events will show that the proposed method yields state-of-the-art results compared to other online event detection methods. Furthermore, it will be shown that the temporal feature augmentation significantly improves the recognition rates.


  1. 1.
    Aucouturier, J.J., Defreville, B., Pachet, F.: The Bag-of-Frames Approach to Audio Pattern Recognition: A Sufficient Model for Urban Soundscapes but Not for Polyphonic Music. J. Acoust. Soc. Am. 122(2), 881–891 (2007)CrossRefGoogle Scholar
  2. 2.
    Carletti, V., Foggia, P., Percannella, G., Saggese, A., Strisciuglio, N., Vento, M.: Audio Surveillance using a Bag of Aural Words Classifier. In: 2013 10th IEEE International Conference on Advanced Video and Signal Based Surveillance, pp. 81–86. IEEE (2013)Google Scholar
  3. 3.
    Chatfield, K., Lempitsky, V., Vedaldi, A., Zisserman, A.: The devil is in the details: an evaluation of recent feature encoding methods. In: Proceeding British Machine Vision Conference (BMVC) (2011)Google Scholar
  4. 4.
    Fink, G.A.: Markov Models for Pattern Recognition. From Theory to Applications. Advances in Computer Vision and Pattern Recognition, 2nd edn. Springer, London (2014)CrossRefzbMATHGoogle Scholar
  5. 5.
    Foggia, P., Saggese, A., Strisciuglio, N., Vento, M.: Cascade classifiers trained on Gammatonegrams for reliably detecting Audio Events. In: 11th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 50–55. IEEE (2014)Google Scholar
  6. 6.
    Giannoulis, D., Benetos, E., Stowell, D., Rossignol, M., Lagrange, M., Plumbley, M.D.: Detection and classification of acoustic scenes and events: an IEEE AASP challenge. In: IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp. 1–4. IEEE (2013)Google Scholar
  7. 7.
    Good, P.: Permutation Tests - A Practical Guide to Resampling Methods for Testing Hypotheses. Springer Series in Statistics, 2nd edn. Springer, New York (2000)zbMATHGoogle Scholar
  8. 8.
    Grzeszick, R., Rothacker, L., Fink, G.A.: Bag-of-Features Representations using Spatial Visual Vocabularies for Object Classification. In: Proceeding International Conference on Image Processing (ICIP) (2013)Google Scholar
  9. 9.
    Jiang, Y.G., Bhattacharya, S., Chang, S.F., Shah, M.: High-level event recognition in unconstrained videos. Int. J. Multimedia Inf. Retrieval 2(2), 73–101 (2013)CrossRefGoogle Scholar
  10. 10.
    Klinck, H., Stelzer, K., Jafarmadar, K., Mellinger, D.K.: AAS Endurance: An Autonomous Acoustic Sailboat for Marine Mammal Research. In: International Robotic Sailing Conference (2009)Google Scholar
  11. 11.
    Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: Proceeding IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2, pp. 2169–2178 (2006)Google Scholar
  12. 12.
    Nogueira, W., Roma, G., Herrera, P.: Automatic Event Classification using Front End Single Channel Noise Reduction, MFCC Features and a Support Vector Machine Classifier. Technical report, IEEE AASP Challenge: Detection and Classification of Acoustic Scenes and Events (2013).
  13. 13.
    Pancoast, S., Akbacak, M.: Bag-of-audio-words approach for multimedia event classification. In: Interspeech, pp. 2105–2108 (2012)Google Scholar
  14. 14.
    Phan, H., Maasz, M., Mazur, R., Mertins, A.: Random regression forests for acoustic event detection and classification. IEEE/ACM Trans. Audio Speech Lang. Process. 23(1), 20–31 (2014). Scholar
  15. 15.
    Phan, H., Mertins, A.: Exploiting superframe cooccurence for acoustic event recognition. In: European Signal Processing Conference (2014)Google Scholar
  16. 16.
    Plinge, A., Grzeszick, R., Fink, G.A.: A bag-of-features approach to acoustic event detection. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (2014)Google Scholar
  17. 17.
    Sánchez, J., Perronnin, F., De Campos, T.: Modeling the spatial layout of images beyond spatial pyramids. Pattern Recogn. Lett. 33(16), 2216–2223 (2012)CrossRefGoogle Scholar
  18. 18.
    Schröder, J., Cauchi, B., Schädler, M.R., Moritz, N., Adiloglu, K., Anemüller, J., Doclo, S., Kollmeier, B., Goetze, S.: Acoustic event detection using signal enhancement and spectro-temporal feature extraction. Technical report, IEEE AASP Challenge: Detection and Classification of Acoustic Scenes and Events (2013).
  19. 19.
    Shao, Y., Srinivasan, S., Wang, D.: Incorporating auditory feature uncertainties in robust speaker identification. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 277–280 (2007)Google Scholar
  20. 20.
    Shivappa, S.T., Trivedi, M.M., Rao, B.D.: Audiovisual information fusion in human computer interfaces and intelligent environments: a survey. Proc. IEEE 98(10), 1692–1715 (2010)CrossRefGoogle Scholar
  21. 21.
    Steele, D., Krijnders, J.D., Guastavino, C.: The Sensor City Initiative: Cognitive Sensors for Soundscape Transformations. GIS Ostrava (2013)Google Scholar
  22. 22.
    Tang, H., Chu, S.M., Hasegawa-Johnson, M., Huang, T.S.: Partially supervised speaker clustering. IEEE Trans. Pattern Anal. Mach. Intell. 34(5), 959–971 (2012)CrossRefGoogle Scholar
  23. 23.
    Temko, A., Malkin, R.G., Zieger, C., Macho, D., Nadeu, C., Omologo, M.: CLEAR evaluation of acoustic event detection and classification systems. In: Stiefelhagen, R., Garofolo, J.S. (eds.) CLEAR 2006. LNCS, vol. 4122, pp. 311–322. Springer, Heidelberg (2007) CrossRefGoogle Scholar
  24. 24.
    Vuegen, L., Broeck, B.V.D., Karsmakers, P., Gemmeke, J.F., Vanrumste, B., Hamme, H.V.: An MFCC-GMM approach for event detection and classification. Technical report, IEEE AASP Challenge: Detection and Classification of Acoustic Scenes and Events (2013).
  25. 25.
    Wang, D., Brown, G.J. (eds.): Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. IEEE Press (2006)Google Scholar
  26. 26.
    Young, S.H., Scanlon, M.V.: Robotic vehicle uses acoustic array for detection and localization in Urban environments. in: SPIE Proceeding Mobile Robot Perception, vol. 4364, pp. 264–273 (2001)Google Scholar
  27. 27.
    Zeppelzauer, M., Stöger, A.S., Breiteneder, C.: Acoustic detection of elephant presence in noisy environments. In: Proceedings of the 2nd ACM international workshop on Multimedia analysis for ecological data, pp. 3–8. ACM (2013)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Open Access This chapter is distributed under the terms of the Creative Commons Attribution Noncommercial License, which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Authors and Affiliations

  1. 1.Department of Computer ScienceTU DortmundDortmundGermany

Personalised recommendations