Abstract
Recognition of environmental sound is usually based on two main architectures, depending on whether the model is trained with frame-level features or with aggregated descriptions of acoustic scenes or events. The former architecture is appropriate for applications where target categories are known in advance, while the later affords a less supervised approach. In this paper, we propose a framework for environmental sound recognition based on blind segmentation and feature aggregation. We describe a new set of descriptors, based on Recurrence Quantification Analysis (RQA), which can be extracted from the similarity matrix of a time series of audio descriptors. We analyze their usefulness for recognition of acoustic scenes and events in addition to standard feature aggregation. Our results show the potential of non-linear time series analysis techniques for dealing with environmental sounds.
This is a preview of subscription content, access via your institution.







Notes
Note that our original submission to the D-CASE challenge obtained very good results by narrowing the frequency range to below 1 Khz. However in our experiments this only worked for the challenge dataset, so for the sake of generality here we chose a wider range.
References
Alpaydin, E. (2014). Introduction to machine learning. MIT Press.
Aucouturier, J.J., Defreville, B., & Pachet, F. (2007). The bag-of-frames approach to audio pattern recognition: a sufficient model for urban soundscapes but not for polyphonic music. The Journal of the Acoustical Society of America, 122(2), 881.
Barchiesi, D., Giannoulis, D., Stowell, D., & Plumbley, M.D. (2015). Acoustic scene classification: classifying environments from the sounds they produce. IEEE Signal Processing Magazine, 32(3), 16–34.
Bello, J.P., Daudet, L., Abdallah, S., Duxbury, C., Davies, M., & Sandler, M.B. (2005). A tutorial on onset detection in music signals. IEEE Audio, Speech Language Processing, 13(5), 1035–1047.
Bisot, V., Serizel, R., & Essid, S. (2016). Acoustic scene classification with matrix factorization for unsupervised feature learning. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6445–6449). IEEE.
Böck, S., & Widmer, G. (2013). Maximum filter vibrato suppression for onset detection. In Proceedings of the 16th international conference on digital audio effects (DAFx-13). Maynooth.
Brons, I., Houben, R., & Dreschler, W.A. (2014). Effects of noise reduction on speech intelligibility, perceived listening effort, and personal preference in hearing-impaired listeners. Trends in Hearing 18. https://doi.org/10.1177/2331216514553924.
Cano, P., Koppenberger, M., & Wack, N. (2005). An industrial-strength content-based music recommendation system. In Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval (p. 673). Salvador.
Cawley, G.C., & Talbot, N.L. (2010). On over-fitting in model selection and subsequent selection bias in performance evaluation. Journal of Machine Learning Research, 11, 2079–2107.
Chachada, S., & Kuo, C.C.J. (2014). Environmental sound recognition: a survey. APSIPA Transactions on Signal and Information Processing, 3, e14.
Chang, C.C., & Lin, C.J. (2001). LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
Chechik, G., Ie, E., Rehn, M., Bengio, S., & Lyon, D. (2008). Large-scale content-based audio retrieval from text queries. In Proceedings of the 1st ACM international conference on multimedia information retrieval (MIR ’08) (p. 105). Beijing.
Chu, S., Narayanan, S., Kuo, C.C.J., & Mataric, M.J. (2006). Where am I? Scene recognition for mobile robots using audio features. In 2006 IEEE International conference on multimedia and expo (pp. 885–888).
Chu, S., Narayanan, S., & Kuo, C.C.J. (2009). Environmental sound recognition with time-frequency audio features. IEEE Audio, Speech Language Processing, 17(6), 1142–1158.
Clavel, C., Ehrette, T., & Richard, G. (2005). Events detection for an audio-based surveillance system. In: IEEE International conference on multimedia and expo (ICME 2005) (pp. 1306–1309).
Dargie, W. (2009). Adaptive audio-based context recognition. IEEE Transactions on Systems, Man and Cybernetics Part A: Systems and Humans, 39(4), 715–725.
Ellis, D.P.W. (2005). PLP and RASTA (and MFCC, and inversion) in Matlab. http://www.ee.columbia.edu/ln/rosa/matlab/rastamat/. Online web resource.
Eronen, A., Peltonen, V., Tuomi, J., Klapuri, A., Fagerlund, S., Sorsa, T., Lorho, G., & Huopaniemi, J. (2006). Audio-based context recognition. IEEE Audio, Speech Language Processing, 14(1), 321–329.
Gaver, W. (1993). What in the world do we hear?: an ecological approach to auditory event perception. Ecological Psychology, 5(1), 1–29.
Geiger, J.T., Schuller, B., & Rigoll, G. (2013). Recognising acoustic scenes with large-scale audio feature extraction and SVM. Tech. rep. IEEE AASP challenge: detection and classification of acoustic scenes and events.
Giannoulis, D., Benetos, E., Stowell, D., Rossignol, M., Lagrange, M., & Plumbley, M.D. (2013). Detection and classification of acoustic scenes and events: an IEEE AASP challenge. In 2013 IEEE workshop on applications of signal processing to audio and acoustics (WASPAA) (pp. 1–4). IEEE.
Heittola, T., Mesaros, A., Eronen, A., & Virtanen, T. (2013). Context-dependent sound event detection. EURASIP Journal on Audio Speech, and Music Processing, 1, 1.
Huang, Z., Cheng, Y.C., Li, K., Hautamäki, V., & Lee, C.H. (2013). A blind segmentation approach to acoustic event detection based on i-vector. In Proceedings of interspeech (pp. 2282–2286).
Imoto, K., Ohishi, Y., Uematsu, H., & Ohmuro, H. (2013). Acoustic scene analysis based on latent acoustic topic and event allocation. In 2013 IEEE international workshop on machine learning for signal processing (MLSP) (pp. 1–6). IEEE.
ITU-T (2010). A generic sound activity detector recommendation G.720.1. https://www.itu.int/rec/T-REC-G.720.1/en.
Klapuri, A. (1999). Sound onset detection by applying psychoacoustic knowledge. In Proceedings of 1999 IEEE international conference on acoustics, speech, and signal processing, 1999. (Vol. 6, pp. 3089–3092). IEEE.
Lagrange, M., Lafay, G., Defreville, B., & Aucouturier, J.J. (2015). The bag-of-frames approach: a not so sufficient model for urban soundscapes. The Journal of the Acoustical Society of America, 138(5), EL487–EL492.
Lee, H., Pham, P., Largman, Y., & Ng, A.Y. (2009). Unsupervised feature learning for audio classification using convolutional deep belief networks. In Advances in neural information processing systems (pp. 1096–1104).
Lee, K., & Ellis, D.P.W. (2010). Audio-based semantic concept classification for consumer video. IEEE Audio, Speech and Language Processing, 18(6), 1406–1416.
Martin, R. (1994). Spectral subtraction based on minimum statistics. Proceedings of EUSIPCO, 94(1), 1182–1185.
McDermott, J.H., & Simoncelli, E.P. (2011). Sound texture perception via statistics of the auditory periphery: evidence from sound synthesis. Neuron, 71(5), 926–940.
Pachet, F., & Roy, P. (2007). Exploring billions of audio features. In 2007 international workshop on content-based multimedia indexing (pp. 227–235). IEEE.
Parascandolo, G., Huttunen, H., & Virtanen, T. (2016). Recurrent neural networks for polyphonic sound event detection in real life recordings. In 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6440–6444). IEEE.
Rakotomamonjy, A., & Gasso, G. (2015). Histogram of gradients of time-frequency representations for audio scene classification. IEEE/ACM Transactions on Audio Speech and Language Processing, 23(1), 142–153.
Roma, G., Nogueira, W., & Herrera, P. (2013). Recurrence quantification analysis features for environmental sound recognition. In 2013 IEEE workshop on applications of signal processing to audio and acoustics (WASPAA) (pp. 1–4). IEEE.
Scheirer, E.D. (1998). Tempo and beat analysis of acoustic musical signals. The Journal of the Acoustical Society of America, 103(1), 588–601.
Serrà, J., Serra, X., & Andrzejak, R.G. (2009). Cross recurrence quantification for cover song identification. New Journal of Physics, 11.
Serrà, J., De los Santos, C., & Andrzejak, R.G. (2011). Nonlinear audio recurrence analysis with application to genre classification. In 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 169–172). IEEE.
Sohn, J., Kim, N.S., & Sung, W. (1999). A statistical model-based voice activity detection. IEEE Signal Processing Letters, 6(1), 1–3.
Stowell, D., Giannoulis, D., Benetos, E., Lagrange, M., & Plumbley, M. (2015). Detection and classification of acoustic scenes and events. IEEE Transactions on Multimedia, 17(19).
Tzanetakis, G., & Cook, P. (2002). Musical genre classification of audio signals. IEEE Audio, Speech and Language Processing, 10(5), 293–302.
Washington, C.d.A., Assis, F.M., Neto, B.G.A., Costa, S.C., & Vieira, V.J.D. (2012). Pathological voice assessment by recurrence quantification analysis. In 2012 ISSNIP biosignals and biorobotics conference: biosignals and robotics for better and safer living (BRC) (pp. 1–6). IEEE.
Webber, C.L., & Zbilut, J.P. (1994). Dynamical assessment of physiological systems and states using recurrence plot strategies. Journal of Applied Physiology, 76 (2), 965–973.
Webber, Jr, C.L., & Zbilut, J.P. (2005). Recurrence quantification analysis of nonlinear dynamical systems. Tutorials in Contemporary Nonlinear Methods for the Behavioral Sciences, 26–94.
Xu, M., Maddage, N., Xu, C., Kankanhalli, M., & Tian, Q. (2003). Creating audio keywords for event detection in soccer video. In Proceedings of the 2003 IEEE international conference on multimedia and expo (ICME ’03) (Vol. 2, pp. II–281).
Yu, D.Y.D., Deng, L.D.L., Droppo, J., Wu, J.W.J., Gong, Y.G.Y., & Acero, A. (2008). Robust speech recognition using a cepstral minimum-mean-square-error-motivated noise suppressor. IEEE Audio, Speech and Language Processing, 16(5), 1061–1070. https://doi.org/10.1109/TASL.2008.921761.
Zbilut, J.P., & Webber, C.L.J. (2006). Recurrence quantification analysis. In Akay, M. (Ed.) Wiley encyclopedia of biomedical engineering. Hoboken: Wiley.
Zhang, H., McLoughlin, I., & Song, Y. (2015). Robust sound event recognition using convolutional neural networks. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 559–563). IEEE.
Zhang, T., & Kuo, C.C.J. (1998). Hierarchical system for content-based audio classification and retrieval. In Photonics East (ISAM, VVDC, IEMB) (pp. 398–409). International Society for Optics and Photonics.
Acknowledgements
The first author was with the Music Technology Group at Universitat Pompeu Fabra for the main part of this work. The third author was with the Music Technology Group at Universitat Pompeu Fabra for part of this work. This work has been suported by the DFG cluster of excellence EXC 1077/1 “Hearing4all”.
Author information
Authors and Affiliations
Corresponding author
Appendix: RQA features
Appendix: RQA features
This section details the equations used for computing this features, mostly derived from Webber & Zbilut (1994). Here, R refers to recurrence plot as described in Section 2.4.
-
Recurrence rate (REC) is just the percentage of points in the recurrence plot.
$$ REC = (1/N^{2}) \sum\limits_{i,j=1}^{N}{R_{i,j}} $$(12) -
Determinism (DET) is measured as the percentage of points that are in diagonal lines.
$$ DET = {{\sum}_{l=l_{min}}^{N}{lP(l)} \over{ {\sum}_{i,j=1}^{N}{R_{i,j}}} } $$(13)where P(l) is the histogram of diagonal line lengths l.
-
Laminarity (LAM) is the percentage of points that form vertical lines.
$$ LAM = {{\sum}_{v=v_{min}}^{N}{vP(v)} \over{ {\sum}_{v=1}^{N}{vP(v)} } } $$(14)where P(v) is the histogram of vertical line lengths v.
-
The ratio between DET and REC is often used. We also use the ratio between LAM and REC, so we define them as
$$ DRATIO =N^{2} {{\sum}_{l=l_{min}}^{N}{lP(l)} \over{ ({\sum}_{l=1}^{N}{lP(l)})^{2}}} $$(15)$$ VRATIO =N^{2} {{\sum}_{v=v_{min}}^{N}{vP(v)} \over{ ({\sum}_{v=1}^{N}{vP(v)})^{2} }} $$(16) -
LEN and Trapping Time TT are the average diagonal and vertical line lengths.
$$ LEN = {{\sum}_{l=l_{min}}^{N}{lP(l)} \over{ {\sum}_{l=l_{min}}^{N}{P(l)}}} $$(17)$$ TT = {{\sum}_{v=v_{min}}^{N}{vP(v)} \over{ {\sum}_{v=v_{min}}^{N}{P(v)}}} $$(18) -
Another common feature is the length of the longest diagonal and vertical lines. The inverse of the maximum diagonal (called Divergence) is also used. We use the inverse of both vertical and diagonal maximum lengths.
$$ DDIV = {1\over{max(l)}} $$(19)$$ VDIV = {1\over{max(v)}} $$(20) -
Finally, the Shannon entropy of the diagonal line lengths is commonly used. We also compute the entropy for vertical line lengths.
$$ DENT = - \sum\limits_{l=l_{min}}^{N}{P(l)ln(P(l))} $$(21)$$ VENT = - \sum\limits_{v=v_{min}}^{N}{P(v)ln(P(v))} $$(22)
Rights and permissions
About this article
Cite this article
Roma, G., Herrera, P. & Nogueira, W. Environmental sound recognition using short-time feature aggregation. J Intell Inf Syst 51, 457–475 (2018). https://doi.org/10.1007/s10844-017-0481-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10844-017-0481-4
Keywords
- Audio databases
- Event detection
- Environmental sound recognition
- Audio features
- Recurrence quantification analysis
- Pattern recognition