Automatic Sound Recognition of Urban Environment Events

  • Theodoros Theodorou
  • Iosif Mporas
  • Nikos Fakotakis
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9319)


The audio analysis of speaker’s surroundings has been a first step for several processing systems that enable speaker’s mobility though his daily life. These algorithms usually operate in a short-time analysis decomposing the incoming events in time and frequency domain. In this paper, an automatic sound recognizer is studied, which investigates audio events of interest from urban environment. Our experiments were conducted using a close set of audio events from which well known and commonly used audio descriptors were extracted and models were training using powerful machine learning algorithms. The best urban sound recognition performance was achieved by SVMs with accuracy equal to approximately 93 %.


Automatic sound recognition Urban environment Dimensionality redundancy 

1 Introduction

Over the last years, the recognition of person’s communication activities (mostly phenomena with cognitive index like speech, gesture and node) through an increasing number of areas of interest drove the automatic processing tools to incorporate analysis about person’s surrounding. One of the common audio surroundings is the urban environment. The researchers of article [20] divide the urban events according to their origin (humane, biological and geophysical) and host conclusions about their affect on peoples’ health, psychology, economic and lifestyle behavior, cognitive needs and etc.

An analytic taxonomy of the urban soundscape is presented in article [18]. Reviewing the bibliography, many tasks have investigating the audio events occurring in urban environments [2, 5, 8, 15, 21, 23]. Article [12] examines the landscapes as an importance parameter of study. Event of urban origin can also be sought in other studies [4, 6, 9, 10, 11, 14, 16, 17, 22]. As can be seen, events related to the transportation system of a city [8, 10, 14, 15, 18, 23] and to weather phenomena [9, 10, 14, 15, 18] are quite common in recordings. Typical for an urban environment (in addition to rural and non-industrial environments) is the occurring construction activity [18] which includes interesting events of the urban soundscape. Moreover, people ability to be sound as crowd noise (in terms of acoustics) [10, 14, 15, 17, 23] and the vocalizations of domestic animals and urban wildlife [4, 11, 16, 18, 23] can easily be captured in audio recordings. Local customs, traditions and the technology (like horn/siren, phone ring and bell) [4, 11, 18, 22] are also related to acoustic phenomena of the urban soundscape.

There are several works that proposed architectures that address unexpected audio events occurring on urban surrounding in speech or communication audio signals. In those architectures sound recognition precedes all other stages, and thus sound recognition stands as cornerstone of the audio processing. Typical sound recognition consists of short-time analysis in time and frequency domain, dividing the audio sequence into intervals. Audio descriptors are used to represent those intervals and pattern recognition algorithms build models to categorize those intervals with sound label from the set of events of interest.

Seeking for audio descriptors to distinguish the characteristics of urban events, several parameters of both time and frequency domain have been proposed. The zero crossing rate [9, 14, 17] and the Mel Frequency Cepstral Coefficients [10, 15, 18, 22] are correspondingly typical examples of time and frequency domain that commonly have been used. Meanwhile, several studies presented a variety of descriptors, like the MPEG-7 descriptors [6, 11, 15], the Perceptual Wavelet Packets [15], the Linear Prediction Code [8, 10], the Matching Pursuit [10], the pitch [17] and the spectral statistics [9, 14, 17].

A variety of deterministic and probabilistic machine learning algorithms have been selected to evaluate experimental frameworks of urban environment events. Some of the well known and commonly used algorithms are the Neural Networks [8, 10, 21], the Support Vector Machine [6, 17, 18, 22, 23], the decision trees [17, 18], the k-nearest neighbors [10, 17] and the hidden markov models [4, 6, 11].

In this work, we present a sound recognition methodology for distinguishing audio events of the urban environment based on widely known and commonly used audio descriptors. We combine them with a data-driven ranking algorithm for investigating their necessity and relevance to our audio task. Consequently, we were concentrating on our framework hypotheses about the coping of our classifiers with the dimensionality redundancy and changes in classifier’s effectiveness while irrelevant to the task descriptors haven’t been discarded.

The rest of the article is organized as follows. In Sect. 2 we present an analytic architecture. In Sect. 3 we introduce the experimental framework and derive details about the database, the audio descriptor and recognizer selection and the ranking technique. In Sect. 4 we described the experimental results. Finally, Sect. 5 follows with the conclusions.

2 System Description

In the proposed scheme the architecture for distinguishing audio events, found in urban environment recordings, relies on short-time analysis in both time and frequency domain. A ranking technique is applied to the audio descriptors to score their discrimination ability with respect to the events of interest in this task. The scheme of this architecture appears in Fig. 1. As it seems, the architecture is divided into training and testing phase.
Fig. 1.

General architecture of the scheme for distinguishing events from urban environments using feature subspaces.

During the training phase, the training set of recordings \(X=\{ X^r \}, r\in [1,R]\), which is previously be annotated with label tags and includes the whole and close set of events of interest of urban environment origin, are sequentially driven though the stages of short time analysis. Initially, the preprocessing stage frame blocks the recordings into sequences of overlapping frames with constant length and time-shift \(O=\{ O^r \}\),. Afterwards these frame sequences are decomposed using a close set of audio descriptors, provided in the feature extraction stage. The outcome feature vector sequence \(V=\{ V^r \}\), consists of N feature in frame level organized in feature vectors \(V^r=\{ V_i^r \}, i\in [1,N]\),. Thereafter, the ranking score measures the discrimination ability of each feature. The output rankings \(S_i^r\) are used to create feature subspaces. Within the sequence is discarded from those features with less significance. The threshold D that defines the boundary of necessity of a feature is manually defined or determined by data-driven criteria. The new sequences \(P^r=g(V^r,S_i^r,D)\) are driven into the training steps of the classification. Within model \(M^D\) is trained from the sequence \(P^r\). During the testing phase, the examining audio file \(X^Y\) is pre-processed with the frame block procedure to be similar to the one of the training phase producing the frame sequence \(O^Y\). Afterward the selection of features which was done during the training phase is feeding the feature extraction block in order to decompose the frame sequence with only those descriptors attached to the working subspace. The extracted sequence \(P^Y=g(O^Y,S_i^r,D)\) is finally driven to the classifier, where the results are constructed \(w=f(P^Y,M^D)\) based on frame level classification according to the label tags previously annotated during the training phase. Further post processing of the results is performed for fine-tuning (either on decision level or on classification scores). This architecture allows exploiting feature subspaces, which contribute to accurate discrimination of events of interest with simultaneously dimensionality redundancy by examining and discarding non-preferable descriptors.

3 Experimental Setup

The experimental setup for the evaluation of the architecture described in Sect. 2, is presented here. During this experimental framework we validate methods that study our previously mentioned hypothesis. Initially we will study the algorithms that could outperform in our high dimensional feature space and after the ranking driven method define the working subspace we will examine the classifier’s effectiveness on this redundancy of dimensions. Furthermore, the following subsections present details about this framework, relating to the audio set of the evaluation, the feature extraction algorithms, the ranking procedure and the selection of classifiers.
Table 1.

Duration Distribution of our events of interest in the collected audio

Sound Type

Duration (in seconds)



Motor Engines














3.1 Audio Data Description

The evaluation of the experimental framework is relied on a collection of audio events commonly found in urban environments. Due to lack of a commonly used and appropriate for discriminating urban events database we turn into the BBC FX Library [1] from which we collect recordings, of total duration equal to 1,044.3 sec. These recordings represent common urban events. Their duration distribution is illustrated in Table 1. All data were stored in single-channel audio files with sampling frequency 16 kHz and resolution analysis 8 bits per sample and manually annotated from an expert audio engineer.

3.2 Feature Extraction

The audio events occurred in the urban environment varying on their time and frequency characteristis. Thus in the literature, a variety of proposed audio descriptors is presented. In this study, we rely on the OpenSmile [7] framework to extract well known and commonly used features, related to general audio processing and sound event distinguishing. The overall structure of our short-time analyis is relied on frame-blocking the audio sequence to overlapping frames of constant length of 25 msec with time shift of 10 mses, an 1st order FIR pre-emphasis filter followed by Hamming windowing and a variety of audio descriptors extracted in frame level. The parametrization and the feature extraction are illustrated in Fig. 2.
Fig. 2.

Diagram of parameterization and feature extraction.

In details, we extract (a) the zero-crossing rate (ZCR), as it’s a quite common time domain feature. Moreover, we extract, using suitable filter banks, (b) the Mel frequency cepstral coefficients (MFCC) [19], (c) the chroma coefficients (Chroma) [3, 13] and (d) the Mel Spectrum. Thereafter, we extract (e) the energy, (f) the pitch, (g) the pitch envelope, (h) the voicing probability and some spectrum statistics’ (i) the energy of 4 bands, (j) the roll off, (k) the flux, (l) the centroid, (m) the frequency of maximum magnitude and (n) the frequency of minimum magnitude. Finally, their values are concatenated to feature vector which was expaned with (o) first and second derivatives (delta and delta-delta coefficients).

3.3 Classification

The construction of the sound type classification models was achieved using the WEKA software toolkit [24]. The selection of classifiers was based on well-known and widely used algorithms in audio processing tasks. Thus we select: the k-nearest neighbors classifier with linear search of the nearest neighbor and without weighting of the distance – also known as instance based classifier (IBk), the Bayes network (BayesNet), with Simple Estimator (alpha = 0.5) and the K2 search algorithm (maximum number of parents = 1), a 3-layer Mulilayer perceptron (MLP) neural net-work, a pruned C4.5 decision tree (J48), a support vector machine with sequential minimal optimization (SMO) algorithm and RBF kernel. The training/testing frame-work was the same for all algorithms in order to have direct comparison of results. Moreover, the models were performed on the sound types of Table 1.

4 Experimental Results

The urban sound recognition methodology described in Sect. 2 was evaluated using the experimental setup presented in Sect. 3. The experimental results are tabulated in Table 2.
Table 2.

Duration Distribution of our events of interest in the collected audio













As can be seen in Table 2, the best performing algorithm was the support vector machines with radial basis function kernel, which achieved classification performance equal to \(92.73\,\%\). The second best performing was the MLP neural network with \(90.59\,\%\) accuracy. The results show that the two evaluated discriminative algorithms outperformed the rest of algorithms.

In terms of sound types, the most misclassified sound type pairs were the wind sound and the motor engines sounds, which for the best performing SVM algorithm were found to be misclassified by approximately \(5\,\%\). All the rest misclassified sounds were found in less than \(2\,\%\) for all cases.

5 Conclusions

The increasing influence of speaker’s surroundings drove the interest of the scientific research interest towards to sound recognition of events from person’s environment. Since urban environment is a quite common environment, it becomes into cornerstone of environments of interest. In the present work, we studied a methodology of automatic sound recognition with a short-time analysis framework using urban environment events. After computing a large set of audio descriptors we applied an ranking algorithm to score the descriptors’ discrimination ability, in terms of necessity and relevance on the current task. With some well known and commonly used machine learning algorithms we evaluate this framework. Our results point out that SVMs managed to outperform all other algorithms with accuracy equal to \(92.73\,\%\). Also the sequential expansions of the working descriptor space with unnecessary dimensions don’t significantly affect the effectiveness of the machine learning algorithms.


  1. 1.
    The BBC sound effects library original series.
  2. 2.
    Aucouturier, J.J., Defreville, B., Pachet, F.: The bag-of-frames approach to audio pattern recognition: a sufficient model for urban soundscapes but not for polyphonic music. J. Acoust. Soc. Am. 122(2), 881–891 (2007)CrossRefGoogle Scholar
  3. 3.
    Bartsch, M.A., Wakefield, G.H.: Audio thumbnailing of popular music using chroma-based representations. IEEE Trans. Multimedia 7(1), 96–104 (2005)CrossRefGoogle Scholar
  4. 4.
    Casey, M.: General sound classification and similarity in MPEG-7. Organised Sound 6(02), 153–164 (2001)CrossRefGoogle Scholar
  5. 5.
    Couvreur, L., Laniray, M.: Automatic noise recognition in urban environments based on artificial neural networks and hidden markov models. InterNoise, Prague, Czech Republic, pp. 1–8 (2004)Google Scholar
  6. 6.
    Dogan, E., Sert, M., Yazici, A.: Content-based classification and segmentation of mixed-type audio by using mpeg-7 features. In:First International Conference on Advances in Multimedia, MMEDIA 2009, pp. 152–157. IEEE (2009)Google Scholar
  7. 7.
    Eyben, F., Wöllmer, M., Schuller, B.: Opensmile: the munich versatile and fast open-source audio feature extractor. In: Proceedings of the international conference on Multimedia, pp. 1459–1462. ACM (2010)Google Scholar
  8. 8.
    Fernandez, L.P.S., Ruiz, A.R., de JM Juarez, J.: Urban noise permanent monitoring and pattern recognition. In: Proceedings of the European Conference of Communications-ECCOM, vol. 10, pp. 143–148 (2010)Google Scholar
  9. 9.
    Huang, R., Hansen, J.H.: Advances in unsupervised audio classification and segmentation for the broadcast news and NGSW corpora. IEEE Trans. Audio Speech Lang. Process. 14(3), 907–919 (2006)CrossRefGoogle Scholar
  10. 10.
    Khunarsal, P., Lursinsap, C., Raicharoen, T.: Very short time environmental sound classification based on spectrogram pattern matching. Inf. Sci. 243, 57–74 (2013)CrossRefGoogle Scholar
  11. 11.
    Kim, H.G., Moreau, N., Sikora, T.: Audio classification based on MPEG-7 spectral basis representations. IEEE Trans. Circuits Syst. Video Technol. 14(5), 716–725 (2004)CrossRefGoogle Scholar
  12. 12.
    Kinnunen, T., Saeidi, R., Leppänen, J., Saarinen, J.P.: Audio context recognition in variable mobile environments from short segments using speaker and language recognizers. In: The Speaker and Language Recognition Workshop, pp. 301–311 (2012)Google Scholar
  13. 13.
    Lee, K., Slaney, M.: Automatic chord recognition from audio using a HMM with supervised learning. In: ISMIR, pp. 133–137 (2006)Google Scholar
  14. 14.
    Lu, H., Pan, W., Lane, N.D., Choudhury, T., Campbell, A.T.: Soundsense: scalable sound sensing for people-centric applications on mobile phones. In: Proceedings of the 7th international conference on Mobile systems, applications, and services, pp. 165–178. ACM (2009)Google Scholar
  15. 15.
    Ntalampiras, S.: Universal background modeling for acoustic surveillance of urban traffic. Digital Signal Process. 31, 69–78 (2014)CrossRefGoogle Scholar
  16. 16.
    Ntalampiras, S., Potamitis, I., Fakotakis, N.: Exploiting temporal feature integration for generalized sound recognition. EURASIP J. Adv. Sig. Process. 2009(1), 807162 (2009)CrossRefGoogle Scholar
  17. 17.
    Patsis, Y., Verhelst, W.: A speech/music/silence/garbage/classifier for searching and indexing broadcast news material. In: 19th International Workshop on Database and Expert Systems Application, DEXA 2008, pp. 585–589. IEEE (2008)Google Scholar
  18. 18.
    Salamon, J., Jacoby, C., Bello, J.P.: A dataset and taxonomy for urban sound research. In: Proceedings of the ACM International Conference on Multimedia, pp. 1041–1044. ACM (2014)Google Scholar
  19. 19.
    Slaney, M.: Auditory toolbox. Interval Research Corporation. Technical report vol. 10 (1998)Google Scholar
  20. 20.
    Smith, J.W., Pijanowski, B.C.: Human and policy dimensions of soundscape ecology. Global Environ. Change 28, 63–74 (2014)CrossRefGoogle Scholar
  21. 21.
    Torija, A., Diego, P.R., Ramos-Ridao, A.: Ann-based m events. a too against envi environment (2011)Google Scholar
  22. 22.
    Tran, H.D., Li, H.: Sound event recognition with probabilistic distance SVMs. IEEE Trans. Audio Speech Lang. Process. 19(6), 1556–1568 (2011)CrossRefGoogle Scholar
  23. 23.
    Valero, X., Alías, F., Oldoni, D., Botteldooren, D.: Support vector machines and self-organizing maps for the recognition of sound events in urban soundscapes. In: 41st International Congress and Exposition on Noise Control Engineering (Inter-Noise-2012). Institute of Noise Control Engineering (2012)Google Scholar
  24. 24.
    Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann (2005)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Theodoros Theodorou
    • 1
  • Iosif Mporas
    • 1
    • 2
  • Nikos Fakotakis
    • 1
  1. 1.Artificial Intelligent Group, Wire Communication Laboratory, Department of Electrical and Computer EngineeringUniversity of PatrasRion-patrasGreece
  2. 2.Computer and Informatics Engineering DepartmentTechnological Educational Institute of Western GreeceAntirioGreece

Personalised recommendations