Detection and localization of selected acoustic events in acoustic field for smart surveillance applications
- First Online:
- 1.7k Downloads
A method for automatic determination of position of chosen sound events such as speech signals and impulse sounds in 3-dimensional space is presented. The events are localized in the presence of sound reflections employing acoustic vector sensors. Human voice and impulsive sounds are detected using adaptive detectors based on modified peak-valley difference (PVD) parameter and sound pressure level. Localization based on signals from the multichannel acoustic vector probe is performed upon the detection. The described algorithms can be employed in surveillance systems to monitor behavior of public events participants. The results can be used to detect sound source position in real time or to calculate the spatial distribution of sound energy in the environment. Moreover, the spatial filtration can be performed to separate sounds arriving from a chosen direction.
KeywordsAcoustic event detection Sound localization Audio surveillance
The paper addresses the problem of detecting and localizing some selected acoustic events in 3-dimensional acoustic field. The known solutions for localization of acoustic events in most cases use a microphone (pressure sensor) array and are limited to the calculation of acoustic wave direction of arrival (DoA) [14, 17]. In this work a novel approach is introduced employing the acoustic vector sensor (AVS) which makes possible to calculate not only the direction of arrival, but also the exact position of the sound source in 3D space . The AVS comprises a pressure sensor and 3 orthogonally placed air particle velocity sensors (vx, vy, vz) [3, 9]. The multichannel output of the probe allows the calculation of direction of acoustic wave front arrival with only one sensor without the use of any microphone array . The proposed technology is a part of the developed automatic surveillance system . Detecting and localizing acoustic events is particularly useful in monitoring of public events such as sports events or conventions in order to detect potential security threats. The described setup of the demonstration system was installed in a lecture hall at the Gdansk University of Technology.
In order to evaluate the system’s ability to detect potentially hazardous events a series of measurements was conducted. Two setups of the applied measurement system are presented. The preliminary setup was used for evaluating the detection and the localization of individual sound sources in some selected regions of the audience. Next, some more precise measurements were carried out to evaluate the spatial resolution of the sound localization algorithm. Finally, the errors in the position calculation were analyzed and adequate correcting functions were established to enhance the spatial accuracy.
The remaining part of the paper is organized as follows. In Section 2 and in Section 3 the employed signal processing methods are described. Section 2 is devoted to detection algorithms, whereas Section 3 describes the operation of calculating the position of the sound source. In Section 4 measurement results are presented, which were conducted during the preliminary tests and for the fixed setup. Further, in Section 4 the accuracy analysis is performed and the methods for reducing the error rate are presented. Finally, the usability and further developments of the system are described in Section 5.
2 Detection of acoustic events
The goal of the proposed system is to localize some selected acoustic events, i.e. speech signals and impulsive sounds in 3D acoustical space. Therefore, the algorithm for detection of such events had to be developed. The engineered system employs separate algorithms for detection of the two types of signals. The speech sounds are detected using an adaptive threshold algorithm based on the modified peak-valley difference (PVD) parameter. The detection of impulse sounds also utilizes adaptive threshold, however the parameter used for detection is simpler, namely it represents the equivalent level of sound pressure.
2.1 Detection of speech signals
in speech processing the sampling rate is assumed to be equal to 22050 samples per second [S/s], since it covers the significant bandwidth for speech analysis. In the present application it is sampled at 48000 S/s, which is a standard sampling rate in measurements of acoustic pressure from environmental microphones, which are employed in the smart surveillance system
the localization frame covers 4096 samples. In most speech processing applications, shorter frames are used
the 4096 Discrete Fourier Transform (DFT) representation of the signal is used to find the spectral peaks of sound
the distribution of peaks and valleys in the spectrum of speech signals is dependent on the fundamental frequency of speech
in classic PVD detection the model of peak distribution in vowels needs to be established before calculating the parameter.
2.2 Detection of impulsive sounds
3 Localization of sound sources
The large cuboid represents the shape of the interior, whereas rhomboid models represent the floor plane. The AVS is placed in the room above the floor in the spot marked by the red empty dot. The black dotted line corresponds to the height of the AVS placement. The two intersecting blue lines indicate the point of the perpendicular line to the plane of the floor directed to the location of the AVS. The full dot marks the position of the sound source inside the interior. The vector I of the intensity of the acoustic field calculated in the xyz space, has the direction of the arrow. The coordinate system, starting in the location of the AVS, is drawn. The intersection of the direction of the intensity vector with the floor plane indicates the position of the sound source. The location of the sound source is expressed by a set of coordinates (x, y, z).
The conducted measurements comprise following stages. First, a preliminary measurement system was set up in the lecture hall in order to evaluate the proposed method of calculation of sound source position. Measuring system consist of: USP regular probe (multichannel Acoustic Vector Sensor), conditioning module type MFSC-4, connection cables, multichannel sound card ESI Maya44 USB and laptop ASUS type B50A . It involved 6 positions of a selected sound source in the audience. Next, full measurements were conducted, covering the whole area of the lecture hall. Finally, the error of calculation of sound source position for every seat was calculated and the correction procedure to increase the accuracy was introduced.
4.1 Preliminary measurement setup
The emitted signals included speech sounds (male’s voice counting from 1 to 10) and impulsive sounds (shots from a signal pistol). Signals registered during the experiment with the described measurement system were analyzed using aforementioned audio signal processing algorithms. Some example results of detection and localization of impulse sound sources are presented in subsequent sections. The output of the speech detection algorithm is shown. Basing on the results of the detection of speech sounds, an effort was made to localize the position of the speaker.
4.2 Preliminary detection results
A number of 18 shots was emitted, namely 3 shots from the position of every sound source 1–6. The dashed line o the top plot indicates the 75 dB threshold of detection of impulsive sounds. No additional noise was emitted, so the signals can be considered quiet. The detection of impulsive sounds in this short sample was 100 % correct. The detection of speech was not assessed, since some overlap was present in the words uttered by the speakers. The evaluation of detection error, however, is not the scope of this work, in which detection serves as a preliminary operation before the calculation of sound source position.
4.3 Preliminary localization results
The root mean squared angle error (RMSE) indicator was used for evaluating the presented algorithm . The computed values of RMSE for impulsive sounds were equal to 8.0° with filtration and 24.4° without filtration. For speech signals these values were equal to 39.1° and 42.5°, respectively. More information about sensitivity and accuracy can be found in our previous papers [2, 10]. Only dominant sound source was localized in the same time for typical acoustic background. When more than one sound source produced the acoustic energy simultaneously, determination their positions was very difficult because the acoustic energy produced by particular sound sources affects the final intensity vector. This is the limitation of the described sound source localization algorithm . Quite different approach to multiple sound sources localization in real time using the acoustic vector sensor was presented in this study [5, 6]. Direction of arrival (DOA) for considered source was determined based on sound intensity method supported by Fourier analysis. Obtained spectrum components for considered signal allowed to determine the DOA value for the particular frequency independently. Such approach can be applied to localization of multiple different sound sources.
4.4 Fixed measurement setup
From each seat in the lecture hall 5 bursts of acoustic energy were emitted (sound of spire of a signal pistol). The result of the sound source localization operation is the direction of incoming sound (azimuth and elevation) and the position of the sound source in the audience (x, y, z coordinates). The results of measurements of error occurring during the calculation of the position of the sound source were presented in related work . The error of calculation of x and y coordinates, as well as azimuth and elevation angle were depicted on the layout of the room. It was shown that the system is prone to errors which might occur due to various reasons related to sound propagation, imperfections of the model and the shape of the room. Thus a calibration procedure was introduced to correct the errors of this algorithm. In the next section the proposed correction functions are described.
4.5 Correction functions
A comparison of results obtained in the experiment described in Section 4.4 with the Ground Truth values derived from the architectural plans of the building led to forming calibration functions to correct the computed acoustic wave direction of arrival.
On the basis of real acoustic calibration data and Ground Truth values, the detailed evaluation of the localization accuracy was done. The several kinds of errors were taken into account: absolute error versus x coordinate, absolute error versus y coordinate, absolute error versus azimuth angle, absolute error versus elevation angle . Its distributions were presented in Figs: from 12 to 15 respectively. At the beginning no correction was applied. The error value were high, especially for left part of the room. The obtained error results were analyzed to find the relation between position of the sound source inside the room and the localization error for that position.
Practical experiments with application of proposed correction methodology indicated greater accuracy of the sound source localization. The number of people present inside the room can produce greater level of background noise and can influence on acoustic condition inside the room. The scattering of sound should increase, the number of reflections should decrease, it means that the difference between the direct and reflected sound should increase. For that reason, the accuracy of the sound source localization could be the same or greater than for empty room (if difference between the energy of the acoustic event and background noise will be greater than 10 dB, it means that the energy of the background noise can be neglected).
The presented method for detection and localization of acoustic events in 3D space was found to be adequate for identifying sound sources inside auditory halls. More generally, the results show that the spatial resolution is sufficient for the monitoring of public events. Obviously, the proposed correction procedure is crucial for achieving this accuracy. A proper sound source localization in the presence of reverberations in real time was possible, not only for impulsive sounds, but also for speech-related sound events. The sound source position was determined using a single 3D acoustic vector sensor. It provides a novel solution while compared to traditional microphone arrays. The information about the sound source position is present in the rising edge of the sound wave. Therefore, a proper detection of the wave front of the acoustic event is crucial for the sound source localization accuracy.
The pass-band filtration significantly improved the localization of the sound source. The presented algorithm is based on sound intensity computation in time domain. The broadband signal analysis can be disturbed by sound coming from other rooms (especially the low frequency components) and by numerous reflections of high frequency components. The pass-band filtration reduce the level of low and high frequency components and increase the difference between the direct and reflected sound. For that reason the application of filtration improves the accuracy of the sound source localization.
The experimental results are promising, as far as the functionality of acoustical monitoring of activity of people is concerned. The described solutions can be useful for surveillance systems monitoring the behavior of participants of public events. In the future some more complex algorithms for localizing sound sources can be employed, e. g. ray tracing can be utilized to reduce errors related to acoustic wave reflections arriving from the walls of the interior.
The presented research is subsidized by the European Commission within FP7 project “INDECT” (Grant Agreement No. 218086).
This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.