1 Introduction

As the world is becoming more and more connected, the interest in the development of technologies that enable cities to be more intelligent and sustainable is growing [32]. Machine learning plays a crucial role in this development, by enabling systems to learn from data collected by the different sensors installed in the city [2].

In the last decade, deep learning models have become a popular choice for several tasks in computer vision and audio processing, such as object detection, image segmentation, or audio and image classification. These models are capable of learning complex features from data, and their ability to generalize to unseen data has been shown in several real-world applications. As the field of deep learning keeps pushing for larger and more capable models, the interest in leveraging the knowledge learned by these models to solve real-world problems has been growing. However, in many applications, these models need to be deployed in the field using edge devices such as smartphones or IoT devices, which are limited in terms of computational power and energy consumption. To address those challenges, we have today several model compression techniques that aim to reduce the number of operations, the complexity of the operations, and the memory required to run these models, while preserving their accuracy. Some of those techniques are model pruning, quantization, weight matrix low-rank factorization, and knowledge distillation.

The project where this work is framed aims to develop a technological urban infrastructure based on the street luminaires, which will serve as the backbone for the implementation of technologies for the smart city transition with a special focus on the issue of mobility. This work presents the part of the project concerned with developing an ML model capable of detecting the presence of certain objects using audio signals. This model should be capable of identifying various classes of objects, such as pedestrians, cyclists, and vehicles, using visual and audio signals while running on edge devices integrated within luminaires. Moreover, the task considered is a multi-label classification, which is of great value for some of the planned applications, such as traffic control. This means that our system will be able to handle cases where multiple objects of the same class are present in a single audio, which is very common. We first present the dataset used for the development of the model. Then, we present the model architecture, which is based on a small ResNet backbone that also supports object detection on visual inputs [1], in the context of the same project. To further improve the accuracy of the model and decrease the computational cost, we describe how we used knowledge distillation to pretrain the tiny version of the ResNet. Finally, we present the results obtained with the model and discuss the challenges and limitations when applying our model to real-world data.

2 Related work

Environmental audio classification has been a topic of extensive research in recent years, with several studies proposing novel approaches for this task. Most of the early work with deep learning used convolutional neural networks (CNNs) to approach the problem. One major distinction between the different approaches is the type of features used as input to the CNN. Some of the most common features being used are raw signals [20, 28], log mel-spectrograms [11], log spectrograms [9, 15], gammatone frequency cepstral coefficients (GFCC), Constant Q-transform (CQT) features, Chromagrams, and a combination of those features [37, 45].

Since the audio signal is a time series, it is also possible to use recurrent neural networks (RNNs) to process it [41] or even the log mel-spectrograms of the audio [45]. More recently, several studies have explored the use of transformers instead of CNNs [6, 7, 14], which have achieved state-of-the-art results in several benchmarks. However, while some work has been done to optimize the computational cost of these models and make them suitable for edge devices [24, 26], these models are still not as fast as CNNs for many devices, which don’t have dedicated operations for the transformer architecture. Moreover, given the more recent nature of these models, the number of published works using them is still relatively small compared to CNNs, which can make its implementation in a time constrained project more challenging.

To develop capable models with a high level of accuracy while keeping the computational cost low, several studies have explored model compression techniques and small architectures. AclNet [20] is a small CNN architecture which is capable of achieving state-of-the-art results on the ESC-50 dataset, by using waveform augmentations and mixup. Cui et al. [9] used knowledge distillation to transfer the knowledge learned from an ESResnet [15] to a modified MobileNetv2 [35] model. Mohaimenuzzaman et al. [28] proposed a network called ACDNet together with a model compression pipeline based on model pruning and quantization, that achieves almost state-of-the-art performance while being able to run on edge devices. The RACNN [11] uses a Resource Adaptive Convolutional (RAC) module to reduce the computational cost of the model, while maintaining classification accuracy.

In different areas of machine learning such as computer vision and natural language processing, the development of architectures designed to run on computationally limited devices has been a topic of extensive research. This is the case of several architectures such as MobileNet [19], YOLO [34], and SqueezeNet [21] among others. Moreover, several methods have been proposed to derive models capable of running in edge devices, by using model compression techniques such as pruning [16, 17, 44], quantization [16, 17], weight matrix decomposition [5], and knowledge distillation [4].

In this work, we take advantage of some of these techniques to put them to the test in a real-world scenario. Inspired by the work done by Palanisamy et al. [30] we explore reusing convolutional-based neural networks that were pretrained in computer vision tasks, which not only boosts the performance of the audio classification model but also reduces deployment time. Additionally, we employ knowledge distillation to transfer the knowledge learned from a large model SOTA model to our small model. Most importantly, we test these methods in a real-world deployment scenario, where we are able to demonstrate that the performance in popular benchmarks [33] doesn’t translate to a real-world scenario.

3 Data

3.1 Data preparation

In this work we aimed to develop an audio classification model capable of detecting several classes of objects, including “Person,” “Bus,” “Siren,” “Car,” “Truck,” “Motorcycle,” and “Bicycle”. The goal is to be able to detect the vast majority of objects that transit in a street environment and that were defined as relevant to the project. From the thousands of audio samples recorded from the luminaires, we labelled 1688, whereas the last 500 were collected and labelled later in the project. These newer samples had the advantage of being recorded synchronously with video which allowed us to improve the quality of the annotations by using not only the audio but also the video frames and respective predictions generated by a YOLO model. Nevertheless, the quantity and quality of the audio data produced by the early prototypes of the luminaires, also developed in the project, were limited. To address this limitation, we used a combination of data captured with the luminaires and publicly available datasets, including: (i) ESC-50 [33], i.e. a multi-class collection of environmental audio recordings with 50 classes from five categories, namely animals, natural soundscapes, human sounds, domestic sounds, and urban sounds; (ii) FSD50K [12], a dataset of audios comprising 200 classes drawn from the AudioSet Ontology; and (iii) the DCASE 2017 Task 4 strong label testing set [27], which is a dataset of audios extracted from Youtube videos comprising 16 classes usually found in a street environment and also drawn from the AudioSet Ontology. These datasets, except for ESC-50, were designed for multi-label classification tasks, which is also the case we are trying to solve with the audio model.

Compiling a dataset from these multiple sources presented several challenges. Firstly, some audio samples were not the expected 10 s long. Secondly, the different audio samples used different sample rates. Finally, the public datasets used different sets of classes. We solved the first two problems by normalizing the audio samples to 10 s, padding short audios and clipping long audios at the center, and resampling all audios to 22.05 kHz. For the third problem, we manually created a mapping between the classes present in the public datasets and the classes of interest for this project, see Tables 4 and 5.

Our final dataset consists of 12,961 audio samples with 36 h of sound in total. Table 1 shows the composition of the final dataset, and Fig. 1 shows the distribution of the audio samples per class in the final dataset.

Table 1 Composition of the final dataset
Fig. 1
figure 1

Distribution of the audio samples per class in the final dataset

3.2 Audio visualization

Figure 2a shows the waveforms of three audio samples from three different datasets that contain the class “Motorcycle”. This figure shows, in part, the level of noise and the high level of amplitude of the signal that we have in the audio samples from the luminaires, especially when compared to the samples from the public datasets. In turn, Fig. 2b shows the mel spectrogram of the same waveforms. As we can observe, the spectrograms of the samples from the public datasets are more clear, with more well-defined high amplitude lines than the ones from the luminaires. Moreover, the spectrogram of the sample from the luminaires shows static noise across all frequency bins, with a higher prominence in the mel bins corresponding to the lower frequencies. This is confirmed by Fig. 2c, where we computed the median absolute deviation of the spectrogram across each frequency bin as an estimation of the noise level. We can observe that the noise level is higher in the samples from the luminaires, especially in lower frequencies. Furthermore, we also compared audios from early versions of the luminaires which contain a higher level of noise.

Fig. 2
figure 2

Visualizations of the audio samples from the different datasets

Even though additional work can be done to improve the quality of the data collected from the luminaires, we can see from the spectrograms, which will be used as input for the model, that the data collected by the luminaires is different from most of the audios we found in the public datasets. Nevertheless, given the limited amount of data from the luminaires and the poor balance of the class distribution, we still find these datasets of great value to train the audio model.

Additionally, while some of the limitations, such as noise, with the data collected with the luminaires might have been partially mitigated by noise cancellation techniques, we decided to keep the data as originally collected. The data was collected in real-world conditions and is therefore representative of the audio that the model will be exposed to when deployed.

4 Model

This section describes the different models we used for the multi-label audio classification task. All the models were developed and trained using PyTorch 1.10.1 [31], CUDA 11 [29], and Python 3.9 [39]. All our models were developed using a single NVIDIA V100S GPU and an Intel Xeon Silver 4214R CPU.

To make the development and training of the audio models more consistent with the environment used for the remaining models being developed for the project, we chose to write a model using PyTorch and reuse some of the existing codebases. Following previous work [30], which explores reusing CNNs developed for image classification in an audio classification task we decided to adopt a ResNet backbone of an object detection model, which is a convolutional neural network with 3 channels as input and to which we added a classification block. We started with the YOLOv4-CSP [42] backbone and later changed it to the smaller and faster backbone of the YOLOv7-tiny [43] model. Although using CNN backbones for audio classification is common, most previous work considered inputs with one channel, while our backbones require three. To overcome this, we transform the 10 s long waveforms into three different mel spectrograms, each with 128 mel bands but with different sets of window sizes and hop sizes ((551, 220), (1102, 551), (2205, 1102)) for the short-time Fourier transform. The width of the spectrograms is normalized to 250, which we found to perform similarly to full-size spectrograms but with significantly faster computation. With this approach, we go from a 10 s waveform with 220500 dimensions that uses 4 bytes (float32) to a spectrogram with shape (3, 128,250) that also uses 4 bytes. Finally, we computed the log of the mel spectrogram values and normalized them using a mean of 4.5 and a standard deviation of 5.0.

We refer to the audio models as ResNet-tiny and ResNet, which are very similar with the only difference being the backbone used. Fig. 3 shows the architecture of the ResNet-tiny model with a classification block consisting of an adaptive average pooling operation with a targeted output of 1x1, a flattening operation, a linear layer with 1024 units, a hard swish activation, a dropout layer with a probability of 0.2, and a linear layer with the same number of units as the number of classes in the dataset. Given that we are solving a multi-label classification task we normalize the logits using the sigmoid function.

To use a model in an edge device, we exported it to the open standard ONNX and compiled it to TensorRT [8]. Our early observations showed no performance differences between the two versions, unfortunately, we were unable to perform more systematic evaluations which would allowed us to report a more confident result.

Fig. 3
figure 3

Our ResNet-tiny audio classification model, which uses the backbone of a YOLOv7-tiny

We compare the computational performance and memory usage of the models, as shown in Table 2. The ResNet-tiny model contains far fewer parameters, which results in a model that is significantly faster and uses less memory.

Table 2 Comparison of computational performance and memory usage for the sound classification models. GFLOPS is the number of floating point operations per second given a spectrogram with shape (128, 256) which is the shape of the input during training and inference. Inference time is the time it takes to process a single spectrogram. Inferences per second are the number of spectrograms that can be processed per second. To measure it we used a batch size of 1 and 5000 batches. Memory usage is the sum of the memory used by the model parameters and buffers (not optimizable tensors, like batch normalization statistics)

5 Pretraining with knowledge distillation

While the ResNet-tiny model is faster in comparison to the ResNet model, our experiments also showed a lower accuracy, which is expected given that it has a smaller size. To close the gap between the two, we pretrained the model with the ResNet-tiny backbone on the Audioset dataset [13] using knowledge distillation [18]. This approach allows us to compress the knowledge of a large model, or an ensemble of models, into a smaller model like the ResNet-tiny, and has shown to be effective in improving the performance of smaller models in previous work. We follow a procedure proposed in previous work [36], and use an ensemble of PaSST models [23] with different strides, that were pretrained on Audioset 2 M as the teacher model. More specifically, we use three PaSST models with strides 10, 12, and 14 and a patch size of 16. Given storage constraints, instead of using the complete Audioset 2 M dataset, for knowledge distillation, we only use the Audioset balanced training set and the evaluation set, which in total contain 33,679 audio samples.

We train the student model using two loss functions, namely the supervised loss (Eq. 2) and the knowledge distillation loss (Eq. 1), both corresponding to binary cross-entropy losses, where C is the number of classes. In one case one compares predictions against ground-truth labels, and in the other case, one compares predictions from the student the teacher models. The total loss is a weighted mean between the two, as shown in Eq. 3, where we give a higher weight to the knowledge distillation loss with \(\alpha =0.8\).

$$\begin{aligned} {\mathcal {L}}_{kd}= & {} \frac{1}{C} \sum _c^C \sigma {(T _c)} \cdot \log {\sigma {(S _c)}} + (1- \sigma {(T _c)}) \cdot \log {(1 - \sigma {(S _c)})} \end{aligned}$$
(1)
$$\begin{aligned} {\mathcal {L}}_{sup}= & {} \frac{1}{C} \sum _c^C y_c \cdot \log {\sigma {(S _c)}} + (1- y_c) \cdot \log {(1 - \sigma {(S _c)})} \end{aligned}$$
(2)
$$\begin{aligned} {\mathcal {L}}= & {} (1-\alpha ) {\mathcal {L}}_{sup} + \alpha {\mathcal {L}}_{kd} \end{aligned}$$
(3)

During the training loop, we augment the sound waveform with a random horizontal translation, also known as rolling. To compute the spectrograms, we use the procedure found in the source code of the PaSST model to extract the spectrograms for the teacher, and use the procedure presented in Sect. 5 for the student model. Both spectrograms are then separately augmented with time and frequency masking. Finally, before feeding the models with the spectrograms, we apply the same random mixup to the teacher and student spectrograms [3].

We pretrained the ResNet-tiny model using knowledge distillation for 10 epochs with a batch size of 64, mixup with \(\lambda =0.3\), and a learning rate with an exponential warmup for 3 epochs and a linear decay. To optimize the model, we use the Adam optimizer [22] with a weight decay of 0.0. We named the resulting model ResNet-tiny-distilled.

6 Training

We use the same training procedure to train the ResNet-tiny, ResNet-tiny-distilled, and ResNet models. Before training, we split the complete dataset into two folds using stratified sampling from each data source. At the start, the ResNet-tiny and ResNet models have a pretrained backbone that was pretrained using COCO 2017 [25], while ResNet-tiny-distilled is the model pretrained using knowledge distillation. All of them start with a classification block that is randomly initialized and with an output of 7 to account for the number of classes of interest. Our complete training/evaluation procedure consists of first training the model on the first fold and evaluating it on the second fold, and then training the model on the second fold and evaluating it on the first fold. Each training step consists of 300 epochs with a batch size of 64 and a learning rate of \(1\times 10^{-5}\). To optimize the model, we employ the binary cross-entropy loss and the Adam optimizer with a weight decay of \(1\times 10^{-3}\). Finally, we apply a set of augmentations to the spectrograms, namely rolling, time masking, and frequency masking. To avoid the cost of computing the spectrograms and considering that we need to train the model two times, we precompute the spectrograms and save them to disk. In the end, training the ResNet and ResNet-tiny models took a total of around 11 h and 30 min, and 4 h and 30 min, respectively.

7 Experimental results

Several metrics exist to evaluate the performance of models for a multi-label classification task. Some of those metrics are precision, recall, F1, and average precision. Precision is the ratio of true positives (TP) over the sum of true positives and false positives (FP). Recall is the ratio of true positives over the sum of true positives and false negatives (FN). F1 is the harmonic mean between precision and recall. Average precision is the area under the precision-recall curve, which is the curve that results from plotting the precision and recall values for different thresholds of confidence. Both F1 and average precision are metrics that balance precision and recall and for that reason are more commonly used. Considering that our task is multi-label classification, these metrics are computed per class and then averaged over all classes. To average the metrics, we can use the micro average or the macro average. The micro average is computed by summing the true positives, false positives, and false negatives over all classes, and then computing the metric. The macro average is computed by computing the metric for each class and then averaging the results.

For assessing the results of the audio models, we use the F1 score as the main metric and do a macro average over all classes. The results are the cross-validation scores of the models on the dataset previously presented, over 2 folds.

In Fig. 4, we can observe that the ResNet-tiny-distilled model is the best-performing in macro average F1 over all the classes, and for most of the classes. All models show a good performance for the classes that are more represented in the dataset, such as “person” and “car”. When we consider the classes that are less represented in the dataset, we observe cases where models show poor performance, such as “truck”, “bus”, and “motorcycle”.

Fig. 4
figure 4

Audio models F1 score per class and macro average for all classes

In Fig. 5, we show the same results but for the case of testing with audio samples recorded by the early prototypes of the luminaires. As we can observe the performance of the models is much lower than when testing with the full dataset. This shows the challenge of the task and in part the limitations of using audio to detect the presence of some classes of objects. However, the bad performance of the ResNet-tiny-distilled for some classes can be explained by the fact that those classes are extremely sub-represented in the data captured from the luminaires. For example, in the training dataset the classes “person”, “bicycle”, “siren and “bus” are only present in 69, 18, 1, and 24 audio samples, respectively, which are a very small fraction of the 1688 audios used in the dataset.

Fig. 5
figure 5

F1 score per class and the macro average for all classes, when testing only with the samples audios recorded from the luminaires

8 Benchmarking

To ground our results with previous work, we compare our model with the state-of-the-art models on the ESC-50 dataset [33]. ESC-50 is a dataset of 2000 audio samples and 50 classes of environmental sounds for multi-class classification, which is frequently used to benchmark audio classification models. To evaluate the models, the dataset is split into five folds and the evaluation metric is the accuracy of the cross-validation over the five folds.

Our training procedure is very similar to the one described in Sect. 7, and we did not perform any hyperparameter search to further improve the results. For each fold, we train the model for 500 epochs, with a batch size of 64, a starting learning rate of \(1 \times 10^{-4}\) that decreases 0.1 at every third of the training, and optimize the model using the cross-entropy loss and the Adam optimizer [22] with a weight decay of \(1 \times 10^{-3}\). Furthermore, we use the same augmentation procedure with random horizontal translation, random frequency masking, and random temporal masking. However, we also use a random mixup with \(\lambda =0.2\). Unlike most of the models in the literature, we did not use a sample rate of 44.1kHz, instead resampled the instances to 22.05kHz, which is the sample rate we are using with the real-world data. Finally, to optimize the model, we use the cross-entropy loss and the Adam optimizer.

In Table 3 we compare our models against the state-of-the-art. As we can observe, our model ResNet-tiny-distilled achieves a performance comparable to some of the state-of-the-art models. Except for the ACDNet model, all the other alternatives have a much higher number of parameters and in some cases perform worse in terms of accuracy.

The most comparable model to our ResNet-tiny-distilled is the ACDNet model [28] which is a convolutional neural network designed for environmental sound classification on edge devices. The ACDNet achieves slightly better performance (+1.5%) while using a comparable number of parameters.

We find these results very interesting and promising, as they show that our model can achieve good performance in a standard audio classification task, even though its backbone was not designed for this type of task, like for example the ACDNet. Taking into account the results with real-world data in Sect. 8, we believe that most of the focus in future work should be on improving the data capture and data augmentation techniques.

Table 3 Comparison of the performance of different sound models on ESC-50

9 Discussion and conclusions

This work presented an audio classification model developed to detect events in a street environment using sound.

Our results show that a ResNet-tiny-distilled model, which uses as backbone a ResNet from an object detection model for visual inputs, is the best-performing model even when testing with audios recorded with the luminaires, and surpassing the performance of a larger ResNet model. However, the performance of the model is still fairly low when we evaluate it using real-world data. We think that there are two main reasons for this: the first one is the fact that the data collected with the luminaires is still very noisy, despite the effort to improve the quality of the audio. This is due to several factors like distance to the vehicles, wind, number of vehicles and other environmental noises, which in the end contribute to a much more complex signal as compared to the data from the public datasets, making it difficult for the model to generalize. The second reason is the class distribution unbalance, which is particularly evident in the data from the luminaires.

Future work will need to focus on improving the quantity and quality of the data by collecting more data and tuning the microphones in the luminaires. Nevertheless, our current results confirm that the task at hand is very challenging and that there are limitations in using audio to detect the presence of some classes of objects. For example, microphones do not reliably detect classes of objects that produce a low sound (e.g. a person or a bicycle). Also, as we note in Sect. 4.1, annotating the audio recorded from the luminaires is very difficult even for humans and the use of a synchronized video contributes immensely to the quality of the annotations. Despite the difficulty of detecting some classes, during the process of collecting and annotating more data we observed a continuous improvement in the performance of the models, which leads us to believe that we still have some performance to gain by improving the quantity and quality of the data. Additionally, classes like siren which are easy to detect in the audio recordings, are by contrast very rare in the real world, which makes the task of capturing audio with their presence challenging, affecting the performance of the models.

We hope that our work can serve as a starting point for future work in this area. The use of audio to detect events in a street environment is a promising area of research and a lot can be done to improve the performance of the models. While the existing public datasets are a good starting point, we show that the development of new datasets with data more representative of real-world conditions is crucial to obtaining capable models. Additionally, we hope that current and future efforts in this domain [40] can contribute to the construction of such datasets. Finally, as the field develops larger and more capable models we believe that improving model compression techniques like knowledge distillation is of great value.