1 Introduction

Humans have the potential to recognize background sounds. The classification devices are unreliable when the sounds of the same loudness are added (Salomons et al., 2016). With the advancement in artificial intelligence and signal processing, it is feasible to identify the background sounds with devices as shown in Fig. 1. However, efficient machine learning algorithms need to be proposed for the recognition of background sounds as efficiently as humans (Yang & Krishnan, 2017).

Hidden Markov model (HMM)-based acoustic environmental classifier has been described in Ma et al. (2006) to check the feasibility of acoustic environmental classification in mobile devices. This model incorporates a hierarchical classification model and adaptive learning mechanism (Ma et al., 2006). An environmental sound classification (ESC) method with a power-aware wearable sensor has been proposed in Zhan and Kuroda (2014). A low calculation one-dimensional (1-D) Haar-like sound feature has been incorporated into the method with HMM that results in an average accuracy of 96.9% (Zhan & Kuroda, 2014). A heterogeneous system of Deep Mixture of Experts (DMoEs) has been proposed in Yang and Krishnan (2017) for classifying acoustic scenes using convolutional neural networks (CNNs). Here, each DMoE is a mixture of different convolutional layers weighted by a gating network (Yang & Krishnan, 2017). In Martín-Morató et al. (2018), the authors have proposed the use of nonlinear time normalization-based event representation prior to the mid-term statistics extraction for audio event classification. Thereafter, these short-term features are represented as a constant uniform distance sampling over a defined space in order to reduce the errors in the presence of noise (Martín-Morató et al., 2018). A novel CS-LBlock-Pat. model has been proposed in Okaba and Tuncer (2021) for an efficient floor tracking of the speaker in a multi-storey building. The proposed model in Okaba and Tuncer (2021) has been evaluated on the ESC dataset collected from a multi-storey hospital with ten floors. Finally, the support vector machine (SVM) has been used on the dataset in order to achieve an accuracy of 95.38% (Okaba & Tuncer, 2021).

A summary of the different machine learning and deep learning methods has been presented in Das et al. (2021) for the speech enhancement affected by environmental noise. A survey on the different methods for acoustic signal classification is presented in Chaki (2021). An implicit Wiener filter-based algorithm has been proposed in Jaiswal et al. (2022) for speech enhancement that is affected by the environmental noise generated from stationary and non-stationary noises. In Huang and Pun (2020), the authors have proposed a model that is based on an attention-enhanced DenseNet-BiLSTM network, and segment-based Linear Filter Bank (LFB) features for detecting spoofing attacks in speaker verification. Initially, the silent segments have been obtained from each speech signal using a short-term zero-crossing rate and energy. Then, the LFB features are obtained from the segments. Finally, attention-enhanced DenseNet-BiLSTM architecture has been built to mitigate the problem of overfitting (Huang & Pun, 2020). A hybrid approach has been proposed in Roy et al. (2016) to recognize the complex daily activities and the ambient using body-worn sensors as well as ambient sensors. A study has been carried out in Griffiths and Langdon (1968) for the acoustic environment at 14 sites in London by conducting interviews in these cities. This study has been carried out to check the acceptability of the traffic noise in the residential area (Griffiths & Langdon, 1968). In Seker and Inik (2020), the authors have designed 150 different CNN-based models for the ESC and tested their accuracy on the Urbansound8k ESC dataset. It has been concluded in Seker and Inik (2020) that the CNN model has achieved an accuracy of 82.5%, which is higher than its classical counterpart. The classical methods for integrating Mel-frequency cepstral coefficients (MFCCs) and the audio signal information in the temporal domain may lead to the loss of essential information (Yang & Krishnan, 2017). Thus, in Yang and Krishnan (2017), the authors have characterized the latent information on the temporal dynamics by adopting a tool named local binary pattern. Then, this LBP is used to encode the evolution process by considering the frame-level MFCC features as 2D images. Finally, the obtained features are fed to the d3C classifier for ESC (Yang & Krishnan, 2017). Spectrum pattern matching based on very short time ESC has been proposed in Khunarsal et al. (2013). A survey on the sonic environment has been presented in Minoura and Hiramatsu (1997).

In Sameh and Lachiri (2013), the authors have utilized the Gabor filter-based spectrum features and proposed a robust ESC approach. The proposed approach in Sameh and Lachiri (2013) includes three methods. In the first two methods, the outputs of log-Gabor filters, whose input is spectrograms, are averaged and underwent mutual information criteria-based optimal feature selection procedure. The third method only processes three patches extracted from each spectrogram (Sameh & Lachiri, 2013). A CNN method has been proposed in Permana et al. (2022) for the classification of bird sounds in order to identify the forest fire. A lightweight method for automatic heart sound classification has been proposed in Li et al. (2021). This method needs a simple preprocessing of the heart sound data and then the time-frequency features are extracted based on the heart sound data. Finally, the heart sounds are classified based on the fusion features (Li et al., 2021).

Fig. 1
figure 1

Background sound classification

However, none of the works in the literature applied synchrosqueezed wavelet transforms for the pre-processing of the data and has not considered a lightweight deep CNN (DCNN) model which works better with the image data. Motivated by this, in this paper, a lightweight DCNN model for automatic background classification using speech signals is proposed. The following are the major contributions of this work:

  • For the time-frequency analysis of speech signals, the synchrosqueezed wavelet transform (SWT) is introduced.

  • A lightweight DCNN architecture is proposed for automatic background classification using the SWT of speech signals.

  • The approach is assessed by taking into account various backgrounds such as airport, airplane, drone, street, babble, car, helicopter, exhibition, station, restaurant, and train sounds embedded in speech signals.

  • Through extensive simulations, we compare the performance of the proposed approach with the existing approaches in terms of accuracy for background classification.

  • The proposed model is also deployed in edge computing devices such as NVIDIA GeForce, Raspberry Pi, and NVIDIA Jetson to exhibit its real-time usage.

  • Finally, all considered models are deployed on edge computing devices to compare the performance in terms of inference time.

The remainder of the paper is organized as follows. Section 2 discusses the dataset details, preprocessing using SWT, convolutional neural network architecture, and other benchmark models considered in this work. Section 3 presents the comparison results of the proposed model with the benchmark models in terms of 5-fold classification accuracy, model size, number of parameters, and inference time. Finally, Sect. 4 concludes the paper with an outline of the possible future.

Fig. 2
figure 2

a Oscillogram and b the corresponding Wavelet synchrosqueezed transform at SNR = + 5 dB

Fig. 3
figure 3

a Oscillogram and b the corresponding Wavelet synchrosqueezed transform at SNR = − 5 dB

2 Proposed method

The proposed method for background sound classification consists of three stages: generation of speech signals with background noise, generation of time-frequency plots using SWT, and classification of background embedded in speech signals using a DCNN. We generated speech signals with embedded noise sources such as airport, airplane, drone, street, babble, car, helicopter, exhibition, station, restaurant, and train sounds during the signals generation stage. Following the generation of these signals, SWT is used to generate time-frequency plots. These SWT plots are used in the classification stage to classify the background embedded in the speech signals using DCNN.

2.1 Speech signal dataset

The background noise present in human speech is classified for all methods considered in this work. We use non-stationary noises such as airport, helicopter, airplane, drone, street, restaurant, babble, car, and train at SNRs of 10 dB, 7.5 dB, 5 dB, 2.5 dB 0 dB, −2.5 dB, −5 dB, −7.5 dB, and −10 dB. Noisy samples obtained from the Aurora dataset (Hirsch & Pearce, 2000) are used for station and exhibition noises which are chosen from the noisy speech corpus NOIZEUS (Hu & Loizou, 2006). NOIZEUS corpus is created from three female and male speakers by asking them to utter the phonetically balanced IEEE English sentences. Drone audio dataset (Al-Emadi et al., 2019) is considered for drone noise and helicopter and airplane noises are taken from ESC dataset (Piczak, 2015). The speech samples are narrow-band with a frequency of 8kHz span over 3 s. All these samples are saved in WAV (16-bit PCM, mono) format for processing.

Fig. 4
figure 4

The proposed CNN model’s architecture

2.2 Synchrosqueezed wavelet transform

The SWT is a signal processing approach used for the time-frequency analysis and also for the modal decomposition of non-stationary signals (Daubechies et al., 2011; Madhavan et al., 2019). This study considers the SWT to evaluate the time-frequency matrix of speech signals with background noise. The SWT of a noisy speech signal, z(n), is computed in three steps. First, the discrete-time continuous wavelet transform (DTCWT) of the noisy speech signal is evaluated as (Daubechies et al., 2011; Madhavan et al., 2019)

$$\begin{aligned} W(\tau , s(k))=\frac{1}{s(k)}\sum _{j=0}^{N-1} z(j) \overline{\psi }\left( \frac{j-\tau }{s(k)}\right) , \end{aligned}$$
(1)

where, \(\overline{\psi }\) is interpreted as the complex conjugate of the wavelet basis function or mother wavelet, \(\psi\). Similarly, s(k) is termed as the kth scale where, \(k>0\). In (1), the factor \(\textbf{W}=[W(\tau , s(k))]_{\tau =0, k>0}^{N, S}\) is interpreted as the wavelet coefficients matrix or scalogram matrix of the noisy speech signal, z(n), where S is total number of scales and \(0 <k \le S\). Second, the instantaneous frequency \(I(\tau , s(k))\) is estimated as (Oberlin & Meignen, 2017)

$$\begin{aligned}&I(\tau , s(k))= \begin{Bmatrix} \text {Ro}\big [R\big (\frac{N}{2 \pi i} \frac{W(\tau +1,s(k))}{W(\tau ,s(k))}\big )\big ]&\text {if} \quad W(\tau ,s(k)) \ne 0 \\ 0&\text {if} \quad W(\tau ,s(k)) = 0 \end{Bmatrix} \end{aligned}$$
(2)

where, \(R(\cdot )\) and \(\text {Ro}(\cdot )\) are interpreted as the real part and round-off operation, respectively. In the third step, the reassignment of wavelet coefficients or the synchrosqueezing is performed based on the instantaneous frequency computed from (2). Hence, the SWT-based time-frequency matrix of the noisy speech signal is computed as (Daubechies et al., 2011)

$$\begin{aligned} W^{SWT}(\tau , \tilde{s}(k))= \frac{1}{\varDelta \tilde{s}} \sum _{s(k) } W(\tau , s(k)) s^{-\frac{3}{2}}(k) \varDelta s(k),\end{aligned}$$
(3)

where \(W^{SWT}(\tau , \tilde{s}(k))\) is the SWT matrix obtained at the kth scale, \(\tilde{s}(k)\). The range of \(\tilde{s}(k)\) for reassignment of wavelet coefficients in (3) is given as \([\tilde{s}(k)-\frac{1}{2}\varDelta \tilde{s}, \tilde{s}(k)+\frac{1}{2}\varDelta \tilde{s}]\) with \(\varDelta \tilde{s}=\tilde{s}(k)-\tilde{s}(k-1)\). Similarly, the \(\varDelta s(k)\) is evaluated as \(\varDelta s=s(k)-s(k-1)\) (Daubechies et al., 2011). Figures 2 and 3 show the oscillogram and the corresponding SWT-based time-frequency contour plots at the SNR values of 5 dB and − 5 dB, respectively. It is observed that the characteristics of time-frequency plots for different oscillogram or speech signals are different. Therefore, the DCNN model designed using the SWT-based time-frequency images of speech signals can be for automatic background classification.

Table 1 The output shape and number of parameters in each layer of the proposed model

2.3 Deep convolutional neural network architecture

In this work, we propose a very lightweight deep convolutional neural network (CNN). CNN is shown to use local filtering and max pooling to successfully learn the invariant features (Cao et al., 2019). There are 11 layers in the architecture of the proposed CNN model as shown in Fig. 4. The Batch Normalization layer is the first layer of the CNN model followed by Convolutional (tanh activation) layer-1. The third layer is Max pooling layer-1 followed by Convolutional (Tanh activation) layer-2, whose output feeds to Max pooling layer-2. The following layer is the Dropout layer, followed by Convolutional (tanh activation) layer-3, whose output will be fed to the Max pooling layer-3. The ninth layer is Flatten layer, which is followed by the Dense layer ( Softmax activation ). The output cross-entropy layer is the final layer of the proposed CNN model. The number of parameters and output shape of each layer is tabulated in Table 1. In each convolutional layer, we consider the kernel size of (4 × 4). In each Max Pooling layer, we consider the pooling size of (3 × 3). In this model, 270 images of size \(250\times 200\times 3\) from each class are considered which makes a total of 2970 images. Among the available dataset, 80% is used for training and 20% for validation.

The importance of the Batch normalization layer is that it allows each network layer to learn independently. It also helps in the normalization of the output of each input layer, this in turn, reduces the network initialization sensitivity and speeds up the training. Further, the convolutional layer extracts the learnable feature maps (FMPs) from the input image, and the reduction of these features can be achieved layer-by-layer using pooling layers. This reduction helps the network to train faster. Tanh layer is a non-linear activation function layer that applies the tanh function to input. The output of the nth Tanh layer (convolutional layer) can be calculated as (Panda et al., 2020)

$$\begin{aligned} U^{n} = \tanh \left( \sum _{a=1}^{D} U_{i}^{n-1}(a)K_{i}^{l}\left( j-a+\frac{D}{2}\right) + b^{l}\right) , \end{aligned}$$
(4)

where, \(U^{n-1}\) denotes the output of the previous layer, \(n-1\), and T denotes the total number of FMPs in a convolution layer. tanh is the non-activation function used in the convolution layer. The most commonly used pooling layers are Max Pooling and average pooling are defined as (Panda et al., 2020)

$$\begin{aligned} \text {Max \;pooling, } U_{n}^{l} = max[ U_{n}^{l-1}] \end{aligned}$$
(5)

and

$$\begin{aligned} \text {Average\; pooling, } U_{n}^{l} = avg[ U_{n}^{l-1}], \end{aligned}$$
(6)

respectively. Here, n represents the feature map of the pooling layer. In classification problems, the Softmax layer follows the final dense layer. The last layer is the classification layer. The Fully connected layer’s output is obtained as (Panda et al., 2020)

$$\begin{aligned} \text {Fully \;Connected \;Layer, } G^{n} = \frac{1}{1 + e^{-(X^{n}G^{n-1} + y^{n})}} \end{aligned}$$
(7)

and the output of the Softmax layer is

$$\begin{aligned} \text {SoftmaxLayer, }\sigma (G_{i}) = \frac{e^{G_{i}}}{\sum _{t=1}^{N}e^{G_{t}}}, \end{aligned}$$
(8)

respectively. Here, \(X^{n}\) is the weight and \(y^{n}\) is the bias, respectively, at nth fully connected layer. N represents the number of output classes. For multi-class classification, the final classification layer’s categorical cross-entropy loss is computed as

$$\begin{aligned} \text {Cost} = - \frac{1}{\mu }\sum _{i=1}^{\mu }\sum _{k=1}^{N}(V_{i,k}.ln(H_{i,k})) , \end{aligned}$$
(9)

where, \(\mu\) is the number of training samples, \(V_{i,k}\) and \(H_{i,k}\) represent the actual output and calculated hypothesis of the network, respectively.

To update the network parameters such as weights and bias, we consider the Adam solver algorithm. These parameters can be obtained by minimizing the loss or cost function (Panda et al., 2020). The Adam optimization algorithm is an add-on to the stochastic gradient descent. It involves a combination of Momentum and Root Mean Square Propagation (RMSP). The method is based on individual adaptive learning rates obtained from the estimates of first and second-order moments of the gradients.

2.4 Benchmark models

We compare the proposed CNN model with several benchmark models described follow.

2.4.1 DenseNet

The DenseNet is a fully connected network that is made up of multiple convolution layers. The inputs to the convolution layer are the outputs of all previous layers. Densenet models come in different variations including DenseNet121, DenseNet169, DenseNet201, and DenseNet264. We compare the performance of the proposed model to DenseNet169, DenseNet121, and Dense Net201 architectures pre-trained on ImageNet datasets.

2.4.2 InceptionNet

In this network, a typical inception layer is made up of 1 × 1, 3 × 3, and 5 × 5 convolution layers whose outputs are concatenated to form input to the next layer. InceptionNet is made up of several stacked convolution layers. In this paper, we look at InceptionV3 and InceptionResNetV2 pre-trained on the ImageNet dataset.

2.4.3 MobileNetV3

In contrast to other models, this network is optimized to perform speed operations and provides a faster boost. AutoML and a Neural Architecture search were used to create this network. It also includes an excitation and squeeze block that can be used on mobile devices in place of the sigmoid function. It also makes use of a modified version of the inverted bottleneck layer, which was first used in MobileNetV2. This network comes in two sizes: MobileNetV3Large and MobileNetV3Small. In this work, we used the MobileNetV2, MobileNetV3Large, and MobileNetV3Small which are trained on the ImageNet dataset.

2.4.4 NASNet

The Neural Architecture Search Network (NASNet) is made up of multiple layers, each of which contains a normal cell and a reduction cell. In this paper, we considered NASNetMobile trained on ImageNet dataset.

Fig. 5
figure 5

Variation of training accuracy and validation accuracy with epoch for the proposed DCNN model

Fig. 6
figure 6

Confusion matrix of the proposed DCNN

2.4.5 ResNet

The ResNet network is made up of a large number of residual blocks. Each block is made up of two 3 × 3 convolution layers with the same output channels. There are also skip connections in the residual network that perform identity mappings. ResNet models include the ResNet18, ResNet34, ResNet50, ResNet101, ResNet110, ResNet152, ResNet162, and ResNet1202. We consider ResNet101V2, ResNet152V2, and ResNet 50V2 architectures that have been pre-trained on the ImageNet dataset in this work.

2.4.6 VGG

VGG is built by 3 × 3 convolutional layers stacked on top of each other. To reduce the volume size in this network, max pooling is used. VGG16 and VGG19 are two VGG models which were trained on the ImageNet dataset is considered in this work.

2.4.7 Xception

The Xception network is an InceptionNet extension that replaces the standard Inception blocks with depthwise separable convolutions. In this study, we used a pre-trained Xception model on the ImageNet dataset.

2.4.8 EfficientNet

EfficientNet is a convolutional neural network that uniformly scales the depth/width/resolution dimensions instead of arbitrary scaling in conventional practice. It uses a compound coefficient to uniformly scale the depth/ width/ resolution dimensions. In this work, we considered EfficientNetB0, EfficientNetB1, EfficientNetB2, EfficientNetB3, EfficientNetB4, EfficientNetB5, EfficientNetB6, and EfficientNetB7 models pre-trained on the ImageNet dataset.

Table 2 Variation of accuracy for all the models considered in this paper
Fig. 7
figure 7

Variation of accuracy in each fold for the proposed model and the benchmark models

Table 3 Inference time for different models when run on different edge computing devices

3 Results and discussion

In this section, the performance of the proposed model is evaluated in terms of training accuracy and validation accuracy. Further, we compare its performance to other benchmark models in terms of 5-fold cross-validation, number of parameters, size, and inference time for each model.

The proposed convolutional neural network model is trained and validated with a dataset created from different noise samples obtained from sounds of airplane, airport, babble, car, drone, exhibition, helicopter, restaurant, station, street, and train. Fig. 5 shows the variation of training and validation accuracy with an increase in the number of epochs. It is observed from Fig. 5 that the accuracy increases exponentially with increasing epoch and saturates to a maximum of 98.82%. The confusion matrix of the proposed CNN model on the test dataset is shown in Fig 6. From Fig. 6, it is observed that the proposed model classifies the airplane, exhibition, babble, drone, car, helicopter, and train data accurately with 100% accuracy. Further, the airport data can be classified correctly with an accuracy of 97% and misclassified as restaurant and station data each with an accuracy of 2%. Moreover, 96% classification accuracy is obtained for the street data. However, the street class is misclassified as helicopter and station each with an accuracy of 2%. Further, the restaurant data is correctly classified with an accuracy of 95% and is misclassified as babble with 2% accuracy and station with an accuracy of 3%. Finally, the proposed model classifies the station data correctly with an accuracy of 92% and misclassifies it as restaurant and airport class each with an accuracy of 4% as can be observed from Fig. 6.

Pre-trained models such as ResNet, DenseNet, Inception, MoblieNet, VGG, NASNetMobile, Xception, and EfficientNet are also trained with the considered dataset for a fair comparison with the proposed model. Table 2 lists the values of accuracy, number of parameters, and size of the dataset for the comparison of the proposed model with other pre-trained models. It is observed from Table 2 that the 5-fold cross-validation accuracy of the proposed model is \(97.96 (\pm .53)\) which is extremely higher than all the pre-trained models considered. This is followed by VGG19 with an accuracy of \(80.74 (\pm 1.50)\). Thus, we conclude that the proposed model outperforms the other pre-trained models in terms of 5-fold validation accuracy, which can also be observed from Fig. 7. Further, it is observed that the proposed model achieves a test accuracy between 96.8% to 98.82%. However, the pre-trained models achieve test accuracy between 65.31% to 81.98% which is very less when compared to the proposed model. The proposed model achieves this higher accuracy with a reduced number of total parameters of 32, 879 which is very less when compared to the number of parameters in other benchmark models. The benchmark model with less number of parameters is MobileNetV3Small with a total of 1, 541, 243 parameters as can be noticed from Table 2. It can also be observed from Table 2 that the proposed model works well with a reduced dataset size of 439KB. However, the pre-trained models work with a minimum model size of 6MB for MobileNetV3Small and a maximum size of 246MB for EfficientNetB7. Even though the model size is higher for the proposed model, the achievable accuracy is very high. Thus, from Table 2, we conclude that the proposed model outperforms the pre-trained models in terms of accuracy and the size of the dataset.

Table 3 shows the variation of the inference time for the proposed model and other pre-trained models when run on different edge computing devices. We consider a total of nine devices for the performance comparison. From Table 3, it is noticed that the proposed model consumes an inference time of 0.42 ms, 0.53 ms, 0.56 ms, 0.7 ms, 0.84 ms, 1.7 ms, 2.47 ms, 3.12 ms, and 41.27 ms when on Tesla P100-PCIE-16GB, Tesla T4, Telsa P4, Tesla K80, NVIDIA GeForce GTX 1050, NVIDIA Jetson Xavier, Inter(R Xeon(R) CPU E5-2630 v4, Intel core i5 8th gen, and Raspberry Pi, respectively, which is very less when compared to other benchmark models when deployed on any of the edge computing devices. Further, it is noticed that the accuracy of the proposed model is less when run on Telsa P100-PCIE-16GB.

4 Conclusion

This paper proposed a novel deep-learning-based approach for automatically classifying background information embedded in speech signals. The time-frequency analysis, SWT, has been used to convert speech signals into time-frequency plots. These time-frequency signals are then used in conjunction with a DCNN to classify background sound information embedded in speech signals. The proposed DCNN model is only 439 KB in size and can classify the sounds from airplane, airports, babble, cars, drones, exhibitions, helicopters, restaurants, stations, streets, and trains. Through extensive simulations, it has been concluded that the proposed SWT-based deep learning approach classifies more accurately in comparison to pre-trained models such as DenseNet, EfficientNet, InceptionNet, MobileNet, NASNet, ResNet, VGG, and Xception.