1 Introduction

Many multimedia, digital, and advanced computerized types of machinery like Audio assistance, Automated Customer Support Sytems, and many more have audio data as one of an integral part for storing various information including environmental sounds, noises, Foley, speech sounds, nonspeech utterances, etc., and even stores more information than video signals [1]. Classifying environmental sounds stands apart from the classification of speech and other music files due to the paucity of prior knowledge regarding their temporal and frequency characteristics. Unlike more well-structured domains like speech and music, environmental sounds encompass a wide and diverse array of audio sources, necessitating classification models to exhibit an increased level of adaptability and generalization to effectively discern and categorize these often complex and heterogeneous auditory inputs. Moreover, environmental sounds are very random and not having any set fashion to work on, many naïve prediction algorithms (algorithms that are not well hyper-tunned for the desired audio frequencies) tend to fail in obtaining fruitful results which makes the environmental sound classification a challenging task for the researchers. In the present scenario, the main focus of the researchers in the field of auditory is to accurately recognize the speech or music files. However, analysis of sounds in the environment being an immensely mixed group of day to day audios which are unlikely to be categorized as speech or music has left behind in the upcoming improvements despite having various applications available in IoT technologies [2], hearing aids [3], smart room monitoring [4], video content highlight generation [5,6,7], audio surveillance systems [8]. Over the years, with the advancement in digital technology and broadcast facilities, users are now enabled to make use of the huge amount of multimedia and audio files. Since the environmental sounds contain various types of noises, textured and structural components namely scattering and iterations, it becomes a much harder task to classify such audio signals accurately than the classification of speech and music which becomes one of the challenging parts of our research to acquire the accurate results.

Classification and categorization of audio signals accurately become an important research area [9, 10]. Such classifications and sampling of audio signals come under the pattern recognition field. Over the years, various machine learning models are developed for accurate predictions, classifications in various fields of the industry, medical, etc. [11]. The main obstacles that come in the way of finding accurate results are namely feature selection and categorizing the audio signals based on the extracted features from that audio signal. To overcome such barriers, researchers generally go under various preprocessing of the audio files which include the process of noise removal, feature selection, and feature extraction. Some of the feature selection methods are Linear Predictive Coefficient (LPCs) [12], Linear Predictive Cepstral Coefficient (LPCCs) [13], Short Time Fourier Transform (STFT) [14], Mel Frequency Cepstral Coefficient (MFCCs) [15,16,17]. About the issues mentioned above, the major contribution of our research work presented in the paper comprises of (1) Choosing the dataset having environmental sounds with a suitable number of sampling records, (2) choosing an appropriate feature selection technique for extracting the features for classification of audio files, and (3) removal or noise and trimming the main and effective part of the audio signals without missing any important features. We use MFCCs and STFTs features being widely used in automatic speech recognition for the classification of audio files in our experimental models which is discussed in detail in Sect. 3. The experiment also includes the classification of audio files from two different datasets having different numbers of samples with variations in the noise level. the rest of the paper is organized in the following sections: 2. Motivation and Contribution, 3. Related work, 4. Dataset and Methodology, 5. Classification Models, 6. Experimentation, 7. Results, 8. Discussion and 9. Conclusion and 10. Future Work. For detailed analysis, we have calculated all the parameters namely precision, recall, specificity, and F1-score.

2 Motivation and contribution

In this section, we have presented the reasons which motivated us to conduct our research in the field of audio classification. We have also highlighted our key contributions that have been addressed throughout the research.

Our motivation centers around efficient file management and the recognition of audio files, which can significantly reduce human labor. Our goal is to present models that can seamlessly integrate with AI systems and take actions based on the classified type of audio file. Using Neural Network techniques in our models, we achieved a correct classification rate of 91 out of 100 instances. Our model holds great potential for applications in the development of smart roads, hospitals, and industries.

The main motivation behind our research was to offer a comprehensive comparative study of various models that can be employed based on the results obtained in different workspaces. Our experiments yielded acceptable results, which can be valuable when selecting a model. Furthermore, our comparative analysis lends support to the results obtained in these experiments. Unlike previous studies, which often presented only one or two models for classification and yielded less accurate results than our model, we aim to provide a broader selection of models and apply the same experimental approach to each one, offering a clear understanding of each model's performance in the field of audio classification. Additionally, our work utilizes a dual dataset approach for more comprehensive model classification, based on results, a feature not present in previous proposals. Both datasets have been chosen to maintain consistent sound quality but with varied features, and each dataset undergoes a separate feature extraction process.

3 Related work

In the past few years, many researchers have proposed their algorithms and techniques for classifying the audio signals using various parameters into consideration [18,19,20,21,22]. Many of the researches followed the two basic preprocessing steps namely analysis of the incoming audio signal and the second step was to extract the key features from the incoming audio file. This step of extracting the features reduced the unwanted data to a large scale and the classification is performed based on these extracted features. The feature selection can further broadly be classified into two categories: waveforms [23, 24], and spectrogram [25,26,27]. The waveform-based classification processes the input data as a 1-D array, while on the other hand, spectrogram-based classification converts the audio signal into spectrograms using time-dependent Fourier transformation (or STFTs). The study presented by Li et al. [28] (2001) discussed the audio classification based on LPC and MFCC features and results showed that cepstral-based features helped in classifying the audio signals accurately. Guo et al. [29] (2003) proposed a new metric namely Distance From Boundary (DFB) for the classification process. This process includes the searching of appropriate boundary which contains the audio file pattern and at last, these distances are sorted by their distances.

The study presented by Cowling M, et al. [30] (2003) experimented with stationary and non-stationary time–frequency-based features extraction process for classifying the environmental sounds. The use of audio surveillance is also considered for the detection and classification of various acoustic incidents such as humans coughing [31], impulsive sounds [32, 33] including gunshots, glass breaking, explosions, alarms, etc. Dargie, W et al. [34] (2009) proposed a study for audio sound classification using MFCC features, however, the performance rate was resulting high but the specific sound results including the accuracy of classifying the audio file remained lacking behind. Another study was given by El-Maleh, K et al. [35] (1999) illustrated many different pattern-based classification models namely QGC (Quadratic Gaussian Classifier), KNN (K-Nearest Neighbor) classifier, and LSLC (Least-Square Linear Classifier). The experiment also included the noise removal process and used LPC features extraction. The QGC classifier achieved the best results with an error rate of 13.6%. However, the study did not compare the results or implemented the models on more than one dataset, unlike our study, and even gained less error rate than the study given by [35].

Seker, H. et al. [36] (2020) proposed the study on classifying the environmental sounds using CNN based model and achieved an accuracy of 82.26% while we achieved an accuracy of 91.41% and 91.27% of accuracies in two different datasets in classifying the environmental sounds. Another study presented by Zhang Z. et al. [37] (2021) illustrated the classification of environmental sounds using ACRNN based model for achieving an accuracy of 86.1% which is still becoming less when compared to our model using the ANN-based architecture. Another contrast is the number of layers used, [36] used 10 layers in ACRNN structure while we have used 2 and 4 layers in two different datasets respectively. About [36], we can state that even simpler ANN architectures can achieve state-of-the-art accuracy instead of having bigger architectures which indirectly lowers the training time and lower computational expenses. The study shown by P. Dhanalakshmi et al. [38] (2009) focused on classifying the audio signals into six categories using different features like LPC, MFCC, LPCC while in contrast to the research presented by us, have classified the audio signals of two different datasets in 10 and 8 categories respectively with better accuracies and other parameters. Although we got the best result in the ANN classifier model, they scored the best accuracies using SVM and the Radial Basis Function Neural Network (RBFNN). Chen Lie et al. [39] (2006) proposed the study on classifying the environmental sounds using various classifiers and shown that SVM achieved better results i.e. 91.41% with a loss reduction of 8–15%, however, the results got reduced when the authors classified environmental sounds having three classes and results came down to 64.76% whereas, in our study, we achieved the better results (91.41% and 91.27% of accuracy in two datasets) using ANN classifier model with least losses.

Apart from these researches, recent studies have also considered deep convolutional neural networks (DNN) for classifying the audio samples. The study presented by Maccagno et al. [40] (2021) incorporated CNN based approach for audio classification at construction sites. The proposed DNN model used spectrograms that were created through the frequency scale and time derivatives. The frame size was considered to be 22,050 Hz and 60 mel bands. The dataset consists of 5 classes and used a fivefold cross-validation process and was able to achieve an accuracy of 97.08%. A similar study was presented by Mehyadin et al. [41] (2021) for bird sound classification. The authors used the model for analyzing the bird sounds and enable the species detection process. All the audio samples that contained noise, were treated with a separate noise filter which used the MFCC feature extraction process. The experiment included three models namely Naïve Bayes, J4.8, and Multilayer Perceptron (MLP) out of which J4.8 achieved the highest accuracy of 78.4%.

Another study presented by Palanisamy et al. [54] where author ckassified audio signals. Authors incorporated dural dataset approach including UrbanSound8k and ESC-50 dataset. In the experiment, ImageNet-Pretrained Standard CNN model was used and achieved a validation accuracy of 92.89% and 87.42% on respective datasets. Similar study was done by Zeghidour et al. [55] where authors proposed LEAF architecture (Learnable Frontend For Audio Classification). The experimental results showed that the proposed model outperformed EfficientNetB0 model. Study also states that the proposed architecture can be integrated with other neural networks at a low parameter cost. The proposed model is fully trainable, and lightweight architectire. The model will learn all the operations for extracting audio features starting from filtering to pooling steps. Based on this research [55], a new efficient, hybrid, and lightweight model can be developed which can provide high and accurate learning rate in helping the fire alarms and other audio signal detection methods.

From the above survey of various former proposals and experiments that have been presented in the classification of audio files, it can be seen that the authors have included only a few models for the categorization process. However, in our research, we have included a total of seven classification models that are used in two different audio datasets having different feature extraction steps.

4 Concepts and dataset

In the following section, we have presented a detailed overview of our two datasets used for classification models followed by the techniques and types of features selected for extracting the key parameters from the audio files for categorizing part. For the feature extraction process, we have used the Librosa library for extracting the features from the audio samples. The features include MFCCs and STFT. In the feature extraction process, a total of 186 features have been extracted for every label in the respective dataset. The length of each audio sample after the noise removal and feature extraction is taken to be 4 s. In the resulted audio samples, only the unique audio is taken which helps in classification. By the unique audio we mean, the audio which has a different frequency, pitch, etc. from the background noise, and other audio. Once the audio sample is noiseless, it is trimmed to the part where the actual identical sound can be distinguished. This can be well illustrated in Figs. 5, 6, and 7. The following is the bifurcation for this section: A. MFCC features, B. STFT features, C. UrbanSound8K Audio Dataset, and D. Sound Event Classification Dataset.

4.1 MFCC features

The key step for accurate classification is the extraction of discrete features from the audio sample or the components which can help in identifying the linguistic contents of the sample discarding the other stuff which carries noises and other unrequired sounds. For such feature extractions, MFCCs are used [42]-[43]. These MFCCs are one of the features that are widely used for extracting the features from the audio samples having less noise. The MFCCs are computed using fast Fourier transformation (FFT) coefficients filtered by a bandpass filter bank. The mathematical expression for Mel-scale computation is shown in Eq. (1).

$$Freq_{mel} = \frac{{x*log\left( {\left( {c + f} \right)/x} \right)}}{log\left( 2 \right)}$$

In Eq. (1), \({Freq}_{mel}\) is the logarithmic scale of the normal frequency (\(f\)) scale, \(x\) plays an important role in calculating the MFCC features. This coefficient helps in converting the high-frequency sound into low frequencies for more accuracy pointing out the changes in the audio sample. It should have an appropriate range of 250 to 350 ie. the number of triangular filters that come in the frequency range of 200-1200 Hz which is the range of dominant audio information. For illustration, a full filter bank can be seen in Fig. 1 [56].

Fig. 1
figure 1

Basic full filter bank [55]

In the final step, MFCCs are calculated using Eq. (2) as illustrated and are denoted as \({Feature}_{MFCC}\).

$$F_{MFCC} = \sqrt{\frac{2}{N}} \mathop \sum \limits_{k = 1}^{N} \left( {logS_{k} } \right)cos\left[ {n\left( {k - 0.5} \right)\frac{\pi }{N}} \right]$$

where, \({S}_{k}\) is the output of the filter bank where k varies from 1 to N, where N is the length of the DFT.

4.2 STFT features

The time-dependent signals are decomposed using Fourier Transform into their respective frequencies. One of the Fourier Transform includes Short-Time Fourier Transform (STFT) which is widely used for extracting the features from the audio sample. Although MFCCs are also widely used in the feature extraction process, they are also very sensitive to the background noises in the sample making them less effective in extracting the features while on the other hand, STFTs are used even on audio samples having noises and provides effective results. The Fourier transforms of a function results in its equivalent frequency for the input amplitude signal. Figure 2 shows how a signal is converted into its frequency signal using the Fourier Transform.

Fig. 2
figure 2

Equivalent Fourier transformation of the provided digital signal

STFT is a method in which FFT transforms are applied once the signal is trimmed by the window function. The mathematical expression for the calculation of the STFT features is shown in Eq. (3).

$$Y\left( {t,f} \right) = STFT\left( {y\left( t \right)} \right) = \int\limits_{{ - \infty }}^{\infty } {y\left( u \right)h^{*} \left( {u - t} \right)e^{{ - 2j\pi fu}} du}$$

where, \(y\left(t\right)\) is the original audio signal, \(h\left(t\right)\) are an STFT window function and the center lies at t = 0 having a length of \(L (0<L\le 1500)\). The resulting STFTs are in 2-D form and are shown below.

$$Y\left( {t, f} \right) = \left[ {y_{1,1} y_{1,2} y_{1,3} \cdot \cdot \cdot y_{1,n} y_{2,1} y_{2,2} y_{2,3} \ldots y_{2,n} \cdot \cdot \cdot \cdot \cdot \cdot \cdot \cdot \cdot \cdot \cdot \cdot \cdot \cdot \cdot \cdot \cdot y_{m,1} y_{m,2} y_{m,3} \cdot \cdot \cdot y_{m,n} } \right]$$

where STFT values can be verified from \({y}_{i,j}\).

4.3 Urbansound8k audio dataset

In the study, for the environmental sound classification process, we have used UrbanSound8K audio dataset [44]. The dataset consists of 10 different environmental sound classes as shown in Table 1. All the audio files are generated from real-time instances having a nearly 4 s recording time. We considered a total of 8732 audio samples from this dataset. The waveforms for different classes have been illustrated in Fig. 3. The samples shown for illustrations are included in the classification. All the figures in Fig. 4 are Amplitude v/s Time frame audio graphs. All the waveforms included in Fig. 3 & 4 have amplitude (db) and time (seconds) on Y-axis and X-axis respectively. Furthermore, these images are presented inorder to provide a glimps of how varied frequency waveforms are used for the training of the models used.

Table 1 Audio sample distribution in urbansound8k dataset
Fig. 3
figure 3

Waveform visualization for different audio samples for each class present UrbanSound8K Dataset

Fig. 4
figure 4

Audio waveforms for all the classes used in the classification process after the removal of noise and trimming process

For each audio file, we have calculated the MFCC features (40 features for each sample), and depending upon these features we have created the modified dataset which contains all the audio samples with their features extracted from the audio file with their respective class label. After the dataset formation, we have divided the dataset randomly to test and training set for the classification process which is discussed in detail in Classifier Models Section.

4.4 Sound event audio dataset

Another dataset used in our research work includes the Sound Event Audio Dataset [44] which is collected from the University of Moratuwa, Sri Lanka. The dataset contains many environmental sound classes, and out of them we selected 7 of them and we also made another category as others which includes the sounds from surroundings, door opening, closing, etc. The audio sample distribution for this dataset can be seen in detail in Table 2. All the audio samples were generated from real-time instances using two MOVO USB omnidirectional microphones. From the dataset, all the features of individual audio samples were extracted, and based upon those, classification models were built considering a total of 1288 samples from this dataset.

Table 2 Audio sample distribution in sound event audio dataset

Since this dataset also contained noise in the samples, we used some preprocessing for removing the noise part and trimming the audio sample without leaving any features behind for getting accurate results. For instance, we have demonstrated the random audio sample from the dataset before removing the noise from it in Fig. 5. Furthermore, Fig. 6 shows the audio sample waveform after the removal of noise from it and results in a much clearer waveform. Finally, the third step includes the trimming of the audio sample after the removal of noise as shown in Fig. 7. For the rest of the audio samples of different classes, we have illustrated the final trimmed waveform which is used for the feature extraction process followed by the classification process in Fig. 4. The samples shown for illustrations are included in the classification. All the figures in Fig. 4 are Amplitude v/s Time frame audio graphs.

Fig. 5
figure 5

Opening pill container audio waveform with noise in the sample

Fig. 6
figure 6

Opening pill container audio waveform after the removal of noise from the sample

Fig. 7
figure 7

Opening pill container audio waveform after trimming the noise-free sample

This time for every audio sample, we calculated STFT features because of the reason that MFCC features do not hold a good grip on the audio samples which contain noise. The STFT features results in a 2-D array that contains the mentioned frequency amplitude bind for an individual window.

5 Classification models

Machine learning models are widely used for training the various models for either prediction [46, 47] or classifying the given samples in different classes. In the process of classifying the audio samples, we have used seven different machine learning models namely Logistic regression, K-Nearest Neighbors, Support Vector Machine, Naïve Bayes, Decision Tree, and Random Forest classifier. Apart from these six classifiers, we have also used Artificial Neural Network (ANN) architecture-based model for classifying the different classes of audio samples. We have illustrated a comparative study of all the classifiers used as shown in Table 3.

Table 3 Comparative analysis of classifier models

6 Experimentation

In this section, we have discussed the experimental setup we have used which categorizing the audio samples. One of the unique things of our research work is the comparison of two similar datasets as discussed in Sect. 3 having noises and applying different feature selection techniques. Our experimental aims for the research are as follows.

  • Extracting the unique and important features from the audio sample using two feature selection techniques namely MFCC and STFT features.

  • Removal of unwanted features and background noises from the audio sample for achieving better results. The resulted

  • Applying classifier models as discussed in Sect. 4, and comparing the results based on different parameters namely accuracy, precision, recall, specificity, F1-score, and MCC.

In Sect. 6 we have further compared the results based on the experiments conducted. The basic overview of our approach for categorizing an audio sample from the datasets follows some important steps name: feature extraction, data pre-processing depending upon the sample (whether contains noise or not), and finally various classifier models are applied. All the steps mentioned above can be seen in Fig. 8.

Fig. 8
figure 8

Flowchart of various steps followed for classification of audio samples

The section is further divided into the following sections: A Software and Hardware, B Data Preprocessing, and C Analysis of Classifier Models.

6.1 Software and hardware

All the classifier models that were used in the experiment were trained through Python 3 with Keras library (using TensorFlow backend) on an anaconda environment. High-level API was used for constructing neural networks as well as other classifier models. we used Intel i5 8th generation processor with 16 GB RAM.

6.2 Data preprocessing

In our experiment, we have considered two datasets having 10 and 8 different classes respectively. The two datasets are discussed in detail in Sect. 3. In UrbanSound8K audio dataset we have applied the MFCC feature selection process because of its noise-free samples. However, in the other dataset, since the samples contain some background noise and other unwanted features, we have applied the STFT feature selection process because MFCCs are sensitive to noise as discussed in the above literature. After the collection of all the features separately for the two different datasets, we have applied the splitting of datasets into training and testing sets respectively. After the splitting, the training set contains 80% of the audio samples from the dataset and the testing set contains the remaining randomly selected 20% audio samples for prediction using the trained models. This distribution is common for both datasets.

6.3 Analysis of classifier models

After categorizing the different classes of the audio sample after data preprocessing using various classifier models, we conclude that the ANN model achieves the best accuracy among all the classifiers. However, some classifiers gave better results depending upon the datasets. This happened because of the internal working and classification criteria of different classifier models. For a better and detailed comparison, we have shown the confusion matrix of different models in Fig. 9. Figure 10 comprises the True Negatives (TN), True Positives (TP), False Positives (FP), and False Negatives (FN). Figures 9 and 10 comprise the results of the UrbanSound8K Audio dataset. Although the Linear Regression model was able to classify most of the audio samples to their correct category as shown in Fig. 9, GS (Gun Shot) was poorly classified. One of the reasons that support the outcomes is the GS and DB have similar waveforms after the feature extraction and trimming process. This can be seen from Fig. 3, and because of this, the model misclassified the GS samples to the DB category. Table 4 contains the various abbreviations used in Figs. 9 and 10.

Fig. 9
figure 9

Confusion matrix of different classifier models which are used in the UrbanSound8K dataset (a. ANN model, b. Logistic Regression model, c. SVM (rbf) model, d. KNN model, e. Naïve Bayes model, f. Decision Tree model, g. Random Forest model)

Fig. 10
figure 10

TP, TN, FP, and FN parameters for different classifier models used in UrbanSound8K audio dataset (a. ANN model, b. Logistic Regression model, c. SVM (rbf) model, d. KNN model, e. Naïve Bayes model, f. Decision Tree model, g. Random Forest model)

Table 4 Abbreviations used in urbansound8k dataset

Similarly, Fig. 11 shows the confusion matrix of all the classifier models that are used in the second dataset (Sound Event Audio Dataset). And Fig. 12 shows the FP, FN, TP, and TN parameters of classifier models. In the next section (Analysis and Result) we have discussed the detailed parameters based on which we conclude that the ANN model achieves the best score among all. Table 5 contains the various abbreviations used in Figs. 11 and 12.

Fig. 11
figure 11

Confusion matrix of different classifier models which are used in the Sound Event Audio Classification dataset (a ANN model, b Logistic Regression model, c SVM (rbf) model, d KNN model, e Naïve Bayes model, f Decision Tree model, g Random Forest model)

Fig. 12
figure 12

TP, TN, FP, and FN parameters for different classifier models used in Sound Event Audio dataset (a ANN model, b Logistic Regression model, c SVM (rbf) model, d KNN model, e Naïve Bayes model, f Decision Tree model, g Random Forest model)

Table 5 Abbreviations used in sound event audio dataset

7 Results

We have evaluated different parameters namely accuracy, precision, recall, specificity, F1-score, and Matthews Correlation Coefficient (MCC) for comparison of different models. In our experiment, we can conclude that we got the best results in the ANN model for both datasets. The mathematical expressions used for evaluating different parameters are shown as follows.

$$Accuracy = \frac{TN + TP}{{TN + TP + FP + FN}}$$
$$Precision = \frac{TP}{{TP + FP}}$$
$$Recall = \frac{TP}{{TP + FN}}$$
$$F1 - Score = \frac{2 \times TP}{{2 \times TP + FP + FN}}$$
$$MCC = \frac{TN \times TP - FP \times FN}{{\sqrt {\left( {TN + FN} \right)\left( {FP + TP} \right)\left( {TN + FP} \right)\left( {FN + TP} \right)} }}$$

In our experiment, precision is referred to as the number of correctly predicted audio classes that turned out to be positive, recall tells about the number of actual positive cases that are predicted correctly with our models, specificity is the proportion of negative cases that are being predicted correctly, and F1-score refers to the harmonic mean of recall and precision or in other words, it provides a combined idea about the two results (recall and precision). F1-score is maximum when precision is equal to recall.

Apart from these matrices, there exists another parameter namely phi-coefficient \((\varphi )\). From Eq. (9) we can see that MCC takes all the parameters (TP, FP, TN, and FN) into account, while other metrics like accuracy, precision, recall, etc. lacks in taking all the four parameters hence making it sensitive to class imbalance and are asymmetric. MCC value can range from 1 to −1 depending upon the correlation. If the MCC is 1 (FP = FN = 0), indicates perfect positive correlation. On the other hand, if the MCC is −1 (TP = TN = 0), indicates a perfect negative correlation, such conditions depict that the classifier always misclassifies the classes.

However, if the MCC is 0, represents that classifier randomly choosing any class. For instance, taking the ANN model and Logistic Regression model into account from Table 5, one can see that MCC is 0.9380 and 0.2177 respectively for different subclasses. From this, we can imply that the ANN model has predicted more positively correlated results than the Logistic Regression model.

Tables 6 and 7 shows every parameter evaluated for each model applied to the UrbanSound8K and Sound Event Audio dataset in classifying each class of the audio sample respectively.

Table 6 Results of various models used in urbansound8k dataset
Table 7 Results of various models used in sound event audio dataset

8 Discussion

To have accurate classification of audio samples using various models, feature extraction and noise cancellation play an important role. Apart from the features and noise present in the audio sample, the difference between two different audio samples belonging to different classes should be appropriate. This can be justified by seeing Figs. 13 and 14 which are being plotted based on the results generated from the experiment on the two datasets. The graphs show the plot of accuracy for each class in the respective dataset classified by each machine learning model. We have also illustrated class-wise best accurate models in Table 8.

Fig. 13
figure 13

Accuracy graph for various models used in UrbanSound8K Dataset

Fig. 14
figure 14

Accuracy graph for various models used in Sound Event Audio Dataset

Table 8 Most accurate classifier in class wise order for urbansound8k dataset

From Fig. 13, the Naïve Bayes model had performed the poorest among the models in accurately classifying the CP (Children Playing) class and has achieved poor results in comparison to other classifier models. This can be due to noise present in the audio samples and the internal processing of the Naïve Bayes model for determining the class to which the audio sample should belong. However, this does not imply that Naïve Bayes is a not-so-good approach for classification. Some studies show that Naïve Bayes performs very well in other fields [52, 53]. Similarly, in Fig. 14, FA (Falling) class has been poorly classified by Naïve Bayes and KNN model. By looking at one of the waveforms shown in Fig. 7 belonging to the FA class, one can see that the segments of that waveform can be matched with other classes too, resulting in false classification as stated above. From Figs. 13 and 14, the Artificial Neural Network model has achieved maximum accuracy for all the audio sample classes. This is because of the working of the ANN model, having multi-layer neural networks for classifying and analyzing every feature of the audio sample, and depending upon the result generated from each neuron, the final prediction is noted.

Using Table 8, one can analyse the performance of the classifier models in terms of the achieved accuracies for respective classes presented in the UrbanSound8K dataset. On the otherhand, for the Sound Event Audio Dataset, it was Aritifical Neural model which achieved the relative highest accuracies among the models that were trained on that dataset.

9 Conclusion

In this paper, we have implemented two ways of data preprocessing namely MFCC and STFT by the virtue of which we can classify different audio samples. Our result shows that MFCC features are sensitive to any kind of noise. Our study shows that after all the data preprocessing, the ANN model achieves the best results in classifying both types of the dataset (with and without noise). The overall accuracies achieved by various classifiers in the UrbamSound8K dataset and Sound Event Audio dataset are listed as follows in Table 9.

Table 9 Overall accuracies achieved on different datasets

From Table 9, it can be seen that the results of Logistic Regression and Naïve Bayes are highly varied across the two datasets. The main reason behind this is the working of the classifier model as well as the relationship formed between the audio sample points with the predicted sample point. Logistic regression forms a linear relationship among the features while on the other hand, Naïve Bayes model assumes total independencies between the features. From the analysis of the results presented in the Table 9, it can be inferred that there is a direct relationship of accurate prediction and the distinct features a class have. Artificial Neural Network being an adaptive model helps in handling audio samples with heteroskedasticity (samples with different variances), it can be seen that most of the best results are achieved by ANN models.

10 Future work

Our next aim is to provide research work that overcomes the problem faced in efficient noise removal techniques and forming the hybrid relationship between the features of the sample including various experiments. Although in this paper, we present seven classifier models in classifying environmental audio samples, we also aim at presenting our algorithm which can outperform the traditional models for efficient classification of the samples. Our goal is to implement an algorithm that can work efficiently with the audio datasets having some kind of noise (Anything apart from the required sound) and provide improved performance. Our model with high efficiency and accuracy can be used in industries for fully automated system development. Depending upon the type of audio classification, our model will be able to make the programmed decisions and will immediately take appropriate actions saving time and energy. With some algorithm changes (trimming the audio sample, locating the key features, neglecting the wrong features from the audio sample), we will be able to use these feature selections with the same efficiency and develop hybrid models. Based on these trained models, our further process will include making user interactive GUIs that will be able to detect different sounds and provide necessary details about the sound signal. This prototype model will prove to be very effective for guiding tourists, locals, and visitors in different areas for information purposes.