Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

Gourisaria, Mahendra Kumar; Agrawal, Rakshit; Sahni, Manoj; Singh, Pradeep Kumar

doi:10.1007/s43926-023-00049-y

Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

Research
Open access
Published: 03 January 2024

Volume 4, article number 1, (2024)
Cite this article

Download PDF

You have full access to this open access article

Discover Internet of Things Aims and scope Submit manuscript

Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

Download PDF

Mahendra Kumar Gourisaria¹,
Rakshit Agrawal¹,
Manoj Sahni² &
…
Pradeep Kumar Singh ORCID: orcid.org/0000-0002-7676-9014³

2987 Accesses
3 Citations
Explore all metrics

Abstract

In the era of automated and digitalized information, advanced computer applications deal with a major part of the data that comprises audio-related information. Advancements in technology have ushered in a new era where cutting-edge devices can deliver comprehensive insights into audio content, leveraging sophisticated algorithms such such as Mel Frequency Cepstral Coefficients (MFCCs) and Short-Time Fourier Transform (STFT) to extract and provide pertinent information. Our study helps in not only efficient audio file management and audio file retrievals but also plays a vital role in security, the robotics industry, and investigations. Beyond its industrial applications, our model exhibits remarkable versatility in the corporate sector, particularly in tasks like siren sound detection and more. Embracing this capability holds the promise of catalyzing the development of advanced automated systems, paving the way for increased efficiency and safety across various corporate domains. The primary aim of our experiment is to focus on creating highly efficient audio classification models that can be seamlessly automated and deployed within the industrial sector, addressing critical needs for enhanced productivity and performance. Despite the dynamic nature of environmental sounds and the presence of noises, our presented audio classification model comes out to be efficient and accurate. The novelty of our research work reclines to compare two different audio datasets having similar characteristics and revolves around classifying the audio signals into several categories using various machine learning techniques and extracting MFCCs and STFTs features from the audio signals. We have also tested the results after and before the noise removal for analyzing the effect of the noise on the results including the precision, recall, specificity, and F1-score. Our experiment shows that the ANN model outperforms the other six audio models with the accuracy of 91.41% and 91.27% on respective datasets.

Anomalous sound event detection: A survey of machine learning based methods and applications

Article 27 December 2021

Content-Based Audio Classification and Retrieval Using Segmentation, Feature Extraction and Neural Network Approach

Research on sound classification based on SVM

Article 16 April 2019

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Many multimedia, digital, and advanced computerized types of machinery like Audio assistance, Automated Customer Support Sytems, and many more have audio data as one of an integral part for storing various information including environmental sounds, noises, Foley, speech sounds, nonspeech utterances, etc., and even stores more information than video signals [1]. Classifying environmental sounds stands apart from the classification of speech and other music files due to the paucity of prior knowledge regarding their temporal and frequency characteristics. Unlike more well-structured domains like speech and music, environmental sounds encompass a wide and diverse array of audio sources, necessitating classification models to exhibit an increased level of adaptability and generalization to effectively discern and categorize these often complex and heterogeneous auditory inputs. Moreover, environmental sounds are very random and not having any set fashion to work on, many naïve prediction algorithms (algorithms that are not well hyper-tunned for the desired audio frequencies) tend to fail in obtaining fruitful results which makes the environmental sound classification a challenging task for the researchers. In the present scenario, the main focus of the researchers in the field of auditory is to accurately recognize the speech or music files. However, analysis of sounds in the environment being an immensely mixed group of day to day audios which are unlikely to be categorized as speech or music has left behind in the upcoming improvements despite having various applications available in IoT technologies [2], hearing aids [3], smart room monitoring [4], video content highlight generation [5,6,7], audio surveillance systems [8]. Over the years, with the advancement in digital technology and broadcast facilities, users are now enabled to make use of the huge amount of multimedia and audio files. Since the environmental sounds contain various types of noises, textured and structural components namely scattering and iterations, it becomes a much harder task to classify such audio signals accurately than the classification of speech and music which becomes one of the challenging parts of our research to acquire the accurate results.

Classification and categorization of audio signals accurately become an important research area [9, 10]. Such classifications and sampling of audio signals come under the pattern recognition field. Over the years, various machine learning models are developed for accurate predictions, classifications in various fields of the industry, medical, etc. [11]. The main obstacles that come in the way of finding accurate results are namely feature selection and categorizing the audio signals based on the extracted features from that audio signal. To overcome such barriers, researchers generally go under various preprocessing of the audio files which include the process of noise removal, feature selection, and feature extraction. Some of the feature selection methods are Linear Predictive Coefficient (LPCs) [12], Linear Predictive Cepstral Coefficient (LPCCs) [13], Short Time Fourier Transform (STFT) [14], Mel Frequency Cepstral Coefficient (MFCCs) [15,16,17]. About the issues mentioned above, the major contribution of our research work presented in the paper comprises of (1) Choosing the dataset having environmental sounds with a suitable number of sampling records, (2) choosing an appropriate feature selection technique for extracting the features for classification of audio files, and (3) removal or noise and trimming the main and effective part of the audio signals without missing any important features. We use MFCCs and STFTs features being widely used in automatic speech recognition for the classification of audio files in our experimental models which is discussed in detail in Sect. 3. The experiment also includes the classification of audio files from two different datasets having different numbers of samples with variations in the noise level. the rest of the paper is organized in the following sections: 2. Motivation and Contribution, 3. Related work, 4. Dataset and Methodology, 5. Classification Models, 6. Experimentation, 7. Results, 8. Discussion and 9. Conclusion and 10. Future Work. For detailed analysis, we have calculated all the parameters namely precision, recall, specificity, and F1-score.

2 Motivation and contribution

In this section, we have presented the reasons which motivated us to conduct our research in the field of audio classification. We have also highlighted our key contributions that have been addressed throughout the research.

Our motivation centers around efficient file management and the recognition of audio files, which can significantly reduce human labor. Our goal is to present models that can seamlessly integrate with AI systems and take actions based on the classified type of audio file. Using Neural Network techniques in our models, we achieved a correct classification rate of 91 out of 100 instances. Our model holds great potential for applications in the development of smart roads, hospitals, and industries.

The main motivation behind our research was to offer a comprehensive comparative study of various models that can be employed based on the results obtained in different workspaces. Our experiments yielded acceptable results, which can be valuable when selecting a model. Furthermore, our comparative analysis lends support to the results obtained in these experiments. Unlike previous studies, which often presented only one or two models for classification and yielded less accurate results than our model, we aim to provide a broader selection of models and apply the same experimental approach to each one, offering a clear understanding of each model's performance in the field of audio classification. Additionally, our work utilizes a dual dataset approach for more comprehensive model classification, based on results, a feature not present in previous proposals. Both datasets have been chosen to maintain consistent sound quality but with varied features, and each dataset undergoes a separate feature extraction process.

3 Related work

In the past few years, many researchers have proposed their algorithms and techniques for classifying the audio signals using various parameters into consideration [18,19,20,21,22]. Many of the researches followed the two basic preprocessing steps namely analysis of the incoming audio signal and the second step was to extract the key features from the incoming audio file. This step of extracting the features reduced the unwanted data to a large scale and the classification is performed based on these extracted features. The feature selection can further broadly be classified into two categories: waveforms [23, 24], and spectrogram [25,26,27]. The waveform-based classification processes the input data as a 1-D array, while on the other hand, spectrogram-based classification converts the audio signal into spectrograms using time-dependent Fourier transformation (or STFTs). The study presented by Li et al. [28] (2001) discussed the audio classification based on LPC and MFCC features and results showed that cepstral-based features helped in classifying the audio signals accurately. Guo et al. [29] (2003) proposed a new metric namely Distance From Boundary (DFB) for the classification process. This process includes the searching of appropriate boundary which contains the audio file pattern and at last, these distances are sorted by their distances.

The study presented by Cowling M, et al. [30] (2003) experimented with stationary and non-stationary time–frequency-based features extraction process for classifying the environmental sounds. The use of audio surveillance is also considered for the detection and classification of various acoustic incidents such as humans coughing [31], impulsive sounds [32, 33] including gunshots, glass breaking, explosions, alarms, etc. Dargie, W et al. [34] (2009) proposed a study for audio sound classification using MFCC features, however, the performance rate was resulting high but the specific sound results including the accuracy of classifying the audio file remained lacking behind. Another study was given by El-Maleh, K et al. [35] (1999) illustrated many different pattern-based classification models namely QGC (Quadratic Gaussian Classifier), KNN (K-Nearest Neighbor) classifier, and LSLC (Least-Square Linear Classifier). The experiment also included the noise removal process and used LPC features extraction. The QGC classifier achieved the best results with an error rate of 13.6%. However, the study did not compare the results or implemented the models on more than one dataset, unlike our study, and even gained less error rate than the study given by [35].

Seker, H. et al. [36] (2020) proposed the study on classifying the environmental sounds using CNN based model and achieved an accuracy of 82.26% while we achieved an accuracy of 91.41% and 91.27% of accuracies in two different datasets in classifying the environmental sounds. Another study presented by Zhang Z. et al. [37] (2021) illustrated the classification of environmental sounds using ACRNN based model for achieving an accuracy of 86.1% which is still becoming less when compared to our model using the ANN-based architecture. Another contrast is the number of layers used, [36] used 10 layers in ACRNN structure while we have used 2 and 4 layers in two different datasets respectively. About [36], we can state that even simpler ANN architectures can achieve state-of-the-art accuracy instead of having bigger architectures which indirectly lowers the training time and lower computational expenses. The study shown by P. Dhanalakshmi et al. [38] (2009) focused on classifying the audio signals into six categories using different features like LPC, MFCC, LPCC while in contrast to the research presented by us, have classified the audio signals of two different datasets in 10 and 8 categories respectively with better accuracies and other parameters. Although we got the best result in the ANN classifier model, they scored the best accuracies using SVM and the Radial Basis Function Neural Network (RBFNN). Chen Lie et al. [39] (2006) proposed the study on classifying the environmental sounds using various classifiers and shown that SVM achieved better results i.e. 91.41% with a loss reduction of 8–15%, however, the results got reduced when the authors classified environmental sounds having three classes and results came down to 64.76% whereas, in our study, we achieved the better results (91.41% and 91.27% of accuracy in two datasets) using ANN classifier model with least losses.

Apart from these researches, recent studies have also considered deep convolutional neural networks (DNN) for classifying the audio samples. The study presented by Maccagno et al. [40] (2021) incorporated CNN based approach for audio classification at construction sites. The proposed DNN model used spectrograms that were created through the frequency scale and time derivatives. The frame size was considered to be 22,050 Hz and 60 mel bands. The dataset consists of 5 classes and used a fivefold cross-validation process and was able to achieve an accuracy of 97.08%. A similar study was presented by Mehyadin et al. [41] (2021) for bird sound classification. The authors used the model for analyzing the bird sounds and enable the species detection process. All the audio samples that contained noise, were treated with a separate noise filter which used the MFCC feature extraction process. The experiment included three models namely Naïve Bayes, J4.8, and Multilayer Perceptron (MLP) out of which J4.8 achieved the highest accuracy of 78.4%.

Another study presented by Palanisamy et al. [54] where author ckassified audio signals. Authors incorporated dural dataset approach including UrbanSound8k and ESC-50 dataset. In the experiment, ImageNet-Pretrained Standard CNN model was used and achieved a validation accuracy of 92.89% and 87.42% on respective datasets. Similar study was done by Zeghidour et al. [55] where authors proposed LEAF architecture (Learnable Frontend For Audio Classification). The experimental results showed that the proposed model outperformed EfficientNetB0 model. Study also states that the proposed architecture can be integrated with other neural networks at a low parameter cost. The proposed model is fully trainable, and lightweight architectire. The model will learn all the operations for extracting audio features starting from filtering to pooling steps. Based on this research [55], a new efficient, hybrid, and lightweight model can be developed which can provide high and accurate learning rate in helping the fire alarms and other audio signal detection methods.

From the above survey of various former proposals and experiments that have been presented in the classification of audio files, it can be seen that the authors have included only a few models for the categorization process. However, in our research, we have included a total of seven classification models that are used in two different audio datasets having different feature extraction steps.

4 Concepts and dataset

In the following section, we have presented a detailed overview of our two datasets used for classification models followed by the techniques and types of features selected for extracting the key parameters from the audio files for categorizing part. For the feature extraction process, we have used the Librosa library for extracting the features from the audio samples. The features include MFCCs and STFT. In the feature extraction process, a total of 186 features have been extracted for every label in the respective dataset. The length of each audio sample after the noise removal and feature extraction is taken to be 4 s. In the resulted audio samples, only the unique audio is taken which helps in classification. By the unique audio we mean, the audio which has a different frequency, pitch, etc. from the background noise, and other audio. Once the audio sample is noiseless, it is trimmed to the part where the actual identical sound can be distinguished. This can be well illustrated in Figs. 5, 6, and 7. The following is the bifurcation for this section: A. MFCC features, B. STFT features, C. UrbanSound8K Audio Dataset, and D. Sound Event Classification Dataset.

4.1 MFCC features

The key step for accurate classification is the extraction of discrete features from the audio sample or the components which can help in identifying the linguistic contents of the sample discarding the other stuff which carries noises and other unrequired sounds. For such feature extractions, MFCCs are used [42]-[43]. These MFCCs are one of the features that are widely used for extracting the features from the audio samples having less noise. The MFCCs are computed using fast Fourier transformation (FFT) coefficients filtered by a bandpass filter bank. The mathematical expression for Mel-scale computation is shown in Eq. (1).

$$Freq_{mel} = \frac{{x*log\left( {\left( {c + f} \right)/x} \right)}}{log\left( 2 \right)}$$

(1)

In Eq. (1), ${Freq}_{mel}$ is the logarithmic scale of the normal frequency ($f$) scale, $x$ plays an important role in calculating the MFCC features. This coefficient helps in converting the high-frequency sound into low frequencies for more accuracy pointing out the changes in the audio sample. It should have an appropriate range of 250 to 350 ie. the number of triangular filters that come in the frequency range of 200-1200 Hz which is the range of dominant audio information. For illustration, a full filter bank can be seen in Fig. 1 [56].

In the final step, MFCCs are calculated using Eq. (2) as illustrated and are denoted as ${Feature}_{MFCC}$.

$$F_{MFCC} = \sqrt{\frac{2}{N}} \mathop \sum \limits_{k = 1}^{N} \left( {logS_{k} } \right)cos\left[ {n\left( {k - 0.5} \right)\frac{\pi }{N}} \right]$$

(2)

where, ${S}_{k}$ is the output of the filter bank where k varies from 1 to N, where N is the length of the DFT.

4.2 STFT features

The time-dependent signals are decomposed using Fourier Transform into their respective frequencies. One of the Fourier Transform includes Short-Time Fourier Transform (STFT) which is widely used for extracting the features from the audio sample. Although MFCCs are also widely used in the feature extraction process, they are also very sensitive to the background noises in the sample making them less effective in extracting the features while on the other hand, STFTs are used even on audio samples having noises and provides effective results. The Fourier transforms of a function results in its equivalent frequency for the input amplitude signal. Figure 2 shows how a signal is converted into its frequency signal using the Fourier Transform.

STFT is a method in which FFT transforms are applied once the signal is trimmed by the window function. The mathematical expression for the calculation of the STFT features is shown in Eq. (3).

$$Y\left( {t,f} \right) = STFT\left( {y\left( t \right)} \right) = \int\limits_{{ - \infty }}^{\infty } {y\left( u \right)h^{*} \left( {u - t} \right)e^{{ - 2j\pi fu}} du}$$

(3)

where, $y\left(t\right)$ is the original audio signal, $h\left(t\right)$ are an STFT window function and the center lies at t = 0 having a length of $L (0<L\le 1500)$. The resulting STFTs are in 2-D form and are shown below.

$$Y\left( {t, f} \right) = \left[ {y_{1,1} y_{1,2} y_{1,3} \cdot \cdot \cdot y_{1,n} y_{2,1} y_{2,2} y_{2,3} \ldots y_{2,n} \cdot \cdot \cdot \cdot \cdot \cdot \cdot \cdot \cdot \cdot \cdot \cdot \cdot \cdot \cdot \cdot \cdot y_{m,1} y_{m,2} y_{m,3} \cdot \cdot \cdot y_{m,n} } \right]$$

(4)

where STFT values can be verified from ${y}_{i,j}$.

4.3 Urbansound8k audio dataset

In the study, for the environmental sound classification process, we have used UrbanSound8K audio dataset [44]. The dataset consists of 10 different environmental sound classes as shown in Table 1. All the audio files are generated from real-time instances having a nearly 4 s recording time. We considered a total of 8732 audio samples from this dataset. The waveforms for different classes have been illustrated in Fig. 3. The samples shown for illustrations are included in the classification. All the figures in Fig. 4 are Amplitude v/s Time frame audio graphs. All the waveforms included in Fig. 3 & 4 have amplitude (db) and time (seconds) on Y-axis and X-axis respectively. Furthermore, these images are presented inorder to provide a glimps of how varied frequency waveforms are used for the training of the models used.

Table 1 Audio sample distribution in urbansound8k dataset

Full size table

For each audio file, we have calculated the MFCC features (40 features for each sample), and depending upon these features we have created the modified dataset which contains all the audio samples with their features extracted from the audio file with their respective class label. After the dataset formation, we have divided the dataset randomly to test and training set for the classification process which is discussed in detail in Classifier Models Section.

4.4 Sound event audio dataset

Another dataset used in our research work includes the Sound Event Audio Dataset [44] which is collected from the University of Moratuwa, Sri Lanka. The dataset contains many environmental sound classes, and out of them we selected 7 of them and we also made another category as others which includes the sounds from surroundings, door opening, closing, etc. The audio sample distribution for this dataset can be seen in detail in Table 2. All the audio samples were generated from real-time instances using two MOVO USB omnidirectional microphones. From the dataset, all the features of individual audio samples were extracted, and based upon those, classification models were built considering a total of 1288 samples from this dataset.

Table 2 Audio sample distribution in sound event audio dataset

Full size table

Since this dataset also contained noise in the samples, we used some preprocessing for removing the noise part and trimming the audio sample without leaving any features behind for getting accurate results. For instance, we have demonstrated the random audio sample from the dataset before removing the noise from it in Fig. 5. Furthermore, Fig. 6 shows the audio sample waveform after the removal of noise from it and results in a much clearer waveform. Finally, the third step includes the trimming of the audio sample after the removal of noise as shown in Fig. 7. For the rest of the audio samples of different classes, we have illustrated the final trimmed waveform which is used for the feature extraction process followed by the classification process in Fig. 4. The samples shown for illustrations are included in the classification. All the figures in Fig. 4 are Amplitude v/s Time frame audio graphs.

This time for every audio sample, we calculated STFT features because of the reason that MFCC features do not hold a good grip on the audio samples which contain noise. The STFT features results in a 2-D array that contains the mentioned frequency amplitude bind for an individual window.

5 Classification models

Machine learning models are widely used for training the various models for either prediction [46, 47] or classifying the given samples in different classes. In the process of classifying the audio samples, we have used seven different machine learning models namely Logistic regression, K-Nearest Neighbors, Support Vector Machine, Naïve Bayes, Decision Tree, and Random Forest classifier. Apart from these six classifiers, we have also used Artificial Neural Network (ANN) architecture-based model for classifying the different classes of audio samples. We have illustrated a comparative study of all the classifiers used as shown in Table 3.

Table 3 Comparative analysis of classifier models

Full size table

6 Experimentation

In this section, we have discussed the experimental setup we have used which categorizing the audio samples. One of the unique things of our research work is the comparison of two similar datasets as discussed in Sect. 3 having noises and applying different feature selection techniques. Our experimental aims for the research are as follows.

Extracting the unique and important features from the audio sample using two feature selection techniques namely MFCC and STFT features.
Removal of unwanted features and background noises from the audio sample for achieving better results. The resulted
Applying classifier models as discussed in Sect. 4, and comparing the results based on different parameters namely accuracy, precision, recall, specificity, F1-score, and MCC.

In Sect. 6 we have further compared the results based on the experiments conducted. The basic overview of our approach for categorizing an audio sample from the datasets follows some important steps name: feature extraction, data pre-processing depending upon the sample (whether contains noise or not), and finally various classifier models are applied. All the steps mentioned above can be seen in Fig. 8.

The section is further divided into the following sections: A Software and Hardware, B Data Preprocessing, and C Analysis of Classifier Models.

6.1 Software and hardware

All the classifier models that were used in the experiment were trained through Python 3 with Keras library (using TensorFlow backend) on an anaconda environment. High-level API was used for constructing neural networks as well as other classifier models. we used Intel i5 8th generation processor with 16 GB RAM.

6.2 Data preprocessing

In our experiment, we have considered two datasets having 10 and 8 different classes respectively. The two datasets are discussed in detail in Sect. 3. In UrbanSound8K audio dataset we have applied the MFCC feature selection process because of its noise-free samples. However, in the other dataset, since the samples contain some background noise and other unwanted features, we have applied the STFT feature selection process because MFCCs are sensitive to noise as discussed in the above literature. After the collection of all the features separately for the two different datasets, we have applied the splitting of datasets into training and testing sets respectively. After the splitting, the training set contains 80% of the audio samples from the dataset and the testing set contains the remaining randomly selected 20% audio samples for prediction using the trained models. This distribution is common for both datasets.

6.3 Analysis of classifier models

After categorizing the different classes of the audio sample after data preprocessing using various classifier models, we conclude that the ANN model achieves the best accuracy among all the classifiers. However, some classifiers gave better results depending upon the datasets. This happened because of the internal working and classification criteria of different classifier models. For a better and detailed comparison, we have shown the confusion matrix of different models in Fig. 9. Figure 10 comprises the True Negatives (TN), True Positives (TP), False Positives (FP), and False Negatives (FN). Figures 9 and 10 comprise the results of the UrbanSound8K Audio dataset. Although the Linear Regression model was able to classify most of the audio samples to their correct category as shown in Fig. 9, GS (Gun Shot) was poorly classified. One of the reasons that support the outcomes is the GS and DB have similar waveforms after the feature extraction and trimming process. This can be seen from Fig. 3, and because of this, the model misclassified the GS samples to the DB category. Table 4 contains the various abbreviations used in Figs. 9 and 10.

Table 4 Abbreviations used in urbansound8k dataset

Full size table

Similarly, Fig. 11 shows the confusion matrix of all the classifier models that are used in the second dataset (Sound Event Audio Dataset). And Fig. 12 shows the FP, FN, TP, and TN parameters of classifier models. In the next section (Analysis and Result) we have discussed the detailed parameters based on which we conclude that the ANN model achieves the best score among all. Table 5 contains the various abbreviations used in Figs. 11 and 12.

Table 5 Abbreviations used in sound event audio dataset

Full size table

7 Results

We have evaluated different parameters namely accuracy, precision, recall, specificity, F1-score, and Matthews Correlation Coefficient (MCC) for comparison of different models. In our experiment, we can conclude that we got the best results in the ANN model for both datasets. The mathematical expressions used for evaluating different parameters are shown as follows.

$$Accuracy = \frac{TN + TP}{{TN + TP + FP + FN}}$$

(5)

$$Precision = \frac{TP}{{TP + FP}}$$

(6)

$$Recall = \frac{TP}{{TP + FN}}$$

(7)

$$F1 - Score = \frac{2 \times TP}{{2 \times TP + FP + FN}}$$

(8)

$$MCC = \frac{TN \times TP - FP \times FN}{{\sqrt {\left( {TN + FN} \right)\left( {FP + TP} \right)\left( {TN + FP} \right)\left( {FN + TP} \right)} }}$$

(9)

In our experiment, precision is referred to as the number of correctly predicted audio classes that turned out to be positive, recall tells about the number of actual positive cases that are predicted correctly with our models, specificity is the proportion of negative cases that are being predicted correctly, and F1-score refers to the harmonic mean of recall and precision or in other words, it provides a combined idea about the two results (recall and precision). F1-score is maximum when precision is equal to recall.

Apart from these matrices, there exists another parameter namely phi-coefficient $(\varphi )$. From Eq. (9) we can see that MCC takes all the parameters (TP, FP, TN, and FN) into account, while other metrics like accuracy, precision, recall, etc. lacks in taking all the four parameters hence making it sensitive to class imbalance and are asymmetric. MCC value can range from 1 to −1 depending upon the correlation. If the MCC is 1 (FP = FN = 0), indicates perfect positive correlation. On the other hand, if the MCC is −1 (TP = TN = 0), indicates a perfect negative correlation, such conditions depict that the classifier always misclassifies the classes.

However, if the MCC is 0, represents that classifier randomly choosing any class. For instance, taking the ANN model and Logistic Regression model into account from Table 5, one can see that MCC is 0.9380 and 0.2177 respectively for different subclasses. From this, we can imply that the ANN model has predicted more positively correlated results than the Logistic Regression model.

Tables 6 and 7 shows every parameter evaluated for each model applied to the UrbanSound8K and Sound Event Audio dataset in classifying each class of the audio sample respectively.

Table 6 Results of various models used in urbansound8k dataset

Full size table

Table 7 Results of various models used in sound event audio dataset

Full size table

8 Discussion

To have accurate classification of audio samples using various models, feature extraction and noise cancellation play an important role. Apart from the features and noise present in the audio sample, the difference between two different audio samples belonging to different classes should be appropriate. This can be justified by seeing Figs. 13 and 14 which are being plotted based on the results generated from the experiment on the two datasets. The graphs show the plot of accuracy for each class in the respective dataset classified by each machine learning model. We have also illustrated class-wise best accurate models in Table 8.

Table 8 Most accurate classifier in class wise order for urbansound8k dataset

Full size table

From Fig. 13, the Naïve Bayes model had performed the poorest among the models in accurately classifying the CP (Children Playing) class and has achieved poor results in comparison to other classifier models. This can be due to noise present in the audio samples and the internal processing of the Naïve Bayes model for determining the class to which the audio sample should belong. However, this does not imply that Naïve Bayes is a not-so-good approach for classification. Some studies show that Naïve Bayes performs very well in other fields [52, 53]. Similarly, in Fig. 14, FA (Falling) class has been poorly classified by Naïve Bayes and KNN model. By looking at one of the waveforms shown in Fig. 7 belonging to the FA class, one can see that the segments of that waveform can be matched with other classes too, resulting in false classification as stated above. From Figs. 13 and 14, the Artificial Neural Network model has achieved maximum accuracy for all the audio sample classes. This is because of the working of the ANN model, having multi-layer neural networks for classifying and analyzing every feature of the audio sample, and depending upon the result generated from each neuron, the final prediction is noted.

Using Table 8, one can analyse the performance of the classifier models in terms of the achieved accuracies for respective classes presented in the UrbanSound8K dataset. On the otherhand, for the Sound Event Audio Dataset, it was Aritifical Neural model which achieved the relative highest accuracies among the models that were trained on that dataset.

9 Conclusion

In this paper, we have implemented two ways of data preprocessing namely MFCC and STFT by the virtue of which we can classify different audio samples. Our result shows that MFCC features are sensitive to any kind of noise. Our study shows that after all the data preprocessing, the ANN model achieves the best results in classifying both types of the dataset (with and without noise). The overall accuracies achieved by various classifiers in the UrbamSound8K dataset and Sound Event Audio dataset are listed as follows in Table 9.

Table 9 Overall accuracies achieved on different datasets

Full size table

From Table 9, it can be seen that the results of Logistic Regression and Naïve Bayes are highly varied across the two datasets. The main reason behind this is the working of the classifier model as well as the relationship formed between the audio sample points with the predicted sample point. Logistic regression forms a linear relationship among the features while on the other hand, Naïve Bayes model assumes total independencies between the features. From the analysis of the results presented in the Table 9, it can be inferred that there is a direct relationship of accurate prediction and the distinct features a class have. Artificial Neural Network being an adaptive model helps in handling audio samples with heteroskedasticity (samples with different variances), it can be seen that most of the best results are achieved by ANN models.

10 Future work

Our next aim is to provide research work that overcomes the problem faced in efficient noise removal techniques and forming the hybrid relationship between the features of the sample including various experiments. Although in this paper, we present seven classifier models in classifying environmental audio samples, we also aim at presenting our algorithm which can outperform the traditional models for efficient classification of the samples. Our goal is to implement an algorithm that can work efficiently with the audio datasets having some kind of noise (Anything apart from the required sound) and provide improved performance. Our model with high efficiency and accuracy can be used in industries for fully automated system development. Depending upon the type of audio classification, our model will be able to make the programmed decisions and will immediately take appropriate actions saving time and energy. With some algorithm changes (trimming the audio sample, locating the key features, neglecting the wrong features from the audio sample), we will be able to use these feature selections with the same efficiency and develop hybrid models. Based on these trained models, our further process will include making user interactive GUIs that will be able to detect different sounds and provide necessary details about the sound signal. This prototype model will prove to be very effective for guiding tourists, locals, and visitors in different areas for information purposes.

Availability of data and materials

Data sharing not applicable to this article as no datasets were generated or analysed during the current study.

References

Chu S, Narayanan S, Kuo C-CJ. Environmental sound recognition with time-frequency audio features. IEEE Trans Audio Speech Lang Process. 2009;17:1142–58.
Article Google Scholar
Ahmad I. “Welcome from Editor-in-Chief: discover Internet-of-Things editorial”, inaugural issue. Discov Internet Things. 2021;1:1.
Article Google Scholar
E. Alexandre, L. Caudra, M. Rosa, and F. Lopez-Ferreras, “Feature selection for sound classification in hearing aids through restricted search driven by genetic algorithms,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 8, pp. 2249–2256, Oct. 2007.L. Ballan, A. Bazzica, M. Bertini, A. D. Bimbo, G. Serra, “Deep networks for audio event classification in soccer videos,” In Proceedings of the IEEE International Conference on Multimedia and Expo, pp. 474–477, 2009.
Vacher M, Serignat J-F, and Chaillot S. “Sound classification in a smart room environment: an approach using GMM and HMM methods,” In Proceedings of the IEEE Conference on Speech Technology and Human-Computer Dialogue, pp. 135–146, 2007.
. Ahmad I,. Swaminathan V, Aved A, &. Khalid S, “An overview of rate control techniques in HEVC and SHVC video encoding. Multimedia Tools and Applications”, vol. 81, no. 24, 2022.
Ahmad I, Luo J. On using game theory for perceptually tuned rate control algorithm for video coding. IEEE Trans Circuits Syst Video Technol. 2006;16(2):202–8.
Article Google Scholar
L. Ballan, A. Bazzica, M. Bertini, A. D. Bimbo, G. Serra, “Deep networks for audio event classification in soccer videos,” In Proceedings of the IEEE International Conference on Multimedia and Expo, pp. 474–477, 2009.
K. Lopatka, P. Zwan, and A. Czy˙zewski, “Dangerous sound event recognition using support vector machine classifiers,” In Advances in Multimedia and Network Information System Technologies, pp. 49–57, 2010.
Ullo SL, Khare SK, Bajaj V, Sinha GR. Hybrid computerized method for environmental sound classification. IEEE Access. 2020;8:124055–65.
Article Google Scholar
Dong X, Yin B, Cong Y, Du Z, Huang X. Environment sound event classification with a two-stream convolutional neural network. IEEE Access. 2020;8:125714–21.
Article Google Scholar
M.K.Gourisaria, R. Agrawal, GM. Harshvardhan, M. Pandey, S.S. Rautaray “Application of Machine Learning in Industry 4.0,” In Machine Learning: Theoretical Foundations and Practical Applications, pp 57–87, 2021, Machine learning: Theoretical foundations and practical applications.
Shetty S, Hegde S. Automatic classification of carnatic music instruments Using MFCC and LPC. Analytics and Innovation: In Data Management; 2020. p. 463–74.
Google Scholar
Vivek V S, Vidhya S, and. Madhanmohan P, “Acoustic Scene Classification in Hearing aid using Deep Learning,” In 2020 International Conference on Communication and Signal Processing (ICCSP), pp. 0695–0699, July 2020.
Kim CI, Cho Y, Jung S, Rew J, Hwang E. Animal sounds classification scheme based on multi-feature network with mixed datasets. KSII Transactions on Internet and Information Systems (TIIS). 2020;14(8):3384–98.
Google Scholar
Bansal V, Pahwa G, and. Kannan N, “Cough Classification for COVID-19 based on audio mfcc features using Convolutional Neural Networks,” In 2020 IEEE International Conference on Computing, Power and Communication Technologies (GUCON), pp. 604–608. 2020.
Chabot P, Bouserhal R E, Cardinal P, and Voix J, “Detection and classification of human-produced nonverbal audio events,” Applied Acoustics, vol. 171, 2020.
Kim HG, Moreau N, Sikora T. Audio classification based on MPEG-7 spectral basis representations. IEEE Trans Circuits Syst Video Technol. 2004;14(5):716–25.
Article Google Scholar
Li D, Sethi IK, Dimitrova N, McGee T. Classification of general audio data for content-based retrieval. Pattern Recogn Lett. 2001;22(5):533–44.
Article Google Scholar
Boddapati V, Petef A, Rasmusson J, Lundberg L. Classifying environmental sounds using image recognition networks. Procedia computer science. 2017;112:2048–56.
Article Google Scholar
Cowling M, Sitte R. Comparison of techniques for environmental sound recognition. Pattern Recogn Lett. 2003;24(15):2895–907.
Article Google Scholar
Bountourakis V, Vrysis L, and Papanikolaou G, “Machine learning algorithms for environmental sound recognition: Towards soundscape semantics,” In Proceedings of the Audio Mostly 2015 on Interaction With Sound, pp. 1–7, 2015.
Bountourakis V, Vrysis L, Konstantoudakis K, Vryzas N. An Enhanced Temporal Feature Integration Method for Environmental Sound Recognition. In Acoustics. 2019;1(2):410–22.
Article Google Scholar
Dieleman S, Schrauwen B. “End-to-end learning for music audio,” IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 6964–6968, 2014.
Lee J, Park J, Kim KL, Nam J. End-to-end deep convolutional neural networks using very small filters for music classification. Applied Sci. 2018;8(1):1–14.
Article Google Scholar
Wu Y, Mao H, Yi Z. Audio classification using attention-augmented convolutional neural network. Knowl-Based Syst. 2018;161:90–100.
Article Google Scholar
Pons J, and Serra X, “Designing efficient architectures for modeling temporal features with convolutional neural networks,” IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 2472–2476, 2017.
Choi K, Fazekas G, and Sandler M, “Automatic tagging using deep convolutional neural networks,” Proceedings of the 17th International Society for Music Information Retrieval Conference, ISMIR 2016 pp. 805–811, 2016.
Jiang H, Bai J, Zhang S, and Xu B, “SVM-based audio scene classification,” Proceeding of the IEEE, pp. 131–136, 2005.
Lu L, Zhang H-J, Li SZ. Content-based audio classification and segmentation by using support vector machines. Multimedia Syst. 2003;8:482–92.
Article Google Scholar
Cowling M, and Sitte R, “Comparison of techniques for environmental sound recognition,” Pattern Recog Lett, pp. 2895–907, 2003.
Harma A, McKinney M F, and Skowronek J, “Automatic surveillance of the acoustic activity in our living environment,” IEEE international conference on multimedia and exposition. Amsterdam (The Netherlands), July 2005.
Clavel C, Ehrette T, and Richard G, “Event detection for an audio-based surveillance system,” IEEE International Conference on Multimedia Exposition. Amsterdam (The Netherlands), July 2005.
Dufaux A, Bezacier L, Ansorge M, and Pellandini F, “Automatic sound detection and recognition for a noisy environment,” Proceedings of. European Signal Processing Conference. Finland, pp. 1033–6, Sep. 2000.
Dargie W. Adaptive audio-based contest recognition. IEEE Trans Syst, Man, Cybernet. 2009;39:715–25.
Article Google Scholar
El-Maleh K, Samouelian A, and Kabal P, “Frame-level noise classification in mobile environments,” Proceedings of ICASSP. Phoenix (AZ), pp. 237–40, March 1999.
Seker H, and Inik O. “CnnSound: Convolutional Neural Networks for the Classification of Environmental Sounds,” Proceedings of ICPS, International Conference on Advances in Artificial Intelligence (ICAAI), pp. 79–84, Oct. 2020.
Zhang Z, Xu S, Zhang S, Qiao T, Cao S. S, “Attention-based convolutional recurrent neural network for environmental sound classification.” Neurocomputing. 2021;453:896–903.
Article Google Scholar
Dhanalakshmi P, Palanivel S, Ramalingam V. Classification of audio signals using SVM and RBFNN. Expert Syst Appl. 2009;36(3):6069–75.
Article Google Scholar
Chen L, Gunduz S, and Ozsu M T, “Mixed type audio classification with support vector machine,” IEEE International Conference on Multimedia and Expo, pp. 781–784. July 2006.
. Maccagno A, Mastropietro A, Mazziotta U, Scarpiniti M, Lee Y C, and Uncini A, “A CNN approach for audio classification in construction sites,” In Progresses in Artificial Intelligence and Neural Systems, pp. 371–381. 2021.
. Mehyadin AE, Abdulazeez AM, Hasan DA, and Saeed JN, “Birds Sound Classification Based on Machine Learning Algorithms,” Asian Journal of Research in Computer Science, pp. 1–11. 2021.
Pakyurek M, Atmis M, Kulac S, Uludag U. Extraction of Novel Features Based on Histograms of MFCCs Used in Emotion Classification from Generated Original Speech Dataset. Elektronika ir Elektrotechnika. 2020;26(1):46–51.
Article Google Scholar
Deng M, Meng T, Cao J, Wang S, Zhang J, Fan H. Heart sound classification based on improved MFCC features and convolutional recurrent neural networks. Neural Netw. 2020;130:22–32.
Article Google Scholar
Salamon J, Jacoby C, and Bello J P, “A dataset and taxonomy for urban sound research,” Proceedings of the 22nd ACM international conference on Multimedia, pp. 1041–1044, Nov. 2014. Retrieved 14 December 2020 from https://urbansounddataset.weebly.com/urbansound8k.html
Chathuranga S (2019) [Online]. Sound Event Dataset. Retrieved 14 December 2020 from https://github.com/chathuranga95/SoundEventClassification
Qamhan MA, Altaheri H, Meftah AH, Muhammad G, Alotaibi YA. Digital audio forensics: microphone and environment classification using deep learning. IEEE Access. 2021;9:62719–33.
Article Google Scholar
GM H, Gourisaria MK, Pandey M, and Rautaray SS, “A Comprehensive Survey and Analysis of Generative Models in Machine Learning,” Computer Science Review – Elsevier, vol. 38, Nov. 2020.
Ayer T, Chhatwal J, Alagoz O, Kahn CE Jr, Woods RW, Burnside ES. Comparison of logistic regression and artificial neural network models in breast cancer risk estimation. Radiographics. 2010;30(1):13–22.
Article Google Scholar
Singh R, Yadav CS, Verma P, Yadav V. Optical character recognition (OCR) for printed Devanagari script using artificial neural network. Int J Computer Sci Communication. 2010;1:91–5.
Google Scholar
Barve S. Optical character recognition using artificial neural network. Int J Adv Res Computer Eng Technol. 2012;1:131–3.
Google Scholar
Jaitly N, Nguyen P, Senior A and Vanhoucke V. Application of pre-trained deep neural networks to large vocabulary speech recognition. 2012.
Ting SL, Ip WH, Tsang AH. Is Naive Bayes a good classifier for document classification. International Journal of Software Engineering and Its Applications. 2011;5(3):37–46.
Google Scholar
Chen L, Gunduz S, and Ozsu MT. Mixed type audio classification with support vector machine. IEEE International Conference on Multimedia and Expo, pp. 781–784, July 2006.
Palanisamy K, Singhania D, & Yao A. (2020). Rethinking CNN models for audio classification. arXiv preprint arXiv:2007.11154.
Zeghidour N, Teboul O, Quitry FDC, & Tagliasacchi M, (2021). Leaf: A learnable frontend for audio classification. arXiv preprint arXiv:2101.08596.
Toledano DT, Fernández-Gallego MP, Lozano-Diez A. Multi-resolution speech analysis for automatic speech recognition using deep neural networks: Experiments on TIMIT. PLoS ONE. 2018;13(10):e0205355.
Article Google Scholar

Download references

Acknowledgements

All authors are thankful to their institutions for the support they received to do this research work.

Author information

Authors and Affiliations

School of Computer Engineering, KIIT Deemed to Be University, Bhubaneswar, Odisha, 751024, India
Mahendra Kumar Gourisaria & Rakshit Agrawal
Department of Mathematics, Pandit Deendayal Energy University, Gandhinagar, Gujarat, 382426, India
Manoj Sahni
Central University of Jammu, Bagla Suchani, Jammu & Kashmir, India
Pradeep Kumar Singh

Authors

Mahendra Kumar Gourisaria
View author publications
You can also search for this author in PubMed Google Scholar
Rakshit Agrawal
View author publications
You can also search for this author in PubMed Google Scholar
Manoj Sahni
View author publications
You can also search for this author in PubMed Google Scholar
Pradeep Kumar Singh
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors have equal contributions.

Corresponding author

Correspondence to Pradeep Kumar Singh.

Ethics declarations

Ethics approval and consent to participate

The submitted manuscript is the original piece of research work, and it not submitted elsewhere in any form previous to this submission. All authors declare that the manuscript is not under consideration in any of the journal or conference and free from dual submission. All authors have contributed for the manuscript.

Research involving human participants and/or animals informed consent

This research work carried out in this manuscript does not include the involvement of human/ animal in any form nor it is related to human/ animal medical data.

Competing Interests

There is no competing interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Gourisaria, M.K., Agrawal, R., Sahni, M. et al. Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques. Discov Internet Things 4, 1 (2024). https://doi.org/10.1007/s43926-023-00049-y

Download citation

Received: 28 April 2023
Accepted: 08 November 2023
Published: 03 January 2024
DOI: https://doi.org/10.1007/s43926-023-00049-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

Abstract

Similar content being viewed by others

Anomalous sound event detection: A survey of machine learning based methods and applications

Content-Based Audio Classification and Retrieval Using Segmentation, Feature Extraction and Neural Network Approach

Research on sound classification based on SVM

1 Introduction

2 Motivation and contribution

3 Related work

4 Concepts and dataset

4.1 MFCC features

4.2 STFT features

4.3 Urbansound8k audio dataset

4.4 Sound event audio dataset

5 Classification models

6 Experimentation

6.1 Software and hardware

6.2 Data preprocessing

6.3 Analysis of classifier models

7 Results

8 Discussion

9 Conclusion

10 Future work

Availability of data and materials

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Research involving human participants and/or animals informed consent

Competing Interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation