An Extended Variational Mode Decomposition Algorithm Developed Speech Emotion Recognition Performance

Emotion recognition (ER) from speech signals is a robust approach since it cannot be imitated like facial expression or text based sentiment analysis. Valuable information underlying the emotions are significant for human-computer interactions enabling intelligent machines to interact with sensitivity in the real world. Previous ER studies through speech signal processing have focused exclusively on associations between different signal mode decomposition methods and hidden informative features. However, improper decomposition parameter selections lead to informative signal component losses due to mode duplicating and mixing. In contrast, the current study proposes VGG-optiVMD, an empowered variational mode decomposition algorithm, to distinguish meaningful speech features and automatically select the number of decomposed modes and optimum balancing parameter for the data fidelity constraint by assessing their effects on the VGG16 flattening output layer. Various feature vectors were employed to train the VGG16 network on different databases and assess VGG-optiVMD reproducibility and reliability. One, two, and three-dimensional feature vectors were constructed by concatenating Mel-frequency cepstral coefficients, Chromagram, Mel spectrograms, Tonnetz diagrams, and spectral centroids. Results confirmed a synergistic relationship between the fine-tuning of the signal sample rate and decomposition parameters with classification accuracy, achieving state-of-the-art 96.09% accuracy in predicting seven emotions on the Berlin EMO-DB database.


Introduction
Word meaning is often conveyed by the tone of voice [30], although human emotions are not solely conveyed through the words used, but also through by arXiv:2312.10937v1[cs.SD] 18 Dec 2023 modifying facial expressions and vocal tone.Thus, changing voice characteristics is how most humans express different emotions [36].Consequently, considerable human-computer interaction research has considered emotion recognition (ER).Various applications detect serious state by analyzing caller emotion in emergency centers; and speech pathology, e-learning, voiceprints, security, and other smart-centric services commonly employ speech emotion recognition (SER).Other approaches have considered biosensing, Electroencephalography (EEG), and facial recognition, to detect emotions [2,20,11].
Signal based ER employs various signals, including electrodermal activity, blood volume pulse, galvanic skin response, electrocardiogram (ECG), EEG, and speech, are commonly categorized into several decomposed modes due to the complexity and nonstationary nature of them, which allows latent factors and patterns to be extracted more easily.Several time series analysis approaches for SER have been over the previous two decades, extracting relevant speech features from nonstationary and instantaneous signals, including traditional short time Fourier transform (STFTs), empirical wavelet transforms (EWTs) [13], and variational mode decomposition (VMD) [12].Nonstationary signal properties and its components make mean STFTs are not always suitable, and previous studies have mostly considered these approaches in isolation [8].Huang et al. [17] proposed empirical mode decomposition (EMD), which decomposes the source signal into an unknown number of signal modes defined by frequency and amplitude modulated components.However, EMD has several limitations, including overlapping intrinsic mode functions (IMFs); and increased computational load when analyzing a large number of modes, particularly EEG and speech signals [8].
Empirical wave transforms employ an adaptive wavelet subdivision scheme, similar to EMD, to address EMD drawbacks by decomposing the signal into a predetermined number of IMFs or modes.Several studies have proposed an envelope weighted transformation to decompose and denoise EEG and speech signals for processing [6,35].Variational mode decomposition employs non-recursive decomposition to deal with nonlinear and nonstationary signals.In contrast with EWT and EMD, few studies have considered VMD to analyze EEG signals.VMD decomposes signals into modes with a narrowband around a center frequency and can overcome EWT limitations, including shift and filter bank boundary sensitivity and EMD mode mixing effects.Therefore, we were motivated to apply VMD for speech signal processing.Acoustic feature selection is essential for SER to describe various voice signal aspects captured from different features [5,11,20].Acoustic features include time-frequency, time, and frequency domain representations.Extracted features from time-frequency domains carry more informative data than the other domains, and better capture latent emotion content from speech signals [33].Useful time-domain features include amplitude envelope, RMS energy, and zero-crossing rate; and are commonly employed as sequence evaluation ratios.In contrast, relevant frequency domain features include band energy ratio (BER), spectral centroid, and spectral flux.Several previous studies used VMD method to analyze signals, extracting features from the decomposed signals.However, we propose VGG-optiVMD, utilizing a VMD based feature augmentation method to enrich predictors and maximize emotion classification accuracy.Results from the proposed VGG-optiVMD approach on several common publicly available databases confirm significant ER improvement compared with previous approaches.
The main contributions from this study can be summarized as follows.
-  [11] decomposed the speech signal into three components sampling at 16000 Hz over 20 ms frames, then input various mode central frequency statistical parameters to a support vector machine (SVM) classifier.Optimum recognition rate achieved 85.81% and 69.13% accuracy for two and four emotion classes, respectively, and increased accuracy by 5% for eight emotion classes compared to previous studies on the RAVDESS database [11].
Lal et al. [24] empirically demonstrated VMD advantages to decompose speech signals in the correct central frequency and subsequently estimated epoch locations from noise degraded emotional speech signal.
Zhang et al. [41] proposed multidimensional feature extraction for EEG signal emotion recognition combining wavelet packet decomposition (WPD) with VMD to break down an EEG signals and extract wavelet packet entropy, modified multi-scale sample entropy, fractal dimension, and first difference of each emotional variational mode functions as feature components.They subsequently demonstrated robust results using a random forest (RF) classifier on the DEAP dataset [21].
Khare et al. [20] reduced reconstruction error using meta-heuristic techniques to condensing from 16 to 1 dimension using eigenvector centrality method channel selection on EEG signals.They subsequently improved Optimized variational mode decomposition (O-VMD) accuracy by 5% compared with traditional VMD on the dataset of four emotions that built by themselves, with low computational load and model complexity.Furthermore, the SVM classifier significantly reducing average mean square error.
Generally, EEG signals can effectively analyze individual's emotion since they are subject dependent.Pandey [29] proposed subject-independent emotion recognition using VMD and deep neural networks (VMD-DNN) on the benchmark DEAP dataset.Two features, first difference and power-spectral-density used since were sufficient to recognize calm, happy, sad, and angry emotions.SVM and DNN classifier accuracy was improved by employing VMD based feature extraction compared with EMD, STFT, and differential entropy feature extraction, achieving 61.25% for arousal and 62.50% valence prediction accuracies.
Several previous studies considered STFT signal decomposition techniques for SER.For example, Zhao et.[42] achieved robust 91.89% accuracy on the EMODB database [7].Few previous studies considered VMD to decompose speech signals as mostly employed EEG signal for ER.Dendukuri [11] achieved 69.13% accuracy to recognize four emotions on the RAVDESS database.However, to the best of our knowledge, the current study is the first to employ VMD to enrich multidimensional feature vectors to enhance VGG-16 network learning.

Proposed Methodology
Speech signal processing involves decoding and encoding information within the speech signal.Glottal airflow from vocal folds, nasal cavity, and vocal tracts generate sounds and words that also convey emotions.Thus, human voice is a convolution of vocal tract frequency response with a glottal pulse.The glottal pulse itself does not contain emotion related informative, and hence is considered noise in this context.The main aim for decomposition based speech signal processing is to constrain noise and interference frequencies to enhance signal decoding.

Speech feature extraction
Essential and informative acoustic features in the time-frequency domain include the Mel spectrogram, chromograms, spectral contrasts, tonnetz, and Melfrequency cepstral coefficients (MFCCs) [1,15].The above features are extracted and subsequently employed in various combinations to generate multidimensional feature vectors or maps.

Variational mode decomposition
Variational mode decomposition is a popular technique for decomposing nonstationary signals into sub-signals or modes, where mode contains a specific meaningful property from the original signal in a narrow bandwidth around the center frequency.Modes are obtained from Hilbert transform output, also called the intrinsic mode function (IMF).Furthermore, mode center frequency can be considered as a real component of the original signal for sufficiently narrow bandwidth [24].The VMD adaptive algorithm reduces the original signal complexity [9,12].
The VMD algorithm applies the Wiener filter, Hilbert transform, analytical signals, and frequency mixing.Wiener filters are narrowband filters for noise reduction.The Hilbert transform is a time-invariant multiplier, convolving the original signal g(t) with the impulse response 1/πt [22].Therefore, it converts the real signal into the complex or imaginary part to extract magnitude and phase angle time series for frequencies with the most power at each specific time point.The VMD algorithm adds the Hilbert transform H[g(t)] to the original signal g(t), removing any negative frequencies present (due to Hermitian symmetry).The two main VMD objects are to constrain the bandwidth for each IMF center frequency and reconstruct the original signal from the sum of all modes.First, the Hilbert transform filters frequencies in the negative side of the spectrum, and then shifts the obtained bandwidth to the modes central frequency.Second, the obtained spectrum is shifted to the baseband region via a modulator function to obtain bandwidth around central frequency ω.Finally, H1 Gaussian smoothness for the demodulation signal is used to estimate the bandwidth.Thus, constraining the L2 norm squared gradient [12] defines the optimization problem (1), , subject to: where the partial derivative ∂ ∂t [.] minimizes variation in the obtained bandwidth; g(t) is the original speech signal frame; g k (t) is the kth mode for g(t); K is the total number of modes; ω k = {w1, . . ., wk} is the mode center frequency, and a convenient way to reference the center frequencies for the set of K modes; e −jω k t is a modulator function to shift the spectrum for each mode to the baseband.
The analytical signal generated by applying the Hilbert transform j πt and unit impulse function δ(t) as shown in equation (1).The δ(t) denotes to the Dirac delta distribution known as a unit impulse so that its value is zero everywhere and infinite at original signal.The original voice signal can be reproduced by solving the constraint optimization (1), which can be simplified using an augmented Lagrangian multiplier to transform it into an unconstrained problem (2), where, λ is a time dependent Lagrangian multiplier, and α is a bandwidth control parameter.
The unconstrained Lagrangian problem (2) can be solved to obtain the frequency and the modes using the alternate direction method of multipliers (ADMM) [16,32,12] optimization in spectral domain.However, optimization outcomes are the same for the frequency and time domains Hence mode g k (ω) can be updated in the spectral domain, Updating is obtained using the Wiener filter for the current residual using the signal prior 1/(ω − ω k ) 2 to restrain variation across the central frequency minimum, providing the updated mode center frequency ω k as where Ĝk (ω) is the Fourier transform for g n+1 k (t).A better decomposed signal can be obtained by reconstructing the original signal as the sum of modes and estimating bandwidth using the Wiener filter.Details for the VMD algorithm are provided in [12].
To leverage VMD effectiveness, we propose the VGG-optiVMD algorithm for automatically selecting optimum α and K by analyzing different decomposition parameter effects on classification accuracy.

Proposed VGG-optiVMD
Reconstruction error for a decomposed signal can be reduced by selecting optimum K and α.Improper decomposition parameter selection will create duplicate modes, causing signal information losses and hence reduced classifier performance.One drawback for VMD is that finding decomposition parameters K and α to provide optimum performance challenging.Several approaches have proposed for ER using ECG, EEG and vibrational signals.For example, the OVMD algorithm [25] uses a series of indicators, including permutation entropy, kurtosis criteria, extreme frequency domain value, and energy loss coefficients, to identify optimum K. Wang et al. [38] controlled power spectral and dynamic entropy features to find optimal K and α to decompose vibration signal and extract fault features.
However, these approaches use IMF or mode characteristics to find the best decomposition parameters for specific low amplitude input signals with empirical threshold selection, which is not applicable for speech signal processing.Dendukuri et al. [11] decomposed speech signals using five modes to recognize eight emotions, achieving 61.2% accuracy on the RAVDESS database.They combined different features, including a 45-dimensional feature set including mode center frequency, statistical values for mode center frequency, MFCCs, and spectral statistical features to improve classifier performance.
The above methods evaluate optimum K value using statistical features and indicators for guidance.In particular, identified mode number correctness was not verified or fine-tuned practically by monitoring classification accuracy.
In contrast, the current study proposes to automate optimum VMD decomposition parameter selection using a feedback loop from the VGG16 flattening output layer.Algorithm 1 shows the proposed optimized VMD algorithm (VGG-optiVMD).The key strength for VGG-optiVMD is reliability, generality, and reproducibility across different speech databases for real-world applications, e.g.customer satisfaction analysis in call centers.

Feature scaling, data augmentation, and emotion classification
Figure 1 shows the proposed framework to train CNN-VGG16 [34] to extract enriched feature vectors and classify seven emotions: anger, boredom, happiness, neutral, disgust, sadness, and fear on two databases EMODB and RAVDESS.Figure 1 shows the model development proceeds as follows.1.The voice signal is sampled at 88400 Hz and five well-known acoustic features extracted in the time-frequency domain: MFCCs, Mel spectrogram, Tonnetz, spectral contrast, and chromagram.2. The Hann window function is applied with 2.9 s fixed length and 0.4 ms shifting time to sub-signal spectra assembled over a series of frames, extracted features are reshaped into a single (128 × 128 × 3) feature vector.3. The SMOTE [26] oversampling strategy is applied to compensate minority classes and reduce model bias.Final testing and training features are randomly partitioned into 20% and 80% sets, respectively.4. The proposed VGG-optiVMD algorithm is applied to decode frequency statistical properties at specific times that distinguish emotions within the feature vector.5.The VGG network is trained on the augmented feature vector to classify emotions into seven classes.
This study followed the preprocessing system from [33].All acoustic features were extracted using the Librosa tool [27] using the Ryerson Audio Visual Database of Emotional Speech and Song (RAVDESS) [26] and the Berlin EMODB [7] databases.Voice data are preprocessed with frame size = 2048, HOP length = 256, and sampling rate = 88400 to avoid spectral leakage and enhance frequency resolution.Several experiments were performed on nine different feature vectors to identify the proposed VGG-optiVMD algorithm effectiveness using.The model was implemented on a Keras framework.The detail of network implementations are available in our GitHub repository3 .

Modelling
The aim of modeling was to enhance informative data within the feature vectors and avoid overfitting.Therefore, we applied data augmentation by decomposing the feature vector data, i.e., g(t) is explained in proposed algorithm, into different modes.Augmentation effects on classification accuracy were assessed using diverse K and α sets.Optimal K and α was assessed iteratively until robust classification accuracy was achieved or the break loop condition reached.K and α were set to a wide range of 3-8 and 1000-6000, respectively, based on empirical experiments since there was no significant improvement in prediction accuracy outside those ranges.The VGG16 is selected to be trained from augmented feature maps as a trade-off between model runtime and classifier accuracy.The VGG16 architecture used the ADAM optimizer with learning rate = 0.0001; six fully connected hidden layers with ReLU, SELU, and TanH activation functions; epochs = 50, batch size = 4; and SoftMax function for the output layer.

Result and Discussion
To assess the effectiveness of our VMD-based feature augmentation method several evaluation metrics were employed including F1 score, test accuracy, and confusion matrix.Based on the experiment results shown in Table 1, there is a correlation between the number of modes K, α and classification accuracy.
The different acoustic features are enriched with various sets of decomposition parameters.Results showed that higher accuracy was obtained for K (4 -6) and α (2000 -4000) in both datasets, although VGG-optiVMD is set to a limited range of α (1000-10000) and K (2-8) due to increasing a heavy computational load when K value is over 8 with sample rate 88400.This limitation can be considered a functional constraint of VGG-optiVMD.Nevertheless, a state-ofthe-art result was achieved with the accuracy of 96.09% with K=6 and α=2000 as demonstrated in Table ??.
Analyzing the results of the baseline model, which is built with the same framework simply without VMD-based feature vector augmentation, helps us to justify the power of the VGG-optiVMD in SER.Therefore, we attempted to evaluate the model performance through variation of sample rate, window size, K and α without using VMD (baseline model) and with VMD (proposed model).As shown in Figure 3, unlike the baseline model, the proposed model performed better with a larger sampling rate and window size.Moreover, the highest test accuracy and F1 score were obtained via VGG-optiVMD, proving that our VMD-based feature augmentation method significantly improved the classification accuracy.The Figure 4 shows the efficient functionality of VGG-optiVMD on the feature vector 3D-Mel Spectrogram+MFCCs+Chromagram. Figure (a) represents the feature before applying VMD based data augmentation, and figure (b) clearly shows that the informative frequencies are distinguished on the feature vector after applying the data augmentation method.In addition, the image shows the feature vector acquired higher distinction energies in the timefrequency domain.Therefore, the implications of this finding can improve the learning process in VGG16 and result in better prediction accuracy.The confusion matrix in Table 1 demonstrates the high performance of the classification model with accuracy above 90% for all classes.Nevertheless, the model performs poorly when predicting happiness and anger emotions due to the similarity of signal attributes such as intensity, frequency and harmonic structure.The VGG-optiVMD method is compared with the most recent works, shown in Table 2, that our method outperforms previous models and achieves a stateof-the-art result in terms of accuracy.In accordance with the knowledge we have, this is the first work to employ VMD as a feature augmentation method in SER.Moreover, the main advantage of the VGG-optiVMD is its generality, which can be employed independently for other acoustic features and different databases.[40] 13 MFCCs Tree Model 70 Popova et al. [31] Mel spectrograms VGG16 71 Hajarol.et al. [14] Mel spectrograms+MFCCs CNN 72.21 Wang et al. [37] Fourier Parameter+MFCCs SVM 73.3 Kown et al. [23] Spectrogram Deep SCNN 79.50 Badsha et al. [4] Spectrogram CNN 80.79 Huang et al. [18] Spectrogram CNN 85.2 Issa et al. [19] MFCCs+Chroma.+Melspec.+Contrast+TonnetzVGG16 86.10 Meng et al. [28] log Mel spec.+1st& 2nd delta(log Mel spec.)CNN-LSTM 90.78 Wu et al. [39] Modulation Spectral Features (MSFs) SVM 91.60 Rudd et al. [33] Harmonic-Percussive (HP)+log Mel spec.VGG16-MLP 92.79 Demircan et al. [10] LPC+MFCCs SVM 92.86 Zhao et al. [42] log Mel spectrogram CNN-LSTM 95.89 VGG-optiVMD 3D-Mel spectrogram+MFCCs+Chromagram VGG16-VMD 96.09%

Conclusion
Speech signal processing is employed in some applications when we only have access to speech voice to detect emotions which is the first aim of this study, the second aim of this study is to introduce specific data augmentation techniques to enrich the extracted acoustic features by design of VGG-optiVMD, an extended VMD algorithm to improve SER performance.
The findings provide solid empirical confirmation of the key role of the sampling rate, the number of the decomposed mode, K and the balancing parameter of the data-fidelity constraint, α, in the performance of the emotion classifier.Taken together, these findings suggest that VMD decomposition parameters K (2-6) and α (2000-6000) are optimum values on both the RAVDESS and EMODB databases.The proposed VGG-optiVMD algorithm improved the emotion classification to a state-of-the-art result with a test accuracy of 96.09% in the Berlin EMO-DB and 86.21% in the RAVDESS datasets.Further work needs to be done to establish whether extracting acoustic features only from informative decomposed modes can reduce computational load constraints.Therefore, the study should be repeated using the VMD algorithm before acoustic feature extraction process.

Fig. 1 :
Fig. 1: Proposed model development workflow: extracted features are enriched using the VGG-optiVMD to automatically identify K and α.

Fig. 2 :
Fig. 2: Empirical results of emotion classification accuracy (%) are demonstrated through different sets of decomposition parameters α and K, that were selected automatically by the VGG-optiVMD algorithm.

Fig. 3 :
Fig. 3: The model performance is assessed by different signal sampling rates and VMD parameters K and α.Graph (a) The VGG-optiVMD identified the set of K = 6 and α = 2000 as optimum value.Graph(b) represents the effect of various ranges of sample rate and window size on the proposed and baseline model in EMODB.The highest accuracy can be achieved by SR = 88200 and WS = 2048.

Fig. 4 :
Fig. 4: The efficient functionality of VGG-optiVMD on the feature vector 3D-Mel Spectrogram+MFCCs+Chromagram clearly shows a higher distinction in the energy magnitudes of frequencies in (b).
To our best knowledge, this study is the first to employ VMD as a dynamic data augmentation for speech emotion recognition.

Table 1 :
Visualization of the model performance with confusion matrix (%) for the 3D-Mel Spectrogram+MFCCs+Chromagram with test accuracy = %96.09on the Berlin EMO-DB dataset.

Table 2 :
Comparison of the proposed method with previous works on the same databases.