A DCRNN-based ensemble classifier for speech emotion recognition in Odia language

The Odia language is an old Eastern Indo-Aryan language, spoken by 46.8 million people across India. We have designed an ensemble classifier using Deep Convolutional Recurrent Neural Network for Speech Emotion Recognition (SER). This study presents a new approach for SER tasks motivated by recent research on speech emotion recognition. Initially, we extract utterance-level log Mel-spectrograms and their first and second derivative (Static, Delta, and Delta-delta), represented as 3-D log Mel-spectrograms. We utilize deep convolutional neural networks deep convolutional neural networks to extract the deep features from 3-D log Mel-spectrograms. Then a bi-directional-gated recurrent unit network is applied to express long-term temporal dependency out of all features to produce utterance-level emotion. Finally, we use ensemble classifiers using Softmax and Support Vector Machine classifier to improve the final recognition rate. In this way, our proposed framework is trained and tested on Odia (Seven emotional states) and RAVDESS (Eight emotional states) dataset. The experimental results reveal that an ensemble classifier performs better instead of a single classifier. The accuracy levels reached are 85.31% and 77.54%, outperforming some state-of-the-art frameworks on the Odia and RAVDESS datasets.


Introduction
In recent years, with the rapid growth in the field of artificial intelligence such as voiceprint, fingerprint, speech emotion recognition, face recognition, and other biometrics systems has attracted more attention by the many researchers [1][2][3]. With further developments in the processing capability of a computer and the increasing demand for pattern recognition and speech emotion recognition, both of these have been vastly used in the interaction between human-robotics [4,5,5,6,8]. The information from speech signals carries people's emotional and most natural communication in dayto-day conversations and works. It consists of paralinguistic and linguistic information. Linguistic contains language and contextual information, and paralinguistic gives information related to the emotional state of the speech [7]. Building an SER system is a challenging task. Firstly, the unavailability of speech datasets in different languages is time-consuming work to create a proper speech emotion database. Secondly, the different dataset has built of other regions of the world with their diverse cultural, languages, and speakers with their different speaking styles [8]. Consequently, all of the above variations create difficulties in detecting the emotional state from the speech signal. In addition, recognition of speech emotional systems is independent of hardware equipment. The automobile industry can also have advantages from SER for the many real-time emotion detection tasks. The various techniques have been utilized in the pre-processing, feature extraction process, and classification algorithms using several SER datasets [9]. However, several speech-emotion classifier systems and different type features are combined in the literature.
The recognition of the speech emotion system is mainly divided into three sections: speech pre-processing, feature extraction, and classifier model [12]. A robust classification model identifies discriminative emotional feature information as an essential factor in the emotion recognition system [10]. The feature extraction process is the initial step, and many hand-crafted features have been used for SER [11,14]. In recent years, spectral features have been used more often than hand-crafted features because spectral features can process more high-level emotional information. Due to this advantage, the spectral features give an efficient result compared to other types of features [12,17]. However, the low-level features are unable to detect the actual emotional state in an utterance. Although, a significant drawback of the SER method is the problems of the feature extraction process because during the process one may lose some important information. So how can we extract as much abundant emotional information from each utterance and train the proper model? That would be the first problem we need to solve. To minimize this issue we are modifying the deep learning method. We employ an ensemble classifier (using Softmax and SVM classifier) based deep learning method. This deep learning method provides a possible solution for the above problem of the feature extraction process for SER. Deep neural network (DNN) is the most common and popular deep learning method, which can extract discriminative features and has shown excellent performance in classification tasks. It has been demonstrated that compared with traditional deep learning methods, DNN achieves better performance.
However, the Gaussian mixture model (GMM) has a problem with the limited training of speech data. On the other hand, the support vector machine (SVM) performs better for recognition tasks than the other classifiers, with limited training data. However, the SVM model does not learn spectral features directly due to spectral features extracted from variable lengths of speech samples [13]. The convolutional neural networks (CNNs) and Recurrent Neural Networks (RNNs) are the two standard deep learning models [14,15]. Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) are two basic RNNs which can easily handle time-series data. Out of two the RNNs, LSTM executes with high error rates; however, optimization of GRU is faster with close to LSTM error rates. Basically, CNN is appropriate for data processing of images and realizes the local pattern of data viewing. One or two-layer CNNs performed poorer than the DCNN models [16].
Zhang et al. [8] have also found that 2-D convolution performs better and is the first choice over 1-D convolution for limited data. Accordingly, they may not learn discriminative features well to determine their emotional state. CNN also has some problems because CNN learns high-level features from Low-Level Descriptors (LLDs). The LLDs features are insufficient to extract emotional classification in complex scenarios [8]. Then researchers began to use images like two-dimensional spectrograms as shown in Fig. 1, to extract the right and flexible emotional information relevant to the SER. The horizontal axis defines the information in the time domain. The vertical axis depicts the information in the frequency domain that holds the important information relevant to emotion, and makes it a decent SER system. Due to these advantages, we adopt a deep convolutional neural network that can automatically extract the most emotionally relevant information from the audio sample's spectrogram. Zeng et al. [17] employed deep neural network-based gated Residual Networks (GResNets) and extracted the emotional feature from generated spectrograms on RAVDESS [23] dataset; the accuracy achieved was 65.97%. Badshah et al. [18] also used the DCNN model to extract speech emotion features from spectrograms. Abdel-Hamid et al. [14] implemented CNNbased deep learning model and applied log Mel-spectrograms as an input.
The success of DCNNs motivates us to use DCNNs in the speech emotion recognition field. In this paper, we report a new approach using DCNN and Bi-directional Gated Recurrent Unit with ensemble classifiers (Combine of Softmax and SVM classifiers) as displayed in Fig. 2. First, we extract log Mel-spectrograms and its first and second derivatives with respect to time (static, delta, and delta-delta). Then, we pass all the Mel-spectrogram through a pre-trained DCNN model to extract deep features. In this experiment, we use AlexNet [19], a pre-trained DCNN. After that, all the deep features are applied sequentially as the input of the Bi-GRU model. The Bi-GRU can capture the time-frequency relationship of utterance-level features and extract high-level utterance-level features. Finally, we adopt ensemble classifiers for emotion classification. This experiment was carried out on the Odia database on seven emotional classes and RAVDESS [23] on eight emotional levels of speech signal. Our experimental work reveals that the proposed approach outperforms some previously published results. The main contributions of our works are as follows: (1) First, we designed a proper speech representation with a DCRNN network using an images such as 3-D log Mel-spectrograms to capture the details of temporalfrequency correlations and assembles a more potent feature learning model. (2) Secondly, the correlated highest prediction probabilities value in the final prediction vector can confuse the classifier to identify the actual emotional state. Here, we employ an ensemble classifier that can only detect the maximum probability vector between two final prediction vectors of two classifiers, which leads to better performance than a single classifier for speech emotion recognition. (3) The DCRNN with ensemble classifier model produces an actual prediction vector from confusing two or more than two prediction vectors using discriminative utterance-level features.
The remainder of this paper is formatted as follows. The related works are represented in Section 2. The details of the DCRNN network and classifiers is described in Section 3. Section 4 shows the details of our experimental results, followed by Section 5 with our conclusions.

Related work
The framework of the speech emotion recognition system contains two basic components. The first component is the extraction of the speech features, and the next part is the classifier selection that identified the emotional state from utterances. We discuss in detail the emotion classification strategies followed by the feature extraction process.

Emotional classifier
Classifier plays an essential role for the SER system. Researchers have proposed various deep learning algorithms to represent an efficient classifier to distinguish emotional classes. Some popular emotion classifiers are the K-Nearest Neighbor (KNN) algorithm [20], Hidden Markov Models (HMMs) [21], Gaussian Mixture Models (GMMs) [22], SVM) [23], and Softmax function [24]. In the above classifiers, if the training data is much greater than the number of features (p > > q), KNN is better, but for lesser training data, SVM outperforms KNN and GMM. Currently, most researchers use Softmax and SVM classifiers rather than KNN and GMM classifiers, which makes them more popular and reliable. Softmax and SVM are the most useable classifiers in speech-related tasks and the performance difference is usually very small. But sometimes, selected features are not robust enough to design a classifier for speech emotion recognition. So classifiers are trained and tested on the same data. To integrate the merits of classifiers, we ensemble the SVM and Softmax classifiers and evaluate the performances in terms of accuracy for speech emotion recognition.

Feature extraction
The feature extraction process is a primary task for building an SER system. This process can reduce noise from raw audio data and generate highly effective features in learning the emotions from the SER model. Various features were used for SER systems, such as acoustic features, context information, hybrid features, and linguistic features. Among these, acoustic features are mainly used in the emotion recognition domain, containing local and global features [16]. Acoustic features are separated into four groups: spectral features,  [14,26], and energy-related features. Prosody features such as Pitch, formant frequency, duration, and loudness were also commonly used [27]. Some voice quality features such as shimmer (amplitude irregularity), jitter (pitch irregularity), fundamental frequency, duration, harmonics-to-noise ratio (HNR), and power are also used [11]. In addition, a combination of prosodic features and voice quality information gives better information about the identification of emotion in comparison with only prosodic features. Some context information has also been studied [28] for emotion recognition. In [28], the authors present an SER system based on cultural information. The authors proved that cross-cultural-based SER performs better than multi-cultural and intra-cultural paradigms. Since these features mentioned above are low-level, they may not contain enough emotional information to identify the subjective emotional state. It may be possible to employ deep learning strategies to learn high-level features that are automatically effective for speech emotion recognition to address this problem.

Proposed methodology
In this section, we present our proposed DCRNN model with a different classifier. The structure of our proposed model is shown in Fig. 2.
First, we extracted the three channels or 3-D log Melspectrograms similarly RGB color images from the raw audio samples. Then we created 3-D log Mel-spectrograms which were fed to the deep convolutional neural network. We used pre-trained AlexNet [19] DCNN model to learn deep emotion feature from image such as log Mel-spectrograms. Next, we input the learning features into the Bi-directional gated recurrent unit (Bi-GRU) to extract and obtain two-dimension high-level features, after Bi-GRU, highlighting emotion features. Finally, an ensemble of two classifiers is employed to categorize the utterance-level features for SER. The details of input and output of each part of DCNNs is summarized in subsections below.

Creation of DCRNN input
The spectral features can identify emotional details better in time-frequency correlation and extract high-level information from the spectra using 2-D images. Therefore the researcher gives more attention on spectral features for speech emotion recognition [17,29]. Generally, durations of speech signals are different, but most of the deep learning models require a fixed size of input. Representations of incomplete feature maps may not detect the correct emotional state of an utterance. To overcome this drawback, we extract 3-D log Mel-spectrograms (Static, Delta, Delta-delta) from the 1-D raw speech signal as inputs to our proposed DCRNN model, to minimize the loss of emotional information.
The process of creation of three-channel utterance level log Mel-spectrograms are as follows. (1) We use Librosa Audio Library [30] to originate the log Mel-spectrograms from the speech signals under 16 kHz sample rate. (2) Then, we apply first and second-order derivative on the 2-D static log Mel-spectrogram along with the time axis to find the delta and delta-delta log Mel-spectrograms. So it creates itself a 3-D log Mel-spectrogram. The 3-D log Mel-spectrogram of each emotion on the Odia database is shown in Fig. 1. Finally, we resize the log Mel-spectrograms to 277 × 227 × 3 because pre-trained (AlexNet) DCNN requires an input size of 277 × 227 × 3.

Learning extracted feature using deep CNNs
After generating log Mel-spectrograms, we performed pretrained AlexNet [19] DCNN for feature extraction. The AlexNet is a powerful model capable of achieving competitive performance on challenging small datasets. On the other hand pre-trained networks like VGG, ResNet, and efficient networks are much deeper and has more parameters, which require a very large number of inputs to achieve high performance [31]. We were inspired by the uses [8,16] of pre-trained AlexNet. In the AlexNet network, the original parameters remain the same, and the layers are used to generate features. The AlexNet deep neural network contains several convolutional layers (CL), dropout, max-pooling layers, fully-connected layer, and the 'Relu' (rectified linear unit) activation function. The detailed description of DCNNs layers as follows.
The AlexNet model accepts with a fixed input size of 277 × 227 × 3. Each convolutional layer consists of several filters, kernel size, non-linear activation function, and padding. The convolutional layer is used to extract local patterns of the input and generates the feature maps. The AlexNet model has five convolutional layers (CL1, CL2, CL3, CL4, and CL5). The CL1, CL2, and CL5 are followed by the max-pooling layer, as shown in Fig. 2. The CL1 layer has a 96 kernel filter and a kernel size of 11 × 11 with a stride number of 4. The size of the CL2 layer is 5 × 5 with 256 kernels and a stride of 1. The CL3 layer has a size of 3 × 3 with 384 kernels connected to the outputs of the CL2 layer, and the CL4 layer has a size 3 × 3 with 384 kernels. The Relu activation function is used in each convolutional layer which increases the training process.
The max-pooling layer reduces the feature maps by utilizing maximum filter activation to get more high-level features. The output from the last max-pooling layers is fed to the fully connected layer. In the AlexNet) [19] model, the fully connected layers are FCL6, FCL7, and FCL8. FCL6 and FCL7 build a 4096-dimensional feature vector, whereas the FCL8 layer contains a 1000-dimensional feature vector because of 1000 types of categories on the Image Net data. We have not used FCL8; the FCL7 produces a 4096-D feature vector connected to the Bi-GRU layer. Bi-directional recurrent units RNN [39] is proposed to solve the problem with time-series data. Most of the DNN, namely, convolutional neural network and multi-layer perceptron neural network are built with the help of weight connection of every layer, and the nodes between every layer are separated. So, nodes are independent of each other. However, in real life, the utterance is a time series with a variable length [32,33]. Therefore, the previous utterance of the speaker is strongly related to the present utterance, which requires a model that can review the past information and process information of the different length of time-series data. To solve these issues, Hochreiter and Schmidhuber implemented Long Short Term Memory (LSTM) [34]. But the LSTMs take longer time, and require more memory to train than GRUs. And GRUs perform better than LSTMs in small amounts of training data.
Transformer architecture is often another common choice. However, this architecture does not capture the input order information. In [35], the authors point out that the Transformer only starts to outperform CNNs when data is more to train for the classification tasks. However, audio datasets typically do not have large amounts of data, which motivates us towards the use of Bi-GRU. The GRU cell is a particular type of RNN and the modified version of LSTM, as shown in Fig. 3. In the GRU cell, the cell state and the hidden state merge, and the input gate and the forgotten gate are combined and built as an update gate. Output of the last fully connected layer (4096-D) of DCNN network is connected to the Bi-GRU network with a sequences input of x 1 , x 2, x 3 . . . . . . x t and we get output sequentially as y 1 , y 2, y 3 . . . . . . y t by calculating each of input using activations ('Relu') functions in the network on the basic formulations from time t = 1 to t = T From the Fig. 3. We get, Here, z t defines update gate,h t represent the current value of the present hidden state,h t denotes the activation value of the current hidden state, and h t−1 is the activation value of previous hidden state.
where, σ defines the sigmoid function, x t is the input of the GRU cell, defines the element-wise multiplication operation,∅ is the tanh function, and W and U are the weight matrix, used during the training operation. The main contribution of r is to control how much information passes through x t and how much previous information will affected byh t . In essence, the GRU unit can overcome the problem of longterm distance information learning and at the same times also overcomes the problem of gradient dissent.
In this study, we use a speech spectrogram to detect emotion categories using a sequence labelling task. The unidirectional GRU cell cannot handle well a large number of speech samples of different durations. Therefore, Bi-GRU (Bi-direction Gated Recurrent Unit) [39] is utilized to learn present information and the past information of a variablelength of sound samples.

Softmax Classifier
Select an efficient classifier is vital for final classification of emotion. Most of the deep learning model uses Softmax [23] classifier. For example, n possible classes has n nodes in the Softmax layer denoted by c j .
Where c j is defined as the discrete probability distribution. Therefore n j=1 c j = 1.
Output of the Softmax functions formula is as follows: where, d j is sum of the input into a Softmax layer which can be define as: Here, d j is the activation in the second last layer and W i j define the weight associate between the second last layer and Softmax layer. Thus the final prediction class j would be:

Multi-class SVM classifier
The easiest way to extend SVMs for multiclass classification problems is using the one-vs-all method [23,36]. For example, in class classification problems, n number of linear SVMs will be independently used, where the data from the other classes form the negative cases. The output representation of the n th SVM is: And, the final predicted category is calculated using Eq. (10): The SVM classifier's prediction is the same as the Softmax classifier demonstrated in Eq. (8). The only difference between multiclass SVM and Softmax is in their parametersweight matrices W . Softmax classifier layer minimizes cross-entropy loss, while SVM classifier tries to find the maximum boundary between the data points concerning the classes.

Ensemble of softmax and SVM classifier
Ensembles classifiers are generally proposed by a combination of two or more classifications. To improve over the best performing classifier, ensemble classifiers must comprise accurate base classifiers [37]. This paper has used the prediction probabilities of Softmax (marked as P Softmax ) and SVM (indicated as P SVM ) individually. Then we combine Softmax and SVM classifier to ensemble their probability as P Ensemble to predict the final class. We use their maximum probabilities from individual classifiers, as mention in Eq. (11).
This strategy has improved our accuracy significantly. Comparing the results, our method with two different classifiers (SVM and Softmax) shows varied effectiveness. The experimental results show that the DRCNN with ensemble classifier is extremely accurate in identifying emotion.

Speech emotional datasets
To illustrate the performances, we tested our model on the Odia dataset and one popular public dataset (RAVDESS). Here, we use RAVDESS [23] dataset to validate our Odia dataset, which is widely used in speech emotion recognition.
The Odia dataset consists of 60 different utterances with seven different emotions: anger, surprise, fear, sadness, happiness, neutral, and disgust [38]. In our previous work, we have used six discrete emotions and a total of 3240 utterances. The dataset is collected from three different Odia dialects (Sambalpuri, Cuttacki, Berhampuri). Each dialect is recorded by six different speakers (three male and three female) whose ages gap between 19 to 40 years. We use ten different Odia emotion sentences, and every sentence was repeated three times in all three dialects. So, in total, 18 (six speakers and three dialects) × 10 (number of sentences) × 7 (number of emotion) × 3 (number of repetition) = 3780 utterances are collected. All the utterances are recorded at a sampling frequency of 8.1 kHz with 16-bit quantization.
The RAVDESS speech emotional corpus [23] was recorded in the English language. The whole speech corpus consists of 1440 emotional audio (.wav) files with eight discrete emotional states: fear, calm, neutral, angry, disgust, sadness, boredom, and happiness. The dataset is completed by twenty-four (twelve male and twelve female) North American professional actors and the average time duration of each audio file is 3 s. The complete recording process was done at a sampling rate of 48 kHz have 16-bit quantization. The details of each emotional state of the RAVDESS and Odia datasets are shown in Table 1.

Experimental setup
The architecture of our proposed model is illustrated in Fig. 2. The DCRNN model is trained using a batch size of 32 and Adam optimizer [46] with a learning rate of 0.001. We set up 150 epochs for the training of our model. Our experiment is carried out on the TensorFlow, and Keras [39] deep learning platform with computer configuration is on Windows 10 Pro 64-bit operating system, Intel(R) Xeon(R) E-2224 CPU @ 3.4 GHz, NVIDIA QUADRO P620 GPU with 16 GB memory. The Train-Test Split technique motivates us, from recent years of studies [3,7,8]. We split the datasets (Odia dataset and RAVDESS) into 80% training and from the remaining 20% of the data, 5% are used for tuning and 15% are used for test the model. The training samples are divided into multiple train-test splits to overcome the overfitting problem and get more stable results. Therefore, the total samples of datasets are divided into n number of folds. From the n folds, we used (n − 1) folds for training, and rest of (one fold) was used for the test and validation set; we set n equal to 5.

Effects of Bi-GRU layer
It is essential to find out the suitable number of Bi-GRU layers and how many neurons are needed per layer to achieve the best-optimized model. To further investigate, we have studied the effect of different classifiers with and without pre-trained models. So, we first optimize the Bi-GRU layer, which is employed with the output of deep CNNs. The hidden layer and neurons are heavily dependent on the performance of the deep learning model. We train and validate our model with different numbers of Bi-GRU (with 1, 2, 3) layers with varying number of neurons (with 128, 256, 512) on the Odia and RAVDESS datasets using Softmax and SVM classifier. After conducting experiments on various layers, we concluded that BiGRU 2 128 (two Bi-GRU layers with 128 neurons) on the RAVDESS dataset and Bi-GRU 2 256 (two Bi-GRU layers with 256 neurons) on the Odia dataset performs better on different classifiers.

Effects of without pre-trained DCNN
We also trained and tested our method without pre-trained DCNN instead of pre-trained DCNN (AlexNet) on the same model of our proposed method. Each parameter of the convolutional layers was randomly initialized with a standard normal distribution. The overall accuracy found without pretraining architecture was 81.24% of Odia dataset and 74.32% of RAVDESS with ensemble classifier. Then, we use the pre-trained ImageNet (AlexNet) model. The pre-trained Ima-geNet (AlexNet) model showed improved performance by 4.07% and 3.22%, respectively, compared to without the pretrained model with ensemble classifier as shown in Table  2. The result demonstrates that using a pre-trained DCNN model improves the recognition accuracy and convergence rate.

DCRNN + Ensemble classifier results
Here, we represent the confusion matrix for investigation of the performances of the DCRNN model using Softmax classifier, SVM classifier, and the ensemble classifiers on Odia dataset and RAVDESS dataset shown in Figs. 4 and 5. Figure 4 represents the confusion matrix of the DCRNN model using Softmax classifier on Odia dataset; 'neutral' and 'surprise' achieves the highest recognition rate of 91.18% and 92%, respectively. In comparison 'disgust' is the lowest accuracy rate of 56.76%, and the other four emotions are obtained with accuracies below 90% with an overall accuracy of 81.53%, as illustrated in Table 2.  Bold indicates the best performance Figure 4b shows the confusion matrix of the DCRNN model using SVM classifier; 'surprise', 'sadness', and 'disgust' emotions show increased recognition rates of 8% (from 92 to 100%), 13.28% (71.43-85.71%), and 10.81% (56.76-67.57%), while 'angry' and 'neutral' show decreased rates of 3.18% (78.12-75%) and 5.89% (91.18-85.29%). Recognition rates of the two other emotions remain relatively same. The overall recognition rate increased from 81.53 to 83.62%, which shows that the SVM classifier performs slightly better than the Softmax classifier.
On the other hand, Fig. 5 shows the performance on the RAVDESS dataset with Softmax, SVM, and an ensemble of Softmax and SVM classifiers with eight emotions. Figure 5a states the performance of the DCRNN model using the Softmax classifier on the RAVDESS dataset. From  Fig. 5a, we observed that 'calm', 'happiness', and 'surprise' are recognized well with an accuracy rate of 91.30%, 87.88%, and 86.84%, whereas 'neutral' classified relatively well with a recognition rate of 77.42%. The other four emotions classified less than 70%. The overall accuracy observed is 73.52%, as shown in Table 2. Figure 5b shows the confusion matrix of the DCRNN model using SVM classifier; 'neutral', 'calm', 'happiness', and 'surprise' can be recognized with a recognition rate of 80.65%, 80.43%, 87.88%, and 84.21% respectively. At the same time the other four emotions are indicate a recognition rate below 75%. The overall recognition rate is 74.91%, which shows that our SVM classifier performs better than the Softmax classifier.
The overall average recognition accuracy rate of the ensemble classifier of 77.54%, reveals that it outperforms the  Softmax and SVM classifiers by 4.02% and 2.63%, respectively as shown in Table. 2.
In addition, we also report the value of the F1-score for each emotional state to calculate the statistical importance of our experimental results; the F1 represents the harmonic mean of precision and recall. Tables 3 and 4 represent the statistical performance on the Odia and RAVDESS databases.
From the results of the F1-score, we demonstrate that each dataset illustrates different issues in recognizing a particular emotional state. On the Odia dataset, 'disgust' is recognized slightly lower than all the other emotions by all the classifiers, whereas 'angry', 'disgust', and 'sadness' perform relatively below all three classifiers on the RAVDESS.
Finally, for further analysis of the effectiveness of the proposed model, the training losses curve of two datasets (Odia and RAVDESS) are represented in Figs. 6 and 7. It can be noticed that training through 150 epochs of two datasets has slight fluctuation in convergence, maybe for the duration of emotion samples were not equal. The average duration of Odia is 4.5 s, and the RAVDESS is 3 s.

Comparison with recent work
Furthermore, we compare the result of our proposed method with several recent studies as well. Table 5 demonstrates a comparison between the results of our proposed method and previously public results with the model and features on the two datasets. Distinctly, on the RAVDESS dataset, our proposed method clearly expresses better with the performance level of 17.44%, 11.57%, 8.14%, 4.04%, and 5.93% collated with [3,7,17,40,41].
Our proposed method on the Odia dataset achieves 18.61% and 10.72% better accuracy compared to [38]. They used only prosodic features such as pitch, energy, format, etc., associated with SVM and GMM classifiers.

Conclusion and future work
Automatic speech emotion recognition becomes challenging because it is imperative to identify which features are prominent for speech emotion recognition tasks. This work has been motivated from the fact that we need to build more accurate emotion classification for regional languages for easy translation from one language to another. We have selected a deep learning method to build an SER system for the Odia language, because accuracy is the main factor and all emotional parts are not well understood.
This study is inspired by the effect of 3-D log Melspectrograms (like RGB color representation) for verifying emotional classes from the speech signal. We propose a new deep CRNN with an ensemble of two (Softmax and SVM) classifiers. A pre-trained deep CNNs ImageNet (AlexNet) model produces a deep high-level feature from the speech

85.31%
Bold indicates the best performance spectrogram. After that, the output of deep CNNs is learned by Bi-GRU to avoid long-term dependency and gives utterance level features. Finally, we get the classification results using the maximum probabilities of the two classifiers. Our experimental results show an 85.31% and 77.54% overall classification rate on the seven classes Odia dataset and eight classes RAVDESS dataset, which reveals outperforms others. In the future, we would like to further investigate our created Odia dataset with more data, and also other datasets from different languages applying our proposed framework. We plan also to implement the average mode of predictions by adding more multi-class classifiers. Further, we would like to extend our work by adopting different acoustic speech features with the modern techniques like using the transformer based model, to achieve more stable and reliable results.