Abstract
Inaccurate emotional reactions from robots have been a problem for authors in previous years. Since technology has advanced, robots like service robots can communicate with people of many other languages. The traditional Speech Emotion Recognition (SER) method utilizes the same corpus for classifier testing and training to accurately identify emotions. However, this method could be more flexible for multi-lingual (multi-language) contexts, which is essential for robots that people use worldwide. This research proposes an ensemble learning method (HMLSTM and CapsNet) that uses a voting majority for a cross-corpus, multi-lingual SER system. This work utilizes three corpora (EMO-DB, URDU, and SAVEE) that offer a variety of languages (German, Urdu, and English) to test multi-language SER. We first use the Refined Attention Pyramid Network (RAPNet) for speech and emotion recognition to extract the features. Following that, the pre-processing step of the data is normalized using the Min–max normalization approach and IGAN to address data imbalance. To identify the emotions in speech into the appropriate group, use HMLSTM and CapsNet’s ensemble learning algorithms. With reasonable accuracy, the proposed ensemble learning approach enhances emotion recognition. It compares the effectiveness of the proposed ensemble learning method with existing traditional learning methods. Using data from a corpus trained on a different corpus, this study tests the performance of a classifier for multi-lingual emotion identification. In this experiment, distinct classifiers offer excellent accuracy for diverse corpora.
Similar content being viewed by others
Data availability
Data will be available when requested.
References
Kwon, S.: MLT-DNet: speech emotion recognition using 1D dilated CNN based on multi-learning trick approach. Expert Syst. Appl. 167, 114177 (2021)
Zhang, S., Tao, X., Chuang, Y., Zhao, X.: Learning deep multimodal affective features for spontaneous speech emotion recognition. Speech Commun. 127, 73–81 (2021)
Kwon, S.: Optimal feature selection based speech emotion recognition using two-stream deep convolutional neural network. Int. J. Intell. Syst. 36(9), 5116–5135 (2021)
Meena, G., Mohbey, K.K., Kumar, S., Lokesh, K.: A hybrid deep learning approach for detecting sentiment polarities and knowledge graph representation on monkeypox tweets. Decis. Anal. J. 7, 100243 (2023)
Tuncer, T., Dogan, S., Acharya, U.R.: Automated accurate speech emotion recognition system using twine shuffle pattern and iterative neighborhood component analysis techniques. Knowl. Syst. 211, 106547 (2021)
Zhao, Z., Li, Q., Zhang, Z., Cummins, N., Wang, H., Tao, J., Schuller, B.W.: Combining a parallel 2D CNN with a self-attention dilated residual network for CTC-based discrete speech emotion recognition. Neural Netw. 141, 52–60 (2021)
Mohbey, K.K., Meena, G., Kumar, S., Lokesh, K.: A CNN-LSTM-based hybrid deep learning approach for sentiment analysis on Monkeypox tweets. New Gener. Comput. 14, 1–19 (2023)
Yildirim, S., Kaya, Y., Kılıç, F.: A modified feature selection method based on metaheuristic algorithms for speech emotion recognition. Appl. Acoust. 173, 107721 (2021)
Li, S., Xing, X., Fan, W., Cai, B., Fordson, P., Xu, X.: Spatiotemporal and frequential cascaded attention networks for speech emotion recognition. Neurocomputing 448, 238–248 (2021)
Liu, Z.T., Rehman, A., Wu, M., Cao, W.H., Hao, M.: Speech emotion recognition based on formant characteristics feature extraction and phoneme type convergence. Inf. Sci. 563, 309–325 (2021)
Abdulmohsin, H.A.: A new proposed statistical feature extraction method in speech emotion recognition. Comput. Electr. Eng. 93, 107172 (2021)
Hansen, L., Zhang, Y.P., Wolf, D., Sechidis, K., Ladegaard, N., Fusaroli, R.: A generalizable speech emotion recognition model reveals depression and remission. Acta Psychiatr. Scand. 145(2), 186–199 (2022)
Fu, C., Dissanayake, T., Hosoda, K., Maekawa, T., & Ishiguro, H.: Similarity of speech emotion in different languages revealed by a neural network with attention. In: 2020 IEEE 14th international conference on semantic computing (ICSC) (pp. 381–386). IEEE (2020)
Kumaran, U., Radha Rammohan, S., Nagarajan, S.M., Prathik, A.: Fusion of mel and gammatone frequency cepstral coefficients for speech emotion recognition using deep C-RNN. Int. J. Speech Technol. 24, 303–314 (2021)
Senthilkumar, N., Karpakam, S., Devi, M.G., Balakumaresan, R., Dhilipkumar, P.: Speech emotion recognition based on Bi-directional LSTM architecture and deep belief networks. Mater. Today Proc. 57, 2180–2184 (2022)
Qadri, S. A. A., Gunawan, T. S., Kartiwi, M., Mansor, H., & Wani, T. M.: Speech emotion recognition using feature fusion of TEO and MFCC on multilingual databases. In: Recent trends in mechatronics towards industry 4.0: selected articles from iM3F 2020, Malaysia (pp. 681–691). Springer Singapore (2022)
Ma, Y., Wang, W.: MSFL: explainable multitask-based shared feature learning for multilingual speech emotion recognition. Appl. Sci. 12(24), 12805 (2022)
Alsabhan, W.: Human-computer interaction with a real-time speech emotion recognition with ensembling techniques 1D convolution neural network and attention. Sensors 23(3), 1386 (2023)
Gomathy, M.: Optimal feature selection for speech emotion recognition using enhanced cat swarm optimization algorithm. Int. J. Speech Technol. 24(1), 155–163 (2021)
Ahmed, M.R., Islam, S., Islam, A.M., Shatabda, S.: An ensemble 1D-CNN-LSTM-GRU model with data augmentation for speech emotion recognition. Expert Syst. Appl. 15(218), 119633 (2023)
Pham, N.T., Dang, D.N., Nguyen, N.D., Nguyen, T.T., Nguyen, H., Manavalan, B., Lim, C.P., Nguyen, S.D.: Hybrid data augmentation and deep attention-based dilated convolutional-recurrent neural networks for speech emotion recognition. Expert Syst. Appl. 15(230), 120608 (2023)
Chen, W., Hu, H.: Generative attention adversarial classification network for unsupervised domain adaptation. Pattern Recogn. 107, 107440 (2020)
Kanna, P.R., Santhi, P.: Unified deep learning approach for efficient intrusion detection system using integrated spatial–temporal features. Knowl. Syst. 226, 107132 (2021)
Wang, Z., Zheng, L., Du, W., Cai, W., Zhou, J., Wang, J., He, G.: A novel method for intelligent fault diagnosis of bearing based on capsule neural network. Complexity 2019(2019), 1 (2019)
SAVEE dataset: https://www.kaggle.com/datasets/ejlok1/surrey-audiovisual-expressed-emotion-savee
EMO-DB dataset: https://www.kaggle.com/datasets/piyushagni5/berlin-database-of-emotional-speech-emodb
URDU dataset: https://www.kaggle.com/datasets/hazrat/urdu-speech-dataset?select=files
Al-onazi, B.B., Nauman, M.A., Jahangir, R., Malik, M.M., Alkhammash, E.H., Elshewey, A.M.: Transformer-based multilingual speech emotion recognition using data augmentation and feature fusion. Appl. Sci. 12(18), 9188 (2022)
Khan, A.: Improved multi-lingual sentiment analysis and recognition using deep learning. J. Inform. Sci. 12, 01655515221137270 (2023)
Funding
No funding was received to assist with the preparation of this manuscript.
Author information
Authors and Affiliations
Contributions
The contributions of authors are as follows: Anumula Sruthi, Anumula Kalyan Kumar, Kishore Dasari, Yenugu Sivaramaiah contributed to conceptualization, methodology, software, formal analysis, investigation, resources, writing—original draft, review & editing, and visualization. Garikapati Divya, Dr. G. Sai Chaitanya Kumar contributed to conceptualization, writing—review & editing.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sruthi, A., Kumar, A.K., Dasari, K. et al. Multi-language: ensemble learning-based speech emotion recognition. Int J Data Sci Anal (2024). https://doi.org/10.1007/s41060-024-00553-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s41060-024-00553-6