Abstract
Nowadays, Machine learning techniques are found to be unique among the most effective approaches for Voice and Emotion Recognition (VER). Moreover, automatic recognition of voice and emotions is essential for smooth psychosocial interactions between humans and machines. There have been huge strides in creating workable pieces of art that combine spectrogram and deep learning characteristics in the VER research. On the other hand, although single Machine Learning (ML) methods deliver acceptable results, it's not quite reaching the standards yet. This necessitates the development of strategies that use various ML techniques, target multiple aspects and elements of voice recognition. This article proposes an ensembling classifier model that incorporates the outcome of base classifiers (CapsNet and RNNs) for VER. The CapsNet model can identify the spatial correlation of vital speech information in spectrograms using a pooling technique. The RNN, on the other hand, is excellent for processing time-series datasets, and both are well known for their performance in classification work. Stacked generalization is used for constructing ensemble classifiers that integrate predictions made by CapsNet and RNN classifiers. As much as 96.05% of overall accuracy is obtained when using this ensemble approach, which is more effective than either CapsNets or RNN when individually compared. One of the significant benefits of the proposed classifier is that it effectively detects the emotional class 'FEAR', with a recognition rate of 96.68% among seven other classes.
Similar content being viewed by others
References
Ali H, Hariharan M, Yaacob S, Adom AH (2015) Facial emotion recognition using empirical mode decomposition. Expert Syst Appl 42(3):1261–1277. https://doi.org/10.1016/j.eswa.2014.08.049
Alonso JB, Cabrera J, Medina M, Travieso CM (2015) New approach in quantification of emotional intensity from the speech signal: emotional temperature. Expert Syst Appl 42(24):9554–9564. https://doi.org/10.1016/j.eswa.2015.07.062
Alshamsi H, Kupuska V (2017) Real-time facial expression recognition app development on smart phones. Int J Eng Res Appl 07(07):30–38. https://doi.org/10.9790/9622-0707033038
Basheer S, Anbarasi M, Sakshi DG, Vinoth Kumar V (2020) Efficient text summarization method for blind people using text mining techniques. Int J Speech Technol 23(4):713–725. https://doi.org/10.1007/s10772-020-09712-z
Chavhan Y, Dhore ML, Yesaware P (2010) Speech emotion recognition using support vector machine. Int J Computer Appl 1(20):8–11. https://doi.org/10.5120/431-636
Dellaert F, Polzin T, & Waibel A (1996) Recognizing emotion in speech, In: Proceeding of fourth international conference on spoken language (Vol 96, No 1970, p 1973)
Dhiman G, Vinoth Kumar V, Kaur A, Sharma A (2021) DON: Deep learning and optimization-based framework for detection of novel coronavirus disease using X-ray images. Interdiscip Sci Comput Life Sci 13(2):260–272. https://doi.org/10.1007/s12539-021-00418-7
Gu Y, Postma E, Lin HX, & Herik JVD (2016), Speech emotion recognition using voiced segment selection algorithm, In: Proceedings of the twenty-second European conference on artificial intelligence pp 1682-1683.
Kerkeni L, Serrestou Y, Raoof K, Mbarki M, Mahjoub MA, Cleder C (2019) Automatic speech emotion recognition using an optimal combination of features based on EMD-TKEO. Speech Commun 114:22–35
Kumar VV, Raghunath KMK, Rajesh N, Venkatesan M, Joseph RB, Thillaiarasu N (2021) Paddy plant disease recognition, risk analysis, and classification using deep convolution neuro-fuzzy network. J Mobile Multimed. https://doi.org/10.13052/jmm1550-4646.1829
Lee J, Tashev I (2015) High-level feature representation using recurrent neural network for speech emotion recognition. Interspeech. https://doi.org/10.21437/interspeech.2015-336
Li X, Akagi M (2019) Improving multilingual speech emotion recognition by combining acoustic features in a three-layer model. Speech Commun 110:1–12. https://doi.org/10.1016/j.specom.2019.04.004
Li J, Mohamed A, Zweig G, & Gong Y (2015) LSTM time and frequency recurrence for automatic speech recognition, In: 2015 IEEE workshop on automatic speech recognition and understanding (ASRU), IEEE, p 187-191
Liu Z-T, Wu M, Cao W-H, Mao J-W, Xu J-P, Tan G-Z (2018) Speech emotion recognition based on feature selection and extreme learning machine decision tree. Neurocomputing 273:271–280. https://doi.org/10.1016/j.neucom.2017.07.050
Mahesh TR, Dhilip Kumar V, Vinoth Kumar V, Asghar J, Geman O, Arulkumaran G, Arun N (2022) AdaBoost ensemble methods using K-fold cross validation for survivability with the early detection of heart disease. Comput Intell Neurosci 2022:1–11. https://doi.org/10.1155/2022/9005278
Martin O, Kotsia I, Macq B, & Pitas I (2006), The eNTERFACE'05 audio-visual emotion database. In: 22nd international conference on data engineering workshops (ICDEW'06), IEEE, pp, 8-8
Milton A, Sharmy Roy S, Tamil Selvi S (2013) SVM scheme for speech emotion recognition using MFCC feature. Int J Computer Appl 69(9):34–39. https://doi.org/10.5120/11872-7667
Mirsamadi S, Barsoum E, & Zhang C (2017), Automatic speech emotion recognition using recurrent neural networks with local attention, In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, pp 2227-2231
Motamed S, Setayeshi S, Rabiee A (2017) Speech emotion recognition based on a modified brain emotional learning model. Biologically Inspired Cognitive Architectures 19:32–38. https://doi.org/10.1016/j.bica.2016.12.002
Muthusamy H, Polat K, Yaacob S (2015) improved emotion recognition using gaussian mixture model and extreme learning machine in speech and glottal signals. Math Probl Eng 2015:1–13. https://doi.org/10.1155/2015/394083
Nahid MMH, Purkaystha B, & Islam MS (2017) Bengali speech recognition: a double layered LSTM-RNN approach. In: 2017 20th international conference of computer and information technology (ICCIT) IEEE, pp 1-6
Narendra NP, Alku P (2019) Dysarthric speech classification from coded telephone speech using glottal features. Speech Commun 110:47–55. https://doi.org/10.1016/j.specom.2019.04.003
Pandiyan S, Ashwin M, Manikandan R, KM KR, &, GR AR (2020) Heterogeneous internet of things organization predictive analysis platform for apple leaf diseases recognition. Computer Commun 154:99–110. https://doi.org/10.1016/j.comcom.2020.02.054
Prasomphan S (2015) Improvement of speech emotion recognition with neural network classifier by using speech spectrogram, In: 2015 International conference on systems, signals and image processing (IWSSIP), IEEE, pp 73-76
Sabour S, Frosst N, Hinton GE (2022) Dynamic routing between capsules. https://proceedings.neurips.cc/paper/2017/file/2cad8fa47bbef282badbb8de5374b894-Paper.pdf
Sarker MK, Alam KMR, & Arifuzzaman M (2014) Emotion recognition from speech based on relevant feature and majority voting, In: 2014 international conference on informatics, electronics & vision (ICIEV), IEEE, pp 1-5
Satt A, Rozenberg S, & Hoory R (2017) Efficient emotion recognition from speech using deep learning on spectrograms, In: Interspeech, pp 1089-1093
Shalini A, Jayasuruthi L, VinothKumar V (2018) Voice Recognition Robot Control Using Android Device. J Comput Theor Nanosci 15(6):2197–2201. https://doi.org/10.1166/jctn.2018.7436
Tahon M, Devillers L (2016) Towards a small set of robust acoustic features for emotion recognition: challenges. IEEE/ACM Trans Audio Speech Lang Process 24(1):16–28. https://doi.org/10.1109/taslp.2015.2487051
Turan MAT & Erzin E (2018) Monitoring infant's emotional cry in domestic environments using the capsule network architecture, In: Interspeech pp 132-136
Vondra M, Vích R (2009) Recognition of emotions in german speech using gaussian mixture models. Lect Notes Comput Sci. https://doi.org/10.1007/978-3-642-00525-1_26
Wang K, An N, Li BN, Zhang Y, Li L (2015) Speech emotion recognition using fourier parameters. IEEE Trans Affect Comput 6(1):69–75. https://doi.org/10.1109/taffc.2015.2392101
Wang W (ed) (2011) Machine audition principles algorithms and systems. IGI Global, Pennsylvania. https://doi.org/10.4018/978-1-61520-919-4
Yang M, Zhao W, Ye J, Lei Z, Zhao Z, & Zhang S (2018) Investigating capsule networks with dynamic routing for text classification, In: Proceedings of the 2018 conference on empirical methods in natural language processing, https://doi.org/10.18653/v1/d18-1350
Ying S, Xue-Ying Z (2018) Characteristics of human auditory model based on compensation of glottal features in speech emotion recognition. Futur Gener Comput Syst 81:291–296. https://doi.org/10.1016/j.future.2017.10.002
Zheng WQ, Yu JS, & Zou YX (2015) An experimental study of speech emotion recognition based on deep convolutional neural networks, In: 2015 international conference on affective computing and intelligent interaction (ACII), IEEE, pp 827-831
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The author declares he has no conflict of interest.
Funding statement
No Funding has been provided for the completion of this research paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Alharbi, Y. Effective ensembling classification strategy for voice and emotion recognition. Int J Syst Assur Eng Manag 15, 334–345 (2024). https://doi.org/10.1007/s13198-022-01729-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13198-022-01729-8