Skip to main content
Log in

Effective ensembling classification strategy for voice and emotion recognition

  • Original Article
  • Published:
International Journal of System Assurance Engineering and Management Aims and scope Submit manuscript

Abstract

Nowadays, Machine learning techniques are found to be unique among the most effective approaches for Voice and Emotion Recognition (VER). Moreover, automatic recognition of voice and emotions is essential for smooth psychosocial interactions between humans and machines. There have been huge strides in creating workable pieces of art that combine spectrogram and deep learning characteristics in the VER research. On the other hand, although single Machine Learning (ML) methods deliver acceptable results, it's not quite reaching the standards yet. This necessitates the development of strategies that use various ML techniques, target multiple aspects and elements of voice recognition. This article proposes an ensembling classifier model that incorporates the outcome of base classifiers (CapsNet and RNNs) for VER. The CapsNet model can identify the spatial correlation of vital speech information in spectrograms using a pooling technique. The RNN, on the other hand, is excellent for processing time-series datasets, and both are well known for their performance in classification work. Stacked generalization is used for constructing ensemble classifiers that integrate predictions made by CapsNet and RNN classifiers. As much as 96.05% of overall accuracy is obtained when using this ensemble approach, which is more effective than either CapsNets or RNN when individually compared. One of the significant benefits of the proposed classifier is that it effectively detects the emotional class 'FEAR', with a recognition rate of 96.68% among seven other classes.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  • Ali H, Hariharan M, Yaacob S, Adom AH (2015) Facial emotion recognition using empirical mode decomposition. Expert Syst Appl 42(3):1261–1277. https://doi.org/10.1016/j.eswa.2014.08.049

    Article  Google Scholar 

  • Alonso JB, Cabrera J, Medina M, Travieso CM (2015) New approach in quantification of emotional intensity from the speech signal: emotional temperature. Expert Syst Appl 42(24):9554–9564. https://doi.org/10.1016/j.eswa.2015.07.062

    Article  Google Scholar 

  • Alshamsi H, Kupuska V (2017) Real-time facial expression recognition app development on smart phones. Int J Eng Res Appl 07(07):30–38. https://doi.org/10.9790/9622-0707033038

    Article  Google Scholar 

  • Basheer S, Anbarasi M, Sakshi DG, Vinoth Kumar V (2020) Efficient text summarization method for blind people using text mining techniques. Int J Speech Technol 23(4):713–725. https://doi.org/10.1007/s10772-020-09712-z

    Article  Google Scholar 

  • Chavhan Y, Dhore ML, Yesaware P (2010) Speech emotion recognition using support vector machine. Int J Computer Appl 1(20):8–11. https://doi.org/10.5120/431-636

    Article  Google Scholar 

  • Dellaert F, Polzin T, & Waibel A (1996) Recognizing emotion in speech, In: Proceeding of fourth international conference on spoken language (Vol 96, No 1970, p 1973)

  • Dhiman G, Vinoth Kumar V, Kaur A, Sharma A (2021) DON: Deep learning and optimization-based framework for detection of novel coronavirus disease using X-ray images. Interdiscip Sci Comput Life Sci 13(2):260–272. https://doi.org/10.1007/s12539-021-00418-7

    Article  Google Scholar 

  • Gu Y, Postma E, Lin HX, & Herik JVD (2016), Speech emotion recognition using voiced segment selection algorithm, In: Proceedings of the twenty-second European conference on artificial intelligence pp 1682-1683.

  • Kerkeni L, Serrestou Y, Raoof K, Mbarki M, Mahjoub MA, Cleder C (2019) Automatic speech emotion recognition using an optimal combination of features based on EMD-TKEO. Speech Commun 114:22–35

    Article  Google Scholar 

  • Kumar VV, Raghunath KMK, Rajesh N, Venkatesan M, Joseph RB, Thillaiarasu N (2021) Paddy plant disease recognition, risk analysis, and classification using deep convolution neuro-fuzzy network. J Mobile Multimed. https://doi.org/10.13052/jmm1550-4646.1829

    Article  Google Scholar 

  • Lee J, Tashev I (2015) High-level feature representation using recurrent neural network for speech emotion recognition. Interspeech. https://doi.org/10.21437/interspeech.2015-336

    Article  Google Scholar 

  • Li X, Akagi M (2019) Improving multilingual speech emotion recognition by combining acoustic features in a three-layer model. Speech Commun 110:1–12. https://doi.org/10.1016/j.specom.2019.04.004

    Article  Google Scholar 

  • Li J, Mohamed A, Zweig G, & Gong Y (2015) LSTM time and frequency recurrence for automatic speech recognition, In: 2015 IEEE workshop on automatic speech recognition and understanding (ASRU), IEEE, p 187-191

  • Liu Z-T, Wu M, Cao W-H, Mao J-W, Xu J-P, Tan G-Z (2018) Speech emotion recognition based on feature selection and extreme learning machine decision tree. Neurocomputing 273:271–280. https://doi.org/10.1016/j.neucom.2017.07.050

    Article  Google Scholar 

  • Mahesh TR, Dhilip Kumar V, Vinoth Kumar V, Asghar J, Geman O, Arulkumaran G, Arun N (2022) AdaBoost ensemble methods using K-fold cross validation for survivability with the early detection of heart disease. Comput Intell Neurosci 2022:1–11. https://doi.org/10.1155/2022/9005278

    Article  Google Scholar 

  • Martin O, Kotsia I, Macq B, & Pitas I (2006), The eNTERFACE'05 audio-visual emotion database. In: 22nd international conference on data engineering workshops (ICDEW'06), IEEE, pp, 8-8

  • Milton A, Sharmy Roy S, Tamil Selvi S (2013) SVM scheme for speech emotion recognition using MFCC feature. Int J Computer Appl 69(9):34–39. https://doi.org/10.5120/11872-7667

    Article  Google Scholar 

  • Mirsamadi S, Barsoum E, & Zhang C (2017), Automatic speech emotion recognition using recurrent neural networks with local attention, In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE, pp 2227-2231

  • Motamed S, Setayeshi S, Rabiee A (2017) Speech emotion recognition based on a modified brain emotional learning model. Biologically Inspired Cognitive Architectures 19:32–38. https://doi.org/10.1016/j.bica.2016.12.002

    Article  Google Scholar 

  • Muthusamy H, Polat K, Yaacob S (2015) improved emotion recognition using gaussian mixture model and extreme learning machine in speech and glottal signals. Math Probl Eng 2015:1–13. https://doi.org/10.1155/2015/394083

    Article  Google Scholar 

  • Nahid MMH, Purkaystha B, & Islam MS (2017) Bengali speech recognition: a double layered LSTM-RNN approach. In: 2017 20th international conference of computer and information technology (ICCIT) IEEE, pp 1-6

  • Narendra NP, Alku P (2019) Dysarthric speech classification from coded telephone speech using glottal features. Speech Commun 110:47–55. https://doi.org/10.1016/j.specom.2019.04.003

    Article  Google Scholar 

  • Pandiyan S, Ashwin M, Manikandan R, KM KR, &, GR AR (2020) Heterogeneous internet of things organization predictive analysis platform for apple leaf diseases recognition. Computer Commun 154:99–110. https://doi.org/10.1016/j.comcom.2020.02.054

    Article  Google Scholar 

  • Prasomphan S (2015) Improvement of speech emotion recognition with neural network classifier by using speech spectrogram, In: 2015 International conference on systems, signals and image processing (IWSSIP), IEEE, pp 73-76

  • Sabour S, Frosst N, Hinton GE (2022) Dynamic routing between capsules. https://proceedings.neurips.cc/paper/2017/file/2cad8fa47bbef282badbb8de5374b894-Paper.pdf

  • Sarker MK, Alam KMR, & Arifuzzaman M (2014) Emotion recognition from speech based on relevant feature and majority voting, In: 2014 international conference on informatics, electronics & vision (ICIEV), IEEE, pp 1-5

  • Satt A, Rozenberg S, & Hoory R (2017) Efficient emotion recognition from speech using deep learning on spectrograms, In: Interspeech, pp 1089-1093

  • Shalini A, Jayasuruthi L, VinothKumar V (2018) Voice Recognition Robot Control Using Android Device. J Comput Theor Nanosci 15(6):2197–2201. https://doi.org/10.1166/jctn.2018.7436

    Article  Google Scholar 

  • Tahon M, Devillers L (2016) Towards a small set of robust acoustic features for emotion recognition: challenges. IEEE/ACM Trans Audio Speech Lang Process 24(1):16–28. https://doi.org/10.1109/taslp.2015.2487051

    Article  Google Scholar 

  • Turan MAT & Erzin E (2018) Monitoring infant's emotional cry in domestic environments using the capsule network architecture, In: Interspeech pp 132-136

  • Vondra M, Vích R (2009) Recognition of emotions in german speech using gaussian mixture models. Lect Notes Comput Sci. https://doi.org/10.1007/978-3-642-00525-1_26

    Article  Google Scholar 

  • Wang K, An N, Li BN, Zhang Y, Li L (2015) Speech emotion recognition using fourier parameters. IEEE Trans Affect Comput 6(1):69–75. https://doi.org/10.1109/taffc.2015.2392101

    Article  Google Scholar 

  • Wang W (ed) (2011) Machine audition principles algorithms and systems. IGI Global, Pennsylvania. https://doi.org/10.4018/978-1-61520-919-4

    Book  Google Scholar 

  • Yang M, Zhao W, Ye J, Lei Z, Zhao Z, & Zhang S (2018) Investigating capsule networks with dynamic routing for text classification, In: Proceedings of the 2018 conference on empirical methods in natural language processing, https://doi.org/10.18653/v1/d18-1350

  • Ying S, Xue-Ying Z (2018) Characteristics of human auditory model based on compensation of glottal features in speech emotion recognition. Futur Gener Comput Syst 81:291–296. https://doi.org/10.1016/j.future.2017.10.002

    Article  Google Scholar 

  • Zheng WQ, Yu JS, & Zou YX (2015) An experimental study of speech emotion recognition based on deep convolutional neural networks, In: 2015 international conference on affective computing and intelligent interaction (ACII), IEEE, pp 827-831

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yasser Alharbi.

Ethics declarations

Conflict of interest

The author declares he has no conflict of interest.

Funding statement

No Funding has been provided for the completion of this research paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Alharbi, Y. Effective ensembling classification strategy for voice and emotion recognition. Int J Syst Assur Eng Manag 15, 334–345 (2024). https://doi.org/10.1007/s13198-022-01729-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13198-022-01729-8

Keywords

Navigation