Abstract
In this study, novel Spectro-Temporal Energy Ratio features based on the formants of vowels, linearly spaced low-frequency, and logarithmically spaced high-frequency parts of the human auditory system are introduced to implement single- and cross-corpus speech emotion recognition experiments. Since the underlying dynamics and characteristics of speech recognition and speech emotion recognition differ too much, designing an emotion-recognition-specific filter bank is mandatory. The proposed features will formulate a novel filter bank strategy to construct 7 trapezoidal filter banks. These novel filter banks differ from Mel and Bark scales in shape and frequency regions and are targeted to generalize the feature space. Cross-corpus experimentation is a step forward in speech emotion recognition, but the researchers are usually chagrined at its results. Our goal is to create a feature set that is robust and resistant to cross-corporal variations using various feature selection algorithms. We will prove this by shrinking the dimension of the feature space from 6984 down to 128 while boosting the accuracy using SVM, RBM, and sVGG (small-VGG) classifiers. Although RBMs are considered no longer fashionable, we will show that they can achieve outstanding jobs when tuned properly. This paper discloses a striking 90.65% accuracy rate harnessing STER features on EmoDB.
Similar content being viewed by others
Availability of Data and Materials
Data is available on the following site: https://github.com/cevparlak/speech-emotion-recognition.
Code Availability
The source code is available on the following site: https://github.com/cevparlak/speech-emotion-recognition.
References
Herculano-Houzel, S.: The human brain in numbers: a linearly scaled-up primate brain. Front. Hum. Neurosci. (2009). https://doi.org/10.3389/neuro.09.031.2009
Nguyen, T.: Total number of synapses in the adult human neocortex. Undergrad. J. Math. Model. One+ Two 3(1), 26 (2010). https://doi.org/10.5038/2326-3652.3.1.26
Ekman, P.E.; Davidson, R.J.: The Nature of Emotion: Fundamental Questions. Oxford University Press (1994)
Plutchik, R.: The Nature of Emotions Human emotions have deep evolutionary roots, a fact that may explain their complexity and provide tools for clinical practice. Am. Sci. 89(4), 344–350 (2001)
Whissel, C.M.: The Dictionary of Affect in Language: Emotion: Theory, Research and Experience. Academic Press, New York (1989) https://doi.org/10.1016/B978-0-12-558704-4.50011-6
Cowie, R.; Cowie, E.D.; Tsapatsoulis, N.; Votsis, G.; Kollias, S.; Fellenz, W.; Taylor, J.G.: Emotion recognition in human-computer interaction. Signal Process. Mag. IEEE 18(1), 32–80 (2001). https://doi.org/10.1109/79.911197
Schuller, B.; Batliner, A.; Steidl, S.; Seppi, D.: Recognizing realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge. Speech Commun. 53(9–10), 1062–1087 (2011). https://doi.org/10.1016/j.specom.2011.01.011
Wang, W.: Machine Audition: Principles, Algorithms, and Systems, p. 1–554. IGI Global (2011)
Wu, S.; Falk, T.H.; Chan, W.: Automatic speech emotion recognition using modulation spectral features. Speech Commun. 53(5), 768–785 (2010). https://doi.org/10.1016/j.specom.2010.08.013
Ververidis, D.; Kotropoulos, C.: Emotional speech recognition: resources, features, and methods. Speech Commun. 48(9), 1162–1181 (2006). https://doi.org/10.1016/j.specom.2006.04.003
Ramakrishnan, S.: Recognition of emotion from speech: a review. Int. J. Speech Technol. 15(2), 99–117 (2012)
He, L.: Stress and Emotion Recognition in Natural Speech in the Work and Family Environments, PhD Thesis, RMIT University (2010)
Neumann, M.: Cross-lingual and multilingual speech emotion recognition on English and French. In: 2018 International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5769–5773. Calgary, AB, Canada IEEE (2018). https://doi.org/10.1109/ICASSP.2018.8462162
Zhang, B.; Provost, E.M.; Essl, G.: Cross-corpus acoustic emotion recognition with multi-task learning: seeking common ground while preserving differences. IEEE Trans. Affect. Comput. 10(1), 85–99 (2019). https://doi.org/10.1109/TAFFC.2017.2684799
Song, P.: Transfer linear subspace learning for cross-corpus speech emotion recognition. IEEE Trans. Aff. Comput. 10(2), 265–275 (2017). https://doi.org/10.1109/TAFFC.2017.2705696
Shah, M.; Chakrabarti, C.; Spanias, A.: Within and cross-corpus speech emotion recognition using latent topic model-based features. EURASIP J. Audio Speech Music Process. (2015). https://doi.org/10.1186/s13636-014-0049-y
Song, P., et al.: Cross-corpus speech emotion recognition based on transfer non-negative matrix factorization. Speech Commun. 83, 34–41 (2016). https://doi.org/10.1016/j.specom.2016.07.010
Wang, K., et al.: Speech emotion recognition using Fourier parameters. IEEE Trans. Affect. Comput. 6(1), 69–75 (2015). https://doi.org/10.1109/TAFFC.2015.2392101
Yenigalla, P.; et al.: Speech emotion recognition using spectrogram & phoneme embedding. In: Interspeech 2018, September, pp. 3688–3692. Hyderabad, India (2018). https://doi.org/10.21437/Interspeech.2018-1811
Mao, Q., et al.: Learning emotion-discriminative and domain-invariant features for domain adaptation in speech emotion recognition. Speech Commun. 93, 1–10 (2017). https://doi.org/10.1016/j.specom.2017.06.006
Kamińska, D.; Sapiński, T.; Anbarjafari, G.: Efficiency of chosen speech descriptors in relation to emotion recognition. EURASIP J. Audio Speech Music Process. (2017). https://doi.org/10.1186/s13636-017-0100-x
Seng, K.P.; Li-Minn, A.; Ooi, C.S.: A combined rule-based & machine learning audio-visual emotion recognition approach. IEEE Trans. Affect. Comput. 9(1), 3–13 (2016). https://doi.org/10.1109/TAFFC.2016.2588488
Phan, D.A.; Matsumoto, Y.; Shindo, H.: Autoencoder for semisupervised multiple emotion detection of conversation transcripts. IEEE Trans. Affect. Comput. 12(3), 682–691 (2018). https://doi.org/10.1109/TAFFC.2018.2885304
Deng, J., et al.: Semisupervised autoencoders for speech emotion recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 26(1), 31–43 (2017). https://doi.org/10.1109/TASLP.2017.2759338
Sahu, S.; Gupta, R.; Espy-Wilson, C.: On enhancing speech emotion recognition using generative adversarial networks. Preprint https://arxiv.org/abs/1806.06626 (2018). https://doi.org/10.48550/arXiv.1806.06626
Oflazoglu, C.; Yildirim, S.: Recognizing emotion from Turkish speech using acoustic features. EURASIP J. Audio Speech Music Process. (2013). https://doi.org/10.1186/1687-4722-2013-26
Kaya, H.; et al.: LSTM Based cross-corpus and cross-task acoustic emotion recognition. In: Interspeech 2018, September, pp. 521–525, Hyderabad, India (2018). https://doi.org/10.21437/Interspeech.2018-2298
Rouast, P.V.; Adam, M.; Chiong, R.: Deep learning for human affect recognition: insights and new developments. IEEE Trans. Affect. Comput. 12(2), 524–543 (2019). https://doi.org/10.1109/TAFFC.2018.2890471
Cho, J.; et al.: Deep neural networks for emotion recognition combining audio and transcripts. In: Interspeech 2018, September, pp. 247–251, Hyderabad, India (2018). https://doi.org/10.21437/Interspeech.2018-2466
Kim, J.; Saurus, R.A.: Emotion recognition from human speech using temporal information and deep learning. In: Interspeech 2018, September, pp. 937–940, Hyderabad, India (2018). https://doi.org/10.21437/Interspeech.2018-1132
Ma, X.; et al.: Emotion recognition from variable-length speech segments using deep learning on spectrograms. In: Interspeech 2018, September, pp. 3683–3687, Hyderabad, India (2018). https://doi.org/10.21437/Interspeech.2018-2228
Trigeorgis, G.; et al.: Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In: 2016 International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016 March, pp. 5200–5204, Shanghai, China. IEEE (2016). https://doi.org/10.1109/ICASSP.2016.7472669
Tzirakis, P.; Zhang, J.; Schuller, B.W.: End-to-end speech emotion recognition using deep neural networks. In: 2018 International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, April, pp. 5089–5093, Calgary, AB, Canada. IEEE (2018). https://doi.org/10.1109/ICASSP.2018.8462677
Busso, C.; Bulut, M.; Lee, C.C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.N.; Lee, S.; Narayanan, S.S.: IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335–359 (2008). https://doi.org/10.1007/s10579-008-9076-6
McKeown, G.; Valstar, M.; Cowie, R.; Pantic, M.; Schroder, M.: The SEMAINE database: annotated multimodal records of emotionally colored conversations between a person and a limited agent. IEEE Trans. Affect. Comput. 3(1), 5–17 (2012). https://doi.org/10.1109/T-AFFC.2011.20
Jeon, J.H.; Le, D.; Xia, R.; Liu, Y.: (2013). A preliminary study of cross-lingual emotion recognition from speech: automatic classification versus human perception. In: Interspeech 2013, August, pp. 2837–2840 Lyon, France (2013). https://doi.org/10.21437/Interspeech.2013-246
Eyben, F.; Batliner, A.; Schuller, B.; Seppi, D.; Steidl, S.: Cross-Corpus classification of realistic emotions–some pilot experiments. In: Proc. 3rd International Workshop on EMOTION (satellite of LREC): Corpora for Research on Emotion and Affect, pp. 77–82Valetta, Malta (2010)
Schuller, B.; Zhang, Z.; Weninger, F.; Rigoll, G.: Selecting training data for cross-corpus speech emotion recognition: Prototypicality vs. generalization. In: Proc. Afeka-AVIOS Speech Processing Conference, Tel Aviv, Israel (2011)
Shami, M.; Verhelst, W.: An evaluation of the robustness of existing supervised machine learning approaches to the classification of emotions in speech. Speech Commun. 49(3), 201–212 (2007). https://doi.org/10.1016/j.specom.2007.01.006
Wen, G., et al.: Random deep belief networks for recognizing emotions from speech signals. Comput. Intell. Neurosci. (2017). https://doi.org/10.1155/2017/1945630
Kahou, S.E., et al.: Emonets: Multimodal deep learning approaches for emotion recognition in video. J. Multimodal User Interfaces 10, 99–111 (2016). https://doi.org/10.1007/s12193-015-0195-2
Hassan, M.M., et al.: Human emotion recognition using deep belief network architecture. Inf. Fusion 51, 10–18 (2019). https://doi.org/10.1016/j.inffus.2018.10.009
Atmaja, B.T.; Akagi, M.: Two-stage dimensional emotion recognition by fusing predictions of acoustic and text networks using SVM. Speech Commun. 126, 9–21 (2021). https://doi.org/10.1016/j.specom.2020.11.003
Firdaus, M.; Chauhan, H.; Ekbal, A.; Bhattacharyya, P.: MEISD: A Multimodal Multi-Label Emotion, Intensity, and Sentiment Dialogue Dataset for Emotion Recognition and Sentiment Analysis in Conversations. In: Proceedings of the 28th International Conference on Computational Linguistics, 2020, December, pp. 4441–4453, Barcelona, Spain (2020). https://doi.org/10.18653/v1/2020.coling-main.393
Poria, S.; Hazarika, D.; Majumder, N.; Naik, G.; Cambria, E.; Mihalcea, R.: MELD: A multimodal multi-party dataset for emotion recognition in conversations. Preprint https://arxiv.org/abs/1810.02508 (2018). https://doi.org/10.48550/arXiv.1810.02508
Yin, Y.; Zheng, X.; Hu, B.; Zhang, Y.; Cui, X.: EEG emotion recognition using fusion model of graph convolutional neural networks and LSTM. Appl. Soft Comput. 100, 106954 (2021). https://doi.org/10.1016/j.asoc.2020.106954
Pakyurek, M.; Atmis, M.; Kulac, S.; Uludag, U.: Extraction of novel features based on histograms of MFCCs used in emotion classification from generated original speech dataset. Elektronika ir Elektrotechnika 26(1), 46–51 (2020). https://doi.org/10.5755/j01.eie.26.1.25309
Parlak, C.; Diri, B.; Gürgen, F.: A Cross-Corpus Experiment in Speech Emotion Recognition. In: Proc. International Workshop on Speech, Language and Audio in Multimedia (SLAM 2014), pp. 58–61, Penang, Malaysia, (2014)
Burkhardt, F.; Paeschke, A.; Rolfes, M.; Sendlmeier, W.F.; Weiss, B.: A database of German emotional speech. In: Interspeech 2005, September, pp. 1517–1520 Lisbon, Portugal (2005). https://doi.org/10.21437/Interspeech.2005-446
Eyben, F.; Wollmer, M.; Schuller, B.: OpenEAR—introducing the Munich open-source emotion and affect recognition toolkit. In: 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops ACII 1–6, 2009, September, Amsterdam, Netherlands. IEEE (2009). https://doi.org/10.1109/ACII.2009.5349350
Martin, O.; Kotsia, I.; Macq, B.; Pitas, I.: The eNTERFACE'05 audio-visual emotion database. In: 22nd International Conference on Data Engineering Workshops (ICDEW'06) 8–8, April 2006, Atlanta, GA, USA. IEEE (2006). https://doi.org/10.1109/ICDEW.2006.145
Wang, Y.; Guan, L.: Recognizing human emotional state from audiovisual signals. IEEE Trans. Multimedia 10(5), 936–946 (2008). https://doi.org/10.1109/TMM.2008.927665
Haq, S.; Jackson, P.J.B.: Multimodal Emotion Recognition, In W. Wang (ed), Machine Audition: Principles, Algorithms and Systems. IGI Global Press, ISBN 978–1615209194 Chapter 17, pp. 398–423 (2010). https://doi.org/10.4018/978-1-61520-919-4.ch017
Eyben, F.; Wöllmer, M.; Schuller, B.: openSMILE - The Munich Versatile and Fast Open-Source Audio Feature Extractor. In: Proc. ACM Multimedia (MM), ACM, ISBN 978-1-60558-933-6, pp. 1459–1462, Florence, Italy, (2009). https://doi.org/10.1145/1873951.1874246
Hall, M.; Frank, E.; Holmes, G.; Pfahringer, B.; Reutemann, P.; Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009). https://doi.org/10.1145/1656274.1656278
Smolensky, P.: Chapter 6: Information processing in dynamical systems: foundations of harmony theory. Parallel Distributed Processing: Explorations in the Microstructure of Cognition: Foundations, MIT Press, pp. 194–281 (1987)
Hinton, G.E.; Salakhutdinov, R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006). https://doi.org/10.1126/science.112764
Salakhutdinov, R.: Learning deep generative models. Annu. Rev. Stat. Appl. 2, 361–385 (2015). https://doi.org/10.1146/annurev-statistics-010814-020120
Hinton, G.E.: A practical guide to training restricted Boltzmann machines. Neural networks: Tricks of the trade 7700, 599–619 Springer, Berlin, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35289-8_32
Krizhevsky, A.; Hinton, G.E.: Learning multiple layers of features from tiny images 1(4) Technical report, University of Toronto (2009)
Hinton, G.E.: Training products of experts by minimizing contrastive divergence. Neural Comput. 14(8), 1771–1800 (2002). https://doi.org/10.1162/089976602760128018
Tanaka, S.M.; Okutomi, M.: A novel inference of a restricted Boltzmann machine. In: 22nd International Conference on Pattern Recognition, 2014, August, pp. 1526–1531, Stockholm, Sweden IEEE (2014). https://doi.org/10.1109/ICPR.2014.271
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. Preprint https://arxiv.org/abs/1409.1556 (2014). https://doi.org/10.48550/arXiv.1409.1556
Davis, S.; Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980). https://doi.org/10.1109/TASSP.1980.1163420
Skowronski, M.D.; Harris, J.G.: Improving the filter bank of a classic speech feature extraction algorithm. In: Proceedings of the 2003 International Symposium on Circuits and Systems ISCAS'03, 2003, May, IV-IV, Bangkok, Thailand. IEEE (2003). https://doi.org/10.1109/ISCAS.2003.1205828
Fletcher, H.; Munson, W.A.: Loudness, its definition, measurement and calculation. Bell Syst. Tech. J. 12(4), 377–430 (1933). https://doi.org/10.1002/j.1538-7305.1933.tb00403.x
Robinson, D.W.; Dadson, R.S.: A re-determination of the equal-loudness relations for pure tones. Br. J. Appl. Phys. 7(5), 166 (1956). https://doi.org/10.1088/0508-3443/7/5/302
Suzuki, Y.; Mellert, V.; Richter, U.; Møller, H.; Nielsen, L.; Hellman, R.; Takeshima, H.: Precise and full-range determination of two-dimensional equal loudness contours. Tohoku University, Japan (2003)
Suzuki, Y.; Takeshima, H.: Equal-loudness-level contours for pure tones. The J. Acoust. Soc. Am. 116(2), 918–933 (2004). https://doi.org/10.1121/1.1763601
Erickson, D.; Yoshida, K.; Menezes, C.; Fujino, A.; Mochida, T.; Shibuya, Y.: Exploratory study of some acoustic and articulatory characteristics of sad speech. Phonetica 63(1), 1–25 (2006). https://doi.org/10.1159/000091404
Li, Y.; Li, J.; Akagi, M.: Contributions of the glottal source and vocal tract cues to emotional vowel perception in the valence-arousal space. J. Acoust. Soc. Am. 144(2), 908 (2018). https://doi.org/10.1121/1.5051323
Zahorian, S. A.; Dikshit, P.; Hu, H.: A spectral-temporal method for pitch tracking. In: Ninth International Conference on Spoken Language Processing, 2006, September, paper 1910-Wed2A1O.5. Pittsburgh, Pennsylvania, USA (2006). https://doi.org/10.21437/Interspeech.2006-475
De Cheveigné, A.; Kawahara, H.: YIN, a fundamental frequency estimator for speech and music. J. Acoust. Soc. Am. 111(4), 1917–1930 (2002). https://doi.org/10.1121/1.1458024
Kim, J.W.; et al.: Crepe: A convolutional representation for pitch estimation. In: 2018 International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 161–165. Calgary, AB, Canada, IEEE (2018). https://doi.org/10.1109/ICASSP.2018.8461329
Staudacher, M., et al.: Fast fundamental frequency determination via adaptive autocorrelation. EURASIP J. Audio Speech Music Process. (2016). https://doi.org/10.1186/s13636-016-0095-8
Goh, Y.H.; et al.: Fast Wavelet-based Pitch Period Detector for Speech Signals. In: 2016 International Conference on Computer Engineering and Information Systems, 2016, November, pp. 494–497, Shanghai, China. Atlantis Press (2016). https://doi.org/10.2991/ceis-16.2016.101
Stone, S.; Steiner, P.; Birkholz, P.: A time-warping pitch tracking algorithm considering fast f0 changes. In: Interspeech 2017, August, pp. 419–423 Stockholm, Sweden (2017). https://doi.org/10.21437/Interspeech.2017-382
Aneeja, G.; Yegnanarayana, B.: Extraction of fundamental frequency from degraded speech using temporal envelopes at high SNR frequencies. IEEE/ACM Trans. Audio Speech Lang. Process. 25(4), 829–838 (2017). https://doi.org/10.1109/TASLP.2017.2666425
Ardaillon, L., & Roebel, A.: Fully-convolutional network for pitch estimation of speech signals. In: Interspeech 2019, September, pp. 2005–2009, Graz, Austria, (2019). https://doi.org/10.21437/Interspeech.2019-2815
Wang, D.; Yu, C.; Hansen, J.H.L.: Robust harmonic features for classification-based pitch estimation. IEEE/ACM Trans. Audio Speech Lang. Process. 25(5), 952–964 (2017). https://doi.org/10.1109/TASLP.2017.2667879
Kim, J.; Erickson, D.; Lee, S.; Narayanan, S.: A study of invariant properties and variation patterns in the converter/distributor model for emotional speech. In: Interspeech 2014, September, pp. 413–417 Singapore (2014). https://doi.org/10.21437/Interspeech.2014-95
Whiteside, S.P.: Simulated emotions: an acoustic study of voice and perturbation measures. In: Fifth International Conference from Spoken Language Processing (ICSLP 1998), November, paper 0153, Sydney Convention Centre, Sydney, Australia (1998). https://doi.org/10.21437/ICSLP.1998-141
Gunes, H.; Piccardi, M.; Pantic, M.: From the lab to the real world: Affect recognition using multiple cues and modalities. InTech Education and Publishing, pp. 185–218 (2008). https://doi.org/10.5772/6180
Funding
This work did not receive any grant from funding agencies in the public, commercial, or not-for-profit sectors.
Author information
Authors and Affiliations
Contributions
CP: Conceptualization, Methodology, Writing, Experiments, BD: Supervising, Reviewing, Writing, Editing, Conceptualization, Validation, and Data Preparation, YA: Supervising, Reviewing, Writing, Editing, Validation.
Corresponding author
Ethics declarations
Conflict of interest
There is no conflict of interest between the authors.
Human and Animal Rights
This article does not contain any studies with human participants or animals performed by any of the authors.
Consent to Participate and for Publication
We confirm that we did not use participants in our study.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Parlak, C., Diri, B. & Altun, Y. Spectro-Temporal Energy Ratio Features for Single-Corpus and Cross-Corpus Experiments in Speech Emotion Recognition. Arab J Sci Eng 49, 3209–3223 (2024). https://doi.org/10.1007/s13369-023-07920-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13369-023-07920-8