Abstract
Speech emotion recognition (SER) is a task that cannot be accomplished solely depending on linguistic models due to the presence of figures of speech. For a more accurate prediction of emotions, researchers adopted acoustic modelling. The complexity of SER can be attributed to a variety of acoustic features, the similarities among certain emotions, etc. In this paper, we proposed a framework named Cross Languages One-Versus-All Speech Emotion Classifier (CLOVASEC) that identifies speeches’ emotions for both Chinese and English. Acoustic features were preprocessed by Synthetic Minority Oversampling Technique (SMOTE) to diminish the impact of an imbalanced dataset then by Principal component analysis (PCA) to reduce the dimension. The features were fed into a classifier that was made up of eight sub-classifiers and each sub-classifier was tasked to differentiate one class from the other seven classes. The framework outperformed regular classifiers significantly on The Chinese Natural Audio-Visual Emotion Database (CHEAVD) and an English dataset from Deng.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Abdi, H., Williams, L.J.: Principal component analysis. Wiley Interdiscip. Rev. Comput. Stat. 2(4), 433–459 (2010). https://doi.org/10.1002/wics.101
Akçay, M.B., Oğuz, K.: Speech emotion recognition: emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Commun. 116, 56–76 (2020). https://doi.org/10.1016/j.specom.2019.12.001
Badshah, A.M., Ahmad, J., Rahim, N., Baik, S.W.: Speech emotion recognition from spectrograms with deep convolutional neural network. In: 2017 International Conference on Platform Technology and Service, PlatCon 2017 - Proceedings, pp. 3–7 (2017). https://doi.org/10.1109/PlatCon.2017.7883728
Bong, S.Z., Wan, K., Murugappan, M., Ibrahim, N.M., Rajamanickam, Y., Mohamad, K.: Implementation of wavelet packet transform and non linear analysis for emotion classification in stroke patient using brain signals. Biomed. Signal Process. Control 36, 102–112 (2017). https://doi.org/10.1016/j.bspc.2017.03.016
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique nitesh. J. Artif. Intell. Res. 16(2), 321–357 (2002). https://doi.org/10.1613/jair.953
Chen, M., Zhao, X.: A multi-scale fusion framework for bimodal speech emotion recognition. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2020-October, pp. 374–378 (2020). https://doi.org/10.21437/Interspeech.2020-3156
Chiba, Y., Nose, T., Ito, A.: Multi-stream attention-based BLSTM with feature segmentation for speech emotion recognition. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2020-October, pp. 3301–3305 (2020). https://doi.org/10.21437/Interspeech.2020-1199
Deng, J., Zhang, Z., Marchi, E., Schuller, B.: Sparse autoencoder-based feature transfer learning for speech emotion recognition. In: Proceedings - 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, ACII 2013, pp. 511–516 (2013). https://doi.org/10.1109/ACII.2013.90
El Ayadi, M., Kamel, M.S., Karray, F.: Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recogn. 44(3), 572–587 (2011). https://doi.org/10.1016/j.patcog.2010.09.020
Feng, H., Ueno, S., Kawahara, T.: End-to-end speech emotion recognition combined with acoustic-to-word ASR model. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2020-October, pp. 501–505 (2020). https://doi.org/10.21437/Interspeech.2020-1180
Fujioka, T., Homma, T., Nagamatsu, K.: Meta-learning for speech emotion recognition considering ambiguity of emotion labels. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2020-October, pp. 2332–2336 (2020). https://doi.org/10.21437/Interspeech.2020-1082
Gunes, H., Piccardi, M.: Bi-modal emotion recognition from expressive face and body gestures. J. Netw. Comput. Appl. 30(4), 1334–1345 (2007). https://doi.org/10.1016/j.jnca.2006.09.007
Ilin, A., Raiko, T.: Practical approaches to principal component analysis in the presence of missing values. J. Mach. Learn. Res. 11, 1957–2000 (2010)
Issa, D., Demirci, M.F., Yazici, A.: Speech emotion recognition with deep convolutional neural networks. Biomed. Signal Process. Control 59, 101894 (2020). https://doi.org/10.1016/j.bspc.2020.101894
Latif, S., Asim, M., Rana, R., Khalifa, S., Jurdak, R., Schuller, B.W.: Augmenting Generative Adversarial Networks for Speech Emotion Recognition. arXiv, pp. 521–525 (2020)
Li, Y., Tao, J., Chao, L., Bao, W., Liu, Y.: CHEAVD: a Chinese natural emotional audio-visual database. J. Ambient Intell. Humanized Comput. 8(6), 913–924 (2017). https://doi.org/10.1007/s12652-016-0406-z
Li, Y., Tao, J., Technology, I., Jiang, D., Shan, S., Jia, J.: MEC 2017: Multimodal Emotion Recognition Challenge (2018)
Lim, W., Jang, D., Lee, T.: Speech emotion recognition using convolutional and recurrent neural networks. In: 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA 2016, pp. 3–6 (2017). https://doi.org/10.1109/APSIPA.2016.7820699
Mayer, J.D.: Emotional intelligence. Imagination Cogn. Pers. 9(3), 185–211 (1989). https://doi.org/10.2190/DUGG-P24E-52WK-6CDG
Nardelli, M., Valenza, G., Greco, A., Lanata, A., Scilingo, E.P.: Recognizing emotions induced by affective sounds through heart rate variability. IEEE Trans. Affect. Comput. 6(4), 385–394 (2015). https://doi.org/10.1109/TAFFC.2015.2432810
Niu, Y., Zou, D., Niu, Y., He, Z., Tan, H.: Improvement on speech emotionrecognition based on deep convolutional neural networks.pdf (2018). https://doi.org/10.1145/3194452.3194460
Nwe, T.L., Foo, S.W., De Silva, L.C.: Speech emotion recognition using hidden Markov models. Speech Commun. 41(4), 603–623 (2003). https://doi.org/10.1016/S0167-6393(03)00099-2
Schuller, B., Rigoll, G., Lang, M.: Hidden Markov model-based speech emotion recognition. In: Proceedings - IEEE International Conference on Multimedia and Expo 1, pp. I401–I404 (2003). https://doi.org/10.1109/ICME.2003.1220939
Shen, G., et al.: WISE: word-level interaction-based multimodal fusion for speech emotion recognition. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2020-October, pp. 369–373 (2020). https://doi.org/10.21437/Interspeech.2020-3131
Su, B.H., Chang, C.M., Lin, Y.S., Lee, C.C.: Improving speech emotion recognition using graph attentive Bi-directional gated recurrent unit network. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2020-October, pp. 506–510 (2020). https://doi.org/10.21437/Interspeech
Sun, Y., Wen, G., Wang, J.: Weighted spectral features based on local Hu moments for speech emotion recognition. Biomed. Signal Process. Control 18, 80–90 (2015). https://doi.org/10.1016/j.bspc.2014.10.008
Theodoros, G.: A Python library for audio feature extraction, classification, segmentation and applications. https://github.com/tyiannak/pyAudioAnalysis
Trigeorgis, G., et al.: Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In: ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings 2016-May, pp. 5200–5204 (2016). https://doi.org/10.1109/ICASSP.2016.7472669
Yuvaraj, R., et al.: Detection of emotions in Parkinson’s disease using higher order spectral features from brain’s electrical activity. Biomed. Signal Process. Control 14(1), 108–116 (2014). https://doi.org/10.1016/j.bspc.2014.07.005
Zhang, X., Xu, M., Zheng, T.F.: Ensemble system for multimodal emotion recognition challenge. In: 2018 1st Asian Conference on Affective Computing and Intelligent Interaction, ACII Asia 2018 (MEC 2017), pp. 7–12 (2018). https://doi.org/10.1109/ACIIAsia.2018.8470352
Zhu, T., Lin, Y., Liu, Y.: Synthetic minority oversampling technique for multiclass imbalance problems. Pattern Recogn. 72, 327–340 (2017). https://doi.org/10.1016/j.patcog.2017.07.024
Acknowledgement
This work was supported by the Six-Talent Peaks Project of Jiangsu Province (XYDXX-204), the Open Fund of Key Laboratory of Urban Land Resources Monitoring and Simulation, Ministry of Natural Resources (KF-2019-04-011, KF-2019-04-065), and Angel Project of Suzhou City science and technology (Grant No. CYTS2018233).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Liu, X., Bin, J., Li, H. (2021). Cross Languages One-Versus-All Speech Emotion Classifier. In: Zhang, H., Yang, Z., Zhang, Z., Wu, Z., Hao, T. (eds) Neural Computing for Advanced Applications. NCAA 2021. Communications in Computer and Information Science, vol 1449. Springer, Singapore. https://doi.org/10.1007/978-981-16-5188-5_15
Download citation
DOI: https://doi.org/10.1007/978-981-16-5188-5_15
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-5187-8
Online ISBN: 978-981-16-5188-5
eBook Packages: Computer ScienceComputer Science (R0)