Abstract
Speech Emotion Recognition (SER) is becoming a key role in global business today to improve service efficiency, like call center services. Recent SERs were based on a deep learning approach. However, the efficiency of deep learning depends on the number of layers, i.e., the deeper layers, the higher efficiency. On the other hand, the deeper layers are causes of a vanishing gradient problem, a low learning rate, and high time-consuming. Therefore, this paper proposed a redesign of existing local feature learning block (LFLB). The new design is called a deep residual local feature learning block (DeepResLFLB). DeepResLFLB consists of three cascade blocks: LFLB, residual local feature learning block (ResLFLB), and multilayer perceptron (MLP). LFLB is built for learning local correlations along with extracting hierarchical correlations; DeepResLFLB can take advantage of repeatedly learning to explain more detail in deeper layers using residual learning for solving vanishing gradient and reducing overfitting; and MLP is adopted to find the relationship of learning and discover probability for predicted speech emotions and gender types. Based on two available published datasets: EMODB and RAVDESS, the proposed DeepResLFLB can significantly improve performance when evaluated by standard metrics: accuracy, precision, recall, and F1-score.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Singkul, S., Khampingyot, B., Maharattamalai, N., Taerungruang, S., Chalothorn, T.: Parsing thai social data: a new challenge for thai NLP. In: 2019 14th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP), pp. 1–7 (2019)
Singkul, S., Woraratpanya, K.: Thai dependency parsing with character embedding. In: 2019 11th International Conference on Information Technology and Electrical Engineering (ICITEE), pp. 1–5 (2019)
El Ayadi, M., Kamel, M.S., Karray, F.: Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recogn. 44(3), 572–587 (2011)
Anagnostopoulos, C.-N., Iliou, T., Giannoukos, I.: Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011. Artif. Intell. Rev. 43(2), 155–177 (2012). https://doi.org/10.1007/s10462-012-9368-5
Zhang, Z., Coutinho, E., Deng, J., Schuller, B.: Cooperative learning and its application to emotion recognition from speech. IEEE/ACM Trans. Audio, Speech Lang. Proces. 23(1), 115–126 (2014)
Guidi, A., Vanello, N., Bertschy, G., Gentili, C., Landini, L., Scilingo, E.P.: Automatic analysis of speech f0 contour for the characterization of mood changes in bipolar patients. Biomed. Signal Process. Control 17, 29–37 (2015)
Gunes, H., Piccardi, M.: Bi-modal emotion recognition from expressive face and body gestures. J. Netw. Comput. Appl. 30(4), 1334–1345 (2007)
Bong, S.Z., Wan, K., Murugappan, M., Ibrahim, N.M., Rajamanickam, Y., Mohamad, K.: Implementation of wavelet packet transform and non linear analysis for emotion classification in stroke patient using brain signals. Biomed. Signal Process. Control 36, 102–112 (2017)
Yuvaraj, R., Murugappan, M., Ibrahim, N.M., Sundaraj, K., Omar, M.I., Mohamad, K., Palaniappan, R.: Detection of emotions in parkinson’s disease using higher order spectral features from brain’s electrical activity. Biomed. Signal Process. Control 14, 108–116 (2014)
Zhao, J., Mao, X., Chen, L.: Speech emotion recognition using deep 1d & 2d CNN LSTM networks. Biomed. Signal Process. Control 47, 312–323 (2019)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016)
Wu, S., Falk, T.H., Chan, W.Y.: Automatic speech emotion recognition using modulation spectral features. Speech Commun. 53(5), 768–785 (2011)
He, L., Lech, M., Maddage, N.C., Allen, N.B.: Study of empirical mode decomposition and spectral analysis for stress and emotion classification in natural speech. Biomed. Signal Process. Control 6(2), 139–146 (2011)
Pérez-Espinosa, H., Reyes-Garcia, C.A., Villaseñor-Pineda, L.: Acoustic feature selection and classification of emotions in speech using a 3d continuous emotion model. Biomed. Signal Process. Control 7(1), 79–87 (2012)
Huang, Z., Dong, M., Mao, Q., Zhan, Y.: Speech emotion recognition using CNN. In: Proceedings of the 22nd ACM international conference on Multimedia. pp. 801–804 (2014)
Huang, Y., Wu, A., Zhang, G., Li, Y.: Extraction of adaptive wavelet packet filter-bank-based acoustic feature for speech emotion recognition. IET Signal Proc. 9(4), 341–348 (2015)
Demircan, S., Kahramanli, H.: Application of fuzzy c-means clustering algorithm to spectral features for emotion classification from speech. Neural Comput. Appl. 29(8), 59–66 (2018)
Sun, Y., Wen, G., Wang, J.: Weighted spectral features based on local hu moments for speech emotion recognition. Biomed. Signal Process. Control 18, 80–90 (2015)
Sari, S.W.W.: The influence of using repeated reading strategy towards student’s reading comprehension. In: Proceeding 1st Annual International Conference on Islamic Education and Language: The Education and 4.0 Industrial Era in Islamic Perspective, p. 71 (2019)
Shanahan, T.: Everything you wanted to know about repeated reading. Reading Rockets. https://www.readingrockets.org/blogs/shanahan-literacy/everything-you-wanted-know-about-repeated-reading (2017)
Venkataramanan, K., Rajamohan, H.R.: Emotion recognition from speech (2019)
Soekhoe, D., Putten, P., Plaat, A.: On the impact of data set size in transfer learning using deep neural networks, pp. 50–60 (2016)
Park, D.S., et al: Specaugment: a simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779 (2019)
Jagini, N.P., Rao, R.R.: Exploring emotion specific features for emotion recognition system using pca approach. In: 2017 International Conference on Intelligent Computing and Control Systems (ICICCS), pp. 58–62 (2017)
Degottex, G.: Glottal source and vocal-tract separation. Ph.D. thesis (2010)
Doukhan, D., Carrive, J., Vallet, F., Larcher, A., Meignier, S.: An open-source speaker gender detection framework for monitoring gender equality. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 5214–5218. IEEE (2018)
Doval, B., d’Alessandro, C., Henrich, N.: The spectrum of glottal flow models. Acta acustica united with acustica 92(6), 1026–1046 (2006)
Wang, Y., Guan, L.: Recognizing human emotional state from audiovisual signals. IEEE Trans. Multimedia 10(5), 936–946 (2008)
Robinson, K., Patterson, R.D.: The stimulus duration required to identify vowels, their octave, and their pitch chroma. J. Acoust. Soc. Am. 98(4), 1858–1865 (1995)
Wakefield, G.H.: Chromagram visualization of the singing voice. In: International Workshop on Models and Analysis of Vocal Emissions for Biomedical Applications (1999)
He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 630–645. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_38
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W.F., Weiss, B.: A database of german emotional speech. In: Ninth European Conference on Speech Communication and Technology (2005)
Livingstone, S.R., Russo, F.A.: The ryerson audio-visual database of emotional speech and song (ravdess): a dynamic, multimodal set of facial and vocal expressions in north american english. PloS one 13(5) (2018)
Breitenstein, C., Lancker, D.V., Daum, I.: The contribution of speech rate and pitch variation to the perception of vocal emotions in a German and an American sample. Cogn. Emotion 15(1), 57–79 (2001)
Acknowledgments
We would like to thank Science Research Foundation, Siam Commercial Bank, for partial financial support to this work.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Singkul, S., Chatchaisathaporn, T., Suntisrivaraporn, B., Woraratpanya, K. (2020). Deep Residual Local Feature Learning for Speech Emotion Recognition. In: Yang, H., Pasupa, K., Leung, A.CS., Kwok, J.T., Chan, J.H., King, I. (eds) Neural Information Processing. ICONIP 2020. Lecture Notes in Computer Science(), vol 12532. Springer, Cham. https://doi.org/10.1007/978-3-030-63830-6_21
Download citation
DOI: https://doi.org/10.1007/978-3-030-63830-6_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-63829-0
Online ISBN: 978-3-030-63830-6
eBook Packages: Computer ScienceComputer Science (R0)