Deep Residual Local Feature Learning for Speech Emotion Recognition

Singkul, Sattaya; Chatchaisathaporn, Thakorn; Suntisrivaraporn, Boontawee; Woraratpanya, Kuntpong

doi:10.1007/978-3-030-63830-6_21

Sattaya Singkul ORCID: orcid.org/0000-0001-7335-7105¹⁴,
Thakorn Chatchaisathaporn¹⁵,
Boontawee Suntisrivaraporn¹⁵ &
…
Kuntpong Woraratpanya¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12532))

Included in the following conference series:

International Conference on Neural Information Processing

2324 Accesses
5 Citations
2 Altmetric

Abstract

Speech Emotion Recognition (SER) is becoming a key role in global business today to improve service efficiency, like call center services. Recent SERs were based on a deep learning approach. However, the efficiency of deep learning depends on the number of layers, i.e., the deeper layers, the higher efficiency. On the other hand, the deeper layers are causes of a vanishing gradient problem, a low learning rate, and high time-consuming. Therefore, this paper proposed a redesign of existing local feature learning block (LFLB). The new design is called a deep residual local feature learning block (DeepResLFLB). DeepResLFLB consists of three cascade blocks: LFLB, residual local feature learning block (ResLFLB), and multilayer perceptron (MLP). LFLB is built for learning local correlations along with extracting hierarchical correlations; DeepResLFLB can take advantage of repeatedly learning to explain more detail in deeper layers using residual learning for solving vanishing gradient and reducing overfitting; and MLP is adopted to find the relationship of learning and discover probability for predicted speech emotions and gender types. Based on two available published datasets: EMODB and RAVDESS, the proposed DeepResLFLB can significantly improve performance when evaluated by standard metrics: accuracy, precision, recall, and F1-score.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Singkul, S., Khampingyot, B., Maharattamalai, N., Taerungruang, S., Chalothorn, T.: Parsing thai social data: a new challenge for thai NLP. In: 2019 14th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP), pp. 1–7 (2019)
Google Scholar
Singkul, S., Woraratpanya, K.: Thai dependency parsing with character embedding. In: 2019 11th International Conference on Information Technology and Electrical Engineering (ICITEE), pp. 1–5 (2019)
Google Scholar
El Ayadi, M., Kamel, M.S., Karray, F.: Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recogn. 44(3), 572–587 (2011)
Article Google Scholar
Anagnostopoulos, C.-N., Iliou, T., Giannoukos, I.: Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011. Artif. Intell. Rev. 43(2), 155–177 (2012). https://doi.org/10.1007/s10462-012-9368-5
Article Google Scholar
Zhang, Z., Coutinho, E., Deng, J., Schuller, B.: Cooperative learning and its application to emotion recognition from speech. IEEE/ACM Trans. Audio, Speech Lang. Proces. 23(1), 115–126 (2014)
Google Scholar
Guidi, A., Vanello, N., Bertschy, G., Gentili, C., Landini, L., Scilingo, E.P.: Automatic analysis of speech f0 contour for the characterization of mood changes in bipolar patients. Biomed. Signal Process. Control 17, 29–37 (2015)
Article Google Scholar
Gunes, H., Piccardi, M.: Bi-modal emotion recognition from expressive face and body gestures. J. Netw. Comput. Appl. 30(4), 1334–1345 (2007)
Article Google Scholar
Bong, S.Z., Wan, K., Murugappan, M., Ibrahim, N.M., Rajamanickam, Y., Mohamad, K.: Implementation of wavelet packet transform and non linear analysis for emotion classification in stroke patient using brain signals. Biomed. Signal Process. Control 36, 102–112 (2017)
Article Google Scholar
Yuvaraj, R., Murugappan, M., Ibrahim, N.M., Sundaraj, K., Omar, M.I., Mohamad, K., Palaniappan, R.: Detection of emotions in parkinson’s disease using higher order spectral features from brain’s electrical activity. Biomed. Signal Process. Control 14, 108–116 (2014)
Article Google Scholar
Zhao, J., Mao, X., Chen, L.: Speech emotion recognition using deep 1d & 2d CNN LSTM networks. Biomed. Signal Process. Control 47, 312–323 (2019)
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016)
Google Scholar
Wu, S., Falk, T.H., Chan, W.Y.: Automatic speech emotion recognition using modulation spectral features. Speech Commun. 53(5), 768–785 (2011)
Article Google Scholar
He, L., Lech, M., Maddage, N.C., Allen, N.B.: Study of empirical mode decomposition and spectral analysis for stress and emotion classification in natural speech. Biomed. Signal Process. Control 6(2), 139–146 (2011)
Article Google Scholar
Pérez-Espinosa, H., Reyes-Garcia, C.A., Villaseñor-Pineda, L.: Acoustic feature selection and classification of emotions in speech using a 3d continuous emotion model. Biomed. Signal Process. Control 7(1), 79–87 (2012)
Article Google Scholar
Huang, Z., Dong, M., Mao, Q., Zhan, Y.: Speech emotion recognition using CNN. In: Proceedings of the 22nd ACM international conference on Multimedia. pp. 801–804 (2014)
Google Scholar
Huang, Y., Wu, A., Zhang, G., Li, Y.: Extraction of adaptive wavelet packet filter-bank-based acoustic feature for speech emotion recognition. IET Signal Proc. 9(4), 341–348 (2015)
Article Google Scholar
Demircan, S., Kahramanli, H.: Application of fuzzy c-means clustering algorithm to spectral features for emotion classification from speech. Neural Comput. Appl. 29(8), 59–66 (2018)
Article Google Scholar
Sun, Y., Wen, G., Wang, J.: Weighted spectral features based on local hu moments for speech emotion recognition. Biomed. Signal Process. Control 18, 80–90 (2015)
Article Google Scholar
Sari, S.W.W.: The influence of using repeated reading strategy towards student’s reading comprehension. In: Proceeding 1st Annual International Conference on Islamic Education and Language: The Education and 4.0 Industrial Era in Islamic Perspective, p. 71 (2019)
Google Scholar
Shanahan, T.: Everything you wanted to know about repeated reading. Reading Rockets. https://www.readingrockets.org/blogs/shanahan-literacy/everything-you-wanted-know-about-repeated-reading (2017)
Venkataramanan, K., Rajamohan, H.R.: Emotion recognition from speech (2019)
Google Scholar
Soekhoe, D., Putten, P., Plaat, A.: On the impact of data set size in transfer learning using deep neural networks, pp. 50–60 (2016)
Google Scholar
Park, D.S., et al: Specaugment: a simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779 (2019)
Jagini, N.P., Rao, R.R.: Exploring emotion specific features for emotion recognition system using pca approach. In: 2017 International Conference on Intelligent Computing and Control Systems (ICICCS), pp. 58–62 (2017)
Google Scholar
Degottex, G.: Glottal source and vocal-tract separation. Ph.D. thesis (2010)
Google Scholar
Doukhan, D., Carrive, J., Vallet, F., Larcher, A., Meignier, S.: An open-source speaker gender detection framework for monitoring gender equality. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 5214–5218. IEEE (2018)
Google Scholar
Doval, B., d’Alessandro, C., Henrich, N.: The spectrum of glottal flow models. Acta acustica united with acustica 92(6), 1026–1046 (2006)
Google Scholar
Wang, Y., Guan, L.: Recognizing human emotional state from audiovisual signals. IEEE Trans. Multimedia 10(5), 936–946 (2008)
Article Google Scholar
Robinson, K., Patterson, R.D.: The stimulus duration required to identify vowels, their octave, and their pitch chroma. J. Acoust. Soc. Am. 98(4), 1858–1865 (1995)
Article Google Scholar
Wakefield, G.H.: Chromagram visualization of the singing voice. In: International Workshop on Models and Analysis of Vocal Emissions for Biomedical Applications (1999)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 630–645. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_38
Chapter Google Scholar
Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W.F., Weiss, B.: A database of german emotional speech. In: Ninth European Conference on Speech Communication and Technology (2005)
Google Scholar
Livingstone, S.R., Russo, F.A.: The ryerson audio-visual database of emotional speech and song (ravdess): a dynamic, multimodal set of facial and vocal expressions in north american english. PloS one 13(5) (2018)
Google Scholar
Breitenstein, C., Lancker, D.V., Daum, I.: The contribution of speech rate and pitch variation to the perception of vocal emotions in a German and an American sample. Cogn. Emotion 15(1), 57–79 (2001)
Article Google Scholar

Download references

Acknowledgments

We would like to thank Science Research Foundation, Siam Commercial Bank, for partial financial support to this work.

Author information

Authors and Affiliations

Faculty of Information Technology, King Mongkut’s Institute of Technology Ladkrabang, Bangkok, Thailand
Sattaya Singkul & Kuntpong Woraratpanya
Data Analytics, Siam Commercial Bank, Bangkok, Thailand
Thakorn Chatchaisathaporn & Boontawee Suntisrivaraporn

Authors

Sattaya Singkul
View author publications
You can also search for this author in PubMed Google Scholar
Thakorn Chatchaisathaporn
View author publications
You can also search for this author in PubMed Google Scholar
Boontawee Suntisrivaraporn
View author publications
You can also search for this author in PubMed Google Scholar
Kuntpong Woraratpanya
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kuntpong Woraratpanya .

Editor information

Editors and Affiliations

Department of AI, Ping An Life, Shenzhen, China
Haiqin Yang
Faculty of Information Technology, King Mongkut's Institute of Technology Ladkrabang, Bangkok, Thailand
Kitsuchart Pasupa
City University of Hong Kong, Kowloon, Hong Kong
Andrew Chi-Sing Leung
Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong, Hong Kong
James T. Kwok
School of Information Technology, King Mongkut’s University of Technology Thonburi, Bangkok, Thailand
Jonathan H. Chan
The Chinese University of Hong Kong, New Territories, Hong Kong
Irwin King

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 257 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Singkul, S., Chatchaisathaporn, T., Suntisrivaraporn, B., Woraratpanya, K. (2020). Deep Residual Local Feature Learning for Speech Emotion Recognition. In: Yang, H., Pasupa, K., Leung, A.CS., Kwok, J.T., Chan, J.H., King, I. (eds) Neural Information Processing. ICONIP 2020. Lecture Notes in Computer Science(), vol 12532. Springer, Cham. https://doi.org/10.1007/978-3-030-63830-6_21

Download citation

DOI: https://doi.org/10.1007/978-3-030-63830-6_21
Published: 19 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-63829-0
Online ISBN: 978-3-030-63830-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics