Skip to main content

Deep Residual Local Feature Learning for Speech Emotion Recognition

  • Conference paper
  • First Online:
Neural Information Processing (ICONIP 2020)

Abstract

Speech Emotion Recognition (SER) is becoming a key role in global business today to improve service efficiency, like call center services. Recent SERs were based on a deep learning approach. However, the efficiency of deep learning depends on the number of layers, i.e., the deeper layers, the higher efficiency. On the other hand, the deeper layers are causes of a vanishing gradient problem, a low learning rate, and high time-consuming. Therefore, this paper proposed a redesign of existing local feature learning block (LFLB). The new design is called a deep residual local feature learning block (DeepResLFLB). DeepResLFLB consists of three cascade blocks: LFLB, residual local feature learning block (ResLFLB), and multilayer perceptron (MLP). LFLB is built for learning local correlations along with extracting hierarchical correlations; DeepResLFLB can take advantage of repeatedly learning to explain more detail in deeper layers using residual learning for solving vanishing gradient and reducing overfitting; and MLP is adopted to find the relationship of learning and discover probability for predicted speech emotions and gender types. Based on two available published datasets: EMODB and RAVDESS, the proposed DeepResLFLB can significantly improve performance when evaluated by standard metrics: accuracy, precision, recall, and F1-score.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Singkul, S., Khampingyot, B., Maharattamalai, N., Taerungruang, S., Chalothorn, T.: Parsing thai social data: a new challenge for thai NLP. In: 2019 14th International Joint Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP), pp. 1–7 (2019)

    Google Scholar 

  2. Singkul, S., Woraratpanya, K.: Thai dependency parsing with character embedding. In: 2019 11th International Conference on Information Technology and Electrical Engineering (ICITEE), pp. 1–5 (2019)

    Google Scholar 

  3. El Ayadi, M., Kamel, M.S., Karray, F.: Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recogn. 44(3), 572–587 (2011)

    Article  Google Scholar 

  4. Anagnostopoulos, C.-N., Iliou, T., Giannoukos, I.: Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011. Artif. Intell. Rev. 43(2), 155–177 (2012). https://doi.org/10.1007/s10462-012-9368-5

    Article  Google Scholar 

  5. Zhang, Z., Coutinho, E., Deng, J., Schuller, B.: Cooperative learning and its application to emotion recognition from speech. IEEE/ACM Trans. Audio, Speech Lang. Proces. 23(1), 115–126 (2014)

    Google Scholar 

  6. Guidi, A., Vanello, N., Bertschy, G., Gentili, C., Landini, L., Scilingo, E.P.: Automatic analysis of speech f0 contour for the characterization of mood changes in bipolar patients. Biomed. Signal Process. Control 17, 29–37 (2015)

    Article  Google Scholar 

  7. Gunes, H., Piccardi, M.: Bi-modal emotion recognition from expressive face and body gestures. J. Netw. Comput. Appl. 30(4), 1334–1345 (2007)

    Article  Google Scholar 

  8. Bong, S.Z., Wan, K., Murugappan, M., Ibrahim, N.M., Rajamanickam, Y., Mohamad, K.: Implementation of wavelet packet transform and non linear analysis for emotion classification in stroke patient using brain signals. Biomed. Signal Process. Control 36, 102–112 (2017)

    Article  Google Scholar 

  9. Yuvaraj, R., Murugappan, M., Ibrahim, N.M., Sundaraj, K., Omar, M.I., Mohamad, K., Palaniappan, R.: Detection of emotions in parkinson’s disease using higher order spectral features from brain’s electrical activity. Biomed. Signal Process. Control 14, 108–116 (2014)

    Article  Google Scholar 

  10. Zhao, J., Mao, X., Chen, L.: Speech emotion recognition using deep 1d & 2d CNN LSTM networks. Biomed. Signal Process. Control 47, 312–323 (2019)

    Article  Google Scholar 

  11. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016)

    Google Scholar 

  12. Wu, S., Falk, T.H., Chan, W.Y.: Automatic speech emotion recognition using modulation spectral features. Speech Commun. 53(5), 768–785 (2011)

    Article  Google Scholar 

  13. He, L., Lech, M., Maddage, N.C., Allen, N.B.: Study of empirical mode decomposition and spectral analysis for stress and emotion classification in natural speech. Biomed. Signal Process. Control 6(2), 139–146 (2011)

    Article  Google Scholar 

  14. Pérez-Espinosa, H., Reyes-Garcia, C.A., Villaseñor-Pineda, L.: Acoustic feature selection and classification of emotions in speech using a 3d continuous emotion model. Biomed. Signal Process. Control 7(1), 79–87 (2012)

    Article  Google Scholar 

  15. Huang, Z., Dong, M., Mao, Q., Zhan, Y.: Speech emotion recognition using CNN. In: Proceedings of the 22nd ACM international conference on Multimedia. pp. 801–804 (2014)

    Google Scholar 

  16. Huang, Y., Wu, A., Zhang, G., Li, Y.: Extraction of adaptive wavelet packet filter-bank-based acoustic feature for speech emotion recognition. IET Signal Proc. 9(4), 341–348 (2015)

    Article  Google Scholar 

  17. Demircan, S., Kahramanli, H.: Application of fuzzy c-means clustering algorithm to spectral features for emotion classification from speech. Neural Comput. Appl. 29(8), 59–66 (2018)

    Article  Google Scholar 

  18. Sun, Y., Wen, G., Wang, J.: Weighted spectral features based on local hu moments for speech emotion recognition. Biomed. Signal Process. Control 18, 80–90 (2015)

    Article  Google Scholar 

  19. Sari, S.W.W.: The influence of using repeated reading strategy towards student’s reading comprehension. In: Proceeding 1st Annual International Conference on Islamic Education and Language: The Education and 4.0 Industrial Era in Islamic Perspective, p. 71 (2019)

    Google Scholar 

  20. Shanahan, T.: Everything you wanted to know about repeated reading. Reading Rockets. https://www.readingrockets.org/blogs/shanahan-literacy/everything-you-wanted-know-about-repeated-reading (2017)

  21. Venkataramanan, K., Rajamohan, H.R.: Emotion recognition from speech (2019)

    Google Scholar 

  22. Soekhoe, D., Putten, P., Plaat, A.: On the impact of data set size in transfer learning using deep neural networks, pp. 50–60 (2016)

    Google Scholar 

  23. Park, D.S., et al: Specaugment: a simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779 (2019)

  24. Jagini, N.P., Rao, R.R.: Exploring emotion specific features for emotion recognition system using pca approach. In: 2017 International Conference on Intelligent Computing and Control Systems (ICICCS), pp. 58–62 (2017)

    Google Scholar 

  25. Degottex, G.: Glottal source and vocal-tract separation. Ph.D. thesis (2010)

    Google Scholar 

  26. Doukhan, D., Carrive, J., Vallet, F., Larcher, A., Meignier, S.: An open-source speaker gender detection framework for monitoring gender equality. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 5214–5218. IEEE (2018)

    Google Scholar 

  27. Doval, B., d’Alessandro, C., Henrich, N.: The spectrum of glottal flow models. Acta acustica united with acustica 92(6), 1026–1046 (2006)

    Google Scholar 

  28. Wang, Y., Guan, L.: Recognizing human emotional state from audiovisual signals. IEEE Trans. Multimedia 10(5), 936–946 (2008)

    Article  Google Scholar 

  29. Robinson, K., Patterson, R.D.: The stimulus duration required to identify vowels, their octave, and their pitch chroma. J. Acoust. Soc. Am. 98(4), 1858–1865 (1995)

    Article  Google Scholar 

  30. Wakefield, G.H.: Chromagram visualization of the singing voice. In: International Workshop on Models and Analysis of Vocal Emissions for Biomedical Applications (1999)

    Google Scholar 

  31. He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 630–645. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_38

    Chapter  Google Scholar 

  32. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W.F., Weiss, B.: A database of german emotional speech. In: Ninth European Conference on Speech Communication and Technology (2005)

    Google Scholar 

  33. Livingstone, S.R., Russo, F.A.: The ryerson audio-visual database of emotional speech and song (ravdess): a dynamic, multimodal set of facial and vocal expressions in north american english. PloS one 13(5) (2018)

    Google Scholar 

  34. Breitenstein, C., Lancker, D.V., Daum, I.: The contribution of speech rate and pitch variation to the perception of vocal emotions in a German and an American sample. Cogn. Emotion 15(1), 57–79 (2001)

    Article  Google Scholar 

Download references

Acknowledgments

We would like to thank Science Research Foundation, Siam Commercial Bank, for partial financial support to this work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kuntpong Woraratpanya .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 257 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Singkul, S., Chatchaisathaporn, T., Suntisrivaraporn, B., Woraratpanya, K. (2020). Deep Residual Local Feature Learning for Speech Emotion Recognition. In: Yang, H., Pasupa, K., Leung, A.CS., Kwok, J.T., Chan, J.H., King, I. (eds) Neural Information Processing. ICONIP 2020. Lecture Notes in Computer Science(), vol 12532. Springer, Cham. https://doi.org/10.1007/978-3-030-63830-6_21

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-63830-6_21

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-63829-0

  • Online ISBN: 978-3-030-63830-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics