Advertisement

Convolutional Neural Network with Spectrogram and Perceptual Features for Speech Emotion Recognition

  • Linjuan Zhang
  • Longbiao Wang
  • Jianwu Dang
  • Lili Guo
  • Haotian Guan
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11304)

Abstract

Convolutional neural network (CNN) has demonstrated a great power at mining deep information from spectrogram for speech emotion recognition. However, perceptual features such as low-level descriptors (LLDs) and their statistical values were not utilized sufficiently in CNN-based emotion recognition. To solve this problem, we propose novel features to combine spectrogram and perceptual features in different levels. Firstly, frame-level LLDs are arranged as time-sequence LLDs. Then, spectrogram and time-sequence LLDs are fused as compositional spectrographic features (CSF). To fully utilize perceptual features and global information, statistical values of LLDs are added in CSF to generate rich-compositional spectrographic features (RSF). Finally, the proposed features are individually fed to CNN to extract deep features for emotion recognition. Bi-directional long short-term memory was employed to identify emotions and the experiments were conducted on EmoDB. Compared with spectrogram, CSF and RSF improve the unweighted accuracy by a relative error reduction of 32.04% and 36.91%, respectively.

Keywords

Speech emotion recognition Spectrogram Perceptual features Convolutional neural network Bi-directional long short-term memory 

Notes

Acknowledgments

The research was supported by the National Natural Science Foundation of China (No. 61771333 and No. U1736219) and JSPS KAKENHI Grant (16K00297).

References

  1. 1.
    Kołakowska, A., Landowska, A., Szwoch, M., Szwoch, W., Wróbel, M.R.: Emotion recognition and its applications. In: Hippe, Z.S., Kulikowski, J.L., Mroczek, T., Wtorek, J. (eds.) Human-Computer Systems Interaction: Backgrounds and Applications 3. AISC, vol. 300, pp. 51–62. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-08491-6_5CrossRefGoogle Scholar
  2. 2.
    El Ayadi, M., Kamel, M.S., Karray, F.: Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recognit. 44(3), 572–587 (2011).  https://doi.org/10.1016/j.patcog.2010.09.020CrossRefzbMATHGoogle Scholar
  3. 3.
    Schuller, B., Batliner, A., Steidl, S., Seppi, D.: Recognising realistic emotions and affect in speech: state of the art and lessons learnt from the first challenge. Speech Commun. 53(9–10), 1062–1087 (2011).  https://doi.org/10.1016/j.specom.2011.01.011CrossRefGoogle Scholar
  4. 4.
    Ringeval, F., et al.: Av+ ec 2015: the first affect recognition challenge bridging across audio, video, and physiological data. In: 5th International Workshop on Audio/Visual Emotion Challenge, pp. 3–8. ACM (2015).  https://doi.org/10.1145/2808196.2811642
  5. 5.
    Valstar, M., et al.: Avec 2016: depression, mood, and emotion recognition workshop and challenge. In: 6th International Workshop on Audio/Visual Emotion Challenge, pp. 3–10. ACM (2016).  https://doi.org/10.1145/2964284.2980532
  6. 6.
    Schuller, B., Steidl, S., Batliner, A.: The Interspeech 2009 emotion challenge. In: Tenth Annual Conference of the International Speech Communication Association (2009)Google Scholar
  7. 7.
    Han, K., Yu, D., Tashev, I.: Speech emotion recognition using deep neural network and extreme learning machine. In: INTERSPEECH, pp. 223–227 (2014). https://www.microsoft.com/en-us/research/publication/speech-emotion-recognition-using-deep-neural-network-and-extreme-learning-machine/
  8. 8.
    Huang, C. W., Narayanan, S. S.: Attention assisted discovery of sub-utterance structure in speech emotion recognition. In: INTERSPEECH, pp. 1387–1391 (2016).  https://doi.org/10.21437/interspeech.2016-448
  9. 9.
    Mirsamadi, S., Barsoum, E., Zhang, C.: Automatic speech emotion recognition using recurrent neural networks with local attention. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2227–2231. IEEE (2017).  https://doi.org/10.1109/icassp.2017.7952552
  10. 10.
    Variani, E., Lei, X., McDermott, E., Moreno, I. L., Gonzalez-Dominguez, J.: Deep neural networks for small footprint text-dependent speaker verification. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4052–4056. IEEE (2014).  https://doi.org/10.1109/icassp.2014.6854363
  11. 11.
    Hannun, A., et al.: Deep Speech: Scaling up End-to-end Speech Recognition (2014). http://arxiv.org/abs/1412.5567
  12. 12.
    Amodei, D., et al.: Deep Speech 2: end-to-end speech recognition in English and Mandarin. In: International Conference on Machine Learning, pp. 173–182 (2016). http://dl.acm.org/citation.cfm?id=3045390.3045410
  13. 13.
    Nwe, T.L., Foo, S.W., De Silva, L.C.: Speech emotion recognition using hidden markov models. Speech Commun. 41(4), 603–623 (2003).  https://doi.org/10.1016/S0167-6393(03)00099-2CrossRefGoogle Scholar
  14. 14.
    Huang, Z., Dong, M., Mao, Q., Zhan, Y.: Speech emotion recognition using CNN. In: 22nd ACM international conference on Multimedia, pp. 801–804. ACM (2014). http://doi.acm.org/10.1145/2647868.2654984
  15. 15.
    Lim, W., Jang, D., Lee, T.: Speech emotion recognition using convolutional and recurrent neural networks. In: Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 1–4. IEEE, Asia-Pacific (2016).  https://doi.org/10.1109/apsipa.2016.7820699
  16. 16.
    Satt, A., Rozenberg, S., Hoory, R.: Efficient emotion recognition from speech using deep learning on spectrograms. In: INTERSPEECH, pp. 1089–1093 (2017).  https://doi.org/10.21437/interspeech.2017-200
  17. 17.
    Guo, L., Wang, L., Dang, J., Zhang, L., Guan, H.: A feature fusion method based on extreme learning machine for speech emotion recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2666–2670 (2018).  https://doi.org/10.1109/icassp.2018.8462219
  18. 18.
    Guo, L., Wang, L., Dang, J., Zhang, L., Guan, H., Li, X.: Speech emotion recognition by combining amplitude and phase information using convolutional neural network. In: INTERSPEECH, pp. 1611–1615 (2018).  https://doi.org/10.21437/interspeech.2018-2156
  19. 19.
    Hu, H., Xu, M.X., Wu, W.: Fusion of global statistical and segmental spectral features for speech emotion recognition. In: INTERSPEECH, pp. 2269–2272 (2007)Google Scholar
  20. 20.
    Yu, D., et al..: Deep convolutional neural networks with layer-wise context expansion and attention. In: INTERSPEECH, pp. 17–21 (2016).  https://doi.org/10.21437/interspeech.2016-251
  21. 21.
    Lee, J., Tashev, I.: High-level feature representation using recurrent neural network for speech emotion recognition. In: Sixteenth Annual Conference of the International Speech Communication Association (2015). https://www.microsoft.com/en-us/research/publication/high-level-feature-representation-using-recurrent-neural-network-for-speech-emotion-recognition/
  22. 22.
    Petrushin, V. A.: Emotion recognition in speech signal: experimental study, development, and application. In: Sixth International Conference on Spoken Language Processing, pp. 222–225 (2000)Google Scholar
  23. 23.
    Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. F., Weiss, B.: A Database of German Emotional Speech. In: Ninth European Conference on Speech Communication and Technology, pp. 1517–1520 (2005)Google Scholar
  24. 24.
    Xie, B.: Research on Key Issues of Mandarin Speech Emotion Recognition [Ph.D. Thesis]. Hangzhou: Zhejiang University (2006)Google Scholar
  25. 25.
    Provost, E. M.: Identifying salient sub-utterance emotion dynamics using flexible units and estimates of affective flow. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3682–3686. IEEE (2013).  https://doi.org/10.1109/icassp.2013.6638345

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Tianjin Key Laboratory of Cognitive Computing and Application, College of Intelligence and ComputingTianjin UniversityTianjinChina
  2. 2.Japan Advanced Institute of Science and TechnologyIshikawaJapan
  3. 3.Intelligent Spoken Language Technology (Tianjin) Co., Ltd.TianjinChina

Personalised recommendations