International Journal of Speech Technology

, Volume 21, Issue 4, pp 931–940 | Cite as

Deep and shallow features fusion based on deep convolutional neural network for speech emotion recognition

  • Linhui Sun
  • Jia Chen
  • Keli Xie
  • Ting Gu


Recent years have witnessed the great progress for speech emotion recognition using deep convolutional neural networks (DCNNs). In order to improve the performance of speech emotion recognition, a novel feature fusion method is proposed. With going deeper of the convolutional layers, the convolutional feature of traditional DCNNs gradually become more abstract, which may not be the best feature for speech emotion recognition. On the other hand, the shallow feature includes only global information without the detailed information extracted by deeper convolutional layers. According to these observations, we design a deep and shallow feature fusion convolutional network, which combines the feature from different levels of network for speech emotion recognition. The proposed network allows us to fully exploit deep and shallow feature. The popular Berlin data set is used in our experiments, the experimental results show that our proposed network can further improve speech emotion recognition rate which demonstrates the effectiveness of the proposed network.


Deep convolutional neutral network Deep and shallow feature fusion Speech emotion recognition 



This work is supported by the National Natural Science Foundation of China (61671252, 61501251), the Natural Science Foundation of Jiangsu Province (BK20140891).


  1. Ayadi, M. E., Kamel, M. S., & Karray, F. (2011). Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition, 44, 572–587.CrossRefzbMATHGoogle Scholar
  2. Badshah, A. M., Ahmad, J., Rahim, N., & Baik, S. W. (2017). Speech emotion recognition from spectrograms with deep convolutional neural network. In Platform Technology and Service (PlatCon), 2017 International Conference on (Vol. 17, No. 3, pp. 1–5). IEEE.
  3. Bandela, S. R., Kumar, K., T. (2017). Stressed speech emotion recognition using feature fusion of Teager energy operator and MFCC. In 2017 8th International Conference on Computing, Communication and Networking Technologies (ICCCNT) IEEE (pp. 1–5).
  4. Bridle, J. S. (1990). Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In F. FogelmanSoulie & J. Herault (Eds.), Neurocomputing (pp. 227–236). Springer, Berlin.CrossRefGoogle Scholar
  5. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. F., & Weiss, B. (2005). A database of German emotional speech. In Interspeech (pp. 1517–1520).Google Scholar
  6. Cowie, R., & Randolph, R. C. (2003). Describing the emotional states that are expressed in speech. Speech Communication, 40(1–2), 5–32. CrossRefzbMATHGoogle Scholar
  7. France, D. J., Shiavi, R. G., Silverman, S., Silverman, M., & Wilkes, M. (2000). Acoustical properties of speech as indicators of depression and suicidal risk. IEEE Transactions on Biomedical Engineering, 47, 829–837. Scholar
  8. Gong, Y., Wang, L., Guo, R., & Lazebnik, S. (2014). Multi-scale orderless pooling of deep convolutional activation features. In Proceedings of 13th European Conference on Computer Vision (pp. 392–407).Google Scholar
  9. Huang, Z. W. (2014). Speech emotion recognition using CNN. In The ACM International Conference (pp. 801–804).Google Scholar
  10. Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., & Girshick, R. (2014). Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia (pp. 675–678).Google Scholar
  11. Jin, Q. (2015). Speech emotion recognition with acoustic and lexical features. In IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 4749–4753).
  12. Jin, Y., Song, P., Zheng, W. M., & Zhao, L. (2013). Novel feature fusion method for speech emotion recognition based on multiple kernel learning. Journal of Southeast University, 29(2), 129–133.zbMATHGoogle Scholar
  13. Kong, T., Yao, A., & Chen, Y. (2016). Hypernet: Towards accurate region proposal generation and joint object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 845–853).
  14. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (1097–1105).Google Scholar
  15. Lecun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553):436–444.CrossRefGoogle Scholar
  16. Lee, J., & Tashev, I. (2015). High-level feature representation using recurrent neural network for speech emotion recognition. In Interspeech.Google Scholar
  17. Li, M., Han, K. J., & Narayanan, S. (2013). Automatic speaker age and gender recognition using acoustic and prosodic level information fusion. Computer Speech & Language, 27, 151–167.CrossRefGoogle Scholar
  18. Mao, Q., Dong, M., Huang, Z., & Zhan, Y. (2014). Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Transactions on Multimedia, 16, 2203–2213. Scholar
  19. Meinedo, H., & Trancoso, I. (2010). Age and gender classification using fusion of acoustic and prosodic features. In Interspeech (pp. 2818–2821).Google Scholar
  20. Mnih, V. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540):529–533.CrossRefGoogle Scholar
  21. Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10) (pp. 807–814).Google Scholar
  22. Niu, Y. F., & Zou, D. S.(2017). A breakthrough in Speech emotion recognition using Deep Retinal Convolution Neural Networks. Computer Science.Google Scholar
  23. Pantic, M., & Rothkrantz, L. J. M. (2003). Toward an affect-sensitive multimodal human-computer interaction. Proc IEEE, 91, 1370–1390.CrossRefGoogle Scholar
  24. Picard, R. W. (1995). Affective computing. Perceptual Computing Section Technical Report. TR 321. MIT Media Laboratory.Google Scholar
  25. Pierre-Yves, O. (2003). The production and recognition of emotions in speech: Features and algorithms. International Journal of Human Computer Studies, 59, 157–183.CrossRefGoogle Scholar
  26. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., & Ma, S. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115, 211–252.MathSciNetCrossRefGoogle Scholar
  27. Shriberg, E., Ferrer, L., Kajarekar, S., Venkataraman, A., & Stolcke, A. (2005). Modeling prosodic feature sequences for speaker recognition. Speech Communication, 46, 455–472.CrossRefGoogle Scholar
  28. Silver, D. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529, 7587:484.CrossRefGoogle Scholar
  29. Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929–1958.MathSciNetzbMATHGoogle Scholar
  30. Ververidis, D., Kotropoulos, C., & Pitas, I. (2004). Automatic emotional speech classification. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, Montreal, Canada (pp. 593–596).
  31. Vidhyasaharan, S., Eliathamby, A., & Julien, E. (2013). On the use of speech parameter contours for emotion recognition. EURASIP Journal on Audio, Speech, and Music Processing, 13, 732–740.Google Scholar
  32. Xin, P., Xu, Y. L., Tang, H., Ma, S. P., & Li, S. (2018). Fast airplane detection incorporating multi-layer features of fully convolutional. Networks Acta Optica Sinica, 38, 0315003.CrossRefGoogle Scholar
  33. Yoshua, B., Courville, A., & Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis & Machine Intelligence, 35, 1798–1828. Scholar
  34. Zhang, T. (2004). Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the Twenty-First International Conference on Machine LearningÐ (p. 116). New York: ACM.Google Scholar
  35. Zheng, W. Q., Yu, J. S., & Zou, Y. X. (2015). An experimental study of speech emotion recognition based on deep convolutional neural networks. In 6th International Conference on Affective Computing and Intelligent Interaction (pp. 827–831).Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.College of Telecommunications & Information EngineeringNanjing University of Posts and TelecommunicationsNanjingChina
  2. 2.Key Lab of Broadband Wireless Communication and Sensor Network Technology, Ministry of EducationNanjing University of Posts and TelecommunicationsNanjingChina

Personalised recommendations