Deep and shallow features fusion based on deep convolutional neural network for speech emotion recognition
- 76 Downloads
Recent years have witnessed the great progress for speech emotion recognition using deep convolutional neural networks (DCNNs). In order to improve the performance of speech emotion recognition, a novel feature fusion method is proposed. With going deeper of the convolutional layers, the convolutional feature of traditional DCNNs gradually become more abstract, which may not be the best feature for speech emotion recognition. On the other hand, the shallow feature includes only global information without the detailed information extracted by deeper convolutional layers. According to these observations, we design a deep and shallow feature fusion convolutional network, which combines the feature from different levels of network for speech emotion recognition. The proposed network allows us to fully exploit deep and shallow feature. The popular Berlin data set is used in our experiments, the experimental results show that our proposed network can further improve speech emotion recognition rate which demonstrates the effectiveness of the proposed network.
KeywordsDeep convolutional neutral network Deep and shallow feature fusion Speech emotion recognition
This work is supported by the National Natural Science Foundation of China (61671252, 61501251), the Natural Science Foundation of Jiangsu Province (BK20140891).
- Badshah, A. M., Ahmad, J., Rahim, N., & Baik, S. W. (2017). Speech emotion recognition from spectrograms with deep convolutional neural network. In Platform Technology and Service (PlatCon), 2017 International Conference on (Vol. 17, No. 3, pp. 1–5). IEEE. https://doi.org/10.1109/PlatCon.2017.7883728.
- Bandela, S. R., Kumar, K., T. (2017). Stressed speech emotion recognition using feature fusion of Teager energy operator and MFCC. In 2017 8th International Conference on Computing, Communication and Networking Technologies (ICCCNT) IEEE (pp. 1–5). https://doi.org/10.1109/ICCCNT.2017.8204149.
- Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. F., & Weiss, B. (2005). A database of German emotional speech. In Interspeech (pp. 1517–1520).Google Scholar
- Gong, Y., Wang, L., Guo, R., & Lazebnik, S. (2014). Multi-scale orderless pooling of deep convolutional activation features. In Proceedings of 13th European Conference on Computer Vision (pp. 392–407).Google Scholar
- Huang, Z. W. (2014). Speech emotion recognition using CNN. In The ACM International Conference (pp. 801–804).Google Scholar
- Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., & Girshick, R. (2014). Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia (pp. 675–678).Google Scholar
- Jin, Q. (2015). Speech emotion recognition with acoustic and lexical features. In IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 4749–4753). https://doi.org/10.1109/ICASSP.2015.7178872.
- Kong, T., Yao, A., & Chen, Y. (2016). Hypernet: Towards accurate region proposal generation and joint object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 845–853). https://doi.org/10.1109/CVPR.2016.98.
- Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (1097–1105).Google Scholar
- Lee, J., & Tashev, I. (2015). High-level feature representation using recurrent neural network for speech emotion recognition. In Interspeech.Google Scholar
- Meinedo, H., & Trancoso, I. (2010). Age and gender classification using fusion of acoustic and prosodic features. In Interspeech (pp. 2818–2821).Google Scholar
- Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10) (pp. 807–814).Google Scholar
- Niu, Y. F., & Zou, D. S.(2017). A breakthrough in Speech emotion recognition using Deep Retinal Convolution Neural Networks. Computer Science.Google Scholar
- Picard, R. W. (1995). Affective computing. Perceptual Computing Section Technical Report. TR 321. MIT Media Laboratory.Google Scholar
- Ververidis, D., Kotropoulos, C., & Pitas, I. (2004). Automatic emotional speech classification. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, Montreal, Canada (pp. 593–596). https://doi.org/10.1109/ICASSP.2004.1326055.
- Vidhyasaharan, S., Eliathamby, A., & Julien, E. (2013). On the use of speech parameter contours for emotion recognition. EURASIP Journal on Audio, Speech, and Music Processing, 13, 732–740.Google Scholar
- Zhang, T. (2004). Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the Twenty-First International Conference on Machine LearningÐ (p. 116). New York: ACM.Google Scholar
- Zheng, W. Q., Yu, J. S., & Zou, Y. X. (2015). An experimental study of speech emotion recognition based on deep convolutional neural networks. In 6th International Conference on Affective Computing and Intelligent Interaction (pp. 827–831).Google Scholar