Speech emotion recognition with unsupervised feature learning

Huang, Zheng-wei; Xue, Wen-tao; Mao, Qi-rong

doi:10.1631/FITEE.1400323

Zheng-wei Huang¹,
Wen-tao Xue¹ &
Qi-rong Mao¹

501 Accesses
22 Citations
Explore all metrics

Abstract

Emotion-based features are critical for achieving high performance in a speech emotion recognition (SER) system. In general, it is difficult to develop these features due to the ambiguity of the ground-truth. In this paper, we apply several unsupervised feature learning algorithms (including K-means clustering, the sparse auto-encoder, and sparse restricted Boltzmann machines), which have promise for learning task-related features by using unlabeled data, to speech emotion recognition. We then evaluate the performance of the proposed approach and present a detailed analysis of the effect of two important factors in the model setup, the content window size and the number of hidden layer nodes. Experimental results show that larger content windows and more hidden nodes contribute to higher performance. We also show that the two-layer network cannot explicitly improve performance compared to a single-layer network.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Spoken Emotion Recognition Using Deep Learning

Speech Emotion Recognition by Conventional Machine Learning and Deep Learning

Emotion Recognition from Speech Using Deep Learning

References

Abdel-Hamid, O., Mohamed, A.R., Jiang, H., et al., 2012. Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, p.4277–4280. [doi:10.1109/ICASSP.2012.6288864]
Google Scholar
Burkhardt, F., Paeschke, A., Rolfes, M., et al., 2005. A database of German emotional speech. Interspeech, p.1517–1520.
Google Scholar
Chan, T.H., Jia, K., Gao, S., et al., 2014. PCANet: a simple deep learning baseline for image classification? arXiv preprint, arXiv:1404.3606.
Google Scholar
Coates, A., Ng, A.Y., Lee, H., 2011. An analysis of singlelayer networks in unsupervised feature learning. Int. Conf. on Artificial Intelligence and Statistics, p.215–223.
Google Scholar
Dahl, G.E., Yu, D., Deng, L., et al., 2012. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Audio Speech Lang. Process., 20(1):30–42. [doi:10.1109/TASL.2011.2134090]
Article Google Scholar
El Ayadi, M., Kamel, M.S., Karray, F., 2011. Survey on speech emotion recognition: features, classification schemes, and databases. Patt. Recogn., 44(3):572–587. [doi:10.1016/j.patcog.2010.09.020]
Article MATH Google Scholar
Feraru, M., Zbancioc, M., 2013. Speech emotion recognition for SROL database using weighted KNN algorithm. Int. Conf. on Electronics, Computers and Artificial Intelligence, p.1–4. [doi:10.1109/ECAI.2013.6636198]
Google Scholar
Gao, H., Chen, S.G., An, P., et al., 2012. Emotion recognition of Mandarin speech for different speech corpora based on nonlinear features. IEEE 11th Int. Conf. on Signal Processing, p.567–570. [doi:10.1109/ICoSP.2012.6491552]
Google Scholar
Gunes, H., Schuller, B., 2013. Categorical and dimensional affect analysis in continuous input: current trends and future directions. Image Vis. Comput., 31(2):120–136. [doi:10.1016/j.imavis.2012.06.016]
Article Google Scholar
Haq, S., Jackson, P.J., 2009. Speaker-dependent audiovisual emotion recognition. Auditory-Visual Speech Processing, p.53–58.
Google Scholar
Hinton, G., Deng, L., Yu, D., et al., 2012. Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag., 29(6):82–97. [doi:10.1109/MSP.2012.2205597]
Article Google Scholar
Kim, Y., Lee, H., Provost, E.M., 2013. Deep learning for robust feature generation in audiovisual emotion recognition. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, p.3687–3691. [doi:10.1109/ICASSP.2013.6638346]
Google Scholar
Koolagudi, S.G., Devliyal, S., Barthwal, A., et al., 2012. Emotion recognition from semi natural speech using artificial neural networks and excitation source features. In: Contemporary Computing. Springer Berlin Heidelberg, p.273–282.
Chapter Google Scholar
Le, D., Provost, E.M., 2013. Emotion recognition from spontaneous speech using hidden Markov models with deep belief networks. IEEE Workshop on Automatic Speech Recognition and Understanding, p.216–221. [doi:10.1109/ASRU.2013.6707732]
Google Scholar
Lee, H., Pham, P., Largman, Y., et al., 2009. Unsupervised feature learning for audio classification using convolutional deep belief networks. Advances in Neural Information Processing Systems, p.1096–1104.
Google Scholar
Li, L., Zhao, Y., Jiang, D., et al., 2013. Hybrid deep neural network–hidden Markov model (DNN-HMM) based speech emotion recognition. Humaine Association Conf. on Affective Computing and Intelligent Interaction, p.312–317. [doi:10.1109/ACII.2013.58]
Google Scholar
Mao, Q., Wang, X., Zhan, Y., 2010. Speech emotion recognition method based on improved decision tree and layered feature selection. Int. J. Human. Robot., 7(2):245–261. [doi:10.1142/S0219843610002088]
Article Google Scholar
Mao, Q.R., Zhao, X.L., Huang, Z.W., et al., 2013. Speakerindependent speech emotion recognition by fusion of functional and accompanying paralanguage features. J. Zhejiang Univ.-Sci. C (Comput. & Electron.), 14(7):573–582. [doi:10.1631/jzus.CIDE1310]
Article Google Scholar
Martin, O., Kotsia, I., Macq, B., et al., 2006. The eNTERFACE’ 05 audio-visual emotion database. Proc. Int. Conf. on Data Engineering Workshops, p.8. [doi:10.1109/ICDEW.2006.145]
Google Scholar
Mencattini, A., Martinelli, E., Costantini, G., et al., 2014. Speech emotion recognition using amplitude modulation parameters and a combined feature selection procedure. Knowl.-Based Syst., 63:68–81. [doi:10.1016/j.knosys.2014.03.019]
Article Google Scholar
Mohamed, A.R., Dahl, G.E., Hinton, G., 2012. Acoustic modeling using deep belief networks. IEEE Trans. Audio Speech Lang. Process., 20(1):14–22. [doi:10.1109/TASL.2011.2109382]
Article Google Scholar
Nicolaou, M.A., Gunes, H., Pantic, M., 2011. Continuous prediction of spontaneous affect from multiple cues and modalities in valence-arousal space. IEEE Trans. Affect. Comput., 2(2):92–105. [doi:10.1109/TAFFC.2011.9]
Article Google Scholar
Pantic, M., Nijholt, A., Pentland, A., et al., 2008. Humancentred intelligent human? Computer interaction (HCI2): how far are we from attaining it? Int. J. Auton. Adapt. Commun. Syst., 1(2):168–187. [doi:10.1504/IJAACS.2008.019799]
Google Scholar
Ramakrishnan, S., El Emary, I.M., 2013. Speech emotion recognition approaches in human computer interaction. Telecommun. Syst., 52(3):1467–1478. [doi:10.1007/s11235-011-9624-z]
Article Google Scholar
Ranzato, M., Huang, F.J., Boureau, Y.L., et al., 2007. Unsupervised learning of invariant feature hierarchies with applications to object recognition. IEEE Conf. on Computer Vision and Pattern Recognition, p.1–8. [doi:10.1109/CVPR.2007.383157]
Google Scholar
Razavian, A.S., Azizpour, H., Sullivan, J., et al., 2014. CNN features off-the-shelf: an astounding baseline for recognition. arXiv preprint, arXiv:1403.6382.
Google Scholar
Schmidt, E.M., Kim, Y.E., 2011. Learning emotion-based acoustic features with deep belief networks. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, p.65–68. [doi:10.1109/ASPAA.2011.6082328]
Google Scholar
Stuhlsatz, A., Meyer, C., Eyben, F., et al., 2011. Deep neural networks for acoustic emotion recognition: raising the benchmarks. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, p.5688–5691. [doi:10.1109/ICASSP.2011.5947651]
Google Scholar
Sun, R., Moore, E.II, 2011. Investigating glottal parameters and Teager energy operators in emotion recognition. LNCS, 6975:425–434. [doi:10.1007/978-3-642-24571-8_54]
Google Scholar
Sun, Y., Wang, X., Tang, X., 2013. Deep learning face representation from predicting 10,000 classes. Proc. IEEE Conf. on Computer Vision and Pattern Recognition, p.1891–1898. [doi:10.1109/CVPR.2014.244]
Google Scholar
Thapliyal, N., Amoli, G., 2012. Speech based emotion recognition with Gaussian mixture model. Int. J. Adv. Res. Comput. Eng. Technol., 1(5):65–69.
Google Scholar
Wu, C.H., Liang, W.B., 2011. Emotion recognition of affective speech based on multiple classifiers using acousticprosodic information and semantic labels. IEEE Trans. Affect. Comput., 2(1):10–21. [doi:10.1109/T-AFFC.2010.16]
Article MathSciNet Google Scholar
Wu, S., Falk, T.H., Chan, W.Y., 2011. Automatic speech emotion recognition using modulation spectral features. Speech Commun., 53(5):768–785. [doi:10.1016/j.specom.2010.08.013]
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang, 212013, China
Zheng-wei Huang, Wen-tao Xue & Qi-rong Mao

Authors

Zheng-wei Huang
View author publications
You can also search for this author in PubMed Google Scholar
Wen-tao Xue
View author publications
You can also search for this author in PubMed Google Scholar
Qi-rong Mao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qi-rong Mao.

Additional information

Project supported by the National Natural Science Foundation of China (Nos. 61272211 and 61170126) and the Six Talent Peaks Foundation of Jiangsu Province, China (No. DZXX027)

ORCID: Zheng-wei HUANG, http://orcid.org/0000-0001-7788-0526; Qi-rong MAO, http://orcid.org/0000-0002-5021-9057

Rights and permissions

Reprints and permissions

About this article

Cite this article

Huang, Zw., Xue, Wt. & Mao, Qr. Speech emotion recognition with unsupervised feature learning. Frontiers Inf Technol Electronic Eng 16, 358–366 (2015). https://doi.org/10.1631/FITEE.1400323

Download citation

Received: 16 September 2014
Revised: 04 March 2015
Published: 13 May 2015
Issue Date: May 2015
DOI: https://doi.org/10.1631/FITEE.1400323

Key words

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speech emotion recognition with unsupervised feature learning

Abstract

Access this article

Similar content being viewed by others

Spoken Emotion Recognition Using Deep Learning

Speech Emotion Recognition by Conventional Machine Learning and Deep Learning

Emotion Recognition from Speech Using Deep Learning

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Key words

Navigation

Speech emotion recognition with unsupervised feature learning

Abstract

Access this article

Similar content being viewed by others

Spoken Emotion Recognition Using Deep Learning

Speech Emotion Recognition by Conventional Machine Learning and Deep Learning

Emotion Recognition from Speech Using Deep Learning

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Key words

Search

Navigation