Semi-supervised Ladder Networks for Speech Emotion Recognition

Tao, Jian-Hua; Huang, Jian; Li, Ya; Lian, Zheng; Niu, Ming-Yue

doi:10.1007/s11633-019-1175-x

Semi-supervised Ladder Networks for Speech Emotion Recognition

Research Article
Open access
Published: 02 May 2019

Volume 16, pages 437–448, (2019)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Automation and Computing Aims and scope Submit manuscript

Semi-supervised Ladder Networks for Speech Emotion Recognition

Download PDF

Jian-Hua Tao ORCID: orcid.org/0000-0002-9437-7188^1,2,3,
Jian Huang^1,2,
Ya Li¹,
Zheng Lian^1,2 &
…
Ming-Yue Niu^1,2

1188 Accesses
28 Citations
1 Altmetric
Explore all metrics

A Correction to this article was published on 13 December 2019

This article has been updated

Abstract

As a major component of speech signal processing, speech emotion recognition has become increasingly essential to understanding human communication. Benefitting from deep learning, many researchers have proposed various unsupervised models to extract effective emotional features and supervised models to train emotion recognition systems. In this paper, we utilize semi-supervised ladder networks for speech emotion recognition. The model is trained by minimizing the supervised loss and auxiliary unsupervised cost function. The addition of the unsupervised auxiliary task provides powerful discriminative representations of the input features, and is also regarded as the regularization of the emotional supervised task. We also compare the ladder network with other classical autoencoder structures. The experiments were conducted on the interactive emotional dyadic motion capture (IEMOCAP) database, and the results reveal that the proposed methods achieve superior performance with a small number of labelled data and achieves better performance than other methods.

Article PDF

Speech emotion recognition based on emotion perception

Article Open access 12 May 2023

Coarse-to-Fine Speech Emotion Recognition Based on Multi-Task Learning

Article 20 June 2020

When Old Meets New: Emotion Recognition from Speech Signals

Article Open access 19 April 2021

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Change history

13 December 2019
The article Semi-supervised Ladder Networks for Speech Emotion Recognition written by Jian-Hua Tao, Jian Huang, Ya Li, Zheng Lian and Ming-Yue Niu, was originally published on vol. 16, no. 4 of <Emphasis Type="Italic">International Journal of Automation and Computing</Emphasis> without Open Access. After publication, the authors decided to opt for Open Choice and to make the article an Open Access publication. Therefore, the copyright of the article has been changed to © The Author(s) 2019 and the article is forthwith distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, duplication, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

References

J. H. Tao, T. N. Tan. Affective computing: A review. In Proceedings of the 1st International Conference on Affective Computing and Intelligent Interaction, Springer, Beijing, China, pp. 981–995, 2005. DOI: 11.1007/11573548_125.
Google Scholar
H. Bořil, A. Sangwan, T. Hasan, J. H. Hansen. Automatic excitement-level detection for sports highlights generation. In Proceedings of the 11th Annual Conference of the International Speech Communication Association, ISCA, Makuhari, Japan, pp. 2202–2205, 2010.
Google Scholar
H. Gunes, B. Schuller. Categorical and dimensional affect analysis in continuous input: current trends and future directions. Image and Vision Computing, vol. 31, no. 2, pp. 120–136, 2013. DOI: https://doi.org/10.1016/j.imavis.2012.06.016.
Google Scholar
T. L. Nwe, S. W. Foo, L. C. De Silva. Speech emotion recognition using hidden Markov models. Speech Communication, vol 41, no. 4, pp. 603–623, 2003. DOI: https://doi.org/10.1016/S0167-6313(03)00011-2.
Google Scholar
M. M. H. El Ayadi, M. S. Kamel, F. Karray. Speech emotion recognition using Gaussian mixture vector autoregressive models. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Honolulu, USA, pp. 957–960, 2007. DOI: https://doi.org/10.1109/ICASSP.2007.367230.
Google Scholar
J. Deng, Z. X. Zhang, F. Eyben, B. Schuller. Autoencoder-based unsupervised domain adaptation for speech emotion recognition. IEEE Signal Processing Letters, vol. 21, no. 1, pp. 1068–1072, 2014. DOI: https://doi.org/10.1109/LSP.2014.2324751.
Google Scholar
B. Zhao, J. S. Feng, X. Wu, S. C. Yan. A survey on deep learning-based fine-grained object classification and semantic segmentation. International Journal of Automation and Computing, vol. 14, no. 2, pp. 111–135, 2017. DOI: https://doi.org/10.1007/s11633-017-1053-3.
Google Scholar
Z. J. Yao, J. Bi, Y. X. Chen. Applying deep learning to individual and community health monitoring data: a survey. International Journal of Automation and Computing, vol. 15, no. 6, pp. 643–655, 2018. DOI: https://doi.org/10.1007/s11633-018-1136-1.
Google Scholar
M. Neumann, N. T. Vu. Attentive convolutional neural network based speech emotion recognition: A study on the impact of input features, signal length, and acted speech. In Proceedings of the 18th Annual Conference of the International Speech Communication Association, ISAA, Stockholm, Sweden, pp. 1263–1267, 2017.
Google Scholar
H. M. Fayek, M. Lech, L. Cavedon. Evaluating deep learning architectures for speech emotion recognition. Neural Networks, vol. 12, pp. 60–68, 2017. DOI: https://doi.org/10.1016/j.neunet.2017.02.013.
Google Scholar
S. E. Eskimez, K. Imade, N. Yang, M. Sturge-Apple, Z. Y. Duan, W. Heinzelman. Emotion classification: How does an automated system compare to Naive human coders? In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Shanghai, China, pp. 2274–2278, 2016. DOI: https://doi.org/10.1101/ICASSP.2016.7472082.
Google Scholar
B. Jou, S. Bhattacharya, S. F. Chang. Predicting viewer perceived emotions in animated GIFs. In Proceedings of the 22nd ACM International Conference on Multimedia, Drlando, USA, pp.213–216, 2014. DOI: https://doi.org/10.1145/2647868.2656408.
Google Scholar
M. El Ayadi, M. S. Kamel, F. Karray. Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recognition, vol. 44, no. 3, pp. 572–587, 2011. DOI: https://doi.org/10.1016/j.patcog.2010.01.020.
MATH Google Scholar
P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, vol. 11, no. 12, pp. 3371–3408, 2010.
MathSciNet MATH Google Scholar
G. E. Hinton, R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, vol. 313, no. 5786, pp. 504–507, 2006. DOI: https://doi.org/10.1126/science.1127647.
MathSciNet MATH Google Scholar
D. P. Kingma, M. Welling. Auto-encoding variational Bayes. In Proceedings of the 2nd International Conference on Learning Representations, ICLR, Ithaca, USA, 2013.
Google Scholar
I. J. Goodfellow, J. Pouget-Abadie, M Mirza, B. Xu, D. Warde-Farley, S. Dzair, A. Courville, Y. Bengio. Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems, MIT Press, Montreal, Canada, pp. 2672–2680, 2014.
Google Scholar
A. Rasmin, H. Valpola, M. Honkala, M. Berglund, T. Raiko. Semi-supervised learning with ladder networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems, MIT Press, Montreal, Canada, pp. 3546–3554, 2015.
Google Scholar
J. Weston, F. Ratle, H. Mobahi, R. Collobert. Deep learning via, semi-supervised embedding. Neural Networks: Tricks of the Trade, 2nd ed., G. Montavon, G. B. Orr, K. R. Müller, Eds., Berlin Heidelberg, Germany: Springer, pp. 631–655, 2012. DOI: https://doi.org/10.1007/178-3-642-35281-8_34.
Google Scholar
D. P. Kingma, D. J. Rezende, S. Mohamed, M. Welling. Semi-supervised learning with deep generative models. In Proceedings of the 27th International Conference on Neural Information Processing Systems, MIT Press, Montreal, Canada, pp. 3581–3581, 2014.
Google Scholar
C. Busso, M. Bulut, S. Narayanan. Toward effective automatic recognition systems of emotion in speech. Social Emotions in Nature and Artifact: Emotions in Human and Human Computer Interaction, J. Gratch and S. Marsella, Eds., New York, USA: Oxford University Press, pp. 110–127, 2014.
Google Scholar
S. Parthasarathy, C. Busso. Jointly predicting arousal, valence and dominance with multi-task learning. In Proceedings of the 18th Annual Conference of the International Speech Communication Association, ISCA, Stockholm, Sweden, pp. 1103–1107, 2017.
Google Scholar
M. Shami, W. Verhelst. Automatic classification of expressiveness in speech: a multi-corpus study. Speaker Classification II: Selected Projects, C. Müller, Ed., Berlin Heidelberg, Germany: Springer-Verlag, vol. 4441, pp. 43–56, 2007. DOI: https://doi.org/10.1007/978-3-540-74122-0_5.
Google Scholar
H. Valpola. From neural PCA to deep unsupervised learning. Advances in Independent Component Analysis and Learning Machines, E. Bingham, S. Kaski, J. Laaksonen, J. Lampinen, Eds., Amsterdam, Netherlands: Academic Press, pp. 143–171, 2015. DOI: https://doi.org/10.1016/B978-0-12-802806-3.00008-7.
Google Scholar
Y. Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009. DOI: https://doi.org/10.1561/2200000006.
MathSciNet MATH Google Scholar
F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. André, C. Busso, L. Y. Devillers, J. Epps, P. Laukka, S. S. Narayanan, K. P. Truong. The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Transactions on Affective Computing, vol. 7, no. 2, pp. 190–202, 2016. DOI: https://doi.org/10.1109/TAFFC.2015.2457417.
Google Scholar
J. Huang, Y. Li, J. H. Tao. Effect of dimensional emotion in discrete speech emotion classification. In Proceedings of the 3nd International Workshop on Affective Social Multimeda Computing, ASMMC, Stockholm, Sweden, 2017.
Google Scholar
Y. Kim, H. Lee, E. M. Provost. Deep learning for robust feature generation in audiovisual emotion recognition. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, Canada, pp. 3687–3691, 2013. DOI: https://doi.org/10.1109/ICASSP.2013.6638346.
Google Scholar
J. Deng, R. Xia, Z. X. Zhang, Y. Liu, B. Schuller. Introducing shared-hidden-layer autoencoders for transfer learning and their application in acoustic emotion recognition. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Florence, Italy, pp. 4818–4822, 2014. DOI: https://doi.org/10.1109/ICASSP.2014.6854517.
Google Scholar
J. Deng, Z. X. Zhang, E. Marchi, B. Schuller. Sparse autoencoder-based feature transfer learning for speech emotion recognition. In Proceedings of Humaine Association Conference on Affective Computing and Intelligent Interaction, IEEE, Geneva, Switzerland, pp. 511–516, 2013. DOI: https://doi.org/10.1109/ACII.2013.90.
Google Scholar
R. Xia, Y. Liu. Using denoising autoencoder for emotion recognition. In Proceedings of the 14th Annual Conference of the International Speech Communication Association, ISCA, Lyon, France, pp. 2886–2889, 2013.
Google Scholar
R. Xia, J. Deng, B. Schuller, Y. Liu. Modeling gender information for emotion recognition using denoising autoencoder. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Florence, Italy, pp. 990–994, 2014. DOI: https://doi.org/10.1109/ICASSP.2014.6853745.
Google Scholar
S. Ghosh, E. Laksana, L. P. Morency, S. Scherer. Learning representations of affect from speech. In Proceedings of International Conference on Learning Representations, ICLR, San Juan, Puerto Rico, 2016.
Google Scholar
S. Ghosh, E. Laksana, L. P. Morency, S. Scherer. Representation learning for speech emotion recognition. In Proceedings of the 17th Annual Conference of the International Speech Communication Association, ISCA, San Francisco, USA, pp. 3603–3607, 2016.
Google Scholar
S. E. Eskimez, Z. Y. Duan, W. Heinzelman. Unsupervised learning approach to feature analysis for automatic speech emotion recognition. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, Canada, 2018. DOI: https://doi.org/10.1109/ICASSP.2018.8462685.
Google Scholar
J. Deng, X. Z. Xu, Z. X. Zhang, S. Frühholz, B. Schuller. Semisupervised autoencoders for speech emotion recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 1, pp. 31–43, 2018. DOI: https://doi.org/10.1109/TASLP.2017.2759338.
Google Scholar
A. Rasmus, H. Valpola, T. Raiko. Lateral Connections in Denoising Autoencoders Support Supervised Learning, [Online], Available: https://arxiv.org/abs/1504.08215, April, 2015.
M. Pezeshki, L. X. Fan, P. Brakel, A. Courville, Y. Bengio. Deconstructing the ladder network architecture. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, ACM, New York, USA, pp. 2368–2376, 2016.
Google Scholar
J. Huang, Y. Li, J. H. Tao, Z. Lian, M. Y. Niu, J. Y. Yi. Speech emotion recognition using semi-supervised learning with ladder networks. In Proceedings of the 1st Asian Conference on Affective Computing and Intelligent Interaction, IEEE, Beijing, China, 2018. DOI: https://doi.org/10.1109/ACII-Asia.2018.8470363.
Google Scholar
S. Parthasarathy, C. Busso. Ladder Networks for Emotion Recognition: Using Unsupervised Auxiliary Tasks to Improve Predictions of Emotional Attributes, [Online], Available: https://www.isca-speech.org/archive/Inter-speech_2018/abstracts/1391.html, 2018.
C. Busso, M. Bulut, C. C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, S. S. Narayanan. IEMOCAP: interactive emotional dyadic motion capture database. Language Resources and Evaluation, vol. 42, no. 4, pp. 335–359, 2008. DOI: https://doi.org/10.1007/s10579-008-9076-6.
Google Scholar
B. Schuller, S. Steidl, A. Batliner. The Interspeech 2009 emotion challenge. In Proceedings of the 10th Annual Conference of the International Speech Communication Association, ISCA, Brighton, UK, pp. 312–315, 2009.
Google Scholar
F. Eyben, M. Wöllmer, B. Schuller. Opensmile: The Munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM International Conference on Multimedia, ACM, Florence, Italy, pp. 1459–1462, 2010. DOI: https://doi.org/10.1145/1873951.1874246.
Google Scholar
D. P. Kingma, J. L. Ba. Adam: A method for stochastic optimization. In Proceedings of International Conference on Learning Representations, ICLR, Ithaca, USA, 2015.
Google Scholar

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (Nos. 61425017 and 61773379), the National Key Research & Development Plan of China (No. 2017YFB1002804).

Author information

Authors and Affiliations

National Laboratory of Pattern Recognition, Beijing, 100190, China
Jian-Hua Tao, Jian Huang, Ya Li, Zheng Lian & Ming-Yue Niu
School of Artificial Intelligence, University of Chinese Academy of Science (CAS), Beijing, 100190, China
Jian-Hua Tao, Jian Huang, Zheng Lian & Ming-Yue Niu
CAS Center for Excellence in Brain Science and Intelligence Technology, Institute of Automation, Chinese Academy of Sciences, Beijing, 100190, China
Jian-Hua Tao

Authors

Jian-Hua Tao
View author publications
You can also search for this author in PubMed Google Scholar
Jian Huang
View author publications
You can also search for this author in PubMed Google Scholar
Ya Li
View author publications
You can also search for this author in PubMed Google Scholar
Zheng Lian
View author publications
You can also search for this author in PubMed Google Scholar
Ming-Yue Niu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jian-Hua Tao.

Additional information

Recommended by Associate Editor Matjaz Games

The original version of this article was revised due to a retrospective Open Access order.

Jian-Hua Tao received the Ph. D. degree in computer science from Tsinghua University, China in 2001. He is winner of the National Science Fund for Distinguished Young Scholars and the deputy director in National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences (CASIA), China. He has directed many national projects, including “863”, National Natural Science Foundation of China. He has published more than eighty papers on journals and proceedings including IEEE Transactions on ASLP, and ICASSP, INTERSPEECH. He also serves as the steering committee member for IEEE Transactions on Affective Computing and the 470 chair or program committee member for major conferences, including International Conference on Pattern Recognition (ICPR), INTERSPEECH, etc.

His research interests include speech synthesis, affective computing and pattern recognition.

Jian Huang received the B. Eng. degree in automation from Wuhan University, China in 2015. He is a Ph. D. degree candidate in pattern recognition and intelligent system at the National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences, China. He had published the papers in INTERSPEECH and ICASSP.

His research interests include affective computing, deep learning and multimodal emotion recognition.

Ya Li received the B. Eng. degree in automation from University of Science and Technology of China (USTC), China in 2007, and the Ph. D. degree in pattern recognition and intelligent system from the National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences (CASIA), China in 2012. She is currently an associate professor with CASIA, China. She has published more than 50 papers in the related journals and conferences, such as Speech Communication, ICASSP, INTERSPEECH and Affective Computing and Intelligent Interaction (ACII). She has won the Second Prize of Beijing Science and Technology Award in 2014. She has also won the Best Student Paper in INTERSPEECH 2016.

Her research interests include affective computing and human-computer interaction.

Zheng Lian received the B. Eng. degree in telecommunication from Beijing University of Posts and Telecommunications, China in 2016. He is a Ph. D. degree candidate in pattern recognition and intelligent system at the National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences (CASIA), China.

His research interests include affective computing, deep learning and multimodal emotion recognition.

Ming-Yue Niu received the M. Sc. degree in information and computing science from Department of Applied Mathematics, Northwestern Polytechnical University (NWPU), China in 2017. Currently, he is a Ph. D. degree candidate in pattern recognition and intelligent system at the National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences (CASIA), China.

His research interests include affective computing and human-computer interaction.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits use, duplication, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Tao, JH., Huang, J., Li, Y. et al. Semi-supervised Ladder Networks for Speech Emotion Recognition. Int. J. Autom. Comput. 16, 437–448 (2019). https://doi.org/10.1007/s11633-019-1175-x

Download citation

Received: 12 October 2018
Accepted: 01 March 2019
Published: 02 May 2019
Issue Date: August 2019
DOI: https://doi.org/10.1007/s11633-019-1175-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Semi-supervised Ladder Networks for Speech Emotion Recognition

Abstract

Article PDF

Similar content being viewed by others

Speech emotion recognition based on emotion perception

Coarse-to-Fine Speech Emotion Recognition Based on Multi-Task Learning

When Old Meets New: Emotion Recognition from Speech Signals

Change history

13 December 2019

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Semi-supervised Ladder Networks for Speech Emotion Recognition

Abstract

Article PDF

Similar content being viewed by others

Speech emotion recognition based on emotion perception

Coarse-to-Fine Speech Emotion Recognition Based on Multi-Task Learning

When Old Meets New: Emotion Recognition from Speech Signals

Explore related subjects

Change history

13 December 2019

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation