Advertisement

Semi-supervised Ladder Networks for Speech Emotion Recognition

  • Jian-Hua TaoEmail author
  • Jian Huang
  • Ya Li
  • Zheng Lian
  • Ming-Yue Niu
Research Article

Abstract

As a major component of speech signal processing, speech emotion recognition has become increasingly essential to understanding human communication. Benefitting from deep learning, many researchers have proposed various unsupervised models to extract effective emotional features and supervised models to train emotion recognition systems. In this paper, we utilize semi-supervised ladder networks for speech emotion recognition. The model is trained by minimizing the supervised loss and auxiliary unsupervised cost function. The addition of the unsupervised auxiliary task provides powerful discriminative representations of the input features, and is also regarded as the regularization of the emotional supervised task. We also compare the ladder network with other classical autoencoder structures. The experiments were conducted on the interactive emotional dyadic motion capture (IEMOCAP) database, and the results reveal that the proposed methods achieve superior performance with a small number of labelled data and achieves better performance than other methods.

Keywords

Speech emotion recognition the ladder network semi-supervised learning autoencoder regularization 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Notes

Acknowledgements

This work was supported by National Natural Science Foundation of China (Nos. 61425017 and 61773379), the National Key Research & Development Plan of China (No. 2017YFB1002804).

References

  1. [1]
    J. H. Tao, T. N. Tan. Affective computing: A review. In Proceedings of the 1st International Conference on Affective Computing and Intelligent Interaction, Springer, Beijing, China, pp. 981–995, 2005. DOI: 11.1007/11573548_125.CrossRefGoogle Scholar
  2. [2]
    H. Bořil, A. Sangwan, T. Hasan, J. H. Hansen. Automatic excitement-level detection for sports highlights generation. In Proceedings of the 11th Annual Conference of the International Speech Communication Association, ISCA, Makuhari, Japan, pp. 2202–2205, 2010.Google Scholar
  3. [3]
    H. Gunes, B. Schuller. Categorical and dimensional affect analysis in continuous input: current trends and future directions. Image and Vision Computing, vol. 31, no. 2, pp. 120–136, 2013. DOI:  https://doi.org/10.1016/j.imavis.2012.06.016.CrossRefGoogle Scholar
  4. [4]
    T. L. Nwe, S. W. Foo, L. C. De Silva. Speech emotion recognition using hidden Markov models. Speech Communication, vol 41, no. 4, pp. 603–623, 2003. DOI:  https://doi.org/10.1016/S0167-6313(03)00011-2.CrossRefGoogle Scholar
  5. [5]
    M. M. H. El Ayadi, M. S. Kamel, F. Karray. Speech emotion recognition using Gaussian mixture vector autoregressive models. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Honolulu, USA, pp. 957–960, 2007. DOI:  https://doi.org/10.1109/ICASSP.2007.367230.Google Scholar
  6. [6]
    J. Deng, Z. X. Zhang, F. Eyben, B. Schuller. Autoencoder-based unsupervised domain adaptation for speech emotion recognition. IEEE Signal Processing Letters, vol. 21, no. 1, pp. 1068–1072, 2014. DOI:  https://doi.org/10.1109/LSP.2014.2324751.Google Scholar
  7. [7]
    B. Zhao, J. S. Feng, X. Wu, S. C. Yan. A survey on deep learning-based fine-grained object classification and semantic segmentation. International Journal of Automation and Computing, vol. 14, no. 2, pp. 111–135, 2017. DOI:  https://doi.org/10.1007/s11633-017-1053-3.CrossRefGoogle Scholar
  8. [8]
    Z. J. Yao, J. Bi, Y. X. Chen. Applying deep learning to individual and community health monitoring data: a survey. International Journal of Automation and Computing, vol. 15, no. 6, pp. 643–655, 2018. DOI:  https://doi.org/10.1007/s11633-018-1136-1.CrossRefGoogle Scholar
  9. [9]
    M. Neumann, N. T. Vu. Attentive convolutional neural network based speech emotion recognition: A study on the impact of input features, signal length, and acted speech. In Proceedings of the 18th Annual Conference of the International Speech Communication Association, ISAA, Stockholm, Sweden, pp. 1263–1267, 2017.Google Scholar
  10. [10]
    H. M. Fayek, M. Lech, L. Cavedon. Evaluating deep learning architectures for speech emotion recognition. Neural Networks, vol. 12, pp. 60–68, 2017. DOI:  https://doi.org/10.1016/j.neunet.2017.02.013.CrossRefGoogle Scholar
  11. [11]
    S. E. Eskimez, K. Imade, N. Yang, M. Sturge-Apple, Z. Y. Duan, W. Heinzelman. Emotion classification: How does an automated system compare to Naive human coders? In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Shanghai, China, pp. 2274–2278, 2016. DOI:  https://doi.org/10.1101/ICASSP.2016.7472082.Google Scholar
  12. [12]
    B. Jou, S. Bhattacharya, S. F. Chang. Predicting viewer perceived emotions in animated GIFs. In Proceedings of the 22nd ACM International Conference on Multimedia, Drlando, USA, pp.213–216, 2014. DOI:  https://doi.org/10.1145/2647868.2656408.Google Scholar
  13. [13]
    M. El Ayadi, M. S. Kamel, F. Karray. Survey on speech emotion recognition: features, classification schemes, and databases. Pattern Recognition, vol. 44, no. 3, pp. 572–587, 2011. DOI:  https://doi.org/10.1016/j.patcog.2010.01.020.CrossRefzbMATHGoogle Scholar
  14. [14]
    P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research, vol. 11, no. 12, pp. 3371–3408, 2010.MathSciNetzbMATHGoogle Scholar
  15. [15]
    G. E. Hinton, R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, vol. 313, no. 5786, pp. 504–507, 2006. DOI:  https://doi.org/10.1126/science.1127647.MathSciNetCrossRefzbMATHGoogle Scholar
  16. [16]
    D. P. Kingma, M. Welling. Auto-encoding variational Bayes. In Proceedings of the 2nd International Conference on Learning Representations, ICLR, Ithaca, USA, 2013.Google Scholar
  17. [17]
    I. J. Goodfellow, J. Pouget-Abadie, M Mirza, B. Xu, D. Warde-Farley, S. Dzair, A. Courville, Y. Bengio. Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems, MIT Press, Montreal, Canada, pp. 2672–2680, 2014.Google Scholar
  18. [18]
    A. Rasmin, H. Valpola, M. Honkala, M. Berglund, T. Raiko. Semi-supervised learning with ladder networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems, MIT Press, Montreal, Canada, pp. 3546–3554, 2015.Google Scholar
  19. [19]
    J. Weston, F. Ratle, H. Mobahi, R. Collobert. Deep learning via, semi-supervised embedding. Neural Networks: Tricks of the Trade, 2nd ed., G. Montavon, G. B. Orr, K. R. Müller, Eds., Berlin Heidelberg, Germany: Springer, pp. 631–655, 2012. DOI:  https://doi.org/10.1007/178-3-642-35281-8_34.Google Scholar
  20. [20]
    D. P. Kingma, D. J. Rezende, S. Mohamed, M. Welling. Semi-supervised learning with deep generative models. In Proceedings of the 27th International Conference on Neural Information Processing Systems, MIT Press, Montreal, Canada, pp. 3581–3581, 2014.Google Scholar
  21. [21]
    C. Busso, M. Bulut, S. Narayanan. Toward effective automatic recognition systems of emotion in speech. Social Emotions in Nature and Artifact: Emotions in Human and Human Computer Interaction, J. Gratch and S. Marsella, Eds., New York, USA: Oxford University Press, pp. 110–127, 2014.Google Scholar
  22. [22]
    S. Parthasarathy, C. Busso. Jointly predicting arousal, valence and dominance with multi-task learning. In Proceedings of the 18th Annual Conference of the International Speech Communication Association, ISCA, Stockholm, Sweden, pp. 1103–1107, 2017.Google Scholar
  23. [23]
    M. Shami, W. Verhelst. Automatic classification of expressiveness in speech: a multi-corpus study. Speaker Classification II: Selected Projects, C. Müller, Ed., Berlin Heidelberg, Germany: Springer-Verlag, vol. 4441, pp. 43–56, 2007. DOI:  https://doi.org/10.1007/978-3-540-74122-0_5.CrossRefGoogle Scholar
  24. [24]
    H. Valpola. From neural PCA to deep unsupervised learning. Advances in Independent Component Analysis and Learning Machines, E. Bingham, S. Kaski, J. Laaksonen, J. Lampinen, Eds., Amsterdam, Netherlands: Academic Press, pp. 143–171, 2015. DOI:  https://doi.org/10.1016/B978-0-12-802806-3.00008-7.CrossRefGoogle Scholar
  25. [25]
    Y. Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, vol. 2, no. 1, pp. 1–127, 2009. DOI:  https://doi.org/10.1561/2200000006.MathSciNetCrossRefzbMATHGoogle Scholar
  26. [26]
    F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. André, C. Busso, L. Y. Devillers, J. Epps, P. Laukka, S. S. Narayanan, K. P. Truong. The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Transactions on Affective Computing, vol. 7, no. 2, pp. 190–202, 2016. DOI:  https://doi.org/10.1109/TAFFC.2015.2457417.CrossRefGoogle Scholar
  27. [27]
    J. Huang, Y. Li, J. H. Tao. Effect of dimensional emotion in discrete speech emotion classification. In Proceedings of the 3nd International Workshop on Affective Social Multimeda Computing, ASMMC, Stockholm, Sweden, 2017.Google Scholar
  28. [28]
    Y. Kim, H. Lee, E. M. Provost. Deep learning for robust feature generation in audiovisual emotion recognition. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, Canada, pp. 3687–3691, 2013. DOI:  https://doi.org/10.1109/ICASSP.2013.6638346.Google Scholar
  29. [29]
    J. Deng, R. Xia, Z. X. Zhang, Y. Liu, B. Schuller. Introducing shared-hidden-layer autoencoders for transfer learning and their application in acoustic emotion recognition. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Florence, Italy, pp. 4818–4822, 2014. DOI:  https://doi.org/10.1109/ICASSP.2014.6854517.Google Scholar
  30. [30]
    J. Deng, Z. X. Zhang, E. Marchi, B. Schuller. Sparse autoencoder-based feature transfer learning for speech emotion recognition. In Proceedings of Humaine Association Conference on Affective Computing and Intelligent Interaction, IEEE, Geneva, Switzerland, pp. 511–516, 2013. DOI:  https://doi.org/10.1109/ACII.2013.90.Google Scholar
  31. [31]
    R. Xia, Y. Liu. Using denoising autoencoder for emotion recognition. In Proceedings of the 14th Annual Conference of the International Speech Communication Association, ISCA, Lyon, France, pp. 2886–2889, 2013.Google Scholar
  32. [32]
    R. Xia, J. Deng, B. Schuller, Y. Liu. Modeling gender information for emotion recognition using denoising autoencoder. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Florence, Italy, pp. 990–994, 2014. DOI:  https://doi.org/10.1109/ICASSP.2014.6853745.Google Scholar
  33. [33]
    S. Ghosh, E. Laksana, L. P. Morency, S. Scherer. Learning representations of affect from speech. In Proceedings of International Conference on Learning Representations, ICLR, San Juan, Puerto Rico, 2016.Google Scholar
  34. [34]
    S. Ghosh, E. Laksana, L. P. Morency, S. Scherer. Representation learning for speech emotion recognition. In Proceedings of the 17th Annual Conference of the International Speech Communication Association, ISCA, San Francisco, USA, pp. 3603–3607, 2016.Google Scholar
  35. [35]
    S. E. Eskimez, Z. Y. Duan, W. Heinzelman. Unsupervised learning approach to feature analysis for automatic speech emotion recognition. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, Canada, 2018. DOI:  https://doi.org/10.1109/ICASSP.2018.8462685.Google Scholar
  36. [36]
    J. Deng, X. Z. Xu, Z. X. Zhang, S. Frühholz, B. Schuller. Semisupervised autoencoders for speech emotion recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 1, pp. 31–43, 2018. DOI:  https://doi.org/10.1109/TASLP.2017.2759338.CrossRefGoogle Scholar
  37. [37]
    A. Rasmus, H. Valpola, T. Raiko. Lateral Connections in Denoising Autoencoders Support Supervised Learning, [Online], Available: https://arxiv.org/abs/1504.08215, April, 2015.
  38. [38]
    M. Pezeshki, L. X. Fan, P. Brakel, A. Courville, Y. Bengio. Deconstructing the ladder network architecture. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, ACM, New York, USA, pp. 2368–2376, 2016.Google Scholar
  39. [39]
    J. Huang, Y. Li, J. H. Tao, Z. Lian, M. Y. Niu, J. Y. Yi. Speech emotion recognition using semi-supervised learning with ladder networks. In Proceedings of the 1st Asian Conference on Affective Computing and Intelligent Interaction, IEEE, Beijing, China, 2018. DOI:  https://doi.org/10.1109/ACII-Asia.2018.8470363.Google Scholar
  40. [40]
    S. Parthasarathy, C. Busso. Ladder Networks for Emotion Recognition: Using Unsupervised Auxiliary Tasks to Improve Predictions of Emotional Attributes, [Online], Available: https://www.isca-speech.org/archive/Inter-speech_2018/abstracts/1391.html, 2018.
  41. [41]
    C. Busso, M. Bulut, C. C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, S. S. Narayanan. IEMOCAP: interactive emotional dyadic motion capture database. Language Resources and Evaluation, vol. 42, no. 4, pp. 335–359, 2008. DOI:  https://doi.org/10.1007/s10579-008-9076-6.CrossRefGoogle Scholar
  42. [42]
    B. Schuller, S. Steidl, A. Batliner. The Interspeech 2009 emotion challenge. In Proceedings of the 10th Annual Conference of the International Speech Communication Association, ISCA, Brighton, UK, pp. 312–315, 2009.Google Scholar
  43. [43]
    F. Eyben, M. Wöllmer, B. Schuller. Opensmile: The Munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM International Conference on Multimedia, ACM, Florence, Italy, pp. 1459–1462, 2010. DOI:  https://doi.org/10.1145/1873951.1874246.Google Scholar
  44. [44]
    D. P. Kingma, J. L. Ba. Adam: A method for stochastic optimization. In Proceedings of International Conference on Learning Representations, ICLR, Ithaca, USA, 2015.Google Scholar

Copyright information

© Institute of Automation, Chinese Academy of Sciences and Springer-Verlag Gmbh Germany, part of Springer Nature 2019

Authors and Affiliations

  1. 1.National Laboratory of Pattern RecognitionBeijingChina
  2. 2.School of Artificial IntelligenceUniversity of Chinese Academy of Science (CAS)BeijingChina
  3. 3.CAS Center for Excellence in Brain Science and Intelligence Technology, Institute of AutomationChinese Academy of SciencesBeijingChina

Personalised recommendations