Multimedia Tools and Applications

, Volume 76, Issue 5, pp 6785–6799 | Cite as

Unsupervised domain adaptation for speech emotion recognition using PCANet

  • Zhengwei Huang
  • Wentao Xue
  • Qirong Mao
  • Yongzhao Zhan


Research in emotion recognition seeks to develop insights into the variances of features of emotion in one common domain. However, automatic emotion recognition from speech is challenging when training data and test data are drawn from different domains due to different recording conditions, languages, speakers and many other factors. In this paper, we propose a novel feature transfer approach with PCANet (a deep network), which extracts both the domain-shared and the domain-specific latent features to facilitate performance improvement. The proposal attempts to learn multiple intermediate feature representations along an interpolating path between the source and target domains using PCANet by considering the distribution shift between source domain and target domain, and then aligns other feature representations on the path with target subspace to control them to change in the right direction towards the target. To exemplify the effectiveness of our approach, we select the INTERSPEECH 2009 Emotion Challenge’s FAU Aibo Emotion Corpus as the target database and two public databases (ABC and Emo-DB) as source set. Experimental results demonstrate that the proposed feature transfer learning method outperforms the conventional machine learning methods and other transfer learning methods on the performance.


Speech emotion recognition Domain adaption PCANet Feature mapping 


  1. 1.
    Abdel-Hamid O, Mohamed A, Jiang H, Penn G (2012) Applying convolutional neural networks concepts to hybrid nn-hmm model for speech recognition. In: 2012 IEEE international conference on Acoustics, speech and signal processing (ICASSP), pp 4277–4280Google Scholar
  2. 2.
    Bengio Y (2012) Deep learning of representations for unsupervised and transfer learning. In: Unsupervised and transfer learning challenges in machine learning, vol 7, p 19Google Scholar
  3. 3.
    Burkhardt F, Paeschke A, Rolfes M, Sendlmeier W, Weiss B (2005) A database of german emotional speech, vol 5Google Scholar
  4. 4.
    Chan TH, Jia K, Gao S, Lu J, Zeng Z, Pcanet MY (2014) A simple deep learning baseline for image classification? arXiv preprint arXiv:1404.3606
  5. 5.
    Chopra S, Balakrishnan S, Dlid GR (2013) Deep learning for domain adaptation by interpolating between domains. ICML workshop on challenges in representation learning 2:5Google Scholar
  6. 6.
    Dahl GE, Yu D, Deng L, Acero A (2012) Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans Audio Speech Lang Process 20(1):30– 42CrossRefGoogle Scholar
  7. 7.
    Daumé I IIH, Marcu D (2006) Domain adaptation for statistical classifiers. J Artif Intell Res:101– 126Google Scholar
  8. 8.
    Deng J, Xia R, Zhang Z, Liu Y, Schuller B (2014) Introducing shared-hidden-layer autoencoders for transfer learning and their application in acoustic emotion recognition. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 4818–4822Google Scholar
  9. 9.
    Deng J, Zhang Z, Marchi E, Schuller B (2013) Sparse autoencoder-based feature transfer learning for speech emotion recognition. In: 2013 Humaine Association Conference on Affective computing and intelligent interaction (ACII), pp 511–516Google Scholar
  10. 10.
    Deng J, Zhang Z, Eyben F, Schuller B (2014) Autoencoder-based unsupervised domain adaptation for speech emotion recognition. IEEE Signal Process Lett 21(9):1068–1072CrossRefGoogle Scholar
  11. 11.
    Eyben F, Wollmer M, Schuller B (2009) Openear-introducing the munich open-source emotion and affect recognition toolkit, pp 1–6Google Scholar
  12. 12.
    Fernando B, Habrard A, Sebban M, Tuytelaars T (2013) Unsupervised visual domain adaptation using subspace alignment, pp 2960–2967Google Scholar
  13. 13.
    Glorot X, Bordes A, Bengio Y (2011) Domain adaptation for large-scale sentiment classification: A deep learning approach. In: Inproceedings of the 28th International Conference on Machine Learning, pp 513–520Google Scholar
  14. 14.
    Gretton A, Smola A, Huang J et al (2009) Covariate shift by kernel mean matching. Dataset shift in machine learning 3(4):5Google Scholar
  15. 15.
    Han K, Yu D, Tashev I (2014) Speech emotion recognition using deep neural network and extreme learning machine. In: Fifteenth Annual Conference of the International Speech Communication AssociationGoogle Scholar
  16. 16.
    Huang Z, Xue W, Mao Q (2015) Speech emotion recognition with unsupervised feature learning. Frontiers of Information Technology & Electronic Engineering 16:358–366Google Scholar
  17. 17.
    Kim Y, Provost EM (2013) Emotion classification via utterance-level dynamics: A pattern-based approach to characterizing affective expressions. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp 3677–3681Google Scholar
  18. 18.
    Kim Y, Lee H, Provost EM (2013) Deep learning for robust feature generation in audiovisual emotion recognition Acoustics. In: 2013 IEEE International Conference on Speech and Signal Processing (ICASSP), pp 3687–3691Google Scholar
  19. 19.
    Le D, Provost EM, Zhan Y (2013) Emotion recognition from spontaneous speech using hidden markov models with deep belief networks, pp 216–221Google Scholar
  20. 20.
    Li L, Jin X, Long M (2012) Topic correlation analysis for cross-domain text classification. AAAI Conference on Artificial IntelligenceGoogle Scholar
  21. 21.
    Mao Q, Wang X, Zhan Y (2010) Speech emotion recognition method based on improved decision tree and layered feature selection. International Journal of Humanoid Robotics 7(02):245–261CrossRefGoogle Scholar
  22. 22.
    Mao Q, Zhao X, Huang Z, Zhan Y (2013) Speaker-independent speech emotion recognition by fusion of functional and accompanying paralanguage features. Journal of Zhejiang University SCIENCE C 14(7):573–582CrossRefGoogle Scholar
  23. 23.
    Mao Q, Dong M, Huang Z, Zhan Y (2014) Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans Multimed 716(8):2203–2213CrossRefGoogle Scholar
  24. 24.
    Oquab M, Bottou L, Laptev I, Sivic J (2014) Learning and transferring mid-level image representations using convolutional neural networks. 2014 Proc. IEEE Conf Comput Vis Pattern Recognit (CVPR):1717–1724Google Scholar
  25. 25.
    Pan S, Yang Q (2010) Survey on transfer learning. IEEE Trans Knowl Data Eng 22(10):1345–1359CrossRefGoogle Scholar
  26. 26.
    Sarikaya R, Hinton GE, Deoras A (2014) Application of deep belief networks for natural language understanding. IEEE/ACM Transactions on Audio Speech and Language Processing 22(4):778–784CrossRefGoogle Scholar
  27. 27.
    Schmidt EM, Kim YE (2011) Learning emotion-based acoustic features with deep belief networks. In: 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pp 65–68Google Scholar
  28. 28.
    Schuller B, Arsic D, Rigoll G, Wimmer M, Radig B (2007) Audiovisual behavior modeling by combined feature spaces. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol 2, pp II–733Google Scholar
  29. 29.
    Schuller B, Steidl S, Batliner A (2009) The interspeech 2009 emotion challenge. In: INTERSPEECH, vol 2009, pp 312–315Google Scholar
  30. 30.
    Schuller B, Vlasenko B, Eyben F, Wollmer M, Stuhlsatz A, Wendemuth A, Rigoll G (2010) Cross-corpus acoustic emotion recognition: variances and strategies. IEEE Trans Affect Comput 1(2):119– 131CrossRefGoogle Scholar
  31. 31.
    Swietojanski P, Ghoshal A, Renals S (2012) Unsupervised cross-lingual knowledge transfer in dnn-based lvcsr. In: 2012 IEEE Spoken Language Technology Workshop (SLT), pp 246–251Google Scholar
  32. 32.
    Sun Y, Wang X, Tang X (2014) Deep learning face representation by joint identification-verification. Advances in Neural Information Processing Systems:1988–1996Google Scholar
  33. 33.
    Yu D, Seltzer ML, Li J, Huang JT, Seide F (2013) Feature learning in deep neural networks-studies on speech recognition tasks. arXiv preprint arXiv:1301.3605
  34. 34.
    Zhang B, Provost EM, Swedberg R et al (2015) Predicting emotion perception across domains: a study of singing and speaking. In: Association for the advancement of artificial intelligence, vol 2015, pp 4277–4280Google Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  • Zhengwei Huang
    • 1
  • Wentao Xue
    • 1
  • Qirong Mao
    • 1
  • Yongzhao Zhan
    • 1
  1. 1.School of Computer Science and Communication EngineeringJiangsu UniversityZhenjiangChina

Personalised recommendations