Multimodel Music Emotion Recognition Using Unsupervised Deep Neural Networks

  • Jianchao Zhou
  • Xiaoou ChenEmail author
  • Deshun Yang
Conference paper
Part of the Lecture Notes in Electrical Engineering book series (LNEE, volume 568)


In most studies on multimodal music emotion recognition, different modalities are generally combined in a simple way and used for supervised training. The improvement of the experiment results illustrates the correlations between different modalities. However, few studies focus on modeling the relationships between different modal data. In this paper, we propose to model the relationships between different modalities (i.e., lyric and audio data) by deep learning methods in multimodal music emotion recognition. Several deep networks are first applied to perform unsupervised feature learning over multiple modalities. We, then, design a series of music emotion recognition experiments to evaluate the learned features. The experiment results show that the deep networks perform well on unsupervised feature learning for multimodal data and can model the relationships effectively. In addition, we demonstrate a unimodal enhancement experiment, where better features for one modality (e.g., lyric) can be learned by the proposed deep network, if the other modality (e.g., audio) is also present at unsupervised feature learning time.


Music emotion recognition Multimodal learning Deep neural networks 


  1. 1.
    Yang YH, Lin YC, Su YF, Chen HH (2008) A regression approach to music emotion recognition. IEEE Trans Audio Speech Lang Process 16(2):448–457CrossRefGoogle Scholar
  2. 2.
    Kim YE, Schmidt EM, Migneco R, Morton BG, Richardson P, Scott J, Speck JA, Turnbull D (2010) Music emotion recognition: a state of the art review. ResearchGate 86(00):937–952Google Scholar
  3. 3.
    Laurier C, Grivolla J, Herrera P (2008) Multimodal music mood classification using audio and lyrics. In: International conference on machine learning and applications, pp 688–693Google Scholar
  4. 4.
    Tzanetakis G, Ermolinskyi A, Cook P (2003) Pitch histograms in audio and symbolic music information retrieval. J New Music Res 32(2):143–152CrossRefGoogle Scholar
  5. 5.
    Hu X, Downie JS, Ehmann AF (2009) Lyric text mining in music mood classification. In: International society for music information retrieval conference, ISMIR 2009, pp 411–416. Kobe International Conference Center, Kobe, Japan, OctoberGoogle Scholar
  6. 6.
    Yang YH, Lin YC, Cheng HT, Liao I, Ho YC, Chen HH (2008) Toward multi-modal music emotion classification. In: Pacific Rim conference on multimedia: advances in multimedia information processing, pp 70–79, (2008)CrossRefGoogle Scholar
  7. 7.
    Hu X, Downie JS (2010) Improving mood classification in music digital libraries by combining lyrics and audio. In: Joint international conference on digital libraries, JCDL 2010, pp 159–168, Gold Coast, Queensland, Australia, JuneGoogle Scholar
  8. 8.
    Lu Q, Chen X, Yang D, Wang J (2010) Boosting for multi-modal music emotion classification. In International society for music information retrieval conference, ISMIR 2010, pp 105–110, Utrecht, Netherlands, AugustGoogle Scholar
  9. 9.
    Zhao Y, Yang D, Chen X (2010) Multi-modal music mood classification using co-training. In: International conference on computational intelligence and software engineering, pp 1–4Google Scholar
  10. 10.
    Srivastava N, Salakhutdinov R (2012) Multimodal learning with deep boltzmann machines. J Mach Learn Res 15(8):1967–2006MathSciNetzbMATHGoogle Scholar
  11. 11.
    Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng AY (2011) Multimodal deep learning. In: International conference on machine learning, ICML 2011, pp 689–696. Bellevue, Washington, USA, June 28–JulyGoogle Scholar
  12. 12.
    Liu W, Zheng WL, Lu BL (2016) Emotion recognition using multimodal deep learningGoogle Scholar
  13. 13.
    Mehrabian A (1995) Framework for a comprehensive description and measurement of emotional states. Genet Soc Gen Psychol Monogr 121(3):339Google Scholar
  14. 14.
    Zhou J, Peng L, Chen X, Yang D (2016) Robust sound event classification by using denoising autoencoder. In: 18th IEEE international workshop on multimedia signal processing, MMSP 2016, pp 1–6. Montreal, QC, Canada, September 21–23Google Scholar
  15. 15.
    Chen H, Murray AF (2003) Continuous restricted boltzmann machine with an implementable training algorithm. Vis Image Signal Process IEE Proc 150(3):153–158CrossRefGoogle Scholar
  16. 16.
    Lang PJ (1980) Behavioral treatment and bio-behavioural assessment: computer applications. Technology in mental health care delivery systems. Norwood AblexGoogle Scholar
  17. 17.
    Mckay C, Fujinaga I, Depalle P (2005) jAudio: a feature extraction libraryGoogle Scholar
  18. 18.
    Guan D, Chen X, Yang D (2012) Music emotion regression based on multi-modal features. In: Proceedings of international symposium on computer music modeling and retrieval, pp 70–77Google Scholar
  19. 19.
    Mckay C, Fujinaga I (2006) Symbolic: a feature extractor for midi files, pp 302–305Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  1. 1.Institute of Computer Science and TechnologyPeking UniversityBeijingChina

Personalised recommendations