Multimedia Tools and Applications

, Volume 76, Issue 9, pp 11809–11837 | Cite as

A comprehensive study on mid-level representation and ensemble learning for emotional analysis of video material



In today’s society where audio-visual content such as professionally edited and user-generated videos is ubiquitous, automatic analysis of this content is a decisive functionality. Within this context, there is an extensive ongoing research about understanding the semantics (i.e., facts) such as objects or events in videos. However, little research has been devoted to understanding the emotional content of the videos. In this paper, we address this issue and introduce a system that performs emotional content analysis of professionally edited and user-generated videos. We concentrate both on the representation and modeling aspects. Videos are represented using mid-level audio-visual features. More specifically, audio and static visual representations are automatically learned from raw data using convolutional neural networks (CNNs). In addition, dense trajectory based motion and SentiBank domain-specific features are incorporated. By means of ensemble learning and fusion mechanisms, videos are classified into one of predefined emotion categories. Results obtained on the VideoEmotion dataset and a subset of the DEAP dataset show that (1) higher level representations perform better than low-level features, (2) among audio features, mid-level learned representations perform better than mid-level handcrafted ones, (3) incorporating motion and domain-specific information leads to a notable performance gain, and (4) ensemble learning is superior to multi-class support vector machines (SVMs) for video affective content analysis.


Video affective content analysis Ensemble learning Deep learning MFCC Color Dense trajectories SentiBank 


  1. 1.
    Acar E, Hopfgartner F, Albayrak S (2014) Understanding affective content of music videos through learned representations International conference on multimedia modelling (MMM), pp. 303–314Google Scholar
  2. 2.
    Acar E, Hopfgartner F, Albayrak S (2015) Fusion of learned multi-modal representations and dense trajectories for emotional analysis in videos. In: IEEE international workshop on content-based multimedia indexing (CBMI), pp. 1–6Google Scholar
  3. 3.
    Baveye Y, Bettinelli J, Dellandréa E, Chen L, Chamaret C (2013) A large video database for computational models of induced emotion. In: Humaine association conference on affective computing and intelligent interaction (ACII), pp. 13–18Google Scholar
  4. 4.
    Baveye Y, Dellandréa E, Chamaret C, Chen L (2015) Deep learning vs. kernel methods: Performance for emotion prediction in videos. In: International conference on affective computing and intelligent interaction (ACII), pp. 77–83Google Scholar
  5. 5.
    Baveye Y, Dellandréa E, Chamaret C, Chen L (2015) LIRIS-ACCEDE: A video database for affective content analysis. IEEE Trans. Affect. Comput 6(1):43–55CrossRefGoogle Scholar
  6. 6.
    Bengio Y, Courville A, Vincent P (2013) Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell 35(8):1798–1828CrossRefGoogle Scholar
  7. 7.
    Borth D, Chen T, Ji R, Chang S (2013) Sentibank: large-scale ontology and classifiers for detecting sentiment and emotions in visual content. In: ACM international conference on multimedia (ACMMM), pp. 459–460Google Scholar
  8. 8.
    Canini L, Benini S, Leonardi R (2013) Affective recommendation of movies based on selected connotative features. IEEE Trans. Circuits Syst. Video Technol 23 (4):636–647CrossRefGoogle Scholar
  9. 9.
    Chang C, Lin C (2011) LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol 2(3):1–27CrossRefGoogle Scholar
  10. 10.
    Chen T, Borth D, Darrell T, Chang S (2014) Deepsentibank: Visual sentiment concept classification with deep convolutional neural networks. Commun. Res. Rep abs/1410:8586Google Scholar
  11. 11.
    Chen T, Yu F X, Chen J, Cui Y, Chen Y, Chang S (2014) Object-based visual sentiment concept analysis and application. In: ACM international conference on multimedia (ACMMM), pp. 367– 376Google Scholar
  12. 12.
    Dumoulin J, Affi D, Mugellini E, Khaled O A, Bertini M, Bimbo A D (2015) Affect recognition in a realistic movie dataset using a hierarchical approach. In: First international workshop on affect andamp; sentiment in multimedia (ASM), pp. 15–20Google Scholar
  13. 13.
    Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann. Stat 32(2):407– 499MathSciNetCrossRefMATHGoogle Scholar
  14. 14.
    Eggink J, Bland D (2012) A large scale experiment for mood-based classification of tv programmes. In: IEEE international conference on multimedia and expo (ICME), pp. 140–145Google Scholar
  15. 15.
    Ellis J G, Lin W S, Lin C, Chang S (2014) Predicting evoked emotions in video. In: IEEE international symposium on multimedia (ISM), pp. 287–294Google Scholar
  16. 16.
    Fan Wu T, Lin C J, Weng R C (2003) Probability estimates for multi-class classification by pairwise coupling. J. Mach. Learn. Res 5:975–1005MathSciNetMATHGoogle Scholar
  17. 17.
    Gunes H, Schuller B (2013) Categorical and dimensional affect analysis in continuous input: current trends and future directions. Image Vis. Comput 31(2):120–136CrossRefGoogle Scholar
  18. 18.
    Irie G, Hidaka K, Satou T, Yamasaki T, Aizawa K (2009) Affective video segment retrieval for consumer generated videos based on correlation between emotions and emotional audio events. In: IEEE international conference on multimedia and expo (ICME), pp. 522–525Google Scholar
  19. 19.
    Irie G, Satou T, Kojima A, Yamasaki T, Aizawa K (2010) Affective audio-visual words and latent topic driving model for realizing movie affective scene classification. IEEE Trans. Multimedia 12(6):523–535CrossRefGoogle Scholar
  20. 20.
    Jeannin S, Divakaran A (2001) Mpeg-7 visual motion descriptors. IEEE Trans. Circuits Syst. Video Technol 11(6):720–724CrossRefGoogle Scholar
  21. 21.
    Ji S, Xu W, Yang M, Yu K (2013) 3d convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell 35(1):221–231CrossRefGoogle Scholar
  22. 22.
    Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: Convolutional architecture for fast feature embedding. In: ACM international conference on multimedia (ACMMM), pp. 675–678Google Scholar
  23. 23.
    Jiang Y, Xu B, Xue X (2014) Predicting emotions in user-generated videos. In: The AAAI conference on artificial intelligence (AAAI)Google Scholar
  24. 24.
    Koelstra S, Mühl C, Soleymani M, Lee J, Yazdani A, Ebrahimi T, Pun T, Nijholt A, Patras I (2012) Deap: A database for emotion analysis; using physiological signals. IEEE Trans. Affect. Comput 3(1):18–31CrossRefGoogle Scholar
  25. 25.
    Krizhevsky A, Sutskever I, Hinton G E (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems (NIPS), pp. 1097–1105Google Scholar
  26. 26.
    Li T L, Chan A B, Chun A H (2010) Automatic musical pattern feature extraction using convolutional neural network. In: International multiconference of engineers and computer scientists (IMECS)Google Scholar
  27. 27.
    Mairal J, Bach F, Ponce J, Sapiro G (2010) Online learning for matrix factorization and sparse coding. J. Mach. Learn. Res 11:19–60MathSciNetMATHGoogle Scholar
  28. 28.
    Niu J, Zhao X, Abdul Aziz M A (2015) A novel affect-based model of similarity measure of videos. Neurocomputing (in press)Google Scholar
  29. 29.
    Pang L, Ngo C W (2015) Multimodal learning with deep boltzmann machine for emotion prediction in user generated videos. In: ACM international conference on multimedia retrieval (ICMR), pp. 619–622Google Scholar
  30. 30.
    Plutchik R, Kellerman H (1986) Emotion: theory research and experience, vol 3. Academic press, New YorkGoogle Scholar
  31. 31.
    Safadi B, Quénot G (2015) A factorized model for multiple SVM and multi-label classification for large scale multimedia indexing. In: 13th international workshop on content-based multimedia indexing, CBMI 2015, Prague, Czech Republic, June 10-12, 2015, pp. 1–6Google Scholar
  32. 32.
    Schmidt E, Scott J, Kim Y (2012) Feature learning in dynamic environments: Modeling the acoustic structure of musical emotion. In: International society for music information retrieval conference (ISMIR), pp. 325–330Google Scholar
  33. 33.
    Soleymani M, Aljanaki A, Wiering F, Veltkamp R C (2015) Content-based music recommendation using underlying music preference structure. In: 2015 IEEE international conference on multimedia and expo (ICME), pp. 1–6Google Scholar
  34. 34.
    Sturm B L, Noorzad P (2012) On automatic music genre recognition by sparse representation classification using auditory temporal modulations. In: International symposium on computer music modeling and retrieval, pp. 379–394Google Scholar
  35. 35.
    Valdez P, Mehrabian A (1994) Effects of color on emotions. J. Exp. Psychol. Gen 123(4):394– 409CrossRefGoogle Scholar
  36. 36.
    Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proc. IEEE international conference on computer vision (ICCV), pp. 3551–3558Google Scholar
  37. 37.
    Wang H L, Cheong L (2006) Affective understanding in film. IEEE Trans. Circuits Syst. Video Technol 16(6):689–704CrossRefGoogle Scholar
  38. 38.
    Wang S, Ji Q (2015) Video affective content analysis: A survey of state-of-the-art methods. IEEE Trans. Affect. Comput 6(4):410–430CrossRefGoogle Scholar
  39. 39.
    Wimmer M, Schuller B, Arsic D, Rigoll G, Radig B (2008) Low-level fusion of audio and video feature for multi-modal emotion recognition. In: International joint conference on computer vision, imaging and computer graphics theory and applications, pp. 145–151Google Scholar
  40. 40.
    Xu B, Fu Y, Jiang Y, Li B, Sigal L (2015) Heterogeneous knowledge transfer in video emotion recognition, attribution and summarization. Commun. Res. Rep abs/1511:04798Google Scholar
  41. 41.
    Xu C, Cetintas S, Lee K, Li L (2014) Visual sentiment prediction with deep convolutional neural networks. Commun. Res. Rep abs/1411:5731Google Scholar
  42. 42.
    Xu M, Wang J, He X, Jin J S, Luo S, Lu H (2014) A three-level framework for affective content analysis and its case studies. Multimedia Tools and Applications 70(2):757–779CrossRefGoogle Scholar
  43. 43.
    Yang X, Wang K, Shamma S A (1992) Auditory representations of acoustic signals. IEEE Trans. Inf. Theory 38(2):824–839CrossRefGoogle Scholar
  44. 44.
    Yazdani A, Kappeler K, Ebrahimi T (2011) Affective content analysis of music video clips. In: ACM international workshop on music information retrieval with user-centered and multimodal strategies (MIRUM), pp. 7–12Google Scholar
  45. 45.
    Yucel Z, Salah A A (2009) Resolution of focus of attention using gaze direction estimation and saliency computation. In: International conference on affective computing and intelligent interaction (ACII), pp. 1–6Google Scholar
  46. 46.
    Zhou Z (2012) Ensemble methods: foundations and algorithms CRC PressGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.DAI LaboratoryTechnische Universität BerlinBerlinGermany
  2. 2.Humanities Advanced Technology and Information InstituteUniversity of GlasgowGlasgowUK

Personalised recommendations