A comprehensive study on mid-level representation and ensemble learning for emotional analysis of video material

Acar, Esra; Hopfgartner, Frank; Albayrak, Sahin

doi:10.1007/s11042-016-3618-5

A comprehensive study on mid-level representation and ensemble learning for emotional analysis of video material

Published: 10 June 2016

Volume 76, pages 11809–11837, (2017)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Esra Acar¹,
Frank Hopfgartner² &
Sahin Albayrak¹

2014 Accesses
22 Citations
4 Altmetric
Explore all metrics

Abstract

In today’s society where audio-visual content such as professionally edited and user-generated videos is ubiquitous, automatic analysis of this content is a decisive functionality. Within this context, there is an extensive ongoing research about understanding the semantics (i.e., facts) such as objects or events in videos. However, little research has been devoted to understanding the emotional content of the videos. In this paper, we address this issue and introduce a system that performs emotional content analysis of professionally edited and user-generated videos. We concentrate both on the representation and modeling aspects. Videos are represented using mid-level audio-visual features. More specifically, audio and static visual representations are automatically learned from raw data using convolutional neural networks (CNNs). In addition, dense trajectory based motion and SentiBank domain-specific features are incorporated. By means of ensemble learning and fusion mechanisms, videos are classified into one of predefined emotion categories. Results obtained on the VideoEmotion dataset and a subset of the DEAP dataset show that (1) higher level representations perform better than low-level features, (2) among audio features, mid-level learned representations perform better than mid-level handcrafted ones, (3) incorporating motion and domain-specific information leads to a notable performance gain, and (4) ensemble learning is superior to multi-class support vector machines (SVMs) for video affective content analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep learning for time series classification: a review

Article 02 March 2019

Visualizing and Understanding Convolutional Networks

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

Article Open access 07 May 2022

Notes

References

Acar E, Hopfgartner F, Albayrak S (2014) Understanding affective content of music videos through learned representations International conference on multimedia modelling (MMM), pp. 303–314
Acar E, Hopfgartner F, Albayrak S (2015) Fusion of learned multi-modal representations and dense trajectories for emotional analysis in videos. In: IEEE international workshop on content-based multimedia indexing (CBMI), pp. 1–6
Baveye Y, Bettinelli J, Dellandréa E, Chen L, Chamaret C (2013) A large video database for computational models of induced emotion. In: Humaine association conference on affective computing and intelligent interaction (ACII), pp. 13–18
Baveye Y, Dellandréa E, Chamaret C, Chen L (2015) Deep learning vs. kernel methods: Performance for emotion prediction in videos. In: International conference on affective computing and intelligent interaction (ACII), pp. 77–83
Baveye Y, Dellandréa E, Chamaret C, Chen L (2015) LIRIS-ACCEDE: A video database for affective content analysis. IEEE Trans. Affect. Comput 6(1):43–55
Article Google Scholar
Bengio Y, Courville A, Vincent P (2013) Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell 35(8):1798–1828
Article Google Scholar
Borth D, Chen T, Ji R, Chang S (2013) Sentibank: large-scale ontology and classifiers for detecting sentiment and emotions in visual content. In: ACM international conference on multimedia (ACMMM), pp. 459–460
Canini L, Benini S, Leonardi R (2013) Affective recommendation of movies based on selected connotative features. IEEE Trans. Circuits Syst. Video Technol 23 (4):636–647
Article Google Scholar
Chang C, Lin C (2011) LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol 2(3):1–27
Article Google Scholar
Chen T, Borth D, Darrell T, Chang S (2014) Deepsentibank: Visual sentiment concept classification with deep convolutional neural networks. Commun. Res. Rep abs/1410:8586
Google Scholar
Chen T, Yu F X, Chen J, Cui Y, Chen Y, Chang S (2014) Object-based visual sentiment concept analysis and application. In: ACM international conference on multimedia (ACMMM), pp. 367– 376
Dumoulin J, Affi D, Mugellini E, Khaled O A, Bertini M, Bimbo A D (2015) Affect recognition in a realistic movie dataset using a hierarchical approach. In: First international workshop on affect andamp; sentiment in multimedia (ASM), pp. 15–20
Efron B, Hastie T, Johnstone I, Tibshirani R (2004) Least angle regression. Ann. Stat 32(2):407– 499
Article MathSciNet MATH Google Scholar
Eggink J, Bland D (2012) A large scale experiment for mood-based classification of tv programmes. In: IEEE international conference on multimedia and expo (ICME), pp. 140–145
Ellis J G, Lin W S, Lin C, Chang S (2014) Predicting evoked emotions in video. In: IEEE international symposium on multimedia (ISM), pp. 287–294
Fan Wu T, Lin C J, Weng R C (2003) Probability estimates for multi-class classification by pairwise coupling. J. Mach. Learn. Res 5:975–1005
MathSciNet MATH Google Scholar
Gunes H, Schuller B (2013) Categorical and dimensional affect analysis in continuous input: current trends and future directions. Image Vis. Comput 31(2):120–136
Article Google Scholar
Irie G, Hidaka K, Satou T, Yamasaki T, Aizawa K (2009) Affective video segment retrieval for consumer generated videos based on correlation between emotions and emotional audio events. In: IEEE international conference on multimedia and expo (ICME), pp. 522–525
Irie G, Satou T, Kojima A, Yamasaki T, Aizawa K (2010) Affective audio-visual words and latent topic driving model for realizing movie affective scene classification. IEEE Trans. Multimedia 12(6):523–535
Article Google Scholar
Jeannin S, Divakaran A (2001) Mpeg-7 visual motion descriptors. IEEE Trans. Circuits Syst. Video Technol 11(6):720–724
Article Google Scholar
Ji S, Xu W, Yang M, Yu K (2013) 3d convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell 35(1):221–231
Article Google Scholar
Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: Convolutional architecture for fast feature embedding. In: ACM international conference on multimedia (ACMMM), pp. 675–678
Jiang Y, Xu B, Xue X (2014) Predicting emotions in user-generated videos. In: The AAAI conference on artificial intelligence (AAAI)
Koelstra S, Mühl C, Soleymani M, Lee J, Yazdani A, Ebrahimi T, Pun T, Nijholt A, Patras I (2012) Deap: A database for emotion analysis; using physiological signals. IEEE Trans. Affect. Comput 3(1):18–31
Article Google Scholar
Krizhevsky A, Sutskever I, Hinton G E (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems (NIPS), pp. 1097–1105
Li T L, Chan A B, Chun A H (2010) Automatic musical pattern feature extraction using convolutional neural network. In: International multiconference of engineers and computer scientists (IMECS)
Mairal J, Bach F, Ponce J, Sapiro G (2010) Online learning for matrix factorization and sparse coding. J. Mach. Learn. Res 11:19–60
MathSciNet MATH Google Scholar
Niu J, Zhao X, Abdul Aziz M A (2015) A novel affect-based model of similarity measure of videos. Neurocomputing (in press)
Pang L, Ngo C W (2015) Multimodal learning with deep boltzmann machine for emotion prediction in user generated videos. In: ACM international conference on multimedia retrieval (ICMR), pp. 619–622
Plutchik R, Kellerman H (1986) Emotion: theory research and experience, vol 3. Academic press, New York
Safadi B, Quénot G (2015) A factorized model for multiple SVM and multi-label classification for large scale multimedia indexing. In: 13th international workshop on content-based multimedia indexing, CBMI 2015, Prague, Czech Republic, June 10-12, 2015, pp. 1–6
Schmidt E, Scott J, Kim Y (2012) Feature learning in dynamic environments: Modeling the acoustic structure of musical emotion. In: International society for music information retrieval conference (ISMIR), pp. 325–330
Soleymani M, Aljanaki A, Wiering F, Veltkamp R C (2015) Content-based music recommendation using underlying music preference structure. In: 2015 IEEE international conference on multimedia and expo (ICME), pp. 1–6
Sturm B L, Noorzad P (2012) On automatic music genre recognition by sparse representation classification using auditory temporal modulations. In: International symposium on computer music modeling and retrieval, pp. 379–394
Valdez P, Mehrabian A (1994) Effects of color on emotions. J. Exp. Psychol. Gen 123(4):394– 409
Article Google Scholar
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proc. IEEE international conference on computer vision (ICCV), pp. 3551–3558
Wang H L, Cheong L (2006) Affective understanding in film. IEEE Trans. Circuits Syst. Video Technol 16(6):689–704
Article Google Scholar
Wang S, Ji Q (2015) Video affective content analysis: A survey of state-of-the-art methods. IEEE Trans. Affect. Comput 6(4):410–430
Article Google Scholar
Wimmer M, Schuller B, Arsic D, Rigoll G, Radig B (2008) Low-level fusion of audio and video feature for multi-modal emotion recognition. In: International joint conference on computer vision, imaging and computer graphics theory and applications, pp. 145–151
Xu B, Fu Y, Jiang Y, Li B, Sigal L (2015) Heterogeneous knowledge transfer in video emotion recognition, attribution and summarization. Commun. Res. Rep abs/1511:04798
Google Scholar
Xu C, Cetintas S, Lee K, Li L (2014) Visual sentiment prediction with deep convolutional neural networks. Commun. Res. Rep abs/1411:5731
Google Scholar
Xu M, Wang J, He X, Jin J S, Luo S, Lu H (2014) A three-level framework for affective content analysis and its case studies. Multimedia Tools and Applications 70(2):757–779
Article Google Scholar
Yang X, Wang K, Shamma S A (1992) Auditory representations of acoustic signals. IEEE Trans. Inf. Theory 38(2):824–839
Article Google Scholar
Yazdani A, Kappeler K, Ebrahimi T (2011) Affective content analysis of music video clips. In: ACM international workshop on music information retrieval with user-centered and multimodal strategies (MIRUM), pp. 7–12
Yucel Z, Salah A A (2009) Resolution of focus of attention using gaze direction estimation and saliency computation. In: International conference on affective computing and intelligent interaction (ACII), pp. 1–6
Zhou Z (2012) Ensemble methods: foundations and algorithms CRC Press

Download references

Acknowledgements

The research leading to these results has received funding from the European Community FP7 under grant agreement number 261743 (NoE VideoSense).

Author information

Authors and Affiliations

DAI Laboratory, Technische Universität Berlin, Ernst-Reuter-Platz 7, TEL 14, 10587, Berlin, Germany
Esra Acar & Sahin Albayrak
Humanities Advanced Technology and Information Institute, University of Glasgow, Glasgow, UK
Frank Hopfgartner

Authors

Esra Acar
View author publications
You can also search for this author in PubMed Google Scholar
Frank Hopfgartner
View author publications
You can also search for this author in PubMed Google Scholar
Sahin Albayrak
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Esra Acar.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Acar, E., Hopfgartner, F. & Albayrak, S. A comprehensive study on mid-level representation and ensemble learning for emotional analysis of video material. Multimed Tools Appl 76, 11809–11837 (2017). https://doi.org/10.1007/s11042-016-3618-5

Download citation

Received: 31 October 2015
Revised: 02 March 2016
Accepted: 12 May 2016
Published: 10 June 2016
Issue Date: May 2017
DOI: https://doi.org/10.1007/s11042-016-3618-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comprehensive study on mid-level representation and ensemble learning for emotional analysis of video material

Abstract

Access this article

Similar content being viewed by others

Deep learning for time series classification: a review

Visualizing and Understanding Convolutional Networks

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A comprehensive study on mid-level representation and ensemble learning for emotional analysis of video material

Abstract

Access this article

Similar content being viewed by others

Deep learning for time series classification: a review

Visualizing and Understanding Convolutional Networks

Human emotion recognition from EEG-based brain–computer interface using machine learning: a comprehensive review

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation