Springer Nature is making SARS-CoV-2 and COVID-19 research free. View research | View latest news | Sign up for updates

Exploiting objective text description of images for visual sentiment analysis

  • 44 Accesses


This paper addresses the problem of Visual Sentiment Analysis focusing on the estimation of the polarity of the sentiment evoked by an image. Starting from an embedding approach which exploits both visual and textual features, we attempt to boost the contribution of each input view. We propose to extract and employ an Objective Text description of images rather than the classic Subjective Text provided by the users (i.e., title, tags and image description) which is extensively exploited in the state of the art to infer the sentiment associated to social images. Objective Text is obtained from the visual content of the images through recent deep learning architectures which are used to classify object, scene and to perform image captioning. Objective Text features are then combined with visual features in an embedding space obtained with Canonical Correlation Analysis. The sentiment polarity is then inferred by a supervised Support Vector Machine. During the evaluation, we compared an extensive number of text and visual features combinations and baselines obtained by considering the state of the art methods. Experiments performed on a representative dataset of 47235 labelled samples demonstrate that the exploitation of Objective Text helps to outperform state-of-the-art for sentiment polarity estimation.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6


  1. 1.

    Our implementation exploits the MVSO English model provided by [23], that corresponds to the DeepSentiBank CNN fine-tuned to predict 4342 English Adjective Noun Pairs.

  2. 2.

    The code to repeat the performance evaluation is available at the URL: http://iplab.dmi.unict.it/sentimentembedding/

  3. 3.



  1. 1.

    Ahmad K, Mekhalfi ML, Conci N, Melgani F, Natale FD (2018) Ensemble of deep models for event recognition. ACM Transactions on Multimedia Computing Communications, and Applications (TOMM) 14(2):51

  2. 2.

    Baecchi C, Uricchio T, Bertini M, Del Bimbo A (2016) A multimodal feature learning approach for sentiment analysis of social network multimedia. Multimed Tools Appl 75(5):2507–2525

  3. 3.

    Battiato S, Farinella GM, Milotta FL, Ortis A, Addesso L, Casella A, D’Amico V, Torrisi G (2016) The social picture. In: Proceedings of the 2016 ACM on international conference on multimedia retrieval, pp 397–400. ACM

  4. 4.

    Battiato S, Moltisanti M, Ravì F, Bruna AR, Naccari F (2013) Aesthetic scoring of digital portraits for consumer applications. In: IS&T/SPIE electronic imaging, pp 866008–866008. International Society for Optics and Photonics

  5. 5.

    Borth D, Ji R, Chen T, Breuel T, Chang SF (2013) Large-scale visual sentiment ontology and detectors using adjective noun pairs. In: Proceedings of the 21st ACM international conference on multimedia, pp 223–232. ACM

  6. 6.

    Campos V, Jou B, i Nieto XG (2017) From pixels to sentiment: Fine-tuning cnns for visual sentiment prediction. Image and Vision Computing 65:15–22. https://doi.org/10.1016/j.imavis.2017.01.011. http://www.sciencedirect.com/science/article/pii/S0262885617300355. Multimodal Sentiment Analysis and Mining in the Wild Image and Vision Computing

  7. 7.

    Campos V, Salvador A, Giró-i Nieto X, Jou B (2015) Diving deep into sentiment: Understanding fine-tuned cnns for visual sentiment prediction. In: Proceedings of the 1st international workshop on affect & sentiment in multimedia, ASM ’15. https://doi.org/10.1145/2813524.2813530. ACM, New York, pp 57–62

  8. 8.

    Chen T, Borth D, Darrell T, Chang SF (2014) Deepsentibank:, Visual sentiment concept classification with deep convolutional neural networks. arXiv:1410.8586

  9. 9.

    Cui P, Liu S, Zhu W (2017) General knowledge embedded image representation learning. IEEE Transactions on Multimedia

  10. 10.

    Datta R, Joshi D, Li J, Wang JZ (2006) Studying aesthetics in photographic images using a computational approach. In: European conference on computer vision, pp 288–301. Springer

  11. 11.

    Esuli A, Sebastiani F (2006) Sentiwordnet: A publicly available lexical resource for opinion mining. In: Proceedings of The European language resources association, vol 6, pp 417–422. Citeseer

  12. 12.

    Fu Y, Hospedales TM, Xiang T, Fu Z, Gong S (2014) Transductive multi-view embedding for zero-shot recognition and annotation. In: Proceedings of the European conference on computer vision, pp 584–599. Springer

  13. 13.

    Gong Y, Ke Q, Isard M, Lazebnik S (2014) A multi-view embedding space for modeling internet images, tags, and their semantics. Int J Comput Vis 106 (2):210–233

  14. 14.

    Gong Y, Wang L, Hodosh M, Hockenmaier J, Lazebnik S (2014) Improving image-sentence embeddings using large weakly annotated photo collections. In: Proceedings of the European conference on computer vision, pp 529–545. Springer

  15. 15.

    Guillaumin M, Verbeek J, Schmid C (2010) Multimodal semi-supervised learning for image classification. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 902–909. IEEE

  16. 16.

    Hardoon DR, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: An overview with application to learning methods. Neural Comput 16(12):2639–2664

  17. 17.

    Huang F, Zhang X, Zhao Z, Xu J, Li Z (2019) Image–text sentiment analysis via deep multimodal attentive fusion. Knowl-Based Syst 167:26–37

  18. 18.

    Hung C, Lin HK (2013) Using objective words in sentiwordnet to improve sentiment classification for word of mouth. IEEE Intell Syst 28(2):47–54

  19. 19.

    Hwang SJ, Grauman K (2010) Accounting for the relative importance of objects in image retrieval. In: Proceedings of British machine vision conference, vol 1, 2

  20. 20.

    Hwang SJ, Grauman K (2012) Learning the relative importance of objects from tagged images for retrieval and cross-modal search. Int J Comput Vis 100(2):134–153

  21. 21.

    Itten J (1962) The art of color; the subjective experience and objective rationale of colour

  22. 22.

    Johnson J, Ballan L, Fei-Fei L (2015) Love thy neighbors: Image annotation by exploiting image metadata. In: Proceedings of the IEEE international conference on computer vision, pp 4624–4632

  23. 23.

    Jou B, Chen T, Pappas N, Redi M, Topkara M, Chang SF (2015) Visual affect around the world: A large-scale multilingual visual sentiment ontology. In: Proceedings of the 23rd ACM international conference on multimedia, pp 159–168. ACM

  24. 24.

    Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137

  25. 25.

    Katsurai M, Satoh S (2016) Image sentiment analysis using latent correlations among visual, textual, and sentiment views. In: Inproceedings of the IEEE international conference on acoustics, speech and signal processing, pp 2837–2841. IEEE

  26. 26.

    Lei X, Qian X, Zhao G (2016) Rating prediction based on social sentiment from textual reviews. IEEE Trans Multimed 18(9):1910–1921

  27. 27.

    Li X, Uricchio T, Ballan L, Bertini M, Snoek CG, Bimbo AD (2016) Socializing the semantic gap: A comparative survey on image tag assignment, refinement, and retrieval. ACM Comput Surveys (CSUR) 49(1):14

  28. 28.

    Machajdik J, Hanbury A (2010) Affective image classification using features inspired by psychology and art theory. In: Proceedings of the 18th ACM international conference on multimedia, pp 83–92. ACM

  29. 29.

    Mike T, Kevan B, Georgios P, Di C, Arvid K (2010) Sentiment in short strength detection informal text. Journal of the Association for Information Science and Technology 61(12):2544–2558

  30. 30.

    Miller GA (1995) Wordnet: a lexical database for english. In: Communications of the ACM, vol 38, pp 39–41. ACM

  31. 31.

    Ortis A, Farinella GM, Torrisi G, Battiato S (2018) Visual sentiment analysis based on on objective text description of images. In: 2018 International conference on content-based multimedia indexing (CBMI), pp 1–6. IEEE

  32. 32.

    Pang L, Zhu S, Ngo CW (2015) Deep multimodal learning for affective analysis and retrieval. IEEE Trans Multimed 17(11):2008–2020

  33. 33.

    Perronnin F, Sénchez J, Xerox YL (2010) Large-scale image categorization with explicit data embedding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2297–2304

  34. 34.

    Qian S, Zhang T, Xu C, Shao J (2016) Multi-modal event topic model for social event analysis. IEEE Trans Multimed 18(2):233–246

  35. 35.

    Rahimi A, Recht B, et al. (2007) Random features for large-scale kernel machines. In: Inproceedings of the neural information processing systems, vol 3, pp 5

  36. 36.

    Rasiwasia N, Costa Pereira J, Coviello E, Doyle G, Lanckriet GR, Levy R, Vasconcelos N (2010) A new approach to cross-modal multimedia retrieval. In: Proceedings of the 18th ACM international conference on multimedia, pp 251–260. ACM

  37. 37.

    Rudinac S, Larson M, Hanjalic A (2013) Learning crowdsourced user preferences for visual summarization of image collections. IEEE Trans Multimed 15(6):1231–1243

  38. 38.

    Siersdorfer S, Minack E, Deng F, Hare J (2010) Analyzing and predicting sentiment of images on the social web. In: Proceedings of the 18th ACM international conference on multimedia, pp 715–718. ACM

  39. 39.

    Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition

  40. 40.

    Valdez P, Mehrabian A (1994) Effects of color on emotions. In: Journal of experimental psychology: General, vol. 123, p. 394. American Psychological Association

  41. 41.

    Wang G, Hoiem D, Forsyth D (2009) Building text features for object image classification. In: Inproceedings of the IEEE conference on computer vision and pattern recognition, pp 1367–1374

  42. 42.

    Wang Y, Wang S, Tang J, Liu H, Li B (2015) Unsupervised sentiment analysis for social media images. In: Proceedings of the 24th international joint conference on artificial intelligence, Buenos Aires, Argentina, pp 2378–2379

  43. 43.

    Xu C, Cetintas S, Lee K, Li L (2014) Visual sentiment prediction with deep convolutional neural networks. arXiv:1411.5731

  44. 44.

    Yang X, Zhang T, Xu C (2015) Cross-domain feature learning in multimedia. IEEE Trans Multimed 17(1):64–78

  45. 45.

    You Q, Cao L, Cong Y, Zhang X, Luo J (2015) A multifaceted approach to social multimedia-based prediction of elections. IEEE Trans Multimed 17 (12):2271–2280

  46. 46.

    You Q, Luo J, Jin H, Yang J (2015) Robust image sentiment analysis using progressively trained and domain transferred deep networks. In: 29th AAAI conference on artificial intelligence

  47. 47.

    Yu FX, Cao L, Feris RS, Smith JR, Chang SF (2013) Designing category-level attributes for discriminative visual recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 771–778

  48. 48.

    Yuan J, Mcdonough S, You Q, Luo J (2013) Sentribute: image sentiment analysis from a mid-level perspective. In: Proceedings of the 2nd international workshop on issues of sentiment discovery and opinion mining. ACM

  49. 49.

    Yuan Z, Sang J, Xu C (2013) Tag-aware image classification via nested deep belief nets. In: 2013 IEEE international conference on multimedia and expo (ICME), pp 1–6. IEEE

  50. 50.

    Yuan Z, Sang J, Xu C, Liu Y (2014) A unified framework of latent feature learning in social media. IEEE Trans Multimed 16(6):1624–1635

  51. 51.

    Zhou B, Lapedriza A, Xiao J, Torralba A, Oliva A (2014) Learning deep features for scene recognition using places database. In: Advances in neural information processing systems, pp 487–495

  52. 52.

    Zhu X, Cao B, Xu S, Liu B, Cao J (2019) Joint visual-textual sentiment analysis based on cross-modality attention mechanism. In: International conference on multimedia modeling, pp 264–276. Springer

Download references


This work has been partially supported by Telecom Italia TIM - Joint Open Lab.

Author information

Correspondence to Alessandro Ortis.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Ortis, A., Farinella, G.M., Torrisi, G. et al. Exploiting objective text description of images for visual sentiment analysis. Multimed Tools Appl (2020). https://doi.org/10.1007/s11042-019-08312-7

Download citation


  • Visual sentiment analysis
  • Objective text features
  • Embedding spaces
  • Social media