Multilingual visual sentiment concept clustering and analysis

  • Nikolaos PappasEmail author
  • Miriam Redi
  • Mercan Topkara
  • Hongyi Liu
  • Brendan Jou
  • Tao Chen
  • Shih-Fu Chang
Regular Paper


Visual content is a rich medium that can be used to communicate not only facts and events, but also emotions and opinions. In some cases, visual content may carry a universal affective bias (e.g., natural disasters or beautiful scenes). Often however, to achieve a parity in the affections a visual media invokes in its recipient compared to the one an author intended requires a deep understanding and even sharing of cultural backgrounds. In this study, we propose a computational framework for the clustering and analysis of multilingual visual affective concepts used in different languages which enable us to pinpoint alignable differences (via similar concepts) and nonalignable differences (via unique concepts) across cultures. To do so, we crowdsource sentiment labels for the MVSO dataset, which contains 16 K multilingual visual sentiment concepts and 7.3M images tagged with these concepts. We then represent these concepts in a distribution-based word vector space via (1) pivotal translation or (2) cross-lingual semantic alignment. We then evaluate these representations on three tasks: affective concept retrieval, concept clustering, and sentiment prediction—all across languages. The proposed clustering framework enables the analysis of the large multilingual dataset both quantitatively and qualitatively. We also show a novel use case consisting of a facial image data subset and explore cultural insights about visual sentiment concepts in such portrait-focused images.


Multilingual Language Cultures Cross-cultural Emotion Sentiment Ontology Concept detection Social multimedia 


  1. 1.
    Jou B, Chen T, Pappas N, Redi M, Topkara M, Chang S-F (2015) Visual affect around the world: a large-scale multilingual visual sentiment ontology. In: ACM international conference on multimedia, (Brisbane, Australia), pp 159–168Google Scholar
  2. 2.
    Turian J, Ratinov L, Bengio Y (2010) Word representations: a simple and general method for semi-supervised learning. In: 48th annual meeting of the Association for Computational Linguistics. ACL ’10, (Uppsala, Sweden), pp 384–394Google Scholar
  3. 3.
    Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12:2493–2537zbMATHGoogle Scholar
  4. 4.
    Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. CoRR, vol. arXiv:1301.3781
  5. 5.
    Pennington J, Socher R, Manning CD (2014) GloVe: global vectors for word representation. In: Empirical methods in natural language processing, pp 1532–1543Google Scholar
  6. 6.
    Al-Rfou R, Perozzi B, Skiena S (2013) Polyglot: distributed word representations for multilingual NLP. CoRR, vol arXiv:1307.1662
  7. 7.
    Klementiev A, Titov I, Bhattarai B (2012) Inducing crosslingual distributed representations of words. In: Proceedings of COLING 2012, (Mumbai, India), pp 1459–1474Google Scholar
  8. 8.
    Zou WY, Socher R, Cer D, Manning CD (2013) Bilingual word embeddings for phrase-based machine translation.In: Proceedings of the 2013 conference on empirical methods in natural language processing, (Seattle. WA, USA), pp 1393–1398Google Scholar
  9. 9.
    Hermann KM, Blunsom P (2014) Multilingual models for compositional distributed semantics. In: Annual meeting of the association for computational linguistics, (Baltimore, Maryland), pp 58–68Google Scholar
  10. 10.
    Chandar APS, Lauly S, Larochelle H, Khapra MM, Ravindran B, Raykar VC, Saha A (2014) An autoencoder approach to learning bilingual word representations. CoRR, vol arXiv:1402.1454
  11. 11.
    Hill F, Reichart R, Korhonen A (2014) Simlex-999: evaluating semantic models with (genuine) similarity estimation. CoRR, vol arXiv:1408.3456
  12. 12.
    Bruni E, Tran NK, Baroni M (2014) Multimodal distributional semantics. J Artif Intell Res 49:1–47MathSciNetzbMATHGoogle Scholar
  13. 13.
    Silberer C, Lapata M (2014) Learning grounded meaning representations with autoencoders. In: 52nd annual meeting of the association for computational linguistics, (Baltimore, Maryland), pp 721–732Google Scholar
  14. 14.
    Lazaridou A, Pham NT, Baroni M (2015) Combining language and vision with a multimodal skip-gram model. In: Conference of the North American chapter of the association for computational linguistics: human language technologies, (Denver, Colorado), pp 153–163Google Scholar
  15. 15.
    Karpathy A, Joulin A, Li F (2014) Deep fragment embeddings for bidirectional image sentence mapping. In: Advances in neural information processing systems 27, pp 1889–1897, Curran Associates, IncGoogle Scholar
  16. 16.
    Kiros R, Salakhutdinov R, Zemel RS (2014) Unifying visual-semantic embeddings with multimodal neural language models. CoRR, vol arXiv:1411.2539
  17. 17.
    Faruqui M, Dyer C (2014) Improving vector space word representations using multilingual correlation. In: Association for computational linguisticsGoogle Scholar
  18. 18.
    Socher R, Karpathy A, Le QV, Manning CD, Ng AY (2014) Grounded compositional semantics for finding and describing images with sentences. TACL 2:207–218Google Scholar
  19. 19.
    Mao J, Xu W, Yang Y, Wang J, Yuille AL (2014) Explain images with multimodal recurrent neural networks. CoRR vol. arXiv:1410.1090
  20. 20.
    Kottur S, Vedantam R, Moura JMF, Parikh D (2015) Visual word2vec (vis-w2v): learning visually grounded word embeddings using abstract scenes. CoRR, vol. arXiv:1511.07067
  21. 21.
    Schnabel T, Labutov I, Mimno D, Joachims T (2015) Evaluation methods for unsupervised word embeddings. In: Conference on empirical methods in natural language processing, (Lisbon, Portugal), pp 298–307Google Scholar
  22. 22.
    Levy O, Goldberg Y, Dagan I (2015) Improving distributional similarity with lessons learned from word embeddings. Trans Assoc Comput Ling 3:211–225Google Scholar
  23. 23.
    Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. Adv Neural Inf Process Syst 26:3111–3119Google Scholar
  24. 24.
    Lebret R, Collobert R (2014) Word embeddings through hellinger pca. In: Conference of the European chapter of the association for computational linguistics, (Gothenburg, Sweden), pp 482–490Google Scholar
  25. 25.
    Baroni M, Zamparelli R (2010) Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space. In: Conference on empirical methods in natural language processing, (Cambridge. MA, USA), pp 1183–1193Google Scholar
  26. 26.
    Socher R, Huval B, Manning CD, Ng AY (2012) Semantic compositionality through recursive matrix-vector spaces. In: Joint conference on empirical methods in natural language processing and computational natural language learning, (Jeju Island, Korea), pp 1201–1211Google Scholar
  27. 27.
    Schmid H (1994) Probabilistic part-of-speech tagging using decision trees. In: International conference on new methods in language processing, (Manchester, UK)Google Scholar
  28. 28.
    Freiwald WA, Tsao DY (2014) Neurons that keep a straight face. Natl Acad Sci 111(22):7894–7895CrossRefGoogle Scholar
  29. 29.
    Redi M, Rasiwasia N, Aggarwal G, Jaimes A (2015) The beauty of capturing faces: Rating the quality of digital portraits. In: IEEE international conference and workshops on automatic face and gesture recognition, (Ljubljana, Slovenia), pp 1–8Google Scholar
  30. 30.
    Jou B, Bhattacharya S, Chang S-F (2014) Predicting viewer perceived emotions in animated GIFs. In: ACM international conference on multimedia, (Orlando, Florida, USA), pp 213–216Google Scholar
  31. 31.
    Bakhshi S, Shamma DA, Gilbert E (2014) Faces engage us: photos with faces attract more likes and comments on instagram. In: ACM conference on human factors in computing systems, (Toronto, ON, Canada), pp 965–974Google Scholar
  32. 32.
    Liao S, Jain AK, Li SZ (2016) A fast and accurate unconstrained face detector. IEEE Trans Pattern Anal Mach Intell 38:211–223CrossRefGoogle Scholar
  33. 33.
    Ammar W, Mulcaire G, Tsvetkov Y, Lample G, Dyer C, Smith NA (2016) Massively multilingual word embeddings. arXiv preprint arXiv:1602.01925
  34. 34.
    Quasthoff U, Richter M, Biemann C (2006) Corpus portal for search in monolingual Corpora. In: Proceedings of the fifth international conference on language resources and evaluation. LREC, pp 1799–1802, GenoaGoogle Scholar
  35. 35.
    Pappas N, Redi M, Topkara M, Brendan J, Liu H, Chen T, Chang S-F (2015) Multilingual visual sentiment concept matching. In: ACM international conference on multimedia retrieval, pp 151–158, New York, USAGoogle Scholar
  36. 36.
    Pang B, Lee L (2005) Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales. In: 43rd annual meeting on association for computational linguistics, pp 115–124, Ann Arbor, MichiganGoogle Scholar
  37. 37.
    Brendan J, Chang S-F (2016) Deep cross residual learning for multitask visual recognition. In: Proceedings of the 2016 ACM conference on multimedia conference, pp 998–1007, Amsterdam, NetherlandsGoogle Scholar
  38. 38.
    Bo Pang, Lee Lillian (2008) Opinion mining and sentiment analysis. Found Trends Inf Retrieval 2(1–2):1–135Google Scholar
  39. 39.
    Pang B, Lee L, Vaithyanathan S (2002) Thumbs up?: sentiment classification using machine learning techniques. In: ACL-02 conference on empirical methods in natural language processing Vol 10, pp 79–86, Philadelphia, PAGoogle Scholar
  40. 40.
    Liu H, Brendan J, Chen T, Topkara M, Pappas N, Redi M, Chang S-F (2015) Complura: exploring and leveraging a large-scale multilingual visual sentiment ontology. pp 417–420, New York, USAGoogle Scholar
  41. 41.
    Turney PD (2002) Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews. In: 40th annual meeting on association for computational linguistics, pp 417–424, Philadelphia, PAGoogle Scholar
  42. 42.
    Maas AL, Daly RE, Pham PT, Huang D, Ng AY, Potts C (2011) Learning word vectors for sentiment analysis In: 49th annual meeting of the association for computational linguistics: human language technologies, Vol 1, pp 142–150Google Scholar
  43. 43.
    Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119Google Scholar
  44. 44.
    Tang D, Wei F, Yang N, Zhou M, Liu T, Qin B (2014) Learning sentiment-specific word embedding for Twitter sentiment classification. In: 52nd annual meeting of the association for computational linguistics, pp 1555–1565, Baltimore, MDGoogle Scholar
  45. 45.
    Hu M, Liu B (2004) Mining and summarizing customer reviews. In: 10th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 168–177, Seattle, WAGoogle Scholar
  46. 46.
    Li Z, Jing F, Zhu X-Y (2006) Movie review mining and summarization. In: 15th ACM international conference on information and knowledge management, pp 43–50, Arlington, VAGoogle Scholar
  47. 47.
    Titov I, McDonald R (2008) Modeling online reviews with multi-grain topic models. In: 17th international conference on World Wide Web, pp 111–120, Beijing, ChinaGoogle Scholar
  48. 48.
    Sauper C, Haghighi A, Barzilay R (2010) Incorporating content structure into text analysis applications. In: 2010 conference on empirical methods in natural language processing, pp 377–387, Cambridge, MAGoogle Scholar
  49. 49.
    Lu B, Ott M, Cardie C, Tsou BK (2011) Multi-aspect sentiment analysis with topic models. In: 2011 IEEE 11th international conference on data mining workshops. pp 81–88 Washington, DCGoogle Scholar
  50. 50.
    McAuley J, Leskovec J, Jurafsky D (2012) Learning attitudes and attributes from multi-aspect reviews In: 2012 IEEE 12th international conference on data mining, pp 1020–1025, Brussels, BelgiumGoogle Scholar
  51. 51.
    Pappas N, Popescu-Belis A (2014) Explaining the stars: weighted multiple-instance learning for aspect-based sentiment analysis. In: Conference on empirical methods in natural language processing, pp 455–466, Doha, QatarGoogle Scholar
  52. 52.
    Morency L-P, Mihalcea R, Doshi P (2011) Towards multimodal sentiment analysis: harvesting opinions from the web. In: 13th international conference on multimodal interfaces, pp 169–176, Tokyo, JapanGoogle Scholar
  53. 53.
    Rosas Veronica, Mihalcea Rada, Morency Louis-Philippe (2013) Multimodal sentiment analysis of Spanish online videos. IEEE Intell Syst 28(3):38–45CrossRefGoogle Scholar
  54. 54.
    Cambria Erik, Schuller Bjorn, Xia Yunqing, Havasi Catherine (2013) New avenues in opinion mining and sentiment analysis. IEEE Intell Syst 28(2):15–21CrossRefGoogle Scholar
  55. 55.
    Borth D, Ji R, Chen T, Breuel T, Chang S-F (2013) Large-scale visual sentiment ontology and detectors using adjective noun pairs. In: 21st ACM international conference on Multimedia, pp 223–232, Barcelona, SpainGoogle Scholar
  56. 56.
    You Q, Luo J, Jin H, Yang J (2016) Cross-modality consistent regression for joint visual-textual sentiment analysis of social multimedia. In: 9th ACM international conference on web search and data mining, pp 13–22, San Fransisco, USAGoogle Scholar
  57. 57.
    Poria S, Cambria E, Gelbukh A (2015) Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In: 2015 conference on empirical methods in natural language processing, pp 2539–2544, Lisbon, PortugalGoogle Scholar
  58. 58.
    Dodds PS, Clark EM, Desu S, Frank MR, Reagan AJ, Williams JR, Mitchell L et al (2015) Human language reveals a universal positivity bias. In: Proceedings of the national academy of sciences 112(8): 2389–2394Google Scholar
  59. 59.
    Poria Soujanya, Cambria Erik, Howard Newton, Huang Guang-Bin, Hussain Amir (2016) Fusing audio, visual and textual clues for sentiment analysis from multimodal content. Neurocomputing 174:50–59CrossRefGoogle Scholar
  60. 60.
    Li, H, Ellis Joseph G, Heng J, Chang S-F (2016) Event specific multimodal pattern mining for knowledge base construction. In: Proceedings of the 2016 ACM on multimedia conference, pp 821–830. ACMGoogle Scholar

Copyright information

© Springer-Verlag London 2017

Authors and Affiliations

  • Nikolaos Pappas
    • 1
    Email author
  • Miriam Redi
    • 2
  • Mercan Topkara
    • 3
  • Hongyi Liu
    • 4
  • Brendan Jou
    • 4
  • Tao Chen
    • 4
  • Shih-Fu Chang
    • 4
  1. 1.Idiap Research InstituteMartignySwitzerland
  2. 2.Nokia Bell LabsCambridgeUK
  3. 3.Teachers Pay TeachersNew YorkUSA
  4. 4.Columbia UniversityNew YorkUSA

Personalised recommendations