Skip to main content
Log in

Estimating the imageability of words by mining visual characteristics from crawled image data

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript


Natural Language Processing and multi-modal analyses are key elements in many applications. However, the semantic gap is an everlasting problem, leading to unnatural results disconnected from the user’s perception. To understand semantics in multimedia applications, human perception needs to be taken into consideration. Imageability is an approach originating from Pyscholinguistics to quantize the human perception of words. Research shows a relationship between language usage and the imageability of words, making it useful for multimodal applications. However, the creation of imageability datasets is often manual and labor-intensive. In this paper, we propose a method using image data mining of a variety of visual features to estimate the imageability of words. The main assumption is a relationship between the imageability of concepts, human perception, and the contents of Web-crawled images. Using a set of low- and high-level visual features from Web-crawled images, a model is trained to predict imageability. The evaluations show that the imageability can be predicted with both a sufficiently low error, and a high correlation to the ground-truth annotations. The proposed method can be used to increase the corpus of imageability dictionaries.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others




  3. Parts-of-speech are obtained using NLTK [29] and may thus have slight error due to ambiguities.



  1. Bai S, An S (2018) A survey on automatic image caption generation. Neurocomputing 311:291–304.

    Google Scholar 

  2. Balahur A, Mohammad S M, Hoste V, Klinger R (eds.) (2018) Proc. 9th Workshop on Computational Approaches to Subjectivity Sentiment and Social Media Analysis, ACL, Stroudsburg, PA, USA

  3. Bay H, Ess A, Tuytelaars T, Van Gool L (2008) Speeded-Up Robust Features (SURF). Comput Vis Image Underst 110(3):346–359.

    Google Scholar 

  4. Breiman L (2001) Random forests. Mach Learn 45 (1):5–32.

    MATH  Google Scholar 

  5. Charbonnier J, Wartena C (2019) Predicting word concreteness and imagery. In: Proc. 13th Int. Conf. on Computational Semantics, pp 176–187.

  6. Chollet F, et al. (2015) Keras.

  7. Coltheart M (1981) The MRC psycholinguistic database. Q J Exp Psychol A 33 (4):497–505.

    Google Scholar 

  8. Coltheart V, Laxon V J, Keating C (1988) Effects of word imageability and age of acquisition on children’s reading. Br J Psychol 79(1):1–12.

    Google Scholar 

  9. Comaniciu D, Meer P (2002) Mean Shift: A robust approach toward feature space analysis. IEEE Trans Pattern Anal Mach Intell 24(5):603–619.

    Google Scholar 

  10. Cortese M J, Fugett A (2004) Imageability ratings for 3,000 monosyllabic words. Behav Res Methods Instrum Comput 36(3):384–387.

    Google Scholar 

  11. Csurka G, Dance CR, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: Proc. ECCV 2004 Workshop on Statistical Learning in Computer Vision, pp 1–22

  12. Deng JDJ, Dong WDW, Socher R, Li LJ, Li K, Fei-Fei L (2009) ImageNet: A large-scale hierarchical image database. In: Proc. 2009 IEEE Computer Society Conf. on Computer Vision and Pattern Recognition, pp 2–9,

  13. Divvala SK, Farhadi A, Guestrin C (2014) Learning everything about anything: Webly-supervised visual concept learning. In: Proc. 2014 IEEE Conf. on Computer Vision and Pattern Recognition, pp 3270–3277,

  14. Douze M, Jégou H, Sandhawalia H, Amsaleg L, Schmid C (2009) Evaluation of GIST descriptors for Web-scale image search. In: Proc. ACM Int. Conf. on Image and Video Retrieval 2009, pp 19:1–19:8.

  15. Fast E, Chen B, Bernstein M S (2016) Empath: Understanding topic signals in large-scale text. Computing Research Repository. arXiv:1602.06979

  16. Giesbrecht B, Camblin C C, Swaab T Y (2004) Separable effects of semantic priming and imageability on word processing in human cortex. Cereb Cortex 14(5):521–529

    Google Scholar 

  17. Hessel J, Mimno D, Lee L (2018) Quantifying the visual concreteness of words and topics in multimodal datasets. In: Proc. 2018 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol 1, pp 2194–2205.

  18. Hewitt J, Ippolito D, Callahan B, Kriz R, Wijaya D T, Callison-Burch C (2018) Learning translations via images with a massively multilingual image dataset. In: Proc. 56th Annual Meeting of the Association for Computational Linguistics, vol 1, pp 2566–2576.

  19. Holzinger A, Biemann C, Pattichis C S, Kell DB (2017a) What do we need to build explainable AI systems for the medical domain. Computing Research Repository. arXiv:1712.09923

  20. Holzinger A, Malle B, Kieseberg P, Roth P M, Mu̇ller H, Reihs R, Zatloukal K (2017b) Towards the augmented pathologist: Challenges of explainable-AI in digital pathology. Computing Research Repository. arXiv:1712.06657

  21. Inoue N, Shinoda K (2016) Adaptation of word vectors using tree structure for visual semantics. In: Proc. 24th ACM Multimedia Conf., pp 277–281.

  22. Itseez (2015) Open source computer vision library.

  23. Jones G V (1985) Deep dyslexia, imageability, and ease of predication. Brain Lang 24(1):1–19.

    Google Scholar 

  24. Kastner M A, Ide I, Kawanishi Y, Hirayama T, Deguchi D, Murase H (2019) Estimating the visual variety of concepts by referring to Web popularity. Multimed Tools Appl 78(7):9463–9488.

    Google Scholar 

  25. Kawakubo H, Akima Y, Yanai K (2010) Automatic construction of a folksonomy-based visual ontology. In: Proc. 2010 IEEE Int. Symposium on Multimedia, pp 330–335.

  26. Kohara Y, Yanai K (2013) Visual analysis of tag co-occurrence on nouns and adjectives. In: Li S, El Saddik A, Wang M, Mei T, Sebe N, Yan S, Hong R, Gurrin C (eds) Advances in Multimedia Modeling: 19th Int. Conf. on Multimedia Modeling Procs., Springer, Lecture Notes in Computer Science, vol 7732, pp 47–57.

  27. Li JJ, Nenkova A (2015) Fast and accurate prediction of sentence specificity. In: Proc. 29th AAAI Conf. on Artificial Intelligence, pp 2281–2287

  28. LjubeŠić N, FiŠer D, Peti-Stantić A (2018) Predicting concreteness and imageability of words within and across languages via word embeddings. In: Proc. 3rd Workshop on Representation Learning for NLP, pp 217–222,

  29. Loper E, Bird S (2002) NLTK: The Natural Language Toolkit. In: Proc. ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics Vol. 1, pp 63–70.

  30. Ma W, Golinkoff R M, Hirsh-Pasek K, McDonough C, Tardif T (2009) Imageability predicts the age of acquisition of verbs in Chinese children. J Child Lang 36:405–423.

    Google Scholar 

  31. Miller GA (1995) WordNet: A lexical database for English. Comm ACM 38 (11):39–41.

    Google Scholar 

  32. Paivio A, Yuille J C, Madigan S A (1968) Concreteness, imagery, and meaningfulness values for 925 nouns. J Exp Psychol 76(1):1–25

    Google Scholar 

  33. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: Machine learning in Python. J Mach Learn Res 12:2825–2830

    MathSciNet  MATH  Google Scholar 

  34. Pennebaker J W, Francis M E, Booth R J (2001) Linguistic Inquiry and Word Count: LIWC 2001. Erlbaum, Mahwah, NJ, USA

    Google Scholar 

  35. Redmon J, Farhadi A (2016) YOLO9000: Better, faster, stronger. Computing Research Repository. arXiv:1612.08242

  36. Reilly J, Kean J (2010) Formal distinctiveness of high- and low-imageability nouns: Analyses and theoretical implications. J Cogn Sci 31(1):157–168.

    Google Scholar 

  37. Ringeval F, Schuller B, Valstar M, Cowie R, Kaya H, Schmitt M, Amiriparian S, Cummins N, Lalanne D, Michaud A, Ciftçi E, Güleç H, Salah A A (2018) Proc. 2018 Audio/Visual Emotion Challenge and Workshop. ACM, New York, NY, USA, Pantic M (ed)

  38. Samek W, Wiegand T, Mu̇ller K (2017) Explainable artificial intelligence: Understanding, visualizing and interpreting deep learning models. Computing Research Repository. arXiv:1708.08296

  39. Schwanenflugel P J (2013) Why are abstract concepts hard to understand? in: The Psychology of Word Meanings, Psychology Press, New York, NY, USA, pp 235–262

  40. Shu X, Qi GJ, Tang J, Wang J (2015) Weakly-shared deep transfer networks for heterogeneous-domain knowledge propagation. In: Proc. 23rd ACM Multimedia Conf., pp 35–44.

  41. Sianipar A, van Groenestijn P, Dijkstra T (2016) Affective meaning, concreteness, and subjective frequency norms for Indonesian words. Front Psychol 7:1907.

    Google Scholar 

  42. Smolik F, Kriz A (2015) The power of imageability: How the acquisition of inflected forms is facilitated in highly imageable verbs and nouns in Czech children. J First Lang 35(6):446–465.

    Google Scholar 

  43. Sun C, Shrivastava A, Singh S, Gupta A (2017) Revisiting unreasonable effectiveness of data in deep learning era. Computing Research Repository. arXiv:1707.02968

  44. Tanaka S, Jatowt A, Kato MP, Tanaka K (2013) Estimating content concreteness for finding comprehensible documents. In: Proc. 6th ACM Int. Conf. on Web Search and Data Mining, pp 475–484.

  45. Tang J, Shu X, Li Z, Qi G J, Wang J (2016) Generalized Deep Transfer Networks for Knowledge Propagation in Heterogeneous Domains. ACM Trans. Multimed Comput. Commun. Appl. 12(4s):1–22.

    Google Scholar 

  46. Tang J, Shu X, Qi G, Li Z, Wang M, Yan S, Jain R (2017) Tri-clustered tensor completion for social-aware image tag refinement. IEEE Trans Pattern Anal Mach Intell 39(8):1662–1674.

    Google Scholar 

  47. Tang J, Shu X, Li Z, Jiang Y, Tian Q (2019) Social anchor-unit graph regularized tensor completion for large-scale image retagging. IEEE Trans Pattern Anal Mach Intell 41(8):2027–2034.

    Google Scholar 

  48. Thomee B, Shamma DA, Friedland G, Elizalde B, Ni K, Poland D, Borth D, Li LJ (2016) YFCC100M: The new data in multimedia research. Comm ACM 59(2):64–73.

    Google Scholar 

  49. Vidanapathirana M (2018) YOLO3-4-Py.

  50. Yanai K, Barnard K (2005) Image region entropy: A measure of “visualness” of Web images associated with one concept. In: Proc. 13th ACM Multimedia Conf., pp 419–422.

  51. Yee L T (2017) Valence, arousal, familiarity, concreteness, and imageability ratings for 292 two-character Chinese nouns in Cantonese speakers in Hong Kong. PloS one 12 (3):e0174569.

    MathSciNet  Google Scholar 

  52. Zhang M, Hwa R, Kovashka A (2018) Equal but not the same: Understanding the implicit relationship between persuasive images and text. In: Proc. British Machine Vision Conference 2018, no. 8

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Marc A. Kastner.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Parts of this research were supported by JSPS KAKENHI 16H02846, and a joint research project with NII, Japan.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kastner, M.A., Ide, I., Nack, F. et al. Estimating the imageability of words by mining visual characteristics from crawled image data. Multimed Tools Appl 79, 18167–18199 (2020).

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: