Advertisement

Estimating the visual variety of concepts by referring to Web popularity

  • Marc A. Kastner
  • Ichiro Ide
  • Yasutomo Kawanishi
  • Takatsugu Hirayama
  • Daisuke Deguchi
  • Hiroshi Murase
Article

Abstract

Increasingly sophisticated methods for data processing demand knowledge on the semantic relationship between language and vision. New fields of research like Explainable AI demand to step away from black-boxed approaches and understanding how the underlying semantics of data sets and AI models work. Advancements in Psycholinguistics suggest, that there is a relationship from language perception to how language production and sentence creation work. In this paper, a method to measure the visual variety of concepts is proposed to quantify the semantic gap between vision and language. For this, an image corpus is recomposed using ImageNet and Web data. Web-based metrics for measuring the popularity of sub-concepts are used as a weighting to ensure that the image composition in a dataset is as natural as possible. Using clustering methods, a score describing the visual variety of each concept is determined. A crowd-sourced survey is conducted to create ground-truth values applicable for this research. The evaluations show that the recomposed image corpus largely improves the measured variety compared to previous datasets. The results are promising and give additional knowledge about the relationship of language and vision.

Keywords

Visual variety Language and vision Concept semantics Semantic gap 

Notes

Acknowledgements

We are grateful to Dr. Kazuaki Nakamura at Osaka University who provided expertise that greatly assisted this research.

References

  1. 1.
    Barnard K, Duygulu P, Forsyth D, de Freitas N, Blei DM, Jordan MI (2003) Matching words and pictures. J Mach Learn Res 3:1107–1135.  https://doi.org/10.1162/153244303322533214 zbMATHGoogle Scholar
  2. 2.
    Bay H, Ess A, Tuytelaars T, Van Gool L (2008) Speeded-Up Robust Features (SURF). Comput Vis Image Underst 110(3):346–359.  https://doi.org/10.1016/j.cviu.2007.09.014 CrossRefGoogle Scholar
  3. 3.
    Buhrmester M, Kwang T, Gosling SD (2011) Amazon’s Mechanical Turk: A new source of inexpensive, yet high-quality, data? Perspect Psychol Sci 6(1):3–5.  https://doi.org/10.1177/1745691610393980 CrossRefGoogle Scholar
  4. 4.
    Comaniciu D, Meer P (2002) Mean Shift: A robust approach toward feature space analysis. IEEE Trans Pattern Anal Mach Intell 24(5):603–619.  https://doi.org/10.1109/34.1000236 CrossRefGoogle Scholar
  5. 5.
    Csurka G, Dance CR, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: Proceedings of ECCV 2004 Workshop on Statistical Learning in Computer Vision, pp 1–22Google Scholar
  6. 6.
    Davies M (2008) The corpus of contemporary American English: 520 million words, 1990–present. http://corpus.byu.edu/coca/
  7. 7.
    Deng JDJ, Dong WDW, Socher R, Li LJ, Li K, Fei-Fei L (2009) ImageNet: A large-scale hierarchical image database. In: Proceedings of 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp 2–9.  https://doi.org/10.1109/CVPR.2009.5206848
  8. 8.
    Divvala SK, Farhadi A, Guestrin C (2014) Learning everything about anything: Webly-supervised visual concept learning. In: Proceedings 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp 3270–3277.  https://doi.org/10.1109/CVPR.2014.412
  9. 9.
    Dodge Y (2008) Spearman rank correlation coefficient. In: The Concise Encyclopedia of Statistics. Springer, New York, pp 502–505,.  https://doi.org/10.1007/978-0-387-32833-1_379
  10. 10.
    Google (2016) Google Custom Search API. https://developers.google.com/custom-search/
  11. 11.
    Hentschel C, Sack H (2015) What image classifiers really see —visualizing bag-of-visual words models. In: Advances in Multimedia Modeling: 21st International Conference on Multimedia Modeling Processing. Springer, Lecture Notes in Computer Science, vol 8935, pp 95–104.  https://doi.org/10.1007/978-3-319-14445-0_9
  12. 12.
    Holzinger A, Biemann C, Pattichis CS, Kell DB (2017) What do we need to build explainable AI systems for the medical domain?. Computer Research Repository arXiv:http://arXiv.org/abs/1712.09923
  13. 13.
    Holzinger A, Malle B, Kieseberg P, Roth PM, Müller H, Reihs R, Zatloukal K (2017) Towards the augmented pathologist: challenges of explainable-AI in digital pathology. Computer Research Repository arXiv:http://arXiv.org/abs/1712.06657
  14. 14.
    Inoue N, Shinoda K (2016) Adaptation of word vectors using tree structure for visual semantics. In: Proceedings of 24th ACM Multimedia Conference, pp 277–281.  https://doi.org/10.1145/2964284.2967226
  15. 15.
    Itseez (2015) Open source computer vision library. https://opencv.org/
  16. 16.
    Kawakubo H, Akima Y, Yanai K (2010) Automatic construction of a folksonomy-based visual ontology. In: Proceedings of 2010 IEEE International Symposium on Multimedia, pp 330–335.  https://doi.org/10.1109/ISM.2010.57
  17. 17.
    Kennedy LS, Chang SF, Kozintsev IV (2006) To search or to label?: Predicting the performance of search-based automatic image classifiers. In: Proceedings of 8th ACM International Workshop on Multimedia Information Retrieval, pp 249–258.  https://doi.org/10.1145/1178677.1178712
  18. 18.
    Kilgarriff A, Baisa V, Bušta J, Jakubíček M, Kovávr V, Michelfeit J, Rychlý P, Suchomel V (2014) The sketch engine: Ten years on. Lexicography 1(1):7–36.  https://doi.org/10.1007/s40607-014-0009-9 CrossRefGoogle Scholar
  19. 19.
    Kohara Y, Yanai K (2013) Visual analysis of tag co-occurrence on nouns and adjectives. In: Li S, El Saddik A, Wang M, Mei T, Sebe N, Yan S, Hong R, Gurrin C (eds) Advances in Multimedia Modeling: 19th International Conference on Multimedia Modeling Processing, vol 7732. Springer, Lecture Notes in Computer Science, pp 47–57.  https://doi.org/10.1007/978-3-642-35725-1-5
  20. 20.
    van Leuken RH, Garcia L, Olivares X, van Zwol R (2009) Visual diversification of image search results. In: Proceedings of 18th International Conference on World Wide Web, pp 341–350.  https://doi.org/10.1145/1526709.1526756
  21. 21.
    Li JJ, Nenkova A (2015) Fast and accurate prediction of sentence specificity. In: Proceedings of 29th AAAI Conference on Artificial Intelligence, pp 2281–2287Google Scholar
  22. 22.
    Loper E, Bird S (2002) NLTK: The Natural Language Toolkit. In: Proceedings of ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, vol 1, pp 63–70.  https://doi.org/10.3115/1118108.1118117
  23. 23.
    Maystre L (2017) Choix —Inference algorithms for models based on Luce’s choice axiom. https://github.com/lucasmaystre/choix/
  24. 24.
    Merriam-Webster (2017) Merriam-Webster Online Dictionary. http://www.merriam-webster.com/
  25. 25.
    Microsoft (2016) Microsoft Azure Bing Search API. https://azure.microsoft.com/ja-jp/services/cognitive-services/search/
  26. 26.
    Miller GA (1995) WordNet: A lexical database for English, vol 38.  https://doi.org/10.1145/219717.219748
  27. 27.
    Nagasawa Y, Nakamura K, Nitta N, Babaguchi N (2017) Effect of junk images on inter-concept distance measurement: Positive or negative? In: Advances in Multimedia Modeling: 23rd International Conference on Multimedia Modeling Procs., Springer, Lecture Notes in Computer Science, vol 10133, pp 173–184.  https://doi.org/10.1007/978-3-319-51814-5_15
  28. 28.
    Nakamura K, Babaguchi N (2015) Inter-concept distance measurement with adaptively weighted multiple visual features. In: Computer Vision — ACCV 2014 Workshops. Springer, Lecture Notes in Computer Science, vol 9010, pp 56–70.  https://doi.org/10.1007/978-3-319-16634-6_5
  29. 29.
    Oxford University Press (2017) OED Online. https://en.oxforddictionaries.com/
  30. 30.
    Paivio A, Yuille JC, Madigan SA (1968) Concreteness, imagery, and meaningfulness values for 925 nouns. J Exp Psychol 76(1):1–25CrossRefGoogle Scholar
  31. 31.
    Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: Machine learning in Python. J Mach Learn Res 12:2825–2830MathSciNetzbMATHGoogle Scholar
  32. 32.
    Samek W, Wiegand T, Mueller KR (2017) Explainable artificial intelligence: understanding, visualizing and interpreting deep learning models. Computer Research Repository arXiv:http://arXiv.org/abs/1708.08296
  33. 33.
    Smolik F, Kriz A (2015) The power of imageability: How the acquisition of inflected forms is facilitated in highly imageable verbs and nouns in Czech children. J First Lang 35(6):446–465.  https://doi.org/10.1177/0142723715609228 CrossRefGoogle Scholar
  34. 34.
    Thurstone LL (1927) The method of paired comparisons for social values. J Abnorm Psychol 21(4):384–400CrossRefGoogle Scholar
  35. 35.
    Yahoo (2005) Flickr. https://www.flickr.com/
  36. 36.
    Yanai K, Barnard K (2005) Image region entropy: A measure of “visualness” of Web images associated with one concept. In: Proceedings of 13th ACM Multimedia Conference, pp 419–422.  https://doi.org/10.1145/1101149.1101241

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Graduate School of InformaticsNagoya UniversityFuro-cho, Chikusa-kuJapan
  2. 2.Institute of Innovation for Future SocietyNagoya UniversityFuro-cho, Chikusa-kuJapan
  3. 3.Information Strategy OfficeNagoya UniversityFuro-cho, Chikusa-kuJapan

Personalised recommendations