International Journal of Computer Vision

, Volume 115, Issue 1, pp 29–43 | Cite as

Predicting Entry-Level Categories

  • Vicente Ordonez
  • Wei Liu
  • Jia Deng
  • Yejin Choi
  • Alexander C. Berg
  • Tamara L. Berg


Entry-level categories—the labels people use to name an object—were originally defined and studied by psychologists in the 1970s and 1980s. In this paper we extend these ideas to study entry-level categories at a larger scale and to learn models that can automatically predict entry-level categories for images. Our models combine visual recognition predictions with linguistic resources like WordNet and proxies for word “naturalness” mined from the enormous amount of text on the web. We demonstrate the usefulness of our models for predicting nouns (entry-level words) associated with images by people, and for learning mappings between concepts predicted by existing visual recognition systems and entry-level concepts. In this work we make use of recent successful efforts on convolutional network models for visual recognition by training classifiers for 7404 object categories on ConvNet activation features. Results for category mapping and entry-level category prediction for images show promise for producing more natural human-like labels. We also demonstrate the potential applicability of our results to the task of image description generation.


Recognition Categorization Entry-level categories Psychology 



This work was supported by NSF Career Award #1444234 and NSF Award #1445409.


  1. Barnard, K., & Yanai, K. (2006). Mutual information of words and pictures. In Information Theory and Applications.Google Scholar
  2. Bird, S. (2006). Nltk: The natural language toolkit. In COLING/ACL.Google Scholar
  3. Brants, T., & Franz, A. (2006). Web 1t 5-gram version 1. In Linguistic Data Consortium.Google Scholar
  4. Chen, X., Shrivastava, A., & Gupta, A. (2013). Extracting visual knowledge from Web Data: NEIL. In ICCV.Google Scholar
  5. Dean, T., Ruzon, M. A., Segal, M., Shlens, J., Vijayanarasimhan, S., & Yagnik, J. (2013). Fast, accurate detection of 100,000 object classes on a single machine. In CVPR.Google Scholar
  6. Deng, J., Dong, W., Socher, R., Li, L-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In CVPR.Google Scholar
  7. Deng, J., Berg, A. C., Li, K., & Li, F-F. (2010). What does classifying more than 10,000 image categories tell us? In ECCV.Google Scholar
  8. Deng, J., Krause, J., Berg, A. C., & Fei-Fei, L. (2012). Hedging your bets: Optimizing accuracy-specificity trade-offs in large scale visual recognition. In CVPR.Google Scholar
  9. Divvala, S., Farhadi, A., & Guestrin, C. (2014). Learning everything about anything: Webly-supervised visual concept learning. In CVPR.Google Scholar
  10. Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., & Darrell, T. (2013). Decaf: A deep convolutional activation feature for generic visual recognition. arXiv preprint arXiv:1310.1531.
  11. Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88, 303–338.CrossRefGoogle Scholar
  12. Farhadi, A., Hejrati, M., Sadeghi, M. A., Young, P., Rashtchian, C., Hockenmaier, J., & Forsyth, D. (2010). Every picture tells a story: Generating sentences for images. In ECCV.Google Scholar
  13. Fei-Fei, L., Fergus, R., & Perona, P. (2007). Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Computer Vision and Image Understanding, 106, 59–70.CrossRefGoogle Scholar
  14. Fellbaum, C. (1998). WordNet: An electronic lexical database. Cambridge: MIT Press.zbMATHGoogle Scholar
  15. Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32, 1627–1645.CrossRefGoogle Scholar
  16. Feng, S., Ravi, S., Kumar, R., Kuznetsova, P., Liu, W., Berg, A. C, Berg, T. L., & Choi, Y. (2015). Refer-to-as relations as semantic knowledge. In AAAI.Google Scholar
  17. Gupta, A., Verma, Y., & Jawahar, C. V. (2012). Choosing linguistics over vision to describe images. In AAAI.Google Scholar
  18. Hodosh, M., Young, P., & Hockenmaier, J. (2013). Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47(1), 853–899.Google Scholar
  19. Jia, Y. (2013). Caffe: An open source convolutional architecture for fast feature embedding.
  20. Jolicoeur, P., Gluck, M. A., & Kosslyn, S. M. (1984). Pictures and names: Making the connection. Cognitive Psychology, 16, 243–275.CrossRefGoogle Scholar
  21. Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). Imagenet classification with deep convolutional neural networks. In NIPS.Google Scholar
  22. Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li, S., Choi, Y., et al. (2013). Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35, 2891–2903.CrossRefGoogle Scholar
  23. Kuznetsova, P., Ordonez, V., Berg, A., Berg, T. L., & Choi, Y. (2012). Collective generation of natural image descriptions. In ACL.Google Scholar
  24. Kuznetsova, P., Ordonez, V., Berg, T., & Choi, Y. (2014). Treetalk: Composition and compression of trees for image descriptions. Transactions of the Association for Computational Linguistics, 2, 351–362.Google Scholar
  25. Le, Q. V., Monga, R., Devin, M., Chen, K., Corrado, G. S, Dean, J., & Ng, A. Y. (2012). Building high-level features using large scale unsupervised learning. In ICML.Google Scholar
  26. Mason, R., & Charniak, E. (2014). Nonparametric method for data-driven image captioning. In ACL.Google Scholar
  27. Mitchell, M., Han, X., Dodge, J., Mensch, A., Goyal, A., Berg, A.,&DauméIII, H. (2012). Midge: Generating image descriptions from computer vision detections. In EACL.Google Scholar
  28. Ordonez, V., Kulkarni, G., & Berg, T. L. (2011). Im2text: Describing images using 1 million captioned photographs. In NIPS.Google Scholar
  29. Ordonez, V., Deng, J., Choi, Y., Berg, A. C., & Berg, T. L. (2013). From large scale image categorization to entry-level categories. In ICCV.Google Scholar
  30. Perronnin, F., Akata, Z., Harchaoui, Z., & Schmid, C. (2012). Towards good practice in large-scale learning for image classification. In CVPR.Google Scholar
  31. Platt, J. C. (1999). Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Advances in large margin classifiers.Google Scholar
  32. Ramnath, K., Baker, S., Vanderwende, L., El-Saban, M., Sinha, S.N., Kannan, A., Hassan, N., Galley, M., Yang, Yi, Ramanan, D., Bergamo, A., & Torresani, L. (2014). Autocaption: Automatic caption generation for personal photos. In WACV.Google Scholar
  33. Rosch, E. (1978). Principles of categorization. In E. Rosch & B. B. Lloyd (Eds.), Cognition and categorization (pp. 27–48). Hillsdale: Erlbaum.Google Scholar
  34. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., & Fei-Fei, L. (2014). Imagenet large scale visual recognition challenge. arXiv:1409.0575.
  35. Russell, B. C., Torralba, A., Murphy, K. P., & Freeman, W. T. (2008). Labelme: A database and web-based tool for image annotation. International Journal of Computer Vision, 77, 157–173.CrossRefGoogle Scholar
  36. Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. ArXiv e-prints.Google Scholar
  37. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2014). Going deeper with convolutions. ArXiv e-prints.Google Scholar
  38. Torralba, A., Fergus, R., & Freeman, W. T. (2008). 80 million tiny images: A large dataset for non-parametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30, 1958–1970.CrossRefGoogle Scholar
  39. Xiao, J., Hays, J., Ehinger, K., Oliva, A., & Torralba, A. (2010). Sun database: Large scale scene recognition from abbey to zoo. In CVPR.Google Scholar
  40. Yanai, K., & Barnard, K. (2005). Probabilistic web image gathering. In MIR. ACM.Google Scholar
  41. Yang, Y., Teo, C. L., DauméIII, H., & Aloimonos, Y. (2011). Corpus-guided sentence generation of natural images. In EMNLP.Google Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • Vicente Ordonez
    • 1
  • Wei Liu
    • 1
  • Jia Deng
    • 2
  • Yejin Choi
    • 3
  • Alexander C. Berg
    • 1
  • Tamara L. Berg
    • 1
  1. 1.Department of Computer ScienceUniversity of North Carolina at Chapel HillChapel HillUSA
  2. 2.Department of Electrical Engineering and Computer ScienceUniversity of MichiganAnn ArborUSA
  3. 3.Computer Science and EngineeringUniversity of WashingtonSeattleUSA

Personalised recommendations