Predicting Entry-Level Categories

Abstract

Entry-level categories—the labels people use to name an object—were originally defined and studied by psychologists in the 1970s and 1980s. In this paper we extend these ideas to study entry-level categories at a larger scale and to learn models that can automatically predict entry-level categories for images. Our models combine visual recognition predictions with linguistic resources like WordNet and proxies for word “naturalness” mined from the enormous amount of text on the web. We demonstrate the usefulness of our models for predicting nouns (entry-level words) associated with images by people, and for learning mappings between concepts predicted by existing visual recognition systems and entry-level concepts. In this work we make use of recent successful efforts on convolutional network models for visual recognition by training classifiers for 7404 object categories on ConvNet activation features. Results for category mapping and entry-level category prediction for images show promise for producing more natural human-like labels. We also demonstrate the potential applicability of our results to the task of image description generation.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Notes

  1. 1.

    This function might bias decisions toward internal nodes. Other alternatives could be explored to estimate internal node scores.

References

  1. Barnard, K., & Yanai, K. (2006). Mutual information of words and pictures. In Information Theory and Applications.

  2. Bird, S. (2006). Nltk: The natural language toolkit. In COLING/ACL.

  3. Brants, T., & Franz, A. (2006). Web 1t 5-gram version 1. In Linguistic Data Consortium.

  4. Chen, X., Shrivastava, A., & Gupta, A. (2013). Extracting visual knowledge from Web Data: NEIL. In ICCV.

  5. Dean, T., Ruzon, M. A., Segal, M., Shlens, J., Vijayanarasimhan, S., & Yagnik, J. (2013). Fast, accurate detection of 100,000 object classes on a single machine. In CVPR.

  6. Deng, J., Dong, W., Socher, R., Li, L-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In CVPR.

  7. Deng, J., Berg, A. C., Li, K., & Li, F-F. (2010). What does classifying more than 10,000 image categories tell us? In ECCV.

  8. Deng, J., Krause, J., Berg, A. C., & Fei-Fei, L. (2012). Hedging your bets: Optimizing accuracy-specificity trade-offs in large scale visual recognition. In CVPR.

  9. Divvala, S., Farhadi, A., & Guestrin, C. (2014). Learning everything about anything: Webly-supervised visual concept learning. In CVPR.

  10. Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., & Darrell, T. (2013). Decaf: A deep convolutional activation feature for generic visual recognition. arXiv preprint arXiv:1310.1531.

  11. Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88, 303–338.

    Article  Google Scholar 

  12. Farhadi, A., Hejrati, M., Sadeghi, M. A., Young, P., Rashtchian, C., Hockenmaier, J., & Forsyth, D. (2010). Every picture tells a story: Generating sentences for images. In ECCV.

  13. Fei-Fei, L., Fergus, R., & Perona, P. (2007). Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Computer Vision and Image Understanding, 106, 59–70.

    Article  Google Scholar 

  14. Fellbaum, C. (1998). WordNet: An electronic lexical database. Cambridge: MIT Press.

    MATH  Google Scholar 

  15. Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32, 1627–1645.

    Article  Google Scholar 

  16. Feng, S., Ravi, S., Kumar, R., Kuznetsova, P., Liu, W., Berg, A. C, Berg, T. L., & Choi, Y. (2015). Refer-to-as relations as semantic knowledge. In AAAI.

  17. Gupta, A., Verma, Y., & Jawahar, C. V. (2012). Choosing linguistics over vision to describe images. In AAAI.

  18. Hodosh, M., Young, P., & Hockenmaier, J. (2013). Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47(1), 853–899.

  19. Jia, Y. (2013). Caffe: An open source convolutional architecture for fast feature embedding. http://caffe.berkeleyvision.org/

  20. Jolicoeur, P., Gluck, M. A., & Kosslyn, S. M. (1984). Pictures and names: Making the connection. Cognitive Psychology, 16, 243–275.

    Article  Google Scholar 

  21. Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). Imagenet classification with deep convolutional neural networks. In NIPS.

  22. Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li, S., Choi, Y., et al. (2013). Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35, 2891–2903.

    Article  Google Scholar 

  23. Kuznetsova, P., Ordonez, V., Berg, A., Berg, T. L., & Choi, Y. (2012). Collective generation of natural image descriptions. In ACL.

  24. Kuznetsova, P., Ordonez, V., Berg, T., & Choi, Y. (2014). Treetalk: Composition and compression of trees for image descriptions. Transactions of the Association for Computational Linguistics, 2, 351–362.

    Google Scholar 

  25. Le, Q. V., Monga, R., Devin, M., Chen, K., Corrado, G. S, Dean, J., & Ng, A. Y. (2012). Building high-level features using large scale unsupervised learning. In ICML.

  26. Mason, R., & Charniak, E. (2014). Nonparametric method for data-driven image captioning. In ACL.

  27. Mitchell, M., Han, X., Dodge, J., Mensch, A., Goyal, A., Berg, A.,&DauméIII, H. (2012). Midge: Generating image descriptions from computer vision detections. In EACL.

  28. Ordonez, V., Kulkarni, G., & Berg, T. L. (2011). Im2text: Describing images using 1 million captioned photographs. In NIPS.

  29. Ordonez, V., Deng, J., Choi, Y., Berg, A. C., & Berg, T. L. (2013). From large scale image categorization to entry-level categories. In ICCV.

  30. Perronnin, F., Akata, Z., Harchaoui, Z., & Schmid, C. (2012). Towards good practice in large-scale learning for image classification. In CVPR.

  31. Platt, J. C. (1999). Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Advances in large margin classifiers.

  32. Ramnath, K., Baker, S., Vanderwende, L., El-Saban, M., Sinha, S.N., Kannan, A., Hassan, N., Galley, M., Yang, Yi, Ramanan, D., Bergamo, A., & Torresani, L. (2014). Autocaption: Automatic caption generation for personal photos. In WACV.

  33. Rosch, E. (1978). Principles of categorization. In E. Rosch & B. B. Lloyd (Eds.), Cognition and categorization (pp. 27–48). Hillsdale: Erlbaum.

    Google Scholar 

  34. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., & Fei-Fei, L. (2014). Imagenet large scale visual recognition challenge. arXiv:1409.0575.

  35. Russell, B. C., Torralba, A., Murphy, K. P., & Freeman, W. T. (2008). Labelme: A database and web-based tool for image annotation. International Journal of Computer Vision, 77, 157–173.

    Article  Google Scholar 

  36. Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. ArXiv e-prints.

  37. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2014). Going deeper with convolutions. ArXiv e-prints.

  38. Torralba, A., Fergus, R., & Freeman, W. T. (2008). 80 million tiny images: A large dataset for non-parametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30, 1958–1970.

    Article  Google Scholar 

  39. Xiao, J., Hays, J., Ehinger, K., Oliva, A., & Torralba, A. (2010). Sun database: Large scale scene recognition from abbey to zoo. In CVPR.

  40. Yanai, K., & Barnard, K. (2005). Probabilistic web image gathering. In MIR. ACM.

  41. Yang, Y., Teo, C. L., DauméIII, H., & Aloimonos, Y. (2011). Corpus-guided sentence generation of natural images. In EMNLP.

Download references

Acknowledgments

This work was supported by NSF Career Award #1444234 and NSF Award #1445409.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Vicente Ordonez.

Additional information

Communicated by Phil Torr, Steve Seitz, Yi Ma, and Kiriakos Kutulakos.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Ordonez, V., Liu, W., Deng, J. et al. Predicting Entry-Level Categories. Int J Comput Vis 115, 29–43 (2015). https://doi.org/10.1007/s11263-015-0815-z

Download citation

Keywords

  • Recognition
  • Categorization
  • Entry-level categories
  • Psychology