Skip to main content
Log in

Predicting Entry-Level Categories

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Entry-level categories—the labels people use to name an object—were originally defined and studied by psychologists in the 1970s and 1980s. In this paper we extend these ideas to study entry-level categories at a larger scale and to learn models that can automatically predict entry-level categories for images. Our models combine visual recognition predictions with linguistic resources like WordNet and proxies for word “naturalness” mined from the enormous amount of text on the web. We demonstrate the usefulness of our models for predicting nouns (entry-level words) associated with images by people, and for learning mappings between concepts predicted by existing visual recognition systems and entry-level concepts. In this work we make use of recent successful efforts on convolutional network models for visual recognition by training classifiers for 7404 object categories on ConvNet activation features. Results for category mapping and entry-level category prediction for images show promise for producing more natural human-like labels. We also demonstrate the potential applicability of our results to the task of image description generation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Notes

  1. This function might bias decisions toward internal nodes. Other alternatives could be explored to estimate internal node scores.

References

  • Barnard, K., & Yanai, K. (2006). Mutual information of words and pictures. In Information Theory and Applications.

  • Bird, S. (2006). Nltk: The natural language toolkit. In COLING/ACL.

  • Brants, T., & Franz, A. (2006). Web 1t 5-gram version 1. In Linguistic Data Consortium.

  • Chen, X., Shrivastava, A., & Gupta, A. (2013). Extracting visual knowledge from Web Data: NEIL. In ICCV.

  • Dean, T., Ruzon, M. A., Segal, M., Shlens, J., Vijayanarasimhan, S., & Yagnik, J. (2013). Fast, accurate detection of 100,000 object classes on a single machine. In CVPR.

  • Deng, J., Dong, W., Socher, R., Li, L-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In CVPR.

  • Deng, J., Berg, A. C., Li, K., & Li, F-F. (2010). What does classifying more than 10,000 image categories tell us? In ECCV.

  • Deng, J., Krause, J., Berg, A. C., & Fei-Fei, L. (2012). Hedging your bets: Optimizing accuracy-specificity trade-offs in large scale visual recognition. In CVPR.

  • Divvala, S., Farhadi, A., & Guestrin, C. (2014). Learning everything about anything: Webly-supervised visual concept learning. In CVPR.

  • Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., & Darrell, T. (2013). Decaf: A deep convolutional activation feature for generic visual recognition. arXiv preprint arXiv:1310.1531.

  • Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88, 303–338.

    Article  Google Scholar 

  • Farhadi, A., Hejrati, M., Sadeghi, M. A., Young, P., Rashtchian, C., Hockenmaier, J., & Forsyth, D. (2010). Every picture tells a story: Generating sentences for images. In ECCV.

  • Fei-Fei, L., Fergus, R., & Perona, P. (2007). Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Computer Vision and Image Understanding, 106, 59–70.

    Article  Google Scholar 

  • Fellbaum, C. (1998). WordNet: An electronic lexical database. Cambridge: MIT Press.

    MATH  Google Scholar 

  • Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32, 1627–1645.

    Article  Google Scholar 

  • Feng, S., Ravi, S., Kumar, R., Kuznetsova, P., Liu, W., Berg, A. C, Berg, T. L., & Choi, Y. (2015). Refer-to-as relations as semantic knowledge. In AAAI.

  • Gupta, A., Verma, Y., & Jawahar, C. V. (2012). Choosing linguistics over vision to describe images. In AAAI.

  • Hodosh, M., Young, P., & Hockenmaier, J. (2013). Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47(1), 853–899.

  • Jia, Y. (2013). Caffe: An open source convolutional architecture for fast feature embedding. http://caffe.berkeleyvision.org/

  • Jolicoeur, P., Gluck, M. A., & Kosslyn, S. M. (1984). Pictures and names: Making the connection. Cognitive Psychology, 16, 243–275.

    Article  Google Scholar 

  • Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). Imagenet classification with deep convolutional neural networks. In NIPS.

  • Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li, S., Choi, Y., et al. (2013). Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35, 2891–2903.

    Article  Google Scholar 

  • Kuznetsova, P., Ordonez, V., Berg, A., Berg, T. L., & Choi, Y. (2012). Collective generation of natural image descriptions. In ACL.

  • Kuznetsova, P., Ordonez, V., Berg, T., & Choi, Y. (2014). Treetalk: Composition and compression of trees for image descriptions. Transactions of the Association for Computational Linguistics, 2, 351–362.

    Google Scholar 

  • Le, Q. V., Monga, R., Devin, M., Chen, K., Corrado, G. S, Dean, J., & Ng, A. Y. (2012). Building high-level features using large scale unsupervised learning. In ICML.

  • Mason, R., & Charniak, E. (2014). Nonparametric method for data-driven image captioning. In ACL.

  • Mitchell, M., Han, X., Dodge, J., Mensch, A., Goyal, A., Berg, A.,&DauméIII, H. (2012). Midge: Generating image descriptions from computer vision detections. In EACL.

  • Ordonez, V., Kulkarni, G., & Berg, T. L. (2011). Im2text: Describing images using 1 million captioned photographs. In NIPS.

  • Ordonez, V., Deng, J., Choi, Y., Berg, A. C., & Berg, T. L. (2013). From large scale image categorization to entry-level categories. In ICCV.

  • Perronnin, F., Akata, Z., Harchaoui, Z., & Schmid, C. (2012). Towards good practice in large-scale learning for image classification. In CVPR.

  • Platt, J. C. (1999). Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Advances in large margin classifiers.

  • Ramnath, K., Baker, S., Vanderwende, L., El-Saban, M., Sinha, S.N., Kannan, A., Hassan, N., Galley, M., Yang, Yi, Ramanan, D., Bergamo, A., & Torresani, L. (2014). Autocaption: Automatic caption generation for personal photos. In WACV.

  • Rosch, E. (1978). Principles of categorization. In E. Rosch & B. B. Lloyd (Eds.), Cognition and categorization (pp. 27–48). Hillsdale: Erlbaum.

    Google Scholar 

  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., & Fei-Fei, L. (2014). Imagenet large scale visual recognition challenge. arXiv:1409.0575.

  • Russell, B. C., Torralba, A., Murphy, K. P., & Freeman, W. T. (2008). Labelme: A database and web-based tool for image annotation. International Journal of Computer Vision, 77, 157–173.

    Article  Google Scholar 

  • Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. ArXiv e-prints.

  • Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2014). Going deeper with convolutions. ArXiv e-prints.

  • Torralba, A., Fergus, R., & Freeman, W. T. (2008). 80 million tiny images: A large dataset for non-parametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30, 1958–1970.

    Article  Google Scholar 

  • Xiao, J., Hays, J., Ehinger, K., Oliva, A., & Torralba, A. (2010). Sun database: Large scale scene recognition from abbey to zoo. In CVPR.

  • Yanai, K., & Barnard, K. (2005). Probabilistic web image gathering. In MIR. ACM.

  • Yang, Y., Teo, C. L., DauméIII, H., & Aloimonos, Y. (2011). Corpus-guided sentence generation of natural images. In EMNLP.

Download references

Acknowledgments

This work was supported by NSF Career Award #1444234 and NSF Award #1445409.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vicente Ordonez.

Additional information

Communicated by Phil Torr, Steve Seitz, Yi Ma, and Kiriakos Kutulakos.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ordonez, V., Liu, W., Deng, J. et al. Predicting Entry-Level Categories. Int J Comput Vis 115, 29–43 (2015). https://doi.org/10.1007/s11263-015-0815-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-015-0815-z

Keywords

Navigation