International Journal of Computer Vision

, Volume 115, Issue 3, pp 211–252 | Cite as

ImageNet Large Scale Visual Recognition Challenge

  • Olga Russakovsky
  • Jia Deng
  • Hao Su
  • Jonathan Krause
  • Sanjeev Satheesh
  • Sean Ma
  • Zhiheng Huang
  • Andrej Karpathy
  • Aditya Khosla
  • Michael Bernstein
  • Alexander C. Berg
  • Li Fei-Fei
Article

Abstract

The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide a detailed analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the 5 years of the challenge, and propose future directions and improvements.

Keywords

Dataset Large-scale Benchmark Object recognition Object detection 

Notes

Acknowledgments

We thank Stanford University, UNC Chapel Hill, Google and Facebook for sponsoring the challenges, and NVIDIA for providing computational resources to participants of ILSVRC2014. We thank our advisors over the years: Lubomir Bourdev, Alexei Efros, Derek Hoiem, Jitendra Malik, Chuck Rosenberg and Andrew Zisserman. We thank the PASCAL VOC organizers for partnering with us in running ILSVRC2010-2012. We thank all members of the Stanford vision lab for supporting the challenges and putting up with us along the way. Finally, and most importantly, we thank all researchers that have made the ILSVRC effort a success by competing in the challenges and by using the datasets to advance computer vision.

References

  1. Ahonen, T., Hadid, A., & Pietikinen, M. (2006). Face description with local binary patterns: Application to face recognition. Pattern Analysis and Machine Intelligence, 28(14), 2037–2041.CrossRefGoogle Scholar
  2. Alexe, B., Deselares, T., & Ferrari, V. (2012). Measuring the objectness of image windows. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11), 2189–2202.CrossRefGoogle Scholar
  3. Arandjelovic, R., & Zisserman, A. (2012). Three things everyone should know to improve object retrieval. In CVPR.Google Scholar
  4. Arbeláez, P., Pont-Tuset, J., Barron, J., Marques, F., & Malik, J. (2014). Multiscale combinatorial grouping. In Computer vision and pattern recognition.Google Scholar
  5. Arbelaez, P., Maire, M., Fowlkes, C., & Malik, J. (2011). Contour detection and hierarchical image segmentation. IEEE Transaction on Pattern Analysis and Machine Intelligence, 33, 898–916.CrossRefGoogle Scholar
  6. Batra, D., Agrawal, H., Banik, P., Chavali, N., Mathialagan, C. S., & Alfadda, A. (2013). Cloudcv: Large-scale distributed computer vision as a cloud service.Google Scholar
  7. Bell, S., Upchurch, P., Snavely, N., & Bala, K. (2013). OpenSurfaces: A richly annotated catalog of surface appearance. In ACM transactions on graphics (SIGGRAPH).Google Scholar
  8. Berg, A., Farrell, R., Khosla, A., Krause, J., Fei-Fei, L., Li, J., & Maji, S. (2013). Fine-grained competition. https://sites.google.com/site/fgcomp2013/.
  9. Chatfield, K., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Return of the devil in the details: Delving deep into convolutional nets. CoRR, abs/1405.3531.Google Scholar
  10. Chen, Q., Song, Z., Huang, Z., Hua, Y., & Yan, S. (2014). Contextualizing object detection and classification. In CVPR.Google Scholar
  11. Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., & Singer, Y. (2006). Online passive-aggressive algorithms. Journal of Machine Learning Research, 7, 551–585.MATHMathSciNetGoogle Scholar
  12. Criminisi, A. (2004). Microsoft Research Cambridge (MSRC) object recognition image database (version 2.0). http://research.microsoft.com/vision/cambridge/recognition.
  13. Dean, T., Ruzon, M., Segal, M., Shlens, J., Vijayanarasimhan, S., & Yagnik, J. (2013). Fast, accurate detection of 100,000 object classes on a single machine. In CVPR.Google Scholar
  14. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In CVPR.Google Scholar
  15. Deng, J., Russakovsky, O., Krause, J., Bernstein, M., Berg, A. C., & Fei-Fei, L. (2014). Scalable multi-label annotation. In CHI.Google Scholar
  16. Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., & Darrell, T. (2013). Decaf: A deep convolutional activation feature for generic visual recognition. CoRR, abs/1310.1531.Google Scholar
  17. Dubout, C., & Fleuret, F. (2012). Exact acceleration of linear object detectors. In Proceedings of the European conference on computer vision (ECCV).Google Scholar
  18. Everingham, M., Gool, L. V., Williams, C., Winn, J., & Zisserman, A. (2005–2012). PASCAL Visual Object Classes Challenge (VOC). http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.
  19. Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The Pascal Visual Object Classes (VOC) challenge. International Journal of Computer Vision, 88(2), 303–338.CrossRefGoogle Scholar
  20. Everingham, M., Eslami, S. M. A., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2014). The Pascal Visual Object Classes (VOC) challenge—A retrospective. International Journal of Computer Vision, 111, 98–136.CrossRefGoogle Scholar
  21. Fei-Fei, L., & Perona, P. (2005). A Bayesian hierarchical model for learning natural scene categories. In CVPR.Google Scholar
  22. Fei-Fei, L., Fergus, R., & Perona, P. (2004). Learning generative visual models from few examples: An incremental bayesian approach tested on 101 object categories. In CVPR.Google Scholar
  23. Felzenszwalb, P., Girshick, R., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.CrossRefGoogle Scholar
  24. Frome, A., Corrado, G., Shlens, J., Bengio, S., Dean, J., Ranzato, M., & Mikolov, T. (2013). Devise: A deep visual-semantic embedding model. In Advances in neural information processing systems, NIPS.Google Scholar
  25. Geiger, A., Lenz, P., Stiller, C., & Urtasun, R. (2013). Vision meets robotics: The kitti dataset. International Journal of Robotics Research, 32, 1231–1237.CrossRefGoogle Scholar
  26. Girshick, R. B., Donahue, J., Darrell, T., & Malik, J. (2013). Rich feature hierarchies for accurate object detection and semantic segmentation (v4). CoRR.Google Scholar
  27. Girshick, R., Donahue, J., Darrell, T., & Malik., J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR.Google Scholar
  28. Gould, S., Fulton, R., & Koller, D. (2009). Decomposing a scene into geometric and semantically consistent regions. In ICCV.Google Scholar
  29. Graham, B. (2013). Sparse arrays of signatures for online character recognition. CoRR.Google Scholar
  30. Griffin, G., Holub, A., & Perona, P. (2007). Caltech-256 object category dataset. Technical report 7694, Caltech.Google Scholar
  31. Harada, T., & Kuniyoshi, Y. (2012). Graphical Gaussian vector for image categorization. In NIPS.Google Scholar
  32. Harel, J., Koch, C., & Perona, P. (2007). Graph-based visual saliency. In NIPS.Google Scholar
  33. He, K., Zhang, X., Ren, S., & Su, J. (2014). Spatial pyramid pooling in deep convolutional networks for visual recognition. In ECCV.Google Scholar
  34. Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580.Google Scholar
  35. Hoiem, D., Chodpathumwan, Y., & Dai, Q. (2012). Diagnosing error in object detectors. In ECCV.Google Scholar
  36. Howard, A. (2014). Some improvements on deep convolutional neural network based image classification. In ICLR.Google Scholar
  37. Huang, G. B., Ramesh, M., Berg, T., & Learned-Miller, E. (2007). Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical report 07–49, University of Massachusetts, Amherst.Google Scholar
  38. Iandola, F. N., Moskewicz, M. W., Karayev, S., Girshick, R. B., Darrell, T., & Keutzer, K. (2014). Densenet: Implementing efficient convnet descriptor pyramids. CoRR.Google Scholar
  39. Jia, Y. (2013). Caffe: An open source convolutional architecture for fast feature embedding. http://caffe.berkeleyvision.org/.
  40. Jojic, N., Frey, B. J., & Kannan, A. (2003). Epitomic analysis of appearance and shape. In ICCV.Google Scholar
  41. Kanezaki, A., Inaba, S., Ushiku, Y., Yamashita, Y., Muraoka, H., Kuniyoshi, Y., & Harada, T. (2014). Hard negative classes for multiple object detection. In ICRA.Google Scholar
  42. Khosla, A., Jayadevaprakash, N., Yao, B., & Fei-Fei, L. (2011). Novel dataset for fine-grained image categorization. In First workshop on fine-grained visual categorization, CVPR.Google Scholar
  43. Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). ImageNet classification with deep convolutional neural networks. In NIPS.Google Scholar
  44. Kuettel, D., Guillaumin, M., & Ferrari, V. (2012). Segmentation propagation in ImageNet. In ECCV.Google Scholar
  45. Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR.Google Scholar
  46. Lin, M., Chen, Q., & Yan, S. (2014a). Network in network. In ICLR.Google Scholar
  47. Lin, Y., Lv, F., Cao, L., Zhu, S., Yang, M., Cour, T., Yu, K., & Huang, T. (2011). Large-scale image classification: Fast feature extraction and SVM training. In CVPR.Google Scholar
  48. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollr, P., & Zitnick, C. L. (2014b). Microsoft COCO: Common objects in context. In ECCV.Google Scholar
  49. Liu, C., Yuen, J., & Torralba, A. (2011). Nonparametric scene parsing via label transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32, 2368–2382.CrossRefGoogle Scholar
  50. Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.CrossRefGoogle Scholar
  51. Maji, S., & Malik, J. (2009). Object detection using a max-margin hough transform. In CVPR.Google Scholar
  52. Manen, S., Guillaumin, M., & Van Gool, L. (2013). Prime object proposals with randomized Prim’s algorithm. In ICCV.Google Scholar
  53. Mensink, T., Verbeek, J., Perronnin, F., & Csurka, G. (2012). Metric learning for large scale image classification: Generalizing to new classes at near-zero cost. In ECCV.Google Scholar
  54. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. In ICLR.Google Scholar
  55. Miller, G. A. (1995). Wordnet: A lexical database for English. Commun. ACM, 38(11), 39–41.CrossRefGoogle Scholar
  56. Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: A holistic representation of the spatial envelope. In IJCV.Google Scholar
  57. Ordonez, V., Deng, J., Choi, Y., Berg, A. C., & Berg, T. L. (2013). From large scale image categorization to entry-level categories. In IEEE international conference on computer vision (ICCV).Google Scholar
  58. Ouyang, W., & Wang, X. (2013). Joint deep learning for pedestrian detection. In ICCV.Google Scholar
  59. Ouyang, W., Luo, P., Zeng, X., Qiu, S., Tian, Y., Li, H., Yang, S., Wang, Z., Xiong, Y., Qian, C., Zhu, Z., Wang, R., Loy, C. C., Wang, X., & Tang, X. (2014). Deepid-net: multi-stage and deformable deep convolutional neural networks for object detection. CoRR, abs/1409.3505.Google Scholar
  60. Papandreou, G. (2014). Deep epitomic convolutional neural networks. CoRR.Google Scholar
  61. Papandreou, G., Chen, L.-C., & Yuille, A. L. (2014). Modeling image patches with a generic dictionary of mini-epitomes.Google Scholar
  62. Perronnin, F., & Dance, C. R. (2007). Fisher kernels on visual vocabularies for image categorization. In CVPR.Google Scholar
  63. Perronnin, F., Akata, Z., Harchaoui, Z., & Schmid, C. (2012). Towards good practice in large-scale learning for image classification. In CVPR.Google Scholar
  64. Perronnin, F., Sánchez, J., & Mensink, T. (2010). Improving the fisher kernel for large-scale image classification. In ECCV (4).Google Scholar
  65. Russakovsky, O., Deng, J., Huang, Z., Berg, A., & Fei-Fei, L. (2013). Detecting avocados to zucchinis: What have we done, & where are we going? In ICCV.Google Scholar
  66. Russell, B., Torralba, A., Murphy, K., & Freeman, W. T. (2007). LabelMe: A database and web-based tool for image annotation. In IJCV.Google Scholar
  67. Sanchez, J., & Perronnin, F. (2011). High-dim. signature compression for large-scale image classification. In CVPR.Google Scholar
  68. Sanchez, J., Perronnin, F., & de Campos, T. (2012). Modeling spatial layout of images beyond spatial pyramids. In PRL.Google Scholar
  69. Scheirer, W., Kumar, N., Belhumeur, P. N., & Boult, T. E. (2012). Multi-attribute spaces: Calibration for attribute fusion and similarity search. In CVPR.Google Scholar
  70. Schmidhuber, J. (2012). Multi-column deep neural networks for image classification. In CVPR.Google Scholar
  71. Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., & LeCun, Y. (2013). Overfeat: Integrated recognition, localization and detection using convolutional networks. CoRR, abs/1312.6229.Google Scholar
  72. Sheng, V. S., Provost, F., & Ipeirotis, P. G. (2008). Get another label? Improving data quality and data mining using multiple, noisy labelers. In SIGKDD.Google Scholar
  73. Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556.Google Scholar
  74. Simonyan, K., Vedaldi, A., & Zisserman, A. (2013). Deep fisher networks for large-scale image classification. In NIPS.Google Scholar
  75. Sorokin, A., & Forsyth, D. (2008). Utility data annotation with Amazon Mechanical Turk. In InterNet08.Google Scholar
  76. Su, H., Deng, J., & Fei-Fei, L. (2012). Crowdsourcing annotations for visual object detection. In AAAI human computation workshop.Google Scholar
  77. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., & Rabinovich, A. (2014). Going deeper with convolutions. Technical report.Google Scholar
  78. Tang, Y. (2013). Deep learning using support vector machines. CoRR, abs/1306.0239.Google Scholar
  79. Thorpe, S., Fize, D., Marlot, C., et al. (1996). Speed of processing in the human visual system. Nature, 381(6582), 520–522.CrossRefGoogle Scholar
  80. Torralba, A., & Efros, A. A. (2011). Unbiased look at dataset bias. In CVPR’11.Google Scholar
  81. Torralba, A., Fergus, R., & Freeman, W. (2008). 80 million tiny images: A large data set for nonparametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30, 1958–1970.CrossRefGoogle Scholar
  82. Uijlings, J., van de Sande, K., Gevers, T., & Smeulders, A. (2013). Selective search for object recognition. International Journal of Computer Vision, 104, 154–171.CrossRefGoogle Scholar
  83. Urtasun, R., Fergus, R., Hoiem, D., Torralba, A., Geiger, A., Lenz, P., Silberman, N., Xiao, J., & Fidler, S. (2013–2014). Reconstruction meets recognition challenge. http://ttic.uchicago.edu/rurtasun/rmrc/.
  84. van de Sande, K. E. A., Snoek, C. G. M., & Smeulders, A. W. M. (2014). Fisher and vlad with flair. In Proceedings of the IEEE conference on computer vision and pattern recognition.Google Scholar
  85. van de Sande, K. E. A., Uijlings, J. R. R., Gevers, T., & Smeulders, A. W. M. (2011b). Segmentation as selective search for object recognition. In ICCV.Google Scholar
  86. van de Sande, K. E. A., Gevers, T., & Snoek, C. G. M. (2010). Evaluating color descriptors for object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1582–1596.CrossRefGoogle Scholar
  87. van de Sande, K. E. A., Gevers, T., & Snoek, C. G. M. (2011a). Empowering visual categorization with the GPU. IEEE Transactions on Multimedia, 13(1), 60–70.CrossRefGoogle Scholar
  88. Vittayakorn, S., & Hays, J. (2011). Quality assessment for crowdsourced object annotations. In BMVC.Google Scholar
  89. von Ahn, L., & Dabbish, L. (2005). Esp: Labeling images with a computer game. In AAAI spring symposium: Knowledge collection from volunteer contributors.Google Scholar
  90. Vondrick, C., Patterson, D., & Ramanan, D. (2012). Efficiently scaling up crowdsourced video annotation. International Journal of Computer Vision, 1010, 184–204.Google Scholar
  91. Wan, L., Zeiler, M., Zhang, S., LeCun, Y., & Fergus, R. (2013). Regularization of neural networks using dropconnect. In Proceedings of the international conference on machine learning (ICML’13).Google Scholar
  92. Wang, M., Xiao, T., Li, J., Hong, C., Zhang, J., & Zhang, Z. (2014). Minerva: A scalable and highly efficient training platform for deep learning. In APSys.Google Scholar
  93. Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., & Gong, Y. (2010). Locality-constrained linear coding for image classification. In CVPR.Google Scholar
  94. Wang, X., Yang, M., Zhu, S., & Lin, Y. (2013). Regionlets for generic object detection. In ICCV.Google Scholar
  95. Welinder, P., Branson, S., Belongie, S., & Perona, P. (2010). The multidimensional wisdom of crowds. In NIPS.Google Scholar
  96. Xiao, J., Hays, J., Ehinger, K., Oliva, A., & Torralba., A. (2010). SUN database: Large-scale scene recognition from Abbey to Zoo. In CVPR.Google Scholar
  97. Yang, J., Yu, K., Gong, Y., & Huang, T. (2009). Linear spatial pyramid matching using sparse coding for image classification. In CVPR.Google Scholar
  98. Yao, B., Yang, X., & Zhu, S.-C. (2007). Introduction to a large scale general purpose ground truth dataset: methodology, annotation tool, and benchmarks. Berlin: Springer.Google Scholar
  99. Zeiler, M. D., & Fergus, R. (2013). Visualizing and understanding convolutional networks. CoRR, abs/1311.2901.Google Scholar
  100. Zeiler, M. D., Taylor, G. W., & Fergus, R. (2011). Adaptive deconvolutional networks for mid and high level feature learning. In ICCV.Google Scholar
  101. Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In NIPS.Google Scholar
  102. Zhou, X., Yu, K., Zhang, T., & Huang, T. (2010). Image classification using super-vector coding of local image descriptors. In ECCV.Google Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • Olga Russakovsky
    • 1
  • Jia Deng
    • 2
  • Hao Su
    • 1
  • Jonathan Krause
    • 1
  • Sanjeev Satheesh
    • 1
  • Sean Ma
    • 1
  • Zhiheng Huang
    • 1
  • Andrej Karpathy
    • 1
  • Aditya Khosla
    • 3
  • Michael Bernstein
    • 1
  • Alexander C. Berg
    • 4
  • Li Fei-Fei
    • 1
  1. 1.Stanford UniversityStanfordUSA
  2. 2.University of MichiganAnn ArborUSA
  3. 3.Massachusetts Institute of TechnologyCambridgeUSA
  4. 4.UNC Chapel HillChapel HillUSA

Personalised recommendations