# ImageNet Large Scale Visual Recognition Challenge

- 28k Downloads
- 1.3k Citations

## Abstract

The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide a detailed analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the 5 years of the challenge, and propose future directions and improvements.

### Keywords

Dataset Large-scale Benchmark Object recognition Object detection## Notes

### Acknowledgments

We thank Stanford University, UNC Chapel Hill, Google and Facebook for sponsoring the challenges, and NVIDIA for providing computational resources to participants of ILSVRC2014. We thank our advisors over the years: Lubomir Bourdev, Alexei Efros, Derek Hoiem, Jitendra Malik, Chuck Rosenberg and Andrew Zisserman. We thank the PASCAL VOC organizers for partnering with us in running ILSVRC2010-2012. We thank all members of the Stanford vision lab for supporting the challenges and putting up with us along the way. Finally, and most importantly, we thank all researchers that have made the ILSVRC effort a success by competing in the challenges and by using the datasets to advance computer vision.

### References

- Ahonen, T., Hadid, A., & Pietikinen, M. (2006). Face description with local binary patterns: Application to face recognition.
*Pattern Analysis and Machine Intelligence*,*28*(14), 2037–2041.CrossRefGoogle Scholar - Alexe, B., Deselares, T., & Ferrari, V. (2012). Measuring the objectness of image windows.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*,*34*(11), 2189–2202.CrossRefGoogle Scholar - Arandjelovic, R., & Zisserman, A. (2012). Three things everyone should know to improve object retrieval. In
*CVPR*.Google Scholar - Arbeláez, P., Pont-Tuset, J., Barron, J., Marques, F., & Malik, J. (2014). Multiscale combinatorial grouping. In
*Computer vision and pattern recognition*.Google Scholar - Arbelaez, P., Maire, M., Fowlkes, C., & Malik, J. (2011). Contour detection and hierarchical image segmentation.
*IEEE Transaction on Pattern Analysis and Machine Intelligence*,*33*, 898–916.CrossRefGoogle Scholar - Batra, D., Agrawal, H., Banik, P., Chavali, N., Mathialagan, C. S., & Alfadda, A. (2013). Cloudcv: Large-scale distributed computer vision as a cloud service.Google Scholar
- Bell, S., Upchurch, P., Snavely, N., & Bala, K. (2013). OpenSurfaces: A richly annotated catalog of surface appearance. In
*ACM transactions on graphics (SIGGRAPH)*.Google Scholar - Berg, A., Farrell, R., Khosla, A., Krause, J., Fei-Fei, L., Li, J., & Maji, S. (2013). Fine-grained competition. https://sites.google.com/site/fgcomp2013/.
- Chatfield, K., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Return of the devil in the details: Delving deep into convolutional nets. CoRR, abs/1405.3531.Google Scholar
- Chen, Q., Song, Z., Huang, Z., Hua, Y., & Yan, S. (2014). Contextualizing object detection and classification. In
*CVPR*.Google Scholar - Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., & Singer, Y. (2006). Online passive-aggressive algorithms.
*Journal of Machine Learning Research*,*7*, 551–585.MATHMathSciNetGoogle Scholar - Criminisi, A. (2004). Microsoft Research Cambridge (MSRC) object recognition image database (version 2.0). http://research.microsoft.com/vision/cambridge/recognition.
- Dean, T., Ruzon, M., Segal, M., Shlens, J., Vijayanarasimhan, S., & Yagnik, J. (2013). Fast, accurate detection of 100,000 object classes on a single machine. In
*CVPR*.Google Scholar - Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In
*CVPR*.Google Scholar - Deng, J., Russakovsky, O., Krause, J., Bernstein, M., Berg, A. C., & Fei-Fei, L. (2014). Scalable multi-label annotation. In
*CHI*.Google Scholar - Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., & Darrell, T. (2013). Decaf: A deep convolutional activation feature for generic visual recognition. CoRR, abs/1310.1531.Google Scholar
- Dubout, C., & Fleuret, F. (2012). Exact acceleration of linear object detectors. In
*Proceedings of the European conference on computer vision (ECCV)*.Google Scholar - Everingham, M., Gool, L. V., Williams, C., Winn, J., & Zisserman, A. (2005–2012). PASCAL Visual Object Classes Challenge (VOC). http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.
- Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The Pascal Visual Object Classes (VOC) challenge.
*International Journal of Computer Vision*,*88*(2), 303–338.CrossRefGoogle Scholar - Everingham, M., Eslami, S. M. A., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2014). The Pascal Visual Object Classes (VOC) challenge—A retrospective.
*International Journal of Computer Vision*,*111*, 98–136.CrossRefGoogle Scholar - Fei-Fei, L., & Perona, P. (2005). A Bayesian hierarchical model for learning natural scene categories. In
*CVPR*.Google Scholar - Fei-Fei, L., Fergus, R., & Perona, P. (2004). Learning generative visual models from few examples: An incremental bayesian approach tested on 101 object categories. In
*CVPR*.Google Scholar - Felzenszwalb, P., Girshick, R., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part based models.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*,*32*(9), 1627–1645.CrossRefGoogle Scholar - Frome, A., Corrado, G., Shlens, J., Bengio, S., Dean, J., Ranzato, M., & Mikolov, T. (2013). Devise: A deep visual-semantic embedding model. In
*Advances in neural information processing systems, NIPS*.Google Scholar - Geiger, A., Lenz, P., Stiller, C., & Urtasun, R. (2013). Vision meets robotics: The kitti dataset.
*International Journal of Robotics Research*,*32*, 1231–1237.CrossRefGoogle Scholar - Girshick, R. B., Donahue, J., Darrell, T., & Malik, J. (2013). Rich feature hierarchies for accurate object detection and semantic segmentation (v4). CoRR.Google Scholar
- Girshick, R., Donahue, J., Darrell, T., & Malik., J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In
*CVPR*.Google Scholar - Gould, S., Fulton, R., & Koller, D. (2009). Decomposing a scene into geometric and semantically consistent regions. In
*ICCV*.Google Scholar - Graham, B. (2013). Sparse arrays of signatures for online character recognition. CoRR.Google Scholar
- Griffin, G., Holub, A., & Perona, P. (2007). Caltech-256 object category dataset. Technical report 7694, Caltech.Google Scholar
- Harada, T., & Kuniyoshi, Y. (2012). Graphical Gaussian vector for image categorization. In
*NIPS*.Google Scholar - Harel, J., Koch, C., & Perona, P. (2007). Graph-based visual saliency. In
*NIPS*.Google Scholar - He, K., Zhang, X., Ren, S., & Su, J. (2014). Spatial pyramid pooling in deep convolutional networks for visual recognition. In
*ECCV*.Google Scholar - Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580.Google Scholar
- Hoiem, D., Chodpathumwan, Y., & Dai, Q. (2012). Diagnosing error in object detectors. In
*ECCV*.Google Scholar - Howard, A. (2014). Some improvements on deep convolutional neural network based image classification. In
*ICLR*.Google Scholar - Huang, G. B., Ramesh, M., Berg, T., & Learned-Miller, E. (2007). Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical report 07–49, University of Massachusetts, Amherst.Google Scholar
- Iandola, F. N., Moskewicz, M. W., Karayev, S., Girshick, R. B., Darrell, T., & Keutzer, K. (2014). Densenet: Implementing efficient convnet descriptor pyramids. CoRR.Google Scholar
- Jia, Y. (2013). Caffe: An open source convolutional architecture for fast feature embedding. http://caffe.berkeleyvision.org/.
- Jojic, N., Frey, B. J., & Kannan, A. (2003). Epitomic analysis of appearance and shape. In
*ICCV*.Google Scholar - Kanezaki, A., Inaba, S., Ushiku, Y., Yamashita, Y., Muraoka, H., Kuniyoshi, Y., & Harada, T. (2014). Hard negative classes for multiple object detection. In
*ICRA*.Google Scholar - Khosla, A., Jayadevaprakash, N., Yao, B., & Fei-Fei, L. (2011). Novel dataset for fine-grained image categorization. In
*First workshop on fine-grained visual categorization, CVPR*.Google Scholar - Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). ImageNet classification with deep convolutional neural networks. In
*NIPS*.Google Scholar - Kuettel, D., Guillaumin, M., & Ferrari, V. (2012). Segmentation propagation in ImageNet. In
*ECCV*.Google Scholar - Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In
*CVPR*.Google Scholar - Lin, M., Chen, Q., & Yan, S. (2014a). Network in network. In
*ICLR*.Google Scholar - Lin, Y., Lv, F., Cao, L., Zhu, S., Yang, M., Cour, T., Yu, K., & Huang, T. (2011). Large-scale image classification: Fast feature extraction and SVM training. In
*CVPR*.Google Scholar - Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollr, P., & Zitnick, C. L. (2014b). Microsoft COCO: Common objects in context. In
*ECCV*.Google Scholar - Liu, C., Yuen, J., & Torralba, A. (2011). Nonparametric scene parsing via label transfer.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*,*32*, 2368–2382.CrossRefGoogle Scholar - Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints.
*International Journal of Computer Vision*,*60*(2), 91–110.CrossRefGoogle Scholar - Maji, S., & Malik, J. (2009). Object detection using a max-margin hough transform. In
*CVPR*.Google Scholar - Manen, S., Guillaumin, M., & Van Gool, L. (2013). Prime object proposals with randomized Prim’s algorithm. In
*ICCV*.Google Scholar - Mensink, T., Verbeek, J., Perronnin, F., & Csurka, G. (2012). Metric learning for large scale image classification: Generalizing to new classes at near-zero cost. In
*ECCV*.Google Scholar - Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. In
*ICLR*.Google Scholar - Miller, G. A. (1995). Wordnet: A lexical database for English.
*Commun. ACM*,*38*(11), 39–41.CrossRefGoogle Scholar - Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: A holistic representation of the spatial envelope. In
*IJCV*.Google Scholar - Ordonez, V., Deng, J., Choi, Y., Berg, A. C., & Berg, T. L. (2013). From large scale image categorization to entry-level categories. In
*IEEE international conference on computer vision (ICCV)*.Google Scholar - Ouyang, W., & Wang, X. (2013). Joint deep learning for pedestrian detection. In
*ICCV*.Google Scholar - Ouyang, W., Luo, P., Zeng, X., Qiu, S., Tian, Y., Li, H., Yang, S., Wang, Z., Xiong, Y., Qian, C., Zhu, Z., Wang, R., Loy, C. C., Wang, X., & Tang, X. (2014). Deepid-net: multi-stage and deformable deep convolutional neural networks for object detection. CoRR, abs/1409.3505.Google Scholar
- Papandreou, G. (2014). Deep epitomic convolutional neural networks. CoRR.Google Scholar
- Papandreou, G., Chen, L.-C., & Yuille, A. L. (2014). Modeling image patches with a generic dictionary of mini-epitomes.Google Scholar
- Perronnin, F., & Dance, C. R. (2007). Fisher kernels on visual vocabularies for image categorization. In
*CVPR*.Google Scholar - Perronnin, F., Akata, Z., Harchaoui, Z., & Schmid, C. (2012). Towards good practice in large-scale learning for image classification. In
*CVPR*.Google Scholar - Perronnin, F., Sánchez, J., & Mensink, T. (2010). Improving the fisher kernel for large-scale image classification. In
*ECCV*(4).Google Scholar - Russakovsky, O., Deng, J., Huang, Z., Berg, A., & Fei-Fei, L. (2013). Detecting avocados to zucchinis: What have we done, & where are we going? In
*ICCV*.Google Scholar - Russell, B., Torralba, A., Murphy, K., & Freeman, W. T. (2007). LabelMe: A database and web-based tool for image annotation. In
*IJCV*.Google Scholar - Sanchez, J., & Perronnin, F. (2011). High-dim. signature compression for large-scale image classification. In
*CVPR*.Google Scholar - Sanchez, J., Perronnin, F., & de Campos, T. (2012). Modeling spatial layout of images beyond spatial pyramids. In
*PRL*.Google Scholar - Scheirer, W., Kumar, N., Belhumeur, P. N., & Boult, T. E. (2012). Multi-attribute spaces: Calibration for attribute fusion and similarity search. In
*CVPR*.Google Scholar - Schmidhuber, J. (2012). Multi-column deep neural networks for image classification. In
*CVPR*.Google Scholar - Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., & LeCun, Y. (2013). Overfeat: Integrated recognition, localization and detection using convolutional networks. CoRR, abs/1312.6229.Google Scholar
- Sheng, V. S., Provost, F., & Ipeirotis, P. G. (2008). Get another label? Improving data quality and data mining using multiple, noisy labelers. In
*SIGKDD*.Google Scholar - Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556.Google Scholar
- Simonyan, K., Vedaldi, A., & Zisserman, A. (2013). Deep fisher networks for large-scale image classification. In
*NIPS*.Google Scholar - Sorokin, A., & Forsyth, D. (2008). Utility data annotation with Amazon Mechanical Turk. In
*InterNet08*.Google Scholar - Su, H., Deng, J., & Fei-Fei, L. (2012). Crowdsourcing annotations for visual object detection. In
*AAAI human computation workshop*.Google Scholar - Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., & Rabinovich, A. (2014). Going deeper with convolutions. Technical report.Google Scholar
- Tang, Y. (2013). Deep learning using support vector machines. CoRR, abs/1306.0239.Google Scholar
- Thorpe, S., Fize, D., Marlot, C., et al. (1996). Speed of processing in the human visual system.
*Nature*,*381*(6582), 520–522.CrossRefGoogle Scholar - Torralba, A., & Efros, A. A. (2011). Unbiased look at dataset bias. In
*CVPR’11*.Google Scholar - Torralba, A., Fergus, R., & Freeman, W. (2008). 80 million tiny images: A large data set for nonparametric object and scene recognition.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*,*30*, 1958–1970.CrossRefGoogle Scholar - Uijlings, J., van de Sande, K., Gevers, T., & Smeulders, A. (2013). Selective search for object recognition.
*International Journal of Computer Vision*,*104*, 154–171.CrossRefGoogle Scholar - Urtasun, R., Fergus, R., Hoiem, D., Torralba, A., Geiger, A., Lenz, P., Silberman, N., Xiao, J., & Fidler, S. (2013–2014). Reconstruction meets recognition challenge. http://ttic.uchicago.edu/rurtasun/rmrc/.
- van de Sande, K. E. A., Snoek, C. G. M., & Smeulders, A. W. M. (2014). Fisher and vlad with flair. In
*Proceedings of the IEEE conference on computer vision and pattern recognition*.Google Scholar - van de Sande, K. E. A., Uijlings, J. R. R., Gevers, T., & Smeulders, A. W. M. (2011b). Segmentation as selective search for object recognition. In
*ICCV*.Google Scholar - van de Sande, K. E. A., Gevers, T., & Snoek, C. G. M. (2010). Evaluating color descriptors for object and scene recognition.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*,*32*(9), 1582–1596.CrossRefGoogle Scholar - van de Sande, K. E. A., Gevers, T., & Snoek, C. G. M. (2011a). Empowering visual categorization with the GPU.
*IEEE Transactions on Multimedia*,*13*(1), 60–70.CrossRefGoogle Scholar - Vittayakorn, S., & Hays, J. (2011). Quality assessment for crowdsourced object annotations. In
*BMVC*.Google Scholar - von Ahn, L., & Dabbish, L. (2005). Esp: Labeling images with a computer game. In
*AAAI spring symposium: Knowledge collection from volunteer contributors*.Google Scholar - Vondrick, C., Patterson, D., & Ramanan, D. (2012). Efficiently scaling up crowdsourced video annotation.
*International Journal of Computer Vision*,*1010*, 184–204.Google Scholar - Wan, L., Zeiler, M., Zhang, S., LeCun, Y., & Fergus, R. (2013). Regularization of neural networks using dropconnect. In
*Proceedings of the international conference on machine learning (ICML’13)*.Google Scholar - Wang, M., Xiao, T., Li, J., Hong, C., Zhang, J., & Zhang, Z. (2014). Minerva: A scalable and highly efficient training platform for deep learning. In
*APSys*.Google Scholar - Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., & Gong, Y. (2010). Locality-constrained linear coding for image classification. In
*CVPR*.Google Scholar - Wang, X., Yang, M., Zhu, S., & Lin, Y. (2013). Regionlets for generic object detection. In
*ICCV*.Google Scholar - Welinder, P., Branson, S., Belongie, S., & Perona, P. (2010). The multidimensional wisdom of crowds. In
*NIPS*.Google Scholar - Xiao, J., Hays, J., Ehinger, K., Oliva, A., & Torralba., A. (2010). SUN database: Large-scale scene recognition from Abbey to Zoo. In
*CVPR*.Google Scholar - Yang, J., Yu, K., Gong, Y., & Huang, T. (2009). Linear spatial pyramid matching using sparse coding for image classification. In
*CVPR*.Google Scholar - Yao, B., Yang, X., & Zhu, S.-C. (2007).
*Introduction to a large scale general purpose ground truth dataset: methodology, annotation tool, and benchmarks*. Berlin: Springer.Google Scholar - Zeiler, M. D., & Fergus, R. (2013). Visualizing and understanding convolutional networks. CoRR, abs/1311.2901.Google Scholar
- Zeiler, M. D., Taylor, G. W., & Fergus, R. (2011). Adaptive deconvolutional networks for mid and high level feature learning. In
*ICCV*.Google Scholar - Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In
*NIPS*.Google Scholar - Zhou, X., Yu, K., Zhang, T., & Huang, T. (2010). Image classification using super-vector coding of local image descriptors. In
*ECCV*.Google Scholar