International Journal of Computer Vision

, Volume 111, Issue 1, pp 98–136 | Cite as

The Pascal Visual Object Classes Challenge: A Retrospective

  • Mark Everingham
  • S. M. Ali Eslami
  • Luc Van Gool
  • Christopher K. I. Williams
  • John Winn
  • Andrew Zisserman
Article

Abstract

The Pascal Visual Object Classes (VOC) challenge consists of two components: (i) a publicly available dataset of images together with ground truth annotation and standardised evaluation software; and (ii) an annual competition and workshop. There are five challenges: classification, detection, segmentation, action classification, and person layout. In this paper we provide a review of the challenge from 2008–2012. The paper is intended for two audiences: algorithm designers, researchers who want to see what the state of the art is, as measured by performance on the VOC datasets, along with the limitations and weak points of the current generation of algorithms; and, challenge designers, who want to see what we as organisers have learnt from the process and our recommendations for the organisation of future challenges. To analyse the performance of submitted algorithms on the VOC datasets we introduce a number of novel evaluation methods: a bootstrapping method for determining whether differences in the performance of two algorithms are significant or not; a normalised average precision so that performance can be compared across classes with different proportions of positive instances; a clustering method for visualising the performance across multiple algorithms so that the hard and easy images can be identified; and the use of a joint classifier over the submitted algorithms in order to measure their complementarity and combined performance. We also analyse the community’s progress through time using the methods of Hoiem et al. (Proceedings of European Conference on Computer Vision, 2012) to identify the types of occurring errors. We conclude the paper with an appraisal of the aspects of the challenge that worked well, and those that could be improved in future challenges.

Keywords

Database Benchmark Object recognition Object detection Segmentation 

References

  1. Alexe, B., Deselaers, T., & Ferrari, V. (2010). What is an object? In Proceedings of Conference on Computer Vision and Pattern Recognition (pp. 73–80).Google Scholar
  2. Alexiou, I., & Bharath, A. (2012). Efficient Kernels couple visual words through categorical opponency. In Proceedings of British Machine Vision Conference.Google Scholar
  3. Bertail, P., Clémençon, S. J., & Vayatis, N. (2009). On bootstrapping the ROC curve. In D. Koller, D. Schuurmans, Y. Bengio, & L. Bottou (Eds.), Advances in Neural Information Processing Systems (Vol. 21, pp. 137–144). Red Hook, NY: Curran Associates, Inc.Google Scholar
  4. Carreira, J., Caseiro, R., Batista, J., & Sminchisescu, C. (2012). Semantic segmentation with second-order pooling. In Proceedings of European Conference on Computer Vision.Google Scholar
  5. Chang, C. C., & Lin, C. J. (2011). LIBSVM: A library for support vector machines. Transactions on Intelligent Systems and Technology, 2, 27:1–27:27. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
  6. Chen, Q., Song, Z., Hua, Y., Huang, Z., & Yan, S. (2012). Generalized hierarchical matching for image classification. In Proceedings of Conference on Computer Vision and Pattern Recognition.Google Scholar
  7. Csurka, G., Dance, C., Fan, L., Williamowski, J., & Bray, C. (2004). Visual categorization with bags of keypoints. In Proceedings of ECCV2004 Workshop on Statistical Learning in Computer Vision (pp. 59–74).Google Scholar
  8. Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In Proceedings of Conference on Computer Vision and Pattern Recognition.Google Scholar
  9. Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., & Darrell, T. (2013). Decaf: A deep convolutional activation feature for generic visual recognition. CoRR abs/1310.1531.Google Scholar
  10. Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The PASCAL visual object classes (VOC) challenge. International Journal of Computer Vision, 88, 303–338.CrossRefGoogle Scholar
  11. Farhadi, A., Endres, I., Hoiem, D., & Forsyth, D. (2009). Describing objects by their attributes. In Proceedings of Conference on Computer Vision and Pattern Recognition, IEEE (pp. 1778– 1785).Google Scholar
  12. Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part based models. Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.CrossRefGoogle Scholar
  13. Flickr website. (2013). http://www.flickr.com/.
  14. Girshick, R. B., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of Conference on Computer Vision and Pattern Recognition.Google Scholar
  15. Hall, P., Hyndman, R., & Fan, Y. (2004). Nonparametric confidence intervals for receiver operating characteristic curves. Biometrika, 91, 743–50.CrossRefMATHMathSciNetGoogle Scholar
  16. Hoai, M., Ladicky, L., & Zisserman, A. (2012). Action Recognition from Still Images by Aligning Body Parts. http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2012/workshop/segmentation_action_layout.pdf. Slides contained in the presentation by Luc van Gool on Overview and results of the segmentation challenge and action taster.
  17. Hoiem, D., Chodpathumwan, Y., & Dai, Q. (2012). Diagnosing error in object detectors. In Proceedings of European Conference on Computer Vision.Google Scholar
  18. Ion, A., Carreira, J., Sminchisescu, C. (2011a). Image segmentation by figure-ground composition into maximal cliques. In Proceedings of International Conference on Computer Vision.Google Scholar
  19. Ion, A., Carreira, J., & Sminchisescu, C. (2011b). Probabilistic joint image segmentation and labeling. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, & K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems (Vol. 24, pp. 1827–1835). Red Hook, NY: Curran Associates, Inc.Google Scholar
  20. Karaoglu, S., Van Gemert, J., & Gevers, T. (2012). Object reading: Text recognition for object recognition. In Proceedings of ECCV 2012 Workshops and Gemonstrations.Google Scholar
  21. Khan, F., Anwer, R., Van de Weijer, J., Bagdanov, A., Vanrell, M., & Lopez, A. M. (2012a). Color attributes for object detection. In Proceedings of Conference on Computer Vision and Pattern Recognition.Google Scholar
  22. Khan, F., Van de Weijer, J., & Vanrell, M. (2012b). Modulating shape features by color attention for object recognition. International Journal of Computer Vision, 98(1), 49–64.CrossRefGoogle Scholar
  23. Khosla, A., Yao, B., & Fei-Fei, L. (2011). Combining randomization and discrimination for fine-grained image categorization. In Proceedings of Conference on Computer Vision and Pattern Recognition.Google Scholar
  24. Krizhevsky, A., Sutskever, I., Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, & K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems (Vol. 25, pp. 1106–1114). Red Hook, NY: Curran Associates, Inc.Google Scholar
  25. Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proceedings of Conference on Computer Vision and Pattern Recognition (pp 2169–2178).Google Scholar
  26. Leibe, B., Leonardis, A., & Schiele, B. (2004). Combined object categorization and segmentation with an implicit shape model. In Proceedings of ECCV Workshop on Statistical Learning in Computer Vision.Google Scholar
  27. Lempitsky, V., & Zisserman, A. (2010). Learning to count objects in images. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel & A. Culotta (Eds.), Advances in Neural Information Processing Systems (Vol. 23, pp. 1324–1332). Red Hook, NY: Curran Associates, Inc.http://papers.nips.cc/paper/4043-learning-to-count-objects-in-images.pdf
  28. Li, F., Carreira, J., Lebanon, G., & Sminchisescu, C. (2013). Composite statistical inference for semantic segmentation. In Proceedings of Conference on Computer Vision and Pattern Recognition.Google Scholar
  29. Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91– 110.CrossRefGoogle Scholar
  30. Nanni, L., & Lumini, A. (2013). Heterogeneous bag-of-features for object/scene recognition. Applied Soft Computing, 13(4), 2171–2178.CrossRefGoogle Scholar
  31. O’Connor, B. (2010). A response to “comparing Precision-Recall curves the Bayesian way?”. A comment on the blog post by Bob Carpenter on Comparing Precision-Recall Curves the Bayesian Way? http://lingpipe-blog.com/2010/01/29/comparing-precision-recall-curves-bayesian-way/.
  32. Oquab, M., Bottou, L., Laptev, I., Sivic, J. (2014). Learning and transferring mid-level image representations using convolutional neural networks. In Proceedings of Conference on Computer Vision and Pattern Recognition.Google Scholar
  33. Russakovsky, O., Lin, Y., Yu, K., & Fei-Fei, L. (2012). Object-centric spatial pooling for image classification. In Proceedings of European Conference on Computer Vision.Google Scholar
  34. Russell, B., Torralba, A., Murphy, K., & Freeman, W. T. (2008). LabelMe: A database and web-based tool for image annotation. International Journal of Computer Vision, 77(1–3), 157–173. http://labelme.csail.mit.edu/
  35. Salton, G., & Mcgill, M. J. (1986). Introduction to modern information retrieval. New York, NY: McGraw-Hill Inc.Google Scholar
  36. Sener, F., Bas, C., Ikizler-Cinbis, N. (2012). On recognizing actions in still images via multiple features. In Proceedings of ECCV Workshop on Action Recognition and Pose Estimation in Still Images.Google Scholar
  37. Song, Z., Chen, Q., Huang, Z., Hua, Y., & Yan, S. (2011). Contextualizing object detection and classification. In Proceedings of Conference on Computer Vision and Pattern Recognition.Google Scholar
  38. Pascal VOC best practice guidelines. (2012). http://pascallin.ecs.soton.ac.uk/challenges/VOC/#bestpractice.
  39. Pascal VOC evaluation server. (2012) http://host.robots.ox.ac.uk:8080/.
  40. Torralba, A., & Efros, A. A. (2011). Unbiased look at dataset bias. In Proceedings of Conference on Computer Vision and Pattern Recognition, IEEE (pp. 1521–1528).Google Scholar
  41. Uijlings, J., Van de Sande, K., Gevers, T., & Smeulders, A. (2013). Selective search for object recognition. International Journal of Computer Vision, 104(2), 154–171.Google Scholar
  42. Van de Sande, K., Uijlings, J., Gevers, T., & Smeulders, A. (2011). Segmentation as selective search for object recognition. In Proceedings of International Conference on Computer Vision.Google Scholar
  43. Van Gemert, J. (2011). Exploiting photographic style for category-level image classification by generalizing the spatial pyramid. In Proceedings of International Conference on Multimedia Retrieval.Google Scholar
  44. Vedaldi, A., Gulshan, V., Varma, M., & Zisserman, A. (2009). Multiple kernels for object detection. In International Conference on Computer Vision.Google Scholar
  45. Viola, P., & Jones, M. (2004). Robust real-time object detection. International Journal of Computer Vision, 57(2), 137–154.CrossRefGoogle Scholar
  46. Wang, X., Lin, L., Huang, L., & Yan, S. (2013). Incorporating structural alternatives and sharing into hierarchy for multiclass object recognition and detection. In Proceedings of Conference on Computer Vision and Pattern Recognition.Google Scholar
  47. Wasserman, L. (2004). All of statistics. Berlin: Springer.CrossRefMATHGoogle Scholar
  48. Xia, W., Song, Z., Feng, J., Cheong, L. F., & Yan, S. (2012). Segmentation over detection by coupled global and local sparse representations. In Proceedings of European Conference on Computer Vision.Google Scholar
  49. Yang, J., Yu, K., Gong, Y., & Huang, T. (2009). Linear spatial pyramid matching using sparse coding for image classification. In Proceedings of Conference on Computer Vision and Pattern Recognition. Google Scholar
  50. Zeiler, M. D., & Fergus, R. (2013). Visualizing and understanding convolutional networks. CoRR abs/1311.2901.Google Scholar
  51. Zhu, L., Chen, Y., Yuille, A., & Freeman, W. (2010). Latent hierarchical structural learning for object detection. In Proceedings of Conference on Computer Vision and Pattern Recognition.Google Scholar
  52. Zisserman, A., Winn, J., Fitzgibbon, A., Van Gool, L., Sivic, J., Williams, C., et al. (2012). In memoriam: Mark Everingham. Transactions on Pattern Analysis and Machine Intelligence, 34(11), 2081–2082.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Mark Everingham
    • 1
  • S. M. Ali Eslami
    • 2
  • Luc Van Gool
    • 3
    • 4
  • Christopher K. I. Williams
    • 5
  • John Winn
    • 2
  • Andrew Zisserman
    • 6
  1. 1.University of LeedsLeedsUK
  2. 2.Microsoft ResearchCambridgeUK
  3. 3.KU LeuvenLeuvenBelgium
  4. 4.ETHZurichSwitzerland
  5. 5.University of EdinburghEdinburghUK
  6. 6.University of OxfordOxfordUK

Personalised recommendations