International Journal of Computer Vision

, Volume 88, Issue 2, pp 303–338

The Pascal Visual Object Classes (VOC) Challenge

  • Mark Everingham
  • Luc Van Gool
  • Christopher K. I. Williams
  • John Winn
  • Andrew Zisserman
Article

Abstract

The Pascal Visual Object Classes (VOC) challenge is a benchmark in visual object category recognition and detection, providing the vision and machine learning communities with a standard dataset of images and annotation, and standard evaluation procedures. Organised annually from 2005 to present, the challenge and its associated dataset has become accepted as the benchmark for object detection.

This paper describes the dataset and evaluation procedure. We review the state-of-the-art in evaluated methods for both classification and detection, analyse whether the methods are statistically different, what they are learning from the images (e.g. the object or its context), and what the methods find easy or confuse. The paper concludes with lessons learnt in the three year history of the challenge, and proposes directions for future improvement and extension.

Keywords

Database Benchmark Object recognition Object detection 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bergtholdt, M., Kappes, J., & Schnörr, C. (2006). Learning of graphical models and efficient inference for object class recognition. In Proceedings of the annual symposium of the German association for pattern recognition (DAGM06) (pp. 273–283) Google Scholar
  2. Chum, O., & Zisserman, A. (2007). An exemplar model for learning object classes. In Proceedings of the IEEE conference on computer vision and pattern recognition. Google Scholar
  3. Chum, O., Philbin, J., Isard, M., & Zisserman, A. (2007). Scalable near identical image and shot detection. In Proceedings of the international conference on image and video retrieval (pp. 549–556). Google Scholar
  4. Csurka, G., Bray, C., Dance, C., & Fan, L. (2004). Visual categorization with bags of keypoints. In Workshop on statistical learning in computer vision, ECCV (pp. 1–22). Google Scholar
  5. Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 886–893). Google Scholar
  6. Demsar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7, 1–30. MathSciNetGoogle Scholar
  7. Duygulu, P., Barnard, K., de Freitas, N., & Forsyth, D. A. (2002). Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In Proceedings of the European conference on computer vision (pp. 97–112). Google Scholar
  8. Everingham, M., Zisserman, A., Williams, C. K. I., & Van Gool, L. (2006a). The 2005 PASCAL visual object classes challenge. In LNAI: Vol. 3944. Machine learning challenges—evaluating predictive uncertainty, visual object classification, and recognising textual entailment (pp. 117–176). Berlin: Springer. CrossRefGoogle Scholar
  9. Everingham, M., Zisserman, A., Williams, C. K. I., & Van Gool, L. (2006b). The PASCAL visual object classes challenge 2006 (VOC2006) results. http://pascal-network.org/challenges/VOC/voc2006/results.pdf.
  10. Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2007). The PASCAL visual object classes challenge 2007 (VOC2007) Results. http://www.pascal-network.org/challenges/VOC/voc2007/index.html.
  11. Fei-Fei, L., Fergus, R., & Perona, P. (2006). One-shot learning of object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(4), 594–611. http://www.vision.caltech.edu/Image_Datasets/Caltech101/Caltech101.html. CrossRefGoogle Scholar
  12. Fellbaum, C. (Ed.) (1998). WordNet: an electronic lexical database. Cambridge: MIT Press. MATHGoogle Scholar
  13. Felzenszwalb, P., McAllester, D., & Ramanan, D. (2008). A discriminatively trained, multiscale, deformable part model. In Proceedings of the IEEE conference on computer vision and pattern recognition. Google Scholar
  14. Fergus, R., Fei-Fei, L., Perona, P., & Zisserman, A. (2005). Learning object categories from Google’s image search. In Proceedings of the international conference on computer vision. Google Scholar
  15. Fergus, R., Perona, P., & Zisserman, A. (2007). Weakly supervised scale-invariant learning of models for visual recognition. International Journal of Computer Vision, 71(3), 273–303. CrossRefGoogle Scholar
  16. Ferrari, V., Fevrier, L., Jurie, F., & Schmid, C. (2008). Groups of adjacent contour segments for object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(1), 36–51. CrossRefGoogle Scholar
  17. Fritz, M., & Schiele, B. (2008). Decomposition, discovery and detection of visual categories using topic models. In Proceedings of the IEEE conference on computer vision and pattern recognition. Google Scholar
  18. Geusebroek, J. (2006). Compact object descriptors from local colour invariant histograms. In Proceedings of the British machine vision conference (pp. 1029–1038). Google Scholar
  19. Grauman, K., & Darrell, T. (2005). The pyramid match kernel: Discriminative classification with sets of image features. In Proceedings of the international conference on computer vision (pp. 1458–1465). Google Scholar
  20. Griffin, G., Holub, A., & Perona, P. (2007). Caltech-256 object category dataset (Technical Report 7694). California Institute of Technology. http://www.vision.caltech.edu/Image_Datasets/Caltech256/.
  21. Hoiem, D., Efros, A. A., & Hebert, M. (2006). Putting objects in perspective. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2137–2144). Google Scholar
  22. Kohli, P., Ladicky, L., & Torr, P. (2008). Robust higher order potentials for enforcing label consistency. In Proceedings of the IEEE conference on computer vision and pattern recognition. Google Scholar
  23. Lampert, C. H., Blaschko, M. B., & Hofmann, T. (2008). Beyond sliding windows: Object localization by efficient subwindow search. In Proceedings of the IEEE conference on computer vision and pattern recognition. Google Scholar
  24. Laptev, I. (2006). Improvements of object detection using boosted histograms. In Proceedings of the British machine vision conference (pp. 949–958). Google Scholar
  25. Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2169–2178). Google Scholar
  26. Leibe, B., Leonardis, A., & Schiele, B. (2004). Combined object categorization and segmentation with an implicit shape model. In ECCV2004 workshop on statistical learning in computer vision, Prague, Czech Republic (pp. 17–32). Google Scholar
  27. Liu, X., Wang, D., Li, J., & Zhang, B. (2007). The feature and spatial covariant kernel: Adding implicit spatial constraints to histogram. In Proceedings of the international conference on image and video retrieval. Google Scholar
  28. Lowe, D. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110. CrossRefGoogle Scholar
  29. Marszalek, M., & Schmid, C. (2007). Semantic hierarchies for visual object recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. Google Scholar
  30. Perronnin, F., & Dance, C. (2007). Fisher kernels on visual vocabularies for image categorization. In Proceedings of the IEEE conference on computer vision and pattern recognition. Google Scholar
  31. Pinto, N., Cox, D., & DiCarlo, J. (2008). Why is real-world visual object recognition hard? PLoS Computational Biology, 4(1), 151–156. CrossRefMathSciNetGoogle Scholar
  32. Russell, B., Torralba, A., Murphy, K., & Freeman, W. T. (2008). LabelMe: a database and web-based tool for image annotation. International Journal of Computer Vision, 77(1–3), 157–173. http://labelme.csail.mit.edu/. CrossRefGoogle Scholar
  33. Salton, G., & McGill, M. J. (1986). Introduction to modern information retrieval. New York: McGraw-Hill. Google Scholar
  34. Scharstein, D., & Szeliski, R. (2002). A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International Journal of Computer Vision, 47(1–3), 7–42. http://vision.middlebury.edu/stereo/. MATHCrossRefGoogle Scholar
  35. Shotton, J., Winn, J. M., Rother, C., & Criminisi, A. (2006). TextonBoost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation. In Proceedings of the European conference on computer vision (pp. 1–15). Google Scholar
  36. Sivic, J., & Zisserman, A. (2003). Video Google: A text retrieval approach to object matching in videos. In Proceedings of the international conference on computer vision (Vol. 2, pp. 1470–1477). http://www.robots.ox.ac.uk/~vgg.
  37. Smeaton, A. F., Over, P., & Kraaij, W. (2006). Evaluation campaigns and TRECVID. In MIR ’06: Proceedings of the 8th ACM international workshop on multimedia information retrieval (pp. 321–330). Google Scholar
  38. Snoek, C., Worring, M., & Smeulders, A. (2005). Early versus late fusion in semantic video analysis. In Proceedings of the ACM international conference on multimedia (pp. 399–402). Google Scholar
  39. Snoek, C., Worring, M., van Gemert, J., Geusebroek, J., & Smeulders, A. (2006). The challenge problem for automated detection of 101 semantic concepts in multimedia. In Proceedings of ACM multimedia. Google Scholar
  40. Sorokin, A., & Forsyth, D. (2008). Utility data annotation with Amazon mechanical turk. In Proceedings of the first IEEE workshop on Internet vision (at CVPR 2008). Google Scholar
  41. Spain, M., & Perona, P. (2008). Some objects are more equal than others: Measuring and predicting importance. In Proceedings of the European conference on computer vision (pp. 523–536). Google Scholar
  42. Stoettinger, J., Hanbury, A., Sebe, N., & Gevers, T. (2007). Do colour interest points improve image retrieval? In Proceedings of the IEEE international conference on image processing (pp. 169–172). Google Scholar
  43. Sudderth, E. B., Torralba, A. B., Freeman, W. T., & Willsky, A. S. (2008). Describing visual scenes using transformed objects and parts. International Journal of Computer Vision, 77(1–3), 291–330. CrossRefGoogle Scholar
  44. Torralba, A. B. (2003). Contextual priming for object detection. International Journal of Computer Vision, 53(2), 169–191. CrossRefGoogle Scholar
  45. Torralba, A. B., Murphy, K. P., & Freeman, W. T. (2007). Sharing visual features for multiclass and multiview object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(5), 854–869. CrossRefGoogle Scholar
  46. van de Sande, K. E. A., Gevers, T., & Snoek, C. G. M. (2008). Evaluation of color descriptors for object and scene recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. Google Scholar
  47. van de Weijer, J., & Schmid, C. (2006). Coloring local feature extraction. In Proceedings of the European conference on computer vision. Google Scholar
  48. van Gemert, J., Geusebroek, J., Veenman, C., Snoek, C., & Smeulders, A. (2006). Robust scene categorization by learning image statistics in context. In CVPR workshop on semantic learning applications in multimedia. Google Scholar
  49. Viitaniemi, V., & Laaksonen, J. (2008). Evaluation of techniques for image classification, object detection and object segmentation (Technical Report TKK-ICS-R2). Department of Information and Computer Science, Helsinki University of Technology. http://www.cis.hut.fi/projects/cbir/.
  50. Viola, P. A., & Jones, M. J. (2004). Robust Real-time Face Detection. International Journal of Computer Vision, 57(2), 137–154. CrossRefGoogle Scholar
  51. von Ahn, L., & Dabbish, L. (2004). Labeling images with a computer game. In Proceedings of the ACM CHI (pp. 319–326). Google Scholar
  52. Wang, D., Li, J., & Zhang, B. (2006). Relay boost fusion for learning rare concepts in multimedia. In Proceedings of the international conference on image and video retrieval. Google Scholar
  53. Winn, J., & Everingham, M. (2007). The PASCAL visual object classes challenge 2007 (VOC2007) annotation guidelines. http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2007/guidelines.html.
  54. Yao, B., Yang, X., & Zhu, S. C. (2007). Introduction to a large scale general purpose ground truth dataset: methodology, annotation tool, and benchmarks. In Proceedings of the 6th international conference on energy minimization methods in computer vision and pattern recognition. http://www.imageparsing.com/.
  55. Yilmaz, E., & Aslam, J. (2006). Estimating average precision with incomplete and imperfect judgments. In Fifteenth ACM international conference on information and knowledge management (CIKM). Google Scholar
  56. Zehnder, P., Koller-Meier, E., & Van Gool, L. (2008). An efficient multi-class detection cascade. In Proceedings of the British machine vision conference. Google Scholar
  57. Zhang, J., Marszalek, M., Lazebnik, S., & Schmid, C. (2007). Local features and kernels for classification of texture and object categories: A comprehensive study. International Journal of Computer Vision, 73(2), 213–238. CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  • Mark Everingham
    • 1
  • Luc Van Gool
    • 2
  • Christopher K. I. Williams
    • 3
  • John Winn
    • 4
  • Andrew Zisserman
    • 5
  1. 1.University of LeedsLeedsUK
  2. 2.KU LeuvenLeuvenBelgium
  3. 3.University of EdinburghEdinburghUK
  4. 4.Microsoft ResearchCambridgeUK
  5. 5.University of OxfordOxfordUK

Personalised recommendations