Microsoft COCO: Common Objects in Context

  • Tsung-Yi Lin
  • Michael Maire
  • Serge Belongie
  • James Hays
  • Pietro Perona
  • Deva Ramanan
  • Piotr Dollár
  • C. Lawrence Zitnick
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8693)

Abstract

We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in precise object localization. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.

References

  1. 1.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A Large-Scale Hierarchical Image Database. In: CVPR (2009)Google Scholar
  2. 2.
    Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL visual object classes (VOC) challenge. IJCV 88(2), 303–338 (2010)Google Scholar
  3. 3.
    Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: SUN database: Large-scale scene recognition from abbey to zoo. In: CVPR (2010)Google Scholar
  4. 4.
    Dollár, P., Wojek, C., Schiele, B., Perona, P.: Pedestrian detection: An evaluation of the state of the art. PAMI 34 (2012)Google Scholar
  5. 5.
    Krizhevsky, A., Sutskever, I., Hinton, G.: ImageNet classification with deep convolutional neural networks. In: NIPS (2012)Google Scholar
  6. 6.
    Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR (2014)Google Scholar
  7. 7.
    Sermanet, P., Eigen, D., Zhang, S., Mathieu, M., Fergus, R., LeCun, Y.: OverFeat: Integrated recognition, localization and detection using convolutional networks. In: ICLR (April 2014)Google Scholar
  8. 8.
    Farhadi, A., Endres, I., Hoiem, D., Forsyth, D.: Describing objects by their attributes. In: CVPR (2009)Google Scholar
  9. 9.
    Patterson, G., Hays, J.: SUN attribute database: Discovering, annotating, and recognizing scene attributes. In: CVPR (2012)Google Scholar
  10. 10.
    Bourdev, L., Malik, J.: Poselets: Body part detectors trained using 3D human pose annotations. In: ICCV (2009)Google Scholar
  11. 11.
    Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part V. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  12. 12.
    Palmer, S., Rosch, E., Chase, P.: Canonical perspective and the perception of objects. Attention and Performance IX 1, 4 (1981)Google Scholar
  13. 13.
    Hoiem, D., Chodpathumwan, Y., Dai, Q.: Diagnosing error in object detectors. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part III. LNCS, vol. 7574, pp. 340–353. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  14. 14.
    Brostow, G., Fauqueur, J., Cipolla, R.: Semantic object classes in video: A high-definition ground truth database. PRL 30(2), 88–97 (2009)CrossRefGoogle Scholar
  15. 15.
    Russell, B., Torralba, A., Murphy, K., Freeman, W.: LabelMe: a database and web-based tool for image annotation. IJCV 77(1-3), 157–173 (2008)CrossRefGoogle Scholar
  16. 16.
    Bell, S., Upchurch, P., Snavely, N., Bala, K.: OpenSurfaces: A richly annotated catalog of surface appearance. SIGGRAPH 32(4) (2013)Google Scholar
  17. 17.
    Ordonez, V., Kulkarni, G., Berg, T.: Im2text: Describing images using 1 million captioned photographs. In: NIPS (2011)Google Scholar
  18. 18.
    Deng, J., Russakovsky, O., Krause, J., Bernstein, M., Berg, A., Fei-Fei, L.: Scalable multi-label annotation. In: CHI (2014)Google Scholar
  19. 19.
    Lin, T., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: Common objects in context. CoRR abs/1405.0312 (2014)Google Scholar
  20. 20.
    Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. IJCV 47(1-3), 7–42 (2002)Google Scholar
  21. 21.
    Baker, S., Scharstein, D., Lewis, J., Roth, S., Black, M., Szeliski, R.: A database and evaluation methodology for optical flow. IJCV 92(1), 1–31 (2011)CrossRefGoogle Scholar
  22. 22.
    Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In: CVPR Workshop of Generative Model Based Vision, WGMBV (2004)Google Scholar
  23. 23.
    Griffin, G., Holub, A., Perona, P.: Caltech-256 object category dataset. Technical Report 7694, California Institute of Technology (2007)Google Scholar
  24. 24.
    Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR (2005)Google Scholar
  25. 25.
    Lecun, Y., Cortes, C.: The MNIST database of handwritten digits (1998)Google Scholar
  26. 26.
    Nene, S.A., Nayar, S.K., Murase, H.: Columbia object image library (coil-20). Technical report, Columbia Universty (1996)Google Scholar
  27. 27.
    Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Computer Science Department, University of Toronto, Tech. Rep. (2009)Google Scholar
  28. 28.
    Torralba, A., Fergus, R., Freeman, W.T.: 80 million tiny images: A large data set for nonparametric object and scene recognition. PAMI 30(11), 1958–1970 (2008)CrossRefGoogle Scholar
  29. 29.
    Ordonez, V., Deng, J., Choi, Y., Berg, A., Berg, T.: From large scale image categorization to entry-level categories. In: ICCV (2013)Google Scholar
  30. 30.
    Fellbaum, C.: WordNet: An electronic lexical database. Blackwell Books (1998)Google Scholar
  31. 31.
    Welinder, P., Branson, S., Mita, T., Wah, C., Schroff, F., Belongie, S., Perona, P.: Caltech-UCSD Birds 200. Technical Report CNS-TR-201, Caltech. (2010)Google Scholar
  32. 32.
    Hjelmås, E., Low, B.: Face detection: A survey. CVIU 83(3), 236–274 (2001)Google Scholar
  33. 33.
    Huang, G.B., Ramesh, M., Berg, T., Learned-Miller, E.: Labeled faces in the wild. Technical Report 07-49, University of Massachusetts, Amherst (October 2007)Google Scholar
  34. 34.
    Russakovsky, O., Deng, J., Huang, Z., Berg, A., Fei-Fei, L.: Detecting avocados to zucchinis: what have we done, and where are we going? In: ICCV (2013)Google Scholar
  35. 35.
    Shotton, J., Winn, J., Rother, C., Criminisi, A.: TextonBoost for image understanding: Multi-class object recognition and segmentation by jointly modeling texture, layout, and context. IJCV 81(1), 2–23 (2009)CrossRefGoogle Scholar
  36. 36.
    Seitz, S.M., Curless, B., Diebel, J., Scharstein, D., Szeliski, R.: A comparison and evaluation of multi-view stereo reconstruction algorithms. In: CVPR (2006)Google Scholar
  37. 37.
    Arbelaez, P., Maire, M., Fowlkes, C., Malik, J.: Contour detection and hierarchical image segmentation. PAMI 33(5), 898–916 (2011)CrossRefGoogle Scholar
  38. 38.
    Lampert, C., Nickisch, H., Harmeling, S.: Learning to detect unseen object classes by between-class attribute transfer. In: CVPR (2009)Google Scholar
  39. 39.
    Heitz, G., Koller, D.: Learning spatial context: Using stuff to find things. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 30–43. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  40. 40.
    Sitton, R.: Spelling Sourcebook. Egger Publishing (1996)Google Scholar
  41. 41.
    Berg, T., Berg, A.: Finding iconic images. In: CVPR (2009)Google Scholar
  42. 42.
    Torralba, A., Efros, A.: Unbiased look at dataset bias. In: CVPR (2011)Google Scholar
  43. 43.
    Douze, M., Jégou, H., Sandhawalia, H., Amsaleg, L., Schmid, C.: Evaluation of gist descriptors for web-scale image search. In: CIVR (2009)Google Scholar
  44. 44.
    Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part-based models. PAMI 32(9), 1627–1645 (2010)CrossRefGoogle Scholar
  45. 45.
    Girshick, R., Felzenszwalb, P., McAllester, D.: Discriminatively trained deformable part models, release 5. PAMI (2012)Google Scholar
  46. 46.
    Zhu, X., Vondrick, C., Ramanan, D., Fowlkes, C.: Do we need more training data or better models for object detection? In: BMVC (2012)Google Scholar
  47. 47.
    Brox, T., Bourdev, L., Maji, S., Malik, J.: Object segmentation by alignment of poselet activations to image contours. In: CVPR (2011)Google Scholar
  48. 48.
    Yang, Y., Hallman, S., Ramanan, D., Fowlkes, C.: Layered object models for image segmentation. PAMI 34(9), 1731–1743 (2012)CrossRefGoogle Scholar
  49. 49.
    Ramanan, D.: Using segmentation to verify object hypotheses. In: CVPR (2007)Google Scholar
  50. 50.
    Dai, Q., Hoiem, D.: Learning to localize detected objects. In: CVPR (2012)Google Scholar
  51. 51.
    Rashtchian, C., Young, P., Hodosh, M., Hockenmaier, J.: Collecting image annotations using Amazon’s Mechanical Turk. In: NAACL Workshop (2010)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Tsung-Yi Lin
    • 1
  • Michael Maire
    • 2
  • Serge Belongie
    • 1
  • James Hays
    • 3
  • Pietro Perona
    • 2
  • Deva Ramanan
    • 4
  • Piotr Dollár
    • 5
  • C. Lawrence Zitnick
    • 5
  1. 1.CornellUSA
  2. 2.CaltechUSA
  3. 3.BrownUSA
  4. 4.UC IrvineUSA
  5. 5.Microsoft ResearchUSA

Personalised recommendations