Skip to main content
Log in

Semantic Understanding of Scenes Through the ADE20K Dataset

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Semantic understanding of visual scenes is one of the holy grails of computer vision. Despite efforts of the community in data collection, there are still few image datasets covering a wide range of scenes and object categories with pixel-wise annotations for scene understanding. In this work, we present a densely annotated dataset ADE20K, which spans diverse annotations of scenes, objects, parts of objects, and in some cases even parts of parts. Totally there are 25k images of the complex everyday scenes containing a variety of objects in their natural spatial context. On average there are 19.5 instances and 10.5 object classes per image. Based on ADE20K, we construct benchmarks for scene parsing and instance segmentation. We provide baseline performances on both of the benchmarks and re-implement state-of-the-art models for open source. We further evaluate the effect of synchronized batch normalization and find that a reasonably large batch size is crucial for the semantic segmentation performance. We show that the networks trained on ADE20K are able to segment a wide variety of scenes and objects.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21

Similar content being viewed by others

Notes

  1. As the original images in the ADE20K dataset have various sizes, for simplicity we rescale the large-sized images to make their minimum heights or widths as 512 in the SceneParse150 benchmark.

  2. http://sceneparsing.csail.mit.edu.

  3. Re-implementation of the state-of-the-art models are released at https://github.com/CSAILVision/semantic-segmentation-pytorch.

References

  • Badrinarayanan, V., Kendall, A., & Cipolla, R. (2017). Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12, 2481–2495.

    Article  Google Scholar 

  • Bell, S., Upchurch, P., Snavely, N., & Bala, K. (2013). OpenSurfaces: A richly annotated catalog of surface appearance. ACM Transactions on Graphics (TOG), 32, 111.

    Article  Google Scholar 

  • Bell, S., Upchurch, P., Snavely, N., & Bala, K. (2015). Material recognition in the wild with the materials in context database. In Proceedings of CVPR.

  • Caesar, H., Uijlings, J., & Ferrari, V. (2017). Coco-stuff: Thing and stuff classes in context.

  • Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2016). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. arXiv:1606.00915.

  • Chen, X., Mottaghi, R., Liu, X., Cho, N. G., Fidler, S., Urtasun, R., & Yuille, A. (2014). Detect what you can: Detecting and representing objects using holistic models and body parts. In Proceedings of CVPR.

  • Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., et al. (2016). The cityscapes dataset for semantic urban scene understanding. In Proceedings of CVPR.

  • Dai, J., He, K., & Sun, J. (2015). Convolutional feature masking for joint object and stuff segmentation. In Proceedings of CVPR.

  • Dai, J., He, K., & Sun, J. (2016). Instance-aware semantic segmentation via multi-task network cascades. In Proceedings of CVPR.

  • Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88, 303–338.

    Article  Google Scholar 

  • Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready for autonomous driving? The kitti vision benchmark suite. In Proceedings of CVPR.

  • Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., et al. (2017). Accurate, large minibatch SGD: Training imagenet in 1 hour. ArXiv preprint arXiv:1706.02677.

  • He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. In Proceedings of ICCV.

  • Huang, J. B., Kang, S. B., Ahuja, N., & Kopf, J. (2014). Image completion using planar structure guidance. ACM Transactions on Graphics (TOG), 33, 129.

    Google Scholar 

  • Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. ArXiv preprint arXiv:1502.03167.

  • Jiang, B., Luo, R., Mao, J., Xiao, T., & Jiang, Y. (2018). Acquisition of localization confidence for accurate object detection. In Proceedings of ECCV.

  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems.

  • Lin, T. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. In Proceedings of CVPR.

  • Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., et al. (2014). Microsoft coco: Common objects in context. In Proceedings of ECCV.

  • Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of CVPR.

  • Martin, D., Fowlkes, C., Tal, D., & Malik, J. (2001). A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings of ICCV.

  • Mottaghi, R., Chen, X., Liu, X., Cho, N. G., Lee, S. W., Fidler, S., et al. (2014). The role of context for object detection and semantic segmentation in the wild. In Proceedings of CVPR.

  • Nathan Silberman, P. K., Derek, H., & Fergus, R. (2012). Indoor segmentation and support inference from RGBD images. In Proceedings of ECCV.

  • Nguyen, A., Dosovitskiy, A., Yosinski, J., Brox, T., & Clune, J. (2016). Synthesizing the preferred inputs for neurons in neural networks via deep generator networks.

  • Noh, H., Hong, S., & Han, B. (2015). Learning deconvolution network for semantic segmentation. In Proceedings of ICCV.

  • Patterson, G., & Hays, J. (2016). Coco attributes: Attributes for people, animals, and objects. In Proceedings of ECCV.

  • Peng, C., Xiao, T., Li, Z., Jiang, Y., Zhang, X., Jia, K., et al. (2018). Megdet: A large mini-batch object detector. In Proceedings of CVPR, pp. 6181–6189.

  • Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems.

  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.

    Article  MathSciNet  Google Scholar 

  • Russell, B. C., Torralba, A., Murphy, K. P., & Freeman, W. T. (2008). Labelme: A database and web-based tool for image annotation. International Journal of Computer Vision, 77, 157–173.

    Article  Google Scholar 

  • Song, S., Lichtenberg, S. P., & Xiao, J. (2015). Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proceedings of CVPR.

  • Spain, M., & Perona, P. (2010). Measuring and predicting object importance. International Journal of Computer Vision, 91, 59–76.

    Article  Google Scholar 

  • Wu, Z., Shen, C., van den Hengel, A. (2016). Wider or deeper: Revisiting the resnet model for visual recognition. CoRR arXiv:1611.10080.

  • Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., & Torralba, A. (2010). Sun database: Large-scale scene recognition from abbey to zoo. In Proceedings of CVPR.

  • Xiao, T., Liu, Y., Zhou, B., Jiang, Y., & Sun, J. (2018). Unified perceptual parsing for scene understanding. In Proceedings of ECCV.

  • Yu, F., & Koltun, V. (2016). Multi-scale context aggregation by dilated convolutions.

  • Zhao, H., Puig, X., Zhou, B., Fidler, S., Torralba, A. (2017a). Open vocabulary scene parsing. In International Conference on Computer Vision (ICCV).

  • Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017b). Pyramid scene parsing network. In Proceedings of CVPR.

  • Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In Advances in neural information processing systems.

  • Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., & Torralba, A. (2017). Scene parsing through ade20k dataset. In Proceedings of CVPR.

Download references

Acknowledgements

This work was partially supported by Samsung and NSF Grant No.1524817 to AT, CUHK Direct Grant for Research 2018/2019 No. 4055098 to BZ. SF acknowledges the support from NSERC.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bolei Zhou.

Additional information

Communicated by Bernt Schiele.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Dataset is available at http://groups.csail.mit.edu/vision/datasets/ADE20K. Pretrained models and code are released at https://github.com/CSAILVision/semantic-segmentation-pytorch.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhou, B., Zhao, H., Puig, X. et al. Semantic Understanding of Scenes Through the ADE20K Dataset. Int J Comput Vis 127, 302–321 (2019). https://doi.org/10.1007/s11263-018-1140-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-018-1140-0

Keywords

Navigation