Advertisement

Modeling Visual Context Is Key to Augmenting Object Detection Datasets

  • Nikita Dvornik
  • Julien Mairal
  • Cordelia Schmid
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11216)

Abstract

Performing data augmentation for learning deep neural networks is well known to be important for training visual recognition systems. By artificially increasing the number of training examples, it helps reducing overfitting and improves generalization. For object detection, classical approaches for data augmentation consist of generating images obtained by basic geometrical transformations and color changes of original training images. In this work, we go one step further and leverage segmentation annotations to increase the number of object instances present on training data. For this approach to be successful, we show that modeling appropriately the visual context surrounding objects is crucial to place them in the right environment. Otherwise, we show that the previous strategy actually hurts. With our context model, we achieve significant mean average precision improvements when few labeled examples are available on the VOC’12 benchmark.

Keywords

Object detection Data augmentation Visual context 

Notes

Acknowledgments

This work was supported by a grant from ANR (MACARON project ANR-14-CE23-0003-01), by the ERC grant 714381 (SOLARIS project), the ERC advanced grant ALLEGRO and gifts from Amazon and Intel.

References

  1. 1.
    Oliva, A., Torralba, A.: The role of context in object recognition. Trends Cogn. Sci. 11(12), 520–527 (2007)CrossRefGoogle Scholar
  2. 2.
    Divvala, S.K., Hoiem, D., Hays, J.H., Efros, A.A., Hebert, M.: An empirical study of context in object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2009)Google Scholar
  3. 3.
    Murphy, K., Torralba, A., Eaton, D., Freeman, W.: Object detection and localization using local and global features. In: Ponce, J., Hebert, M., Schmid, C., Zisserman, A. (eds.) Toward Category-Level Object Recognition. LNCS, vol. 4170, pp. 382–400. Springer, Heidelberg (2006).  https://doi.org/10.1007/11957959_20CrossRefGoogle Scholar
  4. 4.
    Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) 32(9), 1627–1645 (2010)CrossRefGoogle Scholar
  5. 5.
    Park, D., Ramanan, D., Fowlkes, C.: Multiresolution models for object detection. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 241–254. Springer, Heidelberg (2010).  https://doi.org/10.1007/978-3-642-15561-1_18CrossRefGoogle Scholar
  6. 6.
    Heitz, G., Koller, D.: Learning spatial context: using stuff to find things. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5302, pp. 30–43. Springer, Heidelberg (2008).  https://doi.org/10.1007/978-3-540-88682-2_4CrossRefGoogle Scholar
  7. 7.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems (NIPS) (2015)Google Scholar
  8. 8.
    Girshick, R.: Fast R-CNN. In: Proceedings of the International Conference on Computer Vision (ICCV) (2015)Google Scholar
  9. 9.
    Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_2CrossRefGoogle Scholar
  10. 10.
    Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  11. 11.
    Yu, R., Chen, X., Morariu, V.I., Davis, L.S.: The role of context selection in object detection. In: British Machine Vision Conference (BMVC) (2016)Google Scholar
  12. 12.
    Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)MathSciNetzbMATHGoogle Scholar
  13. 13.
    Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The PASCAL visual object classes (VOC) challenge. Int. J. Comput. Vis. (IJCV) 88(2), 303–338 (2010)CrossRefGoogle Scholar
  14. 14.
    Dwibedi, D., Misra, I., Hebert, M.: Cut, paste and learn: surprisingly easy synthesis for instance detection. In: Proceedings of the International Conference on Computer Vision (ICCV) (2017)Google Scholar
  15. 15.
    Gupta, A., Vedaldi, A., Zisserman, A.: Synthetic data for text localisation in natural images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  16. 16.
    Georgakis, G., Mousavian, A., Berg, A.C., Kosecka, J.: Synthesizing training data for object detection in indoor scenes. arXiv preprint arXiv:1702.07836 (2017)
  17. 17.
    Torralba, A., Sinha, P.: Statistical context priming for object detection. In: Proceedings of the International Conference on Computer Vision (ICCV) (2001)Google Scholar
  18. 18.
    Torralba, A.: Contextual priming for object detection. Int. J. Comput. Vis. (IJCV) 53(2), 169–191 (2003)CrossRefGoogle Scholar
  19. 19.
    Choi, M.J., Lim, J.J., Torralba, A., Willsky, A.S.: Exploiting hierarchical context on a large database of object categories. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2010)Google Scholar
  20. 20.
    Gould, S., Fulton, R., Koller, D.: Decomposing a scene into geometric and semantically consistent regions. In: Proceedings of the International Conference on Computer Vision (ICCV) (2009)Google Scholar
  21. 21.
    Chu, W., Cai, D.: Deep feature based contextual model for object detection. Neurocomputing 275, 1035–1042 (2018)CrossRefGoogle Scholar
  22. 22.
    Bell, S., Zitnick, C.L., Bala, K., Girshick, R.: Inside-outside net: detecting objects in context with skip pooling and recurrent neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  23. 23.
    Dvornik, N., Shmelkov, K., Mairal, J., Schmid, C.: BlitzNet: a real-time deep network for scene understanding. In: Proceedings of the International Conference on Computer Vision (ICCV) (2017)Google Scholar
  24. 24.
    Fu, C.Y., Liu, W., Ranga, A., Tyagi, A., Berg, A.C.: DSSD: deconvolutional single shot detector. arXiv preprint arXiv:1701.06659 (2017)
  25. 25.
    Barnea, E., Ben-Shahar, O.: On the utility of context (or the lack thereof) for object detection. arXiv preprint arXiv:1711.05471 (2017)
  26. 26.
    Yao, B., Fei-Fei, L.: Modeling mutual context of object and human pose in human-object interaction activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2010)Google Scholar
  27. 27.
    Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. In: Proceedings of the International Conference on Computer Vision (ICCV) (2015)MathSciNetCrossRefGoogle Scholar
  28. 28.
    Frid-Adar, M., Klang, E., Amitai, M., Goldberger, J., Greenspan, H.: Synthetic data augmentation using GAN for improved liver lesion classification. arXiv preprint arXiv:1801.02385 (2018)
  29. 29.
    Peng, X., Sun, B., Ali, K., Saenko, K.: Learning deep object detectors from 3D models. In: Proceedings of the International Conference on Computer Vision (ICCV) (2015)Google Scholar
  30. 30.
    Zhong, Z., Zheng, L., Kang, G., Li, S., Yang, Y.: Random erasing data augmentation. arXiv preprint arXiv:1708.04896 (2017)
  31. 31.
    Karsch, K., Hedau, V., Forsyth, D., Hoiem, D.: Rendering synthetic objects into legacy photographs. ACM Trans. Graph. (TOG) 30(6), 157 (2011)CrossRefGoogle Scholar
  32. 32.
    Movshovitz-Attias, Y., Kanade, T., Sheikh, Y.: How useful is photo-realistic rendering for visual learning? In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9915, pp. 202–217. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-49409-8_18CrossRefGoogle Scholar
  33. 33.
    Su, H., Qi, C.R., Li, Y., Guibas, L.J.: Render for CNN: viewpoint estimation in images using CNNs trained with rendered 3D model views. In: Proceedings of the International Conference on Computer Vision (ICCV) (2015)Google Scholar
  34. 34.
    Prez, P., Gangnet, M., Blake, A.: Poisson image editing. ACM Trans. Graph. (SIGGRAPH 2003) 22(3), 313–318 (2003)Google Scholar
  35. 35.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10602-1_48CrossRefGoogle Scholar
  36. 36.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  37. 37.
    Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the International Conference on Computer Vision (ICCV) (2017)Google Scholar
  38. 38.
    Kingma, D., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (ICLR) (2015)Google Scholar
  39. 39.
    Liao, Z., Farhadi, A., Wang, Y., Endres, I., Forsyth, D.: Building a dictionary of image fragments. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2012)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Nikita Dvornik
    • 1
  • Julien Mairal
    • 1
  • Cordelia Schmid
    • 1
  1. 1.Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP (Institute of Engineering, Univ. Grenoble Alpes), LJKGrenobleFrance

Personalised recommendations