Advertisement

Enabling Deep Residual Networks for Weakly Supervised Object Detection

Conference paper
  • 785 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12353)

Abstract

Weakly supervised object detection (WSOD) has attracted extensive research attention due to its great flexibility of exploiting large-scale image-level annotation for detector training. Whilst deep residual networks such as ResNet and DenseNet have become the standard backbones for many computer vision tasks, the cutting-edge WSOD methods still rely on plain networks, e.g., VGG, as backbones. It is indeed not trivial to employ deep residual networks for WSOD, which even shows significant deterioration of detection accuracy and non-convergence. In this paper, we discover the intrinsic root with sophisticated analysis and propose a sequence of design principles to take full advantages of deep residual learning for WSOD from the perspectives of adding redundancy, improving robustness and aligning features. First, a redundant adaptation neck is key for effective object instance localization and discriminative feature learning. Second, small-kernel convolutions and MaxPool down-samplings help improve the robustness of information flow, which gives finer object boundaries and make the detector more sensitivity to small objects. Third, dilated convolution is essential to align the proposal features and exploit diverse local information by extracting high-resolution feature maps. Extensive experiments show that the proposed principles enable deep residual networks to establishes new state-of-the-arts on PASCAL VOC and MS COCO.

Notes

Acknowledgment

This work is supported by the Nature Science Foundation of China (No. U1705262, No. 61772443, No. 61572410, No. 61802324 and No. 61702136), National Key R&D Program (No. 2017YFC0113000, and No. 2016YFB1001503), Key R&D Program of Jiangxi Province (No. 20171ACH80022) and Natural Science Foundation of Guangdong Province in China (No. 2019B1515120049).

Supplementary material

504445_1_En_8_MOESM1_ESM.pdf (545 kb)
Supplementary material 1 (pdf 544 KB)

References

  1. 1.
    Alex, K., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Conference on Neural Information Processing Systems (NeurIPS) (2012)Google Scholar
  2. 2.
    Arun, A., Jawahar, C.V., Kumar, M.P.: Dissimilarity coefficient based weakly supervised object detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)Google Scholar
  3. 3.
    Bazzani, L., Bergamo, A., Anguelov, D., Torresani, L.: Self-taught object localization with deep networks. In: WACV (2016)Google Scholar
  4. 4.
    Bency, A.J., Kwon, H., Lee, H., Karthikeyan, S., Manjunath, B.S.: Weakly supervised localization using deep feature maps. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 714–731. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_43CrossRefGoogle Scholar
  5. 5.
    Bilen, H., Pedersoli, M., Tuytelaars, T.: Weakly supervised object detection with posterior regularization. In: The British Machine Vision Conference (BMVC) (2014)Google Scholar
  6. 6.
    Bilen, H., Pedersoli, M., Tuytelaars, T.: Weakly supervised object detection with convex clustering. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2015)Google Scholar
  7. 7.
    Bilen, H., Vedaldi, A.: Weakly supervised deep detection networks. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  8. 8.
    Cinbis, R.G., Verbeek, J., Schmid, C.: Multi-fold MIL training for weakly supervised object localization. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2014)Google Scholar
  9. 9.
    Cinbis, R.G., Verbeek, J., Schmid, C.: Weakly supervised object localization with multi-fold multiple instance learning. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 39, 189–203 (2015)CrossRefGoogle Scholar
  10. 10.
    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2009)Google Scholar
  11. 11.
    Deselaers, T., Alexe, B., Ferrari, V.: Localizing objects while learning their appearance. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 452–466. Springer, Heidelberg (2010).  https://doi.org/10.1007/978-3-642-15561-1_33CrossRefGoogle Scholar
  12. 12.
    Diba, A., Sharma, V., Pazandeh, A., Pirsiavash, H., Van Gool, L.: Weakly supervised cascaded convolutional networks. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  13. 13.
    Diba, A., Sharma, V., Stiefelhagen, R., Van Gool, L.: Weakly supervised object discovery by generative adversarial and ranking networks. In: CVPR Workshop (2019)Google Scholar
  14. 14.
    Dietterich, T.G., Lathrop, R.H., Lozano-Pérez, T.: Solving the multiple instance problem with axis-parallel rectangles. Artif. Intell. (AI) 89, 31–71 (1997)CrossRefGoogle Scholar
  15. 15.
    Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. (IJCV) 88, 303–338 (2010).  https://doi.org/10.1007/s11263-009-0275-4CrossRefGoogle Scholar
  16. 16.
    Gao, M., Li, A., Yu, R., Morariu, V.I., Davis, L.S.: C-WSL: count-guided weakly supervised localization. In: European Conference on Computer Vision (ECCV) (2018)Google Scholar
  17. 17.
    Ge, C., Wang, J.: Fewer is more : image segmentation based weakly supervised object detection with partial aggregation. In: The British Machine Vision Conference (BMVC) (2018)Google Scholar
  18. 18.
    Ge, W., Yang, S., Yu, Y.: Multi-evidence filtering and fusion for multi-label classification, object detection and semantic segmentation based on weakly supervised learning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018)Google Scholar
  19. 19.
    Girshick, R.: Fast R-CNN. In: IEEE International Conference on Computer Vision (ICCV) (2015)Google Scholar
  20. 20.
    Graham-Rowe, D.: Visualizing data using t-SNE. JMLR 9, 2579–2605 (2008)zbMATHGoogle Scholar
  21. 21.
    Gudi, A., van Rosmalen, N., Loog, M., van Gemert, J.: Object-extent pooling for weakly supervised single-shot localization. In: The British Machine Vision Conference (BMVC) (2017)Google Scholar
  22. 22.
    He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: IEEE International Conference on Computer Vision (ICCV) (2017)Google Scholar
  23. 23.
    Huang, G., Liu, Z., Weinberger, K.Q.: Densely connected convolutional networks. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  24. 24.
    Inoue, N., Furuta, R., Yamasaki, T., Aizawa, K.: Cross-domain weakly-supervised object detection through progressive domain adaptation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018)Google Scholar
  25. 25.
    Jie, Z., Wei, Y., Jin, X., Feng, J., Liu, W.: Deep self-taught learning for weakly supervised object localization. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  26. 26.
    Kaiming He, Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  27. 27.
    Kantorov, V., Oquab, M., Cho, M., Laptev, I.: ContextLocNet: context-aware deep network models for weakly supervised localization. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 350–365. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46454-1_22CrossRefGoogle Scholar
  28. 28.
    Ken, C., Karen, S., Andrea, V., Andrew, Z.: Return of the devil in the details delving deep into convolutional nets. In: The British Machine Vision Conference (BMVC) (2014)Google Scholar
  29. 29.
    Kosugi, S., Yamasaki, T., Aizawa, K.: Object-aware instance labeling for weakly supervised object detection. In: IEEE International Conference on Computer Vision (ICCV) (2019)Google Scholar
  30. 30.
    Kumar Singh, K., Jae Lee, Y., Singh, K.K., Lee, Y.J.: You reap what you sow: using videos to generate high precision object proposals for weakly-supervised object detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)Google Scholar
  31. 31.
    Lai, B., Gong, X.: Saliency guided end-to-end learning for weakly supervised object detection. In: International Joint Conferences on Artificial Intelligence (IJCAI) (2017)Google Scholar
  32. 32.
    Li, D., Huang, J.B., Li, Y., Wang, S., Yang, M.H.: Weakly supervised object localization with progressive domain adaptation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  33. 33.
    Li, X., Kan, M., Shan, S., Chen, X.: Weakly supervised object detection with segmentation collaboration. In: IEEE International Conference on Computer Vision (ICCV) (2019)Google Scholar
  34. 34.
    Li, Y., Liu, L., Shen, C., van den Hengel, A.: Image co-localization by mimicking a good detector’s confidence score distribution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 19–34. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46475-6_2CrossRefGoogle Scholar
  35. 35.
    Li, Z., Peng, C., Yu, G., Zhang, X., Deng, Y., Sun, J.: DetNet: a backbone network for object detection. In: European Conference on Computer Vision (ECCV) (2018)Google Scholar
  36. 36.
    Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  37. 37.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10602-1_48CrossRefGoogle Scholar
  38. 38.
    Liu, B., Gao, Y., Guo, N., Ye, X., You, H., Fan, D.: Utilizing the instability in weakly supervised object detection. In: CVPR Workshop (2019)Google Scholar
  39. 39.
    Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46448-0_2CrossRefGoogle Scholar
  40. 40.
    Papadopoulos, D.P., Uijlings, J.R.R., Keller, F., Ferrari, V.: We don’t need no bounding-boxes: training object class detectors using only human verification. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  41. 41.
    Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  42. 42.
    Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Conference on Neural Information Processing Systems (NeurIPS) (2015)Google Scholar
  43. 43.
    Ren, Z., et al.: instance-aware, context-focused, and memory-efficient weakly supervised object detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)Google Scholar
  44. 44.
    Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetV2: inverted residuals and linear bottlenecks. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018)Google Scholar
  45. 45.
    Shen, Y., Ji, R., Zhang, S., Zuo, W., Wang, Y.: Generative adversarial learning towards fast weakly supervised detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018)Google Scholar
  46. 46.
    Shen, Y., Ji, R., Wang, C., Li, X., Li, X.: Weakly supervised object detection via object-specific pixel gradient. IEEE Trans. Neural Netw. Learn. Syst. (TNNLS) 29, 5960–5970 (2018)CrossRefGoogle Scholar
  47. 47.
    Shen, Y., Ji, R., Wang, Y., Wu, Y., Cao, L.: Cyclic guidance for weakly supervised joint detection and segmentation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)Google Scholar
  48. 48.
    Shen, Y., Ji, R., Yang, K., Deng, C., Wang, C.: Category-aware spatial constraint for weakly supervised detection. IEEE Trans. Image Process. (TIP) 29, 843–858 (2019)MathSciNetCrossRefGoogle Scholar
  49. 49.
    Shen, Z., Liu, Z., Li, J., Jiang, Y.G., Chen, Y., Xue, X.: DSOD: learning deeply supervised object detectors from scratch. In: IEEE International Conference on Computer Vision (ICCV) (2017)Google Scholar
  50. 50.
    Shi, M., Caesar, H., Ferrari, V.: Weakly supervised object localization using things and stuff transfer. In: IEEE International Conference on Computer Vision (ICCV) (2017)Google Scholar
  51. 51.
    Shi, M., Ferrari, V.: Weakly supervised object localization using size estimates. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 105–121. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46454-1_7CrossRefGoogle Scholar
  52. 52.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: The International Conference on Learning Representations (ICLR) (2015)Google Scholar
  53. 53.
    Singh, K.K., Xiao, F., Lee, Y.J.: Track and transfer: watching videos to simulate strong human supervision for weakly-supervised object detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  54. 54.
    Szegedy, C., et al.: Going deeper with convolutions. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2015)Google Scholar
  55. 55.
    Tang, P., et al.: PCL: proposal cluster learning for weakly supervised object detection. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 42, 176–91 (2018)CrossRefGoogle Scholar
  56. 56.
    Tang, P., Wang, X., Bai, X., Liu, W.: Multiple instance detection network with online instance classifier refinement. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  57. 57.
    Tang, P., et al.: Weakly supervised region proposal network and object detection. In: European Conference on Computer Vision (ECCV) (2018)Google Scholar
  58. 58.
    Teh, E.W., Wang, Y.: Attention networks for weakly supervised object localization. In: The British Machine Vision Conference (BMVC) (2016)Google Scholar
  59. 59.
    Wan, F., Liu, C., Ke, W., Ji, X., Jiao, J., Ye, Q.: C-MIL: continuation multiple instance learning for weakly supervised object detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)Google Scholar
  60. 60.
    Wan, F., Wei, P., Jiao, J., Han, Z., Ye, Q.: Min-entropy latent model for weakly supervised object detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018)Google Scholar
  61. 61.
    Wang, C., Ren, W., Huang, K., Tan, T.: Weakly supervised object localization with latent category learning. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 431–445. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10599-4_28CrossRefGoogle Scholar
  62. 62.
    Wang, R.J., Li, X., Ao, S., Ling, C.X.: Pelee: a real-time object detection system on mobile devices. In: Conference on Neural Information Processing Systems (NeurIPS) (2018)Google Scholar
  63. 63.
    Wang, X., Zhu, Z., Yao, C., Bai, X.: Relaxed multiple-instance SVM with application to object discovery. In: IEEE International Conference on Computer Vision (ICCV) (2015)Google Scholar
  64. 64.
    Wei, Y., et al.: TS2C: tight box mining with surrounding segmentation context for weakly supervised object detection. In: European Conference on Computer Vision (ECCV) (2018)Google Scholar
  65. 65.
    Yang, K., Li, D., Dou, Y.: Towards precise end-to-end weakly supervised object detection network. In: IEEE International Conference on Computer Vision (ICCV) (2019)Google Scholar
  66. 66.
    Yu, F., Koltun, V., Funkhouser, T.: Dilated residual networks. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2017)Google Scholar
  67. 67.
    Zagoruyko, S., Komodakis, N.: Wide residual networks. In: The British Machine Vision Conference (BMVC) (2016)Google Scholar
  68. 68.
    Zeng, Z., Liu, B., Fu, J., Chao, H., Zhang, L.: WSOD\(\wedge 2\): learning bottom-up and top-down objectness distillation for weakly-supervised object detection. In: IEEE International Conference on Computer Vision (ICCV) (2019)Google Scholar
  69. 69.
    Zhang, X., Wei, Y., Feng, J., Yang, Y., Huang, T.: Adversarial complementary learning for weakly supervised object localization. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018)Google Scholar
  70. 70.
    Zhang, X., Wei, Y., Kang, G., Yang, Y., Huang, T.: Self-produced guidance for weakly-supervised object localization. In: European Conference on Computer Vision (ECCV) (2018)Google Scholar
  71. 71.
    Zhang, X., Feng, J., Xiong, H., Tian, Q.: Zigzag learning for weakly supervised object detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018)Google Scholar
  72. 72.
    Zhang, X., Yang, Y., Feng, J.: ML-LocNet: improving object localization with multi-view learning network. In: European Conference on Computer Vision (ECCV) (2018)Google Scholar
  73. 73.
    Zhang, X., Yang, Y., Feng, J.: Learning to localize objects with noisy labeled instances. In: AAAI Conference on Artificial Intelligence (AAAI) (2019)Google Scholar
  74. 74.
    Zhang, Y., Li, Y., Ghanem, B.: W2F : a weakly-supervised to fully-supervised framework for object detection. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2018)Google Scholar
  75. 75.
    Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2016)Google Scholar
  76. 76.
    Zhou, B., et al.: Semantic understanding of scenes through the ADE20K dataset. Int. J. Comput. Vis. (IJCV) 127, 302–321 (2019).  https://doi.org/10.1007/s11263-018-1140-0CrossRefGoogle Scholar
  77. 77.
    Zhu, R., et al.: ScratchDet: exploring to train single-shot object detectors from scratch. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)Google Scholar
  78. 78.
    Zhu, Y., Zhou, Y., Ye, Q., Qiu, Q., Jiao, J.: Soft proposal networks for weakly supervised object localization. In: IEEE International Conference on Computer Vision (ICCV) (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Media Analytics and Computing Lab, Department of Artificial Intelligence, School of InformaticsXiamen UniversityXiamenChina
  2. 2.PinterestSan FranciscoUSA
  3. 3.CSESouthern University of Science and TechnologyShenzhenChina
  4. 4.Tencent Youtu LabTencent Technology (Shanghai) Co., Ltd.ShanghaiChina

Personalised recommendations