Skip to main content

Simple feature pyramid network for weakly supervised object localization using multi-scale information

Abstract

The purpose of weakly supervised object localization (WSOL) is to localize an object requiring only classification labels. However, most WSOL methods tend to find a specific part of an object. Further, they introduce more complex optimization problems than the classification problem to compensate for the lack of resources such as bounding box annotation. To be more efficient WSOL, we propose a new architecture that utilizes feature pyramid network (FPN) and multi-scale information to deal with simplified optimization and to improve the localization. In our proposed model, FPN produces multi-scale and high-quality feature maps, and then these feature maps are gathered to conduct classification. Therefore, we can use high-quality and abundant information for localization, which induces several advantages. First, our proposed model improves localization. Second, we don’t have to require solving complex optimization problem. In particular, the second advantage alleviates a significant burden such as hyperparameter tuning. Also, we confirmed through experiments that our proposed method outperforms state-of-the-art methods on the CUB-200-2011 and ILSVRC datasets.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3

References

  1. Choe, J., & Shim, H. (2019). Attention-based dropout layer for weakly supervised object localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2219–2228.

  2. Cubuk, E. D, Zoph, B., Mane, D., Vasudevan, V., & Le, Q. V. (2019). Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 113–123.

  3. Dabkowski, P., & Gal, Y. (2017). Real time image saliency for black box classifiers. In Advances in neural information processing systems (pp. 6967–6976).

  4. He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969.

  5. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778.

  6. Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.

  7. Jaderberg, M., Simonyan, K., & Zisserman, A., et al. (2015). Spatial transformer networks. In Advances in neural information processing systems (pp. 2017–2025).

  8. Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125.

  9. Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988.

  10. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., & Berg, A. C. (2016). Ssd: Single shot multibox detector. In European conference on computer vision (pp. 21–37). Springer.

  11. Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440.

  12. Luo, L., Yuan, C., Zhang, K., Jiang, Y., Zhang, Y., & Zhang, H. (2020). Double shot: Preserve and erase based class attention networks for weakly supervised localization (peca-net). In 2020 IEEE international conference on multimedia and expo (ICME) (pp. 1–6). IEEE.

  13. Mai, J., Yang, M., & Luo, W. (2020). Erasing integrated learning: A simple yet effective approach for weakly supervised object localization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8766–8775.

  14. Meethal, A., Pedersoli, M., Belharbi, S., & Granger, E. (2019). Convolutional stn for weakly supervised object localization and beyond. arXiv preprint arXiv:1912.01522.

  15. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., & DeVito, Z. (2017). Zeming Lin. Alban Desmaison: Luca Antiga, and Adam Lerer. Automatic differentiation (in pytorch).

  16. Ren, S., He, K., Girshick, K., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99.

  17. Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In International conference on medical image computing and computer-assisted intervention (pp. 234–241). Springer.

  18. Russakovsky, O., Deng, J., Hao, S., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.

    MathSciNet  Article  Google Scholar 

  19. Singh, K. K., & Lee, Y. J. (2017). Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization. In 2017 IEEE international conference on computer vision (ICCV) (pp. 3544–3553). IEEE.

  20. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826.

  21. Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology.

  22. Xue, H., Liu, C., Wan, F., Jiao, J., Ji, X., & Ye, Q. (2019). Danet: Divergent activation for weakly supervised object localization. In Proceedings of the IEEE international conference on computer vision, pp. 6589–6598.

  23. Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., & Yoo, Y. (2019) Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE international conference on computer vision, pp. 6023–6032.

  24. Zhang, X., Wei, Y., Feng, J., Yang, Y., & Huang, T. S. (2018). Adversarial complementary learning for weakly supervised object localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1325–1334.

  25. Zhang, X., Wei, Y., Kang, G., Yang, Y., & Huang, T. (2018). Self-produced guidance for weakly-supervised object localization. In Proceedings of the European conference on computer vision (ECCV), pp. 597–613.

  26. Zhang, S., Wen, L., Bian, X., Lei, Z., & Li, S. Z. (2018). Single-shot refinement neural network for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4203–4212.

  27. Zhao, Q., Sheng, T., Wang, Y., Tang, Z., Chen, Y., Cai, L., & Ling, H. (2019). M2det: A single-shot object detector based on multi-level feature pyramid network. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 9259–9266.

    Article  Google Scholar 

  28. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016). Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2921–2929.

Download references

Acknowledgements

Myungjoo Kang was supported by the National Research Foundation Grant of Korea (2015R1A5A1009350, 2021R1A2C3010887) and the ICT R&D program of MSIT/IITP(No. 1711117093).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Myungjoo Kang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Koo, B., Choi, HS. & Kang, M. Simple feature pyramid network for weakly supervised object localization using multi-scale information. Multidim Syst Sign Process 32, 1185–1197 (2021). https://doi.org/10.1007/s11045-021-00778-9

Download citation

Keywords

  • Deep learning
  • Convolutional neural network
  • Weakly supervised learning
  • Object localization