Skip to main content

Weakly Supervised Object Localization via Transformer with Implicit Spatial Calibration

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Abstract

Weakly Supervised Object Localization (WSOL), which aims to localize objects by only using image-level labels, has attracted much attention because of its low annotation cost in real applications. Recent studies leverage the advantage of self-attention in visual Transformer for long-range dependency to re-active semantic regions, aiming to avoid partial activation in traditional class activation mapping (CAM). However, the long-range modeling in Transformer neglects the inherent spatial coherence of the object, and it usually diffuses the semantic-aware regions far from the object boundary, making localization results significantly larger or far smaller. To address such an issue, we introduce a simple yet effective Spatial Calibration Module (SCM) for accurate WSOL, incorporating semantic similarities of patch tokens and their spatial relationships into a unified diffusion model. Specifically, we introduce a learnable parameter to dynamically adjust the semantic correlations and spatial context intensities for effective information propagation. In practice, SCM is designed as an external module of Transformer, and can be removed during inference to reduce the computation cost. The object-sensitive localization ability is implicitly embedded into the Transformer encoder through optimization in the training phase. It enables the generated attention maps to capture the sharper object boundaries and filter the object-irrelevant background area. Extensive experimental results demonstrate the effectiveness of the proposed method, which significantly outperforms its counterpart TS-CAM on both CUB-200 and ImageNet-1K benchmarks. The code is available at .

H. Bai—Research done when Haotian Bai was a Research Assistant at Shenzhen Research Institute of Big Data, The Chinese Univeristy of Hong Kong (Shenzhen).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bourigault, S., Lagnier, C., Lamprier, S., Denoyer, L., Gallinari, P.: Learning social network embeddings for predicting information diffusion. In: Proceedings of the 7th ACM International Conference on Web Search and Data Mining, pp. 393–402 (2014)

    Google Scholar 

  2. Chen, Z., et al.: On awakening the local continuity of transformer for weakly supervised object localization. In: Proceedings of the AAAI Conference on Artificial Intelligence (2022)

    Google Scholar 

  3. Cheung, G., Magli, E., Tanaka, Y., Ng, M.K.: Graph spectral image processing. Proc. IEEE 106(5), 907–930 (2018)

    Article  Google Scholar 

  4. Choe, J., Oh, S.J., Lee, S., Chun, S., Akata, Z., Shim, H.: Evaluating weakly supervised object localization methods right. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3133–3142 (2020)

    Google Scholar 

  5. Choe, J., Shim, H.: Attention-based dropout layer for weakly supervised object localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2219–2228 (2019)

    Google Scholar 

  6. Gao, S., Tsang, I.W.H., Chia, L.T.: Laplacian sparse coding, hypergraph laplacian sparse coding, and applications. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 92–104 (2013)

    Article  Google Scholar 

  7. Gao, W., et al.: Ts-cam: token semantic coupled attention map for weakly supervised object localization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2886–2895 (2021)

    Google Scholar 

  8. Gulati, A., et al.: Conformer: convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100 (2020)

  9. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034 (2015)

    Google Scholar 

  10. Kim, E., Kim, S., Lee, J., Kim, H., Yoon, S.: Bridging the gap between classification and localization for weakly supervised object localization. arXiv preprint arXiv:2204.00220 (2022)

  11. Kondor, R.I., Lafferty, J.: Diffusion kernels on graphs and other discrete structures. In: Proceedings of the 19th International Conference on Machine Learning, vol. 2002, pp. 315–322 (2002)

    Google Scholar 

  12. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9992–10002 (2021)

    Google Scholar 

  13. Liu, Z., Li, X., Luo, P., Loy, C.C., Tang, X.: Semantic image segmentation via deep parsing network. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1377–1385 (2015)

    Google Scholar 

  14. Liu, Z., Li, X., Luo, P., Loy, C.C., Tang, X.: Deep learning Markov random field for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 40(8), 1814–1828 (2017)

    Article  Google Scholar 

  15. Loshchilov, I., Hutter, F.: Fixing weight decay regularization in adam (2018)

    Google Scholar 

  16. Lu, W., Jia, X., Xie, W., Shen, L., Zhou, Y., Duan, J.: Geometry constrained weakly supervised object localization. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12371, pp. 481–496. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58574-7_29

    Chapter  Google Scholar 

  17. Ma, H., King, I., Lyu, M.R.: Mining web graphs for recommendations. IEEE Trans. Knowl. Data Eng. 24(6), 1051–1064 (2011)

    Article  Google Scholar 

  18. Mai, J., Yang, M., Luo, W.: Erasing integrated learning: a simple yet effective approach for weakly supervised object localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8766–8775 (2020)

    Google Scholar 

  19. Meng, M., Zhang, T., Yang, W., Zhao, J., Zhang, Y., Wu, F.: Diverse complementary part mining for weakly supervised object localization. IEEE Trans. Image Process. 31, 1774–1788 (2022)

    Article  Google Scholar 

  20. Pan, V.: Fast and efficient parallel algorithms for the exact inversion of integer matrices. In: Maheshwari, S.N. (ed.) FSTTCS 1985. LNCS, vol. 206, pp. 504–521. Springer, Heidelberg (1985). https://doi.org/10.1007/3-540-16042-6_29

    Chapter  Google Scholar 

  21. Pan, V., Reif, J.: Efficient parallel solution of linear systems. In: Proceedings of the Seventeenth Annual ACM Symposium on Theory of Computing, pp. 143–152 (1985)

    Google Scholar 

  22. Qi, Y., Suhail, Y., Lin, Y.Y., Boeke, J.D., Bader, J.S.: Finding friends and enemies in an enemies-only network: a graph diffusion kernel for predicting novel genetic interactions and co-complex membership from yeast genetic interactions. Genome Res. 18(12), 1991–2004 (2008)

    Article  Google Scholar 

  23. Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015)

    Article  MathSciNet  Google Scholar 

  24. Sharir, G., Noy, A., Zelnik-Manor, L.: An image is worth 16\(\times \)16 words, what is a video worth? arXiv preprint arXiv:2103.13915 (2021)

  25. Singh, K.K., Lee, Y.J.: Hide-and-seek: forcing a network to be meticulous for weakly-supervised object and action localization (2017)

    Google Scholar 

  26. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)

    Google Scholar 

  27. Wei, J., Wang, Q., Li, Z., Wang, S., Zhou, S.K., Cui, S.: Shallow feature matters for weakly supervised object localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5993–6001 (2021)

    Google Scholar 

  28. Wei, J., Wang, S., Zhou, S.K., Cui, S., Li, Z.: Weakly supervised object localization through inter-class feature similarity and intra-class appearance consistency. In: European Conference on Computer Vision. Springer, Heidelberg (2022)

    Google Scholar 

  29. Welinder, P., et al.: Caltech-ucsd birds 200. Technical report (2010)

    Google Scholar 

  30. Xue, H., Liu, C., Wan, F., Jiao, J., Ji, X., Ye, Q.: Danet: divergent activation for weakly supervised object localization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6589–6598 (2019)

    Google Scholar 

  31. Yun, S., Han, D., Oh, S.J., Chun, S., Choe, J., Yoo, Y.: Cutmix: regularization strategy to train strong classifiers with localizable features. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6023–6032 (2019)

    Google Scholar 

  32. Zhang, C.L., Cao, Y.H., Wu, J.: Rethinking the route towards weakly supervised object localization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13460–13469 (2020)

    Google Scholar 

  33. Zhang, X., Wei, Y., Feng, J., Yang, Y., Huang, T.S.: Adversarial complementary learning for weakly supervised object localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1325–1334 (2018)

    Google Scholar 

  34. Zhang, X., Wei, Y., Kang, G., Yang, Y., Huang, T.: Self-produced guidance for weakly-supervised object localization. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 597–613 (2018)

    Google Scholar 

  35. Zhang, X., Wei, Y., Yang, Y.: Inter-image communication for weakly supervised localization. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12364, pp. 271–287. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58529-7_17

    Chapter  Google Scholar 

  36. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929 (2016)

    Google Scholar 

Download references

Acknowledgement

The work is supported in part by the Young Scientists Fund of the National Natural Science Foundation of China under grant No. 62106154, by Natural Science Foundation of Guangdong Province, China (General Program) under grant No. 2022A1515011524, by Shenzhen Science and Technology Program ZDSYS20211021111415025, and by the Guangdong Provincial Key Laboratory of Big Data Computing, The Chinese Univeristy of Hong Kong (Shenzhen).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ruimao Zhang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 911 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bai, H., Zhang, R., Wang, J., Wan, X. (2022). Weakly Supervised Object Localization via Transformer with Implicit Spatial Calibration. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13669. Springer, Cham. https://doi.org/10.1007/978-3-031-20077-9_36

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-20077-9_36

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-20076-2

  • Online ISBN: 978-3-031-20077-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics