Skip to main content
Log in

Weakly-Supervised Semantic Segmentation with Visual Words Learning and Hybrid Pooling

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Weakly-supervised semantic segmentation (WSSS) methods with image-level labels generally train a classification network to generate the Class Activation Maps (CAMs) as the initial coarse segmentation labels. However, current WSSS methods still perform far from satisfactorily because their adopted CAMs (1) typically focus on partial discriminative object regions and (2) usually contain useless background regions. These two problems are attributed to the sole image-level supervision and aggregation of global information when training the classification networks. In this work, we propose the visual words learning module and hybrid pooling approach, and incorporate them in classification network to mitigate the above problems. In visual words learning module, we counter the first problem by enforcing the classification network to learn fine-grained visual word labels so that more object extents could be discovered. Specifically, the visual words are learned with a codebook, which could be updated via two proposed strategies, i.e. learning-based strategy and memory-bank strategy. The second drawback of CAMs is alleviated with the proposed hybrid pooling, which incorporates the global average and local discriminative information to simultaneously ensure object completeness and reduce background regions. We evaluated our methods on PASCAL VOC 2012 and MS COCO 2014 datasets. Without any extra saliency prior, our method achieved 70.6% and 70.7% mIoU on the val and test set of PASCAL VOC dataset, respectively, and 36.2% mIoU on the val set of MS COCO dataset, which significantly surpassed the performance of state-of-the-art WSSS methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. https://github.com/jiwoon-ahn/irn.

  2. http://host.robots.ox.ac.uk:8080/anonymous/XJDOJG.html.

  3. http://host.robots.ox.ac.uk:8080/anonymous/J00QBG.html.

  4. http://host.robots.ox.ac.uk:8080/anonymous/Y0XECB.html.

  5. http://host.robots.ox.ac.uk:8080/anonymous/0QVYDO.html.

References

  • Adams, R., & Bischof, L. (1994). Seeded region growing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(6), 641–647.

    Article  Google Scholar 

  • Ahn, J., & Kwak, S. (2018). Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4981–4990.

  • Ahn, J., Cho, S., & Kwak, S. (2019). Weakly supervised learning of instance segmentation with inter-pixel relations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 2209–2218.

  • Arandjelović, R., Gronat, P., Torii, A., Pajdla, T., & Sivic, J. (2017). Netvlad: Cnn architecture for weakly supervised place recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6), 1437–1451.

    Article  Google Scholar 

  • Araslanov, N., & Roth, S. (2020). Single-stage semantic segmentation from image labels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4253–4262.

  • Badrinarayanan, V., Kendall, A., & Cipolla, R. (2017). Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(12), 2481–2495.

    Article  Google Scholar 

  • Bearman, A., Russakovsky, O., Ferrari, V., & Fei-Fei, L. (2016). What’s the point: Semantic segmentation with point supervision. In: European conference on computer vision, Springer, pp 549–565.

  • Chang, Y. T., Wang, Q., Hung, W. C., Piramuthu, R., Tsai, Y. H., & Yang, M. H. (2020a). Mixup-cam: Weakly-supervised semantic segmentation via uncertainty regularization. In: British Machine Vision Conference (BMVC).

  • Chang, Y. T., Wang, Q., Hung, W. C., Piramuthu, R., Tsai, Y. H., & Yang, M. H. (2020b). Weakly-supervised semantic segmentation via sub-category exploration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8991–9000.

  • Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2015). Semantic image segmentation with deep convolutional nets and fully connected crfs. In: International Conference on Learning Representations.

  • Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2017). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834–848.

    Article  Google Scholar 

  • Cogswell, M., Ahmed, F., Girshick, R., Zitnick, L., & Batra, D. (2017). Reducing overfitting in deep networks by decorrelating representations. In: International Conference on Learning Representations.

  • Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3213–3223.

  • Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2), 303–338.

    Article  Google Scholar 

  • Fan, J., Zhang, Z., Tan, T., Song, C., & Xiao, J. (2020). Cian: Cross-image affinity net for weakly supervised semantic segmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 34, 10762–10769.

    Article  Google Scholar 

  • Gao, S. H., Cheng, M. M., Zhao, K., Zhang, X. Y., Yang, M. H., & Torr, P. (2021). Res2net: A new multi-scale backbone architecture. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(2), 652–662.

    Article  Google Scholar 

  • Gidaris, S., Bursuc, A., Komodakis, N., Pérez, P., & Cord, M. (2020). Learning representations by predicting bags of visual words. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6928–6938.

  • Hariharan, B., Arbeláez, P., Bourdev, L., Maji, S., & Malik, J. (2011). Semantic contours from inverse detectors. In: 2011 International Conference on Computer Vision, IEEE, pp 991–998.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778.

  • Hou, Q., Cheng, M. M., Hu, X., Borji, A., Tu, Z., & Torr, P. H. (2017). Deeply supervised salient object detection with short connections. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3203–3212.

  • Hou, Q., Jiang, P., Wei, Y., & Cheng, M. M. (2018). Self-erasing network for integral object attention. Advances in Neural Information Processing Systems, 31, 549–559.

    Google Scholar 

  • Huang, Z., Wang, X., Wang, J., Liu, W., & Wang, J. (2018). Weakly-supervised semantic segmentation network with deep seeded region growing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7014–7023.

  • Jiang, P. T., Hou, Q., Cao, Y., Cheng, M. M., Wei, Y., & Xiong, H. K. (2019). Integral object mining via online attention accumulation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 2070–2079.

  • Jo, S., & Yu, I. J. (2021). Puzzle-cam: Improved localization via matching partial and full features. In: 2021 IEEE International Conference on Image Processing (ICIP), pp 639–643.

  • Ke, T. W., Hwang, J. J., & Yu, S. X. (2021). Universal weakly supervised segmentation by pixel-to-segment contrastive learning. In: International Conference on Learning Representations.

  • Kim, B., Han, S., & Kim, J. (2021). Discriminative region suppression for weakly-supervised semantic segmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 35, 1754–1761.

    Google Scholar 

  • Kolesnikov, A., & Lampert, C. H. (2016). Seed, expand and constrain: Three principles for weakly-supervised image segmentation. In: European conference on computer vision, Springer, pp 695–711.

  • Krähenbühl, P., & Koltun, V. (2011). Efficient inference in fully connected crfs with gaussian edge potentials. Advances in Neural Information Processing Systems, 24, 109–117.

    Google Scholar 

  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25, 1097–1105.

    Google Scholar 

  • Lee J, Kim, E., & Yoon, S. (2021a). Anti-adversarially manipulated attributions for weakly and semi-supervised semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4071–4080.

  • Lee, J., Yi, J., Shin, C., & Yoon, S. (2021b). Bbam: Bounding box attribution map for weakly supervised semantic and instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 2643–2652.

  • Lee, S., Lee, M., Lee, J., & Shim, H. (2021c). Railroad is not a train: Saliency as pseudo-pixel supervision for weakly supervised semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 5495–5505.

  • Li, X., Zhou, T., Li, J., Zhou, Y., & Zhang, Z. (2021). Group-wise semantic mining for weakly supervised semantic segmentation. Proceedings of the AAAI Conference on Artificial Intelligence, 35, 1984–1992.

    Google Scholar 

  • Li, Y., Kuang, Z., Liu, L., Chen, Y., & Zhang, W. (2021b). Pseudo-mask matters in weakly-supervised semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 6964–6973.

  • Lin, D., Dai, J., Jia, J., He, K., & Sun, J. (2016). Scribblesup: Scribble-supervised convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3159–3167.

  • Lin, H., Upchurch, P., & Bala, K. (2019). Block annotation: Better image annotation with sub-image decomposition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).

  • Lin, M., Chen, Q., & Yan, S. (2013). Network in network. arXiv preprint arXiv:13124400.

  • Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In: European conference on computer vision, Springer, pp 740–755.

  • Liu, L., Chen, J., Fieguth, P., Zhao, G., Chellappa, R., & Pietikäinen, M. (2019). From bow to cnn: Two decades of texture representation for texture classification. International Journal of Computer Vision, 127(1), 74–109.

    Article  Google Scholar 

  • Liu, Y., Wu, Y. H., Wen, P. S., Shi, Y. J., Qiu, Y., & Cheng, M. M. (2020). Leveraging instance-, image-and dataset-level information for weakly supervised instance segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(3), 1415–1428.

    Article  Google Scholar 

  • Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440.

  • Oh, Y., Kim, B., & Ham, B. (2021). Background-aware pooling and noise-aware loss for weakly-supervised semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6913–6922.

  • Papandreou, G., Chen, L. C., Murphy, K. P., & Yuille, A. L. (2015). Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation. In: Proceedings of the IEEE international conference on computer vision, pp 1742–1750.

  • Passalis, N., & Tefas, A. (2017). Learning bag-of-features pooling for deep convolutional neural networks. In: 2017 IEEE International Conference on Computer Vision (ICCV), IEEE, pp 5766–5774.

  • Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32, 8026–8037.

    Google Scholar 

  • Pinheiro, P. O., & Collobert, R. (2015). From image-level to pixel-level labeling with convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1713–1721.

  • Roy, A., & Todorovic, S. (2017). Combining bottom-up, top-down, and smoothness cues for weakly supervised image segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3529–3538.

  • Ru, L., Du, B., & Wu, C. (2021). Learning visual words for weakly-supervised semantic segmentation. In: International Joint Conference on Artificial Intelligence.

  • Rubin, D. B. (2019). Essential concepts of causal inference: A remarkable history and an intriguing future. Biostatistics & Epidemiology, 3(1), 140–155.

    Article  Google Scholar 

  • Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., & Monfardini, G. (2008). The graph neural network model. IEEE Transactions on Neural Networks, 20(1), 61–80.

    Article  Google Scholar 

  • Sculley, D. (2010). Web-scale k-means clustering. In: Proceedings of the 19th international conference on World wide web, pp 1177–1178.

  • Song, C., Huang, Y., Ouyang, W., & Wang, L. (2019). Box-driven class-wise region masking and filling rate guided loss for weakly supervised semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3136–3145.

  • Sun, G., Wang, W., Dai, J., & Van Gool, L. (2020). Mining cross-image semantics for weakly supervised semantic segmentation. In: European Conference on Computer Vision, Springer, pp 347–365.

  • Van Der Maaten, L. (2014). Accelerating t-sne using tree-based algorithms. The Journal of Machine Learning Research, 15(1), 3221–3245.

    MathSciNet  MATH  Google Scholar 

  • Vernaza, P., & Chandraker, M. (2017). Learning random-walk label propagation for weakly-supervised semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7158–7166.

  • Wang, X., Liu, S., Ma, H., & Yang, M. H. (2020). Weakly-supervised semantic segmentation by iterative affinity learning. International Journal of Computer Vision, 128(6), 1736–1749.

    Article  MathSciNet  Google Scholar 

  • Wang, Y., Zhang, J., Kan, M., Shan, S., & Chen, X. (2020b). Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 12275–12284.

  • Wei, Y., Feng, J., Liang, X., Cheng, M. M., Zhao, Y., & Yan, S. (2017). Object region mining with adversarial erasing: A simple classification to semantic segmentation approach. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1568–1576.

  • Wu, T., Huang, J., Gao, G., Wei, X., Wei, X., Luo, X., & Liu, C. H. (2021). Embedded discriminative attention mechanism for weakly supervised semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 16765–16774.

  • Wu, Z., Xiong, Y., Yu, S. X., Lin, D. (2018). Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3733–3742.

  • Xu, L., Ouyang, W., Bennamoun, M., Boussaid, F., Sohel, F., & Xu, D. (2021). Leveraging auxiliary tasks with affinity learning for weakly supervised semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 6984–6993.

  • Yao, Y., Chen, T., Xie, G. S., Zhang, C., Shen, F., Wu, Q., Tang, Z., & Zhang, J. (2021). Non-salient region object mining for weakly supervised semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 2623–2632.

  • Zhang, B., Xiao, J., Wei, Y., Sun, M., & Huang, K. (2020). Reliability does matter: An end-to-end weakly supervised semantic segmentation approach. Proceedings of the AAAI Conference on Artificial Intelligence, 34, 12765–12772.

    Article  Google Scholar 

  • Zhang, D., Zhang, H., Tang, J., Hua, X. S., & Sun, Q. (2020). Causal intervention for weakly-supervised semantic segmentation. Advances in Neural Information Processing Systems, 33, 655–666.

    Google Scholar 

  • Zhang, X., Wei, Y., Feng, J., Yang, Y., & Huang, T. S. (2018). Adversarial complementary learning for weakly supervised object localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1325–1334.

  • Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., & Torr, P. H. (2015). Conditional random fields as recurrent neural networks. In: Proceedings of the IEEE international conference on computer vision, pp 1529–1537.

  • Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016). Learning deep features for discriminative localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2921–2929.

  • Zhuang, C., Zhai, A. L., Yamins, D. (2019). Local aggregation for unsupervised learning of visual embeddings. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 6002–6012.

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grants 62141112, T2122014, 41871243 and 61971317, the Science and Technology Major Project of Hubei Province (Next-Generation AI Technologies) under Grant 2019AEA170, and Natural Science Foundation of Hubei Province under Grant 2020CFB594.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bo Du.

Additional information

Communicated by Vittorio Ferrari.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ru, L., Du, B., Zhan, Y. et al. Weakly-Supervised Semantic Segmentation with Visual Words Learning and Hybrid Pooling. Int J Comput Vis 130, 1127–1144 (2022). https://doi.org/10.1007/s11263-022-01586-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-022-01586-9

Keywords

Navigation