Skip to main content

Learning Self-supervised Low-Rank Network for Single-Stage Weakly and Semi-supervised Semantic Segmentation


Semantic segmentation with limited annotations, such as weakly supervised semantic segmentation (WSSS) and semi-supervised semantic segmentation (SSSS), is a challenging task that has attracted much attention recently. Most leading WSSS methods employ a sophisticated multi-stage training strategy to estimate pseudo-labels as precise as possible, but they suffer from high model complexity. In contrast, there exists another research line that trains a single network with image-level labels in one training cycle. However, such a single-stage strategy often performs poorly because of the compounding effect caused by inaccurate pseudo-label estimation. To address this issue, this paper presents a Self-supervised Low-Rank Network (SLRNet) for single-stage WSSS and SSSS. The SLRNet uses cross-view self-supervision, that is, it simultaneously predicts several complementary attentive LR representations from different views of an image to learn precise pseudo-labels. Specifically, we reformulate the LR representation learning as a collective matrix factorization problem and optimize it jointly with the network learning in an end-to-end manner. The resulting LR representation deprecates noisy information while capturing stable semantics across different views, making it robust to the input variations, thereby reducing overfitting to self-supervision errors. The SLRNet can provide a unified single-stage framework for various label-efficient semantic segmentation settings: (1) WSSS with image-level labeled data, (2) SSSS with a few pixel-level labeled data, and (3) SSSS with a few pixel-level labeled data and many image-level labeled data. Extensive experiments on the Pascal VOC 2012, COCO, and L2ID datasets demonstrate that our SLRNet outperforms both state-of-the-art WSSS and SSSS methods with a variety of different settings, proving its good generalizability and efficacy.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7


  • Ahn, J., Cho, S., & Kwak, S. (2019). Weakly supervised learning of instance segmentation with inter-pixel relations. In CVPR (pp. 2209–2218).

  • Ahn, J., & Kwak, S. (2018). Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation. In CVPR (pp. 4981–4990).

  • Araslanov, N., & Roth, S. (2020). Single-stage semantic segmentation from image labels. In CVPR (pp. 4252–4261).

  • Bearman, A., Russakovsky, O., Ferrari, V., & Fei-Fei, L. (2016). What’s the point: Semantic segmentation with point supervision. In ECCV (pp. 549–565).

  • Cabral, R., De la Torre, F., Costeira, J. P., & Bernardino, A. (2013). Unifying nuclear norm and bilinear factorization approaches for low-rank matrix decomposition. In ICCV (pp. 2488–2495).

  • Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., & Joulin, A. (2020). Unsupervised learning of visual features by contrasting cluster assignments. In NeurIPS.

  • Chang, Y., Wang, Q., Hung, W., Piramuthu, R., Tsai, Y., & Yang, M. (2020). Weakly-supervised semantic segmentation via sub-category exploration. In CVPR (pp. 8988–8997).

  • Chen, L. C., Zhu, Y., Papandreou, G., Schroff, F., & Adam, H. (2018). Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV (pp. 833–851).

  • Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. E. (2020). A simple framework for contrastive learning of visual representations. In ICML (pp. 1597–1607).

  • Chen, X., & He, K. (2021). Exploring simple siamese representation learning. In CVPR (pp. 15750–15758).

  • Chen, X., Yuan, Y., Zeng, G., & Wang, J. (2021). Semi-supervised semantic segmentation with cross pseudo supervision. In CVPR.

  • Dai, J., He, K., & Sun, J. (2015). BoxSup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In CVPR (pp. 1635–1643).

  • Deng, J., Dong, W., Socher, R., Li, L., Li, K., & Li, F. (2009). Imagenet: A large-scale hierarchical image database. In CVPR, IEEE computer society (pp. 248–255).

  • Ding, C., Li, T., Peng, W., & Park, H. (2006). Orthogonal nonnegative matrix t-factorizations for clustering. In Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 126–135).

  • Doersch, C., Gupta, A., & Efros, A. A. (2015). Unsupervised visual representation learning by context prediction. In ICCV (pp. 1422–1430).

  • Dong, Z., Hanwang, Z., Jinhui, T., Xiansheng, H., & Qianru, S. (2020). Causal intervention for weakly supervised semantic segmentation. In NeurIPS.

  • Everingham, M., Gool, L. J. V., Williams, C. K. I., Winn, J. M., & Zisserman, A. (2010). The pascal visual object classes (VOC) challenge. IJCV, 88(2), 303–338.

    Article  Google Scholar 

  • Fan, J., Zhang, Z., Tan, T., Song, C., & Xiao, J. (2020). CIAN: Cross-image affinity net for weakly supervised semantic segmentation. In AAAI (pp. 10762–10769).

  • French, G., Laine, S., Aila, T., Mackiewicz, M., & Finlayson, G. D. (2020). Semi-supervised semantic segmentation needs strong, varied perturbations. In BMVC.

  • Geng, Z., Guo, M. H., Chen, H., Li, X., Wei, K., & Lin, Z. (2021). Is attention better than matrix decomposition? In ICLR.

  • Gray, R., & Neuhoff, D. (1998). Quantization. IEEE Transactions on Information Theory, 44(6), 2325–2383.

    MathSciNet  Article  Google Scholar 

  • Hariharan, B., Arbelaez, P., Bourdev, L. D., Maji, S., & Malik, J. (2011). Semantic contours from inverse detectors. In ICCV (pp. 991–998).

  • Havaei, M., Davy, A., Warde-Farley, D., Biard, A., Courville, A. C., Bengio, Y., Pal, C., Jodoin, P., & Larochelle, H. (2017). Brain tumor segmentation with deep neural networks. Medical Image Analysis, 35, 18–31.

    Article  Google Scholar 

  • He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. B. (2020). Momentum contrast for unsupervised visual representation learning. In CVPR (pp. 9726–9735).

  • Hou, Q., Jiang, P., Wei, Y., & Cheng, M. (2017). Self-erasing network for integral object attention. In NeurIPS (pp. 547–557).

  • Hu, X., Tang, J., Gao, H., & Liu, H. (2013) .Unsupervised sentiment analysis with emotional signals. In 22nd international world wide web conference, WWW ’13, Rio de Janeiro, Brazil, May 13-17, 2013 (pp. 607–618).

  • Huang, Z., Wang, X., Wang, J., Liu, W., & Wang, J. (2018a). Weakly-supervised semantic segmentation network with deep seeded region growing. In CVPR (pp. 7014–7023).

  • Hung, W., Tsai, Y., Liou, Y., Lin, Y., & Yang, M. (2018). Adversarial learning for semi-supervised semantic segmentation. In BMVC (p. 65).

  • Jiang, P., Hou, Q., Cao, Y., Cheng, M., Wei, Y., & Xiong, H. (2019). Integral object mining via online attention accumulation. In ICCV (pp. 2070–2079). IEEE.

  • Kolesnikov, A., & Lampert, C. H. (2016). Seed, expand and constrain: Three principles for weakly-supervised image segmentation. In ECCV (pp. 695–711).

  • Krähenbühl, P., & Koltun, V. (2011). Efficient inference in fully connected CRFs with Gaussian edge potentials. In NeurIPS (pp. 109–117).

  • Ledig, C., Theis, L., Huszar, F., Caballero, J., Cunningham, A., Acosta, A., Aitken, A. P., Tejani, A., Totz, J., Wang, Z., & Shi, W. (2017). Photo-realistic single image super-resolution using a generative adversarial network. In CVPR (pp. 105–114).

  • Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755), 788–791.

    Article  Google Scholar 

  • Lee, H., Huang, J., Singh, M., & Yang, M. (2017). Unsupervised representation learning by sorting sequences. In ICCV (pp. 667–676).

  • Lee, H., Lee, K., Lee, K., Lee, H., & Shin, J. (2021). Improving transferability of representations via augmentation-aware self-supervision. In NeurIPS.

  • Lee, J., Kim, E., Lee, S., Lee, J., Yoon, S. (2019). FickleNet: Weakly and semi-supervised semantic image segmentation using stochastic inference. In CVPR (pp. 5267–5276).

  • Li, K., Wu, Z., Peng, K., Ernst, J., & Fu, Y. (2018). Tell me where to look: Guided attention inference network. In CVPR (pp. 9215–9223).

  • Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., & Liu, H. (2019). Expectation-maximization attention networks for semantic segmentation. In ICCV (pp. 9166–9175).

  • Lin, D., Dai, J., Jia, J., He, K., & Sun, J. (2016). ScribbleSup: Scribble-supervised convolutional networks for semantic segmentation. In CVPR (pp. 3159–3167).

  • Lin, T., Maire, M., Belongie, S. J., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common objects in context. In ECCV (Vol. 8693, pp. 740–755). Springer.

  • Liu, G., Lin, Z., Yan, S., Sun, J., Yu, Y., & Ma, Y. (2012). Robust recovery of subspace structures by low-rank representation. IEEE TPAMI, 35(1), 171–184.

    Article  Google Scholar 

  • Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In CVPR (pp. 3431–3440).

  • Ma, L., Wang, C., Xiao, B., & Zhou, W. (2012). Sparse representation for face recognition based on discriminative low-rank dictionary learning. In CVPR (pp. 2586–2593). IEEE.

  • O Pinheiro, P. O., Almahairi, A., Benmalek, R., Golemo, F., & Courville, A. C. (2020). Unsupervised learning of dense visual representations. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan & H. Lin (Eds.), NeurIPS (Vol. 33, pp. 4489–4500).

  • Ouali, Y., Hudelot, C., & Tami, M. (2020). Semi-supervised semantic segmentation with cross-consistency training. In CVPR (pp. 12671–12681).

  • Papandreou, G., Chen, L., Murphy, K. P., & Yuille, A. L. (2015). Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation. In ICCV (pp. 1742–1750).

  • Paszke, A., Gross, S., Massa, F., et al. (2019). Pytorch: An imperative style, high-performance deep learning library. In NeurIPS (pp. 8024–8035).

  • Pathak, D., Krähenbühl, P., Donahue, J., Darrell, T., & Efros, A. A. (2016). Context encoders: Feature learning by inpainting. In CVPR (pp. 2536–2544).

  • Pinheiro, P. H. O., & Collobert, R. (2015). From image-level to pixel-level labeling with convolutional networks. In CVPR (pp. 1713–1721).

  • Saleh, F., Akbarian, M. S. A., Salzmann, M., Petersson, L., Gould, S., & Alvarez, J. M. (2016). Built-in foreground/background prior for weakly-supervised semantic segmentation. ECCV, 9912, 413–432.

    Google Scholar 

  • Shimoda, W., & Yanai, K. (2019). Self-supervised difference detection for weakly-supervised semantic segmentation. In ICCV (pp. 5207–5216).

  • Sohn, K., Berthelot, D., Carlini, N., Zhang, Z., Zhang, H., Raffel, C., Cubuk, E. D., Kurakin, A., & Li, C. (2020). Fixmatch: Simplifying semi-supervised learning with consistency and confidence. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, H. Lin (Eds.), NeurIPS.

  • Souly, N., Spampinato, C., & Shah, M. (2017). Semi supervised semantic segmentation using generative adversarial network. In ICCV (pp. 5689–5697).

  • Stretcu, O., & Leordeanu, M. (2015). Multiple frames matching for object discovery in video. In X. Xie, M. W. Jones, G. K. L. Tam (Eds.), BMVC (pp. 186.1–186.12).

  • Su, H., Jampani, V., Sun, D., Gallo, O., Learned-Miller, E., & Kautz, J. (2019). Pixel-adaptive convolutional neural networks. In CVPR (pp. 11166–11175).

  • Sun, G., Wang, W., Dai, J., & Gool, L. V. (2020). Mining cross-image semantics for weakly supervised semantic segmentation. In ECCV (pp. 347–365)

  • Tai, C., Xiao, T., Zhang, Y., Wang, X., & Weinan, E. (2016). Convolutional neural networks with low-rank regularization. In ICLR.

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In NeurIPS (pp. 5998–6008).

  • Wang, X., Liu, S., Ma, H., & Yang, M. (2020). Weakly-supervised semantic segmentation by iterative affinity learning. IJCV, 128(6), 1736–1749.

    MathSciNet  Article  Google Scholar 

  • Wang, Y., Zhang, J., Kan, M., Shan, S., & Chen, X. (2020b). Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation. In CVPR (pp. 12272–12281).

  • Wang, X., Zhang, R., Shen, C., Kong, T., & Li, L. (2021). Dense contrastive learning for self-supervised visual pre-training. In CVPR (pp. 3024–3033).

  • Wei, Y., Feng, J., Liang, X., Cheng, M., Zhao, Y., & Yan, S. (2017). Object region mining with adversarial erasing: A simple classification to semantic segmentation approach. In CVPR (pp. 6488–6496).

  • Wei, Y., Xiao, H., Shi, H., Jie, Z., Feng, J., & Huang, T. S. (2018). Revisiting dilated convolution: A simple approach for weakly- and semi-supervised semantic segmentation. In CVPR (pp. 7268–7277).

  • Wei, Y., Zheng, S., Cheng, M., Zhao, H., Wang, L., Ding, E., Yang, Y., Torralba, A., Liu, T., Sun, G., Wang, W., Gool, L. V., Bae, W., Noh, J., Seo, J., Kim, G., Zhao, H., Lu, M., Yao, A., Guo, Y., Chen, Y., Zhang, L., Tan, C., Ruan, T., Gu, G., Wei, S., Zhao, Y., Dobko, M., Viniavskyi, O., Dobosevych, O., Wang, Z., Chen, Z., Gong, C., Yan, H., & He, J. (2020). LID 2020: The learning from imperfect data challenge results. CoRR arXiv:2010.11724.

  • Wu, Z., Shen, C., & Van Den Hengel, A. (2019). Wider or deeper: Revisiting the ResNet model for visual recognition. PR, 90, 119–133.

  • Xie, E., Ding, J., Wang, W., Zhan, X., Xu, H., Sun, P., Li, Z., & Luo, P. (2021a). Detco: Unsupervised contrastive learning for object detection. In ICCV (pp. 8392–8401).

  • Xie, Z., Lin, Y., Zhang, Z., Cao, Y., Lin, S., & Hu, H. (2021b). Propagate yourself: Exploring pixel-level consistency for unsupervised visual representation learning. In CVPR (pp. 16684–16693).

  • Zheng, S., Jayasumana, S., Romera-Paredes, B., et al. (2015). Conditional random fields as recurrent neural networks. In ICCV (pp. 1529–1537).

  • Zhou, B., Khosla, A., Lapedriza, À., Oliva, A., & Torralba, A. (2016). Learning deep features for discriminative localization. In CVPR (pp. 2921–2929).

  • Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., & Torralba, A. (2019). Semantic understanding of scenes through the ADE20K dataset. IJCV, 127(3), 302–321.

    Article  Google Scholar 

  • Zoph, B., Ghiasi, G., Lin, T., Cui, Y., Liu, H., Cubuk, E. D., & Le, Q. (2020). Rethinking pre-training and self-training. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan & H. Lin (Eds.), NeurIPS.

  • Zou, Y., Zhang, Z., Zhang, H., Li, C., Bian, X., Huang, J., & Pfister, T. (2021). Pseudoseg: Designing pseudo labels for semantic segmentation. In ICLR.

Download references


This work was partially supported by the National Key Research and Development Program of China under Grant 2019YFB2101904, the National Natural Science Foundation of China under Grants 61732011, 61876127, 61876088 and 61925602.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Pengfei Zhu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Communicated by Zhouchen Lin.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Pan, J., Zhu, P., Zhang, K. et al. Learning Self-supervised Low-Rank Network for Single-Stage Weakly and Semi-supervised Semantic Segmentation. Int J Comput Vis 130, 1181–1195 (2022).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:


  • Weakly-supervised learning
  • Semi-supervised Learning
  • Semantic segmentation