Skip to main content
Log in

Imbalance-Aware Discriminative Clustering for Unsupervised Semantic Segmentation

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Unsupervised semantic segmentation (USS) aims at partitioning an image into semantically meaningful segments by learning from a collection of unlabeled images. The effectiveness of current approaches is plagued by difficulties in coordinating representation learning and pixel clustering, modeling the varying feature distributions of different classes, handling outliers and noise, and addressing the pixel class imbalance problem. This paper introduces a novel approach, termed Imbalance-Aware Dense Discriminative Clustering (IDDC), for USS, which addresses all these difficulties in a unified framework. Different from existing approaches, which learn USS in two stages (i.e., generating and updating pseudo masks, or refining and clustering embeddings), IDDC learns pixel-wise feature representation and dense discriminative clustering in an end-to-end and self-supervised manner, through a novel objective function that transfers the manifold structure of pixels in the embedding space of a vision Transformer (ViT) to the label space while tolerating the noise in pixel affinities. During inference, the trained model directly outputs the classification probability of each pixel conditioned on the image. In addition, this paper proposes a new regularizer, based on the Weibull function, to handle pixel class imbalance and cluster degeneration in a single shot. Experimental results demonstrate that IDDC significantly outperforms all previous USS methods on three real-world datasets, COCO-Stuff-27, COCO-Stuff-171, and Cityscapes. Extensive ablation studies validate the effectiveness of each design. Our code is available at https://github.com/MY-LIU100101/IDDC.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data Availibility

The datasets generated during and/or analysed during the current study are respectively available in the GitHub repository at https://github.com/nightrome/cocostuff, and in the Cityscapes repository at https://www.cityscapes-dataset.com/.

References

  • Ahn, J., & Kwak, S. (2018). Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4981–4990.

  • Alexey, D., Fischer, P., Tobias, J., Springenberg, M.R., & Brox, T. (2015). Discriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(9), 1734–1747

  • Ali, A., Touvron, H., Caron, M., Bojanowski, P., Douze, M., Joulin, A., Laptev, I., Neverova, N., Synnaeve, G., Verbeek, J., et al. (2021). Xcit: Cross-covariance image transformers. Advances in Neural Information Processing Systems, 34, 20014–20027.

    Google Scholar 

  • Alonso, I., Sabater, A., Ferstl, D., Montesano, L., & Murillo, A.C.(2021). Semi-supervised semantic segmentation with pixel-level contrastive learning from a class-wise memory bank. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 8219–8228.

  • Barber, D., & Agakov, F.(2005). Kernelized infomax clustering. In Advances in neural information processing systems, pp. 17–24.

  • Bojanowski, P., & Joulin, A.(2017). Unsupervised learning by predicting noise. In International conference on machine learning, pp. 517–526 . PMLR.

  • Bridle, J., Heading, A., & MacKay, D. (1991). Unsupervised classifiers, mutual information and’phantom targets. In Advances in neural information processing systems, pp. 1537–1544.

  • Caesar, H., Uijlings, J., & Ferrari, V.(2018). Coco-stuff: Thing and stuff classes in context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1209–1218.

  • Caron, M., Bojanowski, P., Joulin, A., & Douze, M.(2018). Deep clustering for unsupervised learning of visual features. In Proceedings of the European conference on computer vision (ECCV), pp. 132–149.

  • Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A.(2021). Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 9650–9660.

  • Chang, Y.-T., Wang, Q., Hung, W.-C., Piramuthu, R., Tsai, Y.-H., & Yang, M.-H.(2020). Weakly-supervised semantic segmentation via sub-category exploration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8991–9000.

  • Chang, J., Wang, L., Meng, G., Xiang, S., & Pan, C.(2017). Deep adaptive image clustering. In Proceedings of the IEEE international conference on computer vision, pp. 5879–5887.

  • Chen, X., & He, K. (2021). Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 15750–15758.

  • Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597–1607 . PMLR.

  • Cheng, B., Schwing, A., & Kirillov, A. (2021). Per-pixel classification is not all you need for semantic segmentation. Advances in Neural Information Processing Systems, 34, 17864–17875.

    Google Scholar 

  • Cho, J.H., Mall, U., Bala, K., & Hariharan, B (2021) Picie: Unsupervised semantic segmentation using invariance and equivariance in clustering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16794–16804.

  • Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B.(2016). The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223.

  • Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on computer vision and pattern recognition, pp. 248–255 . IEEE.

  • Doersch, C., Gupta, A., & Efros, A.A.(2015). Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE international conference on computer vision, pp. 1422–1430.

  • Gao, S., Li, Z.-Y., Yang, M.-H., Cheng, M.-M., Han, J., & Torr, P. (2023). Large-scale unsupervised semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6), 7457–7476.

  • Ghasedi Dizaji, K., Herandi, A., Deng, C., Cai, W., & Huang, H.(2017). Deep clustering via joint convolutional autoencoder embedding and relative entropy minimization. In Proceedings of the IEEE international conference on computer vision, pp. 5736–5745.

  • Ghiasi, G., Cui, Y., Srinivas, A., Qian, R., Lin, T.-Y., Cubuk, E.D., Le, Q.V., & Zoph, B.(2021). Simple copy-paste is a strong data augmentation method for instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2918–2928.

  • Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al. (2020). Bootstrap your own latent-a new approach to self-supervised learning. Advances in Neural Information Processing Systems, 33, 21271–21284.

    Google Scholar 

  • Gu, J., Kwon, H., Wang, D., Ye, W., Li, M., Chen, Y.-H., Lai, L., Chandra, V., & Pan, D.Z.(2022). Multi-scale high-resolution vision transformer for semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12094–12103.

  • Hamilton, M., Zhang, Z., Hariharan, B., Snavely, N., & Freeman, W.T. (2022). Unsupervised semantic segmentation by distilling feature correspondences. In International conference on learning representations, pp. 1–26.

  • He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R.(2020). Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9729–9738.

  • He, K., Girshick, R., & Dollár, P.(2019). Rethinking imagenet pre-training. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 4918–4927.

  • Hou, Y., Zhu, X., Ma, Y., Loy, C.C., & Li, Y.(2022). Point-to-voxel knowledge distillation for lidar semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8479–8488.

  • Hu, Q., Yang, B., Xie, L., Rosa, S., Guo, Y., Wang, Z., Trigoni, N., & Markham, A.(2020). Randla-net: Efficient semantic segmentation of large-scale point clouds. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11108–11117.

  • Hung, W.-C., Jampani, V., Liu, S., Molchanov, P., Yang, M.-H., & Kautz, J.(2019). Scops: Self-supervised co-part segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 869–878.

  • Ji, X., Henriques, J.F., & Vedaldi, A.(2019). Invariant information clustering for unsupervised image classification and segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 9865–9874.

  • Ji, W., Yu, S., Wu, J., Ma, K., Bian, C., Bi, Q., Li, J., Liu, H., Cheng, L., & Zheng, Y.(2021). Learning calibrated medical image segmentation via multi-rater agreement modeling. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition(CVPR), pp. 12341–12351.

  • Kalluri, T., Varma, G., Chandraker, M., & Jawahar, C.(2019). Universal semi-supervised semantic segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 5259–5270.

  • Ke, Z., Qiu, D., Li, K., Yan, Q., & Lau, R.W. (2020). Guided collaborative training for pixel-wise semi-supervised learning. In european conference on computer vision, pp. 429–445 . Springer.

  • Komodakis, N., & Gidaris, S.(2018). Unsupervised representation learning by predicting image rotations. In International conference on learning representations (ICLR).

  • Krause, A., Perona, P., & Gomes, R.(2010). Discriminative clustering by regularized information maximization. Advances in Neural Information Processing Systems 23.

  • Kuhn, H. W. (1955). The Hungarian method for the assignment problem. Naval Research Logistics Quarterly,2(1–2), 83–97.

  • Kwon, D., & Kwak, S.(2022). Semi-supervised semantic segmentation with error localization network. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9957–9967.

  • Lai, X., Tian, Z., Jiang, L., Liu, S., Zhao, H., Wang, L., & Jia, J.(2021). Semi-supervised semantic segmentation with directional context-aware consistency. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1205–1214.

  • Lee, S., Lee, M., Lee, J., & Shim, H.(2021). Railroad is not a train: Saliency as pseudo-pixel supervision for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5495–5505.

  • Li, K., Wang, Z., Cheng, Z., Yu, R., Zhao, Y., Song, G., Liu, C., Yuan, L., & Chen, J.(2023). Acseg: Adaptive conceptualization for unsupervised semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 7162–7172.

  • Li, X., Zhou, Y., Zhang, Y., Zhang, A., Wang, W., Jiang, N., Wu, H., & Wang, W.(2021). Dense semantic contrast for self-supervised visual representation learning. In Proceedings of the 29th ACM international conference on multimedia, pp. 1368–1376.

  • Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang, Z., & Dong, L., et al. (2022). Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12009–12019.

  • Liu, M., Schonfeld, D., & Tang, W.(2021). Exploit visual dependency relations for semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 9726–9735.

  • Lu, Y., Cheung, Y.-M., & Tang, Y. Y. (2019). Self-adaptive multiprototype-based competitive learning approach: A k-means-type algorithm for imbalanced data clustering. IEEE Transactions on Cybernetics, 51(3), 1598–1612.

    Article  Google Scholar 

  • Melas-Kyriazi, L., Rupprecht, C., Laina, I., & Vedaldi, A.(2022). Deep spectral methods: A surprisingly strong baseline for unsupervised semantic segmentation and localization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8364–8375.

  • Mendel, R., Souza, L.A.d., Rauber, D., Papa, J.P., & Palm, C.(2020). Semi-supervised segmentation based on error-correcting supervision. In European conference on computer vision, pp. 141–157 . Springer.

  • Mirsadeghi, S. E., Royat, A., & Rezatofighi, H. (2021). Unsupervised image segmentation by mutual information maximization and adversarial regularization. IEEE Robotics and Automation Letters, 6(4), 6931–6938.

    Article  Google Scholar 

  • Mittal, S., Tatarchenko, M., & Brox, T. (2019). Semi-supervised semantic segmentation with high-and low-level consistency. IEEE transactions on pattern analysis and machine intelligence, 43(4), 1369–1379.

    Article  Google Scholar 

  • Murthy, D.P., Xie, M., & Jiang, R.(2004). Weibull models. Wiley

  • Ng, A., & Jordan, M.(2001). On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. In Advances in neural information processing systems, pp. 841–848.

  • Noroozi, M., & Favaro, P.(2016). Unsupervised learning of visual representations by solving jigsaw puzzles. In European conference on computer vision, pp. 69–84 . Springer.

  • Ouali, Y., Hudelot, C., & Tami, M. (2020). Autoregressive unsupervised image segmentation. In European conference on computer vision, pp. 142–158 . Springer.

  • Pang, B., Li, Y., Zhang, Y., Peng, G., Tang, J., Zha, K., Li, J., & Lu, C.(2022). Unsupervised representation for semantic segmentation by implicit cycle-attention contrastive learning. In Proceedings of the AAAI conference on artificial intelligence, vol. 36, pp. 2044–2052.

  • Peng, C., Myronenko, A., Hatamizadeh, A., Nath, V., Siddiquee, M.M.R., He, Y., Xu, D., Chellappa, R. (2022) Yang, D. Hypersegnas: Bridging one-shot neural architecture search with 3d medical image segmentation using hypernet. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 20741–20751.

  • Purushwalkam, S., & Gupta, A. (2020). Demystifying contrastive self-supervised learning: Invariances, augmentations and dataset biases. Advances in Neural Information Processing Systems, 33, 3407–3418.

    Google Scholar 

  • Qi, C.R., Su, H., Mo, K., & Guibas, L.J. (2017). Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 652–660.

  • Roh, B., Shin, W., Kim, I., & Kim, S.(2021). Spatially consistent representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1144–1153.

  • Schmarje, L., Brünger, J., Santarossa, M., Schröder, S.-M., Kiko, R., & Koch, R. (2021). Fuzzy overclustering: Semi-supervised classification of fuzzy labels with overclustering and inverse cross-entropy. Sensors, 21(19), 6661.

    Article  Google Scholar 

  • Seong, H.S., Moon, W., Lee, S., & Heo, J.-P.(2023). Leveraging hidden positives for unsupervised semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 19540–19549.

  • Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H.(2021). Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pp. 10347–10357 . PMLR.

  • Vahdat, A., Kreis, K., & Kautz, J. (2021). Score-based generative modeling in latent space. Advances in Neural Information Processing Systems, 34, 11287–11302.

    Google Scholar 

  • Van Gansbeke, W., Vandenhende, S., & Van Gool, L.(2022). Discovering object masks with transformers for unsupervised semantic segmentation. arXiv preprint arXiv:2206.06363.

  • Van Gansbeke, W., Vandenhende, S., Georgoulis, S., & Van Gool, L.(2021). Unsupervised semantic segmentation by contrasting object mask proposals. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 10052–10062.

  • Van Gansbeke, W., Vandenhende, S., Georgoulis, S., Proesmans, M., & Van Gool, L.(2020). Scan: Learning to classify images without labels. In European conference on computer vision, pp. 268–285 . Springer.

  • Wang, Z., Rao, Y., Yu, X., Zhou, J., & Lu, J.(2022). Semaffinet: Semantic-affine transformation for point cloud segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11819–11829.

  • Wang, W., Sun, G., & Van Gool, L. (2024). Looking beyond single images for weakly supervised semantic segmentation learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(3), 1635–1649.

  • Wang, Y., Zhang, J., Kan, M., Shan, S., & Chen, X.(2020). Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12275–12284.

  • Wang, X., Zhang, R., Shen, C., Kong, T., & Li, L. (2021). Dense contrastive learning for self-supervised visual pre-training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3024–3033.

  • Wei, Y., Xiao, H., Shi, H., Jie, Z., Feng, J., & Huang, T.S.(2018). Revisiting dilated convolution: A simple approach for weakly-and semi-supervised semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7268–7277.

  • Weibull, W. (1951). A statistical distribution function of wide applicability. Journal of Applied Mechanics, 18(3), 293–297.

  • Wortsman, M., Ilharco, G., Gadre, S.Y., Roelofs, R., Gontijo-Lopes, R., Morcos, A.S., Namkoong, H., Farhadi, A., Carmon, Y., & Kornblith, S., et al.(2022). Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International conference on machine learning, pp. 23965–23998 . PMLR.

  • Xiong, H., Wu, J., & Chen, J.(2006). K-means clustering versus validation measures: a data distribution perspective. In Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 779–784.

  • Xu, H., Caramanis, C., & Mannor, S.(2008). Robust regression and lasso. Advances in Neural Information Processing Systems 21.

  • Xu, L., Neufeld, J., Larson, B., & Schuurmans, D.(2004). Maximum margin clustering. Advances in Neural Information Processing Systems 17.

  • Yin, Z., Wang, P., Wang, F., Xu, X., Zhang, H., Li, H., & Jin, R.(2022). Transfgu: a top-down approach to fine-grained unsupervised semantic segmentation. In European conference on computer vision, pp. 73–89 . Springer.

  • Zadaianchuk, A., Kleindessner, M., Zhu, Y., Locatello, F., & Brox, T.(2022). Unsupervised semantic segmentation with self-supervised object-centric representations. arXiv preprint arXiv:2207.05027.

  • Zhan, X., Xie, J., Liu, Z., Ong, Y.-S., & Loy, C.C.(2020). Online deep clustering for unsupervised representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6688–6697.

  • Zhang, B., Xiao, J., Jiao, J., Wei, Y., & Zhao, Y.(2021). Affinity attention graph neural network for weakly supervised semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11), 8082–8096.

  • Zhao, B., Wang, F., & Zhang, C.(2008). Efficient multiclass maximum margin clustering. In Proceedings of the 25th international conference on machine learning, pp. 1248–1255.

  • Zhou, Z., Qi, L., Yang, X., Ni, D., & Shi, Y.(2022). Generalizable cross-modality medical image segmentation via style augmentation and dual normalization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 20856–20865.

  • Zhou, J., Wei, C., Wang, H., Shen, W., Xie, C., Yuille, A., & Kong, T.(2021). ibot: Image bert pre-training with online tokenizer. arXiv preprint arXiv:2111.07832.

  • Ziegler, A., & Asano, Y.M.(2022). Self-supervised learning of object parts for semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 14502–14511.

Download references

Acknowledgements

This work was supported in part by Wei Tang’s startup funds from the University of Illinois Chicago and the National Science Foundation (NSF) award CNS-1828265.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei Tang.

Additional information

Communicated by Ziyue Xu.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, M., Zhang, J. & Tang, W. Imbalance-Aware Discriminative Clustering for Unsupervised Semantic Segmentation. Int J Comput Vis (2024). https://doi.org/10.1007/s11263-024-02083-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11263-024-02083-x

Keywords

Navigation