Abstract
Weakly-supervised semantic segmentation is a challenging task as no pixel-wise label information is provided for training. Recent methods have exploited classification networks to localize objects by selecting regions with strong response. While such response map provides sparse information, however, there exist strong pairwise relations between pixels in natural images, which can be utilized to propagate the sparse map to a much denser one. In this paper, we propose an iterative algorithm to learn such pairwise relations, which consists of two branches, a unary segmentation network which learns the label probabilities for each pixel, and a pairwise affinity network which learns affinity matrix and refines the probability map generated from the unary network. The refined results by the pairwise network are then used as supervision to train the unary network, and the procedures are conducted iteratively to obtain better segmentation progressively. To learn reliable pixel affinity without accurate annotation, we also propose to mine confident regions. We show that iteratively training this framework is equivalent to optimizing an energy function with convergence to a local minimum. Experimental results on the PASCAL VOC 2012 and COCO datasets demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods.
Similar content being viewed by others
Notes
We use the code provided by the authors. The authors report results on the original training set (1464 images) of the PASCAL VOC 2012 dataset. Here we present results on the augmented training set (10582 images) as all models are trained on the augmented training set.
References
Ahn, J., & Kwak, S. (2018). Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 4981–4990).
Bearman, A., Russakovsky, O., Ferrari, V., & Fei-Fei, L. (2016). What’s the point: Semantic segmentation with point supervision. In Proceedings of European conference on computer vision (ECCV) (pp. 549–565).
Bertasius, G., Torresani, L., Stella, X. Y., & Shi, J. (2017). Convolutional random walk networks for semantic image segmentation. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR) (pp. 858–866).
Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2018). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 40(4), 834–848.
Chen, L. C., Papandreou, G., Schroff, F., & Adam, H. (2017). Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587
Dai, J., He, K., & Sun, J. (2015) Boxsup: exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In Proceedings of IEEE international conference on computer vision (ICCV) (pp. 1635–1643).
Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (VOC) challenge. International Journal of Computer Vision (IJCV), 88(2), 303–338.
Fan, R., Cheng, M. M., Hou, Q., Mu, T. J., Wang, J., & Hu, S. M. (2019). S4net: Single stage salient-instance segmentation. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR) (pp. 6103–6112).
Fan, R., Hou, Q., Cheng, M. M., Yu, G., Martin, R. R., & Hu, S. M. (2018). Associating inter-image salient instances for weakly supervised semantic segmentation. In Proceedings of European conference on computer vision (ECCV) (pp. 367–383).
Felzenszwalb, P. F., & Huttenlocher, D. P. (2004). Efficient graph-based image segmentation. International Journal of Computer Vision (IJCV), 59(2), 167–181.
Hagen, L., & Kahng, A. B. (1992). New spectral methods for ratio cut partitioning and clustering. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 11, 1074–1085.
Hariharan, B., Arbeláez, P., Bourdev, L., Maji, S., & Malik, J. (2011). Semantic contours from inverse detectors. In Proceedings of IEEE international conference on computer vision (ICCV) (pp. 991–998).
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 770–778).
Huang, Z., Wang, X., Wang, J., Liu, W., & Wang, J. (2018). Weakly-supervised semantic segmentation network with deep seeded region growing. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR) (pp. 7014–7023).
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., & Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. In Proceedings of ACM international conference on Multimedia (ACM MM) (pp. 675–678).
Kersten, D. (1987). Predictability and redundancy of natural images. JOSA A, 4(12), 2395–2400.
Khoreva, A., Benenson, R., Hosang, J., Hein, M., & Schiele, B. (2017). Simple does it: Weakly supervised instance and semantic segmentation. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR) (pp. 876–885).
Kolesnikov, A., & Lampert, C. H. (2016). Seed, expand and constrain: Three principles for weakly-supervised image segmentation. In Proceedings of European conference on computer vision (ECCV) (pp. 695–711).
Levin, A., Lischinski, D., & Weiss, Y. (2008). A closed-form solution to natural image matting. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 30, 228–242.
Lin, D., Dai, J., Jia, J., He, K., & Sun, J. (2016). Scribblesup: Scribble-supervised convolutional networks for semantic segmentation. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3159–3167).
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common objects in context. In Proceedings of European conference on computer vision (ECCV) (pp. 740–755).
Liu, S., De Mello, S., Gu, J., Zhong, G., Yang, M. H., & Kautz, J. (2017). Learning affinity via spatial propagation networks. In Proceedings of annual conference on neural information processing systems (NeurIPS) (pp. 1520–1530).
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3431–3440).
Maire, M., Narihira, T., & Yu, S. X. (2016). Affinity CNN: Learning pixel-centric pairwise relations for figure/ground embedding. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR) (pp. 174–182).
Papandreou, G., Chen, L. C., Murphy, K. P., & Yuille, A. L. (2015). Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation. In Proceedings of IEEE international conference on computer vision (ICCV) (pp. 1742–1750).
Pathak, D., Krahenbuhl, P., & Darrell, T. (2015). Constrained convolutional neural networks for weakly supervised segmentation. In Proceedings of IEEE international conference on computer vision (ICCV) (pp. 1796–1804).
Pathak, D., Shelhamer, E., Long, J., & Darrell, T. (2014). Fully convolutional multi-class multiple instance learning. arXiv preprint arXiv:1412.7144.
Pinheiro, P. O., & Collobert, R. (2015). From image-level to pixel-level labeling with convolutional networks. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1713–1721).
Qi, X., Liu, Z., Shi, J., Zhao, H., & Jia, J. (2016). Augmented feedback in semantic segmentation under image level supervision. In Proceedings of European conference on computer vision (ECCV) (pp. 90–105).
Roy, A., & Todorovic, S. (2017). Combining bottom-up, top-down, and smoothness cues for weakly supervised image segmentation. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3529–3538).
Saleh, F., Aliakbarian, M. S., Salzmann, M., Petersson, L., Gould, S., & Alvarez, J. M. (2016). Built-in foreground/background prior for weakly-supervised semantic segmentation. In Proceedings of European Conference on Computer Vision (ECCV) (pp. 413–432).
Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 22(8), 888–905.
Shimoda, W., & Yanai, K. (2016). Distinct class-specific saliency maps for weakly supervised semantic segmentation. In Proceedings of European conference on computer vision (ECCV) (pp. 218–234).
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
Wang, X., Ma, H., Chen, X., & You, S. (2018a). Edge preserving and multi-scale contextual neural network for salient object detection. IEEE Transactions on Image Processing (TIP), 27(1), 121–134.
Wang, X., You, S., Li, X., & Ma, H. (2018b). Weakly-supervised semantic segmentation by iteratively mining common object features. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1354–1362).
Wei, Y. C., Cheng, C. K., et al. (1989) Towards efficient hierarchical designs by ratio cut partitioning. In IEEE international conference on computer-aided design (pp. 298–301).
Wei, Y., Feng, J., Liang, X., Cheng, M. M., Zhao, Y., & Yan, S. (2017a). Object region mining with adversarial erasing: A simple classification to semantic segmentation approach. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1568–1576).
Wei, Y., Liang, X., Chen, Y., Shen, X., Cheng, M. M., Feng, J., et al. (2017b). STC: A simple to complex framework for weakly-supervised semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 39(11), 2314–2320.
Wei, Y., Xiao, H., Shi, H., Jie, Z., Feng, J., & Huang, T. S. (2018). Revisiting dilated convolution: A simple approach for weakly-and semi-supervised semantic segmentation. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR) (pp. 7268–7277).
Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. In Proceedings of IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2881–2890).
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. (2016). Learning deep features for discriminative localization. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2921–2929).
Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., et al. (2019). Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision (IJCV), 127(3), 302–321.
Acknowledgements
This work is supported by National Key Basic Research Program of China (No. 2016YFB0100900), Beijing Science and Technology Planning Project (No. Z191100007419001), National Natural Science Foundation of China (No. 61773231), and National Science Foundation (CAREER No. 1149783).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Kristen Grauman.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Wang, X., Liu, S., Ma, H. et al. Weakly-Supervised Semantic Segmentation by Iterative Affinity Learning. Int J Comput Vis 128, 1736–1749 (2020). https://doi.org/10.1007/s11263-020-01293-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-020-01293-3