Abstract
We consider the task of learning a classifier for semantic segmentation using weak supervision in the form of image labels specifying objects present in the image. Our method uses deep convolutional neural networks (cnns) and adopts an Expectation-Maximization (EM) based approach. We focus on the following three aspects of EM: (i) initialization; (ii) latent posterior estimation (E-step) and (iii) the parameter update (M-step). We show that saliency and attention maps, bottom-up and top-down cues respectively, of images with single objects (simple images) provide highly reliable cues to learn an initialization for the EM. Intuitively, given weak supervisions, we first learn to segment simple images and then move towards the complex ones. Next, for updating the parameters (M step), we propose to minimize the combination of the standard softmax loss and the KL divergence between the latent posterior distribution (obtained using the E-step) and the likelihood given by the cnn. This combination is more robust to wrong predictions made by the E step of the EM algorithm. Extensive experiments and discussions show that our method is very simple and intuitive, and outperforms the state-of-the-art method with a very high margin of 3.7% and 3.9% on the PASCAL VOC12 train and test sets respectively, thus setting new state-of-the-art results.
Q. Hou and D. Massiceti—These authors are contributed equally.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The softmax function is defined as \(\sigma (f_k) = \frac{e^{f_k}}{\sum _{j=0}^c e^{f_j}}\).
References
Ahmed, F., Tarlow, D., Batra, D.: Optimizing expected intersection-over-union with candidate-constrained CRFs. In: ICCV (2015)
Alexe, B., Deselares, T., Ferrari, V.: Measuring the objectness of image windows. PAMI 34(11), 2189–2202 (2012)
Arbelaez, P., Pont-Tuset, J., Barron, J., Marques, F., Malik, J.: Multiscale combinatorial grouping. In: CVPR (2014)
Bearman, A., Russakovsky, O., Ferrari, V., Fei-Fei, L.: What’s the point: semantic segmentation with point supervision. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 549–565. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_34
Chandra, S., Kokkinos, I.: Fast, exact and multi-scale inference for semantic image segmentation with deep Gaussian CRFs. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 402–418. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_25
Chen, L.-G., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected. In: ICLR (2015)
Cheng, M., Zhang, Z., Lin, W., Torr, P.H.S.: BING: binarized normed gradients for objectness estimation at 300fps. In: CVPR (2014)
Cheng, M.-M., Mitra, N.J., Huang, X., Torr, P.H.S., Hu, S.-M.: Global contrast based salient region detection. IEEE TPAMI 37(3), 569–582 (2015)
Cogswell, M., Lin, X., Purushwalkam, S., Batra, D.: Combining the best of graphical models and convnets for semantic segmentation (2014). arXiv:1412.4313
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Royal Stat. Soc. 39(1), 1–38 (1977)
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)
Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C., Winn, J., Zisserman, A.: The Pascal visual object classes challenge a retrospective. IJCV 111(1), 98–136 (2015)
Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient graph based image segmentation. IJCV 59(2), 167–181 (2004)
Hariharan, B., Arbelaez, P., Bourdev, L., Maji, S., Malik, J.: Semantic contours from inverse detectors. In: ICCV (2011)
Hou, Q., Cheng, M.-M., Hu, X.-W., Borji, A., Tu, Z., Torr, P.: Deeply supervised salient object detection with short connections. In: IEEE CVPR (2017)
Jeff Wu, C.F.: On the convergence properties of the EM algorithm. Ann. Stat. 11(1), 95–103 (1983)
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. In: ACM International Conference on Multimedia (2014)
Jiang, H., Wang, J., Yuan, Z., Wu, Y., Zheng, N., Li, S.: Salient object detection: a discriminative regional feature integration approach. In: CVPR (2013)
Kolesnikov, A., Lampert, C.H.: Seed, expand and constrain: three principles for weakly-supervised image segmentation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 695–711. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_42
Krahenbuhl P., Koltun, V.: Efficient inference in fully connected CRFs with Gaussian edge potentials. In: NIPS (2011)
Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)
Lin, D., Dai, J., Jia, J., He, K., Sun, J.: ScribbleSup: scribble-supervised convolutional networks for semantic segmentation. In: CVPR (2016)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)
McLachlan, G.J., Krishnan, T.: The EM Algorithm and Extensions. Wiley, Hoboken (1997)
Nowozin, S.: Optimal decisions from probabilistic models: the intersection-over-union case. In: CVPR (2014)
Papandreou, G., Chen, L.-C., Murphy, K.P., Yuille, A.L.: Weakly- and semi-supervised learning of a DCNN for semantic image segmentation. In: ICCV (2015)
Pathak, D., Krahenbuhl, P., Darrell, T.: Constrained convolutional neural networks for weakly supervised segmentation. In: ICCV (2015)
Pathak, D., Shelhamer, E., Long, J., Darrell, T.: Fully convolutional multi-class multiple instance learning. In: ICLR (2014)
Pinheiro, P.O., Collobert, R.: From image-level to pixel-level labeling with convolutional networks. In: CVPR (2015)
Qi, X., Liu, Z., Shi, J., Zhao, H., Jia, J.: Augmented feedback in semantic segmentation under image level supervision. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 90–105. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_6
Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: visualising image classification models and saliency maps. In: ICLR (2014)
Uijlings, J.R.R., van de Sande, K.E.A., Gevers, T., Smeulders, A.W.M.: Selective search for object recognition. IJCV 104(2), 154–171 (2013)
Wei, Y., Feng, J., Liang, X., Cheng, M., Zhao, Y., Yan, S.: Object region mining with adversarial erasing: a simple classification to semantic segmentation approach. In: CVPR (2017)
Xu, J., Schwing, A., Urtasun, R.: Learning to segment under various forms of weak supervision. In: CVPR (2015)
Yunchao, W., Xiaodan, L., Yunpeng, C., Xiaohui, S., Cheng, M.-M., Yao, Z., Shuicheng, Y.: STC: a simple to complex framework for weakly-supervised semantic segmentation (2015). arXiv:1509.03150
Zhang, J., Lin, Z., Brandt, J., Shen, X., Sclaroff, S.: Top-down neural attention by excitation backprop. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 543–559. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_33
Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., Torr, P.: Conditional random fields as recurrent neural networks. In: ICCV (2015)
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: CVPR (2016)
Acknowledgments
Qibin Hou, Yunchao Wei and Ming-Ming Cheng were sponsored by NSFC (61620106008, 61572264), CAST (YESS20150117), Huawei Innovation Research Program (HIRP), and IBM Global SUR award. Daniela Massiceti, Punnet K. Dokania and Philip H.S. Torr were sponsored by ERC grant ERC-2012-AdG 321162-HELIOS. Ms Massiceti was also sponsored by the Skye Foundation. We thank all sponsors for their support.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Hou, Q., Massiceti, D., Dokania, P.K., Wei, Y., Cheng, MM., Torr, P.H.S. (2018). Bottom-Up Top-Down Cues for Weakly-Supervised Semantic Segmentation. In: Pelillo, M., Hancock, E. (eds) Energy Minimization Methods in Computer Vision and Pattern Recognition. EMMCVPR 2017. Lecture Notes in Computer Science(), vol 10746. Springer, Cham. https://doi.org/10.1007/978-3-319-78199-0_18
Download citation
DOI: https://doi.org/10.1007/978-3-319-78199-0_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-78198-3
Online ISBN: 978-3-319-78199-0
eBook Packages: Computer ScienceComputer Science (R0)