Bottom-Up Top-Down Cues for Weakly-Supervised Semantic Segmentation

Hou, Qibin; Massiceti, Daniela; Dokania, Puneet Kumar; Wei, Yunchao; Cheng, Ming-Ming; Torr, Philip H. S.

doi:10.1007/978-3-319-78199-0_18

Qibin Hou¹⁵,
Daniela Massiceti¹⁶,
Puneet Kumar Dokania¹⁶,
Yunchao Wei¹⁷,
Ming-Ming Cheng ORCID: orcid.org/0000-0001-5550-8758¹⁵ &
…
Philip H. S. Torr¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10746))

Included in the following conference series:

International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition

1307 Accesses
11 Citations

Abstract

We consider the task of learning a classifier for semantic segmentation using weak supervision in the form of image labels specifying objects present in the image. Our method uses deep convolutional neural networks (cnns) and adopts an Expectation-Maximization (EM) based approach. We focus on the following three aspects of EM: (i) initialization; (ii) latent posterior estimation (E-step) and (iii) the parameter update (M-step). We show that saliency and attention maps, bottom-up and top-down cues respectively, of images with single objects (simple images) provide highly reliable cues to learn an initialization for the EM. Intuitively, given weak supervisions, we first learn to segment simple images and then move towards the complex ones. Next, for updating the parameters (M step), we propose to minimize the combination of the standard softmax loss and the KL divergence between the latent posterior distribution (obtained using the E-step) and the likelihood given by the cnn. This combination is more robust to wrong predictions made by the E step of the EM algorithm. Extensive experiments and discussions show that our method is very simple and intuitive, and outperforms the state-of-the-art method with a very high margin of 3.7% and 3.9% on the PASCAL VOC12 train and test sets respectively, thus setting new state-of-the-art results.

Q. Hou and D. Massiceti—These authors are contributed equally.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The softmax function is defined as \(\sigma (f_k) = \frac{e^{f_k}}{\sum _{j=0}^c e^{f_j}}\).

References

Ahmed, F., Tarlow, D., Batra, D.: Optimizing expected intersection-over-union with candidate-constrained CRFs. In: ICCV (2015)
Google Scholar
Alexe, B., Deselares, T., Ferrari, V.: Measuring the objectness of image windows. PAMI 34(11), 2189–2202 (2012)
Article Google Scholar
Arbelaez, P., Pont-Tuset, J., Barron, J., Marques, F., Malik, J.: Multiscale combinatorial grouping. In: CVPR (2014)
Google Scholar
Bearman, A., Russakovsky, O., Ferrari, V., Fei-Fei, L.: What’s the point: semantic segmentation with point supervision. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 549–565. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_34
Google Scholar
Chandra, S., Kokkinos, I.: Fast, exact and multi-scale inference for semantic image segmentation with deep Gaussian CRFs. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 402–418. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_25
Google Scholar
Chen, L.-G., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected. In: ICLR (2015)
Google Scholar
Cheng, M., Zhang, Z., Lin, W., Torr, P.H.S.: BING: binarized normed gradients for objectness estimation at 300fps. In: CVPR (2014)
Google Scholar
Cheng, M.-M., Mitra, N.J., Huang, X., Torr, P.H.S., Hu, S.-M.: Global contrast based salient region detection. IEEE TPAMI 37(3), 569–582 (2015)
Article Google Scholar
Cogswell, M., Lin, X., Purushwalkam, S., Batra, D.: Combining the best of graphical models and convnets for semantic segmentation (2014). arXiv:1412.4313
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Royal Stat. Soc. 39(1), 1–38 (1977)
MathSciNet MATH Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)
Google Scholar
Everingham, M., Eslami, S.M.A., Gool, L.V., Williams, C., Winn, J., Zisserman, A.: The Pascal visual object classes challenge a retrospective. IJCV 111(1), 98–136 (2015)
Article Google Scholar
Felzenszwalb, P.F., Huttenlocher, D.P.: Efficient graph based image segmentation. IJCV 59(2), 167–181 (2004)
Article Google Scholar
Hariharan, B., Arbelaez, P., Bourdev, L., Maji, S., Malik, J.: Semantic contours from inverse detectors. In: ICCV (2011)
Google Scholar
Hou, Q., Cheng, M.-M., Hu, X.-W., Borji, A., Tu, Z., Torr, P.: Deeply supervised salient object detection with short connections. In: IEEE CVPR (2017)
Google Scholar
Jeff Wu, C.F.: On the convergence properties of the EM algorithm. Ann. Stat. 11(1), 95–103 (1983)
Article MathSciNet MATH Google Scholar
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. In: ACM International Conference on Multimedia (2014)
Google Scholar
Jiang, H., Wang, J., Yuan, Z., Wu, Y., Zheng, N., Li, S.: Salient object detection: a discriminative regional feature integration approach. In: CVPR (2013)
Google Scholar
Kolesnikov, A., Lampert, C.H.: Seed, expand and constrain: three principles for weakly-supervised image segmentation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 695–711. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_42
Chapter Google Scholar
Krahenbuhl P., Koltun, V.: Efficient inference in fully connected CRFs with Gaussian edge potentials. In: NIPS (2011)
Google Scholar
Kullback, S., Leibler, R.A.: On information and sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)
Article MathSciNet MATH Google Scholar
Lin, D., Dai, J., Jia, J., He, K., Sun, J.: ScribbleSup: scribble-supervised convolutional networks for semantic segmentation. In: CVPR (2016)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Google Scholar
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)
Google Scholar
McLachlan, G.J., Krishnan, T.: The EM Algorithm and Extensions. Wiley, Hoboken (1997)
MATH Google Scholar
Nowozin, S.: Optimal decisions from probabilistic models: the intersection-over-union case. In: CVPR (2014)
Google Scholar
Papandreou, G., Chen, L.-C., Murphy, K.P., Yuille, A.L.: Weakly- and semi-supervised learning of a DCNN for semantic image segmentation. In: ICCV (2015)
Google Scholar
Pathak, D., Krahenbuhl, P., Darrell, T.: Constrained convolutional neural networks for weakly supervised segmentation. In: ICCV (2015)
Google Scholar
Pathak, D., Shelhamer, E., Long, J., Darrell, T.: Fully convolutional multi-class multiple instance learning. In: ICLR (2014)
Google Scholar
Pinheiro, P.O., Collobert, R.: From image-level to pixel-level labeling with convolutional networks. In: CVPR (2015)
Google Scholar
Qi, X., Liu, Z., Shi, J., Zhao, H., Jia, J.: Augmented feedback in semantic segmentation under image level supervision. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 90–105. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_6
Chapter Google Scholar
Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: visualising image classification models and saliency maps. In: ICLR (2014)
Google Scholar
Uijlings, J.R.R., van de Sande, K.E.A., Gevers, T., Smeulders, A.W.M.: Selective search for object recognition. IJCV 104(2), 154–171 (2013)
Article Google Scholar
Wei, Y., Feng, J., Liang, X., Cheng, M., Zhao, Y., Yan, S.: Object region mining with adversarial erasing: a simple classification to semantic segmentation approach. In: CVPR (2017)
Google Scholar
Xu, J., Schwing, A., Urtasun, R.: Learning to segment under various forms of weak supervision. In: CVPR (2015)
Google Scholar
Yunchao, W., Xiaodan, L., Yunpeng, C., Xiaohui, S., Cheng, M.-M., Yao, Z., Shuicheng, Y.: STC: a simple to complex framework for weakly-supervised semantic segmentation (2015). arXiv:1509.03150
Zhang, J., Lin, Z., Brandt, J., Shen, X., Sclaroff, S.: Top-down neural attention by excitation backprop. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 543–559. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_33
Chapter Google Scholar
Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., Torr, P.: Conditional random fields as recurrent neural networks. In: ICCV (2015)
Google Scholar
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: CVPR (2016)
Google Scholar

Download references

Acknowledgments

Qibin Hou, Yunchao Wei and Ming-Ming Cheng were sponsored by NSFC (61620106008, 61572264), CAST (YESS20150117), Huawei Innovation Research Program (HIRP), and IBM Global SUR award. Daniela Massiceti, Punnet K. Dokania and Philip H.S. Torr were sponsored by ERC grant ERC-2012-AdG 321162-HELIOS. Ms Massiceti was also sponsored by the Skye Foundation. We thank all sponsors for their support.

Author information

Authors and Affiliations

Nankai University, Tianjin, China
Qibin Hou & Ming-Ming Cheng
University of Oxford, Oxford, UK
Daniela Massiceti, Puneet Kumar Dokania & Philip H. S. Torr
NUS, Singapore, Singapore
Yunchao Wei

Authors

Qibin Hou
View author publications
You can also search for this author in PubMed Google Scholar
Daniela Massiceti
View author publications
You can also search for this author in PubMed Google Scholar
Puneet Kumar Dokania
View author publications
You can also search for this author in PubMed Google Scholar
Yunchao Wei
View author publications
You can also search for this author in PubMed Google Scholar
Ming-Ming Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Philip H. S. Torr
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Daniela Massiceti .

Editor information

Editors and Affiliations

Ca’ Foscari University of Venice, Venice, Italy
Marcello Pelillo
University of York, York, United Kingdom
Edwin Hancock

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hou, Q., Massiceti, D., Dokania, P.K., Wei, Y., Cheng, MM., Torr, P.H.S. (2018). Bottom-Up Top-Down Cues for Weakly-Supervised Semantic Segmentation. In: Pelillo, M., Hancock, E. (eds) Energy Minimization Methods in Computer Vision and Pattern Recognition. EMMCVPR 2017. Lecture Notes in Computer Science(), vol 10746. Springer, Cham. https://doi.org/10.1007/978-3-319-78199-0_18

Download citation

DOI: https://doi.org/10.1007/978-3-319-78199-0_18
Published: 22 March 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-78198-3
Online ISBN: 978-3-319-78199-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Bottom-Up Top-Down Cues for Weakly-Supervised Semantic Segmentation