Abstract
In this paper, we consider salient instance segmentation. As well as producing bounding boxes, our network also outputs high-quality instance-level segments as initial selections to indicate the regions of interest. Taking into account the category-independent property of each target, we design a single stage salient instance segmentation framework, with a novel segmentation branch. Our new branch regards not only local context inside each detection window but also the surrounding context, enabling us to distinguish instances in the same scope even with partial occlusion. Our network is end-to-end trainable and is fast (running at 40 fps for images with resolution 320 × 320). We evaluate our approach on a publicly available benchmark and show that it outperforms alternative solutions. We also provide a thorough analysis of our design choices to help readers better understand the function of each part of our network. Source code can be found at https://github.com/RuochenFan/S4Net.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Li, F. F.; VanRullen, R.; Koch, C.; Perona, P. Rapid natural scene categorization in the near absence of attention. Proceedings of the National Academy of Sciences of the United States of America Vol. 99, No. 14, 9596–9601, 2002.
Elazary, L.; Itti, L. Interesting objects are visually salient. Journal of Vision Vol. 8, No. 3, 3, 2008.
Cheng, M.-M.; Zhang, F.-L.; Mitra, N. J.; Huang, X.; Hu, S.-M. RepFinder: Finding approximately repeated scene elements for image editing. ACM Transactions on Graphics Vol. 29, No. 4, Article No. 83, 2010.
Wu, H. S.; Wang, Y. S.; Feng, K. C.; Wong, T. T.; Lee, T. Y.; Heng, P. A. Resizing by symmetrysummarization. ACM Transactions on Graphics Vol. 29, No. 6, Article No. 159, 2010.
Chen, T.; Cheng, M.-M.; Tan, P.; Shamir, A.; Hu, S.-M. Sketch2photo: Internet image montage. ACM Transactions on Graphics Vol. 28, No. 5, Article No. 124, 2009.
Wu, C.; Lenz, I.; Saxena, A. Hierarchical semantic labeling for task-relevant RGB-D perception. In: Proceedings of the Robotics: Science and Systems, 2014.
Borji, A.; Cheng, M.-M.; Hou, Q.; Jiang, H.; Li, J. Salient object detection: A survey. Computational Visual Media Vol. 5, No. 2, 117–150, 2019.
Bylinskii, Z.; Judd, T.; Oliva, A.; Torralba, A.; Durand, F. What do different evaluation metrics tell us about saliency models? IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 41, No. 3, 740–757, 2019.
Li, G.; Xie, Y.; Lin, L.; Yu, Y. Instance-level salient object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2386–2395, 2017.
Wolfe, J. M.; Horowitz, T. S. What attributes guide the deployment of visual attention and how do they do it? Nature Reviews Neuroscience Vol. 5, No. 6, 495–501, 2004.
Desimone, R.; Duncan, J. Neural mechanisms of selective visual attention. Annual Review of Neuroscience Vol. 18, No. 1, 193–222, 1995.
Mannan, S. K.; Kennard, C.; Husain, M. The role of visual salience in directing eye movements in visual object agnosia. Current Biology Vol. 19, No. 6, R247–R248, 2009.
Itti, L.; Koch, C.; Niebur, E. A model of saliencybased visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 20, No. 11, 1254–1259, 1998.
Itti, L.; Koc, C. Computational modeling of visual attention. Nature Reviews Neuroscience Vol. 2, No. 3, 194–203, 2001.
Cheng, M. M.; Mitra, N. J.; Huang, X. L.; Torr, P. H. S.; Hu, S. M. Global contrast based salient region detection. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 37, No. 3, 569–582, 2015.
Jiang, H. Z.; Wang, J. D.; Yuan, Z. J.; Wu, Y.; Zheng, N. N.; Li, S. P. Salient object detection: A discriminative regional feature integration approach. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2083–2090, 2013.
Zhu, W.; Liang, S.; Wei, Y.; Sun, J. Saliency optimization from robust background detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2814–2821, 2014.
Rother, C.; Kolmogorov, V.; Blake A. “GrabCut”: Interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics Vol. 23, No. 3, 309–314, 2004.
Hou, Q.; Cheng, M.-M.; Hu, X.; Borji, A.; Tu, Z.; Torr, P. H. S. Deeply supervised salient object detection with short connections. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 41, No. 4, 815–828, 2019.
Li, G.; Yu, Y. Deep contrast learning for salient object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 478–487, 2016.
Wang, L.; Lu, H.; Ruan, X.; Yang, M.-H. Deep networks for saliency detection via local estimation and global search. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3183–3192, 2015.
Dai, J.; He, K.; Sun, J. Convolutional feature masking for joint object and stuff segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3992–4000, 2015.
Hariharan, B.; Arbel´aez, P.; Girshick, R.; Malik, J. Simultaneous detection and segmentation. In: Computer Vision–ECCV 2014. Lecture Notes in Computer Science, Vol. 8695. Fleet, D.; Pajdla, T.; Schiele, B.; Tuytelaars, T. Eds. Springer Cham, 297–312, 2014.
Hariharan, B.; Arbelaez, P.; Girshick, R.; Malik, J. Hypercolumns for object segmentation and fine-grained localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 447–456, 2015.
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 580–587, 2014.
Ren, S. Q.; He, K. M.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 39, No. 6, 1137–1149, 2017.
Dai, J.; Li, Y.; He, K.; Sun, J. R-FCN: Object detection via region-based fully convolutional networks. In: Proceedings of the Advances in Neural Information Processing Systems 29, 2016.
Girshick, R. Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, 1440–1448, 2015.
He, K. M.; Zhang, X. Y.; Ren, S. Q.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 37, No. 9, 1904–1916, 2015.
Dai, J. F.; He, K. M.; Li, Y.; Ren, S. Q.; Sun, J. Instance-sensitive fully convolutional networks. In: Computer Vision–ECCV 2016. Lecture Notes in Computer Science, Vol. 9910. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 534–549, 2016.
He, K.; Gkioxari, G.; Doll´ar, P.; Girshick, R. Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, 2961–2969, 2017.
Lin, T.-Y.; Doll´ar, P.; Girshick, R. B.; He, K.; Hariharan, B.; Belongie, S. J. Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2117–2125, 2017.
Wei, Y. C.; Liang, X. D.; Chen, Y. P.; Shen, X. H.; Cheng, M. M.; Feng, J. S.; Zhao, Y.; Yan, S. STC: A simple to complex framework for weakly-supervised semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 39, No. 11, 2314–2320, 2017.
Hou, Q. B.; Massiceti, D.; Dokania, P. K.; Wei, Y. C.; Cheng, M. M.; Torr, P. H. S. Bottom-up top-down cues for weakly-supervised semantic segmentation. In: Energy Minimization Methods in Computer Vision and Pattern Recognition. Lecture Notes in Computer Science, Vol. 10746. Pelillo, M.; Hancock, E. Eds. Springer Cham, 263–277, 2018.
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M. et al. ImageNet large scale visual recognition challenge International Journal of Computer Vision Vol. 115, 211–252, 2015.
Everingham, M.; Eslami, S. M. A.; van Gool, L.; Williams, C. K. I.; Winn, J.; Zisserman, A. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision Vol. 111, No. 1, 98–136, 2015.
Zhang, J. M.; Sclaroff, S.; Lin, Z.; Shen, X. H.; Price, B.; Mech, R. Unconstrained salient object detection via proposal subset optimization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5733–5742, 2016.
Pont-Tuset, J.; Arbelaez, P.; Barron, J. T.; Marques, F.; Malik, J. Multiscale combinatorial grouping for image segmentation and object proposal generation. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 39, No. 1, 128–140, 2017.
Qi, W.; Cheng, M. M.; Borji, A.; Lu, H. C.; Bai, L. F. SaliencyRank: Two-stage manifold ranking for salient object detection. Computational Visual Media Vol. 1, No. 4, 309–320, 2015.
Borji, A.; Cheng, M. M.; Jiang, H. Z.; Li, J. Salient object detection: A benchmark. IEEE Transactions on Image Processing Vol. 24, No. 12, 5706–5722, 2015.
Achanta, R.; Shaji, A.; Smith, K.; Lucchi, A.; Fua, P.; Süsstrunk, S. SLIC superpixels compared to stateof- the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 34, No. 11, 2274–2282, 2012.
Felzenszwalb, P. F.; Huttenlocher, D. P. Efficient graphbased image segmentation. International Journal of Computer Vision Vol. 59, No. 2, 167–181, 2004.
Shi, J. B.; Malik, J. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 22, No. 8, 888–905, 2000.
Wang, J. D.; Jiang, H. Z.; Yuan, Z. J.; Cheng, M. M.; Hu, X. W.; Zheng, N. N. Salient object detection: A discriminative regional feature integration approach. International Journal of Computer Vision Vol. 123, No. 2, 251–268, 2017.
Zhao, R.; Ouyang, W.; Li, H.; Wang, X. Saliency detection by multi-context deep learning. In: S4Net: Single stage salient-instance segmentation 203 Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1265–1274, 2015.
Lee, G.; Tai, Y.-W.; Kim, J. Deep saliency with encoded low level distance map and high level features. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 660–668, 2016.
Li, G.; Yu, Y. Visual saliency based on multiscale deep features. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5455–5463, 2015.
Lowe, D. G. Distinctive image features from scaleinvariant keypoints. International Journal of Computer Vision Vol. 60, No. 2, 91–110, 2004.
Bay, H.; Ess, A.; Tuytelaars, T.; Van Gool, L. Speededup robust features (SURF). Computer Vision and Image Understanding Vol. 110, No. 3, 346–359, 2008.
Dalal N.; Triggs, B. Histograms of oriented gradients for human detection. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 1, 886–893, 2005.
Sermanet, P.; Eigen, D.; Zhang, X.; Mathieu, M.; Fergus, R.; LeCun, Y. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013.
Uijlings, J. R.; Van De Sande, K. E.; Gevers, T.; Smeulders, A. W. Selective search for object recognition. International Journal of Computer Vision Vol. 104, No. 2, 154–171, 2013.
Cheng, M.-M.; Zhang, Z.; Lin, W.-Y.; Torr, P. BING: Binarized normed gradients for objectness estimation at 300fps. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3286–3293, 2014.
Pinheiro, P. O.; Collobert, R.; Doll´ar, P. Learning to segment object candidates. In: Proceedings of the Advances in Neural Information Processing Systems 28, 2015.
Arbel´aez, P.; Maire, M.; Fowlkes, C.; Malik, J. Contour detection and hierarchical image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 33, No. 5, 898–916, 2011.
Li, Y.; Qi, H.; Dai, J.; Ji, X.; Wei, Y. Fully convolutional instance-aware semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2359–2367, 2017.
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778, 2016.
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Doll´ar, P. Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, 2980–2988, 2017.
Yosinski, J.; Clune, J.; Nguyen, A.; Fuchs, T.; Lipson, H. Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579, 2015.
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2881–2890, 2017.
Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G. S.; Davis, A.; Dean, J.; Devin, M. et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
Lin, T. Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Doll´ar, P.; Zitnick, C. L. Microsoft COCO: Common objects in context. In: Computer Vision–ECCV 2014. Lecture Notes in Computer Science, Vol. 8693. Fleet, D.; Pajdla, T.; Schiele, B.; Tuytelaars, T. Eds. Springer Cham, 740–755, 2014.
Howard, A. G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
Fan, D. P.; Cheng, M. M.; Liu, J. J.; Gao, S. H.; Hou, Q. B.; Borji, A. Salient objects in clutter: Bringing salient object detection to the foreground. In: Computer Vision–ECCV 2018. Lecture Notes in Computer Science, Vol. 11219. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 196–212, 2018.
Liu, N.; Han, J. DHSNet: Deep hierarchical saliency network for salient object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 678–686, 2016.
Kolesnikov, A.; Lampert, C. H. Seed, expand and constrain: Three principles for weakly-supervised image segmentation. In: Computer Vision–ECCV 2016. Lecture Notes in Computer Science, Vol. 9908. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 695–711, 2016.
Wei, Y. C.; Feng, J. S.; Liang, X. D.; Cheng, M. M.; Zhao, Y.; Yan, S. C. Object region mining with adversarial erasing: A simple classification to semantic segmentation approach. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6488–6496, 2017.
Chen, L. C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A. L. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and 204 R. Fan, M.-M. Cheng, Q. Hou, et al. fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 40, No. 4, 834–848, 2018.
Zhang, J. M.; Lin, Z.; Brandt, J.; Shen, X. H.; Sclaroff, S. Top-down neural attention by excitation backprop. In: Computer Vision–ECCV 2016. Lecture Notes in Computer Science, Vol. 9908. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 543–559, 2016.
Acknowledgements
This research was supported by National Natural Science Foundation of China (61521002, 61572264, 61620106008), the National Youth Talent Support Program, Tianjin Natural Science Foundation (17JCJQJC43700, 18ZXZNGX00110), and the Fundamental Research Funds for the Central Universities (Nankai University, No. 63191501).
Author information
Authors and Affiliations
Corresponding author
Additional information
An earlier version of this paper was presented in IEEECVPR 2019.
Ruochen Fan is a master student in the Computer Science Department, Tsinghua University under the supervision of Prof. Shi-Min Hu. He currently focuses on perception systems for autonomous driving, especially point cloud segmentation and RGB detection. Previously, he worked on saliency detection and weakly-supervised segmentation.
Ming-Ming Cheng is a professor in the College of Computer Science, Nankai University, leading the Media Computing Lab. He received his Ph.D. degree from Tsinghua University in 2012. Then he worked as a research fellow for 2 years with Prof. P. Torr in Oxford. Dr. Cheng’s research primarily centers on algorithmic issues in image understanding and processing, including image segmentation, editing, retrieval, etc. He has published over 30 papers in leading journals and conferences, such as IEEE TPAMI, ACM TOG, ACM SIGGRAPH, IEEE CVPR, and IEEE ICCV.
Qibin Hou is a Ph.D. student under Prof. Ming-Ming Cheng’s supervision. Before joining the Media Computing Lab at Nankai University, he was a machine learning engineer in Baidu. His research interests include low-level vision, deep learning, and multimedia applications.
Tai-Jiang Mu is an Assistant Researcher in the Graphics and Geometric Computing Group in the Department of Computer Science and Technology at Tsinghua University. He received his bachelor degree and Ph.D. degree in Computer Science from Tsinghua University in 2011 and 2016 respectively. His research interests are in computer graphics, image and video processing, and stereoscopic perception.
Jingdong Wang Jingdong Wang is a senior researcher in the Visual Computing Group, Microsoft Research Asia. His areas of interest include efficient CNN architecture design, human pose estimation, semantic segmentation, image classification, object detection, large-scale indexing, and salient object detection. He is serving or has served as an Associate Editor of IEEE TPAMI, IEEE TMM, and IEEE TCSVT, and an area chair (or SPC) of various prestigious conferences in vision, multimedia, and AI, such as CVPR, ICCV, ECCV, ACM MM, IJCAI, and AAAI. He is an ACM Distinguished Member and a Fellow of IAPR.
Shi-Min Hu is a professor in the Department of Computer Science and Technology, Tsinghua University. He received his Ph.D. degree from Zhejiang University in 1996. His research interests include digital geometry processing, video processing, rendering, computer animation, and computer aided geometric design. He has published more than 100 papers in journals and refereed conferences. He is the Editor-in-Chief of Computational Visual Media, and on the editorial boards of several other journals, including Computer Aided Design and Computers & Graphics.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www. editorialmanager.com/cvmj.
About this article
Cite this article
Fan, R., Cheng, MM., Hou, Q. et al. S4Net: Single stage salient-instance segmentation. Comp. Visual Media 6, 191–204 (2020). https://doi.org/10.1007/s41095-020-0173-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s41095-020-0173-9