Abstract
This paper conducts a systematic study on the role of spatial and temporal attention mechanism in the video salient object detection (VSOD) task. We present a two-stage spatial-temporal attention network, named STA-Net, which makes two major contributions. In the first stage, we devise a Multi-Scale-Spatial-Attention (MSSA) module to reduce calculation cost on non-salient regions while exploiting multi-scale saliency information. Such a sliced attention method offers an individual way to efficiently exploit the high-level features of the network with an enlarged receptive field. The second stage is to propose a Pyramid-Saliency-Shift-Aware (PSSA) module, which puts emphasis on the importance of dynamic object information since it offers a valid shift cue to confirm salient object and capture temporal information. Such a temporal detection module is able to encourage precise salient region detection. Exhaustive experiments show that the proposed STA-Net is effective for video salient object detection task, and achieves compelling performance in comparison with state-of-the-art.
Similar content being viewed by others
References
Fukuchi K, Miyazato K, Kimura A, Takagi S, Yamato J (2009) Saliency-based video segmentation with graph cuts and sequentially updated priors. In: 2009 IEEE International Conference on Multimedia and Expo, IEEE, pp 638–641
Hua G, Zhang C, Liu Z, Zhang Z, Shan Y (2009) Efficient scale-space spatiotemporal saliency tracking for distortion-free video retargeting. In: Asian Conference on Computer Vision, Springer, Berlin, Heidelberg, pp 182–192
Chen Y, Zhang W, Wang S, Li L, Huang Q (2018) Saliency-based spatiotemporal attention for video captioning. In: 2018 IEEE Fourth International Conference on Multimedia Big Data BigMM pp. 1–8, IEEE
Hadizadeh H, Bajić I. V. (2013) Saliency-aware video compression. IEEE Trans Image Process 23(1):19–33
Tu Z, Guo Z, Xie W, Yan M, Veltkamp RC, Li B, Yuan J (2017) Fusing disparate object signatures for salient object detection in video. Pattern Recogn 72:285–299
Huang L, Luo B (2018) Video-based salient object detection via spatio-temporal difference and coherence. Multimedia Tools and Applications 77(9):10685–10699
Fu K, Gu IY, Yun Y, Gong C, Yang J (2014) Graph construction for salient object detection in videos. In: 2014 22nd International Conference on Pattern Recognition (pp. 2371–2376), IEEE
Wei Y, Wen F, Zhu W, Sun J (2012) Geodesic saliency using background priors. In: European conference on computer vision (pp. 29–42). Springer, Berlin, Heidelberg
Chen Y, Zou W, Tang Y, Li X, Xu C, Komodakis N (2018) SCOM: Spatiotemporal Constrained optimization for salient object detection. IEEE Trans Image Process 27(7):3345–3357
Treisman AM, Gelade G (1980) A feature-integration theory of attention. Cognitive psychology 12(1):97–136
Xi T, Zhao W, Wang H, Lin W (2016) Salient object detection with spatiotemporal background priors for video. IEEE Trans Image Process 26(7):3425–3436
Wang W, Shen J, Yang R, Porikli F (2017) Saliency-aware video object segmentation. IEEE transactions on pattern analysis and machine intelligence 40(1):20–33
Wang W, Shen J, Shao L (2015) Consistent video saliency using local gradient flow optimization and global refinement. IEEE Trans Image Process 24(11):4185–4196
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv:1409.0473
Nie G, Guo Y, Liu Y, Wang Y (2017) Real-time salient object detection based on fully convolutional networks. In: Chinese Conference on Image and Graphics Technologies (pp. 189–198). Springer, Singapore
Bi H, Lu D, Li N, Yang L, Guan H (2019) Multi-Level Model for Video Saliency Detection. In: 2019 IEEE International Conference on Image Processing (ICIP) (pp. 4654–4658), IEEE
Fan DP, Wang W, Cheng MM, Shen J (2019) Shifting more attention to video salient object detection. In: proceedings of the IEEE conference on computer vision and pattern recognition, pp 8554–8564
Song H, Wang W, Zhao S, Shen J, Lam KM (2018) Pyramid dilated deeper convlstm for video salient object detection. In: proceedings of the European conference on computer vision (ECCV) (pp. 715–731)
Li G, Xie Y, Wei T, Wang K, Lin L (2018) Flow guided recurrent neural encoder for video salient object detection. In: proceedings of the IEEE conference on computer vision and pattern recognition, pp 3243–3252
Wang F, Jiang M, Qian C, Yang S, Li C, Zhang H, Tang X (2017) Residual attention network for image classification. In: proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
Pérez-Hernández F., Tabik S, Lamas A, Olmos R, Fujita H, Herrera F (2020) Object detection binary classifiers methodology based on deep learning to identify small objects handled similarly: Application in video surveillance. Knowledge-Based Systems, 105590
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141
Woo S, Park J, Lee JY, So Kweon I (2018) Cbam: Convolutional block attention module. In: proceedings of the European conference on computer vision (ECCV), pp 3–19
Gao P, Yuan R, Wang F, Xiao L, Fujita H, Zhang Y (2020) Siamese attentional keypoint network for high performance visual tracking. Knowledge-Based Systems 193:105–448
Gao P, Zhang Q, Wang F, Xiao L, Fujita H, Zhang Y (2020) Learning reinforced attentional representation for end-to-end visual tracking. Information Sciences 517:52–67
Perazzi F, Pont-Tuset J, McWilliams B, Van Gool L, Gross M, Sorkine-Hornung A (2016) A benchmark dataset and evaluation methodology for video object segmentation. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 724–732
Li F, Kim T, Humayun A, Tsai D, Rehg JM (2013) Video segmentation by tracking many figure-ground segments. In: proceedings of the IEEE International Conference on Computer Vision (pp. 2192–2199)
Rahtu E, Kannala J, Salo M, Heikkilä J (2010) Segmenting salient objects from images and videos. In: European conference on computer vision (pp. 366–379). Springer, Berlin, Heidelberg
Zhou F, Bing Kang S, Cohen MF (2014) Time-mapping using space-time saliency. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3358–3365
Liu Z, Zhang X, Luo S, Le Meur O (2014) Superpixel-based spatiotemporal saliency detection. IEEE transactions on circuits and systems for video technology 24(9):1522–1540
Zhang J, Sclaroff S, Lin Z, Shen X, Price B, Mech R (2015) Minimum barrier salient object detection at 80 fps. In: proceedings of the IEEE international conference on computer vision, pp 1404–1412
Zhang J, Sclaroff S, Lin Z, Shen X, Price B, Mech R (2015) Minimum barrier salient object detection at 80 fps. In: proceedings of the IEEE international conference on computer vision, pp 1404–1412
Tu WC, He S, Yang Q, Chien SY (2016) Real-time salient object detection with a minimum spanning tree. In: proceedings of the IEEE conference on computer vision and pattern recognition, pp 2334–2342
Liu Z, Li J, Ye L, Sun G, Shen L (2016) Saliency detection for unconstrained videos using superpixel-level graph and spatiotemporal propagation. IEEE transactions on circuits and systems for video technology 27(12):2527–2542
Chen C, Li S, Wang Y, Qin H, Hao A (2017) Video saliency detection via spatial-temporal fusion and low-rank coherency diffusion. IEEE Trans Image Process 26(7):3156–3170
Tang Y, Zou W, Jin Z, Chen Y, Hua Y, Li X (2018) Weakly supervised salient object detection with spatiotemporal cascade neural networks. IEEE Transactions on Circuits and Systems for Video Technology 29(7):1973–1984
Wang W, Shen J, Shao L (2017) Video salient object detection via fully convolutional networks. IEEE Trans Image Process 27(1):38–49
Li S, Seybold B, Vorobyov A, Lei X, Jay Kuo CC (2018) Unsupervised video object segmentation with motion-based bilateral networks. In: proceedings of the European Conference on Computer Vision (ECCV) (pp. 207–223)
Fan DP, Cheng MM, Liu Y, Li T, Borji A (2017) Structure-measure: a new way to evaluate foreground maps. In: proceedings of the IEEE international conference on computer vision, pp 4548–4557
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Bi, HB., Lu, D., Zhu, HH. et al. STA-Net: spatial-temporal attention network for video salient object detection. Appl Intell 51, 3450–3459 (2021). https://doi.org/10.1007/s10489-020-01961-4
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-020-01961-4