Skip to main content
Log in

STA-Net: spatial-temporal attention network for video salient object detection

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

This paper conducts a systematic study on the role of spatial and temporal attention mechanism in the video salient object detection (VSOD) task. We present a two-stage spatial-temporal attention network, named STA-Net, which makes two major contributions. In the first stage, we devise a Multi-Scale-Spatial-Attention (MSSA) module to reduce calculation cost on non-salient regions while exploiting multi-scale saliency information. Such a sliced attention method offers an individual way to efficiently exploit the high-level features of the network with an enlarged receptive field. The second stage is to propose a Pyramid-Saliency-Shift-Aware (PSSA) module, which puts emphasis on the importance of dynamic object information since it offers a valid shift cue to confirm salient object and capture temporal information. Such a temporal detection module is able to encourage precise salient region detection. Exhaustive experiments show that the proposed STA-Net is effective for video salient object detection task, and achieves compelling performance in comparison with state-of-the-art.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Fukuchi K, Miyazato K, Kimura A, Takagi S, Yamato J (2009) Saliency-based video segmentation with graph cuts and sequentially updated priors. In: 2009 IEEE International Conference on Multimedia and Expo, IEEE, pp 638–641

  2. Hua G, Zhang C, Liu Z, Zhang Z, Shan Y (2009) Efficient scale-space spatiotemporal saliency tracking for distortion-free video retargeting. In: Asian Conference on Computer Vision, Springer, Berlin, Heidelberg, pp 182–192

  3. Chen Y, Zhang W, Wang S, Li L, Huang Q (2018) Saliency-based spatiotemporal attention for video captioning. In: 2018 IEEE Fourth International Conference on Multimedia Big Data BigMM pp. 1–8, IEEE

  4. Hadizadeh H, Bajić I. V. (2013) Saliency-aware video compression. IEEE Trans Image Process 23(1):19–33

    Article  MathSciNet  Google Scholar 

  5. Tu Z, Guo Z, Xie W, Yan M, Veltkamp RC, Li B, Yuan J (2017) Fusing disparate object signatures for salient object detection in video. Pattern Recogn 72:285–299

    Article  Google Scholar 

  6. Huang L, Luo B (2018) Video-based salient object detection via spatio-temporal difference and coherence. Multimedia Tools and Applications 77(9):10685–10699

    Article  Google Scholar 

  7. Fu K, Gu IY, Yun Y, Gong C, Yang J (2014) Graph construction for salient object detection in videos. In: 2014 22nd International Conference on Pattern Recognition (pp. 2371–2376), IEEE

  8. Wei Y, Wen F, Zhu W, Sun J (2012) Geodesic saliency using background priors. In: European conference on computer vision (pp. 29–42). Springer, Berlin, Heidelberg

  9. Chen Y, Zou W, Tang Y, Li X, Xu C, Komodakis N (2018) SCOM: Spatiotemporal Constrained optimization for salient object detection. IEEE Trans Image Process 27(7):3345–3357

    Article  MathSciNet  Google Scholar 

  10. Treisman AM, Gelade G (1980) A feature-integration theory of attention. Cognitive psychology 12(1):97–136

    Article  Google Scholar 

  11. Xi T, Zhao W, Wang H, Lin W (2016) Salient object detection with spatiotemporal background priors for video. IEEE Trans Image Process 26(7):3425–3436

    Article  MathSciNet  Google Scholar 

  12. Wang W, Shen J, Yang R, Porikli F (2017) Saliency-aware video object segmentation. IEEE transactions on pattern analysis and machine intelligence 40(1):20–33

    Article  Google Scholar 

  13. Wang W, Shen J, Shao L (2015) Consistent video saliency using local gradient flow optimization and global refinement. IEEE Trans Image Process 24(11):4185–4196

    Article  MathSciNet  Google Scholar 

  14. Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv:1409.0473

  15. Nie G, Guo Y, Liu Y, Wang Y (2017) Real-time salient object detection based on fully convolutional networks. In: Chinese Conference on Image and Graphics Technologies (pp. 189–198). Springer, Singapore

  16. Bi H, Lu D, Li N, Yang L, Guan H (2019) Multi-Level Model for Video Saliency Detection. In: 2019 IEEE International Conference on Image Processing (ICIP) (pp. 4654–4658), IEEE

  17. Fan DP, Wang W, Cheng MM, Shen J (2019) Shifting more attention to video salient object detection. In: proceedings of the IEEE conference on computer vision and pattern recognition, pp 8554–8564

  18. Song H, Wang W, Zhao S, Shen J, Lam KM (2018) Pyramid dilated deeper convlstm for video salient object detection. In: proceedings of the European conference on computer vision (ECCV) (pp. 715–731)

  19. Li G, Xie Y, Wei T, Wang K, Lin L (2018) Flow guided recurrent neural encoder for video salient object detection. In: proceedings of the IEEE conference on computer vision and pattern recognition, pp 3243–3252

  20. Wang F, Jiang M, Qian C, Yang S, Li C, Zhang H, Tang X (2017) Residual attention network for image classification. In: proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164

  21. Pérez-Hernández F., Tabik S, Lamas A, Olmos R, Fujita H, Herrera F (2020) Object detection binary classifiers methodology based on deep learning to identify small objects handled similarly: Application in video surveillance. Knowledge-Based Systems, 105590

  22. Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141

  23. Woo S, Park J, Lee JY, So Kweon I (2018) Cbam: Convolutional block attention module. In: proceedings of the European conference on computer vision (ECCV), pp 3–19

  24. Gao P, Yuan R, Wang F, Xiao L, Fujita H, Zhang Y (2020) Siamese attentional keypoint network for high performance visual tracking. Knowledge-Based Systems 193:105–448

    Google Scholar 

  25. Gao P, Zhang Q, Wang F, Xiao L, Fujita H, Zhang Y (2020) Learning reinforced attentional representation for end-to-end visual tracking. Information Sciences 517:52–67

    Article  Google Scholar 

  26. Perazzi F, Pont-Tuset J, McWilliams B, Van Gool L, Gross M, Sorkine-Hornung A (2016) A benchmark dataset and evaluation methodology for video object segmentation. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 724–732

  27. Li F, Kim T, Humayun A, Tsai D, Rehg JM (2013) Video segmentation by tracking many figure-ground segments. In: proceedings of the IEEE International Conference on Computer Vision (pp. 2192–2199)

  28. Rahtu E, Kannala J, Salo M, Heikkilä J (2010) Segmenting salient objects from images and videos. In: European conference on computer vision (pp. 366–379). Springer, Berlin, Heidelberg

  29. Zhou F, Bing Kang S, Cohen MF (2014) Time-mapping using space-time saliency. In: proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3358–3365

  30. Liu Z, Zhang X, Luo S, Le Meur O (2014) Superpixel-based spatiotemporal saliency detection. IEEE transactions on circuits and systems for video technology 24(9):1522–1540

    Article  Google Scholar 

  31. Zhang J, Sclaroff S, Lin Z, Shen X, Price B, Mech R (2015) Minimum barrier salient object detection at 80 fps. In: proceedings of the IEEE international conference on computer vision, pp 1404–1412

  32. Zhang J, Sclaroff S, Lin Z, Shen X, Price B, Mech R (2015) Minimum barrier salient object detection at 80 fps. In: proceedings of the IEEE international conference on computer vision, pp 1404–1412

  33. Tu WC, He S, Yang Q, Chien SY (2016) Real-time salient object detection with a minimum spanning tree. In: proceedings of the IEEE conference on computer vision and pattern recognition, pp 2334–2342

  34. Liu Z, Li J, Ye L, Sun G, Shen L (2016) Saliency detection for unconstrained videos using superpixel-level graph and spatiotemporal propagation. IEEE transactions on circuits and systems for video technology 27(12):2527–2542

    Article  Google Scholar 

  35. Chen C, Li S, Wang Y, Qin H, Hao A (2017) Video saliency detection via spatial-temporal fusion and low-rank coherency diffusion. IEEE Trans Image Process 26(7):3156–3170

    Article  MathSciNet  Google Scholar 

  36. Tang Y, Zou W, Jin Z, Chen Y, Hua Y, Li X (2018) Weakly supervised salient object detection with spatiotemporal cascade neural networks. IEEE Transactions on Circuits and Systems for Video Technology 29(7):1973–1984

    Article  Google Scholar 

  37. Wang W, Shen J, Shao L (2017) Video salient object detection via fully convolutional networks. IEEE Trans Image Process 27(1):38–49

    Article  MathSciNet  Google Scholar 

  38. Li S, Seybold B, Vorobyov A, Lei X, Jay Kuo CC (2018) Unsupervised video object segmentation with motion-based bilateral networks. In: proceedings of the European Conference on Computer Vision (ECCV) (pp. 207–223)

  39. Fan DP, Cheng MM, Liu Y, Li T, Borji A (2017) Structure-measure: a new way to evaluate foreground maps. In: proceedings of the IEEE international conference on computer vision, pp 4548–4557

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hong-Bo Bi.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bi, HB., Lu, D., Zhu, HH. et al. STA-Net: spatial-temporal attention network for video salient object detection. Appl Intell 51, 3450–3459 (2021). https://doi.org/10.1007/s10489-020-01961-4

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-020-01961-4

Keywords

Navigation