DeepVS: A Deep Learning Based Video Saliency Prediction Approach

  • Lai Jiang
  • Mai XuEmail author
  • Tie Liu
  • Minglang Qiao
  • Zulin Wang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11218)


In this paper, we propose a novel deep learning based video saliency prediction method, named DeepVS. Specifically, we establish a large-scale eye-tracking database of videos (LEDOV), which includes 32 subjects’ fixations on 538 videos. We find from LEDOV that human attention is more likely to be attracted by objects, particularly the moving objects or the moving parts of objects. Hence, an object-to-motion convolutional neural network (OM-CNN) is developed to predict the intra-frame saliency for DeepVS, which is composed of the objectness and motion subnets. In OM-CNN, cross-net mask and hierarchical feature normalization are proposed to combine the spatial features of the objectness subnet and the temporal features of the motion subnet. We further find from our database that there exists a temporal correlation of human attention with a smooth saliency transition across video frames. We thus propose saliency-structured convolutional long short-term memory (SS-ConvLSTM) network, using the extracted features from OM-CNN as the input. Consequently, the inter-frame saliency maps of a video can be generated, which consider both structured output with center-bias and cross-frame transitions of human attention maps. Finally, the experimental results show that DeepVS advances the state-of-the-art in video saliency prediction.


Saliency prediction Convolutional LSTM Eye-tracking database 



This work was supported by the National Nature Science Foundation of China under Grant 61573037 and by the Fok Ying Tung Education Foundation under Grant 151061.

Supplementary material

474202_1_En_37_MOESM1_ESM.pdf (3.7 mb)
Supplementary material 1 (pdf 3757 KB)


  1. 1.
    Bak, C., Kocak, A., Erdem, E., Erdem, A.: Spatio-temporal saliency networks for dynamic saliency prediction. IEEE Trans. Multimed. (2017)Google Scholar
  2. 2.
    Bazzani, L., Larochelle, H., Torresani, L.: Recurrent mixture density network for spatiotemporal visual attention (2017)Google Scholar
  3. 3.
    Chaabouni, S., Benois-Pineau, J., Amar, C.B.: Transfer learning with deep networks for saliency prediction in natural video. In: ICIP, pp. 1604–1608. IEEE (2016)Google Scholar
  4. 4.
    Cheng, M.M., Mitra, N.J., Huang, X., Torr, P.H., Hu, S.M.: Global contrast based salient region detection. IEEE PAMI 37(3), 569–582 (2015)CrossRefGoogle Scholar
  5. 5.
    Deng, X., Xu, M., Jiang, L., Sun, X., Wang, Z.: Subjective-driven complexity control approach for HEVC. IEEE Trans. Circuits Syst. Video Technol. 26(1), 91–106 (2016)CrossRefGoogle Scholar
  6. 6.
    Dosovitskiy, A., et al.: FlowNet: learning optical flow with convolutional networks. In: ICCV, pp. 2758–2766 (2015)Google Scholar
  7. 7.
    Fang, Y., Lin, W., Chen, Z., Tsai, C.M., Lin, C.W.: A video saliency detection model in compressed domain. IEEE TCSVT 24(1), 27–38 (2014)Google Scholar
  8. 8.
    Gal, Y., Ghahramani, Z.: A theoretically grounded application of dropout in recurrent neural networks. In: NIPS, pp. 1019–1027 (2016)Google Scholar
  9. 9.
    Guo, C., Zhang, L.: A novel multiresolution spatiotemporal saliency detection model and its applications in image and video compression. IEEE TIP 19(1), 185–198 (2010)MathSciNetzbMATHGoogle Scholar
  10. 10.
    Hadizadeh, H., Enriquez, M.J., Bajic, I.V.: Eye-tracking database for a set of standard video sequences. IEEE TIP 21(2), 898–903 (2012)MathSciNetzbMATHGoogle Scholar
  11. 11.
    Harel, J., Koch, C., Perona, P.: Graph-based visual saliency. In: NIPS, pp. 545–552 (2006)Google Scholar
  12. 12.
    Hossein Khatoonabadi, S., Vasconcelos, N., Bajic, I.V., Shan, Y.: How many bits does it take for a stimulus to be salient? In: CVPR, pp. 5501–5510 (2015)Google Scholar
  13. 13.
    Huang, X., Shen, C., Boix, X., Zhao, Q.: SALICON: reducing the semantic gap in saliency prediction by adapting deep neural networks. In: ICCV, pp. 262–270 (2015)Google Scholar
  14. 14.
  15. 15.
    Itti, L., Baldi, P.: Bayesian surprise attracts human attention. Vis. Res. 49(10), 1295–1306 (2009)CrossRefGoogle Scholar
  16. 16.
    Itti, L., Dhavale, N., Pighin, F.: Realistic avatar eye and head animation using a neurobiological model of visual attention. Opt. Sci. Technol. 64, 64–78 (2004)Google Scholar
  17. 17.
    Judd, T., Ehinger, K., Durand, F., Torralba, A.: Learning to predict where humans look. In: ICCV, pp. 2106–2113 (2009)Google Scholar
  18. 18.
    Kruthiventi, S.S., Ayush, K., Babu, R.V.: DeepFix: a fully convolutional neural network for predicting human eye fixations. IEEE TIP (2017)Google Scholar
  19. 19.
    Leboran, V., Garcia-Diaz, A., Fdez-Vidal, X.R., Pardo, X.M.: Dynamic whitening saliency. IEEE PAMI 39(5), 893–907 (2017)CrossRefGoogle Scholar
  20. 20.
    Lee, S.H., Kim, J.H., Choi, K.P., Sim, J.Y., Kim, C.S.: Video saliency detection based on spatiotemporal feature learning. In: ICIP, pp. 1120–1124 (2014)Google Scholar
  21. 21.
    Li, S., Xu, M., Ren, Y., Wang, Z.: Closed-form optimization on saliency-guided image compression for HEVC-MSP. IEEE Trans. Multimed. (2017)Google Scholar
  22. 22.
    Li, S., Xu, M., Wang, Z., Sun, X.: Optimal bit allocation for CTU level rate control in HEVC. IEEE Trans. Circuits Syst. Video Technol. 27(11), 2409–2424 (2017)CrossRefGoogle Scholar
  23. 23.
    Liu, Y., Zhang, S., Xu, M., He, X.: Predicting salient face in multiple-face videos. In: CVPR, July 2017Google Scholar
  24. 24.
    Mathe, S., Sminchisescu, C.: Actions in the eye: dynamic gaze datasets and learnt saliency models for visual recognition. IEEE PAMI 37(7), 1408–1424 (2015)CrossRefGoogle Scholar
  25. 25.
    Mital, P.K., Smith, T.J., Hill, R.L., Henderson, J.M.: Clustering of gaze during dynamic scene viewing is predicted by motion. Cogn. Comput. 3(1), 5–24 (2011)CrossRefGoogle Scholar
  26. 26.
    Nguyen, T.V., Xu, M., Gao, G., Kankanhalli, M., Tian, Q., Yan, S.: Static saliency vs. dynamic saliency: a comparative study. In: ACMM, pp. 987–996. ACM (2013)Google Scholar
  27. 27.
    Palazzi, A., Solera, F., Calderara, S., Alletto, S., Cucchiara, R.: Learning where to attend like a human driver. In: Intelligent Vehicles Symposium (IV), 2017 IEEE, pp. 920–925. IEEE (2017)Google Scholar
  28. 28.
    Pan, J., et al.: SalGAN: visual saliency prediction with generative adversarial networks. In: CVPR workshop, January 2017Google Scholar
  29. 29.
    Pan, J., Sayrol, E., Giro-i Nieto, X., McGuinness, K., O’Connor, N.E.: Shallow and deep convolutional networks for saliency prediction. In: CVPR, pp. 598–606 (2016)Google Scholar
  30. 30.
    Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: CVPR, pp. 779–788 (2016)Google Scholar
  31. 31.
    Rudoy, D., Goldman, D.B., Shechtman, E., Zelnik-Manor, L.: Learning video saliency from human gaze using candidate selection. In: CVPR, pp. 1147–1154 (2013)Google Scholar
  32. 32.
    Wang, L., Wang, L., Lu, H., Zhang, P., Ruan, X.: Saliency detection with recurrent fully convolutional networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 825–841. Springer, Cham (2016). Scholar
  33. 33.
    Wang, W., Shen, J., Guo, F., Cheng, M.M., Borji, A.: Revisiting video saliency: a large-scale benchmark and a new model (2018)Google Scholar
  34. 34.
    Wang, W., Shen, J., Shao, L.: Consistent video saliency using local gradient flow optimization and global refinement. IEEE Trans. Image Process. 24(11), 4185–4196 (2015)MathSciNetCrossRefGoogle Scholar
  35. 35.
    Wang, W., Shen, J., Shao, L.: Video salient object detection via fully convolutional networks. IEEE TIP (2017)Google Scholar
  36. 36.
    Xingjian, S., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.C.: Convolutional lstm network: a machine learning approach for precipitation nowcasting. In: NIPS, pp. 802–810 (2015)Google Scholar
  37. 37.
    Xu, M., Jiang, L., Sun, X., Ye, Z., Wang, Z.: Learning to detect video saliency with HEVC features. IEEE TIP 26(1), 369–385 (2017)MathSciNetGoogle Scholar
  38. 38.
    Xu, M., Liu, Y., Hu, R., He, F.: Find who to look at: turning from action to saliency. IEEE Transactions on Image Processing 27(9), 4529–4544 (2018)MathSciNetCrossRefGoogle Scholar
  39. 39.
    Zhang, J., Sclaroff, S.: Exploiting surroundedness for saliency detection: a boolean map approach. IEEE PAMI 38(5), 889–902 (2016)CrossRefGoogle Scholar
  40. 40.
    Zhang, L., Tong, M.H., Cottrell, G.W.: SUNDAy: saliency using natural statistics for dynamic analysis of scenes. In: Annual Cognitive Science Conference, pp. 2944–2949 (2009)Google Scholar
  41. 41.
    Zhang, Q., Wang, Y., Li, B.: Unsupervised video analysis based on a spatiotemporal saliency detector. arXiv preprint (2015)Google Scholar
  42. 42.
    Zhou, F., Bing Kang, S., Cohen, M.F.: Time-mapping using space-time saliency. In: CVPR, pp. 3358–3365 (2014)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.Beihang UniversityBeijingChina

Personalised recommendations