Skip to main content

Unified Image and Video Saliency Modeling

  • Conference paper
  • First Online:
Computer Vision – ECCV 2020 (ECCV 2020)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12350))

Included in the following conference series:

Abstract

Visual saliency modeling for images and videos is treated as two independent tasks in recent computer vision literature. While image saliency modeling is a well-studied problem and progress on benchmarks like SALICON and MIT300 is slowing, video saliency models have shown rapid gains on the recent DHF1K benchmark. Here, we take a step back and ask: Can image and video saliency modeling be approached via a unified model, with mutual benefit? We identify different sources of domain shift between image and video saliency data and between different video saliency datasets as a key challenge for effective joint modelling. To address this we propose four novel domain adaptation techniques—Domain-Adaptive Priors, Domain-Adaptive Fusion, Domain-Adaptive Smoothing and Bypass-RNN—in addition to an improved formulation of learned Gaussian priors. We integrate these techniques into a simple and lightweight encoder-RNN-decoder-style network, UNISAL, and train it jointly with image and video saliency data. We evaluate our method on the video saliency datasets DHF1K, Hollywood-2 and UCF-Sports, and the image saliency datasets SALICON and MIT300. With one set of parameters, UNISAL achieves state-of-the-art performance on all video saliency datasets and is on par with the state-of-the-art for image saliency datasets, despite faster runtime and a 5 to 20-fold smaller model size compared to all competing deep methods. We provide retrospective analyses and ablation studies which confirm the importance of the domain shift modeling. The code is available at https://github.com/rdroste/unisal.

R. Droste and J. Jiao—Contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Bak, C., Kocak, A., Erdem, E., Erdem, A.: Spatio-temporal saliency networks for dynamic saliency prediction. IEEE TMM 20(7), 1688–1698 (2017)

    Google Scholar 

  2. Borji, A.: Saliency prediction in the deep learning era: an empirical investigation. arXiv:1810.03716 (2018)

  3. Borji, A., Itti, L.: State-of-the-art in visual attention modeling. IEEE TPAMI 35(1), 185–207 (2012)

    Article  Google Scholar 

  4. Bousmalis, K., Trigeorgis, G., Silberman, N., Krishnan, D., Erhan, D.: Domain separation networks. In: NeurIPS (2016)

    Google Scholar 

  5. Bylinskii, Z., Judd, T., Oliva, A., Torralba, A., Durand, F.: What do different evaluation metrics tell us about saliency models? IEEE TPAMI 41(3), 740–757 (2019)

    Article  Google Scholar 

  6. Chang, W.G., You, T., Seo, S., Kwak, S., Han, B.: Domain-specific batch normalization for unsupervised domain adaptation. In: CVPR (2019)

    Google Scholar 

  7. Cornia, M., Baraldi, L., Serra, G., Cucchiara, R.: Predicting human eye fixations via an LSTM-based saliency attentive model. IEEE TIP 27(10), 5142–5154 (2016)

    MathSciNet  Google Scholar 

  8. Fang, Y., Wang, Z., Lin, W., Fang, Z.: Video saliency incorporating spatiotemporal cues and uncertainty weighting. IEEE TIP 23(9), 3910–3921 (2014)

    MathSciNet  MATH  Google Scholar 

  9. Gal, Y., Ghahramani, Z.: A Theoretically grounded application of dropout in recurrent neural networks. In: NeurIPS (2016)

    Google Scholar 

  10. Gorji, S., Clark, J.J.: Going from image to video saliency: augmenting image salience with dynamic attentional push. In: CVPR (2018)

    Google Scholar 

  11. Guo, C., Ma, Q., Zhang, L.: Spatio-temporal saliency detection using phase spectrum of quaternion fourier transform. In: CVPR (2008)

    Google Scholar 

  12. Guo, C., Zhang, L.: A novel multiresolution spatiotemporal saliency detection model and its applications in image and video compression. IEEE TIP 19(1), 185–198 (2009)

    MathSciNet  MATH  Google Scholar 

  13. Harel, J., Koch, C., Perona, P.: Graph-based visual saliency. In: NeurIPS (2007)

    Google Scholar 

  14. Hossein Khatoonabadi, S., Vasconcelos, N., Bajic, I.V., Shan, Y.: How many bits does it take for a stimulus to be salient? In: CVPR (2015)

    Google Scholar 

  15. Hou, X., Zhang, L.: Dynamic visual attention: searching for coding length increments. In: NeurIPS (2009)

    Google Scholar 

  16. Huang, X., Shen, C., Boix, X., Zhao, Q.: SALICON: reducing the semantic gap in saliency prediction by adapting deep neural networks. In: ICCV (2015)

    Google Scholar 

  17. Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE TPAMI 20(11), 1254–1259 (1998)

    Article  Google Scholar 

  18. Jetley, S., Murray, N., Vig, E.: End-to-end saliency mapping via probability distribution prediction. In: CVPR (2016)

    Google Scholar 

  19. Jiang, L., Xu, M., Liu, T., Qiao, M., Wang, Z.: DeepVS: a deep learning based video saliency prediction approach. In: ECCV (2018)

    Google Scholar 

  20. Jiang, M., Huang, S., Duan, J., Zhao, Q.: SALICON: saliency in context. In: CVPR (2015)

    Google Scholar 

  21. Judd, T., Durand, F., Torralba, A.: A Benchmark of computational models of saliency to predict human fixations. In: MIT-CSAIL-TR-2012, vol. 1, pp. 1–7 (2012)

    Google Scholar 

  22. Judd, T., Ehinger, K., Durand, F., Torralba, A.: Learning to predict where humans look. In: ICCV (2009)

    Google Scholar 

  23. Kruthiventi, S.S.S., Ayush, K., Babu, R.V.: DeepFix: a fully convolutional neural network for predicting human eye fixations. IEEE TIP 26(9), 4446–4456 (2015)

    MathSciNet  MATH  Google Scholar 

  24. Kümmerer, M., Wallis, T.S.A., Bethge, M.: DeepGaze II: reading fixations from deep features trained on object recognition. arXiv:1610.01563 (2016)

  25. Lai, Q., Wang, W., Sun, H., Shen, J.: Video saliency prediction using spatiotemporal residual attentive networks. IEEE TIP 26, 1113–1126 (2019)

    Google Scholar 

  26. Le Meur, O., Le Callet, P., Barba, D., Thoreau, D.: A coherent computational approach to model bottom-up visual attention. IEEE TPAMI 28(5), 802–817 (2006)

    Article  Google Scholar 

  27. Leboran, V., Garcia-Diaz, A., Fdez-Vidal, X.R., Pardo, X.M.: Dynamic whitening saliency. IEEE TPAMI 39(5), 893–907 (2016)

    Article  Google Scholar 

  28. Li, Y., Wang, N., Shi, J., Liu, J., Hou, X.: Revisiting batch normalization for practical domain adaptation. In: ICLR (2016)

    Google Scholar 

  29. Linardos, P., Mohedano, E., Nieto, J.J., McGuinness, K., Giro-i Nieto, X., O’Connor, N.E.: Simple vs complex temporal recurrences for video saliency prediction. In: BMVC (2019)

    Google Scholar 

  30. Liu, J., Shahroudy, A., Xu, D., Wang, G.: Spatio-temporal LSTM with trust gates for 3D human action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 816–833. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_50

    Chapter  Google Scholar 

  31. Maaten, L.V.D., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(Nov), 2579–2605 (2008)

    MATH  Google Scholar 

  32. Mahadevan, V., Vasconcelos, N.: Spatiotemporal saliency in dynamic scenes. IEEE TPAMI 32(1), 171–177 (2009)

    Article  Google Scholar 

  33. Marat, S., Phuoc, T.H., Granjon, L., Guyader, N., Pellerin, D., Guérin-Dugué, A.: Modelling spatio-temporal saliency to predict gaze direction for short videos. Int. J. Comput. Vis. 82(3), 231 (2009)

    Article  Google Scholar 

  34. Mathe, S., Sminchisescu, C.: Actions in the eye: dynamic gaze datasets and learnt saliency models for visual recognition. IEEE TPAMI 37(7), 1408–1424 (2015)

    Article  Google Scholar 

  35. Min, K., Corso, J.J.: TASED-net: temporally-aggregating spatial encoder-decoder network for video saliency detection. In: ICCV (2019)

    Google Scholar 

  36. Pan, J., et al.: SaLGAN: visual saliency prediction with generative adversarial networks. arXiv:1701.01081 (2017)

  37. Pan, J., Sayrol, E., Giro-i Nieto, X., McGuinness, K., O’Connor, N.E.: Shallow and deep convolutional networks for saliency prediction. In: CVPR (2016)

    Google Scholar 

  38. Rozantsev, A., Salzmann, M., Fua, P.: Beyond sharing weights for deep domain adaptation. IEEE TPAMI 41(4), 801–814 (2019)

    Article  Google Scholar 

  39. Rudoy, D., Goldman, D.B., Shechtman, E., Zelnik-Manor, L.: Learning video saliency from human gaze using candidate selection. In: CVPR (2013)

    Google Scholar 

  40. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetV2: inverted residuals and linear bottlenecks. In: CVPR (2018)

    Google Scholar 

  41. Seo, H.J., Milanfar, P.: Static and space-time visual saliency detection by self-resemblance. J. Vis. 9(12), 15–15 (2009)

    Article  Google Scholar 

  42. Sun, Y., Fisher, R.: Object-based visual attention for computer vision. Artif. Intell. 146(1), 77–123 (2003)

    Article  MathSciNet  Google Scholar 

  43. Tsai, J.C., Chien, J.T.: Adversarial domain separation and adaptation. In: 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1–6 (2017)

    Google Scholar 

  44. Valipour, S., Siam, M., Jagersand, M., Ray, N.: Recurrent fully convolutional networks for video segmentation. In: IEEE WACV, pp. 29–36 (2017)

    Google Scholar 

  45. Vig, E., Dorr, M., Cox, D.: Large-scale optimization of hierarchical features for saliency prediction in natural images. In: CVPR (2014)

    Google Scholar 

  46. Wang, W., Shen, J.: Deep visual attention prediction. IEEE TIP 27(5), 2368–2378 (2017)

    MathSciNet  Google Scholar 

  47. Wang, W., Shen, J., Guo, F., Cheng, M.M., Borji, A.: Revisiting video saliency: a large-scale benchmark and a new model. In: CVPR (2018)

    Google Scholar 

  48. Wang, W., Shen, J., Xie, J., Cheng, M.M., Ling, H., Borji, A.: Revisiting video saliency prediction in the deep learning era. IEEE TPAMI (2019, early access)

    Google Scholar 

  49. Xiao, T., Li, H., Ouyang, W., Wang, X.: Learning deep feature representations with domain guided dropout for person re-identification. arXiv:1604.07528 (2016)

  50. Yang, S., Lin, G., Jiang, Q., Lin, W.: A dilated inception network for visual saliency prediction. IEEE TMM 22(8), 2163–2176 (2020)

    Google Scholar 

  51. Zheng, Q., Jiao, J., Cao, Y., Lau, R.W.: Task-driven webpage saliency. In: ECCV (2018)

    Google Scholar 

  52. Zhong, S.H., Liu, Y., Ren, F., Zhang, J., Ren, T.: Video saliency detection via dynamic consistent spatio-temporal attention modelling. In: AAAI (2013)

    Google Scholar 

Download references

Acknowledgements

We acknowledge the EPSRC (Project Seebibyte, reference EP/M013774/1) and the NVIDIA Corporation for the donation of GPU.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Richard Droste .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 23433 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Droste, R., Jiao, J., Noble, J.A. (2020). Unified Image and Video Saliency Modeling. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12350. Springer, Cham. https://doi.org/10.1007/978-3-030-58558-7_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-58558-7_25

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-58557-0

  • Online ISBN: 978-3-030-58558-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics