Unified Image and Video Saliency Modeling

Droste, Richard; Jiao, Jianbo; Noble, J. Alison

doi:10.1007/978-3-030-58558-7_25

Richard Droste¹²,
Jianbo Jiao¹² &
J. Alison Noble¹²

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12350))

Included in the following conference series:

European Conference on Computer Vision

4225 Accesses
80 Citations

Abstract

Visual saliency modeling for images and videos is treated as two independent tasks in recent computer vision literature. While image saliency modeling is a well-studied problem and progress on benchmarks like SALICON and MIT300 is slowing, video saliency models have shown rapid gains on the recent DHF1K benchmark. Here, we take a step back and ask: Can image and video saliency modeling be approached via a unified model, with mutual benefit? We identify different sources of domain shift between image and video saliency data and between different video saliency datasets as a key challenge for effective joint modelling. To address this we propose four novel domain adaptation techniques—Domain-Adaptive Priors, Domain-Adaptive Fusion, Domain-Adaptive Smoothing and Bypass-RNN—in addition to an improved formulation of learned Gaussian priors. We integrate these techniques into a simple and lightweight encoder-RNN-decoder-style network, UNISAL, and train it jointly with image and video saliency data. We evaluate our method on the video saliency datasets DHF1K, Hollywood-2 and UCF-Sports, and the image saliency datasets SALICON and MIT300. With one set of parameters, UNISAL achieves state-of-the-art performance on all video saliency datasets and is on par with the state-of-the-art for image saliency datasets, despite faster runtime and a 5 to 20-fold smaller model size compared to all competing deep methods. We provide retrospective analyses and ablation studies which confirm the importance of the domain shift modeling. The code is available at https://github.com/rdroste/unisal.

R. Droste and J. Jiao—Contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Hierarchical Domain-Adapted Feature Learning for Video Saliency Prediction

Article Open access 05 October 2021

Aggregated Deep Saliency Prediction by Self-attention Network

A dual-stream encoder–decoder network with attention mechanism for saliency detection in video(s)

Article 26 December 2023

References

Bak, C., Kocak, A., Erdem, E., Erdem, A.: Spatio-temporal saliency networks for dynamic saliency prediction. IEEE TMM 20(7), 1688–1698 (2017)
Google Scholar
Borji, A.: Saliency prediction in the deep learning era: an empirical investigation. arXiv:1810.03716 (2018)
Borji, A., Itti, L.: State-of-the-art in visual attention modeling. IEEE TPAMI 35(1), 185–207 (2012)
Article Google Scholar
Bousmalis, K., Trigeorgis, G., Silberman, N., Krishnan, D., Erhan, D.: Domain separation networks. In: NeurIPS (2016)
Google Scholar
Bylinskii, Z., Judd, T., Oliva, A., Torralba, A., Durand, F.: What do different evaluation metrics tell us about saliency models? IEEE TPAMI 41(3), 740–757 (2019)
Article Google Scholar
Chang, W.G., You, T., Seo, S., Kwak, S., Han, B.: Domain-specific batch normalization for unsupervised domain adaptation. In: CVPR (2019)
Google Scholar
Cornia, M., Baraldi, L., Serra, G., Cucchiara, R.: Predicting human eye fixations via an LSTM-based saliency attentive model. IEEE TIP 27(10), 5142–5154 (2016)
MathSciNet Google Scholar
Fang, Y., Wang, Z., Lin, W., Fang, Z.: Video saliency incorporating spatiotemporal cues and uncertainty weighting. IEEE TIP 23(9), 3910–3921 (2014)
MathSciNet MATH Google Scholar
Gal, Y., Ghahramani, Z.: A Theoretically grounded application of dropout in recurrent neural networks. In: NeurIPS (2016)
Google Scholar
Gorji, S., Clark, J.J.: Going from image to video saliency: augmenting image salience with dynamic attentional push. In: CVPR (2018)
Google Scholar
Guo, C., Ma, Q., Zhang, L.: Spatio-temporal saliency detection using phase spectrum of quaternion fourier transform. In: CVPR (2008)
Google Scholar
Guo, C., Zhang, L.: A novel multiresolution spatiotemporal saliency detection model and its applications in image and video compression. IEEE TIP 19(1), 185–198 (2009)
MathSciNet MATH Google Scholar
Harel, J., Koch, C., Perona, P.: Graph-based visual saliency. In: NeurIPS (2007)
Google Scholar
Hossein Khatoonabadi, S., Vasconcelos, N., Bajic, I.V., Shan, Y.: How many bits does it take for a stimulus to be salient? In: CVPR (2015)
Google Scholar
Hou, X., Zhang, L.: Dynamic visual attention: searching for coding length increments. In: NeurIPS (2009)
Google Scholar
Huang, X., Shen, C., Boix, X., Zhao, Q.: SALICON: reducing the semantic gap in saliency prediction by adapting deep neural networks. In: ICCV (2015)
Google Scholar
Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE TPAMI 20(11), 1254–1259 (1998)
Article Google Scholar
Jetley, S., Murray, N., Vig, E.: End-to-end saliency mapping via probability distribution prediction. In: CVPR (2016)
Google Scholar
Jiang, L., Xu, M., Liu, T., Qiao, M., Wang, Z.: DeepVS: a deep learning based video saliency prediction approach. In: ECCV (2018)
Google Scholar
Jiang, M., Huang, S., Duan, J., Zhao, Q.: SALICON: saliency in context. In: CVPR (2015)
Google Scholar
Judd, T., Durand, F., Torralba, A.: A Benchmark of computational models of saliency to predict human fixations. In: MIT-CSAIL-TR-2012, vol. 1, pp. 1–7 (2012)
Google Scholar
Judd, T., Ehinger, K., Durand, F., Torralba, A.: Learning to predict where humans look. In: ICCV (2009)
Google Scholar
Kruthiventi, S.S.S., Ayush, K., Babu, R.V.: DeepFix: a fully convolutional neural network for predicting human eye fixations. IEEE TIP 26(9), 4446–4456 (2015)
MathSciNet MATH Google Scholar
Kümmerer, M., Wallis, T.S.A., Bethge, M.: DeepGaze II: reading fixations from deep features trained on object recognition. arXiv:1610.01563 (2016)
Lai, Q., Wang, W., Sun, H., Shen, J.: Video saliency prediction using spatiotemporal residual attentive networks. IEEE TIP 26, 1113–1126 (2019)
Google Scholar
Le Meur, O., Le Callet, P., Barba, D., Thoreau, D.: A coherent computational approach to model bottom-up visual attention. IEEE TPAMI 28(5), 802–817 (2006)
Article Google Scholar
Leboran, V., Garcia-Diaz, A., Fdez-Vidal, X.R., Pardo, X.M.: Dynamic whitening saliency. IEEE TPAMI 39(5), 893–907 (2016)
Article Google Scholar
Li, Y., Wang, N., Shi, J., Liu, J., Hou, X.: Revisiting batch normalization for practical domain adaptation. In: ICLR (2016)
Google Scholar
Linardos, P., Mohedano, E., Nieto, J.J., McGuinness, K., Giro-i Nieto, X., O’Connor, N.E.: Simple vs complex temporal recurrences for video saliency prediction. In: BMVC (2019)
Google Scholar
Liu, J., Shahroudy, A., Xu, D., Wang, G.: Spatio-temporal LSTM with trust gates for 3D human action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 816–833. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_50
Chapter Google Scholar
Maaten, L.V.D., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(Nov), 2579–2605 (2008)
MATH Google Scholar
Mahadevan, V., Vasconcelos, N.: Spatiotemporal saliency in dynamic scenes. IEEE TPAMI 32(1), 171–177 (2009)
Article Google Scholar
Marat, S., Phuoc, T.H., Granjon, L., Guyader, N., Pellerin, D., Guérin-Dugué, A.: Modelling spatio-temporal saliency to predict gaze direction for short videos. Int. J. Comput. Vis. 82(3), 231 (2009)
Article Google Scholar
Mathe, S., Sminchisescu, C.: Actions in the eye: dynamic gaze datasets and learnt saliency models for visual recognition. IEEE TPAMI 37(7), 1408–1424 (2015)
Article Google Scholar
Min, K., Corso, J.J.: TASED-net: temporally-aggregating spatial encoder-decoder network for video saliency detection. In: ICCV (2019)
Google Scholar
Pan, J., et al.: SaLGAN: visual saliency prediction with generative adversarial networks. arXiv:1701.01081 (2017)
Pan, J., Sayrol, E., Giro-i Nieto, X., McGuinness, K., O’Connor, N.E.: Shallow and deep convolutional networks for saliency prediction. In: CVPR (2016)
Google Scholar
Rozantsev, A., Salzmann, M., Fua, P.: Beyond sharing weights for deep domain adaptation. IEEE TPAMI 41(4), 801–814 (2019)
Article Google Scholar
Rudoy, D., Goldman, D.B., Shechtman, E., Zelnik-Manor, L.: Learning video saliency from human gaze using candidate selection. In: CVPR (2013)
Google Scholar
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetV2: inverted residuals and linear bottlenecks. In: CVPR (2018)
Google Scholar
Seo, H.J., Milanfar, P.: Static and space-time visual saliency detection by self-resemblance. J. Vis. 9(12), 15–15 (2009)
Article Google Scholar
Sun, Y., Fisher, R.: Object-based visual attention for computer vision. Artif. Intell. 146(1), 77–123 (2003)
Article MathSciNet Google Scholar
Tsai, J.C., Chien, J.T.: Adversarial domain separation and adaptation. In: 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1–6 (2017)
Google Scholar
Valipour, S., Siam, M., Jagersand, M., Ray, N.: Recurrent fully convolutional networks for video segmentation. In: IEEE WACV, pp. 29–36 (2017)
Google Scholar
Vig, E., Dorr, M., Cox, D.: Large-scale optimization of hierarchical features for saliency prediction in natural images. In: CVPR (2014)
Google Scholar
Wang, W., Shen, J.: Deep visual attention prediction. IEEE TIP 27(5), 2368–2378 (2017)
MathSciNet Google Scholar
Wang, W., Shen, J., Guo, F., Cheng, M.M., Borji, A.: Revisiting video saliency: a large-scale benchmark and a new model. In: CVPR (2018)
Google Scholar
Wang, W., Shen, J., Xie, J., Cheng, M.M., Ling, H., Borji, A.: Revisiting video saliency prediction in the deep learning era. IEEE TPAMI (2019, early access)
Google Scholar
Xiao, T., Li, H., Ouyang, W., Wang, X.: Learning deep feature representations with domain guided dropout for person re-identification. arXiv:1604.07528 (2016)
Yang, S., Lin, G., Jiang, Q., Lin, W.: A dilated inception network for visual saliency prediction. IEEE TMM 22(8), 2163–2176 (2020)
Google Scholar
Zheng, Q., Jiao, J., Cao, Y., Lau, R.W.: Task-driven webpage saliency. In: ECCV (2018)
Google Scholar
Zhong, S.H., Liu, Y., Ren, F., Zhang, J., Ren, T.: Video saliency detection via dynamic consistent spatio-temporal attention modelling. In: AAAI (2013)
Google Scholar

Download references

Acknowledgements

We acknowledge the EPSRC (Project Seebibyte, reference EP/M013774/1) and the NVIDIA Corporation for the donation of GPU.

Author information

Authors and Affiliations

University of Oxford, Oxford, UK
Richard Droste, Jianbo Jiao & J. Alison Noble

Authors

Richard Droste
View author publications
You can also search for this author in PubMed Google Scholar
Jianbo Jiao
View author publications
You can also search for this author in PubMed Google Scholar
J. Alison Noble
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Richard Droste .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 23433 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Droste, R., Jiao, J., Noble, J.A. (2020). Unified Image and Video Saliency Modeling. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12350. Springer, Cham. https://doi.org/10.1007/978-3-030-58558-7_25

Download citation

DOI: https://doi.org/10.1007/978-3-030-58558-7_25
Published: 29 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58557-0
Online ISBN: 978-3-030-58558-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Unified Image and Video Saliency Modeling

Abstract

Access this chapter

Similar content being viewed by others

Hierarchical Domain-Adapted Feature Learning for Video Saliency Prediction

Aggregated Deep Saliency Prediction by Self-attention Network

A dual-stream encoder–decoder network with attention mechanism for saliency detection in video(s)

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (zip 23433 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Unified Image and Video Saliency Modeling

Abstract

Access this chapter

Similar content being viewed by others

Hierarchical Domain-Adapted Feature Learning for Video Saliency Prediction

Aggregated Deep Saliency Prediction by Self-attention Network

A dual-stream encoder–decoder network with attention mechanism for saliency detection in video(s)

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (zip 23433 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation