Attentive Models in Vision: Computing Saliency Maps in the Deep Learning Era

  • Marcella CorniaEmail author
  • Davide Abati
  • Lorenzo Baraldi
  • Andrea Palazzi
  • Simone Calderara
  • Rita Cucchiara
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10640)


Estimating the focus of attention of a person looking at an image or a video is a crucial step which can enhance many vision-based inference mechanisms: image segmentation and annotation, video captioning, autonomous driving are some examples. The early stages of the attentive behavior are typically bottom-up; reproducing the same mechanism means to find the saliency embodied in the images, i.e. which parts of an image pop out of a visual scene. This process has been studied for decades in neuroscience and in terms of computational models for reproducing the human cortical process. In the last few years, early models have been replaced by deep learning architectures, that outperform any early approach compared against public datasets. In this paper, we propose a discussion on why convolutional neural networks (CNNs) are so accurate in saliency prediction. We present our DL architectures which combine both bottom-up cues and higher-level semantics, and incorporate the concept of time in the attentional process through LSTM recurrent architectures. Eventually, we present a video-specific architecture based on the C3D network, which can extracts spatio-temporal features by means of 3D convolutions to model task-driven attentive behaviors. The merit of this work is to show how these deep networks are not mere brute-force methods tuned on massive amount of data, but represent well-defined architectures which recall very closely the early saliency models, although improved with the semantics learned by human ground-truth.


Saliency Human attention Neuroscience Vision Deep learning 


  1. 1.
    Alletto, S., Palazzi, A., Solera, F., Calderara, S., Cucchiara, R.: DR(eye)VE: a dataset for attention-based tasks with applications to autonomous and assisted driving. In: CVPR Workshops (2016)Google Scholar
  2. 2.
    Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
  3. 3.
    Baraldi, L., Grana, C., Cucchiara, R.: Hierarchical boundary-aware neural encoder for video captioning. In: CVPR (2017)Google Scholar
  4. 4.
    Bazzani, L., Larochelle, H., Torresani, L.: Recurrent mixture density network for spatiotemporal visual attention. In: ICLR (2017)Google Scholar
  5. 5.
    Bruce, N., Tsotsos, J.: Saliency based on information maximization. In: ANIPS, pp. 155–162 (2005)Google Scholar
  6. 6.
    Bylinskii, Z., Judd, T., Borji, A., Itti, L., Durand, F., Oliva, A., Torralba, A.: Mit saliency benchmark.
  7. 7.
    Bylinskii, Z., Judd, T., Oliva, A., Torralba, A., Durand, F.: What do different evaluation metrics tell us about saliency models? arXiv preprint arXiv:1604.03605 (2016)
  8. 8.
    Cornia, M., Baraldi, L., Serra, G., Cucchiara, R.: A deep multi-level network for saliency prediction. In: ICPR (2016)Google Scholar
  9. 9.
    Cornia, M., Baraldi, L., Serra, G., Cucchiara, R.: Multi-level net: a visual saliency prediction model. In: Hua, G., Jégou, H. (eds.) ECCV 2016. LNCS, vol. 9914, pp. 302–315. Springer, Cham (2016). CrossRefGoogle Scholar
  10. 10.
    Cornia, M., Baraldi, L., Serra, G., Cucchiara, R.: Predicting human eye fixations via an LSTM-based saliency attentive model. arXiv preprint arXiv:1611.09571 (2017)
  11. 11.
    Cornia, M., Baraldi, L., Serra, G., Cucchiara, R.: Visual saliency for image captioning in new multimedia services. In: ICME Workshops (2017)Google Scholar
  12. 12.
    Greenspan, H., Belongie, S., Goodman, R., Perona, P., Rakshit, S., Anderson, C.H.: Overcomplete steerable pyramid filters and rotation invariance. In: CVPR (1994)Google Scholar
  13. 13.
    Hadizadeh, H., Baji, I.V.: Saliency-aware video compression. IEEE Trans. Image Process. 23(1), 19–33 (2014)CrossRefMathSciNetzbMATHGoogle Scholar
  14. 14.
    Harel, J., Koch, C., Perona, P.: Graph-based visual saliency. In: ANIPS, pp. 545–552 (2006)Google Scholar
  15. 15.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  16. 16.
    Huang, X., Shen, C., Boix, X., Zhao, Q.: SALICON: reducing the semantic gap in saliency prediction by adapting deep neural networks. In: ICCV (2015)Google Scholar
  17. 17.
    Itti, L., Koch, C.: Computational modelling of visual attention. Nat. Rev. Neurosci. 2(3), 194–203 (2001)CrossRefGoogle Scholar
  18. 18.
    Itti, L., Koch, C., Niebur, E., et al.: A model of saliency-based visual attention for rapid scene analysis. IEEE TPAMI 20(11), 1254–1259 (1998)CrossRefGoogle Scholar
  19. 19.
    Jetley, S., Murray, N., Vig, E.: End-to-end saliency mapping via probability distribution prediction. In: CVPR (2016)Google Scholar
  20. 20.
    Jiang, M., Huang, S., Duan, J., Zhao, Q.: Salicon: saliency in context. In: CVPR (2015)Google Scholar
  21. 21.
    Judd, T., Durand, F., Torralba, A.: A benchmark of computational models of saliency to predict human fixations. MIT Technical report (2012)Google Scholar
  22. 22.
    Judd, T., Ehinger, K., Durand, F., Torralba, A.: Learning to predict where humans look. In: ICCV (2009)Google Scholar
  23. 23.
    Koch, C., Ullman, S.: Shifts in selective visual attention: towards the underlying neural circuitry. In: Vaina, L.M. (ed.) Matters of Intelligence, pp. 115–141. Springer, Dordrecht (1987). CrossRefGoogle Scholar
  24. 24.
    Kruthiventi, S.S., Ayush, K., Babu, R.V.: Deepfix: a fully convolutional neural network for predicting human eye fixations. arXiv preprint arXiv:1510.02927 (2015)
  25. 25.
    Kümmerer, M., Theis, L., Bethge, M.: DeepGaze I: Boosting saliency prediction with feature maps trained on ImageNet. In: ICLR Workshops (2015)Google Scholar
  26. 26.
    Palazzi, A., Solera, F., Calderara, S., Alletto, S., Cucchiara, R.: Learning to attend like a human driver. In: Intelligent Vehicles Symposium (2017)Google Scholar
  27. 27.
    Pan, J., Sayrol, E., Giro-i-Nieto, X., McGuinness, K., Giró-i, N.X.: Shallow and deep convolutional networks for saliency prediction. In: CVPR (2016)Google Scholar
  28. 28.
    Rudoy, D., Goldman, D.B., Shechtman, E., Zelnik-Manor, L.: Learning video saliency from human gaze using candidate selection. In: CVPR (2013)Google Scholar
  29. 29.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014)Google Scholar
  30. 30.
    Mathe, S., Sminchisescu, C.: Actions in the eye: dynamic gaze datasets and learnt saliency models for visual recognition. IEEE TPAMI 37(7), 1408–1424 (2015)CrossRefGoogle Scholar
  31. 31.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV (2015)Google Scholar
  32. 32.
    Treisman, A.M., Gelade, G.: A feature-integration theory of attention. Cogn. Psychol. 12(1), 97–136 (1980)CrossRefGoogle Scholar
  33. 33.
    Vig, E., Dorr, M., Cox, D.: Large-scale optimization of hierarchical features for saliency prediction in natural images. In: CVPR (2014)Google Scholar
  34. 34.
    Wang, W., Shen, J., Porikli, F.: Saliency-aware geodesic video object segmentation. In: CVPR (2015)Google Scholar
  35. 35.
    Wang, W., Shen, J., Shao, L.: Consistent video saliency using local gradient flow optimization and global refinement. IEEE Trans. Image Process. 24(11), 4185–4196 (2015)CrossRefMathSciNetGoogle Scholar
  36. 36.
    Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Cham (2014). Google Scholar
  37. 37.
    Zhai, Y., Shah, M.: Visual attention detection in video sequences using spatiotemporal cues. In: ACM MM (2006)Google Scholar
  38. 38.
    Zhang, J., Sclaroff, S.: Saliency detection: a boolean map approach. In: ICCV (2013)Google Scholar
  39. 39.
    Zhong, S.H., Liu, Y., Ren, F., Zhang, J., Ren, T.: Video saliency detection via dynamic consistent spatio-temporal attention modelling. In: AAAI (2013)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Marcella Cornia
    • 1
    Email author
  • Davide Abati
    • 1
  • Lorenzo Baraldi
    • 1
  • Andrea Palazzi
    • 1
  • Simone Calderara
    • 1
  • Rita Cucchiara
    • 1
  1. 1.University of Modena and Reggio EmiliaModenaItaly

Personalised recommendations