International Journal of Computer Vision

, Volume 117, Issue 1, pp 93–110 | Cite as

Visual Saliency Detection Using Group Lasso Regularization in Videos of Natural Scenes

  • Nasim SoulyEmail author
  • Mubarak Shah


Visual saliency is the ability of a vision system to promptly select the most relevant data in the scene and reduce the amount of visual data that needs to be processed. Thus, its applications for complex tasks such as object detection, object recognition and video compression have attained interest in computer vision studies. In this paper, we introduce a novel unsupervised method for detecting visual saliency in videos of natural scenes. For this, we divide a video into non-overlapping cuboids and create a matrix whose columns correspond to intensity values of these cuboids. Simultaneously, we segment the video using a hierarchical segmentation method and obtain super-voxels. A dictionary learned from the feature data matrix of the video is subsequently used to represent the video as coefficients of atoms. Then, these coefficients are decomposed into salient and non-salient parts. We propose to use group lasso regularization to find the sparse representation of a video, which benefits from grouping information provided by super-voxels and extracted features from the cuboids. We find saliency regions by decomposing the feature matrix of a video into low-rank and sparse matrices by using robust principal component analysis matrix recovery method. The applicability of our method is tested on four video data sets of natural scenes. Our experiments provide promising results in terms of predicting eye movement using standard evaluation methods. In addition, we show our video saliency can be used to improve the performance of human action recognition on a standard dataset.


Visual saliency Sparse coding Super-voxels Group lasso 



This work was supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number D11PC20066. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBC, or the U.S. Government.


  1. Bach, F. R. (2008). Consistency of the group lasso and multiple kernel learning. The Journal of Machine Learning Research, 9, 1179–1225.MathSciNetzbMATHGoogle Scholar
  2. Borji, A., & Itti, L. (2013). State-of-the-art in visual attention modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35, 185.CrossRefGoogle Scholar
  3. Borji, A., Sihite, D. N., & Itti, L. (2011). Computational modeling of top-down visual attention in interactive environments. In British Machine Vision Conference (pp. 1–12).Google Scholar
  4. Borji, A., Sihite, D. N., & Itti, L. (2013). What stands out in a scene? A study of human explicit saliency judgment. Vision Research, 91, 62–77.CrossRefGoogle Scholar
  5. Bruce, N., & Tsotsos, J. (2005). Saliency based on information maximization. In Advances in Neural Information Processing Systems (pp. 155–162).Google Scholar
  6. Bruce, N. D., & Tsotsos, J. K. (2009). Saliency, attention, and visual search: An information theoretic approach. Journal of Vision, 9(3), 5.CrossRefGoogle Scholar
  7. Dorr, M., Martinetz, T., Gegenfurtner, K. R., & Barth, E. (2010). Variability of eye movements when viewing dynamic natural scenes. Journal of Vision, 10(10), 28.CrossRefGoogle Scholar
  8. Elad, M., & Aharon, M. (2006). Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image Processing, 15, 3736.CrossRefMathSciNetGoogle Scholar
  9. Frintrop, S., Rome, E., & Christensen, H. I. (2010). Computational visual attention systems and their cognitive foundations: A survey. ACM Transactions on Applied Perception (TAP), 7(1), 6.Google Scholar
  10. Gao, D., Han, S., & Vasconcelos, N. (2009). Discriminant saliency, the detection of suspicious coincidences, and applications to visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(6), 989–1005.CrossRefGoogle Scholar
  11. Gao, D., & Vasconcelos, N. (2004). Discriminant saliency for visual recognition from cluttered scenes. In Advances in Neural Information Processing Systems (pp. 481–488).Google Scholar
  12. Gao, D., & Vasconcelos, N. (2009). Decision-theoretic saliency: Computational principles, biological plausibility, and implications for neurophysiology and psychophysics. Neural Computation, 21(1), 239–271.CrossRefzbMATHGoogle Scholar
  13. Grundmann, M., Kwatra, V., Han, M., & Essa, I. (2010). Efficient hierarchical graph-based video segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 2141–2148).Google Scholar
  14. Guo, C., Ma, Q., & Zhang, L. (2008). Spatio-temporal saliency detection using phase spectrum of quaternion fourier transform. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 1–8).Google Scholar
  15. Itti, L., & Baldi, P. (2005). A principled approach to detecting surprising events in video. In IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
  16. Itti, L., & Baldi, P. (2009). Bayesian surprise attracts human attention. Vision Research, 49, 1295.CrossRefGoogle Scholar
  17. Itti, L., Koch, C., & Niebur, E. (1998). A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11), 1254–1259.CrossRefGoogle Scholar
  18. Judd, T., Ehinger, K., Durand, F., & Torralba, A. (2009). Learning to predict where humans look. In IEEE International Conference on Computer Vision (pp. 2106–2113).Google Scholar
  19. Kienzle, W., Schölkopf, B., Wichmann, F. A., & Franz, M. O. (2007a). How to find interesting locations in video: a spatiotemporal interest point detector learned from human eye movements. In Pattern Recognition (pp. 405–414). Springer.Google Scholar
  20. Kienzle, W., Wichmann, F., Schölkopf, B., & Franz, M. (2007b). A nonparametric approach to bottom-up visual saliency. In Advances in Neural Information Processing Systems.Google Scholar
  21. Koch, K., McLean, J., Segev, R., Freed, M. A., Berry, M. J, I. I., & Balasubramanian, V. (2006). How much the eye tells the brain. Current Biology, 16(14), 1428–1434.CrossRefGoogle Scholar
  22. Lan, T., Wang, Y., & Mori, G. (2011). Discriminative figure-centric models for joint action localization and recognition. In International Conference on Computer Vision (ICCV).Google Scholar
  23. Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In IEEE Conference on Computer Vision and Pattern Recognition, 2008 (CVPR 2008).Google Scholar
  24. Lin, Z., Chen, M., & Ma, Y. (2010). The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices. arXiv preprint arXiv:1009.5055.
  25. Liu, J., Ji, S., & Ye, J. (2009). SLEP: Sparse Learning with Efficient Projections. Tempe: Arizona State University.Google Scholar
  26. Liu, T., Yuan, Z., Sun, J., Wang, J., Zheng, N., Tang, X., et al. (2011). Learning to detect a salient object. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(2), 353–367.CrossRefGoogle Scholar
  27. Ma, Y. F., Hua, X. S., Lu, L., & Zhang, H. J. (2005). A generic framework of user attention model and its application in video summarization. IEEE Transactions on Multimedia, 7(5), 907–919.CrossRefGoogle Scholar
  28. Ma, Y. F., Lu, L., Zhang, H. J., & Li, M. (2002). A user attention model for video summarization. In ACM international conference on Multimedia, MULTIMEDIA ’02.Google Scholar
  29. Mahadevan, V., & Vasconcelos, N. (2010). Spatiotemporal saliency in dynamic scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(1), 171–177.CrossRefGoogle Scholar
  30. Mairal, J. (2012). Spams: A sparse modeling software [online], available:
  31. Mairal, J., Bach, F., Ponce, J., & Sapiro, G. (2010). Online learning for matrix factorization and sparse coding. The Journal of Machine Learning Research, 11, 19–60.MathSciNetzbMATHGoogle Scholar
  32. Mallat, S. (2009). A wavelet tour of signal processing. New York: Academic Press.zbMATHGoogle Scholar
  33. Marat, S., Guironnet, M., Pellerin, D., et al. (2007). Video summarization using a visual attention model. In European Signal Processing Conference.Google Scholar
  34. Marat, S., Phuoc, T. H., Granjon, L., Guyader, N., Pellerin, D., & Guérin-Dugué, A. (2009). Modelling spatio-temporal saliency to predict gaze direction for short videos. International Journal of Computer Vision, 82(3), 231–243.CrossRefGoogle Scholar
  35. Marszałek, M., Laptev, I., & Schmid, C. (2009). Actions in context. In IEEE Conference on Computer Vision & Pattern Recognition.Google Scholar
  36. Mathe, S., & Sminchisescu, C. (2012a). Actions in the eye: Dynamic gaze datasets and learnt saliency models for visual recognition. Technical report, Institute of Mathematics of the Romanian Academy and University of Bonn.Google Scholar
  37. Mathe, S., & Sminchisescu, C. (2012b). Dynamic eye movement datasets and learnt saliency models for visual action recognition. In IEEE European Conference on Computer Vision.Google Scholar
  38. Meier, L., Van De Geer, S., & Bühlmann, P. (2008). The group lasso for logistic regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1), 53–71.CrossRefMathSciNetzbMATHGoogle Scholar
  39. Navalpakkam, V., & Itti, L. (2006). An integrated model of top-down and bottom-up attention for optimizing detection speed. In IEEE Conference on Computer Vision and Pattern Recognition.Google Scholar
  40. Olshausen, B. A., & Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision Research, 37(23), 3311–3325.CrossRefGoogle Scholar
  41. Olshausen, B. A., & Field, D. J. (2004). Sparse coding of sensory inputs. Current Opinion in Neurobiology, 14(4), 481–487.CrossRefGoogle Scholar
  42. Poirier, F. J., Gosselin, F., & Arguin, M. (2008). Perceptive fields of saliency. Journal of Vision, 8(15), 14.CrossRefGoogle Scholar
  43. Qin, Z., Scheinberg, K., & Goldfarb, D. (2010). Efficient block-coordinate descent algorithms for the group lasso. Mathematical Programming Computation, 5, 143.CrossRefMathSciNetGoogle Scholar
  44. Rensink, R. A., O’Regan, J. K., & Clark, J. J. (1997). To see or not to see: The need for attention to perceive changes in scenes. Psychological Science, 8(5), 368–373.CrossRefGoogle Scholar
  45. Rodriguez, M. D., Ahmed, J., & Shah, M. (2008). Action mach: a spatio-temporal maximum average correlation height filter for action recognition. In IEEE International Conference on Computer Vision and Pattern Recognition.Google Scholar
  46. Roth, V., & Fischer, B. (2008). The group-lasso for generalized linear models: uniqueness of solutions and efficient algorithms. In International Conference on Machine Learning (Vol. 104).Google Scholar
  47. Rubinstein, R., Bruckstein, A. M., & Elad, M. (2010a). Dictionaries for sparse representation modeling. Proceedings of the IEEE.Google Scholar
  48. Rubinstein, R., Zibulevsky, M., & Elad, M. (2010b). Double sparsity: Learning sparse dictionaries for sparse signal approximation. IEEE Transactions on Signal Processing, 58(3), 1553–1564.Google Scholar
  49. Rudoy, D., Goldman, D. B., Shechtman, E., & Zelnik-Manor, L. (2013). Learning video saliency from human gaze using candidate selection. In 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1147–1154).Google Scholar
  50. Seo, H. J., & Milanfar, P. (2009a). Nonparametric bottom-up saliency detection by self-resemblance. In Computer Vision and Pattern Recognition Workshops (pp. 45–52).Google Scholar
  51. Seo, H. J., & Milanfar, P. (2009b). Static and space-time visual saliency detection by self-resemblance. Journal of Vision, 9(12), 15.Google Scholar
  52. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, 58, 267.MathSciNetzbMATHGoogle Scholar
  53. Treisman, A. M., & Gelade, G. (1980). A feature-integration theory of attention. Cognitive Psychology, 12(1), 97–136.CrossRefGoogle Scholar
  54. Triesch, J., Ballard, D. H., Hayhoe, M. M., & Sullivan, B. T. (2003). What you see is what you need. Journal of Vision, 3, 9.CrossRefGoogle Scholar
  55. Ungerleider, S. K., & Leslie, G. (2000). Mechanisms of visual attention in the human cortex. Annual Review of Neuroscience, 23(1), 315–341.CrossRefGoogle Scholar
  56. Vig, E., Dorr, M., Martinetz, T., & Barth, E. (2011). Eye movements show optimal average anticipation with natural dynamic scenes. Cognitive Computation, 3(1), 79–88.CrossRefGoogle Scholar
  57. Vig, E., Dorr, M., Martinetz, T., & Barth, E. (2012). Intrinsic dimensionality predicts the saliency of natural dynamic scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(6), 1080–1091.CrossRefGoogle Scholar
  58. Wang, H., Klaser, A., Schmid, C., & Liu, C.-L. (2011a). Action recognition by dense trajectories. In 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).Google Scholar
  59. Wang, J., Wang, Y., & Zhang, Z. (2011b). Visual saliency based aerial video summarization by online scene classification. In International Conference on Image and Graphics (pp. 777–782).Google Scholar
  60. Wright, J., Ganesh, A., Rao, S., Peng, Y., & Ma, Y. (2009). Robust principal component analysis: Exact recovery of corrupted low-rank matrices via convex optimization. In Advances in Neural Information Processing Systems (pp. 2080–2088).Google Scholar
  61. Xu, C., & Corso, J. J. (2012). Evaluation of super-voxel methods for early video processing. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 1202–1209).Google Scholar
  62. Yan, J., Zhu, M., Liu, H., & Liu, Y. (2010). Visual saliency detection via sparsity pursuit. IEEE Signal Processing Letters, 17(8), 739–742.CrossRefGoogle Scholar
  63. Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68, 49.CrossRefMathSciNetzbMATHGoogle Scholar
  64. Zhai, Y., & Shah, M. (2006). Visual attention detection in video sequences using spatiotemporal cues. In ACM international conference on Multimedia (pp. 815–824).Google Scholar
  65. Zhang, L., Tong, M. H., & Cottrell, G. W. (2009). Sunday: Saliency using natural statistics for dynamic analysis of scenes. In Annual Cognitive Science Conference (pp. 2944–2949).Google Scholar
  66. Zhang, L., Tong, M. H., Marks, T. K., Shan, H., & Cottrell, G. W. (2008). Sun: A bayesian framework for saliency using natural statistics. Journal of Vision, 8(7), 32.CrossRefGoogle Scholar
  67. Zhong, S.-h., Liu, Y., Ren, F., Zhang, J., & Ren, T. (2013). Video saliency detection via dynamic consistent spatio-temporal attention modelling. In AAAI.Google Scholar
  68. Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301–320.CrossRefMathSciNetzbMATHGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  1. 1.The Center for Research in Computer VisionOrlandoUSA

Personalised recommendations