Skip to main content

Visual Saliency Detection Using Group Lasso Regularization in Videos of Natural Scenes

Abstract

Visual saliency is the ability of a vision system to promptly select the most relevant data in the scene and reduce the amount of visual data that needs to be processed. Thus, its applications for complex tasks such as object detection, object recognition and video compression have attained interest in computer vision studies. In this paper, we introduce a novel unsupervised method for detecting visual saliency in videos of natural scenes. For this, we divide a video into non-overlapping cuboids and create a matrix whose columns correspond to intensity values of these cuboids. Simultaneously, we segment the video using a hierarchical segmentation method and obtain super-voxels. A dictionary learned from the feature data matrix of the video is subsequently used to represent the video as coefficients of atoms. Then, these coefficients are decomposed into salient and non-salient parts. We propose to use group lasso regularization to find the sparse representation of a video, which benefits from grouping information provided by super-voxels and extracted features from the cuboids. We find saliency regions by decomposing the feature matrix of a video into low-rank and sparse matrices by using robust principal component analysis matrix recovery method. The applicability of our method is tested on four video data sets of natural scenes. Our experiments provide promising results in terms of predicting eye movement using standard evaluation methods. In addition, we show our video saliency can be used to improve the performance of human action recognition on a standard dataset.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

References

  • Bach, F. R. (2008). Consistency of the group lasso and multiple kernel learning. The Journal of Machine Learning Research, 9, 1179–1225.

    MathSciNet  MATH  Google Scholar 

  • Borji, A., & Itti, L. (2013). State-of-the-art in visual attention modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35, 185.

    Article  Google Scholar 

  • Borji, A., Sihite, D. N., & Itti, L. (2011). Computational modeling of top-down visual attention in interactive environments. In British Machine Vision Conference (pp. 1–12).

  • Borji, A., Sihite, D. N., & Itti, L. (2013). What stands out in a scene? A study of human explicit saliency judgment. Vision Research, 91, 62–77.

    Article  Google Scholar 

  • Bruce, N., & Tsotsos, J. (2005). Saliency based on information maximization. In Advances in Neural Information Processing Systems (pp. 155–162).

  • Bruce, N. D., & Tsotsos, J. K. (2009). Saliency, attention, and visual search: An information theoretic approach. Journal of Vision, 9(3), 5.

    Article  Google Scholar 

  • Dorr, M., Martinetz, T., Gegenfurtner, K. R., & Barth, E. (2010). Variability of eye movements when viewing dynamic natural scenes. Journal of Vision, 10(10), 28.

    Article  Google Scholar 

  • Elad, M., & Aharon, M. (2006). Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image Processing, 15, 3736.

    Article  MathSciNet  Google Scholar 

  • Frintrop, S., Rome, E., & Christensen, H. I. (2010). Computational visual attention systems and their cognitive foundations: A survey. ACM Transactions on Applied Perception (TAP), 7(1), 6.

    Google Scholar 

  • Gao, D., Han, S., & Vasconcelos, N. (2009). Discriminant saliency, the detection of suspicious coincidences, and applications to visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(6), 989–1005.

    Article  Google Scholar 

  • Gao, D., & Vasconcelos, N. (2004). Discriminant saliency for visual recognition from cluttered scenes. In Advances in Neural Information Processing Systems (pp. 481–488).

  • Gao, D., & Vasconcelos, N. (2009). Decision-theoretic saliency: Computational principles, biological plausibility, and implications for neurophysiology and psychophysics. Neural Computation, 21(1), 239–271.

    Article  MATH  Google Scholar 

  • Grundmann, M., Kwatra, V., Han, M., & Essa, I. (2010). Efficient hierarchical graph-based video segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 2141–2148).

  • Guo, C., Ma, Q., & Zhang, L. (2008). Spatio-temporal saliency detection using phase spectrum of quaternion fourier transform. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 1–8).

  • Itti, L., & Baldi, P. (2005). A principled approach to detecting surprising events in video. In IEEE Conference on Computer Vision and Pattern Recognition.

  • Itti, L., & Baldi, P. (2009). Bayesian surprise attracts human attention. Vision Research, 49, 1295.

    Article  Google Scholar 

  • Itti, L., Koch, C., & Niebur, E. (1998). A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11), 1254–1259.

    Article  Google Scholar 

  • Judd, T., Ehinger, K., Durand, F., & Torralba, A. (2009). Learning to predict where humans look. In IEEE International Conference on Computer Vision (pp. 2106–2113).

  • Kienzle, W., Schölkopf, B., Wichmann, F. A., & Franz, M. O. (2007a). How to find interesting locations in video: a spatiotemporal interest point detector learned from human eye movements. In Pattern Recognition (pp. 405–414). Springer.

  • Kienzle, W., Wichmann, F., Schölkopf, B., & Franz, M. (2007b). A nonparametric approach to bottom-up visual saliency. In Advances in Neural Information Processing Systems.

  • Koch, K., McLean, J., Segev, R., Freed, M. A., Berry, M. J, I. I., & Balasubramanian, V. (2006). How much the eye tells the brain. Current Biology, 16(14), 1428–1434.

    Article  Google Scholar 

  • Lan, T., Wang, Y., & Mori, G. (2011). Discriminative figure-centric models for joint action localization and recognition. In International Conference on Computer Vision (ICCV).

  • Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In IEEE Conference on Computer Vision and Pattern Recognition, 2008 (CVPR 2008).

  • Lin, Z., Chen, M., & Ma, Y. (2010). The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices. arXiv preprint arXiv:1009.5055.

  • Liu, J., Ji, S., & Ye, J. (2009). SLEP: Sparse Learning with Efficient Projections. Tempe: Arizona State University.

    Google Scholar 

  • Liu, T., Yuan, Z., Sun, J., Wang, J., Zheng, N., Tang, X., et al. (2011). Learning to detect a salient object. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(2), 353–367.

    Article  Google Scholar 

  • Ma, Y. F., Hua, X. S., Lu, L., & Zhang, H. J. (2005). A generic framework of user attention model and its application in video summarization. IEEE Transactions on Multimedia, 7(5), 907–919.

    Article  Google Scholar 

  • Ma, Y. F., Lu, L., Zhang, H. J., & Li, M. (2002). A user attention model for video summarization. In ACM international conference on Multimedia, MULTIMEDIA ’02.

  • Mahadevan, V., & Vasconcelos, N. (2010). Spatiotemporal saliency in dynamic scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(1), 171–177.

    Article  Google Scholar 

  • Mairal, J. (2012). Spams: A sparse modeling software [online], available: http://spams-devel.gforge.inria.fr.

  • Mairal, J., Bach, F., Ponce, J., & Sapiro, G. (2010). Online learning for matrix factorization and sparse coding. The Journal of Machine Learning Research, 11, 19–60.

    MathSciNet  MATH  Google Scholar 

  • Mallat, S. (2009). A wavelet tour of signal processing. New York: Academic Press.

    MATH  Google Scholar 

  • Marat, S., Guironnet, M., Pellerin, D., et al. (2007). Video summarization using a visual attention model. In European Signal Processing Conference.

  • Marat, S., Phuoc, T. H., Granjon, L., Guyader, N., Pellerin, D., & Guérin-Dugué, A. (2009). Modelling spatio-temporal saliency to predict gaze direction for short videos. International Journal of Computer Vision, 82(3), 231–243.

    Article  Google Scholar 

  • Marszałek, M., Laptev, I., & Schmid, C. (2009). Actions in context. In IEEE Conference on Computer Vision & Pattern Recognition.

  • Mathe, S., & Sminchisescu, C. (2012a). Actions in the eye: Dynamic gaze datasets and learnt saliency models for visual recognition. Technical report, Institute of Mathematics of the Romanian Academy and University of Bonn.

  • Mathe, S., & Sminchisescu, C. (2012b). Dynamic eye movement datasets and learnt saliency models for visual action recognition. In IEEE European Conference on Computer Vision.

  • Meier, L., Van De Geer, S., & Bühlmann, P. (2008). The group lasso for logistic regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1), 53–71.

    Article  MathSciNet  MATH  Google Scholar 

  • Navalpakkam, V., & Itti, L. (2006). An integrated model of top-down and bottom-up attention for optimizing detection speed. In IEEE Conference on Computer Vision and Pattern Recognition.

  • Olshausen, B. A., & Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision Research, 37(23), 3311–3325.

    Article  Google Scholar 

  • Olshausen, B. A., & Field, D. J. (2004). Sparse coding of sensory inputs. Current Opinion in Neurobiology, 14(4), 481–487.

    Article  Google Scholar 

  • Poirier, F. J., Gosselin, F., & Arguin, M. (2008). Perceptive fields of saliency. Journal of Vision, 8(15), 14.

    Article  Google Scholar 

  • Qin, Z., Scheinberg, K., & Goldfarb, D. (2010). Efficient block-coordinate descent algorithms for the group lasso. Mathematical Programming Computation, 5, 143.

    Article  MathSciNet  Google Scholar 

  • Rensink, R. A., O’Regan, J. K., & Clark, J. J. (1997). To see or not to see: The need for attention to perceive changes in scenes. Psychological Science, 8(5), 368–373.

    Article  Google Scholar 

  • Rodriguez, M. D., Ahmed, J., & Shah, M. (2008). Action mach: a spatio-temporal maximum average correlation height filter for action recognition. In IEEE International Conference on Computer Vision and Pattern Recognition.

  • Roth, V., & Fischer, B. (2008). The group-lasso for generalized linear models: uniqueness of solutions and efficient algorithms. In International Conference on Machine Learning (Vol. 104).

  • Rubinstein, R., Bruckstein, A. M., & Elad, M. (2010a). Dictionaries for sparse representation modeling. Proceedings of the IEEE.

  • Rubinstein, R., Zibulevsky, M., & Elad, M. (2010b). Double sparsity: Learning sparse dictionaries for sparse signal approximation. IEEE Transactions on Signal Processing, 58(3), 1553–1564.

  • Rudoy, D., Goldman, D. B., Shechtman, E., & Zelnik-Manor, L. (2013). Learning video saliency from human gaze using candidate selection. In 2013 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 1147–1154).

  • Seo, H. J., & Milanfar, P. (2009a). Nonparametric bottom-up saliency detection by self-resemblance. In Computer Vision and Pattern Recognition Workshops (pp. 45–52).

  • Seo, H. J., & Milanfar, P. (2009b). Static and space-time visual saliency detection by self-resemblance. Journal of Vision, 9(12), 15.

  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, 58, 267.

    MathSciNet  MATH  Google Scholar 

  • Treisman, A. M., & Gelade, G. (1980). A feature-integration theory of attention. Cognitive Psychology, 12(1), 97–136.

    Article  Google Scholar 

  • Triesch, J., Ballard, D. H., Hayhoe, M. M., & Sullivan, B. T. (2003). What you see is what you need. Journal of Vision, 3, 9.

    Article  Google Scholar 

  • Ungerleider, S. K., & Leslie, G. (2000). Mechanisms of visual attention in the human cortex. Annual Review of Neuroscience, 23(1), 315–341.

    Article  Google Scholar 

  • Vig, E., Dorr, M., Martinetz, T., & Barth, E. (2011). Eye movements show optimal average anticipation with natural dynamic scenes. Cognitive Computation, 3(1), 79–88.

    Article  Google Scholar 

  • Vig, E., Dorr, M., Martinetz, T., & Barth, E. (2012). Intrinsic dimensionality predicts the saliency of natural dynamic scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(6), 1080–1091.

    Article  Google Scholar 

  • Wang, H., Klaser, A., Schmid, C., & Liu, C.-L. (2011a). Action recognition by dense trajectories. In 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  • Wang, J., Wang, Y., & Zhang, Z. (2011b). Visual saliency based aerial video summarization by online scene classification. In International Conference on Image and Graphics (pp. 777–782).

  • Wright, J., Ganesh, A., Rao, S., Peng, Y., & Ma, Y. (2009). Robust principal component analysis: Exact recovery of corrupted low-rank matrices via convex optimization. In Advances in Neural Information Processing Systems (pp. 2080–2088).

  • Xu, C., & Corso, J. J. (2012). Evaluation of super-voxel methods for early video processing. In IEEE Conference on Computer Vision and Pattern Recognition (pp. 1202–1209).

  • Yan, J., Zhu, M., Liu, H., & Liu, Y. (2010). Visual saliency detection via sparsity pursuit. IEEE Signal Processing Letters, 17(8), 739–742.

    Article  Google Scholar 

  • Yuan, M., & Lin, Y. (2006). Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68, 49.

    Article  MathSciNet  MATH  Google Scholar 

  • Zhai, Y., & Shah, M. (2006). Visual attention detection in video sequences using spatiotemporal cues. In ACM international conference on Multimedia (pp. 815–824).

  • Zhang, L., Tong, M. H., & Cottrell, G. W. (2009). Sunday: Saliency using natural statistics for dynamic analysis of scenes. In Annual Cognitive Science Conference (pp. 2944–2949).

  • Zhang, L., Tong, M. H., Marks, T. K., Shan, H., & Cottrell, G. W. (2008). Sun: A bayesian framework for saliency using natural statistics. Journal of Vision, 8(7), 32.

    Article  Google Scholar 

  • Zhong, S.-h., Liu, Y., Ren, F., Zhang, J., & Ren, T. (2013). Video saliency detection via dynamic consistent spatio-temporal attention modelling. In AAAI.

  • Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301–320.

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgments

This work was supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number D11PC20066. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoI/NBC, or the U.S. Government.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nasim Souly.

Additional information

Communicated by Jakob Verbeek.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Souly, N., Shah, M. Visual Saliency Detection Using Group Lasso Regularization in Videos of Natural Scenes. Int J Comput Vis 117, 93–110 (2016). https://doi.org/10.1007/s11263-015-0853-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-015-0853-6

Keywords

  • Visual saliency
  • Sparse coding
  • Super-voxels
  • Group lasso