Space-Variant Descriptor Sampling for Action Recognition Based on Saliency and Eye Movements

  • Eleonora Vig
  • Michael Dorr
  • David Cox
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7578)


Algorithms using “bag of features”-style video representations currently achieve state-of-the-art performance on action recognition tasks, such as the challenging Hollywood2 benchmark [1,2,3]. These algorithms are based on local spatiotemporal descriptors that can be extracted either sparsely (at interest points) or densely (on regular grids), with dense sampling typically leading to the best performance [1]. Here, we investigate the benefit of space-variant processing of inputs, inspired by attentional mechanisms in the human visual system. We employ saliency-mapping algorithms to find informative regions and descriptors corresponding to these regions are either used exclusively, or are given greater representational weight (additional codebook vectors). This approach is evaluated with three state-of-the-art action recognition algorithms [1,2,3], and using several saliency algorithms. We also use saliency maps derived from human eye movements to probe the limits of the approach. Saliency-based pruning allows up to 70% of descriptors to be discarded, while maintaining high performance on Hollywood2. Meanwhile, pruning of 20-50% (depending on model) can even improve recognition. Further improvements can be obtained by combining representations learned separately on salience-pruned and unpruned descriptor sets. Not surprisingly, using the human eye movement data gives the best mean Average Precision (mAP; 61.9%), providing an upper bound on what is possible with a high-quality saliency map. Even without such external data, the Dense Trajectories model [1] enhanced by automated saliency-based descriptor sampling achieves the best mAP (60.0%) reported on Hollywood2 to date.


action recognition saliency maps eye movements bag of features descriptor pruning 


  1. 1.
    Wang, H., Ullah, M.M., Kläser, A., Laptev, I., Schmid, C.: Evaluation of local spatio-temporal features for action recognition. In: BMVC, p. 127 (2009)Google Scholar
  2. 2.
    Wang, H., Kläser, A., Schmid, C., Cheng-Lin, L.: Action recognition by dense trajectories. In: CVPR, pp. 3169–3176 (2011)Google Scholar
  3. 3.
    Le, Q.V., Zou, W.Y., Yeung, S.Y., Ng, A.Y.: Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: CVPR, pp. 3361–3368 (2011)Google Scholar
  4. 4.
    Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. In: ICPR, vol. 3, pp. 32–36 (2004)Google Scholar
  5. 5.
    Laptev, I., Lindeberg, T.: Space-time interest points. In: ICCV, Nice, France (2003)Google Scholar
  6. 6.
    Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65–72 (2005)Google Scholar
  7. 7.
    Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. on PAMI 20, 1254–1259 (1998)CrossRefGoogle Scholar
  8. 8.
    Bruce, N., Tsotsos, J.: Saliency based on information maximization. In: Advances in NIPS, vol. 18, pp. 155–162 (2006)Google Scholar
  9. 9.
    Vig, E., Dorr, M., Martinetz, T., Barth, E.: Intrinsic dimensionality predicts the saliency of natural dynamic scenes. IEEE Trans. on PAMI 34, 1080–1091 (2012)CrossRefGoogle Scholar
  10. 10.
    Geisler, W.S., Perry, J.S.: A real-time foveated multiresolution system for low-bandwidth video communication. In: Rogowitz, B.E., Pappas, T.N. (eds.) Human Vision and Electronic Imaging: SPIE Proceedings, pp. 294–305 (1998)Google Scholar
  11. 11.
    Rutishauser, U., Walther, D., Koch, C., Perona, P.: Is bottom-up attention useful for object recognition. In: CVPR, pp. 37–44 (2004)Google Scholar
  12. 12.
    Serre, T., Wolf, L., Bileschi, S., Riesenhuber, M., Poggio, T.: Robust object recognition with cortex-like mechanisms. IEEE Trans. on PAMI 29, 411–426 (2007)CrossRefGoogle Scholar
  13. 13.
    Liu, T., Yuan, Z., Sun, J., Wang, J., Zheng, N., Tang, X., Shum, H.Y.: Learning to detect a salient object. IEEE Trans. on PAMI 33, 353–367 (2011)CrossRefGoogle Scholar
  14. 14.
    Marszalek, M., Laptev, I., Schmid, C.: Actions in context. In: CVPR, pp. 2929–2936 (2009)Google Scholar
  15. 15.
    Mathe, S., Sminchisescu, C.: Dynamic Eye Movement Datasets and Learnt Saliency Models for Visual Action Recognition. In: Fitzgibbon, A., Lazebnik, S., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part VII, vol. 7573, pp. 842–856. Springer, Heidelberg (2012)Google Scholar
  16. 16.
    Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: CVPR, pp. 1–8 (2008)Google Scholar
  17. 17.
    Pinto, N., DiCarlo, J.J., Cox, D.D.: How far can you get with a modern face recognition test set using only simple features? In: CVPR (2009)Google Scholar
  18. 18.
    Dorr, M., Martinetz, T., Gegenfurtner, K., Barth, E.: Variability of eye movements when viewing dynamic natural scenes. Journal of Vision 10, 1–17 (2010)CrossRefGoogle Scholar
  19. 19.
    Tseng, P., Carmi, R., Cameron, I., Munoz, D., Itti, L.: Quantifying center bias of observers in free viewing of dynamic natural scenes. Journal of Vision 9 (2009)Google Scholar
  20. 20.
    Mota, C., Aach, T., Stuke, I., Barth, E.: Estimation of multiple orientations in multi-dimensional signals. In: ICIP, pp. 2665–2668 (2004)Google Scholar
  21. 21.
    Pomplun, M., Ritter, H., Velichkovsky, B.: Disambiguating complex visual information: Towards communication of personal views of a scene. Perception 25, 931–948 (1996)CrossRefGoogle Scholar
  22. 22.
    Bach, F.R., Lanckriet, G.R.G., Jordan, M.I.: Multiple kernel learning, conic duality, and the SMO algorithm. In: ICML, New York, USA, pp. 6–13 (2004)Google Scholar
  23. 23.
    Sonnenburg, S., Rätsch, G., Schäfer, C., Schölkopf, B.: Large scale multiple kernel learning. J. Mach. Learn. Res. 7, 1531–1565 (2006)MathSciNetzbMATHGoogle Scholar
  24. 24.
    Judd, T., Ehinger, K., Durand, F., Torralba, A.: Learning to predict where humans look. In: ICCV, pp. 2106–2113 (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Eleonora Vig
    • 1
  • Michael Dorr
    • 2
  • David Cox
    • 1
  1. 1.The Rowland Institute at HarvardCambridgeUSA
  2. 2.Schepens Eye Research InstituteHarvard Medical SchoolBostonUSA

Personalised recommendations