Skip to main content
Log in

Augmenting bag-of-words: a robust contextual representation of spatiotemporal interest points for action recognition

  • Original Article
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

Although traditional bag-of-words model, together with local spatiotemporal features, has shown promising results for human action recognition, it ignores all structural information of features, which carries important information of motion structures in videos. Recent methods usually characterize the relationship of quantized spatiotemporal features to overcome this drawback. However, the propagation of quantization error leads to an unreliable representation. To alleviate the propagation of quantization error, we present a coding method, which considers not only the spatial similarity but also the reconstruction ability of visual words after giving a probabilistic interpretation of coding coefficients. Based on our coding method, a new type of feature called cumulative probability histogram is proposed to robustly characterize contextual structural information around interest points, which are extracted from multi-layered contexts and assumed to be complementary to local spatiotemporal features. The proposed method is verified on four benchmark datasets. Experiment results show that our method can achieve better performance than previous methods in action recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. For the case of \(k=1\), CPH features based on our coding method is actually based on hard-assignment coding method and we use the result of hard-assignment coding method as the accuracy of \(k = 1\).

References

  1. Laptev, I.: On space-time interest points. Int. J. Comput. Vis. 64(2–3), 107–123 (2005)

  2. Niebles, J.C., Wang, H.C., Fei-Fei, L.: Unsupervised learning of human action categories using spatial-temporal words. Int. J. Comput. Vis. 79(3), 299–318 (2008)

    Article  Google Scholar 

  3. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 3222–3229 (2008)

  4. Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: 2nd Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65–72 (2005)

  5. Wang, H., Ullah, M.M., Klaser, A., Laptev, I., Schmid, C.: Evaluation of local spatio-temporal features for action recognition. In: Proceedings of British Machine Vision Conference, pp. 1–11 (2009)

  6. Klaser, A., Marszalek, M., Schmid, C.: A spatio-temporal descriptor based on 3d-gradients. In: Proceedings of British Machine Vision Conference, pp. 1–10 (2008)

  7. Ryoo, M.S., Aggarwal, J.K.: Spatio-temporal relationship match: video structure comparison for recognition of complex human activities. In: Proceedings of IEEE International Conference on Computer Vision, pp. 1593–1600 (2009)

  8. Kovashka, A., Grauman, K.: Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 2046–2053 (2010)

  9. Bilinski, P., Bremond, F.: Contextual statistics of space-time ordered features for human action recognition. In: Proceedings of 9th IEEE International Conference on Advanced Video and Signal-Based Surveillance, pp. 228–233 (2012)

  10. Liu, J., Yang, Y., Saleemi, I., Shah, M.: Learning semantic features for action recognition via diffusion maps. Comput. Vis. Image Underst. 116(3), 361–377 (2012)

    Article  Google Scholar 

  11. Savarese, S., DelPozo, A., Niebles, J.C., Fei-Fei, L.: Spatial-temporal correlatons for unsupervised action classification. In: IEEE Workshop on Motion and Video Computing, pp. 1–8 (2008)

  12. Wang, H., Klaser, A., Schmid, C., Liu, C.L.: Action recognition by dense trajectories. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 3169–3176 (2011)

  13. Wang, C., Wang, Y., Yuille, A.L.: An approach to pose-based action recognition. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 915–922 (2013)

  14. Gorelick, L., Blank, M., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. IEEE Trans. Pattern Anal. Mach. Intell. 29(12), 2247–2253 (2007)

    Article  Google Scholar 

  15. Messing, R., Pal, C., Kautz, H.: Activity recognition using the velocity histories of tracked keypoints. In: Proceedings of IEEE International Conference on Computer Vision, pp. 104–111 (2009)

  16. Sun, J., Wu, X., Yan, S., Cheong, L.F., Chua, T.S., Li, J.: Hierarchical spatio-temporal context modeling for action recognition. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 2004–2011 (2009)

  17. Ahad, M.A.R., Tan, J.K., Kim, H., Ishikawa, S.: Motion history image: its variants and applications. Mach. Vis. Appl. 23(2), 255–281 (2012)

    Article  Google Scholar 

  18. Choi, J., Jeon, W.J., Lee, S.C.: Spatio-temporal pyramid matching for sports videos. In: Proceedings of 1st ACM International Conference on Multimedia Information Retrieval, pp. 291–297 (2008)

  19. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 2169–2178 (2006)

  20. Zhao, D., Shao, L., Zhen, X., Liu, Y.: Combining appearance and structural features for human action recognition. Neurocomputing 113, 88–96 (2013)

    Article  Google Scholar 

  21. Wu, X., Xu, D., Duan, L., Luo, J.: Action recognition using context and appearance distribution features. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 489–496 (2011)

  22. Yuan, C., Li, X., Hu, W., Ling, H., Maybank, S.: 3D r transform on spatio-temporal interest points for action recognition. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 724–730 (2013)

  23. Bregonzio, M., Xiang, T., Gong, S.: Fusing appearance and distribution information of interest points for action recognition. Pattern Recognit. 45(3), 1220–1234 (2012)

    Article  Google Scholar 

  24. Bregonzio, M., Gong, S., Xiang, T.: Recognising action as clouds of space-time interest points. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 1948–1955 (2009)

  25. Olshausen, B.A., Field, D.J.: Sparse coding with an overcomplete basis set: a strategy employed by v1? Vis. Res. 37(23), 3311–3325 (1997)

    Article  Google Scholar 

  26. Wright, J., Yang, A.Y., Ganesh, A., Sastry, S.S., Ma, Y.: Robust face recognition via sparse representation. IEEE Trans. Pattern Anal. Mach. Intell. 31(2), 210–227 (2009)

    Article  Google Scholar 

  27. Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., Gong, Y.: Locality-constrained linear coding for image classification. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 3360–3367 (2010)

  28. Van Gemert, J.C., Veenman, C.J., Smeulders, A.W.M., Geusebroek, J.M.: Visual word ambiguity. IEEE Trans. Pattern Anal. Mach. Intell. 32(7), 1271–1283 (2010)

    Article  Google Scholar 

  29. Liu, L., Wang, L., Liu, X.: In defense of soft-assignment coding. In: Proceedings of IEEE International Conference on Computer Vision, pp. 2486–2493 (2011)

  30. Sun, X., Chen, M., Hauptmann, A.: Action recognition via local descriptors and holistic features. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 58–65 (2009)

  31. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local svm approach. In: Proceedings of IEEE International Conference on Pattern Recognition, pp. 32–36 (2004)

  32. Wang, J., Chen, Z., Wu, Y.: Action recognition with multiscale spatio-temporal contexts. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 3185–3192 (2011)

  33. Satkin, S., Hebert, M.: Modeling the temporal extent of actions. In: Proceedings of European Conference on Computer Vision, pp. 536–548 (2010)

  34. Raptis, M., Soatto, S.: Tracklet descriptors for action modeling and video analysis. In: Proceedings of European Conference on Computer Vision, pp. 577–590 (2010)

  35. Rodriguez, M.D., Ahmed, J., Shah, M.: Action mach: a spatio-temporal maximum average correlation height filter for action recognition. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2008)

  36. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre T.: HMDB: a large video database for human motion recognition. In: Proceedings of IEEE International Conference on Computer Vision, pp. 2556–2563 (2011)

  37. Jhuang, H., Serre, T., Wolf, L., Poggio, T.: A biologically inspired system for action recognition. In: Proceedings of IEEE International Conference on Computer Vision, pp. 1–8 (2007)

  38. Kliper-Gross, O., Gurovich, Y., Hassner, T., Wolf, L.: Motion interchange patterns for action recognition in unconstrained videos. In: Proceedings of European Conference on Computer Vision, pp. 256–269 (2012)

  39. Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of IEEE International Conference on Computer Vision, pp. 3551–3558 (2013)

  40. Jiang, Y.G., Dai, Q., Xue, X., Liu, W., Ngo, C.W.: Trajectory-based modeling of human actions with motion reference points. In: Proceedings of European Conference on Computer Vision, pp. 425–438 (2012)

  41. Jain, M., Jegou, H., Bouthemy, P.: Better exploiting motion for better action recognition. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 2555–2562 (2013)

Download references

Acknowledgments

This work was supported by the Fundamental Research Funds for the Central Universities of China under Grant 106112013CDJZR120014 and Scientific and Technological Research Program of Chongqing Municipal Education Commission of China under Grant KJ1401207.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Junyong Ye.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, Y., Ye, J., Wang, T. et al. Augmenting bag-of-words: a robust contextual representation of spatiotemporal interest points for action recognition. Vis Comput 31, 1383–1394 (2015). https://doi.org/10.1007/s00371-014-1020-8

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00371-014-1020-8

Keywords

Navigation