Augmenting bag-of-words: a robust contextual representation of spatiotemporal interest points for action recognition

Li, Yang; Ye, Junyong; Wang, Tongqing; Huang, Shijian

doi:10.1007/s00371-014-1020-8

Augmenting bag-of-words: a robust contextual representation of spatiotemporal interest points for action recognition

Original Article
Published: 10 September 2014

Volume 31, pages 1383–1394, (2015)
Cite this article

The Visual Computer Aims and scope Submit manuscript

Yang Li¹,
Junyong Ye¹,
Tongqing Wang¹ &
…
Shijian Huang¹

479 Accesses
16 Citations
Explore all metrics

Abstract

Although traditional bag-of-words model, together with local spatiotemporal features, has shown promising results for human action recognition, it ignores all structural information of features, which carries important information of motion structures in videos. Recent methods usually characterize the relationship of quantized spatiotemporal features to overcome this drawback. However, the propagation of quantization error leads to an unreliable representation. To alleviate the propagation of quantization error, we present a coding method, which considers not only the spatial similarity but also the reconstruction ability of visual words after giving a probabilistic interpretation of coding coefficients. Based on our coding method, a new type of feature called cumulative probability histogram is proposed to robustly characterize contextual structural information around interest points, which are extracted from multi-layered contexts and assumed to be complementary to local spatiotemporal features. The proposed method is verified on four benchmark datasets. Experiment results show that our method can achieve better performance than previous methods in action recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improved use of descriptors for early recognition of actions in video

Article 01 July 2022

Spatio-temporal information for human action recognition

Article Open access 24 November 2016

Fusing $${\mathcal {R}}$$ Features and Local Features with Context-Aware Kernels for Action Recognition

Article 27 October 2015

Notes

For the case of $k=1$, CPH features based on our coding method is actually based on hard-assignment coding method and we use the result of hard-assignment coding method as the accuracy of $k = 1$.

References

Laptev, I.: On space-time interest points. Int. J. Comput. Vis. 64(2–3), 107–123 (2005)
Niebles, J.C., Wang, H.C., Fei-Fei, L.: Unsupervised learning of human action categories using spatial-temporal words. Int. J. Comput. Vis. 79(3), 299–318 (2008)
Article Google Scholar
Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 3222–3229 (2008)
Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: 2nd Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65–72 (2005)
Wang, H., Ullah, M.M., Klaser, A., Laptev, I., Schmid, C.: Evaluation of local spatio-temporal features for action recognition. In: Proceedings of British Machine Vision Conference, pp. 1–11 (2009)
Klaser, A., Marszalek, M., Schmid, C.: A spatio-temporal descriptor based on 3d-gradients. In: Proceedings of British Machine Vision Conference, pp. 1–10 (2008)
Ryoo, M.S., Aggarwal, J.K.: Spatio-temporal relationship match: video structure comparison for recognition of complex human activities. In: Proceedings of IEEE International Conference on Computer Vision, pp. 1593–1600 (2009)
Kovashka, A., Grauman, K.: Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 2046–2053 (2010)
Bilinski, P., Bremond, F.: Contextual statistics of space-time ordered features for human action recognition. In: Proceedings of 9th IEEE International Conference on Advanced Video and Signal-Based Surveillance, pp. 228–233 (2012)
Liu, J., Yang, Y., Saleemi, I., Shah, M.: Learning semantic features for action recognition via diffusion maps. Comput. Vis. Image Underst. 116(3), 361–377 (2012)
Article Google Scholar
Savarese, S., DelPozo, A., Niebles, J.C., Fei-Fei, L.: Spatial-temporal correlatons for unsupervised action classification. In: IEEE Workshop on Motion and Video Computing, pp. 1–8 (2008)
Wang, H., Klaser, A., Schmid, C., Liu, C.L.: Action recognition by dense trajectories. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 3169–3176 (2011)
Wang, C., Wang, Y., Yuille, A.L.: An approach to pose-based action recognition. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 915–922 (2013)
Gorelick, L., Blank, M., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. IEEE Trans. Pattern Anal. Mach. Intell. 29(12), 2247–2253 (2007)
Article Google Scholar
Messing, R., Pal, C., Kautz, H.: Activity recognition using the velocity histories of tracked keypoints. In: Proceedings of IEEE International Conference on Computer Vision, pp. 104–111 (2009)
Sun, J., Wu, X., Yan, S., Cheong, L.F., Chua, T.S., Li, J.: Hierarchical spatio-temporal context modeling for action recognition. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 2004–2011 (2009)
Ahad, M.A.R., Tan, J.K., Kim, H., Ishikawa, S.: Motion history image: its variants and applications. Mach. Vis. Appl. 23(2), 255–281 (2012)
Article Google Scholar
Choi, J., Jeon, W.J., Lee, S.C.: Spatio-temporal pyramid matching for sports videos. In: Proceedings of 1st ACM International Conference on Multimedia Information Retrieval, pp. 291–297 (2008)
Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 2169–2178 (2006)
Zhao, D., Shao, L., Zhen, X., Liu, Y.: Combining appearance and structural features for human action recognition. Neurocomputing 113, 88–96 (2013)
Article Google Scholar
Wu, X., Xu, D., Duan, L., Luo, J.: Action recognition using context and appearance distribution features. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 489–496 (2011)
Yuan, C., Li, X., Hu, W., Ling, H., Maybank, S.: 3D r transform on spatio-temporal interest points for action recognition. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 724–730 (2013)
Bregonzio, M., Xiang, T., Gong, S.: Fusing appearance and distribution information of interest points for action recognition. Pattern Recognit. 45(3), 1220–1234 (2012)
Article Google Scholar
Bregonzio, M., Gong, S., Xiang, T.: Recognising action as clouds of space-time interest points. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 1948–1955 (2009)
Olshausen, B.A., Field, D.J.: Sparse coding with an overcomplete basis set: a strategy employed by v1? Vis. Res. 37(23), 3311–3325 (1997)
Article Google Scholar
Wright, J., Yang, A.Y., Ganesh, A., Sastry, S.S., Ma, Y.: Robust face recognition via sparse representation. IEEE Trans. Pattern Anal. Mach. Intell. 31(2), 210–227 (2009)
Article Google Scholar
Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., Gong, Y.: Locality-constrained linear coding for image classification. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 3360–3367 (2010)
Van Gemert, J.C., Veenman, C.J., Smeulders, A.W.M., Geusebroek, J.M.: Visual word ambiguity. IEEE Trans. Pattern Anal. Mach. Intell. 32(7), 1271–1283 (2010)
Article Google Scholar
Liu, L., Wang, L., Liu, X.: In defense of soft-assignment coding. In: Proceedings of IEEE International Conference on Computer Vision, pp. 2486–2493 (2011)
Sun, X., Chen, M., Hauptmann, A.: Action recognition via local descriptors and holistic features. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 58–65 (2009)
Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local svm approach. In: Proceedings of IEEE International Conference on Pattern Recognition, pp. 32–36 (2004)
Wang, J., Chen, Z., Wu, Y.: Action recognition with multiscale spatio-temporal contexts. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 3185–3192 (2011)
Satkin, S., Hebert, M.: Modeling the temporal extent of actions. In: Proceedings of European Conference on Computer Vision, pp. 536–548 (2010)
Raptis, M., Soatto, S.: Tracklet descriptors for action modeling and video analysis. In: Proceedings of European Conference on Computer Vision, pp. 577–590 (2010)
Rodriguez, M.D., Ahmed, J., Shah, M.: Action mach: a spatio-temporal maximum average correlation height filter for action recognition. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2008)
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre T.: HMDB: a large video database for human motion recognition. In: Proceedings of IEEE International Conference on Computer Vision, pp. 2556–2563 (2011)
Jhuang, H., Serre, T., Wolf, L., Poggio, T.: A biologically inspired system for action recognition. In: Proceedings of IEEE International Conference on Computer Vision, pp. 1–8 (2007)
Kliper-Gross, O., Gurovich, Y., Hassner, T., Wolf, L.: Motion interchange patterns for action recognition in unconstrained videos. In: Proceedings of European Conference on Computer Vision, pp. 256–269 (2012)
Wang, H., Schmid, C.: Action recognition with improved trajectories. In: Proceedings of IEEE International Conference on Computer Vision, pp. 3551–3558 (2013)
Jiang, Y.G., Dai, Q., Xue, X., Liu, W., Ngo, C.W.: Trajectory-based modeling of human actions with motion reference points. In: Proceedings of European Conference on Computer Vision, pp. 425–438 (2012)
Jain, M., Jegou, H., Bouthemy, P.: Better exploiting motion for better action recognition. In: Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, pp. 2555–2562 (2013)

Download references

Acknowledgments

This work was supported by the Fundamental Research Funds for the Central Universities of China under Grant 106112013CDJZR120014 and Scientific and Technological Research Program of Chongqing Municipal Education Commission of China under Grant KJ1401207.

Author information

Authors and Affiliations

Key Laboratory of Optoelectronic Technology and Systems of the Ministry of Education, Chongqing University, Chongqing, China
Yang Li, Junyong Ye, Tongqing Wang & Shijian Huang

Authors

Yang Li
View author publications
You can also search for this author in PubMed Google Scholar
Junyong Ye
View author publications
You can also search for this author in PubMed Google Scholar
Tongqing Wang
View author publications
You can also search for this author in PubMed Google Scholar
Shijian Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Junyong Ye.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, Y., Ye, J., Wang, T. et al. Augmenting bag-of-words: a robust contextual representation of spatiotemporal interest points for action recognition. Vis Comput 31, 1383–1394 (2015). https://doi.org/10.1007/s00371-014-1020-8

Download citation

Published: 10 September 2014
Issue Date: October 2015
DOI: https://doi.org/10.1007/s00371-014-1020-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Augmenting bag-of-words: a robust contextual representation of spatiotemporal interest points for action recognition

Abstract

Access this article

Similar content being viewed by others

Improved use of descriptors for early recognition of actions in video

Spatio-temporal information for human action recognition

Fusing $${\mathcal {R}}$$ Features and Local Features with Context-Aware Kernels for Action Recognition

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Augmenting bag-of-words: a robust contextual representation of spatiotemporal interest points for action recognition

Abstract

Access this article

Similar content being viewed by others

Improved use of descriptors for early recognition of actions in video

Spatio-temporal information for human action recognition

Fusing $${\mathcal {R}}$$ Features and Local Features with Context-Aware Kernels for Action Recognition

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation