Advertisement

International Journal of Computer Vision

, Volume 112, Issue 1, pp 90–114 | Cite as

Continuous Action Recognition Based on Sequence Alignment

  • Kaustubh Kulkarni
  • Georgios Evangelidis
  • Jan Cech
  • Radu Horaud
Article

Abstract

Continuous action recognition is more challenging than isolated recognition because classification and segmentation must be simultaneously carried out. We build on the well known dynamic time warping framework and devise a novel visual alignment technique, namely dynamic frame warping (DFW), which performs isolated recognition based on per-frame representation of videos, and on aligning a test sequence with a model sequence. Moreover, we propose two extensions which enable to perform recognition concomitant with segmentation, namely one-pass DFW and two-pass DFW. These two methods have their roots in the domain of continuous recognition of speech and, to the best of our knowledge, their extension to continuous visual action recognition has been overlooked. We test and illustrate the proposed techniques with a recently released dataset (RAVEL) and with two public-domain datasets widely used in action recognition (Hollywood-1 and Hollywood-2). We also compare the performances of the proposed isolated and continuous recognition algorithms with several recently published methods.

Keywords

Action recognition Video segmentation Example-based recognition Template matching Dynamic programming Dynamic time warping Bag of words 

Notes

Acknowledgments

The authors acknowledge support from the European project HUMAVIPS #247525 (2010–2013) and from the ERC Advanced Grant VHIA #340113 (2014–2019). J. Cech acknowledges support from the Czech Science Foundation Project GACR.

References

  1. Alameda-Pineda, X., Sanchez-Riera, J., Wienke, J., Franc, V., Cech, J., Kulkarni, K., et al. (2013). RAVEL: An annotated corpus for training robots with audiovisual abilities. Journal on Multimodal User Interfaces, 7(1–2), 79–91.CrossRefGoogle Scholar
  2. Alon, J., Athitsos, V., Yuan, Q., & Sclaroff, S. (2009). A unified framework for gesture recognition and spatiotemporal gesture segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(9), 1685–1699.CrossRefGoogle Scholar
  3. Blackburn, J., & Ribeiro, E. (2007). Human motion recognition using isomap and dynamic time warping. Human motion-understanding, modeling, capture and animation (pp. 285–298). Berlin: Springer.CrossRefGoogle Scholar
  4. Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. New York, NY: Cambridge University Press.CrossRefMATHGoogle Scholar
  5. Brendel, W., & Todorovic, S. (2010). Activities as time series of human postures. In N. Paragios (Ed.), Computer Vision-ECCV 2010 (pp. 721–734). Berlin: Springer.CrossRefGoogle Scholar
  6. Chen, S. S., Donoho, D. L., & Saunders, M. A. (2001). Atomic decomposition by basis pursuit. SIAM Rev, 43(1), 129–159.CrossRefMATHMathSciNetGoogle Scholar
  7. Csurka, G., Dance, C. R., Fan, L., Willamowski, J., & Bray, C. (2004). Visual categorization with bags of keypoints. In ECCV Workshop on Statistical Learning in Computer Vision.Google Scholar
  8. Escalera, S., Gonzàlez, J., Baró, X., Reyes, M., Lopes, O., Guyon, I., Athitsos, V., & Escalante, H. J. (2013). Multi-modal gesture recognition challenge 2013: Dataset and results. In ChaLearn Multi-modal Gesture Recognition Grand Challenge and Workshop, 15th ACM International Conference on Multimodal Interaction.Google Scholar
  9. Evangelidis, G. D., & Psarakis, E. Z. (2008). Parametric image alignment using enhanced correlation coefficient maximization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(10), 1858–1865.CrossRefGoogle Scholar
  10. Gales, M., & Young, S. (2008). The application of hidden Markov models in speech recognition. Foundations and Trends in Signal Processing, 1(3), 195–304.CrossRefGoogle Scholar
  11. Gill, P. R., Wang, A., & Molnar, A. (2011). The in-crowd algorithm for fast basis pursuit denoising. IEEE Transactions on Signal Processing, 59(10), 4595–4605.CrossRefMathSciNetGoogle Scholar
  12. Gong, D., & Medioni, G. (2011) Dynamic manifold warping for view invariant action recognition. In IEEE International Conference on Computer Vision, (pp. 571–578). IEEE.Google Scholar
  13. Hienz, H., Bauer, B., & Kraiss, K. F. (1999). HMM-based continuous sign language recognition using stochastic grammars. In A. Braffort, R. Gherbi, S. Gibet, D. Teil, & J. Richardson (Eds.), Gesture-based communication in human-computer interaction (Vol. 1739, pp. 185–196)., Lecture Notes in Computer Science Berlin: Springer.CrossRefGoogle Scholar
  14. Hoai, M., Lan, Z. Z., & De la Torre, F. (2011). Joint segmentation and classification of human actions in video. In 2011 IEEE Conference on Computer Vision and Pattern Recognition CVPR. (pp. 3265–3272). IEEE.Google Scholar
  15. Ikizler, N., & Duygulu, P. (2009). Histogram of oriented rectangles: A new pose descriptor for human action recognition. Image and Vision Computing, 27(10), 1515–1526.CrossRefGoogle Scholar
  16. Jain, M., Jégou, H., & Bouthémy, P. (2013). Better exploiting motion for better action recognition. In Computer Vision and Pattern Recognition, (pp. 2555–2562). IEEE.Google Scholar
  17. Jiang, Y. G., Dai, Q., Xue, X., Liu, W., & Ngo, C. W. (2012). Trajectory-based modeling of human actions with motion reference points. In European Conference on Computer Vision, (pp. 425–438). Berlin :Springer.Google Scholar
  18. Kulkarni, K., Cherla, S., Kale, A., & Ramasubramanian, V. (2008). A framework for indexing human actions in video. In The 1st International Workshop on Machine Learning for Vision-based Motion Analysis-MLVMA’08.Google Scholar
  19. Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008) Learning realistic human actions from movies. In IEEE Conference on Computer Vision and Pattern Recognition, 2008. CVPR 2008, (pp. 1–8). IEEE.Google Scholar
  20. Lee, C., & Rabiner, L. (1989). A frame-synchronous network search algorithm for connected word recognition. IEEE Transactions on Acoustics, Speech and Signal Processing, 37(11), 1649–1658.CrossRefGoogle Scholar
  21. Liang, R., & Ouhyoung, M. (1998). A real-time continuous gesture recognition system for sign language. In Third IEEE International Conference on Automatic Face and Gesture Recognition, 1998, (pp. 558–567). IEEE.Google Scholar
  22. Lv, F., & Nevatia, R. (2006). Recognition and segmentation of 3-d human action using HMM and multi-class AdaBoost. In European Conference on Computer Vision, (pp. 359–372). Berlin: Springer.Google Scholar
  23. Lv, F., & Nevatia, R. (2007). Single view human action recognition using key pose matching and Viterbi path searching. In Computer Vision and Pattern Recognition, 2007. CVPR’07, (pp. 1–8). IEEE.Google Scholar
  24. Manning, C., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press.CrossRefMATHGoogle Scholar
  25. Marszalek, M., Laptev, I., & Schmid, C. (2009) Actions in context. In IEEE Conference on Computer Vision and Pattern Recognition, (pp. 2929–2936). IEEE.Google Scholar
  26. Morency, L., Quattoni, A., & Darrell, T. (2007). Latent-dynamic discriminative models for continuous gesture recognition. In Computer Vision and Pattern Recognition, (pp. 1–8). IEEE.Google Scholar
  27. Mueller, M. (2007). Dynamic time warping. Information retrieval for music and motion (pp. 69–84). Berlin: Springer.CrossRefGoogle Scholar
  28. Ney, H. (1984). The use of a one-stage dynamic programming algorithm for connected word recognition. IEEE Transactions on Acoustics, Speech and Signal Processing, 32(2), 263–271.CrossRefGoogle Scholar
  29. Ney, H., & Ortmanns, S. (1999). Dynamic programming search for continuous speech recognition. IEEE Signal Processing Magazine, 16(5), 64–83.CrossRefGoogle Scholar
  30. Ning, H., Xu, W., Gong, Y., Huang, T. (2008). Latent pose estimator for continuous action recognition. In European Conference on Computer Vision, (pp. 419–433). Springer.Google Scholar
  31. Rabiner, L., & Juang, B. (1993). Fundamentals of speech recognition. Salt Lake: Prentice hall.Google Scholar
  32. Sakoe, H. (1979). Two-level DP-matching - a dynamic programming-based pattern matching algorithm for connected word recognition. IEEE Transactions on Acoustic, Speech, and Signal Processing, 27(6), 588–595.CrossRefGoogle Scholar
  33. Sanchez-Riera, J., Cech, J., Horaud, R. P. (2012). Action recognition robust to background clutter by using stereo vision. In The Fourth International Workshop on Video Event Categorization, Tagging and Retrieval, LNCS: Springer.Google Scholar
  34. Shi, Q., Wang, L., Cheng, L., & Smola, A. (2011). Discriminative human action segmentation and recognition using SMMs. IJCV, 93(1), 22–32.CrossRefMATHGoogle Scholar
  35. Sigal, L., Balan, A., & Black, M. (2010). Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. International Journal of Computer Vision, 87(1), 4–27.CrossRefGoogle Scholar
  36. Sivic, J., & Zisserman, A. (2009). Efficient visual search of videos cast as text retrieval. IEEE Transactions on PAMI, 31(4), 591–606.CrossRefGoogle Scholar
  37. Sminchisescu, C., Kanaujia, A., & Metaxas, D. N. (2006). Conditional models for contextual human motion recognition. CVIU, 104(2–3), 210–220.Google Scholar
  38. Solmaz, B., Assari, S. M., & Shah, M. (2013). Classifying web videos using a global video descriptor. Machine vision and applications, 24(7), 1473–1485.CrossRefGoogle Scholar
  39. Starner, T., Weaver, J., & Pentland, A. (1998). Real-time american sign language recognition using desk and wearable computer based video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12), 1371–1375.CrossRefGoogle Scholar
  40. Tropp, J. A., & Gilbert, A. C. (2007). Signal recovery from random measurements via orthogonal matching pursuit. IEEE Transactions on Information Theory, 53(12), 4655–4666.CrossRefMATHMathSciNetGoogle Scholar
  41. Ullah, M. M., Parizi, S. N,, Laptev, I. (2010). Improving bag-of-features action recognition with non-local cues. In British Machine Vision Conference. (Vol. 10, pp. 95–101).Google Scholar
  42. Vail, D., Veloso, M., & Lafferty, J. (2007). Conditional random fields for activity recognition. In Proceedings of the 6th International Joint Conference on Autonomous Agents and Multiagent Systems, (p. 235). ACM.Google Scholar
  43. Vintsyuk, T. (1971). Element-wise recognition of continuous speech composed of words from a specified dictionary. Cybernetics and Systems Analysis, 7(2), 361–372.Google Scholar
  44. Vogler, C., & Metaxas, D. (1998). ASL recognition based on a coupling between HMMs and 3D motion analysis. In Sixth International Conference on Computer Vision, (pp. 363–369).Google Scholar
  45. Vogler, C., & Metaxas, D. (2001). A framework for recognizing the simultaneous aspects of american sign language. Computer Vision and Image Understanding, 81(3), 358–384.CrossRefMATHGoogle Scholar
  46. Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In International Conference on Computer Vision, (pp. 3551–3558). IEEE.Google Scholar
  47. Young, S., Russell, N. H., & Thornton, J. (1989). Token passing: a simple conceptual model for connected speech recognition systems. Technical Report 38, University of Cambridge, Department of Engineering.Google Scholar
  48. Young, S., Woodland, P., & Byrne, W. (1993). HTK: Hidden Markov model toolkit v1. 5. Technical Report, University of Cambridge, Department of Engineering.Google Scholar
  49. Young, S., Evermann, G., Kershaw, D., Moore, G., Odell, J., Ollason, D., et al. (2009). The HTK book. Technical Report: University of Cambridge, Department of Engineering.Google Scholar
  50. Zhou, F., & la Torre, F. D. (2009). Canonical time warping for alignment of human behavior. In Advances in Neural Information Processing Systems, (pp. 2286–2294).Google Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Kaustubh Kulkarni
    • 1
  • Georgios Evangelidis
    • 1
  • Jan Cech
    • 2
  • Radu Horaud
    • 1
  1. 1.INRIA Grenoble Rhône-AlpesMontbonnot Saint-MartinFrance
  2. 2.Center for Machine PerceptionCzech Technical University in PraguePragueCzech Republic

Personalised recommendations