International Journal of Computer Vision

, Volume 79, Issue 3, pp 299–318 | Cite as

Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words

Article

Abstract

We present a novel unsupervised learning method for human action categories. A video sequence is represented as a collection of spatial-temporal words by extracting space-time interest points. The algorithm automatically learns the probability distributions of the spatial-temporal words and the intermediate topics corresponding to human action categories. This is achieved by using latent topic models such as the probabilistic Latent Semantic Analysis (pLSA) model and Latent Dirichlet Allocation (LDA). Our approach can handle noisy feature points arisen from dynamic background and moving cameras due to the application of the probabilistic models. Given a novel video sequence, the algorithm can categorize and localize the human action(s) contained in the video. We test our algorithm on three challenging datasets: the KTH human motion dataset, the Weizmann human action dataset, and a recent dataset of figure skating actions. Our results reflect the promise of such a simple approach. In addition, our algorithm can recognize and localize multiple actions in long and complex video sequences containing multiple motions.

Keywords

Action categorization Bag of words Spatio-temporal interest points Topic models Unsupervised learning 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Blank, M., Gorelick, L., Shechtman, E., Irani, M., & Basri, R. (2005). Actions as space-time shapes. In Proceedings of the tenth IEEE international conference on computer vision (Vol. 2, pp. 1395–1402). Los Alamitos: IEEE Computer Society. CrossRefGoogle Scholar
  2. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022. MATHCrossRefGoogle Scholar
  3. Bobick, A. F., & Davis, J. W. (2001). The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(3), 257–267. CrossRefGoogle Scholar
  4. Boiman, O., & Irani, M. (2005). Detecting irregularities in images and in video. In Proceedings of the tenth IEEE international conference on computer vision (Vol. 1, pp. 462–469). Los Alamitos: IEEE Computer Society. CrossRefGoogle Scholar
  5. Cheung, V., Frey, B. J., & Jojic, N. (2005). Video epitomes. In Proceedings of the 2005 IEEE computer society conference on computer vision and pattern recognition (Vol. 1, pp. 42–49). Los Alamitos: IEEE Computer Society. Google Scholar
  6. Dalal, N., Triggs, B., & Schmid, C. (2006). Human detection using oriented histograms of flow and appearance. In European conference on computer vision (Vol. 2, pp. 428–441). Google Scholar
  7. Dance, C., Willamowski, J., Fan, L., Bray, C., & Csurka, G. (2004). Visual categorization with bags of keypoints. In ECCV international workshop on statistical learning in computer vision. Google Scholar
  8. Dollár, P., Rabaud, V., Cottrell, G., & Belongie, S. (2005). Behavior recognition via sparse spatio-temporal features. In 2nd joint IEEE international workshop on visual surveillance and performance evaluation of tracking and surveillance (pp. 65–72). Google Scholar
  9. Efros, A. A., Berg, A. C., Mori, G., & Malik, J. (2003). Recognizing action at a distance. In Proceedings of the ninth IEEE international conference on computer vision (Vol. 2, pp. 726–733). Los Alamitos: IEEE Computer Society. CrossRefGoogle Scholar
  10. Fanti, C., Zelnik-Manor, L., & Perona, P. (2005). Hybrid models for human motion recognition. In Proceedings of the tenth IEEE international conference on computer vision (Vol. 1, pp. 1166–1173). Los Alamitos: IEEE Computer Society. Google Scholar
  11. Fei-Fei, L., & Perona, P. (2005). A Bayesian hierarchical model for learning natural scene categories. In Proceedings of the 2005 IEEE computer society conference on computer vision and pattern recognition (pp. 524–531). Los Alamitos: IEEE Computer Society. Google Scholar
  12. Felzenszwalb, P. F., & Huttenlocher, D. P. (2005). Pictorial structures for object recognition. International Journal of Computer Vision, 61(1), 55–79. CrossRefGoogle Scholar
  13. Feng, X., & Perona, P. (2002). Human action recognition by sequence of movelet codewords. In 1st international symposium on 3D data processing visualization and transmission (3DPVT 2002) (pp. 717–721). Google Scholar
  14. Fergus, R., Fei-Fei, L., Perona, P., & Zisserman, A. (2005). Learning object categories from Google’s image search. In Proceedings of the tenth international conference on computer vision (Vol. 2, pp. 1816–1823). Los Alamitos: IEEE Computer Society. CrossRefGoogle Scholar
  15. Harris, C., & Stephens, M. (1988). A combined corner and edge detector. In Proceedings of the fourth Alvey vision conference (pp. 147–152). Google Scholar
  16. Hoey, J. (2001). Hierarchical unsupervised learning of facial expression categories. In IEEE workshop on detection and recognition of events in video (pp. 99–106). Google Scholar
  17. Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on research and development in information retrieval (pp. 50–57), August 1999. Google Scholar
  18. Kadir, T., & Brady, M. (2003). Scale saliency: a novel approach to salient feature and scale selection. In International conference on visual information engineering (pp. 25–28). Google Scholar
  19. Ke, Y., Sukthankar, R., & Hebert, M. (2005). Efficient visual event detection using volumetric features. In Proceedings of the tenth IEEE international conference on computer vision (pp. 166–173). Los Alamitos: IEEE Computer Society. Google Scholar
  20. Laptev, I. (2005). On space-time interest points. International Journal of Computer Vision, 64(2–3), 107–123. CrossRefGoogle Scholar
  21. Laptev, I., & Lindeberg, T. (2006). Local descriptors for spatio-temporal recognition. In Lecture notes in computer science (Vol. 3667). Spatial coherence for visual motion analysis, first international workshop, SCVMA 2004, Prague, Czech Republic, 15 May 2004. Berlin: Springer. CrossRefGoogle Scholar
  22. Niebles, J. C., & Fei-Fei, L. (2007). A hierarchical model of shape and appearance for human action classification. In Proceedings of the 2007 IEEE computer society conference on computer vision and pattern recognition. Los Alamitos: IEEE Computer Society. Google Scholar
  23. Niebles, J. C., Wang, H., & Fei-Fei, L. (2006). Unsupervised learning of human action categories using spatial-temporal words. In Proceedings of British machine vision conference 2006 (Vol. 3, pp. 1249–1258), September 2006. Google Scholar
  24. Oikonomopoulos, A., Patras, I., & Pantic, M. (2006). Human action recognition with spatiotemporal salient points. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 36(3), 710–719. CrossRefGoogle Scholar
  25. Ramanan, D., & Forsyth, D. A. (2004). Automatic annotation of everyday movements. In Thrun, S., Saul, L., & Schölkopf, B. (Eds.), Advances in neural information processing systems (Vol. 16). Cambridge: MIT Press. Google Scholar
  26. Savarese, S., Winn, J. M., & Criminisi, A. (2006). Discriminative object class models of appearance and shape by correlations. In Proceedings of the 2006 IEEE computer society conference on computer vision and pattern recognition. Los Alamitos: IEEE Computer Society. Google Scholar
  27. Schmid, C., Mohr, R., & Bauckhage, C. (2000). Evaluation of interest point detectors. International Journal of Computer Vision, 2(37), 151–172. CrossRefGoogle Scholar
  28. Schuldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: a local svm approach. In ICPR (pp. 32–36). Google Scholar
  29. Shechtman, E., & Irani, M. (2005). Space-time behavior based correlation. In Proceedings of the 2005 IEEE computer society conference on computer vision and pattern recognition (Vol. 1, pp. 405–412). Los Alamitos: IEEE Computer Society. Google Scholar
  30. Sidenbladh, H., & Black, M. J. (2003). Learning the statistics of people in images and video. International Journal of Computer Vision, 54(1-3), 181–207. CrossRefGoogle Scholar
  31. Sivic, J., Russell, B. C., Efros, A. A., Zisserman, A., & Freeman, W. T. (2005). Discovering objects and their location in images. In Proceedings of the tenth IEEE international conference on computer vision (pp. 370–377), October 2005. Los Alamitos: IEEE Computer Society. CrossRefGoogle Scholar
  32. Song, Y., Goncalves, L., & Perona, P. (2003). Unsupervised learning of human motion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(25), 1–14. Google Scholar
  33. Viola, P., & Jones, M. (2001). Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE computer society conference on computer vision and pattern recognition. Los Alamitos: IEEE Computer Society. Google Scholar
  34. Wang, Y., Jiang, H., Drew, M. S., Li, Z.-N., & Mori, G. (2006). Unsupervised discovery of action classes. In Proceedings of the 2006 IEEE computer society conference on computer vision and pattern recognition. Los Alamitos: IEEE Computer Society. Google Scholar
  35. Xiang, T., & Gong, S. (2005). Video behaviour profiling and abnormality detection without manual labelling. In Proceedings of the tenth IEEE international conference on computer vision (pp. 1238–1245). Los Alamitos: IEEE Computer Society. CrossRefGoogle Scholar
  36. Yilmaz, A., & Shah, M. (2005). Recognizing human actions in videos acquired by uncalibrated moving cameras. In Proceedings of the tenth IEEE international conference on computer vision (Vol. 1, pp. 150–157). Los Alamitos: IEEE Computer Society. CrossRefGoogle Scholar
  37. Zhong, H., Shi, J., & Visontai, M. (2004). Detecting unusual activity in video. In Proceedings of the 2004 IEEE computer society conference on computer vision and pattern recognition (pp. 819–826). Los Alamitos: IEEE Computer Society. CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2008

Authors and Affiliations

  • Juan Carlos Niebles
    • 1
    • 2
  • Hongcheng Wang
    • 3
  • Li Fei-Fei
    • 4
  1. 1.Department of Electrical EngineeringPrinceton University, Engineering QuadranglePrincetonUSA
  2. 2.Robotics and Intelligent Systems GroupUniversidad del NorteBarranquillaColombia
  3. 3.United Technologies Research Center (UTRC)East HartfordUSA
  4. 4.Department of Computer SciencePrinceton UniversityPrincetonUSA

Personalised recommendations