Cognitive Computation

, Volume 4, Issue 4, pp 409–419 | Cite as

Supervised Learning and Codebook Optimization for Bag-of-Words Models

  • Mingyuan Jiu
  • Christian Wolf
  • Christophe Garcia
  • Atilla Baskurt


In this paper, we present a novel approach for supervised codebook learning and optimization for bag-of-words models. This type of models is frequently used in visual recognition tasks like object class recognition or human action recognition. An entity is represented as a histogram of codewords, which are traditionally clustered with unsupervised methods like k-means or random forests and then classified in a supervised way. We propose a new supervised method for joint codebook creation and class learning, which learns the cluster centers of the codebook in a goal-directed way using the class labels of the training set. As a result, the codebook is highly correlated to the recognition problem, leading to a more discriminative codebook. We propose two different learning algorithms, one based on error backpropagation and the other based on cluster label reassignment. We apply the proposed method to human action recognition from video sequences and evaluate it on the KTH data set, reporting very promising results. The proposed technique allows us to improve the discriminative power of an unsupervised learned codebook or to keep the discriminative power while decreasing the size of the learned codebook, thus decreasing the computational complexity due to the nearest neighbor search.


Bag-of-words models Supervised learning Neural networks Action recognition 


  1. 1.
    Csurka G, Dance C, Fan LX, Willamowski J, Bray C. Visual categorization with bags of keypoints. In: Proceedings of ECCV international workshop on statistical learning in computer vision. 2004. p. 1–22.Google Scholar
  2. 2.
    Lowe DG. Distinctive image features from scale-invariant keypoints. Int J Comput Vis (IJCV). 2004;60(2):91–110.CrossRefGoogle Scholar
  3. 3.
    Dollar P, Rabaud V, Cottrell G, Belongie S. Behavior recognition via sparse spatio-temporal features. In: ICCV workshop on visual surveillance and performance evaluation of tracking and surveillance. 2005. p. 65–72.Google Scholar
  4. 4.
    Moosmann F, Triggs B, Jurie F. Fast discriminative visual codebooks using randomized clustering forests. In: NIPS. 2007. p. 985–92.Google Scholar
  5. 5.
    Liu J, Shah M. Learning human actions via information maximization. In: CVPR. 2008. p. 1–8.Google Scholar
  6. 6.
    Liu J, Yang Y, Shah M. Learning semantic visual vocabularies using diffusion distance. In: CVPR. 2009. p. 461-68.Google Scholar
  7. 7.
    Saghafi B, Farahzadeh E, Rajan D, Sluzek A. Embedding visual words into concept space for action and scene recognition. In: BMVC. 2010. p. 1–11.Google Scholar
  8. 8.
    Niebles JC, Wang H, Fei-Fei L. Unsupervised learning of human action categories using spatial-temporal words. Int J Comput Vis (IJCV). 2008;79(3):299–318.CrossRefGoogle Scholar
  9. 9.
    Gilbert A, Illingworth J, Bowden R. Action recognition using mined hierarchical compound features. IEEE Trans Pattern Anal Mach Intell (PAMI). 2011;33(5):883–97.CrossRefGoogle Scholar
  10. 10.
    Laptev I. On space-time interest points. Int J Comput Vis (IJCV). 2005;64(2/3):107–23.CrossRefGoogle Scholar
  11. 11.
    Schuldt C, Laptev I, Caputo B. Recognizing human actions: a local svm approach. In: ICPR. 2004. p. 32–6.Google Scholar
  12. 12.
    Oikonomopoulos A, Patras I, Pantic M. An implicit spatiotemporal shape model for human activity localization and recognition. In: CVPR. 2009. p. 27–33.Google Scholar
  13. 13.
    Ryoo MS, Aggarwal JK. Spatio-temporal relationship match: video structure comparison for recognition of complex human activities. In: ICCV. 2009. p. 1593–600.Google Scholar
  14. 14.
    Ta A-P, Wolf C, Lavoué G, Baskurt A, Jolion JM. Pairwise features for human action recognition. In: ICPR. 2010. p. 3224–7.Google Scholar
  15. 15.
    Mikolajczyk K, Uemura H. Action recognition with appearancemotion features and fast search trees. Comput Vis Image Underst (CVIU). 2011;115(3):426–38.CrossRefGoogle Scholar
  16. 16.
    Aggarwal JK, Ryoo MS. Human activity analysis: a review. ACM Comput Surv (inpress).Google Scholar
  17. 17.
    Turaga P, Chellappa R, Subrahmanian VS, Udrea O. Machine recognition of human activities: a survey. IEEE Trans Circuits Syst Video Technol. 2008;18(11):1473–88.CrossRefGoogle Scholar
  18. 18.
    Weinland D, Ronfard R, Boyer E. A survey of vision-based methods for action representation, segmentation and recognition. Comput Vis Image Underst (CVIU). 2011;115:224–41.CrossRefGoogle Scholar
  19. 19.
    Song Y, Concalves L, Perona P. Unsupervised learning of human motion. IEEE Trans Pattern Anal Mach Intell (PAMI). 2003;25(7):814–27.CrossRefGoogle Scholar
  20. 20.
    Gorelick L, Blank M, Shechtman E, Irani M, Basri R. Actions as space-time shapes. IEEE Trans Pattern Anal Mach Intell (PAMI). 2007;29(12):2247–53.CrossRefGoogle Scholar
  21. 21.
    Wang L, Geng X, Leckie C, Ramamohanarao K. Moving shape dynamics: a signal processing perspective. In: CVPR. 2008. p. 1–8.Google Scholar
  22. 22.
    Weinland D, Boyer E, Ronfard R. Action recognition from arbitrary views using 3D exemplars. In: ICCV. 2007. p. 1–7.Google Scholar
  23. 23.
    Ke Y, Sukthankar R, Hebert M. Efficient visual event detection using volumetric features. In: ICCV. 2005. p. 166–73.Google Scholar
  24. 24.
    Mikolajczyk K, Uemura H. Action recognition with motion-appearance vocabulary forest. In: CVPR. 2008. p. 1–8.Google Scholar
  25. 25.
    Zhang Z, Hu Y, Chan S, Chia LT. Motion context: A new representation for human action recognition. In: ECCV. 2008.Google Scholar
  26. 26.
    Bregonzio M, Gong SG, Xiang T. Recognising action as clouds of space-time interest points. In: CVPR. 2009. p. 1948–55.Google Scholar
  27. 27.
    Liu J, Ali S, Shah M. Recognizing human actions using multiple features. In: CVPR. 2008. p. 1–8.Google Scholar
  28. 28.
    Sun X, Chen M, Hauptmann A. Action recognition via local descriptors and holistic features. In: CVPR workshop on human communicative behavior analysis. 2009. p. 58–65.Google Scholar
  29. 29.
    Seo HJ, Milanfar P. Action recognition from one example. IEEE Trans Pattern Anal Mach Intell (PAMI). 2011;33(5):867–82.CrossRefGoogle Scholar
  30. 30.
    Shechtman E, Irani M. Space-time behavior based correlation. In: CVPR. 2005. p. 405–12.Google Scholar
  31. 31.
    Ta A-P, Wolf C, Lavoué G, Baskurt A. Recognizing and localizing individual activities through graph matching. In: International conference on advanced video and signal-based surveillance. 2010.Google Scholar
  32. 32.
    Abdelkader MF, Almageed WA, Srivastava A, Chellappa R. Silhouette-based gesture and action recognition via modeling trajectories on riemannian shape manifolds. Comput Vis Image Underst (CVIU). 2011;115(3):439–55.CrossRefGoogle Scholar
  33. 33.
    Boiman O, Irani M. Detecting irregularities in images and in video. Int J Comput Vis (IJCV). 2007;74(1):17–31.CrossRefGoogle Scholar
  34. 34.
    Cuntoor NP, Yegnanarayana B, Chellappa R. Activity modeling using event probability sequences. IEEE Trans Image Process. 2008;17(4):594–07.PubMedCrossRefGoogle Scholar
  35. 35.
    Shi Q, Cheng L, Wang L, Smola A. Human action segmentation and recognition using discriminative semi-Markov models. Int J Comput Vis (IJCV). 2010;93(1):22–32.CrossRefGoogle Scholar
  36. 36.
    Xiang T, Gong S. Activity based surveillance video content modelling. Pattern Recogn. 2008;41(7):2309–26.CrossRefGoogle Scholar
  37. 37.
    Xiang T, Gong S. Incremental and adaptive abnormal behaviour detection. Comput Vis Image Underst (CVIU). 2008;11(1):59–73.CrossRefGoogle Scholar
  38. 38.
    Zhang D, Perez DG, Bengio S, McCowan I. Semi-supervised adapted hmms for unusual event detection. In: CVPR. 2005. p. 611–8.Google Scholar
  39. 39.
    Zhou H, Kimber D. Unusual event detection via multi-camera video mining. In: ICPR. 2006. p. 1161–6.Google Scholar
  40. 40.
    Jhuang H, Serre T, Wolf L, Poggio T. A biologically inspired system for action recognition. In: ICCV. 2007. p. 1–8.Google Scholar
  41. 41.
    Taylor GW, Fergus R, Lecun Y, Bregler C. Convolutional learning of spatio-temporal features. In: ECCV. 2010.Google Scholar
  42. 42.
    Baccouche M, Mamalet F, Wolf C, Garcia C, Baskurt A. Sequential deep learning for human action recognition. In: International workshop on human behavior understanding: inducing behavioral change, 2011.Google Scholar
  43. 43.
    Fathi A, Mori G. Action recognition by learning mid-level motion features. In: CVPR. 2008. p. 1–8.Google Scholar
  44. 44.
    Dyana A, Das S. Trajectory representation using gabor features for motion-based video retrieval. Pattern Recogn Lett. 2009;30(10):877–92.CrossRefGoogle Scholar
  45. 45.
    Stauffer C, Grimson WEL. Learning patterns of activity using real-time tracking. IEEE Trans Pattern Anal Mach Intell (PAMI). 2000;22(8):747–57.CrossRefGoogle Scholar
  46. 46.
    Ryoo MS, Aggarwal JK. Stochastic representation and recognition of high-Level group activities. Int J Comput Vis (IJCV). 2010;93(2):183–200.CrossRefGoogle Scholar
  47. 47.
    Wang L, Wang Y, Gao W. Mining layered grammar rules for action recognition. Int J Comput Vis (IJCV). 2011;93(2):162–82.CrossRefGoogle Scholar
  48. 48.
    Niebles JC, Fei-Fei L. A hierarchical model of shape and appearance for human action classification. In: CVPR. 2007. p. 1–8.Google Scholar
  49. 49.
    Bishop CM. Neural networks for pattern recognition. Oxford: Oxford university press; 1994. p. 140–45Google Scholar
  50. 50.
    Laptev I, Marszalek M, Schmid C, Rozenfeld B. Learning realistic human actions from movies. In: CVPR. 2008. p. 1–8.Google Scholar
  51. 51.
    Dietterich TG. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. 1998;10(7):1895–924.PubMedCrossRefGoogle Scholar
  52. 52.
    Chang C-C, Lin C-J. LIBSVM a library for support vector machines. ACM Trans Intell Syst Technol. 2011;2(3):27:1–27:27.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  • Mingyuan Jiu
    • 1
  • Christian Wolf
    • 1
  • Christophe Garcia
    • 1
  • Atilla Baskurt
    • 1
  1. 1.Université de Lyon, CNRS, INSA-Lyon, LIRIS, UMR5205VilleurbanneFrance

Personalised recommendations