Supervised Learning and Codebook Optimization for Bag-of-Words Models


In this paper, we present a novel approach for supervised codebook learning and optimization for bag-of-words models. This type of models is frequently used in visual recognition tasks like object class recognition or human action recognition. An entity is represented as a histogram of codewords, which are traditionally clustered with unsupervised methods like k-means or random forests and then classified in a supervised way. We propose a new supervised method for joint codebook creation and class learning, which learns the cluster centers of the codebook in a goal-directed way using the class labels of the training set. As a result, the codebook is highly correlated to the recognition problem, leading to a more discriminative codebook. We propose two different learning algorithms, one based on error backpropagation and the other based on cluster label reassignment. We apply the proposed method to human action recognition from video sequences and evaluate it on the KTH data set, reporting very promising results. The proposed technique allows us to improve the discriminative power of an unsupervised learned codebook or to keep the discriminative power while decreasing the size of the learned codebook, thus decreasing the computational complexity due to the nearest neighbor search.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7


  1. 1.

    Csurka G, Dance C, Fan LX, Willamowski J, Bray C. Visual categorization with bags of keypoints. In: Proceedings of ECCV international workshop on statistical learning in computer vision. 2004. p. 1–22.

  2. 2.

    Lowe DG. Distinctive image features from scale-invariant keypoints. Int J Comput Vis (IJCV). 2004;60(2):91–110.

    Article  Google Scholar 

  3. 3.

    Dollar P, Rabaud V, Cottrell G, Belongie S. Behavior recognition via sparse spatio-temporal features. In: ICCV workshop on visual surveillance and performance evaluation of tracking and surveillance. 2005. p. 65–72.

  4. 4.

    Moosmann F, Triggs B, Jurie F. Fast discriminative visual codebooks using randomized clustering forests. In: NIPS. 2007. p. 985–92.

  5. 5.

    Liu J, Shah M. Learning human actions via information maximization. In: CVPR. 2008. p. 1–8.

  6. 6.

    Liu J, Yang Y, Shah M. Learning semantic visual vocabularies using diffusion distance. In: CVPR. 2009. p. 461-68.

  7. 7.

    Saghafi B, Farahzadeh E, Rajan D, Sluzek A. Embedding visual words into concept space for action and scene recognition. In: BMVC. 2010. p. 1–11.

  8. 8.

    Niebles JC, Wang H, Fei-Fei L. Unsupervised learning of human action categories using spatial-temporal words. Int J Comput Vis (IJCV). 2008;79(3):299–318.

    Article  Google Scholar 

  9. 9.

    Gilbert A, Illingworth J, Bowden R. Action recognition using mined hierarchical compound features. IEEE Trans Pattern Anal Mach Intell (PAMI). 2011;33(5):883–97.

    Article  Google Scholar 

  10. 10.

    Laptev I. On space-time interest points. Int J Comput Vis (IJCV). 2005;64(2/3):107–23.

    Article  Google Scholar 

  11. 11.

    Schuldt C, Laptev I, Caputo B. Recognizing human actions: a local svm approach. In: ICPR. 2004. p. 32–6.

  12. 12.

    Oikonomopoulos A, Patras I, Pantic M. An implicit spatiotemporal shape model for human activity localization and recognition. In: CVPR. 2009. p. 27–33.

  13. 13.

    Ryoo MS, Aggarwal JK. Spatio-temporal relationship match: video structure comparison for recognition of complex human activities. In: ICCV. 2009. p. 1593–600.

  14. 14.

    Ta A-P, Wolf C, Lavoué G, Baskurt A, Jolion JM. Pairwise features for human action recognition. In: ICPR. 2010. p. 3224–7.

  15. 15.

    Mikolajczyk K, Uemura H. Action recognition with appearancemotion features and fast search trees. Comput Vis Image Underst (CVIU). 2011;115(3):426–38.

    Article  Google Scholar 

  16. 16.

    Aggarwal JK, Ryoo MS. Human activity analysis: a review. ACM Comput Surv (inpress).

  17. 17.

    Turaga P, Chellappa R, Subrahmanian VS, Udrea O. Machine recognition of human activities: a survey. IEEE Trans Circuits Syst Video Technol. 2008;18(11):1473–88.

    Article  Google Scholar 

  18. 18.

    Weinland D, Ronfard R, Boyer E. A survey of vision-based methods for action representation, segmentation and recognition. Comput Vis Image Underst (CVIU). 2011;115:224–41.

    Article  Google Scholar 

  19. 19.

    Song Y, Concalves L, Perona P. Unsupervised learning of human motion. IEEE Trans Pattern Anal Mach Intell (PAMI). 2003;25(7):814–27.

    Article  Google Scholar 

  20. 20.

    Gorelick L, Blank M, Shechtman E, Irani M, Basri R. Actions as space-time shapes. IEEE Trans Pattern Anal Mach Intell (PAMI). 2007;29(12):2247–53.

    Article  Google Scholar 

  21. 21.

    Wang L, Geng X, Leckie C, Ramamohanarao K. Moving shape dynamics: a signal processing perspective. In: CVPR. 2008. p. 1–8.

  22. 22.

    Weinland D, Boyer E, Ronfard R. Action recognition from arbitrary views using 3D exemplars. In: ICCV. 2007. p. 1–7.

  23. 23.

    Ke Y, Sukthankar R, Hebert M. Efficient visual event detection using volumetric features. In: ICCV. 2005. p. 166–73.

  24. 24.

    Mikolajczyk K, Uemura H. Action recognition with motion-appearance vocabulary forest. In: CVPR. 2008. p. 1–8.

  25. 25.

    Zhang Z, Hu Y, Chan S, Chia LT. Motion context: A new representation for human action recognition. In: ECCV. 2008.

  26. 26.

    Bregonzio M, Gong SG, Xiang T. Recognising action as clouds of space-time interest points. In: CVPR. 2009. p. 1948–55.

  27. 27.

    Liu J, Ali S, Shah M. Recognizing human actions using multiple features. In: CVPR. 2008. p. 1–8.

  28. 28.

    Sun X, Chen M, Hauptmann A. Action recognition via local descriptors and holistic features. In: CVPR workshop on human communicative behavior analysis. 2009. p. 58–65.

  29. 29.

    Seo HJ, Milanfar P. Action recognition from one example. IEEE Trans Pattern Anal Mach Intell (PAMI). 2011;33(5):867–82.

    Article  Google Scholar 

  30. 30.

    Shechtman E, Irani M. Space-time behavior based correlation. In: CVPR. 2005. p. 405–12.

  31. 31.

    Ta A-P, Wolf C, Lavoué G, Baskurt A. Recognizing and localizing individual activities through graph matching. In: International conference on advanced video and signal-based surveillance. 2010.

  32. 32.

    Abdelkader MF, Almageed WA, Srivastava A, Chellappa R. Silhouette-based gesture and action recognition via modeling trajectories on riemannian shape manifolds. Comput Vis Image Underst (CVIU). 2011;115(3):439–55.

    Article  Google Scholar 

  33. 33.

    Boiman O, Irani M. Detecting irregularities in images and in video. Int J Comput Vis (IJCV). 2007;74(1):17–31.

    Article  Google Scholar 

  34. 34.

    Cuntoor NP, Yegnanarayana B, Chellappa R. Activity modeling using event probability sequences. IEEE Trans Image Process. 2008;17(4):594–07.

    PubMed  Article  CAS  Google Scholar 

  35. 35.

    Shi Q, Cheng L, Wang L, Smola A. Human action segmentation and recognition using discriminative semi-Markov models. Int J Comput Vis (IJCV). 2010;93(1):22–32.

    Article  Google Scholar 

  36. 36.

    Xiang T, Gong S. Activity based surveillance video content modelling. Pattern Recogn. 2008;41(7):2309–26.

    Article  Google Scholar 

  37. 37.

    Xiang T, Gong S. Incremental and adaptive abnormal behaviour detection. Comput Vis Image Underst (CVIU). 2008;11(1):59–73.

    Article  Google Scholar 

  38. 38.

    Zhang D, Perez DG, Bengio S, McCowan I. Semi-supervised adapted hmms for unusual event detection. In: CVPR. 2005. p. 611–8.

  39. 39.

    Zhou H, Kimber D. Unusual event detection via multi-camera video mining. In: ICPR. 2006. p. 1161–6.

  40. 40.

    Jhuang H, Serre T, Wolf L, Poggio T. A biologically inspired system for action recognition. In: ICCV. 2007. p. 1–8.

  41. 41.

    Taylor GW, Fergus R, Lecun Y, Bregler C. Convolutional learning of spatio-temporal features. In: ECCV. 2010.

  42. 42.

    Baccouche M, Mamalet F, Wolf C, Garcia C, Baskurt A. Sequential deep learning for human action recognition. In: International workshop on human behavior understanding: inducing behavioral change, 2011.

  43. 43.

    Fathi A, Mori G. Action recognition by learning mid-level motion features. In: CVPR. 2008. p. 1–8.

  44. 44.

    Dyana A, Das S. Trajectory representation using gabor features for motion-based video retrieval. Pattern Recogn Lett. 2009;30(10):877–92.

    Article  Google Scholar 

  45. 45.

    Stauffer C, Grimson WEL. Learning patterns of activity using real-time tracking. IEEE Trans Pattern Anal Mach Intell (PAMI). 2000;22(8):747–57.

    Article  Google Scholar 

  46. 46.

    Ryoo MS, Aggarwal JK. Stochastic representation and recognition of high-Level group activities. Int J Comput Vis (IJCV). 2010;93(2):183–200.

    Article  Google Scholar 

  47. 47.

    Wang L, Wang Y, Gao W. Mining layered grammar rules for action recognition. Int J Comput Vis (IJCV). 2011;93(2):162–82.

    Article  Google Scholar 

  48. 48.

    Niebles JC, Fei-Fei L. A hierarchical model of shape and appearance for human action classification. In: CVPR. 2007. p. 1–8.

  49. 49.

    Bishop CM. Neural networks for pattern recognition. Oxford: Oxford university press; 1994. p. 140–45

    Google Scholar 

  50. 50.

    Laptev I, Marszalek M, Schmid C, Rozenfeld B. Learning realistic human actions from movies. In: CVPR. 2008. p. 1–8.

  51. 51.

    Dietterich TG. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. 1998;10(7):1895–924.

    PubMed  Article  Google Scholar 

  52. 52.

    Chang C-C, Lin C-J. LIBSVM a library for support vector machines. ACM Trans Intell Syst Technol. 2011;2(3):27:1–27:27.

    Article  Google Scholar 

Download references

Author information



Corresponding author

Correspondence to Mingyuan Jiu.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Jiu, M., Wolf, C., Garcia, C. et al. Supervised Learning and Codebook Optimization for Bag-of-Words Models. Cogn Comput 4, 409–419 (2012).

Download citation


  • Bag-of-words models
  • Supervised learning
  • Neural networks
  • Action recognition