Abstract
In this paper, we present a novel approach for supervised codebook learning and optimization for bag-of-words models. This type of models is frequently used in visual recognition tasks like object class recognition or human action recognition. An entity is represented as a histogram of codewords, which are traditionally clustered with unsupervised methods like k-means or random forests and then classified in a supervised way. We propose a new supervised method for joint codebook creation and class learning, which learns the cluster centers of the codebook in a goal-directed way using the class labels of the training set. As a result, the codebook is highly correlated to the recognition problem, leading to a more discriminative codebook. We propose two different learning algorithms, one based on error backpropagation and the other based on cluster label reassignment. We apply the proposed method to human action recognition from video sequences and evaluate it on the KTH data set, reporting very promising results. The proposed technique allows us to improve the discriminative power of an unsupervised learned codebook or to keep the discriminative power while decreasing the size of the learned codebook, thus decreasing the computational complexity due to the nearest neighbor search.
Similar content being viewed by others
References
Csurka G, Dance C, Fan LX, Willamowski J, Bray C. Visual categorization with bags of keypoints. In: Proceedings of ECCV international workshop on statistical learning in computer vision. 2004. p. 1–22.
Lowe DG. Distinctive image features from scale-invariant keypoints. Int J Comput Vis (IJCV). 2004;60(2):91–110.
Dollar P, Rabaud V, Cottrell G, Belongie S. Behavior recognition via sparse spatio-temporal features. In: ICCV workshop on visual surveillance and performance evaluation of tracking and surveillance. 2005. p. 65–72.
Moosmann F, Triggs B, Jurie F. Fast discriminative visual codebooks using randomized clustering forests. In: NIPS. 2007. p. 985–92.
Liu J, Shah M. Learning human actions via information maximization. In: CVPR. 2008. p. 1–8.
Liu J, Yang Y, Shah M. Learning semantic visual vocabularies using diffusion distance. In: CVPR. 2009. p. 461-68.
Saghafi B, Farahzadeh E, Rajan D, Sluzek A. Embedding visual words into concept space for action and scene recognition. In: BMVC. 2010. p. 1–11.
Niebles JC, Wang H, Fei-Fei L. Unsupervised learning of human action categories using spatial-temporal words. Int J Comput Vis (IJCV). 2008;79(3):299–318.
Gilbert A, Illingworth J, Bowden R. Action recognition using mined hierarchical compound features. IEEE Trans Pattern Anal Mach Intell (PAMI). 2011;33(5):883–97.
Laptev I. On space-time interest points. Int J Comput Vis (IJCV). 2005;64(2/3):107–23.
Schuldt C, Laptev I, Caputo B. Recognizing human actions: a local svm approach. In: ICPR. 2004. p. 32–6.
Oikonomopoulos A, Patras I, Pantic M. An implicit spatiotemporal shape model for human activity localization and recognition. In: CVPR. 2009. p. 27–33.
Ryoo MS, Aggarwal JK. Spatio-temporal relationship match: video structure comparison for recognition of complex human activities. In: ICCV. 2009. p. 1593–600.
Ta A-P, Wolf C, Lavoué G, Baskurt A, Jolion JM. Pairwise features for human action recognition. In: ICPR. 2010. p. 3224–7.
Mikolajczyk K, Uemura H. Action recognition with appearancemotion features and fast search trees. Comput Vis Image Underst (CVIU). 2011;115(3):426–38.
Aggarwal JK, Ryoo MS. Human activity analysis: a review. ACM Comput Surv (inpress).
Turaga P, Chellappa R, Subrahmanian VS, Udrea O. Machine recognition of human activities: a survey. IEEE Trans Circuits Syst Video Technol. 2008;18(11):1473–88.
Weinland D, Ronfard R, Boyer E. A survey of vision-based methods for action representation, segmentation and recognition. Comput Vis Image Underst (CVIU). 2011;115:224–41.
Song Y, Concalves L, Perona P. Unsupervised learning of human motion. IEEE Trans Pattern Anal Mach Intell (PAMI). 2003;25(7):814–27.
Gorelick L, Blank M, Shechtman E, Irani M, Basri R. Actions as space-time shapes. IEEE Trans Pattern Anal Mach Intell (PAMI). 2007;29(12):2247–53.
Wang L, Geng X, Leckie C, Ramamohanarao K. Moving shape dynamics: a signal processing perspective. In: CVPR. 2008. p. 1–8.
Weinland D, Boyer E, Ronfard R. Action recognition from arbitrary views using 3D exemplars. In: ICCV. 2007. p. 1–7.
Ke Y, Sukthankar R, Hebert M. Efficient visual event detection using volumetric features. In: ICCV. 2005. p. 166–73.
Mikolajczyk K, Uemura H. Action recognition with motion-appearance vocabulary forest. In: CVPR. 2008. p. 1–8.
Zhang Z, Hu Y, Chan S, Chia LT. Motion context: A new representation for human action recognition. In: ECCV. 2008.
Bregonzio M, Gong SG, Xiang T. Recognising action as clouds of space-time interest points. In: CVPR. 2009. p. 1948–55.
Liu J, Ali S, Shah M. Recognizing human actions using multiple features. In: CVPR. 2008. p. 1–8.
Sun X, Chen M, Hauptmann A. Action recognition via local descriptors and holistic features. In: CVPR workshop on human communicative behavior analysis. 2009. p. 58–65.
Seo HJ, Milanfar P. Action recognition from one example. IEEE Trans Pattern Anal Mach Intell (PAMI). 2011;33(5):867–82.
Shechtman E, Irani M. Space-time behavior based correlation. In: CVPR. 2005. p. 405–12.
Ta A-P, Wolf C, Lavoué G, Baskurt A. Recognizing and localizing individual activities through graph matching. In: International conference on advanced video and signal-based surveillance. 2010.
Abdelkader MF, Almageed WA, Srivastava A, Chellappa R. Silhouette-based gesture and action recognition via modeling trajectories on riemannian shape manifolds. Comput Vis Image Underst (CVIU). 2011;115(3):439–55.
Boiman O, Irani M. Detecting irregularities in images and in video. Int J Comput Vis (IJCV). 2007;74(1):17–31.
Cuntoor NP, Yegnanarayana B, Chellappa R. Activity modeling using event probability sequences. IEEE Trans Image Process. 2008;17(4):594–07.
Shi Q, Cheng L, Wang L, Smola A. Human action segmentation and recognition using discriminative semi-Markov models. Int J Comput Vis (IJCV). 2010;93(1):22–32.
Xiang T, Gong S. Activity based surveillance video content modelling. Pattern Recogn. 2008;41(7):2309–26.
Xiang T, Gong S. Incremental and adaptive abnormal behaviour detection. Comput Vis Image Underst (CVIU). 2008;11(1):59–73.
Zhang D, Perez DG, Bengio S, McCowan I. Semi-supervised adapted hmms for unusual event detection. In: CVPR. 2005. p. 611–8.
Zhou H, Kimber D. Unusual event detection via multi-camera video mining. In: ICPR. 2006. p. 1161–6.
Jhuang H, Serre T, Wolf L, Poggio T. A biologically inspired system for action recognition. In: ICCV. 2007. p. 1–8.
Taylor GW, Fergus R, Lecun Y, Bregler C. Convolutional learning of spatio-temporal features. In: ECCV. 2010.
Baccouche M, Mamalet F, Wolf C, Garcia C, Baskurt A. Sequential deep learning for human action recognition. In: International workshop on human behavior understanding: inducing behavioral change, 2011.
Fathi A, Mori G. Action recognition by learning mid-level motion features. In: CVPR. 2008. p. 1–8.
Dyana A, Das S. Trajectory representation using gabor features for motion-based video retrieval. Pattern Recogn Lett. 2009;30(10):877–92.
Stauffer C, Grimson WEL. Learning patterns of activity using real-time tracking. IEEE Trans Pattern Anal Mach Intell (PAMI). 2000;22(8):747–57.
Ryoo MS, Aggarwal JK. Stochastic representation and recognition of high-Level group activities. Int J Comput Vis (IJCV). 2010;93(2):183–200.
Wang L, Wang Y, Gao W. Mining layered grammar rules for action recognition. Int J Comput Vis (IJCV). 2011;93(2):162–82.
Niebles JC, Fei-Fei L. A hierarchical model of shape and appearance for human action classification. In: CVPR. 2007. p. 1–8.
Bishop CM. Neural networks for pattern recognition. Oxford: Oxford university press; 1994. p. 140–45
Laptev I, Marszalek M, Schmid C, Rozenfeld B. Learning realistic human actions from movies. In: CVPR. 2008. p. 1–8.
Dietterich TG. Approximate statistical tests for comparing supervised classification learning algorithms. Neural Comput. 1998;10(7):1895–924.
Chang C-C, Lin C-J. LIBSVM a library for support vector machines. ACM Trans Intell Syst Technol. 2011;2(3):27:1–27:27.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Jiu, M., Wolf, C., Garcia, C. et al. Supervised Learning and Codebook Optimization for Bag-of-Words Models. Cogn Comput 4, 409–419 (2012). https://doi.org/10.1007/s12559-012-9137-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12559-012-9137-4