Boosting VLAD with Supervised Dictionary Learning and High-Order Statistics

  • Xiaojiang Peng
  • Limin Wang
  • Yu Qiao
  • Qiang Peng
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8691)


Recent studies show that aggregating local descriptors into super vector yields effective representation for retrieval and classification tasks. A popular method along this line is vector of locally aggregated descriptors (VLAD), which aggregates the residuals between descriptors and visual words. However, original VLAD ignores high-order statistics of local descriptors and its dictionary may not be optimal for classification tasks. In this paper, we address these problems by utilizing high-order statistics of local descriptors and peforming supervised dictionary learning. The main contributions are twofold. Firstly, we propose a high-order VLAD (H-VLAD) for visual recognition, which leverages two kinds of high-order statistics in the VLAD-like framework, namely diagonal covariance and skewness. These high-order statistics provide complementary information for VLAD and allow for efficient computation. Secondly, to further boost the performance of H-VLAD, we design a supervised dictionary learning algorithm to discriminatively refine the dictionary, which can be also extended for other super vector based encoding methods. We examine the effectiveness of our methods in image-based object categorization and video-based action recognition. Extensive experiments on PASCAL VOC 2007, HMDB51, and UCF101 datasets exhibit that our method achieves the state-of-the-art performance on both tasks.


Visual Word Action Recognition Local Descriptor Sparse Code Convolutional Neural Network 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Arandjelovic, R., Zisserman, A.: All about VLAD. In: CVPR (2013)Google Scholar
  2. 2.
    Bengio, Y., Courville, A.C., Vincent, P.: Representation learning: A review and new perspectives. TPAMI 35(8) (2013)Google Scholar
  3. 3.
    Boureau, Y.L., Bach, F., LeCun, Y., Ponce, J.: Learning mid-level features for recognition. In: CVPR (2010)Google Scholar
  4. 4.
    Cai, Z., Wang, L., Peng, X., Qiao, Y.: Multi-view super vector for action recognition. In: CVPR (2014)Google Scholar
  5. 5.
    Chatfield, K., Lempitsky, V.S., Vedaldi, A., Zisserman, A.: The devil is in the details: An evaluation of recent feature encoding methods. In: BMVC (2011)Google Scholar
  6. 6.
    Delhumeau, J., Gosselin, P.H., Jégou, H., Pérez, P., et al.: Revisiting the vlad image representation. In: ACM MM (2013)Google Scholar
  7. 7.
    Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2007 (VOC 2007) Results (2007)Google Scholar
  8. 8.
    Gong, Y., Wang, L., Guo, R., Lazebnik, S.: Multi-scale orderless pooling of deep convolutional activation features. CoRR abs/1403.1840 (2014)Google Scholar
  9. 9.
    Hogg, R.V., Craig, A.: Introduction to mathematical statistics (1994)Google Scholar
  10. 10.
    Jaakkola, T., Haussler, D., et al.: Exploiting generative models in discriminative classifiers. In: NIPS (1999)Google Scholar
  11. 11.
    Jain, M., Jégou, H., Bouthemy, P.: Better exploiting motion for better action recognition. In: CVPR (2013)Google Scholar
  12. 12.
    Jégou, H., Perronnin, F., Douze, M., Schmid, C., et al.: Aggregating local image descriptors into compact codes. TPAMI (2012)Google Scholar
  13. 13.
    Jia, Y., Darrell, T.: Heavy-tailed distances for gradient based image descriptors. In: NIPS (2011)Google Scholar
  14. 14.
    Jiang, Y.G., Liu, J., Roshan Zamir, A., Laptev, I., Piccardi, M., Shah, M., Sukthankar, R.: THUMOS challenge: Action recognition with a large number of classes (2013),
  15. 15.
    Kobayashi, T.: BoF meets HOG: Feature extraction based on histograms of oriented pdf gradients for image classification. In: CVPR (2013)Google Scholar
  16. 16.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)Google Scholar
  17. 17.
    Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: A large video database for human motion recognition. In: ICCV (2011)Google Scholar
  18. 18.
    LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11) (1998)Google Scholar
  19. 19.
    Liu, L., Wang, L., Liu, X.: In defense of soft-assignment coding. In: ICCV (2011)Google Scholar
  20. 20.
    Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV (2004)Google Scholar
  21. 21.
    Mihir, J., Jegou, H., Bouthemy, P.: Better exploiting motion for better action recognition. In: CVPR (2013)Google Scholar
  22. 22.
    Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. CoRR abs/1405.4506 (2014)Google Scholar
  23. 23.
    Perronnin, F., Sánchez, J., Mensink, T.: Improving the fisher kernel for large-scale image classification. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 143–156. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  24. 24.
    Russakovsky, O., Lin, Y., Yu, K., Fei-Fei, L.: Object-centric spatial pooling for image classification. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part II. LNCS, vol. 7573, pp. 1–15. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  25. 25.
    Shi, F., Petriu, E., Laganiere, R.: Sampling strategies for real-time action recognition. In: CVPR (2013)Google Scholar
  26. 26.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. CoRR abs/1406.2199 (2014)Google Scholar
  27. 27.
    Sivic, J., Zisserman, A.: Video google: A text retrieval approach to object matching in videos. In: ICCV (2003)Google Scholar
  28. 28.
    Soomro, K., Zamir, A.R., Shah, M.: UCF101: A dataset of 101 human actions classes from videos in the wild. ArXiv:1212.0402 (2012)Google Scholar
  29. 29.
    Sydorov, V., Sakurada, M., Lampert, C.H.: Deep fisher kernels - end to end learning of the fisher kernel gmm parameters. In: CVPR (2014)Google Scholar
  30. 30.
    Tariq, U., Yang, J., Huang, T.S.: Maximum margin gmm learning for facial expression recognition. In: FG Workshops (2013)Google Scholar
  31. 31.
    Vedaldi, A., Fulkerson, B.: VLFeat: An open and portable library of computer vision algorithms (2008)Google Scholar
  32. 32.
    Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Dense trajectories and motion boundary descriptors for action recognition. IJCV (2013)Google Scholar
  33. 33.
    Wang, H., Schmid, C., et al.: Action recognition with improved trajectories. In: ICCV (2013)Google Scholar
  34. 34.
    Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., Gong, Y.: Locality-constrained linear coding for image classification. In: CVPR (2010)Google Scholar
  35. 35.
    Wang, L., Qiao, Y., Tang, X.: Motionlets: Mid-level 3D parts for human motion recognition. In: CVPR (2013)Google Scholar
  36. 36.
    Wang, X., Wang, L., Qiao, Y.: A comparative study of encoding, pooling and normalization methods for action recognition. In: ACCV (2012)Google Scholar
  37. 37.
    Wu, J., Zhang, Y., Lin, W.: Towards good practices for action video encoding. In: CVPR (2014)Google Scholar
  38. 38.
    Wu, R., Yu, Y., Wang, W.: Scale: Supervised and cascaded laplacian eigenmaps for visual object recognition based on nearest neighbors. In: CVPR (2013)Google Scholar
  39. 39.
    Yang, J., Yu, K., Gong, Y., Huang, T.: Linear spatial pyramid matching using sparse coding for image classification. In: CVPR (2009)Google Scholar
  40. 40.
    Zhou, X., Yu, K., Zhang, T., Huang, T.S.: Image classification using super-vector coding of local image descriptors. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part V. LNCS, vol. 6315, pp. 141–154. Springer, Heidelberg (2010)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Xiaojiang Peng
    • 1
    • 4
    • 3
  • Limin Wang
    • 2
    • 3
  • Yu Qiao
    • 3
  • Qiang Peng
    • 1
  1. 1.Southwest Jiaotong UniversityChengduChina
  2. 2.Department of Information EngineeringThe Chinese University of Hong KongHong KongChina
  3. 3.Shenzhen Key Lab of CVPRShenzhen Institutes of Advanced Technology, CASShenzhenChina
  4. 4.Hengyang Normal UniversityHengyangChina

Personalised recommendations