Bag-of-Words Image Representation: Key Ideas and Further Insight

Chapter
Part of the Advances in Computer Vision and Pattern Recognition book series (ACVPR)

Abstract

In the context of object and scene recognition, state-of-the-art performances are obtained with visual Bag-of-Words (BoW) models of mid-level representations computed from dense sampled local descriptors (e.g., Scale-Invariant Feature Transform (SIFT)). Several methods to combine low-level features and to set mid-level parameters have been evaluated recently for image classification. In this chapter, we study in detail the different components of the BoW model in the context of image classification. Particularly, we focus on the coding and pooling steps and investigate the impact of the main parameters of the BoW pipeline. We show that an adequate combination of several low (sampling rate, multiscale) and mid-level (codebook size, normalization) parameters is decisive to reach good performances. Based on this analysis, we propose a merging scheme that exploits the specificities of edge-based descriptors. Low and high contrast regions are pooled separately and combined to provide a powerful representation of images. We study the impact on classification performance of the contrast threshold that determines whether a SIFT descriptor corresponds to a low contrast region or a high contrast region. Successful experiments are provided on the Caltech-101 and Scene-15 datasets.

Keywords

Local Descriptor Sparse Code Sift Descriptor Codebook Size Contrast Region 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Avila S, Thome N, Cord M, Valle E, de Araujo A (2011) Bossa: extended bow formalism for image classification. In: Proceedings of the IEEE international conference on image processing (ICIP)Google Scholar
  2. 2.
    Bach FR, Lanckriet GR, Jordan MI (2004) Multiple kernel learning, conic duality, and the SMO algorithm. In: Proceedings of the twenty-first international conference on machine learning (ICML)Google Scholar
  3. 3.
    Bay H, Ess A, Tuytelaars T, van Gool L (2008) SURF: speeded Up robust features. Comput Vis Image Underst (CVIU) 110(3):346–359CrossRefGoogle Scholar
  4. 4.
    Benois-Pineau J, Bugeau A, Karaman S, Mégret R (2012) Spatial and multi-resolution context in visual indexing. In: Visual Indexing and Retrieval, pp 41–63Google Scholar
  5. 5.
    Boureau Y-L, Bach, F, LeCun Y, Ponce J (2010) Learning mid-level features for recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)Google Scholar
  6. 6.
    Boureau Y-L, Le Roux N, Bach F, Ponce J, LeCun Y (2011) Ask the locals: multi-way local pooling for image recognition. In: Proceedings of the IEEE international conference on computer vision (ICCV)Google Scholar
  7. 7.
    Boureau Y-L, Ponce J, LeCun Y (2010) A theoretical analysis of feature pooling in vision algorithms. In: Proceedings of the international conference on machine learning (ICML)Google Scholar
  8. 8.
    Chatfield K, Lempitsky V, Vedaldi A, Zisserman A (2011) The devil is in the details: an evaluation of recent feature encoding methods. In: Proceedings of the British machine vision conference (BMVC)Google Scholar
  9. 9.
    Coates A, Ng A (2011) The importance of encoding versus training with sparse coding and vector quantization. In: Proceedings of the 28th international conference on machine learning (ICML)Google Scholar
  10. 10.
    Cord M, Cunningham P (2008) Machine learning techniques for multimedia: case studies on organization and retrieval. Machine learning techniques for multimedia, cognitive technologies. Springer, HeidelbergGoogle Scholar
  11. 11.
    Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297Google Scholar
  12. 12.
    Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)Google Scholar
  13. 13.
    Duchenne O, Joulin A, Ponce J (2011) A graph-matching kernel for object categorization. In: Proceedings of the IEEE international conference on computer vision (ICCV)Google Scholar
  14. 14.
    Everingham M, Zisserman A, Williams C, Van Gool L (2007) The PASCAL visual obiect classes challenge 2007 (VOC2007) results. Technical Report, Pascal ChallengeGoogle Scholar
  15. 15.
    Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) LIBLINEAR: a library for large linear classification. J Mach Learn Res (JMLR) 9:1871–1874Google Scholar
  16. 16.
    de Avila Fontes SE, Thome N, Cord M, Valle E, de Albuquerque Arajo A (2013) Pooling in image representation: The visual codeword point of view. Comp Vis Image Underst 117(5):453–465Google Scholar
  17. 17.
    Fei-fei L (2005) A bayesian hierarchical model for learning natural scene categories. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)Google Scholar
  18. 18.
    Fei-Fei L, Fergus R, Perona P (2004) Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) workshop on GMBVGoogle Scholar
  19. 19.
    Feng J, Ni B, Tian Q, Yan S (2011) Geometric \(\ell _p\)-norm feature pooling for image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)Google Scholar
  20. 20.
    Gehler P, Nowozin S (2009) On feature combination for multiclass object classification. In: Proceedings of the IEEE international conference on computer vision (ICCV)Google Scholar
  21. 21.
    van Gemert J, Veenman C, Smeulders A, Geusebroek JM (2010) Visual word ambiguity. IEEE Trans Pattern Anal Mach Intell (TPAMI) 32(7):1271–1283Google Scholar
  22. 22.
    Goh H, Thome N, Cord M, Lim J-H (2012) Unsupervised and supervised visual codes with restricted Boltzmann machines. In: Proceedings of the European conference on computer vision (ECCV)Google Scholar
  23. 23.
    González-Díaz I, Buso V, Benois-Pineau J, Bourmaud G, Megret R (2013) Modeling instrumental activities of daily livinf in egocentric vision as sequences of active objects and context for Alzheimer disease research. In: ACM multimedia workshop on multimedia information indexing and retrieval for healthcareGoogle Scholar
  24. 24.
    Grauman K, Darrell T (2005) The pyramid match kernel: discriminative classification with sets of image features. In: Proceedings of the IEEE international conference on computer vision (ICCV)Google Scholar
  25. 25.
    Harris S, Stephens M (1988) A combined corner and edge detector. In: Proceedings of the 4th Alvey vision conference, pp 147–151Google Scholar
  26. 26.
    Jégou H, Douze M, Schmid C, Pérez P (2010) Aggregating local descriptors into a compact image representation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)Google Scholar
  27. 27.
    Karaman S, Benois-Pineau J, Mgret R, Bugeau A (2012) Multi-layer local graph words for object recognition. In: Proceedings of the international conference on multimedia modelingGoogle Scholar
  28. 28.
    Kavukcuoglu K, Sermanet P, Boureau Y-L, Gregor K, Mathieu M, LeCun Y (2010) Learning convolutional feature hierachies for visual recognition. In: Proceedings of advances in neural information processing systems (NIPS), pp 1090–1098Google Scholar
  29. 29.
    Krizhevsky A, Sutskever I, Hinton G (2012) Imagenet classification with deep convolutional neural networks. In: Proceedings of advances in neural information processing systems (NIPS), pp. 1106–1114Google Scholar
  30. 30.
    Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)Google Scholar
  31. 31.
    Liu L, Wang L, Liu X (2011) In defense of soft-assignment coding. In: Proceedings of the IEEE international conference on computer vision (ICCV)Google Scholar
  32. 32.
    Lowe D (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis (IJCV) 60:91–110CrossRefGoogle Scholar
  33. 33.
    Mikolajczyk K, Schmid C (2004) Scale and affine invariant interest point detectors. Int J Comput Vis (IJCV) 60(1):63–86CrossRefGoogle Scholar
  34. 34.
    Mironica I, Uijlings J, Rostamzadeh N, Ionescu B, Sebe N (2013) Time matters! capturing variation in time in video using fisher kernels. In: Proceedings of the 21st ACM international conference on multimediaGoogle Scholar
  35. 35.
    Perronnin F, Dance CR (2007) Fisher kernels on visual vocabularies for image categorization. In: Proceedings of the IEEE Conference on computer vision and pattern recognition (CVPR)Google Scholar
  36. 36.
    Perronnin F, Sánchez J, Mensink T (2010) Improving the fisher kernel for large-scale image classification. In: Proceedings of the European conference on computer vision (ECCV)Google Scholar
  37. 37.
    Serre T, Wolf L, Bileschi S, Riesenhuber M, Poggio T (2007) Robust object recognition with cortex-like mechanisms. IEEE Trans Pattern Anal Mach Intell (TPAMI) 29:411–426CrossRefGoogle Scholar
  38. 38.
    Sharma G, Jurie F, Schmid C (2012) Discriminative spatial saliency for image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)Google Scholar
  39. 39.
    Sivic J, Zisserman A (2003) Video google: a text retrieval approach to object matching in videos. In: Proceedings of the IEEE international conference on computer vision (ICCV)Google Scholar
  40. 40.
    Smith JR, Chang S-F (1997) VisualSEEk: a fully automated content-based image query system. In: Proceedings of the fourth ACM international conference on Multimedia, ACM, pp 87–98Google Scholar
  41. 41.
    Snoek C, Worring M, Hauptmann A (2006) Learning rich semantics from news video archives by style analysis. ACM Transa Multimedia Comput Commun Appl (TOMCCAP) 2(2):91–108Google Scholar
  42. 42.
    Thériault C, Thome N, Cord M (2013) Extended coding and pooling in the HMAX model. IEEE Trans Image Process 22(2):764–777CrossRefMathSciNetGoogle Scholar
  43. 43.
    van de Sande KEA, Gevers T, Snoek CGM (2010) Evaluating color descriptors for object and scene recognition. IEEE Trans Pattern Anal Mach Intell (TPAMI) 32(9):1582–1596CrossRefGoogle Scholar
  44. 44.
    Vedaldi A, Fulkerson B (2008) VLFeat: an open and portable library of computer vision algorithms. http://www.vlfeat.org/
  45. 45.
    Vedaldi A, Gulshan V, Varma M, Zisserman A (2009) Multiple kernels for object detection. In: Proceedings of the IEEE international conference on computer vision (ICCV)Google Scholar
  46. 46.
    Vedaldi A, Zisserman A (2011) Efficient additive kernels via explicit feature maps. IEEE Trans Pattern Anal Mach Intell (TPAMI) 34:480–492Google Scholar
  47. 47.
    Vig E, Dorr, M, Cox DD (2012) Space-variant descriptor sampling for action recognition based on saliency and eye movements. In: Proceedings of the European conference on computer vision (ECCV)Google Scholar
  48. 48.
    Wang J, Yang J, Yu K, Lv F, Huang T, Gong Y (2010) Locality-constrained linear coding for image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)Google Scholar
  49. 49.
    Yang J, Yu K, Gong Y, Huang T (2009) Linear spatial pyramid matching using sparse coding for image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)Google Scholar
  50. 50.
    Zhang H, Berg AC, Maire M, Malik J (2006) SVM-KNN: discriminative nearest neighbor classification for visual category recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)Google Scholar
  51. 51.
    Zhou X, Yu K, Zhang T, Huang TS (2010) Image classification using super-vector coding of local image descriptors. In: Proceedings of the european conference on computer vision (ECCV)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  1. 1.LIP6UPMC—Sorbonne UniversityParisFrance

Personalised recommendations