Neural Computing and Applications

, Volume 28, Issue 5, pp 925–939 | Cite as

Evolving weighting schemes for the Bag of Visual Words

  • Hugo Jair Escalante
  • Víctor Ponce-López
  • Sergio Escalera
  • Xavier Baró
  • Alicia Morales-Reyes
  • José Martínez-Carranza
Computational Intelligence for Vision and Robotics


The Bag of Visual Words (BoVW) is an established representation in computer vision. Taking inspiration from text mining, this representation has proved to be very effective in many domains. However, in most cases, standard term-weighting schemes are adopted (e.g., term-frequency or TF-IDF). It remains open the question of whether alternative weighting schemes could boost the performance of methods based on BoVW. More importantly, it is unknown whether it is possible to automatically learn and determine effective weighting schemes from scratch. This paper brings some light into both of these unknowns. On the one hand, we report an evaluation of the most common weighting schemes used in text mining, but rarely used in computer vision tasks. Besides, we propose an evolutionary algorithm capable of automatically learning weighting schemes for computer vision problems. We report empirical results of an extensive study in several computer vision problems. Results show the usefulness of the proposed method.


Bag of Visual Words Bag of features Genetic programming Term-weighting schemes Computer vision 



This work was supported by CONACyT under Project Grant No. CB-2014-241306 (Clasificación y recuperación de imágenes mediante técnicas de minería de textos) and Spanish Ministry of Economy and Competitiveness TIN2013-43478-P. Víctor Ponce-López is supported by Fellowship No. 2013FI-B01037 and Project TIN2012-38187-C03-02.


  1. 1.
    Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval. Addison-Wesley, BostonGoogle Scholar
  2. 2.
    Bekkerman R, Allan J (2004) Using bigrams in text categorization. Technical Report, Department of Computer Science. University of Massachusetts, Amherst, vol 1003, pp 1–2Google Scholar
  3. 3.
    Bosch A, Zisserman A, Munoz X (2007) Image classification using random forests and ferns. In: Proceedings of the ICCVGoogle Scholar
  4. 4.
    Chang KW, Roth D (2011) Selective block minimization for faster convergence of limited memory large-scale linear models. In: SIGKDD conference on knowledge discovery and data mining. ACMGoogle Scholar
  5. 5.
    Csurka G, Dance CR, Fan L, Willamowski J, Bra C (2004) Visual categorization with bags of keypoints. In: International workshop on statistical learning in computer visionGoogle Scholar
  6. 6.
    Cummins R, O’Riordan C (2006) Evolving local and global weighting schemes in information retrieval. Inf Retr 9:311–330CrossRefGoogle Scholar
  7. 7.
    Debole F, Sebastiani F (2003) Supervised term-weighting for automated text categorization. In: Proceedings of the 2003 ACM symposium on applied computing, SAC ’03. ACM, New York, pp 784–788Google Scholar
  8. 8.
    Demsar J (2006) Statistical comparisons of classifiersover multiple data sets. J Mach Learn Res 7:1–30MathSciNetzbMATHGoogle Scholar
  9. 9.
    Deselaers T, Pimenidis L, Ney H (2008) Bag of visual words for adult image classification and filtering. In: Proceedings of the international conference on pattern recognition. IEEEGoogle Scholar
  10. 10.
    Djuric N, Lan L, Vucetic S, Wang Z (2013) Budgetedsvm: a toolbox for scalable svm approximations. J Mach Learn Res 14:3813–3817MathSciNetzbMATHGoogle Scholar
  11. 11.
    Escalante HJ, Garcia M, Morales A, Graff M, Montes M, Morales EF, Martinez J (2015) Term-weighting learning via genetic programming for text classification. Knowl Based Syst 83:176–189CrossRefGoogle Scholar
  12. 12.
    Escalante HJ, Martinez-Carranza J, Escalera S, Ponce-López V, Baró X (2015) Improving bag of visual words representations with genetic programming. In: Proceedings of the 2015 international joint conference on neural networks. IEEE, pp 3674–3681Google Scholar
  13. 13.
    Escalante HJ, Montes M, Sucar E (2012) Semantic cohesion for image annotation and retrieval. Comput Sist 10(1):121–126Google Scholar
  14. 14.
    Escalante HJ, Sucar E, Morales E (2016) A naive bayes baseline for early gesture recognition. Pattern Recogn Lett 73:91–99CrossRefGoogle Scholar
  15. 15.
    Escalera S, Baro X, Gonzalez J, Bautista MA, Madadi M, Reyes M, Ponce V, Escalante HJ, Shotton J, Guyon I (2014) ChaLearn looking at people challenge 2014: dataset and results. In: Proceedings of ECCV—chalearn workshopGoogle Scholar
  16. 16.
    Fei-Fei L, Fergus R, Perona P (2004) Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. In: Proceedings of the IEEE, CVPRWGoogle Scholar
  17. 17.
    Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3:1289–1305zbMATHGoogle Scholar
  18. 18.
    García-Limón M, Escalante HJ, Montes y Gómez M, Morales A, Morales E (2014) Towards the automated generation of term-weighting schemes for text categorization. In: Procddings of GECCO Comp’14, (Late-breaking abstract), pp 1459–1460Google Scholar
  19. 19.
    Gonzalez-Gurrola LC, Moreno R, Escalante HJ, Martnez F, Carlos R (2015) Learning roadway surface disruption patterns using the bag of words representation. IEEE transactions on intelligent transportation systems (under review)Google Scholar
  20. 20.
    Grauman K, Leibe B (2010) Visual object recognition. Morgan and Claypool, San RafaelGoogle Scholar
  21. 21.
    Guyon I, Athitsos V, Jangyodsuk P, Escalante HJ (2014) The Chalearn gesture dataset (CGD 2011). Mach Vis Appl 25(8):1929–1951CrossRefGoogle Scholar
  22. 22.
    Hernández-Vela A, Bautista MA, Perez-Sala X, Ponce-López V, Escalera S, Baró X, Pujol O, Angulo C (2014) Probability-based dynamic time warping and bag-of-visual-and-depth-words for human gesture recognition in rgb-d. Pattern Recognit Lett 50(1):112–121CrossRefGoogle Scholar
  23. 23.
    Hoai M, De la Torre F (2012) Max-margin early event detectors. In: IEEE conference on computer vision and pattern recognition. IEEE, Providence, RI, pp 2863–2870Google Scholar
  24. 24.
    Hoai M, Lan Z, De la Torre F (2011) Joint segmentation and classification of human actions in video. In: IEEE conference on computer vision and pattern recognition. IEEE, Providence, RI, pp 3265–3272Google Scholar
  25. 25.
    Huang D, Yao S, Wang Y, De La Torre F (2014) Sequential max-margin event detectors. In: European conference on computer visionGoogle Scholar
  26. 26.
    Lan M, Tan CL, Su J, Lu Y (2009) Supervised and traditional term-weighting methods for automatic text categorization. Trans PAMI 31(4):721–735CrossRefGoogle Scholar
  27. 27.
    Langdon WB, Poli R (2001) Foundations of genetic programming. Springer, BerlinzbMATHGoogle Scholar
  28. 28.
    Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2–3):107–123CrossRefGoogle Scholar
  29. 29.
    Lazebnik S, Schmid C, Ponce J (2004) Semi-local affine parts for object recognition. In: British machine vision conference, pp 779–788Google Scholar
  30. 30.
    Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: Proceedings of the computer vision and image processing conference, IEEE, pp 2169–2178Google Scholar
  31. 31.
    Lazebnik S, Schmid C, Ponce JA (2015) Maximum entropy framework for part-based texture and object recognition. In: IEEE international conference on computer vision, pp 832–838Google Scholar
  32. 32.
    Lopez-Monroy AP, Montes y Gomez M, Escalante HJ, Cruz-Roa A, Gonzalez FA (2015) Improving the bovw with discriminative n-grams and mkl. Neurocomputing 175:768–781CrossRefGoogle Scholar
  33. 33.
    Luke S, Panait L (2002) Lexicographic parsimony pressure. In: Proceedings of the 2002 genetic and evolutionary computation conference, pp 829–836Google Scholar
  34. 34.
    Manchala S, Prasad VK, Janaki V (2014) Gmm based language identification system using robust features. Int J Speech Technol 17:99–105CrossRefGoogle Scholar
  35. 35.
    Mirza-Mohammadi M, Escalera S, Radeva P(2009) Contextual-guided bag-of-visual-words model for multi-class object categorization. In: Proceedings of the CAIP. Springer, pp 748–756Google Scholar
  36. 36.
    Neverova N, Wolf C, Taylor GW, Nebout F (2014) Multi-scale deep learning for gesture detection and localization. In: Proceedings of the ECCV chalearn workshop on looking at peopleGoogle Scholar
  37. 37.
    Saffari A, Guyon I (2006) Quick start guide for clop. Technical report, TU Graz—CLOPINETGoogle Scholar
  38. 38.
    Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24:513–523CrossRefGoogle Scholar
  39. 39.
    Sebastiani F (2008) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47CrossRefGoogle Scholar
  40. 40.
    Sidorov G, Gelbukh A, Gomez-Adorno H, Pinto D (2014) Soft similarity and soft cosine measure: similarity of features in vector space model. Comput Sist 18(3):491–504Google Scholar
  41. 41.
    Silva S, Almeida J (2003) Gplab-a genetic programming toolbox for matlab. In: Proceedings of the Nordic MATLAB conference, pp 273–278Google Scholar
  42. 42.
    Sivic J, Zisserman A (2003) Video google: a text retrieval approach to object matching in videos. Int Conf Comput Vis 2:1470–1477Google Scholar
  43. 43.
    Tirilly P, Claveau V, Gros P (2009) A review of weighting schemes for bag of visual words image retrieval. Technical report, IRISAGoogle Scholar
  44. 44.
    Turney P, Pantel P (2010) From frequency to meaning: vector space models of semantics. J Artif Intell Res 37:141–188MathSciNetzbMATHGoogle Scholar
  45. 45.
    Vedaldi A, Fulkerson B (2010) VLFeat: an open and portable library of computer vision algorithms. In: Proceedings of the 18th ACM international conference on multimedia. ACM, pp 1469–1472Google Scholar
  46. 46.
    Wang J, Liu P, She FH, Nahavandi M, Kouzani A (2013) Bag-of-words representation for biomedical time series classification. Biomed Signal Process Control 8(6):634–644CrossRefGoogle Scholar
  47. 47.
    Wang J, Liu Z, Wu Y, Yuan J (2012) Mining actionlet ensemble for action recognition with depth cameras. In: IEEE conference on computer vision and pattern recognition. IEEE, Providence, RI, pp 1290–1297Google Scholar
  48. 48.
    Xia L, Aggarwal JK (2013) Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera. In: IEEE conference on computer vision and pattern recognition. IEEE, Portland, OR, pp 2834–2841Google Scholar
  49. 49.
    Yoo SJ (2004) Intelligent multimedia information retrieval for identifying and rating adult images. In: Proceedings of the international conference KES, vol 3213 of LNAI, pp 164–170. SpringerGoogle Scholar
  50. 50.
    Zhang J, Marszablek M, Lazebnik S, Schmid C (2007) Local features and kernels for classification of texture and object categories: a comprehensive study. Int J Comput Vis 73(2):213–238CrossRefGoogle Scholar
  51. 51.
    Zhang K, Lan L, Wang Z, Moerchen F (2012) Scaling up kernel svm on limited resources: A low-rank linearization approach. In: Proceedings of th AISTATS 2012Google Scholar

Copyright information

© The Natural Computing Applications Forum 2016

Authors and Affiliations

  • Hugo Jair Escalante
    • 1
  • Víctor Ponce-López
    • 2
    • 3
    • 4
  • Sergio Escalera
    • 3
    • 4
  • Xavier Baró
    • 2
    • 3
    • 4
  • Alicia Morales-Reyes
    • 1
  • José Martínez-Carranza
    • 1
  1. 1.Instituto Nacional de Astrofísica, Óptica y ElectrónicaPueblaMexico
  2. 2.Universitat Oberta de CatalunyaBarcelonaSpain
  3. 3.University of BarcelonaBarcelonaSpain
  4. 4.Computer Vision CenterBarcelonaSpain

Personalised recommendations