Multimedia Tools and Applications

, Volume 76, Issue 9, pp 11941–11958 | Cite as

Learned features versus engineered features for multimedia indexing

  • Mateusz Budnik
  • Efrain-Leonardo Gutierrez-Gomez
  • Bahjat Safadi
  • Denis Pellerin
  • Georges Quénot
Article

Abstract

In this paper, we compare “traditional” engineered (hand-crafted) features (or descriptors) and learned features for content-based indexing of image or video documents. Learned (or semantic) features are obtained by training classifiers on a source collection containing samples annotated with concepts. These classifiers are applied to the samples of a destination collection and the classification scores for each sample are gathered into a vector that becomes a feature for it. These feature vectors are then used for training another classifier for the destination concepts on the destination collection. If the classifiers used on the source collection are Deep Convolutional Neural Networks (DCNNs), it is possible to use as a new feature vector also the intermediate values corresponding to the output of all the hidden layers. We made an extensive comparison of the performance of such features with traditional engineered ones as well as with combinations of them. The comparison was made in the context of the TRECVid semantic indexing task. Our results confirm those obtained for still images: features learned from other training data generally outperform engineered features for concept recognition. Additionally, we found that directly training KNN and SVM classifiers using these features performs significantly better than partially retraining the DCNN for adapting it to the new data. We also found that, even though the learned features performed better that the engineered ones, fusing both of them performs even better, indicating that engineered features are still useful, at least in the considered case. Finally, the combination of DCNN features with KNN and SVM classifiers was applied to the VOC 2012 object classification task where it currently obtains the best performance with a MAP of 85.4 %.

Keywords

Semantic indexing Engineered features Learned features 

References

  1. 1.
    Ayache S, Quénot G (2008) Video corpus annotation using active learning. In: European conference on information retrieval (ECIR). Glasgow, pp 187–198Google Scholar
  2. 2.
    Ayache S, Quénot G, Gensel J (2007) Image and video indexing using networks of operators. EURASIP J Imag Vid Process 2007(1):056,928. doi:10.1155/2007/56928
  3. 3.
    Benoit A, Caplier A, Durette B, Herault J (2010) Using human visual system modeling for bio-inspired low level image processing. Comput Vis Imag Understand 114(7):758–773. doi:10.1016/j.cviu.2010.01.011
  4. 4.
    Borgne HL, Gosselin P, Picard D, Redi M, Mérialdo B, Mansencal B, Benois-Pineau J, Ayache S, Hamadi A, Safadi B, Derbas N, Budnik M, Quénot G, Gao B, Zhu C, Tang Y, Dellandrea E, Bichot CE, Chen L, Benoit A, Lambert P, Strat T (2015) IRIM at TRECVid 2015: semantic indexing. In: Proceedings of TRECVID 2015. NIST, USAGoogle Scholar
  5. 5.
    Chatfield K, Simonyan K, Vedaldi A, Zisserman A (2014) Return of the devil in the details: delving deep into convolutional nets. CoRR 1405.3531
  6. 6.
    Csurka G, Bray C, Dance C, Fan L (2004) Visual categorization with bags of keypoints. In: Workshop on statistical learning in computer vision. ECCV, pp 1–22Google Scholar
  7. 7.
    Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) ImageNet: a large-scale hierarchical image database. In: CVPR09Google Scholar
  8. 8.
    Everingham M, Eslami SMA, Van Gool L, Williams CKI, Winn J, Zisserman A (2015) The pascal visual object classes challenge: a retrospective. Int J Comput Vis 111(1):98–136Google Scholar
  9. 9.
    Fukushima K (1980) Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol Cybern 36:193–202CrossRefMATHGoogle Scholar
  10. 10.
    Gosselin PH, Cord M, Philipp-Foliguet S (2008) Combining visual dictionary, kernel-based similarity and learning strategy for image category retrieval. Comput Vis Imag Understand 110(3):403–417. doi:10.1016/j.cviu.2007.09.018. Similarity matching in computer vision and multimedia
  11. 11.
    Hamadi A, Mulhem P, Quénot G (2015) Extended conceptual feedback for semantic multimedia indexing. Multimed Tools Appl 74(4):1225–1248. doi:10.1007/s11042-014-1937-y
  12. 12.
    Hinton G E, Salakhutdinov R R (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507. doi:10.1126/science.1127647. http://science.sciencemag.org/content/313/5786/504 MathSciNetCrossRefMATHGoogle Scholar
  13. 13.
    Jégou H, Perronnin F, Douze M, Sánchez J, Pérez P, Schmid C (2012) Aggregating local image descriptors into compact codes. IEEE Trans Pattern Anal Mach Intel 34(9):1704–1716. doi:10.1109/TPAMI.2011.235 CrossRefGoogle Scholar
  14. 14.
    Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks Pereira F, Burges C, Bottou L, Weinberger K (eds), vol 25, Curran Associates, IncGoogle Scholar
  15. 15.
    LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. In: Proceedings of the IEEE, pp 2278–2324Google Scholar
  16. 16.
    Lowe D G (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110. doi:10.1023/B:VISI.0000029664.99615.94 CrossRefGoogle Scholar
  17. 17.
    Nair V, Hinton GE (2010) Rectified linear units improve restricted boltzmann machines. In: Fürnkranz J, Joachims T (eds) Proceedings of the 27th international conference on machine learning (ICML-10). Omnipress, pp 807–814. http://www.icml2010.org/papers/432.pdf
  18. 18.
    Orr GB, Mueller KR (eds) (1998) Neural networks : tricks of the trade, lecture notes in computer science, vol 1524. SpringerGoogle Scholar
  19. 19.
    Over P, Awad G, Michel M, Fiscus J, Kraaij W, Smeaton AF, Quénot G, Ordelman R (2015) Trecvid 2015 – an overview of the goals, tasks, data, evaluation mechanisms and metrics. In: Proceedings of TRECVID 2015. NIST, USAGoogle Scholar
  20. 20.
    Picard D, Gosselin PH (2013) Efficient image signatures and similarities using tensor products of local descriptors. Comput Vis Image Understand 117(6):680–687. doi:10.1016/j.cviu.2013.02.004 CrossRefGoogle Scholar
  21. 21.
    Razavian A, Azizpour H, Sullivan J, Carlsson S (2014) CNN features off-the-shelf: an astounding baseline for recognition. In: IEEE Conference on computer vision and pattern recognition workshops (CVPRW), pp 512–519. doi:10.1109/CVPRW.2014.131
  22. 22.
    Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) ImageNet large scale visual recognition challenge. Int J Comput Vis (IJCV):1–42. doi:10.1007/s11263-015-0816-y
  23. 23.
    Safadi B, Quénot G (2010) Evaluations of multi-learner approaches for concept indexing in video documents. In: Adaptivity, personalization and fusion of heterogeneous information, RIAO ’10. Le centre de hautes études internationales d’informatique documentaire, Paris, pp 88–91. http://dl.acm.org/citation.cfm?id=1937055.1937075
  24. 24.
    Safadi B, Quénot G (2011) Re-ranking by local re-scoring for video indexing and retrieval. In: Craig Macdonald Iadh Ounis IR (ed) CIKM 2011 - International conference on information and knowledge management. Poster session: information retrieval. ACM, Glasgow, pp 2081–2084. doi:10.1145/2063576.2063895
  25. 25.
    Safadi B, Quénot G (2015) A factorized model for multiple svm and multi-label classification for large scale multimedia indexing. In: 2015 13th International workshop on content-based multimedia indexing (CBMI), pp 1–6. doi:10.1109/CBMI.2015.7153610
  26. 26.
    Safadi B, Derbas N, Hamadi A, Budnik M, Mulhem P, Quénot G (2014) LIG at TRECVid 2015: semantic indexing. In: Proceedings of TRECVID. OrlandoGoogle Scholar
  27. 27.
    Safadi B, Derbas N, Quénot G (2015) Descriptor optimization for multimedia indexing and retrieval. Multimed Tools Appl 74(4):1267–1290. doi:10.1007/s11042-014-2071-6
  28. 28.
    Sánchez J, Perronnin F, Mensink T, Verbeek J (2013) Image classification with the fisher vector: theory and practice. Int J Comput Vis 105(3):222–245MathSciNetCrossRefMATHGoogle Scholar
  29. 29.
    Shabou A, LeBorgne H (2012) Locality-constrained and spatially regularized coding for scene categorization. In: 2012 IEEE Conference on computer vision and pattern recognition (CVPR), pp 3618–3625. doi:10.1109/CVPR.2012.6248107
  30. 30.
    Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. CoRR 1409.1556
  31. 31.
    Sivic J, Zisserman A (2003) Video google: a text retrieval approach to object matching in videos. In: Proceedings of the Ninth IEEE international conference on computer vision - volume 2, ICCV ’03, pp 1470. IEEE Computer Society, Washington, DCGoogle Scholar
  32. 32.
    Smeaton AF, Over P, Kraaij W (2006) Evaluation campaigns and trecvid. In: MIR ’06: proceedings of the 8th ACM international workshop on multimedia information retrieval. ACM Press, New York, pp 321–330. doi:10.1145/1178677.1178722
  33. 33.
    Smith J, Naphade M, Natsev A (2003) Multimedia semantic indexing using model vectors. In: 2003 International conference on multimedia and expo, 2003. ICME ’03. Proceedings, vol 2, pp pp. II–445–8. doi:10.1109/ICME.2003.1221649
  34. 34.
    Snoek CGM, Worring M, Geusebroek JM, Koelma DC, Seinstra FJ (2005) On the surplus value of semantic video analysis beyond the key frame. In: IEEE International conference on multimedia & expo. https://ivi.fnwi.uva.nl/isis/publications/2005/SnoekICME2005
  35. 35.
    Strat S, Benoit A, Lambert P (2013) Retina enhanced sift descriptors for video indexing. In: 2013 11th International workshop on content-based multimedia indexing (CBMI), pp 201–206. doi:10.1109/CBMI.2013.6576582
  36. 36.
    Strat S, Benoit A, Lambert P (2014) Retina enhanced bag of words descriptors for video classification. In: 2014 Proceedings of the 22nd European signal processing conference (EUSIPCO), pp 1307– 1311Google Scholar
  37. 37.
    Strat ST, Benoit A, Lambert P, Bredin H, Quénot G (2014) Hierarchical late fusion for concept detection in videos. In: Ionescu B, Benois-Pineau J, Piatrik T, Quénot G (eds) Fusion in computer vision, advances in computer vision and pattern recognition. Springer International Publishing, pp 53–77. doi:10.1007/978-3-319-05696-8_3
  38. 38.
    Su Y, Jurie F (2012) Improving image classification using semantic attributes. Int J Comput Vis 100(1):59–77. doi:10.1007/s11263-012-0529-4 CrossRefGoogle Scholar
  39. 39.
    Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2014) Going deeper with convolutions. CoRR 1409.4842
  40. 40.
    van de Sande KEA, Gevers T, Snoek CGM (2010) Evaluating color descriptors for object and scene recognition. IEEE Trans Pattern Anal Mach Intel 32(9):1582–1596Google Scholar
  41. 41.
    Yosinski J, Clune J, Bengio Y, Lipson H (2014) How transferable are features in deep neural networks? CoRR 1411.1792
  42. 42.
    Zhu C, Bichot CE, Chen L (2011) Color orthogonal local binary patterns combination for image region description. Rapport technique RR-LIRIS-2011-012, LIRIS UMR 5205:15Google Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  • Mateusz Budnik
    • 1
  • Efrain-Leonardo Gutierrez-Gomez
    • 1
  • Bahjat Safadi
    • 1
  • Denis Pellerin
    • 2
  • Georges Quénot
    • 1
  1. 1.Université Grenoble Alpes, CNRS, LIGGrenobleFrance
  2. 2.Université Grenoble Alpes, CNRS, GIPSA-LabGrenobleFrance

Personalised recommendations