Semantic Concept Detection Using Dense Codeword Motion

  • Claudiu Tănase
  • Bernard Mérialdo
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8192)


When detecting semantic concepts in video, much of the existing research in content-based classification uses keyframe information only. Particularly the combination between local features such as SIFT and the Bag of Words model is very popular with TRECVID participants. The few existing motion and spatiotemporal descriptors are computationally heavy and become impractical when applied on large datasets such as TRECVID. In this paper, we propose a way to efficiently combine positional motion obtained from optic flow in the keyframe with information given by the Dense SIFT Bag of Words feature. The features we propose work by spatially binning motion vectors belonging to the same codeword into separate histograms describing movement direction (left, right, vertical, zero, etc.). Classifiers are mapped using the homogeneous kernel map techinque for approximating the χ2 kernel and then trained efficiently using linear SVM. By using a simple linear fusion technique we can improve the Mean Average Precision of the Bag of Words DSIFT classifier on the TRECVID 2010 Semantic Indexing benchmark from 0.0924 to 0.0972, which is confirmed to be a statistically significant increase based on standardized TRECVID randomization tests.


content based video retrieval semantic indexing TRECVID spatio-temporal features motion feature 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Chen, M., Hauptmann, A.: Mosift: Recognizing human actions in surveillance videos (2009)Google Scholar
  2. 2.
    Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: A library for large linear classification. The Journal of Machine Learning Research 9, 1871–1874 (2008)zbMATHGoogle Scholar
  3. 3.
    Farnebäck, G.: Two-frame motion estimation based on polynomial expansion. In: Bigun, J., Gustavsson, T. (eds.) SCIA 2003. LNCS, vol. 2749, pp. 363–370. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  4. 4.
    Gorisse, D., Precioso, F.: IRIM at TRECVID 2010: Semantic Indexing and Instance Search. In: TREC Online Proceedings, Gaithersburg, United States. gDR ISIS (November 2010)Google Scholar
  5. 5.
    Ikizler-Cinbis, N., Sclaroff, S.: Object, scene and actions: Combining multiple features for human action recognition. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part I. LNCS, vol. 6311, pp. 494–507. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  6. 6.
    Jégou, H., Perronnin, F., Douze, M., Sánchez, J., Pérez, P., Schmid, C.: Aggregating local image descriptors into compact codes. IEEE Transactions on Pattern Analysis and Machine Intelligence (2011),
  7. 7.
    Jurie, F., Triggs, B.: Creating efficient codebooks for visual recognition. In: Tenth IEEE International Conference on Computer Vision, ICCV 2005, vol. 1, pp. 604–610. IEEE (2005)Google Scholar
  8. 8.
    Laptev, I., Lindeberg, T.: Space-time interest points. In: Proceedings of the Ninth IEEE International Conference on Computer Vision 2003, vol. 1, pp. 432–439 (October 2003)Google Scholar
  9. 9.
    Lowe, D.: Object recognition from local scale-invariant features. In: The Proceedings of the Seventh IEEE International Conference on Computer Vision 1999, vol. 2, pp. 1150–1157. IEEE (1999)Google Scholar
  10. 10.
    Over, P., Awad, G., Fiscus, J., Antonishek, B., Michel, M., Smeaton, A., Kraaij, W., Quénot, G., et al.: An overview of the goals, tasks, data, evaluation mechanisms and metrics. In: TRECVID 2011-TREC Video Retrieval Evaluation Online (2011)Google Scholar
  11. 11.
    Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: Machine learning in python. The Journal of Machine Learning Research 12, 2825–2830 (2011)MathSciNetzbMATHGoogle Scholar
  12. 12.
    Vedaldi, A., Fulkerson, B.: VLFeat: An open and portable library of computer vision algorithms (2008)Google Scholar
  13. 13.
    Vedaldi, A., Zisserman, A.: Efficient additive kernels via explicit feature maps. In: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3539–3546. IEEE ( (2010)Google Scholar
  14. 14.
    Wang, F., Jiang, Y.G., Ngo, C.W.: Video event detection using motion relativity and visual relatedness. In: Proceedings of the 16th ACM International Conference on Multimedia, pp. 239–248. ACM (2008)Google Scholar
  15. 15.
    Wang, H., Klaser, A., Schmid, C., Liu, C.: Action recognition by dense trajectories. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3169–3176. IEEE (2011)Google Scholar
  16. 16.
    Wang, H., Ullah, M., Klaser, A., Laptev, I., Schmid, C., et al.: Evaluation of local spatio-temporal features for action recognition. In: BMVC 2009-British Machine Vision Conference (2009)Google Scholar
  17. 17.
    Zhou, X., Yu, K., Zhang, T., Huang, T.S.: Image classification using super-vector coding of local image descriptors. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part V. LNCS, vol. 6315, pp. 141–154. Springer, Heidelberg (2010)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Claudiu Tănase
    • 1
  • Bernard Mérialdo
    • 1
  1. 1.EURECOMCampus SophiaTechBiotFrance

Personalised recommendations