Multimedia Tools and Applications

, Volume 75, Issue 15, pp 9045–9072 | Cite as

A modified vector of locally aggregated descriptors approach for fast video classification

  • Ionuţ MironicăEmail author
  • Ionuţ Cosmin Duţă
  • Bogdan Ionescu
  • Nicu Sebe


In order to reduce the computational complexity, most of the video classification approaches represent video data at frame level. In this paper we investigate a novel perspective that combines frame features to create a global descriptor. The main contributions are: (i) a fast algorithm to densely extract global frame features which are easier and faster to compute than spatio-temporal local features; (ii) replacing the traditional k-means visual vocabulary from Bag-of-Words with a Random Forest approach allowing a significant speedup; (iii) the use of a modified Vector of Locally Aggregated Descriptor(VLAD) combined with a Fisher kernel approach that replace the classic Bag-of-Words approach, allowing us to achieve high accuracy. By doing so, the proposed approach combines the frame-based features effectively capturing video content variation in time. We show that our framework is highly general and is not dependent on a particular type of descriptors. Experiments performed on four different scenarios: movie genre classification, human action recognition, daily activity recognition and violence scene classification, show the superiority of the proposed approach compared to the state of the art.


Capturing content variation in time in video Modified vector of locally aggregated descriptor Random forests Video classification 



The work has been funded by the Sectoral Operational Programme Human Resources Development 2007-2013 of the Ministry of European Funds through the Financial Agreement POSDRU/159/1.5/S/132395.


  1. 1.
    Almeida J, Pedronette DC, Penatti OA (2014) Unsupervised Manifold Learning for Video Genre Retrieval. In: Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. Springer International Publishing, pp 604–612Google Scholar
  2. 2.
    Bilinski P, Corvee E, Bak S, Bremond F (2013) Relative dense tracklets for human action recognition. In: IEEE International Conference of Automatic Face and Gesture Recognition (FG)Google Scholar
  3. 3.
    Bouckaert RR, Frank E, Hall M, Kirkby R, Reutemann P, Seewald A, Scuse D (2013) WEKA Manual for Version 3–7–8Google Scholar
  4. 4.
    Brezeale D, Cook DJ (2008) Automatic video classification: A survey of the literature, in Systems, Man, and Cybernetics, Part C: Applications and Reviews. IEEE Trans 38(3):416–430Google Scholar
  5. 5.
    Breiman L (2001) Random forests. Mach Learn 45(1):5–32MathSciNetCrossRefzbMATHGoogle Scholar
  6. 6.
    Bosch A, Zisserman A, Munoz X (2007) Image classification using random forests and ferns. In: IEEE International Conference on Computer Vision (ICCV)Google Scholar
  7. 7.
    Csurka G, Dance CR, Fan L, Willamowski J, Bray C (2004) Visual Categorization with Bags of Keypoints, European Conference on Computer Vision (ECCV):1–2Google Scholar
  8. 8.
    Ciresan DC, Meier U, Masci J, Maria Gambardella L, Schmidhuber J (2011) Flexible, high performance convolutional neural networks for image classification. Proc-Int Joint Conf Artif Intell (IJCAI) 22(1):1238–1242Google Scholar
  9. 9.
    Chakraborty B, Holte MB, Moeslund TB, Gonzlez J (2012) Selective spatio-temporal interest points. Computer Vision and Image Understanding 116(3):396–410CrossRefGoogle Scholar
  10. 10.
    Demarty C-H, Penet C, Soleymani M, Gravier G (2013) VSD, a public dataset for the detection of violent scenes in movies: design, annotation, analysis and evaluation. Media Tools and ApplicationsGoogle Scholar
  11. 11.
    Demarty C-H, Penet C, Schedl M, Ionescu B, Quang VL, Jiang Y-G (2013) The MediaEval 2013 Affect Task: Violent Scenes Detection. Working Notes Proceedings [3]Google Scholar
  12. 12.
    Demarty C-H, Ionescu B, Jiang Y-G, Quang VL, Schedl M, Penet C (2014) Banchmarking Violent Scenes Detection in Movies. IEEE International Workshop on Content-Based Multimedia Indexing - CBMIGoogle Scholar
  13. 13.
    Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A (2012) The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results,
  14. 14.
    García Seco De Herrera A, Kalpathy-Cramer J, Demner Fushman D, Antani S, Müller H (2013) Overview of the ImageCLEF 2013 medical tasks, Working Notes of CLEF 2013. Cross Language Evaluation Forum, Valencia, SpainGoogle Scholar
  15. 15.
    Goto S, Aoki T (2013) TUDCL at MediaEval 2013 Violent Scenes Detection: Training with Multimodal Features by MKL. Working Notes Proceedings [3]Google Scholar
  16. 16.
    Gold K, Petrosino A (2010) Using information gain to build meaningful decision forests for multilabel classification. In: Development and Learning (ICDL), 2010 IEEE 9th International Conference on. IEEE, pp 58–63Google Scholar
  17. 17.
    Ionescu B, Mironică I, Seyerlehner K, Knees P, Schluter J, Schedl M, Cucu H, Buzo A, Lambert P (2012) ARF @ mediaeval 2012: Multimodal video classification. In: MediaEval workshopGoogle Scholar
  18. 18.
    Imre C, Korner J (2011) Information theory: coding theorems for discrete memoryless systems. Cambridge University PressGoogle Scholar
  19. 19.
    Ikizler-Cinbis N, Sclaroff S (2011) Object, scene and actions: combining multiple features for human action recognition. In: Proceedings of the European Conference on Computer vision (ECCV), pp 494–507Google Scholar
  20. 20.
    Jiang Y-G, Liu J, Roshan Zamir A, Laptev I, Piccardi M, Shah M, Sukthankar R (2013) THUMOS challenge: Action recognition with a large number of classes, ICCV Workshop on Action Recognition with a Large Number of Classes,
  21. 21.
    Jain M, Jegou H, Bouthemy P (2013) Better exploiting motion for better action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)Google Scholar
  22. 22.
    Jégou H, Douze M, Schmid C, Pérez P (2010) Aggregating local descriptors into a compact image representation. In: Computer Vision and Pattern Recognition (CVPR)Google Scholar
  23. 23.
    Jegou H, Perronnin F, Douze M, Sanchez J, Perez P, Schmid C (2012) Aggregating local image descriptors into compact codes. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)Google Scholar
  24. 24.
    Karaman S, Seidenari L, Bagdanov AD, Bagdanov A (2013) L1-regularized logistic regression stacking and transductive CRF smoothing for action recognition in video. ICCV workshop on action recognition with a large number of classesGoogle Scholar
  25. 25.
    Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fe L (2014) Large-scale video classification with convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR)Google Scholar
  26. 26.
    Khurram S, Zamir AR, Shah M (2012) Ucf101: A dataset of 101 human actions classes from videos in the wild. CoRR. arXiv:1212.0402
  27. 27.
    Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (NIPS), pp 1097–1105Google Scholar
  28. 28.
    Ludwig O, Delgado D, Goncalves V, Nunes U (2009) Trainable classifier-fusion schemes: an application to pedestrian detection. IEEE Int Conf Intell Trans Syst 1:432–437Google Scholar
  29. 29.
    Lucas B, Kanade T (1981) An iterative image registration technique with an application to stereo vision. In: Proceedings of Imaging Understanding WorkshopGoogle Scholar
  30. 30.
    Laptev I, Marszalek M, Schmid C, Rozenfeld B (2008) Learning realistic human actions from movies. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2008), pp 1–8Google Scholar
  31. 31.
    Liu J, Luo J, Shah M. (2009) Recognizing realistic actions from videos in the wild. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1996–2003Google Scholar
  32. 32.
    Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. Computer Vision and Pattern RecognitionGoogle Scholar
  33. 33.
    MediaEval 2013 Workshop Larson M, Anguera X, Reuter T, Jones GJF, Ionescu B, Schedl M, Piatrik T, Hauff C, Soleymani M (eds) (2013) co-located with ACM Multimedia, Barcelona, Spain, October 18-19,, ISSN 1613-0073, Vol. 1043,
  34. 34.
    Mironica I, Uijlings J, Rostamzadeh N, Ionescu B, Sebe N (2013) Time matters!: capturing variation in time in video using fisher kernels. In: Proceedings of the 21st ACM International Conference on Multimedia, pp 701–704Google Scholar
  35. 35.
    Marin J, Vzquez D, Lpez AM, Amores J, Leibe B (2013) Random Forests of Local Experts for Pedestrian Detection. In: IEEE International Conference on Computer Vision (ICCV), pp 2592–2599Google Scholar
  36. 36.
    Messing R, Pal C, Kautz H (2009) Activity recognition using the velocity histories of tracked keypoints. In: ICCVGoogle Scholar
  37. 37.
    Mathieu B, Essid S, Fillon T, Prado J, Richard G (2010) YAAFE, an Easy to Use and Efficient Audio Feature Extraction Software. Proceedings of the 11th ISMIR conference, pp 441–446Google Scholar
  38. 38.
    Murthy OV, Goecke R (2013) Ordered Trajectories for Large Scale Human Action Recognition. IEEE International Conference on Computer VisionGoogle Scholar
  39. 39.
    Ma Z, Yang Y, Sebe N, Hauptmann A (2014) Knowledge adaptation with partially shared features for event detection using few exemplars. IEEE Trans Pattern Anal Mach Intell 36(9):1789–1802CrossRefGoogle Scholar
  40. 40.
    Nakayama H (2012) Aggregating Descriptors with Local Gaussian Metrics. In: proceedings of NIPS 2012 Workshop on Large Scale Visual Recognition and RetrievalGoogle Scholar
  41. 41.
    Nowozin S (2012) Improved information gain estimates for decision tree induction, arXiv preprint arXiv:1206.4620
  42. 42.
    Over P, Awad G, Michel M, Fiscus J, Sanders G, Kraaij W, Smeaton AF, Quéenot G (2013) TRECVID 2013 – An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics, Proceedings of TRECVID 2013,, NIST. USA
  43. 43.
    Peng X, Wang L, Wang X, Qiao Y (2014) Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. CoRR , abs/1405.4506Google Scholar
  44. 44.
    Perronnin F, Sanchez J, Mensink T. (2010) Improving the fisher kernel for large-scale image classification. In: European Conference of Computer Vision (ECCV), pp 143–156Google Scholar
  45. 45.
    Penet C, Demarty C-H, Gravier G, Gros P (2013) Technicolor/INRIA Team at the MediaEval 2013 Violent Scenes Detection Task. Working Notes Proceedings [3]Google Scholar
  46. 46.
    Picard D, Gosselin P-H. (2011) Improving image similarity with vectors of locally aggregated tensors. IEEE Image Processing (ICIP), 2011 18th IEEE International Conference onGoogle Scholar
  47. 47.
    Quoc V (2013) Building high-level features using large scale unsupervised learning. In: IEEE International Conference of Acoustics, Speech and Signal Processing (ICASSP)Google Scholar
  48. 48.
    Rohrbach M, Amin S, Andriluka M, Schiele B (2012) A database for fine grained activity detection of cooking activities. In: International Conference of Computer Vision and Pattern Recognition. CVPRGoogle Scholar
  49. 49.
    Rostamzadeh N, Zen G, Mironic I, Uijlings J, Sebe N (2013) Daily Living Activities Recognition via Efficient High and Low Level Cues Combination and Fisher Kernel Representation. In: IEEE International Conference on Image Analysis and Processing. ICIAPGoogle Scholar
  50. 50.
    Raptis M, Soatto S (2011) Tracklet descriptors for action modeling and video analysis. In: European Conference of Computer Vision (ECCV), pp 577–590Google Scholar
  51. 51.
    Simonyan K, Vedaldi A, Zisserman A (2013) Deep Fisher networks for large-scale image classification. In: NIPSGoogle Scholar
  52. 52.
    Schmiedeke S, Xu P, Ferrané I, Eskevich M, Kofler C, Larson M, Estève Y, Lamel L, Jones G, Sikora T (2013) Blip10000: A Social Video Dataset Containing SPUG Content for Tagging and Retrieval, vol 1. ACM Multimedia Systems Conference, Oslo, NorwayCrossRefGoogle Scholar
  53. 53.
    Schmiedeke S, Kofler C, Ferrané I Overview of the MediaEval 2012 Tagging Task, Working Notes Proceedings of the MediaEval 2012 Workshop, Pisa, Italy, October 4-5, 2012,, ISSN 1613-0073,
  54. 54.
    Semela T, Tapaswi M, Ekenel H, Stiefelhagen R (2012) Kit at mediaeval 2012 - content-based genre classification with visual cues. In: MediaEval workshopGoogle Scholar
  55. 55.
    Schmiedeke S, Kelm P, Sikora T (2012) TUB @ MediaEval 2012 tagging task: Feature selection methods for bag-of- (visual)-words approaches. In: MediaEval WorkshopGoogle Scholar
  56. 56.
    Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Conference of Computer Vision and Patern RecognitionGoogle Scholar
  57. 57.
    Solmaz B, Assari SM, Mubarak S (2013) Classifying web videos using a global video descriptor. Mach Vi Appl (MVAP) 24(7):1473–1485CrossRefGoogle Scholar
  58. 58.
    Sjöberg M, Schlüter J, Ionescu B, Schedl M (2013) FAR at MediaEval 2013 Violent Scenes Detection: Concept-based Violent Scenes Detection in Movies. In: MediaEval 2014 Workshop, BarcelonaGoogle Scholar
  59. 59.
    Uijlings JRR, Smeulders AWM, Scha RJH (2010) Real-time visual concept classification. IEEE Trans Multimed 12(7):665–681CrossRefGoogle Scholar
  60. 60.
    Uijlings JRR, Duta IC, Sangineto E, Sebe N (2014) Video classification with Densely extracted HOG/HOF/MBH features: an evaluation of the accuracy/computational efficiency trade-off. In: International Journal of Multimedia Information Retrieval, pp 1–12Google Scholar
  61. 61.
    Van deWeijer J, Schmid C, Verbeek J, Larlus D (2009) Learning color names for real-world applications. IEEE Trans Image Process 18(7):1512–1523MathSciNetCrossRefGoogle Scholar
  62. 62.
    Wang H, Klaser A, Schmid C, Liu C (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis 103(1):60–79MathSciNetCrossRefGoogle Scholar
  63. 63.
    Wang H, Schmid C (2013) Action recognition with improved trajectories. In: ICCV, Proceedings, pp 3551–3558Google Scholar
  64. 64.
    Wang J, Chen Z, Wu Y (2011) Action recognition with multiscale spatio-temporal contexts. In: CVPRGoogle Scholar
  65. 65.
    Wang H, Schmid C (2013) LEAR-INRIA submission for the THUMOS workshop. ICCV Workshop on Action Recognition with a Large Number of ClassesGoogle Scholar
  66. 66.
    Yang Y, Ramanan D (2013) Articulated human detection with flexible mixtures of parts. IEEE Trans Pattern Anal Mach Intell 35(12):2878–2890CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • Ionuţ Mironică
    • 1
    Email author
  • Ionuţ Cosmin Duţă
    • 2
  • Bogdan Ionescu
    • 1
  • Nicu Sebe
    • 2
  1. 1.LAPIUniversity Politehnica of BucharestBucharestRomania
  2. 2.DISIUniversity of TrentoTrentoItaly

Personalised recommendations