Resource Constrained Multimedia Event Detection

  • Zhen-Zhong Lan
  • Yi Yang
  • Nicolas Ballas
  • Shoou-I Yu
  • Alexander Haputmann
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8325)


We present a study comparing the cost and efficiency tradeoffs of multiple features for multimedia event detection. Low-level as well as semantic features are a critical part of contemporary multimedia and computer vision research. Arguably, combinations of multiple feature sets have been a major reason for recent progress in the field, not just as a low dimensional representations of multimedia data, but also as a means to semantically summarize images and videos. However, their efficacy for complex event recognition in unconstrained videos on standardized datasets has not been systematically studied. In this paper, we evaluate the accuracy and contribution of more than 10 multi-modality features, including semantic and low-level video representations, using two newly released NIST TRECVID Multimedia Event Detection (MED) open source datasets, i.e. MEDTEST and KINDREDTEST, which contain more than 1000 hours of videos. Contrasting multiple performance metrics, such as average precision, probability of missed detection and minimum normalized detection cost, we propose a framework to balance the trade-off between accuracy and computational cost. This study provides an empirical foundation for selecting feature sets that are capable of dealing with large-scale data with limited computational resources and are likely to produce superior multimedia event detection accuracy. This framework also applies to other resource limited multimedia analyses such as selecting/fusing multiple classifiers and different representations of each feature set.


Multimedia Event Detection Limited Resource Feature Selection 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
  2. 2.
    Bao, L., Yu, S.-I., Lan, Z.Z., Overwijk, A., Jin, Q., Langner, B., Garbus, M., Burger, S., Metze, F., Hauptmann, A.: Informedia@ trecvid 2011. In: TRECVID 2011 (2011)Google Scholar
  3. 3.
    Bay, H., Tuytelaars, T., Van Gool, L.: Surf: Speeded up robust features. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006, Part I. LNCS, vol. 3951, pp. 404–417. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  4. 4.
    Chen, M.-Y., Hauptmann, A.: Mosift: Recognizing human actions in surveillance videos. CMU-CS-09-161 (2009)Google Scholar
  5. 5.
    Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR, vol. 1, pp. 886–893. IEEE (2005)Google Scholar
  6. 6.
    Ebadollahi, S., Chang, S.-F., Xie, L., Smith John, R.: Visual event detection using multi-dimensional concept semantics. In: ICME, pp. 881–884 (2006)Google Scholar
  7. 7.
    Jiang, Y.-G.: Super: Towards real-time event recognition in internet videos. In: ICMR, p. 7. ACM (2012)Google Scholar
  8. 8.
    Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep convolutional neural networks. In: NIPS, pp. 1106–1114 (2012)Google Scholar
  9. 9.
    Lan, Z.-z., Bao, L., Yu, S.-I., Liu, W., Hauptmann, A.G.: Double fusion for multimedia event detection. In: Schoeffmann, K., Merialdo, B., Hauptmann, A.G., Ngo, C.-W., Andreopoulos, Y., Breiteneder, C. (eds.) MMM 2012. LNCS, vol. 7131, pp. 173–185. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  10. 10.
    Lan, Z.-Z., Bao, L., Yu, S.-I., Liu, W., Hauptmann, A.G.: Multimedia classification and event detection using double fusion. Multimedia Tools and Applications, 1–15 (2013)Google Scholar
  11. 11.
    Laptev, I.: On space-time interest points. IJCV 64(2-3), 107–123 (2005)CrossRefGoogle Scholar
  12. 12.
    Li, L.-J., Su, H., Fei-Fei, L., Xing, E.P.: Object bank: A high-level image representation for scene classification & semantic feature sparsification. In: NIPS, pp. 1378–1386 (2010)Google Scholar
  13. 13.
    Liu, J., Yu, Q., Javed, O., Ali, S., Tamrakar, A., Divakaran, A., Cheng, H., Sawhney, H.S.: Video event recognition using concept attributes. In: WACV, pp. 339–346 (2013)Google Scholar
  14. 14.
    Merler, M., Member, S., Huang, B., Xie, L., Hua, G.: Semantic Model vectors for complex video event recognition. IEEE Trans. on Multimedia 14(1), 88–101 (2012)CrossRefGoogle Scholar
  15. 15.
    Moosmann, F., Nowak, E., Jurie, F.: Randomized clustering forests for image classification. PAMI 30(9), 1632–1646 (2008)CrossRefGoogle Scholar
  16. 16.
    Over, P., Awad, G., Michel, M., Fiscus, J., Sanders, G., Shaw, B., Kraaij, W., Smeaton, A.F., Quéenot, G.: Trecvid 2012 – an overview of the goals, tasks, data, evaluation mechanisms and metrics. In: TRECVID. NIST, USA (2012)Google Scholar
  17. 17.
    Tamrakar, A., Ali, S., Yu, Q., Liu, J., Javed, O., Divakaran, A., Cheng, H., Sawhney, H., International Sarnoff, S.R.I.: Evaluation of low-level leatures and their combinations for complex event detection in open source videos. In: CVPR, pp. 3681–3688 (2012)Google Scholar
  18. 18.
    Van De Sande, K.E.A., Gevers, T., Cees, G.M.S.: Evaluating color descriptors for object and scene recognition. PAMI 32(9), 1582–1596 (2010)CrossRefGoogle Scholar
  19. 19.
    Wang, H., Klaser, A., Schmid, C., Liu, C.-L.: Action recognition by dense trajectories. In: CVPR, pp. 3169–3176. IEEE (2011)Google Scholar
  20. 20.
    Yang, J., Jiang, Y.-G., Hauptmann, A.G., Ngo, C.-W.: Evaluating bag-of-visual-words representations in scene classification. In: Workshop on ICMR, pp. 197–206. ACM (2007)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Zhen-Zhong Lan
    • 1
  • Yi Yang
    • 1
  • Nicolas Ballas
    • 1
  • Shoou-I Yu
    • 1
  • Alexander Haputmann
    • 1
  1. 1.School of Computer ScienceCarnegie Mellon UniversityPittsburghUSA

Personalised recommendations