Category-Specific Video Summarization

  • Danila Potapov
  • Matthijs Douze
  • Zaid Harchaoui
  • Cordelia Schmid
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8694)


In large video collections with clusters of typical categories, such as “birthday party” or “flash-mob”, category-specific video summarization can produce higher quality video summaries than unsupervised approaches that are blind to the video category.

Given a video from a known category, our approach first efficiently performs a temporal segmentation into semantically-consistent segments, delimited not only by shot boundaries but also general change points. Then, equipped with an SVM classifier, our approach assigns importance scores to each segment. The resulting video assembles the sequence of segments with the highest scores. The obtained video summary is therefore both short and highly informative. Experimental results on videos from the multimedia event detection (MED) dataset of TRECVID’11 show that our approach produces video summaries with higher relevance than the state of the art.


video summarization temporal segmentation video classification 


  1. 1.
    Liu, Y., Zhou, F., Liu, W., De la Torre, F., Liu, Y.: Unsupervised summarization of rushes videos. In: ACM Multimedia (2010)Google Scholar
  2. 2.
    de Avila, S., Lopes, A., et al.: VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recognition Letters 32(1), 56–68 (2011)CrossRefGoogle Scholar
  3. 3.
    Lee, Y.J., Ghosh, J., Grauman, K.: Discovering important people and objects for egocentric video summarization. In: CVPR (2012)Google Scholar
  4. 4.
    Wang, M., Hong, R., Li, G., Zha, Z.J., Yan, S., Chua, T.S.: Event driven web video summarization by tag localization and key-shot identification. Transactions on Multimedia 14(4), 975–985 (2012)CrossRefGoogle Scholar
  5. 5.
    Khosla, A., Hamid, R., Lin, C.J., Sundaresan, N.: Large-scale video summarization using web-image priors. In: CVPR (2013)Google Scholar
  6. 6.
    Lu, Z., Grauman, K.: Story-driven summarization for egocentric video. In: CVPR (2013)Google Scholar
  7. 7.
    Truong, B.T., Venkatesh, S.: Video abstraction: A systematic review and classification. ACM Transactions on Multimedia Computing, Communications, and Applications 3(1), 3 (2007)CrossRefGoogle Scholar
  8. 8.
    Over, P., Smeaton, A.F., Awad, G.: The Trecvid 2008 BBC rushes summarization evaluation. In: 2nd ACM TRECVID Video Summarization Workshop (2008)Google Scholar
  9. 9.
    Ma, Y.F., Hua, X.S., Lu, L., Zhang, H.J.: A generic framework of user attention model and its application in video summarization. Transactions on Multimedia (2005)Google Scholar
  10. 10.
    Li, K., Oh, S., Perera, A.G.A., Fu, Y.: A videography analysis framework for video retrieval and summarization. In: BMVC (2012)Google Scholar
  11. 11.
    Ngo, C.W., Ma, Y.F., Zhang, H.J.: Video summarization and scene detection by graph modeling. Circuits and Systems for Video Technology 15(2) (2005)Google Scholar
  12. 12.
    Divakaran, A., Peker, K., Radhakrishnan, R., Xiong, Z., Cabasson, R.: Video summarization using Mpeg-7 motion activity and audio descriptors. In: Video Mining, vol. 6. Springer (2003)Google Scholar
  13. 13.
    Xie, L., Xu, P., Chang, S.F., Divakaran, A., Sun, H.: Structure analysis of soccer video with domain knowledge and hidden markov models. Pattern Recognition Letters 25(7) (2004)Google Scholar
  14. 14.
    Rui, Y., Gupta, A., Acero, A.: Automatically extracting highlights for TV baseball programs. In: ACM Multimedia (2000)Google Scholar
  15. 15.
    Sundaram, H., Xie, L., Chang, S.F.: A utility framework for the automatic generation of audio-visual skims. In: ACM Multimedia (2002)Google Scholar
  16. 16.
    Zhao, B., Xing, E.P.: Quasi real-time summarization for consumer videos. In: CVPR (2014)Google Scholar
  17. 17.
    Cong, Y., Yuan, J., Luo, J.: Towards scalable summarization of consumer videos via sparse dictionary selection. Transactions on Multimedia (2012)Google Scholar
  18. 18.
    Kim, G., Sigal, L., Xing, E.P.: Joint summarization of large-scale collections of web images and videos for storyline reconstruction. In: CVPR (2014)Google Scholar
  19. 19.
    Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: ACL Workshop on Text Summarization Branches, pp. 74–81 (2004)Google Scholar
  20. 20.
    Hoiem, D., Efros, A.A., Hebert, M.: Automatic photo pop-up. ACM Transactions on Graphics 24(3), 577–584 (2005)CrossRefGoogle Scholar
  21. 21.
    Tighe, J., Lazebnik, S.: SuperParsing: Scalable nonparametric image parsing with superpixels. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part V. LNCS, vol. 6315, pp. 352–365. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  22. 22.
    Lezama, J., Alahari, K., Sivic, J., Laptev, I.: Track to the future: Spatio-temporal video segmentation with long-range motion cues. In: CVPR (2011)Google Scholar
  23. 23.
    Grundmann, M., Kwatra, V., Han, M., Essa, I.: Efficient hierarchical graph-based video segmentation. In: CVPR (2010)Google Scholar
  24. 24.
    Massoudi, A., Lefebvre, F., Demarty, C.H., Oisel, L., Chupeau, B.: A video fingerprint based on visual digest and local fingerprints. In: ICIP (2006)Google Scholar
  25. 25.
    Chasanis, V., Kalogeratos, A., Likas, A.: Movie segmentation into scenes and chapters using locally weighted bag of visual words. In: CIVR (2009)Google Scholar
  26. 26.
    Kay, S.M.: Fundamentals of Statistical signal processing, vol. 2: Detection theory. Prentice Hall PTR (1998)Google Scholar
  27. 27.
    Harchaoui, Z., Bach, F., Moulines, E.: Kernel change-point analysis. In: NIPS (2008)Google Scholar
  28. 28.
    Harchaoui, Z., Cappé, O.: Retrospective mutiple change-point estimation with kernels. In: IEEE Workshop on Statistical Signal Processing, pp. 768–772 (2007)Google Scholar
  29. 29.
    Hastie, T., Tibshirani, R., Friedman, J.: The elements of statistical learning: data mining, inference and prediction, 2nd edn. Springer (2009)Google Scholar
  30. 30.
    Arlot, S., Celisse, A., Harchaoui, Z.: Kernel change-point detection. arXiv:1202.3878 (2012)Google Scholar
  31. 31.
    Crow, F.C.: Summed-area tables for texture mapping. ACM SIGGRAPH Computer Graphics 18, 207–212 (1984)CrossRefGoogle Scholar
  32. 32.
    Oneata, D., Verbeek, J., Schmid, C.: Action and Event Recognition with Fisher Vectors on a Compact Feature Set. In: ICCV (2013)Google Scholar
  33. 33.
    Cao, L., Mu, Y., Natsev, A., Chang, S.-F., Hua, G., Smith, J.R.: Scene aligned pooling for complex video recognition. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part II. LNCS, vol. 7573, pp. 688–701. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  34. 34.
    Gaidon, A., Harchaoui, Z., Schmid, C.: Temporal localization with actoms. PAMI (2013)Google Scholar
  35. 35.
    Perronnin, F., Sánchez, J., Mensink, T.: Improving the fisher kernel for large-scale image classification. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 143–156. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  36. 36.
    Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Dense trajectories and motion boundary descriptors for action recognition. IJCV (2013)Google Scholar
  37. 37.
    Manning, C.D., Raghavan, P., Schütze, H.: Introduction to information retrieval, Cambridge, vol. 1 (2008)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Danila Potapov
    • 1
  • Matthijs Douze
    • 1
  • Zaid Harchaoui
    • 1
  • Cordelia Schmid
    • 1
  1. 1.InriaFrance

Personalised recommendations