Video Summarization with Long Short-Term Memory

  • Ke ZhangEmail author
  • Wei-Lun Chao
  • Fei Sha
  • Kristen Grauman
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9911)


We propose a novel supervised learning technique for summarizing videos by automatically selecting keyframes or key subshots. Casting the task as a structured prediction problem, our main idea is to use Long Short-Term Memory (LSTM) to model the variable-range temporal dependency among video frames, so as to derive both representative and compact video summaries. The proposed model successfully accounts for the sequential structure crucial to generating meaningful video summaries, leading to state-of-the-art results on two benchmark datasets. In addition to advances in modeling techniques, we introduce a strategy to address the need for a large amount of annotated data for training complex learning approaches to summarization. There, our main idea is to exploit auxiliary annotated video summarization datasets, in spite of their heterogeneity in visual styles and contents. Specifically, we show that domain adaptation techniques can improve learning by reducing the discrepancies in the original datasets’ statistical properties.


Video summarization Long short-term memory 



KG is partially supported by NSF IIS-1514118 and a gift from Intel. Others are partially supported by USC Graduate Fellowships, NSF IIS-1451412, 1513966, CCF-1139148 and A. P. Sloan Research Fellowship.

Supplementary material

419982_1_En_47_MOESM1_ESM.pdf (287 kb)
Supplementary material 1 (pdf 286 KB)


  1. 1.
    Zhang, H.J., Wu, J., Zhong, D., Smoliar, S.W.: An integrated system for content-based video retrieval and browsing. Pattern Recogn. 30(4), 643–658 (1997)CrossRefGoogle Scholar
  2. 2.
    Gong, B., Chao, W.L., Grauman, K., Sha, F.: Diverse sequential subset selection for supervised video summarization. In: NIPS (2014)Google Scholar
  3. 3.
    Mundur, P., Rao, Y., Yesha, Y.: Keyframe-based video summarization using delaunay clustering. Int. J. Digit. Libr. 6(2), 219–232 (2006)CrossRefGoogle Scholar
  4. 4.
    Liu, D., Hua, G., Chen, T.: A hierarchical visual model for video object summarization. IEEE Trans. Pattern Anal. Mach. Intell. 32(12), 2178–2190 (2010)CrossRefGoogle Scholar
  5. 5.
    Lee, Y.J., Ghosh, J., Grauman, K.: Discovering important people and objects for egocentric video summarization. In: CVPR (2012)Google Scholar
  6. 6.
    Ngo, C.W., Ma, Y.F., Zhang, H.: Automatic video summarization by graph modeling. In: ICCV (2003)Google Scholar
  7. 7.
    Laganière, R., Bacco, R., Hocevar, A., Lambert, P., Païs, G., Ionescu, B.E.: Video summarization from spatio-temporal features. In: ACM TRECVid Video Summarization Workshop (2008)Google Scholar
  8. 8.
    Nam, J., Tewfik, A.H.: Event-driven video abstraction and visualization. Multimedia Tools Appl. 16(1–2), 55–77 (2002)CrossRefzbMATHGoogle Scholar
  9. 9.
    Lu, Z., Grauman, K.: Story-driven summarization for egocentric video. In: CVPR (2013)Google Scholar
  10. 10.
    Hong, R., Tang, J., Tan, H.K., Yan, S., Ngo, C., Chua, T.S.: Event driven summarization for web videos. In: SIGMM Workshop (2009)Google Scholar
  11. 11.
    Khosla, A., Hamid, R., Lin, C.J., Sundaresan, N.: Large-scale video summarization using web-image priors. In: CVPR (2013)Google Scholar
  12. 12.
    Liu, T., Kender, J.R.: Optimization algorithms for the selection of key frame sequences of variable length. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2353, pp. 403–417. Springer, Heidelberg (2002). doi: 10.1007/3-540-47979-1_27 CrossRefGoogle Scholar
  13. 13.
    Kang, H.W., Matsushita, Y., Tang, X., Chen, X.Q.: Space-time video montage. In: CVPR (2006)Google Scholar
  14. 14.
    Ma, Y.F., Lu, L., Zhang, H.J., Li, M.: A user attention model for video summarization. In: ACM Multimedia (2002)Google Scholar
  15. 15.
    Gygli, M., Grabner, H., Van Gool, L.: Video summarization by learning submodular mixtures of objectives. In: CVPR (2015)Google Scholar
  16. 16.
    Zhang, K., Chao, W.l., Sha, F., Grauman, K.: Summary transfer: exemplar-based subset selection for video summarization. In: CVPR (2016)Google Scholar
  17. 17.
    Gygli, M., Grabner, H., Riemenschneider, H., Gool, L.: Creating summaries from user videos. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 505–520. Springer, Heidelberg (2014). doi: 10.1007/978-3-319-10584-0_33 Google Scholar
  18. 18.
    Chao, W.L., Gong, B., Grauman, K., Sha, F.: Large-margin determinantal point processes. In: UAI (2015)Google Scholar
  19. 19.
    Deng, L., Hinton, G., Kingsbury, B.: New types of deep neural network learning for speech recognition and related applications: an overview. In: ICASSP, pp. 8599–8603 (2013)Google Scholar
  20. 20.
    Graves, A., Mohamed, A.R., Hinton, G.: Speech recognition with deep recurrent neural networks. In: ICASSP, pp. 6645–6649 (2013)Google Scholar
  21. 21.
    Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. In: ICML, pp. 1764–1772 (2014)Google Scholar
  22. 22.
    Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR, pp. 2625–2634 (2015)Google Scholar
  23. 23.
    Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., Courville, A.: Describing videos by exploiting temporal structure. In: ICCV, pp. 4507–4515 (2015)Google Scholar
  24. 24.
    Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence-video to text. In: ICCV, pp. 4534–4542 (2015)Google Scholar
  25. 25.
    Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. In: CVPR (2014)Google Scholar
  26. 26.
    Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR, pp. 3128–3137 (2015)Google Scholar
  27. 27.
    Kulesza, A., Taskar, B.: Determinantal point processes for machine learning. Found. Trends Mach. Learn. 5(2–3) (2012)Google Scholar
  28. 28.
    de Avila, S.E.F., Lopes, A.P.B., da Luz, A., de Araújo, A.A.: VSUMM: a mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recogn. Lett. 32(1), 56–68 (2011)CrossRefGoogle Scholar
  29. 29.
    Furini, M., Geraci, F., Montangero, M., Pellegrini, M.: STIMO: STIll and MOving video storyboard for the web scenario. Multimedia Tools Appl. 46(1), 47–69 (2010)CrossRefGoogle Scholar
  30. 30.
    Li, Y., Merialdo, B.: Multi-video summarization based on video-MMR. In: WIAMIS Workshop (2010)Google Scholar
  31. 31.
    Potapov, D., Douze, M., Harchaoui, Z., Schmid, C.: Category-specific video summarization. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 540–555. Springer, Heidelberg (2014). doi: 10.1007/978-3-319-10599-4_35 Google Scholar
  32. 32.
    Morere, O., Goh, H., Veillard, A., Chandrasekhar, V., Lin, J.: Co-regularized deep representations for video summarization. In: ICIP, pp. 3165–3169 (2015)Google Scholar
  33. 33.
    Kim, G., Xing, E.P.: Reconstructing storyline graphs for image recommendation from web community photos. In: CVPR (2014)Google Scholar
  34. 34.
    Zhao, B., Xing, E.P.: Quasi real-time summarization for consumer videos. In: CVPR (2014)Google Scholar
  35. 35.
    Song, Y., Vallmitjana, J., Stent, A., Jaimes, A.: TVSUM: summarizing web videos using titles. In: CVPR (2015)Google Scholar
  36. 36.
    Chu, W.S., Song, Y., Jaimes, A.: Video co-summarization: video summarization by visual co-occurrence. In: CVPR. IEEE (2015)Google Scholar
  37. 37.
    Yang, H., Wang, B., Lin, S., Wipf, D., Guo, M., Guo, B.: Unsupervised extraction of video highlights via robust recurrent auto-encoders. In: ICCV, pp. 4633–4641 (2015)Google Scholar
  38. 38.
    Xu, K., Ba, J., Kiros, R., Courville, A., Salakhutdinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: ICML (2015)Google Scholar
  39. 39.
    Jin, J., Fu, K., Cui, R., Sha, F., Zhang, C.: Aligning where to see and what to tell: image caption with region-based attention and scene factorization, arXiv preprint (2015). arXiv:1506.06272
  40. 40.
    Gers, F.A., Schmidhuber, J., Cummins, F.: Learning to forget: continual prediction with LSTM. Neural Comput. 12(10), 2451–2471 (2000)CrossRefGoogle Scholar
  41. 41.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  42. 42.
    Zaremba, W., Sutskever, I.: Learning to execute. arXiv preprint (2014). arXiv:1410.4615
  43. 43.
    Wolf, W.: Key frame selection by motion analysis. In: ICASSP (1996)Google Scholar
  44. 44.
    Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional LSTM networks. In: IJCNN, pp. 2047–2052 (2005)Google Scholar
  45. 45.
    Kulesza, A., Taskar, B.: Learning determinantal point processes. In: UAI (2011)Google Scholar
  46. 46.
    Buchbinder, N., Feldman, M., Seffi, J., Schwartz, R.: A tight linear time (1/2)-approximation for unconstrained submodular maximization. SIAM J. Comput. 44(5), 1384–1402 (2015)MathSciNetCrossRefzbMATHGoogle Scholar
  47. 47.
    Open video project.
  48. 48.
    Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: CVPR, pp. 1–9 (2015)Google Scholar
  49. 49.
    Saenko, K., Kulis, B., Fritz, M., Darrell, T.: Adapting visual category models to new domains. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 213–226. Springer, Heidelberg (2010). doi: 10.1007/978-3-642-15561-1_16 CrossRefGoogle Scholar
  50. 50.
    Gong, B., Shi, Y., Sha, F., Grauman, K.: Geodesic flow kernel for unsupervised domain adaptation. In: CVPR, pp. 2066–2073 (2012)Google Scholar
  51. 51.
    Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: DeCAF: a deep convolutional activation feature for generic visual recognition. In: ICML, pp. 647–655 (2014)Google Scholar
  52. 52.
    Sun, B., Feng, J., Saenko, K.: Return of frustratingly easy domain adaptation. AAAI (2016)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Ke Zhang
    • 1
    Email author
  • Wei-Lun Chao
    • 1
  • Fei Sha
    • 2
  • Kristen Grauman
    • 3
  1. 1.Department of Computer ScienceUniversity of Southern CaliforniaLos AngelesUSA
  2. 2.Department of Computer ScienceUniversity of CaliforniaLos AngelesUSA
  3. 3.Department of Computer ScienceUniversity of Texas at AustinAustinUSA

Personalised recommendations