Advertisement

Video Summarization Using Deep Semantic Features

  • Mayu Otani
  • Yuta Nakashima
  • Esa Rahtu
  • Janne Heikkilä
  • Naokazu Yokoya
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10115)

Abstract

This paper presents a video summarization technique for an Internet video to provide a quick way to overview its content. This is a challenging problem because finding important or informative parts of the original video requires to understand its content. Furthermore the content of Internet videos is very diverse, ranging from home videos to documentaries, which makes video summarization much more tough as prior knowledge is almost not available. To tackle this problem, we propose to use deep video features that can encode various levels of content semantics, including objects, actions, and scenes, improving the efficiency of standard video summarization techniques. For this, we design a deep neural network that maps videos as well as descriptions to a common semantic space and jointly trained it with associated pairs of videos and descriptions. To generate a video summary, we extract the deep features from each segment of the original video and apply a clustering-based summarization technique to them. We evaluate our video summaries using the SumMe dataset as well as baseline approaches. The results demonstrated the advantages of incorporating our deep semantic features in a video summarization technique.

Notes

Acknowledgement

This work is partly supported by JSPS KAKENHI No. 16K16086.

Supplementary material

440742_1_En_23_MOESM1_ESM.zip (7.7 mb)
Supplementary material 1 (zip 7866 KB)

References

  1. 1.
    YouTube.com: Statistics-YouTube (2016). https://www.youtube.com/yt/press/en-GB/statistics.html
  2. 2.
    Gong, Y., Liu, X.: Video summarization using singular value decomposition. In: Proceedings of IEEE Computer Society Conference Computer Vision and Pattern Recognition (CVPR), pp. 174–180 (2000)Google Scholar
  3. 3.
    Gong, B., Chao, W.L., Grauman, K., Sha, F.: Diverse sequential subset selection for supervised video summarization. In: Proceedings of Advances in Neural Information Processing Systems (NIPS), pp. 2069–2077 (2014)Google Scholar
  4. 4.
    Zhao, B., Xing, E.P.: Quasi real-time summarization for consumer videos. In: Proceedings of IEEE Computer Society Conference Computer Vision and Pattern Recognition (CVPR), pp. 2513–2520 (2014)Google Scholar
  5. 5.
    Lowe, D.G.: Distinctive image features from scale invariant keypoints. Int. J. Comput. Vis. 60, 91–11020042 (2004)CrossRefGoogle Scholar
  6. 6.
    Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Proceedings of IEEE Computer Society Conference Computer Vision and Pattern Recognition (CVPR), pp. 886–893 (2005)Google Scholar
  7. 7.
    Yao, L., Ballas, N., Larochelle, H., Courville, A.: Describing videos by exploiting temporal structure. In: Proceedings of IEEE International Conference Computer Vision (ICCV), pp. 4507–4515 (2015)Google Scholar
  8. 8.
    Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: DeCAF: a deep convolutional activation feature for generic visual recognition. In: Proceedings of International Conference Machine Learning (ICML), vol. 32, pp. 647–655 (2014)Google Scholar
  9. 9.
    Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: a large video description dataset for bridging video and language. In: Proceedings of IEEE Computer Society Conference Computer Vision and Pattern Recognition (CVPR), pp. 5288–5296 (2016)Google Scholar
  10. 10.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recoginition. In: Proceedings International Conference Learning Representations (ICLR), pp. 14 (2015)Google Scholar
  11. 11.
    Babaguchi, N., Kawai, Y., Ogura, T., Kitahashi, T.: Personalized abstraction of broadcasted American football video by highlight selection. IEEE Trans. Multimed. 6, 575–586 (2004)CrossRefGoogle Scholar
  12. 12.
    Sang, J., Xu, C.: Character-based movie summarization. In: Proceedings of ACM International Conference Multimedia (MM), pp. 855–858 (2010)Google Scholar
  13. 13.
    Evangelopoulos, G., Zlatintsi, A., Potamianos, A., Maragos, P., Rapantzikos, K., Skoumas, G., Avrithis, Y.: Multimodal saliency and fusion for movie summarization based on aural, visual, and textual attention. IEEE Trans. Multimed. 15, 1553–1568 (2013)CrossRefGoogle Scholar
  14. 14.
    Lu, Z., Grauman, K.: Story-driven summarization for egocentric video. In: Proceedings of IEEE Computer Society Conference Computer Vision and Pattern Recognition (CVPR), pp. 2714–2721 (2013)Google Scholar
  15. 15.
    Potapov, D., Douze, M., Harchaoui, Z., Schmid, C.: Category-specific video summarization. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 540–555. Springer, Heidelberg (2014). doi: 10.1007/978-3-319-10599-4_35 Google Scholar
  16. 16.
    Yang, H., Wang, B., Lin, S., Wipf, D., Guo, M., Guo, B.: Unsupervised extraction of video highlights via robust recurrent auto-encoders. In: Proceedings of IEEE International Conference Computer Vision (ICCV), pp. 4633–4641 (2015)Google Scholar
  17. 17.
    Xu, J., Mukherjee, L., Li, Y., Warner, J., Rehg, J.M., Singh, V.: Gaze-enabled egocentric video summarization via constrained submodular maximization. In: Proceedings of IEEE Computer Society Conference Computer Vision and Pattern Recognition (CVPR), pp. 2235–2244 (2015)Google Scholar
  18. 18.
    Tschiatschek, S., Iyer, R.K., Wei, H., Bilmes, J.A.: Learning mixtures of submodular functions for image collection summarization. In: Proceedings of Advances in Neural Information Processing Systems (NIPS), pp. 1413–1421 (2014)Google Scholar
  19. 19.
    Gygli, M., Grabner, H., Riemenschneider, H., Gool, L.: Creating summaries from user videos. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 505–520. Springer, Heidelberg (2014). doi: 10.1007/978-3-319-10584-0_33 Google Scholar
  20. 20.
    Gygli, M., Grabner, H., van Gool, L.: Video summarization by learning submodular mixtures of objectives. In: Proceedings of IEEE Computer Society Conference Computer Vision and Pattern Recognition (CVPR), pp. 3090–3098 (2015)Google Scholar
  21. 21.
    Song, Y., Vallmitjana, J., Stent, A., Jaimes, A.: TVSum: summarizing web videos using titles. In: Proceedings of IEEE Computer Society Conference Computer Vision and Pattern Recognition (CVPR), pp. 5179–5187 (2015)Google Scholar
  22. 22.
    Khosla, A., Hamid, R., Lin, C.j., Sundaresan, N.: Large-scale video summarization using web-image priors. In: Proceedings of IEEE Computer Society Conference Computer Vision and Pattern Recognition (CVPR), pp. 2698–2705 (2013)Google Scholar
  23. 23.
    Chu, W.S., Jaimes, A.: Video co-summarization: video summarization by visual co-occurrence. In: Proceedings of IEEE Computer Society Conference Computer Vision and Pattern Recognition (CVPR), pp. 3584–3592 (2015)Google Scholar
  24. 24.
    Wang, X., Gupta, A.: Unsupervised learning of visual representations using videos. In: Proceedings of IEEE International Conference Computer Vision (ICCV), pp. 2794–2802 (2015)Google Scholar
  25. 25.
    Frome, A., Corrado, G., Shlens, J.: DeViSE: a deep visual-semantic embedding model. In: Proceedings of Advances in Neural Information Processing Systems (NIPS), pp. 2121–2129 (2013)Google Scholar
  26. 26.
    Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: Proceedings of IEEE Computer Society Conference Computer Vision and Pattern Recognition (CVPR), pp. 539–546 (2005)Google Scholar
  27. 27.
    Lin, T.Y., Belongie, S., Hays, J.: Learning deep representations for ground-to-aerial geolocalization. In: Proceedings of IEEE Computer Society Conference Computer Vision and Pattern Recognition (CVPR), pp. 5007–5015 (2015)Google Scholar
  28. 28.
    Kiros, R., Zhu, Y., Salakhutdinov, R.R., Zemel, R., Urtasun, R., Torralba, A., Fidler, S.: Skip-thought vectors. In: Proceedings of Advances in Neural Information Processing Systems (NIPS), pp. 3276–3284 (2015)Google Scholar
  29. 29.
    Maaten, L.V.D., Hinton, G.E.: Visualizing high-dimensional data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)Google Scholar
  30. 30.
    DeMenthon, D., Kobla, V., Doermann, D.: Video summarization by curve simplification. In: Proceedings of ACM International Conference Multimedia (MM), pp. 211–218 (1998)Google Scholar
  31. 31.
    Kingma, D., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of Internatonal Conference Learning Representations (ICLR), pp. 11 (2015)Google Scholar
  32. 32.
    Leskovec, J., Krause, A., Guestrin, C., Faloutsos, C., VanBriesen, J., Glance, N.: Cost-effective outbreak detection in networks. In: Proceedings of ACM SIGKDD International Conference Knowledge Discovery and Data Mining (KDD), pp. 420–429 (2007)Google Scholar
  33. 33.
    Ejaz, N., Mehmood, I., Wook Baik, S.: Efficient visual attention based framework for extracting key frames from videos. Sig. Process.: Image Commun. 28, 34–44 (2013)Google Scholar
  34. 34.
    Gygli, M., Grabner, H., Riemenschneider, H., Nater, F., Gool, L.V.: The interestingness of images. In: IEEE International Conference Computer Vision (ICCV), pp. 1633–164 (2013)Google Scholar
  35. 35.
    Alexe, B., Deselaers, T., Ferrari, V.: What is an object? In: Proceedings of IEEE Computer Society Conference Computer Vision and Pattern Recognition (CVPR), pp. 73–80 (2010)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Mayu Otani
    • 1
  • Yuta Nakashima
    • 1
  • Esa Rahtu
    • 2
  • Janne Heikkilä
    • 2
  • Naokazu Yokoya
    • 1
  1. 1.Graduate School of Information ScienceNara Institute of Science and TechnologyIkomaJapan
  2. 2.Center for Machine Vision and Signal AnalysisUniversity of OuluOuluFinland

Personalised recommendations