Video Summarization Using Deep Semantic Features

Otani, Mayu; Nakashima, Yuta; Rahtu, Esa; Heikkilä, Janne; Yokoya, Naokazu

doi:10.1007/978-3-319-54193-8_23

Mayu Otani¹⁷,
Yuta Nakashima¹⁷,
Esa Rahtu¹⁸,
Janne Heikkilä¹⁸ &
…
Naokazu Yokoya¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10115))

Included in the following conference series:

Asian Conference on Computer Vision

4150 Accesses
33 Citations

Abstract

This paper presents a video summarization technique for an Internet video to provide a quick way to overview its content. This is a challenging problem because finding important or informative parts of the original video requires to understand its content. Furthermore the content of Internet videos is very diverse, ranging from home videos to documentaries, which makes video summarization much more tough as prior knowledge is almost not available. To tackle this problem, we propose to use deep video features that can encode various levels of content semantics, including objects, actions, and scenes, improving the efficiency of standard video summarization techniques. For this, we design a deep neural network that maps videos as well as descriptions to a common semantic space and jointly trained it with associated pairs of videos and descriptions. To generate a video summary, we extract the deep features from each segment of the original video and apply a clustering-based summarization technique to them. We evaluate our video summaries using the SumMe dataset as well as baseline approaches. The results demonstrated the advantages of incorporating our deep semantic features in a video summarization technique.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

YouTube.com: Statistics-YouTube (2016). https://www.youtube.com/yt/press/en-GB/statistics.html
Gong, Y., Liu, X.: Video summarization using singular value decomposition. In: Proceedings of IEEE Computer Society Conference Computer Vision and Pattern Recognition (CVPR), pp. 174–180 (2000)
Google Scholar
Gong, B., Chao, W.L., Grauman, K., Sha, F.: Diverse sequential subset selection for supervised video summarization. In: Proceedings of Advances in Neural Information Processing Systems (NIPS), pp. 2069–2077 (2014)
Google Scholar
Zhao, B., Xing, E.P.: Quasi real-time summarization for consumer videos. In: Proceedings of IEEE Computer Society Conference Computer Vision and Pattern Recognition (CVPR), pp. 2513–2520 (2014)
Google Scholar
Lowe, D.G.: Distinctive image features from scale invariant keypoints. Int. J. Comput. Vis. 60, 91–11020042 (2004)
Article Google Scholar
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Proceedings of IEEE Computer Society Conference Computer Vision and Pattern Recognition (CVPR), pp. 886–893 (2005)
Google Scholar
Yao, L., Ballas, N., Larochelle, H., Courville, A.: Describing videos by exploiting temporal structure. In: Proceedings of IEEE International Conference Computer Vision (ICCV), pp. 4507–4515 (2015)
Google Scholar
Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: DeCAF: a deep convolutional activation feature for generic visual recognition. In: Proceedings of International Conference Machine Learning (ICML), vol. 32, pp. 647–655 (2014)
Google Scholar
Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: a large video description dataset for bridging video and language. In: Proceedings of IEEE Computer Society Conference Computer Vision and Pattern Recognition (CVPR), pp. 5288–5296 (2016)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recoginition. In: Proceedings International Conference Learning Representations (ICLR), pp. 14 (2015)
Google Scholar
Babaguchi, N., Kawai, Y., Ogura, T., Kitahashi, T.: Personalized abstraction of broadcasted American football video by highlight selection. IEEE Trans. Multimed. 6, 575–586 (2004)
Article Google Scholar
Sang, J., Xu, C.: Character-based movie summarization. In: Proceedings of ACM International Conference Multimedia (MM), pp. 855–858 (2010)
Google Scholar
Evangelopoulos, G., Zlatintsi, A., Potamianos, A., Maragos, P., Rapantzikos, K., Skoumas, G., Avrithis, Y.: Multimodal saliency and fusion for movie summarization based on aural, visual, and textual attention. IEEE Trans. Multimed. 15, 1553–1568 (2013)
Article Google Scholar
Lu, Z., Grauman, K.: Story-driven summarization for egocentric video. In: Proceedings of IEEE Computer Society Conference Computer Vision and Pattern Recognition (CVPR), pp. 2714–2721 (2013)
Google Scholar
Potapov, D., Douze, M., Harchaoui, Z., Schmid, C.: Category-specific video summarization. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 540–555. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10599-4_35
Google Scholar
Yang, H., Wang, B., Lin, S., Wipf, D., Guo, M., Guo, B.: Unsupervised extraction of video highlights via robust recurrent auto-encoders. In: Proceedings of IEEE International Conference Computer Vision (ICCV), pp. 4633–4641 (2015)
Google Scholar
Xu, J., Mukherjee, L., Li, Y., Warner, J., Rehg, J.M., Singh, V.: Gaze-enabled egocentric video summarization via constrained submodular maximization. In: Proceedings of IEEE Computer Society Conference Computer Vision and Pattern Recognition (CVPR), pp. 2235–2244 (2015)
Google Scholar
Tschiatschek, S., Iyer, R.K., Wei, H., Bilmes, J.A.: Learning mixtures of submodular functions for image collection summarization. In: Proceedings of Advances in Neural Information Processing Systems (NIPS), pp. 1413–1421 (2014)
Google Scholar
Gygli, M., Grabner, H., Riemenschneider, H., Gool, L.: Creating summaries from user videos. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 505–520. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10584-0_33
Google Scholar
Gygli, M., Grabner, H., van Gool, L.: Video summarization by learning submodular mixtures of objectives. In: Proceedings of IEEE Computer Society Conference Computer Vision and Pattern Recognition (CVPR), pp. 3090–3098 (2015)
Google Scholar
Song, Y., Vallmitjana, J., Stent, A., Jaimes, A.: TVSum: summarizing web videos using titles. In: Proceedings of IEEE Computer Society Conference Computer Vision and Pattern Recognition (CVPR), pp. 5179–5187 (2015)
Google Scholar
Khosla, A., Hamid, R., Lin, C.j., Sundaresan, N.: Large-scale video summarization using web-image priors. In: Proceedings of IEEE Computer Society Conference Computer Vision and Pattern Recognition (CVPR), pp. 2698–2705 (2013)
Google Scholar
Chu, W.S., Jaimes, A.: Video co-summarization: video summarization by visual co-occurrence. In: Proceedings of IEEE Computer Society Conference Computer Vision and Pattern Recognition (CVPR), pp. 3584–3592 (2015)
Google Scholar
Wang, X., Gupta, A.: Unsupervised learning of visual representations using videos. In: Proceedings of IEEE International Conference Computer Vision (ICCV), pp. 2794–2802 (2015)
Google Scholar
Frome, A., Corrado, G., Shlens, J.: DeViSE: a deep visual-semantic embedding model. In: Proceedings of Advances in Neural Information Processing Systems (NIPS), pp. 2121–2129 (2013)
Google Scholar
Chopra, S., Hadsell, R., LeCun, Y.: Learning a similarity metric discriminatively, with application to face verification. In: Proceedings of IEEE Computer Society Conference Computer Vision and Pattern Recognition (CVPR), pp. 539–546 (2005)
Google Scholar
Lin, T.Y., Belongie, S., Hays, J.: Learning deep representations for ground-to-aerial geolocalization. In: Proceedings of IEEE Computer Society Conference Computer Vision and Pattern Recognition (CVPR), pp. 5007–5015 (2015)
Google Scholar
Kiros, R., Zhu, Y., Salakhutdinov, R.R., Zemel, R., Urtasun, R., Torralba, A., Fidler, S.: Skip-thought vectors. In: Proceedings of Advances in Neural Information Processing Systems (NIPS), pp. 3276–3284 (2015)
Google Scholar
Maaten, L.V.D., Hinton, G.E.: Visualizing high-dimensional data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)
Google Scholar
DeMenthon, D., Kobla, V., Doermann, D.: Video summarization by curve simplification. In: Proceedings of ACM International Conference Multimedia (MM), pp. 211–218 (1998)
Google Scholar
Kingma, D., Ba, J.: Adam: A method for stochastic optimization. In: Proceedings of Internatonal Conference Learning Representations (ICLR), pp. 11 (2015)
Google Scholar
Leskovec, J., Krause, A., Guestrin, C., Faloutsos, C., VanBriesen, J., Glance, N.: Cost-effective outbreak detection in networks. In: Proceedings of ACM SIGKDD International Conference Knowledge Discovery and Data Mining (KDD), pp. 420–429 (2007)
Google Scholar
Ejaz, N., Mehmood, I., Wook Baik, S.: Efficient visual attention based framework for extracting key frames from videos. Sig. Process.: Image Commun. 28, 34–44 (2013)
Google Scholar
Gygli, M., Grabner, H., Riemenschneider, H., Nater, F., Gool, L.V.: The interestingness of images. In: IEEE International Conference Computer Vision (ICCV), pp. 1633–164 (2013)
Google Scholar
Alexe, B., Deselaers, T., Ferrari, V.: What is an object? In: Proceedings of IEEE Computer Society Conference Computer Vision and Pattern Recognition (CVPR), pp. 73–80 (2010)
Google Scholar

Download references

Acknowledgement

This work is partly supported by JSPS KAKENHI No. 16K16086.

Author information

Authors and Affiliations

Graduate School of Information Science, Nara Institute of Science and Technology, Ikoma, Japan
Mayu Otani, Yuta Nakashima & Naokazu Yokoya
Center for Machine Vision and Signal Analysis, University of Oulu, Oulu, Finland
Esa Rahtu & Janne Heikkilä

Authors

Mayu Otani
View author publications
You can also search for this author in PubMed Google Scholar
Yuta Nakashima
View author publications
You can also search for this author in PubMed Google Scholar
Esa Rahtu
View author publications
You can also search for this author in PubMed Google Scholar
Janne Heikkilä
View author publications
You can also search for this author in PubMed Google Scholar
Naokazu Yokoya
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mayu Otani .

Editor information

Editors and Affiliations

National Tsing Hua University, Hsinchu, Taiwan
Shang-Hong Lai
Graz University of Technology, Graz, Austria
Vincent Lepetit
Drexel University, Philadelphia, Pennsylvania, USA
Ko Nishino
The University of Tokyo, Tokyo, Japan
Yoichi Sato

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 7866 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Otani, M., Nakashima, Y., Rahtu, E., Heikkilä, J., Yokoya, N. (2017). Video Summarization Using Deep Semantic Features. In: Lai, SH., Lepetit, V., Nishino, K., Sato, Y. (eds) Computer Vision – ACCV 2016. ACCV 2016. Lecture Notes in Computer Science(), vol 10115. Springer, Cham. https://doi.org/10.1007/978-3-319-54193-8_23

Download citation

DOI: https://doi.org/10.1007/978-3-319-54193-8_23
Published: 11 March 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-54192-1
Online ISBN: 978-3-319-54193-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics