Advertisement

Adaptive Video Highlight Detection by Learning from User History

Conference paper
  • 626 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12366)

Abstract

Recently, there is an increasing interest in highlight detection research where the goal is to create a short duration video from a longer video by extracting its interesting moments. However, most existing methods ignore the fact that the definition of video highlight is highly subjective. Different users may have different preferences of highlight for the same input video. In this paper, we propose a simple yet effective framework that learns to adapt highlight detection to a user by exploiting the user’s history in the form of highlights that the user has previously created. Our framework consists of two sub-networks: a fully temporal convolutional highlight detection network H that predicts highlight for an input video and a history encoder network M for user history. We introduce a newly designed temporal-adaptive instance normalization (T-AIN) layer to H where the two sub-networks interact with each other. T-AIN has affine parameters that are predicted from M based on the user history and is responsible for the user-adaptive signal to H. Extensive experiments on a large-scale dataset show that our framework can make more accurate and user-specific highlight predictions.

Keywords

Video highlighight detection User-adaptive learning 

Notes

Acknowledgements

The work was supported by NSERC. We thank NVIDIA for donating some of the GPUs used in this work.

References

  1. 1.
    Agnihotri, L., Kender, J., Dimitrova, N., Zimmerman, J.: Framework for personalized multimedia summarization. In: ACM SIGMM International Workshop on Multimedia Information Retrieval (2005)Google Scholar
  2. 2.
    Babaguchi, N., Ohara, K., Ogura, T.: Learning personal preference from viewer’s operations for browsing and its application to baseball video retrieval and summarization. IEEE Trans. Multimed. 9, 1016–1025 (2007)CrossRefGoogle Scholar
  3. 3.
    Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural image synthesis. In: International Conference on Learning Representations (2018)Google Scholar
  4. 4.
    Cai, S., Zuo, W., Davis, L.S., Zhang, L.: Weakly-supervised video summarization using variational encoder-decoder and web prior. In: European Conference on Computer Vision (2018)Google Scholar
  5. 5.
    Chen, T., Lucic, M., Houlsby, N., Gelly, S.: On self modulation for generative adversarial networks. In: International Conference on Learning Representations (2018)Google Scholar
  6. 6.
    De Vries, H., Strub, F., Mary, J., Larochelle, H., Pietquin, O., Courville, A.C.: Modulating early visual processing by language. In: Advances in Neural Information Processing Systems (2017)Google Scholar
  7. 7.
    Gong, B., Chao, W.L., Grauman, K., Sha, F.: Diverse sequential subset selection for supervised video summarization. In: Advances in Neural Information Processing Systems (2014)Google Scholar
  8. 8.
    Gygli, M.: Ridiculously fast shot boundary detection with fully convolutional neural networks. In: International Conference on Content-Based Multimedia Indexing (2018)Google Scholar
  9. 9.
    Gygli, M., Grabner, H., Riemenschneider, H., Van Gool, L.: Creating summaries from user videos. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 505–520. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10584-0_33CrossRefGoogle Scholar
  10. 10.
    Gygli, M., Grabner, H., Van Gool, L.: Video summarization by learning submodular mixtures of objectives. In: IEEE Conference on Computer Vision and Pattern Recognition (2015)Google Scholar
  11. 11.
    Gygli, M., Song, Y., Cao, L.: Video2GIF: automatic generation of animated gifs from video. In: IEEE Conference on Computer Vision and Pattern Recognition (2016)Google Scholar
  12. 12.
    Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normalization. In: IEEE International Conference on Computer Vision (2017)Google Scholar
  13. 13.
    Jaimes, A., Echigo, T., Teraguchi, M., Satoh, F.: Learning personalized video highlights from detailed MPEG-7 metadata. In: International Conference on Image Processing (2002)Google Scholar
  14. 14.
    Jiao, Y., Yang, X., Zhang, T., Huang, S., Xu, C.: Video highlight detection via deep ranking modeling. In: Paul, M., Hitoshi, C., Huang, Q. (eds.) PSIVT 2017. LNCS, vol. 10749, pp. 28–39. Springer, Cham (2018).  https://doi.org/10.1007/978-3-319-75786-5_3CrossRefGoogle Scholar
  15. 15.
    Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: IEEE Conference on Computer Vision and Pattern Recognition (2019)Google Scholar
  16. 16.
    Khosla, A., Hamid, R., Lin, C.J., Sundaresan, N.: Large-scale video summarization using web-image priors. In: IEEE Conference on Computer Vision and Pattern Recognition (2013)Google Scholar
  17. 17.
    Kim, G., Xing, E.P.: Reconstructing storyline graphs for image recommendation from web community photos. In: IEEE Conference on Computer Vision and Pattern Recognition (2014)Google Scholar
  18. 18.
    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (2015)Google Scholar
  19. 19.
    Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: IEEE Conference on Computer Vision and Pattern Recognition (2017)Google Scholar
  20. 20.
    Lee, Y.J., Ghosh, J., Grauman, K.: Discovering important people and objects for egocentric video summarization. In: IEEE Conference on Computer Vision and Pattern Recognition (2012)Google Scholar
  21. 21.
    Liu, M.Y., et al.: Few-shot unsueprvised image-to-image translation. In: IEEE International Conference on Computer Vision (2019)Google Scholar
  22. 22.
    Liu, W., Mei, T., Zhang, Y., Che, C., Luo, J.: Multi-task deep visual-semantic embedding for video thumbnail selection. In: IEEE Conference on Computer Vision and Pattern Recognition (2015)Google Scholar
  23. 23.
    Lu, Z., Grauman, K.: Story-driven summarization for egocentric video. In: IEEE Conference on Computer Vision and Pattern Recognition (2013)Google Scholar
  24. 24.
    Mahasseni, B., Lam, M., Todorovic, S.: Unsupervised video summarization with adversarial LSTM networks. In: IEEE Conference on Computer Vision and Pattern Recognition (2017)Google Scholar
  25. 25.
    del Molino, A.G., Gygli, M.: PHD-GIFs: personalized highlight detection for automatic gif creation. In: ACM Multimedia (2018)Google Scholar
  26. 26.
    Ngo, C.W., Ma, Y.F., Zhang, H.J.: Automatic video summarization by graph modeling. In: IEEE International Conference on Computer Vision (2003)Google Scholar
  27. 27.
    Panda, R., Das, A., Wu, Z., Ernst, J., Roy-Chowdhury, A.K.: Weakly supervised summarization of web videos. In: IEEE International Conference on Computer Vision (2017)Google Scholar
  28. 28.
    Panda, R., Roy-Chowdhury, A.K.: Collaborative summarization of topic-related videos. In: IEEE Conference on Computer Vision and Pattern Recognition (2017)Google Scholar
  29. 29.
    Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y.: Semantic image synthesis with spatially-adaptive normalization. In: IEEE Conference on Computer Vision and Pattern Recognition (2019)Google Scholar
  30. 30.
    Potapov, D., Douze, M., Harchaoui, Z., Schmid, C.: Category-specific video summarization. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 540–555. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10599-4_35CrossRefGoogle Scholar
  31. 31.
    Rochan, M., Wang, Y.: Video summarization by learning from unpaired data. In: IEEE Conference on Computer Vision and Pattern Recognition (2019)Google Scholar
  32. 32.
    Rochan, M., Ye, L., Wang, Y.: Video summarization using fully convolutional sequence networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11216, pp. 358–374. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01258-8_22CrossRefGoogle Scholar
  33. 33.
    Sharghi, A., Gong, B., Shah, M.: Query-focused extractive video summarization. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 3–19. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46484-8_1CrossRefGoogle Scholar
  34. 34.
    Soleymani, M.: The quest for visual interest. In: ACM Multimedia (2015)Google Scholar
  35. 35.
    Song, Y., Vallmitjana, J., Stent, A., Jaimes, A.: TVSum: summarizing web videos using titles. In: IEEE Conference on Computer Vision and Pattern Recognition (2015)Google Scholar
  36. 36.
    Sun, M., Farhadi, A., Seitz, S.: Ranking domain-specific highlights by analyzing edited videos. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 787–802. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10590-1_51CrossRefGoogle Scholar
  37. 37.
    Takahashi, Y., Nitta, N., Babaguchi, N.: User and device adaptation for sports video content. In: IEEE International Conference on Multimedia and Expo (2007)Google Scholar
  38. 38.
    Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: IEEE International Conference on Computer Vision (2015)Google Scholar
  39. 39.
    Truong, B.T., Venkatesh, S.: Video abstraction: a systematic review and classification. ACM Trans. Multimed. Comput. Commun. Appl. 3, 3-es (2007)CrossRefGoogle Scholar
  40. 40.
    Ulyanov, D., Vedaldi, A., Lempitsky, V.: Improved texture networks: maximizing quality and diversity in feed-forward stylization and texture synthesis. In: IEEE Conference on Computer Vision and Pattern Recognition (2017)Google Scholar
  41. 41.
    Vasudevan, A.B., Gygli, M., Volokitin, A., Van Gool, L.: Query-adaptive video summarization via quality-aware relevance estimation. In: ACM Multimedia (2017)Google Scholar
  42. 42.
    Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition (2018)Google Scholar
  43. 43.
    Wang, X., Yu, K., Dong, C., Change Loy, C.: Recovering realistic texture in image super-resolution by deep spatial feature transform. In: IEEE Conference on Computer Vision and Pattern Recognition (2018)Google Scholar
  44. 44.
    Wei, Z., et al.: Sequence-to-segment networks for segment detection. In: Advances in Neural Information Processing Systems (2018)Google Scholar
  45. 45.
    Xiong, B., Kalantidis, Y., Ghadiyaram, D., Grauman, K.: Less is more: learning highlight detection from video duration. In: IEEE Conference on Computer Vision and Pattern Recognition (2019)Google Scholar
  46. 46.
    Yao, T., Mei, T., Rui, Y.: Highlight detection with pairwise deep ranking for first-person video summarization. In: IEEE Conference on Computer Vision and Pattern Recognition (2016)Google Scholar
  47. 47.
    Yu, Y., Lee, S., Na, J., Kang, J., Kim, G.: A deep ranking model for spatio-temporal highlight detection from a 360\(\circ \) video. In: AAAI Conference on Artificial Intelligence (2018)Google Scholar
  48. 48.
    Zhang, H., Goodfellow, I., Metaxas, D., Odena, A.: Self-attention generative adversarial networks. In: International Conference on Machine Learning (2019)Google Scholar
  49. 49.
    Zhang, K., Chao, W.L., Sha, F., Grauman, K.: Summary transfer: examplar-based subset selection for video summarization. In: IEEE Conference on Computer Vision and Pattern Recognition (2016)Google Scholar
  50. 50.
    Zhang, K., Chao, W.-L., Sha, F., Grauman, K.: Video summarization with long short-term memory. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 766–782. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46478-7_47CrossRefGoogle Scholar
  51. 51.
    Zhang, K., Grauman, K., Sha, F.: Retrospective encoders for video summarization. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11212, pp. 391–408. Springer, Cham (2018).  https://doi.org/10.1007/978-3-030-01237-3_24CrossRefGoogle Scholar
  52. 52.
    Zhang, Y., Kampffmeyer, M., Liang, X., Tan, M., Xing, E.P.: Query-conditioned three-player adversarial network for video summarization. In: British Machine Vision Conference (2018)Google Scholar
  53. 53.
    Zhao, B., Li, X., Lu, X.: Hierarchical recurrent neural network for video summarization. In: ACM Multimedia (2017)Google Scholar
  54. 54.
    Zhou, K., Qiao, Y., Xiang, T.: Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In: AAAI Conference on Artificial Intelligence (2018)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.University of ManitobaWinnipegCanada
  2. 2.Huawei TechnologiesMarkhamCanada

Personalised recommendations