Advertisement

MINI-Net: Multiple Instance Ranking Network for Video Highlight Detection

Conference paper
  • 438 Downloads
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12358)

Abstract

We address the weakly supervised video highlight detection problem for learning to detect segments that are more attractive in training videos given their video event label but without expensive supervision of manually annotating highlight segments. While manually averting localizing highlight segments, weakly supervised modeling is challenging, as a video in our daily life could contain highlight segments with multiple event types, e.g., skiing and surfing. In this work, we propose casting weakly supervised video highlight detection modeling for a given specific event as a multiple instance ranking network (MINI-Net) learning. We consider each video as a bag of segments, and therefore, the proposed MINI-Net learns to enforce a higher highlight score for a positive bag that contains highlight segments of a specific event than those for negative bags that are irrelevant. In particular, we form a max-max ranking loss to acquire a reliable relative comparison between the most likely positive segment instance and the hardest negative segment instance. With this max-max ranking loss, our MINI-Net effectively leverages all segment information to acquire a more distinct video feature representation for localizing the highlight segments of a specific event in a video. The extensive experimental results on three challenging public benchmarks clearly validate the efficacy of our multiple instance ranking approach for solving the problem.

Notes

Acknowledgements

This work was supported partially by the National Key Research and Development Program of China (2018YFB1004903), NSFC(U1911401,U1811461), Guangdong Province Science and Technology Innovation Leading Talents (2016TX03X157), Guangdong NSF Project (No. 2018B030312002), Guangzhou Research Project (201902010037), and Research Projects of Zhejiang Lab (No. 2019KD0AB03).

Supplementary material

504454_1_En_21_MOESM1_ESM.pdf (5.5 mb)
Supplementary material 1 (pdf 5619 KB)

References

  1. 1.
    Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: International Conference on Computer Vision (2017)Google Scholar
  2. 2.
    Cai, S., Zuo, W., Davis, L.S., Zhang, L.: Weakly-supervised video summarization using variational encoder-decoder and web prior. In: European Conference on Computer Vision (2018)Google Scholar
  3. 3.
    Carbonneau, M.A., Cheplygina, V., Granger, E., Gagnon, G.: Multiple instance learning: a survey of problem characteristics and applications. Pattern Recogn. 77, 329–353 (2018)CrossRefGoogle Scholar
  4. 4.
    Chu, W.S., Song, Y., Jaimes, A.: Video co-summarization: video summarization by visual co-occurrence. In: Computer Vision and Pattern Recognition (2015)Google Scholar
  5. 5.
    Cinbis, R.G., Verbeek, J., Schmid, C.: Weakly supervised object localization with multi-fold multiple instance learning. Trans. Pattern Anal. Mach. Intell. 39(1), 189–203 (2016)CrossRefGoogle Scholar
  6. 6.
    Elhamifar, E., Sapiro, G., Vidal, R.: See all by looking at a few: sparse modeling for finding representative objects. In: Computer Vision and Pattern Recognition (2012)Google Scholar
  7. 7.
    Gong, B., Chao, W.L., Grauman, K., Sha, F.: Diverse sequential subset selection for supervised video summarization. In: Advances in Neural Information Processing Systems (2014)Google Scholar
  8. 8.
    Gygli, M., Grabner, H., Van Gool, L.: Video summarization by learning submodular mixtures of objectives. In: Computer Vision and Pattern Recognition (2015)Google Scholar
  9. 9.
    Gygli, M., Song, Y., Cao, L.: Video2gif: automatic generation of animated gifs from video. In: Computer Vision and Pattern Recognition (2016)Google Scholar
  10. 10.
    Hori, C., et al.: Attention-based multimodal fusion for video description. In: International Conference on Computer Vision (2017)Google Scholar
  11. 11.
    Ilse, M., Tomczak, J.M., Welling, M.: Attention-based deep multiple instance learning. arXiv preprint arXiv:1802.04712 (2018)
  12. 12.
    Jiao, Y., Li, Z., Huang, S., Yang, X., Liu, B., Zhang, T.: Three-dimensional attention-based deep ranking model for video highlight detection. Trans. Multimedia 20(10), 2693–2705 (2018)CrossRefGoogle Scholar
  13. 13.
    Kim, G., Sigal, L., Xing, E.P.: Joint summarization of large-scale collections of web images and videos for storyline reconstruction. In: Computer Vision and Pattern Recognition (2014)Google Scholar
  14. 14.
    Mahasseni, B., Lam, M., Todorovic, S.: Unsupervised video summarization with adversarial lstm networks. In: Computer Vision and Pattern Recognition (2017)Google Scholar
  15. 15.
    Meng, J., Wu, S., Zheng, W.S.: Weakly supervised person re-identification. In: Computer Vision and Pattern Recognition (2019)Google Scholar
  16. 16.
    Panda, R., Das, A., Wu, Z., Ernst, J., Roy-Chowdhury, A.K.: Weakly supervised summarization of web videos. In: International Conference on Computer Vision (2017)Google Scholar
  17. 17.
    Panda, R., Roy-Chowdhury, A.K.: Collaborative summarization of topic-related videos. In: Computer Vision and Pattern Recognition (2017)Google Scholar
  18. 18.
    Potapov, D., Douze, M., Harchaoui, Z., Schmid, C.: Category-specific video summarization. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 540–555. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10599-4_35CrossRefGoogle Scholar
  19. 19.
    Song, Y., Vallmitjana, J., Stent, A., Jaimes, A.: Tvsum: summarizing web videos using titles. In: Computer Vision and Pattern Recognition (2015)Google Scholar
  20. 20.
    Sultani, W., Chen, C., Shah, M.: Real-world anomaly detection in surveillance videos. In: Computer Vision and Pattern Recognition (2018)Google Scholar
  21. 21.
    Sun, M., Farhadi, A., Seitz, S.: Ranking domain-specific highlights by analyzing edited videos. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 787–802. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10590-1_51CrossRefGoogle Scholar
  22. 22.
    Tang, H., Kwatra, V., Sargin, M.E., Gargi, U.: Detecting highlights in sports videos: cricket as a test case. In: International Conference on Multimedia and Expo (2011)Google Scholar
  23. 23.
    Ulges, A., Schulze, C., Breuel, T.: Multiple instance learning from weakly labeled videos. In: Workshop on Cross-media Information Analysis and Retrieval (2008)Google Scholar
  24. 24.
    Wang, J., Xu, C., Chng, E., Tian, Q.: Sports highlight detection from keyword sequences using hmm. In: International Conference on Multimedia and Expo (2004)Google Scholar
  25. 25.
    Wang, L., Sun, Z., Yao, W., Zhan, H., Zhu, C.: Unsupervised multi-stream highlight detection for the game "honor of kings". arXiv preprint arXiv:1910.06189 (2019)
  26. 26.
    Wu, J., Yu, Y., Huang, C., Yu, K.: Deep multiple instance learning for image classification and auto-annotation. In: Computer Vision and Pattern Recognition (2015)Google Scholar
  27. 27.
    Xiong, B., Kalantidis, Y., Ghadiyaram, D., Grauman, K.: Less is more: learning highlight detection from video duration. In: Computer Vision and Pattern Recognition (2019)Google Scholar
  28. 28.
    Xiong, B., Kim, G., Sigal, L.: Storyline representation of egocentric videos with an applications to story-based search. In: International Conference on Computer Vision (2015)Google Scholar
  29. 29.
    Xiong, Z., Radhakrishnan, R., Divakaran, A., Huang, T.S.: Highlights extraction from sports video based on an audio-visual marker detection framework. In: International Conference on Multimedia and Expo (2005)Google Scholar
  30. 30.
    Yang, H., Wang, B., Lin, S., Wipf, D., Guo, M., Guo, B.: Unsupervised extraction of video highlights via robust recurrent auto-encoders. In: International Conference on Computer Vision (2015)Google Scholar
  31. 31.
    Yao, T., Mei, T., Rui, Y.: Highlight detection with pairwise deep ranking for first-person video summarization. In: Computer Vision and Pattern Recognition (2016)Google Scholar
  32. 32.
    Zhang, K., Chao, W.-L., Sha, F., Grauman, K.: Video summarization with long short-term memory. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 766–782. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46478-7_47CrossRefGoogle Scholar
  33. 33.
    Zhou, K., Qiao, Y., Xiang, T.: Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In: AAAI Conference on Artificial Intelligence (2018)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.School of Data and Computer ScienceSun Yat-sen UniversityGuangzhouChina
  2. 2.Peng Cheng LaboratoryShenzhenChina
  3. 3.VICO GroupUniversity of EdinburghEdinburghUK
  4. 4.Pazhou LabGuangzhouChina
  5. 5.Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of EducationGuangzhouChina

Personalised recommendations