Video Highlight Detection via Deep Ranking Modeling

  • Yifan Jiao
  • Xiaoshan Yang
  • Tianzhu Zhang
  • Shucheng Huang
  • Changsheng Xu
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10749)


The video highlight detection task is to localize key elements (moments of user’s major or special interest) in a video. Most of existing highlight detection approaches extract features from the video segment as a whole without considering the difference of local features both temporally and spatially. Due to the complexity of video content, this kind of mixed features will impact the final highlight prediction. In temporal extent, not all frames are worth watching because some of them only contain background of the environment without human or other moving objects. In spatial extent, it is similar that not all regions in each frame are highlights especially when there are lots of clutters in the background. To solve the above problem, we propose a novel attention model which can automatically localize the key elements in a video without any extra supervised annotations. Specifically, the proposed attention model produces attention weights of local regions along both the spatial and temporal dimensions of the video segment. The regions of key elements in the video will be strengthened with large weights. Thus more effective feature of the video segment is obtained to predict the highlight score. The proposed attention scheme can be easily integrated into a conventional end-to-end deep ranking model which aims to learn a deep neural network to compute the highlight score of each video segment. Extensive experimental results on the YouTube dataset demonstrate that the proposed approach achieves significant improvement over state-of-the-art methods.


Video highlight detection Attention model Deep ranking 



This work is supported in part by the National Natural Science Foundation of China under Grant 61432019, Grant 61572498, Grant 61532009, and Grant 61772244, the Key Research Program of Frontier Sciences, CAS, Grant NO. QYZDJ-SSW-JSC039, the Beijing Natural Science Foundation 4172062, and Postgraduate Research & Practice Innovation Program of Jiangsu Province, Grant NO. SJCX17_0599.


  1. 1.
    Liu, S., Wang, C.H., Qian, R.H., Yu, H., Bao, R.: Surveillance video parsing with single frame supervision. arXiv preprint arXiv:1611.09587 (2016)
  2. 2.
    Liu, S., Liang, X.D., Liu, L.Q., Shen, X.H., Yang, J.C., Xu, C.S., Lin, L., Cao, X.C., Yan, S.C.: Matching-CNN meets KNN: quasi-parametric human parsing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1419–1427 (2015)Google Scholar
  3. 3.
    Zhang, T.Z., Liu, S., Ahuja, N., Yang, M.H., Ghanem, B.: Robust visual tracking via consistent low-rank sparse learning. Int. J. Comput. Vis. 111(2), 171–190 (2015)CrossRefGoogle Scholar
  4. 4.
    Sun, M., Farhadi, A., Seitz, S.: Ranking domain-specific highlights by analyzing edited videos. In: ECCV (2014)Google Scholar
  5. 5.
    Liu, S., Feng, J.S., Domokos, C., Xu, H., Huang, J.S., Hu, Z.Z., Yan, S.C.: Fashion parsing with weak color-category labels. IEEE Trans. Multimed. 16(1), 253–265 (2014)CrossRefGoogle Scholar
  6. 6.
    Rui, Y., Gupta, A., Acero, A.: Automatically extracting highlights for TV baseball programs. In: Proceedings of the 8th ACM International Conference on Multimedia 2000, Los Angeles, CA, USA, 30 October–3 November 2000, pp. 105–115 (2000)Google Scholar
  7. 7.
    Nepal, S., Srinivasan, U., Graham, J.R.: Automatic detection of goal segments in basketball videos. In: Proceedings of the 9th ACM International Conference on Multimedia 2001, Ottawa, Ontario, Canada, 30 September–5 October 2001, pp. 261–269 (2001)Google Scholar
  8. 8.
    Otsuka, I., Nakane, K., Divakaran, A., Hatanaka, K., Ogawa, M.: A highlight scene detection and video summarization system using audio feature for a personal video recorder. IEEE Trans. Consum. Electron. 51(1), 112–116 (2005)CrossRefGoogle Scholar
  9. 9.
    Tong, X.F., Liu, Q.S., Zhang, Y.F., Lu, H.Q.: Highlight ranking for sports video browsing. In: Proceedings of the 13th ACM International Conference on Multimedia, Singapore, 6–11 November 2005, pp. 519–522 (2005)Google Scholar
  10. 10.
    Ngo, C., Ma, Y.F., Zhang, H.J.: Video summarization and scene detection by graph modeling. IEEE Trans. Circuits Syst. Video Technol. 15(2), 296–305 (2005)CrossRefGoogle Scholar
  11. 11.
    Mundur, P., Rao, Y., Yesha, Y.: Keyframe-based video summarization using delaunay clustering. Int. J. Dig. Libr. 6(2), 219–232 (2006)CrossRefGoogle Scholar
  12. 12.
    Borth, D., Ulges, A., Schulze, C., Thomas, M.B.: Keyframe extraction for video tagging & summarization. In: Informatiktage 2008. Fachwissenschaftlicher Informatik-Kongress, 14–15 März 2008, B-IT Bonn-Aachen International Center for Information Technology in Bonn, pp. 45–48 (2008)Google Scholar
  13. 13.
    Qu, Z., Lin, L.D., Gao, T.F., Wang, Y.K.: An improved keyframe extraction method based on HSV colour space. JSW 8(7), 1751–1758 (2013)CrossRefGoogle Scholar
  14. 14.
    Lin, Y.L., Vlad, I.M., Winston, H.H.: Summarizing while recording: context-based highlight detection for egocentric videos. In: 2015 IEEE International Conference on Computer Vision Workshop, ICCV Workshops 2015, Santiago, Chile, 7–13 December 2015, pp. 443–451 (2015)Google Scholar
  15. 15.
    Yao, T., Mei, T., Rui, Y.: Highlight detection with pairwise deep ranking for first- person video summarization, pp. 982–990 (2016)Google Scholar
  16. 16.
    Jia, Y.Q., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Ross, B.G., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. CoRR, abs/1408.5093 (2014)Google Scholar
  17. 17.
    Yang, X.S., Zhang, T.Z., Xu, C.S., Yan, S.C., Hossain, M.S., Ghoneim, A.: Deep relative attributes. IEEE Trans. Multimed. 18(9), 1832–1842 (2016)CrossRefGoogle Scholar
  18. 18.
    Gao, J.Y., Zhang, T.Z., Yang, X.S., Xu, C.S.: Deep relative tracking. IEEE Trans. Image Process. 26(4), 1845–1858 (2017)MathSciNetCrossRefGoogle Scholar
  19. 19.
    Krizhevsky, A., Sutskever, I., Geoffrey, E.H.: ImageNet classification with deep convolutional neural networks. In: 26th Annual Conference on Neural Information Processing Systems, pp. 1106–1114 (2012)Google Scholar
  20. 20.
    Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. volume abs/1212.0402 (2012)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Yifan Jiao
    • 1
  • Xiaoshan Yang
    • 2
  • Tianzhu Zhang
    • 2
  • Shucheng Huang
    • 1
  • Changsheng Xu
    • 2
  1. 1.School of ComputerJiangsu University of Science and TechnologyZhenjiangChina
  2. 2.National Laboratory of Pattern Recognition, Institute of AutomationChinese Academy of SciencesBeijingChina

Personalised recommendations