Gaze Aware Deep Learning Model for Video Summarization

  • Jiaxin Wu
  • Sheng-hua Zhong
  • Zheng MaEmail author
  • Stephen J. Heinen
  • Jianmin Jiang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11165)


Video summarization is an ideal tool for skimming videos. Previous computational models extract explicit information from the input video, such as visual appearance, motion or audio information, in order to generate informative summaries. Eye gaze information, which is an implicit clue, has proved useful for indicating important content and the viewer’s interest. In this paper, we propose a novel gaze-aware deep learning model for video summarization. In our model, the position and velocity of the observers’ raw eye movements are processed by the deep neural network to indicate the users’ preferences. Experiments on two widely used video summarization datasets show that our model is more proficient than state-of-the-art methods in summarizing video for characterizing general preferences as well as for personal preferences. The results provide an innovative and improved algorithm for using gaze information in video summarization.


Video summarization Gaze information Convolutional neural networks 



This work was supported by the National Natural Science Foundation of China (No. 61502311, No. 61620106008), the Natural Science Foundation of Guangdong Province (No. 2016A030310053, 2016A030310039, 2017A030310521), the Science and Technology Innovation Commission of Shenzhen under Grant (No. JCYJ2016 0422151736824), Shenzhen Emerging Industries of the Strategic Basic Research Project under Grant (No. JCYJ20160226191842793), the Shenzhen high-level overseas talents program, the Tencent ‘‘Rhinoceros Birds’’- Scientific Research Foundation for Young Teachers of Shenzhen University (2016), the National Institutes of Health Grant (5T32EY025201-03), and the Smith-Kettlewell Eye Research Institute Grant.


  1. 1.
    Chakraborty, P.R., Tjondronegoro, D., Zhang, L., Chandran, V.: Automatic identification of sports video highlights using viewer interest features. In: ICMR, pp. 55–62 (2016)Google Scholar
  2. 2.
    Chang, C.C., Lin, C.J.: Libsvm: a library for support vector machines. ACM TIST 2(3), 1–27 (2011)CrossRefGoogle Scholar
  3. 3.
    Chuk, T., Chan, A., Hsiao, J.: Hidden markov model analysis reveals better eye movement strategies in face recognition. In: CogSci (2015)Google Scholar
  4. 4.
    Deng, J., et al.: Imagenet: a large-scale hierarchical image database. In: CVPR, pp. 248–255 (2009)Google Scholar
  5. 5.
    Drucker, H., Burges, C.J.C., Kaufman, L., Smola, A.J., Vapnik, V.: Support vector regression machines. In: NIPS, pp. 155–161 (1997)Google Scholar
  6. 6.
    Gygli, M., Grabner, H., Riemenschneider, H., Van Gool, L.: Creating summaries from user videos. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 505–520. Springer, Cham (2014). Scholar
  7. 7.
    Gygli, M., Grabner, H., Van Gool, L.: Video summarization by learning submodular mixtures of objectives. In: CVPR (2015)Google Scholar
  8. 8.
    Holmberg, N., Holmqvist, K., Sandberg, H.: Children’s attention to online adverts is related to low-level saliency factors and individual level of gaze control. JEMR 8(2), 1–10 (2015)Google Scholar
  9. 9.
    Jia, Y., et al.: Caffe: convolutional architecture for fast feature embedding. CoRR abs/1408.5093 (2014)Google Scholar
  10. 10.
    Jiang, W., Cotton, C., Loui, A.C.: Automatic consumer video summarization by audio and visual analysis. In: ICMR, pp. 1–6 (2011)Google Scholar
  11. 11.
    Li, Y., Fathi, A., Rehg, J.M.: Learning to predict gaze in egocentric video. In: ICCV, pp. 3216–3223 (2013)Google Scholar
  12. 12.
    Liu, Y., Zhong, S.H., Li, W.: Query-oriented multi-document summarization via unsupervised deep learning. In: AAAI, pp. 1699–1705 (2012)Google Scholar
  13. 13.
    Mahasseni, B., Lam, M., Todorovic, S.: Unsupervised video summarization with adversarial LSTM networks. In: CVPR (2017)Google Scholar
  14. 14.
    Mishra, A.K., Aloimonos, Y., Cheong, L.F., Kassim, A.: Active visual segmentation. TPAMI 34(4), 639–653 (2012)CrossRefGoogle Scholar
  15. 15.
    Papoutsaki, A., Sangkloy, P., Laskey, J., Daskalova, N., Huang, J., Hays, J.: Webgazer: Scalable webcam eye tracking using user interactions. In: IJCAI, pp. 3839–3845 (2016)Google Scholar
  16. 16.
    Salehin, M.M., Paul, M.: A novel framework for video summarization based on smooth pursuit information from eye tracker data. In: ICMR, pp. 692–697 (2017)Google Scholar
  17. 17.
    Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS, pp. 568–576 (2014)Google Scholar
  18. 18.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014)Google Scholar
  19. 19.
    Song, Y., Vallmitjana, J., Stent, A., Jaimes, A.: Tvsum: summarizing web videos using titles. In: CVPR, pp. 5179–5187 (2015)Google Scholar
  20. 20.
    Truong, B.T., Venkatesh, S.: Video abstraction: a systematic review and classification. ACM TOMM 3(1), 1–37 (2007)CrossRefGoogle Scholar
  21. 21.
    Wu, J., Zhong, S.H., Jiang, J., Yang, Y.: A novel clustering method for static video summarization. MTAP 76(7), 9625–9641 (2017)Google Scholar
  22. 22.
    Wu, J., Zhong, S.H., Ma, Z., Heinen, S.J., Jiang, J.: Foveated convolutional neural networks for video summarization. MTAP (2018)Google Scholar
  23. 23.
    Xu, J., Mukherjee, L., Li, Y., Warner, J., Rehg, J.M., Singh, V.: Gaze-enabled egocentric video summarization via constrained submodular maximization. In: CVPR, pp. 2235–2244 (2015)Google Scholar
  24. 24.
    Yao, T., Mei, T., Rui, Y.: Highlight detection with pairwise deep ranking for first-person video summarization. In: CVPR, pp. 982–990 (2016)Google Scholar
  25. 25.
    Zhang, B., Wang, L., Wang, Z., Qiao, Y., Wang, H.: Real-time action recognition with enhanced motion vector CNNs. In: CVPR, pp. 2718–2726 (2016)Google Scholar
  26. 26.
    Zhang, K., Chao, Wei, L., Sha, F., Grauman, K.: Summary transfer: exemplar-based subset selection for video summarization. In: CVPR (2016)Google Scholar
  27. 27.
    Zhang, K., Chao, W.-L., Sha, F., Grauman, K.: Video summarization with long short-term memory. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 766–782. Springer, Cham (2016). Scholar
  28. 28.
    Zhong, S.H., Liu, Y., Li, B., Long, J.: Query-oriented unsupervised multi-document summarization via deep learning model. ESWA 42(21), 8146–8155 (2015)Google Scholar
  29. 29.
    Zhong, S.H., Liu, Y., Liu, Y.: Bilinear deep learning for image classification. In: ACM MM, pp. 343–352 (2011)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Jiaxin Wu
    • 1
  • Sheng-hua Zhong
    • 1
  • Zheng Ma
    • 2
    Email author
  • Stephen J. Heinen
    • 2
  • Jianmin Jiang
    • 1
  1. 1.College of Computer Science and Software EngineeringShenzhen UniversityShenzhenChina
  2. 2.The Smith-Kettlewell Eye Research InstituteSan FranciscoUSA

Personalised recommendations