Video Scene Analysis: A Machine Learning Perspective

  • Wen GaoEmail author
  • Yonghong Tian
  • Lingyu Duan
  • Jia Li
  • Yuanning Li


With the increasing proliferation of digital video contents, learning-based video scene analysis has proven to be an effective methodology for improving the access and retrieval of large video collections. This chapter is devoted to present a survey and tutorial on the research in this topic. We identify two major categories of the state-of-the-art tasks based on their application setup and learning targets: generic methods and genre-specific analysis techniques. For generic video scene analysis problems, we discuss two kinds of learning models that aim at narrowing down the semantic gap and the intention gap, two main research challenges in video content analysis and retrieval. For genre-specific analysis problems, we take sports video analysis and surveillance event detection as illustrating examples.


Visual Saliency Sport Video Video Annotation Shot Sequence Video Content Analysis 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.



The work is supported by grants from the Chinese National Natural Science Foundation under contract No. 60973055 and No. 61035001, and National Basic Research Program of China under contract No. 2009CB320906.


  1. 1.
    S. Aksoy, K. Koperski, C. Tusk, G. Marchisio, and J.C. Tilton, “Learning Bayesian classifiers for scene classification with a visual grammar,” IEEE Trans. Geoscience and Remote Sensing, vol. 43, no. 3, pp. 581-589, 2005.CrossRefGoogle Scholar
  2. 2.
    Y. Altun, I. Tsochantaridis, and T. Hofman, “Hidden Markov support vector machines,” in Proc. IEEE Int. Conf. Mechine Learning, 2003, pp. 3-10.Google Scholar
  3. 3.
    K. Barnard, P. Duygulu, N. de Freitas, D. Forsyth, D. Blei, and M. I. Jordan, “Matching words and pictures,” J. Machine Learning Research, vol 3, pp. 1107-1135, 2003.zbMATHCrossRefGoogle Scholar
  4. 4.
    S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.Google Scholar
  5. 5.
    N. D. Bruce and J. K. Tsotsos. Saliency based on information maximization. In Advances in neural information processing systems, pp. 155-162, 2006.Google Scholar
  6. 6.
    M. Cerf, J. Harel, W. Einhauser, and C. Koch, Predicting human gaze using low-level saliency combined with face detection, in Advances in Neural Information Processing Systems, 2008, pp. 241-248.Google Scholar
  7. 7.
    Dai, J., Duan, L., Tong, X., Xu, C., Tian, Q., Lu, H., and Jin, J. 2005. Replay scene classification in soccer video using web broadcast text. In Proc. IEEE ICME. 1098-1101.Google Scholar
  8. 8.
    L. Duan, I.W. Tsang, D. Xu, and S.J. Maybank, “Domain transfer SVM for video concept detection,” in Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, 2009, pp. 1-8.Google Scholar
  9. 9.
    S. Ebadollahi, L. Xie, S.-F., Chang, and J.R. Smith, “Visual event detection using multidimensional concept dynamics,” in Proc. IEEE Int. Conf. Multimedia and Expo, 2006, pp. 881-884.Google Scholar
  10. 10.
    C. Frith. The top in top-down attention. In Neurobiology of attention (pp. 105-108), 2005.Google Scholar
  11. 11.
    Wen Gao, Yonghong Tian, Tiejun Huang, Qiang Yang. Vlogging: A Survey of Video Blogging Technology on the Web. ACM Computing Survey, 2(4), Jun. 2010.Google Scholar
  12. 12.
    Gunawardana, A., Mahajan, M., Acero, A., and Platt, J. 2005. Hidden conditional random fields for phone classification. In Proc. Interspeech. 1117-1120.Google Scholar
  13. 13.
    C. Guo, Q. Ma, and L. Zhang, Spatio-temporal saliency detection using phase spectrum of quaternion fourier transform, in IEEE Conference on Computer Vision and Pattern Recognition, 2008.Google Scholar
  14. 14.
    J. S. Hare, P. H. Lewis, P. G. B. Enser and C. J. Sandom, “Mind the Gap: Another look at the problem of the semantic gap in image retrieval,” Multimedia Content Analysis, Management and Retrieval 2006, vol. 6073, No. 1, 2006, San Jose, CA, USA.Google Scholar
  15. 15.
    J. Harel, C. Koch, and P. Perona, Graph-based visual saliency, in Advances in Neural Information Processing Systems, 2007, pp. 545-552.Google Scholar
  16. 16.
    X. Hou and L. Zhang, Saliency detection: A spectral residual approach, in IEEE Conference on Computer Vision and Pattern Recognition, 2007.Google Scholar
  17. 17.
    H. Hsu, L. Kennedy, and S. F. Chang, “Video search reranking through random walk over document-level context graph,” in Proc. ACM Multimedia, 2007, pp. 971-980.Google Scholar
  18. 18.
    Y. Hu, D. Rajan, and L.-T. Chia, Robust subspace analysis for detecting visual attention regions in images, in ACM International Conference on Multimedia, 2005, pp. 716-724.Google Scholar
  19. 19.
    L. Itti and C. Koch, Computational modeling of visual attention, Nature Review Neuroscience, vol. 2, no. 3, pp. 194-203, 2001.CrossRefGoogle Scholar
  20. 20.
    L. Itti and P. Baldi, A principled approach to detecting surprising events in video, in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2005, pp. 631-637.Google Scholar
  21. 21.
    L. Itti, G. Rees, and J. Tsotsos. Neurobiology of attention. San Diego: Elsevier, 2005Google Scholar
  22. 22.
    L. Itti, Crcns data sharing: Eye movements during free-viewing of natural videos, in Collaborative Research in Computational Neuroscience Annual Meeting, 2008.Google Scholar
  23. 23.
    L. Itti and C. Koch. Feature combination strategies for saliency-based visual attention systems. Journal of Electronic Imaging, 10(1), 161-169, 2001.CrossRefGoogle Scholar
  24. 24.
    L. Itti, C. Koch, and E. Niebur, A model of saliency-based visual attention for rapid scene analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 11, pp. 1254-1259, 1998.CrossRefGoogle Scholar
  25. 25.
    W. Jiang, S. F. Chang, and A. Loui, “Context-based concept fusion with boosted conditional random Fields,” in Proc. Int. Conf. on Acoustics, Speech, and Signal Processing, 2007, pp. 949-952.Google Scholar
  26. 26.
    Shuqiang Jiang, Yonghong Tian, Qingming Huang, Tiejun Huang, Wen Gao. Content-Based Video Semantic Analysis. Book Chapter in Semantic Mining Technologies for Multimedia Databases (Edited by Tao, Xu, and Li), IGI Global, 2009.Google Scholar
  27. 27.
    Y. G. Jiang, J. Wang, S. F. Chang, C. W. Ngo, “Domain adaptive semantic diffusion for large scale context-based video annotation,” in Proc. IEEE Int. Conf. Computer Vision, 2009, pp. 1-8.Google Scholar
  28. 28.
    L. Kennedy, and S. F. Chang, “A reranking approach for context-based concept fusion in video indexing and retrieval,” in Proc. IEEE Int. Conf. on Image and Video Retrieval, 2007, pp. 333-340.Google Scholar
  29. 29.
    W. Kienzle, F. A.Wichmann, B. Scholkopf, and M. O. Franz, A nonparametric approach to bottom-up visual saliency, in Advances in Neural Information Processing Systems, 2007, pp. 689-696.Google Scholar
  30. 30.
    W. Kienzle, B. Scholkopf, F. A. Wichmann, and M. O. Franz, How to find interesting locations in video: a spatiotemporal interest point detector learned from human eye movements, in 29th DAGM Symposium, 2007, pp. 405-414.Google Scholar
  31. 31.
    M. Li, Y. T. Zheng, S. X. Lin, Y. D. Zhang, T.-S. Chua, Multimedia evidence fusion for video concept detection via OWA operator, in Proc. Advances in Multimedia Modeling, pp. 208-216, 2009.Google Scholar
  32. 32.
    H. Liu, S. Jiang, Q. Huang, C. Xu, and W. Gao, Region-based visual attention analysis with its application in image browsing on small displays, in ACM International Conference on Multimedia, 2007, pp. 305-308.Google Scholar
  33. 33.
    T. Liu, J. Sun, N.-N. Zheng, X. Tang, and H.-Y. Shum, Learning to detect a salient object, in IEEE Conference on Computer Vision and Pattern Recognition, 2007.Google Scholar
  34. 34.
    T. Liu, N. Zheng, W. Ding, and Z. Yuan, Video attention: Learning to detect a salient object sequence, in IEEE International Conference on Pattern Recognition, 2008.Google Scholar
  35. 35.
    Y. Liu, F. Wu, Y. Zhuang, J. Xiao, “Active post-refined multimodality video semantic concept detection with tensor representation,” in Proc. ACM Multimedia, 2008, pp. 91-100.Google Scholar
  36. 36.
    K. H. Liu, M. F. Weng, C. Y. Tseng, Y. Y. Chuang, and M. S. Chen, “Association and temporal rule mining for post-processing of semantic concept detection in video,” IEEE Trans. Multimedia, 2008, pp. 240-251.Google Scholar
  37. 37.
    Y.-F. Ma, X.-S. Hua, L. Lu, and H.-J. Zhang, A generic framework of user attention model and its application in video summarization, IEEE Transactions on Multimedia, vol. 7, no. 5, pp. 907-919, 2005.CrossRefGoogle Scholar
  38. 38.
    S. Marat, T. H. Phuoc, L. Granjon, N. Guyader, D. Pellerin, and A. Guerin-Dugue, Modelling spatio-temporal saliency to predict gaze direction for short videos, International Journal of Computer Vision, vol. 82, no. 3, pp. 231-243, 2009.CrossRefGoogle Scholar
  39. 39.
    G. Miao, G. Zhu, S. Jiang, Q. Huang, C. Xu, and W. Gao, A Real-Time Score Detection and Recognition Approach for Broadcast Basketball Video. In Proc. IEEE Int. Conf. Multimedia and Expo, 2007, pp. 1691-1694.Google Scholar
  40. 40.
    F. Monay and D. Gatica-Perez, “Modeling semantic aspects for cross-media image indexing,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 10, pp. 1802-1917, Oct. 2007.CrossRefGoogle Scholar
  41. 41.
    M. R. Naphade, I. Kozintsev, and T. Huang, “Factor graph framework for semantic video indexing,” IEEE Trans. Circuits and Systems for Video Technology, 2002, pp. 40-52.Google Scholar
  42. 42.
    M. R. Naphade, “On supervision and statistical learning for semantic multimedia analysis,” Journal of Visual Communication and Image Representation, vol. 15, no. 3, pp. 348-369, Sep. 2004.CrossRefGoogle Scholar
  43. 43.
    A. Natsev, A. Haubold, J. Tesic, L. Xie, R. Yan, “Semantic concept-based query expansion and re-ranking for multimedia retrieval,” in Proc. ACM Multimedia, 2007, pp. 991-1000.Google Scholar
  44. 44.
    V. Navalpakkam and L. Itti, Search goal tunes visual features optimally, Neuron, vol. 53, pp. 605-617, 2007.CrossRefGoogle Scholar
  45. 45.
    T. N. Pappas, J.Q. Chen, D. Depalov, “Perceptually based techniques for image segmentation and semantic classification,” IEEE Communications Magazine, vol. 45, no. 1, pp. 44-51, Jan. 2007.CrossRefGoogle Scholar
  46. 46.
    R. J. Peters and L. Itti, Beyond bottom-up: Incorporating task-dependent influences into a computational model of spatial attention, in IEEE CVPR, 2007.Google Scholar
  47. 47.
    R. J. Peters and L. Itti. Congruence between model and human attention reveals unique signatures of critical visual events. In Advances in neural information processing systems (pp. 1145-1152), 2007.Google Scholar
  48. 48.
    G. J. Qi, X. S. Hua, Y. Rui, J. Tang, T. Mei, and H. J. Zhang, “Correlative multi-label video annotation,” in Proc. ACM Multimedia, 2007, pp. 17-26.Google Scholar
  49. 49.
    Quattoni, A.,Wang, S., Morency, L., Collins, M., Darrell, T., and Csail, M. 2007. Hidden state conditional random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 29, 10, 1848-1852.CrossRefGoogle Scholar
  50. 50.
    A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain, “Content-based image retrieval at the end of the early years,” IEEE Trans. Pattern Anal. Mach Intell., vol. 22, no.12, pp. 1349-1380, Dec. 2000.CrossRefGoogle Scholar
  51. 51.
    J. R. Smith, M. Naphade, and A. Natsev, “Multimedia semantic indexing using model vectors,” in Proc. IEEE Int. Conf. Multimedia and Expo, 2003, pp. 445-448.Google Scholar
  52. 52.
    C. G. M. Snoek, M. Worring, J.C. Gemert, J.-M. Geusebroek, and A.W.M. Smeulers, “The challenge problem for automated detection of 101 semantic concepts in multimedia,” in Proc. ACM Multimedia, 2006, pp. 421-430.Google Scholar
  53. 53.
    E. Spyrou and Y. Avrithis, “Detection of High-Level Concepts in Multimedia,” Encyclopedia of Multimedia, 2nd Edition, Springer 2008.Google Scholar
  54. 54.
    A. M. Treisman and G. Gelade, A feature-integration theory of attention, Cognitive Psychology, vol. 12, no. 1, pp. 97-136, 1980.CrossRefGoogle Scholar
  55. 55.
    I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun, “Support vector machine learning for interdependent and structured output spaces,” in Proc. IEEE Int. Conf. Machine Learning, 2004, pp. 823-830.Google Scholar
  56. 56.
    D. Walther and C. Koch, Modeling attention to salient proto-objects, Neural Networks, vol. 19, no. 9, pp. 1395-1407, 2006.zbMATHCrossRefGoogle Scholar
  57. 57.
    T. Wang, J. Li, Q. Diao, W. Hu, Y. Zhang, and C. Dulong, “Semantic event detection using conditional random fields,” in Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition Workshop, 2006.Google Scholar
  58. 58.
    M. Weng, Y. Chuang, “Multi-cue fusion for semantic video indexing,” in Proc. ACM Multimedia, 2008, pp. 71-80.Google Scholar
  59. 59.
    L. Xie and S. F. Chang, “Structural analysis of soccer video with hidden markov models,” in Proc. Int. Conf. on Acoustics, Speech, and Signal Processing, 2002, pp. 767-775.Google Scholar
  60. 60.
    Xiong, Z. Y., Zhou, X. S., Tian, Q., Rui, Y., and Huang, T. S. Semantic retrieval of video: Review of research on video retrieval in meetings, movies and broadcast news, and sports. IEEE Signal Processing Magazine 18, 3, 18-27, 2006.CrossRefGoogle Scholar
  61. 61.
    Xu, C., Wang, J., Wan, K., Li, Y., and Duan, L. 2006. Live sports event detection based on broadcast video and web-casting text. In Proc. ACM MM. 230.Google Scholar
  62. 62.
    Xu, C., Zhang, Y., Zhu, G., Rui, Y., Lu, H., and Huang, Q. 2008. Using webcast text for semantic event detection in broadcast sports video. IEEE Transactions on Multimedia 10, 7, 1342-1355.CrossRefGoogle Scholar
  63. 63.
    R. Yan, M. Y. Chen, and A. Hauptmann, “Mining relationship between video concepts using probabilistic graphical models,” in Proc. IEEE Int. Conf. Multimedia and Expo, 2006, pp. 301-304.Google Scholar
  64. 64.
    J. Yang and A. G. Hauptmann, “Exploring temporal consistency for video analysis and retrieval,” in Proc. 8th ACM SIGMM Int. Workshop on Multimedia Information Retrieval, 2006, pp. 33-46.Google Scholar
  65. 65.
    J. Yang, R. Yan, A. Hauptmann, “Cross-domain video concept detection using adaptive svms,” in Proc. ACM Multimedia, 2007, pp. 188-297.Google Scholar
  66. 66.
    Yang Yang, Jingen Liu, Mubarak Shah, Video Scene Understanding Using Multi-scale Analysis, Proc. 12th Int’l Conf. Computer Vision, 1669-1676, 2009.Google Scholar
  67. 67.
    Zheng-Jun Zha, Linjun Yang, Tao Mei, Meng Wang, Zengfu Wang, Tat-Seng Chua, Xian-Sheng Hua. Visual query suggestion: Towards capturing user intent in internet image search. ACM Transactions on Multimedia Computing, Communications, and Applications, 6(3), Article 13, August 2010.Google Scholar
  68. 68.
    Y. Zhai and M. Shah, Visual attention detection in video sequences using spatiotemporal cues, in ACM International Conference on Multimedia, 2006, pp. 815-824.Google Scholar
  69. 69.
    H. Zhang, A. C. Berg. M. Maire, and J. Malik, ”Svm-knn: Discriminative nearest neighbor classification for visual category recognition,” Proc. IEEE Conf. CVPR, pp. 2126-2136, 2006.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2011

Authors and Affiliations

  • Wen Gao
    • 1
    Email author
  • Yonghong Tian
  • Lingyu Duan
  • Jia Li
  • Yuanning Li
  1. 1.School of EE & CSPeking UniversityBeijingChina

Personalised recommendations