Skip to main content

Video Scene Analysis: A Machine Learning Perspective

  • Chapter
  • First Online:

Abstract

With the increasing proliferation of digital video contents, learning-based video scene analysis has proven to be an effective methodology for improving the access and retrieval of large video collections. This chapter is devoted to present a survey and tutorial on the research in this topic. We identify two major categories of the state-of-the-art tasks based on their application setup and learning targets: generic methods and genre-specific analysis techniques. For generic video scene analysis problems, we discuss two kinds of learning models that aim at narrowing down the semantic gap and the intention gap, two main research challenges in video content analysis and retrieval. For genre-specific analysis problems, we take sports video analysis and surveillance event detection as illustrating examples.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. S. Aksoy, K. Koperski, C. Tusk, G. Marchisio, and J.C. Tilton, “Learning Bayesian classifiers for scene classification with a visual grammar,” IEEE Trans. Geoscience and Remote Sensing, vol. 43, no. 3, pp. 581-589, 2005.

    Article  Google Scholar 

  2. Y. Altun, I. Tsochantaridis, and T. Hofman, “Hidden Markov support vector machines,” in Proc. IEEE Int. Conf. Mechine Learning, 2003, pp. 3-10.

    Google Scholar 

  3. K. Barnard, P. Duygulu, N. de Freitas, D. Forsyth, D. Blei, and M. I. Jordan, “Matching words and pictures,” J. Machine Learning Research, vol 3, pp. 1107-1135, 2003.

    Article  MATH  Google Scholar 

  4. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.

    Google Scholar 

  5. N. D. Bruce and J. K. Tsotsos. Saliency based on information maximization. In Advances in neural information processing systems, pp. 155-162, 2006.

    Google Scholar 

  6. M. Cerf, J. Harel, W. Einhauser, and C. Koch, Predicting human gaze using low-level saliency combined with face detection, in Advances in Neural Information Processing Systems, 2008, pp. 241-248.

    Google Scholar 

  7. Dai, J., Duan, L., Tong, X., Xu, C., Tian, Q., Lu, H., and Jin, J. 2005. Replay scene classification in soccer video using web broadcast text. In Proc. IEEE ICME. 1098-1101.

    Google Scholar 

  8. L. Duan, I.W. Tsang, D. Xu, and S.J. Maybank, “Domain transfer SVM for video concept detection,” in Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, 2009, pp. 1-8.

    Google Scholar 

  9. S. Ebadollahi, L. Xie, S.-F., Chang, and J.R. Smith, “Visual event detection using multidimensional concept dynamics,” in Proc. IEEE Int. Conf. Multimedia and Expo, 2006, pp. 881-884.

    Google Scholar 

  10. C. Frith. The top in top-down attention. In Neurobiology of attention (pp. 105-108), 2005.

    Google Scholar 

  11. Wen Gao, Yonghong Tian, Tiejun Huang, Qiang Yang. Vlogging: A Survey of Video Blogging Technology on the Web. ACM Computing Survey, 2(4), Jun. 2010.

    Google Scholar 

  12. Gunawardana, A., Mahajan, M., Acero, A., and Platt, J. 2005. Hidden conditional random fields for phone classification. In Proc. Interspeech. 1117-1120.

    Google Scholar 

  13. C. Guo, Q. Ma, and L. Zhang, Spatio-temporal saliency detection using phase spectrum of quaternion fourier transform, in IEEE Conference on Computer Vision and Pattern Recognition, 2008.

    Google Scholar 

  14. J. S. Hare, P. H. Lewis, P. G. B. Enser and C. J. Sandom, “Mind the Gap: Another look at the problem of the semantic gap in image retrieval,” Multimedia Content Analysis, Management and Retrieval 2006, vol. 6073, No. 1, 2006, San Jose, CA, USA.

    Google Scholar 

  15. J. Harel, C. Koch, and P. Perona, Graph-based visual saliency, in Advances in Neural Information Processing Systems, 2007, pp. 545-552.

    Google Scholar 

  16. X. Hou and L. Zhang, Saliency detection: A spectral residual approach, in IEEE Conference on Computer Vision and Pattern Recognition, 2007.

    Google Scholar 

  17. H. Hsu, L. Kennedy, and S. F. Chang, “Video search reranking through random walk over document-level context graph,” in Proc. ACM Multimedia, 2007, pp. 971-980.

    Google Scholar 

  18. Y. Hu, D. Rajan, and L.-T. Chia, Robust subspace analysis for detecting visual attention regions in images, in ACM International Conference on Multimedia, 2005, pp. 716-724.

    Google Scholar 

  19. L. Itti and C. Koch, Computational modeling of visual attention, Nature Review Neuroscience, vol. 2, no. 3, pp. 194-203, 2001.

    Article  Google Scholar 

  20. L. Itti and P. Baldi, A principled approach to detecting surprising events in video, in Proc. IEEE Conference on Computer Vision and Pattern Recognition, 2005, pp. 631-637.

    Google Scholar 

  21. L. Itti, G. Rees, and J. Tsotsos. Neurobiology of attention. San Diego: Elsevier, 2005

    Google Scholar 

  22. L. Itti, Crcns data sharing: Eye movements during free-viewing of natural videos, in Collaborative Research in Computational Neuroscience Annual Meeting, 2008.

    Google Scholar 

  23. L. Itti and C. Koch. Feature combination strategies for saliency-based visual attention systems. Journal of Electronic Imaging, 10(1), 161-169, 2001.

    Article  Google Scholar 

  24. L. Itti, C. Koch, and E. Niebur, A model of saliency-based visual attention for rapid scene analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 11, pp. 1254-1259, 1998.

    Article  Google Scholar 

  25. W. Jiang, S. F. Chang, and A. Loui, “Context-based concept fusion with boosted conditional random Fields,” in Proc. Int. Conf. on Acoustics, Speech, and Signal Processing, 2007, pp. 949-952.

    Google Scholar 

  26. Shuqiang Jiang, Yonghong Tian, Qingming Huang, Tiejun Huang, Wen Gao. Content-Based Video Semantic Analysis. Book Chapter in Semantic Mining Technologies for Multimedia Databases (Edited by Tao, Xu, and Li), IGI Global, 2009.

    Google Scholar 

  27. Y. G. Jiang, J. Wang, S. F. Chang, C. W. Ngo, “Domain adaptive semantic diffusion for large scale context-based video annotation,” in Proc. IEEE Int. Conf. Computer Vision, 2009, pp. 1-8.

    Google Scholar 

  28. L. Kennedy, and S. F. Chang, “A reranking approach for context-based concept fusion in video indexing and retrieval,” in Proc. IEEE Int. Conf. on Image and Video Retrieval, 2007, pp. 333-340.

    Google Scholar 

  29. W. Kienzle, F. A.Wichmann, B. Scholkopf, and M. O. Franz, A nonparametric approach to bottom-up visual saliency, in Advances in Neural Information Processing Systems, 2007, pp. 689-696.

    Google Scholar 

  30. W. Kienzle, B. Scholkopf, F. A. Wichmann, and M. O. Franz, How to find interesting locations in video: a spatiotemporal interest point detector learned from human eye movements, in 29th DAGM Symposium, 2007, pp. 405-414.

    Google Scholar 

  31. M. Li, Y. T. Zheng, S. X. Lin, Y. D. Zhang, T.-S. Chua, Multimedia evidence fusion for video concept detection via OWA operator, in Proc. Advances in Multimedia Modeling, pp. 208-216, 2009.

    Google Scholar 

  32. H. Liu, S. Jiang, Q. Huang, C. Xu, and W. Gao, Region-based visual attention analysis with its application in image browsing on small displays, in ACM International Conference on Multimedia, 2007, pp. 305-308.

    Google Scholar 

  33. T. Liu, J. Sun, N.-N. Zheng, X. Tang, and H.-Y. Shum, Learning to detect a salient object, in IEEE Conference on Computer Vision and Pattern Recognition, 2007.

    Google Scholar 

  34. T. Liu, N. Zheng, W. Ding, and Z. Yuan, Video attention: Learning to detect a salient object sequence, in IEEE International Conference on Pattern Recognition, 2008.

    Google Scholar 

  35. Y. Liu, F. Wu, Y. Zhuang, J. Xiao, “Active post-refined multimodality video semantic concept detection with tensor representation,” in Proc. ACM Multimedia, 2008, pp. 91-100.

    Google Scholar 

  36. K. H. Liu, M. F. Weng, C. Y. Tseng, Y. Y. Chuang, and M. S. Chen, “Association and temporal rule mining for post-processing of semantic concept detection in video,” IEEE Trans. Multimedia, 2008, pp. 240-251.

    Google Scholar 

  37. Y.-F. Ma, X.-S. Hua, L. Lu, and H.-J. Zhang, A generic framework of user attention model and its application in video summarization, IEEE Transactions on Multimedia, vol. 7, no. 5, pp. 907-919, 2005.

    Article  Google Scholar 

  38. S. Marat, T. H. Phuoc, L. Granjon, N. Guyader, D. Pellerin, and A. Guerin-Dugue, Modelling spatio-temporal saliency to predict gaze direction for short videos, International Journal of Computer Vision, vol. 82, no. 3, pp. 231-243, 2009.

    Article  Google Scholar 

  39. G. Miao, G. Zhu, S. Jiang, Q. Huang, C. Xu, and W. Gao, A Real-Time Score Detection and Recognition Approach for Broadcast Basketball Video. In Proc. IEEE Int. Conf. Multimedia and Expo, 2007, pp. 1691-1694.

    Google Scholar 

  40. F. Monay and D. Gatica-Perez, “Modeling semantic aspects for cross-media image indexing,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 10, pp. 1802-1917, Oct. 2007.

    Article  Google Scholar 

  41. M. R. Naphade, I. Kozintsev, and T. Huang, “Factor graph framework for semantic video indexing,” IEEE Trans. Circuits and Systems for Video Technology, 2002, pp. 40-52.

    Google Scholar 

  42. M. R. Naphade, “On supervision and statistical learning for semantic multimedia analysis,” Journal of Visual Communication and Image Representation, vol. 15, no. 3, pp. 348-369, Sep. 2004.

    Article  Google Scholar 

  43. A. Natsev, A. Haubold, J. Tesic, L. Xie, R. Yan, “Semantic concept-based query expansion and re-ranking for multimedia retrieval,” in Proc. ACM Multimedia, 2007, pp. 991-1000.

    Google Scholar 

  44. V. Navalpakkam and L. Itti, Search goal tunes visual features optimally, Neuron, vol. 53, pp. 605-617, 2007.

    Article  Google Scholar 

  45. T. N. Pappas, J.Q. Chen, D. Depalov, “Perceptually based techniques for image segmentation and semantic classification,” IEEE Communications Magazine, vol. 45, no. 1, pp. 44-51, Jan. 2007.

    Article  Google Scholar 

  46. R. J. Peters and L. Itti, Beyond bottom-up: Incorporating task-dependent influences into a computational model of spatial attention, in IEEE CVPR, 2007.

    Google Scholar 

  47. R. J. Peters and L. Itti. Congruence between model and human attention reveals unique signatures of critical visual events. In Advances in neural information processing systems (pp. 1145-1152), 2007.

    Google Scholar 

  48. G. J. Qi, X. S. Hua, Y. Rui, J. Tang, T. Mei, and H. J. Zhang, “Correlative multi-label video annotation,” in Proc. ACM Multimedia, 2007, pp. 17-26.

    Google Scholar 

  49. Quattoni, A.,Wang, S., Morency, L., Collins, M., Darrell, T., and Csail, M. 2007. Hidden state conditional random fields. IEEE Transactions on Pattern Analysis and Machine Intelligence 29, 10, 1848-1852.

    Article  Google Scholar 

  50. A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta, and R. Jain, “Content-based image retrieval at the end of the early years,” IEEE Trans. Pattern Anal. Mach Intell., vol. 22, no.12, pp. 1349-1380, Dec. 2000.

    Article  Google Scholar 

  51. J. R. Smith, M. Naphade, and A. Natsev, “Multimedia semantic indexing using model vectors,” in Proc. IEEE Int. Conf. Multimedia and Expo, 2003, pp. 445-448.

    Google Scholar 

  52. C. G. M. Snoek, M. Worring, J.C. Gemert, J.-M. Geusebroek, and A.W.M. Smeulers, “The challenge problem for automated detection of 101 semantic concepts in multimedia,” in Proc. ACM Multimedia, 2006, pp. 421-430.

    Google Scholar 

  53. E. Spyrou and Y. Avrithis, “Detection of High-Level Concepts in Multimedia,” Encyclopedia of Multimedia, 2nd Edition, Springer 2008.

    Google Scholar 

  54. A. M. Treisman and G. Gelade, A feature-integration theory of attention, Cognitive Psychology, vol. 12, no. 1, pp. 97-136, 1980.

    Article  Google Scholar 

  55. I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun, “Support vector machine learning for interdependent and structured output spaces,” in Proc. IEEE Int. Conf. Machine Learning, 2004, pp. 823-830.

    Google Scholar 

  56. D. Walther and C. Koch, Modeling attention to salient proto-objects, Neural Networks, vol. 19, no. 9, pp. 1395-1407, 2006.

    Article  MATH  Google Scholar 

  57. T. Wang, J. Li, Q. Diao, W. Hu, Y. Zhang, and C. Dulong, “Semantic event detection using conditional random fields,” in Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition Workshop, 2006.

    Google Scholar 

  58. M. Weng, Y. Chuang, “Multi-cue fusion for semantic video indexing,” in Proc. ACM Multimedia, 2008, pp. 71-80.

    Google Scholar 

  59. L. Xie and S. F. Chang, “Structural analysis of soccer video with hidden markov models,” in Proc. Int. Conf. on Acoustics, Speech, and Signal Processing, 2002, pp. 767-775.

    Google Scholar 

  60. Xiong, Z. Y., Zhou, X. S., Tian, Q., Rui, Y., and Huang, T. S. Semantic retrieval of video: Review of research on video retrieval in meetings, movies and broadcast news, and sports. IEEE Signal Processing Magazine 18, 3, 18-27, 2006.

    Article  Google Scholar 

  61. Xu, C., Wang, J., Wan, K., Li, Y., and Duan, L. 2006. Live sports event detection based on broadcast video and web-casting text. In Proc. ACM MM. 230.

    Google Scholar 

  62. Xu, C., Zhang, Y., Zhu, G., Rui, Y., Lu, H., and Huang, Q. 2008. Using webcast text for semantic event detection in broadcast sports video. IEEE Transactions on Multimedia 10, 7, 1342-1355.

    Article  Google Scholar 

  63. R. Yan, M. Y. Chen, and A. Hauptmann, “Mining relationship between video concepts using probabilistic graphical models,” in Proc. IEEE Int. Conf. Multimedia and Expo, 2006, pp. 301-304.

    Google Scholar 

  64. J. Yang and A. G. Hauptmann, “Exploring temporal consistency for video analysis and retrieval,” in Proc. 8th ACM SIGMM Int. Workshop on Multimedia Information Retrieval, 2006, pp. 33-46.

    Google Scholar 

  65. J. Yang, R. Yan, A. Hauptmann, “Cross-domain video concept detection using adaptive svms,” in Proc. ACM Multimedia, 2007, pp. 188-297.

    Google Scholar 

  66. Yang Yang, Jingen Liu, Mubarak Shah, Video Scene Understanding Using Multi-scale Analysis, Proc. 12th Int’l Conf. Computer Vision, 1669-1676, 2009.

    Google Scholar 

  67. Zheng-Jun Zha, Linjun Yang, Tao Mei, Meng Wang, Zengfu Wang, Tat-Seng Chua, Xian-Sheng Hua. Visual query suggestion: Towards capturing user intent in internet image search. ACM Transactions on Multimedia Computing, Communications, and Applications, 6(3), Article 13, August 2010.

    Google Scholar 

  68. Y. Zhai and M. Shah, Visual attention detection in video sequences using spatiotemporal cues, in ACM International Conference on Multimedia, 2006, pp. 815-824.

    Google Scholar 

  69. H. Zhang, A. C. Berg. M. Maire, and J. Malik, ”Svm-knn: Discriminative nearest neighbor classification for visual category recognition,” Proc. IEEE Conf. CVPR, pp. 2126-2136, 2006.

    Google Scholar 

Download references

Acknowledgements

The work is supported by grants from the Chinese National Natural Science Foundation under contract No. 60973055 and No. 61035001, and National Basic Research Program of China under contract No. 2009CB320906.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wen Gao .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Gao, W., Tian, Y., Duan, L., Li, J., Li, Y. (2011). Video Scene Analysis: A Machine Learning Perspective. In: Ngan, K., Li, H. (eds) Video Segmentation and Its Applications. Springer, New York, NY. https://doi.org/10.1007/978-1-4419-9482-0_4

Download citation

  • DOI: https://doi.org/10.1007/978-1-4419-9482-0_4

  • Published:

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4419-9481-3

  • Online ISBN: 978-1-4419-9482-0

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics