A Foreground Segmentation Scheme

  • Shilin Zhang
  • Heping Li
  • Shuwu Zhang
Part of the Advances in Intelligent and Soft Computing book series (AINSC, volume 148)


Speech signal, video caption text and video frame images are all key factors for a person to understand the video content. Through above observation, we bring forward a scheme which integrating continuous speech recognition, video caption text recognition and object recognition. The video is firstly segmented into a number of shots by shot detection. Then the caption text recognition and speech recognition are carried out and the results are treated as two paragraphs of text. Only the noun words are kept. The words are further depicted as a graph. The graph vertices stand for the words and the edges denote the semantic relation between two neighboring words. In the last step, we apply the dense sub graph finding method to mine the video semantic meaning. Experiments show that our video semantic mining method is efficient.


Video Semantic Mining Auto Speech Recognition Information Fusion Object Recognition 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Snoek, C.G.M., Worring, M.: Concept-based video retrieval. Trends Inf. Retriev. 4(2), 215 (2009)Google Scholar
  2. 2.
    Csurka, G., Dance, C., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: Proc. ECCV Int. Workshop Statistical Learning in Computer Vision, Prague, Czech Republic (2004)Google Scholar
  3. 3.
    Sivic, J., Zisserman, A.: Video Google: A text retrieval approach to object matching in videos. In: Proc. IEEE Int. Conf. Computer Vision (2003)Google Scholar
  4. 4.
    Smeaton, A., Over, P., Kraaij, W.: Evaluation campaigns and TRECVID. In: Proc. ACM SIGMM Int. Workshop Multimedia Information Retrieval (2006)Google Scholar
  5. 5.
    Winn, J.: The PASCAL Visual Object Classes Challeng 2010 (VOC 2010) Development Kit. Tech. Rep., University of Leeds (2010)Google Scholar
  6. 6.
    Zhang, X., Liu, Y., Liang, C., Xu, C.: A visualized communication system using cross-media semantic association. In: Proceedings of the 17th International Conference on Advances in Multimedia Modelling, Taipei, Taiwan, January 05-07 (2011)Google Scholar
  7. 7.
    Hanbury, A., Müller, H.: Automated component-level evaluation, present and future. In: Proceedings of the 2010 International Conference on Multilingual and Multimodal Information Access Evaluation: Cross-Language Evaluation Forum, Padua, Italy, September 20-23 (2010)Google Scholar
  8. 8.
    Lee, H., Yu, J., Im, Y., et al.: A unified scheme of shot boundary detection and anchor shot detection in news video story parsing. Multimedia Tools and Applications, 1127 (2011)Google Scholar
  9. 9.
    Open Source Toolkit For Speech Recognition,
  10. 10.
    Rleon, M., Mallo, S., Gasull, A.: A tree structured-based caption text detection approach. In: Proceedings of the Fifth IASTED International Conference on Visualization, Imaging, and Image Processing, p. 220 (2005)Google Scholar

Copyright information

© Springer-Verlag GmbH Berlin Heidelberg 2012

Authors and Affiliations

  1. 1.High Technology and Innovation CenterInstitute of Automation, Chinese Academy of SciencesBeijingChina

Personalised recommendations