Multimedia Tools and Applications

, Volume 70, Issue 3, pp 1487–1502 | Cite as

Video text detection and localization in intra-frames of H.264/AVC compressed video



Video texts are closely related to the video content. The video text information can facilitate content based video analysis, indexing and retrieval. Video sequences are usually compressed before storage and transmission. A basic step of text-based applications is text detection and localization. In this paper, an overlaid text detection and localization method is proposed for H.264/AVC compressed videos by using the integer discrete cosine transform (DCT) coefficients of intra-frames. The main contributions of this paper are in the following two aspects: 1) coarse text blocks detection using block sizes and quantization parameters adaptive thresholds; 2) text line localization according to the characteristics of text in intra frames of H.264/AVC compressed domain. Comparisons are made with the pixel domain based text detection method for the H.264/AVC compressed video. Text detection results on five H.264/AVC video sequences under various qualities show the effectiveness of the proposed method.


Text detection DCT coefficient H.264/AVC Integer DCT AC coefficient Intra prediction 



This work is supported in part by the National Natural Science Foundation of China (NSFC) Project No.60903121, No.61173109, and Foundations of Microsoft Research Asia.


  1. 1.
    Chen D, Bourlard H, Thiran J (2001) Text identification in complex background using svm. In Proceedings of the International Conference on Computer Vision and Pattern Recognition, 2, 621-626Google Scholar
  2. 2.
    Crandall D, Kasturi R (2001) Robust detection of stylized text events in digital video. In Proceedings of the International Conference on Document Analysis and Recognition 865-869Google Scholar
  3. 3.
    Cui Y, Huang Q (1997) Character extraction of license plates from video. In Proceedings of the Conference on Computer Vision and Pattern Recognition 502-507Google Scholar
  4. 4.
    Ekin A (2006) Local information based overlaid text detection by classifier fusion. In Proc. ICASSP2006, 2, II753-II756.Google Scholar
  5. 5.
    Gargi U, Antani S, Kasturi R (1998) Indexing text events in digital video databases. In Proc. Int. Conf. Pattern Recognit., 1, 916-918Google Scholar
  6. 6.
    Gordon S (2003) Simplified Use of 8x8 Transform. Doc. JVT-I022, San Diego, Sept. 2003Google Scholar
  7. 7.
  8. 8.
    Jain A, Yu B (1998) Automatic text location in images and video frames. In Proc. ICPR, 1497-1499Google Scholar
  9. 9.
    Jiang H, Liu G, Qian X, et al. (2008) A fast and efficient text tracking in compressed video. in Proc ISMGoogle Scholar
  10. 10.
    Jung K, Kim K, Jain A (2004) Text information extraction in images and video: a survey. Pattern Recognition 37:977–997CrossRefGoogle Scholar
  11. 11.
    JVT Reference Software version 10.2.
  12. 12.
    JVT-G050, 2003. Draft ITU-T recommendation and final draft international standard of joint video specification (ITU-T Rec. H.264/ISO/IEC 14486-10 AVC. in Joint Video Team (JVT) of ISO/IEC MPEG and ITU-T VECGGoogle Scholar
  13. 13.
    Lee C, Jung K, Kim H (2003) Automatic text detection and removal in video sequences. Pattern Recogn Lett 24:2607–2623CrossRefGoogle Scholar
  14. 14.
    Li H, Doermann D, Kia O (2000) Automatic text detection and tracking in digital video. IEEE Trans Image Process 9(1):147–156CrossRefGoogle Scholar
  15. 15.
    Lim Y, Choi S, Lee S (2000) Text extraction in MPEG compressed video for content-based indexing. In Proc. Int. Conf. on Pattern Recognit., 4, 409-412Google Scholar
  16. 16.
    Liu Z, Sarkar S (2008) Robust outdoor text detection using text intensity and shape features. in Proc ICPRGoogle Scholar
  17. 17.
    Lu S, Barner K (2008) Weighted DCT coefficients based text detection. in Proc. ICASSP 1341-1344Google Scholar
  18. 18.
    Lyu M, Song J, Cai M (2005) A comprehensive method for multilingual video text detection, localization, and extraction. IEEE Trans Circuits and Systems for Video Technology 15(2):243–255CrossRefGoogle Scholar
  19. 19.
    Malvar H et al (2003) Low-complexity transform and quantization in H.264/AVC. IEEE Trans CSVT 13:598–603Google Scholar
  20. 20.
    Mariano V, Kasturi R (2000) Locating uniform-colored text in video frames. in Proc. 15th Int. Conf. Pattern Recognit., 4, 539-542Google Scholar
  21. 21.
    Ngo C, Chan C (2005) Video text detection and segmentation for optical character recognition. Multimedia Systems 10(3):261–272CrossRefGoogle Scholar
  22. 22.
    Qi W, Gu L, Jiang H, Chen X, Zhang H (2000) Integrating visual, audio and text analysis for news video. in Proc. Int. Conf. Image Process., 3, 520-523Google Scholar
  23. 23.
    Qian X, Liu G (2006) Text detection, localization and segmentation in compressed videos. in Proc. ICASSP2006., 2, II385-II388Google Scholar
  24. 24.
    Qian X, Liu G (2007) Global motion estimation from randomly selected motion vector groups and GM/LM based applications. Signal, Image and Video Processing 4:179–189CrossRefGoogle Scholar
  25. 25.
    Qian X, Liu G, Su R (2006) Effective fades and flashlight detection based on accumulating histogram difference. IEEE Trans Circuits and Systems for Video Technology 16(11):1245–1258CrossRefGoogle Scholar
  26. 26.
    Qian X, Liu G, Wang H, Su R (2007) Text detection, localization and tracking in compressed videos. Signal Processing: Image Communication 22(9):752–768Google Scholar
  27. 27.
    Rainer L, Axel W (2002) Localizing and segmenting text in images and videos. IEEE Trans Circuits and Systems for Video Technology 12(4):256–267CrossRefGoogle Scholar
  28. 28.
    Sato T, Kanade T (1998) Video OCR: Indexing digital news libraries by recognition of superimposed caption. ICCV Workshop on Image and Video retrieval Google Scholar
  29. 29.
    Shen B, Sethi I (1996) Direct feature extraction from compressed images. in IS&T SPIE: Storage and Retrieval for Image and Video Databases IV, 2607, 404-417Google Scholar
  30. 30.
    Shivakumara P, Phan TQ, Tan CL (2009) A robust wavelet transform based technique for video text detection. Int Conf Document Analysis and Recognition, 1285-1289Google Scholar
  31. 31.
    Snoek C, Worring M (2005) Multimedia event-based video indexing using time intervals. IEEE Trans Multimedia 7(4):638–647CrossRefGoogle Scholar
  32. 32.
    Sun L, Liu G, Qian X, Guo D (2009) A novel text detection and localization method based on corner response. in Proc ICMEGoogle Scholar
  33. 33.
    Tang X, Gao B, Liu J, Zhang H (2002) A spatial-temporal approach for video caption detection and recognition. IEEE Trans Neural Networks 13(4):961–971CrossRefGoogle Scholar
  34. 34.
    Wang P, Cai R, Yang S (2003) A hybrid approach to news video classification with multimodal features. in Proc. Int. Conf. on Information, Communication and Signal Processing, 2, 787-791Google Scholar
  35. 35.
    Wang R, Jin W, Wu L (2004) A novel video caption detection approach using multi-frame integration. ICPR 2004. Proceedings of the 17th International Conference, 1, 449-52Google Scholar
  36. 36.
    Wang F, Ma Y, Zhang H, Li J (2005) A generic framework for semantic sports video analysis using dynamic bayesian networks. in Proc. Int. Conf. on Multimedia Modeling, 115-121Google Scholar
  37. 37.
    Wiegand T, Sullivan G, Bjontegaard G, Luthra A (2003) Overview of the H.264/AVC video coding standard. IEEE Tans Circuits Syst Video Technol 13:560–576CrossRefGoogle Scholar
  38. 38.
    Wu W, Chen D, Yang J (2005) Integrating co-training and recognition for text detection. In Proceedings of the International Conference on Multimedia ExpoGoogle Scholar
  39. 39.
    Wu V, Manmatha R, Riseman E (1999) Textfinder: an automatic system to detect and recognize text in images. IEEE Trans Pattern Anal Mach Intell 21(11):1224–229CrossRefGoogle Scholar
  40. 40.
    Zhang J, Goldgof D, Kasturi R (2008) A new edge-based text verification approach for video. in Proc. ICPRGoogle Scholar
  41. 41.
    Zhang H, Wu J, Zhong D, Smoliar S (1997) An integrated system for content-based video retrieval and browsing. Pattern Recognit 30:643–658CrossRefGoogle Scholar
  42. 42.
    Zhong Y, Zhang H, Jain A (2000) Automatic caption localization in compressed video. IEEE Trans Pattern Analysis and Machine Intelligence 22(4):385–392CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  1. 1.Xi’an Jiaotong UniversityXi’anChina

Personalised recommendations