Skip to main content

A fast hierarchical method for multi-script and arbitrary oriented scene text extraction


Typography and layout lead to the hierarchical organization of text in words, text lines, paragraphs. This inherent structure is a key property of text in any script and language, which has nonetheless been minimally leveraged by existing scene text detection methods. This paper addresses the problem of text segmentation in natural scenes from a hierarchical perspective. Contrary to existing methods, we make explicit use of text structure, aiming directly to the detection of region groupings corresponding to text within a hierarchy produced by an agglomerative similarity clustering process over individual regions. We propose an optimal way to construct such an hierarchy introducing a feature space designed to produce text group hypotheses with high recall and a novel stopping rule combining a discriminative classifier and a probabilistic measure of group meaningfulness based on perceptual organization. Results obtained over four standard datasets, covering text in variable orientations and different languages, demonstrate that our algorithm, while being trained in a single mixed dataset, outperforms state-of-the-art methods in unconstrained scenarios.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14


  1. Cao, F., Delon, J., Desolneux, A., Musé, P., Sur, F.: An a contrario approach to hierarchical clustering validity assessment. Technical report, INRIA (2004)

  2. Chen, H., Tsai, S., Schroth, G., Chen, D., Grzeszczuk, R., Girod, B.: Robust text detection in natural images with edge-enhanced maximally stable extremal regions. In: Proceedings of ICIP, (2011)

  3. Chen, X., Yuille, A.: Detecting and reading text in natural scenes. In: Proceedings of CVPR, (2004)

  4. Coates, A., Carpenter, B., Case, C., Satheesh, S., Suresh, B., Wang, T., Wu, D., Ng, A.: Text detection and character recognition in scene images with unsupervised feature learning. In: Proceedings of ICDAR, (2011)

  5. Desolneux, A., Moisan, L., Morel, J.M.: A grouping principle and four applications. IEEE Trans. Pattern Anal. Mach. Intell. 25(4), 508–513 (2003)

    Article  MATH  Google Scholar 

  6. Epshtein, B., Ofek, E., Wexler, Y.: Detecting text in natural scenes with stroke width transform. In: Proceedings of CVPR, (2010)

  7. Gomez, L., Karatzas, D.: Multi-script text extraction from natural scenes. In: Proceedings of ICDAR, (2013)

  8. Hu, M.K.: Visual pattern recognition by moment invariants. IRE Trans. Inf. Theory 8(2), 179–187 (1962)

    Article  MATH  Google Scholar 

  9. Huang, W., Qiao, Y., Tang, X.: Robust scene text detection with convolution neural network induced mser trees. In: Proceedings of ECCV, (2014)

  10. Jaderberg, M., Vedaldi, A., Zisserman, A.: Deep features for text spotting. In: Proceedings of ECCV, (2014)

  11. Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., Mestre, S.R., Mas, J., Mota, D.F., Almazan, J.A., de las Heras, L.P., et al.: Icdar 2013 robust reading competition. In: Proceedings of ICDAR, (2013)

  12. Kumar, D., Prasad, M., Ramakrishnan, A.: Multi-script robust reading competition in icdar 2013. In: Proceedings of Workshop on Multilingual OCR, (2013)

  13. Kumar, D., Ramakrishnan, A.: Otcymist: otsu-canny minimal spanning tree for born-digital images. In: DAS, pp. 389–393. IEEE, (2012)

  14. Lee, S., Cho, M.S., Jung, K., Kim, J.H.: Scene text extraction with edge constraint and text collinearity. In: Proceedings of ICPR, (2010)

  15. Li, L., Yu, S., Zhong, L., Li, X.: Multilingual text detection with nonlinear neural network. Math. Probl. Eng. 2015, 1–7 (2015)

    Google Scholar 

  16. Liang, G., Shivakumara, P., Lu, T., Tan, C.L.: Multi-spectral fusion based approach for arbitrarily oriented scene text detection in video images. IEEE Trans. Image Process. 24(11), 4488–4501 (2015)

    Article  MathSciNet  Google Scholar 

  17. Lucas, S.M., Panaretos, A., Sosa, L., Tang, A., Wong, S., Young, R., et al.: ICDAR 2003 robust reading competitions: entries, results, and future directions. IJDAR 7(2–3), 105–122 (2005)

    Article  Google Scholar 

  18. Matas, J., Chum, O., Urban, M., Pajdla, T.: Robust wide-baseline stereo from maximally stable extremal regions. Image Vis. Comput. 22(10), 761–767 (2004)

    Article  Google Scholar 

  19. Matas, J., Zimmermann, K.: A new class of learnable detectors for categorisation. In: Scandinavian Conference on Image Analysis, (2005)

  20. Milyaev, S., Barinova, O., Novikova, T., Kohli, P., Lempitsky, V.: Image binarization for end-to-end text understanding in natural images. In: Proceedings of ICDAR, (2013)

  21. Milyaev, S., Barinova, O., Novikova, T., Kohli, P., Lempitsky, V.: Fast and accurate scene text understanding with image binarization and off-the-shelf OCR. IJDAR 18(2), 169–182 (2015)

    Article  Google Scholar 

  22. Minetto, R., Thome, N., Cord, M., Leite, N.J., Stolfi, J.: T-HOG: an effective gradient-based descriptor for single line text regions. Pattern Recogn. 46(3), 1078–1090 (2013)

    Article  Google Scholar 

  23. Mishra, A., Alahari, K., Jawahar, C.: Top-down and bottom-up cues for scene text recognition. In: Proceedings of CVPR, (2012)

  24. Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning. In: NIPS Workshop on Deep Learning and Unsupervised Feature Learning, (2011)

  25. Neumann, L., Matas, J.: A method for text localization and recognition in real-world images. In: Proceedings of ACCV, (2010)

  26. Neumann, L., Matas, J.: Text localization in real-world images using efficiently pruned exhaustive search. In: Proceedings of ICDAR, (2011)

  27. Neumann, L., Matas, J.: Real-time scene text localization and recognition. In: Proceedings of CVPR, (2012)

  28. Novikova, T., Barinova, O., Kohli, P., Lempitsky, V.: Large-lexicon attribute-consistent text recognition in natural images. In: Proceedings of ECCV, (2012)

  29. Pan, Y.F., Hou, X., Liu, C.L.: Text localization in natural scene images based on conditional random field. In: Proceedings of ICDAR, (2009)

  30. Van de Sande, K.E., Uijlings, J.R., Gevers, T., Smeulders, A.W.: Segmentation as selective search for object recognition. In: ICCV. IEEE, (2011)

  31. Schapire, R.E., Singer, Y.: Improved boosting algorithms using confidence-rated predictions. Mach. Learn. 37(3), 297–336 (1999)

    Article  MATH  Google Scholar 

  32. Shi, C., Wang, C., Xiao, B., Gao, S., Hu, J.: End-to-end scene text recognition using tree-structured models. Pattern Recogn. 47(9), 2853–2866 (2014)

    Article  Google Scholar 

  33. Uijlings, J., van de Sande, K., Gevers, T., Smeulders, A.: Selective search for object recognition. Int. J. Comput. Vis. 104(2), 154–171 (2013)

    Article  Google Scholar 

  34. Wang, K., Babenko, B., Belongie, S.: End-to-end scene text recognition. In: Proceedings of ICCV, (2011)

  35. Wang, K., Belongie, S.: Word spotting in the wild. In: Proceedings of ECCV, (2010)

  36. Wang, T., Wu, D.J., Coates, A., Ng, A.Y.: End-to-end text recognition with convolutional neural networks. In: Proceedings of ICPR, (2012)

  37. Wolf, C., Jolion, J.M.: Object count/area graphs for the evaluation of object detection and segmentation algorithms. IJDAR 8(4), 280–296 (2006)

    Article  Google Scholar 

  38. Yao, C., Bai, X., Liu, W., Ma, Y., Tu, Z.: Detecting texts of arbitrary orientations in natural images. In: Proceedings of CVPR, (2012)

  39. Yao, C., Bai, X., Shi, B., Liu, W.: Strokelets: A learned multi-scale representation for scene text recognition. In: Proceedings of CVPR, (2014)

  40. Yin, X.C., Yin, X., Huang, K., Hao, H.W.: Robust text detection in natural scene images. IEEE Trans. Pattern Anal. Mach. Intell. 36(5), 970–983 (2014)

    Article  Google Scholar 

  41. Yin, X.C., Pei, W.Y., Zhang, J., Hao, H.W.: Multi-orientation scene text detection with adaptive clustering. IEEE Trans. Pattern Anal. Mach. Intell. 37(9), 1930–1937 (2015)

    Article  Google Scholar 

  42. Zhang, J., Kasturi, R.: Text detection using edge gradient and graph spectrum. In: Proceedings of ICPR, (2010)

Download references


This project was supported by the Spanish project TIN2011-24631 the fellowship RYC-2009-05031 and the Catalan government scholarship 2013FI1126.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Lluis Gomez.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Gomez, L., Karatzas, D. A fast hierarchical method for multi-script and arbitrary oriented scene text extraction. IJDAR 19, 335–349 (2016).

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:


  • Scene text
  • Segmentation
  • Detection
  • Hierarchical grouping
  • Perceptual organization