Text Detection in Document Images by Machine Learning Algorithms

Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 403)


In the proposed paper, we consider a problem of text detection in document images. This problem plays an important role in OCR systems and is a challenging task. In the first step of our proposed text detection approach, we use a self-adjusting bottom-up segmentation algorithm to segment a document image into a set of connected components (CCs). The segmentation algorithm is based on the Sobel edge detection method. In the second step, CCs are described in terms of 27 features and a machine learning algorithm is then used to classify the CCs as text or nontext. For testing the approach, we have collected a dataset (ASTRoID), which contains 500 images of text blocks and 500 images of nontext blocks. We empirically compare performance of the proposed text detection method when using seven different machine learning algorithms.


Text detection Document segmentation Text/nontext classification Machine learning 



The presented work was supported by Creative Core FISNM-3330-13-500033 ‘Simulations’ project funded by the European Union, The European Regional Development Fund. The operation is carried out within the framework of the Operational Programme for Strengthening Regional Development Potentials for the period 2007–2013, Development Priority 1: Competitiveness and research excellence, Priority Guideline 1.1: Improving the competitive skills and research excellence.


  1. 1.
    Kise, K.: Page Segmentation Techniques in Document Analysis. Handbook of Document Image Processing and Recognition, pp. 135–175. Springer, London (2014)CrossRefGoogle Scholar
  2. 2.
    Coppi, D., Grana, C., Cucchiara, R.: Illustrations segmentation in digitized documents using local correlation features. In: 10th Italian Research Conference on Digital Libraries, vol. 38, pp. 76–83. Procedia Computer Science, Padua (2014)Google Scholar
  3. 3.
    Shafait, F., Keysers, D., Breuel, T.: Performance evaluation and benchmarking of six-page segmentation algorithms. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 941–954. IEEE Press (2008)Google Scholar
  4. 4.
    Kruatrachue, B., Moongfangklang, N., Siriboon, K.: Fast document segmentation using contour and X-Y cut technique. In: The Third World Enformatika Conference, WEC vol. 5, pp. 27–29. Turkey (2005)Google Scholar
  5. 5.
    Barlas, P., Kasar, T., Adams, S., Chatelain, C., Paquet, T.: A typed and handwritten text block segmentation system for heterogeneous and complex documents. In: 11th IAPR International Workshop on Document Analysis Systems, pp. 46–50, IEEE Press, Tours (2014)Google Scholar
  6. 6.
    Priyadharshini, N., Vijaya, M.S.: Genetic programming for document segmentation and region classification using discipulus. Int. J. Adv. Res. Artif. Intell. 2, 15–22 (2013)Google Scholar
  7. 7.
    Priyanka, N., Pal, S., Mandal, R.: Line and word segmentation approach for printed documents. Int. J. Comput. Appl. 1, 30–36 (2010)Google Scholar
  8. 8.
    Vikas, J.D., Vijay, H.M.: Devnagari document segmentation using histogram approach. Int. J. Comput. Sci. Eng. Inf. Tech. 1, 46–53 (2011)Google Scholar
  9. 9.
    Bukhari, S.S., Azawi, M.A., Shafait, F., Breuel, T.M.: Document image segmentation using discriminative learning over connected components. In: 9th IAPR International Workshop on Document Analysis Systems, pp. 183–190. Boston (2010)Google Scholar
  10. 10.
    Bukhari, S.S., Asi, A., Breuel, T.M., El-Sana, J.: Layout analysis for arabic historical document images using machine learning. In: International Conference on Frontiers in Handwriting Recognition, pp. 639–644 (2012)Google Scholar
  11. 11.
    Zagoris, K., Chatzichristofis, S.A., Papamarkos, N.: Text Localization using standard deviation analysis of structure elements and support vector machines. EURASIP J. Adv. Sign. Process. 47, 1–2 (2011)Google Scholar
  12. 12.
    Bukhari, S.S., Shafait, F., Breuel, T.M.: Improved document image segmentation algorithm using multiresolution morphology. In: 18th Document Recognition and Retrieval Conference, pp. 1–10. San Jose (2011)Google Scholar
  13. 13.
    Sumathi, C.P., Priya, N.: A combined edge-based text region extraction from document images. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 3, 827–835 (2013)Google Scholar
  14. 14.
    Kundu, M.K., Dhar, S., Banerjee, M.: A new approach for segmentation of image and text in natural and commercial color document. In: Proceedings of International Conference on Communication, Devices and Intelligent Systems, pp. 85–88. IEEE Press, India (2012)Google Scholar
  15. 15.
    Roy, P.P., Pal, U., Lladós, J.: Touching text character localization in graphical documents using SIFT. In: Proceedings of the 8th International Conference on Graphics Recognition: Achievements, Challenges, and Evolution, pp. 199–211. Springer, France (2010)Google Scholar
  16. 16.
    Vasuki, S., Ganesan, L.: Performance measure for edge based color image segmentation in color spaces. In: Proceedings of the International Conference on Emerging Technologies in Intelligent System and Control: Exploring, Exposing, and Experiencing the Emerging Technologies, pp. 621–626. Allied Publishers, Coimbatore (2005)Google Scholar
  17. 17.
    Otsu, N.: A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 9, 62–66 (1979)CrossRefGoogle Scholar
  18. 18.
    Basilis, G.G.: Imaging Techniques in Document Analysis Processes. Handbook of Document Image Processing and Recognition. Springer, London (2014)Google Scholar
  19. 19.
    Burger, W., Burge, M.J.: Principles of Digital Image Processing. Springer, London (2009)MATHGoogle Scholar
  20. 20.
    WEKA (Open source, Data Mining software in Java), University of Waikato, New Zealand. http://www.cs.waikato.ac.nz/ml/weka

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Laboratory of Data TechnologiesFaculty of Information StudiesNovo MestoSlovenia
  2. 2.Jožef Stefan InstituteDepartment of Knowledge TechnologiesLjubljanaSlovenia

Personalised recommendations