Abstract
In the proposed paper, we consider a problem of text detection in document images. This problem plays an important role in OCR systems and is a challenging task. In the first step of our proposed text detection approach, we use a self-adjusting bottom-up segmentation algorithm to segment a document image into a set of connected components (CCs). The segmentation algorithm is based on the Sobel edge detection method. In the second step, CCs are described in terms of 27 features and a machine learning algorithm is then used to classify the CCs as text or nontext. For testing the approach, we have collected a dataset (ASTRoID), which contains 500 images of text blocks and 500 images of nontext blocks. We empirically compare performance of the proposed text detection method when using seven different machine learning algorithms.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Kise, K.: Page Segmentation Techniques in Document Analysis. Handbook of Document Image Processing and Recognition, pp. 135–175. Springer, London (2014)
Coppi, D., Grana, C., Cucchiara, R.: Illustrations segmentation in digitized documents using local correlation features. In: 10th Italian Research Conference on Digital Libraries, vol. 38, pp. 76–83. Procedia Computer Science, Padua (2014)
Shafait, F., Keysers, D., Breuel, T.: Performance evaluation and benchmarking of six-page segmentation algorithms. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 941–954. IEEE Press (2008)
Kruatrachue, B., Moongfangklang, N., Siriboon, K.: Fast document segmentation using contour and X-Y cut technique. In: The Third World Enformatika Conference, WEC vol. 5, pp. 27–29. Turkey (2005)
Barlas, P., Kasar, T., Adams, S., Chatelain, C., Paquet, T.: A typed and handwritten text block segmentation system for heterogeneous and complex documents. In: 11th IAPR International Workshop on Document Analysis Systems, pp. 46–50, IEEE Press, Tours (2014)
Priyadharshini, N., Vijaya, M.S.: Genetic programming for document segmentation and region classification using discipulus. Int. J. Adv. Res. Artif. Intell. 2, 15–22 (2013)
Priyanka, N., Pal, S., Mandal, R.: Line and word segmentation approach for printed documents. Int. J. Comput. Appl. 1, 30–36 (2010)
Vikas, J.D., Vijay, H.M.: Devnagari document segmentation using histogram approach. Int. J. Comput. Sci. Eng. Inf. Tech. 1, 46–53 (2011)
Bukhari, S.S., Azawi, M.A., Shafait, F., Breuel, T.M.: Document image segmentation using discriminative learning over connected components. In: 9th IAPR International Workshop on Document Analysis Systems, pp. 183–190. Boston (2010)
Bukhari, S.S., Asi, A., Breuel, T.M., El-Sana, J.: Layout analysis for arabic historical document images using machine learning. In: International Conference on Frontiers in Handwriting Recognition, pp. 639–644 (2012)
Zagoris, K., Chatzichristofis, S.A., Papamarkos, N.: Text Localization using standard deviation analysis of structure elements and support vector machines. EURASIP J. Adv. Sign. Process. 47, 1–2 (2011)
Bukhari, S.S., Shafait, F., Breuel, T.M.: Improved document image segmentation algorithm using multiresolution morphology. In: 18th Document Recognition and Retrieval Conference, pp. 1–10. San Jose (2011)
Sumathi, C.P., Priya, N.: A combined edge-based text region extraction from document images. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 3, 827–835 (2013)
Kundu, M.K., Dhar, S., Banerjee, M.: A new approach for segmentation of image and text in natural and commercial color document. In: Proceedings of International Conference on Communication, Devices and Intelligent Systems, pp. 85–88. IEEE Press, India (2012)
Roy, P.P., Pal, U., Lladós, J.: Touching text character localization in graphical documents using SIFT. In: Proceedings of the 8th International Conference on Graphics Recognition: Achievements, Challenges, and Evolution, pp. 199–211. Springer, France (2010)
Vasuki, S., Ganesan, L.: Performance measure for edge based color image segmentation in color spaces. In: Proceedings of the International Conference on Emerging Technologies in Intelligent System and Control: Exploring, Exposing, and Experiencing the Emerging Technologies, pp. 621–626. Allied Publishers, Coimbatore (2005)
Otsu, N.: A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 9, 62–66 (1979)
Basilis, G.G.: Imaging Techniques in Document Analysis Processes. Handbook of Document Image Processing and Recognition. Springer, London (2014)
Burger, W., Burge, M.J.: Principles of Digital Image Processing. Springer, London (2009)
WEKA (Open source, Data Mining software in Java), University of Waikato, New Zealand. http://www.cs.waikato.ac.nz/ml/weka
Acknowledgments
The presented work was supported by Creative Core FISNM-3330-13-500033 ‘Simulations’ project funded by the European Union, The European Regional Development Fund. The operation is carried out within the framework of the Operational Programme for Strengthening Regional Development Potentials for the period 2007–2013, Development Priority 1: Competitiveness and research excellence, Priority Guideline 1.1: Improving the competitive skills and research excellence.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Zelenika, D., Povh, J., Ženko, B. (2016). Text Detection in Document Images by Machine Learning Algorithms. In: Burduk, R., Jackowski, K., Kurzyński, M., Woźniak, M., Żołnierek, A. (eds) Proceedings of the 9th International Conference on Computer Recognition Systems CORES 2015. Advances in Intelligent Systems and Computing, vol 403. Springer, Cham. https://doi.org/10.1007/978-3-319-26227-7_16
Download citation
DOI: https://doi.org/10.1007/978-3-319-26227-7_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26225-3
Online ISBN: 978-3-319-26227-7
eBook Packages: EngineeringEngineering (R0)