Text Detection in Document Images by Machine Learning Algorithms

Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 403)

Abstract

In the proposed paper, we consider a problem of text detection in document images. This problem plays an important role in OCR systems and is a challenging task. In the first step of our proposed text detection approach, we use a self-adjusting bottom-up segmentation algorithm to segment a document image into a set of connected components (CCs). The segmentation algorithm is based on the Sobel edge detection method. In the second step, CCs are described in terms of 27 features and a machine learning algorithm is then used to classify the CCs as text or nontext. For testing the approach, we have collected a dataset (ASTRoID), which contains 500 images of text blocks and 500 images of nontext blocks. We empirically compare performance of the proposed text detection method when using seven different machine learning algorithms.

Keywords

Text detection Document segmentation Text/nontext classification Machine learning 

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Laboratory of Data TechnologiesFaculty of Information StudiesNovo MestoSlovenia
  2. 2.Jožef Stefan InstituteDepartment of Knowledge TechnologiesLjubljanaSlovenia

Personalised recommendations