Abstract
The performance of document text recognition depends on text line segmentation algorithms, which heavily relies on the type of language, author’s writing style, pen type, and document quality. In this paper, we present a novel unsupervised text-line segmentation algorithm for printed Arabic documents with and without diacritics. The presented approach employs a projection profile along with connected components in an iterative manner to detect text-lines. The primary benefits of the presented algorithm are (i) it is not threshold dependent, (ii) it is not required a training phase for threshold selection, and (iii) it is robust towards page rotation, font type, size, and style variation for both with and without diacritics documents. The extensive computational simulations on manually collected dataset prove the efficiency of the proposed scheme compared with several baseline and states of the art methods, including, Voronoi, X-Y Cut, Docstrum, Smearing and Seam-carving methods. Computational time analysis also presented.
Similar content being viewed by others
References
Aldavert D, Rusiñol M (2018) Manuscript text line detection and segmentation using second-order derivatives. In: 2018 13th IAPR International Workshop on Document Analysis Systems (DAS), pp 293–298
Ayesh M, Mohammad K, Qaroush A, Agaian S, Washha M (2017) A robust line segmentation algorithm for arabic printed text with diacritics. Electronic Imaging 2017:42–47. https://doi.org/10.2352/ISSN.2470-1173.2017.13.IPAS-204
Barakat BK, Droby A, Alasam R, Madi B, Rabaev I, Shammes R, El-Sana J (2020) Unsupervised text line segmentation
Breuel TM (2002) Two geometric algorithms for layout analysis. In: Proceedings of the 5th International workshop on document analysis systems V, DAS ’02. http://dl.acm.org/citation.cfm?id=647798.736824. Springer, London, pp 188–199
Bukhari SS, Shafait F, Breuel TM (2013) Towards generic text-line extraction. In: 2013 12th International Conference on Document Analysis and Recognition (ICDAR). IEEE, pp 748–752
Forczmański P, Markiewicz A (2016) Two-stage approach to extracting visual objects from paper documents. Mach Vis Appl 27(8):1243–1257
Haraty RA, Ghaddar C (2004) Arabic text recognition. Int Arab J Inf Technol 1:156–163
Isheawy NAM, Hasan H Optical character recognition (ocr) system
Jaeger S, Zhu G, Doermann D, Chen K, Sampat S (2006) Doclib: A software library for document processing. In: International Conference on Document Recognition and Retrieval XIII. San Jose, pp 1–9
Jain A, Yu B (1998) Document representation and its application to page decomposition. IEEE Trans Pattern Anal Mach Intell 20(3):294–308. https://doi.org/10.1109/34.667886
Kise K, Sato A, Iwata M (1998) Segmentation of page images using the area voronoi diagram. Comput Vis Image Underst 70(3):370–382
Kundu S, Paul S, Bera SK, Abraham A, Sarkar R (2020) Text-line extraction from handwritten document Q5 672 images using gan. Expert Syst Appl 140(112):916
Lam L, Lee SW, Suen C (1992) Thinning methodologies-a comprehensive survey. IEEE Trans Pattern Anal Mach Intell 14(9):869–885. https://doi.org/10.1109/34.161346
Lawgali A (2015) Handwritten digit recognition based on dwt and dct
Li Y, Zheng Y, Doermann D, Jaeger S (2008) Script-independent text line segmentation in freestyle handwritten documents. IEEE Trans Pattern Anal Mach Intell 30(8):1313–1329
Manmatha R, Rothfeder JL (2005) A scale space approach for automatically segmenting words from historical handwritten documents. IEEE Trans Pattern Anal Mach Intell 27(8):1212–1225
Mao S, Kanungo T (2001) Empirical performance evaluation methodology and its application to page segmentation algorithms. IEEE Trans Pattern Anal Mach Intell 23(3):242–256
Mao S, Kanungo T (2002) Software architecture of pset: a page segmentation evaluation toolkit. Int J Doc Anal Recognit 4(3):205–217
Mao S, Rosenfeld A, Kanungo T (2003) Document structure analysis algorithms: a literature survey. https://doi.org/10.1117/12.476326
Marti UV, Bunke H (2001) Text line segmentation and word recognition in a system for general writer independent handwriting recognition. In: Sixth International Conference on Document Analysis and Recognition, 2001. Proceedings. IEEE, pp 159–163
MATLAB (2015) version 8.15.0 (R2015a). The MathWorks Inc., Natick
Mohammad K, Agaian S (2012) Practical recognition system for text printed on clear reflected material. ISRN Machine Vision 2012
Mohammad K, Agaian S, Saleh H (2012) Arabic license plate recognition system
Mozaffari S, Faez K, Faradji F, Ziaratban M, Golzan SM (2006) A comprehensive isolated farsi/arabic character database for handwritten ocr research. In: Tenth international workshop on frontiers in handwriting recognition. Suvisoft
Nagy G (2000) Twenty years of document image analysis in pami. IEEE Transactions on Pattern Analysis & Machine Intelligence (1)38–62
Nagy G, Seth S, Viswanathan M (1992) A prototype document image analysis system for technical journals. Computer 25(7):10–22
Neche C, Belaid A, Kacem-Echi A (2019) Arabic handwritten documents segmentation into text-lines and words using deep learning. In: 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW), vol 6, pp 19–24
O’Gorman L (1993) The document spectrum for page layout analysis. IEEE Trans Pattern Anal Mach Intell 15(11):1162–1173
Oliveira S, Seguin B, Kaplan F (2018) dhsegment: A generic deep-learning approach for document segmentation. arXiv:1804.10371
Otsu N (1979) A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man and Cybernetics 9(1):62–66. https://doi.org/10.1109/TSMC.1979.43100767
Pal U, Roy PP (2004) Multioriented and curved text lines extraction from indian documents. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 34(4):1676–1684
Pechwitz M, Maddouri SS, Märgner V, Ellouze N, Amiri H, et al. (2002) Ifn/enit-database of handwritten arabic words. In: Proceedings of CIFED, vol 2. Citeseer, pp 127–136
Renton G, Soullard Y, Chatelain C, Adam S, Kermorvant C, Paquet T (2018) Fully convolutional network with dilated convolutions for handwritten text line segmentation. International Journal on Document Analysis and Recognition (IJDAR)
Saabni R (2018) Robust and efficient text: Line extraction by local minimal sub-seams, pp 1–6
Seuret M, Stoekl Ben Ezra D, Liwicki M (2017) Robust heartbeat-based line segmentation methods for regular texts and paratextual elements
Shafait F, Keysers D, Breuel T (2008) Performance evaluation and benchmarking of six-page segmentation algorithms. IEEE Trans Pattern Anal Mach Intell 30(6):941–954
Singh S (2013) Optical character recognition techniques: a survey. Journal of emerging Trends in Computing and information Sciences 4(6):545–550
Slimane F, Ingold R, Kanoun S, Alimi A, Hennebert J (2009) A new arabic printed text image database and evaluation protocols. In: 10th International conference on document analysis and recognition, 2009. ICDAR ’09, pp 946–950. https://doi.org/10.1109/ICDAR.2009.155
Suleyman E, Tuerxun P, Moydin K, Hamdulla A (2019) An adaptive threshold algorithm for offline uyghur handwritten text line segmentation, pp 302–312
Tripathy N, Pal U (2004) Handwriting segmentation of unconstrained oriya text. In: Ninth International workshop on frontiers in handwriting recognition, 2004. IWFHR-9 2004. IEEE, pp 306–311
Wang L, Uchida S, Fan W, Sun J (2016) Globally optimal text line extraction based on k-shortest paths algorithm. In: 2016 12th IAPR Workshop on Document Analysis Systems (DAS), pp 335–339
Wang L, Uchida S, Zhu A, Sun J (2017) Human reading knowledge inspired text line extraction. Cogn Comput 10:1–10
White J, Rohrer G (1983) Image thresholding for optical character recognition and other applications requiring character image extraction. IBM J Res Dev 27(4):400–411. https://doi.org/10.1147/rd.274.0400
Yu B, Jain AK (1996) A robust and fast skew detection algorithm for generic documents. Pattern Recognit 29(10):1599–1629
Zahour A, Taconet B, Mercy P, Ramdane S (2001) Arabic hand-written text-line extraction. In: Sixth International conference on document analysis and recognition, 2001. Proceedings. IEEE, pp 281–285
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Mohammad, K., Qaroush, A., Washha, M. et al. An adaptive text-line extraction algorithm for printed Arabic documents with diacritics. Multimed Tools Appl 80, 2177–2204 (2021). https://doi.org/10.1007/s11042-020-09737-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-020-09737-1