Text line segmentation of historical documents: a survey

Likforman-Sulem, Laurence; Zahour, Abderrazak; Taconet, Bruno

doi:10.1007/s10032-006-0023-z

Text line segmentation of historical documents: a survey

ORIGINAL PAPER
Published: 28 September 2006

Volume 9, pages 123–138, (2007)
Cite this article

International Journal of Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Laurence Likforman-Sulem¹,
Abderrazak Zahour² &
Bruno Taconet²

1800 Accesses
274 Citations
6 Altmetric
Explore all metrics

Abstract

There is a huge amount of historical documents in libraries and in various National Archives that have not been exploited electronically. Although automatic reading of complete pages remains, in most cases, a long-term objective, tasks such as word spotting, text/image alignment, authentication and extraction of specific fields are in use today. For all these tasks, a major step is document segmentation into text lines. Because of the low quality and the complexity of these documents (background noise, artifacts due to aging, interfering lines), automatic text line segmentation remains an open research field. The objective of this paper is to present a survey of existing methods, developed during the last decade and dedicated to documents of historical interest.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Amin A. (1998): Off-line Arabic character recognition: the state of the art. Pattern Recognit. 31(5): 517–530
Article MathSciNet Google Scholar
Antonacopoulos, A.: Flexible page segmentation using the background. In: Proceedings of the 12th International Conference on Pattern Recognition (12th ICPR), Jerusalem, Israel, 9–12 October, vol. 2, pp. 339–344, 1994
Antonacopoulos, A., Karatzas, D.: Document image analysis for World War II personal records. In: First International Workshop on Document Image Analysis for Libraries, DIAL’04, Palo Alto, pp. 336–341
Bar Yosef, I., Kedem, K., Dinstein, I., Beit-Arie, M., Engel, E.: Classification of Hebrew calligraphic handwriting styles: preliminary results. In: First International Workshop on Document Image Analysis for Libraries, DIAL’04, Palo Alto, 2004
Bouche, R., Emptoz, H., LeBourgeois, F.: DEBORA— Digital accEs to Books of RenaissAnce, Lyon (France), on-line document. rfv6.insa-lyon.fr/debora, 2000
Bozzi A., Sapuppo A. (1995): Computer aided preservation and transcription of ancient manuscripts and old printed documents. Ercim News 19, 27–28
Google Scholar
Bruzzone, E., Coffetti, M.C.: An algorithm for extracting cursive text lines, 1999. In: Proceedings of ICDAR’99, 20–22 September, pp. 749–752, 1999
Calabretto, S., Bozzi, A.: The philological workstation BAMBI (Better Access to Manuscripts and Browsing of Images). Int. J. Digit. Inf. (JoDI) 1(3), 1–17 ISSN 1368–7506 (1998)
Google Scholar
Cohen E., Hull J., Srihari S. (1991): Understanding handwritten text in a structured environment: determining zip codes from addresses. Int. J. Pattern Recognit. AI 5(1–2): 221–264
Article Google Scholar
Downton A., Leedham C.G. (1990): Preprocessing and presorting of envelope images for automatic sorting using OCR. Pattern Recognit. 23(3–4): 347–362
Article Google Scholar
Downton, A., Lucas, S., Patoulas, G., Beccaloni, G.W., Scoble, M.J., Robinson, G.S.: Computerizing national history card archives. Proceedings of ICDAR’03, Edinburgh, 2003
Feldbach, M.: Generierung einer semantischen reprasentation aus abbildungen handschritlicher kirchenbuchaufzeichnungen. Diplomarbeit, Otto von Guericke Universitat Magdeburg (2000)
Feldbach, M., Tönnies, K.D.: Line detection and segmentation in Historical Church registers. In: Proceedings of ICDAR’01, Seattle, pp. 743–747, 2001
Fletcher L.A., Kasturi R. (1988): Text string segmentation from mixed text/graphics images. IEEE PAMI 10(3): 910–918
Google Scholar
Govindaraju, V., Srihari, R., Srihari, S.: Handwritten text recognition. In: Document Analysis Systems DAS 94, Kaiserlautern, pp. 157–171, 1994
Granado, I., Mengucci, M., Muge, F.: Extraction de textes et de figures dans les livres anciens à l’aide de la morphologie mathématique. In: Actes de CIFED’2000, Colloque International Francophone sur l’Ecrit et le Document, Lyon, pp. 81–90, 2000
Gusnard de Ventadert, André J., Richy H., Likforman- Sulem L., Desjardin E. (1999): Les documents anciens. Document Numérique 3(1–2): 57–73
Google Scholar
He, J., Downton, A.C.: User-assisted archive document image analysis for digital library construction. In: Seventh International Conference on Document Analysis and Recognition, Edinburgh, 2003
Hough, P.V.C.: Methods and means for recognizing complex patterns. US Patent 3,069,654, 1962
Jain A., Bhattacharjee S. (1992): Text segmentation using Gabor filters for automatic document processing. MVA 5, 169–184
Google Scholar
Kim I.-K., Jung D.-W., Park R.-H. (2002): Document image binarization based on topographic analysis using a water flow model. Pattern Recognit. 35, 265–277
Article MATH Google Scholar
Kolcz A., Alspector J., Augusteyn M., Carlson R., Viorel Popescu G. (2000): A line-oriented approach to word spotting in handwritten document. Pattern Anal. Appl. 3, 155–168
Article Google Scholar
Lakshmi C.V., Patvardhan C. (2004): An optical character recognition system for printed Telugu text. Pattern Anal. Appl. 7, 190–204
MathSciNet Google Scholar
Lamouche, I., Bellissant, C.: Séparation recto/verso d’images de manuscrits anciens. In: Proceedings of Colloque National sur l’Ecrit et le Document CNED’96, Nantes, pp. 199–206, 1996
LeBourgeois, F.: Robust multifont OCR system from gray level images. In: 4th International Conference on Document Analysis and Recognition, Ulm, 1997
LeBourgeois, F., Emptoz, H., Trinh, E., Duong, J.: Networking digital document images. In: 6th International Conference on Document Analysis and Recognition, Seattle, 2001
Likforman-Sulem L., Maitre H., Sirat C. (1990): An expert vision system for analysis of Hebrew characters and authentication of manuscripts. Pattern Recognit. 24(2): 121–137
Article Google Scholar
Likforman-Sulem L., Faure C. (1994): Extracting lines on handwritten documents by perceptual grouping. In: Faure C., Keuss P., Lorette G., Winter A. (eds) Advances in Handwriting and Drawing: A Multidisciplinary Approach. Europia, Paris, pp. 21–38
Google Scholar
Likforman-Sulem L., Faure C. (1995): Une méthode de résolution des conflits d’alignements pour la segmentation des documents manuscrits. Trait. Signal 12(6): 541–549
Google Scholar
Likforman-Sulem, L., Hanimyan, A., Faure, C.: A Hough based algorithm for extracting text lines in handwritten document. In: Proceedings of ICDAR’95, pp. 774–777, 1995
Likforman-Sulem, L.: Extraction d’éléments graphiques dans les images de manuscrits. In: Colloque International Francophone sur l’Ecrit et le Document (CIFED’98), Québec, pp. 223–232, 1998
Lins R.D., Guimaraes Neto M., França Neto L., Galdino Rosa L. (1994): An environment for processing images of historical documents. Microprocess. Microprogram. 40, 939–942
Article Google Scholar
Manmatha, R., Srimal, N.: Scale space technique for word segmentation in handwritten manuscripts. In: Proceedings 2nd International Conference on Scale Space Theories in Computer Vision, pp. 22–33, 1999
Marti, U., Bunke, H.: A full English sentence database for off-line handwriting recognition. In: Proceedings of 5th International Conference on Document Analysis and Recognition ICDAR’99, Bangalore, pp. 705–708, 1999
Marti, U., Bunke, H.: On the influence of vocabulary size and language models in unconstrained handwritten text recognition. In: Proceedings of ICDAR’01, Seattle, pp. 260–265, 2001
Mello, C.A.B., Cavalcanti, C.S.V.C., Carvalho, C.: Colorizing paper texture of green-scale image of historical documents. In: Proceedings of the 4th IASTED Conference on Visualization, Imaging and Image Processing, VIIP, Marbella, Spain, 2004
Mengucci, M., Granado, I.: Morphological segmentation of text and figures in Renaissance books (XVI century). In Goutsias, J., Vincent, L., Bloomberg, D. (eds.) Mathematical Morphology and its applications to image processing, pp. 397–404. Kluwer, Dordrecht (2000)
Nagy, G., Seth, S.: Hierarchical representation of optically scanned documents. In: 7th International Conference on Pattern Recognition, Montreal, pp. 347–349, 1984
O’Gorman L. (1993): The document spectrum for page layout analysis. IEEE Trans. Pattern Anal. Mach. Intell. 15, 1162–1173
Article Google Scholar
Oztop E., Mulayim A.Y., Atalay V., Yarman-Vural F. (1999): Repulsive attractive network for baseline extraction on document images. Signal Process. 75, 1–10
Article Google Scholar
Pal, U., Datta, S.: Segmentation of Bangla unconstrained handwritten text. In: Proceedings of Seventh International Conference on Document Analysis and Recognition, pp. 1128–1132, 2003
Pal U., Chaudhuri B.B. (2004): Indian script character recognition: a survey. Pattern Recognit. 37, 1887–1899
Article Google Scholar
Piquin, P., Viard-Gaudin, C., Barba, D.: Coopération des outils de segmentation et de binarisation de documents. In: Proceedings of Colloque National sur l’Ecrit et le Document, CNED’94, Rouen, pp. 283–292, 1994
Plamondon R., Lorette G. (1989): Automatic signature authentication and writer identification: the state of the art. Pattern Recognit. 22(2): 107–131
Article Google Scholar
Pu, Y., Shi, Z.: A natural learning algorithm based on Hough transform for text lines extraction in handwritten documents. In: Proceedings of the 6th International Workshop on Frontiers in Handwriting Recognition, Taejon, Korea, pp. 637–646, 1998
Rath, T., Manmatha, R.: Features for word spotting in historical manuscripts. In: Proceedings of ICDAR 03, Edinburgh, August 2003
Robert, L., Likforman-Sulem, L., Lecolinet, E.: Image and text coupling for creating electronic books from manuscripts. In: Proceedings of ICDAR’97, Ulm, pp. 823–826, 1997
Seni G., Cohen E. (1994): External word segmentation of off-line handwritten text lines. Pattern Recognit. 27(1): 41–52
Article Google Scholar
Shapiro V., Gluhchev G., Sgurev V. (1993): Handwritten document image segmentation and analysis. Pattern Recognit. Lett. 14, 71–78
Article Google Scholar
Shi, Z., Govindaraju, V.: Line separation for complex document images using fuzzy runlength. In: Proceedings of the International Workshop on Document Image Analysis for Libraries, Palo Alto, CA, USA, 23–24 January 2004
Shi, Z., Govindaraju, V.: Historical document image enhancement using background light intensity normalization. In: ICPR 2004, Cambridge, 2004
Solihin Y., Leedham C.G. (1999): Integral ratio: a new class of global thresholding techniques for handwriting images. IEEE PAMI 21(8): 761–768
Google Scholar
Srihari, S., Kim, G.: Penman: a system for reading unconstrained handwritten page image. In: SDIUT 97, Symposium on Document Image Understanding Technology, pp. 142–153, 1997
Tan C.L., Cao R., Shen P. (2002): Restoration of archival documents using a wavelet technique. IEEE PAMI 24(10): 1399–1404
Google Scholar
Tomai, C.I., Zhang, B., Govindaraju, V.: Transcript mapping for historic handwritten document images. In: Proceedings of IWFHR-8, Niagara, August 2002
Tseng Y.H., Lee H.J. (1999): Recognition-based handwritten Chinese character segmentation using a probabilistic Viterbi algorithm. Pattern Recognit. Lett. 20(8): 791–806
Article Google Scholar
Wong K., Casey R., Wahl F. (1982): Document analysis systems. IBM J. Res. Dev. 26(6): 647–656
Article Google Scholar
Zahour, A., Taconet, B., Mercy, P., Ramdane, S.: Arabic hand-written text-line extraction. In: Proceedings of the 6th ICDAR, Seattle, pp. 281–285, 2001
Zahour, A., Taconet, B., Ramdane, S.: Contribution à la segmentation de textes manuscrits anciens. In: Proceedings of CIFED 2004, La Rochelle, 2004
Zhang, B., Srihari, S.N., Huang, C.: Word image retrieval using binary features. In: SPIE Conference on Document Recognition and Retrieval XI, San Jose, CA, USA, 18–22 January 2004

Download references

Author information

Authors and Affiliations

GET-Ecole Nationale Supérieure des Télécommunications/TSI and CNRS-LTCI, 46 rue Barrault, 75013, Paris, France
Laurence Likforman-Sulem
IUT, Université du Havre/GED, Place Robert Schuman, 76610, Le Havre, France
Abderrazak Zahour & Bruno Taconet

Authors

Laurence Likforman-Sulem
View author publications
You can also search for this author in PubMed Google Scholar
Abderrazak Zahour
View author publications
You can also search for this author in PubMed Google Scholar
Bruno Taconet
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Laurence Likforman-Sulem.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Likforman-Sulem, L., Zahour, A. & Taconet, B. Text line segmentation of historical documents: a survey. IJDAR 9, 123–138 (2007). https://doi.org/10.1007/s10032-006-0023-z

Download citation

Received: 14 February 2005
Revised: 14 November 2005
Accepted: 28 May 2006
Published: 28 September 2006
Issue Date: April 2007
DOI: https://doi.org/10.1007/s10032-006-0023-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Text line segmentation of historical documents: a survey

Abstract

Access this article

Similar content being viewed by others

Image segmentation evaluation: a survey of methods

OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment

A comprehensive survey of image segmentation: clustering methods, performance parameters, and benchmark datasets

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Text line segmentation of historical documents: a survey

Abstract

Access this article

Similar content being viewed by others

Image segmentation evaluation: a survey of methods

OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment

A comprehensive survey of image segmentation: clustering methods, performance parameters, and benchmark datasets

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation