Abstract
There is a huge amount of historical documents in libraries and in various National Archives that have not been exploited electronically. Although automatic reading of complete pages remains, in most cases, a long-term objective, tasks such as word spotting, text/image alignment, authentication and extraction of specific fields are in use today. For all these tasks, a major step is document segmentation into text lines. Because of the low quality and the complexity of these documents (background noise, artifacts due to aging, interfering lines), automatic text line segmentation remains an open research field. The objective of this paper is to present a survey of existing methods, developed during the last decade and dedicated to documents of historical interest.
Similar content being viewed by others
References
Amin A. (1998): Off-line Arabic character recognition: the state of the art. Pattern Recognit. 31(5): 517–530
Antonacopoulos, A.: Flexible page segmentation using the background. In: Proceedings of the 12th International Conference on Pattern Recognition (12th ICPR), Jerusalem, Israel, 9–12 October, vol. 2, pp. 339–344, 1994
Antonacopoulos, A., Karatzas, D.: Document image analysis for World War II personal records. In: First International Workshop on Document Image Analysis for Libraries, DIAL’04, Palo Alto, pp. 336–341
Bar Yosef, I., Kedem, K., Dinstein, I., Beit-Arie, M., Engel, E.: Classification of Hebrew calligraphic handwriting styles: preliminary results. In: First International Workshop on Document Image Analysis for Libraries, DIAL’04, Palo Alto, 2004
Bouche, R., Emptoz, H., LeBourgeois, F.: DEBORA— Digital accEs to Books of RenaissAnce, Lyon (France), on-line document. rfv6.insa-lyon.fr/debora, 2000
Bozzi A., Sapuppo A. (1995): Computer aided preservation and transcription of ancient manuscripts and old printed documents. Ercim News 19, 27–28
Bruzzone, E., Coffetti, M.C.: An algorithm for extracting cursive text lines, 1999. In: Proceedings of ICDAR’99, 20–22 September, pp. 749–752, 1999
Calabretto, S., Bozzi, A.: The philological workstation BAMBI (Better Access to Manuscripts and Browsing of Images). Int. J. Digit. Inf. (JoDI) 1(3), 1–17 ISSN 1368–7506 (1998)
Cohen E., Hull J., Srihari S. (1991): Understanding handwritten text in a structured environment: determining zip codes from addresses. Int. J. Pattern Recognit. AI 5(1–2): 221–264
Downton A., Leedham C.G. (1990): Preprocessing and presorting of envelope images for automatic sorting using OCR. Pattern Recognit. 23(3–4): 347–362
Downton, A., Lucas, S., Patoulas, G., Beccaloni, G.W., Scoble, M.J., Robinson, G.S.: Computerizing national history card archives. Proceedings of ICDAR’03, Edinburgh, 2003
Feldbach, M.: Generierung einer semantischen reprasentation aus abbildungen handschritlicher kirchenbuchaufzeichnungen. Diplomarbeit, Otto von Guericke Universitat Magdeburg (2000)
Feldbach, M., Tönnies, K.D.: Line detection and segmentation in Historical Church registers. In: Proceedings of ICDAR’01, Seattle, pp. 743–747, 2001
Fletcher L.A., Kasturi R. (1988): Text string segmentation from mixed text/graphics images. IEEE PAMI 10(3): 910–918
Govindaraju, V., Srihari, R., Srihari, S.: Handwritten text recognition. In: Document Analysis Systems DAS 94, Kaiserlautern, pp. 157–171, 1994
Granado, I., Mengucci, M., Muge, F.: Extraction de textes et de figures dans les livres anciens à l’aide de la morphologie mathématique. In: Actes de CIFED’2000, Colloque International Francophone sur l’Ecrit et le Document, Lyon, pp. 81–90, 2000
Gusnard de Ventadert, André J., Richy H., Likforman- Sulem L., Desjardin E. (1999): Les documents anciens. Document Numérique 3(1–2): 57–73
He, J., Downton, A.C.: User-assisted archive document image analysis for digital library construction. In: Seventh International Conference on Document Analysis and Recognition, Edinburgh, 2003
Hough, P.V.C.: Methods and means for recognizing complex patterns. US Patent 3,069,654, 1962
Jain A., Bhattacharjee S. (1992): Text segmentation using Gabor filters for automatic document processing. MVA 5, 169–184
Kim I.-K., Jung D.-W., Park R.-H. (2002): Document image binarization based on topographic analysis using a water flow model. Pattern Recognit. 35, 265–277
Kolcz A., Alspector J., Augusteyn M., Carlson R., Viorel Popescu G. (2000): A line-oriented approach to word spotting in handwritten document. Pattern Anal. Appl. 3, 155–168
Lakshmi C.V., Patvardhan C. (2004): An optical character recognition system for printed Telugu text. Pattern Anal. Appl. 7, 190–204
Lamouche, I., Bellissant, C.: Séparation recto/verso d’images de manuscrits anciens. In: Proceedings of Colloque National sur l’Ecrit et le Document CNED’96, Nantes, pp. 199–206, 1996
LeBourgeois, F.: Robust multifont OCR system from gray level images. In: 4th International Conference on Document Analysis and Recognition, Ulm, 1997
LeBourgeois, F., Emptoz, H., Trinh, E., Duong, J.: Networking digital document images. In: 6th International Conference on Document Analysis and Recognition, Seattle, 2001
Likforman-Sulem L., Maitre H., Sirat C. (1990): An expert vision system for analysis of Hebrew characters and authentication of manuscripts. Pattern Recognit. 24(2): 121–137
Likforman-Sulem L., Faure C. (1994): Extracting lines on handwritten documents by perceptual grouping. In: Faure C., Keuss P., Lorette G., Winter A. (eds) Advances in Handwriting and Drawing: A Multidisciplinary Approach. Europia, Paris, pp. 21–38
Likforman-Sulem L., Faure C. (1995): Une méthode de résolution des conflits d’alignements pour la segmentation des documents manuscrits. Trait. Signal 12(6): 541–549
Likforman-Sulem, L., Hanimyan, A., Faure, C.: A Hough based algorithm for extracting text lines in handwritten document. In: Proceedings of ICDAR’95, pp. 774–777, 1995
Likforman-Sulem, L.: Extraction d’éléments graphiques dans les images de manuscrits. In: Colloque International Francophone sur l’Ecrit et le Document (CIFED’98), Québec, pp. 223–232, 1998
Lins R.D., Guimaraes Neto M., França Neto L., Galdino Rosa L. (1994): An environment for processing images of historical documents. Microprocess. Microprogram. 40, 939–942
Manmatha, R., Srimal, N.: Scale space technique for word segmentation in handwritten manuscripts. In: Proceedings 2nd International Conference on Scale Space Theories in Computer Vision, pp. 22–33, 1999
Marti, U., Bunke, H.: A full English sentence database for off-line handwriting recognition. In: Proceedings of 5th International Conference on Document Analysis and Recognition ICDAR’99, Bangalore, pp. 705–708, 1999
Marti, U., Bunke, H.: On the influence of vocabulary size and language models in unconstrained handwritten text recognition. In: Proceedings of ICDAR’01, Seattle, pp. 260–265, 2001
Mello, C.A.B., Cavalcanti, C.S.V.C., Carvalho, C.: Colorizing paper texture of green-scale image of historical documents. In: Proceedings of the 4th IASTED Conference on Visualization, Imaging and Image Processing, VIIP, Marbella, Spain, 2004
Mengucci, M., Granado, I.: Morphological segmentation of text and figures in Renaissance books (XVI century). In Goutsias, J., Vincent, L., Bloomberg, D. (eds.) Mathematical Morphology and its applications to image processing, pp. 397–404. Kluwer, Dordrecht (2000)
Nagy, G., Seth, S.: Hierarchical representation of optically scanned documents. In: 7th International Conference on Pattern Recognition, Montreal, pp. 347–349, 1984
O’Gorman L. (1993): The document spectrum for page layout analysis. IEEE Trans. Pattern Anal. Mach. Intell. 15, 1162–1173
Oztop E., Mulayim A.Y., Atalay V., Yarman-Vural F. (1999): Repulsive attractive network for baseline extraction on document images. Signal Process. 75, 1–10
Pal, U., Datta, S.: Segmentation of Bangla unconstrained handwritten text. In: Proceedings of Seventh International Conference on Document Analysis and Recognition, pp. 1128–1132, 2003
Pal U., Chaudhuri B.B. (2004): Indian script character recognition: a survey. Pattern Recognit. 37, 1887–1899
Piquin, P., Viard-Gaudin, C., Barba, D.: Coopération des outils de segmentation et de binarisation de documents. In: Proceedings of Colloque National sur l’Ecrit et le Document, CNED’94, Rouen, pp. 283–292, 1994
Plamondon R., Lorette G. (1989): Automatic signature authentication and writer identification: the state of the art. Pattern Recognit. 22(2): 107–131
Pu, Y., Shi, Z.: A natural learning algorithm based on Hough transform for text lines extraction in handwritten documents. In: Proceedings of the 6th International Workshop on Frontiers in Handwriting Recognition, Taejon, Korea, pp. 637–646, 1998
Rath, T., Manmatha, R.: Features for word spotting in historical manuscripts. In: Proceedings of ICDAR 03, Edinburgh, August 2003
Robert, L., Likforman-Sulem, L., Lecolinet, E.: Image and text coupling for creating electronic books from manuscripts. In: Proceedings of ICDAR’97, Ulm, pp. 823–826, 1997
Seni G., Cohen E. (1994): External word segmentation of off-line handwritten text lines. Pattern Recognit. 27(1): 41–52
Shapiro V., Gluhchev G., Sgurev V. (1993): Handwritten document image segmentation and analysis. Pattern Recognit. Lett. 14, 71–78
Shi, Z., Govindaraju, V.: Line separation for complex document images using fuzzy runlength. In: Proceedings of the International Workshop on Document Image Analysis for Libraries, Palo Alto, CA, USA, 23–24 January 2004
Shi, Z., Govindaraju, V.: Historical document image enhancement using background light intensity normalization. In: ICPR 2004, Cambridge, 2004
Solihin Y., Leedham C.G. (1999): Integral ratio: a new class of global thresholding techniques for handwriting images. IEEE PAMI 21(8): 761–768
Srihari, S., Kim, G.: Penman: a system for reading unconstrained handwritten page image. In: SDIUT 97, Symposium on Document Image Understanding Technology, pp. 142–153, 1997
Tan C.L., Cao R., Shen P. (2002): Restoration of archival documents using a wavelet technique. IEEE PAMI 24(10): 1399–1404
Tomai, C.I., Zhang, B., Govindaraju, V.: Transcript mapping for historic handwritten document images. In: Proceedings of IWFHR-8, Niagara, August 2002
Tseng Y.H., Lee H.J. (1999): Recognition-based handwritten Chinese character segmentation using a probabilistic Viterbi algorithm. Pattern Recognit. Lett. 20(8): 791–806
Wong K., Casey R., Wahl F. (1982): Document analysis systems. IBM J. Res. Dev. 26(6): 647–656
Zahour, A., Taconet, B., Mercy, P., Ramdane, S.: Arabic hand-written text-line extraction. In: Proceedings of the 6th ICDAR, Seattle, pp. 281–285, 2001
Zahour, A., Taconet, B., Ramdane, S.: Contribution à la segmentation de textes manuscrits anciens. In: Proceedings of CIFED 2004, La Rochelle, 2004
Zhang, B., Srihari, S.N., Huang, C.: Word image retrieval using binary features. In: SPIE Conference on Document Recognition and Retrieval XI, San Jose, CA, USA, 18–22 January 2004
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Likforman-Sulem, L., Zahour, A. & Taconet, B. Text line segmentation of historical documents: a survey. IJDAR 9, 123–138 (2007). https://doi.org/10.1007/s10032-006-0023-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10032-006-0023-z