Skip to main content

Text line segmentation of historical documents: a survey

Abstract

There is a huge amount of historical documents in libraries and in various National Archives that have not been exploited electronically. Although automatic reading of complete pages remains, in most cases, a long-term objective, tasks such as word spotting, text/image alignment, authentication and extraction of specific fields are in use today. For all these tasks, a major step is document segmentation into text lines. Because of the low quality and the complexity of these documents (background noise, artifacts due to aging, interfering lines), automatic text line segmentation remains an open research field. The objective of this paper is to present a survey of existing methods, developed during the last decade and dedicated to documents of historical interest.

This is a preview of subscription content, access via your institution.

References

  1. 1.

    Amin A. (1998): Off-line Arabic character recognition: the state of the art. Pattern Recognit. 31(5): 517–530

    Article  MathSciNet  Google Scholar 

  2. 2.

    Antonacopoulos, A.: Flexible page segmentation using the background. In: Proceedings of the 12th International Conference on Pattern Recognition (12th ICPR), Jerusalem, Israel, 9–12 October, vol. 2, pp. 339–344, 1994

  3. 3.

    Antonacopoulos, A., Karatzas, D.: Document image analysis for World War II personal records. In: First International Workshop on Document Image Analysis for Libraries, DIAL’04, Palo Alto, pp. 336–341

  4. 4.

    Bar Yosef, I., Kedem, K., Dinstein, I., Beit-Arie, M., Engel, E.: Classification of Hebrew calligraphic handwriting styles: preliminary results. In: First International Workshop on Document Image Analysis for Libraries, DIAL’04, Palo Alto, 2004

  5. 5.

    Bouche, R., Emptoz, H., LeBourgeois, F.: DEBORA— Digital accEs to Books of RenaissAnce, Lyon (France), on-line document. rfv6.insa-lyon.fr/debora, 2000

  6. 6.

    Bozzi A., Sapuppo A. (1995): Computer aided preservation and transcription of ancient manuscripts and old printed documents. Ercim News 19, 27–28

    Google Scholar 

  7. 7.

    Bruzzone, E., Coffetti, M.C.: An algorithm for extracting cursive text lines, 1999. In: Proceedings of ICDAR’99, 20–22 September, pp. 749–752, 1999

  8. 8.

    Calabretto, S., Bozzi, A.: The philological workstation BAMBI (Better Access to Manuscripts and Browsing of Images). Int. J. Digit. Inf. (JoDI) 1(3), 1–17 ISSN 1368–7506 (1998)

    Google Scholar 

  9. 9.

    Cohen E., Hull J., Srihari S. (1991): Understanding handwritten text in a structured environment: determining zip codes from addresses. Int. J. Pattern Recognit. AI 5(1–2): 221–264

    Article  Google Scholar 

  10. 10.

    Downton A., Leedham C.G. (1990): Preprocessing and presorting of envelope images for automatic sorting using OCR. Pattern Recognit. 23(3–4): 347–362

    Article  Google Scholar 

  11. 11.

    Downton, A., Lucas, S., Patoulas, G., Beccaloni, G.W., Scoble, M.J., Robinson, G.S.: Computerizing national history card archives. Proceedings of ICDAR’03, Edinburgh, 2003

  12. 12.

    Feldbach, M.: Generierung einer semantischen reprasentation aus abbildungen handschritlicher kirchenbuchaufzeichnungen. Diplomarbeit, Otto von Guericke Universitat Magdeburg (2000)

  13. 13.

    Feldbach, M., Tönnies, K.D.: Line detection and segmentation in Historical Church registers. In: Proceedings of ICDAR’01, Seattle, pp. 743–747, 2001

  14. 14.

    Fletcher L.A., Kasturi R. (1988): Text string segmentation from mixed text/graphics images. IEEE PAMI 10(3): 910–918

    Google Scholar 

  15. 15.

    Govindaraju, V., Srihari, R., Srihari, S.: Handwritten text recognition. In: Document Analysis Systems DAS 94, Kaiserlautern, pp. 157–171, 1994

  16. 16.

    Granado, I., Mengucci, M., Muge, F.: Extraction de textes et de figures dans les livres anciens à l’aide de la morphologie mathématique. In: Actes de CIFED’2000, Colloque International Francophone sur l’Ecrit et le Document, Lyon, pp. 81–90, 2000

  17. 17.

    Gusnard de Ventadert, André J., Richy H., Likforman- Sulem L., Desjardin E. (1999): Les documents anciens. Document Numérique 3(1–2): 57–73

    Google Scholar 

  18. 18.

    He, J., Downton, A.C.: User-assisted archive document image analysis for digital library construction. In: Seventh International Conference on Document Analysis and Recognition, Edinburgh, 2003

  19. 19.

    Hough, P.V.C.: Methods and means for recognizing complex patterns. US Patent 3,069,654, 1962

  20. 20.

    Jain A., Bhattacharjee S. (1992): Text segmentation using Gabor filters for automatic document processing. MVA 5, 169–184

    Google Scholar 

  21. 21.

    Kim I.-K., Jung D.-W., Park R.-H. (2002): Document image binarization based on topographic analysis using a water flow model. Pattern Recognit. 35, 265–277

    MATH  Article  Google Scholar 

  22. 22.

    Kolcz A., Alspector J., Augusteyn M., Carlson R., Viorel Popescu G. (2000): A line-oriented approach to word spotting in handwritten document. Pattern Anal. Appl. 3, 155–168

    Article  Google Scholar 

  23. 23.

    Lakshmi C.V., Patvardhan C. (2004): An optical character recognition system for printed Telugu text. Pattern Anal. Appl. 7, 190–204

    MathSciNet  Google Scholar 

  24. 24.

    Lamouche, I., Bellissant, C.: Séparation recto/verso d’images de manuscrits anciens. In: Proceedings of Colloque National sur l’Ecrit et le Document CNED’96, Nantes, pp. 199–206, 1996

  25. 25.

    LeBourgeois, F.: Robust multifont OCR system from gray level images. In: 4th International Conference on Document Analysis and Recognition, Ulm, 1997

  26. 26.

    LeBourgeois, F., Emptoz, H., Trinh, E., Duong, J.: Networking digital document images. In: 6th International Conference on Document Analysis and Recognition, Seattle, 2001

  27. 27.

    Likforman-Sulem L., Maitre H., Sirat C. (1990): An expert vision system for analysis of Hebrew characters and authentication of manuscripts. Pattern Recognit. 24(2): 121–137

    Article  Google Scholar 

  28. 28.

    Likforman-Sulem L., Faure C. (1994): Extracting lines on handwritten documents by perceptual grouping. In: Faure C., Keuss P., Lorette G., Winter A. (eds) Advances in Handwriting and Drawing: A Multidisciplinary Approach. Europia, Paris, pp. 21–38

    Google Scholar 

  29. 29.

    Likforman-Sulem L., Faure C. (1995): Une méthode de résolution des conflits d’alignements pour la segmentation des documents manuscrits. Trait. Signal 12(6): 541–549

    Google Scholar 

  30. 30.

    Likforman-Sulem, L., Hanimyan, A., Faure, C.: A Hough based algorithm for extracting text lines in handwritten document. In: Proceedings of ICDAR’95, pp. 774–777, 1995

  31. 31.

    Likforman-Sulem, L.: Extraction d’éléments graphiques dans les images de manuscrits. In: Colloque International Francophone sur l’Ecrit et le Document (CIFED’98), Québec, pp. 223–232, 1998

  32. 32.

    Lins R.D., Guimaraes Neto M., França Neto L., Galdino Rosa L. (1994): An environment for processing images of historical documents. Microprocess. Microprogram. 40, 939–942

    Article  Google Scholar 

  33. 33.

    Manmatha, R., Srimal, N.: Scale space technique for word segmentation in handwritten manuscripts. In: Proceedings 2nd International Conference on Scale Space Theories in Computer Vision, pp. 22–33, 1999

  34. 34.

    Marti, U., Bunke, H.: A full English sentence database for off-line handwriting recognition. In: Proceedings of 5th International Conference on Document Analysis and Recognition ICDAR’99, Bangalore, pp. 705–708, 1999

  35. 35.

    Marti, U., Bunke, H.: On the influence of vocabulary size and language models in unconstrained handwritten text recognition. In: Proceedings of ICDAR’01, Seattle, pp. 260–265, 2001

  36. 36.

    Mello, C.A.B., Cavalcanti, C.S.V.C., Carvalho, C.: Colorizing paper texture of green-scale image of historical documents. In: Proceedings of the 4th IASTED Conference on Visualization, Imaging and Image Processing, VIIP, Marbella, Spain, 2004

  37. 37.

    Mengucci, M., Granado, I.: Morphological segmentation of text and figures in Renaissance books (XVI century). In Goutsias, J., Vincent, L., Bloomberg, D. (eds.) Mathematical Morphology and its applications to image processing, pp. 397–404. Kluwer, Dordrecht (2000)

  38. 38.

    Nagy, G., Seth, S.: Hierarchical representation of optically scanned documents. In: 7th International Conference on Pattern Recognition, Montreal, pp. 347–349, 1984

  39. 39.

    O’Gorman L. (1993): The document spectrum for page layout analysis. IEEE Trans. Pattern Anal. Mach. Intell. 15, 1162–1173

    Article  Google Scholar 

  40. 40.

    Oztop E., Mulayim A.Y., Atalay V., Yarman-Vural F. (1999): Repulsive attractive network for baseline extraction on document images. Signal Process. 75, 1–10

    Article  Google Scholar 

  41. 41.

    Pal, U., Datta, S.: Segmentation of Bangla unconstrained handwritten text. In: Proceedings of Seventh International Conference on Document Analysis and Recognition, pp. 1128–1132, 2003

  42. 42.

    Pal U., Chaudhuri B.B. (2004): Indian script character recognition: a survey. Pattern Recognit. 37, 1887–1899

    Article  Google Scholar 

  43. 43.

    Piquin, P., Viard-Gaudin, C., Barba, D.: Coopération des outils de segmentation et de binarisation de documents. In: Proceedings of Colloque National sur l’Ecrit et le Document, CNED’94, Rouen, pp. 283–292, 1994

  44. 44.

    Plamondon R., Lorette G. (1989): Automatic signature authentication and writer identification: the state of the art. Pattern Recognit. 22(2): 107–131

    Article  Google Scholar 

  45. 45.

    Pu, Y., Shi, Z.: A natural learning algorithm based on Hough transform for text lines extraction in handwritten documents. In: Proceedings of the 6th International Workshop on Frontiers in Handwriting Recognition, Taejon, Korea, pp. 637–646, 1998

  46. 46.

    Rath, T., Manmatha, R.: Features for word spotting in historical manuscripts. In: Proceedings of ICDAR 03, Edinburgh, August 2003

  47. 47.

    Robert, L., Likforman-Sulem, L., Lecolinet, E.: Image and text coupling for creating electronic books from manuscripts. In: Proceedings of ICDAR’97, Ulm, pp. 823–826, 1997

  48. 48.

    Seni G., Cohen E. (1994): External word segmentation of off-line handwritten text lines. Pattern Recognit. 27(1): 41–52

    Article  Google Scholar 

  49. 49.

    Shapiro V., Gluhchev G., Sgurev V. (1993): Handwritten document image segmentation and analysis. Pattern Recognit. Lett. 14, 71–78

    Article  Google Scholar 

  50. 50.

    Shi, Z., Govindaraju, V.: Line separation for complex document images using fuzzy runlength. In: Proceedings of the International Workshop on Document Image Analysis for Libraries, Palo Alto, CA, USA, 23–24 January 2004

  51. 51.

    Shi, Z., Govindaraju, V.: Historical document image enhancement using background light intensity normalization. In: ICPR 2004, Cambridge, 2004

  52. 52.

    Solihin Y., Leedham C.G. (1999): Integral ratio: a new class of global thresholding techniques for handwriting images. IEEE PAMI 21(8): 761–768

    Google Scholar 

  53. 53.

    Srihari, S., Kim, G.: Penman: a system for reading unconstrained handwritten page image. In: SDIUT 97, Symposium on Document Image Understanding Technology, pp. 142–153, 1997

  54. 54.

    Tan C.L., Cao R., Shen P. (2002): Restoration of archival documents using a wavelet technique. IEEE PAMI 24(10): 1399–1404

    Google Scholar 

  55. 55.

    Tomai, C.I., Zhang, B., Govindaraju, V.: Transcript mapping for historic handwritten document images. In: Proceedings of IWFHR-8, Niagara, August 2002

  56. 56.

    Tseng Y.H., Lee H.J. (1999): Recognition-based handwritten Chinese character segmentation using a probabilistic Viterbi algorithm. Pattern Recognit. Lett. 20(8): 791–806

    Article  Google Scholar 

  57. 57.

    Wong K., Casey R., Wahl F. (1982): Document analysis systems. IBM J. Res. Dev. 26(6): 647–656

    Article  Google Scholar 

  58. 58.

    Zahour, A., Taconet, B., Mercy, P., Ramdane, S.: Arabic hand-written text-line extraction. In: Proceedings of the 6th ICDAR, Seattle, pp. 281–285, 2001

  59. 59.

    Zahour, A., Taconet, B., Ramdane, S.: Contribution à la segmentation de textes manuscrits anciens. In: Proceedings of CIFED 2004, La Rochelle, 2004

  60. 60.

    Zhang, B., Srihari, S.N., Huang, C.: Word image retrieval using binary features. In: SPIE Conference on Document Recognition and Retrieval XI, San Jose, CA, USA, 18–22 January 2004

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Laurence Likforman-Sulem.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Likforman-Sulem, L., Zahour, A. & Taconet, B. Text line segmentation of historical documents: a survey. IJDAR 9, 123–138 (2007). https://doi.org/10.1007/s10032-006-0023-z

Download citation

Keywords

  • Segmentation
  • Handwriting
  • Text lines
  • Historical documents
  • Survey