Skip to main content

Layout Detection and Table Recognition – Recent Challenges in Digitizing Historical Documents and Handwritten Tabular Data

  • Conference paper
  • First Online:
Digital Libraries for Open Knowledge (TPDL 2020)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12246))

Included in the following conference series:

Abstract

In this paper, we discuss the computer-aided processing of handwritten tabular records of historical weather data. The observationes meteorologicae, which are housed by the Regensburg University Library, are one of the oldest collections of weather data in Europe. Starting in 1771, meteorological data was consistently documented in a standardized form over almost 60 years by several writers. The tabular structure, as well as the unconstrained textual layout of comments and the use of historical characters, propose various challenges in layout and text recognition. We present a customized strategy to digitize tabular and handwritten data by combining various state-of-the-art methods for OCR processing to fit the collection. Since the recognition of historical documents still poses major challenges, we provide lessons learned from experimental testing during the first project stages. Our results show that deep learning methods can be used for text recognition and layout detection. However, they are less efficient for the recognition of tabular structures. Furthermore, a tailored approach had to be developed for the historical meteorological characters during the manual creation of ground truth data. The customized system achieved an accuracy rate of 82% for the text recognition of the heterogeneous handwriting and 87% accuracy for layout recognition of the tables.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Anderson, R.: Collections 2021: the future of the library collection is not a collection. https://serials.uksg.org/articles/10.1629/24211/. Accessed 5 June 2020

  2. Novy, L.: Bibliotheken zwischen tradition und Fortschritt: Bewahren und Bewegen. https://www.goethe.de/ins/fr/de/kul/sup/nlc/21296095.html. Accessed 5 June 2020

  3. Neuroth, H.: Bibliothek, Archiv, Museum. In: Digital Humanities: Eine Einführung, pp. 123–213. J.B. Metzler, Stuttgart (2017)

    Google Scholar 

  4. Webster, J.W.: Digital collaborations: a survey analysis of digital humanities partnerships between librarians and other academics. Digit. Hum. Q. 13(4) (2020)

    Google Scholar 

  5. Moretti, F.: Distant Reading. Verso, London (2013)

    Google Scholar 

  6. Horstmann, W.: Are academic libraries changing fast enough? Bibliothek – Forschung und Praxis 42(3), 433–440 (2018)

    Article  Google Scholar 

  7. Munoz, T.: Recovering a humanist librarianship through digital humanities. In: White, J., Gilbert, H. (eds.) Laying the Foundation: Digital Humanities in Academic Libraries, pp. 3–14. Purdue University Press (2016)

    Google Scholar 

  8. Terras, M.: Peering Inside the Big Tent. Ashgate Publishing, Farnham (2013)

    Google Scholar 

  9. Roth, C.: Digital, digitized, and numerical humanities. Digit. Scholarsh. Hum. 34(3), 616–632 (2019)

    Article  Google Scholar 

  10. Universitätsbibliothek Regensburg: Observationes meteorologicae: Placidus Heinrich und seine Wetteraufzeichnungen. http://bibliothek.uni-regensburg.de/meteorologie/. Accessed 5 June 2020

  11. Eimern, J.: Zur Geschichte des Wetterdienstes in Bayern. Annalen der Meteorologie (14), 7–17. Selbstverlag des Deutschen Wetterdienstes (1979)

    Google Scholar 

  12. Lorenz, M.: Naturforschung in St. Emmeram. In: Im Turm, im Kabinett, im Labor. Streifzüge durch die Regensburger Wissenschaftsgeschichte, pp. 12–29. Universitätsverlag Regensburg (1995)

    Google Scholar 

  13. Lehenmeier, C., Burghardt, M.: Historische Wetterdaten im Spannungsfeld zwischen OCR und User-Centered Design. In: Burghardt, M., Müller-Birn, C. (eds) INF-DH-2018, Gesellschaft für Informatik e.V. (2018)

    Google Scholar 

  14. Doermann, D., Tombre, K.: Handbook of Document Image Processing and Recognition. Springer, London (2014). https://doi.org/10.1007/978-0-85729-859-1

    Book  MATH  Google Scholar 

  15. Reul, C., et al.: OCR4all – an open-source tool providing a (semi-)automatic OCR workflow for historical printings (2019)

    Google Scholar 

  16. Piotrowski, M.: Natural Language Processing for Historical Texts. Morgan & Claypool Publishers, New York (2012)

    Book  Google Scholar 

  17. Rehbein, M.: Digitalisierung. In: Digital Humanities: Eine Einführung, pp. 179–199. J.B. Metzler, Stuttgart (2017)

    Google Scholar 

  18. Chollet, F.: Deep Learning with Python. Manning Publications Co., New York (2017)

    Google Scholar 

  19. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436–444 (2015)

    Article  Google Scholar 

  20. Graves, A., Liwicki, M., Fernández, S., Bertolami, R., Bunke, H., Schmidhuber, J.: A novel connectionist system for unconstrained handwriting recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31(5), 855–868 (2009)

    Article  Google Scholar 

  21. Oliveira, S.F., Seguin, B., Kaplan, F.: dhSegment: a generic deep-learning approach for document segmentation (2018)

    Google Scholar 

  22. Transkribus. https://transkribus.eu/Transkribus/. Accessed 5 June 2020

  23. Tesseract 4. https://github.com/tesseract-ocr/tesseract. Accessed 5 June 2020

  24. OCRopus. https://github.com/tmbarchive/ocropy. Accessed 5 June 2020

  25. ABBYY FineReader. https://www.abbyy.com/de-de/finereader/. Accessed 5 June 2020

  26. Boudraa, O., Hidouci W. K., Michelucci, D.: Degraded Historical Documents Images Binarization Using a Combination of Enhanced Techniques (2019)

    Google Scholar 

  27. dhSegment. https://github.com/dhlab-epfl/dhSegment. Accessed 5 June 2020

  28. Gatos, B.G.: Imaging techniques in document analysis processes. In: Doermann, D., Tombre, K. (eds.) Handbook of Document Image Processing and Recognition. LNCS, pp. 73–131. Springer, London (2014). https://doi.org/10.1007/978-0-85729-859-1_4

    Chapter  Google Scholar 

  29. ScriptNet: ICDAR 2017 Competition on Baseline Detection in Archival Documents (cBAD). https://zenodo.org/record/835441. Accessed 5 June 2020

  30. Schreiber, S., Agne, S., Wolf, I., Dengel, A., Ahmed, S.: Deepdesrt: deep learning for detection and structure recognition of tables in document images. In: 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), pp. 1162–1167 (2017)

    Google Scholar 

  31. Szeliski, R.: Computer Vision: Algorithms and Applications. Springer, London (2011). https://doi.org/10.1007/978-1-84882-935-0

    Book  MATH  Google Scholar 

  32. Lee, B.C.G.: Line detection in binary document scans: a case study with the international tracing service archives. In: IEEE International Conference on Big Data (Big Data), pp. 2256–2261. IEEE Computer Society (2017)

    Google Scholar 

  33. Kleber, F., Dejean, H., Lang, E.: Matching table structures of historical register books using association graphs. In: 16th International Conference on Frontiers in Handwriting Recognition, pp. 217–222. IEEE Computer Society (2018)

    Google Scholar 

  34. Li, M., Cui, L., Huang, S., Wei, F., Zhou, M., Li, Z.: TableBank: table benchmark for image-based table detection and recognition (2019)

    Google Scholar 

  35. Rashid, S.F., Akmal, A., Adnan, M., Aslam, A.A., Dengel, A.: Table recognition in heterogeneous documents using machine learning. In: 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). pp. 777–782 (2017)

    Google Scholar 

  36. Clinchant, S., Déjean, H., Meunier, JL., Lang, E., Kleber, F.: Comparing machine learning approaches for table recognition in historical register books. In: Proceedings of the 13th IAPR International Workshop on Document Analysis Systems (2018)

    Google Scholar 

  37. The distinctive format and the partly standardized Latin terminology made negotiation processes of spelling variations less important

    Google Scholar 

  38. Mundt, L.: Empfehlungen zur Edition neulateinischer Texte. In: Mundt, L., Roloff, H.-G., Seelbach, U. (eds.) Probleme der Edition von Texten der Frühen Neuzeit. Beihefte zu editio Bd., vol. 3, pp. 186–190, Tübingen (1992)

    Google Scholar 

  39. Transcription guidelines for ground truth. https://ocr-d.de/gt//trans_documentation/trSchreibweisen.html. Accessed 5 June 2020

  40. The selection of the different Unicode characters was chosen with a Unicode Shapecatcher regardless of the unicode spelling. https://shapecatcher.com/. Accessed 5 June 2020

  41. CalamariOCR. https://github.com/Calamari-OCR/calamari. Accessed 5 June 2020

  42. Wick, C., Reul, C., Puppe, F.: Calamari – a high-performance tensorflow-based deep learning package for optical character recognition (2018)

    Google Scholar 

  43. Raschka, S., Mirjalil, V.: Python Machine Learning, 2nd edn. Packt Publishng, Birmingham (2017)

    Google Scholar 

  44. IAM Handwriting Database. http://www.fki.inf.unibe.ch/databases/iam-handwriting-database. Accessed 5 June 2020

  45. Transkribus in 10 (oder weniger) Schritten. https://transkribus.eu/wiki/images/c/cf/Transkribus_in_10_Schritten.pdf. Accessed 5 June 2020

  46. Martínek, J., Lenc, L., Král, P.: Training strategies for OCR systems for historical documents. In: MacIntyre, J., Maglogiannis, I., Iliadis, L., Pimenidis, E. (eds.) AIAI 2019. IAICT, vol. 559, pp. 362–373. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-19823-7_30

    Chapter  Google Scholar 

  47. Jayasundara, V., Jayasekara, S., Jayasekara, S., Rajasegaran, J., Seneviratne, S., Rodrigo, R.: TextCaps: handwritten character recognition with very small datasets (2019)

    Google Scholar 

  48. Pletschacher, S., Antonacopoulos, A.: The PAGE (Page Analysis and Ground-truth Elements) format framework. In: Proceedings of the 2010 20th International Conference on Pattern Recognition, pp. 257–260. IEEE Computer Society (2010)

    Google Scholar 

  49. van Lit, L.W.: C: Among Digitized Manuscripts: Philology, Codicology, Paleography in a Digital World. Brill, Boston (2020)

    Google Scholar 

  50. Hill, M., Hengchen, S.: Quantifying the impact of dirty OCR on historical text analysis: eighteenth century collections online as a case study. Digit. Scholarsh. Hum. 34(4), 825–843 (2019)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Constantin Lehenmeier .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Lehenmeier, C., Burghardt, M., Mischka, B. (2020). Layout Detection and Table Recognition – Recent Challenges in Digitizing Historical Documents and Handwritten Tabular Data. In: Hall, M., Merčun, T., Risse, T., Duchateau, F. (eds) Digital Libraries for Open Knowledge. TPDL 2020. Lecture Notes in Computer Science(), vol 12246. Springer, Cham. https://doi.org/10.1007/978-3-030-54956-5_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-54956-5_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-54955-8

  • Online ISBN: 978-3-030-54956-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics