Layout Detection and Table Recognition – Recent Challenges in Digitizing Historical Documents and Handwritten Tabular Data

Lehenmeier, Constantin; Burghardt, Manuel; Mischka, Bernadette

doi:10.1007/978-3-030-54956-5_17

Constantin Lehenmeier¹²,
Manuel Burghardt¹³ &
Bernadette Mischka¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12246))

Included in the following conference series:

International Conference on Theory and Practice of Digital Libraries

1045 Accesses
8 Citations
2 Altmetric

Abstract

In this paper, we discuss the computer-aided processing of handwritten tabular records of historical weather data. The observationes meteorologicae, which are housed by the Regensburg University Library, are one of the oldest collections of weather data in Europe. Starting in 1771, meteorological data was consistently documented in a standardized form over almost 60 years by several writers. The tabular structure, as well as the unconstrained textual layout of comments and the use of historical characters, propose various challenges in layout and text recognition. We present a customized strategy to digitize tabular and handwritten data by combining various state-of-the-art methods for OCR processing to fit the collection. Since the recognition of historical documents still poses major challenges, we provide lessons learned from experimental testing during the first project stages. Our results show that deep learning methods can be used for text recognition and layout detection. However, they are less efficient for the recognition of tabular structures. Furthermore, a tailored approach had to be developed for the historical meteorological characters during the manual creation of ground truth data. The customized system achieved an accuracy rate of 82% for the text recognition of the heterogeneous handwriting and 87% accuracy for layout recognition of the tables.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Anderson, R.: Collections 2021: the future of the library collection is not a collection. https://serials.uksg.org/articles/10.1629/24211/. Accessed 5 June 2020
Novy, L.: Bibliotheken zwischen tradition und Fortschritt: Bewahren und Bewegen. https://www.goethe.de/ins/fr/de/kul/sup/nlc/21296095.html. Accessed 5 June 2020
Neuroth, H.: Bibliothek, Archiv, Museum. In: Digital Humanities: Eine Einführung, pp. 123–213. J.B. Metzler, Stuttgart (2017)
Google Scholar
Webster, J.W.: Digital collaborations: a survey analysis of digital humanities partnerships between librarians and other academics. Digit. Hum. Q. 13(4) (2020)
Google Scholar
Moretti, F.: Distant Reading. Verso, London (2013)
Google Scholar
Horstmann, W.: Are academic libraries changing fast enough? Bibliothek – Forschung und Praxis 42(3), 433–440 (2018)
Article Google Scholar
Munoz, T.: Recovering a humanist librarianship through digital humanities. In: White, J., Gilbert, H. (eds.) Laying the Foundation: Digital Humanities in Academic Libraries, pp. 3–14. Purdue University Press (2016)
Google Scholar
Terras, M.: Peering Inside the Big Tent. Ashgate Publishing, Farnham (2013)
Google Scholar
Roth, C.: Digital, digitized, and numerical humanities. Digit. Scholarsh. Hum. 34(3), 616–632 (2019)
Article Google Scholar
Universitätsbibliothek Regensburg: Observationes meteorologicae: Placidus Heinrich und seine Wetteraufzeichnungen. http://bibliothek.uni-regensburg.de/meteorologie/. Accessed 5 June 2020
Eimern, J.: Zur Geschichte des Wetterdienstes in Bayern. Annalen der Meteorologie (14), 7–17. Selbstverlag des Deutschen Wetterdienstes (1979)
Google Scholar
Lorenz, M.: Naturforschung in St. Emmeram. In: Im Turm, im Kabinett, im Labor. Streifzüge durch die Regensburger Wissenschaftsgeschichte, pp. 12–29. Universitätsverlag Regensburg (1995)
Google Scholar
Lehenmeier, C., Burghardt, M.: Historische Wetterdaten im Spannungsfeld zwischen OCR und User-Centered Design. In: Burghardt, M., Müller-Birn, C. (eds) INF-DH-2018, Gesellschaft für Informatik e.V. (2018)
Google Scholar
Doermann, D., Tombre, K.: Handbook of Document Image Processing and Recognition. Springer, London (2014). https://doi.org/10.1007/978-0-85729-859-1
Book MATH Google Scholar
Reul, C., et al.: OCR4all – an open-source tool providing a (semi-)automatic OCR workflow for historical printings (2019)
Google Scholar
Piotrowski, M.: Natural Language Processing for Historical Texts. Morgan & Claypool Publishers, New York (2012)
Book Google Scholar
Rehbein, M.: Digitalisierung. In: Digital Humanities: Eine Einführung, pp. 179–199. J.B. Metzler, Stuttgart (2017)
Google Scholar
Chollet, F.: Deep Learning with Python. Manning Publications Co., New York (2017)
Google Scholar
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436–444 (2015)
Article Google Scholar
Graves, A., Liwicki, M., Fernández, S., Bertolami, R., Bunke, H., Schmidhuber, J.: A novel connectionist system for unconstrained handwriting recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31(5), 855–868 (2009)
Article Google Scholar
Oliveira, S.F., Seguin, B., Kaplan, F.: dhSegment: a generic deep-learning approach for document segmentation (2018)
Google Scholar
Transkribus. https://transkribus.eu/Transkribus/. Accessed 5 June 2020
Tesseract 4. https://github.com/tesseract-ocr/tesseract. Accessed 5 June 2020
OCRopus. https://github.com/tmbarchive/ocropy. Accessed 5 June 2020
ABBYY FineReader. https://www.abbyy.com/de-de/finereader/. Accessed 5 June 2020
Boudraa, O., Hidouci W. K., Michelucci, D.: Degraded Historical Documents Images Binarization Using a Combination of Enhanced Techniques (2019)
Google Scholar
dhSegment. https://github.com/dhlab-epfl/dhSegment. Accessed 5 June 2020
Gatos, B.G.: Imaging techniques in document analysis processes. In: Doermann, D., Tombre, K. (eds.) Handbook of Document Image Processing and Recognition. LNCS, pp. 73–131. Springer, London (2014). https://doi.org/10.1007/978-0-85729-859-1_4
Chapter Google Scholar
ScriptNet: ICDAR 2017 Competition on Baseline Detection in Archival Documents (cBAD). https://zenodo.org/record/835441. Accessed 5 June 2020
Schreiber, S., Agne, S., Wolf, I., Dengel, A., Ahmed, S.: Deepdesrt: deep learning for detection and structure recognition of tables in document images. In: 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), pp. 1162–1167 (2017)
Google Scholar
Szeliski, R.: Computer Vision: Algorithms and Applications. Springer, London (2011). https://doi.org/10.1007/978-1-84882-935-0
Book MATH Google Scholar
Lee, B.C.G.: Line detection in binary document scans: a case study with the international tracing service archives. In: IEEE International Conference on Big Data (Big Data), pp. 2256–2261. IEEE Computer Society (2017)
Google Scholar
Kleber, F., Dejean, H., Lang, E.: Matching table structures of historical register books using association graphs. In: 16th International Conference on Frontiers in Handwriting Recognition, pp. 217–222. IEEE Computer Society (2018)
Google Scholar
Li, M., Cui, L., Huang, S., Wei, F., Zhou, M., Li, Z.: TableBank: table benchmark for image-based table detection and recognition (2019)
Google Scholar
Rashid, S.F., Akmal, A., Adnan, M., Aslam, A.A., Dengel, A.: Table recognition in heterogeneous documents using machine learning. In: 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). pp. 777–782 (2017)
Google Scholar
Clinchant, S., Déjean, H., Meunier, JL., Lang, E., Kleber, F.: Comparing machine learning approaches for table recognition in historical register books. In: Proceedings of the 13th IAPR International Workshop on Document Analysis Systems (2018)
Google Scholar
The distinctive format and the partly standardized Latin terminology made negotiation processes of spelling variations less important
Google Scholar
Mundt, L.: Empfehlungen zur Edition neulateinischer Texte. In: Mundt, L., Roloff, H.-G., Seelbach, U. (eds.) Probleme der Edition von Texten der Frühen Neuzeit. Beihefte zu editio Bd., vol. 3, pp. 186–190, Tübingen (1992)
Google Scholar
Transcription guidelines for ground truth. https://ocr-d.de/gt//trans_documentation/trSchreibweisen.html. Accessed 5 June 2020
The selection of the different Unicode characters was chosen with a Unicode Shapecatcher regardless of the unicode spelling. https://shapecatcher.com/. Accessed 5 June 2020
CalamariOCR. https://github.com/Calamari-OCR/calamari. Accessed 5 June 2020
Wick, C., Reul, C., Puppe, F.: Calamari – a high-performance tensorflow-based deep learning package for optical character recognition (2018)
Google Scholar
Raschka, S., Mirjalil, V.: Python Machine Learning, 2nd edn. Packt Publishng, Birmingham (2017)
Google Scholar
IAM Handwriting Database. http://www.fki.inf.unibe.ch/databases/iam-handwriting-database. Accessed 5 June 2020
Transkribus in 10 (oder weniger) Schritten. https://transkribus.eu/wiki/images/c/cf/Transkribus_in_10_Schritten.pdf. Accessed 5 June 2020
Martínek, J., Lenc, L., Král, P.: Training strategies for OCR systems for historical documents. In: MacIntyre, J., Maglogiannis, I., Iliadis, L., Pimenidis, E. (eds.) AIAI 2019. IAICT, vol. 559, pp. 362–373. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-19823-7_30
Chapter Google Scholar
Jayasundara, V., Jayasekara, S., Jayasekara, S., Rajasegaran, J., Seneviratne, S., Rodrigo, R.: TextCaps: handwritten character recognition with very small datasets (2019)
Google Scholar
Pletschacher, S., Antonacopoulos, A.: The PAGE (Page Analysis and Ground-truth Elements) format framework. In: Proceedings of the 2010 20th International Conference on Pattern Recognition, pp. 257–260. IEEE Computer Society (2010)
Google Scholar
van Lit, L.W.: C: Among Digitized Manuscripts: Philology, Codicology, Paleography in a Digital World. Brill, Boston (2020)
Google Scholar
Hill, M., Hengchen, S.: Quantifying the impact of dirty OCR on historical text analysis: eighteenth century collections online as a case study. Digit. Scholarsh. Hum. 34(4), 825–843 (2019)
Article Google Scholar

Download references

Author information

Authors and Affiliations

University Library of Regensburg, 93053, Regensburg, Germany
Constantin Lehenmeier
Computational Humanities, Leipzig University, 04109, Leipzig, Germany
Manuel Burghardt
University of Regensburg, 93053, Regensburg, Germany
Bernadette Mischka

Authors

Constantin Lehenmeier
View author publications
You can also search for this author in PubMed Google Scholar
Manuel Burghardt
View author publications
You can also search for this author in PubMed Google Scholar
Bernadette Mischka
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Constantin Lehenmeier .

Editor information

Editors and Affiliations

School of Computing and Communications, The Open University, Milton Keynes, UK
Mark Hall
Faculty of Arts, University of Ljubljana, Ljubljana, Slovenia
Tanja Merčun
University Library J. C. Senckenberg, Goethe University Frankfurt, Frankfurt am Main, Germany
Thomas Risse
Université Claude Bernard Lyon 1, Villeurbanne, France
Fabien Duchateau

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lehenmeier, C., Burghardt, M., Mischka, B. (2020). Layout Detection and Table Recognition – Recent Challenges in Digitizing Historical Documents and Handwritten Tabular Data. In: Hall, M., Merčun, T., Risse, T., Duchateau, F. (eds) Digital Libraries for Open Knowledge. TPDL 2020. Lecture Notes in Computer Science(), vol 12246. Springer, Cham. https://doi.org/10.1007/978-3-030-54956-5_17

Download citation

DOI: https://doi.org/10.1007/978-3-030-54956-5_17
Published: 17 August 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-54955-8
Online ISBN: 978-3-030-54956-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics