Abstract
In this paper, we discuss the computer-aided processing of handwritten tabular records of historical weather data. The observationes meteorologicae, which are housed by the Regensburg University Library, are one of the oldest collections of weather data in Europe. Starting in 1771, meteorological data was consistently documented in a standardized form over almost 60 years by several writers. The tabular structure, as well as the unconstrained textual layout of comments and the use of historical characters, propose various challenges in layout and text recognition. We present a customized strategy to digitize tabular and handwritten data by combining various state-of-the-art methods for OCR processing to fit the collection. Since the recognition of historical documents still poses major challenges, we provide lessons learned from experimental testing during the first project stages. Our results show that deep learning methods can be used for text recognition and layout detection. However, they are less efficient for the recognition of tabular structures. Furthermore, a tailored approach had to be developed for the historical meteorological characters during the manual creation of ground truth data. The customized system achieved an accuracy rate of 82% for the text recognition of the heterogeneous handwriting and 87% accuracy for layout recognition of the tables.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Anderson, R.: Collections 2021: the future of the library collection is not a collection. https://serials.uksg.org/articles/10.1629/24211/. Accessed 5 June 2020
Novy, L.: Bibliotheken zwischen tradition und Fortschritt: Bewahren und Bewegen. https://www.goethe.de/ins/fr/de/kul/sup/nlc/21296095.html. Accessed 5 June 2020
Neuroth, H.: Bibliothek, Archiv, Museum. In: Digital Humanities: Eine Einführung, pp. 123–213. J.B. Metzler, Stuttgart (2017)
Webster, J.W.: Digital collaborations: a survey analysis of digital humanities partnerships between librarians and other academics. Digit. Hum. Q. 13(4) (2020)
Moretti, F.: Distant Reading. Verso, London (2013)
Horstmann, W.: Are academic libraries changing fast enough? Bibliothek – Forschung und Praxis 42(3), 433–440 (2018)
Munoz, T.: Recovering a humanist librarianship through digital humanities. In: White, J., Gilbert, H. (eds.) Laying the Foundation: Digital Humanities in Academic Libraries, pp. 3–14. Purdue University Press (2016)
Terras, M.: Peering Inside the Big Tent. Ashgate Publishing, Farnham (2013)
Roth, C.: Digital, digitized, and numerical humanities. Digit. Scholarsh. Hum. 34(3), 616–632 (2019)
Universitätsbibliothek Regensburg: Observationes meteorologicae: Placidus Heinrich und seine Wetteraufzeichnungen. http://bibliothek.uni-regensburg.de/meteorologie/. Accessed 5 June 2020
Eimern, J.: Zur Geschichte des Wetterdienstes in Bayern. Annalen der Meteorologie (14), 7–17. Selbstverlag des Deutschen Wetterdienstes (1979)
Lorenz, M.: Naturforschung in St. Emmeram. In: Im Turm, im Kabinett, im Labor. Streifzüge durch die Regensburger Wissenschaftsgeschichte, pp. 12–29. Universitätsverlag Regensburg (1995)
Lehenmeier, C., Burghardt, M.: Historische Wetterdaten im Spannungsfeld zwischen OCR und User-Centered Design. In: Burghardt, M., Müller-Birn, C. (eds) INF-DH-2018, Gesellschaft für Informatik e.V. (2018)
Doermann, D., Tombre, K.: Handbook of Document Image Processing and Recognition. Springer, London (2014). https://doi.org/10.1007/978-0-85729-859-1
Reul, C., et al.: OCR4all – an open-source tool providing a (semi-)automatic OCR workflow for historical printings (2019)
Piotrowski, M.: Natural Language Processing for Historical Texts. Morgan & Claypool Publishers, New York (2012)
Rehbein, M.: Digitalisierung. In: Digital Humanities: Eine Einführung, pp. 179–199. J.B. Metzler, Stuttgart (2017)
Chollet, F.: Deep Learning with Python. Manning Publications Co., New York (2017)
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436–444 (2015)
Graves, A., Liwicki, M., Fernández, S., Bertolami, R., Bunke, H., Schmidhuber, J.: A novel connectionist system for unconstrained handwriting recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31(5), 855–868 (2009)
Oliveira, S.F., Seguin, B., Kaplan, F.: dhSegment: a generic deep-learning approach for document segmentation (2018)
Transkribus. https://transkribus.eu/Transkribus/. Accessed 5 June 2020
Tesseract 4. https://github.com/tesseract-ocr/tesseract. Accessed 5 June 2020
OCRopus. https://github.com/tmbarchive/ocropy. Accessed 5 June 2020
ABBYY FineReader. https://www.abbyy.com/de-de/finereader/. Accessed 5 June 2020
Boudraa, O., Hidouci W. K., Michelucci, D.: Degraded Historical Documents Images Binarization Using a Combination of Enhanced Techniques (2019)
dhSegment. https://github.com/dhlab-epfl/dhSegment. Accessed 5 June 2020
Gatos, B.G.: Imaging techniques in document analysis processes. In: Doermann, D., Tombre, K. (eds.) Handbook of Document Image Processing and Recognition. LNCS, pp. 73–131. Springer, London (2014). https://doi.org/10.1007/978-0-85729-859-1_4
ScriptNet: ICDAR 2017 Competition on Baseline Detection in Archival Documents (cBAD). https://zenodo.org/record/835441. Accessed 5 June 2020
Schreiber, S., Agne, S., Wolf, I., Dengel, A., Ahmed, S.: Deepdesrt: deep learning for detection and structure recognition of tables in document images. In: 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), pp. 1162–1167 (2017)
Szeliski, R.: Computer Vision: Algorithms and Applications. Springer, London (2011). https://doi.org/10.1007/978-1-84882-935-0
Lee, B.C.G.: Line detection in binary document scans: a case study with the international tracing service archives. In: IEEE International Conference on Big Data (Big Data), pp. 2256–2261. IEEE Computer Society (2017)
Kleber, F., Dejean, H., Lang, E.: Matching table structures of historical register books using association graphs. In: 16th International Conference on Frontiers in Handwriting Recognition, pp. 217–222. IEEE Computer Society (2018)
Li, M., Cui, L., Huang, S., Wei, F., Zhou, M., Li, Z.: TableBank: table benchmark for image-based table detection and recognition (2019)
Rashid, S.F., Akmal, A., Adnan, M., Aslam, A.A., Dengel, A.: Table recognition in heterogeneous documents using machine learning. In: 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). pp. 777–782 (2017)
Clinchant, S., Déjean, H., Meunier, JL., Lang, E., Kleber, F.: Comparing machine learning approaches for table recognition in historical register books. In: Proceedings of the 13th IAPR International Workshop on Document Analysis Systems (2018)
The distinctive format and the partly standardized Latin terminology made negotiation processes of spelling variations less important
Mundt, L.: Empfehlungen zur Edition neulateinischer Texte. In: Mundt, L., Roloff, H.-G., Seelbach, U. (eds.) Probleme der Edition von Texten der Frühen Neuzeit. Beihefte zu editio Bd., vol. 3, pp. 186–190, Tübingen (1992)
Transcription guidelines for ground truth. https://ocr-d.de/gt//trans_documentation/trSchreibweisen.html. Accessed 5 June 2020
The selection of the different Unicode characters was chosen with a Unicode Shapecatcher regardless of the unicode spelling. https://shapecatcher.com/. Accessed 5 June 2020
CalamariOCR. https://github.com/Calamari-OCR/calamari. Accessed 5 June 2020
Wick, C., Reul, C., Puppe, F.: Calamari – a high-performance tensorflow-based deep learning package for optical character recognition (2018)
Raschka, S., Mirjalil, V.: Python Machine Learning, 2nd edn. Packt Publishng, Birmingham (2017)
IAM Handwriting Database. http://www.fki.inf.unibe.ch/databases/iam-handwriting-database. Accessed 5 June 2020
Transkribus in 10 (oder weniger) Schritten. https://transkribus.eu/wiki/images/c/cf/Transkribus_in_10_Schritten.pdf. Accessed 5 June 2020
Martínek, J., Lenc, L., Král, P.: Training strategies for OCR systems for historical documents. In: MacIntyre, J., Maglogiannis, I., Iliadis, L., Pimenidis, E. (eds.) AIAI 2019. IAICT, vol. 559, pp. 362–373. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-19823-7_30
Jayasundara, V., Jayasekara, S., Jayasekara, S., Rajasegaran, J., Seneviratne, S., Rodrigo, R.: TextCaps: handwritten character recognition with very small datasets (2019)
Pletschacher, S., Antonacopoulos, A.: The PAGE (Page Analysis and Ground-truth Elements) format framework. In: Proceedings of the 2010 20th International Conference on Pattern Recognition, pp. 257–260. IEEE Computer Society (2010)
van Lit, L.W.: C: Among Digitized Manuscripts: Philology, Codicology, Paleography in a Digital World. Brill, Boston (2020)
Hill, M., Hengchen, S.: Quantifying the impact of dirty OCR on historical text analysis: eighteenth century collections online as a case study. Digit. Scholarsh. Hum. 34(4), 825–843 (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Lehenmeier, C., Burghardt, M., Mischka, B. (2020). Layout Detection and Table Recognition – Recent Challenges in Digitizing Historical Documents and Handwritten Tabular Data. In: Hall, M., Merčun, T., Risse, T., Duchateau, F. (eds) Digital Libraries for Open Knowledge. TPDL 2020. Lecture Notes in Computer Science(), vol 12246. Springer, Cham. https://doi.org/10.1007/978-3-030-54956-5_17
Download citation
DOI: https://doi.org/10.1007/978-3-030-54956-5_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-54955-8
Online ISBN: 978-3-030-54956-5
eBook Packages: Computer ScienceComputer Science (R0)