Abstract
Large and important parts of cultural heritage are stored in archives that are difficult to access, even after digitization. Documents and notes are written in hard-to-read historical handwriting and are often interspersed with illustrations. Such collections are weakly structured and largely inaccessible to a wider public and scholars. Traditionally, humanities researchers treat text and images separately. This separation extends to traditional handwriting recognition systems. Many of them use a segmentation free OCR approach which only allows the resolution of homogenous manuscripts in terms of layout, style and linguistic content. This is in contrast to our infrastructure which aims to resolve heterogeneous handwritten manuscript pages in which different scripts and images are narrowly intertwined. Authors in our use case, a 17,000 page account of exploration of the Indonesian Archipelago between 1820–1850 (“Natuurkundige Commissie voor Nederlands-Indië”) tried to follow a semantic way to record their knowledge and observations, however, this discipline does not exist in the handwriting script. The use of different languages, such as German, Latin, Dutch, Malay, Greek, and French makes interpretation more challenging. Our infrastructure takes the state-of-the-art word retrieval system MONK as starting point. Owing to its visual approach, MONK can handle the diversity of material we encounter in our use case and many other historical collections: text, drawings and images. By combining text and image recognition, we significantly transcend beyond the state-of-the art, and provide meaningful additions to integrated manuscript recognition. This paper describes the infrastructure and presents early results.
Mahya Ameryan and Andreas Weber share the first authorship of this paper.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The Metamorfoze programme funds the preservation of paper heritage that is deemed to be of national importance for the Netherlands. The FCD programme (FES Collection Digitization, 2010–2015) digitized a significant part of the specimens preserved by Naturalis.
- 2.
An exception is: https://kogs-www.informatik.uni-hamburg.de/projekte/IMPACT.html, last accessed 2017/09/02.
- 3.
http://biodivlib.wikispaces.com/The+Field+Book+Project, last accessed 2017/09/01.
- 4.
https://github.com/lisestork/NHC-Ontology, last accessed 2017/09/01.
References
Heerlien, M., Van Leusen, J., Schnörr, S., De Jong-Kole, S., Raes, N., Van Hulsen, K.: The natural history production pine: an industrial approach to the digitization of scientific collections. J. Comput. Cult. Herit. 8, 3:1–3:11 (2015)
Pethers, H., Huertas, B.: The Dollmann collection: a case study of linking library and historical specimen collections at the Natural History Museum, London. Linnean 31, 18–22 (2015)
Ogilvie, B.: Correspondence networks. In: Lightman, B. (ed.) A Companion to the History of Science, pp. 358–371. Wiley (2016)
Ridge, M. (ed.): Crowdsourcing Our Cultural Heritage. Ashgate, Farnham (2014)
Franzoni, C., Sauermann, H.: Crowd science: the organization of scientific research in open collaborative projects. Res. Policy 43, 1–20 (2014)
Terras, M.: Crowdsourcing in the digital humanities. In: Schreibman, S., Siemens, R., Unsworth, J. (eds.) A New Companion to Digital Humanities, pp. 420–438. Wiley, New York (2015)
Causer, T., Tonra, J., Wallace, V.: Transcription maximized; expense minimized? Crowdsourcing and editing The Collected Works of Jeremy Bentham. Lit. Linguist. Comput. 27, 119–137 (2012)
Causer, T., Terras, M.: ‘Many hands make light work. Many hands together make merry work’: Transcribe Bentham and crowdsourcing manuscript collections. In: Crowdsourcing Our Cultural Heritage, pp. 57–88. Ashgate, Surrey (2014)
Orli, S., Bird, J.: Establishing workflows and opening access to data within natural history collections. Collections 12, 147–162 (2016)
Mitchell, W.J.T.: Picture Theory: Essays on Verbal and Visual Representation. University of Chicago Press, Chicago (1994)
Kusukawa, S.: Picturing the Book of Nature: Image, Text, and Argument in Sixteenth-Century Human Anatomy and Medical Botany. University of Chicago Press, Chicago (2011)
Kwastek, K.: Vom Bild zum Bild - digital humanities jenseits des textes. In: Baum, C., Stäcker, T. (eds.) Grenzen und Möglichkeiten der Digital Humanities (= Sonderband der Zeitschrift für digitale Geisteswissenschaften, 1) (2015)
van der Zant, T., Schomaker, L., Zinger, S., van Schie, H.: Where are the search engines for handwritten documents? Interdisc. Sci. Rev. 34, 224–235 (2009)
Mühlberger, G.: Die automatisierte Volltexterkennung historischer Handschriften. In: Digitalisierung im Archiv: Neue Wege der Bereitstellung des Archivguts, pp. 87–116. Archivschule Marburg, Marburg (2015)
Schomaker, L.: Design considerations for a large-scale image-based text search engine in historical manuscript collections. Inf. Technol. 58, 80–88 (2016)
Mees, G., van Achterberg, C.: Vogelkundig onderzoek op Nieuw Guinea in 1828. Zoologische Bijdragen 40, 3–64 (1994)
Klaver, C.J.: Inseparable Friends in Life and Death: The Life and Work of Heinrich Kuhl (1797–1821) and Johan Conrad van Hasselt (1797–1823). Barkhuis, Groningen (2007)
Temminck, C.J., Müller, S., Schlegel, H., de Haan, W., Korthals, P.W.: Verhandelingen over de natuurlijke geschiedenis der Nederlandsche overzeesche bezittingen. Luchtmans, Leiden (1839–1847)
Roberts, T.R.: The freshwater fishes of Java, as observed by Kuhl and van Hasselt in 1820-23. Zoologische Verhandelingen 285, 1–93 (1993)
Fransen, C.H.J.M., Holthuis, L.B., Adama, J.P.H.M.: Type-catalogue of the Decapod Crustacea in the collections of the Nationaal Natuurhistorisch Museum, with appendices of pre-1900 collectors and material. Zoologische Verhandelingen 311, 1–344 (1997)
Hildenhagen, T.: Heinrich Kuhl - Das Leben eines fast vergessenen Naturforschers aus Hanau. Neues Magazin für Hanauische Geschichte, pp. 110–214 (2013)
See for instance the digital Cyclopaedia of Malaysian Collectors. http://www.nationaalherbarium.nl/FMCollectors/Introduction.htm. Last Accessed 08 Sep 2017
Hoogmoed, M.S., Gassó Miracle, M.E.: Type specimens of recent and fossil Testudines and Crocodylia in the collections of NCB Naturalis, Leiden, the Netherlands. Zoologische Mededeelingen 84, 159–199 (2010)
van der Zant, T., Schomaker, L., Haak, K.: Handwritten-word spotting using biologically inspired features. IEEE Trans. Pattern Anal. Mach. Intell. 30, 1945–1957 (2008)
Van Oosten, J.-P., Schomaker, L.: A Reevaluation and benchmark of hidden Markov models. In: 2014 14th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 531–536 (2014)
Van Oosten, J.-P., Schomaker, L.: Separability versus prototypicality in handwritten word-image retrieval. Pattern Recognit. 47, 1031–1038 (2014)
READ project website. https://read.transkribus.eu/. Last Accessed 27 July 2017
He, S., Wiering, M., Schomaker, L.: Junction detection in handwritten documents and its application to writer identification. Pattern Recognit. 48, 4036–4048 (2015)
Günter, S., Bunke, H.: HMM-based handwritten word recognition: on the optimization of the number of states, training iterations and Gaussian components. Pattern Recognit. 37, 2069–2079 (2004)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)
Graves, A.: RNNLIB: a recurrent neural network library for sequence learning problems. http://sourceforge.net/projects/rnnl/. Last Accessed 01 Sep 2017
Bulacu, M., Brink, A., van der Zant, T., Schomaker, L.: Recognition of handwritten numerical fields in a large single-writer historical collection. In: 2009 10th International Conference on Document Analysis and Recognition, pp. 808–812 (2009)
Yan, K., Verbeek, F.J.: Segmentation for high-throughput image analysis: watershed masked clustering. In: Margaria, T., Steffen, B. (eds.) ISoLA 2012. LNCS, vol. 7610, pp. 25–41. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34032-1_4
Shi, Z.: Handwritten document images based on positional expectancy, Master thesis, Artificial Intelligence, University of Groningen, the Netherlands, May 2016
Gassó Miracle, M.E.: On whose authority? Temminck’s debates on zoological classification and nomenclature: 1820–1850. J. Hist. Biol. 44, 445–481 (2011)
Stork, L., Weber, A.: A linked data approach to disclose handwritten biodiversity heritage collections. In: Presented at the Digital Humanities Benelux Conference 2017 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this chapter
Cite this chapter
Weber, A., Ameryan, M., Wolstencroft, K., Stork, L., Heerlien, M., Schomaker, L. (2018). Towards a Digital Infrastructure for Illustrated Handwritten Archives. In: Ioannides, M. (eds) Digital Cultural Heritage. Lecture Notes in Computer Science(), vol 10605. Springer, Cham. https://doi.org/10.1007/978-3-319-75826-8_13
Download citation
DOI: https://doi.org/10.1007/978-3-319-75826-8_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-75825-1
Online ISBN: 978-3-319-75826-8
eBook Packages: Computer ScienceComputer Science (R0)