From Handwritten Manuscripts to Linked Data
Museums, archives and digital libraries make increasing use of Semantic Web technologies to enrich and publish their collection items. The contents of those items, however, are not often enriched in the same way. Extracting named entities within historical manuscripts and disclosing the relationships between them would facilitate cultural heritage research, but it is a labour-intensive and time-consuming process, particularly for handwritten documents.
It requires either automated handwriting recognition techniques, or manual annotation by domain experts before the content can be semantically structured. Different workflows have been proposed to address this problem, involving full-text transcription and named entity extraction, with results ranging from unstructured files to semantically annotated knowledge bases. Here, we detail these workflows and describe the approach we have taken to disclose historical biodiversity data, which enables the direct labelling and semantic annotation of document images in hand-written archives.
KeywordsLinked data Cultural heritage Handwriting recognition Semantic annotation Named entity recognition
- 1.The Field Book Project. https://siarchives.si.edu/about/field-book-project. Accessed 14 Mar 2018
- 2.Baechler, M., Fischer, A., Naji, N., Ingold, R., Bunke, H., Savoy, J.: HisDoc: historical document analysis, recognition, and retrieval. In: Proceedings of Digital Humanities, pp. 94–96. University of Hamburg, July 2012Google Scholar
- 3.Dijkshoorn, C., De Boer, V., Aroyo, L., Schreiber, G.: Accurator: nichesourcing for cultural heritage. Computing Research Repository, abs/1709.09249 (2017)Google Scholar
- 5.Kahle, P., Colutto, S., Hackl, G., Mühlberger, G.: Transkribus-a service platform for transcription, recognition and retrieval of historical documents. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 4, pp. 19–24. IEEE (2017)Google Scholar
- 7.Schomaker, L.: Design considerations for a large-scale image-based text search engine in historical manuscript collections. IT - Inf. Technol. 58(2), 80–88 (2016)Google Scholar
- 8.Stork, L., et al.: Semantic annotation of natural history collections. Web Semant.: Sci. Serv. Agents World Wide Web. (2018). https://doi.org/10.1016/j.websem.2018.06.002
- 10.Weber, A., Ameryan, M., Wolstencroft, K., Stork, L., Heerlien, M., Schomaker, L.: Towards a digital infrastructure for illustrated handwritten archives. In: Ioannides, M. (ed.) ITN-DCH 2017. LNCS, vol. 10605, pp. 155–166. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-75826-8_13CrossRefGoogle Scholar