Skip to main content

A Semantic Search Engine for Historical Handwritten Document Images

  • 708 Accesses

Part of the Lecture Notes in Computer Science book series (LNISA,volume 12866)

Abstract

A very large number of historical manuscript collections are available in image formats and require extensive manual processing in order to search through them. So, we propose and build a search engine for automatically storing, indexing and efficiently searching the manuscript images. Firstly, a handwritten text recognition technique is used to convert the images into textual representations. In the next steps, we apply the named entity recognition and historical knowledge graph to build a semantic search model, which can understand the user’s intent in the query and the contextual meaning of concepts in documents, to return correctly the transcriptions and their corresponding images for users.

Keywords

  • Handwriting transcription
  • Named entity
  • Knowledge graph

1 Introduction

Every year, the great collections of historical handwritten manuscripts in museums, libraries and other organisations are digitised as electronic images. The digitisation makes the manuscripts available to a wider audience, and preserves the cultural heritage. The automatic recognition of textual corpora and named entities generated from medieval and early-modern manuscript sources with high accuracy is a challenge [2, 20, 22]. Manuscript images are often processed through keyword spotting or word recognition to be accessed and searched, such as [4, 8, 14, 17] and [18]. There are some papers build a search system for handwritten images, such as [1, 5, 15, 16, 21] and [23]. However, their systems only offer keyword search.

Unlike keyword search, semantic search improves search precision and recall by understanding the user’s intent and the contextual meaning of concepts in documents and queries [3, 12, 19, 24]. This paper proposes a semantic search engine for full-text retrieval of historical handwritten document images based on named entity (NE), keyword (KW) and knowledge graph (KG). This would help not only in processing, storing and indexing automatically, but also would allow users to access quickly and retrieve efficiently manuscripts.

2 System Architecture

The Public Record Office of Ireland (PROI) was destroyed on 30 June 1922, resulting in the loss of 700 years of Irish history. The Beyond 2022 Project (https://beyond2022.ie) is combining historical research, archival discovery, and technical innovation to create a virtual reconstruction of the PROI. There are over 300 volumes of surviving and collected handwritten copies of lots documents, with some 100,000 pages containing 25 million words of text.

Fig. 1.
figure 1

The system architecture

Our system architecture of the search engine is illustrated in Fig. 1 which has four separate processing modules being Handwritten Text Recognition, NE Recognition, KW-NE Indexing and KW-NE-Based IR Model. Firstly, the historical handwritten document images are digitised to transcriptions through the Handwritten Text Recognition module. Then, the transcriptions are annotated by NEs through the NE Recognition module. This module needs to connect to the Knowledge Graph to extract the classes and identifiers of NEs. Next, KWs and NEs of the annotated transcriptions and the respective original images are presented and indexed by the KW-NE indexing module and stored in KW-NE Annotated Text and Image Repository. The raw text query is also annotated NEs through the NE Recognition module to become a KW-NE annotated query. Finally, the KW-NE-Based IR Model module compares the annotated query and the annotated documents to return the ranked transcriptions and images.

3 Image Representation and Knowledge Graph

Transkribus [13] is used for training and deploying Handwritten Text Recognition (HTR) models to derive text transcription from image scans. Given the rate at which transcriptions can be generated, NE Recognition (NER) and Entity Linking (EL) are required to automated annotate all instances of entities occurring in the transcription text. We used SpaCy [11] for NER and had highly results on 18\(^{th}\) century English text. To provide flexibility, an NLP pipeline has been implemented as a thin layer over a number of standard NLP tools. The output of the pipeline is a NLP Interchange Format [10] in which a NER tool has annotated classes of entities and, where possible, an EL tool has connected the recognized entities to KG.

The KG collects structured data from various historical sources. Part of the data is manually curated by historians through spreadsheets. Other data sources (e.g. geographical data from OSi [6]) are imported automatically as RDF for direct insertion into KG. The schema (or ontology) used to structure KG, is mainly based on the popular CIDOC-CRM ontology [7]. A short excerpt of KG is depicted in Fig. 2. It shows a few main entities and relationships related to a person (of type CIDOC-CRM:E21_Person) named “William Sutton”, who was member of a few relevant offices in Ireland.

Fig. 2.
figure 2

A portion of our historical KG about “William Sutton”.

4 Information Retrieval Model and Demo

A search engine needs to not only return the best documents, but also be fast. We implemented the index and search functions based on Elasticsearch to have a real-time search engine [9]. The Okapi BM25 model was proposed to find and rank the relevant handwritten manuscripts for queries. In the model, documents and queries are presented by sets of concepts being NEs or KWs. Figure 3 presents an image of a handwritten medieval historical manuscript, its transcription and its concept set d, applied in the model. In the transcription, there are three kinds of words determined by our NER tool: (1) stop-words being the, to, of, we and you; (2) NEs being sheriff, Meath, clerk and William Sutton; and (3) KWs being king, &c, greeting, direct, pay, shilling and silver. The stop-words are not added into the concept set d.

Fig. 3.
figure 3

An example about NE and KW annotation of a medieval historical manuscript

Fig. 4.
figure 4

User interface of our deployed search engine

Figure 4 presents the interface of our search engineFootnote 1, and the concept sets of \(q_1\) and \(q_2\). In that, coun_meath is the identifier of an entity named Meath and classed Country, which is determined by our NER algorithm. While, silver and shilling are keywords. To exploit the features of NEs for semantic search, a NE needs to be presented by its most specific meaning in the concept set d. It means that, with a NE in the transcription,

  • If our NER can determine its identifier, the NE will be presented by its identifier in d. For example, occu_sheriff, coun_meath and occu_clerk are identifiers of entities named sheriff, Meath and clerk, and added into d.

  • If our NER only determines its most specific class, the NE will be presented by a combined information including its name and class. For example, the entity named William Sutton does not exist in our historical KG, so its identifier cannot be extracted. However, the NER determines its most specific class being Person. So william_sutton/person is added into d.

5 Conclusion

We proposed a novel semantic full-text search system for images of historical handwritten manuscripts. Unlike the existing approach only using KW extracted from images, we exploited NE, KW and KG of increase search performance. In that, NER and HTR tools were built to recognise transcriptions and NEs from the manuscript images. Besides, to increase the precision of our NER tool, the historical KG was designed and proposed. Then, we implemented the index and search functions for transcriptions based on Elasticsearch and Okapi BM25 to search images in real-time. Finally, the semantic search engine was also implemented and deployed.

Notes

  1. 1.

    https://by2022.adaptcentre.ie/conf_demo.

References

  1. Aghbari, Z., Brook, S.: HAH manuscripts: a holistic paradigm for classifying and retrieving historical Arabic handwritten documents. Expert Syst. Appl. 36(8), 10942–10951 (2009)

    CrossRef  Google Scholar 

  2. Ahmed, R., Al-Khatib, W., Mahmoud, S.: A survey on handwritten documents word spotting. Int. J. Multimed. Inf. Retr. 6(1), 31–47 (2017). https://doi.org/10.1007/s13735-016-0110-y

    CrossRef  Google Scholar 

  3. Cao, T., Ngo, V.: Semantic search by latent ontological features. Int. J. New Gener. Comput. 30(1), 53–71 (2012). https://doi.org/10.1007/s00354-012-0104-0

    CrossRef  Google Scholar 

  4. Cheikhrouhou, A., Kessentini, Y., Kanoun, S.: Multi-task learning for simultaneous script identification and keyword spotting in document images. Pattern Recogn. 113, 107832 (2021)

    CrossRef  Google Scholar 

  5. Colutto, S., Kahle, P., Guenter, H., Muehlberger, G.: Transkribus. A platform for automated text recognition and searching of historical documents. In: Proceedings of the 15th International Conference on eScience (eScience), pp. 463–466 (2019)

    Google Scholar 

  6. Debruyne, C., et al.: Ireland?s authoritative geospatial linked data. In: d’Amato, C., et al. (eds.) ISWC 2017. LNCS, vol. 10588, pp. 66–74. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68204-4_6

    CrossRef  Google Scholar 

  7. Doerr, M.: The CIDOC conceptual reference module: an ontological approach to semantic interoperability of metadata. AI Mag. 24(3), 75–92 (2003)

    Google Scholar 

  8. Frinken, V., Palakodety, S.: Handwritten keyword spotting in historical documents. In: Handwritten Historical Document Analysis, Recognition, and Retrieval—State of the Art and Future Trends, Series in MP&AI, vol. 89, pp. 81–99. World Scientific Publishing (2021)

    Google Scholar 

  9. Gheorghe, R., Hinman, M., Russo, R.: Elasticsearch in Action, 1st edn. Manning Publications Co., Shelter Island (2015)

    Google Scholar 

  10. Hellmann, S., Lehmann, J., Auer, S., Brümmer, M.: Integrating NLP using linked data. In: Alani, H., et al. (eds.) ISWC 2013. LNCS, vol. 8219, pp. 98–113. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41338-4_7

    CrossRef  Google Scholar 

  11. Honnibal, M., Montani, I., Van Landeghem, S., Boyd, A.: SpaCy: industrial-strength natural language processing in Python (2020). https://doi.org/10.5281/zenodo.1212303

  12. Jiang, Y.: Semantically-enhanced information retrieval using multiple knowledge sources. Clust. Comput. 23(4), 2925–2944 (2020). https://doi.org/10.1007/s10586-020-03057-7

    CrossRef  Google Scholar 

  13. Kahle, P., Colutto, S., Hackl, G., Mühlberger, G.: Transkribus - a service platform for transcription, recognition and retrieval of historical documents. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 04, pp. 19–24 (2017). https://doi.org/10.1109/ICDAR.2017.307

  14. Kang, L., Riba, P., Villegas, M., Fornés, A., Rusiñol, M.: Candidate fusion: integrating language modelling into a sequence-to-sequence handwritten word recognition architecture. Pattern Recogn. 112, 107790 (2021)

    CrossRef  Google Scholar 

  15. Lang, E., Puigcerver, J., Toselli, A.H., Vidal, E.: Probabilistic indexing and search for information extraction on handwritten German parish records. In: Proceedings of 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 44–49 (2018)

    Google Scholar 

  16. Leydier, Y., Lebourgeois, F., Emptoz, H.: Text search for medieval manuscript images. Pattern Recogn. 40(12), 3552–3567 (2007)

    CrossRef  Google Scholar 

  17. Li, Z., Wu, Q., Xiao, Y., Jin, M., Lu, H.: Deep matching network for handwritten Chinese character recognition. Pattern Recogn. 107, 107471 (2020)

    CrossRef  Google Scholar 

  18. Martínek, J., Lenc, L., Král, P.: Building an efficient OCR system for historical documents with little training data. Neural Comput. Appl. 32(23), 17209–17227 (2020). https://doi.org/10.1007/s00521-020-04910-x

    CrossRef  Google Scholar 

  19. Ngo, V., Cao, T.: Discovering latent concepts and exploiting ontological features for semantic text search. In: Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP-2011), pp. 571–579. ACL (2011)

    Google Scholar 

  20. Nozza, D., Manchanda, P., Fersini, E., Palmonari, M., Messina, E.: LearningToAdapt with word embeddings: domain adaptation of named entity recognition systems. Inf. Process. Manag. 58(3), 102537 (2021)

    CrossRef  Google Scholar 

  21. Stauffer, M., Fischer, A., Riesen, K.: Filters for graph-based keyword spotting in historical handwritten documents. Pattern Recogn. Lett. 134, 125–134 (2020)

    CrossRef  Google Scholar 

  22. Toledo, J., Carbonell, M., Fornés, A., Lladós, J.: Information extraction from historical handwritten document images with a context-aware neural model. Pattern Recogn. 86, 27–36 (2019)

    CrossRef  Google Scholar 

  23. Vidal, E., et al.: The carabela project and manuscript collection: large-scale probabilistic indexing and content-based classification. In: The 17th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 85–90 (2020)

    Google Scholar 

  24. Wang, J., et al.: A pseudo-relevance feedback framework combining relevance matching and semantic matching for information retrieval. Inf. Process. Manag. 57(6), 102342 (2020)

    CrossRef  Google Scholar 

Download references

Acknowledgment

Beyond 2022 is funded by the Government of Ireland, through the Department of Culture, Heritage and the Gaeltacht, under the Project Ireland 2040 framework. The project is also partially supported by the ADAPT Centre for Digital Content Technology under the SFI Research Centres Programme (Grant 13/RC/2106_P2).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vuong M. Ngo .

Editor information

Editors and Affiliations

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and Permissions

Copyright information

© 2021 The Author(s)

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Ngo, V.M., Munnelly, G., Orlandi, F., Crooks, P., O’Sullivan, D., Conlan, O. (2021). A Semantic Search Engine for Historical Handwritten Document Images. In: Berget, G., Hall, M.M., Brenn, D., Kumpulainen, S. (eds) Linking Theory and Practice of Digital Libraries. TPDL 2021. Lecture Notes in Computer Science(), vol 12866. Springer, Cham. https://doi.org/10.1007/978-3-030-86324-1_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-86324-1_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86323-4

  • Online ISBN: 978-3-030-86324-1

  • eBook Packages: Computer ScienceComputer Science (R0)