Abstract
Search in collections of digitised historical documents is hindered by a two-prong problem, orthographic variety and optical character recognition (OCR) mistakes. We present a new search engine for historical documents, DuoSearch, which uses ElasticSearch and machine learning methods based on deep neural networks to offer a solution to this problem. It was tested on a collection of historical newspapers in Bulgarian from the mid-19th to the mid-20th century. The system provides an interactive and intuitive interface for the end-users allowing them to enter search terms in modern Bulgarian and search across historical spellings. This is the first solution facilitating the use of digitised historical documents in Bulgarian.
Keywords
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Doucet, A., et al.: NewsEye: a digital investigator for historical newspapers. In: Digital Humanities 2020, DH 2020, Conference Abstracts, Ottawa, Canada, 22–24 July 2020. Alliance of Digital Humanities Organizations (ADHO). https://zenodo.org/record/3895269 (2020)
Rigaud, C., Doucet, A., Coustaty, M., Moreux, J.P.: ICDAR 2019 competition on post-OCR text correction. In: 2019 International Conference on Document Analysis and Recognition (ICDAR) (2019). https://doi.org/10.1109/icdar.2019.00255
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding arxiv:1810.04805 (2018)
Allen, R.B.: Toward a platform for working with sets of digitized historical newspapers. In: IFLA International Newspaper Conference: Digital Preservation and Access to News and Views, New Delhi, pp. 54–59 (2010)
Cole, B.: The National Digital Newspaper Program. Organization of American Historians Newsletter 32 (2004)
Oberbichler, S., et al.: Integrated interdisciplinary workflows for research on historical newspapers: perspectives from humanities scholars, computer scientists, and librarians. J. Assoc. Inf. Sci. Technol. 1–15 (2021). https://doi.org/10.1002/asi.24565
CLaDA-BG dictionary, Grant number DO01-164/28.08.2018. https://clada-bg.eu
Acknowledgements
This research is partially supported by Project UNITe BG05M2OP001-1.001-0004 funded by the OP “Shcience and Education for Smart Growth” and co-funded by the EU through the ESI Funds. The contribution of M. Dobreva is supported by the project KP-06-DB/6 DISTILL funded by the NSF of Bulgaria.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Beshirov, A., Hadzhieva, S., Koychev, I., Dobreva, M. (2022). DuoSearch: A Novel Search Engine for Bulgarian Historical Documents. In: Hagen, M., et al. Advances in Information Retrieval. ECIR 2022. Lecture Notes in Computer Science, vol 13186. Springer, Cham. https://doi.org/10.1007/978-3-030-99739-7_31
Download citation
DOI: https://doi.org/10.1007/978-3-030-99739-7_31
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-99738-0
Online ISBN: 978-3-030-99739-7
eBook Packages: Computer ScienceComputer Science (R0)