Skip to main content

The System for Efficient Indexing and Search in the Large Archives of Scanned Historical Documents

  • Conference paper
  • First Online:
Advances in Information Retrieval (ECIR 2023)

Abstract

The paper introduces software capable of indexing and searching large archives of scanned historical documents. The system capabilities are demonstrated on the collection containing documents from the archives of the post-Soviet security services. The backend of the system was designed with a focus on flexibility (it is actually already being used for other related tasks) and scalability to larger volumes of data. The graphical user interface design has been consulted with historians interested in using the archived documents and was developed in several iterations, gradually including the changes induced both by the user’s requests and by our improving knowledge about the nature of the processed data.

The work described herein has been supported by the Ministry of Education, Youth and Sports of the Czech Republic project LINDAT/CLARIAH-CZ and by the grant of the University of West Bohemia, project No. SGS-2022-017.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The core worker in the presented demo is an OCR worker who internally uses the Tesseract OCR engine [6]. The OCR engine is wrapped with additional functionalities improving the performance of the engine in the domain of scanned documents – see the details in [2, 3].

References

  1. Chýek, A., Šmídl, L., Švec, J.: Multimodal Dialog with the MALACH Audiovisual Archive. In: Proceedings of Interspeech 2019, pp. 3663–3664 (2019)

    Google Scholar 

  2. Gruber, I.: OCR improvements for images of multi-page historical documents. In: Karpov, A., Potapova, R. (eds.) SPECOM 2021. LNCS (LNAI), vol. 12997, pp. 226–237. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87802-3_21

    Chapter  Google Scholar 

  3. Gruber, I., et al.: An automated pipeline for robust image processing and optical character recognition of historical documents. In: Karpov, A., Potapova, R. (eds.) SPECOM 2020. LNCS (LNAI), vol. 12335, pp. 166–175. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-60276-5_17

    Chapter  Google Scholar 

  4. Institute for Study of the Totalitarian Regimes (2022). https://www.ustrcr.cz/en/

  5. Psutka, J., et al.: System for fast lexical and phonetic spoken term detection in a Czech cultural heritage archive. EURASIP J. Audio Speech Music Process. 2011(1), 10 (2011). https://doi.org/10.1186/1687-4722-2011-10

  6. Smith, R.: An overview of the tesseract ocr engine. In: Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), vol. 2, pp. 629–633 (2007). https://doi.org/10.1109/ICDAR.2007.4376991

  7. Stanislav, P., Švec, J., Ircing, P.: An engine for online video search in large archives of the holocaust testimonies. In: Proceedings of Interspeech 2016, pp. 2352–2353 (2016)

    Google Scholar 

  8. Zajíc, Z., et al.: Towards processing of the oral history interviews and related printed documents. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki (2018). https://aclanthology.org/L18-1331

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Martin Bulín .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bulín, M., Švec, J., Ircing, P. (2023). The System for Efficient Indexing and Search in the Large Archives of Scanned Historical Documents. In: Kamps, J., et al. Advances in Information Retrieval. ECIR 2023. Lecture Notes in Computer Science, vol 13982. Springer, Cham. https://doi.org/10.1007/978-3-031-28241-6_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-28241-6_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-28240-9

  • Online ISBN: 978-3-031-28241-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics