Assessing the impact of OCR noise on multilingual event detection over digitised documents

Boros, Emanuela; Nguyen, Nhu Khoa; Lejeune, Gaël; Doucet, Antoine

doi:10.1007/s00799-022-00325-2

Assessing the impact of OCR noise on multilingual event detection over digitised documents

Published: 04 April 2022

Volume 23, pages 241–266, (2022)
Cite this article

International Journal on Digital Libraries Aims and scope Submit manuscript

Emanuela Boros ORCID: orcid.org/0000-0001-6299-9452¹,
Nhu Khoa Nguyen¹,
Gaël Lejeune² &
…
Antoine Doucet¹

411 Accesses
6 Citations
Explore all metrics

Abstract

Event detection is a crucial task for natural language processing and it involves the identification of instances of specified types of events in text and their classification into event types. The detection of events from digitised documents could enable historians to gather and combine a large amount of information into an integrated whole, a panoramic interpretation of the past. However, the level of degradation of digitised documents and the quality of the optical character recognition (OCR) tools might hinder the performance of an event detection system. While several studies have been performed in detecting events from historical documents, the transcribed documents needed to be hand-validated which implied a great effort of human expertise and manual labour-intensive work. Thus, in this study, we explore the robustness of two different event detection language-independent models to OCR noise, over two datasets that cover different event types and multiple languages. We aim at analysing their ability to mitigate problems caused by the low quality of the digitised documents and we simulate the existence of transcribed data, synthesised from clean annotated text, by injecting synthetic noise. For creating the noisy synthetic data, we chose to utilise four main types of noise that commonly occur after the digitisation process: Character Degradation, Bleed Through, Blur, and Phantom Character. Finally, we conclude that the imbalance of the datasets, the richness of the different annotation styles, and the language characteristics are the most important factors that can influence event detection in digitised documents.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Impact of OCR Quality on Named Entity Linking

TransDocAnalyser: A Framework for Semi-structured Offline Handwritten Documents Analysis with an Application to Legal Domain

Making Large Collections of Handwritten Material Easily Accessible and Searchable

Notes

https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/english-events-guidelines-v5.4.3.pdf.
https://catalog.ldc.upenn.edu/LDC2006T06.
However, even though the reported results were better than the model that we experiment within this study, the CNN-based model in [3] that utilises a wide range of convolutional windows, requires a considerable amount of memory resources and therefore could not be put in practice.
https://framenet.icsi.berkeley.edu/fndrupal/.
The authors utilised the Adobe Acrobat Pro DC OCR software, version 2015. However, the system has long been outdated.
The authors made available the trained models on GitHub: https://github.com/dhfbk/Histo.
https://catalog.ldc.upenn.edu/LDC2006T06.
We remind here that this sub-task is not treated in this study. Because event detection is already challenging, we base our experiments only on ED.
This process is done by using the interlingual links coming from English infectious diseases Wikipedia pages.
If no location matches the previous rules, the system assumes that the event takes place in the country of the “source” metadata (Implicit Location Rule [33]).
https://imagemagick.org.
https://docs.microsoft.com/en-us/typography/font-list/arial-unicode-ms.
https://github.com/tesseract-ocr/tesseract.
We assume that using different major versions of Tesseract (e.g. from 3.x to 4.x) may affect our results since the OCR engine has changed considerably according to the changelog. However, since we chose the last version available, it might be too tedious and time-consuming to perform experiments with different Tesseract versions in the light of a different OCR engine.
We noticed the fact that the batch size affects the Adam optimiser [60], and thus our choice of 256, which performed the best on the validation set.
This is also observed in Fig. 7.
We designate this a limitation in our set-up and we detail it in Sect. 8.3.
While Bidirectional Encoder Representations from Transformers (BERT) had a major impact in the NLP community, its ability to handle noisy inputs is still an open question [62] or at least requires the addition of complementary methods [44, 52].
https://github.com/tesseract-ocr/tesseract.
The ACE 2005 dataset is available under a paid license and thus, we cannot make it available.
https://zenodo.org/record/3709617.

References

Bedi, H., Patil, S., Hingmire, S., Palshikar, G.: Event timeline generation from history textbooks. In: Proceedings of the 4th Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA 2017), pp. 69–77 (2017)
Boros, E.: Neural methods for event extraction. Ph.D. thesis, Université Paris Sud (2018)
Boros, E., Besançon, R., Ferret, O., Grau, B.: The importance of character-level information in an event detection model. In: International Conference on Applications of Natural Language to Information Systems, pp. 119–131. Springer (2021)
Boroş, E., Besançon, R., Ferret, O., Grau, B.: Intérêt des modèles de caractères pour la détection d’événements (the interest of character-level models for event detection). In: Actes de la 28e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1: conférence principale, pp. 179–188 (2021)
Boros, E., Hamdi, A., Linhares Pontes, E., Cabrera-Diego, L.A., Moreno, J.G., Sidere, N., Doucet, A.: Alleviating digitization errors in named entity recognition for historical documents. In: Proceedings of the 24th Conference on Computational Natural Language Learning, pp. 431–441. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.conll-1.35. https://www.aclweb.org/anthology/2020.conll-1.35
Boros, E., Linhares Pontes, E., Cabrera-Diego, L.A., Hamdi, A., Moreno, J.G., Sidère, N., Doucet, A.: Robust Named Entity Recognition and Linking on Historical Multilingual Documents. In: Cappellato, L., Eickhoff, C., Ferro, N., Névéol, A. (eds.) CLEF 2020 Working Notes. Working Notes of CLEF 2020—Conference and Labs of the Evaluation Forum. CEUR-WS (2020)
Boros, E., Moreno, J., Doucet, A.: Event detection with entity markers. In: European Conference on Information Retrieval, pp. 233–240. Springer (2021)
Boroş, E., Romero, V., Maarand, M., Zenklová, K., Křečková, J., Vidal, E., Stutzmann, D., Kermorvant, C.: A comparison of sequential and combined approaches for named entity recognition in a corpus of handwritten medieval charters. In: 2020 17th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 79–84. IEEE (2020)
Boschee, E., Natarajan, P., Weischedel, R.: Automatic extraction of events from open source text for predictive forecasting. In: Handbook of Computational Approaches to Counterterrorism, pp. 51–67. Springer (2013)
Boschetti, F., Cimino, A., Dell’Orletta, F., Lebani, G., Passaro, L., Picchi, P., Venturi, G., Montemagni, S., Lenci, A.: Computational analysis of historical documents: an application to Italian war bulletins in world war I and II. In: Workshop on Language Resources and Technologies for Processing and Linking Historical Documents and Archives (LRT4HDA 2014), pp. 70–75. ELRA (2014)
Bronstein, O., Dagan, I., Li, Q., Ji, H., Frank, A.: Seed-based event trigger labeling: how far can event descriptions get us? In: ACL, vol. 2, pp. 372–376 (2015)
Chen, C., Ng, V.I.: Joint modeling for Chinese event extraction with rich linguistic features. In: In COLING. Citeseer (2012)
Chen, Y., Xu, L., Liu, K., Zeng, D., Zhao, J.: Event extraction via dynamic multi-pooling convolutional neural networks. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, vol. 1, pp. 167–176 (2015)
Collier, N.: Towards cross-lingual alerting for bursty epidemic events. J. Biomed. Semant. 2(5), S10 (2011)
Article Google Scholar
Collier, N., Doan, S., Kawazoe, A., Goodwin, R.M., Conway, M., Tateno, Y., Ngo, Q.H., Dien, D., Kawtrakul, A., Takeuchi, K., et al.: Biocaster: detecting public health rumors with a web-based text mining system. Bioinformatics 24(24), 2940–2941 (2008)
Article Google Scholar
Cybulska, A., Vossen, P.: Event models for historical perspectives: determining relations between high and low level events in text, based on the classification of time, location and participants. In: LREC (2010)
Cybulska, A., Vossen, P.: Historical event extraction from text. In: Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, pp. 39–43 (2011)
Doddington, G., Mitchell, A., Przybocki, M., Ramshaw, L., Strassel, S., Weischedel, R.: The automatic content extraction (ace) program-tasks, data, and evaluation. In: Proceedings of LREC, vol. 4, pp. 837–840. Citeseer (2004)
Du, X., Cardie, C.: Event extraction by answering (almost) natural questions. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 671–683. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.emnlp-main.49. https://aclanthology.org/2020.emnlp-main.49
Duan, S., He, R., Zhao, W.: Exploiting document level information to improve event detection via recurrent neural networks. In: Eighth International Joint Conference on Natural Language Processing (IJCNLP 2017), pp. 352–361. Asian Federation of Natural Language Processing (2017)
Feng, X., Huang, L., Tang, D., Ji, H., Qin, B., Liu, T.: A language-independent neural network for event detection. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), vol. 2, pp. 66–71 (2016)
Filatova, E., Hatzivassiloglou, V.: Event-based extractive summarization (2004)
Giguet, E., Lucas, N.: La détection automatique des citations et des locuteurs dans les textes informatifs. Le discours rapporté dans tous ses états: Question de frontières, pp. 410–418 (2004)
Grishman, R., Sundheim, B.: Message understanding conference-6: a brief history. In: COLING 1996, pp. 466–471 (1996)
Hamborg, F., Lachnit, S., Schubotz, M., Hepp, T., Gipp, B.: Giveme5w: main event retrieval from news articles by extraction of the five journalistic w questions. In: Transforming Digital Worlds, pp. 356–366. Springer International Publishing, Cham (2018). https://doi.org/10.1007/978-3-319-78105-1_39
Hamdi, A., Jean-Caurant, A., Sidere, N., Coustaty, M., Doucet, A.: An analysis of the performance of named entity recognition over ocred documents. In: 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 333–334. IEEE, Illinois, USA (2019)
Hong, Y., Zhang, J., Ma, B., Yao, J., Zhou, G., Zhu, Q.: Using cross-entity inference to improve event extraction. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 1127–1136. Association for Computational Linguistics (2011)
Huang, R., Riloff, E.: Peeling back the layers: detecting event role fillers in secondary contexts. In: ACL 2011, pp. 1137–1147 (2011)
Huff, A.G., Breit, N., Allen, T., Whiting, K., Kiley, C.: Evaluation and verification of the global rapid identification of threats system for infectious diseases in textual data sources. In: Interdisciplinary Perspectives on Infectious Diseases (2016)
Ide, N., Woolner, D.: Exploiting semantic web technologies for intelligent access to historical documents. In: LREC. Citeseer (2004)
Journet, N., Visani, M., Mansencal, B., Van-Cuong, K., Billy, A.: Doccreator: a new software for creating synthetic ground-truthed document images. J. Imaging 3(4), 62 (2017)
Article Google Scholar
Lai, V., Nguyen, M.V., Kaufman, H., Nguyen, T.H.: Event extraction from historical texts: A new dataset for black rebellions. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 2390–2400. Association for Computational Linguistics (2021). https://doi.org/10.18653/v1/2021.findings-acl.211. https://aclanthology.org/2021.findings-acl.211
Lejeune, G., Brixtel, R., Doucet, A., Lucas, N.: Multilingual event extraction for epidemic detection. Artif. Intell. Med. (2015). https://doi.org/10.1016/j.artmed.2015.06.005
Article Google Scholar
Lejeune, G., Zhu, L.: A new proposal for evaluating web page cleaning tools. Computacion y Sistemas 22(4), 1249–1258 (2018)
Google Scholar
Li, Q., Ji, H., Huang, L.: Joint event extraction via structured prediction with global features. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 73–82. Association for Computational Linguistics, Sofia, Bulgaria (2013). https://www.aclweb.org/anthology/P13-1008
Linhares Pontes, E., Cabrera-Diego, L.A., Moreno, J.G., Boros, E., Hamdi, A., Sidère, N., Coustaty, M., Doucet, A.: Entity linking for historical documents: challenges and solutions. In: Ishita, E., Pang, N.L.S., Zhou, L. (eds.) Digital Libraries at Times of Massive Societal Transition, pp. 215–231. Springer, Cham (2020)
Chapter Google Scholar
Linhares Pontes, E., Hamdi, A., Sidere, N., Doucet, A.: Impact of OCR quality on named entity linking. In: Digital Libraries at the Crossroads of Digital Information for the Future—21st International Conference on Asia-Pacific Digital Libraries, ICADL 2019, Kuala Lumpur, Malaysia, November 4–7, 2019, Proceedings, pp. 102–115 (2019). https://doi.org/10.1007/978-3-030-34058-2_11
Liu, J., Chen, Y., Liu, K., Bi, W., Liu, X.: Event extraction as machine reading comprehension. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1641–1651 (2020)
Liu, M., Li, W., Wu, M., Lu, Q.: Extractive summarization based on event term clustering. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pp. 185–188 (2007)
Lucas, N.: The enunciative structure of news dispatches, a contrastive rhetorical approach. in: Language, Culture, Rhetoric, pp. 154–164 (2004)
Lucas, N.: Modélisation différentielle du texte, de la linguistique aux algorithmes. Ph.D. thesis, Université de Caen (2009)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Bengio, Y., LeCun, Y. (eds.) 1st International Conference on Learning Representations, ICLR 2013, p. IEEE, Scottsdale, Arizona, USA (2013)
Google Scholar
Miller, D., Boisen, S., Schwartz, R., Stone, R., Weischedel, R.: Named entity extraction from noisy input: speech and OCR. In: Proceedings of the Sixth Conference on Applied Natural Language Processing, pp. 316–324. Association for Computational Linguistics (2000)
Muller, B., Sagot, B., Seddah, D.: Enhancing bert for lexical normalization. In: Proceedings of the 5th Workshop on Noisy User-Generated Text (W-NUT 2019), pp. 297–306 (2019)
Mutuvi, S., Boros, E., Doucet, A., Lejeune, G., Jatowt, A., Odeo, M.: Multilingual epidemiological text classification: a comparative study. In: COLING, International Conference on Computational Linguistics (2020)
Mutuvi, S., Doucet, A., Odeo, M., Jatowt, A.: Evaluating the impact of OCR errors on topic modeling. In: International Conference on Asian Digital Libraries, pp. 3–14. Springer, Berlin (2018)
Nguyen, T.H., Cho, K., Grishman, R.: Joint event extraction via recurrent neural networks. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 300–309. Association for Computational Linguistics, San Diego, California (2016). https://doi.org/10.18653/v1/N16-1034. https://www.aclweb.org/anthology/N16-1034
Nguyen, T.H., Grishman, R.: Event detection and domain adaptation with convolutional neural networks. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 365–371. Association for Computational Linguistics, Beijing, China (2015). https://doi.org/10.3115/v1/P15-2060. https://www.aclweb.org/anthology/P15-2060
Nguyen, T.H., Grishman, R.: Modeling skip-grams for event detection with convolutional neural networks. In: Proceedings of EMNLP (2016)
Nguyen, T.T.H., Jatowt, A., Coustaty, M., Nguyen, N.V., Doucet, A.: Deep statistical analysis of ocr errors for effective post-ocr processing. In: Proceedings of the 18th Joint Conference on Digital Libraries, pp. 29–38 (2019)
Oberbichler, S., Boroş, E., Doucet, A., Marjanen, J., Pfanzelter, E., Rautiainen, J., Toivonen, H., Tolonen, M.: Integrated interdisciplinary workflows for research on historical newspapers: perspectives from humanities scholars, computer scientists, and librarians. J. Assoc. Inform. Sci, Technol (2021)
Google Scholar
Pruthi, D., Dhingra, B., Lipton, Z.C.: Combating adversarial misspellings with robust word recognition. In: 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019), pp. 5582–5591. Florence, Italy (2019)
Riloff, E.: Automatically generating extraction patterns from untagged text. In: AAAI’96, pp. 1044–1049 (1996)
Riloff, E.: An empirical study of automated dictionary construction for information extraction in three domains. Artif. Intell. 85(1), 101–134 (1996)
Article Google Scholar
Rodriquez, K.J., Bryant, M., Blanke, T., Luszczynska, M.: Comparison of named entity recognition tools for raw OCR text. In: Jancsary, J., (ed.) 11th Conference on Natural Language Processing, KONVENS 2012, Empirical Methods in Natural Language Processing, September 19–21, 2012, Scientific Series of the ÖGAI, vol. 5, pp. 410–414. ÖGAI, Wien, Österreich, Vienna, Austria (2012). http://www.oegai.at/konvens2012/proceedings/60_rodriquez12w/
Rovera, M., Nanni, F., Ponzetto, S.P.: Providing advanced access to historical war memoirs through the identification of events, participants and roles (2019)
Saurí, R., Knippen, R., Verhagen, M., Pustejovsky, J.: Evita: A robust event recognizer for QA systems. In: Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, pp. 700–707. Association for Computational Linguistics, Vancouver, British Columbia, Canada (2005). https://aclanthology.org/H05-1088
Shaw, R.B.: Events and Periods as Concepts for Organizing Historical Knowledge. University of California, Berkeley (2010)
Google Scholar
Smith, R.: An overview of the tesseract ocr engine. In: Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), vol. 2, pp. 629–633. IEEE, IEEE Computer Society, USA (2007)
Smith, S.L., Kindermans, P., Le, Q.V.: Don’t decay the learning rate, increase the batch size. CoRR abs/1711.00489 (2017). http://arxiv.org/abs/1711.00489
Sprugnoli, R.: Event detection and classification for the digital humanities. Ph.D. thesis, University of Trento (2018)
Sun, L., Hashimoto, K., Yin, W., Asai, A., Li, J., Yu, P., Xiong, C.: Adv-bert: bert is not robust on misspellings! generating nature adversarial samples on bert. arXiv preprint arXiv:2003.04985 (2020)
Ukkonen, E.: Maximal and minimal representations of gapped and non-gapped motifs of a string. Theor. Comput. Sci. 410, 4341–4349 (2009). https://doi.org/10.1016/j.tcs.2009.07.015
Article MathSciNet MATH Google Scholar
van Strien, D., Beelen, K., Ardanuy, M.C., Hosseini, K., McGillivray, B., Colavizza, G.: Assessing the impact of ocr quality on downstream nlp tasks. In: ICAART 2020—Proceedings of the 12th International Conference on Agents and Artificial Intelligence, vol. 1, pp. 484–496 (2020)
Walker, C., Stephanie, S., Julie, M., Kazuaki, M.: Ace 2005 multilingual training corpus. Linguistic Data Consortium, Technical report (2005)
Google Scholar
Wang, P., Sun, R., Zhao, H., Yu, K.: A new word language model evaluation metric for character based languages. In: Sun, M., Zhang, M., Lin, D., Wang, H. (eds.) Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data, pp. 315–324. Springer, Berlin (2013)
Chapter Google Scholar
Yangarber, R., Grishman, R., Tapanainen, P., Huttunen, S.: Automatic acquisition of domain knowledge for information extraction. In: 18th International Conference on Computational Linguistics (COLING 2000), pp. 940–946 (2000)

Download references

Acknowledgements

This work has been supported by the European Union’s Horizon 2020 research and innovation programme under grants 770299 (NewsEye) and 825153 (EMBEDDIA), and by the ANNA and Termitrad projects funded by the Nouvelle-Aquitaine Region.

Author information

Authors and Affiliations

University of La Rochelle, 17000, La Rochelle, France
Emanuela Boros, Nhu Khoa Nguyen & Antoine Doucet
Sorbonne University, 75006, Paris, France
Gaël Lejeune

Authors

Emanuela Boros
View author publications
You can also search for this author in PubMed Google Scholar
Nhu Khoa Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Gaël Lejeune
View author publications
You can also search for this author in PubMed Google Scholar
Antoine Doucet
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Emanuela Boros: Conceptualization, Methodology, Software, Formal analysis, Investigation, Data Curation, Writing. Nhu Khoa Nguyen: Conceptualization, Methodology, Software, Formal analysis, Investigation, Writing. Gaël Lejeune: Conceptualization, Methodology, Supervision, Writing - review & editing. Antoine Doucet: Funding acquisition, Conceptualization, Methodology, Project administration, Validation, Supervision, Writing - review & editing.

Corresponding author

Correspondence to Emanuela Boros.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Availability of code, data, and material

Our code is freely available at https://github.com/NewsEye/event-detection.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Boros, E., Nguyen, N.K., Lejeune, G. et al. Assessing the impact of OCR noise on multilingual event detection over digitised documents. Int J Digit Libr 23, 241–266 (2022). https://doi.org/10.1007/s00799-022-00325-2

Download citation

Received: 27 April 2021
Revised: 06 March 2022
Accepted: 08 March 2022
Published: 04 April 2022
Issue Date: September 2022
DOI: https://doi.org/10.1007/s00799-022-00325-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Assessing the impact of OCR noise on multilingual event detection over digitised documents

Abstract

Access this article

Similar content being viewed by others

Impact of OCR Quality on Named Entity Linking

TransDocAnalyser: A Framework for Semi-structured Offline Handwritten Documents Analysis with an Application to Legal Domain

Making Large Collections of Handwritten Material Easily Accessible and Searchable

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Availability of code, data, and material

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Assessing the impact of OCR noise on multilingual event detection over digitised documents

Abstract

Access this article

Similar content being viewed by others

Impact of OCR Quality on Named Entity Linking

TransDocAnalyser: A Framework for Semi-structured Offline Handwritten Documents Analysis with an Application to Legal Domain

Making Large Collections of Handwritten Material Easily Accessible and Searchable

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Availability of code, data, and material

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation