When to Use OCR Post-correction for Named Entity Recognition?

Huynh, Vinh-Nam; Hamdi, Ahmed; Doucet, Antoine

doi:10.1007/978-3-030-64452-9_3

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12504))

Included in the following conference series:

International Conference on Asian Digital Libraries

1127 Accesses
9 Citations

Abstract

In the last decades, a huge number of documents has been digitised, before undergoing optical character recognition (OCR) to extract their textual content. This step is crucial for indexing the documents and to make the resulting collections accessible. However, the fact that documents are indexed through their OCRed content is posing a number of problems, due to the varying performance of OCR methods over time. Indeed, OCR quality has a considerable impact on the indexing and therefore the accessibility of digital documents. Named entities are among the most adequate information to index documents, in particular in the case of digital libraries, for which log analysis studies have shown that around 80% of user queries include a named entity. Taking full advantage of the computational power of modern natural language processing (NLP) systems, named entity recognition (NER) can be operated over enormous OCR corpora efficiently. Despite progress in OCR, resulting text files still have misrecognised words (or noise for short) which are harming NER performance. In this paper, to handle this challenge, we apply a spelling correction method to noisy versions of a corpus with variable OCR error rates in order to quantitatively estimate the contribution of post-OCR correction to NER. Our main finding is that we can indeed consistently improve the performance of NER when the OCR quality is reasonable (error rates respectively between 2% and 10% for characters (CER) and between 10% and 25% for words (WER)). The noise correction algorithm we propose is both language-independent and with low complexity.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Gallica is the digital portal of the National Library of France.
2.
https://zenodo.org/record/3877554.
3.
https://github.com/kermitt2/delft.
4.
https://github.com/mammothb/symspellpy.

References

Chiron, G., Doucet, A., Coustaty, M., Visani, M., Moreux, J.P.: Impact of OCR errors on the use of digital libraries: towards a better access to information. In: Proceedings of the 17th ACM/IEEE Joint Conference on Digital Libraries, pp. 249–252. IEEE Press (2017)
Google Scholar
Chiu, J.P., Nichols, E.: Named entity recognition with bidirectional LSTM-CNNs. arXiv preprint arXiv:1511.08308 (2015)
Farahmand, A., Sarrafzadeh, H., Shanbehzadeh, J.: Document image noises and removal methods (2013)
Google Scholar
Gefen, A.: Les enjeux épistémologiques des humanités numériques. Socio-La nouvelle revue des sciences sociales (4), 61–74 (2014)
Google Scholar
Hamdi, A., Jean-Caurant, A., Sidere, N., Coustaty, M., Doucet, A.: An analysis of the performance of named entity recognition over OCRed documents. In: 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 333–334. IEEE (2019)
Google Scholar
Hamdi, A., Jean-Caurant, A., Sidère, N., Coustaty, M., Doucet, A.: Assessing and minimizing the impact of OCR quality on named entity recognition. In: Hall, M., Merčun, T., Risse, T., Duchateau, F. (eds.) TPDL 2020. LNCS, vol. 12246, pp. 87–101. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-54956-5_7
Chapter Google Scholar
Journet, N., Visani, M., Mansencal, B., Van-Cuong, K., Billy, A.: DocCreator: a new software for creating synthetic ground-truthed document images. J. Imaging 3(4), 62 (2017)
Article Google Scholar
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360 (2016)
Lopresti, D.: Optical character recognition errors and their effects on natural language processing. Int. J. Doc. Anal. Recognit. (IJDAR) 12(3), 141–151 (2009)
Article Google Scholar
Lund, W.B., Kennard, D.J., Ringger, E.K.: Combining multiple thresholding binarization values to improve OCR output. In: Document Recognition and Retrieval XX, vol. 8658, p. 86580R. International Society for Optics and Photonics (2013)
Google Scholar
Ma, X., Hovy, E.: End-to-end sequence labeling via Bi-directional LSTM-CNNs-CRF. arXiv preprint arXiv:1603.01354 (2016)
Magdy, W., Darwish, K.: Effect of OCR error correction on Arabic retrieval. Inf. Retr. 11(5), 405–425 (2008)
Article Google Scholar
Miller, D., Boisen, S., Schwartz, R., Stone, R., Weischedel, R.: Named entity extraction from noisy input: speech and OCR. In: Proceedings of the Sixth Conference on Applied Natural Language Processing, pp. 316–324. Association for Computational Linguistics (2000)
Google Scholar
Nguyen, T.T.H., Jatowt, A., Coustaty, M., Nguyen, N.V., Doucet, A.: Deep statistical analysis of OCR errors for effective post-OCR processing. In: 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp. 29–38. IEEE (2019)
Google Scholar
Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Google Scholar
Riedl, M., Padó, S.: A named entity recognition shootout for German. In: Proceedings of ACL, Melbourne, Australia, pp. 120–125 (2018). http://aclweb.org/anthology/P18-2020.pdf
Rodriquez, K.J., Bryant, M., Blanke, T., Luszczynska, M.: Comparison of named entity recognition tools for raw OCR text. In: KONVENS, pp. 410–414 (2012)
Google Scholar
van Strien, D., Beelen, K., Ardanuy, M., Hosseini, K., McGillivray, B., Colavizza, G.: Assessing the impact of OCR quality on downstream NLP tasks. In: Proceedings of the 12th International Conference on Agents and Artificial Intelligence, Valletta, Malta, pp. 484–496. SCITEPRESS - Science and Technology Publications (2020). https://doi.org/10.5220/0009169004840496. http://www.scitepress.org/DigitalLibrary/Link.aspx?doi=10.5220/0009169004840496
Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003-Volume 4, pp. 142–147. Association for Computational Linguistics (2003)
Google Scholar
Zuccon, G., Nguyen, A.N., Bergheim, A., Wickman, S., Grayson, N.: The impact of OCR accuracy on automated cancer classification of pathology reports. In: HIC, pp. 250–256 (2012)
Google Scholar

Download references

Acknowledgements

This work has been supported by the European Union Horizon 2020 research and innovation programme under grant 770299 (NewsEye).

Author information

Authors and Affiliations

ICT Lab, University of Science and Technology of Hanoi, Hanoi, Vietnam
Vinh-Nam Huynh
University of La Rochelle, La Rochelle, France
Ahmed Hamdi & Antoine Doucet

Authors

Vinh-Nam Huynh
View author publications
You can also search for this author in PubMed Google Scholar
Ahmed Hamdi
View author publications
You can also search for this author in PubMed Google Scholar
Antoine Doucet
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ahmed Hamdi .

Editor information

Editors and Affiliations

Kyushu University, Fukuoka, Japan
Emi Ishita
National University of Singapore, Singapore, Singapore
Natalie Lee San Pang
Wuhan University, Wuhan, China
Lihong Zhou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huynh, VN., Hamdi, A., Doucet, A. (2020). When to Use OCR Post-correction for Named Entity Recognition?. In: Ishita, E., Pang, N.L.S., Zhou, L. (eds) Digital Libraries at Times of Massive Societal Transition. ICADL 2020. Lecture Notes in Computer Science(), vol 12504. Springer, Cham. https://doi.org/10.1007/978-3-030-64452-9_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-64452-9_3
Published: 26 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-64451-2
Online ISBN: 978-3-030-64452-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics