Adaptive Edit-Distance and Regression Approach for Post-OCR Text Correction

Nguyen, Thi-Tuyet-Hai; Coustaty, Mickael; Doucet, Antoine; Jatowt, Adam; Nguyen, Nhu-Van

doi:10.1007/978-3-030-04257-8_29

Thi-Tuyet-Hai Nguyen¹⁶,
Mickael Coustaty¹⁶,
Antoine Doucet¹⁶,
Adam Jatowt¹⁷ &
…
Nhu-Van Nguyen¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11279))

Included in the following conference series:

International Conference on Asian Digital Libraries

1431 Accesses
12 Citations

Abstract

Post-processing is a crucial step in improving the performance of OCR process. In this paper, we present a novel approach which explores a modified way of candidate generating and candidate scoring at character level as well as word level. These features are combined with some important features suggested by related work for ranking candidates in a regression model. The experimental results show that our approach has comparable results with the top performing approaches in the Post-OCR text correction competition ICDAR 2017.

This work has been supported by the European Union’s Horizon 2020 research and innovation programme under grant 770299 (NewsEye).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://sites.google.com/view/icdar2017-postcorrectionocr/, last visited on 28 June 2018.

References

Afli, H., Barrault, L., Schwenk, H.: OCR error correction using statistical machine translation. Int. J. Comput. Linguist. Appl. 7, 175–191 (2016)
Google Scholar
Bassil, Y., Alwani, M.: OCR post-processing error correction algorithm using Google online spelling suggestion. arXiv preprint arXiv:1204.0191 (2012)
Chelba, C., et al.: One billion word benchmark for measuring progress in statistical language modeling (2013)
Google Scholar
Chiron, G., Doucet, A., Coustaty, M., Moreux, J.P.: ICDAR2017 competition on post-OCR text correction. In: 2017 14th IAPR International Conference on Document Analysis and Recognition, ICDAR, vol. 1, pp. 1423–1428. IEEE (2017)
Google Scholar
Church, K.W., Gale, W.A.: Probability scoring for spelling correction. Stat. Comput. 1(2), 93–103 (1991)
Article Google Scholar
Evershed, J., Fitch, K.: Correcting noisy OCR: context beats confusion. In: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, pp. 45–51. ACM (2014)
Google Scholar
Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 1189–1232 (2001)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Islam, A., Inkpen, D.: Semantic text similarity using corpus-based word similarity and string similarity. ACM Trans. Knowl. Discov. Data 2, 10 (2008)
Article Google Scholar
Islam, A., Inkpen, D.: Real-word spelling correction using Google Web IT 3-grams. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, vol. 3, pp. 1241–1249 (2009)
Google Scholar
Jones, M.A., Story, G.A., Ballard, B.W.: Integrating multiple knowledge sources in a Bayesian OCR post-processor. In: International Journal on Document Analysis and Recognition, p. 925–933 (1991)
Google Scholar
Kissos, I., Dershowitz, N.: OCR error correction using character correction and feature-based word classification. In: 2016 12th IAPR Workshop on Document Analysis Systems, DAS, pp. 198–203. IEEE (2016)
Google Scholar
Koehn, P., et al.: Moses: open source toolkit for statistical machine translation (2007)
Google Scholar
Llobet, R., Navarro-Cerdan, J.R., Perez-Cortes, J.C., Arlandis, J.: Efficient OCR post-processing combining language, hypothesis and error models. In: Hancock, E.R., Wilson, R.C., Windeatt, T., Ulusoy, I., Escolano, F. (eds.) SSPR/SPR 2010. LNCS, vol. 6218, pp. 728–737. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-14980-1_72
Chapter Google Scholar
Mei, J., Islam, A., Wu, Y., Moh’d, A., Milios, E.E.: Statistical learning for OCR text correction. arXiv preprint arXiv:1611.06950 (2016)
Mikolov, T., Karafiát, M., Burget, L., Černocký, J., Khudanpur, S.: Recurrent neural network based language model. In: Eleventh Annual Conference of the International Speech Communication Association (2010)
Google Scholar
Niwa, H., Kayashima, K.: Postprocessing for character recognition using keyword information
Google Scholar
Schulz, S., Kuhn, J.: Multi-modular domain-tailored OCR post-correction. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2716–2726 (2017)
Google Scholar
Tiedemann, J.: Character-based pivot translation for under-resourced languages and domains. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 141–151 (2012)
Google Scholar
Tong, X., Evans, D.A.: A statistical approach to automatic OCR error correction in context. In: Fourth Workshop on Very Large Corpora (1996)
Google Scholar

Download references

Author information

Authors and Affiliations

L3i, University of La Rochelle, La Rochelle, France
Thi-Tuyet-Hai Nguyen, Mickael Coustaty, Antoine Doucet & Nhu-Van Nguyen
Department of Social Informatics, Kyoto University, Kyoto, Japan
Adam Jatowt

Authors

Thi-Tuyet-Hai Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Mickael Coustaty
View author publications
You can also search for this author in PubMed Google Scholar
Antoine Doucet
View author publications
You can also search for this author in PubMed Google Scholar
Adam Jatowt
View author publications
You can also search for this author in PubMed Google Scholar
Nhu-Van Nguyen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thi-Tuyet-Hai Nguyen .

Editor information

Editors and Affiliations

University College London Qatar, Doha, Qatar
Milena Dobreva
University of Waikato, Hamilton, New Zealand
Annika Hinze
University of Ljubljana, Ljubljana, Slovenia
Maja Žumer

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nguyen, TTH., Coustaty, M., Doucet, A., Jatowt, A., Nguyen, NV. (2018). Adaptive Edit-Distance and Regression Approach for Post-OCR Text Correction. In: Dobreva, M., Hinze, A., Žumer, M. (eds) Maturity and Innovation in Digital Libraries. ICADL 2018. Lecture Notes in Computer Science(), vol 11279. Springer, Cham. https://doi.org/10.1007/978-3-030-04257-8_29

Download citation

DOI: https://doi.org/10.1007/978-3-030-04257-8_29
Published: 15 November 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-04256-1
Online ISBN: 978-3-030-04257-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics