Skip to main content

Adaptive Edit-Distance and Regression Approach for Post-OCR Text Correction

  • Conference paper
  • First Online:
Book cover Maturity and Innovation in Digital Libraries (ICADL 2018)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11279))

Included in the following conference series:

Abstract

Post-processing is a crucial step in improving the performance of OCR process. In this paper, we present a novel approach which explores a modified way of candidate generating and candidate scoring at character level as well as word level. These features are combined with some important features suggested by related work for ranking candidates in a regression model. The experimental results show that our approach has comparable results with the top performing approaches in the Post-OCR text correction competition ICDAR 2017.

This work has been supported by the European Union’s Horizon 2020 research and innovation programme under grant 770299 (NewsEye).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://sites.google.com/view/icdar2017-postcorrectionocr/, last visited on 28 June 2018.

References

  1. Afli, H., Barrault, L., Schwenk, H.: OCR error correction using statistical machine translation. Int. J. Comput. Linguist. Appl. 7, 175–191 (2016)

    Google Scholar 

  2. Bassil, Y., Alwani, M.: OCR post-processing error correction algorithm using Google online spelling suggestion. arXiv preprint arXiv:1204.0191 (2012)

  3. Chelba, C., et al.: One billion word benchmark for measuring progress in statistical language modeling (2013)

    Google Scholar 

  4. Chiron, G., Doucet, A., Coustaty, M., Moreux, J.P.: ICDAR2017 competition on post-OCR text correction. In: 2017 14th IAPR International Conference on Document Analysis and Recognition, ICDAR, vol. 1, pp. 1423–1428. IEEE (2017)

    Google Scholar 

  5. Church, K.W., Gale, W.A.: Probability scoring for spelling correction. Stat. Comput. 1(2), 93–103 (1991)

    Article  Google Scholar 

  6. Evershed, J., Fitch, K.: Correcting noisy OCR: context beats confusion. In: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, pp. 45–51. ACM (2014)

    Google Scholar 

  7. Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat. 1189–1232 (2001)

    Google Scholar 

  8. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  9. Islam, A., Inkpen, D.: Semantic text similarity using corpus-based word similarity and string similarity. ACM Trans. Knowl. Discov. Data 2, 10 (2008)

    Article  Google Scholar 

  10. Islam, A., Inkpen, D.: Real-word spelling correction using Google Web IT 3-grams. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, vol. 3, pp. 1241–1249 (2009)

    Google Scholar 

  11. Jones, M.A., Story, G.A., Ballard, B.W.: Integrating multiple knowledge sources in a Bayesian OCR post-processor. In: International Journal on Document Analysis and Recognition, p. 925–933 (1991)

    Google Scholar 

  12. Kissos, I., Dershowitz, N.: OCR error correction using character correction and feature-based word classification. In: 2016 12th IAPR Workshop on Document Analysis Systems, DAS, pp. 198–203. IEEE (2016)

    Google Scholar 

  13. Koehn, P., et al.: Moses: open source toolkit for statistical machine translation (2007)

    Google Scholar 

  14. Llobet, R., Navarro-Cerdan, J.R., Perez-Cortes, J.C., Arlandis, J.: Efficient OCR post-processing combining language, hypothesis and error models. In: Hancock, E.R., Wilson, R.C., Windeatt, T., Ulusoy, I., Escolano, F. (eds.) SSPR/SPR 2010. LNCS, vol. 6218, pp. 728–737. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-14980-1_72

    Chapter  Google Scholar 

  15. Mei, J., Islam, A., Wu, Y., Moh’d, A., Milios, E.E.: Statistical learning for OCR text correction. arXiv preprint arXiv:1611.06950 (2016)

  16. Mikolov, T., Karafiát, M., Burget, L., Černocký, J., Khudanpur, S.: Recurrent neural network based language model. In: Eleventh Annual Conference of the International Speech Communication Association (2010)

    Google Scholar 

  17. Niwa, H., Kayashima, K.: Postprocessing for character recognition using keyword information

    Google Scholar 

  18. Schulz, S., Kuhn, J.: Multi-modular domain-tailored OCR post-correction. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2716–2726 (2017)

    Google Scholar 

  19. Tiedemann, J.: Character-based pivot translation for under-resourced languages and domains. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 141–151 (2012)

    Google Scholar 

  20. Tong, X., Evans, D.A.: A statistical approach to automatic OCR error correction in context. In: Fourth Workshop on Very Large Corpora (1996)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thi-Tuyet-Hai Nguyen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Nguyen, TTH., Coustaty, M., Doucet, A., Jatowt, A., Nguyen, NV. (2018). Adaptive Edit-Distance and Regression Approach for Post-OCR Text Correction. In: Dobreva, M., Hinze, A., Žumer, M. (eds) Maturity and Innovation in Digital Libraries. ICADL 2018. Lecture Notes in Computer Science(), vol 11279. Springer, Cham. https://doi.org/10.1007/978-3-030-04257-8_29

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-04257-8_29

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-04256-1

  • Online ISBN: 978-3-030-04257-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics