Advertisement

Post-correction of OCR Errors Using PyEnchant Spelling Suggestions Selected Through a Modified Needleman–Wunsch Algorithm

  • Ewerton CappelattiEmail author
  • Regina De Oliveira Heidrich
  • Ricardo Oliveira
  • Cintia Monticelli
  • Ronaldo Rodrigues
  • Rodrigo Goulart
  • Eduardo Velho
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 850)

Abstract

In this article, the efforts made by the Vocalizer project development team to correct errors from texts generated by OCR Tesseract are described. Vocalizer consists of a device that captures images from books, converts them into plain texts with the aid of an OCR (Optical Character Recognition) software. It also prepares the post-processing of the obtained text, and converts its textual content into voice. The whole process is performed autonomously. In the post-processing step, a modified Needleman-Wunsch algorithm was applied to select the suggestions made by the spellchecker PyEnchant. The results obtained were reasonable, which encourages further research.

Keywords

Optical character recognition software Error correction Character recognition 

References

  1. 1.
    Smith, R.: An overview of the Tesseract OCR engine. In: Proceedings of the Ninth International Conference on Document Analysis and Recognition, ICDAR 2007, vol. 2, pp. 629–633 (2007)Google Scholar
  2. 2.
    Ryan, K.: PyEnchant: a spellchecking library for Python. http://pythonhosted.org/pyenchant/faq.html. Accessed 05 Oct 2017
  3. 3.
    Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)CrossRefGoogle Scholar
  4. 4.
    Volk, M., Furrer, L., Sennrich, R.: Strategies for reducing and correcting OCR error. In: Sporleder, C., van den Bosch, A., Zervanou, K. (eds.) Language Technology for Cultural Heritage, pp. 3–22. Springer, Berlin (2011).  https://doi.org/10.1007/978-3-642-20227-8_1CrossRefGoogle Scholar
  5. 5.
    Kissos, I., Dershowitz, N.: OCR error correction using character correction and feature-based word classification. In: Proceedings of the 12th IAPR Workshop on Document Analysis Systems, pp. 198–203 (2016)Google Scholar
  6. 6.
    Tong, X., Evans, D.A.: A statistical approach to automatic OCR error correction in context. In: Proceedings of the Fourth Workshop on Very Large Corpora, pp. 88–100 (1996)Google Scholar
  7. 7.
    Nylander, S.: Statistics and phonotactical rules in finding OCR errors. In: Proceedings of the NODALIDA, pp. 174–181 (1999)Google Scholar
  8. 8.
    Bassil, Y., Alwani, M.: OCR post-processing error correction algorithm using Google’s online spelling suggestion. J. Emerg. Trends Comput. Inf. Sci. 3(1), 90–99 (2012)Google Scholar
  9. 9.
    Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10(8), 707–710 (1966)MathSciNetGoogle Scholar
  10. 10.
    Setúbal, J., Meidanis, J.: Introduction to Computational Molecular Biology. PWS Publishing Company, Boston (1997)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Ewerton Cappelatti
    • 1
    Email author
  • Regina De Oliveira Heidrich
    • 1
  • Ricardo Oliveira
    • 1
  • Cintia Monticelli
    • 1
  • Ronaldo Rodrigues
    • 1
  • Rodrigo Goulart
    • 1
  • Eduardo Velho
    • 1
  1. 1.Feevale UniversityNovo HamburgoBrazil

Personalised recommendations