Skip to main content

Fine-Tuning OCR Error Detection and Correction in a Polish Corpus of Scientific Abstracts

  • Conference paper
  • First Online:
Book cover Recent Challenges in Intelligent Information and Database Systems (ACIIDS 2022)

Abstract

The paper explores the idea of detecting and correcting post-OCR errors in a corpus of Polish scientific abstracts by first evaluating several available spellchecking approaches and then reusing one of the rule-based solutions to eliminate frequent errors most likely resulting from technical problems of the OCR process. The fine-tuning consisted in removing word breaks, rejecting corrections which change the case of the output, removing unnecessary spaces between word segments and restoring Polish letters replaced with spaces whenever the correction resulted in a valid Polish word. The obtained system proved competitive with language model-based solutions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    See also http://www.djvu.com.pl/galeria/UJ/Gazety_czasopisma.php.

  2. 2.

    http://clip.ipipan.waw.pl/POSMAC.

  3. 3.

    https://bibliotekanauki.pl/.

  4. 4.

    https://curlicat.eu/.

  5. 5.

    https://languagetool.org/.

  6. 6.

    https://ws.clarin-pl.eu/speller.

  7. 7.

    https://spacy.io/.

  8. 8.

    https://github.com/filyp/autocorrect.

  9. 9.

    https://ws.clarin-pl.eu/symspell.

  10. 10.

    https://github.com/wolfgarbe/SymSpell.

  11. 11.

    https://huggingface.co/clarin-pl/fastText-kgr10.

  12. 12.

    http://2021.poleval.pl/tasks/task3.

  13. 13.

    https://huggingface.co/allegro/plt5-large.

  14. 14.

    https://answers.microsoft.com/en-us/msoffice/forum/all/how-to-accept-all-autocorrect-suggestions-in/e8de0d2c-5429-4a48-8f0c-c62c0f69c717.

  15. 15.

    Detected automatically in the process of linguistic annotation with Concraft disambiguating tagger [9] which in some rare cases resulted in several true sentences treated as a single one.

  16. 16.

    Compare e.g. https://languagetool.org/development/api/org/languagetool/rules/Categories.html.

  17. 17.

    https://metacpan.org/pod/Algorithm::Merge.

  18. 18.

    http://morfeusz.sgjp.pl/download/, version 20220410.

References

  1. Hládek, D., Staš, J., Pleva, M.: Survey of automatic spelling correction. Electronics 9(10) (2020). https://doi.org/10.3390/electronics9101670, https://www.mdpi.com/2079-9292/9/10/1670

  2. van Huyssteen, G.B., Eiselen, E.R., Puttkammer, M.J.: Evaluating evaluation metrics for spelling checker evaluations. In: Proceedings of the First International Workshop on Proofing Tools and Language Technologies, pp. 91–99 (2004)

    Google Scholar 

  3. Kobyliński, Ł., Kieraś, W., Rynkun, S.: PolEval 2021 task 3: post-correction of OCR results. In: Ogrodniczuk and Kobyliński [5], pp. 85–91 (2021). http://poleval.pl/files/poleval2021.pdf

  4. Lewandowski, R.: Społeczna korekta post-OCR w bibliotekach cyfrowych. In: Ilona Koutny, P.N. (ed.) Język, Komunikacja, Informacja, pp. 123–134. Sorus (2011). 5/2010-2011

    Google Scholar 

  5. Ogrodniczuk, M., Kobyliński, Ł. (eds.): Proceedings of the PolEval 2021 Workshop. Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland (2021). http://poleval.pl/files/poleval2021.pdf

  6. Pęzik, P., Mikołajczyk, A., Wawrzyński, A., Nitoń, B., Ogrodniczuk, M.: Keyword extraction from short texts with a text-to-text transfer transformer. In: Szczerbicki, E. (ed.) ACIIDS 2022. CCIS, vol. 1716, pp. 530–542. Springer, Singapore (2022). https://doi.org/10.1007/978-981-19-8234-7_41

  7. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(140), 1–67 (2020). http://jmlr.org/papers/v21/20-074.html

  8. Váradi, T., et al.: Introducing the CURLICAT corpora: seven-language domain specific annotated corpora from curated sources. In: Calzolari, N., et al. (eds.) Proceedings of the Thirteenth International Conference on Language Resources and Evaluation (LREC 2022), pp. 100–108. European Language Resources Association (ELRA), Marseille (2022). http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.11.pdf

  9. Waszczuk, J., Kieraś, W., Woliński, M.: Morphosyntactic disambiguation and segmentation for historical polish with graph-based conditional random fields. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2018. LNCS (LNAI), vol. 11107, pp. 188–196. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00794-2_20

    Chapter  Google Scholar 

  10. Woliński, M.: Morfeusz reloaded. In: Calzolari, N., et al. (eds.) Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), pp. 1106–1111. European Language Resources Association (ELRA), Reykjavík (2014). http://www.lrec-conf.org/proceedings/lrec2014/pdf/768_Paper.pdf

  11. Woliński, M., Miłkowski, M., Ogrodniczuk, M., Przepiórkowski, A., Szałkiewicz: PoliMorf: a (not so) new open morphological dictionary for Polish. In: Calzolari, N., et al. (eds.) Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), pp. 860–864. European Language Resources Association (ELRA), Istanbul (2012). http://www.lrec-conf.org/proceedings/lrec2012/pdf/263_Paper.pdf

  12. Woliński, M., Saloni, Z., Wołosz, R., Gruszczyński, W., Skowrońska, D., Bronk, Z.: Słownik gramatyczny języka polskiego (2020). http://sgjp.pl/. 4th edition

  13. Wróbel, K.: OCR correction with encoder-decoder transformer. In: Ogrodniczuk and Kobyliński [5], pp. 97–102 (2021). http://poleval.pl/files/poleval2021.pdf

Download references

Acknowledgements

The work reported here was supported by the European Commission in the CEF Telecom Programme (Action No: 2019-EU-IA-0034, Grant Agreement No: INEA/CEF/ICT/A2019/1926831) and the Polish Ministry of Science and Higher Education: research project 5103/CEF/2020/2, funds for 2020-2022).

We would like to thank Krzysztof Wróbel for his language model-based error candidate detection experiment using the ED 3 pl tool and Stanisław Lorys for first-pass manual correction of the evaluation data and proposing the classification of spelling errors.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Maciej Ogrodniczuk .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ogrodniczuk, M. (2022). Fine-Tuning OCR Error Detection and Correction in a Polish Corpus of Scientific Abstracts. In: Szczerbicki, E., Wojtkiewicz, K., Nguyen, S.V., Pietranik, M., Krótkiewicz, M. (eds) Recent Challenges in Intelligent Information and Database Systems. ACIIDS 2022. Communications in Computer and Information Science, vol 1716. Springer, Singapore. https://doi.org/10.1007/978-981-19-8234-7_35

Download citation

  • DOI: https://doi.org/10.1007/978-981-19-8234-7_35

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-19-8233-0

  • Online ISBN: 978-981-19-8234-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics