Fine-Tuning OCR Error Detection and Correction in a Polish Corpus of Scientific Abstracts

Ogrodniczuk, Maciej

doi:10.1007/978-981-19-8234-7_35

Maciej Ogrodniczuk ORCID: orcid.org/0000-0002-3467-9424¹⁰

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1716))

Included in the following conference series:

Asian Conference on Intelligent Information and Database Systems

838 Accesses

Abstract

The paper explores the idea of detecting and correcting post-OCR errors in a corpus of Polish scientific abstracts by first evaluating several available spellchecking approaches and then reusing one of the rule-based solutions to eliminate frequent errors most likely resulting from technical problems of the OCR process. The fine-tuning consisted in removing word breaks, rejecting corrections which change the case of the output, removing unnecessary spaces between word segments and restoring Polish letters replaced with spaces whenever the correction resulted in a valid Polish word. The obtained system proved competitive with language model-based solutions.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
See also http://www.djvu.com.pl/galeria/UJ/Gazety_czasopisma.php.
2.
http://clip.ipipan.waw.pl/POSMAC.
3.
https://bibliotekanauki.pl/.
4.
https://curlicat.eu/.
5.
https://languagetool.org/.
6.
https://ws.clarin-pl.eu/speller.
7.
https://spacy.io/.
8.
https://github.com/filyp/autocorrect.
9.
https://ws.clarin-pl.eu/symspell.
10.
https://github.com/wolfgarbe/SymSpell.
11.
https://huggingface.co/clarin-pl/fastText-kgr10.
12.
http://2021.poleval.pl/tasks/task3.
13.
https://huggingface.co/allegro/plt5-large.
14.
https://answers.microsoft.com/en-us/msoffice/forum/all/how-to-accept-all-autocorrect-suggestions-in/e8de0d2c-5429-4a48-8f0c-c62c0f69c717.
15.
Detected automatically in the process of linguistic annotation with Concraft disambiguating tagger [9] which in some rare cases resulted in several true sentences treated as a single one.
16.
Compare e.g. https://languagetool.org/development/api/org/languagetool/rules/Categories.html.
17.
https://metacpan.org/pod/Algorithm::Merge.
18.
http://morfeusz.sgjp.pl/download/, version 20220410.

References

Hládek, D., Staš, J., Pleva, M.: Survey of automatic spelling correction. Electronics 9(10) (2020). https://doi.org/10.3390/electronics9101670, https://www.mdpi.com/2079-9292/9/10/1670
van Huyssteen, G.B., Eiselen, E.R., Puttkammer, M.J.: Evaluating evaluation metrics for spelling checker evaluations. In: Proceedings of the First International Workshop on Proofing Tools and Language Technologies, pp. 91–99 (2004)
Google Scholar
Kobyliński, Ł., Kieraś, W., Rynkun, S.: PolEval 2021 task 3: post-correction of OCR results. In: Ogrodniczuk and Kobyliński [5], pp. 85–91 (2021). http://poleval.pl/files/poleval2021.pdf
Lewandowski, R.: Społeczna korekta post-OCR w bibliotekach cyfrowych. In: Ilona Koutny, P.N. (ed.) Język, Komunikacja, Informacja, pp. 123–134. Sorus (2011). 5/2010-2011
Google Scholar
Ogrodniczuk, M., Kobyliński, Ł. (eds.): Proceedings of the PolEval 2021 Workshop. Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland (2021). http://poleval.pl/files/poleval2021.pdf
Pęzik, P., Mikołajczyk, A., Wawrzyński, A., Nitoń, B., Ogrodniczuk, M.: Keyword extraction from short texts with a text-to-text transfer transformer. In: Szczerbicki, E. (ed.) ACIIDS 2022. CCIS, vol. 1716, pp. 530–542. Springer, Singapore (2022). https://doi.org/10.1007/978-981-19-8234-7_41
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(140), 1–67 (2020). http://jmlr.org/papers/v21/20-074.html
Váradi, T., et al.: Introducing the CURLICAT corpora: seven-language domain specific annotated corpora from curated sources. In: Calzolari, N., et al. (eds.) Proceedings of the Thirteenth International Conference on Language Resources and Evaluation (LREC 2022), pp. 100–108. European Language Resources Association (ELRA), Marseille (2022). http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.11.pdf
Waszczuk, J., Kieraś, W., Woliński, M.: Morphosyntactic disambiguation and segmentation for historical polish with graph-based conditional random fields. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2018. LNCS (LNAI), vol. 11107, pp. 188–196. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00794-2_20
Chapter Google Scholar
Woliński, M.: Morfeusz reloaded. In: Calzolari, N., et al. (eds.) Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), pp. 1106–1111. European Language Resources Association (ELRA), Reykjavík (2014). http://www.lrec-conf.org/proceedings/lrec2014/pdf/768_Paper.pdf
Woliński, M., Miłkowski, M., Ogrodniczuk, M., Przepiórkowski, A., Szałkiewicz: PoliMorf: a (not so) new open morphological dictionary for Polish. In: Calzolari, N., et al. (eds.) Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), pp. 860–864. European Language Resources Association (ELRA), Istanbul (2012). http://www.lrec-conf.org/proceedings/lrec2012/pdf/263_Paper.pdf
Woliński, M., Saloni, Z., Wołosz, R., Gruszczyński, W., Skowrońska, D., Bronk, Z.: Słownik gramatyczny języka polskiego (2020). http://sgjp.pl/. 4th edition
Wróbel, K.: OCR correction with encoder-decoder transformer. In: Ogrodniczuk and Kobyliński [5], pp. 97–102 (2021). http://poleval.pl/files/poleval2021.pdf

Download references

Acknowledgements

The work reported here was supported by the European Commission in the CEF Telecom Programme (Action No: 2019-EU-IA-0034, Grant Agreement No: INEA/CEF/ICT/A2019/1926831) and the Polish Ministry of Science and Higher Education: research project 5103/CEF/2020/2, funds for 2020-2022).

We would like to thank Krzysztof Wróbel for his language model-based error candidate detection experiment using the ED 3 pl tool and Stanisław Lorys for first-pass manual correction of the evaluation data and proposing the classification of spelling errors.

Author information

Authors and Affiliations

Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland
Maciej Ogrodniczuk

Authors

Maciej Ogrodniczuk
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Maciej Ogrodniczuk .

Editor information

Editors and Affiliations

University of Newcastle Australia, Newcastle, NSW, Australia
Edward Szczerbicki
Wrocław University of Science and Technology, Wrocław, Poland
Krystian Wojtkiewicz
International University - VNU-HCM, Ho Chi Minh City, Vietnam
Sinh Van Nguyen
Wrocław University of Science and Technology, Wrocław, Poland
Marcin Pietranik
Wrocław University of Science and Technology, Wrocław, Poland
Marek Krótkiewicz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ogrodniczuk, M. (2022). Fine-Tuning OCR Error Detection and Correction in a Polish Corpus of Scientific Abstracts. In: Szczerbicki, E., Wojtkiewicz, K., Nguyen, S.V., Pietranik, M., Krótkiewicz, M. (eds) Recent Challenges in Intelligent Information and Database Systems. ACIIDS 2022. Communications in Computer and Information Science, vol 1716. Springer, Singapore. https://doi.org/10.1007/978-981-19-8234-7_35

Download citation

DOI: https://doi.org/10.1007/978-981-19-8234-7_35
Published: 24 November 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-8233-0
Online ISBN: 978-981-19-8234-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Fine-Tuning OCR Error Detection and Correction in a Polish Corpus of Scientific Abstracts