Abstract
The paper explores the idea of detecting and correcting post-OCR errors in a corpus of Polish scientific abstracts by first evaluating several available spellchecking approaches and then reusing one of the rule-based solutions to eliminate frequent errors most likely resulting from technical problems of the OCR process. The fine-tuning consisted in removing word breaks, rejecting corrections which change the case of the output, removing unnecessary spaces between word segments and restoring Polish letters replaced with spaces whenever the correction resulted in a valid Polish word. The obtained system proved competitive with language model-based solutions.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
Detected automatically in the process of linguistic annotation with Concraft disambiguating tagger [9] which in some rare cases resulted in several true sentences treated as a single one.
- 16.
- 17.
- 18.
http://morfeusz.sgjp.pl/download/, version 20220410.
References
Hládek, D., Staš, J., Pleva, M.: Survey of automatic spelling correction. Electronics 9(10) (2020). https://doi.org/10.3390/electronics9101670, https://www.mdpi.com/2079-9292/9/10/1670
van Huyssteen, G.B., Eiselen, E.R., Puttkammer, M.J.: Evaluating evaluation metrics for spelling checker evaluations. In: Proceedings of the First International Workshop on Proofing Tools and Language Technologies, pp. 91–99 (2004)
Kobyliński, Ł., Kieraś, W., Rynkun, S.: PolEval 2021 task 3: post-correction of OCR results. In: Ogrodniczuk and Kobyliński [5], pp. 85–91 (2021). http://poleval.pl/files/poleval2021.pdf
Lewandowski, R.: Społeczna korekta post-OCR w bibliotekach cyfrowych. In: Ilona Koutny, P.N. (ed.) Język, Komunikacja, Informacja, pp. 123–134. Sorus (2011). 5/2010-2011
Ogrodniczuk, M., Kobyliński, Ł. (eds.): Proceedings of the PolEval 2021 Workshop. Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland (2021). http://poleval.pl/files/poleval2021.pdf
Pęzik, P., Mikołajczyk, A., Wawrzyński, A., Nitoń, B., Ogrodniczuk, M.: Keyword extraction from short texts with a text-to-text transfer transformer. In: Szczerbicki, E. (ed.) ACIIDS 2022. CCIS, vol. 1716, pp. 530–542. Springer, Singapore (2022). https://doi.org/10.1007/978-981-19-8234-7_41
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(140), 1–67 (2020). http://jmlr.org/papers/v21/20-074.html
Váradi, T., et al.: Introducing the CURLICAT corpora: seven-language domain specific annotated corpora from curated sources. In: Calzolari, N., et al. (eds.) Proceedings of the Thirteenth International Conference on Language Resources and Evaluation (LREC 2022), pp. 100–108. European Language Resources Association (ELRA), Marseille (2022). http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.11.pdf
Waszczuk, J., Kieraś, W., Woliński, M.: Morphosyntactic disambiguation and segmentation for historical polish with graph-based conditional random fields. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2018. LNCS (LNAI), vol. 11107, pp. 188–196. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00794-2_20
Woliński, M.: Morfeusz reloaded. In: Calzolari, N., et al. (eds.) Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), pp. 1106–1111. European Language Resources Association (ELRA), Reykjavík (2014). http://www.lrec-conf.org/proceedings/lrec2014/pdf/768_Paper.pdf
Woliński, M., Miłkowski, M., Ogrodniczuk, M., Przepiórkowski, A., Szałkiewicz: PoliMorf: a (not so) new open morphological dictionary for Polish. In: Calzolari, N., et al. (eds.) Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), pp. 860–864. European Language Resources Association (ELRA), Istanbul (2012). http://www.lrec-conf.org/proceedings/lrec2012/pdf/263_Paper.pdf
Woliński, M., Saloni, Z., Wołosz, R., Gruszczyński, W., Skowrońska, D., Bronk, Z.: Słownik gramatyczny języka polskiego (2020). http://sgjp.pl/. 4th edition
Wróbel, K.: OCR correction with encoder-decoder transformer. In: Ogrodniczuk and Kobyliński [5], pp. 97–102 (2021). http://poleval.pl/files/poleval2021.pdf
Acknowledgements
The work reported here was supported by the European Commission in the CEF Telecom Programme (Action No: 2019-EU-IA-0034, Grant Agreement No: INEA/CEF/ICT/A2019/1926831) and the Polish Ministry of Science and Higher Education: research project 5103/CEF/2020/2, funds for 2020-2022).
We would like to thank Krzysztof Wróbel for his language model-based error candidate detection experiment using the ED 3 pl tool and Stanisław Lorys for first-pass manual correction of the evaluation data and proposing the classification of spelling errors.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Ogrodniczuk, M. (2022). Fine-Tuning OCR Error Detection and Correction in a Polish Corpus of Scientific Abstracts. In: Szczerbicki, E., Wojtkiewicz, K., Nguyen, S.V., Pietranik, M., Krótkiewicz, M. (eds) Recent Challenges in Intelligent Information and Database Systems. ACIIDS 2022. Communications in Computer and Information Science, vol 1716. Springer, Singapore. https://doi.org/10.1007/978-981-19-8234-7_35
Download citation
DOI: https://doi.org/10.1007/978-981-19-8234-7_35
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-8233-0
Online ISBN: 978-981-19-8234-7
eBook Packages: Computer ScienceComputer Science (R0)