Skip to main content
Log in

Corpus Construction for Historical Newspapers: A Case Study on Public Meeting Corpus Construction Using OCR Error Correction

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

Large text corpora are indispensable for natural language processing. However, in various fields such as literature and humanities, many documents to be studied are only scanned to images, but not converted to text data. Optical character recognition (OCR) is a technology to convert scanned document images into text data. However, OCR often misrecognizes characters due to the low quality of the scanned document images, which is a crucial factor that degrades the quality of constructed text corpora. This paper works on corpus construction for historical newspapers. We present a corpus construction method based on a pipeline of image processing, OCR, and filtering. To improve the quality, we further propose to integrate OCR error correction. To this end, we manually construct an OCR error correction dataset in the historical newspaper domain, propose methods to improve a neural OCR correction model and compare various OCR error correction models. We evaluate our corpus construction method on the accuracy of extracting articles of a specific topic to construct a historical newspaper corpus. As a result, our method improves the article extraction F score by \(1.7\%\) via OCR error correction comparing to previous work. This verifies the effectiveness of OCR error correction for corpus construction.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

Notes

  1. https://trove.nla.gov.au.

  2. Trove is an online library database service maintained by the Australian government, which covers major Australian daily newspapers and local newspapers.

  3. https://github.com/INL/AttestationTool.

  4. Note that due to the large number (i.e., 407, 756) of overall advertisement pages including “public meeting” articles, it is almost impossible to either digitize all of them or correct the OCR errors in all of them manually.

  5. Unfortunately, multiple OCR candidates are unavailable for our “public meeting” domain, and thus we used a corpus from a different domain in our experiment.

  6. We used top-128 in our experiments.

  7. https://openai.com/blog/language-unsupervised/.

  8. https://opencv.org/.

  9. https://onlizer.com/google_drive/tesseract_ocr

  10. https://docs.python.jp/3/library/difflib.html

  11. http://www.statmt.org/moses/

  12. 5-grams language models have been used by default in SMT and we followed that for the OCR error correction task following [8].

  13. https://github.com/kpu/kenlm/

  14. http://code.google.com/p/giza-pp

  15. http://opennmt.net/

  16. https://github.com/Doreenruirui/ACL2018_Multi_Input_OCR

  17. http://dlxs.richmond.edu/d/ddr/

  18. http://adb.anu.edu.au/

  19. https://auspost.com.au/

  20. http://www.let.osaka-u.ac.jp/seiyousi/Ghost-Gazetteer/index.htm

  21. https://www.imagemagick.org/

  22. Note that here, we applied our SMT OCR correction model to the OCRed text provided by Trove. Therefore, this baseline can be treated as an improved version on how good we can get “public meeting” articles from the advertisement pages using the functions provided by Trove.

  23. https://nlp.stanford.edu/software/lex-parser.shtml

  24. As advertisement pages in our data contain the key phrase “public meeting,” there must be target articles to be extracted. If no articles are extracted by Baseline from a advertisement page, we treat it as a failure.

References

  1. Afli H, Barrault L, Schwenk H. OCR error correction using statistical machine translation. Int J Comput Ling Appl. 2015;7(1):175–91.

    Google Scholar 

  2. Afli H, Qiu Z, Way A, Sheridan P. Using SMT for OCR error correction of historical texts. In: Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), pp 962–966. 2016

  3. Barbaresi A. Bootstrapped OCR error detection for a less-resourced language variant. In: 13th Conference on Natural Language Processing (KONVENS 2016), pp 21–26. 2016

  4. Barrault L, Bojar O, Costa-jussà MR, Federmann C, Fishel M, Graham Y, Haddow B, Huck M, Koehn P, Malmasi S, Monz C, Müller M, Pal S, Post M, Zampieri M. Findings of the 2019 conference on machine translation (WMT19). In: Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pp 1–61. 2019

  5. Cassidy S. Publishing the Trove newspaper corpus. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pp 4520–4525. 2016

  6. Chiron G, Doucet A, Coustaty M, Visani M, Moreux JP. Impact of OCR errors on the use of digital libraries: towards a better access to information. In: Proceedings of the 17th ACM/IEEE Joint Conference on Digital Libraries, JCDL ’17, pp 249–252. 2017

  7. Cho K, van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp 1724–1734. 2014

  8. Chu C, Nakazawa T, Kurohashi S. Integrated parallel sentence and fragment extraction from comparable corpora: A case study on chinese–japanese wikipedia. ACM Trans Asian Low-Resour Lang Inform Process. 2015;15(2):10:1–10:22

  9. Chung J, Cho K, Bengio Y. A character-level decoder without explicit segmentation for neural machine translation. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 1693–1703. 2016

  10. Davies M. Expanding horizons in historical linguistics with the 400-million word corpus of historical american english. Corpora. 2012;7:121–57.

    Article  Google Scholar 

  11. Dong R, Smith D. Multi-input attention for unsupervised OCR correction. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 2363–2372. 2018

  12. Duchi J, Hazan E, Singer Y. Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res. 2011;12:2121–59.

    MathSciNet  MATH  Google Scholar 

  13. Eger S, Brück T, Mehler A. A comparison of four character-level string-to-string translation models for (ocr) spelling error correction. Prag Bull Math Ling. 2016;106:77–99.

    Article  Google Scholar 

  14. Evershed J, Fitch K (2014) Correcting noisy OCR: context beats confusion. In: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, pp 45–51

  15. Fujikawa T. Public meetings in New South Wales: 1871–1901. J R Aust Hist Soc. 1990;76:45–61.

    Google Scholar 

  16. Kingma D, Ba J. Adam: A method for stochastic optimization. In: International Conference on Learning Representations. 2015

  17. Klein S, Kopel M. A voting system for automatic ocr correction. 2002

  18. Klein G, Kim Y, Deng Y, Senellart J, Rush A. OpenNMT: open-source toolkit for neural machine translation. In: Proceedings of ACL 2017, System Demonstrations, pp 67–72. 2017

  19. Koehn . Europarl: a parallel corpus for statistical machine translation. In: Proceedings of Machine Translation Summit, pp 79–86. 2005

  20. Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E. Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, ACL ’07, pp 177–180. 2007

  21. Kolak O, Resnik P. OCR error correction using a noisy channel model. In: Proceedings of the Second International Conference on Human Language Technology Research, pp 257–262. 2002

  22. Kolak O, Byrne W, Resnik P. A generative probabilistic OCR model for NLP applications. In: Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pp 134–141. 2003

  23. Lund WB, Kennard DJ, Ringger EK. Combining multiple thresholding binarization values to improve OCR output. Doc Recogn Retrie XX. 2013;8658:254–64.

    Google Scholar 

  24. Lyu L, Koutraki M, Krickl M, Fetahu B. Neural OCR post-hoc correction of historical corpora. Trans Assoc Comput Ling. 2021;9:479–93.

    Google Scholar 

  25. Marcus MP, Marcinkiewicz MA, Santorini B. Building a large annotated corpus of English: The Penn Treebank. Comput Ling. 1993;19(2):313–30.

    Google Scholar 

  26. Mokhtar K, Bukhari SS, Dengel A. OCR error correction: state-of-the-art vs an nmt-based approach. In: Proceedings of the 2018 13th IAPR International Workshop on Document Analysis Systems (DAS), pp 429–434. 2018

  27. Moreno-García C, Elyan E. Digitisation of assets from the oil gas industry: Challenges and opportunities. In: Proceedings of the International Conference on Document Analysis and Recognition (ICDAR), pp 2–5. 2019

  28. Moreno-García C, Elyan E, Jayne C. New trends on digitisation of complex engineering drawings. 2019;31(6):1695–712.

  29. Neudecker C. An open corpus for named entity recognition in historic newspapers. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pp 4348–4352. 2016

  30. Och FJ. Minimum error rate training in statistical machine translation. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pp 160–167. 2003

  31. Otsu N. A threshold selection method from gray-level histograms. IEEE Trans Syst Man Cybern. 1979;9(1):62–6.

    Article  Google Scholar 

  32. Radford A, Narasimhan K (2018) Improving language understanding by generative pre-training

  33. Richter C, Wickes M, Beser D, Marcus M. Low-resource post processing of noisy OCR output for historical corpus digitisation. In: Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC-2018), pp 2331–2339. 2018

  34. Rögnvaldsson E, Ingason AK, Sigurðsson EF, Wallenberg J. The Icelandic parsed historical corpus (IcePaHC). In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pp 1977–1984. 2012

  35. Sánchez-Martínez F, Martínez-Sempere I, Ivars-Ribes X, Carrasco R. An open diachronic corpus of historical Spanish. Lang Resour Evaluat. 2013;47:1327–42.

    Article  Google Scholar 

  36. Sherratt T (2021) Glam workbench—using the trove newspaper gazette harvester (the web app version)

  37. Smith R. An overview of the Tesseract OCR engine. In: Proc. of International Conference on Document Analysis and Recognition, vol 2, pp 629–633. 2007

  38. Smith DA, Cordel R, Dillon EM, Stramp N, Wilkerson J. Detecting and modeling local text reuse. In: IEEE/ACM Joint Conference on Digital Libraries, pp 183–192. 2014

  39. Snoek J, Larochelle H, Adams RP. Practical bayesian optimization of machine learning algorithms. In: Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 2, pp 2951–2959. 2012

  40. Suzuki S, Abe K. Topological structural analysis of digitized binary images by border following. Comput Vis Graph Image Process. 1985;30(1):32–46.

    Article  MATH  Google Scholar 

  41. Tanaka K, Chu C, Ren H, Renoust B, Nakashima Y, Takemura N, Nagahara H, Fujikawa T. Constructing a public meeting corpus. In: Proceedings of The 12th Language Resources and Evaluation Conference, pp 1934–1940. 2020

  42. Trad A, Doush I. Improving post-processing optical character recognition documents with arabic language using spelling error detection and correction. Int J Reason Based Intell Syst. 2016;8:91.

    Google Scholar 

  43. Wilkerson J, Smith D, Stramp N. Tracing the flow of policy ideas in legislatures: A text reuse approach. Am J Polit Sci. 2015;59(4)

  44. Xu S, Smith D. Retrieving and combining repeated passages to improve ocr. In: 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp 1–4. 2017

  45. Yamazoe T, Etoh M, Yoshimura T, Tsujino K. Hypothesis preservation approach to scene text recognition with weighted finite-state transducer. In: Proceedings of the 2011 International Conference on Document Analysis and Recognition, pp 359–363. 2011

  46. Zoph B, Yuret D, May J, Knight K. Transfer learning for low-resource neural machine translation. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp 1568–1575. 2016

Download references

Acknowledgements

This work was supported by Grant-in-Aid for Scientific Research (B) #19H01330, JSPS.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chenhui Chu.

Ethics declarations

Conflict of interest

The authors declare that they do not have a financial or personal relationship with a third party whose interests could be positively or negatively influenced by the article’s content.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tanaka, K., Chu, C., Kajiwara, T. et al. Corpus Construction for Historical Newspapers: A Case Study on Public Meeting Corpus Construction Using OCR Error Correction. SN COMPUT. SCI. 3, 489 (2022). https://doi.org/10.1007/s42979-022-01393-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-022-01393-6

Keywords

Navigation