Corpus Construction for Historical Newspapers: A Case Study on Public Meeting Corpus Construction Using OCR Error Correction

Tanaka, Koji; Chu, Chenhui; Kajiwara, Tomoyuki; Nakashima, Yuta; Takemura, Noriko; Nagahara, Hajime; Fujikawa, Takao

doi:10.1007/s42979-022-01393-6

Corpus Construction for Historical Newspapers: A Case Study on Public Meeting Corpus Construction Using OCR Error Correction

Original Research
Published: 25 September 2022

Volume 3, article number 489, (2022)
Cite this article

SN Computer Science Aims and scope Submit manuscript

Koji Tanaka¹,
Chenhui Chu ORCID: orcid.org/0000-0001-9848-6384²,
Tomoyuki Kajiwara¹,
Yuta Nakashima¹,
Noriko Takemura¹,
Hajime Nagahara¹ &
…
Takao Fujikawa¹

220 Accesses
1 Altmetric
Explore all metrics

Abstract

Large text corpora are indispensable for natural language processing. However, in various fields such as literature and humanities, many documents to be studied are only scanned to images, but not converted to text data. Optical character recognition (OCR) is a technology to convert scanned document images into text data. However, OCR often misrecognizes characters due to the low quality of the scanned document images, which is a crucial factor that degrades the quality of constructed text corpora. This paper works on corpus construction for historical newspapers. We present a corpus construction method based on a pipeline of image processing, OCR, and filtering. To improve the quality, we further propose to integrate OCR error correction. To this end, we manually construct an OCR error correction dataset in the historical newspaper domain, propose methods to improve a neural OCR correction model and compare various OCR error correction models. We evaluate our corpus construction method on the accuracy of extracting articles of a specific topic to construct a historical newspaper corpus. As a result, our method improves the article extraction F score by $1.7\%$ via OCR error correction comparing to previous work. This verifies the effectiveness of OCR error correction for corpus construction.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Benchmarking NAS for Article Separation in Historical Newspapers

Large Synthetic Data from the ar $$\mathrm {\chi }$$ iv for OCR Post Correction of Historic Scientific Articles

The image and ground truth dataset of Mongolian movable-type newspapers for text recognition

Article 07 September 2023

Min Lu, Feilong Bao, … Guanglai Gao

Notes

https://trove.nla.gov.au.
Trove is an online library database service maintained by the Australian government, which covers major Australian daily newspapers and local newspapers.
https://github.com/INL/AttestationTool.
Note that due to the large number (i.e., 407, 756) of overall advertisement pages including “public meeting” articles, it is almost impossible to either digitize all of them or correct the OCR errors in all of them manually.
Unfortunately, multiple OCR candidates are unavailable for our “public meeting” domain, and thus we used a corpus from a different domain in our experiment.
We used top-128 in our experiments.
https://openai.com/blog/language-unsupervised/.
https://opencv.org/.
https://onlizer.com/google_drive/tesseract_ocr
https://docs.python.jp/3/library/difflib.html
http://www.statmt.org/moses/
5-grams language models have been used by default in SMT and we followed that for the OCR error correction task following [8].
https://github.com/kpu/kenlm/
http://code.google.com/p/giza-pp
http://opennmt.net/
https://github.com/Doreenruirui/ACL2018_Multi_Input_OCR
http://dlxs.richmond.edu/d/ddr/
http://adb.anu.edu.au/
https://auspost.com.au/
http://www.let.osaka-u.ac.jp/seiyousi/Ghost-Gazetteer/index.htm
https://www.imagemagick.org/
Note that here, we applied our SMT OCR correction model to the OCRed text provided by Trove. Therefore, this baseline can be treated as an improved version on how good we can get “public meeting” articles from the advertisement pages using the functions provided by Trove.
https://nlp.stanford.edu/software/lex-parser.shtml
As advertisement pages in our data contain the key phrase “public meeting,” there must be target articles to be extracted. If no articles are extracted by Baseline from a advertisement page, we treat it as a failure.

References

Afli H, Barrault L, Schwenk H. OCR error correction using statistical machine translation. Int J Comput Ling Appl. 2015;7(1):175–91.
Google Scholar
Afli H, Qiu Z, Way A, Sheridan P. Using SMT for OCR error correction of historical texts. In: Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), pp 962–966. 2016
Barbaresi A. Bootstrapped OCR error detection for a less-resourced language variant. In: 13th Conference on Natural Language Processing (KONVENS 2016), pp 21–26. 2016
Barrault L, Bojar O, Costa-jussà MR, Federmann C, Fishel M, Graham Y, Haddow B, Huck M, Koehn P, Malmasi S, Monz C, Müller M, Pal S, Post M, Zampieri M. Findings of the 2019 conference on machine translation (WMT19). In: Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pp 1–61. 2019
Cassidy S. Publishing the Trove newspaper corpus. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pp 4520–4525. 2016
Chiron G, Doucet A, Coustaty M, Visani M, Moreux JP. Impact of OCR errors on the use of digital libraries: towards a better access to information. In: Proceedings of the 17th ACM/IEEE Joint Conference on Digital Libraries, JCDL ’17, pp 249–252. 2017
Cho K, van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp 1724–1734. 2014
Chu C, Nakazawa T, Kurohashi S. Integrated parallel sentence and fragment extraction from comparable corpora: A case study on chinese–japanese wikipedia. ACM Trans Asian Low-Resour Lang Inform Process. 2015;15(2):10:1–10:22
Chung J, Cho K, Bengio Y. A character-level decoder without explicit segmentation for neural machine translation. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 1693–1703. 2016
Davies M. Expanding horizons in historical linguistics with the 400-million word corpus of historical american english. Corpora. 2012;7:121–57.
Article Google Scholar
Dong R, Smith D. Multi-input attention for unsupervised OCR correction. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 2363–2372. 2018
Duchi J, Hazan E, Singer Y. Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res. 2011;12:2121–59.
MathSciNet MATH Google Scholar
Eger S, Brück T, Mehler A. A comparison of four character-level string-to-string translation models for (ocr) spelling error correction. Prag Bull Math Ling. 2016;106:77–99.
Article Google Scholar
Evershed J, Fitch K (2014) Correcting noisy OCR: context beats confusion. In: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, pp 45–51
Fujikawa T. Public meetings in New South Wales: 1871–1901. J R Aust Hist Soc. 1990;76:45–61.
Google Scholar
Kingma D, Ba J. Adam: A method for stochastic optimization. In: International Conference on Learning Representations. 2015
Klein S, Kopel M. A voting system for automatic ocr correction. 2002
Klein G, Kim Y, Deng Y, Senellart J, Rush A. OpenNMT: open-source toolkit for neural machine translation. In: Proceedings of ACL 2017, System Demonstrations, pp 67–72. 2017
Koehn . Europarl: a parallel corpus for statistical machine translation. In: Proceedings of Machine Translation Summit, pp 79–86. 2005
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E. Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, ACL ’07, pp 177–180. 2007
Kolak O, Resnik P. OCR error correction using a noisy channel model. In: Proceedings of the Second International Conference on Human Language Technology Research, pp 257–262. 2002
Kolak O, Byrne W, Resnik P. A generative probabilistic OCR model for NLP applications. In: Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, pp 134–141. 2003
Lund WB, Kennard DJ, Ringger EK. Combining multiple thresholding binarization values to improve OCR output. Doc Recogn Retrie XX. 2013;8658:254–64.
Google Scholar
Lyu L, Koutraki M, Krickl M, Fetahu B. Neural OCR post-hoc correction of historical corpora. Trans Assoc Comput Ling. 2021;9:479–93.
Google Scholar
Marcus MP, Marcinkiewicz MA, Santorini B. Building a large annotated corpus of English: The Penn Treebank. Comput Ling. 1993;19(2):313–30.
Google Scholar
Mokhtar K, Bukhari SS, Dengel A. OCR error correction: state-of-the-art vs an nmt-based approach. In: Proceedings of the 2018 13th IAPR International Workshop on Document Analysis Systems (DAS), pp 429–434. 2018
Moreno-García C, Elyan E. Digitisation of assets from the oil gas industry: Challenges and opportunities. In: Proceedings of the International Conference on Document Analysis and Recognition (ICDAR), pp 2–5. 2019
Moreno-García C, Elyan E, Jayne C. New trends on digitisation of complex engineering drawings. 2019;31(6):1695–712.
Neudecker C. An open corpus for named entity recognition in historic newspapers. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pp 4348–4352. 2016
Och FJ. Minimum error rate training in statistical machine translation. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pp 160–167. 2003
Otsu N. A threshold selection method from gray-level histograms. IEEE Trans Syst Man Cybern. 1979;9(1):62–6.
Article Google Scholar
Radford A, Narasimhan K (2018) Improving language understanding by generative pre-training
Richter C, Wickes M, Beser D, Marcus M. Low-resource post processing of noisy OCR output for historical corpus digitisation. In: Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC-2018), pp 2331–2339. 2018
Rögnvaldsson E, Ingason AK, Sigurðsson EF, Wallenberg J. The Icelandic parsed historical corpus (IcePaHC). In: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pp 1977–1984. 2012
Sánchez-Martínez F, Martínez-Sempere I, Ivars-Ribes X, Carrasco R. An open diachronic corpus of historical Spanish. Lang Resour Evaluat. 2013;47:1327–42.
Article Google Scholar
Sherratt T (2021) Glam workbench—using the trove newspaper gazette harvester (the web app version)
Smith R. An overview of the Tesseract OCR engine. In: Proc. of International Conference on Document Analysis and Recognition, vol 2, pp 629–633. 2007
Smith DA, Cordel R, Dillon EM, Stramp N, Wilkerson J. Detecting and modeling local text reuse. In: IEEE/ACM Joint Conference on Digital Libraries, pp 183–192. 2014
Snoek J, Larochelle H, Adams RP. Practical bayesian optimization of machine learning algorithms. In: Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 2, pp 2951–2959. 2012
Suzuki S, Abe K. Topological structural analysis of digitized binary images by border following. Comput Vis Graph Image Process. 1985;30(1):32–46.
Article MATH Google Scholar
Tanaka K, Chu C, Ren H, Renoust B, Nakashima Y, Takemura N, Nagahara H, Fujikawa T. Constructing a public meeting corpus. In: Proceedings of The 12th Language Resources and Evaluation Conference, pp 1934–1940. 2020
Trad A, Doush I. Improving post-processing optical character recognition documents with arabic language using spelling error detection and correction. Int J Reason Based Intell Syst. 2016;8:91.
Google Scholar
Wilkerson J, Smith D, Stramp N. Tracing the flow of policy ideas in legislatures: A text reuse approach. Am J Polit Sci. 2015;59(4)
Xu S, Smith D. Retrieving and combining repeated passages to improve ocr. In: 2017 ACM/IEEE Joint Conference on Digital Libraries (JCDL), pp 1–4. 2017
Yamazoe T, Etoh M, Yoshimura T, Tsujino K. Hypothesis preservation approach to scene text recognition with weighted finite-state transducer. In: Proceedings of the 2011 International Conference on Document Analysis and Recognition, pp 359–363. 2011
Zoph B, Yuret D, May J, Knight K. Transfer learning for low-resource neural machine translation. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp 1568–1575. 2016

Download references

Acknowledgements

This work was supported by Grant-in-Aid for Scientific Research (B) #19H01330, JSPS.

Author information

Authors and Affiliations

Osaka University, Osaka, Japan
Koji Tanaka, Tomoyuki Kajiwara, Yuta Nakashima, Noriko Takemura, Hajime Nagahara & Takao Fujikawa
Kyoto University, Kyoto, Japan
Chenhui Chu

Authors

Koji Tanaka
View author publications
You can also search for this author in PubMed Google Scholar
Chenhui Chu
View author publications
You can also search for this author in PubMed Google Scholar
Tomoyuki Kajiwara
View author publications
You can also search for this author in PubMed Google Scholar
Yuta Nakashima
View author publications
You can also search for this author in PubMed Google Scholar
Noriko Takemura
View author publications
You can also search for this author in PubMed Google Scholar
Hajime Nagahara
View author publications
You can also search for this author in PubMed Google Scholar
Takao Fujikawa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chenhui Chu.

Ethics declarations

Conflict of interest

The authors declare that they do not have a financial or personal relationship with a third party whose interests could be positively or negatively influenced by the article’s content.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Tanaka, K., Chu, C., Kajiwara, T. et al. Corpus Construction for Historical Newspapers: A Case Study on Public Meeting Corpus Construction Using OCR Error Correction. SN COMPUT. SCI. 3, 489 (2022). https://doi.org/10.1007/s42979-022-01393-6

Download citation

Received: 21 October 2021
Accepted: 27 August 2022
Published: 25 September 2022
DOI: https://doi.org/10.1007/s42979-022-01393-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Corpus Construction for Historical Newspapers: A Case Study on Public Meeting Corpus Construction Using OCR Error Correction

Abstract

Access this article

Similar content being viewed by others

Benchmarking NAS for Article Separation in Historical Newspapers

Large Synthetic Data from the ar $$\mathrm {\chi }$$ iv for OCR Post Correction of Historic Scientific Articles

The image and ground truth dataset of Mongolian movable-type newspapers for text recognition

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Abstract

Access this article

Similar content being viewed by others

Benchmarking NAS for Article Separation in Historical Newspapers

Large Synthetic Data from the ar $$\mathrm {\chi }$$ iv for OCR Post Correction of Historic Scientific Articles

The image and ground truth dataset of Mongolian movable-type newspapers for text recognition

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation