Data Centric Domain Adaptation for Historical Text with OCR Errors

März, Luisa; Schweter, Stefan; Poerner, Nina; Roth, Benjamin; Schütze, Hinrich

doi:10.1007/978-3-030-86331-9_48

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12822))

Included in the following conference series:

International Conference on Document Analysis and Recognition

3412 Accesses
2 Citations

Abstract

We propose new methods for in-domain and cross-domain Named Entity Recognition (NER) on historical data for Dutch and French. For the cross-domain case, we address domain shift by integrating unsupervised in-domain data via contextualized string embeddings; and OCR errors by injecting synthetic OCR errors into the source domain and address data centric domain adaptation. We propose a general approach to imitate OCR errors in arbitrary input data. Our cross-domain as well as our in-domain results outperform several strong baselines and establish state-of-the-art results. We publish preprocessed versions of the French and Dutch Europeana NER corpora.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Akbik, A., Bergmann, T., Blythe, D., Rasul, K., Schweter, S., Vollgraf, R.: FLAIR: an easy-to-use framework for state-of-the-art NLP. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pp. 54–59. Association for Computational Linguistics, Minneapolis (June 2019). https://www.aclweb.org/anthology/N19-4010
Akbik, A., Bergmann, T., Vollgraf, R.: Pooled contextualized embeddings for named entity recognition. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Long and Short Papers), vol. 1, pp. 724–728. Association for Computational Linguistics, Minneapolis (June 2019)
Google Scholar
Akbik, A., Blythe, D., Vollgraf, R.: Contextual string embeddings for sequence labeling. In: 27th International Conference on Computational Linguistics, COLING 2018, pp. 1638–1649 (2018)
Google Scholar
Baevski, A., Edunov, S., Liu, Y., Zettlemoyer, L., Auli, M.: Cloze-driven pretraining of self-attention networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5359–5368. Association for Computational Linguistics, Hong Kong (November 2019). https://www.aclweb.org/anthology/D19-1539
Berg-Kirkpatrick, T., Durrett, G., Klein, D.: Unsupervised transcription of historical documents. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 207–217. Association for Computational Linguistics, Sofia (August 2013). https://www.aclweb.org/anthology/P13-1021
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017). https://www.aclweb.org/anthology/Q17-1010
Article Google Scholar
Boros, E., et al.: Robust named entity recognition and linking on historical multilingual documents. In: Conference and Labs of the Evaluation Forum (CLEF 2020). Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, vol. 2696, pp. 1–17. CEUR-WS Working Notes, Thessaloniki (September 2020). https://hal.archives-ouvertes.fr/hal-03026969
Çavdar, M.: Distant supervision for French relation extraction (2017)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Long and Short Papers), vol. 1, pp. 4171–4186. Association for Computational Linguistics, Minneapolis (June 2019). https://www.aclweb.org/anthology/N19-1423
Ehrmann, M., Colavizza, G., Rochat, Y., Kaplan, F.: Diachronic evaluation of NER systems on old newspapers. In: KONVENS (2016)
Google Scholar
Ehrmann, M., Romanello, M., Flückiger, A., Clematide, S.: Extended overview of CLEF HIPE 2020: named entity processing on historical newspapers. Zenodo (October 2020)
Google Scholar
Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by Gibbs sampling. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 363–370. Association for Computational Linguistics, Ann Arbor (June 2005). https://www.aclweb.org/anthology/P05-1045
Hamdi, A., Jean-Caurant, A., Sidère, N., Coustaty, M., Doucet, A.: Assessing and minimizing the impact of OCR quality on named entity recognition. In: Hall, M., Merčun, T., Risse, T., Duchateau, F. (eds.) TPDL 2020. LNCS, vol. 12246, pp. 87–101. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-54956-5_7
Chapter Google Scholar
Jean-Caurant, A., Tamani, N., Courboulay, V., Burie, J.: Lexicographical-based order for post-OCR correction of named entities. In: 14th IAPR International Conference on Document Analysis and Recognition, ICDAR 2017, Kyoto, Japan, November 9–15, 2017, pp. 1192–1197. IEEE (2017). https://doi.org/10.1109/ICDAR.2017.197
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 260–270. Association for Computational Linguistics, San Diego (June 2016). https://www.aclweb.org/anthology/N16-1030
Martinek, J., Lenc, L., Král, P., Nicolaou, A., Christlein, V.: Hybrid training data for historical text OCR, pp. 565–570 (September 2019). https://doi.org/10.1109/ICDAR.2019.00096
Neudecker, C.: An open corpus for named entity recognition in historic newspapers. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pp. 4348–4352. European Language Resources Association (ELRA), Portorož (May 2016). https://www.aclweb.org/anthology/L16-1689
Nothman, J., Ringland, N., Radford, W., Murphy, T., Curran, J.R.: Learning multilingual named entity recognition from Wikipedia. Artif. Intell. 194, 151–175 (2013)
Article MathSciNet Google Scholar
Piktus, A., Edizel, N.B., Bojanowski, P., Grave, E., Ferreira, R., Silvestri, F.: Misspelling oblivious word embeddings. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Long and Short Papers), vol. 1, pp. 3226–3234. Association for Computational Linguistics, Minneapolis (June 2019). https://www.aclweb.org/anthology/N19-1326
Ramponi, A., Plank, B.: Neural unsupervised domain adaptation in NLP–a survey (2020)
Google Scholar
Schick, T., Schütze, H.: Rare words: a major problem for contextualized embeddings and how to fix it by attentive mimicking. CoRR abs/1904.06707 (2019). http://arxiv.org/abs/1904.06707
Schweter, S., Baiter, J.: Towards robust named entity recognition for historic German. In: Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), pp. 96–103. Association for Computational Linguistics, Florence (August 2019). https://www.aclweb.org/anthology/W19-4312
Schweter, S., März, L.: Triple E - effective ensembling of embeddings and language models for NER of historical German. In: Cappellato, L., Eickhoff, C., Ferro, N., Névéol, A. (eds.) Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, Thessaloniki, Greece, September 22–25, 2020. CEUR Workshop Proceedings, vol. 2696. CEUR-WS.org (2020). http://ceur-ws.org/Vol-2696/paper_173.pdf
Tjong Kim Sang, E.F.: Introduction to the CoNLL-2002 shared task: language-independent named entity recognition. In: COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002) (2002). https://www.aclweb.org/anthology/W02-2024
Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pp. 142–147 (2003). https://www.aclweb.org/anthology/W03-0419
Vobl, T., Gotscharek, A., Reffle, U., Ringlstetter, C., Schulz, K.U.: Pocoto - an open source system for efficient interactive postcorrection of ocred historical texts. In: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, DATeCH 2014, pp. 57–61. ACM, New York (2014). http://doi.acm.org/10.1145/2595188.2595197
Wagner, R.A., Fischer, M.J.: The string-to-string correction problem. J. ACM 21(1), 168–173 (1974). https://doi.org/10.1145/321796.321811
Article MathSciNet MATH Google Scholar
Yeh, A.: More accurate tests for the statistical significance of result differences. In: The 18th International Conference on Computational Linguistics, COLING 2000, vol. 2 (2000). https://www.aclweb.org/anthology/C00-2137

Download references

Acknowledgement

This work was funded by the European Research Council (ERC #740516).

Author information

Authors and Affiliations

Center for Information and Language Processing, Ludwig Maximilian University, Munich, Germany
Luisa März, Nina Poerner & Hinrich Schütze
Digital Philology, Research Group Data Mining and Machine Learning, University of Vienna, Vienna, Austria
Luisa März & Benjamin Roth
Bayerische Staatsbibliothek München, Digital Library/Munich Digitization Center, Munich, Germany
Stefan Schweter
NLP Expert Center, Data:Lab, Volkswagen AG, Munich, Germany
Luisa März

Authors

Luisa März
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Schweter
View author publications
You can also search for this author in PubMed Google Scholar
Nina Poerner
View author publications
You can also search for this author in PubMed Google Scholar
Benjamin Roth
View author publications
You can also search for this author in PubMed Google Scholar
Hinrich Schütze
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Luisa März .

Editor information

Editors and Affiliations

Universitat Autònoma de Barcelona, Barcelona, Spain
Josep Lladós
Lehigh University, Bethlehem, PA, USA
Daniel Lopresti
Kyushu University, Fukuoka-shi, Japan
Seiichi Uchida

A Appendix

Detailed Information About Experiments and Data

The computing infrastructure we use for all our experiments is one GeForce GTX 1080Ti GPU with an average runtime of 12 h per experiment. For the French and Dutch baseline model NN base we count 15,895,683 parameters each. For the French NN ensemble model there are 88,264,777 parameters and 96,895,161 parameters for the Dutch NN ensemble.

The Europeana Newspaper Corpus is split 80/10/10 into train/dev/test (Table 6). The downsampled French WikiNER corpus is split 70/15/15 into train/dev/test and the Dutch CoNLL-02 corpus is already split in its original version. The downloadable version of the data can be found here: https://github.com/stefan-it/historic-domain-adaptation-icdar.

Table 6. Number of tokens for each datasplit.

Full size table

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

März, L., Schweter, S., Poerner, N., Roth, B., Schütze, H. (2021). Data Centric Domain Adaptation for Historical Text with OCR Errors. In: Lladós, J., Lopresti, D., Uchida, S. (eds) Document Analysis and Recognition – ICDAR 2021. ICDAR 2021. Lecture Notes in Computer Science(), vol 12822. Springer, Cham. https://doi.org/10.1007/978-3-030-86331-9_48

Download citation

DOI: https://doi.org/10.1007/978-3-030-86331-9_48
Published: 02 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86330-2
Online ISBN: 978-3-030-86331-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Data Centric Domain Adaptation for Historical Text with OCR Errors

Abstract

Access this chapter

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Appendix

A Appendix

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation