Skip to main content

Benchmarking NAS for Article Separation in Historical Newspapers

  • Conference paper
  • First Online:
Leveraging Generative Intelligence in Digital Libraries: Towards Human-Machine Collaboration (ICADL 2023)

Abstract

The digitization of historical newspapers is a crucial task for preserving cultural heritage and making it accessible for various natural language processing and information retrieval tasks. One of the key challenges in digitizing old newspapers is article separation, which consists of identifying and extracting individual articles from scanned newspaper images and retrieving the semantic structure. It is a critical step in making historical newspapers machine-readable and searchable, enabling tasks such as information extraction, document summarization, and text mining. In this work, we assess NewsEye Article Separation (NAS), a multilingual dataset for article separation in historical newspapers. It consists of scanned newspaper pages from the \(19^{th}\) and \(20^{th}\) centuries and annotation files in German, Finnish, and French. Moreover, the dataset is challenging due to the varying layouts and font styles, which makes it difficult for models to generalize to unseen data. Also, we introduce new metrics of article error rate, article coverage score, proper predicted article, and segmentation to evaluate the performance of the models trained on the NAS to highlight the relevance and challenges of this dataset. We believe that NAS, which is publicly available, will be a valuable resource for researchers working on historical newspaper digitization.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 49.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.newseye.eu/.

  2. 2.

    https://transkribus.eu/Transkribus.

  3. 3.

    https://github.com/NancyGirdhar/AS_EvaluationMetrics.

  4. 4.

    https://github.com/CITlabRostock/citlab-article-separation-new.

References

  1. Andrade, G., Ramos, G., Madeira, D., Sachetto, R., Ferreira, R., Rocha, L.: G-DBscan: a GPU accelerated algorithm for density-based clustering. Procedia Comput. Sci. 18, 369–378 (2013)

    Article  Google Scholar 

  2. Augusto Borges Oliveira, D., Palhares Viana, M.: Fast CNN-based document layout analysis. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 1173–1180 (2017)

    Google Scholar 

  3. Bansal, A., Chaudhury, S., Roy, S.D., Srivastava, J.: Newspaper article extraction using hierarchical fixed point model. In: 2014 11th IAPR International Workshop on Document Analysis Systems, pp. 257–261. IEEE (2014)

    Google Scholar 

  4. Buntinx, V., Kaplan, F., Xanthos, A.: Layout analysis on newspaper archives. In: Digital Humanities 2017, pp. 409–412 (2017)

    Google Scholar 

  5. Cohen, R., Asi, A., Kedem, K., El-Sana, J., Dinstein, I.: Robust text and drawing segmentation algorithm for historical documents. In: Proceedings of the 2nd International Workshop on Historical Document Imaging and Processing, pp. 110–117 (2013)

    Google Scholar 

  6. Colutto, S., Kahle, P., Guenter, H., Mühlberger, G.: Transkribus. A platform for automated text recognition and searching of historical documents. In: 2019 15th International Conference on eScience (eScience), pp. 463–466. IEEE (2019)

    Google Scholar 

  7. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  8. Doermann, D., Zotkina, E., Li, H.: Gedi-a groundtruthing environment for document images. In: Ninth IAPR International Workshop on Document Analysis Systems (DAS 2010). Citeseer (2010)

    Google Scholar 

  9. Doucet, A., et al.: Newseye: a digital investigator for historical newspapers. In: 15th Annual International Conference of the Alliance of Digital Humanities Organizations, DH 2020 (2020)

    Google Scholar 

  10. Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, vol. 96, pp. 226–231 (1996)

    Google Scholar 

  11. Gatos, B., Mantzaris, S., Chandrinos, K., Tsigris, A., Perantonis, S.J.: Integrated algorithms for newspaper page decomposition and article tracking. In: Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR’99 (Cat. No. PR00318), pp. 559–562. IEEE (1999)

    Google Scholar 

  12. Meier, B., Stadelmann, T., Stampfli, J., Arnold, M., Cieliebak, M.: Fully convolutional neural networks for newspaper article segmentation. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 414–419. IEEE (2017)

    Google Scholar 

  13. Michael, J., Weidemann, Max, L.R., Doucet, A.: Newseye: a digital investigator for historical newspapers (2022). https://www.newseye.eu/fileadmin/deliverables/NewsEye-T23-D27-ArticleSeparation-c-final-Submitted-v6.0.pdf. Accessed on 26 May 2023

  14. Michael, J., Weidemann, M., Laasch, B., Labahn, R.: ICPR 2020 competition on text block segmentation on a NewsEye dataset. In: Del Bimbo, A., Cucchiara, R., Sclaroff, S., Farinella, G.M., Mei, T., Bertini, M., Escalante, H.J., Vezzani, R. (eds.) ICPR 2021. LNCS, vol. 12668, pp. 405–418. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-68793-9_30

    Chapter  Google Scholar 

  15. Naoum, A.: Article Segmentation in Digitised Newspapers. Ph.D. thesis (2020)

    Google Scholar 

  16. Naoum, A., Nothman, J., Curran, J.: Article segmentation in digitised newspapers with a 2d Markov model. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1007–1014. IEEE (2019)

    Google Scholar 

  17. Oliveira, S.A., Seguin, B., Kaplan, F.: dhsegment: a generic deep-learning approach for document segmentation. In: 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 7–12. IEEE (2018)

    Google Scholar 

  18. Palfray, T., Hebert, D., Nicolas, S., Tranouez, P., Paquet, T.: Logical segmentation for article extraction in digitized old newspapers. In: Proceedings of the 2012 ACM Symposium on Document Engineering, pp. 129–132 (2012)

    Google Scholar 

  19. Pletschacher, S., Antonacopoulos, A.: The page (page analysis and ground-truth elements) format framework. In: 2010 20th International Conference on Pattern Recognition, pp. 257–260. IEEE (2010)

    Google Scholar 

  20. Zheng, S.,et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6881–6890 (2021)

    Google Scholar 

  21. Zhu, W., Sokhandan, N., Yang, G., Martin, S., Sathyanarayana, S.: Docbed: a multi-stage ocr solution for documents with complex layouts. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 12643–12649 (2022)

    Google Scholar 

Download references

Acknowledgement

This work has been supported by the ANNA (2019-1R40226), TERMITRAD (AAPR2020-2019-8510010), Pypa (AAPR2021-2021-12263410), and Actuadata (AAPR2022-2021-17014610) projects funded by the Nouvelle-Aquitaine Region, France. We would also like to thank our colleagues Max Weidemann, Johannes Michael, and Roger Labahn for their review, valuable suggestions, and insightful comments, which greatly contributed to the improvement of this manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nancy Girdhar .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Girdhar, N., Coustaty, M., Doucet, A. (2023). Benchmarking NAS for Article Separation in Historical Newspapers. In: Goh, D.H., Chen, SJ., Tuarob, S. (eds) Leveraging Generative Intelligence in Digital Libraries: Towards Human-Machine Collaboration. ICADL 2023. Lecture Notes in Computer Science, vol 14457. Springer, Singapore. https://doi.org/10.1007/978-981-99-8085-7_7

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-8085-7_7

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-8084-0

  • Online ISBN: 978-981-99-8085-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics