Abstract
The digitization of historical newspapers is a crucial task for preserving cultural heritage and making it accessible for various natural language processing and information retrieval tasks. One of the key challenges in digitizing old newspapers is article separation, which consists of identifying and extracting individual articles from scanned newspaper images and retrieving the semantic structure. It is a critical step in making historical newspapers machine-readable and searchable, enabling tasks such as information extraction, document summarization, and text mining. In this work, we assess NewsEye Article Separation (NAS), a multilingual dataset for article separation in historical newspapers. It consists of scanned newspaper pages from the \(19^{th}\) and \(20^{th}\) centuries and annotation files in German, Finnish, and French. Moreover, the dataset is challenging due to the varying layouts and font styles, which makes it difficult for models to generalize to unseen data. Also, we introduce new metrics of article error rate, article coverage score, proper predicted article, and segmentation to evaluate the performance of the models trained on the NAS to highlight the relevance and challenges of this dataset. We believe that NAS, which is publicly available, will be a valuable resource for researchers working on historical newspaper digitization.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Andrade, G., Ramos, G., Madeira, D., Sachetto, R., Ferreira, R., Rocha, L.: G-DBscan: a GPU accelerated algorithm for density-based clustering. Procedia Comput. Sci. 18, 369–378 (2013)
Augusto Borges Oliveira, D., Palhares Viana, M.: Fast CNN-based document layout analysis. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 1173–1180 (2017)
Bansal, A., Chaudhury, S., Roy, S.D., Srivastava, J.: Newspaper article extraction using hierarchical fixed point model. In: 2014 11th IAPR International Workshop on Document Analysis Systems, pp. 257–261. IEEE (2014)
Buntinx, V., Kaplan, F., Xanthos, A.: Layout analysis on newspaper archives. In: Digital Humanities 2017, pp. 409–412 (2017)
Cohen, R., Asi, A., Kedem, K., El-Sana, J., Dinstein, I.: Robust text and drawing segmentation algorithm for historical documents. In: Proceedings of the 2nd International Workshop on Historical Document Imaging and Processing, pp. 110–117 (2013)
Colutto, S., Kahle, P., Guenter, H., Mühlberger, G.: Transkribus. A platform for automated text recognition and searching of historical documents. In: 2019 15th International Conference on eScience (eScience), pp. 463–466. IEEE (2019)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Doermann, D., Zotkina, E., Li, H.: Gedi-a groundtruthing environment for document images. In: Ninth IAPR International Workshop on Document Analysis Systems (DAS 2010). Citeseer (2010)
Doucet, A., et al.: Newseye: a digital investigator for historical newspapers. In: 15th Annual International Conference of the Alliance of Digital Humanities Organizations, DH 2020 (2020)
Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, vol. 96, pp. 226–231 (1996)
Gatos, B., Mantzaris, S., Chandrinos, K., Tsigris, A., Perantonis, S.J.: Integrated algorithms for newspaper page decomposition and article tracking. In: Proceedings of the Fifth International Conference on Document Analysis and Recognition. ICDAR’99 (Cat. No. PR00318), pp. 559–562. IEEE (1999)
Meier, B., Stadelmann, T., Stampfli, J., Arnold, M., Cieliebak, M.: Fully convolutional neural networks for newspaper article segmentation. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, pp. 414–419. IEEE (2017)
Michael, J., Weidemann, Max, L.R., Doucet, A.: Newseye: a digital investigator for historical newspapers (2022). https://www.newseye.eu/fileadmin/deliverables/NewsEye-T23-D27-ArticleSeparation-c-final-Submitted-v6.0.pdf. Accessed on 26 May 2023
Michael, J., Weidemann, M., Laasch, B., Labahn, R.: ICPR 2020 competition on text block segmentation on a NewsEye dataset. In: Del Bimbo, A., Cucchiara, R., Sclaroff, S., Farinella, G.M., Mei, T., Bertini, M., Escalante, H.J., Vezzani, R. (eds.) ICPR 2021. LNCS, vol. 12668, pp. 405–418. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-68793-9_30
Naoum, A.: Article Segmentation in Digitised Newspapers. Ph.D. thesis (2020)
Naoum, A., Nothman, J., Curran, J.: Article segmentation in digitised newspapers with a 2d Markov model. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1007–1014. IEEE (2019)
Oliveira, S.A., Seguin, B., Kaplan, F.: dhsegment: a generic deep-learning approach for document segmentation. In: 2018 16th International Conference on Frontiers in Handwriting Recognition (ICFHR), pp. 7–12. IEEE (2018)
Palfray, T., Hebert, D., Nicolas, S., Tranouez, P., Paquet, T.: Logical segmentation for article extraction in digitized old newspapers. In: Proceedings of the 2012 ACM Symposium on Document Engineering, pp. 129–132 (2012)
Pletschacher, S., Antonacopoulos, A.: The page (page analysis and ground-truth elements) format framework. In: 2010 20th International Conference on Pattern Recognition, pp. 257–260. IEEE (2010)
Zheng, S.,et al.: Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6881–6890 (2021)
Zhu, W., Sokhandan, N., Yang, G., Martin, S., Sathyanarayana, S.: Docbed: a multi-stage ocr solution for documents with complex layouts. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 12643–12649 (2022)
Acknowledgement
This work has been supported by the ANNA (2019-1R40226), TERMITRAD (AAPR2020-2019-8510010), Pypa (AAPR2021-2021-12263410), and Actuadata (AAPR2022-2021-17014610) projects funded by the Nouvelle-Aquitaine Region, France. We would also like to thank our colleagues Max Weidemann, Johannes Michael, and Roger Labahn for their review, valuable suggestions, and insightful comments, which greatly contributed to the improvement of this manuscript.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Girdhar, N., Coustaty, M., Doucet, A. (2023). Benchmarking NAS for Article Separation in Historical Newspapers. In: Goh, D.H., Chen, SJ., Tuarob, S. (eds) Leveraging Generative Intelligence in Digital Libraries: Towards Human-Machine Collaboration. ICADL 2023. Lecture Notes in Computer Science, vol 14457. Springer, Singapore. https://doi.org/10.1007/978-981-99-8085-7_7
Download citation
DOI: https://doi.org/10.1007/978-981-99-8085-7_7
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8084-0
Online ISBN: 978-981-99-8085-7
eBook Packages: Computer ScienceComputer Science (R0)