Document Layout Annotation: Database and Benchmark in the Domain of Public Affairs

Peña, Alejandro; Morales, Aythami; Fierrez, Julian; Ortega-Garcia, Javier; Grande, Marcos; Puente, Íñigo; Córdova, Jorge; Córdova, Gonzalo

doi:10.1007/978-3-031-41501-2_9

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14194))

Included in the following conference series:

International Conference on Document Analysis and Recognition

428 Accesses

Abstract

Every day, thousands of digital documents are generated with useful information for companies, public organizations, and citizens. Given the impossibility of processing them manually, the automatic processing of these documents is becoming increasingly necessary in certain sectors. However, this task remains challenging, since in most cases a text-only based parsing is not enough to fully understand the information presented through different components of varying significance. In this regard, Document Layout Analysis (DLA) has been an interesting research field for many years, which aims to detect and classify the basic components of a document. In this work, we used a procedure to semi-automatically annotate digital documents with different layout labels, including 4 basic layout blocks and 4 text categories. We apply this procedure to collect a novel database for DLA in the public affairs domain, using a set of 24 data sources from the Spanish Administration. The database comprises 37.9K documents with more than 441.3K document pages, and more than 8M labels associated to 8 layout block units. The results of our experiments validate the proposed text labeling procedure with accuracy up to \(99\%\).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Towards Document Panoptic Segmentation with Pinpoint Accuracy: Method and Evaluation

Document Layout Analysis for Semantic Information Extraction

Attributed Paths for Layout-Based Document Retrieval

Notes

References

Antonacopoulos, A., et al.: A realistic dataset for performance evaluation of document layout analysis. In: ICDAR, pp. 296–300 (2009)
Google Scholar
Bast, H., Korzen, C.: A benchmark and evaluation for text extraction from PDF. In: 2017 ACM/IEEE Joint Conference on Digital Libraries (2017)
Google Scholar
Brown, T., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901 (2020)
Google Scholar
Bukhari, S., et al.: Improved document image segmentation algorithm using multiresolution morphology. In: Document Recognition and Retrieval XVIII, vol. 7874, pp. 109–116 (2011)
Google Scholar
Clausner, C., et al.: The ENP image and ground truth dataset of historical newspapers. In: ICDAR, pp. 931–935 (2015)
Google Scholar
Clausner, C., et al.: ICDAR2017 competition on recognition of documents with complex layouts-RDCL2017. In: ICDAR, vol. 1, pp. 1404–1410 (2017)
Google Scholar
DeAlcala, D., Serna, I., Morales, A., Fierrez, J., et al.: Measuring bias in AI models: an statistical approach introducing N-Sigma. In: COMPSAC (2023)
Google Scholar
Eskenazi, S., et al.: A comprehensive survey of mostly textual document segmentation algorithms since 2008. Pattern Recogn. 64, 1–14 (2017)
Article Google Scholar
Göbel, M., et al.: ICDAR 2013 table competition. In: Proceedings of the International Conference on Document Analysis and Recognition, pp. 1449–1453 (2013)
Google Scholar
He, K., et al.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 2961–2969 (2017)
Google Scholar
Document management - Portable document format - Part 1: PDF 1.7. Standard, International Organization for Standardization (ISO) (2008)
Google Scholar
Kenton, J., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019)
Google Scholar
Lang, T., et al.: Physical layout analysis of partly annotated newspaper images. In: Proceedings of the 23rd Computer Vision Winter Workshop, pp. 63–70 (2018)
Google Scholar
Oliveira, D., Viana, M.: Fast CNN-based document layout analysis. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1173–1180 (2017)
Google Scholar
Peña, A., Morales, A., Fierrez, J., et al.: Leveraging large language models for topic classification in the domain of public affairs. In: ICDAR (2023)
Google Scholar
Peña, A., Serna, I., et al.: Human-centric multimodal machine learning: recent advances and testbed on AI-based recruitment. SN Comput. Sci. 4, 434 (2023)
Article Google Scholar
Ren, S., et al.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, vol. 28 (2015)
Google Scholar
Sah, A., et al.: Text and non-text recognition using modified HOG descriptor. In: Proceedings of the IEEE Calcutta Conference, pp. 64–68 (2017)
Google Scholar
Serna, I., et al.: Sensitive loss: improving accuracy and fairness of face representations with discrimination-aware deep learning. Artif. Intell. 305, 103682 (2022)
Article MATH Google Scholar
Soto, C., Yoo, S.: Visual detection with context for document layout analysis. In: EMNLP-IJCNLP, pp. 3464–3470 (2019)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Google Scholar
Wong, K., et al.: Document analysis system. IBM J. Res. Dev. 26(6), 647–656 (1982)
Article Google Scholar
Zhong, X., et al.: PubLayNet: largest dataset ever for document layout analysis. In: ICDAR, pp. 1015–1022 (2019)
Google Scholar

Download references

Acknowledgments

Support by VINCES Consulting under the project VINCESAI-ARGOS and BBforTAI (PID2021-127641OB-I00 MICINN/FEDER). The work of A. Peña is supported by a FPU Fellowship (FPU21/00535) by the Spanish MIU.

Author information

Authors and Affiliations

BiDA - Lab, Universidad Autónoma de Madrid (UAM), 28049, Madrid, Spain
Alejandro Peña, Aythami Morales, Julian Fierrez, Javier Ortega-Garcia & Marcos Grande
VINCES Consulting, 28010, Madrid, Spain
Íñigo Puente, Jorge Córdova & Gonzalo Córdova

Authors

Alejandro Peña
View author publications
You can also search for this author in PubMed Google Scholar
Aythami Morales
View author publications
You can also search for this author in PubMed Google Scholar
Julian Fierrez
View author publications
You can also search for this author in PubMed Google Scholar
Javier Ortega-Garcia
View author publications
You can also search for this author in PubMed Google Scholar
Marcos Grande
View author publications
You can also search for this author in PubMed Google Scholar
Íñigo Puente
View author publications
You can also search for this author in PubMed Google Scholar
Jorge Córdova
View author publications
You can also search for this author in PubMed Google Scholar
Gonzalo Córdova
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alejandro Peña .

Editor information

Editors and Affiliations

University of La Rochelle, La Rochelle, France
Mickael Coustaty
Autonomous University of Barcelona, Bellaterra, Spain
Alicia Fornés

Annex

Table 4. List of data sources used to collect the PAL Database.

Full size table

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Peña, A. et al. (2023). Document Layout Annotation: Database and Benchmark in the Domain of Public Affairs. In: Coustaty, M., Fornés, A. (eds) Document Analysis and Recognition – ICDAR 2023 Workshops. ICDAR 2023. Lecture Notes in Computer Science, vol 14194. Springer, Cham. https://doi.org/10.1007/978-3-031-41501-2_9

Download citation

DOI: https://doi.org/10.1007/978-3-031-41501-2_9
Published: 15 August 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-41500-5
Online ISBN: 978-3-031-41501-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Document Layout Annotation: Database and Benchmark in the Domain of Public Affairs

Abstract

Access this chapter

Similar content being viewed by others

Towards Document Panoptic Segmentation with Pinpoint Accuracy: Method and Evaluation

Document Layout Analysis for Semantic Information Extraction

Attributed Paths for Layout-Based Document Retrieval

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Annex

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Navigation

Document Layout Annotation: Database and Benchmark in the Domain of Public Affairs

Abstract

Access this chapter

Similar content being viewed by others

Towards Document Panoptic Segmentation with Pinpoint Accuracy: Method and Evaluation

Document Layout Analysis for Semantic Information Extraction

Attributed Paths for Layout-Based Document Retrieval

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Annex

Annex

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation