1 Introduction

Several research and application fields require annotated datasets to advance the development of intelligent systems. Among many, ImageNet [1] permitted the growth of novel approaches that have guided the creation of some of the most modern learning systems. Document Image Analysis and Recognition (DIAR) is not an exception, and several well-known benchmark datasets allowed researchers to advance the state of the art in DIAR and in general in pattern recognition. For instance, in the 1990s the NIST [2] and MNIST [3] datasets of handwritten digits have been instrumental for significant advances of techniques for pattern recognition [4]. DIAR is not limited to isolated character recognition, but encompasses several tasks ranging from pre-processing, to layout analysis with the overall aim of achieving document understanding in many application domains [5]. Many application areas deal with proprietary data that cannot be made publicly available, due to copyright and privacy issues such as financial documents or health records. These latter difficulties, along with the annotation effort required for large quantities of documents, are usually the main challenges faced when creating a new benchmark dataset for DIAR.

Among other tasks, Document Layout Analysis (DLA) research advanced significantly in the 1990s thanks to a new collection of scanned pages of scientific articles. Similar to NIST, the UW datasets [6, 7] set a milestone for evaluating research progresses. Thereafter, scientific articles have been widely used as benchmark sources of data due to their availability, in terms of quantity and accessibility, and their rich semantic structure that allows researchers to focus on different tasks in the document understanding pipeline, e.g., Table Detection (TD) and Table Recognition (TR). Although some tasks are nowadays basically solved (e.g., physical layout analysis, that is, the identification of homogeneous regions of text in the page), there is still space for research in the analysis of challenging regions of documents (e.g. tables and graphical illustrations) as well as for the overall understanding of scientific articles published with non-common styles and layouts.

In this paper, we aim to provide a guide to the different datasets that have been proposed over the past 30 years for supporting research on DLA over scientific articles: other recent surveys focused either on historical document collections [8] or state-of-the-art methods for page object detection [9], and we suggest the reader to go through them to a broader comprehensive overview of the DIAR field. In addition to a comprehensive inventory of datasets, highlighting their strengths and limitations, we focus our attention on the annotation procedures that have been proposed for such collections, with an analysis of the advantages and disadvantages of different approaches. We reviewed some of the most important state-of-the-art methods tested on the collections in this survey, but for a complete overview of DLA, we would like to refer to one survey over the most important methods proposed to tackle this task [10].

We focus on DLA of scientific articles, for three main reasons:

  • to investigate the most used annotation procedures, along with the challenges of creating large and qualitative datasets with good annotations;

  • to provide researchers with an overview of available datasets along with their details, to help the identification of the best suitable benchmarks to develop and test novel algorithms;

  • to outline how document collections have changed until today, posing questions and open problems that could further enhance the DIAR research.

The paper is organized as follows. In Sect. 2, we describe DLA and the principal techniques used to tackle it, with a particular focus on the analysis of scientific articles. Then, a detailed description of annotation procedures is depicted in Sect. 3. After an overview of the collections reported in this survey (Sect. 4), we divide them in three main categories, starting from small scale fully annotated in Sect. 5, mostly containing scanned documents. Then, in Sect. 6, partially annotated collections are described, focused only on challenging parts of scientific documents such as tables and figures. Finally, large-scale fully annotated datasets are listed in Sect. 7. For completeness, in Sect. 8 we provide a broader overview of significant datasets that are related to DLA for different types of documents, and in Sect. 9 we discuss and summarize the impact and complexity of each collection presented in this survey, along with the latest state-of-the-art methods tested on the datasets. We then discuss identified open problems and challenges to the field of DLA for scientific articles in Sect. 10. Finally, we outline the conclusions in Sect. 11.

Table 1 Acronyms of tasks addressed in layout analysis of scientific articles

2 Document layout analysis of scientific articles

In addition to Optical Character Recognition (OCR) of printed or handwritten characters, one of the most investigated tasks in Document Image Analysis and Recognition has been Document Layout Analysis which aims at finding regions in a page, such as text or figures (physical layout analysis) and recognizing and classifying them, e.g., discriminating text blocks as title or paragraph (logical layout analysis). In physical layout analysis, the aim is to identify homogeneous regions (usually by means of bounding boxes) [10, 11]. Since many tasks can be addressed in DLA of scientific articles, we summarize the main ones in Table 1 together with the acronyms used in this paper.

Over the years, several methods have been proposed attempting to solve DLA, following the application of novel techniques and the gathering of larger collections of annotated data. Ranging from the early 1990s up to nowadays, it is possible to broadly divide the different techniques into three main groups: heuristics, statistical machine learning, and deep learning methods. The first two groups are described in [12], dividing different approaches depending on two criteria.

The first criterion refers to how the document is analyzed, either using bottom-up, top-down, or hybrid techniques. Bottom-up techniques start gathering information at the pixel level and then iteratively group them into larger areas, from connected components (CCs) up to larger meaningful areas of text or non-text (e.g., figures). Representative algorithms from this group are RLSA [13], Docstrum [14], and Voronoi diagrams [15]. On the contrary, top-down techniques start from the whole document until basic components are found in subsequent steps, like in the X-Y cut algorithm [16]. Finally, hybrid methods are compositions of the aforementioned ones.

The second criterion discriminates the techniques considering what is analyzed either the physical or the logical document layout. The first one aims at the identification of homogeneous regions in the page while the latter at assigning functional information, a label, to these regions. Methods are categorized on these terms depending on the downstream task they are used for. To cite some, Strouthopoulos and Papamarkos [17] propose an Artificial Neural Network ANN to classify \(8 \times 8\) document patches as graphics or halftones. Wu et al. [18] segment text regions using a series of split-or-merge operations guided by a binary SVM classifier. Once the page objects are segmented and/or classified, some post-processing techniques could be considered to generalize the results over different layouts [10]. It is worth to notice that most methods for layout analysis have been demonstrated and tested on collections of digitized scientific articles.

More recently, deep learning techniques have been used also for DLA, taking advantage of larger document collections. In a recent paper summarizing models, tasks, and datasets for document AI [19], the most important techniques are listed and divided into three broad categories: Convolutional Neural Networks (CNN), Graph Neural Networks (GNN), and Transformers. For instance, Faster R-CNN [20]- and Mask R-CNN [21]-based architectures have been widely used across several benchmarks to detect page objects, while LayoutLM [22] has been the first transformer-based architecture applied to DIAR. Graphs have been mainly used for information extraction over administrative documents [23, 24] and TD/TR [25]. In the ICDAR2021 competition [26], Zhang et al. [27] achieve the SOTA on the Document Layout Recognition track, proposing a multi-modal Mask-RCNN-based object detection framework that makes use of vision, language, and geometry. A more detailed description of state-of-the-art methods tested on the datasets presented in this survey is found in Sect. 9.3.

Fig. 1
figure 1

Different pipelines to create labeled data depend on the combination of three main factors: the sources from where to crawl the data (left), which data are available (center), and which annotation procedures to involve (right). Synthetic documents do not require to follow any specific annotation procedure since labels are given within the generated data

It is useful to observe that the majority of methods used so far for DLA strongly rely on supervised learning. This is why the amount of labeled data has been always at the same time an important and complex issue to address in DIAR. There are two main problems related to annotated collections of documents: (i) not all types of documents are publicly available due to, for example, policy issues; therefore, many benchmarks are composed by scientific articles; (ii) not all available data come with structured information for automatic annotation, forcing to choose either to manually inspect a small amount of data or exclude a vast amount of unlabeled documents. These problems highly affect the proposed DLA frameworks both in terms of robustness and generalization, due to the lack of variability in available benchmarks. A solution to fill the gap between expensive annotation procedures and large automatically labeled collections is the generation of synthetic data that, by construction, come with annotations (e.g. [28]). Even if this solution opens new possibilities, it is not easy to generate data as faithfully as possible to real data and to avoid the trained algorithms dropping drastically in the wild.

In the next chapters, we describe the peculiarities of scientific document datasets and then analyze traditional and novel annotation procedures including some generative methods.

Table 2 Main tools used to support annotation

3 Annotation procedures

As depicted before, the state-of-the-art techniques proposed to tackle DLA have been often supervised oriented, thus requiring carefully annotated data. In particular in the last years, an increasing demand of larger datasets for deep learning methods has started to open new challenges on how to annotate such collections. Throughout the years, procedures for annotating documents have been proposed trying to maximize two different measures, usually inversely proportional: quality and amount of annotations, taking into consideration also time, costs, and data variability. As summarized in Fig. 1, we broadly categorize the annotation procedures in three different classes:

  • Manual. A designated team is given a set of rules to be followed, along with an annotation tool to both help them and to output the labels in a standard format.

  • Automatic. A set of techniques, algorithmic or based on machine learning, that scale with the amount of data to be annotated. The only constraint is that scientific articles in PDF need additional structured files, such as LaTeX, XML, or HTML. Human intervention is usually required mostly to check the quality of a subset of randomly selected samples.

  • Generative. Generative models can synthesize data from documents of a given domain. Along with generated samples, annotations are also generated exploiting deep learning and transformers architectures.

Fig. 2
figure 2

General scheme adopted by automatic annotation. The upper part summarizes the two approaches based on recoloring LaTeXsource code and compiling: both “Content Recoloring” and “Regions Recoloring” make use of RegEx matching to find target commands, injecting custom code, and compiling the new PDF colored version. Then, using RGB values regions and labels are extracted. On the lower part, XML is used to give logical meaning to the content and locations extracted from PDF through PDFMiner-based [39] tools. To match XML and PDF content, algorithms such as Levenshtein distance and Bag of Word are usually involved

In Table 2, we list the tools for document annotations used in the literature. We report if it is possible to use them through an online service, a local application or calling an API. We also summarize the annotation process for each tool, outlining the input data, the procedure, and the output data format.

3.1 Manual annotations

Manual techniques do not scale well with the dataset size. In addition, guidelines need to be defined so that different annotators follow the same rules, to produce coherent and homogeneous annotations among data. This approach has been the foundation of most of datasets across different machine learning fields; today it is often not preferred due to its high cost both in terms of money and time. However, researchers continue to propose manually annotated data since the human supervision is capable of yielding qualitative annotations and gather important information that are not easily available through automatic annotations.

3.2 Automatic annotations

PDF files of scientific articles are often not semantically structured. It is not trivial to access information such as text, tables, and titles and to detect them easily in the page layout starting from a PDF. However, scientific documents are often stored in other structured formats along with PDF. The information in these files, such as LaTeX and XML, makes it possible to automatically identify objects in the layout. The general annotation scheme in this case is depicted in Fig. 2. We describe in detail the two main annotation approaches in the next subsection, while the differences between datasets can be found in their relative sections. We also include in this category semi-automatic annotations procedures, meaning the usage of automatic ones along with human supervision during the process.

3.2.1 From LaTeX format

ArXiv is a huge collection of scientific documents made available by the authors. From arXiv, many researchers collect data to build document datasets. Some datasets presented in this survey [44,45,46] have been built starting from arXiv, because most papers in the repository are stored including both PDF and LaTeX files, which allows researchers to implement automatic pipelines to annotate documents much faster than humans. This approach is visually described in the top part of Fig. 2: usually, the source LaTeX code is injected with custom commands, e.g.,

figure a

or

figure b

, around detected target regions such as tables, lists, images. Then, a new version of the PDF is compiled and leveraging information extraction tools that can identify colored text, regions are extracted in terms of bounding boxes and labels. The two main approaches are based on content or regions recoloring according to the granularity of the colored items: in the first case individual words and in the second one bounding boxes of regions. The drawbacks of automatic annotation based on LaTeX are that sometimes changing the source code results in either layout alteration or compilation errors.

3.2.2 From XML format

It is possible to approach automatic annotation by also considering XML files together with the corresponding PDF files. This is the case for datasets created from PubMed (e.g., [47] [48] [49]). PubMed Central (PMC) is a free full-text archive of biomedical and life sciences journal literature hosted at the US National Institutes of Health’s National Library of Medicine (NIH/NLM). It is freely accessible since 1997. In PubMed, documents are presented in PDF and XML formats, making possible to integrate information in PDF and XML files. As depicted in the bottom part of Fig. 2, region coordinates and text are first extracted from the PDF using PDFMiner-based tools. Then, involving string matching techniques such as Levenshtein distance or Bag of Word (BoW), different regions are associated with different tags. Finally, labels are attached to the final layout annotations. The drawbacks of automatic annotation based on XML are that this structured information is not as widely available as LaTeX for scientific documents and cannot be easily edited to compile a modified PDF.

Fig. 3
figure 3

Datasets overview based on different annotations procedures, enriched with information on where the documents have been crawled from. In particular, SciBank and TableBank specifically state that some human intervention occurred for quality checking. TableBank and DeepFigures contain used automatic procedures based on LaTeX and XML

3.3 Generative models

An alternative solution is the generation of synthetic documents together with their annotations. Using generative methods, it is possible to automatically create annotations for an arbitrary amount of data, with layout and content variations. These approaches can use as starting point automatic or manually labeled documents, and they learn how to create new layouts and contents which are reasonably similar to real ones. However, despite the recent improvement, creating synthetic documents is still a difficult task.

Differently from manual and automatic procedures, in this case it is not possible to outline a general approach. Even if none of the datasets presented in this survey is generated, we found interesting to revise, to the best of our knowledge, the most important contributions developed so far in this direction. Document Domain Randomization [50] is a method which attempts to create document pages by diversifying real documents examples. DDR models textual and non-textual contents of a starting set of data following rules which are defined by the users, such as document layouts, font and style. DocSynth [51] is a model which, given a set of document images, generates new variations leveraging a Generative Adversarial Network (GAN). This method generates images to augment existing datasets. Similar to DocSynth, READ (Recursive Autoencoders for Document Layout Generation) [52] is an architecture which relies on a Recurrent Neural Network (RNN) and Variational Auto-Encoder (VAE). Starting from a training set of document examples, with semantic-based labels, READ is able to generate large quantities of variable and plausible synthetic layouts similar to the original ones.

More recently, transformer-based architectures, on top of their success on NLP tasks, started to be used to learn and generate document layouts as well. LayoutTransformer [53] leverages self-attention to learn and understand the existing relationships between regions in a given layout. The authors based the generation of new scientific papers on data in the PubLayNet [49] dataset. Inspired by this work, we proposed an approach to generate layout and content of scientific papers in high-resolution PDF format [54], filling the regions with generated text and tables and images crawled from [55].

3.4 Discussion

One of the most challenging aspects when creating a new dataset is how to annotate the data, to scale with the dimension of the collection and include in the data a good amount of variability. Manual annotations are time consuming and expensive, but ensure good quality; on the contrary, automatic procedures can generate more annotated data, but require often human intervention due to errors. Document generation could be a good alternative to meet the pitfalls of the aforementioned approaches, but it is still a complex task: it is therefore an excellent opportunity to advance research in this direction.

4 Categorization of datasets

In this survey, we focus on scientific document collections, since they have been widely used by the research community to test and compare different approaches for DLA. Other related datasets, either considering scientific papers for different tasks or addressing DLA for different documents, are summarized in Sect. 10.

Following the description of the annotation procedures given before, we generally divide the datasets as shown in Fig. 3. This graphic is useful to visualize different procedures used for the creation of datasets. Generative-based techniques are used during model training for several purposes and, to the best of our knowledge, there are no collections published to be reported here.

Moreover, considering what has been annotated in addition to how, we divide the datasets into three categories:

  • small-scale fully annotated, where the manual annotation is commonly used and document images come from scanners or digital cameras.

  • partially annotated documents, where the focus is mainly on finding and recognizing specific regions, such as tables or figures, two important objects for DLA.

  • large-scale fully annotated, collections—where automatic labeling is preferred because of the huge amount of data.

These three main approaches for building datasets roughly follow also the timeline of research of the last 30 years. In the timeline shown in Fig. 4, the datasets are arranged in a chronological order and labeled according to the previous categories. The first proposed datasets contained scanned documents manually annotated, while most recent collections are built automatically and in general contain digital-born documents. Small-scale fully annotated datasets (Sect. 5) are represented as red triangles, partially annotated datasets (Sect. 6) as green circles, and large-scale fully annotated datasets (Sect. 7) as blue squares.

Fig. 4
figure 4

Timeline of small-scale fully annotated (red triangles), partially annotated (green circles), and large-scale fully annotated (blue squares) datasets. On the y-axis, the size of each dataset is in logarithmic scale. Symbols are scaled according to the size of the corresponding dataset

Table 3 Small-scale fully annotated document collections comparison

5 Small-scale fully annotated datasets

In this section, we present small datasets, that is, with a maximum order of about 10,000 pages. It is no coincidence that these collections are mostly composed of scanned documents, considering the resources required to obtain and annotate the data. Moreover, these datasets are have been released before 2009 except for ScanBank [57] (2021), which is also the larger among the others. These datasets are summarized and compared in Table 3.

5.1 University of Washington datasets

In the 1990s, the research group lead by Prof. Haralick released three datasets aimed at fostering research on the development of OCR tools and on document analysis in general. The aim has been surely achieved since the three datasets that came out from this work have been widely used by researchers and laid out the use of benchmark datasets and open evaluation metrics to support researchers to design and evaluate novel algorithms [6, 7]. In short, the UW datasets have been the model for the subsequent collections of data arranged by researchers in the community.

The English Document Database I was released in 1993 and consists of scanned pages from English technical journals randomly selected from the University of Washington library. The dataset was distributed as CD-ROM delivered by express mail. It contains image files (bi-level and gray scale) and corresponding ground-truth information. The images came from scanned documents (from the journals themselves or photocopies) and synthetic ones generated with LaTeX. Additional information is related to the noise level in the document and metadata characterizing the provenance of the document.

Concerning the ground-truth information, pages are first decomposed into “header,” “footer,” and “live-matter” areas. The header is text appearing on top of the page and for scientific journals; it usually includes information like the name of the article, the journal, the authors, and the page number. Apart from the “footer,” the remaining part of the page is referred to as the “live matter.”

Each of these zones is described by its bounding box and is often decomposed at a finer level into sub-zones of different types (e.g. text, figure, tables, half-tones, and mathematical equations) that are again described by their bounding boxes. For each zone, additional information is provided, including zone semantic meaning (e.g., for text zones we can have a section heading or a reference list item), the dominant font, the font style, etc. At the lower level, for each text zone, its contents in terms of ASCII text are provided, using escape sequences for non-ASCII symbols.

5.2 Medical Article Records Groundtruth

The Medical Article Records Groundtruth Dataset (MARG) contains bi-level images mostly coming from journal articles [56]. Ground-truth information is gathered from the output of the MARS (Medical Article Records System) system that combines scanning, OCR, document image analysis, and lexical analysis algorithms. MARS has been developed by an R &D team at the National Library of Medicine (NLM) and can automatically extract bibliographic metadata from paper-based biomedical journals to populate the Library’s database (Medline). The ground-truth data, provided in XML, include page, zone, line, word, and character-level information.

The MARG dataset contains only title pages where the title, authors, affiliation, and abstract are organized in nine different layouts. For instance, in the layout of type “A” the title, authors, affiliation, and abstract are centered at the top of the title page and the rest of the page has a two-column layout.

5.3 PRImA

The dataset produced by the PRImA research group [29] is a realistic and highly accurate collection of several documents with a wide variety of layouts. It is easily accessible through a web application,Footnote 1 allowing researchers to identify subsets of interest. One subset was used for the ICDAR2009 Page Segmentation Competition [58]. The dataset is composed of scientific publications, magazines, and technical journals from mainstream publications related to news, business, and technology.

It contains 1240 ground-truth images in a 7 to 1 ratio in favor of magazine pages [29]; as of September 2022, it is composed of 305 images with 265 magazine pages and 40 technical articles. Metadata associated with each document contain administrative and bibliographic information (title, publication, author, copyright holder, etc.), scanning features, and content information such as the presence of images and/or graphical items, number of columns, and variety of font sizes. The dataset was built from scans of documents at 300 dpi in 24-bit color. Steps to eliminate the skew both in digitization and post-scanning are introduced. The scans are then cropped and binarized

The ground truth is stored in an XML-based format whose schema is part of the PAGE [30] image representation framework. Different region types are represented: text, image, line drawing, graphic, table, chart, separator, maths, noise, and frame. The textual regions are hierarchically divided into paragraphs, text lines, words, and glyphs. Each region is outlined with a polygon having only horizontal and vertical edges. For each region type, different information is collected; text regions are described by language, font, reading direction, text color, background color, and logical label. Trained operators added the region boundaries and region-level metadata through the semi-automated ground-truthing tool Aletheia. After a set of non-overlapping imprecise outlines is marked by the operator, Aletheia fits the set to the region contents like a shrink wrap operation. The tool can outline pages, regions, text lines, words, and glyphs.

The MARG and Prima are similar to UW datasets in size, but they concentrate on Layout Analysis using more modern annotation formats (XML).

5.4 ScanBank

After years of automatic labeling of digital-born documents labeling (Sects. 6 and 7), the ScanBank Dataset [57] collects and annotates scanned electronic theses and dissertations (ETDs) to be used for figure extraction. To motivate the need for this type of dataset, the authors claim that methods such as DeepFigure [47], trained over only digital-born documents, are unable to generalize over scanned ones that present a more challenging task. A total of 10,182 images of pages and 3300 figures were obtained across the 70 sampled ETDs, crawled from MIT’s DSpace repositoryFootnote 2. Each PDF page is converted into an image and then labeled manually with VGG Image Annotator (VIA) [36]. The resolution is scaled to 100 DPI to resemble a “screenshot” of a page from the PDF. Captions, table of contents, and list of tables and figures are labeled too, using bounding boxes. The authors also propose some data augmentation techniques trying to simulate actual scanned documents in a large collection of annotated scientific papers.

Table 4 Partially annotated document collections comparison. Full description of tasks is given in Table 1

6 Partially annotated datasets

In this section, we present datasets that are intended for layout analysis considering only some specific parts, such as tables and figures. Table Detection and Figure extraction are important and challenging subtasks of layout analysis: they introduce a significant variability beyond text regions and require the use of more complex pipelines and techniques. The datasets summarized in Table 4 are relatively large and recent. As shown in Fig. 4, they are temporally located between the scanned datasets and digital-born datasets. Their magnitude varies between 100 (CS-150 [60]) and 1 M (TabLeX [66]) pages. In particular, it is possible to notice a particular focus of the community for figure extraction, in particular for tables, given the large amount of collections compared to other regions of the document layout. This is mainly due to two reasons: (i) tables come with nearly infinite variation in layout and finding methods generalizing over them requires a huge amount of labeled data and (ii) tables summarize important information and their detection and recognition are crucial to perform document understanding and information retrieval.

6.1 Marmot

Marmot [59] is a large standard Chinese and English dataset for table detection. It contains over 2000 PDF pages collected in a 1:1 proportion from Chinese e-books and English conference and journal papers. The ground-truth data were extracted with a semi-automatic tool, Marmot, and then double-checked by 15 people. The pages show a good variety in language types, page layouts, and table styles. The Chinese e-Book pages, provided by the Founder Apabi digital library, contain no more than 15 pages per book and are mostly in one-column layout. The English pages are both in one-column and two-column layouts. Half pages in the dataset contain at least one table and the other half contain complex layouts that can be mistaken as a table such as matrices and figures. Each page is described by a labeled ground truth, a page image at 600 dpi, and an XML description of objects attributes. The ground-truth data schema defines a set of tags that describe leaves (basic page units corresponding to text, image, or graph contents) and composites (logical components labeled by the tool and including more leaves). Each table is composed by three parts: table caption, table footnote, table body. Each part contains textline elements (an entire textline not crossing a page column) that in tables correspond to single lines in table cells. Textlines then contain text characters.

6.2 CS-150 and CS-Large

In 2015, the CS-150 dataset [60] has been released to address figure extraction. It is a collection of 50 papers from NIPS 2008–2013, 50 from ICML 2009–2014, and 50 from AAAI 2009–2014, selecting 10 published papers at random from each conference and year. Images, tables, and captions have been labeled by experts using the LabelMe tool [33]. In 2016, the authors released a larger version of the dataset called CS-Large [61] by randomly sampling 346 papers from Semantic Scholar [68] published after 1999.

The authors also propose PDFFigures 2.0 [61], a tool built on top of the previous version for a wider range of different styles and layouts (tested on CS-Large). It is based on data-driven heuristics, which exploit formatting conventions used consistently in the computer science domain.

6.3 FigureSeer

The end-to-end framework FigureSeer [37] parses result figures to enable search and retrieval of results in research papers. To benchmark the framework, an annotated figure parsing dataset is offered. The dataset is produced using over 20,000 papers from five conferences (CVPR, ICML, ACL, CHI, AAAI) gathered from CiteSeerX indexed by Semantic Scholar. In the papers, there are more than 60,000 figures. Mechanical Turk [69] is used to annotate the figures.

6.4 DeepFigures

DeepFigures [47] is the first large automatically labeled dataset of scientific papers for figure extraction. It counts more than one million scientific papers, with 5.5 million induced labels of tables and images bounding boxes. This paper introduces one of the first approaches to automatically label large document collections.

Following the general automatic procedure schema described in Sect. 3.2, in DeepFigures both LaTeX and XML structured information has been used. In these scenarios, tables, figures, and captions have been surrounded by bounding box, using: (i) difference of original and re-compiled PDFs according to RGB pixel values for LaTeX sources; (ii) dynamic programming to find the substrings in the PDF text with the smallest Levenshtein distance to the caption text in the XML file for images, bag of words similarity for tables (using PDFBox [70]). The overall precision of the induced labels has been evaluated by hiring experts to check randomly selected papers from both sources, resulting in an average precision of 96.8%. The dataset has been used by the authors to train a deep learning method and then deployed in Semantic Scholar to extract figures from 13 million scientific documents.

6.5 SciTSR

SciTSR [62] is a large-scale table structure recognition dataset, which contains 15,000 tables in PDF and their corresponding structure labels obtained from LaTeX source files. The authors identify also a subset of 3600 tables, called SciTSR-COMP, focused on complicated tables containing multiple spanning cells, i.e., a cell that spans over two or more rows/columns.

Differently from the general annotation scheme for LaTeX structured information (as in Sect. 3.2.1), no recoloring has been used here. In this dataset, the authors extracted table regions looking for table sections, by means of RegEx matching

figure c

\(\dots \)

figure d

to generate the PDF table file without recoloring. Finally, they extracted cells, columns, and rows parsing

figure e

,

figure f

,

figure g

and

figure h

commands, from which structure labels have been generated for Table Structure Recognition.

6.6 TableBank

TableBank [63] is an image-based table detection and recognition dataset built with a novel weak supervision from Word and LaTeX documents gathered on the Internet. It contains 417,234 high-quality labeled tables from documents taken from a variety of domains. TableBank was built exploiting the Word and LaTeX files’ mark-up syntax to detect table structures. The Word files are in DOCX format and have been crawled from the web. Table bounding boxes are contained inside internal Office XML code. LaTeX files have been bulk-downloaded from arXiv considering scientific papers published in the period 2014–2018. The TableBank dataset focuses on table detection (TD) and table structure recognition (TSR).

For the Table Detection task, TableBank authors have created 417,234 labeled tables from the crawled documents as follows: tables are detected and annotated at both LaTeX and XML level. In the latter, each table is bounded by \(\texttt {<w:tbl>}\) and \(\texttt {</w:tbl>}\) tags; the bounding box can be colored still using XML in this case, since the Office XML code can be compiled back to the Word file. A similar approach is performed for LaTeX tables, where bounding boxes are colored using the fcolorbox command. In this way, authors were able to annotate the tables, in a manner similar to what is done for DeepFigures [47].

Concerning Table Structure Recognition, TableBank contains 145,463 training instances from Word and LaTeX documents converting them into HTML tags sequences, passing through the XML information. For Word files, the XML file is already provided. LaTeX files are firstly converted to XML using the LaTeXMLFootnote 3 toolkit. In both cases, \(\texttt {<cell\_y>}\) and \(\texttt {<cell\_n>}\) are extracted considering cells with and without text, respectively.

Table Detection experiments have been conducted using Faster RCNN Model (ResNeXt-152) [71].

6.7 Table2Latex-450k

Table2Latex-450k [44] is a dataset of 450K table images for scientific table recognition, digitally rendered from LaTeX files crawled from arXiv articles between July 1991 and November 2016.

As in SciTSR[62], the tables are extracted by means of RegEx matching, compiled into PDF and then converted into cropped images. Of the resulting 940,279 tables from 298,168 articles, only 465,957 were kept since the others resulted in compilation errors. The dataset is separated into a training/validation/test sets at article level due to the similarity shared by tables within the same article.

Associated with PNG images, the dataset also contains a tokenized and normalized version of the LaTeX source code to reduce the ambiguities and make it easier for neural models to learn a consistent mapping. To do so, PlasTexFootnote 4, a tool to transform LaTex into XML-DOM, has been used.

6.8 PubTabNet

The PubTabNet dataset [64] gathers scientific documents from PMC, with annotations automatically generated by matching PDF and XML files. Differently from the previous ones, this dataset has three main advantages. (i) the tables are typeset by the publishers of over 6,000 journals in PMC, which offers considerably more diversity; (ii) cells are categorized into headers and body cells, which is important when retrieving information from tables; (iii) the format of the output is HTML, which can be directly integrated into web applications. In addition, tables in HTML format are represented as a tree structure.

The PDF and XML files are matched through the algorithm of [49], following the general schema described in Sect. 3.2.2 to detect table regions and subsequently to convert them to images. Table regions are identified within the two sources and kept if the cosine similarity of the TF-IDF features of the two texts is larger than 90% and their length differs by less than 10%; rare tables with formulas or no frequent characters are removed as well. Finally, each table is rendered as image along with its HTML ground-truth annotation, without hyperlinks and metadata unused for the downstream task.

6.8.1 FinTabNet and PubXNet

Two datasets are proposed in [48]: (i) PubXNet, is an enhanced version of PubTabNet [64] adding table cells labels for a joint subset of PubTabNet and PubLayNet [49], (ii) the FinTabNet dataset is proposed to perform table detection and structure recognition on financial documents. Since HTML was already available, authors directly matched that information with the source PDF for automatic labeling.

6.9 PubTables-1 M

PubTables-1 M [65] is a large dataset composed of nearly 1 million tables extracted from scientific articles. This dataset contains detailed information about table structure and attempts to address the problem of over-segmentation using a novel canonicalization technique to obtain more consistent ground truth. PubTables-1 M has been developed to solve table extraction problems and its three main subtasks: table detection, table structure recognition, and functional analysis. It aims at overcoming the large variety of formats, styles, and structures that define tables in scientific papers, contrasting over-segmentation problems, and the need for more data. PubTables-1 M contains rich annotation information for all rows, columns, and tables’ headers. The annotation procedure follows the one described in Sect. 3.2.2: to align XML and PDF the Needleman–Wunsch algorithm has been used, along with custom noise removal heuristics to ensure a certain level of quality. PubTables-1 M contains 947,642 tables annotated for structure recognition. For table detection, there are 575,304 fully annotated tables.

6.10 TabLeX

TabLeX [66] is a dataset of table images generated from scientific articles downloaded from arXiv, created working directly on LaTeX as described in Sect. 3.2.1. The papers belong to eight categories: Physics, Computer Science, Mathematics, Statistics, Electrical Engineering, Quantitative Biology, Quantitative Finance, and Economics. TabLeX consists of two subsets: one for table structure extraction and the other for table content extraction. The dataset contains images in different aspect ratios and in a variety of fonts to ease the development of tools that are able to robustly extract tables from scientific documents. The dataset is organized in sets for table structure detection (TSD) and table content detection (TCD), and each set is further split into short (TSD-250, TCD-250) and long (TSD-500, TCD-500) tables.

6.11 TNCR

TNCR (Table Net detection and classification dataset [67]) is an image dataset of scanned documents. Data are collected as PDF files from various public access libraries, for a total of 875,026 PDF pages; these pages are then parsed manually to select only table pages, obtaining a subset of images to be used for training a Faster-RCNN to annotate candidate tables. Finally, the chosen ones are manually annotated using one of the following labels, based on table type: full lined, no lines, merged cells, partial lined, and partial lined merged cells.

Table 5 Large-scale fully annotated document collections comparison

7 Large-scale fully annotated datasets

In this section, we present large-scale datasets, which are summarized in Table 5. Large datasets for DLA have been collected taking advantage of publicly available collections of papers motivated by the increasing interest on document analysis exploiting machine learning. Researchers have been therefore able to start adopting deep learning techniques leveraging such huge datasets. In Fig. 4, these datasets are the most recent, intended to be used exploiting state-of-the-art methods for DLA. PubLayNet [49] is the first dataset of this type, counting 500k annotated PDF pages, and leveraging automatic annotation techniques. Following this thread, even larger datasets have been developed, looking for even more classes. Today, these datasets allow researchers to obtain the best results in DLA.

7.1 PubLayNet

PubLayNet [49] has been developed aiming to construct a dataset to be the largest ever created for DLA. PubLayNet is made up of publicly available scientific journals, specifically medical ones, in PDF and XML format, taken from PMC and labeled as described in Sect. 3.2.2 using PDFMiner [39].

Layout categories have been chosen to cover important document regions, such as paragraphs, titles, lists, images, and tables. Once all the PDF pages have been annotated, authors proceeded with quality control, discarding the noisy documents. Finally, documents were divided into training, development, and testing sets at the journal level to maximize the differences between sets and for a better evaluation of how well an object detector can generalize to unseen layouts and contents.

7.2 DocBank

DocBank [45] is a large-scale dataset built using a weak supervision approach. It enables models to integrate both the textual and layout information for downstream tasks. DocBank includes 500K document pages. The strength of the dataset is that it can be used for both NLP and Computer Vision tasks since its annotations are at the token level. DocBank documents are taken from arXiv, along with their LaTeX source files. The downloaded papers belong to many domains, such as Physics, Mathematics, and Computer Science; in general, DocBank is built to focus on scientific documents.

DocBank is a natural extension of the TableBank [63] dataset, and their annotation pipelines are quite similar. Working on LaTeX semantic as described in 3.2.1, twelve classes are identified: Abstract, Author, Caption, Equation, Figure, Footer, List, Paragraph, Reference, Section, Table, and Title.

PDFPlumber [43], a package based on PDFMiner, is then used to extract text and non-text elements from documents. Text is tokenized using white spaces; then, bounding boxes are defined as the rectangle with the upper left point of the first character and the right lower point of the last character of a token. Labels are determined on the color-mapping basis defined in the previous point.

Finally, the token-level dataset is converted to the region-level dataset, that is, tokens belonging to the same class are connected using the Breadth First Search algorithm to find connected components and then defining regions of tokens.

7.3 Dense Article Dataset

The Dense Article Dataset (DAD) [35] has been constructed to overcome the limited number of labeled classes in previous datasets. It covers research articles from multiple disciplines and captures the detailed elements of a research article in its original publication format. The dataset is composed of 450 open-access research articles from 14 different journals from 5 publishers (Elsevier, Springer, SAGEpublisher, Wiley, and IEEE).

The collected research articles are annotated with 43 classes that specialize three general categories: front matters, body matters, and back matters. 5980 pages are manually annotated with the Microsoft open-source tool "VOTT" [35]. The citations, due to their frequency (37k instances), are annotated by exploiting the output of Grobid [42] and then manually corrected. Annotations are saved as a JSON file with coordinates of the rectangular boxes.

7.4 DocLayNet

DocLayNet [72] is a dataset that consists of documents belonging to six distinct domains (Financial, Scientific, Patents, Manuals, Laws, and Tenders) collected from different free repositories. The main goal was to build an heterogeneous dataset to help trained detectors to produce more robust predictions without being domain specific. The dataset is composed of manual annotations (bounding boxes) for 80,863 PDF pages. Among these, 7059 carry two instances of manual annotations, and 1591 carry three, for 91,104 total annotation instances. Each bounding box belongs to one of the following 11 classes: Caption, Footnote, Formula, List-item, Page-footer, Page-header, Picture, Section-header, Table, Text, and Title. The authors avoided the use of scanned pages and looked for documents consisting of more than 10 pages, with many tables, figures, plots. Most documents in this dataset are written in English (95%).

DocLayNet is annotated using an augmented COCO format and the dataset is composed of three main parts: (i) original PDFs with text-cell coordinates and text (in JSON format); (ii) PNG files for each page (with size: \(1025 \times 1025\) pixels); (iii) COCO annotations (in JSON format) for training, development, and test sets.

The authors considered manual annotations over automatic ones, also because structured source files were not available for the collected documents. DocLayNet manual annotations have been obtained using IBM CSS (Corpus Conversion Service) [34], a cloud-native platform that provides a visual annotation interface and allows for dataset inspection and analysis. To ensure homogeneity among different annotators, a 100-page annotation guideline has been written and administered to 32 selected experts. The whole annotation process lasted 6 months.

7.5 SciBank

SciBank [46] is a dataset containing 74,435 pages from 9635 scientific papers selected from arXiv. The annotation classes are 12: Caption, Equation, Inline equation, Figure, Keywords, Reference, Section, Subsection, Table, and Title. Annotations follow the general schema described in Sect. 3.2.1. The dataset consists of images and CSV files containing information about the page and regions on the page. For each annotation, in addition to the class label and the region coordinates, there is information about the paper, page number, size of the page, and corresponding PNG file.

Differently from Publaynet [49], TableBank [63], and DocBank [45], SciBank contains annotations for inline equation.

8 Datasets of other types of documents

Scientific publications have been widely used for DLA for two main reasons: (i) there is a significant amount of publicly available data on digital libraries that are easy to gather and usually come with additional structural information, such as XML or LaTeX  that helps in the automatic annotation; (ii) scientific articles have a semantically rich non-trivial structure that is useful to train and test DLA algorithms. However, when it comes to transfer the information learned to other domains, it is nothing but trivial to keep the same performances due to high variability in layout and contents. Other benchmarks have been released to cope with this lack of information. First attempts to collect such data have been made by MediaTeam Document Database [73] and UvA Document Dataset [74].

The MediaTeam dataset contains 512 pages belonging to 198 documents of different categories (e.g., advertisements, manuals, and line drawing); the 58 “articles” are from newspapers. Originally all documents were stored in uncompressed TIFF format (total 7.2 GB). Due to the limited capacity of CD-ROM at the time, documents were distributed in compressed JPEG.

The UvA Dataset contains complex regions (the ground-truth is not limited to upright rectangles) and is focused on magazine pages with advertisements. Its most important feature is the inclusion of colored pages with respect to previous collections that were limited to gray-level or black and white document images.

One relevant task in DIAR is document classification that can be useful in DLA, since the identification of the specific document at hand can guide the analysis of different regions. In 2006, the RVL-CDIP (Ryerson Vision Lab Complex Document Information Processing) dataset [75] was released, containing 400,000 scanned gray-scale images with a great variety of contents in 16 different classes. It is still nowadays one of the largest collections of documents for DIAR related tasks. Some subsets have been extracted from RVL-CDIP during the years, trying to enhance the possible usage of this collection. Among others, we would like to cite Tobacco800 [76] and the annotated invoices gathered by Riba P. et al. [23]. The first one, composed of 1290 document images, is a realistic collection for document image analysis research since the images were scanned using a large variety of devices and tools over time. In addition, a significant percentage of Tobacco800 are consecutively numbered multi-page business documents, making it a valuable test for various content-based document image retrieval approaches. The second collection, instead, annotates a subsection of the RVL-CDIP invoices class for table detection and layout analysis, gathering some important information such as sender, receiver, and total amount.

Over several years, important competitions have been organized at the ICDAR conferences releasing benchmarks for TD and TSR that, nowadays, are still used to compare different methods. The most important ones to be cited are ICDAR 2013 [77] and ICDAR 2019 [78]. The first collection counts 156 tables crawled from two government sources site:europa.eu and site:*.gov. The organizers searched for tables keeping two pages after and before them, giving space for false positive detection. The second collection is more variable, both in layouts and contents. For the 2019 competition, 1639 tables are collected for TD and TR. Approximately half of them are handwritten, collected from the contribution of 23 different institutions, and the remaining are printed ones coming from different sources, such as scientific journals, forms, financial statements. The content varies between English and Chinese. Annotations and results are provided in XML format.

In 2020, IIT-AR-13K [79] is released. Manually annotated, it is one of the largest collections of business documents, collected from publicly available annual reports. There are five categories of objects that can be detected: tables, figures, natural images, logos, and signatures. The dataset presents also variability in content and layout: the reports are in different languages and scripts, while collected through ten years of twenty-nine different companies’ reports.

The most recent collection of this category is FinTab [80], a Chinese dataset for table extraction from financial documents that responds to the lack of diversity in tables of the currently existing datasets. It consists of 19 PDF files with more than 1600 tables sourced from annual and semi-annual reports, debt financing, bond financing, collections of medium-term notes, short-term financing, and a prospectus. In total, the dataset adds up to 3329 pages with 2522 of them containing tables. The dataset has been manually reviewed and contains textual and structural ground-truth for the tables. The textual ground truth describes both the characters and string information while the structure ground truth is made related to cell and table line information; the ground truth is stored in JSON.

9 Impact of datasets

The release and widespread use of a research dataset can have a significant impact in the research on a given topic. In order to evaluate the impact of the datasets analyzed in this paper, we investigated three main features: i) the scientific impact, indirectly measured by the number of citations; ii) the dataset complexity, that is related to the quantity and quality of data and annotations, strictly connected to task to be tackled; iii) the most relevant methods evaluated on the datasets. The above factors are summarized in Table 1 and further discussed in this section.

While the impact can be measured with quantitative, albeit sometimes inaccurate, numerical values obtained from the number of citations to papers describing the datasets, the complexity is harder to be estimated. We therefore assess the dataset complexity by looking at some features of the data that are task dependent. The most important methods used are on the other hand identified by looking at research papers that describe techniques using the data. Impact and complexity can be inversely proportional: one dataset released several years ago could have pushed forward significantly the research in the past, but nowadays, given its small scale, it can be of little utility to train data-hungry machine learning models. On the opposite, a novel dataset can count on few citations, but could pose a new important challenge to the research community and can be useful to train novel models. Impact and complexity of the datasets are summarized in Table 6 and discussed in Sections 9.1 and 9.2. The tasks to be performed on a given dataset strongly affect the impact and complexity as well: for a fair comparison and discussion, a further division is therefore needed in our analysis. In reference to Table 1, four macrotasks are considered: Document Layout Analysis (DLA), Table Understanding (TD, TSR, TR, and TFA) and Figure Extraction (FE), and Document Classification for [75]. For each task, the most impactful datasets have been analyzed to identify the latest and most effective methods proposed as summarized in Table 6 and briefly discussed in Sect. 9.3. Some sample pages selected from the most impactful datasets per task are shown in Fig. 5.

In the rest of this section, we inspect the datasets on the basis of the three directions previously mentioned.

9.1 Impact

The impact of a dataset is evaluated in this work by the number of citations divided by the number of years passed from the dataset release to the time of writing (months are not taken into consideration). Citations are collected from Google Scholar and wrong references are usually not filtered out. This information is listed in the Impact columns in Table 6. Given a dataset, the identification of the most suitable paper to look for is not always easy. Nowadays, when a new dataset is released, it is often described in a research article that is also mentioned in the data repository. Users of the dataset are invited to cite this article when referencing to the data. For older datasets, it is less common to have an unique paper, or technical report, to be cited. For instance, papers discussing methods developed and tested on the UW-I/UW-III datasets referenced to the data by means of several citations [6, 7, 81,82,83] sometimes including more than one citation in the same article. In this case, the number of citations reported in Table 6 is obtained by a rough filtering of wrong results obtained with a Google search for "UW-III document." In our view, this is an underestimation of the number of actual references to the dataset. Similar considerations also apply to other earlier datasets like MARG [56], MediaTeam [73], and UvA [74]. Table 6 does not include information for SciBank [46] that is hosted in the IEEE DataPort, and has no citations yet.

Table 6 Impact numbers from Google Scholar citations; years do not take into account months

9.2 Complexity

We summarize the data complexity by looking at two information about the dataset content: the number of different types of labels and the type of data annotated.

  • Number of labels: depending on the task, a larger number of classes to be distinguished are usually more informative and more challenging. For instance, for DLA it is quite easy to distinguish between text and not-text regions, but it is harder to identify the semantic role of text regions. In the case of Table Understanding, rather than looking at the number of classes it is more informative to take note of the type of information provided in the ground truth that can be: Table Position (TP), Table Type (TT), Cell Position (CP), and Cell Type (CT);

  • Type of data (ToD) labeled: can be scanned (S), digital-born (D), or a mixture of the two (M). Usually digital-born document is easier to be parsed (e.g., there are no OCR errors), while images and low-quality scans make the document analysis harder. More recent datasets nearly always deal with digital-born articles while earlier datasets focus on scanned pages. This change is due from one side to the increasing number of applications dealing with digital-born data and from the other hand to the development of techniques for automatically generating the ground truth that rely on information in digital-born documents.

This information summarizes the datasets descriptions reported in this paper and is shown in the Complexity columns in Table 6.

9.3 Methods

In this section, we briefly overview the most common methods tested on the main datasets analyzed in this paper. We already summarized the main approaches for DLA in Sect. 2. In particular, the methods analyzed in the following, and listed in the Methods column in Table 6, have been found across the best results for the two most impactful datasets for each task. For the remaining datasets, we report some specific methods tested on the data.

Fig. 5
figure 5

Random samples selected from most impactful datasets per task (as described in Section 9). TableBank and PubMedX are skipped since they are an extension of DocBank and PubTabNet, containing similar data. For figure extraction: FigureSeer contains directly annotated figures; DeepFigures examples are not shown since the authors provide only annotations

Among others and with the aim of a brief and general overview without aspiring to cover all the methods of interest, we report the following approaches.

  • Heuristic methods for DLA of scientific articles are related to well known algorithms based, for instance, on the RLSA algorithm, projection profile, and area Voronoi diagram (Sect. 2). Other common approaches are based on the analysis of connected components in bi-level images. For instance, in [84] connected components are used to extract figures and captions from PDFs of biomedical articles. Most methods in this group have been tested on the UW datasets [85,86,87,88,89].

  • When moving the attention to machine learning-based approaches, object detectors in images have been extensively explored in computer vision in general, and in DIAR in particular, by adopting seminal models like Faster R-CNN [90] and Mask R-CNN [91] demonstrating significant advancements in object detection performance. Using a dual backbone ResNext-101 with deformable convolution, DOLNet [92] achieve remarkable results over seven different benchmarks. Differently, HiM [93] and VSR[27] are state-of-the-art methods for DocBank [45] and [49], respectively, and use a RPN in conjunction with textual embedding and a graph structure for final document objects refinements in a multi-modal fashion. Considering a specific object of the document page, the GTE framework [94], composed of a series of vision-based neural networks, uses two levels of depth at table and cell level to detect table boundaries and its structure. Other relevant methods in this category are DeepLabV3+ [95], TNCR [67], and FigureSeer [37] for Figure Extraction and analysis.

  • Since document layouts are, or either contain, highly structured items, several researchers focused their efforts on graph-based techniques to tackle several tasks. Even if more traditional approaches (like the XY-tree and the Area Voronoi diagram) can be regarded as graph based, the widespread development of geometric deep learning (and in particular Graph Neural Networks) gave rise to applications for DLA of scientific articles. These algorithms usually propose a graph representation of the document, considering clusters of similar semantics (e.g., words or paragraphs) as nodes and using distance metrics as edge connections. Compared to other methods, these techniques still highly rely on the quality of the document preprocessing, restricting their expressive power. One recent work belonging to this area is [25], where the authors show the potentiality of a GNN over a dense network for a custom dataset composed of 0.5 million synthetic tables. More recently, other proposals such as GLAM [96], GFTE [80], and GraphTSR [62] have shown remarkable results on datasets such as DocLayNet [72], FinTab [48], and SciTSR [62]. Concerning the analysis of chart images, in [97] a semantic segmentation network is used to predict probability maps for line styles. From this output, a graph is built and the line tracing is performed using linear programming. We also recently proposed a graph-based architecture to detect and recognize tables and their surroundings on a subset of [49] and [65] relying on a graph structure of the page [98].

  • Since the first proposal of the transformer architecture [99], the DIAR community began to explore the use of transformer-based architectures for several tasks, including DLA of scientific articles. LayoutLMv3 [100] is the first multi-modal architecture not relying on pre-trained visual extractors that mixing visual, textual, and linear embeddings in a transformer-based model is able to achieve state-of-the-art performances over PubLayNet [49] and other important benchmarks. Belonging to the same family, it is worth to mention also DocFormer [101] and DiT [102] that, similarly to LayoutLMv3, achieve remarkable results specially on ICDAR table competitions [77, 78] and RVL-CDIP [75] datasets.

10 Discussion and open problems

In this survey, we explored the datasets available to perform DLA for scientific articles, mainly focusing on the annotation procedures. The first collections included few documents manually annotated. More recently deep learning techniques gained an increasing interest and achieved important results. To train these models, large quantity of annotated data are needed along with automatic annotation techniques to address a growing amount of data.

It is worth mentioning that, despite technological and algorithmic developments, there are datasets such as DocLayNet [72] that are manually annotated. This opens a reflection on the trade-off needed among quality, heterogeneity, and quantity of annotated data and between the use of manual or automatic annotation techniques to build datasets.

The use of automatic annotation techniques allows researchers to label large datasets, which, however, are often scarcely heterogeneous. On the other hand, manual annotations make it possible to deal with datasets composed of widely varied data that are unfortunately not comparable in quantity to those annotated automatically.

The two approaches come with different limitations: manual procedures are expensive in terms of money and time and do not scale well. On the other side, automatic annotations are not applicable on any application domain, since additional structured information is needed, but is not always available. Moreover, they are not highly reliable for the quality of annotations compared to manual ones and often require human supervision to check a representative sample of data. To cope with these limitations, generative procedures can be a viable alternative, as discussed in Sect. 3.3. An arbitrary large number of synthetically generated documents can help enhance the variability in contents and layout, providing also the corresponding annotations. On the other hand, the generation is restricted to the space of documents that the generator observed during training and there is still the risk of greatly diverting from reality. That is why generated datasets are usually not used in isolation, but combined with other collections.

Problems and limitations bounded to the collection of documents and their annotations are nothing but solved. In other fields, such as computer vision and natural language processing, new paths have started to be traced (e.g., [110]) drawing inspiration from these fields some novel directions could be explored also for the DIAR community. Among others, we believe that addressing the following points could help to overcome the aforementioned limitations:

  • gathering wider variety of documents: for instance, written in different languages and scripts, to make architectures more robust over different data distributions. We focused mainly on DLA for scientific literature (nearly always in written in English) since most benchmarks belong to this area. However, there are a several open access collections of documents on Internet that can be crawled and unsupervised techniques could be explored to work with them;

  • create multi-page collections: usually the community is focused on single pages for training, while in real case scenarios there is interest also for understanding multi-page documents, such as in administrative applications. From the training point of view, using multiple pages would result in a more informative process for the system in use, while on the application perspective models usage could be extended to real-scenarios documents;

  • developing novel generative methods: unfortunately not all types of documents are accessible due to privacy issues, e.g., medical records or legal documents. Recently, the community has seen the growth of new generative models based on diffusion [111] and unsupervised pre-training [112]. In this direction could be interesting to apply similar techniques to generate an arbitrary large stream of documents simply given a text prompt as input, on top of techniques similar to I2DFormer [113].

11 Conclusions

In this survey, we analyzed and compared the most used datasets for layout analysis, with a focus on those related to scientific publications. In particular, we drew a taxonomy of the most common techniques to annotate data, either manual or automatic. We also discussed techniques (e.g., based on generative models) that might be used for the development of datasets to train intelligent systems able to generalize over larger distributions of documents.

We provided detailed information on the datasets proposed over the years, with reference to the most widely used tools for PDF analysis and labeling. Our aim was to summarize the annotation procedures proposed and to provide a general overview of available datasets, helping scholars to choose the most suitable data for experimenting their techniques.

We compared all the datasets described in the paper by looking at their estimated impact and complexity. We also presented a brief overview of the most used methods designed and tested on these datasets. A numerical comparing of experimental results achieved by different approaches on common data is challenging due to different articles addressing different sub-tasks, variations in data splits, and diverse performance measures. Adding this information would lead to a fragmented presentation. We remind that this survey primarily focuses on data generation/annotation algorithms rather than data utilization techniques, and we believe that a deeper analysis about these techniques should be addressed in a dedicated paper.

Finally, we discussed advantages and drawbacks of various approaches used for labeling datasets, proposing new perspectives, and interesting open problems to be investigated in the future.