Towards a Novel Classification of Table Types in Scholarly Publications

He, Jilin; Borisova, Ekaterina; Rehm, Georg

doi:10.1007/978-3-031-65794-8_3

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14770))

Included in the following conference series:

International Workshop on Natural Scientific Language Processing and Research Knowledge Graphs

396 Accesses

Abstract

Tables are one of the prevalent means of organising and representing structured data. They contain a wealth of valuable information that is challenging to extract automatically, yet can be leveraged for downstream tasks such as question answering and knowledge base construction. Table Type Classification (TTC) is one of the tasks which contributes to better semantic understanding and extraction of knowledge in tabular data. While multiple classification schemas exist, almost all of them are focused on web tables. Therefore, these classifications might overlook certain types which are common in other areas such as scientific research. This paper addresses this gap by introducing ten novel TTC taxonomies tailored towards tables used in scholarly publications. We also evaluate the applicability of taxonomies derived from web tables to scientific tables. Additionally, we propose a new dataset containing 13,000 annotated table images, called TD4CLTabs. Our results indicate that both existing and newly proposed taxonomies are suitable and effective for classifying scientific tables.

You have full access to this open access chapter, Download conference paper PDF

Keywords

1 Introduction

Tables are used to summarise and present information in a structured manner across various areas such as business, finance, science, education, and healthcare [40]. With a growing interest in the field of Table Understanding (TU), several studies have focused on the automatic extraction of knowledge from tables [3, 16, 36, 45] and applying it to various tasks, e. g., question answering [5, 7, 9, 20, 22, 29, 33, 43, 48, 50], knowledge base construction [25, 27], table-to-text generation [28], tabular data augmentation [12, 44, 45], content extension and completion [21, 27], fact-checking [1, 6], and natural language inference [17].

Table Type Classification (TTC) is the TU sub-task aimed to categorise tables according to a predefined schema based on their layout structure, content or purpose of use [45]. Classifying tables into specific types helps to uncover the semantics of the data they contain, facilitating tasks such as detecting and filtering layout tables (which do not contain any meaningful data), recognising table structures, and information extraction [14, 15, 23, 25]. Even though various TTC schemas exist [4, 8, 11, 25,26,27, 41], most were designed focusing on tabular structures that exist in web pages, commonly referred to as web tables [26]. As a consequence, these classifications might overlook certain table features and types, especially domain specific ones. In particular, they might not be fully applicable to tables found in scholarly papers. We refer to such tables as scientific tables, defining them as tabular structures found in (digital) scholarly publications and labelled as a table by the authors. To the best of our knowledge, there is only one study by Kruit et al. [25] that proposed a table type taxonomy derived from scientific tables. No taxonomies based on structural or layout features exist for the field of scientific publications. The present paper addresses this gap by developing ten novel taxonomies based on scientific tables. To this end, we collect a corpus of tables extracted from Computational Linguistics (CL) articles. We develop various taxonomies based on two well-established classification schemas and by considering table features identified in previous studies and our own corpus analysis. We train and evaluate classifiers on the dataset of scientific tables that we annotated according to the two pre-existing schemas and our newly proposed taxonomies.

Our contributions can be summarised as follows:

We construct and release the TD4CLTabs dataset with 13,000 annotated images of scientific tables extracted from CL articles.
We propose and evaluate ten novel TTC taxonomies defined based on scientific tables.
We assess the applicability of taxonomies derived from web tables to scientific tables.
We offer a list of table features which are potentially important for TTC. The list includes attributes considered by previous taxonomies, alongside those overlooked by these schemas but identified in the literature and in our TD4CLTabs dataset.

This article is structured as follows: Sect. 2 discusses related work. Section 3 describes our approach to the dataset and taxonomies construction. Sections 4 and 5 present the evaluation results and main findings, respectively. Section 6 outlines limitations. Concluding remarks are provided in Sect. 7.

2 Related Work

Tables are ubiquitous data structures, often stored in relational databases (e. g., MySQL, PostgreSQL), spreadsheets (e. g., Microsoft Excel, Google Sheets), web pages (e. g., Wikipedia), and scientific articles. Tables vary greatly in terms of their layout structures and content, posing challenges for automatic TU [2, 46]. In order to effectively process and extract knowledge from tables, several TTC schemas have been proposed.

The existing schemas vary in their complexity, ranging from simple binary classifications to multi-layer taxonomies. Additionally, most TTC schemas have been designed based on tables found in web pages. For instance, in the pioneering work by Wang and Hu [42], web tables were classified into two categories: genuine, i. e., leaf tables (not containing other tables, lists, images, etc.) and non-genuine. Later Cafarella et al. [4] distinguished between extremely small tables, HTML forms, calendars, non-relational (contain low-quality data), and relational (contain high-quality data) tables. Subsequent studies proposed more fine-grained classifications by organising table types into hierarchical taxonomies. Crestan et al. [11] introduced the categories of relational knowledge tables, which contain relational data, and layout tables, which do not contain any meaningful data at all. The former class included sub-types defined based on the positioning of table headers: vertical listing, horizontal listing, matrix, attribute/value, enumeration, and calendar. The layout category contained formatting and navigational tables. Lautert et al. [26] refined this taxonomy by revisiting the relational knowledge tables class and incorporating types derived from cell features. On the first layer, relational knowledge tables were categorised as horizontal, vertical, and matrix. These were subsequently divided into concise (contain merged cells), nested (contain a table in a cell), splitted (contain repeated labels in headers), simple and composed multivalued (contain multiple values in a single cell) categories. Chen and Cafarella [8] devised an alternative TTC taxonomy focusing on the use-case of web spreadsheets. In contrast to previous studies, this taxonomy incorporates major classes such as data frame spreadsheets and non-data frame (flat) spreadsheets, along with their respective sub-categories. More recent studies have shifted back to single-level classification schemas. Eberius et al. [14] distinguished between three main table types, namely matrix, horizontal listing, and vertical listing (see Fig. 6 in Appendix A). Similarly, Lehmberg et al. [27] also classified tables into three major categories: relational, entity, and matrix.

In contrast to web tables, there is currently only one TTC taxonomy defined based on scientific tables extracted from Computer Science papers. It was proposed by Kruit et al. [25] for the development of Tab2Know, i.e., a novel end-to-end system for building a knowledge base from scientific tables. This taxonomy consists of four root classes (observation, example, input, other) with their respective sub-classes and primarily focuses on the narrative role tables play in scholarly articles rather than their structural characteristics.

As emphasised by Zhang and Balog [45], the established approaches to TTC were designed for different use-cases. Therefore, it is not surprising that existing schemas might overlook certain table features. For instance, Shigarov et al. [38, 39] highlighted that current classifications fail to address header and cell-related characteristics such as header hierarchies, the presence of non-textual content and diagonally split cells. Additionally, the schemas do not consider the concepts of complicated tables (i. e., containing spanning cells) and void cells introduced by Chi et al. [10] and Rolan et al. [35], respectively (see Fig. 7 in Appendix B).

In earlier studies, TTC relied on traditional machine learning algorithms such as decision trees, support vector machines, and logistic regression [4, 11, 14, 25, 26, 42]. Recent research has shifted towards the adoption of deep learning techniques such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), and attention mechanisms for automatic feature extraction from tables [18, 31]. Previous approaches primarily utilised plain-text and HTML representations of tables. However, not all tables are readily accessible in a machine-readable format. For instance, scientific tables are commonly embedded in unstructured PDF documents. Such tables have to be extracted and transformed into a format suitable for training and testing models. One of the widely used approaches involves obtaining the image-like representations of tables from a PDF file [24, 25, 49] which can either be directly used as model input or first converted into structured formats like CSV or JSON.

3 Methodology

3.1 Data

To assess the applicability of web tables-based taxonomies to the area of science and to construct novel TTC taxonomies, we created a corpus of table images from scholarly articles in the ACL Anthology.^{Footnote 1} We fetched a total of 3,219 papers from the year 2022, chosen as the latest collection of publications in the readily available ACL Anthology corpus.^{Footnote 2} As ACL papers are available only in PDF, Tab2Know was used to obtain table images. Out of the 3,219 PDF files, Tab2Know successfully processed 2,687, resulting in a total of 15,292 table images. Since Tab2Know is designed to locate and extract tables without their respective captions and titles, these are not present in our corpus.

3.2 Taxonomies Construction

We applied two established schemas based on web tables to the corpus of scientific tables, i. e., the classifications proposed by Eberius et al. [14] and Crestan et al. [11]. We picked these two taxonomies based on their usage in recent applications and tasks. We did not consider the taxonomy proposed by Kruit et al. [25] since it classifies tables based on their narrative role in scientific articles rather than their layout structure.

In order to determine whether any adjustments are needed in the two taxonomies, such as excluding under-represented classes, we examined their presence and distribution in a sample of 1200 table images from our corpus. The results are presented in Fig. 1. Eberius et al.’s schema, featuring the classes listing and matrix, was directly adopted to the TTC task due to their high frequency in the corpus. The taxonomy by Crestan et al. was adjusted by keeping horizontal listing, vertical listing, matrix, and enumeration, while disregarding other classes (e. g., calendar, form, layout tables, etc.) since these could not be observed in the sample data. Additionally, all tables of the attribute/value class were classified as either vertical listing or horizontal listing since they represent specific instances of these classes [11]. Together with the class other tables, which was introduced for tables that do not fit any of the pre-defined classes, we refer to the final two taxonomies as Baseline_I and Baseline_II, respectively. The graphical illustration of the baseline taxonomies is provided in Fig. 2(a).

In addition, ten novel taxonomies were defined by incorporating the table types from the baseline taxonomies as well as header and cell features. As a fist step, we determined which classes should be preserved from Baseline_I and Baseline_II by analysing the results of their preliminary frequency of occurrence (Fig. 1). Hence, only the matrix and horizontal listing classes were considered while designing the taxonomies. Vertical listing and enumeration were disregarded due to their low frequencies in the dataset. Then, we compiled a list of table layout features which are neglected by the existing taxonomies but distinguished by previous studies (see Sect. 2). We further extended the list with additional features observed during the examination of the 1200 sample tables. The collected features fall into header and other table attributes and are outlined in Table 1.

Table 1. Header and other features potentially significant for Table Type Classification. Attributes identified based on a sample of 1200 tables extracted from ACL papers are highlighted in italics.

Full size table

Initially, we constructed the TTC taxonomies by combining the selected table types and additional header features. We refer to these as Header-Feature Table Taxonomies (HFTTs) and present them in Figs. 2 (b) and (c). Thus, taking into account the absence or presence of a header hierarchy, we extended Baseline_I with the classes flat listing, flat matrix, hierarchical listing, and hierarchical matrix classes, and called it HFTT_Novel_I. Then, we incorporated the positioning of hierarchical headers (HHs) within the classes matrix and horizontal listing into HFTT_Novel_I. For the former, HH might exclusively appear in a column header (CH), row header (RH), or in both. We refer to these three additional classes as type-1, type-2, type-3 hierarchical matrix. In the case of horizontal listing, HH may be positioned on the left, right or middle of a table, potentially with repetitions. We name the resulting taxonomy HFTT_Novel_II. As can be seen from Fig. 2(b), for HFTT_Novel_III, we further distinguished between matrix with diagonally split cells at the top-left cell (pseudo matrix) and without those (regular matrix). Note that pseudo matrices often bear a resemblance to listing. For the final HFTT_Novel_IV, we excluded HH and the three respective HH positioning types related to matrix and pseudo matrix. Eventually, the ten different taxonomies developed vary in terms of their number of classes, from 3 to 17. Baseline_I contains the fewest number of categories, while FFTT_Novel_V includes the highest number.

As outlined in Table 1, HFTT can be extended with other table features related to cell types and table splitting. Thus, each feature introduces a new category within each table type across HFTTs. When focusing solely on header features, the resulting table types are mutually exclusive. For instance, if a table is categorized as matrix, it cannot simultaneously belong to the listing class. Similarly, once it falls into the type-1 hierarchical matrix, it cannot be classified as type-2, type-3 or pseudo matrix. However, when considering both header and other table features, the resulting table types become inclusive. Thus, matrix can exhibit features such as spanning cells and being split at the same time, leading to a new category called split complex matrix. We refer to the refined HFTTs, containing header features, cell-related attributes, and table splitting, as Full-Feature Table Taxonomies (FFTTs). Figure 3 shows two examples.

3.3 Annotation

To label the corpus of 15,292 table images according to the defined taxonomies, we run an annotation project. LabelStudio^{Footnote 3} was used as the annotation tool and since there was only one annotator involved, a Master student of Data Science, no inter-annotator agreement (IAA) score was calculated. To ensure that the final corpus contains well-structured images, displaying only the complete and clear layout of tables, we filtered out inappropriate samples while annotating. To this end, we introduced the class non-table and used the following rules during the annotation:

If a table is partially extracted, as if incorrectly cropped, it is not considered to be a complete table and should be annotated as non-table.
If a table is fully extracted but labelled as Figure in a paper, it should be annotated as non-table.
If a table is fully extracted but there is other information in the image, such as segments of text, it should be annotated as non-table.
If a table is fully extracted but an image contains multiple scattered tables, it is considered as incorrect input and should be annotated as non-table.

As a result, 280 table images belong to the non-table category and were excluded from the corpus. We also checked the labelled data with respect to annotation errors. Consequently, 54 images were removed from the corpus.

The final dataset comprises 13,301 annotated scientific table images along with their respective metadata (image name, image label, image path, and dataset split). We refer to the final corpus as TD4CLTabs (Type Detection for Computational Linguistics Tables) dataset.^{Footnote 4} As a post-processing step, we encoded the categorical features with numerical values. Then we divided the dataset into a training set containing 10,347 table images and a test set comprising 2,954 samples.

3.4 Models

Considering recent advances of deep learning in computer vision (CV), alongside the proven successful application of table images for TU tasks such as table detection and table structure recognition [30, 32, 34, 37, 49], we approach TTC as an image classification task. In particular, TTC based on HFTTs was tackeled as a multi-class problem, while classification based on FFTTs was addressed as a multi-label task.

Two models, ResNet50 [19] and Vision Transformer (Vit) [13], were trained.^{Footnote 5} ResNet50 is a deep CNN model widely utilised in CV tasks, exhibiting efficient performance in image classification problems. ViT presents a newer approach to CV, utilising the Transformer architecture’s unique ability to capture global image information, outperforming traditional CNN models. We combined pre-encoded labels from all hierarchy levels into one flat list and fed them as input into the models along with table images.

ResNet50 was implemented using the Fastai framework.^{Footnote 6} For the Vit model, we utilised the Hugging Face implementation.^{Footnote 7} To enhance the robustness and reliability of the image classification models, cross-validation was applied with k set to 4. For both models, the batch size was set to 16. The resize dimensions of (500, 900) and (224, 224) were chosen for ResNet50 and Vit, respectively. FocalLoss was employed as the loss function for ResNet50, while the default CrossEntropy was used for Vit. The training process for ResNet50 extended to 30 epochs with early stopping enabled and a patience of 5 epochs. Vit was trained for 15 epochs with the option to save the best model. Both models utilised pretrained weights, with ResNet50 set to True and Vit using the ‘google/vit-base-patch16-224-in21k’ pretrained configuration.

3.5 Evaluation Metrics

To evaluate the performance of the two models on the multi-class classification task, error rate, precision (weighted), recall (weighted), and F1 score (weighted) were used. In the case of multi-label classification, hamming loss, macro and micro F1 scores were utilised.

4 Results

4.1 Dataset Analysis

The table images in our dataset have a wide range of resolutions, spanning from a minimum of \(100 \times 100\) pixels to a maximum of either \(1200 \times 200\) or \(1000 \times 1400\) pixels. In terms of dimensions, tables average 7.60 rows and 6.68 columns.

The distribution of tables per class within each HFTT is presented in Fig. 4. As can be seen, with the increase in the number of classes, the degree of data imbalance also rises. The analysis shows that matrix tables are approximately 15% more common than listings in the dataset. Interestingly, other tables comprise less than 5%. Among the matrix tables, those with HHs constitute approximately half of all (49%). Furthermore, the majority of such tables (about 64%) fall under type-1 hierarchical matrix, i.e., have HHs located in a CH. Matrix tables with diagonally split cells are quite frequent (about 71%). The least common across the matrix sub-categories are type-2 hierarchical and type-3 hierarchical. In terms of the listing class, horizontal tables are more frequent (about 84% of the total) than vertical and enumeration types. In contrast to hierarchical matrix tables, the number of hierarchical listings in the dataset is considerably lower (approx. 8% of all listings).

Figure 5 illustrates the distribution of table splitting and cell-related features incorporated into FFTTs within the TD4CLTabs dataset. The results indicate the infrequent occurrence of those across the given corpus of scientific tables. The highest value of about 13% was achieved for the missing and void cells type, followed by the presence of hierarchical rows (approximately 10%). A limited number of tables contain cells with non-textual content (about 3%) and other complex cells (about 2%).

4.2 Table Type Classification

Table 2 presents the TTC results across HFTTs. The Vit model outperforms ResNet50 in all but one case, namely HFTT_Novel_II. We can also see a general trend of decreasing performance among the models as the number of classes in the taxonomy increases. The class imbalance indicated in Sect. 4.1 might have also influenced the predictions. The best F1 value (0.82) was obtained for Vit based on Baseline_I. This is not surprising since it is a 1-level schema with the least number of classes and the most balanced data. The second highest F1 scores (0.78) were achieved by Baseline_II and HFTT_Novel_IV, both of which contain two additional categories when compared to Baseline_I. Even though HFTT_Novel_III contains four more categories than HFTT_Novel_II, the models based on these taxonomies result in very similar results (approx. 1% difference). The study also shows that HFTT_Novel_IV achieved the highest scores among the novel taxonomies.

Table 2. Multi-class classification results based on baseline and Header-Feature Table Taxonomies

Full size table

The results for multi-label classification based on FFTTs are provided in Table 3. In terms of micro F1, the Vit model demonstrates overall better performance compared to ResNet50 across all taxonomies, except FFTT_Novel_IV and FFTT_Novel_V. However, all models exhibit low macro F1 scores, indicating the dataset imbalance. The hamming loss values are also consistently low across the models (0.05–0.07), suggesting an overall good performance of the classifiers. Similar to the classification based on HFTTs, we note a trend where models tend to perform worse on FFTTs with a larger number of classes. Furthermore, the highest score (0.75) for FFTTs is about 7% and 2% lower compared to those obtained for the baselines and HFTTs, respectively.

Table 3. Multi-label classification results based on Full-Feature Table Taxonomies. The threshold is set to 0.5. If the probability of the prediction is greater than 0.5, it as a positive prediction. Otherwise, it is a negative prediction.

Full size table

To address the problem of class imbalance, we applied the random oversampling technique [47] on novel HFTTs.^{Footnote 8} This involved duplicating instances of the minority classes to align with the majority classes. As shown in Table 4, oversampling consistently improved F1 scores by 1–5% across the models. The Vit model based on HFTT_Novel_IV is the only instance where a slight decrease in score (by about 2%) is observed. All other evaluation scores also increased in the majority of HFTT classifiers. Furthermore, comparable results to ResNet50 with Baseline_I were achieved on ResNet50 with HFTT_Novel_I and HFTT_Novel_IV. However, despite the overall improvement in model performance, the prediction accuracy for novel taxonomies still remains lower (by approximately 5%) than that of Baseline_I based on Vit.

Table 4. Multi-class classification results based Header-Feature Table Taxonomies after applying oversampling

Full size table

5 Discussion

The study indicates that matrix and listing tables are the most commonly used across CL papers. In particular, matrix with hierarchical headers, frequently found in CHs, matrix with diagonally split cells, and horizontal listings are prevalent. Hence, these types are worth considering when classifying scientific tables. In contrast, the findings suggest that incorporating table splitting and cell features may not be advantageous, as they seem to be relatively uncommon in scientific tables.

The study further showcased the applicability of the TTC schema by Eberius et al. to scientific tables. In this sense, Crestan’s et al. taxonomy also proved to be adaptable after smaller adjustments. The models based on these baseline schemas demonstrate greater efficiency on TTC than those trained on the newly proposed taxonomies. Hence, although the two established classification schemas were designed for web tables, they are still suitable for scientific tables.

While the experimental results do not demonstrate a clear advantage of the novel domain-specific taxonomies, they do show the promising outcomes. Among the newly developed taxonomies, HFTT_Novel_I and HFTT_Novel_VI have proven to be the most successful. This could potentially be attributed to the smaller number of categories within those, indicating a lower level of complexity, compared to other schemas. These taxonomies also achieved efficiency comparable to the results obtained for ResNet50 with the baseline schemas.

6 Limitations

While this study sheds light on devising TTC taxonomies for scientific tables, it is not without limitations. First, the annotations may be subjective and contain errors due to the involvement of only one annotator. Having at least one additional annotator and curator, and subsequently validating the results by calculating the IAA score, would be beneficial. Second, the novel taxonomies were constructed and tested based on scientific tables from CL papers. Thus, the applicability of those to other domains remains an open research question, which we leave for future work. Third, the study considered only two existing web table based taxonomies, limiting the analysis to types within them and potentially neglecting other categories relevant to scientific tables. Finally, the hierarchy of the taxonomies’ labels was not taken into account in this study. Additionally, to tackle class imbalance, we considered only oversampling and applied it only to taxonomies with header features. Future endeavours could incorporate the label hierarchy in the model training process and focus on annotating more samples for the minority classes or on utilising other automatic methods for solving class imbalance (e. g., resampling).

7 Conclusion

In this paper, we developed and evaluated the effectiveness of ten novel TTC taxonomies tailored for tables found in scholarly publications. Additionally, we examined the applicability of well-established schemas designed for and based on web tables to the use-case of scientific tables. The findings reveal that existing taxonomies are indeed suitable for classifying scientific tables. However, while established taxonomies demonstrate their efficiency, comparable performance can also be achieved with two novel domain-specific taxonomies. Finally, our study indicates that header features are essential for classifying scientific tables, whereas cell features and table splitting have not shown to provide significant advantages. The proposed taxonomies can be beneficial for downstream tasks such as information retrieval from scholarly papers by helping to reduce the search space, data integration allowing mapping of scientific tables with similar structures across different datasets, and scientific table structure recognition.

Notes

1.
https://aclanthology.org.
2.
https://github.com/shauryr/ACL-anthology-corpus.
3.
https://labelstud.io.
4.
https://zenodo.org/records/10972922.
5.
The code is available on Software Heritage: https://archive.softwareheritage.org/browse/directory/1f492fb7db23db3a57484edd196af4fdf7139061/?origin_url=https://github.com/JilinHe/TD4CLTabs &revision=b549ac21bb59386734457eb6a36b8d358b0a68ee &snapshot=6b93b959741a8fbfff4f5ebeaf71e8177b81ff6f.
6.
https://www.fast.ai.
7.
https://huggingface.co.
8.
Note that we have not addressed the data imbalance for FFTTs.

References

Aly, R., et al.: The fact extraction and VERification over unstructured and structured information (FEVEROUS) shared task. In: Aly, R., et al. (eds.) Proceedings of the Fourth Workshop on Fact Extraction and VERification (FEVER), pp. 1–13. Association for Computational Linguistics, Dominican Republic (2021). https://doi.org/10.18653/v1/2021.fever-1.1
Bonfitto, S., Casiraghi, E., Mesiti, M.: Table understanding approaches for extracting knowledge from heterogeneous tables. WIREs Data Min. Knowl. Discov. 11(4), e1407 (2021). https://doi.org/10.1002/widm.1407
Article Google Scholar
Borisov, V., Leemann, T., Sessler, K., Haug, J., Pawelczyk, M., Kasneci, G.: Deep neural networks and tabular data: a survey. IEEE Trans. Neural Netw. Learn. Syst. 1–21 (2022). https://doi.org/10.1109/tnnls.2022.3229161
Cafarella, M.J., Halevy, A.Y., Zhang, Y., Wang, D.Z., Wu, E.: Uncovering the relational web. In: WebDB, pp. 1–6. Citeseer (2008)
Google Scholar
Chen, W., Chang, M.W., Schlinger, E., Wang, W., Cohen, W.W.: Open question answering over tables and text. arXiv (2021)
Google Scholar
Chen, W., et al.: TabFact: a large-scale dataset for table-based fact verification. In: International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia (2020)
Google Scholar
Chen, W., Zha, H., Chen, Z., Xiong, W., Wang, H., Wang, W.Y.: HybridQA: a dataset of multi-hop question answering over tabular and textual data. In: Cohn, T., He, Y., Liu, Y. (eds.) Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 1026–1036. Association for Computational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.findings-emnlp.91
Chen, Z., Cafarella, M.: Automatic web spreadsheet data extraction. In: Proceedings of the 3rd International Workshop on Semantic Search Over the Web, SSW 2013. Association for Computing Machinery, New York (2013). https://doi.org/10.1145/2509908.2509909
Cheng, Z., et al.: HiTab: a hierarchical table dataset for question answering and natural language generation. In: Muresan, S., Nakov, P., Villavicencio, A. (eds.) Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1094–1110. Association for Computational Linguistics, Dublin (2022). https://doi.org/10.18653/v1/2022.acl-long.78
Chi, Z., Huang, H., Xu, H.D., Yu, H., Yin, W., Mao, X.L.: Complicated table structure recognition. arXiv preprint arXiv:1908.04729 (2019)
Crestan, E., Pantel, P.: Web-scale table census and classification. In: Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, pp. 545–554 (2011)
Google Scholar
Del Bimbo, D., Gemelli, A., Marinai, S.: Data augmentation on graphs for table type classification. In: Krzyzak, A., Suen, C.Y., Torsello, A., Nobile, N. (eds.) S+SSPR 2022. LNCS, vol. 13813, pp. 242–252. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-23028-8_25
Chapter Google Scholar
Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Eberius, J., Braunschweig, K., Hentsch, M., Thiele, M., Ahmadov, A., Lehner, W.: Building the Dresden web table corpus: A classification approach. In: 2015 IEEE/ACM 2nd International Symposium on Big Data Computing (BDC), pp. 41–50. IEEE (2015)
Google Scholar
Ghasemi-Gol, M., Szekely, P.: TabVec: table vectors for classification of web tables. arXiv preprint arXiv:1802.06290 (2018)
Gorishniy, Y., Rubachev, I., Khrulkov, V., Babenko, A.: Revisiting deep learning models for tabular data. In: Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W. (eds.) Advances in Neural Information Processing Systems, vol. 34, pp. 18932–18943. Curran Associates, Inc. (2021)
Google Scholar
Gupta, V., Mehta, M., Nokhiz, P., Srikumar, V.: INFOTABS: inference on tables as semi-structured data. In: Jurafsky, D., Chai, J., Schluter, N., Tetreault, J. (eds.) Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2309–2324. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.acl-main.210
Habibi, M., Starlinger, J., Leser, U.: DeepTable: a permutation invariant neural network for table orientation classification. Data Min. Knowl. Disc. 34(6), 1963–1983 (2020)
Article MathSciNet Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR abs/1512.03385 (2015). http://arxiv.org/abs/1512.03385
Herzig, J., Müller, T., Krichene, S., Eisenschlos, J.M.: Open domain question answering over tables via dense retrieval. arXiv (2021)
Google Scholar
Hu, K., et al.: VizNet: towards a large-scale visualization learning and benchmarking repository. arXiv (2019)
Google Scholar
Iyyer, M., Yih, W.T., Chang, M.W.: Search-based neural structured learning for sequential question answering. In: Barzilay, R., Kan, M.Y. (eds.) Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1821–1831. Association for Computational Linguistics, Vancouver (2017). https://doi.org/10.18653/v1/P17-1167
Kardas, M., et al.: AxCell: automatic extraction of results from machine learning papers. In: Webber, B., Cohn, T., He, Y., Liu, Y. (eds.) Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 8580–8594. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.emnlp-main.692, https://aclanthology.org/2020.emnlp-main.692
Karishma, Z., Rohatgi, S., Puranik, K.S., Wu, J., Giles, C.L.: ACL-Fig: a dataset for scientific figure classification. arXiv (2023)
Google Scholar
Kruit, B., He, H., Urbani, J.: Tab2Know: building a knowledge base from tables in scientific papers. In: Pan, J.Z., Tamma, V., d’Amato, C., Janowicz, K., Fu, B., Polleres, A., Seneviratne, O., Kagal, L. (eds.) ISWC 2020. LNCS, vol. 12506, pp. 349–365. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-62419-4_20
Chapter Google Scholar
Lautert, L.R., Scheidt, M.M., Dorneles, C.F.: Web table taxonomy and formalization. ACM SIGMOD Rec. 42(3), 28–33 (2013)
Article Google Scholar
Lehmberg, O., Ritze, D., Meusel, R., Bizer, C.: A large public corpus of web tables containing time and context metadata. In: Proceedings of the 25th International Conference Companion on World Wide Web, WWW 2016 Companion, pp. 75-76. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE (2016). https://doi.org/10.1145/2872518.2889386
Moosavi, N.S., Rücklé, A., Roth, D., Gurevych, I.: Learning to reason for text generation from scientific tables. arXiv preprint arXiv:2104.08296 (2021)
Nan, L., et al.: FeTaQA: free-form table question answering. Trans. Assoc. Comput. Linguist. 10, 35–49 (2022). https://doi.org/10.1162/tacl_a_00446
Article Google Scholar
Nassar, A., Livathinos, N., Lysak, M., Staar, P.: TableFormer: table structure understanding with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4614–4623 (2022)
Google Scholar
Nishida, K., Sadamitsu, K., Higashinaka, R., Matsuo, Y.: Understanding the semantic structures of tables with a hybrid deep neural network architecture. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31 (2017)
Google Scholar
Paliwal, S., Vishwanath, D., Rahul, R., Sharma, M., Vig, L.: TableNet: deep learning model for end-to-end table detection and tabular data extraction from scanned document images. arXiv (2020)
Google Scholar
Pasupat, P., Liang, P.: Compositional semantic parsing on semi-structured tables. In: Zong, C., Strube, M. (eds.) Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1470–1480. Association for Computational Linguistics, Beijing (2015). https://doi.org/10.3115/v1/P15-1142
Prasad, D., Gadpal, A., Kapadni, K., Visave, M., Sultanpure, K.: CascadeTabNet: an approach for end to end table detection and structure recognition from image-based documents. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 572–573 (2020)
Google Scholar
Roldán, J.C., Jiménez, P., Corchuelo, R.: On extracting data from tables that are encoded using HTML. Knowl.-Based Syst. 190, 105157 (2020)
Article Google Scholar
Sahakyan, M., Aung, Z., Rahwan, T.: Explainable artificial intelligence for tabular data: a survey. IEEE Access 9, 135392–135422 (2021). https://doi.org/10.1109/ACCESS.2021.3116481
Article Google Scholar
Schreiber, S., Agne, S., Wolf, I., Dengel, A., Ahmed, S.: DeepDeSRT: deep learning for detection and structure recognition of tables in document images. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 01, pp. 1162–1167 (2017). https://doi.org/10.1109/ICDAR.2017.192
Shigarov, A.: Table understanding: problem overview. WIREs Data Min. Knowl. Discov. 13(1), e1482 (2023). https://doi.org/10.1002/widm.1482
Article Google Scholar
Shigarov, A.O., Mikhailov, A.A.: Rule-based spreadsheet data transformation from arbitrary to relational tables. Inf. Syst. 71, 123–136 (2017). https://doi.org/10.1016/j.is.2017.08.004
Article Google Scholar
Shwartz-Ziv, R., Armon, A.: Tabular data: deep learning is not all you need. Inf. Fusion 81, 84–90 (2022). https://doi.org/10.1016/j.inffus.2021.11.011
Article Google Scholar
Wang, Y., Hu, J.: Detecting tables in HTML documents. In: Lopresti, D., Hu, J., Kashi, R. (eds.) DAS 2002. LNCS, vol. 2423, pp. 249–260. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45869-7_29
Chapter Google Scholar
Wang, Y., Hu, J.: A machine learning based approach for table detection on the web. In: Proceedings of the 11th International Conference on World Wide Web, pp. 242–250 (2002)
Google Scholar
Zayats, V., Toutanova, K., Ostendorf, M.: Representations for question answering from documents with tables and text. arXiv preprint arXiv:2101.10573 (2021)
Zhang, L., Zhang, S., Balog, K.: Table2vec: neural word and entity embeddings for table population and retrieval. In: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1029–1032 (2019)
Google Scholar
Zhang, S., Balog, K.: Web table extraction, retrieval, and augmentation: a survey. ACM Trans. Intell. Syst. Technol. (TIST) 11(2), 1–35 (2020)
Article Google Scholar
Zheng, X., Burdick, D., Popa, L., Zhong, X., Wang, N.X.R.: Global table extractor (GTE): a framework for joint table identification and cell structure recognition using visual context. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 697–706 (2021)
Google Scholar
Zheng, Z., Cai, Y., Li, Y.: Oversampling method for imbalanced classification. Comput. Inform. 34(5), 1017–1037 (2015)
Google Scholar
Zhong, V., Xiong, C., Socher, R.: Seq2SQL: generating structured queries from natural language using reinforcement learning. arXiv (2017)
Google Scholar
Zhong, X., ShafieiBavani, E., Yepes, A.J.: Image-based table recognition: data, model, and evaluation. arXiv (2020)
Google Scholar
Zhu, F., et al.: TAT-QA: a question answering benchmark on a hybrid of tabular and textual content in finance. In: Zong, C., Xia, F., Li, W., Navigli, R. (eds.) Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 3277–3287. Association for Computational Linguistics (2021). https://doi.org/10.18653/v1/2021.acl-long.254

Download references

Acknowledgments

The work presented in this paper was partially supported by the consortium NFDI for Data Science and Artificial Intelligence (NFDI4DS, no. 460234259) (https://www.nfdi4datascience.de) as part of the non-profit association National Research Data Infrastructure (NFDI e. V.). The NFDI is funded by the Federal Republic of Germany and its states.

Author information

Authors and Affiliations

Technische Universität Berlin (TU), Berlin, Germany
Jilin He
Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI), Berlin, Germany
Ekaterina Borisova & Georg Rehm

Authors

Jilin He
View author publications
You can also search for this author in PubMed Google Scholar
Ekaterina Borisova
View author publications
You can also search for this author in PubMed Google Scholar
Georg Rehm
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ekaterina Borisova .

Editor information

Editors and Affiliations

Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI), Berlin, Germany
Georg Rehm
GESIS Leibniz Institut für Sozialwissenschaften and Heinrich-Heine - University Düsseldorf, Cologne, Germany
Stefan Dietze
Technical University of Berlin and Fraunhofer FOKUS, Berlin, Berlin, Germany
Sonja Schimmler
Wismar University of Applied Sciences, Wismar, Germany
Frank Krüger

Appendices

A Examples of Matrix, Horizontal Listing, and Vertical Listing Tables

B Illustrations of Table Features

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

He, J., Borisova, E., Rehm, G. (2024). Towards a Novel Classification of Table Types in Scholarly Publications. In: Rehm, G., Dietze, S., Schimmler, S., Krüger, F. (eds) Natural Scientific Language Processing and Research Knowledge Graphs. NSLP 2024. Lecture Notes in Computer Science(), vol 14770. Springer, Cham. https://doi.org/10.1007/978-3-031-65794-8_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-65794-8_3
Published: 15 August 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-65793-1
Online ISBN: 978-3-031-65794-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Towards a Novel Classification of Table Types in Scholarly Publications

Abstract

Keywords

1 Introduction

2 Related Work

3 Methodology

3.1 Data

3.2 Taxonomies Construction

3.3 Annotation

3.4 Models

3.5 Evaluation Metrics

4 Results

4.1 Dataset Analysis

4.2 Table Type Classification

5 Discussion

6 Limitations

7 Conclusion

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendices

A Examples of Matrix, Horizontal Listing, and Vertical Listing Tables

B Illustrations of Table Features

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation