Keywords

1 Introduction

Keywords and other classifications may help when searching or organizing scholarly publications [20]. They can be annotated by the authors or the publishers, with a corresponding manual effort, or may be machine-generated. The latter has been an application of natural language processing which, with the advent of pre-trained large language models such as BERT [14], has recently gained momentum. Still, the automated classification of research papers remains challenging [27].

This paper describes our submission to the shared task Field of Research Classification of Scholarly PublicationsFootnote 1 of the 1st Workshop on Natural Scientific Language Processing and Research Knowledge Graphs (NSLP 2024). Its Subtask I, which our contribution addresses, is to develop a single-label classifier for general scholarly publications. We trained and tested it on a dataset of around 60, 000 English scientific papers [1, 2], each from one of 123 hierarchical classes of a subset of the Open Research Knowledge Graph taxonomy.Footnote 2 Our approach, dubbed SLAMFORC (short for Single-Label Multi-modal Field of Research Classification), is multi-modal in that we incorporated data from three different sources: the dataset provided by the organizers of the challenge containing metadata of the articles (e.g., title, abstract), the semantic information provided by CrossrefFootnote 3, and the contents of the papers (i.e., full text and images). Using this data as features, we engineered a classifier that produces single-label predictions for a given input document. For this endeavor, we computed the embeddings with two different flavors of a pre-trained BERT [14] model and, subsequently, fed these vectors to a handful of traditional classifiers. Then, we applied a voting ensemble [21] to their output to combine them into a final classifier, incorporating all of them as well as the entirety of the available features.

The shared task was very competitive, with 13 system submissions. The margin among the top five submissions was very narrow (\({\pm }0.75\%\)), illustrating that the boundaries were pushed of what was possible with the provided data and task. In the end, our approach came in among the top results, scoring the highest values for two out of four evaluated aspects and the second-best for the others: accuracy (\(75.6\%\)), precision (\(75.7\%\)), recall (\(75.6\%\)), and F1 (\(75.4\%\)).

The remainder of this paper is structured as follows. Section 2 presents the related work, and Sect. 3 introduces our methodology. In the ensuing Sect. 4, we describe our experiments. Finally, we draw conclusions in Sect. 5.

2 Related Work

The classification of scholarly papers into research fields has found ample applications: for example, to ease organizing or searching the flood of new publications.

One such system [8] groups biomedical papers by applying non-negative matrix factorization [17] to the term relevance vectors of the documents. It uses bisecting k-means clustering [6], and, at the same time, assigns semantic meaning to each document and cluster inferred from the matrix decompositions.

The work by Taheriyan [27] describes an approach to classifying papers by using relationships such as common authors and references as well as citations in a graph. This information allows new papers to be assigned topics automatically instead of requiring manual annotations.

Nguyen and Shirai [20] focus on various text features such as the segmentation of the paper and apply three different classifiers: multi-label kNN [30], binary approach [28], and their newly proposed back-off model. While the latter performs the best, another interesting insight from their results is that only using the title, abstract, and the sections Introduction and Conclusions of papers improves over using the full text as a feature.

Another approach is presented by Kim and Gil [16]: They describe a classification system based on latent Dirichlet allocation [7] and term frequency-inverse document frequency [25]. The former is employed to extract relevant keywords from the abstracts, the latter for k-means clustering [4] papers with similar topics.

More recently, SPECTER [12] uses pre-trained language models (e.g., SciBERT [5]) to generate document-level embeddings from the titles and abstracts. These can be used for downstream tasks, such as predicting the class of a document, which is demonstrated by applying SPECTER to a new dataset with papers in 19 classes. In this work, incorporating the entire text of papers remains an open issue due to limitations on memory and the availability of the paper contents.

3 The SLAMFORC System

This section describes our approach to solving the shared task. We first explain the multi-modal data of our system. Then, we detail the classifiers we used with this data.

Figure 1 shows an overview of the system. Its code is publicly available.Footnote 4

Fig. 1.
figure 1

Overview of the system architecture.

3.1 Multi-modal Data

The dataset for the shared task [1, 2] consisted of approximately 60, 000 scholarly articles, compiled from various sources such as the Open Research Knowledge Graph [3], arXivFootnote 5, CrossrefFootnote 6, and the Semantic Scholar Academic Graph [29]. It spans 123 fields of research (FoR) across five major domains and four hierarchical levels, with mapping to the ORKG taxonomy.Footnote 7 The challenge of imbalanced data is evident in the dataset, where the distribution of fields is uneven, varying from as low as eight articles (for Molecular, Cellular, and Tissue Engineering) to over 6, 000 (for Physics).

We utilized Crossref (see Footnote 3) to further enhance the text data of papers. Specifically, for each paper, we used its Digital Object Identifier (DOI) and the Crossref API clientFootnote 8 to retrieve its annotated subjects and references from the Crossref Unified Resource API.Footnote 9 For the paper with the DOI “10.1007/JHEP06(2012)126,” for example, we retrieved the subject “Nuclear and High Energy Physics” and the metadata of 37 reference papers. Despite Crossref adopting a different taxonomy, this retrieved subject remains highly useful for predicting the target label of this paper (i.e., “Physics”). Also, the reference papers are mostly in the Physics domain, and this information can be very useful.

We used the title, abstract, and publisher information from the provided dataset, along with the subject data, to generate the metadata embeddings for each paper. We appended all this data as input text to SciNCL [23], a pre-trained BERT model, for computing an embedding as a comprehensive representation of each paper.

In order to make use of the full text for the papers in the dataset, we first had to obtain the respective documents. This was straightforward for items that already had a download link annotated. For all other papers, we used the DOI field, where available, to find the PDFs. There were some cases where neither was available. For those, we relied on Crossref’s API to resolve the paper title to its DOI, which allowed us to download the full text document, if it was available.

To extract the text from the PDFs, we employed PaperMage [19]. For each PDF, it produces a JSON file with information about its content and structure. We only relied on the extracted symbols, which we used to reconstruct the full text of the respective papers. Using this data, we computed the document-level embeddings with two pre-trained BERT models: SciBERT [5] and SciNCL [23]. Because of BERT’s limitation to processing 512 tokens at a time [14] and papers exceeding this, we batched the input data accordingly. We employed a sliding window of size 512 tokens with an overlap of 128 to conserve semantics near the window borders. After computing the embeddings for each such chunk, we averaged them to obtain the final document-level embedding.

To incorporate the visual information contained in the PDFs, we extracted all their images and converted them to raster graphics. For each image, we used an OpenCLIP [11] model pre-trained on the LAION-5B dataset [26] as well as a pre-trained DINOv2 [22] model to extract image features. When PDFs contained multiple images, we used mean-pooling to aggregate the multiple feature vectors per model, resulting in two vectors per PDF; one for each applied model. For papers where the PDF did not contain any images or the PDF was not available, these vectors were set to zero.

3.2 Classifier

For the final system, we used a mixture of traditional classifiers and neural methods that we combined with an ensemble voting method [21]. Figure 1 shows an overview of the system. After computing the embeddings for the various data sources, we trained several classifiers that could handle vectors as input and predict the single-label class for each item in the dataset.

An obvious choice are Support Vector Machines [13], or SVM for short. Due to the nature of the input data, they can naturally classify them in a high-dimensional space and predict the field of research label. We employed a Random Forest (RF) [15] since they avoid overfitting to the training data, which was an overt problem to be expected because of the skew in classes in the dataset. Logistic Regression (LR) is another widely used traditional classifier to predict single labels on linearly separable data. With eXtreme Gradient Boosting (XGB) [10], we used another popular method that can achieve good performance while sacrificing interpretability. Next, we also employed a fully connected neural net that is a Multilayer Perceptron (MLP), able to deal with not linearly separable data. Furthermore, we also trained SciNCL [23] as an end-to-end solution on the metadata.

Finally, we combined the output of the classifiers described above into an ensemble method [21] with hard voting [18]. This enabled the use of all techniques and all available data at the same time while still producing a single predicted label for each item in the dataset.

Table 1. The first results on the validation set of the individual classifiers with all features (embeddings of metadata, full text, and images) as measured by accuracy, weighted precision, recall, and F1.
Table 2. Ablation study on the validation set by feature combination.

4 Experiments

Table 1 shows the results of the initial experiments. We used a set of traditional classifiers as implemented by scikit-learn [24] with all of the available data for each paper consisting of the stacked embedding vectors. Since no method significantly outperformed the others, we combined all of them post-hoc using a voting ensemble method, giving us our final classifier for the results of which we submitted to the shared task.

To illustrate the impact of each data source and dissect our multi-modal approach, we performed a feature ablation study, the results of which are shown in Table 2. We used our final system architecture with all classifiers combined with voting on the powerset of possible feature combinations. It is evident that the (embeddings of the) metadata have the most positive influence on the results. Still, adding extra information to the classifier is not detrimental but rather contributes to a higher score. This holds more for the (embeddings of the) full texts of the papers which perform decent on their own. Using the embeddings of the images in the papers alone, where applicable, achieves clearly worse results than the other two data sources. Nevertheless, the combination of all features is among the highest scoring for all four employed metrics, and there was no reason not to rely on everything available.

Finally, Table 3 shows the results of the shared task evaluation.Footnote 10 Our submission (ID 683689, top row) scored the highest for precision (\(75.7\%\)) and F1 (\(75.4\%\)) while achieving the second-best values for accuracy (\(75.6\%\)) and recall (\(75.6\%\)). This goes to show that our multi-modal approach worked and performed well in this competition. Without further knowledge of the other systems, no comparisons can be made or insights gained, and are, thus, left for future work. In conclusion, the automated field of research classification of scientific papers is still challenging, but the submissions for this shared task seemed to have pushed the boundaries of what was possible with the given tools and information, seeing how close the top results were.

Table 3. The final evaluation results on the test set as measured by accuracy, weighted precision, recall, and F1 (best for each in bold, runner-up underlined). Our submission is the first line.

5 Conclusions

In this paper, we presented SLAMFORC, a system for the Single-Label Multi-modal Field of Research Classification. We used it to produce the results for our submission to the shared task Field of Research Classification of Scholarly Publications. Pursuing a multi-modal approach by incorporating not only the given dataset containing metadata of the papers but also the full text of publications as well as images in these documents, we built an ensemble classifier by combining a set of traditional classifiers using a voting ensemble. We computed the embeddings with pre-trained large language models, stacked these vectors, and trained the individual classifiers. Then, we used them jointly to obtain a single-label prediction for each item in the dataset.

As one of the conclusions of this work, we would like to raise some issues with the evaluation method. Possibly, some metric that also considers the semantics in the taxonomy might have enabled a more effective evaluation and allowed for insights into the inner workings of the systems, especially in connection with the misclassified items. One such metric was proposed by Chen et al. [9], which evaluates the performance of taxonomic assignments based on said given taxonomy.

Our system achieved the highest precision and F1 and the second-best accuracy and recall values of all submissions, demonstrating its effectiveness. While the ceiling seems to have been reached of what was possible in the shared task, judging by the range of the top submissions We hope to have contributed to the still challenging classification of research fields for scientific publications.