Domain-independent Extraction of Scientific Concepts from Research Articles

We examine the novel task of domain-independent scientific concept extraction from abstracts of scholarly articles and present two contributions. First, we suggest a set of generic scientific concepts that have been identified in a systematic annotation process. This set of concepts is utilised to annotate a corpus of scientific abstracts from 10 domains of Science, Technology and Medicine at the phrasal level in a joint effort with domain experts. The resulting dataset is used in a set of benchmark experiments to (a) provide baseline performance for this task, (b) examine the transferability of concepts between domains. Second, we present two deep learning systems as baselines. In particular, we propose active learning to deal with different domains in our task. The experimental results show that (1) a substantial agreement is achievable by non-experts after consultation with domain experts, (2) the baseline system achieves a fairly high F1 score, (3) active learning enables us to nearly halve the amount of required training data.


Introduction
Scholarly communication as of today is a document-centric process. Research results are usually conveyed in written articles, as a PDF file with text, tables and figures. Automatic indexing of these texts is limited and generally does not access their semantic content. There are thus severe limitations how current research infrastructures can support scientists in their work: finding relevant research works, comparing them, and compiling summaries is still a tedious and error-prone manual work. The strong increase in the number of published research papers aggravates this situation [7].
Knowledge graphs are recognised as an effective approach to facilitate semantic search [3]. For academic search engines, Xiong et al. [42] have shown that exploiting knowledge bases like Freebase can improve search results. However, the introduction of new scientific concepts occurs at a faster pace than knowledge base curation, resulting in a large gap in knowledge base coverage of scientific entities [1], e.g. the task geolocation estimation of photos from the Computer Vision field is neither present in Wikipedia nor in more specialised knowledge bases like Computer Science Ontology (CSO) [35] arXiv:2001.03067v1 [cs.IR] 9 Jan 2020 or "Papers with code" [32]. Information extraction from text helps to identify emerging entities and to populate knowledge graphs [3]. Thus, information extraction from scientific texts is a first vital step towards a fine-grained research knowledge graph in which research articles are described and interconnected through entities like tasks, materials, and methods. Our work is motivated by the idea of the automatic construction of a research knowledge graph.
Information extraction from scientific texts, obviously, differs from its general domain counterpart: Understanding a research paper and determining its most important statements demands certain expertise in the article's domain. Every domain is characterised by its specific terminology and phrasing which is hard to grasp for a non-expert reader. In consequence, extraction of scientific concepts from text would entail the involvement of domain experts and a specific design of an extraction methodology for each scientific discipline -both requirements are rather time-consuming and costly.
At present, a structured study of these assumptions is missing. We thus present the task of domain-independent scientific concept extraction. This article examines the intuition that most domain-specific articles share certain core concepts such as the mentions of research tasks, used materials, or data. If so, these would allow a domain-independent information extraction system, which does not reach all semantic depths of the analysed article, but still provides some science-specific structure.
In this paper, we introduce a set of science concepts that generalise well over the set of examined domains (10 disciplines from Science, Technology and Medicine (STM)). These concepts have been identified in a systematic, joint effort of domain experts and non-domain experts. The inter-coder agreement is measured to ensure the adequacy and quality of concepts. A set of research abstracts has been annotated using these concepts and the results are discussed with experts from the corresponding fields. The resulting dataset serves as a basis to train two baseline deep learning classifiers. In particular, we present an active learning approach to reduce the number of required training data. The systems are evaluated in different experimental setups.
Our main contributions can be summarised as follows: (1) We introduce the novel task domain-independent scientific concept extraction, which aims at automatically extracting scientific entities in a domain-independent manner. (2) We release a new corpus that comprises 110 abstracts of 10 STM domains annotated at the phrasal level. Additionally, we release a silver-labelled corpus with 62K automatically annotated abstracts of Elsevier with CCBY license and 1.2 Mio. extracted unique concepts comprising 24 domains. (3) We present two baseline deep learning systems for this task, including an active learning approach. To the best of our knowledge, this is the first approach that applies active learning to scholarly texts. We demonstrate that about half of the training data are sufficient to maintain the performance when using the entire training set. (4) We make our corpora and source code publicly available to facilitate further research.

Related Work
This section gives a brief overview of existing annotated scientific corpora before some exemplary applications for domain-independent information extraction from scientific papers and the respective state of the art are introduced.

Scientific corpora
Sentence level annotation. Early approaches for semantic structuring of research papers focused on sentences as the basic unit of analysis. This enables, for instance, automatic highlighting of relevant paper passages to enable efficient assessment regarding quality and relevance. Several ontologies have been created that focus on the rhetorical [17,11], argumentative [41,27] or activity-based [33] structure of research papers.
Annotated datasets exist for several domains, e.g. PubMed200k [12] from biomedical randomized controlled trials, NICTA-PIBOSO [22] from evidence-based medicine, Dr. Inventor [14] from Computer Graphics, Core Scientific Concepts (CoreSC) [27] from Chemistry and Biochemistry, and Argumentative Zoning (AZ) [41] from Chemistry and Computational Linguistics, Sentence Corpus [8] from Biology, Machine Learning and Psychology. Most datasets cover only a single domain, while few other datasets cover three domains. Several machine learning methods have been proposed for scientific sentence classification [20,12,14,26].
Phrase level annotation. More recent corpora have been annotated at phrasal level. SciCite [9] and ACL ARC [21] are datasets for citation intent classification from Computer Science, Medicine, and Computational Linguistics. ACL RD-TEC [18] from Computational Linguistics aims at extracting scientific technology and non-technology terms. ScienceIE17 [2] from Computer Science, Material Sciences, and Physics contains three concepts PROCESS, TASK and MATERIAL. SciERC [28] from the machine learning domain contains six concepts TASK, METHOD, METRIC, MATERIAL, OTHER-SCIEN-TIFICTERM and GENERIC. Each corpus covers at most three domains.
Experts vs. non-experts. The aforementioned datasets were usually annotated by domain experts [12,22,2,28,18,27]. In contrast, Teufel et al. [41] explicitly use nonexperts in their annotation tasks, arguing that text understanding systems can use general, rhetorical and logical aspects also when qualifying scientific text. According to this line of thought, more researchers used (presumably cheaper) non-expert annotation as an alternative [14,8].
Snow et. al. [39] provide a study on expert versus non-expert performance for general, non-scientific annotation tasks. They state that about four non-experts (Mechanical Turk workers, in their case) were needed to rival the experts' annotation quality. However, systems trained on data generated by non-experts showed to benefit from annotation diversity and to suffer less from annotator bias. A recent study [34] examines the agreement between experts and non-experts for visual concept classification and person recognition in historical video data. For the task of face recognition, training with expert annotations lead to an increase of only 1.5 % in classification accuracy.
Active learning in Natural Language Processing (NLP). To the best of our knowledge, active learning has not been applied to classification tasks for scientific text yet. Recent publications demonstrate the effectiveness of active learning for NLP tasks such as Named Entity Recognition (NER) [37] and sentence classification [44]. Siddhant and Lipton [38] and Shen et. al. [37] compare several sampling strategies on NLP tasks and show that Maximum Normalized Log-Probability (MNLP) based on uncertainty sampling performs well in NER.  [1]. These graphs interlink the papers through meta-data such as citations, authors, venues, and keywords, but not through deep semantic representation of the articles' content.

Applications for domain-independent scientific information extraction
However, first attempts towards a more semantic representation of article content exist: Ammar et al. [1] interlink the Semantic Scholar Corpus with DBpedia [25] and Unified Medical Language System (UMLS) [6] using entity linking techniques. Yaman et al. [43] connect SciGraph with DBpedia person entities. Xiong et al. [42] demonstrate that academic search engines can greatly benefit from exploiting general-purpose knowledge bases. However, the coverage of science-specific concepts is rather low [1].
Research paper recommendation systems. Beel et al. [4] provide a comprehensive survey about research paper recommendation systems. Such systems usually employ different strategies (e.g. content-based and collaborative filtering) and several data sources (e.g. text in the documents, ratings, feedback, stereotyping). Graph-based systems, in particular, exploit citation graphs and genes mentioned in the papers [23]. Beel et al. conclude that it is not possible to determine the most effective recommendation approach at the moment. However, we believe that a fine-grained research knowledge graph can improve such systems. Although "Papers with code" [32] is not a typical recommendation system, it allows researchers to browse easily for papers from the field of machine learning that address a certain task.

Domain-independent scientific concept extraction: A corpus
In this section, we introduce the novel task of domain-independent extraction of scientific concepts and present an annotated corpus. As the discussion of related work reveals, the annotation of scientific resources is not a novel task. However, most researchers focus on at most three scientific disciplines and on expert-level annotations. In this work, we explore the domain-independent annotation of scientific concepts based on abstracts from ten different science domains. Since other studies have also shown that non-expert annotations are feasible for the general and scientific domain, we go for a cost-efficient middle course: annotations of non-experts experienced in the annotation task and consultation with domain-experts. Finally, we explore how well state-of-the-art machine learning approaches do perform on this novel, domain-independent information extraction task and whether active learning can save annotation costs. The base corpus, which we make publicly available, and the annotation process are described below.

OA STM Corpus
The OA STM corpus [13] is a set of open access (OA) articles from various domains in Science, Technology and Medicine (STM). It was published in 2017 as a platform for benchmarking methods in scholarly article processing, amongst other scientific information extraction. The dataset contains a selection of 110 articles from 10 domains, namely Agriculture (Agr), Astronomy (Ast), Biology (Bio), Chemistry (Che), Computer Science (CS), Earth Science (ES), Engineering (Eng), Materials Science (MS), Mathematics (Mat), and Medicine (Med). While the original corpus contains full articles, this first annotation cycle focuses on the articles' abstracts.

Annotation process
The OA STM Corpus is used as a base for (a) the identification of potential domainindependent concepts; (b) a first annotated corpus for baseline classification experiments. Main actors in the annotation process were two post-doctoral researchers with a background in computer science (acting as non-expert annotators); their basic annotation assumptions were checked by experts from the respective domains. Table 1: The four core scientific concepts that were derived in this study PROCESS Natural phenomenon or activities, e.g. growing (Bio), reduction (Mat), flooding (ES). METHOD A commonly used procedure that acts on entities, e.g. powder X-ray (Che), the PRAM analysis (CS), magnetoencephalography (Med). MATERIAL A physical or abstract entity used in scientific experiments or proofs, e.g. soil (Agr), the moon (Ast), the carbonator (Che). DATA The data themselves, measurements, or quantitative or qualitative characteristics of entities, e.g. rotational energy (Eng), tensile strength (MS), 3D time-lapse seismic data (ES).
Pre-annotation. A literature review of annotation schemes [27,2,26,11] provided a seed set of potential candidate concepts. Both non-experts independently annotated a subset of the STM abstracts with these concepts and discussed the outcome. In a three-step process, the concept set was pruned to only contain those which seemed suitably transferable between domains. Our set of generic scientific concepts consists of PROCESS, METHOD, MATERIAL, and DATA (see Table 1 for their definitions). We also identified TASK [2], OBJECT [26], and RESULTS [11], however, in this study we do not consider nested span concepts, hence we leave them out since they were almost always nested with the other scientific entities (e.g. a RESULT may be nested with DATA).
Phase I. Five abstracts per domain (i.e. 50 abstracts) were annotated by both annotators and the inter-annotator agreement was computed using Cohen's κ [10]. Results showed a moderate inter-annotator agreement of 0.52 κ.
Phase II. The annotations were then presented to subject specialists who each reviewed (a) the choice of concepts and (b) annotation decisions on the respective domain corpus. The interviews mostly confirmed the concept candidates as generally applicable. The experts' feedback on the annotation was even more valuable: The comments allowed for a more precise reformulation of the annotation guidelines, including illustrating examples from the corpus.
Consolidation. Finally, the 50 abstracts from phase I were reannotated by the nonexperts. Based on the revised annotation guidelines, a substantial agreement of 0.76 κ could be reached (see Table 2). Subsequently, the remaining 60 abstracts (six per do-main) were annotated by one annotator. This last phase also involved reconciliation of the previously annotated 50 abstracts to obtain a gold standard corpus.

Experimental setup: Two baseline classifiers
The current state-of-the-art for scientific entity extraction is Beltagy et al.'s system [5]. We use their NER task-specific deep learning architecture atop SciBERT embeddings with a Conditional Random Field (CRF) based sequence tag decoder [29] and BILOU (beginning, inside, last, outside, unit) tagging scheme. The following classifiers are implemented in AllenNLP [15]. We report span-based micro-averaged F1 scores and use the ScienceIE17 [2] evaluation script.

Traditionally trained classifiers
Using the above mentioned architecture, we train one model with data from all domains combined. We refer to this model as the domain-independent classifier. Similarly, we train 10 models for each domain in our corpus -the domain-specific classifier.
To obtain a robust evaluation of models, we perform five-fold cross-validation experiments. In each fold experiment, we train a model on 8 abstracts per domain (i.e. 80 abstracts), tune hyperparameters on 1 abstract per domain (i.e. 10 abstracts), and test on the remaining 2 abstracts per domain (i.e. 20 abstracts) ensuring that the data splits are not identical between the folds. All results reported in the paper are averaged over the five folds. Please note that 8 abstracts have about 445 concepts so that the training data should be sufficient for the domain-dependent classifier.

Active learning trained classifier
Based on the results of the aforementioned comparison studies [38,44], we decide to use MNLP [37] as the sampling strategy in the active learning setting. It is chosen over other possibly suitable candidates such as Bayesian Active Learning by Disagreement (BALD) [19], which is another powerful strategy, but has higher computational requirements. The objective involves strategically selecting sentences from the overall dataset in each iteration of the algorithm greedily, aiming at getting greater performance with a minimum number of sentences. In our experiments, we found that adding 4% of the data to be the most discriminative selection of classifier performance. Therefore, we run 25 iterations of active learning in each stage adding 4% training data. To obtain a robust evaluation of models, we repeat the experiment for five folds and average the results. The models use the same hyperparameters as for the domain-independent classifier. We retrain the model within each iteration and fold.

Experimental results and discussion
This section describes the results of the experimental setup and the correlation analysis between inter-annotator agreement and performance of the several classifiers. Table 4 shows an overview of the domain-independent classifier results. The system achieves an F 1 score of 65.5 (± 1.26) in the overall task. For this classifier, MATERIAL was the easiest concept with an F 1 of 71 (± 1.88), whereas METHOD was the hardest concept with an F 1 of 43 (± 6.30). The concept METHOD is also the most underrepresented one in our corpus, which partly explains the poor extraction performance.

Traditionally trained classifiers
Next, we compare and contrast the 10 domain-specific classifiers according to their capability to extract the concepts from their own domains and in other domains. The results are shown as F 1 scores in Figure 1 where the x-axis represents the 10 test domains. We discuss some observations in the sequel.   Figure 1) extracts scientific entities from its own domain at the same performance as the domain-independent classifier with an F 1 score of 71 (± 9.0) demonstrating a robust domain. It comprises only 11% of the overall data, yet the domain-independent classifier trained on all data does not outperform it.
Most generic domain. MS (the third last bar in each domain in Figure 1) exhibits a high degree of domain independence since it is among the top 3 classifiers for seven of the 10 domains (viz. ES, Che, CS, Ast, Agr, MS, and Bio).
Most specialised domain. Mat (the second last bar in each domain in Figure 1) shows the lowest performance in extracting scientific concepts from all domains except itself. Hence it shows to be the most specialised domain in our corpus. Notably, a characteristic feature of this domain is that it has short abstracts (nearly a third of the size of the longest abstracts), so it is also the most underrepresented in our corpus. Also, distinct from the other domains, Mat has triple the number of DATA entities compared to each of its other concepts, where in the other domains PROCESS and MATERIAL are consistently predominant. Domain-independent vs. domain-dependent classifier. Except for Bio the domainindependent classifier clearly outperforms the domain-dependent one extracting concepts from their respective domains. To analyse the reason, we investigate the improvements in CS domain. We have chosen CS exemplary as the size of the domain is slightly below the average and this domain strongly benefits from the domain-independent classifier and improves the F 1 score for the CS classifier from 49.5 (± 4.22) to 65.9 (± 1.21). The F 1 score for span-detection is improved from 73.4 (± 3.45) to 82.0 (± 3.98). Span-detection usually requires less domain-dependent signals, thus the domainindependent classifier can benefit from other domains. Accuracy on token-level also improves from 67.7 (± 5.35) to 77.5 (± 4.42) F 1, that is correct labelling of the tokens also benefits from other domains. This is also supported by the results in the confusion matrix depicted in Figure 2 for the CS and the domain-independent classifier on token-level.
Scientific concept extraction. Figure 3 depicts the 10 domain-specific classifier results for extracting each of the four scientific concepts. It can be observed that Agr, Med, Bio, and Ast classifiers are the best in extracting PROCESS, METHOD, MATERIAL, and DATA, respectively. Figure 4 shows the results of the active learning experiment. Table 5 depicts the results for the fraction of training data when the performance using the entire training dataset is achieved. MNLP clearly outperforms the random baseline. While using only 52 % For SciERC [28] and ScienceIE17 [2] similar results are demonstrating that MNLP can significantly reduce the amount of labelled data. To find out which mix of training data produces the most generic model, we analyse the distribution of sentences in the training data sampled by MNLP. As expected, the random sampling strategy uniformly samples sentences from all domains in each iteration. However, (Math, CS) are the most and (Eng, MS) the least preferred domains

Correlations between inter-annotator agreement and performance
In this section, we analyse the correlations of inter-annotator agreement κ and the number of annotated concepts per domain (#) on the performance and variance of the classifiers employing Pearson's correlation coefficient (Pearson's R).   Table 6 summarises the results of our correlation analysis. The active learning classifier (AL-trained) has been trained with 52 % training data sampled by MNLP. For the domain-dependent, domain-independent and AL-trained classifier we observe a strong correlation between F1 and number of concepts per domain (R 0.70, 0.76, 0.68) and a weak correlation between κ and F1 (R 0.20, 0.28, 0.23). Thus, we can hypothesise that the number of annotated concepts in a particular domain has more influence on the performance than the inter-annotator agreement.
The correlation values for std is different between the classifier types. For the domaindependent classifier the correlation between κ and std (R 0.29), and the number of concepts per domain and std (R 0.28) is slightly positive. In other words: the higher the agreement and the size of the domain, the higher the variance of the domain-dependent classifier. This is different for the domain-independent classifier as there is no correlation anymore. For the AL-trained classifier there is, on the other hand, a moderate negative correlation between κ and std (R -0.41), and a strong negative correlation between number of concepts per domain and std (R -0.72), i.e. higher agreement and larger amount of training data in a domain lead to less variance for the AL-trained classifier. We hypothesise that more diversity through several domains in the domain-independent and the AL-trained classifier leads to better performance and lower variance by introducing an inductive bias.

Conclusions
In this paper, we have introduced the novel task of domain-independent concept extraction from scientific texts. During a systematic annotation procedure involving domain experts, we have identified four general core concepts that are relevant across the domains of Science, Technology and Medicine. To enable and foster research on these topics, we have annotated a corpus for the domains. We have verified the adequacy of the concepts by evaluating the human annotator agreement for our broad STM domain corpus. The results indicate that the identification of the generic concepts in a corpus covering 10 different scholarly domains is feasible by non-experts with moderate agreement and after consultation of domain experts with substantial agreement (0.76 κ).
We have presented two deep learning systems which achieved a fairly high F1 score (65.5% overall). The domain-independent system noticeably outperforms the domaindependent systems, which indicates that the model can generalise well across domains. We also observed a strong correlation between the number of annotated concepts per domain and classifier performance, and only a weak correlation between inter-annotator agreement per domain and the performance. We can hypothesise that more annotated data positively influence the performance in the respective domain.
Furthermore, we have suggested active learning for our novel task. We have shown that only approx. 5 annotated abstracts per domain serving as training data are sufficient to build a performant model. Our active learning results for SciERC [28] and ScienceIE17 [2] datasets were similar. The promising results suggest that we do not need a large annotated dataset for scientific information extraction. Active learning can significantly save annotation costs and enable fast adaptation to new domains.
We make our annotated corpus, a silver-labelled corpus with 62K abstracts comprising 24 domains, and source code publicly available 1 . We hope to facilitate the research on that task and several applications, e.g. academic search engines or research paper recommendation systems.
In the future, we plan to extend and refine the concepts for certain domains. Besides, we want to apply and evaluate the information extraction system to populate a research knowledge graph. For that we plan to extend the corpus with co-reference annotations [24] so that mentions referring to the same concept can be collapsed.