1 Introduction

Characterising scholarly documents according to their relevant research topics enables a variety of functionalities, such as: (i) enhancing semantically the metadata of scientific publications, (ii) categorising proceedings in digital libraries, (iii) producing smart analytics, (iv) generating recommendations, and v) detecting research trends [53]. In general, state-of-the-art approaches either classify papers in a top-down fashion, taking advantage of pre-existent categories from domain vocabularies, such as MeSHFootnote 1, PhySHFootnote 2, and the STW Thesaurus for EconomicsFootnote 3, or instead proceed in a bottom-up fashion, by means of topic detection methods, such as probabilistic topic models [8, 24]. The first solution has the advantage of relying on a set of formally defined research topics associated with human readable labels; however, it requires such a controlled vocabulary to be available. Conversely, while bottom-up approaches do not require a predefined vocabulary, they tend, however, to produce noisier and less interpretable results [42].

In 2019, we released the Computer Science Ontology (CSO) [55], a large-scale, granular, and automatically generated ontology of research areas in Computer Science, which includes more than 14K research topics and 159K semantic relationships. CSO has been adopted by Springer Nature editors to classify the computer science proceedings they publish, such as the well-known LNCS series [45]. We also released CSO as a publicly available resource, to foster its adoption and the development of novel CSO-powered applications. However, many users interested in adopting CSO for characterising their data have limited understanding of semantic technologies and how to use an ontology for annotating documents. Hence, the natural next step was to develop a classifier that supports the annotation of research papers according to CSO [54].

In this paper, we present the latest version of the CSO Classifier (v3.0), a scalable solution for automatically classifying research papers according to the Computer Science Ontology. The CSO Classifier takes as input the textual components of a scientific paper (usually title, abstract, and keywords) and returns a selection of research topics drawn from CSO. It operates in three steps. First, it finds all topics in the ontology that are explicitly mentioned in the input text. Then, it identifies further semantically related topics by utilising part-of-speech tagging and world embeddings. Finally, it discards outliers and enriches this set of topics by taking advantage of the CSO taxonomy to include their super-areas. The Classifier has been evaluated on a gold standard of manually annotated research papers, demonstrating a significant improvement over a number of alternative approaches, including an earlier version (v2.0), which has been used by Springer Nature editors since 2018 to support the annotation of Computer Science proceedings [50].

This paper extends our earlier work, which was presented at TPDL 2019 [54]. In particular, the novel contributions presented here are as follows:

  1. 1.

    an improved version of the CSO Classifier (3.0), which takes advantage of a novel component for discarding outliers from the result set;

  2. 2.

    a new evaluation evidencing the improvement in performance brought about by the outlier detection component;

  3. 3.

    a new solution for improving the classifier’s scalability;

  4. 4.

    a strategy for applying the CSO Classifier to other disciplines;

  5. 5.

    an overview of applications developed by adopters of the CSO Classifier;

  6. 6.

    a revisited and updated literature review.

The CSO Classifier is implemented in Python and can be installed from PyPIFootnote 4 using the command: pip install cso-classifier. It can also be downloaded from https://github.com/angelosalatino/cso-classifier. The data produced in the evaluation and the word embeddings model are available at http://w3id.org/cso/cso-classifier.

The rest of the manuscript is organised as follows. In Sect. 2, we review the literature regarding the classification of research articles and outline current limitations. In Sect. 3, we discuss the Computer Science Ontology. In Sect. 4, we describe the CSO Classifier and its modules. Next, in Sect. 5 we evaluate the CSO Classifier against alternative approaches, focusing on the performance of the new method for detecting outliers. In Sect. 6, we discuss our new solution for improving the scalability of the CSO Classifier, and in Sect. 7, we show how to apply the classifier to other fields of Science. In Sect. 8, we provide an overview of applications developed by early adopters of the CSO Classifier. Finally, in Sect. 9 we summarise the main contributions and outline future directions of research.

2 Literature review

The goal of topic classification is to identify the relevant subjects within a set of documents. Specifically, in the scholarly communication domain it aims at identifying research topics within scientific documents. These approaches can be typically characterised according to four main categories: (i) topic modelling, (ii) supervised machine learning approaches, (iii) approaches based on citation networks, and (iv) approaches based on natural language processing. We devote this section to present the main state-of-the-art approaches in these four categories and discuss their limitations.

2.1 Topic modelling

Topic modelling is a type of statistical approach for discovering topics that occur in a collection of documents. One of the most acclaimed approaches is the latent Dirichlet analysis developed by Blei et al. [7]. LDA is a three-level hierarchical Bayesian model to retrieve latent—or hidden—patterns in texts. The basic idea is that each document is modelled as a mixture of topics, where a topic is a multinomial distribution over words, characterised as a discrete probability distribution, defining the likelihood that each word will appear in a given topic. In other words, LDA aims to discover the latent structure which connects words to topics and topics to documents. This is achieved by computing the conditional distribution of the hidden variables (topics) given the observed variables (words) [7]. Over the years, LDA influenced many other approaches, such as Griffiths et al. [24] in which the authors designed a generative model for document collections. Their author–topic model simultaneously modelled the content of documents and the interests of authors. Bolelli et al. [8] further extended the author–topic model introducing the segmented author–topic model (S-ATM), a model that uses temporal ordering of documents in order to identify topic evolution and then exploits citations to evaluate the weights for the main terms in documents. Other approaches that fall within the topic modelling category are the latent semantic analysis (LSA) [17], the probabilistic latent semantic analysis (pLSA) [26], and the correlated topic model (CTM) [31], which is a consequent work from Blei et al. in mitigating the issues of the original LDA [7].

The advantage of these approaches is that they can achieve good results in the absence of a strong a priori categorisation and do not require training data. However, the resulting topics typically require manual verification by domain experts (e.g. senior researchers) in order to assign to them sound labels and assess the best number of topics for a set of documents [6]. In addition, using a high number of topics usually introduces noise. As a result, the number of topics is normally kept low, with the consequence that the resulting classification is not very granular.

2.2 Supervised machine learning approaches

This second category of approaches for classifying research topics aims at developing a multi-class model in which each class refers to a research topic.

Mai et al. [36] developed an approach to subject classification using deep learning techniques, and they applied it on a set of papers annotated with the STW Thesaurus for Economics (\(\sim \)5K classes) and MeSH (\(\sim \)27K classes). Similarly, Chernyak [15] presented a supervised approach for annotating papers in Computer Science with topics from the ACM Computing Classification SystemFootnote 5.

Caragea et al. [12] developed an approach that combines research article’s textual content and citation network to predict the topic of an article. Specifically, they trained two different classifiers on the two sets of data with the idea that their combined information has the potential to improve topics classification. Then, they trained their classifiers over a corpus of 3186 papers distributed over six classes, such as agents, artificial intelligence, information retrieval, human computer interaction, machine learning, and databases. However, the goal of this technique is to classify papers according to one single broad subject category. In contrast, we propose an approach for classifying research documents according to more than one topic.

Kandimalla et al. [29] propose a deep attentive neural network for classifying papers according to 104 Web of ScienceFootnote 6 subject categories. Their classifier was trained on 9M abstracts from Web of Science, and it can be directly applied to abstract without the need of additional information, e.g. citations. However, their approach finds difficulties in discriminating between overlapping categories. This is due to the nature of research papers, which encompass more than one subject.

HierClasSArt [1] is a recent approach for classifying articles according to a taxonomy of mathematical topics which uses combination of neural networks and knowledge graphs. It generates a knowledge graph from the abstracts and then classifies the papers using a latent representations that takes into account both the concepts in the knowledge graph and the metadata.

Garcia-Silva et al. [23] focus on the task of classifying scientific publications against a taxonomy of scientific disciplines, taking advantage of BERT [20] and its different flavours specialised in the scientific domain: BioBERT [33] and SciBERT [5]. Specifically, they train a multi-label classifier on 450K papers tagged with the 22 first-level categories of the ANZSRCFootnote 7 taxonomy.

One of the major difficulties arising when developing supervised approaches is related to the gold standard [12]. It requires an intensive manual labelling effort to generate a gold standard that includes all possible classes (research topics) and that is also balanced with regard to the number of papers labelled per topic. Indeed, very broad areas tend to have many published papers and hence are extensively represented, while very specific areas tend to have fewer papers.

Some recent deep learning models, such as few-shot learning and zero-shot learning, may be able to mitigate this issue, but they require more research and refinement [69].

2.3 Approaches based on citation networks

Another set of approaches for classifying documents use citation networks, and most of them are based on the principle of clustering scientific documents by means of a co-citation analysis. The use of citations for detecting topics has been explored in many different ways and some approaches combine citations with other entities, such as keywords and abstracts.

Upham et al. [64] used the Web of Science corpus to identify emerging topics within the years 1999–2004, represented as co-citation clusters. Small et al. [59] also performed co-citation analysis over ScopusFootnote 8 data, aiming at identifying top 25 emergent topics for each year from 2007 to 2010.

Boyack and Klavans [10] built a map of science using 20 million research articles over 16 years using co-citation techniques. Through this map, it is possible to observe the disciplinary structure of science in which papers of the same area tend to group together.

Van Eck et al. [65] developed CitNetExplorerFootnote 9 and VOSviewerFootnote 10, which can be used to cluster publications and to analyse the resulting clustering solutions. These two applications work at two different levels of granularity. Through visualisation techniques, CitNetExplorer focuses on the analysis of clusters at the level of individual publications, while VOSviewer focuses on the analysis of clusters at an aggregate level. CitNetExplorer and VOSviewer are heavily used by scientometricians to analyse developments in science [25, 70].

The main drawback of citation-based approaches is that they are able to assign each document to only one topic, and a document is rarely monothematic.

2.4 Approaches based on natural language processing

This category of topic classifiers groups all those unsupervised approaches that take advantage of natural language processing techniques, such as text analysis [16, 45] and word embeddings [72].

For instance, Decker [16] introduced an unsupervised approach that generates paper–topic relationships by exploiting keywords and abstracts, in order to analyse the trends of topics on different timescales.

Jo et al. [27] developed an approach that combines distributions of terms (i.e. n-grams) with the distribution of the citation graph related to publications containing that term. In particular, the authors assume that if a term is relevant for a topic, documents containing that term will have a stronger connection than randomly selected ones. Then, their algorithm identifies the set of terms having citation patterns exhibiting synergy.

Another set of methods rely only on keywords. For instance, Duvvuru et al. [21] built a network of co-occurring keywords and subsequently performed statistical analysis by calculating degree, strength, clustering coefficient, and end-point degree to identify clusters and associate them to research topics.

Some recent approaches use word embeddings, aiming to quantify semantic similarities between words based on their distributional properties in samples of text. For example, Zhang et al. [72] applied k-means on a set of words represented as embeddings. However, all these approaches to topic detection need to generate the topics from scratch rather than exploiting a domain vocabulary or ontology, resulting in noisier and less interpretable results [42]. The Microsoft Academic Graph’s team developed an approach for tagging document according to the fields of study, a controlled vocabulary of research topics [58]. This approach associates embeddings to both topics and articles and computes the cosine similarity between them. It then classifies each article with all the topics that score a similarity higher than a threshold. However, this technique is not described in details and the evaluation data are not available, making it difficult for the scientific community to reuse or compare with this method.

In sum, we still lack practical unsupervised approaches for classifying papers according to a granular set of topics. The CSO Classifier aims to address this gap, by providing high-quality automatic classification of research papers in the domain of Computer Science.

3 The computer science ontology

The Computer Science Ontology (CSO) is a large-scale ontology of research areas in the field of Computer Science. It was automatically generated using the Klink-2 algorithm [42] on a dataset of 16 million publications, mainly in the field of Computer Science [44]. Compared to other solutions available in the state of the art (e.g. the ACM Computing Classification System), the Computer Science Ontology includes a much higher number of research topics, which can support a granular representation of the content of research papers, and it can be easily updated by running Klink-2 on recent corpora of publications.

The current version of CSOFootnote 12 includes 14K semantic topics and 159K relationships. The main root is Computer Science; however, the ontology includes also a few additional roots, such as Linguistics, Geometry, and Semantics. The CSO data modelFootnote 13 is an extension of SKOSFootnote 14, and it includes four main semantic relations:

  • superTopicOf, which indicates that a topic is a super-area of another one (e.g. Semantic Web is a super-area of Linked Data).

  • relatedEquivalent, which indicates that two topics can be treated as equivalent for the purpose of exploring research data (e.g. Ontology Matching and Ontology Mapping).

  • contributesTo, which indicates that the research output of one topic contributes to another.

  • owl:sameAs, this relation indicates that a research concept in CSO is equivalent to a concept described in an external resource, such as DBpedia, Wikidata, and Freebase.

CSO is available through the CSO PortalFootnote 15, a web application that enables users to download, explore, and visualise sections of the ontology. Moreover, users can use the portal to provide granular feedback at different levels, such as rating topics and relationships, and suggesting missing relationships. The reader can refer to [56] for a more detailed description of CSO and how it has been developed.

CSO is used by several tools and proved to effectively support a wide range of tasks, such as exploring and analysing scholarly data (e.g. Rexplore [44], ScholarLensViz [35], ConceptScope [71]), detecting research communities (e.g. TST [47], RCMB [46]), identifying domain experts (e.g. VeTo [66]), recommending articles [62] and video lessons [9], generating knowledge graphs [18] (e.g. Temporal KG [49], AIDA KG [2], AI KG [19]), knowledge graph embeddings (e.g. Trans4E [39]), and topic models (e.g. CoCoNoW [4]), and predicting academic impact (e.g. ArtSim [13]), research topics (e.g. Augur [52]), ontology concepts (e.g. SIM [11], POE [43]), and technologies (e.g. TTF [41], TechMiner [40]). CSO has also been adopted by Springer Nature, one of the top two international academic publishers, which uses it to support a number of innovative applications, including (i) Smart Topic Miner [51], a tool designed to assist the Springer Nature editorial team in classifying proceedings, (ii) Smart Book Recommender [62], an ontology-based recommender system for selecting books to market at academic venues, and (iii) the AIDA Dashboard [2], a web application for exploring and making sense of scientific conferences.

4 CSO classifier

The CSO Classifier is a tool that takes as input the textual components of a research paper (usually title, abstract, and keywords) and outputs the relevant topics drawn from CSO. It adopts an unsupervised approach, which has been shown to perform well against alternative methods—see Sect. 5. Here, we should emphasise that, although the classifier leverages a word embedding model, we consider this approach unsupervised because it does not require labelled examples, consistently with the characterisation of unsupervised methods in the work of Song and Roth [61] and other relevant literature [34].

The choice of an unsupervised approach is quite natural in our scenario. As already pointed out, a supervised machine learning algorithm would require an extensive set of annotated examples, covering the thousands of research topics provided by CSO. Clearly, such a dataset does not exist and it would be non-trivial to develop. A further advantage of an unsupervised approach is that there is no need for retraining the algorithm when new versions of the Computer Science Ontology are released.

Fig. 1
figure 1

Architecture of the CSO Classifier

The CSO Classifier consists of three main components: (i) the syntactic module, (ii) the semantic module, and iii) the post-processing module. Figure 1 shows its architecture.

The syntactic module parses the input documents and identifies CSO concepts that are explicitly referred in the document. The semantic module uses part-of-speech tagging to identify promising terms and then exploits word embeddings to infer semantically related topics. Finally, the post-processing module combines the results of these two modules, discards outliers, and enhances the topic set by including relevant super-areas. To assist the description of our approach, we will use the sample paper [30] shown in Table 1 as a running example.

Table 1 Sample paper that will be analysed by the CSO Classifier [30]

4.1 Syntactic module

The syntactic module identifies topics that are explicitly referred in the textual input, mapping n-grams to CSO concepts. At the beginning, the algorithm removes English stop words and collects unigrams, bigrams, and trigrams. Then, for each n-gram, it computes its Levenshtein similarity with the labels of the topics in CSO. Research topics having similarity equal or higher than a threshold (i.e. the constant msm) with an n-gram, are selected for the final set of topics. We empirically set msm to 0.94, which allows us to recognise lexical variations of CSO topics, such as hyphens (e.g. “knowledge based systems” and “knowledge-based systems”), plurals (e.g. “database” and “databases”), and British versus American spelling (e.g. “data visualisation” and “data visualization”).

In Table 2, we report the list of topics returned by the syntactic module for the running example. In particular, we can identify some key topics that are central to the analysed paper, such as: “neural networks”, “image segmentation”, “recurrent neural networks”, and “image retrieval”.

4.2 Semantic module

The semantic module is designed to identify topics that are semantically related to the paper, but may not be explicitly mentioned in the text. It uses word embeddings produced by a word2vec model to compute the semantic similarity between the terms in the document and the CSO concepts.

The semantic module follows four steps: (i) entity extraction, (ii) CSO concept identification, (iii) concept ranking, and (iv) concept selection.

In the following sections, we first describe how we trained the word embedding model and then illustrate the algorithm.

4.2.1 Word embedding generation

We generated the word embeddings by training a word2vec model [37, 38] on a collection of text from Microsoft Academic Graph (MAG)Footnote 16. MAG is a knowledge graph containing scientific publication records, citation relationships, authors, institutions, journals, conferences, and fields of study. It is the largest dataset of scholarly data publicly available [68], and, as of April 2021, it contains more than 258 million publications.

We first downloaded titles and abstracts of 4,654,062 English papers in the field of Computer Science. Then, we pre-processed the data by replacing spaces with underscores in all n-grams matching the CSO topic labels (e.g. “digital libraries” became “digital_libraries”) and we performed a co-location analysis to identify frequent bigrams and trigrams (e.g. “highest_accuracies”, “highly_cited_journals”). These frequent n-grams were identified by analysing combinations of words that co-occur together, as suggested in [38]Footnote 17. Indeed, while it is possible to obtain the vector of an n-gram by summing the embedding vectors of all its tokens, the resulting representation usually is not as good as the one obtained by considering the n-gram as a single word during the training phase. As an example, Fig 2 shows a two-dimensional projection of the term semantic_web and of the vector obtained by summing the embeddings of semantic and web (semantic+web). We can appreciate that semantic_web is closer than semantic+web to well-known semantic web technologies such as linked_data, ontologies, and RDF.

Table 2 Topics returned from the syntactic module when processing the paper in Table 1
Fig. 2
figure 2

Scatter plot of PCA projection of the words: semantic_web, semantic+web, linked_data, ontology, and RDF

Finally, we trained the word2vec model, after testing several combinations of parametersFootnote 18.

4.2.2 Entity extraction

Our main assumption is that research concepts are represented either by nouns or adjectives followed by nouns. Indeed, 12% of the topics in CSO consist of just nouns and the remaining 88% percent of topics follow the adjective–noun pattern (e.g. semantic web, neural networks). Analysing only these text chunks allows us to speed up computation and avoid combinations that usually result in false positives. To this end, the classifier tags each token in the text according to its part of speech (e.g. nouns, verbs, adjectives, adverbs) and then applies a grammar-based chunk parser to identify chunks of words, expressed by the following grammar:

$$\begin{aligned}<\mathrm {JJ}.*>*<\mathrm {NN}.*>+ \end{aligned}$$
(1)

where JJ represents adjectives and NN represents nouns.

4.2.3 CSO concept identification

In this phase, the classifier processes the extracted chunks of words and uses the word embedding model to identify semantically related topics. First, it decomposes these chunks in unigrams, bigrams, and trigrams. Next, for each gram, it retrieves the top ten similar words (with cosine similarity higher than 0.7) from the word2vec model. The CSO topics matching these words are added to the result set. Figure 3 illustrates this process more in detail.

Fig. 3
figure 3

Identification of CSO concepts semantically related to n-grams

When processing bigrams or trigrams, the classifier joins their tokens using an underscore, e.g. “web_application”, in order to refer to the corresponding word in the word2vec model. If a n-gram is not available within the vocabulary of the model, the classifier generates its representation averaging the embedding vectors of the given tokens.

A specific CSO concept can be identified multiple times, for two main reasons: (i) the same n-gram can appear multiple times within the title, abstract and keywords or (ii) multiple n-grams can be semantically related to the same CSO concept. Indeed, the concept “image_segmentation” can be inferred by several semantically related n-grams, such as: “segmentation”, “image_analysis”, “segmentation_method”, “contour_extraction”, “segmentation_techniques”, “object_segmentation”, and “image_segmentation_algorithm”.

4.2.4 Concept ranking

The previous step tends to produce a large number of topics (typically more than 70), some of which are only marginally related to the analysed research paper. For instance, when processing the paper in Table 1, some n-grams triggered topics like “automatic_segmentations” and “retrieval_algorithms”, that may be considered less relevant. For this reason, the semantic module weighs the identified CSO concepts according to their overall relevancy to the paper. The relevance score of a topic is computed as the product between the number of times it was identified (frequency) and the number of unique n-grams that led to it (diversity). For instance, if a concept has been identified seven times, from three different n-grams, its final score will equal 21. In addition, if a topic is directly mentioned in the paper, its score is set to the maximum score found. Finally, the classifier ranks the topics according to their relevance score.

4.2.5 Concept selection

Typically, the relevance scores of the candidate topics follow a long-tail distribution. For this reason, the classifier employs the elbow method [57] to ensure that only relevant topics are eventually selected. This technique was originally designed to find the appropriate number of clusters in a dataset. In particular, it observes the cost function for different numbers of clusters. The optimal number of clusters is then located at the elbow of the resulting curve. This point balances the number of clusters and the percentage improvement of the cost function.

In Table 3, we report the list of topics obtained using the semantic module on the running example. In bold are the topics that were detected by the semantic module, but not by the syntactic module.

Table 3 Topics returned from the semantic module when processing the paper in Table 1

4.3 Post-processing

The post-processing module of the CSO Classifier combines the output of both the syntactic and semantic modules, discards outliers, and enhances the topic set by including relevant super-areas.

It takes the union of the result sets of the two modules, since this solution maximises the f-measure according to our experiments (see Sect. 5). However, it discards the topics returned by the semantic module that appear among the first n most occurring words in the vocabulary of the embeddings (n=3,000 in the current version). This is done to exclude very generic terms (e.g. “language”, “learning”, “component”) that tend to have a good similarity value with a large number of n-grams, typically resulting in too many false positives.

Differently from the original design presented in [54], version 3.0 of the CSO Classifier takes advantage of a new component that identifies outliers. We introduced this solution to address a specific type of recurring errors pointed out by the users of the CSO Classifier during the last year: in some cases, the approach would return erroneous topics that were conceptually distant from the others. Even in the cases in which these outliers were actually mentioned in the document, they were very marginal to the core topics and often identified as erroneous by human users. Indeed, several false positives produced by previous versions would follow this pattern. For instance, let us consider a set of topics such as bioinformatics, database, graph algorithms, keyword queries, query evaluation, query languages, query processing, query results, rdf, rdf graph, recommendation, routing scheme, semantic web, single path, sparql, and user query. Here, it is clear that the focus of the paper leans towards the application of SPARQL queries. Hence, topics such as bioinformatics, recommendation, routing scheme, and single path are in principle outliers and therefore candidates for rejection.

To address this issue, we studied the effect of outliers on the quality of our results and developed different methods for outliers detection. The main idea behind all these approaches is to compute the pairwise similarity of the topics and then delete the uncoupled ones. In order to find a good solution in this space, we performed extensive experiments and compared five metrics for measuring pairwise similarity and three approaches for selecting uncoupled topics. Section 5.2 describes in detail these alternative solutions, and Sect. 5.4 reports the best results for each combination. The evaluation confirmed that some outlier detection approaches improve the performance of our classifier.

Section (4.3.1) describes the method for detecting outliers that produced the best performance in our experiments and is therefore included in the new version of the CSO Classifier. Section (4.3.2) describes the semantic enhancement component, which defines the last step of the post-processing module.

4.3.1 Outlier detection

In order to identify outliers, we compute the pairwise similarity of the topics and identify disconnected ones. We then apply a set of heuristics to detect outliers to discard. In the following subsections, we describe this approach more in detail.

Computation of Similarity Matrix

We compute the pairwise similarity between pairs of topics by taking the maximum value between two similarity indices: graph similarity and embeddings similarity.

For the graph similarity, we take advantage of the graph representation of the Computer Science Ontology, where the topics and the relationships between them are, respectively, expressed by nodes and edges. Such representation allows us to identify the distance between pair of topics by applying the Dijkstra algorithm for finding the shortest path between them and thus computing the length of such path. The idea is that two topics are similar if they reside very close to each other within the CSO graph.

In practice, given a pair of topics we compute the length of their shortest path and we populate a distance matrix \(D_{graph}\). The distance falls in the range [1, n], where n is the diameter of the CSO graph (15 in the current version). The similarity matrix \(S_{graph}\) is the complement to 1 of the normalised distance matrix: \(S_{graph} = 1 - norm(D_{graph})\).

For the embedding similarity, we use the word2vec model trained for the CSO Classifier and described in Sect. 4.2.1. The idea is that, since the word embedding model captures a semantic relation between words, it follows that similar topics have similar representations. In particular, for each pair of topics we compute the cosine similarity of their embeddings and we populate a similarity matrix \(S_{embeddings}\). For the multi-word topics that are not available in the model vocabulary, we create their embedding representation by averaging the embeddings of all their tokens.

We create the final similarity matrix by comparing \(S_{graph}\) and \(S_{embeddings}\), and taking the element-wise maxima. Such similarity matrix is a square (\(n \times n\)) matrix, where n is the number of topics. Since the similarity between a topic A and a topic B is equivalent to the similarity between topic B and topic A, the similarity matrix is symmetric. Moreover, similarity values are between 0 and 1.

Identification of outliers In order to identify isolated topics we binarise the similarity matrix, based on whether the similarity between a pair of topics is higher or lower than a threshold k. To this end, we devised a dynamic threshold that considers as 1 the top k similarity values, where k is a multiple of the number of topics. Since the similarity matrix is symmetric, to discard redundant similarities, we identify the top k values by considering only the upper triangle and we exclude the 1s on the main diagonal. During the evaluation, and specifically in Sect. 5.5, we observe that the optimal value of k is 1, or in other words we select the same number of top similarities as the number of topics.

We then identify all topics that have no relationship with any other in the binarised matrix and exclude from this set: i) the syntactic topics that have more than two grams (multigrams), ii) the super topics of the retained group, and iii) the topics that have high string similarity with the retained group. The remaining topics are considered outliers and discarded from the ones returned by the CSO Classifier.

We exclude from this process the topics returned by the syntactic module with more than two tokens because they are quite specific and the fact that they have been syntactically matched means they are central to the classified document. Similarly, we preserve super topics because they are entailed by at least another topic in the set. For a similar reason, we also retain topics with a high string similarity to at least another topic. We tested different string similarity measures such as cosine, Jaro-Winkler, Levenshtein, and the normalised Longest Common Subsequence. The method that produced best results is the normalised Longest Common Subsequence with a threshold of 0.5.

4.3.2 Semantic enhancement

The resulting set of topics is enriched by inferring all the direct super topics, according to the superTopicOf relationship in CSO [56]. For instance, when the classifier extracts the topic “machine learning”, it will infer also “artificial intelligence”. By default, the classifier enriches the set of topics by adding only the direct super topics. However, to provide a broader set of topics fitting the research paper, it is also possible to infer the list of all their super topics up to the root, i.e. Computer Science.

In Table 4, we report a brief summary on the list of topics obtained at the different stages of this post-processing module. The union set includes the results from both the syntactic and semantic modules. The outlier topics are the ones excluded by the newly developed component, and the enhanced topics are the super topics. Among the latter, we can see there are several other topics that are pertinent to the paper in Table 1, such as: “image analysis”, “image enhancement”, and “information retrieval systems”.

Table 4 Topics obtained from the enhancement process when processing the running paper

5 Evaluation

To measure the performance of the approach introduced in this manuscript, we evaluated 10 versions of the CSO Classifier and 13 alternative approaches on the task of classifying papers drawn from a manually generated gold standard. In particular, the versions of the CSO Classifier tested in this evaluation differ according to the mechanism used for outlier detection, which is the main novelty introduced by CSO Classifier 3.0.

In Sect. (5.1), we describe the creation of the gold standard. Then, in Sect. 5.2 we discuss the metrics and approaches for outlier detection. Finally, Sect. 5.3 presents the experimental set-up, Sect. 5.4 reports the results, and Sect. 5.5 focuses on the different techniques for identifying the outlier topics.

5.1 Creation of the gold standard

Due to the absence of corpora annotated with fine-grained topics and being the Computer Science Ontology recently released, we lacked a gold standard for evaluating the CSO Classifier. To this end, we developed a gold standard with the support of 21 domain experts in classifying 70 papers according to the CSO ontology. The objectives of this gold standard are twofold. Firstly, it allows us to evaluate the classifier. Secondly, it will be a valuable resource to facilitate further evaluations by other members of the research community.

5.1.1 Data preparation

From Microsoft Academic Graph, we selected the 70 most cited papers published within the decade 2007–2017 from the fields of Semantic Web (23 papers), Natural Language Processing (23 papers), and Data Mining (24 papers)Footnote 19.

Next, we contacted 21 researchers in these fields, at various levels of seniority and without prior experience with CSO, and asked them to annotate ten papers each. We organised the data collection so that each paper is annotated by three experts, and we use the majority vote to address disagreements. We randomly assigned the papers to the experts, while minimising the number of shared papers between each pair of experts, in order to foster diversity.

5.1.2 Data collection

To support the domain experts during the annotation process, we developed a web application. Through this application, the authors were able to read title, abstract, keywords (when available), and the set of candidate topics of each paper. We asked the experts to thoroughly read all the information and assess a set of candidate topics by dragging and dropping them in two different baskets: relevant and not relevant. The application allowed the experts also to add further CSO topics that according to their judgement were missing from the candidate topics.

We created the set of candidate topics, which were displayed by the application, by aggregating the output of three classifiers: the syntactic module (Sect. 4.1), the semantic module (Sect. 4.2), and a third approach, which was introduced for mitigating the bias towards the first two methods.

The third approach first splits the input document into overlapping windows of size 10 (same as the training window of the word2vec model), each of them overlapping by five words. As a second step, it creates the embedding representation of each window by computing the average of the embedding vectors of all its tokens and uses the word2vec model to identify the top 20 similar words with similarity above 0.6. Then, the algorithm returns the CSO concept matching those words. Next, to each CSO concept it assigns a score based i) on the number of times it is found in the list of similar words and ii) on the embedding similarity, i.e. cosine similarity between the vector representation of the window and word embedding. Finally, it sorts them in descending order and prunes the result set using the elbow method [57]. For each paper, the combination of these approaches produced a very inclusive set of 41.8 ± 17.5 candidate topics.

5.1.3 Gold standard

The data collection process produced 210 annotations—70 papers times 3 annotations per paper or 21 experts times 10 annotations. On average, each paper was assigned with 18 ± 9 topics.

We computed the Fleiss’ kappa to measure the agreement among the three annotators on each paper. We obtained an average of 0.451 ± 0.177, indicating a moderate inter-rater agreement, according to Landis and Koch [32].

We created the gold standard using the majority rule approach. Specifically, if a topic was considered relevant by at least two annotators, it was added to the gold standard. As a result, on average each paper is associated with 14.4 ± 7.0 topics. In order to consider the taxonomic relationships of CSO, the resulting set of topics were semantically enriched by including also their direct super-topics as in [45].

5.2 Methods for outlier detection

As discussed in Sect. 4.3, the CSO Classifier 2.0 [54] sometimes returns topics that have a very weak connection with all the other topics identified in the document. These outliers are often false positives and tend to have a negative impact on the performance of the classifier.

We thus decided to perform extensive experiments for answering two research questions:

Q1. Can an outlier detection algorithm improve the performance of an unsupervised approach for topic detection such as the CSO Classifier?

Q2. What kind of methodologies and similarity metrics would yield the best performance?

To reduce the search space, we considered a simple architecture that includes two steps: i) computing the pairwise similarity of topics and ii) detecting uncoupled topics. In order to evaluate different alternatives for these steps, we compared the performance of five similarity metrics and three approaches for detecting disconnected topics. These similarity measures were combined with additional heuristics, described in Sect. 4.3.1, resulting in eight different approaches. In Sect. 5.2.1 we describe the five similarity metrics, in Sect. 5.2.2 the three methods for finding uncoupled topics, and in Sect. 5.2.3 we describe more in detail the eight approaches for detecting outliers.

5.2.1 Similarity metrics

The Graph similarity considers only the \(S_{graph}\) described in Sect. 4.3.1, computing the distance between topics as the length of their shortest path in the CSO graph. The similarity matrix is obtained by complementing to one the normalised distance matrix.

The Branch similarity is also based on CSO. It takes advantage of the hierarchical relationships (see Sect. 3), and for each pair of topics, the branch similarity is computed as the Jaccard similarity between their sets of super topics. The idea is that two topics have high similarity if they share almost the same broader topics.

The Graph-branch similarity extends the graph similarity. It computes the length of the shortest path between topics. Additionally, for a given pair of topics, if one is the super topic of the other, then their distance is set to 0 (minimum distance). This is because intuitively a paper about “machine learning” is also a paper in “artificial intelligence” and “computer science”. This similarity reflects the inclusion aspect of the CSO hierarchy.

The Word2vec similarity considers only the \(S_{embeddings}\) matrix shown in Sect. 4.3.1, computing the cosine similarity of topic embeddings.

The Graph-word2vec similarity measure produces the similarity matrix as shown in Sect. 4.3.1.

5.2.2 Approaches for finding isolated nodes

In this section, we summarise the three approaches used for identifying isolated topics. First, we used the approach described in Sect. 4.3.1, identified as MAT. Then, we also employed the clique percolation method and the hierarchical clustering, which will be described in the following subsections.

Clustering with clique percolation method The clique percolation method (CPM) is an algorithm for finding clusters within networks, introduced by Palla et al. [48]. It takes as input a graph data structure, and it returns the list of topics organised within clusters. In this case, the similarity matrix, produced by the different metrics described above, can be seen as a weighted adjacency matrix representing a graph. However, since such weights fall within the range [0, 1], considering all these positive similarities as edges, will lead to a highly interconnected graph hindering the possibility of identifying outliers. To this end, we prune the graph retaining only the top k edges, based on the similarity, where k is a multiple of the number of vertices (topics). The optimal value of k will be object of the evaluation in Sect. 5.5. In brief, the constructed graph will have all topics as nodes and the edges between couple of topics if their similarity is higher than the defined threshold. This approach is equivalent to binarising the similarity matrix described in Sect. 4.3.1. The set of candidate topics are identified as the ones that form a cluster on their own.

As parameter of the algorithm, the dimension of cliques has been set to 3. The Python implementation of CPM is available on GitHubFootnote 20.

Hierarchical Clustering Another strategy we used for identifying uncoupled topics is the unweighted pair group method with arithmetic mean developed by Sokal et al. [60], and identified as HIER.

This algorithm builds a dendrogram (i.e. rooted tree) that reflects the existing structure within the similarity matrix. This approach starts with each observation in its own cluster, then in a bottom-up fashion it merges pairs of clusters, until it reaches the root where all observations belong to one single cluster. At each iteration, the two most similar clusters are combined into a higher-level cluster. The similarity values of the newly formed cluster are given by averaging the similarity values of the two clusters.

To obtain a good clustering, the algorithm then cuts the dendrogram at a certain level. Since cluster analysis is essentially an exploratory approach, the interpretation of the hierarchical structure depends on the context. To this end, we performed several cuts across the different levels, which will be object of the evaluation in Sect. 5.5.

From the returned clusters, we identify the candidate topics as the ones that create clusters on their own, or in other words the algorithm did not find a suitable merge for them.

5.2.3 Outlier detection approaches

From the five similarity metrics, we devised eight approaches for identifying outliers. These approaches combine also the heuristics described in Sect. 4.3.1. Specifically, we identify with multigrams the heuristic that keeps the syntactic topics that have more than two grams. Instead, we identify with superTopics the one that keeps the super topics of the retained group, and finally we identify with stringSim the heuristic that keeps the topics that have high string similarity with the retained group.

For these approaches, we also tested the most suitable algorithm for finding the isolated topics. In addition, for each of these algorithms we also tested different pruning thresholds. Specifically, for both removing isolated topics algorithm (described in Sect. 4.3.1) and clique percolation method we observed the optimal threshold for binarising the similarity matrix. This threshold identifies the top k similarity values where k is a multiple of the number of topics. The lower is k, the fewer ones will be available in the similarity matrix, resulting in a restrictive threshold. On the other hand, the higher is k, more ones will appear in the similarity matrix, resulting in a conservative approach. Indeed, when k is higher, there are fewer and fewer excluded topics and the results of the classifier converge to the ones presented in [54].

Instead, for the hierarchical clustering we identified the optimal threshold for cutting the dendrogram and therefore identifying the clusters of cohesive groups of topics. This threshold identifies the level where the cut is performed, starting from top. If k is low, it means that the cut is near to the root node where almost all topics will belong to a few large clusters. Instead, if k is higher, it means that the cut is performed closer to the leaf nodes and thus it will return several small clusters.

For each of the eight outlier detection approaches, we performed a grid search to determine the best configuration in term of the algorithm for finding uncoupled topics (CPM, MAT, or HIER) and the pruning threshold (1–5) by evaluating them as a component of the CSO Classifier against the gold standard described in Sect. 5.1. Table 5 summarises the settings which obtained the best f-measure for each approach. For instance, the approach GRA-S obtained the best results when using a threshold of 5 and the CPM algorithm. We report and discuss their performance in Sect. 5.4.

Table 5 The eight approaches for detecting outlier topics according to their similarity measures, optimal thresholds (THR), optimal algorithms (Algor.), and heuristics

5.3 Experimental set-up

We compared 23 alternative methods and evaluated their results against the gold standard. Table 6 describes their main features and reports their performance.

TF-IDF returns for each paper a ranked list of words according to their term frequency-inverse document frequency (TF-IDF) score. The inverse document frequency of the terms was computed on the dataset of 4.6M papers in Computer Science, introduced in Sect. 4.2.1. TF-IDF-M maps these terms to CSO by returning all the CSO topics having Levenshtein similarity higher than 0.8 with them.

The next six classifiers use the latent Dirichlet allocation (LDA) [7] in order to produce a number of keywords extracted from the distribution of terms associated with the LDA topics. We trained three versions of LDA over the same corpus with different numbers of topics, i.e. 100, 500, and 1000, respectively, LDA100, LDA500, and LDA1000. These three classifiers select all LDA topics with a probability of at least j and return all their words with a probability of at least k. LDA100-M, LDA500-M, and LDA1000-M work in the same way, but the resulting keywords are then mapped to the CSO topics. In particular, they return all CSO topics that have Levenshtein similarity higher than 0.8 with the resulting set of terms. We performed a grid search for finding the best values of j and k on the gold standard and report here the best results of each classifier in terms of f-measure.

W2V-W is the classifier described in Sect. 5.1.2, designed to produce further candidate topics for the domain experts. It processes the input document with a sliding window and uses the word2vec model to identify concepts semantically similar to the embedding of the window.

STM is the classifier originally adopted by Smart Topic Miner [45], the application used by Springer Nature for classifying proceeding in the field of Computer Science. It works similarly to the syntactic module described in Sect. 4.1, but it detects only exact matches between the terms extracted from the text and the CSO topic labels. SYN is the first version of the CSO classifier, firstly introduced in [50], and it is equivalent to the syntactic module as described in Sect. 4.1. SEM consists of the semantic module described in Sect. 4.2. INT is a hybrid version that returns the intersection of the topics produced by the syntactic (SYN) and semantic (SEM) modules. Finally, CC 2.0 is the implementation of the CSO Classifier v2.0 presented in [54]. As described in Sect. 4.3, this version produces the union of the topics returned by the two modules, but it does not include the outlier detection component, yet.

The remaining nine classifiers provide alternative approaches to extending CC 2.0, to produce a new version of the CSO Classifier.

CC+RAND is a very simple baseline that randomly removes 10% of the topics in the set. The other eight classifiers use the eight methods reported in Table 5 and are labelled as CC+CODE, where CC stands for CSO Classifier and CODE refers to the outlier detection approach described in Sect. 5.2.3, e.g. GRA-S for graph similarity.

We assessed the performance of these 23 approaches by means of precision, recall, and f-measure. When classifying a given paper p, the value of precision pr(p) and recall re(p) are computed as shown in Eq. 2:

$$\begin{aligned} {\text {pr}}(p)=\frac{|c l(p) \cap g s(p)|}{|c l(p)|} \quad {\text {re}}(p)=\frac{|c l(p) \cap g s(p)|}{|g s(p)|} \end{aligned}$$
(2)

where cl(p) identifies the topics returned by the classifier, and gs(p) the gold standard obtained for that paper, including the super-areas of the gold standard used to enrich the user annotations as mentioned in Sect. 5.1.3. In order to obtain a better comparison between the different classifiers, we enhanced the results of each method with their direct super-concepts. The overall precision and recall for a given classifier are computed as the average of the values of precision and recall obtained over the papers. The f-measure (F1) is the harmonic mean of precision and recall.

5.4 Results

Table 6 reports the values of precision, recall, and f-measure of the different classifiers. The upper panel shows the results of the 14 approaches discussed in [54], while the lower panel summarises the results of the nine new versions of the CSO Classifier.

Table 6 Values of precision, recall, and f-measure for the different classifiers. In bold are the best results for the two sections

The approaches based on LDA and TF-IDF performed poorly and did not exceed 30.1% of f-measure. Arguably, we could raise the precision by increasing the Levenshtein similarity threshold for matching terms with CSO topics; however, this drastically reduces the recall making these approaches mostly unfit for this task. An analysis on the LDA topics showed that these tend to be broad and noisy. Indeed, they cluster together distinct CSO topics (e.g. “databases” and “search engines”) in the same LDA topic. In a nutshell, LDA performs quite well in identifying broader topics characterising large collection of documents, but as discussed in Section 2.1 it is typically less suitable in inferring more specific research topics, which may be associated with a low number of publications (50–200). W2V-W performed also poorly in terms of both precision and recall, respectively, 41.2% and 16.7%.

STM and SYN yielded a very good precision of, respectively, 80.8% and 78.3%. Indeed, these methods are good at finding topics that get explicitly mentioned in the text, which tend to be very relevant. However, their low recall, respectively, of 58.2% and 63.8%, is because they fail to detect some more subtle topics that are just implied. The difference of performance between these two classifiers is due to the method used to map n-grams to CSO topics, because STM identifies only equal matches, while SYN finds also partial matches, thus increasing recall at the expense of precision.

Compared with SYN, the semantic module (SEM) lost some precision but gained in recall and f-measure. This suggests that it is able to identify further topics that are not explicitly available in the paper, but clearly this may also produce some more false positives. INT yielded a higher precision (79.3%) compared to the syntactic and the semantic modules (78.3% and 70.8%), but it did not perform well in terms of recall, which dropped from 63.8% and 72.2% to 59.1%.

Finally, the CSO Classifier v2.0 (CC 2.0), presented in [54], outperformed all previous methods in terms of both recall and f-measure, respectively, 75.3% and 74.1%.

The following nine classifiers represent the contribution introduced with this manuscript. The first solution (CC+RAND) randomly removes 10% of the identified topics for each paper. Compared with CC 2.0 [54], it returns a slightly higher precision (73.1%), but the recall (67.6%) and f-measure (70.2%) are dramatically reduced.

The classifiers CC+GRA-S, CC+BRA-S, CC+GBR-S, and CC+W2V-S produce comparable results with f-measure around 74.4%. In particular, their precision values are higher than CC 2.0, ranging between \([73.4\%,73.9\%]\), as well as recall values that range between \([74.9\%,75.3\%]\).

CC+GWV-S combines CC+GRA-S and CC+W2V-S, and produces higher values of precision (75.6%), but its outlier detection method is very aggressive impacting the recall (73.9%). Both CC+GWG-S and CC+GWP-S extend CC+GWV-S, and in comparison with the latter, they improve the overall f-measure to 75.3%. These two versions of the CSO Classifier produce very high values of precision, respectively, 77.7% and 77.5%; however, their recall values go as low as 73.1%.

Finally, CC+GWS-S outperforms the other classifiers as it presents a f-measure of 75.4% providing a more balanced values of precision (76.7%) and recall (74.0%) compared to CC+GWG-S and CC+GWP-S. For this reason, we selected this configuration as the new version of the CSO Classifier (v3.0).

In order to assess the difference between CSO Classifier 2.0 [54] (CC 2.0) and CSO Classifier 3.0 (CC+GWS-S), we used the Wilcoxon nonparametric test. When considering the full results reported in Table 6, the two approaches are statistically different regarding precision (\(p<0.0001\)), but not f-measure (\(p=0.1500\)). However, this is due to the fact that the outlier detection component only activates in about half of the cases, i.e. when CSO Classifier 2.0 returns a topic that is semantically inconsistent with the others. When considering only the 34 articles (out of 70) in which the outlier detection component activated, the effect of the new approach becomes very evident: CC+GWS-S gains about 8% precision over CC 2.0 (77.0% vs 69.3%) losing only 4% recall (76.7% vs 80.7%). When running the Wilcoxon test on this set, CSO Classifier 3.0 significantly outperforms CSO Classifier 2.0 in both precision (\(p<0.0001\)) and f-measure (\(p<0.0001\)).

5.5 Evaluating outliers detection techniques

In this section, we focus on the different techniques which were developed for identifying the topic outliers. We can see them as an operation of selection that, given the set of topics returned by the CSO Classifier, needs to identify and exclude the false positive, hopefully improving the precision of the classifier. We can then compute precision, recall, and f-measure of this selection operation to study their performance. Naturally, we cannot assume that all the false positives are outlier topics; therefore, the recall will be capped according to the percentage of outliers in the set of false positive.

In this context, the true positives (TP) are the topics that were selected to be excluded but are not present in the gold standard, the false positives (FP) are the topics that were selected but are in the gold standard, and the false negatives (FN) are the topics that were not selected but are not in the gold standard.

Table 7 reports the best values of precision, recall, and f-measure for all the similarity measures, alongside the values of f-measure of the CSO Classifier when such similarity measure in the outlier detection component is applied (last column).

Table 7 Values of precision, recall, and f-measure for the outlier detection techniques. In bold are the best results. In the last column, we report also the values of f-measure of the CSO Classifier, when such technique is applied

The table shows that GRA-S and GBR-S, respectively, adopting the graph similarity and the hybrid graph and branch similarity, obtain a very high precision (88.9%). However, these approaches lack of ambition as they tend to identify just a few outliers.

The similarity measure that performed the best in terms of f-measure (38%) is GWG-S, which has 70.6% of precision and 26% of recall. However, for this task precision is more important than recall. GWS-S is the clear winner in terms of precision (75.0%), outperforming significantly (\(p<0.0001\) according to the Wilcoxon test) all the approaches which obtained a decent recall (\(> 3.5\%\)). For this reason, we adopted GWS-S for the CSO Classifier 3.0.

The fact that the best approaches for outlier detection obtained a recall around 20–25% suggests that about the same percentage of false positive produced by the CSO Classifier could be ascribed to outliers. However, we need further experiments to reach a definitive conclusion.

Answering the questions posed in Sect. 5.2, these results provide evidence that an outlier detection algorithm can indeed improve the performance of an unsupervised approach like the CSO Classifier.

Furthermore, the outlier detection approaches taking advantage of both the topological structure of CSO and word embedding model appears to perform best.

6 Improving classification scalability

A recurring problem of the early adopters of the CSO Classifier was its scalability. For instance, the version of the CSO Classifier presented in the previous section, CC-GWS-S, requires about 2.6 seconds, on average, to process one paperFootnote 21. If we wish to integrate this classifier in the Smart Topic Miner [51], an application that classifies conference proceedings, which may contain hundreds of papers, this implies a time performance of about 4 and a half minutes for each batch of 100 papers. In addition, we estimated that classifying all Computer Science papers in Microsoft Academic Graph would require around 2 months, assuming such a computation is carried out by ten parallel processes. By inspecting the code and timing the several components of the classifier, we identified the major bottlenecks. In particular, we found two components that were requiring excessive amount of time. The first one concerns the code in the syntactic module, which compares the n-grams against the list of topics. The second bottleneck concerns the code in the semantic module, which retrieves the most similar words of a given n-gram from the word2vec model.

To patch the first problem, we assumed that the similarity between an n-gram and a research topic is higher if they share the initial four characters. To this end, we developed a topic-stems dictionary in which CSO topics are grouped together if they share the first four characters. In such case, the syntactic module instead of comparing the n-grams against all topics in the ontology will just identify the group of topics that share its stem (the first four characters) and compare them with a limited number of topics. For instance, a n-gram having “digi” as initial four characters will be compared only against 70 topics, such as digital libraries, digital signal processing, digital-to-analog converters, and digital rights management, instead of the whole set of 14K topics.

To mitigate the second bottleneck, we created a cached model. Specifically, this model connects words of the word2vec model vocabulary straight to the topics in the Computer Science Ontology. To create this cache, we iterated over the model vocabulary, and, for each word, we extracted its top ten similar words (considering only terms with a cosine similarity higher than 0.7). Then, for each similar word, we computed the Levenshtein similarity against all CSO topics. All topics matching with a similarity higher or equal to 0.94 are linked to the selected word of the vocabulary.

In Fig. 4, there is an excerpt of the designed cache, where, for example, the word “web_services” from the vocabulary model has 0.82 (sim_w) similarity with the word “service_oriented_architecture” (wet = word embedding token), that matches with the topic service-oriented_architectures with Levenshtein similarity of 0.949 (sim_t).

Fig. 4
figure 4

An excerpt of the cached word2vec model

In order to assess the effect of these improvements, we performed a scalability analysis of the different versions of the classifier. Specifically, we run CSO Classifier 2.0 (CC 2.0) and the CSO Classifier (CC 3.0) with and without cache on a sample of 1000 papers. In Table 8, we report the computational times expressed in terms of i) the total number of seconds for classifying the set of papers, ii) papers per minute, iii) papers per second, and iv) gained speed. The gained speed is relative to CC 3.0, being the slowest classifier. To facilitate the reader in comparing the trade-off between speed and loss of accuracy, we also report the corresponding values of precision, recall, and f-measure when evaluated against the gold standard.

Table 8 Computational time of the different versions of the classifier both with and without cache

We can observe that when switching from CC 3.0 to CC 3.0 Cached, we lose 4% f-measure, but the classifier is more than 15x faster. When switching from CC 3.0 to CC 2.0 Cached, we lose less than 5% f-measure, but we gain a speed up to 31x. In Fig. 5, we can better appreciate the trade-off between computational time and f-measure.

The CSO classifier 3.0 offers two flags that allow the users to choose the best trade-off between precision and speed. The flag delete_outliers = True enables the outliers module (CC 3.0), whereas fast_classification = True enables the cache.

Fig. 5
figure 5

The variation of f-measure based on computational time

7 Employing the CSO Classifier in other domains of Science

The architecture of the CSO Classifier makes it possible to apply it to other scientific domains, as long as a comprehensive ontology or taxonomy of research areas is available for the domain in question. In this section, we illustrate the methodology required to use the CSO Classifier in other scientific domains.

7.1 Ontology or taxonomy of research topics

The Computer Science Ontology is a crucial component for the CSO Classifier, as it provides the list of possible topics to associate to documents. However, as CSO only covers Computer Science, when applying the CSO Classifier to another scientific field, the first step requires replacing CSO with an alternative taxonomy or ontology of research topics. In particular, a good replacement ought to provide the following relationships:

  • superTopicOf, which provides hierarchical information and shows how research topics are distributed, from the most generic to the most specific ones;

  • relatedEquivalent, which identifies synonyms of a research concept;

  • label, which provides the possible lexical materialisations of a research topic in a document.

An optional relation is primaryLabel, which defines which of the labels of a concept to use by default. This is advisable since it makes the resulting annotations easier to read and analyse.

7.2 Word embedding model

The advantage of using a word embedding model within the CSO Classifier is that it enables to capture the semantics of words. However, such semantics highly depends on the domain of application, i.e. Computer Science, and how they are used within the language of that field.

In order to apply the classifier to another domain, it is crucial to train the word2vec model with another corpus of research papers that fits the new domain, so to be able to capture the semantic of the words in that particular domain. Moreover, it is also important to evaluate the number n of most occurring words in the vocabulary, as described in Sect. 4.3, so that the semantic module can avoid inferring generic terms.

As shown in Sect. 4.2.1, the word2vec model was trained using titles and abstracts from Microsoft Academic Graph in the field of Computer Science. Since MAG covers also other areas of Science, it can be a suitable resource for training the new model. However, depending on the scientific field in question, there can be other available sources that can be used to train the word2vec model. In particular, we are currently working on transferring the CSO Classifier to the field of Medicine, and we are training the model using PubMedFootnote 22.

8 Applications that use the CSO classifier

Since the introduction of the CSO Classifier in 2019, several research papers and surveys [63] have cited it, emphasising its value. In addition, a number of researchers have embedded the classifier in concrete applications.

Dörpinghaus et al. [22] developed a scientific knowledge graph that integrates bibliographic data and metadata from Pubmed and DBLP. Since Pubmed data was already annotated, they extracted topics from DBLP using the CSO Classifier. Then, the authors use this knowledge graph to generate graph embeddings applied on tasks such as topic detection, document clustering, and knowledge discovery.

Vergoulis et al. [67] used the CSO Classifier for classifying 1.5M papers and use such topical representation for identifying experts that share similar publishing habits. This exercise can unfold in various real-life applications such as reviewer recommendation, collaborator seeking, and new hire recommendation. The same research team used the CSO Classifier on a corpus of over 3M papers. This was then used to develop ArtSim [14], an approach that estimates the popularity of papers in their cold-start period. This work is based on the intuition that similar papers are likely to follow a similar trajectory in terms of popularity. The authors calculate paper similarities using metapath analyses on scholarly knowledge graphs, which provide better results that use instead citation-based measures.

Jose et al. [28] developed an ontology-based framework that integrates CSO and the CSO Classifier for retrieving specific journal articles from academic repositories. This framework also aims at dynamically expanding the existing Computer Science Ontology with new specialisations, by analysing recently published research papers. In this way, the academic repositories reflect such recently introduced specialties; hence, they support the retrieval of more accurate results.

In addition, the CSO Classifier is currently integrated within the Smart Topic Miner [54], an application that assists the Springer Nature editorial team in annotating the volumes of all books covering conference proceedings in Computer Science. STM uses the CSO Classifier to annotate each paper with the topics from CSO. Then it groups and ranks the topics according to the number of papers addressing them. A demo of STM is available at http://stm-demo.kmi.open.ac.uk/.

The CSO Classifier has also been used to generate the Academia/Industry DynAmics (AIDA) Knowledge Graph [2], which characterises 14M papers and 8M patents according to the research topics drawn from the Computer Science Ontology. We used this dataset to develop the AIDA Dashboard, a tool for exploring and making sense of scientific conferences [3].

9 Conclusions

In this paper, we introduced an improved version of the CSO Classifier, v3.0, which takes advantage of a new approach for detecting outliers, thus producing a more accurate classification of research documents. This solution was evaluated on a gold standard of 70 manually annotated documents and shown to outperform alternative approaches in terms of recall and f-measure. The code of the CSO Classifier and all the relevant material is freely available to the wider research community.

This work opens up several interesting research directions. We plan to test BERT [20], SciBERT [5], and similar modern embeddings, to try and enhance the semantic and post-processing modules. We also intend to explore the application of our approach to other research fields. In particular, as already mentioned, we are working on adapting the CSO Classifier to the Biomedical domain, with the aim of classifying PubMed documents according to concepts drawn from the Medical Subject Headings.