Our ontology development involves three aspects: first, the design of the ontology structure, consisting of a set of related topics and subtopics in the relevant subject areas; second, populating the ontology with keywords; and third, classifying documents based on the frequency of keywords.
The mapping process can be seen as a problem of multi-class classification, with a large number of classes, and is achieved by relying on source-specific vocabularies and mapping techniques that also exploit (expert) knowledge about the structure of individual data sources. This is an iterative process, based on co-dependencies between data, topics, and the representation system. Our initial ontology derived from policy documents was enriched and customised, based on the outcome of the matching process and expert assessment of the results. Eventually, the original ontology classes may also be adapted based on their distinctiveness in terms of data items. Such a staged approach, distinguishing between core elements that are stabilised (the ontology classes) and elements that are dynamic and can be revised (the assignment of data items to classes), is desirable from a design and user perspective. Therefore, the approach is flexible, for example to respond to changes in policy interests (see "Discussion and conclusions" section), and to a certain extent scalable since new data sources can be integrated within the process. All three steps require some human intervention to define prior assumptions and to evaluate outcomes, but they integrate automatic processing through advanced NLP techniques for the parts involving handling large data. The classification process is fully automated once the ontology is complete, so re-annotation can easily take place if the ontology is changed or new data is available.
Ontology design
The ontology is defined according to the two strands of KET and SGC. This has implications because there is inherent overlap, not only between these two domains, but also within them. For example, within SGC, the topics of energy and climate change are closely intertwined, while much current research on transport is connected with sustainability. While KET topics focus primarily on technological research, there are overlaps with the “social” topics of SGCs, which often require technological solutions. Therefore, a good structure is hard to define because it is not clear what level of precision is practical, and because these affect the implementation of the document-topic mapping. Moreover, as already discussed, the intrinsic vagueness of the notion of KETs and especially SGCs means that the topics are hard to define, and there is no gold standard against which to evaluate.
The structure must also be intuitive for human users to navigate, and this is perhaps the most challenging component. Ontologies must be dynamic: new terms and definitions continuously emerge from researchers and standardisation groups, while other terms may become irrelevant or replaced by more popular synonyms. This means that updating of existing ontologies is required, through reference to new documents.
We have attempted to mitigate these problems by consulting experts at every stage of the process, holding workshops with policy makers from a variety of fields. We take as a starting point some existing classifications, such as the mappings between IPC (International Patent Classification) codes and both KETs (Van der Velde 2012) and SGCs (Frietsch et al. 2016). For KETs, we also make use of the structure implemented in the nature.com ontologies portal (Hammond and Pasin 2015). Some of these topics are already connected to DBpedia and MESH, which provide an additional source of information for keywords. Linking with the nature.com ontology helps with mapping scientific publications, and enables future extension of the ontology to other topics. A collection was also made of relevant EU policy documents, which describe how the KETs and SGCs are structured (Maynard and Lepori 2017), followed by an iterative process of annotating documents and looking for missing topics.
However, initial experimentation made it clear that relying heavily on pre-existing classifications was impractical—not only due to the huge number of topics, but more importantly because these classifications were very different (and no single classification covered all topics), so that the classes in the ontology were unevenly distributed and varied greatly in coverage. Furthermore, aligning elements from different origins led to a number of inconsistencies and duplications. We therefore manually refined this initial structure, removing the lower levels, reconfiguring branches, and adding additional topics where needed, in order to develop a more balanced classification system and to cover expert-based assessment of the relevant topic.
The first version of the ontology contained 4 levels of categorisation and a total of 457 topics, which is impractical for user selection. The refinement process has left us with a set of 150 topics in 3 levels—the first containing the distinction between KET and SGC, the second containing the major 13 topics belonging to them, and the third containing the major subtopics e.g. “society” is divided into topics such as “housing”, “education” and “employment”. This classification is distinctive enough to be interesting for policymakers without making the choices too specific. The latter has an impact on quality, because it is hard to allocate documents to topics at very precise levels, but also on usability of the system.
A key expert decision relates also to the conceptual overlap between classes. For example, the KET “Advanced Manufacturing” is deliberately designed to be crosscutting across the other 6 KETs, so its direct subclasses include “Advanced Materials for Manufacturing” (which overlaps with the “Advanced Manufacturing” KET). While the use of an ontology in some sense fundamentally addresses this problem of overlap, on the other hand the topic classification method essentially relies on matching each document with the best fit to a class. For this to work effectively, classes must be as distinct as possible. We aim for a middle ground whereby documents can be classified according to multiple topics, but the topics themselves are as distinct as possible.
Ontology population
The ontology needs to be populated with instances (keywords) from various data sources, which help to: (1) match user queries to topics; and (2) match documents from the various databases to these topics.
In the KET domain, until now topic definitions have been mostly based on keywords in papers; however, this is not sufficient, and these definitions need to consider also other kinds of documents and references. Furthermore, terms used by policymakers may not correspond to the keywords used in the data sources, and even between the different types of data source, terms vary widely.
SGCs offer a particular set of terminology-related problems, because keywords are often less technical and more ambiguous than those belonging to KET topics. For example, a related keyword for the topic of “education” could be “learning”, but this occurs frequently in relation to other topics; similarly, “skill” is indicative of the “employment” topic but occurs in many unrelated documents.
Concerning the mapping of data sources to the ontology, differences in vocabularies within academia, industry and society mean that the same concepts are typically expressed in different ways, especially in patents, which are extremely technical. Existing attempts at classification, as described earlier, have highlighted these issues. Our solution lies in the use of techniques from NLP and Machine Learning, where this kind of language variation is a common problem and techniques go far beyond the simple keyword matching approach used in other work.
Following a series of initial experiments, the solution adopted involves multiple layers of keyword extraction and a mixture of automated techniques interspersed with expert knowledge at key junctures. First, a small set of specific high-quality keywords is selected manually for each topic (typically around 5 per topic). These key terms are used, together with the preferred terms for each class (automatically derived from the class name or a linguistic variant) as seed terms for the expansion stage later. For example, “intelligent transport” is a key term for the topic “intelligent navigation”. An additional source of keywords comes from the subject index of the EU-FP project database, which we have mapped to our ontology.Footnote 3
The next stage consists of automatically generating further terms from the ontology class names and associated information, such as class descriptions, using GATE’s Automatic Term Recognition tool TermRaider (Maynard et al. 2007; Zhang et al. 2018). These terms are known as generated terms, and are only used for the matching stage later, where they have a lower weighting, since we are less confident about their relevance or because they may be ambiguous. For example, “radar tracker” is a non-preferred term for the topic “intelligent transport”. This term might be relevant here only if found in conjunction with another relevant term for the topic.
Initial experiments with generating keywords automatically were largely unsuccessful for two reasons: first, this information was very inconsistent (some classes had detailed descriptions while some had none), and second, many important keywords were missing, even with the addition of information extracted from external knowledge sources such as Wikipedia. Furthermore, term extraction tools could not sufficiently distinguish between high quality (specific and distinct) keywords from more general ones, resulting in the same keywords being extracted for a large number of classes. Previous approaches to mapping documents to topics based on keywords, especially in the patent domain (e.g. Gok et al. 2015), have been focused on a very specific domain and thus the keywords have been manually selected, which is not feasible here. It is clear that some expert intervention is necessary in order to ensure high quality.
To resolve these issues, first, a stop list was manually created in order to prevent generic keywords (e.g. “method”) being selected. Furthermore, at every stage, multi-word terms are preferred, as these are better at distinguishing between similar topics. Then, an automatic keyword enrichment method was used to boost the number of keywords, based on a large collection of training material (2.6 million documents containing a mixture of patent, project and publication abstracts as well as EU policy documents), from which we extracted new candidate terms. The enrichment process can be broken down into three main steps: corpus pre-processing, embeddings training, and embeddings-based term scoringFootnote 4. First, we apply linguistic pre-processing to our training corpus to find: (1) all occurrences of original ontology keywords in corpus (both of which are lemmatised); and (2) single and multi-word term candidates in the corpus, filtering out any Named Entities (e.g. names of people, places etc.). Next, we merge the ontology matches and the term candidate, and create (potentially overlapping) keyword candidates. We then calculate the canonical lemmatised string for these candidates, and finally calculate term statistics for all term candidates (using tf, df, idf). This results in a set of 1.2 million keyword candidates in 180 million locations in the corpus.
Next, we train the embeddings (vector representations for single and multi-word terms) from our keyword candidates and corpus. These embeddings were used to find the similarity between the seed terms and new terms, and to decide which new terms to keep, as well as which topic to map them to.Footnote 5 Finally, we score the terms based on the embeddings, according to their “representativeness” of that class, and prior probabilities generated using Pointwise Mutual Information (PMI) for term combinations, based on frequency of co-occurrence in the training data. These were used in the final classification stage, in order to ensure that more representative terms got a higher weighting, and to avoid outliers getting ranked too highly: some keywords are only good indicators when they occur together in the same document as another keyword. For example, the term “packaging” could refer to many topics, but when found with the term “microelectronics” it is a good indicator of various subtopics of Micro- and Nano-Engineering. We use a novel method we call centrboth. For each class, we calculate the average embedding for the set of preferred terms, and another average embedding for the set of non-preferred terms related to the class. The final embedding is the weighted average of both. We then use a method we term simonly. This is the 0/1 normalised cosine similarity between the embeddings representing the ontology class (centrboth) calculated in the previous step, and the embedding representing the candidate term. In both cases for simonly, we take the unweighted average, since using the weighted (tf, idf) average did not work well in early experiments.
A major challenge with the keyword enrichment process is that there is no gold standard with which to compare the results, so manual judgements must be made about the best method of defining the similarity and cut-off thresholds. Starting from a set of 2122 ontology keyword/class pairs, 11,814 new keyword/class pairs were generated, before a second stopword list was applied, to produce a final set of 9076 pairs. This stopword list was developed based on manual judgement, and contains keyword-concept pairs which should not be matched (for example, “shipyard” is not a good keyword for the topic “aeronautics”, but it is for “maritime transport”).
The result of the ontology population stage is thus a set of keywords associated with each class, each of which has a score indicating the degree of its relevance (see Table 1). There is some overlap because occasionally the same keyword appears in a higher-level class and one (or more) of its subclasses. Preferred terms are automatically generated from the class label and are usually similar to, or the same as, the class name itself. Key terms are the additional terms manually generated by experts, or which come from other knowledge sources such as DBpedia. Both are considered to be high quality (though they are also manually checked), are used as input for the term enrichment process, and are given a higher weighting during the annotation process. Project terms come from existing project keyword classifications. Generated terms are those created by the term extraction tool, while enriched terms come from the automatic enrichment process. These may be of lower quality and get a lower weighting.
Table 1 Number of each type of keyword for the high-level topics
Document classification
Our data sources comprise three major datasets on S&T made available within the RISIS H2020 infrastructure projectFootnote 6: the Web of Science version at CWTS, University of Leiden (about 30 m publications), the PATSTAT version at IFRIS in Paris (2.37 m patents), and the EUPRO database of European Framework Programme projects (67,475 projects), all from the period 2000–2017. The annotation links each data element (e.g. a project) with the relevant topic(s) in the ontology, so that indicators can be built around them. The amount of data that can be annotated is restricted only by time and processing power, and annotation time can be reduced by adding extra threads to the processing.
Due to availability and licensing restrictions, we only have access to titles, abstracts and some internal classification (such as IPC classes for patents). This limits data available for training, and might affect the matching of keywords, as previous findings have shown that while the abstract has the best ratio of keywords, neglecting the rest of the paper might lead to the omission of important relevant terms (Shah et al. 2003). We also currently only consider documents in English, which limits the patent collection.
Our classifier takes documents as input and returns information about the class(es) to which each is linked, along with a score, based on (i) the weight of that keyword for that class (preferred terms have a higher score, as do terms ranked close in similarity to these); (ii) the combination of keywords found in the document using PMI calculations from the ontology population stage; (iii) subclass boosting, whereby keywords belonging to a more specific class in the ontology are preferred over more general ones.
The classification process assigns multiple possible topics to each document. Thresholds are used to decide which of the topics are most relevant, as the ontology is used to build aggregated indicators at the regional and/or topical level. This is a typical expert-based task that involves manual checking of classified documents and distribution analysis to find a reasonable balance between recall and precision. Different approaches for thresholding have been tested, resulting in a simple criterion assigning documents to classes with a score above the median of the whole set of documents, which works reasonably well, but there is admittedly room for fine-tuning the scoring approach in the future.