Keywords

1 Introduction

The Entity Linking (EL) task identifies entity mentions in a text corpus and associates them with corresponding entities in a Knowledge Base (KB). In this way, we can leverage the information of publicly available KBs about real-world entities to achieve a better understanding of their semantics and also of natural language. For instance, in the text “in the world of pop music, there is Michael Jackson and there is everybody else” quoted from The New York Times, we can link the mention Michael Jackson with its corresponding entry in, e.g., the Wikidata KB [35] (wd:Q2831), or the DBpedia KB [16] (dbr:Michael_Jackson)Footnote 1 allowing us to leverage, thereafter, the information in the KB about this entity to support semantic search, relationship extraction, text enrichment, entity summarisation, or semantic annotation, amongst other applications.

One of the major driving forces for research on EL has been the development of a variety of ever-expanding KBs that describe a broad selection of notable entities covering various domains (e.g., Wikipedia, DBpedia, Freebase, YAGO, Wikidata). Hence, while traditional Named Entity Recognition (NER) tools focused on identifying mentions of entities of specific types in a text, EL further requires disambiguation of which entity in the KB is being spoken about; this remains a challenging problem. On the one hand, name variations – such as “Michael Joseph Jackson”, “Jackson”, “The King of Pop” – mean that the same KB entity may be referred to in a variety of ways by a given text. On the other hand, ambiguity – where the name “Michael Jackson” may refer to various (other) KB entities, such as a journalist (wd:Q167877), a football player (wd:Q6831558), an actor (wd:Q6831554), amongst others – means that an entity mention in a text may have several KB candidates associated with it.

Many research works have addressed these challenges of the EL task down through the years. Most of the early EL systems proposed in the literature were monolingual approaches focusing on texts written in one single language, in most cases English (e.g., [12, 18]). These approaches often use resources of a specific language, such as Part-Of-Speech taggers and WordNetFootnote 2, which prevent generalisation or adaptation to other languages. Furthermore, most of the labelled datasets available for training and evaluating EL approaches were English only (e.g., AIDA/CoNLL [12], DBpedia Spotlight Corpus [18], KORE 50 [13]).

However, as the EL area has matured, more and more works have begun to focus on languages other than English, including multilingual approaches that are either language agnostic [5, 6, 8, 22] – relying only on the language of labels available in the reference KB – or that can be configured for multiple languages [21, 30]. Recognising this trend, a number of multilingual datasets for EL were released, such as for the 2013 TAC KBP challengeFootnote 3 and the 2015 SemEval Task 13 challengeFootnote 4. Although such resources are valuable for multilingual EL research – where in previous work [28] we presented an evaluation of EL systems comparing two languages from the SemEval dataset – they have their limitations, key amongst which are their limited availability (participants onlyFootnote 5), a narrow selection of languages, and differences in text and annotations across languages that makes it difficult to compare the performance in each language. More generally, the EL datasets available in multiple languages – and languages other than English – greatly lags behind what is available for English.

Contributions: In this paper, we propose the VoxEL dataset: a manually-annotated gold standard for EL considering five European languages, namely German, English, Spanish, French and Italian. This dataset is based on an online source of multilingual news, where we selected and annotated 15 corresponding news articles for these five languages (75 articles in total). Additionally, we created two versions of VoxEL: a strict version where entities correspond to a restricted definition of entity, as a mention of a person, place or organisation (based on traditional MUC/NER definitions), and a relaxed version where we considered a broader selection of mentions referring to entities described by Wikipedia. Based on the VoxEL dataset, using the GERBIL evaluation framework [34], we present results for various EL systems, allowing us to compare not only across systems, but also across languages. As an additional contribution, we compare the performance of EL systems configurable for a given language with the analogous results produced by applying state-of-the-art machine translation (Google translate) to English and then applying EL configured for English. Our findings show that most systems perform best for English text. Furthermore, machine translation of input text to English achieves comparable – and often better – performance when compared with dedicated multilingual EL approaches.

2 Preliminaries

We first introduce some preliminaries relating to EL. Let E be a set of entity identifiers in a KB; these are typically IRIs, such as wd:Q2831, dbr:Michael_Jackson. Given an input text, the EL process can be conceptualised in terms of two main phases. First, Entity Recognition (ER) establishes a set of entity mentions M, where each such mention is a sub-string referring to an entity, annotated with its start position in the input text, e.g., (37,“Michael Jackon”). Second, for each mention \(m\in M\) recognised by the first phase, Entity Disambiguation (ED) attempts to establish a link between m and the corresponding identifier \(e\in E\) for the KB entity to which it refers. The second disambiguation phase can be further broken down into a number of (typical) sub-tasks, described next:

  • Candidate entity generation: For each mention \(m\in M\), this stage selects a subset of the most probable KB entities \(E_m \subseteq E\) to which it may refer. There are two high-level approaches by which candidate entities are often generated. The first is a dictionary-based approach, which involves applying keyword or string matching between the mention m and the label of entities from E. The second is an NER-based approach, where traditional NER tools are used to identify entity mentions (potentially) independently of the KB.

  • Candidate entity ranking: This stage is where the final disambiguation is made: the candidate entities \(E_m\) for each mention m are ranked according to some measure indicating their likelihood of being the reference for m. The measures used for ranking each entity \(e \in E_m\) may take into account features of the candidate entity e (e.g., centrality), features of the candidate link (me) (e.g., string similarity), features involving e and candidates for neighbouring mentions \(E_m'\) (e.g., graph distance in the KB), and so forth. Ranking may take the form of an explicit metric that potentially combines several measures, or may be implicit in the use of machine-learning methods that classify candidates, or that compute an optimal assignment of links.

  • Unlinkable mention prediction: The target KBs considered by EL are often, by their nature, incomplete. In some applications, it may thus be useful to extract entity mentions from the input text that do not (yet) have a corresponding entity in the KB. These are sometimes referred to as emerging entities, are typically produced by NER candidate generation (rather than a dictionary approach), and are assigned a label such as NIL (Not In Lexicon).

It is important to note that while the above processes provide a functional overview of the operation of most EL systems, not all EL systems follow this linear sequence of steps. Most systems perform recognition first, and once the mentions are identified the disambiguation phase is initiated [18, 21]. However, other approaches may instead apply a unified process, building models that create feedback between the recognition and disambiguation steps [7]. In any case, the output of the EL process will be a set of links of the form (me), where the mention m in the text is linked to the entity e in the KB, optionally annotated with a confidence score – often called a support – for the link.

3 Related Work

We now cover related works in the context of multilingual EL, first discussing approaches and systems, thereafter discussing available datasets.

3.1 Multilingual EL Systems

In theory, any EL system can be applied to any language; as demonstrated in our previous work [28], even a system supporting only English may still be able to correctly recognise and link the name of a person such as Michael Jackson in the text of another language, assuming the alphabet remains the same. Hence, the notion of a multilingual EL system can become blurred. For example language-agnostic systems – systems that require no linguistic components or resources specific to a language – can become multilingual simply by virtue of having a reference KB with labels in a different – or multiple different – language(s).

Here we thus focus on EL systems that have published evaluation results over texts from multiple languagesFootnote 6, thus demonstrating proven multilingual capabilities. We summarise such systems in Table 1, where we provide details on the year of the main publication, the languages evaluated, as well as denoting whether or not entity recognition is supportedFootnote 7, and whether or not a demo, source code or API is currently available.Footnote 8 As expected, a high-level inspection of the table shows that English is the most popularly-evaluated (and thus we surmise supported) language, followed by European languages such as German, Spanish, French, Dutch and Italian. We also highlight that most of the multilingual EL approaches included in the table have emerged since 2010.

Table 1. Overview of multilingual EL approaches; the italicised approaches will be incorporated as part of our experiments.

We will later conduct experiments using the GERBIL evaluation framework [34], which allows for invoking and integrating the results of a variety of public APIs for EL, generating results according to standard metrics in a consistent manner. Hence, in our later experiments, we shall only consider those systems with a working REST-API made available by the authors of the system. In addition, we will manually label our VoxEL system according to Wikipedia, with which other important KBs such as DBpedia, YAGO, Freebase, Wikidata, etc., can be linked; hence we only include systems that support such a KB linked with Wikipedia. Note that GERBIL automatically takes care of mapping coreferent identifiers across KBs (and even across languages in cases such as DBpedia with different KB identifiers for different languages and cross-language links).

With these criteria in mind, we experiment with the following systems:

  • TagME (2010) uses analyses of anchor texts in Wikipedia pages to perform EL [8]. The ranking stage is based primarily on two measures: commonness, which describes how often an anchor text is associated with a particular Wikipedia entity; and relatedness, which is a co-citation measure indicating how frequently candidate entities for different mentions are linked from the same Wikipedia article. TagME is language agnostic: it can take advantage of the Wikipedia Search API to apply the same conceptual process over different language versions of Wikipedia to support multilingual EL.

  • THD (2012) is based on three measures [6]: most frequent senses, which ranks candidates for a mention based on the Wikipedia Search API results for that mention; co-occurrence, which is a co-citation measure looking at how often candidate entities for different mentions are linked from the same paragraphs in Wikipedia; and explicit semantic analysis, which uses keyword similarity measures to relate mentions with a concept. These methods are language agnostic and applicable to different language versions of Wikipedia.

  • DBpedia Spotlight (2013) was first proposed to deal with English annotations [18], based on keyword and string matching functions ranked by a probabilistic model based on a variant of a TF–IDF measure. DBpedia Spotlight is largely language-agnostic, where an extended version later proposed by Daiber et al. [5] leverages the multilingual information of the Wikipedia and DBpedia KBs to support multiple languages.

  • Babelfy (2014) performs EL with respect to a custom multilingual KB BabelNetFootnote 9 constructed from Wikipedia and WordNet, using machine translation to bridge the gaps in information available for different language versions of Wikipedia [21]. Recognition is based on POS tagging for different languages, selecting candidate entities by string matching. Ranking is reduced to finding the densest subgraph that relates neighbouring entities and mentions.

  • FREME (2016) delegates the recognition of entities to the Stanford-NER tool, which is trained over the anchor texts of Wikipedia corpora in different languages. Candidate entities are generated by keyword search over local indexes, which are then ranked based on the number of matching anchor texts in Wikipedia linking to the corresponding article of the candidate entity [30].

With respect to FOX, note that while it meets all of our criteria, at the time of writing, we did not succeed in getting the API to run over VoxEL without error; hence we do not include this system. We also omit AGDISTIS and MAG from our selection because they do not perform recognition, requiring a prior identification of the entities in the input text (finding a suitable NER tool/model is not straightforward for some of the languages in our dataset).

3.2 Multilingual EL Datasets

In order to train and evaluate EL approaches, labelled datasets – annotated with the correct entity mentions and their respective KB links – are essential. In some cases these datasets are labelled manually, while in other cases labels can be derived from existing information, such as anchor texts. In Table 2 we survey the labelled datasets most frequently used by EL approaches (note that sentence counts were not available for some datasets).

Table 2. Survey of dataset for EL task. For multilingual datasets, the quantities shown refer to the English data available. We present metadata about the relaxed and strict version of our dataset by VoxEL\(_R\) and VoxEL\(_S\) respectively. (Abbreviations: |D| number of documents, |S| number of sentences, |E| number of entities, Mn denotes that all entities were manually annotated.)

We can see that the majority of datasets provide text in one language only – predominantly English – with the exceptions being as follows:

  • SemEval 2015 Task 13: is built over a biomedical, math, computer and social domain and is designed to support EL and WSD at the same time, containing annotations to Wikipedia, BabelNet and WordNet [20].

  • DBpedia Abstracts: provides a large-scale training and evaluation corpora based on the anchor texts extracted from the abstracts (first paragraph) of Wikipedia pages in seven languages [2].Footnote 10

  • MEANTIME: consists of 120 news articles from WikiNewsFootnote 11 with manual annotations of entities, events, temporal information and semantic roles [19].Footnote 12

With respect to DBpedia Abstracts, while offering a very large multilingual corpus, the texts across different languages vary, as do the documents available; while such a dataset could be used to compare different systems for the same languages, it could not be used to compare the same systems for different languages. Furthermore, there are no guarantees for the completeness of the annotations since they are anchor texts/links extracted from Wikipedia; hence the dataset is best suited as a large collection of positive (training) examples, in a similar manner to how TagME [8] and FREME [30] use anchor texts.

Unlike DBpedia Abstracts, the SemEval and MEANTIME datasets contain analogous documents translated to different languages (also known as parallel corpora [20]). Our VoxEL dataset complements these previous resources but with some added benefits. Primarily, both the SemEval and MEANTIME datasets exhibit slight variations in the annotations across languages, leading to (e.g.) a different number of entity annotations in the text for different languages; for example SemEval [20] reports 1,261 annotations for English, 1,239 for Spanish, and 1,225 for Italian, while MEANTIME [19] reports 2,790 entity mentions for English, 2,729 for Dutch, 2,709 for Italian and 2,704 for Spanish. On the other hand, VoxEL has precisely the same annotations across languages aligned at the sentence level, and also features datasets labelled under two definitions of entity. More generally, we see VoxEL as complementing these other datasets.Footnote 13

4 The VoxEL Dataset

In this section, we describe the VoxEL Dataset that we propose as a gold standard for EL involving five languages: German, English, Spanish, French and Italian. VoxEL is based on 15 news articles sourced from the VoxEuropFootnote 14 web-site: a European newsletter with the same news articles professionally translated to different languages. This source of text thus obviates the need for translation of texts to different languages, and facilitates the consistent identification and annotation of mentions (and their Wikipedia links) across languages. With VoxEL, we thus provide a high-quality resource with which to evaluate the behaviour of EL systems across a variety of European languages.

While the VoxEurop newsletter is a valuable source of professionally translated text in several European languages, there are sometimes natural variations across languages that – although they preserve meaning – may change how the entities are mentioned. A common example is the use of pronouns rather than repeating a person’s name to make the text more readable in a given language. Such variations would then lead to different entity annotations across languages, hindering comparability. Hence, in order to achieve the same number of sentences and annotations for each new (document), we applied small manual edits to homogenise the text (e.g., replacing a pronoun by a person’s name). On the other hand, sentences that introduce new entities in one particular language, or that deviate too significantly across all languages, are eliminated; fewer than 10% of the sentences from the original source were eliminated.

When labelling entities, we take into consideration the lack of consensus about what is an “entity” [14, 17, 29]: some works conservatively consider only mentions of entities referring to fixed types such as person, organisation and location as entities (similar to the traditional NER/TAC consensus on an entity), while other authors note that a much more diverse set of entities are available in Wikipedia and related KBs for linking, and thus consider any noun-phrase mentioning an entity in Wikipedia to be a valid target for linking [24]. Furthermore, there is a lack of consensus on how overlapping entities – like New York City Fire Department – should be treated [14, 17]; should New York City be annotated as a separate entity or should we only cover maximal entities? Rather than take a stance on such questions – which appear application dependent – we instead create two versions of the data: a strict version that considers only maximal entity mentions referring to persons, organisations and locations; and a relaxed version that considers any noun phrase mentioning a Wikipedia entity as a mention, including overlapping mentions where applicable. For example, in the sentence “The European Central Bank released new inflation figures today” the strict version would only include “European Central Bank”, while the relaxed version would also include “Central Bank” and “inflation”.

To create the annotation of mentions with corresponding KB identifiers, we implemented a Web toolFootnote 15 that allows a user to annotate a text, producing output in the NLP Interchange Format (NIF) [11], as well as offering visualisations of the annotations that facilitate, e.g., revision. For each language, we provide annotated links targeting the English Wikipedia entry, as well as that language’s version of Wikipedia (if different from English). In case there was no appropriate Wikipedia entry for a mention of a person, organisation or place, we annotate the mention with a NotInLexicon marker. These annotations were created by the first author in English, which were then revised by the other authors according to the two labelling guidelines (strict and relaxed). The first author then extended these annotations to the other languages using the sentence-level correspondence, thereafter verifying that each language has the same number of annotations and the same set of English Wikipedia identifiers for each sentence.

In summary, VoxEL consists of 15 news articles (documents) from the multilingual newsletter VoxEurop, totalling 94 sentences; the central topic of these documents is politics, particularly at a European level. This text is annotated five times for each language, and two times for the strict and relaxed versions, giving a total of 150 annotated documents and 940 sentences. The same number of annotations is given for each language (including by sentence). For the strict version, each language has 204 annotated mentions, while for the relaxed version, each language has 674 annotated mentions. In the relaxed version, 6.2%, 10.8%, 20.3% and 62.7% of the entries correspond to persons, organisations, places and others respectively, while in the strict version the entities that fall in the first three classes constitute 16.9%, 28.7% and 54.4% (others are excluded by definition under the strict guidelines). Again, this homogeneity of text and annotations across languages was non-trivial to achieve, but facilitates comparison of evaluation results not only across systems, but across languages.

5 Experiments

We now use our proposed VoxEL dataset to conduct experiments in order to explore the behaviour of state-of-the-art EL systems for multilingual settings. In particular, we are interested in the following questions:

  • RQ1: How does the performance of systems compare for multilingual EL?

  • RQ2: For which of the five languages are the best results achieved?

  • RQ3: How would a method based on machine translation to English compare with directly configuring the system for a particular language?

In order to address RQ1 and RQ2, we ran the multilingual EL systems Babelfy, DBpedia Spotlight, FREME, TagME and THD over both versions of VoxEL in all five languages. These experiments were conducted with the GERBIL [34] EL evaluation framework, which provides unified access to the public APIs of multiple EL tools, abstracting different input and output formats using the NIF vocabulary, translating identifiers across KBs, and allowing to apply standard metrics to measure the performance of results with respect to a labelled dataset. GERBIL calls these systems via their REST APIs maintaining default (non-language) parameters, except for the case of Babelfy, for which we analyse two configurations: one that applies a more liberal interpretation of entities to include conceptual entities (Babelfy\(_R\)), and another configuration that applies a stricter definition of entities (Babelfy\(_S\)), where the two configurations correspond loosely with the relaxed/strict versions of our dataset.

Table 3. GERBIL Evaluation of EL systems with Micro Recall (mR), Precision (mP) and F\(_1\) (mF). A value “–” indicates that the system does not support the corresponding language. The results in bold are the best for that metric, system and dataset variant comparing across the five languages (i.e., the best in each row, split by Relax/Strict).

The results of these experiments are shown in Table 3, where we present micro-measures for Precision (mP), Recall (mR) and \(F_1\) (mF), with all systems, for all languages, in both versions of the dataset.Footnote 16 From first impressions, we can observe that two systems – TagME and THD – cannot be configured for all languages, where we leave the corresponding results blank.

With respect to RQ1, for the Relaxed version, the highest \(F_1\) scores are obtained by Babelfy\(_R\) (0.662: ES) and DBpedia Spotlight (0.650: EN). On the other hand, the highest \(F_1\) scores for the Strict version are TagME (0.857: EN) and Babelfy\(_R\) (0.805: ES). In general, the \(F_1\) scores for the Strict version were higher than those for the Relaxed version: investigating further, the GERBIL framework only considers annotations to be false positives when a different annotation is given in the labelled dataset at an overlapping position; hence fewer labels in the Strict dataset will imply fewer false positives overall, which seems to outweigh the effect of the extra true positives that the Relaxed version would generate. Comparing the best Strict/Relaxed results for each system, we can see that Babelfy\(_R\), DBpedia Spotlight and FREME have less of a gap between both, meaning that they tend to annotate a broader range of entities; on the other hand, Babelfy\(_S\) and THD are more restrictive in the entities they link.

With respect to RQ2, considering all systems, we can see a general trend that English had the best results overall, with the best mF for DBpedia Spotlight, FREME and TagME. For THD, German had higher precision but much lower recall; a similar result can be seen for FREME in Italian in the Relaxed version. On the other hand, Babelfy generally had best results in German and Spanish, where, in fact, it often had the lowest precision in English.

With respect to possible factors that explain such differences across languages, there are variations between languages that may make the EL task easier or harder depending on the features used; for example, systems that rely on capitalisation may perform differently for Spanish, which uses less capitalisation, (e.g., “Jungla de cristal”: a Spanish movie title in sentence case); and German, where all nouns are capitalised. Furthermore, the quality of EL resources available for different languages – in terms of linguistic components, training sets, contextual corpora, KB meta-data, etc. – may also vary across languages.

Regarding RQ3, we present another experiment to address the question of the efficacy of using machine translations. First we note that, although works in related areas – such as cross-lingual ontology matching [9] – have used machine translation to adapt to multilingual settings, to the best of our knowledge, no system listed in Table 1 uses machine translations over the input text (though systems such as Babelfy do use machine translations to enrich the lexical knowledge available in the KB). Hence we check to see if translating a text to English using a state-of-the-art approach – Google TranslateFootnote 17 – and applying EL over the translated English text would fare better than applying EL directly over the target language; we choose one target language to avoid generating results for a quadratic pairing of languages, and we choose English since it was the only language working for/supported by all systems in Table 3.

A complication for these translation experiments is that while VoxEL contains annotations for the texts in their original five languages, including English, it does not contain annotations for the texts translated to English. While we considered manually annotating such documents produced by Google Translate, we opted against it partly due to the amount of labour it would again involve, but more importantly because it would be specific to one translation service at one point in time: as these translation services improve, these labelled documents would quickly become obsolete. Instead, we apply evaluation on a per-sentence basis, where for each sentence of a text in a non-English language, we translate it and then compare the set of annotations produced against the set of manually-annotated labels from the original English documents; in other words, we check the annotations produced by sentence, rather than by their exact position. This is only possible because in the original VoxEL dataset, we defined a one-to-one correspondence between sentences across the five different languages.

Note that since GERBIL requires labels to have a corresponding position, we thus needed to run these experiments locally outside of the GERBIL framework. Hence, for a sentence s, let A denote the IRIs associated with manual labels for s in the original English text, and let B denote the IRIs annotated by the system for the corresponding sentence of the translated text; we denote true positives by \(A \cap B\), false positives by \(B - A\), and false negatives by \(A - B\).Footnote 18

In Table 4, we show the results of this second experiment, focusing this time on the Micro-F\(_1\) (mF) score obtained for each system over the five languages of VoxEL, again for the relaxed and strict versions. For each system, we consider three experiments: (1) the system is configured for the given language and run over text for the given language, (2) the system is configured for English and run over the text translated from the given language, (3) the system is configured for English and run over the text in the given language without translation. We use the third experiment to establish how the translation to English – rather than the system configuration to English – affects the results. First we note that without using positional information to check false positives (as per GERBIL), the results change from those presented in Table 3; more generally, the gap between the Relaxed and Strict version is reduced.

Table 4. Micro \(F_1\) scores for systems performing EL with respect to the VoxEL dataset. For each system and each non-English language, we show the results of three experiments: first, for the system is configured for the same language as the input text; second, for (EN,EN\(_t\)), the system is configured for English and applied to text translated to English from the original language (EN,EN); third, for , the system is configured for English and run for the text in the current (original) language. Below the name of each system, we provide the relaxed and strict results for the English text. Underlined results indicate the best of the three configurations for the given system, language and dataset variant (e.g., the best for the columns of three values). The best result for each system across all variations (excluding English input) is bolded.

With respect to RQ3, in Table 4, for each system, language and dataset variant, we underline which of the three configurations performs best. For example, in DBpedia Spotlight, all values on the (EN,EN\(_t\)) line – which denotes applying DBpedia Spotlight configured for English over text translated to English – are underlined, meaning that for all languages, prior translation to English outperformed submitting the text in its original language to DBpedia Spotlight configured for that language.Footnote 19 In fact, for almost all systems, translating the input text to English generally outperforms using the available language configurations of the respective EL systems, with the exception of Babelfy, where the available multilingual settings generally outperform a prior translation to English (we may recall that in Table 3, Babelfy performed best for texts other than English). We further note that the translation results are generally competitive with those for the original English text – shown below the name of the system for the Relaxed and Strict datasets – even slightly outperforming those results in some cases. We also observe from the generally poor results that translation is important; in other words, one cannot simply just apply an EL system configured for English over another language and expect good results.

To give a better impression of the results obtained from the second experiment, in Fig. 1, for the selected systems, we show the following aggregations: (1) Calibrated : the mean Micro-\(F_1\) score across the four non-English languages with the EL system configured for that language; (2) Translation (EN,EN\(_t\)): the mean Micro-\(F_1\) score across the four non-English languages with the text translated to English and the EL system configured for English; (3) English (EN,EN): the (single) Micro-\(F_1\) score for the original English text. From this figure, we can see that translation is comparable to native English EL, and that translation often considerably outperforms EL in the original language.

We highlight that using translation to English, the result will be an annotated text in English rather than the original language. However, given that translation is done per-sentence, the EL annotations for the translated English text could potentially be “mapped” back per sentence to the text in the original language; at the very least, the translated English annotations would be a useful reference.

Fig. 1.
figure 1

Summary of the Micro-F\(_1\) results over VoxEL Relaxed/Strict for the translation experiments, comparing mean values for setting the EL system to the language of the text (Calibrated), translating the text to English first (Translation), and the corresponding \(F_1\) score for EL over the original English text (English)

6 Conclusion

While Entity Linking has traditionally focused on processing texts in English, in recent years there has been a growing trend towards developing techniques and systems that can support multiple languages. To support such research, in this paper we have described a new labelled dataset for multilingual EL, which we call VoxEL. The dataset contains 15 new articles in 5 different languages with 2 different criteria for labelling, resulting in a corpus of 150 manually-annotated news articles. In a Strict version of the dataset considering a core of entities, we derive 204 annotated mentions in each language, while in a Relaxed version of the dataset considering a broader range of entities described by Wikipedia, we derive 674 annotated mentions in each language. The VoxEL dataset is distinguished by having a one-to-one correspondence of sentences – and annotated entities per sentence – between languages. The dataset (in NIF) is available online under a CC-BY 4.0 licence: https://dx.doi.org/10.6084/m9.figshare.6539675.

We used the VoxEL dataset to conduct experiments comparing the performance of selected EL systems in a multilingual setting. We found that in general, Babelfy and DBpedia Spotlight performed the most consistently across languages. We also found that with the exception of Babelfy, EL systems performed best over English versions of the text. Next, we compared configuring the multilingual EL system for each non-English language versus applying a machine translation of the text to English and running the system in English; with the exception of Babelfy, we found that the machine translation approach outperformed configuring the system for a non-English language; even in the case of Babelfy, the translation sometimes performed better, while in others it remained competitive. This raises a key issue for research on multilingual EL: state-of-the-art machine translation is now reaching a point where we must ask if it is worth building dedicated multilingual EL systems, or if we should focus on EL for one language to which other languages can be machine translated.