1 Introduction

The huge volume of data exchanged every day via social media, text messaging and chat services has led to an increase in the number of works on natural language processing (NLP) during the last decades. Different NLP areas require extracting meaningful information from the text (which is unstructured in the majority of cases) automatically and quickly. Some of the most important and studied areas of NLP are information extraction (IE), machine translation (MT), opinion mining (OM) and question-answering (QA). These areas share one important aspect: they all need to identify proper nouns and classify them into the appropriate type of named entity [63, 136]. Named entities could be the names of persons (such as David, or Satoshi), locations (e.g. Tokyo, or Canada), or organisations (e.g. Stanford University, or Amazon). For example, let us consider this sentence: “Mr Brown is living in California, and he is working at Amazon”. A named entity recognition (NER) system should recognise “Brown” as the name of a person, “California” as the name of a location and “Amazon” as the name of an organisation. Based on this example, it seems that a simple dictionary of names combined with a set of regular expressions could solve this problem.

However, NER is not as simple as it appears. For instance, if we had “Mr. Brown is living in California, and he is working at Amazon. He would like to buy a brown jacket. He also would like to visit the Amazon rainforest”; it can be seen that both “Brown” and “Amazon” appear in two circumstances. “Brown” appears as the name of a person and as a colour (which is not a named entity). “Amazon” appears as the name of an organisation and the name of a location. These are evident ambiguities in the system. Hence, a simple system relying on dictionaries or regular expressions would fail in recognising the correct named entities.

Solving this problem involves another research area: named entity disambiguation (NED). “Mr. Brown” could be Chris Brown the singer or Joseph Brown, a Systems Development Engineer at Amazon. To answer this question, we need to link “Brown” to a knowledge base and select the best candidate based on the context where this entity appears. In this case, the sub-sentence “is working at Amazon” is crucial for affirming that “Mr. Brown” is an engineer from Amazon and not a famous singer. Also, due to these entities, “Amazon” is easily recognised as an organisation rather than a rainforest. The combined end-to-end process of finding the mention of the entity Mr. Brown in the text, and disambiguating it to Chris Brown in a knowledge base is known as entity linking (EL).

During the last decade, various works have addressed named entity linking, recognition, and disambiguation. For the works focusing on disambiguation, transformers and contextual embeddings are also mainly used, where these models provide state-of-the-art results regarding different NLP tasks. Based on the aforementioned example, distinguishing between the person’s name “Brown” and the colour “brown” requires, in the majority of the cases, an individual word embedding for each “brown” for disambiguating. Because of the challenges related to this disambiguation, the majority of the works focus on the NER task, and only a few of them propose systems dedicated to the whole entity linking pipeline.

One of the major challenges related to entity linking resides in the scarcity of datasets. Almost all the datasets are constructed manually, which is effort- and time-consuming, leading to corpora including only thousands of items (documents, sentences, reviews, or comments). While automatic corpora construction is starting to be adopted by researchers, manual approaches are more accurate and provide better results. However, those corpora are domain-centric and only useful for a single research purpose. Most of those resources cannot be adapted to real-life scenarios, as they are poorly performed when applied to another domain. Also, the majority of the resources focus on English, leading to a lack of research on other languages (which translates to less entity linking (EL) tooling for those languages).

The main goal of this survey is to highlight the most recent studies, directions, challenges and limitations that have been proposed for entity linking. For this purpose, this survey is organised as follows. We start with Sect. 2 presenting some generalities about entity linking, recognition and disambiguation. Then, we present in Sect. 3 the most recent previous surveys that we analysed. Section 4 presents the methodology that we followed to construct this survey and the research questions that we aim to answer. We divided the surveyed works into two categories: Section 5 focuses on research on the English language, and Sect. 6 presents multilingual research. In Sect. 7, we focus on the resources that have been made available. We synthesise the studied works and resources in Sect. 8. We compare our survey to others in Sect. 9. The paper ends with a discussion of open issues and perspectives for future works in Sect. 10.

2 Entity linking: background

Consider as an example, the mention of “Ford”. This mention could be associated with the Ford Motor Company American multinational automaker, or Henry Ford the founder or the Ford Foundation. Only the context could indicate which entity is linked to the mention (“Ford”) in a knowledge base.Footnote 1 From the example mentioned above, two tasks could be highlighted: 1) extracting the different mentions from a given text, and 2) disambiguation, corresponding to the extraction of the link or association from the initial mention to the right entity in a specific context.

First, the mention “Ford” is extracted from a given text/document. Afterwards, this mention is associated with the motor company, Henry, depending on the context. We observe that depending on the context, the mention “Ford” could be the name of a person, of an organisation, or a location (if we referred to Ford IslandFootnote 2). The task of extracting “Ford” from the document by determining its category (name of a person, organisation, location, etc.) is known under several terms: most commonly, named entity recognition, named entity resolution, or named entity extraction. The task of linking the extracted mention “Ford” to its entity in a given knowledge base is known as named entity disambiguation. Both research areas are presented in detail below.

2.1 Named entity recognition (NER)

Depending on the domain of interest, the named entities to consider are different. For example, in the biomedical domain, genes are the entities of interest. In the general domain, recognising the names of persons, organisations, and locations is crucial. In addition to these, dates, insurance numbers, and postal codes are important for companies handling customer data.

In some cases, producing labels from a small set of entity classes is not enough, especially where other details are required for each entity. For example, in addition to the name of a person, its role (doctor, engineer, director, soldier, terrorist, etc.) in a given organisation is also needed [107]. For the location, it could also be interesting to detect whether the extracted location represents a city, a country, a mountain, a park, etc. Proposing this classification of named entity recognition in different categories and subcategories is known as fine-grained named entity recognition (FGER). To distinguish between the two research areas, the first area is usually named as coarse-grained named entity recognition (CGER), because it uses a more general classification.

Fig. 1
figure 1

Named entity recognition example

2.2 Named entity disambiguation (NED)

The principal characteristic of NED systems is that they focus on the task of disambiguation of a given entity, independently from the NER task [50]. To disambiguate the extracted entities, two directions are considered [29]: 1) focusing on each entity locally, independently from the other entities and by relying only on the surrounding text, 2) focusing on all the entities on the document globally or collaboratively (at the same time) to ensure coherence. Some systems are solely dedicated to disambiguation [60, 76, 144]. Other systems are end-to-end, covering both NER and NED [92, 141]. An end-to-end NED system is equivalent to entity linking (EL) because it considers both NER and NED.

Fig. 2
figure 2

Named entity disambiguation example

Figure 2.2 illustrates the major idea of NED where both Paris and France are linked to their respective Wikipedia pages.

In order to give the reader a global overview of the described works through this survey paper, we present Fig. 1. The figure classifies all the works that are described in more detail in the following sections. The works are classified into three different categories: survey work (presenting a state-of-the-art), research work (presenting an approach or a methodology), and datasets, tools and knowledge bases.

Fig. 3
figure 3

Summary of the presented works

3 Previous surveys

To present this paper, we follow four main steps: (1) Gathering surveys on EL, NER, and NED. (2) Analysing the gathered surveys. (3) Extracting a set of issues related to the studied surveys. (4) Proposing an approach for constructing a survey presenting the most recent works and resolving the issues of the other surveys. For gathering all the pertinent surveys on the research literature, we ran searches on Google Scholar using the queries “survey entity linking”, “survey named entity recognition/resolution/extraction”, “survey named entity disambiguation”, “state of the art entity linking”, “state of the art named entity recognition/resolution/extraction”, and “state-of-the-art named entity disambiguation”. We filter by year to have only the most recent surveys that have been published from 2016.

This technique found five survey papers recently presented on entity linking (only one is from 2007, but most of the research literature cites it). The first one from Balog [17] considered entity linking in general, including both NER and NED. The others were principally dedicated to NER [63, 124, 136, 188]. To the best of our knowledge, no surveys have been done exclusively on NED.

The survey from Balog [17] in 2018 defined the problems of EL, NER and NED. The author concludes by presenting the general process of EL, which is only composed of both NER and NED. The authors affirm that the named entities have to be extracted before being disambiguated. The author also focuses on both NER and NED by citing some pertinent research in both areas. For disambiguation, the authors also distinguish between local and global disambiguation. Afterwards, the authors describe some available entity linking systems (such as AIDA, DBpedia Spotlight, TagMe) and some publicly available datasets (such as MSNBC, AQUQINT and ACE2004). The authors conclude by presenting a set of challenges related to both NER and NED and by briefly highlighting the new tendencies related to entity linking, related to the semantic embedding and neural models that are recently used. In this context, four recent works were cited [57, 60, 174, 200].

The 2007 survey of Nadeau et al. [124] is the first survey dedicated to NER. This survey spanned from 1991 to 2006. These authors classify NER approaches into three main categories: (1) supervised approaches using maximum entropy models, decision trees, hidden Markov models (HMM) and conditional random fields (CRF); (2) semi-supervised approaches using "bootstrapping" to construct a corpus-based on a set of initial seeds; (3) unsupervised approaches using clustering. This survey also considers multilingual aspects, presenting works done on German, Japanese, Greek, Italian and other languages.

Goyal et al. published in 2007 a detailed survey about NER [63]. The authors classify approaches as rule-based (using hand-crafted features) or machine learning-based. Similarly to Nadeau’s survey, the ML-based approaches were divided into three main categories: supervised, unsupervised, and semi-supervised or hybrid. For each category, the authors produced a summary table comparing their various features: target language/domain, the technique used, the dataset, and the results.

Yadav et al. [188] present the most recent survey on NER (published in 2019) by focusing on architecture using deep learning models. The presented survey aims to compare feature-engineered and neural network systems proposed for multi-domain and multilingual NER. The authors classify the NER system presented in the research literature into four categories: (1) knowledge-based systems that use a domain-specific lexicon, (2) unsupervised systems using bootstrapping, (3) supervised systems that use annotated data and hidden Markov model (HMM), support vector machine (SVM), etc., and (4) neural network systems. The authors focus on the last category, and they regroup the studied neural network system into four other categories: (1) word-level architectures using the set of words (embedding) composing a sentence as an input of a recurrent neural network (RNN), (2) character-level architectures using a set of characters as an input to the RNN, (3) character- + word-level architectures, where two models were dominant (the first one uses a combination between word embedding and a convolution over the characters composing the words and uses CRF for the decoding step, and the second one concatenates word embeddings and the character using an LSTM or Bi-LSTM layer), and (4) character + word + prefix/suffix model which also integrates the prefixes/suffixes features into the model. In addition to the systems, the authors also presented some NER datasets.

Finally, the survey from Patil et al. [136] on multilingual NER classified systems across two categories: (1) the systems for Indian languages (such as Hindi, Bengali and Punjabi) and (2) the systems dedicated to non-Indian languages (such as English, Spanish, Chinese and Arabic). Most of the approaches presented by the authors are statistical approaches using conditional random field (CRF) and maximum entropy (Maxent) in the majority of cases. The authors also highlighted the use of hybrid systems combining rule-based and statistical approaches, or combining more than one statistical algorithm (such as Maxent and HMM). The authors concluded that the most critical issues related to NER in Indian languages are the lack of annotated corpora, the Indian morphology and the variations in the writing style.

Our analysis of the available surveys in the literature identified the following issues:

  • From the presented surveys, it can be concluded that only one paper focuses on NER + NED, and all the other surveys focus on NER only.

  • No survey handles all of NER, NED, and multilingual aspects.

  • Almost all the works presented by the studied surveys are old (before 2015).

  • The surveys lack descriptions of the resources constructed, such as tools, API and datasets.

  • The few datasets that were described did not include the necessary information to locate them online (e.g. the link to their website).

Our aim behind this survey is to present a recent paper focusing on the most recent works on multilingual NER and NED, resolving the issues cited above.

4 Survey methodology

We followed an incremental method for gathering the research works presented in this survey. Figure 4 outlines our approach.

Fig. 4
figure 4

Methodology for gathering research papers

We started by gathering the keywords related to EL, such as NER and NED. Then, we gathered only a few recent works on each field by targeting both English-specific and multilingual works. For each work, we focus on four main aspects: (1) previous works, to extract the added value of each studied work within the research literature, (2) methodology, to present to the community the most fundamental aspects of each methodology, (3) the used or constructed resources, to gather the publicly available APIs, tools, and datasets representing valuable resources for the community of research, and (4) the experiments and the parameters used with each model, their results and their comparisons with the previous works. Each one of these steps leads to gathering more work and more resources. From the collected works, we excluded the works handling word disambiguation.Footnote 3 We also excluded the research works presenting similar approaches using the same techniques, keeping one representative work in each family of approaches. This resulted in 167 research works about EL/NER/NED being selected, which we will discuss below.

Finally, in addition to presenting and analysing the works and resources focusing on EL/NER/NED, we also aim to answer the following research questions:

  • Q1: What are the most recent methods/techniques used for entity linking (including named entity recognition and disambiguation)?

  • Q2: What is the tendency related to the corpora? Did studies tend more to use a publicly available corpus, or did they prefer constructing their own corpora?

  • Q3: What are the main techniques proposed for constructing a corpus, and what are their main advantages and disadvantages?

  • Q4: Which English-centric approaches produce the best results and performance?

  • Q5: Which multilingual approaches produce the best results and performance?

  • Q6: What are the open issues for entity linking?

To answer the above questions, we classify the works presented in the research literature into two main categories: the research works that have been done in English and the multilingual research works. For both categories, we split the works into NER and NED.

5 Research works on English

5.1 Named entity recognition

The most used approach for NER is “coarse-grained”, which considers entities belonging to a small number of major classes (from one to ten). Most of the research studies focusing on a single class are dedicated to bio-medical NER: they extract the names of diseases, viruses, patients, etc. [98, 129]. The work focusing on three classes aims to extract person, location and organisation names [83]. The works focusing on four classes tend to focus on identifying person, location, organisation, and miscellaneous names [44, 162, 187]. Others focused on six [6] or ten categories [105]. The most important issue with the coarse-grained approach is that they focus on a small set of classes (up to 10). Fine-grained approaches aim to resolve this limitation, with some systems detecting more than 100 classes [107].

The coarse- and fine-grained classifications are correlated. For example, the coarse-grained class of person (PER) may contain the fine-grained classes of judge, lawyer and other person (plaintiffs, defendants, witnesses, appraisers, etc.). The location (LOC) includes the fine-grained classes of the country (LD: countries, states and city-states), city (ST: cities, villages and communities), street (STR: streets, squares, avenues, municipalities and attractions), and so on. The coarse-grained class “organization” (ORG) is divided into public, social, state and economic institutions, etc. [100]. We present, in the following, the works presented for both approaches (coarse-grained and fine-grained).

5.1.1 Coarse-grained NER approach

Different approaches were used, including rule-based, machine learning-based and clustering-based approaches. Rule-based approaches use pre-defined vocabularies that include complex logic. On the other hand, machine learning-based approaches employ statistical approaches (including support vector machines or SVM, and decision trees) [178]. More recently, deep learning methods have provided significant improvements in performance terms in multiple visual analysis tasks, such as object detection, classification and tracking. Deep learning models typically contain hundreds of thousands or even millions of trainable parameters, which give them their edge in terms of performance [133]. Finally, the goal of clustering is to discover the natural groupings of a set of objects. Many clustering algorithms are generic in the sense that they can be applied to any type of data that are equipped with a measure of distance between data points. Diverse types of clustering methods are available. The most popular clustering algorithm is k-means, which iteratively identifies k cluster centres (centroids), and each cell is assigned to the closest centroid [91].

In the context of rule-based approaches, Neelakantan et al. [129] propose an approach to automatically construct a dictionary for NER from Wikipedia, using a large corpus of unlabelled data and a few seed examples. This approach includes two steps: (1) collecting a list of candidate phrases from the unlabelled corpus for every named entity type using simple rules and (2) removing the noisy candidates from the list obtained to construct an accurate dictionary. To predict whether a candidate phrase represents a named entity, the lower-dimensional, real-valued canonical correlation analysis (CCA) embeddings of the candidate phrases are used as features, and the training is done using a small number of labelled examples and a binary SVM for classification. Two kinds of experiments were carried out: (1) using a dictionary-based tagger by relying on the four constructed dictionaries and the two used corpora (GENIA and NCBI) and (2) using the CRF-based tagger by considering the constructed dictionaries as features. First, the authors compare the different results obtained with the four dictionaries. For CRF, different regularisersFootnote 4 were used: 0.0001, 0.001, etc. The experiments using CRF were done using both CCA word and phrase embedding. The best F1-score obtained is up to 62.30 (on the GENIA corpus using CCA), up to 48.03 (on NCBI using manual construction), up to 79 (on GENIA using the CRF tagger and CCA-phrase), and up to 81 (on the NCBI corpus using the CRF tagger and CCA-phrase).

Only some studies use classic machine learning algorithms such as SVM [129]. The majority of the proposed studies rely on neural networks [6, 44, 98, 105, 187]:

  • Xu et al. [187] proposed an approach that examines all possible fragments in the text (up to a certain length) one by one. It uses the FOFE methodFootnote 5 to fully encode the fragment, its left context and right context into fixed-size representations, which are in turn fed to a FENNFootnote 6 to predict the entity mentions. This model is based on both character- and word-level models. In the evaluation phase, the authors also consider the nested entity (embedded names including others, for example, British Columbia or Western Canada).

  • Aguilar et al. [6] propose a system, which embeds a sentence into a high-dimensional space (using CNN,Footnote 7 BiLSTM,Footnote 8 and dense encoders) to extract features. Afterward, the resulting vectors of each encoder are concatenated for performing multi-tasks. Finally, a CRF classifier uses the weights of the common dense layer to perform a sequential classification.

  • Lee et al. [98] propose an approach relying on transfer learning and artificial neural networks to extract NER from patient note de-identification. Transfer learning is used to improve a learner from one domain by transferring information from a related domain.Footnote 9 The proposed model includes six significant layers: 2 embedding layers (one for the token, one for the characters). Two LSTM layers (one for tokens one for characters)-a fully connected and a CRF layer. Two kinds of experiments were carried out: (1) Experiment on different sizes of the training (the target) to show how many labels are needed for the target dataset to achieve consistent performances with and without transfer learning. (2) Experiment by transferring different combinations of parameters used in the neural network rather than all to show the importance of each layer.

  • Dernoncourt et al. [44] propose NeuroNER, which is a state-of-the-art NER based on neural network. The purpose of NeuroNER is to allow users to annotate entities using a graphical web-based user interface (BRAT)Footnote 10 [169]. The model contains three layers: (1) LSTM for character embedding, (2) LSTM for token embedding and 3) CRF.

  • The purpose of the work [105] is to propose an approach for dealing with the noisy and colloquial nature of tweets using an LSTM to learn orthographic features automatically. The proposed approach includes three main components: (1) orthographic sentence generator (described in Sect. ), (2) word representations as input vectors, (3) bidirectional LSTM. At the output layer, the CRF log-likelihood (likelihood of labelling the whole sentence correctly by modelling the interactions between two successive labels) is used.

  • Finally, the purpose of the work [83] is to propose a hybrid system (using Python script) combining different freely available NER tools. Four freely available NER tools were used: NER, Spicy, LingPipe and NLTK. The proposed tool can recognise the basic three entity types: PERSON, LOCATION and ORGANISATION. The four tools were evaluated on WikiGold. As the Stanford NER gave the best results, the constructed corpus was firstly annotated by Stanford NER and reviewed manually. The hybrid system was evaluated on both constructed corpus history and infopedia. The results were compared to the results returned by Stanford NER and Spicy.

The current state-of-the-art results were recently obtained using models targeting more than one natural language processing (NLP) problem, including NER. In this context, Baevski et al. [15] propose a bidirectional transformer architecture predicting every token in the training corpus by using a cloze-style training objective (where humans were asked to guess omitted words in a sentence using its context, knowledge of syntax and other skills [176]). The proposed model aims to predict the centre word given right-to-left and left-to-right context representations. This model was used for many NLP tasks, including text classification, Question-Answering, parsing, and NER tasks. For NER and parsing, the authors rely on different architectures (using embedding models previously presented in the research literature [45, 139, 140]), using different language models as well as different learning rates. This model outperforms all the results presented in the research literature, with an F1 score up to 93.5. With an F1-score slightly below (up to 93.47), Jiang et al. [84] propose a neural architecture search (NAS) dedicated to both language modelling and NER. These authors were the first to integrate differentiable NAS. Differentiable NAS uses a continuous relaxation to architecture representation, making gradient descent straightforwardly applicable to search [84]. They use recurrent neural networks (RNNs), including a recurrent cell consisting of 8 nodes. The proposed system runs on 40 training epochs with a batch size of 256 and a learning rate of 20. For NER and in addition to the works presented in [45, 139], these authors also compare their results to results presented by Lample et al. [96] (F1-score up to 90.94). Their results were also compared to the results obtained by Akbik et al. [8] (F1-score up to 93.18).

However, no publicly available system is associated with both works presented by Baevski et al. [15] and Jiang et al. [84]. Akbik et al. [7, 8] obtained results slightly below the aforementioned systems, but their proposed system (Flair) is publicly available.Footnote 11 Flair is based on contextualised word embedding, which associates each word with its context in a given sentence [8]. Afterwards, this work was improved by integrating a memory of all the embedding context for a given word [7]. All the generated vectors are then concatenated using a pooling operation. This work was also compared to different recent works proposed in the research literature [5, 36, 45, 96, 139].

Only one work was found to use clustering [162]. In this work, the authors propose an approach based on k-means clustering with three numbers of clusters: 100, 1000, and 5000. The authors also rely on the skip-gram (SG) model of word2vec for extracting word vectors and on the linear support vector classifier (SVC) for classification. Table 1 summarises all the works presented in this part by showing the constructed and used datasets and giving some details on the experimentation and results.

Table 1 Works on coarse-grained NER

5.1.2 Fine-grained NER approach

FIGER [107] is one of the first fine-grained systems proposed for NER. The authors consider the fine-grained problem as a multi-class, multi-label classification problem. This system recognises 112 tags using a conditional random field (CRF)Footnote 12 tagger to find the candidates and the perceptron algorithm for the classification. For training the CRF, the authors opt for automatic construction of the corpus to generate a larger dataset allowing the classification of all the tags. For this, they exploit the anchor links in Wikipedia text to label entity segments with the appropriate tags automatically. To validate their system, the authors carry out two types of experiments: the first one is dedicated to NER and the second one to relation extraction (RE). For the first one, the authors compared their system to two other systems in the research literature, Stanford NER [55] and Illinois Named-Entity Linking (NEL) [146]. For RE, the authors use MultiRFootnote 13 [77], trained using distant supervision by heuristically matching relation instances.

FIGER was compared to a more recent system, TypeNet [123]. TypeNet aims to integrate hierarchical information into the embedding space of entities and types to improve entity linking and fine-grained entity typing tasks. Although the principal goal of TypeNet is fine-grained NER, it also integrated an entity linking model based on a combination of string similarity score and string cosine similarity. To propose the hierarchy of the embedding space (the links between types and entities), the authors use two state-of-the-art knowledge graph embedding models: real and complex bi-linear maps. The real model is equivalent to RESCAL (a single IS-A relation type proposed by Nickel et al. [131]). The complex model is based on the ComplExmodel (using complex-valued vectors for types, and complex diagonal matrices for relations) proposed by Trouillon et al. [180] For evaluating their system, the authors consider three kinds of experiments: (1) mention-level entity typing using FIGER, (2) entity-level typing using Wikipedia and TypeNet, and (3) entity linking using MedMentions. The authors compare their approach to different state-of-the-art systems [68, 107, 160, 161].

Table 2 Works on fine-grained NER

Abhishek et al. [2] focused on FIGER [107] and Typenet [123] to present the HAnDS (Heuristics Allied with Distant Supervision) framework to automatically construct a dataset suitable for fine-grained NER. HAnDS requires three inputs: a linked text corpus (e.g. Wikipedia), a knowledge base (capturing concepts, their properties, and inter-concept properties: e.g. Freebase) and a type hierarchy (a hierarchical organisation of various entity types, e.g. FIGER and TypeNet). To reduce the false-positive and the false-negative, HAnDS follows three stages: (1) link categorisation and processing for removing the incorrect anchor detected as entity mention, (2) inference of additional links, by linking the correct referential name of the entity mention to the correct concept in the knowledge base, and (3) sentence selection, allowing high-quality annotations by using a POS tagger and other features. For evaluation, the authors consider two sub-tasks: Fine-ED, a sequence labelling problem and Fine-ET, a multi-label classification problem. for Fine-ED, LSTM-CNN-CRF modelFootnote 14 is used [112]. For Fine-ET, an LSTM-based

Lal et al. [95] present SANE, a system using Wikipedia categories to recognise fine-grained entities. The authors focused on named entity typing (NET), where they associate a semantic type to a given entity. SANE is based on Stanford NER for the extraction of named entities and on a pattern-based matching for fine-grained NER. The best categories are chosen using a selection model from Word2vec. The selected categories in the lookup-based extraction phase are mapped to appropriate WordNet types. A 3-class (Person, Organization, Location) NER classifier is used to find coarse-grained (CG) named entities. Afterwards, the identified entities are processed using SANE. SANE is compared to FINET. The results of both systems (i.e. SANE and FINET) were manually labelled by two independent annotators. The inter-annotator agreement (Kappa) is 0.72 for FINET and 0.86 for SANE.

Table 2 summarises all the works presented in this part by showing the constructed and used datasets and giving some details on the experimentation and results.

5.2 Named entity disambiguation

The first proposed NED approach focused on local disambiguation, which resolves the entity mentions independently and uses various hand-designed features and heuristics (specific to each mention) [34]. This approach suffers from two major limitations: it overlooks the topical coherence among the target entities, and unseen words/features are not recognised (data sparseness) [29]. To resolve these issues, global disambiguation (where all entity mentions are disambiguated simultaneously) was proposed [22, 29, 195]. We present below the identified works for both local and global disambiguation approaches.

5.2.1 Local disambiguation

Local disambiguation approaches disambiguate each mention in a document separately, utilizing clues including the textual similarity between the document and each candidate to disambiguate [146].

In this context, Chisholm et al. [34] propose an entity disambiguation system (named named entity linking (NEL)) to compare Wikipedia and Wikilinks. For extracting features, the authors apply three approaches: (1) entity prior, corresponding to the probability of a link pointing to a given entity, (2) name probability, corresponding to the relationship between a name and an entity, and (3) textual context, by using the surroundings words. Two techniques were used: Bag Of Words (BOW) context, and Distributional BOW, where a word embedding vector of dimension equal to 300 is used. To perform disambiguation, an SVM classifier was used. The authors carry out many experiments to compare Wikipedia and Wikilinks, showing the impact of combining them and the impact of the corpus size on the results, and then compare their NEL system to four other systems in the literature [11, 73, 76, 78].

Table 3 Works on NED

5.2.2 Global disambiguation

The global optimisation problem is an NP-hard problem where approximations are required. For example, for Wikipedia, the common approach is to utilise the Wikipedia link graph to obtain an estimate of pairwise relatedness between titles in order to efficiently generate a disambiguation context [146].

Yang et al. [195] propose the structured gradient tree boosting (SGTB) learning model for named entity disambiguation. The constructed framework is built by using the SGTB model [194], by employing a conditional random field (CRF) objective. To compute the partition function (normalisation term) for training and inference, beam search is used. Moreover, Bidirectional Beam Search with a Gold path (BiBSG) is used for reducing the model variance and considering both past and future information in the prediction step. Two experiments are conducted: in-domain (using AIDA-CoNLL corpus) and cross-domain (using all the other datasets). Cao et al. [29] present NCEL, a neural model for collective entity linking. NCEL includes three main compounds: candidate generation, feature extraction and neural model. Firstly, the generation of the different candidates is based on Wikipedia page titles, and a dictionary derived from a large web corpus and Yet Another Great Ontology (YAGO). Secondly, Neural Collective Entity Linking (NCEL) uses both local contextual features (based on similarity) and global coherence information (based on a window size for defining neighbour adjacency). Finally, NCEL incorporates graph convolutional networks into a deep neural network to utilise structured graph information for collective feature abstraction. NCEL was compared to 16 other systems focusing on EL [34, 59,60,61, 68, 73, 76, 93, 117, 122, 135, 141, 177, 181, 193, 199]. The results were compared using the GerbilFootnote 15 benchmark. The authors also detail the analysis of two corpora, the less complex one (TAC2010) and the most complex one (WIKI and CWeb).

Bhatia et al. [22] present a simple, fast, and accurate probabilistic entity-linking algorithm used in the enterprise. To do this, the authors rely on automatically constructed domain-specific knowledge graphs. The idea of this approach is to first extract the named entities from the query (using publicly available systems such as Apache OpenNLPFootnote 16 or Stanford NERFootnote 17). Afterwards, a list of target entities is generated by retrieving all entities from the graph containing the extracted tokens. For each result, the entity and text context are computed using the naive Bayes algorithm. The role of the entity context component is to compute the probability of observing the entities forming the context after observing the target entities. The role of the text context component is to compute the probability of observing the query terms after observing the target entities. Finally, the scores for all the target entities are combined to produce a final ranked list. The authors compare their approach to 5 other works in the research literature [4, 30, 75, 76, 113]. Their knowledge base has 2,261 candidates per mention to disambiguate, which is high compared to manually cleaned knowledge bases such as DBpedia.

Kolitsas et al. present an end-to-end system to perform the task of entity linking [92], inspired by the most recent models of Lee and al. [99] and Ganea et al. [60]. The purpose of this system is to generate all possible spans/mentions to select the top candidates referred by each mention. The best candidates are selected using an empirical probabilistic entity/map built by Ganea et al. [60] and based on Wikipedia hyperlinks, Crosswikis [167] and YAGO. To disambiguate the generated candidates, the authors compute a similarity score using embedding dot products (of the different word embedding vectors constructed for the mentions and their context). For extending their model from local disambiguation to global disambiguation, the authors added a layer to their neural network model. However, for global disambiguation, they only consider the candidate with the highest local score. The proposed system was compared to many state-of-the-art systems included in Gerbil.

Mulang et al. [121] present Arjun, a context-aware entity linking approach, including 3 subtasks: (1) surface form extraction identifying all the surface forms associated with the entities, (2) entity mapping (or candidate generation) where the surface forms are mapped to a list of candidate entities from the local knowledge graph, and (3) entity disambiguation, where the most appropriate candidate entity for each surface form is selected. For both sub-tasks (1) and (3), the authors extended the attentive neural model proposed by Luong et al. [109]. In contrast to Luong et al., the authors use a bidirectional long short-term memory (Bi-LSTM) model for the encoder and a one-directional LSTM model for the decoder. For creating the local knowledge graph, the authors follow the same methodology described by Sakor et al. [151] where each entity label is extended with its aliases from Wikidata. Arjun was compared to OpenTapioca [42], which is an end-to-end EL approach released for Wikipedia.

Table 3 summarises all the works presented in this part by showing the constructed and used datasets and giving some details on the experimentation and results.

6 Multilingual research works

6.1 Named entity recognition

The purpose of the paper of Seyler et al. [155] is to show the importance of external knowledge for performing NER. The authors present a novel modular framework that divides the knowledge into four categories: (1) knowledge-agnostic, including local features extracted directly from the text, (2) name-based knowledge that identifies patterns in names and exploits the fact that the set of distinct names is limited, (3) knowledge base-based knowledge extracted from an entity annotated corpus, and (4) entity-based knowledge by encoding document-specific knowledge about the entities found in the text. The extracted features were used to train a linear-chain CRF. The experimentation shows the impact of incrementally adding external knowledge. The system was also applied to two additional languages, namely German and Spanish.

Kuru et al. [94] present CharNER, a character-level tagger for language-independent Named Entity Recognition (NER). CharNER operates at a character level, where the characters belonging to the same word are annotated with the same tag. The system architecture is composed of a 5-layer bidirectional LSTM network, connected to an output layer (a softmax layer). Finally, a Viterbi decoder takes the sequence of character tag probabilities produced by the softmax layer and produces word-level tags. The presented results are close to those of the literature, without using any manually generated features.

Shen et al. [158] present a deep active learning architecture to extract NER with a small training corpus. To reduce the computational complexity, CNN was used as a character-level and word-level encoder and LSTM as a tag decoder. For active learning, the authors explore the uncertainty-based sampling strategy [101]. They use several algorithms: (1) least confidence (LC, for sorting examples in ascending order according to the probability assigned by the model), (2) maximum normalised Log-Probability (MNLP, normalising LC for concentrating on both long and short sequences, in contrast to LC which concentrates only on long sequences), (3) interpreting the variability of the predictions over successive forward passes due to dropout as a measure of the model’s uncertainty, and (4) other sampling strategies (OSS, by maximising the representatives of the label set without querying a similar example).

Al-Rfou et al. [9] propose Polyglot, a language-independent NER system. To automatically construct a system dedicated to 40 languages, the authors relied on Wikipedia, Freebase and neural word embeddings. The authors consider the NER task as a word-level classification problem (same as the model proposed by Collobert [37]). Polyglot includes two main stages: (1) encoding the semantic and syntactic features of words in each language and (2) automatically generating a corpus from Wikipedia and Freebase. The Polyglot embeddings [10] were used for each language: the model was trained on Wikipedia without any labelled data. The process of creating a NER corpus includes two steps as well: (1) linking the Wikipedia articles to the corresponding entities and (2) using the exact surface form matching to extend the annotation (oversampling). The authors compare their system to that of Nothman et al. [132].

Yu et al. [197] present Cog-Comp, a Character-level Language Model (CLM) that considers each letter as a word and each word as a sentence, to show its impact on multilingual NER. The authors focus on 8 languages, including English. They also propose two features (entity, and language, based on the original language of the named entities) for improving the results. The system proposed by the authors was compared to two state-of-the-art NER systems, Cog-CompNER [88] and LSTM-CRF [96].

Shao et al. [156] also investigate the impact of additional features and configurations on neural network-based models in the context of multilingual NER. The authors focused on three baseline models, including a standard Bi-LSTM, a feed-forward network, and a window-based Bi-LSTM. The authors consider many features such as CRF at the output layer, POS and character embedding layers, and 3 different activation functions (hard sigmoid, relu and tanh). The models were applied to three languages, including English, German and Arabic. The authors compare their models to many systems proposed in the research literature. For English, the models were compared to 4 state-of-the-art systems [35, 37, 56, 80]. Three state-of-the-art systems were compared for German [3, 70, 147]. For Arabic, the models were compared to the system of Benajiba et al. [19].

Halwe et al. [74] also focus on a low-resourced language (Arabic) by presenting a deep co-learning approach to extract the named entities. The authors first construct an algorithm classifying Arabic Wikipedia articles into one of the four categories, namely person, location, organisation and objects (for non-entities). Afterwards, the authors rely on the proposed classifier to automatically annotate a large corpus of Arabic Wikipedia articles (25,000 articles). The authors were able to partially annotate 66,156 sentences from the extracted corpus. Finally, the authors adopt the concept of co-training proposed by Blum et al. [24] to combine annotated corpora, with their partially annotated constructed corpus for the task of NER. As a deep neural network architecture, the authors use both LSTM and BiLSTM layers. They also combine Bi-LSTM and CRF.

More recently, Jin et al. [85,86,87] focused on approaches transforming entities in a knowledge base (KB) to an entity graph to apply graph-based algorithms to it. The idea of their first work [86] is to construct an entity graph by using links between entities. They also used both graph structure and entity features for fine-grained NER. They applied an attributed and predictive network embedding model to construct entity features and structure the graph. Finally, they use multi-label classifiers to determine the entity class. The authors compare their approach to 8 state-of-the-art methods (FIGMENT [190], CUTE [186], MuLR [191], Global [128], Corpus [189], PTE [175], Planetoid [196] and ASNE [104]).

In their second work, Jin et al. [87] convert entities in the KB into three semantic graphs. Each graph represents a specific kind of correlation among entities. The first one (Aco) is dedicated to representing the co-occurrence relations among entities. The second one (Acat) represents the category-proximity between entities. The third one (Aprop) represents property proximity between entities. Afterwards, the authors propose hierarchical multi-graph convolutional networks (HMGCNs), representing a deep learning architecture combining Aco, Acat and Aprop. To handle relations between types, a recursive regularisation is adopted. The proposed approach was compared to 4 of the state-of-the-art systems mentioned above (FIGMENT, CUTE, MuLR and APE). Experiments show that the two approaches proposed by Jin et al. significantly outperform all the compared system.

Finally, in their most recent work [85], the same authors propose a multilingual transfer learning model combining a mixture-of-experts approach. Their model dynamically captures the relationship between the target language and each source language and generalises to predict types of unseen entities in new languages. They investigate the role of the similarity between the source and the target languages on performance. They focused on six languages: German, English, Dutch, Russian, Spanish and Chinese. The main idea of their model is to use multiple source languages as a mixture of experts to learn the metric related to the weight of the experts for different target examples. For extracting features, they rely on mBERT [46] being pre-trained on concatenated Wikipedia data in 104 languages. From their different experiments, the authors conclude that the more similar the source and the target languages are, the better the performance will be: a large set of source languages with a high deviation of similarity performs worse than one of its subsets whose members are more similar to the target than other sources. The best-obtained F1-score (0.636) was achieved using three languages (English, German and Spanish), where English was relatively more important.

Table 4 summarises all the works presented in this part by showing the constructed/used dataset and giving some details on the experimentation and results.

Table 4 Work on multilingual NER

6.2 Named entity disambiguation

Rosales et al. [149] present VoxEL, a multilingual manually annotated dataset dedicated to entity linking. In addition to English, VoxEL includes German, Spanish, French and Italian. VoxEL is based on 15 news articles (94 sentences mostly dedicated to politics, particularly at a European level) sourced from VoxEurop. Two kinds of tagging are used: (1) strict tagging, based on three entities (person, location and organisation), and (2) relaxed tagging, using a knowledge base and considering any noun phrase mentioned in Wikipedia as a valid entity. 204 mentions were annotated by strict VoxEL and 674 by relaxed VoxEL (for each language). For validating this dataset, the authors compare it to other multilingual corpora dedicated to EL by using various state-of-the-art multilingual systems, including TagME [53], TDH [49], DBpedia Spotlight [113], Babelfy [117], and FREME [154]). To present a fair comparison, the authors carry out all the experiments on the Gerbil benchmark.

Sil et al. [163] present LIEL, a Language-Independent Entity Linking system, including two steps: (1) extraction of the different mentions related to named entities and (2) linking the extracted mentions to a knowledge base (Wikipedia). To extract mentions and perform co-reference resolution, the authors use the IBM Statistical Information and Relation Extraction (SIRE) tool.Footnote 18 For mention detection, the authors use a CRF model of IBM SIRE and use the maximum entropy clustering algorithm for co-reference resolution (where 53 entity types were identified). For the entity linking step, the authors search the best mention that would maximise the information extracted from the entire document. LIEL was compared to many systems for English ( [32, 114, 159]), and for Chinese and Spanish it was compared to the systems of the shared tasks at TAC 2013Footnote 19 and TAC 2014.Footnote 20

BENGAL [130] is the first automatic approach which uses structured data to produce entity-linking benchmarks. The first purpose of Bengal is to propose a gold standard to generate benchmarks in English and also in other languages, such as Brazilian Portuguese, and Spanish. BENGAL is based on an RDFFootnote 21 graph. BENGAL starts by selecting a set of seed resources from the graph using a given number of triples to use during the generation process: for example, if the number is 3, BENGAL focuses on the person, organisation and location triples. To extract the set of seeds, a SPARQL query is used. A set of sub-graphs is then extracted, describing the information of each entity, the relationship between entities, and other aspects. The last part of the approach consists of applying a verbalisation, which transforms the graph into a set of sentences (documents) by using a set of predefined predicates. Gerbil was used to evaluate the performance of BENGAL on English by comparing it to other datasets constructed manually. BENGAL was also used for evaluating the annotation performance in Brazilian Portuguese. In this case, an RDF verbaliser [118] was used. This verbaliser was extended to Spanish using an adaptation of SimpleNLG [165].

MAG is a multilingual, knowledge-base-agnostic and deterministic entity linking approach [120]. MAG consists of an offline phase and an online phase. During the offline phase, five indexes are generated: surface forms (all the labels related to an entity), person names (all the variations of person names across different languages), rare references (using the Stanford POS tagger [179] to extract adjectives related to the entities), acronyms (a handcrafted index from STANDS4Footnote 22), and context (using Concise Bounded DescriptionFootnote 23). The online phase consists of two steps: candidate generation and disambiguation. To generate all the candidates, all mentions are preprocessed by separating the acronyms (each word containing 5 letters or less is considered an acronym) from the string mentions. The string mentions are normalised. Afterwards, the candidates are searched using three different techniques: (1) by acronym, if the mention is classified as an acronym, (2) by label, relying on the set of surface forms which were generated, and (3) by context, using the TF-IDF metric.Footnote 24 Finally, a disambiguation graph is constructed to extract the optimal candidate. This step is equivalent to the disambiguation approach of AGDISTIS [181] based on HITSFootnote 25 and PageRank.Footnote 26 MAG was recently extended to support 40 languages, including low-resourced languages such as Ukrainian, Greek, Hungarian, Croatian, Portuguese, Japanese and Korean [119]. This work also presents a demo relying on online web services which allows for easy access to the entity linking approaches previously proposed by the authors [120]. By using this demo, the user is also able to define a set of parameters such as the graph-based algorithm (choosing between HITS and PageRank), whether to use acronyms or not, etc. MAG is also used in a domain-specific problem using a knowledge base of music terms [134].

Raiman et al. [144] propose DeepType, a system associating with each entity a set of types (e.g. Person, Place, etc.) to disambiguate entities. The authors were inspired by the previous work of Ling et al. [106] showing an improvement in the performance of their system after integrating the types proposed in FIGER [107]. The type system is automatically designed using a set of relations from Wikipedia and Wikidata. To predict the type system, the authors propose an algorithm containing 2 steps: (1) stochastic optimisation or heuristic search, and (2) gradient descent to fit classifier parameters. The idea of stochastic optimisation is to use an objective proxy function to avoid training an entire entity prediction model for each evaluation of the objective function. A neural network classifier is then trained by incorporating the resulting type to label data in multiple languages. A bidirectional LSTM network [96] with the word, prefix, and suffix embeddings (as previously done by Andor et al. [13]) is used. DeepType was compared to three other state-of-the-art systems [53, 114, 193].

Table 8 summarises all the works presented in this part by showing the constructed/used datasets and giving some details on the experimentation and results. More details about the constructed/used datasets, tools and ontologies that were referenced in this section are presented in the following part (Sect. 7).

Table 5 Work on multilingual NED

7 Datasets, tools and knowledge bases

7.1 Datasets

7.1.1 NER datasets

This part describes 22 datasets used in the research literature and that was proposed/used by the works presented in this survey. CoNLL03Footnote 27 consists of a set of newswires in English [152]. CoNLL03 is separate from the Reuters RCV1Footnote 28 corpus (RCV1 was constructed from August 1996 to August 1997) [102]. KBP2015 is a trilingual datasetFootnote 29 that consists of discussion forum posts and news articles that were published in recent years: all the documents are related, but they are not parallel across languages [82]. WNUT-2016Footnote 30 [172] is a corpus consisting of tweets which were manually annotated using BRATFootnote 31: the corpus distinguishes 10 different named entity types (i.e. person, company, facility, geo-loc, movie, music artist, other, product, sports team and TV show). WNUT-2017Footnote 32 [43] is a manually annotated corpus used in the 3rd Workshop on Noisy User-generated Text (W-NUT). The documents contain the types person, location, corporation, product, creative-work and group. This corpus was extracted using many sources such as YouTube, Twitter, etc. For the manual annotation, three annotators were assigned to each document. The GermEvalFootnote 33 2014 NER shared task [21] dataset represents a collection of citations that were extracted from News Corpora and German Wikipedia. ANERcopFootnote 34 [20] was annotated using the same format of the CONLL 2002 corpora. ANERcop was manually collected and annotated by only one annotator to maintain coherence.

The MIMIC and i2b2 (2014 and 2016 datasets) [173]Footnote 35 were used by Lee et al. [98] to show the utility of transfer learning for NER. MIMIC is a part of the MIMIC-III dataset.Footnote 36 WikiGoldFootnote 37 [69] is an annotated corpus including a small sample of Wikipedia articles in CoNLL format (IOB) [83]. MUC-7Footnote 38 is a set of New York Times articles [33]. CoNLL2003gFootnote 39 [152] was extracted from the German newspaper Frankfurter Rundshau, between September and December 1992. CoNLL2002Footnote 40 [153] is a collection of newswire articles from May 2000, made available by the Spanish EFE News Agency [153]. OntoNotes\(-\)5.0Footnote 41 is a large-scale corpus of annotation of multiple levels of the shallow semantic structure in text. OntoNotes\(-\)5.0 contains three languages: English (one million words + 200K words of the English translation), Chinese (one million words) and Arabic (300K words). OntoNotes\(-\)5.0 was extracted from newswire and magazine articles, broadcast news, broadcast conversations, web data and conversational speech data. The GENIA corpusFootnote 42 is a semantically annotated corpus dedicated to biological text mining. In the GENIA corpus, the articles are encoded in an XML-based mark-up scheme, and each article contains its MEDLINE ID, title and abstract: all the texts in the abstracts are segmented into sentences [89]. The NCBI Disease corpusFootnote 43 is a large-scale corpus consisting of 6,900 disease mentions in 793 PubMed citations.Footnote 44 This corpus was developed by a team of 12 annotators (two people per annotation) and covered all sentences in a PubMed abstract. Disease mentions are categorised into Specific Disease, Disease Class, Composite Mention and Modifier categories [48].

WikiFbFFootnote 45 is a corpus created automatically using Wikipedia, Freebase and the FIGER hierarchy [107]. WikiFbTFootnote 46 is a corpus also created automatically using Wikipedia, Freebase and the TypeNet hierarchy. Wiki-NDSFootnote 47 is a corpus created using the naive distant supervision approach with the same Wikipedia version used for creating both WikiFbF and WikiFbT. 1k-WFB-gFootnote 48 is a fine-grained annotated corpus, manually annotated by Ling et al. [1] to cover large typeset. The sentences used for the construction of this corpus were extracted from Wikipedia text. TypenetFootnote 49 [123] is a dataset of hierarchical entity types for fine-grained entity typing. TypeNet was created by manually using Freebase types [26] and the synsets of the WordNet hierarchy [52]. Another fine-grained dataset is FEW-NERD [47] a large-scale, publicly availableFootnote 50 human-annotated few-shot NER dataset with 8 coarse-grained and 66 fine-grained entity types. FEW-NERD includes 188,238 sentences from Wikipedia (corresponding to 4,601,160 words).

7.1.2 NED datasets

Nine datasets were proposed for NED. KORE50Footnote 51 [75] is a dataset containing highly ambiguous entity mentions. The research community widely uses this corpus, and it is considered among the most challenging datasets for entity disambiguation. On average, each sentence contains only 14 words, where 3 of them represent mentions [22]. AIDA-CoNLLFootnote 52 [76] is based on CoNLL 2003. The annotation of this corpus was done manually using YAGO2 (detailed in Sect. 7.4). Two students disambiguated each mention. In case of conflict, the authors chose the right one. AQUAINTFootnote 53 [114] is a randomly selected and manually linked subset. MSNBCFootnote 54 [39] represents a subset containing two stories for each of ten categories (business, U.S. politics, entertainment, health, sports, tech and science, travel, TV news, U.S. news, and world news): the corpus was extracted on January 2, 2007. ACE2004Footnote 55 [146] is a corpus manually annotated using Amazon Mechanical Turk (AMTFootnote 56). The accuracy of annotations was approximately 85%, and the authors manually corrected the annotations to increase their precision. WIKI and CWebFootnote 57 [67] are two corpora that respectively contain 345 and 320 files constructed by sampling large publicly annotated corpora such as ClueWeb and Wikipedia. For constructing these corpora, the authors collected many annotated documents and retained only the most ambiguous documents. The authors use many thresholds for the indicator of difficulty. They opt for a bracket where the accuracy is the highest. TACFootnote 58 [81] is a corpus extracted from English Wikipedia in October 2008. This corpus includes three kinds of entities, person, organisation and geo-political entities. \(N^3\) news.deFootnote 59 is a real-world data set collected from 2009 to 2011 [120]. DBpedia AbstractsFootnote 60 [27] is a large, multilingual corpus generated from enriched Wikipedia data of annotated Wikipedia abstracts. In addition to English, DBpedia Abstracts contain six languages: Dutch, French, Spanish, Italian, Japanese [120]. T-RExFootnote 61 is a dataset annotated using Wikidata triples. [121].

To sum up, Table 6 presents all the corpora mentioned above. Some metrics, including the size of the corpus, its language, and the entities detected, are included. Some of the research works using the datasets are mentioned as well.

Table 6 NER dataset statistics
Table 7 NED dataset statistics

7.2 Tools and ontology

Tools are the different systems developed for NER and NED. These systems can be used with adequate Python or Java libraries in order to automatically detect and disambiguate names with a few lines of code. Ontology addresses questions of how entities are grouped into categories and which of these entities exist on the most fundamental level. Ontologists often try to determine what the categories or highest kinds are and how they form a system of categories that encompasses the classification of all entities. Commonly proposed categories include substances, properties, relations, states of affairs and events.Footnote 62

7.2.1 NER tools

Stanford NERFootnote 63 [55], is a Java package based on linear chain Conditional Random Field. The models were trained on a mixture of CoNLL, MUC-6, MUC-7 and ACE-named entity corpora. The basic required output tags are "PERSON", "LOCATION" and "ORGANIZATION" [83].

spaCyFootnote 64 is implemented in Python. No detailed information is presented related to its model. The related output tags include "PERSON", "LOC", "ORG", "GPE" etc. [83].

LingPipeFootnote 65 is implemented in Java and supports both rule-based NER and supervised training of a statistical model or more direct method like dictionary matching. The NER model was trained on the MUC 6 corpus. It is relatively slow but with higher accuracy. The output entity types are PERSON, LOCATION, ORGANISATION [83].

Natural Language Toolkit (NLTK)Footnote 66 [23] is a Python NLP toolkit heavily used in the research community. NLTK’s NER is based on a supervised machine learning algorithm (Maximum Entropy Classifier), and it is trained on the ACE corpus. The output entities include PERSON, LOCATION, ORGANISATION [83].

OpenNLPFootnote 67 is a machine learning-based toolkit (developed with Java) for the processing of natural language text. It supports the most common NLP tasks, including named entity recognition (NER) [16].

FINETFootnote 68 [41] is a FGNER system handling short text (such as tweets or sentences). FINET generates candidate types (explicitly and implicitly mentioned types) using a sequence of multiple extractors and selects the most appropriate using word-sense disambiguation approaches. FINET is an unsupervised system relying on Wordnet and another knowledge base to generate training data.

7.3 NED tools

In this section, we focus on Gerbil for presenting the NED tools. Indeed, Gerbil is a benchmark gathering different tools on NED in order to compare them.

GerbilFootnote 69Footnote 70 [182] is a framework dedicated to evaluating the dataset and tools proposed for semantic entity annotation (including NER, NED and EL). The aim of Gerbil is to provide an easy way to compare the results between the different state-of-the-art EL approaches. Gerbil relies on the framework proposed by Cornolti et al. [38] in order to propose six kinds of experiments: (1) D2KB (mapping a set of given entity mentions), (2) A2KB (an extension of D2KB by integrating disambiguation of the extracted mentions), (3) Sa2KB (also an extension of D2KB by integrating the score of the mention during the evaluation process), (4) C2KB (detecting entities in a given document), (5) Sc2KB (an extension of C2KB where a subset of entities is returned), and (6) Rc2KB (also an extension of C2KB returning a sorted list of resources from the entity set).

Gerbil includes 9 NER/NED systems: (1) Wikipedia MinerFootnote 71 [114], based on prior probabilities, context relatedness and quality, combined to classification, (2) DBpedia SpotlightFootnote 72 [113], using DBpedia and based on a vector representation with cosine similarity, (3) TagMe 2Footnote 73 [54], which recognises named entities by using link texts from Wikipedia (for disambiguation, it uses a link graph), (4) NERD-MLFootnote 74 [183], based on machine learning classification and on a CRF in order to extract and for disambiguate entities, (5) KEA NER/NEDFootnote 75 [168] (based on a predetermined context, an n-gram analysis and a lookup of all potential DBpedia candidate entities for each n-gram), (6) WATFootnote 76 [141] (a set of graph-based algorithms and SVM linear models for collective entity linking), (7) AGDISTISFootnote 77 [181] (based on string similarity measures, a set of heuristic for handling co-referencing and on the graph-based HITS algorithmFootnote 78), (8) BabelfyFootnote 79 [116] (based on random walks and a sub-graph algorithm for multilingual disambiguation by using BabelNet [127]), and (9) DexterFootnote 80 [31] (an open-source entity disambiguation framework with several state-of-the-art disambiguation methods).

Gerbil integrates all the datasets used in order to train and evaluate the aforementioned systems (i.e. Wikipedia Miner,Footnote 81 DBpedia Spotlight,Footnote 82 TagMe 2,Footnote 83NERD-MLFootnote 84 WAT,Footnote 85 AGDISTIS,Footnote 86 Babelfy,Footnote 87,Footnote 88,Footnote 89,Footnote 90,Footnote 91,Footnote 92). In addition to the systems’ datasets, Gerbil includes ACE2004, IITBFootnote 93 (containing 103 documents and 11,250 entities), MeijFootnote 94 [166] (containing 502 tweets and 814 entities), MSNBC and N\(^3\)Reuters-128Footnote 95 and N\(^3\) RSS-500Footnote 96 [148] (respectively containing 128 news/880 entities and 500 RSS-feeds/1000 entities).

Table 8 NER and NED tools

7.4 Ontology

YAGOFootnote 97 [51] is an extensible ontology derived from Wikipedia WordNet and GeoNames. YAGO contains more than 1 million entities and 5 million facts, and it covers both entities and relations. The facts have been automatically extracted from Wikipedia and unified with WordNet, using a combination of rule-based and heuristic methods. The resulting knowledge base represents an improved WordNet, by adding knowledge about individuals like persons, organisations, products, etc., with their semantic relationships. In its original version, YAGO contains more than 1 million entities (like persons, organisations, cities, etc.) and contains more than 5 million facts (was born, wrote music for, etc.) about these entities. Different versions of YAGO have been proposed: YAGO, YAGO2 and YAGO3.

DBpediaFootnote 98 [14] is a multilingual knowledge base. For constructing it, structured information from Wikipedia such as categorisation information, links to external Web pages and geo-coordinates were extracted. The English version of the DBpedia knowledge base currently contains about 4,233,000 entities.

FreebaseFootnote 99 [26] is a large online knowledge base created by its community members (collaboratively). Freebase has a friendly interface allowing all the users to interact with it. Freebase contains data extracted from many sources such as Wikipedia, and it includes more than 43 million entities and 2.4 billion facts [157].

MedMentionsFootnote 100 [123] is a large dataset identifying and linking entity mentioned in PubMed abstractsFootnote 101 to specific UMLSFootnote 102 concepts. 246,000 UMLS entities were manually annotated using 3,704 mentions extracted from PubMed abstracts. The average depth in the hierarchy of a concept is 14.4 and the maximum depth is 43.

8 Synthesis and analysis

8.1 Review metrics

To sum up, we analysed 177 works in the presented survey. Ninety-seven works (including 42 research works and 56 resource descriptions) are described in detail. Of the 42 research papers, 5 are surveys, and 37 are solution works. Of the 32 studied research works, 24 focus on English and 13 on other languages, most of the time, including English, such as German and Spanish. Also, from the 24 works on English, 18 focus on NER while 6 focus on NED. From the 13 multilingual works, 7 focus on NER and 6 on NER. Hence, from the total number of the presented works (33 works), 20 works are on NER while 13 are on NED. Almost all studied works are recent (published up to 2023), and 35 of them are from 2015.

Concerning the resources, from the 56 resources described, 31 are corpora, and 25 are APIs and tools. From the 30 corpora, 21 works focus on NER, and 9 works focus on NED. Of the 25 APIs and tools, 6 focus on NER and 19 focus on NED. Of the 22 corpora dedicated to NER, 16 corpora are dedicated to English, 2 are multilingual (including English), and 4 corpora are dedicated to other languages such as German, Spanish, Arabic, etc. Of the 9 corpora dedicated to NED, 6 are in English, 2 are multilingual (including English), and the last one is dedicated to German. The APIs and tools are language-independent: they are usually trained on corpora in English or German, but users can train its model using a language of their choice.

Finally, we observe that the new tendency for extracting named entities and disambiguating them is to use neural networks. Of the 33 studied works, 16 use neural networks. 10 works from the 16 (in total) are in English, and 6 works are multilingual. The other works are using standard machine learning algorithms such as SVM, CRF (is the most dominant used algorithm), etc. For neural networks, almost all the works are based on CNN, LSTM, or Bi-LSTM. Other works rely on existing tools such as CoreNLP, OpenNLP and Gerbil.

8.2 Analytical synthesis

This section aims to answer the research questions presented in Sect. 4. Answering these questions allows us to conclude this survey by giving a general picture of the current situation of the research related to EL, NER and NED. Answering these questions will also highlight open issues related to the field of EL that require further research.

R1: After analysing the presented works, we conclude that the majority of the presented works aim to extract/recognise entities only (NER task). Disambiguation is only proposed by the most recent studies (the majority of them in 2018). Also, only a few systems can be considered to do entity linking (also called end-to-end systems), such as those of Cao et al. [29] or Bhatia et al. [22]. An EL system is a system providing both NER and NED. Almost all the proposed systems focused on one task only, by starting with a set of predefined named entities (for NED). Concerning the used algorithms, the tendency is to rely on neural networks (NNs). The originality of each work is related to the used architecture (word/character level/word + character level) and the type of NN used in each level (such as CNN, LSTM, or Bi-LSTM). We also observed that CRF is usually used with an NN architecture, either as a feature extractor or as a layer (replacing the softmaxFootnote 104 function for generating the labels).

R2: It can be seen that almost all the constructed resources are publicly available. However, some of them (such as MUC-6) are published under the LDC non-free licence. However, the majority of them are free for research purposes. It is the main reason why almost all the recent works rely on the publicly available corpora. Only some recent research works focused on constructing their corpora. Among the freely available corpora, almost all the works presented in the research literature focus on the CoNLL corpora (including all its variants: CoNLL2003 for English, CoNLL2002 for Spanish and Dutch, and CoNLL2003 for German) for NER and CoNLL/AIDA for NED. The main problems with these corpora are their limited size (only 1,393 articles for CoNLL2003) and that they cannot be used in all domains. They were extracted from newspapers, so they cannot be used in medical or technical domains. Also, these corpora include only 4 entity types (PER, ORG, LOC and MISC), limiting them to coarse-grained NER.

R3: There are two types of approaches to constructing the annotated corpora: manual and automatic. Almost all the works on coarse recognition use corpora, which were constructed manually. Automatic construction is the most appropriate for fine-grained recognition, where the number of recognised entities is up to 118. Manual construction is time- and effort-consuming, but it produces corpora that give more accurate results. The corpora that are constructed automatically can cover more entities and more domains with less effort, but they suffer from a lack of precision. Almost all the corpora that were constructed automatically rely on ontologies such as Wikipedia or YAGO. Finally, a lack of real-life scenarios can also be observed: almost all the research using these corpora uses them in an intrinsic way where the training and the testing corpus are different parts of the same corpus. In practice, the trained tools have to be used on data outside the corpus.

R4: It can be concluded that the results presented for NER are more promising than those presented for NED. This is perfectly understandable, since NER consists of only extracting the different entities without disambiguating them. Also, the corpora used for NER are less challenging than those used for NED, where an entity could correspond to different types. The results related to coarse recognition are much better than those related to fine-grained recognition. It is also understandable that extracting four entities is less challenging than extracting 118. The approaches relying on neural networks give promising results compared to the works using classical machine learning approaches. These results are better where the NN models are associated with the word embedding used for feature extraction. However, almost all the presented works compare themselves to up to three other works. In most of the cases, the comparison is not made using a benchmark, which could compare the approach to many other systems on many corpora.

R5: Two main tendencies emerged from the multilingual approaches: works that propose language-independent approaches, and works that create parallel corpora across all the studied languages. For constructing these corpora, the presented research usually relies on ontologies. These corpora represent new resources, but the real added value is behind the language-independent approach, where the proposed system could be applied to different languages. We also conclude that almost all the language-independent systems work at the character level using NNs. However, the experiments show that in almost all cases, the results obtained on English are better than the results obtained in other languages.

R6: By answering to all the previous research questions, we provided a general overview of the research works recently proposed for EL/NER/NED. However, each one of the presented answers also raised a set of issues (open issues). From the R1, we conclude that end-to-end EL systems are rare: the majority of the works are focusing on NER or NED, but not both. Also, as almost all the systems focused on coarse-grained NER, more work is needed on fine-grained NER. A stronger focus on these kinds of systems would undoubtedly improve the quality of the proposed systems. From R2, we conclude that the research literature should focus more on the construction of annotated corpora rather than using the same ones for all the proposed studies. As it has a lack of resources dedicated to fine-grained NER, constructing more resources is undoubtedly an open issue. Furthermore, considering the heterogeneity of the used corpora for training and testing would certainly present more realistic results. It is also essential to focus more on unstructured data provided from social media: the results on W-NUT (corpora constructed from Twitter) are low compared to the results obtained on CoNLL (which were constructed from newspapers). From R3, we conclude that both manual and automatic resource construction present advantages and disadvantages. Some research works on semi-automatic construction integrating deep active learning are ongoing. Focusing more on the semi-automatic construction where the corpus would be annotated automatically and reviewed manually presents an open line of work which could resolve the disadvantages of both techniques. It would be less time- and effort-consuming since the corpus is first constructed automatically, and achieve more precision because the corpus is reviewed manually.

The results of R4 confirm our previous assumptions. However, another important aspect is highlighted in this answer: lack of reliability in the comparison of the results. Some recent works are relying on benchmarks (such as Gerbil) to provide a fair comparison between the proposed approach results and the results presented in the research literature, but more works should rely on them. Another related open issue is the lack of benchmarks: Gerbil provides a reliable comparison framework, but it includes only a few systems. It is possible to add more systems into Gerbil, but the integrated systems need web APIs to be integrated. Finally from R5, we conclude that even by proposing a language-independent system, the research literature should focus more on the characteristics of each studied language where the results in English are better than the results in the other languages.

9 Comparison with the other surveys presented in the research literature

From 2007 to 2019, five other surveys related to entity linking were proposed. However, only one of them focused on both NER and NED ( [17]). All the other ones were dedicated to NER. The presented paper focuses on all the works, resources, and tools that are related to entity linking by handling both NER and NED. Some statistics comparing our survey to the proposed surveys in the research literature are presented in Table 9. From this table, we first observe that the most recent survey was proposed in 2019. However, only 23 recent works from 2015 were described in this survey, which focuses on neural network models proposed for NER. Also, 15 datasets were presented without being described in detail. No information about the size, the construction technique, or the location of the presented datasets was given.

Concerning the number of the described works, almost all the surveys presented more works than ours. Nadeau et al. [124] described 60 works, Goyal et al. [63] described 48 and Patil et al. [136] described 43. However, almost all the described works in these surveys are not recent. No work after 2015 was presented by Nadeau et al. [124] and Patil et al. [136]. The main aim of our survey is to cover the most recent works proposed for entity linking. As almost all the works presented before 2015 were covered in the previous surveys, we saw no reason to survey them again. Mainly due to this reason, we described and detailed only 37 research works, where 31 (95%) of them are after 2015. In addition to the works that were detailed and classified, we also present some comparisons with works presented before 2015.

The most valuable resources behind each natural language processing problem (including NER and NED) are the datasets, tools and APIs. However, Table 9 shows that the previous surveys described only a few resources. In addition to Yadav et al. [188] who cite a set of corpora without providing any details, Goyal et al. [63] presented a table associating each presented work to the used dataset. However, these authors only give some statistics about the datasets, without detailing their construction approaches, or even the classes that the datasets are dealing with (person, location, etc.). The presented surveys also neglected the tools and APIs, where only 11 tools were presented by Balog et al. [17] and only 6 by Patil et al. [136]. In contrast to the above surveys, our work gives particular attention to the available resources (tools, API and datasets): we described, detailed and classified 55 resources (25 tools/APIs, and 30 datasets). In addition to describing these resources and classifying them, we also provide the location of each resource (as a Web link).

Table 9 Comparison with other surveys

Same as almost all the other surveys, we classify the presented works, in the research literature, by distinguishing those dedicated to English (only) from those focusing on other languages. For both NER and NED, we present the works on English and multilingual works (focusing on more than one language). We also present in Table 7 the language of each constructed dataset to highlight the resources which were constructed in other languages than English. Finally, it can be seen from Table 9 that except for Yadav et al. [188], we are the only survey giving particular attention to the works using neural networks. In contrast to Yadav et al., we are focusing on the most recent works on NER and NED, which lead us to the new tendency to use neural network models.

10 Conclusion

The role of this survey was to present and classify the most recent studies that have been done on EL. However, it has been highlighted from the beginning that EL represents the task of recognising entities mentioned in the text and linking them to the corresponding entries in a knowledge repository [17]. Hence, we mainly focused on the papers that have been proposed for both NER and NED (where we focused on 43 research papers and 55 resources including corpora and tools). Studies focusing on English and also the ones proposed for other languages were considered. This survey focuses on the most recent studies where 95% of presented papers were published after 2015.

After analysing the studied papers, we concluded that the majority of the works focused on one only, either NER or NED and only a few works were dedicated to presenting the whole pipeline leading to EL. For the works focusing on NER, the coarse-grained NER approach is mainly used compared to the fine-grained NER where the majority of the studies focused on four entity types only (PER, ORG, LOC and MISC). This is mainly due to the scarcity of datasets dedicated to fine-grained NER. The most used corpora are the CoNLL corpora (including CoNLL2003 for English and German and CoNLL2002 for Spanish) focusing on the four entities mentioned earlier. Also, the coarse-grained NER returns more accurate results than the fine-grained NER, mainly due to the way used for constructing the corpora. Indeed, almost all the corpora used for the coarse-grained NER were constructed manually, whereas almost all the corpora used for fine-grained NER were constructed automatically by relying on ontologies such as Wikipedia or YAGO. Due to the lack of diversity related to the constructed resources, only a few studies were multilingual were the majority of the papers focused on English.

Moreover, it was also concluded that the systems proposed for NER are much more promising than those presented for NED where disambiguating entities are much more complex than recognising them. However, the lack of reliable benchmarks is the most important issue related to a fair comparison among the different proposed studies. almost all the presented works compare themselves to up to three other works of their choice (presenting the most similar approaches). This solution is strong enough when it is applied to CoNLL corpora, as the majority of the studies used them for result comparison. However, when a corpus is used only by a few studies, it is difficult to conclude a strong and fair comparison. More benchmarks such as Gerbil should be constructed to provide this reliable comparison among frameworks.

Finally, we observed that the latest tendency regarding entity linking is to apply it to medical data where there are events that trigger a sudden increase in the number of publications, such as the COVID-19 pandemic. For instance, PubMed added new 952,919 citations only in 2020. In this context, new word embedding models (such as BioBERT [97], PubMedBERT [64] and SciBERT [18]) are pre-trained on large biomedical corpora through unsupervised tasks and then fine-tuned [65, 66] for specific tasks, including EL. These models achieved state-of-the-art performance in several EL benchmarks [150].