Citation recommendation: approaches and datasets

Citation recommendation describes the task of recommending citations for a given text. Due to the overload of published scientific works in recent years on the one hand, and the need to cite the most appropriate publications when writing scientific texts on the other hand, citation recommendation has emerged as an important research topic. In recent years, several approaches and evaluation data sets have been presented. However, to the best of our knowledge, no literature survey has been conducted explicitly on citation recommendation. In this article, we give a thorough introduction to automatic citation recommendation research. We then present an overview of the approaches and data sets for citation recommendation and identify differences and commonalities using various dimensions. Last but not least, we shed light on the evaluation methods and outline general challenges in the evaluation and how to meet them. We restrict ourselves to citation recommendation for scientific publications, as this document type has been studied the most in this area. However, many of the observations and discussions included in this survey are also applicable to other types of text, such as news articles and encyclopedic articles.

thor need to be backed up in order to ensure transparency, reliability, and truthfulness. Secondly, mentions of methods and data sets and further important domain-specific concepts need to be linked via references in order to help the reader to properly understand the text and to give attribution to the corresponding publications and authors (see Table 1). However, citing properly has become increasingly difficult due to the dramatically increasing number of scientific publications published each year [24,155,54] (see also Fig. 1). For instance, in the computer science domain alone, more than 100,000 new papers are published every year and three times more papers were published in 2010 than in 2000 [88]. A similar trend can be observed in other disciplines [90]. For instance, in the medical digital library database PubMed, the number of publications in 2014 (514k) was more than triple the amount published in 1990 (137k) and more than 100 times the amount published in 1950 (4k) [25]. Despite this phenomenon of information overload in science in the form of a "tsunami of publications," citing appropriate publications remains a necessity for scientific writing.
As a consequence, approaches for citation recommendation have been developed. Citation recommendation refers to the task of recommending appropriate citations for a text passage within a document. For instance, given the phrase "is similar to sequence-tosequence learning" in a document, a citation recommendation system might insert a citation marker (e.g., " [1]") after this textual phrase, and might add a reference to a publication in the document's reference section. The added reference thereby needs to fit semantically to the context in the citing document and may be required to meet further constraints (e.g., concerning its recency).  Table 1 Examples for in-text citations from [105].

Citation type Example sentence
concept "To this end, SWRL [14] extends OWL-DL and OWL-Lite with Horn clauses." claim "In the traditional hypertext Web, browsing and searching are often seen as the two dominant modes of interaction (Olston & Chi, 2003)." author "Gibson et al. [12] used hyperlink for identifying communities." Note that citation recommendation differs from paper recommendation [14,135]: paper recommendation aims to recommend documents to the user that are worthwhile to read and to investigate (particularly, in the context of a research topic). To that end, one or several papers [157,62,4,131] or the user's already clicked/ bookmarked/written documents [7,91] can, for instance, be used for the recommendation. We can refer to [14,10] for surveys on paper recommendation. Citation recommendation, by contrast, assists the user in substantiating a given text passage (e.g., written claim or scientific concept) within an input document by recommending publications that can be used as citations. The textual phrase to be backed up can vary in length -from one word up to a paragraph -and is called citation context. In some cases [67,99], the citation context needs to be discovered before the actual citation recommendation. While some existing works consider citation recommendation as a task of extending the set of known references for a given paper [78,79,60], we consider citation recommendation purely as a task for substantiating claims and concepts in the citation context. This makes citation recommendation context-aware and very challenging, because the concept of relevance is much stricter than in ad hoc retrieval [137]. Consequently, citation recommendation approaches have been proposed using additional information besides the citation con-text for the recommendation, such as the author's name of the input document [44]. Evaluating a citation recommendation approach requires to verify if the recommended paper is relevant as citation for a given citation context. For scalability reasons, usually the citations in existing papers and their citation contexts are used as ground truth (see Sec. 5.1).
Existing surveys focus only on related research areas of citation recommendation, but not explicitly on citation recommendation itself. Among the most closely related studies are the surveys on paper recommendation [14,10]. In these articles, the authors do not consider recommender systems for given citation contexts. Several surveys on other aspects of citation contexts have also been published. Alvarez et al. [6] summarize and discuss works on the identification of citation contexts, on the classification of each citation's role (called citation function), and on the classification of each citation's "sentiment" (called citation polarity). Ding et al. [40] focus on the content-based analyses of citation contexts, while White [156] considers primarily the classification of citations into classes, the topics covered by citation contexts, and the motivation of citing. Moreover, distantly related to this survey, surveys on the analysis of citing behavior [23,143] and surveys on works about the analysis of citation networks exist, for instance, for the purpose of creating better measurements of the scientific impact of researchers or communities [151]. Dedicated approaches and data sets for citation recommendation are not covered in all those works, nor is there any analysis of citation recommendation evaluations and evaluation challenges. This makes it necessary to consider citation recommendation separately and to use task-specific dimensions for comparing the approaches.
We make the following contributions in this survey: 1. We describe the process of citation recommendation, the scenarios in which it can be applied, as well as the advantages it has in general. 2. We systematically compare citation recommendation to related tasks and research topics. 3. We outline the different approaches to citation recommendation published so far and compare them by means of specifically introduced dimensions. 4. We give an overview of evaluation data sets and further working data sets for citation recommendation and show their limitations. 5. We shed light on the evaluation methods used so far for citation recommendation, we point out the challenges of evaluating citation recommendation approaches, and present guidelines for improving citation recommendation evaluations in the future.
Several reader groups can benefit from this survey: non-experts can obtain an overview of citation recommendation; the community of citation recommendation researchers can use the survey as the basis for discussions of critical points in approaches and evaluations, as well as for getting suggestions for future research directions (e.g., research topic suggestions for PhD candidates); and finally, the survey can assist developers in choosing among the available approaches or data sets.
The rest of this article is structured as follows: in Section 2, we introduce the field of citation recommendation to the reader. In Section 3, we describe how we collected publications presenting citation recommendation approaches. We propose classification dimensions and compare the approaches by these dimensions. In Section 4, we give an overview of evaluation data sets and compare the data sets by corresponding dimensions. Section 5 gives a systematic overview of the evaluation methods that have been applied so far and of the challenges that emerge when evaluating citation recommendation approaches. Section 6 is dedicated to potential future work. The survey closes in Section 7 with a summary.

Terminology
In the following, we define some important concepts of citation recommendation, which we use throughout the article. In order to have a generic task formalization, as we prefer, we do not restrict ourselves to scientific papers as a document type, but consider text documents in general.
The basic concept of citing is depicted in Fig. 2. A citation is defined as a link between a citing document and a cited document at a specific location in the citing document. This location is called the citation marker (e.g., " [1]") and the text fragment which should be supported by the citation is called the citation context. The citation context can be transformed into an abstract representation, such as an embedding vector [44,20] or a translation model [70,72]. This enables us to more accurately match the information in the citation context with the information provided in the "citable" documents (also called candidate cited documents).
"References" and "citations" are often used interchangeably in the literature. However, we name in-text references, given by citation markers, citations. References, in contrast, are listed in the reference section of the citing document and describe links to other documents on a document level without context.
In the academic field, both the citing documents and the cited documents are usually scientific papers. We use the terms paper, publication, and work interchangeably in this article. The authors of scientific papers are usually researchers. We then use researcher and scientist interchangeably. Researchers who use a citation recommendation system become users.
Citing documents and cited documents consist of content and metadata. In the case of scientific papers, the paper's metadata typically consists of the title, the author information, an abstract, and other information, such as the venue in which the paper has been published.
Different citation context lengths can be used for citation recommendation. If only a fragment of an input text document is used as citation context (e.g., a sentence [65,70] or a window of 50 words), we call it local citation recommendation or context-aware citation recommendation. If no specific citation context, but instead the whole input text document or the document's abstract is used for the recommendation (see, e.g., [137,113,144,85]), we call it global citation recommendation or non-context-aware citation recommendation (following He et al. [68]). In the following sections, we will primarily focus on local citation recommendation, since only this variant targets the recommendation of papers for backing up single concepts and claims in a text fragment (i.e., assists the user in the actual citing process) and has not been addressed in other surveys, to the best of our knowledge. 1 1 It should be noted that it is also possible to design global context-aware citation recommendation approaches, i.e., approaches which recommend citations for specific contexts (e.g., sentences) but which take the whole paper into account (e.g., to ensure an even greater understanding of the context or to diversify the recommendations). However, we are not aware of any such approach being published (see also Section 6 for potential future work).

Scenarios, Advantages, and Caveats of Citation Recommendation
In the "traditional" process of finding appropriate citations, the researcher needs to come up with candidate publications for citing on her own. The candidate papers that can be cited are either already known by her, are contained in a given document collection, or first need to be discovered. For the last option, the scientist typically uses widely used bibliographic databases, such as Google Scholar, 2 or domain-specific platforms such as DBLP 3 or PubMed. 4 The search for citing material typically requires considerable time and effort as well as skills: the right keywords for querying need to be found, and the top n returned documents need to be manually assessed with regards to their relevance to the citing document and to the specific citation contexts.
The idea of citation recommendation is to enhance the citing process: The user provides the text she has written (with or without initial citations) to the recommender system. This system then presents to the user for specific segments of the input text all publications which were determined automatically as suitable citations. The user can investigate the recommendations in more detail and approve or disapprove them. Following this procedure, the tedious manual, separate search in bibliographic databases and paper collections can be considerably reduced (and maybe even skipped). The user does not need to think of meaningful keywords for searching papers any more. Last but not least, citing may become less dependent on the (often very limited) set of papers known to the current user.
We do not want to hide that citation recommendation can also entail problematic features if applied inadequately. Firstly, if citing becomes purely automated, the role of citations might change (e.g., instead of criticizing, citations might support a statement; see [148,111,147] for citation function schemes). The trust in citations might decrease, since machines (here: recommender systems) might not engender as much trust as experts who have dealt with the topic. We thus argue that a human-in-the-loop is still needed for citation recommendation. Secondly, if the recommendation models are trained on a fixed publication data set, instead of removing citation biases, the recommender systems could introduce additional biases towards specific papers. Therefore, it must be ensured that a sufficiently large number of papers is indexed and that the new papers are indexed periodically. Caveats of citation recommendation are discussed in depth in Sec. 5.
Citation recommendation systems can be designed for several user groups: (1) Expert Setting. In this setting, a researcher is familiar with her research area and is in the process of writing an expert text, such as a scientific publication (e.g., after having developed a novel approach or for conducting a survey in her research field). Recommendations of citations can still be beneficial for her, as such a user might still be unaware of publications in their field in the light of the "tsunami of publications" common in all scientific fields nowadays [24,155,54]. Citation recommendation systems might come up with recommendations which were not in the focus of the researcher if she cited in the traditional way, since the system might be able to bridge language barriers [83,146] and also find publications which use synonyms or otherwise related concepts.
(2) Non-Expert Setting. We can think of several non-expert user types for which citation recommendation can be beneficial: -A researcher needs to write a scientific text on a topic that is outside of her core research area and expertise (e.g., generic research proposals [19] and potential future work descriptions). -A journalist in the science domain -e.g., authoring texts for a popular science magazine -needs to write an article on a certain scientific topic [123,118]. We can assume that the journalist typically is not an expert on the topic she needs to write about. Having citations in the text helps to substantiate the written facts and make the text more complete and understandable. -"Newcomers" in science, such as Masters students and PhD students in their early years, are confronted with the vast amount of citable publications and typically do not know any or all of the relevant literature in the research field yet [67,166]. Getting papers recommended helps not only students in writing systematic and scientific texts, such as research project proposals (exposés), but also their mentors (e.g., professors).
In all these non-expert settings, the relevance of the recommended publications is presumably not so much determined by the timeliness of the publications, as in the expert setting, but instead more by the general importance and prominence of the publications. Thus, the relevance function for finding the most appropriate citations might vary from setting to setting. Besides the pure topical relevance of recommended citations, also the fit from a social perspective might be essential. In recent decades, the citing behavior of scientists has been studied extensively in order to find good measurements for the scientific impact of scientists and their publications [23]. In this context, several biases in citing have been considered. Most notably, the hypothesis has been made that researchers tend to cite publications which they have written themselves or which have been written by colleagues [74]. Another hypothesis is that very prominent and highly cited works get additional citations only due to their prominence and visibility in the community (see, e.g., [156]). Citation recommendation systems can help in reducing biases by recommending citations which are the best fit for the author, the citation context with its argumentation, and the community. 5 Section 5.2 discusses citing bias in the context of citation recommendation in detail.
Overall, we can summarize the benefits of citation recommendation as follows: 1. Finding suitable citations should become more effective. This is because the match between the query (citation context) and the citable documents is more sophisticated than via manual matching (e.g., also considering synonyms, related topics, etc.). Furthermore, the recommender system typically covers a much larger collection of known publications than the set of documents known to the user. 2. Researchers are more (time-)efficient during the process of citing, as the number and extent of manual investigations (using bibliographic databases or own document collections) are reduced, and because recommendations are returned immediately. 3. The search for publications which can be cited becomes easier and more user-friendly ("citing for everyone"). As a consequence, citing is no longer just a "privilege" for experts, but potentially something for almost anyone. 4. By establishing a formal relevance function dealing with the issue of which papers are cited and what characteristics they have, the process is no longer left to chance. Hence, biases in citing behavior can be minimized. 5. Ideally, citation recommendation systems only recommend citations for valid statements and existing concepts, while unexaminable statements are not cited. Hence, citation recommendation implies an implicit fact checking process by showing sources to the user which support the written statements. 6. Advanced citation recommendation systems can, in addition, search for suitable, cite-worthy publications in other languages than the citing document (cross-linguality). They can also recommend publications under the special consideration of topic evo- 5 However, please also note the caveats of citation recommendation as outlined above and in Sec. 5.3. lution over time, of current buzzwords, or in a personalized way, by incorporating user profiles.

Task Definition
In the following, we define local citation recommendation. By considering the whole document, abstract, or title as citation context, this definition can also serve as definition for global citation recommendation. The general architecture of a context-aware citation recommendation system is depicted in Fig. 3. State-of-theart citation recommendation approaches are supervised learning approaches. Thus, we can distinguish between an offline step, in which a recommendation model is learned based on a collection of documents, and an online step, in which the recommendation model is applied to a new incoming text document. Note, however, that unsupervised learning approaches and rule-based approaches are also possible (although, to date, to the best of our knowledge, none such have been proposed). In that case, the learning phase in the offline step is eliminated and a given model (e.g., set of rules) can be directly applied (see Fig. 3).
In the following, we give an overview of the steps in case of supervised learning (using the symbols summarized in Table 2). Note that existing citation recommendation approaches use, to the best of our knowledge, content-based filtering techniques and are not based on other recommendation techniques, such as collaborative filtering or hybrid models. It is therefore not surprising that the approaches are mostly not personalized 6 (i.e., not incorporating user profiles). Hence, our task formalization does not consider personalization.

Offline Step
Input Input is a set of documents D = {d 1 , ..., d n }, which we call in the following the citing documents, with citations and references. 7 Processing The processing of the input texts consists of the following steps: (1) Reference Extraction. All references from the reference sections of all citing documents are extracted and stored in a global index R.
(2) Citation Context Extraction & Representation. First, all citation contexts c ij ∈ C i from each (3) Model Learning. Given the output of the previous steps (the citing documents D, the cited documents R, and the abstract citation contexts Z), we can learn a mapping function f which maps each citation context representation z ij and its citing document d i to a reference (cited document) r m ∈ R as given by the training data: Note that some approaches to citation recommendation might not use any other information from the citing documents besides the citation contexts, eliminating thus d i as argument in the mapping function. In those cases, only the representation of the citation context z ij is decisive (e.g., representation of a concept).
The mapping function f and the whole task can be formulated as a binary classification task (as also presented in [127]), especially in order to employ statistical models. Then, each citable document r m is considered as a class and the task is to determine if (z ij , d i ) should be in class r m : [0, 1] is the probability of citing r m given z ij and d i . As mentioned above, d i might be optional for some approaches. In reality, g is often learned based on machine learning. However, one can also think of other ways to create g (e.g., rule-based approaches).
Output Output is the function g, given the abstract citation contexts Z, the citing documents D, and the cited documents R.

Online
Step Input Input is a text document d without citations and references (or only a few ones).
Processing Processing the document d consists of the following steps: (1) Reference Extraction (optional): If d already contains citations and a reference section, the references R d from d can be extracted and the corresponding representations can be retrieved from the database of cited papers R. These representations can be utilized for improving the citation recommendation within Model Application, e.g., for a better topical coherence among existing and recommended citations [87].
(2) Citation Context Extraction & Representation: First, if the existing citations in document d are to be used, the task is to extract and represent them in the same way as in the Offline Step. Then, all potential citation contexts c d k ∈ C d -i.e., contexts in d, which are judged as suitable for having a citationare extracted from d and transformed into the same abstract representation form z d k as used in the Model Learning: ∀c d k ∈ C d : c d k → z d k . Note that, sometimes, an additional filtering step filters out all potential citation contexts which are not worth considering.
(3) Model Application: Here, the mapping function g, learned during the training, is applied on the potential citation context representations z d k of document d for recommending citations: R is thereby the global index of "citable" papers (gathered during the offline step). R z d k is the set of recommended cited papers. These papers were classified as cited with a likelihood of at least θ.
(4) Text Enrichment Given the document d and the set of recommendations R z d k for each citation context representation z d k , the running text of document d gets enriched by the recommended citations and the reference section of d gets enriched by the corresponding references.  Output Output is the annotated document d .

Non-Scholarly Citation Recommendation
Also, outside academia, there is a demand for citing written knowledge. We can mention three kinds of documents, which often appear in such scenarios as citing documents: encyclopedic articles, news articles, and patents. Citation recommendation approaches developed for the scholarly field can in principle also be applied to such fields outside academia. Note, however, that each of the use cases might bring additional requirements and challenges. The scholarly domain is characterized by the use of a particular vocabulary, thus making it hard to apply models (e.g., embeddings) that were pretrained on other domains (e.g., news). In contrast, documents in the non-scholarly field, such as news articles, often do not have a (dense) citation network. This might make it harder to build metadata-based representations of the documents and to evaluate the recommender systems, because no co-citation network can be used for the evaluation (see the fuzzy evaluation metrics in Sec. 5.1). In the following, we outline specifically developed approaches for non-scholarly citation recommendation.
Encyclopedic articles as citing documents: The English Wikipedia is nowadays already very rich and quite complete in the number of articles included, but still lacks citations in the range of (at least) hundreds of thousands [76]. This lack of citations diminishes the potential of Wikipedia to be a reliable source of information. Since in Wikipedia mainly news articles are cited [52], several approaches have focused on developing methods for recommending news citations for Wikipedia [106,107,52,51].
News articles as citing documents: Peng et al. [118] approach the task of citation recommendation for news articles. They use a combination of existing implicit and explicit citation context representations as well as 200 preselected candidate articles instead of hundreds of thousands per citation context.
Patents as citing documents: Authors of patents need to reference other patents in order to show the context in which the patent is embedded. Thus, approaches for patent citation recommendation have been proposed [103].

Scholarly Data Recommendation
Scientists are not only confronted with an information overload regarding publications, but also regarding various other items, such as books, venues, and data sets. As a consequence, these items can also be recommended appropriately in order to assist the scientist in her work. Among others, approaches have been developed for recommending books [110], scientific events [86], venues [165] and reviewers [93] for given papers, patents [116], scientific data sets [132], potentially identical texts (by that means identifying plagiarism) [59], and newly published papers, via notifying functions [46].

Related Citation-based Tasks
In the following, we describe some citation-based tasks that are either strongly related to or an integral part of citation recommendation.
Citation network analysis Citation network analysis describes the task of analyzing the references between documents in order to make statements about the scientific landscape and to investigate quantitatively scientific publishing. Among others, citation network analysis has been performed to determine communities of researchers [164,39], to find experts in a domain [64], to know which researchers or publications have been or will become important, and to obtain trends in what is published over time [66]. Note that citation network analysis operates on the document level and generally does not consider the document's contents.
Citation context detection and extraction Each citation is textually embedded in a citation context. The citation context can vary in length, ranging typically from a part of a sentence to many sentences. As shown in several analyses [6,3], precisely determining the borders of the citation context is non-trivial. This is because several citations might appear in the same sentence and because citations can have different roles. While in some cases a claim made by the author needs to be backed up, in other cases a single concept (e.g., method, data set, or other domain-specific entity) needs to be referenced by a corresponding publication [105]. In conclusion, there seems to be no consistent single optimal citation context length [6,126,125]. Different citation context lengths have been used for citation recommendation (see Table 4).
To extract citation contexts and references from papers, specific approaches have been developed [149,150]. These approaches were developed for PDFs with a papertypical layout. They are not only capable of extracting a paper's metadata, such as title, author information, and abstract, in a structured format, but also the references from the reference section, as well as linking the citation markers in the text to the corresponding references.
Given the tools outlined in [150], we can state: (1) All underlying approaches are a rule engine or a conditional random field (except Neural-ParsCit). (2) Several tools (e.g., ParsCit) have the additional feature that they can extract not only the fulltext from the PDF documents, but also a citation context around the found citation markers. (3) Several tools (e.g., ParsCit) require plaintext files as input. Transforming PDF to plaintext is, however, an additional burden and leads to noise in the data. (4) The tools differ considerably in the processing time needed for processing PDF files [11]. ParsCit and GROBID, which have been used most frequently by researchers, to our knowledge, are among the fastest.
Citation context characterization Citations can have different roles, i.e., citations are used for varying purposes. These reasons are also called citation functions. The citation function can be determined -to some degree automatically -by analyzing the citation context and by extracting features [148,111,147]. Similar tasks to the citation function determination are the polarity determination (i.e., if the author speaks in a positive, neutral, or negative way about the cited paper) [1,57] and the determination of the citation importance [152,35].
The general typical structure of publications has been studied and brought into a schema, such as the IM-RaD structure [134], standing for introduction, methods, results, and discussion. In [18], for instance, the authors find out that the average number of citations among the same sections in article texts is invariant across all considered journals, with the introduction and discussion accounting for most of the citations. Furthermore, apparently the age of cited papers varies by section, with references to older papers being found in the methods section and citations to more recent papers in the discussion. Although such insights have not been used for development of citation recommendation approaches yet, we believe that they can be beneficial for better approximating real human citing behavior.
Citation-based document summarization Citation-based document summarization is based on the idea that the citation contexts within the citing papers are written very carefully by the authors and that they reveal noteworthy aspects of the cited papers. Thus, by collecting all citation contexts and grouping them by cited papers, summaries and opinions about the cited papers can be obtained, opening the door for citation-based automatic survey generation and automatic related work section generation [2,45,108].
Citation matching and modeling Citation matching [117] deals with the research challenge of finding identical citations in different documents in order to build a coherent citation network, i.e., a global index of citations for a document collection.
Representing the metadata of both citing and cited papers in a structured way is essential for any citationbased task. Recently, several ontologies, such as FaBiO and CiTO [119], have been proposed for this purpose. Besides the metadata of papers, further relations and concepts can be modeled ontologically in order to facilitate transparency and advances in research [120].

Comparison of Citation Recommendation Approaches
Approaches to (local and global) citation recommendation have been published over the years, using diverse methods, and proposing many variations of the citation recommendation task, such as a recommendation across languages [146] or using specific metadata about the input text [127,44]. However, no overview and comparison of these approaches has been presented in the literature so far. In the following, we give such an overview.

Corpus Creation
Following a similar procedure as in [14], we collect the papers for our comparison as follows: 1. On May 3, 2019, we searched in DBLP for papers containing "citation" and "rec*" in the title. This resulted in a set of 179 papers. We read those papers and manually classified each of them whether they present an approach to (local or global) citation recommendation or not. 2. In a further step, we also investigated all papers referenced by the so-far given relevant papers, and the ones that refer to these so-far given papers, and classify them as relevant or not. 3. To avoid missing any papers, we used Google Scholar as an academic search engine with the query keywords "citation recommendation" and "cite recommend," as well as the Google Scholar profiles from the authors of the so-far relevant papers. Based on that, we added a few more relevant papers to our corpus. 8 Overall, 51 papers propose a novel, either global or local citation recommendation approach (see Table 3). Out of these, 17 present local citation recommendation approaches, that is, approaches that use a specific citation context within the input document (see Sec. 2.1 for the distinction between local and global citation recommendation). This means that only 33.3% of the approaches Table 3 Approaches to global and local citation recommendation (CR).

Corpus Characteristics
1. We can observe that approaches to citation recommendation have been published over the last 17 years (see Figure 4). The task of global citation recommendation has attracted the interest of researchers at an earlier stage than local citation recommendation (first publication year 2002 [104] vs. 2010 [68]). Both the number of approaches to global citation recommendation and local citation recommendation has increased continuously. Overall, more approaches to global citation recommendation system have been published than approaches to local citation recommendation. However, note that the most recent publications on global citation recommendation have been published in very short time intervals at similar or same venues from partially identical authors (see Table 3). 2. Some precursor works on the general task of analyzing and predicting links between documents [36] have , the number of publications increased by 33%, while citations increased by 55%. 3. Before the content-based (local and global) citation recommendation approaches -as considered in this survey -, several systems had already been proposed that use purely the citation graph as basis for the recommendation. This "prehistory" of contentbased citation recommendation is explainable by the fact that quantitative science studies such as bibliometrics have a long history, and were already quite established in the 2000s. 4. Having an appropriate and large collection of scientific papers as evaluation and training data is crucial and not easy to obtain, since -especially in the past -papers were often "hidden" behind paywalls of publishers. Therefore, it is not very surprising that several approaches [19,82,20,83] consider only abstracts as citing documents instead of full paper contents. Citation recommendation then turns into reference recommendation for abstract texts. 5. Citation recommendation is located in the intersection of the research areas information retrieval, digital libraries, natural language processing, and machine learning. This is also reflected in the venues in which approaches to citation recommendation have been presented. Considering both global and local citation recommendation, SIGIR, IEEE Access, CIKM, and JCDL have been chosen most frequently as venues (5 times SIGIR, 5 times IEEE Access, 5 times CIKM, 3 times JCDL; together accounting for 35% of all papers). Particularly, IEEE Access has become popular as a venue for publishing citation recommendation approaches by a few researches in 2018 and 2019. Note that this journal's reviewing and publication process is designed to be very tight (one review round takes 7 days) and that IEEE has an article processing charge. Our paper corpus also contains a few publications from medium-ranked conferences, such as AIRS [97]. It became apparent that these papers provide less comprehensive evaluations, but relatively high evaluation results (see the evaluation metrics paragraph in Section 3.3). Due to missing baselines, these results need to be taken with care. 6. Considering purely local citation recommendation, SIGIR (3 times) and ACL (2 times) occur most frequently as venue. The remaining venues occur only once.  [104,124,81,82,69,34,56,100,61,20,83,84,28,27,163,37,112,161,29]: B: [19]; C: [167]; D: [137,113,144,159,80,94,96,43,42,168,38,101,95]; E: [68,67,70,127,97,41,146,71,166,44,162,77]; F: [65,99,87,63,85] The numbers correspond to the references in the reference section.
Big picture. In Fig. 5, we present visually a "big picture" of the different settings in all citation recommendation approaches. We thereby differentiate between what data is used from the citing documents (either only metadata (incl. abstract), or metadata plus content, or metadata plus specific citation contexts), and what data is used from the cited documents (either only metadata, or metadata plus content). Note that approaches using the metadata or the content of the citing documents make up the group of global citation recommendation approaches, while approaches using specific citation contexts target local citation recommendation. Note also that approaches using only the metadata of the citing documents can be regarded as targeting both the expert setting and the non-expert setting (see Section 2.2), while the other approaches are designed primarily for the expert setting. The publications that propose the approaches sometimes do not point out in detail what data is used (e.g., whether the author information of the citing papers is also used), which makes a valid comparison infeasible. Thus, this "big picture" figure tries to provide a clear picture of what has been pursued so far. Notable, for instance, is that 23.5% (12 out of 51) of all approaches use citation contexts (less than the whole content) of the citing documents and only the metadata of the cited documents (see class E). In contrast, we can find only one approach that uses the whole content of the citing documents and only the metadata of the cited documents (see class C). We can mention two potential reasons for this fact. Firstly, it can be difficult to obtain the publications' full texts (due to, among other reasons, limited APIs and copyright issues). Secondly, operating only with papers' metadata is also easier from a technical perspective.
Citation relationships. Fig. 6 shows the citationrelationships between papers with citation recommendation approaches. The papers are thereby ordered from left to right by publishing year. It is eye-catching that there is no continuous citing behavior along the temporal dimension, i.e., a paper in our set does not necessarily cite preceeding papers in our set. However, in some cases we can explain this by the fact that publications were published within short time intervals. Consequently, the authors might not have been aware of other approaches which had either been published very recently or had not yet been published. For instance, the standalone papers [127,97,41,99,146] were published between February 2013 and July 2014. Nevertheless, we can observe that authors of citation recommendation approaches do omit references to other citation recommendation approaches.

Comparison of Local Citation Recommendation Approaches
When comparing citation recommendation approaches, it is important to differentiate between approaches to local citation recommendation (making recommendations based on a small text fragment) and approaches to global citation recommendation. To understand that, consider a scenario in which a text document with 20 citation markers is given. In case of local citation recommendation, it is not uncommon to provide, for instance, three recommendations per citation context. However, a global citation recommendation system would provide only a list of 60 recommendations without indications where to insert the corresponding citation markers. In our mind, it is not reasonable to call this process context-aware citation recommendation and to evaluate the list of 60 recommendations in the same way as the 20 lists with 3 recommendations, since citations are meant to back up single statements and concepts on a clause level, i.e., being suitable only for specific contexts. Note also that global recommendation approaches in the context of paper recommendation are   Introduction). This survey, in contrast, focuses on context-awareness, which, to date, has not yet been considered systematically. Thus, in this subsection, we compare only the 17 approaches to local citation recommendation.
In order to characterize and distinguish the different approaches from each other, we introduce the following dimensions: 1. What is the underlying approach and to which data mining technique is it associated? 2. What information is used for the user modeling, if any? 3. Is the set of candidate papers prefiltered before the recommendation? 4. What is used as the citation context (e.g., 1 sentence or 50 words before and after the citation marker)? 5. Is the citation context pre-specified in the evaluation or do cite-worthy contexts first need to be determined by the algorithm? 6. Is the content of the cited papers also needed (limiting the evaluation to corresponding data sets)? 7. Which evaluation data set is used (e.g., CiteSeerX or own data set)? 8. From which domain are the papers used in the evaluation (e.g., computer science)? 9. What are the used evaluation metrics? Table 4 shows the classification of the approaches according to these dimensions. While in the following we point out the main findings per dimension, note that we also provide a description of the single approaches and their characteristics in an online semantic wiki. 9 1. Approach: A variety of methods have been developed for local citation recommendation. We can group them into the following four groups: (a) Hand-crafted feature based models [67,127,97,41,99]. All approaches in this group are based on features that were hand-crafted by the developers. Text similarity scores obtained between the citation context and the candidate papers are examples of text-based features. Remarkably, all features used for the approaches are kept comparably simple. Moreover, the approaches do not use additional external data sources, but rather statistics derived from the paper collection itself (e.g., citation count and text similarity). Relatively basic techniques used for the ranking of citations for the purpose of citation recommendation (e.g., logistic regression and linear SVM [97], or merely the cosine similarity of TF-IDF vectors [41]) seem to lead to already noteworthy  [67] and gradient boosted regression trees [99]. Note, however, that their superiority compared to simpler models is hard to judge due to differing evaluation settings, such as data sets and metrics. In recent years, no novel approaches of this group have been published any more (latest one from 2014), likely due to the fact that (1) the obvious features have already been used and evaluated, and (2) recent approaches (e.g., neural networks) seem to outperform the hand-crafted featurebased models. Nevertheless, hand-crafted feature based models provide the following advantages: 1. Scalability: Since both the computation of the features and the used classifier/regression model are kept rather simple, the citation recommendation approaches become very scalable and fast. 2. Explainability: The described techniques are particularly beneficial when it comes to getting to know which features are most indicative for recommending appropriate citations. 3. Small data: The models do not require huge data sets for training, but may already work well for small data sets (e.g., a few thousand documents). Existing approaches in this group use mainly lexical features and other bibliometrics-based features (e.g., citation count). Hand-crafted features focusing on the semantics and pragmatics of the citation contexts and of the candidate cited documents, are missing. In the future, one can envision a scenario in which claims or argumentation structures are extracted from the citation contexts and compared with the claims/argumentation structures from the citable documents. (b) Topic modeling [68,85]. Topic modeling is a way of representing text (here: candidate papers and citation contexts) by means of abstract topics, and thereby exploiting the latent semantic structure of texts. Topic modeling became popular, among others, after the publication of the LDA approach by Blei et al. in 2002 and was applied to local citation recommendation in 2010 [68,85]. Using topic modeling in the context of citation recommendation means to adapt default topic modeling approaches, which work purely on plain text documents, in such a way that they can deal with both texts and citations. To this end, He et al. [68] use a probabilistic model based on Gleason's Theorem, while Kataria et al. [85] propose the LDA-variations Link-LDA and Link-PLSA-LDA. Note that topic modeling per se is computationally rather expensive and may require more resources than approaches of the group (a). Moreover, conceptually it might be designed rather for longer texts, and, thus, more suitable for global citation recommendation (where it has been applied in [144,113]). In the series of citation recommendation approaches, topic modeling has been applied within a relatively short time interval (2010 only for local citation recommendation; 2008 and 2009 in case of global citation recommendation) and has been replaced first by machine translation models (group (c)) and later by neural network-based approaches (group (d)). (c) Machine translation [65,70]. The authors of [65,70] apply the concept of machine translation to local citation recommendation. These approaches had been published also within a short time frame, namely only in 2012. Using machine translation might appear surprising at first. However, the developed approaches do not translate words from one language into another, but merely "translate" the citation context into the cited document (written in the same language, but maybe with a partially different vocabulary). In this way, the vocabulary mismatch problem is avoided. The first published approach using machine translation for citation recommendation was designed for global citation recommendation [101]. Here, the words in the citing document are translated to the words in the cited document. This requires the cited documents' content to be available. Approaches to local citation recommendation follow: In [65], the translation model uses several positions in the citable document for translations. However, this makes the approach computationally very expensive. The last approach in this group [70] translates the citing document merely into the identifiers of the cited documents and does not use the cited documents' content any more. By doing that, the authors obtain surprisingly high evaluation results. Note that machine translation is a statistical approach and requires a large training data set. However, in the published papers and their evaluations, rather small data sets (e.g., 3,000 and 14,000 documents in [70] and 30,000 documents in [65]) are used. Moreover, high thresholds for the translation probability may be set to make the machine translation approach feasible [70].
(d) Neural networks [146,71,166,44,87,63]. This group contains not only many approaches to local citation recommendation (6 out of 17, that is, 35%), but also the most recent ones: here, papers have been published since 2014. Due to the large field of neural network research in general, the architectures proposed here also vary considerably. Although there are also relatively generic neural network architectures applied [166,146], we can observe a tendency in increasing complexity of the approaches. Approaches are either specifically designed for texts with citations (e.g., [44,87]) or consider texts with citations as a special case of hyperlinked documents [63]. In the first subgroup are approaches using convolutional neural networks [44] and special attention mechanisms, such as for authors [44]. In the latter subgroup is an approach which uses two vector representations for each paper. Note that the approaches in this approach group do not incorporate any user model information, but work purely on the sequence of words. An exception is [44] which exploits the citing document's author information.
When it comes to deciding whether neural networks should be used in a productive system, one should note that neural networks need to be trained on large data sets. In recent years, large paper collections have been published (see Sec. 4). However, also the infrastructure, such as GPUs, needs to be available. Moreover, considerable approximations need to be applied to keep the approach feasible. This includes the negative sampling strategy [166,71,87,63]. But also a prefiltering step before the actual citation recommendation approach is often performed, which reduces the set of candidate papers significantly [71].
Han et al. [63], who propose one of the most recent citation recommendation systems and who evaluate their approach on data sets with realworld sizes, report recall@10 values of 0.16/ 0.32/ 0.21 and nDCG@10 values of 0.08/0.21/0.13 for the data sets NIPS, ACL-Anthology, and Cite-Seer+DBLP data. This shows that the results depend considerably on the data set and on the pre-processing steps (e.g., whether PDF-to-text conversion is performed). Overall, it can be assumed that the novel approaches to citation recommendation published in the near future will mainly be based on neural networks, too.
Overall, existing approaches are primarily based on implicit representations of the cited statements and concepts (e.g., embeddings of citation contexts [87,63]), but not on fine-grained explicit representations of statements or events. One reason for that might be the missing research on the different citation types besides the citation function, and the current relatively low performance of fact extraction and event extraction methods from text. 2. User model: As outlined in Sec. 2, approaches to citation recommendation can optionally incorporate user information, such as the user name, the venue that the input text should be submitted to, or keywords which categorize the input text explicitly. Overall, we can observe that most approaches (12 out of 17, i.e., 71%) do not use any user model. Five approaches are dependent on the author name of the citing document. 10 3. Prefilter: By default, all candidate papers need to be taken into account for any citation recommendation. This often results in millions of comparisons between representation forms and, thus, turns out to be unfeasible. To escape from that, the proposed methods often incorporate a pre-filtering step as a step before the actual recommendation, in which the set of candidate papers is drastically reduced. In 30% (5 out the 17) of the considered papers, the authors mention such a step (see Table 4). While three authors implement a certain numerical value as threshold [127,99,44], 11 [6]. 5. Citation placeholders: The citation placeholders, i.e., the places in which a citation should be recommended, and therefore also the citation context, are typically already provided a priori for evaluating the single approaches (exceptions are [67,99]). The main reason for this fact is presumably that the past approaches focus on the citation recommendation task itself and see the identification of "cite-worthy" contexts as a separate task. Determining the citeworthiness, which is similar to determining the citation function, is not tackled in the approaches. How- 10 The two global citation recommendation approaches [95,20] allow the user to disclose more information about her optionally. 11 Examples in the case of global citation recommendation are [137,20]. 12 Concerning global citation recommendation, we can refer here to [95,82]. ever, there have been separate attempts at solving this task [50,141] (and related: [3]). Also, with respect to performing the evaluation, having a flexible citation context makes it very tricky to compare the approaches in offline evaluations with the citation contexts and their citations from the ground truth.
Single attempts such as [67,99], solve it, however, for instance, by using only those citation contexts and associated citations which overlap with the found citation contexts to a considerable degree. 6. Cited papers' content needed: The approaches to citation recommendation differ in the characteristic of whether they incorporate the content of the cited documents or not. Incorporating the contents means that all cited documents need to be available in the form of full text. This is often a limitation, since any paper published somewhere could be referenced by authors; the cited documents are, thus, often not in any ready collection of citing documents. For instance, in the CiteSeer data set of [113], only 16% of the cited documents are also citing documents; this is similar to the arXiv CS data set [48] and unarXiv data set [129]. Not incorporating the content, on the other hand, leads to a less fine-grained recommendation and the vision of even a single fact-based recommendation is illusive. Considering the approaches to local citation recommendation, we cannot recognize a clear trend concerning the aspect of used content: both approaches using the cited papers' content and not using it have been proposed in recent years. 7. Evaluation data set: In general, a variety of data sets have been used in the publications. Most frequently (in 8 out of 17, i.e., 47% of the cases), versions of the CiteSeer data set (i.e., CiteSeer, Cite-SeerX, RefSeer) have been applied, because this data set has been available since the early years of citation recommendation research and because it is relatively large. However, even the approaches in recent years are often evaluated on newly created data sets. As Sec. 4.1 is dedicated to data sets used for citation recommendation, we can refer to this section for more details. 8. Domain: Independent of which data set has been applied, all data sets cover the computer science or computational linguistics domain. We can assume that this is because (1) the papers in those domains are relatively easy to obtain online, and because (2) the papers are understandable by the authors of these approaches, allowing them to judge at first sight whether the recommendations are suitable. 9. Evaluation metrics: Concerning the usage of evaluation metrics and the interpretation of evaluation scores, the following aspects are especially noteworthy: (a) Varying metrics: The metrics used across the papers vary considerably; most frequently, recall, MAP, nDCG, and MRR are used (10/9/7/7 out of 17 times). This variety makes it hard to compare the effectiveness of the approaches. (b) Varying data sets: Since largely systems have been evaluated on varying data sets and with varying document filtering criteria, we can hardly compare the systems' performance overall. For instance, the recent approaches [71,44] report both nDCG@10 scores of 0.26. 13 (c) Varying k: Even if the same metrics are used in different papers, and maybe when even the same data sets are used, for considering the top k returned recommendations, different k values are considered, with a great variance from k = 1 up to k = 200. Especially high values like k = 100 [87] or k = 80 [77] seem to be unrealistic as no user-friendly system would presumably expect the user to check so many recommendations. (d) Missing baselines: It can be observed that the considered papers do not reference all prior works (see also Fig. 6) and that previously proposed approaches are not used sufficiently as baselines in the evaluations, although the papers propose solutions for the same research problem. This applies to papers on local citation recommendation and global citation recommendation. For instance, [83] does not cite [146], although both tackle the cross-language citation recommendation problem. This issue was already observed for papers on paper recommendation in [14]. (e) Varying citation recommendation tasks: The system's performance strongly depends on the kind of citation recommendation which is pursued. Given not only a citation context as input, but also the metadata of the citing paper, such as the authors, the venue, etc., then the nDCG@10 score can be 0.62 as in [99] instead of around 0.26 as in [71,44]. 14 In total, it is very hard to compare the effectiveness of the approaches (1) if different metrics are used and with different top k values, (2) if different evaluation data sets are used, (3) if the approaches do not use existing 13 In case of global citation recommendation, see [96] with an nDCG@10 score of 0.21. 14 Moreover, global citation recommendation systems using only the papers' abstracts perform differently to the ones based on the papers' full contents. This can be illustrated by the fact that Liu et al. [96] use an abstract as input and obtain MAP@all of 0.16, while the same authors in [95] obtain a MAP@all score of 0.64 when using the full content.
systems as baselines, and (4) if the differences in the task set-up are not outlined. Considering the abovediscussed approaches, we can observe this phenomenon to a high degree.

System Demonstrations
While a relatively large amount of approaches to citation recommendation have been published, only RefSeer [72] and CITEWERTs 15 [49] have been presented as systems for demonstration purposes. RefSeer is based on the model proposed by He et al. [68] and uses Ci-teULike as the underlying document corpus. It recommends one citation for each sentence in the input text. CITEWERTs, in contrast, is the first system which not only recommends citations but also identifies citeworthy contexts in the input text beforehand. This makes the system more user-friendly, since it hides unnecessary recommendations, and it reduces the number of costly recommendation computations. Besides these systems, to the best of our knowledge, only paper recommendation systems exist, i.e., systems that do not use any citation context, but, for instance, only use a citation graph [73]. TheAdvisor [89], FairScholar [8] are further examples of paper recommender system demonstrations. Google Scholar, 16 Mendeley 17 , Docear [15], and Mr. DLib [16] also provide a functionality for obtaining paper recommendations.

Data Sets for Citation Recommendation
In this section, we give an overview of data sets which can be used in the context of citation recommendation. Section 4.1 presents data sets containing papers' content, while Section 4.2 outlines data sets containing purely metadata about papers.

Overview of Data Sets
There exist several corpora which provide papers' content and, hence, can serve as a gold standard for automatic evaluations. Table 5 gives an overview of the data sets which are considered by us. Note that we only consider data sets here that are not outdated and that are still available (either online or upon request from the author). Hence, old data sets, such as the Rexa data base [137] or the initial CiteSeer database [58], are not included. 18 Generally, we can differentiate between two corpora sets: firstly, the CiteSeer data sets, available in different versions, have been explicitly created for citationbased tasks. They already provide the citation contexts of each citing paper and can be described as follows: -CiteSeerX (complete) [31]: Referring to the Cite-SeerX version of 2014, the number of indexed documents exceeded 2M. The CiteSeerX system crawls, indexes, and parses documents that are openly available on the Web. Therefore, only about half of all indexed documents are actually scientific publications, while a large fraction of the documents are manuscripts. The degree to which the findings resulting from the evaluations based on CiteSeerX also hold for the actual citing behavior in science is therefore unknown to some degree. -CiteSeerX cleaned by Caragea et al. [31]: The raw CiteSeerX data set contains a lot of noise and errors as outlined by Roy et al. [128]. Thus, in 2014, Caragea et al. [31] released a smaller, cleaner version of it. The revised data set resolves some of the noise problems and in addition links papers to DBLP. -RefSeer [71]: RefSeer has been used for evaluating several citation recommendation approaches [71,44]. Since it contains the data of CiteSeerX as of October 2013 without further data quality improvement efforts, RefSeer is on the same quality level as CiteSeerX. -CiteSeerX cleaned by Wu et al. [160]: According to Wu et al. [160], the cleaned data set [31] still has relatively low precision in terms of matching CiteSeerX papers with papers in DBLP. Hence, Wu et al. have published another approach for creating a cleaner data set out of the raw CiteSeerX data, achieving slightly better results on the matching of the papers from CiteSeerX and DBLP.
Then, there are collections of scientific publications, with and without provided metadata, for which citation contexts are not explicitly provided. However, in those cases, the citation contexts can be extracted by appropriate tools based on the papers' content, making these corpora also applicable as ground truth for offline evaluations. They are listed alphabetically in the following: -ACL Anthology Network (ACL-AAN) [122]: ACL-AAN is a manually curated database of citations, collaborations, and summaries in the field of Computational Linguistics. It is based on 18k papers. The latest release is from 2016. ACL-AAN has been used as an evaluation data set for many tasks. -ACL Anthology Reference Corpus (ACL-ARC) [21]: 19 ACL-ARC is a widely used corpus of scholarly publications about computational linguistics.
There are different versions of it available. ACL-ARC is based on the ACL Anthology website and contains the source PDF files (about 11k for the February 2007 snapshot), the corresponding content as plaintext, and metadata of the documents taken either from the website or from the PDFs. -arXiv CS [48]: This data set, used by [50,47], was obtained by utilizing all arXiv.org source data of the computer science domain and transforming the L A T E X files into plaintext by an own implemented T E Xparser. As far as possible, each reference is linked to its DBLP entry. -CORE: 20 CORE collects openly available scientific publications (originating from institutional repositories, subject repositories, and journal publishers) as data basis for approaches concerning search, text mining, and analytics. As of October 2019, the data set contains 136M open access articles. CORE has been proposed for citation-based tasks for several years. However, to the best of our knowledge, it has not yet been used for evaluating or deploying any of the published citation recommendation systems. -Scholarly Paper Recommendation Dataset 2 (Scholarly Data Set): 21 This data set contains about 100k publications of the ACM Digital Library and has been used for evaluating paper recommendation approaches [140,139]. -unarXiv [129]: This data set is an extension of the arXiv CS data set. It consists of over one million full text documents (about 269 million sentences) and links to 2.7 million unique papers via 29.2 million citation contexts (having 15.9 million unique references). All papers and citations are linked to the Microsoft Academic Graph. Table 5 shows the mentioned data sets categorized by different dimensions. We can outline the following highlights with respect to these dimensions:

Comparison of Evaluation Data Sets
Size of data set The considered data sets differ considerably in their sizes: they range from small (below 100k documents; see ACL-ARC and ACL-AAN) to very large (over 1M documents; see CiteSeerX complete). Note thereby that the cleanliness of the provided papers' contents does not necessarily depend on the overall size of the data set: for instance, ACL-AAN and ACL-ARC are quite noisy, as they contain rather old publications, which are hard to parse. However, clean metadata of the cited papers is available for those data sets.
Availability of citation context CiteSeerX, arXiv CS, and the unarXiv data set provide explicitly extracted citation contexts of the citations in the documents. In case of the different versions of CiteSeerX, a fixed window of 400 characters has been chosen around the citation markers. In the case of arXiv CS and unarXiv, the content is provided sentence-wise, so that all sentences annotated with citation identifiers can be used as citation context. The corpora which contain the publications contents in their original form -namely, ACL-AAN, ACL-ARC, CORE, and Scholarly -do not provide citation contexts. However, these contexts could be extracted without much effort by using appropriate tools from the source PDF files.
Structured metadata of citing papers For all the presented corpora, structured metadata of all the citing papers is provided. An exception is Scholarly, which only consists of PDF files. Hence, the metadata needs to be extracted by oneself with the corresponding tools. Note that the metadata is clean only for those corpora for which the information has been entered manually at some point. For CiteSeerX, all information, including the metadata of citing papers, has been extracted from the publications (mainly PDFs). Hence, this framework is independent of external data. However, as a tradeoff, the extracted metadata is to some extent noisy and inaccurate (missing information or wrongly split strings etc.) [128].
Structured metadata of cited papers Only the CiteSeer data sets as well as arXiv CS and unarXiv provide this information per se. In the case of CORE, it is planned that publications will be linked to the Microsoft Academic Graph. Consequently, structured metadata of cited papers will be retrievable from this data set. 22 For the other corpora containing publications' content, the metadata of the cited papers can be obtained by extracting the information from the publications' reference sections via the appropriate tools. However, note that it  Table 6 Overview of papers' metadata data sets applicable to citation recommendation.
Size of data set does not only require the parsing via an appropriate information extraction tool, but also the reconciliation of the data (i.e., building a global database of publications' metadata). The task of how to find out if two referenced papers are actually the same and, hence, should have the same identifier is non-trivial and is known as citation matching.
Paper content of citing papers Some approaches, such as sequence-to-sequence approaches, require the complete contents of all citing papers. In the complete Cite-SeerX data set, all citing papers' contents are still available. Also the paper collections Scholarly, arXiv CS, unarXiv, ACL-ARC, and ACL-AAN (and CORE to some degree) contain the full paper contents. However, in case of Scholarly and ACL-AAN, the original data sets do not contain the contents as plaintext, so that one first needs to run appropriate transformation approaches.
Abstract of citing papers Since the abstract of papers belongs to the metadata, it is quite easily obtainable for both citing papers and cited papers. Furthermore, it already summarizes the main points of each paper (although typically not sufficiently for a detailed and precise recommendation) and can be used for obtaining a better representation of the paper, and, hence, for improving the recommendation of papers based on citation contexts overall. Regarding the citing papers, all data sets either provide the abstract already in an explicitly given form (see the CiteSeerX data set and partially CORE) or contain the original publications (as PDF or similar formats), so that the abstract can be extracted from them (see Scholarly, arXiv CS, unarXiv, ACL-ARC, ACL-AAN).
Abstract of cited papers Having as much information as possible about what the cited papers are dealing with is crucial for a good citation recommendation. In this context, the abstracts of cited papers are very useful and are used by several approaches [68,67,99,82,101,41,166,20,83]. However, none of the data sets contain abstracts for all cited papers.
Full citation graph In a full citation graph (also called citation network ), not only the citations of the citing papers are represented, but the citations of any paper of a given document collection. Such a graph can be used for obtaining a good representation of the papers (see paper embeddings [44,59]) and to compute similarities among papers. None of the considered corpora provides such an extended citation graph. 23 As an alternative, one can think of linking papers from one corpus with papers of a metadata corpus (see Section 4.2).
Cleanliness The situation is mixed in this regard: the metadata of the papers is of good quality, especially if it originates from corresponding, dedicated databases instead of being extracted solely from the publications themselves (see ACL-AAN, ACL-ARC, arXiv CS, and unarXiv vs. the CiteSeerX data sets). The papers' content is typically provided via information extraction methods, meaning that the quality is not that high, particularly if the papers were hard to parse and process, e.g., due to being very old (see the papers of ACL-ARC and ACL-AAN vs. Scholarly, which contains newer papers) or due to special formating in the publications, such as formulas in the text (see CiteSeerX data sets vs. the arxiv CS and unarXiv data sets, where formulas were detected and removed).
Links to bibliographic data sets Having publications linked to external bibliographic data sets such as DBLP allows the use of interlinked information for paper representations and for search. Corpora of scientific papers have often been created in the area of computer science, since there are many publications available online. As a consequence, the most widely used bibliographic database for computer science, DBLP, has been used as a reference of interlinking. More precisely, the cleaned versions of CiteSeerX and the arXiv CS data set provide links to DBLP. unarXiv provides links to the Microsoft Academic Graph, as it covers not only computer science papers, but also many other disciplines.

Corpora Containing Papers' Metadata
Besides corpora including papers' content, data sets exist that contain metadata about publications; typical metadata include the citation relations between papers and the titles, venues, publication years, and abstracts of the publications. Although no content is usually provided, the metadata can be regarded as an explicit, structured representation of the papers and, hence, can be used as a valuable representation of the papers, e.g., for learning embedding vectors based of them (see, e.g., [44,55]). Due to their extensive sizes, the following data sets are in our view particularly suitable for citation recommendation: 24 23 Note, however, that data sets such as unarXive (and CORE in the future) link to the Microsoft Academic Graph providing citation information. 24 The data set Mendeley DataTEL is not listed, as it has not been available to us after several requests. Further -AMiner DBLPv10 25 [145]: This data set contains over 3M papers and 25.2M citation relationships, making it a large citation network data set. Since DBLP was used as data source, the data is very clean. -AMiner ACMv9 26 [145]: This data set has the same structure as AMiner DBLPv10, but was constructed from 2.4M ACM publications, with 9.7M citations. -Microsoft Academic Graph: 27 This data set can be considered as an actual knowledge graph about publications and associated entities such as authors, institutions, journals, and fields of study. Direct access to the MAG is only provided via an API. However, dump versions have been created. 28 Prior versions of the MAG are known as the Microsoft Academic Search data set, based on a the project Microsoft Academic Search which retired in 2012. -Open Academic Graph: 29 This data set is designated to be an intersection of the Microsoft Academic Graph and the AMiner data. In many cases, the DBLP entries for computer science publications ought to be retrievable. -PubMed: 30 PubMed is a database of bibliographic information with a focus on life science literature. As of October 2019, it contains 29M citations and abstracts. It also provides links to the full-text articles and third-party websites if available (but no content). Table 6 shows the mentioned data sets categorized by various dimensions. The same dimensions are used as for comparing the corpora in Section 4.1, except the ones which are homogeneous among the metadata data sets (e.g. availability of citation context, paper content of citing papers). Due to page limitations, we omit a textual comparison of the mentioned metadata data sets.
data sets, such as CORA (https://relational.fit.cvut. cz/dataset/CORA), have not been shortlisted due to their small size. We have also not listed bibliographic databases like DBLP here, as they contain neither the papers' contents nor information about the citations between papers. Also Springer's SciGraph does not contain any citation information yet. Bibliographic databases, such as Scopus and Web of Science, are dedicated information retrieval platforms, but do not officially support bulk downloads. 25

Evaluation Methods and Challenges
In this section, we first discuss the different ways of evaluating citation recommendation approaches. Secondly, we point out important challenges related to evaluating citation recommendation approaches. Afterwards, we provide the reader with guidelines concerning what aspects to consider for evaluating future recommender systems.

Evaluation Methods for Citation Recommendation
Generally, we can distinguish between offline evaluations, online evaluations, and user studies. In offline evaluations, no users are involved and the evaluation is performed automatically. Online evaluations measure the acceptance rates of recommendations in deployed recommender systems. User studies are used for measuring the user satisfaction through explicit user ratings.
For offline evaluations, the following evaluation methods have been applied so far for citation recommendation: 1. Strict "citation re-prediction:" This evaluation method has been used by almost all approaches to local citation recommendation (15 out of 17; see [68,85,65,70,127,97,41,146,71,166,44,87,63,162,77]). The evaluation is performed as follows: an approach is evaluated by assessing which of the citations that have been recommended by the system are also in the original publications. We can therefore call this method "re-prediction." This evaluation method scales very well, but ignores several evaluation challenges, such as the relevance of alternative citations, and the cite-worthiness of contexts (see Section 5.2). Hence, the evaluation metrics used for strict citation reprediction, such as normalized discounted cumulative gain (nDCG), mean average precision (MAP) and mean reciprocal rank (MRR), might reflect the reality in the sense of the citing behavior observed in the past, but not the desired citing behavior. 2. "Relaxed citation re-prediction:" In order to allow papers to be recommended which are not written as citations by the authors of the papers, but which are still relevant, and on the other hand, to keep the evaluation still automatic and scalable, sometimes a relaxation of the strict re-prediction method is applied. In the set of considered approaches, the following methods have been applied by both He et al. [67] and Livne et al. [99]: (a) The relative co-cited probability metric is designed as a modified accuracy metric and based on the assumption that papers which are frequently cocited are relevant to each other. Hence, if not the actual cited paper, but a co-cited paper 31 is recommended, this paper is also considered as a hit to some degree. The relative co-cited probability is the ratio to which recommended papers are either directly cited or are co-citations of actual citations. In the latter case, the co-cited paper is only scored gradually. (b) The regular nDCG score is used for measuring the correct ranking of items. Modifying this score is based on the idea that if the actual paper is not standing on the intended position, but there is another paper there, which is also relevant (here, again determined by the co-citations), then this should also be judged as correct to some degree. More specifically, the authors use the average relative co-cited probability of r with all original citations of d to obtain a citation relevance score of r to d. Then the documents in D are sorted w.r.t. this relevance score and each document is assigned a score on a 5 point scale regarding its relevance. Finally, the average nDCG score over all documents is calculated based on these scores.
A more comprehensive, but not very scalable way to evaluate approaches is to rely on online evaluations [17]. None of the considered approaches has been evaluated in this way so far. Also no user studies for citation recommendation systems are known to us. 32

Challenges of Evaluating Citation Recommendation Approaches
In the previous subsection we learned that it is hard to apply traditional evaluation metrics for offline evaluations of citation recommendation systems. We now point out further challenges when it comes to determining the performance of citation recommendation systems. In Section 5.3, we then propose steps for approaching some of these challenges.

Fitness of Citations
Training and evaluating a citation recommendation system based on an existing paper collection used as ground truth is tricky, since the citing behavior encoded in the citations of these considered papers is taken as ground 31 B is a co-cited paper of A, if both A and B are cited by a third paper C. 32 For paper recommendation, a few manual evaluations exist [14]. However, paper recommendation is out of our scope.
truth. This becomes a problem when the original citing behavior is not favorable and adaptations are desired. In the past, several analyses of scientific citing behavior have been published [143]. We can reuse these for characterizing the different aspects of citing biases in the context of evaluating citation recommendation. We thereby group citing biases along the attributes of the citable publications: 1. Content Understanding: Authors of citing papers may differ in their expertise, knowledge level, and working style when selecting citations (cf. professor vs. masters student). The suitability of the content of citable papers is therefore often judged differently. Furthermore, authors of citing papers might perform literature investigations and reviews in a rather sloppy way [75] and read, for instance, mainly titles and abstracts of documents only. However, titles and abstracts may deceive users about the true claims and contributions of papers. Moreover, the selection of citations can be biased by the style of the titles and abstracts (see, e.g., [26,138]). Also the writing style of the fulltext of the citable papers has some influence on citing, as it reflects the perceived quality of the paper [98]. 2. Authors: It is quite common to cite publications written by oneself, called self-citations, [5,74] or written by colleagues, advisors, and friends [153], with an element of preferential bias. Although analyses have shown that this is not per se harmful [143], a citation recommender system ought to be designed independent of any bias. Furthermore, the user of a citation recommendation system might be interested particularly in works she does not yet know. There are also cases in which the authors of the citing and cited document do not know each other, but in which the author of the citing document still favors specific persons as authors of the cited documents. Most notably, sometimes citation cartels exist in the scientific communities, which first of all cite papers within sub-communities [53]. Furthermore, it has been observed that even the country a person comes from, the race, and the gender play a role in the selection of citations [142]. A bias towards citing authors who act as the referee or reviewer of the citing document in a peer-review process is also plausible [154]. 3. Venue and Paper Type: It is obvious that the venue is an influential factor in selecting appropriate citations for a given text. Highly rated conferences and journals might get higher levels of attention and are privileged compared to lower rated conferences, workshops, and similar publication formats [30,155].
A bias can go so far that a relatively weak publication in a prestigious journal receives a high number of citations only due to the centrality of the journal [30]. Papers in interdisciplinary journals are more likely to be cited [9]. Last but not least, it should be noted that, in the frame of the widely performed peer-reviewing process, especially papers that were published in the same venue as the citing paper are more often selected as citations [158]. Many venues have introduced a page limit for submitted papers. As a consequence, authors often choose to cut several citations which would be relevant and important for understanding the content. 4. Timeliness: The temporal dimension concerning citing behavior is, to the best of our knowledge, relatively unexplored in the context of citation recommendation. On the one hand, due to the acceleration in the publishing rate of scientific contributions, authors of citing papers might target citing especially recent papers. On the other hand, older papers have more citations and are easier to find. Note also that the reasons for citing specific publications can change over time [33]. 5. Accessibility and Visibility: During the citing process, researchers are limited by their capabilities for finding appropriate publications for citing. In particular, they typically cite only papers to which they have fulltext access. However, a considerable amount of researchers have limitations in this regard, such as having no license for accessing papers of specific publishers (e.g., ACM or Springer) and paper collections. Consequently, the set of citable papers is narrowed down considerably. Hence, either not all concepts and claims in the citing paper can be backed up by citations or they cannot be backed up by optimal citations. Papers are also embedded in the social interactions and dissemination processes of researchers. Most notably, the claim that prominent publications get cited more is comprehensible and well-studied, even though more relevant alternative publications might exist for citation [156]. Prominent papers are papers which already have a high number of citations, or papers written by authors who are well known in the field and who also have a high aggregated citation count. We can refer in this context to the studies on the socalled Matthew effect [13] and on the Google Scholar effect [130]. Particularly prominent papers are called landmark papers and citation classics [133]. They are characterized by the fact that they are often added as citations in a ritualized way and self-enforce their citing.
Last but not least, it cannot be neglected that nowadays many publications are disseminated via social networks and other channels. Research on these aspects in the context of citing behavior has been performed only to a limited extent [92]. 6. Discipline: Firstly, researchers naturally work within scientific communities and disciplines, with the consequence that they are often exclusively familiar with works published in their discipline or field and that it is difficult for them to discover papers from other fields (due to different venues, terminology, etc.). Hence, citations tend to be limited by the affiliation to the discipline (or even research field). Secondly, the citing behavior also changes from discipline to discipline. Comparing the citing behavior across disciplines, and, hence, comparing also citation recommendation systems trained and tested on different disciplines, is challenging. For instance, disciplines differ in (1) the number of articles published, (2) the number of co-authors per paper, (3) the relevance of the publication type (e.g., journal, conference, book) for publishing, and (4) the age of cited papers [102]. These aspects have a direct influence on the relevance function of any citation recommendation model. Investigations and evaluations on the context of citation recommendation approaches are missing so far, however. As stated in Section 3, evaluations on citation recommendation have been performed mainly on corpora containing only computer science publications.

Cite-Worthiness of Contexts
Citation recommendation systems typically consider predefined citation contexts for their prediction. However, in reality, typically not only the provided citation contexts are cite-worthy, but also further contexts. Among others, one reason for missing citations is the page restriction which authors need to fulfill for submitting papers to venues. 33 In the past, there have been a few approaches for assessing the cite-worthiness of potential citation contexts automatically, however, only in the sense of a binary classification task [50,22,141,49].
Although there are single works on characterizing the citation context, such as on the citation function, the citation importance, and the citation polarity (see Sec-tion 2.4), these aspects are not considered in citation recommendation approaches so far. In particular the type of citation, given as the citation function or in the form of another classification, such as whether the citation backs up a single concept or a claim, seems to be a notable aspect to be considered.

Scenario Specificity
As outlined in Section 2.2, citation recommendation systems can be applied in different scenarios, differing in particular in (1) the user type (see expert vs. nonexpert setting) and (2) in the type and length of input text. Considering these nuances during evaluation makes a comparison of approaches difficult. However, it is necessary, as the comparison would be unfair otherwise. For instance, citation recommendation sytems using only text from an abstract perform differently than ones based on a paper's full content (see the MAP@all score of 0.16 [96] vs. 0.64 [95]). In contrast to that, the difference in the usability of systems for different user types can be assessed via online evaluations and user studies.

Discussion
Based on the given observations, we propose the following suggestions for an improved evaluation of citation recommendation systems: Concerning offline evaluations In the main, nDCG, MRR, MAP, and recall have been used as the evaluation metric in existing offline evaluations. We recommend using them for the top k recommendations with a rather low value for k (e.g., k = 5 or k = 10) as in [70,99,71,63], since it is in our view realistic to return only very few recommendations to the user per citation context (and not using e.g., nDCG@50, and nDCG@75 as in [68] or MAP@100 as in [146]). [144] agrees with us that it is hard to specify for each citation context how many recommended citations should be returned and notes that for simplicity, the average number of citations per paper could be set as k (e.g., 11 in [144]), if the whole input document is considered. Common evaluation metrics used for citation recommendation reflect the reality only in the sense of the citing behavior observed in the past, but not alternatively valid citations. So far, only a few citation recommendation systems have been evaluated based on alternative offline evaluation metrics (see "relaxed citation re-prediction" in Section 5.1). For instance, the precision metric is softened and papers are also assessed as a hit if they are only related to the cited publications in the citation graph. We argue that such metrics need to be taken with care in the light that citation recommendation aims to back up specific claims and concepts.
Concerning online evaluations and user studies As outlined in Section 5.1, user studies and online evaluations are so far missing in the context of citation recommendation, while offline evaluations predominate. The situation is therefore similar to the situation in the field of paper recommendation [14]. Similar to [17], we recommend performing user studies and online evaluations as necessary steps in the future. This might be particularly fruitful (1) for determining a reasonable ratio of citations per document (cf. cite-worthiness of contexts), and (2) for assessing the relevance of alternative citations, which can be even more relevant than the original citations. 34 Differentiating and automatically determining different levels of relevance seems to be necessary to address this issue, as outlined by [136]. Studies on the importance and grading of citations are rare (see Section 2.4.3), and, to the best of our knowledge, there are no user assessment studies on assessing alternative papers in the context of (personalized or unpersonalized) citation recommendation.
Concerning citing biases In order to minimize the biases in the citing behavior, the corpora used for training and testing might need to be changed. For instance, only those publications might be considered for which a high degree of fairness can be guaranteed. Single publications could be classified in this respect and might receive a confidence value concerning biases [121].
To not introduce a citing bias via recommending specific papers, citation recommendation systems should use large paper collections (see Sec. 4) and the information which recommendation algorithm and candidate papers are used, should be made available to the user.
Concerning scenario specificity Similar to paper recommender systems [14], the evaluation results of citation recommendation approaches are often not reproducible, since the data sets are not available and/or many important details of the implementation are omitted in the papers due to constraints such as page limitations [12]. Therefore, we recommend making evaluation data sets, the implementation of the system, and the calculation of evaluation metrics as transparent as possible. Also the targeted scenario (see Section 2.2) and use case characteristics should be clearly visible.

Potential Future Work
There are still many variations of the architecture and of the input and output of citation recommendation systems which have not been considered yet. More specifically, we can think of the following adaptations to enhance citation recommendation: -Inserting a minimal, yet, sufficient (optimal) set of citations; -Given an input text with already present citations, suggesting newer/better ones to update some obsolete/poor citations; -Topically diversifying recommended citations [34]; -Combating the cold-start problem for freshly published papers which are not yet cited, hence no training data is available on them; -Incorporating information on social networks among researchers and considering knowledge sharing platforms; -Focusing on specific user groups, which have a given preknowledge in common (see our listed scenarios in Sec. 2.2); -Recommending papers which state similar, related, or contrary claims as the ones in the citation contexts; -Studying the influences of citing behavior on citation recommendation systems and developing methods for minimizing citing biases (cf. Section 5.2); -Developing global context-aware citation recommendation approaches, i.e., approaches that recommend citations in a context-aware way, yet consider the whole paper; -Recommending citations refuting an argument (using argumentation mining); -Evaluating citation recommendation approaches on different disciplines (outside computer science).
Besides these concrete future works, we can think of the following visions in the long term, which embrace a new process of citing in the future: 1. One can envision that, in the mid-term, citation recommendation approaches can better capture the semantics of the citation context, with the result that actual fact-based citation recommendation becomes reality. This describes the opportunity of obtaining precise citation recommendations, since both the claims in the citation context and the claims in the candidate cited documents are represented explicitly in a semantically-structured form. In this sense, citation recommendation systems might be capable of not only citing publications, but any knowledge (in particular, facts and events) available on the Web. This vision becomes particularly feasible in light of the Linked Open Data (LOD) cloud and is in line with research on LOD-based recommender systems [114]. 2. One can envision that the working style of researchers will dramatically change in the next few decades [32,109]. As a result, one might think not only of citation recommendation as considered in this article, but based on the expected or potential characteristics of scientific publishing. For instance, one can imagine that publications will not be published in PDF format any more, but in either an annotated version of it (with information about the hypotheses, the methods, the data sets, the evaluation setup, and the evaluation results), or in the form of a flexible publication form (e.g., subversioning system), in which authors can change the content, especially the citations, since over the time citations might become obsolete or new citations might become relevant.

Conclusions and Outlook
In this survey, we gave a profound overview of the research field of citation recommendation. To that end, we firstly introduced citation recommendation via outlining possible scenarios and via a description of the task. We saw that the approaches to context-aware citation recommendation can be grouped into hand-crafted feature-based models, topic models, machine translation models, and neural network models. The approaches do not only differ w.r.t. the underlying method, but also w.r.t. the provided input data. More specifically, the considered set-ups differ in the use of a user model, the prefiltering of candidate papers, the length of the citation context, whether citation placeholders are provided, and whether the content of cited papers is needed. Concerning the evaluation, the approaches are evaluated based on very diverse metrics and different data sets, making it hard to assess the validity and advance of single approaches. Moreover, approaches are often compared to existing approaches to a limited extent. We also considered the data sets that can be used for deploying and evaluating citation recommendation. We distinguished between corpora containing papers' content and corpora providing papers' metadata. Here we learned that several corpora exist, especially in the field of computer science. However, the data sets differ considerably in their size and in their quality (e.g., noise due to information extraction).
Concerning the challenges of evaluating citation recommendation and the evaluation methods used so far, we found out that biases in the citing behavior have largely been ignored, as well as the "worthiness" to cite at all or in specific circumstances. Assessing citation recommendations might also depend on the scientific discipline and on the concrete use case. Approaches have been evaluated rather unilaterally and not across disciplines.
Upcoming approaches on citation recommendation are likely to be based on more advanced techniques of machine learning, such as variants of recurrent neural networks. In the long term, one can envision that citation recommendation approaches can better capture the semantics of the citation context, with the result that actual fact-based citation recommendation becomes reality. Given the likely continuation and proliferation of the "tsunami" of publications and of citations in the years and decades to come, we can assume that citation recommendation will become an integral component of a researcher's working environment.