Introduction

Citations are an essential tool for scientific practice. By allowing authors to refer to existing publications, citations make it possible to position one’s work within the context of others’, critique, compare, and point readers to supplementary reading material. In other words, citations enable scientific discourse. Because of this, citations are a valuable indicator for the academic community’s reception of and interaction with published works. Their analysis is used, for example, to quantify research output [17], qualify references [1], and detect trends [5]. Furthermore, citations can be utilized to aid researchers through, for example, summarization [11] or recommendation [12, 33] of papers, and through applications driven by document embeddings in general [7].

As such analyses and applications require data to be based on, the availability of citation data or lack thereof is decisive with regard to the areas, in which respective insights can be gained and approaches developed. Here, the literature points in two major directions of lacking coverage—namely the humanities [9, 24] and non-English publications [30, 36, 38, 46]. Because most large scholarly data sets are either artificially limited to few languages (e.g., English only) or do not provide language metadata, a particular practice not well researched so far is cross-lingual citation. That is, references where the citing and cited documents are written in different languages (see (vi) in Fig. 1). Cross-lingual citations are, however, important bridges between otherwise insufficiently connected “language silos” [38, 42].

Fig. 1
figure 1

Schematic explanation of terminology

Because English is currently the de facto academic lingua franca [37], citations from non-English languages to English are significantly more prevalent than the other way around. This dichotomy is reflected in existing literature, where usually either citations from English [24, 29], or to English [20, 21, 41, 44] are analyzed. As both directions involve a non-English document on one side of the citation, the analysis of either is challenging with today’s Anglocentric state of citation data.

Setting our focus to cross-lingual citations from English, we perform a large-scale analysis on over one million documents. In line with existing literature, we determine the prevalence of cross-lingual citations across multiple dimensions. Additionally, we investigate the citation’s usage as well as impact. In particular, the following research questions are addressed.

  1. RQ1)

    How prevalent are English to non-English references? We consider prevalence in general, in different disciplines, across time, and within publications that use them.

  2. RQ2)

    In what circumstances are cross-lingual citations in English papers used? Here, we consider self-citation, geographic origin, as well as citation function and sentiment.

  3. RQ3)

    What is the impact of cross-lingual citations in English documents? We consider the aspects of acceptance, data mining challenges, as well as impact on the success of a publication.

Through our analysis, we make the following contributions.

  1. 1.

    We conduct an analysis of cross-lingual citations in English papers that is considerably more extensive than existing literature in terms of corpus size as well as covered languages, time, and disciplines. This not only makes the results more representative of the areas covered, but also enables the use of our collected data for machine learning-based applications such as cross-lingual citation recommendation.

  2. 2.

    We propose an easy and reliable method for identifying cross-lingual citations from English papers to publications in non-Latin script languages (e.g., Russian and Chinese).

  3. 3.

    We highlight key challenges for handling cross-lingual citations that can inform future developments in scholarly data mining.

  4. 4.

    To facilitate further research, we make our collected data, source code, and full results publicly available.Footnote 1

The remainder of the paper is structured as follows. After briefly addressing our use of terminology down below, we give an overview of related work in Sect. 2. In Sect. 3, we discuss the identification of cross-lingual citations, data sources considered, and our data collection process. Subsequent analyses with regard to our research questions are then covered in Sect. 4. We end with a discussion of our findings and concluding remarks in Sect. 5.

Terminology

Because citation, reference and related terms are not used consistently in the literature, we briefly address their use in this paper. As shown in Fig. 1, a citing document creates a bibliographical link to a cited document. We use the terms citation and reference interchangeably for this type of link (e.g., “(vi) in Fig. 1 marks a cross-lingual reference,” or “Paper\(^a\) makes two citations”). The textual manifestation of a bibliographic reference, often found at the end of a paper (e.g., “[1] Smith” in Fig. 1), is referred to as reference section entry, or sometimes reference for short. We call the combined set of these entries reference section. Lastly, parts within the text of a paper, which contain a marker connected to one of the reference section entries, are called in-text citations.

Related work

Table 1 Comparison of corpora

Existing literature on cross-lingual citations in academic publications covers analyses as well as approaches to prediction tasks. These are, however, only based on small corpora or restricted to specific language pairs. As shown in Table 1, our work is based on a considerably larger corpus which is also more comprehensive in terms of the time span and disciplines that are covered.

In the following, we describe the works in Table 1 in more detail, reporting on the key corpus characteristics and findings. This is complemented by a short overview of existing literature on various types of cross-lingual interconnections in media other than academic publications.

Cross-lingual citations in academic publications

The literature concerning cross-lingual citations in academic publications can be found in the form of analyses and applications. In [24], Kellsey and Knievel conduct an analysis of 468 articles containing 16,138 citations. The analysis spans 4 English language journals in the humanities (disciplines: history, classics, linguistics, and philosophy) over 5 particular years (1962, 1972, 1982, 1992, and 2002). They count cross-lingual citations to English, German, French, Italian, Spanish, Portuguese, and Latin, while further languages are grouped into a category “other.” The authors find that 21.3% of the citations in their corpus are cross-lingual, but note strong differences between the covered disciplines. Over time, they observe a steady total, but declining relative number of cross-lingual citations per article. The authors furthermore find that the ratio of publications that contain at least one cross-lingual citation is increasing.

Lillis et al. [29] investigate if the global status of English is impacting the “citability” of non-English works in English publications. They base their analysis on 240 articles from 2000 to 2007 in psychology journals and furthermore use the Social Sciences Citation Index and ethnographic records. Their corpus contains 10,688 references, of which 8.5% are cross-lingual. Analyzing the prevalence of references in various contexts, they find that authors are more likely to cite a “local language” in English-medium national journals than in international journals. Further conducting analyses of, e.g., in-text citation surface forms, they come to the conclusion that there are strong indicators for a pressure to cite English rather than non-English publications.

Similar observations are made by Kirchik et al. [27] concerning citations to Russian. Analyzing 498,221 papers in Thomson Reuters’ Web of Science between 1993 and 2010, they find that Russian scholars are more than twice as likely to cite Russian publications when publishing in Russian language journals (21% of citations) than when they publish in English (10% of citations).

In [41], Schrader analyzes citations from non-English documents to English articles in open access and “traditional” journals. The corpus used comprises 403 cited articles published between 2011 and 2012 in the discipline of library and information science. The articles were cited 5,183 times (13.8% by non-English documents). In their analysis, the author observes that being open access makes no statistically significant difference for the ratio of incoming cross-lingual citations of an article, or the language composition of citations a journal receives.

Apart from analyses, there are also approaches to prediction tasks based on cross-lingual citations [20, 21, 33, 44]. Tang et al. [44] propose a bilingual context-citation embedding algorithm for the task of predicting suitable citations to English publications in Chinese sentences. To train and evaluate their approach, they use 2,061 articles from 2002 to 2012 in the Chinese Journal of Computers, which contain citations to 17,693 English publications. Comparing to several baseline methods, they observe the best performance for their novel system. Similarly, in [20] and [21] Jiang et al. propose two novel document embedding methods jointly learned on publication content and citation relations. The corpus used in both cases consists of 14,631 Chinese computer science papers from the Wanfang digital library. The papers contain 11,252 references to Chinese publications and 27,101 references to English publications. For the task of predicting a list of suitable English language references for a Chinese query document, both approaches are reported to outperform a range of baseline methods.

Cross-lingual interconnections in other types of media

Apart from academic publications, cross-lingual connections are also described in other types of media. Hale [15] analyzes cross-lingual hyperlinks between online blogs centered around a news event in 2010. In a corpus of 113,117 blog pages in English, Spanish, and Japanese, 12,527 hyperlinks (5.6% of them cross-lingual) are identified. Analysis finds that less than 2% of links in English blogs are cross-lingual, while the number in Spanish and Japanese blogs is slightly above 10%. Hyperlinks between Spanish and Japanese are almost non-existent (7 in total). Further investigating the development of links over time, the author observes a gradual decrease in language group insularity driven by individual translations of blog content—a phenomenon described as “bridgeblogging” by Zuckerman [48]. Similar structural features are reported by Eleta et al. [10] and Hale [14] for Twitter, where multilingual users are bridging language communities.

Focusing on types of information diffusion that are not textually manifested through connections such as bibliographic references and hyperlinks, there also is literature on cross-lingual phenomena on collaborative online platforms, such as the study of cross-lingual information diffusion on Wikipedia [26, 40].

Lastly, as with academic publications, there furthermore exists literature on link prediction tasks. In [22], Jin et al. analyze cross-lingual information cascades and develop a machine learning approach based on language and content features to predict the size and language distribution of such cascades.

Data collection

In this section, we first discuss how to identify cross-lingual citations. Subsequently, we outline the steps of data source selection and corpus construction. Lastly, we describe the key characteristics of our corpus.

Identification of cross-lingual citations

Identifying cross-lingual citations requires information about the language of the citing and cited document. However, this is often missing in scholarly data sets.Footnote 2 Identifying the involved documents’ language when it is not given in metadata, however, is challenging, because (a) the full text, especially of the cited documents, is not always available, (b) abstracts are not reliable because non-English publications often provide an additional English abstract, and (c) language identification on short strings (e.g., titles in references) does not achieve sufficient results with existing techniques [19].

To nevertheless be able to conduct an analysis of cross-lingual citations on a large scale, we utilize the common practice of authors appending an explicit marker in the form of “(in <Language>)” to such references. This shifts the requirements from language metadata or language identification to the existence of reference section entries in the data. This is because the language of the cited document is given by the “<Language>” part of the marker, and the language the marker itself is written in (i.e., English) provides the citing document’s language. For example, the reference section entry “M. Saitou, ‘Hydrodynamics on non-commutative space’ (in Japanese), [...]”Footnote 3 by itself contains enough information to determine that the cited document is written in Japanese and the citing document is written in English.

The question then remains, how common the practice of using such explicit markers is—that is, to cite, for example, “A Modern Model Description of Magnetism (in Russian)” instead of .Footnote 4 To answer this question, we perform a preliminary analysis on the data set unarXive [39], which comprises 39 million reference section entries. Specifically, we conduct a large automated analysis on all reference section entries in the data set and additionally perform a smaller, manual analysis on a stratified sample of 5,000 references.

In the large automated analysis, we first identify the cited document’s title within references using the state-of-the-art [45] reference string parser module of GROBID [32] and then determine the title’s language using the language identification tool Lingua,Footnote 5 which is specialized for very short text. Manually inspecting our results, we note that non-Latin script languages (e.g., Chinese, Japanese, Russian) are detected reliably,Footnote 6 but Latin script languages (e.g., German and French) are not. For instance, many English titles are falsely identified as German.

Table 2 References to non-Latin script languages in the automated analysis

For non-Latin script languages, which we is shown in Table 2, only a small fraction of cross-lingual citations is not explicitly marked. We observe ratios of unmarked cross-lingual citations relative to explicit markers consistently below 2%.Footnote 7

Table 3 Results of manual labeling

To get a reliable estimate for Latin script languages as well, we additionally perform a smaller, manual analysis. To this end, we label a stratified sampleFootnote 8 of 5,000 references from unarXive with the reference’s language as well as whether an explicit language marker was used or not. The results of our evaluation are shown in Table 3. In accordance with our automated large analysis, we observe that non-Latin script languages are generally explicitly marked. For Latin script languages, however, explicit marking appears to be considerably less common. We additionally evaluate the automated language identification results for our manually annotated references and measure F1 scores of 0.48, 0.46, and 0.60 for French, German, and Italian, respectively. Notably, less than half of the references with German titles are detected (44% recall) and more than half of the references identified as German are false positives (48% precision).

The results of above preliminary investigations have two consequences for the findings in our main analyses, which are based on explicit language markers. First, a direct comparison between our results on non-Latin and Latin script languages is only valid for explicitly marked cross-lingual citations, as there is a notable amount of undetected cross-lingual citations for Latin script languages. Second, the number of undetected cross-lingual citations for non-Latin script languages such as Chinese, Japanese, and Russian, is negligible. Accordingly, concerning these languages, our results are valid for cross-lingual citations regardless of language markers.

Data source selection

Table 4 Overview of data sets

As our data source, we considered five large scholarly data sets commonly used for citation related tasks [12, 25]. Table 4 gives an overview of their key properties. The Microsoft Academic Graph (MAG) and CORE are both very large data sets with some form of language metadata present. In the MAG, the language is given not for documents themselves, but for URLs associated with papers. CORE contains a language label for 1.79% of its documents. S2ORC, the PubMed Central Open Access Subset (PMC OAS), and unarXive do not offer language metadata, but all contain some form of reference sections (GROBID output, JATS [18] XML, and raw strings extracted from LaTeX source files, respectively).

From these five, we decided to use unarXive and the MAG. This decision was motivated by two key reasons: (1) metadata of cited documents, and (2) evaluation of the acceptance of cross-lingual citations in English papers. As for (1), both S2ORC and the PMC OAS link references in their papers to document IDs within the data set itself (only partly in the PMC OAS, where also MEDLINE IDs and DOIs are found [13]). This is problematic in our case, because S2ORC is restricted to English papers, and the PMC OAS is constrained to Latin script contents,Footnote 9 which means metadata on non-English cited documents is non-existent (S2ORC) or very limited (PMC OAS). In unarXive, on the other hand, references are linked to the MAG, which contains metadata on publications regardless of language. Concerning reason (2), the fact that unarXive is built from papers on the preprint server arxiv.org, and the MAG contains metadata on paper’s preprint and published versions, allows us to analyze whether or not cross-lingual citations are affected by the peer review process.

With these two data sources selected, the extent of our analysis is over one million documents, across 3 disciplines (physics, mathematics, computer science), over a span of 27 years (1992–2019).

Data Collection

To identify references with “(in <Language>)” markers, we iterate through the total of 39.7M reference section entries in unarXive and first filter for the regular expression . This yields 51,380 matches with 207 unique tokens following “in” within the parentheses. Within these 207 tokens, we manually remove those referring to non-languages (e.g., “press” or “preparation”) and correct misspellings (e.g., “japanease” or “russain”), resulting in 44 unique language tokens. These are (presented in ISO 639-1 codes) be, bg, ca, cs, da, de, el, en, eo, es, et, fa, fi, fr, he, hi, hr, hu, hy, id, is, it, ja, ka, ko, la, lv, mk, mr, nl, no, pl, pt, ro, ru, sa, sk, sl, sr, sv, tr, uk, vi, and zh. These 44 languages cover 43 of the 78 languages, in which journals indexed in the Directory of Open Access JournalsFootnote 10 (DOAJ) are published as of July 2020. The one language found in our data, but with no journal in the DOAJ, is Marathi. In terms of journal count by language, above 44 languages cover 97.54% of the DOAJ. In total, our data contain 33,290 reference section entries in 18,171 unique citing documents. We refer to this set of documents as the cross-lingual set.

To analyze differences between papers containing cross-lingual citations in unarXive and a comparable random set, we also generate a second set of papers. To ensure comparability, we go through each year of the cross-lingual set, note the number of documents per discipline and then randomly sample the same number of documents from all of unarXive within this year and discipline. This means the cross-lingual set and the random set have the same document distribution across years and disciplines. Table 5 gives an overview of the resulting data used.

Results

In the following, we describe the results of our analyses with regard to the research questions laid out in the introduction. We begin with general numbers concerning the prevalence of cross-lingual citations. These results are based on unarXive alone. This is followed by more in depth observations regarding cross-lingual citations’ usage (e.g., the underlying motivation or the citation’s function) and impact (e.g., acceptance by reviewers or challenges for data mining). These subsequent in depth analyses additionally utilize the MAG metadata.

Table 5 Overview of data used

Prevalence

We find “(in <Language>)” markers in 33,290 out of 39,694,083 reference section entries (0.08%). These appear in 18,171 out of 1,192,097 documents (1.5%)—in other words in every 66th document. Of these 18k documents, 17,223 cite one language other than English, 864 cite two, 76 three, 7 documents four, and a single document cites works in English and five further languages (Russian, French, Polish, Italian, and German). The five most common language pairs within a single document are Russian–Ukrainian (277 documents), German–Russian (166), French–Russian (135), French–German (68), and Chinese–Russian (59).

Table 6 Most prevalent languages

Table 6 shows the absolute number of reference section entries and unique citing documents for the five most prevalent languages, which combined make up over 90% in terms of both references and documents. As we can see, Russian is by far the most common, making up about two-thirds of the cross-lingual set. When breaking down these numbers by year or discipline, it is important to also factor in the distribution of documents along these dimensions in the whole data set. Doing so, we show in Fig. 2 the relative number of documents with cross-lingual citations over time for each of the aforementioned five languages. While the numbers in earlier years can be a bit unstable due to low numbers of total documents, we can observe a downwards trend of citations to Russian, an upwards trend of citations to Chinese, and a somewhat stable proportion in documents citing Japanese works. Looking at the numbers per discipline in Fig. 3, we can see that cross-lingual citations occur most often in mathematics papers and are about half as common in physics and computer science.

Fig. 2
figure 2

Relative number of documents citing Russian, Chinese, Japanese, German, and French works. Showing all aforementioned in the bottom right

Fig. 3
figure 3

Relative number of mathematics, physics, and computer science documents citing non-English works

Fig. 4
figure 4

“Cross-linguality” of reference sections by discipline

Lastly, within the reference section of a document that has at least one cross-lingual citation, the mean value of “cross-linguality” (i.e., what portion of the reference section is cross-lingual) is 0.083 with a standard deviation of 0.099. Breaking these numbers down by discipline, we can see in Fig. 4 that there is no large difference, although mathematics papers tend to have a slightly higher portion of cross-lingual citations. The mean values for mathematics, physics and computer science are 0.090, 0.078, and 0.080, respectively.

Regarding prevalence, we observe that in English papers in the disciplines of physics, mathematics, and computer science about 1 in 66 publications contains at least one explicitly marked citation to a non-English document. About two-thirds of these citations are to Russian documents, although in the last years there is a downwards trend with regard to Russian and an upwards trend in citations to Chinese. Furthermore, cross-lingual citations appear about twice as often in mathematics compared to physics and computer science. These observations suggest that while cross-lingual citations are not very frequent in general, they might be worth considering in applications dealing with specific disciplines and languages (e.g., citations to Russian in mathematics publications).

Usage

Regarding the usage of cross-lingual citations in English publications, we analyze four different aspects. (1) Whether or not self-citations are a driving factor, (2) to what degree the geographical origin of a cross-lingual citation is correlated with the cited document’s language, (3) what function they serve, and (4) what sentiment they express toward the cited document.

Self-citation

To assess the relative degree of self-citation when referring to publications in other languages, we compare the ratio of self-citations in (a) the cross-lingual citations within the documents of the cross-lingual set, and (b) the monolingual citations within the documents of the cross-lingual set. Comparing two sets of citations from identical documents allows us to control for confounding effects such as author specific self-citation bias.

To determine self-citation, we rely on the author metadata in the MAG and therefore require both the citing and cited document of a reference to have a MAG ID. Within the cross-lingual set, this is the case for 3,370 cross-lingual references and 264,341 monolingual references. While at first, we strictly determine a self-citation by author IDs in the MAG being identical, manual inspection of matches and non-matches reveals, that author disambiguation within the MAG is somewhat lacking—that is, in a non-trivial amount of cases there are several IDs for a single author. We therefore measure self-citation by two metrics. A strict metric which only counts a match of MAG IDs, and a loose metric which counts an overlap of the sets of author names on both ends of the reference as a self-citation.

Table 7 Self-citations

Table 7 shows that going by the strict metric, self-citation is twice as common in monolingual citations. Applying the loose metric, however, self-citation appears to be slightly more common in cross-lingual citations. The larger discrepancy between the results of the strict and loose metric for cross-lingual citations suggests that authors publishing in multiple languages might be less well disambiguated in the MAG. With regard to self-citation being a motivating factor for cross-lingual citations—be it, for example, due to the need to reference one’s own prior work—we can note that our data do not suggest this to be the case. Authors using cross-lingual citations appear to be at least equally as likely to self-cite when referencing English works.

Fig. 5
figure 5

Geographic origin of cross-lingual citations to the ten most cited languages (absolute count)

Fig. 6
figure 6

Geographic origin of cross-lingual citations to the ten most cited languages(relative count)

Geographical origin

In this section, we analyze the geographical origin of cross-lingual citations. As a measure for geographical origin, we use the country in which a citing author’s affiliation is located. We refer to a citation as being to a “local language” or of “local origin,” if the cited document’s language is the most commonly spoken language in the affiliation’s location. An example of this would be a researcher affiliated with a research institution located in Russia, being the author of a paper in which they cite a publication written in Russian.

For our analysis, we rely on author affiliation metadata in the MAG. We start off with all documents in the cross-lingual set that have a MAG ID.Footnote 11 From those, we select all which provide information on the authors’ affiliations.Footnote 12 This leaves us with 7,522 out of 16,300 papers. To associate an author’s affiliation with a language, we use the most commonly spoken language in the country or territory.Footnote 13 Grouping affiliations by language, we can then view the correlation of (a) cited languages and (b) language grouped affiliations in two ways. On the one hand, we can see for each cited language how many of the citations are of local origin—compared to, for example, from an English speaking country. On the other hand, we can see for each language group of affiliations how many cross-lingual citations are to a local language. Our results of this analysis are shown for the 10 most commonly cited languages in Figs. 5 and 6, and for all identified cited languages in Appendix A.

Figure 5 shows citation numbers in absolute terms. Looking, for example, at citations to Russian publications (the bottom row of the figure), we can see that the largest amount of citations originates from Russian speaking countries (5,599 out of 18,672) followed by English speaking countries (4,535) and German speaking countries (1,427).

In Fig. 6, we show relative numbers per cited language. That is, the values of each row add up to 1. Here, we can see that citations to Japanese, Polish and particularly Portuguese appear to be of local origin comparatively often, with 68% for Japanese, 64% for Polish and 86% for Portuguese. Overall, we observe that cross-lingual citations are most often either of local origin or from an English speaking country. Evaluated over all languages, 37% of cross-lingual citations are local (the diagonal in Figs. 5 and 6), while 26% are from the Anglosphere (the “en” column in Figs. 5 and 6).

Fig. 7
figure 7

Geographic origin of cross-lingual citations (local vs. English speaking countries). Marker size (surface area) indicates number of citations

In Fig. 7, we jointly visualize how “locally” cited each language in our corpus is (x-axis) compared to which portion of citations originate from English speaking countries (y-axis). Overall, we observe larger variation on the “locality” dimension (values ranging from 0 to 1 with a variance of 0.058) than on the “from English speaking countries” dimension (values from 0 to 0.67 with a variance of 0.026). Looking at non-Latin script languages, we can see that Cyrillic script languages (e.g., Russian and Ukrainian) are less often of local origin than Asian languages (Chinese, Japanese, Korean) or languages written in Arabic script (PersianFootnote 14). Narrowing down on above-mentioned three Asian languages, we observe that for Chinese the relative portion of citations from English-speaking countries (0.41) is more than double of the same measure for Japanese (0.19), which is more than triple the value for Korean (0.06). The comparatively high ratio for Chinese (not just among Asian languages but overallFootnote 15) could be taken as an indication for two phenomena: first, an increased relevance of publications written in Chinese (i.e., a higher necessity to cite) and second, an increased rate of scholars able to read Chinese in English speaking country research institutions (i.e., a higher probability of the ability to cite).

Citation intent and sentiment

To assess whether or not cross-lingual citations tend to serve a different purpose than their monolingual counterpart, and whether or not authors have a different disposition toward cited works, we analyze the in-text citations (see Fig. 1) in our corpus.

The analysis of in-text citations—commonly referred to as citation context analysis—is concerned with the textual context of citations [16]. Two tasks in citation context analysis are the classification of citation intent (also referred to as citation function) and citation sentiment (also referred to as citation polarity) [16]. Citation intent can reveal why an author added a reference, while the citation sentiment can give insight into the author’s disposition toward that reference. Both citation intent and sentiment have been used in a number of diverse tasks, such as classification [4, 8, 23], summarization [6], and citation recommendation [12]. For citation intent, many schemes have been proposed to classify different functions, ranging from fine-grained to coarse-grained schemes. A partial overview of these can be found in Hernández-Alvarez [16], Jurgens et al. [23], Cohan et al. [8], and Lauscher et al. [28]. These schemes, however, are often domain-specific and too fine-grained [8]. Jurgens et al. [23] proposed a unified scheme of previous work (with six categories), while Cohan et al. [8] proposed a more generalized scheme (with three categories) that works for multiple domains. Recently, Lauscher et al.  [28] expanded these schemes to multi-sentence and multi-label citation contexts. Given the number of diverse domains on arXive, we adopt the general scheme by Cohan et al. [8]. For citation sentiment, a three category scheme (positive, negative, or neutral) is widely adopted [1, 3, 16]. Previous approaches to citation intent and sentiment classification have used either hand-crafted rules or classical machine learning models [1, 23], while more recent approaches using deep learning and word embeddings have demonstrated significant improvements in performance [4, 8, 28].

For our analysis, we create two, equally sized sets of in-text citations. The in-text x-ling set (cross-lingual) and the in-text mono set (monolingual). In the following, we describe the creation of both sets, the classifier model training, and our results for citation intent and sentiment classification.

Data Preparation For the in-text x-ling set, we determine all in-text citations associated with the references in the cross-lingual set. This yields 45,516 in-text citations for our 33,290 cross-lingual references. The in-text mono set is then created by extracting in-text citations associated with adjacent monolingual references. We illustrate this process in Fig. 8, showing a paper with a single cross-lingual reference for which, accordingly, a single adjacent monolingual reference would be determined and its associated in-text citations (indicated by the two blue markers above) extracted. For in-text mono, we extract 53,177 in-text citations (i.e., on average more in-text citations per reference) which we reduce to 45,516 through stratified sampling. By sourcing our monolingual in-text citations for comparison from the same papers, we avoid confounding effects such as author specific differences in citation styles.

Fig. 8
figure 8

Schematic explanation of an adjacent monolingual reference

As a citing sentence can contain more than one citation marker, it is possible that the in-text citations associated with two adjacent reference section entries appear within the same sentence (e.g., as indicated in the second “text” line in Fig. 8). This is the case for 10,454 of the in-text citations we extracted (i.e., these appear in both sets). We define them as a third set called mixed, leaving in-text x-ling and in-text mono at 35,062 items each.

Table 8 Class distribution and evaluation details for the model training

Model Training Training data for citation sentiment and intent classification regarding papers cannot easily be crowdsourced, because domain knowledge is needed for annotation. As a consequence, available data sets are comparatively small. We identify SciCite [8] for citation intent and the data set proposed by Athar [3] for citation sentiment as most appropriate for our purposes.

  • SciCite contains 11,020 citations that originate from the Semantic Scholar corpus, which covers several disciplines such as computer science, molecular biology, microbiology and neuroscience [2]. Citations in SciCite are labeled regarding their intent across three categories, namely Background, Method, and Result. The class distribution can be seen in Table 8. We select the data set because it is currently the largest available, and classifiers trained on the data set achieve good performance.

  • The data set created by Athar contains 8,736 annotated citations from 310 research papers. To the best of our knowledge, it is the largest citation sentiment data set currently available. Following [35], we manually remove 809 items from the data set that are either duplicates or too short to be accurately evaluated regarding their sentiment. The resulting data set, which we refer to as Athar from hereon, contains 7,927 citations annotated with one of the three labels Negative, Neutral, and Positive. Citations labeled Negative and Positive are comparably infrequent in the corpus (see Table 8), which makes classifying them more difficult. As possible mitigation strategies, we consider the following options.

    • Athar\(^{\dagger }\): balancing the data by under-sampling.

    • Athar\(^{\mathsection }\): removing the Negative class, as its low performance (see Table 8) puts its informativeness into question.

    • Athar\(^{\ddagger }\): both of the aforementioned.

For each of our classification models, we fine-tune SciBERT [4], a pre-trained language model for scientific text that achieves state-of-the-art performance on sentence classification tasks.

Our evaluation results are shown in Table 8. On both SciCite and Athar our models perform on par with the best performing models presented in their respective publications. For citation intent, we achieve an F1 score of 86.6% and relatively similar performance across classes. For citation sentiment, we achieve an F1 score of 67.9% on the original Athar data set. Two of our three class imbalance mitigation strategies (Athar\(^{\mathsection }\) and Athar\(^{\ddagger }\)) result in an increase in the F1 score to over 80%. Of those two, we decide to use the model trained on Athar\(^{\ddagger }\). While training on Athar\(^{\mathsection }\) gives us a slightly higher F1 score, the model trained on Athar\(^{\ddagger }\) achieves high precision and recall for positive citations—which are presumably less common—while also maintaining good performance for neural citations. Implementation details for the model training can be found in Appendix B.

Table 9 Citation intent and sentiment classification results for cross-lingual, monolingual, and mixed in-text citations. (Values are the number of citations per class followed by the percentage in brackets)
Fig. 9
figure 9

Comparison of citation intent distribution across arXiv categories for in-text x-ling (left) and in-text mono (right)

Classification Results Based on above evaluation, we proceed by using our models trained on SciCite, Athar, and Athar\(^{\ddagger }\) to classify the intent and sentiment of citations in in-text x-ling and in-text mono. In Table 9, we show the classification results for citation intent (top half) and sentiment (bottom half). The classifiers trained on SciCite and Athar appear to amplify the unbalanced data distribution they were trained on to some degree. Comparing the sentiment classifiers trained on the original Athar and balanced Athar\(^{\ddagger }\) data set, we see that citations classified as Positive increase from around 3% to almost 38%. We take this as a clear sign that reliably distinguishing neutral from positive citations remains a challenge even with state-of-the art models and training data.

Comparing our results across the data sets in-text x-ling, in-text mono, and in-text mixed we see that in terms of both intent and sentiment class distributions are similar. Taking a closer look at citation intent across the scientific disciplines,Footnote 16 we can see in Fig. 9 that the distributions are overall comparable among disciplines and between cross- and monolingual citations, with mathematics showing a slightly higher use of background citations.

Overall, our results for citation sentiment and intent show no distinct differences between cross- and monolingual citations. This can be taken as an indication for two things. First, that authors cite existing literature with a certain intent and sentiment regardless of the cited work’s language. Second, that cross-lingual—while occurring less frequent—serve the same functions as monolingual citations and are therefore not less significant.

Impact

Regarding the impact of cross-lingual citations, we analyze whether cross-lingual citations in English papers are seen as an “acceptable” practice, whether or not they pose a particular challenge for citation data mining, and their potential impact on the success of the paper they’re part of. Our results concerning these three aspects are described in the following sections.

Acceptance

To assess the acceptance of cross-lingual citations by the scientific community—that is, whether or not non-English publications are deemed “citable” [29]—we analyze papers in our data that have both a preprint version as well as a published version (in a journal or conference proceedings) dated later than the preprint. This is the case for 2,982 papers. For each preprint-published paper pair, we check if there is a difference in cross-lingual citations. This gives an indication of how the process of peer review affects cross-lingual citations. We perform a manual as well as an automated analysis.Footnote 17

For the manual evaluation, we take a random sample of 100 paper pairs. We then retrieve a PDF file of both the preprint and the published version, and manually compare their reference sections. For the automated evaluation, we find that 599 of the 2.9k paper pairs have PDF source URLs given in the MAG. After automatically downloading these and parsing them with GROBID, we are left with 498 valid sets of references. For these, we identify explicitly marked cross-lingual references as described in Sect. 3 and calculate their differences.

Table 10 shows the results of our evaluations. In both, cross-lingual citations are more often removed than added, but in the majority of cases left intact. The larger volatility in the automated evaluation is likely due to parsing inconsistencies of GROBID. Our findings complement those of Lillis et al. [29], who, analyzing psychology journals, observe “some evidence that gatekeepers [...] are explicitly challenging citations in other languages.” For the fields of physics, mathematics, and computer science, we find no clear indication of a consistent in- or decreasing effect of the peer review process on cross-lingual citations.

Table 10 Changes in cross-ling. cit. between preprints and published papers

Impact on paper success

To get an indication of whether or not an English paper’s success is influenced by the fact that it contains citations to non-English documents, we compare our cross-lingual set with the random set (cf. Table 4). For both sets, we first determine the number of papers that in the MAG metadata have a published version (journal or conference proceedings) in addition to the preprint on arxiv.org. That is, we assume that papers which only have a preprint version did not make it through the peer review process. Using this measure, we observe 9,390 of 16,224 (57.88%) successful papers in the cross-lingual set, and 10,966 of 16,378 (66.96%) successful papers in the random set. Unsurprisingly, due to the higher ratio of published versions, the papers in the random set are also cited more. Table 11 shows a comparison of the average number of citations that documents in both sets received. Due to the high standard deviation in the complete sets, we also look at papers which received between 1 and 100 citations, which are comparably frequent in both sets. As we can see, in the unfiltered as well as the filtered case, documents with cross-lingual citations tend to be cited a little less. Because here we can only control for the distribution of papers across years and disciplines, and not for individual authors (as we did in the Sect. 4.2.1), there might be various confounding factors involved.

Table 11 Comparison of citations received

Impact on citation data mining

To assess if cross-lingual citations pose a particular challenge for scholarly data mining—and are therefore likely to be underrepresented in scholarly data—we compare the ratio of references that could be resolved to MAG metadata records for the cross-lingual set and the whole unarXive data set. Of the 39M references in unarXive 42.6% are resolved to a MAG ID. For the complete reference sections of the papers in the cross-lingual set (i.e., references to both non-English and English documents), the number is 45.7% (290,421 of 635,154 references). Looking only at the cross-lingual citations, the success rate of reference resolution drops to 11.2% (3,734 of 33,290 references). We interpret this as a clear indication that resolving cross-lingual references is a challenge. Possible reasons for this are, for example:

  1. 1.

    A lack of language coverage in the target data set.

    For example, if the target data set only contains records of English papers, references to non-English publications cannot be found within and resolved to that target data set.

  2. 2.

    Missing metadata in the target data set.

    For example, when there is a primary non-English as well as an alternative English title of a publication, only the former is in the target data set’s metadata, but the latter is used in the cross-lingual reference.

  3. 3.

    The use of a title translated “on the fly.”

    If a non-English publication has no alternative English title, a self translated title in a reference cannot be found in any metadata. To give an example, reference 14 in arXiv:1309.1264 titled “Hierarchy of reversible logic elements with memory” is only found in metadataFootnote 18 as .

  4. 4.

    The use of a title transliterated “on the fly.”

    Similar to an unofficial translated title, if a title is transliterated and this transliteration is not existent in metadata, the provided title is not resolvable. A concrete example of this is the third reference in arXiv:cs/9912004 titled “Daimeishi-ga Sasumono Sono Sashi-kata” which is only found in metadataFootnote 19 as .

Cases 4 and especially 3 additionally impose a challenge on human readers, as the referred documents can only be found by trying to translate or transliterate back to the original. References to non-English documents which do not have an alternative English title should therefore ideally include enough information to (a) identify the referenced document (i.e., at least the original title), and (b) a way for readers not familiar with the cited document’s language to get an idea of what is being cited (e.g., by adding a freely translated English title).Footnote 20 There are, however, situations where an original title cannot be used. Documents in PubMed Central, for example, cannot contain non-Latin scripts,Footnote 21 meaning that references to documents in Russian, Chinese, Japanese, etc., which do not have alternative English titles are inevitably a challenge for both human readers as well as data mining approaches, unless there is a DOI, URL, or similar identifier that can be referred to.

In light of this, taking a closer look at the 88.8% of unmatched references in the cross-lingual set broken down by languages, we note the following matching failure rates for the five most prevalent languages: Russian: 88.6%, Chinese: 87.0%, Japanese: 91.0%, German: 85.4%, and French: 83.2%. While all of these are high, the numbers for the three non-Latin script languages are noticeably higher than those of German and French. As can be seen with the task of resolving references—and as also indicated through our self-citation data shown in Table 7—cross-lingual citations do pose a particular challenge for scholarly data mining.

Discussion and conclusion

Utilizing two large data sets, unarXive and the MAG, we performed a large-scale analysis of citations from English papers to non-English language publications (i.e., cross-lingual citations). The data analyzed spans over one million citing publications, 3 disciplines, and 27 years. We gained insights into cross-lingual citations’ prevalence, usage and impact.

Recapitulating our key results, we find that citations to non-Latin script languages can reliably be identified by a “(in <Language>)” marker, which enables automated identification in large corpora. Between the disciplines of physics, mathematics, and computer science, cross-lingual citations appear twice as often in mathematics papers compared to the remaining two fields. Over the course of time, we see a downwards trend in citations to Russian and an upwards trend for citations to Chinese. In general, cross-lingual citations are more often of linguistically local origin than originating from English speaking countries. Citations to Chinese, however, are about twice as likely to come from the Anglosphere than citations to other languages. Concerning authors citing behavior, we observe no remarkable differences between cross- and monolingual citations in terms of self-citations, intent, and sentiment. We also see no clear indication for gatekeeping of cross-lingual citations through the process of peer review. As for the impact of cross-lingual citations on a paper’s success, we only get inconclusive results. Finally, we see clear indicators that cross-lingual citations pose challenges for scholarly data mining, such as a lower likelihood to resolve a cited document due to more complex metadata (e.g., publications having two titles, a primary non-English and an alternative English title) and shortcomings in data integration (e.g., with local citation indices).

Through our preliminary analyses (see Sect. 3.1), we identify challenges in reliably assessing cross-lingual citations to Latin script languages, preventing automated identification in large corpora. These insights can facilitate future efforts in overcoming the identified challenges. Our detailed findings regarding prevalence can help identify scenarios, in which a dedicated effort to take into account cross-lingual citations is warranted. For example, a citation-driven analysis of research trends in mathematics might benefit from being able to track “citation trails” into the realm of Russian publications. Lastly, due to the large scale of our investigation, the use of our collected data for machine learning-based applications such as cross-lingual citation recommendation is possible.

As our analysis is based on explicit language markers of cited documents, which has shown to be reliable for non-Latin script languages but only capture a small fraction of citations to Latin script languages, we want to investigate further methods for identifying cross-lingual citations, to be able perform more exhaustive analyses. Furthermore, our corpus covers publications from the fields of physics, mathematics, and computer science. While arxiv.org has extensive coverage of physics and mathematics, the share of computer science publications is currently still in a phase of rapid growth. We therefore want to expand our investigation regarding computer science publications to get more representative results, but also include additional disciplines not covered so far. Lastly, we would like to conduct complementary analyses of cross-lingual citations from non-English to English. These might be more challenging to perform on a large scale, because non-English scholarly data is not as readily available. However, such analyses are also likely to yield insights with a larger impact, as citing English language publications is rather common in other languages.