What do we know about the disruption index in scientometrics? An overview of the literature

The purpose of this paper is to provide a review of the literature on the original disruption index (DI1) and its variants in scientometrics. The DI1 has received much media attention and prompted a public debate about science policy implications, since a study published in Nature found that papers in all disciplines and patents are becoming less disruptive over time. This review explains in the first part the DI1 and its variants in detail by examining their technicaland theoretical properties. The remaining parts of the review are devoted to studies that examine the validity and the limitations of the indices. Particular focus is placed on (1) possible biases that affect disruption indices (2) the convergent and predictive validity of disruption scores, and (3) the comparative performance of the DI1 and its variants. The review shows that, while the literature on convergent validity is not entirely conclusive, it is clear that some modified index variants, in particular DI5, show higher degrees of convergent validity than DI1. The literature draws attention to the fact that (some) disruption indices suffer from inconsistency, time-sensitive biases, and several data-induced biases. The limitations of disruption indices are highlighted and best practice guidelines are provided. The review encourages users of the index to inform about the variety of DI1 variants and to apply the most appropriate variant. More research on the validity of disruption scores as well as a more precise understanding of disruption as a theoretical construct is needed before the indices can be used in the research evaluation practice.


Introduction
Only five years have passed since the introduction of the disruption index (DI1) by Funk and Owen-Smith (2017) 1 , and meanwhile it has seen widespread application. Many researchers have used the DI1 to identify the most disruptive publications in specific disciplines and/or subdisciplines. Numerous articles, especially in the field of life sciences, have applied the DI1 to the field-specific literature to identify disruptive publications in different disciplines: surgery Becerra et al., 2022;Hansdorfer et al., 2021;Horen et al., 2021;Sullivan et al., 2021;Williams et al., 2021), radiology (Abu-Omar et al., 2022), breast cancer research , urology (Khusid et al., 2021), ophthalmology (Patel et al., 2022), energy security (Jiang & Liu, 2023a), and nanoscience (Kong et al., 2023). In the field of scientometrics, Bornmann and Tekles (2019b), and Bornmann et al. (2020b) tried to find the most disruptive papers published in Scientometrics with the help of (a modified version of) the DI1. The popularity of the new index is not only reflected in its application in several disciplines, but also in the recent introduction of an index variant on the journal level. Jiang and Liu (2023b) proposed the Journal Disruption Index (JDI) as an alternative to (traditional) journal level metrics such as the Journal Impact Factor (JIF, provided by Clarivate). Furthermore, Yang, Hu, et al. (2023) and R.  proposed different ways to incorporate the DI1 to the evaluation of scientists' research impact.
The DI1 played a key role in two influential science of science papers published recently in Nature: (1)  used the DI1 to investigate how the growth of team science impacts research outputs. They found that large teams tend to conduct consolidating research while small teams tend to produce disruptive publications. Although (international) cooperation is frequently seen as key factor for scientific excellence, disruptive research seems to be connected with rather small research groups. (2) Park et al. (2023) shocked the scientific community (and beyond) with the claim that scientific papers and patents have been getting less disruptive since World War II. Using data on 45 million papers and 3.9 million patents, they report that there has been a continuous decrease in average disruption scores across all disciplines. The article made waves in and beyond the science system and prompted a public debate surrounding the question of if and why science is running out of steam in spite of the massive expansion of the (global) science system in recent decades. 1 The authors called the disruption index CD index.
While the finding that both patents and papers are getting »less bang per buck« is certainly spectacular, it is important not to jump unreflectively and straight forward to far reaching conclusions (science policy actions). Park et al. (2023, p. 143) themselves point out that "even though research to date supports the validity of the CD index [referred to as DI1 in this review], it is a relatively new index of innovative activity and will benefit from future work on its behaviour and properties". Therefore, any meaningful discussion about the results of Park et al. (2023) (as well as the results of any other study involving the DI1) requires a detailed understanding of the index's properties and limitations, which have been studied in several (empirical) studies since 2019.
In order to provide detailed insights into the properties and limitations of (different variants of) the DI1, this review paper provides a systematic review of the current literature on DI1 and its modified variants. The review consists of three parts. In the first part, the technical and competence-destroying discontinuities is that mastery of the new technology fundamentally 2 Funk and Owen-Smith (2017) created the DI1, but the idea of using citation data to identify transformative research was proposed in earlier publications. For example, Huang et al. (2013, p. 291) stated in a conference paper: "We view the process by which transformative research is recognized by the scientific community as a competition between paradigms for the attention of the scientific community. […] We claim that transformative research shifts attention of the scientific community away from the established paradigm and that this is observable as a disruption of the growth of its citations cascade. Disruption occurs when the challenger paradigm can explain new citations received by the established paradigm". alters the set of relevant competences within a product class " (Tushman & Anderson, 1986, p. 442). In other words, some technological innovations improve upon established technologies without replacing them, whereas others render previous technologies obsolete. Funk and Owen-Smith (2017) took manifold inspiration from the literature on technological shifts, but they were of the opinion that the dichotomy of competence-destroying or competence-enhancing technologies lacked nuance. They argued that "a new technology's influence on the status quo is a matter of degree, not categorical influence" (Funk & Owen-Smith, 2017, p. 792). Funk and Owen-Smith (2017) also claimed that established measures of technological impact (like citation counts) only capture the magnitude of a technology's use and thus miss "the key substantive distinction between new things that are valuable because they reinforce the status quo and new things that are valuable because they challenge the existing order" (Funk & Owen-Smith, 2017, p. 793). Therefore, they created the DI1 that could take advantage of vast patent databases like the U.S. Patent Citations Data File. Since innovation is a valuable resource not just in the realm of technology (measured by patents and their citations data), but also in the realm of science (measured by publications and their citation data), the concept of disruption attracted the attention of , who were the first to apply the DI1 to the world of bibliometrics.
The DI1 is based on citation networks (Figure 1). Each citation network consists of three elements: a focal paper (FP), a set of references cited by the FP (set R) and a set of citing papers (set C). The citing papers are divided into three mutually exclusive groups. Group F (for "FP") encompasses all publications that cite the FP without citing even a single one of the FP's cited  Funk and Owen-Smith (2017).
references. Publications that cite both the FP and at least one of its cited references belong in group B (for "both"), whereas group R (for "reference") consists of publications that cite at least one of the FP's cited references without citing the FP itself. , and represent the total number of papers in set F, B, and R, respectively.
The interpretation of and is rather straightforward: A large indicates that the FP renders its own cited references obsolete and is thus associated with highly disruptive publications. In contrast, a large is a sign of a consolidating publication because the citation impact of the FP is dependent on the citation impact of its references. The intended purpose of is to weaken the disruption value of the FP, but this only works if the numerator ( − ) is positive. However, in case of ( − ) < 0, actually strengthens the disruption score of the FP (in the sense of being less consolidating). Since this inconsistency poses a significant thread to the validity of disruption scores, more information on this topic will be presented in Section 4.1. The DI1 is equivalent to the following ratio: The DI1 has a range of -1 to 1. Negative values are supposed to indicate developmental papers whereas positive values supposedly signify disruptive papers. Two things should be kept in mind about the calculation of the DI1: First, the DI1 is based on bibliographic coupling.
Bibliographic coupling is a method that looks for publications that cite the same references.
The DI1 applies bibliographic coupling to FPs and their citing papers. Consequently, one might argue that the DI1 "can be considered as a continuity indicator more than a disruption indicator since the operation is grounded in bibliographic coupling. The bibliographic coupling of a focal paper to its references generates a representation of continuity. From this perspective, discontinuity is indicated when the bibliographic coupling is not sufficiently generating continuity" (Leydesdorff et al., 2021).
This point ties in with a second point: Following the terminology proposed by Bu et al. (2021), DI1 is a relative index because it treats disruption and consolidation as opposite concepts.
From an absolute perspective, the citation network of an FP may simultaneously contain many bibliographic couplings links (indicating consolidating science) and a large (indicating disruptive science). In absolute terms, such an FP is both highly disruptive and highly consolidating. In contrast, from a relative perspective, the relationship between disruption and consolidation is a zero-sum game: No publication may be both disruptive and consolidating at the same time. For example, an article with a DI1 score of 0.5 is supposed to be more disruptive and less consolidating than an article with a DI1 score of 0. An article with a DI1 score of 0.3 is less disruptive and more consolidating than an article with a DI1 score of 0.4.

The disruption index's underlying theoretical concepts
In this section, implicit theoretical assumptions built into the DI1 (and its modified variants) are explained in relation to two important theoretical concepts: the concept of novelty and the concept of scientific revolutions. Although the literature does not provide a precise definition of the term »disruption«, it can be said with certainty that there are significant differences between the concept of »disruption« and the concept of »novelty«. Research on novelty indices predates the creation of the DI1 by a couple of years (Foster et al., 2015;Lee et al., 2015;Uzzi et al., 2013). Novelty indices are guided by the notion that creativity is no creatio ex nihilo but rather a cumulative process that manifests in atypical combinations of prior knowledge. According to Lee et al. (2015, p. 685), novelty indices were born out of a stream of research that "views creativity as an evolutionary search process across a combinatorial space and sees creativity as the novel recombination of elements". For example, researchers calculated the novelty value of papers by searching their bibliography for atypical (Uzzi et al., 2013) or unique (Wang et al., 2017) combinations of cited references.
In contrast to novelty indices, which only consider the cited references of an FP, the DI1 also considers the FP's citing papers. This is not just a technical, but also a conceptual difference.
Novelty indices focus on the origin of creative ideas in combinatorial processes. But, as Lee et al. (2015) explain, creativity is not just about the origin of ideas, it is also their usefulness and impact that matters. By also considering citing papers in its calculation, the DI1 captures not just the origin, but also the impact of new ideas. This is intuitively plausible since a novel idea that receives barely any attention from the scientific community hardly deserves to be labelled »disruptive«: "Although novelty may be necessary for disruptiveness, it is not necessarily sufficient to make something disruptive" (Bornmann et al., 2020a(Bornmann et al., , p. 1256).
The conceptual difference between disruption and novelty is also reflected in empirical results. By examining a dataset on Citation Classics, Leahey et al. (2023) showed that only specific types of novelty are linked to higher disruption scores. In the Citation Classics dataset, new methods are positively associated with disruption scores, whereas new theories and new results are negatively associated with disruption scores. Shibayama and Wang (2020) investigated the relationship between two types of novelty (theoretical and methodological) and disruption scores (see Section 5.4). The study is based on data from a survey, which asked researchers to rate the theoretical and methodological originality of their own publications. Shibayama and Wang (2020) found that disruption scores are positively associated with selfassessed theoretical originality, but not with self-assessed methodological originality. Even though it is difficult to draw conclusions from two studies that employed different methods and produced seemingly contradictory results, both Shibayama and Wang (2020) and Leahey et al. (2023) highlight that only a specific subset of novel research is also disruptive research.
In addition to novelty, the DI1 also relies heavily on concepts inspired by Kuhn's theory of paradigm shifts (Kuhn, 1962). According to Kuhn, the history of science can be categorized into two repeating phases: normal science and scientific revolutions. Normal science is characterized by the modus operandi of a specific paradigm: "For Kuhn science progresses by gradual, incremental changes in a particular discipline's practice and knowledge" (Marcum, 2015, p. 143). The phase of normal science is brought to an end by sudden paradigm shifts caused by scientific breakthroughs that drastically alter the status quo. Negative (or low) disruption scores are often interpreted as representations of the consolidating nature of normal science, whereas positive (or high) disruption scores are supposed to indicate drastic scientific breakthroughs or even paradigm shifts (Bornmann et al., 2020a;Li & Chen, 2022;Liang et al., 2022;Shibayama & Wang, 2020;.

Variants of the disruption index
Since the introduction of DI1 by Funk and Owen-Smith (2017), a number of researchers have suggested modified variants of the index. These variants will be explained in this section. 3 The explanations do not follow a chronological order; instead the different variants are categorized into distinct groups based on their specific type of modification.

Disruption and citation impact
The first alternative to DI1 was suggested by Funk and Owen-Smith (2017) themselves. In addition to DI1, they also constructed mDI1. The difference between the two indices is the inclusion of the weighting parameter , which captures only those citations directly linked to the FP.
In this formulation, differs from in that the former counts only citations of the focal patent, whereas the latter includes citations of both the focal patent and its predecessors" (Funk & Owen-Smith, 2017, p. 795). Whereas DI1 "does not discriminate among inventions that influence a large stream of subsequent work and those that shape the attention of a smaller number of later inventors" (Funk & Owen-Smith, 2017, p. 795), mDI1 also accounts for the magnitude of a patent's use. Even though mDI1 so far has received little attention from researchers, the idea of an index that measures both citation impact and disruption is not without merit.
Consider the hypothetical case of two papers A and B: A and B are assigned identical DI1 scores, but A's citation impact by far surpasses B's citation impact. This in turn means that A inspired many researchers to pursue new ideas, whereas B did not have a lasting impact on the scientific community. While there are good reasons to differentiate between low and high impact research, Wei et al. (2023) argue that the influence of citation impact is too dominant in the calculation of mDI1 because of the different scaling of citation counts and disruption scores.
As an alternative to distilling citation impact and disruption values down to one number, Wei et al. (2023) constructed a two-dimensional framework that keeps the measurement of citation impact and disruption separate ( Figure 2). In this framework, publications with both high citation counts and high disruption scores are classified as revolutionary science. Articles like paper B in the example above fall in the low impact direction-changing science category because they introduce original ideas but do not gain the recognition of many researchers.
High impact incremental science represents influential consolidating research. Most articles are low impact incremental science since they neither contain revolutionary ideas nor do they Depending on the research evaluation context, it might be worth considering not only the magnitude, but also the field-specificity of a publication's citation impact. Hypothetically, two papers A and B may have identical citation counts and disruption values, but differ greatly in the way they exert influence on the scientific community: While Paper A is a source of inspiration for scientists from many different disciplines, Paper B mainly grabs the attention of scientists working within a specific field. Since the DI1 considers all citations of the FP regardless of the disciplines the citing papers belong to, it would not be able to distinguish between papers with a broad citation impact (like Paper A) and papers with a field-specific citation impact (like Paper B). 4 This is an issue if one seeks to find the most disruptive publications in a particular discipline. Therefore, Bornmann et al. (2020b) suggested an improved field-specific variant of the DI1. In order to find disruptive papers published in Scientometrics, they redefined and as follows: • : Number of papers citing the FP, and at least of the cited references of all Scientometrics papers published in the same year as the FP.
• : Number of papers citing at least one of the cited references of all Scientometrics papers published in the same year as the FP, but not the FP itself.
The reasoning behind the addition of the threshold to will be discussed in the next section. Following Bittmann et al. (2022), the field-specific versions of the DI1 will be referred to as DI1n. Compared to DI1, DI1n is based on a larger set of cited references as it does not only consider the cited references of the FP, but all cited references of all papers published in a certain journal within a certain time window.

Dealing with noise caused by highly cited references
Recall that denotes the number of publications that cite at least one of the FP's cited references, but do not cite the FP itself. Since is part of the denominator, a large pushes DI1 scores closer to zero (see Section 4.1). Because essentially captures the citation impact of the FP's cited references within the citation network, the FP's disruption value could be biased by the number of references it cites and by the citation impact of these references.
Bornmann and Tekles (2021) explain this problem in detail: "Suppose that a focal paper cites a few highly cited papers, which are very likely to be cited by papers citing the focal paper, even if the focal paper is rather disruptive. In such a situation, the citing papers with only a few citation links to the focal paper's cited references may not be adequate indices for disruptive research" (see Section 4.3). Bornmann et al. (2020a) were the first to suggest a way to eliminate biases caused by highly cited references. They modified DI1 by implementing a threshold ( > 1), such that only those citing papers that cite at least of the FP's cited references are considered in the calculation of the index values. Whereas the DI1 only takes into account whether or not there is at least one bibliographic coupling link between the FP and its citing papers, DII also considers the strength of the bibliographic coupling links. More specifically, Bornmann et al.
(2020a) recommend a threshold of = 5. DI5 excludes all citing papers that cite less than five of the FP's cited references and thereby focuses on citing papers that rely more heavily on the FP's cited references. In the hypothetical case of an FP that cites three highly influential publications, DI5 would not consider citing papers that cite only these three publications and none of the other references cited by the FP.
Recently, Deng and Zeng (2023) suggested a different way to get rid of the noise brought about by highly cited references. Instead of excluding citing papers that do not reach a minimum threshold of bibliographic coupling links with the FP, they opted to use a threshold such that the % most highly cited references are selected and excluded. As Figure 3 illustrates, the exclusion of highly cited references could turn some red ( ) or orange ( ) citing papers into green citing papers ( ). Deng and Zeng (2023) chose to refer to their new index by the simple name of "new disruption". To fit in it with the denotation used for other variants, the new disruption will be denoted as DIX% for a threshold of (e.g. DI1%, DI5%, DI10%, etc.).

Variants without
DII and DIX% keep , but try to eliminate some of the noise caused by highly cited references.
A potential disadvantage of indices like DII and DIX% is that they rely on arbitrary thresholds ( = 5 and = 3). Instead of using thresholds, one could also drop entirely. Dropping results in indices considering only such citing papers that cite the FP. Wu and Yan (2019) Figure 3: Comparison of how DI1 and DIX% handle highly cited references. Illustration based on Deng and Zeng (2023). The colour green represents , red represents , and orange represents .
discussed an index that corresponds to DI1 but drops . In line with the denotation used by Bornmann et al. (2020a), indices of this type will be referred to as DI noR .

= − +
Another approach to get rid of was suggested by Bu et al. (2021) in a paper that introduced the dependency index (DEP). 5 The "DEP is defined as the average number of citation links from a paper citing the FP to the FP's cited references. A high (average) number of such citation links indicates a high dependency of citing papers on earlier work so that disruptiveness is represented by small values of DEP" (Bittmann et al., 2022(Bittmann et al., , p. 1250).

=
In this formulation, represents the total number of bibliographic coupling links between the FP and its citing papers. is the total number of citing papers. As the name suggests, the A third variant of the DI1 without is the Shibayama-Wang originality, named after its creators Shibayama and Wang (2020). They took advantage of the fact that dropping allows them to construct an index that counts the actual bibliographic coupling links instead of counting the linked publications. The originality index, denoted as , is calculated as follows: In the formula, denotes the total number of the FP's citing papers and denotes the total number of the FP's cited references. Analogously, and refer to a specific citing paper and a specific cited reference, respectively. Note that the originality index does not consider publications that only cite the FP's cited references, but do not cite the FP itself. The originality score ranges from 0 to 1 and is equivalent to the proportion of = 0 in the citation network (represented by green dashed lines in Figure 4). Like other index variants, the Shibayama-Wang originality is influenced by cited references with high citations counts. Shibayama and Wang (2020) also address the possibility that the number of cited references of the FP's citing papers could bias the calculation of the originality index because papers with many references in set are more likely to cite papers in set . To tackle these two sources of bias, Shibayama and Wang (2020) suggested two weighted versions of :

Figure 4: Graphical representation of the Shibayama-Wang originality in a simple citation network. The links connecting the FP to its cited references were left out for aesthetic reasons.
In the formulas, denotes the reference count of the c th citing paper; is the citation count of the r th cited reference. is an arbitrary positive number which may be chosen such that the minimum originality value equals zero.

Disentangling disruption and consolidation
The index variants discussed so far treat the relationship between disruption and consolidation as a trade-off, because they distil the disruptive and consolidating aspects of a given publication down to a single number. In certain cases, it may be more useful to treat disruption and consolidation not as opposites, but as two distinct concepts which require two distinct indices. As demonstrated by Leydesdorff et al. (2021), the simplest way to construct indices that serve this purpose is to change the numerator in the calculation of the DI1: The DI1 assigns the value of 0 to Paper A and -0.75 to Paper B respectively. This might lead to the conclusion that Paper B is less disruptive. However, a more detailed inspection using DI* and DI # reveals that DI* -focusing on disruption -assigns the same value (0.083) to both papers, implying that they are equally disruptive. The two publications only differ with respect to their consolidation values. The DI # value is ten times larger for Paper B (0.83) than for Paper A (0.083), meaning that Paper B is more consolidating than Paper A. In addition to this example, Leydesdorff et al. (2021) also provided another more conceptual argument for the use of DI* and DI # that relates to the weight given to in the calculation of the index values: "The difference between the total number of citing papers ( + ) and the value in the numerator […] is ( + ) − ( − ) = 2 × . One could argue that it would be more parsimonious to subtract only once from the total citations ( + )" (Leydesdorff et al., 2021). 6 A different line of argument was put forward by Chen et al. (2021) in the context of research on patent data. They criticized the dichotomous typology of either competency-destroying or competency-enhancing technologies, which is fundamental to the construction of the DI1, as being too one-sided. The main reason for this criticism is the failure of the dichotomous typology to identify "dual technologies" (Chen et al., 2021).
Analogous to the calculation of DI1, denotes the total number of publications that cite the FP but not , represents the total number of publications that cite both the FP and and is the total number of publications that cite , but do not cite the FP. In the second step, the final D and C values are calculated by averaging across all and . The D and C indices provide detailed insights into the citation networks of patents and papers. Not only do they allow for the separate calculation of disruption and consolidation values, but the respective and values also provide information about the relationship between an FP and its prior arts. Chen et al. (2021) illustrated the advantage of using separate indices for disruption and consolidation scores by constructing a more nuanced framework of technological innovation.
As shown in Figure 5, dual technologies are characterized by both high D and high C values.
Both this framework and the D and C indices may be repurposed for bibliometrics by simply replacing the prior arts 1 , … , with cited references 1 , … , (Li & Chen, 2022).

Measuring disruption with keywords and MeSH terms
While all studies mentioned so far try to measure disruption using citation networks, researchers have also made efforts to measure disruption and/or novelty with key words and text data (Arts et al., 2021;Boudreau et al., 2016;Foster et al., 2015;Hou et al., 2022) Note that the ED does not distinguish between major topic and subheading MeSH terms . Instead, it differentiates between six different types of occurrences of knowledge elements within a citation network ( Figure 6).
S.  tested two different ways to operationalize knowledge elements. The first approach treats every MeSH term as a knowledge element. This means that the resulting index, referred to as ED(ent), looks for FPs with unique MeSH terms (compared to their cited references). In contrast, the second approach is based on MeSH co-occurrences. Therefore, ED(rel) searches for unique combinations of MeSH terms. Out of all variants of the DI1 (explained so far), ED(rel) shares the most similarities with key-word-based novelty indices.
The calculation of ED takes three steps. In the first step, ED considers the relationship of the FP to its cited references. EDR "quantifies the knowledge change directly caused by FP compared to existing research stream" (S. Wang et al., 2023, p. 154) by subtracting the proportion of knowledge elements found in both the FP and its cited references ( ) from the proportion of knowledge elements only found in the FP ( ).
This procedure is followed up by a second step that groups the knowledge elements contained in the FP's citing papers into one of four categories: "(1) knowledge elements derived exclusively from FP [here: CF]; (2) knowledge elements derived from both FP and its   Wang et al., 2023, pp. 154-155). Like in step one, the number of knowledge elements that originate from the FP's cited references is subtracted from the number of the elements that indicate new and original ideas introduced by either the FP or its citing papers. In the third and last step, the two equations from step one and two are combined using a parameter that defaults at 0.5 and can be used to give more weight to one part of the equation, if so desired. In fact, S.  recommend using < 0.5 because their results suggest that EDC contributes more to correct identification of breakthrough papers than EDR. Since the calculation of ED involves multiple steps, the six groups of knowledge elements are not mutually exclusive. For example, one and the same knowledge element may be part of in step one and in step two.
Following the example of Funk and Owen-Smith (2017)  4 Possible disadvantages of using citation data to measure disruption and consolidation

Illustration of possible combinations
With the exception of the ED, the DI1 and all of its variants rely on citation data to measure disruption and consolidation. For multiple reasons, citation data may not be treated as a perfect representation of the disruptive and consolidating qualities of publications and patents. The citations of patents and scientific publications paint only an incomplete picture of the knowledge and the ideas that circulate through the relevant communities. Not all inventors seek patent protection for their inventions (Funk & Owen-Smith, 2017), and not all publications are properly indexed by bibliometric databases. In the science system, the gap between the total amount of publications and the amount of publications indexed by bibliometric databases is much larger in the social sciences and the humanities than in the natural and life sciences (Bornmann, 2020;Moed, 2005). In summary, this means that there is the danger of selection bias when using citation data to measure disruption and consolidation.
The DI1 and its variants are further limited by the fact that actual citation behaviour is not always in line with the normative citation theory (Merton, 1988), which states that citations represent cognitive influences and are used to give credit to previous research or to prior arts.
In reality, however, the inclusion or omission of citations of patents may be a strategic process and some companies may have incentives not to properly cite all prior arts (Alcácer et al., 2009). Similarly, the cited references of a scientific publication often do not represent all of the sources of inspiration that went into a paper (Tahamtan & Bornmann, 2018b). Since citations are a "complex, multidimensional and not a unidimensional phenomenon" (Bornmann & Daniel, 2008, p. 69), any application of the DI1 and its variants will be limited by noisy data.
In addition to the general limitations of citation data, the following subsections provide a summary of studies that examine possible (data-induced) biases that might affect the DI1 and its variants. An unbiased index should only be affected by parameters that relate to the theoretical construct that the index is supposed to measure. In case of the DI1 and its variants, this means that disruption scores should only reflect the disruptive and consolidating qualities of publications (and nothing else). If, on the other hand, parameters that are unrelated to disruption and consolidation affect disruption scores, then it may be concluded that the DI1 and its variants suffer from biases. Each of the following subsections represent a different kind of bias that was investigated in the literature: inconsistency, time-dependency, biases related to reference lists, and coverage-induced biases.

as a source of inconsistency
Consistent disruption indices should have the following feature: Disruptive qualities of an FP should always lead to higher disruption scores and consolidating qualities should always lead to lower disruption scores. With only a few calculations,  managed to prove that the DI1 as well as many of its variants are not consistent. The inconsistency is caused by the term . represents consolidating qualities and is therefore supposed to weaken the disruption score of papers. This works as intended, as long as the numerator ( -) is positive. In case, however, that ( < ) an issue arises: actually strengthens the disruptiveness of papers with negative disruption scores. This problem is illustrated in Table   2. As Table 2 shows, the performance of FP C and FP D is identical with the exception of the values. FP D should be assigned a lower disruption score than FP C, since a high is supposed to indicate consolidation. The results in Table 2 show, however, that artificially inflates FP D's disruption score because it strengthens the denominator. Thus, FP D is falsely rewarded with a higher DI1 score than FP C. The issue is caused by the fact that is only part of the denominator and thus has no influence on whether the disruption score is positive or negative: 1 < 0 < . The same issue also affects mDI1, DII, DIX%, and DIn. Variants that are positive by definition as well as variants that do not contain do not suffer from the inconsistency. ED is also not affected because every term in the denominator is even part of the nominator: < 0 < ; < 0 ( + ) < ( + ).
Another consequence of the inconsistency is that pushes the disruption scores of all papers closer to zero, regardless of whether they are on the consolidating or the disruptive part of the scale. Since tends to be quite large in many cases, DI1 assigns values of close to zero to many papers Leydesdorff et al., 2021). One could argue that this goes against the original intention of Funk and Owen-Smith (2017) to create a nuanced metric because it "raises the question of whether different nuances of disruptions can be adequately captured by DI1, or if the term [here: ] is too dominant for this purpose" (Bornmann et al., 2020a(Bornmann et al., , p. 1245.

Time-dependency of disruption scores
Since the citation network of an FP keeps changing as long as the FP keeps receiving additional citations, disruption scores may vary greatly depending on the time of measurement. Using four example papers, Bornmann and Tekles (2019a) investigated the variation of DI1 scores over time (Figure 7). While the disruption score of Randall and Sundrum (1999) stabilized rather quickly, it took Davis et al. (1995) five years to arrive at a stable disruption value. The development of the DI1 scores of Oregan and Gratzel (1991) and Iijima (1991) is also insightful, because even after 15 years it seems they still had not fully stabilized. Note that the timesensitivity affects all citation based variants of the DI1 and not just DI1. Based on their

Possible biases caused by the number and the citation impact of cited references
In addition to time-dependency, the DI1 and its variants may also be biased by the total number and the citation impact of the FP's cited references. The more references an FP contains and the more citations these references have received in total, the more likely it is that the citing papers cite at least one of the FP's cited references. Liu et al. (2023) demonstrated the effect that the removal or addition of an important cited reference may   (Bornmann & Tekles, 2019a).
in lower DI1 scores, Liu et al. (2023) conclude that according to the DI1 "it is difficult for a focal paper to disrupt highly cited predecessors". it will be better to restrict the analysis to more recent publications " (Liang et al., 2022, p. 5728). Note that the issue of publications wrongfully receiving high disruption scores because of a lack of (indexed) references also affects certain document types that usually contain no or only a few references (e.g. editorials, letters, book reviews, meeting abstracts, etc.). In summary, document-type-, language-, field-and publication-age-dependent lack of coverage may artificially boost disruption values.

Convergent and predictive validity of DI 1 and its variants
As Section 3 illustrates, on the one hand, DI1 and its variants seem to be attractive and versatile tools to measure the disruptiveness of research in empirical studies based on large publication sets. In their influential Nature study, Park et al. (2023)  In order to present the information in a systematic and approachable way, this review focuses on studies that tested the convergent validity and/or the predictive validity of the DI1 and its variants. Convergent validity addresses the question of whether a metric is positively associated with the construct it is supposed to measure. The convergent validity of a metric may be assessed in two ways: A) Checking how well the results of the metric in question correspond with the results of other metrics that measure the same (or similar concepts) (Forthmann & Runco, 2020). B) Checking how well the results of the metric in question correspond with expert evaluations of the same (or similar) concepts (Kreiman & Maunsell, 2011). "The criteria for convergent validity would not be satisfied in a bibliometric experiment that found little or no correlation between, say, peer review grades and citation measures " (Rowlands, 2018). In the case of the DI1 and its variants, the basic idea is to use lists of landmark publications (e.g., publications leading to the Nobel prize, NP) that were compiled by groups of experts and to compare the disruption scores of landmark and non-landmark publications. If disruption scores do measure what they are supposed to measure, they should identify the landmark papers picked by experts by assigning significantly higher scores to landmark publications than to non-landmark publications.
Tests of predictive validity involve the usage of historical data in order to assess how well a bibliometric index is able to predict future outcomes of the concept of interest (Kreiman & Maunsell, 2011). Currently, there is only one study 8 , namely Shibayama and Wang (2020), that investigated the predictive validity of an index measuring disruption (i.e. the Shibayama-Wang originality). The methods and results of validation studies are listed and explained in the following subsections. To make the results of the validation studies more easily comparable, the subsections are based on the type of data used to validate disruption indices. Particular focus is placed on the comparative performance of different disruption index variants.  Table 4 because it is the only study that tested the convergent validity of multiple disruption indices.   The results of the regression analyses are shown in Table 4. Only the coefficients of DI5 and mDI1 achieve statistical significance. Against expectation, all indices have negative coefficients: the higher the disruption value of a publication, the lower the likelihood of it being a prize-winning paper. In other words, prize-winning papers appear to be more consolidating on average than non-prize-winning PubMed publications with comparable bibliometric features.

Nobel Prize-winning publications
A hint on how the conflicting findings of the above mentioned studies can be interpreted can be found in the study of Liang et al. (2022).

Faculty Opinions
Faculty Opinions aims to provide curated selections of relevant and high quality research within the disciplines of biology and medicine. The company is responsible for the Faculty Opinions database (formerly known as F1000Prime), which is based on a post publication peer review process. Peer-nominated Faculty Members (FMs) rate and rank papers according to their quality and their importance using a three star system (1 star = "good", 2 stars = "very good", 3 stars = "exceptional"). We found three studies that take advantage of Faculty Opinions in order to assess the validity of the DI1 and its variants. These studies are Bornmann et al. (2020a), S.  and Wei et al. (2023). Each study took a slightly different approach to operationalizing expert judgements of disruptiveness (and similar concepts) .

Bornmann et al. (2020a) used tags, which FMs may choose to assign to papers in addition to
ratings. The purpose of the tags is to provide an "'at a glance' guide for the reason(s) the article is being recommended" 10 . Examples of such tags are displayed in Table 5  aspects of novelty and show negative correlations with tags that indicate consolidating research. "As disruptive research should include elements of novelty, we expect that the tags are positively related to the disruption indicator scores. For instance, we assume that a paper receiving many 'new finding' tags from FMs will have a higher disruption index score than a paper receiving only a few tags (or none at all)" (Bornmann et al., 2020a(Bornmann et al., , p. 1247. The study was based on a dataset of 157,020 papers published between 2000 and 2016. Only papers with at least ten cited references and at least ten citations were included and a citation window of at least three years was chosen. In total, the DI1 and four variants were tested: DI1, DI5, DI1 noR , DI5 noR , and DEP. Table 5   (b) all other indicators measuring disruption are independent of DI1" (Bornmann et al., 2020a(Bornmann et al., , p. 1252. In other words, DI5, DI1 noR , DI5 noR , and DEP load strongly on the same dimension, implying that at least one of them is an improvement compared to the DI1. The FA also shows that disruption scores do not correlate with reviewers' ratings, suggesting that reviewers do not tend to assign higher (or lower) star ratings to disruptive publications than to consolidating publications.
The lack of a correlation between reviewers' ratings and disruption scores probably affects the results of S. , who assessed the validity of DI1, mDI1, DI5, mED(ent), and mED(rel) using a combination of tags and reviewers' ratings. They categorized papers that earned a reviewer score of at least two stars and received the tags "Hypothesis", "New finding", "Novel drug target", "Technical advance", and "Changes in clinical practice" as breakthrough papers. S.  collected 2,002 breakthrough papers that were published between 1991 and 2002. They constructed a dependent variable that is 1 if the paper is a breakthrough paper and 0 if not. A logistic regression analysis was performed with the DI1 and its variants as independent variables. The authors tested whether the indices are able to differentiate between breakthrough papers and randomly sampled non-breakthrough papers from the PubMed database with the same publication year and similar citation counts.
The results of the logistic regression analyses are displayed in Table 7. In contrast to the results Table 7: Results from logistic regression analysis based on S. . Results that match expectations are highlighted in bold. reviewers' comments instead of tags. They decided to recognize a paper as "revolutionary science" if its review comments included the words "innovative", "revolutionize", "revolutionary", "novel", "novelty", "creativity", "creative", "innovation", "original", "initial", "originality", "radical", "breakthrough", "new", "bridge", "combine", "first ones", "contribute to", "thought-provoking" or "provocative". They provided examples of reviews that indicate revolutionary, i.e. disruptive, science (Table 8). Wei et al. (2023) collected 70 revolutionary papers and compiled a control group of 1,405 papers from the same journals.

Index
Based on the reviewers' comments, Wei et al. (2023) created a binary variable that assumes 1 if a paper is considered revolutionary science and assumes 0 otherwise. This binary variable was used as the independent variable in a multivariate linear regression. The dependent variables were citation counts and DI1 values; the control variables were number of authors  Wei et al. (2023). Words that point to revolutionary science are underlined. Although the second comment does not contain any of the words listed by Wei et al. (2023), the paper was still coded as "revolutionary science".
Review Comment 1 I have found that it is such an outstanding article overall. I find the work innovative and recommend indexing.
2 This paper provides an important advance in the study of spatial proteomics.

3
EGSEA is a new gene set analysis tool that combines results from multiple individual tools in R as to yield better results. The authors have published the EGSEA methodology previously. This paper focuses on the practical analysis workflow based on EGSEA with specific examples. As EGSEA is a compound and complicated analysis procedure, this work serves as valuable guidance for the users to make full use of this tool.
and number of cited references. The analysis reveals that both the average citation counts and the average DI1 values are higher for revolutionary papers than for papers in the control group (25 citations and 0.016 value points, respectively). Both coefficients are statistically significant and point in the expected direction, but the coefficient for disruption scores is very small. Wei et al. (2023) conclude that the combination of citation counts and DI1 is able to correctly identify disruptive science. In summary, the DI1 and its variants perform significantly less favourable in the study of S.  than in the studies of Bornmann et al.
(2020a) and Wei et al. (2023). A short overview of the results is provided in Table 9. The weak correlation between disruptions cores and reviewer scores observed in the FA by Bornmann et al. (2020a) may have caused mDI1, DI1 and DI5 to perform poorly in the calculations of S. . While the DI1 and its variants are not able to identify publications with high reviewer scores, DI5 and DEP in particular seem capable of identifying novel research.

Milestone and breakthrough papers
In 2008, Physical Review Letters (PRL) compiled a list of milestone publications to celebrate the journal's 50 th anniversary. The list includes publications from 1958 to 2001 and is available online. 11 According to the information provided by the publisher on this website, the collection "contains Letters that have made long-lived contributions to physics, either by announcing significant discoveries, or by initiating new areas of research. A number of these 11 https://journals.aps.org/prl/50years/milestones Tags and scores Logistic regression mED(rel), mED(ent) DI5, DI1 articles report on work that was later recognized with a Nobel Prize for one or more of the authors". The milestones papers have been carefully selected to represent the various subdisciplines of the field of physics. A similar list of milestone papers was published in 2015 by Physical Reviews E (PRE) in celebration of its 50,000 th publication. 12 This collection includes papers published between 1993 and 2004 and with the objective to identify significant scientific contributions in different fields of physics.
As of now, there are three studies that have used the PLR and PRE collections to assess the convergent validity of the DI1 and its variants. Based on the assumption that milestone assignments are a proxy for disruption, the three studies investigated whether the DI1 and its variants assign higher values to milestone than to non-milestone publications. We will start with two studies that are very similar: Bornmann and Tekles (2021) and Bittmann et al. (2022). In cases where observable covariates are unevenly distributed among unbalanced data, matching algorithms may be used to combat biases: "The general idea behind statistical matching is to simulate an experimental design when only observational data are available to make (causal) inferences. In an experiment, usually two groups are compared: treatment and control. The randomized allocation process in the experiment guarantees that both groups are similar, on average, with respect to observed and unobserved characteristics before the treatment is applied. Matching tries to mimic this process by balancing known covariates in both groups. The balancing creates a statistical comparison where treatment and control are similar, at least with respect to measured covariates " (Bittmann et al., 2022(Bittmann et al., , p. 1251. In other words, if there is covariate that is unequally distributed among treatment and control group, a matching algorithm tries to compare members of the treatment and control group with comparable z-values. Since both Bornmann and Tekles (2021) and Bittmann et al. (2022) mainly relied on CEM, we forgo a detailed explanation of the different types of matching algorithms and focus on CEM instead. Unlike other matching algorithms, CEM tries to find exact matches and actively discards dissimilar cases from the calculation. This procedure considerably improves the balancing of the data. If perfect matches are difficult to find, coarsening is employed: "For example, a continuous variable with a large number of distinct values is coarsened into a prespecified number of categories, such as quintiles. Matching is then performed based on quintile categories, and the original information is retained. After matching based on the coarsened variables, the final effects are calculated as differences in the outcome variable between group means using the original and unchanged dependent variable" (Bittmann et al., 2022(Bittmann et al., , p. 1255.  Bittmann et al. (2022) and Bornmann and Tekles (2021). The standard errors are displayed in brackets. Note that Bittmann et al. (2022)   The results of Bornmann and Tekles (2021) and Bittmann et al. (2022) are displayed side by side in  Deng and Zeng (2023) compared the disruption values of three groups of publications, i.e., NPs, milestone papers, and review articles, and analyzed ordinary papers Figure 9: Illustration of how the choice of X for DIX% influences the ranks of review articles, milestone papers, and NPs (Deng and Zeng, 2023).
in the database of the American Physiological Society. After restricting the sample to papers that received at least five citations (to include papers with a minimum impact), the sample consisted of 230,867 publications.
Unlike Bornmann and Tekles (2021) and Bittmann et al. (2022), Deng and Zeng (2023) chose not to calculate ATEs and instead opted for a different approach. They assumed that review papers are prime examples of consolidating publications, and that milestone papers and NPs are the most disruptive papers in the dataset. In their study, disruption was measured with ranks instead of scores. For example, a paper ranked at 15 has a higher disruption value than a paper ranked at 200. Therefore, a valid index should (1) assign lower ranks (i.e., ranks closer to 0) to milestone papers and NPs than to ordinary papers and (2) assign higher ranks to reviews than to primary research papers. Three indices were tested: DI1, DI5, and DIX%. Figure   9 shows how the percentile choice for DIX% changes the ranks of the publications. Deng and Zeng (2023) report that a threshold of 3% produces the best results because DI3% assigns minimum ranks to milestone papers and NPs. Table 11 provides an overview of the performances of the three indices. It displays percentile ranks instead of absolute ranks. Low percentile ranks indicate consolidation, and high percentile ranks indicate disruptive science. DI5 succeeds in assigning particularly low percentile ranks to review letters, while DI3% maximizes the percentile ranks of NPs and milestone papers. Therefore, Deng and Zeng (2023) concluded that DI5 as well as DI3% are considerable improvements compared to DI1. DI5 appears to excel at the identification of consolidating research, whereas DI3% performs well at identifying disruptive research.
Following the same fundamental idea as the three studies above, but using a different dataset, S.  tested the ability of five indices (mED(rel), mED(ent), mDI1, DI5, DI1) to identify articles that have been declared as breakthrough papers by the Science magazine.  Deng and Zeng (2023). According to Deng and Zeng (2023), t-tests showed that the differences between the three indices are statistically significant for milestone papers and NPs (p<0.01 In 1989, Science magazine started awarding the title of "molecule of the year" to molecules connected to major scientific developments (Guyer & Koshland, 1989). In 1996, the award was renamed to "breakthrough of the year". It is given to publications that made significant contributions to a field of research and led to a scientific breakthrough (not necessarily connected to molecules). From this annually updated list of breakthrough publications, S.  collected 321 articles published between 1991 and 2014. A logistic regression reveals that only mED(rel) and mED(ent) are able to identify the breakthrough papers among PubMed publications with comparable bibliometric features (Table 12). The performance of mED(rel) is clearly superior compared to any other index.

Self-assessments of researchers
Shibayama and Wang (2020) tested both the convergent and the predictive validity of Origbase, Origweighted_yc, and Origweighted_zr by combining survey data with bibliometric data from the WoS.
The survey was mailed to 573 randomly selected researchers in the life sciences who had earned their PhD degrees between 1996 and 2011 in Japan. Shibayama and Wang (2020) used a two-dimensional concept of originality and differentiated between theoretical and methodological originality. The respondents were asked to the evaluate the theoretical as well as the methodological originality of their dissertation projects using a three-point scale: 0 (not original), 1 (somewhat original), 2 (original). The survey data was then linked with bibliometric data about the publications that the researchers had published in the year of their graduation or 1-2 years before. In total, 246 responses from the survey and the bibliometric data for 546 publications were collected.
The test of the convergent validity was conducted by calculating correlation coefficients To assess the predictive validity of disruption indices, Shibayama and Wang (2020) investigated whether Origbase, Origweighted_yc, and Origweighted_zr are able to predict highly impactful papers. The design of this test was guided by the notion that highly original research is more likely to be highly cited than less original research. Based on citation counts up to 2018 a binary variable was constructed that assumes 1 if a paper is among the top 10% most highly cited publications in the dataset and assumes 0 otherwise. A logistic regression showed that all variants of the Shibayama-Wang originality are able to predict future citation impact reasonably well and similarly well (Table 13).
In summary, Origbase, Origweighted_yc, and Origweighted_zr displayed similar levels of convergent and predictive validity. For the most part, the findings of Shibayama and Wang (2020)

Discussion
There are two main takeaways from the current research on disruption indices. On the one hand, the DI1 and its variants enable the exploration of intriguing research questions that require vast amounts of bibliometric data. Two popular bibliometric studies published by Wu et al. (2019) and Park et al. (2023) in Nature would scarcely be possible without the DI1. On the other hand, it is apparent that the DI1 and its variants have some considerable limitations that they share with many other bibliometric indices. Not only are these indices, as citationbased indices, highly time-sensitive metrics and dependent on many factors that influence citation decisions (e.g., the language of the cited paper or the reputation of the authors) (Tahamtan & Bornmann, 2018a). Disruption scores are also heavily affected by a time-, discipline-, document type-, and language-related lack of coverage in literature databases.
Because no database offers perfect coverage of the worldwide literature and because coverage is generally worse for publications from early publication years, there is no way to rule out the possibility that the results of historic studies like Park et al. (2023)

Convergent and predictive validity
A systematic review of the literature reveals that empirical evidence on the predictive validity of disruption indices is still too scarce to arrive at any substantial conclusions . The current literature only provides some illustrative calculations with the D and C indices (Li & Chen, 2022) and more detailed evidence on the predictive validity of the Shibayama-Wang orginality (Shibayama & Wang, 2020). Compared to predictive validity, there is a richer body of literature on the convergent validity of disruption indices, but the results are not entirely conclusive.
There are only two consistent findings across all studies: 1) Citation impact measures are strongly and positively associated with milestone and NP status (Bittmann et al., 2022;Wei et al., 2023). 2) Some DI1 variants offer considerable improvements compared to the DI1. The comparative performance of disruption indicators has so far been assessed by five studies. The favourable performances of DI5 in four of these studies (Bittmann et al., 2022;Bornmann et al., 2020a;Deng & Zeng, 2023) are contrasted by only one result that does not confirm the convergent validity of the DI5 (S. . Therefore, we conclude that DI5 shows the most consistently favourable performance in the current literature. The fact that DEP also shows some promising results suggests that indices measuring disruption profit from considering the strength of bibliographic coupling links between the FP and its citing papers. Still, researchers should be aware of the limitations of DI5: DI5 suffers from the inconsistency caused by and its application requires the use of an arbitrary threshold.

Limitations of validation studies
As with any type of research, it should be kept in mind that the studies on the convergent validity of the DI1 and its variants have their own limitations: (1) Expert judgements and self-assessments used as benchmarks may be flawed in their own way since they may be biased (Bornmann & Daniel, 2009). For example, empirical evidence suggests that expert opinions may be biased against highly novel science (Boudreau et al., 2016).
(2) Since experts have access to bibliometric data, it is possible that they take citations counts into consideration when assigning milestone status to papers. The same applies to the selfassessments of researchers.
(3) If disruption scores are related to citation counts, the results of validation studies could be confounded by citation impact (Bittmann et al., 2022;).
(4) The studies relying on the Faculty Opinions database use aspects of novelty to test the convergent validity of the DI1 and its variants. Bornmann et al. (2020aBornmann et al. ( , p. 1256) point out that, because novelty and disruption are distinct concepts, there is no way to "completely exclude the possibility that many nondisruptive discoveries are novel".

Best practice guidelines
On the positive side, the current state of research highlights a key strength of the DI1 and its variants. They are highly versatile indices, which provide researchers with a number of options to tackle some of their weaknesses: (1) Researchers who want to avoid the inconsistency caused by still have a variety of indices to choose from.
(2) Since publications without (indexed) cited references are a major threat to the validity of disruption scores, it seems to become standard practice to calculate disruption scores only for publications with at least a certain number (e.g. 10) of citations and cited references (Bornmann et al., 2020a;Bornmann & Tekles, 2019b;Deng & Zeng, 2023;Ruan et al., 2021;Sheng et al., 2023).
(3) For the same reason, it seems also advisable not to calculate disruption scores for articles from very early publication years.
(4) Because a short citation window may lead to non-reliable results, Bornmann and Tekles (2019a) propose a citation window of at least three years after publication, as it is usually recommended in bibliometrics (van Raan, 2019). However, a time window of three years does not in any way guarantee reliable results for articles that keep on accruing citations long after publication. Since there is no one-size-fits-all approach to citation windows for the DI1 and its variants, researchers who want to work with the indices are encouraged to provide transparent reasons for their choice of the citation window.

Future research
Future research is needed to address four key aspects that have not been sufficiently covered by the current literature: (1) The concept of disruption requires a precise definition. In the current literature, disruption is loosely associated with the idea of scientific breakthrough or paradigm shift. However, Wuestman et al. (2020) showed that there are different types of scientific breakthroughs and that breakthroughs should therefore not be treated as a homogenous group. Consequently, there needs to be a discussion about how the concept of disruptive research fits into the typology of scientific breakthroughs.
(2) The current state of research shows that the time-sensitivity of disruption scores is one of the major limitations of the DI1 and its variants. Time-sensitive bias could render historical analysis with the indices challenging (like Park et al., 2023). Current research on time-sensitive biases that affect DI1 (and other variants) is still in a very early (preprint) stage (Bentley et al., 2023;Macher et al., 2023;Petersen et al., 2023) and requires further support by future publications.
(3) The substantial correlation between citation impact and expert assessments of disruptive papers needs to be examined in more detail. The result seems to contradict the central claim (4) The studies which investigated the relationship between disruption scores and different aspects of novelty lead to seemingly conflicting results. The results by Shibayama and Wang (2020) seem to imply that theoretical (and not methodological) innovations contribute to disruptive research, whereas the findings of Leahey et al. (2023) indicate that new methods (and not new theories) are a key driving force behind disruption in science. Given these inconclusive results, more research on the relationship between disruption scores and specific types of novelty is needed. Such research could not only provide valuable insights into the theoretical and technical properties of disruption metrics but also improve our understanding of scientific innovation.
(5) More research is needed on the lesser known variants of the DI1, since some of them (like mED(rel) and Origbase) have been examined so far by only one or very few studies. Future research could also explore how the use of different thresholds affects DII (e.g. = 3, = 10, = 20, etc.). The investigation of the DI1 and the development of variants have led to important insights and recommendations of necessary improvements, but it looks as if this research has still not reached its full potential.

Conclusions
This review was intended to reveal an overview of the research on the DI1 and its variants.
Although these indices have been applied already in science of science studies targeting important science policy questions (e.g., do we have more or less disruptiveness in research over time), we would like to encourage more empirical studies on the reliability, validity and other properties of the different indices. These results are necessary to know significantly more about the indices. Only if the properties of the DI1 and its variants are well known, the empirical studies that investigate science using the indices can be properly designed and interpreted.

Statements and declarations
Competing interests: The authors have no relevant financial or non-financial interests to disclose.
Funding: Open Access funding enabled and organized by Project DEAL.
Lutz Bornmann is Editorial Board Member of Scientometrics.