Introduction

Most quantitative measures of the output of InterDisciplinary Research (IDR) rely on bibliometric methods. Since such methods are commonly used to inform policy in science and technology, they require reliable indicators and results. While analytical indicators and tools have been refined over time, their results are in most cases not precise. The accuracy of such indicators depends on the quality of the bibliographic data, which should be correct and complete. Unfortunately, the gathering of a correct and complete bibliographic dataset is a complicated task due to the fact that not all scientific publications are indexed by digital libraries. Current bibliographic databases, such as the Web of Science (WoS) or Scopus, do not cover books, book chapters and many regional non-English journals in which some fields mainly publish. Even conference proceedings, which constitute the main publication venues in many applied fast-changing fields, are often not indexed. The gathering and comparison of records gathered from different bibliographic sources mitigates this problem to some extent. However, an additional problem affects top–down approaches to measure IDR such as the Rao–Stirling diversity index: the need for a predefined taxonomy of disciplines that classifies all publications in the dataset. This problem cannot be solved with the comparison of data gathered from different sources because not all libraries classify their publications into a taxonomy of disciplines nor use the same taxonomy, and even those that use a taxonomy might not classify all their indexed publications with it—as is the case of WoS. Manual classification of publications into disciplinary fields is also not viable for a large number of uncategorized publications. In consequence, top–down measurements of IDR usually deliver proxy results.

In this paper we acknowledge the problem of dealing with incomplete data gathered from several libraries. We focus on the problem of uncategorized publications for the measurement of IDR with the Rao–Stirling index. We choose this index because it is a well-established bibliometric indicator that requires a complete categorization of all references into disciplinary fields; however this problem has not received adequate attention in the literature. We propose a theoretical extension of the Rao–Stirling index to account for the uncertainty resulting from references that remain uncategorized.

Background

The field of measuring IDR heavily relies on bibliometric methods and data due to the widely-held view that scientific research is disseminated via publications. Different types of approaches exist for measuring IDR, which have been accordingly endorsed for differing needs of analysis. For an extensive review of approaches, we refer to the work of Wagner et al. (2011). Among them, the most common method for measuring IDR is citation analysis, in which an exchange or integration among fields is captured via discipline-specific citations pointing to other fields. Two distinguishable strategies for measuring IDR are bottom–up and top–down. The first approach is based on clusters of articles without a predefined taxonomy of disciplines. The clustering is based on the structural relationships of a network of publications (Boyack and Klavans 2010; Chen et al. 2010; Leydesdorff 2007; Leydesdorff et al. 2013). In contrast, top–down approaches rely on a predefined taxonomy of disciplines that is used to classify publications into disciplinary fields (Leydesdorff et al. 2013; Porter and Rafols 2009; Rafols et al. 2012). While bottom–up approaches are suited for capturing emerging developments that do not fit into existing categories, the classification-based approach is useful for large-scale explorations, such as comparisons of areas of science using an extensive amount of data or the disciplinary breadth of research institutions. The latter approach is the focus of this paper.

The results of citation analyses are subject to the quality of bibliographic data in terms of completeness and accuracy. Well-established top–down methods used to analyze the number of disciplines cited by a publication or their degree of concentration such as Shannon entropy Shannon (1948) and Herfindhal index Rhoades (1993) are designed to be used with datasets with complete information, since they cannot acknowledge the degree of missing data. This is also the case of the Rao–Stirling diversity index, a more complete top–down index proposed by Porter et al. (2007), and Porter and Rafols (2009). Precise IDR measurement using these methods requires a bibliographic dataset with: (1) complete records of references, (2) a correct list of references for each publication, (3) accurate categorization of publications into disciplinary fields, and (4) the categorization of each reference into at least one discipline. The combination of such quality characteristics results in ground-truth bibliographic data, which is rarely attainable since no publication database provides adequate correctness and completeness in respect to both references and categorization into disciplinary fields.

Concerning references, verification mechanisms as discussed by van Raan (1996) are crucial to detect incomplete records of references and remove incorrect references in bibliographic sources, such as those encountered by Moed et al. (1995) and Chen et al. (2012). In regard to taxonomies of disciplines, their accuracy have been widely discussed in the literature without reaching consensus on an adequate one National Research Council (2010), Rafols and Leydesdorff (2009). In spite of its weaknesses, the list of categories provided by WoS is the most widely used (Bensman and Leydesdorff 2009; Pudovkin and Garfield 2002). The exhaustive categorization of all references within a dataset into disciplinary fields remains an open issue under-discussed in the literature. Although the important consequences of missing data in bibliographic datasets have been acknowledged in the literature (Moed et al. 1985), to our knowledge the problem of uncategorized records in top–down IDR measurement has not been properly addressed. Some bibliometric studies minimize this problem by excluding uncategorized publications from the dataset. The use of the categories of WoS implies the exclusion of all publications other than journals indexed by WoS (i.e., proceedings papers, books, technical reports) (Bjurström and Polk 2011; Carley and Porter 2011; Chen et al. 2012). Other studies account for the percentage of uncategorized publications and compute the index on the categorized references (Rafols et al. 2012; Porter and Rafols 2009). These approaches do not take into account the potential diversity of the excluded or missing data; hence interdisciplinarity is underestimated.

A method that automatizes the assignment of disciplines was implemented by Ponomarev et al. (2013) in order to categorize authors into one out of a small set of major research fields. It is based on aggregated information on the categories of the publications of the author and their references, for which disciplines are grouped into broad categories that relate to the research activity of the group of individuals. Disciplines unrelated to the research activity of the group of individuals are categorized as ‘others’. Therefore, it does not allow for the automatic assignment of specific categories loosely related to the selected major fields, which is needed to compute the Rao–Stirling index.

In the following we propose a method which acknowledges missing data and determines the associated uncertainties (see “Method” section), as well as its evaluation and discussion in the subsequent sections.

Method

Introduction

In this section we briefly introduce the Rao–Stirling index and present as our main theoretical contribution an extension of it that encodes the uncertainty caused by missing bibliographic data as an uncertainty interval. The Rao–Stirling index is a distance-based indicator, inspired by the Stirling index (Stirling 2007), which not only captures the variety and balance of the disciplines cited by a paper, but also their disparity using a measure of similarity between disciplines. A hypothetical document \(\mathcal {D}\) and a set \(\mathcal {T}\) of \({N_{\mathcal {T}}}\) disciplines will serve as an example for the following explanations. The index can be expressed as:

$$\begin{aligned} I=1-\sum _{i,j} s_{ij}~p_i~p_j \end{aligned}$$

where \(p_i\) is the proportion of references of the discipline i in a given paper. \(s_{ij}\) is a cosine measure of similarity between the disciplines i and j. It is a matrix of similarities where disciplines that are co-cited more often by the same paper are ‘closer’ than disciplines that are less frequently co-cited (Porter and Rafols 2009). It ensures low integration scores for publications citing very similar disciplines and high integration scores for publications citing very diverse disciplines. The integration score ranges from 0 to 1 (the metric can asymptotically approach this upper limit) as variety, balance, and disparity increase.

The information on the disciplines of the categorized references of \(\mathcal {D}\) can be aggregated into a vector \(\mathbf {c} = (c_1, c_2, \ldots , c_{N_{\mathcal {T}}})\) of reference counts per discipline. Each count \(c_i\) gives the number of references of \(\mathcal {D}\) that belong to the i-th discipline of \(\mathcal {T}\). Note that a reference can already be interdisciplinary and belong to several disciplines. By denoting the number of references that are cited by \(\mathcal {D}\) with \({N_\text{ref}}\), we have for the 1-norm of \(\mathbf {c}\) that

$$\begin{aligned} \sum _{i = 1}^{N_{\mathcal {T}}}c_i = {{||}\mathbf {c}{||}_{1}} \ge {N_{\text{ref}}}, \end{aligned}$$

if complete bibliographical data is assumed. Each count \(c_i\) corresponds to a proportion \(p_i\) by the relation \(p_i = \frac{c_i}{{{||}\mathbf {c}{||}_{1}}}\). The Rao–Stirling diversity \(I\) is then given as

$$\begin{aligned} I= 1 - \sum _{ i = 1 \atop j = 1 }^{N_{\mathcal {T}}}s_{ij} \; p_i \; p_j = 1 - \frac{1}{{{||}\mathbf {c}{||}_{1}}^2} \sum _{ i = 1 \atop j = 1 }^{N_{\mathcal {T}}}s_{ij} \; c_i \; c_j = 1 - \frac{\mathbf {c} \, \mathbf {S} \, \mathbf {c}^\intercal }{{{||}\mathbf {c}{||}_{1}}^2} \end{aligned}$$
(1)

where the similarity matrix \(\mathbf {S}= (s_{ij})\) encodes the distance between the different disciplines (Stirling 2007).

Missing Data

Problems arise when the disciplines of one or more references are unknown. As a consequence, \(\mathbf {c}\) cannot be determined and \(I\) is not well defined. The common approach is to simply omit these references and compute the index on the references categorized with disciplines (Bjurström and Polk 2011; Carley and Porter 2011; Chen et al. 2012; Rafols et al. 2012; Porter and Rafols 2009). Depending on the counts \(\mathbf {c}\) obtained from the categorized references, as well as the number of uncategorized references, the uncertainty can widely vary. For a single uncategorized reference among dozens categorized, the effect would be minor, whereas in the converse case, the uncertainty spans nearly the whole range of the index, rendering the initial estimate meaningless.

To capture the effects of missing data, we will compute the range in which the Rao–Stirling diversity \(I\) can vary when the uncategorized references are assigned to (sensible) arbitrary disciplines. While this range could be determined by enumerating all possible assignments and computing \(I\) for each, such an approach is computationally infeasible as it suffers from combinatorial explosion, i.e., an uncategorized reference can be assigned to \({N_{\mathcal {T}}}\) disciplines in \(2^{N_{\mathcal {T}}}\) ways. Instead, we will formulate the search for an upper and lower bound on \(I\) as an optimization problem. In the following, we present its basic formulation and several subsequent refinements.

Uncertainty Estimation

Given a document \(\mathcal {D}\), let us denote with \(\mathbf {c}\) the reference counts per discipline for all references categorized into disciplinary fields. Furthermore, \(\mathcal {D}\) is referencing u uncategorized documents, i.e., documents for which we have no information on their respective disciplines. We now aim to compute new sets \(\mathbf {n}_-\) and \(\mathbf {n}_+\) of reference counts per discipline such that all uncategorized references are assigned to one or more disciplines. Our goal is to obtain the smallest (resp. largest) possible diversity index \(I_-\) (resp. \(I_+\)) when computed with these new counts. Formally, we can state this requirement as

$$\begin{aligned} \mathbf {n}_-= & {} \mathop {\hbox {arg min}}\limits _{\mathbf {n} \in \mathbb {R}^{N_{\mathcal {T}}}} \left( 1 - \frac{\mathbf {n} \, \mathbf {S} \, \mathbf {n}^\intercal }{{{||}\mathbf {n}{||}_{1}}^2} \right) \quad \text {and} \quad \mathbf {n}_+ = \mathop {\hbox {arg max}}\limits _{\mathbf {n} \in \mathbb {R}^{N_{\mathcal {T}}}} \left( 1 - \frac{\mathbf {n} \, \mathbf {S} \, \mathbf {n}^\intercal }{{{||}\mathbf {n}{||}_{1}}^2} \right) \nonumber \\&\quad {\text {subject to}} \left\{ \begin{array}{l} c_i \le n_i \le c_i + u \quad (i = 1,2,\ldots ,{N_{\mathcal {T}}}) \\ {{||}\mathbf {c}{||}_{1}} + u \le {{||}\mathbf {n}{||}_{1}} \le {{||}\mathbf {c}{||}_{1}} + {N_{\mathcal {T}}}\, u. \end{array} \right. \end{aligned}$$
(2)

In this formulation, \(\mathbf {n}_-\) and \(\mathbf {n}_+\) are given as those new counts \(\mathbf {n}\) that minimize and maximize the Rao–Stirling diversity defined in Eq. 1. These operations are subject to two constraints that ensure that the information obtained from the categorized references—in the form of the counts \(\mathbf {c}\)—is respected. The first constraint requires that the new count \(n_i\) for each discipline cannot decrease below \(c_i\) and that each discipline may acquire up to u reassigned references. The last constraint indicates that we expect each uncategorized reference to be assigned to at least one discipline and at most \({N_{\mathcal {T}}}\) disciplines. The optimization problem can also be stated in terms of proportions \(\mathbf {p} = \mathbf {n} / {{||}\mathbf {n}{||}_{1}}\) (see Eq. 1), which removes the normalization in the quadratic term:

$$\begin{aligned} \mathbf {p}_-= & {} \mathop {\hbox {arg min}}\limits _{\mathbf {p} \in \mathbb {R}^{N_{\mathcal {T}}}} \left( 1 - \mathbf {p} \, \mathbf {S} \, \mathbf {p}^\intercal \right) \quad \text {and} \quad \mathbf {p}_+ = \mathop {\hbox {arg max}}\limits _{\mathbf {p} \in \mathbb {R}^{N_{\mathcal {T}}}} \left( 1 - \mathbf {p} \, \mathbf {S} \, \mathbf {p}^\intercal \right) \nonumber \\&\quad {\text {subject to}} \left\{ \begin{array}{l} 0 \le p_i \le \frac{c_i + u}{c_j} p_j \quad (i,j = 1,2,\ldots ,{N_{\mathcal {T}}}) \\ {{||}\mathbf {p}{||}_{1}} = 1, \end{array}. \right. \end{aligned}$$
(3)

A derivation of the transformation from Eqs. 2 to 3 can be found in Appendix “Hypercube constraints”. While the formulation of the optimization problem in terms of counts \(\mathbf {n}\) allows a more intuitive description of the various constraints, the formulation in terms of proportions \(\mathbf {p}\) allows a more efficient computation of the solution as we show in “Computational methods” section.

Constraint refinement

The full range of uncertainty in the Rao–Stirling diversity index regarding missing data is given as solutions to the optimization problems stated in Eqs. 2 and 3. We found, however, that such a general form considers situations that are highly unlikely to occur in real-world scenarios. In the above formulation it is possible that each uncategorized reference increases the per-discipline count of each discipline by one. This would indicate that such a reassigned reference is maximally interdisciplinary in the sense that it covers all disciplines. Since this is not a realistic scenario, we limit the number of disciplines that each uncategorized reference could belong to. If we assume that each uncategorized reference cannot cover more than k disciplines, we can represent this as an additional constraint in optimization problem Eq. 2:

$$\begin{aligned} {{||}\mathbf {n}{||}_{1}} \le {{||}\mathbf {c}{||}_{1}} + k \, u. \end{aligned}$$
(4)

In proportion space, the equivalent constraint for Eq. 3 is given as

$$\begin{aligned} p_i \ge \frac{c_i}{{{||}\mathbf {c}{||}_{1}} + k \, u} \quad (i = 1,\ldots ,{N_{\mathcal {T}}}). \end{aligned}$$
(5)

Details on this derivation can be found in Appendix “Constraint refinement”. In “Computation of the Rao–Stirling index and its uncertainty interval” section we derive a value of \(k=4\) as suitable for uncertainty computations in our context. The impact of this choice on the actual calculations is discussed in “Computational methods” section.

Discipline pruning

A reassignment of an uncategorized reference to an arbitrary subset of disciplines can lead to highly improbable results even when the cardinality of the subset is bounded as described in “Constraint refinement” section. This arises naturally due to the maximization of the Rao–Stirling diversity index in the aforementioned optimization problems. A concrete example could be a document in the field of computer science that exclusively cites previous works from its own discipline but has two uncategorized references. A possible reassignment that would significantly increase its diversity can be realized by assigning them to the unrelated disciplines of, for example, zoology and slavic literature. While such an assignment is not invalid per-se, it is nevertheless prohibitively unlikely and in this section we present a method to exclude such improbable disciplines.

Our primary goal is to choose for each document a subset \({\mathcal {T}}_{\text {prune}}\) from the set \(\mathcal {T}\) of all disciplines that includes such exceedingly unlikely candidates. Since we do not possess any knowledge on the disciplines of uncategorized references, we will infer this information from the disciplines of the categorized references. In the end, these deductions will lead to additional constraints for the optimization problems Eqs. 2 and 3 of the form

$$\begin{aligned} n_i = 0 \quad \text {respective} \quad p_i = 0 \quad \left( i \in {\mathcal {I}}_{\text {prune}}\right) \end{aligned}$$
(6)

where \({\mathcal {I}}_{\text {prune}}\) denotes the indices that correspond to the pruned disciplines that are contained in \({\mathcal {T}}_{\text {prune}}\).

A simple straightforward solution would be to just eliminate all disciplines that are not already observed from the categorized references, i.e., to set the constraint \(n_i = 0\) (resp. \(p_i = 0\)), if \(c_i = 0\). The problem with this approach is that it does not allow for the introduction of new disciplines through the reassignment of uncategorized references, which would underestimate the achievable diversity significantly.

In contrast, we take the mutual similarities of different disciplines into account for which we utilize the similarity matrix \(\mathbf {S}\) as given in Eq. 1. If the categorized references are from closely related disciplines, we only permit very similar disciplines to participate in the reassignment procedure, whereas we allow a larger set of disciplines for categorized references belonging to a diverse set of disciplines.

Our method is based on the concept of a discipline neighborhood \(\mathcal {H}_i\) of a discipline \(\tau _i \in {\mathcal {T}}\) with index i given by all those disciplines that have a similarity higher than a given value \(\Delta\), i.e.,

$$\begin{aligned} {\mathcal {H}}_i = \{\tau _j \in {\mathcal {T}}: \mathbf {S}_{ij} \ge \Delta \} \end{aligned}$$
(7)

where \(\Delta\) effectively controls the size of \(\mathcal {H}_i\). The set of permissible disciplines \({\mathcal {T}}_{\text {valid}}\) is then given as a union of such neighborhoods—one for each discipline that is observed from the categorized references. Note that the set of removed disciplines \({\mathcal {T}}_{\text {prune}}\) is given as the complement of this set, i.e., \({\mathcal {T}}_{\text {prune}}= {\mathcal {T}}\setminus {\mathcal {T}}_{\text {valid}}\). For the actual computation of this set of neighborhoods, we propose the following objectives:

Completeness :

Each neighborhood should contain at least two observed disciplines. This ensures that each neighborhood includes at least all disciplines that are more similar than the next most similar known discipline.

Cohesion :

The neighborhoods should form a single connected component to avoid having multiple disjoint discipline clusters. For documents with references in, for example, two dissimilar disciplines, an omission of this objective could lead to a set of permissible disciplines that are very similar to either of these two known disciplines without considering the disciplines in between them.

Conciseness :

The neighborhoods should be chosen in such a way as to yield the smallest possible set of permissible disciplines that fulfills the previous objective. The actual meaningfulness of the upper bound of the uncertainty interval is ensured in this way.

As we show in Appendix “Discipline pruning”, we can obtain a set of permissible disciplines \({\mathcal {T}}_{\text {valid}}\) that obeys these objectives with the help of maximal spanning trees on the complete graph of disciplines when regarding the similarity matrix \(\mathbf {S}\) as its adjacency matrix. Furthermore, our approach provides a user-chosen tolerance parameter—modulating the similarity values \(\Delta\) of Eq. 7—with which the strictness of the pruning can be controlled. A tolerance of 0 would allow all disciplines to participate in the redistribution process (i.e., \({\mathcal {T}}_{\text {prune}}= \emptyset\)) while a value of 1 does not introduce any additional tolerance. Note that the corresponding constraints (see Eq. 6) effectively reduce the dimensionality of the optimization problem and it is possible to compute Eqs. 2 or 3 only on those discipline counts or proportions that are not members of \({\mathcal {T}}_{\text {valid}}\). Details on the employed algorithms for these methods can be found in “Computational methods” section and our choice of the tolerance value is motivated in “Computation of the Rao–Stirling index and its uncertainty interval” section.

Computational methods

In this section, we describe the computational methods used to compute the solutions of the optimization problems stated in Eqs. 2 or 3 while taking the constraints in Eqs. 4-6 into account. We choose different solution strategies for finding the reassignments with lowest possible diversity index \(I_-\) and highest possible diversity index \(I_+\). The need for different strategies lies in the nature of the similarity measure between different disciplines, given by the similarity matrix \(\mathbf {S}\); it has to be positive semidefinite to yield a non-negative diversity index for arbitrary discipline counts. The associated quadratic form \(\mathbf {c} \, \mathbf {S} \, \mathbf {c}^\intercal\) is thus a convex function in \(\mathbf {c}\), while \(- \mathbf {c} \, \mathbf {S} \, \mathbf {c}^\intercal\) is concave. Thus, the Rao–Stirling diversity (see Eq. 1) is a concave function and its maximization (to obtain \(I_+\)) can be computed with the help of quadratic programming (Nocedal and Wright 2006). Note that the constraints in Eqs. 25 constitute linear functions, which can be incorporated into the computation as linear equality and inequality constraints and do not impact its polynomial runtime complexity (Kozlov et al. 1980).

The minimization of a concave function has significantly worse complexity and the computation of \(I_-\) lies in the class NP-hard (Pardalos and Vavasis 1991; Sahni 1974). However, we exploit the fact that the Rao–Stirling diversity is purely concave in the sense that all the eigenvalues of the similarity matrix \(\mathbf {S}\) are non-positive. From this follows that all local minima lie on the vertices of the polytope that is bounded by the constraints of the optimization problems (Floudas and Visweswaran 1995). A search over all possible vertices yields the global minimum in exponential time, since the polytope for optimization problem Eq. 2 has \(2^{N_{\mathcal {T}}}\) vertices, where \({N_{\mathcal {T}}}\) denotes the number of disciplines with \({N_{\mathcal {T}}}= 249\) in our case. Our constraint refinement of “Constraint refinement” section reduces the search space significantly and, apart from a more realistic uncertainty estimation, ensures the efficient computability of \(I_-\). Limiting the discipline reassignment to at most four disciplines (i.e., \(k = 4\)) limits the search space to only \(\sum _{i=1}^{k=4} \left( {\begin{array}{c}{N_{\mathcal {T}}}\\ i\end{array}}\right) =1.6\times 10^{8}\) vertices, which can be explored exhaustively on commodity hardware. See “Computation of the Rao–Stirling index and its uncertainty interval” section for a discussion of the choice of \(k = 4\).

The discipline pruning and the corresponding maximal spanning tree have negligible computational overhead but reduce the dimensionality of the aforementioned minimization or maximization problem even further. The computation of \(I_-\) especially benefits from this approach. For the minimum spanning tree computation, Prim’s algorithm is used (Prim 1957).

Evaluation

The evaluation of the proposed method was conducted empirically. Following the framework for knowledge integration and diffusion suggested by Liu et al. (2012), the uncertainty intervals of the interdisciplinarity of the publications of a set of individuals were calculated. Ground-truth bibliographic data provided by the authors in personal interviews was used to evaluate the method. The results of our method computed with incomplete data from digital libraries were compared with the results of the Rao–Stirling index calculated with ground-truth data.

Sample frame

The sample frame of this study consists of the publications of doctoral researchers in a Computer Science (CS) faculty of a highly ranked European university between 2009 and 2014. Doctoral researchers are usually the main authors of their publications and have a thorough knowledge of the literature they reference. We focus on CS because this field emerged as a result of integrating disciplines and it continues to be one of the most interdisciplinary fields because of its diverse applications. Moreover, CS is an ideal field to use in evaluating our method because gathering publication data with a high percentage of categorized references is especially challenging. While in other fields conferences serve as venues for community building and maintenance, in CS they focus on selectivity, quality and fast dissemination—needed in such a fast-evolving field—which drives down conference acceptance rates Grudin (2011). Therefore, CS researchers target their publications at conferences, which are regarded as the primary means of publication in the field. Since conference publications are not associated to the taxonomy of disciplines of WoS, which we use in this analysis, a high number of uncategorized references is obtained.

Data collection

In order to gather the most complete and accurate record of publications and their references, data was gathered from different sources. First, the publication database of the university was used to collect all the publications of doctoral students of the CS faculty published between 2009 and 2014. This database contains a very exhaustive list of publications authored by those affiliated to the university, as its records are used to compute the financial assignments to the different research groups. Because the publication database of the university does not keep records of references, in the next step we gathered more data from online bibliographic databases: (1) Scopus from Elsevier, which offers high coverage of articles; and (2) WoS from Thomson Reuters, which provides a comprehensive citation search and encompasses publications of multiple online databases, resulting in multidisciplinary coverage.

The association of publications to disciplinary fields was possible using the taxonomy of disciplines of WoS, called Category Terms (CTs). It contains 249 CTs and is elaborated based on a combination of subject matter expert judgments and inter-journal citation patterns that together serve to cluster journals into topical groupings. Since there is no consensus on a perfect taxonomy of disciplines, the one of WoS was selected because its extensive use in the bibliometric analyses of previous related work, but other taxonomies could also be used. As a measure of similarity between CTs, we used the co-citation similarity matrix provided by Porter and Rafols (2009).

The combination of several databases increases the completeness of the record of references at the same time that it decreases the percentage of publications categorized with CTs—only journal publications indexed by WoS are categorized. Our dataset contains 1746 publications authored by 225 doctoral students. The extraction of references was possible for 1068 publications indexed by WoS or Scopus. The association of CTs to references was possible for 979 of the publications that had references indexed by WoS. A total of 12,243 references were extracted, of which 5310 are categorized with CTs.

Computation of the Rao–Stirling index and its uncertainty interval

We calculated the Rao–Stirling index and the uncertainty interval of the 1068 publications for which the extraction of references was possible. The limit of discipline reassignment for the uncertainty interval was set to \({k=4}\). This score is at the 99th percentile of the number of CTs used by WoS to categorize the journals of our dataset. The tolerance was also set to the 99th percentile of similarity between CTs (\({t=0.233}\)) in order to incorporate a slight diversity into the pool of similar CTs to be used in the reassignment procedure.

The results can be observed in Fig. 1. It is very typical for publications to have only some of their references categorized, while the rest remain uncategorized (publication IDs 81-979). When every single reference of a publication is categorized with the same single CT both endpoints of the uncertainty interval are 0, as no CTs need to be redistributed (IDs 1-6). In case where a publication that references a single CT has uncategorized references (IDs 7-80), the lower bound of the interval would be 0 (all uncategorized references could be assigned to the same single CT), while the upper bound would be greater than 0 (the uncategorized references could be assigned to different CTs). If all references of a publication are uncategorized, the Rao–Stirling cannot be computed and the size of the uncertainty interval is at maximum (IDs 980-1068).

Fig. 1
figure 1

Rao–Stirling indices calculated with incomplete data (dashed line) and upper and lower endpoints of our uncertainty intervals (black and gray solid lines) for the 1068 students’ publications from which references could be extracted. While the Rao–Stirling index ignores the missing data, the lower and upper bounds of our uncertainty intervals take into account the uncategorized references, performing sensible reassignments of CTs that deliver the lowest and highest diversity index respectively. The publications are ordered along the x-axis according to their Rao–Stirling index

The size of the uncertainty interval indicates the level of accuracy of the Rao–Stirling index. The interval is large when publications contain a large proportion of uncategorized references, while it converges to a single value when all references are categorized (see Fig. 2). The significance of this relationship is confirmed through linear regression analysis with p value \(<2.2 \times 10^{-16}\).

Fig. 2
figure 2

Relationship of the length of our uncertainty interval and the data completeness of the 1068 students’ publications for which references could be extracted. The completeness of a publication is defined as the ratio of categorized references in relation to its total number of references. The linear regression is represented with a black line. It can be seen that our approach captures the uncertainty associated with varying completeness

Collection of ground-truth data

We refer to ground-truth data as complete and correct publication records with complete and correct categorization of references. The manual gathering of such data is very time-consuming. Therefore, a sample of publications was selected from the whole publication dataset. We applied stratified sampling with samples of equal size in each stratum, in order to obtain a sample of publications with different degrees of completeness and interdisciplinarity. Publications were divided into mutually exclusive sub-groups depending on two variables: (1) the proportion of categorized references among all references of a paper; and (2) the degree of interdisciplinarity of a publication, calculated using the Rao–Stirling index with the incomplete publication dataset that was previously gathered from the digital libraries WoS and Scopus (see “Data collection” section). Both variables were divided into 4 intervals, creating 16 sub-groups of publications. From each sub-group 3 publications were randomly selected, yielding a sample of 48 publications. First authors were invited to participate in our study. In a few cases a coauthor was invited due to reasons such as expertise or availability. In personal interviews, the participants categorized the references of their publications using one to four CTs from the taxonomy of WoS. For each interview we provided the following material:

  • Digital copies of the author’s publication and all its references which were gathered manually from digital libraries.

  • A print-out of the taxonomy of CTs of WoS. In order to make the search of CTs easier for the participants, CTs were grouped into macro-disciplines.

Data collection via personal interviews was chosen over a questionnaire in order to ensure the gathering of higher quality data, which allowed us to:

  • Explain the importance of providing objective data. Since interdisciplinary research has a good connotation, it was important to make our participants understand that they were not going to be evaluated in terms of interdisciplinarity. We asked them to provide us with the most objective data without exaggerating interdisciplinarity or single-disciplinarity.

  • Make sure that participants became acquainted with the taxonomy of CTs, as none of the participants were familiar with it.

  • Confirm that participants understood their task. Participants were asked to think out loud and explain their choice of CTs for verification purposes.

  • Make sure that each participant followed the same criteria to categorize publications into disciplines.

Comparative analysis

In order to evaluate the performance of our method, its results were compared with the measurement of IDR based on completed data. The ground-truth data provided by the interview participants was used to complete the missing categorization of references from their publications. We computed the Rao–Stirling index of these publications again, this time using the completed data. The results of the Rao–Stirling index with completed data are compared with the results of the Rao–Stirling index with incomplete data in Fig. 3. On average the results of the Rao–Stirling index calculated with completed data are higher and less variable (see Table 1).

Fig. 3
figure 3

Rao–Stirling indices of the 48 publications of the sample with incomplete (gray line) and completed (black line) data. The publications are ordered according to their Rao–Stirling index with incomplete data. Depending on the degree of incompleteness, large deviations of the diversity index can be observed

Table 1 Estimated mean and standard deviation (SD) of the Rao–Stirling index of the 48 publications of the sample calculated with incomplete and completed data. These estimated values were calculated with a bootstrapped sample of 50,000 elements with replacement

Since the bounds of the uncertainty interval are an estimation of the possible highest and lowest Rao–Stirling index of a publication with incomplete data, its result is correct when the interval includes the Rao–Stirling index with completed data (see Fig. 4). The accuracy of the uncertainty interval is affected by the degree of categorized reference completeness of the publications.

Fig. 4
figure 4

Indices of the 48 publications of the sample: Rao–Stirling calculated with completed data (gray solid line), upper (black solid line) and lower (black dashed line) bounds of the uncertainty interval calculated with incomplete data and parameters \({k = 4}\) and \({t = 0.233}\). The uncertainty interval includes in its range the results of the Rao–Stirling index with completed data in almost all cases, which indicates its good performance

In order to assess the performance of both our method and the Rao–Stirling index, where both use incomplete data, we compare the average of their results to the ones of the Rao–Stirling index with completed data (see Table 2). Since our method provides a measure of uncertainty, we also assess its performance by weighting the results of the uncertainty interval according to the size of the intervals, where smaller intervals have more weight than larger ones. Thus, more accurate intervals (publications with more complete data) have more weight than inaccurate intervals (publications with more incomplete data).

Table 2 Estimated mean, bias and standard deviation of the indices of the 48 publications of the sample: Rao–Stirling index with completed data (first row), Rao–Stirling with incomplete data (second row), the center of the uncertainty interval (third row), and the center of the uncertainty interval weighted according to its size (fourth row). These estimated values were calculated with a bootstrapped sample of 50,000 elements with replacement. A visual representation of these values can be observed in Fig. 5

The results of our method are closer to the ones of the Rao–Stirling index with completed data. This suggests that the center of the uncertainty interval is a more accurate IDR measurement than the Rao–Stirling index with incomplete data. The most accurate results are those of the weighted center of the uncertainty interval, whose standard deviation even includes the actual mean of the Rao–Stirling index with completed data (see Fig. 5).

Fig. 5
figure 5

Comparison of the estimated mean of the Rao–Stirling index with completed data (gray horizontal line) with the estimated means and standard deviations of the Rao–Stirling index with incomplete data (circle), the center of the uncertainty interval (square) and the weighted center of the uncertainty interval (triangle). These estimated values were calculated with a bootstrapped sample of 50,000 elements with replacement (see Table 2). Our uncertainty interval shows a better performance than the Rao–Stirling index with incomplete data. The aggregated results of our uncertainty interval are closer to the results of the Rao–Stirling index with completed data

Discussion

The accuracy of citation-based IDR measurements heavily depends on the quality of the bibliographic data. The combination of data from several sources might help to enhance the quality of data but it certainly does not assure ground-truth bibliographic data. The dataset gathered for the evaluation of our methods is an example of an incomplete one, even though data from three different digital libraries was extracted and combined. Not all publications of our dataset have a complete record of references, and not all references are categorized with CTs. The Rao–Stirling index is incapable of taking both problems into account as it is not designed to handle missing data.

Our method tackles the problem of uncategorized references, extending the Rao–Stirling index to encode the uncertainty caused by missing data as an interval. A high degree of incompleteness in publications particularly interdisciplinary in nature may also result in underestimating the upper bound of the uncertainty interval. This is especially problematic when a publication only has one reference categorized by a single CTs. Such a degree of incompleteness affects the rational redistribution of CTs needed to compute the upper endpoint of the uncertainty interval (see publication ID = 6 in Figs. 3 and 4). The main benefit of the uncertainty interval is that it acts as a confidence indicator of the results delivered by the Rao–Stirling index. On the one hand, publications with a low proportion of uncategorized references have correspondingly small uncertainty intervals, implying a more reliable measurement of the Rao–Stirling index. On the other hand, publications with a high proportion of uncategorized references have correspondingly large uncertainty intervals, indicating an unreliable measurement of the Rao–Stirling index. This finding proves the importance of selecting publications with a proportion of categorized references above a threshold value when computing an index of interdisciplinarity, as in the analysis of Rafols et al. (2012).

The empirical evaluation of our method confirms that the acknowledgment of missing data delivers a more accurate aggregated IDR measurement than the Rao–Stirling index. Our contribution constitutes a first approach to measure IDR taking into account the inaccuracy of the bibliographic data, but other problems still affect the results of the Rao–Stirling and other IDR indices. Future analysis to evaluate this method should be conducted using other taxonomies of disciplines. Further work would be needed in order to tackle the problem of incomplete and incorrect records of references, as well as incorrect categorization of publications into disciplinary fields. Additional issues to consider are the use of a precise taxonomy of disciplines and similarity matrix. Therefore, further avenues of research towards more precise IDR indicators remain open. To aid these efforts, we are providing the source code for our implementation of the uncertainty computation to the community, which can be found at https://gitlab.com/mc.calatrava.moreno/robustrao.git.