# On the uncertainty of interdisciplinarity measurements due to incomplete bibliographic data

- First Online:

- Received:

DOI: 10.1007/s11192-016-1842-4

- Cite this article as:
- Calatrava Moreno, M., Auzinger, T. & Werthner, H. Scientometrics (2016) 107: 213. doi:10.1007/s11192-016-1842-4

- 4 Citations
- 884 Downloads

## Abstract

The accuracy of interdisciplinarity measurements is directly related to the quality of the underlying bibliographic data. Existing indicators of interdisciplinarity are not capable of reflecting the inaccuracies introduced by incorrect and incomplete records because correct and complete bibliographic data can rarely be obtained. This is the case for the Rao–Stirling index, which cannot handle references that are not categorized into disciplinary fields. We introduce a method that addresses this problem. It extends the Rao–Stirling index to acknowledge missing data by calculating its interval of uncertainty using computational optimization. The evaluation of our method indicates that the uncertainty interval is not only useful for estimating the inaccuracy of interdisciplinarity measurements, but it also delivers slightly more accurate aggregated interdisciplinarity measurements than the Rao–Stirling index.

### Keywords

Interdisciplinarity Rao–Stirling index Bibliometrics Missing data Uncertainty Optimization Spanning tree## Introduction

Most quantitative measures of the output of InterDisciplinary Research (IDR) rely on bibliometric methods. Since such methods are commonly used to inform policy in science and technology, they require reliable indicators and results. While analytical indicators and tools have been refined over time, their results are in most cases not precise. The accuracy of such indicators depends on the quality of the bibliographic data, which should be correct and complete. Unfortunately, the gathering of a correct and complete bibliographic dataset is a complicated task due to the fact that not all scientific publications are indexed by digital libraries. Current bibliographic databases, such as the Web of Science (WoS) or Scopus, do not cover books, book chapters and many regional non-English journals in which some fields mainly publish. Even conference proceedings, which constitute the main publication venues in many applied fast-changing fields, are often not indexed. The gathering and comparison of records gathered from different bibliographic sources mitigates this problem to some extent. However, an additional problem affects top–down approaches to measure IDR such as the Rao–Stirling diversity index: the need for a predefined taxonomy of disciplines that classifies all publications in the dataset. This problem cannot be solved with the comparison of data gathered from different sources because not all libraries classify their publications into a taxonomy of disciplines nor use the same taxonomy, and even those that use a taxonomy might not classify all their indexed publications with it—as is the case of WoS. Manual classification of publications into disciplinary fields is also not viable for a large number of uncategorized publications. In consequence, top–down measurements of IDR usually deliver proxy results.

In this paper we acknowledge the problem of dealing with incomplete data gathered from several libraries. We focus on the problem of uncategorized publications for the measurement of IDR with the Rao–Stirling index. We choose this index because it is a well-established bibliometric indicator that requires a complete categorization of all references into disciplinary fields; however this problem has not received adequate attention in the literature. We propose a theoretical extension of the Rao–Stirling index to account for the uncertainty resulting from references that remain uncategorized.

## Background

The field of measuring IDR heavily relies on bibliometric methods and data due to the widely-held view that scientific research is disseminated via publications. Different types of approaches exist for measuring IDR, which have been accordingly endorsed for differing needs of analysis. For an extensive review of approaches, we refer to the work of Wagner et al. (2011). Among them, the most common method for measuring IDR is citation analysis, in which an exchange or integration among fields is captured via discipline-specific citations pointing to other fields. Two distinguishable strategies for measuring IDR are bottom–up and top–down. The first approach is based on clusters of articles without a predefined taxonomy of disciplines. The clustering is based on the structural relationships of a network of publications (Boyack and Klavans 2010; Chen et al. 2010; Leydesdorff 2007; Leydesdorff et al. 2013). In contrast, top–down approaches rely on a predefined taxonomy of disciplines that is used to classify publications into disciplinary fields (Leydesdorff et al. 2013; Porter and Rafols 2009; Rafols et al. 2012). While bottom–up approaches are suited for capturing emerging developments that do not fit into existing categories, the classification-based approach is useful for large-scale explorations, such as comparisons of areas of science using an extensive amount of data or the disciplinary breadth of research institutions. The latter approach is the focus of this paper.

The results of citation analyses are subject to the quality of bibliographic data in terms of completeness and accuracy. Well-established top–down methods used to analyze the number of disciplines cited by a publication or their degree of concentration such as Shannon entropy Shannon (1948) and Herfindhal index Rhoades (1993) are designed to be used with datasets with complete information, since they cannot acknowledge the degree of missing data. This is also the case of the Rao–Stirling diversity index, a more complete top–down index proposed by Porter et al. (2007), and Porter and Rafols (2009). Precise IDR measurement using these methods requires a bibliographic dataset with: (1) complete records of references, (2) a correct list of references for each publication, (3) accurate categorization of publications into disciplinary fields, and (4) the categorization of each reference into at least one discipline. The combination of such quality characteristics results in ground-truth bibliographic data, which is rarely attainable since no publication database provides adequate correctness and completeness in respect to both references and categorization into disciplinary fields.

Concerning references, verification mechanisms as discussed by van Raan (1996) are crucial to detect incomplete records of references and remove incorrect references in bibliographic sources, such as those encountered by Moed et al. (1995) and Chen et al. (2012). In regard to taxonomies of disciplines, their accuracy have been widely discussed in the literature without reaching consensus on an adequate one National Research Council (2010), Rafols and Leydesdorff (2009). In spite of its weaknesses, the list of categories provided by WoS is the most widely used (Bensman and Leydesdorff 2009; Pudovkin and Garfield 2002). The exhaustive categorization of all references within a dataset into disciplinary fields remains an open issue under-discussed in the literature. Although the important consequences of missing data in bibliographic datasets have been acknowledged in the literature (Moed et al. 1985), to our knowledge the problem of uncategorized records in top–down IDR measurement has not been properly addressed. Some bibliometric studies minimize this problem by excluding uncategorized publications from the dataset. The use of the categories of WoS implies the exclusion of all publications other than journals indexed by WoS (i.e., proceedings papers, books, technical reports) (Bjurström and Polk 2011; Carley and Porter 2011; Chen et al. 2012). Other studies account for the percentage of uncategorized publications and compute the index on the categorized references (Rafols et al. 2012; Porter and Rafols 2009). These approaches do not take into account the potential diversity of the excluded or missing data; hence interdisciplinarity is underestimated.

A method that automatizes the assignment of disciplines was implemented by Ponomarev et al. (2013) in order to categorize authors into one out of a small set of major research fields. It is based on aggregated information on the categories of the publications of the author and their references, for which disciplines are grouped into broad categories that relate to the research activity of the group of individuals. Disciplines unrelated to the research activity of the group of individuals are categorized as ‘others’. Therefore, it does not allow for the automatic assignment of specific categories loosely related to the selected major fields, which is needed to compute the Rao–Stirling index.

In the following we propose a method which acknowledges missing data and determines the associated uncertainties (see “Method” section), as well as its evaluation and discussion in the subsequent sections.

## Method

### Introduction

*i*in a given paper. \(s_{ij}\) is a cosine measure of similarity between the disciplines i and j. It is a matrix of similarities where disciplines that are co-cited more often by the same paper are ‘closer’ than disciplines that are less frequently co-cited (Porter and Rafols 2009). It ensures low integration scores for publications citing very similar disciplines and high integration scores for publications citing very diverse disciplines. The integration score ranges from 0 to 1 (the metric can asymptotically approach this upper limit) as variety, balance, and disparity increase.

*i*-th discipline of \(\mathcal {T}\). Note that a reference can already be interdisciplinary and belong to several disciplines. By denoting the number of references that are cited by \(\mathcal {D}\) with \({N_\text{ref}}\), we have for the 1-norm of \(\mathbf {c}\) that

### Missing Data

Problems arise when the disciplines of one or more references are unknown. As a consequence, \(\mathbf {c}\) cannot be determined and \(I\) is not well defined. The common approach is to simply omit these references and compute the index on the references categorized with disciplines (Bjurström and Polk 2011; Carley and Porter 2011; Chen et al. 2012; Rafols et al. 2012; Porter and Rafols 2009). Depending on the counts \(\mathbf {c}\) obtained from the categorized references, as well as the number of uncategorized references, the uncertainty can widely vary. For a single uncategorized reference among dozens categorized, the effect would be minor, whereas in the converse case, the uncertainty spans nearly the whole range of the index, rendering the initial estimate meaningless.

To capture the effects of missing data, we will compute the range in which the Rao–Stirling diversity \(I\) can vary when the uncategorized references are assigned to (sensible) arbitrary disciplines. While this range could be determined by enumerating all possible assignments and computing \(I\) for each, such an approach is computationally infeasible as it suffers from combinatorial explosion, i.e., an uncategorized reference can be assigned to \({N_{\mathcal {T}}}\) disciplines in \(2^{N_{\mathcal {T}}}\) ways. Instead, we will formulate the search for an upper and lower bound on \(I\) as an optimization problem. In the following, we present its basic formulation and several subsequent refinements.

### Uncertainty Estimation

*categorized*into disciplinary fields. Furthermore, \(\mathcal {D}\) is referencing

*u*

*uncategorized*documents, i.e., documents for which we have no information on their respective disciplines. We now aim to compute new sets \(\mathbf {n}_-\) and \(\mathbf {n}_+\) of reference counts per discipline such that all uncategorized references are assigned to one or more disciplines. Our goal is to obtain the smallest (resp. largest) possible diversity index \(I_-\) (resp. \(I_+\)) when computed with these new counts. Formally, we can state this requirement as

*u*reassigned references. The last constraint indicates that we expect each uncategorized reference to be assigned to at least one discipline and at most \({N_{\mathcal {T}}}\) disciplines. The optimization problem can also be stated in terms of proportions \(\mathbf {p} = \mathbf {n} / {{||}\mathbf {n}{||}_{1}}\) (see Eq. 1), which removes the normalization in the quadratic term:

### Constraint refinement

*all*disciplines. Since this is not a realistic scenario, we limit the number of disciplines that each uncategorized reference could belong to. If we assume that each uncategorized reference cannot cover more than

*k*disciplines, we can represent this as an additional constraint in optimization problem Eq. 2:

### Discipline pruning

A reassignment of an uncategorized reference to an arbitrary subset of disciplines can lead to highly improbable results even when the cardinality of the subset is bounded as described in “Constraint refinement” section. This arises naturally due to the maximization of the Rao–Stirling diversity index in the aforementioned optimization problems. A concrete example could be a document in the field of *computer science* that exclusively cites previous works from its own discipline but has two uncategorized references. A possible reassignment that would significantly increase its diversity can be realized by assigning them to the unrelated disciplines of, for example, *zoology* and *slavic literature*. While such an assignment is not invalid per-se, it is nevertheless prohibitively unlikely and in this section we present a method to exclude such improbable disciplines.

A simple straightforward solution would be to just eliminate all disciplines that are not already observed from the categorized references, i.e., to set the constraint \(n_i = 0\) (resp. \(p_i = 0\)), if \(c_i = 0\). The problem with this approach is that it does not allow for the introduction of new disciplines through the reassignment of uncategorized references, which would underestimate the achievable diversity significantly.

In contrast, we take the mutual similarities of different disciplines into account for which we utilize the similarity matrix \(\mathbf {S}\) as given in Eq. 1. If the categorized references are from closely related disciplines, we only permit very similar disciplines to participate in the reassignment procedure, whereas we allow a larger set of disciplines for categorized references belonging to a diverse set of disciplines.

*discipline neighborhood*\(\mathcal {H}_i\) of a discipline \(\tau _i \in {\mathcal {T}}\) with index

*i*given by all those disciplines that have a similarity higher than a given value \(\Delta\), i.e.,

*Completeness*Each neighborhood should contain at least two observed disciplines. This ensures that each neighborhood includes at least all disciplines that are more similar than the next most similar known discipline.

*Cohesion*The neighborhoods should form a single connected component to avoid having multiple disjoint discipline clusters. For documents with references in, for example, two dissimilar disciplines, an omission of this objective could lead to a set of permissible disciplines that are very similar to either of these two known disciplines without considering the disciplines in between them.

*Conciseness*The neighborhoods should be chosen in such a way as to yield the smallest possible set of permissible disciplines that fulfills the previous objective. The actual meaningfulness of the upper bound of the uncertainty interval is ensured in this way.

*tolerance*parameter—modulating the similarity values \(\Delta\) of Eq. 7—with which the strictness of the pruning can be controlled. A tolerance of 0 would allow all disciplines to participate in the redistribution process (i.e., \({\mathcal {T}}_{\text {prune}}= \emptyset\)) while a value of 1 does not introduce any additional tolerance. Note that the corresponding constraints (see Eq. 6) effectively reduce the dimensionality of the optimization problem and it is possible to compute Eqs. 2 or 3 only on those discipline counts or proportions that are not members of \({\mathcal {T}}_{\text {valid}}\). Details on the employed algorithms for these methods can be found in “Computational methods” section and our choice of the tolerance value is motivated in “Computation of the Rao–Stirling index and its uncertainty interval” section.

### Computational methods

In this section, we describe the computational methods used to compute the solutions of the optimization problems stated in Eqs. 2 or 3 while taking the constraints in Eqs. 4-6 into account. We choose different solution strategies for finding the reassignments with lowest possible diversity index \(I_-\) and highest possible diversity index \(I_+\). The need for different strategies lies in the nature of the similarity measure between different disciplines, given by the similarity matrix \(\mathbf {S}\); it has to be *positive semidefinite* to yield a non-negative diversity index for arbitrary discipline counts. The associated quadratic form \(\mathbf {c} \, \mathbf {S} \, \mathbf {c}^\intercal\) is thus a *convex* function in \(\mathbf {c}\), while \(- \mathbf {c} \, \mathbf {S} \, \mathbf {c}^\intercal\) is *concave*. Thus, the Rao–Stirling diversity (see Eq. 1) is a concave function and its maximization (to obtain \(I_+\)) can be computed with the help of quadratic programming (Nocedal and Wright 2006). Note that the constraints in Eqs. 2–5 constitute linear functions, which can be incorporated into the computation as linear equality and inequality constraints and do not impact its polynomial runtime complexity (Kozlov et al. 1980).

The minimization of a concave function has significantly worse complexity and the computation of \(I_-\) lies in the class NP-hard (Pardalos and Vavasis 1991; Sahni 1974). However, we exploit the fact that the Rao–Stirling diversity is purely concave in the sense that all the eigenvalues of the similarity matrix \(\mathbf {S}\) are non-positive. From this follows that all local minima lie on the vertices of the polytope that is bounded by the constraints of the optimization problems (Floudas and Visweswaran 1995). A search over all possible vertices yields the global minimum in exponential time, since the polytope for optimization problem Eq. 2 has \(2^{N_{\mathcal {T}}}\) vertices, where \({N_{\mathcal {T}}}\) denotes the number of disciplines with \({N_{\mathcal {T}}}= 249\) in our case. Our constraint refinement of “Constraint refinement” section reduces the search space significantly and, apart from a more realistic uncertainty estimation, ensures the efficient computability of \(I_-\). Limiting the discipline reassignment to at most four disciplines (i.e., \(k = 4\)) limits the search space to only \(\sum _{i=1}^{k=4} \left( {\begin{array}{c}{N_{\mathcal {T}}}\\ i\end{array}}\right) =1.6\times 10^{8}\) vertices, which can be explored exhaustively on commodity hardware. See “Computation of the Rao–Stirling index and its uncertainty interval” section for a discussion of the choice of \(k = 4\).

The discipline pruning and the corresponding maximal spanning tree have negligible computational overhead but reduce the dimensionality of the aforementioned minimization or maximization problem even further. The computation of \(I_-\) especially benefits from this approach. For the minimum spanning tree computation, Prim’s algorithm is used (Prim 1957).

## Evaluation

The evaluation of the proposed method was conducted empirically. Following the framework for knowledge integration and diffusion suggested by Liu et al. (2012), the uncertainty intervals of the interdisciplinarity of the publications of a set of individuals were calculated. Ground-truth bibliographic data provided by the authors in personal interviews was used to evaluate the method. The results of our method computed with incomplete data from digital libraries were compared with the results of the Rao–Stirling index calculated with ground-truth data.

### Sample frame

The sample frame of this study consists of the publications of doctoral researchers in a Computer Science (CS) faculty of a highly ranked European university between 2009 and 2014. Doctoral researchers are usually the main authors of their publications and have a thorough knowledge of the literature they reference. We focus on CS because this field emerged as a result of integrating disciplines and it continues to be one of the most interdisciplinary fields because of its diverse applications. Moreover, CS is an ideal field to use in evaluating our method because gathering publication data with a high percentage of categorized references is especially challenging. While in other fields conferences serve as venues for community building and maintenance, in CS they focus on selectivity, quality and fast dissemination—needed in such a fast-evolving field—which drives down conference acceptance rates Grudin (2011). Therefore, CS researchers target their publications at conferences, which are regarded as the primary means of publication in the field. Since conference publications are not associated to the taxonomy of disciplines of WoS, which we use in this analysis, a high number of uncategorized references is obtained.

### Data collection

In order to gather the most complete and accurate record of publications and their references, data was gathered from different sources. First, the publication database of the university was used to collect all the publications of doctoral students of the CS faculty published between 2009 and 2014. This database contains a very exhaustive list of publications authored by those affiliated to the university, as its records are used to compute the financial assignments to the different research groups. Because the publication database of the university does not keep records of references, in the next step we gathered more data from online bibliographic databases: (1) Scopus from Elsevier, which offers high coverage of articles; and (2) WoS from Thomson Reuters, which provides a comprehensive citation search and encompasses publications of multiple online databases, resulting in multidisciplinary coverage.

The association of publications to disciplinary fields was possible using the taxonomy of disciplines of WoS, called *Category Terms* (CTs). It contains 249 CTs and is elaborated based on a combination of subject matter expert judgments and inter-journal citation patterns that together serve to cluster journals into topical groupings. Since there is no consensus on a perfect taxonomy of disciplines, the one of WoS was selected because its extensive use in the bibliometric analyses of previous related work, but other taxonomies could also be used. As a measure of similarity between CTs, we used the co-citation similarity matrix provided by Porter and Rafols (2009).

The combination of several databases increases the completeness of the record of references at the same time that it decreases the percentage of publications categorized with CTs—only journal publications indexed by WoS are categorized. Our dataset contains 1746 publications authored by 225 doctoral students. The extraction of references was possible for 1068 publications indexed by WoS or Scopus. The association of CTs to references was possible for 979 of the publications that had references indexed by WoS. A total of 12,243 references were extracted, of which 5310 are categorized with CTs.

### Computation of the Rao–Stirling index and its uncertainty interval

We calculated the Rao–Stirling index and the uncertainty interval of the 1068 publications for which the extraction of references was possible. The limit of discipline reassignment for the uncertainty interval was set to \({k=4}\). This score is at the 99th percentile of the number of CTs used by WoS to categorize the journals of our dataset. The tolerance was also set to the 99th percentile of similarity between CTs (\({t=0.233}\)) in order to incorporate a slight diversity into the pool of similar CTs to be used in the reassignment procedure.

*p*value \(<2.2 \times 10^{-16}\).

### Collection of ground-truth data

Digital copies of the author’s publication and all its references which were gathered manually from digital libraries.

A print-out of the taxonomy of CTs of WoS. In order to make the search of CTs easier for the participants, CTs were grouped into macro-disciplines.

Explain the importance of providing objective data. Since interdisciplinary research has a good connotation, it was important to make our participants understand that they were not going to be evaluated in terms of interdisciplinarity. We asked them to provide us with the most objective data without exaggerating interdisciplinarity or single-disciplinarity.

Make sure that participants became acquainted with the taxonomy of CTs, as none of the participants were familiar with it.

Confirm that participants understood their task. Participants were asked to think out loud and explain their choice of CTs for verification purposes.

Make sure that each participant followed the same criteria to categorize publications into disciplines.

### Comparative analysis

Estimated mean and standard deviation (SD) of the Rao–Stirling index of the 48 publications of the sample calculated with incomplete and completed data. These estimated values were calculated with a bootstrapped sample of 50,000 elements with replacement

Rao–Stirling index | Estimated mean | SD |
---|---|---|

Incomplete data | 0.47495 | 0.03929 |

Completed data | 0.53862 | 0.03307 |

Estimated mean, bias and standard deviation of the indices of the 48 publications of the sample: Rao–Stirling index with completed data (first row), Rao–Stirling with incomplete data (second row), the center of the uncertainty interval (third row), and the center of the uncertainty interval weighted according to its size (fourth row). These estimated values were calculated with a bootstrapped sample of 50,000 elements with replacement. A visual representation of these values can be observed in Fig. 5

Diversity index | Estimated mean | Bias | SD |
---|---|---|---|

Rao–Stirling with completed data | 0.539 | −9.646 × 10 | 3.308 × 10 |

Rao–Stirling with incomplete data | 0.475 | 1.390 × 10 | 3.929 × 10 |

Center uncertainty interval | 0.569 | 2.869 × 10 | 2.964 × 10 |

Weighted center uncertainty interval | 0.558 | 1.342 × 10 | 3.266 × 10 |

## Discussion

The accuracy of citation-based IDR measurements heavily depends on the quality of the bibliographic data. The combination of data from several sources might help to enhance the quality of data but it certainly does not assure ground-truth bibliographic data. The dataset gathered for the evaluation of our methods is an example of an incomplete one, even though data from three different digital libraries was extracted and combined. Not all publications of our dataset have a complete record of references, and not all references are categorized with CTs. The Rao–Stirling index is incapable of taking both problems into account as it is not designed to handle missing data.

Our method tackles the problem of uncategorized references, extending the Rao–Stirling index to encode the uncertainty caused by missing data as an interval. A high degree of incompleteness in publications particularly interdisciplinary in nature may also result in underestimating the upper bound of the uncertainty interval. This is especially problematic when a publication only has one reference categorized by a single CTs. Such a degree of incompleteness affects the rational redistribution of CTs needed to compute the upper endpoint of the uncertainty interval (see publication ID = 6 in Figs. 3 and 4). The main benefit of the uncertainty interval is that it acts as a confidence indicator of the results delivered by the Rao–Stirling index. On the one hand, publications with a low proportion of uncategorized references have correspondingly small uncertainty intervals, implying a more reliable measurement of the Rao–Stirling index. On the other hand, publications with a high proportion of uncategorized references have correspondingly large uncertainty intervals, indicating an unreliable measurement of the Rao–Stirling index. This finding proves the importance of selecting publications with a proportion of categorized references above a threshold value when computing an index of interdisciplinarity, as in the analysis of Rafols et al. (2012).

The empirical evaluation of our method confirms that the acknowledgment of missing data delivers a more accurate aggregated IDR measurement than the Rao–Stirling index. Our contribution constitutes a first approach to measure IDR taking into account the inaccuracy of the bibliographic data, but other problems still affect the results of the Rao–Stirling and other IDR indices. Future analysis to evaluate this method should be conducted using other taxonomies of disciplines. Further work would be needed in order to tackle the problem of incomplete and incorrect records of references, as well as incorrect categorization of publications into disciplinary fields. Additional issues to consider are the use of a precise taxonomy of disciplines and similarity matrix. Therefore, further avenues of research towards more precise IDR indicators remain open. To aid these efforts, we are providing the source code for our implementation of the uncertainty computation to the community, which can be found at https://gitlab.com/mc.calatrava.moreno/robustrao.git.

## Acknowledgments

The authors wish to thank the 48 doctoral researchers who agreed to participate in this study and generously shared their time to be interviewed.

## Supplementary material

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.