On the uncertainty of interdisciplinarity measurements due to incomplete bibliographic data

The accuracy of interdisciplinarity measurements is directly related to the quality of the underlying bibliographic data. Existing indicators of interdisciplinarity are not capable of reflecting the inaccuracies introduced by incorrect and incomplete records because correct and complete bibliographic data can rarely be obtained. This is the case for the Rao–Stirling index, which cannot handle references that are not categorized into disciplinary fields. We introduce a method that addresses this problem. It extends the Rao–Stirling index to acknowledge missing data by calculating its interval of uncertainty using computational optimization. The evaluation of our method indicates that the uncertainty interval is not only useful for estimating the inaccuracy of interdisciplinarity measurements, but it also delivers slightly more accurate aggregated interdisciplinarity measurements than the Rao–Stirling index. Electronic supplementary material The online version of this article (doi:10.1007/s11192-016-1842-4) contains supplementary material, which is available to authorized users.


Introduction
Since the similarity matrix S that is used in the Rao-Stirling diversity index is positive semidefinite, the index, given by is a concave function in c. As described in Section 3, the minimization of this function is used to compute the lower bound of the uncertainty interval. Due to the purely concaveness, the minima lie on the vertices of the polytope that is spanned by the constraints on c (Floudas and Visweswaran 1995).

Example Setting
We provide a simplified setting to show the complexity of the minimization problem. If we assume only the existence of three different disciplines, the similarity matrix can be given as where α describes the similarity between the first and the second disciplines. We assume that the third discipline is completely dissimilar to the first and second discipline.
A hypothetic document with both categorized and uncategorized references serves as the basis for this example. We assume the following categorized references in the three disciplines where the first two disciplines are cited by two references each. The number of references that cite the third discipline is given by c3. Finally, the number of uncategorized references is given by u. In total 2 + 2 + c3 + u references are present in this example of which u references are uncategorized.

Example Minimization
The diversity index of our setting is thus given as where 0 ≤ λ ≤ 1 (resp. 0 ≤ μ ≤ 1 -λ) determine the extent at which the uncategorized references are redistributed to the first (resp. second) discipline. The remainder is redistributed to the third discipline. Since the minimum of the index is attained at the vertices of the constraint polytope, it can only be located at one of three extremal reference redistributions I1, I2, and I3. For these, all the uncategorized references are redistributed to the either the first, second, or third discipline. The diversity index of these three extremal cases is given as For a fixed similarity between the first and second discipline of α = 0.6, these quanities can be plotted as surfaces over the number of uncategorized references u and the number of references c3 that cite the third discipline. To clarify the exposition, we chose a fixed u = 2 for the following discussion. Note that the same conclusion is valid for any u > 1. Depending on the value of c3, the minimum can be attained by redistributing all references to either the first or third discipline. In the case of c3 = 3, the minimum can be achieved by redistributing the uncategorized references to the first discipline (i.e., I1 < I3 for c3 = 3). For c3 = 4, a redistribution to the third discipline yields the minimal index (i.e., I3 < I1 for c3 = 4).
Conclusion NP-hard-example.nb Thus, it cannot be assumed that simply adding all uncategorized references to the discipline with the highest number of categorized references would yield the minimal index. In both aforementioned cases, the third discipline has the highest number of categorized references (c3 = 3, 4 > 2).