Limiting sample size
Because of the large number of sampling iterations, the average of the mean CNCIs from the samples was similar to the population CNCI for all institutions at all sample sizes. At all sample sizes, the distribution of means typically approached a normal distribution. The skew, which even at small sample sizes was much less than the source population skew, rapidly decreased. Some institutions had a bimodal distribution, which is discussed separately below.
The statistic of interest is not the average value of a large sample and its departure from the population mean but the variance in the sample means (Fig. 1).
The variance associated with small sample sizes is very high (Fig. 1). The range of variances is correlated with the average CNCI of the institution, which is obviously a derivative of the distribution of individual paper CNCIs. No institution has a uniformly highly cited set of papers but the spread (kurtosis) of its individual paper CNCI values is greater where the institution’s average CNCI is higher. Helsinki and Edinburgh have two of the three highest average CNCIs and have a relatively platykurtic (though skewed) distribution with a wide range of individual paper CNCIs from which samples may be drawn. UNAM has a low average CNCI and a clustered range of paper CNCIs (more leptokurtic) because it has relatively smaller number of highly cited papers.
The variance was greater than 1 for seven and over 0.5 for all the institutions at sample size = 20 papers. It dropped to a range up to around 1.0 at sample size = 50, and to 0.5 or less at sample size = 100. It can be seen that an increase in the chosen range of sample sizes broadly halves the variance at each step (Fig. 1).
At what points on this spectrum do the ranges of sample CNCI values broadly overlap and at what point does the variance drop to a level such that the probability that the sample value is approaching the true CNCI suggests that the universities can be more accurately distinguished?
In Fig. 2, the population average CNCI values for the full dataset of papers for the ten institutions are shown with an indicator of the magnitude of the standard deviation (which should cover slightly more than two-thirds of the datapoints) at each of three sample sizes. It is evident that a sample size of 50 produces a relatively high probability of indistinguishable results. In this scenario, the ranking of institutions by CNCI could vary considerably
Even with sample sizes of 200 there is an appreciable likelihood of misinterpretation. If we consider Tel Aviv University, with an average CNCI near the middle of our institutional set, then we can see that a spread of other institutional means from Yonsei to Zurich lie within the range of one standard deviation. In fact, the ranges of all the institutional standard deviations still overlap except the institutions with the two lowest and three highest average CNCIs. The institutions in the middle range of mean CNCIs are effectively indistinguishable at this level of sampling.
Selecting highly cited papers
A naive expectation, when the highest impact papers (by CNCI) are selected from each DAIS-ID cluster, would be that the smaller dataset created by removing low-cited papers would lead to an increase in the average CNCI for an institution and the variance would similarly go down because low-cited papers have been removed. Reality does not match this at aggregated institutional level, however, because of the variance between DAIS-IDs, some of which are mostly highly cited and some of which are mostly low cited, particularly among social sciences and arts outside North America. We therefore reduced the dataset and took the paper with the highest CNCI from each cluster, as described in Methods.
For information, the overall institutional distributions of the subset of researchers’ ‘top’ CNCI papers for DAIS-IDs with four or more papers was plotted for each institution (Appendix: Fig. 7). The spread of most impactful papers (in terms of CNCI) for the set of researchers at these institutions are skewed in a similar way to the overall CNCI distribution. It is interesting, however, to note the similarity of distribution between many institutions, with modal CNCI values around 1–2 times world average and a tail extending into the 4–8 times world average. Indeed, institutional differences in this tail may be a principal differentiator (Glänzel 2013).
There is a general agreement in the scientometric literature that, on average, there is a broad relationship between average CNCI values and other quantitative (research income) and qualitative (peer review) indicators of research performance (reviewed in Waltman 2016). Figure 7 (in Appendix) therefore seems to suggest that the researcher population at each institution is made up of a very large platform of common-run individuals (sensu Glänzel and Moed 2013) whose most highly cited papers are a little above world average and a right-skewed tail of high-end researchers whose papers are much more highly cited for their field and year. The relative distribution of the mainstream and the talented must then influence the net institutional outcome.
Because the population is so skewed, the standard deviation and hence the error in the sample means also increase. The main driver for this is the residual skewed distribution: although some low cited papers have been removed, there are still plenty of other low cited papers. The removal of various low cited papers leads to an increase in the mean, the remaining low-citation papers are now further from the mean as a consequence, and therefore the standard deviation is greater. Although the average CNCI of the highest CNCI papers in each DAIS-ID cluster is correlated with and about 2.5 to 3 times the overall mean CNCI for each institution (Fig. 3), the means are statistically indistinguishable for the distributions of researchers’ highest CNCI papers (Appendix: Fig. 8).
There were about 3000 papers (range Moscow—2392 to Nanjing—4195) in the ‘top’ papers’ dataset for each institution. Sampling this dataset, using 10,000 iterations of 1000 papers each, produces the aforementioned increase in the average CNCI for each institution, since many low-cited papers have been removed. The distributions of sample means are shown in Fig. 4. Although the underlying distribution remains very skewed (Appendix: Fig. 7) the distribution of the sample means is much smaller and again approaches normality.
UNAM’s distribution in Fig. 4 has a double peak and is clearly not normal. Lomonosov Moscow University may also have an emergent second peak. Further investigation (below) was carried out to explore the source of this anomaly.
The distributions of sample means can be seen to be relatively discrete (Fig. 5) and provides a better level of discrimination than did the samples of 200 papers from the full population (Fig. 2). As noted earlier, given that the modal peaks of highest CNCI values are similar across these institutions (Appendix: Fig. 7) the differentiating factor that separates the much tighter peak values must be the relative frequency of higher ‘top’ CNCI values (see Glänzel 2013). Thus, in research assessment exercises where selectivity is supported, the ability to select such material will be of critical significance in determining the outcomes.
It is feasible to analyse the data based on direct analysis of indicative author names, but no substantive difference in the results is provided by this.
Bi-modal distributions
While the distribution of sample means (for the full institutional data and for the ‘top’ papers data) was typically normal, some of the institutions had a double peak in the distribution of their sample means, particularly for larger sample sizes. This was investigated by progressive sampling of the UNAM data (which has the most evident bimodality) with a greater number of sample size intervals from very small (20 papers) to comprehensive (2000 papers) samples from the ‘top’ paper dataset of 2714 papers for the 5-year period.
Figure 6 shows a plot of the distributions resulting from this spread of varying sample sizes using 10,000 samples at each interval. The horizontal axis shows the range of average CNCI for the samples and was set to a maximum of 10 times world average since a valid institutional average greater than this would be extremely unlikely. The distribution appears unimodal with a very small sample size of 20 because the right-hand modal peak is in fact above a CNCI of 20 and is thus out-of-frame. As the sample size is increased to 50 it just begins to come into view on the right of the plot. As the eye progresses through increasing sample sizes it is evident that this peak grows in frequency and moves leftward.
Why does this happen? It is a consequence of one UNAM paper being particularly highly cited compared with the rest of the institution’s output. Samples that included this paper would, of course, have a distinctly higher mean CNCI, while the probability that this paper was included in a sample is a simple function of the sample size. Samples with and without this paper separately approached normal distributions but when combined produce a double peak. The most highly cited UNAM paper has a CNCI of ~ 418; its second most highly cited paper has a CNCI of ~ 91, followed by progressively closer CNCIs of 73, 70 and 54. The peak on the right in Fig. 6 represents the samples that include that highly cited paper whereas the peak on the left denotes the samples that where it was not captured. The peak on the right should be centred around \(\left( {418 - \bar{x}} \right)/n\), where \(\bar{x}\) is the mean of the left peak (around 2) and \(n\) is the sample size. For a sample size of 1000, the peak would be 0.416 to the right of the other peak while the standard deviation of either peak is 0.22, and, with 2714 papers in the top-cited papers dataset, the right-hand peak should have a height of \(n/\left( {2,714 - n} \right)\) compared with the left peak (i.e. the probability that a sample of \(n\) randomly chosen papers includes that top paper against the probability that it doesn’t).