We react on a study by Waltman et al. (2011a), entitled “Towards a new crown indicator: An empirical analysis.” The authors go at great length to show that a change in the normalization—in reaction to our previous critique of the Leiden “crown” indicators (Opthof and Leydesdorff 2010)—did not significantly affect the rankings at various aggregated levels. Since the Center for Science and Technology Studies (CWTS)-data under discussion were not publicly available,Footnote 1 let us use a previous occasion at which Van Raan (2006) revealed some of the micro-data underlying the evaluations in the case of 147 research groups in chemistry. The defense at that time was triggered by the introduction of the h-index by Hirsch (2005). How did the Leiden “crown” indicators work in comparison to the h-index? Unlike the citation indicators, the h-index is sensitive to the number of publications for which citation rates are compared. Decomposition of aggregated data allows for distinguishing mechanisms; for example, variance “within groups” versus “between groups.”

Since Narin (1976) suggested the use of bibliometrics for evaluative purposes, semi-industrial centers have sprung up either connected to academia (such as in Budapest, Leiden, Leuven, Beijing, Shanghai, etc.) or as independent commercial enterprises (e.g., Science-Metrix in Montreal). Two major companies (Thomson Reuters and Elsevier) are also active in this market. In other words, citation analysis has become an industry. Intellectual property of the data and the results of the analysis has become a major asset in this (quasi-)industry. Although contractors sometimes state that the results are freely available for the users, the licenses of the data (the Science Citation Index) often do not permit to publish results freely so that the scientists under study would be able to control these evaluations themselves (cf. Opthof and Leydesdorff 2010). This practice of secrecy tends to shield the evaluation against the criticism that has been voiced against the use of citation analysis for evaluative purposes (Leydesdorff 2008; MacRoberts and MacRoberts 1987, 1996, 2010).

The invention of the h-index as a new statistics in 2005 (Hirsch 2005), however, challenged the leading researcher of CWTS (Van Raan 2006) to test whether this new indicator correlated with the “crown” indicators of scientometric evaluation in use by CWTS: citations per paper/field citation score (CPP/FCSm) and CPP/JCSm (Schubert and Braun 1986; Vinkler 1986; Moed et al. 1995). These latter indicators have extensively been used for such purposes as the Leiden Rankings of universities, research evaluation at the institutional level, and science policy advice at national and international (e.g., EU) levels (e.g., Moed 2005). Vinkler (1996) considered this indicator—which he indicated with RW—as the most appropriate one for the evaluation.

The CWTS study (VSNU 2002) was based on more than 18,000 publications of 147 research groups in chemistry and chemical engineering in the Netherlands for the years 1991–1998. A subset of this data was secondarily analyzed by Van Raan (2006). In addition to the citation indicators, the research groups under study were peer reviewed on their quality on a five-point scale. All fields within chemistry were covered by this set of university groups. The author notes that the various specialties exhibit different citation characteristics and that therefore field-normalization would be essential (cf. Leydesdorff and Opthof 2010, 2011). CPP/FCSm normalizes CPP for the mean FCSm where a “field” is defined as a set of journals sharing a field-code of the ISI Subject Categories. Analogously CPP/JCSm normalizes for the mean citation scores of individual journals (Schubert and Braun 1986; Vinkler 1986; Waltman et al. 2011b).

Van Raan (2006, p. 495) provided the Table 1 Footnote 2

Table 1 Example of the results of the bibliometric analysis for the chemistry groups

Table 1 shows the results for 12 research groups in one university who published during this period 1,327 times, obtaining a total of 17,566 citations. The bibliometric indicators, the h-index, and the peer ratings are provided. In the latter, “5” indicates “excellent,” “4” means “good,” and “3” is classified as “satisfactory.” Below “3” is not considered “satisfactory,” but such a low rating did not occur in this set of data.

Table 2 shows the Pearson correlations (r) in the lower triangle and the Spearman rank correlations (ρ) in the upper triangle. As noted (cf. Van Raan 2006, p. 499), the h-index is also dependent on the number of publications while the CWTS-indicators are not. As could be expected, the two CWTS-indicators are highly correlated between themselves (r = 0.783; p < 0.01). However, the quality parameter Q is uncorrelated with any of these scientometric indicators. Thus, we may conclude that the indicators are not validated by this study despite the author’s claim to the contrary.

Table 2 Pearson correlations (lower triangle) and Spearman rank correlations (upper triangle) among three citation indicators one peer-review based quality indicator

Figure 1 shows the discriminating power of the h-index and the two indicators of CWTS (CPP/JCSm and CPP/FCSm) using the set provided in Table 1. We added error bars in order to show that the differences are contained within the margins of the standard errors of the measurement. Thus, none of the citation-based indicators is able to discriminate between the categories “good” and “excellent” which were distinguished during the peer review.

Fig. 1
figure 1

Discrimination between “good” and “excellent” research using the h-index and the Leiden indicators CPP/JCSm and CPP/FCSm in the case of Table 1

In his Table 2, Van Raan (2006, p. 500) provided also aggregated data for the set of 147 research groups. In this table, the association between Q and h is significant (using χ 2, and p < 0.05), but not the association between Q and CPP/FCSm when testing Q = 4 against Q = 5 (χ 2 = 4.211Footnote 3; df = 2; p = 0.112). Thus, even at this aggregated level (N = 147), these results confirm the previous conclusion of Bornmann et al. (2010; cf. Van den Besselaar and Leydesdorff 2009) that the peer review systems and citation analysis are able to distinguish the tails of the distributions (low quality) from the high-end of the set, but perform poorly in distinguishing between excellent and good research to the extent that the correlation between evaluations based on these scientometric indicators or peer review can be negative (cf. Neufeld and von Ins 2011).

In Tables 19.3 and 19.4 at p. 243, Moed (2005) used this same data, but having access to the source data he added the larger set of similar results for biology and physics (whereas Table 1 above only contained the data for 12 research groups in chemistry in a single university among 147 chemistry groups at ten universities). By aggregating CPP/FCSm values also along the scale of “Citation impact classes,” he can conclude (at p. 242) that “a very high citation impact discriminated very well between departments rated excellent or good and those receiving lower peer ratings, but it did not discriminate so well between good and excellent departments in the perception of peers.”

This wording (“not so well”) suggests a poor correlation, whereas we showed above that there was no correlation at the level of the smaller set of chemistry and using the values of CPP/FCSm before binning them into “Citation impact classes:” citation analysis is not always helpful in distinguishing between good and excellent research. Aggregation may inadvertently obscure the absence of correlations. Unfortunately, the selection between “excellent” and “good” is one of the policy contexts in which citation analysis is used; for example, in rankings and funding schemes (e.g., Bornmann et al. 2010; Halffman and Leydesdorff 2010; Geuna and Martin 2003).

In summary, we argue that the industrial character of citation analysis for evaluative purposes has hidden technical flaws in these measurements because of a lack of openness about the data and therefore critical discussion in academia. Notwithstanding their prevailing use in research evaluation and strategic decision-making, the statistical analysis of this scientometric data, for example, supports the claim of the criticizers (e.g., MacRoberts and MacRoberts 2010) that citation analysis hitherto cannot legitimate the strategic selection of excellence.