Introduction

Citations create links between publications; but to relate citations to publications as two different things, one needs a model (for example, an equation). The journal impact factor (JIF) indexes only one aspect of this relationship: citation impact. Using the h-index, papers with at least h citations are counted. One can also count papers with h2 or h/2 citations (Egghe 2008). This paper is based on a different and, in our opinion, more informative model: the Integrated Impact Indicator I3.

The 2-year JIF was outlined by Garfield and Sher (1963; cf. Garfield 1955; Sher and Garfield 1965) at the time of establishing the Institute for Scientific Information (ISI). JIF2 is defined as the number of citations in the current year (t) to any of a journal’s publications of the two previous years (t − 1 and t − 2), divided by the number of citable items (substantive articles, reviews, and proceedings) in the same journal in these two previous years. Although not strictly a mathematical average, JIF2 provides a functional approximation of the mean early citation rate per citable item. A JIF2 of 2.5 implies that, on average, the citable items published 1 or 2 years ago were cited two and a half times. Other JIF variants are also available; for example, JIF5 covers a 5-year window.Footnote 1

The central problem that led Garfield (1972, 1979) to use the JIF when developing the Science Citation Index, was the selection of journals for inclusion in this database. He argued that citation analysis provides an excellent source of information for evaluating journals. The choice of a 2-year time window was based on experiments with the Genetics Citation Index and the early Science Citation Index (Garfield 2003, at p. 364; Martyn and Gilchrist 1968). However, one possible disadvantage of the short term (2 years) could be that “the journal impact factors enter the picture when an individual’s most recent papers have not yet had time to be cited” (Garfield 2003, p. 365; cf. Archambault and Larivière 2009). Bio-medical fields have a fast-moving research front with a short citation cycle, and JIF2 may be an appropriate measure for such fields but less so for other fields (Price 1970). In the 2007 edition of Journal Citation Reports (reissued for this reason in 2009) a 5-year JIF (JIF5, considering five instead of only two publication years) was added to balance the focus on short-term citations provided by JIF2 (Jacsó 2009; cf. Frandsen and Rousseau 2005).Footnote 2

The skew in citation distributions provides another challenge to the evaluation (Seglen 1992, 1997). The mean of a skewed distribution provides less information than the median as a measure of central tendency. To address this problem, McAllister et al. (1983, at p. 207) proposed the use of percentiles or percentile classes as a non-parametric indicator (Narin 1987Footnote 3; see later: Bornmann and Mutz 2011; Tijssen et al. 2002). Using this non-parametric approach, and on the basis of a list of criteria provided by Leydesdorff et al. (2011), two of us first developed the Integrated Impact Indicator (I3) based on the integration of the quantile values attributed to each element in a distribution (Leydesdorff and Bornmann 2011).

Since I3 is based on integration, the development of I3 presents citation analysts with a construct fundamentally different from a methodology based on averages. An analogy that demonstrates the difference between integration and averaging is given by basic mechanics: the impact of two colliding bodies is determined by their combined mass and velocity, and not by the average of their velocities. So, it can be argued that the gross impact of the journal as an entity is the combined volume and citation of its contents (articles and other items); but not an average. Journals differ both in size (the number of published items) and in the skew and kurtosis of the distribution of citations across items. A useful and informative indicator for the comparison of journal influences should respond to these differences. A citation average cannot reflect the variation in both publications and citations but an indicator based on integration can do so.

One route to indexing both performance and impact via a single number has been provided by the h-index (Hirsch 2005) and its variants (e.g., Bornmann et al. 2011a, b; Egghe 2008). However, the h-index has many drawbacks, not least mathematical inconsistency (Marchant 2009; Waltman and Van Eck 2012). Furthermore, Bornmann et al. (2008) showed that the h-index is mainly determined by the number of papers (and not by citation impact). In other words, the impact dimension of a publication set may not be properly measured using the h-index. One aspect that I3 has in common with the h-index is that the focus is no longer on impact as an attribute but on the information production process (Egghe and Rousseau 1990; Ye et al. 2017). This approach could be applied not only to journals but also to other sets of documents with citations such as the research portfolios of departments or universities. In this study, however, we focus on journal indicators.

At the time of our previous paper about I3 (Leydesdorff and Bornmann 2011), we were unable to demonstrate the generic value of the non-parametric approach because of limited data access. Recently, however, the complete Web of Science became accessible under license to the Max Planck Society (Germany). This enables us to compare I3-values across the database with other journal indicators such as JIF2 and JIF5, total citations (NCit), and numbers of publications (NPub). The choice for journals as units of analysis provides us with a rich and well-studied domain.

Our approach based on percentiles can be considered as the development of “second generation indicators” for two reasons. First, we build on the first-generation approach that Garfield (1979, 2003, 2006) developed for the selection of journals. Second, the original objective of journal selection is very different from the purposes of research evaluation to which JIF has erroneously ben applied (e.g., Alberts 2013). The relevant indicators should accordingly be appropriately sophisticated.

The weighting scheme and the I3* indicator

In this study, we introduce I3*—a variant within the general I3 scheme—by proposing a weighting scheme of percentile classes. We elaborate on Bornmann and Mutz (2011) who counted six percentile classes with weights from one to six. Since that publication, however, several threads of work have clarified the position of the top-10% and top-1% categories as proxies for excellence. On the basis of this literature (e.g., Bornmann 2014), our basic assertion is that a paper in the top-1% class can be weighted at ten times the value of a paper in the top-10% class. It follows log-linearly that a top-1% paper weighs 100 times more than a paper at the bottom. This weighting scheme reflects the highly-skewed nature of citation distributions. We add, as a second assertion, a weighting to distinguish between papers in the top-50% (weight = 2) and bottom-50% (weight = 1). The dividing line between bottom-50% and top-50% is less pronounced than the line between an averagely-cited paper and an exceptionally-cited one.

Figure 1 and Table 1 clarify the correspondence between the approaches. (We will show the differences empirically in a later section.) In Fig. 1 the left axis is logarithmic—that is, log(1) to log(100) —whereas the right axis is linear (one to six). In the original scheme of Bornmann and Mutz (2011), the relative weighting of a top-1% and top-10% paper was only 6:4.5 (equivalent to 4:3) whereas we apply 10:1 (= 10) in the new scheme. Using quantiles (Leydesdorff and Bornmann 2011), the relation between a top-1% and top-10% paper would only be 99:89 (= 1.1).

Fig. 1
figure 1

Weighting factors of the percentile ranks in Bornmann and Mutz (2011) and this study

Table 1 Weighting factors of the percentile ranks in Bornmann and Mutz (2011) and this study

In other words, we distinguish between I3 as a general scheme and a possible family of specific weighting schemes. The latter are applications for specific evaluation contexts. In general, I3 can be written as follows:

$$I3\left( {PR_{1} - W_{1} ,PR_{2} - W_{2} , \ldots PR_{n} - W_{n} } \right)$$

where PR defines the lower threshold of the respective percentile rank class and W the corresponding weight; n is the number of classes and weights, respectively. In this notation, the scheme proposed by Bornmann and Mutz (2011)—at the time called PR6—can be written as follows: I3(99-6, 95-5, 90-4, 75-3, 50-2, 0-1); and the scheme in this paper (I3*) can be formalized as I3(99-100, 90-10, 50-2, 0-1). However, the scheme can be used more broadly for percentile-based indicators: the top-10% so-called excellence indicator (e.g., Bornmann et al. 2012; Waltman et al. 2012), for example, can be formalized as the special case I3(90-1). In this study, we propose a new variant which we denote as I3*; I3* can be considered as a pragmatic shorthand of I3(99-100, 90-10, 50-2, 0-1).

As is the case for all I3 evaluations, I3* is size-dependent: it scales ceteris paribus with journal size. By dividing I3* by the number of elements of the distribution N = Σini (of documents), a size-independent equivalent can be generated. Not surprisingly, this latter measure is highly correlated with JIF2 and JIF5. In other words, I3*i/Ni provides the journal-specific expected I3* value of a paper published in journal i. This value can be used as a benchmark for testing whether the observed citation count for a specific paper is above or below expectation.

Note that we test expected citation rates against observed ones at the level of a sample (e.g., a journal). Consequently, our approach avoids the “ecological fallacy” of using a journal characteristic as an expected value to compare with observed values derived from the individual papers published in the respective journal (Robinson 1950; Kreft and de Leeuw 1988; cf. Waltman and Traag 2017). The observed values are not estimated on the basis of a journal characteristic, but are measured in order to inform the expectation.

Methods

Data

Data were harvested at the Max Planck Digital Library (MPDL) in-house database of the Max Planck Society during the period October 15–29, 2018. This database contains an analytically enriched copy of the Sciences Citation Index-Expanded (SCI-E), the Social Sciences Citation Index (SSCI), and the Arts and Humanities Citation Index (AHCI). Citation count data can be normalized for the Clarivate Web of Science Subject Categories (WoS Categories) and theoretically could be based on whole-number counting or fractional counting in the case of more than a single co-author. The unit of analysis in this study, however, is the individual paper to which citation counts are attributed irrespective of whether the paper is single- or multi-authored.

The (current) citation window in the in-house database was the period to the end of 2017, at the time of the data collection. We collected substantive items (articles and reviews) using the publication year 2014 with a 3-year citation window to the end of 2017. The results were checked against a similar download for the publication year 2009, that is, 5 years earlier. The year 2014 was chosen as the last year with a complete 3-year citation window at the time of this research (October–November, 2018); furthermore, the year 2009 is the first year after the update of WoS to its current version 5.

Non-normalized data

The in-house database contains many more journals than the Journal Citation Reports (JCR, which form the basis for the computation of JIF). In order to be able to compare between I3*-values and other indicators, we use only the subset of publications in the 11,761 journals contained in the JCR 2014. These journals all have JIFs and other standard indicators. Of these journals, 11,149 are unique in the SCI-E and SSCI, and the overlap between SSCI and SCI-E is 612 journals. Another 207 journals could not be matched unequivocally on the basis of journal name abbreviations in the in-house database and JCR, so that our sample is 10,942 journals. Note that we are using individual-journal attributes so that the inclusion or exclusion of a specific journal does not affect the values for the other journals under study.

We collected the data as follows. On the basis of the number of papers (articles and reviews, excluding non-academic ephemera such as editorials) in a specific year (in this case: 2014), we identified the threshold number of citations at category boundaries, e.g. the lower boundary of the 1% most-frequently cited papers. If there are, for example, a total of 100,000 papers in a year, then one thousand of them should belong to the most-cited-1% for obvious stochastic reasons. If the papers are ranked by descending citation counts, then the citation count of the 1000th paper is the threshold value (Ahlgren et al. 2014). For each journal, the number of papers in this set can be counted. By counting the number of papers with a citation count exceeding this threshold value, the problem of ties is circumvented. However, there is a possibility that more than 1,000 papers may thereby be included in the top-1% because there are several papers with the same value as the threshold (in 2014, e.g., 1.03% of the papers instead of exactly 1%). The same applies to the other top-x % classes.

In summary, we harvest the top-1%, top-10%, top-50%, and bottom-50% publication scores for each journal by first determining the thresholds of these percentile classes for the entire database and, second, by counting each journal’s participation in the respective layers of the database. Using a dedicated routine, the data are organized in a relational database with JCR-2014 data. The tables resulting from the analyses can be read into standard software (e.g., Excel, SPSS) for further processing and statistical analysis.

Normalized data

Citation counts are also field-normalized in the in-house database using the WoS Categories, because citation rates differ between fields. These field-normalized scores are available at individual document level for all publications since 1980. The I3* indicator calculated with field-normalized data will be denoted as I3*F—pragmatically abbreviating I3*F(99-100, 90-10, 50-2, 0-1) in this case. Some journals are assigned to more than a single WoS category: in these instances, the journal items and their citation counts are fractionally attributed. In the case of ties at the thresholds of a top-x% class of papers (see above), the field-normalized indicators have been calculated following Waltman and Schreiber (2013). Thus, the in-house database shows whether a paper belongs to the top-1%, top-10%, or top-50% of papers in the corresponding WoS Categories. Papers at the threshold separating the top from the bottom are fractionally assigned to the top paper set.

Statistics

Table 2 shows how to calculate I3* based on publication numbers using PLOS One as an example. The publication numbers in the first columns (a and b) are obtained from the in-house database of the Max Planck Society. These are the numbers of papers in the different top-x%-classes. Since the publication numbers in the higher classes are subsets of the numbers in the lower classes, the percentile classes are corrected (by subtraction) to avoid double counting. The resulting values in each distinct class are provided in the columns c and d. The distinct class counts are then multiplied by the appropriate weights. In the last step of calculating I3*, the weighted numbers of papers in the distinct classes are summed into I3*. In this case, I3* = 78,733 (non-normalized) and I3*F = 53,570.256 (field-normalized).

Table 2 PLOS One data as an example of the calculation of I3*, based on non-normalized and field-normalized values

The maximal I3* is ((30,042 * 100) + (0 * 10) + (0 * 2) + (0 * 1) =) 3,004,200 whereby all papers in the journal would belong to the 1% most frequently cited papers in the corresponding fields. With I3* = 53,570.256, the journal reaches 1.78% of this maximum. Without field-normalization, this is 2.62%. In other words, there is ample room for improvement.Footnote 4

As noted, I3* can be divided by N, the number of publications (which is by definition equal to the sum of the numbers in the four percentile classes). I3*/N is based on relative frequencies, since the number in each term (ni) is divided by N (= Σini). One can expect I3*/N to no longer be size-dependent and thus to have applications different from I3*, as we shall show below. We focus on I3* in this paper; we will discuss potential applications of I3*/N in a later paper.

We have applied Spearman rank-correlation analysis and factor analysis (principal component analysis with varimax rotation) to the following variablesFootnote 5:

  1. 1.

    total numbers of publications (NPub);

  2. 2.

    citations (NCit);

  3. 3.

    JIF2;

  4. 4.

    JIF5;

  5. 5.

    Non-normalized I3*-values (I3*);

  6. 6.

    Field-normalized I3*-values (I3*F);

  7. 7.

    I3*/N for the non-normalized case (I3*/N).

The results are shown as factor-plots using the first two components as x- and y-axes. This representation in a two-dimensional map provides a ready means of assessing the results visually.

We chose two components in accordance with our design, but the number of eigenvectors with a value larger than one is also two. The results indicate that the two first eigenvectors explain about 85–90% of the variance in the subsequent analyses. Since the distributions are non-normal, Spearman’s rank-order correlations are preferable to Pearson correlations.Footnote 6 Note that the factor-analysis is based on Pearson correlations and the results are consequently, in this respect, approximations. Rotated factor matrices and the percentages of explained variance are also provided for each analysis.

Results

Full set (journal count, n = 10,942)

Figure 2 shows the two-dimensional factor plot of the data provided numerically in Table 3. The first two factors explain 87.5% of the variance. The correlation between I3* and its field-normalized equivalent I3*F and between them and this first component is greater than 0.9, so they can be considered as essentially the same characteristic. The factor loadings of the numbers of citations (NCit) and publications (NPub) on this first factor are greater than 0.8. NPub, which is the size indicator of output (number of publications), does not load substantially on the second factor which represents impact (number of citations); however, the number of citations (NCit) loads on factor 1 (.802) much more than on factor 2 (.324).

Fig. 2
figure 2

Component plot in rotated space of the two main components in the matrix (varimax-rotated PCA) of 10,942 cases (journals) versus seven indicators: total numbers of publications (NPub), citations (NCit), JIF2, JIF5, non-normalized I3*-values (I3*), field-normalized I3*-values (I3*F), and I3/N* for the non-normalized case (I3*/N). Note: The asterisk is an illegal character in a variable name or label in SPSS, and therefore not included in the plots

Table 3 Rotated factor matrix of the seven indicators plotted in Fig. 1

The correlations in Table 4 are all statistically significant (p < .01). Note that the number of journals is large (n = 10,942) and that significance is therefore less meaningful. However, it can be noted that JIF2 and JIF5 correlate with publication count (NPub) at an observably lower level (0.44 and 0.42) than I3* and I3*F (0.92 and 0.86). Obviously, size-normalization (dividing by n) does not completely remove the effect of size. This is in accordance with the recently published conclusions of Antonoyiannakis (2018). I3*/N can also be considered as a mean and thus a parametric statistic.

Table 4 Spearman rank-order correlations between the variables listed in Table 3

Table 5 shows the ranking of the 25 journals with the greatest index values for each of I3*, field-normalized I3*F, and I3*/N, respectively. (The full listing is available at http://www.leydesdorff.net/I3/ranking.htm.) The size effect of PLOS One dominates both the I3* and I3*F ranking, but not the third column (I3*/N) which is size-independent because of the division by N. Twelve of the 25 titles in this latter column are attributed to journals in the Nature publishing group indicating the high quality of this portfolio. Note that Science, which occupies sixth position in the first two columns, drops to 29th position on the size-independent indicator. PLOS One falls much further, to position 2064.

Table 5 25 journals ranked on non-normalized I3* values (I3*), field-normalized values (I3*F), and non-normalized values (I3*/N)

There may be a disciplinary interaction with normalization: field-normalization seems to affect the leading chemistry journals more than others. The Journal of the American Chemical Society (JACS), for example, holds second place on the (left-side) list of I3*, but only ninth place on I3*F. By contrast, leading physics journals seem to list higher on the normalized indicator. Perhaps, these relatively well-cited journals in chemistry have a longer-tailed citation distribution than comparable physics journals: normalization (division by N) will have a greater effect with increasing values of N. As noted, the two indicators are highly correlated overall, but the possibility of a disciplinary factor will need further elucidation in future research.

The Social Sciences Citation Index (SSCI)

The citation environment of journals listed in the Social Sciences Citation Index (SSCI) is very different from that of journals in the SCI-E. The SSCI journals in JCR constitute about 28% (3105/10,942) of the total serial titles, but the total citations to SSCI journals constitute less than 10% of all citations to JCR titles (4,506,510/48,340,046 in our time window). The average yearly total cites (NCit) of a journal in SSCI is 1451.3 compared with 4417.8 for the combined set. Figure 3 shows the relatively small contribution of the SSCI journals to the citation indices in terms of citations.

Fig. 3
figure 3

Source: Leydesdorff et al. (2018, p. 627)

Aggregated citation counts for journals in the Web of Science editions for the SCI and SSCI, both separate and combined.

Figure 4 shows the factor plot for the 3105 journals in SSCI for comparison with the factor plot for the combined sets of SSCI and SCI provided in Fig. 1. The main difference is the greater distance between the number of citations (NCit) and the number of publications (NPub). The correlation between both indicators is smaller in SSCI (.633) than in SCI (.719) (Tables 4 and 7, respectively). Consequently, NPub and NCit are distanced in Fig. 4 and the order of the two factors is reversed (Table 6). Nonetheless, these two factors together still explain 84% of the variance.

Fig. 4
figure 4

Notes: The indicators are: total numbers of publications (NPub); citations (NCit); JIF2; JIF5; non-normalized I3*-values (I3*); field-normalized I3-values (I3*F); and scaled I3* for the non-normalized case (I3*_N). The asterisk is an illegal character in a variable name or label in SPSS, and therefore not included in the plots

Two-component factor plot of seven indicators for 3105 journals in the SSCI.

Table 6 Rotated factor matrix of the seven indicators plotted in Fig. 4
Table 7 Spearman rank-order correlations among the seven indicators under study for 3105 journals in the SSCI

If we focus on a specific journal category of SSCI, such as the 83 journals in Information & Library Science, the difference in citation cultures between SCI and SSCI outcomes is further emphasized. Alternatively, if we focus on a narrow specialism in the natural sciences, such as Spectroscopy with 41 journals, we find that the distinction between the two components is even more pronounced than for the full set of 10,942 journals. Table 8 juxtaposes the rotated factor matrices showing these differences numerically.

Table 8 Rotated factor matrices for two specialist WoS categories, one each from SSCI and SCI

While the number of publications drives the number of citations in the SCI-E, this appears to be less the case in the SSCI. Size is less important for impact in SSCI than in SCIE. I3* correlates with size (NPub) more than with citations (NCit) in the social sciences.

Comparison with 2009

It is possible that the results obtained for 2014 were specific for that year, because it is relatively recent and the citation counts were not yet stable. We tested this by repeating the analysis for 2009 data, which was chosen because the WoS (version 5) was reorganized in 2008/2009.

Of the 9216 journals in the combined 2009 sets of JCRs for SCI-E and SSCI, 8994 journal title abbreviations could automatically be matched between the data from the in-house database and JCR. The two-component plot in Fig. 5 shows that the outcome for 2009 data is very similar to that seen with 2014 data (Table 9).

Fig. 5
figure 5

Notes: The indicators are: total numbers of publications (NPub); citations (NCit); JIF2; JIF5; non-normalized I3*-values (I3*); field-normalized I3*-values (I3*F); and scaled I3* for the non-normalized case (I3*/N). The asterisk is an illegal character in a variable name or label in SPSS, and therefore not included in the plots

Plot of the two first components for 8994 journals in JCR 2009.

Table 9 Rotated factor matrices for full sets in 2009 and 2014

Two factors explain 88.1% of the variance in 2009 and 87.5% in 2014. Figure 5 shows the 2-component plot for 2009. The results are virtually identical in these two sample years. Thus, the indicator appears to be robust over time.

Statistics

PLOS One was by far the largest journal in 2014 with 30,042 publications. It was followed in this analysis by RSC Advances with 8345 citable items. In terms of total citations, however, PLOS One is in eighth place with 332,716 citations. In the same year, Nature accrued 617,363 citations to 862 publications. The simple citations/publication (c/p) ratio for Nature is 716.3 and for PLOS One is 11.1. By comparison, the values of I3*/N are 61.4 for Nature and 2.6 for PLOS One and, in seeming contradiction to conventional indicators, the (non-normalized) I3* values are 78,733 for PLOS One and 52,883 for Nature.

What do these figures mean, and are the differences statistically and practically significant? One can test the distribution of papers over the classes against the expected numbers. This can be done for the frequencies in the matrix using chi-squared statistics, or by a test between means (in the case of I3*/N) using the z test and/or Cohen’s h for “practical significance.”Footnote 7 Table 10 shows various options for testing observed values against expected ones; Table 11 generalizes this to the possibility to test any two distributions against each other. As empirical instances, we again use PLOS One for comparison of observed with expected values (Table 10), and this same journal versus RSC Advances in Table 11.

Table 10 Comparison of PLOS One with expected values
Table 11 Comparison of PLOS One with RSC Advances

The results of the chi-squared tests are statistically significant (p < .001), both when comparing PLOS One with the expectation, and PLOS One with RSC Advances. One can summarize the results of the chi-squared ex post using Cramèr’s V which conveniently ranges from zero to one. In this case, Cramèr’s V = 0.27 in Table 10 and Cramèr’s V = 0.05 in Table 11. In other words, the differences between the expected and observed percentile-rank distribution is more than five times larger than the corresponding differences between PLOS One and RSC Advances. (The template provides these values automatically.) The results of the chi-squared based on testing the I3* values (in columns g and h in both Tables 10 and 11) are provided at the bottom of column k.

While the chi-squared statistic provides a test for comparing the entire distributions (two vectors of four classes), the decomposition of chi-squared into standardized residuals \(\left[ {\frac{{\left( {{\text{observed}} - {\text{expected}}} \right)}}{{\sqrt {{\text{expected}} } }}} \right]\) provides us with a statistic for each class. Standardized residuals can be considered as z-values: they are significant at the 5% level if the absolute value is larger than 1.96, 1% for an absolute value > 2.576, and 1‰ for an absolute value > 3.291 (Sheskin 2011, at p. 672). These statistics are provided in column j.

Furthermore, the residuals are signed and indicate (in Table 10, for example) that PLOS One scores are significantly below expectation in the top-1% class (p < .001), but above expectation in the top-50% class (p < .001). The overall distribution over the percentile classes (including the vertical direction of columns e and f) is statistically significant at the 1‰ level: the journal as a whole performs significantly below expectation in terms of I3*. (Note that each of the four decompositions in column l is based on two observations, since eight cells are used in the computation of the chi-squared.)

In Table 11, RSC Advances scores statistically significantly higher than PLOS One in the top-10% (column l), but not (statistically) significantly below PLOS One in the lower-ranked classes. Tables 10b and 11b add the statistics for I3*/N. The division by N makes all the frequencies relative. Since these relative frequencies can also be considered as proportions, one can z test for difference in proportions (Sheskin 2011, pp. 656f.) or also compute an effect size using Cohen’s w (1988, at p. 216; Leydesdorff et al. 2019).

The z-values in column q of Table 10 show that PLOS One scores above expectation in the percentile class between 50 and 89, but this value is not statistically significant (in column r). PLOS One scores statistically not-significantly below expectation in the top-1% and even more so in the top-10% and bottom-50%.

These results may come as no surprise, but cases other than PLOS One may offer less intuitive results about the status of a journal. For example, specification of the differences between RSC Advances and Nature in terms of these four classes would be far from obvious. The template available at https://www.leydesdorff.net/I3/template.xlsx automatically fills out the numbers and significance levels when the user provides the field-normalized and non-normalized values for top-1%, top-10%, top-50%, and total number of papers in the respective cells.

In order to have information about the significance of the results on the basis of effect sizes (Cohen 1988; Schneider 2013; Wasserstein and Lazar 2016; Williams and Bornmann 2014), we added Cohen’s h and w for the comparison among proportions as column s to Tables 10 and 11. The w index is 0.4 in Table 10, and thus the difference between PLOS One and its expected citation rates in these four categories is meaningful and significant for practical purposes. This is not the case for the difference between the two journals: w = 0.1. The values of h accord with those of the z test for each of the classes.

It should be kept in mind that the tests on proportions address the size-independent indicator I3*/N. As noted, this measure can be used as the expected value of citations of a publication published in the relevant journal. In other words, a paper that is accepted for publication in RSC Advances has a statistically significantly greater likelihood of being cited in the overall top-10% than a paper in PLOS One. It is also less likely to be cited below the 50% threshold.

Effects of different weighting schemes

Weighting schemes have a significant effect on the outcome and interpretation of the analysis of categorized data; weighting introduces a level of subjectivity. Using the general scheme of I3, I3 variants can be adapted to the context of the evaluation situation. For example, if the focus is solely on research excellence, the percentile classes reflecting high impact can be provided with a higher weight. Reducing the weighting for higher impact classes would mean that productivity is relatively more emphasized.

What happens if, instead of the logarithmic set, we use the linear set of Bornmann and Mutz (2011) specified in Table 1 or the respective quantile values as used by Leydesdorff and Bornmann (2011)? Our data collection is categorized in four classes, so we can do this with a weight of 6 for the top-1% papers, 4 for the top-10%, 2 for the top-50%, and 1 for the bottom-50%. Bornmann and Mutz (2011) used two additional classes: 5 for the top-5% papers, and the class between 50 and 89 was divided into 75–89 weighted with 3 and the class 50–74 weighted with 2. The analysis is now less sensitive: using a linear scale, the information benefit of I3* is considerably reduced.

With linear weighting, Fig. 6 shows us that I3* no longer captures the number of citations, but becomes a size indicator (correlated to NPub more than NCit). The top-1% papers, for example, are now given a relative value of six instead of one hundred and thus highly skewed, citation frequencies no longer play a strongly differential role in the assessment across higher and lower-ranked percentile classes.

Fig. 6
figure 6

Plot of the two main components in the matrix (varimax-rotated PCA) of 10,942 cases (journals). The indicators are: total numbers of publications (NPub); citations (NCit); JIF2; JIF5; non-normalized I3*-values (I3*); field-normalized I3*-values (I3*F); and scaled I3* for the non-normalized case (I3*/N). The asterisk is an illegal character in a variable name or label in SPSS, and therefore not included in the plots

Table 12 shows the comparison between using the six percentile ranks used by Bornmann and Mutz (2011) with using the quantile values 99, 89, 50, and 1 for the four classes (Leydesdorff and Bornmann 2011). The position of NCit is similarly changed in both cases; the coordinate values of NCit—boldfaced in Table 12—are slightly lower. Correspondingly, the Pearson correlation of NCit and I3* declines further from .624 to .613 (p < .01).

Table 12 Rotated factor matrices of the seven indicators based on replacing I3* values with six percentile ranks (Bornmann and Mutz 2011) and quantile values (Leydesdorff and Bornmann 2011), respectively

Recall a similar effect for the social sciences (SSCI compared with SCI) particularly when we focused on the 83 journals in the LIS category. The reason in that case was because of a difference in the data, but in the general case the reason is the (mis)specification of a model which does not give appropriate attention to the skew in the distribution.

Summary and conclusions

We argue in this paper that an indicator can be developed that reflects both impact and output, and that combines the two dimensions of publications and citations into a single measure by using non-parametric statistics. The generic Integrated Impact Indicator I3 is a sum of weighted publication numbers in different percentile classes. The indicator can be used very flexibly with a range of percentile classes and weights. Depending on the chosen parameters, I3 can be made more output- or more impact-orientated. In this study, we introduced I3* = I3(99-100, 90-10, 50-2, 0-1) which categorises and weights papers published in the higher citation impact range in a more informed way, given the distribution skew, than the indicator proposed by Bornmann and Mutz (2011) and the quantile-based approach elaborated by Leydesdorff and Bornmann (2011).

I3* can be size-normalized by dividing the value by the original number of publications, to obtain a secondary indicator that expresses the expected contribution made by a single paper given the journal’s characteristics. The size-normalized and size-independent indicators can be considered as relating to two nearly orthogonal axes. When we consider the relationship between conventional journal indicators and these new indicators, we see that I3* correlates strongly with both the total number of citations and publications, whereas I3*/N correlates with size-independent indicators such as JIF.

The journal impact factor developed by Garfield and Sher (1963) was originally intended as a journal statistic of value to publishers and librarians for portfolio management. It was not intended for research evaluation, but it has in fact been increasingly employed for this purpose and mistakenly used as a benchmark for individual researchers and their research output. An average citation rate of two (JIF2) or five (JIF5) years is not representative of the journal as a whole. The JIF can be used as one indicator of the reputation or status of a journal, subject to appropriate contextual considerations, but it cannot be used as an impact value for single papers (Pendlebury and Adams 2012; Bornmann and Williams 2017; Leydesdorff et al. 2016a, b; cf. Waltman and Traag 2017).

Can the I3* indicator be compared with the h-index? Only to the extent that the measurement of output and impact are combined into a single number in both indicators. However, the h-index is mathematically inconsistent; it overrides disciplinary-specific cultural and other considerations, and observed values cannot be tested systematically against expected ones. By contrast, I3* can be analyzed using various statistical tests or power analysis depending on the context in which one wishes to use the indicator. Furthermore, I3* does not provide only one single value like the h-index, but gives an additional four reference values with performance information in different impact classes. This information can be compared with expected values and between different publication sets (e.g., of two or more institutions). Thus, I3* can be used as a single number (e.g., for policy purposes), but it can also be decomposed into the contributions of the percentile rank classes (e.g., the top-10% group). Importantly, one is able to specify error terms on the basis of statistics.

The versatility of I3* is illustrated in a spreadsheet in Excel containing a template for the computation at https://www.leydesdorff.net/i3/template.xlsx. The Ptop10% and PPtop10% indicators have become established as quasi-standard indicators in professional bibliometrics, especially when research institutions are compared (Waltman et al. 2012). The use of these percentile-based indicators is recommended, for instance in the Leiden Manifesto, which included ten guiding principles for research evaluation (Hicks et al. 2015).Footnote 8 It is an advantage of the I3* indicator—which is a percentile-based indicator—that it integrates the top-1% with the top-10% information and combines them with information about other percentile classes. Thus, one provides a broader picture by using I3* as indicator compared to Ptop10% and PPtop10%.

The almost weekly invention of a new h-type indicator signals that many innovative analysts are not aware of a central problem with bibliometric data, shared with other forms of collected data, that indicators necessarily generate error both in source measurement and through analytical methodology (Leydesdorff et al. 2016, pp. 2144f.). Consequently, one should not underestimate the need to elaborate, test, and report on algorithms and their analytics, both empirically and statistically. Elegance on purely mathematical (that is, a priori) grounds is not a sufficient condition for claiming scientometric utility (Ye and Leydesdorff 2014).

Perspectives for further research

The convergent validity of different (field-normalized) indicators can be investigated by comparing the indicators with assessments by peers (Bornmann et al. 2019). Peer assessments of papers published in the biomedical area are available in the F1000Prime database (see https://f1000.com/prime). High correlations between quantitative and qualitative assessments signal the convergent validity of bibliometric indicators which should be preferred in the practice of research evaluation. Bornmann and Leydesdorff (2013) have correlated different indicators with assessments by peers provided in the F1000Prime database. The results showed, for instance, that “Percentile in Subject Area achieves the highest correlation with F1000 ratings” (p. 286). In a follow-up study, I3* indicators are investigated with a similar design to investigate whether these new indicators also have convergent validity (Bornmann et al. 2019).