Abstract
We propose the I3* indicator as a nonparametric alternative to the journal impact factor (JIF) and hindex. We apply I3* to more than 10,000 journals. The results can be compared with other journal metrics. I3* is a promising variant within the general scheme of nonparametric I3 indicators introduced previously: I3* provides a single metric which correlates with both impact in terms of citations (c) and output in terms of publications (p). We argue for weighting using four percentile classes: the top1% and top10% as excellence indicators; the top50% and bottom50% as output indicators. Like the hindex, which also incorporates both c and p, I3*values are sizedependent; however, division of I3* by the number of publications (I3*/N) provides a sizeindependent indicator which correlates strongly with the 2 and 5year journal impact factors (JIF2 and JIF5). Unlike the hindex, I3* correlates significantly with both the total number of citations and publications. The values of I3* and I3*/N can be statistically tested against the expectation or against one another using chisquared tests or effect sizes. A template (in Excel) is provided online for relevant tests.
Introduction
Citations create links between publications; but to relate citations to publications as two different things, one needs a model (for example, an equation). The journal impact factor (JIF) indexes only one aspect of this relationship: citation impact. Using the hindex, papers with at least h citations are counted. One can also count papers with h^{2} or h/2 citations (Egghe 2008). This paper is based on a different and, in our opinion, more informative model: the Integrated Impact Indicator I3.
The 2year JIF was outlined by Garfield and Sher (1963; cf. Garfield 1955; Sher and Garfield 1965) at the time of establishing the Institute for Scientific Information (ISI). JIF2 is defined as the number of citations in the current year (t) to any of a journal’s publications of the two previous years (t − 1 and t − 2), divided by the number of citable items (substantive articles, reviews, and proceedings) in the same journal in these two previous years. Although not strictly a mathematical average, JIF2 provides a functional approximation of the mean early citation rate per citable item. A JIF2 of 2.5 implies that, on average, the citable items published 1 or 2 years ago were cited two and a half times. Other JIF variants are also available; for example, JIF5 covers a 5year window.^{Footnote 1}
The central problem that led Garfield (1972, 1979) to use the JIF when developing the Science Citation Index, was the selection of journals for inclusion in this database. He argued that citation analysis provides an excellent source of information for evaluating journals. The choice of a 2year time window was based on experiments with the Genetics Citation Index and the early Science Citation Index (Garfield 2003, at p. 364; Martyn and Gilchrist 1968). However, one possible disadvantage of the short term (2 years) could be that “the journal impact factors enter the picture when an individual’s most recent papers have not yet had time to be cited” (Garfield 2003, p. 365; cf. Archambault and Larivière 2009). Biomedical fields have a fastmoving research front with a short citation cycle, and JIF2 may be an appropriate measure for such fields but less so for other fields (Price 1970). In the 2007 edition of Journal Citation Reports (reissued for this reason in 2009) a 5year JIF (JIF5, considering five instead of only two publication years) was added to balance the focus on shortterm citations provided by JIF2 (Jacsó 2009; cf. Frandsen and Rousseau 2005).^{Footnote 2}
The skew in citation distributions provides another challenge to the evaluation (Seglen 1992, 1997). The mean of a skewed distribution provides less information than the median as a measure of central tendency. To address this problem, McAllister et al. (1983, at p. 207) proposed the use of percentiles or percentile classes as a nonparametric indicator (Narin 1987^{Footnote 3}; see later: Bornmann and Mutz 2011; Tijssen et al. 2002). Using this nonparametric approach, and on the basis of a list of criteria provided by Leydesdorff et al. (2011), two of us first developed the Integrated Impact Indicator (I3) based on the integration of the quantile values attributed to each element in a distribution (Leydesdorff and Bornmann 2011).
Since I3 is based on integration, the development of I3 presents citation analysts with a construct fundamentally different from a methodology based on averages. An analogy that demonstrates the difference between integration and averaging is given by basic mechanics: the impact of two colliding bodies is determined by their combined mass and velocity, and not by the average of their velocities. So, it can be argued that the gross impact of the journal as an entity is the combined volume and citation of its contents (articles and other items); but not an average. Journals differ both in size (the number of published items) and in the skew and kurtosis of the distribution of citations across items. A useful and informative indicator for the comparison of journal influences should respond to these differences. A citation average cannot reflect the variation in both publications and citations but an indicator based on integration can do so.
One route to indexing both performance and impact via a single number has been provided by the hindex (Hirsch 2005) and its variants (e.g., Bornmann et al. 2011a, b; Egghe 2008). However, the hindex has many drawbacks, not least mathematical inconsistency (Marchant 2009; Waltman and Van Eck 2012). Furthermore, Bornmann et al. (2008) showed that the hindex is mainly determined by the number of papers (and not by citation impact). In other words, the impact dimension of a publication set may not be properly measured using the hindex. One aspect that I3 has in common with the hindex is that the focus is no longer on impact as an attribute but on the information production process (Egghe and Rousseau 1990; Ye et al. 2017). This approach could be applied not only to journals but also to other sets of documents with citations such as the research portfolios of departments or universities. In this study, however, we focus on journal indicators.
At the time of our previous paper about I3 (Leydesdorff and Bornmann 2011), we were unable to demonstrate the generic value of the nonparametric approach because of limited data access. Recently, however, the complete Web of Science became accessible under license to the Max Planck Society (Germany). This enables us to compare I3values across the database with other journal indicators such as JIF2 and JIF5, total citations (NCit), and numbers of publications (NPub). The choice for journals as units of analysis provides us with a rich and wellstudied domain.
Our approach based on percentiles can be considered as the development of “second generation indicators” for two reasons. First, we build on the firstgeneration approach that Garfield (1979, 2003, 2006) developed for the selection of journals. Second, the original objective of journal selection is very different from the purposes of research evaluation to which JIF has erroneously ben applied (e.g., Alberts 2013). The relevant indicators should accordingly be appropriately sophisticated.
The weighting scheme and the I3* indicator
In this study, we introduce I3*—a variant within the general I3 scheme—by proposing a weighting scheme of percentile classes. We elaborate on Bornmann and Mutz (2011) who counted six percentile classes with weights from one to six. Since that publication, however, several threads of work have clarified the position of the top10% and top1% categories as proxies for excellence. On the basis of this literature (e.g., Bornmann 2014), our basic assertion is that a paper in the top1% class can be weighted at ten times the value of a paper in the top10% class. It follows loglinearly that a top1% paper weighs 100 times more than a paper at the bottom. This weighting scheme reflects the highlyskewed nature of citation distributions. We add, as a second assertion, a weighting to distinguish between papers in the top50% (weight = 2) and bottom50% (weight = 1). The dividing line between bottom50% and top50% is less pronounced than the line between an averagelycited paper and an exceptionallycited one.
Figure 1 and Table 1 clarify the correspondence between the approaches. (We will show the differences empirically in a later section.) In Fig. 1 the left axis is logarithmic—that is, log(1) to log(100) —whereas the right axis is linear (one to six). In the original scheme of Bornmann and Mutz (2011), the relative weighting of a top1% and top10% paper was only 6:4.5 (equivalent to 4:3) whereas we apply 10:1 (= 10) in the new scheme. Using quantiles (Leydesdorff and Bornmann 2011), the relation between a top1% and top10% paper would only be 99:89 (= 1.1).
In other words, we distinguish between I3 as a general scheme and a possible family of specific weighting schemes. The latter are applications for specific evaluation contexts. In general, I3 can be written as follows:
where PR defines the lower threshold of the respective percentile rank class and W the corresponding weight; n is the number of classes and weights, respectively. In this notation, the scheme proposed by Bornmann and Mutz (2011)—at the time called PR6—can be written as follows: I3(996, 955, 904, 753, 502, 01); and the scheme in this paper (I3*) can be formalized as I3(99100, 9010, 502, 01). However, the scheme can be used more broadly for percentilebased indicators: the top10% socalled excellence indicator (e.g., Bornmann et al. 2012; Waltman et al. 2012), for example, can be formalized as the special case I3(901). In this study, we propose a new variant which we denote as I3*; I3* can be considered as a pragmatic shorthand of I3(99100, 9010, 502, 01).
As is the case for all I3 evaluations, I3* is sizedependent: it scales ceteris paribus with journal size. By dividing I3* by the number of elements of the distribution N = Σ_{i}n_{i} (of documents), a sizeindependent equivalent can be generated. Not surprisingly, this latter measure is highly correlated with JIF2 and JIF5. In other words, I3*_{i}/N_{i} provides the journalspecific expected I3* value of a paper published in journal i. This value can be used as a benchmark for testing whether the observed citation count for a specific paper is above or below expectation.
Note that we test expected citation rates against observed ones at the level of a sample (e.g., a journal). Consequently, our approach avoids the “ecological fallacy” of using a journal characteristic as an expected value to compare with observed values derived from the individual papers published in the respective journal (Robinson 1950; Kreft and de Leeuw 1988; cf. Waltman and Traag 2017). The observed values are not estimated on the basis of a journal characteristic, but are measured in order to inform the expectation.
Methods
Data
Data were harvested at the Max Planck Digital Library (MPDL) inhouse database of the Max Planck Society during the period October 15–29, 2018. This database contains an analytically enriched copy of the Sciences Citation IndexExpanded (SCIE), the Social Sciences Citation Index (SSCI), and the Arts and Humanities Citation Index (AHCI). Citation count data can be normalized for the Clarivate Web of Science Subject Categories (WoS Categories) and theoretically could be based on wholenumber counting or fractional counting in the case of more than a single coauthor. The unit of analysis in this study, however, is the individual paper to which citation counts are attributed irrespective of whether the paper is single or multiauthored.
The (current) citation window in the inhouse database was the period to the end of 2017, at the time of the data collection. We collected substantive items (articles and reviews) using the publication year 2014 with a 3year citation window to the end of 2017. The results were checked against a similar download for the publication year 2009, that is, 5 years earlier. The year 2014 was chosen as the last year with a complete 3year citation window at the time of this research (October–November, 2018); furthermore, the year 2009 is the first year after the update of WoS to its current version 5.
Nonnormalized data
The inhouse database contains many more journals than the Journal Citation Reports (JCR, which form the basis for the computation of JIF). In order to be able to compare between I3*values and other indicators, we use only the subset of publications in the 11,761 journals contained in the JCR 2014. These journals all have JIFs and other standard indicators. Of these journals, 11,149 are unique in the SCIE and SSCI, and the overlap between SSCI and SCIE is 612 journals. Another 207 journals could not be matched unequivocally on the basis of journal name abbreviations in the inhouse database and JCR, so that our sample is 10,942 journals. Note that we are using individualjournal attributes so that the inclusion or exclusion of a specific journal does not affect the values for the other journals under study.
We collected the data as follows. On the basis of the number of papers (articles and reviews, excluding nonacademic ephemera such as editorials) in a specific year (in this case: 2014), we identified the threshold number of citations at category boundaries, e.g. the lower boundary of the 1% mostfrequently cited papers. If there are, for example, a total of 100,000 papers in a year, then one thousand of them should belong to the mostcited1% for obvious stochastic reasons. If the papers are ranked by descending citation counts, then the citation count of the 1000^{th} paper is the threshold value (Ahlgren et al. 2014). For each journal, the number of papers in this set can be counted. By counting the number of papers with a citation count exceeding this threshold value, the problem of ties is circumvented. However, there is a possibility that more than 1,000 papers may thereby be included in the top1% because there are several papers with the same value as the threshold (in 2014, e.g., 1.03% of the papers instead of exactly 1%). The same applies to the other topx % classes.
In summary, we harvest the top1%, top10%, top50%, and bottom50% publication scores for each journal by first determining the thresholds of these percentile classes for the entire database and, second, by counting each journal’s participation in the respective layers of the database. Using a dedicated routine, the data are organized in a relational database with JCR2014 data. The tables resulting from the analyses can be read into standard software (e.g., Excel, SPSS) for further processing and statistical analysis.
Normalized data
Citation counts are also fieldnormalized in the inhouse database using the WoS Categories, because citation rates differ between fields. These fieldnormalized scores are available at individual document level for all publications since 1980. The I3* indicator calculated with fieldnormalized data will be denoted as I3*F—pragmatically abbreviating I3*F(99100, 9010, 502, 01) in this case. Some journals are assigned to more than a single WoS category: in these instances, the journal items and their citation counts are fractionally attributed. In the case of ties at the thresholds of a topx% class of papers (see above), the fieldnormalized indicators have been calculated following Waltman and Schreiber (2013). Thus, the inhouse database shows whether a paper belongs to the top1%, top10%, or top50% of papers in the corresponding WoS Categories. Papers at the threshold separating the top from the bottom are fractionally assigned to the top paper set.
Statistics
Table 2 shows how to calculate I3* based on publication numbers using PLOS One as an example. The publication numbers in the first columns (a and b) are obtained from the inhouse database of the Max Planck Society. These are the numbers of papers in the different topx%classes. Since the publication numbers in the higher classes are subsets of the numbers in the lower classes, the percentile classes are corrected (by subtraction) to avoid double counting. The resulting values in each distinct class are provided in the columns c and d. The distinct class counts are then multiplied by the appropriate weights. In the last step of calculating I3*, the weighted numbers of papers in the distinct classes are summed into I3*. In this case, I3* = 78,733 (nonnormalized) and I3*F = 53,570.256 (fieldnormalized).
The maximal I3* is ((30,042 * 100) + (0 * 10) + (0 * 2) + (0 * 1) =) 3,004,200 whereby all papers in the journal would belong to the 1% most frequently cited papers in the corresponding fields. With I3* = 53,570.256, the journal reaches 1.78% of this maximum. Without fieldnormalization, this is 2.62%. In other words, there is ample room for improvement.^{Footnote 4}
As noted, I3* can be divided by N, the number of publications (which is by definition equal to the sum of the numbers in the four percentile classes). I3*/N is based on relative frequencies, since the number in each term (n_{i}) is divided by N (= Σ_{i}n_{i}). One can expect I3*/N to no longer be sizedependent and thus to have applications different from I3*, as we shall show below. We focus on I3* in this paper; we will discuss potential applications of I3*/N in a later paper.
We have applied Spearman rankcorrelation analysis and factor analysis (principal component analysis with varimax rotation) to the following variables^{Footnote 5}:

1.
total numbers of publications (NPub);

2.
citations (NCit);

3.
JIF2;

4.
JIF5;

5.
Nonnormalized I3*values (I3*);

6.
Fieldnormalized I3*values (I3*F);

7.
I3*/N for the nonnormalized case (I3*/N).
The results are shown as factorplots using the first two components as x and yaxes. This representation in a twodimensional map provides a ready means of assessing the results visually.
We chose two components in accordance with our design, but the number of eigenvectors with a value larger than one is also two. The results indicate that the two first eigenvectors explain about 85–90% of the variance in the subsequent analyses. Since the distributions are nonnormal, Spearman’s rankorder correlations are preferable to Pearson correlations.^{Footnote 6} Note that the factoranalysis is based on Pearson correlations and the results are consequently, in this respect, approximations. Rotated factor matrices and the percentages of explained variance are also provided for each analysis.
Results
Full set (journal count, n = 10,942)
Figure 2 shows the twodimensional factor plot of the data provided numerically in Table 3. The first two factors explain 87.5% of the variance. The correlation between I3* and its fieldnormalized equivalent I3*F and between them and this first component is greater than 0.9, so they can be considered as essentially the same characteristic. The factor loadings of the numbers of citations (NCit) and publications (NPub) on this first factor are greater than 0.8. NPub, which is the size indicator of output (number of publications), does not load substantially on the second factor which represents impact (number of citations); however, the number of citations (NCit) loads on factor 1 (.802) much more than on factor 2 (.324).
The correlations in Table 4 are all statistically significant (p < .01). Note that the number of journals is large (n = 10,942) and that significance is therefore less meaningful. However, it can be noted that JIF2 and JIF5 correlate with publication count (NPub) at an observably lower level (0.44 and 0.42) than I3* and I3*F (0.92 and 0.86). Obviously, sizenormalization (dividing by n) does not completely remove the effect of size. This is in accordance with the recently published conclusions of Antonoyiannakis (2018). I3*/N can also be considered as a mean and thus a parametric statistic.
Table 5 shows the ranking of the 25 journals with the greatest index values for each of I3*, fieldnormalized I3*F, and I3*/N, respectively. (The full listing is available at http://www.leydesdorff.net/I3/ranking.htm.) The size effect of PLOS One dominates both the I3* and I3*F ranking, but not the third column (I3*/N) which is sizeindependent because of the division by N. Twelve of the 25 titles in this latter column are attributed to journals in the Nature publishing group indicating the high quality of this portfolio. Note that Science, which occupies sixth position in the first two columns, drops to 29^{th} position on the sizeindependent indicator. PLOS One falls much further, to position 2064.
There may be a disciplinary interaction with normalization: fieldnormalization seems to affect the leading chemistry journals more than others. The Journal of the American Chemical Society (JACS), for example, holds second place on the (leftside) list of I3*, but only ninth place on I3*F. By contrast, leading physics journals seem to list higher on the normalized indicator. Perhaps, these relatively wellcited journals in chemistry have a longertailed citation distribution than comparable physics journals: normalization (division by N) will have a greater effect with increasing values of N. As noted, the two indicators are highly correlated overall, but the possibility of a disciplinary factor will need further elucidation in future research.
The Social Sciences Citation Index (SSCI)
The citation environment of journals listed in the Social Sciences Citation Index (SSCI) is very different from that of journals in the SCIE. The SSCI journals in JCR constitute about 28% (3105/10,942) of the total serial titles, but the total citations to SSCI journals constitute less than 10% of all citations to JCR titles (4,506,510/48,340,046 in our time window). The average yearly total cites (NCit) of a journal in SSCI is 1451.3 compared with 4417.8 for the combined set. Figure 3 shows the relatively small contribution of the SSCI journals to the citation indices in terms of citations.
Figure 4 shows the factor plot for the 3105 journals in SSCI for comparison with the factor plot for the combined sets of SSCI and SCI provided in Fig. 1. The main difference is the greater distance between the number of citations (NCit) and the number of publications (NPub). The correlation between both indicators is smaller in SSCI (.633) than in SCI (.719) (Tables 4 and 7, respectively). Consequently, NPub and NCit are distanced in Fig. 4 and the order of the two factors is reversed (Table 6). Nonetheless, these two factors together still explain 84% of the variance.
If we focus on a specific journal category of SSCI, such as the 83 journals in Information & Library Science, the difference in citation cultures between SCI and SSCI outcomes is further emphasized. Alternatively, if we focus on a narrow specialism in the natural sciences, such as Spectroscopy with 41 journals, we find that the distinction between the two components is even more pronounced than for the full set of 10,942 journals. Table 8 juxtaposes the rotated factor matrices showing these differences numerically.
While the number of publications drives the number of citations in the SCIE, this appears to be less the case in the SSCI. Size is less important for impact in SSCI than in SCIE. I3* correlates with size (NPub) more than with citations (NCit) in the social sciences.
Comparison with 2009
It is possible that the results obtained for 2014 were specific for that year, because it is relatively recent and the citation counts were not yet stable. We tested this by repeating the analysis for 2009 data, which was chosen because the WoS (version 5) was reorganized in 2008/2009.
Of the 9216 journals in the combined 2009 sets of JCRs for SCIE and SSCI, 8994 journal title abbreviations could automatically be matched between the data from the inhouse database and JCR. The twocomponent plot in Fig. 5 shows that the outcome for 2009 data is very similar to that seen with 2014 data (Table 9).
Two factors explain 88.1% of the variance in 2009 and 87.5% in 2014. Figure 5 shows the 2component plot for 2009. The results are virtually identical in these two sample years. Thus, the indicator appears to be robust over time.
Statistics
PLOS One was by far the largest journal in 2014 with 30,042 publications. It was followed in this analysis by RSC Advances with 8345 citable items. In terms of total citations, however, PLOS One is in eighth place with 332,716 citations. In the same year, Nature accrued 617,363 citations to 862 publications. The simple citations/publication (c/p) ratio for Nature is 716.3 and for PLOS One is 11.1. By comparison, the values of I3*/N are 61.4 for Nature and 2.6 for PLOS One and, in seeming contradiction to conventional indicators, the (nonnormalized) I3* values are 78,733 for PLOS One and 52,883 for Nature.
What do these figures mean, and are the differences statistically and practically significant? One can test the distribution of papers over the classes against the expected numbers. This can be done for the frequencies in the matrix using chisquared statistics, or by a test between means (in the case of I3*/N) using the z test and/or Cohen’s h for “practical significance.”^{Footnote 7} Table 10 shows various options for testing observed values against expected ones; Table 11 generalizes this to the possibility to test any two distributions against each other. As empirical instances, we again use PLOS One for comparison of observed with expected values (Table 10), and this same journal versus RSC Advances in Table 11.
The results of the chisquared tests are statistically significant (p < .001), both when comparing PLOS One with the expectation, and PLOS One with RSC Advances. One can summarize the results of the chisquared ex post using Cramèr’s V which conveniently ranges from zero to one. In this case, Cramèr’s V = 0.27 in Table 10 and Cramèr’s V = 0.05 in Table 11. In other words, the differences between the expected and observed percentilerank distribution is more than five times larger than the corresponding differences between PLOS One and RSC Advances. (The template provides these values automatically.) The results of the chisquared based on testing the I3* values (in columns g and h in both Tables 10 and 11) are provided at the bottom of column k.
While the chisquared statistic provides a test for comparing the entire distributions (two vectors of four classes), the decomposition of chisquared into standardized residuals \(\left[ {\frac{{\left( {{\text{observed}}  {\text{expected}}} \right)}}{{\sqrt {{\text{expected}} } }}} \right]\) provides us with a statistic for each class. Standardized residuals can be considered as zvalues: they are significant at the 5% level if the absolute value is larger than 1.96, 1% for an absolute value > 2.576, and 1‰ for an absolute value > 3.291 (Sheskin 2011, at p. 672). These statistics are provided in column j.
Furthermore, the residuals are signed and indicate (in Table 10, for example) that PLOS One scores are significantly below expectation in the top1% class (p < .001), but above expectation in the top50% class (p < .001). The overall distribution over the percentile classes (including the vertical direction of columns e and f) is statistically significant at the 1‰ level: the journal as a whole performs significantly below expectation in terms of I3*. (Note that each of the four decompositions in column l is based on two observations, since eight cells are used in the computation of the chisquared.)
In Table 11, RSC Advances scores statistically significantly higher than PLOS One in the top10% (column l), but not (statistically) significantly below PLOS One in the lowerranked classes. Tables 10b and 11b add the statistics for I3*/N. The division by N makes all the frequencies relative. Since these relative frequencies can also be considered as proportions, one can z test for difference in proportions (Sheskin 2011, pp. 656f.) or also compute an effect size using Cohen’s w (1988, at p. 216; Leydesdorff et al. 2019).
The zvalues in column q of Table 10 show that PLOS One scores above expectation in the percentile class between 50 and 89, but this value is not statistically significant (in column r). PLOS One scores statistically notsignificantly below expectation in the top1% and even more so in the top10% and bottom50%.
These results may come as no surprise, but cases other than PLOS One may offer less intuitive results about the status of a journal. For example, specification of the differences between RSC Advances and Nature in terms of these four classes would be far from obvious. The template available at https://www.leydesdorff.net/I3/template.xlsx automatically fills out the numbers and significance levels when the user provides the fieldnormalized and nonnormalized values for top1%, top10%, top50%, and total number of papers in the respective cells.
In order to have information about the significance of the results on the basis of effect sizes (Cohen 1988; Schneider 2013; Wasserstein and Lazar 2016; Williams and Bornmann 2014), we added Cohen’s h and w for the comparison among proportions as column s to Tables 10 and 11. The w index is 0.4 in Table 10, and thus the difference between PLOS One and its expected citation rates in these four categories is meaningful and significant for practical purposes. This is not the case for the difference between the two journals: w = 0.1. The values of h accord with those of the z test for each of the classes.
It should be kept in mind that the tests on proportions address the sizeindependent indicator I3*/N. As noted, this measure can be used as the expected value of citations of a publication published in the relevant journal. In other words, a paper that is accepted for publication in RSC Advances has a statistically significantly greater likelihood of being cited in the overall top10% than a paper in PLOS One. It is also less likely to be cited below the 50% threshold.
Effects of different weighting schemes
Weighting schemes have a significant effect on the outcome and interpretation of the analysis of categorized data; weighting introduces a level of subjectivity. Using the general scheme of I3, I3 variants can be adapted to the context of the evaluation situation. For example, if the focus is solely on research excellence, the percentile classes reflecting high impact can be provided with a higher weight. Reducing the weighting for higher impact classes would mean that productivity is relatively more emphasized.
What happens if, instead of the logarithmic set, we use the linear set of Bornmann and Mutz (2011) specified in Table 1 or the respective quantile values as used by Leydesdorff and Bornmann (2011)? Our data collection is categorized in four classes, so we can do this with a weight of 6 for the top1% papers, 4 for the top10%, 2 for the top50%, and 1 for the bottom50%. Bornmann and Mutz (2011) used two additional classes: 5 for the top5% papers, and the class between 50 and 89 was divided into 75–89 weighted with 3 and the class 50–74 weighted with 2. The analysis is now less sensitive: using a linear scale, the information benefit of I3* is considerably reduced.
With linear weighting, Fig. 6 shows us that I3* no longer captures the number of citations, but becomes a size indicator (correlated to NPub more than NCit). The top1% papers, for example, are now given a relative value of six instead of one hundred and thus highly skewed, citation frequencies no longer play a strongly differential role in the assessment across higher and lowerranked percentile classes.
Table 12 shows the comparison between using the six percentile ranks used by Bornmann and Mutz (2011) with using the quantile values 99, 89, 50, and 1 for the four classes (Leydesdorff and Bornmann 2011). The position of NCit is similarly changed in both cases; the coordinate values of NCit—boldfaced in Table 12—are slightly lower. Correspondingly, the Pearson correlation of NCit and I3* declines further from .624 to .613 (p < .01).
Recall a similar effect for the social sciences (SSCI compared with SCI) particularly when we focused on the 83 journals in the LIS category. The reason in that case was because of a difference in the data, but in the general case the reason is the (mis)specification of a model which does not give appropriate attention to the skew in the distribution.
Summary and conclusions
We argue in this paper that an indicator can be developed that reflects both impact and output, and that combines the two dimensions of publications and citations into a single measure by using nonparametric statistics. The generic Integrated Impact Indicator I3 is a sum of weighted publication numbers in different percentile classes. The indicator can be used very flexibly with a range of percentile classes and weights. Depending on the chosen parameters, I3 can be made more output or more impactorientated. In this study, we introduced I3* = I3(99100, 9010, 502, 01) which categorises and weights papers published in the higher citation impact range in a more informed way, given the distribution skew, than the indicator proposed by Bornmann and Mutz (2011) and the quantilebased approach elaborated by Leydesdorff and Bornmann (2011).
I3* can be sizenormalized by dividing the value by the original number of publications, to obtain a secondary indicator that expresses the expected contribution made by a single paper given the journal’s characteristics. The sizenormalized and sizeindependent indicators can be considered as relating to two nearly orthogonal axes. When we consider the relationship between conventional journal indicators and these new indicators, we see that I3* correlates strongly with both the total number of citations and publications, whereas I3*/N correlates with sizeindependent indicators such as JIF.
The journal impact factor developed by Garfield and Sher (1963) was originally intended as a journal statistic of value to publishers and librarians for portfolio management. It was not intended for research evaluation, but it has in fact been increasingly employed for this purpose and mistakenly used as a benchmark for individual researchers and their research output. An average citation rate of two (JIF2) or five (JIF5) years is not representative of the journal as a whole. The JIF can be used as one indicator of the reputation or status of a journal, subject to appropriate contextual considerations, but it cannot be used as an impact value for single papers (Pendlebury and Adams 2012; Bornmann and Williams 2017; Leydesdorff et al. 2016a, b; cf. Waltman and Traag 2017).
Can the I3* indicator be compared with the hindex? Only to the extent that the measurement of output and impact are combined into a single number in both indicators. However, the hindex is mathematically inconsistent; it overrides disciplinaryspecific cultural and other considerations, and observed values cannot be tested systematically against expected ones. By contrast, I3* can be analyzed using various statistical tests or power analysis depending on the context in which one wishes to use the indicator. Furthermore, I3* does not provide only one single value like the hindex, but gives an additional four reference values with performance information in different impact classes. This information can be compared with expected values and between different publication sets (e.g., of two or more institutions). Thus, I3* can be used as a single number (e.g., for policy purposes), but it can also be decomposed into the contributions of the percentile rank classes (e.g., the top10% group). Importantly, one is able to specify error terms on the basis of statistics.
The versatility of I3* is illustrated in a spreadsheet in Excel containing a template for the computation at https://www.leydesdorff.net/i3/template.xlsx. The P_{top10%} and PP_{top10%} indicators have become established as quasistandard indicators in professional bibliometrics, especially when research institutions are compared (Waltman et al. 2012). The use of these percentilebased indicators is recommended, for instance in the Leiden Manifesto, which included ten guiding principles for research evaluation (Hicks et al. 2015).^{Footnote 8} It is an advantage of the I3* indicator—which is a percentilebased indicator—that it integrates the top1% with the top10% information and combines them with information about other percentile classes. Thus, one provides a broader picture by using I3* as indicator compared to P_{top10%} and PP_{top10%}.
The almost weekly invention of a new htype indicator signals that many innovative analysts are not aware of a central problem with bibliometric data, shared with other forms of collected data, that indicators necessarily generate error both in source measurement and through analytical methodology (Leydesdorff et al. 2016, pp. 2144f.). Consequently, one should not underestimate the need to elaborate, test, and report on algorithms and their analytics, both empirically and statistically. Elegance on purely mathematical (that is, a priori) grounds is not a sufficient condition for claiming scientometric utility (Ye and Leydesdorff 2014).
Perspectives for further research
The convergent validity of different (fieldnormalized) indicators can be investigated by comparing the indicators with assessments by peers (Bornmann et al. 2019). Peer assessments of papers published in the biomedical area are available in the F1000Prime database (see https://f1000.com/prime). High correlations between quantitative and qualitative assessments signal the convergent validity of bibliometric indicators which should be preferred in the practice of research evaluation. Bornmann and Leydesdorff (2013) have correlated different indicators with assessments by peers provided in the F1000Prime database. The results showed, for instance, that “Percentile in Subject Area achieves the highest correlation with F1000 ratings” (p. 286). In a followup study, I3* indicators are investigated with a similar design to investigate whether these new indicators also have convergent validity (Bornmann et al. 2019).
Notes
A journal that publishes many items that do not report substantive research, but nonetheless attract citations, can inflate its JIF (Moed and van Leeuwen 1996).
Analogously, the minimal I3* which PLOS One 2014 could reach is 30,042; all publications would in this case belong to the bottom50% papers and thus be weighted only with a one (0 * 100 + 0 * 10 + 0 * 2 + 30,042 * 1 = 30,042).
We checked also for oblique rotation, but the results are very similar.
A nonparametric alternative would be to use multidimensional scaling (MDS, Schiffman et al. 1981).
Cohen’s h tests proportions against each other for each row using h = 2 * (arcsin√p_{obs} − arcsin√p_{exp}) (Cohen 1988, pp. 180 ff.), whereas Cohen’s w first sums over the rows and then takes the square root (Cohen 1988, pp. 216f.): \({\mathbf{w}} = \sqrt {\mathop \sum \limits_{i = 1}^{m} \frac{{(p({\text{observed}})  p({\text{expected}})^{2} }}{{p({\text{expected}})}}}\).
As explained above, I3(901) is the notation for P_{top10%} whereas PP_{top10%} can be written as I3(901)/N.
References
Ahlgren, P., Persson, O., & Rousseau, R. (2014). An approach for efficient online identification of the topk percent most cited documents in large sets of Web of Science documents. ISSI Newsletter, 10(4), 81–89.
Alberts, B. (2013). Impact factor distortions. Science, 340(6134), 787.
Antonoyiannakis, M. (2018). Impact factors and the central limit theorem: Why citation averages are scale dependent. Journal of Informetrics, 12(4), 1072–1088.
Archambault, É., & Larivière, V. (2009). History of the journal impact factor: Contingencies and consequences. Scientometrics, 79(3), 635–649.
Bensman, S. J. (2007). Garfield and the impact factor. Annual Review of Information Science and Technology, 41(1), 93–155.
Bornmann, L. (2014). How are excellent (highly cited) papers defined in bibliometrics? A quantitative analysis of the literature. Research Evaluation, 23(2), 166–173.
Bornmann, L., De Moya Anegón, F., & Leydesdorff, L. (2012). The new excellence indicator in the World Report of the SCImago Institutions Rankings 2011. Journal of Informetrics, 6(2), 333–335. https://doi.org/10.1016/j.joi.2011.11.006.
Bornmann, L., & Leydesdorff, L. (2013). The validation of (advanced) bibliometric indicators through peer assessments: A comparative study using data from InCites and F1000. Journal of Informetrics, 7(2), 286–291. https://doi.org/10.1016/j.joi.2012.12.003.
Bornmann, L., & Mutz, R. (2011). Further steps towards an ideal method of measuring citation performance: The avoidance of citation (ratio) averages in fieldnormalization. Journal of Informetrics, 5(1), 228–230.
Bornmann, L., Mutz, R., & Daniel, H.D. (2008). Are there better indices for evaluation purposes than the h index? A comparison of nine different variants of the h index using data from biomedicine. Journal of the American Society for Information Science and Technology, 59(5), 830–837. https://doi.org/10.1002/asi.20806.
Bornmann, L., Mutz, R., Hug, S. E., & Daniel, H.D. (2011a). A multilevel metaanalysis of studies reporting correlations between the h index and 37 different h index variants. Journal of Informetrics, 5(3), 346–359.
Bornmann, L., Mutz, R., Marx, W., Schier, H., & Daniel, H.D. (2011b). A multilevel modelling approach to investigating the predictive validity of editorial decisions: Do the editors of a high profile journal select manuscripts that are highly cited after publication? Journal of the Royal Statistical Society: Series A (Statistics in Society), 174(4), 857–879.
Bornmann, L., Tekles, A., & Leydesdorff, L. (2019). How well does I3 perform for impact measurement compared to other bibliometric indicators? The convergent validity of several (fieldnormalized) indicators. Scientometrics. https://doi.org/10.1007/s11192019030716.
Bornmann, L., & Williams, R. (2017). Can the journal impact factor be used as a criterion for the selection of junior researchers? A largescale empirical study based on ResearcherID data. Journal of Informetrics, 11(3), 788–799. https://doi.org/10.1016/j.joi.2017.06.001.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum.
Egghe, L. (2008). Mathematical theory of the hand gindex in case of fractional counting of authorship. Journal of the American Society for Information Science and Technology, 59(10), 1608–1616.
Egghe, L., & Rousseau, R. (1990). Introduction to informetrics. Amsterdam: Elsevier.
Frandsen, T. F., & Rousseau, R. (2005). Article impact calculated over arbitrary periods. Journal of the American Society for Information Science and Technology, 56(1), 58–62.
Garfield, E. (1955). Citation indexes for science: A new dimension in documentation through association of ideas. Science, 122(3159), 108–111.
Garfield, E. (1972). Citation analysis as a tool in journal evaluation. Science, 178(Number 4060), 471–479.
Garfield, E. (1979). Is citation analysis a legitimate evaluation tool? Scientometrics, 1(4), 359–375.
Garfield, E. (2003). The meaning of the impact factor. Revista Internacional de Psicologia Clinica y de la Salud, 3(2), 363–369.
Garfield, E. (2006). The history and meaning of the journal impact factor. JAMA, 295(1), 90–93.
Garfield, E., & Sher, I. H. (1963). New factors in the evaluation of scientific literature through citation indexing. American Documentation, 14(3), 195–201.
Gross, P. L. K., & Gross, E. M. (1927). College libraries and chemical education. Science, 66(No. 1713 (Oct. 28, 1927)), 385–389.
Hicks, D., Wouters, P., Waltman, L., de Rijcke, S., & Rafols, I. (2015). Bibliometrics: The Leiden Manifesto for research metrics. Nature, 520(7548), 429–431.
Hirsch, J. E. (2005). An index to quantify an individual’s scientific research output. Proceedings of the National Academy of Sciences of the USA, 102(46), 16569–16572.
Jacsó, P. (2009). Fiveyear impact factor data in the journal citation reports. Online Information Review, 33(3), 603–614.
Kreft, G. G., & de Leeuw, E. (1988). The seesaw effect: A multilevel problem? Quality & Quantity, 22(2), 127–137.
Leydesdorff, L., & Bornmann, L. (2011). Integrated impact indicators compared with impact factors: An alternative research design with policy implications. Journal of the American Society for Information Science and Technology, 62(11), 2133–2146. https://doi.org/10.1002/asi.21609.
Leydesdorff, L., Bornmann, L., Comins, J., & Milojević, S. (2016a). Citations: Indicators of quality? The impact fallacy. Frontiers in Research Metrics and Analytics. https://doi.org/10.3389/frma.2016.00001.
Leydesdorff, L., Bornmann, L., & Mingers, J. (2019). Statistical significance and effect sizes of differences among research universities at the level of nations and worldwide based on the Leiden rankings. Journal of the Association for Information Science and Technology, 70(5), 509–525. https://doi.org/10.1002/asi.24130.
Leydesdorff, L., Bornmann, L., Mutz, R., & Opthof, T. (2011). Turning the tables on citation analysis one more time: Principles for comparing sets of documents. Journal of the American Society for Information Science and Technology, 62(7), 1370–1381. https://doi.org/10.1002/asi.21534.
Leydesdorff, L., Wagner, C., & Bornmann, L. (2018). Discontinuities in citation relations among journals: Selforganized criticality as a model of scientific revolutions and change. Scientometrics, 116(1), 623–644. https://doi.org/10.1007/s1119201827346.
Leydesdorff, L., Wouters, P., & Bornmann, L. (2016b). Professional and citizen bibliometrics: Complementarities and ambivalences in the development and use of indicators—A stateoftheart report. Scientometrics, 109(3), 2129–2150. https://doi.org/10.1007/s1119201621508.
Marchant, T. (2009). An axiomatic characterization of the ranking based on the hindex and some other bibliometric rankings of authors. Scientometrics, 80(2), 325–342.
Martyn, J., & Gilchrist, A. (1968). An evaluation of British scientific journals. London: Aslib.
McAllister, P. R., Narin, F., & Corrigan, J. G. (1983). Programmatic evaluation and comparison based on standardized citation scores. IEEE Transactions on Engineering Management, 30(4), 205–211.
Moed, H. F., & Van Leeuwen, T. N. (1996). Impact factors can mislead. Nature, 381(6579), 186.
Narin, F. (1976). Evaluative bibliometrics: The use of publication and citation analysis in the evaluation of scientific activity. Washington, DC: National Science Foundation.
Narin, F. (1987). Bibliometric techniques in the evaluation of research programs. Science and Public Policy, 14(2), 99–106.
Pendlebury, D. A., & Adams, J. (2012). Comments on a critique of the Thomson Reuters journal impact factor. Scientometrics, 92, 395–401. https://doi.org/10.1007/s1119201206896.
Price, D. J. (1970). Citation measures of hard science, soft science, technology, and nonscience. In C. E. Nelson & D. K. Pollock (Eds.), Communication among scientists and engineers (pp. 3–22). Lexington, MA: Heath.
Robinson, W. D. (1950). Ecological correlations and the behavior of individuals. American Sociological Review, 15, 351–357.
Schiffman, S. S., Reynolds, M. L., & Young, F. W. (1981). Introduction to multidimensional scaling: Theory, methods, and applications. New York: Academic Press.
Schneider, J. W. (2013). Caveats for using statistical significance tests in research assessments. Journal of Informetrics, 7(1), 50–62.
Seglen, P. O. (1992). The skewness of science. Journal of the American Society for Information Science, 43(9), 628–638.
Seglen, P. O. (1997). Why the impact factor of journals should not be used for evaluating research. British Medical Journal, 314, 498–502.
Sher, I. H., & Garfield, E. (1965). New tools for improving and evaluating the effectiveness of research. Paper presented at the Second conference on Research Program Effectiveness, July 27–29, Washington, DC.
Sheskin, D. J. (2011). Handbook of parametric and nonparametric statistical procedures (5th ed.). Boca Raton, FL: Chapman & Hall/CRC.
Tijssen, R. J. W., Visser, M. S., & Van Leeuwen, T. N. (2002). Benchmarking international scientific excellence: Are highly cited research papers an appropriate frame of reference? Scientometrics, 54(3), 381–397.
Waltman, L., CaleroMedina, C., Kosten, J., Noyons, E., Tijssen, R. J., Eck, N. J., et al. (2012). The Leiden ranking 2011/2012: Data collection, indicators, and interpretation. Journal of the American Society for Information Science and Technology, 63(12), 2419–2432.
Waltman, L., & Schreiber, M. (2013). On the calculation of percentilebased bibliometric indicators. Journal of the American Society for Information Science and Technology, 64(2), 372–379.
Waltman, L., & Traag, V. A. (2017). Use of the journal impact factor for assessing individual articles need not be wrong. arXiv preprint arXiv:1703.02334.
Waltman, L., & Van Eck, N. J. (2012). The inconsistency of the hindex. Journal of the American Society for Information Science and Technology, 63(2), 406–415.
Wasserstein, R. L., & Lazar, N. A. (2016). The ASA’s statement on pvalues: context, process, and purpose. The American Statistician, 70(2), 129–133.
Williams, R., & Bornmann, L. (2014). The substantive and practical significance of citation impact differences between institutions: Guidelines for the analysis of percentiles using effect sizes and confidence intervals. In Y. Ding, R. Rousseau, & D. Wolfram (Eds.), Measuring scholarly impact: Methods and practice (pp. 259–281). Heidelberg: Springer.
Ye, F. Y., Bornmann, L., & Leydesdorff, L. (2017). hbased I3type multivariate vectors: multidimensional indicators of publication and citation scores. COLLNET Journal of Scientometrics and Information Management, 11(1), 153–171.
Ye, F. Y., & Leydesdorff, L. (2014). The “Academic Trace” of the Performance Matrix: A Mathematical Synthesis of the hIndex and the Integrated Impact Indicator (I3). Journal of the Association for Information Science and Technology, 65(4), 742–750. https://doi.org/10.1002/asi.23075.
Acknowledgements
The bibliometric data used in this paper are from an inhouse database developed and maintained in collaboration with the Max Planck Digital Library (MPDL, Munich) of the Max Planck Society, and derived from the Science Citation Index Expanded (SCIE), the Social Sciences Citation Index (SSCI), and the Arts and Humanities Citation Index (AHCI) prepared by Clarivate Analytics (Philadelphia, Pennsylvania, USA). We are also grateful to ISI/Clarivate Analytics for providing one of us with JCR data.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Leydesdorff, L., Bornmann, L. & Adams, J. The integrated impact indicator revisited (I3*): a nonparametric alternative to the journal impact factor. Scientometrics 119, 1669–1694 (2019). https://doi.org/10.1007/s11192019030998
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192019030998
Keywords
 Journal indicator
 Percentile
 Citation analysis
 I3*
 Journal impact factor