How well does I3 perform for impact measurement compared to other bibliometric indicators? The convergent validity of several (field-normalized) indicators

Recently, the integrated impact indicator (I3) indicator was introduced where citations are weighted in accordance with the percentile rank class of each publication in a set of publications. I3 can also be used as a field-normalized indicator. Field-normalization is common practice in bibliometrics, especially when institutions and countries are compared. Publication and citation practices are so different among fields that citation impact is normalized for cross-field comparisons. In this study, we test the ability of the indicator to discriminate between quality levels of papers as defined by Faculty members at F1000Prime. F1000Prime is a post-publication peer review system for assessing papers in the biomedical area. Thus, we test the convergent validity of I3 (in this study, we test I3/N - the size-independent variant of I3 where I3 is divided by the number of papers) using assessments by peers as baseline and compare its validity with several other (field-normalized) indicators: the mean-normalized citation score (MNCS), relative-citation ratio (RCR), citation score normalized by cited references (CSNCR), characteristic scores and scales (CSS), source-normalized citation score (SNCS), citation percentile, and proportion of papers which belong to the x% most frequently cited papers (PPtop x%). The results show that the PPtop 1% indicator discriminates best among different quality levels. I3 performs similar as (slightly better than) most of the other field-normalized indicators. Thus, the results point out that the indicator could be a valuable alternative to other indicators in bibliometrics.


Introduction
In the application of citation analysis in research evaluation, one may need to compare the citation impact of publications from different fields. Different from using raw citation counts from the Web of Science (WoS, Clarivate Analytics) or Scopus (Elsevier) databases, professional bibliometricians have knowledge of differences in publication and citation cultures among fields of science (e.g., concerning the speed and frequency of citations) and use methods to assess the citation impact of focal papers against the impact of all other papers in the same field and publication year (McAllister, Narin, & Corrigan, 1983;Narin, 1981;Wang, Song, & Barabási, 2013). Field delineation, however, is not an easy task (e.g., Klavans & Boyack, 2017;Leydesdorff, 2006).

Various indicators (approaches) have been introduced in bibliometrics since the early
1980s to construct field-normalized scores. According to Waltman (2016) "the idea of these indicators is to correct as much as possible for the effect of variables that one does not want to influence the outcomes of a citation analysis, such as the field, the year, and the document type of a publication" (p. 375). The necessity to normalize citation impact for cross-field comparisons is also one of the ten principles for research evaluation formulated in the Leiden Manifesto (Hicks, Wouters, Waltman, de Rijcke, & Rafols, 2015). Leydesdorff and Bornmann (2011b) introduced the integrated impact indicator (I3) where citations are weighted in accordance with the percentile rank class of each publication in a set of publications (e.g., published by a researcher or research group). Percentiles are a priori field-normalized: one can compare the top-1% for different reference sets. Although several publications appearing afterwards have dealt with the indicator (Leydesdorff & Bornmann, 2012;Rousseau, 2012;Wagner & Leydesdorff, 2012;Ye, Bornmann, & Leydesdorff, 2017), a comparison with other (field-normalized) indicators has not yet been done. In this study, we undertake this comparison by investigating the convergent validity of the indicator. In psychometrics, convergent validity tests whether measurements which are assumed to be related (here: assessments by peers and citation impacts) are actually related: we are interested in the question of how I3 discriminates between papers having received different quality scores by peers compared to various other indicators. We received a dataset from F1000Prime (see https://f1000.com/prime) including the bibliographic information of papers published in the biomedical area and their quality scores by peers. We use these scores as a benchmark for testing the indicators (Garfield, 1979).

Normalization of citation impact in bibliometrics
In this section, we discuss the various field-normalized indicators which are used for the comparison with the I3 indicator: mean-normalized citation score (MNCS), relativecitation ratio (RCR), citation score normalized by cited references (CSNCR), characteristic scores and scales (CSS), source normalized citation score (SNCS), citation percentile, and proportion of papers which belong to the x% most frequently cited papers (PPtop x%). More comprehensive overviews of methods for normalizing citations can be found in Mingers and Leydesdorff (2015), Waltman (2016), and Bornmann (in press). I3 is explained in section 2.7 after all other indicators have been explained, since the I3 variant used in this study is based on other field-normalizing approaches.
One can distinguish between field-normalization and statistical normalization: each indicator assumes some form of reference sets (field-normalization) and some form of comparison-strategy (statistical normalization). The indicators compared in this study vary with respect to both these aspects: different reference sets (e.g., papers published in the same subject category or co-cited papers) and different strategies to compare the focal papers to these reference sets (e.g., comparing values in relation to the mean or generating nonparametric percentiles for the comparison). Most of the variance among the indicators selected for this study are a result of the statistical normalization. However, there is always already (at least some) variance among the indicators with respect to the field categorizations (i.e., the indicators have not been calculated by using one single categorization scheme). Most of the indicators in this study have been calculated based on WoS subject categories (WCs).
However, RCR and the citing side indicators are not relying on these categories, but on cocited papers and papers published in the same journal or paper.
The use of WCs for field-normalization has been criticized as unprecise in terms of its analytical basis. WCs are attributed to journals (and not to individual papers) and journals are not homogeneous in terms of the disciplines of papers published in them (Leydesdorff & Bornmann, 2016). Although other field-categorisation schemes have been proposed for the normalization of citation impact such as algorithmically constructed classification systems (Ruiz-Castillo & Waltman, 2015) or expert-based field categorisations (Bornmann, Marx, & Barth, 2013) "the WoS journal subject categories are the most commonly used field classification system for normalisation purposes" (Wouters et al., 2015, p. 18

Mean-normalized citation score (MNCS)
Based on early proposals by Schubert and Braun (1986), Moed, Burger, Frankfort, and van Raan (1985) used field-normalization based on the WCs in the so-called "crown indicator" of the Leiden Centre for Science and Technology Studies (CWTS). Opthof and Leydesdorff (2010) note that the statistical normalization in the definition of the crown was statistically erroneous (see Lundberg, 2007). Given the order of operations, one should first multiply and divide and only thereafter sum and subtract (Gingras & Larivière, 2011;Leydesdorff & Opthof, 2018). The normalization can then be formulated as follows: In a response, Waltman, van Eck, van Leeuwen, Visser, and van Raan (2011) proposed to use this "mean-normalized citation score" (MNCS) with field-normalization by defining the mean in the denominator of each paper in terms of the WCs attributed to the respective journals. MNCS is currently a frequently used field-normalized indicator (Purkayasthaa, Palmaroa, Falk-Krzesinskib, & Baas, 2018). It is calculated by dividing the citations of a paper in question by the average citation rate of the papers that were published in the same subject category (and publication year).
Two normalizations are thus involved: (1) normalization relative to the mean and (2) normalization in terms of WCs. MNCS, however, can also be used with classification schemes other than WCs. The first assumption that the mean of the citation rate of the papers in the sample can be considered as an expected value, is not valid. The citation distributions are always skewed and thus non-normal. (The Central Limit Theorem is only valid for much larger samples.) At the time (2011), we proposed the use of percentile classes instead Leydesdorff, Bornmann, Mutz, & Opthof, 2011).
A further complication arises when a paper is published in a journal that belongs to more than a single subject category. MNCS can then be calculated with reference to different sets, e.g., by using "fractional counting" (Smolinsky, 2016;Waltman et al., 2011). In this study, the average is calculated over the MNCSs in the case of multiple categories. The impact of different publication sets can then be compared by using the mean of the MNCSs.

Relative-citation ratio (RCR)
Hutchins, Yuan, Anderson, and Santangelo (2016) proposed the Relative Citation Rate (RCR) as a new field-normalized impact indicator. The indicator is similarly designed as MNCS: it is a quotient of the focal paper's citation counts and the expected number of citations in the reference set. The difference of the RCR from the MNCS is that the expected value (respectively the reference sets) is based on co-citations: the papers co-cited with the focal paper are considered to represent a more precise reference set at the paper level than WCs which are attributed at the journal level. In bibliometrics, co-citations are frequently used similarity measures which are based on citation relations. An overview of research on the RCR can be found in Lindner, Torralba, and Khan (2018).

Citation score normalized by cited references (CSNCR)
Bornmann and Haunschild (2016) introduced the field-normalized indicator "citation score normalized by cited references" (CSNCR) which is closely related to the MNCS. The indicator is rooted in early suggestions by Garfield (1979) that "the most accurate measure of citation potential is the average number of references per paper published in a given field".
The CSNCR is defined as follows: the citations of a focal paper are divided by the mean number of cited references in a subject category. The theoretical analysis of the CSNCR by Bornmann and Haunschild (2016) demonstrated that the indicator has the properties of consistency and homogeneous normalization. The authors' empirical comparison of the CSNCR with other field-normalized indicators revealed that it is as suitable as other fieldnormalized indicators to normalize citations.

Characteristic scores and scales (CSS)
The characteristic scores and scales (CSS) method by Glänzel and Schubert (1988) for normalizing citation data is one of the earliest proposed field-normalization approaches. The CSS method classifies the publications in reference sets (subject categories) as follows: "characteristic scores are obtained from iteratively truncating a distribution according to conditional mean values from the low end up to the high end. In particular, the scores bk (k > 0) are obtained from iteratively truncating samples at their mean value and recalculating the mean of the truncated sample until the procedure is stopped or no new scores are obtained" (Glänzel, 2013, p. 111). In many studies based on this method, four impact classes are used to group the papers in reference sets (see Glänzel, Thijs, & Debackere, 2014): 1. poorly cited (papers with less citations than b1), 2. fairly cited (papers with citations above b1 but less citations than b2), 3. remarkably cited (papers with citations above b2 but less citations than b3), and 4. outstandingly cited (papers with citations of at least b4).
In the MPG in-house database, all papers in each reference set published since 1980 are classified following the CSS method.

Citing-side normalization of citation impact
Citations are attributed to papers on the cited side by the indicators mentioned above. Zitt and Small (2008) first introduced the idea of normalizing citation impact on the citingside. The authors proposed a modification of the journal impact factor (JIF) by fractional citation weighting. Citing-side normalization is also named source normalization, fractional citation weighting, fractional counting of citations, or a priori normalization (Waltman & van Eck, 2013a). The method cannot only be used for journals as initially proposed by Zitt and Small (2008) but also for any other publication sets (Moed, 2010). Citing-side normalization considers the environment of a given citation (Leydesdorff & Bornmann, 2011a;Leydesdorff, Radicchi, Bornmann, Castellano, & de Nooy, 2013): the citation is weighted depending on its environment. A citation from a subject category with papers containing long reference lists (e.g., bio-medicine) receives a lower weighting than a citation from a subject category with on average only few citations.
For citing-side normalization, the number of references of the citing paper is usually used to weight a specific citation (Waltman & van Eck, 2013b). The assumption is that this number of references reflects the typical number in the field (subject categories) of the citing paper. However, this assumption cannot always be made. For this reason, an average number of references is calculated (and used as weighting factor) which includes other papers appearing in a journal alongside the citing paper. In this study, we consider three variants of citing-side normalization, which are explained by Waltman and van Eck (2013b) in more detail.

Variant 1:
The first variant is the SNCS1 (source normalized citation score) indicator. In the formula, ai is the average number of linked references in those papers which appeared in the same journal and in the same publication year as the citing paper i. "Linked references" are references to papers in journals covered by the WoS. The reduction to linked references (instead of using all references) is intended to prevent that subject categories of citing publications not indexed in WoS are disadvantaged (see . For the second variant SNCS2 each citation of a paper is divided by the number of linked references in the citing publication ri. The difference to SNCS1 is that SNCS2 focusses on the linked references in the citing paper and not the journal of the citing paper. The selection of the reference publication years is done analogously to the SNCS1.

Variant 3:
SNCS3 combines SNCS1 and SNCS2. ri is defined as in SNCS2. pi is the paper share containing at least one linked reference among the papers in the same journal and publication year as the citing publication i. The selection of the reference publication years follows the same procedure as for the SNCS1 and SNC2.

2.6
Percentile-based indicators 2.6.1 Citation impact percentiles The distribution of citation data is usually very skewed with only a few papers being highly-cited (Seglen, 1992). Since the arithmetic mean is not appropriate as a measure of the Citation impact percentiles can be calculated with various procedures (see the overview in Bornmann, Leydesdorff, & Mutz, 2013). In the current study, two approaches were used which are frequently applied in evaluative bibliometrics. For both approaches, all papers in the reference sets are ranked in decreasing or increasing order by their citation counts (i), and the number of publications in the reference set (n) is determined in the first step. For the product InCites (a customised, web-based research evaluation tool based on bibliometric data from WoS), Clarivate Analytics calculates the percentiles by using (basically) the formula ([i/n] * 100). This inversed ranking will be named as "InCites percentiles" in the following. However, the use of this formula may lead to a mean percentile of a reference set unequal to 50 (the median). The formula ([(i -0.5)/n] * 100) (Hazen, 1914) does not suffer this disadvantage. We will use the abbreviation "Hazen percentile" for these percentiles in the following. Furthermore, the papers are sorted in increasing impact order for InCites percentiles, but in decreasing order for Hazen percentiles; we invert the InCites percentiles in this study by subtracting the values from 100.

Proportion of papers belonging to the top-x%
Citation percentiles can be directly used for impact measurements. However, it is also very common in bibliometrics to focus on certain percentile classes (Bornmann, 2014). In this study, we include three indicators focusing on three classes: PPtop-50%, PPtop-10%, and PPtop-1%.
The indicators reveal the proportion of papers published by a unit which belong to the x% most frequently cited papers. The results of Tahamtan and Bornmann (2018) show that the PPtop-x% indicatorsespecially the PPtop-10% indicatorare one of the earliest used fieldnormalized indicators in scientometrics which were introduced by Narin (1981). In this study, we used PPtop-x% indicators which have been calculated based on two fractional counting approaches.
Papers may be equal in the rankings, if the papers are sorted by citations and more than one paper have the same citation counts. These ties in citations lead to the problem of exactly assigning the papers to the top-x% class or the corresponding bottom-x% class. To solve this problem we use an approach introduced by Waltman and Schreiber (2013). They propose to fractionally assign the papers at the top-x% threshold to the top-and bottom-x%in dependence of the number of papers with the same number of citations at the threshold.
The second fractional counting approach used for the indicators concerns the multiple assignment of journals to subject categories. We use the fractional counting approach by Waltman et al. (2011) to calculate the PPtop-x% indicators across multiple subject categories.

I3 indicator
One of the newest (a priori field-normalized) indicators is the integrated impact indicator (I3) which is also percentile-based. It was defined as a non-parametric alternative in response to the above mentioned discussions about statistical normalization of the CWTS "crown-indicator" (Leydesdorff et al., 2011). Bornmann (2010) and Bornmann and Mutz (2011) proposed to use the weighted number of papers of units (e.g., journals or institutions) belonging to certain percentile impact classes for performance measurements. The further elaboration into I3, the integrated impact indicator, combines these proposals in a unified scheme (Leydesdorff & Bornmann, 2011b;Leydesdorff & Bornmann, 2012;Rousseau, 2012;Wagner & Leydesdorff, 2012).
In the most recent development, Leydesdorff, Bornmann, and Adams (2018) propose to use four percentile classes (top-1%, top-10%, top-50%, and bottom-50%) as weighting scheme for I3. They argued that a paper in the top-1% class can be valued ten times more than a paper in the top-10% class. It follows that a top-1% paper weights 100 times a paper at the bottom. It is an advantage of this scheme that it appreciates the highly-skewed nature of citation data by using a logarithmic scale. It follows that papers in the top-50% are weighted with two and bottom-50% with one. The resulting indicator correlates above .9 with the numbers of both publications and citations in empirical cases.

Peer ratings provided by F1000Prime
F1000Prime is a post-publication The selected papers for F1000Prime are rated by the Faculty members as "good", "very good", or "exceptional" which are set to the scores of 1, 2, or 3, respectively. Since many papers are assessed not only by one Faculty member but by several, we calculated the sum of the scores for this study. This accords to the F1000Prime practice to use the individual scores for calculating the total score for each paper (which are used then to rank the papers in the disciplines). The assessments in the F1000 database can be used either by scientists for receiving pointers to relevant papers in their areas, but also as a database for research evaluation purposes. According to Wouters and Costas (2012) "the data and indicators provided by F1000 are without doubt rich and valuable, and the tool has a strong potential for research evaluation, being in fact a good complement to alternative metrics for research assessments at different levels (papers, individuals, journals, etc.)" (p. 14).

Used datasets
In 2018, F1000 provided one of the authors with data on recommendations made by the Faculty members and the bibliographic information for the corresponding papers in their system (n=51,461 papers). We matched the papers with the papers in our WoS in-house database (of the Max Planck Society) using the DOI. We restricted the set to papers with the document types "article" and "review". In the statistical analyses, we included not only the field-normalized indicators explained in section 2 (with a citation window between publication year and the end of 2017), but also citation counts (1) for a three-years citation window and (2) Table 2). Most of the reduction is due to the necessity of using a minimum citation window.  Subject-specific differences in publication and citation cultures are usually revealed by differences in the mean number of citations, authors, and cited references. Table 3 shows the minimum and maximum of the mean number of citations, authors, and cited references in the 157 WCs (in addition to the minimum and maximum of the number of papers). The number of papers in the WCs differs between 1 and 5466. As the results in Table 3 point out, the F1000Prime dataset is concerned by larger differences in the mean numbers of citations, authors, and cited references. Since these results point to larger subject-specific heterogeneity in the dataset, it might be reasonable to use the dataset for studying the validity of different methods for cross-disciplinary normalization.

Results
We included 14 (field-normalized) indicators in this study for comparing them with I3.
As  To have a first overview of the different (field-normalized) indicators, we calculated Spearman rank order correlations (see Table 4). As the correlation coefficients in the table reveal, most of the coefficients are on a large or (much) larger than typical level (following the guidelines by Cohen, 1988, to interpret correlation coefficients). This is also the case for the correlations between normalized and non-normalized indicators (i.e., number of citations).
The results in Table 4 might be interpreted as first hints that the differences between the indicators in measuring citation impact (field-normalized) are not very large. However, we could not include I3 in the correlation analysis, since I3 can only be used on the aggregated (group) level.
Since I3 can be used as a field-normalized indicator, we are interested in this study in how it discriminates between papers rated differently by Faculty members compared to other (field-normalized) indicators (see section 2). In other words, we are interested in its convergent validity: does the indicator discriminate worse, equal to, or better than the other indicators between the different quality levels and is thus more convergently valid to the assessment by peers than the other indicators? I3 differs from the other indicators by being calculated on the aggregated, and not on the single paper level. Thus, we need groups of papers for the comparison of I3 with other indicators.
The CSS method which we explained in section 2.4 cannot only be used to fieldnormalize single papers, but to group any paper set with metrics (see, e.g., Bornmann & Glänzel, 2018). Using the CSS method to group the papers in four classesbased on the sum of the F1000Prime scores per paperwe found 1396 papers (4.97%) in the class with the best scores (F1000 class 4, sum scores between 5 and 35), 3737 papers (13.32%) in the second best class (F1000 class 3, sum scores between 3 and 4), 10,334 papers (36.82%) in the next class (F1000 class 2, sum scores equal to 2), and 12,596 papers (44.88%) in the lowest class (F1000 class 1, sum scores equal to 1).
For the four groups, we calculated the arithmetic average of each indicator per group.
The median would have been an alternative, but this statistic fails to properly differentiate between the groups because of ties in certain indicator values. For example, the PPtop x% indicators mostly consists of the values 0 and 1 which lead to corresponding indifferent median values for the classes. We decided not to use the sum, since the results are dependent on the sample size: the more papers in a group are, the better results can be expected.
Although I3 was designed to reflect the output in addition to the impact dimension (as a sum score), the output dimension is not relevant for this validity study. The performance of the four F1000 classes is not dependent on the output dimension; only the impact of the single papers matters. In the usual evaluation of research groups or institutions, however, we are faced with a different situation in which both dimensionspublications and citationsare of equal interest for assessing performance.
In case of the I3 indicator, we divided I3 by the number of papers in a group and obtain I3/N. This has been proposed already by  for the comparison of journal I3 scores with the Journal Impact Factor (which is a mean citation rate). For the four F1000Prime quality groups, we received the following I3/N values: F1000 class 1 = 11.68, F1000 class 2 = 14.63, F1000 class 3 = 22.66, and F1000 class 4 = 39.03. The mean values point out that I3 measures quality as expected: it discriminates validly between the four performance groups. However, does I3/N discriminate better between the groups than the other indicators (and is thus more convergently valid)? As the results in Table 5 show, all other indicators which we considered in this study are similarly able to discriminate between the four F1000 classes. To compare the ability of the indicators to discriminate between the four F1000 classes, we calculated the so called "Average Annual Growth Rate (AAGR)" (instead of annual differences we have quality group differences in our study). The AAGR is the average increase in citation impact over the quality groups. It is computed by taking the arithmetic average of a series of growth rates. In the first step of calculating AAGR for each indicator, we determined the percentage growth for each group (except for F1000 class 1) which is the percentage growth (F1000 class x / F1000 class x -1) -1. In the second step, the AAGR is calculated as the sum of each indicator's growth rate divided by the number of F1000 classes -1. We also calculated the "Sum Annual Growth Rate (SAGR) for comparison with the AAGR which is a measure of the total increase in citation impact over the quality groups. The results on the basis of AAGR and SAGR for the various indicators are shown in Table 6. 2 The indicators are sorted by SAGR (and AAGR) in decreasing order. The column "Difference to previous SAGR" reveals how much the SAGR of an indicator differs from the SAGR of the indicator with the rank x -1. Thus, the column indicates how much larger the scores in the better class are. The results in Table 6 point out that PPtop 1% discriminates best between the different quality classes. The indicator is followed by the number of citations (measured across a three year citation window). CSNCR is on the third position whereby I3 has very similar AAGR and SAGR values as CSNCR.
As the "Difference to previous SAGR" column reveals, PPtop 1% discriminates much better than the second best positioned indicator number of citations (measured across a three year citation window) which performs itself much better than the CSNCR indicator. The indicators with the rank positions 3 to 10 are able to discriminate similarly between the four quality levels. The next larger performance difference are visible between PPtop 10% and SNCS3 (-45.84%) as well as between Incites percentiles and CSS (-66.51%).

Discussion: limitations and perspectives
The discussion about the normalization of citation impact has a long tradition in bibliometrics. Since publication and citation practices are very different among the various fields of science, citation numbers from different fields cannot be directly compared . The use of field-normalized indicators in research evaluation is one of the guiding principles in the Leiden Manifesto (Hicks et al., 2015). The same Manifesto advocates the use of percentiles for field normalization. In many evaluation contexts one uses field-normalized indicators (based on statistical normalization by the mean) for measuring citation impact instead of using the raw times-cited information from the WoS or Scopus databases. For example, field-normalized indicators are used in the popular Times Higher Education Rankings (see https://www.timeshighereducation.com/world-universityrankings).
Research on these indicators focused especially on the use of the arithmetic average of highly-skewed citation distributions. This poses a problem, for instance, for the use of MNCS and the way in which "research fields" are operationalized. Various categorization schemes can be used to define fields (e.g., schemes based on citation relations or subject categorizations from field-specific literature databases) and fields can be defined at different levels of aggregation (Wilsdon et al., 2015). Some research has been undertaken hitherto to identify field-normalized indicators using methods which normalize citation impact better than other indicators. According to the empirical results of Waltman andvan Eck (2013a, 2013b), citing-side normalization has been shown more successful than cited-side normalization in field-normalizing citation impact. Purkayasthaa et al. (2018) reported the following results: "from the high correlations within our analyses of the two metrics across a range of research areas, we conclude that RCRScopus and FWCI [field-weighted citation impact] can be used interchangeably to evaluate citation impact of an article or of larger entities such as universities".  and Bornmann and Marx (2015) used assessments from F1000Prime to compare the validity of different (fieldnormalized) citation impact indicators.
We included a range of (field-normalized) indicators in the current study to compare the newly proposed I3 indicator with other indicators with respect to their convergent validity (using assessments by peers as a baseline; sometimes called the "golden standard" of peer review). We wanted to know whether I3 is better able than other indicators in discriminating between different quality levels as defined by Faculty members working for F1000Prime. The indicators differ in terms of field-categorization (e.g., papers in the same WC or co-cited papers) and comparison-strategy (e.g., comparison of percentiles or focal papers with mean values). The investigation of the different indicators show smaller differences between different types of reference sets (field-categorization), but larger differences with respect to the comparison strategy (statistical normalization).
The results show that the PPtop 1% indicator discriminates best compared to the other indicators given the assumed baseline of F1000Prime. However, this result reflects the orientation of F1000Prime towards excellence in biomedicine which the PPtop 1% indicator targets more precisely than any of the other indicators. The second best indicator is the raw number of citations in the first three years after publication. Although this indicator is not field-normalized nor statistically normalized, it performs comparably wellperhaps because it focusses specifically on the period when most of the papers are selected by the Faculty members for inclusion in the F1000Prime database. The Faculty members might also consider the number of citations in their selection decisions and assessments of the papers.
Furthermore, the F1000Prime dataset is a relatively homogenous dataset with respect to field differences, and for this reason field-normalization may not play an important role.
At the third and fourth positions in the validity ranking of the indicators are CSNCR and I3 with a very similar value. Both indicators also differ scarcely from (perform slightly better than) RCR, MNCS, and the three SNSCI indicators (as well as the citation counts measured over the variable citation window until 2017). Thus, the newly developed I3 indicator holds up well against many other (field-normalized) indicators by discriminating equal to (or even slightly better than) the other indicators between the four F1000 quality classes.
With regard to percentiles (InCites and Hazen percentiles), our results are in disagreement to the previous results of . They reported very positive results for citation percentiles when this indicator is compared with other (fieldnormalized) indicators: "Percentile in Subject Area achieves the highest correlation with F1000 ratings" (p. 286). Using other data, the results in this study show, however, that percentiles (InCites and Hazen percentiles) perform comparably worse. The reasons for the differences between both studies should further be investigated in future studies.
A reason for the comparably poor performance of some of the percentile-based indicators might be that the F1000Prime data is a selective group of papers regarded as especially useful for other researchers in biomedicine. Therefore, a discrimination of these papers with respect to their quality scores focuses on a rather high level of quality (very high quality vs. high quality). This suggests that percentile-based indicators focusing on the upper end of the citation distribution (especially the top-1% indicator) are better suited for adequately discriminating this specific data, whereas indicators considering other parts of the distribution may have less discriminative power in this set.
Furthermore, even in the already selective F1000 dataset, highly skewed distributions of quality scores and indicator values can be observed. Most of the papers fall into F1000 classes 1 or 2 which are very similar when compared to indicators which include low quality scores. This is also reflected in the indicator values across classes: for most of the indicators, classes 1 and 2 are rather similar, whereas class 4 substantially differs from the other classes.
As a result, the assessment of the indicators' validity mainly rests on the ability to discriminate the top papers from the rest of the (already selective set of) papers. This may also favor percentile-based indicators focusing on the upper end of the citation distribution. We expect that other percentile-based indicators would be better able to differentiate between papers (groups of papers) reflecting the broad range of different quality levels.
Although many (field-normalized) indicators which we included in this study might measure citation impact similarly, the results of our study also show that the concordance between the indicators is not perfect. The use of certain (field-normalized) indicators in research evaluation might lead to different results on citation impactdepending on the used indicators. Against the backdrop of our results concerning differences between the indicators, it might be interesting to investigate in future studies, whether there are particular papers or types of papers which differ significantly between the various indicators. Information like this would be valuable in pointing out the biases of the various indicators.