A percentile rank score of group productivity: an evaluation of publication productivity for researchers from various fields

The difficulty in evaluating the research performance of groups is attributable to the following two factors: 1) difference of population size or discipline of group members and 2) skewed distribution of the research performance of individuals. This study attempts to overcome this difficulty, focusing on the research performance based on publication productivity. We employ the normalized index for the number of papers, in which publication efficiency was considered and disciplinary variation in the publication intensity was corrected by the disciplinary averages, to calculate a new percentile rank score. The score was developed on the basis of the principle that a person who is rare is valuable. The score was also tested with publication data for faculty members of 17 Japanese universities. The employment of the normalized index increased the score of universities with relatively few faculty members working in the disciplines of high productivity, resulting in more plausible university rankings. The rankings show a high correlation with those for a previously established percentile rank score, which was developed for citation analysis, and they are consistent with the judgment by evaluators of several universities under study. The advantage of the new score over the previous one is that it has no room for arbitrariness in determining the scheme of rank classification and the weights given to each rank class.


Introduction
Scientometric indicators are statistical measures based on the number and distribution of publications, authors, references and citations (Schubert et al., 1988). They are used for assessment, such as time series analysis of research performance or performance comparison with others. The data are often obtained from large abstract and citation databases, such as Web of Science by Clarivate Analytics and Scopus by Elsevier, which have specific criteria for registering scientific journals. The indicators offered by the databases can be roughly divided into two types: one for evaluating publication productivity based on the number of papers and the other for measuring the visibility of papers or journal impact. This study focuses on the former for which intensive studies are scarce compared with the latter.
Group assessment using the bibliometric indicators is a traditional practice at various levels ranging from countries or research institutions to research projects. For example, Gul et al. (2015) evaluated the difference in research performance among Middle Eastern countries by comparing the share of the number of papers in the region or the average number of citations per paper of each country. At the institutional level, Guskov et al. (2018) conducted a time-series analysis for the growth rate of the number of papers sorted by document types, covering research institutes participating in a national project for improvement of research performance. At the project level, to evaluate the research performance of 147 chemistry groups in the Netherlands, van Raan (2006) compared bibliometric indicators, such as the number of papers and citations, to the research quality judged by peers.
Two types of data set can be used to evaluate group publication productivity: one collected using a group name, such as a university name, and the other using authors' names. The two kinds of data set produce different results when the time window of data collection is set to be long. The paper count generated by the former represents the research activity of a group and is widely used because it can be directly obtained from the abstract and citation databases, whereas that by the latter represents the research potential of a group because it contains the publications that constituent members had published in other groups in the past. The latter, a bottom-up approach of data collection and integration, requires great care to unify author IDs or authorial information, which tend to split into multiple ones with time; thus, name disambiguation is essential. Although evaluating research potential is conducted less than that of research activity for this reason, it is also important for executive board members of research institutes to formulate strategies to improve research performance. Kotsemir and Shashnov (2017) employed the bottom-up approach in evaluating the research potential of departments and staff members of a Russian university to avoid underestimation of the number of publications, which is unique to some countries, due to missing correct affiliation names with the publications indexed in the Scopus database. This study also evaluates the research potential of groups using publication data collected at the individual level.
The number of papers aggregated at several levels-such as individual, department, and university-is a simple index to estimate the publication productivity of the target, but it has two deficiencies. First, efficiency is not taken into account. Schubert et al. (1988) noted the importance of evaluating a citation impact in consideration of balance between the impact as an output and publication effort as an input. The same applies in evaluating publication productivity. Second, comparing the index across disciplinary boundaries is difficult due to a large disciplinary variation stemmed from a difference in average publication intensity and average number of coauthors. For the citation impact, a number of studies have been conducted to develop normalized indices, which are provided in research performance evaluation tools associated with abstract and citation databases (e.g., CNCI by Clarivate Analytics and FWCI by Elsevier); these databases enable us to compare the citation impact across years, document types, and disciplinary boundaries. Using these indices, for example, a comparison of research quality has been made not only among faculty members but also among nations or institutions (Avanesova and Shamliyan, 2018), as has a comparison of the effectiveness of interuniversity cooperation (Khor and Yu, 2016). Regarding publication productivity, in contrast, no such normalized index is widely perceived; evaluation and comparison within a specific discipline are considered preferable (Abramo and D'Angelo, 2014) and therefore are generally conducted in the previous studies (e.g., Vinluan, 2012;Barrot, 2017).
This study employs the normalized index for the number of papers named as Discipline Weighted Publication Productivity (DWPP) proposed by Yamamoto and Ishikawa (2017) to make a group comparison across population sizes and disciplines of its constituent members. The DWPP is calculated at the individual level as a ratio of paper count as an output to publication effort as an input to represent publication efficiency; the publication effort is here defined as an individual's effort devoted to a research field following Schubert et al. (1988). The normalization is conducted by dividing the efficiency by its average in each discipline and aggregating over all disciplines. Even with this index, however, comparing the publication productivity at the group level remains not easy. This is because most groups are composed of a small number of prolific researchers and a large number of unproductive researchers, which skews the distribution of the DWPP and renders its arithmetic mean inappropriate to know the representative value.
Although calculating the representative value as the median value is possible, information obtained from the distribution shape should be lost. The same is true of citation impact, and attempts have been made to evaluate researchers or groups more precisely on the basis of relative rankings. To compare researchers with different publication sets, Leydesdorff et al. (2011) developed the mean percentile rank score, which is calculated by classifying the researchers' publications by the percentile rank based on the number of citations, obtaining the relative frequencies in every rank class, and summing the values after weighting. The authors proposed two schemes to give the weights, which increase in a different manner as the percentile rank rises; in their method, any other scheme can be chosen depending on the purposes of the evaluation. The "degree of freedom" for choosing the scheme is advantageous for evaluators from the aspect of great flexibility, while a disadvantage is that the evaluation is liable to be arbitrary. The present study therefore aims at developing a novel index called the Better Performance Index (BPI), which is derived from the relationship between relative ranks in the whole sample and in each group, without room for arbitrariness. This paper first describes the outline and calculation method of the DWPP and BPI, and then shows the values calculated from publication data collected at the individual level for nearly 15,000 faculty members of 17 Japanese universities. Finally, the following are discussed: interpretation of the DWPP and BPI, the effects of normalizing the number of papers on the BPI, and the degree of agreement between university rankings for the BPI and median value or scores by Leydesdorff et al. (2011).

Data
The Scopus database employs an original author ID system, in which users can unify multiple IDs for a single author manually or using an optional service of Elsevier, i.e., Profile Refinement Service. The publication data used in the present study were collected from Scopus at the individual level, based on a list of researchers in 2019, for 9 years from 1 3 2010 to 2018 inclusive. The long time window was chosen to ensure the sufficient sample size required for accurate determination of the DWPP following Yamamoto and Ishikawa (2017). It also enables us to minimize the effect of non-publishing authors within a given period, the so-called "potential authors" (e.g., Schubert and Telcs, 1986;Koski et al., 2016) because it would be reasonable to assume that those who do not publish over such long periods are unlikely to publish after the defined period, which means they are not counted as the potential authors to be included in the analysis. The data are confined to the faculty members of 17 Japanese universities having conducted the ID unification to improve the accuracy of name disambiguation. The faculty members with the ID, hereinafter simply referred to as researchers, total 14,912, corresponding to ∼ 8% of all the faculty members who belong to any Japanese university as of 2017. The number of papers to be analyzed amounts to 267,946; only the document types "article," "review," and "conference paper" are included.
The data were not separated by publication year in this analysis because the number of papers does not tend to increase as time passes, unlike the number of citations, setting aside the effect of a few years of registration delay after publication. Although it is conceivable that the long period of analysis may be disadvantageous for early-career faculty, they were treated in the same way as the others as the effects would be negligible on the whole. Table 1 summarizes the 17 universities under study: 12 comprehensive (i.e., multidisciplinary) universities and 5 specialized universities. Among the universities, a large variety is shown in the number of researchers from just over 100 to just under 3,000, as well as in the level of universities suggested by the Quacquarelli Symonds (QS) World University Rankings from being ranked in the top 100 to being out of rank. At the request of some universities, university names are not disclosed in this article.

Disciplinary classification of researchers
Several studies have analyzed publication productivity across disciplinary boundaries with consideration for efficiency, which requires determination of researcher population in each discipline as an input. The Times Higher Education (THE) World University Rankings includes the research performance evaluation based on the normalized publication productivity calculated from self-reported data of universities for the number of faculty members and the number of papers in each THE discipline (Times Higher Education, 2019). The normalization is, however, very rough as the classification is conducted into only 11 disciplines. In contrast, in the Italian higher education system, every researcher is classified into a single research field of 370; Abramo and D'Angelo (2014) employed the normalized publication productivity to determine university rankings and compare the results with those based on the normalized citation index. Although fine classification is utilized in the normalization, comparing the researchers with those outside of Italy in the same framework is not possible. Besides, determining the discipline uniquely for some researchers working in interdisciplinary field can present difficulty. Thus, in the present study, the All Science Journal Classification of Scopus assigned to source titles was applied to the researcher's discipline. The classification has two levels: 27 major disciplines and 334 minor disciplines. We employed the major disciplines (hereinafter, ASJC27) in light of the sample size available. For example, one researcher has published two papers in different journals. If ENGI and MATE are assigned to one journal and ENGI and PHYS to the other, his/her discipline is represented as ENGI 50%, MATE 25%, and PHYS 25% (see the appendix for the correspondence of ASJC27 disciplines and its abbreviations). In calculating DWPP, an individual's publication effort is allocated to each related discipline on the basis of this ratio.

Discipline weighted publication productivity
The DWPP employs the fractional counting method by the number of coauthors to evaluate a degree of contribution in the publication process following Moed (2005), which referred that the integer counting method represents participation, whereas the fractional counting method represents a kind of credit to a study. The different coauthors of a publication usually have not equally contributed (Waltman and Van Eck, 2015), but it would present a problem with the evaluation at the individual level, not at the level of large group like a university.
The following is an overview of the DWPP proposed by Yamamoto and Ishikawa (2017). To take the work efficiency into consideration, a unit of publication effort is defined as manpower, with each researcher having a total of one manpower in a given time period. According to Abramo and D'Angelo (2014), productivity is the quintessential indicator of efficiency and defining research productivity as the number of publications per researcher is a norm in bibliometrics. The DWPP is equivalent to the indicator, where each researcher is classified into a single research discipline. In the present study, as researchers are classified into one or more disciplines (see "Disciplinary classification of researchers" section), the separable value of manpower is employed in exchange for the number of researchers. Furthermore, the following simplifying assumptions are adopted: 1) Coauthors equally contribute to the publication.
2) The paper count can be allocated evenly if the paper is assigned to multiple disciplines. 3) A given researcher's publication efforts devoted to a specific discipline are proportional to the ratio of the number of papers in the discipline to the total number of his/her papers. 4) Non-publishing faculty members, whose fields of study are not identifiable, can be ignored.
The method for calculating the DWPP and its components in the formula are summarized in Table 2. The DWPP values of all researchers average one; if a given researcher has a DWPP larger than one, he/she has above-average publication productivity.

Percentile rank score
The DWPP makes it possible to compare the publication productivity of individuals across disciplinary boundaries. Meanwhile, its arithmetic mean and standard deviation are considered unsuitable for the comparison of groups. Bibliometric indicators, such as the number of papers and citations, have long been known to have a highly skewed distribution; normality cannot be assumed (e.g., Schubert and Glänzel, 1984;Seglen, 1992;Baccini et al., 2014) As is to be expected, in the present study, null hypothesis that the DWPP values are normally distributed was rejected for all universities by the Shapiro-Wilk normality test at the 5% significance level. As percentile rank is less affected by outliers and can be used for highly skewed data to obtain the representative value, several studies have employed it to evaluate the difference of citation impact among institutions or researchers (e.g., Leydesdorff et al., 2011;Bornmann et al., 2013;Bornmann and Marx, 2014).
This study also adopts the percentile rank score to evaluate the publication productivity of groups. First, the relative frequency of researcher rankings p(k), where k is the class number, is calculated following the steps below:

Index
Formula Description is the number of papers and subscripts i and j indicate the identity numbers of the researcher and discipline, respectively. The symbol i⋅ represents the total number of papers for researcher i, being integrated over all disciplines.
Publication productivity of disci- N is the total number of researchers included in the analysis. The numerator and denominator represent the total number of papers and the total effort ratio, respectively, in discipline j.
M is the total number of disciplines. The i⋅ is calculated by dividing the number of papers for researcher i by the reference value P j in each discipline and integrating the ratios over all disciplines.
1) Calculate individuals' DWPP and ranking in the whole sample based on the value. 2) Divide researchers' rankings into l percentile rank classes.
3) For each university, sum up the number of researchers in each class and divide it by the university population.
When classifying the percentile ranks at regular intervals, the k-th class includes researchers with percentile ranks greater than 100(k − 1)∕l and less than or equal to 100k/l. Next, three percentile rank scores are calculated using the p(k). Two are the scores proposed by Leydesdorff et al. (2011), for which p(k) is calculated in classes divided into various intervals, weighted with the class number k, and then summed up over all classes as follows: Leydesdorff et al. (2011) calculated R(6) and R(100) by employing two different schemes of rank classification. For R(6), the ranking data are divided into six percentile rank classes at the 50th, 75th, 90th, 95th, and 99th percentile following National Science Board (2010). Because of the uneven classification, the weight to the p(k) rapidly increases at higher ranks. As for R(100), the data are evenly divided into 100 classes, and the weight is proportional to the percentile rank. Bearing in mind that their target of analysis is the number of citations, which is different from the number of papers, we employ the R(6) and R(100) for comparison with our original score introduced next.
The third score we propose, the BPI, is calculated as follows with ranks evenly divided into l classes: where q(k) denotes cumulative relative frequency in the k-th class, and the subscripts s and e indicate single university and equilibrium conditions, respectively; the equilibrium conditions are met when the distribution of researcher rankings coincides with that for the whole sample. Here q s (k) can be calculated as ∑ k i=1 p s (i) , and q e (k) as k/l. This equation can be transformed into the following form using deviation , which represents an excess of researcher ratio in above the k-th classes for a specific university when compared to that for the whole sample: The equation suggests two outcomes. First, the deviation (k) receives weights inversely proportional to 1 − q e (k) ; the weight nonlinearly increases as k increases. Second, the deviation of BPI from one represents the weighted (k) averaged over all classes. To be specific, one researcher in the highest class (100th percentile class; top-1%) weighs 99 times as much as one in the lowest class (1st percentile class; 99%-bottom). In the case of l = 10 , .
the weight in the highest (10th decile class; top-10%) is nine times larger than the lowest (1st decile class; 90%-bottom). Although a small interval should be taken when the analysis includes small groups, this study focuses on universities, and no significant difference was observed between BPI(10) and BPI(100); hereinafter, BPI(10) is referred to as BPI and will be analyzed further. Cumulative distribution function of researcher rankings in a model university is illustrated for visual understanding of the method (Fig. 1). The university is assumed to have p(k) decreasing from 0.19 by 0.02 at each step with increasing k. The function shows the relation between relative ranks in the whole sample and in the model university; specifically, it reads that a given researcher is at the 5th decile, i.e., the 50th percentile, in the whole, whereas he/she is at the 75th percentile in the university. If the distribution of researcher rankings for the university is similar to that for the whole sample, the function coincides with the equilibrium line, and he/she is at the 50th percentile in the university as well. In Fig. 1, an example is shown for the derivation of (k) in Equation (2), which is calculated for the 5th class as q e (5) − q s (5) = 0.5 − 0.75 = −0.25 . The BPI of the model university comes to 0.5 when cumulating the (k) over all classes after weighted. In summary, the BPI is an indicator of difference between researchers' relative ranks in the whole sample and those in a specific group. A BPI value of less than one can be interpreted as researcher distribution inclined toward the lower ranks from overall average. Fig. 1 The cumulative distribution function of researcher rankings in a model university. The function is drawn by linearly interpolating the values of adjacent decile classes. The dashed line denotes the equilibrium conditions between a specific university and all universities, i.e., the whole sample. Researchers ranked in the 10th decile class have the highest publication productivity, whereas those in the 1st have the lowest. The closed square and circle show the relative frequency cumulated up to the 5th decile for equilibrium conditions ( q e (5) ) and for a specific university ( q s (5) ), respectively; (5) is calculated as q e (5) − q s (5) = 0.5 − 0.75 = −0.25 Figure 2 shows the nine-year average publication productivity P j calculated for each discipline (see Table 2); the data are arranged so that relevant disciplines are placed adjacent to each other following the way of the "Wheel of Science" provided by Elsevier (Elsevier 2020). Whereas researchers engaged in COMP, MATE, and ENGI disciplines have the highest P j , those engaged in ARTS and DENT disciplines have the lowest. In this study, disciplines with the P j significantly higher than its overall average are referred to as high-P j disciplines and otherwise as low-P j disciplines, except for insignificant ones. The high-P j disciplines were COMP, MATH, PHYS, CHEM, CENG, MATE, ENGI, ENER, and EART; the P j was consistently high in these closely related disciplines. Regarding the low-P j disciplines, these encompassed not only initially-anticipated ones in soft science, such as ARTS, PSYC, SOCI, BUSI, and ECON but also some in medical science, such as HEAL, NURS, and DENT (see the appendix for abbreviations and the P j values). The P j values were used to normalize the number of papers of individuals in calculating the normalized index DWPP.

Variation of publication productivity among disciplines and individuals
The variation of the DWPP among individuals is presented in the form of boxplot in each decile class (Fig. 3). The DWPP is approximately linear in relation to decile rank in the lower classes, whereas it exponentially increases over the 5th class, resulting in a skewed distribution overall. A notable feature is the extremely large variation observed in the highest 10th class. The dotted line and error bars on the P j bars indicate the average value of P j over all disciplines and 95% confidence intervals calculated by the bootstrap method, respectively. In order to facilitate comparisons of P j between research areas consisting of closely related disciplines, the data are arranged in the same order as in the Wheel of Science provided by Elsevier Percentile rank score of each university Figure 4 shows the relative frequency distribution of researcher rankings based on the DWPP for three universities-U1, U2, and U7 (see details in Table 1), which are sorted out because of the characteristic distribution pattern. The relative frequency increases as the rank becomes higher for U1, whereas the opposite is true for U7. No obvious tendency is present for U2; the values are almost constant over all classes. The distribution patterns suggest that the research performance is superior for U1 dominated by high-ranking researchers, medium for U2 with balanced researcher rankings, and inferior for U7 dominated by low-ranking researchers. The cumulative distribution function, which is generated by summing up the relative frequency at each step from the lowest class, can manifest the difference more clearly (Fig. 5). It becomes convex for universities such as U1 with the values smaller than 0.1 in the lower classes, whereas it becomes concave for universities such as U7 with the values larger than 0.1 there. Regarding universities such as U2 with the values comparable to 0.1 in all classes, the function becomes close to the equilibrium line. The function visually represents correspondence between a given researcher's relative rank in the whole sample and that in each university. For example, if a researcher is at the 50th percentile in the whole, he/she is at the 39th in U1, 49th in U2, and 62nd in U7. Table 3 shows the percentile rank scores of all universities calculated for three patterns described in "Percentile rank score" section; the table includes the median of the DWPP for each university as a benchmark score. For the later discussion, also shown is the score calculated from rankings based on publication productivity before normalization (PP) obtained in the same way as the DWPP, except for setting the P j values of every discipline equal to unity (see the equation for the DWPP in Table 2).

Interpretation of the DWPP and BPI
Comparing the publication productivity across the disciplinary boundaries is undesirable, and this discourages attempts to measure research activity of groups. In this study, therefore, the new percentile rank score, BPI, based on the normalized index for the number of papers, DWPP, was developed and tested with real publication data of faculty members from 17 Japanese universities. In calculating the DWPP, the disciplinary average of publication productivity P j is very important for achieving accuracy. As described in "Variation of publication productivity among disciplines and individuals" section, the variation in the P j values is consistent with the empirical knowledge that the publication productivity is higher than the overall average in hard science, while the opposite is true in soft science. The results provide assurance for further examination of the BPI calculated from the researcher rankings for the DWPP.
For the soft science disciplines, however, the P j is ambiguous concerning its real meaning. The outputs are frequently written in non-English native languages because the necessity of writing in English does not become widely accepted; furthermore, they are published in the form of a book or monograph (e.g., Kyvik, 2003;Nederhof, 2006). As a result, they are prone to be excluded from registration to worldwide bibliometric databases; accordingly, the evaluation based on such databases tends to be disadvantageous to these disciplines (Mongeon and Paul-Hus, 2016) Although the normalization of the number of papers favors researchers in these disciplines thanks to smaller P j , it does not resolve the problem with a large number of unindexed outputs. Furthermore, the potential authors pointed out by Koski et al. (2016) are expected to be more common in these disciplines than in the hard-science disciplines; although inclusion of their manpower to the calculation will cause a decrease in P j and consequently an increase in the DWPP, a limit exists to the accuracy of the estimation for their manpower.
Considering these limitations, the DWPP can represent the publication productivity of only researchers in hard science. Fortunately, the 17 universities under study do not include specialized universities in the field of soft science, and 92% of the total manpower is devoted to hard science. Thus, considering that the scores in Table 3 represent the publication productivity of hard-science faculty would be reasonable, and the error stemmed from including soft-science faculty in the analysis is insignificant. Figure 3 illustrates that the median and variation in the DWPP are remarkably higher in the highest class than those in the second-highest class. The results suggest that the topranked researchers largely contribute to the overall publication productivity of universities even if they are in a minority, which is consistent with findings from previous studies (e.g., Kwiek, 2016) The behavior of the DWPP is considered to justify the application of the Table 3 The percentile rank scores of all universities calculated for three patterns Notes. Median and all score except BPI pp are based on the normalized publication productivity; the subscript pp denotes the score based on the publication productivity before normalization. Square brackets denote the university rankings based on the score. ROC represents the rate of change calculated as (BPI dwpp − BPI pp )∕BPI pp × 100(%) and high the manpower ratio obtained by dividing the total effort ratio in the high-P j disciplines by the number of researchers for each university. a R(6) and R(100) were calculated following the method employed by Leydesdorff et al. (2011). b The values less than 20% or more than 75% are highlighted in bold.

University Median
Percentile rank score ROC nonlinearly increasing weight to the difference between q s and q e in the calculation of the BPI (see "Percentile rank score" section).

Effects of normalizing the number of papers
Next, this study examines how the normalization of the number of papers affects the BPI. As shown in Table 3, the rate of change in the BPI, i.e., (BPI dwpp − BPI pp )∕BPI pp × 100 (%), exceeded 10% only for U7 and U17; the two universities enjoyed an advantage by the normalization. In contrast, the normalization brought disadvantages to five universities; the rate of change fell below -10% for U11, U12, U13, U15, and U16. All except U12 of the latter universities are institutes of technology (see Table 1). Although U12 is a comprehensive university, it is well known for high research performance in engineering disciplines. The ratio of manpower devoted to the high-P j disciplines is less than 20% for the universities with a substantial increase in the score, whereas it is more than 75% for the universities with a substantial decrease. Considering the fact that the high-P j disciplines correspond to the science and engineering disciplines (see "Variation of publication productivity among disciplines and individuals" section), the normalization naturally brings in better scores to the universities with a small ratio of researchers who work in these disciplines. Figure 6 shows the change in the relative frequency distribution of researcher rankings based on the publication productivity before and after normalization, i.e., the PP and the DWPP, for three universities with characteristic changes: U5, U11, and U17. U11 is one of the universities with a substantial decrease in the BPI due to the normalization, and the distribution is shifted from the higher to lower ranks on the whole (Table 3). This is probably because the manpower devoted to the high-P j disciplines accounts for 86% of the total, and the publication productivity of most researchers is decreased by the normalization. No significant change is seen in the distribution before and after normalization for U5, where 54% of manpower is devoted to the high-P j disciplines; elevated and lowered rankings can be inferred as balanced. The largest improvement of the BPI is observed for the agricultural specialized university, U17, where only 8% of manpower is devoted to the high-P j disciplines; one can see that the overall distribution has shifted upward after normalization. Fig. 6 The change in the relative frequency of researcher rankings based on the publication productivity before and after normalization, i.e., the PP and the DWPP, for three universities. The open and solid bars denote the values for the PP and the DWPP, respectively A certain correlation can be found between difficulty of entrance exams and research performance in Japanese universities. In light of this correlation, 3rd place of U4, 6th of U11, and 10th of U13 in the rankings for BPI dwpp are probable, but 7th place of U4, 2nd of U11, and 5th of U13 for BPI pp are implausible (see Table 3). The results suggest that the normalization of the number of papers is crucial for fair comparison of publication productivity of groups that incorporate researchers of a large variety of disciplines.

University rankings
In this section, rankings were compared based on the median of the DWPP and three percentile rank scores (see Table 3). To examine the similarities, the Spearman's rank correlation coefficients are calculated (Table 4). Although the correlation coefficients are generally high, the BPI has the highest correlation with the R(6), followed by the median, and the least with the R(100). The R(100) had low correlations with both the R(6) and median. As mentioned in "Percentile rank score" section, nonlinear weights with respect to the percentile rank are used in the calculation of the R(6) and BPI, whereas linear weights are employed for the R(100). By interviewing evaluators of several universities under study, a high degree of agreement was found between the rankings based on the R(6) or BPI and their subjective evaluation on the 17 universities. The results suggest that applying the nonlinearly increasing weights as rank increases is reasonable when calculating the percentile rank scores.
In this analysis, the correlation of the rankings for the R(6) with those for the BPI is quite high; the ranking difference for each university is two at a maximum, and the standard deviation is 0.97. The group size can, however, make a difference in the results. When evaluating small groups such as departments and research units, instead of universities, setting the number of classes small should make it difficult for the score to represent difference among groups; the problem can be avoided by making the evaluation at fine rankintervals. In adopting the R(6), reconsidering the scheme of rank classification and weighting is necessary.
The BPI in Equation (2) does not require such an adjustment; this is a merit of this score in exchange for flexibility on the weighting. Based on the principle that a person who is rare is valuable and the empirical fact that publication productivity of individuals exponentially increases with increasing rank (see "Variation of publication productivity among disciplines and individuals" section), the weight to each class is approximated by the reciprocal number of the cumulative relative frequency of researchers ranked above the class in the whole sample. The score is simple to apply also to the case of small groups or to the case without clear policy of evaluators on how to classify the ranks and how to weight the relative frequency in each class. Finally, the difference between rankings based on the BPI and simplest score, the median of the DWPP, is discussed. Although the standard deviation of the ranking differences, 1.33, is unexpectedly low, the largest difference appears for U5, i.e., 9th place for the BPI and 13th place for the median. U5 has the relative frequency of researcher rankings over 10% in 10th as well as 4th decile class or below (Fig. 6). Only universities U1-U5 have more than 10% of researchers in the highest class; the BPI places significant value on the existence of the exceptional researchers, and the median does not, resulting in the wide difference. The results suggest that the BPI or R(6) is preferable to the median of the DWPP because weights can be adopted as a function of rank.
In this study, we examined the availability of the BPI; in the practices of research performance evaluation based on the score, statistical significance testing of differences would be crucial as discussed by Leydesdorff et al. (2011).

Future work
When comparing the QS World University Rankings to the rankings for the BPI, universities ranked in the range of 501-750 and 751-1000 in the former are often inversely correlated with the latter (see Tables 1 and 3). In particular, for U5, the QS ranking is in the second highest class, being in the range of 101-300, whereas the ranking for the BPI is in the middle of the 17 universities, being 9th. In light of the fact that universities with comparable value of BPI (i.e., U8, U9, U10, and U13) are in the range of 501-750 or below, U5 receives an unexpectedly high evaluation. Such differences are inevitable because the QS rankings are based on a reputation from academics or employers, faculty/student ratio, citations per paper, and international faculty or student ratio (Quacquarelli Symonds, 2020); the evaluation does not incorporate a factor of publication productivity. Regarding U5, the university is large in scale, having the second largest value of the institutional average of the normalized citation index FWCI after U1.
Improving the metrics of reputation used for the world university rankings, such as those reported by the THE or QS, is difficult due to their vagueness. Benchmarking for purely research performance is absolutely imperative for university executive board members to decide future policies. Undoubtedly, research performance cannot be measured only by publication productivity; rather, citation impact tends to be preferred in the evaluation. To the authors' knowledge, however, there exists no normalized citation index in which efficiency is considered in a similar way to the DWPP. Since the number of citations depends on publication year, more data are required for the normalization. A major future challenge is evaluating the research performance of groups in terms of both activity and impact by developing the normalized index for the number of citations in a similar way to the DWPP with an expanded data set.
Another issue to be solved is which document types should be included in the analysis. As mentioned above, some universities are placed at lower ranks in the QS rankings but at higher ranks in the rankings for the BPI. The feature of these universities, specifically U11, U12, and U13, is that the ratio of researchers related to the science and engineering disciplines is high, and that the number of conference papers accounts for over 35% of the total. As a number of researchers of the 17 Japanese universities specialized in computer science, which places importance on the conference papers (Vrettas and Sanderson, 2015), we decided not to exclude this document type from the analysis to evaluate as many researchers as possible. In the context of subjects or purposes of the evaluation, however, deep deliberation would be needed on the inclusion of the conference paper or on giving the same weight to it as to the article and review.

Conclusion
The difficulty in evaluating the research performance of groups is attributable to two factors: 1) difference of population size or discipline of group members and 2) skewed distribution of the research performance of individuals. This study attempted to overcome this difficulty, focusing on research performance based on publication productivity. We employed the normalized index for the number of papers, DWPP, to calculate a new percentile rank score, BPI, developed on the basis of the principle that a person who is rare is valuable. In calculating the DWPP, publication efficiency was considered by introducing the concept of manpower devoted to the publication, and the disciplinary variation in the publication intensity was corrected by the disciplinary averages of publication productivity P j .
The BPI based on the DWPP was tested with publication data collected at the individual level for faculty members of 17 universities in Japan. Regarding the P j , significantly high values were observed in the science and engineering disciplines, except for biological disciplines, while low values were distinct in soft science and some medical disciplines. The normalization of the number of papers increased the BPI by over 10% for the universities with the manpower devoted to the high-P j disciplines less than 20% of the total, resulting in more plausible university rankings. The results suggest that the DWPP can alleviate the problem in which universities with a small portion of faculty members working in the high-P j disciplines tend to be disadvantaged in the research performance evaluation. The university rankings for the BPI showed a high correlation with those for the R(6) developed by Leydesdorff et al. (2011), and they were consistent with the subjective judgment by evaluators of several universities under study. The advantage of the BPI over the R(6) is that it has no room for arbitrariness in determining the scheme of rank classification and the weights given to each rank class.