Robustness of personal rankings: the Handelsblatt example

In the last years, Handelsblatt has published several rankings of business economists from German, Swiss and Austrian research institutions based on their journal publication output. These rankings have a strong influence on the academic profession. We scrutinize the Handelsblatt methodology by examining the effect the rankings’ underlying algorithms and assumptions have on the scores and ranks of individual researchers. In doing so, we clarify how robust the result is with respect to these internal parameters. Since the parameters used by Handelsblatt are not scientifically substantiated but defined ad hoc, this question is of great importance. For each parameter variation, we provide several robustness measures for both the Handelsblatt life’s work ranking and the Handelsblatt recent research performance ranking. E.g., if one applies a weighting scheme that lays more emphasis on first tier journal publications such that the weight of a particular category is always double of the weight of the next lower category, rank correlations based on all researchers in both personal rankings exceed 80 %. However, if one solely considers the top 25 performing researchers rank correlations fall below 50 and 20 % of researchers even drop out of this top group. Further research as well as the discussion in the academic community should clarify whether these correlations verify the robustness of the ranking or manifest the opposite.

Abstract In the last years, Handelsblatt has published several rankings of business economists from German, Swiss and Austrian research institutions based on their journal publication output. These rankings have a strong influence on the academic profession. We scrutinize the Handelsblatt methodology by examining the effect the rankings' underlying algorithms and assumptions have on the scores and ranks of individual researchers. In doing so, we clarify how robust the result is with respect to these internal parameters. Since the parameters used by Handelsblatt are not scientifically substantiated but defined ad hoc, this question is of great importance. For each parameter variation, we provide several robustness measures for both the Handelsblatt life's work ranking and the Handelsblatt recent research performance ranking. E.g., if one applies a weighting scheme that lays more emphasis on first tier journal publications such that the weight of a particular category is always double of the weight of the next lower category, rank correlations based on all researchers in both personal rankings exceed 80 %. However, if one solely considers the top 25 performing researchers rank correlations fall below 50 and 20 % of researchers even drop out of this top group. Further research as well as the discussion in the academic community should clarify whether these correlations verify the robustness of the ranking or manifest the opposite.
Keywords Research evaluation Á Personal ranking Á Handelsblatt ranking

Introduction
In the last years, Handelsblatt has published several rankings of business economists from German, Swiss and Austrian research institutions based on their research performance. The first ranking was launched in 2009. In September 2012 and December 2014, two more recent versions followed. While Handelsblatt also produces rankings of entire research institutions (Ranking ''Top-25 Departments''), its focus is placed on rankings of individual researchers (e.g., ''Top-250 Life's Work Ranking'', ''Top-100 Recent Research Performance''). These rankings have gained great importance in the academic profession; however, they are still heavily discussed (see, e.g., Kieser and Osterloh 2012).
To produce individual rankings, Handelsblatt solely considers journal publication output and uses several journal rankings that classify journals according to their (perceived) quality. Using this journal classification and ''certain algorithms'', each scholar was assigned a score to reflect their publication success. This score was then used to sort the scholars. Because the resulting rankings have such a sustained influence on both the public perception of business administration and the impression researchers have of their peers, it is appropriate and necessary to critically scrutinize the underlying methodology. To this end, we focus on the Handelsblatt life's work ranking as well as the Handelsblatt recent research performance ranking (which is solely based on the journal publications in the last 5 years) and examine the effect the mentioned ''certain algorithms'' have on the ranking according to which the researchers' journal publications are translated into scores. Handelsblatt itself states: '' There are several ways to make research performance comparable, each of them with their special strengths and weaknesses. For instance, when selecting and weighting journals one can reasonably justify different decisions. In individual cases this can have a significant effect on a researcher's rank. Also, in case of co-authorship the way the scores are split up among authors can influence the rank as well as whether a researcher's life's work, recent publication output or annual productivity is used as ranking criterion.''(translated from Müller and Storbeck 2009).
Against this background, there is good reason to analyze the robustness of the Handelblatt rankings. The objective of this study is thus to measure the extent to which both personal rankings change when the methods and algorithms used by Handelsblatt (referred to in this context as ''internal parameters'') are modified. In particular, our aim is to clarify how robust the result is with respect to variations in these internal parameters. What happens if single parameters are changed slightly? Does this displace the rank of only a few researchers, or does it lead to completely different results? Also we analyze which internal parameter exhibits the strongest effect on the ranking and, given the fact that the top 10, 25 or 100 researchers receive more public attention, if there would still be the same scholars in these top performing groups. Since the parameters used by Handelsblatt are not scientifically substantiated but defined ad hoc, this question is of great importance. If the Handelsblatt ranking is shown to depend heavily on the parameters, this may indicate a lack of validity of the underlying methodology. Our dataset dates from January 2013 and our robustness analyses are thus based on the 2012 ranking. A more recent ranking dates from December 2014 which, however, we did not investigate because the same methodology was used (for details see section 3) and results are assumed to be very similar.
The paper is structured as follows. The next section reviews the literature on performance measurement. Section 3 describes the dataset underlying the Handelsblatt personal rankings. In section 4, we examine whether they are robust to changes in the internal parameters. Section 8 provides implications and limitations of our study.

Indicators of research performance and previous studies on the Handelsblatt ranking
It is beyond dispute that the performance of individual researchers is multidimensional in nature (e.g., Hussain 2011;Bornmann and Marx 2013). Besides research, acquisition of external funds, academic self-administration, active participation in professional associations and communities as well as teaching and supporting young scholars play an important role in academia. Therefore, plenty of indicators have been proposed in the literature that measure, compare and rank individuals' performance. With regard to research output, one generally makes use of scientometrics (see, e.g., Bornmann and Marx 2013;Kreiman and Maunsell 2011;Froghi et al. 2012), i.e., metrics that capture research performance based on librarian/bibliographic resources. According to Fig. 1, such quantitative evaluation instruments can be classified into the following categories: Productivity indicators are a measure of the researcher's publication output, e.g., total number of publications or average number of publications per year. Most productivity indicators additionally take into account journal quality weights since publication quantity does not equate with publication quality (see, e.g., Rost and Frey 2011). A feature common to these indicators is that they assume important research results are published and that publications are the only visible research output. When calculating such metrics several decision criteria have to be set, such as choice of considered publication outlets and type of publication, choice of period of time, handling of co-authorships and the determination of appropriate journal quality weights. Because of this discretionary or even arbitrary nature, productivity 123 indicators have often been challenged and criticized (Adler and Harzing 2009). Particularly, the determination of journal quality weights is an often discussed issue. The commonly used proxies for the quality of a journal-Journal Impact Factors (JIF) (Garfield 2006) and Normalized Journal Position (NJP) (Bornmann and Marx 2013;Costas et al. 2010)-are based on a journal's citation frequency. However, this frequency can also be determined by non-scientific reasons (Bornmann and Daniel 2008;Judge et al. 2007;Jones et al. 1996) such as the reputation of a journal or its relationship to the authors and might create potential for manipulation or misuse (Archambault and Larivière 2009). Another possible way to determine accurate journal weights is to rely on researchers' assessment of a journal's quality. The German Academic Association for Business Research (VHB) uses this approach in their journal ranking JOURQUAL (Hennig-Thurau et al. 2004). The second category of research performance measures consists of impact indicators such as the sum of citations a researcher receives, or their average number of citations per publication. Unlike productivity indicators, they attempt to capture the response to individual publications. However, it is well known that these metrics are strongly time-and field-specific, e.g., due to the different numbers of journals indexed or different citation practices across scientific disciplines (see Abramo et al. 2011). Therefore, it seems appropriate to standardize citation frequency with respect to field and time (Bornmann and Marx 2013). In particular, Leydesdorff et al. (2011) suggest using percentile ranks that rate each paper in terms of its percentile in the citation distribution. Besides non-scientific motivations for citing a paper, citation analyses are criticized for leaving room for discretionary decisions, e.g., how to deal with self-citations or what the appropriate length of the considered citation window is (Abramo et al. 2011). Recently, additional indicators have been proposed to combine productivity and impact and thus increase assessment accuracy. Most notably, Hirsch (2005) introduces the h-index that reflects the number of the researcher's publications that have been cited in other papers at least h times. An overview and critical evaluation of the h-index and its variants are provided by Froghi et al. (2012).

Productivity indicators Impact indicators Esteem indicators
In addition to productivity and impact indicators, esteem indicators as a surrogate for quality have been proposed in the literature. E.g., Albers (2011) suggests considering honorary doctorates, too, when assessing the researchers' overall achievements. Rost and Frey (2011) focus on membership of the editorial boards of professional journals. They provide evidence that a ranking based on this esteem indicator randomly correlates with citation and publication rankings. However, Backes-Gellner (2011) cast doubt on this esteem indicator for various reasons, one of which being that unlike productivity indicators, it leaves more room for manipulation and discriminates against younger scholars.
To summarize, all research performance indicators have their advantages and disadvantages, and no standardized approach has emerged. All indicators have been discussed controversially by critics questioning the validity of these indicators and of any rankings of individuals, faculties or universities that are based on these indicators. Also, the methodology used by Handelsblatt, which is based solely on productivity indicators, has already been analyzed and frequently criticized in several studies. A broad critique of the Handelsblatt ranking even led to an online appeal to boycott the ranking (see Kieser and Osterloh 2012;Dilger 2013). Criticism was levelled at several aspects including the inference from journal quality to paper quality, lack of neutrality with respect to different fields of specialization, and the setting of false incentives that adversely affect science and society. For a rebuttal of these criticisms, see Storbeck (2012). It was also criticized that Handelsblatt focuses on journal-based publications only. Note, however, that Handelsblatt does not claim to measure the quality of researchers in general. Concerning the 2007 Handelsblatt economists' ranking, Hofmeister and Ursprung (2008) complain about an incomplete weighting of the extent of research results, the distorting co-author weighting, and an overly restrictive journal selection. More recent Handelsblatt rankings have at least addressed the last two concerns in that they use a less distorting co-author weighting and a broadly expanded journal selection. Voeth et al. (2011) analyze the journal rankings underlying the Handelsblatt ranking with respect to more impact-oriented quality. For marketing, they show that journal rankings are only weakly aligned with the bibliometric impact of the journals. At the center of criticism of Müller (2010) stands the Handelsblatt business economists' ranking which he compares to a citation-based researchers' ranking. His comparison reveals considerable discrepancies between both rankings. Accordingly, he concludes that citations cannot be used to predict the rank assigned by Handelsblatt. Individual rankings should thus be interpreted with caution.
The study most similar to ours is that of Krapf (2011). Krapf uses alternative journal rankings (e.g., Jourqual 2, impact factors, journal-rating of the Wirtschaftsuniversität Wien) and compares the resulting department and life's work rankings to the corresponding Handelsblatt rankings by providing correlation measures. His findings suggest that department rankings are more or less robust to different journal classifications. By contrast, individual rankings react much more sensitive to a change in the underlying journal ranking. This remains true independently of whether the economists' or business economists' ranking of Handelsblatt is considered. In contrast to Krapf, we did neither analyze department rankings nor rankings of economists but limit our robustness analyses to rankings of individual business economists that are in return investigated in more depth. In particular, our study focuses on both the Handelsblatt life's work ranking and the Handelsblatt recent research performance ranking, and differs from his analysis in several ways.
Firstly, we not only investigate the impact of variations in journal weights but also in other internal parameters that have not been examined before such the way scores are split among co-authors and the weighting of different types of journal publications. Secondly, our approach also differs with respect to variations in journal weights. When assigning weights to each journal publication Handelsblatt had to face two problems: the allocation of journals to several quality categories (which journals belong to the best, second best, third best etc., quality level?) and the assignment of appropriate weights to each of these categories (which weight is assigned to the best, second best, third best etc. category?). Krapf concentrates on the former aspect by investigating ranking differences that arise when different journal rankings are used that classify journals differently into quality categories. By contrast, we focus on the latter aspect by still relying on the Handelsblatt allocation but changing the weights that Handelsblatt assigned ad hoc to each quality category. This enables us to isolate the effect of the Handelsblatt weighting scheme. Finally, we contribute to prior research by providing more meaningful statistical measures to analyze the robustness of the rankings. Particularly, we additionally use rank correlations that do not assume equivalence of rank differences (see Sect. 5 for details) as well as median rank changes and also provide statistics for the most interesting group of top performers (top 10, 25, 100 and 250) on how many researchers drop out of the respective top group when internal parameters vary.

Data summary
The data used by Handelsblatt were collected by Thurgauer Wirtschaftsinstitut (TWI) at the University of Konstanz and is currently managed by the KOF Swiss Economics Institute (ETH Zürich). 1 TWI records all publications of the individual researchers who are able to constantly view, update and correct their data via the online portal Forschungsmonitoring. Our dataset dates from January 2013 and does thus essentially match the data underlying the 2012 ranking.
Originally, the database contained 3,016 researchers. 493 were excluded from our dataset since they have not published yet. Around 15 % of these researchers without any publications are professors. Thus, the database on which our analysis is based contains 2,523 researchers with at least one publication each. As shown in the distribution of academic or job titles in Table 1, only 62 % of them hold a professorship. This category includes junior professors, assistant professors, associate professors, honorary professors, irregular (''außerplanmäßiger'') professors as well as full professors at universities of applied sciences (''Fachhochschulen'') or universities.
Besides names and academic degrees, the data also contain researchers' dates of birth, their university or research institute affiliation, field of specialization and information on their journal publications. In total, the database covers 47,998 journal publications. However, several are co-authored so duplicate entries cannot be ruled out. For each publication, the number of co-authors, type of publication, year of publication and the weight assigned by Handelsblatt to the journal in question are provided. Note that we do not have information on the particular journal of publication. Concerning the assignment of weights Handelsblatt considered all journals that are ranked by JOURQUAL 2.1 of the VHB, journals that belong to social science citation index (SSCI) or science citation index (SCI) as well as journals that are mentioned in the 2011 version of the journal ranking of the Erasmus Research Institute of Management (EJL) (see Schläpfer and Storbeck 2012). In addition, all economics journals were added to the database as long as they were assigned a weight of 0.1 or higher in the Handelsblatt economists' ranking. This procedure results in a total of 950 journals that were considered by Handelsblatt in 2012 [meanwhile the number has increased to more than 1000 journals and the so-called Journal to Field Impact Scores (JFIS) are additionally used that account for different publication and citation practices across disciplines in business administration, see Gygli et al. (2014)] and categorized into eight quality levels. An overview of the journals' classification to these quality levels can be found in Schläpfer and Storbeck (2012). In a second step, Handelsblatt then assigned weights to each quality level that are shown in Table 2.
When calculating the individual scores, Handelsblatt also took into account the type of journal publication by assigning only half points for comments and zero points for editorials, corrections and book reviews. Moreover, the scores were split among the authors. In analogy to the Handelsblatt economists' ranking, the score is divided by the number of authors n. By contrast, the prior ranking of business economists in 2009 used the allocation formula 2=ðn þ 1Þ to split up the score. However, this approach has been criticized because it incentivizes the excessive use of co-authorships. As a consequence, the formula 1 / n was introduced in the 2012 version of the ranking. The influence of this approach is also addressed in the next section. The individual score of each researcher was then determined according to While index i denotes a particular researcher, index j stands for a particular journal publication. The summation runs over a researcher's j ¼ 1; . . .; Pub i journal publications. ''Pub i '' denotes the total number of journal publications of researcher i as far as the life's work ranking is concerned and the total number of journal publications in the last five years as far as the recent research performance ranking is concerned.  Table 3 shows that an average of 2.29 researchers produced a journal publication, only very few of which were comments or book reviews.
To provide descriptive statistics of the performance of various disciplines in business administration in the Handelsblatt life's work ranking, we grouped researchers according to their fields of specialization. However, the allocation by Handelsblatt is associated with an enormously high error rate and is incomplete. In particular, more than 600 researchers were not allocated to any field of specialization. Therefore, we rely on the researchers' self-assignment which follows from their affiliation with one or more of the 16 scientific sections of the VHB. 964 of the 2,523 business economists belong to at least one section. We manually assigned another 1,529 researchers to one or more fields of specialization in business administration by comparing their websites with the allocation made by Handelsblatt. This approach proved successful, with only 30 researchers who could not be allocated to a discipline. Table 4 (see p. 10) shows the absolute and relative frequencies of each specialization's affiliates. The sum of all affiliates is 3,390 rather than 2,493, which implies that some researchers are assigned to more than one specialization. Moreover, Table 4 describes the disciplinary composition of top performing groups in the Handelsblatt life's work ranking. It reads as follows: Out of the 25 best researchers according to Handelsblatt 11.1 % have specialized in banking and finance whereas not a single top 25 researcher has specialized in business taxation. This reveals that the disciplines are not equally represented in the top performing groups. Differences across disciplines also become obvious when looking at the Handelsblatt quality classification of their publications, ranging from 1 (highest quality level) to 8 (lowest quality level). Researchers specialized in business taxation published in journals with an average quality level of 7.4. No other specialization shows a lower mean or smaller standard deviation. The reason might lie in the fact that they predominantly publish in German journals that are (by intention) lower ranked than international ones. Researchers specialized in accounting or auditing published in journals with an average quality level of 7.0, so they come in second. This table describes the allocation of researchers to their fields of specialization and provides information on the research performance across these specializations. The allocation was possible for 2,493 of the 2,523 researchers with non-zero publications considered by Handelsblatt. The first columns report absolute and relative frequencies of researchers in each discipline. Note that the sum of all affiliates add up to 3,390 rather than to 2,493 because some researchers have specialized in more than one field of business administration. The next two columns report the disciplinary composition of the top 25 and top 250 group according to the Handelsblatt life's work ranking. The last columns display mean and standard deviation (SD) of journal quality levels, ranging from 1 (highest quality) to 8 (lowest quality), per discipline. The numbers are based on 47,551 publications of 2,493 researchers in the Handelsblatt database that have non-zero publications and that could be allocated to the fields of business administration 4 Robustness with respect to internal parameters

Robustness measures
In this section, we explore whether and if so, to what extent variations in the internal parameters change the Handelsblatt life's work ranking as well as the Handelsblatt recent research performance ranking of business economists. We begin by attempting to reconstruct both rankings on the basis of our dataset by calculating and sorting the scores according to the approach described above. In line with the approach of Handelsblatt, we placed researchers with identical scores at the same rank. We then compare our results with the top 250 life's work ranking of business economists and with the top 100 recent research performance ranking of business economists, respectively, that were published by Handelsblatt in 2012. Differences between our reconstruction and published version of the rankings arise only when a researcher refused to be ranked and named in the public rankings. At the time of publication of the ranking, the number of boycotters was 339; most of them are lower-ranked. For the purpose of our study, we include all boycotters in the data. Next, we modify the underlying score calculation in a consistent way and analyze the resulting effect on the (ordinal) ranks, (cardinal) scores and the composition of the top performing groups of researchers. More precisely, we provide the following robustness measures for each parameter variation. Score correlation We analyze the correlation between the obtained cardinal scores before and after the parameter change measured by Pearson's correlation coefficient (Pearson's rho). A high coefficient indicates a strong linear relationship between the scores before and after the change in internal parameters.
Rank correlation With respect to individual ranks, we raise the question to what extent these ordinal data react to parameter changes. As is known from the literature, there are two such rank correlation measures: Spearman's correlation coefficient and Kendall's tau. To assess which of them is most appropriate for our purposes, we take a closer look at both rank correlation coefficients. Although the Spearman correlation coefficient is calculated like the Pearson correlation coefficient, it is based on ranks instead of raw scores. This approach assumes equivalence between all rank neighbors which implies that, e.g., the difference between the first and second rank is equivalent to the difference between ranks 100 and 101. The above-mentioned study Krapf (2011) uses Spearman's rho. However, due to the features we just mentioned, we are not convinced that it is appropriate for our purposes. A closer look at, e.g., the first twenty researchers reveals immense score differences. The further down in the ranking, the smaller these differences. From the thirtieth rank downwards, the differences are marginal and practically negligible. Therefore, we do not use Spearman's rho. Instead, Kendall's tau seems suitable as a correlation measure in our study as it does not assume equivalence of rank differences (see, e.g., Cleff 2008: 118). Moreover, Kendall's tau is easy to interpret. Tau takes values between À100 % and þ100 %; the actual value is determined by all rank comparisons that are possible within the sample. A value of þ100 % indicates that both rankings, those before and after the parameter change, have the exact same order. In case of a correlation amounting to À100 % all comparisons reverse to the opposite. A zero correlation implies that even and odd comparisons are perfectly balanced. Consequently, we use Kendall's tau -instead of Spearman's rho-as a correlation measure for ordinal ranks.
Percentage out Because most public attention is given to the top performing group of researchers, we complement all analyses by additionally providing statistics for the top 10, 25, 100, 250 and 500 researchers. We then investigate the percentage of researchers that drop out of the respective top group if a parameter variation takes place. The lower this percentage, the more robust is the composition of the top performing group.
Median absolute rank change Finally, we want to describe whether researchers are ranked quite differently on average or if they only slightly de-or increase in ranks if internal parameters are varied. 2 To prevent that negative and positive rank changes neutralize each other, we look at absolute rank changes only. Instead of arithmetic means, we rely on the median (absolute) change in number of ranks because the median is robust to outliers. If, e.g., within the top 10 the best ranked researcher goes down from rank 1 to 10 due to a parameter change and all nine other scores remain stable, the median absolute rank change is 1 whereas the mean absolute rank change would amount to 1.8 even though all (but one) researchers just move up by one rank.

Internal parameters
To investigate the impact of internal parameters on the rankings, we deviate from the Handelsblatt procedure and verify how individual scores, individual ranks as well as the composition of the top performing groups change. The internal parameters of interest are: Co-author weight First, we deal with parameter variations that we expect to have no serious effect on the ranking. Concerning co-authorships we change the current allocation formula 1 n into 2 nþ1 that was formerly used by Handelsblatt to split up scores among n authors. Accordingly, in cases with two authors both receive 0.67 points instead of 0.5 points as in the recent ranking. The formerly used formula has been heavily criticized for creating (false) incentives for multi-coauthorship (Hofmeister and Ursprung 2008). The robustness measures associated with the alternation of this formula inform about the extent to which such multicoauthorships have been used.
Weight of publication type Furthermore, we integrate and fully weight comments, editorials, corrections and book reviews when determining the scores. Such a weighting can be seen as problematic since they may not be classified as original research output. However, they are part of a researcher's journal publication output and overall productivity. As prior research claims that rankings should be based on multiple criteria, not only on research output (see, e.g., Albers 2011), we additionally rely on these journal publication types to measure productivity and want to examine if even small changes like this lead to serious adjustments in the ranking. By contrast, Handelsblatt assigned only half points for comments and zero points for editorials, corrections and book reviews. Table 2. However, there does not exist an overall accepted or ''theoretically correct'' weighting scheme. Thus, we apply some alternative journal weighting schemes that we believe are reasonable and leave it to the readers' judgement which scheme reflects the journals' quality best. We then want to test how sensitive the ranking reacts to these changes in journal weights. Our changes in journal weights can be divided into two blocks.

Journal weights The selection of weights is another subjective influence on the ranking and there is no reasonable justification for the concrete values shown in
First, we only change two weights at most. For instance, we raise or lower one of the values presented in Table 2 and verify the effect of this modification on the obtained score and the resulting rank. For consistency, we redefine the weights by merging two categories. That is, in one analysis we combine the Handelsblatt journal categories HB4, HB5 and HB6 and assign them a weight of 0.3, thereby downgrading the former category HB4 and upgrading the former category HB6. Note that Handelsblatt assigns zero points to publications in journals of the category HB8, meaning that of approximately 2,500 considered researchers, more than 300 received no points and had to share the last rank. Since category HB8 does not weight journals in, e.g., JOURQUAL category E, journals that play an important role in transferring research findings to professional practice are not taken into account. In light of the strong applied nature of business administration, the decision to disregard these publications as research output should be considered carefully. Therefore, we also analyze the impact on the ranking when taking into account the lowest journal category.
In a second step, we scrutinize the ranking more closely by modifying the weights of all journal categories instead of those of just one or two. We assess the effects that result from arranging the weights in an arithmetic or geometric progression. There are also other degrees of freedom. For instance, in the case of an arithmetic series one could assign a value of 0.7 to the highest journal category and reduce this weight downwards in steps of c ¼ 0:1. In the case of geometric progressions, the growth ratio g between two journal categories could reach, e.g., 0.5 or 0.8 without one value being more appropriate than the other.
The specific weighting schemes that we tested in our analysis are displayed in Table 5. Note that the choice of one specific weighting schemes directly defines whether quality or quantity is rewarded better in the ranking. Researchers who want to obtain high scores in the Handelsblatt personal rankings have to decide whether to spend their time producing many papers of low quality or a smaller number of top papers. In particular, not the absolute but relative weights determine the qualityquantity-trade-off and are thus also reported in Table 5. The higher the relative weight of first tier publications, the stronger emphasis is placed on quality over quantity. With a relative weight of 50 % for publications with highest quality (HB1), the geometric weighting scheme with g ¼ 0:5 thus creates the strongest incentive to publish in first tier journals, i.e., the quality of one category is always double of the quality of the next lower category. Such a weighting scheme would thus punish researchers that (almost) never publish in the highest journal category. By contrast, out of the various weighting schemes tested in our analysis the arithmetic progression with HB1 ¼ 1 and c ¼ 0:1 lays least emphasis on HB1 publications but puts more equal weights to all journals instead. In between ranges, the weighting scheme was applied by Handelsblatt. To illustrate the effect of these three weighting schemes, consider the following example (see Table 6) of authors having different publication structures. The geometric scheme presented in the first row gives an advantage to quality as author 1 would receive a higher score than authors 2 and 3. The opposite is true for the arithmetic weighting presented in the second row where quantity is rewarded better. By contrast, Handelsblatt would position author 1 and author 2 at the same rank. When comparing the scores of authors with different publication structures, it becomes obvious that the score reacts stronger to changes in the weighting, the more an author chooses quantity over quality. Thus, a larger number of low quality publications (which is also prevalent in our data as Table 3 shows) leads to leverage effects.

Results
The robustness measures that result from a comparison between the Handelsblatt ranking and its modifications due to alternations of the aforementioned internal parameters are reported in Table 7 for the life's work ranking and in Table 8 for the recent research performance ranking. Because there are no generally accepted guidelines or thresholds when robustness measures show stability, we leave it to the reader to judge whether or not reported numbers show that the ranking is robust.
Co-author weight Based on all 2,523 researchers with non-zero publications and independently from whether all or only their most recent publications are considered the alternation of co-author weights to the formerly used version 2=ðn þ 1Þ results in rank and score correlations that exceed 96 % (see first row of Tables 7 and 8). This indicates that less than 2 % of rank comparisons reverse to the opposite. However, the modification of co-author weights more strongly impacts the ranks of top performers: Rank correlations no longer exceed 85 % and even drop to 59.1 % in the top 25 list of the recent research performance ranking. Two researchers drop out of the top 10 list.
Weight of publication type With respect to including all types of journal publications the Handelsblatt life's work ranking (recent research ranking) shows rank and score correlations that almost all exceed 85 % (90 %) in all (sub)samples (see second row of Table 7 and 8, respectively); only as far as the top 25 researchers are concerned Kendall's rank correlation falls to 68 % (83.6 %). Thus, the vast majority of comparisons remain stable which is consistent with what we expected as only 4.8 % of all publications are comments, editorials, corrections or book reviews. The median absolute rank change based on all 2,523 researchers indicates that 50 % of them de-or increase their position in the life's work ranking (recent research ranking) by 17 (10) ranks at most.
Journal weights In both personal rankings, combining journal quality categories yields score correlations that almost all amount to more than 95 % independently Handelsblatt weights (see Table 2    from whether all or only the top researchers are considered. Rank correlations take values that are on average 11 % points lower and reach a minimum of 0.556 as far as the impact of merging the first three journal quality levels on the top 10 list of the recent research performance ranking is concerned. In comparison to other parameter variations, the percentage of researchers that drop out of the top groups (which does not exceed 12 %) as well as the median change in ranks is rather moderate. When taking into account the lowest journal category, the recent research performance shows greater stability than the life's work ranking in almost all subsamples. This might indicate a change in the researchers' publication strategies across time because weighting of low tier publications causes less variation in the ranking if researchers' lifetime publications instead of only their most recent publications are taken into account. If one, e.g., assigns the lowest journal a value of only 0.05 (instead of 0) the rank correlation is still above 90 % in the recent research performance ranking of all 2,523 researchers but drops to approximately 86 % if researchers' entire publication records are considered. This means that almost 7 % of comparisons between specific researchers reverse. Note that Handelsblatt also considers researchers with zero points, meaning that of 2,523 considered researchers, 303 received no points and had to share the last rank. One could argue that the majority of these researchers would obtain positive scores when weighting the lowest journal category and that the changes in the life work's ranking are mainly explained by the resulting new ranks. To verify this, we additionally calculate correlation measures based on only those researchers that received positive scores in the Handelsblatt life work's ranking, i.e., we exclude the 303 researchers that shared the last rank. The results are provided in Table 9 in the Appendix and show that correlation coefficients remain stable, indicating that the researchers with a zero score are not causal for the decline in rank correlation.
Finally, we more generally varied journal weights by arranging them in arithmetic and geometric progressions. For the life's work ranking (recent research performance ranking), Table 7 (Table 8) reveals that the rank correlation decreases to only 67.7 % (76.7 %) if all 2,523 researchers are taken into account and more equal weights are assigned to all eight journal quality levels, i.e., an arithmetic progression with HB1 = 1 and c = 0.1. Corresponding rank correlations in the top 250 rankings even drop below 50 % which indicates that more than 25 % of rank comparisons reverse. If, in contrast, one better rewards quality by attaching greatest relative importance on publications in first tier journals (i.e., geometric progression with HB1 = 1 and g = 0.5) rank correlations based on all 2,523 researchers exceed 80 % but again take numbers below 50 % in some subsamples. Also, the composition of the top groups and the median rank change show that both Handelsblatt personal rankings react more sensitively to these systematic modifications of journal weights than to the journal weight variations described before. One possible explanation is that a researcher's number of journal publications per quality level is correlated between adjacent quality levels. Therefore, weight alterations that simultaneously affect the weights of the highest as well as lowest categories should have the largest impact on the rankings.

Conclusion
Handelsblatt has ranked more than 2,500 business economists according to their life's work (based on their total journal publication output) as well as according to their recent research performance (based on their journal publications within the last 5 years). Our aim is to analyze the impact of the underlying internal parameters on these personal rankings. In particular, we investigate to what extent they are robust with respect to changes in internal parameters, such as the splitting of scores in case of co-authorships, inclusion of all types of journal publications, and journal weighting.
We find that there are differences in journal publication intensity between researchers that persist even when internal parameters are varied. However, individual performance evaluations based on a researcher's specific rank can strongly depend on the internal parameters of the ranking. Specifically, our findings suggest that the Handelsblatt rankings of business economists tend to be more robust with respect to the allocation of scores among co-authors and full integration of all types of journal publication (i.e., book reviews, comments, corrections and editorials besides full original articles) than to systematic changes in journal weights. The underlying weighting scheme directly determines whether quality or quantity is rewarded better in the rankings. In particular, Handelsblatt applies a weighting scheme such that, e.g., the quality of the second best journal category is considered 70 % of the quality of first tier journal publications. Our study investigates the effect of several alternative weighting schemes on both Handelsblatt personal rankings. If one applies a scheme that lays more emphasis on quality, e.g., such that the weight of a particular category is always double of the weight of the next lower category, rank correlation based on all researchers amounts to 83.6 % in the life's work ranking (85.9 % in the recent research performance ranking) which indicates that about 8 % (7 %) of rank comparisons reverse to the opposite. However, if, e.g., solely the top 25 performing researchers are considered the corresponding rank correlation fall below 50 %, and 20 % of researchers even drop out of the top 25 list. In general, our analyses show that rank and score correlations tend to be lower if only the group of top performers (e.g., top 10, 25, 100, 250, 500) are taken into account. Future research as well as the (ongoing) discussion in the academic community will ascertain whether or not such numbers allow the conclusion that the ranking is robust.
In general, when comparing researchers' publication output in journals, e.g., in the course of filling a vacancy, quantitative measures cannot replace an evaluation of the papers' actual content. In this respect, our result is conform to Krapf (2011) who has already claimed that the actual rank assigned by Handelsblatt is not suitable (and not intended to be suitable) for evaluating individual researchers. However, ''managing by numbers'' is still common, and Germany's academic community is not the only one to have raised concerns about the danger of a mechanistic use of ratings when comparing applicants' profiles [for the UK see Hussain (2011)]. If one nevertheless relies on ''managing by numbers'' researchers', metric score is a more robust estimate of their publication output than their rank in the Handelsblatt rankings; the comparison of both correlation coefficients used in this study shows that in most cases, correlation of scores takes higher values than correlation of ranks. This study has a number of limitations. Firstly, we have not addressed all the problems that are generally associated with rankings. For example, it is possible that the correlation between reviewers' evaluations is rather low, and that the quality classification of a journal can be manipulated (e.g., by coalitions). Moreover, another restriction of the underlying data is that it neither contains information on the number of pages per publication nor on other types of publications such as books or articles in outlets other than journals. Therefore, an analysis of researcher's overall publication record and its impact is not possible.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. Appendix See Table 9. This table contains score correlations according to Pearson's q and rank correlations (Kendall's s) between the ranking before and after varying the journal weights. The first two columns are based on all 2,523 researchers with non-zero publications that were considered by the Handelsblatt life's work ranking. In contrast, the last columns show correlation measures based on 2,220 researchers with non-zero publications who obtained positive scores in this ranking. Thus, 303 researchers sharing the last rank (non-zero publications but zero score) are excluded