Quantitative analysis of automatic performance evaluation systems based on the h-index

Hauer, Marc P.; Hofmann, Xavier C. R.; Krafft, Tobias D.; Zweig, Katharina A.

doi:10.1007/s11192-020-03407-7

Quantitative analysis of automatic performance evaluation systems based on the h-index

Open access
Published: 14 March 2020

Volume 123, pages 735–751, (2020)
Cite this article

Download PDF

You have full access to this open access article

Scientometrics Aims and scope Submit manuscript

Quantitative analysis of automatic performance evaluation systems based on the h-index

Download PDF

2358 Accesses
5 Citations
Explore all metrics

Abstract

Since the h-index has been invented, it is the most frequently discussed bibliometric value and one of the most commonly used metrics to quantify a researcher’s scientific output. The more it is increasingly gaining popularity to use the metric as an indication of the quality of a job applicant or an employee the more important it is to assure its correctitude. Many platforms offer the h-index of a scientist as a service, sometimes without the explicit knowledge of the respective person. In this article we show that looking up the h-index for a researcher on the five most commonly used platforms, namely AMiner, Google Scholar, ResearchGate, Scopus and Web of Science, results in a variance that is in many cases as large as the average value. This is due to the varying definitions of what a scientific article is, the underlying data basis, and different qualities of the entity recognition problem. To perform our study, we crawled the h-index of the worlds top researchers according to two different rankings, all the Nobel Prize laureates except Literature and Peace, and the teaching staff of the computer science department of the TU Kaiserslautern Germany with whom we additionally computed their h-index manually. Thus we showed that the individual h-indices differ to an alarming extent between the platforms. We observed that researchers with an extraordinary high h-index and researchers with an index appropriate to the scientific career path and the respective scientific field are affected alike by these problems.

Comprehensive evaluation of h-index and its extensions in the domain of mathematics

Article 24 January 2019

Multiple versions of the h-index: cautionary use for formal academic purposes

Article 20 February 2018

$$h_u$$ -index: a unified index to quantify individuals across disciplines

Article 25 February 2021

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Using online services to get more information about the quality of a job applicant or an employee is increasingly gaining popularity, as information gathered through platforms like, for example, Facebook or LinkedIn can be automatically processed without the need of investing personnel expenditure (Sattelberger 2015). Furthermore, the process suggests neutrality and fairness (Gapski 2015), implying an objective treatment of the individual, since every person is evaluated by the same success measures. Using a bibliometric measure to evaluate the performance of scientists, for example, is common practice (Nature 2017). One of the most frequently used of such metrics is the h-index (Ball 2007; Saleem 2011). It computes a value depending on the number of papers published by a scientist and their respective impact on other researches such that:

A scientist has index h, if h of his or her $N_{{\text {p}}}$ papers have at least h citations each and the other ($N_{{\text {p}}}-h$) papers have $\le h$ citations each (Hirsch 2005).

The validity of such a one-dimensional evaluation of scientists relies heavily on a complete list of all publications the scientist authored, as well as all publications which cite any of those. Since manually gathering such a list is impossible, the utilization of citation databases is inevitable. The most popular platforms offering the h-index are: AMiner (2017a), Google Scholar (2017), ResearchGate (2017), Scopus (2017) and Web of Science (2017). Searching for the h-index of multiple names on all of those platforms it can be seen that a number of inconsistencies between the different platforms like wrong assignments of works to an author (entity resolution errors), missing, doubled or false database entries and many more sources of error (MacRoberts and MacRoberts 1989) can lead to huge discrepancies in the resulting h-indices (and make it almost impossible to correctly estimate the exact value). Additionally, it is generally known that citation behaviour often varies greatly between different scientific disciplines. Various studies have already proven this, but the investigations are either directed at a single platform where the names and associated scientific disciplines can be extracted from the citation database itself (Batista et al. 2006), which limits the validity of the results due to the error-proneness of such an approach, or to multiple platforms with a very small set of names (Bar-Ilan 2008).

For promotion committees at a university looking up the h-index value of a potential employee on at least one of the platforms, it is important to know to what extent those errors may influence the results. Therefore, we compare the aforementioned five platforms offering the computation of the h-index, following four big research questions:

RQ1 (Differences between platforms): To what extent are there differences of h-index value distributions for a given sample between the platforms?
RQ2 (Difference between values for the individuals): For a given set of persons, how big are the individual discrepancies of h-index values?
RQ3 (Differences between scientific disciplines): How much do the h-indices for Nobel Prize winners differ depending on their scientific discipline on the various platforms?
RQ4 (Comparison to ground truth): Of what magnitude are the differences compared to the persons manually assessed h-indices?

To answer these questions, we first introduce the platforms in “The platforms” section. We then present the results of our study for which we gathered the h-indices on said platforms for different sets of names and determine the aggregated gaps in “Study I” section. Afterwards we evaluate how these differences are reflected in the h-indices of the nobel prize winners when we differentiate by scientific discipline (“Study II” section). Finally, we compute the reasonably accurate h-indices for a test group of 25 names by hand in order to compare them to the results delivered by the platforms and to inspect the respective deviations, which is presented in “Study III” section. In “Threats to validity” section, we will analyze potential threats to validity and in “General discussion” section, we will conclude the results in a general discussion.

The platforms

In this section, we discuss the platforms used in our studies and their individual properties. In Table 1 those properties are summarized and extended by additional common aspects. We are especially interested in the question of whether scientists create their own accounts or whether an account can also be automatically created by the system or by someone else.

AMiner (AM) started as a research project led by Dr. Jie Tang at Tsinghua University, China. Based on the comparably lower number of unique surnames in China than in other countries, one of its primal goals is to differentiate between multiple people with the same name, but at the same time it ignores the fact that a person may publish under several name variations (e.g. different name abbreviations) (Tang et al. 2008). The database is constructed by crawling a variety of different web-sources, which leads to automatically constructed profiles without a permission or notice. This procedure and additional manually constructed profiles may lead to duplicated and non-scientific entries. The platform also allows the user to manually correct mistakes and complete profiles. Manual modification does not require any form of authentication or validation and therefore allows easy manipulation of profiles.

Google Scholar (GS) is a free to use platform for scientific investigation provided by Alphabet. It screens websites for a certain kind of formatting and checks the indexing of online documents to decide whether they are scientific publications. Due to the susceptibility to errors of this way of data extraction, many unscientific contributions are listed on GS (Petersen et al. 2014). A profile has to be constructed manually, automatically constructed profiles are created only for deceased people such as Sigmund Freud or Albert Einstein.

ResearchGate (RG) is a social network for researchers and scientists, focusing on the person as central entity instead of their work. In general, accounts can be manually created by a scientist or automatically constructed. Only on manually created accounts, the h-index is shown. If such an account is abandoned later, the indexing algorithm automatically adds further papers and increases the h-index accordingly, though the respective person can configure a mandatory manual validation.

Scopus (SP) is a platform that offers a number of services which are strongly limited depending on whether the user has a purchased or free access. Since the free version provides all functionality important for this study, the following description focuses on the latter. The database is constructed by extraction of bibliometric information from a specific set of journals (on the platform this set is called content coverage) this set is publicly visible. In case of missing publications or mistakes, the automatically constructed profiles cannot be edited by the user, a time-consuming support system has to be used instead. Whether and to what extent user feedback will actually influence the database is unclear.

Web of Science (WoS) is a database set up in a similar way as SP, by screening publications from a limited set of journals. It has to be noted that the platform does not support profiles of any kind to provide bibliometric information of authors. Instead, it allows the dynamic construction of a so-called citation report, which contains any publication released by a person with the name searched for. From this report, falsely assigned publications can be excluded manually to correct the displayed h-index. This process leads to a comparatively low error rate. However, the strongly limited selection of journals considered results in rather low values (Piwowar 2013; Nature 1965).

All of the presented platforms lack transparency regarding the limited validity of the h-indices they provide. Even well-known basic aspects that have to be considered when it comes to the h-index, for example the incomparability of values between different scientific fields, are nowhere to be found, let alone explanations about less-known potential sources of errors. Though all of them list an explanation about how they gathered their data, they do not note that they cannot guarantee for complete publication coverage. Additionally, none of the platforms seems to consider a differentiation between scholarly articles and others beyond the method they use for crawling the data, at least there is no information available indicating such a differentiation. Last but not least the crucial information to what extent authors actually administrate their profile is not visible. Thus, we were interested in how much the h-indices would differ for a wide range of scientists. In the following, we will describe three different studies consisting of various lists of scientists for which we gathered their h-index, and the resulting variance of a person’s h-index over the different platforms.

Table 1 Important aspects of the platforms

Full size table

Study I

Description

In order to find out the differences between the platforms as posed in RQ1 we collected several sets of names and developed a scraper visiting each platform once per name, extracting the respective h-index and saving it to a database for elaboration. Subsequently, the maximum deviation of the average h-index of individuals compared to the h-index found on the platforms was analyzed in order to address RQ2.

Method

Step 1 Setup sets of names: To encounter multiple possible threats to validity we chose four sets of names with different criteria to analyze as described in the following.

${D}_{{\text {topGS}}}$ contains 1360 names of the researchers with an h-index of at least 100 according to the Webometrics Ranking of World Universities initiative of the Cybermetrics Lab research group in Spain (Webometrics 2017), who get their data from GS. Since some of the names are only available in an abbreviated form, we removed them to increase the chance of correct evaluation. After doing so, 1295 names were left.
${D}_{{\text {topAM}}}$ contains 139 names of the researchers with an h-index of at least 100 according to AM (2017b).
${D}_{{\text {nobel}}}$ contains the 632 names of the researchers who won a Nobel Prize in chemistry, physics, medicine or economics (Nobel 2017).
${D}_{{\text {TUKL}}}$ contains the 56 names of the current docents of a technical department at a German university.

Step 2 Collect h-indices: Most of the platforms do not offer an API to allow easy automatic access. As a consequence, the scraper visits the respective websites via browser by using the Selenium testing environment (Razak and Fahrurazi 2011) to search each name from a sequentially loaded subset of names and extracts the returning h-indices for the first result found respectively, to simulate an employers behavior. The h-indices are stored in an Elasticsearch database, which allows easy access, modification and visualization via Kibana.

Step 3 Refine the results: Since the crawling procedure is prone to failure due to occasional loading errors, delays and further problems beyond our control, some subsets delivered very little results. To avoid a lack of applicable data those subsets were reprocessed. Names with accented or special characters that might not be processed correctly have been excluded from the study as well.^{Footnote 1}

Step 4 Evaluation: Sometimes, a name cannot be found on all platforms discussed. Therefore we restrict the database for our analysis to those names which were found on all platforms (see Table 2). The found h-indices are split by platform and visualized as box plots for each dataset, respectively.

Table 2 The number of names in each dataset, for which an h-index on all platforms could be found

Full size table

Results

The box plots in Fig. 1 unveil notable differences between the platforms for each set of names. The key values for ${D}_{{\text {topGS}}}$ as presented in Table 3 show that the average h-index differs up to a factor of 3 between AM and GS, the median between RG and GS even by a factor of 12. The Inter Quartile Ranges (IQR) for the values from AM and RG are at least twice as high of those for GS, SP and WOS (see Fig. 1 and Table 3).

Except for AM, the plots for ${D}_{{\text {topAM}}}$ look very similar (see Fig. 2 and Table 4).

For ${D}_{{\text {nobel}}}$, the distribution between the plots look much like those for ${D}_{{\text {topGS}}}$ except that they are lower (see Fig. 3 and Table 5). This could be explained by the different criteria for the selection of scientists (citation based on ${D}_{{\text {topGS}}}$ vs. Nobel Prize based on ${D}_{{\text {nobel}}}$).

${D}_{{\text {TUKL}}}$ focuses on scientists from only one department of a university. Accordingly, the resulting values are lower and lead to different distributions. While h-indices from GS are still greater than the ones from other platforms, AM, RG and SP yield similar values compared to each other (see Fig. 4 and Table 6).

To inspect the worst-case impact of the findings for individuals, we examined the maximum and minimum h-indices from any of the platforms for each of the 1052 names in all datasets sorted by their average h-index on all five platforms (see Fig. 5). It turns out that the discrepancies between maximum and minimum are considerably high.

Table 3 Key values describing the results of ${D}_{{\text {topGS}}}$

Full size table

Table 4 Key values describing the results of ${D}_{{\text {topAM}}}$

Full size table

Table 5 Key values describing the results of ${D}_{{\text {nobel}}}$

Full size table

Table 6 Key values describing the results of ${D}_{{\text {TUKL}}}$

Full size table

Discussion

Scientific evidence that different platforms yield different results when it comes to the h-index values and therefore address RQ1 has already been provided by others (Falagas et al. 2008). Consequently, we did not expect to find very similar h-index values, however, the extent of differences was considerably higher than expected. Especially the h-index values from GS generally seem to be higher than on the other platforms. The values found for ${D}_{{\text {topAM}}}$ are approximately as high for AM as for RG. This is due to the fact that the dataset focuses on names with a high h-index on AM itself. WOS is considered a well-known and well-used tool for scientific literature research, but due to its relatively small coverage of journals (Reuters 2008), the gathered h-indices are comparably low.

The full scope of the individual discrepancies of h-index values (RQ2) becomes apparent when considering the individual maximum and minimum h-index value that can be found on any of the five platforms. Naturally this discrepancy is somehow related to the actual h-index, which however can only be calculated with great effort. Therefore we have chosen the average h-index of a person on all five platforms as x-axis and plotted the minimum and maximum against it (see Fig. 5). The results clearly show that the discrepancies between the h-indices on the different platforms are enormous even for scientists with a small h-index. The potential harm of consulting the h-index on the wrong platform is considerably high, since an h-index that is too small can have a negative impact on a scientist’s career, whereas an h-index that is too large can lead to unfair competition and thus to an advantage for scientists who deserve it less than their competitors. In the area of individual observation, further studies in quantitative terms do not exist yet. Thus, our results harden the impression that the individual database issues have a higher negative impact than assumed by studies examining small datasets like Bar-Ilan (2008).

Since a comparison with the approximately correct real h-indices disambiguates the magnitude of differences even further, we elaborate them for the scientists in dataset ${D}_{{\text {TUKL}}}$ in study III.