The h-index as an almost-exact function of some basic statistics

As is known, the h-index, h, is an exact function of the citation pattern. At the same time, and more generally, it is recognized that h is “loosely” related to the values of some basic statistics, such as the number of publications and the number of citations. In the present study we introduce a formula that expresses the h-index as an almost-exact function of some (four) basic statistics. On the basis of an empirical study—in which we consider citation data obtained from two different lists of journals from two quite different scientific fields—we provide evidence that our ready-to-use formula is able to predict the h-index very accurately (at least for practical purposes). For comparative reasons, alternative estimators of the h-index have been considered and their performance evaluated by drawing on the same dataset. We conclude that, in addition to its own interest, as an effective proxy representation of the h-index, the formula introduced may provide new insights into “factors” determining the value of the h-index, and how they interact with each other.


Introduction
The purpose of this paper is to present a formula with which to determine (estimate) the hindex, h, under incomplete information conditions (IIC). By IIC we mean the situation in which, for different kinds of reasons, we do not know the whole set of citation data, the entire citation profile that would allow us to obtain the actual exact value of the h-index. This is the case, for example, when only few ''basic'' citation statistics (other than the hindex) are published, or known to us.
To be concrete, we will refer to simple citation indicators-to use the words of Hirsch (2005), ''single-number criteria commonly used to evaluate scientific output''-as: 1. total number of citations C; 2. total number of citations for the t (t 2 1; 2; 3; . . . f g ) most-cited publications, C t ; thus, C t ¼ P t i¼1 c i ð Þ, where c i ð Þ represents the number of citations to publication i, and where publications are ranked in decreasing order of the number of citations: c 1 ð Þ ! c 2 ð Þ ! Á Á Á ! c T ð Þ. 3. total number of publications T; 4. total number of ''significant'' publications, that is, those with at least a predetermined number of citations k each (k 2 1; 2; 3; . . . f g ), T k .
In this paper we focus on these indicators in their simplest versions, that is: C, C 1 , T and T 1 . The purpose of the analysis is twofold: to estimate the h-index (when it cannot be determined directly from the data) and hence at the same time to identify the main factors which influence the level of the h-index. A crucial question is therefore the extent to which the h-index can be satisfactorily predicted from knowledge of only the above basic statistics-i.e. under IIC.
More formally, we are searching for a formulâ h ¼ĥ S 1 ; . . .; S r ð Þ ; 1 r 4, S j 2 S, 1 j r, where S ¼ C; C 1 ; T; T 1 f g . To be noted is that the formulaĥ can be interpreted as a genuine estimator of the h-index, h, i.e.ĥ ffi h, because it does not depend on values of unknown parameters.
Possible estimators under IIC of the h-index can be found in the literature: -A very simple proxy for the h-index is given by h H ¼ ffiffiffiffiffiffiffiffi ffi C=a p . This model, which can be traced back to Hirsch (2005), is not a genuine estimator of the h-index because h H is still a function of an unknown parameter, a, and it is not specified (by the formula itself) how to estimate this parameter in terms of the above basic statistics. Nevertheless, an estimator for the h-index can be obtained by substituting the unknown parameter a with a fixed constant (Hirsch found ''empirically'' that a lay between 3 and 5). Redner (2010) found that '' ffiffiffi ffi C p is essentially equivalent to the hindex, up to an overall factor that is close to 2'' (put otherwise, he found that the distribution ratio ffiffiffi ffi C p =2h has an empirical distribution ''sharply peaked about 1''). This suggests the approximating formulâ with r ¼ 1, S ¼ C f g, which we could then call the Redner formula-probably the simplest estimator of the h-index, under IIC.
-While h R is a model-free proxy for the h-index, more elaborate solutions has been attempted in the literature by assuming specific probabilistic distributions for the citation rate. For example, a formula that follows model (1), with r ¼ 4, has been recently introduced by Bertoli-Barsotti and Lando (2017), wherem 1 ¼ C À C 1 ð Þ = T 1 À 1 ð Þ is nothing but a ''trimmed'' version of the simple sample mean C=T 1 , and where W Á ð Þ represents the so-called Lambert-W function (Corless and Jeffrey 2015). The Lambert-W function is the function W z ð Þ satisfying z ¼ W z ð Þe W z ð Þ , and can be currently computed using mathematical software, for example the Mathematica Ò software package (Wolfram Research, Inc. 2014), or the R statistical computing environment (R Development Core Team 2012). The use of a ''trimmed'' version of the sample mean is a simple technique with which to make the sample mean more robust with respect to a single outlier-a single highly-cited paper that could substantially inflate the mean, as is well known.
f g Þis based on the assumption that the citation rate of papers (cited at least once) follows a shifted-geometric distribution (SGD) with parameter Q ðQ [ 1Þ with probability function p y ð Þ ¼ Q Ày Q À 1 ð Þ yÀ1 , y ¼ 1; 2; . . .; p y ð Þ represents the probability of observing the number of citations y of a paper (cited at least once), while Q represents the expectation of the SGD. Then,n y ð Þ ¼ Tp y ð Þ expresses the ''expected''/estimated number of articles with y citations.
-As an alternative approach, an important class of models is the one defined by the formulaĥ where c 0 is a fixed and known positive constant (Schubert and Glänzel 2007). From model (4), specific ready-to-use formulas are obtained by taking, in particular: (a) c 0 ¼ 4 À1=3 (Iglesias and Pecharroman 2007; see also Ionescu and Chopard 2013;Panaretos and Malesios 2009;Vinkler 2009Vinkler , 2013, (b) c 0 ¼ 0:75 (Schubert and Glänzel 2007), (c) c 0 ¼ 1 Prathap (2010a, b). Following the notation of Bertoli-Barsotti and Lando (2017), let h SG c 0 ð Þ ¼ c 0 C 2=3 T À1=3 . Note that these formulas are functions of the data only through two out of the four basic statistics (r ¼ 2, S ¼ C; T f g), and they are based on the assumption of a continuous-type distribution. The formula h SG 1 ð Þ is also known as the ''p-index'' (Prathap 2010a, b). -Another approach which deserves mention for completeness, even if it does not yield a ready-to-use formula, is that proposed by Iglesias and Pecharroman (2007). Adopting a different perspective, i.e. the rank-size formulation, and starting from the assumption that the number c k ð Þ of citations of the paper of rank k, is approximately distributed following a stretched exponential type PDF (not to be confused with a Weibull PDF, see below), Iglesias and Pecharroman suggest deriving a formula for the h-index as the solution of the equation Interestingly, the solution may be derived in closed form (even if authors did not realize this) by means of the Lambert-W function. Unfortunately, this solution still depends on the value of an unknown free parameter, specifically b [see their Eqs. (16) and (17)]. Hence, their formula could become a genuine estimator of the h-index-of the formĥ ¼ĥ C; T; T 1 ð Þ , r ¼ 3-only by constraining the unknown parameter b to assume a fixed (but arbitrary) value b 0 .

A new formula for the h-index under the Weibull assumption
Let N y ð Þ be the empirical citation distribution function, i.e. the function giving the number of papers which have been cited y times at most. Then, in particular, n y ð Þ ¼ N y ð Þ À N y À 1 ð Þ, for y ¼ 1; 2; . . ., n 0 ð Þ ¼ N 0 ð Þ, is the number of papers that have been cited exactly y times. We assume that the citation rate of a paper is a random variable X that is distributed as a two-parameter Weibull distribution, with CDF The probability density function is then for x [ 0, and 0 otherwise. The Weibull distribution is a rather flexible model: the PDF is reverse J-shaped for b 1 and bell-shaped otherwise. Since our assumption involves a continuous distribution, a suitable discretization rule is needed. In particular, for every y, y ¼ 0; 1; 2; . . ., let T exp Àay b È É express the ''expected'' number of articles with at least y citations. Hence,n y ð Þ ¼ Þ represents the expected number of articles with y citations exactly, andN y ð Þ ¼ TF y þ 1; a; b ð Þthe expected number of papers which have been cited y times at most. As a special case, can be interpreted as a model for the so-called uncitedness factor, TÀT1 T ¼ n 0 Replacing ax b with t in the equation, we have Thus, replacing bt with s, we obtain the equivalent equation Hence, by definition of the above mentioned Lambert-W function, we find the solution , we finally arrive at the formula An empirical counterpart of the above theoretical model for the h index may now be obtained by substituting the parameters a and b with estimates, a Ã and b Ã , based on suitable functions of the citation data only through the basic statistics C; C 1 ; T and T 1 . This can be done firstly by using the uncitedness factor to derive the equation 1 À e Àa ¼ TÀT 1 T , that can be solved (under the assumption 0\T 1 \T) for the variable a as as an estimate of parameter a, and secondly, by using the trimmed sample citation rate, as an estimate of the expectation of X, that is E Note that, by construction, our approximation slightly overestimates the true average number of citations, so that a correction for continuity by one-half is needed. We then find b Ã as the solution (method of moments) of the equation that can be solved numerically. It should be noted that the existence and uniqueness of the solution of Eq. (15) are not always warranted a priori. Indeed, it can be proved that the necessary and sufficient condition for existence and uniqueness of the solution is m Ã [ 1 (see ''Appendix''). We should then consider ''out of range'' the cases where m Ã 1, and exclude them from the analysis. With a and b replaced by where the suffix WW is motivated by the fact that the formula is based on a Weibull distribution and on the Lambert-W function.

Analysis Two datasets
This section empirically investigates the effectiveness of formula h WW as an estimate of the actual value of the h-index, h. We will compare estimates derived from h WW with the real values of the h-index. In order to facilitate possible comparisons with other formulas (see below), we choose to use the same two datasets as in Bertoli-Barsotti and Lando (2017), where the authors present an empirical study based on citation data obtained from two different sets of journals belonging to two different scientific fields: (1) the S&MM list and (2) the EE&F list.  Table 2.
Estimation of the h-index with the formula h WW Table 1 for the S&MM list and Table 2 for the EE&F list report, for each journal, identified by its ISSN code, the four basic statistics, C, C 1 , T and T 1 , the h-index, h, as computed using the above procedure, and the value provided by the formula h WW in its rounded-off version h WW h i, that is, in symbols, where Á b c is the floor function (recall that the floor function of x gives the greatest integer less than or equal to x). Note that, from an operational point of view, all estimating formulas (1) generate real numbers. However, for estimation purposes, these numbers should be rounded-off to the nearest integer, not only in order to produce numbers in the same range of values as the h-index but also to avoid ''false precision''. (Hicks et al. 2015).
To give an example illustrating the calculation of this estimate, let us consider the case of the Journal of the American Statistical Association (ISSN 0162-1459, from the S&MM list). We have C ¼ 5231; C 1 ¼ 156; T ¼ 663 and T 1 ¼ 519. Hence Then, substituting a Ã and m Ã into the Eq. (15) we find which yields the solution b Ã ¼ 0:7365. Thus, since W 0:2449 Á 0:7365 Á 663 0:7365 we finally conclude that so that the rounded-off version of h WW in this case exactly coincides with the actual hindex, h ¼ 31: In Figs. 1 and 2 we plot for each journal, respectively for the S&MM list and the EE&F list, the empirical value of the h-index h versus its predicted value by h WW .

A comparative analysis of the accuracy
To verify the accuracy of formula h WW , comparatively, we considered, among several possible ready-to-use formulas, the following ones among those defined above:h    Glänzel (2006), Malesios (2015), Schreiber et al. (2012) and Schubert and Glänzel (2007) for formulas h SG , whereĥ j i ð Þ ¼ bĥ j i ð Þ þ 0:5c is the rounded-off version of formula i, i ¼ 1; 2; . . .; 6, then, (b) as a criterion with which to assess the overall quality of the formula, we computed the mean absolute relative error (MARE), The results are summarized in Table 3.

Conclusion
This paper has addressed the need to gain better understanding of how simple citation metrics are related to the h-index, or rather, to a ''good'' proxy representation of the h index. This also responds to the more basic requirement of ''building bridges'' between different types of known and available measures of impact/impact indicators-under IIC.
Differently from other studies (that consider the problem of defining a ''model'' of the h-index), our concern has not been to estimate the parameters (sometimes even considered at the unit level, i.e. single journal, or single scientist; see e.g. Petersen et al. 2011) of a parametric model for the h-index under the assumption of knowing the entire citation pattern; rather, we addressed the quite different and more practical problem of finding a proxy representation of h through a universal formula that only depends on few summary statistics of the data. The formula h WW is ''universal'' in the sense that it gives a proxy representation of h that holds for any given journal and any dataset.
The issue of determining an indicator under IIC is closely related to the search for a solution of the problem of recovering and comparing impact indicators from different databases. As a simple but significant example of this issue, we may cite the specific problem of determining/estimating the IF for journals using the Google Scholar-based hindex as a predictor (Bertocchi et al. 2015).
As confirmed in our case study analysis, the h-index can be viewed as an almost-exact function of C; C 1 ; T and T 1 , through h WW , i.e. that the basic statistics C; C 1 ; T and T 1 provide salient information for the evaluation of the h-index with high precision. In practice, while computation of the h-index h requires knowledge of the entire citation profile (or at least large part of it, e.g. the so-called h-core), formula h WW requires knowledge of only a few elementary summary statistics, but reproduces the actual value of h quite well. In truth, in our computations we found that the estimates yielded by h WW were slightly biased downwards for quite high values of the h-index but, as can be seen from Table 3, overall the formula h WW yields very accurate approximations to the empirical value of the h-index, with values of the MARE ranging around 5-6%, not too dissimilar from those obtained by formulah and h WW exhibit comparable levels of accuracy (the advantages of the formulah 1 ð Þ W , as compared to formula h WW , may be that: (i) it yields an explicit expression of the basic indicators C; C 1 ; T and T 1 , while the latter not, and (ii) it is based on a simpler probabilistic model). Even though the Pearson correlation, q, is not an adequate measure of the accuracy of the estimation and should not be used to compare the effectiveness of the different estimators considered (and this is the reason why this concept has been banished from this study), for the sake of completeness we point out that: (1) Þ¼ 0:90. Ultimately, despite the differences between the datasets considered-in terms of scientific areas, time windows for publication and citation, types of ''citable'' documents considered, mean level of the basic indicators C; C 1 ; T and T 1 (with values of respectively 2111, 95, 432 and 312 for the S&MM dataset and 741, 33, 199 and 159 for the EE&F dataset)-we may conclude that, on the whole, h WW provides fairly accurate approximations to the real value of the h-index, at least for not too large values of T (e.g. T\2000), m (e.g. m\20) and h (e.g. h \ 40), such as those considered in this study.
(a) If 0\a Ã exp Àc ð Þ, the inequality (26) holds. In this case the function g a Ã ; b ð Þ is strictly decreasing from þ1 at 0 to 1 at þ1, with a limit approached from above. We conclude that, in this case, Eq. (15) has a unique solution if and only if m Ã [ 1; otherwise, if m Ã 1, Eq. (15) has no solution. (b) On the other hand, if a Ã [ exp Àc ð Þ, the derivative function o ob g a Ã ; b ð Þ changes its sign from negative to positive at b ¼ b 0 , for some b 0 [ 0; hence g a Ã ; b ð Þis strictly decreasing for every 0\b\b 0 , and strictly increasing for every b [ b 0 , and the point b 0 is a global minimum for g a Ã ; b ð Þ. Moreover since, as seen before, lim b!1 g a Ã ; b ð Þ¼1, then 0\g a Ã ; b 0 ð Þ\1, and the limit at infinity is approached from below. We conclude that, in this case too, Eq. (15)