Agent-based model for the h-index – Exact solution

. Hirsch’s h -index is perhaps the most popular citation-based measure of scientiﬁc excellence. In 2013 G. Ionescu and B. Chopard proposed an agent-based model describing a process for generating publications and citations in an abstract scientiﬁc community. Within such a framework, one may simulate a scientist’s activity, and – by extension – investigate the whole community of researchers. Even though the Ionescu and Chopard model predicts the h -index quite well, the authors provided a solution based solely on simulations. In this paper, we complete their results with exact, analytic formulas. What is more, by considering a simpliﬁed version of the Ionescu-Chopard model, we obtained a compact, easy to compute formula for the h -index. The derived approximate and exact solutions are investigated on a simulated and real-world data sets.


Introduction
Since the 1999 seminal paper by Barabási and Albert [1] many methods that originally were developed in statistical physics have been successfully applied in a wide range of problems coming from diverse domains. Scientometrics, an area in which one is concerned with the quantitative characteristics of science and scientific research, is one of such domains. Recently, different authors studied -among others -the long term prediction of scientific success [2], impact that an affiliation change has on a scientist's productivity [3], or production and consumption of the knowledge in physics [4,5]. However, historically main efforts were focused on the study of the structure of citation networks [6,7,8,9], and the reproduction of their degree distributions [9,10,11,12]. Starting from the de Solla Price seminal work [13] it is a known fact that citation networks arise due to the preferential attachment rule [1]. This process, well known in complex network analysis [8,10,11], was studied from the point of view of citation networks [7,9,12,15,16], where it is also known as the rich get richer rule or the Matthew effect [14]. Different variations of the classical, linear, preferential attachment (see [10] or Table 1 in [7]) were considered, but to the best of our knowledge there is a lack of models in the literature which concern the h-index (except [17], which is described in Sec. 2). a Corresponding author; e-mail: zogala@ibspan.waw.pl.
The h-index proposed in 2005 by J.E. Hirsch [18] is the most popular citation-based measure of scientific excellence. Even though this data fusion tool was already studied in the 1940s (compare the notion of the Ky Fan metric [19] and also the Sugeno integral, see, e.g., [20]), it may be conceived as a turning point in the history of scientometrics. The idea standing behind the Hirsch index is to measure not only the overall quality of a scientist's output (most often expressed by the number of citations that each individual paper received), but also its size. Thus, it may be understood as a measure of both productivity and impact of a researcher (or an institution). More formally, let us assume that we are given a list S = (S 1 , . . . , S n ) ∈ N n 0 , where S i denotes the number of citations to the i-th paper. If S (n) ≥ 1, the Hirsch index is given by the formula: h-index = max h = 1, . . . , n : S (n−h+1) ≥ h , where S (n−h+1) denotes the (n−h+1)-th order statistic of S. Moreover, if S (n) = 0, then h-index = 0. Intuitively, an author has his/her h-index equal to H, if H of her/his n papers have at least H citations each, and the other n− H papers have at most H citations each.
There were a few papers devoted to the stochastic properties of the h-index in some simple probabilistic models, see [21,22,23,24]). Recently, Ionescu and Chopard in [17] considered a publication-citation process in an abstract scientific community which was described by a multiagent model. Such a model consists of a scientist producing new papers, giving citations to the already published papers (including his/her own ones), and receiving citations from the community. This bottom-up approach allows to simulate a single scientist's activity as well as to investigate the whole community of researchers. What was very inspiring for us is the fact, that Fig. 3. in [13] is a perfect illustration of the mechanism of Ionescu-Chopard model, but this de Solla Price article was published almost 50 years before Ionescu and Chopard paper. Nevertheless, it turns out that their approach predicts quite well the h-index from bibliometric data. However, its authors did not provide an analytic form of a solution to their model, relying only on Monte Carlo simulations instead. In the current work we present an exact solution to that model as well as its simplification and an application on realworld data.
The paper is organized as follows. In Sec. 2 the agentbased model proposed by Ionescu and Chopard (referred to as the IC model) is described in very detail. Sec. 3 presents theoretical results concerning exact formulas for vectors of citations and the results of comparative simulation studies. In Sec. 4 a simplified model is proposed. Next, in Sec. 5 the results of an empirical analysis concerning all investigated approximations of the h-index are presented. Finally, Sec. 6 concludes the paper.

The IC single-scientist model
In 2013 Ionescu and Chopard [17] introduced a multiagent model to describe a publication-citation generation process in an abstract scientific community. Their approach consists of a scientist producing new papers, giving citations to his own and other already published papers, and receiving citations from the community. The model is based on a preferential attachment rule [1], which was observed in many real-world systems [1,8]. As we mentioned before, preferential attachment rule is strongly connected with the so-called Matthew effect [14]: highly cited articles are more eagerly cited by other authors than lowly cited ones. More precisely, the probability of adding new citations to a paper is proportional to the number of citations it has already obtained.

Simulation description
Unlike in the case of various well-known models for constructing citation networks [9,12,15,16], the IC model focuses not on the overall structure of a citation network but only on the node degree distribution, i.e., on the number of citations of papers written by one author. Its aim is to approximate citation scores for each published paper of a given author, i.e., an N dimensional vector S = (S 1 , . . . , S N ), where S k denotes the number of citations of the k-th paper. By definition, this shall be based solely on the number N of papers he/she published as well as the total number M of citations that his/her papers obtained. Moreover, we assume that citations to each paper S k are of two kinds: external X k and internal (self) ones Y k , thus The simulation of interest is an iterative process. We start with an initial number of papers N 0 , none of which is cited. During each iteration we add a new paper to the collection and distribute both self and external citations to the existing papers according to the preferential attachment rule. We give a fixed number of p internal and q external citations to the k-th paper with probability of: Due to the form of the given probability distribution, in [17] it is assumed that only external citations are taken into account when assigning the new ones. Self citations do not influence a paper's importance. Once the fixed number N of published papers is reached, the process goes on, but only q external citations are being granted during each step. The simulation ends as soon as the total number of citations M has been distributed. The initial values for sequences X and Y are given by X denote the number of external and self citations, respectively, of the k-th paper in the t-th iteration. Before the k-th paper is published, its citation counts are set to 0. Thus, X = 0 for k > t. Nevertheless, please note that this assumption has no impact on further derivations, as it is well-known that papers with no citations do not influence the h-index value.
The simulation consists of the three following phases.
Phase 0. Firstly, we initialize the variables X 1 , . . . , X N0 and Y 1 , . . . , Y N0 , and set t = N 0 . In the first step of the next phase we are going to distribute citations across the first N 0 articles. In other words, the considered author has already published her/his first N 0 articles and is waiting for citations. Two cases are possible: N → the author published less than N 0 papers. In such a case, the simulation ends before going to phase (I), even though it is possible that there are still citations left to be distributed. We could try going straight to phase (II) and distribute these citations, yet it would not increase the precision of the h-index estimation significantly (due to the fact that N 0 is small). On the other hand, this would unnecessarily complicate the formulas for X k and Y k . -N 0 < N → there are enough papers and citations to go to phase (I).
Phase (I) For each t = N 0 + 1, . . . , min{N, ⌊ M p+q ⌋ + N 0 }, we distribute q external and p self citations according to the preferential attachment rule given by Eq. (1): When phase (I) comes to an end (which means that the author has already published all her/his works and obtained all self citations), the three following cases are possible: In this case we could distribute such leftover citations, yet it would not increase the precision of the h-index estimation significantly and would unnecessarily complicate the formulas for X k and Y k , -M p+q + N 0 > N → simulation does not end, there are still citations to be distributed. We go to phase (II).
⌋+N , we shall distribute only the external citations among the already published N papers: When phase (II) comes to an end, two situations are possible: even though there are possibly up to q−1 undistributed citations left. The reason to abandon the leftover citations distribution is the same as in phase (I).

Exact formulas for citation vectors
Let us now present the exact formulas for X derived for the IC model.

External citations
Please notice that the sums in Eqs. (2) and (4) are in fact the expected values of random variables from binomial distributions Bin(q, p k,t ) and Bin(q,p k,t ), respectively, where probabilities p k,t ,p k,t are given by: andp The value of X k in the t-th step can be written as: The sums in the denominators are equal to Therefore, and now this recurrence relation can be solved easily. We wish to find X k = X (tmax) k , where the value of t max depends on whether the simulation stops in phase (I) or (II). When solving the recurrence equations we continue until reaching X As a consequence, if t max N , then we obtain: and if t max > N , then it holds: Please note that we can simplify the above formula by relying on the notion of the Γ function. Firstly, observe that: where: Moreover, Hence, the formula for X k can be written as: with: The above simplification gives a more elegant representation of X. However, it is worth noting that the product form is more computationally stable than calculating gamma functions for large arguments. Due to this fact in our simulations we use the product form. Nevertheless, both representations enable us to compute the elements of X significantly faster than in the case of the simulation procedure presented in [17].

Self citations
Similarly as in the previous subsection, we can solve the equation for self citations distribution. Basing on Eqs. (3) and (5), we have: We would like to find Y k = Y (smax) k , and due to the fact that here we always end up in phase (I), s max is equal to: Hence, Also, the formula for Y k may be simplified as follows: where α = qN 0 q + 1 , β = q(N 0 + 1) q + 1 .

Non-integer values of p and q
The authors of the IC model mention in [17] that, given non-integer values of p and q, one distributes: self-citations as well as: citations given by the scientific community. Therefore, as with probabilities of 1 − (⌈p⌉ − p) and 1 − (⌈q⌉ − q) the average number of self-and external citations is equal to p and q, respectively.
Let us consider X (t) k and let the second summand in Eq. (6) be denoted as: where the number of external citations to the k-th paper obtained at time t, Q t k , follows the Bin(q, p k,t ) distribution for t N and the Bin(q,p k,t ) distribution for t > N . Moreover, Taking into account the distribution of q ′ , we have that: Therefore, Eq. (6) is now of the form: By relying on a similar reasoning we obtain: It is easily seen that final form of the result is the same as for the formulas for integer p and q.

Overall number of citations and the h-index
Once we have determined the formulas for external X k and self Y k citations corresponding to the Ionescu and Chopard model, the only action left to estimate h-index is just to sum them up. One sees that both X k and Y k are nondecreasing, so S k = X k + Y k is also nondecreasing and thus: h exact = max{k : S k k}.

Comparative simulation study
Let us now briefly compare the estimates of the h-index obtained with the IC model (denoted as h IC ) and h exact , i.e., the ones that are based on Eq. (7)  According to [17], parameters p and q giving the best global agreement between the IC model and the original h-index are equal to 1 and 2, respectively. However, the authors also stated that p and q can be tuned up in such a way that almost any scientific profile can be fit well. In the case of the h-index of Hirsch, we found out that p = 1 and q = 3 gives a reasonable agreement, while for q = 2 the final h-index is overestimated. Please note that the model is stochastic in its nature and its results vary across different simulation runs, even for the same values of p, q, M and N . Therefore, for the purpose of a sensible comparison, we analyzed 1000 samples for p = 1, q = 1, 2, . . . , 10 and N 0 = p + q. The h IC distribution estimates are presented in Fig. 1 in a form of box-and-wiskers plots 1 Additionally, the h exact and the h-index obtained by averaging the citation vectors as generated by the IC model are indicated. We may observe a high agreement between h IC computed on an averaged citation vector and h exact (the largest difference between these two estimates, i.e., |h IC − h exact | is equal to 1).     Step plots of vectors of external X and self citations Y obtained from the IC model, depicted by ×, and Eqs. (7) and (8), depicted by •.
As it was stated in [17], the initial number of publications N 0 should be small enough so as to not influence the rest of the process, but large enough in order to provide enough papers to cite in the first iteration. The authors suggest to choose N 0 = p + q. In order to assess the influence of this parameter on h IC and h excat , we analyze N 0 varying from 1 to 50, for p = 1 and q = 2. Please note that N 0 = 50 is nearly 25% of all the Hirsch's publication count. For the IC model, we perform 1000 runs and average the obtained values: AVR(h IC ) denotes the mean of h IC obtained in each run and sd its standard deviation, while h IC (AVR) denotes the h-index computed on an averaged citation vector from the IC model. The results presented in Table 1 suggest that there is no significant difference between N 0 = 1, 2 and N 0 = p + q = 3 in this case. Therefore, one may choose N 0 = 1 and if N = 1 and M > 0, simply assign the h-index equal to 1.

A simplification of the IC model
Please note that the exact solution to the IC model, i.e., Eqs. (7) and (8), gives an analytical expression of the very intuitive and reasonable simulation setup as proposed by Ionescu and Chopard. Nevertheless, as the form of the derived formulas is quite complicated, their intuitive interpretation is difficult. Let us recall that the IC model is based on an assumption that only external citations are taken into account when assigning new ones (due to the form of the probability distribution given by Eq. (1)). Moreover, during the simulation study, as it was also stated in [17], we observed that the parameter p has no significant influence on the outcoming h-index. Hence, in this section we reduce the number of parameters, which leads to a significant simplification of the model.
Let us employ the following assumptions: (i) We assume N 0 = 0, so the first paper starts to gain citations just after its publication. (ii) We consider only one vector X, which means that we take into account all the citations together without distinguishing between external and self citations.
The number of simulation parameters is decreased to only two: q, which is the number of citations given in each iteration and T , which is number of simulation steps. Similarly as in Sec. 3 let us write the recurrence relation for X (t) k : which may be expressed as: which may be further simplified as: Eq. (10) is the exact solution of our simplified version of the model, but the following asymptotic relation: allows us to obtain the approximation of X k as: where α = q/(q + 1). Even with our exact solution finding the compact formula for the h-index seems untraceable. Fortunately, for the simplification of the IC model, an observation that X k is an increasing function of k leads to: which is equivalent to: One can show that for every T > 0 and α ∈ (0, 1) Eq. (11) has always exactly one solution, which is the h-index.

Real data evaluation
In this section we perform an empirical analysis of exemplary citation vectors gathered from Elsevier's Scopus (see [25] for the detailed description of the data set). Please note that the whole data set includes citation vectors corresponding to 16282 authors. Nevertheless, about 78% of all the vectors are of length one (among them ca. 32% represent a single uncited paper). This is typical to bibliometric data sets, which consist of a high number of short vectors. Moreover, since it is observed that all the vectors are skewed, usually to model them distributions like exponential or Pareto type II (Lomax) are used, (e.g., compare [27,28,29]). Table 2 presents basic sample statistics for the Scopus data set. For the sake of clarity of the results presented in this paper, a subset of 100 randomly chosen authors has been selected. In order to assess the quality of the proposed approximation we choose vectors of length greater than or equal to 20 (in total number of 69) and from the vectors of length smaller than 20 we randomly choose 31 with uniform distribution. Basic sample statistics of the selected sample are presented in Table 3.  In Fig. 3 there are presented the approximated values of h-index as a function of real values from considered data. Please note that the mean squared difference between the estimated values and the h-index equals to 6.15, 6.45 and 4.14, respectively for the estimates based on Eq. (7), Eq. (11) and Eq. (9).
Please note that in the case of Hirsch himself, considered in Sec. 3.5, the approximations of the h-index are equal to 51.99 ≈ 52 for approximation given by Eq. (11) and 52 for approximation based on Eq. (9). The obtained estimates of the Hirsch h-index for various values of parameter q are presented in Table 4. Please note that by an appropriate selection of the parameter we were able to recreate the value of his h-index. Moreover, Fig. 4 depicts its predicted growth dynamic over each iteration. We see that our simplification does not predict the h-index worse than the original simulation. However, one should be aware that the approximate h-index given by Eq. (11) is not necessarily an integer value (compare Fig. 3 and Table 4). This should be taken into account in analysis of real data sets: if needed, e.g., proper rounding can be applied.

Conclusions
In this paper we investigated an agent-based model for the bibliometric h-index introduced in [17]. The main contribution included is an exact formula for the number of external citations and self citations for each paper produced by a given author. This result not only completes the work conducted by Ionescu and Chopard, but also gives a perspective for a better insight into the citation process. What is more, we proposed a simplification of the i.e., one that is based on Eq. (7), the points marked with + correspond to the estimate of the h-index that is based only on a vector of external citations (Eq. (9)) with parameter q ′′ = 3, while the points marked with △ correspond to the approximation given by Eq. (11) with also q ′′ = 3. The dotted lines of corresponding color, depicted as − − −, − · − · − and · · · · ·, are the least squares fit of the h-index values and considered approximations, respectively.
IC model and presented the approximation of the h-index based on such an approach. The obtained exact formulas were compared with the results of simulations proposed by Ionescu and Chopard. Interestingly, we may observe a good level of compatibility between them, but mostly for a large number of papers and citations. In this case, however, simulations are more computationally demanding, which makes the usage of the exact formulas more preferable. Also a real data evaluation on an informetric data set was presented.
There are still many issues worth deeper investigation. First of all, due to the analytical formulas one may analyze the theoretical properties of the h-index estimate. Since it has been shown that the h-index is an example of an aggregation operator and its properties can be studied by the means of aggregation theory, it is worth to investigate if such properties are still valid when it comes to the IC model estimate.
Moreover, also the theoretical evaluation of the influence of the considered parameters on the results, which  has been done by Ionescu and Chopard only by an empirical study, should be performed. Note that the exact formula for the approximation given by Eq. (17) as well as the comparative study of the proposed approximations of the Hirsch index and the ones already available in the literature opens an interesting future research direction. Also, it is reasonable to perform similar analysis on different data sets, for example representing the data concerning the social network (Facebook, Twitter) users or citation information gathered from different fields of science. Also, there are a lot of variations of the classical preferential attachment rule, proposed by Barabási and Albert. There is also a rich discussion in the literature on the proper version of those mechanisms for considered problem [7,9,10]. Mostly due to the simplicity (and for agree-ment with original work [17]) we chose a classical linear version [8,11]. The analysis of different forms of preferential attachment rule (as those presented in [12] or [9,10]) is also left for future studies.