Flexible models for overdispersed and underdispersed count data

Within the framework of probability models for overdispersed count data, we propose the generalized fractional Poisson distribution (gfPd), which is a natural generalization of the fractional Poisson distribution (fPd), and the standard Poisson distribution. We derive some properties of gfPd and more specifically we study moments, limiting behavior and other features of fPd. The skewness suggests that fPd can be left-skewed, right-skewed or symmetric; this makes the model flexible and appealing in practice. We apply the model to a real big count data and estimate the model parameters using maximum likelihood. Then, we turn to the very general class of weighted Poisson distributions (WPD's) to allow both overdispersion and underdispersion. Similar to Kemp's generalized hypergeometric probability distribution, based on hypergeometric functions, we introduce a novel WPD case where the weight function is chosen as a suitable ratio of three-parameter Mittag--Leffler functions. The proposed family includes the well-known COM-Poisson and the hyper-Poisson models. We characterize conditions on the parameters allowing for overdispersion and underdispersion, and analyze two special cases of interest which have not yet appeared in the literature.


Introduction and mathematical background
The negative binomial distribution is one of the most widely used discrete probability models that allow departure from the mean-equal-variance Poisson model. More specifically, the negative binomial distribution models overdispersion of data relative to the Poisson distribution. For clarity, we refer to the extended negative binomial distribution with probability mass function where r > 0. If r ∈ {1, 2, . . . }, x is the number of failures which occur in a sequence of independent Bernoulli trials to obtain r successes, and p is the success probability of each trial. One limitation of the negative binomial distribution in fitting overdispersed count data is that the skewness and kurtosis are always positive. An example is given in Section 2.1.1, in which we introduce two real world data sets that do not fit a negative binomial model. The data sets reflect reported incidents of crime that occurred in the city of Chicago from January 1,2001 to May 21, 2018. These data sets are overdispersed but the skewness coefficients are estimated to be respectively -0.758 and -0.996. Undoubtedly, the negative binomial model is expected to underperform in these types of count populations. These data sets are just two examples in a yet to be discovered non-negative binomial world, thus demonstrating the real need for a more flexible alternative for overdispersed count data. The literature on alternative probabilistic models for overdispersed count data is vast. A history of the overdispersed data problem and related literature can be found in [32]. In this paper we consider the fractional Poisson distribution (fPd) as an alternative. The fPd arises naturally from the widely studied fractional Poisson process [31,33,16,22,3,5,25]. It has not yet been studied in depth and has not been applied to model real count data. We show that the fPd allows big (large mean), both left-and right-skewed overdispersed count data making it attractive for practical settings, especially now that data are becoming more available and bigger than before. fPd's usually involve one parameter; generalizations to two parameters are proposed in [3,14]. Here, we take a step forward and further generalize the fPd to a three parameter model, proving the resulting distribution is still overdispersed.
One of the most popular measures to detect the departures from the Poisson distribution is the so-called Fisher index which is the ratio of the variance to the mean (≶ 1) of the count distribution. As shown in the crime example of Section 2.1.1, the computation of the Fisher index is not sufficient to determine a first fitting assessment of the model, which indeed should take into account at least the presence of negative/positive skewness. To compute all these measures, the first three factorial moments should be considered. Consider a discrete random variable X with probability generating function (pgf) where {a k } is a sequence of real numbers such that a 0 = 1. Observe that Q(t) = G X (1 + t) is the factorial moment generating function of X. The k-th moment is where S(k, r ) are the Stirling numbers of the second kind [12]. By means of the factorial moments it is straightforward to characterize overdispersion or underdispersion as follows: letting a 2 > a 2 1 yields overdispersion whereas a 2 < a 2 1 gives underdispersion. Let c 2 and c 3 be the second and third cumulant of X, respectively. Then, the skewness can be expressed as If the condition lim n→∞ a n (n − k) is fulfilled, the probability mass function of X can be written in terms of its factorial moments [10]: As an example, the very well-known generalized Poisson distribution which accounts for both under and overdispersion [6,8], put in the above form has factorial moments given by a 0 = 1 and where h(λ 2 ) = ∞ and k = 1, 2, . . ., if λ 2 > 0. While h(λ 2 ) = M − k and k = 1, . . . , M, if max(−1, −λ 1 /M) ≤ λ 2 < 0 and M is the largest positive integer for which λ 1 + Mλ 2 > 0. Another example is given by the Kemp family of generalized hypergeometric factorial moments distributions (GHFD) [19] for which the factorial moments are given by where Γ [(a); , with a 1 , . . . , a p , b 1 , . . . , b q ∈ R and p, q non negative integers. The factorial moment generating function is and (a) m = a(a + 1) · · · (a + n − 1), m ≥ 1. Both overdispersion and underdispersion are possible, depending on the values of the parameters [37]. The generalized fractional Poisson distribution (gfPd), which we introduce in the next section, lies in the same class of the Kemp's GHFD but with the hypergeometric function in (9) substituted by a generalized Mittag-Leffler function (also known as three-parameter Mittag-Leffler function or Prabhakar function). In this case, as we have anticipated above, the model is capable of not only describing overdispersion but also having a degree of flexibility in dealing with skewness.
To take into account underdispersion we then turn our attention to the very general class of weighted Poisson distributions (WPD). It is worthy to note that a second family of distributions based on hypergeometric functions allowing both underdispersion and overdispersion, namely the Kemp's generalized hypergeometric probability distribution (GHPD) [18], lies in the class of WPD's. In Theorem 3.2 we first give a general necessary and sufficient condition to have an underdispersed or an overdispersed WPD random variable in the case in which the weight function may depend on the underlying Poisson parameter λ. Special cases of WPD's admitting a small number of parameters have already proven to be of practical interest, such as for instance the well-known COM-Poisson [9] or the hyper-Poisson [2] models. We present a novel WPD special case in which the weight function is chosen as a suitable ratio of generalized Mittag-Leffler functions. The proposed distribution family includes the above-mentioned well-known classical cases. We characterize conditions on the parameters allowing overdispersion and underdispersion, and analyze two further special cases of interest which have not yet appeared in the literature. We derive recursions to generate probability mass functions (and thus random numbers) and show how to approximate the mean and the variance.
The paper is organized as follows: in Section 2, we introduce the generalized fractional Poisson distribution, discuss some properties and recover the classical fPd as a special case. These models are fitted to the two real-world data sets mentioned above. Section 3 is devoted to weighted Poisson distributions, their characteristic factorial moments and related conditions to obtain overdispersion and underdispersion. Furthermore, the novel WPD based on generalized Mittag-Leffler functions is introduced and described in Section 3.1: we discuss some properties and show how to get exact formulae for factorial moments by using Faà di Bruno's formula [34]. The two special models are thus characterized depending on the values of the parameters and finally some illustrative plots end the paper.
Note that the probability mass function can be determined using the following integral representation [27]: where the Wright function φ is defined as the convergent sum [20] φ Remark 2.1. The random variable X δ α,β has factorial moments By expressing the moments in terms of factorial moments and after some algebra we obtain and Thus the distribution is overdispersed, for Observe that the function βBeta(β, α) is increasing in β for α, β ∈ (0, 1) as where ψ is the digamma function. Note that (21) is positive by formula (1.3.3) of [23] as ψ is increasing on (0, ∞). Thus and for δ ∈ (0, β/α) the bound (20) is always verified.

Fractional Poisson distribution
This section analyzes the classical fPd, which is a special case of gfPd, and is obtained when β = δ = 1. The fPd can model asymmetric (both left-skewed and right-skewed) overdispersed count data for all mean count values (small and large). The fPd has probability mass function (pmf) where µ > 0, α ∈ [0, 1]. Notice that if α = 1, the standard Poisson distribution is retrieved, while for α = 0 we have X 0 Furthermore, the probability mass function can be determined using the following integral representation [4]: where the M-Wright function [24] is the probability density function of the random variable S −α with S d = α + -stable supported in R + . By using (25), the cumulative distribution function turns out to be Remark 2.2. From (2.1), the random variable X α has factorial moments Hence the probability generating function is G Xα (u) = E 1 α,1 (µ (u − 1)), |u| ≤ 1.
With respect to the symmetry structure of X α , from (4) and (28), the skewness of X α reads Moreover, which correctly vanishes if α = 1, like the ordinary Poisson distribution. q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q α = 0.9 α = 0.75 α = 0.5 α = 0.1 Figure 1: Probability mass functions of X α for α = 0.1, 0.5, 0.75, 0.9, and µ = 20.

Simulation and parameter estimation
The integral representation (25) allows visualization of the probability mass function of X α (see Figure 1). Figure 1 shows the flexibility of the fPd. The probability distribution ranges from zero-inflated right-skewed (α → 0) to left-skewed (α → 1) and symmetric (α = 1) overdispersed count data. To compute the integral in (25) by means of Monte Carlo techniques, we use the approximation, where Y j s iid = S −α . Note that the random variable S can be generated using the following formula [17,7]: where U 1 and U 2 are independently and uniformly distributed in [0, 1]. Thus, fractional Poisson random numbers can be generated using the algorithm below. Algorithm: Step 1. Set X = 0, and T = 0.
Note that the random variable V follows the exponential distribution with density function µ exp(−µv ), v ≥ 0. Algorithms for generating random variables from the exponential density function are well-known. Hence, the algorithm allows estimation of the kth moment, i.e., EX k α . Figure 2 shows the plot of the skewness coefficient (30) as a function of µ and α. Unlike the negative binomial, the fPd can accommodate both left-skewed and right-skewed count data making it more flexible. Thus, the fPd is more flexible than the negative binomial, especially if  Figure  3.
Furthermore, we compared fPd(α, µ) with the negative binomial NegBinom(size, mean) using the usual chi-square goodness-of-fit test statistic and the maximum likelihood estimates for both models. Note that the chi-square test statistic follows, approximately, a chi-square distribution with (k − 1 − p) degrees of freedom where k is the number of cells and p is the number of parameters to be estimated plus one.
For illustration purposes, we used grid search for the fPd(α, µ) as it is relatively fast due to α being bounded in (0, 1) and to µ, which is just in the neighborhood of the true data mean scaled by Γ(1 + α). Observe that 5 × 10 5 random numbers are used in all the calculations. From the results below, the fractional Poisson distribution fPd(α, µ) provides better fits than the negative binomial NegBinom(si z e, mean) model for both data sets at 5% level of significance. This exercise clearly demonstrates the limitation of the negative binomial in dealing with left-skewed count data.  When β = α and δ = 1, we have X α,α d = gfPd(α, α, 1, µ) with Proposition 2.1. The probability mass function can be written as where ν S is the distribution of a random variable S whose density has Laplace transform exp(−t α ).
Proof. Note that The above result provides an algorithm to evaluate the probability mass function as Thus, we can now estimate α and µ using maximum likelihood just like in the fPd case. The maximum likelihood estimates for the two crime datasets above are given in Table 2 below. The chi-square goodness-of-fit test statistics are large, indicating bad fits.
From (4) and (37), the symmetry structure of X α,α can be determined as follows: Moreover, which vanishes if α = 1 (Poisson distribution). Moreover, (39) is non-negative and decreasing: this explains the bad fits indicated by the large chi-square values above.

Underdispersion and overdispersion for weighted Poisson distributions
Weighted Poisson distributions [30] provide a unifying approach for both overdispersion and underdispersion [21]. Let Y be a Poisson random variable of parameter λ > 0 and let Y w be the corresponding WPD with weight function w . Proof. It is enough to observe that the pgf G Y w (u) can be written in form (2) as follows: Let T be the linear left-shift operator acting on number sequences. Let us still denote with T its coefficientwise extension to the ring of formal power series in R + [[λ]] [34]. Next proposition links overdispersion and underdispersion of Y w respectively to a Turán-type and a reverse Turán-type inequality involving T .
Proof. The random variable Y w is overdispersed if and only if a 2 > a 2 1 , that is Ew and the result follows observing that T j f (λ) = k≥0 Remark 3.1. Observe that when w does not depend on λ, then T j f (λ) = D j λ f (λ) for j = 1, 2.
i.e. log-convexity (log-convexity) of f . This is already known in the literature (see Theorem 3 of [21]).
and some algebra leads us to the following sufficient condition for overdispersion or underdisper- Notice that Ew (Y ) is a function of the Poisson parameter λ. For the sake of clarity, from now on, let us denote it by η(λ). Weighted Poisson distributions with a weight function w not depending on the Poisson parameter λ are also known as power series distributions (PSD) [15]. It is easy to see that the factorial generating function is in this case whose factorial moments are A special well-known family of PSD is the generalized hypergeometric probability distribution (GHPD) [18], where with p F q given in (9). Depending on the values of the parameters of GHPD both overdispersion and underdispersion are possible [37]. For p = q = 1, a special case of GHPD is the hyper-Poisson distribution [2]. In the next section we will analyze an alternative WPD in which the hyper-Poisson distribution remains a special case and that exhibits both underdispersion and overdispersion.

A novel flexible WPD allowing overdispersion or underdispersion
Let Y w be a WP random variable with weight function where γ > 0, min(α, β, ν) ≥ 0, α + β > 0. Moreover, if γ = β and ν ≥ 1 then β is allowed to be zero. Since it is a PSD, the random variable Y w is characterized by the normalizing function The convergence of the above series can be ascertained as follows. Let γ ≤ 1; by Gautschi's inequality (see [36], formula (2.23)) we have the upper bound which converges by ratio test and taking into account the well-known asymptotics for the ratio of gamma functions (see [38]). Now, let γ > 1. In this case an upper bound can be derived by formula (3.72) of [36]: Remark 3.3. Since η γ+r,ν α,αr +β (λ) = j≥0 λ j j! A j,r with by using Faà di Bruno's formula [34] one has where {A j,0 } and and {B i,k } are the coefficients of η γ,ν α,β (λ) and the partial Bell exponential polynomials [34], respectively.
Furthermore, the probability mass function reads Concerning the variability of Y w , by using Theorem 3 of [21] and the succeeding Corollary, that is by imposing log-convexity (log-concavity) of the weight function, we write for y ∈ R + , where ψ(z) is the Psi function (see [23], Section 1.3). In addition, by considering formula (6.4.10) of [1], Therefore log-convexity (log-concavity) of w (y ) is equivalent to the condition This yields that if (58) holds, then Y w is overdispersed (underdispersed).
In the two next sections we analyze two special cases of interest which, to the best of our knowledge, are still not considered in the literature.

Model I
We first introduce the special case in which α = 1, γ = β, β > 0, and β is allowed to be zero only if ν ≥ 1. This is a three-parameter model (λ, ν, β) which retains the same simple conditions for underdispersion and overdispersion as for the COM-Poisson model. Indeed, formula (58) reduces to ν > 1 and ν ∈ [0, 1), respectively. However, this model is more flexible than the COM-Poisson model for the presence of the parameter β. Notice that the pmf can be written as which suggests that Model I belongs to the exponential family of distributions with parameters log λ and 1 − ν, where β is a nuisance parameter or is known. Figures 4 and 5 show sample shapes of this family of distributions. Note that distributions in Figure 4 ( Figure 5) are overdispersed (underdispersed). Also, This gives a procedure to calculate iteratively the probability mass function and generate random numbers. The only thing to figure out is to compute η β,ν 1,β (λ) in order to obtain P (Y w = 0) = 1/[Γ(β) ν−1 η β,ν 1,β (λ)]. An upper bound for the normalizing function η β,ν 1,β (λ) can be determined similarly to [26], Section 3.2, taking into consideration that the multiplier is ultimately monotonically decreasing. Hence, we can approximate the normalizing constant η β,ν 1,β (λ) by truncating the series and bound the truncation error R k , where k is such that for j > k the multiplier (61) is already monotonically decreasing and bounded above by ε k ∈ (0, 1). Correspondingly, denoting with η β,ν 1,β (λ) = k j=0 λ j j! Γ(j + β) 1−ν , the relative truncation error R k / η β,ν 1,β (λ) is bounded by

Model II
If we set α = ν = 1 we get another three-parameter model (λ, γ, β). The reparametrization β = ξγ together with condition (58) shows that both overdispersion (ξ > 1) and underdispersion (ξ ∈ (0, 1)) are possible. This comes from the fact that ω → ∞ r =0 (y + ω + r ) −2 is decreasing for all fixed y ∈ R + . As for Model I, the probability distribution belongs to the exponential family with parameter log λ, with γ and β as nuisance parameters. Explicitly, the pmf reads and, as in the previous Section 3.1.1, the iterative representation allows an approximated evaluation of the pmf with error control, and consequently random number generation. Also in this case this holds as the involved multiplier is ultimately monotonically decreasing. Figures 6 and 7 show some forms of this class of distributions. Observe that distributions in Figure 6 (Figure 7) are underdispersed (overdispersed). If we further let λ = 1, we obtain a two-parameter model, still allowing for underdispersion if β ∈ (0, γ) (or equivalently ξ ∈ (0, 1)) and overdispersion if β > γ (or ξ > 1), which is also directly comparable with the two-parameter Model I above, the COM-Poisson model, and the hyper-Poisson model.