A new model for over-dispersed count data: Poisson quasi-Lindley regression model

In this paper, a new regression model for count response variable is proposed via re-parametrization of Poisson quasi-Lindley distribution. The maximum likelihood and method of moment estimations are considered to estimate the unknown parameters of re-parametrized Poisson quasi-Lindley distribution. The simulation study is conducted to evaluate the efficiency of estimation methods. The real data set is analyzed to demonstrate the usefulness of proposed model against the well-known regression models for count data modeling such as Poisson and negative-binomial regression models. Empirical results show that when the response variable is over-dispersed, the proposed model provides better results than other competitive models.


Introduction
The interest on count data modeling has been greatly increased in the last decade. The widely used distribution for modeling the count data sets is Poisson distribution. The well-known property of Poisson distribution is that its mean and variance are equal. Therefore, Poisson distribution does not work in the case of over-dispersion or under-dispersion. Poisson distribution is widely used in many research fields such as actuarial, environmental, actuarial and economics sciences in spite of its weakness. The reason for that comes from its simple form and easy implementation and software support. To remove the drawback of Poisson distribution, researchers have shown great interest to introduce mixed-Poisson distributions for modeling the over-dispersed or under-dispersed count data sets such as Bhati et al. [1], Imoto et al. [7], Mahmoudi and Zakerzadeh [9], Gencturk and Yigiter [5], Wongrin and Bodhisuwan [15], Déniz [3], Cheng et al. [2], Lord and Geedipally [8], Zamani et al. [16], Sáez-Castillo and Conde-Sánchez [12], Rodríguez-Avi et al. [10], Shmueli et al. [11], Shoukri et al. [13].
As mentioned above, Poisson distribution is insufficient to model the over-dispersed count data sets. The main motivation of this study is to introduce an alternative regression model for modeling the over-dispersed count data sets. Therefore, a re-parametrization of Poisson quasi-Lindley distribution, proposed by Grine and Zeghdoudi [6], is introduced and its statistical properties are studied comprehensively such as mean, variance and estimation problem of the model parameters. The maximum likelihood (ML) and method of moments (MM) estimation methods are considered to estimate the unknown parameters of the re-parametrized PQL distribution. The efficiencies of the estimation methods are compared with extensive simulation study. Using the re-parametrized Poisson quasi-Lindley distribution, a new regression model for over-dispersed count data sets is introduced. To demonstrate the effectiveness of proposed regression model, a real data set on days of absence of the high school students are analyzed with Poisson, negative-binomial and PQL regression models.
The rest of the paper is organized as follows: In "Re-parametrization of Poisson quasi-Lindley distribution" section, the statistical properties of the re-parametrized Poisson quasi-Lindley distribution are obtained. In "Estimation" section, ML 1 3 and (MM) estimation methods are considered to estimate the unknown model parameters. In "Simulation" section, finite sample performance of estimation methods is compared via a Monte Carlo simulation study. In "Poisson quasi-Lindley regression model" section, a new regression model is introduced. In "Empirical study" section, a real data set is analyzed to demonstrate the usefulness of proposed model against the Poisson and negative-binomial regression models. "Conclusion" section contains the concluding remarks.

Re-parametrization of Poisson quasi-Lindley distribution
Let the random variable X follows a Poisson distribution. The probability mass function (pmf) is where > 0 . The mean and variance of Poisson distribution are E(X) = and Var(X) = , respectively. So, the dispersion index, shortly DI, for Poisson distribution is DI = Var(X)∕E(X) = ∕ = 1 . As seen from the dispersion index of Poisson distribution, the over-dispersed or underdispersed data sets cannot be modeled by Poisson distribution. Note that when the variance is greater than mean, the over-dispersion occurs; otherwise, it is called as under-dispersion. Grine and Zeghdoudi [6] introduced a new mixed-Poisson distribution, called Poisson quasi-Lindley (PQL), by compounding Poisson distribution with quasi-Lindley distribution, introduced by Shanker and Mishra [14]. The pmf of PQL distribution is given by where > 0 and > −1 . Hereafter, the random variable Y will be denoted as PQL( , ) . The corresponding cumulative distribution function (cdf) to 1 is The mean and variance of PQL distribution are given by, respectively, Here, the re-parametrization of PQL distribution is considered. The motivation of re-parametrization for PQL P(Y = y) = + 1 ( + 1) + (y + 1) distribution comes from the generalized linear model approach.

Proposition 1 Let = (2 + )∕[(1 + ) ] , then the pdf of PQL distribution is
where > 0 and > 0 . The mean and variance of 5 are given by, respectively, Note that the parameter should be greater than zero to ensure the positive variance. The other statistical properties of PQL distribution, such as probability and moment generating functions, mode and its cdf, under the above re-parametrization can be obtained following the results in Grine and Zeghdoudi [6]. As seen from 6, since the second part of variance equation for PQL distribution is greater than zero for all values of the parameters and , the variance of PQL distribution is always greater than its mean. Therefore, PQL distribution can be a good choice for modeling the over-dispersed data sets. Figure 1 displays the dispersion index and possible shapes of PQL distribution. When the parameters and increase, the dispersion of PX distribution increases. Note that the effect of the parameter on dispersion is higher than that of parameter . As seen from left side of Fig. 1, PQL distribution can be a good choice for modeling extremely rightskewed data sets.

Generating random variables from Poisson-xgamma distribution
Here, a general algorithm and corresponding code written in R software are given to generate random variables from PQL distribution. The below code can be used for all discrete distributions such as Poisson, Poisson-Lindley, negative-binomial.

Estimation
In this section, ML and MM estimation methods are considered to estimate the unknown parameters of PQL distribution.

Maximum likelihood estimation
Let X 1 , X 2 , … , X n be independent and identically distributed PQL random variables. The log-likelihood function is Taking partial derivatives of (7) with respect to and , we have The ML estimates of ( , ) can be obtained by means of simultaneous solutions of 8 and 9. It is not possible to obtain explicit forms of ML estimates of PQL distribution since the likelihood equations contain nonlinear functions. For this reason, nonlinear minimization tools are needed to solve these equations. The nonlinear minimization (nlm) function of R software is used for this purpose. The corresponding interval estimations of the parameters are obtained by means of observed information matrix which is given by The elements of observed information matrix are upon request from the authors. It is well known that under the (7) ( , ) = n ln 2 + ( + 1) 2 regularity conditions that are fulfilled for the parameters, the asymptotic joint distribution of (̂ ,̂ ) , as n → ∞ is a bi-variate normal distribution with mean ( , ) and variance-covariance −1 F ( ) . Using the asymptotic normality, the asymptotic 100(1 − p)% confidence intervals for the parameters and , respectively, are given by where z p∕2 is the upper p / 2 quantile of the standard normal distribution.

Method of moments
The MM estimators of the parameters and can be obtained by equating the mean and variance of PQL distribution to sample mean and variance, given as follows where ȳ and s 2 are the sample mean and variance, respectively. For simultaneous solution of (10) and (11) The detailed information about asymptotic properties of MM estimators can be found in Farbod and Arzideh [4].

Simulation
In this section, Monte Carlo simulation study is conducted to evaluate the finite sample performance of ML and MM estimates of PQL distribution. The following simulation procedure is used.
1. Set the sample size n and the vector of parameters = ( , ) T ; 2. Generate random observations from the PQL( , ) distribution, using the algorithm given in "Generating random variables from Poisson-xgamma distribution" section, with size n; 3. Use the generated random observations in Step 2, and estimate by means of ML and MM estimation methods; 4. Repeat N times the steps 2 and 3; 5. Use ̂ and and calculate the biases, mean relative estimates (MREs) and mean square errors (MSEs) from the following equations: n = 40, 45, 50, … , 500 . When n is sufficiently large, MREs should be closer to one and MSEs and biases should be closer to zero. As seen from Fig. 2, when the sample size, n, increases, the MSEs and biases are closer to zero and MREs approach to one for both estimation methods. The MM and ML estimation methods yield similar results for the parameter in view of estimated MSE, bias and MRE. However, ML estimation method provides more satisfactory results for the parameter , especially for small sample sizes. Therefore, we suggest to use ML estimation method when the sample size is small.

Poisson quasi-Lindley regression model
The Poisson and negative-binomial are the two commonly used regression models for count data modeling. When the response variable is not equi-dispersed, the negative-binomial regression model is preferable. Here, an alternative regression model is introduced for over-dispersed response variable.
Let random variable Y follow a PQL distribution, given in (5). The mean of Y is E(Y| , ) = . Therefore, the covariates can be linked to the mean of response variable, y, by means of the log-link function, given by , … x ik is the vector of covariates and = 0 , 1 , 2 , … k T is the unknown vector of regression coefficients. Inserting (16) in (5), the log-likelihood function can be obtained as follows where = ( , ) T . The unknown parameters, and = 0 , 1 , 2 , … k T , are obtained by maximizing (16) with the nlm function of R software. Under standard regularity conditions, the asymptotic distribution of (̂ − ) is multivariate normal N k+2 (0, J( ) −1 ) , where J( ) is the expected information matrix. The asymptotic covariance matrix J( ) −1 of ̂ can be approximated by the inverse of the (k + 2) × (k + 2) observed information matrix I( ) , whose elements are evaluated numerically via most statistical packages. The approximate multivariate normal distribution N k+2 (0, I( ) −1 ) for ̂ can be used to construct asymptotic confidence intervals for the vector of parameters .

Empirical study
In this section, modeling ability of PQL regression model is compared with Poisson and NB regression models via an application on real data set. The data contain number of absence (daily), gender and type of instructional program of the 314 high school students from two urban high schools. The data set can be obtained from https ://stats .idre.ucla.edu/ stat/stata /dae/nb_data.dta. The response variable, number of absence y i , is modeled with gender (female = 1, male = 0) x 1 and type of instructional program (general = 1, academic = 2, vocational = 3). The vocational program is used as a baseline category for type of instructional program variable. The general and academic instructional programs are coded as x 2 and x 3 , respectively. To decide the best model, the estimated negative log-likelihood value, Akaike Information Criteria (AIC) and Bayesian Information Criteria (BIC) values are used. The lowest values of these statistics show the best-fitted model for the used data set. The following regression model is fitted. Figure 3 displays the distribution of days of absence. The mean of response variable is 5.955, and variance is 49.518 which is an evidence for over-dispersion. Table 1 lists the estimated parameters of the models and corresponding SEs, estimated negative − , AIC and BIC values. Since PQL regression model has the lowest values of these statistics, we conclude that PQL regression model provides better fits than Poisson and NB regression models, especially for over-dispersed data set.
The obtained observed information matrix of PQL regression model, I( ) , is The diagonal elements of the inverse of I( ) give the variances of estimated parameters. The inverse of I( ) is The asymptotic confidence intervals of regression parameters are 0.049 < 1 < −0.430 , 1.349 < 2 < 0.961 and 1.215 < 3 < 0.673 , respectively. As seen from estimated regression coefficients of PQL regression model, we  Fig. 3 The distribution of days of absence of students conclude that the gender has no statistically significant effect on the days of absence for students. However, the days of absence for general and academic instructional program students are 1.348 and 0.945 times higher than the vocational instructional program students.

Conclusion
A re-parametrization of the Poisson quasi-Lindley distribution is introduced and studied comprehensively. The parameter estimation problem of the Poisson quasi-Lindley distribution is discussed via extensive simulation study. A new regression model for count data is proposed and compared with Poisson and negative-binomial regression models based on the real data set. We conclude that Poisson quasi-Lindley regression model exhibits better fitting performance than Poisson and negative-binomial regression models when the response variable is over-dispersed. We hope that the results given in this study will be very helpful for researchers studying in this field.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creat iveco mmons .org/licen ses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.