Zero to k Inflated Poisson Regression Models with Applications

In the count data set, the frequency of some points may occur more than expected under the standard data analysis models. Indeed, in many situations, the frequencies of zero and of some other points tend to be higher than those of the Poisson. Adapting existing models for analyzing inflated observations has been studied in the literature. A method for modeling the inflated data is the inflated distribution. In this paper, we extend this inflated distribution. Indeed, if inflations occur in three or more of the support point, then the previous models are not suitable. We propose a model based on zero, one, …,\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ldots ,$$\end{document} and k inflated points with probabilities w0,w1,…,\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$w_{0},w_1,\ldots ,$$\end{document} and wk,\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$w_{k},$$\end{document} respectively. By choosing the appropriate values for the weights w0,…,wk,\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$w_{0},\ldots ,w_{k},$$\end{document} various inflated distributions, such as the zero-inflated, zero–one-inflated, and zero–k-inflated distributions, are derived as special cases of the proposed model in this paper. Various illustrative examples and real data sets are analyzed using the obtained results.


Introduction
One of the important causes of overdispersion in the count data is an inflated number of zeros in excess of the number expected under the distribution.In such cases, one of the appropriate models is zero-inflated Poisson (ZIP) distribution.There are numerous papers in the literature dealing with the ZIP model.The earliest studies on the ZIP model were done by Cohen [6] and Yoneda [37].Lambert [14] introduced and studied the ZIP regression model using the Expectation-Maximization (EM) approach.Vandenbroek [34] gave a score test for testing a standard Poisson regression model.Jansakul and Hinde [11] extended this test to a more general situation where the zero probability depends on covariates.Ridout et al. [27] derived a score test for testing a ZIP regression model against the zero-inflated negative binomial alternatives.Agarwal et al. [1] applied the ZIP regression model for analyzing spatial count data sets.The score test using the normal approximation might underestimate the nominal significance level for small sample cases.Jung et al. [12] proposed a parametric bootstrap method for this problem.They observed that the ZIP regression model for prediction is more robust than the usual Poisson regression model.Long et al. [20] developed a marginalized ZIP regression model approach for independent responses to model the population mean count directly, allowing straightforward inference for overall exposure effects and derived an empirical robust variance estimation for overall incidence density ratios.Zhu et al. [40] have extended zero-inflated count models to account for random effects.Lim et al. [16] proposed the ZIP regression mixture model to account for both excess zeros and overdispersion caused by unobserved heterogeneity.A Bayesian latent factor ZIP model was proposed by Neelon and Chung [25] to analyze the molecular differences among breast cancer patients.Furthermore, ZIP models for censored data were studied by Saffari and Adnan [28] and Yang and Simpson [36].Research on these models is still in progress.For instance, the ZIP model's parameters have been estimated using the Genetic Algorithm by Dikheel and Jouda [9] and the Shrinkage estimator by Zandi et al. [38].ZIP models continue to be frequently utilized in many studies, even with the development of more comprehensive models [24,32].Empirical evidence shows that inflation may occur at more than one point.For example, Lin and Tsai [17] discussed a model that can be applied to both excessive zeros and ones known and called the zero-one inflated Poisson (ZOIP) model.Zhang et al. [39] initially studied the likelihood-based ZOIP model without covariates.When the covariates are available, it is essential to build a ZOIP regression model to clarify the relationship between the covariates and the response variable.Tang et al. [33], Liu et al. [18], Liu et al. [19], and Arora and Chaganty [5] studied the statistical inference for the ZOIP model.Melkersson and Rooth [23] proposed a zero-two inflated Poisson model, which accounts for a relative excess of both zero-two children in modeling complete female fertility.
In this article, we attempt to provide a generalization for an inflated regression model based on the Poisson distribution.For this purpose, we generalized the inflated points to 0, … , k, for k = 0, 1, 2, … .This generalization provides various benefits for the modeling of inflated and non-inflated data.For example, this generalization includes all previous inflated regression models based on Poisson distribution.Also, this model provides a wide range of models to the researcher, who can choose the most appropriate ones according to the data.In summary, the originality of this paper lies in the development of the family of inflated Poisson distribution and its application to generalized linear models.This generalization is significant because it gives the researcher access to the entire family of Poisson-based inflated distributions (the most used family in discrete inflated distributions) at a single model.Thus researcher chooses the appropriate model for the data set by choosing the appropriate value of k according to the inflation at each point or any number of points.Besides, with other choices of k and distribution weights, new models can also be introduced.Selecting the appropriate value for k might be a challenge when using the ZKIP models.An effective approach for selecting the correct value for k in the ZKIP models is described in order to address this issue.Section 7 provides an explanation of this algorithm along with useful examples.
Therefore, the rest of the paper is organized as follows.In Sect.2, we introduce the zero to k inflated Poisson (ZKIP) distribution and some special cases.Estimation by the maximum likelihood (ML) method and standard errors of the ML estimates are outlined in Sect.3. Section 4 deals with the regression ZKIP model and the ML estimation, EM algorithm, and hypotheses testing for this model.In Sect.5, the method of the randomized quantile residual (RQR) for the adequacy of the proposed model is introduced.Simulations are conducted in Sect.6 to assess the usefulness of this model.In Sect.7, two real data sets are used to demonstrate the flexibility and superiority of the proposed model against the existing ones.Finally, in Sect.8, conclusions are given.

Proposing the New Model
We can build the zero to k inflated distribution with introducing the function g(y) as where f(y) is a discrete distribution, say Poisson, 0 ≤ w i , i = 0, … , k, (w) = ∑ k i=0 w i and 0 ≤ (w) ≤ 1.By substituting the Poisson probability mass function (PMF) f (y) = exp(− ) y ∕y!, y = 0, 1, … , into Equation (1), we have where w = (w 0 , … , w k ).The discrete random variable Y with the PMF (2) is called to follow the ZKIP distribution and denoted by Y ∼ZKIP(w, ).Here, some special cases of the family of ZKIP models are given as follows: • k = 0 → ZIP distribution [9,24,38] and [32].
If Y ∼ZKIP(y; , ), then the moment generating function is obtained from Eq. (2) as Equation (3) gives the expectation and variance of the random variable Y as Figure 1 shows PMF of the ZKIP distribution for = 1 and some selected values of k and w.

Estimation
Let Y = (Y 1 , … , Y n ) be a random sample arising from the ZKIP distribution.In this section, we study the problem of estimating unknown parameter vector = (w, ) on the basis of Y. Let (3) We can rewrite PMF (2) as where Thus, the likelihood function (LF) of the observed sample Y = y reads Then the logarithm of the LF (LLF) is From Eq. ( 8), the ML estimates are derived by solving the following equations with respect to the parameters: where (6) (9) obs (w, ; y) obs (w, ; y) Journal of Statistical Theory and Applications (2023) 22:366-392 Since there is no closed-form solution for this system of equations, we may use numerical methods to estimate the parameters.To do this, we need to derive the elements of the observed information, which are given by where and For the problems of interval estimation and testing statistical hypotheses, the Fisher information (FI) matrix is useful.The FI matrix ( (Θ) = ij , i, j = 0, … , k + 1), obtained from the observed information matrix by taking the expected values of each entry, is given as follows: 1 3 Journal of Statistical Theory and Applications (2023) 22:366-392 If is the arbitrary parameter of model and θ is the ML estimates for this parameter, then from Lehmann et al. [15], we have Therefore, we can use asymptotic distribution (12) to construct an asymptotic confidence interval and hypotheses testing for the parameters of the proposed model (see Lehmann et al. [15]) for more details).

ZKIP Regression Model
In Sect.2, we introduce and motivate the ZKIP distribution.Recall that the PMF of the ZKIP distribution is where P (y) = exp(− ) y ∕y! for y = 0, 1, … .So the ZKIP is a distribution with k + 2 parameters w = (w 0 , … , w k ) and .In fact, it is a mixture of k + 2 distributions.The first distribution is degenerate at zero with weight w 0 , and the second is degen- erate at one with weight w 1 , and the same goes for up to k.Finally, the (k + 2) th the distribution is the Poisson with mean with weight (1 − (w)).
Suppose that we have a vector y = (y 1 , … , y n ) of n independent count responses from a ZKIP distribution.We assume that, associated with each y i , the vector of covari- ates (x T i ) has been observed.The layout of the observed data is shown in Table 1.From ( 13), the LF of the available data in Table 1 can be rewritten as where = ( 1 , … , n ) and P j (y j ) = exp(− j ) y j j ∕y j !, (y j ≥ 0).To connect the param- eters with the covariates, we follow the standard generalized linear model (GLM) framework for the multinomial distribution;\, see Agresti [2] for further reading.The k + 2 mixing distributions can be viewed as k + 2 nominal categories.Thus, the probabilities of the k + 2 (degenerate(0), degenerate(1), … , degenerate(k), Poisson) categories are w 0 , w 1 , … , w k , (1 − (w)), respectively.Following the GLM baseline category logit model for the multinomial, let .
Here, we treat the Poisson distribution as the baseline category, and thus we have ((k + 2) − 1) = k + 1 equations for the other k + 1 categories.As in log-linear mod- els, the ZKIP regression model assumes that the Poisson parameter i is a loglinear function of the covariates;\, that is, where = ( 1 , … , p ) T is a p-dimensional unknown regression parameters vector.
For the sake of brevity, we assume that the parameters vector = ( 0 , … , k ) is con- stant.The generalization where these k + 1 parameters are functions of the covari- ates is straightforward.Thus, the parameters of the ZKIP regression model are and .In what follows, the problems of estimating parameters and hypotheses testing are discussed in detail.

Estimation of Regression Parameters
In the following, we study methods for estimating the parameters of the ZKIP regression model.Two popular methods are the ML and EM methods.The ML method involves optimizing the LF (14) or the logarithm of the LF with respect to the unknown parameters and .By substituting the reparameterizations (15) into the LF ( 14), we can rewrite (8) as follows: where log( j ) = x T j and * = 1∕{1 + ∑ k l=0 exp( l )}.The ML estimates can be obtained by maximizing the LLF (16) directly with respect to the parameters.Alternatively, one can take the partial derivatives of the LLF and solve the (k + 2) score equations as follows: Table 1 Form of data for regression analyses on inflated ZKIP distribution

Observation
Response Covariates An alternative and popular method for parameter estimation is the EM approach.The EM approach treats the observed data y = (y 1 , … , y n ) as a part of com- plete data that includes z = (z 1 , … , z n ), which is regarded as missing.Here each ) is a k + 2 component vector with PMF (17).By the definition of latent variable z, we have where w k+1 = (1 − (w)).
Then, the conditional PMF of y i given z i is calculated.Thus, the joint distri- bution of the observed and missing data, which is given by the ZKIP distribution, is a mixture of Poisson with k + 1 degenerate distributions at zero to k. Con- sider a latent variable z = (z 0 , … , z k+1 ), which is distributed as a multinomial with parameters (1, w 0 , … , w k+1 ).Note that z takes values (0, … , 0, 1 i , 0, … , 0) with probability w i , i = 0, … , (k + 1).That is, Finally, the joint PMF of (Y, z) is obtained from ( 17) and (19) Therefore from (20), the complete LF of the ZKIP model is By (15), the LLF of the complete data (y, z) reads For w 1 = ⋯ = w k = 0, the ZKIP is reduced to the ZIP model.From (22), the LLF of the ZIP for the complete data is derived as Lambert [14] used Eq. ( 22) as the LLF of the complete data for the ZIP model to obtain the EM estimates.We now proceed to describe the EM algorithm (Dempster et al. [8];\, Wu [35]) for the ZKIP model.The first step in the EM algorithm involves selecting some initial values for the unknown parameters.The choice of the initial values is important for the convergence of the algorithm.An incorrect choice of the initial values could result in slow convergence or breakdown of the algorithm.We recommend using the proportions of zeros, … , k's, respectively, in the observed data as initial values for the parameters w 0 , … , w k and use the relations (15) to get initial values 00 , … , k0 , for the parameters 0 , … , k , respectively.The next step involves filling the latent values z i by their expectations, which is the E-step.We ( 20)

3
Journal of Statistical Theory and Applications (2023) 22:366-392 will use the conditional expected values of E(z|y) given in Table 2 to generate z i 's.
We use Table 2 to estimate the missing values in the expectation step of the EM algorithm as follows: For the maximization step in the EM algorithm, we solve the following score equations instead of maximizing the complete LF directly: where ẑ(k+1)j = 1 − ∑ k i=0 ẑij .In summary, the EM algorithm to estimate the param- eters ( 0 , … , k ) and the regression parameter for the ZKIP regression model is as follows.
In the next subsection, we discuss how to obtain the standard errors of the estimates obtained by the EM algorithm.

Standard Errors for the EM Algorithm
The most commonly used method to get the standard errors in the mixture models is to compute the matrix of partial derivatives of the LLF for the observed data, that is, to calculate the information matrix from the observed data.Lambert [14] used this method for computing the standard errors for the ZIP regression model.Lin and Tsai [17] used the Hessian matrix to get the standard errors for the zero-k inflated Poisson model without actually computing second-order partial derivatives of the LLF.Recall that the Hessian matrix comes out as a byproduct of the nonlinear optimization methods in the common statistical packages.
To compute the standard errors of the estimates obtained by the EM algorithm, we follow the approach described by Louis [22].The relation between the likelihood of the complete, observed, and missing data is given by where y and z stand for the observed and missing data, respectively.By taking the logarithm from Equation ( 27), we get where obs (w, |y) = log L obs (w, |y) , comp (w, |y, z) = log L comp (w, |y, z) and miss (w, |(z|y)) = log{L miss (w, |(z|y))}.Taking second-order partial deriva- tives from Eq. ( 28), the information matrices for the complete, observed, and missing data satisfy the following identity: where the matrices I comp and I miss are, respectively, the Hessian matrix of the LLF of complete data and the LLF of missing data.Since the right hand side of Eq. ( 29) depends on the missing data, Louis [22] suggested to take the expected value of the missing data given the observed.This gives us the identity In other words, Louis [22] estimated of the observed information matrix by Note that the LLF of the complete data for the ZKIP regression model is given by Eq. ( 22), and the corresponding first-order derivatives are shown in Eqs.(25) and (26).The elements of the matrix E(I comp |y) are the expected values of the negative of second-order partial derivatives of the complete LLF (22) given by Journal of Statistical Theory and Applications (2023) 22:366-392 To see this, from Eq. ( 15), we have and then Eq. ( 22) yields From Eq. ( 32) we conclude that After some algebraic manipulations and the value of E(z|y) from Table 2, the fol- lowing result is concluded Similarly, E − 2 comp ∕ l s and E − 2 comp ∕ T are obtained.The LLF of the missing data for the ZKIP regression model is Finally, the elements of the matrix E(I miss |y) are the negative of the expected value of the second-order derivatives of (35).These are given by 2 comp

Hypothesis Testing
Testing the impact of the jth covariate on the count response is equivalent to testing H 0 ∶ j = 0 vs. H 1 ∶ j ≠ 0. This hypothesis test is straightforward and can be done using the standard Wald statistic, z = βi ∕SE( βi ), which has asymptotically standard normal distribution under the null hypothesis H 0 .Here, SE( βj ) minus twice the LLF stands for the standard error of the estimate βj .An alternative approach for testing H 0 ∶ j = 0 is to use the generalized likelihood ratio test statistic defined by which has asymptotically the chi-Square distribution with one degree of freedom, where ̃ and ̃ denote the ML estimates under the hypothesis j = 0 and β and ̂ are the ML estimates under the ZKIP model.Since 0 ≤ w i ≤ 1, (i = 0, … , k), the null hypothesis H 0 ∶ w i 1 = ⋯ = w i r = 0, 0 ≤ i r ≤ k, corresponds to testing the parameters to check the necessity of their pres- ence in the model.In this case, the regularity conditions are not met.That is, the standard asymptotic theory for the LRT statistic (36) is not applicable, and in fact, its asymptotic distribution is a mixture of chi-Square distributions.

The Model Diagnosis
The model diagnosis is an essential step to ensure that a fitted model is adequate for the observed data set.However, diagnosing counts models is still a challenging research problem.Pearson and deviance residuals are often used in practice for diagnosing counts models, despite wide recognition that these residuals are far from normality when applied to count data.RQRs, proposed by Dunn and Smyth [10] and Park et al. [26], are used to overcome the above-mentioned problems.The key idea of the RQR Journal of Statistical Theory and Applications (2023) 22:366-392 is to randomize the lower tail probability into a uniform random number between the discontinuity gap of the cumulative density function (CDF).It can be shown that the RQRs are normally distributed under the true model.To the best of our knowledge, the RQR has not been applied to the residual analyses for zero-inflated or modified mixed effects models.
To do this, we follow the Dunn and Smyth [10] approach.The RQR inverts the fitted distribution function for each response value and finds the equivalent standard normal quantile.Let G(y; w, ) denote the CDF for random variable y.If the CDF is continu- ous, then G(y i ; w, ) is uniformly distributed on the unit interval.RQRs can thus be defined as where Φ −1 (⋅) is the quantile function of the standard normal distribution.However, if the CDF is discrete, then randomization is added to modify q i in Eq. (37).To be more specific, let g(y i ; w, ) denote the PMF of y.The CDF can be redefined as where U is a uniform random variable on [0, 1], and G(y − ; w, ) is the lower limit of G in y.When G is discrete, we let a i = lim y→y − i G(y; ŵi , λi ) and b i = G(y i ; ŵi , λi ).Then, the randomized quantile residual is defined by where G * i is a uniform random variable on the interval (a i , b i ], and q i ∼ N(0, 1).Here, N(0, 1) stands for the standard normal distribution.Therefore, the only information that is required for calculating RQRs is the CDF of the response variable.In numerical analysis, we will use RQRs to investigate the adequacy of the used models.

Simulation Studies
In this simulated scenario, we have evaluated the efficiency of the ZKIP regression model with the link function log( ) = 0 + 1 x 1 + 2 x 2 , where 0 = 0.5, 1 = 2 = −0.05,w 0 = w 1 = w 2 = w 3 = 0.2 and x 1 and x 2 are, respectively, gener- ated from N(0, 1) and the Bernoulli distribution with parameter 0.5 and k = 0, 1, 2, 3.In each step, we select a sample with sizes of n = 100, 500, 1000 from ZKIP regres- sion (k = 0, 1, 2, 3) models and estimate the parameters of the assumed models using the ML method.We repeat this step 10,000 times.If the model and its related definitions work correctly, the estimates obtained for the parameters of this models should be close to their real values.Also, as the sample size rises, the means of bias and standard error should be decreased.This assertions are supported by Table 3's results.
For the next simulated scenario, we have selected a sample with sizes n = 100, 500, 1000, from Z3IP regression model (with the parameters similar to (37) G * (y; w, ) = G(y − ; w, ) + U × g(y; w, ), Table 3).Then, we have fitted the regression ZKIP with k = 0, 1, 2 and Poisson models, for the same sample.As anticipated, the proposed model performed admirably results.The regression Z3IP model shows the best outcomes based on the comparative criteria in Table 4.

Real Data Analysis
To assess the performance of the proposed ZKIP model and the corresponding regression version, two real data sets are analyzed.

Example 1 (Alcohol Consumption)
DeHart et al. [7] described a study in which "moderate to heavy drinkers" (at least 12 alcoholic drinks/week for women, 15 for men) were recruited to keep a daily record of each drink that they consumed over a 30-day study period.Participants also completed a variety of rating scales covering daily events in their lives and items related to self-esteem.Among the researchers' hypotheses, the negative events, particularly those involving romantic relationships, are suspected to be related to the amount of alcohol consumed, especially among those with low self-esteem.
In this example, we consider numall (number of alcoholic beverages, or "drinks," consumed in one day) as a response variable.In Tables 5 and 6 , we fit the ZKIP distribution to this data set.In the following, negevent (an index for combining the total number and intensity of negative events experienced during the day) and nrel (a measure of negative relationship interactions) are considered covariate variables.Tables 7 and 8 show the fitted regression ZKIP models.
The results in Tables 5 and 6 show that the ZKIP regression model with k = 6 performs well and dominates the rest models.Tables 7 and 8 also confirm that the regression ZKIP with k = 6 is the best model among the considered models.In Table 9, we consider the problem of the hypothesis testing H 0 ∶ w 6 = 0 against the alternative H 1 ∶ w 5 > 0 and H 1 ∶ w 7 = 0, with significance level = 0.05.The asymptotic distribution is 0.5 2 0 + 0.5 2 1 , where 2 0 ≡ 0, for the Alcohol consump- tion data set (for more details, see [29]).We see that k = 6 is the best choose for the ZKIP model.
With these explanations, an algorithm can be presented to choose the appropriate value of k.We start fitting the inflated model from k = 0 and continuing this fitting until k, so there is no significant difference between the fitting of k and the larger k + 1.

Example 2 (Lung Data Set)
This data set is about survival in those with advanced lung cancer from the "North Central Cancer Treatment Group" (Loprinzi et al. [21]).Performance scores rate how well the patient can perform usual daily activities.The Eastern Cooperative Oncology Group (ECOG) performance score is measured by the physician as 1 3 Journal of Statistical Theory and Applications (2023) 22:366-392 0 = asymptomatic, 1 = symptomatic but completely ambulatory, 2 = in bed < 50% of the day, 3 = in bed > 50% of the day but not bedbound, 4 = bedbound (ph.ecog).We consider the ph.ecog as the response variable and the Age and Sex of patients as covariates.The fitted ZKIP model for the response variable in the Lung data set is shown in Tables 10 and 11 .It is observed that the ZKIP model with k = 2 is the best one.Tables 12 and 13 confirm the same results in the regression ZKIP model.Table 14 also introduces k = 2 as an appropriate value for the regression ZKIP model for this data set.
Figures 3 and 5 for, respectively, real data sets 1 and 2 show the RQRs of the regression ZKIP model for all observations.For the best k-obtained regression ZKIP fitted, these figures approximately show a random scatter around zero for RQR in the three real data sets, which seems to be reasonable.Figures 2 and 4 also show that the RQRs follow approximately the N(0, 1) distribution for the obtained k (Figs.2,3,4,5).The introduced model can be used in decision trees, random forests and statistical quality control or any model that can be used for inflated discrete data.Also, this model can be developed for the neutrosophic statistics [3,4,31].

Fig. 5
Fig. 5 QQ plots for the RQR's of the Lung data set

Table 2 E
(z|y)for the ZKIP regression model

Table 3
Bias of ML estimates and standard errors (in parentheses), and model diagnostics in the simulation study

Table 4
Frequency comparisons associated with simulated data from the ZKIP regression distribution

Table 5
Estimates, standard errors (in parentheses), and model diagnostics for the Alcohol consumption data set

Table 6
Observed number of illness and the corresponding expected values under the fitted models in Table5

Table 7
Estimates, standard errors (in parentheses), and model diagnostics (log-likelihood, and AIC) for the Alcohol consumption data set

Table 8
Observed number of illness and the corresponding expected values under the fitted models in Table7Count Observed k

Table 10
Estimates, standard errors (in parentheses), and model diagnostics for the Lung data set

Table 12
Estimates, standard errors (in parentheses), and model diagnostics (log-likelihood, and AIC) for Lung data set

Table 14
Testing to choose k in Tables