Transition Models for Count Data: a Flexible Alternative to Fixed Distribution Models

A flexible semiparametric class of models is introduced that offers an alternative to classical regression models for count data as the Poisson and negative binomial model, as well as to more general models accounting for excess zeros that are also based on fixed distributional assumptions. The model allows that the data itself determine the distribution of the response variable, but, in its basic form, uses a parametric term that specifies the effect of explanatory variables. In addition, an extended version is considered, in which the effects of covariates are specified nonparametrically. The proposed model and traditional models are compared by utilizing several real data applications.


Introduction
In many applications the response variable of interest is a nonnegative integer or count which one wants to relate to a set of covariates. Classical regression models are the Poisson model and the negative binomial model, which can be embedded into the framework of generalized linear models (McCullagh and Nelder, 1989). More general models use the generalized Poisson distribution, the double Poisson distribution or the Conway-Maxwell-Poisson distribution (Consul, 1998;Zou et al., 2013;Sellers and Shmueli, 2010). Specific models designed to account for excess zeros are the hurdle model and the zero-inflation model. Concise overviews of modeling strategies were given by Kleiber and Zeileis (2008), Hilbe (2011), Cameron and Trivedi (2013) and Hilbe (2014). Specific models using distributions beyond the classical ones were considered, among others, by Joe and Zhu (2005), Gschoessl and Czado (2006), Nikoloulopoulos and Karlis (2008) and Rigby et al. (2008). More recently, existing count regression approaches were compared, for example, by Hayat and Higgins (2014), Payne et al. (2017) and Maxwell et al. (2018).
Most of the established models assume that a fixed distribution holds for the response variable conditional on the values of covariates, and the mean (and possibly the variance parameter) is linked to a linear function of the covariates. Various methods have been proposed to estimate the parametric effect of covariates on the counts (Cameron and Trivedi, 2013). Specifying a distribution of the counts, however, can be rather restrictive, and the validity of inference tools depends on the correct specification of the distribution.
To address this issue, we propose a class of models that does not require specifying a fixed distribution for the counts. Instead, the form of the distribution is determined by parameters that reflect the tendency to higher counts, which has the effect that the fitted distribution is fully data-driven. For the estimation of the parameters that determine the distribution penalized maximum likelihood estimation procedures are proposed. The proposed models automatically account for zero inflation, which typically calls for more complex models, see, for example, Loeys et al. (2012). Interpretation of the parameters is kept simple, since the conceptualization uses that counts typically result from a process, where the final count is the result of increasing numbers.
The effect of covariates on the response is modeled by a linear term. This enables an easy interpretation of the regression coefficients in terms of multiplicative increases or decreases of the counts. The proposed models are semiparametric in nature, because the distribution of the response is modeled in a flexible way adapted to the data while the effect of covariates is modeled parametrically. The models are also extended to allow for smoothly varying coefficients in a nonparametric fashion.
A main advantage of the proposed model class is that it can be embedded into the framework of binary regression. This implies that standard software for maximum likelihood estimation of generalized additive models (Wood, 2006) can be used for model fitting.
The rest of the article is organized as follows: In Section 2 classical models for count data are briefly reviewed. In Section 3 the model class is introduced and penalized maximum likelihood estimation methods are described. Section 4 is devoted to illustrative applications. In Section 5 the model is extended to allow for more flexible effects of covariates, which are not necessarily restricted to linear effects. Details on implementation and software are given in Section 6. Section 7 summarizes the main findings of the article.

Classical Models for Count Data
Let Y i ∈ {0, 1, 2, . . . } denote the response variable and x T i = (x i1 , . . . , x ip ) a vector of covariates of an i.i.d. sample with n observations. In generalized linear models (GLMs) one specifies a distributional assumption for Y i |x i and a structural assumption that links the mean µ i = E(Y i |x i ) to the covariates. The structural assumption in GLMs has the form where β = (β 1 , . . . , β p ) is a set of real-valued coefficients, g is a known link function and h = g −1 denotes the response function, see, for example, McCullagh and Nelder (1989), Fahrmeir and Tutz (2001).

Poisson and Negative Binomial Model
Popular models for count data are the Poisson model and the negative binomial model. The Poisson model assumes that Y i |x i follows a Poisson distribution P (µ i ), where the mean and the variance are given by µ i , respectively. The most widely used model uses the canonical link function by specifying In model (1) the expressions exp(β 1 ), . . . , exp(β p ) have a simple interpretation in terms of multiplicative increases or decreases of µ i . More generally on can assume that Y i |x i follows a negative binomial distribution NB(ν, µ i ) with probability density function (p.d.f.) with mean and variance given by While the mean is the same as for the simple Poisson model, the variance exceeds the Poisson variances by µ 2 i /ν, which may be seen as a limiting case (ν → ∞).

Zero-Inflation Model
In some applications one observes more zero counts than is consistent with the Poisson or negative binomial model; then data display overdispersion through excess zeros. This happens in cases in which the population consists of two subpopulations, the non-responders who are "never at risk" with counts Y i = 0 and the responders who are at risk with counts Y i ∈ {0, 1, . . . }. Formally, the zero-inflated density function is a mixture of distributions. With C i denoting the class indicator of subpopulations (C i = 1 for responders and C i = 0 for non-responders) one obtains the mixture distribution where π i = P (C i = 0) are the mixing probabilities. Typically one assumes that a classical count data model, for example the Poisson model (1), holds for the responders, that is, one assumes Y i |x i , C i = 1 ∼ P (µ i ), and a binary model, for example the logistic model, determines class membership. Then the link between responses and covariates is determined by the two predictors where γ = (γ 1 , . . . , γ p ) is a second set of real-valued coefficients. More general models, in which the Poisson distribution is replaced by the generalized Poisson distribution have been considered by Famoye and Singh (2003), Famoye and Singh (2006), Gupta et al. (2004), Czado et al. (2007) and Min and Czado (2010). Estimation procedures for zero-inflated models are available in the R package pscl .

Hurdle Model
An alternative model that is able to account for excess zeros is the hurdle model (Mullahy, 1986;Creel and Loomis, 1990). The model specifies two processes that generate the zero counts and the positive counts. It combines a truncated-at-zero count model (left-truncated at Y i = 1) which is employed for positive counts and a binary model or censored count model (right-censored at Y i = 1) which determines whether the response is zero or positive, i.e., if the "hurdle is crossed". Formally, the hurdle model is determined by where f 1 determines the binary decision between zero and a positive outcome. If the hurdle is crossed, the response is determined by the truncated count model with p.d.f.
If f 1 = f 2 , the model collapses to the so-called parent process f 2 . The model is quite flexible, because it allows for both under-and overdispersion. For example, in the hurdle Poisson model where both f 1 and f 2 correspond to Poisson distributions with means µ 1 and µ 2 , the link between responses and covariates is determined by the two predictors The Poisson and geometric hurdle model have been examined by Mullahy (1986), negative binomial hurdle models have been considered by Pohlmeier and Ulrich (1995). Estimation procedures for hurdle models are available in the R package pscl .

The Transition Model for Count Data
In particular the Poisson and the negative binomial model have a simple structure with a clearly defined link between the covariates and the mean. The models however are rather restrictive since they assume that the distribution of Y i |x i is known and fixed. A strict parametric form is assumed for the whole support {0, 1, 2, . . . } and typically just the dependence of the mean on the predictors is specified by the model.

The Basic Transition Model
The approach proposed here is much more flexible and does not assume a fixed distribution for the response. It focusses on the modelling of the transition between counts. In its simplest form, it assumes for where F (·) is a fixed distribution function. The parameters θ r represent intercept coefficients and β T = (β 1 , . . . , β p ) is a vector of regression coefficients. One models the transition probability δ ir = P (Y i > r|Y i ≥ r, x i ), which is the conditional probability that the number of counts is larger than r (i.e., transition to a higher value than r) given the number of counts is at least r. These probabilities are determined by a classical binary regression model. For example, if F (·) is the logistic distribution function, one uses the binary logit model. In general, a distribution of count data Y i can be characterized by the probabilities on the support π i0 , π i1 , . . . , where π ir = P (Y i = r), or the (conditional) transition probabilities δ i0 , δ i1 , . . . given by The transition model (5) specifies the transition probabilities. If no covariates are present, any discrete distribution with support {0, 1, 2, . . .} can be represented by the model, determined by the intercept parameters θ 0 , θ 1 , . . .. In the presence of covariates the intercepts represent the basic distribution of the counts, which is modified by the values of the covariates. Thus, the functional form of the count distribution is not restricted. In particular the response may follow a Poisson distribution or a negative binomial distribution. In addition the model is able to account for specific phenomena like excess zeros.
The parameters in model (5) have an easy interpretation depending on the function F (·). If one chooses the logistic distribution function one obtains the conditional transition to higher categories in the form and the regression coefficients β 1 , . . . , β p have the usual interpretation as in the common binary logit model. An alternative form is the representation as continuation ratios The ratio P (Y i > r|x i )/P (Y i = r|x i ) compares the probability that the number of counts is larger than r to the probability that the number of counts is equal to r.
The basic assumption of model (6) is that the effect of covariates is the same for any given category r. This property can also be seen as a form of strict stochastic ordering. That means, if one considers two population that are characterized by the covariate values x andx, one obtains Thus it is assumed that continuation ratios are the same for each category r.
The modeling of transitions as specified in model (5) was used in various contexts before. In ordinal regression transition modeling is known under the name sequential model; in the logistic version it is called continuation ratio model (Agresti, 2002;Tutz, 2012). Its properties as an ordinal regression model have been investigated in particular by McCullagh (1980). It is also used in discrete survival analysis, where one parameterizes the discrete hazard function λ(r|x i ) = P (Y i = r|Y i ≥ r, x i ) instead of the conditional probability of transition (Tutz and Schmid, 2016).
Modeling of transitions, however, seems not to have been used in the modeling of count data. The main difference to the use in ordinal regression and discrete hazard modeling is that, in contrast to these models, the number of categories is not restricted. In ordinal models one typically uses up to ten categories, which limits the number of parameters. In count data, however, the number of intercepts is unlimited. In extended models, where also the regression coefficients vary across categories (to be considered in Section 5) the main problem is the number of parameters that cannot be handled by simple maximum likelihood estimation. For count data and fixed predictor value, model (5) is a Markov chain model of order one, because the probability of transition depends only on the previously obtained category. It is related to categorical time series, which have been investigated by Kaufmann (1987), Fahrmeir and Kaufmann (1987), and Kedem and Fokianos (2002).

Illustration of Flexibility of the Model
Before giving details of the fitting procedure we demonstrate the flexibility of the proposed transition model by a small benchmark experiment that was based on 100 replications. We generated samples of size n = 100 with the outcome values drawn from (i) a Poisson distribution, y i ∼ P o(µ i = 5), i = 1, . . . , n, and (ii) a negative binomial distribution, y i ∼ N B(ν = 5/8, µ i = 5), i = 1, . . . , n, which equals variance var(y i ) = 45. Figure 1 shows the estimated probability density functions for 10 randomly chosen replications (upper and middle panel) and the average estimated probability density function over all 100 replications (lower panel) obtained from fitting the transition model and the true data-generating model (Poisson or negative binomial) to the data, respectively. In both cases, it is seen that the transition model is well able to capture the underlying distribution. In particular, the average estimated p.d.f. and the true p.d.f. (black line) closely coincide for both distributions (lower panel).

Estimation Maximum Likelihood Estimation
where α T = (θ 0 , θ 1 , . . . , β T ) collects all the parameters and π ir = P (Y i = r|x i ) is the probability of observing category r, which for the transition model (5) is given by For the model considered here it is helpful to represent the data in a different way. One considers the underlying Markov chain with I(·) denoting the indicator function (I(a) = 1, if a holds, I(a) = 0 otherwise). Then the likelihood can be given in the form By using (8) it can be rewritten as In (9) . . , 0, 1). They can be seen as dummy variables for the response or as binary variables that code if transition to the next category occurred or not. The value Y ir = 0, r < Y i , denotes that the transition to a higher category than r occurred. If one wants to code transition as 0-1 variable with 1 denoting transition to a higher category one usesỸ ir = 1 − Y ir yielding the likelihood which is obviously equivalent to the log-likelihood of the binary response model Thus the model can be fitted by using maximum likelihood methods for binary data that encode the sequence of transitions up to the observed response.

Penalized Maximum Likelihood Estimation
Maximum likelihood (ML) estimators tend to fail because model (5) contains many parameters, in particular the number of intercepts becomes large unless the counts are restricted to very small numbers. Therefore alternative estimators are needed. We will use penalized maximum likelihood estimates. Then instead of the log-likelihood (10) one maximizes the penalized log-likelihood where (·) is the common log-likelihood of the model and J(α) is a penalty term that penalizes specific structures in the parameter vector. The parameter λ is a tuning parameter that specifies how serious the penalty term has to be taken. Since the intercept parameters θ r determine the dimensionality of the estimation problem the penalty is used to regularize these parameters.
A reasonable assumption on the θ-parameters is that they are changing slowly over categories. A penalty that enforces smoothing over response categories uses the squared differences between adjacent categories, Let M max denote the maximal value that has been observed, that is, M max = max{Y i }. Then one uses the penalty where M is larger than M max , for example, M can be chosen as the integer closest to 1.2 M max . When maximizing the penalized log-likelihood one obtains θ Mmax = θ Mmax+1 = · · · = θ M . It is important to choose M larger than M max to account for possibly larger future observations and to avoid irregularities at the boundaries. If one uses λ = 0 in (11) one obtains the ML estimates. In the extreme case λ → ∞ all parameters obtain the same value. An alternative, more general smoothing technique uses penalized splines as proposed by Eilers and Marx (1996). Let the θ-parameters be specified as a smooth function over categories by using an expansion in basis functions of the form where φ k (·) are fixed basis functions. We will use B-splines (Eilers and Marx, 1996) on equally spaced knots in the range [0, M ], where M is again a larger value than the maximal observed response. The penalty now does not refer to the θ-parameters themselves but to the γ-parameters. A flexible form is where ∆ d is the difference operator, operating on adjacent B-spline coefficients, that is, The method is referred to as P-splines (for penalized splines).

Embedding into the Framework of Varying-Coefficients Models
The transition variables (Ỹ i0 ,Ỹ i1 , . . . ,Ỹ i,Y i ) T = (1, 1, . . . , 1, 0) in (10) are binary variables. The log-likelihood is the same as for binary response models of the form P (Ỹ ir = 1|x i ) = F (θ r + x T i β). The binary models can also be seen as varying-coefficients models of the form P (Ỹ ir = 1|x i ) = F β(r) + x T i β , where β(r) = θ r is an unknown function of the category, that is, the intercepts vary across categories. By considering the category as an explanatory variable one may treat the models as specific varying-coefficients models and use existing software for fitting (see Section 6 for further details).

Selection of Smoothing Parameter and Prediction Accuracy
For the choice of the tuning parameter (for example by resampling or crossvalidation) a criterion for the accuracy of prediction is needed. A classical approach in linear models is to estimate the mean and compare it to the actually observed response by using the quadratic distance. But since the whole distribution given a fixed covariate is estimated it is more appropriate to compare the estimated distribution to the degenerated distribution that represents the actual observation by using loss functions.
Candidates for loss functions are the quadratic loss function L 2 (π i ,π i ) = r (π ir −π ir ) 2 and the Kullback-Leibler loss L KL (π i ,π i ) = r π ir log(π ir /π ir ), where the vectors π i = (π i0 , π i1 , . . .) andπ i = (π i0 ,π i1 , . . .) denote the true and estimated probabilities. When the true probability vector is replaced by the 0-1 vector of observations Y i = (Y i0 , Y i1 , Y i2 , . . .) one obtains for the quadratic loss the Brier score For the Kullback-Leibler loss one obtains the logarithmic score L KL (Y i ,π i ) = − log(π i,Y i ). The latter has the disadvantage that the predictive distributionπ i is only evaluated at the observation Y i . Therefore, it does not take the whole predictive distribution into account. As Gneiting and Raftery (2007) postulate, a desirable predictive distribution should be as sharp as possible and well calibrated. Sharpness refers to the concentration of the distribution and calibration to the agreement between the distribution and the observation. For count data, a more appropriate loss function derived from the continuous ranked probability score (Gneiting and Raftery, 2007), which will also be used in the applications in the next section, is whereπ i (r) =π i0 +π i1 + . . . +π ir represents the cumulative distribution. It was also used by Czado et al. (2009) for the predictive assessment of count data.

Applications
To illustrate the applicability and the added value of the proposed transition model, we show the results of three real-world data examples comparing the various approaches introduced in the previous sections. Specifically, we consider (i) the Poisson model, hereinafter referred to as Poisson, (ii) the negative binomial model, referred to as NegBin, (iii) the zero-inflation model (3) using a Poisson model for the responders and a logit model to determine the class membership, referred to as Zero, (iv) the hurdle model (4) using a logit model for f 1 and a Poisson model for f 2 , referred to as Hurdle, (v) the transition model with a quadratic difference penalty (11) on the intercepts, referred to as QuadPen and (vi) the transition model, where the intercepts are expanded in cubic B-splines using a first order difference penalty (12), referred to as P-Splines.
To determine the optimal smoothing parameters λ for models (v) and (vi) and to compare the predictive performance of all six approaches we used the ranked  probability score (13). For this purpose, we repeatedly (100 replications) generated subsamples without replacement containing 2/3 of the observations in the original data and computed the ranked probability score from the remaining test data sets (i.e., from 1/3 of the original data).

Absenteeism from School
We first consider a sociological study on children in Australia. The data set is available in the R package MASS (Venables and Ripley, 2002) and was initially analysed by Aitkin (1979). The data consists of a sample of 146 children from New South Wales, Australia. The outcome of interest is the number of days a child was absent from school in one particular school year, taking values Y i ∈ {0, . . . , 81}. The covariates included in the models are Aboriginal ethnicity (Eth; 0: no, 1: yes), gender (Sex; 0: female, 1: male), the educational stage (Edu; 0: primary, 1: first form, 2: second form, 3: third form) and a learner status (Lrn; 0: average, 1: slow). Figure 2 shows the results of the resampling experiment comparing the six different approaches. The ranked probability score was computed from the test data sets (containing 46 observations each) for outcome values r ∈ {0, . . . , 30}. This range of values was chosen to ensure outcome values up to r = 30 in each of the learning and test samples. It is seen that the negative binomial model (second boxplot) and the two transition models (fifth and sixth boxplot) outperform the three alternative approaches with only minor differences between them. This result indicates that the negative binomial distribution is much more appropriate than the Poisson distribution for the analysis of this data set. Importantly, the performance of the flexible transition model in terms of accuracy of prediction is the same as for the negative binomial model. The transition model is well able to capture the essential characteristics of the data.
The estimated coefficientsβ and the corresponding standard errors and zvalues obtained by the negative binomial model and the transition model with penalized B-splines using the whole sample of n = 146 children are given in Ta- ble 1. Overall, the results of both models widely coincide. From the z-values it can be derived that only ethnicity has a significant effect on the outcome (at the 5% type 1 error level). Based on the negative binomial model, the expected number of days a child is absent from school is increased by the factor exp(0.569) = 1.766 for the group of aboriginal children. In terms of the transition model, the continuation ratio (defined in (7)) is increased by the factor exp(0.585) = 1.795, indicating higher counts in the group of aboriginal children. Deb and Trivedi (1997) analyzed the demand for medical care for individuals, aged 66 and over, based on a dataset from the U.S. National Medical Expenditure survey in 1987/88. The data ("NMES1988") are available from the R package AER . Like Zeileis et al. (2008) we consider the number of physician/non-physician office and hospital outpatient visits (Ofp) as outcome variable. The covariates used in the present analysis are the selfperceived health status (Health; 0: poor, 1: excellent), the number of hospital stays (Hosp), the number of chronic conditions (Numchron), age, maritial status (Married; 0: no, 1: yes), and number of years of education (School). Since the effects vary across gender, we restrict consideration to male patients (n = 356). Figure 3 shows the unconditional distribution of the outcome Y i ∈ {0, . . . , 40}. The ranked probabilty scores for outcome values r ∈ {0, . . . , 30} obtained from the approaches (i) to (vi) are shown in Figure 4. When fitting the transition models the minimal ranked probability score (averaged over 100 replications) was  obtained for λ = 5 (QuadPen) and λ = 16 (P-Splines), respectively. Similar to the previous example the negative binomial model (second boxplot) and the two transition models (fifth and sixth boxplot) performed considerably better than the Poisson model and the two models accounting for excess zeros. The estimated coefficientsβ and the corresponding standard errors and zvalues obtained by the negative binomial model and the transition model with penalized B-splines using the whole sample of n = 356 patients are given in Table 2. Again, the two models yielded very similar results. An excellent health status reduced the expected number of visits, whereas the number of hospital   stays and the number of years of education significantly increased the expected number of visits. Figure 5 shows the fitted smooth function of the θ-parameters obtained by the transition model with penalized B-splines, which represents the basic distribution of the counts. The function reveals decreasing coefficients with a local peak at ∼ 20 visits.

Transition Model with Varying Coefficents
An extended form of the transition model assumes for where the regression coefficients β T r = (β 1r , . . . , β pr ) may vary over categories. The parameter β jr represents the weight on variable j for the transition to higher categories than r. Within the basis functions approach each β-parameter can be represented as where φ k (r) are fixed basis functions. Then the whole predictor of the model becomes x ij γ jk φ k (r) .

Extended Model for Excess Zeros
A specific model with varying coefficients that is tailored to the case of excess zeros is obtained if one separates the first transition from all the other transitions.
In the model the first transition is determined by the parameter vector β 0 while the other transitions are determined by the parameter vector β. As in zero-inflated count models and hurdle models one specifies separate effects that model the occurrence of excess zeros. In Model (15) one has varying coefficients β(r) = θ r in the second equation of the model that vary across categories r = 1, 2, . . .. These can again be fitted using a quadratic difference penalty or penalized B-splines as described for the basic model in Section 3.

Absenteeism from School (contd.)
Let us again consider the study on schoolchildren in Australia. Figure 6 shows the results when fitting the extended transition model (14) in its most general, that means the coefficients of all covariates are expanded in cubic B-splines using a first order difference penalty.
As in the basic transition model (cf. Table 1) the estimated functions indicate non-significant constant effects for girls compared to boys and for second form pupils compared to primary pupils (Edu:2). However, there are significant nonlinear effects for ethnicity (which was also significant in the basic model) as well as for first form pupils (Edu:1) and learner status.
As an example, let us consider how slow learners compare to average learners (upper right panel of Figure 6). It is seen that the continuation ratio increases by the factor exp(0.024) = 1.024 (r = 0) up to the factor exp(0.726) = 2.067 (r = 23), which doubles the probability for a higher count. That means the type of learner has a stronger impact on the days of absence if the student already has been absent for many days.

Demand for Medical Care (contd.)
The results from fitting the extended model (15) accounting for excess zeros to the data of the National Medical Expenditure survey are shown in Table 3 and Figure 7. There are remarkable differences compared to the previous results in Table 2: (i) the number of years of education (school) had a significant effect on the first transition, but there was no evidence for an effect on the other transitions (z-value = 1.591), (ii) the number of chronic conditions, which was not significant in the basic model, showed a significant positive effect (β 0 = 0.591) on the first transition (driving the decision to consult a doctor or not), and (iii) an excellent health status and the number of hospital stays had significant effects only in part of the model that models the transition to higher categories given the number of visits was already above zero. The ranked probability score of the extended model (evaluated on the test data sets and averaged over 100 replications) was 3.562, which indicates the best predictive performance among all considered models (see also Figure 4).

Boating Trips
As third example we consider data based on a survey in 1980 administered to n = 659 leisure boat owners in eastern Texas, which is available from R package AER  and was analyzed before by Ozuna and Gomez (1995). Here, the outcome of interest is the number of recreational boating trips to Lake Somerville, Y i ∈ {0, . . . , 40}. Note that 417 (63%) observations take the value zero, which calls for a model accounting for excess zeros (two outliers with outcome values 50 and 88 were excluded). The five covariates used in the present analysis are the facility's subjective quality ranking (Quality; 1: very negative -5: very positive), an indicator, if the individual did water-skiing at the lake (Ski; 0: no, 1:yes), the annual household income (Income; in 1,000 USD), an indicator, if the individual payed an annual user fee at the lake (Userfee; 0: no, 1: yes) and the expenditure when visiting the lake (Cost; in USD). Next to the approaches (i) -(iv), we also considered the extended transition model (15), referred to as P-Splines (Zero), for model comparison. Figure 8 shows the ranked probability scores computed from the test data sets (containing 219 observations each) for outcome values r ∈ {0, . . . , 30}. Among the classical models (right panel) the zero-inflation and hurdle model (third and fourth boxplot) showed a better prediction accuracy than the simple Poisson model and the negative binomial model. They also outperformed the basic transition models. The best-performing model (on average), however, was the extended transition model (seventh boxplot), which demonstrates the added value of accounting for excess zeros.
The results from fitting the extended transition model to the whole sample of n = 657 individuals is given in Table 4. Due to a quasi-complete separation of the outcome with regard to Userfee (all individuals paying a user fee had > 0 counts), the corresponding coefficient estimate tends to +∞ in the zero part of the model. Also a high quality ranking increases the probability for outcome values greater than zero (β 0 = 1.480). Within the second part of the model the expected number of boating trips was significantly higher for individuals that (i) gave a positive quality evaluation, (ii) did water-skiing at the lake, and (iii) payed a user fee. On the other hand, the expected number of boating trips decreased with the expenditure spent when visiting the lake. The continuation ratio (7) is decreased by the factor exp(−1) = 0.368 with an increased expenditure of 100 USD.

Software
A crucial advantage of the proposed transition models is that they can be fitted using standard software for binary response models. Before fitting models one has to generate the binary data (Ỹ i0 ,Ỹ i1 , . . . ,Ỹ i,Y i ) T = (1, 1, . . . , 1, 0) that encode the transitions up to the observed outcome. This is done by the generation of an augmented data matrix, which is composed of a set of smaller (augmented) data matrices for each individual. The resulting matrix has n i=1 Y i rows. In R the augmented data matrix can be generated using the function dataLong() in the R package discSurv (Welchowski and Schmid, 2019). Estimates of the model with a quadratic penalty on the intercepts can be computed using the function ordSmooth() in the R package ordPens (Gertheiss, 2015), estimates of the model with penalized B-splines can be obtained using the function gam() in the R package mgcv (Wood, 2006).

Concluding Remarks
A semiparametric alternative for the modeling of count data is proposed. The models are very flexible, as they do not assume a fixed distribution for the response variable, but adapt the distribution to the data by using smoothly varying coefficients. The extended form of the model further allows that also the regression coefficients vary smoothly over categories. Importantly, the models directly account for the presence of excess zeros, which has the advantage that no specific two-component model needs to be specified.
Our applications showed that in terms of prediction (measured by the ranked probability score) the transition models outperform the simple Poisson model and the models tailored to excess zeros, and perform at least as good as the negative binomial model. It was also illustrated that the extension to varying regression coefficients further enhances the flexibility and the predictive ability of the model class.
An important advantage of the transition models is that they can be embedded into the class of binary regression models. Therefore all inference techniques including asymptotic results to obtain confidence intervals that have been shown to hold for this class of models can be used. A further consequence is that selection of covariates can be done within that framework. Selection of covariates may be demanding even for a moderate number of covariates, because the number of regression coefficients highly depends on the number of response categories. For example, regularization methods as the lasso are applicable but beyond the scope of this article. We restricted our consideration to the logistic link function. Although it is an attractive choice, because of the simple interpretation of effects, it should be noted that also alternative link function (e.g., the complementary log-log link function) can be used for fitting. Also the assumption of linear predictors can be relaxed by additive predictors, for example, by the use of spline functions for continuous covariates. Then one might investigate non-linear (smoothly varying) coefficients.