1 Introduction

The analysis of count data represents an important topic in the statistic literature and numerous textbooks (e.g., Cameron and Trivedi 1998; McCullagh and Nelder 1989; Hastie and Tibshirani 1990; Winkelmann 2005) have been partially or entirely dedicated to the subject. The most well-established approach is to utilize parametric or semiparametric regression models that are typically based on the Poisson distribution and its generalizations, and maximize a pseudo-likelihood to estimate conditional means (Gourieroux et al. 1984).

Machado and Santos Silva (2005) proposed analyzing count data using quantile regression (\(\textsc {qr}\); Koenker and Bassett 1978; Koenker 2005). Their approach permits avoiding strong parametric assumptions and enables investigating every aspect of the conditional distribution, and not just its mean. This idea, however, comes with some complications. The fact that the conditional density of the data is not absolutely continuous, in combination with the non-smoothness of the objective function, causes a non-standard rate of convergence (Manski 1975, 1985) and may as well generate identifiability issues and computational problems.

The solution proposed in Machado and Santos Silva’s paper is to artificially smooth the data by applying jittering (Stevens 1950). A continuous outcome is generated by adding a random quantity in [0, 1) to the original counts, and estimation and inference are carried out by applying standard \(\textsc {qr}\). A one-to-one correspondence between the quantiles of the counts and those of the artificial data can be established. This method has been applied in various fields including analysis of fertility (Miranda 2008; Booth and Kee 2009), frequency of doctor visits (Moreira and Barros 2010; Winkelmann 2006), car accidents (Qin and Reyes 2011), and capacity of pre-enrollment test to predict students’ performance (Grilli et al. 2016). A Bayesian version of jittering has been proposed by Lee and Neocleous (2010). Jittering has been advocated as a computational trick to avoid degenerated solutions (e.g., Koenker 2017), which commonly occur in presence of discrete responses.

Other recent approaches include that of Congdon (2017), in which the asymmetric Laplace distribution is combined with a Poisson model in a Bayesian framework, and the model-based quantile regression of Padellini and Rue (2018), in which quantiles are mapped to the parameters of a generalized linear model identified by a continuous version of a valid count distribution. Tzavidis et al. (2015) proposed a semiparametric M-quantile approach for counts that extends the ideas of Cantoni and Ronchetti (2001) and Breckling and Chambers (2001). These methods avoid jittering, but depend on a limited choice of predefined parametric models.

The idea presented in this paper is to describe the quantile regression coefficients, say \(\varvec{\beta }(p)\), by parametric, smooth functions of p, say \(\varvec{\beta }(p \mid \varvec{\theta })\), using the quantile regression coefficients modeling (qrcm) framework described in Frumento and Bottai (2016, 2017). With this approach, the conditional quantile function is modeled parametrically as if the response variable was continuous. The goal is to utilize a “working model” that does not reflect the actual data distribution, but permits estimating a smooth quantile function by minimizing a smooth loss function, in the same spirit of Efron (1992).

The proposed method bears some similarities with Padellini and Rue’s (2018) approach, in which a statistical model describes a hypothetical continuous counterpart of an originally discrete response. Our modeling framework, however, does not utilize known families of distributions such as Poisson or Negative Binomial, and allows to estimate quantile functions with an arbitrary parametric structure. As shown in the paper, this approach not only permits to avoid jittering, but also generates more efficient estimators, simplifies estimation and inference, and facilitates the interpretation of the results.

The paper is structured as follows. In Sect. 2 we describe the \(\textsc {qrcm}\) framework in a general quantile regression situation. In Sect. 3 we show how \(\textsc {qrcm}\) can be applied to count data. We describe computation and inference in Sect. 4, and report simulation results in Sect. 5. In Sect. 6 we show the usefulness of the proposed method by analyzing a dataset relating the frequency of doctor’s visits to a set of demographic and socio-economic predictors. The R package qrcm implements the described estimator and includes all the necessary functions for inference, prediction, and plotting.

2 Quantile regression coefficients modeling

We denote by \(Y_i\) a response variable of interest, and by \(\varvec{x}_i\) a q-dimensional vector of observed covariates, \(i = 1, \ldots , n\). The standard quantile regression (qr) model assumes that

$$\begin{aligned} Q_{T(Y)}(p \mid \varvec{x}) = \varvec{x}^{\mathrm {\scriptscriptstyle T} }\varvec{\beta }(p) \end{aligned}$$
(1)

is the conditional quantile function of some known, monotone transformation \(T(\cdot )\) of Y. Working with a transformed response is common in practice and may be convenient with non-negative or bounded outcomes. Note that, in a general \(\textsc {qr}\) framework, the response variable is assumed to be sampled from an absolutely continuous population, which is not true if Y is a count. Estimation of \(\varvec{\beta }(p)\) is carried out by minimizing the objective function

$$\begin{aligned} L(\varvec{\beta }(p)) = \sum _{i = 1}^{n}{(p - \omega _{p,i})(T(y_i) - \varvec{x}^{{\mathrm {\scriptscriptstyle T} }}_i\varvec{\beta }(p))}, \end{aligned}$$
(2)

in which \(y_i\) is a realization of \(Y_i\), and \(\omega _{p,i} = I(T(y_i) \le \varvec{x}^{\mathrm {\scriptscriptstyle T} }_i\varvec{\beta }(p))\).

Quantile regression does not require distributional assumptions and permits investigating every aspect of possibly asymmetric, heavy-tailed, or multimodal response variables, showing the effect of covariates at different quantiles. Although the distribution-free nature of \(\textsc {qr}\) is generally seen as an advantage, it also represents its main weakness. Quantiles are estimated one at a time and no parametric structure is assigned to the coefficient functions \(\varvec{\beta }(p)\). This makes estimation inefficient, generates a large amount of random variability, and causes both the loss function, \(L(\varvec{\beta }(p))\), and the estimated coefficients, \({\hat{\varvec{\beta }}}(p)\), to be non-smooth functions of their arguments.

Joint estimation of multiple quantiles has been discussed by numerous authors (e.g., Tokdar and Kadane 2012; Reich 2012; Reich and Smith 2013; Yang and Tokdar 2017; Das and Ghosal 2017, 2018; Fabrizi et al. 2020), and is implemented in the R packages BSquare (Smith and Reich 2013) and qrjoint (Tokdar and Cunningham 2018). The idea of simultaneous quantile regression is recurrent in the literature on quantile crossing (e.g., He 1997; Bondell et al. 2010; Liu and Wu 2011), and can be used to improve estimation of individual fixed effects (Koenker 2004).

Frumento and Bottai (2016, 2017) suggested using a fully parametric approach and reformulated model (1) as

$$\begin{aligned} Q_{T(Y)}(p \mid \varvec{x}, \varvec{\theta }) = \varvec{x}^{\mathrm {\scriptscriptstyle T} }\varvec{\beta }(p \mid \varvec{\theta }), \end{aligned}$$
(3)

where \(\varvec{\theta }\) is a vector of model parameters that describe the functional form of the coefficient functions, \(\varvec{\beta }(p) = \varvec{\beta }(p \mid \varvec{\theta })\). This approach is referred to as quantile regression coefficients modeling (qrcm) and is implemented in the qrcm R package (Frumento 2020). An estimate of \(\varvec{\theta }\) can be obtained by minimizing

$$\begin{aligned} L(\varvec{\theta }) = \int _0^1{L(\varvec{\beta }(p \mid \varvec{\theta })) \mathrm {d}p}, \end{aligned}$$
(4)

which is the integral, with respect to the order of the quantile, of the loss function of standard quantile regression displayed in (2). Unlike \(L(\varvec{\beta }(p))\), the loss defined in Eq. (4) is a smooth function of its arguments, which permits using Newton-type algorithms to carry out minimization, and applying the standard theory of M-estimator (e.g., Newey and McFadden 1994) to derive and implement asymptotics.

An illustration of the parametric approach described in model (3) is given in Fig. 1, which is constructed using simulated data. On the left, we report the estimated \(\textsc {qr}\) coefficients of order \(p = 0.01,0.02, \ldots , 0.99\). On the right, we show a parametric fit based on a cubic function, \(\beta (p \mid \varvec{\theta }) = \theta _0 + \theta _1p + \theta _2p^2 + \theta _3p^3\). Describing coefficients by smooth functions with closed-form mathematical expressions comes with an obvious gain in efficiency, especially in the tails, and makes it simpler to report and interpret the results.

Fig. 1
figure 1

Comparison between \(\textsc {qr}\) and \(\textsc {qrcm}\). Left: coefficient function estimated by standard quantile regression. Right: the coefficient is modeled by a cubic function, \(\beta (p \mid \theta ) = \theta _0 + \theta _1p + \theta _2p^2 + \theta _3p^3\), and estimation is carried out by minimizing \(L(\varvec{\theta })\) (Eq. 4). Shaded areas represent pointwise confidence intervals

3 Quantile regression coefficients modeling with counts

When \(\textsc {qr}\) is applied to count data, a non-standard rate of convergence is obtained as a result of the non-smoothness of the objective function in conjunction with the discreteness of the response variable. The problem was undertaken, among others, by Manski (1975, 1985) and Huber (1981). In their paper, Machado and Santos Silva (2005) suggested generating an artificial continuous variable \(Y^* = Y + U\) by adding a quantity \(U \in [0,1)\) to the original counts Y, a procedure that was referred to as jittering (Stevens 1950). The most common choice is to define U to be a uniform random variable, independent of Y and \(\varvec{x}\). A quantile regression model of the form

$$\begin{aligned} Q_{T(Y^*)}(p \mid \varvec{x}) = \varvec{x}^{\mathrm {\scriptscriptstyle T} }\varvec{\beta }(p) \end{aligned}$$
(5)

is assumed to hold for some known monotone transformation \(T(\cdot )\) of \(Y^*\). Because \(Y^*\) is continuous, ordinary quantile regression can be applied to \(T(Y^*)\) and standard asymptotic theory holds. Given an estimate \({\hat{\varvec{\beta }}}(p)\) of \(\varvec{\beta }(p)\), quantiles of the original count Y are consistently estimated by

$$\begin{aligned} {\hat{Q}}_Y(p \mid \varvec{x}) = \lceil T^{-1}(\varvec{x}^{\mathrm {\scriptscriptstyle T} }{\hat{\varvec{\beta }}}(p)) - 1 \rceil , \end{aligned}$$
(6)

where \(\lceil a \rceil\) denotes the ceiling operator. Because the value of \({\hat{\varvec{\beta }}}(p)\) depends on the specific realization \(\{u\}_{i = 1}^n\) of U, it is preferable to compute \({\hat{\varvec{\beta }}}(p)\) as the average estimate across m jittered samples (average-jittering). Another solution is to adopt the Bayesian framework described by Lee and Neocleous (2010), in which new values of U are simulated at each iteration of the Monte Carlo Markov Chain algorithm.

Alternative approaches, that are briefly discussed in Machado and Santos Silva’s paper, aim to replace \(L(\varvec{\beta }(p))\) with a smooth objective function. This can be obtained by replacing the indicator functions \(\omega _{p,i} = I(T(y_i^*) \le \varvec{x}^{\mathrm {\scriptscriptstyle T} }_i\varvec{\beta }(p))\) (Eq. 2) with a smooth counterpart (e.g., an integrated kernel), or by directly using a different loss, such as the asymmetric maximum likelihood proposed by Efron (1992).

In this paper, we suggest applying the \(\textsc {qrcm}\) framework to model a discrete response, using a parametric quantile function as if the data were generated from an absolutely continuous population. The objective is that of imposing some degree of smoothing to the assumed distribution, without altering the response itself.

Our proposal is based on the empirical evidence that almost identical estimators are obtained by applying \(\textsc {qrcm}\) to the jittered response, \(Y^*= Y + U\), and directly to \(Y^\circ = Y + E[U]\). Such equivalence results from the parametric structure that is imposed to the quantile function, which permits “smoothing away” the points of mass in the empirical distribution of the data.

Without loss of generality, we assume \(E[U] = 0.5\) and apply model (3) to some transformation \(T(\cdot )\) of \(Y^\circ = Y + 0.5\),

$$\begin{aligned} Q_{T(Y^\circ )}(p \mid \varvec{x}, \varvec{\theta }) = \varvec{x}^{\mathrm {\scriptscriptstyle T} }\varvec{\beta }(p \mid \varvec{\theta }). \end{aligned}$$
(7)

Here, \(\varvec{\beta }(p \mid \varvec{\theta })\) is a vector of parametric coefficient functions that are assumed to be continuous and differentiable functions of p. The resulting quantile function is itself continuous, and depends on the covariates according to the same modeling structure used in standard quantile regression. The function to be minimized, \(L(\varvec{\theta })\), is given in Eq. (4) and, unlike \(L(\varvec{\beta }(p))\), is also continuous and continuously differentiable.

After an estimate \({\hat{\varvec{\theta }}}\) of \(\varvec{\theta }\) has been computed, quantile regression coefficients are obtained as \({\hat{\varvec{\beta }}}(p) = \varvec{\beta }(p \mid {\hat{\varvec{\theta }}})\), and Eq. (6) can be used to estimate the quantiles of Y,

$$\begin{aligned} Q_Y(p \mid \varvec{x}, {\hat{\varvec{\theta }}}) = \lceil T^{-1}(\varvec{x}^{\mathrm {\scriptscriptstyle T} }\varvec{\beta }(p \mid {\hat{\varvec{\theta }}})) - 1 \rceil . \end{aligned}$$
(8)

By definition, model (7) cannot be the true data-generating process and should be thought of as a “working model”. Although an interpretation in terms of a latent continuous variable is possible, the idea of fitting a continuous quantile function to a discrete outcome should be regarded as a computational expedient and can be seen as an implicit way of performing jittering.

3.1 A toy example

Suppose that a discrete response variable Y is uniformly distributed on the support \(\{0,1,2, \ldots 9\}\). Assuming \(U \sim U[0,1)\), the jittered response \(Y^*= Y + U\) has a continuous U[0, 10) distribution, with quantile function \(Q_{Y^*}(p) = 10p\). To implement \(\textsc {qrcm}\), define a parametric model \(Q(p \mid \theta ) = \theta p\), in which the true value of the parameter is \(\theta = 10\). Simple algebra permits showing that the minimizer of \(L(\theta )\) under this model solves \(n^{-1}\sum _i \min (y_i/\theta ,1)^2 = 1/3\) and is a function of the quadratic mean of the data. It can be easily shown that, asymptotically, the same estimators of quantiles are obtained with the following three methods: (1) standard \(\textsc {qr}\) applied to \(Y^*\); (2) \(\textsc {qrcm}\) estimator applied to \(Y^*\); and (3) \(\textsc {qrcm}\) estimator applied to \(Y^\circ = Y + \sqrt{903}/6 - 4.5 \approx Y + 0.508\). To prove this result, just note that \((3E[(Y^*)^2])^{1/2} = (3E[(Y^\circ )^2])^{1/2} = 10\). While the equivalence between (1) and (2) is straightforward, point (3) shows that the same estimator can be obtained without jittering.

3.2 Model building

As shown in simulations in Sect. 5, a “good” parametric model may outperform standard \(\textsc {qr}\) with uniform jittering in terms of bias and standard error. Defining a parametric model for a quantile function, however, is not trivial. Numerous alternative strategies to model \(\varvec{\beta }(p \mid \varvec{\theta })\) parametrically are presented in Frumento and Bottai’s (2016; 2017) papers, and the excellent book by Gilchrist (2000) can also be used for inspiration. The coefficient functions are usually simple, often monotone, and can sometimes be well approximated by linear or even constant functions. On the other hand, when flexibility is needed, polynomials or spline functions can be used. Some sensible options for model building are reported in Table 1.

Table 1 Strategies for model building

Note that standard \(\textsc {qr}\) estimators can be obtained as a special case of \(\textsc {qrcm}\) by allowing \(\varvec{\beta }(p \mid \varvec{\theta })\) to be an arbitrarily flexible function of p. The linear regression model, in which \(Y \sim N(\beta _0 + \beta _1x_1 + \beta _2x_2 + \cdots , \sigma ^2)\), is also a special case of model (3) and corresponds to the parametric quantile function defined by \(Q_Y(p \mid x) = \beta _0 + \sigma z(p) + \beta _1x_1 + \beta _2x_2 + \cdots\), where z(p) is the quantile function of a standard Normal distribution.

In this section, we describe a general strategy to formulate parametric quantile regression models that can be applied to count data. We assume to fit the working model defined by (7),

$$\begin{aligned} Q_{T(Y^\circ )}(p \mid \varvec{x}, \varvec{\theta }) = \varvec{x}^{\mathrm {\scriptscriptstyle T} }\varvec{\beta }(p \mid \varvec{\theta }), \end{aligned}$$

where \(Y^\circ = Y + 0.5\).

In their 2005 paper, Machado and Santos Silva suggested modeling the following quantity: \(T(Y^*, p) = \log (Y^*- p)I(Y^*> p) + \log (\epsilon )I(Y^*\le p)\), where \(Y^*= Y + U\) is the jittered variable, and \(\epsilon\) some small positive number. This transformation is justified by the fact of using uniform jittering and, being a function of p, is only convenient when quantiles are estimated one at a time.

Here we suggest two alternative options: (i) to directly model \(Y^\circ\), letting \(T(\cdot ) = I(\cdot )\); and (ii) to use a \(\log\) transformation, \(T(Y^\circ ) = \log (Y^\circ )\), by analogy with the standard link function of log-linear models. The choice is mainly determined by whether the association with the covariates is assumed to be linear or \(\log\)-linear. Note, however, that a \(\log\)-linear association could be well approximated by a linear model in which the covariates have been suitably transformed, e.g., by replacing them by the corresponding spline basis.

To formulate a parametric model for the coefficient functions, \(\varvec{\beta }(p \mid \varvec{\theta })\), we suggest using the following linear parametrization:

$$\begin{aligned} \varvec{\beta }(p \mid \varvec{\theta }) = \varvec{\theta }\varvec{b}(p), \end{aligned}$$
(9)

where \(\varvec{b}(p) = \left[ b_1(p), \ldots , b_k(p)\right] ^{{\mathrm {\scriptscriptstyle T} }}\) is a set of k known functions of p, and \(\varvec{\theta }\) is a \(q\times k\) matrix with entries \(\theta _{jh}\), \(j = 1, \ldots , q\), \(h = 1, \ldots , k\). The j-th regression coefficient is given by \(\beta _j(p \mid \varvec{\theta }) = \theta _{j1}b_1(p) + \cdots + \theta _{j_k}b_k(p)\), and the quantile function is rewritten as

$$\begin{aligned} Q_{T(Y^\circ )}(p \mid \varvec{x}, \varvec{\theta }) = \varvec{x}^{\mathrm {\scriptscriptstyle T} }\varvec{\theta }\varvec{b}(p). \end{aligned}$$
(10)

This parametrization can prove flexible and computationally convenient (Frumento and Bottai 2016, 2017).

Consider, for example, a regression model with a single covariate x:

$$\begin{aligned} Q_{T(Y^\circ )}(p \mid x, \varvec{\theta }) = \beta _0(p \mid \varvec{\theta }) + \beta _1(p \mid \varvec{\theta })x. \end{aligned}$$

Some distributional assumptions can be directly translated into a parametric model for \(\beta _0(p \mid \varvec{\theta })\) and \(\beta _1(p \mid \varvec{\theta })\). For instance, if the working model assumes \((Y^\circ - \theta _1x) \sim \text {Exp}(1/\theta _0)\), the quantile function is \(Q_{Y^\circ }(p \mid x, \varvec{\theta }) = -\theta _{00}\log (1 - p) + \theta _{11}x\), which can be written as

$$\begin{aligned} Q_{Y^\circ }(p \mid x, \varvec{\theta }) = \begin{bmatrix} 1&x \end{bmatrix} \begin{bmatrix} 0 &{} \theta _{00}\\ \theta _{11} &{} 0 \end{bmatrix} \begin{bmatrix} 1\\ -\log (1 - p) \end{bmatrix}, \end{aligned}$$

identifying \(\varvec{b}(p) = \left[ 1, -\log (1 - p)\right] ^{{\mathrm {\scriptscriptstyle T} }}\) as the “basis” to be used for model building.

To avoid strong parametric assumptions, it may be preferable to formulate a more flexible working model:

$$\begin{aligned} \beta _0(p \mid \varvec{\theta })= & {} \theta _{00} - \theta _{01}\log (1 - p) + \theta _{02}\sin {(\pi p)} + \theta _{03}\cos {(\pi p)},\\ \beta _1(p \mid \varvec{\theta })= & {} \theta _{10} + \theta _{14}p. \end{aligned}$$

In this model, the intercept is described by the quantile function of an Exponential distribution, \(-\log (1 - p)\), that determines the shape of the right tail, and a combination of trigonometric functions; while the coefficient associated with x is assumed to be linear. In matrix form, the model is defined by

$$\begin{aligned} Q_{T(Y^\circ )}(p \mid x, \varvec{\theta }) = \begin{bmatrix} 1&x \end{bmatrix} \begin{bmatrix} \theta _{00} &{} \theta _{01} &{} \theta _{02} &{} \theta _{03} &{} 0\\ \theta _{10} &{} 0 &{} 0 &{} 0 &{} \theta _{14} \end{bmatrix} \begin{bmatrix} 1\\ -\log (1 - p)\\ \sin {(\pi p)}\\ \cos {(\pi p)}\\ p \end{bmatrix}. \end{aligned}$$

Note that this quantile function does not correspond to any known, closed-form family of probability distribution.

An alternative flexible parametrization is given by

$$\begin{aligned} \beta _0(p \mid \varvec{\theta })= & {} \theta _{00} + \theta _{01}p - \theta _{02}\log (1 - p) + \theta _{03}\log {p} , \\ \beta _1(p \mid \varvec{\theta })= & {} \theta _{10} + \theta _{11}p + \theta _{14}(p - 0.25)^3 + \theta _{15}(p - 0.75)^3. \end{aligned}$$

Here, the intercept is modeled by a quantile function that includes as special cases that of the (shifted) Exponential (\(\theta _{01} = \theta _{03} = 0\)), the asymmetric Logistic (\(\theta _{01} = 0\)), the standard Logistic (\(\theta _{01} = 0, \theta _{02} = \theta _{03}\)), and the Uniform (\(\theta _{02} = \theta _{03} = 0\)). The slope of x is a combination of linear and cubic functions. If \(x \ge 0\), then \(\varvec{\theta }\ge \varvec{0}\) is a sufficient condition for \(Q_{T(Y^\circ )}(p \mid x, \varvec{\theta })\) to be monotonically increasing. This makes it simple to control for quantile crossing, that represents a common issue in standard quantile regression.

Possible choices of \(\varvec{b}(p)\) include polynomials \(\left[ p, p^2, p^3, \ldots \right]\), splines, piecewise linear functions, roots \([p^{1/2}\), \((1 - p)^{1/2}\), \(p^{1/3}\), \((1 - p)^{1/3}, \ldots ]\), logarithms \(\left[ \log (p), -\log (1 - p)\right]\), trigonometric functions \(\left[ \cos (\pi p), \sin (\pi p)\right]\), quantile functions of known distribution (e.g., that of a Beta or Gamma distribution), and combinations of the above. The intercept, \(\beta _0(p \mid \varvec{\theta })\), is more frequently modeled using unbounded functions, while the coefficients associated with \(\varvec{x}\) are usually assumed to be bounded and, in some situations, can be described by linear or even constant functions. A variety of modeling options will be used in the application described in Sect. 6.

As shown in the above examples, a common feature of parametric quantile functions is that the support of the response, which is identified by quantiles of order \(p = 0\) and \(p = 1\), may depend on estimated parameters. This is always the case, for example, if \(\varvec{\beta }(p \mid \varvec{\theta })\) is assumed to be polynomial. In this situation, maximum likelihood estimators do not meet regularity conditions and cannot be computed by standard algorithms, which explains why most methods involving some sort of parametric modeling (e.g., Reich and Smith 2013) use Bayesian inference. This does not represent an issue in the current framework, where the likelihood function is not used. The minimizer of the simultaneous loss function \(L(\varvec{\theta })\) defined in (4) always corresponds to an interior point.

4 Computation and inference

Define the model as in (7),

$$\begin{aligned} Q_{T(Y^\circ )}(p \mid \varvec{x}, \varvec{\theta }) = \varvec{x}^{\mathrm {\scriptscriptstyle T} }\varvec{\beta }(p \mid \varvec{\theta }), \end{aligned}$$

where \(Y^\circ = Y + 0.5\), and denote by \(y_i^\circ = y_i + 0.5\), \(i = 1, \ldots , n\), a vector of observed responses. We remark that, since Y is a count while \(Q_{T(Y^\circ )}(p \mid \varvec{x}, \varvec{\theta })\) describes a continuous response, the model does not directly reflect the true data-generating process. If an underlying continuous variable Z is invoked such that \(Y = \lfloor Z \rfloor\), where \(\lfloor a \rfloor\) denotes the floor operator, then the model may be assumed to correctly describe the quantile function of Z. In this case, however, \(Y^\circ = Y + 0.5\) should be considered interval-censored between \(Y^\circ - 0.5\) and \(Y^\circ + 0.5\). Here we avoid questioning whether \({\hat{\varvec{\theta }}}\) consistently estimates a true parameter \(\varvec{\theta }_0\), and treat \(Q_{T(Y^\circ )}\) as a working model.

As shown in (4), an estimate \({\hat{\varvec{\theta }}}\) of \(\varvec{\theta }\) is computed as the minimizer of

$$\begin{aligned} L(\varvec{\theta }) = \int _0^1{L(\varvec{\beta }(p \mid \varvec{\theta })) \mathrm {d}p} = \int _0^1{\sum _{i = 1}^{n}{(p - \omega _{p,i})(T(y_i^\circ ) - \varvec{x}^{{\mathrm {\scriptscriptstyle T} }}_i\varvec{\beta }(p))}\mathrm {d}p}. \end{aligned}$$

Numerical integration can be used to evaluate \(L(\varvec{\theta })\), and Newton-type algorithms can be applied to perform optimization. When \(\varvec{\beta }(p \mid \varvec{\theta }) = \varvec{\theta }\varvec{b}(p)\) as in model (9), the following expression can be obtained (Frumento and Bottai 2016):

$$\begin{aligned} L(\varvec{\theta }) = \sum _{i = 1}^{n}{T(y_i^\circ )(p_i - 0.5) + \varvec{x}_i^{\mathrm {\scriptscriptstyle T} }\varvec{\theta }\left[ {\bar{\varvec{B}}} - \varvec{B}(p_i)\right] }, \end{aligned}$$

where

$$\begin{aligned} \varvec{B}(p) = \int _0^p{\varvec{b}(u) \mathrm {d}u}, \qquad {\overline{\varvec{B}}} = \int _0^1{\varvec{B}(u) \mathrm {d}u}. \end{aligned}$$

In the above formulas, \(p_i = F(T(y_i^\circ ) \mid \varvec{x}_i, \varvec{\theta })\) corresponds to the cumulative distribution function of \(y^\circ _i\) evaluated at \(\varvec{\theta }\), and can be obtained as the inverse of \(Q_{T(Y^\circ )}\). In the implementation of the qrcm R package, \(\varvec{B}(p)\) and \({\bar{\varvec{B}}}\) are evaluated numerically, and a bisection algorithm is used to compute \(p_i\) at the current estimate of \(\varvec{\theta }\).

Following the standard theory of M-estimators (e.g., Newey and McFadden 1994), an estimate of \(\text {cov}({\hat{\varvec{\theta }}})\) can be obtained as

$$\begin{aligned} {\widehat{\text{cov}}}({\hat{\varvec{\theta }}}) = {\hat{\varvec{H}}}^{-1}{\hat{\varvec{\Omega }}} {\hat{\varvec{H}}}^{-1}, \end{aligned}$$
(11)

where \({\hat{\varvec{H}}}\) is the matrix of second derivatives of \(L(\varvec{\theta })\), evaluated at \({\hat{\varvec{\theta }}}\), and \({\hat{\varvec{\Omega }}}\) is the outer product of the summands of the gradient. The exact expressions for \(\varvec{H}\) and \(\varvec{\Omega }\), of which \({\hat{\varvec{H}}}\) and \({\hat{\varvec{\Omega }}}\) are the sample counterparts, are provided in Frumento and Bottai (2016). Note that, unlike standard \(\textsc {qr}\), the asymptotic covariance matrix of \(\textsc {qrcm}\) estimator does not involve nuisance parameters, which makes it unnecessary to estimate the sparsity function (e.g., Koenker and Machado 1999) or to use bootstrap to perform inference.

If \(\varvec{\beta }(p \mid \varvec{\theta }) = \varvec{\theta }\varvec{b}(p)\) as in model (9), an estimate of \(\text {cov}({\hat{\varvec{\beta }}}(p \mid \varvec{\theta }))\) is easily obtained from \({\widehat{\text{cov}}}({\hat{\varvec{\theta }}})\) by using quadratic forms. Obviously, the described inferential procedures are imperfect, as they ignore the fact that \(Q_{T(Y^\circ )}\) is not the true quantile function. However, simulation results show that reliable estimates of the standard errors are obtained even with a relatively small sample size.

5 Simulation results

We considered a model of the form

$$\begin{aligned} Q(p \mid \varvec{x}) = \beta _0(p) + \beta _1(p)x_1 + \beta _2(p)x_2, \end{aligned}$$
(12)

where \(x_1\) was uniform between 0 and 3, and \(x_2\) was binary with \(P(x_2 = 1) = 0.5\). To generate discrete data, we first simulated a continuous variable Z from model (12); then, we defined \(Y = \lfloor Z \rfloor\). We considered two scenarios. In scenario 1 we defined

$$\begin{aligned} \beta _0(p) = -\log (1 - p), \quad \beta _1(p) = 2(1 + p), \quad \beta _2(p) = 2p^{1/2}. \end{aligned}$$

In scenario 2 we defined

$$\begin{aligned} \beta _0(p) = 10(1 - (1 - p)^{1/4}), \quad \beta _1(p) = 3, \quad \beta _2(p) = 5p^5. \end{aligned}$$

These scenarios were regarded as “true” models, although they describe the conditional quantile function of the unobserved response Z, and not that of the observed count Y. Both scenarios generate relatively small counts, rarely exceeding \(Y = 15\).

For each scenario, we simulated \(R = 1000\) datasets of size \(n = 300\). For each simulated dataset we computed: (1) the \(\textsc {qr}\) estimators of \(\varvec{\beta }(p)\), using average jittering with \(m = 100\) replicates; (2) the \(\textsc {qrcm}\) estimators, defining \(\varvec{\beta }(p \mid \varvec{\theta })\) as for the “true” model; and (3) the \(\textsc {qrcm}\) estimators, defining \(\varvec{\beta }(p \mid \varvec{\theta })\) to be a third-degree shifted Legendre’s polynomial (e.g., El Attar 2009), an orthogonal polynomial in (0, 1) that can be used to formulate flexible models for the coefficient functions. \(\textsc {qr}\) estimators were applied to the response variable \(Y^*= Y + U\) with \(U \sim U[0,1)\), while \(\textsc {qrcm}\) were applied to \(Y^\circ = Y + 0.5\).

In Tables 2 and 3, we report the average estimates of \(\varvec{\beta }(p)\) at different values of p, and the corresponding standard errors. When the “true” model was fitted, \(\textsc {qrcm}\) estimators were much more efficient than standard \(\textsc {qr}\), and their average was closer to the “true” parameters. When a flexible model was used instead, the performance of \(\textsc {qrcm}\) estimators was more similar to that of average-jittering \(\textsc {qr}\), although some relevant efficiency gains were observed in the right tails where the data were more sparse. This suggests that describing \(\varvec{\beta }(p \mid \varvec{\theta })\) by a parsimonious model can substantially improve on standard \(\textsc {qr}\) estimators, while overfitting tends to nullify the gain.

In the last two columns of each table, only for \(\textsc {qrcm}\) estimators, we report the average estimated standard errors computed using the asymptotic covariance matrix defined by Eq. (11). Results suggest that inferential procedures are reliable, although the working model is only an approximation of the true data-generating process.

Table 2 Simulation results (1)
Table 3 Simulation results (2)

6 Analysis of NMES data

To illustrate the usefulness of the proposed method, we analyzed data from the US National Medical Expenditure Survey (nmes) conducted in 1987 and 1988 (Deb and Trivedi 1997; Kleiber and Zeileis 2008). The dataset includes a representative sample of \(n = 4406\) US civilians aged \(\ge 66\), for which numerous socio-economic indicators (age, gender, marital status, education, income) and indicators of health condition are available.

The goal of our analysis was to predict the number of doctor’s visits during a one-year period. The response variable had a skewed distribution with a very long right tail (Fig. 2). Higher-order quantiles, which correspond to very frequent doctor’s visits, were considered particularly important. We formulated the following quantile regression model:

$$\begin{aligned} Q(p \mid \varvec{x})= & {} \beta _0(p) + \beta _1(p)I(\text {Health = poor}) + \beta _2(p)I(\text {Health = excellent})\\&+\,\beta _3(p)\text {Nchronic} + \beta _4(p)\text {Male} + \beta _5(p)(\text {Age} - 73)/10 \\&+\,\beta _6(p)(\text {School} - 12) + \beta _7(p)\text {Married} + \beta _8(p)\text {Employed} \\&+\,\beta _9(p)(\text {Income} - 1.7) + \beta _{10}(p)\text {Insurance} + \beta _{11}(p)\text {MedicAid}, \end{aligned}$$

where “Health” is a self-perceived health status, with levels “poor”, “average”, and “excellent”; “Nchronic” is the number of chronic diseases; “Male” is an indicator of male gender; the age was expressed in decades, and centered at its median; “School” is the number of years of education, and was centered at its modal value of 12 years; “Married” is an indicator of marital status; “Employed” is an indicator of employment (about \(90\%\) of subjects were retired); the household income (in tens of thousands US dollars) was centered at its median. All individuals in the sample were covered by Medicare, a public insurance program that offers protection against health-related costs. “Insurance” is an indicator of whether the subject was also covered by a private insurance; and “Medicaid” indicates whether he or she was covered by MedicAid, a federal program complementary to Medicare.

Considering that most predictors were binary, and that the association between the number of doctor’s visits and the non-binary covariates appeared to be well approximated by a straight line, we decided not to transform the response variable.

We first estimated a grid of percentiles, \(p = \{0.01, 0.02, \ldots , 0.99\}\), by using ordinary quantile regression with average-jittering (\(m = 100\)). We used bootstrap (\(R = 100\)) to estimate the standard errors. Note that this procedure is very time-consuming, as it requires to estimate \(99 \times 100 \times 100 = 990,000\) quantile regression models. In Figs. 3 and 4, the estimated coefficient functions are represented by broken dashed lines.

We then formulated a variety of parametric models to be applied to \(Y^\circ = Y + 0.5\). We considered the following alternative parametrizations:

$$\begin{aligned} \beta _j(p \mid \varvec{\theta })&= \theta _{0j} + \theta _{1j}p + \theta _{2j}p^2 + \theta _{3j}\log \{1 - p^{\delta }\}, \end{aligned}$$
(i)
$$\begin{aligned} \beta _j(p \mid \varvec{\theta })&= \theta _{0j} + \theta _{1j}p + \theta _{2j}p^2 + \theta _{3j}(1 - p)^{\delta }, \end{aligned}$$
(ii)
$$\begin{aligned} \beta _j(p \mid \varvec{\theta })&= \theta _{0j} + \theta _{1j}p + \theta _{2j}p^2 + \theta _{3j}\exp \{p^{\delta }\}, \end{aligned}$$
(iii)

\(j = 0, \ldots , 11\). All models were formed by the combination of a 2nd-degree polynomial, and a function that could be used to describe a long right tail. Other flexible models could be defined using splines, trigonometric functions, or piecewise-linear functions. We considered a variety of \(\delta\), namely \(\delta = \{0.5,1,2\}\) in model (i), \(\delta = \{0.05,0.10,0.25\}\) in model (ii), and \(\delta = \{1,5,10\}\) in model (iii). We allowed the parametric form of \(\beta _0(p \mid \varvec{\theta })\) to differ from that of the other coefficients. For example, we could use model (i) for \(\beta _0(p \mid \varvec{\theta })\), and model (ii) for \(\beta _1(p \mid \varvec{\theta }), \beta _2(p \mid \varvec{\theta }), \ldots\).

Since all models had the same number of parameters, we selected the “best” model based on the minimized loss function. Note, however, that information criteria such as aic and bic can also be used for model selection. For general results on information criteria for M-estimators, see Machado (1993); a discussion of criteria for quantile regression models can be found in Lee et al. (2014).

The optimal model had all coefficients parametrized as in (i), with \(\delta = 2\):

$$\begin{aligned} \beta _j(p \mid \varvec{\theta }) = \theta _{0j} + \theta _{1j}p + \theta _{2j}p^2 + \theta _{3j}\log \{1 - p^2\}, j = 0, \ldots , 11. \end{aligned}$$

The estimated coefficient functions are represented by continuous lines in Figs. 3 and 4. The model parameters are summarized in Table 4, while selected percentiles are reported in Table 5.

Result showed that more frequent doctor’s visits were associated with poorer health conditions, female gender (at least at quantiles below 0.75), higher education, and having additional private or public insurances. Age, marital status, and economic indicators did not appear to have a clear association with the response. Importantly, predictors affected the large quantiles much more than the low quantiles of the distribution. For example, the regression coefficient associated with poor health was less than 2 at the median, greater than 3 at the 75th percentile, and greater than 6 at the 95th percentile. The quantile function of two representative individuals, computed using Eqs. (6) and (8), is exemplified in Fig. 5.

The estimates obtained using the described \(\textsc {qrcm}\) approach were very close to those of ordinary quantile regression with average-jittering, showing that the two methods are virtually equivalent. Using \(\textsc {qrcm}\), however, resulted in a much faster computation and did not require using bootstrap to compute standard errors. Additionally, using a parametric model allowed to describe the coefficient functions by means of simple mathematical equations, improving the efficiency of the estimators and making it easier to summarize and interpret the results.

Fig. 2
figure 2

Distribution of the number of doctor’s visit in the nmes dataset (\(n = 4406\))

Fig. 3
figure 3

Estimated quantile regression coefficient functions. The dashed lines are obtained using standard quantile regression with average-jittering. The continuous lines (with pointwise confidence intervals represented by shaded areas) are based on the parametric model summarized in Table 4. For a better readability, all figures are truncated at \(p = 0.97\)

Fig. 4
figure 4

Estimated quantile regression coefficient functions (continued from Fig. 3)

Table 4 Estimated model parameters
Table 5 Estimated quantile regression coefficients
Fig. 5
figure 5

Estimated quantile function of the number of doctor’s visits, obtained using quantile regression with average-jittering (qr) and quantile regression coefficients modeling (qrcm). Left: quantile function of the “typical” individual: average perceived health, one chronic condition, male, median age, 12 school years, married, not employed, median income, with a private insurance, no MedicAid. Right: quantile function of a “disadvantaged” individual: poor health, three chronic conditions, male, age = 80, 6 school years, not married, not employed, 1st decile of income, no private insurance, no MedicAid

7 Conclusions

We showed how the \(\textsc {qrcm}\) paradigm can be applied to count data, using a working model in which the assumed quantile function describes a continuous response. This, in combination with the smoothness of the objective function, avoids using jittering, generates efficient estimators, and simplifies inference.

Unlike other forms of model-based quantile regression, in which estimation is carried out in a Bayesian framework, the proposed approach adopts the frequentist paradigm. This is only possible because the minimizers of the objective function \(L(\varvec{\theta })\), unlike those of the likelihood function, correspond to interior points whether or not the model parameters affect the support of the response. This not only avoids the problem of selecting prior distributions, but may be considered an advantage in fields, like Medicine and Epidemiology, where Bayesian techniques are only used sparely.

In the paper, we only considered a linear quantile regression model of the form \(Q_{T(Y)}(p \mid \varvec{x}, \varvec{\theta }) = \varvec{x}^{\mathrm {\scriptscriptstyle T} }\varvec{\beta }(p \mid \varvec{\theta })\), and we used a linear parametrization \(\varvec{\beta }(p \mid \varvec{\theta }) = \varvec{\theta }\varvec{b}(p)\) to describe the regression coefficients. These assumptions could be relaxed, for example by allowing \(Q_{T(Y)}(p \mid \varvec{x}, \varvec{\theta })\) to be a nonlinear function of \(\varvec{x}\) or \(\varvec{\beta }\), or by assuming that \(\varvec{\beta }(p \mid \varvec{\theta })\) is a nonlinear function of \(\varvec{\theta }\). This would not affect the estimation method, but would make computation much more complicated without necessarily representing an advantage in terms of model flexibility. Describing a variety of meaningful parametric quantile functions can be the subject of future research.

Although empirical evidence suggests that parametric models are relatively immune to quantile crossing, the monotonicity of the estimated quantile function is not generally guaranteed. Some special parametrization (e.g., Reich and Smith 2013; Yang and Tokdar 2017; Das and Ghosal 2017) can be used to avoid crossing. Alternatively, \(L(\varvec{\theta })\) could be minimized subject to monotonicity constraints. This represent an important subject for future work, and a challenge from a computational standpoint.

Quantiles are often more interesting than simple measures of location and scale, such as the mean and the variance. For example, many applications aim to describe the tail behavior and the impact of extreme observations. Our proposal may promote the widespread adoption of quantile regression methods in the analysis of count data. Possible applications are found in medicine, epidemiology, life sciences in general, sociology, psychology, and economics.

Providing a user-friendly implementation of the described estimator is an important part of our work. All software used in this paper is implemented in the qrcm R package. The package includes a main function iqr that carries out model fitting, and a variety of auxiliary functions that permit extracting information from the fitted model, performing prediction and extrapolation, plotting the estimated regression coefficients, and obtaining goodness-of-fit measures. The R package is available at http://CRAN.R-project.org/package=qrcm, and upon request to the authors.