Statistics and Computing

, Volume 24, Issue 3, pp 461–479

Linear quantile mixed models

Article

DOI: 10.1007/s11222-013-9381-9

Cite this article as:
Geraci, M. & Bottai, M. Stat Comput (2014) 24: 461. doi:10.1007/s11222-013-9381-9

Abstract

Dependent data arise in many studies. Frequently adopted sampling designs, such as cluster, multilevel, spatial, and repeated measures, may induce this dependence, which the analysis of the data needs to take into due account. In a previous publication (Geraci and Bottai in Biostatistics 8:140–154, 2007), we proposed a conditional quantile regression model for continuous responses where subject-specific random intercepts were included to account for within-subject dependence in the context of longitudinal data analysis. The approach hinged upon the link existing between the minimization of weighted absolute deviations, typically used in quantile regression, and the maximization of a Laplace likelihood. Here, we consider an extension of those models to more complex dependence structures in the data, which are modeled by including multiple random effects in the linear conditional quantile functions. We also discuss estimation strategies to reduce the computational burden and inefficiency associated with the Monte Carlo EM algorithm we have proposed previously. In particular, the estimation of the fixed regression coefficients and of the random effects’ covariance matrix is based on a combination of Gaussian quadrature approximations and non-smooth optimization algorithms. Finally, a simulation study and a number of applications of our models are presented.

Keywords

Best linear predictor Clarke’s derivative Hierarchical models Gaussian quadrature 

1 Introduction

Conditional quantile regression (QR) pertains to the estimation of unknown quantiles of an outcome as a function of a set of covariates and a vector of fixed regression coefficients. Generally, QR estimation makes no assumption on the shape of the distribution of the outcome (Boscovich 1757; Wagner 1959; Barrodale and Roberts 1978; Bassett and Koenker 1978; Koenker and Bassett 1978). Its ability to provide a thorough description of the distributional effects has contributed to make QR attractive in several fields. See for example Yu et al. (2003) and Koenker (2005) for an overview of recent applications.

A number of sampling designs such as multilevel, longitudinal and cluster sampling typically require the application of statistical methods that allow for the correlation between observations that belong to the same unit or cluster. Mixed effects models, also known as multilevel or hierarchical or random-effects models, represent highly popular and flexible models to analyze complex data. They model and estimate between-cluster variability by means of cluster-specific random effects. These, in turn, provide a modeling structure for estimating the intraclass correlation coefficient (ICC). For example, let Z be a block-diagonal matrix with diagonal blocks given by vectors of ones and I be an identity matrix of appropriate dimensions. In the two-level random intercept model the ratio of the variance of the random intercepts to the total variance defines the proportion of variance explained by clustering, namely \(\mathrm{ICC} = \psi^{2}_{u}/(\psi^{2}_{u}+\psi^{2})\). Inference can be then conducted at the population (marginal) or cluster (conditional) level. Indeed, this represents an advantage of random-effects models over strictly marginal models for the latter cannot provide conditional inferences (Lee and Nelder 2004). In these models, both fixed and random effects are assumed to be purely location-shift effects.

In the last few years, the need for extending QR for independent data to clustered data has led to several and quite distinct approaches. These can be roughly classified into two groups: distribution-free and likelihood-based. The former includes fixed effects (Koenker 2004; Lamarche 2010; Galvao and Montes-Rojas 2010; Galvao 2011) and weighted (Lipsitz et al. 1997; Karlsson 2008; Fu and Wang 2012) approaches. The latter are mainly based on the asymmetric Laplace (AL) density (Geraci and Bottai 2007; Liu and Bottai 2009; Yuan and Yin 2010; Lee and Neocleous 2010; Farcomeni 2012) or other parametric distributions (Reich et al. 2010a).

These categories are by no means mutually exclusive. For example, penalty methods as those proposed by Koenker (2004) might have a strict relationship with the asymmetric Laplace regression with double-exponential random effects as suggested by Geraci and Bottai (2007). Yet, this has not been fully explored—see Lamarche (2010) for a discussion on the asymptotics of L1 penalized fixed-intercept models. Nor the classification is exhaustive. In the approach by Canay (2011) location-shift fixed effects are eliminated by a transformation. Also, other approaches might involve modeling the moments of a distribution function using a parametric family (e.g., Rigby and Stasinopoulos 2005) and deriving the quantiles of the response by inversion. Yet, the spirit of a likelihood-based approach to QR is different from a probability model fitting exercise, where a given distribution is assumed to be the ‘true’ distribution.

In Sect. 2, we briefly review Geraci and Bottai’s AL-based approach (Geraci and Bottai 2007, hereafter GB) and we introduce a generalization of the model proposed therein. In Sect. 3, we describe an estimation process based on numerical integration and nonsmooth optimization. Inferential issues are also discussed. A simulation study is presented in Sect. 4. We then conclude with some applications (Sect. 5) and final remarks (Sect. 6). All computations were performed using the package lqmm (Geraci 2012) for the statistical programming environment R (R Development Core Team 2012).

2 Linear quantile mixed models

2.1 Random-effects models with asymmetric Laplace error

A continuous random variable w∈ℝ is said to follow an asymmetric Laplace density with parameters (μ,σ,τ), w∼AL(μ,σ,τ), if its density can be expressed as
$$ p(w|\mu, \sigma, \tau)=\frac{\tau(1-\tau)}{\sigma}\exp\biggl\{ -\frac{1}{\sigma} \rho_\tau(w-\mu) \biggr\}, $$
where −∞<μ<+∞ is the location parameter, σ>0 is the scale parameter, 0<τ<1 is the skew parameter, and ρτ(v)=v{τI(v<0)} is the loss function, with I(⋅) denoting the indicator function. Note that ρτ is the asymmetrically weighted L1 function used in quantile regression problems (Koenker and Bassett 1978). The AL distribution first appeared in Hinkley and Revankar (1977) under a different parameterization. See Yu and Zhang (2005) for more details on this distribution. In our case, the location parameter μ is of great interest as this is the τth quantile of w, i.e. \(\operatorname{Pr} (w \leq\mu ) = \tau\).
Let y∈ℝ be a response variable with unknown, continuous distribution function Fy. For a fixed skewness parameter τ, it is convenient to estimate the τth quantile of y, μ(τ), from the model
$$ y = \mu^{(\tau)} + \varepsilon ^{(\tau)}, $$
where ε(τ)∼AL(0,σ,τ). Note that this assumption is ancillary as we do not assume that Fy is AL.
For a random variable w composed of n independent variates wi with common skew and scale parameters, but generic location μ=(μ1,…,μn)′, wi∼AL(μi,σ,τ), i=1,…,n, we use the simplified notation
$$ p(w|\mu, \sigma, \tau)=\sigma_{n}(\tau)\exp\biggl\{-\frac {1}{\sigma} \rho_\tau(w-\mu) \biggr\}, $$
where σn(τ)=τn(1−τ)n/σn and \(\rho _{\tau}(y-\mu) = \sum_{i=1}^{n}\rho_{\tau}(w_{i}-\mu_{i} )\).
In GB, we proposed a random-intercepts QR model for longitudinal data using the AL to model the τth conditional quantile of a continuous response variable. In particular, we assumed the following regression function:
$$ Q_{y|u} (\tau|x,u )=X\beta^{(\tau)} + u, $$
where (y,X) represents longitudinal data, u a vector of subject-specific random effects and Qy|u denotes the inverse of the (unknown) distribution function Fy|u. The τth regression quantile of y|u was then estimated under the convenient assumption y|u∼AL((τ)+u,σ(τ),τ), where the τ-dependent parameters β(τ) and σ(τ) have a frequentist interpretation. A Bayesian character to this model, however, has been attributed by others (Reich et al. 2010b; Li et al. 2010). The link between the L1 norm regression problem and the AL-based estimation of the coefficients β(τ) has been discussed by Koenker and Machado (1999) and Yu and Moyeed (2001). Here, we emphasize that the mode of y|u is (τ)+u, for any τ∈(0,1).

A model with random intercepts only cannot obviously account for between-clusters heterogeneity associated with given explanatory variables. This is the case, for example, of growth curves in which both intercepts and slopes of temporal trajectories differ by subject due to genetic and environmental effects. In the next section, we extend our model to include multiple random effects.

2.2 Model generalization

Consider clustered data in the form \((x'_{ij},z'_{ij},y_{ij})\), for j=1,…,ni and i=1,…,M, N=∑ini, where \(x'_{ij}\) is the jth row of a known ni×p matrix Xi, \(z'_{ij}\) is the jth row of a known ni×q matrix Zi and yij is the jth observation of the response variable in the ith cluster. Mixed effects models represent a common and well-known class of regression models used to analyze data coming from similar designs. A typical formulation of a linear mixed model (LMM) for clustered data is
$$ y_{ij} = x'_{ij}\beta+ z'_{ij}u_{i}+ \varepsilon _{ij},\quad j=1,\ldots, n_{i},\ i=1,\ldots, M, $$
where β and ui, i=1,…,M, are, respectively, fixed and random effects associated with p and q model covariates and y is assumed to follow a multivariate normal distribution characterized by some parameter θ. Within this framework, the target of the analysis is the mean of the response, whether conditional, i.e. E(yi|u), or marginal, i.e. E(yi), will depend on the purpose of the analysis. The likelihood of the LMM specified above is, from a modeling standpoint, conditional. The term marginal refers to the likelihood integrated with respect to u. Throughout the paper, we will use such distinction.

There are several reasons why an analyst might consider a random-effects approach for their data while wanting to go beyond a location-shift model. A quantile regression approach allows for departures from location-shift assumptions, typical of LMMs, in which covariates do not affect the shape of the distribution. Here, we follow the same approach of GB but we provide a framework within which modeling of the regression quantiles of clustered continuous outcomes is more general and estimation is computationally more efficient.

We introduce the convenient assumption that the \(y_{i} = (y_{11},\ldots,y_{1n_{i}})'\), i=1,…,M, conditionally on a q×1 vector of random effects ui, are independently distributed according to a joint AL with location and scale parameters given by \(\mu^{(\tau)}_{i} = X_{i}\theta^{(\tau)}_{x} + Z_{i}u_{i}\) and σ(τ), where \(\theta^{(\tau)}_{x}\in\mathbb{R}^{p}\) is a vector of unknown fixed effects. The skew parameter τ is set a priori and defines the quantile level to be estimated. Also, we assume that ui=(ui1,…,uiq)′, for i=1,…,M, is a random vector independent from the model’s error term and distributed according to p(ui|Ψ(τ)), where Ψ(τ) is a q×q covariance matrix. Note that all the parameters are τ-dependent. The random effects vector u depends on τ through Ψ(τ). We will omit the superscript τ when this is not source of confusion. Moreover, we assume that the random effects are zero-median vectors. Other scenarios may include the case in which u are not zero-centered (see for example Koenker 2005, p. 281) and/or are not symmetric. A discussion on proposed distributions for the random effects is deferred to the next section.

If we let \(u = (u'_{1},\ldots,u'_{M})'\), \(y = (y'_{1},\ldots,y'_{M})'\), \(X= [X'_{1}|\ldots|X'_{M} ]'\) and \(Z=\bigoplus_{i=1}^{M}Z_{i}\), \(\mu^{(\tau )} = X\theta^{(\tau)}_{x} + Zu\), the joint density of (y,u) based on M clusters for the τth linear quantile mixed model (LQMM) is given by Throughout the paper we will assume that \(\varPsi^{(\tau)} \in S^{q}_{++}\), where \(S^{q}_{++}\) is the set of real symmetric positive-definite q×q matrices. The likelihood in (2) can be equivalently seen as arising from the model
$$ y =\mu^{(\tau)} + \varepsilon ^{(\tau)}, $$
with independent and identically distributed errors \(\varepsilon _{ij}^{(\tau)} \sim\mathrm{AL} (0,\sigma, \tau)\).
The formulation of the model in (2) is general. For Zi=(1,…,1)′, i=1,…,M, the LQMM corresponds to the random-intercepts model developed by GB. We obtain the ith contribution to the marginal likelihood by integrating out the random effects, which leads to
$$ L_{i}(\theta_{x},\sigma, \varPsi|y_i)=\int_{R^{q}}p(y_{i},u_{i}| \theta_{x},\sigma,\varPsi) \, du_{i}, $$
(3)
where Rq denotes the q-dimensional Euclidean space. We denote the marginal log-likelihood by i(θx,σ,Ψ|y)=logLi(θx,σ,Ψ|y), i=1,…,M.

Clearly, since inference in model (2) is done seperately for each τ, quantile crossing phenomena may arise within the convex hull of X, perhaps as a result of model misspecification (Koenker 2005, p. 55). As explained by Lum and Gelfand (2012), the stronger constraint of a joint quantile specification is replaced by the weaker stochastic ordering of the asymmetric Laplace. Ad-hoc solutions to avoid quantile-crossing have been proposed (He 1997; Zhao 2000).

3 Estimation

GB proposed to fit the random-intercepts QR model by using a Monte Carlo EM algorithm. This approach on the one hand avoids evaluating a multidimensional integral, but on the other hand, it can be computationally demanding. In a simulation study, Alhamzawi et al. (2011) compared Gibbs sampling with GB’s EM algorithm and with Yuan and Yin’s (2010) MCMC algorithm. In terms of relative bias, the three approaches were found to be analogous. However, the EM algorithm showed poorer computational efficiency. Modeling complex structures of the random effects increases the computational requirements. In this section, we will explore alternative strategies based on different optimization techniques.

Our goal is to evaluate the integral The marginal density in (4) can be characterized by using Theorem 6 by Prékopa (1973).

Theorem 1

(Prékopa 1973)

Letf(x,y) be a function ofp+qvariables wherexis ap-component andyis aq-component vector. Suppose thatfis logarithmic concave inp+qand letAbe a convex subset ofq. Then the function of the variablex
$$ g (x ) = \int_{A}f(x,y) \, dy $$
is logarithmic concave in the entire spacep.

If p(ui|Ψ) is log-concave in u, the integrand in (4) will be log-concave in y and Prékopa’s theorem will apply to p(yi|θx,σ,Ψ). As we shall see, the distributions of the random effects considered in this study are log-concave. The joint log-likelihood, however, is not a concave function of the scale and variance parameters for which a reparameterization is required in order to apply the above results.

3.1 Numerical integration

The integral in (4) has the form \(\int_{\mathbb{R}^{q}}f(u)w(u)\prod_{q}du_{q}\). By choosing a suitable weighting function or kernel w(u) and upon a change of the integration variable where necessary, the q-dimensional Gaussian quadrature formula, based on q successive applications of simple one-dimensional rules, provides the approximation
$$ \int_{\mathbb{R}^q}f(u)w(u)\,du\approx\sum _{k_1=1}^{K}\cdots\sum_{k_q=1}^{K}f(v_{k_1,\ldots,k_q}) \prod_{l=1}^{q}w_{k_{l}}, $$
where K is a given integer. The abscissas \(v_{k_{1},\ldots,k_{q}}=(v_{k_{1}},\allowbreak\ldots,v_{k_{q}})'\) and the weights \(w_{k_{l}}\), kl=1,…,K, l=1,…,q, are chosen so that the approximation is exact if f(u) is a polynomial of a given total order. More precisely, the product rule defined above would be exact for a tensor product of univariate polynomials.

For this reason, the product rule entails a ‘curse of dimensionality’, an exponential increase of the number of evaluations of the integrand function. For example, we would need 3,200,000 function evaluations for a 5-dimensional quadrature rule based on 20 nodes and 64,000,000 for adding only one random effect, let alone the total number of evaluations necessary to convergence if the parameter’s estimation algorithm is iterative (as it will be seen to be the case). Fortunately, a satisfactory accuracy is typically achieved with less than 20 nodes.

The choice of an appropriate distribution for the random effects is not straightforward. As recognized by GB, robustness issues apply to the distribution of y|u as well as of u. As a robust alternative to the Gaussian choice, we suggested the use of the symmetric Laplace. This choice led to a regression model which, after a rescaling (Geraci and Bottai 2007, p. 146), was similar to the penalized model proposed by Koenker (2004). As noted by Koenker (2004), using L1 penalties in fixed-effect models is convenient as this produces an elegant form of the penalized objective function apt to be solved via linear programming algorithms. See also Koenker and Mizera (2004) for quantile bivariate smoothing using L1 penalties.

In the following, we will focus explicitly on two types of distributions, namely Gaussian and Laplacian. It is immediate to verify that these assumptions correspond to applying, respectively, a Gauss–Hermite and a Gauss–Laguerre quadrature to the integral in (4). More general considerations can be made with regard to the use of symmetric as well as asymmetric kernels belonging to the exponential family.

3.1.1 Normal random effects

Under the assumption of normal random effects, the approximation of the integral in (4) by Gauss–Hermite quadrature is, bar a proportionality constant, with nodes \(v_{k_{1},\ldots,k_{q}}=(v_{k_{1}},\ldots,v_{k_{q}})'\) and weights \(w_{k_{l}}\), l=1,…,q.
The (marginal) log-likelihood for all clusters is approximated by

The integral above can be recognized as a normal-Laplace convolution (Reed 2006). This type of distribution is known in a special form in meta-analysis (Demidenko 2004).

3.1.2 Robust random effects

We consider independent random effects under the assumption of a symmetric Laplace distribution, therefore \(\varPsi=\bigoplus_{l = 1}^{q}\psi_{l}\). Since each one-dimensional integral in (4) can be split around zero, the Gauss–Laguerre quadrature can be applied to the interval [0,∞) and, by symmetry, to the interval (−∞,0]. This results in the following approximation: with nodes \(v_{k_{1},\ldots,k_{q}}=(v_{k_{1}},\ldots,v_{k_{q}})'\) and weights \(w_{k_{l}}\), l=1,…,q, opportunely chosen.
The log-likelihood for all clusters is approximated by
There are several papers on variations of the univariate Laplace distribution (Kozubowski and Nadarajah 2010). A valid multivariate extension of the asymmetric Laplace distribution that imposes a general covariance structure is not obvious. For example, Kotz et al. (2000) proposed a multivariate asymmetric Laplace distribution whose density in the n-variate symmetric case is given by with \(\operatorname{Cov}(y)=\varSigma\), λ=1−n/2 and Kλ(t) is the modified Bessel function of the third kind. For n=1 and Σ=σ2, the density in (7) reduces to
$$ p(y|\sigma) = \frac{\sqrt{2}}{2\sigma}\exp\biggl(-\frac{\sqrt {2}}{\sigma}|y| \biggr), $$
that is \(y \sim \mathrm{AL} (0,\frac{\sigma}{2\sqrt{2}},\frac{1}{2} )\). It can be shown that for a n×n real matrix G, \(\operatorname{Cov}(Gy)=G\varSigma G'\) (Kotz et al. 2000).

The use of (7) for correlated random effects is uncertain. Even though it is easy to re-scale Ψ to a diagonal matrix, the joint density does not factorize into q AL variates (Eltoft et al. 2006) and, therefore, the q-dimensional quadrature can not be based on q successive applications of one-dimensional rules. Therefore, at the moment, we do not consider any of the available proposals for a generalized Laplace distribution as suitable for our purposes. See for example Liu and Bottai (2009) for some results on the use of Kotz et al.’s (2000) multivariate Laplace distribution.

3.2 Nonsmooth optimization

The nondifferentiability of the loss function ρτ at points where \(y_{ij}-x'_{ij}\theta_{x}-z'_{ij}u = 0\) interferes with the standard theory of smooth optimization. Subgradient optimization and derivative-free optimization techniques (e.g., coordinate and pattern-search methods, modified Nelder-Mead methods, implicit filtering) have been developed to tackle nonstandard optimization problems. Here, we briefly study the Clarke’s derivative of the loss function as starting point for a nonsmooth optimization approach. In his book, first published in 1983, Clarke (1990) developed a general theory of nonsmooth analysis that leads to a powerful and elegant approach to mathematical programming. Here, we focus our attention on theorems for Lipschitz functions which play an important role in Clarke’s treatise.

We begin by characterizing the (approximated) marginal likelihood as a Lipschitz function. The covariance matrix of the random effects is reparameterized in terms of an m-dimensional vector of non-redundant parameters θz, i.e. Ψ(θz). Let us define \(\theta= (\theta'_{x},\theta'_{z} )'\) and consider the ith contribution to the likelihood in (5) or (6), rewritten compactly as where \(\tilde{X}_{i,\mathbf{k}} = [X_{i}| (v'_{\mathbf{k}}\otimes Z_{i} )T_{q} ]\) has row vectors \(\tilde{x}'_{ij,\mathbf{k}}\), vk is a q×1 vector of nodes and k=(k1,…,kq)′. Tq is a matrix of order q2×m so that \(\operatorname{vec} (\varPsi^{1/2} ) = T_{q}\theta_{z}\) (Gauss–Hermite) or \(\operatorname{vec} (\varPsi) = T_{q}\theta_{z}\) (Gauss–Laguerre). Since the likelihood function is strictly differentiable with respect to σ, we focus on θ alone for brevity.
It can be noted that app,i is a real-valued function of θ given by the composition gh, where \(h:\mathbb{R}^{p+m}\rightarrow\mathbb{R}^{n_{i}K^{q}}\) and \(g:\mathbb{R}^{n_{i}K^{q}}\rightarrow\mathbb{R}\). Each component function hij,k of h,
$$ h_{ij,\mathbf{k}}(\theta)=\rho_\tau\bigl(y_{ij}- \tilde{x}'_{ij,\mathbf{k}}\theta\bigr)/\sigma, $$
is Lipschitz near θ, so is
$$g \bigl(h(\theta) \bigr) = \log\sigma_{n_i}(\tau) + \log\sum _{\mathbf{k}}\exp\biggl\{-\sum_{j}h_{ij,\mathbf{k}}( \theta) \biggr\}\prod_{l=1}^{q}w_{k_{l}} $$
near h(θ). This implies that gh is Lipschitz near θ.
Then, we calculate the generalized gradient for Lipschitz functions (Clarke 1990). We have that (i) −g is regular at h(θ), (ii) each hij,k is regular at θ, and (iii) every niKq-dimensional element λ of (−g)(h(θ)) has nonnegative components for r=1,…,ni and h=(h1,…,hq)′, hl=1,…,K, l=1,…,q.
For the chain rule I (Clarke 1990, Theorem 2.3.9, p. 42), it follows that where the summation is extended to all j’s and k’s and Open image in new window denotes the weak-closed convex hull. For the chain rule II (Clarke 1990, Theorem 2.3.10, p. 45)
$$ \xi_{ij,\mathbf{h}} \equiv\partial h_{ij,\mathbf{h}} (\theta) = \left\{ \begin{array}{l} -\frac{\tau}{\sigma} \tilde{x}_{ij,\mathbf{h}},\quad{}\text{if } y_{ij}-\tilde{x}'_{ij,\mathbf{h}}\theta> 0,\\[5pt] -\frac{(\tau-1)}{\sigma} \tilde{x}_{ij,\mathbf{h}},\quad{}\text{if } y_{ij}- \tilde{x}'_{ij,\mathbf{h}}\theta< 0,\\[5pt] -\frac{1}{\sigma} (\omega+ \tau- \frac{1}{2} ) \tilde{x}_{ij,\mathbf{h}}:|\omega|\leq\frac{1}{2},\\[5pt] \quad{} \text{if } y_{ij}-\tilde{x}'_{ij,\mathbf{h}}\theta= 0. \end{array} \right. $$
For estimation purposes, the contribution to the gradient of the zero-probability observation \(y_{ij}=\tilde{x}'_{ij,\mathbf{h}}\theta\) is set at \(-\frac{\tau}{\sigma} \tilde{x}_{ij,\mathbf{h}}\) for all i, j and h, that is \(\omega= \frac{1}{2}\).
If a local optimum is attained at \(\hat{\theta}\), solution to the minimization problem
$$ \min_{\theta} \Biggl\{-\sum_{i=1}^{M} \ell_{\mathrm{app},i} (\theta,\sigma)|\theta_{x} \in \mathbb{R}^{p}, \varPsi(\theta_{z}) \in S^{q}_{++} \Biggr\}, $$
then 0∈−∑iapp,i (Clarke 1990). In practice, the constraint \(\varPsi(\theta_{z}) \in S^{q}_{++}\) can be imposed a posteriori by calculating the nearest symmetric positive definite matrix (Higham 2002) after θz is estimated or, a priori, by a reparameterization of Ψ (Pinheiro and Bates 1996; Pourahmadi 1999). Equality and inequality constraints can be accommodated by the Lagrangian rule provided by Theorem 6.1.1 in Clarke (1990, p. 228).

Using the subgradient approach of Rockafellar (1970), Koenker (2005) gives optimality conditions of the quantile regression problem for independent data. Here, we consider the gradient search algorithm for Laplace likelihood originally developed by Bottai and Orsini (2012) for censored Laplace regression (Bottai and Zhang 2010), which works as follows.

From a current parameter value the algorithm searches the positive semi-line in the direction of the gradient for a new parameter value at which the likelihood is larger. The algorithm stops when the change in the likelihood is less than a specified tolerance. Convergence is guaranteed by the continuity and concavity of the log-likelihood (8). Let us denote the gradient of −app{θ,σ} in (9) with \(s (\theta^{k_{1}},\sigma^{0} )\). The minimization steps are:
  1. 1.

    Set θ=θ0; δ=δ0; σ=σ0; k1=0.

     
  2. 2.
    If \(\ell_{\mathrm{app}} \{\theta^{k_{1}} - \delta^{k_{1}}s (\theta^{k_{1}} ),\sigma^{0} \} \geq\ell_{\mathrm{app}} \{\theta ^{k_{1}},\sigma^{0} \}\)
    1. (a)

      then set \(\delta^{k_{1}+1} = a\delta^{k_{1}}\);

       
    2. (b)
      else if \(\ell_{\mathrm{app}} \{\theta^{k_{1}},\sigma ^{0} \}\,{-}\,\ell_{\mathrm{app}} \{\theta^{k_{1}}\,{-}\,\delta^{k_{1}}s (\theta^{k_{1}} ),\sigma^{0} \} \,{<}\, \omega_{1}\)
      1. (i)

        then return \(\theta^{k_{1}+1}\); stop;

         
      2. (ii)

        else set \(\theta^{k_{1}+1} = \theta^{k_{1}} - \delta s (\theta^{k_{1}} )\); \(\delta^{k_{1}+1} = b\delta^{k_{1}}\).

         
       
     
  3. 3.

    Set k1=k1+1; go to step 2.

     
The algorithm requires setting the starting value of the parameter θ0 and σ0, the initial step δ0>0, the contraction step factor a∈(0,1), the expansion step factor b≥1, and the tolerance ω1>0 for the change in app. Additionally, a check on the convergence of the parameter θ can be introduced in step (b) by verifying \(\max \mid \delta^{k_{1}}s (\theta^{k_{1}} )\mid< \omega_{2}\), where ω2>0 controls the tolerance. Least squares estimates can be used to initialize θx. Once the algorithm finds a solution for θ, the scale parameter σ can be estimated residually to obtain σ1 or a second iterative loop as above can be initialized by setting σ=σ1 and so forth until the change in the parameter is sufficiently small. Simulation results suggest that in most cases the number of iterations of the second loop is small (typically, less than four), as is the change in the estimate.

3.3 Interpretation of parameters

Consider the random intercept model introduced in (1). Its marginal distribution is given by \(y \sim N (\mu,\allowbreak \psi^{2}_{u} ZZ' + \psi^{2}I )\). The parameters of such model have a straightforward interpretation: μ is the mean effect at the population level, \(\psi^{2}_{u}\) is a measure of the dispersion of the cluster-specific random effects and proportional to the ICC, and ψ2 is the error’s variance.

Each cluster has a conditional distribution N(μ+ui,ψ2I), therefore the ‘atomic’ τth quantile is qi(τ)≡μ+ui+ψΦ−1(τ) for all j=1,…,ni, where Φ−1 denotes the inverse of the cumulative distribution function of a standard normal. Conditionally on ui, the yij’s are independent. Thus, the τth sample quantile \(\hat{q}_{i\cdot}(\tau)\) is an estimator of qi(τ), which, for large ni,
$$ \hat{q}_{i\cdot}(\tau) \,\dot{\sim}\, N \biggl(q_{i\cdot}(\tau), \frac{\tau(1-\tau)}{n_i [p_{y_{i\cdot}|u_i}(q_{i\cdot}(\tau)) ]^2} \biggr). $$
Note that this result is valid for any 0<τ<1 and continuous p. If we took the average of M such estimators we would obtain, when M is large,
$$ \frac{1}{M}\sum_{i=1}^{M} \hat{q}_{i\cdot}(\tau) \,\dot{\sim}\, N \bigl( \bar{q}(\tau),\nu \bigr) $$
where \(\bar{q}(\tau)=\frac{1}{M}\sum_{i} q_{i\cdot}(\tau)\approx \mu+ \psi\varPhi^{-1}(\tau)\) and It is intended that, under the above random intercept model, the approximation of the mean is valid on average (that is, \(E_{u} \{\bar {q}(\tau) \}\)). Thus, the fixed effects can be interpreted as a weighted average approximation of the population quantiles they target.

Let us now turn to the linear quantile mixed models. Our starting point is conditional. The marginal likelihood (4), estimated under (5) or (6), implicitly assumes that the τth regression quantiles of the clusters ‘gravitate’ around a common regression quantile which, in the case of the normal intercept model, would be μ+ψΦ−1(τ), clearly different from the τth quantile of the marginal model \(\mu+\sqrt{\psi^{2}_{u}+\psi^{2}}\varPhi^{-1}(\tau)\). Oberhofer and Haupt’s (2005) showed that the unconditional quantile estimator for dependent random variables is asymptotically unbiased.

The scale parameter σ does not have, in general, a straightforward interpretation since the use of the Laplace distribution for the conditional response follows from the convenience of manipulating a likelihood rather than from the observation that the data is indeed Laplacian. But what if it is? Consider the linear median mixed model (τ=0.5). The variance of the quantile estimator of qi(0.5) based on normal approximations is \(\frac{\tau (1-\tau)}{n_{i} [\frac{\tau(1-\tau)}{\sigma} ]^{2}} = \frac{4\sigma ^{2}}{n_{i}}\). The relative efficiency of this estimator under Laplacian and normal hypotheses is then \(\frac{4\sigma^{2}}{n_{i}}/\frac{\pi \psi^{2}}{2n_{i}}=8\sigma^{2}/\pi\psi^{2}\). Note that if y∼AL(μ,σ,τ) then \(\operatorname{var}(y) = \frac{\sigma^{2}(1-2\tau+2\tau^{2})}{(1-\tau)^{2}\tau^{2}}\) (Yu and Zhang 2005), hence \(\operatorname{var}(y) = 8\sigma^{2}\) for τ=0.5. The median estimator would achieve the same asymptotic efficiency under the two distributions if, in terms of variances, 8σ2/ψ2=π or, equivalently in terms of scale parameters, if σ/ψ≈0.62.

3.4 Inference

The LQMM in (2) as well as GB’s random-intercept model assume constant scale parameter and, under this assumption, the asymptotics in heteroscedastic data would lead to incorrect inferential approximations. However, GB’s standard error calculation relied on bootstrap resampling which makes large-n approximations irrelevant (Kim and Yang 2011). The inclusion of variance weights might represent a possible development to address such limitation in our models. See Koenker (2005, p. 94) for a discussion on likelihood ratio tests based on the asymmetric Laplace residuals.

Bootstrap represents a very flexible approach and is often used in quantile estimation (Parzen et al. 1994; Buchinsky 1995; He and Hu 2002; Kocherginsky et al. 2005; Bose and Chatterjee 2003; Koenker 2005; Feng et al. 2011; Canay 2011). Here, we consider a block bootstrap approach to assess the uncertainty in conditional quantile estimates. R bootstrap samples are obtained by resampling the index i=1,…,M with replacement. The standard errors are computed as the square root of the diagonal elements of the covariance matrix \(\hat{V} = \operatorname{Cov} (B )\), where B is a matrix R×(p+1) with row vectors of bootstrap estimates of (θx,σ)′. Similarly, standard errors for Ψ can be obtained after opportunely transforming the bootstrap estimates of θz.

Model selection can be conducted by using traditional measures such as the Akaike information criterion (AIC) or the Bayesian information criterion (BIC). A Laplacian likelihood ratio test can be used for testing linear hypotheses of the type H0: θ2=0, with partitioned coefficients vector \(\theta_{x} = (\theta'_{1},\theta'_{2} )'\). See Koenker and Machado (1999) for additional details.

As mentioned previously, random-effects models allow conditional inferences in addition to marginal inferences (Robinson 1991; Lee and Nelder 2004). Prediction of the random effects vector u needs some elaboration. Here, we briefly illustrate a possible approach based on best prediction. Consider the simple case where y=u+ε. The best predictor of u is defined as the vector \(\tilde{u}\) that solves
$$ \min_{\tilde{u}} \mathrm{E} \bigl\{ (\tilde{u} - u )^2 \bigr\}. $$
(10)
The solution to (10) is the conditional expectation E(u|y) (Ruppert et al. 2003). If we restrict the family of predictors to be linear and we consider the general LQMM \(y = X\theta^{(\tau)}_{x} + Zu + \varepsilon ^{(\tau)}\), the solution is the best linear predictor (BLP) of Zu
$$ Z\tilde{u} = \mathrm{E} (Zu ) + \bigl\{\operatorname {Cov} (Zu,y ) \bigr \}\varSigma^{-1} \bigl\{y - \mathrm{E} (y ) \bigr\}. $$
(11)
The first term of expression (11) is zero and, following from the assumption that uε(τ), \(\operatorname{Cov} (Zu,y )=Z\varPsi^{(\tau)} Z'\). The last two terms of the product are the inverse of the matrix \(\varSigma\equiv \operatorname{Cov} (y ) = Z\varPsi^{(\tau)} Z' + \operatorname{Cov} (\varepsilon ^{(\tau)} )\) and the vector \(y - \mathrm{E} (y ) = y - X\theta^{(\tau)}_{x} - \mathrm{E} (\varepsilon ^{(\tau)} )\). Expressions for mean and variance of the asymmetric Laplace are given in Yu and Zhang (2005). If in (11) we plug an estimate of θ and σ, we refer to (11) as estimated BLP (eBLP) and we write
$$ \hat{u}^{(\tau)}_{\mathrm{eBLP}} = \hat{\varPsi }^{(\tau)} Z' \hat{\varSigma}^{-1} \bigl\{y - X\hat{ \theta}_{x}^{(\tau)} - \hat{\mathrm{E}} \bigl( \varepsilon ^{(\tau)} \bigr) \bigr\}. $$
(12)

For the sake of simplicity, consider the LMM y=u+ε, u∼N(0,I), ε∼N(0,I) which entails an ICC of 0.5. As seen previously, the random effects play a pure location-shift role. Therefore, we expect that in this case a good predictor for u in LQMMs provided a prediction \(\hat{u}\) that is close to the best linear unbiased prediction (BLUP) of u from a LMM.

In a small-scale simulation study, we generated n=10 repeated observations from a LMM with M=50 groups and we calculated the eBLP \(\hat{u}^{(\tau)}_{\mathrm{eBLP}}\) given in (12) for Open image in new window, Open image in new window, and the estimated BLUP (eBLUP) \(\hat{u}_{\mathrm{eBLUP}}\) for 200 replications. We then estimated the coefficients of the linear regression The mean squared error for a and b was, respectively, 0.056 and 0.018. This was calculated as the squared difference between the estimated regression coefficients and those from a perfect linear relationship, averaged over the simulation realizations and the number of τ’s. Selected results are presented in Fig. 1. In general, the median of the intercepts estimated from the model above was close to 0 at all estimated quantiles. The median of the slopes b was close to 1 for τ’s in the region of 0.5 but decreased to around 0.8 for τ=0.01 and τ=0.99. The spread of the slopes was larger for tail quantiles. These results remained unchanged when M=100 and n=20. However, to higher values of the intraclass correlation corresponded slopes closer to 1 at all estimated quantiles. Conversely, lower ICC values led to weaker association between \(\hat{u}^{(\tau)}_{\mathrm{eBLP}}\) and \(\hat{u}_{\mathrm{eBLUP}}\).
Fig. 1

Plot of best linear and best linear unbiased predictions (grey lines) of u for (aτ=0.5 and (bτ=0.99; and boxplot of the intercepts (c) and slopes (d) from the regression model \(\mathrm{E} (\hat{u}^{(\tau)}_{\mathrm{eBLP}} \mid \hat{u}_{\mathrm{eBLP}} ) =a + b \cdot\hat{u}_{\mathrm{eBLP}}\) for three quantiles in 200 replicated data sets

The residuals \(\hat{e} = y - X\hat{\theta}^{(\tau)}_{x} - Z\hat{u}^{(\tau)}_{\mathrm{eBLP}}\) did not satisfy the optimality condition of Koenker and Bassett (1978, Theorem 3.4). An ad-hoc adjustment might consist in constraining the estimation of u in (11) to satisfy such condition. The development of a predictor for the random effects that take into accounts these preliminary results is needed.

4 Simulation study

We carried out an extensive simulation study. Data were generated according to
$$ y_{ij} = (\beta_{0} + u_i) + ( \beta_{1} + v_{i})x_{ij} + \beta_{2}z_{ij} + (1+\gamma x_{ij})e_{ij}, $$
where β=(100,2,1)′, ui and vi are cluster-specific random effects, xij=δi+ζij, δiN(0,1), ζijN(0,1) and zij∼Binom(1,0.5). To ensure a positive scale parameter and thus avoid quantile-crossing, δi and ζij were generated as standard uniforms in all heteroscedastic models (γ≠0). A summary of the simulation scenarios for different data-generating distributions and sample sizes is given in Table 1.
Table 1

Simulation study scenarios. Distributions: N(0,ψ2), normal with mean 0 and variance ψ2; tg, t with g degrees of freedom; \(\chi^{2}_{g}\), chi-square with g degrees of freedom; Stg, skew t with g degrees of freedom (Azzalini and Capitanio 2003)

Model description

(n,M)

u

v

e

γ

(1) location-shift symmetric

(5,50)

N(0,5)

N(0,5)

0

(2) location-shift symmetric

(5,50)

t3

N(0,5)

0

(3) location-shift heavy-tailed

(5,50)

N(0,5)

t3

0

(4) location-shift heavy-tailed

(5,50)

t3

t3

0

(5) location-shift asymmetric

(5,50)

N(0,5)

\(\chi^{2}_{2}\)

0

(6) location-shift asymmetric

(5,50)

t3

\(\chi^{2}_{2}\)

0

(7) location-shift symmetric \(\operatorname{cor}(u,v) = 0\)

(5,50)

N(0,5)

N(0,5)

N(0,5)

0

(8) location-shift heavy-tailed \(\operatorname{cor}(u,v) = 0\)

(5,50)

t3

t3

t3

0

(9) location-shift symmetric

(10,100)

N(0,5)

N(0,5)

0

(10) location-shift symmetric

(20,200)

N(0,5)

N(0,5)

0

(11) location-shift heavy-tailed \(\operatorname{cor}(u,v) = 0\)

(10,100)

t3

t3

t3

0

(12) location-shift heavy-tailed \(\operatorname{cor}(u,v) = 0\)

(20,200)

t3

t3

t3

0

(13) location-shift asymmetric \(\operatorname{cor}(u,v) = 0\)

(20,200)

t3

t3

\(\chi^{2}_{2}\)

0

(14) heteroscedastic symmetric

(10,100)

N(0,5)

N(0,5)

0.25

(15) heteroscedastic heavy-tailed

(10,100)

t3

t3

0.25

(16) heteroscedastic asymmetric

(10,100)

t3

\(\chi^{2}_{2}\)

0.25

(17) location-shift symmetric \(\operatorname{cor}(u,v) > 0\)

(10,100)

N(0,5)

N(0,5)

N(0,5)

0

(18) location-shift symmetric \(\operatorname{cor}(u,v) < 0\)

(10,100)

N(0,5)

N(0,5)

N(0,5)

0

(19) location-shift heavy-tailed \(\operatorname{cor}(u,v) > 0\)

(10,100)

t3

t3

t3

0

(20) location-shift heavy-tailed \(\operatorname{cor}(u,v) < 0\)

(10,100)

St3

St3

t3

0

(21) location-shift heavy-tailed \(\operatorname{cor}(u,v) > 0\)

(10,100)

St3

St3

t3

0

(22) location-shift with 5 % contamination

(10,100)

N(0,5)

\(\chi^{2}_{2}\) + N(0,50)

0

(23) location-shift with 5 % contamination

(10,100)

N(0,5)+N(0,50)

\(\chi^{2}_{2}\)

0

Quantile mixed models of the form
$$ y_i = X_{i}\theta_{x}^{(\tau)} + Z_{i}u_{i} + \varepsilon _{i}^{(\tau)}, $$
where Xi is a n×3 matrix that collects a column of ones and the variables x and z, were estimated for τ∈{0.5,0.75,0.9}. The matrix Zi and the dimension of the random-effects vector ui were defined according to each scenario, and so was the LQMM covariance matrix Ψ(τ) (i.e., either an identity or a symmetric matrix). A Gauss–Hermite quadrature with K=11 nodes was used to approximate the marginal log-likelihood as in (5). For data generated under models (14–16), the number of nodes was set to K=17. All models were estimated with the gradient search algorithm described in Sect. 3.2, with the exception of those applied to data generated under scenarios (22–23), which were estimated using a Nelder-Mead algorithm. The latter was also used for models (1–6) to compare derivative-free with gradient-based optimization. The tolerance parameter and the maximum number of iterations were set to, respectively, 10−3 and 500 for the likelihood, and to 10−5 and 10 for the scale parameter σ. The standard deviation of the response was used as initial step size. The contraction and expansion step factors were set to 0.5 and 1 respectively. Starting values for θx were obtained via least squares estimation. For models fitted to contaminated data (22–23), Gauss–Laguerre quadrature was also assessed. In addition, tail quantiles τ∈{0.01,0.05,0.95,0.99} were estimated for data generated under scenarios (1,3,5) using a sample size n=10 and M=100, and a likelihood tolerance parameter equal to 5⋅10−3.

We also assessed the coverage rate of the median estimator’s confidence intervals at the nominal 90 % level. Data were generated as in scenario (3) (Table 1) but with fixed design points (i.e., drawn only once), either for γ=0 or γ=0.25 and for three sample sizes (n,M): (5,50), (10,100), and (10,200). The standard errors were calculated using R=50 bootstrap replications. The estimates were obtained using a Gauss–Hermite quadrature with K=17 nodes and Nelder-Mead optimization.

Finally, we evaluated the Gauss–Hermite quadrature for several quadrature grid sizes. Data were generated from yij=(βk+ui,k)xij,k+eij, ui,kN(0,5), eijN(0,5) with n=5 and M=50. The number of fixed and random effects k ranged from 1 (intercept model) to 6, with xij,k=δi+ζij as described above for k=2,…,6. LQMMs were estimated for τ∈{0.5,0.75,0.9} using K=5 or K=17 nodes.

In all cases, datasets were generated independently 500 times. In what follows, we report selected results for the sake of brevity. All results are available upon request from the first author.

For each scenario in Table 1, we assessed the relative bias, averaged over the simulation realizations, and reported the results in Table 2. Overall, the performance of the LQMM estimator was satisfactory, except in specific cases. Estimation bias was low at all considered quantiles for one- or two-dimensional random-effects models. However, an increase of bias in the slope of x was observed in data generated with correlated random effects, in particular when using t3 distributions (scenario 191). The bias for this coefficient became severe when u and v were drawn from skewed, in addition to heavy-tailed, distributions (scenarios 20 and 21), which, of course, was due to the use of symmetric quadrature weights. Mean CPU time2, averaged over the simulation realizations and the number of τ’s, ranged from 0.02 seconds in simpler scenarios to 2.80 seconds in scenarios with more complex random-effects structures or larger sample sizes.
Table 2

Relative bias of \(\hat{\theta}_{x}^{(\tau)}\) for τ∈{0.5,0.75,0.9}

Model

Intercept

x

z

0.5

0.75

0.9

0.5

0.75

0.9

0.5

0.75

0.9

(1)

−0.000

0.001

0.002

−0.002

−0.003

0.000

0.007

0.019

−0.018

(2)

0.000

0.001

0.000

0.001

−0.004

0.001

0.018

0.041

−0.017

(3)

0.000

0.003

0.008

0.005

0.004

−0.002

0.013

0.034

0.013

(4)

0.000

0.002

0.004

0.004

−0.000

−0.002

−0.006

0.034

−0.024

(5)

0.002

0.005

0.003

0.001

0.004

0.001

0.010

0.056

0.072

(6)

0.002

0.003

0.002

0.004

0.007

0.011

−0.018

0.055

0.048

(7)

−0.000

0.001

0.003

0.005

0.006

−0.009

−0.002

0.047

−0.107

(8)

−0.000

0.010

0.039

−0.013

−0.013

−0.004

0.038

−0.002

−0.038

(9)

−0.000

0.000

0.002

0.002

−0.001

−0.001

−0.007

0.002

−0.007

(10)

−0.000

−0.001

0.002

−0.000

0.002

−0.001

−0.002

0.001

−0.008

(11)

−0.000

0.011

0.057

−0.010

0.006

−0.006

−0.006

−0.008

0.008

(12)

0.000

0.009

0.021

−0.016

−0.018

−0.020

−0.005

−0.011

−0.005

(13)

0.002

0.006

0.015

0.012

−0.009

0.018

−0.004

−0.007

−0.009

(14)

−0.000

0.000

0.001

−0.006

−0.005

−0.072

0.011

0.012

0.001

(15)

0.000

0.001

0.002

0.009

0.010

−0.005

−0.011

−0.007

−0.018

(16)

0.002

0.001

0.000

0.018

0.034

0.024

−0.005

−0.002

−0.006

(17)

0.000

−0.000

0.003

0.002

−0.108

0.017

0.004

0.032

−0.006

(18)

−0.000

0.002

0.005

−0.013

−0.045

−0.089

−0.010

−0.000

−0.034

(19)

0.000

0.011

0.068

−0.009

0.135

0.459

−0.008

−0.024

−0.014

(20)

0.006

0.010

0.016

0.328

0.258

0.218

−0.003

−0.002

−0.018

(21)

0.012

0.015

0.019

0.624

0.566

0.614

0.004

0.013

0.004

(22)

0.003

0.005

0.010

0.001

0.003

0.005

−0.002

−0.009

−0.034

(22)a

0.003

0.005

0.009

0.001

0.005

0.004

−0.006

−0.010

−0.050

(23)

0.003

0.003

0.004

0.001

−0.000

−0.007

−0.002

−0.001

−0.006

(23)a

0.003

0.003

0.003

−0.001

−0.002

−0.005

−0.003

0.003

0.002

aRobust (Gauss–Laguerre) random-effects estimation.

Outlier contamination, either in the random intercept or in the error term (scenarios 22 and 23), did not impact the estimation bias. However, for data generated under scenario (23), the Monte Carlo variance of the estimated intercepts using Gauss–Laguerre quadrature was 51 % (τ=0.5), 39 % (τ=0.75), and 28 % (τ=0.9) smaller than that calculated when using normal quadrature weights (results not shown).

Estimation bias in scenarios (1–6) remained unchanged when using derivative-free estimation (results not shown).

As expected, tail quantiles had higher estimation bias, in particular for τ=0.01 and τ=0.99, though results (not shown) differed markedly depending on the coefficient being estimated. The relative absolute bias, averaged over τ and the three scenarios, was 1 % for the intercept and 0.6 % for x’s slope but 24 % for the binary covariate’s coefficient.

Coverage rate and average length of bootstrap confidence intervals are reported in Table 3. Coverage rates were close to the nominal level in most cases. As expected, the average length of the confidence intervals decreased with increasing sample size and was larger in the heteroscedastic scenario. In the Bayesian context, credible intervals for the location parameter of AL-distributed variates have been found to achieve nominal coverage at several quantiles (Lum and Gelfand 2012).
Table 3

Coverage rate and mean length of the confidence intervals based on 50 bootstrap replications for the median estimator at nominal 90 % level

(n,M)

Location-shift (γ=0)

Heteroscedastic (γ=0.25)

(5,50)

(10,100)

(10,200)

(5,50)

(10,100)

(10,200)

 

Coverage (%)

Intercept

91.6

91.2

94.4

91.0

92.0

92.4

x

93.0

93.2

94.2

90.8

91.0

94.2

z

90.6

91.0

94.4

93.6

94.6

90.2

 

Length

Intercept

1.7

1.4

1.2

2.2

1.6

1.2

x

0.4

0.2

0.2

1.5

0.8

0.6

z

0.8

0.4

0.3

0.9

0.4

0.3

Relative absolute bias, total number of iterations and CPU time, averaged over the simulation realizations and the number of τ’s, are reported in Table 4. While the number of iterations grows linearly, computation time increases exponentially with increasing q. Bias is low at all grid sizes and, for a given dimension q, it remains approximately unchanged when increasing the number of nodes from 5 to 17. Based on these results, we recommend using a small number of quadrature nodes (say, between 5 and 11) especially when fitting models with more than three random effects. However, as noted with regard to scenario 19 (Table 2), random effects distributions characterized by heavy tails may require using a larger number of nodes to reduce the estimation bias.
Table 4

Relative absolute bias of \(\hat{\theta}_{x}^{(\tau)}\) for τ∈{0.5,0.75,0.9}, number of iterations and time to convergence

Nodes K

Dimension q

Grid size

Relative absolute bias

Iterations

CPU time (seconds)

5

1

5×1

0.000

37

0.01

2

25×2

0.006

50

0.05

3

125×3

0.020

56

0.31

4

625×4

0.014

61

2.36

5

3125×5

0.014

65

18.10

6

15625×6

0.026

71

126.83

17

1

17×1

0.000

39

0.02

2

289×2

0.004

52

0.46

3

4913×3

0.005

58

11.77

4

83521×4

0.023

65

332.02

Substantial reductions in the size of the quadrature grid for a given level of accuracy can be obtained by employing efficient quadrature rules, such as, for example, integration on sparse grids (Heiss and Winschel 2008) and nested integration rule for Gaussian weights (Genz and Keister 1996). Also, adaptive rescaling (Pinheiro and Bates 1995; Pinheiro and Chao 2006) may improve the estimation when the variance of the random effects is large. These are topics of current research by the first author.

5 Examples

5.1 Treatment of lead-exposed children

In this section, we revisit a placebo-controlled, double-blind, randomized trial of succimer (a chelating agent) in children with blood lead levels (BLL) of 20–44 μg/dL conducted by the Treatment of Lead-Exposed Children (TLC) Trial Group (2000). Lead exposure is known to increase the risk of encephalopathy and, in general, to impair cognitive function. The trial was designed to test the hypothesis that children with moderate BLL who were treated with succimer would score higher than children given placebo on a number of behavioral and cognitive tests. Drug treatment was later found not to improve psychological test scores (Rogan et al. 2001).

The original data consist of repeated measurements of BLL observations on an initial sample of 780 children (396 in the succimer group and 384 in the placebo group). Here, we analyze a subset of the data for M=100 children and for a shorter time period. Our results, therefore, may not be extended to draw general conclusions on the original study. The dataset, available at http://biosun1.harvard.edu/~fitzmaur/ala/tlc.txt, provides BLL obtained at baseline (or week 0), week 1, week 4, and week 6 on children randomly assigned to treatment (50) or placebo (50) group.

A Trellis plot (Sarkar 2008) of selected individual blood lead trajectories is shown in Fig. 2. The initial drop in BLL is followed by a rebound after 1 week of treatment in the succimer group (all sample curves are reported in Treatment of Lead-Exposed Children (TLC) Trial Group 2000). A monotonic pattern, in contrast, describes observed BLL in the placebo group. The overall distribution of BLL at each week, conditional on treatment status, is presented in Fig. 3. Non-normal features of the conditional distribution of BLL at each time point and in both groups are apparent.
Fig. 2

Trellis plot of blood lead level in 24 children treated with succimer (subjects 1–12) or with placebo (subjects 51–62)

Fig. 3

Histograms of blood lead level in 100 children treated with succimer (grey) or with placebo (white) at 0, 1, 4, and 6 weeks from randomization

We modeled the temporal trajectories by using natural cubic splines with three degrees of freedom and break-points placed at 0, 4 and 6 weeks. The B-spline basis matrix T=[T1|T2] was then included in the quantile mixed model
$$ y_i = X_{i}\theta_{x}^{(\tau)} + Z_{i}u_{i} + \varepsilon _{i}^{(\tau)}, $$
where Xi is a 4×6 matrix that collects a column of ones, a dummy variable for treatment (baseline: placebo), the basis splines T1 and T2 and their interaction with treatment. The matrix Zi includes the same variables as Xi except for those associated with T2 (the inclusion of a random slope for T2 was not found to significantly improve the fit of the model). Therefore, θx is a 6×1 vector of fixed coefficients and ui is a 4×1 vector of subject-specific random effects.

We assumed a general positive-definite matrix and approximated the log-likelihood using a Gauss–Hermite quadrature as in (5) with K=7 nodes, giving a quadrature grid of size 2401×4, for 5 quantiles Open image in new window, Open image in new window. The objective function was optimized by using the gradient search algorithm described in Sect. 3.2. Standard errors were computed using R=30 bootstrap replications.

The estimated regression quantiles and their 95 % confidence intervals (error bars) are shown in Fig. 4. The least squares solution from a LMM (dashed line) and the 95 % confidence interval (dotted lines) are provided for comparison.
Fig. 4

Regression quantiles for blood lead level and 95 % confidence intervals (error bars). Least squares estimates (dashed lines) and 95 % confidence intervals (dotted lines) are reported

The intercepts can be interpreted as the quantiles τ, Open image in new window, of the distribution of BLL in untreated children at the beginning of the study. The estimated treatment effect shows that the BLL distributions in the two groups at baseline differ markedly at lower quantiles and moderately at the centre.

The estimated slopes for spline T1 in the placebo group show a fall of BLL over the 6-week time period at approximately uniform rates across the distribution, with levels comparable to the average rate. The negative estimates of T2’s coefficients indicate a slightly steeper decrease of BLL after 4 weeks at rates that are, again, similar across the distribution.

In the treatment group, there is a generalized fall of BLL levels at all quantiles but at faster rates than those in the placebo group. The positive coefficients for spline T2 reflect the rebound that we observed in Fig. 2. The magnitude of this effect is weaker at higher levels of lead concentration.

The analysis of the so-called Lehman-Doksum quantile treatment effect (Doksum 1974; Lehmann 1975; Koenker and Xiao 2002) has particular relevance in placebo-controlled studies as the one considered here. In the case of independent observations, formal testing of location-shift and location-scale-shift hypotheses can be carried out, for example, by using Khmaladze-type tests (Koenker and Xiao 2002; Koenker 2005). To the best of our knowledge, these are not available for correlated data though it has been suggested that an extension of these tests to the analysis of cluster-specific distributions may be contemplated (Koenker 2005, p. 281).

We conclude this example with a brief analysis of the effects related to the random part of the model. Variance components and correlations are shown in Table 5. Mean and median show similar levels of variability and covariability of the random effects. There is an increasing trend in the variance of the intercepts from lower to higher quantiles as well as for the effects associated with T1’s slopes in both groups. In contrast, intercepts in the treatment group show decreasing variance from lower to higher quantiles. In general, the correlation between random effects is substantial at all quantiles and between most of the random effects. Box-centile plots of the random effects predicted with the eBLP in (12) are shown in Fig. 5.
Fig. 5

Box-centile plots of predicted child-specific random effects for blood lead level

Table 5

Correlation matrix of the random effects for intercept (Int), basis spline T1, treatment variable (Trt) and their interaction term (Trt: T1). Variances are reported in brackets

 

Int

T1

Trt

Trt: T1

Int

T1

Trt

Trt: T1

 

τ=0.1

τ=0.25

Int

(11.05)

   

(13.40)

   

T1

0.84

(4.51)

  

0.86

(3.19)

  

Trt

0.62

0.89

(9.93)

 

0.48

0.84

(4.92)

 

Trt: T1

−0.64

−0.65

−0.52

(1.34)

0.55

0.87

0.98

(4.96)

 

τ=0.5

Mean

Int

(17.49)

   

(16.38)

   

T1

1.00

(4.03)

  

1.00

(2.03)

  

Trt

−0.59

−0.61

(1.00)

 

−0.44

−0.44

(4.91)

 

Trt: T1

0.95

0.93

−0.49

(10.79)

0.74

0.74

0.28

(23.56)

 

τ=0.75

τ=0.9

Int

(19.48)

   

(31.54)

   

T1

0.72

(7.28)

  

0.40

(10.52)

  

Trt

−0.64

−0.39

(1.30)

 

0.25

0.15

(1.95)

 

Trt: T1

0.76

0.89

−0.75

(13.31)

0.58

0.94

0.04

(15.27)

5.2 A-level Chemistry scores

The dataset consists of A-level Chemistry examination scores in 1997 from 31,022 students in 2410 English schools (Fielding et al. 2003). Additional variables included in the dataset were gender, age in months centered at 222 months (18.5 years), average General Certificate of Secondary Education (GCSE) score centered at mean (6.3) and school identifier.

The outcome is a discrete variable and bounded between 0 and 10. Therefore, the following analysis is to be considered mainly for illustration purposes. To model the quantiles of A-level Chemistry scores, we considered the random-intercept model
$$ y_i = X_{i}\theta_{x}^{(\tau)} + u_{i} + \varepsilon _{i}^{(\tau)}, $$
where Xi is a ni×4 matrix, i=1,…,2410, that collects a column of ones, a variable for age (range −6 to 5), a dummy variable for sex (baseline: male), and a variable for mean GCSE score as prior attainment (range −6.3 to 1.7). A random intercept ui was introduced to capture the heterogeneity between schools whose number of students ni varied from 1 to 188 (median 8).
We approximated the log-likelihood using a Gauss–Hermite quadrature as in (5) with K=9 nodes for 7 quantiles Open image in new window, Open image in new window and optimized the objective function via Nelder-Mead. Standard errors were computed using R=20 bootstrap replications. In Fig. 6, the estimated regression quantiles and their 95 % confidence intervals (error bars) are depicted, together with the variance of the random intercepts, ψ2, and the ICC which are shown in the two plots at the bottom of the figure. The least squares solution from a linear mixed-effects model (dashed line) and the 95 % confidence interval (dotted lines) are provided for comparison.
Fig. 6

Regression quantiles, variance ψ2 and intraclass correlation for A-level Chemistry scores and 95 % confidence intervals (error bars). Least squares estimates (dashed lines) and 95 % confidence intervals (dotted lines) are reported

The intercepts can be interpreted as the quantiles τ, Open image in new window, of the distribution of Chemistry scores in 18.5 year-old males with a GCSE score equal to 6.3. The mean and median scores are very close and the monotonic increase of the quantiles denotes a uniform distribution.

The regression coefficients for age and sex are approximately constant and negative across quantiles and close to the corresponding mean age (−0.037) and sex (−0.74) effects. That is, older individuals and females perform less satisfactorily than younger peers and males, respectively, regardless of their rank in the Chemistry score distribution. However, at higher quantiles (τ=0.9) and therefore for high performing students, there is some indication that these effects may be weaker.

Prior achievement, as measured by mean GCSE results, is beneficial on average Chemistry score. For one point increase of GCSE mean score, A-level Chemistry score increases by 2.57 points. However, the positive effect is much stronger at lower quantiles (low performing students) than at higher quantiles (high performing students), with decreasing magnitude inbetween.

Finally, the between-school variability shows a decreasing pattern from lower to higher quantiles. It is interesting to note that, in contrast, the ICC follows an inverted U curve. That is, the intra-school correlation is higher at the center of the outcome distribution where, proportionally, the within-school variability is smaller. Differences between schools, therefore, might play a less important role in explaining the total variability in below- and above-the-average students’ performance. If these results were to be confirmed by other analyses, this would obviously have an important implication for education policies at the national level.

5.3 Doubly robust meta-analysis

Let us consider the case in which M aggregate measures or summaries for a common underlying outcome are available but the raw data are not. These could be, for example, standard deviations coming from M experiments aimed at assessing the precision of an instrument. Some instruments might have a much higher precision than others. We are interested in summarizing or pooling these studies by estimating one overall measure of the precision β0 and estimating the deviation ui, i=1,…,M, from β0. Suppose each experiment has a valid sample of size ni. Rather than assuming a normal random-effects model (DerSimonian and Laird 1986), we could consider estimating the random-effects median model
$$ y_{i} = \beta_{0} + u_{i} + \varepsilon _{i}^{(0.5)} $$
(13)
where yi is the effect size for study i, i=1,…,M and ui is a random effect that measures the deviation of the ith study effect from the population median β0. In addition, we could introduce Laplacian random effects in model (13). This would entail a double-robustness of the model: to unusual deviations from β0 and to unusual deviations from β0+ui. To take into account the different weight of the studies, the log-likelihood to be maximized would be \(\sum_{i=1}^{M}\varPi(n_{i} )\ell_{\mathrm{app},i} (\theta,\sigma)\), where Π(⋅) is some function of the size of the samples. See Demidenko (2004) for a normal-Laplace regression, where normality is assumed for ε. Model (13) would not have an advantage over existing meta-analysis models if the effect size has a sampling distribution approximately normal (e.g., as in the case of the sample mean).

5.4 Smoothing splines

Nonlinear QR for dependent data has received some attention in recent years. Karlsson (2008) developed a weighted approach to nonlinear QR estimation for repeated measures. Wang (2012) extended GB’s model to account for nonlinear effects in normal-Laplace models.

Here, we show another possible application of the LQMMs presented in Sect. 2. We generated n=10 observations clustered within M=500 groups using the nonlinear heteroscedastic model
$$ y_{ij} = \exp\biggl\{\frac{5x_{ij}}{1+2x^2} \biggr \} + u_i + x_{ij}\varepsilon _{ij}, $$
(14)
where xijU(0,1), uiN(0,0.25) and εijN(0,1), independently. In this model, the within-group correlation \(\operatorname{cor} (y_{ij},y_{ij'} )\), jj′, will vary in the range 0.2–1, depending on xij.
We approximated the nonlinear function in (14) with a natural cubic spline model with four degrees of freedom, break-points placed at the quartiles of x and boundary knots at the extremes of x’s range. The B-spline basis matrix S was then included in the quantile mixed model y=(τ)+u+ε(τ). Fitted regression models estimated with a 9-knot Gauss–Hermite quadrature and 95 % confidence bands estimated by using 50 bootstrap replications are shown in Fig. 7 for three quantiles τ∈{0.1,0.5,0.9}. Due to the symmetry of the error term, the estimated median and the (true) mean are very close. However, the spread of the distribution differs across the values of x.
Fig. 7

Smoothed regression quantiles (τ∈{0.1,0.5,.9}) with data points (a) and 95 % confidence bands (b). The true mean function is represented by a dashed line

5.5 Spatial modeling

Spatial modeling refers, loosely, to the case in which geographical information is introduced in the model. This includes the case in which x is a vector of geographical coordinates or, more simply, when the grouping factor is some geographical unit (e.g., postal codes, wards or counties). In the latter case, each random term u of the LQMM would be associated with a small-area effect and the resulting variance-covariance matrix would be interpreted as a spatial correlation matrix at the quantile of interest. The modeling of Ψ, therefore, should take into account the spatial association between areas (e.g., contiguity). See for example Lee and Neocleous (2010) for an application of Bayesian quantile regression (Yu and Moyeed 2001) to environmental epidemiology using the results of Machado and Santos Silva (2005). A Bayesian spatial quantile regression approach using the asymmetric Laplace is proposed also by Lum and Gelfand (2012).

Quantile smoothing of surfaces and related inference is a recent topic (He et al. 1998; He and Portnoy 2000; Koenker and Mizera 2004). An application of triogram smoothing splines to poverty mapping is described by Geraci and Salvati (2007). A Bayesian quantile modeling of ozone concentration surfaces is given by Reich et al. (2010b).

Suppose we want to estimate the quantiles of an outcome under the following nonparametric model:
$$ y_{i} = f(x_{i}) + \varepsilon _{i}, $$
where \(x'_{i}\) is a vector of geographical coordinates. We could introduce a bivariate smoother for f as, for example, a generalized covariance function with quadratic penalty and adopt an approach as described in Ruppert et al. (2003) or a triogram smoothing spline model (Koenker and Mizera 2004) with L1 penalized coefficients. Differently from the classical L2 penalty used for estimating nonparametric mean functions, L1 penalization (Koenker et al. 1994) has a more direct and natural application to quantiles from a modeling and estimation standpoints. In both circumstances, however, the random effects would play the role of spline coefficients and the type of quadrature to be used would naturally follow the metric of the associated penalty term.

6 Final remarks

The growing interest in the theory and applications of regression methods for quantiles of outcome variables of research interest is to a large extent a consequence of the insight and ease of interpretation that only quantiles can provide. The present paper extends the work by Geraci and Bottai (2007) to the general linear quantile mixed effects regression model; discusses relevant aspects of estimation, testing, and computation in detail; illustrates the use of the linear quantile mixed effects regression in two real data examples; shows some of the results of an extensive simulation study; and indicates possible avenues of future research. The linear quantile mixed effects regression presented in this paper constitutes a practical statistical tool for the evaluation of quantiles with dependent data that complements the widely popular mixed effects regression for the mean and may help to further understanding in many fields of applied research.

Footnotes
1

For this scenario, we increased the number of quadrature nodes K from 11 to 17. The relative bias decreased from 0.135 to 0.051 (τ=0.75) and from 0.459 to 0.075 (τ=0.9).

 
2

All computations were performed on a 64-bit operating system machine with 16 Gb of RAM and quad-core processor at 2.93 GHz.

 

Acknowledgements

The Centre for Paediatric Epidemiology and Biostatistics benefits from funding support from the Medical Research Council in its capacity as the MRC Centre of Epidemiology for Child Health (G0400546). The UCL Institute of Child Health receives a proportion of funding from the Department of Health’s NIHR Biomedical Research Centres funding scheme.

Copyright information

© Springer Science+Business Media New York 2013

Authors and Affiliations

  1. 1.Centre for Paediatric Epidemiology and Biostatistics, MRC Centre of Epidemiology for Child Health, Institute of Child HealthUniversity College LondonLondonUK
  2. 2.Unit of Biostatistics, Institute of Environmental MedicineKarolinska InstitutetStockholmSweden