1 Introduction

Conventional regression analysis, based on the ordinary least squares (OLS) framework, plays a vital role in exploring the relationship amongst variables. It is, however, well known that the OLS estimator is not robust to deviations from normality of the response in the form of heavy-tailed distributions or outliers. As a robust alternative, quantile regression (QR), introduced by Koenker and Bassett (1978), has become a popular paradigm to describe a more complete conditional distribution information. One caveat is that QR can have arbitrarily small relative efficiency compared to the OLS estimator based on a single quantile. Further, the quantile regression at one quantile may provide more efficient estimates than for another quantile. Zou and Yuan (2008) proposed the simultaneous estimation over multiple quantiles, with equal weights assigned to each quantile considered, to abate the issues of the single quantile regression. Zhao and Xiao (2014) extended the composite quantile regression to unequal weights for each quantile. A Bayesian estimation procedure for the composite model with unequal weights was outlined by Huang and Chen (2015).

A different set of issues with QR regression is from the perspective of axiomatic theory. Artzner et al. (1999) provide a foundation of coherent risk measures, where it’s shown that quantiles are not coherent as they don’t satisfy the criterion of subadditivity, see, e.g. Bellini et al. (2014) for more details. Generalized quantiles represent a separate extension of QR, which addresses the issues with quantiles as a risk measure, and is based on modifying the asymmetric linear loss function of quantiles by considering more general loss functions. Newey and Powell (1987) introduced expectiles as minimization of an asymmetric squared function. Chen (1996) introduced \(L^p\)-quantiles as the minimization of an asymmetric power function, which includes both the expectile and quantile loss functions as special cases. As \(L^p\)-quantiles are coherent risk measures, they have gained much recent popularity in actuarial applications and extreme value analysis (Bellini et al. 2014; Usseglio-Carleve 2018; Daouia et al. 2019; Konen and Paindaveine 2022). Apart from coherency, the usage of \(L^p\)-quantiles is mainly motivated by their flexibility, bridging between the robustness of quantiles and the sensitivity of expectiles. However, the \(L^p\)-quantile approach is not without drawbacks, mainly that \(L^p\)-quantiles does not have an interpretation as direct as ordinary quantiles. For a general discussion on relating the interpretation of \(L^p\)-quantiles to ordinary quantiles, see Jones (1994).

This paper aims to extend \(L^p\)-quantile regression to composite \(L^p\)-quantile regression from the Bayesian perspective, emulating the extension for ordinary quantiles by Huang and Chen (2015). Bayesian single \(L^p\)-quantile regression, based on the Skewed exponential power distribution (SEPD) (Komunjer 2007; Zhu and Zinde-Walsh 2009), has been considered by Bernardi et al. (2018); Arnroth and Vegelius (2023). However, maximization of the likelihood of the SEPD corresponds to the minimization of a transformation of the \(L^p\)-quantile loss function. Therefore, a novel parametrization of the SEPD, based directly on the loss function of \(L^p\)-quantiles, is introduced. Compared to the single \(L^p\)-quantile setting, the composite extension significantly alleviates the issues of interpretation as the parameters of interest for the Bayesian composite \(L^p\)-quantile regression (BCLQR) are the same as those of Bayesian composite quantile regression (BCQR). Hence, results based on \(L^1\)-quantiles and \(L^p\)-quantiles will be directly comparable. Furthermore, following Huang and Chen (2015), the issue of variable selection is simultaneously considered by considering a Laplace prior for the regression coefficients, which translates into a \(L^1\) penalty on the regression coefficients, i.e., Lasso (Tibshirani 1996).

The article is organized as follows. In Sect. 2, the BCLQR method with a lasso penalty on the regression coefficients is introduced. Section 3 presents numerical results with applications to both simulated and empirical data. Finally, discussion and conclusions are put in Sect. 4.

2 Bayesian composite \(L^p\)-quantile regression

Consider the following linear model

$$\begin{aligned} y = b_0 + \varvec{{x}}^T\varvec{{\beta }} + \varepsilon , \end{aligned}$$
(1)

where \(\varvec{{x}} \in \mathbbm {R}^m\) is the m-dimensional covariate, \(\varvec{{\beta }} \in \mathbbm {R}^m\) is the m-dimensional vector of unknown parameters, \(b_0 \in \mathbbm {R}\) is the intercept, \(y \in \mathbbm {R}\) is the response, and \(\varepsilon \) is the noise. The conditional \(\tau \)th \(L^p\)-quantile of \(y|\varvec{{x}}\) is

$$\begin{aligned} b_0 + \varvec{{x}}^T\varvec{{\beta }} + q_\tau = b_\tau + \varvec{{x}}^T\varvec{{\beta }}, \end{aligned}$$

where \(q_\tau \) is the \(\tau \)th \(L^p\)-quantile of \(\varepsilon \) for \(\tau \in (0,1)\), independent of \(\varvec{{x}}\). For a random variable z with cumulative density function F, the \(\tau \)th \(L^p\)-quantile is defined as (Chen 1996)

$$\begin{aligned} q_\tau = \arg \min _q \int \rho _{\tau ,p}(z-q) dF(z), \end{aligned}$$
(2)

where \(\rho _{\tau , p}(\cdot )\) is an asymmetric power function defined as

$$\begin{aligned} \rho _{\tau , p}(y) = |\tau - I(y \le 0)||y|^p. \end{aligned}$$
(3)

For a random sample \((y_1, \varvec{{x}}_1), \ldots , (y_n, \varvec{{x}}_n)\), the \(\tau \)th \(L^p\)-quantile regression model estimates \(b_\tau \) and \(\varvec{{\beta }}\) by solving

$$\begin{aligned} ({\hat{b}}_{\tau }, \hat{\varvec{{\beta }}}) = \text {argmin}_{b_{\tau }, \varvec{{\beta }}} \sum _{i=1}^n \rho _{\tau , p}(y_i - b_\tau - \varvec{{x}}_i^T \varvec{{\beta }}), \end{aligned}$$
(4)

By setting \(p = 1\) in (4), we retain the standard quantile regression estimator. To recast minimization of (4) into maximization of a likelihood function, we assume the noise term in (1) to follow a density of the form

$$\begin{aligned} f(y; \tau , p) = K_{\tau ,p} \exp \{-|\tau - I(y \le 0)||y|^p\} = K_{\tau ,p}\exp \{-\rho _{\tau ,p}(y)\}, \end{aligned}$$
(5)

where \(p > 0\) and \(K_{\tau ,p}^{-1} = \Gamma (1+1/p)(\tau ^{-1/p}+(1-\tau )^{-1/p})\). The density function (5) corresponds to a re-scaling of the skewed exponential power distribution (Komunjer 2007; Zhu and Zinde-Walsh 2009). Furthermore, setting \(p=1\) directly retains the asymmetric Laplace distribution utilized for Bayesian quantile regression (Kozumi and Kobayashi 2011). For scale \(\sigma > 0\) and location \(\mu \in \mathbbm {R}\), the density is

$$\begin{aligned} p(y; \mu ,\sigma ,\tau ,p) = \sigma ^{-1}f\left( \sigma ^{-1}(y-\mu ); \tau , p \right) . \end{aligned}$$
(6)

The reparametrization \(\eta = \sigma ^p\) shall be considered in the sequel, with a prior placed directly on \(\eta \), which simplifies the subsequent MCMC sampling scheme. Following Huang and Chen (2015) and Zhao and Xiao (2014), composite \(L^p\)-quantile regression is defined as

$$\begin{aligned} ({\hat{b}}_{\tau _1}, \ldots , {\hat{b}}_{\tau _K}, \hat{\varvec{{\beta }}}) = \text {argmin}_{b_{\tau _1}, \ldots , b_{\tau _K}, \varvec{{\beta }}} \sum _{i=1}^n \left\{ \sum _{k=1}^K w_k\rho _{\tau _k, p}(y_i - b_{\tau _{k}} - \varvec{{x}}_i^T \varvec{{\beta }})\right\} , \end{aligned}$$
(7)

where \(0< \tau _1< \ldots< \tau _K < 1\) are quantile levels and \(0< w_k < 1\) is the weight of the kth component, where \(\varvec{{w}}=(w_1, \ldots , w_K)^T\) is defined on the \((K-1)\)-dimensional simplex \(\{\varvec{{w}}\in [0,1]^k: \sum _{i=1}^k w_k = 1\}\). Note that, unlike fitting K independent quantile models, (7) assumes the same \(\varvec{{\beta }}\) across quantiles. By fixing \(p = 1\) in (7), the estimation procedure of Huang and Chen (2015) is retained.

The joint distribution of \(\varvec{{y}} = (y_1, \ldots , y_n)^T\) given \(\varvec{{X}} = \big (\varvec{{x}}_1^T, \ldots , \varvec{{x}}_n^T\big )^T\) for a composite model is

$$\begin{aligned} p(\varvec{{y}}|\varvec{{X}}, \varvec{{\beta }}, \varvec{{b}}, \varvec{{w}}, p, \sigma ) = \prod _{i=1}^n \left( \sum _{k=1}^K w_k p_{\tau _k}(y_i | \varvec{{x}}_i, b_{\tau _{k}}, \varvec{{\beta }}, \eta , p)\right) , \end{aligned}$$
(8)

where \(\varvec{{b}} = (b_{\tau _1}, \ldots , b_{\tau _K})^T\) and

$$\begin{aligned} p_{\tau _k}(y_i |\varvec{{x}}_i, b_{\tau _{k}}, \varvec{{\beta }}, \sigma , p) = \frac{K_{\tau _k, p}}{\eta ^{1/p}} \exp \left\{ -\frac{1}{\eta }\rho _{\tau _k, p}\left( y_i - \varvec{{x}}_i^T\varvec{{\beta }} - b_{\tau _{k}}\right) \right\} . \end{aligned}$$

To solve (8), we introduce a matrix C for cluster assignment whose ikth element, \(C_{ik}\), is equal to 1 if the ith subject belongs to the kth cluster, otherwise \(C_{ik} = 0\). C is treated as a missing value and we start from the complete likelihood which has the form

$$\begin{aligned} p(\varvec{{y}}|\varvec{{X}}, \varvec{{\beta }}, \varvec{{b}}, \varvec{{w}}, \sigma , p, \varvec{{\tau }}, \varvec{{C}})&= \prod _{i=1}^n \prod _{k=1}^K \big [w_k p_{\tau _k}(y_i |\varvec{{x}}_i, b_{\tau _{k}}, \varvec{{\beta }}, \sigma , p) \big ]^{C_{ik}} \\&= \frac{1}{\eta ^{n/p}\Gamma (1+1/p)^n} \prod _{k=1}^K \left( \frac{w_k}{\tau _k^{\frac{1}{p}} + (1-\tau _k)^{\frac{1}{p}}} \right) ^{n_k} \\&\quad \exp \left\{ - \frac{1}{\eta }\sum _{i=1}^n \sum _{k=1}^K C_{ik} \rho _{\tau _k,p}\left( y_i - \varvec{{x}}_i^T\varvec{{\beta }} - b_{\tau _k}\right) \right\} , \end{aligned}$$

where \(n_k = \sum _{i=1}^n C_{ik}\).

To perform variable selection, we consider the standard Laplace prior for \(\varvec{{\beta }}\)

$$\begin{aligned} \pi (\varvec{{\beta }}) = \frac{1}{2^m}\exp \left\{ -\sum _{j=1}^m |\beta _j| \right\} . \end{aligned}$$

A Dirichlet prior is considered for \(\varvec{{w}}\),

$$\begin{aligned} \pi (\varvec{{w}}) = \text {Dirichlet}(\alpha _1, \ldots , \alpha _K), \end{aligned}$$

with \(\alpha _1 = \ldots = \alpha _K = 0.1\). The priors of \(\sigma \) and p are set to

$$\begin{aligned} \pi (\eta ) = \mathcal{I}\mathcal{G}(0.1,0.1) \quad \text {and} \quad \pi (p) = {\mathcal {U}}(0,5), \end{aligned}$$

where \(\mathcal{I}\mathcal{G}(a,b)\) denotes the inverse Gamma distribution with shape a and scale b and \({\mathcal {U}}(a,b)\) denotes the continuous Uniform distribution on [ab]. Here, the upperbound on p was set to 5 which was sufficiently high to not impact the posterior samples of p in Sect. 3 meaningfully. Also, from a practical perspective, limiting the values of p to not be too large is motivated by the increased flatness around the mode of the SEPD as p increases, which could cause issues for a MCMC procedure. It’s worth noting that, from a practitioner’s perspective, the most intuitive approach is to restrict the prior on p to a continuous distribution within the range \(p \in [1,2]\). As such, the interpretation of the procedure is along the lines of a bridge between the robustness of quantiles and the sensitivity of expectiles. We avoid such restrictions here however to allow for increased flexibility of the estimation procedure. Additionally, we treat the quantile specific intercepts as described in Huang and Chen (2015), setting \(\pi (\varvec{{b}}) \propto 1\). Such an improper prior yields a proper posterior for the single \(L^p\)-quantile regression (Arnroth and Vegelius 2023). The posterior distribution is thus given by

$$\begin{aligned} \pi (\varvec{{\beta }}, \varvec{{b}}, \eta , p | \varvec{{y}}, \varvec{{X}}) \propto p(\varvec{{y}}|&\varvec{{X}}, \varvec{{b}}, \eta , p, \varvec{{\tau }}, \varvec{{C}}) \pi (\varvec{{\beta }}) \pi (\eta ) \pi (\varvec{{w}})\pi (p). \end{aligned}$$
(9)

A MCMC procedure is utilized to sample from (9). The conditional distribution of \(\varvec{{\beta }}\) is

$$\begin{aligned}&\pi (\varvec{{\beta }}| \varvec{{y}}, \varvec{{X}}, \varvec{{\tau }}, \varvec{{b}}, \varvec{{C}}, \varvec{{w}}, \eta , p, \lambda ) \propto \exp \nonumber \\ {}&\quad \left\{ - \frac{1}{\eta }\sum _{i=1}^n \sum _{k=1}^K C_{ik} \rho _{\tau _k,p}\left( y_i - \varvec{{x}_i^{T}}\varvec{{\beta }} - b_{\tau _k}\right) - \lambda \sum _{j=1}^m |\beta _j| \right\} , \end{aligned}$$
(10)

of which the normalizing constant is unknown. Due to the possible large dimension of \(\varvec{{\beta }}\), the Metropolis adjusted Langevin algorithm (MALA) is used to sample (10) efficiently. Proposals are generated as

$$\begin{aligned} \varvec{{\beta }}' = \varvec{{\beta }}+ \frac{\epsilon _\beta ^2}{2}\varvec{{A}} \nabla _{\varvec{{\beta }}}{\mathcal {L}}(\varvec{{\beta }}) + \epsilon _\beta \sqrt{\varvec{{A}}} \varvec{{Z}}, \end{aligned}$$

where \({\mathcal {L}}(\varvec{{\beta )}}\) is the log of (10), \(\varvec{{A}} = (\varvec{{X}}^T\varvec{{X}})^{-1}\), and \(\varvec{{Z}} \sim {\mathcal {N}}_m(\varvec{{0}}, \varvec{{I}})\). Note that \(\varvec{{X}}^T\varvec{{X}}\) is used rather than the Hessian due to issues with positive definiteness for some values of p. Ignoring non-differentiability of \(|\cdot |\) at 0, the gradient is

$$\begin{aligned} \nabla _{\varvec{{\beta }}} {\mathcal {L}}(\varvec{{\beta }})&= \frac{p}{\eta } \sum _{k=1}^K \bigg [(1-\tau _k) \sum _{i\in N_{1,k}}|\varvec{{x}}_{i}^T\varvec{{\beta }}+ b_{\tau _{k}} - y_i|^{p-1}\varvec{{x}}_i\\&\quad - \tau _k \sum _{i\in N_{2,k}} C_{ik}|y_i - \varvec{{x}}_{i}^T\varvec{{\beta }}- b_{\tau _{k}}|^{p-1}\varvec{{x}}_i\bigg ] - \lambda , \end{aligned}$$

where \(N_{1,k} = \{i: y_i < \varvec{{x}}_{i}^T\varvec{{\beta }}+ b_{\tau _{k}} \text { and } C_{ik} = 1\}\) and \(N_{2,k} = \{i: y_i > \varvec{{x}}_{i}^T\varvec{{\beta }}+ b_{\tau _{k}} \text { and } C_{ik} = 1\}\). The acceptance probability is then given by

$$\begin{aligned} \varphi _{\varvec{{\beta }}}(\varvec{{\beta }}', \varvec{{\beta }}) = 1\ \wedge \ \frac{\pi (\varvec{{\beta }}' | \varvec{{y}}, \varvec{{X}},\varvec{{\tau }}, \varvec{{b}}, \varvec{{C}}, \varvec{{w}}, \eta , p, \lambda ) q(\varvec{{\beta }}| \varvec{{\beta }}')}{\pi (\varvec{{\beta }}| \varvec{{y}}, \varvec{{X}},\varvec{{\tau }}, \varvec{{b}}, \varvec{{C}}, \varvec{{w}}, \eta , p, \lambda ) q(\varvec{{\beta }}' | \varvec{{\beta }})}, \end{aligned}$$

where \(q(\varvec{{\beta }} | \varvec{{\beta }}') = {\mathcal {N}}\big (\varvec{{\beta }}| \varvec{{\beta }}' + \epsilon _\beta ^2/2 \varvec{{A}} \nabla _{\varvec{{\beta }}'}{\mathcal {L}}(\varvec{{\beta }}'),\ \epsilon _\beta ^2\varvec{{A}} \big )\) and \(a \wedge b = \min \{a,b\}\).

The conditional distribution for the LP-quantile specific intercepts is

$$\begin{aligned}&\pi (\varvec{{b}} | \varvec{{y}}, \varvec{{X}},\varvec{{\tau }}, \varvec{{\beta }}, \varvec{{C}}, \varvec{{w}}, \eta , p) \propto \nonumber \\ {}&\quad \exp \left\{ -\frac{1}{\eta } \sum _{i=1}^n \sum _{k=1}^K C_{ik} \rho _{\tau _k,p}\left( y_i - \varvec{{x}_i^{T}}\varvec{{\beta }} - b_k\right) \right\} . \end{aligned}$$
(11)

As for \(\varvec{{\beta }}\), the MALA is considered using only the first order derivative of (11). Proposals are thus generated as

$$\begin{aligned} \varvec{{b}}' = \varvec{{b}}+ \frac{\epsilon _{\varvec{{b}}}^2}{2}\nabla _{\varvec{{b}}}{\mathcal {L}}(\varvec{{b}}) + \epsilon \varvec{{Z}}, \end{aligned}$$

where \(\nabla _{\varvec{{b}}}{\mathcal {L}}(\varvec{{b}})\) has kth element

$$\begin{aligned} \frac{\partial }{\partial b_{\tau _{k}}} {\mathcal {L}}(\varvec{{b}})&= \frac{p}{\eta } \bigg [(1-\tau _k) \sum _{i\in N_{1}}C_{ik}|\varvec{{x}}_{i}^T\varvec{{\beta }}+ b_{\tau _{k}} - y_i|^{p-1} \\&\quad - \tau _k \sum _{i\in N_{2}} C_{ik}|y_i - \varvec{{x}}_{i}^T\varvec{{\beta }}- b_{\tau _{k}}|^{p-1}\bigg ], \end{aligned}$$

where \(N_{1} = \{i: y_i < \varvec{{x}}_{i}^T\varvec{{\beta }}+ b_{\tau _{k}}\}\) and \(N_{2} = \{i: y_i > \varvec{{x}}_{i}^T\varvec{{\beta }}+ b_{\tau _{k}}\}\) The acceptance probability is then given by

$$\begin{aligned} \varphi _{\varvec{{b}}}(\varvec{{b}}', \varvec{{b}}) = 1\ \wedge \ \frac{\pi (\varvec{{b}}' | \varvec{{y}}, \varvec{{X}},\varvec{{\tau }}, \varvec{{\beta }}, \varvec{{C}}, \varvec{{w}}, \eta , p) q(\varvec{{b}}| \varvec{{b}}')}{\pi (\varvec{{b}}| \varvec{{y}}, \varvec{{X}},\varvec{{\tau }}, \varvec{{\beta }}, \varvec{{C}}, \varvec{{w}}, \eta , p) q(\varvec{{b}}' | \varvec{{b}})}, \end{aligned}$$

where \(q(\varvec{{b}}| \varvec{{b}}') = {\mathcal {N}}\big (\varvec{{b}}| \varvec{{b}}' + \epsilon _{\varvec{{b}}}^2/2 \nabla _{\varvec{{b}}'}{\mathcal {L}}(\varvec{{b}}'),\ \epsilon _{\varvec{{b}}}^2\varvec{{I}} \big )\).

The conditional distribution of \(\eta \) is

$$\begin{aligned}&\pi (\eta | \varvec{{y}}, \varvec{{X}},\varvec{{\tau }}, \varvec{{\beta }}, \varvec{{b}}, \varvec{{C}}, \varvec{{w}}, p) \propto \frac{1}{\eta ^{\frac{n}{p}+a_1}} \\ {}&\quad \exp \left\{ - \frac{1}{\eta }\left( \sum _{i=1}^n \sum _{k=1}^K C_{ik} \rho _{\tau _k,p}\left( y_i - \varvec{{x}_i^{T}}\varvec{{\beta }} - b_{\tau _k}\right) + a_2\right) \right\} . \end{aligned}$$

Hence

$$\begin{aligned}&\eta | \varvec{{y}}, \varvec{{X}},\varvec{{\tau }}, \varvec{{\beta }}, \varvec{{b}}, \varvec{{C}}, \varvec{{w}}, p \sim \\ {}&\quad \mathcal{I}\mathcal{G}\left( \frac{n}{p}+1+a_1, \sum _{i=1}^n \sum _{k=1}^K C_{ik} \rho _{\tau _k,p}\left( y_i - \varvec{{x}_i^{T}}\varvec{{\beta }} - b_{\tau _k}\right) +a_2\right) . \end{aligned}$$

The conditional distribution of p is given by

$$\begin{aligned} \pi (p|&\varvec{{y}}, \varvec{{X}},\varvec{{\tau }}, \varvec{{\beta }}, \varvec{{b}}, \varvec{{C}}, \varvec{{w}}, \eta ) \nonumber \\&\propto \frac{1}{\Gamma (1+1/p)^n} \prod _{k=1}^K \left( \frac{1}{\tau _k^{\frac{1}{p}} + (1-\tau _k)^{\frac{1}{p}}} \right) ^{n_k} \nonumber \\ {}&\quad \exp \left\{ - \frac{1}{\eta }\sum _{i=1}^n \sum _{k=1}^K C_{ik} \rho _{\tau _k,p}\left( y_i - \varvec{{x}_i^{T}}\varvec{{\beta }} - b_{\tau _k}\right) \right\} I_{(0,5]}(p), \end{aligned}$$
(12)

where \(I_{\mathcal {X}}(x)\) is the indicator function defined as 1 if \(x \in {\mathcal {X}}\) and 0 otherwise. The normalizing constant of (12) is unknown, so p is sampled via Metropolis Hastings with proposals generated as \(p' \sim {\mathcal {N}}_{(0,5]}(p, \epsilon _p^2)\), where \({\mathcal {N}}_{{\mathcal {A}}}(\mu ,\sigma )\) denotes the normal distribution with location \(\mu \) and variance \(\sigma \) truncated to the set \({\mathcal {A}}\). The corresponding acceptance probability is

$$\begin{aligned} \varphi _p(p',p) = 1\ \wedge \frac{\pi (p'| \varvec{{y}}, \varvec{{X}},\varvec{{\tau }}, \varvec{{\beta }}, \varvec{{b}}, \varvec{{C}}, \varvec{{w}}, \eta )}{\pi (p|\varvec{{y}}, \varvec{{X}},\varvec{{\tau }}, \varvec{{\beta }}, \varvec{{b}}, \varvec{{C}}, \varvec{{w}}, \eta )}\frac{\Phi \big (\frac{5-p}{\epsilon _p}\big ) - \Phi \big (\frac{-p}{\epsilon _p}\big )}{\Phi \big (\frac{5-p'}{\epsilon _p}\big ) - \Phi \big (\frac{-p'}{\epsilon _p}\big )}, \end{aligned}$$

where \(\Phi (\cdot )\) denotes the standard normal cumulative distribution function.

The conditional distribution of the component weights is

$$\begin{aligned}&p(\varvec{{w}} | \varvec{{y}}, \varvec{{X}},\varvec{{\tau }}, \varvec{{\beta }}, \varvec{{b}}, \varvec{{C}}, p, \eta ) \propto \\ {}&\quad \prod _{k=1}^K w_k^{n_k + \alpha _k} \propto \text {Dirichlet}(n_1 + \alpha _1, \ldots , n_K + \alpha _K). \end{aligned}$$

The conditional distribution of \(\varvec{{C}}_i = (C_{i1}, \ldots , C_{iK})^T\) is a multinomial distribution

$$\begin{aligned}&p(\varvec{{C}}_i | y_i, \varvec{{x}}_i,\varvec{{\tau }}, \varvec{{\beta }}, \varvec{{b}}, p, \eta ) \nonumber \\&\quad \propto \prod _{k=1}^K \left[ \frac{w_k}{\tau _k^{\frac{1}{p}} + (1-\tau _k)^{\frac{1}{p}}} \exp \big \{-\eta ^{-1}\rho _{\tau _k,p}\left( y_i - \varvec{{x}_i^{T}}\varvec{{\beta }} - b_{\tau _k}\right) \big \} \right] ^{C_{ik}} \\&\propto \text {Multinomial}(1, {\hat{p}}_1,\ldots ,{\hat{p}}_k), \end{aligned}$$

where

$$\begin{aligned} {\hat{p}}_k = \frac{w_k\big [\tau _k^{-1/p} + (1-\tau _k)^{-1/p}\big ]\exp \{- C_{ik}\rho _{\tau _k,p}(y_i - b_{\tau _{k}} - \varvec{{x}}_{i}^T\varvec{{\beta }}) / \eta \}}{\sum _{k=1}^K w_k\big [\tau _k^{-1/p} + (1-\tau _k)^{-1/p}\big ]\exp \{- C_{ik}\rho _{\tau _k,p}(y_i - b_{\tau _{k}} - \varvec{{x}}_{i}^T\varvec{{\beta }}) / \eta \}}. \end{aligned}$$

3 Numerical studies

3.1 Simulation

In this section, Monte Carlo simulations are done to compare the performance of Bayesian regularized composite \(L^p\)-quantile regression and Bayesian regularized composite quantile regression. Julia (Bezanson et al. 2017) was used to produce the results.Footnote 1 Note that priors for BCQR are set as in Huang and Chen (2015). We consider the linear model where data are generated from

$$\begin{aligned} y_i = \varvec{{x}}_{i}^T\varvec{{\beta }}+ \epsilon _i,\quad i = 1,\ldots ,n, \end{aligned}$$
(13)

where \(\varvec{{x}}_i\) is sampled from \({\mathcal {N}}(\varvec{{0}}, \varvec{{I}})\) and multiple error distributions are considered for \(\epsilon _i\). Note that the error distributions have been chosen based on those considered in Huang and Chen (2015), and all results are stable over different parameters than those reported. In the sequel, MN shall denote the mixture of normal distributions, \(0.5{\mathcal {N}}(-2, 1) + 0.5 {\mathcal {N}}(2,1)\), and ML the mixture of Laplace distributions, \(0.5\text {Lap}(-2, 1) + 0.5 \text {Lap}(2,1)\) and the dimension of \(\varvec{{\beta }}\) is denoted by m. Two settings are considered where the first is the dense case with \(n = 200\) and \(m = 8\) where \(\varvec{{\beta }} = \varvec{{1}}\). For the second setting, the sparse case, \(n = 100\) and \(m = 20\) with \((\beta _1, \beta _2, \beta _5)^T = (0.5, 1.5, 0.2)^T\), where \(\beta _j\) is the jth position of \(\varvec{{\beta }}\), and the remaining coefficients are set to 0. The root mean square error (RMSE) is used to compare different methods, \(\text {RMSE}({\hat{\varvec{{\beta }}}}) = E (||{\hat{\varvec{{\beta }}}} - \varvec{{\beta }}||)\), where \({\hat{\varvec{{\beta }}}}\) is taken as the mean of the posterior sample. For each simulated datum, the first 3000 sweeps of the chains are discarded as burn-in. Then, an additional 10, 000 sweeps are performed, with every 5th sweep kept to reduce the serial correlation of the chains. As in Huang and Chen (2015), the number of components is fixed to \(K = 9\) with \(\tau _k = k/(K+1)\) for \(k = 1, \ldots , 9\). Other values of K have been considered, however, results are not sensitive to this choice, so results over multiple K are not displayed. The simulations are repeated 1000 times.

Results in terms of RMSE for both settings are found in Table 1. The proposed method outperforms BCQR for all distributions except the mixture Laplace distribution for setting 1, where the results are very similar. As expected, the results are close for the Laplace error distribution, highlighting that BCQR is a special case of BCLQR, with \(p = 1\).

Table 1 Summary table of RMSE over 1000 replications of simulated data, standard errors in parenthesis

For the sparse case, the variable selection properties of BCQR and BCLQR are compared. Denote the number of correctly classified non-zero coefficients, true positives, by TP and the number of incorrectly classified zero coefficients, false positives, by FP. A coefficient is classified as non-zero if the \(95\%\) highest posterior density interval does not cover zero. Note that both TP and FP are at most 3 from the specification of \(\varvec{{\beta }}\). Otherwise, it is classified as zero. We also denote a correctly classified 0 as a true negative (TN), and define overall accuracy (OA) as \((TN + TP)/20\). Table 2 shows that the methods perform similarly, in terms of TP and FP, except for the mixture Laplace case, where BCQR performs better than BCLQR. Further, for the OA, the proposed method outperforms BCQR with respect to all error distributions.

Table 2 Summary of variable selection by comparison of true and false positives and overall accuracy for the sparse setting with mean over 1000 replications of simulated data reported, standard errors in parenthesis

3.2 Empirical data

In this section, the proposed method is applied to empirical data and compared with BCQR. As in Huang and Chen (2015), 10-fold cross-validation is used to evaluate the performance of the two methods, with accuracy measured by the mean absolute prediction error (MAPE) and corresponding standard deviation evaluated on the test data. As in the simulation study, the number of components is fixed to \(K = 9\) with \(\tau _k = k/(K+1)\) for \(k = 1, \ldots , 9\).

3.2.1 Boston housing data

The Boston housing data were first analyzed by Harrison Jr and Rubinfeld (1978) and has been used extensively in the context of Bayesian quantile regression, see, e.g., Huang and Chen (2015); Kozumi and Kobayashi (2011); Li et al. (2010). The data consists of 506 observations and were obtained from the MASS (Venables and Ripley 2002) package in R (R Core Team 2021).Footnote 2 The relationship between log-transformed median value of owner-occupied housing (in 1000 USD) and the remaining 13 variables is considered, for details on the parameters see Venables and Ripley (2002). We draw 30, 000 samples from the posterior with the first 5000 discarded as the transient phase. The posterior sample is thinned down to every 5th sample.

The MAPE of BCLQR is 0.137, with a standard deviation of 0.032. Similarly, the MAPE of BCQR is 0.139, with a standard deviation of 0.031. Thus we find that the proposed method performs slightly better. In Fig. 1, the estimators are compared for the complete data, where differences are generally negligible, however the parameter Crim stands out in terms of interval width and MAP estimate. The intervals of Fig. 1 shows that the the procedures gives the same result in terms of variable selection. In Table 3, MAP estimates and corresponding standard deviation for six regression coefficients are presented. Table 3 also includes effective sample size (ESS), which shows that the efficieny in exploration of the posterior for the two MCMC sampling schemes are quite different. BCLQR is generally more efficient. For more details on ESS, see, e.g., Section 11.5 of Gelman et al. (2014).

Fig. 1
figure 1

MAP estimates and 95% credible sets for all parameters based on BCLQR and BCQR for the entire Boston housing data

Table 3 ESS of the MCMC procedures and estimated regression coefficients for the entire Boston housing data, with standard error in parenthesis

In Fig. 2, the result for quantile specific intercepts and mixture weights are displayed. The main difference between BCLQR and BCQR is that the weights of the former concentrates to a much larger degree than the latter, as seen in Fig. 2b. The concentration explains the larger degree of oscillation of the intercepts for BCLQR over quantiles, in relation to the intercepts of BCQR, as seen in Fig. 2a.

Fig. 2
figure 2

Comparison over quantiles \(\tau \in \{0.1, 0.2, \ldots , 0.9\}\) for estimation on entire Boston data

3.2.2 Body measurement data

In this section, the body measurement data, which was first analysed by Heinz et al. (2003), is considered. The data consists of 507 observations and were downloaded from the Brq package in R (R Core Team 2021).Footnote 3 The relationship between body weight and 24 other variables is considered. See Heinz et al. (2003) for details on the data and descriptions of the independent variables. We draw 30, 000 samples from the posterior with the first 5000 discarded as the transient phase. The posterior sample is thinned down to every 5th sample.

The MAPE of BCLQR is 1.577, with a standard deviation of 0.221. The MAPE of BCQR is 1.680 with a standard deviation of 0.242. Thus we find that the proposed method performs better then BCQR. The result from the estimation of the whole sample is shown in Fig. 3, with 95% credible sets. In terms of variable selection, the procedures differs on the variables ShouldGi, KneeSk and Gender. The interval of BCLQR for ShouldGi covers 0 whilst that of BCQR does not, vice versa for KneeSk and Gender.

MAP estimates, standard deviation and ESS for all regression coefficients are presented in Table 4. The difference in ESS for the posterior samples of \(\varvec{{\beta }}\) is much greater for the Body measurement data in comparison to the Boston housing data. This is most likely due to the joint sampling of \(\varvec{{\beta }}\) in our MCMC procedure, as compared to each index of \(\varvec{{\beta }}\) being sampled individually in BCQR (Huang and Chen 2015).

Fig. 3
figure 3

Point estimates and 95% credible sets for all parameters based on BCLQR and BCQR for the entire Body measurement data

Fig. 4
figure 4

Comparison over quantiles \(\tau \in \{0.1, 0.2, \ldots , 0.9\}\) for entire Body measurement data

Table 4 ESS of the MCMC procedures and estimated regression coefficients for the entire body measurement data, with standard error in parenthesis

In Fig. 4, the results for quantile specific intercepts and mixture weights are displayed where patterns similar to those noted in Fig. 2 can be seen.

4 Conclusions

In this paper, we present an efficient Bayesian method that extends the approach of Huang and Chen (2015) to combine multiple \(L^p\)-quantile regressions for both inference and variable selection. A simulation study demonstrates that the proposed method performs as well as, or even better than, Bayesian composite quantile regression (BCQR). BCQR outperforms the proposed method only in terms of true positives for variable selection accuracy for some specific error distributions. However, when considering true negatives as well, the proposed method surpasses BCQR across all error distributions. This performance improvement holds true for both sparse and dense settings, underscoring the versatility of \(L^p\)-quantiles, which encompass quantiles and expectiles through the parameter p as special cases, in contrast to ordinary quantiles

Comparisons using empirical data reveal that the proposed method outperforms BCQR in terms of prediction error during 10-fold cross-validation. Furthermore, the overall effective sample size of the regression coefficients in Bayesian composite \(L^p\)-quantile regression (BCLQR) is higher for the parameters of interest in practical applications, indicating more efficient exploration of the posterior distribution.

For future research, it may be valuable to relax the assumption of a fixed number of quantiles in the composite model. Additionally, treating discretized quantiles, \(\{\tau _k\}_{k=1}^K\), as continuous variables and considering a Dirichlet process as a prior on the component weights could be explored. Extending BCQR to infinite mixtures remains an open and intriguing possibility. Another promising avenue is to study the specific case of BCLQR with a fixed \(p = 2\), essentially composite expectile regression.