1 Introduction

Enhancements in data collection technologies have highlighted the importance of modern variable selection techniques. Traditional methods, such as best subset selection, are suboptimal and are computationally expensive when the number of variables is high (Fan and Lv 2010). Modern approaches make use of penalization methods to execute simultaneous model selection and estimation. A popular method is the least absolute shrinkage and selection operator (LASSO) (Tibshirani 1996), which comprises of an \(L_{1}\) penalty but leads to biased estimates. In contrast, the adaptive LASSO (Zou 2006) has adaptive weights, which reduce the bias present in the LASSO estimates. These methods have been developed primarily in the context of normal linear regression and have been extended to generalized linear models (GLMs) (Tibshirani 1996; Friedman et al. 2010) and Cox’s proportional hazard models for survival data (Tibshirani 1997). In these classical models, covariates enter through the location parameter (or the hazard scale in the Cox model case). A more modern and flexible approach is to include covariates in multiple distributional parameters, such as the location and dispersion, simultaneously; this approach is known as “distributional regression” (Stasinopoulos et al. 2018) and “multiparameter regression” (MPR) (Burke et al. 2020). The goal of this paper is to expand penalized regression to the flexible MPR setting using a novel differentiable \(L_{0}\) penalty that does not require tuning parameter selection, which is especially appealing in the MPR setting where one would typically require multiple tuning parameters (one for each distributional parameter).

Originally, methods such as quadratic programming were used to solve these non-differentiable LASSO-type problems, but Efron et al. (2004) and Friedman et al. (2007), respectively, proposed the least angle regression (LARS) and co-ordinate descent algorithms—with the latter proving to be particularly fast for problems of this type. These are somewhat “non-standard” estimation procedures in the context of classical statistical estimation, where non-differentiable objective functions are relatively less common. As an alternative non-gradient based optimization, perturbing the penalty function slightly to render it differentiable (Hunter and Li 2005; Lloyd-Jones et al. 2018) enables standard optimization methods to be used. Oelker and Tutz (2017) outline a series of approximations of different penalties, which allows for penalized smooth functions. These differentiable penalties can be easily implemented and solved using standard gradient based optimization procedures, i.e., Newton-Raphson. The tuning parameter that controls the strength of the penalty is typically obtained by minimizing the cross-validation error or an information criterion (IC), such as the Akaike IC (AIC) (Akaike 1974) or Bayesian IC (BIC) (Schwarz 1978). This is a two-step estimation process, which tends to be computationally intensive as it involves fitting an array of different models and selecting the best one.

Su (2015) and Su et al. (2018) present an estimation procedure that is not based on the \(L_{1}\) norm, titled “MIC” (minimum approximated information criterion). They exploit the close connection between model selection criteria and penalization (Fan and Lv 2010) and introduce an approximated information criteria in order to avoid the classic two-step estimation process. At its core, the MIC utilizes an approximation of the “\(L_{0}\) norm” with a continuous unit dent function. The \(L_0\) norm is discrete in nature and it is preferable to have a penalty function with a level of smoothness for optimization purposes. Su (2015) describes a “subtle uprooting” method for variable selection, which involves using a smooth surrogate function for approximating cardinality. This is followed by a second technical step for enhancing sparsity, where the final problem becomes non-differentiable. This approach is extended to GLMs in Su et al. (2018). Fixing the tuning parameter at two for the AIC or \(\log (n)\) for the BIC is computationally advantageous, as it avoids the tuning parameter selection problem. It is not required to compute the whole regularization path of solutions, nor is it necessary to choose the best tuning parameter using cross-validation, as is typically done.

We propose a more straightforward method of approximating the IC function using a smooth approximation of the \(L_{0}\) norm, which can be optimized directly. Instead of performing the reparameterization step as outlined in Su (2015), which renders the problem non-differentiable, we achieve sparsity in a different way. Our approach squeezes the coefficient values to zero by optimizing a sequence of objective functions that get successively closer to the non-differentiable one. Consequently, our proposed “smooth IC” (SIC) function can be optimized directly using standard gradient based optimization techniques. Additionally, we extend this new SIC variable selection procedure for use in the developing area of distributional regression (Stasinopoulos et al. 2018). Our proposed methods are implemented in our publicly available R package “smoothic” (O’Neill and Burke 2021). To date, penalized estimation has been primarily applied in the context of classical regression models, where the covariates are allowed to enter the model through a single parameter (e.g., a location parameter). Other distributional parameters, such as a dispersion parameter, are typically constant. This “single parameter regression” (SPR) does not take into account the possible impact of covariates on the other distributional parameters. Distributional regression, which is also referred to as “multiparameter regression” (MPR), is a more flexible approach where multiple parameters are regressed simultaneously on covariates. For example, covariates can enter the model through the location and dispersion parameters, or scale and shape parameters of the hazard function in the survival context (see Burke and MacKenzie 2017; Burke et al. 2020 and references therein), or indeed in various different distributional parameters as in generalized additive models for location, scale and shape (GAMLSS) (Rigby and Stasinopoulos 2005). Mayr et al. (2012) address the problem of variable selection by utilizing classical gradient boosting techniques to fit GAMLSS models. More recently, Groll et al. (2019) suggest implementing a LASSO-type penalization in the GAMLSS framework. This regularization approach to GAMLSS is highly flexible, but it has the added complexity of separate tuning parameters for each regression component. Groll et al. (2019) state that carrying out the computationally demanding grid search for the optimal tuning parameters is a drawback of their method. In our proposed multiparameter regression with smooth IC (MPR-SIC) procedure, this issue is circumvented as the values of both tuning parameters are known in advance.

The model formulation, including the introduction of the “smooth \(L_0\) norm”, the estimation procedure and the optimization algorithm are outlined in Sect. 2. In Sect. 3, the performance of our proposed methods is evaluated in both variable selection and parameter estimation through extensive simulation studies. We consider three real data analyses to demonstrate our proposed methods in Sect. 4. Finally, we close with some concluding remarks in Sect. 5.

2 Model formulation

2.1 Preliminaries

The classic normal linear regression is a single parameter problem that assumes there is constant variance in the errors. The model is

$$\begin{aligned} y_i={x_{i}^{T}\beta }+\varepsilon _i \end{aligned}$$

for \(i = 1, \ldots ,n\), where \(y_i\) is the response value and \({x_i} = (1, x_{1i},\ldots ,x_{pi})^T\) is a vector of covariates for the ith individual over the predictor variables \(j=0,1,\ldots ,p\). Here, \({\beta }=(\beta _{0},\beta _{1},\ldots ,\beta _{p})^{T}\) is the vector of regression coefficients for the location parameter and \(\varepsilon _i \sim \text {N}(0, \sigma ^2)\) under the homogeneity assumption. For the multiparameter regression (MPR) approach, where covariates appear in multiple distributional parameters simultaneously, the single parameter model in (1) is extended to include heterogeneity of error variance:

$$\begin{aligned} \text {Var}(\varepsilon _i)=\sigma _i^2=e^{{x_{i}^T\alpha }}, \end{aligned}$$

where the log-linear form ensures that \(\sigma ^2_i\) remains positive. The vector of regression coefficients for the dispersion parameter is \({\alpha }=(\alpha _{0},\alpha _{1},\ldots ,\alpha _{p})^{T}\). There may be different (possibly overlapping) sets of covariates impacting the location and dispersion, and although we use \(x_i\) for both, a given \(\beta \) or \(\alpha \) coefficient may be set to zero, which removes the covariate from that model component. Because we apply penalized variable selection, the regression coefficients need to be on a similar scale, and therefore we assume that the predictors are scaled to have unit variance.

The log-likelihood function for the MPR normal model is

$$\begin{aligned} \ell (\theta )= & {} -\frac{n}{2}\log (2\pi )-\frac{1}{2}\sum _{i=1}^n{x_{i}^T\alpha }\nonumber \\ {}{} & {} -\frac{1}{2}\sum _{i=1}^{n}e^{-{x_{i}^T\alpha }}\left( y_i-{x_{i}}^{T}{\beta }\right) ^2, \end{aligned}$$

where \(\theta =(\beta ^T,\alpha ^T)^T=(\beta _0,\ldots ,\beta _{p},\alpha _0,\ldots ,\alpha _{p})^T\). Our focus is on variable selection in the location and dispersion components, and we note that model selection criteria, such as the AIC and BIC, have a penalized functional form similar to regularization. In the distributional regression framework, an information criterion (IC) can be formulated as

$$\begin{aligned} \textrm{IC}=-2\ell (\theta )+\lambda \big [{||\tilde{\beta }||}_0 + ||\tilde{\alpha }||_0 + 2\big ], \end{aligned}$$

where \(\lambda \) is fixed at \(\lambda =2\) or \(\lambda =\log (n)\) for the AIC and BIC respectively, and \(\tilde{\beta }=(\beta _1,\ldots ,\beta _{p})^T\) and \(\tilde{\alpha }=(\alpha _1,\ldots ,\alpha _{p})^T\), i.e., the coefficient vectors with the intercepts omitted; there is an addition of two in the penalty to take into account the estimation of the intercept terms \(\beta _0\) and \(\alpha _0\). The \(L_{0}\) norm, \(||\theta ||_0 = \textrm{card}(\theta ) = \sum _{j=1}^p I (\theta _{j} \ne 0)\), indicates the cardinality or the number of non-zero elements in \(\theta \). This is not truly a norm since \(||c \theta ||_0 \ne c||\theta ||_0\) when \(c \ne 0,1\). The AIC is reported to be asymptotically “selection inconsistent” and “loss-efficient” as a variable selection criterion (Shao 1997; Yang 2005; Wang et al. 2009). As a result of its consistency property and superior empirical performance in variable selection, we employ a BIC-type criterion (Wang and Leng 2007) where \(\lambda =\log (n)\).

Using the likelihood in (3) and arranging (4) as an IC-based penalized likelihood results in

$$\begin{aligned} \ell ^\text {IC}(\theta )=\ell (\theta )-\frac{ \log (n)}{2}\big [||\tilde{\beta }||_0+||\tilde{\alpha }||_0+2\big ]. \end{aligned}$$

To enable gradient-based optimization, we define \(||\theta ||_{0, \epsilon } = \sum _{j=1}^{p}\phi _\epsilon (\theta _j)\) as the “smooth \(L_0\) norm”, (See Sect. 2.2) and substitute the \(L_0\) norm in (5) with \(||\tilde{\beta }||_{0, \epsilon }\) and \(||\tilde{\alpha }||_{0, \epsilon }\). This results in our proposed approach of MPR with smooth IC (MPR-SIC), which is the maximization of

$$\begin{aligned} \ell ^\text {SIC}(\theta )=\ell (\theta )-\frac{ \log (n)}{2}\big [||\tilde{\beta }||_{0, \epsilon }+ ||\tilde{\alpha }||_{0, \epsilon }+2\big ]. \end{aligned}$$

Therefore, since BIC minimization is intrinsic to this formulation, it obviates the usual need for estimating the model at a range of tuning parameter grid points and then evaluating each of these using an external BIC in a second step. Avoiding this grid search is especially useful in the context of distributional regression. For the more commonly used \(L_1\) norm, there is no direct link to the BIC, in which case one must search for the optimal tuning parameter. Moreover, one would typically have a separate tuning parameter for each distributional parameter to account for differing scales in these parameters, and this multidimensional grid search optimization is quite computationally intensive. In contrast, the BIC penalizes all parameters equally: it is \(\log (n)\) for all non-zero parameters, irrespective of their size or distributional type (e.g., location or dispersion), and it is zero for zero parameters.

2.2 Smooth \(L_0\) norm

Due to the non-differentiability of the \(L_0\) norm, we propose a smooth function to approximate it:

$$\begin{aligned} \phi _\epsilon (x) = \frac{x^2}{x^2 + \epsilon ^2}. \end{aligned}$$

This is differentiable for \(\epsilon >0\) and \(\lim _{\epsilon \rightarrow 0} \phi _\epsilon (x) = ||x ||_0\). Figure 1 demonstrates how \(\phi _\epsilon (x)\) gets closer to \(||x ||_0\) as \(\epsilon \) decreases. The smallest value shown (\(\epsilon =10^{-5}\)) approximates the \(L_0\) norm very closely, but it is also near the discontinuity at \(x=0\) making it unstable. Ultimately, (7) requires a small \(\epsilon \) value to produce shrinkage, but we have found that simply fixing it to a small value from the offset yields poor results due to its closeness to the discontinuity (see the Supplementary Material). Therefore, to create a more stable problem, we recommend the use of a decreasing sequence of \(\epsilon \) values (described in Sect. 2.4). Interestingly, with a fixed “large” value of \(\epsilon = 1\), this penalty has been referred to as a “weight elimination penalty” in the context of neural networks (Rumelhart et al. 1991). It is noteworthy that Oelker and Tutz (2017) develop a general penalized estimation procedure based on smooth approximations to penalties. Within this framework, they consider an \(L_0\)-norm approximation that is slightly less straightforward than ours, since it is based on a logistic function with two smoothing parameters. Devriendt et al. (2021) provide an alternative estimation procedure to Oelker and Tutz (2017), which is exact rather than approximate, but which does not include the \(L_0\) penalty (albeit they suggest adapting to a stochastic algorithm could potentially handle this). Crucially, however, both of these approaches require a grid search to find the optimal tuning parameter, but this is avoided in our work due to the connection to an information criterion established in Sect. 2.1.

Note that the first and second derivatives have a simple analytic form and therefore can be used within the gradient based optimization procedure of Sect. 2.3:

$$\begin{aligned} \phi _{\epsilon }^{\prime }(x)= \frac{2x\epsilon ^2}{\left( x^2+\epsilon ^2\right) ^2}, \qquad \phi _{\epsilon }^{\prime \prime }(x)=\frac{2\epsilon ^2\left( \epsilon ^2-3x^2\right) }{\left( x^2+\epsilon ^2\right) ^3}. \end{aligned}$$
Fig. 1
figure 1

Smooth \(L_0\) norm

2.3 Estimation procedure

We define the penalized estimates as

$$\begin{aligned} \hat{\theta } = {\arg \max }\big (\ell ^{\text {SIC}}(\theta )\big ), \end{aligned}$$

where \(\ell ^{\text {SIC}}(\theta )\) is given by (6). The first and second derivatives with respect to the parameters are

$$\begin{aligned} \frac{\partial \ell ^{\text {SIC}}}{\partial \beta }&= \frac{\partial \ell }{\partial \beta } - \frac{ \log (n)}{2}\nu _{\beta } = X^T z_{\beta } - \frac{ \log (n)}{2}\nu _{\beta },\nonumber \\ \frac{\partial \ell ^{\text {SIC}}}{\partial \alpha }&= \frac{\partial \ell }{\partial \alpha } - \frac{ \log (n)}{2}\nu _{\alpha } = X^T z_{\alpha } - \frac{ \log (n)}{2}\nu _{\alpha }, \end{aligned}$$

where X is an \(n \times (p+1)\) matrix, whose ith row is \(x_i\), \(z_\beta \) and \(z_\alpha \) are vectors of length n such that \(z_{\beta ,i}=e^{-x_{i}^T\alpha }(y_i-x_{i}^T\beta )\) and \(z_{\alpha ,i}=\big ({e^{{-x_{i}^T\alpha }}}{(y_{i}-x_{i}^T\beta )^2}-1\big )/2\), and \(\nu _\beta \) and \(\nu _\alpha \) are vectors whose \((j+1)\)th elements are given by \(\phi _\epsilon ^{\prime }(\beta _{j})\) and \(\phi _\epsilon ^{\prime }(\alpha _{j})\) respectively, except whose first elements are zero due to the fact that the intercepts are not penalized.

The matrix of negative second derivatives of \(\ell ^{\text {SIC}}(\theta )\), i.e., \(-\nabla _{\theta }\nabla _{\theta }^T\ell ^{\text {SIC}}(\theta )\) is given by

$$\begin{aligned} I(\theta )&= I_0(\theta ) + \begin{pmatrix} {\log (n)} \Sigma _{\beta }/2 &{} 0 \\ 0 &{} {\log (n)}\Sigma _{\alpha }/2 \end{pmatrix}\\&= \begin{pmatrix} X^{T} W_{\beta } X + {\log (n)}\Sigma _{\beta }/2 &{} X^{T} W_{\alpha \beta } X \\ X^{T} W_{\alpha \beta } X &{} X^{T} W_{\alpha } X + {\log (n)}\Sigma _{\alpha }/2 \end{pmatrix} \end{aligned}$$

where \(I_0(\theta ) = -\nabla _{\theta }\nabla _{\theta }^T\ell (\theta )\) is the observed information matrix of the unpenalized likelihood; \(\Sigma _\beta \) and \(\Sigma _\alpha \) are diagonal matrices that appear due to the penalties and whose \((j+1)\)th diagonal elements are given by \(\phi _\epsilon ^{\prime \prime }(\beta _{j})\) and \(\phi _\epsilon ^{\prime \prime }(\alpha _{j})\) respectively, except whose first diagonal elements are zero due to the fact that the intercepts are not penalized; \(W_\beta \), \(W_\alpha \) and \(W_{\beta , \alpha }\) are \(n \times n\) diagonal weight matrices whose ith diagonal elements are given by \(e^{-x_{i}^T\alpha }\), \(e^{-x_{i}^T\alpha }(y_i-x_{i}^T\beta )^2/2\) and \(e^{-x_{i}^T\alpha }(y_i-x_{i}^T\beta )\) respectively. We employ the “RS” algorithm (Rigby and Stasinopoulos 2005), which does not use cross derivatives. This algorithm is motivated by the fact that in many classical models, including location and scale models, the parameters are information orthogonal as discussed in Cox and Reid (1987). However, Stasinopoulos and Rigby (2007) report that the RS algorithm works well even when the parameters are not information orthogonal.

The resulting system of Newton–Raphson equations can be expressed compactly as

$$\begin{aligned}{} & {} \begin{pmatrix} X^{T} W_{\beta }^{(m)} X + {\log (n)}\Sigma _{\beta }^{(m)}/2 &{} 0 \\ 0 &{} X^{T} W_{\alpha }^{(m)} X + {\log (n)}\Sigma _{\alpha }^{(m)}/2 \end{pmatrix}\nonumber \\{} & {} \qquad { \begin{pmatrix} \beta ^{(m+1)} - \beta ^{(m)}\\ \alpha ^{(m+1)} - \alpha ^{(m)} \end{pmatrix}} = \begin{pmatrix} X^T z_{\beta }^{(m)} - {\log (n)}\nu _{\beta }^{(m)}/2\\ X^T z_{\alpha }^{(m)} - {\log (n)}\nu _{\alpha }^{(m)}/2 \end{pmatrix}.\nonumber \\ \end{aligned}$$

They are iteratively solved for \(\theta ^{(m+1)} = ({\beta ^{(m+1)}}^T, {\alpha ^{(m+1)}}^T)^T\), where the elements super-scripted by (m) depend on \(\theta ^{m}\), but this is excluded for notational convenience. Note that, since the RS algorithm sets the off-diagonal blocks to zero, it is possible to optimize the problem by alternating between the estimation of the mean and variance models; however, this is not considered here. We use the classical ordinary least squares estimates as initial values for the location parameter, i.e., \(\beta ^{(0)} = (X^T X)^{-1} X^T Y\), where \(Y = (y_1,\ldots ,y_n)^T\) is the response vector. We fix the starting value for the intercept of the dispersion term at \(\log (s^2)\), where the classical residual variance estimator \(s^2 = \sum _{i=1}^n (y_i - x_i^T \beta ^{(0)})^2/(n - p)\) is used. The remaining elements of the \(\alpha ^{(0)}\) parameter vector are set to zero (Rutemiller and Bowers 1968; Harvey 1976), which gives \(\alpha ^{(0)}=(\log (s^2), 0, \ldots , 0)^T\). The standard errors of the estimates can be directly acquired by estimating the covariance of the penalized estimates for the true non-zero parameters using the sandwich formula,

$$\begin{aligned} \hat{\textrm{cov}}(\hat{\theta })=\{I(\hat{\theta })\}^{-1} I_0(\hat{\theta })\{I(\hat{\theta })\}^{-1}, \end{aligned}$$

which has been shown to be accurate for moderate sample sizes (Fan and Li 2001, 2002).

2.4 \(\epsilon \)-telescoping

Although smaller values of \(\epsilon \) lead to a better approximation of the \(L_0\) norm (see Fig. 1), and hence IC optimization, we have found that the procedure becomes less numerically stable (see Supplementary Material). On the other hand, larger values of \(\epsilon \) lead to a more stable optimization procedure, but one that does not yield coefficients close to zero. Therefore, we propose a method that “telescopes” through a decreasing sequence of \(\epsilon \) values and makes use of “warm starts”, whereby the solution to the previous optimization problem is used as the initial point for the current nearby problem. The method can produce final estimates of the true zero coefficients that are extremely close to zero, and, so, can be treated as being equal to zero for practical purposes.

In this paper, we treat values below \(10^{-8}\) as being zero. We have found that using a sequence of \(T=100\) steps from \(\epsilon _1=10\) to \(\epsilon _{T}=10^{-5}\) performs well. Of course, applying fewer steps in the sequence from \(\epsilon _1\) to \(\epsilon _T\) is an option in practice. However, larger values of T help to ensure that the repeated fitting procedure brings the parameters close to zero while avoiding estimation instability. If T is too small (e.g., \(T=10\)), then the variable selection performance declines; simulation results for \(T=50\) and \(T=10\) are provided in the Supplementary Material. Once an adequate number of steps are used, the performance of the method is not highly influenced by the choice of the sequence. A large enough \(\epsilon _1\) must be chosen in order to introduce smoothness and give stable estimates, while a smaller \(\epsilon _{T}\) more closely approximates the \(L_0\) norm to induce pseudo-sparsity (where we say “pseudo” since the algorithm produces coefficients that can be made arbitrarily close to zero while not being exactly zero). In addition to this, we propose implementing an exponentially decaying sequence of the form of \(\epsilon _1 r^{t-1}\), where \(\epsilon _1\) is the starting value, \(r \in (0,1)\) is the rate of decay and t is the step number. For our suggested sequence with \(T=100\) steps from \(\epsilon _1=10\) to \(\epsilon _{T}=10^{-5}\), the decay parameter is \(r = 0.87\). This is advantageous as the optimization begins with large increments from \(\epsilon _1 = 10\), which provides rapid improvements and estimates that are initially close to the unpenalized values. The smaller increments leading to \(\epsilon _{T}=10^{-5}\) allow for smaller refinements, especially with regard to squeezing some coefficients to be close to zero.

Although we avoid a grid search over penalty tuning parameters (typically denoted by \(\lambda \) in penalized estimation), we instead have a sequence of \(\epsilon \) values. However, there is a key distinction between the objectives of these two approaches. In tuning parameter selection, the grid search over \(\lambda \) is an optimization procedure, which, as previously discussed, is computationally demanding in the context of distributional regression due to it being a multidimensional grid. Moreover, the position of the optimal solution is unknown and could potentially be missed—especially if one reduces the number of grid points to combat the aforementioned computational expense. In contrast, our \(\epsilon \)-telescoping approach is unidimensional and is not itself an optimization procedure since we know in advance, from a mathematical perspective, that \(\epsilon \) should effectively be zero. Thus, the role of the \(\epsilon \)-telescope is to move the problem to an arbitrarily small value of \(\epsilon _{T}\) in a stable way. It should be noted that, although we use \(\epsilon _{T}=10^{-5}\), it may be that a relationship between \(\epsilon _{T}\) and the sample size could be established using asymptotic analysis, e.g., a larger \(\epsilon _{T}\) value might be acceptable at smaller sample sizes; however, this is beyond the scope of the current article.

Table 1 Coefficient values of \(\beta _1\) and \(\beta _2\) as the method telescopes through \(\epsilon \)
Fig. 2
figure 2

Slice through objective function for different values of \(\epsilon \). Dashed vertical lines mark true value

Table 1 presents an example of the coefficient estimates for a true zero and true non-zero coefficient for some simulated data. The shrinkage effect due to decreasing \(\epsilon \) values is apparent. The value of the true zero coefficient \(\beta _1\) drastically reduces in magnitude through each step. It is obvious that the final estimate at \(\epsilon _{T} = 10^{-5}\) is extremely close to zero. As a result, it can be treated as a zero coefficient without any issues—and, indeed, could be shrunk further if desired by reducing \(\epsilon _{T}\). The estimate for the true non-zero coefficient \(\beta _2\) does not vary greatly over the telescoping steps.

Figure 2 provides a visualization of the telescoping effect in terms of the objective function. This is a slice through the objective function, which is plotted as a function of the coefficient value. Different curves are plotted corresponding to the \(\epsilon \) values in the telescope sequence. In the case of the true zero coefficient \(\beta _1\), it is clear that as \(\epsilon \) decreases, the width of the curves become narrower and therefore there is less uncertainty around the estimate. Additionally, it is evident that the minimum of the curve shifts towards zero as \(\epsilon \) decreases. For the true non-zero coefficient \(\beta _2\), the curves for the different \(\epsilon \) values are almost identical, i.e., the telescoping has little impact on the shape of the objective function in this case.

2.5 Algorithm

The proposed MPR-SIC variable selection method is summarized in Algorithm 1.

figure a

3 Simulation studies

3.1 Setup

We have undertaken a simulation study to investigate the performance of our proposed MPR-SIC method. We simulate data from the normal MPR model from Sect. 2, where X is a matrix of 12 covariates. To achieve a realistic setup, we make use of a range of different regression parameter values and covariate distributions. The regression coefficients, provided in Table 2, take values of 0, 0.5 or 1, respectively, corresponding to covariates having no effect, a weak effect or a strong effect. Moreover, there are several combinations where covariates enter through: both distributional parameters (\(X_1\) to \(X_4\)); the location only (\(X_5\) and \(X_6\)); the dispersion only (\(X_7\) and \(X_8\)); and in neither the location nor the scale (\(X_9\) to \(X_{12}\)), this latter group being pure noise covariates, i.e., they have no impact on the response. As for the covariate distributions, we include: two skewed covariates, \((X_1, X_{11}) \sim \) Exponential(1); two unbalanced binary covariates \((X_3, X_{10}) \sim \) Bernoulli(0.75); four independent normal covariates \((X_4, X_5, X_7, X_8) \sim \) \(\text {N}(0,1)\); and four correlated multivariate normal covariates \((X_2, X_6, X_9, X_{12}) = (Z_1, Z_2, Z_3, Z_4) \sim \) MVN wherein \(\text {corr}(Z_j, Z_k)= 0.8^{|j - k |}\). Lastly, three different sample sizes (\(n=100\), 500 and 1000) are considered, where each scenario is replicated 1000 times. The Supplementary Material contains some additional simulation studies (not discussed here): scenarios where no covariate enters the dispersion (i.e., classical linear regression) and scenarios with only independent normally distributed covariates.

Table 2 True parameter values

For each scenario, we perform our proposed procedures, MPR-SIC and SPR-SIC; the latter is the SIC implemented for a single-parameter regression model, i.e., penalized linear regression. We compare the performance of our method to the “bamlss” package (Umlauf et al. 2018), a package which implements penalized distributional regression (among other things). More specifically, we assign LASSO penalties to the location and dispersion regression parameters of a normal distribution, where the associated tuning parameters are selected by minimizing the BIC using a two-dimensional grid search with \(50 \times 50\) grid points; hereafter, we refer to this as BAMLSS. We note that the BAMLSS method does not always necessarily bring parameters very close to zero, and, therefore, our interpretation of a zero effect in BAMLSS is based on the associated 95% credible interval containing zero.

We also apply the adaptive LASSO (ALASSO) method from the “glmnet” package (Friedman et al. 2010), which corresponds to penalized linear regression, where we select the value of the tuning parameter by minimizing the BIC (ALASSO-IC). Note that only cross-validation-minimization is available in the glmnet package, and, therefore, we compute the BIC by evaluating the normal likelihood at the ALASSO estimates, \(\hat{\beta }\), and with the variance estimator

$$\begin{aligned} \hat{\sigma }^2=\sum ^{n}_{i=1}\frac{\left( y_i-{x_{i}}^{T}{\hat{\beta }}\right) ^2}{n-{k}} \end{aligned}$$

as suggested in Reid et al. (2016), where k is the number of non-zero elements in \(\hat{\beta }\). Since glmnet does not provide parameter inference, we compute standard errors (and hence confidence intervals) using the general sandwich formula provided in (11). (The form of the matrix \(I(\hat{\theta })\) is slightly different due to the presence of the \(L_1\) penalty.) The ALASSO method is misspecified in the scenarios we consider here, since it does not cover dispersion effects. However, we include it as a very commonly used method in practice, and it is useful to see how variable selection in the location is impacted when one fails to model the dispersion. The SPR-SIC method is similarly misspecified and is included as an SIC-based alternative to the ALASSO. Note that the Supplementary Material includes scenarios (not discussed here) where these are not misspecified, i.e., the true model only contains location effects.

3.2 Simulation results

Before we consider performance in terms of variable selection and parameter inference, we first briefly review the computational expense. To this end, average computation times for each of the methods are given in Table 3. We can see that our MPR-SIC procedure is 40–50 times faster than BAMLSS, in large part due to the two-dimensional grid required by the latter. Even though the SPR-SIC approach is misspecified here, it is still useful to note that it is 4–8 times faster than the MPR-SIC approach. Thus, as expected, the distributional MPR approach is slightly slower than the SPR approach due to the fact that the former specifies (correctly) a dispersion model, and, hence, has twice the number of parameters to estimate (ignoring intercepts). However, the difference is not as dramatic as MPR-SIC versus BAMLSS since the SIC approaches both have the same penalty with unidimensional \(\epsilon \)-telescoping. The ALASSO-IC approach is the fastest overall, but it should be noted that the core of its implementation is compiled C code. Even so, the SPR-SIC is still relatively competitive computationally at approximately 10 times slower using only R code.

Turning now to variable selection performance, metrics including the average number of true zero coefficients correctly set to zero (C) and the average number of true non-zero coefficients incorrectly set to zero (IC) are investigated. The probability of choosing the true model (PT) is examined by looking at the proportion of times the true model is selected. The mean squared error (MSE) is computed for each simulation replicate in order to assess in-sample prediction accuracy, and is calculated by \(\text {MSE}(\hat{\theta })=(\hat{\theta } - \theta )^T{X^TX}(\hat{\theta } - \theta )/n\) (Tibshirani 1997). These metrics, averaged over simulation replicates, are presented in Table 4.

Table 3 Simulation results: average computation time per simulation replicate (in s)
Table 4 Simulation results: model selection metrics

For all of the methods, the C values are close to six (true value) and improve as the sample size increases. For the MPR-SIC method, the IC values are zero in most cases, apart from \(n = 100\). This is due to the three smaller-valued weak effects being set to zero incorrectly in some simulation replicates (\(\beta _2\), \(\beta _3\), \(\beta _5\) in the location and \(\alpha _1\), \(\alpha _3\), \(\alpha _7\) in the dispersion). This behaviour is also conveyed by the probability of choosing the true model (PT). The PT values are low for \(n=100\), which is due to both \(\beta \) and \(\alpha \) sometimes having a zero coefficient not set to zero, and, sometimes having a non-zero coefficient incorrectly set to zero. The sample size of \(n=100\) is a challenging scenario as the MPR-SIC method is fitting a total of \(2(p+1)\) parameters, which in this case is 26 parameters for a relatively small sample size. Taking this into account, we suggest that the performance of the method in this setting is reasonable. The PT values are also high for \(n = 500\) and 1000, and appear to be converging to one. The BAMLSS procedure has higher IC values than the MPR-SIC method for \(n = 100\) and lower C values for the larger sample sizes. The net effect of this is that the PT values are generally lower than for the MPR-SIC method (except for the location parameter at \(n = 100\)). For the SPR-SIC and ALASSO-IC methods, it only makes sense to consider their performance in the location, since there is no dispersion model. We can see that the IC values for these approaches are quite large, which means that they are setting some of the non-zero parameters to zero (albeit this is reducing with the sample size). The corresponding PT values are also quite low compared to the MPR-SIC approach. Ultimately, this indicates that erroneously ignoring the dispersion has an impact on the estimation of the location, even though the location and dispersion parameters are orthogonal for the normal distribution.

The estimation and inferential performance of our proposed MPR-SIC method is investigated in Table 5. The average estimate over simulation replicates is shown along with the true standard error (SE), which is the standard deviation of the estimates over simulation replicates, and the average estimated standard error (SEE) over simulation replicates, where the SEE in a given replicate is computed using (11); also shown is the empirical coverage probability (CP) for a nominal 95% confidence interval. We can see that, in all cases, the estimated parameter is close to the true value, albeit there is some bias in the larger \(\alpha \) coefficients at \(n=100\). The standard errors for both the \(\beta \) and \(\alpha \) parameters are underestimated at \(n=100\), leading to CPs below 0.95. However, at \(n \ge 500\) the standard errors are well estimated and the coverage is very close to the desired 0.95 level. The equivalent results for the BAMLSS, SPR-SIC and ALASSO-IC methods are deferred to the Supplementary Material, but we briefly outline them here: although BAMLSS appears to be better at the smallest sample size, the dispersion parameter results are not as good as MPR-SIC for the larger sample sizes (with higher SE values and CP values generally below 0.9); both the SPR-IC and ALASSO-IC methods perform poorly in the location parameter in all respects (biased estimates, large SEs that are underestimated by the SEEs, and quite low CP values).

Table 5 Simulation results: estimation and inference metrics
Table 6 Simulation results: out-of-sample prediction coverage probabilities
Fig. 3
figure 3

Prediction coverage probabilities (PCPs) of observations for different dispersion levels, \(\sigma _i\). Solid black line indicates the coverage and the red dashed line is a reference line at 0.95

Out-of-sample prediction coverage probabilities (PCPs) for the methods are calculated for a sample 20% the size of the original data and are shown in Table 6. This is calculated as the proportion of times the true response value lies in a nominal 95% prediction interval (PI) in each replicate. The average is then taken over the 1000 replicates. The 95% PIs are calculated as \(x_i^T\hat{\beta } \pm 1.96 \sqrt{\exp (x_i^T\hat{\alpha })}\). Note that for the SPR-SIC and ALASSO-IC, the dispersion parameter is held constant across observations and so \(\alpha \) is a vector of zeros except for its first element, which is the intercept. The methods appear to perform similarly when examining the overall PCP, although both the MPR-SIC and BAMLSS methods are somewhat poorer than the others in the \(n=100\) case. However, note that the observations can be categorized by their variability \(\sigma _i\) and, therefore, we split them into groups with low, medium and high variability using the thresholds \(\sigma _i \le 1\), \(\sigma _i \in (1, 2.2]\) and \(\sigma _i > 2.2\), respectively, where these thresholds are the tertiles computed numerically from the true underlying distribution of \(\sigma _i\). Doing so reveals that none of the methods perform particularly well at \(n=100\) when viewed in terms of these three levels of variability. For \(n=500\) and 1000, the coverage for the MPR-SIC and BAMLSS procedures remain at approximately 95%, which is the desired nominal value. In contrast, for the SPR-SIC and ALASSO-IC methods, the PIs are too wide for the low and medium variability cases (leading to 100% coverage) and too narrow for the large variability cases (leading to approximately 85% coverage). This is unsurprising since these methods assume a constant \(\sigma \), and, therefore, cannot adapt to heterogeneity in the data. This can also be visualized in Fig. 3, which shows the coverage for ten \(\sigma _i\) categories in the case where \(n=1000\); again we see that the MPR-SIC and BAMLSS methods lie close to 95%, whereas the other methods are either too high or too low.

In order to examine the generalizability of our simulation study, we have also considered several additional simulation scenarios, whose results can be found in the Supplementary Material. In particular, we have changed the effect sizes and cardinality of the active sets (i.e., the sets of covariates with non-zero effects) so they differ across the location and dispersion parameters; we also consider settings where we have doubled the number of covariates (to 24). In general, the performance is comparable to the results presented here, but two noteworthy differences are as follows: (1) when the dispersion effects are much larger than the location effects, the selection performance in the location reduces considerably (albeit inferential performance remains good); and (2) when the number of covariates is increased to 24, the problem becomes unstable for \(n = 100\) and the larger-sample PT values are reduced (by about 10–15 percentage points).

4 Real data analyses

4.1 Overview

We consider three real data analyses to illustrate our proposed MPR-SIC method, which is implemented using the \(\epsilon \)-telescope (Algorithm 1). For each dataset, the resulting MPR-SIC, BAMLSS, SPR-SIC and ALASSO-IC estimates (\(\hat{\beta }, \hat{\alpha }\)) are presented, and note that, for the SPR-SIC and ALASSO-IC methods, \(\hat{\alpha }\) is a vector of zeros except for its first element (the intercept). We also compare these methods in terms of out-of-sample PCP values. Additionally, for the proposed MPR-SIC, we provide the associated standard errors for each non-zero coefficient and the change in BIC, denoted \(\Delta \)BIC, that arises upon setting that coefficient to zero. The \(\Delta \)BIC value provides a measure of the impact of dropping a variable from the location (\(\beta \) coefficient) or the dispersion (\(\alpha \) coefficient), and, therefore, indicates its importance in these model components. For the other methods, these additional metrics are deferred to Supplementary Material, but we indicate statistical significance by emboldening coefficient values for all methods in the main text. (Note that, for BAMLSS, “statistical significance” is based on the credible intervals excluding zero.)

4.2 Prostate cancer data

We examine the prostate cancer data, which come from a study by Stamey et al. (1989) and which appear in Tibshirani (1996) and Zou and Hastie (2005). The correlation between the level of prostate-specific antigen (PSA) and various clinical measures in 97 men who were about to receive a radical prostatectomy is examined. The predictors consist of eight clinical measures: log(cancer volume (\(\text {cm}^3\))) (lcavol), log(prostate weight (g)) (lweight), presence of seminal vesicle invasion (SVI) (svi), age of the patient (age), log(amount of benign prostatic hyperplasia (\(\text {cm}^2\))) (lbph), log(capsular penetration (cm)) (lcp), Gleason score (gleason) and percentage of Gleason scores four of five (pgg45). The logarithm of PSA (ng/mL) is the response variable. The presence of SVI (svi) is a binary variable (1 = yes, 0 = no) and gleason is a discrete numerical variable with four values. The Gleason score relates to prostate cancer grades and the pgg45 predictor provides information on the history of the patient. This is the percentage of Gleason scores they received before their final Gleason score in gleason. PSA is a protein that is produced by normal and malignant prostate cells, and is useful as a preoperative marker, as prostate cancer causes PSA to be discharged into the blood.

Figure 4 plots the standardized coefficient values with respect to the MPR-SIC \(\epsilon \)-telescope, which shows how the method works as \(\epsilon \) moves towards zero. We note that the coefficients are essentially unpenalized at the largest \(\epsilon = 10^1\) value where there is no variable selection; this is because the penalty in (7) is close to zero for large \(\epsilon \) values. Then, decreasing \(\epsilon \) moves the problem towards \(L_0\) penalization such that variable selection occurs. In particular, lcavol is selected only in the location component while lweight and svi are selected in both the location and dispersion components. Interestingly, although we decrease to \(\epsilon =10^{-5}\), the results here do not change appreciably beyond \(\epsilon =10^{-2}\).

Fig. 4
figure 4

Prostate Cancer Data: standardized coefficient values through the \(\epsilon \)-telescope for the location and dispersion components. Lines are coloured corresponding to the selected variables, where grey indicates that the variable is set to zero and not selected in the final model. (Color figure online)

From Table 7, we can see that, like the MPR-SIC method, the other three methods also select lcavol, lweight, and svi in the location component; in all cases, their location coefficients are positive. Thus, increased values of log(cancer volume) and log(prostate weight), and the presence of SVI are associated with increased log(PSA) values, and therefore may be indicative of prostate cancer. As for the dispersion component, while the MPR-SIC method selects both lweight and svi, BAMLSS only identifies lweight as being important. Both of these methods have similar BIC values (224 and 222 units respectively), which are lower than the models with only location regression components (227 and 228 units, respectively, for SPR-SIC and ALASSO-IC). Distributional regression approaches improve on classical single parameter regression approaches since they can capture more complex covariate effects, e.g., lweight and svi appear in both the location and dispersion within the MPR-SIC model. Given this additional complexity, it can be helpful to visualize the effects. To this end, inspired by Stadlmann and Kneib (2021), we provide a series of model-based (MPR-SIC) conditional density curves for different covariate combinations in Fig. 5.

Table 7 Prostate Cancer Data: estimation metrics

As mentioned, within the MPR-SIC model (and all models considered), increased lcavol values are associated with increased log(PSA). Moreover, the large \(\Delta \)BIC value of 40.39 identifies the lcavol location effect as being the most important (across all location and dispersion effects)—and there is a clear location shift in the associated conditional density plots in Fig. 5. Similarly, increased lweight values are also associated with increased log(PSA), but to a lesser extent than with lcavol. In line with this, there is more overlap between the conditional densities for high and low lweight values and the \(\Delta \hbox {BIC}\) value is smaller (19.79). In other words, the cancer volume (lcavol) is more strongly associated with PSA values than the prostate weight (lweight). However, the nature of the lweight effect differs from that of lcavol since it impacts the dispersion: increased lweight values are associated with reduced dispersion. As for svi, its presence primarily increases the dispersion. This can be seen both visually from the density plots and confirmed by the fact that removing svi from the dispersion increases the BIC 4.09, whereas, its removal from the location increases the BIC by just 1.64 units.

Fig. 5
figure 5

Prostate Cancer Data: MPR-SIC model-based conditional density curves. The black curve corresponds to an individual whose covariates (lcavol, lweight, svi) are all equal to median values (serving as a “baseline” or “average” individual); dashed grey lines mark the 2.5th, 50th and 97.5th quantiles of this density. Keeping two of the covariates fixed at the median values, the red and blue densities correspond to the modification of the third covariate as: “low” (Q1, the first quartile) and “high” (Q3, the third quartile) for the continuous covariates, lcavol and lweight; and “absence” (\(=0\)) and “presence” (\(=1\)) for the binary covariate, svi; red and blue vertical lines mark the 2.5th and 97.5th quantile values of each density. (Color figure online)

4.3 Sniffer data

When gasoline is pumped into a tank, hydrocarbon vapours are forced out and emitted into the atmosphere. This is a source of air pollution and in order to reduce this, devices that capture the vapour are set up. Testing these vapour recovery systems involves a “sniffer” device to measure the amount of vapour that is recovered. A method of estimating the total amount released is required to estimate the efficiency of the system. A laboratory experiment was carried out to discover factors that impact the amount of hydrocarbon vapour released when gasoline is pumped into a tank. Four factors are varied—vapour pressure (psi) of the dispensed gasoline (gaspres), temperature (\(^\circ \hbox {F}\)) of the dispensed gasoline (gastemp), initial tank temperature (\(^\circ \hbox {F}\)) (tanktemp) and initial vapour pressure (psi) in the tank (tankpres). The quantity of emitted hydrocarbon (g) is the response variable. There are 125 runs in the data. These data have previously been considered by Weisberg (2013) who noted that the dispersion may depend on the predictors but did not apply a heteroscedastic model, and Bedrick (2000) who used a model with all four predictors in the location along with gastemp and gaspres in the dispersion.

The MPR-SIC, BAMLSS and SPR-SIC methods each select a different combination of variables for the location parameter (see Table 8). In terms of the selected statistically significant effects, the two location regression models (SPR-SIC and ALASSO-IC) select gaspres, gastemp, and tankpres. However, these models have higher BIC values than the distributional regression models (MPR-SIC and BAMLSS), with the latter models choosing tanktemp rather than tankpres as being important in the location. In any case, the location effects of gaspres and gastemp are positive across all models (albeit gaspres is not statistically significant in BAMLSS), indicating that increased gasoline pressure and temperature values are related to increased amounts of emitted hydrocarbon; moreover, the MPR-SIC model identifies these as the most important effects with \(\Delta \)BIC values of 68.41 and 56.85, respectively. The initial tank temperature (tanktemp) appears to be less important (\(\Delta \)BIC \(= 5.60\)), but its negative location coefficient in the MPR-SIC and BAMLSS models indicates that higher temperatures reduce the emitted hydrocarbon. In addition to the location effect of gastemp, the MPR-SIC model also finds this variable to increase the dispersion. With a \(\Delta \)BIC value of 17.62, the gastemp dispersion effect is far greater than the tanktemp location effect; this demonstrates the fact that modelling only the location—as is most often done in practice—can miss important features of the process under study. We note that the BAMLSS model is somewhat more complex than the MPR-SIC model, in that there are more coefficients that are far from zero. Overall, the MPR-SIC model achieves the lowest BIC of 616 units.

Table 8 Sniffer Data: estimation metrics
Fig. 6
figure 6

Sniffer Data: a MPR-SIC model-based conditional density curves for each of the selected variables (see Fig. 5 for more details); b all eight conditional density curves obtained from the combinations of “low” (Q1, the first quartile) and “high” (Q3, the third quartile) for each of the three covariates gaspres, gastemp, and tankpres; these are ordered based on mean and coloured based on variance. (Color figure online)

Conditional density plots display the various effects in Fig. 6a (and these are analogous to those shown in Fig. 5 for the prostate cancer data). Here we can clearly see: the large impact of gaspres on the location; the lesser impact of gastemp on the location and its impact on the dispersion; and the weak impact of tanktemp on the location. Moreover, since there are only three selected variables, and this is an industrial setup where each one can be altered in practice, Fig. 6b also displays all of the eight combinations of conditional densities that arise from varying each covariates at either “low” or “high” values. There is a clear optimal configuration here, which yields both the lowest and least variable hydrocarbon emissions: reduce the gasoline pressure (gaspres) and temperature (gastemp) and increase the initial tank temperature (tanktemp). This setting will yield emissions approximately between 20 and 30, with a mean of 25. In contrast, for the worst setting (high gaspres and gastemp and low tanktemp), emissions will generally be between 27 and 42, with a mean of 35. Thus, air pollution can be reduced considerably by optimizing the setup.

4.4 Boston house price data

Data from a cross-sectional study of 506 communities in the Boston area carried out in 1970 (Harrison and Rubinfeld 1978) are available in Wooldridge (2015). The association between median house prices in a particular community with various community characteristics is examined. There are eight explanatory variables: average number of rooms per house (rooms), percentage of the population that are “lower status” (lowstat), average student-teacher ratio of schools in the community (stratio), log(property tax per $1000) (lproptax), log(weighted distances to five employment centres in the Boston region) (ldist), crimes committed per capita (crime), log(annual average nitrogen oxide concentration (pphm)) (lnox) and index of accessibility to radial highways (radial). The log(median house price ($)) is the dependent variable. DiCiccio et al. (2019) applied a weighted least squares approach to these data (which accounts for heterogeneity but does not model the dispersion), where they only considered the rooms, stratio, ldist and lnox variables.

Table 9 Boston House Price Data: estimation metrics

Table 9 shows that all eight covariates are included in the location component across the MPR-SIC, BAMLSS, SPR-SIC and ALASSO-IC methods. All the coefficients are statistically significant, and the signs of the estimated location coefficients are alike across the methods. Only two covariates (rooms and radial) have a positive effect on house prices. Houses with a greater number of rooms and access to radial highways are generally desirable, which results in increased house prices. The remaining variables may be considered to be undesirable, thus reducing the median house prices. In particular, the percentage of the population in the community that are “low status” and the student-teacher ratio of schools in the community both have a sizeable impact on the BIC when they are removed from the location parameter. The MPR-SIC method selects three covariates in the dispersion parameter, lowstat, ldist and radial, of which ldist has a negative coefficient while the other two have positive coefficients. We note that ldist, when dropped from the dispersion, leads to a greater \(\Delta \)BIC value than three of the eight variables in the location component, again highlighting that modelling only the location ignores important effects. Interestingly, the BAMLSS approach finds rooms rather than ldist to be statistically significant, but is otherwise quite comparable to the MPR-SIC approach in the values of the (statistically significant) model coefficients. Moreover, we can see that the BIC values for these two models are much lower than those of the location-regression models (SPR-SIC and ALASSO-IC).

Fig. 7
figure 7

Boston House Price Data: mean-variance pairs for each of the 506 communities. The colour of the points relate to whether a community has higher or lower values for rooms and ldist than the medians for each, e.g., the dark blue points correspond to communities where both the rooms and ldist values are higher than the median values. (Color figure online)

It is useful to estimate the average sales price and the variability for a given community, and, therefore, for each of the 506 communities in the dataset, we compute the mean-variance pairs \((\mu (x_i), \sigma ^2(x_i))^T\) where \(\mu (x) = x^T\beta \) and \(\sigma ^2(x) = e^{x^T\alpha }\). These are displayed in Fig. 7 and note also that each point is equivalent to an underlying conditional density (as the normal distribution is fully characterized by its mean and variance). Interestingly, we see that communities with higher prices also tend to have lower variability in these prices. The points are coloured according to the rooms and ldist values, these being the most important location and dispersion variables, respectively. From this, we see that higher values of rooms are associated with increased prices, while higher values of ldist are associated with reduced variability. Thus, from the perspective of the real estate agent, desirable homes are those with a higher number of rooms located a greater distance away from employment centres. That being said, there are certainly other factors influencing house prices and their variability as previously discussed based on Table 9.

4.5 Prediction coverage probabilities

Table 10 contains the out-of-sample PCPs overall and split by category of variability calculated using 10-fold cross-validation for the prostate cancer, sniffer, and Boston house price data. Considering the PCPs from an overall point of view, the MPR-SIC, BAMLSS, SPR-SIC and ALASSO-IC methods perform similarly across all three data analyses. As expected based on the simulation, the coverage improves with respect to sample size, as we compare results from the smaller prostate cancer and sniffer datasets with the larger Boston house price dataset. The overall pattern is that both the SPR-SIC and ALASSO-IC methods tend to produce wider PIs for the low and medium variability categories, and narrower PIs for the high category—but do okay in terms of coverage for the low variability cases in the two smaller datasets (prostate cancer and sniffer data). The MPR-SIC and BAMLSS approaches are more balanced and tend towards good coverage with increasing sample size—whereas the other two methods continue to produce overly wide intervals for the low and medium variability categories and overly narrow intervals for the high category. This effect can be seen in Fig. 8 for the Boston house price data where the coverage is displayed for six \(\sigma _i\) categories; the coverage for both the MPR-SIC and BAMLSS methods lie close to 95%, while this is not the case for the other methods (that do not model the dispersion).

Table 10 Real data analyses results: out-of-sample prediction coverage probabilities
Fig. 8
figure 8

Boston House Price Data: prediction coverage probabilities (PCPs) of observations for different dispersion levels, \(\sigma _i\). Solid black line indicates the coverage and the red dashed line is a reference line at 0.95. (Color figure online)

5 Discussion

Our proposed variable selection procedure uses a smooth \(L_0\) norm to facilitate smooth information criterion optimization, and is extended for application in the developing area of distributional regression. This enables straightforward selection of more complex covariate effects afforded by modelling multiple distributional parameters simultaneously. The smooth objective function means that this is achieved using standard gradient based optimization procedures, i.e., Newton–Raphson. Moreover, because the objective function is an information criterion, the approach circumvents the need for penalty tuning parameter optimization, e.g., this is fixed at \(\lambda = \log (n)\) in the BIC case. This is something that can be computationally intensive in LASSO-type problems, especially in the context of distributional regression modelling due to the additional parameters to be estimated and the fact that, in theory, there may be a separate tuning parameter for each distributional parameter. We provide a publicly available package smoothic for the implementation of our proposed methods (O’Neill and Burke 2021).

Through extensive simulation studies, we have demonstrated that the procedure has very favourable performance in terms of variable selection, parameter inference, and out-of-sample prediction; this is true in both single and multiparameter settings. Results from the real data analyses illustrate the effectiveness of our procedure, and in particular, the advantage of modelling the dispersion is clear from the fact that we have found dispersion effects that are stronger (in BIC terms) than location effects.

The methods proposed in this article are not restricted for use with only the normal distribution. We believe that the techniques presented herein can be extended for use with non-normal models, for example, the generalized linear model family. Moreover, we anticipate that the methods will also extend to the setting of robust statistical modelling, which is particularly important given the ever-growing presence of complex datasets (Fan et al. 2014; Avella Medina and Ronchetti 2015; Maronna et al. 2019). Such an extension would be capable of dealing with heteroscedasticity and outliers, while also carrying out variable selection and parameter estimation simultaneously. Additionally, an anonymous reviewer has pointed out that, in combination with our proposal, a fused or group penalty would also be useful in practice for nominal or ordinal covariates. Such extensions will be a focus of our future work.