Abstract
Due to the increasing complexity and dimensionality of data sources, it is favorable that methodological approaches yield robust results so that corrupted observations do not jeopardize overall conclusions. We propose a modelling approach which is robust towards outliers in the response variable for generalized additive models for location, scale and shape (GAMLSS). We extend a recently proposed robustification of the loglikelihood to gradient boosting for GAMLSS, which is based on trimming low loglikelihood values via a loglogistic function to a boundary depending on a robustness constant. We recommend a datadriven choice for the involved robustness constant based on a quantile of the unconditioned response variable and investigate the choice in a simulation study for low and highdimensional data situations. The versatile application possibilities of robust gradient boosting for GAMLSS are illustrated via three biomedical examples—including the modelling of thyroid hormone levels, spatial effects for functional magnetic resonance brain imaging and a highdimensional application with gene expression levels for cancer cell lines.
Similar content being viewed by others
1 Introduction
With the emergence of complex observational data in biomedical research there is also an increasing need for robust data analyses (see Barrios 2015; Monti and Filzmoser 2022). Robust statistical approaches should ensure that also for analyses purely based on observational or routine data from clinical practice (and hence less controlled settings than clinical trials), the overall conclusions are not based on small amounts of corrupted or inconsistent observations such as outliers. On the other hand, many modern research questions also warrant more complex and flexible analysis tools, which may go beyond the classical focus on the mean of a distribution (Kneib et al. 2021). One of these flexible model classes are Generalized Additive Models for Location, Scale and Shape (GAMLSS, introduced by Rigby and Stasinopoulos 2005).
In contrast to classical regression models, GAMLSS allow for the modelling of multiple distribution parameters (e.g., also scale and shape parameters) in dependence of potentially different sets of covariates for each parameter. As a result, GAMLSS model the complete conditional distribution and not only the mean. Furthermore, the response variable can follow any distribution (continuous, discrete, mixed, e.g. zeroinflated continuous distributions with a spike at zero) and is not restricted to the exponential family as it is the case for regular generalized additive models (Hastie and Tibshirani 1990; Wood 2006). Recently, a robust fitting approach for GAMLSS was proposed by Aeberhard et al. (2021), which tackles the problem of outliers in the response variable. The core idea is to robustify the likelihood via a loglogistic transformation, which ensures that some extreme observations do not dominate the model fit. In contrast to more classical robust regression methods based on composite loss functions (Huber 1981; Amato et al. 2021), the focus on the likelihood facilitates the transfer of this robustness concept towards distributional regression. While the approach by Aeberhard et al. (2021) shows a very promising performance for lowdimensional models, it is not applicable for highdimensional data with potentially more explanatory variables than observations (\(p > n\)), and does not provide datadriven variable selection.
To overcome these remaining issues, we incorporate the robustification by Aeberhard et al. (2021) in a boosting algorithm for GAMLSS (Mayr et al. 2012; Thomas et al. 2018). Boosting is a concept from machine learning that was later adapted to fit statistical models (Friedman 2001). An important advantage of these statistical boosting approaches is that they yield interpretable effect estimates for single predictor variables while allowing for highdimensional data with \(p > n\). Furthermore, they can be adapted to carryout variable selection by stopping the algorithm before it converges (Bühlmann 2006). Recently, some approaches for robust fitting of statistical models via boosting have been proposed (Ju and SalibiánBarrera 2021; Speller et al. 2022), but they focus only on classical mean regression. We hence propose to incorporate the robustified likelihood approach into the boosting framework to fill the gap of robust approaches for fitting GAMLSS in the context of highdimensional data.
As in classical robust regression approaches (Huber 1981; Maronna et al. 2019), the robustification of the likelihood in Aeberhard et al. (2021) is controlled by a robustness constant, which has to be specified before fitting the model. While there often exist reasonable default values which work well in many settings, the specification is generally difficult as the amount of required robustness is typically unknown. To facilitate the application of our approach in practice, we additionally propose a datadriven quantilebased way to choose the robustness constant.
We analyse the performance of our approach regarding prediction accuracy and variable selection for different amounts and types of corrupted outcomes via a simulation study (Sect. 3) for Gaussian and Gamma distributed data, comparing it to the classical boosting algorithm for GAMLSS (Mayr et al. 2012; Thomas et al. 2018). In lowdimensional settings, we compare our approach also to the robust penalized maximum likelihood approach of Aeberhard et al. (2021). To further illustrate the practical relevance of our proposed method, we provide three biomedical applications (Sect. 4). We revisit the lowdimensional data application provided in Aeberhard et al. (2021) on spatial effects for functional magnetic resonance brain imaging using our robust boosting approach. Additionally, we present the analysis of a classical mediumsized epidemiological trial on thyroid hormone levels from the general population with datadriven variable selection. As a last illustrative example, we focus on gene expression levels with \(p \gg n\) in cancer celllines.
2 Methods
2.1 Gradient boosting for GAMLSS
The gamboostLSS algorithm combines variable selection and prediction modeling while fitting GAMLSS using a componentwise gradient boosting approach (Mayr et al. 2012). A prespecified set of regressiontype baselearners (typically one for each covariate) is iteratively fitted to the negative gradient of the likelihood. In each iteration of the algorithm, the bestperforming baselearner is selected and only this one is updated (Bühlmann 2006). This iterationbased procedure automatically leads to the selection of informative variables or baselearners, while baselearners that are never selected to be updated are effectively excluded from the final model. Typically, model tuning is performed by stopping the algorithm before convergence to avoid overfitting and to enforce variable selection and shrinkage.
To describe the componentwise gradient boosting algorithm in the GAMLSS regression framework, we use the following notation: The response observations \(y_i\) for \(i=1,\ldots ,n\) with conditional density functions \(f(y_i\,\vert \, \varvec{\theta }_{i})\) are assumed to be independent given the parameter vector \(\varvec{\theta }_{i}=(\theta _{i,k})_{k=1,\ldots ,K}\). Here, each parameter \(\theta _{i,k}\) of the parameter vector \(\varvec{\theta }_{i}\) can depend on a different subset of all available covariates, which is taken into account by using potentially varying index sets \(J_k\) for the covariates. In principle, the number of simultaneously modelled parameters K can be arbitrarily high, but for most distributions there are not more than \(K=4\) parameters (originally also denoted by \(\varvec{\theta }=(\mu ,\sigma ,\nu ,\tau )\) for location, scale, skewness and kurtosis parameters, Rigby and Stasinopoulos 2005).
Every parameter \(\theta _k\) is related to its additive predictor \(\eta _{\theta _k}\) by its specific monotonic link function \(g_k\):
where \(\beta _{\theta _k,0}\) denotes the parameterspecific intercept and \(h_{\theta _k,j}({\varvec{x}}_{k,j})\) is the effect of the covariates \({\varvec{x}}_{k,j}\) within the additive predictor for parameter \(\theta _k\). In the context of componentwise gradient boosting, each \(h_{\theta _k,j}\) corresponds to one baselearner. In the simplest case this is linear but can also be a nonlinear spline or other type of effect. Note that \(h_{\theta _{k,j}}\) may also depend on multiple covariates like for interactions or spatial effects, where, e.g. coordinates are used for estimating the joint effect via one spatial baselearner (see brain data example in Sect. 4.1).
Once the model and baselearners in Eq. (1) are specified, the actual optimization problem \(\varvec{{\hat{\theta }}}= \underset{\varvec{\theta }}{\text {argmax}} \,\ell ({\varvec{y}}\,\vert \,\varvec{\theta })\) with the loglikelihood function
is solved iteratively via gradient boosting.
Prior to the first iteration, an offset value is calculated for each additive predictor based on the unconditional distribution of the response variable \({\textbf{Y}}\) (e.g. the mean for the location parameter of a Gaussian distribution). Then, the partial derivatives with respect to the additive predictors \(\eta _{\theta _k}\) are computed in each iteration \(m\ge 1\) and the baselearners are fitted separately to the gradient, while only the best fitting baselearner is chosen to be added to the current additive predictor \(\eta _{\theta _k}^{[m]}\) via a small fixed steplength (typically chosen to be 0.1):
This iteration process is executed until the final (optimal) stopping iteration \(m_{\text {stop}}\) is reached. The boosting algorithm is usually stopped before convergence to avoid overfitting and to improve the prediction performance on test data with shrinkage of effect estimates towards zero. Resampling techniques like crossvalidation or bootstrapping are typically used to tune the stopping iteration \(m_{\text {stop}}\) based on the predictive performance, when no additional validation set for tuning \(m_{\text {stop}}\) is available.
Similar to most other recent methodological works on multidimensional boosting (Strömer et al. 2022, 2023; Hans et al. 2023; Griesbach et al. 2023; Stöcker et al. 2021), we use the socalled noncyclic variant (Thomas et al. 2018) for iteratively updating the previous estimate in the boosting algorithm. Thereby, only the update (baselearner fit) leading to the best overall improvement over all distribution parameters is actually executed and all other additive predictors remain without updates for this specific iteration. This leads to a relatively fast onedimensional tuning process for \(m_{\text {stop}}\), as it is also the case in regular modelbased boosting. If there are only updates for the location parameter this could reduce the GAMLSS model to a GAM. In comparison, the older cyclic variant (Mayr et al. 2012) applies a grid search and updates all parameters one after another within one iteration. Typically, the noncyclic variant leads basically to the same results but substantially reduces the computation time for tuning (Thomas et al. 2018). Both variants are implemented in the R addon package gamboostLSS (Hofner et al. 2016), which builds up on the mboost (Hofner et al. 2014) package for modelbased boosting.
2.2 Robust boosting for GAMLSS
One way to achieve robust model fitting under potentially contaminated data (particularly for the response values) is to apply some kind of trimming of extreme observations to control the influence of single observations. This general concept is widely used in robust statistics, starting with the trimmed mean (Maronna et al. 2019; Lugosi and Mendelson 2019), but also for more complex methodologies e.g., for highdimensional regression (Alfons et al. 2013; Speller et al. 2022). In the case of continuous distributions, in Rigby et al. (2019) a prespecified amount of extreme observations is trimmed to marginal quantiles towards the center of the outcome distribution. This basically leads to a mixed distribution (with spikes at the quantiles) before model fitting, which can be adjusted for by a bias correction to preserve Fisher consistency.
Another robustification approach through different robust evaluation functions (Eguchi and Kano 2001) on the loglikelihood level to reduce influence of single response observations was adapted and proposed for the GAMLSS framework by Aeberhard et al. (2021). It is implemented via the loglogistic function within the R package GJRM, which can be used as an extension to the classical gamlss package to fit GAMLSS. However, it is limited to smaller numbers of covariates, also in comparison to the number of observations.
To apply robust regression also for larger numbers of covariates up to highdimensional cases, where the number of covariates p may exceed the sample size n, we propose a robust gradient boosting approach to fit GAMLSS. Similar to Aeberhard et al. (2021), we consider the robustified loglikelihood function \({\tilde{\ell }}_c\) as a direct penalisation of Eq. (2) by the loglogistic function \(\rho _c\)
leading to a similar optimisation problem as before, where the loglikelihood is replaced by its robustified version:
The loglogistic function \(\rho _c\) is defined by
with derivative
The loglogistic function and its derivative are illustrated in Fig. 1 for different values of the robustness constant c.
For each \(c>0\), the loglogistic function \(\rho _c\) is twice continuously differentiable on \({\mathbb {R}}\), convex and bounded from below, limiting the impact of observations with small loglikelihood values on the fitting process. Positive loglikelihood values remain nearly unchanged. Negative loglikelihood values are trimmed towards zero so that the influence on the overall likelihood is bounded for single observations. A larger value of c leads to fewer observations whose influence get restricted, while asymptotically, for \(c\rightarrow \infty\), it corresponds to the original, nonrobust loglikelihood procedure. Smaller c values have the opposite effect and lead to a sharper cutoff. The slope of \(\rho _c\) is monotonically increasing as can be seen in Fig. 1, with a slope close to 0 for most negative values. In \(z=c\) the derivative \(\rho _c'\) reaches its point of inflection.
In the context of gradient boosting, the most important part of the fitting procedure is the computation of the partial derivatives of the (robustified) loglikelihood for each additive predictor \(\eta _{\theta _k}\):
The derivative \(\rho _{c}'\) is bounded on the unit interval [0, 1], which means that the gradient from the robustified loglikelihood in Eq. (8) is the loglikelihood gradient (3) multiplied by weights \(w_i=\rho _{c}'(\ell (y_i\,\vert \,\varvec{\theta }_i))\in (0,1)\) for \(i=1,\ldots ,n\), downweighting observations with extreme (negative) loglikelihood values. Therefore, the choice of the robustness constant c entails a tradeoff between robustness for corrupted data and efficiency for uncorrupted data.
For positive values of c, there is a natural lower bound of \(\rho _c\) given by the infimum over all loglikelihood values \(z\in {\mathbb {R}}\) (see Fig. 1), which can be calculated by considering the following limit:
The value \(b_{\text {low}}(c)\) is negative for all \(c > 0\). This means that all loglikelihood values, which have lower values than this boundary, are at least lifted to \(b_{\text {low}}(c)\). Since the scale of the loglikelihood itself is depending on the statistical model and on the data, this should also be taken into account for the choice of the robustness constant c. As already discussed before, the value of c determines the robustness of the final model. We opt for a choice of c, which is simple to interpret: we propose to use an intercept model with maximum likelihood estimates for all distributional parameters exclusively based on the full response sample \({{\textbf {y}}}\) to generate initial loglikelihood values. Under an assumed amount \(\tau\) of corruption of the response data, the quantile \(q_{\tau }\) gives an upper bound for corrupted observations. For the general distribution case this quantile is given by:
If we now equate the \(\tau\) quantile of the loglikelihood values \(q_{\tau }\) with the lower bound \(b_{\text {low}}\), we can solve for \(c_{\tau }\), which can be used for our model fit – guaranteeing that all loglikelihood values smaller than \(q_{\tau }\) are bounded by our method:
Note that Eq. (11) can only be applied for \(q_{\tau }<0\), which is most likely true in most practical cases. Exceptions for instance are distributions with very small variances, and are captured via an appropriate boundary within our implementation.
In cases where no prior information on the amount of corrupted observations is available (which might be the case in most practical settings), we recommend to use the default value of \(\tau =0.05\). This means that \(5\%\) of the observations have lower loglikelihood values than \(b_{\text {low}}\), irrespective of the particular GAMLSS distribution. Exemplary, for an assumed amount of corruption \(\tau =0.05\) the intercept model of the Gaussian distribution leads to the quantile
with offset values for \((\mu ={\hat{\theta }}_1= {\hat{\eta }}_{\theta _1},\sigma = {\hat{\theta }}_2= \log ({\hat{\eta }}_{\theta _2}) )\) as the sample mean and the sample variance from the response \({{\textbf {y}}}\). These are the maximum likelihood estimators for both parameters resulting in a robustness constant \(c_{0.05}\).
Example code for our approach can be found in the supplement. The implementation and the code to reproduce the simulations is available on GitHub: https://github.com/JSpBonn/RobustGAMLSS.
3 Simulations
The specific goals of our simulation study were to investigate

A: how the robust boosting approach behaves under uncorrupted data situations in comparison to (nonrobust) classical boosting for GAMLSS,

B: in which data situations the robust methods are beneficial, but also limited in its usage facing corrupted data of different types,

C: how sensitive the results are with respect to the choice of \(\tau\) corresponding to the robustness constant \(c_{\tau }\),

D: and how the robust boosting method performs in comparison to the original robust fitting procedure for GAMLSS of Aeberhard et al. (2021) in lowdimensional settings.
Especially for highdimensional settings we also investigated the performance of the boosting approaches regarding variable selection and compared the computational runtime with the original version for the lowdimensional settings.
3.1 Settings
We considered a lowdimensional setting with \(p=5\) and a highdimensional setting with \(p=1000\) explanatory variables with the same number of observations \(n=1000\). The general structure of a Gaussian distributed outcome \(y_i\) for \(i\in \{1,\ldots ,1000\}\) was given by
while all other covariates remained uninformative. We considered a Toeplitz correlation structure with \(\rho =0.5\) for neighbouring covariates, i.e., \({\textbf{x}}_i\) followed a multivariate Gaussian distribution \({\mathcal {N}}_p({\textbf{1}},\Sigma )\) with covariance matrix entries \(\sigma _{i,j}=0.5^{\vert ij\vert }\), resulting in diagonal entries \(\sigma _{i,i}=1\) and, exemplary, correlation \(\sigma _{2,5}=0.5^3\) between covariates \({\textbf{x}}_2\) and \({\textbf{x}}_5\).
We simulated different amounts of corrupted response observations \(\pi \in \{0\%,5\%,10\%,15\%,20\%\}\) by adding or subtracting \(4 \cdot \text {sd}({\textbf{y}})\) to generate a symmetric error (see Fig. 2 for an illustration). Furthermore, we also simulated a skewed corruption, where all error entries were added with their absolute values as onesided bias. We applied Eqs. (10) and (11) for \(\tau \in \{0.01,0.05,0.10\}\) to specify the corresponding robustness constants \(c_{\tau }\).
The same data generating process was used for a skewed corrupted Gamma distributed outcome \(y_i\) with expected value \(\mu =\exp (1+1.5\cdot x_{i1}0.75\cdot x_{i2})\) and a variance of \(\mu ^2/\sigma\) with \(\sigma =\exp (0.5  0.25\cdot x_{i1}+0.5\cdot x_{i3})^2\).
All individual explanatory variables were considered as componentwise linear baselearners within the boosting algorithms for each distribution parameter. The stopping iteration \(m_{\text {stop}}\) as the main tuning parameter of the boosting models was optimised in a resourceefficient way based on a validation data set of size \(n_{\text {validation}}=1000\) (which was corrupted in the same way as the training data), by minimising the mean empirical risk. To compare the different models and address questions A to D, we measured the predictive performance on an uncorrupted test data set of size \(n_{\text {test}}=1000\) via the negative loglikelihood values of a Gaussian or Gamma distribution, respectively, at the optimal stopping iteration.
As additional performance measures, we also considered the mean absolute deviation of the estimates of \(\eta _{\mu }\) and \(\eta _{\sigma }\) to their true values, the selected variables in the final models (true and false positive rate), the final stopping iteration, also in detail for both parameters separately (due to the usage of the noncyclic updating method), the computational runtime and in the specific case of the robustified gradient boosting approach also the \(c_{\tau }\) values. All simulations were performed \(B=100\) times using R version 4.0.5. Additional results of the simulation study (e.g., for different amounts of corruption and boundary values for c) can be found in the supplement.
3.2 Results
An overview of the prediction performance is given in Table 1 based on the average negative loglikelihood (NLL) values over \(B=100\) simulation runs. The classic boosting model shows low NLL values in all noncorrupted settings – the lowest for all three lowdimensional cases, the second lowest behind the robust boosting approach for \(\tau =0.01\) in both highdimensional Gaussian cases and lowest for the highdimensional Gamma setting. Larger choices of a quantile \(\tau\) in uncorrupted settings lead to an increasing loss in prediction accuracy. As a summary for question A we can conclude that there is typically a loss in prediction accuracy for non corrupted data when using the robust boosting approach especially for higher \(\tau\) quantiles, while for the choice \(\tau =0.01\) the loss in accuracy is negligible.
On the other hand, already for a small amount of corruption of \(5\%\) of observations, classical boosting falls short of all robust models, even in comparison to the NLL values of robust boosting for \(0\%\) corruption, and performs increasingly worse for larger amounts of corruption. In the lowdimensional case, this is particularly due to the incorporation of corrupted data points in the fitting process, while in the highdimensional case, the classical boosting algorithm tends to stop quite early to avoid overfitting (cf. supplement Table S8 and S16).
Regarding question B, the compact answer would be that all robust boosting models are beneficial for all considered amounts of corruption \(>0\%\). In detail, there is a tendency for \(\tau =0.01\) to perform better for smaller amounts, but in more extreme situations for the highdimensional case and for amounts of corruption close to \(20\%\) it will not be robust enough anymore and its performance gets closer to classic boosting (while still being better). In such extreme cases, the quantile with \(\tau =0.10\) has its greatest advantages. The intermediate choice of \(\tau =0.05\) incorporates both beneficial behaviours and does not induce a large loss in prediction accuracy for no or smaller amounts of corruption, but still yields robust results for higher amounts in all simulation settings.
The robustification via the loglogistic function in combination with the idea of choosing the robustness constant \(c_{\tau }\) in a datariven way (based on reasonably low quantile values \(\tau\)) leads to overall good performances of the robust boosting models. The comparison of different \(\tau\) values shows limited differences in their NLL values. Still, when considering question C concerning the sensitivity regarding the choice of \(\tau\), it seems better to choose one of the smaller quantiles, for which the results stay robust until confronted with extreme corruption amounts (\(\tau =0.01\) and \(20\%\), especially for highdimensional data), while performing similar to the classical boosting when confronted with noncorrupted data.
Additional simulation results in the supplement to the paper show that the overall behaviour of all models had similar tendencies independently of the kind of corruption (symmetric or skewed), the dimensionality (low or highdimensional) and the distribution (Gaussian or Gamma). In particular, results regarding the mean absolute value of the estimated parameters and the variable selection properties are also in line with the performance regarding the NLL values (cf. Table S2, S3 and Table S5, S6 for the Gaussian and S12 and S14 for the Gamma setting of the supplement, respectively). Especially in the highdimensional settings, the robust boosting models result in larger true positive rates than the classical boosting model when the data is corrupted by any amount.
Regarding question D, the comparison with the robust penalised maximum likelihood (robust ML) approach of Aeberhard et al. (2021) was only possible for the lowdimensional settings. There, however, this approach resulted in a very competitive performance regarding NLL values for the Gaussian setting. The initially chosen robustness constant \(c_{0.05}\) based on the \(\tau =0.05\) quantile is optimised during their fitting approach, which comes at the cost of longer computation runtimes (cf. Table S7, S15). Within the simulations for the Gamma distribution the NLL values are increased in comparison to the boosting approaches. Furthermore, the robust penalised ML approach is not applicable to highdimensional settings and does not incorporate datadriven variable selection as the proposed robust boosting algorithm.
4 Biomedical applications
To demonstrate the flexibility and adaptability of gradient boosting due to its modular structure in combination with the benefits of robust modelfitting for GAMLSS, we present three application examples from three different biomedical research fields. Additionally to the Gaussian distribution, we also consider the Gamma distribution for modelling positive rightskewed data. We use the same parameterization as for the simulations, where the expected value is equal to \(\mu\) and the shape parameter \(\sigma\) leads to a variance of \(\mu ^2/\sigma\) (with loglink for both parameters). For all examples, we chose the robustness constant c in the same way as in the simulations via Eqs. (10) and (11) with \(\tau =0.05\) for all applications and distributions. Additional information including runtimes are presented in the supplement Tables S17S21.
4.1 Functional magnetic resonance brain imaging
When modelling brain activity parameters based on functional magnetic resonance brain imaging data, the assumption of homoscedasticity can be too restrictive (Landau et al. 2004). Furthermore, measurements can easily be influenced, e.g. via movement of the patient during testing, but also through physiological conditions like nearby veins. In such situations, robust regression methods with the capability of simultaneously fitting multiple parameters while accounting for potential heteroscedasticity are particularly suited.
We focus on the same application as Aeberhard et al. (2021) (based on Landau et al. 2004), who modelled brain activity via the median Fundamental Power Quotient \((\text {medFPQ})\) and reported both a robust and a classical GAMLSS. Given is the measured resonance of an experimental stimulus of a human brain subject for one 2D slice through the dorsal cerebral cortex for \(n=1567\) brain voxels. The dataset is publicly available via the R package gamair, where three replicates of FPQ for each voxel are given by their median resulting in a rightskewed, nonnegative distribution of \(\text {medFPQ}\). Note that no extreme observations (potential outliers) were excluded, which stresses the general idea of robust model fitting, for instance in contrast to (Wood 2006, p.228).
In Aeberhard et al. (2021) the response variable \(\text {medFPQ}\) was modeled via a Gamma distribution in dependence of the spatial covariates \(x_1\) and \(x_2\), which represent the location of each voxel within the 2D brain slice. We fitted similar Gamma regression models, but using gradient boosting for GAMLSS based on the classical and new robust versions in combination with a spatial baselearner (for \(x_1\) and \(x_2\)) for best comparability on the full data set. Therefore, both boosting approaches were tuned using 25fold bootstrapping, which resulted in optimal stopping iterations \(m_{\text {stop}}=1480\) for the classical Gamma fit and \(m_{\text {stop}}=197\) for the robust version.
Figure 3 displays the results of the predicted \(\text {medFPQ}\) values for both boosting approaches as coloured deviations for different brain activities. It can be observed that the robust approach led to much smoother spatial effect estimates for both parameters. The classical boosting approach, in contrast, fitted rougher and more detailed spatial effects. This may also be due to the fact that the boosting algorithm stopped much later for the classical approach (a brain image with converged effect estimates can be found for both approaches in supplement Figure S3). For the location parameter (\(\eta _{\mu }\)), the classical model shows more heterogeneity and a larger effect in the left frontal lobe of the brain, while the robust model updated more regions for the shape parameter (\(\eta _{\sigma }\)), especially in the frontal lobe (stronger, but similar to the classical one). Overall, for this example our robust boosting results for GAMLSS are comparable to the ones presented by Aeberhard et al. (2021) (when taking the different parameterizations into account). An additional direct comparison between the two robust approaches (also regarding computation time) was included in the supplement Figure S4 using Gaussian models.
4.2 Thyroid hormone levels of a general adult population
Thyroid hormones (TH) play a major role in regulating basal metabolism in humans. TH synthesis is regulated by feedback mechanisms: decreased thyroid hormone levels lead to increased synthesis of hypothalamic thyrotropinreleasing hormone which increases the secretion of thyroidstimulating hormone (TSH). TSH stimulates the production of thyroid hormones, thyroxine (T4) and triiodothyronine (T3). In order to examine the relationship between thyroid function and hematological parameters, we analyzed the data from the AEstrada Glycation and Inflammation Study (AEGIS), a crosssectional study conducted in the municipality of AEstrada, in Northwestern Spain. An outline of the project can be found at https://www.clinicaltrials.gov (code: NCT01796184) and details also in AlendeCastro et al. (2019).
Blood samples were taken from \(n = 1516\) participants to describe the relationship between blood cell components and thyroid function. In total, we considered \(p = 33\) potential explanatory variables for \(n = 1151\) complete observations, which were not previously diagnosed with any kind of thyroid dysfunction. To model the thyroid hormone level of T3 as the response variable (see Fig. 4), we boosted a Gaussian GAMLSS with all available candidate variables based on the classical boosting approach as well as the new robust variant. Besides fitting a model on the full data set, we performed 10fold crossvalidation to compare the selection frequencies of the variables in the boosting models and the prediction accuracy via the loglikelihood on test data. To avoid further reduction of the sample size for tuning via a validation set, the stopping iteration \(m_{\text {stop}}\) of boosting was determined by 25fold bootstrap for all models (separately on each of the training folds of the 10fold crossvalidation). For comparison, we also fitted the original robust approach by Aeberhard et al. (2021) for the full model without variable selection.
While boosting the nonrobust Gaussian model selected 5 variables for the location parameter \(\mu\) and 6 variables for the dispersion parameter \(\sigma\), the new robust boosting version stopped later and selected 15 predictors for the location, but only one variable (transferrin saturation) for \(\sigma\), which was also included relatively late in the final model (see supplement Table S11 for a complete overview of selected variables). From a biological point of view, these results are consistent with previous studies in which small differences in thyroid function are associated with significant differences in a range of clinical parameters. Anemia and other blood abnormalities are common in thyroid function abnormalities. Iron metabolism (hemoglobin, ferritin, transferrin) is also very intricately connected to thyroid hormone metabolism (Bremner et al. 2012).
Regarding prediction accuracy the classical boosting fit yielded slightly favorable results, with median negative loglikelihood values of \(55.34~ (\text {IQR}=16.93)\) compared to \(65.49 ~(\text {IQR}=26.26)\) for boosting the robust version, which might be also related to the later stopping (\(m_{\text {stop}}\) was on average \(27.50 ~(\text {sd}=12.55)\) for the normal Gaussian model and \(46.10 ~(\text {sd}=7.25)\) for the robust one). Similar likelihood results were also found for the full model by the original robust GAMLSS approach (55.22 with \(\text {IQR}=22.96\)). In general, there seems to be only one unconditioned extreme outlier (Fig. 4), which can also be handled well by the nonrobust boosting approach given the relatively high sample size. But also the robust boosting variant does not yield a large loss in prediction accuracy when applied to quasi noncorrupted data (cf. Table 1). The most frequently selected variables for each approach were also the ones from the final model fitted on the full data.
4.3 Protein expression data via NCI60 cancer cell panel
Highdimensional data situations are very common in the field of genetics, where the number of potential covariates often exceeds the number of observations. We illustrate the performance of our approach for this situation via the NCI60 data set, which includes only \(n=59\) human cancer cell lines in comparison to \(p = 14951\) gene expression measurements as potential predictors. The data set is available at https://discover.nci.nih.gov/cellminer/.
We followed the same data processing steps as in Speller et al. (2022), Sun et al. (2020) and considered the positive and highly rightskewed KRT19 protein antibody array from the “Protein: Antibody Array DTB” set as response variable (Sun et al. 2020). The KRT19 protein data is an example with extreme kurtosis and some (extreme) outliers (the median value is 1.10, while the \(75\%\) quantile is 32.14 with values up to 257), implying a challenging modelling task which is particularly interesting for robust regression approaches.
We applied classical gradient boosting for GAMLSS using the Gamma distribution to compare it with our robust boosting approach. The predictor variables are given as the gene expression measurements, which we included via separate linear baselearners for all boosting models. Besides considering both classical and robust boosting on the full data set, we additionally performed leaveoneoutcrossvalidation (LOOCV); in this case with \(n=59\), each model was fitted on 58 observations as training data and the remaining observations were used as test data to evaluate the negative loglikelihood of the original Gamma distribution. To avoid forming a validation set, all models were stopped early at their optimal stopping iteration via an inner 25fold bootstrap on the respective training observations. To investigate the variable selection behaviour of boosting, we also computed selection frequencies on the 59 LOOCV training folds for all potential predictor variables.
The robust model fit on the full data resulted in an optimal stopping iteration of \(m_{\text {stop}}=152\), with only one selected predictor—the KRT19 gene—for the location parameter \(\mu\) and 7 selected genes for the shape parameter \(\sigma\) (see Fig. 5 for an illustration for the coefficientpaths). The early stopping of the boosting algorithm results in shrinkage of effect estimates, which is nicely visible for SETD1A. Updated variables after iteration \(m_{\text {stop}}\) are not included in the final model (see Fig. 5, lightblue paths deviate only from zero for iterations \(m>m_{\text {stop}}\)). The only updated baselearner for \(\eta _{\mu }\) is the one referring to the KRT19 gene, which is included relatively late, suggesting that the default offset value for \(\mu\) was a good starting point and that generally more information may be contained for the shape parameter \(\sigma\).
The nonrobust Gamma regression, in contrast, resulted in an early stopping of the boosting algorithm at iteration \(m_{\text {stop}}=1\). As a result, only the intercept for \(\sigma\) was updated once, while \(\mu\) stayed at its offset value. This illustrates an interesting property of gradient boosting: when updates are nonbeneficial for the prediction, the tuned boosting model can stay extremely sparse to avoid overfitting. In case of extreme \(p \gg n\) situations with highly skewed data and outliers, this behaviour could also be very reasonable. However, this does not necessarily mean that there are no valid information in the data, as could be observed for the robust approach.
To further assess the performance of both methods, Table 2 summarizes the results of LOOCV regarding prediction accuracy and variable selection. The nonrobust boosting variant selected no genes at all and due to early stopping resulted in intercept models. In contrast, the robust approach selected on average two genes for \(\mu\), while especially the gene expression of KRT19 was selected in nearly all crossvalidation steps (58). For \(\sigma\), the results are also in line with the robust model on the full data. The 7 most frequently selected genes in LOOCV were also selected for \(\sigma\) on the full data (cf. Figure 5). Taking the highdimensionality and the small sample size of \(n=59\) into account, the selection process for the predictors appears very stable for the robust approaches and points to the same small subset of genes for further investigation. Regarding prediction accuracy, the robust GAMLSS fits were also beneficial and yielded lower negative loglikelihood values on the test observations (mean \(3.53 ~(\text {sd}=2.09)\) in comparison to \(4.39 ~(\text {sd}=1.97)\)).
5 Discussion
We have proposed a robust boosting approach for generalized additive models for location, scale and shape (GAMLSS, Rigby and Stasinopoulos 2005). It reduces the effects of outliers in the response variable while incorporating variable selection for potentially highdimensional data. The core of the approach is a robustification of the likelihood via the loglogistic function as recently proposed by Aeberhard et al. (2021). We incorporated the robustified likelihood as the loss function for a componentwise gradient boosting algorithm for GAMLSS (Mayr et al. 2012). To the best of our knowledge, our approach allows for the first time to use a robust boosting approach to fit distributional regression models via GAMLSS. Additionally, we have proposed a quantilebased datadriven method on how to specify the necessary robustness constant in practice.
The results of our simulation study suggest that our approach works well and provides robustness for reasonable amounts of corrupted observations (up to 1520%), both for symmetric and skewed corruption types. At the same time, when applied on data without any outliers, the performance only slightly decreases compared to a classical nonrobust boosting algorithm. A similar behaviour could also be observed in the three biomedical applications: our robust boosting approach was able to replicate similar results compared to a previous robust analysis for functional magnetic resonance brain imaging, showing more robust effect estimates than classical boosting. However, in a much larger epidemiological data set on thyroid hormone levels with only one extreme outlier, the robust boosting version did not outperform the classical algorithm regarding prediction accuracy. This might be due to the fact that also classical boosting can lead to a relatively robust model fit by imposing stronger regularisation via early stopping. This could also be observed in a more extreme way for the highdimensional gene expression data from cancer cells: Here, the classical boosting algorithm basically selects only an intercept model while the robust version allows to identify the most influential genes, outperforming the classical boosting algorithm with respect to prediction accuracy.
Our approach also has several limitations: Due to the iterative fitting of the baselearners to the gradient of the loss, statistical boosting algorithms typically cannot provide standard errors for effect estimates. This is also true for the proposed robust variant, making it mostly suitable for exploratory data analysis. There are some workarounds based on permutation tests (Mayr et al. 2017; Hepp et al. 2019) and bootstrapping (Hofner et al. 2016), though they tend to become computationally very demanding for highdimensional data. Also without these additional resampling measures, boosting and tuning a robust GAMLSS leads to a longer computation time than the classical approach by Aeberhard et al. (2021). The use of boosting for model fitting in lowdimensional settings is hence most favorable only for situations where the variable selection properties or shrinkage of effect estimates are required. Another limitation is that we have tested the robust version only on a limited combination of distributions (Gaussian, Gamma), baselearners (linear, spatial) and focused on the noncyclic variant. There is no reason to believe why the approach should not work for other GAMLSS distributions, the cyclic update variant and/or baselearners available in the boosting framework (Mayr and Hofner 2018; Hofner et al. 2014) – nevertheless this should be properly tested. Furthermore, even when the event of almost exclusively positive loglikelihood values is unlikely to occur in practical applications, it is possible that our proposed quantilebased selection of the tuning constant fails for extreme data situations. Further research is also warranted on the robustification of baselearners, as our approach is currently limited to outliers in the response variable. Another line of research could also focus on an automated selection of the most appropriate (robust) response distribution based on prediction performance.
In this context, quantile regression (Koenker and Hallock 2001) could also be a natural alternative approach for robust distributional regression. In contrast to GAMLSS, it does not carry the risk of missspecifying the distribution. However, in quantile regression the focus is not really on the complete conditional distribution, as each quantile is fitted separately, which can lead to crossing of neighboring predicted quantiles. GAMLSS, on the other hand, model the complete distribution and allow for the interpretation of effects directly on location or scale parameters. If quantiles are needed, they can still be computed from the predicted distribution – without the risk of quantile crossing.
References
Aeberhard WH, Cantoni E, Marra G, Radice R (2021) Robust fitting for generalized additive models for location, scale and shape. Stat Comput. https://doi.org/10.1007/s1122202009979x
AlendeCastro V, AlonsoSampedro M, VazquezTemprano N, Tuñez C, Rey D, GarcíaIglesias C, Sopeña B, Gude F, GonzalezQuintela A (2019) Factors influencing erythrocyte sedimentation rate in adults: new evidence for an old test. Medicine. https://doi.org/10.1097/MD.0000000000016816
Alfons A, Croux C, Gelper S (2013) Sparse least trimmed squares regression for analyzing highdimensional large data sets. Annals Appl Stat 7(1):226–248
Amato U, Antoniadis A, De Feis I, Gijbels I (2021) Penalised robust estimators for sparse and highdimensional linear models. Stat Methods Appl 30:1–48
Barrios EB (2015) Robustness, data analysis, and statistical modeling: the first \(50\) years and beyond. Commun Statist Appl Methods 22(6):543–556
Bremner AP, Feddema P, Joske DJ, Leedman PJ, O’Leary PC, Olynyk JK, Walsh JP (2012) Significant association between thyroid hormones and erythrocyte indices in euthyroid subjects. Clin Endocrinol 76(2):304–311. https://doi.org/10.1111/j.13652265.2011.04228.x
Bühlmann P (2006) Boosting for highdimensional linear models. Ann Stat 34(2):559–583
Eguchi S, Kano Y (2001) Robustifing maximum likelihood estimation by psidivergence. ISM Res Memo 802
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29(5):1189–1232
Griesbach C, Mayr A, Bergherr E (2023) Variable selection and allocation in joint models via gradient boosting techniques. Mathematics 11(2):411
Hans N, Klein N, Faschingbauer F, Schneider M, Mayr A (2023) Boosting distributional copula regression. Biometrics. https://doi.org/10.1111/biom.13765
Hastie T, Tibshirani R (1990) Generalized additive models. Chapman & Hall, London
Hepp T, Schmid M, Mayr A (2019) Significance tests for boosted location and scale models with linear baselearners. Int J Biostat. https://doi.org/10.1515/ijb20180110
Hofner B, Mayr A, Robinzonov N, Schmid M (2014) Modelbased boosting in R: a handson tutorial using the R package mboost. Comput Stat 29:3–35
Hofner B, Kneib T, Hothorn T (2016) A unified framework of constrained regression. Stat Comput 26:1–14
Hofner B, Mayr A, Schmid M (2016) gamboostLSS: an R package for model building and variable selection in the gamlss framework. J Stat Softw. https://doi.org/10.18637/jss.v074.i01
Huber PJ (1981) Robust statistics. Wiley, New York
Ju X, SalibiánBarrera M (2021) Robust boosting for regression problems. Comput Stat Data Anal. https://doi.org/10.1016/j.csda.2020.107065
Kneib T, Silbersdorff A, Säfken B (2021) Rage against the mean – a review of distributional regression approaches. Econom Stat. https://doi.org/10.1016/j.ecosta.2021.07.006
Koenker R, Hallock KF (2001) Quantile regression. J Econ Perspect 15(4):143–156
Landau S, EllisonWright IC, Bullmore ET (2004) Tests for a difference in timing of physiological response between two brain regions measured by using functional magnetic resonance imaging. J Roy Stat Soc: Ser C (Appl Stat) 53(1):63–82
Lugosi G, Mendelson S (2019) Mean estimation and regression under heavytailed distributions: a survey. Found Comput Math 19(5):1145–1190
Maronna RA, Martin RD, Yohai VJ, SalibiánBarrera M (2019) Robust statistics: theory and methods (with R). John Wiley & Sons, New York. 2nd ed
Mayr A, Hofner B (2018) Boosting for statistical modelling: a nontechnical introduction. Stat Model 18(3–4):365–384
Mayr A, Fenske N, Hofner B, Kneib T, Schmid M (2012) Generalized additive models for location, scale and shape for high dimensional data–a flexible approach based on boosting. J Roy Stat Soc: Ser C (Appl Stat) 61(3):403–427
Mayr A, Schmid M, Pfahlberg A, Uter W, Gefeller O (2017) A permutation test to analyse systematic bias and random measurement errors of medical devices via boosting location and scale models. Stat Methods Med Res 26(3):1443–1460
Monti GS, Filzmoser P (2022) Robust logistic zerosum regression for microbiome compositional data. Adv Data Anal Classif 16:301–324. https://doi.org/10.1007/s11634021004654
Rigby RA, Stasinopoulos DM (2005) Generalized additive models for location, scale and shape. J Roy Stat Soc: Ser C (Appl Stat) 54(3):507–554
Rigby RA, Stasinopoulos MD, Heller GZ, De Bastiani F (2019) Distributions for modeling location, scale and shape: using GAMLSS in R. CRC Press, Boca Raton
Speller J, Staerk C, Mayr A (2022) Robust statistical boosting with quantilebased adaptive loss functions. Int J Biostat. https://doi.org/10.1515/ijb20210127
Stöcker A, Brockhaus S, Schaffer SA, Bv Bronk, Opitz M, Greven S (2021) Boosting functional response models for location, scale and shape with an application to bacterial competition. Stat Model 21(5):385–404
Strömer A, Staerk C, Klein N, Weinhold L, Titze S, Mayr A (2022) Deselection of baselearners for statistical boosting – with an application to distributional regression. Stat Methods Med Res 31(2):207–224
Strömer A, Klein N, Staerk C, Klinkhammer H, Mayr A (2023) Boosting multivariate structured additive distributional regression models. Stat Med 42(11):1779–1801. https://doi.org/10.1002/sim.9699
Sun Q, Zhou W, Fan J (2020) Adaptive Huber regression. J Am Stat Assoc 115(529):254–265
Thomas J, Mayr A, Bischl B, Schmid M, Smith A, Hofner B (2018) Gradient boosting for distributional regression: faster tuning and improved variable selection via noncyclical updates. Stat Comput 28(3):673–687. https://doi.org/10.1007/s1122201797546
Wood SN (2006) Generalized additive models: an introduction with R. Chapman & Hall/CRC, London
Funding
Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Speller, J., Staerk, C., Gude, F. et al. Robust gradient boosting for generalized additive models for location, scale and shape. Adv Data Anal Classif (2023). https://doi.org/10.1007/s11634023005555
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11634023005555