Robust Fitting for Generalized Additive Models for Location, Scale and Shape

The validity of estimation and smoothing parameter selection for the wide class of generalized additive models for location, scale and shape (GAMLSS) relies on the correct specification of a likelihood function. Deviations from such assumption are known to mislead any likelihood-based inference and can hinder penalization schemes meant to ensure some degree of smoothness for non-linear effects. We propose a general approach to achieve robustness in fitting GAMLSSs by limiting the contribution of observations with low log-likelihood values. Robust selection of the smoothing parameters can be carried out either by minimizing information criteria that naturally arise from the robustified likelihood or via an extended Fellner-Schall method. The latter allows for automatic smoothing parameter selection and is particularly advantageous in applications with multiple smoothing parameters. We also address the challenge of tuning robust estimators for models with non-linear effects by proposing a novel median downweighting proportion criterion. This enables a fair comparison with existing robust estimators for the special case of generalized additive models, where our estimator competes favorably. The overall good performance of our proposal is illustrated by further simulations in the GAMLSS setting and by an application to functional magnetic resonance brain imaging using bivariate smoothing splines.


Introduction
Generalized additive models for location, scale and shape (GAMLSS) are flexible non-parametric regression models that have been introduced by Rigby and Stasinopoulos (2005); see also the recent book and tutorial by Stasinopoulos et al. (2017) and Stasinopoulos et al. (2018) for a review. These models allow to use explanatory variables not only to model the location parameter (e.g., the mean) of a response distribution, like in generalized additive models (GAM; Hastie and Tibshirani, 1990), but also the scale and shape parameters. GAMLSSs also go beyond the exponential family of distributions. In fact, the approach can be seen more broadly as a way to model any parameter of any given distribution. As such, some authors refer to it as distributional or multi-parameter regression (e.g., Burke and MacKenzie, 2017, Lang et al., 2014, Pan and Mackenzie, 2003, Stasinopoulos et al., 2018. Software availability for a wide range of families of distributions, such as the R package gamlss (Stasinopoulos and Rigby, 2019), have helped making these models very popular and widely applied in several fields: we can cite Glasbey and Khondoker (2009) (normalizing cDNA microarray), Rudge and Gilchrist (2005) (health impact of temperatures in dwellings), De Castro et al. (2010) (long-term survival models for clinical studies), Beyerlein et al. (2008) (childhood obesity), and Cole et al. (2009) (charts for child growth curves).
The motivation of this paper comes from challenging applications like the one presented in Section 5. The study first reported in Landau et al. (2003) investigates differences in the brain physiological response to controlled stimuli between anatomically distinct regions. The brain activity response is measured at voxels in a brain slice (a 2D raster image) with sole explanatory variables being the coordinates identifying the location of each voxel. The measurements are highly noisy but the mean response level and its spread are believed to vary smoothly over the brain slice, thus prompting non-linear effects for both location and scale parameters. Wood (2017, p. 329) identified two voxel responses in these data that were deemed too extreme, and were then discarded for the subsequent analysis. We believe a robust fitting of a GAMLSS is hence appropriate here, and this for two reasons: to guarantee that estimates and uncertainties are reliable, and to identify potentially outlying observations in an automated way thanks to robustness weights.
The fitting of GAMLSSs is typically performed by penalized maximum likelihood (ML) estimation. For datasets like the one above, where extreme observations likely occur, the ML estimation procedure suffers from a lack of robustness, meaning that the estimated smooth functions can be distorted by the outliers. Both the nonparametric function estimates themselves and the choice of the smoothing parameters associated to them are affected. To address these issues, we introduce a general robust estimator for GAMLSSs. Our approach covers special cases where robustness has been previously addressed, in particular in the (extended) GAM context (Alimadad and Salibian-Barrera, 2011, Croux et al., 2012, Wong et al., 2014. These works, however, cannot be extended to the more general setting of GAMLSS. Specifically, in contrast with the cited literature which acts at the level of the score equations, we introduce robustness by modifying the objective function following an idea introduced by Eguchi and Kano (2001). We also propose a novel and general procedure to tune the robustness parameter associated with the robust approach. This problematic issue has been partially ignored in the literature for robust (extended) GAMs. For the selection of the smoothing parameters, we additionally propose robust versions of the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC), that can be typically minimized in a grid search, and an adaptation of the Fellner-Schall automatic multiple smoothing parameter selection method (Wood and Fasiolo, 2017), which has important practical advantages. The proposed robust models can be easily used via the newly-revised gamlss() function in the R package GJRM (Marra and Radice, 2020).
In Section 2 we introduce the GAMLSS framework and the related estimation procedure which is based on penalized maximum likelihood. Our proposal is fully introduced in Section 3, with subsections devoted to the definition of a penalized robust objective function, theoretical properties and inference, the practical implementation of the estimation procedure, smoothing parameter selection, and the challenge of the choice of the robustness tuning constant. In Section 4 we present two simulation studies to highlight the good behavior of our proposal: one in the GAMLSS setting with a design mimicking the brain imaging data example, and one in the special case of a GAM to allow comparison with existing robust procedure in this context. The brain imaging data analysis is then presented in Section 5, while conclusions are given in Section 6.
where the functions g d are applied element-wise, 1 n is an n-dimensional vector of ones, the (n × J dk ) matrix X dk has (i, j)th element b dkj (x dki ), and β dk = (β dk1 , . . . , β dkJ kd ) . The predictors can thus be re-written as We note that our results and methods here are understood in a fixedknot framework, i.e. that the number of basis functions is fixed at a high value so that any approximation bias in s dk (x dki ) is negligible compared to estimation variability (as in e.g. Vatter and Chavez-Demoulin, 2015).
To enforce a certain degree of smoothness for every approximated s dk (·) function, each β dk has an associated quadratic penalty λ dk β dk D dk β dk , where D dk only depends on the choice of basis functions. The smoothing parameter λ dk ∈ [0, ∞) controls the trade-off between fit and smoothness and plays a crucial role in determining the shape of the estimatedŝ dk (·). The overall penalty can be written as Following Wood (2017), the approximated s dk (·) smooth functions are subject to centering constraints to ensure identifiability. Examples of smooth function specification include onedimensional, multi-dimensional, random field and random effect smoothers; see e.g. Wood (2017) for details. Note that we have considered distributions with up to three parameters (location, scale and shape), hence the adopted notation with d = 1, 2, 3, yet the proposed framework can be conceptually extended to distributions with more parameters in a straightforward manner. The families of distributions implemented in this work are listed in Table S1 in Web Appendix D.

Penalized Log-likelihood
Let δ = (β 1 , β 2 , β 3 ) ∈ ∆ ⊆ R p denote the full model parameter vector. Given a sample of n realizations y 1 , . . . , y n , the log-likelihood function corresponding to (2) is given by where f (y i |·) can either denote the probability density function (pdf) or the probability mass function (pmf) corresponding to the distribution D. Because of the flexibility of the smooth terms, the use of an unpenalized optimization algorithm is likely to result in unduly wiggly estimates (e.g. Wood, 2017). Estimation is thus typically performed by maximizing the penalized version p (δ) = (δ) − 1 2 δ Sδ, where S = diag(D 1 , D 2 , D 3 ). The smoothing parameters contained in the D d 's make up the vector λ = (λ 1 , λ 2 , λ 3 ) . Estimation of δ is typically achieved for a given value of λ, while the selection of λ is often performed by minimizing some prediction error criterion, either as an outer optimization or in an alternating scheme (Wood, 2017). Examples of such a criterion include cross-validation (CV; e.g. Hastie and Tibshirani, 1990) and generalized cross-validation (Craven and Wahba, 1979) estimates of prediction error, as well as estimates of the Kullback-Leibler divergence between a true model and the fitted one such as the AIC, and the Generalized Information Criterion (GIC) of Konishi and Kitagawa (1996).

Robust Estimation
The estimation procedures mentioned in the previous section rely on strict distributional assumptions. These methods are known to be highly sensitive to deviations from model assumptions (e.g. Hampel et al., 1986, Huber andRonchetti, 2009). To this end, we propose a general robust fitting approach, which is valid for the entire class of GAMLSS and that directly yields robust criteria for the selection of smoothing parameters.

Penalized Robustified Log-likelihood
Based on the Ψ-divergence approach of Eguchi and Kano (2001), we introduce the robustified log-likelihood˜ where, for a given δ, the user-specified ρ c function is designed to reduce low log-likelihood values (δ) i while leaving large log-likelihood values essentially unchanged, and is a correction term ensuring Fisher consistency (see Theorem 1 below), where ρ c is directly derived from the specified ρ c through where ρ c (s) = ∂ρ c (s)/∂s. The ρ function is indexed by a so-called robustness tuning constant c > 0 which regulates the trade-off between loss of estimation efficiency, should the data exactly come from the assumed GAMLSS, and the magnitude of the maximum estimation bias should the data not come from the postulated model. For any given c, ρ c is assumed to be convex, monotonically increasing and twice continuously differentiable over R and have bounded first derivative ρ c within [0, 1]. The latter can be interpreted as a multiplicative robustness weight, as one would do when weighting the estimating equations in robust Mestimation. The important difference here is that the "robustification" happens at the loglikelihood level and not by directly applying weights at the score level, such as in Wong et al. (2014) for example. An advantage of our approach is that it leads to a natural definition of robust criteria for the selection of smoothing parameters (see Section 3.2), for instance. Eguchi and Kano (2001) proposed the following log-logistic ρ function: with corresponding ρ c (z) = exp(z) − exp(c) log 1 + exp(z + c) and first derivative ρ c (z) = exp(z + c)/ 1 + exp(z + c) . Web Figure S1 in Web Appendix C displays the log-logistic ρ c and its first derivative. It illustrates how a smaller value of c leads to an earlier flattening of the ρ function applied on log-likelihood contributions, thus limiting earlier their impact. Note that lim c→∞ ρ c (z) = z so that an increasingly large c value leads to the (non-robust) original (δ). We discuss the choice of c in Section 3.5. For a given smoothing parameter λ, we define our robust estimatorδ =δ(λ) by maximizing the penalized robustified log-likelihood where the penalty is identical to that of non-robust penalized estimation. Indeed, our robustification scheme targets only deviations in the response variable, the latter which does not appear in δ Sδ so that only contributions to the unpenalized log-likelihood (δ) need to be accounted for. The robust estimator is thus the solution in δ to the following estimating equations (first-order conditions): In (6), the response variable Y i only appears through (δ) i since b ρ is an expectation. Thus ρ c indeed plays the role of a multiplicative weight within [0, 1] which limits the impact of potentially deviating observations given some δ. This robustness weight is proved useful both for selecting c (see Section 3.5) and as a diagnostic tool (see the data analysis in Section 5).

Asymptotic Properties and Inference
The unpenalized robust estimator which maximizes˜ (δ) admits a statistical M -functional representation T (F ), for some generic probability distribution F , which is the solution in δ with expectations taken under F . Thus, the finite-sample solution in δ to can be written as T (F n ), where F n denotes the empirical distribution putting mass 1/n on each observation. T (F n ) amounts to an unpenalized robust estimator.
To discuss the asymptotic properties of the proposed (penalized) robust estimator, we define δ 0 as the parameter value to which the unpenalized MLE maximizing (δ) in (3) converges, as n → ∞. By viewing δ 0 as the "true" parameter that generates the data under distribution D with moments defined in Equation (2), Theorem 1 below establishes the Fisher consistency ofδ and its asymptotic distribution; the proof is deferred to Web Appendix A.
Theorem 1. Under conditions (C1)-(C5) in Web Appendix A, as n → ∞ the penalized robust estimatorδ admits the same M -functional representation T as the unpenalized robust estimator and we have T (D) = δ 0 . Moreover, where the asymptotic covariance matrix is given by the so-called sandwich formula with expectations taken under the assumed distribution D.
In Theorem 1, T (D) = δ 0 means thatδ is Fisher consistent: it returns the true parameter when T is evaluated at the assumed distribution D, which implies thatδ is asymptotically unbiased for δ 0 . The influence function (IF; Hampel, 1974) of the Fisher consistent functional T is proportional to the score ψ(Y, δ) given in (7). This score being bounded in the response variable Y thanks to ρ c ∈ [0, 1], the IF is itself bounded. This guarantees a bounded maximum asymptotic bias under arbitrary contamination in Y , which is the main robustness property ofδ.
Remark 1. The asymptotic variance V(δ 0 ) in Theorem 1 corresponds to an unpenalized robust estimation because we assume the usual asymptotically vanishing penalty for consistency (see condition (C5) in Web Appendix A). A better approximation of the finite-sample covariance matrix with non-zero penalty can be obtained from a Taylor expansion of the penalized robustified score, as given in Equation (2) in Web Appendix A. It amounts to In these expressions, δ 0 being unknown in practice one would typically "plug-in" the estimateδ to compute standard errors. This allows for the computation of approximate (point-wise) confidence intervals, which can then be interpolated for confidence bands for non-linear effects. See for instance Croux et al. (2012, p. 39) for the analogue in the extended GAM setting.
Remark 2. An alternative covariance can be computed following an empirical Bayes approach, which is often reported to lead to good finite-sample coverage of confidence intervals in the frequentist sense (see e.g. Wood, 2012, Wood, 2017). For a given λ, viewing the quadratic penalty as an improper Gaussian prior distribution for δ (seen as a random vector here), with mean zero and covariance S −1 , the joint density of (Y , δ) is given, up to normalization constants, by L(y, δ; λ) = exp ˜ (δ) exp − δ Sδ/2 |S| 1/2 , with | · | denoting matrix determinant. We seek the covariance of the posterior distribution of δ|Y , as the posterior mode corresponds to the robust estimateδ. As in Wood and Fasiolo (2017), a second-order Taylor expansion of the posterior log-density about its mode reveals that as n → ∞ the posterior distribution approaches a multivariate Gaussian with covariance given by M p (δ) −1 . Our experience is that the observed version of this posterior covariance matrix, ∂δ∂δ , can be used as a computationally efficient alternative to V p (δ).
The effective degrees of freedom (edf) of smooth terms are a valuable tool for assessing the degree of smoothness achieved by a fit. We follow the discussion of Wood (2017, Chapter 6) based on links with generalized linear mixed models and restricted ML estimation to obtain that the edf of a GAMLSS robust fit is tr . This term matches the "penalty term" of our robust AIC introduced in Section 3.4 below.

Estimation Approach and Implementation
To maximize (5), we have modified the efficient and stable trust region algorithm of Marra et al. (2017) to accommodate the robustified objective function and corresponding correction term b ρ (δ). Estimation of δ and λ is carried out as follows. At iteration a, holding λ fixed and for some tuning constant value c, for a given δ [a] we maximize equation (5) using a trust region algorithm (Conn et al., 2000): where · denotes the Euclidean norm, ∆ [a] is the radius of the trust region which is d for d = 1, 2, 3, and the Hessian matrix H has , for d, h = 1, 2, 3. Equation (9) uses a quadratic approximation of −˜ p about δ [a] (the so-called model function) in order to choose the best e [a+1] within the ball centered in δ [a] of radius ∆ [a] , the trust region. Close to the converged solution, the trust region usually behaves like an unconstrained optimization algorithm.
Trust region algorithms have several advantages over classical alternatives. For instance, in line search methods, when an iteration falls in a long plateau region, the search for step δ [a+1] can occur so far away from δ [a] that the evaluation of the model log-likelihood may be indefinite or not finite, in which case the user's intervention is required. Trust region methods, on the other hand, always solve the sub-problem (9) before evaluating the objective function. So, if˜ p is not finite at the proposed δ [a+1] then step e [a+1] is rejected, the trust region shrunken, and the optimization computed again. The radius is also reduced if there is no agreement between the model and objective functions (i.e., the proposed point in the region is not better than the current one). Reversibly, if an agreement occurs, the trust region is expanded for the next iteration. In summary, δ [a+1] is accepted if it improves over δ [a] and allows for the evaluation of˘ p , whereas the reduction/expansion of ∆ [a+1] is based on the similarity between model and objective functions. Theoretical and practical details of the method can be found in Nocedal and Wright (2006, Chapter 4) and Geyer (2015). The latter also discusses the necessary modifications to the sub-problem (9) and the radius for ill-scaled variables.
The analytical score and Hessian of (the non-robust) (δ) can be derived in a modular way. This allows for a direct extension to other families of distributions not included in Table 1 in Web Appendix D as long as their pdf/pmf are known and their derivatives with respect to their parameters exist. Regarding the optimization of the robustified˜ p (δ), the integral defining b ρ (δ) in (4), as well as its derivatives, in general have to be approximated. For discrete distributions over countably infinite supports, this amounts to a straightforward truncation of a converging infinite sum. For continuous distributions, we rely on a unidimensional adaptive Gaussian quadrature rule for which we compute data-based finite bounds for numerical stability and to increased speed.

Robust Selection of Smoothing Parameters
Our robustification scheme with ρ c directly applied on log-likelihood contributions has the advantage of yielding a natural robust AIC (RAIC). Following the construction of the generalized information criterion (GIC) of Konishi and Kitagawa (1996), we can define the Kullback-Leibler divergence d KL between the true distribution G that generated the data, with density g, and the distribution corresponding to our robustified likelihood (up to normalization constants) as The generic random variable Y here stands for an out-of-sample observation to be predicted, thus d KL represents a measure of prediction error. Minimizing d KL with respect to δ is equivalent to maximizing E G [˜ (δ, Y )] since the first term on the right hand side of (10) is a constant. But because G is unknown, the In the GIC framework, the first-order correction of this bias depends on the estimator used for δ. We consider here the penalized robust estimatorδ, so that by Theorem 2.2 of Konishi and Kitagawa (1996) the bias correction amounts to tr M p (δ) −1 Q(δ) = tr (M(δ) + S) −1 Q(δ) . Thus we define the RAIC as where recall that S = S(λ), and the observed matrices M(δ) and Q(δ) allow for fast computations. Selecting λ can thus be done by minimizing RAIC(λ). In (11), since all terms are based on the robustified˜ (δ), the RAIC naturally inherits robustness and the selected λ is thus expected to remain stable in the presence of model deviations.
Minimizing an AIC-type criterion for smoothing parameter selection is known to favor more complex models, with function estimates more on the wiggly side. As this feature may carry over to our RAIC, an alternative is to consider a robust version of the Bayesian Information Criterion where its heavier penalty coefficient (log(n) rather than 2) generally favors simpler models, with smoother function estimates. Similarly to Wong et al. (2014), in our setting a robust BIC (RBIC) is naturally given by That being said, the proposed RAIC and RBIC procedures involve two nested optimizations: an inner optimization for computingδ given λ, and an outer optimization over λ. The high computational cost involved makes the selection of λ nearly unfeasible, or unbearably slow, whenever more than one or two smoothers are considered. We therefore propose an alternative robust selection method that can be automated as part of the estimation process with little computational overhead. This alternative is a robust version of the Fellner-Schall method recently introduced in Wood and Fasiolo (2017), which we will call the extended Fellner-Schall (EFS) method. Web Appendix B provides the detailed development, the main ideas can be summarized as follows. First, we take the empirical Bayes viewpoint as in Remark 2 above to consider the quadratic penalty as an improper Gaussian prior on δ, resulting in the the joint (robustified) likelihood L(y, δ; λ). Next, we approximate the integral defining the marginal likelihood L(y; λ) = ∆ L(y, δ; λ) dδ by Laplace's method. By considering the estimateδ =δ(λ) as based on a previous iterate for λ, we obtain a tractable expression for (the Laplace-approximated) ∂ log L(y; λ)/∂λ. Finally, we follow the heuristic reasoning of Wood and Fasiolo (2017) to derive the following update from iteration [k] to [k + 1] for all elements of λ: whereδ =δ(λ [k] ) here. In this expression, ∂S(λ)/∂λ j is straightforward to write down and implement since S(λ) is block-diagonal and each block is typically linear in the components of λ and only involves the (known) basis functions. We note that under the conditions of Theorem 1 the update guarantees by construction that λ remains positive and that the iterates converge whenever the gradient with respect to λ gets arbitrarily close to zero. This update rule can thus be alternated with computingδ in an automated and efficient way since both rely on similar quantities (see Section 3.3).
Remark 3. The proposed EFS method is simple to implement and avoids unfeasible grid searches. All that is required is a set of explicit formulas, as given above, to update λ in order to increase the (Laplace-approximated) marginal robustified log-likelihood. Our derivation also highlights the method's broader appeal since it can be easily adapted to modeling situations requiring the use of non-standard models and estimators (i.e. beyond the robust estimation in this paper) as long as a Laplace-approximated marginal likelihood is available.

Choice of the Robustness Tuning Constant
The robustness tuning constant c regulates how early ρ c starts to diminish the contribution of an observation to the objective function˜ . The choice of c is typically made before fitting the model to data by targeting a certain loss of estimation efficiency with respect to the MLE at the assumed model. With strictly parametric models, the usual criterion is the ratio of the traces of the asymptotic covariance matrices of the model parameters. But with non-parametric models, where basis function coefficients are subject to some smoothness constraint (as is the case here) the asymptotic covariance matrices of the penalized MLE and of the robust estimator are not necessarily comparable. The reason is that robust estimation may achieve a different degree of smoothness, i.e. a different bias-variance tradeoff stemming from different λ values selected by minimizing some prediction error criterion. If the two estimators achieve different degrees of smoothness, then the coefficients variances are not necessarily on the same scale and are thus not comparable. One may constrain the smoothness to be similar between the two estimation methods, but this would defeat the purpose of robustness: we are indeed interested in potential differences between the fitted functions and typically suspect that deviating observations may push classical estimates to be too wiggly. Hence the need for a different criterion for the choice of c. We note that previous works (Alimadad and Salibian-Barrera, 2011, Croux et al., 2012, Wong et al., 2014 have not discussed this important issue, resorting to somewhat default values for c taken from strictly parametric cases. We propose a novel general criterion for the selection of the tuning parameter c which covers both additive models and strictly parametric ones. It is simulation-based and relies on the heuristic idea of controlling how the robustness weights at the score level (represented here by ρ c ) behave under data generated from the assumed model. Our procedure is as follows: Step 1: For a given tuning constant value c, compute the robust estimatorδ c on the original data by maximizing (5), including the optimal smoothing parameterλ c .
Step 2 Step 3: The criterion corresponding to c is the median downweighting proportion (MDP) over the B independent replicates: median{w 1 /n, . . . , w B /n}.
Since ρ c ( (δ) b,i ) ∈ [0, 1] for any δ by construction, the ratio w b /n indeed represents how much downweighting has occurred on a particular sample y b . The value w b /n = 1 indicates no downweighting at all, i.e. the corresponding estimate is the penalized MLE. We empirically confirmed over a variety of models (through simulations not presented here) that the MDP indeed increases monotonically with c until reaching one and remaining constant beyond that. This implies that our new criterion shares a one-to-one relation with the traditional criterion of the ratio of the trace of the asymptotic covariance matrices within the subset of c values that lead to some downweighting under the given design. The MDP is not asymptotic and is in effect tailored to the model and design of the data under study. We finally note that no heavy computation is involved in Step 2: we do not estimate parameters on the simulated y b vectors, we only need to evaluate the log-likelihood at the true parameter δ c that generated the sample. In addition, our experience is that Monte Carlo simulation variability is quite small in the MDP so that B = 100 seems sufficient for most practical purposes.

Simulation Studies
To investigate the finite sample properties of the proposed estimator, we carry out two simulation studies. In the first one, we assess the robustness properties of our methodology in a GAMLSS setting inspired by the motivating data we analyze in Section 5. In the second simulation study, we compare our proposal to existing alternatives in the simpler setting of a GAM. All computations are performed in R (R Development Core Team, 2020). Our robust estimator is available in the R package GJRM (Marra and Radice, 2020).

Simulation under a GAMLSS
The motivating brain imaging data introduced in Section 5 have a nonnegative response variable representing the physiological activation level of voxels in a "brain slice". There are two covariates, X and Y, defining the location of each voxel. The combinations of X and Y result in a sample size of n = 1567. In this simulation study, we use the covariates to generate a response for each voxel according to a GAMLSS with a gamma distribution with expectation µ and variance σ 2 µ 2 where log(µ) = η 1 = s 1 (X, Y) and log(σ) = η 2 = s 2 (X, Y). The smooth functions s 1 and s 2 are constructed to mimic the main features of the fitted surfaces on the real data in Section 5, see Figure S2 in Web Appendix C.
To generate data that is contaminated in a similar way to what is observed in the real data, we modify a clean simulated dataset by choosing at random 78 (= 5%) of the responses falling in the upper-right corner of the brain slice, for X > 70 and Y > 30, and by adding 10 to their original value. We simulate 200 replications of the above in both a "clean" scenario (at the assumed model) and in the contaminated scenario. For each replication we fit a gamma GAMLSS with log links for both µ and σ, both with a classical (ML) and with our robust estimation method. We use bivariate thin plate regression splines with k=100 bases to approximate the s 1 and s 2 smooth functions. Both estimators rely on the EFS method for selecting the smoothing parameters. The robust estimator is tuned to achieve an MDP of 0.95, resulting in c = 3.1 given the design. We assess estimation performance by investigating the differences between the true parameter and the estimated one, both on the linear predictor scale (η 1 and η 2 ) and on the canonical parameter scale (µ and σ). We also compute the mean squared error (MSE) of each target θ computed as MSE(θ, θ) = 1 Figure 1 below presents boxplots of the MSE of both methods under both scenarios, while Figure S3 in Web Appendix C shows the same but with the vertical scales manually set to improve visualization. Similarly, Figures S4 and S5 present boxplots of MSEs on the scale of µ and σ. In the clean data scenario, the MSE of classical estimates for both parameters is slightly smaller than that of robust estimates, as theoretically expected. When the data are contaminated, the MSE of classical estimates explodes whereas the MSE of the robust method only shows a slight increase with somewhat more variability across replications.
We investigate these differences further by looking at the fitted surfaces for s 1 (X, Y) and s 2 (X, Y). Figure 2 below shows colored surfaces representing the average bias across replications 1 200 200 j=1 (θ i,j −θ i ), where i = 1, . . . , n and θ is either η 1 or η 2 , in the clean data scenario; Figure 3 shows the same under the contaminated scenario. Note that the coloring scales are not the same between the two figures. At the assumed model, we see that both methods perform equally well, showing overall little bias centered about zero. However, under contamination the classical estimates show a large positive bias in the top-right corner of the brain slice, which is precisely the area that is contaminated (X > 70 and Y > 30). Under contamination, the robust estimates show roughly similar biases than at the model, meaning that the fitted surfaces are quite stable in spite of the contamination. In Web Appendix C, Figures S6 and S7 present similar colored surfaces but for µ and σ; the results are essentially the same. Overall, this simulation study not only highlights the robustness property of our proposed estimator but also how tuning for an MDP of 0.95 yields smooth functions estimations that are nearly indistinguishable from ML-based ones when the data come form the assumed model.

Comparison to Robust Alternatives in a GAM Setting
In order to compare our proposed estimation method to existing robust approaches in the special case of a GAM, we consider here one of the simulation designs of Wong et al. (2014). For i = 1, . . . , n, we generate independent responses Y i ∼ Poisson(µ i ) with µ i = exp(η i ) and η i = 4 cos(2π(1 − x 2 i )), where the x i 's are independently drawn from a Uniform(0, 1) distribution. The sample size is set to n = 100. Following Wong et al. (2014, p. 280), contaminated data are obtained by randomly selecting 5% of the original responses and changing them to the nearest integer y i u u 2 1 , where u 1 is drawn from a Uniform(2, 5) distribution and where u 2 is randomly set to either 1 or −1. We simulate 200 replications.
We compare the following methods, with the same setting choices as in Wong et al. (2014): • AS: the approach of Alimadad and Salibian-Barrera (2011) with span=0.5; • CGP: the approach of Croux et al. (2012) with nknots=15; • WYL: the approach of Wong et al. (2014) with k = 30 basis functions and with smoothing parameter chosen by minimizing their robust BIC, following their recommendation; • GAMLSS: our proposed approach with k = 20 basis functions; • Classical: ML-based estimation with k = 20 basis functions and smoothing parameter selected by the Fellner-Schall method of Wood and Fasiolo (2017).
All existing approaches build on Cantoni and Ronchetti (2001a) to define robust penalized estimating equations for δ. Croux et al. (2012) additionally define a similar set of estimating equations for the dispersion parameter in their extended GAM setting. That is, all these approaches robustify estimating (score) equations, typically by appending weights, whereas our proposed approach directly robustifies a likelihood. Regarding smoothers and basis functions, Alimadad and Salibian-Barrera (2011)   Salibian-Barrera (2011) use a robust version of CV defined as a sum of squared weighted residuals in line with Cantoni and Ronchetti (2001b), and implemented it in a "brute-force" way; Croux et al. (2012) construct a robust GCV criterion and a robust AIC by applying some bounded function to the deviances appearing in the classical counterparts; while Wong et al. (2014) define robust versions of AIC, BIC and leave-one-out CV, all of them borrowing from the quasi-likelihood definition in Cantoni and Ronchetti (2001a). The proposals based on brute-force (G)CV are generally too demanding to be practical for medium to large applications. The robust information criteria are more tractable, although still with a high computational cost if grid searches are to be used. In all of the three existing approaches, there is no formal treatment of the robustness tuning constant selection. Alimadad and Salibian-Barrera (2011, p. 723) advise to use c = 1.5, commenting on the fact that "values of c between 1 and 4 produce similar qualitative results". Croux et al. (2012, p. 33) suggest using c = 1.345 for both estimating equations for the mean and the dispersion, borrowing from the Gaussian regression setting and stating that "this value gives reasonable results for other models as well". Finally, Wong et al. (2014) suggest to use c = 1.6 as in Cantoni and Ronchetti (2001a) without further discussion, even though the simulation designs are different.
Since we only have one smooth term here, we can afford the computational cost of the brute-force CV of AS and consider three variants of our estimator to compare smoothing parameter selection methods: minimizing our proposed robust AIC (RAIC); minimizing our robust BIC (RBIC); and the extended Fellner-Schall method (EFS). The RAIC/RBIC minimization are performed by a grid search with a relative numerical tolerance of 10 −5 on the scale of the single smoothing parameter λ. All the methods have been tuned to achieve an MDP of 0.95 following the procedure introduced in Section 3.5, to make them comparable. The resulting tuning constants are k = 1.2 for the AS method, tccM = 1.2 and tccG = 1.345 for CGP, c = 1.2 for WYL, and c = 5.8 for our approach. As already noted by Wong et al. (2014, p. 286), the CGP method estimates an additional dispersion parameter by default. This implies greater modeling flexibility and may make the comparison unfair in some situations, but we do not expect this to contribute much to its performance in the simulation settings considered here. We evaluate and compare the performances of the methods by assessing their MSE for the Poisson mean parameter µ computed as MSE(μ, µ) = 1 n n i=1 (μ i − µ i ) 2 . The R code for the WYL approach is available through the R package robustGAM, whereas the AS approach is available via the R package rgam. The code for the CGP approach was retrieved from the online supplementary material of Wong et al. (2014). Figure 4 displays boxplots of the MSE for all methods both at the assumed Poisson GAM model (left sub-panel) and under contamination (right sub-panel). Some numerical summary statistics are given in Table 1. The classical (ML-based) estimation has the lowest MSE under clean data, while it unsurprisingly shows poor performance under contamination. Among the robust methods, AS has the largest MSEs and tends to vary more than the others. CGP, WYL and our method all roughly have the same MSEs on average, although WYL shows larger variability across samples. Among our three variants (RAIC, RBIC and EFS) performance is similar at the model, but under contamination RAIC features slightly larger MSEs. This is in line with remarks made by Wong et al. (2014) about AIC/RAIC favoring wigglier fits which here may allow contaminated observations to contribute relatively more to the fit than with heavier penalties such as BIC/RBIC, and this regardless of the robustness property of the method.
Overall, these simulation results yields two main conclusions. First, our proposed robust method performs similarly to the best-performing existing alternatives in the GAM

Application to Brain Imaging Data
The data motivating the proposed method come from fMRI of the human brain. These data were presented in Landau et al. (2003) and subsequently used in Wood (2017), and are available in the R package gamair available on CRAN. The goal of the original study is to test for a difference in the timing (phase shift) of the physiological response between two anatomically distinct brain regions. For this purpose, a set of fMRI measures were acquired from a healthy participant during the performance of a verbal fluency task. The active task of this experiment consisted of generating words beginning with a cued letter, while the baseline condition was given by covertly repeating a letter. Brain activity was then summarized as the median of three measurements of fundamental power quotient on each brain voxel. The coordinates of each voxel were also recorded; these can be used to model the response surface. Wood (2017, p. 329) identified two extreme voxel responses that were discarded for the subsequent analysis. The response variable is thus the median fundamental power quotient medFPQ which represents the physiological response of the brain to controlled stimuli. This response is measured at voxels in a 2D brain slice with two covariates X and Y identifying the location of each voxel. The medFPQ measurements are rather noisy with possible spikes and troughs in activity which do not relate to the controlled stimulus, but the mean response level and its spread are likely to vary smoothly over the brain slice. Following Wood (2017), we thus model both the mean and variance of medFPQ as joint functions s(X,Y) to be approximated by thin plate regression spline basis functions with a smoothness penalty based on second order derivatives. However, contrary to the analysis in (Wood, 2017, p. 329) where two voxels with medFPQ ≤ 5 × 10 −3 were excluded on the ground that they can be regarded as outliers, we will consider the entire data set without exclusions. Given the nonnegative and positively skewed nature of medFPQ, we postulate a gamma distribution parameterized with mean µ and variance σ 2 µ 2 , with log(µ) = η 1 = s 1 (X, Y) and log(σ) = η 2 = s 2 (X, Y). Other families were considered, including the log-logistic distribution which is outside the exponential family; diagnostics and model validation (not presented here) confirmed that a gamma distribution provides the best fit. We fit the gamma GAMLSS with a classical (ML, non-robust) estimation method and our proposed robust method. Because of the joint smoother used here, we rely on the EFS method which provides fast computations. The robust estimator is tuned to achieve an MDP of 0.95, resulting in a robustness constant of c = 4.5.
The fitted surfaces for η 1 and η 2 are given in Figure 5. Overall, the robust fitted surfaces appear smoother for both parameters, with a surface that is nearly flat for η 2 . The classical fit uses a total of 77.2 effective degrees of freedom (56.09 for fitting η 1 , 19.11 for η 2 , plus 2 for the constants), whereas the robust fit only uses 30.04 effective degrees of freedom (26.00 for fitting η 1 , 2.04 for η 2 , and 2 for the constants). This hints that the automatic selection of the smoothing parameter in the classical fit was influenced by some potentially outlying observations.
Consider the largest local differences in Figure 5 between the two fits: in the upper-right corner of the brain forη 1 (which has motivated the contamination scheme of Section 4.1), and in the leftmost part of the brain forη 2 . For the latter, classical estimates imply a much larger localized response variance than robust estimates do. This is driven by two observations in this area which are the ones excluded from the analysis in Wood (2017). But for the large difference inη 1 , the spike in mean brain activity implied by classical estimates is much subdued when considering robust estimation. This is explained when investigating the robustness weights, which are displayed in Figure 6. A few observations in the top-right corner are heavily downweighted by the robust method, which results in the smoother mean surface in Figure 5. These low weights do not imply that these observations are necessarily outliers, but simply that they do not seem to follow the same trends as the majority of the data given the gamma GAMLSS assumed here. The downweighted observations in the topright corner may indeed represent a physiological response of interest here, we note that the robustness weights identify them in an automated way. Also, note that the two observations excluded by Wood (2017) are also heavily downweighted; these are indicated in Figure 6 as green crosses for reference. Hence, the robust fitted surfaces combined with the robustness weights are effective at both modeling smooth functions in a reliable way and at automatically detecting observations deviating from trends and model assumptions.

Discussion
We introduced a robust estimation method for the broad class of GAMLSS. Our approach is quite general since it can be employed for any differentiable likelihood. By directly robustifying the log-likelihood and correcting it for Fisher consistency, this method yields natural robust versions of AIC and BIC. For more complicated designs where grid searches are not feasible, our extended Fellner-Schall method allows for a reliable and automatic selection of smoothing parameters. Our implementation in the R package GJRM, based on the trust region algorithm, is modular and stable. Furthermore, the introduced MDP addresses the challenge of the selection of the robustness tuning constant for models with flexible non-linear effects in a simple and effective way. We believe this criterion has broad applicability in the imple-mentation of robust methods in many contexts, including the ones where efficiency criteria based on asymptotic covariances already exist but may be computationally expensive.
Simulations in the special case of a GAM showed that our robust estimator is on par with the best-performing existing approaches, when tuned for comparable robustness under the assumed model. Simulations in the broader GAMLSS setting as well as our application to the brain imaging data showed that our robust estimator allows for the automatic detection of deviating observations through the robustness weights, and that the approach yields trustworthy estimates.
Future work includes the extension to high-dimensional settings, following for instance Andreas et al. (2012) where the problem of variable selection is considered. An alternative strategy for variable selection is developed in  and  using L 1 -type of penalties. The following conditions are required for Theorem 1 in the main body of the paper.
(C2) The parameter space ∆ ⊂ R p is compact and δ 0 is an interior point of ∆.
(C5) The smoothness parameter λ = o(n 1/2 )1, i.e. a vanishing penalty n −1 λ = o(n −1/2 )1 for the log-likelihood scaled by 1/n. Condition (C1) summarizes the requirements for the user-specified ρ function. In particular, the boundedness of the first derivative is the key to the robustness property of the proposed estimator as it ensures that the influence function is itself bounded (see below). These requirements are satisfied by the log-logistic function used in the main body of the paper.
The conditions in (C2) essentially guarantee that the asymptotic distribution is absolutely continuous (i.e. without point-masses on the boundary of the domain). The compactness requirement is rather standard in non-parametric and sieve estimation contexts.
Condition (C3) includes most commonly used families of distributions. We note it may be relaxed to Lipschitz requirements on the derivative of log f (y|µ, σ, ν) and may hold only for asymptotically non-null sets (see Huber, 1967).
The requirements in (C4) are standard minimal conditions in the robustness literature, akin to assuming that the Fisher information matrix exists in a neighborhood of the true parameter in a maximum likelihood (ML) framework.
Finally, (C5) imposes a minimum rate at which λ decreases as n → ∞. This rate is necessary to guarantee that the penalization does not induce any asymptotic bias for the robust estimator.
Proof of Theorem 1. The penalized robust estimatorδ is the solution in δ to where here˜ (δ) = 1 n n i=1 ρ c (δ) i − 1 n b ρ (δ) is the scaled robustified log-likelihood. Given the smoothness of˜ (δ) ensured by (C1) and (C3), a first-order Taylor expansion about δ 0 of the right-hand side of (1) evaluated at the solutionδ yields Multiplying by √ n on both sides and rearranging lead to Considering the limit as n → ∞, (C5) implies n −1 S(λ) → 0 and n −1/2 S(λ)δ 0 → 0. This effectively reduces the estimating equations to those of the unpenalized robust estimator that solves ∂˜ (δ)/∂δ = 0. Viewing δ 0 as the true parameter that generated the data, we can thus invoke standard results from M -estimation theory. In particular, conditions (C1)-(C4) are sufficient to apply Theorem 2 of Huber (1967) so that the unpenalized robust estimator converges a.s. to δ 0 . This implies that the statistical functional T (F ) defined as the solution satisfies T (D) = δ 0 , i.e. it is Fisher consistent for δ 0 . Asymptotic normality follows under (C1)-(C4) by observing that the remainder in (2) is O p (n −1/2 ) because of √ n-consistency and by Theorem 3 of Huber (1967):

Web Appendix B: Extended Fellner-Schall Method
We follow here the main steps of Wood and Fasiolo (2017) in developing a robust extended Fellner-Schall (EFS) method to select the smoothness parameter λ. The quadratic penalty in the penalized robustified log-likelihood can be viewed as an improper Gaussian prior distribution on δ, seen as a random vector throughout this section, with λ now seen as another model parameter. For this, consider adding the term corresponding to the determinant of the Gaussian covariance matrix to get an improper prior, i.e. up to a constant term (involving √ 2π). We thus define the joint robustified log-likelihood l for Y = (Y 1 , . . . , Y n ) and δ as where | · | denotes matrix determinant, so that we can view L(y, δ; λ) = exp(l(y, δ)) as a joint (robustified) likelihood up to constant terms: The multivariate Gaussian distribution of δ has a mean vector of zero and covariance S(λ) −1 , or precision matrix S(λ). By construction, S(λ) is symmetric and positive definite so that all its eigenvalues are strictly positive and it is thus invertible. Note that this added term does not depend on δ so that our robust estimatorδ can also be defined as arg max δ l(y, δ; λ). We derive the marginal robustified likelihood for Y by integrating out δ from L(y, δ; λ): L(y; λ) = ∆ L(y, δ; λ) dδ. The Laplace approximation of this integral can be summarized as a second-order Taylor expansion of l(y, δ; λ) about the global maximumδ =δ(λ) = arg max δ l(y, δ; λ) for a given λ and observation vector y. This yields the following approximation to the marginal robustified log-likelihood where the "LA" subscript identifies Laplace-approximated quantities and the negative Hessian is guaranteed to be positive definite at least in a neighborhood ofδ for sufficiently large n under the conditions of Theorem 1. Now, viewing λ as the parameter of this (Laplace-approximated) marginal log-likelihood, ideally we would then compute the first-order derivative: Unfortunately, it cannot be computed directly in general since λ appears essentially everywhere. In particular, λ appears in M p both explicitly because M p is a direct function of S(λ) (see above) and implicitly throughδ(λ). We therefore need to neglect the dependence on λ for some terms in (5). Specifically, we are going to considerδ as a function of a previous "inactive" iterate λ , so thatδ is fixed with respect to the current "active" λ. This implies that M p (δ, λ) = − ∂ 2˜ (δ) ∂δ∂δ δ=δ + S(λ) is now a function of the active λ only explicitly through S(λ). That way, the new Laplace-approximated marginal log-likelihood, written as l LA (y; λ, λ ), is now a function of the active λ only through S(λ). To compute the derivative of l LA (y; λ, λ ) with respect to the active λ, we adapt Jacobi's formula to invertible matrices and obtain ∂ log |S(λ)| ∂λ j = tr S(λ) −1 ∂S(λ) ∂λ j since S(λ) is symmetric and invertible (and a differentiable function of λ). Based on this result, we can get a simplified version of (5):      Tables   Table S1 below lists the distributions implemented in the R package GJRM. These have been parametrized according to Rigby and Stasinopoulos (2005) and are defined in terms of µ, σ and ν. The means and variances of DAGUM, FISK (also known as log-logistic) and SM are indeterminate for certain values of σ and ν. If a parameter can only take positive values then the transformation/link function log(· − ) is employed. If a parameter can take values in (0, 1) then the inverse of the cumulative distribution function of a standardized logistic is used. B(·, ·) is the beta function, Γ(·) is the gamma function, α = 1 σ 2 + 2µ (1−exp(−µ)) 2 y = 1, 2, . . . µ > 0 Table S1: Definition and some properties of the distributions implemented in GJRM.