The generalized equivalence of regularization and min–max robustification in linear mixed models

The connection between regularization and min–max robustification in the presence of unobservable covariate measurement errors in linear mixed models is addressed. We prove that regularized model parameter estimation is equivalent to robust loss minimization under a min–max approach. On the example of the LASSO, Ridge regression, and the Elastic Net, we derive uncertainty sets that characterize the feasible noise that can be added to a given estimation problem. These sets allow us to determine measurement error bounds without distribution assumptions. A conservative Jackknife estimator of the mean squared error in this setting is proposed. We further derive conditions under which min-max robust estimation of model parameters is consistent. The theoretical findings are supported by a Monte Carlo simulation study under multiple measurement error scenarios.

from survey data (Alfons et al. 2013). Biomarker measures may be contaminated due to errors in specimen collection and storage (White 2011). Further, since big data sources are more and more used for analysis (Davalos 2017;Yamada et al. 2018), the measurement error is often not controllable. If the correct data values cannot be recovered from their noisy observations, linear regression fails to provide valid results. In that case, methodological adjustments are required to allow for statistically wellfounded results. These adjustments are often summarized under the umbrella term robust estimation (Li and Zheng 2007).
Robustness is not connoted consistently in statistics. There is a multitude of different methods that account for the effects of measurement interference in the estimation process. Bertsimas et al. (2017) categorize them into two general approaches to robustification, an optimistic and a pessimistic perspective, which they call the min-min and min-max approach. LetX = X + D be a design matrix that is contaminated by measurement errors. Here, X ∈ R n× p denotes the original design matrix without errors and D ∈ R n× p is a matrix of error terms. Burgard et al. (2020a, b) assumed that measurement errors are normally distributed. However, no distributional assumptions are made here about D. Further, X and D are not required to be independent. Now, for a loss function g : R n → R + , R + = [0, ∞), a response vector y ∈ R n , a perturbation matrix ∈ R n× p , and a set U ⊆ R n× p , the min-min approach is formulated by while the min-max approach is characterized by the problem min β∈R p max ∈U g(y − (X + )β).
In both variants, the design matrix is perturbed to account for some possibly contained additive measurement errors. In the min-min approach, the perturbations are chosen minimal with respect to U. This is the most common idea of robustness in statistics. Bertsimas et al. (2017) refer to it as optimistic, because the researcher is allowed to choose which observations to discard for model parameter estimation. The primary concern of the min-min approach is to robustify against distribution outliers. Therefore, oftentimes distribution information about the measurement errors is required. Examples of min-min methods include Least Trimmed Squares (Rousseeuw and Leroy 2003), Trimmed LASSO (Bertsimas et al. 2017), Total Least Squares (Markovsky and Huffel 2007), as well as M-estimation with influence functions (Huber 1973;Schmid and Münnich 2014). The min-max method, on the other hand, introduces perturbations that are chosen maximal with respect to U. This idea mainly stems from robust optimization theory. The objective is to find solutions that are still good or feasible under a general level of uncertainty. In the process, deterministic assumptions about U are made, which is then called uncertainty set. The researcher chooses it in accordance to how the additive error might be structured. Bertsimas et al. (2017) refer to this approach as pessimistic, since model parameter estimation is performed under a worst-case scenario for the perturbations. Unlike in the min-min approach, the target is not to robustify against errors of a given distribution, but against errors of a given magnitude. This robustness viewpoint has been studied for example by El Ghaoui and Lebret (1997), Ben-Tal et al. (2009), and Bertsimas and Copenhaver (2018).
In practice, distribution information on the measurement errors is rarely available. This is particulary the case for big data sources, as the origin of the data is often unknown. In these settings, it makes sense to adopt robust optimization and regard the disturbance ofX pessimistically. Under this premise, we obtain conservative, yet valid results. That is to say, the min-max method can be used to achieve robust estimates in the absence of distribution information on data contamination. Yet, it is not obvious how to efficiently solve a corresponding min-max problem (Bertsimas et al. 2011). But recent results from robust optimization show that it is related to regularized regression problems of the form min β∈R p g(y −Xβ) + λh(β), where λ > 0 and h : R p → R + is a regularization. From an optimization stand point, problems like (3) can be handled better and solved more efficiently. But given the literature on regularized regression, it is uncommon to regard (3) as robustification. In many cases, regression models are extended by regularization due to at least one of the following aspects: (i) allow for high-dimensional inference (Neykov et al. 2014), (ii) perform variable selection (Zhang and Xiang 2015;Yang et al. 2018), and (iii) deal with multicollinearity (Norouzirad and Arashi 2019). Apart from these common applications, Bertsimas and Copenhaver (2018) provided novel insights by showing that (2) and (3) are equivalent if g is a semi-norm and h is a norm. Unfortunately, many regularized regression methods do not fit naturally in this framework. Popular techniques like the LASSO (Tibshirani 1996), Ridge regression (Hoerl and Kennard 1970), or the Elastic Net (Zou and Hastie 2005) use the squared 2 -norm as loss function (g), which is not a semi-norm. Further, many regularizations (h) include squares or other functions of norms, which are no norms. Given this limitation, it is desirable to find a more general connection of regularization and robustification that applies to a broader range of methods.
In this paper, we prove that (2) and (3) are equivalent when g and h are functions of semi-norms and norms. On the example of linear mixed models (LMMs; Pinheiro and Bates 2000), we show that regularization obtains efficient estimates in the presence of measurement errors without distribution assumptions. Given the majority of regularized regression applications, this introduces a fairly new perspective on these methods. Past developments mainly focussed on how to robustify regularized regression under contaminated data (Rosenbaum and Tsybakov 2010;Loh and Wainwright 2012;Sørensen et al. 2015). We show that regularization itself is a robustification. Building upon this result, we derive uncertainty sets for the LASSO, Ridge regression and the Elastic Net. They characterize the nature of the respective robustification effect and allow us to find upper bounds for the measurement errors. From the error bounds, we construct a conservative Jackknife estimator of the mean squared error (MSE) for contaminated data. Further, we study conditions under which robust optimization allows for consistency in model parameter estimation.
We proceed as follows. In Sect. 2, the generalized equivalence is established. We use it to derive the uncertainty sets resulting from the three regularizations. Next, we build a robust version of the LMM and show how robust empirical best predictors from the model are obtained. Section 3 addresses MSE estimation. We first derive error bounds from the uncertainty sets. Then, we present the conservative Jackknife estimator. In Sect. 4, we cover consistency in model parameter estimation. Section 5 contains a Monte Carlo simulation to demonstrate the effectiveness of the methodology. Section 6 closes with some conclusive remarks. The paper is supported by a supplemental material file with eight appendices. Appendix 1 to 7 contain the proofs of the mathematical developments presented in this study. Appendix 8 contains MSE calculations for a general random effect structure. This paper contains insights of a related working paper by Burgard et al. (2019).

Min-max robustification
In Sect. 1, we introduced min-max robustification (2) as a conservative approach to obtain estimates in the presence of unknown measurement errors. Since it is unclear to efficiently solve the underlying optimization problem, we now present how it is related to regularized regression problems (3). For this purpose, the following result by Bertsimas and Copenhaver (2018) is helpful, as it connects min-max robustification with regularization.
Proposition 1 (Bertsimas and Copenhaver 2018) If g : R n → R + is a semi-norm which is not identically zero and h : R p → R + is a norm, then for any z ∈ R n , β ∈ R p and λ > 0 max Clearly, the proposition directly implies that for g, h and U as in Proposition 1. Thus, the framework provided by Bertsimas and Copenhaver (2018) gives us novel insights into the role of regularization in regression. The choice of a regularization function h with parameter λ directly constraints the uncertainty set U, which defines a set of perturbations for the design matrix. In other words, the regularization controls the magnitude of noise that can be added toX. Under this interference, β is chosen such that the loss is minimal. The effect can be imagined as a two player game where one player tries to minimize the loss by controlling β, while the other player tries to maximize the deviation by controlling the noise that is added toX. However, many regression methods are formulated using the squared norm or a mix of squared and non-squared norms. For instance, the LASSO is posed as the optimization problem min β∈R p y −Xβ 2 2 + λ β 1 .
Here, the deviation is squared while the regularization term is not. On the other hand, Ridge regression is posed as the optimization problem with both, the deviation and regularization, being squared. Both optimization problems do not fit naturally into the framework of Proposition 1 since a squared (semi-)norm · 2 is not a (semi-)norm. We provide a generalization of Proposition 1 that (i) displays a more fundamental connection between regularization and robustification, and (ii) enables us to regard more sophisticated regularizations in light of robustification.
Theorem 1 Let λ 1 , . . . , λ d be positive real numbers, g : The proof can be found in Appendix 1 of the supplemental material. Observe the difference between Theorem 1 and Proposition 1. In the original statement, regularization and robustification are equivalent when the loss function is a semi-norm and the regularization is a norm. In the generalization, the equivalence also holds when loss function and regularization are increasing convex functions of semi-norms and norms. This covers a broader range of settings, which we demonstrate hereafter. For Ridge regression, we have g(z) = z 2 , h 1 (z) = z 2 , f (z) = z 2 , f 1 (z) = z 2 and d = 1 with λ > 0. Applying Theorem 1 yieldŝ with for some ϕ > 0. For the LASSO, we have g(z) = z 2 , h 1 (z) = z 1 , f (z) = z 2 , f 1 (z) = z and d = 1 with λ > 0. Applying Theorem 1 obtainŝ And finally, for the Elastic Net, we have We see that the theorem can be broadly applied and establishes the robustification effect for a variety of regularized regression methods. However, the manner in which robustification is achieved depends on the regularization. By looking at the definition of U in Theorem 1, we see that the effects of measurement errors with respect to the loss function are bounded by a generic term d l=1 ϕ l h l (·) for some ϕ 1 > 0, . . . , ϕ d > 0. The exact form of this term depends on the regularization the researcher wishes to apply. Accordingly, in the light of the three regularized regression approaches considered before, the robustification effect manifests itself differently given the penalty. On that note, Bertsimas and Copenhaver (2018) provided another result that allows for an interpretation of the robustification effects. It is summarized within the subsequent proposition.
Proposition 2 (Bertsimas and Copenhaver 2018) Let t ∈ [1, ∞], β 0 be the number of non-zero entries of β and i be the i-th column of for i = 1, . . . , p. If Applying this result to our generalization, we see that for U 2 (Ridge regression), the maximum singular value of is bounded by ϕ. With respect to U 1 (LASSO), the columnwise 2 -norm of is bounded by ϕ. Thus, while Ridge regression induces an upper bound on the entire noise matrix, the LASSO provides a componentwise bound. With respect to the Elastic Net, the error bound is a linear combination of the componentwise LASSO bound and the general Ridge bound. Unfortunately, it is less apparent how to interpret the corresponding robustification effect.

Model and robust empirical best prediction
Based on these insights, we now use min-max robustification to construct a robust version of the basic LMM in a finite population setting. Let U denote a population of |U| = N individuals indexed by i = 1, . . . , N . Assume that U is segmented into m domains U j of size |U j | = N j indexed by j = 1, . . . , m with U j and U k pairwise disjoint for all j = k. Let S ⊂ U be a random sample of size |S| = n < N that is drawn from U. Assume that the sample design is such that there are domain-specific subsamples S j ⊂ U j of size |S j | = n j > 1 for all j = 1, . . . , m. Let Y be a realvalued response variable of interest. Denote y i j ∈ R as the realization of Y for a given individual i ∈ U j . For convenience, assume that the objective is to estimate the mean of Y for all population domains, that is Let X be a p-dimensional real-valued vector of covariates statistically related to Y . Let x i j ∈ R 1× p denote the realization of X for i ∈ U j and let z i j ∈ R 1×q be the known incidence vector. In the light of Sect. 1, we assume that the observations of X are impaired by measurement errors. That is, we only The robust LMM is formulated as where y j = (y 1 j , . . . , y n j j ) is the vector of sample responses in S j . Further, we have . . , z n j j ) , and β ∈ R p×1 as vector of fixed effect coefficients. The term Z j ∈ R n j ×q denotes the random effect design matrix and b j ∈ R q×1 with b j ∼ N(0 q , ) is the vector of random effects. The latter follows a normal distribution with a positive-definite covariance matrix ∈ R q×q that is parametrized by some vector ψ ∈ R q * ×1 . The vector e j ∈ R n j ×1 contains random model errors with e j ∼ N(0 n j , σ 2 I n j ) and a variance parameter σ 2 . We assume that b 1 , . . . , b m , e 1 , . . . , e m are independent. Under model (13), the response vectors follow independent normals where Formulating (13) over all domains yields . . , Z m ) ∈ R n×mq are stacked matrices. Finally, the model parameters are θ := (β , η ) with η = (σ 2 , ψ ) . Please note that in accordance with a likelihoodbased estimation setting, the random effects b are not model parameters, but random variables. Thus, in order to obtain robust estimates of (12), random effect and response realizations have to be predicted. For this, we first state the best predictors of b j and Y j under the robust LMM with the preliminary assumption that θ is known. They are obtained from the respective conditional expectations given the response observations under (16). We refer to them as robust best predictors (RBPs). Afterwards, the model parameters are substituted by empirical estimates to obtain the robust empirical best predictors (REBPs). The RBPs are stated in the subsequent Proposition. (16), the RBPs of b j andȲ j are given bŷ

Proposition 3 Under model
The proof can be found in Appendix 2 of the supplemental material. Note that the RBP ofȲ j requires covariate observations for all i ∈ U j . Such knowledge may be unrealistic in practice, depending on the application. Therefore, we use an alternative expression that is less demanding in terms of data. Battese et al. (1988) suggested the approximationȲ j ≈ μ j =X j β +Z j b j for cases when n j /N j ≈ 0. Here,X j and Z j are the domain means of X j and Z j in domain U j . Observe that the unknown μ j is generated without measurement errors. The RBP of μ j iŝ whereD j is the hypothetical domain mean of the measurement errors. Based on this approximation and Proposition 3, we can state the REBP of μ j by substituting the unknown model parameter θ by an estimatorθ = (β ,η ) under the min-max setting. This is done by solving the two optimization problems iteratively. For fixed effect estimation, we solve the regularized weighted least squares problem given variance parameter candidatesη and a predefined regularization d l=1 λ l f l (h l (β)). For variance parameter estimation, we solve the maximum likelihood (ML) problem (18) given the robust solutionβ, where |V(η)| is the determinant of V(η). Both estimation steps are performed conditionally on each other until convergence. For an iteration index r = 1, 2, . . ., the complete procedure is summarized in the subsequent algorithm.
Algorithm 1 Model parameter estimation 1: Center the response observations y and standardize the covariate observationsX.
3: while not converged do

6: end
In the upper descriptions, centering means transforming y such that it has zero mean. Further, standardization implies transformingX such that each of its columns has zero mean and unit length. At this point, we omit a detailed presentation of suitable methods to solve the individual problems. This already been addressed exhaustively in the literature. For the solution of (17), coordinate descent methods are often applied. See for instance Tseng andYun (2009), Friedman et al. (2010), as well as Bottou et al. (2018). Regarding the solution of (18), a Newton-Raphson algorithm can be used (Lindstrom and Bates 1988;Searle et al. 1992). For an estimateθ = (β ,η ) , the REBP of μ j iŝ

Conservative MSE estimation under measurement errors
Hereafter, we demonstrate how to obtain conservative estimates for the MSE of the REBP, which is given by . This is done in two steps. We first derive upper bounds for the MSE of the RBP under known model parameters in the presence of measurement errors, that is, Then, we state a Jackknife procedure that accounts for the additional uncertainty resulting from model parameter estimation. Combining both steps ultimately allows for conservative MSE estimates.

MSE bound for the RBP
For the sake of a compact presentation, we assume that the random effect structure is limited to a random intercept, thus, b j = b j with b j ∼ N(0, ψ 2 ). Note that there is a related MSE derivation for a general random effect structure in Appendix 8 of the supplemental material. Under the random intercept setting, the MSE of the RBP is given by wherē with e i j iid ∼ N(0, σ 2 ). Recall that b j and e j are independent. SinceD j is a fixed unknown quantity under the min-max setting, it follows that We see that MSE contains the term (D j β) 2 , which is unknown due to the measurement errors being unobservable. Thus, even with all the model parameters known, we cannot calculate the exact value of (22) becauseD j is an unknown quantity under the considered setting. Yet, recall that for min-max robustification, we introduce perturbations to account for the uncertainty resulting from covariate contamination. Therefore, we can replace the term (D j β) 2 by a corresponding expression ( j β) 2 , provided that λ 1 , . . . , λ d are chosen sufficiently high. Here, j is the j-th row of . As described in Sect. 2.1, the perturbations are element of an underlying uncertainty set U, which depends on the regularization d l=1 λ l f l (h l (β)). If ∈ U, the uncertainty set induces an upper bound d l=1 ϕ l h l (β) on the total impact of the perturbations on model parameter estimation in accordance with Theorem 1. Provided that the loss function g is the squared 2 -norm, the total impact is measured by β 2 in accordance with Theorem 1. From this argumentation, we can state the error bound Substituting (22) subsequently yields an upper limit for the MSE. However, by Theorem 1, the error bound (23) depends on the chosen regularization. Naturally, it has to be determined for each uncertainty set individually. Here, we encounter the problem that the uncertainty set parameters ϕ 1 , . . . , ϕ d are unknown. The one-to-one relation between the regularization parameters and the uncertainty set parameters in Proposition 1 is lost in Theorem 1. Hence, the values of the uncertainty set parameters have to be recovered first before corresponding error bounds can be used. In order to recover the uncertainty set parameters, we apply the following basic procedure. Assume that we have computed an optimal solutionβ of a regularized regression problem in accordance with Theorem 1. By Theorem 1,β is also an optimal solution of min for appropriate ϕ 1 > 0, . . . , ϕ d > 0 forming the uncertainty set U. Thus, by Proposition 1, we know thatβ is an optimal solution of min Following this argumentation, we recover the relation by choosing ϕ 1 , . . . , ϕ d in such a way thatβ is an optimal solution of (26). For this, we need to derive the optimality conditions of (26) for a given specification of g and h. We then solve the arising system of equations for the uncertainty set parameters in dependence of the regularization parameters. This yields us the relation between ϕ 1 , . . . , ϕ d and λ 1 , . . . , λ d . In what follows, we demonstrate the procedure for Ridge regression, the LASSO, and the Elastic Net. The obtained results are subsequently used to find upper bounds for the MSE stated in (22). Let W := V −1/2X and denote the columns of W by W 1 , . . . , W p . Further, define v := V −1/2 y.

Error bound for Ridge regression
We have a single uncertainty set parameter ϕ and the optimization problem In this setting, ϕ can be recovered according to the subsequent proposition.
Proposition 4 Letβ 2 = 0 p be an optimal solution to the optimization problem (27).
Then, the related uncertainty set, as described in Theorem 1, is given by The proof can be found in Appendix 3 of the supplemental material. Observe that the uncertainty set parameter ϕ has a closed-form solution when ∈ U 2 . We use the expression to substitute ϕ in the error bound (23). This allows us to state an upper bound for the MSE (22) when min-max robustification is achieved via Ridge regression. We obtain Error bound for the LASSO We have a single uncertainty set parameter ϕ and the optimization problem Under this premise, ϕ can be recovered as follows.
Proposition 5 Letβ 1 = 0 p be an optimal solution to the optimization problem (29).
Then, the related uncertainty set, as described in Theorem 1, is given by The proof can be found in Appendix 4 of the supplemental material. We see that the uncertainty set parameter has a closed-form solution when ∈ U 1 . We use the expression to substitute ϕ in the error bound (23). This allows us to state an upper bound for the MSE (22) when min-max robustification is achieved via the LASSO. We obtain

Error bound for the Elastic Net
We have two uncertainty set parameters ϕ 1 , ϕ 2 and the optimization problem They are recovered as demonstrated hereafter.

Proposition 6
Letβ E N = 0 p be an optimal solution to the optimization problem (31).
Then, the related uncertainty set, as described in Theorem 1, is given by with ϕ 1 , ϕ 2 being a solution of the system The proof can be found in Appendix 5 of the supplemental material. Note that the term 1 p marks a column vector of p ones. We see that the uncertainty set parameters ϕ 1 , ϕ 2 do not have a closed-form solution. They can be quantified numerically, for instance by applying the Moore-Penrose inverse. For a given robust estimation problem with optimal solutionβ E N , let ϕ * 1 , ϕ * 2 be the solutions for the uncertainty set parameters. Plugging them into the MSE equation (22), we obtain the following result when min-max robustification is achieved via the Elastic Net:

Conservative Jacknife estimator
We now use the MSE bounds (28), (30), and (32) for the RBP to construct a conservative Jackknife estimator for the MSE of the REBP. For this, we rely on theoretical developments presented by Jiang et al. (2002). The term "conservative" stems from the fact that we do not use the actual MSE, but their upper bounds in accordance with Theorem 1. With this, we do not obtain estimates of MSE(μ R E B P j ) in the classical sense. Instead, an upper bound for the measure is estimated under a given level of uncertainty resulting from unobservable measurement errors. We refer to this as pessimistic MSE (PMSE) estimation. A delete-1-Jackknife procedure is applied. In every iteration of the algorithm, a domain-specific subsample S j is deleted from the data base. The remaining observations are used to perform model parameter estimation. Based on the obtained estimates, predictions for μ j in all domains U 1 , . . . , U m are produced. The principle is repeated until all domain-specific subsamples have been deleted once. With this resampling scheme, we obtain an approximation to the prediction uncertainty resulting from model parameter estimation (Jiang et al. 2002;Burgard et al. 2020a

8: end
After the algorithm is completed, the conservative Jackknife estimator for the REBP of μ j is calculated according to

Consistency
Hereafter, we study conditions under which min-max robustification as described in Sect. 2.1 allows for consistency in model parameter estimation. We adapt the theoretical framework developed by Fan and Li (2001) as well as Ghosh and Thoresen (2018) for the asymptotic behavior of regularized regression with non-concave penalties. Based on the insights of Sect. 2, we introduce min-max robustification to their developments. For this purpose, let us state the optimization problem for model parameter estimation as follows: with ⊆ R p × (0, ∞) 1+q * as parameter space, Q θ : → R as objective function, and denoting the negative log-likelihood function of (16). The superscript l = 1, . . . , d marks the index of regularization terms and k = 1, . . . , p is the index of fixed effect coefficients. Thus, the term n d l=1 λ l n p k=1 h l k (β k ) is a componentwise non-concave regularization with parameters λ 1 n , . . . , λ d n depending on the sample size. For each parameter λ l n , we have a family of increasing, convex and non-concave functions h l 1 , . . . , h l p with h l k (β k ) : R → R + . This componentwise notation is required to establish consistency for a robust estimatorβ that is potentially sparse. Further, following Fan and Li (2001) as well as Ghosh and Thoresen (2018), we let the regularization directly depend on n via multiplication. Note that this does not affect min-max robustification presented in Theorem 1. For a given regularized regression problem, we can substitute a predefined parameter value λ l with an equivalent term nλ l by choosing λ l = λ l /n.

Asymptotic perturbation behavior
A necessary condition for consistency is that λ l n → 0 as n → ∞ for all l = 1, . . . , d. In the classical context of regularized regression, this is a reasonable assumption. Increasing the sample size solves typical issues for which regularization is applied, such as the model being not identifiable or rank deficiency of the design matrix. However, in the light of robustification, it also implies that the impact of the measurement errors on model parameter estimation has to approach zero. Let us restate the uncertainty set from Theorem 1 in the asymptotic setting for g = 2 according to With this formulation, the mean impact of the measurement errors on model parameter estimation is constrained by the regularization. Yet, if we draw new observations, it is not guaranteed that the mean impact will approach zero for n → ∞. This is demonstrated hereafter. Without loss of generality, assume that the sample size in the asymptotic setting is increased in terms of sequential draws denoted by r = 1, 2, . . .. We start with a set of n init r > 0 initial sample observations in the r -th draw with r = 1. The perturbations resembling the uncertainty in the corresponding covariate observations are represented by the perturbation matrix init r ∈ R n init r × p . Next, we draw n new r > 0 new observations and pool them with the initial sample observations. The perturbation matrix for these new observations is new r ∈ R n new r × p . In the next draw r + 1 = 2, the previously pooled observations represent the initial ones, such that n init r +1 = n init r + n new r . This is repeated for r → ∞, implying that n init r +1 → ∞. Note that the limit is not necessarily zero, provided that γ = 0 p . This, however, is imperative if λ l n → 0 for all l = 1, . . . , d while simultaneously ensuring a robust solution in the sense of Theorem 1. At this point, we can conclude that min-max robust model parameter estimation within the robust LMM is not consistent for arbitrary design matrix perturbations, since the impact of the measurement error does not vanish. In order to guarantee that the mean impact of the measurement errors approaches zero, we have to introduce further assumptions on the asymptotic perturbation behavior. These assumptions are with respect to the magnitude of the perturbations. We explicitely avoid assumptions regarding their distribution, as the main feature of min-max robustification is the absence of distribution assumptions. The required behavior in terms of sequential draws is characterized by the subsequent lemma. The proof can be found in Appendix 6 of the supplemental material. In practice, the behavior stated in Lemma 1 may be viewed as the measurement process becoming more accurate over time, or the number of contaminated observations rising at a smaller rate than the number of correct observations.

Asymptotic results
For illustrative purposes, let θ * denote the true value of the model parameter vector. Consistency is studied by investigating the asymptotic behavior of θ − θ * 2 as n → ∞. We consider a deterministic design matrix setting with a fixed number of covariates ( p + 1 + q * ) < n. With this, we assume that model parameter estimation is a low-dimensional problem. The asymptotics of regularized regression are usually studied in high-dimensional settings. However, the focus of our contribution is on robust estimation rather than on high-dimensional inference. Therefore, we restrict the analysis to the simpler case of low-dimensional problems. See Schelldorfer et al. (2011), Shao andDeng (2012), as well as Loh and Wainwright (2012) for theoretical results on high-dimensional inference.
In what follows, we draw from the theoretical framework presented by Fan and Li (2001) as well as Ghosh and Thoresen (2018). Several assumptions are introduced that are required in order to establish consistency. For simplicity, and by the centering of the response observations as well as the standardization of the covariate values as displayed in Algorithm 1, we assume that the domain response vectors y j ∈ y are iid. Thus, we can state the negative log-likelihood See that L n (β, η) is convex in β and non-convex in η. Let λ 1 , . . . , λ d → 0 as n → ∞. Define as the maximum values of the first-and second-order derivatives with respect to all non-zero elements of the regression coefficient vector at the true point β * .

Assumption 1
The model is identifiable and the support of f (y; β, η) is independent of the parameter θ = (β , η ) . The density f (y; β, η) has first-and second-order derivatives with and

Assumption 3
There exists an open subset of containing (β * , η * ), on which f (y; β, η) admits all its third-order partial derivatives for almost all y that are uniformly bounded by some function with finite expectation under the true value of the full parameter vector.
Assumption 4 Regarding the regularization term, B n → 0 as n → ∞.
Assumptions 1 to 3 are basic regularity conditions on ML estimation problems and satisfied by the majority of common statistical models. Assumption 4 is a requirement for non-concave regularizations to ensure that the objective value difference between a local minimizer and the true parameter value approaches zero asymptotically. Since we have assumed that λ → 0 as n → ∞, this holds for all considered regularizations. Assumption 5 is a technical requirement that can be viewed as the number of contaminated observations rising at a slower rate than the correctly measured observations. Based on the presented system of assumptions, the following theorem can be stated.
The proof can be found in Appendix 7 of the supplemental material.

Set up
A Monte Carlo simulation with R = 500 iterations indexed by r = 1, . . . , R is conducted. We use a deterministic design matrix setting and create a synthetic population U of N = 50 000 individuals in m = 100 equally sized domains with N j = 500. A random subset S ⊂ U of size n = 500 is drawn once such that n j = 5 for j = 1, . . . , m.
The response variable realizations are generated in each iteration individually. The overall specifications are where μ X = (2, 2, 2) , σ 2 X = 1, β = (2, 2, 2) , and ψ 2 = σ 2 = 1. Since we consider a deterministic design matrix setting, x i j is generated once for each i ∈ U, and then held fixed over all Monte Carlo iterations. The random components z j and e i j are drawn from their respective distributions in each iteration individually. The objective is to estimate the domain mean of Y as given in (12) based on S. For the sample observations, we simulate erroneous covariate observations in the sense that onlyx i j = x i j + d i j is observed for all i ∈ S rather than x i j . Regarding the covariate measurement error vector , we once again follow the deterministic design matrix setting. Thus, we generate d i j once for each i ∈ U, and then held fixed over all Monte Carlo iterations. However, we implement four different scenarios with respect to d i j to evaluate the benefits of min-max robustification under different data constellations. For j = 1, . . . , m, i = 1, . . . , N j , the scenarios are Here, [2] D and [3] D are scenario-specific covariance matrices that are given by Under each setting, we consider the following prediction and estimation methods: • EBLUP/ML: empirical best linear unbiased predictor under the basic LMM, ML estimation (Pinheiro and Bates 2000). We use several performance measures for the evaluation of domain mean prediction: For model parameter estimation, let k = 1, . . . , p + q * + 1 be the index of all model parameters. For eachθ k ∈θ , we consider the following performance measures: . We look at the relative bias as well as the coefficient of variation (CV): We use the R package glmnet for the optimization problem (17) to obtain fixed effect estimates. The tuning of the regularization parameters required for the optimization problem (17) is performed via standard cross validation using a parameter grid of 1000 candidate values. Further, we use the R package nloptr for the maximum likelihood problem (18) to obtain variance parameter estimates.

Results
We start with the results for domain mean prediction. They are summarized in Table  1 and further visualized in Fig. 1. In the latter, the densities of (μ (r ) j are plotted for the EBLUP and the L2.REBP. From Table 1, it can be seen that in Scenario 1 (absence of measurement errors) the unregularized EBLUP and the regularized predictors have very similar performance. This is due to the fact that the optimal regularization parameter found by cross validation is very close to zero in the absence of measurement errors. A different picture arises in the presence of measurement errors, thus in Scenario 2 to Scenario 4. Here, the regularized predictors are much more efficient than the unreguarized EBLUP.
The efficiency advantage in terms of the MSE ranges from 37% to 47%. Observe that the advantage through min-max robustification is even evident for asymmetric errors. Thus underlines the theoretical result that we do not require any distribution assumptions on the error. Another interesting aspect is that min-max robustification leads to a bias in domain mean prediction. In general, this was expected, since it is well-known that regularization introduces bias to model parameter estimation (Hoerl and Kennard 1970). However, by looking at Fig. 1, we see that the bias increases in the presence of measurement errors. This is due to the fact that the optimal regularization parameter found by cross validation is larger in Scenario 2 to 4 compared to Scenario 1. This is an important aspect, as it implies that cross validation is sensible with respect to measurement errors. The robustness argument provided in Theorem 1 relies on the assumption that λ 1 , . . . , λ d is chosen sufficiently high. Although in practice, we never have a guarantee that the assumption is satisfied, this observation suggests that cross validation is capable of finding λ 1 , . . . , λ d that at least approximate the level of uncertainty inX. We continue with the results for model parameter estimation. They are summarized in Table 2 and further visualized in Fig. 2. For Table 2, note that since β 1 = β 2 = β 3 = 2, we pool the estimatesβ 1 ,β 2 ,β 3 for the calculation of the performance measures and summarize the results into the columns MSE(β) and Bias(β). For the variance parameters, we proceed accordingly. Since σ 2 = ψ 2 = 1, we pool the estimates and summarize the results into the columns MSE(η) and Bias(η). Likewise, in Fig. 2, the variance parameter deviationη (r ) k − η k is plotted for all considered methods and both parameters simultaneously. With respect to β-estimation, the unregularized approach is slightly more efficient that the regularized methods for Scenario 1. This is because regularization introduces bias to the estimates, as pointed out by Hoerl and Kennard (1970). In the presence of measurement errors, min-max robustification obtains much more efficient results. The advantage ranges from 36 to 72%. We further see that the additional noise in the covariate data affects the unregularized approach considerably, which leads to a negative bias. Regularization, on the other hand, manages to reduce the bias in this setting as the estimates are less influenced by the contamination. For η-estimation, the benefits of min-max robustification are also visible. We see that the variance parameter estimates based on the unregularized β-estimates are severly biased upwards. This suggests that the additional noise in the design matrix is falsely interpreted as increased error term variance σ 2 . The results based on the robust βestimates are less biased. By looking at Fig. 2, it further becomes evident that min-max robustification allows for stable variance parameter estimates. The overall spread of the estimates is much more narrow than for the unregularized approach. In terms of the MSE, the advantage of regularized estimation over unregularized estimation ranges from 15 to 31%.
We finally look at the results for (pessimistic) MSE estimation. They are presented in Table 3. For the absence of measurement errors, the resulting estimates are approximately unbiased. This makes sense given the fact that the optimal regularization parameter found by cross validation on correctly observed data is close to zero. In this setting, the modified Jackknife procedure summarized in Algorithm 2 basically reduces to a standard Jackknife, which is known to be suitable for MSE estimation in mixed models (Jiang et al. 2002). In Scenario 2 to 4, the proposed method leads to biased estimates of the true MSE. The relative overestimation ranges from 68 to 90%. However, this was expected in light of the theoretical results from Chapter 3. The obtained estimates are based on the MSE bounds (28), (30), and (32). These bounds were used in order to find upper limits of the true MSE given the fact that covariate observations are contaminated by unknown errors. If the bounds are plugged into  Algorithm 2, the resulting estimator (33) will always overestimate the true MSE for λ 1 , . . . , λ d chosen appropriately large. This is due to the nature of min-max robustification characterized in Proposition 1. It introduces design matrix perturbations that are maximal with respect to the underlying uncertainty set. Thus, the obtained MSE estimates are based on the premise that the measurement error associated with a given prediction is potentially maximal with respect to U. If the distribution of the covariate measurement errors would be known, then it would be possible to find more accurate MSE bounds. See Loh and Wainwright (2012) for a corresponding theoretical analysis. However, we do not consider this case here, as our main focus is to achieve robustness without knowledge of the measurement error.

Conclusion
The presented paper addressed the connection between regularization and min-max robustification for linear mixed models in the presence of unobserved covariate measurement errors. It was demonstrated that min-max robustification represents a powerful tool to obtain reliable model parameter estimates and predictions when the data basis is contaminated by unknown errors. We showed that this approach to robust estimation is equivalent to regularized model parameter estimation for a broad range of penalties. These insights were subsequently used to construct a robust version of the basic LMM to perform regression analysis on contaminated data sets. We derived robust best predictors under the model and presented a novel Jackknife algorithm for conservative mean squared error estimation with respect to response predictions. The methodology allows for reliable uncertainty measurement and does not require any distribution assumptions regarding the error. In addition to that, we conducted an asymptotic analysis and proved that min-max robustification allows for consistency in model parameter estimation.
The theoretical findings of our study shed a new light on regularized regression in future research. The proposed min-max robustification marks a very attractive addition to big data analysis, where measurement errors tend to be uncontrollable. Indeed, ularized regression has novel legitimacy in these contexts. Nevertheless, our results further suggest that it is also beneficial for standard applications. The methodology introduces an alternative concept of robustness that is relatively new to statistics. Accounting for general data uncertainty of a given magnitude (min-max robust) rather than measurement errors of a given distribution (min-min robust) marks a different paradigm that can enhance robust statistics in future applications. By the virtue of these properties, regularized regression can obtain reliable results for instance in survey analysis when sampled individuals provide inaccurate information, or in official statistics when indicators are based on estimates. Another big advantage of our method is its simplicity. In Theorem 1, we establish that regularized regression and min-max robustification are equivalent under the considered setting. This implies that we can obtain min-max robust estimation results by using well-known standard software packages, such as glmnet. Accordingly, min-max robustification can be broadly applied and is computationally efficient even for large data sets.
Nevertheless, there is still demand for future research. In Sect. 2.1, it was stated that the nature of the robustification effect is determined by the underlying uncertainty set, which is again determined by the regularization the researcher chooses. It is likely that in practice, there are situations where one regularization works better than the other, depending on the underlying data contamination. As of now, it remains an open question how an optimal regularization could be determined when the measurement errors are completely unknown.