ROBOUT: a conditional outlier detection methodology for high-dimensional data

This paper presents a methodology, called ROBOUT, to identify outliers conditional on a high-dimensional noisy information set. In particular, ROBOUT is able to identify observations with outlying conditional mean or variance when the dataset contains leverage outliers, highly correlated variables, and a large dimension compared to the sample size. ROBOUT entails a preliminary robust imputation procedure, that prevents leverage outliers from corrupting predictor recovery, a selection stage of the statistically relevant predictors (through cross-validated LASSO-penalized Huber loss regression), the estimation of a robust regression model based on the selected predictors (via MM regression), and a criterion to identify conditional outliers. We conduct a comprehensive simulation study in which the proposed algorithm is tested under a wide range of perturbation scenarios. The combination formed by LASSO-penalized Huber loss and MM regression turns out to be the best in terms of conditional outlier detection under the above described perturbed conditions, also compared to existing integrated methodologies like SPARSE-LTS and RLARS. The proposed methodology is ﬁnally applied to a granular supervisory banking dataset collected by the European Central Bank, in order to model the total assets of euro area banks.


Introduction
Data quality is a fundamental prerequisite for any kind of quantitative analysis, and the large datasets which are becoming increasingly available present specific challenges to the task of monitoring and ensuring data quality.One critical aspect of the data quality monitoring is outlier detection, i.e. the identification of values which are either obviously mistaken or seem to be unjustified from an empirical perspective.An outlier is defined by [12] as 'an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism'.Similarly, [2] defines an outlier as 'an observation (or subset of observations) which appears to be inconsistent with the remainder of that set of data'.Classical statistical methods, which underpin analytical tools supporting analysis of large datasets, are sensitive to the presence of outliers and, consequently, reach distorted conclusions from the data analysis [25].
In this paper, we focus on outlier detection in high-dimensional datasets, where the number of variables p (i.e. the dimension of the data space) is large, possibly larger than the number of observations n (i.e. the sample size).In this way, we also include so-called fat datasets, featuring p > n.Such datasets arise in diverse fields, such as bioinformatics, economics, neuroscience, signal processing and others.Our aim is to retrieve from such datasets any anomaly in a target variable y with respect to a set of K related variables (the predictors of y) that constitute a subset of the p ≫ K variables of the dataset (the candidate predictors of y).The K predictors of y are ex-ante unknown and need to be identified from the p variables, that are also usually affected by multicollinearity effects.In addition, we tolerate that the K predictors may contain leverage outliers, a feature that typically inflates the predictive power of irrelevant predictors.
Given these premises, we propose a new method, called ROBOUT, to solve the problem of conditional outlier detection in high dimensions.ROBOUT can accommodate diverse statistical properties of the examined dataset (such as multicollinearity, leverage outliers or high-dimensional information sets) while at the same time being computationally efficient.Figure 1 illustrates visually the problem that we aim to solve and the idea of the method that we propose.We consider the target variable y and two variables of the dataset x 1 and x 2 (in principle, the variables of the dataset could be many more than just two).
x 1 is a predictor of y while x 2 is not, however this is not known ex-ante.We assume that there is an outlier in variable y, namely point 4 which is an outlier of y conditional on predictor x 1 .Point 4 ′ shows the 'correct' position that 4 would have occupied if it followed the same data generating process as the other three points.The perturbation of point 4 affects the linear relationship between y and x 1 , as line l 0 is shifted below to line l 1 , but also affects the relationship between y and x 2 , potentially leading to a spuriously statistically significant relationship.
Identifying the perturbation in point 4 requires to solve two tasks: first to identify that x 1 is the relevant conditional variable and second to identify that Fig. 1: Identification of conditional outliers when the predictors are not known ex-ante.The left panel depicts the scatter plot of the target variable y and its predictor variable x 1 , while the right panel depicts the scatter plot of the target variable y and a non-predictor variable x 2 .Point 4 is a conditional outlier of y on x 1 .The blue line represents the OLS regression line with point 4 in the correct (theoretical) position, that is 4', while the red line represents the OLS regression line with point 4 in the perturbed (actual) position.
4 is an outlier of y conditional on x 1 .Since the presence of outliers in y may disrupt the identification of the statistical relationship between two related variables (in this case y and x 1 ) and may generate a misleading statistical relationship between unrelated variables (in this case y and x 2 ), in this paper we separate the two problems: we first run a robust variable selection step, with the aim to identify the relevant predictors, and we subsequently run a low-dimensional robust regression to identify conditional outliers.As will be argued in this paper, such two-step approach has substantial advantages in terms of computational efficiency and performance in detecting conditional outliers.
If the conditional outliers in y are present in the same observation with anomalous instances of the related predictors, it has been shown that predictors are typically not identified correctly (see [14]).That is the reason why we develop a robust preliminary imputation procedure, that prevents such points (i.e.leverage points) from corrupting predictor recovery, thus restoring the identification of true predictors.Ensuring the recovery of true predictors is the key to successfully apply a subsequent robust regression in a consistent way and to identify conditional outliers by means of the robustly estimated residual scale, thus avoiding the trap of variance estimation in high-dimensional linear models [5,8,22].Other confounding factors for predictor recovery are a high level of multicollinearity, a high number of variables p compared to the sample size n, and a small signal-to-noise ratio.
The practical relevance of the problem as formulated above can be clarified by referring to the banking supervisory dataset that is used in this paper to show the behaviour of the proposed outlier detection method.A bank may present a particularly high value of e.g. total assets (the variable y in the above formulation), which may be spotted as an anomalous instance compared to the rest of banks.However, considering other related bank indicators, like debt securities and derivatives, we may realize that the total asset value for the bank in question is perfectly in line with expectations.On the other hand, a bank with an average value of total assets may be judged as anomalous with respect to predictors.This could be for example the case when for a specific observation (i.e.bank) a relatively moderate amount of assets corresponds to a disproportionally high amount of derivatives, a financial instrument that one would normally expect to be used extensively by the largest banks.In general, whether an observation of variable y is considered to be an outlier does not depend solely on the distribution of y but also on the corresponding values of the most relevant predictors of y.This kind of outlier detection is referred to as conditional outlier detection [13], because it focuses on the detection of improbable values in a target variable given (i.e.conditionally on) the observed values on a set of related variables.
Let us formalize the described problem in probabilistic terms.In [13], unconditional outliers are defined as instances which fall into a low probability density region of f(y), where f(y) is the unconditional probability density function of the scalar target variable y.Instead, conditional outliers are defined as instances of y which fall into a low probability density region of f(y|x) = f(y, x)/f(x), where f(y|x) is the conditional probability density function of the scalar target variable y given a vector of predictors x, and the set of predictors x is a subset of a wider noisy information set Ω, which is given at the outset of the problem.In this paper, we assume that the expected value of f(y|x) is a linear function of the variables in x.Note that this does not constrain the nature of the prescribed relationships between y and the initial information set Ω from which x is identified, because we can always include e.g.quadratic and exponential functions of specific variables in the vector x.
For the sake of simplicity, let us suppose that for the single observation i ∈ {1, . . ., n} it holds that y i |x D,i ∼ N (a i + x ′ D,i β, σ 2 i ), where, for observation i, y i is the specific value of y, a i is the intercept term, x D,i is the K × 1 vector of predictors, β is the K × 1 vector of regression coefficients, and σ 2 i is the variance of y i conditional on x D,i .
We denote by D the set of predictor indices, such that |D| = K and D ⊆ {1, . . ., p}, and by D ′ the complementary set of D with respect to the set {1, . . ., p}.The n × K matrix X D contains as columns the predictor variables, indexed by D, the n×(p−K) matrix X D ′ contains as columns the non-predictor variables, indexed by D ′ , and the n × p complete matrix X = [X D |X D ′ ] represents the available information set Ω.
We define a set of outlier indices O with O ⊆ {1, . . ., n} such that |O| = [αn].The parameter α represents the contamination rate of the dataset, with α ∈ [0, 0.5].We assume, without loss of generality, that for any i ̸ ∈ O it holds a i = a and σ 2 i = σ 2 , i.e. that all non-outlying observations feature a constant conditional mean and variance.
Definition 1 A conditional outlier in mean is defined as any observation i ′ such that y i ′ falls in a low probability density region of Definition 2 A conditional outlier in variance is defined as any observation i ′ such that y i ′ falls in a low probability density region of Definition 3 A leverage outlier is defined as any observation i ′ such that its vector of predictors x D,i ′ is perturbed entry-wise by replacing Practically speaking, Definitions 1 and 2 may reflect that the target variable y is affected by measurement errors, idiosyncratic events, unpredictable shocks etc., as well as structural differences in the data generating mechanism.Definition 3 may represent the same situations in the matrix of predictors X D .Conditional outlier recovery is critical to spot hidden inconsistencies or frauds in y, and cannot be performed properly if the unknown predictors of y, contained in X D , are not identified.For these reasons, following the provided definitions, we study how the presence of multicollinearity effects and leverage outliers affects the identification of the true predictors of y and the recovery of conditional outliers (in mean or variance) in y, under different p/n ratios.
Few integrated methods have been presented in the literature that recover simultaneously both the K predictors of the response variable y (out of the p candidate predictors) and the conditional outliers in y, such as SPARSE-LTS [1] and RLARS [14].To the best of our knowledge, their performance as conditional outlier detection methods in the presence of conditional outliers in variance in the sense of Definition 2 has not been explored yet.Another existing method, called SNCD (Semismooth Newton Coordinate Descent) algorithm [39], provides a fast and reliable solution to the recovery of predictors, by minimizing a Huber or a least absolute deviation (LAD) loss of the residuals penalized by an elastic net [40].That optimization problem is solved by explicitly deriving Karush-Kuhn-Tucker (KKT) conditions.However, SNCD has a significant drawback: it is not robust to leverage outliers, as also shown in [1].For this reason, we provide a robust preliminary imputation procedure, that replaces anomalous values in the predictors by the median of the s closest nonoutlying neighbours, identified by looking at the most (robustly) correlated variable with the predictor in question.In this way, SNCD may be applied on the clean imputed dataset to identify the right predictors, on which a robust regression model is finally calculated to spot conditional outliers.
The described method, called ROBOUT, is very efficient as regards computational cost, which is a direct function of the degree of perturbation in the dataset.ROBOUT improves existing methods in the literature by enhancing both predictor recovery and conditional outlier detection performance in high dimensions.This result is achieved by exploiting the large cross section to setup the preliminary imputation procedure, whose accuracy benefits from the presence of a high number of positively correlated variables, thus turning the curse of dimensionality to a blessing.
We conduct an extensive simulation study that includes scenarios featuring several levels of outlyingness in the target variable and in the predictors, various degrees of correlation in the dataset, and different p/n ratios.The results show that ROBOUT performs systematically better than existing alternatives, especially when p/n is high (larger than 1), predictors are correlated and leverage outliers are present, thus providing a feasible and reliable solution to the conditional outlier detection problem in high dimensions, which is effective also under challenging conditions when other methods may fail.
The paper is structured as follows.Section 2 presents a literature review on high-dimensional variable selection and robust regression methods.ROBOUT methodology is formally defined in Section 3, entailing a robust preliminary imputation, a variable selection, a low-dimensional regression and a conditional outlier detection step.The wide simulation exercise that we undertake in Section 4 aims to test the performance of ROBOUT, placing emphasis on the versatility of the ensuing conditional outlier detection under several perturbed scenarios for different p/n ratios.Subsequently, we apply ROBOUT to a banking dataset collected by the European Central Bank in Section 5, and we provide some concluding remarks in Section 6.

State of the art
Large and high-dimensional datasets appear very suitable for conditional outlier detection in the regression context [26].As [34] notes, the large size of the data requires automated techniques for the identification of subsets of variables which are statistically linked.In such datasets, the challenge is to identify the critical predictors and then to perform a conditional outlier detection for the variables of interest.To this end, we need to define both a variable selection and a robust regression procedure, in order to recover consistently both the true set of predictors and the regression coefficients.This type of outlier detection may spot outliers which could remain unnoticed if single variables are considered separately, thus taking advantage of the information content present in a large cross section.
The original idea of robust regression is based on the application of a weighting scheme to observations, with the aim to dampen the influence of outlying points on regression estimates, both in y (conditional outliers) and in X (leverage outliers).Two established methods for low-dimensional robust regression are: ❼ least trimmed squares (LTS) estimation [23], which identifies the 100(1−α)% most concentrated observations and estimates the regression coefficients on those via ordinary least squares; ❼ MM-estimation [37], which is a three-stage procedure that minimizes a Tukey's bisquare function of the residuals using a robust initialization of the coefficients in β (obtained by S-regression) and of the residual scale σ (obtained by M-estimation).
In the p > n case, the mentioned traditional robust regression techniques no longer work, because they simply are weighted versions of the least squares method.As described in [9], providing an exhaustive literature review on robust linear regression for high-dimensional data, a number of different strategies have been consequently proposed to perform robust regression in high dimensions.
A first strand of methodologies is based on dimension reduction.A possible idea is to robustly estimate the top r principal components of all the potential predictors, with r selected by cross validation, and then to apply a low-dimensional robust regression method.This method is referred to as robust principal component regression (PCR).An alternative method of the same family suggests to replace robust principal component analysis by robust partial least squares (PLS), that robustly estimates the top r components explaining the most of the covariance between the target variable y and potential predictors.PCR requires the robust estimation of the covariance matrix in high dimensions (see [20]), while PLS requires the robust estimation of the covariance between y and weighted potential predictors [28].Another relevant method in this strand is robust ridge regression, that shrinks down the contribution of non-predictor variables by means of a robustified version of ridge regression [19].
The methods so far described perform coefficient estimation, but do not provide any explicit predictor selection.Many variable selection methods, like SURE [7], SCAD [15], the Dantzig selector [4], Forward Regression [35] and SLOPE [3] are effective for fat datasets, but do not tolerate the presence of outliers, neither in the response variable nor in the rest of the dataset.In contrast, LAD-LASSO [36] and MD-LASSO [17] are only robust to the presence of outliers in the response variable.Several other variants of LASSO regression [31], like EXTENDED [18], FUSED [33], GRAPH [21], GROUP [38], SPARSE-GROUP [29] and ADAPTIVE [41], are not robust to outliers, as they all employ a least squares type of loss.
There are few methods proposed in the literature that perform simultaneously variable selection and conditional outlier detection.A robust LASSO regression method performing both variable selection and conditional outlier detection is SPARSE-LTS [1], that is based on the simultaneous optimization of a trimmed least squares loss and a LASSO penalty.The coefficient estimates are derived by performing at each iteration a LASSO estimation on the 100 × (1 − α)% observations with the smallest squared residuals.Another robust LASSO regression method is MM-LASSO [30].Building on the S-ridge approach to high-dimensional regression model estimation in [19], MM-LASSO combines (adaptive) MM-regression and LASSO penalization to provide consistent identification of predictors and estimation of coefficients under fat tailed distributions.
A previous relevant solution of different nature is plug-in RLARS [14], that is the robustified version of LARS [6] where the covariance matrix of the features is robustly estimated via an adjusted Winsorization step.RLARS manages to deal with outliers by rendering them uninfluential in the iterative computation of the covariances needed to retrieve predictors.Predictor selection is then performed by iteratively selecting the best angle to update coefficients, as in LARS, and the MM regression is used as final step.
A further approach to identify predictors and spot conditional outliers in high dimensions is by robustifying the elastic net regression.In this respect, a robust solution is ENET-LTS [16], that provides the trimmed version of the elastic net.In [10], the penalized elastic net S-estimator PENSE and its refinement, the penalized elastic net M-estimator PENSEM, provide a consistent solution to conditional outlier detection, with the relevant advantage for PENSEM to provide the most precise τ -estimate of the residual scale.
The SNCD algorithm [39], that minimizes simultaneously a Huber loss or a LAD loss of the residuals and the elastic net penalty by explicitly deriving the Karush-Kuhn-Tucker (KKT) conditions of the objective function, is instead not robust to leverage outliers, even if it is scalable to both large sample sizes and high dimensions, as its computational cost is O(pn).
For this reason, we propose a robust preliminary imputation procedure for potential predictors that renders SNCD applicable when outliers in the predictors are present.Such outliers are recovered by computing robustified z-scores for each potential predictor.Points with outlying robustified z-scores are replaced by plausible values calculated by exploiting the multicollinear structure of the large cross section.Such preliminary step has a computational cost proportional to the number of identified potential leverage outliers.Then, the relevant predictors of y are identified via SNCD, applied to the robustly imputed dataset.Finally, we apply a low-dimensional robust regression method, like MM, to perform conditional outlier detection by means of the robustly estimated residual scale.We call the entire procedure ROBOUT, that is a comprehensive methodology to spot conditional outliers from high-dimensional datasets.
3 ROBOUT: a comprehensive approach for conditional outlier detection where y is the n × 1 vector of the response variable, a is the n × 1 vector of intercepts, X D is the n × K matrix of predictors (whose columns are indexed by D), β is the K ×1 vector of regression coefficients and ϵ is the n×1 vector of residuals.The same regression model for the single observation i ∈ {1, . . ., n} can be written as where a i denotes the intercept and x D,i denotes the K × 1 vector of predictors.
For each non-outlier index i ̸ ∈ O, we assume by simplicity that , where x D ′ ,i is the vector of non-predictors (D ′ stores the indices of non-predictor variables).For each outlier index i ′ ∈ O, we still fix a i ′ = a, but we assume the following: 1) consistently with Definition 1, conditional outliers in mean are generated as 2) consistently with Definition 2, conditional outliers in variance are generated as 3) consistently with Definition 3, the vector of predictors is generated as x D,i ′ ∼ N (0 K , I K ), then y i ′ is generated by model ( 2) as in 1) or 2), and in the end, leverage outliers are generated by replacing a posteriori In all the above cases, the parameter m > 1 represents the degree of perturbation.In addition, we also allow for multicollinearity by setting COV (x j ′ , x j ′′ ) = ρ j ′ j ′′ , ∀j ′ , j ′′ ∈ {1, . . ., p}, j ′ ̸ = j ′′ , where x j is the j-th variable in the n × p matrix X, representing the available information set Ω.Such settings allow to exemplify three outlyingness schemes for each i ′ ∈ O: 1) case 1: conditional outliers in mean & leverage outliers, where the outliers are generated in y i ′ according to Definition 1 and in x D,i ′ k according to Definition 3; 2) case 2: conditional outliers in variance & leverage outliers, where the outliers are generated in y i ′ according to Definition 2 and in x D,i ′ k according to Definition 3; 3) case 3: conditional outliers in mean, where the outliers are only generated in y i ′ according to Definition 1.
Let us assume model ( 2) as true, with E(ϵ i |Ω) = E(ϵ i ), for each i = 1, . . ., n.In order to quantify the model strength under each of the three outlyingness schemes, we specifically derive the expected perturbation in y i (conditionally on the predictors) and the residual variance inflation factor under each case.

The expected perturbation E(y
V(ϵi) = 1, under cases 1 and 3, V(ϵ i ′ ) V(ϵi) = m 2 , under case 2. These calculations enable us to derive the expected overall Signal-to-Noise Ratio SN R = ESS/RSS of model ( 2) under each of the three outlyingness schemes, where ESS, the expected Explained Sum of Squares, is constant across cases and equal to In the end, we can derive the expected SN R = ESS/RSS, where the formula for RSS varies under each case.We stress that ESS depends on the coefficient vector β (signs and magnitudes), the covariance matrix of predictors, and the number of predictors K, while RSS may depend on the outlier proportion α, the outlyingness parameter m, the intercept a, the residual variance σ 2 , the number of predictors K, the sign of predictor values, and the coefficient vector β, according to the underlying outlyingness scheme.

Methodology
We present here our proposed conditional outlier detection method, ROBOUT, consisting of four robust steps, namely, preliminary imputation, variable selection, low-dimensional regression and conditional outlier detection.We stress that ROBOUT, unlike many existing methods, separates the steps of variable selection and conditional outlier detection, because this leads to a better performance for conditional outlier identification, as it is shown by our simulation study.The preliminary imputation procedure is needed to make the method robust to leverage outliers in the predictors.

Preliminary imputation
Let us consider the n × p matrix of potential predictors X.First, we calculate the median and the unbiased median absolute deviation of each variable.This means that, for each j = 1, . . ., p, we compute x med,j = med(x j ) and x mad,j = 1.4826mad(x j ).Then, relying on x med,j and x mad,j , we derive a robust z-score for each entry ij of X: x mad,j , i = 1, . . ., n, j = 1, . . ., p.In the end, we flag as outliers the entries that present an outlying z ij , i.e. we set w ij = 0 if |z ij | > ϕ −1 (0.995), where ϕ −1 is the inverse standard normal distribution function, and w ij = 1 otherwise.
Algorithm 1 Algorithm for preliminary imputation.For each ij, i = 1, . . ., n, j = 1, . . ., p, such that w ij = 0: 1. derive the set of ordered variable indices D j by sorting ρj ′ j (or τj ′ j ), j ′ ̸ = j, in decreasing order; 2. take the first index in the ordered set D j , j top , such that w ij top = 1; 3. identify among all the points i ′ ̸ = i the set I i,s of the s closest neighbours of x ij top (in absolute norm) with w i ′ j top = 1; 4. derive the set I i,3 ∈ I i,s that contains the three closest neighbours of x ij top (in absolute norm) with Three relevant features of Algorithm 1 need to be stressed.First, the computational cost is a direct function of n i=1 p j=1 ✶(w ij = 0), that represents the degree of dataset perturbation.Second, Algorithm 1 exploits the multicollinearity structure of a large cross section, such that a large p and a rich multicollinearity structure are actually improving the imputation procedure systematically.Third, a safe choice for the number s of nearest neighbours to ensure the recovery of set I i,3 to be used for imputation is s = max p j=1 n i=1 ✶(w ij = 0) + 3, that is usually much smaller than n.The overall computational cost of the imputation step is thus proportional to O(pn log n) as p and n diverge.
In this way, the described imputation procedure is robust to the classical leverage outliers prescribed by Definition 3,1 it is computationally efficient, and it takes advantage of a rich multicollinearity structure.Finally, we can standardize each column of the imputed matrix X * , by calculating the standardized imputed values

Variable selection
Starting with our target response variable y i , i = 1, . . ., n, we aim to consistently select the relevant set of its predictors from a large set of variables under the presence of conditional outliers in mean or variance.Consequently, we aim to identify a model such as (2) from the data.Importantly, the focus of this step is on the selection of predictors rather than on the estimated coefficients and residual scale.This is a core idea of ROBOUT, i.e. that a robust predictor selection step is distinguished from the robust regression step, as this renders the method more reliable in challenging situations compared to the existing methods that combine these two steps.The option [39] that we consider is to use the objective function min β0,β1,...,βp where ϵ i = y i − β 0 − p j=1 β j x * * ij , i = 1, . . ., n, j = 1, . . ., p, λ is a penalization parameter, and |t| > δ, is the Huber weight function, where δ is a tuning parameter.Henceforth, we call (3) SNCD-H objective function.
SNCD-H estimates an elastic-net penalized Huber loss regression, optimized by using the semi-smooth Newton coordinate descent algorithm presented in [39].SNCD has a computational cost proportional to O(pn), i.e. linear in both p and n.Weighting observations is precisely what renders the results robust in the face of conditional outliers in y, because it annihilates their influence.
The other robust alternative that we consider in the simulation study substitutes ρ H,δ (ϵ i ) in (3) with the absolute loss ρ L (ϵ i ), where We call that version the SNCD-L objective function.SNCD-L estimates an elastic-net penalized absolute loss regression, optimized in the same way.Minimization (3) with ρ H,δ (t) or ρ L (t) is practically performed for a decreasing sequence of values λ = λ t , t = 1, . . ., T , such that λ 0 = λ max and λ T = λ min , where λ max returns no predictors, and λ min returns K max predictors.Predictor selection is performed by the adaptive version of the strong rule of [32], proposed in [39].At each value of λ t , that rule exploits the coefficient estimates at λ t−1 .The optimal λ is then selected by cross-validation, using as Out-Of-Sample (OOS) metric the loss ρ H,δ for SNCD-H and the loss ρ L for SNCD-L.
Following [11], two optimal values of λ can be selected.The value λ 0SE is the one returning the minimum OOS loss along the sequence λ t .The value λ 1SE is the maximum value along the sequence λ t such that the OOS metric stands within one standard error by the OOS metric estimated at λ 0SE .As typical in such applications, the model estimated at λ 0SE is subject to overfitting, and consequently, the set of retrieved predictors can be very redundant.The estimated model for λ 1SE is instead much more parsimonious, including many fewer irrelevant predictors when p/n is large.For this reason, we will select the model estimated at λ 1SE , with the only recommendation that the allowed maximum number of predictors must be large enough to ensure the consistency of such model selection process when p is large (we suggest not to exceed the upper limit value K max = [p/2]).At the end of this step, we get K, the chosen number of predictors, and D, the corresponding set of predictor indices.

Robust regression
In this step, we apply a robust regression method to the recovered predictors stored in matrix X D , in order to robustly estimate coefficients, residuals, and residual scale.To this aim, we utilise MM regression [37].In the simulation study, we also test the performance of least trimmed squares (LTS) [23].Concerning computational cost, we know that both LTS and MM share a burden of O(n) operations, due to the use of Fast-LTS [24] for LTS, and of Fast-S [27] for the initial scale in MM.At the end of this step, we get the estimated K × 1 vector of coefficients β, the estimated n × 1 vector of residuals ϵ, and the residual scale estimate σ.

Outlier detection
As a last step, we recover the vector O of conditional outlier indices as the set of all the points i ∈ {1, . . ., n} with robustly rescaled residuals ϵ i / σ larger than ϕ −1 (0.995) in absolute value.

A comparative simulation study
In this section, we test the four possible variants of ROBOUT methodology based on the different variable selection and robust regression estimation options presented in Section 3.2.In the first step, either the SNCD-H or the SNCD-L objective function can be used, while in the second step either LTS or MM can be used to estimate the regression equation.We call the ensuing four versions of ROBOUT as H+LTS, L+LTS, H+MM and L+MM, where H and L refer to SNCD-H and SNCD-L, respectively.
Our simulation study aims both to compare ROBOUT to competitor methods but also to identify the optimal design of ROBOUT with respect to its constituent components.The competitor methods against which the ROBOUT versions are tested are SPARSE-LTS and RLARS. 2 Calculations are performed by R Studio.We set the following parameters for each method used: ❼ SNCD-H and SNCD-L are run via the function cv.hqreg of package hqreg with method='huber' or method='quantile' respectively, with a maximum number of predictors dfmax equal to 10, and type.measure='dev' to select the number of predictors as in Section 3.2.2; ❼ as an input to the Fast-LTS algorithm in the robust regression step, the trimming parameter α of ltsReg function in robustbase package is set equal to the default α = 0.25.In alternative, the function lmrob in robustbase package estimates MM regression by the default Tukey's bisquare loss with 95% asymptotic efficiency for normal errors; ❼ the competitor method SPARSE-LTS is run by function sparseLTS of robustHD package with mode 'fraction', setting a vector of 10 fractions, 0.05f , f = 1, . . ., 10, with the default trimming parameter α = 0.25, the option nsamp = c(200, 5), and the prediction error as cross-validation criterion for predictor selection (crit='PE' ); ❼ the plug-in version of the competitor method RLARS is performed by the function rlars of package robustHD with crit='PE' and a maximum number of predictors sMax equal to 10.
All the methods are run with the default preprocessing step for all the potential predictors: standardization for SNCD, unit-norm normalization for SPARSE-LTS, robust standardization for RLARS.The intercept is included for all estimations.
According to the definitions provided in Section 3.1, we generate the data from model (1) under cases 1, 2, 3, for m = 5, 9, 13, 17, 23.Each scenario is defined by the degree of multicollinearity in the matrix X and the relative size of the matrix (p/n ratio).As regards the degree of multicollinearity, we set ρ j ′ j ′′ = ρ, ∀j ′ , j ′′ ∈ {1, . . ., p}, j ′ ̸ = j ′′ , and we distinguish three cases: ρ = 0, 0.3, 0.7.As regards the relative size of X, we examine four different cases for the p/n ratio: The rationale for differentiated scenarios with respect to the p/n ratio and the level of correlation ρ is that relative dimensions and the degree of multicollinearity affect the performance of the various methods that are tested, for example by impacting on the effectiveness of the selection of predictors, as explained in Section 2.
Table 1 reports how the minimum expected overall Signal-to-Noise Ratio (SNR) varies under the three outlyingness schemes and the three levels of correlation, with respect to the perturbation parameter m.We note that case 1 presents the lowest values of minimum expected SNR, even smaller than 1 at large m.Cases 2 and 3 also present comparable values, although slightly larger.
Table 1: Minimum expected overall SNR as derived in Section 3.1 under cases 1, 2, 3, for m = 5, 9, 13, 17, 23, ρ = 0.7, 0.3, 0, with α = 0.1, Henceforth, we refer to scenarios by combining the number and the letter of the above cases e.g."scenario 1a, ρ = 0.7" refers to a scenario with case 1, case a for the p/n ratio i.e. p = 60 and n = 300, ρ = 0.7.Each scenario is run 100 times for each value of m and the various metrics presented below are calculated by averaging across these 100 replications.
For each scenario, each value of parameter m and each method, we derive for each replicate r = 1, . . ., 100 the set of recovered outliers O r and we calculate the following performance measures for conditional outlier detection: ❼ the masking rate M R r , defined as the proportion of masked outliers (i.e. false negatives) over the true number of outliers: ❼ the swamping rate SR r , defined as the proportion of swamped outliers (i.e.false positives) over the number of recovered outliers: ❼ the F 1 −score, defined as 2 P RECr×RECr P RECr+RECr , where the precision P REC r is equal to P REC r = 1 − M R r , and the recall REC r is equal to REC r = 1 − SR r , which is an overall performance measure incorporating both masking and swamping effects.
All measures range from 0 to 1, and are averaged across the 100 replications of each scenario and value of m for each method.
Similarly, for each replicate r = 1, . . ., 100 we derive the set of recovered predictors D r , we calculate as performance measure for predictor recovery the number of masked predictors, M P r = j∈D ✶(j ̸ ∈ D r ), and we then derive the recovery rate of all true predictors as AP = 100 r=1 ✶(MP r = 0)/100.

Results
In Figure 2, we report performance measures for scenario 1a, ρ = 0.7.Under this challenging p ≪ n scenario with leverage outliers and conditional outliers in mean, we can observe that H+MM and SPARSE-LTS are the best methods concerning outlier detection, measured by the F 1 score (right panel), followed by L+MM and H+LTS, while L+LTS and RLARS are the worst.In particular, RLARS suffers a lot from a small value of m.Concerning predictor retrieval, we can observe in the left panel that SNCD-H is the best predictor selection method across m, followed by SPARSE-LTS by a little margin.SNCD-L is in third position with a gap around 10% from SNCD-H in the AP measure, while RLARS is last, with AP below 70%.It follows that the ROBOUT option H+MM turns out to be the best method overall.
In Figure 3, we report the results for scenario 2a, ρ = 0.7.In this case, i.e. with conditional outliers in variance rather than in mean as in the previous scenario, but still including leverage outliers, both SPARSE-LTS and RLARS perform worse than ROBOUT, both for predictor selection (left panel) and conditional outlier detection (right panel).Concerning the latter, H+LTS and H-MM are the top performers in this case, followed by L+LTS and L+MM.RLARS and SPARSE-LTS perform quite worse instead, with values of F 1 score around 0.9 across m.Concerning predictor retrieval (left panel), SNCD-H and SNCD-L attain values of the AP metric close to 1, while SPARSE-LTS and RLARS slide below 0.9 and are affected by large values of m.
In Figure 4, we find performance results for scenario 3a, ρ = 0.7.Under this scenario, with conditional outliers in mean and no leverage outliers, the right panel, containing the average F 1 score across m, shows that H+MM and L+MM outperform SPARSE-LTS for conditional outlier detection.This is due to the capability of MM to consistently identify the outliers even from a partial recovery of true predictors.On the contrary, H+LTS and L+LTS perform much worse, and RLARS performs very badly, particularly at low values of m.The ROBOUT option H+MM is thus consistently the best for conditional outlier detection, even in this case.
In Figure 5, performance results are shown for scenario 1a, ρ = 0.3, i.e. now we turn our attention to scenarios with lower correlation among the variables of the information set.We observe that SNCD-L performs quite better than in Figure 2, showing that a lower value of ρ helps predictor recovery via SNCD-L.It follows that the ROBOUT options H+MM and L+MM are overall the best performers in this case, while H+LTS and L+LTS follow.On the contrary, the performance of the competitor method SPARSE-LTS gets a bit worse than in Figure 2, both concerning the AP measure (left panel) and the F 1 score (right  panel), due to a lower SNR (see Table 1).RLARS is still by far the worst performer.
In Figure 6, performance results for scenario 2a, ρ = 0.3, are reported.The ROBOUT options H+MM, L+MM and H+LTS are the best concerning the F 1 score across m, followed by RLARS, that performs relevantly better than in Figure 3. More, SNCD-H is still the most consistent performer for predictor recovery, especially at large m.The performance of SPARSE-LTS is instead very bad both for predictor recovery and outlier detection, even worse than in Figure 3, due to the lower SNR compared to scenario 1a (see Table 1).In Figure 7, reporting performance results for scenario 3a, ρ = 0.3, when the information set does not present leverage outliers, the ROBOUT options H+MM and L+MM offer the best performance for the F 1 score at large m, while SPARSE-LTS is still the best performer for the AP measure, as in Figure 4. Instead, H+LTS and L+LTS perform worse than SPARSE-LTS by some margin.At the same time, SNCD-H and SNCD-L perform acceptably concerning the AP measure, standing around 0.9 and 0.8 at m = 23, respectively.RLARS is still not competitive in this scenario.Proceeding with our analysis, we focus on scenario 1b, ρ = 0.7.In Figure 8, we find that all ROBOUT options work almost perfectly, both for predictor recovery and outlier detection.A slight ineffectiveness is recorded for RLARS concerning predictor recovery, and for SPARSE-LTS concerning outlier detection, particularly at small values of m.Under scenario 2b, ρ = 0.7 (see Figure 9), all methods but SPARSE-LTS perform really well, with the performance of SPARSE-LTS being quite better than in Figure 3.Under scenario 3b, ρ = 0.7, all methods perform very well, with values close to 1 both for the AP metric and the F 1 score across m.Under scenario 1b, ρ = 0.3, all methods show a perfect behaviour.Under scenario 2b, ρ = 0.3, only RLARS shows some slight decrease in the AP metric as m increases, but this does not affect outlier detection.Under scenario 3b, ρ = 0.3, performance is again optimal for all methods, excluded the average F 1 score of SPARSE-LTS at m = 5.
Concerning scenarios 2c, ρ = 0.7, 3c, ρ = 0.7, 2c, ρ = 0.3, 3c, ρ = 0.3, no method presents any issue.Under scenarios 1c, ρ = 0.7 (see Figure 10) and 1c, ρ = 0.3, some ineffectiveness is only shown by SPARSE-LTS at small values of m.The same performance patterns appear under scenarios 1d, ρ = 0.7 (see Figure 11) and 1d, ρ = 0.3.When the underlying correlation parameter ρ is equal to 0, ROBOUT, specifically tailored to tackle high-dimensional situations with a large p/n, is expected to perform less effectively, because leverage outliers are more difficult to be imputed correctly, and the model strength is lower.However, looking at the results of scenario 1a, ρ = 0 (see Figure 12), we may observe that, although predictor recovery suffers compared to SPARSE-LTS and RLARS, H+MM and L+MM still offer a reliable conditional outlier detection, comparable to the SPARSE-LTS one, with RLARS having the best performance and H+LTS and L+LTS failing.Under scenario 2a, ρ = 0 (Figure 13), where the degree of model perturbation is lower, all the options of ROBOUT perform in a really good way, better than the competitors.Under scenario 1b, ρ = 0 (Figure 14), SPARSE-LTS is inconsistent for conditional outlier detection, while the other methods perform optimally.Under scenario 2b, ρ = 0 (Figure 15), only RLARS shows some ineffectiveness at large m, both concerning predictor recovery and outlier detection.
Under scenario 3a, ρ = 0 (Figure 16), although predictor recovery is not accurate for ROBOUT as in Figure 12, H+MM and L+MM are still the best performers for conditional outlier detection, similarly to SPARSE-LTS and better than RLARS, while H+LTS and L+LTS perform badly.This shows again the resilience of MM regression even under a partial recovery of predictors.Under scenario 3b, ρ = 0 (Figure 17), all performance patterns improve, with H+LTS and L+LTS showing acceptable F 1 score, and H+MM and L+MM dominating with a large gap compared to SPARSE-LTS.

Discussion
In this section, we discuss how the relevant simulation parameters impact on the statistical performance of tested methods.A large correlation level ρ has a positive impact on ROBOUT, due to the preliminary imputation procedure, and the need for a strong SNR for predictor selection.For this reason, SNCD-H and SNCD-L suffer from the absence of correlation among the dataset variables.SPARSE-LTS is not much affected by a large ρ, unlike RLARS, that is significantly impacted by a large ρ in a negative way.
A large p/n ratio has a remarkably negative impact on SNCD-L, and a slightly negative impact on SNCD-H, as predictor selection methods.Instead, a large p/n positively impacts on the results obtained by SPARSE-LTS, that suffers from a large n, particularly under conditional outliers in variance.
Importantly, RLARS performs quite badly under all scenarios of type a, i.e. featuring p/n = 5.
In this respect, we need to stress the impact of the maximum allowed number of predictors K max on the performance of all procedures.A larger K max obviously increases the chance to include all relevant predictors when p is large, but this comes at the expense of the inclusion of irrelevant predictors in the model, and a larger computational cost.These drawbacks are particularly impacting on RLARS, but also the other methods are relevantly affected, particularly when the SNR is low.It follows that a larger K max positively impacts on conditional outlier detection performance, although at the price of model redundancy.
Considering the outlyingness scheme, we stress that SNCD-H and SNCD-L slightly suffer from the absence of leverage outliers, while SPARSE-LTS particularly suffers from the presence of conditional outliers in variance, irrespectively of the value of ρ.
Noticeably, we stress that MM, unlike LTS, is resilient to conditional outliers in mean (with or without leverage outliers) under too large or small ρ, and a large p/n ratio, thus being resistant to a small model strength, or to possible inaccuracies in the predictor selection.
All in all, SNCD-H+MM appears to be the ROBOUT combination that guarantees a reliable performance for conditional outlier detection under all tested scenarios.In particular, SNCD-H+MM is the most reliable when conditional outliers in mean/variance are present and also are leverage outliers, and when the dataset is multicollinear with more variables than observations, which is the most challenging and typical case in high dimensions.

A real banking supervisory data example
In this section, we apply the proposed conditional outlier detection procedure, ROBOUT, to a real dataset that contains granular data on the activities of the largest euro area banks, both on the asset and the liability sides of their balance sheet.These data are submitted by the euro area banks to the European Central Bank in the context of their supervisory reporting requirements.The dataset in question is compositional, i.e. it includes some parent categories, like 'Debt', and their sub-parent categories like 'Debt versus other banks', 'Debt versus central government', etc.In addition, the dataset is very sparse, as not all banks are engaged in all the activities spanned by the granular set of variables.The reference date of the data is end-2014.
Since the original scale of the variables is in the order of billions, we apply to each entry of the data matrix X (referring to cost or loss items) a logarithmic transformation of the following kind: where i = 1, . . ., 365 banks and j = 1, . . ., 771 variables.This preliminary step is needed to make the data interpretable, and to symmetrize the distribution shapes.
Then, we apply a variable screening based on robust Spearman correlation: we derive the list of all the variable pairs with Spearman correlation exceeding 0.8, and we exclude from the dataset the variable with smaller index.At the end of this procedure, 361 variables survive.We then find that two rows present the same values on the retained 361 variables.We drop the one with smaller index, and we get a data matrix with n = 364 observations and p = 361 variables.
Our final data matrix presents 79.05% of zero entries, 18.86% of positive entries, and 2.08% of negative entries.The retained variables show a mean Spearman correlation of 0.0896 and a mean absolute Spearman correlation of 0.1711.The heat map of the Spearman correlation matrix (Figure 18) displays the presence of a very rich multicollinearity structure, whose pattern matches the order of the dataset variables.A rich negative correlation is especially present among the variables related to loans and receivables.
Importantly, the sample covariance matrix presents 17 eigenvalues below the numerical tolerance level (2.2 × 10 −16 ).This is the reason why SPARSE-LTS cannot be applied on these data, since it requires the input data matrix to be far from rank deficiency.We set the log of bank's size as the variable on which we would like to identify outlying observations.The distribution of the log of banks' size shows that the log-normality assumption on total assets cannot be rejected.The p-value of the Jarque-Bera test is in fact 46.72%.The fact that the size follows a lognormal distribution is intuitive, given the high variance, due to the existence of both large and small banks, reinforced by the size dispersion of home countries in the sample e.g. with respect to their GDP, and the non-negativity of the size variable.The mean and the median of the log-size are almost equal (23.04 vs 23.14), while the standard deviation is slightly larger than the rescaled MAD (2.14 vs 1.73).
We run the four variants of ROBOUT method and RLARS on the final data matrix, with the maximum number of predictors set to 15.The histogram of the rates of identified leverage outliers (by the procedure in Section 3.2.1)across the 360 potential predictors is reported in Figure 19.Even if approximately one third of the potential predictors presents no leverage outliers, 36 variables out of 360 present more than 6 flagged leverage outliers (out of 364).Predictor selection via SNCD-H as in Section 3.2.2identifies 12 predictors, while via SNCD-L 14 predictors are identified, that constitute a superset of the 12 predictors identified by SNCD-H.Relying on the simulation study of Section 4, we follow the results of SNCD-H, although the two predictor sets are really similar.RLARS identifies instead 15 predictors, that relevantly differ from the SNCD-H and SNCD-L ones.
Subsequently, we run MM regression with the predictors identified by RLARS, SNCD-H and SNCD-L, respectively.In Figure 20, we report two relevant diagnostic plots for RLARS+MM.We can note that the predictor set of RLARS does not allow to properly model the left tail of the empirical distribution of the log-size.On the contrary, the same diagnostic plots for H+MM, reported in Figure 21, show that the H+MM option does not present this issue.The same can be stated for the L+MM option.Table 2 reports the results of the MM regression estimated on the predictors retrieved by SNCD-H.In addition, Table 3 contains the descriptions of the identified predictors.The set of predictors contains variables both from the banks' asset and liability side.The broad variable 'loans and receivables' (i.e.without reference to the type of counterpart, such as firms, households, etc.) together with securities mainly from the government sector, represent the predictors from the asset side.Furthermore, liability items reflecting banks' core financial intermediation function (i.e.channeling savings to investments) are identified as predictors, such as deposits and issued debt.The MM regression estimated on the predictors retrieved by SNCD-L returns a similar output.The recovered outliers by the H+MM option are 15, against the 14 outliers recovered by L+MM.11 elements of the two outlier sets are in common.Importantly, both options recover two strongly negative outliers, that present values of total assets with a wrong scale (1000 times smaller).The two additional ROBOUT variants, namely H+LTS and L+LTS, recover 19 and 18 outliers, respectively.12 outliers recovered by H+LTS are also recovered by H+MM, and 11 outliers recovered by L+LTS are also recovered by L+MM.The two strongly negative outliers are always present.On the contrary, RLARS recovers 17 outliers, that do not include the two strongly negative ones, which we consider as an indication that it performs worse than the ROBOUT variants.Based on the simulation study in Section 4, which showed that H+MM is reliable and parsimonious under various setups, we concentrate in the remainder of this section on the H+MM results.
The resulting 364 × 12 matrix of relevant predictors presents a 44% of zero elements, and 2 zero rows out of 364.The heat map of the relative Spearman correlation matrix shows that all predictors are strongly and positively correlated, with an average of 0.54 (see Figure 22).In Figure 23, we can observe that the chosen approach recovers the two strongly negative outliers in the left tail of the log-size distribution.Apart from those two, only another recovered outlier presents a negative residual, while the other 12 present positive residuals.As we can see, there is no specific pattern in the log-size of recovered outliers, because they are conditional outliers with respect to the recovered predictors, therefore the combined set of the values for the log-size variable and the predictors should be considered to understand the labelling of these log-size values as outliers.
In general, the recovered conditional outliers are banks with a too large value of total assets compared to the associated values of loans, deposits, or debt securities.This fact can be appreciated in Table 4, showing that the proportion of zero values recorded for each of the predictors into the set of recovered outliers is relevantly larger than in the set of non-outliers.This finding characterizes the nature of recovered outliers, as they mainly are idiosyncratic banks, like public financing institutions, branches of investment In a comprehensive simulation study, we have tested perturbation conditions including conditional outliers in mean or in variance in the response variable, multicollinearity and leverage outliers in the predictors, and we have considered cases when the dimension is both (much) smaller and (much) larger than the sample size.The simulation study shows that when the ROBOUT performance is compared to that of relevant competitors such as RLARS and SPARSE-LTS, the option SNCD-H+MM is overall the most resilient for what concerns predictor recovery and conditional outlier detection across all tested scenarios.
After the validation study, ROBOUT has been applied to a large granular banking dataset containing several balance sheet indicators for euro area banks.In this way, we are able to robustly model the log-size of euro area banks, and to identify the banks presenting anomalous values in the total assets with respect to its most relevant predictors, pertaining to loans, deposits, and bank securities.The recovered outliers constitute a set of idiosyncratic banks with respect to the textbook prototype of traditional bank, pointing to public financing institutions, local branches of investment banks, and clearing houses.ROBOUT may thus be a useful tool for bank supervisors, who need to spot hidden anomalies from raw balance sheet data.

Fig. 18 :
Fig. 18: Heat map of the Spearman correlation matrix estimated on the 364 × 361 final data matrix.Lighter colours correspond to larger correlation values.

Fig. 19 :
Fig. 19: Histogram of the rates of identified leverage outliers by the procedure in Section 3.2.1 across the 360 potential predictors.

Fig. 20 :
Fig. 20: Diagnostic plots for the MM regression applied on the predictors retrieved by RLARS.

Fig. 21 :
Fig. 21: Diagnostic plots for the MM regression applied on the predictors retrieved by SNCD-H.

Fig. 22 :
Fig. 22: Heat map of the Spearman correlation matrix estimated on the predictors recovered by SNCD-H.Lighter colours correspond to larger correlation values.
Average AP measure (left panel) and F 1 score (right panel) of all considered methods under scenario 1a, ρ = 0.7.

Table 2 :
Output of the MM regression applied on the predictors retrieved by SNCD-H.The response variable is the log-size of the banks.The variable legend is reported in Table3.

Table 3 :
Variable legend for the MM output reported in

Table 2 .
V1Loans and receivables.Loans and advances.Carrying amount.V2 Debt securities.General governments.Incurred but not reported losses.V3 Debt securities.Non-financial corporations.Carrying amount.V4 Held to maturity investment.Carrying amount.V5 Deposits.General governments.Designated at fair value through profit and loss.V6 Deposits.General governments.Deposits redeemable at notice.Fair value.V7 Deposits.Non-financial corporations.Repurchase agreements.Held for trading.V8 Households.Deposits redeemable at notice.Held for trading.V9 Debt securities issued.Amount of cumulative change.V10 Debt securities issued.Amount contractually required to pay at maturity.V11 Non-performing loans.Nominal amount.V12 Derivatives for trading.Credit spread option.Notional amount.Sold.