Robust Bayesian regression with the forward search: theory and data analysis
Abstract
The frequentist forward search yields a flexible and informative form of robust regression. The device of fictitious observations provides a natural way to include prior information in the search. However, this extension is not straightforward, requiring weighted regression. Bayesian versions of forward plots are used to exhibit the presence of multiple outliers in a data set from banking with 1903 observations and nine explanatory variables which shows, in this case, the clear advantages from including prior information in the forward search. Use of observation weights from frequentist robust regression is shown to provide a simple general method for robust Bayesian regression.
Keywords
Consistency factor Fictitious observation Forward search Graphical methods Outliers Weighted regressionMathematics Subject Classification
62F15 62F35 62J05 65C60 68U201 Introduction
Frequentist methods for robust regression are increasingly studied and applied. The foundations of robust statistical methods are presented in the books of Hampel et al. (1986), of Maronna et al. (2006) and of Huber and Ronchetti (2009). Book length treatments of robust regression include Rousseeuw and Leroy (1987) and Atkinson and Riani (2000). However, none of these methods includes prior information; they can all be thought of as robust developments of least squares. The present paper describes a procedure for robust regression incorporating prior information, determines its properties and illustrates its use in the analysis of a dataset with 1903 observations.
 1.
Hard (0,1) Trimming. In Least Trimmed Squares (LTS: Hampel 1975; Rousseeuw 1984) the amount of trimming of the n observations when the number of parameters in the fullrank model is p, is determined by the choice of the trimming parameter h, \([n/2]+ [(p+1)/2] \le h \le n\), which is specified in advance. The LTS estimate is intended to minimize the sum of squares of the residuals of h observations. In Least Median of Squares (LMS: Rousseeuw 1984) the estimate minimizes the median of h squared residuals.
 2.
Adaptive Hard Trimming. In the Forward Search (FS), the observations are again hard trimmed, but the value of h is determined by the data, being found adaptively by the search. Data analysis starts from a very robust fit to a few, carefully selected, observations found by LMS or LTS with the minimum value of h. The number of observations used in fitting then increases until all are included.
 3.
Soft Trimming (downweighting). M estimation and derived methods (Huber and Ronchetti 2009). The intention is that observations near the centre of the distribution retain their value, but the function \(\rho \), which determines the form of trimming, ensures that increasingly remote observations have a weight that decreases with distance from the centre.
As we describe in more detail in the next section, the FS uses least squares to fit the model to subsets of m observations, chosen to have the m smallest squared residuals, the subset size increasing during the search. The results of the FS are typically presented through a forward plot of quantities of interest as a function of m. As a result, it is possible to connect individual observations with changes in residuals and parameter estimates, thus identifying outliers and systematic failures of the fitted model. (See Atkinson et al. (2010) for a general survey of the FS, with discussion). In addition, since the method is based on the repeated use of least squares, it is relatively straightforward to introduce prior information into the search.
Whichever of the three forms of robust regression given above is used, the aim in outlier detection is to obtain a “clean” set of data providing estimates of the parameters uncorrupted by any outliers. Inclusion of outlying observations in the data subset used for parameter estimation can yield biased estimates of the parameters, making the outliers seem less remote, a phenomenon called “masking”. The FS avoids masking by the use, for as large a value of m as possible, of observations believed not to be outlying. The complementary “backward” procedure starts with diagnostic measures calculated from all the data and then deletes the most outlying. The procedure continues until no further outliers are identified. Such procedures, described in the books of Cook and Weisberg (1982) and Atkinson (1985), are prone to the effect of masking. Illustrations of this effect for several different models are in Atkinson and Riani (2000) and demonstrate the failure of the method to identify outliers. The Bayesian outlier detection methods of West (1984) and Chaloner and Brant (1988) start from parameter estimates from the full sample and so can also be expected to suffer from masking.
Although it is straightforward to introduce prior information into the FS, an interesting technical problem arises in estimation of the error variance \(\sigma ^2\). Since the sample estimate in the frequentist search comes from a set of order statistics of the residuals, the estimate of \(\sigma ^2\) has to be rescaled. In the Bayesian search, we need to combine a prior estimate with one obtained from such a set of order statistics from the subsample of observations. This estimate has likewise to be rescaled before being combined with the prior estimate of \(\sigma ^2\); parameter estimation then uses weighted least squares. A similar calculation could be used to provide a version of least trimmed squares (Rousseeuw 1984) that incorporates prior information. Our focus throughout is on linear regression, but our technique of representing prior information by fictitious observations can readily be extended to more complicated models such as those based on ordinal regression described in Croux et al. (2013) or for sparse regression (Hoffmann et al. 2015).
The paper is structured as follows. Notation and parameter estimation for Bayesian regression are introduced in Sect. 2. Section 2.3 describes the introduction into the FS of prior information in the form of fictitious observations, leading to a form of weighted least squares which is central to our algorithm. We describe the Bayesian FS in Sect. 3 and, in Sect. 4, use forward plots to elucidate the change in properties of the search with variation of the amount of prior information. The example, in Sect. 5, shows the effect of the outliers on parameter estimation and a strong contrast with the frequentist analysis which indicated over twelve times as many outliers. In Sect. 6 a comparison of the forward search with a weighted likelihood procedure (Agostinelli 2001) leads to a general method for the extension of robust frequentist regression to include prior information. A simulation study in Sect. 7 compares the power of frequentist and Bayesian procedures, both when the prior specification is correct and when it is not. The paper concludes with a more general discussion in Sect. 8.
2 Parameter estimation
2.1 No prior information
We first, to establish notation, consider parameter estimation in the absence of prior information, that is least squares.
In the regression model \(y = X\beta + \varepsilon \), y is the \(n \times 1\) vector of responses, X is an \(n \times p\) fullrank matrix of known constants, with ith row \(x_i^{\mathrm {T}}\), and \(\beta \) is a vector of p unknown parameters. The normal theory assumptions are that the errors \(\varepsilon _i\) are i.i.d. \(N(0,\sigma ^2)\).
The least squares estimator of \(\beta \) is \(\hat{\beta }\). Then the vector of n least squares residuals is \( e = y  \hat{y} = y  X\hat{\beta } = (I  H)y\), where \(H = X(X^{\mathrm {T}}X)^{1}X^{\mathrm {T}}\) is the ‘hat’ matrix, with diagonal elements \(h_i\) and offdiagonal elements \(h_{ij}\). The residual mean square estimator of \(\sigma ^2\) is \( s^2 = e^{\mathrm {T}}e/(np) = \sum _{i=1}^n e_i^2/(np). \)
2.2 The normal inversegamma prior distribution
We represent prior information using the conjugate prior for the normal theory regression model leading to a normal prior distribution for \(\beta \) and an inversegamma distribution for \(\sigma ^2\).
2.3 Prior distribution from fictitious observations
The device of fictitious prior observations provides a convenient representation of this conjugate prior information. We follow, for example, Chaloner and Brant (1988), who are interested in outlier detection, and describe the parameter values of these prior distributions in terms of \(n_0\) fictitious observations.
Prior information for the linear model is given as the scaled information matrix \( R= X_0^{\mathrm {T}}X_0\) and the prior mean \(\hat{\beta }_0 = R^{1}X_0^{\mathrm {T}}y_0\). Then \( S_0 = y_0^{\mathrm {T}}y_0  \hat{\beta }_0^{\mathrm {T}} R \hat{\beta }_0.\) Thus, given \(n_0\) prior observations the parameters for the normal inversegamma prior may readily be calculated.
2.4 Posterior distributions
3 The Bayesian search
3.1 Parameter estimation
The posterior distributions of Sect. 2.4 arise from the combination of \(n_0\) prior observations, perhaps fictitious, and the n actual observations. In the FS we combine the \(n_0\) prior observations with a carefully selected m out of the n observations. The search proceeds from \(m = 0\), when the fictitious observations provide the parameter values for all n residuals from the data. It then continues with the fictitious observations always included amongst those used for parameter estimation; their residuals are ignored in the selection of successive subsets.
Since, during the forward search, n in (3) is replaced by the subset size m, X and y in (4) become \(y(m)/\sqrt{c(m,n)}\) and \(X(m)/\sqrt{c(m,n)}\), giving rise to posterior values \(a_1(m)\), \(b_1(m)\), \(\tau _1(m)\) and \({\hat{\sigma }}^2_1(m)\).
3.2 Forward highest posterior density intervals
3.3 Outlier detection
3.4 Envelopes and multiple testing
A Bayesian FS through the data provides a set of n absolute minimum deletion residuals. We require the null pointwise distribution of this set of values and find, for each value of m, a numerical estimate of, for example, the 99% quantile of the distribution of \(r_{\mathrm{imin}}(m)\).
When used as the boundary of critical regions for outlier testing, these envelopes have a pointwise size of 1%. Performing n tests of outlyingness of this size leads to a procedure for the whole sample which has a size much greater than the pointwise size. In order to obtain a procedure with a 1% samplewise size, we require a rule which allows for the simple behaviour in which a few outliers enter at the end of the search and the more complicated behaviour when there are many outliers which may be apparent away from the end of the search. However, at the end of the search such outliers may be masked and not evident. Our chosen rule achieves this by using exceedances of several envelopes to give a “signal” that outliers may be present.
In cases of appreciable contamination, the signal may occur too early, indicating an excessive number of outliers. This happens because of the way in which the envelopes increase towards the end of the search. Accordingly, we check the sample size indicated by the signal for outliers and then increase it, checking the 99% envelope for outliers as the value of n increases, a process known as resuperimposition. The notation \(r_{\mathrm{min}}(m,n)\) indicates the dependence of this process on a series of values of n.
In the next section, where interest is in envelopes over the whole search, we find selected percentage points of the null distribution of \(r_{\mathrm{imin}}(m)\) by simulation. However, in the data analyses of Sect. 5 the focus is on the detection of outliers in the second half of the search. Here we use a procedure derived from the distribution of order statistics to calculate the envelopes for the many values of \(r_{{\mathrm{min}}}(m,n)\) required in the resuperimposition of envelopes. Further details of the algorithm and its application to the frequentist analysis of multivariate data are in Riani et al. (2009).
4 Prior information and simulation envelopes
The panels of Fig. 3 are for similar simulations, but now with \(n_0\) and n both 500. The main differences from Fig. 2 are that the widths of the bands now decrease only slightly with m and that the estimate of \(\sigma ^2\) is relatively close to one throughout the search; the minimum value in this simulation is 0.97.
The widths of the intervals for \(\hat{\beta }_3(m)\) depend on the information matrices. If, as here, the prior data and the observations come from the same population, the ratio of the widths of the prior band to that at the end of the search is \(\surd \{(n_0 + n p)/(n_0  p)\}\), here \(\surd (525/25)\), or approximately 4.58, for the results plotted in Fig. 2. In Fig. 3 the ratio is virtually \(\surd 2\). This difference is clearly reflected in the figures.
5 Example: bank profit data
As an example of the application of the Bayesian FS, we now analyse data on the profitability to an Italian bank of customers with a variety of profiles, as measured by nine explanatory variables.

\(y_i\): annual profit or loss per customer;

\(x_{1i}\): number of products bought by the customers;

\(x_{2i}\): current account balance plus holding of bonds issued by the bank;

\(x_{3i}\): holding of investments for which the bank acted as an agent;

\(x_{4i}\): amount in deposit and savings accounts with the bank;

\(x_{5i}\): number of activities in all accounts;

\(x_{6i}\): total value of all transactions;

\(x_{7i}\): total value of debit card spending (recorded with a negative sign);

\(x_{8i}\): number of credit and debit cards;

\(x_{9i}\): total value of credit card spending.
Bank profit data: prior estimates of parameters
Parameter  \(\beta _0\)  \(\beta _1\)  \(\beta _2\)  \(\beta _3\)  \(\beta _4\)  \(\beta _5\) 
Mean  −0.5  9.1  0.001  0.0002  0.002  0.12 
Parameter  \(\beta _6\)  \(\beta _7\)  \(\beta _8\)  \(\beta _9\)  \(s_0^2\)  
Mean  0.0004  −0.0004  1.3  0.00004  10,000 
Figure 6 shows the forward plots of the HPD regions, together with 95 and 99% envelopes. The horizontal lines indicate the prior values of the parameters and the vertical line indicates the point at which outliers start to be included in the subset used for parameter estimation.
These results show very clearly the effect of the outliers. In the lefthand part of the panels and, indeed, in the earlier part of the search not included in the figure, the parameter estimates are stable, in most cases lying close to their prior values. However, inclusion of the outliers causes changes in the estimates. Some, such as \(\hat{\beta }_1(m)\), \(\hat{\beta }_3(m)\) and \(\hat{\beta }_7(m)\), move steadily in one direction. Others, such as \(\hat{\beta }_6(m)\) and \(\hat{\beta }_9(m)\), oscillate, especially towards the very end of the search. The most dramatic change is in \(\hat{\beta }_4(m)\) which goes from positive to negative as the vertical strip of outliers is included. From a banking point of view, the most interesting results are those for the two parameters with negative prior values. It might be expected that the intercept would be zero or slightly negative. But \(\hat{\beta }_7(m)\) remains positive throughout the search, thus changing understanding of the importance of \(x_7\), debit card spending. More generally important is the appreciable increase in the estimate of \(\sigma ^2\). In the figure this has been truncated, so that the stability of the estimate in the earlier part of the search is visible. However, when all observations are used in fitting, the estimate has a value of 3.14e\(+\)04, as opposed to a value close to 1.0e\(+\)04 for much of the outlier free search. Such a large value renders inferences imprecise, with some loss of information. This shows particularly clearly in the plots of those estimates less affected by outliers, such as \(\hat{\beta }_0(m)\), \(\hat{\beta }_5(m)\) and \(\hat{\beta }_8(m)\).
The 95 and 99% HPD regions in Fig. 6 also provide information about the importance of the predictors in the model. In the absence of outliers, only the regions for \(\hat{\beta }_0(m)\), \(\hat{\beta }_8(m)\) and \(\hat{\beta }_9(m)\) include zero, so that these terms might be dropped from the model, although dropping one term might cause changes in the HPD regions for the remaining variables. The effect of the outliers is to increase the seeming importance of some other variables, such as \(x_1\) and \(x_3\). Only \(\hat{\beta }_4(m)\) shows a change of sign.
We do not make a detailed comparison with the frequentist forward search which declares 586 observations as outliers. This apparent abundance of outliers is caused by anomalously high values of some of the explanatory variables. Such high leverage points can occasionally cause misleading fluctuations in the forward search trajectory leading to early stopping. However, such behaviour can be detected by visual inspection of such plots as the frequentist version of Fig. 4. The Bayesian analysis provides a stability in the procedure which avoids an unnecessary rejection of almost one third of the data.
6 A comparison with weighted likelihood
6.1 Background
The fundamental output of a robust analysis is the weight attached to each observation. In the forward search, the adaptively calculated weights have the values 0 and 1; in the analysis of the bank profit data the weights from the forward search contain 48 zeroes.
Many other robust methods, such as MM and Sestimation (Maronna et al. 2006), downweight observations in a more smooth way, resulting in weights that have values in [0,1]. As an example, we use the trimmed likelihood weights from the R package wle (Agostinelli 2001). The calculation of these robust weights, which forms a first stage of their Bayesian analysis, is described in Agostinelli and Greco (2013, §2). Incorporation of prior information forms a second stage.
6.2 Comparison of methods on the bank profit data
Observations with small robust weights are outliers. Agostinelli (2001) suggests a threshold value of 0.5. For the bank profit data, we find 46 observations with weights below 0.5, all of which are also found by the forward search. In the Bayesian analysis using (13), we use the same prior as in the forward search and obtain parameter estimates differing (apart from the last two variables) by no more than 1.3%. The maximum difference is 17%.
The agreement between the two methods is not surprising in this example, where virtually the same set of outliers is declared and the same prior distribution is used. In other examples, such as the Boston housing data (Anglin and Gençay 1996), the differences between the two analyses are greater than those for the bank profit data, but not sufficient to change any conclusions drawn from the analysis of the data. Amongst the comparisons of several methods for frequentist robust regression presented by Riani et al. (2014a), we prefer the forward search because it adds to parameter estimation the monitoring of inferential quantities during the search. As an example, Fig. 6 shows the effect of the outliers which enter towards the end of the search on the HPD regions for the parameters.
7 Power of Bayesian and frequentist procedures
We simulate normally distributed observations from a regression model with four variables and a constant (\(p = 5\)), the values of the explanatory variables having independent standard normal distributions. The simulation envelopes for the distribution of the residuals are invariant to the numerical values of \(\beta \) and \(\sigma ^2\), so we take \(\beta _0 =0\) and \(\sigma _0^2=1\). The outliers were generated by adding a constant, in the range 0.5 to seven, to a specified proportion of observations, and \(n_0\) was taken as 500. To increase the power of our comparisons, the explanatory variables were generated once for each simulation study. We calculated several measures of power, all of which gave a similar pattern. Here we present results from 10,000 simulations on the average power, that is the average proportion of contaminated observations correctly identified.
Figure 7 shows power curves for Bayesian and frequentist procedures and also for Bayesian procedures with incorrectly specified priors when the contamination rate is 5%. The curves do not cross for powers a little <0.2 and above. The procedure with highest power is the curve that is furthest to the left which, in the figure, is the correctly specified Bayesian procedure. The next best is the frequentist one, ignoring prior information. The central power curve is that in which the mean of \(\beta _0\) is wrongly specified as −1.5. This is the most powerful procedure for small shifts, as the incorrect prior is in the opposite direction to the positive quantity used to generate outliers. With large shifts, this effect becomes less important. For most values of average power, the curve for misspecified \(\sigma ^2\) comes next, with positive misspecification of \(\beta \) worst. Over these values, three of the four best procedures have power curves which are virtually translated horizontally. However, the curve for misspecified \(\beta \) has a rather different shape at the lower end caused by the shape of the forward envelopes for minimum deletion residuals. With \(\beta \) misspecified, the envelopes for large m sometimes lie slightly above the frequentist envelopes. The effect is to give occasional indication of outliers for relatively small values of the shift generating the outliers.
8 Discussion
Data do contain outliers. Our Bayesian analysis of the bank profit data has revealed 46 outliers out of 1906 observations. Working backwards from a full fit using single or multiple deletion statistics cannot be relied upon to detect such outliers. Robust methods are essential.
The results of Sect. 6 indicate how prior information may be introduced into a wide class of methods for robust regression. However, in this paper we have used the forward search as the method of robust regression into which to introduce prior information. There were two main reasons for this choice. One is that our comparisons with other methods of robust regression showed the superiority of the frequentist forward search in terms of power of outlier detection and the closeness of empirical power to the nominal value. A minor advantage is the absence of adjustable parameters; it is not necessary to choose trimming proportion or breakdown point a priori. A second, and very important, advantage is that the structure of the search makes clear the relationship between individual observations entering the search and changes in inferences. This is illustrated in the final part of the plots of parameter estimates and HPD regions in Fig. 6. The structure can also make evident divergencies between prior estimates and the data in the initial part of the search.
A closelyrelated second application of the method of fictitious observations combined with the FS would be to multivariate analysis. Atkinson et al. (2018) use the frequentist FS for outlier detection and clustering of normally distributed data. The extension to the inclusion of prior information can be expected to bring the advantages of stability and inferential clarity we have seen here.
The advantage of prior information in stabilising inference in the bank profit data is impressive; as we record, the frequentist analysis found 586 outliers. Since many forms of data, for example the bank data, become available annually, statistical value is certainly added by carrying forward, from year to year, the prior information found from previous robust analyses.
Routines for the robust Bayesian regression described here are included in the FSDA toolbox downloadable from http://fsda.jrc.ec.europa.eu/ or http://www.riani.it/ MATLAB. Computation for our analysis of the bank profit data took <10 s on a standard laptop computer. Since, from the expressions for parameter estimation and inference in Sect. 3, the order of complexity of calculation is the same as that for the frequentist forward search, guidelines for computational time can be taken from Riani et al. (2015).
References
 Agostinelli C (2001) wle: A package for robust statistics using weighted likelihood. R News 1(3):32–38Google Scholar
 Agostinelli C, Greco L (2013) A weighted strategy to handle likelihood uncertainy in Bayesain inference. Comput Stat 28:319–339CrossRefzbMATHGoogle Scholar
 Anglin P, Gençay R (1996) Semiparametric estimation of a hedonic price function. J Appl Econ 11:633–648CrossRefGoogle Scholar
 Atkinson AC (1985) Plots, transformations, and regression. Oxford University Press, OxfordzbMATHGoogle Scholar
 Atkinson AC, Riani M (2000) Robust diagnostic regression analysis. Springer, New YorkCrossRefzbMATHGoogle Scholar
 Atkinson AC, Riani M, Cerioli A (2010) The forward search: theory and data analysis (with discussion). J Korean Stat Soc 39:117–134. doi: 10.1016/j.jkss.2010.02.007 CrossRefzbMATHGoogle Scholar
 Atkinson AC, Riani M, Cerioli A (2018) Cluster detection and clustering with random start forward searches. J Appl Stat (In press). doi: 10.1080/02664763.2017.1310806 Google Scholar
 Chaloner K, Brant R (1988) A Bayesian approach to outlier detection and residual analysis. Biometrika 75:651–659MathSciNetCrossRefzbMATHGoogle Scholar
 Cook RD, Weisberg S (1982) Residuals and influence in regression. Chapman and Hall, LondonzbMATHGoogle Scholar
 Croux C, Haesbroeck G, Ruwet C (2013) Robust estimation for ordinal regression. J Stat Plan Inference 143:1486–1499MathSciNetCrossRefzbMATHGoogle Scholar
 Hampel F, Ronchetti EM, Rousseeuw P, Stahel WA (1986) Robust statistics. Wiley, New YorkzbMATHGoogle Scholar
 Hampel FR (1975) Beyond location parameters: robust concepts and methods. Bull Int Stat Inst 46:375–382MathSciNetzbMATHGoogle Scholar
 Hoffmann I, Serneels S, Filzmoser P, Croux C (2015) Sparse partial robust M regression. Chemom Intell Lab Syst 149(Part A):50–59CrossRefGoogle Scholar
 Huber PJ, Ronchetti EM (2009) Robust statistics, 2nd edn. Wiley, New YorkCrossRefzbMATHGoogle Scholar
 Johansen S, Nielsen B (2016) Analysis of the forward search using some new results for martingales and empirical processes. Bernoulli 21:1131–1183MathSciNetCrossRefzbMATHGoogle Scholar
 Maronna RA, Martin RD, Yohai VJ (2006) Robust statistics: theory and methods. Wiley, ChichesterCrossRefzbMATHGoogle Scholar
 Pison G, Van Aelst S, Willems G (2002) Small sample corrections for LTS and MCD. Metrika 55:111–123. doi: 10.1007/s001840200191 MathSciNetCrossRefzbMATHGoogle Scholar
 Rao CR (1973) Linear statistical inference and its applications, 2nd edn. Wiley, New YorkCrossRefzbMATHGoogle Scholar
 Riani M, Atkinson AC, Cerioli A (2009) Finding an unknown number of multivariate outliers. J R Stat Soc Ser B 71:447–466MathSciNetCrossRefzbMATHGoogle Scholar
 Riani M, Cerioli A, Atkinson AC, Perrotta D (2014a) Monitoring robust regression. Electron J Stat 8:642–673MathSciNetCrossRefzbMATHGoogle Scholar
 Riani M, Cerioli A, Torti F (2014b) On consistency factors and efficiency of robust Sestimators. TEST 23:356–387MathSciNetCrossRefzbMATHGoogle Scholar
 Riani M, Atkinson AC, Perrotta D (2014c) A parametric framework for the comparison of methods of very robust regression. Stat Sci 29:128–143MathSciNetCrossRefzbMATHGoogle Scholar
 Riani M, Perrotta D, Cerioli A (2015) The forward search for very large datasets. J Stat Softw 67(1):1–20Google Scholar
 Rousseeuw PJ (1984) Least median of squares regression. J Am Stat Assoc 79:871–880MathSciNetCrossRefzbMATHGoogle Scholar
 Rousseeuw PJ, Leroy AM (1987) Robust regression and outlier detection. Wiley, New YorkCrossRefzbMATHGoogle Scholar
 Tallis GM (1963) Elliptical and radial truncation in normal samples. Ann Math Stat 34:940–944CrossRefzbMATHGoogle Scholar
 West M (1984) Outlier models and prior distributions in Bayesian linear regression. J R Stat Soc Ser B 46:431–439MathSciNetzbMATHGoogle Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.