# Outliers

**DOI:**https://doi.org/10.1057/978-1-349-95189-5_1884

## Abstract

Nearly all empirical investigations in economics, particularly those involving linear structural models or regressions, are subject to the problem of anomalous data, commonly called outliers. Roughly speaking, there are three sources of outliers. First, the distribution of the model’s random disturbances often has longer tails than the normal distribution, resulting in a greatly increased chance of larger disturbances. Second, the data set may contain erroneous numbers, or ‘gross errors’. The data bases most prone to gross errors are large cross sections, particularly those compiled from surveys; gross errors can result from misinterpreted questions, incorrectly recorded answers, keypunch errors, etc. Third, the model itself, typically linear in (transformations of) the variables, is only an approximation to reality. It is apt to be a poor representation of the process generating the data for extreme values of the explanatory variables. This source of outliers applies even to, say, macroeconomic time series, where the likelihood of gross errors is minimal.

Nearly all empirical investigations in economics, particularly those involving linear structural models or regressions, are subject to the problem of anomalous data, commonly called outliers. Roughly speaking, there are three sources of outliers. First, the distribution of the model’s random disturbances often has longer tails than the normal distribution, resulting in a greatly increased chance of larger disturbances. Second, the data set may contain erroneous numbers, or ‘gross errors’. The data bases most prone to gross errors are large cross sections, particularly those compiled from surveys; gross errors can result from misinterpreted questions, incorrectly recorded answers, keypunch errors, etc. Third, the model itself, typically linear in (transformations of) the variables, is only an approximation to reality. It is apt to be a poor representation of the process generating the data for extreme values of the explanatory variables. This source of outliers applies even to, say, macroeconomic time series, where the likelihood of gross errors is minimal.

Outliers resulting from heavy-tailed but still symmetric disturbance distributions can greatly decrease the efficiency of least squares, while gross errors can in addition cause substantial biases. These potentially damaging effects of anomalous data have been recognized for many years; indeed, the first published work on least squares (Legendre 1805) recommended that outliers be removed from the sample before estimation. The wisdom of this and other approaches that give the observations unequal weights was debated throughout the 19th century.

Despite considerable evidence that error distributions tend to be heavy tailed, many statisticians were reluctant to modify least squares, which was known to be optimal when the disturbances are normally distributed. There were notable exceptions, however, such as Simon Newcomb, an astronomer and mathematician as well as an economist. Newcomb (1886) introduced the idea of modelling the disturbance distribution as a mixture of normal distributions with differing variances; the implied marginal distribution then has heavier tails than the normal. Newcomb also proposed a ‘weighted least squares’ alternative that, it turns out, is similar to a 1964 proposal of Peter Huber, discussed below, which has numerous desirable robustness properties. (The contributions of Newcomb and other late-19th and early 20th-century statisticians are discussed in more detail by Stigler 1973.)

*c*> 0, that have a certain optimal minimax efficiency-robustness property. Suppose the regression model is

*y*

_{i}=

*x*

_{i}

*β*+

*u*

_{i}(

*i*= 1,…,

*n*), where

*y*

_{i}is the

*i*th observation on the dependent variable,

*x*

_{i}is the

*k*-dimensional row vector containing the

*i*th observation on the explanatory variables,

*u*

_{i}is the

*i*th disturbance, and

*β*is the

*k*-vector parameters to be estimated. Then the Huber estimate

*b*is the vector that solves the equations

*ψ*

_{c}(

*t*) ≡ max [−

*c*, min(

*t*,

*c*)] and where the choice of the parameter

*c*depends on the scale of the disturbance distribution and the desired tradeoff between robustness and efficiency. As

*c → ∞*, the Huber estimator reduces to ordinary least squares, whereas if

*c*is never zero, the estimator is similar to the method of least absolute residuals, which had been studied as early as Laplace (1818) and which gained some popularity in the 1950s (see Taylor 1974). The Huber estimator and over sixty others were compared for small samples in the ‘location’ problem (regression on just a constant term) in an extensive 1970–71 Monte Carlo study (Andrews et al. 1972). The results suggested that the asymptotic properties hold quite well in samples as small as twenty.

Though the Huber estimators, and others designed for efficiency robustness, maintain a high efficiency even for heavy-tailed disturbance distributions, they are not resistant to other sources of outliers, such as low-probability gross errors. A second desirable robustness property, introduced by Hampel (1968, 1971) and corresponding to the mathematical concept of uniform continuity, is that if gross errors are generated with small probability, then, irrespective of the distribution of those gross errors, the estimator’s bias should be small. Estimators having this property are called ‘qualitatively robust’. Hampel quantified this relationship by means of an estimator’s ‘sensitivity’, which he defined as the right-hand derivative of the maximum possible bias, with respect to the probability of gross errors, evaluated at probability zero.

*w*and

*v*are non-negative weight functions that allows for the downweighting of observations with outlying values for the explanatory variables, called ‘leverage points’. The proposals of Krasker (1980) and Krasker and Welsch (1982) also have a certain efficiency property among estimators with the same sensitivity to gross errors. The idea of finding an estimator that has maximum efficiency subject to a bound on the sensitivity was developed by Hampel (1968).

If an estimator is qualitatively robust, its asymptotic bias will be small provided the probability of gross errors is sufficiently small. However, this property does not tell us how the estimator will behave if the gross errors are, say, ten per cent of the data. One crude measure of this behaviour, introduced by Hampel (1968, 1971) and called the ‘breakdown point’, is the smallest probability of gross errors that can cause the asymptotic bias to be arbitrarily large. Equivalently, it is the largest fraction of gross errors in the data that the estimator can handle before it becomes totally unreliable. By the 1980s it was clear that the most common qualitatively robust regression estimators, such as those listed earlier, have low breakdown points when *k*, the number of parameters, is large. Several alternative estimators \( \frac{1}{2} \), the largest possible value. Examples are the ‘repeated medians’ estimator of Siegel (1982), the projection-pursuit approach of Donoho and Huber (1983), and the estimator proposed by Rousseeuw (1984), which minimizes the median of the squared residuals (rather than their sum). However, all of these estimators are computationally burdensome unless *k* is small, and in fact, it appears that the computational difficulties are an inherent feature of high-breakdown multivariate procedures that transform naturally under linear changes in the coordinate system.

One of the most important uses for high-breakdown procedures is simply to facilitate the identification of outliers, which are often masked by non-robust estimators. For example, in a simple regression, a single outlier associated with an extreme value of the explanatory variable can have so much influence on the least-squares estimate that its own residual is very small. Thus, mere examination of the residuals from a non-robust fit can fail to reveal the anomalous observations. This problem becomes much more severe in higher dimensions, where even many qualitatively robust estimators can break down due to a small cluster of outlying observations. Belsley et al. (1980) have proposed a variety of methods for identifying outliers in regression.

For statistical inference, as opposed to data analysis, identification of the outliers is only a small part of the problem. An important difficulty is that it is often impossible to determine solely from the data whether an outlying observation results from aberrant data, or whether the true regression function is slightly non-linear. Typically either of these possibilities will ‘explain’ the outlier, but for inference their implications may be very different. In these circumstances it seems essential to place a prior on the amount of curvature in the regression function, but this is difficult to do, particularly when there are several explanatory variables. One approach is outlined in Krasker et al. (1983, section 5).

Finally, although the preceding remarks have dealt with regression, outliers occur and have similar consequences in many other statistical contexts, such as discrete or censored dependent variable models, stochastic parameter models, or linear structural models. The most reliable way to identify outliers in these contexts is to estimate robustly the model’s underlying parameters, and check for observations that deviate greatly in an appropriate sense from the model’s prediction. For example, Krasker and Welsch (1985b) have presented a qualitatively robust weighted-instrumental-variables estimator for simultaneous-equations models, analogous to their proposal for regression. In general, however, methods for dealing with outliers in models of the kind just mentioned are far less developed than those for regression.

## See Also

### Bibliography

- Andrews, D.F., P.J. Bickel, F.R. Hampel, P.J. Huber, W.H. Rogers, and J.W. Tukey. 1972.
*Robust estimates of location: Survey and advances*. Princeton: Princeton University Press.Google Scholar - Belsley, D.A., E. Kuh, and R.E. Welsch. 1980.
*Regression diagnostics*. New York: Wiley.CrossRefGoogle Scholar - Donoho, D.L., and P.J. Huber. 1983. The notion of breakdown point. In
*A Festschrift for Erich L. Lehmann*, ed. P. Bickel, K. Doksum, and J.L. Hodges Jr.. Belmont: Wadsworth International Group.Google Scholar - Hampel, F.R. 1968. Contributions to the theory of robust estimation. PhD thesis, University of California, Berkeley.Google Scholar
- Hampel, F.R. 1971. A general qualitative definition of robustness.
*Annals of Mathematical Statistics*42: 1887–1896.CrossRefGoogle Scholar - Huber, P.J. 1964. Robust estimation of a location parameter.
*Annals of Mathematical Statistics*35(1): 73–101.CrossRefGoogle Scholar - Krasker, W.S. 1980. Estimation in linear regression models with disparate data points.
*Econometrica*48: 1333–1346.CrossRefGoogle Scholar - Krasker, W.S., and R.E. Welsch. 1985a. Efficient bounded-influence regression estimation.
*Journal of the American Statistical Association*77(379): 595–604.CrossRefGoogle Scholar - Krasker, W.S., and R.E. Welsch. 1985b. Resistant estimation for simultaneous-equations models using weighted instrumental variables.
*Econometrica*53(6): 1475–1488.CrossRefGoogle Scholar - Krasker, W.S., E. Kuh, and R.E. Welsch. 1983. Estimation for dirty data and flawed models. In
*Handbook of econometrics*, vol. 1, ed. Z. Griliches and M.D. Intriligator. Amsterdam: North-Holland.Google Scholar - de Laplace, P.S. 1818.
*Deuxième supplèment à la théorie analytique des probabilités*. Paris: Courcier. Reprinted in*Oeuvres de Laplace*, vol. 7, 569–623. Paris: Imprimerie Royale, 1847. Repinted in*Oeuvres complètes de Laplace*, vol. 7, 531–580. Paris: Gauthier-Villars, 1886.Google Scholar - Legendre, A.M. 1805. On the method of least squares. Trans. in
*A source book in mathematics*, ed. D.E. Smith. New York: Dover Publications, 1959.Google Scholar - Newcomb, S. 1886. A generalized theory of the combination of observations so as to obtain the best result.
*American Journal of Mathematics*8: 343–366.CrossRefGoogle Scholar - Rousseeuw, P.J. 1984. Least median of squares regression.
*Journal of the American Statistical Association*79(388): 871–880.CrossRefGoogle Scholar - Siegel, A.F. 1982. Robust regression using repeated medians.
*Biometrika*69: 242–244.CrossRefGoogle Scholar - Stigler, S.M. 1973. Simon Newcomb. Percy Daniell, and the history of robust estimation, 1885–1920.
*Journal of the American Statistical Association*68(344): 872–879.Google Scholar - Taylor, L.D. 1974. Estimation by minimizing the sum of absolute errors. In
*Frontiers of econometrics*, ed. P. Zarembka. New York: Academic Press.Google Scholar - Tukey, J.W. 1960. A survey of sampling from contaminated distributions. In
*Contributions to probability and statistics*, ed. I. Olkin. Stanford: Stanford University Press.Google Scholar