The New Palgrave Dictionary of Economics

2018 Edition
| Editors: Macmillan Publishers Ltd

Outliers

  • William S. Krasker
Reference work entry
DOI: https://doi.org/10.1057/978-1-349-95189-5_1884

Abstract

Nearly all empirical investigations in economics, particularly those involving linear structural models or regressions, are subject to the problem of anomalous data, commonly called outliers. Roughly speaking, there are three sources of outliers. First, the distribution of the model’s random disturbances often has longer tails than the normal distribution, resulting in a greatly increased chance of larger disturbances. Second, the data set may contain erroneous numbers, or ‘gross errors’. The data bases most prone to gross errors are large cross sections, particularly those compiled from surveys; gross errors can result from misinterpreted questions, incorrectly recorded answers, keypunch errors, etc. Third, the model itself, typically linear in (transformations of) the variables, is only an approximation to reality. It is apt to be a poor representation of the process generating the data for extreme values of the explanatory variables. This source of outliers applies even to, say, macroeconomic time series, where the likelihood of gross errors is minimal.

Nearly all empirical investigations in economics, particularly those involving linear structural models or regressions, are subject to the problem of anomalous data, commonly called outliers. Roughly speaking, there are three sources of outliers. First, the distribution of the model’s random disturbances often has longer tails than the normal distribution, resulting in a greatly increased chance of larger disturbances. Second, the data set may contain erroneous numbers, or ‘gross errors’. The data bases most prone to gross errors are large cross sections, particularly those compiled from surveys; gross errors can result from misinterpreted questions, incorrectly recorded answers, keypunch errors, etc. Third, the model itself, typically linear in (transformations of) the variables, is only an approximation to reality. It is apt to be a poor representation of the process generating the data for extreme values of the explanatory variables. This source of outliers applies even to, say, macroeconomic time series, where the likelihood of gross errors is minimal.

Outliers resulting from heavy-tailed but still symmetric disturbance distributions can greatly decrease the efficiency of least squares, while gross errors can in addition cause substantial biases. These potentially damaging effects of anomalous data have been recognized for many years; indeed, the first published work on least squares (Legendre 1805) recommended that outliers be removed from the sample before estimation. The wisdom of this and other approaches that give the observations unequal weights was debated throughout the 19th century.

Despite considerable evidence that error distributions tend to be heavy tailed, many statisticians were reluctant to modify least squares, which was known to be optimal when the disturbances are normally distributed. There were notable exceptions, however, such as Simon Newcomb, an astronomer and mathematician as well as an economist. Newcomb (1886) introduced the idea of modelling the disturbance distribution as a mixture of normal distributions with differing variances; the implied marginal distribution then has heavier tails than the normal. Newcomb also proposed a ‘weighted least squares’ alternative that, it turns out, is similar to a 1964 proposal of Peter Huber, discussed below, which has numerous desirable robustness properties. (The contributions of Newcomb and other late-19th and early 20th-century statisticians are discussed in more detail by Stigler 1973.)

There was a rapid increase in interest in robustness in the mid 1900s, in part due to the work of John Tukey (see, e.g., Tukey 1960). Robustness research benefited greatly in the 1960s from the formalization of certain desirable robustness properties of estimators. The first is ‘efficiency robustness’: one would like an estimator to maintain a high efficiency for all symmetric disturbance distributions that are ‘close to’ the normal distribution. Peter Huber (1964) found a one-parameter family of estimators, indexed by c > 0, that have a certain optimal minimax efficiency-robustness property. Suppose the regression model is yi = xiβ + ui (i = 1,…, n), where yi is the ith observation on the dependent variable, xi is the k-dimensional row vector containing the ith observation on the explanatory variables, ui is the ith disturbance, and β is the k-vector parameters to be estimated. Then the Huber estimate b is the vector that solves the equations
$$ 0={\psi}_c\left({y}_i-{x}_ib\right) {x}_{ij} \left(j=1,\dots, k\right), $$
where ψc(t) ≡ max [−c, min(t, c)] and where the choice of the parameter c depends on the scale of the disturbance distribution and the desired tradeoff between robustness and efficiency. As c → ∞, the Huber estimator reduces to ordinary least squares, whereas if c is never zero, the estimator is similar to the method of least absolute residuals, which had been studied as early as Laplace (1818) and which gained some popularity in the 1950s (see Taylor 1974). The Huber estimator and over sixty others were compared for small samples in the ‘location’ problem (regression on just a constant term) in an extensive 1970–71 Monte Carlo study (Andrews et al. 1972). The results suggested that the asymptotic properties hold quite well in samples as small as twenty.

Though the Huber estimators, and others designed for efficiency robustness, maintain a high efficiency even for heavy-tailed disturbance distributions, they are not resistant to other sources of outliers, such as low-probability gross errors. A second desirable robustness property, introduced by Hampel (1968, 1971) and corresponding to the mathematical concept of uniform continuity, is that if gross errors are generated with small probability, then, irrespective of the distribution of those gross errors, the estimator’s bias should be small. Estimators having this property are called ‘qualitatively robust’. Hampel quantified this relationship by means of an estimator’s ‘sensitivity’, which he defined as the right-hand derivative of the maximum possible bias, with respect to the probability of gross errors, evaluated at probability zero.

Modifications of the Huber estimator designed to make it qualitatively robust were proposed by several researchers in the 1970s (see Krasker and Welsch (1982) for further discussion). They have the general form
$$ {\displaystyle \begin{array}{l}0=v\left({x}_i\right){\psi}_c\left(\left({y}_i-{x}_ib\right)/\left(w\left({x}_i\right.\right)\right)\times {x}_{ij}\\ {} \left(j=1,\dots, k\right)\end{array}} $$
where w and v are non-negative weight functions that allows for the downweighting of observations with outlying values for the explanatory variables, called ‘leverage points’. The proposals of Krasker (1980) and Krasker and Welsch (1982) also have a certain efficiency property among estimators with the same sensitivity to gross errors. The idea of finding an estimator that has maximum efficiency subject to a bound on the sensitivity was developed by Hampel (1968).

If an estimator is qualitatively robust, its asymptotic bias will be small provided the probability of gross errors is sufficiently small. However, this property does not tell us how the estimator will behave if the gross errors are, say, ten per cent of the data. One crude measure of this behaviour, introduced by Hampel (1968, 1971) and called the ‘breakdown point’, is the smallest probability of gross errors that can cause the asymptotic bias to be arbitrarily large. Equivalently, it is the largest fraction of gross errors in the data that the estimator can handle before it becomes totally unreliable. By the 1980s it was clear that the most common qualitatively robust regression estimators, such as those listed earlier, have low breakdown points when k, the number of parameters, is large. Several alternative estimators \( \frac{1}{2} \), the largest possible value. Examples are the ‘repeated medians’ estimator of Siegel (1982), the projection-pursuit approach of Donoho and Huber (1983), and the estimator proposed by Rousseeuw (1984), which minimizes the median of the squared residuals (rather than their sum). However, all of these estimators are computationally burdensome unless k is small, and in fact, it appears that the computational difficulties are an inherent feature of high-breakdown multivariate procedures that transform naturally under linear changes in the coordinate system.

One of the most important uses for high-breakdown procedures is simply to facilitate the identification of outliers, which are often masked by non-robust estimators. For example, in a simple regression, a single outlier associated with an extreme value of the explanatory variable can have so much influence on the least-squares estimate that its own residual is very small. Thus, mere examination of the residuals from a non-robust fit can fail to reveal the anomalous observations. This problem becomes much more severe in higher dimensions, where even many qualitatively robust estimators can break down due to a small cluster of outlying observations. Belsley et al. (1980) have proposed a variety of methods for identifying outliers in regression.

For statistical inference, as opposed to data analysis, identification of the outliers is only a small part of the problem. An important difficulty is that it is often impossible to determine solely from the data whether an outlying observation results from aberrant data, or whether the true regression function is slightly non-linear. Typically either of these possibilities will ‘explain’ the outlier, but for inference their implications may be very different. In these circumstances it seems essential to place a prior on the amount of curvature in the regression function, but this is difficult to do, particularly when there are several explanatory variables. One approach is outlined in Krasker et al. (1983, section 5).

Finally, although the preceding remarks have dealt with regression, outliers occur and have similar consequences in many other statistical contexts, such as discrete or censored dependent variable models, stochastic parameter models, or linear structural models. The most reliable way to identify outliers in these contexts is to estimate robustly the model’s underlying parameters, and check for observations that deviate greatly in an appropriate sense from the model’s prediction. For example, Krasker and Welsch (1985b) have presented a qualitatively robust weighted-instrumental-variables estimator for simultaneous-equations models, analogous to their proposal for regression. In general, however, methods for dealing with outliers in models of the kind just mentioned are far less developed than those for regression.

See Also

Bibliography

  1. Andrews, D.F., P.J. Bickel, F.R. Hampel, P.J. Huber, W.H. Rogers, and J.W. Tukey. 1972. Robust estimates of location: Survey and advances. Princeton: Princeton University Press.Google Scholar
  2. Belsley, D.A., E. Kuh, and R.E. Welsch. 1980. Regression diagnostics. New York: Wiley.CrossRefGoogle Scholar
  3. Donoho, D.L., and P.J. Huber. 1983. The notion of breakdown point. In A Festschrift for Erich L. Lehmann, ed. P. Bickel, K. Doksum, and J.L. Hodges Jr.. Belmont: Wadsworth International Group.Google Scholar
  4. Hampel, F.R. 1968. Contributions to the theory of robust estimation. PhD thesis, University of California, Berkeley.Google Scholar
  5. Hampel, F.R. 1971. A general qualitative definition of robustness. Annals of Mathematical Statistics 42: 1887–1896.CrossRefGoogle Scholar
  6. Huber, P.J. 1964. Robust estimation of a location parameter. Annals of Mathematical Statistics 35(1): 73–101.CrossRefGoogle Scholar
  7. Krasker, W.S. 1980. Estimation in linear regression models with disparate data points. Econometrica 48: 1333–1346.CrossRefGoogle Scholar
  8. Krasker, W.S., and R.E. Welsch. 1985a. Efficient bounded-influence regression estimation. Journal of the American Statistical Association 77(379): 595–604.CrossRefGoogle Scholar
  9. Krasker, W.S., and R.E. Welsch. 1985b. Resistant estimation for simultaneous-equations models using weighted instrumental variables. Econometrica 53(6): 1475–1488.CrossRefGoogle Scholar
  10. Krasker, W.S., E. Kuh, and R.E. Welsch. 1983. Estimation for dirty data and flawed models. In Handbook of econometrics, vol. 1, ed. Z. Griliches and M.D. Intriligator. Amsterdam: North-Holland.Google Scholar
  11. de Laplace, P.S. 1818. Deuxième supplèment à la théorie analytique des probabilités. Paris: Courcier. Reprinted in Oeuvres de Laplace, vol. 7, 569–623. Paris: Imprimerie Royale, 1847. Repinted in Oeuvres complètes de Laplace, vol. 7, 531–580. Paris: Gauthier-Villars, 1886.Google Scholar
  12. Legendre, A.M. 1805. On the method of least squares. Trans. in A source book in mathematics, ed. D.E. Smith. New York: Dover Publications, 1959.Google Scholar
  13. Newcomb, S. 1886. A generalized theory of the combination of observations so as to obtain the best result. American Journal of Mathematics 8: 343–366.CrossRefGoogle Scholar
  14. Rousseeuw, P.J. 1984. Least median of squares regression. Journal of the American Statistical Association 79(388): 871–880.CrossRefGoogle Scholar
  15. Siegel, A.F. 1982. Robust regression using repeated medians. Biometrika 69: 242–244.CrossRefGoogle Scholar
  16. Stigler, S.M. 1973. Simon Newcomb. Percy Daniell, and the history of robust estimation, 1885–1920. Journal of the American Statistical Association 68(344): 872–879.Google Scholar
  17. Taylor, L.D. 1974. Estimation by minimizing the sum of absolute errors. In Frontiers of econometrics, ed. P. Zarembka. New York: Academic Press.Google Scholar
  18. Tukey, J.W. 1960. A survey of sampling from contaminated distributions. In Contributions to probability and statistics, ed. I. Olkin. Stanford: Stanford University Press.Google Scholar

Copyright information

© Macmillan Publishers Ltd. 2018

Authors and Affiliations

  • William S. Krasker
    • 1
  1. 1.