Outlyingness: Which variables contribute most?

Abstract

Outlier detection is an inevitable step to most statistical data analyses. However, the mere detection of an outlying case does not always answer all scientific questions associated with that data point. Outlier detection techniques, classical and robust alike, will typically flag the entire case as outlying, or attribute a specific case weight to the entire case. In practice, particularly in high dimensional data, the outlier will most likely not be outlying along all of its variables, but just along a subset of them. If so, the scientific question why the case has been flagged as an outlier becomes of interest. In this article, a fast and efficient method is proposed to detect variables that contribute most to an outlier’s outlyingness. Thereby, it helps the analyst understand in which way an outlier lies out. The approach pursued in this work is to estimate the univariate direction of maximal outlyingness. It is shown that the problem of estimating that direction can be rewritten as the normed solution of a classical least squares regression problem. Identifying the subset of variables contributing most to outlyingness, can thus be achieved by estimating the associated least squares problem in a sparse manner. From a practical perspective, sparse partial least squares (SPLS) regression, preferably by the fast sparse NIPALS (SNIPLS) algorithm, is suggested to tackle that problem. The performed method is demonstrated to perform well both on simulated data and real life examples.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

References

  1. Agostinelli, C., Leung, A., Yohai, V.J., Zamar, R.H.: Robust estimation of multivariate location and scatter in the presence of cellwise and casewise contamination. Test 24(3), 441–461 (2015)

    MathSciNet  Article  MATH  Google Scholar 

  2. Alfons, A.: robusthd: Robust methods for high-dimensional data. R package version 01 (2012)

  3. Bibby, J., Kent, J., Mardia, K.: Multivariate Analysis. Academic Press, London (1979)

    Google Scholar 

  4. Boudt, K., Rousseeuw, P., Vanduffel, S., Verdonck, T.: The minimum regularized covariance determinant estimator. arXiv:1701.07086 (2017)

  5. Candès, E., Tao, T.: The dantzig selector: statistical estimation when p is much larger than n. Ann. Stat. 35, 2313–2351 (2007)

    MathSciNet  Article  MATH  Google Scholar 

  6. Cerioli, A.: Multivariate outlier detection with high-breakdown estimators. J. Am. Stat. Assoc. 105(489), 147–156 (2010)

    MathSciNet  Article  MATH  Google Scholar 

  7. Chun, H., Keleş, S.: Sparse partial least squares regression for simultaneous dimension reduction and variable selection. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 72(1), 3–25 (2010)

    MathSciNet  Article  Google Scholar 

  8. Croux, C., Ruiz-Gazen, A.: High breakdown estimators for principal components: the projection-pursuit approach revisited. J. Multivar. Anal. 95, 206–226 (2005)

    MathSciNet  Article  MATH  Google Scholar 

  9. Davies, P., Gather, U.: The identification of multiple outliers. J. Am. Stat. Assoc. 88, 782–792 (1993)

    MathSciNet  Article  MATH  Google Scholar 

  10. Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96(456), 1348–1360 (2001)

    MathSciNet  Article  MATH  Google Scholar 

  11. Farcomeni, A., Greco, L.: Robust Methods for Data Reduction. CRC Press, Boca Raton (2015)

    Google Scholar 

  12. Hoerl, A.E., Kennard, R.W.: Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12(1), 55–67 (1970)

    Article  MATH  Google Scholar 

  13. Hoffmann, I., Serneels, S., Filzmoser, P., Croux, C.: Sparse partial robust m regression. Chemom. Intell. Lab. Syst. 149, 50–59 (2015)

    Article  Google Scholar 

  14. Hoffmann, I., Filzmoser, P., Serneels, S., Varmuza, K.: Sparse and robust PLS for binary classification. J. Chemom. 30, 153–162 (2016)

    Article  Google Scholar 

  15. Hubert, M., Rousseeuw, P.J., Vanden Branden, K.: ROBPCA: a new approach to robust principal components analysis. Technometrics 47, 64–79 (2005)

    MathSciNet  Article  Google Scholar 

  16. Janssens, K.H., De Raedt, I., Schalm, O., Veeckman, J.: Composition of 15–17\(^{{\rm th}}\) century archæological glass vessels excavated in antwerp, belgium. Mikrochimica Acta 15(Suppl.), 253–267 (1998)

    Google Scholar 

  17. Lemberge, P., De Raedt, I., Janssens, K.H., Wei, F., Van Espen, P.J.: Quantitative analysis of 16–17\(^{{\rm th}}\) century archæological glass vessels using pls regression of epxma and \(\mu \)-xrf data. J. Chemom. 14, 751–763 (2000)

    Article  Google Scholar 

  18. Lopuhaä, H.: Multivariate \(\tau \)-estimators for location and scatter. Can. J. Stat. 19, 307–321 (1991)

    MathSciNet  Article  MATH  Google Scholar 

  19. Maronna, R., Zamar, R.: Robust estimates of location and dispersion for high-dimensional data sets. Technometrics 44, 307–317 (2002)

    MathSciNet  Article  Google Scholar 

  20. Maronna, R., Martin, D., Yohai, V.: Robust statistics: theory and methods. Wiley, New York (2006)

    Google Scholar 

  21. Öllerer, V., Croux, C.: Robust high-dimensional precision matrix estimation. In: Modern nonparametric, robust and multivariate methods, pp. 325–350. Springer (2015)

  22. Öllerer, V., Alfons, A., Croux, C.: The shooting s-estimator for robust regression. Comput. Stat. 31, 829–844 (2016)

    MathSciNet  Article  MATH  Google Scholar 

  23. Riani, M., Atkinson, A., Cerioli, A.: Finding an unknown number of multivariate outliers. J. R. Stat. Soc. B 71(2), 447–466 (2009)

    MathSciNet  Article  MATH  Google Scholar 

  24. Rousseeuw, P.J.: Least median of squares regression. J. Am. Stat. Assoc. 79, 871–880 (1984)

    MathSciNet  Article  MATH  Google Scholar 

  25. Rousseeuw, P.J., Van den Bossche, W.: Detecting deviating data cells. Technometrics (Accepted) (2017). https://doi.org/10.1080/00401706.2017.1340909

  26. Rousseeuw, P.J., Croux, C.: Alternatives to the median absolute deviation. J. Am. Stat. Assoc. 88(424), 1273–1283 (1993)

    MathSciNet  Article  MATH  Google Scholar 

  27. Rousseeuw, P.J., Leroy, A.: Robust regression and outlier detection. Wiley, New York (1987)

    Google Scholar 

  28. Rousseeuw, P.J., Van Driessen, K.: A fast algorithm for the minimum covariance determinant estimator. Technometrics 41, 212–223 (1999)

    Article  Google Scholar 

  29. Rousseeuw, P.J., Van Zomeren, B.: Unmasking multivariate outliers and leverage points. J. Am. Stat. Assoc. 85, 633–651 (1990)

    Article  Google Scholar 

  30. Serneels, S., Croux, C., Filzmoser, P., Van Espen, P.J.: Partial robust m-regression. Chemom. Intell. Lab. Syst. 79, 55–64 (2005)

    Article  Google Scholar 

  31. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodol.) 58(1), 267–288 (1996)

    MathSciNet  MATH  Google Scholar 

  32. Willems, G., Joe, H., Zamar, R.: Diagnosing multivariate outliers detected by robust estimators. J. Comput. Gr. Stat. 18(1), 73–91 (2009)

    MathSciNet  Article  Google Scholar 

  33. Wold, H.: Estimation of principal components and related models by iterative least squares. In: Krishnaiaah, P.R. (ed.) Multivariate Analysis, pp. 391–420. Academic Press, New York (1966)

    Google Scholar 

  34. Zhang, C.H.: Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38(2), 894–942 (2010)

    MathSciNet  Article  MATH  Google Scholar 

  35. Zou, H.: The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 101(476), 1418–1429 (2006)

    MathSciNet  Article  MATH  Google Scholar 

  36. Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 67(2), 301–320 (2005)

    MathSciNet  Article  MATH  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Tim Verdonck.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported by the BNP Paribas Fortis Chair in Fraud Analytics and Internal Funds KU Leuven under Grant C16/15/068.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 981 KB)

A Appendix: Proofs

A Appendix: Proofs

A.1 Proof of Proposition 1

Proof

Note that our weighted covariance matrix \(\hat{\varvec{\varSigma }}_w\), like all covariance matrices, is a positive-semidefinite matrix. Since we also assume it is not singular and \(\hat{\varvec{\varSigma }}_w^{-1}\) exists, we know that \(\hat{\varvec{\varSigma }}_w\) is positive-definite. We now apply the Cauchy- Bunyakovskiy-Schwarz inequality to \(\varvec{x}= \hat{\varvec{\varSigma }}_w^{-1/2}\varvec{x}_1\) and \(\varvec{y}= \hat{\varvec{\varSigma }}_w^{1/2}\varvec{y}_1\), for arbitrary \(\varvec{x}_1,\varvec{y}_1 \in \mathbb {R}^p\). This results in the following inequality

$$\begin{aligned} (\varvec{x}_1^T\varvec{y}_1)^2 \le \varvec{x}_1^T\hat{\varvec{\varSigma }}_w^{-1}\varvec{x}_1 \varvec{y}_1^T \hat{\varvec{\varSigma }}_w\varvec{y}_1 \end{aligned}$$

We have equality if \(\varvec{y}= c\varvec{x}\) with \(c\in \mathbb {R}\), which means \(\hat{\varvec{\varSigma }}_w^{1/2}\varvec{y}_1 = c \hat{\varvec{\varSigma }}_w^{-1/2}\varvec{x}_1\) or \(\varvec{y}_1 = c \hat{\varvec{\varSigma }}_w^{-1}\varvec{x}_1\). So summarized, for any \(\varvec{x},\varvec{y}\in \mathbb {R}^p\) we have the inequality

$$\begin{aligned} (\varvec{x}^T\varvec{y})^2 \le \varvec{x}^T\hat{\varvec{\varSigma }}_w^{-1}\varvec{x}\varvec{y}^T \hat{\varvec{\varSigma }}_w\varvec{y}, \end{aligned}$$

where there is equality if and only if \(\varvec{y}= c \hat{\varvec{\varSigma }}_w^{-1}\varvec{x}\).

We now look at

$$\begin{aligned} \frac{(\varvec{x}^T\varvec{a}- \hat{\varvec{\mu }}_w^T\varvec{a})^2}{\varvec{a}^T\hat{\varvec{\varSigma }}_w\varvec{a}} \end{aligned}$$

and apply this inequality:

$$\begin{aligned} \frac{((\varvec{x}- \hat{\varvec{\mu }}_w)^T\varvec{a})^2}{\varvec{a}^T\hat{\varvec{\varSigma }}_w\varvec{a}}\le & {} \frac{(\varvec{x}- \hat{\varvec{\mu }}_w)^T \hat{\varvec{\varSigma }}_w^{-1}(\varvec{x}- \hat{\varvec{\mu }}_w) \varvec{a}^T \hat{\varvec{\varSigma }}_w \varvec{a}}{\varvec{a}^T\hat{\varvec{\varSigma }}_w\varvec{a}}\\= & {} (\varvec{x}- \hat{\varvec{\mu }}_w)^T \hat{\varvec{\varSigma }}_w^{-1}(\varvec{x}- \hat{\varvec{\mu }}_w). \end{aligned}$$

We have equality in the above inequality if

\(\varvec{a}= c \hat{\varvec{\varSigma }}_w^{-1}(\varvec{x}- \hat{\varvec{\mu }}_w)\). So

$$\begin{aligned} \varvec{a}= \frac{\hat{\varvec{\varSigma }}_w^{-1}(\varvec{x}- \hat{\varvec{\mu }}_w)}{\Vert \hat{\varvec{\varSigma }}_w^{-1}(\varvec{x}- \hat{\varvec{\mu }}_w)\Vert } \end{aligned}$$

is the direction \(\varvec{a}\) that maximizes

$$\begin{aligned} \frac{|\varvec{x}^T\varvec{a}- \hat{\varvec{\mu }}_w^T\varvec{a}|}{\sqrt{\varvec{a}^T\hat{\varvec{\varSigma }}_w\varvec{a}}} \end{aligned}$$

and for this \(\varvec{a}\) we have

$$\begin{aligned}&\left( \frac{|\varvec{x}^T\varvec{a}- \hat{\varvec{\mu }}_w^T\varvec{a}|}{\sqrt{\varvec{a}^T\hat{\varvec{\varSigma }}_w\varvec{a}}}\right) ^2\\&\quad = (\varvec{x}- \hat{\varvec{\mu }}_w)^T \hat{\varvec{\varSigma }}_w^{-1}(\varvec{x}- \hat{\varvec{\mu }}_w) = r(\varvec{x};\varvec{X})^2. \end{aligned}$$

\(\square \)

A.2 Proof of Theorem 1

Proof

We know that, by the theory of ordinary least squares regression,

$$\begin{aligned} \varvec{\theta }_{\varepsilon } = (\varvec{X}_{w,\varepsilon }^T\varvec{X}_{w,\varepsilon })^{-1}\varvec{X}_{w,\varepsilon }^T \varvec{y}^{n+1}_{w,\varepsilon } \end{aligned}$$

and by the definition of our weighted covariance matrix, \(\hat{\varvec{\varSigma }}_{w,\varepsilon } = \frac{1}{n_{w,\varepsilon }-1} \varvec{X}_{w,\varepsilon }^T\varvec{X}_{w,\varepsilon }\), we can write

$$\begin{aligned} \varvec{\theta }_{\varepsilon } = ((n_{w,\varepsilon }-1)\hat{\varvec{\varSigma }}_{w,\varepsilon })^{-1}\varvec{X}_{w,\varepsilon }^T\varvec{y}^{n+1}_{w,\varepsilon }. \end{aligned}$$

We know that \(((n_{w,\varepsilon }-1)\hat{\varvec{\varSigma }}_{w,\varepsilon })^{-1} = \frac{1}{n_{w,\varepsilon }-1} \hat{\varvec{\varSigma }}_{w,\varepsilon }^{-1}\) and it is easy to see that \(\varvec{X}^T_{w,\varepsilon }\varvec{y}^{n+1}_{w,\varepsilon } = \sqrt{\varepsilon }(\varvec{x}- \hat{\varvec{\mu }}_{w,\varepsilon })\), if we look at the definitions of \(\varvec{X}_{w,\varepsilon }\) and \(\varvec{y}^{n+1}_{w,\varepsilon }\). Thus we have that

$$\begin{aligned} \varvec{\theta }_{\varepsilon } = \frac{\sqrt{\varepsilon }}{n_{w,\varepsilon }-1} \hat{\varvec{\varSigma }}_{w,\varepsilon }^{-1}(\varvec{x}- \hat{\varvec{\mu }}_{w,\varepsilon }). \end{aligned}$$

Since \(\varepsilon \) is strictly larger than zero, we have that

$$\begin{aligned} \frac{\varvec{\theta }_{\varepsilon }}{\Vert \varvec{\theta }_{\varepsilon }\Vert } = \frac{\hat{\varvec{\varSigma }}_{w,\varepsilon }^{-1}(\varvec{x}- \hat{\varvec{\mu }}_{w,\varepsilon })}{\Vert \hat{\varvec{\varSigma }}_{w,\varepsilon }^{-1}(\varvec{x}- \hat{\varvec{\mu }}_{w,\varepsilon })\Vert }. \end{aligned}$$

Then we get that

$$\begin{aligned} \lim _{\varepsilon \rightarrow 0}\frac{\varvec{\theta }_{\varepsilon }}{\Vert \varvec{\theta }_{\varepsilon }\Vert }=\frac{\hat{\varvec{\varSigma }}_w^{-1}(\varvec{x}- \hat{\varvec{\mu }}_w)}{\Vert \hat{\varvec{\varSigma }}_w^{-1}(\varvec{x}- \hat{\varvec{\mu }}_w)\Vert } = \varvec{a}(\varvec{x}) \end{aligned}$$

since \(\lim _{\varepsilon \rightarrow 0}n_{w,\varepsilon }=n_w\), \(\lim _{\varepsilon \rightarrow 0}\hat{\varvec{\mu }}_{w,\varepsilon }=\hat{\varvec{\mu }}_w\) and

\(\lim _{\varepsilon \rightarrow 0}\hat{\varvec{\varSigma }}_{w,\varepsilon }^{-1}=\hat{\varvec{\varSigma }}_w^{-1}\). \(\square \)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Debruyne, M., Höppner, S., Serneels, S. et al. Outlyingness: Which variables contribute most?. Stat Comput 29, 707–723 (2019). https://doi.org/10.1007/s11222-018-9831-5

Download citation

Keywords

  • Partial least squares
  • Robust statistics
  • Sparsity
  • Variable selection