Abstract
Anomalies persist in the use of deletion diagnostics in regression. Tests for outliers under subset deletions utilize the R-Fisher FI statistics, each having a noncentral F-distribution with noncentrality parameter λ as a function of shifts only at deleted rows in the index set I. Numerous studies examine empirical outcomes of these diagnostics in random experiments. In contrast, studies here are probabilistic, examining distributions behind those empirical outcomes and tracking the effects of shifts at nondeleted rows. By allowing shifts at nondeleted rows in a set J, in addition to traditional shifts at deleted rows in I, FI is shown to have a doubly noncentral F-distribution. By removing the unnecessary restriction that shifts occur only at deleted rows, these findings support constructs akin to power curves in tracking probabilities of masking or swamping as shifts evolve. In addition, “regression effects” among outliers may have unforeseen consequences. A dichotomy of shifts is discovered as projections into the “regressor” and “error” spaces of a model. Hidden shifts at nondeleted rows can obfuscate not only meanings ascribed to traditional outlier diagnostics, but also to subset influence diagnostics corresponding one-to-one with FI. In short, despite wide usage abetted by software support, deletion diagnostics in current vogue no longer can be recommended to achieve objectives traditionally cited. Case studies illustrate the debilitating effects of these anomalies in practice, together with conclusions misleading to prospective users.
Similar content being viewed by others
References
Andrews, D. F., and T. Pregibon. 1978. Finding outliers that matter. J. R. Stat. Soc. B, 40, 85–93.
Atkinson, A. C. 1985. Plots, transformations, and regression. Oxford, U.K.: Oxford University Press.
Barnett, V., and T. Lewis. 1984. Outliers in statistical data, 2nd ed. New York, NY: Wiley.
Beckman, R. J., and H. J. Trussell. 1974. The distribution of an arbitrary Studentized residual and the effects of updating in multiple regression. J. Am. Stat. Assoc. 69, 199–201.
Belsley, D. A., E. Kuh, and R. E. Welsch. 1980. Regression diagnostics: Identifying influential data and sources of collinearity. New York, NY: Wiley.
Box, G. E. P., and K. B. Wilson. 1951. On the experimental attainment of optimum conditions. J. R. Stat. Soc. B, 13, 1–45.
Bulgren, W. 1971. On representations of the doubly non-central F distribution. J. Am. Stat. Assoc., 66, 184–186.
Chatterjee, S., and A. S. Hadi. 1986. Influential observations, high leverage points, and outliers in linear regression. Stat. Sci., 1, 379–393.
Chatterjee, S., and A. S. Hadi. 1988. Sensitivity analysis in linear regression. New York, NY: Wiley.
Cook, R. D. 1977. Detection of influential observations in linear regression. Technometrics, 19, 15–18.
Cook, R. D. 1986. [Influential observations, high leverage points, and outliers in linear regression]: Comment. Stat. Sci., 1, 393–397.
Cook, R. D., and S. Weisberg. 1982. Residuals and influence in regression. London, UK: Chapman and Hall.
Draper, N. R., J. A. John. 1981. Influential observations and outliers in regression. Technometrics, 23, 21–26.
Ennis, D., and N. Johnson. 1993. Noncentral and central chi-square, F and beta distribution functions as special cases of the distribution function of an indefinite quadratic form. Commun. Stat. Theory Methods, 22, 897–905.
Fox, J. 1991. Regression diagnostics. Newbury Park, CA: Sage.
Gentleman, J. F., and W. B. Wilk. 1975. Detecting outliers. II. Supplementing the direct analysis of residuals. Biometrics, 31, 387–410.
Ghosh, S. 1978. On robustness of designs against incomplete data. Sankhyā Ser. B, 40, 204–208.
Das Gupta, S., and M. D. Perlman. 1974. Power of the noncentral F-test: Effect of additional variates on Hotelling’s T2-test. J. Am. Stat. Assoc., 69, 174–180.
Hoaglin, D. C., and P. J. Kempthorne. 1986. [Influential observations, high leverage points, and outliers in linear regression]: Comment. Stat. Sci., 1, 408–412.
Imhof, J. 1961. Computing the distribution of quadratic forms in normal variables. Biometrika, 48, 419–426.
Jensen, D. R. 2000. The use of Studentized diagnostics in regression. Metrika, 52, 213–223.
Jensen, D. R. 2001. Properties of selected subset diagnostics in regression. Stat. Prob. Lett., 51, 377–388.
Jensen, D. R., and D. E. Ramirez. 1996. Computing the CDF of Cook’s DI statistic. In Proceedings of the 12th Symposium in Computational Statistics ed. A. Prat, and E. Ripoll, 65–66. Barcelona, Spain: Institut d’Estadistica de Catalunya.
Johnson, N. L., and S. Kotz. 1970. Distributions in statistics: Continuous univariate distributions—2. Boston, MA: Houghton Mifflin.
LaMotte, L. R. 1999. Collapsibility hypotheses and diagnostic bounds in regression analysis. Metrika, 50, 109–119.
Mahalanobis, P. C. 1936. On the generalized distance in statistics. Proc. Nat. Inst. Sci. India, 12, 49–55.
Myers, R. H. 1990. Classical and modern regression with applications, 2nd ed. Boston, MA: PWS-KENT.
Rousseeuw, P. J., and A.M. Leroy. 1987. Robust regression and outlier detection. New York, NY: Wiley.
Snedecor, G. W., and W. G. Cochran. 1968. Statistical methods, 6th ed. Ames, IA: Iowa State University Press.
Welsch, R. E. 1982. Influence functions and regression diagnostics. In Modern data analysis, ed. R. L. Launer and A. F. Siegel, 149–169. New York, NY: Academic Press.
Welsch, R. E., and E. Kuh. 1977. Linear regression diagnostics. Technical Report 923–77, Cambridge, MA: Sloan School of Management, Massachusetts Institute of Technology.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Jensen, D.R., Ramirez, D.E. Noncentralities Induced in Regression Diagnostics. J Stat Theory Pract 8, 141–165 (2014). https://doi.org/10.1080/15598608.2014.847758
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1080/15598608.2014.847758