Abstract
Regression lies heart in statistics, it is the one of the most important branch of multivariate techniques available for extracting knowledge in almost every field of study and research. Nowadays, it has drawn a huge interest to perform the tasks with different fields like machine learning, pattern recognition and data mining. Investigating outlier (exceptional) is a century long problem to the data analyst and researchers. Blind application of data could have dangerous consequences and leading to discovery of meaningless patterns and carrying to the imperfect knowledge. As a result of digital revolution and the growth of the Internet and Intranet data continues to be accumulated at an exponential rate and thereby importance of detecting outliers and study their costs and benefits as a tool for reliable knowledge discovery claims perfect attention. Investigating outliers in regression has been paid great value for the last few decades within two frames of thoughts in the name of robust regression and regression diagnostics. Robust regression first wants to fit a regression to the majority of the data and then to discover outliers as those points that possess large residuals from the robust output whereas in regression diagnostics one first finds the outliers, delete/correct them and then fit the regular data by classical (usual) methods. At the beginning there seems to be much confusion but now the researchers reach to the consensus, robustness and diagnostics are two complementary approaches to the analysis of data and any one is not good enough. In this chapter, we discuss both of them under the unique spectrum of regression diagnostics. Chapter expresses the necessity and views of regression diagnostics as well as presents several contemporary methods through numerical examples in linear regression within each aforesaid category together with current challenges and
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Atkinson, A.C. (1981), Two graphical displays for outlying and influential observations in regression. Biometrika, 68, 13 20.
Atkinson, A. C. (1986), Masking unmasked. Biometrika, 73, 533541.
Atkinson, A. C., Riani, M. (2000), Robust Diagnostic Regression Analysis. London, Springer.
Barnett, V., Lewis, T. B. (1995), Outliers in Statistical Data. NY, Wiley.
lBelsley, D. A., Kuh, E.,Welsch, R. E. (1980), Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. NY, Wiley.
Berka, P. (1997), Recognizing reliability of discovered knowledge, Principles of knowledge discovery and data mining, Lecture notes in computer science, Vol. 1263/1997, 307−314.
Berry, M. J. A., Linoff, G. (1997), Data Mining Techniques for Marketing, Sales and Customer Support, NY, Wiley.
Billor, N., Hadi A. S., Velleman, F. (2000), BACON: Blocked adaptive computationally efficient outlier nominator. Computational Statistics and Data Analysis, 34, 279298.
Box, G. E. P. (1953), Non-normality and tests on variance. Biometrika, 40, 318335.
Chatterjee, S., Hadi, A. S. (1986), Influential observations, high leverage points, and outliers in regression. Statistical Sciences, 1, 379416.
Chatterjee, S., Hadi, A. S. (1988), Sensitivity Analysis in Linear Regression. NY, Wiley.
Chatterjee, S., Hadi, A. S. (2006), Regression Analysis by Examples. NY, Wiley.
Cook, R. D. (1977), Detection of influential observations in linear regression. Technometrics, 19, 1518.
Cook, R. D. (1979), Influential observations in regression. Journal of the American Statistical Association, 74, 169174.
Cook, R. D. (1986), Assessment of local influence. Journal of Royal Statistical Society, B, 48(2), 133169.
Cook, R. D., Weisberg, S. (1982), Residuals and Influence in Regression. London, Chapman and Hall.
Cookley, C.W., Hettmansperger, T. P. (1993), A bounded influence, high breakdown, efficient regression estimator, Journal of the American Statistical Association, 88, 872880.
Dai, H., Liu, J. and Liu, H. (2006), 1st InternationalWorkshop on Reliability Issues in Knowledge Discovery (RIKD 06), http://doi.ieeecomputersociety.org/10.1109/ICDMW.2008.6, access 10-8-10.
Dai, H, Liu, J. (2008), 2nd International Workshop on Reliability Issues in Knowledge Discovery (RIKD 08). newsgroups.derkeiler.com/Archive/Comp/comp…/msg00009.html, access 10−8−10.
Dai, H., Liu, J., Smirnovi, E. (2010), 3rd International Workshop on Reliability Issues in Knowledge Discovery (RIKD 10), http://www.ourglocal.com/event/?eventid=4342, access 10−8−10.
Daniel, C., Wood, F. S. (1971), Fitting Equations to Data, NY, Wiley.
Efron, B., Tibshirani, R. J. (1993), An Introduction to the Bootstrap. NY, Wiley.
Elder, J. F. and Pregibon, D. (1995), A statistical perspective on KDD, in Proceedings of KDD-95, 87−93.
Ellenberg, J. H. (1976), Testing for a single outlier from a general regression. Biometrics, 32, 637645.
Fayyad, U., Piatetsky-Shapiro, G., Smyth, P. (1996), The KDD process for extracting useful knowledge from volumes of data, Communications of the ACM, 39 (10), 27−34.
Feng, Y., Wu, Z. (2006), Enhancing reliability throughout knowledge discovery process, in Proceedings of 1st International Workshop on Reliability Issues in Knowledge Discovery, Hong Kong, China.
Fox, J. (1993), Regression diagnostics. In M. S. L. Beck (Ed.), Regression analysis (245334). London, Sage Publications.
Gnanadesikan, R., Wilk, M. B. (1968), Probability plotting methods for the analysis of data, Biometrika, 55(1), 117.
Hadi, A. S. (1992), A new measure of overall potential influence in linear regression. Computational Statistics and Data Analysis, 14, 127.
Hadi, A. S., Simonoff, J. S. (1993), Procedures for the identification of outliers. Journal of the American Statistical Association, 88, 12641272.
Hampel, F. R. (1968), Contribution to the theory of robust estimation. Ph. D. Thesis, University of California, Berkley.
Hampel, F. R. (1975). Beyond location parameters: robust concepts and methods. Bulletin of the International Statistics Institute, 46, 375382.
Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., Stahel, W. A. (1986), Robust Statistics: The Approach Based on Influence Function. NY, Wiley.
Hawkins, D. M. (1980), Identification of Outliers. London, Chapman and Hall.
Hawkins, D. M., Bradu, D., Kass, G. V. (1984), Location of several outliers in multiple regression data using elemental sets. Technometrics, 26, 197208.
Hoaglin, D. C., Welsch, R. E. (1978), The hat matrix in regression and ANOVA. American Statistician, 32, 1722.
Hossjer, O. (1994), Rank-based estimates in the linear model with high breakdown point. Journal of the American Statistical Association, 89, 149158.
Huber, P. J. (1964), Robust estimation of a location parameter. Annals of Mathematical Statistics, 35, 73101.
Huber, P. J. (1973), Robust regression: asymptotics, conjectures and Monte Carlo. Annals of Statistics, 1, 799821.
Huber, P. J. (1981), Robust Statistics. NY, Wiley.
Huber, P. J. (1991), Between robustness and diagnostics. In Stahel, W. and Weisberg, S. (Eds.), Direction in Robust Statistics and Diagnostics. 121130, NY, Springer-Verlag.
Imon, A.H.M.R. (2005), Identifying multiple influential observations in linear regression. Journal of Applied Statistics, 32(9), 929946.
Knorr, M. E., Ng, T. R., Tucakov, V. (2000), Distance-based outlier: algorithms and applications. VLDB Journal, 8, 327253.
Mahalanobis, P. C. (1936), On the generalized distance in statistics. Proceedings of the National Institute of Science of India, 12, 4955.
Mannila, H. (1996), Data mining: machine learning, statistics, and databases. http:reference.kfupm.edu.sa/contentda/data mining machine learning statistic 50921.pdf; access 6−8−10.
Mallow, C. P. (1975), On some topics in robustness, Unpublished memorandum, Bell telephone laboratories, Murray Hill, NJ.
Maronna, R. A., Zamar, R. H. (2002), Robust estimates of location and dispersion for highdimensional data sets, Technometrics, 44, 307313.
Maronna, R. A., Martin, R. D., Yohai, V. J. (2006), Robust Statistics: Theory and Methods. NY, Wiley.
Nurunnabi, A. A. M. (2008), Robust diagnostic deletion techniques in linear and logistic regression, M. Phil. Thesis, Unpublished, Rajshahi University, Bangladesh.
Nurunnabi, A. A. M., Imon, A. H. M. R., Nasser, M. (2011), A diagnostic measure for influential observations in linear regression. Communication in Statistics-Theory and Methods, 40 (7), 11691183.
Pea, D., Prieto, F. J. (2001), Multivariate outlier detection and robust covariance estimation, Technometrics, 43, 286310.
Rousseeuw, P. J. (1984), Least median of squares regression. Journal of the American Statistical Association, 79, 871880.
Rousseeuw, P. J., Leroy, A. M. (2003), Robust Regression and Outlier Detection. NY, Wiley.
Rousseeuw, P. J., van Driessen, K. (1999), A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41, 212223.
Rousseeuw, P. J., van Zomeren, B. C. (1990), Unmasking multivariate outliers and leverage points. Journal of the American Statistical Association, 85, 633639.
Simpson, D. G., Ruppert, D., Carroll, R. J. (1992), On one-step GM-estimates and stability of inference in linear regression, Journal of the American Statistical Association, 87, 439450.
Tukey, J. W. (1960), A survey of sampling from contaminated distributions: contributions to probability and statistics. Olkin, I. Ed., Stanford University Press, Stanford, California.
Tukey, J. W. (1962), The future of data analysis. Annals of Mathematical Statistics, 33, 167.
Velleman, P. F., Welsch, R. E. (1981), Efficient computing in regression diagnostics. American Statistician, 35, 234242.
Welsch, R. E., Kuh, E. (1977), Linear regression diagnostics, Sloan School of Management Working Paper, 923977, MIT, Cambridge: Massachusetts.
Willems, G., Aelst, S. V. (2004), Fast and robust bootstrap for LTS. Elsevier Science.
Yohai, V. J. (1987), High breakdown point and high efficiency robust estimates for regression. The Annals of Statistics, 15, 642656.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer Science+Business Media, LLC
About this paper
Cite this paper
Nurunnabi, A.A.M., Dai, H. (2012). Robust-Diagnostic Regression: A Prelude for Inducing Reliable Knowledge from Regression. In: Dai, H., Liu, J., Smirnov, E. (eds) Reliable Knowledge Discovery. Springer, Boston, MA. https://doi.org/10.1007/978-1-4614-1903-7_4
Download citation
DOI: https://doi.org/10.1007/978-1-4614-1903-7_4
Published:
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4614-1902-0
Online ISBN: 978-1-4614-1903-7
eBook Packages: Computer ScienceComputer Science (R0)