Data Mining and Knowledge Discovery Handbook pp 131-146 | Cite as
Outlier Detection
Chapter
- 100 Citations
- 12 Mentions
- 16k Downloads
Abstract
Outlier detection is a primary step in many data-mining applications. We present several methods for outlier detection, while distinguishing between univariate vs. multivariate techniques and parametric vs. nonparametric procedures. In presence of outliers, special attention should be taken to assure the robustness of the used estimators. Outlier detection for Data Mining is often based on distance measures, clustering and spatial methods.
Keywords
Outliers Distance measures Statistical Process Control Spatial dataPreview
Unable to display preview. Download preview PDF.
References
- Acuna E., Rodriguez C. A., ”Meta analysis study of outlier detection methods in classification,” Technical paper, Department of Mathematics, University of Puerto Rico at Mayaguez, Retrived from academic.uprm.edu/ ea-cuna/paperout.pdf. In proceedings IPSI2004, Venice, 2004.Google Scholar
- Alwan L.C., Ebrahimi N., Soofi E.S., ”Information theoretic framework for process control,” European Journal of Operations Research, 111, 526–542, 1998.CrossRefzbMATHGoogle Scholar
- Alwan L.C., Roberts H.V., ”Time-series modeling for statistical process control,” Journal of Business and Economics Statistics, 6(1), 87–95, 1988.CrossRefGoogle Scholar
- Apley D.W., Shi J., ”The GLRT for statistical process control of autocorrelated processes,” IIE Transactions, 31, 1123–1134, 1999.CrossRefGoogle Scholar
- Barbara D., Faloutsos C, Hellerstein J., Ioannidis Y., Jagadish H.V., Johnson T., Ng R., Poosala V., Ross K., Sevcik K.C., ”The New Jersey Data Reduction Report,” Data Eng. Bull., September, 1996.Google Scholar
- Barbara D., Chen P., ”Using the fractal dimension to cluster datasets,” In Proc. ACM KDD 2000, 260–264, 2000.Google Scholar
- Barnett V., Lewis T., Outliers in Statistical Data. John Wiley, 1994.Google Scholar
- Bay S.D., Schwabacher M., ”Mining distance-based outliers in near linear time with randomization and a simple pruning rule,” In Proc. of the ninth ACM-SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 2003.Google Scholar
- Ben-Gal I., Morag G., Shmilovici A., ”CSPC: A Monitoring Procedure for State Dependent Processes,” Technometrics, 45(4), 293–311, 2003.MathSciNetCrossRefGoogle Scholar
- Box G. E. P., Jenkins G. M., Times Series Analysis, Forecasting and Control, Oakland, CA: Holden Day, 1976.Google Scholar
- Breunig M.M., Kriegel H.P., Ng R.T., Sander J., ”Lof: Identifying density-based local outliers,” In Proc. ACMSIGMOD Conf. 2000, 93–104, 2000.Google Scholar
- Caussinus H., Roiz A., ”Interesting projections of multidimensional data by means of generalized component analysis,” In Compstat 90, 121–126, Heidelberg: Physica, 1990.Google Scholar
- David H. A., ”Robust estimation in the presence of outliers,” In Robustness in Statistics, eds. R. L. Launer and G.N. Wilkinson, Academic Press, New York, 61–74, 1979.Google Scholar
- Davies L., Gather U., ”The identification of multiple outliers,” Journal of the American Statistical Association, 88(423), 782–792, 1993.MathSciNetCrossRefzbMATHGoogle Scholar
- DuMouchel W., Schonlau M., ”A fast computer intrusion detection algorithm based on hypothesis testing of command transition probabilities,” In Proceedings of the 4th International Conference on Knowledge Discovery and Data-mining (KDD98), 189–193, 1998.Google Scholar
- Fawcett T, Provost F., ”Adaptive fraud detection,” Data-mining andKnowledge Discovery, 1(3), 291–316, 1997.CrossRefGoogle Scholar
- Ferguson T. S., ”On the Rejection of outliers,” In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, 253–287, 1961.zbMATHGoogle Scholar
- Gather U., ”Testing for multisource contamination in location / scale families,” Communication in Statistics, Part A: Theory and Methods, 18, 1–34, 1989.zbMATHMathSciNetGoogle Scholar
- Grubbs F. E., ”Proceadures for detecting outlying observations in Samples,” Technometrics, 11, 1–21, 1969.CrossRefGoogle Scholar
- Hadi A. S., ”Identifying multiple outliers in multivariate data,” Journal of the Royal Statistical Society. Series B, 54, 761–771, 1992.MathSciNetGoogle Scholar
- Hadi A. S., ”A modification of a method for the detection of outliers in multivariate samples,” Journal of the Royal Statistical Society, Series B, 56(2), 1994.Google Scholar
- Hawkins D., Identification of Outliers, Chapman and Hall, 1980.Google Scholar
- Hawkins S., He H. X., Williams G. J., Baxter R. A., ”Outlier detection using replicator neural networks,” In Proceedings of the Fifth International Conference and Data Warehousing and Knowledge Discovery (DaWaK02), Aix en Provence, France, 2002.Google Scholar
- Haining R., Spatial Data Analysis in the Social and Environmental Sciences. Cambridge University Press, 1993.Google Scholar
- Hampel F. R., ”A general qualitative definition of robustness,” Annals of Mathematics Statistics, 42, 1887–1896, 1971.zbMATHMathSciNetGoogle Scholar
- Hampel F. R., ”The influence curve and its role in robust estimation,” Journal of the American Statistical Association, 69, 382–393, 1974.MathSciNetCrossRefGoogle Scholar
- Haslett J., Brandley R., Craig P., Unwin A., Wills G., ”Dynamic Graphics for Exploring Spatial Data With Application to Locating Global and Local Anomalies,” The American Statistician, 45, 234–242, 1991.CrossRefGoogle Scholar
- Hu T, Sung S. Y., Detecting pattern-based outliers, Pattern Recognition Letters, 24, 3059–3068.Google Scholar
- Iglewics B., Martinez J., Outlier Detection using robust measures of scale, Journal of Sattistical Computation and Simulation, 15, 285–293, 1982.Google Scholar
- Jin W., Tung A., Han J., ”Mining top-n local outliers in large databases,” In Proceedings of the 7th International Conference on Knowledge Discovery and Data-mining (KDD01), San Francisco, CA, 2001.Google Scholar
- Johnson R., Applied Multivariate Statistical Analysis. Prentice Hall, 1992.Google Scholar
- Johnson T, Kwok I., Ng R., ”Fast Computation of 2-Dimensional Depth Contours,” In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, 224–228. AAAI Press, 1998.Google Scholar
- Kaufman L., Rousseeuw P.J., Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York, 1990.Google Scholar
- Knorr E., Ng R., ”A unified approach for mining outliers,” In Proceedings Knowledge Discovery KDD, 219–222, 1997.Google Scholar
- Knorr E., Ng. R., ”Algorithms for mining distance-based outliers in large datasets,” In Proc. 24th Int. Conf. Very Large Data Bases (VLDB), 392–403, 24-27, 1998.Google Scholar
- Knorr, E., Ng R., Tucakov V., ”Distance-based outliers: Algorithms and applications,” VLDB Journal: Very Large Data Bases, 8(3–4):237–253, 2000.CrossRefGoogle Scholar
- Knorr E. M., Ng R. T, Zamar R. H., ”Robust space transformations for distance based operations,” In Proceedings of the 7th International Conference on Knowledge Discovery and Data-mining (KDD01), 126–135, San Francisco, CA, 2001.Google Scholar
- Kollios G., Gunopulos D., Koudas N., Berchtold S., ”Efficient biased sampling for approximate clustering and outlier detection in large data sets,” IEEE Transactions on Knowledge and Data Engineering, 15(5), 1170–1187, 2003.CrossRefGoogle Scholar
- Liu H., Shah S., Jiang W., ”On-line outlier detection and data cleaning,” Computers and Chemical Engineering, 28, 1635–1647, 2004.CrossRefGoogle Scholar
- Lu C, Chen D., Kou Y., ”Algorithms for spatial outlier detection,” In Proceedings of the 3rd IEEE International Conference on Data-mining (ICDM’03), Melbourne, FL, 2003.Google Scholar
- Lu C.W., Reynolds M.R., ”EWMA Control Charts for Monitoring the Mean of Autocorrelated Processes,” Journal of Quality Technology, 31(2), 166–188, 1999.Google Scholar
- Luc A., ”Local Indicators of Spatial Association: LISA,” Geographical Analysis, 27(2), 93–115, 1995.zbMATHGoogle Scholar
- Luc A., ”Exploratory Spatial Data Analysis and Geographic Information Systems,” In M. Painho, editor, New Tools for Spatial Analysis, 45–54, 1994.Google Scholar
- Martin R. D., Thomson D. J., ”Robust-resistant spectrum estimation,” In Proceeding of the IEEE, 70, 1097–1115, 1982.CrossRefGoogle Scholar
- Montgomery D.C., Mastrangelo CM., ”Some statistical process control methods for autocorrelated data,” Journal of Quality Technology, 23(3), 179–193, 1991.Google Scholar
- Ng R.T., Han J., Efficient and Effective Clustering Methods for Spatial Data Mining, In Proceedings of Very Large Data Bases Conference, 144-155, 1994.Google Scholar
- Oliver J. J., Baxter R. A., Wallace C. S., ”Unsupervised Learning using MML,” In Proceedings of the Thirteenth International Conference (ICML96), pages 364–372, Morgan Kaufmann Publishers, San Francisco, CA, 1996.Google Scholar
- Panatier Y., Variowin. Software for Spatial Data Analysis in 2D., Springer-Verlag, New York, 1996.Google Scholar
- Papadimitriou S., Kitawaga H., Gibbons P.G., Faloutsos C, ”LOCI: Fast Outlier Detection Using the Local Correlation Integral,” Intel research Laboratory Technical report no. IRP-TR-02-09, 2002.Google Scholar
- Penny K. I., Jolliffe I. T., ”A comparison of multivariate outlier detection methods for clinical laboratory safety data,” The Statistician 50(3), 295–308, 2001.MathSciNetGoogle Scholar
- Perarson R. K., ”Outliers in process modeling and identification,” IEEE Transactions on Control Systems Technology, 10, 55–63, 2002.CrossRefGoogle Scholar
- Ramaswamy S., Rastogi R., Shim K., ”Efficient algorithms for mining outliers from large data sets,” In Proceedings of the ACM SIGMOD International Conference on Management of Data, Dalas, TX, 2000.Google Scholar
- Rosner B., On the detection of many outliers, Technometrics, 17, 221–227, 1975.zbMATHMathSciNetCrossRefGoogle Scholar
- Rousseeuw P., ”Multivariate estimation with high breakdown point,” In: W. Grossmann et al., editors, Mathematical Statistics and Applications, Vol. B, 283–297, Akademiai Kiado: Budapest, 1985.Google Scholar
- Rousseeuw P., Leory A., Robust Regression and Outlier Detection, Wiley Series in Probability and Statistics, 1987.Google Scholar
- Runger G., Willemain T., ”Model-based and Model-free Control of Autocor-related Processes,” Journal of Quality Technology, 27(4), 283–292, 1995.Google Scholar
- Ruts I., Rousseeuw P., ”Computing Depth Contours of Bivariate Point Clouds,” In Computational Statistics and Data Analysis, 23, 153–168, 1996.CrossRefzbMATHGoogle Scholar
- Schiffman S. S., Reynolds M. L., Young F. W., Introduction to Multidimensional Scaling: Theory, Methods and Applications. New York: Academic Press, 1981.zbMATHGoogle Scholar
- Shekhar S., Chawla S., A Tour of Spatial Databases, Prentice Hall, 2002.Google Scholar
- Shekhar S., Lu C. T, Zhang P., ”Detecting Graph-Based Spatial Outlier: Algorithms and Applications (A Summary of Results),” In Proc. of the Seventh ACM-SIGKDD Conference on Knowledge Discovery and Data Mining, SF, CA, 2001.Google Scholar
- Shekhar S., Lu C. T, Zhang P., ”Detecting Graph-Based Spatial Outlier,” Intelligent Data Analysis: An International Journal, 6(5), 451–468, 2002.zbMATHGoogle Scholar
- Shekhar S., Lu C. T, Zhang P., ”A Unified Approach to Spatial Outliers Detection,” Geolnformatica, an International Journal on Advances of Computer Science for Geographic Information System, 7(2), 2003.Google Scholar
- Wardell D.G., Moskowitz H., Plante R.D., ”Run-length distributions of special-cause control charts for correlated processes,” Technometrics, 36(1), 3–17, 1994.MathSciNetCrossRefzbMATHGoogle Scholar
- Tukey J.W., Exploratory Data Analysis. Addison-Wesley, 1977.Google Scholar
- Williams G. J., Baxter R. A., He H. X., Hawkins S., Gu L., ”A Comparative Study of RNN for Outlier Detection in Data Mining,” IEEE International Conference on Data-mining (ICDM’02), Maebashi City, Japan, CSIRO Technical Report CMIS-02/102, 2002.Google Scholar
- Williams G. J., Huang Z., ”Mining the knowledge mine: The hot spots methodology for mining large real world databases,” In Abdul Sattar, editor, Advanced Topics in Artificial Intelligence, volume 1342 of Lecture Notes in Artificial Intelligence, 340–348, Springer, 1997.Google Scholar
- Zhang N.F., ”A Statistical Control Chart for Stationary Process Data,” Technometrics, 40(1), 24–38, 1998.zbMATHMathSciNetCrossRefGoogle Scholar
Copyright information
© Springer Science+Business Media, Inc. 2005