Abstract
Outlier detection is a primary step in many data-mining applications. We present several methods for outlier detection, while distinguishing between univariate vs. multivariate techniques and parametric vs. nonparametric procedures. In presence of outliers, special attention should be taken to assure the robustness of the used estimators. Outlier detection for Data Mining is often based on distance measures, clustering and spatial methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Acuna E., Rodriguez C. A., ”Meta analysis study of outlier detection methods in classification,” Technical paper, Department of Mathematics, University of Puerto Rico at Mayaguez, Retrived from academic.uprm.edu/ ea-cuna/paperout.pdf. In proceedings IPSI2004, Venice, 2004.
Alwan L.C., Ebrahimi N., Soofi E.S., ”Information theoretic framework for process control,” European Journal of Operations Research, 111, 526–542, 1998.
Alwan L.C., Roberts H.V., ”Time-series modeling for statistical process control,” Journal of Business and Economics Statistics, 6(1), 87–95, 1988.
Apley D.W., Shi J., ”The GLRT for statistical process control of autocorrelated processes,” IIE Transactions, 31, 1123–1134, 1999.
Barbara D., Faloutsos C, Hellerstein J., Ioannidis Y., Jagadish H.V., Johnson T., Ng R., Poosala V., Ross K., Sevcik K.C., ”The New Jersey Data Reduction Report,” Data Eng. Bull., September, 1996.
Barbara D., Chen P., ”Using the fractal dimension to cluster datasets,” In Proc. ACM KDD 2000, 260–264, 2000.
Barnett V., Lewis T., Outliers in Statistical Data. John Wiley, 1994.
Bay S.D., Schwabacher M., ”Mining distance-based outliers in near linear time with randomization and a simple pruning rule,” In Proc. of the ninth ACM-SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 2003.
Ben-Gal I., Morag G., Shmilovici A., ”CSPC: A Monitoring Procedure for State Dependent Processes,” Technometrics, 45(4), 293–311, 2003.
Box G. E. P., Jenkins G. M., Times Series Analysis, Forecasting and Control, Oakland, CA: Holden Day, 1976.
Breunig M.M., Kriegel H.P., Ng R.T., Sander J., ”Lof: Identifying density-based local outliers,” In Proc. ACMSIGMOD Conf. 2000, 93–104, 2000.
Caussinus H., Roiz A., ”Interesting projections of multidimensional data by means of generalized component analysis,” In Compstat 90, 121–126, Heidelberg: Physica, 1990.
David H. A., ”Robust estimation in the presence of outliers,” In Robustness in Statistics, eds. R. L. Launer and G.N. Wilkinson, Academic Press, New York, 61–74, 1979.
Davies L., Gather U., ”The identification of multiple outliers,” Journal of the American Statistical Association, 88(423), 782–792, 1993.
DuMouchel W., Schonlau M., ”A fast computer intrusion detection algorithm based on hypothesis testing of command transition probabilities,” In Proceedings of the 4th International Conference on Knowledge Discovery and Data-mining (KDD98), 189–193, 1998.
Fawcett T, Provost F., ”Adaptive fraud detection,” Data-mining andKnowledge Discovery, 1(3), 291–316, 1997.
Ferguson T. S., ”On the Rejection of outliers,” In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, 253–287, 1961.
Gather U., ”Testing for multisource contamination in location / scale families,” Communication in Statistics, Part A: Theory and Methods, 18, 1–34, 1989.
Grubbs F. E., ”Proceadures for detecting outlying observations in Samples,” Technometrics, 11, 1–21, 1969.
Hadi A. S., ”Identifying multiple outliers in multivariate data,” Journal of the Royal Statistical Society. Series B, 54, 761–771, 1992.
Hadi A. S., ”A modification of a method for the detection of outliers in multivariate samples,” Journal of the Royal Statistical Society, Series B, 56(2), 1994.
Hawkins D., Identification of Outliers, Chapman and Hall, 1980.
Hawkins S., He H. X., Williams G. J., Baxter R. A., ”Outlier detection using replicator neural networks,” In Proceedings of the Fifth International Conference and Data Warehousing and Knowledge Discovery (DaWaK02), Aix en Provence, France, 2002.
Haining R., Spatial Data Analysis in the Social and Environmental Sciences. Cambridge University Press, 1993.
Hampel F. R., ”A general qualitative definition of robustness,” Annals of Mathematics Statistics, 42, 1887–1896, 1971.
Hampel F. R., ”The influence curve and its role in robust estimation,” Journal of the American Statistical Association, 69, 382–393, 1974.
Haslett J., Brandley R., Craig P., Unwin A., Wills G., ”Dynamic Graphics for Exploring Spatial Data With Application to Locating Global and Local Anomalies,” The American Statistician, 45, 234–242, 1991.
Hu T, Sung S. Y., Detecting pattern-based outliers, Pattern Recognition Letters, 24, 3059–3068.
Iglewics B., Martinez J., Outlier Detection using robust measures of scale, Journal of Sattistical Computation and Simulation, 15, 285–293, 1982.
Jin W., Tung A., Han J., ”Mining top-n local outliers in large databases,” In Proceedings of the 7th International Conference on Knowledge Discovery and Data-mining (KDD01), San Francisco, CA, 2001.
Johnson R., Applied Multivariate Statistical Analysis. Prentice Hall, 1992.
Johnson T, Kwok I., Ng R., ”Fast Computation of 2-Dimensional Depth Contours,” In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, 224–228. AAAI Press, 1998.
Kaufman L., Rousseeuw P.J., Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York, 1990.
Knorr E., Ng R., ”A unified approach for mining outliers,” In Proceedings Knowledge Discovery KDD, 219–222, 1997.
Knorr E., Ng. R., ”Algorithms for mining distance-based outliers in large datasets,” In Proc. 24th Int. Conf. Very Large Data Bases (VLDB), 392–403, 24-27, 1998.
Knorr, E., Ng R., Tucakov V., ”Distance-based outliers: Algorithms and applications,” VLDB Journal: Very Large Data Bases, 8(3–4):237–253, 2000.
Knorr E. M., Ng R. T, Zamar R. H., ”Robust space transformations for distance based operations,” In Proceedings of the 7th International Conference on Knowledge Discovery and Data-mining (KDD01), 126–135, San Francisco, CA, 2001.
Kollios G., Gunopulos D., Koudas N., Berchtold S., ”Efficient biased sampling for approximate clustering and outlier detection in large data sets,” IEEE Transactions on Knowledge and Data Engineering, 15(5), 1170–1187, 2003.
Liu H., Shah S., Jiang W., ”On-line outlier detection and data cleaning,” Computers and Chemical Engineering, 28, 1635–1647, 2004.
Lu C, Chen D., Kou Y., ”Algorithms for spatial outlier detection,” In Proceedings of the 3rd IEEE International Conference on Data-mining (ICDM’03), Melbourne, FL, 2003.
Lu C.W., Reynolds M.R., ”EWMA Control Charts for Monitoring the Mean of Autocorrelated Processes,” Journal of Quality Technology, 31(2), 166–188, 1999.
Luc A., ”Local Indicators of Spatial Association: LISA,” Geographical Analysis, 27(2), 93–115, 1995.
Luc A., ”Exploratory Spatial Data Analysis and Geographic Information Systems,” In M. Painho, editor, New Tools for Spatial Analysis, 45–54, 1994.
Martin R. D., Thomson D. J., ”Robust-resistant spectrum estimation,” In Proceeding of the IEEE, 70, 1097–1115, 1982.
Montgomery D.C., Mastrangelo CM., ”Some statistical process control methods for autocorrelated data,” Journal of Quality Technology, 23(3), 179–193, 1991.
Ng R.T., Han J., Efficient and Effective Clustering Methods for Spatial Data Mining, In Proceedings of Very Large Data Bases Conference, 144-155, 1994.
Oliver J. J., Baxter R. A., Wallace C. S., ”Unsupervised Learning using MML,” In Proceedings of the Thirteenth International Conference (ICML96), pages 364–372, Morgan Kaufmann Publishers, San Francisco, CA, 1996.
Panatier Y., Variowin. Software for Spatial Data Analysis in 2D., Springer-Verlag, New York, 1996.
Papadimitriou S., Kitawaga H., Gibbons P.G., Faloutsos C, ”LOCI: Fast Outlier Detection Using the Local Correlation Integral,” Intel research Laboratory Technical report no. IRP-TR-02-09, 2002.
Penny K. I., Jolliffe I. T., ”A comparison of multivariate outlier detection methods for clinical laboratory safety data,” The Statistician 50(3), 295–308, 2001.
Perarson R. K., ”Outliers in process modeling and identification,” IEEE Transactions on Control Systems Technology, 10, 55–63, 2002.
Ramaswamy S., Rastogi R., Shim K., ”Efficient algorithms for mining outliers from large data sets,” In Proceedings of the ACM SIGMOD International Conference on Management of Data, Dalas, TX, 2000.
Rosner B., On the detection of many outliers, Technometrics, 17, 221–227, 1975.
Rousseeuw P., ”Multivariate estimation with high breakdown point,” In: W. Grossmann et al., editors, Mathematical Statistics and Applications, Vol. B, 283–297, Akademiai Kiado: Budapest, 1985.
Rousseeuw P., Leory A., Robust Regression and Outlier Detection, Wiley Series in Probability and Statistics, 1987.
Runger G., Willemain T., ”Model-based and Model-free Control of Autocor-related Processes,” Journal of Quality Technology, 27(4), 283–292, 1995.
Ruts I., Rousseeuw P., ”Computing Depth Contours of Bivariate Point Clouds,” In Computational Statistics and Data Analysis, 23, 153–168, 1996.
Schiffman S. S., Reynolds M. L., Young F. W., Introduction to Multidimensional Scaling: Theory, Methods and Applications. New York: Academic Press, 1981.
Shekhar S., Chawla S., A Tour of Spatial Databases, Prentice Hall, 2002.
Shekhar S., Lu C. T, Zhang P., ”Detecting Graph-Based Spatial Outlier: Algorithms and Applications (A Summary of Results),” In Proc. of the Seventh ACM-SIGKDD Conference on Knowledge Discovery and Data Mining, SF, CA, 2001.
Shekhar S., Lu C. T, Zhang P., ”Detecting Graph-Based Spatial Outlier,” Intelligent Data Analysis: An International Journal, 6(5), 451–468, 2002.
Shekhar S., Lu C. T, Zhang P., ”A Unified Approach to Spatial Outliers Detection,” Geolnformatica, an International Journal on Advances of Computer Science for Geographic Information System, 7(2), 2003.
Wardell D.G., Moskowitz H., Plante R.D., ”Run-length distributions of special-cause control charts for correlated processes,” Technometrics, 36(1), 3–17, 1994.
Tukey J.W., Exploratory Data Analysis. Addison-Wesley, 1977.
Williams G. J., Baxter R. A., He H. X., Hawkins S., Gu L., ”A Comparative Study of RNN for Outlier Detection in Data Mining,” IEEE International Conference on Data-mining (ICDM’02), Maebashi City, Japan, CSIRO Technical Report CMIS-02/102, 2002.
Williams G. J., Huang Z., ”Mining the knowledge mine: The hot spots methodology for mining large real world databases,” In Abdul Sattar, editor, Advanced Topics in Artificial Intelligence, volume 1342 of Lecture Notes in Artificial Intelligence, 340–348, Springer, 1997.
Zhang N.F., ”A Statistical Control Chart for Stationary Process Data,” Technometrics, 40(1), 24–38, 1998.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer Science+Business Media, Inc.
About this chapter
Cite this chapter
Ben-Gal, I. (2005). Outlier Detection. In: Maimon, O., Rokach, L. (eds) Data Mining and Knowledge Discovery Handbook. Springer, Boston, MA. https://doi.org/10.1007/0-387-25465-X_7
Download citation
DOI: https://doi.org/10.1007/0-387-25465-X_7
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-24435-8
Online ISBN: 978-0-387-25465-4
eBook Packages: Computer ScienceComputer Science (R0)