Outlier Detection

Ben-Gal, Irad

doi:10.1007/0-387-25465-X_7

Irad Ben-Gal²

22k Accesses
149 Citations
12 Altmetric

Abstract

Outlier detection is a primary step in many data-mining applications. We present several methods for outlier detection, while distinguishing between univariate vs. multivariate techniques and parametric vs. nonparametric procedures. In presence of outliers, special attention should be taken to assure the robustness of the used estimators. Outlier detection for Data Mining is often based on distance measures, clustering and spatial methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 229.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Acuna E., Rodriguez C. A., ”Meta analysis study of outlier detection methods in classification,” Technical paper, Department of Mathematics, University of Puerto Rico at Mayaguez, Retrived from academic.uprm.edu/ ea-cuna/paperout.pdf. In proceedings IPSI2004, Venice, 2004.
Google Scholar
Alwan L.C., Ebrahimi N., Soofi E.S., ”Information theoretic framework for process control,” European Journal of Operations Research, 111, 526–542, 1998.
Article MATH Google Scholar
Alwan L.C., Roberts H.V., ”Time-series modeling for statistical process control,” Journal of Business and Economics Statistics, 6(1), 87–95, 1988.
Article Google Scholar
Apley D.W., Shi J., ”The GLRT for statistical process control of autocorrelated processes,” IIE Transactions, 31, 1123–1134, 1999.
Article Google Scholar
Barbara D., Faloutsos C, Hellerstein J., Ioannidis Y., Jagadish H.V., Johnson T., Ng R., Poosala V., Ross K., Sevcik K.C., ”The New Jersey Data Reduction Report,” Data Eng. Bull., September, 1996.
Google Scholar
Barbara D., Chen P., ”Using the fractal dimension to cluster datasets,” In Proc. ACM KDD 2000, 260–264, 2000.
Google Scholar
Barnett V., Lewis T., Outliers in Statistical Data. John Wiley, 1994.
Google Scholar
Bay S.D., Schwabacher M., ”Mining distance-based outliers in near linear time with randomization and a simple pruning rule,” In Proc. of the ninth ACM-SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 2003.
Google Scholar
Ben-Gal I., Morag G., Shmilovici A., ”CSPC: A Monitoring Procedure for State Dependent Processes,” Technometrics, 45(4), 293–311, 2003.
Article MathSciNet Google Scholar
Box G. E. P., Jenkins G. M., Times Series Analysis, Forecasting and Control, Oakland, CA: Holden Day, 1976.
Google Scholar
Breunig M.M., Kriegel H.P., Ng R.T., Sander J., ”Lof: Identifying density-based local outliers,” In Proc. ACMSIGMOD Conf. 2000, 93–104, 2000.
Google Scholar
Caussinus H., Roiz A., ”Interesting projections of multidimensional data by means of generalized component analysis,” In Compstat 90, 121–126, Heidelberg: Physica, 1990.
Google Scholar
David H. A., ”Robust estimation in the presence of outliers,” In Robustness in Statistics, eds. R. L. Launer and G.N. Wilkinson, Academic Press, New York, 61–74, 1979.
Google Scholar
Davies L., Gather U., ”The identification of multiple outliers,” Journal of the American Statistical Association, 88(423), 782–792, 1993.
Article MathSciNet MATH Google Scholar
DuMouchel W., Schonlau M., ”A fast computer intrusion detection algorithm based on hypothesis testing of command transition probabilities,” In Proceedings of the 4th International Conference on Knowledge Discovery and Data-mining (KDD98), 189–193, 1998.
Google Scholar
Fawcett T, Provost F., ”Adaptive fraud detection,” Data-mining andKnowledge Discovery, 1(3), 291–316, 1997.
Article Google Scholar
Ferguson T. S., ”On the Rejection of outliers,” In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, 253–287, 1961.
MATH Google Scholar
Gather U., ”Testing for multisource contamination in location / scale families,” Communication in Statistics, Part A: Theory and Methods, 18, 1–34, 1989.
MATH MathSciNet Google Scholar
Grubbs F. E., ”Proceadures for detecting outlying observations in Samples,” Technometrics, 11, 1–21, 1969.
Article Google Scholar
Hadi A. S., ”Identifying multiple outliers in multivariate data,” Journal of the Royal Statistical Society. Series B, 54, 761–771, 1992.
MathSciNet Google Scholar
Hadi A. S., ”A modification of a method for the detection of outliers in multivariate samples,” Journal of the Royal Statistical Society, Series B, 56(2), 1994.
Google Scholar
Hawkins D., Identification of Outliers, Chapman and Hall, 1980.
Google Scholar
Hawkins S., He H. X., Williams G. J., Baxter R. A., ”Outlier detection using replicator neural networks,” In Proceedings of the Fifth International Conference and Data Warehousing and Knowledge Discovery (DaWaK02), Aix en Provence, France, 2002.
Google Scholar
Haining R., Spatial Data Analysis in the Social and Environmental Sciences. Cambridge University Press, 1993.
Google Scholar
Hampel F. R., ”A general qualitative definition of robustness,” Annals of Mathematics Statistics, 42, 1887–1896, 1971.
MATH MathSciNet Google Scholar
Hampel F. R., ”The influence curve and its role in robust estimation,” Journal of the American Statistical Association, 69, 382–393, 1974.
Article MathSciNet Google Scholar
Haslett J., Brandley R., Craig P., Unwin A., Wills G., ”Dynamic Graphics for Exploring Spatial Data With Application to Locating Global and Local Anomalies,” The American Statistician, 45, 234–242, 1991.
Article Google Scholar
Hu T, Sung S. Y., Detecting pattern-based outliers, Pattern Recognition Letters, 24, 3059–3068.
Google Scholar
Iglewics B., Martinez J., Outlier Detection using robust measures of scale, Journal of Sattistical Computation and Simulation, 15, 285–293, 1982.
Google Scholar
Jin W., Tung A., Han J., ”Mining top-n local outliers in large databases,” In Proceedings of the 7th International Conference on Knowledge Discovery and Data-mining (KDD01), San Francisco, CA, 2001.
Google Scholar
Johnson R., Applied Multivariate Statistical Analysis. Prentice Hall, 1992.
Google Scholar
Johnson T, Kwok I., Ng R., ”Fast Computation of 2-Dimensional Depth Contours,” In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, 224–228. AAAI Press, 1998.
Google Scholar
Kaufman L., Rousseeuw P.J., Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York, 1990.
Google Scholar
Knorr E., Ng R., ”A unified approach for mining outliers,” In Proceedings Knowledge Discovery KDD, 219–222, 1997.
Google Scholar
Knorr E., Ng. R., ”Algorithms for mining distance-based outliers in large datasets,” In Proc. 24th Int. Conf. Very Large Data Bases (VLDB), 392–403, 24-27, 1998.
Google Scholar
Knorr, E., Ng R., Tucakov V., ”Distance-based outliers: Algorithms and applications,” VLDB Journal: Very Large Data Bases, 8(3–4):237–253, 2000.
Article Google Scholar
Knorr E. M., Ng R. T, Zamar R. H., ”Robust space transformations for distance based operations,” In Proceedings of the 7th International Conference on Knowledge Discovery and Data-mining (KDD01), 126–135, San Francisco, CA, 2001.
Google Scholar
Kollios G., Gunopulos D., Koudas N., Berchtold S., ”Efficient biased sampling for approximate clustering and outlier detection in large data sets,” IEEE Transactions on Knowledge and Data Engineering, 15(5), 1170–1187, 2003.
Article Google Scholar
Liu H., Shah S., Jiang W., ”On-line outlier detection and data cleaning,” Computers and Chemical Engineering, 28, 1635–1647, 2004.
Article Google Scholar
Lu C, Chen D., Kou Y., ”Algorithms for spatial outlier detection,” In Proceedings of the 3rd IEEE International Conference on Data-mining (ICDM’03), Melbourne, FL, 2003.
Google Scholar
Lu C.W., Reynolds M.R., ”EWMA Control Charts for Monitoring the Mean of Autocorrelated Processes,” Journal of Quality Technology, 31(2), 166–188, 1999.
Google Scholar
Luc A., ”Local Indicators of Spatial Association: LISA,” Geographical Analysis, 27(2), 93–115, 1995.
MATH Google Scholar
Luc A., ”Exploratory Spatial Data Analysis and Geographic Information Systems,” In M. Painho, editor, New Tools for Spatial Analysis, 45–54, 1994.
Google Scholar
Martin R. D., Thomson D. J., ”Robust-resistant spectrum estimation,” In Proceeding of the IEEE, 70, 1097–1115, 1982.
Article Google Scholar
Montgomery D.C., Mastrangelo CM., ”Some statistical process control methods for autocorrelated data,” Journal of Quality Technology, 23(3), 179–193, 1991.
Google Scholar
Ng R.T., Han J., Efficient and Effective Clustering Methods for Spatial Data Mining, In Proceedings of Very Large Data Bases Conference, 144-155, 1994.
Google Scholar
Oliver J. J., Baxter R. A., Wallace C. S., ”Unsupervised Learning using MML,” In Proceedings of the Thirteenth International Conference (ICML96), pages 364–372, Morgan Kaufmann Publishers, San Francisco, CA, 1996.
Google Scholar
Panatier Y., Variowin. Software for Spatial Data Analysis in 2D., Springer-Verlag, New York, 1996.
Google Scholar
Papadimitriou S., Kitawaga H., Gibbons P.G., Faloutsos C, ”LOCI: Fast Outlier Detection Using the Local Correlation Integral,” Intel research Laboratory Technical report no. IRP-TR-02-09, 2002.
Google Scholar
Penny K. I., Jolliffe I. T., ”A comparison of multivariate outlier detection methods for clinical laboratory safety data,” The Statistician 50(3), 295–308, 2001.
MathSciNet Google Scholar
Perarson R. K., ”Outliers in process modeling and identification,” IEEE Transactions on Control Systems Technology, 10, 55–63, 2002.
Article Google Scholar
Ramaswamy S., Rastogi R., Shim K., ”Efficient algorithms for mining outliers from large data sets,” In Proceedings of the ACM SIGMOD International Conference on Management of Data, Dalas, TX, 2000.
Google Scholar
Rosner B., On the detection of many outliers, Technometrics, 17, 221–227, 1975.
Article MATH MathSciNet Google Scholar
Rousseeuw P., ”Multivariate estimation with high breakdown point,” In: W. Grossmann et al., editors, Mathematical Statistics and Applications, Vol. B, 283–297, Akademiai Kiado: Budapest, 1985.
Google Scholar
Rousseeuw P., Leory A., Robust Regression and Outlier Detection, Wiley Series in Probability and Statistics, 1987.
Google Scholar
Runger G., Willemain T., ”Model-based and Model-free Control of Autocor-related Processes,” Journal of Quality Technology, 27(4), 283–292, 1995.
Google Scholar
Ruts I., Rousseeuw P., ”Computing Depth Contours of Bivariate Point Clouds,” In Computational Statistics and Data Analysis, 23, 153–168, 1996.
Article MATH Google Scholar
Schiffman S. S., Reynolds M. L., Young F. W., Introduction to Multidimensional Scaling: Theory, Methods and Applications. New York: Academic Press, 1981.
MATH Google Scholar
Shekhar S., Chawla S., A Tour of Spatial Databases, Prentice Hall, 2002.
Google Scholar
Shekhar S., Lu C. T, Zhang P., ”Detecting Graph-Based Spatial Outlier: Algorithms and Applications (A Summary of Results),” In Proc. of the Seventh ACM-SIGKDD Conference on Knowledge Discovery and Data Mining, SF, CA, 2001.
Google Scholar
Shekhar S., Lu C. T, Zhang P., ”Detecting Graph-Based Spatial Outlier,” Intelligent Data Analysis: An International Journal, 6(5), 451–468, 2002.
MATH Google Scholar
Shekhar S., Lu C. T, Zhang P., ”A Unified Approach to Spatial Outliers Detection,” Geolnformatica, an International Journal on Advances of Computer Science for Geographic Information System, 7(2), 2003.
Google Scholar
Wardell D.G., Moskowitz H., Plante R.D., ”Run-length distributions of special-cause control charts for correlated processes,” Technometrics, 36(1), 3–17, 1994.
Article MathSciNet MATH Google Scholar
Tukey J.W., Exploratory Data Analysis. Addison-Wesley, 1977.
Google Scholar
Williams G. J., Baxter R. A., He H. X., Hawkins S., Gu L., ”A Comparative Study of RNN for Outlier Detection in Data Mining,” IEEE International Conference on Data-mining (ICDM’02), Maebashi City, Japan, CSIRO Technical Report CMIS-02/102, 2002.
Google Scholar
Williams G. J., Huang Z., ”Mining the knowledge mine: The hot spots methodology for mining large real world databases,” In Abdul Sattar, editor, Advanced Topics in Artificial Intelligence, volume 1342 of Lecture Notes in Artificial Intelligence, 340–348, Springer, 1997.
Google Scholar
Zhang N.F., ”A Statistical Control Chart for Stationary Process Data,” Technometrics, 40(1), 24–38, 1998.
Article MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Department of Industrial Engineering, Tel-Aviv University, Ramat-Aviv, Tel-Aviv, 69978, Israel
Irad Ben-Gal

Authors

Irad Ben-Gal
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Dept. of Industrial Engineering, Tel-Aviv University, 69978, Ramat-Aviv, Israel
Oded Maimon & Lior Rokach &

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Ben-Gal, I. (2005). Outlier Detection. In: Maimon, O., Rokach, L. (eds) Data Mining and Knowledge Discovery Handbook. Springer, Boston, MA. https://doi.org/10.1007/0-387-25465-X_7

Download citation

DOI: https://doi.org/10.1007/0-387-25465-X_7
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-24435-8
Online ISBN: 978-0-387-25465-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics