Skip to main content

Abstract

Outlier detection is a primary step in many data-mining applications. We present several methods for outlier detection, while distinguishing between univariate vs. multivariate techniques and parametric vs. nonparametric procedures. In presence of outliers, special attention should be taken to assure the robustness of the used estimators. Outlier detection for Data Mining is often based on distance measures, clustering and spatial methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 229.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Acuna E., Rodriguez C. A., ”Meta analysis study of outlier detection methods in classification,” Technical paper, Department of Mathematics, University of Puerto Rico at Mayaguez, Retrived from academic.uprm.edu/ ea-cuna/paperout.pdf. In proceedings IPSI2004, Venice, 2004.

    Google Scholar 

  • Alwan L.C., Ebrahimi N., Soofi E.S., ”Information theoretic framework for process control,” European Journal of Operations Research, 111, 526–542, 1998.

    Article  MATH  Google Scholar 

  • Alwan L.C., Roberts H.V., ”Time-series modeling for statistical process control,” Journal of Business and Economics Statistics, 6(1), 87–95, 1988.

    Article  Google Scholar 

  • Apley D.W., Shi J., ”The GLRT for statistical process control of autocorrelated processes,” IIE Transactions, 31, 1123–1134, 1999.

    Article  Google Scholar 

  • Barbara D., Faloutsos C, Hellerstein J., Ioannidis Y., Jagadish H.V., Johnson T., Ng R., Poosala V., Ross K., Sevcik K.C., ”The New Jersey Data Reduction Report,” Data Eng. Bull., September, 1996.

    Google Scholar 

  • Barbara D., Chen P., ”Using the fractal dimension to cluster datasets,” In Proc. ACM KDD 2000, 260–264, 2000.

    Google Scholar 

  • Barnett V., Lewis T., Outliers in Statistical Data. John Wiley, 1994.

    Google Scholar 

  • Bay S.D., Schwabacher M., ”Mining distance-based outliers in near linear time with randomization and a simple pruning rule,” In Proc. of the ninth ACM-SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 2003.

    Google Scholar 

  • Ben-Gal I., Morag G., Shmilovici A., ”CSPC: A Monitoring Procedure for State Dependent Processes,” Technometrics, 45(4), 293–311, 2003.

    Article  MathSciNet  Google Scholar 

  • Box G. E. P., Jenkins G. M., Times Series Analysis, Forecasting and Control, Oakland, CA: Holden Day, 1976.

    Google Scholar 

  • Breunig M.M., Kriegel H.P., Ng R.T., Sander J., ”Lof: Identifying density-based local outliers,” In Proc. ACMSIGMOD Conf. 2000, 93–104, 2000.

    Google Scholar 

  • Caussinus H., Roiz A., ”Interesting projections of multidimensional data by means of generalized component analysis,” In Compstat 90, 121–126, Heidelberg: Physica, 1990.

    Google Scholar 

  • David H. A., ”Robust estimation in the presence of outliers,” In Robustness in Statistics, eds. R. L. Launer and G.N. Wilkinson, Academic Press, New York, 61–74, 1979.

    Google Scholar 

  • Davies L., Gather U., ”The identification of multiple outliers,” Journal of the American Statistical Association, 88(423), 782–792, 1993.

    Article  MathSciNet  MATH  Google Scholar 

  • DuMouchel W., Schonlau M., ”A fast computer intrusion detection algorithm based on hypothesis testing of command transition probabilities,” In Proceedings of the 4th International Conference on Knowledge Discovery and Data-mining (KDD98), 189–193, 1998.

    Google Scholar 

  • Fawcett T, Provost F., ”Adaptive fraud detection,” Data-mining andKnowledge Discovery, 1(3), 291–316, 1997.

    Article  Google Scholar 

  • Ferguson T. S., ”On the Rejection of outliers,” In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, 253–287, 1961.

    MATH  Google Scholar 

  • Gather U., ”Testing for multisource contamination in location / scale families,” Communication in Statistics, Part A: Theory and Methods, 18, 1–34, 1989.

    MATH  MathSciNet  Google Scholar 

  • Grubbs F. E., ”Proceadures for detecting outlying observations in Samples,” Technometrics, 11, 1–21, 1969.

    Article  Google Scholar 

  • Hadi A. S., ”Identifying multiple outliers in multivariate data,” Journal of the Royal Statistical Society. Series B, 54, 761–771, 1992.

    MathSciNet  Google Scholar 

  • Hadi A. S., ”A modification of a method for the detection of outliers in multivariate samples,” Journal of the Royal Statistical Society, Series B, 56(2), 1994.

    Google Scholar 

  • Hawkins D., Identification of Outliers, Chapman and Hall, 1980.

    Google Scholar 

  • Hawkins S., He H. X., Williams G. J., Baxter R. A., ”Outlier detection using replicator neural networks,” In Proceedings of the Fifth International Conference and Data Warehousing and Knowledge Discovery (DaWaK02), Aix en Provence, France, 2002.

    Google Scholar 

  • Haining R., Spatial Data Analysis in the Social and Environmental Sciences. Cambridge University Press, 1993.

    Google Scholar 

  • Hampel F. R., ”A general qualitative definition of robustness,” Annals of Mathematics Statistics, 42, 1887–1896, 1971.

    MATH  MathSciNet  Google Scholar 

  • Hampel F. R., ”The influence curve and its role in robust estimation,” Journal of the American Statistical Association, 69, 382–393, 1974.

    Article  MathSciNet  Google Scholar 

  • Haslett J., Brandley R., Craig P., Unwin A., Wills G., ”Dynamic Graphics for Exploring Spatial Data With Application to Locating Global and Local Anomalies,” The American Statistician, 45, 234–242, 1991.

    Article  Google Scholar 

  • Hu T, Sung S. Y., Detecting pattern-based outliers, Pattern Recognition Letters, 24, 3059–3068.

    Google Scholar 

  • Iglewics B., Martinez J., Outlier Detection using robust measures of scale, Journal of Sattistical Computation and Simulation, 15, 285–293, 1982.

    Google Scholar 

  • Jin W., Tung A., Han J., ”Mining top-n local outliers in large databases,” In Proceedings of the 7th International Conference on Knowledge Discovery and Data-mining (KDD01), San Francisco, CA, 2001.

    Google Scholar 

  • Johnson R., Applied Multivariate Statistical Analysis. Prentice Hall, 1992.

    Google Scholar 

  • Johnson T, Kwok I., Ng R., ”Fast Computation of 2-Dimensional Depth Contours,” In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, 224–228. AAAI Press, 1998.

    Google Scholar 

  • Kaufman L., Rousseeuw P.J., Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York, 1990.

    Google Scholar 

  • Knorr E., Ng R., ”A unified approach for mining outliers,” In Proceedings Knowledge Discovery KDD, 219–222, 1997.

    Google Scholar 

  • Knorr E., Ng. R., ”Algorithms for mining distance-based outliers in large datasets,” In Proc. 24th Int. Conf. Very Large Data Bases (VLDB), 392–403, 24-27, 1998.

    Google Scholar 

  • Knorr, E., Ng R., Tucakov V., ”Distance-based outliers: Algorithms and applications,” VLDB Journal: Very Large Data Bases, 8(3–4):237–253, 2000.

    Article  Google Scholar 

  • Knorr E. M., Ng R. T, Zamar R. H., ”Robust space transformations for distance based operations,” In Proceedings of the 7th International Conference on Knowledge Discovery and Data-mining (KDD01), 126–135, San Francisco, CA, 2001.

    Google Scholar 

  • Kollios G., Gunopulos D., Koudas N., Berchtold S., ”Efficient biased sampling for approximate clustering and outlier detection in large data sets,” IEEE Transactions on Knowledge and Data Engineering, 15(5), 1170–1187, 2003.

    Article  Google Scholar 

  • Liu H., Shah S., Jiang W., ”On-line outlier detection and data cleaning,” Computers and Chemical Engineering, 28, 1635–1647, 2004.

    Article  Google Scholar 

  • Lu C, Chen D., Kou Y., ”Algorithms for spatial outlier detection,” In Proceedings of the 3rd IEEE International Conference on Data-mining (ICDM’03), Melbourne, FL, 2003.

    Google Scholar 

  • Lu C.W., Reynolds M.R., ”EWMA Control Charts for Monitoring the Mean of Autocorrelated Processes,” Journal of Quality Technology, 31(2), 166–188, 1999.

    Google Scholar 

  • Luc A., ”Local Indicators of Spatial Association: LISA,” Geographical Analysis, 27(2), 93–115, 1995.

    MATH  Google Scholar 

  • Luc A., ”Exploratory Spatial Data Analysis and Geographic Information Systems,” In M. Painho, editor, New Tools for Spatial Analysis, 45–54, 1994.

    Google Scholar 

  • Martin R. D., Thomson D. J., ”Robust-resistant spectrum estimation,” In Proceeding of the IEEE, 70, 1097–1115, 1982.

    Article  Google Scholar 

  • Montgomery D.C., Mastrangelo CM., ”Some statistical process control methods for autocorrelated data,” Journal of Quality Technology, 23(3), 179–193, 1991.

    Google Scholar 

  • Ng R.T., Han J., Efficient and Effective Clustering Methods for Spatial Data Mining, In Proceedings of Very Large Data Bases Conference, 144-155, 1994.

    Google Scholar 

  • Oliver J. J., Baxter R. A., Wallace C. S., ”Unsupervised Learning using MML,” In Proceedings of the Thirteenth International Conference (ICML96), pages 364–372, Morgan Kaufmann Publishers, San Francisco, CA, 1996.

    Google Scholar 

  • Panatier Y., Variowin. Software for Spatial Data Analysis in 2D., Springer-Verlag, New York, 1996.

    Google Scholar 

  • Papadimitriou S., Kitawaga H., Gibbons P.G., Faloutsos C, ”LOCI: Fast Outlier Detection Using the Local Correlation Integral,” Intel research Laboratory Technical report no. IRP-TR-02-09, 2002.

    Google Scholar 

  • Penny K. I., Jolliffe I. T., ”A comparison of multivariate outlier detection methods for clinical laboratory safety data,” The Statistician 50(3), 295–308, 2001.

    MathSciNet  Google Scholar 

  • Perarson R. K., ”Outliers in process modeling and identification,” IEEE Transactions on Control Systems Technology, 10, 55–63, 2002.

    Article  Google Scholar 

  • Ramaswamy S., Rastogi R., Shim K., ”Efficient algorithms for mining outliers from large data sets,” In Proceedings of the ACM SIGMOD International Conference on Management of Data, Dalas, TX, 2000.

    Google Scholar 

  • Rosner B., On the detection of many outliers, Technometrics, 17, 221–227, 1975.

    Article  MATH  MathSciNet  Google Scholar 

  • Rousseeuw P., ”Multivariate estimation with high breakdown point,” In: W. Grossmann et al., editors, Mathematical Statistics and Applications, Vol. B, 283–297, Akademiai Kiado: Budapest, 1985.

    Google Scholar 

  • Rousseeuw P., Leory A., Robust Regression and Outlier Detection, Wiley Series in Probability and Statistics, 1987.

    Google Scholar 

  • Runger G., Willemain T., ”Model-based and Model-free Control of Autocor-related Processes,” Journal of Quality Technology, 27(4), 283–292, 1995.

    Google Scholar 

  • Ruts I., Rousseeuw P., ”Computing Depth Contours of Bivariate Point Clouds,” In Computational Statistics and Data Analysis, 23, 153–168, 1996.

    Article  MATH  Google Scholar 

  • Schiffman S. S., Reynolds M. L., Young F. W., Introduction to Multidimensional Scaling: Theory, Methods and Applications. New York: Academic Press, 1981.

    MATH  Google Scholar 

  • Shekhar S., Chawla S., A Tour of Spatial Databases, Prentice Hall, 2002.

    Google Scholar 

  • Shekhar S., Lu C. T, Zhang P., ”Detecting Graph-Based Spatial Outlier: Algorithms and Applications (A Summary of Results),” In Proc. of the Seventh ACM-SIGKDD Conference on Knowledge Discovery and Data Mining, SF, CA, 2001.

    Google Scholar 

  • Shekhar S., Lu C. T, Zhang P., ”Detecting Graph-Based Spatial Outlier,” Intelligent Data Analysis: An International Journal, 6(5), 451–468, 2002.

    MATH  Google Scholar 

  • Shekhar S., Lu C. T, Zhang P., ”A Unified Approach to Spatial Outliers Detection,” Geolnformatica, an International Journal on Advances of Computer Science for Geographic Information System, 7(2), 2003.

    Google Scholar 

  • Wardell D.G., Moskowitz H., Plante R.D., ”Run-length distributions of special-cause control charts for correlated processes,” Technometrics, 36(1), 3–17, 1994.

    Article  MathSciNet  MATH  Google Scholar 

  • Tukey J.W., Exploratory Data Analysis. Addison-Wesley, 1977.

    Google Scholar 

  • Williams G. J., Baxter R. A., He H. X., Hawkins S., Gu L., ”A Comparative Study of RNN for Outlier Detection in Data Mining,” IEEE International Conference on Data-mining (ICDM’02), Maebashi City, Japan, CSIRO Technical Report CMIS-02/102, 2002.

    Google Scholar 

  • Williams G. J., Huang Z., ”Mining the knowledge mine: The hot spots methodology for mining large real world databases,” In Abdul Sattar, editor, Advanced Topics in Artificial Intelligence, volume 1342 of Lecture Notes in Artificial Intelligence, 340–348, Springer, 1997.

    Google Scholar 

  • Zhang N.F., ”A Statistical Control Chart for Stationary Process Data,” Technometrics, 40(1), 24–38, 1998.

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer Science+Business Media, Inc.

About this chapter

Cite this chapter

Ben-Gal, I. (2005). Outlier Detection. In: Maimon, O., Rokach, L. (eds) Data Mining and Knowledge Discovery Handbook. Springer, Boston, MA. https://doi.org/10.1007/0-387-25465-X_7

Download citation

  • DOI: https://doi.org/10.1007/0-387-25465-X_7

  • Publisher Name: Springer, Boston, MA

  • Print ISBN: 978-0-387-24435-8

  • Online ISBN: 978-0-387-25465-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics