Advertisement

Abstract

Outlier detection is a primary step in many data-mining applications. We present several methods for outlier detection, while distinguishing between univariate vs. multivariate techniques and parametric vs. nonparametric procedures. In presence of outliers, special attention should be taken to assure the robustness of the used estimators. Outlier detection for Data Mining is often based on distance measures, clustering and spatial methods.

Keywords

Outliers Distance measures Statistical Process Control Spatial data 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Acuna E., Rodriguez C. A., ”Meta analysis study of outlier detection methods in classification,” Technical paper, Department of Mathematics, University of Puerto Rico at Mayaguez, Retrived from academic.uprm.edu/ ea-cuna/paperout.pdf. In proceedings IPSI2004, Venice, 2004.Google Scholar
  2. Alwan L.C., Ebrahimi N., Soofi E.S., ”Information theoretic framework for process control,” European Journal of Operations Research, 111, 526–542, 1998.CrossRefzbMATHGoogle Scholar
  3. Alwan L.C., Roberts H.V., ”Time-series modeling for statistical process control,” Journal of Business and Economics Statistics, 6(1), 87–95, 1988.CrossRefGoogle Scholar
  4. Apley D.W., Shi J., ”The GLRT for statistical process control of autocorrelated processes,” IIE Transactions, 31, 1123–1134, 1999.CrossRefGoogle Scholar
  5. Barbara D., Faloutsos C, Hellerstein J., Ioannidis Y., Jagadish H.V., Johnson T., Ng R., Poosala V., Ross K., Sevcik K.C., ”The New Jersey Data Reduction Report,” Data Eng. Bull., September, 1996.Google Scholar
  6. Barbara D., Chen P., ”Using the fractal dimension to cluster datasets,” In Proc. ACM KDD 2000, 260–264, 2000.Google Scholar
  7. Barnett V., Lewis T., Outliers in Statistical Data. John Wiley, 1994.Google Scholar
  8. Bay S.D., Schwabacher M., ”Mining distance-based outliers in near linear time with randomization and a simple pruning rule,” In Proc. of the ninth ACM-SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 2003.Google Scholar
  9. Ben-Gal I., Morag G., Shmilovici A., ”CSPC: A Monitoring Procedure for State Dependent Processes,” Technometrics, 45(4), 293–311, 2003.MathSciNetCrossRefGoogle Scholar
  10. Box G. E. P., Jenkins G. M., Times Series Analysis, Forecasting and Control, Oakland, CA: Holden Day, 1976.Google Scholar
  11. Breunig M.M., Kriegel H.P., Ng R.T., Sander J., ”Lof: Identifying density-based local outliers,” In Proc. ACMSIGMOD Conf. 2000, 93–104, 2000.Google Scholar
  12. Caussinus H., Roiz A., ”Interesting projections of multidimensional data by means of generalized component analysis,” In Compstat 90, 121–126, Heidelberg: Physica, 1990.Google Scholar
  13. David H. A., ”Robust estimation in the presence of outliers,” In Robustness in Statistics, eds. R. L. Launer and G.N. Wilkinson, Academic Press, New York, 61–74, 1979.Google Scholar
  14. Davies L., Gather U., ”The identification of multiple outliers,” Journal of the American Statistical Association, 88(423), 782–792, 1993.MathSciNetCrossRefzbMATHGoogle Scholar
  15. DuMouchel W., Schonlau M., ”A fast computer intrusion detection algorithm based on hypothesis testing of command transition probabilities,” In Proceedings of the 4th International Conference on Knowledge Discovery and Data-mining (KDD98), 189–193, 1998.Google Scholar
  16. Fawcett T, Provost F., ”Adaptive fraud detection,” Data-mining andKnowledge Discovery, 1(3), 291–316, 1997.CrossRefGoogle Scholar
  17. Ferguson T. S., ”On the Rejection of outliers,” In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, 253–287, 1961.zbMATHGoogle Scholar
  18. Gather U., ”Testing for multisource contamination in location / scale families,” Communication in Statistics, Part A: Theory and Methods, 18, 1–34, 1989.zbMATHMathSciNetGoogle Scholar
  19. Grubbs F. E., ”Proceadures for detecting outlying observations in Samples,” Technometrics, 11, 1–21, 1969.CrossRefGoogle Scholar
  20. Hadi A. S., ”Identifying multiple outliers in multivariate data,” Journal of the Royal Statistical Society. Series B, 54, 761–771, 1992.MathSciNetGoogle Scholar
  21. Hadi A. S., ”A modification of a method for the detection of outliers in multivariate samples,” Journal of the Royal Statistical Society, Series B, 56(2), 1994.Google Scholar
  22. Hawkins D., Identification of Outliers, Chapman and Hall, 1980.Google Scholar
  23. Hawkins S., He H. X., Williams G. J., Baxter R. A., ”Outlier detection using replicator neural networks,” In Proceedings of the Fifth International Conference and Data Warehousing and Knowledge Discovery (DaWaK02), Aix en Provence, France, 2002.Google Scholar
  24. Haining R., Spatial Data Analysis in the Social and Environmental Sciences. Cambridge University Press, 1993.Google Scholar
  25. Hampel F. R., ”A general qualitative definition of robustness,” Annals of Mathematics Statistics, 42, 1887–1896, 1971.zbMATHMathSciNetGoogle Scholar
  26. Hampel F. R., ”The influence curve and its role in robust estimation,” Journal of the American Statistical Association, 69, 382–393, 1974.MathSciNetCrossRefGoogle Scholar
  27. Haslett J., Brandley R., Craig P., Unwin A., Wills G., ”Dynamic Graphics for Exploring Spatial Data With Application to Locating Global and Local Anomalies,” The American Statistician, 45, 234–242, 1991.CrossRefGoogle Scholar
  28. Hu T, Sung S. Y., Detecting pattern-based outliers, Pattern Recognition Letters, 24, 3059–3068.Google Scholar
  29. Iglewics B., Martinez J., Outlier Detection using robust measures of scale, Journal of Sattistical Computation and Simulation, 15, 285–293, 1982.Google Scholar
  30. Jin W., Tung A., Han J., ”Mining top-n local outliers in large databases,” In Proceedings of the 7th International Conference on Knowledge Discovery and Data-mining (KDD01), San Francisco, CA, 2001.Google Scholar
  31. Johnson R., Applied Multivariate Statistical Analysis. Prentice Hall, 1992.Google Scholar
  32. Johnson T, Kwok I., Ng R., ”Fast Computation of 2-Dimensional Depth Contours,” In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, 224–228. AAAI Press, 1998.Google Scholar
  33. Kaufman L., Rousseeuw P.J., Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York, 1990.Google Scholar
  34. Knorr E., Ng R., ”A unified approach for mining outliers,” In Proceedings Knowledge Discovery KDD, 219–222, 1997.Google Scholar
  35. Knorr E., Ng. R., ”Algorithms for mining distance-based outliers in large datasets,” In Proc. 24th Int. Conf. Very Large Data Bases (VLDB), 392–403, 24-27, 1998.Google Scholar
  36. Knorr, E., Ng R., Tucakov V., ”Distance-based outliers: Algorithms and applications,” VLDB Journal: Very Large Data Bases, 8(3–4):237–253, 2000.CrossRefGoogle Scholar
  37. Knorr E. M., Ng R. T, Zamar R. H., ”Robust space transformations for distance based operations,” In Proceedings of the 7th International Conference on Knowledge Discovery and Data-mining (KDD01), 126–135, San Francisco, CA, 2001.Google Scholar
  38. Kollios G., Gunopulos D., Koudas N., Berchtold S., ”Efficient biased sampling for approximate clustering and outlier detection in large data sets,” IEEE Transactions on Knowledge and Data Engineering, 15(5), 1170–1187, 2003.CrossRefGoogle Scholar
  39. Liu H., Shah S., Jiang W., ”On-line outlier detection and data cleaning,” Computers and Chemical Engineering, 28, 1635–1647, 2004.CrossRefGoogle Scholar
  40. Lu C, Chen D., Kou Y., ”Algorithms for spatial outlier detection,” In Proceedings of the 3rd IEEE International Conference on Data-mining (ICDM’03), Melbourne, FL, 2003.Google Scholar
  41. Lu C.W., Reynolds M.R., ”EWMA Control Charts for Monitoring the Mean of Autocorrelated Processes,” Journal of Quality Technology, 31(2), 166–188, 1999.Google Scholar
  42. Luc A., ”Local Indicators of Spatial Association: LISA,” Geographical Analysis, 27(2), 93–115, 1995.zbMATHGoogle Scholar
  43. Luc A., ”Exploratory Spatial Data Analysis and Geographic Information Systems,” In M. Painho, editor, New Tools for Spatial Analysis, 45–54, 1994.Google Scholar
  44. Martin R. D., Thomson D. J., ”Robust-resistant spectrum estimation,” In Proceeding of the IEEE, 70, 1097–1115, 1982.CrossRefGoogle Scholar
  45. Montgomery D.C., Mastrangelo CM., ”Some statistical process control methods for autocorrelated data,” Journal of Quality Technology, 23(3), 179–193, 1991.Google Scholar
  46. Ng R.T., Han J., Efficient and Effective Clustering Methods for Spatial Data Mining, In Proceedings of Very Large Data Bases Conference, 144-155, 1994.Google Scholar
  47. Oliver J. J., Baxter R. A., Wallace C. S., ”Unsupervised Learning using MML,” In Proceedings of the Thirteenth International Conference (ICML96), pages 364–372, Morgan Kaufmann Publishers, San Francisco, CA, 1996.Google Scholar
  48. Panatier Y., Variowin. Software for Spatial Data Analysis in 2D., Springer-Verlag, New York, 1996.Google Scholar
  49. Papadimitriou S., Kitawaga H., Gibbons P.G., Faloutsos C, ”LOCI: Fast Outlier Detection Using the Local Correlation Integral,” Intel research Laboratory Technical report no. IRP-TR-02-09, 2002.Google Scholar
  50. Penny K. I., Jolliffe I. T., ”A comparison of multivariate outlier detection methods for clinical laboratory safety data,” The Statistician 50(3), 295–308, 2001.MathSciNetGoogle Scholar
  51. Perarson R. K., ”Outliers in process modeling and identification,” IEEE Transactions on Control Systems Technology, 10, 55–63, 2002.CrossRefGoogle Scholar
  52. Ramaswamy S., Rastogi R., Shim K., ”Efficient algorithms for mining outliers from large data sets,” In Proceedings of the ACM SIGMOD International Conference on Management of Data, Dalas, TX, 2000.Google Scholar
  53. Rosner B., On the detection of many outliers, Technometrics, 17, 221–227, 1975.zbMATHMathSciNetCrossRefGoogle Scholar
  54. Rousseeuw P., ”Multivariate estimation with high breakdown point,” In: W. Grossmann et al., editors, Mathematical Statistics and Applications, Vol. B, 283–297, Akademiai Kiado: Budapest, 1985.Google Scholar
  55. Rousseeuw P., Leory A., Robust Regression and Outlier Detection, Wiley Series in Probability and Statistics, 1987.Google Scholar
  56. Runger G., Willemain T., ”Model-based and Model-free Control of Autocor-related Processes,” Journal of Quality Technology, 27(4), 283–292, 1995.Google Scholar
  57. Ruts I., Rousseeuw P., ”Computing Depth Contours of Bivariate Point Clouds,” In Computational Statistics and Data Analysis, 23, 153–168, 1996.CrossRefzbMATHGoogle Scholar
  58. Schiffman S. S., Reynolds M. L., Young F. W., Introduction to Multidimensional Scaling: Theory, Methods and Applications. New York: Academic Press, 1981.zbMATHGoogle Scholar
  59. Shekhar S., Chawla S., A Tour of Spatial Databases, Prentice Hall, 2002.Google Scholar
  60. Shekhar S., Lu C. T, Zhang P., ”Detecting Graph-Based Spatial Outlier: Algorithms and Applications (A Summary of Results),” In Proc. of the Seventh ACM-SIGKDD Conference on Knowledge Discovery and Data Mining, SF, CA, 2001.Google Scholar
  61. Shekhar S., Lu C. T, Zhang P., ”Detecting Graph-Based Spatial Outlier,” Intelligent Data Analysis: An International Journal, 6(5), 451–468, 2002.zbMATHGoogle Scholar
  62. Shekhar S., Lu C. T, Zhang P., ”A Unified Approach to Spatial Outliers Detection,” Geolnformatica, an International Journal on Advances of Computer Science for Geographic Information System, 7(2), 2003.Google Scholar
  63. Wardell D.G., Moskowitz H., Plante R.D., ”Run-length distributions of special-cause control charts for correlated processes,” Technometrics, 36(1), 3–17, 1994.MathSciNetCrossRefzbMATHGoogle Scholar
  64. Tukey J.W., Exploratory Data Analysis. Addison-Wesley, 1977.Google Scholar
  65. Williams G. J., Baxter R. A., He H. X., Hawkins S., Gu L., ”A Comparative Study of RNN for Outlier Detection in Data Mining,” IEEE International Conference on Data-mining (ICDM’02), Maebashi City, Japan, CSIRO Technical Report CMIS-02/102, 2002.Google Scholar
  66. Williams G. J., Huang Z., ”Mining the knowledge mine: The hot spots methodology for mining large real world databases,” In Abdul Sattar, editor, Advanced Topics in Artificial Intelligence, volume 1342 of Lecture Notes in Artificial Intelligence, 340–348, Springer, 1997.Google Scholar
  67. Zhang N.F., ”A Statistical Control Chart for Stationary Process Data,” Technometrics, 40(1), 24–38, 1998.zbMATHMathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, Inc. 2005

Authors and Affiliations

  • Irad Ben-Gal
    • 1
  1. 1.Department of Industrial EngineeringTel-Aviv UniversityRamat-Aviv, Tel-AvivIsrael

Personalised recommendations