Knowledge and Information Systems

, Volume 9, Issue 3, pp 309–338 | Cite as

Finding centric local outliers in categorical/numerical spaces

  • Jeffrey Xu Yu
  • Weining Qian
  • Hongjun Lu
  • Aoying Zhou
Regular Paper

Abstract

Outlier detection techniques are widely used in many applications such as credit-card fraud detection, monitoring criminal activities in electronic commerce, etc. These applications attempt to identify outliers as noises, exceptions, or objects around the border. The existing density-based local outlier detection assigns the degree to which an object is an outlier in a numerical space. In this paper, we propose a novel mutual-reinforcement-based local outlier detection approach. Instead of detecting local outliers as noise, we attempt to identify local outliers in the center, where they are similar to some clusters of objects on one hand, and are unique on the other. Our technique can be used for bank investment to identify a unique body, similar to many good competitors, in which to invest. We attempt to detect local outliers in categorical, ordinal as well as numerical data. In categorical data, the challenge is that there are many similar but different ways to specify relationships among the data items. Our mutual-reinforcement-based approach is stable, with similar but different user-defined relationships. Our technique can reduce the burden for users to determine the relationships among data items, and find the explanations why the outliers are found. We conducted extensive experimental studies using real datasets.

Data mining Clustering Outlier detection 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Aggarwal C, Yu P (2001) Outlier detection for high dimensional data. In: Proceedings of ACM SIGMOD international conference on management of data. ACM, New York, pp 37–47Google Scholar
  2. 2.
    Barnett V, Lewis T (1994) Outliers in statistical data. Wiley, New YorkMATHGoogle Scholar
  3. 3.
    Breunig M, Kriegel H-P, Ng R, Sander J (1999) Optics-of: Identifying local outliers. In: Proccedings of the 3rd European conference on principles and practice of knowledge discovery in databases. Springer, Berlin Heidelberg New York, pp 262–270Google Scholar
  4. 4.
    Breunig M, Kriegel H-P, Ng R, Sander J (2000) Lof: Identifying density-based local outliers. In: Proceedings of the ACM SIGMOD international conference on management of data. ACM, New York, pp 93–104Google Scholar
  5. 5.
    Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd International conference on knowledge discovery and data mining. AAAI, Manlo Park, CA, pp 226–231Google Scholar
  6. 6.
    Guha S, Rastogi R, Shim K (1998) Cure: An efficient clustering algorithm for large databases. In: Proceedings of the ACM SIGMOD international conference on management of data. ACM, New York, pp 73–84Google Scholar
  7. 7.
    Guha S, Rastogi R, Shim K (1999) Rock: A robust clustering algorithm for categorical attributes. In: Proceedings of the IEEE international conference on data engineering. IEEE Computer Society, Morristown, NJGoogle Scholar
  8. 8.
    Hawkins D (1980) Identification of outliers. Chapman and Hall, LondonMATHGoogle Scholar
  9. 9.
    Hinneburg A, Keim D (1998) An efficient approach to clustering in large multimedia databases with noise. In: Proceedings of the 4th international conference on knowledge discovery and data mining. AAAI, Menlo Park, CA, pp 58–65Google Scholar
  10. 10.
    Jin W, Tung AK, Han J (2001) Mining top-n local outliers in large databases. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 293–298Google Scholar
  11. 11.
    Karypis G, Han E, Kumar V (1999) Chameleon: a hierarchical clustering algorithm using dynamic modeling. IEEE Computing 32(8):68–75Google Scholar
  12. 12.
    Kleinberg J (1998) Authoritative sources in a hyperlinked environment In: Proceedings of the 9th ACM-SIAM symposium on discrete algorithmsMathSciNetGoogle Scholar
  13. 13.
    Knorr E, Ng R (1998) Algorithms for mining distance-based outliers in large datasets. In: Proceedings of the 24th international conference on very large data bases. Morgan Kaufmann, San Mateo, CA, pp 392–403Google Scholar
  14. 14.
    Knorr E, Ng R (1999) Finding intensional knowledge of distance-based outliers. In: Proceedings of the 25th international conference on very large data bases. Morgan Kaufmann, San Mateo, CA, pp 211–222Google Scholar
  15. 15.
    Ng R, Han J (1994) Efficient and effective clustering methods for spatial data mining. In: Proceedings of the 20th international conference on very large data bases. Morgan Kaufmann, San Mateo, CA, pp 144–155Google Scholar
  16. 16.
    Preparata F, Shamos M (1988) Computational geometry: an introduction. Springer, Berlin Heidelberg New YorkGoogle Scholar
  17. 17.
    Ramaswamy S, Rastogi R, Shim K (2000) Efficient algorithms for mining outliers from large data sets. In: Proceedings of the ACM SIGMOD international conference on management of data. ACM, New York, pp 427–438Google Scholar
  18. 18.
    Ruts I, Rousseeuw P (1996) Computing depth contours of bivariate point clouds. J Comput Stat Data Anal 23:153–168MATHCrossRefGoogle Scholar
  19. 19.
    Sheikholeslami G, Chatterjee S, Zhang A (1998) Wavecluster: a multi-resolution clustering approach for very large spatial databases. In: Proceedings of 24th international conference on very large data bases. Morgan Kaufmann, San Mateo, Ca, pp 428–439Google Scholar
  20. 20.
    Shekhar S, Lu C-T, Zhang P (2001) Detecting graph-based spatial outliers: Algorithms and applications (a summary of results). In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New YorkGoogle Scholar
  21. 21.
    Tang J, Chen Z, Fu A W-C, Cheung D (2001) A robust outlier detection scheme for large data sets. Technical report. http://www.cs.panam.edu/ chen/paper-file/ outlierpaper.psGoogle Scholar
  22. 22.
    Wang W, Yang J, Muntz R (1997) Sting: a statistical information grid approach to spatial data mining. In: Proceedings of the 23rd international conference on very large data bases. Morgan Kaufmann, San Mateo, CA, pp 186–195Google Scholar
  23. 23.
    Zhang T, Ramakrishnan R, Linvy M (1996) Birch: an efficient data clustering method for very large databases. In: Proceedings of the ACM SIGMOD international conference on management of data. ACM, New York, pp 103–114Google Scholar

Copyright information

© Springer-Verlag London Limited 2005

Authors and Affiliations

  • Jeffrey Xu Yu
    • 1
  • Weining Qian
    • 2
  • Hongjun Lu
    • 3
  • Aoying Zhou
    • 2
  1. 1.The Chinese University of Hong KongShatin, N.T.China
  2. 2.Fudan UniversityShanghaiChina
  3. 3.The Hong Kong University of Science and TechnologyHong KongChina

Personalised recommendations