The VLDB Journal

, Volume 14, Issue 2, pp 211–221 | Cite as

An effective and efficient algorithm for high-dimensional outlier detection

Abstract.

The outlier detection problem has important applications in the field of fraud detection, network robustness analysis, and intrusion detection. Most such applications are most important for high-dimensional domains in which the data can contain hundreds of dimensions. Many recent algorithms have been proposed for outlier detection that use several concepts of proximity in order to find the outliers based on their relationship to the other points in the data. However, in high-dimensional space, the data are sparse and concepts using the notion of proximity fail to retain their effectiveness. In fact, the sparsity of high-dimensional data can be understood in a different way so as to imply that every point is an equally good outlier from the perspective of distance-based definitions. Consequently, for high-dimensional data, the notion of finding meaningful outliers becomes substantially more complex and nonobvious. In this paper, we discuss new techniques for outlier detection that find the outliers by studying the behavior of projections from the data set.

Keywords:

Data mining High-dimensional spaces Outlier detection 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Aggarwal CC (2001) Re-designing distance functions and distance based applications for high dimensional data. ACM SIGMOD Rec 30(1):13-18Google Scholar
  2. 2.
    Aggarwal CC(1999) Fast algorithms for projected clustering. In: Proceedings of ACM SIGMOD, pp 61-72Google Scholar
  3. 3.
    Aggarwal CC, Yu P (2000) Finding generalized projected clusters in high dimensional spaces. In: Proceedings of ACM SIGMOD, pp 70-81Google Scholar
  4. 4.
    Aggarwal CC, Hinneburg A, Keim DA (2001) On the surprising behavior of distance metrics in high dimensional space. In: Proceedings of ICDT, pp 420-434Google Scholar
  5. 5.
    Aggarwal CC, Orlin JB, Tai RP (1997) Optimized crossover for the independent set problem. Operat Res 45(2):226-234MathSciNetMATHGoogle Scholar
  6. 6.
    Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of ACM SIGMOD, pp 94-105Google Scholar
  7. 7.
    Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of ACM SIGMOD, pp 207-216Google Scholar
  8. 8.
    Arning A, Agrawal R, Raghavan P (1996) A linear method for deviation detection in large databases. In: Proceedings of KDD, pp 164-169Google Scholar
  9. 9.
    Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is nearest neighbors meaningful? In: Proceedings of ICDT, pp 217-235Google Scholar
  10. 10.
    Breunig MM, Kriegel H-P, Ng RT, Sander J (2000) LOF: identifying density-based local outliers. In: Proceedings of ACM SIGMOD, pp 93-104Google Scholar
  11. 11.
    Chakrabarti K, Mehrotra S (2000) Local dimensionality reduction: a new approach to indexing high dimensional spaces. In: Proceedings of the VLDB conference, pp 89-104Google Scholar
  12. 12.
    Darwin C (1859) The origin of species by natural selection. Available at: http://www.literature.org/authors/darwin-charles/the-origin-of-species/Google Scholar
  13. 13.
    De Jong KA (1975) Analysis of the behaviour of a class of genetic adaptive systems. PhD dissertation, University of Michigan, Ann Arbor, MIGoogle Scholar
  14. 14.
    Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of KDD, pp 226-231Google Scholar
  15. 15.
    Goldberg DE (1989) Genetic algorithms in search, optimization and machine learning. Addison-Wesley, Reading, MAGoogle Scholar
  16. 16.
    Guha S, Rastogi R, Shim K (1998) CURE: an efficient clustering algorithm for large databases. In: Proceedings of ACM SIGMOD, pp 73-84Google Scholar
  17. 17.
    Hinneburg A, Aggarwal CC, Keim DA (2000) What is the nearest neighbor in high dimensional spaces? In: Proceedings of the VLDB conference, pp 506-515Google Scholar
  18. 18.
    Holland JH (1975) Adaptation in natural and artificial systems. University of Michigan Press, Ann Arbor, MIGoogle Scholar
  19. 19.
    Kirkpatrick S, Gelatt CD, Vecchi MP (1983) Optimization by simulated annealing. Science 220(4589):671-680MathSciNetGoogle Scholar
  20. 20.
    Knorr E, Ng R (1998) Algorithms for mining distance-based outliers in large data sets. In: Proceedings of the VLDB conference, pp 392-403Google Scholar
  21. 21.
    Knorr E, Ng R (1999) Finding intensional knowledge of distance-based outliers. In: Proceedings of the VLDB conference, pp 211-222Google Scholar
  22. 22.
    Parthasarathy, S Aggarwal CC (2003) On the use of conceptual reconstruction for mining massively incomplete data sets. IEEE Trans Knowl Data Eng 15(6):1512-1531CrossRefGoogle Scholar
  23. 23.
    Ramaswamy S, Rastogi R, Shim K (2000) Efficient algorithms for mining outliers from large data sets. In: Proceedings of ACM SIGMOD, pp 427-438Google Scholar
  24. 24.
    Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Proceedings of ACM SIGMOD, pp 103-114Google Scholar

Copyright information

© Springer-Verlag Berlin/Heidelberg 2005

Authors and Affiliations

  1. 1.IBM T.J. Watson Research CenterHawthorneUSA

Personalised recommendations