Advertisement

Fast Outlier Detection in High Dimensional Spaces

  • Fabrizio Angiulli
  • Clara Pizzuti
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2431)

Abstract

In this paper we propose a new definition of distance-based outlier that considers for each point the sum of the distances from its k nearest neighbors, called weight. Outliers are those points having the largest values of weight. In order to compute these weights, we find the k nearest neighbors of each point in a fast and efficient way by linearizing the search space through the Hilbert space filling curve. The algorithm consists of two phases, the first provides an approximated solution, within a small factor, after executing at most d + 1 scans of the data set with a low time complexity cost, where d is the number of dimensions of the data set. During each scan the number of points candidate to belong to the solution set is sensibly reduced. The second phase returns the exact solution by doing a single scan which examines further a little fraction of the data set. Experimental results show that the algorithm always finds the exact solution during the first phase after d- 《 d + 1 steps and it scales linearly both in the dimensionality and the size of the data set.

Keywords

Point Feature Outlier Detection High Dimensional Space Local Outlier Hilbert Curve 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    C. C. Aggarwal and P. S. Yu. Outlier detection for high dimensional data. In Proc. ACM Int. Conference on Managment of Data (SIGMOD’01), 2001.Google Scholar
  2. 2.
    F. Angiulli and C. Pizzuti. Fast outlier detection in high dimensional spaces. In Tech. Report, n. 25, ISI-CNR, 2002.Google Scholar
  3. 3.
    A. Arning, C. Aggarwal, and P. Raghavan. A linear method for deviation detection in large databases. In Proc. Int. Conf. on Knowledge Discovery and Data Mining, pages 164–169, 1996.Google Scholar
  4. 4.
    V. Barnett and T. Lewis. Outliers in Statistical Data. John Wiley & Sons, 1994.Google Scholar
  5. 5.
    M. M. Breunig, H. Kriegel, R. T. Ng, and J. Sander. Lof: Identifying density-based local outliers. In Proc. ACM Int. Conf. on Managment of Data (SIGMOD’00), 2000.Google Scholar
  6. 6.
    C. E. Brodley and M. Friedl. Identifying and eliminating mislabeled training instances. In Proc. National American Conf. on Artificial Intelligence (AAAI/IAAI 96), pages 799–805, 1996.Google Scholar
  7. 7.
    Yu D., Sheikholeslami S., and A. Zhang. Findout: Finding outliers in very large datasets. In Tech. Report, 99-03, Univ. of New York, Buffalo, pages 1–19, 1999.Google Scholar
  8. 8.
    C. Faloutsos and S. Roseman. Fractals for secondary key retrieval. In Proc. ACM Int. Conf. on Principles of Database Systems (PODS’89), pages 247–252, 1989.Google Scholar
  9. 9.
    J. Han and M. Kamber. Data Mining, Concepts and Technique. Morgan Kaufmann, San Francisco, 2001.Google Scholar
  10. 10.
    H. V. Jagadish. Linear clustering of objects with multiple atributes. In Proc. ACM Int. Conf. on Managment of Data (SIGMOD’90), pages 332–342, 1990.Google Scholar
  11. 11.
    H. V. Jagadish. Linear clustering of objects with multiple atributes. In Proc. ACM Int. Conf. on Managment of Data (SIGMOD’90), pages 332–342, 1990.Google Scholar
  12. 12.
    E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. In Proc. Int. conf. on Very Large Databases (VLDB98), pages 392–403, 1998.Google Scholar
  13. 13.
    M. Lopez and S. Liao. Finding k-closest-pairs efficiently for high dimensional data. In Proc. 12th Canadian Conf. on Computational Geometry (CCCG), pages 197–204, 2000.Google Scholar
  14. 14.
    S. Ramaswamy, R. Rastogi, and K. Shim. Efficient algorithms for mining outliers from large data sets. In Proc. ACM Int. Conf. on Managment of Data (SIGMOD’00), pages 427–438, 2000.Google Scholar
  15. 15.
    Hans Sagan. Space Filling Curves. Springer-Verlag, 1994.Google Scholar
  16. 16.
    S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-driven exploration of olap data cubes. In Proc. Sixth Int. Conf on ExtendingD atabase Thecnology (EDBT), Valencia, Spain, March 1998.Google Scholar
  17. 17.
    J. Shepherd, X. Zhu, and N. Megiddo. A fast indexing method for multidimensional nearest neighbor search. In Proc. SPIE Conf. on Storage and Retrieval for image and video databases VII, pages 350–355, 1999.Google Scholar
  18. 18.
    Z. R. Struzik and A. Siebes. Outliers detection and localisation with wavelet based multifractal formalism. In Tech. Report, CWI, Amsterdam, INS-R0008, 2000.Google Scholar
  19. 19.
    K. Yamanishi and J. Takeuchi. Discovering outlier filtering rules from unlabeled data. In Proc. ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pages 389–394, 2001.Google Scholar
  20. 20.
    K. Yamanishi, J. Takeuchi, G. Williams, and P. Milne. On-line unsupervised learning outlier detection using finite mixtures with discounting learning algorithms. In Proc. ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pages 250–254, 2000.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2002

Authors and Affiliations

  • Fabrizio Angiulli
    • 1
  • Clara Pizzuti
    • 1
  1. 1.ISI-CNR, c/o DEISUniversitá della CalabriaRendeItaly

Personalised recommendations