# Fast Outlier Detection in High Dimensional Spaces

## Abstract

In this paper we propose a new definition of distance-based outlier that considers for each point the sum of the distances from its *k* nearest neighbors, called weight. Outliers are those points having the largest values of weight. In order to compute these weights, we find the *k* nearest neighbors of each point in a fast and efficient way by linearizing the search space through the Hilbert space filling curve. The algorithm consists of two phases, the first provides an approximated solution, within a small factor, after executing at most *d* + 1 scans of the data set with a low time complexity cost, where *d* is the number of dimensions of the data set. During each scan the number of points candidate to belong to the solution set is sensibly reduced. The second phase returns the exact solution by doing a single scan which examines further a little fraction of the data set. Experimental results show that the algorithm always finds the exact solution during the first phase after *d*- 《 *d* + 1 steps and it scales linearly both in the dimensionality and the size of the data set.

## Keywords

Point Feature Outlier Detection High Dimensional Space Local Outlier Hilbert Curve## References

- 1.C. C. Aggarwal and P. S. Yu. Outlier detection for high dimensional data. In
*Proc. ACM Int. Conference on Managment of Data (SIGMOD’01)*, 2001.Google Scholar - 2.F. Angiulli and C. Pizzuti. Fast outlier detection in high dimensional spaces. In
*Tech. Report*,*n*.*25*,*ISI-CNR*, 2002.Google Scholar - 3.A. Arning, C. Aggarwal, and P. Raghavan. A linear method for deviation detection in large databases. In
*Proc. Int. Conf. on Knowledge Discovery and Data Mining*, pages 164–169, 1996.Google Scholar - 4.V. Barnett and T. Lewis.
*Outliers in Statistical Data*. John Wiley & Sons, 1994.Google Scholar - 5.M. M. Breunig, H. Kriegel, R. T. Ng, and J. Sander. Lof: Identifying density-based local outliers. In
*Proc. ACM Int. Conf. on Managment of Data (SIGMOD’00)*, 2000.Google Scholar - 6.C. E. Brodley and M. Friedl. Identifying and eliminating mislabeled training instances. In
*Proc. National American Conf. on Artificial Intelligence (AAAI/IAAI 96)*, pages 799–805, 1996.Google Scholar - 7.Yu D., Sheikholeslami S., and A. Zhang. Findout: Finding outliers in very large datasets. In
*Tech. Report*,*99-03*,*Univ. of New York, Buffalo*, pages 1–19, 1999.Google Scholar - 8.C. Faloutsos and S. Roseman. Fractals for secondary key retrieval. In
*Proc. ACM Int. Conf. on Principles of Database Systems (PODS’89)*, pages 247–252, 1989.Google Scholar - 9.J. Han and M. Kamber.
*Data Mining, Concepts and Technique*. Morgan Kaufmann, San Francisco, 2001.Google Scholar - 10.H. V. Jagadish. Linear clustering of objects with multiple atributes. In
*Proc. ACM Int. Conf. on Managment of Data (SIGMOD’90)*, pages 332–342, 1990.Google Scholar - 11.H. V. Jagadish. Linear clustering of objects with multiple atributes. In
*Proc. ACM Int. Conf. on Managment of Data (SIGMOD’90)*, pages 332–342, 1990.Google Scholar - 12.E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. In
*Proc. Int. conf. on Very Large Databases (VLDB98)*, pages 392–403, 1998.Google Scholar - 13.M. Lopez and S. Liao. Finding
*k*-closest-pairs efficiently for high dimensional data. In*Proc. 12th Canadian Conf. on Computational Geometry (CCCG)*, pages 197–204, 2000.Google Scholar - 14.S. Ramaswamy, R. Rastogi, and K. Shim. Efficient algorithms for mining outliers from large data sets. In
*Proc. ACM Int. Conf. on Managment of Data (SIGMOD’00)*, pages 427–438, 2000.Google Scholar - 15.Hans Sagan.
*Space Filling Curves*. Springer-Verlag, 1994.Google Scholar - 16.S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-driven exploration of olap data cubes. In
*Proc. Sixth Int. Conf on ExtendingD atabase Thecnology (EDBT)*, Valencia, Spain, March 1998.Google Scholar - 17.J. Shepherd, X. Zhu, and N. Megiddo. A fast indexing method for multidimensional nearest neighbor search. In
*Proc. SPIE Conf. on Storage and Retrieval for image and video databases VII*, pages 350–355, 1999.Google Scholar - 18.Z. R. Struzik and A. Siebes. Outliers detection and localisation with wavelet based multifractal formalism. In
*Tech. Report*,*CWI, Amsterdam*,*INS-R0008*, 2000.Google Scholar - 19.K. Yamanishi and J. Takeuchi. Discovering outlier filtering rules from unlabeled data. In
*Proc. ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining*, pages 389–394, 2001.Google Scholar - 20.K. Yamanishi, J. Takeuchi, G. Williams, and P. Milne. On-line unsupervised learning outlier detection using finite mixtures with discounting learning algorithms. In
*Proc. ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining*, pages 250–254, 2000.Google Scholar