Abstract
Outlier or anomaly detection is a fundamental data mining task with the aim to identify data points, events, transactions which deviate from the norm. The identification of outliers in data can provide insights about the underlying data generating process. In general, outliers can be of two kinds: global and local. Global outliers are distinct with respect to the whole data set, while local outliers are distinct with respect to data points in their local neighbourhood. While several approaches have been proposed to scale up the process of global outlier discovery in large databases, this has not been the case for local outliers. We tackle this problem by optimising the use of local outlier factor (LOF) for large and high-dimensional data. We propose projection-indexed nearest-neighbours (PINN), a novel technique that exploits extended nearest-neighbour sets in a reduced-dimensional space to create an accurate approximation for k-nearest-neighbour distances, which is used as the core density measurement within LOF. The reduced dimensionality allows for efficient sub-quadratic indexing in the number of items in the data set, where previously only quadratic performance was possible. A detailed theoretical analysis of random projection (RP) and PINN shows that we are able to preserve the density of the intrinsic manifold of the data set after projection. Experimental results show that PINN outperforms the standard projection methods RP and PCA when measuring LOF for many high-dimensional real-world data sets of up to 300,000 elements and 102,600 dimensions. A further investigation into the use of high-dimensionality-specific indexing such as spatial approximate sample hierarchy (SASH) shows that our novel technique holds benefits over even these types of highly efficient indexing. We cement the practical applications of our novel technique with insights into what it means to find local outliers in real data including image and text data, and include potential applications for this knowledge.
Similar content being viewed by others
References
Breunig MM, Kriegel H-P, Ng RT, Sander J (2000) Lof: identifying density-based local outliers. SIGMOD Rec 29(2): 93–104
Golub GH, Van Loan CF (1996) Matrix computations. 3rd edn. Johns Hopkins University Press, Baltimore, MD, USA
Knorr EM, Ng RT, Tucakov V (2000) Distance-based outliers: algorithms and applications. VLDB J 8(3–4): 237–253
Geusebroek J-M, Burghouts GJ, Smeulders AWM (2005) The amsterdam library of object images. Int J Comput Vis 61(1): 103–112
Bay SD, Schwabacher M (2003) Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: KDD ’03: proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, New York, pp 29–38
Hinneburg A, Aggarwal CC, Keim DA (2000) What is the nearest neighbor in high dimensional spaces? In: VLDB ’00: proceedings of the 26th international conference on very large data bases. Morgan Kaufmann Publishers Inc, San Francisco, pp 506–515
Tenenbaum JB, Silva Vd, Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290(5500): 2319–2323
Papadimitriou S, Kitagawa H, Gibbons PB, Faloutsos C (2003) Loci: fast outlier detection using the local correlation integral. In: ICDE
Jin W, Tung AKH, Han J (2001) Mining top-n local outliers in large databases. In: KDD ’01: proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, New York, pp 293–298
Kriegel H-P, Kröger P, Schubert E, Zimek A (2009) Outlier detection in axis-parallel subspaces of high dimensional data. In: Proceedings of the 13th Pacific-Asia conference on knowledge discovery and data mining (PAKDD), Bangkok, Thailand
Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. In: SIGMOD ’01: proceedings of the 2001 ACM SIGMOD international conference on Management of data. ACM, New York, pp 37–46
Zhang J, Wang H (2006) Detecting outlying subspaces for high-dimensional data: the new task, algorithms, and performance. Knowl Inform Syst 10: 333–355
Chawla S, Sun P (2006) Slom: a new measure for local spatial outliers. Knowl Inform Syst 9(4): 412–429
Agarwal D (2007) Detecting anomalies in cross-classified streams: a bayesian approach. Knowl Inform Syst 11(1): 29–44
Yu JX, Qian W, Lu H, Zhou A (2006) Finding centric local outliers in categorical/numerical spaces. Knowl Inform Syst 9: 309–338
Tang J, Chen Z, Fu AW, Cheung DW (2006) Capabilities of outlier detection schemes in large datasets, framework and methodologies. Knowl Inform Syst 11: 45–84
Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: STOC ’98: proceedings of the thirtieth annual ACM symposium on Theory of computing. ACM, New York, pp 604–613
Houle ME, Sakuma J (2005) Fast approximate similarity search in extremely high-dimensional data sets. In: ICDE, pp 619–630
Sharma A, Paliwal KK (2007) Fast principal component analysis using fixed-point algorithm. Pattern Recogn Lett 28(10): 1151–1155
Deegalla S, Bostrom H (2006) Reducing high-dimensional data by principal component analysis vs. random projection for nearest neighbor classification. In: ICMLA ’06: proceedings of the 5th international conference on machine learning and applications. IEEE Computer Society, Washington, DC, pp 245–250
Johnson WB, Lindenstrauss J (1982) Extensions of lipschitz mappings into a hilbert space. In: Conference in modern analysis and probability (New Haven, Conn.). Amer. Math. Soc., pp 189–206
Dasgupta S, Gupta A (1999) An elementary proof of the johnson-lindenstrauss lemma. International Computer Science Institute, Berkeley, CA, Technical Report TR-99-006
Achlioptas D (2001) Database-friendly random projections. In: 20th ACM symposium on principles of database systems. ACM, pp 274–281
Karger DR, Ruhl M (2002) Finding nearest neighbors in growth-restricted metrics. In: STOC ’02: proceedings of the thiry-fourth annual ACM symposium on theory of computing. ACM, New York, pp 741–750
Yap CK (1988) A geometric consistency theorem for a symbolic perturbation scheme. In: SCG ’88: proceedings of the fourth annual symposium on Computational geometry. ACM, New York, pp 134–142
Asuncion A, Newman D (2007) UCI machine learning repository, [Online]. Available: http://www.ics.uci.edu/~mlearn/MLRepository.html
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
de Vries, T., Chawla, S. & Houle, M.E. Density-preserving projections for large-scale local anomaly detection. Knowl Inf Syst 32, 25–52 (2012). https://doi.org/10.1007/s10115-011-0430-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-011-0430-4