Knowledge and Information Systems

, Volume 32, Issue 1, pp 25–52 | Cite as

Density-preserving projections for large-scale local anomaly detection

  • Timothy de Vries
  • Sanjay Chawla
  • Michael E. Houle
Regular Paper

Abstract

Outlier or anomaly detection is a fundamental data mining task with the aim to identify data points, events, transactions which deviate from the norm. The identification of outliers in data can provide insights about the underlying data generating process. In general, outliers can be of two kinds: global and local. Global outliers are distinct with respect to the whole data set, while local outliers are distinct with respect to data points in their local neighbourhood. While several approaches have been proposed to scale up the process of global outlier discovery in large databases, this has not been the case for local outliers. We tackle this problem by optimising the use of local outlier factor (LOF) for large and high-dimensional data. We propose projection-indexed nearest-neighbours (PINN), a novel technique that exploits extended nearest-neighbour sets in a reduced-dimensional space to create an accurate approximation for k-nearest-neighbour distances, which is used as the core density measurement within LOF. The reduced dimensionality allows for efficient sub-quadratic indexing in the number of items in the data set, where previously only quadratic performance was possible. A detailed theoretical analysis of random projection (RP) and PINN shows that we are able to preserve the density of the intrinsic manifold of the data set after projection. Experimental results show that PINN outperforms the standard projection methods RP and PCA when measuring LOF for many high-dimensional real-world data sets of up to 300,000 elements and 102,600 dimensions. A further investigation into the use of high-dimensionality-specific indexing such as spatial approximate sample hierarchy (SASH) shows that our novel technique holds benefits over even these types of highly efficient indexing. We cement the practical applications of our novel technique with insights into what it means to find local outliers in real data including image and text data, and include potential applications for this knowledge.

Keywords

Anomaly detection Dimensionality reduction 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Breunig MM, Kriegel H-P, Ng RT, Sander J (2000) Lof: identifying density-based local outliers. SIGMOD Rec 29(2): 93–104CrossRefGoogle Scholar
  2. 2.
    Golub GH, Van Loan CF (1996) Matrix computations. 3rd edn. Johns Hopkins University Press, Baltimore, MD, USAMATHGoogle Scholar
  3. 3.
    Knorr EM, Ng RT, Tucakov V (2000) Distance-based outliers: algorithms and applications. VLDB J 8(3–4): 237–253Google Scholar
  4. 4.
    Geusebroek J-M, Burghouts GJ, Smeulders AWM (2005) The amsterdam library of object images. Int J Comput Vis 61(1): 103–112CrossRefGoogle Scholar
  5. 5.
    Bay SD, Schwabacher M (2003) Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: KDD ’03: proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, New York, pp 29–38Google Scholar
  6. 6.
    Hinneburg A, Aggarwal CC, Keim DA (2000) What is the nearest neighbor in high dimensional spaces? In: VLDB ’00: proceedings of the 26th international conference on very large data bases. Morgan Kaufmann Publishers Inc, San Francisco, pp 506–515Google Scholar
  7. 7.
    Tenenbaum JB, Silva Vd, Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290(5500): 2319–2323CrossRefGoogle Scholar
  8. 8.
    Papadimitriou S, Kitagawa H, Gibbons PB, Faloutsos C (2003) Loci: fast outlier detection using the local correlation integral. In: ICDEGoogle Scholar
  9. 9.
    Jin W, Tung AKH, Han J (2001) Mining top-n local outliers in large databases. In: KDD ’01: proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, New York, pp 293–298Google Scholar
  10. 10.
    Kriegel H-P, Kröger P, Schubert E, Zimek A (2009) Outlier detection in axis-parallel subspaces of high dimensional data. In: Proceedings of the 13th Pacific-Asia conference on knowledge discovery and data mining (PAKDD), Bangkok, ThailandGoogle Scholar
  11. 11.
    Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. In: SIGMOD ’01: proceedings of the 2001 ACM SIGMOD international conference on Management of data. ACM, New York, pp 37–46Google Scholar
  12. 12.
    Zhang J, Wang H (2006) Detecting outlying subspaces for high-dimensional data: the new task, algorithms, and performance. Knowl Inform Syst 10: 333–355CrossRefGoogle Scholar
  13. 13.
    Chawla S, Sun P (2006) Slom: a new measure for local spatial outliers. Knowl Inform Syst 9(4): 412–429CrossRefGoogle Scholar
  14. 14.
    Agarwal D (2007) Detecting anomalies in cross-classified streams: a bayesian approach. Knowl Inform Syst 11(1): 29–44CrossRefGoogle Scholar
  15. 15.
    Yu JX, Qian W, Lu H, Zhou A (2006) Finding centric local outliers in categorical/numerical spaces. Knowl Inform Syst 9: 309–338CrossRefGoogle Scholar
  16. 16.
    Tang J, Chen Z, Fu AW, Cheung DW (2006) Capabilities of outlier detection schemes in large datasets, framework and methodologies. Knowl Inform Syst 11: 45–84CrossRefGoogle Scholar
  17. 17.
    Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: STOC ’98: proceedings of the thirtieth annual ACM symposium on Theory of computing. ACM, New York, pp 604–613Google Scholar
  18. 18.
    Houle ME, Sakuma J (2005) Fast approximate similarity search in extremely high-dimensional data sets. In: ICDE, pp 619–630Google Scholar
  19. 19.
    Sharma A, Paliwal KK (2007) Fast principal component analysis using fixed-point algorithm. Pattern Recogn Lett 28(10): 1151–1155CrossRefGoogle Scholar
  20. 20.
    Deegalla S, Bostrom H (2006) Reducing high-dimensional data by principal component analysis vs. random projection for nearest neighbor classification. In: ICMLA ’06: proceedings of the 5th international conference on machine learning and applications. IEEE Computer Society, Washington, DC, pp 245–250Google Scholar
  21. 21.
    Johnson WB, Lindenstrauss J (1982) Extensions of lipschitz mappings into a hilbert space. In: Conference in modern analysis and probability (New Haven, Conn.). Amer. Math. Soc., pp 189–206Google Scholar
  22. 22.
    Dasgupta S, Gupta A (1999) An elementary proof of the johnson-lindenstrauss lemma. International Computer Science Institute, Berkeley, CA, Technical Report TR-99-006Google Scholar
  23. 23.
    Achlioptas D (2001) Database-friendly random projections. In: 20th ACM symposium on principles of database systems. ACM, pp 274–281Google Scholar
  24. 24.
    Karger DR, Ruhl M (2002) Finding nearest neighbors in growth-restricted metrics. In: STOC ’02: proceedings of the thiry-fourth annual ACM symposium on theory of computing. ACM, New York, pp 741–750Google Scholar
  25. 25.
    Yap CK (1988) A geometric consistency theorem for a symbolic perturbation scheme. In: SCG ’88: proceedings of the fourth annual symposium on Computational geometry. ACM, New York, pp 134–142Google Scholar
  26. 26.
    Asuncion A, Newman D (2007) UCI machine learning repository, [Online]. Available: http://www.ics.uci.edu/~mlearn/MLRepository.html

Copyright information

© Springer-Verlag London Limited 2011

Authors and Affiliations

  • Timothy de Vries
    • 1
  • Sanjay Chawla
    • 1
  • Michael E. Houle
    • 2
  1. 1.School of Information TechnologiesUniversity of SydneySydneyAustralia
  2. 2.National Institute of InformaticsTokyoJapan

Personalised recommendations