Skip to main content
Log in

Fast mining of distance-based outliers in high-dimensional datasets

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Defining outliers by their distance to neighboring data points has been shown to be an effective non-parametric approach to outlier detection. In recent years, many research efforts have looked at developing fast distance-based outlier detection algorithms. Several of the existing distance-based outlier detection algorithms report log-linear time performance as a function of the number of data points on many real low-dimensional datasets. However, these algorithms are unable to deliver the same level of performance on high-dimensional datasets, since their scaling behavior is exponential in the number of dimensions. In this paper, we present RBRP, a fast algorithm for mining distance-based outliers, particularly targeted at high-dimensional datasets. RBRP scales log-linearly as a function of the number of data points and linearly as a function of the number of dimensions. Our empirical evaluation demonstrates that we outperform the state-of-the-art algorithm, often by an order of magnitude.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Angiulli F, Pizzuti C (2002) Fast outlier detection in high dimensional spaces. In: Proceedings of the international conference on principles of data mining and knowledge discovery

  • Arya S, Mount DM, Netanyahu NS, Silverman R and Wu AY (1998). An optimal algorithm for approximate nearest neighbor searching fixed dimensions. J ACM 45(6): 891–923

    Article  MATH  MathSciNet  Google Scholar 

  • Barnett V, Lewis T (1994) Outliers in statistical data. John Wiley and Sons

  • Bay S (1999). The UCI KDD archive. University of California, Department of Information and Computer Science, Irvine, CA

    Google Scholar 

  • Bay S, Schwabacher M (2003) Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: Proceedings of the international conference on knowledge discovery and data mining

  • Bentley J (1975). Multidimensional binary search trees used for associative searching. Commun ACM 18: 509–517

    Article  MATH  Google Scholar 

  • Berchtold S, Keim D, Kreigel H (1996) The X-tree: an index structure for high dimensional data. In: Proceedings of the international conference on very large data bases (VLDB)

  • Bolton R and Hand D (2002). Statistical fraud detection: a review. Stat Sci 17: 235–255

    Article  MATH  MathSciNet  Google Scholar 

  • Gamberger D, Lavrac N, Groselj C (1999) Experiments with noise filtering in the medical domain. In: Proceedings of the international conference on machine learning

  • Ghoting A, Parthasarathy S, Otey M (2005) Fast mining of distance-based outliers in high dimensional datasets. Technical report TR71, Department of Computer Science and Engineering, The Ohio State University

  • Guttmann R (1984) A dynamic index structure for spatial searching. In: Proceedings of the international conference on management of data (SIGMOD)

  • Hartigan J (1975) Clustering algorithms. John Wiley and Sons

  • Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the symposium on theory of computing (STOC), pp 604–613

  • Jagadish H, Poosala V, Koudas N, Sevcik K, Muthukrishnan S, Suel T (1998) Optimal histograms with guarantees. In: Proceedings of the international conference on very large databases (VLDB)

  • Jolliffe I (1986) Principal component analysis. Springer-Verlag

  • Kleinberg JM (1997) Two algorithms for nearest-neighbor search in high dimensions. In: Proceedings of the symposium on theory of computing (STOC), pp 599–608

  • Knorr E, Ng R (1999) Finding intensional knowledge of distance-based outliers. In: Proceedings of the international conference on very large data bases (VLDB)

  • Kushilevitz E, Ostrovsky R, Rabani Y (1998) Efficient search for approximate nearest neighbor in high dimensional spaces. In: Proceedings of the symposium on theory of computing (STOC)

  • Muralikrishna M, DeWitt D (1988) Equi-depth histograms for estimating selectivity factors for multidimensional queries. In: Proceedings of the international conference on management of data (SIGMOD)

  • Ramaswamy S, Rastogi R, Shim K (2000) Efficient algorithms for mining outliers from large datasets. In: Proceedings of the international conference on management of data

  • Ruggles S, Sobek M (1997) Integrated public use microdata series: version 2.0

  • Sequeira K, Zaki M (2002) ADMIT: anomaly-based data mining for intrusions. In: Proceedings of the international conference on knowledge discovery and data mining

  • Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the international conference on management of data (SIGMOD)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Amol Ghoting.

Additional information

Responsible editor: Thorsten Joachims.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ghoting, A., Parthasarathy, S. & Otey, M.E. Fast mining of distance-based outliers in high-dimensional datasets. Data Min Knowl Disc 16, 349–364 (2008). https://doi.org/10.1007/s10618-008-0093-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-008-0093-2

Keywords

Navigation