Fast mining of distance-based outliers in high-dimensional datasets

Ghoting, Amol; Parthasarathy, Srinivasan; Otey, Matthew Eric

doi:10.1007/s10618-008-0093-2

Fast mining of distance-based outliers in high-dimensional datasets

Published: 04 March 2008

Volume 16, pages 349–364, (2008)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Amol Ghoting¹,
Srinivasan Parthasarathy² &
Matthew Eric Otey³

885 Accesses
124 Citations
3 Altmetric
Explore all metrics

Abstract

Defining outliers by their distance to neighboring data points has been shown to be an effective non-parametric approach to outlier detection. In recent years, many research efforts have looked at developing fast distance-based outlier detection algorithms. Several of the existing distance-based outlier detection algorithms report log-linear time performance as a function of the number of data points on many real low-dimensional datasets. However, these algorithms are unable to deliver the same level of performance on high-dimensional datasets, since their scaling behavior is exponential in the number of dimensions. In this paper, we present RBRP, a fast algorithm for mining distance-based outliers, particularly targeted at high-dimensional datasets. RBRP scales log-linearly as a function of the number of data points and linearly as a function of the number of dimensions. Our empirical evaluation demonstrates that we outperform the state-of-the-art algorithm, often by an order of magnitude.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Angiulli F, Pizzuti C (2002) Fast outlier detection in high dimensional spaces. In: Proceedings of the international conference on principles of data mining and knowledge discovery
Arya S, Mount DM, Netanyahu NS, Silverman R and Wu AY (1998). An optimal algorithm for approximate nearest neighbor searching fixed dimensions. J ACM 45(6): 891–923
Article MATH MathSciNet Google Scholar
Barnett V, Lewis T (1994) Outliers in statistical data. John Wiley and Sons
Bay S (1999). The UCI KDD archive. University of California, Department of Information and Computer Science, Irvine, CA
Google Scholar
Bay S, Schwabacher M (2003) Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: Proceedings of the international conference on knowledge discovery and data mining
Bentley J (1975). Multidimensional binary search trees used for associative searching. Commun ACM 18: 509–517
Article MATH Google Scholar
Berchtold S, Keim D, Kreigel H (1996) The X-tree: an index structure for high dimensional data. In: Proceedings of the international conference on very large data bases (VLDB)
Bolton R and Hand D (2002). Statistical fraud detection: a review. Stat Sci 17: 235–255
Article MATH MathSciNet Google Scholar
Gamberger D, Lavrac N, Groselj C (1999) Experiments with noise filtering in the medical domain. In: Proceedings of the international conference on machine learning
Ghoting A, Parthasarathy S, Otey M (2005) Fast mining of distance-based outliers in high dimensional datasets. Technical report TR71, Department of Computer Science and Engineering, The Ohio State University
Guttmann R (1984) A dynamic index structure for spatial searching. In: Proceedings of the international conference on management of data (SIGMOD)
Hartigan J (1975) Clustering algorithms. John Wiley and Sons
Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the symposium on theory of computing (STOC), pp 604–613
Jagadish H, Poosala V, Koudas N, Sevcik K, Muthukrishnan S, Suel T (1998) Optimal histograms with guarantees. In: Proceedings of the international conference on very large databases (VLDB)
Jolliffe I (1986) Principal component analysis. Springer-Verlag
Kleinberg JM (1997) Two algorithms for nearest-neighbor search in high dimensions. In: Proceedings of the symposium on theory of computing (STOC), pp 599–608
Knorr E, Ng R (1999) Finding intensional knowledge of distance-based outliers. In: Proceedings of the international conference on very large data bases (VLDB)
Kushilevitz E, Ostrovsky R, Rabani Y (1998) Efficient search for approximate nearest neighbor in high dimensional spaces. In: Proceedings of the symposium on theory of computing (STOC)
Muralikrishna M, DeWitt D (1988) Equi-depth histograms for estimating selectivity factors for multidimensional queries. In: Proceedings of the international conference on management of data (SIGMOD)
Ramaswamy S, Rastogi R, Shim K (2000) Efficient algorithms for mining outliers from large datasets. In: Proceedings of the international conference on management of data
Ruggles S, Sobek M (1997) Integrated public use microdata series: version 2.0
Sequeira K, Zaki M (2002) ADMIT: anomaly-based data mining for intrusions. In: Proceedings of the international conference on knowledge discovery and data mining
Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the international conference on management of data (SIGMOD)

Download references

Author information

Authors and Affiliations

IBM T. J. Watson Research Center, 1101 Kitchawan Road, Yorktown Heights, NY, 10598, USA
Amol Ghoting
The Ohio State University, Columbus, OH, USA
Srinivasan Parthasarathy
Google, Inc., Pittsburgh, PA, USA
Matthew Eric Otey

Authors

Amol Ghoting
View author publications
You can also search for this author in PubMed Google Scholar
Srinivasan Parthasarathy
View author publications
You can also search for this author in PubMed Google Scholar
Matthew Eric Otey
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Amol Ghoting.

Additional information

Responsible editor: Thorsten Joachims.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ghoting, A., Parthasarathy, S. & Otey, M.E. Fast mining of distance-based outliers in high-dimensional datasets. Data Min Knowl Disc 16, 349–364 (2008). https://doi.org/10.1007/s10618-008-0093-2

Download citation

Received: 29 January 2007
Accepted: 14 February 2008
Published: 04 March 2008
Issue Date: June 2008
DOI: https://doi.org/10.1007/s10618-008-0093-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fast mining of distance-based outliers in high-dimensional datasets

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Fast mining of distance-based outliers in high-dimensional datasets

Abstract

Access this article

Similar content being viewed by others

A Comprehensive Survey of Clustering Algorithms

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation