Abstract
Defining outliers by their distance to neighboring data points has been shown to be an effective non-parametric approach to outlier detection. In recent years, many research efforts have looked at developing fast distance-based outlier detection algorithms. Several of the existing distance-based outlier detection algorithms report log-linear time performance as a function of the number of data points on many real low-dimensional datasets. However, these algorithms are unable to deliver the same level of performance on high-dimensional datasets, since their scaling behavior is exponential in the number of dimensions. In this paper, we present RBRP, a fast algorithm for mining distance-based outliers, particularly targeted at high-dimensional datasets. RBRP scales log-linearly as a function of the number of data points and linearly as a function of the number of dimensions. Our empirical evaluation demonstrates that we outperform the state-of-the-art algorithm, often by an order of magnitude.
Similar content being viewed by others
References
Angiulli F, Pizzuti C (2002) Fast outlier detection in high dimensional spaces. In: Proceedings of the international conference on principles of data mining and knowledge discovery
Arya S, Mount DM, Netanyahu NS, Silverman R and Wu AY (1998). An optimal algorithm for approximate nearest neighbor searching fixed dimensions. J ACM 45(6): 891–923
Barnett V, Lewis T (1994) Outliers in statistical data. John Wiley and Sons
Bay S (1999). The UCI KDD archive. University of California, Department of Information and Computer Science, Irvine, CA
Bay S, Schwabacher M (2003) Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: Proceedings of the international conference on knowledge discovery and data mining
Bentley J (1975). Multidimensional binary search trees used for associative searching. Commun ACM 18: 509–517
Berchtold S, Keim D, Kreigel H (1996) The X-tree: an index structure for high dimensional data. In: Proceedings of the international conference on very large data bases (VLDB)
Bolton R and Hand D (2002). Statistical fraud detection: a review. Stat Sci 17: 235–255
Gamberger D, Lavrac N, Groselj C (1999) Experiments with noise filtering in the medical domain. In: Proceedings of the international conference on machine learning
Ghoting A, Parthasarathy S, Otey M (2005) Fast mining of distance-based outliers in high dimensional datasets. Technical report TR71, Department of Computer Science and Engineering, The Ohio State University
Guttmann R (1984) A dynamic index structure for spatial searching. In: Proceedings of the international conference on management of data (SIGMOD)
Hartigan J (1975) Clustering algorithms. John Wiley and Sons
Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the symposium on theory of computing (STOC), pp 604–613
Jagadish H, Poosala V, Koudas N, Sevcik K, Muthukrishnan S, Suel T (1998) Optimal histograms with guarantees. In: Proceedings of the international conference on very large databases (VLDB)
Jolliffe I (1986) Principal component analysis. Springer-Verlag
Kleinberg JM (1997) Two algorithms for nearest-neighbor search in high dimensions. In: Proceedings of the symposium on theory of computing (STOC), pp 599–608
Knorr E, Ng R (1999) Finding intensional knowledge of distance-based outliers. In: Proceedings of the international conference on very large data bases (VLDB)
Kushilevitz E, Ostrovsky R, Rabani Y (1998) Efficient search for approximate nearest neighbor in high dimensional spaces. In: Proceedings of the symposium on theory of computing (STOC)
Muralikrishna M, DeWitt D (1988) Equi-depth histograms for estimating selectivity factors for multidimensional queries. In: Proceedings of the international conference on management of data (SIGMOD)
Ramaswamy S, Rastogi R, Shim K (2000) Efficient algorithms for mining outliers from large datasets. In: Proceedings of the international conference on management of data
Ruggles S, Sobek M (1997) Integrated public use microdata series: version 2.0
Sequeira K, Zaki M (2002) ADMIT: anomaly-based data mining for intrusions. In: Proceedings of the international conference on knowledge discovery and data mining
Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the international conference on management of data (SIGMOD)
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Thorsten Joachims.
Rights and permissions
About this article
Cite this article
Ghoting, A., Parthasarathy, S. & Otey, M.E. Fast mining of distance-based outliers in high-dimensional datasets. Data Min Knowl Disc 16, 349–364 (2008). https://doi.org/10.1007/s10618-008-0093-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-008-0093-2