Abstract
Today’s real-world databases typically have millions of items with many thousands of fields, resulting in data that range in size into terabytes. As a result, traditional distribution-based outlier detection techniques have more and more restricted capabilities and novel approaches that find unusual samples in a data set based on their distances to neighboring samples have become more and more popular. The problem with these k-nearest neighbor-based methods is that they are computationally expensive for large datasets. At the same time, today’s databases are often too large to fit into the main memory at once. As a result, memory capacity and, correspondingly, I/O cost, become an important issue. In this chapter, we present a simple distance-based outlier detection algorithm that can compete with existing solutions in both CPU and I/O efficiency.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Hawkins, D. M. (1980). Identification of Outliers. London: Chapman and Hall.
Eskin, E., Arnold, A., Prerau, M., Portnoy, L. & Stolfo, S. (2002). A geometric framework for unsupervised anomaly detection: Detecting intrusions in unlabeled data. In: Barbará, D., Jajodia, S. (eds) Applications of Data Mining in Computer Security. Advances in Information Security, vol. 6, pp. 77–101.
Lane, T. & Brodley, C.E. (1998). Temporal sequence learning and data reduction for anomaly detection. In Proceedings of the 1998 5th ACM Conference on Computer and Communications Security (CCS-5), San Francisco, CA, USA, pp. 150–158.
Bolton, R. J., & David, J. H. (2002). Unsupervised profiling methods for fraud detection. Statistical Science, 17(3), 235–255.
Wong, W., Moore, A., Cooper, G. & Wagner, M. (2002). Rule-based anomaly pattern detection for detecting disease outbreaks. In Proceedings of the 18th National Conference on Artificial Intelligence, Edmonton, Alta., Canada, pp. 217–223.
Sheng, B., Li, Q., Mao, W. & Jin, W. (2007). Outlier detection in sensor networks. In Proceedings of ACM International Symposium on Mobile Ad Hoc Networking and Computing, pp. 219–228.
Hodge, V. J., & Austin, J. (2004). A survey of outlier detection methodologies. Artificial Intelligence Review, 22(2), 85–126.
Chandola,V., Banerjee, A. & Kumar, V. (2009). Anomaly detection: A survey. ACM Computing Surveys, 41(3): 15.1–15.58.
Knorr, E. M. & Ng, R. T. (1998). Algorithms for mining distance-based outliers in large datasets. In Proceedings of the International Conference on Very Large Data Bases (VLDB’98), New York, pp. 392–403.
Angiulli, F., & Pizzuti, C. (2005). Outlier mining in large high-dimensional data sets. IEEE Transactions on Knowledge and Data Engineering, 17(2), 203–215.
Angiulli, F., Basta, S., & Pizzuti, C. (2006). Distance-based detection and prediction of outliers. IEEE Transactions on Knowledge and Data Engineering, 18(2), 145–160.
Angiulli, F., & Fassetti, F. (2009). DOLPHIN: An efficient algorithm for mining distance-based outliers in very large datasets. ACM Transactions on Knowledge Discovery from Data, 3(1), 1–57.
Bay, S. D. & Schwabacher, M. (2003). Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘03), Washington, DC, United states, pp. 29–38.
Ghoting, A., Parthasarathy, S., & Otey, M. E. (2006). Fast mining of distance-based outliers in high-dimensional datasets. Data Mining and Knowledge Discovery, 16(3), 349–364.
Tao, Y., Xiao, X. & Al, E. (2006). Mining distance-based outliers from large databases in any metric space. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘06). Philadelphia, PA, United states, pp. 394–403.
Wang, X., Wang, X. L. & Wilkes, D. M. (2008). A fast distance-based outlier detection technique. In Poster Proceedings of the 8th Industrial Conference on Data Mining, Leipzig, Germany, pp. 25–44.
Knorr, E. M. & Ng, R. T. (1999). Finding intensional knowledge of distance-based outliers. In Proceedings of the 25th International Conference on Very Large Data Bases (VLDB’99), Edinburgh, Scotland, pp. 211–222.
Knorr, E. M., Ng, R. T., & Tucakov, V. (2000). Distance-based outliers: Algorithms and applications. The VLDB Journal, 8(3–4), 237–253.
Ramaswamy, S., Rastogi, R. & Shim, K. (2000). Efficient algorithms for mining outliers from large data sets. In Proceedings of the 2000 ACM International Conference on Management of Data (SIGMOD’00), Dallas, pp. 427–438.
Angiulli, F. and Pizzuti, C. (2002). Fast outlier detection in high dimensional spaces. In Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD’02), Helsinki, pp. 15–26.
Wang, X., Wang, X. L., & Wilkes, D. M. (2009). A divide-and-conquer approach for minimum spanning tree-based clustering. IEEE Transactions on Knowledge and Data Engineering, 21(7), 945–958.
Palmer, C. R., Gibbons, P. B. & Faloutsos, C. (2002). Fast approximation of the “neighborhood” function for massive graphs. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’02), Edmonton, Alta, Canada, pp. 81–90.
UCI: The UCI KDD Archive. (https://kdd.ics.uci.edu/). Irvine, CA: University of California, Department of Information and Computer Science.
Yu, C., Ooi, B. C., Tan, K. L. & Jagadish, H. V. (2001). Indexing the distance: An efficient method to KNN processing. In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB’01), Roma, Italy, pp. 421–430.
Yu, C., Ooi, B. C., Tan, K. L., & Jagadish, H. V. (2005). iDistance: An adaptive B+-tree based indexing method for nearest neighbor search. ACM Transactions on Database Systems, 30(2), 364–397.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2021 Xi'an Jiaotong University Press
About this chapter
Cite this chapter
Wang, X., Wang, X., Wilkes, M. (2021). A Fast Distance-Based Outlier Detection Technique Using a Divisive Hierarchical Clustering Algorithm. In: New Developments in Unsupervised Outlier Detection. Springer, Singapore. https://doi.org/10.1007/978-981-15-9519-6_3
Download citation
DOI: https://doi.org/10.1007/978-981-15-9519-6_3
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-9518-9
Online ISBN: 978-981-15-9519-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)