A Fast Distance-Based Outlier Detection Technique Using a Divisive Hierarchical Clustering Algorithm

Wang, Xiaochun; Wang, Xiali; Wilkes, Mitch

doi:10.1007/978-981-15-9519-6_3

Xiaochun Wang⁴,
Xiali Wang⁵ &
Mitch Wilkes⁶

491 Accesses
3 Citations

Abstract

Today’s real-world databases typically have millions of items with many thousands of fields, resulting in data that range in size into terabytes. As a result, traditional distribution-based outlier detection techniques have more and more restricted capabilities and novel approaches that find unusual samples in a data set based on their distances to neighboring samples have become more and more popular. The problem with these k-nearest neighbor-based methods is that they are computationally expensive for large datasets. At the same time, today’s databases are often too large to fit into the main memory at once. As a result, memory capacity and, correspondingly, I/O cost, become an important issue. In this chapter, we present a simple distance-based outlier detection algorithm that can compete with existing solutions in both CPU and I/O efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Hawkins, D. M. (1980). Identification of Outliers. London: Chapman and Hall.
Book Google Scholar
Eskin, E., Arnold, A., Prerau, M., Portnoy, L. & Stolfo, S. (2002). A geometric framework for unsupervised anomaly detection: Detecting intrusions in unlabeled data. In: Barbará, D., Jajodia, S. (eds) Applications of Data Mining in Computer Security. Advances in Information Security, vol. 6, pp. 77–101.
Google Scholar
Lane, T. & Brodley, C.E. (1998). Temporal sequence learning and data reduction for anomaly detection. In Proceedings of the 1998 5th ACM Conference on Computer and Communications Security (CCS-5), San Francisco, CA, USA, pp. 150–158.
Google Scholar
Bolton, R. J., & David, J. H. (2002). Unsupervised profiling methods for fraud detection. Statistical Science, 17(3), 235–255.
Article MathSciNet Google Scholar
Wong, W., Moore, A., Cooper, G. & Wagner, M. (2002). Rule-based anomaly pattern detection for detecting disease outbreaks. In Proceedings of the 18th National Conference on Artificial Intelligence, Edmonton, Alta., Canada, pp. 217–223.
Google Scholar
Sheng, B., Li, Q., Mao, W. & Jin, W. (2007). Outlier detection in sensor networks. In Proceedings of ACM International Symposium on Mobile Ad Hoc Networking and Computing, pp. 219–228.
Google Scholar
Hodge, V. J., & Austin, J. (2004). A survey of outlier detection methodologies. Artificial Intelligence Review, 22(2), 85–126.
Article Google Scholar
Chandola,V., Banerjee, A. & Kumar, V. (2009). Anomaly detection: A survey. ACM Computing Surveys, 41(3): 15.1–15.58.
Google Scholar
Knorr, E. M. & Ng, R. T. (1998). Algorithms for mining distance-based outliers in large datasets. In Proceedings of the International Conference on Very Large Data Bases (VLDB’98), New York, pp. 392–403.
Google Scholar
Angiulli, F., & Pizzuti, C. (2005). Outlier mining in large high-dimensional data sets. IEEE Transactions on Knowledge and Data Engineering, 17(2), 203–215.
Article Google Scholar
Angiulli, F., Basta, S., & Pizzuti, C. (2006). Distance-based detection and prediction of outliers. IEEE Transactions on Knowledge and Data Engineering, 18(2), 145–160.
Article Google Scholar
Angiulli, F., & Fassetti, F. (2009). DOLPHIN: An efficient algorithm for mining distance-based outliers in very large datasets. ACM Transactions on Knowledge Discovery from Data, 3(1), 1–57.
Article Google Scholar
Bay, S. D. & Schwabacher, M. (2003). Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘03), Washington, DC, United states, pp. 29–38.
Google Scholar
Ghoting, A., Parthasarathy, S., & Otey, M. E. (2006). Fast mining of distance-based outliers in high-dimensional datasets. Data Mining and Knowledge Discovery, 16(3), 349–364.
Article MathSciNet Google Scholar
Tao, Y., Xiao, X. & Al, E. (2006). Mining distance-based outliers from large databases in any metric space. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘06). Philadelphia, PA, United states, pp. 394–403.
Google Scholar
Wang, X., Wang, X. L. & Wilkes, D. M. (2008). A fast distance-based outlier detection technique. In Poster Proceedings of the 8th Industrial Conference on Data Mining, Leipzig, Germany, pp. 25–44.
Google Scholar
Knorr, E. M. & Ng, R. T. (1999). Finding intensional knowledge of distance-based outliers. In Proceedings of the 25th International Conference on Very Large Data Bases (VLDB’99), Edinburgh, Scotland, pp. 211–222.
Google Scholar
Knorr, E. M., Ng, R. T., & Tucakov, V. (2000). Distance-based outliers: Algorithms and applications. The VLDB Journal, 8(3–4), 237–253.
Article Google Scholar
Ramaswamy, S., Rastogi, R. & Shim, K. (2000). Efficient algorithms for mining outliers from large data sets. In Proceedings of the 2000 ACM International Conference on Management of Data (SIGMOD’00), Dallas, pp. 427–438.
Google Scholar
Angiulli, F. and Pizzuti, C. (2002). Fast outlier detection in high dimensional spaces. In Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD’02), Helsinki, pp. 15–26.
Google Scholar
Wang, X., Wang, X. L., & Wilkes, D. M. (2009). A divide-and-conquer approach for minimum spanning tree-based clustering. IEEE Transactions on Knowledge and Data Engineering, 21(7), 945–958.
Article Google Scholar
Palmer, C. R., Gibbons, P. B. & Faloutsos, C. (2002). Fast approximation of the “neighborhood” function for massive graphs. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’02), Edmonton, Alta, Canada, pp. 81–90.
Google Scholar
UCI: The UCI KDD Archive. (https://kdd.ics.uci.edu/). Irvine, CA: University of California, Department of Information and Computer Science.
Yu, C., Ooi, B. C., Tan, K. L. & Jagadish, H. V. (2001). Indexing the distance: An efficient method to KNN processing. In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB’01), Roma, Italy, pp. 421–430.
Google Scholar
Yu, C., Ooi, B. C., Tan, K. L., & Jagadish, H. V. (2005). iDistance: An adaptive B+-tree based indexing method for nearest neighbor search. ACM Transactions on Database Systems, 30(2), 364–397.
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Software Engineering, Xi’an Jiaotong University, Xi’an, Shaanxi, China
Xiaochun Wang
School of Information Engineering, Chang’an University, Xi’an, Shaanxi, China
Xiali Wang
Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN, USA
Mitch Wilkes

Authors

Xiaochun Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiali Wang
View author publications
You can also search for this author in PubMed Google Scholar
Mitch Wilkes
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaochun Wang .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Wang, X., Wang, X., Wilkes, M. (2021). A Fast Distance-Based Outlier Detection Technique Using a Divisive Hierarchical Clustering Algorithm. In: New Developments in Unsupervised Outlier Detection. Springer, Singapore. https://doi.org/10.1007/978-981-15-9519-6_3

Download citation

DOI: https://doi.org/10.1007/978-981-15-9519-6_3
Published: 25 November 2020
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-9518-9
Online ISBN: 978-981-15-9519-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics