Skip to main content

A Fast Distance-Based Outlier Detection Technique Using a Divisive Hierarchical Clustering Algorithm

  • Chapter
  • First Online:
New Developments in Unsupervised Outlier Detection

Abstract

Today’s real-world databases typically have millions of items with many thousands of fields, resulting in data that range in size into terabytes. As a result, traditional distribution-based outlier detection techniques have more and more restricted capabilities and novel approaches that find unusual samples in a data set based on their distances to neighboring samples have become more and more popular. The problem with these k-nearest neighbor-based methods is that they are computationally expensive for large datasets. At the same time, today’s databases are often too large to fit into the main memory at once. As a result, memory capacity and, correspondingly, I/O cost, become an important issue. In this chapter, we present a simple distance-based outlier detection algorithm that can compete with existing solutions in both CPU and I/O efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Hawkins, D. M. (1980). Identification of Outliers. London: Chapman and Hall.

    Book  Google Scholar 

  2. Eskin, E., Arnold, A., Prerau, M., Portnoy, L. & Stolfo, S. (2002). A geometric framework for unsupervised anomaly detection: Detecting intrusions in unlabeled data. In: Barbará, D., Jajodia, S. (eds) Applications of Data Mining in Computer Security. Advances in Information Security, vol. 6, pp. 77–101.

    Google Scholar 

  3. Lane, T. & Brodley, C.E. (1998). Temporal sequence learning and data reduction for anomaly detection. In Proceedings of the 1998 5th ACM Conference on Computer and Communications Security (CCS-5), San Francisco, CA, USA, pp. 150–158.

    Google Scholar 

  4. Bolton, R. J., & David, J. H. (2002). Unsupervised profiling methods for fraud detection. Statistical Science, 17(3), 235–255.

    Article  MathSciNet  Google Scholar 

  5. Wong, W., Moore, A., Cooper, G. & Wagner, M. (2002). Rule-based anomaly pattern detection for detecting disease outbreaks. In Proceedings of the 18th National Conference on Artificial Intelligence, Edmonton, Alta., Canada, pp. 217–223.

    Google Scholar 

  6. Sheng, B., Li, Q., Mao, W. & Jin, W. (2007). Outlier detection in sensor networks. In Proceedings of ACM International Symposium on Mobile Ad Hoc Networking and Computing, pp. 219–228.

    Google Scholar 

  7. Hodge, V. J., & Austin, J. (2004). A survey of outlier detection methodologies. Artificial Intelligence Review, 22(2), 85–126.

    Article  Google Scholar 

  8. Chandola,V., Banerjee, A. & Kumar, V. (2009). Anomaly detection: A survey. ACM Computing Surveys, 41(3): 15.1–15.58.

    Google Scholar 

  9. Knorr, E. M. & Ng, R. T. (1998). Algorithms for mining distance-based outliers in large datasets. In Proceedings of the International Conference on Very Large Data Bases (VLDB’98), New York, pp. 392–403.

    Google Scholar 

  10. Angiulli, F., & Pizzuti, C. (2005). Outlier mining in large high-dimensional data sets. IEEE Transactions on Knowledge and Data Engineering, 17(2), 203–215.

    Article  Google Scholar 

  11. Angiulli, F., Basta, S., & Pizzuti, C. (2006). Distance-based detection and prediction of outliers. IEEE Transactions on Knowledge and Data Engineering, 18(2), 145–160.

    Article  Google Scholar 

  12. Angiulli, F., & Fassetti, F. (2009). DOLPHIN: An efficient algorithm for mining distance-based outliers in very large datasets. ACM Transactions on Knowledge Discovery from Data, 3(1), 1–57.

    Article  Google Scholar 

  13. Bay, S. D. & Schwabacher, M. (2003). Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘03), Washington, DC, United states, pp. 29–38.

    Google Scholar 

  14. Ghoting, A., Parthasarathy, S., & Otey, M. E. (2006). Fast mining of distance-based outliers in high-dimensional datasets. Data Mining and Knowledge Discovery, 16(3), 349–364.

    Article  MathSciNet  Google Scholar 

  15. Tao, Y., Xiao, X. & Al, E. (2006). Mining distance-based outliers from large databases in any metric space. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘06). Philadelphia, PA, United states, pp. 394–403.

    Google Scholar 

  16. Wang, X., Wang, X. L. & Wilkes, D. M. (2008). A fast distance-based outlier detection technique. In Poster Proceedings of the 8th Industrial Conference on Data Mining, Leipzig, Germany, pp. 25–44.

    Google Scholar 

  17. Knorr, E. M. & Ng, R. T. (1999). Finding intensional knowledge of distance-based outliers. In Proceedings of the 25th International Conference on Very Large Data Bases (VLDB’99), Edinburgh, Scotland, pp. 211–222.

    Google Scholar 

  18. Knorr, E. M., Ng, R. T., & Tucakov, V. (2000). Distance-based outliers: Algorithms and applications. The VLDB Journal, 8(3–4), 237–253.

    Article  Google Scholar 

  19. Ramaswamy, S., Rastogi, R. & Shim, K. (2000). Efficient algorithms for mining outliers from large data sets. In Proceedings of the 2000 ACM International Conference on Management of Data (SIGMOD’00), Dallas, pp. 427–438.

    Google Scholar 

  20. Angiulli, F. and Pizzuti, C. (2002). Fast outlier detection in high dimensional spaces. In Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD’02), Helsinki, pp. 15–26.

    Google Scholar 

  21. Wang, X., Wang, X. L., & Wilkes, D. M. (2009). A divide-and-conquer approach for minimum spanning tree-based clustering. IEEE Transactions on Knowledge and Data Engineering, 21(7), 945–958.

    Article  Google Scholar 

  22. Palmer, C. R., Gibbons, P. B. & Faloutsos, C. (2002). Fast approximation of the “neighborhood” function for massive graphs. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’02), Edmonton, Alta, Canada, pp. 81–90.

    Google Scholar 

  23. UCI: The UCI KDD Archive. (https://kdd.ics.uci.edu/). Irvine, CA: University of California, Department of Information and Computer Science.

  24. Yu, C., Ooi, B. C., Tan, K. L. & Jagadish, H. V. (2001). Indexing the distance: An efficient method to KNN processing. In Proceedings of the 27th International Conference on Very Large Data Bases (VLDB’01), Roma, Italy, pp. 421–430.

    Google Scholar 

  25. Yu, C., Ooi, B. C., Tan, K. L., & Jagadish, H. V. (2005). iDistance: An adaptive B+-tree based indexing method for nearest neighbor search. ACM Transactions on Database Systems, 30(2), 364–397.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaochun Wang .

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Xi'an Jiaotong University Press

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Wang, X., Wang, X., Wilkes, M. (2021). A Fast Distance-Based Outlier Detection Technique Using a Divisive Hierarchical Clustering Algorithm. In: New Developments in Unsupervised Outlier Detection. Springer, Singapore. https://doi.org/10.1007/978-981-15-9519-6_3

Download citation

Publish with us

Policies and ethics