Skip to main content

An Effective Boundary Point Detection Algorithm Via k-Nearest Neighbors-Based Centroid

  • Chapter
  • First Online:
New Developments in Unsupervised Outlier Detection

Abstract

Outlier detection techniques and clustering techniques are important areas of data mining. Clustering is all about finding groups of data points, whereas outlier analysis is all about finding data points that are far away from these clusters. Clustering and outlier detection, therefore, share a well-known complementary relationship. A simplistic view is that every data point is either a member of a cluster or an outlier. Data points on the boundary regions of a cluster may also be considered weak outliers. However, the study of boundary points is sometimes more meaningful than clusters and outliers. There has been many research work done on boundary point detection. However, with the data obtained becoming more and more complex, existing boundary point detection algorithms have problems such as low precision, parameter dependence, and difficulty in separating outliers. In this chapter, we propose a boundary point detection algorithm, CENTROID-B, based on the concept of kNN-based centroid which has low dependence on parameters, high precision and can detect outliers at the same time. The experimental results on different types of data sets show that the proposed boundary point detection algorithm is effective and manifests high accuracy. Euclidean minimum spanning tree algorithms run typically with quadratic computational complexity, which is not practical for large-scale multi-dimensional datasets. In this chapter, we propose a new two-level approximate Euclidean minimum spanning tree algorithm for large-scale multi-dimensional datasets. In the first level, we perform the proposed outlier and boundary point detection for a given data set to identify a small amount of boundary points. In the second level, we run standard Prim’s algorithm on the reduced dataset to complete an approximate Euclidean minimum spanning tree. Experiments on sample data sets demonstrate the efficiency of the proposed method, while keeping high approximate precision.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Hawkins, D. M. (1980). Identification of outliers. London: Chapman and Hall.

    Book  Google Scholar 

  2. Ester, M., Kriegel, H.-P., Sander, J. and Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’96), Portland, Oregon, USA, pp. 226–231.

    Google Scholar 

  3. Breuning, M. M., Kriegel, H. P., Ng, R. T., & Sander, J. (2000). LOF: Identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data (SIGMOD’00), Dallas, TX, United states, pp. 93–104.

    Google Scholar 

  4. Qiu, B. Z., Yue, F., & Shen, J. Y. (2007). BRIM: an efficient boundary points detecting algorithm. In Proceedings of the 11th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’07), Nanjing, China, pp. 761–768.

    Google Scholar 

  5. He, Y. Z., Wang, C. H., & Qiu, B. Z. (2013). Clustering boundary points detection algorithm based on gradient binarization. Applied Mechanics and Materials, 263–266(n PART 1): 2358–2363.

    Google Scholar 

  6. Li, X., Wu, X., Lv, J., He, J., Guo, J., & Li, M. (2018). Automatic detection of boundary points based on local geometrical measures. Soft Computing, 22(11), 3663–3674.

    Article  Google Scholar 

  7. Kruskal, J. B. (1956). On the shortest spanning subtree of a graph and the traveling salesman problem. Proceedings of the American Mathematical Society, 7(1), 48–50.

    Article  MathSciNet  Google Scholar 

  8. Prim, R. C. (1957). Shortest connection networks and some generalizations. Bell System Technical Journal, 36(6), 1389–1401.

    Article  Google Scholar 

  9. An, L., Xiang, Q. S., & Chavez, S. (2000). A fast implementation of the minimum spanning tree method for phase unwrapping. IEEE Transactions on Medical Imaging, 19(8), 805–808.

    Article  Google Scholar 

  10. Xu, Y., & Uberbacher, E. C. (1997). 2D image segmentation using minimum spanning trees. Image and Vision Computing, 15(1), 47–57.

    Article  Google Scholar 

  11. Zahn, C. T. (1971). Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Transactions on Computers, C-20(1), 68–86.

    Google Scholar 

  12. Xu, Y., Olman, V., & Xu, D. (2002). Clustering gene expression data using a graph-theoretic approach: An application of minimum spanning trees. Bioinformatics, 18(4), 536–545.

    Article  Google Scholar 

  13. Zhong, C., Miao, D., & Wang, R. (2010). A graph-theoretical clustering method based on two rounds of minimum spanning trees. Pattern Recognition, 43(3), 752–766.

    Article  Google Scholar 

  14. Juszczak, P., Tax, D. M. J., Pekalska, E., & Duin, R. P. W. (2009). Minimum spanning tree based one-class classifier. Neurocomputing, 72, 1859–1869.

    Article  Google Scholar 

  15. Yang, C. L. (2005). Building k edge-disjoint spanning trees of minimum total length for isometric data embedding. IEEE Transactions Pattern Analysis and Machine Intelligence, 27(10), 1680–1683.

    Article  Google Scholar 

  16. Gower, J. C., & Ross, G. J. S. (1969). Minimum spanning trees and single linkage cluster analysis. Applied Statistics, 18(1), 54–64.

    Article  MathSciNet  Google Scholar 

  17. Balcan, M-F., Blum, A., & Vempala, S. (2008). A discriminative framework for clustering via similarity functions. In Proceedings of the 40th Annual ACM Symposium on Theory of Computing (STOC’08), Victoria, BC, Canada, pp. 671–680.

    Google Scholar 

  18. Tang, J., Chen, Z., Fu, A. W. C., & Cheung, D. W. (2002). Enhancing effectiveness of outlier detections for low density patterns. In Proceedings of the 6th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’02), Taipei, Taiwan, pp. 535–548.

    Google Scholar 

  19. Gibbons, P.B., Papadimitriou, S., Kitagawa, H. and Faloutsos, C. (2003). LOCI: Fast outlier detection using the local correlation integral. In Proceedings of the IEEE 19th International Conference on Data Engineering (ICDE’03), Bangalore, India, pp.315–326.

    Google Scholar 

  20. Sun, P. and Chawla, S. (2004). On local spatial outliers. In Proceedings of the 4th IEEE International Conference on Data Mining (ICDM’04), Brighton, UK, pp. 209–216.

    Google Scholar 

  21. Jin, W., Tung, A. K. H., Han, J., & Wang, W. (2006). Ranking outliers using symmetric neighborhood relationship. In Proceedings of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’06), Singapore, pp. 577–593.

    Google Scholar 

  22. Fan, H., Zaiane, O. R., Foss, A., & Wu, J. (2006). A nonparametric outlier detection for efficiently discovering top-N outliers from engineering data. In Proceedings of the 10th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (PAKDD’06), Singapore, Singapore, pp. 557–566.

    Google Scholar 

  23. Zhang, K., Hutter, M., & Jin, H. (2009). A new local distance-based outlier detection approach for scattered real-world data. In Proceedings of the 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’09), pp. 813–822.

    Google Scholar 

  24. Huang, H., Mehrotra, K., & Mohan, C. K. (2013). Rank-based outlier detection. Journal of Statistical Computation and Simulation, 83(3), 518–531.

    Article  MathSciNet  Google Scholar 

  25. Schubert, E., Zimek, A., & Kriegel, H. P. (2014). Generalized outlier detection with flexible kernel density estimates. In Proceedings of the 14th Siam International Conference on Data Mining (SDM’14), Philadelphia, pp. 542–550.

    Google Scholar 

  26. Ru, X., Liu, Z., Huang, Z., et al. (2016). Normalized residual-based constant false-alarm rate outlier detection. Pattern Recognition Letters, 69, 1–7.

    Article  Google Scholar 

  27. Tang, B., & He, H. (2017). A local density-based approach for outlier detection. Neurocomputing, 241, 171–180.

    Article  Google Scholar 

  28. Kriegel, H.-P., Schubert, M., & Zimek, A. (2008). Angle-based outlier detection in high-dimensional data. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’08), Las Vegas, Nevada, USA, pp. 444–452.

    Google Scholar 

  29. Pham, N., & Pagh, R. (2012). A near-linear time approximation algorithm for angle-based outlier detection in high-dimensional data. In Proceedings of the 18th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (SIGKDD’12), Beijing, China, pp. 877–885.

    Google Scholar 

  30. Borůvka, O. (1926). O jistém problému minimálním (About a Certain Minimal Problem). Práce moravské přírodovědecké spolěcnosti v Brně, III 37–58 (in Czech with German summary).

    Google Scholar 

  31. Jarník, V. (1930). O jistém problému minimálním (About a Certain Minimal Problem). Práce moravské přírodovědecké spolěcnosti v Brně, VI 57–63 (in Czech).

    Google Scholar 

  32. Dijkstra, E. W. (1959). A note on two problems in connexion with graphs. Numerische Mathematik, 1(1), 269–271.

    Article  MathSciNet  Google Scholar 

  33. Bentley, J., & Friedman, J. (1978). Fast algorithms for constructing minimal spanning trees in coordinate spaces. IEEE Transactions on Computers, C-27(2), 97–105.

    Google Scholar 

  34. Preparata, F. P., & Shamos, M. I. (1985). Computational Geometry. New York: Springer.

    Book  Google Scholar 

  35. Callahan, P., & Kosaraju, S. (1993). Faster algorithms for some geometric graph problems in higher dimensions. In Proceedings of the 4th Annual ACM-SIAM Symposium on Discrete Algorithms, Austin, TX, USA, pp. 291–300.

    Google Scholar 

  36. Narasimhan, G., Zachariasen, M., & Zhu, J. (2000). Experiments with computing geometric minimum spanning trees. In Proceedings of the 2nd Workshop on Algorithm Engineering and Experimentation (ALENEX’00), pp. 183–196.

    Google Scholar 

  37. March, W. B., Ram, P., & Gray, A. G. (2010). Fast Euclidean minimum spanning tree: algorithm, analysis, and applications. In Proceedings of 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’10), Washington, USA, pp. 603–612.

    Google Scholar 

  38. Vaidya, P. M. (1988). Minimum spanning trees in k-dimensional space. SIAM Journal on Computing, 17(3), 572–582.

    Article  MathSciNet  Google Scholar 

  39. Wang, X., Wang, X., & Wilkes, D. M. (2009). A divide-and-conquer approach for minimum spanning tree-based clustering. IEEE Transactions Knowledge and Data Engineering, 21(7), 945–958.

    Article  Google Scholar 

  40. Lai, C., Rafa, T., & Nelson, D. E. (2009). Approximate minimum spanning tree clustering in high-dimensional space. Intelligent Data Analysis, 13(4), 575–597.

    Article  Google Scholar 

  41. Wang, X., Wang, X. L., & Zhu, J. (2014). A new fast minimum spanning tree based clustering technique. In Proceedings of the 2014 IEEE International Workshop on Scalable Data Analytics (ICDMW’14), Shenzhen, China, pp. 1053–1060.

    Google Scholar 

  42. Zhong, C., Malinen, M., Miao, D., & Fränti, P. (2015). A fast minimum spanning tree algorithm based on K-means. Information Sciences, 295, 1–17.

    Article  MathSciNet  Google Scholar 

  43. Yu, C., Ooi, B. C., Tan, K. L., & Jagadish, H. V. (2005). iDistance: An adaptive B+-tree based indexing method for nearest neighbor search. ACM Transactions on Data Base Systems (TODS), 30(2), 364–397.

    Article  Google Scholar 

  44. Wang, X., Wang, X. L., & Wilkes, D. M. (2012). Modifying iDistance for a fast CHAMELEON with application to patch based image segmentation. In Proceedings of the 9th IASTED International Conference on Signal Processing, Pattern Recognition and Applications, Crete, Greece, pp. 107–114.

    Google Scholar 

  45. Li, Y., & Maguire, L. (2011). Selecting critical patterns based on local geometrical and statistical information. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(6), 1189–1201.

    Article  Google Scholar 

  46. Li, X., Lv, J., & Yi, Z. (2018). An efficient representation-based method for boundary point and outlier detection. IEEE Transactions on Neural Networks and Learning Systems, 29(1), 51–62.

    Article  MathSciNet  Google Scholar 

  47. Rodriguez, A., & Laio, A. (2014). Clustering by fast search and find of density peaks. Science, 344(6191), 1492.

    Article  Google Scholar 

  48. Karypis, G., Han, E. H., & Kumar, V. (1999). CHAMELEON: A hierarchical clustering algorithm using dynamic modeling. IEEE Transactions on Computers, 32(8), 68–75.

    Google Scholar 

  49. https://archive.ics.uci.edu/ml/datasets.html.

  50. Wang, X. L., Wang, X., & Li, X. (2018). A fast two-level approximate Euclidean minimum spanning tree algorithm for high-dimensional data. In Proceedings of the 14th International Conference on Machine Learning and Data Mining, New York, USA, pp. 273–287.

    Google Scholar 

Download references

Acknowledgements

This chapter was modified from the paper published by our group in "machine learning and data mining" [50]. The related contents are reused with permission.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaochun Wang .

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Xi'an Jiaotong University Press

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Wang, X., Wang, X., Wilkes, M. (2021). An Effective Boundary Point Detection Algorithm Via k-Nearest Neighbors-Based Centroid. In: New Developments in Unsupervised Outlier Detection. Springer, Singapore. https://doi.org/10.1007/978-981-15-9519-6_8

Download citation

Publish with us

Policies and ethics