An Effective Boundary Point Detection Algorithm Via k-Nearest Neighbors-Based Centroid

Wang, Xiaochun; Wang, Xiali; Wilkes, Mitch

doi:10.1007/978-981-15-9519-6_8

Xiaochun Wang⁴,
Xiali Wang⁵ &
Mitch Wilkes⁶

500 Accesses
1 Citations

Abstract

Outlier detection techniques and clustering techniques are important areas of data mining. Clustering is all about finding groups of data points, whereas outlier analysis is all about finding data points that are far away from these clusters. Clustering and outlier detection, therefore, share a well-known complementary relationship. A simplistic view is that every data point is either a member of a cluster or an outlier. Data points on the boundary regions of a cluster may also be considered weak outliers. However, the study of boundary points is sometimes more meaningful than clusters and outliers. There has been many research work done on boundary point detection. However, with the data obtained becoming more and more complex, existing boundary point detection algorithms have problems such as low precision, parameter dependence, and difficulty in separating outliers. In this chapter, we propose a boundary point detection algorithm, CENTROID-B, based on the concept of kNN-based centroid which has low dependence on parameters, high precision and can detect outliers at the same time. The experimental results on different types of data sets show that the proposed boundary point detection algorithm is effective and manifests high accuracy. Euclidean minimum spanning tree algorithms run typically with quadratic computational complexity, which is not practical for large-scale multi-dimensional datasets. In this chapter, we propose a new two-level approximate Euclidean minimum spanning tree algorithm for large-scale multi-dimensional datasets. In the first level, we perform the proposed outlier and boundary point detection for a given data set to identify a small amount of boundary points. In the second level, we run standard Prim’s algorithm on the reduced dataset to complete an approximate Euclidean minimum spanning tree. Experiments on sample data sets demonstrate the efficiency of the proposed method, while keeping high approximate precision.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 149.00; Price excludes VAT (USA)

Softcover Book: USD 199.99; Price excludes VAT (USA)

Hardcover Book: USD 199.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Hawkins, D. M. (1980). Identification of outliers. London: Chapman and Hall.
Book Google Scholar
Ester, M., Kriegel, H.-P., Sander, J. and Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’96), Portland, Oregon, USA, pp. 226–231.
Google Scholar
Breuning, M. M., Kriegel, H. P., Ng, R. T., & Sander, J. (2000). LOF: Identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data (SIGMOD’00), Dallas, TX, United states, pp. 93–104.
Google Scholar
Qiu, B. Z., Yue, F., & Shen, J. Y. (2007). BRIM: an efficient boundary points detecting algorithm. In Proceedings of the 11th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’07), Nanjing, China, pp. 761–768.
Google Scholar
He, Y. Z., Wang, C. H., & Qiu, B. Z. (2013). Clustering boundary points detection algorithm based on gradient binarization. Applied Mechanics and Materials, 263–266(n PART 1): 2358–2363.
Google Scholar
Li, X., Wu, X., Lv, J., He, J., Guo, J., & Li, M. (2018). Automatic detection of boundary points based on local geometrical measures. Soft Computing, 22(11), 3663–3674.
Article Google Scholar
Kruskal, J. B. (1956). On the shortest spanning subtree of a graph and the traveling salesman problem. Proceedings of the American Mathematical Society, 7(1), 48–50.
Article MathSciNet Google Scholar
Prim, R. C. (1957). Shortest connection networks and some generalizations. Bell System Technical Journal, 36(6), 1389–1401.
Article Google Scholar
An, L., Xiang, Q. S., & Chavez, S. (2000). A fast implementation of the minimum spanning tree method for phase unwrapping. IEEE Transactions on Medical Imaging, 19(8), 805–808.
Article Google Scholar
Xu, Y., & Uberbacher, E. C. (1997). 2D image segmentation using minimum spanning trees. Image and Vision Computing, 15(1), 47–57.
Article Google Scholar
Zahn, C. T. (1971). Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Transactions on Computers, C-20(1), 68–86.
Google Scholar
Xu, Y., Olman, V., & Xu, D. (2002). Clustering gene expression data using a graph-theoretic approach: An application of minimum spanning trees. Bioinformatics, 18(4), 536–545.
Article Google Scholar
Zhong, C., Miao, D., & Wang, R. (2010). A graph-theoretical clustering method based on two rounds of minimum spanning trees. Pattern Recognition, 43(3), 752–766.
Article Google Scholar
Juszczak, P., Tax, D. M. J., Pekalska, E., & Duin, R. P. W. (2009). Minimum spanning tree based one-class classifier. Neurocomputing, 72, 1859–1869.
Article Google Scholar
Yang, C. L. (2005). Building k edge-disjoint spanning trees of minimum total length for isometric data embedding. IEEE Transactions Pattern Analysis and Machine Intelligence, 27(10), 1680–1683.
Article Google Scholar
Gower, J. C., & Ross, G. J. S. (1969). Minimum spanning trees and single linkage cluster analysis. Applied Statistics, 18(1), 54–64.
Article MathSciNet Google Scholar
Balcan, M-F., Blum, A., & Vempala, S. (2008). A discriminative framework for clustering via similarity functions. In Proceedings of the 40th Annual ACM Symposium on Theory of Computing (STOC’08), Victoria, BC, Canada, pp. 671–680.
Google Scholar
Tang, J., Chen, Z., Fu, A. W. C., & Cheung, D. W. (2002). Enhancing effectiveness of outlier detections for low density patterns. In Proceedings of the 6th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’02), Taipei, Taiwan, pp. 535–548.
Google Scholar
Gibbons, P.B., Papadimitriou, S., Kitagawa, H. and Faloutsos, C. (2003). LOCI: Fast outlier detection using the local correlation integral. In Proceedings of the IEEE 19th International Conference on Data Engineering (ICDE’03), Bangalore, India, pp.315–326.
Google Scholar
Sun, P. and Chawla, S. (2004). On local spatial outliers. In Proceedings of the 4th IEEE International Conference on Data Mining (ICDM’04), Brighton, UK, pp. 209–216.
Google Scholar
Jin, W., Tung, A. K. H., Han, J., & Wang, W. (2006). Ranking outliers using symmetric neighborhood relationship. In Proceedings of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’06), Singapore, pp. 577–593.
Google Scholar
Fan, H., Zaiane, O. R., Foss, A., & Wu, J. (2006). A nonparametric outlier detection for efficiently discovering top-N outliers from engineering data. In Proceedings of the 10th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (PAKDD’06), Singapore, Singapore, pp. 557–566.
Google Scholar
Zhang, K., Hutter, M., & Jin, H. (2009). A new local distance-based outlier detection approach for scattered real-world data. In Proceedings of the 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’09), pp. 813–822.
Google Scholar
Huang, H., Mehrotra, K., & Mohan, C. K. (2013). Rank-based outlier detection. Journal of Statistical Computation and Simulation, 83(3), 518–531.
Article MathSciNet Google Scholar
Schubert, E., Zimek, A., & Kriegel, H. P. (2014). Generalized outlier detection with flexible kernel density estimates. In Proceedings of the 14th Siam International Conference on Data Mining (SDM’14), Philadelphia, pp. 542–550.
Google Scholar
Ru, X., Liu, Z., Huang, Z., et al. (2016). Normalized residual-based constant false-alarm rate outlier detection. Pattern Recognition Letters, 69, 1–7.
Article Google Scholar
Tang, B., & He, H. (2017). A local density-based approach for outlier detection. Neurocomputing, 241, 171–180.
Article Google Scholar
Kriegel, H.-P., Schubert, M., & Zimek, A. (2008). Angle-based outlier detection in high-dimensional data. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’08), Las Vegas, Nevada, USA, pp. 444–452.
Google Scholar
Pham, N., & Pagh, R. (2012). A near-linear time approximation algorithm for angle-based outlier detection in high-dimensional data. In Proceedings of the 18th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (SIGKDD’12), Beijing, China, pp. 877–885.
Google Scholar
Borůvka, O. (1926). O jistém problému minimálním (About a Certain Minimal Problem). Práce moravské přírodovědecké spolěcnosti v Brně, III 37–58 (in Czech with German summary).
Google Scholar
Jarník, V. (1930). O jistém problému minimálním (About a Certain Minimal Problem). Práce moravské přírodovědecké spolěcnosti v Brně, VI 57–63 (in Czech).
Google Scholar
Dijkstra, E. W. (1959). A note on two problems in connexion with graphs. Numerische Mathematik, 1(1), 269–271.
Article MathSciNet Google Scholar
Bentley, J., & Friedman, J. (1978). Fast algorithms for constructing minimal spanning trees in coordinate spaces. IEEE Transactions on Computers, C-27(2), 97–105.
Google Scholar
Preparata, F. P., & Shamos, M. I. (1985). Computational Geometry. New York: Springer.
Book Google Scholar
Callahan, P., & Kosaraju, S. (1993). Faster algorithms for some geometric graph problems in higher dimensions. In Proceedings of the 4th Annual ACM-SIAM Symposium on Discrete Algorithms, Austin, TX, USA, pp. 291–300.
Google Scholar
Narasimhan, G., Zachariasen, M., & Zhu, J. (2000). Experiments with computing geometric minimum spanning trees. In Proceedings of the 2nd Workshop on Algorithm Engineering and Experimentation (ALENEX’00), pp. 183–196.
Google Scholar
March, W. B., Ram, P., & Gray, A. G. (2010). Fast Euclidean minimum spanning tree: algorithm, analysis, and applications. In Proceedings of 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’10), Washington, USA, pp. 603–612.
Google Scholar
Vaidya, P. M. (1988). Minimum spanning trees in k-dimensional space. SIAM Journal on Computing, 17(3), 572–582.
Article MathSciNet Google Scholar
Wang, X., Wang, X., & Wilkes, D. M. (2009). A divide-and-conquer approach for minimum spanning tree-based clustering. IEEE Transactions Knowledge and Data Engineering, 21(7), 945–958.
Article Google Scholar
Lai, C., Rafa, T., & Nelson, D. E. (2009). Approximate minimum spanning tree clustering in high-dimensional space. Intelligent Data Analysis, 13(4), 575–597.
Article Google Scholar
Wang, X., Wang, X. L., & Zhu, J. (2014). A new fast minimum spanning tree based clustering technique. In Proceedings of the 2014 IEEE International Workshop on Scalable Data Analytics (ICDMW’14), Shenzhen, China, pp. 1053–1060.
Google Scholar
Zhong, C., Malinen, M., Miao, D., & Fränti, P. (2015). A fast minimum spanning tree algorithm based on K-means. Information Sciences, 295, 1–17.
Article MathSciNet Google Scholar
Yu, C., Ooi, B. C., Tan, K. L., & Jagadish, H. V. (2005). iDistance: An adaptive B+-tree based indexing method for nearest neighbor search. ACM Transactions on Data Base Systems (TODS), 30(2), 364–397.
Article Google Scholar
Wang, X., Wang, X. L., & Wilkes, D. M. (2012). Modifying iDistance for a fast CHAMELEON with application to patch based image segmentation. In Proceedings of the 9th IASTED International Conference on Signal Processing, Pattern Recognition and Applications, Crete, Greece, pp. 107–114.
Google Scholar
Li, Y., & Maguire, L. (2011). Selecting critical patterns based on local geometrical and statistical information. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(6), 1189–1201.
Article Google Scholar
Li, X., Lv, J., & Yi, Z. (2018). An efficient representation-based method for boundary point and outlier detection. IEEE Transactions on Neural Networks and Learning Systems, 29(1), 51–62.
Article MathSciNet Google Scholar
Rodriguez, A., & Laio, A. (2014). Clustering by fast search and find of density peaks. Science, 344(6191), 1492.
Article Google Scholar
Karypis, G., Han, E. H., & Kumar, V. (1999). CHAMELEON: A hierarchical clustering algorithm using dynamic modeling. IEEE Transactions on Computers, 32(8), 68–75.
Google Scholar
https://archive.ics.uci.edu/ml/datasets.html.
Wang, X. L., Wang, X., & Li, X. (2018). A fast two-level approximate Euclidean minimum spanning tree algorithm for high-dimensional data. In Proceedings of the 14th International Conference on Machine Learning and Data Mining, New York, USA, pp. 273–287.
Google Scholar

Download references

Acknowledgements

This chapter was modified from the paper published by our group in "machine learning and data mining" [50]. The related contents are reused with permission.

Author information

Authors and Affiliations

School of Software Engineering, Xi’an Jiaotong University, Xi’an, Shaanxi, China
Xiaochun Wang
School of Information Engineering, Chang’an University, Xi’an, Shaanxi, China
Xiali Wang
Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN, USA
Mitch Wilkes

Authors

Xiaochun Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiali Wang
View author publications
You can also search for this author in PubMed Google Scholar
Mitch Wilkes
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaochun Wang .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Wang, X., Wang, X., Wilkes, M. (2021). An Effective Boundary Point Detection Algorithm Via k-Nearest Neighbors-Based Centroid. In: New Developments in Unsupervised Outlier Detection. Springer, Singapore. https://doi.org/10.1007/978-981-15-9519-6_8

Download citation

DOI: https://doi.org/10.1007/978-981-15-9519-6_8
Published: 25 November 2020
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-9518-9
Online ISBN: 978-981-15-9519-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics