Abstract
Outlier detection techniques and clustering techniques are important areas of data mining. Clustering is all about finding groups of data points, whereas outlier analysis is all about finding data points that are far away from these clusters. Clustering and outlier detection, therefore, share a well-known complementary relationship. A simplistic view is that every data point is either a member of a cluster or an outlier. Data points on the boundary regions of a cluster may also be considered weak outliers. However, the study of boundary points is sometimes more meaningful than clusters and outliers. There has been many research work done on boundary point detection. However, with the data obtained becoming more and more complex, existing boundary point detection algorithms have problems such as low precision, parameter dependence, and difficulty in separating outliers. In this chapter, we propose a boundary point detection algorithm, CENTROID-B, based on the concept of kNN-based centroid which has low dependence on parameters, high precision and can detect outliers at the same time. The experimental results on different types of data sets show that the proposed boundary point detection algorithm is effective and manifests high accuracy. Euclidean minimum spanning tree algorithms run typically with quadratic computational complexity, which is not practical for large-scale multi-dimensional datasets. In this chapter, we propose a new two-level approximate Euclidean minimum spanning tree algorithm for large-scale multi-dimensional datasets. In the first level, we perform the proposed outlier and boundary point detection for a given data set to identify a small amount of boundary points. In the second level, we run standard Prim’s algorithm on the reduced dataset to complete an approximate Euclidean minimum spanning tree. Experiments on sample data sets demonstrate the efficiency of the proposed method, while keeping high approximate precision.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Hawkins, D. M. (1980). Identification of outliers. London: Chapman and Hall.
Ester, M., Kriegel, H.-P., Sander, J. and Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’96), Portland, Oregon, USA, pp. 226–231.
Breuning, M. M., Kriegel, H. P., Ng, R. T., & Sander, J. (2000). LOF: Identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data (SIGMOD’00), Dallas, TX, United states, pp. 93–104.
Qiu, B. Z., Yue, F., & Shen, J. Y. (2007). BRIM: an efficient boundary points detecting algorithm. In Proceedings of the 11th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’07), Nanjing, China, pp. 761–768.
He, Y. Z., Wang, C. H., & Qiu, B. Z. (2013). Clustering boundary points detection algorithm based on gradient binarization. Applied Mechanics and Materials, 263–266(n PART 1): 2358–2363.
Li, X., Wu, X., Lv, J., He, J., Guo, J., & Li, M. (2018). Automatic detection of boundary points based on local geometrical measures. Soft Computing, 22(11), 3663–3674.
Kruskal, J. B. (1956). On the shortest spanning subtree of a graph and the traveling salesman problem. Proceedings of the American Mathematical Society, 7(1), 48–50.
Prim, R. C. (1957). Shortest connection networks and some generalizations. Bell System Technical Journal, 36(6), 1389–1401.
An, L., Xiang, Q. S., & Chavez, S. (2000). A fast implementation of the minimum spanning tree method for phase unwrapping. IEEE Transactions on Medical Imaging, 19(8), 805–808.
Xu, Y., & Uberbacher, E. C. (1997). 2D image segmentation using minimum spanning trees. Image and Vision Computing, 15(1), 47–57.
Zahn, C. T. (1971). Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Transactions on Computers, C-20(1), 68–86.
Xu, Y., Olman, V., & Xu, D. (2002). Clustering gene expression data using a graph-theoretic approach: An application of minimum spanning trees. Bioinformatics, 18(4), 536–545.
Zhong, C., Miao, D., & Wang, R. (2010). A graph-theoretical clustering method based on two rounds of minimum spanning trees. Pattern Recognition, 43(3), 752–766.
Juszczak, P., Tax, D. M. J., Pekalska, E., & Duin, R. P. W. (2009). Minimum spanning tree based one-class classifier. Neurocomputing, 72, 1859–1869.
Yang, C. L. (2005). Building k edge-disjoint spanning trees of minimum total length for isometric data embedding. IEEE Transactions Pattern Analysis and Machine Intelligence, 27(10), 1680–1683.
Gower, J. C., & Ross, G. J. S. (1969). Minimum spanning trees and single linkage cluster analysis. Applied Statistics, 18(1), 54–64.
Balcan, M-F., Blum, A., & Vempala, S. (2008). A discriminative framework for clustering via similarity functions. In Proceedings of the 40th Annual ACM Symposium on Theory of Computing (STOC’08), Victoria, BC, Canada, pp. 671–680.
Tang, J., Chen, Z., Fu, A. W. C., & Cheung, D. W. (2002). Enhancing effectiveness of outlier detections for low density patterns. In Proceedings of the 6th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’02), Taipei, Taiwan, pp. 535–548.
Gibbons, P.B., Papadimitriou, S., Kitagawa, H. and Faloutsos, C. (2003). LOCI: Fast outlier detection using the local correlation integral. In Proceedings of the IEEE 19th International Conference on Data Engineering (ICDE’03), Bangalore, India, pp.315–326.
Sun, P. and Chawla, S. (2004). On local spatial outliers. In Proceedings of the 4th IEEE International Conference on Data Mining (ICDM’04), Brighton, UK, pp. 209–216.
Jin, W., Tung, A. K. H., Han, J., & Wang, W. (2006). Ranking outliers using symmetric neighborhood relationship. In Proceedings of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’06), Singapore, pp. 577–593.
Fan, H., Zaiane, O. R., Foss, A., & Wu, J. (2006). A nonparametric outlier detection for efficiently discovering top-N outliers from engineering data. In Proceedings of the 10th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining (PAKDD’06), Singapore, Singapore, pp. 557–566.
Zhang, K., Hutter, M., & Jin, H. (2009). A new local distance-based outlier detection approach for scattered real-world data. In Proceedings of the 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’09), pp. 813–822.
Huang, H., Mehrotra, K., & Mohan, C. K. (2013). Rank-based outlier detection. Journal of Statistical Computation and Simulation, 83(3), 518–531.
Schubert, E., Zimek, A., & Kriegel, H. P. (2014). Generalized outlier detection with flexible kernel density estimates. In Proceedings of the 14th Siam International Conference on Data Mining (SDM’14), Philadelphia, pp. 542–550.
Ru, X., Liu, Z., Huang, Z., et al. (2016). Normalized residual-based constant false-alarm rate outlier detection. Pattern Recognition Letters, 69, 1–7.
Tang, B., & He, H. (2017). A local density-based approach for outlier detection. Neurocomputing, 241, 171–180.
Kriegel, H.-P., Schubert, M., & Zimek, A. (2008). Angle-based outlier detection in high-dimensional data. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’08), Las Vegas, Nevada, USA, pp. 444–452.
Pham, N., & Pagh, R. (2012). A near-linear time approximation algorithm for angle-based outlier detection in high-dimensional data. In Proceedings of the 18th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (SIGKDD’12), Beijing, China, pp. 877–885.
Borůvka, O. (1926). O jistém problému minimálním (About a Certain Minimal Problem). Práce moravské přírodovědecké spolěcnosti v Brně, III 37–58 (in Czech with German summary).
Jarník, V. (1930). O jistém problému minimálním (About a Certain Minimal Problem). Práce moravské přírodovědecké spolěcnosti v Brně, VI 57–63 (in Czech).
Dijkstra, E. W. (1959). A note on two problems in connexion with graphs. Numerische Mathematik, 1(1), 269–271.
Bentley, J., & Friedman, J. (1978). Fast algorithms for constructing minimal spanning trees in coordinate spaces. IEEE Transactions on Computers, C-27(2), 97–105.
Preparata, F. P., & Shamos, M. I. (1985). Computational Geometry. New York: Springer.
Callahan, P., & Kosaraju, S. (1993). Faster algorithms for some geometric graph problems in higher dimensions. In Proceedings of the 4th Annual ACM-SIAM Symposium on Discrete Algorithms, Austin, TX, USA, pp. 291–300.
Narasimhan, G., Zachariasen, M., & Zhu, J. (2000). Experiments with computing geometric minimum spanning trees. In Proceedings of the 2nd Workshop on Algorithm Engineering and Experimentation (ALENEX’00), pp. 183–196.
March, W. B., Ram, P., & Gray, A. G. (2010). Fast Euclidean minimum spanning tree: algorithm, analysis, and applications. In Proceedings of 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’10), Washington, USA, pp. 603–612.
Vaidya, P. M. (1988). Minimum spanning trees in k-dimensional space. SIAM Journal on Computing, 17(3), 572–582.
Wang, X., Wang, X., & Wilkes, D. M. (2009). A divide-and-conquer approach for minimum spanning tree-based clustering. IEEE Transactions Knowledge and Data Engineering, 21(7), 945–958.
Lai, C., Rafa, T., & Nelson, D. E. (2009). Approximate minimum spanning tree clustering in high-dimensional space. Intelligent Data Analysis, 13(4), 575–597.
Wang, X., Wang, X. L., & Zhu, J. (2014). A new fast minimum spanning tree based clustering technique. In Proceedings of the 2014 IEEE International Workshop on Scalable Data Analytics (ICDMW’14), Shenzhen, China, pp. 1053–1060.
Zhong, C., Malinen, M., Miao, D., & Fränti, P. (2015). A fast minimum spanning tree algorithm based on K-means. Information Sciences, 295, 1–17.
Yu, C., Ooi, B. C., Tan, K. L., & Jagadish, H. V. (2005). iDistance: An adaptive B+-tree based indexing method for nearest neighbor search. ACM Transactions on Data Base Systems (TODS), 30(2), 364–397.
Wang, X., Wang, X. L., & Wilkes, D. M. (2012). Modifying iDistance for a fast CHAMELEON with application to patch based image segmentation. In Proceedings of the 9th IASTED International Conference on Signal Processing, Pattern Recognition and Applications, Crete, Greece, pp. 107–114.
Li, Y., & Maguire, L. (2011). Selecting critical patterns based on local geometrical and statistical information. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(6), 1189–1201.
Li, X., Lv, J., & Yi, Z. (2018). An efficient representation-based method for boundary point and outlier detection. IEEE Transactions on Neural Networks and Learning Systems, 29(1), 51–62.
Rodriguez, A., & Laio, A. (2014). Clustering by fast search and find of density peaks. Science, 344(6191), 1492.
Karypis, G., Han, E. H., & Kumar, V. (1999). CHAMELEON: A hierarchical clustering algorithm using dynamic modeling. IEEE Transactions on Computers, 32(8), 68–75.
Wang, X. L., Wang, X., & Li, X. (2018). A fast two-level approximate Euclidean minimum spanning tree algorithm for high-dimensional data. In Proceedings of the 14th International Conference on Machine Learning and Data Mining, New York, USA, pp. 273–287.
Acknowledgements
This chapter was modified from the paper published by our group in "machine learning and data mining" [50]. The related contents are reused with permission.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2021 Xi'an Jiaotong University Press
About this chapter
Cite this chapter
Wang, X., Wang, X., Wilkes, M. (2021). An Effective Boundary Point Detection Algorithm Via k-Nearest Neighbors-Based Centroid. In: New Developments in Unsupervised Outlier Detection. Springer, Singapore. https://doi.org/10.1007/978-981-15-9519-6_8
Download citation
DOI: https://doi.org/10.1007/978-981-15-9519-6_8
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-9518-9
Online ISBN: 978-981-15-9519-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)