Advertisement

Enhancing Hierarchical Linkage Clustering via Boundary Point Detection

  • Xiaochun WangEmail author
  • Xiali Wang
  • Don Mitchell Wilkes
Chapter

Abstract

In various disciplines, hierarchical clustering has been an effective tool for data analysis. However, traditional hierarchical clustering algorithms are not scalable to very large databases because of their high computational cost. To partially circumvent this drawback, in this paper, we propose a new algorithm for hierarchical linkage clustering as a solution for an efficient as well as reliable data clustering problem. Basically, our algorithm consists of two stages. In the first stage, the traditional linkage algorithms are applied to cluster a size-reduced version of the original dataset via boundary point detection. In the second stage, a k-nearest neighbors based classifier is employed to assign a cluster label to the remaining data points. Finally, evaluation is performed to show that the proposed algorithms can obtain good results not only in terms of the consumption of reasonable run times but also with better accuracy.

Keywords

Agglomerative hierarchical clustering Distance-based outlier detection Density-based outlier detection Boundary point detection K-nearest neighbors search 

References

  1. Aggarwal, C. C., & Yu, P. S. (2001). Outlier detection for high dimensional data. ACM Sigmod Record, 30, 37–46.CrossRefGoogle Scholar
  2. Angiulli, F., & Pizzuti, C. (2002). Fast outlier detection in high dimensional spaces. In Proceedings of the Sixth European Conference on the Principles of Data Mining and Knowledge Discovery (pp. 15–26).Google Scholar
  3. Breuning, M.M., Kriegel, H.P., Ng, R.T., & Sander, J. (2000). LOF: Identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data (SIGMOD’00) (pp. 93–104).Google Scholar
  4. Cole, J. R., Wang, Q., Fish, J. A., Chai, B., McGarrell, D. M., Sun, Y., et al. (2013). Ribosomal database project: Data and tools for high throughput rRNA analysis. Nucleic Acids Research, 42(D1), D 633–D 642.Google Scholar
  5. Defays, D. (1977). An efficient algorithm for a complete link method. The Computer Journal. British Computer Society, 20(4), 364–366.MathSciNetzbMATHGoogle Scholar
  6. Ding, X., Li, Y., Belatreche, A., & Maguire, L. (2014). A locally adaptive boundary evolution algorithm for novelty detection using level set methods. In Proceedings of the 2014 IEEE International Joint Conference on Neural Networks (IJCNN’14) (pp. 1870–1876).Google Scholar
  7. Ding, X., Li, Y., Belatreche, A., & Maguire, L. P. (2015). Novelty detection using level set methods. IEEE Transactions on Neural Networks and Learning Systems, 26(3), 576–588.MathSciNetCrossRefGoogle Scholar
  8. Flynn, P. J., Jain, A. K., & Murty, M. N. (1999). Data clustering: A review. ACM Computing Surveys, 31(3), 264–323.CrossRefGoogle Scholar
  9. Ghaemmaghami, H., Dean, D., Vogt, R., & Sridharan, S. (2012). Speaker attribution of multiple telephone conversations using a complete-linkage clustering approach. In Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’12) (pp. 4185–4188).Google Scholar
  10. Gibbons, P.B., Papadimitriou, S., Kitagawa, H., & Faloutsos, C. (2003). LOCI: Fast outlier detection using the local correlation integral. In Proceedings of the IEEE 19th International Conference on Data Engineering, Bangalore, India (pp. 315–328).Google Scholar
  11. Hawkins, D. M. (1980). Identification of outliers. Monographs on applied probability and statistics. London: Chapman and Hall.CrossRefGoogle Scholar
  12. Huang, H., Mehrotra, K., & Mohan, C. K. (2013). Rank-based outlier detection. Journal of Statistical Computation and Simulation, 83(3), 1–14.MathSciNetCrossRefGoogle Scholar
  13. Jagadish, H. V., Chin Ooi, B., Tan, K. L., Yu, C., & Zhang, R. (2005). iDistance: An adaptive B + -tree based indexing method for nearest neighbor search. ACM Transactions on Data Base Systems(ACM TODS), 30(2), 364–397.CrossRefGoogle Scholar
  14. Jin, W., Tung, A.K.H., Han, J., & Wang, W. (2006). Ranking outliers using symmetric neighborhood relationship. In Proceedings of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Singapore (Vol. 3918, pp. 577–59).Google Scholar
  15. King, B. (1967). Step-wise clustering procedures. Journal of the American Statistical Association, 62(317), 86–101.CrossRefGoogle Scholar
  16. Knorr, E.M., & Ng, R.T. (1998). Algorithms for mining distance-based outliers in large datasets. In Proceedings of the 24th Very Large Databases Conference (VLDB’98), New York, USA (pp. 392–403).Google Scholar
  17. Kutsuna, T., & Yamamoto, A. (2014). Outlier detection based on leave-one out density using binary decision diagrams. In V. S. Tseng, T. B. Ho, & Z.-H. Zhou (Eds.), Advances in knowledge discovery and data mining (pp. 486–497). Berlin: Springer.CrossRefGoogle Scholar
  18. Li, L., Lv, J., & Yi, Z. (2015a). A non-negative representation learning algorithm for selecting neighbors. Machine Learning, 102, 133–153.MathSciNetCrossRefGoogle Scholar
  19. Li, X., Lv, J.C., & Cheng, D. (2015b). Angle-based outlier detection algorithm with more stable relationships. In Proceedings of the 18th Asia Pacific Symposium on Intelligent and Evolutionary Systems (Vol. 1). Springer.Google Scholar
  20. Li, Y. (2008). A surface representation approach for novelty detection. In Proceedings of International conference on information and automation(ICIA) (pp 1464–1468).Google Scholar
  21. Li, Y. (2011). Selecting training points for one-class support vector machines. Pattern Recognition Letters, 32(11), 517–1522.CrossRefGoogle Scholar
  22. Li, Y., & Maguire, L. P. (2011). Selecting critical patterns based on local geometrical and statistical information. IEEE Trans Pattern Analysis and Machine Intelligence, 33(6), 1189–1201.CrossRefGoogle Scholar
  23. Milligan, G. W., Soon, S. C., & Sokol, L. M. (1983). The effect of cluster size, dimensionality, and the number of clusters on recovery of true cluster structure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5, 40–47.Google Scholar
  24. Qiu, B., & Cao, X. (2016). Clustering boundary detection for high dimensional space based on space inversion and hopkins statistics. Knowledge Based System, 98(C), 216–225.CrossRefGoogle Scholar
  25. Rieck, K., Trinius, P., Willems, C., & Holz, T. (2011). Automatic analysis of malware behavior using machine learning. Journal of Computer Security, 19(4), 639–668.CrossRefGoogle Scholar
  26. Ramaswamy, S., Rastogi, R., & Shim, K. (2000). Efficient algorithms for mining outliers from large data sets. In Proceedings of the 2000 ACM SIGMOD Conference (pp. 427–438).Google Scholar
  27. Ru, X., Liu, Z., Huang, Z., et al. (2016). Normalized residual-based constant false-alarm rate outlier detection. Pattern Recognition Letters, 69, 1–7.CrossRefGoogle Scholar
  28. Schubert E, Zimek A, & Kriegel, H.P. (2014). Generalized outlier detection with flexible kernel density estimates. In Proceedings of the 2014 Siam International Conference on Data Mining.Google Scholar
  29. Sibson, R. (1973). SLINK: An optimally efficient algorithm for the single-link cluster method. The Computer Journal. British Computer Society, 16(1), 30–34.MathSciNetGoogle Scholar
  30. Sun, P., & Chawla, S. (2004). On local spatial outliers. In Proceedings of the 4th International Conference on Data Mining (ICDM’04), Brighton, UK (pp. 209–216).Google Scholar
  31. Tang, J., Chen, Z., Fu, A.W.C., & Cheung, D.W. (2002). Enhancing effectiveness of outlier detections for low density patterns. In Proceedings of the 6th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’02), Taipei, Taiwan (Vol. 2336, pp. 535–548).Google Scholar
  32. Tang, B., & He, H. (2017). A local density-based approach for outlier detection. Neurocomputing, 241, 171–180.CrossRefGoogle Scholar
  33. Terrell, G. R., & Scott, D. W. (1992). Variable Kernel density estimation. Annals of Statistics, 20(3), 1236–1265.MathSciNetCrossRefGoogle Scholar
  34. Theodorodis, S., & Kouttoumbas, K. (2009). Pattern Recognition (4th ed.). Amsterdam: Academic Press.Google Scholar
  35. Wang, X., Wang, X. L., Chen, C., & Wilkes, D. M. (2013). Enhancing minimum spanning tree-based clustering by removing density-based outliers. Digital Signal Processing, 23(5), 1523–1538.MathSciNetCrossRefGoogle Scholar
  36. Warrens, M. J. (2008). On the equivalence of Cohen’s kappa and the Hubert-Arabie adjustedrand index. Journal of Classification, 25(2), 177–183.MathSciNetCrossRefGoogle Scholar
  37. Xia, C., Hsu, W., Lee, M. L., & Ooi, B. C. (2006). Border: Efficient computation of boundary points. IEEE Transactions on Knowledge Data Engineering, 18(3), 289–303.CrossRefGoogle Scholar
  38. Yeung, K. Y., Fraley, C., Murua, A., & Raftery, A. E. (2001). Model-based clustering and data transformations for gene expression data. Bioinformatics, 17, 977–987.Google Scholar
  39. Yu, C., Chin Ooi, B., Tan, K.L., & Jagadish, H.V. (2001). Indexing the distance: an efficient method to KNN processing. In Proceedings of the 27th International Conference on Very large Databases (VLDB’01), Roma, Italy (pp. 421–430).Google Scholar
  40. Zhang, K., Hutter, M., & Jin, H. (2009). A new local distance-based outlier detection approach for scattered real-world data. Advances in Knowledge Discovery and Data Mining, 5476, 813–822.CrossRefGoogle Scholar

Copyright information

© Xi'an Jiaotong University Press 2020

Authors and Affiliations

  • Xiaochun Wang
    • 1
    Email author
  • Xiali Wang
    • 2
  • Don Mitchell Wilkes
    • 3
  1. 1.School of Software EngineeringXi’an Jiaotong UniversityXi’anChina
  2. 2.School of Information EngineeringChang’an UniversityXi’anChina
  3. 3.Department of Electrical Engineering and Computer ScienceVanderbilt UniversityNashvilleUSA

Personalised recommendations