Advertisement

Wuhan University Journal of Natural Sciences

, Volume 23, Issue 6, pp 514–524 | Cite as

Application of Algorithm CARDBK in Document Clustering

  • Yehang Zhu
  • Mingjie Zhang
  • Feng Shi
Article
  • 7 Downloads

Abstract

In the K-means clustering algorithm, each data point is uniquely placed into one category. The clustering quality is heavily dependent on the initial cluster centroid. Different initializations can yield varied results; local adjustment cannot save the clustering result from poor local optima. If there is an anomaly in a cluster, it will seriously affect the cluster mean value. The K-means clustering algorithm is only suitable for clusters with convex shapes. We therefore propose a novel clustering algorithm CARDBK—“centroid all rank distance (CARD)” which means that all centroids are sorted by distance value from one point and “BK” are the initials of “batch K-means”—in which one point not only modifies a cluster centroid nearest to this point but also modifies multiple clusters centroids adjacent to this point, and the degree of influence of a point on a cluster centroid depends on the distance value between this point and the other nearer cluster centroids. Experimental results showed that our CARDBK algorithm outperformed other algorithms when tested on a number of different data sets based on the following performance indexes: entropy, purity, F1 value, Rand index and normalized mutual information (NMI). Our algorithm manifested to be more stable, linearly scalable and faster.

Key words

algorithm design and analysis clustering document analysis text processing 

CLC number

TP 391 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [1]
    Zhu Y H. Research on Document Clustering Algorithm[D]. Xi’an: Northwestern Polytechnical University, 2009(Ch).Google Scholar
  2. [2]
    Zhu Y H, Li Y L, Cui M T, et al. Clustering algorithm CARDBK improved from K–means algorithm[J]. Computer Science, 2015, 42(3): 201–205 (Ch).Google Scholar
  3. [3]
    Wu D, Ren J D, Sheng L. Representative points clustering algorithm based on density factor and relevant degree[J]. International Journal of Machine Learning and Cybernetics, 2017, 8(2): 641–649.CrossRefGoogle Scholar
  4. [4]
    Carlo B, Paola B, Marco C, et al. A clustering algorithm for planning the integration process of a large number of conceptual schemas[J]. Journal of Computer Science and Technology, 2015, 30(1): 214–224.CrossRefGoogle Scholar
  5. [5]
    Elkan C. Using the triangle inequality to accelerate K–means[C]//Proceedings of the Twentieth International Conference on Machine Learning. Washington D C: ACM Press, 2003: 147–153.Google Scholar
  6. [6]
    Li D Q, Shen J Y, Chen H M. A fast K–means clustering algorithm based on grid data reduction[C]//Aerospace Conference. Washington D C: IEEE Press, 2008:1–6.Google Scholar
  7. [7]
    Xie J Y, Wang Y E. The K–means algorithm for the minimum variance optimization of the initial cluster center[J]. Computer Engineering, 2014, 40(8): 205–206(Ch).Google Scholar
  8. [8]
    Sarafis I, Zalzala A M S, Trinder P W. A genetic rule–based data clustering toolkit[C]//Congress on Evolutionary Computation. Washington D C: IEEE Press, 2002, 2: 1238–1243.Google Scholar
  9. [9]
    Mao S Y, Li K L. Research of optimal K–means initial clustering center[J]. Computer Engineering and Applications, 2007, 43(22): 179–181.Google Scholar
  10. [10]
    Lai Y X, Liu J P. Optimization study on initial center of K–means algorithm[J]. Computer Engineering and Applications, 2008, 44(10): 147–149.Google Scholar
  11. [11]
    Zhong C M, Mikko M, Miao D Q. A fast minimum spanning tree algorithm based on K–means[J]. Information Sciences, 2015:1–17.Google Scholar
  12. [12]
    Zhang Z L, Cao Z Y, Li Y T. Research on K–means algorithm based on weighted Euclidean distance[J]. Journal of Zhengzhou University (Engineering Edition), 2010, 31(1): 89–92 (Ch).Google Scholar
  13. [13]
    Huang Y, Zhang W Z, Zhang H Y, et al. A fast clustering algorithm for massive short message[C]//Human Centered Computing(HCC 2016). (Lecture Notes in Computer Science, vol 9567). Berlin: Springer–Verlag, 2016:183–192.Google Scholar
  14. [14]
    Terence J, Singh S K. Divisive hierarchical bisecting min–max clustering algorithm [C]//Proceedings of the International Conference on Data Engineering and Communication Technology (Advances in Intelligent Systems and Computing). Berlin: Springer–Verlag, 2017, 468: 579–592.Google Scholar
  15. [15]
    Liu Y, Yin H P, Chai Y. An improved kernel K–means clustering algorithm[C]//Proceedings of 2016 Chinese Intelligent Systems Conference (Lecture Notes in Electrical Engineering, vol 404). Berlin: Springer–Verlag, 2016:275–280.Google Scholar
  16. [16]
    Sandip M, Khushbu J. Performance analysis of clustering algorithm in sensing microblog for smart cities[C]//Proceedings of the International Congress on Information and Communication Technology (Advances in Intelligent Systems and Computing, vol 439). Berlin: Springer–Verlag, 2016: 467–475.Google Scholar
  17. [17]
    Pérez–Ortega J, Almanza–Ortega N N, Adams–López J, et al. Improving the efficiency of the K–medoids clustering algorithm by getting initial medoids[C]//Recent Advances in Information Systems and Technologies (WorldCIST 2017) (Advances in Intelligent Systems and Computing, vol 569). Berlin: Springer–Verlag, 2017:125–132.Google Scholar
  18. [18]
    Huo J Y, Zhang H L. An improved K–means clustering algorithm based on the voronoi diagram method[C]//Advances in Swarm Intelligence (ICSI 2016). (Lecture Notes in Computer Science, vol 9713). Berlin: Springer–Verlag, 2016: 107–114.Google Scholar
  19. [19]
    Karypis G. CLUTO–Software for Clustering High–Dimensional Datasets[CP/OL]. [2008–10–25]. http://glaros.dtc. umn.edu/gkhome/cluto/cluto/download.Google Scholar
  20. [20]
    Drias H, Cherif N F, Kechid A. k–MM: A hybrid clustering algorithm based on k–Means and k–Medoids[C]//Advances in Nature and Biologically Inspired Computing (Advances in Intelligent Systems and Computing). Berlin: Springer–Verlag, 2016, 419: 37–48.Google Scholar
  21. [21]
    Lorbeer B, Kosareva A, Deva B, et al. A–BIRCH: Automatic threshold estimation for the BIRCH clustering algorithm[C]//Advances in Big Data (INNS 2016) (Advances in Intelligent Systems and Computing). Berlin: Springer–Verlag, 2016, 529: 169–178.Google Scholar
  22. [22]
    Bhattacharjee P, Awekar A. Batch incremental shared near–est neighbor density based clustering algorithm for dynamic datasets[C]//Advances in Information Retrieval (ECIR 2017). (Lecture Notes in Computer Science, vol 10193). Berlin: Springer–Verlag, 2017: 568–574.Google Scholar
  23. [23]
    Wang Z T, Kang P, Wu Z W, et al. A density–based clustering algorithm with educational applications[C]//Current Developments in Web Based Learning (ICWL 2015). (Lecture Notes in Computer Science, vol 9584). Berlin: Springer–Verlag, 2016:118–127.Google Scholar
  24. [24]
    Juan A, Vidal E. Comparison of four initialization techniques for the K–medians clustering algorithm[C]//Proceedings of the Joint IAPR International Workshops on Advances in Pattern Recognition. London: Springer–Verlag, 2000: 842–852.Google Scholar
  25. [25]
    Dey L, Ranjan K, Verma I, et al. A semantic overlapping clustering algorithm for analyzing short–texts[C]//Rough Sets, IJCRS 2016 (Lecture Notes in Computer Science, vol 9920). Berlin: Springer–Verlag, 2016: 470–479.Google Scholar
  26. [26]
    Goyal M, Kumar S. Improving the initial centroids of k–means clustering algorithm to generalize its applicability[J]. Journal of The Institution of Engineers (India): Series B, 2014, 95(4): 345–350.CrossRefGoogle Scholar
  27. [27]
    Li W J, Feng Y M, Li D J, et al. Micro–blog topic detection method based on BTM topic model and K–means clustering algorithm[J]. Automatic Control and Computer Sciences, 2016, 50(4): 271–277.CrossRefGoogle Scholar
  28. [28]
    Macqueen J. Some methods of classification and analysis of multivariate observations[C]//Proc of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Los Angeles: University of California Press, 1967: 281–297.Google Scholar
  29. [29]
    Huang J L, Zhu Q S, Yang L J, et al. QCC: A novel clustering algorithm based on quasi–cluster centers[J]. Machine Learning, 2017, 106(3): 337–357.CrossRefGoogle Scholar

Copyright information

© Wuhan University and Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.College of Economics and ManagementXi’an University of Posts and TelecommunicationsXi’an, ShaanxiChina
  2. 2.Information Business DepartmentPuyang Technician CollegePuyang, HenanChina

Personalised recommendations