# A Novel Hierarchical Document Clustering Algorithm Based on a kNN Connection Graph

## Abstract

Bottom-up hierarchical document clustering normally merges two most similar clusters in each step iteratively. This paper proposes a novel bottom-up hierarchical document clustering algorithm to merge several pairs of most similar clusters in each step. This is done via a concept of “kNN-connectedness”, which measures the mutual connectedness of clusters in kNNs, and a kNN connection graph, which organizes given clusters into several sets of kNN-connected clusters. In such a graph, a connection between any two clusters only exists in the kNN-connected clusters of the same set. Moreover, a new kNN-based attraction function is proposed to measure the similarity between two clusters and indicates the potential probability of the two clusters being merged. The attraction function only considers the relative distribution of their nearest neighbors between two clusters in a vector space while other criteria, such as the well-known cluster-based cosine similarity function, measures the absolute distance between two clusters. This makes the attraction function effectively apply to the cases where different clusters may have very different distance variation. In each step, a kNN connection graph, consisting of several sets of kNN-connected clusters, is first constructed from the given clusters using a kNN algorithm and the concept of “kNN-connectedness”. For each set of kNN-connected clusters, the attraction degree between any two clusters is calculated and several top connected cluster pairs will be merged. In this way, the iteration number can be largely reduced and the clustering process can be much speeded. Evaluation on a news document corpus shows that the kNN connection graph-based hierarchical document clustering algorithm can achieve better performance than the famous k-means clustering algorithm while reducing the iteration number sharply in comparison with normal hierarchical document clustering.

## Keywords

Cluster Process Similar Cluster Absolute Distance Document Cluster Attraction Function## Preview

Unable to display preview. Download preview PDF.

## References

- 1.Breiman, L.: Bagging Predictors. Machine Learning 24, 123–140 (1996)MATHMathSciNetGoogle Scholar
- 2.Croft, W.B.: Clustering large files of codument using the single-link method. Journal of the American Society for Information Science (1977)Google Scholar
- 3.Hammouda, K.M., Kamel, M.S.: Efficient Phrase-Based Document Indexing for Web Document Clustering. IEEE Transactions on Knowledge and Data Engineering 16(10), 1279–1296 (2004)CrossRefGoogle Scholar
- 4.Hartigan, J.A., Wong, M.A.: Algorithm AS 136: A K-means clustering algorithm. Applied Statistics (1979)Google Scholar
- 5.Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., et al.: An Efficient k-Means Clustering Algorithm: Analysis and Implementation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2002)Google Scholar
- 6.Han, J.W., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2001)Google Scholar
- 7.Wei, J.H., He, P.L., Sun, Y.H.: Research on Text Hierarchical Clustering Algorithm based on K-Means. Computer Applications 25(10), 2323–2324 (2005)Google Scholar
- 8.Zhen, T.: Research of Clustering Algorithm Based on Hierarchical and Partitioning Method. Computer engineering and Applications 8, 182–184 (2006)Google Scholar
- 9.Wu, F., Li, S.J.: An Efficient Hierarchical Clustering Algorithm. Computer Engineering 30(9), 70–71 (2004)Google Scholar
- 10.Schutze, H.: Single-link, complete link and average-link, http://www-csli.stanford.edu/~schuetze/completelink.html (accessed, 20/05/2006)
- 11.Steinbach, M., Karipis, G.: A comparison of document clustering techniques. In: KDD workshop on text mining (2000)Google Scholar
- 12.Willett, P.: Recent trends in hierarchical document clustering: a critical review. Information Processing and Management (1988)Google Scholar
- 13.Zamir, O., Etzioni, O.: Web document clustering: a feasibility demonstration. In: SIGIR 1998 (1998)Google Scholar
- 14.Zhang, H.P., Yu, H.K., Xiong, D.Y., Liu, Q.: HHMM-based Chinese Lexical Analyzer ICTCLAS. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, pp. 184–187 (2003)Google Scholar
- 15.Zhao, Y., Karipis, G.: Evaluation of hierarchical document clustering for document datasets. In: CIKM 2002 (2002)Google Scholar