Document Clustering Based on a Weighted Exponential Measurement
Frequent terms sets clustering method has been proposed to overcome hardship of high dimensionality, and finding meaningful labels for clusters. Although this method provides meaningful labels for clusters, it has low accuracy. In this research, candidate clusters are extracted by mining frequent terms set within documents dataset. Each document is assigned to these clusters with considering the value of supports. A new similarity measurement function for clusters is designed based on similarity and weight of clusters and is proposed to remove unwanted clusters in a noise reduction step. The proposed method operates based on the concept of terms sets, value of support and weight of each cluster. Experimental results show that our proposed method provides more accurate clusters in comparison with previous efforts done on “Re0” and “Hitech” datasets.
KeywordsClustering Frequent Terms Set Noise Reduction
Unable to display preview. Download preview PDF.
- 1.Luo, C., Li, Y., Chung, S.M.: Text document clustering based on neighbors. Data & Knowledge Engineering 68(11), 1271–1288 (2009)Google Scholar
- 2.Jain, A.K.: Data clustering: 50 years beyond K-means. Pattern Recognition Letters 31(8), 651–666 (2010)Google Scholar
- 3.Tan, A.H.: Text Mining: The state of the art and the challenges. In: Pacific Asia Conf. on Knowledge Discovery and Data Mining (1999)Google Scholar
- 4.Beil, F., Ester, M., Xu, X.: Frequent term-based text clustering. ACM (2002)Google Scholar
- 5.Fung, B.C.M.: Hierarchical document clustering using frequent itemsets. Simon Fraser University (2003)Google Scholar
- 7.Salton, G.: The SMART Retrieval System—Experiments in Automatic Document Retrieval. Prentice Hall Inc., Englewood Cliffs (1971)Google Scholar