Document Clustering Based on a Weighted Exponential Measurement

  • Shahrooz Taheri
  • Alex Tze Hiang Sim
  • Seyed Hamid Ghorashi
Part of the Lecture Notes in Electrical Engineering book series (LNEE, volume 279)


Frequent terms sets clustering method has been proposed to overcome hardship of high dimensionality, and finding meaningful labels for clusters. Although this method provides meaningful labels for clusters, it has low accuracy. In this research, candidate clusters are extracted by mining frequent terms set within documents dataset. Each document is assigned to these clusters with considering the value of supports. A new similarity measurement function for clusters is designed based on similarity and weight of clusters and is proposed to remove unwanted clusters in a noise reduction step. The proposed method operates based on the concept of terms sets, value of support and weight of each cluster. Experimental results show that our proposed method provides more accurate clusters in comparison with previous efforts done on “Re0” and “Hitech” datasets.


Clustering Frequent Terms Set Noise Reduction 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Luo, C., Li, Y., Chung, S.M.: Text document clustering based on neighbors. Data & Knowledge Engineering 68(11), 1271–1288 (2009)Google Scholar
  2. 2.
    Jain, A.K.: Data clustering: 50 years beyond K-means. Pattern Recognition Letters 31(8), 651–666 (2010)Google Scholar
  3. 3.
    Tan, A.H.: Text Mining: The state of the art and the challenges. In: Pacific Asia Conf. on Knowledge Discovery and Data Mining (1999)Google Scholar
  4. 4.
    Beil, F., Ester, M., Xu, X.: Frequent term-based text clustering. ACM (2002)Google Scholar
  5. 5.
    Fung, B.C.M.: Hierarchical document clustering using frequent itemsets. Simon Fraser University (2003)Google Scholar
  6. 6.
    Chen, C.L., Tseng, F.S.C., Liang, T.: Mining fuzzy frequent itemsets for hierarchical document clustering. Information Processing & Management 46(2), 193–211 (2010)CrossRefGoogle Scholar
  7. 7.
    Salton, G.: The SMART Retrieval System—Experiments in Automatic Document Retrieval. Prentice Hall Inc., Englewood Cliffs (1971)Google Scholar
  8. 8.
    Han, J., et al.: Mining frequent patterns without candidate generation: A frequent-pattern tree approach. Data Mining and Knowledge Discovery 8(1), 53–87 (2004)CrossRefMathSciNetGoogle Scholar
  9. 9.
    Zadeh, L.A.: The concept of a linguistic variable and its application to approximate reasoning–I. Information Sciences 8(3), 199–249 (1975)CrossRefMATHMathSciNetGoogle Scholar
  10. 10.
    Zhao, Y., Karypis, G.: Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learning 55(3), 311–331 (2004)CrossRefMATHGoogle Scholar
  11. 11.
    Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1975)CrossRefMATHGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2014

Authors and Affiliations

  • Shahrooz Taheri
    • 1
  • Alex Tze Hiang Sim
    • 1
  • Seyed Hamid Ghorashi
    • 1
  1. 1.Department of Information Systems, Faculty of Computer Science and Information SystemsUnivevrsiti Teknologi MalaysiaSkudaiMalaysia

Personalised recommendations