Advertisement

Analysis of Similarity Measures with WordNet Based Text Document Clustering

  • Nadella Sandhya
  • A. Govardhan
Part of the Advances in Intelligent and Soft Computing book series (AINSC, volume 132)

Abstract

Text Document Clustering aids in reorganizing the large collections of documents into a smaller number of manageable clusters. While several clustering methods and the associated similarity measures have been proposed in the past, the partition clustering algorithms are reported performing well on document clustering. Usually cosine function is used to measure the similarity between two documents in the criterion function, but it may not work well when the clusters are not well separated. Word meanings are better than word forms in terms of representing the topics of documents. Thus, here we have involved ontology into the text clustering algorithm. In this research WordNet based document representation is attempted by assigning each word a part-ofspeech (POS) tag and by enriching the ‘bag-of-words’ data representation with synset concept which corresponds to synonym set that is introduced by WordNet. After replacing the ‘bag of words’ with their respective Synset IDs a variant of K-Means algorithm is used for document clustering. Then we compare the three popular similarity measures (Cosine, Pearson Correlation Coefficient and extended Jaccard) in conjunction with different types of vector space representation (Term Frequency and Term Frequency-Inverse Document Frequency) of documents.

Keywords

Term Frequency Cosine Similarity Cluster Accuracy Document Cluster Inverse Document Frequency 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Buttersworth, London (1989)Google Scholar
  2. 2.
    Kowalski, G.: Information Retrieval Systems – Theory and Implementation. Kluwer Academic Publishers (1997)Google Scholar
  3. 3.
    Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W.: Scatter/Gather: A Cluster- based Approach to Browsing Large Document Collections. In: SIGIR 1992, pp. 318–329 (1992)Google Scholar
  4. 4.
    Zamir, O., Etzioni, O., Madani, O., Karp, R.M.: Fast and Intuitive Clustering of Web Documents. In: KDD 1997, pp. 287–290 (1997)Google Scholar
  5. 5.
    Koller, D., Sahami, M.: Hierarchically classifying documents using very few words. In: Proceedings of the 14th International Conference on Machine Learning (ML), pp. 170–178 (1997)Google Scholar
  6. 6.
    Salton, G.: Automatic Text Processing. Addison-Wesley, New York (1989)Google Scholar
  7. 7.
    Steinbach, M., Karypis, G., Kumar, V.: A Comparison of Document Clustering Techniques. In: KDD Workshop on Text Mining (2000)Google Scholar
  8. 8.
    Cutting, D.R., Pedersen, J.O., Karger, D.R., Tukey, J.W.: Scatter/gather: A cluster-based approach to browsing large document collections. In: Proceedings of the ACM SIGIR (1992)Google Scholar
  9. 9.
    Larsen, B., Aone, C.: Fast and Effective Text Mining using Linear-time Document Clustering. In: Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (1999)Google Scholar
  10. 10.
    Arthur, D., Vassilvitskii, S.: K-means++ the advantages of careful seeding. In: Symposium on Discrete Algorithms (2007)Google Scholar
  11. 11.
    Varelas, G., Voutsakis, E., Raftopoulou, P., Petrakis, E., Milios, E.: Semantic similarity methods in wordNet and their application to information retrieval on the web. In: Workshop On Web Information And Data Management, Proceedings of the 7th annual ACM international workshop on Web information and data management, pp. 10–16 (2005)Google Scholar
  12. 12.
    Chen, C.-L., Tseng, F.S.C., Liang, T.: An Integration of Fuzzy Association Rules and WordNet for Document Clustering. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.-B. (eds.) PAKDD 2009. LNCS, vol. 5476, pp. 147–159. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  13. 13.
    Sedding, J., Kazakov, D.: WordNet-based text document clustering. In: Proc. of COLING-Workshop on Robust Methods in Analysis of Natural Language Data (2004)Google Scholar
  14. 14.
    Sedding, J., Kazakov, D.: WordNet-based Text Document ClusteringGoogle Scholar
  15. 15.
    Miller, G.: WordNet: a lexical database for English. Communications of the ACM 38(11), 39–41 (1995)CrossRefGoogle Scholar
  16. 16.
    Technische Universität Dresden, An Empirical Study of K-Means Initialization Methods for Document ClusteringGoogle Scholar
  17. 17.
    Lloyd, S.P.: Least squares quantization in pcm. IEEE Transactions on Information Theory 28(2), 129–137 (1982)CrossRefzbMATHMathSciNetGoogle Scholar
  18. 18.
    Strehl, A., Ghosh, J., Mooney, R.: Impact of similarity measures on web-page clustering. In: AAAI-2000: Workshop on Artificial Intelligence for Web Search (July 2000)Google Scholar
  19. 19.
    Huang, A.: Similarity Measures for Text Document Clustering. In: The Proceedings of New Zealand Computer Science Research Student Conference (2008)Google Scholar
  20. 20.
    Sandhya, N., Srilalitha, Y., Anuradha, K., Govardhan, A.: Analysis of stemming algorithm for Text ClusteringGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Nadella Sandhya
    • 1
  • A. Govardhan
    • 2
  1. 1.CSE Dept.Gokaraju Rangaraju Institute of Engineering & TechnologyHyderabadIndia
  2. 2.JNTUH College of EngineeringJagtialIndia

Personalised recommendations