Advertisement

Knowledge and Information Systems

, Volume 28, Issue 2, pp 365–393 | Cite as

Statistical semantics for enhancing document clustering

  • Ahmed K. FarahatEmail author
  • Mohamed S. Kamel
Regular Paper

Abstract

Document clustering algorithms usually use vector space model (VSM) as their underlying model for document representation. VSM assumes that terms are independent and accordingly ignores any semantic relations between them. This results in mapping documents to a space where the proximity between document vectors does not reflect their true semantic similarity. This paper proposes new models for document representation that capture semantic similarity between documents based on measures of correlations between their terms. The paper uses the proposed models to enhance the effectiveness of different algorithms for document clustering. The proposed representation models define a corpus-specific semantic similarity by estimating measures of term–term correlations from the documents to be clustered. The corpus of documents accordingly defines a context in which semantic similarity is calculated. Experiments have been conducted on thirteen benchmark data sets to empirically evaluate the effectiveness of the proposed models and compare them to VSM and other well-known models for capturing semantic similarity.

Keywords

Document clustering Statistical semantics Semantic similarity Term–term correlations 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Cai D, He X, Han J (2005) Document clustering using locality preserving indexing. IEEE Trans Knowl Data Eng 17(12): 1624–1637CrossRefGoogle Scholar
  2. 2.
    Carbonell J, Yang Y, Frederking R, Brown R, Geng Y, Lee D (1997) Translingual information retrieval: A comparative evaluation. In: Proceedings of the fifteenth international joint conference on artificial intelligence. Morgan Kaufmann, San Mateo, pp 708–715Google Scholar
  3. 3.
    Cristianini N, Shawe-Taylor J, Lodhi H (2002) Latent semantic kernels. J Intell Inf Syst 18(2): 127–152CrossRefGoogle Scholar
  4. 4.
    Deerwester S, Dumais S, Furnas G, Landauer T, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci Technol 41(6): 391–407CrossRefGoogle Scholar
  5. 5.
    Dhillon I (2001) Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 269–274Google Scholar
  6. 6.
    Dhillon I, Kogan J, Nicholas C (2003) Feature selection and document clustering. In: Berry M (eds) Survey of Text Mining. Springer, New York, pp 73–100Google Scholar
  7. 7.
    Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. Mach Learn 42(1/2): 143–175zbMATHCrossRefGoogle Scholar
  8. 8.
    Ding C, Li T, Jordan MI (2010) Convex and semi-nonnegative matrix factorizations. IEEE Trans Pattern Anal Mach Intell 32: 45–55CrossRefGoogle Scholar
  9. 9.
    Dongen S (2000) Performance criteria for graph clustering and Markov cluster experiments. Technical report, CWI (Centre for Mathematics and Computer Science), Amsterdam, The NetherlandsGoogle Scholar
  10. 10.
    Drineas P, Kannan R, Mahoney M (2007) Fast Monte Carlo algorithms for matrices I: approximating matrix multiplication. SIAM J Comput 36(1): 132–157MathSciNetCrossRefGoogle Scholar
  11. 11.
    Farahat AK, Kamel MS (2009) Document clustering using semantic kernels based on term–term correlations. In: Proceedings of the 2009 IEEE international conference on data mining workshops. IEEE Computer Society, Washington, DC, pp 459–464Google Scholar
  12. 12.
    Farahat AK, Kamel MS (2010) Enhancing document clustering using hybrid models for semantic similarity. In: Proceedings of the eighth workshop on text mining at the tenth SIAM international conference on data mining. SIAM, Philadelphia, pp 83–92Google Scholar
  13. 13.
    Fung B, Wang K, Ester M (2003) Hierarchical document clustering using frequent itemsets. In: Proceedings of the third SIAM international conference on data mining. SIAM, Philadelphia, pp 59–70Google Scholar
  14. 14.
    Furnas G, Landauer T, Gomez L, Dumais S (1983) Statistical semantics: analysis of the potential performance of keyword information systems. Bell Syst Tech J 62(6): 1753–1806Google Scholar
  15. 15.
    Gabrilovich E, Markovitch S (2007) Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In: Proceedings of the twentieth international joint conference on artificial intelligence. Morgan Kaufmann, San Mateo, pp 6–12Google Scholar
  16. 16.
    Han E, Boley D, Gini M, Gross R, Hastings K, Karypis G, Kumar V, Mobasher B, Moore J (1998) WebACE: a web agent for document categorization and exploration. In: Proceedings of the second international conference on autonomous agents. ACM, New York, pp 408–415Google Scholar
  17. 17.
    He X, Zha H, Ding C, Simon H (2002) Web document clustering using hyperlink structures. Comput Stat Data Anal 41(1): 19–45MathSciNetzbMATHCrossRefGoogle Scholar
  18. 18.
    Hotho A, Staab S, Stumme G (2003) WordNet improves text document clustering. In: Proceedings of the SIGIR 2003 semantic web workshop. ACM, New York, pp 541–544Google Scholar
  19. 19.
    Hu X, Zhang X, Lu C, Park EK, Zhou X (2009) Exploiting wikipedia as external knowledge for document clustering. In: Proceedings of the fifteenth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 389–396Google Scholar
  20. 20.
    Huang A, Milne D, Frank E, Witten I (2009) Clustering documents using a Wikipedia-based concept representation. In: Proceedings of the thirteenth Pacific-Asia conference on advances in knowledge discovery and data mining. Springer, Berlin, pp 628–636Google Scholar
  21. 21.
    Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3): 264–323CrossRefGoogle Scholar
  22. 22.
    Jing L, Ng M, Huang J (2010) Knowledge-based vector space model for text clustering. Knowl Inf Syst 25: 35–55CrossRefGoogle Scholar
  23. 23.
    Jolliffe I (2002) Principal component analysis. Springer, New YorkzbMATHGoogle Scholar
  24. 24.
    Karypis G (2003) CLUTO—a clustering toolkit. Technical Report #02-017, University of Minnesota, Department of Computer Science, Minnesota, MN, USAGoogle Scholar
  25. 25.
    Lee D, Seung H (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401: 788–791CrossRefGoogle Scholar
  26. 26.
    Lewis D (1999) Reuters-21578 text categorization test collection distribution 1.0Google Scholar
  27. 27.
    Meila M (2003) Comparing clusterings by the variation of information. In: Learning theory and Kernel Machines. Springer, Berlin, pp 173–187Google Scholar
  28. 28.
    Miller GA (1995) WordNet: a lexical database for English. Commun ACM 38(11): 39–41CrossRefGoogle Scholar
  29. 29.
    Pessiot J-F, Kim Y-M, Amini MR, Gallinari P (2010) Improving document clustering in a learned concept space. Inf Process Manage 46(2): 180–192CrossRefGoogle Scholar
  30. 30.
    Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11): 613–620zbMATHCrossRefGoogle Scholar
  31. 31.
    Scholkopf B, Smola A, Muller K (1997) Kernel principal component analysis. Lect Notes Comput Sci 1327: 583–588CrossRefGoogle Scholar
  32. 32.
    Schütze H, Silverstein C (1997) Projections for efficient document clustering. In: Proceedings of the twentieth annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’97. ACM, New York, pp 74–81Google Scholar
  33. 33.
    Shawe-Taylor J, Cristianini N (2004) Kernel methods for pattern analysis. Cambridge University Press, CambridgeCrossRefGoogle Scholar
  34. 34.
    Slonim N, Tishby N (2000) Document clustering using word clusters via the information bottleneck method. In: Proceedings of the twenty-third annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 208–215Google Scholar
  35. 35.
    von Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17(4): 395–416MathSciNetCrossRefGoogle Scholar
  36. 36.
    Wang P, Hu J, Zeng H, Chen Z (2009) Using wikipedia knowledge to improve text classification. Knowl Inf Syst 19(3): 265–281CrossRefGoogle Scholar
  37. 37.
    Wong SKM, Ziarko W, Wong PCN (1985) Generalized vector spaces model in information retrieval. In: Proceedings of the eighth annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 18–25Google Scholar
  38. 38.
    Wu J, Xiong H, Chen J (2009) Adapting the right measures for k-means clustering. In: Proceedings of the fifteenth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, pp 877–886Google Scholar
  39. 39.
    Xu W, Liu X, Gong Y (2003) Document clustering based on non-negative matrix factorization. In: Proceedings of the twenty-sixth annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 267–273Google Scholar
  40. 40.
    Zhao Y, Karypis G (2004) Empirical and theoretical comparisons of selected criterion functions for document clustering. Mach Learn 55(3): 311–331zbMATHCrossRefGoogle Scholar
  41. 41.
    Zhao Y, Karypis G (2005) Hierarchical clustering algorithms for document datasets. Data Min Knowl Discov 10(2): 141–168MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer-Verlag London Limited 2010

Authors and Affiliations

  1. 1.Department of Electrical and Computer EngineeringUniversity of WaterlooWaterlooCanada

Personalised recommendations