Knowledge and Information Systems

, Volume 28, Issue 3, pp 687–708 | Cite as

An integration of fuzzy association rules and WordNet for document clustering

Regular Paper

Abstract

With the rapid growth of text documents, document clustering technique is emerging for efficient document retrieval and better document browsing. Recently, some methods had been proposed to resolve the problems of high dimensionality, scalability, accuracy, and meaningful cluster labels by using frequent itemsets derived from association rule mining for clustering documents. In order to improve the quality of document clustering results, we propose an effective Fuzzy Frequent Itemset-based Document Clustering (F2IDC) approach that combines fuzzy association rule mining with the background knowledge embedded in WordNet. A term hierarchy generated from WordNet is applied to discover generalized frequent itemsets as candidate cluster labels for grouping documents. We have conducted experiments to evaluate our approach on Classic4, Re0, R8, and WebKB datasets. Our experimental results show that our proposed approach indeed provide more accurate clustering results than prior influential clustering methods presented in recent literature.

Keywords

Fuzzy association rule mining Text mining Document clustering Frequent itemsets WordNet 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Agrawal R, Imielinski T, Swami AN (1993) Mining association rules between sets of items in large databases. In: ACM SIGMOD international conference on management of data, pp 207–216Google Scholar
  2. 2.
    Beil F, Ester M, Xu X (2002) Frequent term-based text clustering. In: International conference on knowledge discovery and data mining (KDD’02), pp 436–442Google Scholar
  3. 3.
    Chen CL, Tseng FSC, Liang T (2008) Hierarchical document clustering using fuzzy association rule mining. In: The 3rd international conference of innovative computing information and control (ICICIC2008), pp 326–330Google Scholar
  4. 4.
    Chen CL, Tseng FSC, Liang T (2010) Mining fuzzy frequent itemsets for hierarchical document clustering. Inf Process Manag 46(2): 193–211CrossRefGoogle Scholar
  5. 5.
    Craven M, DiPasquo D, McCallum A, Mitchell T, Nigam K, Slattery S (1998) Learning to extract symbolic knowledge from the World Wide Web. In: AAAI-98Google Scholar
  6. 6.
    Cutting DR, Karger DR, Pederson JO, Tukey JW (1992) Scatter/gather: a cluster-based approach to browsing large document collections. In: The 15th international ACM SIGIR conference on research and development in information retrieval, pp 318–329Google Scholar
  7. 7.
    Exarchos TP, Tsipouras MG, Papaloukas C, Fotiadis DI (2009) An optimized sequential pattern matching methodology for sequence classification. Knowl Inf Syst 19(2): 249–264CrossRefGoogle Scholar
  8. 8.
    Fung B, Wang K, Ester M (2003) Hierarchical document clustering using frequent itemsets. In: SIAM international conference on data mining (SDM’03), pp 59–70Google Scholar
  9. 9.
    Hong TP, Lin KY, Wang SL (2003) Fuzzy data mining for interesting generalized association rules. Fuzzy Sets Syst 138(2): 255–269MathSciNetCrossRefGoogle Scholar
  10. 10.
    Hotho A, Staab S, Stumme G (2003) Wordnet improves text document clustering. In: SIGIR international conference on Semantic Web WorkshopGoogle Scholar
  11. 11.
    Huang Z, Sun S, Wang W (2010) Efficient mining of skyline objects in subspaces over data streams. Knowl Inf Syst 22(2): 159–183CrossRefGoogle Scholar
  12. 12.
    Kaya M, Alhajj R (2006) Utilizing genetic algorithms to optimize membership functions for fuzzy weighted association rule mining. Appl Intell 24(1): 7–15CrossRefGoogle Scholar
  13. 13.
    Kushal Dave DMP, Lawrence S (2003) Mining the peanut gallery: opinion extraction and semantic classification of product reviews. In: The 12th international conference on World Wide Web (WWW)Google Scholar
  14. 14.
    Lewis DD, Yang Y, Rose TG, Li F (2004) RCV1: a new benchmark collection for text categorization research. J Mach Learn Res 5: 361–397Google Scholar
  15. 15.
    Liu B, Hsu W, Ma Y (1999) Pruning and summarizing the discovered associations. In: The ACM SIGKDD conference on knowledge discovery and data mining, pp 125–134Google Scholar
  16. 16.
    MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: The 5th Berkeley Symposium on Mathematical Statistics and Probability, pp 281–297Google Scholar
  17. 17.
    Mandhani B, Joshi S, Kummamuru K (2003) A matrix density based algorithm to hierarchically co-cluster documents and words. In: The 12th international conference on World Wide Web (WWW), pp 511–518Google Scholar
  18. 18.
    Martín-Bautista MJ, Sánchez D, Chamorro-Martínez J, Serrano JM, Vila MA (2004) Mining web documents to find additional query terms using fuzzy association rules. Fuzzy Sets Syst 148(1): 85–104MATHCrossRefGoogle Scholar
  19. 19.
    Michenerand CD, Sokal RR (1957) A quantitative approach to a problem in classification. Evolution 11: 130–162CrossRefGoogle Scholar
  20. 20.
    Miller GA (1995) WordNet: a lexical database for English. J Commun ACM 38(11): 39–41CrossRefGoogle Scholar
  21. 21.
    Porter MF (1980) An algorithm for suffix stripping. Program 14(3): 130–137Google Scholar
  22. 22.
    Scott S, Matwin S (1998) Text classification using WordNet hypernyms. In: Proceedings of Worksh Usage of WordNet in NLP Systems at COLING-98, pp 38–44Google Scholar
  23. 23.
    Sedding J, Kazakov D (2004) WordNet-based text document clustering. In: COLING-2004 workshop on robust methods in analysis of natural language dataGoogle Scholar
  24. 24.
    Shihab K (2004) Improving clustering performance by using feature selection and extraction techniques. J Intell Syst 13(3): 135–161Google Scholar
  25. 25.
    Singhal A, Salton G (1993) Automatic text browsing using vector space model. Technical Report, Department of Computer Science, Cornell UniversityGoogle Scholar
  26. 26.
    Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: The 6th ACM SIGKDD international conference on knowledge discovery and data mining (KDD)Google Scholar
  27. 27.
    Wang P, Hu J, Zeng H-J, Chen Z (2009) Wikipedia knowledge to improve text classification. Knowl Inf Syst 19(3): 265–281CrossRefGoogle Scholar
  28. 28.
    Wei C, Hu P, Dong YX (2002) Managing document categories in e-commerce environments: an evolution-based approach. Eur J Inf Syst 11(3): 208–222CrossRefGoogle Scholar
  29. 29.
    Willett P (1988) Recent trends in hierarchic document clustering: a critical review. Inf Process Manag 24(5): 577–597CrossRefGoogle Scholar
  30. 30.
    Xu W, Gong Y (2004) Document clustering by concept factorization. In: The 27th ACM SIGIR conference on research and development in information retrieval, pp 202–209Google Scholar
  31. 31.
    Yu H, Searsmith D, Li X, Han J (2004) Scalable construction of topic directory with nonparametric closed termset mining. In: The IEEE international conference on data mining series (ICDM 2004), pp 563–566Google Scholar
  32. 32.
    Zadeh LA (1965) Fuzzy sets. Inf Control 8: 338–353MathSciNetMATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag London Limited 2010

Authors and Affiliations

  • Chun-Ling Chen
    • 1
  • Frank S. C. Tseng
    • 2
  • Tyne Liang
    • 1
  1. 1.Department of Computer ScienceNational Chiao Tung UniversityHsinChuTaiwan, ROC
  2. 2.Department of Information ManagementNational Kaohsiung 1st University of Science and TechnologyYenChaoTaiwan, ROC

Personalised recommendations