Knowledge and Information Systems

, Volume 32, Issue 3, pp 475–503 | Cite as

Interpretable and reconfigurable clustering of document datasets by deriving word-based rules

Regular Paper

Abstract

Clusters of text documents output by clustering algorithms are often hard to interpret. We describe motivating real-world scenarios that necessitate reconfigurability and high interpretability of clusters and outline the problem of generating clusterings with interpretable and reconfigurable cluster models. We develop two clustering algorithms toward the outlined goal of building interpretable and reconfigurable cluster models. They generate clusters with associated rules that are composed of conditions on word occurrences or nonoccurrences. The proposed approaches vary in the complexity of the format of the rules; RGC employs disjunctions and conjunctions in rule generation whereas RGC-D rules are simple disjunctions of conditions signifying presence of various words. In both the cases, each cluster is comprised of precisely the set of documents that satisfy the corresponding rule. Rules of the latter kind are easy to interpret, whereas the former leads to more accurate clustering. We show that our approaches outperform the unsupervised decision tree approach for rule-generating clustering and also an approach we provide for generating interpretable models for general clusterings, both by significant margins. We empirically show that the purity and f-measure losses to achieve interpretability can be as little as 3 and 5%, respectively using the algorithms presented herein.

Keywords

Data clustering Text clustering Interpretability 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: SIGMODGoogle Scholar
  2. 2.
    Alsberg BK (1995) Fast, fuzzy c-means clustering of data sets with many features. J Comput Chem 16(4): 414–421CrossRefGoogle Scholar
  3. 3.
    Balachandran VPD, Khemani D (2009) Interpretable and reconfigurable clustering of document datasets by deriving word-based rules. In: CIKMGoogle Scholar
  4. 4.
    Basak J, Krishnapuram R (2005) Interpretable hierarchical clustering by constructing an unsupervised decision tree. IEEE Trans Knowl Data EngGoogle Scholar
  5. 5.
    Basu S, Banerjee A, Mooney RJ (2004) Active semi-supervision for pairwise constrained clustering. In: Berry MW, Dayal U, Kamath C, Skillicorn DB (eds) SIAM international conference on data mining. SIAMGoogle Scholar
  6. 6.
    Beil F, Ester M, Xu X (2002) Frequent term-based text clustering. In: KDD, ACM, pp 436–442Google Scholar
  7. 7.
    Bilenko M, Basu S, Mooney RJ (2004) Integrating constraints and metric learning in semi-supervised clustering. In: ICMLGoogle Scholar
  8. 8.
    Boley D (1998) Hierarchical taxonomies using divisive partitioning. Technical report, University of MinnesotaGoogle Scholar
  9. 9.
    Chen K, Liu L (2009) Best k: critical clustering structures in categorical datasets. Knowl Inf Syst 20: 1–33CrossRefGoogle Scholar
  10. 10.
    Cho Y (2002) A personalized recommender system based on web usage mining and decision tree induction. Expert Syst Appl 23(3): 329–342CrossRefGoogle Scholar
  11. 11.
    Cohen WW (1995) Fast effective rule induction. In: Proceedings of the twelfth international conference on machine learning. Morgan Kaufmann, pp 115–123Google Scholar
  12. 12.
    Cohen WW, Singer Y (1999) A simple, fast, and effective rule learner. In: Proceedings of the sixteenth national conference on artificial intelligence. AAAI Press, pp 335–342Google Scholar
  13. 13.
    Cutting DR, Karger DR, Pedersen JO, Tukey JW (1992) Scatter/gather: a cluster-based approach to browsing large document collections. In: Proceedings of the 15th annual international ACM SIGIR conference on research and development in information retrieval’, SIGIR ’92, ACM, New York, pp 318–329Google Scholar
  14. 14.
    Deepak P, Roy S (2006) Scaled entropy and df-se: different and improved unsupervised feature selection techniques for text clustering. In: International workshop on feature selection in data mining (SDM)Google Scholar
  15. 15.
    Dhillon IS, Mallela S, Modha DS (2003) Information-theoretic co-clustering. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’03, ACM, New York, pp 89–98Google Scholar
  16. 16.
    Farahat AK, Kamel MS (2011) Statistical semantic for enhancing document clustering. Knowl Inf Syst 26(1)Google Scholar
  17. 17.
    Fisher DH (1987) Knowledge acquisition via incremental conceptual clustering. In: Machine learning, pp 139–172Google Scholar
  18. 18.
    Gao B, Ester M (2006) Cluster description formats, problems and algorithms. In: SDMGoogle Scholar
  19. 19.
    Greene D, Cunningham P (2005) Producing accurate interpretable clusters from high-dimensional data. In: PKDD, pp 486–494Google Scholar
  20. 20.
    Halvey M, Keane M (2007) An assessment of tag presentation techniques. In: World wide web conferenceGoogle Scholar
  21. 21.
    Hamerly G, Elkan C (2003) Learning the k in k-means. In: Neural information processing systems. MIT Press, p 2003Google Scholar
  22. 22.
    Hotho A, Staab S, Stumme G (2003) Ontologies improve text document clustering. In: ICDM, pp 541–544Google Scholar
  23. 23.
    Ishibuchi H, Nojima Y (2007) Analysis of interpretability-accuracy tradeoff of fuzzy systems by multiobjective fuzzy genetics-based machine learning. Int J Approx ReasonGoogle Scholar
  24. 24.
    Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3): 264–323CrossRefGoogle Scholar
  25. 25.
    Jain AK, Murty MN, Flynn PJ (1999b) Data clustering: a review. ACM Comput SurvGoogle Scholar
  26. 26.
    Jing L, Ng MK, Huang JZ (2010) Knowledge-based vector space model for text clustering. Knowl Inf Syst 25(1): 35–55CrossRefGoogle Scholar
  27. 27.
    Johnson SC (1967) Hierarchical clustering schemes. In: PsychometrikaGoogle Scholar
  28. 28.
    Krishnapuram R, Kummamuru K (2003) Automatic taxonomy generation: issues and possibilities. In: Proceedings of the 10th international fuzzy systems association world congress conference on fuzzy sets and systems, IFSA’03/ Springer, Berlin, pp 52–63Google Scholar
  29. 29.
    Lakshmanan L, Ng RT, Wang CX, Zhou X, Johnson T (2002) The generalized mdl approach for summarization. In: VLDBGoogle Scholar
  30. 30.
    Liu B, Xia Y, Yu PS (2000) Clustering through decision tree construction. In: CIKM, ACM, New YorkGoogle Scholar
  31. 31.
    Liu T, Liu S, Chen Z, Ma W (2003) An evaluation on feature selection for text clustering. In: Internatioanl conference on machine learning (ICML)Google Scholar
  32. 32.
    MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: 5th symposium of maths, statistics and probabilityGoogle Scholar
  33. 33.
    Mamdani EH, Assilian S (1975) An experiment in linguistic synthesis with a fuzzy logic controller. Int J Man-Mach Stud 1–13Google Scholar
  34. 34.
    Mandhani B, Joshi S, Kummamuru K (2003) A matrix density based algorithm to hierarchically co-cluster documents and words. In: World wide web conference, pp 511–518Google Scholar
  35. 35.
    Mendelzon AO, Pu KQ (2003) Concise descriptions of subsets of structured sets, in ‘PODS’Google Scholar
  36. 36.
    Michalski RS, Stepp RE (1983) Learning from observation: conceptual clustering. In: Machine learning: an artificial intelligence approachGoogle Scholar
  37. 37.
    Nauck DD (2002) Measuring interpretability in rule-based classification systems. In Proceedings of IEEE international conference on fuzzy systems, 2002, pp 196–201Google Scholar
  38. 38.
    Oikonomakou N, Vazirgiannis M (2005) A review of web document clustering approaches. In: Data mining and knowledge discovery handbook, pp 921–943Google Scholar
  39. 39.
    Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., CAGoogle Scholar
  40. 40.
    Salton G (1989) Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley Longman Publishing Co., Inc., BostonGoogle Scholar
  41. 41.
    Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11): 613–620MATHCrossRefGoogle Scholar
  42. 42.
    Steinbach M, Ertz L, Kumar V (2003) The challenges of clustering high-dimensional data. In: New vistas in statistical physics: applications in econophysics, bioinformatics, and pattern recognition. SpringerGoogle Scholar
  43. 43.
    Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniquesGoogle Scholar
  44. 44.
    Tagaki, S (1983) Derivation of fuzzy control rules from human operator’s control action. In: IFAC symposium on fuzzy information, knowledge representation and decision analysisGoogle Scholar
  45. 45.
    Tan P-N, Steinbach M, Kumar V (2005) Introduction to data mining. Addison-Wesley Longman Publishing Co., Inc., BostonGoogle Scholar
  46. 46.
    Véronis J (2002) Book reviews: polysemy: theoretical and computational approaches. Comput Linguist 28(1): 90–95CrossRefGoogle Scholar
  47. 47.
    Wei F, Li W, Lu Q, He Y (2010) A document-sensitive graph model for multi-document summarization. Knowl Inf Syst 22(2): 245–259CrossRefGoogle Scholar
  48. 48.
    Weiss SM, Indurkhya N (2000) Lightweight rule induction. In: ICML, pp 1135–1142Google Scholar
  49. 49.
    Zamir O, Etzioni O (1998) Web document clustering: a feasibility demonstration. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’98, pp 46–54Google Scholar
  50. 50.
    Zamir O, Etzioni O, Madani O, Karp RM (1997) Fast and intuitive clustering of web documents. In: KDD, pp 287–290Google Scholar
  51. 51.
    Zeng Y, Tang J, Garcia-Frias J, Gao GR (2002) An adaptive meta-clustering approach: combining the information from different clustering results. In: CSB, p 276Google Scholar
  52. 52.
    Zhao Y, Karypis G (2001) Criterion functions for document clustering: experiments and analysis. In: TR, University of MinnesotaGoogle Scholar
  53. 53.
    Zhao Y, Karypis G (2002) Evaluation of hierarchical clustering algorithms for document datasets. In: Proceedings of the eleventh international conference on information and knowledge management, CIKM ’02, ACM, New York, pp 515–524Google Scholar

Copyright information

© Springer-Verlag London Limited 2011

Authors and Affiliations

  1. 1.VMwareBangaloreIndia
  2. 2.IBM ResearchBangaloreIndia
  3. 3.IIT MadrasChennaiIndia

Personalised recommendations