Skip to main content
Log in

Interpretable and reconfigurable clustering of document datasets by deriving word-based rules

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Clusters of text documents output by clustering algorithms are often hard to interpret. We describe motivating real-world scenarios that necessitate reconfigurability and high interpretability of clusters and outline the problem of generating clusterings with interpretable and reconfigurable cluster models. We develop two clustering algorithms toward the outlined goal of building interpretable and reconfigurable cluster models. They generate clusters with associated rules that are composed of conditions on word occurrences or nonoccurrences. The proposed approaches vary in the complexity of the format of the rules; RGC employs disjunctions and conjunctions in rule generation whereas RGC-D rules are simple disjunctions of conditions signifying presence of various words. In both the cases, each cluster is comprised of precisely the set of documents that satisfy the corresponding rule. Rules of the latter kind are easy to interpret, whereas the former leads to more accurate clustering. We show that our approaches outperform the unsupervised decision tree approach for rule-generating clustering and also an approach we provide for generating interpretable models for general clusterings, both by significant margins. We empirically show that the purity and f-measure losses to achieve interpretability can be as little as 3 and 5%, respectively using the algorithms presented herein.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: SIGMOD

  2. Alsberg BK (1995) Fast, fuzzy c-means clustering of data sets with many features. J Comput Chem 16(4): 414–421

    Article  Google Scholar 

  3. Balachandran VPD, Khemani D (2009) Interpretable and reconfigurable clustering of document datasets by deriving word-based rules. In: CIKM

  4. Basak J, Krishnapuram R (2005) Interpretable hierarchical clustering by constructing an unsupervised decision tree. IEEE Trans Knowl Data Eng

  5. Basu S, Banerjee A, Mooney RJ (2004) Active semi-supervision for pairwise constrained clustering. In: Berry MW, Dayal U, Kamath C, Skillicorn DB (eds) SIAM international conference on data mining. SIAM

  6. Beil F, Ester M, Xu X (2002) Frequent term-based text clustering. In: KDD, ACM, pp 436–442

  7. Bilenko M, Basu S, Mooney RJ (2004) Integrating constraints and metric learning in semi-supervised clustering. In: ICML

  8. Boley D (1998) Hierarchical taxonomies using divisive partitioning. Technical report, University of Minnesota

  9. Chen K, Liu L (2009) Best k: critical clustering structures in categorical datasets. Knowl Inf Syst 20: 1–33

    Article  Google Scholar 

  10. Cho Y (2002) A personalized recommender system based on web usage mining and decision tree induction. Expert Syst Appl 23(3): 329–342

    Article  Google Scholar 

  11. Cohen WW (1995) Fast effective rule induction. In: Proceedings of the twelfth international conference on machine learning. Morgan Kaufmann, pp 115–123

  12. Cohen WW, Singer Y (1999) A simple, fast, and effective rule learner. In: Proceedings of the sixteenth national conference on artificial intelligence. AAAI Press, pp 335–342

  13. Cutting DR, Karger DR, Pedersen JO, Tukey JW (1992) Scatter/gather: a cluster-based approach to browsing large document collections. In: Proceedings of the 15th annual international ACM SIGIR conference on research and development in information retrieval’, SIGIR ’92, ACM, New York, pp 318–329

  14. Deepak P, Roy S (2006) Scaled entropy and df-se: different and improved unsupervised feature selection techniques for text clustering. In: International workshop on feature selection in data mining (SDM)

  15. Dhillon IS, Mallela S, Modha DS (2003) Information-theoretic co-clustering. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’03, ACM, New York, pp 89–98

  16. Farahat AK, Kamel MS (2011) Statistical semantic for enhancing document clustering. Knowl Inf Syst 26(1)

  17. Fisher DH (1987) Knowledge acquisition via incremental conceptual clustering. In: Machine learning, pp 139–172

  18. Gao B, Ester M (2006) Cluster description formats, problems and algorithms. In: SDM

  19. Greene D, Cunningham P (2005) Producing accurate interpretable clusters from high-dimensional data. In: PKDD, pp 486–494

  20. Halvey M, Keane M (2007) An assessment of tag presentation techniques. In: World wide web conference

  21. Hamerly G, Elkan C (2003) Learning the k in k-means. In: Neural information processing systems. MIT Press, p 2003

  22. Hotho A, Staab S, Stumme G (2003) Ontologies improve text document clustering. In: ICDM, pp 541–544

  23. Ishibuchi H, Nojima Y (2007) Analysis of interpretability-accuracy tradeoff of fuzzy systems by multiobjective fuzzy genetics-based machine learning. Int J Approx Reason

  24. Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3): 264–323

    Article  Google Scholar 

  25. Jain AK, Murty MN, Flynn PJ (1999b) Data clustering: a review. ACM Comput Surv

  26. Jing L, Ng MK, Huang JZ (2010) Knowledge-based vector space model for text clustering. Knowl Inf Syst 25(1): 35–55

    Article  Google Scholar 

  27. Johnson SC (1967) Hierarchical clustering schemes. In: Psychometrika

  28. Krishnapuram R, Kummamuru K (2003) Automatic taxonomy generation: issues and possibilities. In: Proceedings of the 10th international fuzzy systems association world congress conference on fuzzy sets and systems, IFSA’03/ Springer, Berlin, pp 52–63

  29. Lakshmanan L, Ng RT, Wang CX, Zhou X, Johnson T (2002) The generalized mdl approach for summarization. In: VLDB

  30. Liu B, Xia Y, Yu PS (2000) Clustering through decision tree construction. In: CIKM, ACM, New York

  31. Liu T, Liu S, Chen Z, Ma W (2003) An evaluation on feature selection for text clustering. In: Internatioanl conference on machine learning (ICML)

  32. MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: 5th symposium of maths, statistics and probability

  33. Mamdani EH, Assilian S (1975) An experiment in linguistic synthesis with a fuzzy logic controller. Int J Man-Mach Stud 1–13

  34. Mandhani B, Joshi S, Kummamuru K (2003) A matrix density based algorithm to hierarchically co-cluster documents and words. In: World wide web conference, pp 511–518

  35. Mendelzon AO, Pu KQ (2003) Concise descriptions of subsets of structured sets, in ‘PODS’

  36. Michalski RS, Stepp RE (1983) Learning from observation: conceptual clustering. In: Machine learning: an artificial intelligence approach

  37. Nauck DD (2002) Measuring interpretability in rule-based classification systems. In Proceedings of IEEE international conference on fuzzy systems, 2002, pp 196–201

  38. Oikonomakou N, Vazirgiannis M (2005) A review of web document clustering approaches. In: Data mining and knowledge discovery handbook, pp 921–943

  39. Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., CA

    Google Scholar 

  40. Salton G (1989) Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley Longman Publishing Co., Inc., Boston

    Google Scholar 

  41. Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11): 613–620

    Article  MATH  Google Scholar 

  42. Steinbach M, Ertz L, Kumar V (2003) The challenges of clustering high-dimensional data. In: New vistas in statistical physics: applications in econophysics, bioinformatics, and pattern recognition. Springer

  43. Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques

  44. Tagaki, S (1983) Derivation of fuzzy control rules from human operator’s control action. In: IFAC symposium on fuzzy information, knowledge representation and decision analysis

  45. Tan P-N, Steinbach M, Kumar V (2005) Introduction to data mining. Addison-Wesley Longman Publishing Co., Inc., Boston

    Google Scholar 

  46. Véronis J (2002) Book reviews: polysemy: theoretical and computational approaches. Comput Linguist 28(1): 90–95

    Article  Google Scholar 

  47. Wei F, Li W, Lu Q, He Y (2010) A document-sensitive graph model for multi-document summarization. Knowl Inf Syst 22(2): 245–259

    Article  Google Scholar 

  48. Weiss SM, Indurkhya N (2000) Lightweight rule induction. In: ICML, pp 1135–1142

  49. Zamir O, Etzioni O (1998) Web document clustering: a feasibility demonstration. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’98, pp 46–54

  50. Zamir O, Etzioni O, Madani O, Karp RM (1997) Fast and intuitive clustering of web documents. In: KDD, pp 287–290

  51. Zeng Y, Tang J, Garcia-Frias J, Gao GR (2002) An adaptive meta-clustering approach: combining the information from different clustering results. In: CSB, p 276

  52. Zhao Y, Karypis G (2001) Criterion functions for document clustering: experiments and analysis. In: TR, University of Minnesota

  53. Zhao Y, Karypis G (2002) Evaluation of hierarchical clustering algorithms for document datasets. In: Proceedings of the eleventh international conference on information and knowledge management, CIKM ’02, ACM, New York, pp 515–524

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Deepak P.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Balachandran, V., Deepak P & Khemani, D. Interpretable and reconfigurable clustering of document datasets by deriving word-based rules. Knowl Inf Syst 32, 475–503 (2012). https://doi.org/10.1007/s10115-011-0446-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-011-0446-9

Keywords

Navigation