Abstract
Clusters of text documents output by clustering algorithms are often hard to interpret. We describe motivating real-world scenarios that necessitate reconfigurability and high interpretability of clusters and outline the problem of generating clusterings with interpretable and reconfigurable cluster models. We develop two clustering algorithms toward the outlined goal of building interpretable and reconfigurable cluster models. They generate clusters with associated rules that are composed of conditions on word occurrences or nonoccurrences. The proposed approaches vary in the complexity of the format of the rules; RGC employs disjunctions and conjunctions in rule generation whereas RGC-D rules are simple disjunctions of conditions signifying presence of various words. In both the cases, each cluster is comprised of precisely the set of documents that satisfy the corresponding rule. Rules of the latter kind are easy to interpret, whereas the former leads to more accurate clustering. We show that our approaches outperform the unsupervised decision tree approach for rule-generating clustering and also an approach we provide for generating interpretable models for general clusterings, both by significant margins. We empirically show that the purity and f-measure losses to achieve interpretability can be as little as 3 and 5%, respectively using the algorithms presented herein.
Similar content being viewed by others
References
Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: SIGMOD
Alsberg BK (1995) Fast, fuzzy c-means clustering of data sets with many features. J Comput Chem 16(4): 414–421
Balachandran VPD, Khemani D (2009) Interpretable and reconfigurable clustering of document datasets by deriving word-based rules. In: CIKM
Basak J, Krishnapuram R (2005) Interpretable hierarchical clustering by constructing an unsupervised decision tree. IEEE Trans Knowl Data Eng
Basu S, Banerjee A, Mooney RJ (2004) Active semi-supervision for pairwise constrained clustering. In: Berry MW, Dayal U, Kamath C, Skillicorn DB (eds) SIAM international conference on data mining. SIAM
Beil F, Ester M, Xu X (2002) Frequent term-based text clustering. In: KDD, ACM, pp 436–442
Bilenko M, Basu S, Mooney RJ (2004) Integrating constraints and metric learning in semi-supervised clustering. In: ICML
Boley D (1998) Hierarchical taxonomies using divisive partitioning. Technical report, University of Minnesota
Chen K, Liu L (2009) Best k: critical clustering structures in categorical datasets. Knowl Inf Syst 20: 1–33
Cho Y (2002) A personalized recommender system based on web usage mining and decision tree induction. Expert Syst Appl 23(3): 329–342
Cohen WW (1995) Fast effective rule induction. In: Proceedings of the twelfth international conference on machine learning. Morgan Kaufmann, pp 115–123
Cohen WW, Singer Y (1999) A simple, fast, and effective rule learner. In: Proceedings of the sixteenth national conference on artificial intelligence. AAAI Press, pp 335–342
Cutting DR, Karger DR, Pedersen JO, Tukey JW (1992) Scatter/gather: a cluster-based approach to browsing large document collections. In: Proceedings of the 15th annual international ACM SIGIR conference on research and development in information retrieval’, SIGIR ’92, ACM, New York, pp 318–329
Deepak P, Roy S (2006) Scaled entropy and df-se: different and improved unsupervised feature selection techniques for text clustering. In: International workshop on feature selection in data mining (SDM)
Dhillon IS, Mallela S, Modha DS (2003) Information-theoretic co-clustering. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’03, ACM, New York, pp 89–98
Farahat AK, Kamel MS (2011) Statistical semantic for enhancing document clustering. Knowl Inf Syst 26(1)
Fisher DH (1987) Knowledge acquisition via incremental conceptual clustering. In: Machine learning, pp 139–172
Gao B, Ester M (2006) Cluster description formats, problems and algorithms. In: SDM
Greene D, Cunningham P (2005) Producing accurate interpretable clusters from high-dimensional data. In: PKDD, pp 486–494
Halvey M, Keane M (2007) An assessment of tag presentation techniques. In: World wide web conference
Hamerly G, Elkan C (2003) Learning the k in k-means. In: Neural information processing systems. MIT Press, p 2003
Hotho A, Staab S, Stumme G (2003) Ontologies improve text document clustering. In: ICDM, pp 541–544
Ishibuchi H, Nojima Y (2007) Analysis of interpretability-accuracy tradeoff of fuzzy systems by multiobjective fuzzy genetics-based machine learning. Int J Approx Reason
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3): 264–323
Jain AK, Murty MN, Flynn PJ (1999b) Data clustering: a review. ACM Comput Surv
Jing L, Ng MK, Huang JZ (2010) Knowledge-based vector space model for text clustering. Knowl Inf Syst 25(1): 35–55
Johnson SC (1967) Hierarchical clustering schemes. In: Psychometrika
Krishnapuram R, Kummamuru K (2003) Automatic taxonomy generation: issues and possibilities. In: Proceedings of the 10th international fuzzy systems association world congress conference on fuzzy sets and systems, IFSA’03/ Springer, Berlin, pp 52–63
Lakshmanan L, Ng RT, Wang CX, Zhou X, Johnson T (2002) The generalized mdl approach for summarization. In: VLDB
Liu B, Xia Y, Yu PS (2000) Clustering through decision tree construction. In: CIKM, ACM, New York
Liu T, Liu S, Chen Z, Ma W (2003) An evaluation on feature selection for text clustering. In: Internatioanl conference on machine learning (ICML)
MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: 5th symposium of maths, statistics and probability
Mamdani EH, Assilian S (1975) An experiment in linguistic synthesis with a fuzzy logic controller. Int J Man-Mach Stud 1–13
Mandhani B, Joshi S, Kummamuru K (2003) A matrix density based algorithm to hierarchically co-cluster documents and words. In: World wide web conference, pp 511–518
Mendelzon AO, Pu KQ (2003) Concise descriptions of subsets of structured sets, in ‘PODS’
Michalski RS, Stepp RE (1983) Learning from observation: conceptual clustering. In: Machine learning: an artificial intelligence approach
Nauck DD (2002) Measuring interpretability in rule-based classification systems. In Proceedings of IEEE international conference on fuzzy systems, 2002, pp 196–201
Oikonomakou N, Vazirgiannis M (2005) A review of web document clustering approaches. In: Data mining and knowledge discovery handbook, pp 921–943
Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., CA
Salton G (1989) Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley Longman Publishing Co., Inc., Boston
Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11): 613–620
Steinbach M, Ertz L, Kumar V (2003) The challenges of clustering high-dimensional data. In: New vistas in statistical physics: applications in econophysics, bioinformatics, and pattern recognition. Springer
Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques
Tagaki, S (1983) Derivation of fuzzy control rules from human operator’s control action. In: IFAC symposium on fuzzy information, knowledge representation and decision analysis
Tan P-N, Steinbach M, Kumar V (2005) Introduction to data mining. Addison-Wesley Longman Publishing Co., Inc., Boston
Véronis J (2002) Book reviews: polysemy: theoretical and computational approaches. Comput Linguist 28(1): 90–95
Wei F, Li W, Lu Q, He Y (2010) A document-sensitive graph model for multi-document summarization. Knowl Inf Syst 22(2): 245–259
Weiss SM, Indurkhya N (2000) Lightweight rule induction. In: ICML, pp 1135–1142
Zamir O, Etzioni O (1998) Web document clustering: a feasibility demonstration. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’98, pp 46–54
Zamir O, Etzioni O, Madani O, Karp RM (1997) Fast and intuitive clustering of web documents. In: KDD, pp 287–290
Zeng Y, Tang J, Garcia-Frias J, Gao GR (2002) An adaptive meta-clustering approach: combining the information from different clustering results. In: CSB, p 276
Zhao Y, Karypis G (2001) Criterion functions for document clustering: experiments and analysis. In: TR, University of Minnesota
Zhao Y, Karypis G (2002) Evaluation of hierarchical clustering algorithms for document datasets. In: Proceedings of the eleventh international conference on information and knowledge management, CIKM ’02, ACM, New York, pp 515–524
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Balachandran, V., Deepak P & Khemani, D. Interpretable and reconfigurable clustering of document datasets by deriving word-based rules. Knowl Inf Syst 32, 475–503 (2012). https://doi.org/10.1007/s10115-011-0446-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-011-0446-9