Interpretable and reconfigurable clustering of document datasets by deriving word-based rules

Balachandran, Vipin; Deepak P; Khemani, Deepak

doi:10.1007/s10115-011-0446-9

Interpretable and reconfigurable clustering of document datasets by deriving word-based rules

Regular Paper
Published: 12 November 2011

Volume 32, pages 475–503, (2012)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Vipin Balachandran¹,
Deepak P² &
Deepak Khemani³

216 Accesses
10 Citations
Explore all metrics

Abstract

Clusters of text documents output by clustering algorithms are often hard to interpret. We describe motivating real-world scenarios that necessitate reconfigurability and high interpretability of clusters and outline the problem of generating clusterings with interpretable and reconfigurable cluster models. We develop two clustering algorithms toward the outlined goal of building interpretable and reconfigurable cluster models. They generate clusters with associated rules that are composed of conditions on word occurrences or nonoccurrences. The proposed approaches vary in the complexity of the format of the rules; RGC employs disjunctions and conjunctions in rule generation whereas RGC-D rules are simple disjunctions of conditions signifying presence of various words. In both the cases, each cluster is comprised of precisely the set of documents that satisfy the corresponding rule. Rules of the latter kind are easy to interpret, whereas the former leads to more accurate clustering. We show that our approaches outperform the unsupervised decision tree approach for rule-generating clustering and also an approach we provide for generating interpretable models for general clusterings, both by significant margins. We empirically show that the purity and f-measure losses to achieve interpretability can be as little as 3 and 5%, respectively using the algorithms presented herein.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Evaluating Top-K Approximate Patterns via Text Clustering

Soft document clustering using a novel graph covering approach

Article Open access 14 June 2018

A comprehensive and analytical review of text clustering techniques

Article 08 April 2024

References

Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: SIGMOD
Alsberg BK (1995) Fast, fuzzy c-means clustering of data sets with many features. J Comput Chem 16(4): 414–421
Article Google Scholar
Balachandran VPD, Khemani D (2009) Interpretable and reconfigurable clustering of document datasets by deriving word-based rules. In: CIKM
Basak J, Krishnapuram R (2005) Interpretable hierarchical clustering by constructing an unsupervised decision tree. IEEE Trans Knowl Data Eng
Basu S, Banerjee A, Mooney RJ (2004) Active semi-supervision for pairwise constrained clustering. In: Berry MW, Dayal U, Kamath C, Skillicorn DB (eds) SIAM international conference on data mining. SIAM
Beil F, Ester M, Xu X (2002) Frequent term-based text clustering. In: KDD, ACM, pp 436–442
Bilenko M, Basu S, Mooney RJ (2004) Integrating constraints and metric learning in semi-supervised clustering. In: ICML
Boley D (1998) Hierarchical taxonomies using divisive partitioning. Technical report, University of Minnesota
Chen K, Liu L (2009) Best k: critical clustering structures in categorical datasets. Knowl Inf Syst 20: 1–33
Article Google Scholar
Cho Y (2002) A personalized recommender system based on web usage mining and decision tree induction. Expert Syst Appl 23(3): 329–342
Article Google Scholar
Cohen WW (1995) Fast effective rule induction. In: Proceedings of the twelfth international conference on machine learning. Morgan Kaufmann, pp 115–123
Cohen WW, Singer Y (1999) A simple, fast, and effective rule learner. In: Proceedings of the sixteenth national conference on artificial intelligence. AAAI Press, pp 335–342
Cutting DR, Karger DR, Pedersen JO, Tukey JW (1992) Scatter/gather: a cluster-based approach to browsing large document collections. In: Proceedings of the 15th annual international ACM SIGIR conference on research and development in information retrieval’, SIGIR ’92, ACM, New York, pp 318–329
Deepak P, Roy S (2006) Scaled entropy and df-se: different and improved unsupervised feature selection techniques for text clustering. In: International workshop on feature selection in data mining (SDM)
Dhillon IS, Mallela S, Modha DS (2003) Information-theoretic co-clustering. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’03, ACM, New York, pp 89–98
Farahat AK, Kamel MS (2011) Statistical semantic for enhancing document clustering. Knowl Inf Syst 26(1)
Fisher DH (1987) Knowledge acquisition via incremental conceptual clustering. In: Machine learning, pp 139–172
Gao B, Ester M (2006) Cluster description formats, problems and algorithms. In: SDM
Greene D, Cunningham P (2005) Producing accurate interpretable clusters from high-dimensional data. In: PKDD, pp 486–494
Halvey M, Keane M (2007) An assessment of tag presentation techniques. In: World wide web conference
Hamerly G, Elkan C (2003) Learning the k in k-means. In: Neural information processing systems. MIT Press, p 2003
Hotho A, Staab S, Stumme G (2003) Ontologies improve text document clustering. In: ICDM, pp 541–544
Ishibuchi H, Nojima Y (2007) Analysis of interpretability-accuracy tradeoff of fuzzy systems by multiobjective fuzzy genetics-based machine learning. Int J Approx Reason
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3): 264–323
Article Google Scholar
Jain AK, Murty MN, Flynn PJ (1999b) Data clustering: a review. ACM Comput Surv
Jing L, Ng MK, Huang JZ (2010) Knowledge-based vector space model for text clustering. Knowl Inf Syst 25(1): 35–55
Article Google Scholar
Johnson SC (1967) Hierarchical clustering schemes. In: Psychometrika
Krishnapuram R, Kummamuru K (2003) Automatic taxonomy generation: issues and possibilities. In: Proceedings of the 10th international fuzzy systems association world congress conference on fuzzy sets and systems, IFSA’03/ Springer, Berlin, pp 52–63
Lakshmanan L, Ng RT, Wang CX, Zhou X, Johnson T (2002) The generalized mdl approach for summarization. In: VLDB
Liu B, Xia Y, Yu PS (2000) Clustering through decision tree construction. In: CIKM, ACM, New York
Liu T, Liu S, Chen Z, Ma W (2003) An evaluation on feature selection for text clustering. In: Internatioanl conference on machine learning (ICML)
MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: 5th symposium of maths, statistics and probability
Mamdani EH, Assilian S (1975) An experiment in linguistic synthesis with a fuzzy logic controller. Int J Man-Mach Stud 1–13
Mandhani B, Joshi S, Kummamuru K (2003) A matrix density based algorithm to hierarchically co-cluster documents and words. In: World wide web conference, pp 511–518
Mendelzon AO, Pu KQ (2003) Concise descriptions of subsets of structured sets, in ‘PODS’
Michalski RS, Stepp RE (1983) Learning from observation: conceptual clustering. In: Machine learning: an artificial intelligence approach
Nauck DD (2002) Measuring interpretability in rule-based classification systems. In Proceedings of IEEE international conference on fuzzy systems, 2002, pp 196–201
Oikonomakou N, Vazirgiannis M (2005) A review of web document clustering approaches. In: Data mining and knowledge discovery handbook, pp 921–943
Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann Publishers Inc., CA
Google Scholar
Salton G (1989) Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley Longman Publishing Co., Inc., Boston
Google Scholar
Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11): 613–620
Article MATH Google Scholar
Steinbach M, Ertz L, Kumar V (2003) The challenges of clustering high-dimensional data. In: New vistas in statistical physics: applications in econophysics, bioinformatics, and pattern recognition. Springer
Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques
Tagaki, S (1983) Derivation of fuzzy control rules from human operator’s control action. In: IFAC symposium on fuzzy information, knowledge representation and decision analysis
Tan P-N, Steinbach M, Kumar V (2005) Introduction to data mining. Addison-Wesley Longman Publishing Co., Inc., Boston
Google Scholar
Véronis J (2002) Book reviews: polysemy: theoretical and computational approaches. Comput Linguist 28(1): 90–95
Article Google Scholar
Wei F, Li W, Lu Q, He Y (2010) A document-sensitive graph model for multi-document summarization. Knowl Inf Syst 22(2): 245–259
Article Google Scholar
Weiss SM, Indurkhya N (2000) Lightweight rule induction. In: ICML, pp 1135–1142
Zamir O, Etzioni O (1998) Web document clustering: a feasibility demonstration. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’98, pp 46–54
Zamir O, Etzioni O, Madani O, Karp RM (1997) Fast and intuitive clustering of web documents. In: KDD, pp 287–290
Zeng Y, Tang J, Garcia-Frias J, Gao GR (2002) An adaptive meta-clustering approach: combining the information from different clustering results. In: CSB, p 276
Zhao Y, Karypis G (2001) Criterion functions for document clustering: experiments and analysis. In: TR, University of Minnesota
Zhao Y, Karypis G (2002) Evaluation of hierarchical clustering algorithms for document datasets. In: Proceedings of the eleventh international conference on information and knowledge management, CIKM ’02, ACM, New York, pp 515–524

Download references

Author information

Authors and Affiliations

VMware, Bangalore, India
Vipin Balachandran
IBM Research, Bangalore, India
Deepak P
IIT Madras, Chennai, India
Deepak Khemani

Authors

Vipin Balachandran
View author publications
You can also search for this author in PubMed Google Scholar
Deepak P
View author publications
You can also search for this author in PubMed Google Scholar
Deepak Khemani
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Deepak P.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Balachandran, V., Deepak P & Khemani, D. Interpretable and reconfigurable clustering of document datasets by deriving word-based rules. Knowl Inf Syst 32, 475–503 (2012). https://doi.org/10.1007/s10115-011-0446-9

Download citation

Received: 02 November 2010
Revised: 08 June 2011
Accepted: 22 October 2011
Published: 12 November 2011
Issue Date: September 2012
DOI: https://doi.org/10.1007/s10115-011-0446-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Interpretable and reconfigurable clustering of document datasets by deriving word-based rules

Abstract

Access this article

Similar content being viewed by others

Evaluating Top-K Approximate Patterns via Text Clustering

Soft document clustering using a novel graph covering approach

A comprehensive and analytical review of text clustering techniques

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Interpretable and reconfigurable clustering of document datasets by deriving word-based rules

Abstract

Access this article

Similar content being viewed by others

Evaluating Top-K Approximate Patterns via Text Clustering

Soft document clustering using a novel graph covering approach

A comprehensive and analytical review of text clustering techniques

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation