Data Mining and Knowledge Discovery

, Volume 21, Issue 1, pp 153–185

Hierarchical document clustering using local patterns

  • Hassan H. Malik
  • John R. Kender
  • Dmitriy Fradkin
  • Fabian Moerchen
Article

Abstract

The global pattern mining step in existing pattern-based hierarchical clustering algorithms may result in an unpredictable number of patterns. In this paper, we propose IDHC, a pattern-based hierarchical clustering algorithm that builds a cluster hierarchy without mining for globally significant patterns. IDHC first discovers locally promising patterns by allowing each instance to “vote” for its representative size-2 patterns in a way that ensures an effective balance between local pattern frequency and pattern significance in the dataset. The cluster hierarchy (i.e., the global model) is then directly constructed using these locally promising patterns as features. Each pattern forms an initial (possibly overlapping) cluster, and the rest of the cluster hierarchy is obtained by following a unique iterative cluster refinement process. By effectively utilizing instance-to-cluster relationships, this process directly identifies clusters for each level in the hierarchy, and efficiently prunes duplicate clusters. Furthermore, IDHC produces cluster labels that are more descriptive (patterns are not artificially restricted), and adapts a soft clustering scheme that allows instances to exist in suitable nodes at various levels in the cluster hierarchy. We present results of experiments performed on 16 standard text datasets, and show that IDHC outperforms state-of-the-art hierarchical clustering algorithms in terms of average entropy and FScore measures.

Keywords

Pattern based hierarchical clustering Interestingness measures Dimensionality reduction Pattern selection Global modeling using local patterns 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© The Author(s) 2010

Authors and Affiliations

  • Hassan H. Malik
    • 1
  • John R. Kender
    • 2
  • Dmitriy Fradkin
    • 3
  • Fabian Moerchen
    • 3
  1. 1.Thomson ReutersNew YorkUSA
  2. 2.Columbia UniversityNew YorkUSA
  3. 3.Siemens Corporate ResearchPrincetonUSA

Personalised recommendations