Data Mining and Knowledge Discovery

, Volume 25, Issue 2, pp 208–242

Diverse subgroup set discovery

Open AccessArticle

DOI: 10.1007/s10618-012-0273-y

Cite this article as:
van Leeuwen, M. & Knobbe, A. Data Min Knowl Disc (2012) 25: 208. doi:10.1007/s10618-012-0273-y

Abstract

Large data is challenging for most existing discovery algorithms, for several reasons. First of all, such data leads to enormous hypothesis spaces, making exhaustive search infeasible. Second, many variants of essentially the same pattern exist, due to (numeric) attributes of high cardinality, correlated attributes, and so on. This causes top-k mining algorithms to return highly redundant result sets, while ignoring many potentially interesting results. These problems are particularly apparent with subgroup discovery (SD) and its generalisation, exceptional model mining. To address this, we introduce subgroup set discovery: one should not consider individual subgroups, but sets of subgroups. We consider three degrees of redundancy, and propose corresponding heuristic selection strategies in order to eliminate redundancy. By incorporating these (generic) subgroup selection methods in a beam search, the aim is to improve the balance between exploration and exploitation. The proposed algorithm, dubbed DSSD for diverse subgroup set discovery, is experimentally evaluated and compared to existing approaches. For this, a variety of target types with corresponding datasets and quality measures is used. The subgroup sets that are discovered by the competing methods are evaluated primarily on the following three criteria: (1) diversity in the subgroup covers (exploration), (2) the maximum quality found (exploitation), and (3) runtime. The results show that DSSD outperforms each traditional SD method on all or a (non-empty) subset of these criteria, depending on the specific setting. The more complex the task, the larger the benefit of using our diverse heuristic search turns out to be.

Keywords

Subgroup set discoveryExceptional model miningPattern selectionHeuristic searchDiversity
Download to read the full article text

Copyright information

© The Author(s) 2012

Authors and Affiliations

  1. 1.Machine Learning, Department of Computer ScienceKatholieke Universiteit LeuvenLeuvenBelgium
  2. 2.Algorithmic Data Analysis, Department of Information and Computer Sciences, Faculty of ScienceUniversiteit UtrechtUtrechtThe Netherlands
  3. 3.Leiden Institute of Advanced Computer ScienceUniversiteit LeidenLeidenThe Netherlands