Open Access
Article

Data Mining and Knowledge Discovery

, Volume 25, Issue 2, pp 208-242

Diverse subgroup set discovery

Authors

  • Matthijs van Leeuwen
    • Machine Learning, Department of Computer ScienceKatholieke Universiteit Leuven
    • Algorithmic Data Analysis, Department of Information and Computer Sciences, Faculty of ScienceUniversiteit Utrecht
  • Arno Knobbe
    • Leiden Institute of Advanced Computer ScienceUniversiteit Leiden

DOI: 10.1007/s10618-012-0273-y

Abstract

Large data is challenging for most existing discovery algorithms, for several reasons. First of all, such data leads to enormous hypothesis spaces, making exhaustive search infeasible. Second, many variants of essentially the same pattern exist, due to (numeric) attributes of high cardinality, correlated attributes, and so on. This causes top-k mining algorithms to return highly redundant result sets, while ignoring many potentially interesting results. These problems are particularly apparent with subgroup discovery (SD) and its generalisation, exceptional model mining. To address this, we introduce subgroup set discovery: one should not consider individual subgroups, but sets of subgroups. We consider three degrees of redundancy, and propose corresponding heuristic selection strategies in order to eliminate redundancy. By incorporating these (generic) subgroup selection methods in a beam search, the aim is to improve the balance between exploration and exploitation. The proposed algorithm, dubbed DSSD for diverse subgroup set discovery, is experimentally evaluated and compared to existing approaches. For this, a variety of target types with corresponding datasets and quality measures is used. The subgroup sets that are discovered by the competing methods are evaluated primarily on the following three criteria: (1) diversity in the subgroup covers (exploration), (2) the maximum quality found (exploitation), and (3) runtime. The results show that DSSD outperforms each traditional SD method on all or a (non-empty) subset of these criteria, depending on the specific setting. The more complex the task, the larger the benefit of using our diverse heuristic search turns out to be.

Keywords

Subgroup set discovery Exceptional model mining Pattern selection Heuristic search Diversity

Acknowledgements

The authors wish to thank Antti Ukkonen for sharing the Finnish elections data. This research is financially supported by the Netherlands Organisation for Scientific Research (NWO) through the EMM project (number 612.065.822) and a Rubicon grant.

Open Access

This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.

Copyright information

© The Author(s) 2012