Diverse subgroup set discovery

Abstract

Large data is challenging for most existing discovery algorithms, for several reasons. First of all, such data leads to enormous hypothesis spaces, making exhaustive search infeasible. Second, many variants of essentially the same pattern exist, due to (numeric) attributes of high cardinality, correlated attributes, and so on. This causes top-k mining algorithms to return highly redundant result sets, while ignoring many potentially interesting results. These problems are particularly apparent with subgroup discovery (SD) and its generalisation, exceptional model mining. To address this, we introduce subgroup set discovery: one should not consider individual subgroups, but sets of subgroups. We consider three degrees of redundancy, and propose corresponding heuristic selection strategies in order to eliminate redundancy. By incorporating these (generic) subgroup selection methods in a beam search, the aim is to improve the balance between exploration and exploitation. The proposed algorithm, dubbed DSSD for diverse subgroup set discovery, is experimentally evaluated and compared to existing approaches. For this, a variety of target types with corresponding datasets and quality measures is used. The subgroup sets that are discovered by the competing methods are evaluated primarily on the following three criteria: (1) diversity in the subgroup covers (exploration), (2) the maximum quality found (exploitation), and (3) runtime. The results show that DSSD outperforms each traditional SD method on all or a (non-empty) subset of these criteria, depending on the specific setting. The more complex the task, the larger the benefit of using our diverse heuristic search turns out to be.

References

  1. Abudawood T, Flach P (2009) Evaluation measures for multi-class subgroup discovery. In: Proceedings of the ECML/PKDD’09, Bled, pp 35–50

  2. Atzmüller M, Lemmerich F (2009) Fast subgroup discovery for continuous target concepts. In: Proceedings of ISMIS ’09, Prague, pp 35–44

  3. Aumann Y, Lindell Y (1999) A statistical theory for quantitative association rules. In: Proceedings of KDD’99, San Diego, pp 261–270

  4. Bailey J, Dong G (2007) Contrast data mining: methods and applications. Tutorial at the IEEE international conference on data mining (ICDM), Omaha

  5. Bay S, Pazzani M (2001) Detecting group differences: mining contrast sets. Data Min Knowl Discov 5(3): 213–246

    MATH  Article  Google Scholar 

  6. Bringmann B, Zimmermann A (2007) The chosen few: on identifying valuable patterns. In: Proceedings of the ICDM’07, Omaha, pp 63–72

  7. Clark P, Boswell R (1991) Rule induction with CN2: some recent improvements. In: Proceedings of the European working session on learning (EWSL-91), Porto, pp 151–163

  8. Clark P, Niblett T (1989) The CN2 induction algorithm. Mach Learn 3: 261–283

    Google Scholar 

  9. Cover T, Thomas J (2006) Elements of information theory, 2nd ed. Wiley, New York

    Google Scholar 

  10. Daly O, Taniar D (2005) Exception rules in data mining. In: Encyclopedia of information science and technology (II), pp 1144–1148

  11. Dong G, Zhang X, Wong L, Li J (1999) CAEP: classification by aggregating emerging patterns. In: Proceedings of DS’99, Tokyo, pp 30–42

  12. Duivesteijn W, Knobbe A, Feelders A, van Leeuwen M (2010) Subgroup discovery meets bayesian networks: an exceptional model mining approach. In: Proceedings of the ICDM’10, Sydney, pp 158–167

  13. Friedman J, Fisher N (1999) Bump hunting in high-dimensional data. Stat Comput 9(2): 123–143

    Article  Google Scholar 

  14. Garriga G, Kralj P, Lavrac N (2008) Closed sets for labeled data. J Mach Learn Res 9: 559–580

    MathSciNet  MATH  Google Scholar 

  15. Grosskreutz H, Paurat D (2011) Fast and memory-efficient discovery of the top-k relevant subgroups in a reduced candidate space. In: Proceedings of the ECML/PKDD ’11, Athens, pp 533–548

  16. Grosskreutz H, Rüping S (2009) On subgroup discovery in numerical domains. Data Min Knowl Discov 19(2): 210–226

    MathSciNet  Article  Google Scholar 

  17. Grosskreutz H, Rüping S, Wrobel S (2008) Tight optimistic estimates for fast subgroup discovery. In: Proceedings of the ECML/PKDD’08, Antwerp, pp 440–456

  18. Grosskreutz H, Boley M, Krause-Traudes M (2010) Subgroup discovery for election analysis: a case study in descriptive data mining. In: Proceedings of DS’10, no. 6332 in LNAI. Springer, New York, pp 57–71

  19. Grünwald P (2007) The minimum description length principle. MIT Press, Cambridge

    Google Scholar 

  20. Han J, Cheng H, Xin D, Yan X (2007) Frequent pattern mining: current status and future directions. Data Min Knowl Discov 15(1): 55–86

    MathSciNet  Article  Google Scholar 

  21. Heikinheimo H, Fortelius M, Eronen J, Mannila H (2007) Biogeography of european land mammals shows environmentally distinct and spatially coherent clusters. J Biogeogr 34(6): 1053–1064

    Article  Google Scholar 

  22. Klösgen W (1996) Advances in knowledge discovery and data mining, chap Explora: a multipattern and multistrategy discovery assistant. MIT Press, Cambridge, pp 249–271

  23. Klösgen W (2002) Handbook of data mining and knowledge discovery, chap Subgroup discovery. Oxford University Press, Oxford

  24. Knobbe A (2006) Multi-relational data mining. IOS Press, Amsterdam

    Google Scholar 

  25. Knobbe A, Ho E (2006a) Maximally informative k-itemsets and their efficient discovery. In: Proceedings of the KDD’06, Philadelphia, Berlin, pp 237–244

  26. Knobbe A, Ho E (2006b) Pattern teams. In: Proceedings of the ECML PKDD’06, Berlin, pp 577–584

  27. Knobbe A, Valkonet J (2009) Building classifiers from pattern teams. In: Proceedings of the ECML PKDD’09 workshop LeGo 2009, Bled, pp 77–93

  28. Kocev D, Struyf J, Dzeroski S (2007) Beam search induction and similarity constraints for predictive clustering trees. In: LNCS KDID 2006, Berlin, pp 134–151

  29. Kralj Novak P, Lavrač N, Webb G (2009) Supervised descriptive rule discovery: a unifying survey of contrast set, emerging pattern and subgroup mining. J Mach Learn Res 10: 377–403

    MATH  Google Scholar 

  30. Kullback S, Leibler R (1951) On information and sufficiency. Ann Math Stat 22(1): 79–86

    MathSciNet  MATH  Article  Google Scholar 

  31. Lavrač N, Kavšek B, Flach P, Todorovski L (2004) Subgroup discovery with CN2-SD. J Mach Learn Res 5: 153–188

    Google Scholar 

  32. Leman D, Feelders A, Knobbe A (2008) Exceptional model mining. In: Proceedings of the ECML/PKDD’08, vol 2, Antwerp, pp 1–16

  33. Lemmerich F, Puppe F (2011) Local models for expectation-driven subgroup discovery. In: Proceedings of the ICDM’11, Vancouver

  34. Lemmerich F, Rohlfs M, Atzmüller M (2010) Fast discovery of relevant subgroup patterns. In: Proceedings of FLAIRS, Daytona Beach

  35. Liu B, Hsu W, Ma Y (2001) Discovering the set of fundamental rule changes. In: Proceedings of KDD’01, San Francisco, pp 335–340

  36. Lowerre B (1976) The harpy speech recognition system. PhD thesis

  37. Mannila H, Toivonen H (1996) Multiple uses of frequent sets and condensed representations. In: Proceedings of the KDD’96, Portland, pp 189–194

  38. Mitchell-Jones A, Amori G, Bogdanowicz W, Krystufek B, Reijnders P, Spitzenberger F, Stubbe M, Thissen J, Vohralik V, Zima J (1999) The atlas of European mammals. Academic Press, London

    Google Scholar 

  39. Morishita S, Sese J (2000) Traversing itemset lattice with statistical metric pruning. In: Proceedings PODS, Dallas, pp 226–236

  40. Nijssen S, Guns T, De Raedt L (2009) Correlated itemset mining in roc space: a constraint programming approach. In: Proceedings KDD’09, Paris, pp 647–656

  41. Pasquier N, Bastide Y, Taouil R, Lakhal L (1999) Discovering frequent closed itemsets for association rules. In: Proceedings of the ICDT’99, Jerusalem, pp 398–416

  42. Peng H, Long F, Ding C (2005) Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8): 1226–1238

    Article  Google Scholar 

  43. Pieters B, Knobbe A, Dzeroski S (2010) Subgroup discovery in ranked data, with an application to gene set enrichment. In: Proceedings preference learning workshop (PL 2010) at ECML PKDD ’10, Barcelona

  44. Shell P, Rubio JH, Barro GQ (1994) Improving search through diversity. In: AAAI, Seattle, pp 1323–1328

  45. Tsoumakas G, Vilcek J, Spyromitros L (2010) MULAN: a java library for multi-label learning. http://mulan.sourceforge.net/

  46. van Leeuwen M (2010) Maximal exceptions with minimal descriptions. Data Min Knowl Discov 21(2): 259–276

    MathSciNet  Article  Google Scholar 

  47. van Leeuwen M, Knobbe A (2011) Non-redundant subgroup discovery in large and complex data. In: Proceedings of the ECML PKDD’11, Bled, pp 459–474

  48. Vreeken J, van Leeuwen M, Siebes A (2011) Krimp: mining itemsets that compress. Data Min Knowl Discov 23(1): 169–214

    MathSciNet  MATH  Article  Google Scholar 

  49. Webb G (1995) Opus: an efficient admissible algorithm for unordered search. J Artif Intell Res 3: 431–465

    MATH  Google Scholar 

  50. Webb G (2001) Discovering associations with numeric variables. In: Proceedings of KDD’01, San Francisco, pp 383–388

  51. Webb G, Butler S, Newlands D (2003) On detecting differences between groups. In: Proceedings of KDD’03, Washington, pp 256–265

  52. Wrobel S (1997) An algorithm for multi-relational discovery of subgroups. In: Proceedings of PKDD 1997. Springer, Heidelberg, pp 78–87

  53. Yan X, Han J (2002) gSpan: Graph-based substructure pattern mining. In: Proceedings of the ICDM’02, Maebashi, pp 721–724

Download references

Acknowledgements

The authors wish to thank Antti Ukkonen for sharing the Finnish elections data. This research is financially supported by the Netherlands Organisation for Scientific Research (NWO) through the EMM project (number 612.065.822) and a Rubicon grant.

Open Access

This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Matthijs van Leeuwen.

Additional information

The research described in this paper builds upon and extends the work appearing in ECML PKDD’11 (van Leeuwen and Knobbe 2011).

Responsible editor: Dimitrios Gunopulos, Donato Malerba, Michalis Vazirgiannis.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and Permissions

About this article

Cite this article

van Leeuwen, M., Knobbe, A. Diverse subgroup set discovery. Data Min Knowl Disc 25, 208–242 (2012). https://doi.org/10.1007/s10618-012-0273-y

Download citation

Keywords

  • Subgroup set discovery
  • Exceptional model mining
  • Pattern selection
  • Heuristic search
  • Diversity