Data Mining and Knowledge Discovery

, Volume 25, Issue 2, pp 208–242 | Cite as

Diverse subgroup set discovery

Open Access
Article

Abstract

Large data is challenging for most existing discovery algorithms, for several reasons. First of all, such data leads to enormous hypothesis spaces, making exhaustive search infeasible. Second, many variants of essentially the same pattern exist, due to (numeric) attributes of high cardinality, correlated attributes, and so on. This causes top-k mining algorithms to return highly redundant result sets, while ignoring many potentially interesting results. These problems are particularly apparent with subgroup discovery (SD) and its generalisation, exceptional model mining. To address this, we introduce subgroup set discovery: one should not consider individual subgroups, but sets of subgroups. We consider three degrees of redundancy, and propose corresponding heuristic selection strategies in order to eliminate redundancy. By incorporating these (generic) subgroup selection methods in a beam search, the aim is to improve the balance between exploration and exploitation. The proposed algorithm, dubbed DSSD for diverse subgroup set discovery, is experimentally evaluated and compared to existing approaches. For this, a variety of target types with corresponding datasets and quality measures is used. The subgroup sets that are discovered by the competing methods are evaluated primarily on the following three criteria: (1) diversity in the subgroup covers (exploration), (2) the maximum quality found (exploitation), and (3) runtime. The results show that DSSD outperforms each traditional SD method on all or a (non-empty) subset of these criteria, depending on the specific setting. The more complex the task, the larger the benefit of using our diverse heuristic search turns out to be.

Keywords

Subgroup set discovery Exceptional model mining Pattern selection Heuristic search Diversity 

References

  1. Abudawood T, Flach P (2009) Evaluation measures for multi-class subgroup discovery. In: Proceedings of the ECML/PKDD’09, Bled, pp 35–50Google Scholar
  2. Atzmüller M, Lemmerich F (2009) Fast subgroup discovery for continuous target concepts. In: Proceedings of ISMIS ’09, Prague, pp 35–44Google Scholar
  3. Aumann Y, Lindell Y (1999) A statistical theory for quantitative association rules. In: Proceedings of KDD’99, San Diego, pp 261–270Google Scholar
  4. Bailey J, Dong G (2007) Contrast data mining: methods and applications. Tutorial at the IEEE international conference on data mining (ICDM), OmahaGoogle Scholar
  5. Bay S, Pazzani M (2001) Detecting group differences: mining contrast sets. Data Min Knowl Discov 5(3): 213–246MATHCrossRefGoogle Scholar
  6. Bringmann B, Zimmermann A (2007) The chosen few: on identifying valuable patterns. In: Proceedings of the ICDM’07, Omaha, pp 63–72Google Scholar
  7. Clark P, Boswell R (1991) Rule induction with CN2: some recent improvements. In: Proceedings of the European working session on learning (EWSL-91), Porto, pp 151–163Google Scholar
  8. Clark P, Niblett T (1989) The CN2 induction algorithm. Mach Learn 3: 261–283Google Scholar
  9. Cover T, Thomas J (2006) Elements of information theory, 2nd ed. Wiley, New YorkMATHGoogle Scholar
  10. Daly O, Taniar D (2005) Exception rules in data mining. In: Encyclopedia of information science and technology (II), pp 1144–1148Google Scholar
  11. Dong G, Zhang X, Wong L, Li J (1999) CAEP: classification by aggregating emerging patterns. In: Proceedings of DS’99, Tokyo, pp 30–42Google Scholar
  12. Duivesteijn W, Knobbe A, Feelders A, van Leeuwen M (2010) Subgroup discovery meets bayesian networks: an exceptional model mining approach. In: Proceedings of the ICDM’10, Sydney, pp 158–167Google Scholar
  13. Friedman J, Fisher N (1999) Bump hunting in high-dimensional data. Stat Comput 9(2): 123–143CrossRefGoogle Scholar
  14. Garriga G, Kralj P, Lavrac N (2008) Closed sets for labeled data. J Mach Learn Res 9: 559–580MathSciNetMATHGoogle Scholar
  15. Grosskreutz H, Paurat D (2011) Fast and memory-efficient discovery of the top-k relevant subgroups in a reduced candidate space. In: Proceedings of the ECML/PKDD ’11, Athens, pp 533–548Google Scholar
  16. Grosskreutz H, Rüping S (2009) On subgroup discovery in numerical domains. Data Min Knowl Discov 19(2): 210–226MathSciNetCrossRefGoogle Scholar
  17. Grosskreutz H, Rüping S, Wrobel S (2008) Tight optimistic estimates for fast subgroup discovery. In: Proceedings of the ECML/PKDD’08, Antwerp, pp 440–456Google Scholar
  18. Grosskreutz H, Boley M, Krause-Traudes M (2010) Subgroup discovery for election analysis: a case study in descriptive data mining. In: Proceedings of DS’10, no. 6332 in LNAI. Springer, New York, pp 57–71Google Scholar
  19. Grünwald P (2007) The minimum description length principle. MIT Press, CambridgeGoogle Scholar
  20. Han J, Cheng H, Xin D, Yan X (2007) Frequent pattern mining: current status and future directions. Data Min Knowl Discov 15(1): 55–86MathSciNetCrossRefGoogle Scholar
  21. Heikinheimo H, Fortelius M, Eronen J, Mannila H (2007) Biogeography of european land mammals shows environmentally distinct and spatially coherent clusters. J Biogeogr 34(6): 1053–1064CrossRefGoogle Scholar
  22. Klösgen W (1996) Advances in knowledge discovery and data mining, chap Explora: a multipattern and multistrategy discovery assistant. MIT Press, Cambridge, pp 249–271Google Scholar
  23. Klösgen W (2002) Handbook of data mining and knowledge discovery, chap Subgroup discovery. Oxford University Press, OxfordGoogle Scholar
  24. Knobbe A (2006) Multi-relational data mining. IOS Press, AmsterdamMATHGoogle Scholar
  25. Knobbe A, Ho E (2006a) Maximally informative k-itemsets and their efficient discovery. In: Proceedings of the KDD’06, Philadelphia, Berlin, pp 237–244Google Scholar
  26. Knobbe A, Ho E (2006b) Pattern teams. In: Proceedings of the ECML PKDD’06, Berlin, pp 577–584Google Scholar
  27. Knobbe A, Valkonet J (2009) Building classifiers from pattern teams. In: Proceedings of the ECML PKDD’09 workshop LeGo 2009, Bled, pp 77–93Google Scholar
  28. Kocev D, Struyf J, Dzeroski S (2007) Beam search induction and similarity constraints for predictive clustering trees. In: LNCS KDID 2006, Berlin, pp 134–151Google Scholar
  29. Kralj Novak P, Lavrač N, Webb G (2009) Supervised descriptive rule discovery: a unifying survey of contrast set, emerging pattern and subgroup mining. J Mach Learn Res 10: 377–403MATHGoogle Scholar
  30. Kullback S, Leibler R (1951) On information and sufficiency. Ann Math Stat 22(1): 79–86MathSciNetMATHCrossRefGoogle Scholar
  31. Lavrač N, Kavšek B, Flach P, Todorovski L (2004) Subgroup discovery with CN2-SD. J Mach Learn Res 5: 153–188Google Scholar
  32. Leman D, Feelders A, Knobbe A (2008) Exceptional model mining. In: Proceedings of the ECML/PKDD’08, vol 2, Antwerp, pp 1–16Google Scholar
  33. Lemmerich F, Puppe F (2011) Local models for expectation-driven subgroup discovery. In: Proceedings of the ICDM’11, VancouverGoogle Scholar
  34. Lemmerich F, Rohlfs M, Atzmüller M (2010) Fast discovery of relevant subgroup patterns. In: Proceedings of FLAIRS, Daytona BeachGoogle Scholar
  35. Liu B, Hsu W, Ma Y (2001) Discovering the set of fundamental rule changes. In: Proceedings of KDD’01, San Francisco, pp 335–340Google Scholar
  36. Lowerre B (1976) The harpy speech recognition system. PhD thesisGoogle Scholar
  37. Mannila H, Toivonen H (1996) Multiple uses of frequent sets and condensed representations. In: Proceedings of the KDD’96, Portland, pp 189–194Google Scholar
  38. Mitchell-Jones A, Amori G, Bogdanowicz W, Krystufek B, Reijnders P, Spitzenberger F, Stubbe M, Thissen J, Vohralik V, Zima J (1999) The atlas of European mammals. Academic Press, LondonGoogle Scholar
  39. Morishita S, Sese J (2000) Traversing itemset lattice with statistical metric pruning. In: Proceedings PODS, Dallas, pp 226–236Google Scholar
  40. Nijssen S, Guns T, De Raedt L (2009) Correlated itemset mining in roc space: a constraint programming approach. In: Proceedings KDD’09, Paris, pp 647–656Google Scholar
  41. Pasquier N, Bastide Y, Taouil R, Lakhal L (1999) Discovering frequent closed itemsets for association rules. In: Proceedings of the ICDT’99, Jerusalem, pp 398–416Google Scholar
  42. Peng H, Long F, Ding C (2005) Feature selection based on mutual information: Criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8): 1226–1238CrossRefGoogle Scholar
  43. Pieters B, Knobbe A, Dzeroski S (2010) Subgroup discovery in ranked data, with an application to gene set enrichment. In: Proceedings preference learning workshop (PL 2010) at ECML PKDD ’10, BarcelonaGoogle Scholar
  44. Shell P, Rubio JH, Barro GQ (1994) Improving search through diversity. In: AAAI, Seattle, pp 1323–1328Google Scholar
  45. Tsoumakas G, Vilcek J, Spyromitros L (2010) MULAN: a java library for multi-label learning. http://mulan.sourceforge.net/
  46. van Leeuwen M (2010) Maximal exceptions with minimal descriptions. Data Min Knowl Discov 21(2): 259–276MathSciNetCrossRefGoogle Scholar
  47. van Leeuwen M, Knobbe A (2011) Non-redundant subgroup discovery in large and complex data. In: Proceedings of the ECML PKDD’11, Bled, pp 459–474Google Scholar
  48. Vreeken J, van Leeuwen M, Siebes A (2011) Krimp: mining itemsets that compress. Data Min Knowl Discov 23(1): 169–214MathSciNetMATHCrossRefGoogle Scholar
  49. Webb G (1995) Opus: an efficient admissible algorithm for unordered search. J Artif Intell Res 3: 431–465MATHGoogle Scholar
  50. Webb G (2001) Discovering associations with numeric variables. In: Proceedings of KDD’01, San Francisco, pp 383–388Google Scholar
  51. Webb G, Butler S, Newlands D (2003) On detecting differences between groups. In: Proceedings of KDD’03, Washington, pp 256–265Google Scholar
  52. Wrobel S (1997) An algorithm for multi-relational discovery of subgroups. In: Proceedings of PKDD 1997. Springer, Heidelberg, pp 78–87Google Scholar
  53. Yan X, Han J (2002) gSpan: Graph-based substructure pattern mining. In: Proceedings of the ICDM’02, Maebashi, pp 721–724Google Scholar

Copyright information

© The Author(s) 2012

Authors and Affiliations

  1. 1.Machine Learning, Department of Computer ScienceKatholieke Universiteit LeuvenLeuvenBelgium
  2. 2.Algorithmic Data Analysis, Department of Information and Computer Sciences, Faculty of ScienceUniversiteit UtrechtUtrechtThe Netherlands
  3. 3.Leiden Institute of Advanced Computer ScienceUniversiteit LeidenLeidenThe Netherlands

Personalised recommendations