Data Mining and Knowledge Discovery

, Volume 30, Issue 3, pp 711–762 | Cite as

Fast exhaustive subgroup discovery with numerical target concepts

  • Florian Lemmerich
  • Martin Atzmueller
  • Frank Puppe
Article

Abstract

Subgroup discovery is a key data mining method that aims at identifying descriptions of subsets of the data that show an interesting distribution with respect to a pre-defined target concept. For practical applications the integration of numerical data is crucial. Therefore, a wide variety of interestingness measures has been proposed in literature that use a numerical attribute as the target concept. However, efficient mining in this setting is still an open issue. In this paper, we present novel techniques for fast exhaustive subgroup discovery with a numerical target concept. We initially survey previously proposed measures in this setting. Then, we explore options for pruning the search space using optimistic estimate bounds. Specifically, we introduce novel bounds in closed form and ordering-based bounds as a new technique to derive estimates for several types of interestingness measures with no previously known bounds. In addition, we investigate efficient data structures, namely adapted FP-trees and bitset-based data representations, and discuss their interdependencies to interestingness measures and pruning schemes. The presented techniques are incorporated into two novel algorithms. Finally, the benefits of the proposed pruning bounds and algorithms are assessed and compared in an extensive experimental evaluation on 24 publicly available datasets. The novel algorithms reduce runtimes consistently by more than one order of magnitude.

Keywords

Subgroup discovery Pattern mining Numerical data Pruning  Data structures Data mining Algorithms 

Notes

Acknowledgments

This work has been partially supported by the VENUS research cluster at the interdisciplinary Research Center for Information System Design (ITeG) at Kassel University.

References

  1. Alcala-Fernandez J, Fernandez A, Luengo J, Derrac J, Garcia S, Sanchez L, Herrera F (2011) KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Mult Valued Logic Soft Comput 17(2–3):255–287Google Scholar
  2. Atzmueller M (2015) Subgroup discovery—advanced review. WIREs Data Mining Knowl Discov 5(1):35–49CrossRefGoogle Scholar
  3. Atzmueller M, Lemmerich F (2009) Fast subgroup discovery for continuous target concepts. In: Proceedings of the 18th international symposium on foundations of intelligent systems (ISMIS), p 35–44Google Scholar
  4. Atzmueller M, Lemmerich F (2012) VIKAMINE—Open-source subgroup discovery, pattern mining, and analytics. In: Proceedings of the European conference on machine learning and knowledge discovery in databases (ECML/PKDD), p 842–845Google Scholar
  5. Atzmueller M, Lemmerich F (2013) Exploratory pattern mining on social media using geo-references and social tagging information. Int J Web Sci 2(1–2):80–112CrossRefGoogle Scholar
  6. Atzmueller M, Lemmerich F, Krause B, Hotho A (2009) Who are the spammers? Understandable local patterns for concept description. In: Proceedings of the 7th conference on computer methods and systemsGoogle Scholar
  7. Atzmueller M, Mueller J, Becker M (2015) Exploratory subgroup analytics on ubiquitous data. In: Atzmueller A, Chin A, Scholz C, Trattner C (Ed.), Mining, modeling and recommending ’things’ in social media, p 1–20. SpringerGoogle Scholar
  8. Atzmueller M, Puppe F (2006) SD-Map—a fast algorithm for exhaustive subgroup discovery. In: Proceedings of the 10th European conference on principles and practice of knowledge discovery in databases (PKDD), p 6–17Google Scholar
  9. Atzmueller M, Puupe F (2009) A knowledge-intensive approach for semi-automatic causal subgroup discovery. In: Berendt B et al (eds) Knowledge discovery enhanced with semantic and social information, vol 220. Springer, Berlin, pp 19–36CrossRefGoogle Scholar
  10. Aumann Y, Lindell Y (1999) A statistical theory for quantitative association rules. In: Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), p 261–270Google Scholar
  11. Aumann Y, Lindell Y (2003) A statistical theory for quantitative association rules. J Intell Inf Syst 20(3):255–283CrossRefGoogle Scholar
  12. Batal I, Hauskrecht M (2010) A concise representation of association rules using minimal predictive rules. In: Proceedings of the 2010 European conference on machine learning and knowledge discovery in databases (ECML/PKDD), p 87–102Google Scholar
  13. Bay SD, Pazzani MJ (2001) Detecting group differences: mining contrast sets. Data Min Knowl Discov 5(3):213–246CrossRefMATHGoogle Scholar
  14. Bayardo RJ (1998) Efficiently mining long patterns from databases. In: Proceedings of the 1998 ACM SIGMOD international conference on management of data, p 85–93Google Scholar
  15. Bayardo RJ, Agrawal R, Gunopulos D (1999) Constraint-based rule mining in large, dense databases. Data Min Knowl Discov 4(2–3):217–240Google Scholar
  16. Box GEP (1953) Non-normality and tests on variances. Biometrika 40:318–335MathSciNetCrossRefMATHGoogle Scholar
  17. Breiman L, Friedman JH, Stone CJ, Olshen RA (1984) Classification and regression trees. Chapman & Hall, Boca RatonMATHGoogle Scholar
  18. Brin S, Rastogi R, Shim K (2003) Mining optimized gain rules for numeric attributes. IEEE Trans Knowl Data Eng 15(2):324–338CrossRefGoogle Scholar
  19. Cheng H, Yan X, Han J, Yu PS (2008) Direct discriminative pattern mining for effective classification. In: Proceedings of the 24th international conference on data engineering (ICDE), p 169–178Google Scholar
  20. Dong G, Li J (1999) Efficient mining of emerging patterns: discovering trends and differences. In: Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), p 43–52Google Scholar
  21. Duivesteijn W, Knobbe AJ, Feelders A, van Leeuwen M (2010) Subgroup discovery meets bayesian networks—an exceptional model mining approach. In: Proceedings of the 10th international conference on data mining (ICDM), p 158–167Google Scholar
  22. El-Qawasmeh E (2003) Beating the popcount. Int J Inf Technol 9(1):1–18Google Scholar
  23. Fawcett T (2006) An introduction to ROC analysis. Pattern Recognit Lett 27(8):861–874MathSciNetCrossRefGoogle Scholar
  24. Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: Proceedings of the 13th international joint conference on artificial intelligence (IJCAI), p 1022–1027Google Scholar
  25. Freidlin B, Gastwirth JL (2000) Should the median test be retired from general use? Am Stat 54(3):161–164Google Scholar
  26. Fukuda T, Morimoto Y, Morishita S, Tokuyama T (1996) Mining optimized association rules for numeric attributes. In: Proceedings of the 15th ACM symposium on principles of database systems (PODS), p 182–191Google Scholar
  27. García S, Luengo J, Saez JA, Lopez V, Herrera F (2013) A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Trans Knowl Data Eng 25(4):734–750CrossRefGoogle Scholar
  28. Geng L, Hamilton HJ (2006) Interestingness measures for data mining: a survey. ACM Comput Surv 38(3):9CrossRefGoogle Scholar
  29. Grosskreutz H (2008) Cascaded subgroups discovery with an application to regression. In: From local patterns to global models, workshop at the ECML/PKDD, p 275–286Google Scholar
  30. Grosskreutz H, Rüping S (2009) On subgroup discovery in numerical domains. Data Min Knowl Discov 19(2):210–226MathSciNetCrossRefGoogle Scholar
  31. Grosskreutz H, Rüping S, Wrobel S (2008) Tight optimistic estimates for fast subgroup discovery. In: Proceedings of the 2008 European conference on machine learning and knowledge discovery in databases (ECML/PKDD), p 440–456Google Scholar
  32. Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. ACM SIGMOD Rec 29(2):1–12CrossRefGoogle Scholar
  33. Han J, Pei J, Yin Y, Mao R (2004) Mining frequent patterns without candidate generation: a frequent-pattern tree approach. Data Min Knowl Discov 8(1):53–87MathSciNetCrossRefGoogle Scholar
  34. Hart PE, Nilsson NJ, Raphael B (1968) A formal basis for the heuristic determination of minimum cost paths. IEEE Trans Syst Sci Cybernet 4(2):100–107CrossRefGoogle Scholar
  35. Jorge AM, Azevedo PJ, Pereira F (2006) Distribution rules with numeric attributes of interest. In: Proceedings of the 10th European conference on principles and practice of knowledge discovery in databases (PKDD), p 247–258Google Scholar
  36. Kavšek B, Lavrač N (2006) Apriori-SD: adapting association rule learning to subgroup discovery. Appl Artif Intell 20:543–583CrossRefGoogle Scholar
  37. Klösgen W (1994) Exploration of simulation experiments by discovery. Technical Report WS-04-03Google Scholar
  38. Klösgen W (1995) Efficient discovery of interesting statements in databases. J Intell Inf Syst 4(1):53–69CrossRefGoogle Scholar
  39. Klösgen W (1996) Explora: a multipattern and multistrategy discovery assistant. In: Fayyad U-M, Piatetsky-Shapiro G, Smyth P, Uthurusamy R (eds) Advances in knowledge discovery and data mining. MIT Press, Cambridge, pp 249–271Google Scholar
  40. Klösgen W (2002) Data mining tasks and methods: subgroup discovery: deviation analysis. In: Klösgen W, Zytkow JM (ed), Handbook of Data Mining and Knowledge Discovery, p 354–361Google Scholar
  41. Klösgen W, May M (2002) Census data mining—an application. In: Proceedings of the 6th European conference on principles and practice of knowledge discovery in databases (PKDD)Google Scholar
  42. Kotsiantis S, Kanellopoulos D (2006) Discretization techniques: a recent survey. GESTS Int Trans Comput Sci Eng 32(1):47–58Google Scholar
  43. Kralj Novak P, Lavrač N, Webb GI (2009) Supervised descriptive rule discovery: a unifying survey of contrast set, emerging pattern and subgroup mining. J Mach Learn Res 10:377–403MATHGoogle Scholar
  44. Lavrač N, Kavšek B, Flach PA, Todorovski L (2004) Subgroup discovery with CN2-SD. J Mach Learn Res 5:153–188MathSciNetGoogle Scholar
  45. Leman D, Feelders A, Knobbe AJ (2008) Exceptional model mining. In: Proceedings of the European conference on machine learning and knowledge discovery in databases (ECML/PKDD), p 1–16Google Scholar
  46. Lemmerich F (2014) Novel techniques for efficient and effective subgroup discovery. PhD thesis, Universität WürzburgGoogle Scholar
  47. Lemmerich F, Atzmueller M (2012) Describing locations using tags and images: explorative pattern mining in social media. In: Revised selected papers from the workshops on modeling and mining ubiquitous social media, p 77–96Google Scholar
  48. Lemmerich F, Becker M, Atzmueller M (2012) Generic pattern trees for exhaustive exceptional model mining. In: Proceedings of the European conference on machine learning and knowledge discovery in databases (ECML/PKDD), p 277–292Google Scholar
  49. Lemmerich F, Becker M, Puppe F (2013) Difference-based estimates for generalization-aware subgroup discovery. In: Proceedings of the European conference on machine learning and knowledge discovery in databases (ECML/PKDD), p 288–303Google Scholar
  50. Lemmerich F, Puppe F (2011) Local models for expectation-driven subgroup discovery. In: Proceedings of the 11th international conference on data mining (ICDM), p 360–369Google Scholar
  51. Lemmerich F, Rohlfs M, Atzmueller M (2010) Fast discovery of relevant subgroup patterns. In: Proceedings of the 23rd Florida artificial intelligence research society conference (FLAIRS), p 428–433Google Scholar
  52. Lichman M (2013) UCI machine learning repository. http://archive.ics.uci.edu/ml
  53. Lucas JP, Jorge AM, Pereira F, Pernas AM, Machado AA (2007) A tool for interactive subgroup discovery using distribution rules. In: Proceedings of the artificial intelligence 13th Portuguese conference on progress in artificial intelligence (EPIA), p 426–436Google Scholar
  54. Mampaey M, Nijssen S, Feelders A, Knobbe AJ (2012) Efficient algorithms for finding richer subgroup descriptions in numeric and nominal data. In: Proceedings of the 12th international conference on data mining (ICDM), p 499–508Google Scholar
  55. Moreland K, Truemper K (2009) Discretization of target attributes for subgroup discovery. In: Proceedings of the 6th international conference on machine learning and data mining in pattern recognition (MLDM), p 44–52Google Scholar
  56. Morishita S (1998) On classification and regression. In: Proceedings of the first international conference on discovery science, p 40–57Google Scholar
  57. Morishita S, Sese J (2000) Traversing itemset lattices with statistical metric pruning. In: Proceedings of the 19th ACM symposium on principles of database systems (PODS), p 226–236Google Scholar
  58. Pieters BFI (2010) Subgroup discovery on numeric and ordinal targets, with an application to biological data aggregation. Technical report, Universiteit UtrechtGoogle Scholar
  59. Pieters BFI, Knobbe AJ, Džeroski S (2010) Subgroup discovery in ranked data, with an application to gene set enrichment. In: Preference learning, workshop at the ECML/PKDD, vol. 10, p 1–18Google Scholar
  60. Rastogi R, Shim K (2002) Mining optimized association rules with categorical and numeric attributes. IEEE Trans Knowl Data Eng 14(1):29–50CrossRefGoogle Scholar
  61. Webb GI (1995) OPUS: an efficient admissible algorithm for unordered search. J Artif Intell Res 3(1):431–465MATHGoogle Scholar
  62. Webb GI (2001) Discovering associations with numeric variables. In: Proceedings of the 7th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), p 383–388Google Scholar
  63. Wrobel S (1997) An algorithm for multi-relational discovery of subgroups. In: Proceedings of the 1st European symposium on principles of data mining and knowledge discovery (PKDD), p 78–87Google Scholar
  64. Zaki MJ (2000) Scalable algorithms for association mining. IEEE Trans Knowl Data Eng 12(3):372–390MathSciNetCrossRefGoogle Scholar
  65. Zimmermann A, De Raedt L (2009) Cluster-grouping: from subgroup discovery to clustering. Mach Learn 77(1):125–159CrossRefMATHGoogle Scholar

Copyright information

© The Author(s) 2015

Authors and Affiliations

  • Florian Lemmerich
    • 1
  • Martin Atzmueller
    • 2
  • Frank Puppe
    • 3
  1. 1.Computational Social Science DepartmentGESIS – Leibniz Institute for the Social SciencesCologneGermany
  2. 2.Research Center for Information System Design (ITeG), Knowledge and Data Engineering GroupUniversity of KasselKasselGermany
  3. 3.Institute of Computer Science, Artificial Intelligence and Applied Computer Science GroupUniversity of WürzburgWürzburgGermany

Personalised recommendations