Skip to main content

Discovering Outstanding Subgroup Lists for Numeric Targets Using MDL

  • Conference paper
  • First Online:
Machine Learning and Knowledge Discovery in Databases (ECML PKDD 2020)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12457))

Abstract

The task of subgroup discovery (SD) is to find interpretable descriptions of subsets of a dataset that stand out with respect to a target attribute. To address the problem of mining large numbers of redundant subgroups, subgroup set discovery (SSD) has been proposed. State-of-the-art SSD methods have their limitations though, as they typically heavily rely on heuristics and/or user-chosen hyperparameters.

We propose a dispersion-aware problem formulation for subgroup set discovery that is based on the minimum description length (MDL) principle and subgroup lists. We argue that the best subgroup list is the one that best summarizes the data given the overall distribution of the target. We restrict our focus to a single numeric target variable and show that our formalization coincides with an existing quality measure when finding a single subgroup, but that—in addition—it allows to trade off subgroup quality with the complexity of the subgroup. We next propose SSD++, a heuristic algorithm for which we empirically demonstrate that it returns outstanding subgroup lists: non-redundant sets of compact subgroups that stand out by having strongly deviating means and small spread.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The extended version of this work is available on arXiv [16].

  2. 2.

    To obtain code lengths in bits, all logarithms in this paper are to the base 2.

  3. 3.

    \(L_\mathbb {N}(i)= \log k_0 + \log ^{*} i \), where \(\log ^{*} i = \log i + \log \log i + \ldots \) and \( k_0 \approx 2.865064\).

  4. 4.

    See proof in Appendix 2 of the extended version [16].

  5. 5.

    The full derivation of the Bayesian encoding and an in-depth explanation are given in Appendix 1 of the extended version [16].

  6. 6.

    Derivations are given in Appendix 4 of the extended version [16].

  7. 7.

    For the implementation of SSD++ and to reproduce the experiments see Proença [15].

  8. 8.

    http://www.patternsthatmatter.org/software.php#dssd/.

  9. 9.

    http://www.keel.es/.

References

  1. Antonio, N., de Almeida, A., Nunes, L.: Hotel booking demand datasets. Data Brief 22, 41–49 (2019)

    Article  Google Scholar 

  2. Atzmueller, M.: Subgroup discovery. Wiley Interdisc. Rev. Data Min. Knowl. Disc. 5(1), 35–49 (2015)

    Article  Google Scholar 

  3. Belfodil, A., et al.: FSSD-a fast and efficient algorithm for subgroup set discovery. In: Proceedings of DSAA 2019 (2019)

    Google Scholar 

  4. Boley, M., Goldsmith, B.R., Ghiringhelli, L.M., Vreeken, J.: Identifying consistent statements about numerical data with dispersion-corrected subgroup discovery. Data Min. Knowl. Disc. 31(5), 1391–1418 (2017). https://doi.org/10.1007/s10618-017-0520-3

    Article  MathSciNet  MATH  Google Scholar 

  5. Bosc, G., Boulicaut, J.F., Raïssi, C., Kaytoue, M.: Anytime discovery of a diverse set of patterns with Monte Carlo tree search. Data Min. Knowl. Disc. 32(3), 604–650 (2018). https://doi.org/10.1007/s10618-017-0547-5

    Article  MathSciNet  MATH  Google Scholar 

  6. Gönen, M., Johnson, W.O., Lu, Y., Westfall, P.H.: The Bayesian two-sample t test. Am. Stat. 59(3), 252–257 (2005)

    Article  MathSciNet  Google Scholar 

  7. Grünwald, P., Roos, T.: Minimum description length revisited. Int. J. Math. Ind. 11(1), 1930001 (29 p.) (2019)

    Article  MathSciNet  Google Scholar 

  8. Grünwald, P.D.: The Minimum Description Length Principle. MIT Press, Cambridge (2007)

    Book  Google Scholar 

  9. Klösgen, W.: Explora: a multipattern and multistrategy discovery assistant. In: Advances in Knowledge Discovery and Data Mining, pp. 249–271 (1996)

    Google Scholar 

  10. Lavrač, N., Kavšek, B., Flach, P., Todorovski, L.: Subgroup discovery with CN2-SD. J. Mach. Learn. Res. 5, 153–188 (2004)

    MathSciNet  Google Scholar 

  11. van Leeuwen, M.: Maximal exceptions with minimal descriptions. Data Min. Knowl. Disc. 21(2), 259–276 (2010). https://doi.org/10.1007/s10618-010-0187-5

    Article  MathSciNet  Google Scholar 

  12. van Leeuwen, M., Knobbe, A.: Diverse subgroup set discovery. Data Min. Knowl. Disc. 25(2), 208–242 (2012). https://doi.org/10.1007/s10618-012-0273-y

    Article  MathSciNet  Google Scholar 

  13. Lijffijt, J., Kang, B., Duivesteijn, W., Puolamaki, K., Oikarinen, E., De Bie, T.: Subjectively interesting subgroup discovery on real-valued targets. In: 2018 IEEE ICDE, pp. 1352–1355. IEEE (2018)

    Google Scholar 

  14. Meeng, M., Knobbe, A.: For real: a thorough look at numeric attributes in subgroup discovery. Data Min. Knowl. Disc. 35(1), 158–212 (2021)

    Article  MathSciNet  Google Scholar 

  15. Proença, H.M. : HMProenca/SSDpp-numeric: v2020.06.0 (2020). https://github.com/HMProenca/SSDpp-numeric. Archived at https://doi.org/10.5281/zenodo.3901236

  16. Proença, H.M., Grünwald, P., Bäck, T., van Leeuwen, M.: Discovering outstanding subgroup lists for numeric targets using MDL. Preprint arXiv:2006.09186 (2020)

  17. Proença, H.M., Klijn, R., Bäck, T., van Leeuwen, M.: Identifying flight delay patterns using diverse subgroup discovery. In: 2018 SSCI, pp. 60–67. IEEE (2018)

    Google Scholar 

  18. Proença, H.M., van Leeuwen, M.: Interpretable multiclass classification by MDL-based rule lists. Inf. Sci. 512, 1372–1393 (2020)

    Article  Google Scholar 

  19. Rissanen, J.: Modeling by shortest data description. Automatica 14(5), 465–471 (1978)

    Article  Google Scholar 

  20. Rouder, J.N., Speckman, P.L., Sun, D., Morey, R.D., Iverson, G.: Bayesian t tests for accepting and rejecting the null hypothesis. Psychon. Bull. Rev. 16(2), 225–237 (2009)

    Article  Google Scholar 

  21. Van Leeuwen, M., Galbrun, E.: Association discovery in two-view data. IEEE Trans. Knowl. Data Eng. 27(12), 3190–3202 (2015)

    Article  Google Scholar 

  22. Vreeken, J., Van Leeuwen, M., Siebes, A.: KRIMP: mining itemsets that compress. Data Min. Knowl. Disc. 23(1), 169–214 (2011). https://doi.org/10.1007/s10618-010-0202-x

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgment

This work is part of the research programme Indo-Dutch Joint Research Programme for ICT 2014 with project number 629.002.201, SAPPAO, which is financed by the Netherlands Organisation for Scientific Research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hugo M. Proença .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Proença, H.M., Grünwald, P., Bäck, T., Leeuwen, M.v. (2021). Discovering Outstanding Subgroup Lists for Numeric Targets Using MDL. In: Hutter, F., Kersting, K., Lijffijt, J., Valera, I. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2020. Lecture Notes in Computer Science(), vol 12457. Springer, Cham. https://doi.org/10.1007/978-3-030-67658-2_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-67658-2_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-67657-5

  • Online ISBN: 978-3-030-67658-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics