Discovering Outstanding Subgroup Lists for Numeric Targets Using MDL

Proença, Hugo M.; Grünwald, Peter; Bäck, Thomas; Leeuwen, Matthijs van

doi:10.1007/978-3-030-67658-2_2

Hugo M. Proença¹²,
Peter Grünwald^12,13,
Thomas Bäck¹² &
…
Matthijs van Leeuwen¹²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12457))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

1708 Accesses
5 Citations

Abstract

The task of subgroup discovery (SD) is to find interpretable descriptions of subsets of a dataset that stand out with respect to a target attribute. To address the problem of mining large numbers of redundant subgroups, subgroup set discovery (SSD) has been proposed. State-of-the-art SSD methods have their limitations though, as they typically heavily rely on heuristics and/or user-chosen hyperparameters.

We propose a dispersion-aware problem formulation for subgroup set discovery that is based on the minimum description length (MDL) principle and subgroup lists. We argue that the best subgroup list is the one that best summarizes the data given the overall distribution of the target. We restrict our focus to a single numeric target variable and show that our formalization coincides with an existing quality measure when finding a single subgroup, but that—in addition—it allows to trade off subgroup quality with the complexity of the subgroup. We next propose SSD++, a heuristic algorithm for which we empirically demonstrate that it returns outstanding subgroup lists: non-redundant sets of compact subgroups that stand out by having strongly deviating means and small spread.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The extended version of this work is available on arXiv [16].
2.
To obtain code lengths in bits, all logarithms in this paper are to the base 2.
3.
\(L_\mathbb {N}(i)= \log k_0 + \log ^{*} i \), where \(\log ^{*} i = \log i + \log \log i + \ldots \) and \( k_0 \approx 2.865064\).
4.
See proof in Appendix 2 of the extended version [16].
5.
The full derivation of the Bayesian encoding and an in-depth explanation are given in Appendix 1 of the extended version [16].
6.
Derivations are given in Appendix 4 of the extended version [16].
7.
For the implementation of SSD++ and to reproduce the experiments see Proença [15].
8.
http://www.patternsthatmatter.org/software.php#dssd/.
9.
http://www.keel.es/.

References

Antonio, N., de Almeida, A., Nunes, L.: Hotel booking demand datasets. Data Brief 22, 41–49 (2019)
Article Google Scholar
Atzmueller, M.: Subgroup discovery. Wiley Interdisc. Rev. Data Min. Knowl. Disc. 5(1), 35–49 (2015)
Article Google Scholar
Belfodil, A., et al.: FSSD-a fast and efficient algorithm for subgroup set discovery. In: Proceedings of DSAA 2019 (2019)
Google Scholar
Boley, M., Goldsmith, B.R., Ghiringhelli, L.M., Vreeken, J.: Identifying consistent statements about numerical data with dispersion-corrected subgroup discovery. Data Min. Knowl. Disc. 31(5), 1391–1418 (2017). https://doi.org/10.1007/s10618-017-0520-3
Article MathSciNet MATH Google Scholar
Bosc, G., Boulicaut, J.F., Raïssi, C., Kaytoue, M.: Anytime discovery of a diverse set of patterns with Monte Carlo tree search. Data Min. Knowl. Disc. 32(3), 604–650 (2018). https://doi.org/10.1007/s10618-017-0547-5
Article MathSciNet MATH Google Scholar
Gönen, M., Johnson, W.O., Lu, Y., Westfall, P.H.: The Bayesian two-sample t test. Am. Stat. 59(3), 252–257 (2005)
Article MathSciNet Google Scholar
Grünwald, P., Roos, T.: Minimum description length revisited. Int. J. Math. Ind. 11(1), 1930001 (29 p.) (2019)
Article MathSciNet Google Scholar
Grünwald, P.D.: The Minimum Description Length Principle. MIT Press, Cambridge (2007)
Book Google Scholar
Klösgen, W.: Explora: a multipattern and multistrategy discovery assistant. In: Advances in Knowledge Discovery and Data Mining, pp. 249–271 (1996)
Google Scholar
Lavrač, N., Kavšek, B., Flach, P., Todorovski, L.: Subgroup discovery with CN2-SD. J. Mach. Learn. Res. 5, 153–188 (2004)
MathSciNet Google Scholar
van Leeuwen, M.: Maximal exceptions with minimal descriptions. Data Min. Knowl. Disc. 21(2), 259–276 (2010). https://doi.org/10.1007/s10618-010-0187-5
Article MathSciNet Google Scholar
van Leeuwen, M., Knobbe, A.: Diverse subgroup set discovery. Data Min. Knowl. Disc. 25(2), 208–242 (2012). https://doi.org/10.1007/s10618-012-0273-y
Article MathSciNet Google Scholar
Lijffijt, J., Kang, B., Duivesteijn, W., Puolamaki, K., Oikarinen, E., De Bie, T.: Subjectively interesting subgroup discovery on real-valued targets. In: 2018 IEEE ICDE, pp. 1352–1355. IEEE (2018)
Google Scholar
Meeng, M., Knobbe, A.: For real: a thorough look at numeric attributes in subgroup discovery. Data Min. Knowl. Disc. 35(1), 158–212 (2021)
Article MathSciNet Google Scholar
Proença, H.M. : HMProenca/SSDpp-numeric: v2020.06.0 (2020). https://github.com/HMProenca/SSDpp-numeric. Archived at https://doi.org/10.5281/zenodo.3901236
Proença, H.M., Grünwald, P., Bäck, T., van Leeuwen, M.: Discovering outstanding subgroup lists for numeric targets using MDL. Preprint arXiv:2006.09186 (2020)
Proença, H.M., Klijn, R., Bäck, T., van Leeuwen, M.: Identifying flight delay patterns using diverse subgroup discovery. In: 2018 SSCI, pp. 60–67. IEEE (2018)
Google Scholar
Proença, H.M., van Leeuwen, M.: Interpretable multiclass classification by MDL-based rule lists. Inf. Sci. 512, 1372–1393 (2020)
Article Google Scholar
Rissanen, J.: Modeling by shortest data description. Automatica 14(5), 465–471 (1978)
Article Google Scholar
Rouder, J.N., Speckman, P.L., Sun, D., Morey, R.D., Iverson, G.: Bayesian t tests for accepting and rejecting the null hypothesis. Psychon. Bull. Rev. 16(2), 225–237 (2009)
Article Google Scholar
Van Leeuwen, M., Galbrun, E.: Association discovery in two-view data. IEEE Trans. Knowl. Data Eng. 27(12), 3190–3202 (2015)
Article Google Scholar
Vreeken, J., Van Leeuwen, M., Siebes, A.: KRIMP: mining itemsets that compress. Data Min. Knowl. Disc. 23(1), 169–214 (2011). https://doi.org/10.1007/s10618-010-0202-x
Article MathSciNet MATH Google Scholar

Download references

Acknowledgment

This work is part of the research programme Indo-Dutch Joint Research Programme for ICT 2014 with project number 629.002.201, SAPPAO, which is financed by the Netherlands Organisation for Scientific Research.

Author information

Authors and Affiliations

Leiden University, Leiden, Netherlands
Hugo M. Proença, Peter Grünwald, Thomas Bäck & Matthijs van Leeuwen
CWI, Amsterdam, Netherlands
Peter Grünwald

Authors

Hugo M. Proença
View author publications
You can also search for this author in PubMed Google Scholar
Peter Grünwald
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Bäck
View author publications
You can also search for this author in PubMed Google Scholar
Matthijs van Leeuwen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hugo M. Proença .

Editor information

Editors and Affiliations

Albert-Ludwigs-Universität, Freiburg, Germany
Frank Hutter
TU Darmstadt, Darmstadt, Germany
Kristian Kersting
Ghent University, Ghent, Belgium
Jefrey Lijffijt
Saarland University, Saarbrücken, Germany
Isabel Valera

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Proença, H.M., Grünwald, P., Bäck, T., Leeuwen, M.v. (2021). Discovering Outstanding Subgroup Lists for Numeric Targets Using MDL. In: Hutter, F., Kersting, K., Lijffijt, J., Valera, I. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2020. Lecture Notes in Computer Science(), vol 12457. Springer, Cham. https://doi.org/10.1007/978-3-030-67658-2_2

Download citation

DOI: https://doi.org/10.1007/978-3-030-67658-2_2
Published: 25 February 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-67657-5
Online ISBN: 978-3-030-67658-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)