Robust subgroup discovery

We introduce the problem of robust subgroup discovery, i.e., finding a set of interpretable descriptions of subsets that 1) stand out with respect to one or more target attributes, 2) are statistically robust, and 3) non-redundant. Many attempts have been made to mine either locally robust subgroups or to tackle the pattern explosion, but we are the first to address both challenges at the same time from a global modelling perspective. First, we formulate the broad model class of subgroup lists, i.e., ordered sets of subgroups, for univariate and multivariate targets that can consist of nominal or numeric variables, including traditional top-1 subgroup discovery in its definition. This novel model class allows us to formalise the problem of optimal robust subgroup discovery using the Minimum Description Length (MDL) principle, where we resort to optimal Normalised Maximum Likelihood and Bayesian encodings for nominal and numeric targets, respectively. Second, finding optimal subgroup lists is NP-hard. Therefore, we propose SSD++, a greedy heuristic that finds good subgroup lists and guarantees that the most significant subgroup found according to the MDL criterion is added in each iteration. In fact, the greedy gain is shown to be equivalent to a Bayesian one-sample proportion, multinomial, or t-test between the subgroup and dataset marginal target distributions plus a multiple hypothesis testing penalty. Furthermore, we empirically show on 54 datasets that SSD++ outperforms previous subgroup discovery methods in terms of quality, generalisation on unseen data, and subgroup list size.

on 54 datasets that SSD++ outperforms previous subgroup discovery methods in terms of quality, generalisation on unseen data, and subgroup list size.
Keywords subgroup discovery · subgroup list · the Minimum Description Length (MDL) principle · interpretability 1 Introduction Exploratory Data Analysis (EDA) (Tukey, 1977) aims at enhancing its practitioner's natural ability to recognise patterns in the data being studied. The more she explores, the more she discovers, but also the higher the risk of finding interesting results arising out of coincidences, e.g., spurious relations between variables that have no connection in the real world. Intuitively this corresponds to testing multiple hypotheses without realising it. This duality of EDA requires a thorough analysis of results and highlights the need for statistically robust techniques that allow us to explore the data in a responsible way. While EDA encompasses all techniques referring to data exploration, Subgroup Discovery (SD) (Klösgen, 1996;Atzmueller, 2015) is the subfield that is concerned with discovering interpretable descriptions of subsets of the data that stand out with respect to a given target variable, i.e., subgroups. This work aims at improving the discovery of subgroup lists, i.e., ordered sets of subsets, that describe different regions of the data while being statistically robust by themselves and against multiple hypothesis testing. Two simple examples of subgroup lists can be found in Figures 1 and 2.
Pr(animaltype = · · · | s) in % The dataset was modified, some variables removed and others discretised, for ease of presentation. n s refers to the number of instances covered by subgroup 's' defined by 'description',μ andσ its estimated mean and standard deviation for the target variable in thousands of dollars (K). * concerns instances not covered by any of the five subgroups Subgroup discovery (SD) can be seen as a generalisation of association rule mining or as the exploratory counterpart of rule learning, where the targets/consequent of the rules are fixed and rules are ranked according to quality measures combining subgroup size and deviation of the target variable(s) with respect to the overall distribution in the data. In its traditional form, subgroup discovery is also referred to as top-k subgroup mining (Atzmueller, 2015), which entails mining the k top-ranking subgroups according to a local quality measure and a number k selected by the user. Since its conception, subgroup discovery has been developed for various types of data and targets, e.g., nominal, numeric (Großkreutz and Rüping, 2009), and multi-label (van Leeuwen, 2010) targets. SD has been applied in a wide range of different domains Atzmueller, 2015), such as identifying the properties of materials , unusual consumption patterns in smart grids (Jin et al., 2014), identifying the characteristics of delayed flights (Proença et al., 2018), and understanding the influence of pace in long-distance running (De Leeuw et al., 2018). Although SD appeals to several domains, top-k mining traditionally suffers from three main issues that make it impractical for many applications: 1) poor computational efficiency of exhaustive search with the more relevant quality measures (van Leeuwen and Knobbe, 2012;Bosc et al., 2018); 2) redundancy of mined subgroups, i.e., the fact that subsets with the highest deviation according to a certain local quality measure tend to cover the same region of the dataset with slight variations in their description of the subset (van Leeuwen and Knobbe, 2012); and 3) lack of generalisation or statistical robustness of mined subgroups (van Leeuwen and Ukkonen, 2016). In this work, we focus on the last two issues together: reducing redundancy by finding small lists of subgroups that describe the differences in the data well; and obtaining statistically robust subgroups. First, we define the problem of robust subgroup discovery in its full generality using the minimum description length (MDL) principle. Second, as optimally solving the problem is unfeasible, we focus solely on subgroup lists (not sets) and propose a greedy algorithm that finds good subgroup lists using a local objective for binary, nominal, or numeric target variables. Moreover, we show that this MDL-based greedy gain is equivalent to maximising the Bayes factor between each subgroup's distribution and the dataset marginal distribution plus a penalty for multiple hypothesis testing. Since subgroup lists with only one subgroup are equivalent to top-1 mining, all statistical properties developed here can be directly applied to top-k subgroup discovery. For a formal definition of the Robust Subgroup Discovery problem and our approximation, please refer to Section 4. Note that we restrict our focus to finding subgroup lists with the MDL principle because 1) subgroup lists are one of the core subgroup set models and one of the first that was proposed (Lavrač et al., 2004); 2) they allow for an optimal formulation based on the MDL principle due to their property of unambiguously partitioning the data into non-overlapping parts; 3) the subgroups can be interpreted sequentially, i.e., from most to least relevant discovered subgroup; and 4) the MDL principle is a statistical criterion for model selection that requires the fewest assumptions possible about the model class, data, and (data) generation process.
In recent years both redundancy and statistical robustness issues have been partially addressed, mostly independent of each other. We next briefly discuss recent advances and limitations and refer the reader to Section 11 for an indepth analysis of related work. In terms of redundancy, the first main limitation of existing works is their focus on one type of target variables, such as binary targets (Bosc et al., 2018;Belfodil et al., 2019), nominal targets (Lavrač et al., 2004), or numeric targets (Lijffijt et al., 2018), where only DSSD focuses on univariate and multivariate nominal and numeric targets (van Leeuwen and Knobbe, 2012). The second main limitation is the lack of an optimality criterion for subgroup sets or lists, where the only exception is FSSD (Belfodil et al., 2019). It is important to emphasise that some works aim at finding sequential subgroups or subgroup lists, while others aim at finding unordered sets or subgroup sets. Subgroup lists are akin to rule lists  in the sense that each subgroup/rule needs to be interpreted sequentially, and thus they should be read and interpreted sequentially and cannot overlap, while subgroup sets are allowed to overlap. In this work, we focus solely on subgroup lists, and although previous works often did not use this term, we retroactively rename those models that are, in fact, subgroup lists. In terms of statistical robustness, the approaches can be divided into postprocessing techniques (Duivesteijn and Knobbe, 2011;van Leeuwen and Ukkonen, 2016) and modified quality measures (Song et al., 2016;Song, 2017). Postprocessing techniques first mine the top-k subgroups and then apply a statistical test to find statistical significance. Modified quality measure approaches, which are more closely related to our work, add a statistical test directly to the quality measure definition; however, they do not consider multiple-hypothesis testing correction and only focus on top-k mining. Next, we show how our contributions address both issues in a unified way.
Contributions. We aim to bridge the gap in the literature by finding the best non-redundant subgroup list from a global dataset perspective while guaranteeing the local quality of the found subgroups, making the approach statistically robust from both perspectives. Two examples of subgroup lists for nominal and numeric targets can be seen in Figures 1 and 2. To solve this problem, we propose a formal definition of a subgroup list and employ the Minimum Description Length (MDL) principle (Rissanen, 1978) to define an optimal subgroup list from a global perspective. We provide this formalisation for univariate and multivariate nominal and numeric targets. Notably, the subgroup that minimises the MDL-optimal formulation for a subgroup list with one subgroup is the same subgroup that would be found by top-1 subgroup discovery with Weighted Kullback-Leibler divergence (WKL) as a quality measure. This makes our proposal the first global formulation of subgroup set discovery that is a direct generalisation of traditional subgroup discovery. Thus, all statistical properties developed in this work also apply to top-k subgroup discovery. As finding optimal subgroup lists is NP-Hard, we propose SSD++, a heuristic algorithm that finds "good" subgroup lists. SSD++ combines beam searchto find individual subgroups-with greedy search-to iteratively add the bestfound subgroup to the subgroup list. Maximising the MDL criterion in each iteration guarantees that each subgroup added to the list adheres to a local statistical test equivalent to Bayesian proportions, multinomial, or t-test (for binary, nominal and numeric targets, respectively) plus a penalty to compensate for multiple hypothesis testing.
Previous work. This work is an extension of  and builds on some of the results of . The former introduces MDL-based subgroup lists for univariate numeric target variables and SSD++, a heuristic algorithm for finding such subgroup lists. The current manuscript significantly extends our previous work by generalising the MDL data encoding to three new target variable types: multivariate numeric targets, and univariate and multivariate nominal targets. To find and prove its efficacy, the algorithm and empirical results are also extended to those new target variables. Moreover, the current work provides a new interpretation of the MDL encoding and of the greedy gain as an MDL equivalent to Bayesian testing.  introduced MDL-based rule lists for classification; however, compared to the current work, it is limited to Boolean explanatory and nominal target variables, its model and data encoding are less optimal, it has no local statistical guarantees, and the algorithm was less flexible in accepting user-defined hyperparameters.
Summary of contributions. To summarise, the primary contributions presented in this work-including the contributions originally from , which we indicate with a * below-are: 1. Subgroup list model class -We define the subgroup list model class* over a tabular dataset in general (Section 4.1), providing a global formulation for the problem of sequential subgroup mining, and in particular for univariate and multivariate nominal targets (Section 6.1), and univariate numeric* and multivariate numeric targets (Section 6.2). 2. Robust subgroup discovery using MDL -We define the optimal problem of robust subgroup discovery in Section 4.2 for the case of subgroup lists using the MDL principle. We show the relation of model encoding and multiple-hypothesis testing (Section 5), and resort to the optimal Normalised Maximum Likelihood (NML) encoding for nominal targets (Section 6.1) and the Bayesian encoding with non-informative priors for numeric targets* (Section 6.2). Notably, we show that this problem formalisation is equivalent to the standard definition of top-1 subgroup discovery with WKL as a quality measure for the case of a subgroup list with one subgroup (Section 7.1). 3. Greedy MDL algorithms maximise local statistical test -We show that the greedy gain commonly used in the MDL for pattern mining literature can be interpreted as an MDL equivalent to a local Bayesian hypothesis test, a.k.a. Bayes factor, on the likelihood of the data being better fitted by the greedy extended model versus the current model plus a penalty for the extra model complexity (Section 7.2). In the case of our specific algorithm, SSD++, the greedy objective is equivalent to one-sample Bayes proportions, multinomial, and t-test (for binary, nominal and numeric* targets, respectively) plus a penalty to compensate for multiple hypothesis testing (Section 7.2).
Moreover, this work includes the following secondary contributions, the details of which are all included in the appendices for the interested reader: 4. Normalised Maximum Likelihood for partition models -Derivation of the Normalised Maximum Likelihood (NML) optimal encoding, a refined MDL encoding, for model classes that partition the data for nominal target variables-subgroup lists, rule lists, trees, etc. (Appendix A). 5. Bayesian encoding of normal distributions -Derivation of a Bayesian optimal encoding of normal distributions with non-informative priors for numeric targets* (Appendix B). It is shown that for a large number of instances, it converges to the BIC* (Appendix B.1). Similarly to the NML encoding, it can be used by any model class that unambiguously partitions the data, such as subgroup lists, rule lists, trees, etc. 6. Subgroups discovery versus rule-based prediction -We demonstrate the difference between the formal objectives for subgroup discovery and predictive rule models, such as classification rule lists, from the perspective of our MDL-based approach (Appendix D).
Structure of the paper. Besides the customary introduction, related work, and conclusion, this work contains two preliminary sections-Sections 2 and 3a problem statement section-Section 4-three theoretical sections related with MDL encoding-Sections 5 to 7-one algorithm section-Section 8-and two empirical results sections-Sections 9 and 10. The preliminary sections introduce the notation and basic concepts, while the problem statement formalises the objective of this work. Together, these three sections-where a reader familiar with the topic at hand can skim through-allow reading each of the following sections independently of each other. The specific details of each section can be described as follows. Section 2 introduces the notation used throughout this work and the preliminaries related to subgroup discovery and subgroup set discovery. After that, Section 3 presents the preliminaries related to model selection in the frequentist, Bayesian, and the Minimum Description Length (MDL) principle branches of statistics. Then, Section 4 defines the subgroup list model class, the robust subgroup discovery problem statement, and a novel quality measure for subgroup lists based on the Weighted Kullback-Leibler (WKL) divergence. After that, Section 5 shows the MDL principle model encoding part of the subgroup list and its relation to multiple hypothesis testing. Section 6 presents the data encoding for nominal and numeric targets. Then, Section 7 demonstrates how our MDLbased formulation of subgroup lists equals WKL-based subgroup discovery and Bayesian testing. After that, Section 8 presents SSD++, a heuristic algorithm to mine subgroup lists, as well as its time complexity. Then, in Section 9 we show the empirical results of our proposed method when compared against the state-of-the-art algorithms for univariate and multivariate nominal and numeric targets over 54 datasets. After that, in Section 10 we apply robust subgroup discovery to find how descriptions of the socioeconomic background affect the grades of engineering students in Colombia. Then, Section 11 covers the most relevant related work, together with the main differences to our approach. Finally, Section 12 presents our main conclusions and future work.

Preliminaries: notation and subgroup discovery
In this section, the mathematical notation used throughout this work is introduced together with all the topics of Subgroup Discovery (SD), Subgroup Set Discovery (SSD), and beam search and separate and conquer algorithms for subgroup discovery.
This section is divided as follows. First, Section 2.1 defines the notation used for data, target variables, and subgroups. Then, in Section 2.2, we present the task of subgroup discovery and how to rank the discovered subgroups according to a quality measure. A particular emphasis is given to the Kullback-Leibler divergence as it plays a fundamental role in our definition of MDLbased subgroup lists. After that, Section 2.3 defines the task of subgroup set discovery. Finally, Section 2.4 briefly describes beam search and separate and conquer, algorithms that we use to generate the top-k subgroups and add subgroups to the subgroup list. Note that we use subscripts on the dataset variables (D, X, Y, X, Y, x, y) to indicate column indices and superscripts for row indices. In the case of other notation, such as the number of elements n or statistics µ, σ we will not use the superscript as it could be confused with the exponentiation of that value. Also, X i (resp. Y i ) refers to both the properties of the i th explanatory (resp. target) variable and to all the values of this variable for a specific column.

Data and subgroups
Depending on the type and number of targets (one or multiple), the type of problem can be divided into four categories: 1) single-nominal ; 2) singlenumeric; 3) multi-nominal ; and 4) multi-numeric. In machine learning, the single-numeric case corresponds to regression, the single-nominal to classification, and in the case of more than one variable their multi-target generalisations, respectively.

Subgroups
A subgroup, denoted by s, consists of a description (also intent) that defines a cover (also extent), i.e., a subset of dataset D.
Subgroup description: A description a is a Boolean function over all explanatory variables X. Formally, it is a function a : X 1 × · · · × X m → {f alse, true}. In our case, a description a is a conjunction of conditions on X, each specifying a specific value or interval on a variable. The domain of possible conditions depends on the type of a variable: numeric variables support greater and less than {≥, ≤}; nominal support equal to {=}. The size of a description a, denoted |a|, is the number of conditioned variables it contains.
Example 1: In Figure 2, subgroup 1 has a description of size |a| = 2, with one condition on a nominal variable: {weigth = heavy}; and another on a numeric variable: {consumption-city ≤ 8km/L}.
Generic probability distribution with parameters Θ. N (µ; σ) Normal probability distribution with parameters µ and σ. Cat(p 1 , · · · , p k ) Categorical probability distribution with p i probability per category. p y|c Probability of category y given description a, i.e., Pr(y | a) µ Mean value parameter. σ Standard deviation parameter. θ Maximum likelihood estimation of parameter θ. q(a) Subgroup discovery quality measure.

Q(S)
Subgroup set discovery quality measure. f (Θ a ,Θ d ) Function of differences between distributionΘ a andΘ d . α Tradeoff between subgroup coverage and distribution difference.  The two points that make the Bayesian encoding proper. L Bayes2.0 Bayesian length of encoding made proper with first 2 points. Γ (n) Gamma function, the extension of the factorial to real numbers.
Compression gain of adding subgroup s to model M .
Level of normalisation of the compression gain. ζ Set of all items (possible single conditions) in X. dmax Beam search maximum depth of search. w b Beam search beam width. ncut Number of cut points for numeric discretisation.
Subgroup cover: The cover is the bag of instances from D where the subgroup description holds true. Formally, it is defined by: where we use a x to denote a(x) = true. Further, let n a = |D a | denote the coverage of the subgroup, i.e., the number of instances it covers.
Example 2 (continuation): In Figure 2, subgroup 1 covers 11 instances in the dataset which can be found by conditions in its description, and thus its coverage is 11.

Subgroups as probabilistic rules
As D a encompasses both the explanatory and target variables, the effect of a on the target variables can be interpreted as a probabilistic rule. Regarding the multiple target variables, we assume that they are independent. This simplifies the problem and is a common approach in multi-label classification (Herrera et al., 2016). Thus, the general form of the rule is: where y j is a value of variable Y j , Dist is a probability distribution (defined later) andΘ a j is the shorthand for the maximum likelihood estimation of the parameters of Dist over values Y a j , i.e.,Θ a j =Θ j (Y a ). Thus, y i ∼ Dist(Θ a j ) tells us that the values of variable Y j are distributed according to a distribution Dist with parametersΘ a j estimated over the values Y a j . The vector of all parameter values of a rule is denoted by Θ a . In our case, Dist can be a categorical or normal distribution in the nominal or numeric target case, respectively. In the numeric case the normal distribution is represented as: N (μ,σ). In the nominal case the distribution is Cat(p 1 , · · · ,p k ), where k is the number of classes (or categories) of the corresponding variable andp c the estimated probability for class c.

Subgroup discovery
Subgroup discovery is the data mining task of finding subgroups that stand out with respect to some given target variable(s). The definition of standing out, also known as interestingness, is quantified by a quality measure, which depends on the task at hand (Webb, 1995;Klösgen, 1996). Generally, these measures quantify quality by how different the target variable distribution of a subgroup is from what is defined as 'normal' behaviour in a dataset. In the case of structured data, a subgroup generally takes the form of an association rule, and the 'normal' behaviour is usually measured by the average behaviour of the target variable of that dataset (Atzmueller, 2015).
Quality measures. Thus, depending on the target variable and task, different quality measures can be chosen to assess the quality (or interestingness) of a subgroup description a, over is cover D a . In general, quality measures have two components: 1) representativeness of the subgroup in the data, based on coverage n a = |D a |; and 2) a function of the difference between statistics of the empirical target distribution of the pattern,Θ a =Θ(Y a ), and the overall empirical target distribution of the dataset,Θ d =Θ(Y). The latter corresponds to the statistics estimated over the whole data, e.g., in the case of the Automobile import subgroup list of Figure 2 it isΘ d = {μ = 13;σ = 8} and it is estimated over (all) 197 instances of the dataset. The general form of a quality measure to be maximised is where α allows to control the trade-off between coverage and the difference of the distributions, and f (Θ a ,Θ d ) is a function that measures how different the subgroup and dataset distributions are. As an example, the most commonly adopted quality measure for single-numeric targets is Weighted Relative Accuracy (WRAcc) (Lavrač et al., 1999), with α = 1 and f (Θ a ,Θ d ) =μ a −μ d (the difference between subgroup and dataset averages).

Weighted Kullback-Leibler divergence
Another commonly adopted measure is the Weighted-Kullback Leibler divergence (WKL) (van Leeuwen and . This is also the measure that we consider throughout this work because of 1) its flexibility in terms of (number and types of) supported target variables; and 2) its relationship to the MDL principle (see Sections 7.1); and 3) it arises from using the Logloss for assessing the goodness of fit of the dataset and subgroup distribution, which is a proper scoring rule (Song et al., 2016). WKL is defined as the Kullback-Leibler (KL) divergence (Kullback and Leibler, 1951) between a subgroup's and dataset target distribution KL(Θ a ;Θ d ) linearly weighted by its coverage. Revisiting Eq.
(3) this corresponds to f (.) = KL(.) and α = 1. The definition of WKL for a univariate target variable Y is given by: where KL(Θ a ;Θ d ) is the Kullback-Leibler divergence between subgroup and dataset for target Y . The KL divergence in Eq. (4) depends on the probabilistic model chosen to describe the target variables. In its general form, the KL divergence can be defined as: where the logarithm is to the base two (like all logs in this work). Thus the choice of the distribution used to describe the target is of great importance and should reflect what the analyse would like to find in the data. Now, depending on the type of target we will see show how to compute W KL(Θ a ;Θ d ).
It is easy to see that for multivariate targets, we either use a multivariate distribution, e.g., a multivariate normal distribution or assume that they are independent target variables, where the total WKL turns out to be just the sum of the WKL for each target variable.
We will now provide the definitions of WKL for univariate categorical and normal distributions.

Weighted Kullback-Leibler for categorical distributions
In the case of a univariate nominal target Y , the distribution can be uniquely described by a categorical distribution with the probability of each categorŷ Θ a = {p 1|a , ...,p k|a }, so that the KL(Θ a ;Θ d ) of Eq. (4) takes the form of: wherep c|a = Pr(c | a) is the maximum likelihood estimate of the conditional probability of the target c given the subgroup a, andp c is the marginal probability for that category.

Weighted Kullback-Leibler for normal distributions
In the case of a univariate numeric target Y , many distributions could be used for modelling. We resort to the normal distribution for its robustness and analytical properties, as mentioned before. Nonetheless, still two possibilities remain: a location distributionΘ a = {µ a } that only accounts for the mean, or a 'complete' normal distributionΘ a = {µ a , σ a } that accounts for the mean and the variance. With the location distribution KL(Θ a ;Θ d ) equals: while with the normal distribution one obtains: Note that sinceσ d is a constant for each dataset, there is a strong resemblance between W KL µ (s) and WRAcc, where the only difference is the square of the difference of the means. Also, notice that W KL µ,σ directly takes penalises subgroups with large variance-a dispersion-aware quality measurewhile W KL µ (s) (and also WRAcc) fail to give importance to the dispersion of subgroup values.

Subgroup set discovery
Subgroup set discovery (SSD) (van Leeuwen and Knobbe, 2012) is the task of finding a set of high-quality, non-redundant subgroups that together describe all substantial deviations in the target distribution. It can be seen as an instantiation of the LeGo (from Local Patterns to Global Models) framework, which describes the steps to pass from local descriptions of the data to a global model (Knobbe et al., 2008). LeGo identifies three phases of this process for SSD: 1) mining local candidate subgroups; 2) finding a compact set of the subgroups from the candidates found in 1; and 3) combining the interesting subgroups identified in 2 in one global model. Of course, not all phases need to happen in this order, and some stages can be combined.
Here, we are interested in phase 3: how to aggregate the subgroups in a global model. There are three most common aggregation models for subgroups: 1) top-k subgroups, which are just the best k ranking subgroups according to a local quality measure (sometimes called traditional SD); 2) subgroup list, a sequentially ordered set of subgroups; and 3) subgroup set, an unordered set of subgroups. Top-k is still a local paradigm as it does not consider how those k subgroups describe different regions of the data. Contrastingly, subgroup lists and sets are true global models as they consider both the subgroups' local coverage and their global coverage as sets.

Defining the task of Subgroup Set Discovery
Now, we will formally define the task of SSD from a global dataset perspective. SSD can be defined as, given a quality function Q for subgroup sets and the set of all possible subgroup sets S, the task is to find the subgroup set S * = {s 1 , . . . , s k } given by S * = arg max S∈S Q(S). Note that Q should take into account the individual quality of subgroups q(a) and the overlap of their coverages D a and quantify the contribution of each instance only once. As opposed to top-k mining where only their individual qualities are considered, i.e., Q(S) = q(a).
Ideally, a quality measure for subgroup sets Q should: 1) be global, i.e., for a given dataset it should be possible to compare subgroup set qualities regardless of subgroup set size or coverage; 2) maximise the individual qualities of the subgroups; and 3) minimise redundancy of the subgroup set, i.e., the subgroups covers should overlap as little as possible while ensuring the previous point. Next, we formulate the subgroup list model class and propose a new global measure for subgroup lists.

Beam search and separate and conquer algorithms in subgroup discovery
To find good subgroup lists, we propose the SSD++ algorithm in Section 8. SSD++ is a heuristic based on the Separate-and-Conquer (SaC) (Fürnkranz, 1999) strategy of iteratively adding the local best subgroup to the list, combined with beam search for candidate subgroup generation at each iteration level.
For that reason, we now present the beam search algorithm in subgroup discovery and the SaC algorithm usually used in SSD. For an in-depth analysis of algorithms in SD and SSD check Section 11.1 and 11.1.2, respectively. Greedy approaches are often employed in SSD in general, and subgroup list discovery in particular, as the task of finding the optimal unordered or ordered set of patterns is NP-Hard.
Beam search is arguably the most common heuristic in subgroup discovery (Lavrač et al., 2004;Meeng and Knobbe, 2011;van Leeuwen and Knobbe, 2012;Meeng and Knobbe, 2021). It is a greedy hill-climbing approach that starts with candidate subgroups of size one and iteratively refine a subset of those to subgroups to a larger length by adding one more condition per iteration. Specifically, beam-search has three main hyperparameters: 1) the beam-width w b ; 2) the maximum search depth d max ; and 3) a quality measure.
In its standard form, the process starts by finding all the best w b subgroups of size one, i.e., their description only includes one condition such as x 1 < 5 or x 2 = category, according to the quality measure. Then, it refines all the w b size one subgroups by adding one more condition, selecting the w b best refinements, and discarding the rest according to the quality measure. The process continues until the maximum number of conditions d max is achieved, and the subgroup that maximises the quality measure is returned. There are variations such as the one used by van Leeuwen and Knobbe (2012) and Meeng and Knobbe (2021) where the numeric explanatory variables are discretised in each refinement, also known as dynamic discretisation.
Separate and Conquer. Most algorithms for Subgroup Set Discovery find their global models-subgroup lists or sets-sequentially by adding one subgroup at the time (Lavrač et al., 2004;van Leeuwen and Knobbe, 2012;Bosc et al., 2018;Lijffijt et al., 2018;Belfodil et al., 2019), and mostly vary on how they remove the data or generate their candidate subgroups. Thus, all SSD approaches that sequentially add subgroups to a model can be seen as a variation of the traditional Separate-and-Conquer (SaC) rule learning strategy, defined by: 1) adding local best rule/subgroup to the model; 2) remove (traditional SaC) or re-weight (deviation from traditional SaC) the data covered by it; and 3) repeat process 1 and 2 until there is no data left to cover. Thus, using SaC for finding subgroup lists is the obvious choice. Depending if they find list or sets, in step 2 they either remove or re-weight, respectively. For details on each specific method please refer to Section 11.1.2.

Preliminaries: Model selection in frequentist, Bayesian, and MDL perspective
This section briefly revisits model selection in different branches of statistics: classic statistics, also known as frequentist; Bayesian statistics; and using the Minimum Description Length (MDL) principle. For an in-depth comparison of these methods, please refer to Chapter 7 of Hastie et al. (2009). The objective of model selection is to find the best point hypothesis, i.e., choosing the best model (and its parameters set to specific values) for a dataset from a class of possible models. In our case, this translates to selecting the best subgroup list, such as in Figure 1, out of all the possible subgroup lists that can be constructed for that dataset. Furthermore, the best model should describe the data well while not overfitting, i.e., it should generalise its findings beyond the (training) data used to estimate its parameters. Many methods are reformulations of the principle often called Occam's Razor, i.e., select the simplest model that fits the data well. However, depending on the branch of statistics, the assumptions and notation differ, making it hard to compare the methods directly; thus, we attempt to present them in a more unified way. The reason for this is twofold: first, we aim to provide a gentle introduction to the MDL principle for the unacquainted reader by starting from more known branches of statistics; second, our proposed MDL formulation of robust subgroup discovery is related to concepts from other branches, such as model comparison with Bayesian factors or multiple hypothesis testing.
Note that we do not delve into the related Akaike Information Criterion (AIC) (Akaike, 1998). Although it is usually advantageous in predictive settings (Grünwald and Roos, 2019), the AIC has a higher rate of false positives and a bias towards more complex models than the Bayesion Information Criterion (BIC) (Rouder et al., 2009) and, consequently, our MDL formulation (as it asymptotically converges to BIC up to a constant; see Appendix B.1). We want to avoid these properties when mining statistically robust subgroups.
This section is divided as follows. First, in Section 3.1, we present frequentist approaches as they are the most commonly employed. Then, Section 3.2 discusses Bayesian hypothesis testing. After that, Section 3.3 introduces the basic principles of MDL for model selection. Finally, Section 3.4 looks into the concept of multiple hypothesis testing from the three different perspectives.

Frequentist approach to model selection
From a classical statistics perspective, one can use several methods to select the best model from a set of models. These can be broadly divided into outdata and in-data methods, corresponding to testing the models on an external or the same (internal) data source on which they were estimated (trained), respectively. To the first category belong methods commonly used in machine learning, such as cross-validation (Hastie et al., 2009). The second category, which concerns us most, corresponds to structural measures, which have additional terms to penalise the complexity of the model, i.e., how general is a model class constrained to a certain number of parameters and dataset size. Model complexity is sometimes also interpreted as the effective number of parameters. For example, imagine two classes of subgroup lists: subgroup lists with one subgroup; and subgroup lists with up to three subgroups (same number of conditions and dataset); it is easy to see that the second subgroup list includes the first class and can potentially divide the data in more ways. Common examples of model complexity measures are the L1 and L2 norms and the Vapnik-Chervonenkis (VC) dimension (Vapnik and Chervonenkis, 2015). We do not directly mention common 'performance' measures such as accuracy or maximum likelihood. In fact, from an in-data testing perspective and for a nested model class such as subgroup lists, e.g., the class of all subgroup lists with two subgroups includes the class of all subgroup lists with one subgroup, these measures will overfit on the data (Grünwald, 2007).

Structural measures.
Ideally, the method chosen should give guarantees on the model performance on unseen data. The risk minimisation principle guarantees those by using the VC dimension for model complexity. The main idea is to select the model that maximises a performance measure, e.g. accuracy, while having the smallest VC dimension. Informally, the VC dimension of a model is given by the largest set of data examples it (e.g., the class of subgroup list with 3 subgroups of one condition) can separate. In practice, computing the VC dimension can be impractical for certain model classes, and one resorts to more straightforward model complexities, such as the L1 or L2 norm. An example of an L1 norm structural measure for finding decision lists was described by Angelino et al. (2017) as where Q(D, M ) is the objective function used to select the model, Acc.(Y | X, M ) is the accuracy, λ is an adjustable parameter, and |M | is the number of rules in the model. For completeness, the method should include an extra term to penalise the number of conditions in each rule. Bayesian statistics and the MDL principle can be seen as probabilistic structural measures with mathematically rigorous foundations, similarly to the structural risk minimisation, that take into account all model parameters to quantify model complexity (Grünwald and Roos, 2019). The main difference between these branches of statistics is that frequentist statistics usually focus on the probability of the data given the model Pr(D | M ). In contrast, Bayes and MDL focus on Pr(M | D) by making additional assumptions about the probability of the model before seeing any data Pr(M ) (prior)-similar, albeit different, to the additional assumptions made by the VC dimension.

Bayes hypothesis testing and Bayes factor
Bayesian hypothesis testing was introduced by Jeffreys (Jeffreys, 1935(Jeffreys, , 1998, and focuses on comparing the hypothesis/models based on their probability of occurrence given the data, i.e., based on the posterior probability of each model Pr(M | D). To compare both models, one computes the ratio of their posterior distributions such as (Kass and Raftery, 1995;Rouder et al., 2009): where the transition from the second to the third expression is made using the Bayes rule and removing the terms Pr(D) from the expression, and K 1,2 is the Bayes factor between the models. Indeed, depending on how large or how small the ratio K 1,2 is, we can interpret it-similar to p-values in frequentist statistics-as more evidence in favour of hypothesis 1 or 2, respectively (Kass and Raftery, 1995). It is interesting to notice that, in the case of a large number of instances and a smooth prior, the Bayes factor approximates the Bayes Information Criterion (BIC) up to a constant (Schwarz, 1978;Raftery, 1995).
As the two hypotheses being tested are over the same data, the ratio Ξ can be rewritten as Pr(M 1 , D)/ Pr(M 2 , D) using the chain rule. Also, as we deal with a supervised setting, Pr(D | M ) becomes Pr(Y | X, M ), taking the same form used in the next section for the MDL principle. When this comparison is extended to a whole model class M, several methods can be used to select the best model; however, the one most similar to our MDL approach corresponds to choosing the model with the highest probability. i.e., the mode of the distribution.

MDL-based model selection
The Minimum Description Length (MDL) principle originates from the ideas of information theory. It allows for model selection by comparing code lengths for different models and selecting the model that compresses the data best (Rissanen, 1978;Grünwald, 2007;Grünwald and Roos, 2019). In our specific case, the goal is to find the best subgroup list model M from the class of all possible subgroup list models M, as defined in Section 4.1. As we want to find the best point-hypothesis, such as in Figure 2, the model selection problem should be formalised using a two-part code (Grünwald, 2007), i.e., where L(Y | X, M ) is the encoded length, in bits 1 , of target variables data Y given explanatory data X and model M , L(M ) is the encoded length, in bits, of the model, and L(D, M ) is the total encoded length and the sum of both terms. Note that the data encoding changes from L(D | M ) to L(Y | X, M ) to reflect our supervised setting and how we are only concerned with encoding the target variables Y in the goodness of fit part of MDL. Intuitively, the best model M * is the model that results in the best trade-off between how well the model compresses the target data and the complexity of that model-thus minimising redundancy and automatically selecting the best subgroup list size.
MDL and probabilities. Although the MDL principle measures the length of encodings in bits, every encoding can be translated to probabilities by the Shannon-Fano code (Shannon, 1948): where A is an event-in our case, it can be the model M , the target variables Y, or any of their subparts defined in Section 5 and 6-and Pr(A) its probability. Thus, each code length in MDL can be directly interpreted as the negative logarithm of a probability. Consequently, the model with the smallest total encoded length L(D, M ) is that one having the largest probability Pr(D, M ).

Multiple-hypothesis testing
Multiple-hypothesis testing is the task of testing more than one hypothesis 2 on the same data (Shaffer, 1995). To avoid increasing the Type I error rate, i.e., selecting models that are false discoveries, we should compensate for the fact that we test multiple hypotheses. For clarity, note that our focus here is on multiple-hypothesis testing to select a single best model from a class of models for one dataset, not selecting the best algorithm over multiple datasets. For an in-depth explanation of the latter, please refer to Demšar (2006).
In frequentist statistics, one usually accepts a model based on a p-value and an assigned significance level α. Suppose you have chosen α equal to 0.05 and you are testing only two hypotheses; then, the α value tells us that by chance, you have a 0.05 probability of having evidence in favour of one model when it is false and should be rejected. If we would instead test 100 different models (and these are considered independently), it is easy to see that we have a chance of at least 5 models being acceptable when they should not have been. One way to counteract this effect is by adjusting the significance level to accommodate the number of hypotheses tested. Several methods exist to that end, and the conceptually more simple is the Bonferroni correction, where one divides α by the number of hypotheses/models being tested.
In the case of Bayesian statistics and the MDL principle, this translates to using Pr(M ) and L(M ), respectively, to account for all possible models that are tested as hypotheses. In Section 5.1 we will show how to interpret L(M ) as a correction for multiple-hypothesis testing in the context of subgroup lists.

Problem statement: robust subgroup discovery
This section formally introduces the problem we propose to solve: robust subgroup discovery. Informally, the problem can be described as: Find the globally optimal set or list (i.e., an ordered set) of non-redundant and statistically robust subgroups that together explain relevant local deviations in the data with respect to specified target variables.
As this is a broad problem, we need to narrow it down: we only deal with subgroup lists, define the optimal list using the MDL principle, and propose an iterative greedy algorithm that guarantees that each subgroup added to the list is statistically robust. Our MDL-based formulation of subgroup lists includes the MDL-based top-k subgroup discovery problem (Section 7.1), which gives a direct relationship to subgroup discovery as it was originally introduced.
To formally state our problem, we first need to introduce the subgroup list model class, which we will do in Section 4.1. Then, based on this model, we provide our problem statement in Section 4.2. Finally, in Section 4.3 we propose a quality measure that quantifies the goodness of fit of subgroup lists.

Subgroup list model class
Subgroup lists are a sequentially ordered set of subgroups; see Figure 3. Given its ordered format, a subgroup list always partitions the data, i.e., each instance of data is covered by one and only one subgroup (or the default rule). For example, if a subgroup list contains 4 subgroups, the dataset will be partitioned into 4 + 1 parts, one for each subgroup plus one for the dataset/default rule. A subgroup down the list, such as the second subgroup, should be interpreted as: the second subgroup is active only when its description is active and the description of the first subgroup is not active.
More specifically, as we are only interested in finding subgroups for which the target deviates from the overall distribution, we assume Y values distributed according toΘ d by default (last line in Figure 3). Thus, for each subset in the data where the target distribution deviates fromΘ d and a description exists, a subgroup specifying a different distributionΘ a could be added to the list. Ordering the rules formed by subgroups S = {s 1 , · · · , s ω } and adding the dataset rule at the end (default rule) leads to a subgroup list M of the form of Figure 3.
Regarding the possible distributions Dist, we use categorical distributions for the nominal targets, i.e., Dist Cat(p 1 , · · · ,p k ), or normal distributions for the numeric target case, i.e., Dist N (μ i ,σ i ).
The categorical distribution is a natural choice for describing the probabilities of classes (Letham et al., 2015). For numeric targets, several distributions can be selected; however, the normal distribution captures two properties of interest in numeric variables, i.e., centre and spread, while being robust to cases where the data violates the normality assumption. Also, it allows for a closed-form solution from a Bayesian (Jeffreys, 1998) and MDL (Grünwald, 2007) perspective. For an analysis of the direct use of the numeric empirical distribution in subgroup discover, please refer to Meeng et al. (2020).

Subgroup lists versus subgroup sets
While we formulate our theory solely for subgroup lists, each global model has advantages and disadvantages. On the one hand, subgroup lists allow for a sequential interpretation of the subgroups, generally in decreasing order of their importance. Moreover, each instance in the data is associated with only one subgroup. On the other hand, subgroup sets allow for a semi-independent interpretation of each subgroup and can be considered a more general framework. These properties tend to make sets more interpretable when looking at all the subgroups. At the same time, lists are usually more interpretable from an instance perspective-as each instance is only covered once-and on the contribution of each subgroup to the global model. However, as the number of subgroups in a model increases, both types of models become harder to interpret. In the case of subgroup lists, one must inspect the covering subgroup and all the preceding ones for each instance. In the case of subgroup sets, there can be a considerable overlap for each instance, making it hard to assess the individual contribution of each subgroup.
In addition to being one of the first subgroup set discovery models, the main advantage of selecting subgroup lists, in our case, is their property of unambiguously partitioning the data into non-overlapping parts. This property allows us to use the MDL principle to formulate the robust subgroup discovery problem for subgroup lists optimally.

Difference between subgroup lists and (predictive) rule lists
A subgroup list defined above corresponds to a probabilistic rule list with ω = |S| rules and a last (default) rule fixed to the overall empirical distributions for each target variable . Fixing this last 'rule' distribution is crucial and differentiates a subgroup list from a rule list as used in classification and/or regression (Proença, 2021), as this enforces the discovery of a set of subgroups whose individual target distributions all substantially deviate from the overall target distribution (dataset rule). It is shown in Section 7.1 that the objective of finding a subgroup list with this format is equivalent to top-k subgroup discovery when finding subgroup lists with just one subgroup. A theoretical comparison of the difference between the objectives of predictive rule lists and subgroup lists from an MDL-based perspective is given in Appendix D.

Formal problem statement
Let D be a dataset consisting of explanatory variables data X and target variables data Y, i.e., D = {X, Y}. Let M ∈ M(D) be all possible subgroup lists for D formed by an ordered set of subgroups S and a dataset rule, as in Figure 3. Let s ∈ Ω(D) be all possible subgroups in D with respect to all possible descriptions in X, which are formed by conjunctions of conditions (pattern language) on the possible values X of the explanatory variables X ∈ X. The conditions vary by variable type X and can be an interval over the reals R for numeric, e.g., consumption − city ∈ [0, 8] in Figure 2, or equality for nominal or Boolean, e.g. weight = light in Figure 2. The target description of each subgroup is restricted to Categorical or Normal distributions. Given the definition of all possible subgroup lists in a dataset, the objective is to return the subgroup list M that minimises the MDL two-part code of Eq. (11), i.e., the objective is to find where the first part concerns the global optimality of the subgroup list, on the complete dataset, while the constraint Ξ(Θ i ,Θ d ) states that there should be more evidence in favour of having each subgroup in the list than for using the overall dataset distribution (for the motivation on using the MDL principle refer to Section 4.2.1).
To operationalise this, L(Y | X, M ) and L(M ) need to be defined. Thus, we propose a model encoding L(M ) in Section 5.1 and a data encoding in Section 6. Then, Section 7 shows that the data encoding equals WKL-based subgroup discovery and Bayesian testing, reflecting the statistical robustness of each subgroup in the list. Finally, as finding the optimal subgroup list according to this formulation is unfeasible for most real-world problems, Section 8 proposes SSD++, a heuristic algorithm (for NP-Hard intuition, check Section 4.2.2). Moreover, this algorithm approximates the MDL minimisation by restricting the search space of possible subgroups and subgroup lists. Nonetheless, it guarantees that the most statistically robust subgroup found by beam search is added to the subgroup list. Indeed, the greedily adding a subgroup automatically accounts for the statistical constraint of Eq (13), i.e.,

Motivation for the MDL principle
The MDL principle is used due to its statistical robustness and objectivity when compared to other approaches (for a short introduction to model selection, please refer to Section 3). First, it does not assume that the model that generated the data belongs to the model class used. Second, it allows the data encoding, i.e., L(Y | Y, M ) = − log Pr(Y | Y, M ) to be computed with several optimal methods, including the one in Bayesian statistics-Bayes updating rule (Grünwald and Roos, 2019). Third, from a model encoding perspective, i.e., L(M ) = − log Pr(M ), which is the most subjective part of the encoding, the MDL principle recommends choosing an encoding that uses the minimum number of assumptions, similarly to the max entropy principle (Jaynes, 1957).

Finding optimal subgroup lists is NP-Hard
The general task of finding an ordered set of patterns (Mielikäinen and Mannila, 2003) and that of finding the smallest decision list for a dataset D (Rivest, 1987) are both NP-hard problems. Thus, it is trivial to see that the problem of finding a subgroup list, which is a probabilistic generalisation of the decision list, is NP-hard.

A new measure for subgroup lists: the sum of WKL divergences
This section extends the WKL divergence to subgroup lists, allowing us to compare the quality of different algorithms that mine subgroup lists. Also, it is shown in Section 7.1 to correspond to one part of the data encoding of our MDL formulation. Following the introduction of quality measures in Section 2.2 and subgroup lists in the previous sections, we can extend the KL-based measure of Eq. (4) for individual subgroups to measure subgroup lists. That is, we propose the Sum of Weighted Kullback-Leibler divergences (SWKL), which can be interpreted as the sum of weighted KL divergences for the individual subgroups: where i is the subgroup index in a subgroup list, ω is the number of subgroups in S, and |D| is the number of instances in D. The latter is used to normalise the measure and compare values across datasets. In the case of multiple target variables, the normalisation could also include the number of targets, but we do not use this in this work. The SWKL measure assumes that the data is partitioned per subgroup and is based on the assumption that subgroups can be interpreted sequentially as a list. An advantage of the SWKL measure is that it can be used for any target variable(s), as long as probabilistic models are used. Note that computing SWKL is straightforward for subgroup lists, but not for subgroup sets as multiple subgroups can cover an instance. For subgroup sets, it would be necessary to explicitly define the type of probabilistic overlap, e.g., additive or multiplicative mixtures of the individual subgroup models.
Overfitting. It should be noted that this measure only quantifies how well a list of subgroups captures the deviations in a given dataset and is prone to overfitting: the higher the number of subgroups, the easier it is to obtain a higher value as there is no penalty for the number of subgroups (or their complexities, for that matter). As such, SWKL can be seen as a measure of 'goodness of fit' for subgroup lists. This is not an issue for our approach as our MDL-based criterion naturally penalises for multiple hypothesis testing and the complexity of the individual subgroups, which is empirically validated by the statistical robustness analysis Section 9.5. Moreover, overfitting does not seem to be an issue in our empirical comparisons with other algorithms of Section 9.3 and 9.4, as the number of subgroups found was similar for most algorithms, rendering the subgroup lists comparable based on SWKL.

Model encoding of subgroup lists
We presented the MDL principle in its generality in Section 3.3 and the specific problem statement of finding optimal subgroup lists in Section 4.2. In this section, we define the model encoding L(M ) of subgroup lists and its relationship to multiple hypothesis testing.

Model Encoding
Following the MDL principle (Grünwald, 2007), we need to ensure that 1) all models in the model class, i.e., all subgroup lists for a given dataset, can be distinguished; and 2) larger code lengths are assigned to more complex models. To accomplish the former we encode all elements of a model that can change, while for the latter we resort to two different codes: when a larger value represents a larger complexity we use the universal code for integers (Rissanen, 1983), denoted 3 L N , and when we have no prior knowledge but need to encode an element from a set we choose the uniform code. Specifically, the encoded length of a model M over variables in X is given by where we first encode the number of subgroups |S| using the universal code for integers, and then encode each subgroup description individually. For each description, first the number |a i | of variables used is encoded, then the set of variables using a uniform code over the set of all possible combinations of |a i | from all explanatory variables, and finally the specific condition for a given variable. As we allow variables of two types, the latter is further specified by where the code for each variable type assigns code lengths proportional to the number of possible parts the variable's domain can partition the dataset. Note that this seems justified, as more parts imply more potential spurious associations with the target that we would like to avoid. For nominal variables this is given by the size of the domain, i.e., the number of categories in a nominal variable. For numeric variables it equals the number of operators used L N|2 (|n op |) 4 plus the possible number of outcomes N (n op , n cut ) given the operators and n cut cut points. The number of operators for numeric variables can be one or two, as there can be conditions with one (e.g., x ≤ 2) or two operators (e.g., 1 ≤ x ≤ 2), which is a function of the number of possible subsets generated by n cut cut points. Note that we here assume that equal frequency binning is used, which means that knowing X and n cut is sufficient to determine the cut points.
In the case of top-k subgroup discovery, i.e., only be interested in the top individual subgroups, the model becomes a subgroup list with only one subgroup. Thus, the model encoding of Eq. (15) remains the same except for the first term (L N (|S|)), which should be removed. This is because it is not required to account for a subgroup list with more than one subgroup.
Example 5 (continuation): Let us assume that the subgroup list of the Automobile example of Figure 2 is composed of only the first subgroup. In that case the list only has one subgroup with description: {weight = heavy & consumption-city ≤ 8 km/L }. Taking into account that the dataset has 17 variables, |X weight | = 3 and only 3 cut-points were used for numeric attributes, the expression of the model length is given by: It is important to note that the length of the model can (and should) be a real number, as we are only concerned with the idea of compression, not with materialising and transmitting the actually encoded data (Grünwald, 2007).

Multiple-hypothesis testing and model length.
As presented in the previous section, the model encoding L(M ) must be able to distinguish all possible models that could be learned for a dataset D. In fact, we are counting all possible models and then giving them a probability of occurring-with a smaller probability to models with fewer terms. Looking at Eq. (15) in particular, we can see that each term counts different parts of the model: 1) L N (|S|) counts the possible number of subgroups in subgroup lists; 2) L N (|a i |) counts the possible length of a subgroup description; 3) log m |ai| counts the possible pairs of descriptions of size |a i |; and 4) L(v) counts the possible values a variable can have. Together, these four terms count all possible subgroup lists. In some cases, we are being extra conservative and counting more models that do not exist in a dataset, e.g., by using L N (|S|) for the number of subgroups we allow for subgroup lists with infinite subgroups-above the number of instances in the data. Thus, the MDL principle is actively trying to avoid false positive or Type I error models by penalising all the models that could be learned and compared to each other.

Data encoding of target variables
When the model is defined, what remains is to define the length function of the target data given the explanatory data and model, L(Y | X, M ). In this section, we show how to encode the target data Y by dividing it into smaller subsets that can be encoded individually and then summed together, and why there are different types of data encoding for each of the subsets. The specifics of encoding nominal and numeric targets are described in Sections 6.1 and 6.2, respectively.
Cover of a subgroup in a subgroup list. First, we observe that for any given subgroup list of the form of Figure 3, any individual instance (x i , y i ) can only be 'covered' by one subgroup. That is, the cover of a subgroup a i , denoted D a , depends on the order of the list and is given by the instances where its description occurs minus those instances covered by previous subgroups: Next, let n i = |D i | be the number of instances covered by a subgroup (also known as usage). In case an instance (x i , y i ) is not covered by any subgroup s ∈ S then it is 'covered' by the default rule. The instances covered by the default rule D d are the ones not covered by any subgroup (hence the name default rule) and formally defined as: Now, given that the subsets for each subgroup or default rule and each target variable are well-defined, one can-for each of the rules and targets-estimate the parameters of its probabilistic distribution using the maximum likelihood estimator. Note that this shows us that a subgroup s i ∈ M is fully defined by its description a i in a dataset D, and we will interchangeably refer to the subgroup by its description and to its elements (statistics, parameters, distributions, etc.) by its index i when obvious from context. As the subgroup list induces a partition of the data, the total length of the encoded data can be given by the sum of its non-overlapping parts: where Θ d is the vector of parameters for each variable Θ d 1 , . . . , Θ d t . Observe that we dropped X a as these are not necessary to encode Y a but only to generate the partition of the data, and also dropped the parameters Θ i of the subgroups as we do not know what are their parameters until we see the data. This last part will be clarified in the next paragraphs, where we describe how to encode subsets without knowing the parameters. As a side-note, note that Eq. (19) concerns the encoding of any supervised partition of the data, which allows to directly quantify the quality of any tree learning method-each such tree induces a partition of the data.
Encoding data of t (assumed) independent target variables. As each target variable is assumed independent from each other the encoding of target data is given by the sum of their individual encodings: Integrating (19) and (20), one obtains: Two types of data encoding: data encoding can be separated in two different categories: 1) with known parameters; and 2) with unknown parameters.
1) Known parameters: when the parameters of a distribution are known, one can encode the data points directly using the probability for those points given by the distribution with the known parameters. Thus, the encoding of points Y i j (j th variable and i th subgroup) is equal to the negative logarithm of their probability given by known parametersΘ i j : This type of code is used in the case of the default rule of a subgroup list, as the parametersΘ d j are equal to the marginal distribution of variable Y j and are constant for each dataset. Note that this is the key difference between a subgroup list and a predictive rule list: the last rule of a subgroup list is fixed to the marginal distribution, while in the (predictive) rule list its parameters are unknown and depend on the subset D d .
2) Unknown parameters: when the parameters are unknown we need to encode both the parameter values and the data points. We have two possibilities: 1) crude MDL, i.e., encoding the probabilities using a suboptimal probability distribution and then applying the Shannon-Fano code, i.e., the logarithm of the empirical probability (Shannon, 1948); or 2) employ an optimal encoding of both parameters of the distribution and data points together (Grünwald and Roos, 2019). In this work, we employ optimal encoding of parameters, as it guarantees optimality in the sense that the encoding is the best possible in the worst-case scenario, i.e., in case the sample of the data is not representative of the population. For our problem, three main types of optimal encodings exist, which are, in increasing order of optimality guarantees: 1) prequential plug-in; 2) Bayesian; 3) Normalised Maximum Likelihood (NML). While the first two are asymptotically optimal, the NML encoding is optimal for fixed sample sizes. Depending on the target type, we employ the best encoding possible while being computationally feasible, i.e., we require adequate run-time for our algorithm. For nominal targets, we present an NML encoding for both the probabilities of each class and the data points in Section 6.1, which is a theoretical improvement over the prequential plug-in code that was recently proposed for classification rule lists by . For numeric targets, we resort to a Bayesian encoding, as recently proposed by , as the NML code is not computationally feasible for that case.

Data encoding: nominal target variables
When the data have one or more nominal targets, the distributions in the probabilistic rules (2) are categorical distributions Cat(Θ), each with a set of parameters Θ = {p 1 , · · · , p k } representing the k classes: This implies a subgroup of the form: a → y 1 ∼ Cat(p 1 , · · · , p k ), · · · , y t ∼ Cat(p 1 , · · · , p k ), where k and k are the number of classes Y 1 and Y t , respectively. To simplify the introduction of concepts we will assume we only have one target variable in Y, and then generalise the results to multiple variables at the end. Thus, throughout this section Y becomes Y , and the parameters of each subgroup s i becomeΘ i = {p 1|i , · · · , p k|i } as there is only one variable with k classes, where p 1|i is the probability of class 1 for subgroup i, i.e., Pr(c = 1 | a i ). The general form of a subgroup list with one nominal target takes the form of Figure 4.
In the following sections, we will derive the data encoding for subgroup lists with categorical distributions. First, in Section 6.1.1 we introduce the maximum likelihood estimators that will be needed to derive the MDL encodings. Then, in Section 6.1.2, it is shown how to encode a categorical distribution when its parameters are known, which is the case for the default rule of a subgroup list. Finally, in Section 6.1.3 it is shown how to encode a categorical distribution when the parameters of the distribution are unknown. Each description a i uniquely defines a subset D i given by its cover Eq. (17). However in the nominal case for each class label c, we also need to find its subset of the data D c|i , formally given by: which allows us to compute the usage over each class n c|i = |D c|i |. Now, we are in a position to use the maximum likelihood estimator for the parameterŝ Θ i of each categorical distribution as: We can show how to encode each subset of target values with the known parameters of the distribution-the default rule of a subgroup list-and unknown parameters-all the subgroups.

Encoding categorical distributions with known parameters
To encode target values with known parameters-as is the case for the default rule of a subgroup list-we can directly use Eq. (22) with given parameter estimatesΘ d =p 1|d , · · · ,p k|d (marginal distribution over the whole dataset): where (Θ d | Y d ) is the log-likelihood of the parameter setΘ d , and n c|d denotes the number of points associated with each class c covered by default rule Y d .

Encoding categorical distributions with unknown parameters
When the parameters are unknown-as is the case for each individual subgroup distribution-we will employ the Normalised Maximum Likelihood (NML) code, as it "is optimal in the sense that it achieves the minimax optimal code length regret" (Grünwald, 2007). Although the expression of the NML code can be daunting, its intuition is very clear (Kontkanen et al., 2005), i.e., the NML code is equivalent to first encoding all maximum likelihood estimates of sequences Z of n i points based on their likelihoods, and then encoding data Y i with its maximum likelihood estimateΘ i as in Eq. (26). Formally, the NML code length of the subset Y i is given by 5 : where Y ni is the space of all possible sequences of n i points with cardinality k = |Y| (possible values per point),Θ Z is the maximum likelihood estimate over Z, C(n i , k) is the complexity-as it is called in MDL literature (Grünwald, 2007)-of the multinomial distribution over n i points and k categories. Note that this term can be efficiently computed in sub-linear time O( √ dn i + k) if approximated by a finite floating-point precision of d digits (Mononen and Myllymäki, 2008).
Finally, inserting (26) and (27) in (21) we obtain, for the total data encoding of a subgroup list: 6.2 Data encoding: numeric target variables When we have one or more numeric target variables, the consequent of probabilistic rules as in Eq.
(2) are now normal distributions N (Θ) with parameters Θ = {µ, σ}, and take the following form: where we use Pr(y | µ, σ) to denote the probability density function (pdf), which is a slight abuse of notation that we admit to unify the whole work. This translates to a probabilistic rule of the form: To simplify the introduction of concepts, we will again assume we have only one target variable in Y, and generalise the results to multiple variables at the end. Thus, throughout this section Y becomes Y , and the parameters of each subgroup s i become Θ i = {µ i , σ i } as there is only one variable. The general form of a subgroup list with normal target distribution is given in Figure 5. In the following subsections, we will derive the data encoding for subgroup lists with normal distributions. First, in Section 6.2.1 we introduce the maximum likelihood estimators that will be needed to derive the MDL encodings. Then, in Section 6.2.2 we show how to encode a normal distribution when its parameters µ and σ are known, such as is the case for the default rule of a subgroup list. Finally, in Section 6.2.3 we show how to encode a normal distribution using an uninformative prior when the parameters of the distribution are unknown.

Maximum Likelihood (ML) estimation of the parameters
Each description a i uniquely defines a subset D i given by its cover (17), which allows to estimate the parameters of each normal distribution using the maximum likelihood estimate over Y i : whereσ 2 i is the biased estimator such that the estimate times n i equals the Residual Sum of Squares, i.e., n iσ 2 i = y∈Y a (y −μ i ) 2 = RSS i . Note that the parameters of the default rule of Figure 5 are fixed for a dataset and thus correspond to estimatesμ d andσ d over all target values Y .
As each subgroup list defines a partition of the data, we can encode each target value part, Y i or Y d , separately and sum them to obtain the total encoding of Y . In the case of subgroup lists, the last rule-i.e., default rule-has fixed parameters equal to the overall dataset distribution, while the subgroups parameters are not known in advance and have thus to be encoded together with the data points. We start by showing how to encode the subset of target values with the default 'rule'-known parameters of the distribution-and then show how to encode each subgroup subset-unknown parameters of the distribution.

Encoding normal distributions with known parameters
The target values not covered by any subgroup Y d , as defined in (18), are covered by the default dataset 'rule' and distribution at the end of a subgroup list. As the statisticsΘ d = {μ d ,σ d } are known and constant for a given dataset, one can simply encode the instances using this (normal) distribution, resulting in encoded length The first two terms are normalising terms of a normal distribution, while the last term represents the Residual Sum of Squares (RSS) normalised by the variance of the data. Note that when Y d = Y , i.e., the whole dataset target, RSS is equal to n d σ d and the last term reduces to n d /2 log e.

Encoding normal distributions with unknown parameters
In contrast to the previous case, here we do not know a priori the statistics defining the probability distribution corresponding to the subgroup, i.e.,μ and σ are not given by the model and thus both need to be encoded. For this, we resort to the Bayesian encoding of a normal distribution with mean µ and standard deviation σ unknown, which was shown to be asymptotically optimal (Grünwald, 2007). The optimal code length is given by the negative logarithm of a probability, and the optimal Bayesian probability for Y a is given by where w(µ, σ) is the prior on the parameters, which needs to be chosen.
Choosing the prior. The MDL principle requires the encoding to be as unbiased as possible for any values of the parameters, which leads to the use of uninformative priors. The most uninformative prior is Jeffrey's prior, which is 1/σ 2 and therefore constant for any value of µ and σ, but unfortunately its integral is undefined, i.e., σ −2 dσ dµ = ∞. Thus, we need to make the integral finite, which we will do next. It should be noted that when using normal distributions with Bayes factors-Bayesian equivalent to traditional statistical testing-the authors tend to also add a normal prior on the effect size, as e.g., δ = µ/σ ∼ N (0, τ ) (Jeffreys, 1998;Gönen et al., 2005;Rouder et al., 2009). Nonetheless, this prior gives a higher probability to values of µ closer to zero, which is a bias that we do not want to impose. Thus we only use Jeffrey's prior, which converges 6 to the Bayes Information Criterion (BIC) for large n. Now, given the our prior w(µ, σ) = 1 σ 2 √ 2π -where √ 2π was added for normalisation reasons-the remaining question is how we can make the integral finite. The most common solution, which we also employ, is to use u data points from Y i , denoted Y i|u , to create a proper conditional prior w(µ, σ | Y i|u ). As there are only two unknown parameters, we only need two points hence u = 2 (Grünwald, 2007); for more on the interpretation of such "priors conditional on initial data points", see Grünwald and Roos (2019). Consequently, we first encode Y i|2 with a non-optimal code that is readily available-i.e., the dataset distribution of Eq. (32)-and then use the Bayesian rule to derive the total encoded length of Y i as where is the extra cost incurred by encoding two points non-optimally. After some re-writing 7 we obtain the encoded length of the y values covered by a subgroup Y i as where Γ is the Gamma function that extends the factorial to the real numbers (Γ (n) = (n − 1)! for integer n) andμ i andσ i are the statistics of Eqs. (30) and (31), respectively. Note that for Y i|2 any two unequal values (otherwisê σ 2 = 0 and L Bayes (Y i|2 ) = ∞) can be chosen from Y a , thus we choose them such that they minimise L cost (Y i|2 ). Finally, inserting (32) and (35) in (21) we obtain for the total data encoding for a subgroup list: Example 7 (continuation): We revisit the Automobile subgroup list of Figure 2 and find the length of the Bayes2.0 encoding (Eq. (35)) of the first subgroup.
To compute it we need to get the statistics of the subgroup (Θ 1 = {μ 1 = 35;σ 1 = 8}), the number of samples it covers (n 1 = 11), the dataset statistics (Θ d = {μ d = 13;σ d = 8}), and the two points closest to the dataset mean Y 1|2 = {14; 31} that makes the encoding proper (and which are not available in the example information). Assuming that L cost (Y i|2 ) = 0.69bits for simplicity, the length of the encoding of Y 1 is given by: =58.06 + 0.69 =58.75 bits.

MDL-optimal subgroup lists relation to WKL-based SD and Bayesian testing
This section investigates the equivalence of our MDL-optimal formulation of subgroup lists to top-k subgroup discovery with WKL and Bayesian testing with multiple hypotheses. First, Section 7.1 shows that when our subgroup lists only contain one subgroup, they correspond top-1 subgroup discovery with WKL as a quality measure. Second, Section 7.2 shows that adding one subgroup to the list-MDL greedy gain-corresponds to Bayesian proportions, multinominal, and t-test for binary, nominal, and numeric targets, respectively, plus a term for multiple-hypothesis testing.

Relationship of MDL-optimal subgroup lists to WKL-based SD
We now investigate the relationship between finding an MDL-optimal subgroup list and WKL-based top-k subgroup discovery. Remember that WKL is a subgroup discovery measure based on information-theory (van Leeuwen, 2010) and takes the form of Eq. (3) for a general probability distribution; we describe it in more detail in Subsection 2.2.1. Now, assume that we have a single target variable (Y instead of Y) and a subgroup list consisting of just one subgroup s with description a (and the default rule). Next, let us turn the MDL minimisation problem into a maximisation problem by multiplying Eq. (11) by minus one and adding a constant (for each dataset) L(Y | Θ d ) to obtain: In the case of a subgroup list with one subgroup and one target, the data encoding of Eq. (28) for nominal targets or Eq. (6.2.3) can be substituted by where L * stands for L NML or L Bayes2.0 for nominal and numeric targets, respectively. Also, note that Y d is given by all the points not covered by the subgroup description a, i.e., Y ¬a . Thus, we can further develop the maximisation problem to 8 : where n a KL(Θ a ;Θ d ) is the Weighted Kulback-Leibler divergence fromΘ a tô Θ d , and COMP(n a , Dist) is the complexity associated with the target probability distribution-C(n a , k) for categorical and log n a for normal. The equality of the expression holds for categorical and is asymptotically equal for the normal. This result shows that finding the MDL-optimal subgroup is equivalent to finding the subgroup that maximises WKL, plus two extra terms: one that defines the complexity of the distribution C(n a , k), and another that defines the complexity of the subgroup L(M ).
Dispersion-correction quality measure for numeric targets. Importantly, we can observe from Eq. (36) that the measure based on the Kullback-Leibler divergence of normal distributions is part of the family of dispersion-corrected subgroup quality measures, as it takes into account both the centrality and the spread of the target values . When we consider subgroup lists having more than one subgroup, Eq. (36) simply expands to: where SWKL(S) is the measure for subgroup set quality that we proposed in Section 4.3, and the other terms penalise the complexity of the subgroup list. This demonstrates that the MDL-based objective for the optimal subgroup corresponds to a subgroup set quality minus two terms for model complexity-multiple=hypothesis testing.

Relationship of MDL-optimal subgroup lists to Bayesian testing
We will now show how our MDL criterion relates to Bayesian and multiple hypothesis testing. The Bayesian alternative to statistical testing is the Bayesian factor presented in Section 3.2, which compares the best model by comparing the likelihoods of the data given each model. Now imagine we are comparing two models M 1 and M 2 based on their MDL quality as defined by Eq. (11): where we use the Shannon-Fano code of Eq. (12) to transform code length in bits L(· · · ) to probabilities Pr(· · · ), and K 1,2 is the Bayes factor between model 1 and 2 (as presented in Section 3.2). In practice, taking into account Pr(M ) (or L(M )) is equivalent to using the posterior distributions instead of just the "Bayes" factor, and in our case, amounts to a penalty for multiple hypothesis testing as described in Section 5.2. This is the general equivalence between our formulation and Bayesian testing but now let us look at the specific case of adding only one subgroup at the end of the list, i.e., the greedy gain.
Greedy gain. Adding one subgroup to the list is of special interest to us because it relates with both the greedy gain of our algorithm (Section 8.1) and to Bayesian statistical testing against the dataset distributionΘ d .
where we obtained the expression on WKL divergence again. Thus, the hypothesis we are testing here is that the data non-covered by M is better described by the dataset distributionΘ d or the subgroup distributionΘ a . This tells us that adding one subgroup that minimises the MDL expression to the subgroup list is equivalent to Bayesian testing Rouder et al. (2009). Specifically, the first two terms are equivalent to a Bayesian proportions test (with a binary target), a multinomial test (with a nominal target), or a t-test (with a numeric target described by a normal). L(M ) − L(M ⊕ s) accounts for multiple hypothesis testing by penalising for all the possible subgroups that could be added to the model M , as explained in Section 5.1. Indeed, in the nominal case, the first two terms are similar to one of the quality measures proposed by Song et al. (2016)-we use the NML encoding and they use a Bayesian one-for taking into account the uncertainty of the class distribution for nominal targets, assuming that the subgroup and dataset are different. However, they do not consider our additional term for multiple hypothesis testing, which is necessary to have a low number of false positives.

The SSD++ Algorithm
This section proposes SSD++, a heuristic algorithm to find good subgroup lists based on the proposed MDL formulation of Section 5 and 6. Our proposed algorithm combines the Separate-and-Conquer (SaC) (Fürnkranz, 1999) strategy of iteratively adding the local best subgroup to the list with beam search for candidate subgroup generation (for a short introduction on these algorithms, please refer to Section 2.4). We use a double greedy approach because the problem of finding the optimal subgroup list is NP-hard (Section 4.2.2), and the algorithm can be extended to different types of target variables. Further, in subgroup discovery, beam search was empirically shown to be competitive in terms of quality compared to a complete search while offering a considerable speed-up (Meeng and Knobbe, 2021). Greedy heuristic approaches are a common practice in MDL-based pattern mining (Vreeken et al., 2011; and rule-based learning (Fürnkranz et al., 2012). Lastly, our approach of greedy search and adding one subgroup at a time is computationally interpretable to the user, as it adds at each iteration the locally best and most statistically significant subgroup found by beam search.
This section is divided as follows. First, Section 8.1 presents the greedy gaincompression gain-of adding one subgroup to the list and its equivalence to WKL-based SD. Then, Section 8.2 describes the SSD++ algorithm. Finally, Section 8.3 shows the time and space complexity of SSD++.

Compression gain
To quantify the quality of annexing a subgroup s at the end (after all the other subgroups and before the default rule) of model M , denoted M ⊕ s, we employ the compression gain: i.e., a favourable statistical test for adding a subgroup (Section 7.2). Considering the extremes, with β = 1 we have the normalised gain first introduced for the classification setting by , and for β = 0 we have the absolute gain which is just the regular gain used in the greedy search of previous MDL-based pattern mining (Vreeken et al., 2011). Developing Eq. (38) further shows that the compression gain only depends on the added subgroup s with description a, as in the specific case of a subgroup list the default rule is fixed and it is the same for M and M ⊕ s: where ∆ β L(Y | X, M ⊕ s) and ∆ β L(M ⊕ s) are the data and model compression gain, respectively. The last expression shows the equivalence of the compression gain to statistical testing with multiple-hypothesis as shown in Eq. (37). COMP(n a , Dist) is the uncertainty associated with the probability distribution for n a points, which for categorical distributions is given by C(n a , k) and for normal is log n a Interpretation of hyperparameter β. The hyperparameter β represents a tradeoff between finding many subgroups that cover few instances or few subgroups that cover many instances 9 . In the general form of a subgroup quality measure of Eq. (3), β is just given by β = 1 − α. Later, we empirically show that the normalised gain (β = 1) usually achieves a better MDL score than other β values; this was already known for other measures from rule learning theory (Fürnkranz et al., 2012). Nonetheless, the main objective of subgroup discovery is to locally describe regions in the data that strongly deviate from a certain target. Thus, the user can specify what she is looking for in the data: either a more granular and detailed perspective (β close to one) or a more general and high-level one (β close to zero). Note that, for comparison to other algorithms we will always use the normalised gain (β = 1) except when explicitly stated.

SSD++ algorithm
We propose SSD++ 10 , a heuristic algorithm with two main components: 1) a SaC iterative loop that adds the best-found subgroup at the end of the subgroup list; and 2) a beam search to find high-quality subgroups at each SaC iteration based on the compression gain of Eq. 39.
More specifically, the greedy search algorithm starts from an empty list, with just a default rule equal to the priors in the data, and adds subgroups according to the well-known separate-and-conquer strategy (Fürnkranz et al., 2012): 1) iteratively find and add the subgroup that gives the most considerable improvement in compression; 2) remove the data covered by that rule; and 3) repeat steps 1-2 until compression cannot be improved. This implies that we always add subgroups at the end of the list, but before the default rule. Beam search is used for candidate generation at each iteration to find the best candidate to add. In terms of the LeGo framework (Knobbe et al., 2008), our work can be seen as first mining all the possible local patterns (subgroups)technically we could use beam-search only once, but using it at each iteration results in better candidates-and then using the SaC algorithm to sequentially add the best subgroup to a list.
Note that this algorithm extends that of  for univariate numeric targets to three extra target types-univariate and multivariate nominal, and multivariate numeric-with the added normalisation hyperparameter β.

Algorithm description
Algorithm 1 presents SSD++. The algorithm starts by taking as input a dataset D and the beam search parameters, namely the number of cut points n cut , the width of the beam w b , and the maximum depth of search d max . It initialises the rule list with the default rule, based on the dataset empirical distribution (Ln 1). Then, while the beam search algorithm returns subgroups that improve compression (Ln 3), it keeps iterating over two steps: 1) finding the best subgroup from all candidates generated in the beam search (Ln 4); and 2) adding that subgroup to the end of the model, i.e., after all the existing subgroups in the model (Ln 5). The beam search (explained further in the next paragraph) returns the best subgroup on the data not covered by any subgroup already in model M . When no subgroup improves compression (non-positive gain), the loop stops and returns the subgroup list. Beam search is used at each iteration to generate the best candidates at each SaC iteration, instead of only once at the beginning, as it could yield local optima and get stuck on the top-k subgroups.
Beam search. Given a beam width w b and maximum search depth d max it consists of: 1) find all items, i.e., all conditioned variables such as x 1 < 5 (see next paragraph for the numeric discretisation with cut points) or x 2 = category, and add the best w b items according to compression gain (Eq. (39)) as subgroups of size 1 to the beam; 2) refine all subgroups in the beam with all items and add the best w b to a new empty beam; 3) repeat 2 and 3 until the maximum depth d max of the beam is reached and return the best subgroupaccording to the compression score-found in all iterations.
Numeric discretisation. Suppose a numeric variable X j , and a number of cut points n cut . The items generated from this numeric variable are all valid subsets (they must cover at least one instance) given by equal frequency discretisation with open and closed intervals for n cut cut points. Open intervals require one operator (≥ or ≤), while closed intervals require two (≥ and ≤). As an example, in the case of a generic variable X j and n cut = 2, with cut point 1 = 10 and cut point 2 = 20 it generates four items with one operator, i.e., items 1op = { x j ≥ 10, x j ≤ 10,x j ≥ 20, x j ≤ 20}, and one item with two operators, i.e., items 2op = {10 ≤ x j ≤ 20}.

Algorithm 1 SSD++ algorithm
Input: Dataset D, number of cut points ncut, beam width w b , depth max. dmax and normalisation β Output:

Time and Space Complexity
In this section, we analyse the time and space complexity of SSD++ as given in Algorithm 1 in Section 8.3.1 and 8.3.2, respectively.

Time Complexity
The algorithm can be divided into three parts: 1) preprocessing of the data; 2) the Separate and Conquer (SaC) algorithm; and 3) the beam search. In addition, there are different complexities depending on the target type, as each statistic requires different computations.
1) Preprocessing phase. In the preprocessing phase, all the coverage bitsets of the items are generated, i.e., the indexes of the instances covered by each item generated from numerical and nominal variables. The set of all items is ζ and its size |ζ|. Thus, we go over the data a maximum of |ζ| times, obtaining a time complexity of O(|ζ|n), and the results are stored in a dictionary for O(1) access. Also, some constants are cached for a fixed amount the first time they are computed, such as the universal code of integers L N (i), and Γ (i) for the numeric target case, and C(i) in the categorical case.
2) SaC phase. For the SaC phase, it is clear that the algorithm runs the beam search |S| times and will thus multiply the time complexity of the beam search by |S|.
3) Beam search phase. For the last d max − 1 iterations of the loop, each of w b candidates in the beam is refined with all |ζ| items, which gives a time complexity by itself of O(d max w b |ζ|). Then, for each refinement, the algorithm computes its coverage, statistics, and score, where the last two depend on the number and type of target. The coverage of the refinement is the logical conjunction of two bitsets, i.e., the bitset of the candidate b cand and that of the item b item . The computation of this new coverage has a time complexity of O(|b cand | + |b item |), which in a worst-case equals a run over the dataset O(n + n) = O(n). Thus the time complexity of the algorithm is given by where stats is the time complexity associated with computing the statistics for one candidate. Now, we will analyse the specific stats complexity depending on the type of target.
Nominal target variables. The statistics for categorical distributions require the computation of the usage for each class for each target of each subgroup rule and the new default rule. Assuming a maximum number of classes k (for all target variables) and t target variables, then the worst case for the coverage gives O(tnk) from which the likelihood can be directly computed. The nominal score requires the computation of the data and model encoding, from which the data encoding dominates. The data encoding entails the computation of the NML complexity and likelihood for each refinement. In general, the values of the NML complexity are just computed once and then cached; thus, in a worst-case where one requires to compute n values for C(n i ), ∀ ni=1,...,n . Using the approximation of Mononen and Myllymäki (2008) for its computation, with O( √ 10n i + k), gives a worst-case complexity of O(tn( √ n + k)). This does not depend on the parameters of the beam, as the lookup of these values is O(1). The likelihood generally dominates over this term as it is computed for each refinement. Thus the total time complexity for nominal targets is given by: Numeric target variables. The statistics for normal distributions require the computation of the mean and variance (or residual sum of squares) for the refined subgroup and for the default rule. The mean can be computed in O(n) and given its value the variance can also be computed in O(n). Thus, for all the targets, one obtains O(tn).
The numeric score requires the computation of the data and model encoding, from which the data encoding dominates. The data encoding entails calculating the gamma function and the direct use of the statistics. Similar to the NML complexity, we compute the values of the gamma function as needed and cache them afterwards. In general, the computation of the gamma function is dominated by the other terms as we only compute it at most n times.
Thus the total time complexity for numeric targets is given by: Notice that this represents a worst-case scenario. In practice, the direct use of bitsets to compute the class usages in the nominal case makes it faster than its numeric counterpart for the same dataset size.

Space Complexity
The main memory consumption resources of the algorithm are: 1) the storage of items ζ; 2) the beam; and 3) the cached constants. The item storage requires at most the storage of |ζ| bitsets, with each bitset taking O(n), thus it totals O(|ζ|n). The beam saves w b bitsets at a time, thus having a space complexity of O(w b n). The cached values make up a total of n values being dominated by the items or beam part. Thus, depending on which part dominates, the space complexity of the algorithm is O(w b n + |ζ|n).

Empirical evaluation
In this section, we will empirically validate our proposed problem formulation and the SSD++ 11 algorithm. To do this, we will test how varying the hyperparameters of SSD++ affects the subgroups found, and then we will compare SSD++ against state-of-the-art algorithms in subgroup set discovery.
This section is divided as follows. In Section 9.1 we evaluate the effect of changing the different hyperparameters of SSD++. Then, in Section 9.2 we present the setup for validating our approach based on algorithms compared against, datasets, and measures used to evaluate them. After that, in Section 9.3, the results for univariate and multivariate nominal targets compared with state-of-the-art algorithms are presented. Then, in Section 9.4 the results for univariate and multivariate numeric targets compared with state-of-the-art algorithms are shown. After that, in Section 9.5 the statistical robustness of our formulation and algorithm are tested based on the generalisation to unseen data. Finally, in Section 9.6 the runtimes of the algorithms are compared.

Influence of SSD++ hyperparameters
Here we study the effect of SSD++ hyperparameters on the discovered subgroup lists. To not overfit our hyperparameters to the datasets and for this reason obtain a better performance than other methods, the values of SSD++ hyperparameters for the remaining of the experiments (besides this section) are fixed at the standard values of the DSSD implementation for the beam search, i.e., beam width w b = 100, number of cut points n cut = 5, and maximum search depth d max = 5, and to the compression gain normalisation term β = 1 (normalised gain). These values are assumed to be enough to achieve convergence and to obtain good subgroup lists and are thus taken as the standard values of SSD++. Now, to evaluate hyperparameter influence, we vary one hyperparameter value at a time while others remain fixed at their standard values. The results of varying the compression gain normalisation hyperparameter β can be seen in Appendix H; the results of varying the beam search hyperparameters w b , n cut , and d max can be found in Appendix I.
Normalisation term β. The results are evaluated in terms of compression ratio, SWKL, and the number of rules. For compression gain, the results (as shown in Appendix H) are similar for a small number of samples, but β = 1 and 0.5 obtain better results for larger datasets. In terms of SWKL, normalised gain (β = 1) is better. On the other hand, in terms of the number of rules β = 1 can obtain one order of magnitude more rules than the others, especially for larger datasets.
Beam search hyperparameters w b , d max , and n cut . The results are evaluated in terms of compression ratio and the average number of conditions per subgroup (for d max ). In general, increasing any of the three values results in better models according to relative compression. It is also interesting to note that for maximum depths above 5 it is rare to have an average number of conditions above 4, backing up our decision for the standard value d max = 5.

Setup of the subgroup quality performance comparisons
In this section, we evaluate the quality of our proposed method by comparing it to the state-of-the-art approaches in subgroup set discovery, which may vary depending on the type of target variable(s). The comparison takes three dimensions: 1) the algorithms used to compare against; 2) measures used to evaluate the quality of the subgroups found by each algorithm; 3) the datasets in which the algorithms are evaluated. We now discuss the details of each dimension.

Algorithms
The algorithms we compared and their relevant characteristic are listed in Table 2. A short description of each is as follows: 1. top-k 12 -standard subgroup discovery miner used as a benchmark. 2. seq-cover 12 -sequential covering as implemented in the DSSD implementation. 3. CN2-SD 13 -the classical sequential covering subgroup discovery algorithm, which is only implemented for nominal targets, and only removes the examples of the class of interest already covered (not all examples covered, as seq-cover does). 4. Diverse Subgroup Set Discovery (DSSD) 12 -diverse beam search for diverse sets of subgroups (van Leeuwen and Knobbe, 2012). 5. Monte Carlo Tree Search for Data Mining (MCTS4DM) -an approach to improve on beam search to find better subgroups without getting stuck in local optima (Bosc et al., 2018). 6. FSSD -a sequential approach for subgroup set discovery that defines a set as a disjunction of subgroups (Belfodil et al., 2019).
As can be seen in Table 2 most algorithms can only be applied to singletarget binary problems, and besides SSD++ only top-k, seq-cover and CN2-SD support the use of Sum of Weighted Kullback-Leibler (SWKL) divergence to measure the quality of the found subgroup set. Thus we only compare against seq-cover and CN2-SD, algorithms that output a subgroup list and can be applied to many target types, and with top-k as a reference of a non-diverse subgroup discovery algorithm. The algorithms that output sets do not have a stopping criterion or global formulation, and underperform in terms of SWKL; thus those comparisons are relegated to Appendix F. As an example, DSSD can indeed be applied to all types of target variables, but the fact that it uses weighted sequential covering makes it unsuitable to use the SWKL, making it unfairly underperform and unsuitable for a fair comparison (as shown in the Appendix). Also, note that we do not compare with machine learning algorithms that generate rules for classification or regression, such as RIPPER or CART, as the rules generated aim at making the best prediction possible and not the highest difference from the dataset distribution, as shown theoretically in Appendix D. Quality measures. As the quality of a set is measured using the SWKL, the most appropriate measure to use is the Weighted Kullback-Leibler (WKL) for the algorithms that support it. CN2-SD supports entropy which is related to WKL. FSSD only supports WRAcc at the moment. Note that for the case of numeric targets, except SSD++, all use a WKL that only takes into account the mean, given by W KL µ (s) = n s /σ d (μ d −μ s ) 2 , in contrast to the deviationaware measure of SSD++ in Eq. 8. Hyperparameters. Most algorithms use beam search, thus only have three main hyperparameters: the maximum depth of search d max ; the width of the beam w b ; and the number of cut points to discretise numeric explanatory variables n cut . The larger the values, the better the performance, but the slower The algorithms only support W KLµ for numeric targets (Eq. (7)), i.e., a Weighted Kullback-Leibler divergency that only takes into account the mean, contrary to the one used by SSD++ that also uses the variance (Eq. (8) the algorithms become, as time complexity is linear to each of them. To be fair and not over-search the hyperparameters, we selected the default values of the DSSD and seq-cover implementation for all beam-search algorithms: d max = 5, w b = 100, n cut = 5. For the case of MCTS4DM, which requires a larger set of hyperparameters, only the number of iterations is set, n iter = 50 000, to ensure good convergence, and the rest were set as default. FSSD only requires the maximum depth, which was set at 5.

Measures
To compare the quality of the subgroup sets obtained by different algorithms, we use three different measures. The first is our proposal to measure the overall quality of an ordered set of subgroups, the Sum of Weighted Kullback-Leibler (SWKL), as defined in Eq. (14). The other two are the number of subgroups |S| and the average number of conditions per subgroup |a|, two commonly used measures for the interpretability/complexity of a set of rules. These two measures follow the law of parsimony and assume that fewer subgroups with fewer conditions are easier to understand by humans, which can be an invalid assumption in some situations. Nonetheless, it is widely used and its simple understanding typically makes for a good proxy (Doshi-Velez and Kim, 2018).
Generalisation. In machine learning, algorithms are evaluated based on their generalisation to unseen data (e.g., cross-validation). This is not common practice in subgroup discovery and other algorithm implementations cannot run on unseen data. For this reason, we test against other algorithms in the same dataset. In terms of generalisation we compare SSD++ in its proposed format, versus SSD++ with KL and WKL divergence as quality measures instead of the greedy MDL gain (in Ln 3 of the SSD++), i.e., our formulation without distribution and model complexity, COMP(n a ) and L(M ), respectively, in Eq (39).

Datasets
For a thorough analysis we use a total of 54 datasets-10-univariate binary; 10 univariate nominal; 9 multivariate nominal; 15 univariate numeric; and 9 multivariate numeric-that are listed in Tables 7 and 8 of Appendix E. The datasets are commonly used benchmarks of machine learning and subgroup discovery, which are publicly available from the UCI 14 , Keel 15 , and MULAN 16 repositories. The datasets were selected to be the most varied possible. In the case of the nominal target datasets in Table 7, the number of targets ranging from 1 to 374, the classes from 2 to 28, the samples from 150 to 45 222, and the variables from 3 to 1 186. In the case of the numeric target datasets in Table 8, the number of targets ranging from 1 to 16, the samples from 154 to 22 784. Note that we used multi-label datasets instead of multi-nominal as the latter are not widely available.

Nominal target results
The results obtained on binary, nominal, and multi-label datasets with sequential subgroup set miners can be seen in Table 3 and in a graphical representation for the SWKL measure in Figure 6, while the results for algorithms that output sets can be found in Table 9 in Appendix F. Overall, we can see that our algorithm gets 14 out of 29 best results, compared with seq-cover in second place with 13 best results. In terms of SWKL per type of data, SSD++ achieves the smallest ranking for binary, seq-cover for nominal, and both are tied for multi-nominal. This small difference in the results between SSD++ and seq-cover is important for two reasons. First, it validates SWKL, showing that seq-cover is already implicitly maximising it without knowing it. Second, it shows that SSD++ can obtain on par or slightly better results than other established approaches. Our non-diverse baseline, top-k, shows that covering different dataset regions is important to maximise SWKL. Regarding the number of found subgroups, we can see that in most cases, all algorithms are in the same order of magnitude, except when SSD++ obtains many more subgroups (for adult, nursery, kr-vs-k, and mediamill ). These results can be explained by the use of normalised gain (β = 1) by SSD++, together with the fact that these datasets have a large number of samples, few variables, or a large number of categories. First, let us recall that the normalised compression gain of Eq. (38) is composed of a data covering part and a model penalisation part and that both are normalised by the number of instances covered, which gives an advantage to subgroups that cover less data but are well-covered (only one category, or few categories). When the datasets are larger and the number of variables is reasonably small, like adult with 45 222 examples and 14 variables, there is a larger chance of finding more statistically "significant" subgroups, as there can be more regions where subgroups only (or almost only) cover one class, and the penalisation of the model encoding is small as there are not many variables. On the other hand, subgroups covering more data can more easily have a larger entropy in the class label distribution. For example, kr-vs-k, which is a reasonably large dataset  Fig. 6 Comparison of SSD++ against other algorithms for numeric targets datasets with the Bonferroni-Dunn test (Demšar, 2006) of the SWKL measure. The values shown represent the average ranking of the respective algorithms. Ranks outside the marked interval are significantly different-from a frequentist perspective-(p < 0.05) from SSD++. Note that these graphs were added to help visualisation and the authors do not recommend inferring from the "significance" obtained. Moreover, the lack of significance was expected given the small number of datasets per target type with 28 056 and with 18 class labels, a subgroup that only covers one class label, as opposed to covering many class labels, will have a higher chance of being chosen. The number of subgroups found can be large, but it was shown in a classification setting that they generalise well . It is interesting to note that in the case of corel-5k, SSD++ does not find any "significant" subgroup to add. Regarding the number of conditions per subgroup, the two best-performing algorithms in terms of SWKL, SSD++, and seq-cover, tend to have a similar and lower number of conditions than the other algorithms. As Top-k only covers the same region, it tends to be close to the maximum depth of 5.

Numeric target results
The results for the single-target and multi-target numeric datasets can be seen in Table 4 and in a graphical representation for the SWKL measure in Figure 7. In general, SSD++ obtains the best results for 23 out of 25 datasets. This is expected as SWKL and SSD++ take into account the dispersion/deviation of the subgroup target while top-k and seq-cover do not. Moreover, the normalised standard deviation of the first subgroup found supports this claim, as SSD++ tends to find subgroups with smaller deviations for 10 out of 15 cases.  Fig. 7 Comparison of SSD++ against the rest with the Bonferroni-Dunn test (Demšar, 2006) for numeric targets for the SWKL measure. The values represent the average ranking of the respective algorithms and all algorithm with ranks outside the marked interval are significantly different-from a frequentist perspective-(p < 0.05) from the SSD++. Note that this graphs were added to help visualisation and the authors do not recommend to infer from the "significance" obtained. Moreover, the lack of significance was expected given the small number of datasets per target type Comparing SWKL results for top-k with seq-cover and SSD++ shows that irrespective of dispersion-aware (SSD++) or not (seq-cover), covering different regions of the data increases the quality of the list in terms of SWKL, validating the use of our measure. It should be noted that both top-k and seq-cover could, in practice, support taking into account the deviation, but that would require several non-trivial modifications in their source code.
Regarding the number of subgroups, seq-cover tends to have more rules than SSD++ for datasets with less than 5 000 examples, while SSD++ tends to have more for a larger number of examples. This makes sense as there is more evidence to identify possible significant subgroups. Regarding the number of antecedents, SSD++ tends to have, on average, one condition fewer than seq-cover for single-target and a similar number for the multi-target case.

Statistical robustness and generalisation
The main results of the statistical robustness analysis for single-binary,nominal, and -numeric targets are shown in Figures 8a and 8b, while the complete results can be seen in Tables 10 and 11 of Appendix G. Only single-target datasets are used for an easier interpretation of the results. Our proposed formulations-MDL β=1 (normalised gain) and MDL β=0 (absolute gain)-are compared against KL and WKL divergence, i.e., their counterparts, that do not take into account distribution complexity and multiple-hypothesis testing in Eq. (39). These counterpart 'non-testing' versions of SSD++ are similar, in the essential parts, to the state-of-the-art seq-cover algorithm. To compare the different quality measures used, we use the difference between log loss in train and test sets 17 with a 50%-50% train-test split. The log loss measures how well the estimated probabilities model the distribution present in the data (a lower value is better), so that the difference between log losses measures how well the model generalises to unseen data. At first glance, the two MDL-based approaches achieve the best generalisation (difference of log losses), lower log loss on the test set, and the smallest number of subgroups in 86%, 77%, and 100% of the cases, respectively, when compared to their 'non-testing' counterparts.
It is interesting to observe that, on the one hand, MDL β=0 obtains the lowest difference of log losses in 69% of the cases. On the other hand, however, MDL β=1 has the best test set log loss overall, obtaining the best value 54% of the cases. This difference is expected, as MDL β=0 produces more conservative subgroup lists in terms of log loss in the train set, which is reflected by a lower generalisation error. Also, while MDL β=1 has the best test set performance, its counterpart (KL) has the worst, showing that our MDL formulation adds statistical robustness to well-known existing measures. Further, we observe test log losses with large or infinite values in the numeric target case. This happens when one subgroup with a small variance sees a point far from its mean in the test set. For the case of MDL β=1 and W KL µ,σ , the infinite values appear in few (one to two) subgroups in their lists, not making it a problematic behaviour for description; however, for KL µ,σ this happens for most subgroups found. generalise to unseen data, tested with a 50%-50% train-test split. As baselines, we used SSD++ with KL and WKL divergence as quality measures, i.e., the same as our approach but without accounting for multiple hypothesis testing and distribution complexity (that is, without L(M ) and COMP(n a ) in Eq. (39)). The properties of the datasets can be seen in Table 7, and are ordered in ascending number of: 1) number of classes (if nominal); and 2) number of samples. The evaluation measure is |LogLossRatio(train) − LogLossRatio(test)|, where LogLossRatio(*) is the ratio between the log loss measure of the obtained subgroup list and the log loss of the dataset marginal distribution (dataset rule). The lower the value of the measure, the better, as this indicates that the model generalises well and does not overfit on the training set. The complete results can be seen in Tables 10 and 11 in Appendix G

Runtime comparison
Runtimes of all algorithms compared, i.e., top-k, seq-cover, CN2-SD, and SSD++ are shown in Figures 9a and 9b. In general, the runtime increases with the number of samples in the dataset for a fixed data type. For the nominal datasets, there is an increase in runtime with the number of target variables, which does not seem to happen for numeric targets. This is be-cause the number of subgroups found for multivariate numeric targets was, in general, smaller.
Comparing the algorithms against each other, as expected, top-k was the fastest algorithm, as it only needs to search for the subgroups once, while the others need multiple iterations. For nominal targets, CN2-SD was the slowest algorithm, which stems from entropy as a quality measure-experiments with WRAcc proved to be much faster. On the other hand, SSD++ seems to perform on par with seq-cover and is often even faster. For numeric targets, SSD++ was one order of magnitude slower than seq-cover.
One possible reason is the extra time to compute the variance, although this does not explain the difference between both algorithms. A further study of the numeric implementation could make for an interesting research direction.
s o n a r h a b e r m a n b r e a s tC a n c e r a u s tr a li a n T ic T a c T o e g e r m a  Fig. 9 Runtime in seconds for all algorithms for each dataset. The black vertical line divides the type of datasets, i.e., from left to right: univariate binary, nominal, and multi-label for nominal targets, and univariate and multivariate for numeric 10 Case study: associations between socioeconomic background and university grades of Colombia engineering students In this section, we apply SSD++ to a real use case to assess its usefulness and limitations. To this end, we aim to understanding how socioeconomic factors affect the grades of engineering university students in Colombia on their national exams. The dataset used to study this is fully described by Delahoz-Dominguez et al. (2020). It contains socioeconomic variables and grades in national exams done at the high school and university level for engineering students in Colombia. For our specific case study, we have selected two of their exam grades at the university for two reasons. First, the relationship between socioeconomic variables and university grades is weaker (than for high school grades), thus more interesting to see if we can find relations, and second, only having two exam grades improves the visualisation of the results.
Dataset. The dataset used is composed of 12 412 samples, 22 explanatory variables, and 2 numeric target variables. The explanatory variables refer to the socioeconomic background of the students at the time of high school, and they are made of variables such as parent' level of education, the household income, which type of high school they attended, the utilities available at home (e.g., internet and television), and their neighbourhood stratum 18 . The numeric targets represent their grades, from 0% to 100%, in two national university-level exams, namely quantitative reasoning and English. An additional reason for selecting this dataset is that it violates two of our model assumptions: 1) the target variables values are truncated between 0 and 100, thus violating the use of a continuous normal distribution to describe them; and 2) the target variables are not independent, as suggested by a correlation of 53%. If our approach is shown to work despite these violations, we may consider this is a good result.

Analysis of the subgroups obtained with SSD++
The first four subgroups with absolute (β = 0) and normalised (β = 1) gain can be seen in Figures 11a and 11b, respectively. The distributions of the first two subgroups for both gains can be seen in Figures 10a, 10b, 10c, and 10d. The two extreme gains were used to show the interest (from a user perspective) of using different gains depending on the goal of the data exploration, i.e., coarse versus fine-grained perspective.
Comparison of absolute and normalised gain. Overall, with absolute and normalised gain, our method finds 7 and 34 subgroups that cover a total of 84% and 92% of the data, respectively. Looking at Figures 10a, 10b, 10c, and 10d, it can be seen that normalised gain favours smaller and compact subgroups that deviate more from the dataset distribution, while absolute gain favours larger subgroups that deviate less from the dataset distribution. These conclusions can be verified by noting that normalised gain subgroups tend to have a smaller standard deviation, between 5% and 9%, while absolute gain has values in the same order of magnitude of the dataset distribution, i.e., around 23%.
Interpretation of the results. Both normalised and absolute gain results show that having a 'better' socioeconomic background is associated with higher average grades in both exams, and the contrary is associated with lower grades. This is clearer in the absolute gain case, as each subgroup covers more data. It is noticeable in Figure 10b that a subgroup with a standard deviation similar to the dataset leads to subgroups that are spread throughout the whole range of values. Nonetheless, that subgroup covers more regions with lower grades than the dataset, making it a relevant result to understand the dataset better.
In general, it can be seen that some conditions often appear in the subgroups, such as household income above and below 5 minimum wages and education of one of the parents equal or above high school. It seems that the presence or absence of these variables is highly associated with above or below-average performance, respectively. Looking at specific subgroups, it is interesting to see that in the 4 th subgroup of the absolute gain, the Quantitative reasoning grade is equal to the average behaviour of the dataset (77%), while the English grade is 8% above average. Looking at the subgroups with normalised gain, we see that there are only slight variations of their descriptions and that they belong to a similar socioeconomic macro group but with slight differences in their descriptions, which corresponds to small differences in their grades distribution.
Violation of the model assumptions. Here, we can observe how our method behaves when some modelling assumptions are violated. Regarding the truncated values, it seems that the normalised gain is affected by grades around 100 (as seen in Figures 10c and 10d) as most of its subgroups capture these students, which increases the average and lowers the standard deviation, making them rank higher. Our method was not developed for highly stratified target values, but the results seem to show that it does not seem prohibitive to the use of SSD++ in these cases as long as the stratification is mild and the user takes into account this fact.
Regarding the independence assumption, it seems that the subgroups found are still relevant, although both grades are almost always taken into account together, i.e., as the values are positively correlated, it is more likely to find subgroups with mean values that are high or low for both exams, but not high for one and low for the other. This is expected as the encoding of independent normal distributions does not take into account the covariance between target variables, and thus that case is not deemed a deviation by the current model formulation.  (b) Subgroup list with normalized gain (β = 1). First 4 subgroups of a total of 34 and swkl = 0.52 Fig. 11 Colombia engineering students performance in Quantitative Reasoning and English exams. The results of Fig. 11a and 11b were obtained by SSD++ with absolute gain (β = 0) and normalised gain (β = 1). The dataset contains two numeric target variable Quantitative Reasoning and English exams in a 0-100% scale. The dataset represents 12 412 engineering students in Colombia, their grades in university national exams and their socialeconomic background. Description contains information regarding students socio-economic background, n s the number of instances covered, Quant. and English the average grade and standard deviation in the respective exams. * The n of the dataset is the total number of instances in the dataset

Related work
In this section we cover work related to our proposed MDL subgroup lists, in three categories: subgroup discovery; rule learning; and MDL for pattern mining. The relevance of each topic is as follows: subgroup discovery directly relates to the task at hand; rule learning are generalisations of subgroup discovery; and MDL for pattern mining shares the same theory for formalising the problem.

Subgroup discovery
In its traditional form, subgroup discovery is also known as top-k subgroup mining (Atzmueller, 2015), entails the mining of the k top-ranking subgroups according to a quality measure and a number k selected by the user. This formulation suffers from three main issues that make it impractical for most applications: 1) lack of efficient search algorithms for more relevant quality measures (van Leeuwen and Knobbe, 2012;Bosc et al., 2018); 2) redundancy of subgroup sets mined, i.e., the fact that subsets with the highest deviation according to a certain quality measure tend to cover the same region of the dataset with slight variations in their description of the subset (van Leeuwen and Knobbe, 2012); 3) lack of statistical guarantees and generalisation of mined subgroups (van Leeuwen and Ukkonen, 2016). We will now go over the contributions of previous works on all these issues, with a specific focus on the last two-redundancy and statistical guarantees-which our work proposes to solve.

Efficient search algorithms
Algorithms for subgroup discovery can be broadly divided into three categories: 1) exhaustive search; 2) sampling-based methods; and 3) heuristics. In our work, we use a heuristic approach based on beam search to generate the candidate subgroups to add at each iteration. We will now present the developments in each of these three topics and why they are not feasible for finding good subgroup lists.
Exhaustive search methods have the advantage of guaranteeing to find the best solution. Most of these approaches in SD are either based on branchand-bound (Webb, 1995) or on extending frequent pattern mining algorithms (Aggarwal et al., 2014) to SD, such as Apriori-SD (Kavšek et al., 2003;Kavšek and Lavrač, 2006) based on Apriori, and SD-Map (Atzmueller and Puppe, 2006) and SD-Map* (Atzmueller and Lemmerich, 2009) based on FP-growth. However, for the implementation to be efficient in terms of time complexity, e.g., SD-Map and SD-Map*, they combine pruning of the search space with efficient traversal and data structures. And even though these approaches can handle multi-target problems, their efficiency is constrained to specific quality measures that allow for efficient search and/or pruning of the search space.
To address some of the limitations regarding quality measures, Boley et al. (2017) proposed an efficient exhaustive search (for numeric targets) to quality measures that consider the dispersion of the target values. Also, note that the previous methods could only find the optimal given already discretised explanatory variables, thus Belfodil et al. (2018) proposing to mine subgroups over numeric explanatory variables with guarantees. The main limitation of these approaches is that they need to be tailored for specific quality measures, mostly focus on binary targets, require special handling of numeric explanatory variables, do not take the redundancy of the found subgroups into account, and are less efficient for a task where they need to be run multiple times-such as in SSD.
Sampling approaches can be seen as an attractive time-efficient alternative to exhaustive search, especially when interacting with user's preferences is required Boley et al. (2011); Moens and Boley (2014). Nonetheless, they require a probability distribution over the pattern space, which needs to be tailored to specific quality measures and attribute types.
Heuristic approaches are used when an exhaustive search is not feasible, such as in the case of non-trivial targets, e.g., Bayesian networks in exceptional model mining (Duivesteijn et al., 2010), when the optimal subgroup definition changes throughout the problem such as in SSD, or when faster solutions are deemed necessary. The most common heuristic is the beam search (Lavrač et al., 2004;Meeng and Knobbe, 2011;van Leeuwen and Knobbe, 2012;Meeng and Knobbe, 2021). It allows for a non-exhaustive but efficient procedure that can easily generalise any quality measure or type of target variables. This makes for an obvious choice for our problem as our quality measureequivalent to WKL plus some additional terms-cannot be easily pruned. Thus, beam-search has the efficiency and flexibility that we need. Other heuristics include Genetic Algorithms (GAs) (Carmona et al., 2010(Carmona et al., , 2014, which, to be efficient, need to appropriately select a suitable formulation of the problem and tweak the hyperparameters for each dataset.

Redundancy of subgroup sets and subgroup set discovery
To address redundancy among the found subgroups, most previously proposed approaches encompass supervised pattern set mining (Bringmann and Zimmermann, 2007), and methods based on relevance (Großkreutz et al., 2012), and diversity (van Leeuwen and Knobbe, , 2012. Unlike diversity-based methods, the supervised pattern set mining objective is to find a fixed number of patterns, which must be chosen in advance. At the same time, relevance is limited to non-numeric targets. It is the last group, the diversity-based methods, that share the most similarities to our work, i.e., the area of Subgroup Set Discovery. As introduced in Section 2.3, Subgroup Set Discovery is an instantiation of the LeGo framework, that passes from local descriptions of the data (SD) to a global model (SSD) (Knobbe et al., 2008). The main approaches in SSD are CN2-SD (Lavrač et al., 2004), Diverse Subgroup Set Discovery (DSSD) (van Leeuwen and Knobbe, 2012), Skylines of subgroup sets (van Leeuwen and Ukkonen, 2013), Monte Carlo Tree Search for Data Mining (MCTS4DM) (Bosc et al., 2018), Subjectively Interesting Subgroup Mining (SISD) (Lijffijt et al., 2018), and FSSD (Belfodil et al., 2019). Table 5 summarises the differences between Subgroup Set Discovery methods, with SSD++ representing our approach and where all methods are compared in terms of: if they use a list or a set; the target variables they support; if they have statistical guarantees; if they have automatic stopping criteria (not defined by the user); and if they have a global definition of a subgroup set or list. Note that top-k mining algorithms could be directly applied to SSD if one iteratively adds subgroups and re-weights the instances (Lavrač et al., 2004). However, these methods would also miss a global definition of the problem and would need to be fine-tuned for the dataset at hand. Considering the methods in more detail, CN2-SD (Lavrač et al., 2004) is a direct adaptation of CN2-a classical rule learner for classification-was one of the first methods to deal with redundancy and can be applied to nominal target variables. Algorithmically, it uses a sequential approach, wherein each iteration adds one subgroup to the set and then removes the data covered by that subgroup until no more data can be covered in this way. More specifically, the method can also be used for unordered search-where only the data of the class of interest covered by each subgroup is iteratively removedor use a weighted covering scheme that, instead of iteratively removing the covered instances, weighs them based on how many times they were covered before. DSSD (van Leeuwen and Knobbe, 2012) developed a technique based on a novel measure of overlap between subgroups to iteratively find a set of subgroups. It can be applied to single-and-multi-target nominal and numeric variables with different types of quality measures. Skylines of subgroup sets (van Leeuwen and Ukkonen, 2013) are proposed to directly account for quality-diversity trade-off and find the Pareto optimal subgroup sets of size k. MCTS4DM (Bosc et al., 2018) uses Monte Carlo tree search to improve the quality of the subgroups found. However, it can only be applied to binary target variables and explanatory variables of the same type (all numeric or all nominal). Subjectively Interesting Subgroup Discovery (Lijffijt et al., 2018) finds the subjectively most interesting subgroup for numeric target variables with regard to the user's prior knowledge, based on an information-theoretic framework for formalising subjective interestingness. By successively updating the prior knowledge based on the found subgroups, it iteratively mines a diverse set of subgroups that are also dispersion-aware. FSSD (Belfodil et al., 2019) is a more recent approach that considers the 'union' of all subgroups as a single pattern by forming a disjunction of subgroups and evaluating its quality and can only be applied to binary target variables. This approach is similar to a sequential approach for mining subgroups. However, the individual contributions of each subgroup are dissolved in the 'new' subgroup formed by the disjunction of all subgroups.

Subgroup discovery with statistical guarantees
In terms of statistical guarantees to subgroup discovery, most approaches consider first mining the top-k subgroups and then post-processing them in terms of a test to find statistically significant subgroups (Duivesteijn and Knobbe, 2011;van Leeuwen and Ukkonen, 2016). Duivesteijn and Knobbe (2011) proposed to use random permutations of the target variable with respect to a quality measure to evaluate how the discovered subgroups compare against the null hypothesis generated by those permutations. Later, van Leeuwen and Ukkonen (2016) discussed the concept of significance for subgroup discovery and concluded that p-values should be used with caution as not all false discoveries can be removed in this way, as there will always be random subsets with large effect sizes. Two approaches that automatically find statistically robust subgroups are Lijffijt et al. (2018) and Song et al. (2016);Song (2017). The first approach (already mentioned in the last section), uses the maximum entropy principle to iteratively find subjectively interesting subgroups against a user's prior knowledge. The second proposes a quality measure that directly considers the subgroup distribution and if this is statistically different from the background/dataset distribution.
Our approach strongly deviates from the first two, as our method tests for statistical guarantees during the mining process, and it is parametric-as we use categorical and normal distributions to model the targets. Also, our notion of statistical robustness takes into account the concept of the subgroup list model class. Regarding Lijffijt et al. (2018), even though they also mine subgroups iteratively, they lack a definition of an optimal subgroup set. Their goal is to model the user's subjective knowledge and find regions in the data the user has no prior knowledge. Finally, our quality measure is similar to that of Song et al. (2016); Song (2017) in their top-k nominal target case; however, we also take into account multiple-hypothesis testing and focus on subgroup lists.

Rule learning
Pattern mining and association rule mining (Agrawal et al., 1993) are concerned with mining items that co-occur together, i.e., itemsets or patterns, and relationships between itemsets and a target item, e.g., a class, respectively. A known problem of their direct approach is the infamous pattern explosion, i.e., they tend to return enormous amounts of patterns/rules. To solve this problem, many approaches were proposed, but two stand out concerning our work, namely, rule-based classifier and statistical rule mining. To see the relationship with rule mining, first, note that subgroup discovery can be seen as a specialisation of association rule mining. Second, subgroup lists could be regarded as rule lists with a fixed default rule, i.e., the last rule that gets activated when no other rule applies is fixed to 'predict' the global distribution of the complete dataset. Rule lists and rule sets have long been a common and successful way to compactly apply rules for classification (Rivest, 1987).

Rule-based classifiers
Earlier approaches to finding good rule-based models can be broadly divided into two categories based on their model construction: greedy top-down or bottom-up approaches. From a top-down perspective, methods such as CBA (Liu et al., 1998) and CMAR (Li et al., 2001) start by mining all association rules from the data and then adding them to the model one by one. From a bottom-up perspective, methods such as (Cohen, 1995) mine one rule at a time until a final model is obtained. Cheng et al. (2008) effectively removes the two-step approach by using a branch-in-bound search on the FP-growth process and iteratively reduces the search space until a set of discriminant sets for classification is found. Nonetheless, the main limitation of these approaches is that they are based on a heuristic definition of a rule-based model, i.e., they add rules without a global optimal criteria.
Over the past years, rule learning methods that go beyond greedy approaches have been developed, i.e., Monte-Carlo search for Bayesian rule lists (Letham et al., 2015;Yang et al., 2017), and branch-and-bound with tight bounds for decision lists (Angelino et al., 2017) and rule sets . However, the main limitation of these methods is that they can only be applied to small or mid-size datasets and are mostly limited to binary targets.
Even though all algorithms mentioned in this section resemble our approach, their main goal is to make the best predictions-not to find the largest deviations in the data. Even though the two problems are related, we emphasise the theoretical difference between subgroup discovery and prediction in Appendix D, where the former focuses on local deviations and the latter on a globally homogeneous partition of the data.

Statistical rule mining
The idea of mining rules with statistical guarantees is appealing as it increases the users' trust in the patterns found while at the same time reducing the number of rules returned by a miner (Hämäläinen and Webb, 2019). The concept of statistical rule mining progressed by incrementally adding more statistical guarantees. Webb (2007) proposed for the first time the mining of statistically significant patterns, then Hämäläinen (2012) proposed KingFisher, an efficient algorithm to mine dependent rules, i.e., rules that show a dependency with respect to a target in terms of a dependency test like Fisher's exact test. After that, Hämäläinen and Webb (2017) added extra procedures to remove spurious relations from the miner findings. Lastly, the criteria under which causal rules can be mined were defined and an efficient algorithm to mine them was proposed (Budhathoki et al., 2021). All these methods focus on mining all the possible individual statistically significant (or causal) rules and not on finding a non-redundant set, as is the case of Subgroup Set Discovery. In this paper, we aim to accomplish both at the same time, finding the best global subgroup list while assuring local statistically robust subgroups.

MDL in pattern mining
In data mining, Krimp (Vreeken et al., 2011) was the first method to apply the MDL principle holistically, i.e., for the whole model selection process. This seminal work used a version of crude MDL, i.e., a not completely optimal 'two-part' encoding of the data, to find the pattern list that compressed a transaction dataset best to address the pattern explosion issue in pattern mining. Recent works have aimed at improving the encoding through refined MDL for encoding the data, i.e., an encoding that enjoys optimal properties at least in expectation (Grünwald, 2007). The first of such approaches was DiffNorm (Budhathoki and Vreeken, 2015), which used a prequential plug-in code to improve the encoding of transaction data, and recently MINT was proposed to mine real-valued pattern sets with a similar encoding (Makhalova et al., 2022). Although Krimp, DiffNorm, and MINT are used to describe data, they aim to find regularities-not deviations-and do not consider a target variable. For an in-depth survey of MDL in pattern mining, please refer to the survey by Galbrun (2020).
MDL for rule learning MDL has been used to find optimal sets of association rules for two-view data (van Leeuwen and Galbrun, 2015) and tabular data (Fischer and Vreeken, 2019). The latter is the most related to our work, as it aims to find rule sets that describe the data well. Like Krimp, it aims to find all associations in the data, though not at identifying deviations as we do, and no specific target variable(s) are defined. As discussed in the contributions of our work in Section 1, this work builds on top of MDL-based rule lists for classification . Compared to our work,  focuses solely on classification, can only handle discretised explanatory variables while using a less optimal model and data encoding.

Conclusions
We showed that finding good subgroup lists (ordered sets) that are both nonredundant and statistically robust, i.e., robust subgroup discovery, is computationally feasible. To achieve this, we formally define the subgroup list model class and the problem of robust subgroup discovery. Then, based on these two, we propose an optimal formulation of subgroup lists based on the MDL principle-that includes top-1 subgroup discovery in its definition. As optimally solving the problem is not possible, we proposed a heuristic algorithm dubbed SSD++ that approximates this objective using a greedy search that adds the subgroup that locally minimises the MDL criterion to the list in each consecutive iteration. Moreover, this approximation was shown to be equivalent to Bayesian testing between subgroup and dataset marginal target distributions plus a penalty for multiple hypothesis testing, which guarantees that each subgroup added to the list is statistically sound.
These assertions are supported by empirical evidence obtained on a varied set of 54 datasets. In the case of nominal targets, our method performed on par in terms of subgroup list quality while obtaining smaller lists with fewer conditions. In the case of numeric targets and through the use of a deviation-aware measure, our method dominated in 92% of the cases. Finally, we evaluated statistical robustness by testing the generalisation on unseen data. Our MDLbased formulations obtained the lowest generalisation error 86% of the time when compared to KL and WKL quality measures.
Through a case study relating the socioeconomic background and national exam grades of Colombia engineering university students, we showed that SSD++ could be flexibly adapted to different goals of the user. In particular, it can change from a fine-grained perspective of the data that finds many subgroups covering small parts of the data well, to a coarse perspective that finds few subgroups covering large parts of the data. Also, it was shown that our method is robust to mild violations of our model assumptions.
In short, our approach can find interpretable, non-redundant, and statistically robust ordered lists of subsets' descriptions that largely deviate from 'normality' for (selected) target variables-deviation from 'normality' is measured as a divergence between the subset and the dataset/background distributions for those variables-based on the user-specified interests on coveragedescriptions that cover a small or large portion of the data.
Limitations. Even though the SSD++ algorithm has some appealing local statistical properties, we do not know how far the found models are from the optimal subgroup lists as defined by the global MDL criteria we proposed. Also, it does not scale very well for numeric targets, which was to be expected from the time complexity analysis. At the moment, multiple target variables are assumed to be independent, which can produce erroneous results when this assumption is violated. Preliminary experiments show that for moderately correlated variables (e.g., with a correlation of 0.5) this does not seem to be an issue, but there is no quantification of its implications. Similarly, for numeric targets, we use a normal distribution, and several datasets violate this assumption, either by behaving like a multi-modal or truncated distribution.
Future work. The main lines of research for future work can be divided into three categories: 1) extending subgroup lists to other target variables and/or distributions; 2) algorithmic developments; and 3) generalise this framework to other model classes. In the first category, an obvious extension would be to distributions that take into account multiple dependent target variables, such as multivariate-normal distributions for numeric targets and over itemsets for the nominal case. Another interesting and straightforward development would be the extension of our work to mixed targets, combining both nominal and numeric variables. In the second category, algorithmic developments could go from mere upper-and-lower bounds to improvements in search methods and to study the feasibility of global search such as Markov Chain Monte Carlo methods used by Yang et al. (2017) or branch-and-bound algorithms used by Boley et al. (2021). In the third category, our approach could be formalised for subgroup sets, allowing for overlap between the subgroups.

Appendices
A Normalised maximum likelihood independence for non-overlapping multinomials For this section, let us assume that we have a dataset D = {X, Y } and model M that forms a partition over the whole data. The model M divides the data D in ω parts, of the form {(X 1 , Y 1 ), · · · , (X ω , Y ω )}. Each part has an associated categorical distribution with estimated parametersΘ i over the target part Y i (as defined in Section 2). Our goal in this section is to show that the NML encoding of a partition equals to the sum of the NML encoding of its parts: Note that in the case of a subgroup list, as the default rule does not require NML encoding, the M used in this section represents the subgroups S, and D means the data covered by these. In the case of a tree or rule list, M represents the model that partitions the data at the level of leaves and rules (including default rule), respectively, and D the whole dataset. This is done without any loss of generality as the separation property allows us to separate the encoding of the default rule for a subgroup list. First, let us recall the definition of the NML probability distribution (Shtarkov, 1987): where Y n is the set of all possible sequences of n points with k = |Y| categories,M (Y | X) andM (Z | X) are the models with parameters estimated according to the maximum likelihood over the data Y and Z, respectively. Taking into account that our data is independent and identically distributed (i.i.d.), and that our model M partitions the data into ω parts, we can further develop the previous formula to: where l(Θ i | Y i ) is the likelihood function for each of the ω parts and g(Y, X, M ) is a complexity function that depends on these 3 variables. The first term is already independent for each part; however, the second is not. Let us now look at g(Y, X, M ) in the case where we only have one part in the dataset, i.e., D 1 . We will call this term the NML complexity of a multinomial distribution and denote it by C(n 1 , k) of one part D 1 = {Y 1 , X 1 }, with n 1 = |D 1 | and k = Y C(n 1 , k) = log Pr(z i ;Θ(Z 1 ))   = log   n 11 +n 12 +...+n 1k =n 1 n 1 ! n 11 !n 12 !...n 1k ! c∈Y n 1c n 1 where n 1c is the number of points of category c in Y 1 , and the passage from the second equality to the last is a property of multinomial distributions commonly used to make the computation of C(na, k) simpler (Grünwald, 2007). It is interesting to note that C(na, k) only depends on the number of points in Y 1 and its cardinality, not on the actual values. This term, i.e., the complexity of a multinomial distribution over n 1 points with k possible values, measures the likelihood of each possible sequence. Table 6 All possible sequences of a partition of fixed length of the data in three parts. Fixed length means that all possible parts always have the same amount of points, as e.g. |A 1 | = |A 2 | = · · · = |A a | = n A .

Aa B b Cc
Now we must generalise from a part to the whole partition of the dataset. To illustrate how to do this, let us first look at Table 6, which shows an example of all the possible sequences in a fixed-length three-part partition of the data. Then, taking into account those three parts, let us look at how the probabilities of all those sequences could be computed: where this follows naturally from the distributive property of the multiplication. It is easy to see that this generalises to partitions of any number of parts. Thus, going back to the complexity term g(Y, X, M ), we can see that log g(Y, X, M ) = log Substituting this back into Eq. (41), we obtain what we wanted: B Bayesian encoding of a normal distribution with mean and standard deviation unknown For encoding a sequence of numeric valued i.i.d. observations such as Y = {y 1 , ...., yn}, the Bayesian encoding takes the following form: where f is the probability density function (pdf), Θ is the set of parameters of the distribution, and w(Θ) the prior over the parameters. In the case of a normal distribution Θ = {µ, σ}, with µ and σ being its mean and standard deviation, respectively, the pdf f (Y | Θ) over a sequence Y is the multiplication of the individual pdfs, thus: In order not to bias the encoding for specific values of the parameters, we choose to use the constant Jeffrey's prior of 1/σ 2 for the unknown parameters µ and σ, and add an extra. Thus, our prior is given by: where 1/ √ 2π was added for normalisation reasons. Putting everything together, one obtains: The integrals over the whole space of the parameters µ and σ allow us to penalise the fact that we do not know the statistics a priori, thus penalising the fact that distribution over n points could, by chance, have the same statistics like the one found in the data.
Note that using an improper prior requires that we somehow make it proper, i.e., we need to find a way to make the integration over the prior finite w(µ, σ) = K, where K is a constant value. The usual way to make an improper prior finite is to condition on the k minimum number observations Y |k ∈ Y needed to make the integral proper (Grünwald, 2007), which in the case of two unknowns (µ and σ) is k = 2. Thus, instead of using w(µ, σ) we will in practice be using w(µ, σ | Y |2 ), and using the chain rule and the Bayesian formula returns a total encoding of Y equal to where P (Y |2 ) is a non-optimal probability used to define Y |2 = {y 1 , y 2 } that we will define later and y 1 , y 2 chosen in a way that maximises P (Y ). Now that we have all the ingredients to define P (Y ) we will start by defining P Bayes (Y ) and then choose the appropriate probability for P (Y |2 ).
To solve the first integral of P Bayes (Y ) in Eq. (48), we integrate in σ and note that the formula is an instance of the gamma function, with the corresponding variable transformation: Performing the variable transformation and noting that the minus sign of dz cancels with the reversing of the integral limits, we get: To solve the integral in µ we need to introduce the statisticsμ andσ as the values estimated from the data. We define these quantities as: whereμ is the mean estimator over n data points andσ 2 is the estimator of the variance. Note that for the variance the biased version with n was used instead of with n − 1 as it allows to compute the Residual Sum of Squares (RSS) directly by RSS = nσ.
Focusing now on the interior part of the integral of Eq. 52 and rewriting it in order to resemble the t-student distribution, we obtain: where s 2 s =σ 2 /n is the "sampling" variance. Now, taking into account the fact that the integral of the t-student distribution over the whole space is equal to one, and reshuffling around its terms we get Inserting this back in Eq. 48 we obtain: Returning to the the conditional probability of Eq. (49), we see that we still need to define P (Y |2 ), the non-optimal probability of the first two-points. As in the case of our model class we assume that the dataset overall statistics are known, i.e., Θ = {μ d ,σ d }, we will use this distribution to find the probability of the points Y |2 = {y 1 , y 2 } as : Finally, applying the minus logarithm base 2 to all the terms in Eq (49) to obtain the total code length in bits, whereμ 2 is the estimated mean of y 1 , y 2 and Lcost(Y |2 ) is the extra cost incurred of not being able to use a refined encoding for Y |2 . Now that the encoding length is defined, we need to choose the two points. i.e., y 1 , y 2 . Because we want to minimise this length, we notice that there are only two terms that contribute to it in Lcost(Y |2 ), and thus by choosing the two observations close toμ d minimises both the encoding of P (Y |2 ) and maximise P Bayes (Y |2 ) for most cases. There are exceptions to this, depending on the respective values of µ d and y 1 , y 2 , but these are not significant to change the values too much and require less computational search to find the points.

B.1 Convergence to BIC for large n
In this section, it is shown that for a large number of instances n, the Bayesian encoding of a normal distribution with unknown mean and standard deviation (Eq. (58)) converges to the encoding of a normal distribution with mean and standard deviation known plus log n, i.e., proportional to the definition of the Bayes Information Criterion (BIC). First, the encoding of a normal distribution with mean and standard deviation known over n i.i.d. points is equal to the sum of the individual encodings: Second, we need to use the Stirling approximation of the Gamma function for large n: and finally we insert it into Eq. (58) and assume τ = 1 to obtain: where from the second to the third line, we assumed large n, making some of the terms disappear, while the definition nσ 2 = n i (y i − µ) 2 is used for making the third term of the third expression appear. From the fourth to the fifth expressions, it was assumed that Lcost(Y |2 ) is negligible, as it is the cost of not being able to encode the first two points optimally. For the Bayes information criterion, we used its standard definition, where (Θ | Y ) is the likelihood as estimated from the data, and k is the number of parameters, which in our case is 2.

C Derivation of MDL-based optimal subgroup lists equivalence to WKL-based SD
In this appendix we derive the formula that relates the MDL-based subgroup lists with WKL-based subgroup discovery for categorical and normal distributions. This arises as the solution of the maximisation problem (equivalent to the standard MDL minimisation) of: Categorical distribution derivation: where naKL(Θ a ;Θ d ) is the Weighted Kullback-Leibler divergence fromΘ a toΘ d .
Normal distribution derivation: Using the Stirling approximation of the gamma function: Γ (n + 1) ∼ √ 2πn n e n ; in Appendix B.1, the derivation is as follows: where naKL(Θ a ;Θ d ) is the usage-weighted Kullback-Leibler divergence between the normal distributions specified by the respective parameter vectors.

D Difference between subgroup discovery and rule-based predictive models
This appendix shows the difference between the objective being maximised for subgroup discovery and for predictive rules. We do this through the comparison of the equivalent maximisation MDL scores for subgroup lists and classification rule lists  with only one rule/subgroup-without loss of generality for greater sizes or for regression tasks. To differentiate both model classes, SL and RL will be used for subgroup lists and classification rule lists, respectively. First, lets recall the form of a subgroup list SL as given in Figure  whereΘ ¬a was used to emphasise that the default rule of a rule list is not fixed, and is equivalent to the 'not rule 1'. This is the key difference between these two types of models, the default rule is fixed to the marginal distribution of the dataset for subgroup lists, and the default rule has the distribution of the negative set of the rules in the list for rule lists. It should be noted that there are many definitions of rule lists that use a fixed rule; however, having a variable default rule that maximises the prediction quality is the best representative of rule lists and of the objective of finding the best machine learning model, i.e., returning the best partition of the data with the smallest error possible. Note that a decision tree is also part of this family of models, as any path starting at the tree's root to one of its leaves also forms a rule. Thus, a decision tree is equivalent to a set of disjoint rules, i.e., none of the rules described in this way overlap on a dataset. For the type of classification rule lists defined above, the encoding of the first rule and default rule is given by Eq. 27 as for both rules; the parameters are unknown. Thus the MDL score of a rule list is given by: and note that the model encoding L(RL) = L(SL), has both lists can be described in the same manner. Following the same steps as in Section 7.1 by turning the MDL score objective from a minimisation to maximisation by multiplying by minus one and adding the constant L(Y d | Θ d ), we obtain the same objective as in Eq. 7.1: where r is the rule that maximises the objective. Working out this equation, maximisation objective of a classification rule list for a target variable of k class labels is given by: This should be contrasted with the maximisation objective of subgroup list of Eq. 36, which is given by: Comparing both of the last equations, we can notice the crucial distinction between subgroup discovery and classification: the local nature of subgroup discovery and the global nature of the classification task. In other words, subgroup discovery aims at finding subgroups that locally maximise their quality, independently of the rest of the dataset, and even though rules for classification try to maximise their local quality also, they have to take into account the quality of their negative set, i.e., a classification rule cannot be considered by its quality alone, it has to be considered in terms of its global impact in the dataset. On the other hand, this result also shows the similarity between both tasks and where the confusion sometimes arises, i.e., in some cases, the best subgroup can also be the best rule. An example of this would be a very large dataset (relatively to the number of observations covered by the rule).
Here, the best rule/subgroup would cover a small number of observations compared to the rule formed by the negative set of that rule, i.e., D ¬a , as a similar distribution toΘ d , makinĝ Θ ¬a ∼Θ d . Nonetheless, this similarity decreases in the case of larger lists, as the default rule will always represent what is left. In contrast, in a subgroup list, it remains constant and represents what we consider uninteresting. The same result can be obtained for regression rule lists.

F Empirical results of non-sequential subgroup discovery algorithms
The comparison of SSD++ with subgroup set discovery algorithms that return sets (and not lists) can be seen in Table 9. G Statistical robustness results of applying SSD++ and our MDL approach on unseen data.
In Table 10 we show the statistical robustness analysis of the SSD++ algorithm and our MDL approach by seeing how its training performance translates to unseen data performance. The measure used to evaluate the quality and generalisation of the found subgroups is |LogLossRatio(train) − LogLossRatio(test)|, i.e., the absolute difference between the log loss ratio in the train and test sets. First, the log loss (LogLoss) is defined as follows: LogLoss(Y|X, SL(D train )) = (y,x)∈{Y,X} − log Pr(y | x, SL(D train )), where SL(D train ) is a subgroup list selected from the train dataset D train , Y and X correspond to the target and exploratory data for which we want to know the log loss (it can be the same as D train for the training set, or different in case of the test set), and Pr(y i | SL) denotes the probabilities based on a categorical or a normal distribution for nominal or numeric targets, respectively. For the numeric case we use the probability density function instead. Now, the LogLossRatio is the ratio between the LogLoss of subgroup list SL and the LogLoss of the marginal distribution (equal to the default rule of SL): LogLossRatio( * ) = LogLoss(Y|X, SL(D train )) whereΘ d is the dataset's marginal distribution. To avoid having probabilities equal to zero and infinite log losses, for nominal targets, we added a pseudo-count of 0.5 to every subgroup distribution, i.e., the Jeffrey's prior for the multinomial distribution (Grünwald, 2007).  (39)), i.e., M DL β=1 and M DL β=0 -generalises to unseen data, tested in a 50%-50% train-test split. As baselines, we run SSD++ with KL and WKL divergence as quality measures, i.e., the same as our approach but without accounting for multiple hypothesis testing or distribution complexity (without L(M ) and COMP(n a ) in Eq. (39)). The properties of the datasets can be seen in Table 7, and are ordered in ascending number of: 1) number of classes; and 2) number of samples. The evaluation measures are {Log Loss ratio between subgroup list and dataset marginal distribution (equal to dataset rule) for train (LL train ) and test (LL test ) sets; and number of subgroups |S|}.

Table 11
Statistical robustness analysis for numeric target datasets. This table shows how our approach-SSD++ with normalised and absolute gain (Eq. (39)), i.e., M DL β=1 and M DL β=0 -generalises to unseen data, tested in a 50%-50% train-test split. As baselines, we run SSD++ with KL and WKL divergence as quality measures, i.e., the same as our approach but without accounting for multiple hypothesis testing or distribution complexity (without L(M ) and COMP(n a ) in Eq. (39)). The properties of the datasets can be seen in Table 8

H Empirical analysis of compression gain
In this section we present a thorough comparison of the normalisation terms β of SSD++, where β = 1 is the normalised gain and β = 0 the absolute gain. SSD++ is executed with the same parameters (beam width, number of cut points for numerical variables, and maximum depth of search) as in the experiments section, i.e., w b = 100, ncut = 5, dmax = 5. The different types of gain are compared for all the benchmark datasets described in the paper in terms of their compression ratio (defined later) in Figure 12, Sum of Weighted Kullback-Leibler divergency (SWKL) in Figure 13, and number of rules in Figure 14. The compression ratio is the length of the found model L(D, M ) divided by the length of encoding the data with the dataset distribution (a model without subgroups) L(D |Θ d ), and formally it has the following form: s o n a r h a b e r m a n b r e a s t C a n c e r a u s t r a li a n T ic T a c T o e g e r m a

I Empirical analysis of the influence of the beam search hyperparameters
This section presents a thorough comparison of the influence of the hyperparameters of the beam search of SSD++ on its results. As a complete search over the whole combination of parameters is unfeasible, we present here an exploration of the hyperparameters used for the experimental comparison in the paper (w b = 100, ncut = 5, dmax = 5), i.e., we fix two of the parameters on the values above and then proceed to change the selected parameter of interest. We do this for all the 3 parameters. The line between the dots of the same colour does not represent an interpolation and is merely used to aid visualisation and suggest trends. Note on relative compression. It may seem that the values of the relative compression remain constant, but that is an illusion due to the scale of the y axis. Moreover, as the compression ratio is given by dividing large values (usually above the thousands), its value with two decimal digits can be misleading. Nonetheless, in general, when zooming over the figures, one can discern a slight improvement (smaller values) for larger values of the hyperparameters.