1 Introduction

Rules are understandable and provide a powerful means of expressing patterns present in many types of data with applications ranging from mechanical fault diagnosis [18] to genome analysis [49]. With over 50 years of research (e.g., [31]), the induction of rules has been studied almost for as long as learning of neural networks (NNs). Even though it was repeatedly shown that rule models could perform on par or even better than black-box models in some tasks (e.g., [55]), the uptake of rules remained low. For example, as of writing, no rule learning algorithm is included in the popular scikit-learnFootnote 1 machine learning toolkit.

One possible explanation for this is that NNs, a more successful family of machine learning algorithms, have benefited from a modular architecture that enabled many different improvements to be combined. For example, an authoritative review [57] of deep learning covers hundreds of articles presenting incremental improvements such as better gradient descent, the balance between complexity and high generalisation capacity, different layer architectures, etc. Throughout the same period, the progress in rule learning research did not stall, and the character of the contributions was largely different. Many rule learning papers described a ‘monolithic’ method, which shared only some elementary principles with the prior art (see [24] for an overview). In our perspective, the absence of a commonly recognized baseline method (such as multilayer perceptron and backpropagation in NNs), a common platform (such as TensorFlow)Footnote 2 and the resulting difficulty in integrating advances is at least partly liable for the low uptake of rule learning.

In our work, we attempt to adopt the incremental paradigm, which helped to propel neural networks to their current success. We describe multiple postprocessing methods that yield smaller improvements when applied to models learned with multiple existing rule learners. As the baseline method, we use the association rule classification group of algorithms. These provide a key benefit: the generation of association rules is a standard task for which multiple high-performance algorithms (and their implementations) exist. Our implementation is for performance reasons in Java, however, as the software interface, we use the arules ecosystem in R [29], which enables our approach to integrate with multiple existing rule-based classifiers. A third-party Python package with scikit-learn like interface is also available.

A major impediment of existing association rule-based classification algorithms is that they work only on nominal data. Existing approaches typically address this limitation by discretisation, which is performed as part of preprocessing. For example, the “temperature” attribute would be automatically split into intervals such as [10,20), [20,30), …, which are then used as independent nominal categories during rule mining. In this article, we propose several rule tuning steps aiming to recover lost information in the discretisation of quantitative attributes. We also propose a new rule pruning strategy, which further reduces the size of the classification model.

The work presented here was initially inspired by Classification based on associations (CBA) algorithm [46], and since the aim is to incorporate quantitative information, we call the resulting framework “Quantitative CBA”. The framework can also be used with other rule learning approaches that rely on prediscretised data, which we demonstrate with the case of the recently proposed Interpretable Decision Sets (IDS) algorithm by Lakkaraju et al., 2016 [42], the Scalable Bayesian Rule Lists (SBRL) algorithm by Yang et al., 2017 [69] and four other algorithms.

The rest of this article is organised as follows: Section 2 provides a brief introduction to association rule classification. Section 3 describes the proposed rule tuning and pruning steps. Section 4 contains the experimental validation. Section 5 provides a comparison of the presented approach with related work and Section 6 presents the limitations of our proposal. The conclusions summarise the contributions. The Appendix presents additional measures of quality of rule classifiers and a supplementary pseudocode listing.

2 Preliminaries: association rule learning and classification

In this section, we introduce the association rule learning and classification tasks, and then we briefly review and compare association rule classification algorithms. This serves as the selection of representative algorithms that are used as baselines.

2.1 Class association rule learning

Association rule learning is an algorithmic approach that was originally designed to discover interesting patterns in very large and sparse instance spaces [3]. These patterns are represented as rules.

In our work, we focus on class association rule learning as a specific type of association rule learning task, where the learned rules are constrained to contain the target label in the consequent. Class association rule learning is performed on similar datasets as those used for other supervised machine learning tasks.

Definition 1

(dataset) A dataset T is a set of instances. Each instance oiT is described by vector 〈oi,1,…,oi,n〉, where oi,1,…,oi,n− 1 are values of predictor attributes A1,…,An− 1, and oi,n is the value of the class (target) attribute An. The value of a predictor attribute A in instance oi is denoted as A(oi) and the value of the class attribute is denoted as class(oi).

Example

In order to illustrate the key definitions and consequently the steps of the proposed algorithms, we will use the humtemp synthetic data set, denoted as T. There are two predictor attributes Temperature, also referred to as A1 and Humidity, also referred to as A2. The target attribute A3 is Class and corresponds to a subjective comfort level with values ranging from 1 (worst) to 4 (best). The dataset contains |T| = 36 instances. An example instance o1T,o1 = 〈32,50,3〉 corresponds to Temperature(o1)=A1(o1) = 32, Humidity(o1) = A2(o1) = 50 and Class(o1)= A3(o1) = class(o1) = 3.

Definition 2

(literal) Literal expresses the condition that an instance is set to a specific value in a given attribute. A literal A = V evaluates to true for instance oi if and only if A(oi) = V, otherwise it evaluates to false. A literal A = V can also be represented using the notation A(V ). The attribute A in a literal l = A(V ) can also be referred to as attr(l).

A literal is a basic building block of association rules. It corresponds to a Boolean attribute, which is either true or false for a given instance. In our work, we do not consider negative literals.

In terms of terminology, a literal in rule learning is sometimes referred to as a condition or as an item. The conditions of rules in machine learning are typically represented using the attribute = value notation, e.g., A1 = 32. However, in the domain of inductive logical programming and relational rule learning, a literal is typically represented as attribute(value), e.g., A1(32). In the following, we adopt the former notation for practical examples of rules, and since the latter notation is more convenient for symbolic manipulations, we use it in pseudocode listings and definitions.

Example

Literal l = A2(50) evaluates to true for instance o1 because A2(o1) = 50 as follows from the previous example. To refer to the name of the attribute in l we use attr(l) = Humidity.

Definition 3

(class association rule) A rule takes the form \(l_{1} \land l_{2}, \land {\ldots } \land l_{m} \rightarrow c\). The antecedent of a rule, denoted as ant(r), consists of a conjunction of literals l1,l2,…,lm, m ≥ 0,∀i,j,ij : attr(li)≠attr(lj),attr(li)≠attr(c). The consequent of a rule consists of one literal c denoted as cons(r).

A class association rule is an implication stating that if the conjunction of literals in the antecedent of a rule evaluates to true for an instance, then the consequent of the rule is also true for this instance. In the rest of the article, class association rules are referred to simply as association rules or only as rules. For conciseness or readability reasons, the logical conjunction ∧ is sometimes replaced by comma (,) or the word and in rule listings.

Example

Consider rule r1: Temperature= 32 and Humidity= 50Class= 3. This rule contains literals l1 = Temperature(32),l2 = Humidity(50) in the antecedent and one literal l3 = Class(3) in the consequent. The rule meets the condition that the names of attributes in all literals are different as attr(l1) =Temperature, attr(l2) =Humidity and attr(l3) =Class.

A special type of rule is a default rule, which is typically used as a catch-all rule for instances that are not covered by any of the other rules.

Definition 4

(default rule) If rule r is a class association rule and ant(r) is empty then r is a default rule.

Example

The rule \(r_{d}: \emptyset \rightarrow \) Class= 4 is a default rule.

When the antecedent of a rule can be applied to an instance (covers it), it does not necessarily mean that the rule correctly classifies the instance.

Definition 5

(rule covering an instance) A rule r covers instance o if and only if all literals in the antecedent of r evaluate to true for instance o. If r is a default rule, then r covers any instance.

Example

Both rules r1 and rd introduced in the previous examples cover instance o1.

Definition 6

(rule correctly classifying an instance) Rule r correctly classifies instance o if and only if r covers o and the consequent of r evaluates to true for o.

Example

Rule r1 correctly classifies instance o1. Rule rd does not correctly classify instance o because cons(rd) = 4 ≠ 3.

Definition 7

(support and confidence of a rule) Let r be a rule, and T be a dataset.

Absolute support is computed as follows:

$$ supp_{abs}(r) = |\{o \in T:r \text{ correctly classifies } o\}|. $$
(1)

Support is computed as follows:

$$ supp(r) = \frac{supp_{abs}(r)}{|T|}. $$
(2)

Confidence is computed as follows:

$$ \text{\textit{conf}}(r) = \frac{supp_{abs}(r)}{|\{o \in T: r \text{ covers } o\}|}. $$
(3)

Association rule learning algorithms require two parameters: minimum support and minimum confidence thresholds. Only rules with support and confidence equal to or above these thresholds are output.

Example

Let o1 be the only instance covered by r1 in T. As shown in the previous example, instance o1 is correctly classified by r1, therefore suppabs(r1) = 1. The confidence \(conf(r_{1}) = \frac {1}{1}=1\). The support of r1 in T is \(supp(r_{1})=\frac {1}{36}=2.8\%\).

2.2 Handling numerical data

For numerical attributes, the standard definition of literal (Def. 2) results in problems with sparsity. For example, single values of temperature, such as “Temperature= 32”, could individually have too low support, hence any rule containing such a literal would not meet the prespecified support threshold. For this reason, quantitative attributes typically need to be discretised prior to the execution of association rule learning. The discretisation replaces the precise numerical value with an interval to increase the number of instances in data that are covered by literals. For example, when the temperature attribute is discretised, e.g., ‘Temperature= 32’ is replaced by ‘Temperature=(30,35]’, more instances can match the latter value (interval) than the former value (specific number).

Once preprocessing has been applied, the learned rules correspond to high-density regions in data with boundaries aligned to the discretisation breakpoints. This can impair precision but improve computational efficiency on high-dimensional data, allowing association rule learning to be applied to much larger data than is amenable to other types of analyses [21, p. 492].

2.3 Building an association rule classifier

The first Association Rule Classification (ARC) algorithm dubbed CBA (Classification based on Associations) was introduced in 1998 [46]. While there were multiple successor algorithms, the structure of most ARC algorithms followed that of CBA [64]: 1. Learn class association rules, 2. Select the subset of the association rules, 3. Classify new instances.

In the following, we briefly describe the individual steps.

I. Rule learning :

In the first step of an ARC framework, standard association rule learning algorithms are used to learn conjunctive classification rules from data.

The Apriori [3] association rule learning algorithm was used in CBA, and in the recently proposed Interpretable Decision Sets (IDS). Some other approaches, such as Bayesian rule sets (BRS) [66] use FP-Growth [32], while other algorithms such as CORELS [5] and Scalable Bayesian rule lists (SBRL) [69] are explicitly agnostic about the underlying rule learning algorithm, and could therefore be used in conjunction with any approach capable of generating regular association rules (e.g., [13, 17]).Footnote 3

There are two main obstacles for straightforward use of the discovered association rules as a classifier: the excessive number of rules discovered even on small datasets, and the fact that the generated rules can be contradicting and incomplete (no rule covers an instance).

II. Rule selection (also called rule pruning) :

A qualitative review of rule selection (pruning) algorithms used in ARC is presented in [62, 64]. The most commonly used method according to these surveys is data coverage pruning. This type of pruning processes the rules in the order of their strength, removing instances covered by the rule from the training set used for pruning. If a rule does not correctly classify at least one instance, it is deleted and the instances it covers are kept. In CBA, data coverage pruning is combined with “default rule pruning”: The algorithm replaces all rules with lower precedence than the current rule with a default rule if the default rule that is inserted at that place reduces the number of errors. The effect of pruning on the size of the rule list is reported in [46]. Based on experiments on 26 datasets, the following effect of data coverage pruning was observed: the average number of rules per dataset without pruning was 35,140; with pruning the average number of rules was reduced to 69 without effectively impacting accuracy.

Some ARC algorithms use optimisation algorithms to select a subset of the candidate rules generated by association rule learning. For example, the IDS algorithm optimises an objective function, which reflects the accuracy of the individual rules as well as multiple facets of model interpretability, including the number of rules.

III. Classification :

There are two principal types of ARC algorithms that differ in the way classification is performed: rule sets and rule lists.

Definition 8

(rule list) A rule list is an ordered sequence of rules R = 〈r1,…,ri,…,rm〉, where the antecedents of r1,…,rm− 1 are nonempty and rm is a default rule with an empty antecedent. A rule ri has a higher precedence over rule rj in the rule list R if i < j and a lower precedence if i > j. An instance is classified by the highest precedence rule covering the instance.

Example

Example of a rule list is in Table 3.

Algorithms generating rule lists include CBA, SBRL and CORELS. While IDS uses the term “sets” in its title (Interpretable decision sets), the generated models are also rule lists according to the definition presented above [19]. The advantage of rule lists is that they make it easy to explain the classification of a particular instance because there is always only one rule that is accountable for it [40].

Rule sets, also called rule ensembles, provide an alternative approach, where all rules with antecedents covering the instance are used to classify the instance. An example of an ARC algorithm combining multiple rules to perform classification is CPAR [70], or more recent BRS.

2.4 Overview of association rule classification algorithms

The main benefit of using a rule-based classifier, as opposed to a state-of-the-art sub-symbolic (“black-box”) method such as a deep neural network, should be the comprehensibility of the rule-based model combined with fast execution on large and sparse datasets and a competitive predictive performance. Individual rule learning algorithms meet these aspirations to a different degree.

Table 1 summarises which properties of rule models we found desirable. In Table 2, these criteria are used to compare common association rule classification algorithms. In this table, single refers to single rule (one rule) classification, crisp refers to whether the rules comprising the classifier are crisp (as opposed to fuzzy), det. refers to whether the algorithm is deterministic with no random element such as genetic optimisation, assoc corresponds to whether the method is based on association rules, acc, rules and time is average accuracy, average rule count, and average time across 26 datasets as reported by [4].

Table 1 Desirable comprehensibility traits of rule models
Table 2 A comparison of association rule-based classifiers and closely related approaches

The results in Table 2 show that CBA produces more comprehensible models than any of its successors considering most characteristics (single rule classification, deterministic, crisp rules). However, CBA tends to produce larger models than some of the newer algorithms. It should be noted that while smaller models are currently typically preferred by algorithm designers in terms of explainability, there is mixed evidence as to whether smaller models (fewer rules, fewer conditions) are always more comprehensible. Small models can be sufficient for discriminating the classes but they may not provide a sufficient explanation [25].

CBA also maintains high accuracy and fast execution times. In terms of accuracy, CBA is outperformed only by the FARC-HD (by 4 percent points) and the CPAR (by 2 percent points). However, the CPAR has 4x times more rules on the output and performs multi-rule classification, which is possibly less comprehensible than the one-rule classification in CBA. While the FARC-HD outperforms CBA in terms of accuracy, this fuzzy rule learning algorithm is more than 100x slower than CBA.

It should be noted that Table 2, which is based on data from [4], excludes several relevant recently proposed algorithms. These typically subject the input rule set generated by association rule learning to a sophisticated selection process, involving optimisation techniques such as the Markov chain Monte Carlo (in SBRL), branch-and-bounds (in CORELS), submodular optimisation (in IDS) or simulated annealing (in BRS). For these algorithms, a comprehensive previously published benchmark on a larger number of datasets is not, to our knowledge, available. Nevertheless, these algorithms are widely considered state-of-the-art in the area of association rule classification (e.g. in [65]). We select SBRL and IDS as two additional algorithms, which are postprocessed with our approach and evaluated. CORELS is also included among the benchmarked algorithms.

3 Proposed approach

Quantitative classification based on associations (QCBA) is a collection of rule tuning steps that use the original continuous data to “edit” the rules, changing the scope of literals in the antecedent of the rules. As a consequence, the fit of the individual rules to the data improves. The second group of tuning steps aims to remove unnecessary literals and rules, making the classifier smaller and possibly more accurate. The resulting models are ordered rule lists with favourable comprehensibility properties, such as one-rule classification and crisp rules.

The method takes on the input as follows:

  • One input is training dataset T with numeric attributes (before discretisation) used to learn the rules.

  • Another input is the set of rules learned on a discretised version of T.

The input to the algorithm is an unordered set of rules, but it can also be an ordered rule list. In the latter case, the input order is ignored. The output is an ordered classification rule list, which has the same or a smaller number of rules (a default rule may be added). The individual rules have the boundaries of numerical conditions (literals) adjusted. Some may also be completely removed. The first phase consists of the following rule tuning (optimisation)Footnote 4 steps:

  1. 1.

    Refitting rules to the value grid. Literals initially aligned to the borders of the discretised regions are refit to a finer grid with steps corresponding to all unique attribute values appearing in the training data.

  2. 2.

    Literal pruning: Redundant literals are removed from the rules.

  3. 3.

    Trimming. Boundary regions with no correctly classified instances are removed.

  4. 4.

    Extension. Ranges of literals in the antecedents are extended, preserving or improving rule confidence.

Once the rules have been processed, their coverage changes which can make some of the rules redundant.

  1. 5.

    Data coverage pruning and default rule pruning. These two pruning steps, proposed in [46], correspond to the classifier builder phase of CBA.

  2. 6.

    Default rule overlap pruning. Rules that classify into the same class as the default rule at the end of the classifier can be removed if there is no other rule between the removed rule and the default rule that would change the classification of instances initially classified by the removed rule.

Procedure 1 depicts the succession of tuning steps in QCBA and provides pointers to the algorithms described in detail in the following subsections. For each proposed step, there is a brief description of the effects on selected measures of classifier quality measures. For tuning steps 1–4, these are measures applying to single rules as “local classifiers”. Pruning steps 5–6 are evaluated in terms of their effect on the complete rule list – the “global classifier”.

Procedure 1
figure c

QCBA qcba().

3.1 Example

To illustrate the rule tuning steps, we use the humtemp synthetic dataset. The dataset is plotted in Fig. 1a.

Fig. 1
figure 1

Illustration of refit, trimming and extension tuning steps

The grid depicted with the dashed lines denotes the result of a discretisation algorithm, which is performed as part of preprocessing. In this case, equidistant binning is applied. Figure 1b and Table 3 show an input list of rules learned on the humtemp data. In our example, these rules are learned by CBA, but another algorithm generating rule lists could have also been applied.

Table 3 Input rule list depicted in Fig. 1b

A QCBA model generated after all of the tuning steps is shown in Fig. 1c. Figure 1e–h correspond to the individual tuning steps, which transform the original rule R#3 (see Fig. 1d, Table 3) from the CBA model to its final form in the QCBA model. These figures will be referred to again from the detailed description included below.

3.2 Refit

The refit step is inspired by the way the C4.5 decision tree learning algorithm [51] selects splitting points for numerical attributes. Figure 1d shows rule “Rule #3: Temperature=(25,30] and Humidity=(40,60] => Class= 4” contained in the CBA model. Note the “padding” between the rule boundary and the covered instances that are nearest to the boundary.

The refit tuning step (Procedure 2) contracts interval boundaries to a finer grid, which corresponds to the raw, “undiscretised” attribute values ensuring that the refit rule covers the same instances as the original rule. The standard definition of a literal truth value (Def. 2) requires that the value in the literal exactly matches the value of the referenced attribute in the instance. To do this, we need to amend this definition to also apply to the values in the training data that belong to the interval.

Procedure 2
figure h

Refit rule refit().

Definition 9

(interval-valued literal) Literal A(I) is called an interval-valued literal if I is an interval. An interval-valued literal l evaluates to true for instance o if and only if A(o) belongs to the interval I.

Example

Literal Temperature = (25,30] is true according to Def. 2 only for instances that have the exact string value ‘(25,30]’ in the Temperature attribute. Since QCBA works with the original values before discretisation, according to Definition 9, the literal Temperature = (25,30] is also true for an instance with the numeric value 26 in the Temperature attribute.

Definition 10

(value set of a literal) We let l = A(V ) be a literal defined on attribute A and T be a dataset. Then we let the value set of l be defined as vals(l,T) = {A(o) : oT,l evaluates to true for o}.

Example

For dataset T depicted in Fig. 1a, the value set of literal l: Temperature= (25,30] is the set of five temperature readings that are actually recorded in the dataset: vals(l,T) = {26,27,28,29,30}.

The Refit step of the proposed QCBA method needs to retrieve the source values that are merged into an interval. These are not defined as all real numbers included in the interval, but rather as values actually appearing in the referenced attribute in the instances in a supplied dataset. This definition will be referenced from the extension tuning step.

Definition 11

(value set of an attribute) We let the value set of attribute A in dataset T be defined as vals(A,T) = {A(o) : oT}.

Example

The red boxes in Fig. 1e mark the instances that are used as “anchors” for the refit of Rule #3 from Table 3. For the literal “Temperature=(25,30]”, the upper boundary corresponds to an existing instance; therefore there is no change. Since the lower boundary is exclusive, it is adjusted to the nearest value of a real instance, which is 26. Likewise, the boundaries of Humidity in Rule #3 are adjusted to values of the nearest instances within the original boundaries. The resulting rule is “Temperature=[26,30], Humidity=[42,58] → Class= 4” shown in Fig. 1e.

Effect on measures of classifier quality Since refit does not affect the covered or classified instances on the training dataset, the confidence and the support of the rule after refit are unchanged (as computed from the training data). The main effect of the refit step is the increase in the density of the region covered by the rule: the same instances as covered by the original rule are covered by a smaller region corresponding to the refitted rule. Density is computed as the number of correctly classified instances divided by the volume covered by the antecedent of the rule (see Definitions 14 and 15 in Appendix A).

3.3 Literal pruning

Some association rule learning algorithms can output models containing rules with redundant literals. One example is IDS, where rules are generated without checking the confidence threshold. Since the IDS algorithm contains a stochastic element, there is no guarantee that even if a shorter rule covering the same instances is included among the candidate rules, it will be selected into the final classifier instead of the longer rule. As a consequence, the classifier can contain rules with unnecessary literals.

The applicability of literal pruning is not limited to IDS. Since QCBA is modular, literal pruning can also be used after other steps in QCBA have been performed. For example, while candidate rules used in CBA do not generally contain redundant conditions, after the literals in these rules have been subject to other QCBA tuning steps, especially trimming and extension, some of the conditions may become redundant.

The proposed literal pruning algorithm attempts to remove literals from a rule. If the confidence of the rule does not drop, the shorter rule is kept and becomes a seed for further attempts at literal pruning. Literal pruning is depicted in Procedure 3.

Procedure 3
figure i

Literal pruning pruneLiterals().

In our experiments, it can be seen that from most rules no literal can be pruned. Therefore iteration in the order of appearance is more computationally efficient than a full greedy procedure that would evaluate all candidate literals for removal and then select the best one. Additionally, literal pruning can result in duplicate rules in the rule list. These are removed as part of the post-pruning step.

Effect on measures of classifier quality Literal pruning improves (or does not affect) the quality measures of individual rules as local classifiers, as rule confidence, and rule support can only stay the same or increase (on the training data). The rule length stays the same or decreases, making the rules shorter. As each omitted condition means that the rule spans all values indiscriminately in the corresponding attribute, the density of instances covered by the rule decreases when one or more literals are removed.

3.4 Trimming

This step adjusts the interval boundaries to the actual values in the covered instances. If there is any covered but misclassified instance on the boundary, it is removed, the boundary is adjusted, and this process repeats until no instance is removed (Procedure 4).

Procedure 4
figure j

Rule trimming trim().

Example

In Fig. 1f, trimming has been applied and the rule is shaved of boundary regions that do not cover any correctly classified instances: one instance with class 3 (denoted by a red box) on the rule boundary is initially covered by Rule #3, but also misclassified by it. As part of the trimming, the lower boundary of the Temperature literal on Rule #3 is increased not to cover this instance.

Effect on measures of classifier quality Since only regions with incorrectly classified instances are removed, the trim operation does not change the support of the trimmed rule, and rule confidence can only increase or stay the same. The density of the region covered by the rule is also increased (or unchanged), since the region covered by the rule stays the same or decreases, and the number of correctly classified instances stays the same.

3.5 Extension

The purpose of this hill-climbing process is to move literal boundaries outwards, increasing rule coverage. Each extension generated by Procedure 5 corresponds to a new rule that is derived from the current rule by extending the boundaries of literals in its antecedent in one direction with steps corresponding to breakpoints on the finer grid.

Procedure 5
figure k

Rule Extension getRuleExtension().

The algorithm refers to the notion of the direct extension of a literal presented Definition 12.

Definition 12

(direct extension of an interval-valued literal) We let l = A(][xi,xj][) be an interval-valued literal and a T a dataset, where “][” denotes that the interval boundary is either open our closed. A direct extension of type lower of literal l is a literal A([xl,xj][), where the lower boundary is \(x_{l}= \max \limits (\{x\in vals(A,T): x < \min \limits (vals(l,T)) \})\) and the type of the upper boundary (open or closed) is the same as in l. A direct extension of type higher of literal l is a literal A(][xi,xh]), where the upper boundary is \(x_{h}= \min \limits (\{x\in vals(A,T): x > \max \limits (vals(l,T)) \})\) and the type of the lower boundary is the same as in l. If xl = then a direct extension of type lower does not exist (is empty set). If xh = then the direct extension of type higher does not exist (is empty set).

The extension process for a given rule can generate multiple extension candidates (Procedure 6).

Procedure 6
figure l

Get all direct candidate extensions getAllDirect Extensions().

In Procedure 7, the candidate extensions are iterated in the order specified by CBA:

Procedure 7
figure m

Rule extension extendRule().

Definition 13

(rule sorting criteria) The rule ri is ranked higher than the rule rj if

  1. 1.

    conf(ri) > conf(rj) triangleright higher confidence rules are preferred,

  2. 2.

    supp(ri) > supp(rj) triangleright higher support rules are preferred,

  3. 3.

    |ant(ri)| < |ant(rj)| triangleright shorter rules are preferred.

The first applicable condition is used.

An extension is accepted if there is an improvement in confidence over the last confirmed extension by a user-set minImprovement threshold. If this crisp extension has not been reached, a conditional extension is attempted (Procedure 7, line 13). This picks up on the last extension attempt that resulted in a rule that does not meet the criteria for crisp extension but it meets the criteria for the conditional extension. The procedure iteratively generates new candidate rules by extending a literal that was subject to extension in this last extension attempt. If the extension meets the criteria for crisp acceptance, the conditional extension successfully finishes with the extension replacing the current rule.

Example

The red boxes in Fig. 1g demonstrate the confirmed extension steps for the seed rule depicted in Fig. 1f. In the first confirmed extension, the upper Temperature boundary is increased to 30 by one step of the finer grid. Since the confidence of the extended rule does not decrease, this extension meets the conditions for crisp acceptance. Subsequent extensions result in the rule depicted by the blue region in Fig. 1g. This rule cannot be further extended in any of the directions without a drop in confidence. The procedure, therefore, tries the conditional extension (red region in Fig. 1h) by lowering the boundary of the Humidity literal to 38. This makes the rule cover two more instances, one correctly and one incorrectly, decreasing the confidence below the initial 75%. Further extends in Humidity in the same direction (by lowering the lower boundary) make the rule cover two more instances correctly (these are in red circles in Fig. 1h). This brings the confidence back to 75% and results in a crisp acceptance. While the algorithm tries the additional conditional extensions, none were successful. Figure 1c shows the final rule, which has both a higher confidence and support than the original rule in Fig. 1d.

Effect on measures of classifier quality Assuming minImprovement = 0 (the default setting), the procedure outputs only rules with confidence higher or the same as the original rule. Since the algorithm does not have any “shrink” step, as it can only extend the coverage of the rule by adding new regions and instances, the support of the rule can only increase or remain unchanged. The density of the rule can increase, decrease or stay the same as a result of extension depending on the ratio of the increase in the number of correctly classified instances and the size of the newly covered regions as a result of the extension.

3.6 Post-pruning

The previous steps affected individual rules, changing their coverage. The number of rules can now be reduced using an adaptation of CBA’s data coverage pruning and the default rule pruning. The latter also adds a default rule to the end of the rule list that ensures that the rule list covers all possible test instances. As a first step of post-pruning, rules are sorted according to the criteria in Definition 13. The post-pruning algorithm is depicted in Procedure 8 and corresponds to the core of the CBA algorithm described in Section 2.

Procedure 8
figure n

Post-pruning post-pruning().

Effect on measures of classifier quality Unlike all the preceding steps, the standard CBA data coverage pruning does not edit the individual rules, but by selecting a subset of rules, it forms the final “global” classifier. The data coverage pruning algorithm removes a rule only if it does not correctly classify any instance from those uncovered by nonremoved rules with higher precedence. The pruning, therefore, removes only rules that do not contribute to the number of correctly classified instances, and as such, it does not increase the number of misclassified instances in the training dataset. Similarly, the default rule pruning ensures that the replacement of the tail of the rules in the base classifier by the default rule does not increase the count of misclassified instances in the training dataset.

3.7 Default rule overlap pruning

The default rule overlap checks whether when some of the rules classifying into the same class as the default rule are removed: the classifications would not change since the presence of the default rule would ensure that the instances are correctly classified. A pruning candidate – a rule that is a candidate for removal – can be removed only if the removal does not change the classification of instances that are correctly classified by the pruning candidate by rules with a lower precedence. We consider two versions of default rule overlap pruning: instance-based and range-based.

The instance-based version, described in Procedure 9, removes a rule if there is no instance in the training data that would be misclassified as a result of the removal.

Procedure 9
figure o

Default rule overlap pruning (Instance-based) drop-in().

Example

Referring to Fig. 2, Rule #6 is the only pruning candidate since other rules classify to different classes than the default rule (Rule #11). Because none of the rules between #6 and #11 would cause misclassification of training instances covered by #6 if #6 is removed, Rule #6 is removed by instance-based pruning.

Fig. 2
figure 2

Illustration of default rule overlap pruning. This figure uses a different rule list than the previous examples. The rule list is shown from rule #6 (previous rules are not relevant)

The range-based version, depicted in Procedure 10 (located in the Appendix), analyses overlaps in the range of literals between the pruning candidate and all potentially clashing rules. A potential clashing rule is a rule with lower precedence than the pruning candidate with respect to the criteria in Definition 13 that has a different class in the consequent than the pruning candidate. The pruning candidate is removed only if none of the potential clashing rules overlaps the region covered by the pruning candidate.

Procedure 10
figure p

Default rule overlap pruning (Range-based) drop-ra().

Range-based pruning imposes more stringent conditions for the removal of a rule than instance-based pruning as it checks empty overlap in regions covered by the rule (as opposed to overlap in training instances). As seen in our evaluation (Section 4), as a result of these stricter conditions it prunes many fewer rules.

A discussion of the limitations of the presented default rule pruning algorithms is provided in Section 6.

Example

Referring to data in Fig. 2 left, rules between #6 and #11 seem to cover different geometric regions, which would be an argument for removing #6. After closer inspection (Fig. 2 right) we note that #6 overlaps in Humidity and shares an inclusive boundary on Temperature with Rule #8, which also includes Temperature= 34 but classifies to a different class. Thus, Rule #8 is confirmed as a clashing rule for Rule #6, and since a clashing rule exists, #6 is not removed using range-based pruning.

Effect on measures of classifier quality In the range-based version, only rules that cover redundant regions (in terms of attribute ranges) are removed; thus, pruning does not have an effect on the accuracy of the pruned rule list on both training and unseen data. The more relaxed instance-based version determines rule redundancy based on the training instances, which only results in no change in the accuracy on the training data.

4 Experiments

In this section, we present an evaluation of the presented rule tuning steps comprising the QCBA framework. The scope of the evaluation, in terms of the number and character of the included datasets and reference baselines, follows the approach taken in related research.

In [42], the results of IDS were compared against seven other algorithms. Of these, five were general machine learning algorithms (such as random forests, and decision trees), and two were closely related – BDL [44] and CBA. In [69], the results of SBRL are compared against nine other algorithms, out of these seven were general machine learning algorithms (including C4.5 and RIPPER), and two were closely related (CBA and CMAR).

We perform two types of benchmarks. The first type executes the baseline algorithm and consequently postprocessed the resulting models by QCBA comparing the results. In this way, we benchmark CBA, two closely related recent ARC algorithms (IDS and SBRL), two other arc algorithms (CPAR, CMAR), and two related inductive rule learning algorithms (FOIL2 [53] and FOIL enhancement PRM [70]). In the second type of benchmark, we do not perform the postprocessing with QCBA. This covers four standard symbolic – intrinsically explainable – learning algorithms (FURIA, RIPPER, PART, C4.5/J48) as well as a state-of-the-art rule learner CORELS.

As for the number of datasets, IDS was evaluated on three proprietary datasets [42], SBRL on seven publicly available datasets [69], and CORELS on three datasets [5]. In our approach, we used 23 open datasets (some algorithms are evaluated only on a subset of these).

4.1 Datasets and setup

The University of California provides a collection of publicly available datasets that are commonly used for benchmarking machine learning algorithms at https://archive.ics.uci.edu. We choose 22 datasets to perform the main evaluation. The selection criteria were a) at least one numerical predictor attribute, and b) the dataset being previously used in the evaluation of symbolic learning algorithms in one of the following seminal papers: [4, 35, 46, 52] (ordered by publication date).

Details of the 22 datasets selected for the main benchmark are given in Table 4, where att. denotes the number of attributes, inst. denotes a number of instances, miss. whether the dataset contains missing observations. As follows from the table, several datasets come from visual information processing or signal processing domains (ionosphere, letter, segment, sonar). The second most strongly represented is the medical domain (colic, breast-w, diabetes, heart-statlog, lymph). Eleven datasets are binary classification problems, nine datasets are multinominal (more than two classes), and two datasets have an ordinal class attribute (autos and labour).

Table 4 Overview of 22 datasets involved in the main benchmark

In addition, the 23rd dataset (intrusion detection) is used for the scalability analysis in Section 4.6.

Numerical (quantitative) explanatory attributes with three or more distinct values are subject to discretisation using the MDLP algorithm [15]. We use the MDLP implementation wrapped in our arc package. Prediscretised data is used only for association rule classification algorithms (CBA, IDS, and SBRL). The remaining algorithms involved in the benchmark do not require prediscretisation. The evaluation is performed using 10-fold stratified cross-validation. All evaluations used the same folds.

The results are obtained using the open-source CBA and QCBA implementations in our arc and qCBA packages, which we made available in the CRAN repository of R language. The core of QCBA is implemented in Java 8. For CMAR, CPAR, PRM and FOIL we used arulesCBA package [30], which is based on the LUCS-KDD Java implementation of these algorithms.Footnote 5

The evaluations were performed on a computer running Ubuntu 20.04 equipped with 32 GB of RAM, a Samsung SSD 960 EVO hard disk, and an Intel(R) Core(TM) i5-7440HQ CPU @ 2.80GHz. For the evaluations in Tables 5 and 8 the software was externally restricted to one CPU core. The scalability benchmark (Figs. 6 and 7) was run on an Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz with six virtualized cores and 22 GBs of RAM.

Table 5 Effect of individual steps – aggregated results for 22 UCI datasets

Metaparameter setup

The CBA algorithm has three main hyperparameters – minimum confidence, minimum support thresholds, and the total number of candidate rules. In [46] it was recommended to use 50% as the minimum confidence and 1% as the minimum support. For our experiments, we used these thresholds. In [46], the total number of rules used was 80.000; however, it was noted that the performance starts to stabilise at approximately 60.000 rules. According to our experiments (not included in this article), there is virtually no difference between the 80.000 and 50.000 thresholds apart from the higher computation time for the former; therefore we used 50.000.Footnote 6 We also limited the maximum number of literals in the antecedent to 5. For the standalone CBA run, default rule pruning is used. For the other runs, it is not performed within CBA.Footnote 7

4.2 Effects of individual steps

The proposed tuning steps do not have any mandatory thresholds. The extension process contains two numerical parameters, which are left set to their default values: minImprovement= 0 and minCI=-1. These default values are justified in Section 3.5. The effect of varying the latter threshold is studied in Section 4.6.

Evaluation methodology

We evaluate the individual rule tuning steps. As a baseline, we use a CBA run with default parameters corresponding to those recommended in [46] (with small deviations justified above). Classification performance is measured by accuracy as in several recent studies of ARC classifier performance, such as [5, 65, 69]. All results are reported using tenfold cross-validation with macro averaging. All evaluations used the same folds. The average accuracy for all 22 datasets is reported as an indicative comparison measure. We also include a won-tie-loss matrix, which compares two classifiers by reporting the number of datasets where the reference classifier wins, loses, or the two classifiers perform equally well. For a more reliable comparison, we include p-value for the Wilcoxon signed-rank test computed on accuracies. This test is preferred over Friedman’s test [8].

We use three metrics to measure the size of the model: the average antecedent length (number of conditions in the rule), the average number of rules per model and the average number of conditions per model, which is computed as the average number of rules × average antecedent length. These are the most common measures in recent related research: [5, 42, 42, 65].

We also include a benchmark indicating the computational requirements of the individual proposed postprocessing steps. The build times reported in Table 5 are computed as an average of classifier learning time for 220 models (10 folds for each of the 22 datasets). In addition to the absolute run time, which can be volatile across software and hardware platforms, we include the relative execution time with the CBA baseline that is assigned a score of 1.0.

The CBA baseline includes the discovery of candidate association rules, data coverage pruning, and default rule pruning. It should be noted that the implementation of the tuning steps is not heavily optimised for speed.

Overview of the results

As a first step in determining the effect of the individual tuning steps, we choose which ARC algorithm would be used to generate base models that would be postprocessed. We chose CBA, which is the most commonly used reference algorithm in related research and is still considered state-of-the-art [42].

Table 5 demonstrates the effect of the individual tuning steps comprising our postprocessing framework when applied to models learned with CBA. The performance is compared with CBA as a baseline.

Configuration #1 corresponds to the refit tuning step that is performed on top of CBA, configuration #2 corresponds to the refit tuning step and the literal pruning, etc. Configurations #6 and #7 correspond to all proposed tuning steps performed (#6 uses instance-based pruning and #7 uses range-based default rule overlap pruning).

The QCBA setup that achieves the maximum reduction in the size of the classifier is configuration #6, which includes all the tuning steps. A compromise between accuracy and size is provided by #5, which compared to #6 excludes the default overlap pruning. As follows from a comparison of #5 and #7, range-based pruning is ineffective (for CBA as a base learner) on this collection of datasets.

The literal pruning step appears redundant when CBA models are postprocessed. However, for the postprocessing of the models generated by the IDS algorithm, this processing step has a positive effect as further elaborated in Section 4.4.

Accuracy vs. AUC We experimented with two ways of computing the confidence of classification, which is the basis for computing the AUC values.Footnote 8 In both cases, the comparison of AUC values had a similar outcome as a comparison based on accuracy.

Model size vs. predictive performance

The full QCBA postprocessing (#6) reduces the size of the model with an average of 235.3 conditions by 44% compared to the original CBA model to 132.0 conditions, incurring a point drop of only a 1% in the average accuracy. QCBA configuration #5 results in a reduction in size of 22%. Additionally, configuration #5 has a better accuracy on 10 datasets while CBA wins only on 3 datasets, leaving 9 results in a draw. The reduction in model size is due to the removal of rules. The average rule length of the postprocessed CBA models remains the same. However, as we show later when models produced by IDS and SBRL are postprocessed, the average rule length also drops.

It should be noted that the CBA baseline is generated with default rule pruning, while postprocessing with QCBA default rule pruning within CBA is disabled since QCBA performs default rule pruning as its last pruning step. As a result, the number of rules in QCBA configurations #1 to #4 increase when compared to CBA.

Runtime

The results for the runtime are reported in the last two rows of Table 5. It can be seen that refit, literal pruning and trimming take together less time than learning a CBA model alone. The most computationally intensive operation is the extension. If we look at the median build times, we can observe that the postprocessing takes about as long as it takes to build the input CBA model.

Figure 3 shows the breakdown results for individual datasets. In this figure, the size of the hexagonal symbol denotes the size of the model, computed as the number of rules × average number of conditions (smaller is better). The QCBA learning time reported is for postprocessing only. The purple line denotes the number of unique values across all attributes (except the target) in the given dataset and the datasets (on the X-axis) are sorted by this value.

Fig. 3
figure 3

CBA + QCBA#5 on individual datasets

The larger run time for several datasets in Fig. 3 can be attributed to the extension step. The graph also shows the largest impact on datasets with many distinct values or rows, which correspond to segment, letter and spambase datasets. The segment and letter datasets contain various image metrics and spambase word frequency attributes. Such datasets are not typical representatives of the use cases, where interpretable machine learning models are needed. Nevertheless, the evaluation of the runtime indicates that the computational optimisation of the extension algorithm is one of the most important areas for further work.

4.3 Postprocessing SBRL models

The Scalable Bayesian Rule Lists (SBRL) [69] is a recently proposed rule learning algorithm. As with most association rule learning approaches, the SBRL algorithm can process only nominal attributes. In this experiment, we postprocess models generated by the SBRL with the proposed rule tuning steps.

Since SBRL is limited to datasets with binary classes, we process all eleven datasets with binary class labels from Table 4. R Implementation sbrl (CRAN version 1.2) from the SBRL authors was used.Footnote 9 The postprocessed models are generated using the QCBA package referenced earlier. For prediction, rules are applied in the order output by QCBA. We evaluated the two best QCBA configurations from the CBA evaluation (QCBA #5 and QCBA #6 from Table 5).

Metaparameter setup

SBRL is run with the following default parameter settings: iters= 30000, pos_sign= 0, neg_sign= 1, rule_minlen= 1, rule_maxlen= 1, eta= 1.0, minsupport_pos= 0.10, lambda= 10.0, minsupport_neg= 0.10, alpha={1,1}, nchain= 10. With this setting, which limits the antecedent length to 1, the SBRL produces very simple models with a somewhat lower accuracy. The second evaluated setting for the SBRL differs only by the increased maximum rule length, which allows the algorithm to learn more expressive models. The rule_maxlen parameter is set to 10 for all datasets, except those where this setting resulted in an out of memory error in the first rule generation phase within the SBRL.Footnote 10

Results

Table 6 shows the predictive performance and comprehensibility metrics for the SBRL only, the SBRL model postprocessed with all the proposed tuning steps except default rule overlap pruning (SBRL +QCBA#5), and the SBRL model postprocessed with all proposed tuning steps including instance-based default rule overlap pruning (SBRL+QCBA#6). The table shows results for two SBRL configurations: rules of length 1, rules of length 10 (with several exceptions noted above). The build time for the QCBA runs excludes the time required to build the input SBRL model. The p-value is for Wilcoxon signed-rank test computed on accuracies. The won/tie/loss record denotes the number of times that the postprocessed model has a better, same, or lower accuracy than the input SBRL model. The model size is computed as a product of the average rule count and the average rule length. The build times for postprocessing do not include the time required to build the input SBRL model. The normalised build time is reported relative to the build time of the SBRL only.

Table 6 Postprocessing Scalable Bayesian rule list (SBRL) models

Similarly to the study of the effect of individual steps performed for CBA, it is reported in Table 6 that configuration QCBA#6 with all tuning steps enabled generates the smallest models, while configuration #5 with all tuning steps, but without default rule overlap pruning, provides balance between accuracy and size.

Configuration QCBA#6 results in a reduction in size, as represented by the average number of conditions in the models, by 33% to 41%, while configuration #5 results in a reduction by 25% to 32% but has higher gains in predictive performance. According to the won/tie/loss record in Table 6, the SBRL models postprocessed by the proposed tuning steps have a higher accuracy on most datasets compared to the SBRL-only models. The postprocessing is most effective on models composed of short rules: there, the improvement is statistically significant (p < 0.05).

Overall, the evaluation shows that postprocessing the SBRL models with the proposed tuning steps results in a reduced model size and an improved accuracy. The computationally most intensive part is the generation of input rules performed within the SBRL. As Table 6 and Fig. 4 show, the additional computational cost of postprocessing the SBRL models is relatively low, even when the SBRL setup involves learning of longer input rules. The reason is that the models on the SBRL output consistently contain a small number of rules, which eases their subsequent tuning. As Fig. 4 shows, there is a dependency between the number of distinct numerical values in the dataset and the learning time.

Fig. 4
figure 4

The SBRL (long rules)+QCBA#5 on individual datasets

4.4 Postprocessing IDS models

Similar to the SBRL and CBA, the Interpretable Decision Sets (IDS) [42] is an association rule classification algorithm. In the first step, candidate association rules are generated from frequent itemsets returned by the Apriori algorithm. In the rule optimisation step, a subset of the generated rules is selected to form the final classifier. The rule selection procedure is a subject of optimisation using the smooth local search (SLS) algorithm [16], which guarantees a near-optimal solution and is at least 2/5 of the optimal solution. IDS uses a compound objective function composed of several interpretability subobjectives (number of rules, rule length, minimise rule overlap) and accuracy (maximise precision and recall).

Setup

For evaluation purposes, we use our reimplementation of the IDS algorithm described in [19].Footnote 11 The IDS authors have made a reference implementation available on GitHub: nevertheless the evaluation reported in [19] shows that the reference implementation is too slow to be applied on the benchmark datasets introduced in Table 4. While the implementation described in our paper [19] is faster, and there are additional performance gains after [19] was published, there are still performance issues. For the optimisation to finish in a reasonable time (within minutes per dataset fold), we limit the input rule list to 50 rules. Rules are mined with a similar setting that is used for CBA. The minimum support threshold is set to 1%, and we also apply a minimum confidence threshold of 50%, which improves the results. The maximum rule length is set to 5 (4 for the spambase dataset). All seven lambda parameters, which in the IDS essentially corresponds to the accuracy-interpretability trade-off, are set to equal values (1.0).Footnote 12

For prediction on IDS models, rules are sorted by a harmonic mean of support and confidence as specified in [42] and applied in this order; the first firing rule classifies the instance. The rule order for prediction on models postprocessed by QCBA is determined by the order output by QCBA (criteria in Def. 13).

Results

The results reported in Table 7 indicate that the proposed tuning steps consistently reduce the size of models generated by IDS, while also improving predictive performance on the majority of datasets. Each IDS+QCBA configuration is run in two variations without literal pruning or with it (default). On average, the accuracy improves by 2% (absolute), and the model size is reduced by 85% (for QCBA#5). Literal pruning makes a modest contribution to this reduction. Models built without literal pruning are 35% (QCBA#5) and 36% (QCBA#6) larger than models built with literal pruning (computed from average conditions/model).

Table 7 Postprocessing Interpretable Decision Sets (IDS) models

The Wilcoxon signed-rank test shows a statistically significant improvement over IDS at p < 0.01. The median of time required for postprocessing with QCBA is relatively small, as it takes about one-tenth of the time needed to learn the IDS models.

An interesting observation, which is not apparent from the results for individual datasets in Fig. 5, is that for four datasets, the QCBA postprocessing generated rule lists that are composed of only the default rule (in all folds). This empty rule has equally good accuracy as the IDS-generated rule list.Footnote 13

Fig. 5
figure 5

IDS+QCBA#5 on individual datasets

4.5 Postprocessing CPAR, CMAR, FOIL2 and PRM models

In this section, we evaluate against four reference algorithms widely established in the domain of rule learning. Note that some of the baseline algorithms, such as CPAR, combine multiple rules for prediction. As part of the postprocessing, the models were converted to a QCBA model (sorted according to the CBA criteria in Def. 13), and the prediction was based on the first rule in this list that matched the test instance.

Setup For CMAR, CPAR, PRM and FOIL, the default metaparameter values from [30] were used.

Results Table 8 describe the results of postprocessing for the four rule learning algorithms. The results show that the QCBA reduces the average rule count for all four algorithms and the average rule length for CMAR and FOIL. The reductions for the aggregate model size (average number of conditions in the model) range from 22% for PRM to 79% for CMAR. For CPAR, FOIL2 and PRM, the postprocessed models had better accuracy and won-tie-loss score. The difference is statistically significant (p < 0.05). The best overall result in terms of accuracy and won-tie-loss record across all evaluated methods is obtained by FOIL2 postprocessed by QCBA.

Table 8 Effect of postprocessing CPAR, CMAR, FOIL2 and PRM (base datasets) by QCBA#5 (Q#5 in the table)

4.6 Head-to-head scalability analysis

In this section, we focus on the effect of increasing the training set size on the performance of the proposed methods. As the input dataset, we use the intrusion detection dataset from KDD Cup’99.Footnote 14 This dataset is suitable for benchmarking since it contains a binary class label required by the SBRL algorithm and it also contains multiple numerical attributes with a high number of distinct numerical values. We generate subsets of the dataset containing 1,000, 10,000, 20,000, 30,000 and 40,000 rows, which is roughly the maximum number of rows for which we are able to obtain results for all implementations included in this benchmark given the available hardware.

The individual subsets are preprocessed in the same way as the main evaluation on the 22 UCI datasets reported above. Data is divided into training and testing partitions using stratified tenfold crossvalidation and the generated partitions are materialised to ensure that all evaluated configurations are compared on exactly the same data.Footnote 15 Numerical attributes are discretised with MDLP.

The first analysis focuses on the effect of the training set size on the time required for applying the tuning steps in QCBA. Similar to Table 5, this analysis is conceived as an “ablation study,” where the effect of individual tuning steps on the training time is reported cumulatively. The results included in Fig. 6. The baseline is for the CBA data coverage pruning (without default rule pruning). The reported QCBA times are exclusive of CBA training time but inclusive of all preceding QCBA steps, e.g., the times for trimming also include literal pruning and refitting. Two alternative versions of the extension step are shown differing in whether a conditional accept is enabled (minCI = − 1) or not (minCI = 0). The results show that conditional accept substantially increases the run time. There is, however, no effect on accuracy (not shown in the figure). We, therefore, opted for minCI = 0 for the remaining experiments on the KDD Cup’99 dataset. The additional time added by post-pruning and default rule overlap pruning is small compared to the computational costs of the extension step.

Fig. 6
figure 6

Ablation study of training time for individual tuning steps in QCBA applied on a CBA model for subsets of the KDD dataset

A subsequent analysis summarised in Fig. 7 focuses on the effect of postprocessing models generated by three baseline algorithms by QCBA (#5, minCI = 0). The settings of the baselines are similar to those in the previous experiments. For the SBRL, the maximum rule length has to be limited to 3; otherwise, the SBRL would not finish.

Fig. 7
figure 7

Postprocessing on differently sized subsets of the KDD’99 anomaly dataset. The size of the hexagonal symbol denotes the average model size

The results indicate that postprocessing has the largest effect on the IDS, while the best results are obtained for the SBRL. Additionally, for the SBRL, QCBA consistently reduces the model size while maintaining accuracy. In our evaluation, the extension step becomes computationally very intensive at approximately 40,000 rows and 40,000 distinct values. Since the used dataset contains nearly 500 thousand rows, the evaluation can be extended in future work to track performance improvements.

4.7 Comparison with related symbolic classifiers

In this section, we compare the the predictive performance against several additional reference algorithms.

The selection of reference

Algorithms C4.5 [51] and RIPPER [11] are well-established interpretable classifiers that are widely used as standard reference algorithms. FURIA [35], also covered in our related work section, is a state-of-the-art association rule classifier that outputs fuzzy rules. PART [20] is a rule learning algorithm designed to address some of the shortcomings of C4.5 and RIPPER. Despite the fact that PART is an older algorithm, it is still often used in benchmarks of recently proposed rule learning methods, such as CORELS, which is also included in the benchmark. As the QCBA baseline, we choose the instance-based version of the default rule overlap pruning (configuration QCBA#5 from Table 5).

Implementations

For standard learners (C4.5, FURIA, PART, RIPPER), we use the implementations available in the Weka framework.Footnote 16 For CORELS, we use the implementation from the authors.Footnote 17

Metaparameter tuning

For C4.5, FURIA, PART, and RIPPER we perform hyperparameter optimisation using the MultiSearch package,Footnote 18 which implements a variation of grid search over multi-parameter spaces. We choose accuracy as the evaluation measure. Parameter tuning is performed separately for each fold on the training data using internal crossvalidation. The best setting found is used for the evaluation of the test data. The grid for individual learners is defined in Table 9.

Table 9 Grid definition for hyperparameter optimisation

For CORELS, it is reported that metaparameter tuning has a limited effect [61]. Similar to CBA, we use the default values. In this case, we use the settings from pycorels: c = 0.01,n_iter = 10000, policy=lower_bound, ablation=NA, max_card = 2, min_support= 0.01. Note that minimum support is set to the same value as for CBA. For CORELS, we experimented with varying the regularisation threshold c for which higher values should penalize longer rule lists, but this had only small effect, therefore the default value was used for the experiments.

Results

The comparison against non-ARC classification algorithms is presented in Table 10. For CORELS only the 11 datasets with binary targets are processed. For FURIA, one dataset (“letter”) is omitted due to excessive run time (the algorithm has not finished in several hours). The results show that CBA postprocessed by the proposed tuning steps does not have a statistically significantly better accuracy (p < 0.05). However, QCBA wins on more datasets than C4.5 (J48) and RIPPER and has the same number of wins and losses as PART. QCBA outperforms CORELS on all datasets except two; however, it should be noted that CORELS produces extremely small models typically composed of only a few short rules. The only algorithm that, in a pairwise comparison, performs better than QCBA is FURIA, which is a rule classifier extending on RIPPER but generating fuzzy rule models. A detailed discussion of the similarities and the differences between the proposed approach and FURIA and CORELS is presented in the following section.

Table 10 Comparison between CBA postprocessed by QCBA (#5) and five reference classifiers

4.8 Analysis of wins by dataset

Table 11 shows the results of individual classifiers and their combinations with QCBA per individual datasets. The entries are sorted by the QCBA wins column, which indicates the percentage of results where QCBA has a higher accuracy in the pairwise comparisons (NA results were excluded). The last two columns show the overall best algorithm and its accuracy. Entries with the highest result obtained by QCBA are in bold. The first part of the table shows on which datasets the postprocessing by QCBA#5 results in an improvement over seven reference classification algorithms. The second part of the table compares five other reference classification algorithms for which postprocessing by QCBA is not available to CBA+QCBA#5. Overall, QCBA was better in the majority of comparisons against the reference method on 12 datasets. QCBA combined with a baseline method obtained the best overall accuracy on five datasets. FOIL2 is with two datasets with the highest accuracy the most successful method for combination with QCBA. The overall heterogeneity of the results indicates that the best algorithm or a combination of algorithms is dependent on the dataset.

Table 11 Results by individual datasets. Baseline and reference classifiers are coded as: 1: CBA, 2: CMAR, 3: CPAR, 4: IDS, 5: FOIL, 6: PRM, 7: SBRL, 8: CORELS, 9: J48, 10: PART, 11: RIPPER and 12: FURIA

5 Related work

Separate-and-/conquer is possibly the most common approach to rule learning. This strategy finds rules that explain part of training instances, separates these instances, and iteratively uses the remaining examples to find additional rules until no instances remain [22]. Separate-and-conquer provides a basis, for example, for the seminal RIPPER algorithm, its fuzzy rule extension with FURIA [35], or GuideR [59], a state-of-the-art algorithm allowing the user’s preferences to be introduced into the mining process. The numerical attributes in separate-and-conquer approaches are typically supported through selectors (≠,≤,≥) and range operators (intervals). Multiple extensions to separate-and-conquer approaches have been proposed. Among them, a notable strategy is boosting (weighted covering strategy) [23, p.175,178].

Association rule classification is a principally different approach where many rules are first generated with fast association rule learning algorithms, and a subset of these rules is then selected to form the final classifier. To our knowledge, all previously proposed association rule classification algorithms support only categorical inputs. However, there has been some work on learning standard association rules or subgroups (as a nugget discovery rather than a classification task) from numerical data. Additionally, quantitative information is processed in fuzzy association rule classification approaches. In the following, we discuss the differences between the new approach to supporting quantitative attributes presented in this article, which is based on postprocessing of already learned ARC models, and existing approaches for quantitative association rule learning (including those derived from formal concept analysis), fuzzy rule classification, and sequential covering (separate and conquer).

5.1 Quantitative association rule mining

Several quantitative association rule learningFootnote 19 algorithms have been proposed (see [1] for a recent review). Two representative and widely referenced approaches include QuantMiner [56] and NAR-Discovery [60]. The earlier proposed QuantMiner is an evolutionary algorithm that optimises a multi-objective fitness function that combines support and confidence. The essence of QuantMiner is that multiple seed rules are subject to using standard evolutionary operators. For example, mutation corresponds to an increase or a decrease in the lower/upper bound of a rule. NAR-Discovery takes a different, two-stage approach. Similar to QCBA, a set of “coarse” association rules is generated on prediscretised data, with standard association rule generation algorithms in the first stage. In the second stage, for each coarse-grained rule, several rules are generated using fine bins. The granularity of the bins is a parameter of the algorithm. One feature of NAR-Discovery is that it produces at least one order of magnitude more rules than QuantMiner.

Table 12 compares our QCBA framework with NAR-Discovery and QuantMiner. Justifications for individual values in the table include: 1. classification models: neither QuantMiner or NAR-Discovery were designed for classification; 2. deterministic: QuantMiner is an evolutionary algorithm; 3. number of rules: too many rules generated is one of the biggest issues facing association rule generation algorithms. Neither NAR-Discovery nor QuantMiner contains procedures for limiting the number of rules, while QCBA contains several rule pruning algorithms; 4. precision of intervals: for QuantMiner the precision of the intervals depends on the setting of the evolutionary process and for NAR-Discovery on the discretisation setting. QCBA generates interval boundaries exactly corresponding to values in the input continuous data; 5. externally set parameters (in addition to confidence and support thresholds): NAR-Discovery requires two granularity settings (fine/coarse), and QuantMiner requires multiple parameters such as population size, mutation, and crossover rate for the evolutionary process, which have a considerable effect on the processing result. QCBA does not require any externally set parameters except for several optional parameters that can speed up the algorithm in exchange for a lower accuracy of the generated models.

Table 12 Comparison with methods for quantitative association rule generation

The notion of trimming closely corresponds to the closure operator introduced in [38]. Finally, it should be noted that quantitative association rule learning algorithms address a different task than QCBA does. Unlike QCBA these are unsupervised algorithms that do not have the class information available. This is exploited by QCBA, among others, to perform rule pruning, which allows reducing the number of rules in an informed way.

5.2 Numerical pattern mining with formal concept analysis

One approach in this area are algorithms adapting Formal Concept Analysis for the discretisation of data for pattern mining. In [38], an interval pattern was defined as an M-dimensional vector of intervals that could be represented as a hyperrectangle. The concept of an interval pattern is closely related to our definition of a rule as visualised in Fig. 1 with the difference that the problem addressed by [38] was unsupervised. The essence of the approach in [38] is that it finds all minimum-length interval patterns that can contain the given set of instances. These interval patterns are called closures in [38]. They essentially correspond to a rule after our trimming operation (Fig. 1f corresponds to a closure). As a dual notion, the same article [38] introduced the notion of a generator, which is the largest region covering the same set of instances. While we do not explicitly work with generators, they are output by QCBA as part of the extension operation (Fig. 1g corresponds to a generator). The primary difference between our approach and [38] is that the latter’s goal was to create all possible closures (generators) within a given dataset, while we use the corresponding notions as a component of our approach focused on processing individual already discovered rules. According to [63], approaches based on Formal Concept Analysis do not scale to large data. As seen from our evaluation results, by focusing on pre-mined rules we have largely (but not completely) circumvented the performance issues this approach can bring.

5.3 Subgroup discovery

Subgroup discovery is an alternative method for pattern mining. Similar to association rule mining, subgroup discovery has developed a range of methods for dealing with numerical inputs. These include discretisation of input data, as well as various heuristic approaches. A recent advance in this field is the RefineAndMine algorithm presented in [7]. This algorithm has the anytime property, which means that its search converges to an exhaustive search if given sufficient time; however, at any time before this, a solution is available. Its main principle is somewhat similar to the NAR-Discovery quantitative association rule learning algorithm described earlier; it first performs coarse discretisation, which is iteratively improved, and patterns are mined. However, unlike NAR-Discovery, RefineAndMine provides guarantees on the remaining distance to the exhaustive search. Mining for subgroups with numerical target attributes (also with guarantees) was addressed in [43]. These approaches are conceptually different from ours in two respects. First, these are primarily nugget-mining algorithms intended for the discovery of a diverse set of patterns, not for generating concise classifiers. Second, these are “all-in-one” algorithms, while QCBA aims at postprocessing models generated by existing approaches.

5.4 Fuzzy approaches

Several ARC approaches adopt fuzzy logic to deal with numerical attributes. A notable representative of this class of algorithms is FARC-HD, and its evolved version FARC-HD-OVO [14]. In the following, we will focus on the FURIA algorithm [35]. This is not an ARC algorithm but is conceptually closer to our approach and is frequently referenced as the state-of-the-art in terms of the accuracy a rule learning algorithm can achieve [6, 50]. In our benchmark, FURIA obtained the best results.

FURIA postprocesses rules generated by the RIPPER algorithm for induction of crisp rules. RIPPER produces ordered rule lists, and the first matching rule in the list is used for classification. A default rule is added to the end of the list by RIPPER, ensuring that a query instance is always covered. As is typical for fuzzy approaches, FURIA outputs an unordered rule set, where multiple rules may need to be combined to obtain the classification.

A biproduct of the transition to the rule set performed by FURIA is the removal of the default rule. To ensure coverage of each instance, FURIA implements rule stretching, which stores multiple generalisations (i.e., with one or more literals in the antecedent omitted) of each rule and uses these for classification.

The most important element in FURIA is the fuzzification of input rules. The original intervals on quantitative attributes in rules produced by RIPPER are used as upper and lower bounds of the core \([\phi _{i}^{c,L},\phi _{i}^{c,U}]\), and FURIA determines the optimal upper and lower bounds of the fuzzy supports: \([\phi _{i}^{s,L},\phi _{i}^{s,U}]\). It should be noted that in fuzzy set theory, the support bound denotes an “upper” rule boundary, while in the scope of association rule learning it denotes the number or fraction of correctly classified instances. When searching for \(\phi _{i}^{s,L}\) and \(\phi _{i}^{s,U}\), FURIA proceeds greedily; it evaluates fuzzifications for all antecedents and fuzzifies the one that produces the best purity. As fuzzification progresses, the training instances are gradually removed; therefore, the order that the rules are fuzzified in is important.

To compare FURIA with QCBA, both algorithms postprocess input rules, adjusting boundaries of numerical attributes. Within the algorithms, there are numerous differences. In QCBA, we retain the default rule (although it may be updated); therefore, even if no other rule matches, a model postprocessed with QCBA is able to provide a classification. A somewhat similar procedure to fuzzification in FURIA is the QCBA extension. Unlike in FURIA, rules are extended independently of one another, which eases parallelisation. In summary, FURIA produces fuzzy rules, the resulting models are rule sets, and QCBA produces rule lists composed of crisp rules.

The results for another state-of-the-art algorithm, FARC-HD, have shown that this fuzzy rule learning approach is much slower than crisp rule learning approaches. In reference to Table 2, FARC-HD was on average more than 100x slower than CBA [14]. The remaining two other fuzzy associative classifiers (LAFAR [34] and CFAR [10]) included in the benchmark in [14] are reported to have even more severe scalability problems as they could not be run on all datasets in the benchmark.

For explainability (interpretability), we were unable to find a study that would evaluate the interpretability of automatically learned fuzzy rule classifiers. However, in principle, fuzzy rule algorithms generate rule sets as multiple rules are combined using membership functions to classify one instance, while models generated by QCBA are intended to be used as rule lists with only the highest-ranked rule being used to classify an instance. Rule lists and rule sets have clearly different properties in terms of interpretability, and future research will show for what purposes each representation is most suitable for.

5.5 Non-fuzzy ARC algorithms

In the following, we focus on the comparison of CBA (and QCBA) with CORELS [5], which is a state-of-the-art algorithm belonging to the ARC family. We describe how CORELS works, compare it with CBA and QCBA, discuss its performance in our experiments, and finally provide a scalability comparison.

CORELS - certifiably optimal rule lists

The CORELS algorithm supports only binary class datasets, which limits the scope of comparison reported in our evaluation only to datasets with a binary class attribute. Similar to QCBA, CORELS processes rules generated by standard association rule mining algorithms. CORELS uses a branch-and-bound method to choose a subset of the rule antecedents, transforming them into a rule list. CORELS guarantees that considering only the input rule list, the generated classifier (rule list) is optimal, where the ‘optimality’ is defined through a compound objective function R consisting of a misclassification error on the training data (lower is better) and a regularisation term derived from a number of rules in the rule list (lower is better). When the regularisation strength is set to zero, CORELS guarantees that the produced rule list has the highest accuracy of all rule lists that can be generated by selecting antecedents from the input list of pre-mined rules.

Comparison of CORELS and CBA

For rule pruning and sorting, QCBA uses the CBA algorithm, which is based on a combination of heuristics. Rules are first sorted by their confidence, support and length and then considered in this order for addition to the final rule list. In contrast, CORELS does not use these sorting heuristics. Instead, it performs a branch-and-bound search. Starting with an empty rule list, it iteratively finds the best rule (generated from the list of antecedents of pre-generated rules) to be added to the final rule list. CORELS exploits several properties to effectively search this very large state space. The key property is called the hierarchical lower bound by the CORELS authors. If the currently considered rule list (without a default rule) dp has the same or higher misclassification error than the objective value R of the best rule list generated thus far, then the rule list dp and all extensions of it generated by appending rules at the end of dp can be omitted from the search. The hierarchical lower bound property is very close to the default rule pruning already used in CBA (see Section 2.3).

What element of the CORELS algorithm can be attributed to producing much shorter rule lists than CBA is a matter of future research. We hypothesise that the main factor may be the replacement of the ‘confidence-support-rule length’ sorting heuristics (see Definition 13) used by CBA by the branch-and-bound method in CORELS. This is supported by prior research, which shows that this sorting step in CBA could be made more efficient [67].

Comparison of CORELS and QCBA

CORELS treats the individual rule antecedents generated by rule mining as atomic (unmodifiable) components. Since CORELS is intended only for categorical data, its application on datasets with numerical attributes requires prediscretisation. In contrast, QCBA primarily focuses on the optimisation of the individual candidate rules, editing the rules, as well as the value sets of individual literals.

Why CORELS – despite its optimality guarantee – underperforms other rule learners?

The results that we obtain for CORELS are somewhat unexpected, but they are congruent with a detailed evaluation of CORELS performed in [61], where CORELS was also outperformed by both PART and RIPPER, but it produced the smallest models. It should be noted that this difference cannot be accounted for a user-set bias for concise models allowed by CORELS, since setting the regularisation penalty to zero did not consistently increase the accuracy of the CORELS models.

As follows from the principles of the CORELS operation laid out in the previous, we hypothesise that the reason why rule lists generated by other rule learning algorithms often had higher accuracy than the “certifiably optimal” CORELS rule lists is that the CORELS guarantee of optimality applies only to the original candidate set of antecedents pre-mined with association rule learning from prediscretised data. This applies not only to PART and RIPPER, which do not rely on pre-mined rules but also to QCBA, which uses pre-mined rules but is able to edit these rules, which are atomic for CORELS. Thus, QCBA may work with “better” rules than are available to CORELS for selection to the final rule list.

5.6 Separate-and-conquer and rule ensembles

We expect that state-of-the-art rule ensemble algorithms derived from RIPPER would outperform algorithms generating rule lists, even after they were postprocessed by QCBA. For example, a comparison of results published for ENDER [12] with our results on several UCI datasets indicates that ENDER has higher predictive accuracy on most datasets. However, there is an interpretability-accuracy trade-off, as rule sets are possibly less interpretable since they belong among ensemble algorithms, which are generally considered problematic from the explainability perspective [58]. Furthermore, some algorithms producing rule ensembles tend to generate larger models (see rule count for CPAR in Table 2 is 5x higher than the rule count of CBA).

Scalability comparison: ARC algorithms vs. separate-and-conquer

Association rule learning, which is a crucial computationally intensive step in ARC algorithms, is widely considered a highly scalable approach for large and sparse data (e.g., [33]). This is confirmed in a benchmark reported in [41] comparing the performance of RIPPER (as a separate-and-conquer algorithm) and CBA on a larger dataset with a high number of distinct values, which shows that unlike CBA, RIPPER is unable to process the complete dataset. In a benchmark of rule learning algorithms included in [66], RIPPER was the only algorithm that is unable to finish on all datasets due to an out of memory problem on the connect-4 dataset with 67.000 instances and 42 attributes. This indicates that while the scalability of association rule learning may face considerable challenges especially on datasets with a high number of unique values, the ARC approach as a whole shows promise and can outperform established separate-and-conquer rule learning algorithms such as RIPPER.

6 Limitations

In this section, we first discuss the general disadvantages of building classifiers from pre-mined rules. After that, we proceed to discuss the known trade-offs in the proposed QCBA tuning steps.

Building classifiers from pre-mined rules

ARC algorithms rely on association rule mining for the provision of candidate rules (antecedents). This can miss some rules that could improve the quality of the classifier. One of the reasons is that association rule mining only generates rules that exceed globally valid prespecified confidence and support thresholds. This is, for some purposes, a desirable property since rule confidence and support are computed from all instances in the dataset, and thus, the generated rules are valid independently of each other. However, once placed into a rule list, these independently learned rules cease to be independent. In a rule list, only instances uncovered by rules with higher precedence reach a particular rule. The generation of candidates through association rule learning on all instances using globally set thresholds does not match the way rules are eventually used in the classifier. Since ARC algorithms such as CORELS and CBA use these independently learned rules, the resulting rule lists can be suboptimal when all possible rules are considered.

Order of processed literals

On line 2 in Procedure 6 (Extension), the literals are considered for extension in the order of appearance in the rule. A similar approach is taken in the literal pruning algorithm, where the order of processed literals can also affect the result. Future work could explore an alternative fully greedy approach, where all literals are first evaluated and then the best one is selected for processing. Such change should be done carefully to avoid adverse effects on the run time of the extension algorithm. For this, the benchmarks already show that it is the most time consuming step. For literal pruning, the expected benefits of a fully greedy approach seem to be limited, since in our experience it is rarely the case that more than one literal can be removed from a rule.

Selection and ordering of rules in the post-pruning step

For the post-pruning step, QCBA adopts the pruning algorithm from CBA. This is fast, but it does not provide optimal results, as it uses a heuristic to sort the rules. A possible solution would be to combine QCBA with CORELS, which provides a guarantee that the rule list is optimal with respect to the user-set trade-off between accuracy and the number of rules. Since CORELS operates on prediscretised data, postprocessing CORELS models by QCBA to “edit” the pre-mined rules could yield additional reduction in the size of the generated models or improvements in performance of the CORELS models. A combination of QCBA and CORELS could take advantage of the modular architecture of QCBA, where individual tuning steps can be performed independently.

QCBA default rule pruning step

As follows from the benchmarks included in the previous section, neither version of default rule pruning constitutes a performance impediment. This speed is at the expense of the exclusion of several checks that could remove additional redundant rules. In the original Default Rule Overlap Pruning algorithm (Proc. 9 on line 3), the set \(T_{r}^{corr}\) includes all instances in training data correctly classified by the pruning candidate r. However, instances covered by rules with higher precedence than r can be excluded from \(T_{r}^{corr}\). Additional checks can also be introduced to ensure that the set of instances covered by each of the candidate clashes considers only instances reaching the respective candidate clash rule. Similar adjustments can be made to the range-based version of the algorithm.

7 Conclusion

This research aims to ameliorate one of the substantial limitations of association rule classification: the adherence of the rules that the classifier is composed of to the multidimensional grid created by discretisation of numerical attributes. Quantitative classification based on associations (QCBA) is, to the authors’ knowledge, the first non-fuzzy association rule classification approach that recovers part of the information lost in prediscretisation. QCBA is not a standalone learning algorithm but rather a palette of postprocessing steps, from which some can be selected to enhance the results of an arbitrary rule learning algorithm.

Prior research shows that learning algorithms that directly work with numerical attributes do not often have better predictive performance than when the discretisation is performed as part of preprocessing [9, 26]. Our results obtained for classification rules are partly in line with the aforementioned conclusions that were obtained for decision trees and probabilistic approaches.

As part of the evaluation, we postprocessed models generated by CBA, CMAR, CPAR, IDS, FOIL2, PRM and SBRL and compared the result of the postprocessing with the baseline version. There was an improvement in predictive accuracy or the won-tie-loss record for all baseline methods except CMAR. While the effect on predictive accuracy was typically relatively small, bigger and consistent improvements could be observed for model size. For example, baseline CMAR generated by far the largest model which the QCBA algorithm reduced by nearly 80% without negatively affecting the predictive performance.

An interesting area of future work would be combining QCBA with a broader range of base rule learning algorithms and problems, such as multilabel classification [36, 54]. The incorporation of a wider range of rule quality measures could also have a positive impact on predictive performance [68]. Real-world datasets typically contain both numerical and categorical attributes, while the presented work addresses only quantitative attributes. There is a complementary line of research on merging the categorical attributes by creating multi-value rule sets [65], with very promising results in terms of gains in accuracy and comprehensibility. Since QCBA has a modular architecture, some of its tuning steps can become building blocks in a combined approach that would generate small, yet accurate models on data containing mixed attribute types.