1 Introduction

In the wake of the success of machine learning algorithms that learn inscrutable black-box models—most notably deep neural networks—explainable and interpretable models have gained again in importance. A popular line of work, pioneered by the algorithms Lime (Ribeiro et al., 2016) and Shap (Lundberg & Lee, 2017) is concerned with finding local white-box explanations that approximate the learned global black-box model in a given neighborhood of an example. Lore (Guidotti et al., 2018) is such an approach, which has a particular focus on rule-based explanations.

However, there is also considerable criticism towards this approach. Most notably, Rudin (2019) argues that instead of devoting more efforts to explaining black-box models, it might be preferable to focus on improving algorithms that learn white-box models in the first place, most notably rule learning algorithms. Consequently, several new rule learning algorithms have recently emerged, which we will briefly review in Sect. 2.4. In fact, as the extracted local white-box models are typically not perfectly aligned with the underlying black-box models, not even in their defined local neighborhood, it is not entirely clear that such approximations can serve as a valid explanation. Algorithms such as GLocalX (Setzu et al., 2021) or TreeExplainer (Lundberg et al., 2019) aim at constructing global white-box models out of local explanations, but these models are typically less accurate than their underlying black-box models, and the question of whether they outperform white-box models that have been directly learned from data is, in our opinion, still open.

In this paper, we argue that one of the key insights that result from local explanation algorithms like Lore is that each example gets its own individual explanation. This property has, so far, hardly been exploited in predictive rule learning, which typically try to find a concise set of rules that explain the examples with as few rules as possible. A notable exception is the Harmony algorithm (Wang & Karypis, 2006), which takes an instance-centric view in that it aims at finding the best classification rule for each example. Moreover, historically, the first rule learning algorithms were example-based: AQ (Michalski, 1969, 1973) selected a random example and found the best rule that covers this example. However, mostly for computational reasons, this was not repeated for every possible example, but all examples that could be explained by the same rule were removed before the selection of the next example. CN2 (Clark & Niblett, 1989) was among the first algorithms that explicitly changed this strategy, from finding the best rule for a given example to finding a rule that explains as many examples as possible. Note that for any given rule evaluation measure, the rule found by such a strategy is not necessarily the best rule for any of the examples that it covers.

Motivated by these observations, we propose a rule learning algorithm, named Lord (Locally Optimal Rules Discoverer), that attempts to find the best rule for every training example. It does that very efficiently, using N-lists (Deng & Lv, 2015), a state-of-the-art data structure for frequent itemsets mining. From this data structure, for each training example, the best rule covering this example is extracted using greedy search. The found locally optimal rules are collected and filtered for a rule-based classifier, and new examples are classified by selecting the best of the covering rules in the classifier. While the found rule may still be a local optimum for covering an example, we nevertheless claim that the objective of finding the best rule for each example already makes a difference in comparison to the common goal of finding a simple rule that is good for as many examples as possible, even if the found rule is maybe not the globally best. Our experiments demonstrate that the algorithm compares well to state-of-the-art rule learning algorithms, not only in terms of accuracy, but also with respect to efficiency, which we demonstrate by evaluating it on very large datasets.

The remainder of the paper is organized as follows. Section 2 provides a brief foundations and survey on recent work of rule learning. Section 3 presents our approach which is then followed up with experiments and discussions in Sect. 4 and conclusion in Sect. 5.

2 Locally optimal rules

In this section, we will briefly recall the foundations of rule learning and our notational conventions (Sect. 2.1), recall the basic principles of classic rule learning algorithms such as AQ and CN2 (Sect. 2.2), which serve as the basis for the key idea behind our approach, which will be introduced in Sect. 2.3. This section concludes with a brief review of relevant work in inductive rule learning (Sect. 2.4), before we turn our attention to our efficient implementation of the Lord algorithm in the next section.

2.1 Problem definition and notational conventions

The problem of rule learning assumes a number of labeled training examples \(E = \{e_1, \dots , e_n\} = \{\langle \textbf{x}_1,y_1\rangle , \dots , \langle \textbf{x}_n,y_n,\rangle \}\). Each training example \(e = \langle \textbf{x},y\rangle\) consists of an instance \(\textbf{x}_i\) with its corresponding label \(y_i \in C\), where C is a nominal class attribute. The instances \(\textbf{x}_i\) are characterized with a set F of k binary features, i.e., \(\textbf{x}_i = \langle f_{i,1}, \dots , f_{i,k} \rangle \in \{0,1\}^k\).

A rule r maps a subset \(F_r \subset F\) of the features to a label \(y \in C\), i.e., it has the form

$$\begin{aligned} r: F_r \rightarrow y \end{aligned}$$
(1)

with the semantics that every example \(\textbf{x}\), for which the features in \(F_r \subset F\) (the body of the rule) are present (i.e., have the value 1), should be assigned the label y.Footnote 1

Definition 1

(Rule Coverage) A rule r is said to cover an example \(\textbf{x}\) iff \(F_r \subset F_\textbf{x}\), where \(F_r \subset F\) are the conditions in the body of the rule, and \(F_\textbf{x} \subset F\) are the features of example \(\textbf{x}\). We denote the set of examples in E that are covered by the rule r with \(E_r \subset E\).

The learning problem consists of finding a set of rules R, which can be used to assign the correct labels to new, unseen data. Examples assigned to the correct class are called true positives and those assigned to a different class false positives.

In the following, we will assume tabular datasets that are based on a set of m categorical attributes \(A_i\), \(i \in \{1, \dots , m\}\), each of which has a fixed set of possible values \(a_{ij}\), \(j \in \{1, \dots , |A_i|\}\). Note that this assumption is not crucial for the key idea of locally optimal rule induction, but it will be exploited by the efficient implementation introduced in Sect. 3. The set of features is then defined via all possible selectors.

Definition 2

(Selector) A selector \(s_{ij}\) is a single condition represented by \(A_i = a_{ij}\) which selects examples (data rows) having value \(a_{ij}\) for attribute \(A_i\) from the input dataset.

The term selector goes back to Michalski (1973, 1983), who used it as a generalization of any type of comparison of an attribute value with a constant. It can be thought of as a binary feature, similar to the notion of an item in frequent pattern mining, but maintaining the association to its defining attribute \(A_i\). In the following, we will nevertheless often use these two terms interchangeably.

For simplicity, we simply ignore missing values, i.e., no selector will cover such a value, but other, more elaborate techniques are possible (Wohlrab & Fürnkranz, 2011). Numerical attributes can be handled by discretization (García et al., 2013).

2.2 Learning rule sets

Covering (aka separate-and-conquer) rule learning algorithms learn one rule at a time. For doing so, it searches for a rule that optimizes some quality criterion. When a new rule is found, all covered examples which satisfy all conditions in the body of the rule are removed from the training set, and the rule learning continues with another rule until all training examples are covered or the given stopping criteria are met. The difference among algorithms in this family is mainly in the way a new rule is found. This crucially depends on the search strategy (e.g., hill-climbing, beam search, or exhaustive search), and the heuristic criterion h(.) that is used for evaluating rules (Fürnkranz, 1999).

figure a

Historically, AQ (Michalski, 1969) can be considered as the ancestor of this family of algorithms. It proceeds by selecting a random example that is not yet covered by any of the previously found rules, and search the space of all generalizations of this rule using a top-down beam search for finding the best rule. An abstract pseudo-code of the general idea behind AQ is shown in Algorithm 1. Note that line 5 aims at finding the rule that optimizes the heuristic function h(.) for a randomly selected seed example.Footnote 2 However, once such an optimal rule has been found for the example \(\textbf{x}\), all examples covered by this rule will also be classified using this rule, and are thus removed from the training set (line 7). Note that for any such example \(\textbf{x}' \ne \textbf{x}\), a better rule than r may exist, but will not be searched unless it also happens to cover \(\textbf{x}\). So, in summary, AQ strives for finding optimal rules for some of the given training examples, and uses those for classifying all other examples to which they apply, regardless of whether they are the optimal rule for these examples or not.

CN2 (Clark & Niblett, 1989) took this one step further by combining the covering loop of AQ with ideas from decision tree learning, which are optimized collectively over all examples. The resulting algorithm is sketched as Algorithm 2. Instead of optimizing rules for individual examples, the algorithm strives for finding the best overall rule for the current set of training examples. The crucial difference is that no seed examples are selected (line 4 of Algorithm 1), and the best rule is searched over all possible rules that can be formed from all possible features and all possible classes (compare line 4 of Algorithm 2 to line 5 of Algorithm 1). Again, the details of the algorithm differ (e.g., later many of its successors optimize rules for each individual class \(y\in C\) instead of optimizing over all classes, and the search for the best rule is in many cases greedy), but the key idea is to find a generally good rule, as opposed to finding the optimal rule for each individual example.

figure b

This strategy is essentially still in use by many state-of-the-art rule learning algorithms. Most notably, the well-known Ripper rule learning algorithm (Cohen, 1995) follows this strategy with some important enhancements. In particular, it does not solely rely on the choice of a suitable heuristic for fighting overfitting, but employs additional pruning and optimization techniques. Inspired by incremental reduced error pruning (Fürnkranz and Widmer, 1994), Ripper effectively deals with the over-fitting problem by simplifying rules on a separate pruning set, and with additional loops of post-processing for optimizing a rule set. Particularly, the key idea is to examine whether or not to replace one rule from the previously learned rule set by a revised one which is formed by a growth and then a pruning phase aiming at reducing error of the entire rule set. Ripper can still be considered the state-of-the-art in inductive rule learning, and is hard to beat in both predictive accuracy and the simplicity of the learned rule sets.

In general, the drawback of this family of techniques is that later rules are found just based on gradually reduced parts of the training set, resulting in that rules are discovered on insufficient statistic information. Also, the inherent sequence in the rule search makes the techniques harder to tackle big datasets which usually require to be processed in parallel.

2.3 Locally optimal rule learning

As we have seen in the previous section, common rule learning algorithms strive for finding rules that classify many examples well, as opposed to finding rules that are optimal for each individual example. Locally optimal rule discovery, as proposed in this paper, aims at solving this problem by reverting back to the basic AQ algorithm. But instead of being satisfied with finding the best rule for some of the examples, we compose an ensemble of rules by finding the best rule for each training example. The resulting basic idea is sketched in Algorithm 3.

figure c

Note that there is no covering or removal of examples (as in line 7 of Algorithm 1 or line 6 of Algorithm 2), or selection of seed examples (as in line 4 of Algorithm 1). Instead, one optimal rule is learned for each individual example. The resulting rule set thus, in principle, consists of a set of rules, each being optimal locally for one of the training examples.

This basic intuitive idea has some obvious disadvantages. Most notably, it seems to be very inefficient to search for the best rule for each example. To that end, we will propose an efficient search technique, based on ideas from association rule discovery, which is able to greedily find a best rule for each example in very much the same way as conventional rule learners like CN2 or Ripper, but, by exploiting efficient data structures, can strive for finding the optimum rule for every example instead of only a small subset of rules. Our experiments will demonstrate that the resulting algorithm is able to deal with very large datasets, which cause problems for state-of-the-art rule learning algorithms.

The other key problem is that the resulting rule set will be considerably larger than the rule sets found by conventional rule learning algorithms. We note in passing that the size of the rule set is not necessarily the same as the size of the example set, because the same rule may be optimal for multiple examples, and will consequently only be added once to the set. We will also introduce some mild pruning techniques for further reducing the size of the rule set. Nevertheless, the interpretability of the remaining, large rule sets is still problematic. However, we argue that this does not hold for the individual rules. Unlike, e.g., large rule sets that are derived from random forests, our rule sets consist of locally optimal rules, which are reminiscent of locally optimal explanations which have been recently proposed in explainable machine learning. Algorithms such as Lime (Ribeiro et al., 2016) or Shap (Lundberg & Lee, 2017) strive to find local white-box explanations that approximate a learned global black-box model in a given neighborhood of an example. In particular, Lore (Guidotti et al., 2018) learns rule-based explanations from the data. While each of the found rules can be used for explaining a single example, collectively, these explanations cannot be considered as an interpretable, global explanation for the domain, and substantial efforts are required to extract an interpretable global rule-based model from local rule-based explanations (Setzu et al., 2021). While we do not explicitly address the issue of interpretability in this paper, our approach was motivated by these algorithms. The meaning of the entire rule set will be hard to grasp, but the individual rules may serve as explanations for the examples for which they are optimal.

2.4 Related work

In this section, we briefly review additional work in inductive rule learning and relate it to our approach. It is not necessary for understanding the key contribution of this paper, the quick reader may safely skip forward to Sect. 3, where we introduce our efficient implementation of the Lord algorithm.

Rule-based methods are among the popular technique classes in data mining and machine learning (Fürnkranz et al., 2012). The methods generally can be categorized into two types, descriptive and predictive rule learning that are based on the purpose of use of discovered rules. Descriptive rule learning aims at discovering patterns catching relations between a target variable and a set of explaining ones known as subgroup discovery (Klösgen, 1996; Atzmüller, 2015), or co-occurrences of sets of items with other sets of items known as association rule discovery (Agrawal et al., 1993; Hipp et al., June 2000). Predictive rule learning tries to generalize the training data into a collection of rules that can make predictions for new examples. While descriptive rule learning aims at statistical validity of the found rules, predictive rule learning focuses on predictive performance. Our method introduced here belongs to the category of predictive rule learning, which can be discriminated into two main families, rule set construction, where rules are incrementally added to a target theory, and rule set selection, where first a large set of rules is learned which is then filtered into a small set of classification rules.

2.4.1 Covering algorithms

Covering algorithms, aka separate-and-conquer learning, construct a rule set by learning one rule at a time, removing all covered examples, and repeating until all examples are covered. The AQ, CN2, and Ripper algorithms discussed above are their main proponents, but the family of algorithms is very large (Fürnkranz, 1999). We note that the covering approach may be viewed as a special case of additive learning algorithms such as boosting, where the weights of the examples are not restricted to be 1 (uncovered) or 0 (covered), but can take arbitrary values. This approach was pioneered by algorithms such as LRI (Weiss and Indurkhya, 2000) and Slipper (Cohen and Singer, 1999), the general framework of gradient boosting for rule learning was most clearly defined in Ender (Dembczyński et al., 2010). Recent additions to this family include Boomer (Rapp et al., 2020), which generalizes this approach to learning multi-label rules, and the algorithm of Boley et al. (2021), which replaced the greedy search for the best addition to the rule set with an efficient exhaustive search.

2.4.2 Associative classification

Most prominent among the rule set selection techniques is associative classification (see, e.g., (Liu et al., 1998, 2000; Li et al., 2001; Yin and Han, 2003)). These algorithms generally search for association rules with target classes in their head, which are then filtered by certain pruning conditions or heuristic measurements to form a predictive rule set for the classification. Note that this filtering is often essentially equivalent to a covering loop. For example, CBA sorts all rules according to a heuristic h(.), and then selects one rule at a time until all examples are covered. A common disadvantage of this family of algorithms is that the classification performance depends strongly on the input parameter support. In principle, smaller supports provide higher classification accuracy, but often the number of patterns and so the time complexity and found rules for some datasets explode as well. The diversity in the number of frequent patterns among datasets makes it hard to select an appropriate support for the trade-off between the classification performance and calculation resources.

DDPMine (Cheng et al., 2008) uses a modified version of FP-Growth to search for a set of discriminative patterns, as measured by information gain. The algorithm applies a pruning method in which searching on a conditional database can be ignored if the upper bound information gain from the conditional database is not greater than the information gain of the currently best frequent itemset. All examples supporting the result itemset will be removed from the current FP-tree, and another iteration continues until the current FP-tree is empty. The resulting discriminative itemsets can then be used as input features for training a subsequent classifier, such as a support vector machine. It remains unclear whether they can also be used as a stand-alone rule set.

The Harmony algorithm (Wang & Karypis, 2006) is quite similar to our approach in that it shares the general idea of finding the best rule for each training example. However, while our work is based on ideas in classical classification rule learning, Harmony is firmly rooted in association rule discovery. At its core, Harmony exhaustively enumerates all possible rules that satisfy a given minimum support threshold, and checks for each rule whether it has a higher confidence than the current best rule for each of its covered examples. By introducing several effective pruning methods, Harmony has shown efficient execution and higher classification accuracy than the covering algorithm Foil (Quinlan, 1990) and the associative classification algorithm CPAR (Yin and Han, 2003). However, like all associative classification algorithms, the performance of Harmony crucially depends on the chosen support threshold, which essentially defines a trade-off between the optimization quality and the efficiency of algorithm: lower support values will lead to an exponential growth in the search space that is covered exhaustively whereas higher minimum support values may miss optimal rules that have a low coverage (the importance of low-coverage rule was first observed by Holte et al. (1989)). Thus, Harmony copes with the exponential size of the hypothesis space by reducing it with a suitable choice of the minimum support threshold and finding instance-based global optima in this reduced space, whereas our approach, Lord, deals with the complexity by replacing the exhaustive search for the global optimum with an efficient greedy search for a local optimum. Obviously, both approaches may miss the globally best rule, but in very different ways, and it will eventually depend on the domain which one is more effective. However, we argue that Lord is considerably more flexible, in that it, e.g., allows for easy parallelization, as the search of the optimum for each example is independent of all other examples, or facilitates the use of arbitrary rule learning heuristics whereas Harmony’s pruning heuristics depend on the use of support and confidence, which are not particularly well suited for classification rule learning (Fürnkranz & Flach, 2005).

2.4.3 Modern rule learning algorithms

In the wake of the success of deep learning, some efforts also went into the design of efficient rule learning algorithms that optimize a given loss function. Many recent approaches (Dash et al., 2018; Su et al., 2016; Wang et al., 2017; Wang and Rudin, 2015; Yang et al., 2017; Letham et al., 2015) limit their work to binary classification rules, but instead of finding rules based on heuristic measures, they optimize the learned rule collection (set or list) according to an objective function with a search procedure aiming at finding a sparse rule set. Because the proposed optimization methods are computationally expensive, and the size of the search spaces explodes, they often operate on truncated search spaces in a greedy fashion. This loses the guarantee that the output rule collections are globally optimal. Also, the search for Boolean rules can take very long on large datasets, which limits their applicability to big data.

IDS (Lakkaraju et al., 2016) forms a rule set (a decision set in their terminology) by sampling on a candidate rule space corresponding to the cross product between the set of frequent itemsets and the set of classes. A candidate rule is selected based on optimizing a joint objective function that combines multiple criteria such as accuracy, rule count, rule length, coverage, etc. for the learned rule set. Like IDS, other methods are also related to associative classification, but use alternative techniques for searching for an optimized rule set (or list) from pre-mined rules. BRL (Letham et al., 2015) produces a Bayesian posterior distribution over permutations of a set of Bayesian association rules, each of which defines a prior distribution over classes for its rule head rather than a single class. From this, decision lists with high posterior probability are selected to classify new examples. Its successor, SBRL (Yang et al., 2017) further improves BRL’s computational efficiency. Highly similar to BRL, FRL (Wang and Rudin, 2015) also learns a rule list. BRS (Wang et al., 2017) searches for an optimal rule set by first generating association rules from positive examples w.r.t. one out of two classes of the target binary attribute. To reduce the number of candidate rules, the association rules are filtered with some criteria such as applying an upper bound on rule length, ensuring that the false positive rate is smaller than the true positive rate, or maximizing information gain. A simulated annealing algorithm with a prior probability-based objective function is applied to search for an optimal rule set in the space of subsets from the candidate rules.

The approach by Su et al. (2016) applies integer programming to formulate a Hamming-distance-based objective function and performs block coordinate descent or linear programming relaxation to search for an optimized Boolean rule set. In a similar way, the CG rule learning algorithm (Dash et al., 2018) finds a Boolean rule set which minimizes the sum of the number of positive examples classified incorrectly and the number of clauses (rule bodies) in the space of all possible clauses covering negative examples. The complexity of the resulting rule sets, which is quantified as the sum of length of the rules, is bounded by a given input parameter to control the complexity and avoid over-fitting. Since this approach is only suitable for very small datasets, column generation (Barnhart et al., 1998) is used for tackling larger datasets, which allows to generate only a small subset of all possible rules explicitly.

All these recent techniques have in common that they strive for finding a minimal rule set, for the sake of interpretability. As argued at the end of the previous section, this is not our main objective: instead we aim for high predictive accuracy and scalability to large datasets. For this, we also embark on ideas from association rule learning, which are discussed in the next section.

3 The LORD algorithm

In this section, we describe Lord, our efficient implementation of locally optimal rule discovery, which allows us to find the best rule for each training example.

Like many other state-of-the-art algorithms (cf. Sect. 2.4), Lord draws upon some ideas from association rule learning. In particular, we make use of PPC-trees and N-lists, which can efficiently summarize counts of conjunctive expressions. N-lists are a data structure which was originally proposed to efficiently discover frequent itemsets from a dataset of transactions via the state-of-the-art algorithm PrePost\(^+\) (Deng et al., 2012; Deng & Lv, 2015). We adapt the N-list structure to tabular datasets with attributes \(A_i\), \(i \in \{1, \dots , m\}\) by using selectors as features, and to classification problems with the last attribute \(A_m\) as the nominal class attribute. We also assume that the first \(m-1\) predictive attributes \(A_i\), \((i < m)\) are nominal, i.e., numeric attributes will be discretized beforehand, but can be missing (value null).

In the following, we describe our adaptation of PPC-trees (Sect. 3.2) and N-lists (Sect. 3.3) to a classification setting. In Sect. 3.4, we show how these structures can be used for implementing an efficient rule learning algorithm, which combines advanced ideas from algorithms such as CN2 or Ripper to learn, prune, and optimize large rule sets, for which we present an efficient representation in Sect. 3.5. All introduced definitions and concepts are illustrated with an example in Sect. 3.6 and Fig. 1. Finally, the complexity of the algorithm is analyzed in Sect. 3.7.

3.1 Initialization

In a first pass, the dataset is scanned to count the frequency of all distinct selectors. Selectors from the predictive attributes (group 1) and those from the class attribute (group 2) are sorted locally in the ascending order of their frequencies. The sorted group 2 is then appended to the end of sorted group 1, yielding a global order O of selectors. We use symbol \(\prec\) to express the order relationship between two selectors, e.g. \(s_1 \prec s_2\) means \(s_1\) precedes \(s_2\) in the order O. We assume that sets of selectors will always be ordered according to O.

Definition 3

(Selector-set) A selector-set is a set of selectors in which selectors are in a predefined order O and there are no two selectors from the same attribute. A k-selector-set \(s_1s_2 \cdots s_{k}\) is a selector-set having k selectors in the order O, \(s_1 \prec s_2 \prec \cdots \prec s_{k}\).

The selected order O of selectors does not influence the correctness of calculating the support count of selector-sets, but it will affect the memory efficiency of the PPC-tree and N-lists (presented in the next sections). Because the support count of a rule is always calculated after the support count of the rule body, it is helpful to place selectors of group 1 before selectors of group 2 so that the support count calculations for the rule can utilize the previously computed results for the rule body (cf. also Definition 7 below).

3.2 PPC-trees

In a second pass, the dataset is scanned to construct the so-called PPC-tree structure, a prefix tree. The PPC-tree consists of so-called PPC-nodes, which contain a distinct selector that it associates, as well as frequency information which accumulates the number of examples (selector-sets) passing through the node when they are inserted into the tree.

Definition 4

(PP-code and PPC-node) A PPC-node stores the following components:

  • a selector to which the node is associated

  • a PP-code \(\langle \textit{pre},\textit{post}\rangle\), which encodes in what place the node is encountered in a pre-order or post-order traversal of the tree respectively

  • a frequency count (freq), which encodes how many examples pass through this node in the tree

In addition, it also needs to store pointers to its parent and children nodes.

The PPC-tree is built up incrementally. Each example e is represented with a sorted selector-set \(S_e\). Note that it is possible that \(|S_e| < m\) because e can contain null values for some attributes, which are simply ignored. From the tree root, selectors in \(S_e\) are inserted sequentially into the tree structure in the reverse order of O. When inserting a selector at a tree node, the child node registering the same selector is found and the frequency of the child node is increased by one. If there is no child node for the next selector, a new child node registering this selector is created with frequency initialized at 1. The inserting process continues with the next selector and the current node changed to the child node.

After all examples have been already inserted into the tree, the PP-codes are computed from a pre-order and post-order traversal of the tree. These allow to efficiently determine whether PPC-nodes are on the same path or not, using the following property of PP-codes (Deng and Lv 2015):

Property 1

Given two N-nodes \(N_1\) and \(N_2\) with their respective PP-codes \(\langle \textit{pre}_1,\textit{post}_1\rangle\) and \(\langle \textit{pre}_2,\textit{post}_2\rangle\), \(N_2\) is an ancestor of \(N_1\) (and thus \(N_1\) is a descendant of \(N_2\)) iff \(\textit{pre}_2 < \textit{pre}_1\) and \(\textit{post}_2 > \textit{post}_1\).

We will denote the sets of ancestors and descendants of a node N with \(\textsc {Anc}(N)\) and \(\textsc {Desc}(N)\), respectively.

PPC-trees are similar to FP-trees (Han et al., 2004), which have also been adapted to supervised learning (Atzmüller and Puppe, 2006). Both are prefix trees constructed from a list of items ordered in increasing order of their global frequencies, with a similar way of inserting an example into the trees. Nodes of both trees contain an associated item and a frequency count. However, while each node of a PPC-tree contains a PP-code that allows to determine whether two nodes stay in the same path without looking at the tree structure, each node of an FP-tree contains a link to the next node of the same item, which allows to form a chain of nodes for each distinct item. Thus, an FP-tree has to maintain a header table to access the first nodes of node chains which supports to create conditional FP-trees. For this reason, FP-trees must also be retained throughout the entire process whereas the memory for PPC-trees can be freed after computing the more compact N-lists as described in the following section.

3.3 N-list generation

The PPC-tree is used for constructing the N-list of each selector, which effectively summarizes the frequency of occurrence of this selector. Thus, we associate with each selector a list of N-nodes, which collectively capture all matches of this selector in the data. After the generation of the N-list, the PPC-tree is no longer needed, and its memory can be freed.

Definition 5

(N-node) An N-node is a reduced version of a PPC-node which only retains the PP-code and freq, denoted as \(\langle \textit{pre},\textit{post}\rangle \!:\!\textit{freq}\).

Definition 6

(N-list of a selector) The N-list of a selector is a list of all N-nodes associated with this selector. The N-nodes are sorted in increasing order of pre.

By a pre-order traversal in the built PPC-tree, an N-node is created from each visited tree node and added to the end of the N-list of the selector registered by the tree node. Completing the tree traversal, an N-list for each distinct selector is generated. No rearrangements to N-nodes in N-lists are performed since N-nodes were added in ascending order of pre while traversing the tree.

Property 2

Sorting the N-nodes in an N-list according to increasing pre, also sorts them according to increasing post.

To see this, assume that the pre of two nodes in the N-list increases, but their post decreases. According to Property 1, the two nodes then have an ancestor-descendant relationship, i.e., they are on the same path in the PPC-tree. However, all nodes in an N-list share the same distinct selector, and therefore they all have to be on different paths of the PPC-tree, refuting the assumption.

The N-lists for single selectors, which can also be called 1-selector-sets, essentially correspond to the 1-itemsets in an Apriori-like association rule learning algorithm. However, as we see in the following, they contain all necessary information for constructing the N-lists for combinations of selectors.

For computing the N-list of a k-selector set, we use two different approaches:

Definition 7

(N-list of k-selector-set, \(k \ge 2\)) The N-list NL of the k-selector-set \(s_1s_2 \cdots s_{k-2} s_{k-1}s_k\) is calculated from the N-lists \(NL_1\) of the \((k-1)\)-selector-set \(s_1s_2 \cdots s_{k-2}s_{k-1}\), and \(NL_2\) of either

  1. (i)

    the \((k-1)\)-selector-set \(s_1s_2 \cdots s_{k-2}s_{k}\), or

  2. (ii)

    the selector \(s_k\),

as follows

$$\begin{aligned} NL = \bigg \{ \langle N_2.\textit{pre},N_2. \textit{post}\rangle \!:\! {\mathop{\mathop{\sum _{N_1 \in NL_1 \cap}}_{\textsc {Desc}(N_2)}} } N_1.\textit{freq} \;\;\bigg \vert \; N_2 \in NL_2 \bigg \} \end{aligned}$$
(2)

Thus, by this definition, the N-nodes in NL sharing the same PP-codes will be combined into a single N-node with the same PP-codes and the sum of the frequency of the N-nodes.

The N-list calculation in Definition 7(i) provides sufficient information for an exhaustive search for frequent itemsets in which itemsets are enumerated in a unique order satisfying the input conditions for the calculation. However, in our greedy approach, we will need to be able to query and compute the support counts of arbitrary selector-sets, without the complete layer-wise enumeration of all itemsets that is typical for association rule discovery. For this reason, we use case (i) only if the N-list of \((k-1)\)-selector-set \(s_1s_2 \cdots s_{k-2}s_{k}\) has already been computed from previous results. If it is not yet available, we use case (ii) the N-list of selector \(s_k\) which is always available and better suited for greedy search.

In (2), a new N-node N is added to the result N-list NL for each pair of two nodes \(N_1 \in NL_1\) and \(N_2 \in NL_2\) that satisfies the ancestor-descendant relationship (cf. Property 1). \(NL_2\) is always an ancestor of \(NL_1\) because \(N_1\) and \(N_2\) associate with selectors \(s_{k-1}\) and \(s_k\) respectively and \(s_{k-1} \prec s_k\) (cf. Definition 3). The new node N receives the frequency count of \(N_1\) (because it is the number of paths containing both \(N_1\) and \(N_2\)) and the PP-code of \(N_2\) (so that N-list of the selector-set reduces its length quickly thanks to node combinations at ancestor nodes while the selector-set grows, consequently reducing both memory consumption and computation time).

For implementing this, we define a recursive function CalculateNList (Algorithm 4), which can calculate the N-list of any selector-set \(s_1s_2 \cdots s_k\) as follows: NListSet is a hash map that caches generated N-lists and maps a selector-set to its N-list for fast access to N-lists of a given input selector-set. It initially contains all N-lists of every distinct selector (Definition 6). The function first checks if the N-list of the input selector-set has been calculated to return the N-list. Otherwise, it calculates recursively the N-list \(NL_1\) of sub-selector-set \(s_1s_2 \cdots s_{k-2}s_{k-1}\) in lines 6-9. For lines 10-13, the N-list \(NL_2\) will be the N-list of sub-selector-set \(s_1s_2 \cdots s_{k-2}s_k\) (Definition 7(i)) if it has been calculated and previously cached in NListSet; otherwise, \(NL_2\) is assigned to the N-list of selector \(s_k\) (Definition 7(ii)) which is usually longer than the N-list of selector-set \(s_1s_2 \cdots s_{k-2}s_k\) but avoids another branch of recursive calculation. At line 14, function GenerateNlist generates the N-list NL from \(NL_1\) and \(NL_2\) based on Equation (2). All intermediate results are stored in NListSet so that the recursion stops as soon as a retrieved N-list is encountered.

For the correct and efficient calculation of the heuristic values for rule evaluations without the need of re-counting frequencies in the database, the following property is crucial:

Property 3

The support count of a k-selector-set constructed according to Definition 7 is the sum of frequencies of the N-nodes in its N-list.

Note that this property is not trivial and does not hold for arbitrary ways of combining N-lists. Counter-examples and a proof for Definition 7(ii), which builds upon a previous proof of Definition 7(i) by Deng et al. (2012), Proposition 4, can be found in Appendix 1, so the property generally holds for Lord.

figure d

3.4 Rule learning

This subsection presents the proposed rule learning algorithm Lord (Locally Optimal Rules Discoverer). It first builds up the N-list structure for each distinct selector from the training set E as discussed in the previous sections, which allows it to efficiently obtain the N-list and thus the coverage counts for an arbitrary rule body (Algorithm 4). This is used in a greedy search for a locally optimal rule, in a similar way as classical algorithms such as CN2 and Ripper, as shown in Algorithm 5. In particular, the algorithm searches for a locally best rule for each training example in two phases, lines 6–13 for rule growth and lines 14–21 for rule pruning.

figure e

Rule evaluation. For comparing the quality of candidate rules, we use a heuristic function h, and coverage and class frequency for tie breaking.

Definition 8

(Rule comparison) Given two rules \(r_1\) and \(r_2\), \(r_1\) is better than \(r_2\), denoted as \(r_{1} \succ r_{2}\) if (i) \(h(r_{1}) > h(r_{2})\) or (ii) \(h(r_{1}) = h(r_{2})\) and \(r_{1}.p > r_{2}.p\) or (iii) \(h(r_{1}) = h(r_{2})\) and \(r_{1}.p = r_{2}.p\) and \(r_{1}.head \prec r_{2}.head\).

where h(r) is the heuristic value, r.p is the number of covered positive examples (true positives) of the rule r, and r.head is the rule head of rule r. In the rare case that the heuristic value and the number of covered true positives are equal, we favor the rule with the minority class, because it covers a higher percentage of the positive examples of this class.

As a heuristic, we use the m-estimate, which has been proposed by Cestnik (1990) and used in CN2 (Clark and Boswell, 1991) and related algorithms (Džeroski et al., 1993). The m-estimate value of a rule \(r:B \rightarrow c\) is calculated as

$$\begin{aligned} h_m(r) = \frac{r.p+m\frac{P}{P+N}}{r.p+r.n+m} \end{aligned}$$
(3)

where

m = a settable parameter in the range [0, +\(\infty\))

r.p = the number of true positives of rule r

r.n = the number of false positives of rule r

P = the number of positive examples in E w.r.t. class c

N = the number of negative examples in E w.r.t. class c

The m-value is a tunable parameter that provides an excellent trade-off between weighted relative accuracy, which is frequently used in descriptive rule learning, and precision, the main target for predictive learning (Fürnkranz & Flach, 2005). It also has been shown to perform very well in a broad empirical comparison of various rule learning heuristics (Janssen & Fürnkranz, 2010). In Lord, we use comparably low values of m (the default is 0.1), which result in a bias towards more specific and less general rules. This is analyzed in somewhat more depth in Sect. 4.7.

All counts can be efficiently obtained from the corresponding N-lists (Algorithm 4): For example, the number of true positives (r.p), or the total coverage of the rule (\(r.p + r.n\)) of rule r can be computed efficiently from the support count of selector-sets \(B \cup \{c\}\) and B that are derived from their corresponding N-list structure according to Property 3. Similarly, P, the count of positive examples w.r.t class c, can be derived from the N-list of selector c.

The CalculateNList function defined above (Algorithm 4) is called frequently at lines 7 and 15 of Algorithm 5 to determine the support counts of selector-sets. In our implementation, the scope of the NListSet input of the function is limited to the search for the best rule of each example and not re-used across examples. However, these additional computational costs are compensated by the memory saved and the completely independent search for the best rule among examples, which allows for easy parallelization. This is empirically analyzed below, in Sect. 4.6.

Rule growth. In the rule growth phase, the rule body is first initialized with an empty set, and iteratively extended with the selector among the remaining selectors \(S_e\) that results in the best improvement according to Definition 8. Note that, contrary to the similar search in algorithms like CN2 or Ripper, the set of possible selectors does not contain all possible features, but only those that are pertinent for the current example, so that the complexity of this phase is bounded by the number of attributes m, and not by the considerably larger number of selectors that are derived from these attributes. The growth process finishes when the rule is no longer improved or \(S_e\) is empty.

Rule pruning. In the pruning phase, selectors are iteratively pruned from the rule body so that the improvement in each step is maximized based on Definition 8. This is analogous to the incremental reduced error pruning technique (Fürnkranz and Widmer, 1994), which is used as in Ripper, with the difference that all possible conditions are considered for pruning (not only final sequences), and the resulting rules are re-evaluated on the training set, not on a separate pruning set. The pruning phase completes when the rule cannot be further improved or when only two selectors remain in the rule body. In this case, we do not have to prune further, because all rules with a single condition have already been validated in the growing phase, and can therefore not have a higher heuristic value than the current rule.

As one of our astute reviewers has suggested, rules could be further improved by repeating multiple phases of rule specialization (growing) and generalization (pruning), or even a general bi-directional hill-climbing, which has, e.g., been tried in the JoJo algorithm (Fensel and Wiese, 1993). Empirically, however, this may not be necessary. The rules found in the growth phase often already achieve the highest heuristic value, so the chance to prune a selector from a rule in the pruning phase is small, and consequently, the chance to further improve the rule in a second growing phase (after the pruning phase) is even smaller. We will return to this issue in experiments reported in Sect. 4.8.

Efficient Forward Rule Selection. In order to reduce the computational complexity of Lord, we also implemented and tested a variant, Lord*, which only learns a rule when a training example is not correctly classified by the existing rule set. More precisely, Lord* does the following for each training example:

  • If no rule in the current rule set covers the example, a new rule for the training example is learned and added to the current rule set.

  • If the training example is mis-classified by the current rule set, a new rule is learned, and if the found best rule is better than the selected classifying rule, it is added to the current rule set.

  • If the training example is correctly classified, no new rule is learned.

Note that a consequence of this is that the final rule set learned by Lord* is somewhat affected by the input order of training examples. However, under the assumption of a random order, it seems that the technique works well, as we will see in Sect. 4.

Rule Filtering. Ripper introduced a very effective but rather expensive rule optimization phase. Lord also implements an optimization phase, in the form of a very simple filter that removes rules which have been superseded by better rules. All found best rules are collected in a rule set R which is then filtered to the final rule set \(R^{'}\) in lines 24–28. ClassCoveringRules computes the set \(R_e \subset R\) of all rules that cover e and whose rule head equals the class label of e. The best rule from a group of rules is selected based on Definition 8. Selecting the best rule r from \(R_e\) further reduces the rule variances and therefore also the number of rules. Assume that there is a group of k training examples sharing a global best rule. In the search phase, however, because of the greedy search, a different locally optimal rule may be found for each of the k examples, e.g., a total of l rules (\(l \le k\)) with the best rule \(r_0\). Then in the filter phrase, all the k examples adopt \(r_0\) as their best covering rule and the other \(l-1\) rules will be eliminated. Our experiments (Table 8) will show empirically that this hypothesis seems to hold and will underline the effectiveness of the rule filter. Generally, the larger the number of examples in the training set is, the higher is the chance that such a rule \(r_0\) is actually the best rule. Finally, the default rule which has an empty body and predicts the majority class, is added to \(R^{'}\) to guarantee that every testing example is covered by at least one rule.

Classification with the Learned Rule Set. In the classification phase, the best rule is chosen from a group of rules covering the example based on Definition 8 to classify an unseen example. If no covering rule is found for the example, the default rule is used.

3.5 R-tree

For classifying a new example, Lord looks for the best rule among all rules that cover this example. In order to speed up this process, we index the result rule set via a so-called R-tree, a prefix tree of rule bodies whose selectors are in the same predefined order O as the selector-sets (Definition 3). Each node (excluding the root node) of an R-tree associates with a distinct selector and may or may not contain a reference to a rule. Its child nodes are also in the order O. Selectors of a rule body are inserted into the tree in the reverse order of O. Thus, the tree serves as an index structure that allows to efficiently find the covering rules of an example whose corresponding selectors are also in the order O. The structure of an R-tree is quite similar to that of an FPO-tree (Huynh & Küng, 2020), which provides optimal compactness and efficient aggregations for very large numbers of local frequent itemsets. A minor difference is that the R-tree indexes references to rules whereas the FPO-tree records support counts of itemsets. Figure 1e shows an example of an R-tree.

3.6 Example

Figure 1 illustrates the entire process from an example dataset in Fig. 1a to the resulting rule set in Fig. 1e. The example dataset includes three predictive attributes \(A_1\), \(A_2\) and \(A_3\), and the class attribute C. The right-most column in Fig. 1a depicts the selector-sets corresponding to examples. Figure 1b shows all distinct selectors in the predefined order O (Sect. 3.1) after the first data scan. Figure 1c depicts the corresponding PPC-tree (Sect. 3.2) built from the example dataset shown in Fig. 1a.

Fig. 1
figure 1

A running example from an example dataset to the result rule set

The left column in Fig. 1d enumerates the N-lists (Sect. 3.3) of all single selectors that can be derived directly from the example PPC-tree in Fig. 1c. These basic N-lists are then combined to N-lists of selector-sets and their corresponding support counts. The middle and the right columns in Fig. 1d respectively show the generated N-lists and the corresponding candidate rules estimated while finding the best rule for the first example with its representative set of selectors {\(s_{31}, s_{21}, s_{11}\)} and the class selector \(s_1\). The best rule initializes with an empty body (\(\emptyset \rightarrow s_1\)) and is then extended with each of the three possible selectors \(s_{31}, s_{21}, s_{11}\). For the selector \(s_{31}\), with the candidate rule \(s_{31} \rightarrow s_1\), the calculation is as follows:

  1. 1.

    Compute the N-list of selector-set \(s_{31}s_1\) from that of \(s_{31}\) and \(s_1\) (Definition 7). N-node \(\langle 10,6\rangle \!:\!1\) is a successor of N-node \(\langle 7,11\rangle \!:\!3\) because it comes after \(\langle 7,11\rangle \!:\!3\) in the pre-order traversal (\(10 > 7\)) and before \(\langle 7,11\rangle \!:\!3\) in the post-order traversal (\(6 < 11\)). Consequently, a new N-node \(\langle 7,11\rangle \!:\!1\) is formed from the PP-codes of the ancestor and the frequency of the successor.

  2. 2.

    Compute the m-estimate (3) of the rule, yielding \(h_{1}(s_{31} \rightarrow s_1) = 0.6875\) for \(m = 1\).

The same calculation is applied for selectors \(s_{21}\) and \(s_{11}\), and eventually rule \(s_{21} \rightarrow s_1\) is selected as the currently best rule because its heuristic value of 0.8437 is the highest. \(s_{31}\) and \(s_{11}\) are then considered as extensions. The evaluation of \(h_1\) for the extended rules triggers the computation of the N-lists of \(s_{31}s_{21}\) and \(s_{31}s_{21}s_1\) (for the extended rule \(s_{31} s_{21} \rightarrow s_1\)) and of \(s_{21}s_{11}\) and \(s_{21}s_{11}s_1\) (for \(s_{21}s_{11} \rightarrow s_1\)). Their heuristic values, 0.6875 and 0.8437 respectively, do not exceed \(h_1\)(\(s_{21} \rightarrow s_1\)). As a consequence, the rule growth stops. Also, the rule pruning is skipped in this case because the rule body length is 1.

In this way, the search for a locally best rule for each of the remaining examples is performed. All found rules are listed in the left two columns in Fig. 1e. Note that the first three examples share the same locally best rule \(s_{21} \rightarrow s_1\). A filter step is then applied to the rule set in which rule \(s_{12}s_{32} \rightarrow s_2\) of example #7 is removed because the example adopts rule \(s_{32}s_{22} \rightarrow s_2\) of example #8 as its new best rule. The filtered rule set is then complemented with a default rule, which has an empty body and predicts the majority class \(s_2\).

The right-most column in Fig. 1e illustrates an R-tree which has been constructed from the result rule set on its left. If none of these rules covers a new example, the default rule will be used to classify the example.

3.7 Complexity analysis

For the following analysis of Lord’s computational complexity, we assume a dataset with n examples and m attributes. Furthermore, we use k to denote the maximum length of a rule. Note that k can be bounded in various ways. For example, obviously \(k \le m\), as each attribute is tested at most once. We can also assume that \(O(k) \le O(\log n)\), assuming that each condition reduces the number of examples covered by the rule by a certain fraction. Finally, k can of course be bounded in practice by allowing only rules up to a certain fixed length k.

The number of rule evaluations in the growth phase is obviously \(\sum _{j=0}^{k-1}(m-j)\) = \(m\cdot k - (k-1)k/2\) as the \(j^{th}\) selector added to the rule body has \((m-j)\) options. Similarly, the pruning phase requires at most \(\sum _{j=0}^{k-2}(k-j)\) = \((k-1)k - (k-2)(k-1)/2\) = \((k-1)k/2 + 1\) rule estimations. Therefore, the total number of rule evaluations is \(O(m\cdot k)\), which is considerably less than the \(O(2^m)\), the number of all possible rules.Footnote 3

A key factor for Lord’s efficiency is that the rule evaluations do not have to be performed on data, but using the N-lists stored in memory. For computing the heuristic value of a candidate rule, \(s_1 s_2 \dots s_k \rightarrow c\), we need the N-lists of the two selector-sets \(s_1 s_2 \dots s_k\) and \(s_1 s_2 \dots s_k c\). First, the N-list of \(s_1 s_2 \dots s_k\) is calculated recursively by Algorithm 4 which, in the worst case, must calculate the N-lists of \(k-1\) selector-sets \(s_1 s_2, \cdots , s_1 s_2 \dots s_{k-1}\), \(s_1 s_2 \dots s_k\) based on Definition 7. Since the PP-codes of the N-nodes in an N-list are in order, the calculation of the N-list of the k-selector-set in Definition 7 is linear in the length l of the N-lists. Secondly, the N-list of \(s_1 s_2 \dots s_k c\) can be calculated directly from N-lists of \(s_1 s_2 \dots s_k\) and c based on Definition 7(ii) (note that the N-list of c has only one N-node). Overall, the computational complexity of a rule estimation for a rule \(s_1 s_2 \dots s_k \rightarrow c\) is O(\(k\cdot l\)). Note that l is bounded by the number of examples covered by the body of the rule and therefore depends on the density of the corresponding selectors in the dataset.

The overall computational complexity of Lord, therefore, is \(O(n\cdot m \cdot k^2 \cdot i \cdot l)\). Note that the search for a local best rule for a training example is inherently independent of the other examples’ searches. This allows the rule search to be parallelized massively with multi-workers, e.g. shared memory with multi-threads and/or distributed environment, in a work-pool model for high load balance. Thus, the total time complexity of Lord is \(O(\frac{1}{q} \cdot n\cdot m \cdot k^2 \cdot i \cdot l)\), where q is the number of parallel rule search workers.

4 Experiments

In this section, we report on the experimental evaluation of the Lord algorithm. The main goals of this study is to show that the algorithm is able to efficiently learn rule sets for very large databases, with an accuracy that is not worse and often better than that of other state-of-the-art algorithms (Sect. 4.2). We also compare the algorithms w.r.t. runtime (Sect. 4.3) and rule complexity (Sect. 4.4), and ensure that Lord’s discretization does not give the algorithm an unfair advantage (Sect. 4.5). Furthermore, we do extensive experiments to analyze the scalability of Lord dealing with very large datasets (Sect. 4.6) and investigate the impact of the parameter m of the m-estimate heuristic on the classification accuracy (Sect. 4.7), the potential of using multiple successive growing and pruning phases (Sect. 4.8), as well as the effect and potential of rule filtering on the rule sets learned by Lord (Sect. 4.9).

4.1 Experimental setup

Table 1 gives a brief characterization of the 24 UCI datasets (, 2017) used in the experiments. The first 12 datasets are small, and the remaining 12 range from medium to big volumes up to several millions of examples. The datasets are also quite diverse in terms of number and type of attributes, as well as their completeness and class distributions. The gas-sensor-12 (Huerta et al., 2016) dataset contains data of gas sensors reacting to stimuli (wine, banana), and gas-sensor-11 is a version of gas-sensor-12 after removing the time-offset attribute for the duration between the start of the stimulus emission and the sensor operation that cannot be recorded as an input in the practice of gas detection. The dataset at row 23 is a cleaned version of pamap2 (Reiss and Stricker, 2012) with 3,850,505 instances and 54 attributes, for physical activity monitoring, that was processed according to recommendations of the dataset’s authors. The other datasets are used as they are.

Table 1 Datasets used in experiments

In principle, any discretization method could be used to discretize numeric attributes, such as Fusinter (Zighed et al., 1998), or the well-known MDLP (Fayyad and Irani, 1993). We selected the former because in our experiments, it ran faster than MDLP, and Lord can provide slightly higher prediction performance on discretized datasets by Fusinter compared to that by MDLP. In Sect. 4.5, we double-check that the discretization did not give Lord an unfair advantage over the other algorithms.

We have compared Lord, encoded in Java, with a classification association rule learning algorithm CMAR (Li et al., 2001), the classic heuristic rule learning algorithm Ripper (Cohen, 1995), and the two recent algorithms IDS (Lakkaraju et al., 2016) and CG (Dash et al., 2018). Despite its age, Ripper is still among the state-of-the-art algorithms and is well known for its high accuracy, the simplicity of the learned rule sets, and its fast execution times. We use JRip, an implementation of Ripper, that can be found in the Weka library.Footnote 4IDS and CG are two rule learning algorithms proposed recently. For CG, we used the source code provided by the authors.Footnote 5 For IDS, we also started with the authors’ source code,Footnote 6 but soon found that the IDS algorithm ran too slowly for our machine. It could not complete a cross-validation fold with the time-out setting, even on the small datasets. This is because the calculation for selecting a rule formed from frequent itemsets and classes makes IDS sensitive to the total number of frequent itemsets. A similar observation was also made by Filip and Kliegr (2019) who have customized IDS by allowing it to select only the top k association rules instead of all. As their customized algorithm, PyIDS,Footnote 7 runs much faster, we have used it in our experiments, even though it cannot guarantee the same prediction performance as the original. For CMAR, we considered two implementations, respectively from the libraries SPMFFootnote 8 and LUCS-KDDFootnote 9 and report the results of the latter, which performed better in both runtime and accuracy.

We have performed a 10-fold cross-validation with a time-out of 72 hours for an algorithm execution on a dataset. For datasets pamap2 and susy, the algorithms were only tested on the first fold of the 10-fold cross-validation. The same train-test splits are used for all algorithms. CMAR and PyIDS run on the same discretized datasets as Lord. The input order of the training examples for Lord* is kept as it is in each cross-validation fold. All experiments were run on two Xeon Quad-core CPUs X5570 @2.93GHz, hence eight cores, and 46 GB of available memory.

Besides the above rule learners, we also compared Lord with a black-box approach, i.e. SMO (Platt, 1998), Weka’s sequential minimal optimization algorithm for training a SVM classifier. The default and the best settings on each data set were used for the comparison. We selected the best performance among multiple values of the complexity parameter C for SMO. In terms of accuracy, Lord beats SMO on the larger datasets (12–23), but loses on most of the smaller datasets (7 out of 1–10), and sets a tie on mushroom dataset. With respect to runtime, there is not much difference between Lord and SMO on the small datasets since both run in lesser than 1 s, but the runtime on larger datasets grows enormously for SMO, e.g. SMO cannot complete its learning on susy dataset in the time-out setting. On average over the datasets (1–23), Lord is better than SMO for both runtime (in s) and accuracy, i.e. with the default setting, (354, 0.9297) for Lord vs. (7825, 0.8749) for SMO, and with the best settings, (354, 0.9339) vs. (21019, 0.8773).

4.2 Predictive accuracy

Table 2 The accuracy of algorithms performing on the datasets

Table 2 shows the classification accuracy of the algorithms. Lord is implemented to execute in parallel for the rule discovery phase with the thread count equals the detected number of cores of a machine. It runs with a fixed/default m-estimate parameter setting of \(m = 0.1\), as shown in column 3. We also show the results of the setting for which Lord achieved its best accuracyFootnote 10 and the setting for enhanced execution performance Lord* (Sect. 3.4) with the default \(m = 0.1\) in column 5.

Ripper is generally used with its default settings, but we tried different settings for the number of optimization runs o with 0, and 2 (the default value), which are shown respectively in columns 6, and 7. This parameter is important, as these optimization runs have a large positive effect on Ripper’s accuracy, but are also quite expensive. CMAR in column 8 and CG in column 9 were used in their default settings recommended by the authors, and PyIDS with two settings using \(k = 50\) and \(k = 150\) association rules is respectively shown in columns 10 and 11. CG is for binary classification only, therefore in column 9 there are no results (//) for this algorithm on multi-class datasets.

In order to compare the algorithms, we group the values in Table 2 according to the basic algorithm, and highlight a value in bold if it outperforms the best values of other algorithms, but not necessarily other parameter settings of the same algorithm. This allows to also compare, e.g., the parameter setting of Lord with a fixed parameter \(m=0.1\) to all competitors, which would not be possible if we had only marked the best in each line, because Lord (best m) is always at least as good as Lord (\(m= 0.1\)). Because many experiments with the competitive algorithms could not be completed on the two largest datasets pamap2 and susy, the average accuracy is derived from a main group of the first 22 datasets. Another average accuracy is calculated from a subgroup of datasets which can be processed successfully by CG. It’s obvious that Lord gives the highest average rank and accuracy for all the three settings on the datasets winning 11/22 (\(m= 0.1\)), 13/22 (best m) and 11/22 (Lord*) over the competitors. For pamap2 and susy, the performance of Lord is also superior to the competitors who could complete these tasks. The differences in average accuracy among the algorithms grows when moving from the subgroup to the main group of datasets. For example, the performance differences of the three settings of Lord compared to the second best Ripper with (\(o = 2\)) increase from [0.42%, 0.62%, 0.41%] to [1.16%, 1.59%, 1.14%], and even more for the other algorithms.

The second best algorithm, Ripper, wins 4/22 and 6/22 for its two settings. CG and CMAR both winning 2/22 respectively come at the third and the fourth place, and PyIDS with no wins positions at the last place. On average, PyIDS achieves a higher accuracy for larger values of k, but this is not consistent across all datasets (e.g., breast, german).

In order to assess whether these differences are statistically significant, we do a Friedman test (Demšar, 2006) based on the average ranks on the first 22 datasets. CG is not ranked because of its incompleteness for many datasets. The result shows a significant difference (p-value \(= 2.027e-18\)) in accuracy, indicating that the null hypothesis that all algorithms have the same performance can be confidently rejected. The post-hoc Nemenyi test is visualized in Fig. 2 with the critical distance \(CD = 2.24\) at significance level 0.95. It can be seen that the highest accuracy group includes Lord with the three settings and Ripper (\(o = 2\)) in which the accuracy of Lord (best m) is consistently higher than Ripper (\(o = 2\)) but not enough to indicate a significant difference by the test. Lord (best m) is significantly more accurate than the remaining algorithms, Ripper without its optimization runs (\(o = 0\)), CMAR, and PyIDS.

Fig. 2
figure 2

Nemenyi test on accuracy of algorithms on the first 22 datasets, \(CD = 2.24, \alpha = 0.05\)

Table 3 The runtime (in s) of algorithms performing on the datasets

4.3 Runtime comparison

Table 3 shows the average (derived from 10-fold cross-validation) runtime of all algorithms on each of the datasets. The last lines show the average of these values and the average rank of each method (after rounding the runtime at precision of 0.1 s) for the first 22 datasets. It is obvious that PyIDs is the slowest algorithm, and that its runtime increases fast with the number of considered rules k. On the small datasets, Lord and Ripper run very fast, in less than 1 s, while CMAR and CG typically take a bit longer.

On the larger datasets, Lord is much faster than both Ripper and CG. Without the optimization runs, Ripper (\(m = 0\)) is faster than Lord in some cases but slower than Lord* (\(m = 0.1\)) for all the datasets. The optimization of Ripper is necessary to improve its accuracy, and this makes Ripper (\(o = 2\)) considerably slower than Lord in all the cases. Note that the average accuracy and rank of Lord* (\(m = 0.1\)) are also higher than that of Ripper (\(o = 2\)). We can see that Lord* runs faster than Lord, especially much faster for the last three largest datasets, without losing much of the predictive accuracy of its counterpart Lord (\(m = 0.1\)).

For the average runtime over the first 22 datasets, Lord and Lord* are obviously the fastest ones, but over all 24 datasets, CMAR is faster than Lord* and Lord. This change comes mainly from the long runtime of Lord and Lord* on the susy dataset. CMAR runs the fastest for some large datasets (gas-sensor-11, gas-sensor-12, pamap2, and susy), but its accuracy on these datasets is much lower than that of Lord*. The reason is that these datasets are very sparse resulting in a low amount of generated frequent patterns and class association rules. Moreover, the chosen implementation LUCS-KDD of CMAR applies additional limitations on the number of found frequent patterns, rules and their length. The original version by Li et al. (2001) in which FP-Growth is used to discover frequent patterns, as implemented in SPMF, can result in long runtimes or memory overflow caused by large numbers of conditional pattern trees for dense datasets mined at the default minimum support 0.01, e.g. census, connect-4, kr-vs-kp.

In summary, Lord* is the fastest followed by Ripper (\(o = 0\)), Lord, CMAR, Ripper (\(o = 2\)) and PyIDS. Thus, Lord and Lord* find a better balance between accuracy and runtime than the other competitors.

Table 4 The rule complexity of algorithms performing on the datasets, rule count on the left and average rule length on the right of each cell

4.4 Rule complexity

In terms of the rule complexity reported in Table 4, without considering CG, PyIDS is the best algorithm; however, for this advantage it sacrifices too much classification performance. Ripper comes at the second place with a much better balance between the rule complexity and performance. The rule sets by Lord and Lord* are larger than those found by Ripper, CG, and PyIDS but can be competitive to those by CMAR. For example, the rule sets found by Lord* are even smaller than those by CMAR for 14 out of the 24 datasets. Although the average rule lengths by Lord are shorter than those of Ripper for some large datasets, the sizes of the found rule sets are, again, orders of magnitude larger than those of Ripper. This was to be expected because Lord searches for a locally optimal rule for each training example, and thus, its rule sets are likely to contain many groups of rule variants and are typically much larger than sparse rule sets learned by conventional rule learners. However, the more complex rule sets by Lord are often compensated by more accurate classifications.

Table 5 Experimental results of Ripper and CG algorithms on mixed and numerical datasets discretized by Fusinter. The numbers in parentheses repeat the evaluations on the original datasets

4.5 Influence of discretization

A possible trivial reason for the performance difference between Lord and the other algorithms on mixed or numerical datasets could be that Lord uses pre-discretized data (as do CMAR and PyIDS), whereas Ripper and CG use their own internal discretization. In order to investigate this possibility, we also performed experiments where Ripper and CG were used on the same discretized datasets as Lord.

Table 5 reports the results. Each tabular cell shows the average execution time (in s) on top and the average accuracy below. The values in parentheses show the corresponding values on the original datasets, which we duplicate here for convenience. The better values for the discretized datasets are highlighted in bold. After the discretization, the accuracy of Ripper decreases for most of the datasets, except for adult and bank with (\(o = 0\)) and sec-mushroom. Similarly, CG loses accuracy for 5 out of 7 datasets, and gains for adult and hypo. For census, it results in a similar error message as on the original dataset. CG runs slightly faster for some discretized datasets but much slower for skin and sec-mushroom. Ripper also takes a longer runtime for most of the discretized datasets. In 4 cases, it also ran out of memory (in particular in the default setting with 2 optimization runs).

Table 6 Data discretization time (in s) included in the runtime of Lord and Lord* compared to others’

Table 6 shows the runtime of the Lord algorithm including the data discretization time compared to the runtime of Ripper and CG on 4 datasets. For the other datasets, the discretization time is negligible, typically some tens to hundred milliseconds. In comparison to Ripper’s default configuration with two optimization runs, Lord still does not lose its advantage in execution time with the additional discretization time.

In summary, discretization does not generally improve the performance of CG and Ripper, so that we can conclude that the discretization does not provide an unfair advantage to Lord. Note that we also have not spent much effort on optimizing the discretization.

4.6 Scalability analysis

This section analyzes the scalability of Lord which is potentially affected by two factors, data size and the number of threads running in the rule learning phase. For these experiments, we use the susy dataset with subsets increasing in size, up to 5,000,000 examples. Figure 3a and b respectively show the memory consumption and the runtime of Lord w.r.t. the number of examples. While the memory consumed in the first phase (before the rule learning phase) and by the data structures increases approximately linearly with the data size, the total memory peak and the runtime show a super-linear increase. The runtime differences between 1 and 2 million data points is approximately 2.17 times smaller than the corresponding increase between 4 and 5 million. For data sizes from 1 to 3 million, the total memory peak is the memory peak in the rule learning phase which is greater than that in the first phase; but for data sizes from 4 to 5 million, the total memory peak is the memory peak in the first phase, where in addition to the PPC-tree, we also need to store N-lists of selectors and re-code the training examples as arrays of selector IDs.

In the second phase, the memory for PPC-trees is freed, and the memory required for storing the N-lists of selectors and the R-tree is much smaller and also increases at a slower rate than the memory consumed by the PPC-tree. The main reason for this is that the N-nodes in N-lists contain only a PP-code and the corresponding frequency (Definition 5), whereas the PPC-nodes additionally store a selector, a children list and a reference to a parent to maintain the tree structure (Definition 4). This also allows for a more efficient implementation as a 2-dimensional array (\(3 \times l\)), where each of the three components of the l N-nodes in the list is stored in a separate dimension.

Fig. 3
figure 3

Influence of data sizes and thread counts on the consumed memory and the runtime

The memory used by the data structures (i.e., PPC-tree, N-lists of selectors, R-tree) only depends on the data size but not on the number of threads for learning rules. Even though there is only a single master copy of distinct selectors, which is read-only and shared among the threads, each rule learning thread maintains a local NListSet for caching the N-lists of selector-sets that it encounters while finding the best rule for each single example. Thus, the memory peak used in the rule learning phase increases w.r.t. the number of rule-finding threads and may eventually overtake the consumed memory peak used in the first phase. We verify this assumption with experiments on the full susy dataset shown in Fig. 3c. The memory peak used in the learning phase increases linearly with the number of threads, but remains smaller than the memory occupied by the PPC-tree. Therefore, for a modest number of threads (we could experiment with up to 8 threads), the total consumed memory peak used by Lord is the memory peak during the construction of the PPC-tree and N-lists of selectors, which does not depend on the number of threads.

Figure 3d shows the runtime of Lord on the full susy dataset w.r.t. the number of threads. We can see that it is very close to an optimal speed-up, which would be achieved if increasing the number of threads by a factor of 2 roughly reduces the runtime by a factor of 2. This comes from the fact that the single-thread construction of the PPC-tree and N-lists runs comparably fast, so that the parallel rule learning phase takes most of the runtime of the Lord algorithm. This, in turn can run in parallel with very high load balance thanks to the inherent independence of finding the best rule for each single example.

In summary, it can be seen that Lord can confront memory-related scalability as efficiently as PPC-trees or similar tree structures such as FP-trees.

4.7 Influence of m-estimate Heuristic

Figure 4 shows the impact of parameter m of the m-estimate heuristic on the classification accuracy performance on a representative subset of the 24 datasets from our main experiments. The optimal value range of the m parameter can vary among datasets but all accuracy curves share the characteristic of a single peak value with a gradual reduction of the accuracy on both sides. This is not surprising, as it is known that the m-parameter provides a flexible trade-off between precision, which is known to overfit, and weighted relative accuracy, which is known to overgeneralize in predictive rule learning (Fürnkranz and Flach 2005; Janssen and Fürnkranz 2010). A special case is mushroom which is noise-free and thus has a flat peak, retaining a perfect accuracy of 1 for the value range [0, 10] of m. In general, we can see that the accuracy changes in simple shapes w.r.t the values of parameter m, hence it is easy to find the best value of m.

Fig. 4
figure 4

Influence of m-estimate heuristic on the classification accuracy. The x-axes and y-axes indicate the values of m parameter of m-estimate heuristic and classification accuracy respectively

4.8 Analysis of hill-climbing variants

Recall from Sect. 3.4 that Lord’s rule refinement process consists of a single greedy rule growing phase, followed by a single greedy rule pruning phase, as has also been realized in many other rule learning algorithms such as Ripper. Lord thus finds a local optimum with respect to this 2-phase optimization procedure. However, one may rightfully argue,Footnote 11 that this local optimum could be further improved with additional pruning and growing steps, and that, in fact, a 2-phase local optimum may not necessarily be optimal in that case.

To check this, we have implemented OverLord, a variant that adds another greedy refinement phase in the opposite direction if the previous one has resulted in an improvement of the current local optimum. Thus, if the pruning phase of Lord improves the local optimum of the previous growing phase, OverLord follows up with another growing phase, which, in case it further improves the local optimum, is followed with another pruning phase, etc. The process terminates when the last phase did not achieve any improvement. As the phase before the last one has also not yielded further improvements into the opposite direction (which is why there was a switch in direction), the found rule is locally optimal w.r.t. both, specialization and generalization.

Table 7 Comparison between Lord and OverLord on the 11 datasets, where performance differences were noticeable. Each cell shows (from top to bottom) the runtime (in s), the average accuracy, the number of generated rules, and the average rule length

We have experimentally compared Lord and OverLord on all 24 datasets of Table 1. On 13 datasets, the results were exactly the same (except for minor differences in runtime). Table 7 summarizes the results in terms of accuracy, rule complexity, and runtime on the remaining 11 (medium to very large) datasets. We show the results of two settings, the default \(m=0.1\) and the best m for each dataset. The better accuracy is highlighted in bold. The runtime of the two versions is similar for the first 9 datasets, but for the last 2 very big ones the additional runtime for OverLord is noticeable.

In cases where a difference is noticeable, it is typically less than \(1\%\). Also, maybe somewhat surprisingly, it is not the case that the improved local optimum on the training data consistently improves the performance on the test data. For \(m=0.1\), we observed 4 wins for Lord and 5 wins for OverLord (with a total of 15 ties, including the 13 datasets that are not shown), and for the best m on each dataset Lord was ahead on 6 datasets and OverLord on 4 datasets, with 14 ties. We explain this with a somewhat increased tendency towards overfitting, because the rule lengths of OverLord are typically slightly longer than those of Lord, in that the rules that are further refined after the first pruning phase, obviously become more complex, thus apply to fewer data points, and are rarely ever further improved with a second pruning phase. As this is a result of the search procedure, it may also be viewed as an instance of over-searching (Quinlan and Cameron-Jones, 1995; Janssen and Fürnkranz, 2009). In any case, the differences are only very small and occur without a clear pattern.

In general, we conclude from these experiments that two optimization phases, one for growing and one for pruning, are sufficient, and that additional phases can be applied as an optional setting for slightly better performance in some cases, but may also reduce the performance in others.

4.9 Analysis of rule filtering

Finally, we take a closer look at the high number of rules that are learned by Lord. In principle, Lord tries to identify the best rule for every single example. However, many of these rules are duplicates, already when they are found, and others are removed in a post-processing phase. We look at the magnitude of these reductions and evaluate whether such a high number of rules is necessary.

Columns 3 and 4 of Table 8 show the number of instances, and the number of best rules discovered from these instances. We see that on average, the number of rules is about 18% of the number of examples, which means that approximately five examples share the same best rule after our first greedy search phase. However, the exact values vary considerably, ranging from less than 1% in the easy mushroom dataset to more than half in the large susy dataset. This seems to relate to semantic or distributional properties of the dataset, as no clear relation to characteristics such as the size of the dataset is recognizable.

Furthermore, as mentioned in Sect. 3.4, Lord has an additional filtering phase which somewhat compacts the resulting rule sets by removing rules for all examples for which a better rule can be found in the learned set of rules. The last two columns of Table 8 show the effect of this strategy. The highest reduction happens for pamap2, where 86.77% of the learned rules are filtered out. On average, about 38% of the rules found in the learning phase will be eliminated in the filter phase, even though every rule has been a local optimum for at least one training example in the rule search phase. This means that there exists a large portion of training examples which abandon their own local best rule and adopt a local best rule of another training example as a better one. Thus, the negative effects of a greedy search for local optima are greatly reduced by this additional simple filter.

Table 8 Reduction on the number of rules in the learning and the filtering phase
Fig. 5
figure 5

Classification performance of Lord against the percentage of best rules remained in rule sets produced from medium and big datasets

However, the remaining numbers of rules are still extremely large. For example, for the 5 million examples of susy, more than 2.8 million different rules are found in the first place, which are reduced to ca. 1.6 million rules after filtering. An obvious question is whether this large number of rules can be further reduced. While we cannot give a definite answer, we try to shed some light on this question by looking at the prediction quality with a varying rule quality threshold.

Figure 5 depicts the classification performance of Lord against the percentage of best rules in rule sets for the 12 medium and big datasets.Footnote 12 Every data point in the figure shows the accuracy of a rule set consisting of the \(p\%\) of the rules with the highest evaluation. Even though the increments in the first steps are larger than the increments in the later steps, we can see that in most cases, the accuracy of Lord continues to increase with increasing sizes of the rule sets, as can, e.g., be clearly observed for the two gas-sensor datasets, as well as for cover-type or PUC-Rio, the most notable exception being census and bank. This implies that even though better rules seem to contribute more to the overall predictive accuracy, the rules that contribute to improving the classification performance are distributed throughout the range of the ordered rule set, and that seemingly weaker rules are also necessary for maintaining a high classification performance.

This result seems to reflect that higher quality rules which recognize usual and common cases are not enough to discriminate all cases well. Lower quality rules (with low coverage and in large quantity) are responsible for recognizing rare and exceptional cases that should also be in rule sets for better classification accuracy. This is reminiscent of the result by Holte et al. (1989), who observed that small disjuncts (i.e., rules with a low coverage) make up a fair percentage of the overall accuracy of a classifier, and that their removal may be futile.

In any case, we note that the number of covering rules for an unseen example is much smaller than the total number of rules in the rule sets, and the examination of the set of rules that cover an example is much easier and more focused than for the complete set of rules. When it comes to interpretation, one only has to deal with a small percentage of best rules which recognize popular cases for general principles. In this way, the complexity of interpreting rule sets can be reduced. In fact, unseen examples are always classified with a single rule (cf. Definition 8), which yields a natural justification for the predicted class. Nevertheless, clearly, interpretability is not the main focus of the Lord rule learner.

5 Conclusion

With the Lord algorithm, we have introduced a new approach to predictive rule learning in which a locally optimal rule is found for each training example, and the best covering rule of a new example is chosen for the classification. Although a large percentage of training examples share their local optima after a final filtering phase, the number of rules discovered by Lord is larger than that by conventional rule learners. Nevertheless, the rule sets learned by Lord outperform state-of-the-art rule learners in terms of predictive accuracy. More importantly, despite the large number of found rules, it is very efficient and can be applied to very large databases that cause problems for its competitors. This is primarily due to the adoption of efficient data structures, which have been developed in association rule discovery, to a classification learning setting. Lord also features an inherently parallel search for rules, which could be transformed to a massively parallel architecture for dealing with big data.

We have also seen that even though better rules contribute more to the predictive accuracy, the rules contributing on the improvement of classification performance are distributed throughout the range of an ordered rule set, so that a further filtering of the rules appears to be non-trivial. Although the remaining rules in a rule set after the filter step are a local optimum for at least one training example, it is still possible that rules can be removed without changing the performance on a given test set. A closer investigation of this issue, as well as a method for detecting such redundant rules will help to reduce the rule set complexity, and improve the interpretability of the found rule sets.

So far, we have not given much thought to the classification phase of the algorithm. Currently, we use the best (according to the training set performance) rule among all rules that cover a test example. Other rule selection methods, or possibly also voting techniques might further improve the prediction performance. In particular, we are also considering to adapt the method to a transductive setting, where the best rule for each test example is learned on the fly and—if it is a new local optimum—added to the rule set. However, this problem turns out to be harder than expected, and a straight-forward adaptation of Lord, which for each possible class forms the best rule covering the test example, performed worse than Lord.