Re-interpreting Rules Interpretability

Trustworthy machine learning requires a high level of interpretability of machine learning models, yet many models are inherently black-boxes. Training interpretable models instead — or using them to mimic the black-box model — seems like a viable solution. In practice, however, these interpretable models are still unintelligible due to their size and complexity. In this paper, we present an approach to explain the logic of large interpretable models that can be represented as sets of logical rules by a simple, and thus intelligible, descriptive model. The coarseness of this descriptive model and its ﬁdelity to the original model can be controlled, so that a user can understand the original model in varying levels of depth. We showcase and discuss this approach on three real-world problems from healthcare, material science, and ﬁnance.


Introduction
One of the key challenges for machine learning (ML) models to be adopted in critical applications, such as autonomous driving and healthcare, is that the model must be explainable [1]. The explainability is not only demanded by practitioners, but is in fact required by law in the EU with the European Parliament's General Data Protection Regulation (GDPR) introducing the right to receive explanations of decisions made by AI systems. There are two different types of explanations: (i) local explanations, i.e., a justification for an individual decision, also termed post-hoc explanation [1], and (ii) a global explanation of the overall logic and behavior of a model. The latter one is often a generalization of the former, since from such an understanding of the model individual decisions can be justified as well.
The usual way to have a global explanation is to use a model that inherently allows such an understanding. Typical examples are decision trees, or rule ensembles. Studying the rules, or equivalently the paths in the decision tree, allows a user to understand the logic of the model, as well as to justify individual predictions. If instead an existing model that is not inherently explainable (i.e., a deep neural network) needs a global explanation, then it can be obtained by training an explainable mimic-model [2] that approximates the black-box model's behavior.
For explainable models to achieve high predictive quality often requires them to be very large in terms of their number of rules. This also holds for mimic models that aspire to achieve high fidelity to the original black-box model. For example, a tree may have hundreds of nodes and tens of levels, and a rule ensemble may consist of hundreds of rules with complex conditions. Therefore, models that are interpretable in principle often remain beyond human perceptual and cognitive capabilities due to their size [3].
A lot of prior research focussed on training compact explainable models, e.g., via special tuning or post-processing [4] in order to reduce the model size and improve its understandability. However, the achievable degree of reduction is significantly constrained by the striving to preserve the prediction accuracy. Even more importantly, such a mimic model will have a different logic than the original model, and its relationship to the original model will be unclear. Hence, instead of explaining how the original model comes to its predictions, the mimic model demonstrates alternative ways to come to the same or similar predictions. While this is, perhaps, the only viable possibility when the original model is a true black box, it may be less desirable when the model logic can, in principle, be understood by a human. In the latter case, a preferable approach would be to facilitate the comprehension of the original model logic rather than to substitute it by another logic.
In this paper, we present an approach to facilitating comprehension of an existing model, representable as a set of conjunction rules (e.g., rule ensembles themselves, decision trees, random forests, tree ensembles), that is explainable in principle but not in practice due to its size. The idea is to extract the general logic from the model by uniting its rules based on their similarity. A union rule not only substitutes multiple original rules, but also typically consists of fewer logical conditions than each of the original rules. Thus, the resulting set of union rules, even with additionally possible exceptions from them, becomes more comprehensible.
Unlike the existing methods that aim at reducing the size of a model while preserving its accuracy on the data, our method creates a new model that describes the original model at hand. The original model serves as an input for the algorithm and the output is a descriptive model of the original model, having also the form of the set of conjunctive rules. The purpose of this descriptive model is not to make predictions for data instances but to tell how the original model works. Therefore, the descriptive model is not evaluated in terms of the accuracy of its predictions but in terms of its correspondence to the original model. For this purpose, we introduce a novel measure called Coherence Coefficient showing how consistent the descriptive rules are with the rules they are intended to describe. This measure allows for a user to regulate the degree of inconsistency of the descriptive model with the model at hand.
Hence, the very idea of our approach is principally different from the ideas behind the existing methods for model simplification that strive to preserve and improve the performance. Our contributions thus are: • Introduction of an approach that produces a descriptive model of a model that is explainable in principle but too large for comprehension for the purpose of facilitating the understanding of the model logic. • The achievable degree of simplification is not restricted by the requirement to preserve the prediction quality of the original model, different from the multitude of known approaches for training more compact rule-based models. • Union rules of the descriptive model can be explored in detail by tracing the hierarchy of more specific rules that were involved in the derivation of the union rules. • The construction of the descriptive model is fully transparent, and its relationship to the original model is absolutely clear.
The remainder of this paper is organized as follows. We first discuss the relation of the proposed approach to training compact (mimic) models in Section 2. We then present our approach in Section 3, followed by exploring the algorithm via visualizations on a typical application in Section 4. We empirically evaluate the approach in Section 5 and conclude by a discussion of the contribution and its limitations, as well as future work in Section 6.

Related work discussion
In ML, certain types of models, namely, decision trees, decision tables, and rules are considered to be inherently interpretable [2], as they can be represented in a human-readable form. However, the actual comprehensibility of such a model greatly depends on its complexity [5,6], which is typically roughly estimated in terms of the model size [2]. Therefore, the existing ML algorithms that generate decision trees or rules usually strive to reduce the model size by pruning the tree or compressing the set of rules (e.g., RuleFit [7]) so that the smaller model is still as accurate as the big one. Making models more compact is a vast area of research mainly due to the expected improvements in generalization and stability properties of the obtained solutions. Al-Akhras et al. [8] discuss a popular approach to avoiding overfitting in decision trees in which a more compact tree is produced via reducing the amount of instances used to build the model. The approaches directed to reducing the amount of instances were surveyed by Wilson and Martinez [9]. Pruning of decision trees is another popular way to achieve higher stability [10]. Helmbold and Schapire [11] propose an alternative algorithm that avoids pruning. Compactness of sets of rules is also a matter of concern. Dash et al. [12] propose an algorithm for creating compact whilst sufficiently accurate sets of rules using integer programming. In general, enforcing sparseness of the learned rules is a popular problem addressed by, e.g., Su et al. [13]. Alternative approaches propose a different interpretable class of models that is trained in a way to be sparse [14,15].
The research on compressing intelligible models is mostly based on the regularization techniques applied while training. Thus, Joly et al. [16] propose to use L1 compression for random forests in order to decrease the prohibitively long computation time for the big forests. Alternatively, Painsky and Rosset [17] propose to encode a random forest in a lossless or lossy with guarantees way, which allows not to store the full modelsmotivated by the limitations of the storage space. Sometimes an interpretable by design model is even compressed into a black box model, like a neural network [18], in order to sustain the small storage space and high performance. In general, aforementioned works aim at achieving more stable, smaller and better generalizing intelligible models while training. Big restriction to the degree of the compactness is always final performance of the model [19].
On the other hand, while the creation of rulebased mimic models is a typical approach to explaining the behavior of black-box models, such as neural networks [2], the research on improving the thereby obtained explanations is on-going. It is clear, that applying pruning or other compression training techniques when trying to mimic a complex black-box model will lead to a loss in fidelity. To achieve both goals, Qiao et al. [20] recently proposed a novel approach in which a set of decision rules is generated by a neural network with a special two-layer architecture. The authors also proposed a sparsity-based regularisation approach to balance between classification accuracy and the simplicity of the derived rules. For now this is a limited approach, that does not allow to work with any black-box model at hand. So, current research, on the one hand, acknowledges the problem of comprehending large rules sets or decision trees, on the other hand, does not consider the possibility of creating approximate simpler descriptive models instead of directly training more compact ones. Note that creating a descriptive model that helps to interpret the original model is fundamentally different from training a more compact model with similar accuracy. The descriptive model seeks to explain a given model at hand, while training a different, more compact model seeks to replace it and makes it much harder to connect functionality of the initial black-box with the interpretable and compact mimic model.
Freitas [21] discusses that decision trees and rules have different properties in terms of interpretability and that decision trees are usually perceived better when transformed to rules sets. This is also confirmed by Quinlan [22], who considers multiple approaches to pruning decision trees and finalises with the transformation to rules as a help for understanding. A random forest model consisting of multiple trees can also be transformed to a set of rules, for example, using a novel approach from Bénard et al. [23], which is close to the RuleFit [7]. Furthermore, it is argued that a representation in the form of rules can be more compact than a decision tree, because rules can include only significant clauses and have no repeated occurrences of the same variable [2]. Another work [24] discusses high redundancy in decision trees and proposes a method for extracting non-redundant rule-like explanations from a decision tree. The arguments about advantages of rules over trees substantiate the focus of our research on sets of rules.
Since our approach involves unions of rules, it is partly related to the works where rules or decision trees are merged for various purposes. Hierarchical merging of several trees was addressed in the context of the problem of learning decision trees from multiple sources of the data-so the challenge is to produce one tree that will cover the decisions of others [25]. Another problem that is addressed is construction of consensus trees from different ones with the goal of producing a more stable model [26]. A framework for combining multiple rule-based models that have been created for different subproblems is proposed by Strecht et al. [27]. Rules from different models are combined by computing their intersections. After resolving conflicts, the resulting rules set is minimised by uniting nearly-identical rules. A similar approach to joining rules is taken by Andrzejak et al. [28]. Our approach also involves an operation of rule union, but, unlike others, it allows controlled decrease of rule accuracy for achieving a higher degree of simplification.
Our research involves not only the development of an algorithmic method to obtain a descriptive model, but also the creation of interactive visual techniques for exploring sets of rules and investigating the behavior and results of the algorithm. Combining computational methods with interactive visual interfaces is at the core of Visual Analytics (VA) [29]. In particular, VA techniques allow human experts to be involved in the creation of ML models [30]. This way, humans can contribute not only their background knowledge, but also new knowledge gained in the process of interactive data analysis [31] through discovery and abstraction of patterns existing in data [32]. Currently, the problem of explaining ML/AI models is receiving much attention in VA [33]; however, the techniques proposed so far address mostly the needs of model developers rather than domain experts. As an exception, RuleMatrix [34] visualizes rule sets for users with little machine learning experience, but it does not address the problem of model simplification and is severely limited by the size of the rule set.
In the area of visualization research, a comparative evaluation of four basic techniques for visual representation of rules sets, namely, symbolic and graphical encoding of conditions with and without vertical alignment of conditions referring to the same features, has been conducted recently [35]. The experiments showed the superiority of the representations that use feature alignment, which is valid for our table view. Graphical encoding is advantageous to textual, although the effect is less pronounced compared to that of feature alignment. However, the experiments were conducted using small sets of rules, whereas effective visualization of large models is still a challenging task.

Main concepts
In the following we define the terms that we will be using throughout the paper. We assume that one wants to interpret a predictive model h : X → Y at hand with input and output spaces X , resp. Y that can be rewritten as a collection of rules, i.e., h = R, where each rule R ∈ R consists of an antecedent which is a conjunction of conditions and consequent which is a prediction r of the rule. The input space X consists of instances with d features f i , i = [d], which can be numerical or categorical. The output space Y can be either categorical for classification or numerical for regression tasks.
A condition c is a logical expression of the form f i ∈ V , where V can be a set of values (for a categorical f i ) or an interval (for a numeric f i ) that is restricting the values that f i can get. Such c can be a splitting condition from a decision tree node or a part of a conjunctive logical rule.
By definition, any rule covers itself. When R 1 ⊇ R 2 and R 1 = R 2 , then R 1 is more general and R 2 is more specific. Note that R 2 may include conditions involving features that do not appear in conditions of R 1 , i.e., R 1 may have fewer conditions than R 2 . For each rule R ∈ R, where R is the set of all rules in the model, we can identify set of rules R ⊇ that are covered by it. When the set of rules R is optimal in the sense of our approach such sets are trivial R ⊇ = {R}. Definition 2. Predictions of two rules R 1 and R 2 , denoted by r 1 and r 2 , are congruent r 1 ∼ = r 2 if one of the following conditions holds: • r 1 = r 2 ; • |r 1 − r 2 | ≤ ǫ when r 1 and r 2 are numbers; • max(r up 1 , r up 2 )−min(r low 1 , r low 2 ) ≤ ǫ when r 1 and r 2 are numeric intervals, where ǫ is a tolerance threshold used during the run of the algorithm.
Note, that the last case of intervals is needed for the further work of the algorithm with union rules (defined in the following).
Definition 3. We say that R 1 ⊇ R 2 correctly, if r 1 ∼ = r 2 . In this case the coverage of R 2 by R 1 is correct; otherwise, the coverage is wrong. If R 1 ⊇ R 2 wrongly, then R 2 is an exception of the covering rule R 1 .
Definition 4. The coherence coefficient (CC) of a rule is the ratio of the number of correctly covered rules to the total number of covered rules: Definition 5. A rule whose CC < 1, i.e., a rule having at least one exception, is called a rough rule.

Definition 6.
A roughness threshold ρ ∈ [0, 1] defines the minimal acceptable value of CC of a rule included in a descriptive model during the run of the algorithm.
So, specifying ρ = 1 means that no rough rules are allowed, and the smaller ρ gets, the more exceptions rough rules are allowed to have.
For a better understanding of the concept of rule coverage, imagine the multidimensional space of the features (assuming, for simplicity, that all features are numeric). Conditions of a rule antecedent define a multidimensional shape (namely, a rectangular hyper-parallelepiped) in this space. When some feature f i is not used in a rule explicitly, it can be treated as being involved in an implicit condition f i ∈ V where V is the whole range of possible feature values. A rule R 1 covers rule R 2 (Definition 1) when the shape p 1 defined by R 1 includes the shape p 2 defined by R 2 . Please note that any rectangular parallelepiped p in this space corresponds to some conjunction of conditions, even if there is no rule with such an antecedent. For two or more shapes, it is possible to create a rectangular parallelepiped that encloses all these shapes. The smallest parallelepiped p ∪ enclosing the shapes p 1 and p 2 defined by the conditions of rules R 1 and R 2 represents the union of the antecedents of R 1 and R 2 .
When we apply the union operation also to the predictions r 1 and r 2 of the rules R 1 and R 2 , we obtain a new rule R ∪ , which is the union of the rules R 1 and R 2 . The rule R ∪ is meaningful only when the predictions r 1 and r 2 are congruent (Definition 2); so, our algorithm makes unions only from rules with congruent predictions.
Accidentally, p ∪ , apart from p 1 and p 2 , may also include parallelepipeds corresponding to antecedents of some other rules; hence, a union R ∪ of two rules R 1 and R 2 may additionally cover other rules. Some of those other rules may have predictions incongruent to the prediction of R ∪ . In such a case, R ∪ is a rough rule (Definition 4), and the rules with incongruent predictions are its exceptions (Definition 3).
Let us now define the union of two rules more formally.
Definition 8. A union of two predictions r 1 and r 2 , denoted r ∪ , is defined as • r ∪ = r 1 ∪ r 2 when r 1 and r 2 are distinct sets of discrete values, • r ∪ = [min(r low 1 , r low 2 ), max(r up 1 , r up 2 )] when r 1 and r 2 are numeric intervals, Definition 9. A union of two rules R 1 and R 2 with congruent predictions r 1 and r 2 is a rule R ∪ where each condition is a union of conditions from R 1 and R 2 according to Definition 7 and the prediction r ∪ is the union of r 1 and r 2 according to Definition 8. Since union is defined for congruent rules, it follows that R ∪ ⊇ R 1 and R ∪ ⊇ R 2 correctly.

Distance function
In order to perform the hierarchical merging of rules, we define a distance function on the space of rule antecedents. We set the distance between two rule antecedents to be the sum of the distances between the value intervals V of the same feature f i in the conditions of the rules. So if , then distance between c 1 and c 2 is where v max and v min are the absolute maximal and minimal, respectively, values of the feature f i that may occur in practice. This distance metric is, in fact, a specific formulation of the Hausdorff distance [36] for numeric intervals. The division by (v max − v min ) is done for normalization of all distances between conditions to the interval [0, 1].
For categorical features, rule conditions contain discrete sets of categorical values instead of numeric intervals. In this case, the distance between two conditions can be defined as the Jaccard similarity index [37] subtracted from 1, i.e., if c 1 = f i ∈ A and c 1 = f i ∈ B, where A and B are sets, then The distance equals 0 when A and B are identical and 1 when the sets have no common elements.
Based on the distances between corresponding conditions, the distance between the rules R 1 and R 2 is fi d fi , where f i ∈ {features used in R 1 and R 2 }. It corresponds to the definition of the Manhattan distance. The interval endpoints are normalised to values between 0 and 1: When some feature is absent in the conditions of one of the rules, it is assumed to have an interval from 0 to 1. Note that since we are not aiming at creating a new compact model that will be used on novel data, this assumption makes sense.
Note that the distance metric is defined solely for the rule antecedents and does not take into account the rule predictions. Since merging is applied only to the rules with congruent predictions, there is no need to include the predictions in the calculation of the rule similarity. Besides, the distance metric defined in this way can be used for detection of similar rules with incongruent predictions, which may be useful in examining the quality of a rule set.

Input
• A classification or regression model in the form of a set of rules or a decision tree. In the latter case, the tree is transformed to an equivalent set of rules by one of existing methods (e.g., [38]). • A roughness threshold ρ (Definition 6).

Output
• A set of rules such that: 1. each original rule is correctly covered by some resulting rule (Definitions 1, 3); 2. the resulting set of rules has smaller cardinality than the original one. In case when the resulting set of rules has the same cardinality as the original one, we say that the algorithm failed; 3. the coherence coefficient (Definition 4) of any union rule in the resulting set is not less than the roughness threshold, i.e., CC ≥ ρ (Definition 6).
The pseudocode of the rules set generalization algorithm is given below (Alg. 1). The algorithm repeatedly finds the closest (according to the defined distance metric) pair of rules whose predictions are congruent by Definition 2 and applies the operation of rule union (Definition 9). If the united rule has CC ≥ ρ (Definitions 4, 6), it substitutes the two rules it was produced from; otherwise, it is discarded. After accepting a new rule, the algorithm searches for the other rules that are correctly covered by this rule (Definitions 1-3) and, if found, removes them from the resulting set. The algorithm terminates when no new union rule was accepted during an iteration.

Checking Fidelity in Terms of Data Predictions
Since a union rule is more general than the rules it has been derived from, it may be applicable to additional data instances not described by the original rules. For some of these additional instances, the prediction of the union rule may be incongruent with the predictions of corresponding rules from the original model. If we consider some reference dataset, that we have at hand (not necessarily the training dataset used for the original black-box model), we can define the following notion of fidelity.
Definition 10. The fidelity of a union rule with respect to the original rules set is the ratio of the number of data instances in some reference dataset (e.g., the set from which the model was derived) for which the union rule gives predictions congruent to the predictions of the original model to the total number of data instances this rule is applicable to.
Definition 11. The overall fidelity of a descriptive model, i.e., a generalized rules set, with respect to the original model is the ratio of the number of data instances in the reference dataset for which the descriptive model gives predictions congruent to the predictions of the original model to the total number of data instances both models are applicable to.
When some set of data instances described by the original rules set is available, the fidelity of the derived union rules to the original predictions can be additionally checked. A reasonable requirement is that the fidelity must not be less than ρ. A condition for checking the fidelity should be added in the "if" statement on line 16 of the Alg. 1, i.e., the extended condition is CC(

Iterative lowering of the roughness threshold
There is a possibility to apply Algorithm 1 in an iterative manner. For this purpose, the user specifies an interval [ρ low , ρ up ] and a step ∆(ρ), where ∆(ρ) < ρ up − ρ low . Algorithm 1 is executed several times with consecutively setting the roughness if congruent(r i , r j ) then 7: end if 10: end for 11: changed ← F alse 12: while P D = ∅ ∧ ¬changed do 13: (i, j) ← argmin di,j P D ⊲ find the minimal distance pair 14: end for 23: changed ← T rue 25: end if 26: end while 27: end while threshold ρ to ρ up , ρ up − ∆(ρ), ..., ρ low , i.e., starting from ρ up and decreasing the threshold in each following run by ∆(ρ). The output of run i is used as the input of run i + 1.
This extension of the method prioritises more coherent rules, i.e., it will strive to produce united rules with higher CC before attempting to achieve higher compression at the cost of reducing the coherence.
To demonstrate possible differences between the results of the basic algorithm and its multistep variant, Fig. 1 shows two projection displays where rules are represented by dots. The dots are arranged on a plane based on the distances between the rules. The projections have been obtained using the method t-SNE [39]. The dot colours encode the predictions and the sizes are proportional to the number of the data instances the rule applies to. The lines connect dots representing rules that were united by the generalization algorithm. The display on the left corresponds to the base algorithm and the one on the right to the multi-step variant. The dots marked in black represent a group of original rules that were united in a single rule by the multi-step variant and included in three different unions by the base variant.
The illustrations refer to an example classification model consisting of 109 rules including in total 818 conditions. With the roughness threshold of 0.6, the base variant reduces the original set to 54 rules with 342 conditions, of which 33 rules are the same as in the original set (i.e., the algorithm cannot generalize them) and 21 rules are unions obtained from 76 original rules. The coherence coefficient of the union rules ranges from 0.6 to 1; however, only one union has CC = 1, three rules

Visualizations
We have designed and implemented several visualizations that allow researchers to explore rules sets and to investigate the work of our algorithm 1 . This helps to see and interpret the working of the algorithm. Please note that the visualizations are not a part of the rule generalization method and our prototype implementation of the proposed approach is limited to dealing with rules in which the conditions involve only numeric features. However, this limitation does not pertain to the approach itself.
We display a set of rules in the form of a table, as shown in Fig. 2. Each table row corresponds to one rule, each table column corresponds to one feature. Value intervals of the conditions are represented by horizontal bars, which show the relative position of the interval between the minimal and maximal feature values. If a feature is not used in a rule, the corresponding cell is empty. Besides, there in a column entitled "Rule", where each rule is represented as a whole by a glyph with vertical axes corresponding to all available features and vertical bars corresponding to the features used in the rule.
A table showing the results of the rule generalization algorithm (Fig. 3) includes additional columns containing (1) counts of correct and wrong applications of the rule to data instances, (2) counts of correctly and wrongly covered original rules, (3) fidelity, (4) coherence coefficient, (5) number of rules in the derivation hierarchy, and (6) depth of the hierarchy.   Detailed information about a rule is provided in a popup window (Fig. 4)  Another type of display we use is a panel with multiple rules represented by glyphs, as shown in Fig. 5. A popup window with rule details, as in Fig. 4, can be obtained by pointing on a glyph. Comparison of the rules is supported by a mode of glyph drawing in which conditions of one rule are represented by hollow bars with black frames and conditions of one or more previously selected rules are represented by filled semi-transparent bars without frames. The bar shading is darker where feature value intervals from two or more selected rules overlap. For example, in Fig. 5, the conditions of three selected rules (their glyphs have black frames around them) are represented within all glyphs in the display. Rules are selected by clicking on their glyphs or rows in a table view.     6 illustrates the concept of rule coverage. Here, the rule shown on the top left (it is selected, so that the glyph is marked with a black frame) covers the remaining rules represented in the image. The conditions of the covering rule are represented in all glyphs by cyan-shaded bars. The colours of the glyph frames encode the predicted classes of the rules. Five of the nine covered rules have the same prediction as the covering rule and three rules have a different one. Hence, five rules are covered correctly and three rules incorrectly. Fig. 7 illustrates the operation of rule union. Two original rules are shown on the right and their union on the left. The original rules are selected, and their conditions are represented by cyanshaded bars in all three glyphs. Darker bar shading signifies overlapping conditions. The first four conditions and the seventh condition are identical in the two original rules; so, the same conditions are included in the union rule. In the fifth and eighths conditions, the value interval of one rule includes the value interval of the other rule; so, the union rule includes the larger intervals. In the sixth and ninth conditions, the value intervals do not overlap; so, the union contains the interval from the lower end of the lower interval to the upper end of the higher interval. There are two conditions with features appearing only in one of the original rules. For these features, the union has no conditions. The numbers 0.80 and 0.99 above and below the glyph of the union rule represent the coherence coefficient and the fidelity of the union rule, respectively. In this example, the union rule covers four original rules correctly and one original rule incorrectly; so, it is a rough rule with CC = 4/(4+ 1) = 0.8. This union rule gives the same predictions as the original model for 963 data instances and different predictions for 8 data instances; hence, its fidelity is 963/(963 + 8) = 0.99.

Experiments
We describe our investigation of 4 models from three real-world tasks. For each rules set, we ran the basic rule generalization algorithm 9 times setting the parameter ρ to 1.00, 0.95, 0.90, 0.85, ..., 0.60. Each run was applied to the original rules set. For each run we are analysing the statistics that describe the comprehensibility of the compressed model (the number of the resulting rules, the total number of conditions in all the rules, the mean number of conditions per rule, the number and percentage of the rules including more than 5 conditions, considered as complex), the roughness of the descriptive model (the minimal CC that was actually achieved, the minimal fidelity of a rule, the total fidelity of the whole rules set, and the number of rough rules), and the characteristics of the algorithm work (the number of generated union rules and the maximal depth of a rule derivation hierarchy). While the first group of the results allows to access the interpretability, the second and third ones give a deeper understanding of the mechanics that allows to achieve such compressed descriptive model. It is important to note once again, that our aim is to interpret the global logic of a model at hand, not to understand the data. So we achieve our goal if we can explain the main rules learned by the model, the main features that affect its decisions, the possible outliers that require highly specific rules, etc.

Cardiocartography dataset
Since medical domain is of high interest for interpretability opportunities, as a primal experiment we looked at a medical dataset. It is an UCI [41] dataset of cardiocartography records [42]. It contains 2126 fetal cardiotocograms for which various diagnostic features were measured. They were classified with respect to a morphologic pattern into 10 classes and to a fetal state into 3 classes. For both cases we directly learned a decision tree and analysed them using our algorithm.

classes task
The 3 classes model consists of 109 rules describing 1700 data instances of the training dataset. The statistical characteristics of the generalized rules sets obtained for different settings of the parameter ρ are presented in Fig. 9.
It can be noticed that decreasing the roughness threshold ρ from 1 to 0.85 does not lead to generation of any rough rule, i.e., none of the resulting rules has exceptions. However, the union rules, even when their CC = 1, are more general than the original rules and applicable to larger subsets of the data instances, which may include instances with incongruent predictions. Hence, union rules may be fully coherent with regard to the covered original rules, but at the same moment their fidelity may be less than 1.
Another observation is that there are many original rules that cannot be united with others and remain standalone even when the roughness threshold is low. Thus, for ρ = 0.60, only 21 out of 54 rules in the resulting model are union rules. Nevertheless, the achievable degree of simplification can be judged as quite high, especially in terms of the number of conditions and the proportion of complex rules with more than 5 conditions. Moreover, such rules help to identify outlier instances that require different logic than most of the other ones. For example, the rule on the right of Fig. 4 describes only one data instance, and it could not be united with any other rule.
An important property of the generalized rules set is that simpler rules (i.e., including fewer conditions) describe a much larger proportion of the data instances than in the original model. So, the minimal number of conditions in one rule is 3 in the original model and 1 in the simplified versions obtained with ρ = 0.65 and ρ = 0.60 (the maximal number of conditions per rule is 12 in all models). The original model contains 4 rules with 3 conditions describing 47 data instances, 7 rules with 4 conditions describing 62 instances, and 23 rules with 5 conditions describing 163 instances. Taken together, the 34 simpler rules describe 272 data instances out of 1700, i.e., only 16%. In the model obtained with ρ = 0.65, the numbers of the rules including from 1 to 5 conditions are, respectively, 2, 1, 6, 9, and 11, and these 29 rules describe 1009, 36,201,200, and 48 data instances, respectively, i.e., 1494 instances in total. As two or more rules from a generalized model may be applicable to the same data instances, the cumulative number of the data instances correctly (i.e., in congruence with the original model) described by the model with ρ = 0.65 is 2709, and thus the simplest rules make 55% (1494/2709 * 100) of the correct descriptions. However, these rules describe 96 data instances incorrectly, i.e., their joint fidelity is 0.94 = 1494/(1494 + 96).
Hence, there are multiple aspects of simplification: the number of rules, the number of conditions, the proportion of simple rules, and the proportion of the data described by these simple rules. Moreover, the conditions of the simplest rules applicable to large number of instances indicate which features have higher importance than others. For example, the model with ρ = 0.65 contains a rule with a single condition "If histogram mode < 148.5 then class = 1" correctly describing 881 data instances and having fidelity 0.97. This rule reveals the importance of the feature "histogram mode". Another example, is that percentage of time with abnormal short term variability is rather low for class 1 (healthy), but gets higher for 2 and 3 (suspect and pathology), at the same moment the histogram mean is lower for the pathology class, compared to other two.
Additionally we investigated the effects of model pruning on the performance of our algorithm and presented the results in Section 5.4.

classes task
This dataset allows us to see the difference between interpretability for simpler and more complex task on the same data features. Decision tree for 10 classes consists of 202 rules describing the same 1700 data instances of the training dataset. The descriptive statistics of the results of the experiments are shown in Fig. 10. As it could be expected, the potential for compression and generalization is lower when the number of classes is higher due to the congruence requirement. Compared to the 3-class model, the 10-class model also consists of more complex rules, i.e., ones that have more conditions. The generalization increases the proportion of simpler rules having up to 5 conditions, which contributes to better comprehensibility, along with the decrease of the number of the rules. The fact that with ρ = 0.60 we see  many more union rules that are not rough, compared to the 3-class model also confirms that the global logic of the model is more complex.

Home Equity Line of Credit (HELOC), 2 classes
This example application is based on the Explainable Machine Learning Challenge organised by a group of commercial and academic organisations 2 . Based on an anonymised dataset of applications made by homeowners, the challenge requires creation of a readily explainable model predicting the value of the variable Risk Performance, which may be either "bad" or "good". In order to allow a correct decision tree creation, we excluded records with special values and two categorical features. We first created an obviously incomprehensible random forest model with 50 trees without depth restriction, that achieves perfect accuracy, and then generated a mimic model approximating the behavior of the random forest model. The mimic model consists of 384 rules containing in total 3019 conditions which involve 21 features with numeric value domains. The statistics describing the results of the generalization are presented in Fig. 11. It can be seen that the mimic model can be slightly simplified even with ρ = 1. It means that the model has some redundancies. While increasing the degree of simplification, the total fidelity of the simplified model to the original one decreases gradually but more substantially than it was in other experiments. A probable reason is high similarities between rules giving opposite predictions: when a rule gets more general, it may become applicable to additional data instances that are described by other rules, even if it does not cover those other rules (i.e., the conditions of the rules partly overlap). The projection plot on the left of Fig. 12 supports this guess: blue and red dots representing rules with negative and positive outcomes, respectively, tend to be very close in the plot. An interesting side effect of the simplification is that it increases the separation, i.e., the dissimilarity between the rules with the positive and negative outcomes. This can be seen from comparing the projection of the original rules set on the left of Fig. 12 to the projections of the simplified rules sets obtained with ρ = 0.85 (Fig. 12, center) and with ρ = 0.75 (Fig. 12, right).
The similarities between rules are demonstrated in Fig. 13, where a table displays a group of rules represented by a cluster of closely positioned dots in the projection plot shown in Fig. 12, left. The cluster has been interactively selected by dragging a frame around it. The table shows that the rules with negative results (Action = 0) differ from the closest rules with positive results (Action = 1) by just one condition.
Using this example, we can demonstrate how our techniques can be used to answer the question of the challenge organizers: if an applicant who has got a negative result ("bad"), can the model easily explain what should be changed to turn the result to positive ("good")? For this purpose, the rule R 0 that gave the negative result needs to be identified in the projection plot (the localisation of rules is supported by highlighting) and the rules with positive results having close positions in the plot need to be selected, for example, as shown in Fig. 12     table display shown in Fig. 13 or a glyph representation, as in Fig. 14. In this figure, the rule no. 41 is selected as a reference for comparison. Its conditions are represented by cyan-filled bars in all glyphs. It is easy to see that a small increase of the value of the third feature (Percent Installment Trades) will make rule no. 271 with a positive outcome applicable to this case instead of the rule no. 41. Other possibilities are to make rule no. 241 applicable by increasing the value of the 7th feature (Net Fraction Installment Burden), or rule no. 340 by increasing the value of the 8th feature (Percent Trades Never Delinquent), or rule no. 254 by increasing the value of the second feature (Consolidated Version of Risk Markers), or rule no. 79 by decreasing the value of the 5th feature (Number Trades 60+ ever). In a similar way, one can use more general rules of a simplified model version, in which critical features like these can be easier identified.

Material science, regression
This example is taken from the NOMAD 2018 Kaggle challenge to predict the formation energies and bandgap energies of alloys from transparent conductors 3 . In contrast to the previous examples, this is a regression task.
In material science, a state-of-the-art prediction method is RuleFit [7] which trains a rule ensemble combined with a linear model. Note that the winning methods of the Kaggle challenge (n-gram [43], SOAG [44], MBTR [45]) do not substantially outperform RuleFit on the entirety of the dataset ( [46] have shown that they do, however, perform well on well-defined subsets of the data). A prediction is obtained using a weighted sum of rule outputs and feature values. We used RuleFit to train a rule ensemble on the 402 data 3 https://www.kaggle.com/c/nomad2018-predict-transparent-conductors instances using formation energy as target. The resulting rule ensemble consists of 396 rules. Analyzing these rules, we find that most of them describe single data instances which is not surprising given that the number of rules is nearly equal to the number of data instances. This indicates some potential for compression.
The target values range from 0 to 0.7676. We have set the tolerance threshold ǫ to 0.010, 0.020 and 0.050, corresponding to the 2%, 5%, and 10% percentiles of target values. The collected statistics are presented in Fig. 15.
The second rows in all three tables demonstrate that the largest part of the simplification is achieved at the cost of decreasing the precision of the predictions from specific numbers to intervals. For example, a union rule predicts that the result will be from 0.2179 to 0.2214 instead of predicting a fixed number like 0.22. Hence, the chosen value of the tolerance threshold ǫ has the highest impact on the resulting degree of simplification and generalization, whereas the impact of the roughness threshold ρ is quite small: the decrease in the number of rules due to decreasing ρ from 1 to 0.6 ranges from only 6% for ǫ = 0.010 to 14% for ǫ = 0.050, and the decrease in the number of conditions ranges from 9% for ǫ = 0.010 to 20% for ǫ = 0.050.
Nevertheless, the potential of our method for simplifying the explanation of regression models can be considered as high. It is quite reasonable to posit that a user rarely needs an exact explanation for each individual numeric value that can be predicted by a model. Rather, the user can be satisfied with a model description telling what combinations of conditions lead to model results fitting in different ranges of values (e.g., high and low). The user-controlled value of ǫ determines how narrow or wide these intervals will be. Thus, by choosing a larger value, a user can  obtain a compact description of model behavior even without decreasing the coherence coefficient of the rules. For example, the same material science model generalized with ǫ = 0.075 is described by 80 rules with 494 conditions, and ǫ = 0.25 gives only 22 rules with 94 conditions in total and from 3 to maximum 6 conditions in each individual rule. Like in the other cases, such combinations of conditions, as well as the features they involve, can be considered the most influential for the model result.
This example suggests an interesting possible way of using the generalization method for regression tasks. First, a high-level overview of the behavior of a model is gained by obtaining a very rough (large ǫ) generalized representation of it. Then, subsets of the original rules that have been unified in the result of the generalization are investigated in more detail by applying the generalization method separately to these subsets. For example, the t-SNE projection plot in Fig. 16 shows how the original rules of the material science model were joined in a highly generalized version of the model. One group of 27 linked rules has been selected and generalized with ǫ = 0.05 to 5 simple rules. The latter are shown in a table view, where the first two columns represent the intervals of the predicted values: r ∈ [minQ, maxQ].

Experiments with pruned models
In order to investigate the effect of pruning of the original model on the compression that can be achieved with our algorithm, we created decision trees with different level of cost complexity pruning. In Fig. 17 pruning degree 1 denotes the least compressed model and pruning degree 3 -most compressed model. An immediate observation that can be made is that the result of our algorithm is highly dependent on the pruning performed. Thus, the strongly pruned model is highly resistant to generalization. This means that our method can be useful also for practitioners in order to understand if the model is compact enough and does not contain redundancies. Another interesting observation is that the model 2 can be compressed only with significant roughness: the models obtained with ρ = 0.90 and ρ = 0.80 are identical, while the simplification attempt with ρ = 1 fails (no union rules could be produced). This is even more pronounced for the strongest pruning. It should be noted that each of these pruned models has progressively declining accuracy when trained, which showcases the difference of our approach compared to pruning: while keeping a required degree of coherence and fidelity to the original model, our method gives a simplified description without any effect on the accuracy of the original model.

Discussion
We proposed an approach to facilitating comprehension of models that are interpretable by design, but too large to be actually intelligible by a human due to cognitive limitations. For this, we explain the logic of a large (in principle) interpretable model by a simplified descriptive model that suits human cognitive properties: while averse to large volumes of information, humans are good in dealing with vague concepts, approximate statements, and fuzzy reasoning. One can think of a data mart 4 as an example of widely used descriptive models in real world: instead of giving a human full data from business, a special high level view is formed in order to understand the processes happening in it.
Our approach differs from the approaches of regularization or compression techniques for obtaining simpler yet accurate enough predictive models, since our goal is not to retrain or improve a model, but to explain it. That is, our aim is to represent the logic of a complicated pseudointerpretable model at hand. This is independent of whether the model was derived from data or is an interpretable model mimicking the behavior of some black-box model. A descriptive model is not meant to be used as a substitute for the model at hand (i.e., it is not used for making predictions), but its purpose is to provide an explanation for the global logic of that model. The cost of high simplification is loss of predictive accuracy. The more complex the global logic of a model, the harder it is to generalize and represent it by a simple descriptive model of sufficient fidelity. Our algorithm for rule generalization allows users to control how similar to the original model the descriptive model must be in terms of predictions. Besides, the possibility to see the exceptions and the hierarchy of rule generalization allows a human to increase the fidelity (and, hence, the complexity) of the description as desired. In addition to model description, our approach also supports model exploration in terms of important features, their impact on predictions, and which feature combinations would create outliers.
Similar to a mimic model [2] there is a trade-off between interpretability, i.e., size of the descriptive model, and accuracy of the description, i.e., similarity of the descriptive model to the original one. The goal typically is to have the most concise descriptive model that is still sufficiently similar. The similarity, however, is in general hard to assess: A meaningful similarity measure not only depends on the functional similarity, e.g., as measured by a suitable norm on the function space, but also on the expected difference given the data distribution. We use two measures as a surrogate: fidelity to measure difference in predictions and the coherence coefficient to measure structural similarity.   Fidelity is widely used in explainable AI [2] and measures the number of data instances in a reference dataset for which the predictions of the original and descriptive differ by more than ǫ. Fidelity can be inaccurate for two reasons: (i) using a reference dataset as an empirical sample of the data distribution is only an approximation, and (ii) the difference of the two models on those data samples where they do not agree is unbounded.
We introduced the coherence coefficient to measure the number of rules subsumed coherently by a generalization in the description. Rules not covered coherently are kept in the description as exception rules so that no structural parts of the original model are discarded. This guarantees that for all data instances where at least one rule in the original model is satisfied, at least one rule in the descriptive model will be satisfied and their predictions do not differ by more then ǫ. Therefore, the set of points for which at least one rule in the descriptive model is satisfied is a superset of the points for which at least one rule in the original model is satisfied. This measure can still be inaccurate on the points for which a rule in the descriptive model is satisfied but no rule in the original model is. For those points, the difference in prediction is in general unbounded.
For exploring the properties of the algorithm, we created a visualization interface and performed a series of experiments applying the rules generalization to four different models. Our case studies showed that the human interaction for setting the acceptable level of description roughness is very helpful-while significant roughness makes the result easier to comprehend, obtaining several descriptive models with different degrees of roughness can help to refine the understanding of the predictions logic. Interesting enough, the experiment with a regression model showcased that in this case imprecision of predictions allows to achieve higher simplification than rules roughness control. Based on this observation, we propose a method for focused exploration of selected parts of a regression model at hand: starting from a very simple but very imprecise descriptive model, a user selects one of the generalized rules, extracts the subset of the original rules it covers, and obtains a more precise descriptive model for this subset. In this way, the understanding of the model logic can be gradually refined and deepened.
We also found out that the distance metric we introduced can be used for answering the prediction justification questions, i.e., determining what features should be changed and how to make a model change its prediction. Knowing the rule by which the current prediction was made, one uses the distance metric to select the closest rules giving the desired outcome and inspects how their conditions differ from the conditions of the rule that was applied.
Our algorithm allows two variants of use (see Sec. 3) and is open to further extensions. For example, it can take into account possible overlaps (partial coverage) between a generalized rule and the original rules. Currently, the coherence coefficient of a general rule is calculated only from the rules fully covered by it. This definition can be extended in an obvious way to including also partial coverage by an appropriate change in the computation of CC.
An interesting direction for future work is to combine generalization of the rules with merging and generalization of the features involved in the rule conditions, which is expected to enable much higher degrees of model logic simplification. Examples of feature merging can be seen in the award-winning solution of the HELOC Challenge 5 [47], where 6 groups of semantically related original features were integrated into composite features thereby reducing the original 23 features to 10 features. Such feature merging is usually hard to perform in the interpretable manner without domain knowledge and human reasoning. However, it may be possible to detect automatically (by analyzing a rule set) which features are likely to be related and propose groups of such features to a human expert for considering and controlling integration. This can significantly strengthen the comprehensibility of a descriptive model.