On cognitive preferences and the plausibility of rule-based models

It is conventional wisdom in machine learning and data mining that logical models such as rule sets are more interpretable than other models, and that among such rule-based models, simpler models are more interpretable than more complex ones. In this position paper, we question this latter assumption by focusing on one particular aspect of interpretability, namely the plausibility of models. Roughly speaking, we equate the plausibility of a model with the likeliness that a user accepts it as an explanation for a prediction. In particular, we argue that—all other things being equal—longer explanations may be more convincing than shorter ones, and that the predominant bias for shorter models, which is typically necessary for learning powerful discriminative models, may not be suitable when it comes to user acceptance of the learned models. To that end, we first recapitulate evidence for and against this postulate, and then report the results of an evaluation in a crowdsourcing study based on about 3000 judgments. The results do not reveal a strong preference for simple rules, whereas we can observe a weak preference for longer rules in some domains. We then relate these results to well-known cognitive biases such as the conjunction fallacy, the representative heuristic, or the recognition heuristic, and investigate their relation to rule length and plausibility.


Introduction
In their classical definition of the field, Fayyad et al. (1996) have defined knowledge discovery in databases as "the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data."Research has since progressed considerably in all of these dimensions in a mostly datadriven fashion.The validity of models is typically addressed with predictive evaluation techniques such as significance tests, hold-out sets, or cross validation (Japkowicz & Shah, 2011), techniques which are now also increasingly used for pattern evaluation (Webb, 2007).The novelty of patterns is typically assessed by comparing their local distribution to expected values, in areas such as novelty detection (Markou & Singh, 2003a,b), where the goal is to detect unusual behavior in time series, subgroup discovery (Kralj Novak et al., 2009), which aims at discovering groups of data that have unusual class distributions, or exceptional model mining (Duivesteijn et al., 2016), which generalizes this notion to differences with respect to data models instead of data distributions.The search for useful patterns has mostly been addressed via optimization, where the utility of a pattern is defined via a predefined objective function (Hu & Mojsilovic, 2007) or via cost functions that steer the discovery process into the direction of low-cost or high-utility solutions (Elkan, 2001).To that end, Kleinberg et al. (1998) formulated a data mining framework based on utility and decision theory.
Arguably, the last dimension, understandability or interpretability, has received the least attention in the literature.The reason why interpretability has rarely been explicitly addressed is that it is often equated with the presence of logical or structured models such as decision trees or rule sets, which have been extensively researched since the early days of machine learning.In fact, much of the research on learning such models has been motivated with their interpretability.For example, Fürnkranz et al. (2012) argue that rules "offer the best trade-off between human and machine understandability".Their main advantage is the simple logical structure of a rule, which can be directly interpreted by experts not familiar with machine learning or data mining concepts.Moreover, rule-based models are highly modular, in the sense that they may be viewed as a collection of local patterns (Fürnkranz, 2005;Knobbe et al., 2008;Fürnkranz & Knobbe, 2010), whose individual interpretations are often easier to grasp than the complete predictive theory.For example, Lakkaraju et al. (2016) argued that rule sets (which they call decision sets) are more interpretable than decision lists, because they can be decomposed into individual local patterns.
Only recently, with the success of highly precise but largely inscrutable deep learning models, has the topic of interpretability received serious attention, and several workshops in various disciplines have been devoted to the topic of learning interpretable models at conferences like ICML (Kim et al., 2016(Kim et al., , 2017(Kim et al., , 2018)), NIPS (Wilson et al., 2016;Tosi et al., 2017;Müller et al., 2017) or CHI (Gillies et al., 2016).Moreover, a book on explainable and interpetable machine learning is in preparation (Jair Escalante et al., 2018), funding agencies like DARPA have recognized the need for explainable AI1 , and the General Data Protection Regulation of the EC includes a "right to explanation", which may have a strong impact on machine learning and data mining solutions (Piatetsky-Shapiro, 2018).The strength of many recent learning algorithms, most notably deep learning (LeCun et al., 2015;Schmidhuber, 2015), feature learning (Mikolov et al., 2013), fuzzy systems (Alonso et al., 2015) or topic modeling (Blei, 2012), is that latent variables are formed during the learning process.Understanding the meaning of these hidden variables is crucial for transparent and justifiable decisions.Consequently, visualization of such model components has recently received some attention (Chaney & Blei, 2012;Zeiler & Fergus, 2014;Rothe & Schütze, 2016).Alternatively, some research has been devoted to trying to convert such arcane models to more interpretable rule-based or tree-based theories (Andrews et al., 1995;Craven & Shavlik, 1997;Schmitz et al., 1999;Zilke et al., 2016;Ribeiro et al., 2016) or to develop hybrid models that combine the interpretability of logic with the predictive strength of statistical and probabilistic models (Besold et al., 2017;Tran & d'Avila Garcez, 2018;Hu et al., 2016).Following a similar goal, Ribeiro et al. (2016) introduced a method for learning local explanations for inscrutable models that allows to trade off fidelity to the original model with interpretability and complexity of the local model.
Nevertheless, in our view, many of these approaches fall short in that they take the interpretability of rule-based models for granted.Interpretability is often considered to correlate with complexity, with the intuition that simpler models are easier to understand.Principles like Occam's Razor (Blumer et al., 1987) or Minimum Description Length (MDL) (Rissanen, 1978) are commonly used heuristics for model selection, and have shown to be successful in overfitting avoidance.As a consequence, most rule learning algorithms have a strong bias towards simple theories.Despite the necessity of a bias for simplicity for overfitting avoidance, we argue in this paper that simpler rules are not necessarily more interpretable, not even if all other things (such as coverage and precision) are equal.This implicit equation of comprehensibility and simplicity was already criticized by, e.g., Pazzani (2000), who argued that "there has been no study that shows that people find smaller models more comprehensible or that the size of a model is the only factor that affects its comprehensibility."There are also a few systems that explicitly strive for longer rules, and recent evidence has shed some doubt on the assumption that shorter rules are indeed preferred by human experts.We will discuss the relation of rule complexity and interpretability at length in Section 2.
Other criteria than accuracy and model complexity have rarely been considered in the learning process.For example, Gabriel et al. (2014) proposed to consider the semantic coherence of its conditions when formulating a rule.Pazzani et al. (2001) show that rules that respect monotonicity constraints are more acceptable to experts than rules that do not.As a consequence, they modify a rule learner to respect such constraints by ignoring attribute values that generally correlate well with other classes than the predicted class.Freitas (2013) reviews these and other approaches, compares several classifier types with respect to their comprehensibility, and points out several drawbacks of model size as a single measure of interpretability.
In his pioneering theoretical framework for inductive learning, Michalski (1983) stressed its links with cognitive science, noting that "inductive learning has a strong cognitive science flavor", and postulates that "descriptions generated by inductive inference bear similarity to human knowledge representations" with reference to Hintzman (1978), an elementary text from psychology on human learning.Michalski (1983) considers adherence to the comprehensibility postulate to be "crucial" for inductive rule learning, yet, as discussed above, it is rarely ever explicitly addressed beyond equating it with model simplicity.
In this paper, we primarily intend to highlight this gap in machine learning and data mining research.In particular, we focus on the plausibility of rules, which, in our view, is an important aspect that contributes to interpretability (Section 2).In addition to the comprehensibility of a model in the sense that the model can be applied to new data, we argue that a good model should also be plausible, i.e., be convincing and acceptable to the user.For example, as an extreme case, a default model that always predicts the majority class, is very interpretable, but in most cases not very plausible.We will argue that different models may have different degrees of plausibility, even if they have the same discriminative power.Moreover, we believe that the plausibility of a model is-all other things being equal-not related or in some cases even positively correlated with the complexity of a model (Section 3).
To that end, we also report the results of a crowd-sourcing evaluation of learned rules in four domains.Overall, the performed experiments are based on nearly 3.000 judgments collected from 390 distinct participants.The results show that there is indeed no evidence that shorter rules are preferred human subjects.On the contrary, we could observe a preference for longer rules in two of the studied domains (Section 4).In the following, we then relate this finding to related results in the psychological literature, such as the conjunctive fallacy (Section 5) and insensitivity to sample size (Section 6).Section 7 is devoted to relevance of conditions in rules, which may not always have, according to the recently described weak evidence effect, the expected influence one preference.The remaining sections focus on interplay of cognitive factors and machine readable semantics: Section 8 covers the recognition heuristic, Section 9 discusses the effect of semantic coherence on interpretability, and Section 10 briefly highlights the lack of methods for learning structured rule-based models.

Interpretability, Comprehensiblity, and Plausibility
Interpretability is a very elusive concept which we use in an intuitive sense.Kodratoff (1994) has already observed that it is an ill-defined concept, and has called upon several communities from both academia and industry to tackle this problem, to "find objective definitions of what comprehensibility is", and to open "the hunt for probably approximate comprehensible learning".Since then, not much has changed: the concept can be found under different names in the literature, including understandability, interpretability, comprehensibility, plausibility, trustworthiness, justifiability and others.They all have slightly different QOL = High :-Many events take place.QOL = High :-Host City of Olympic Summer Games.QOL = Low :-African Capital.(b) rated lowly by users Figure 1: Good discriminative rules for the quality of living of a city (Paulheim, 2012) semantic connotations, which have, e.g., been reviewed in Bibal & Frénay (2016).Similarly, Lipton (2016) suggests that the term interpretability is ill-defined, and its use in the literature refers to different concepts.
One of the few attempts for an operational definition of interpretability is given by Schmid et al. ( 2017) and Muggleton et al. (2018), who related the concept to objective measurements such as the time needed for inspecting a learned concept, for applying it in practice, or for giving it a meaningful and correct name.This gives interpretability a fundamental notion of syntactic comprehensibility: a model is interpretable if it can be understood by humans in the sense that it can be correctly applied to new data.Following Muggleton et al. (2018), we refer to this type of syntactic interpretability as comprehensibility, and define it as follows: Definition 1 (Comprehensibility) A model m 1 is more "comprehensible" than a model m 2 with respect to a given task, if a user can apply model m 1 with greater accuracy than model m 2 to new samples drawn randomly from the task domain.Muggleton et al. (2018) study various related, measurable quantities, such as the inspection time, the rate with which the meaning of the predicate is recognized from its definition, or the time used for coming up with a suitable name for a definition, which all capture different aspects of how a shown definition of a model can be related to the user's background knowledge.Piltaver et al. (2016) use a very similar definition when they study how the response time for various data-and model-related tasks such as "classify", "explain", "validate", or "discover" varies with changes in the structure of learned decision trees.Another variant of this definition was suggested by Dhurandhar et al. (2017;2018), who consider interpretability relative to a target model, typically (but not necessarily) a human user.More precisely, they define a learned model as δ -interpretable relative to a target model if the target model can be improved by a factor of δ (e.g., w.r.t.predictive accuracy) with information obtained by the learned model.All these notions have in common that they relate interpretability to a performance aspect, in the sense that a task can be performed better or performed at all with the help of the learned model.
Note, however, that Definition 1 does not address how convincing a model is as a possible explanation for data.For example, an empty model or a default model, classifying all examples as positive, is very simple to interpret, comprehend and apply, but it is neither very useful for applying it to new data, nor does it provide a convincing explanation to the user.As a more practical example, consider the rules shown in Figure 1, which have been derived by the Explain-a-LOD system (Paulheim & Fürnkranz, 2012).The rules provide several possible explanations for why a city has a high quality of living, using Linked Open Data as backgrond knowledge.Clearly, all rules are comprehensible, and can be easily applied in practice.Even though all of them are good discriminators on the provided data and can be equally well applied by a human or an automated system, the first three appear to be more convincing to a human user, which was also confirmed in an experimental study (Paulheim, 2012).
Thus, one also needs to make some assumptions about the correctness of the models that are compared.For example, Muggleton et al. (2018) only compare different complete solutions for a given task, and identify the most interpretable one among them according to Definition 1.This essentially means that the user is assumed to trust all models equally provided that s/he is able to comprehend it.However, in practice users are often skeptical towards learned models, and need to be convinced of their trustworthiness.comprehensibility plausibility objective subjective • can an explanation help to solve a task?
• does the user think it can help to solve a task?syntactic semantic • can an explanation be successfully applied?• how consistent is it with the user's knowledge?possible measures • efficiency or effectiveness in solving a task • user's willingness to accept the explanation typical errors • failure to perform a task • over-or underconfidence in an explanation's validity In this paper, we would thus like to focus on a different aspect of interpretability, which we refer to as plausibility.We view this notion in the sense of "user acceptance" or "user preference", i.e., a model is plausible if a user is willing to accept it as a possible explanation for a prediction.For the purposes of this paper, we thus define plausibility as follows: Definition 2 (Plausibility) A model m 1 is more "plausible" than a model m 2 if m 1 is more likely to be accepted by a user than m 2 .
Within this definition, the word "accepted" bears the meaning specified by the Cambridge English Dictionary2 as "generally agreed to be satisfactory or right".
Note that plausibility presupposes comprehensibility in that the latter is a prerequisite for a user's ability to judge the plausibility or trustworthiness of a rule.Our definition of plausibility is maybe less objective than the above definition of comprehensibility because it always relates to the subject's perception of the utility of a given explanation instead of its clear measurable value.Table 1 tries to highlight some of the differences between comprehensibility and plausibility.For example, plausibility is, in our understanding, necessarily subjective, because it involves a user's estimate on how well the given explanation explains the data, whereas comprehensibility is more objective in the sense that it can be measured whether the user is able to successfully apply the provided explanation to new data.In that sense, we also perceive comprehensibility as focusing more on syntactic aspects ("is the user able to follow the provided instructions?")whereas plausibility is more semantic because it implicitly relates to the user's prior knowledge about the task domain.
In this aspect, it is very related to the notion of justifiability as introduced by Martens & Baesens (2010).They consider a model to be more justifiable if it better conforms to domain knowledge, which is provided in the form of external constraints such as monotonicity constraints.They also define a measure for justifiability, which essentially corresponds to a weighted sum over the fractions of cases where each variable is needed in order to discriminate between different class values.Our notion of plausibility differs from justifiability in that we do not want to assume explicit domain knowledge in the form of constraints but would like to rely on the user's own general knowledge that allows her to assess whether an explanation is convincing or not.
Of course, the differences shown in Table 1 are soft.For example, comprehensibility is, of course, also semantic and not only syntactic.In fact, Muggleton et al. (2018) directly address this by also measuring whether their subjects can give meaningful names to the explanations they deal with, and whether these names are helpful in applying the knowledge.However, their experiments clearly put more weight on whether the provided logical theories can be applied in practice than to how relevant the subjects thought they are for the given task.Thus, plausibility, in our view, needs to be evaluated in introspective user studies, where the users explicitly indicate how plausible an explanation is, or which of two explanations is more plausible.Two explanations that can equally well be applied in practice, may nevertheless be perceived as having different degrees of plausibility.
In the remainder of the paper, we will therefore typically talk about "plausibility" in the above-mentioned sense, but we will sometimes use terms like "interpretability" as a somewhat more general term.We also use "comprehensibility", mostly when we refer to syntactic interpretability, as discussed and defined above.However, all terms are meant to be interpreted in an intuitive, and non-formal way.3

Interpretability and Model Complexity
The rules shown in Figure 1 may suggest that simpler rules are more acceptable than longer rules because the highly rated rules (a) are shorter than the lowly rated rules (b).In fact, there are many good reasons why simpler models should be preferred over more complex models.Obviously, a shorter model can be interpreted with less effort than a more complex model of the same kind, in much the same way as reading one paragraph is quicker than reading one page.Nevertheless, a page of elaborate explanations may be more comprehensible than a single dense paragraph that provides the same information (as we all know from reading research papers).Other reasons for preferring simpler models include that they are easier to falsify, that there are fewer simpler theories than complex theories, so the a priori chances that a simple theory fits the data are lower, or that simpler rules tend to be more general, cover more examples and their quality estimates are therefore statistically more reliable.
However, one can also find results that throw doubt on this claim.In the following, we discuss this issue in some depth, by first discussing the use of a simplicity bias in machine learning (Section 3.1), then taking the alternative point of view and recapitulating works where more complex theories are preferred (Section 3.2), and then summarizing the conflicting past evidence for either of the two views (Section 3.3).Michalski (1983) already states that inductive learning algorithms need to incorporate a preference criterion for selecting hypotheses to address the problem of the possibly unlimited number of hypotheses, and that this criterion is typically simplicity, referring to philosophical works on simplicity of scientific theories by Kemeny (1953) and Post (1960), which refine the initial postulate attributed to Ockham.According to Post (1960), judgments of simplicity should not be made "solely on the linguistic form of the theory". 4This type of simplicity is referred to as linguistic simplicity.A related notion of semantic simplicity is described through the falsifiability criterion (Popper, 1935(Popper, , 1959)), which essentially states that simpler theories can be more easily falsified.Third, Post (1960) introduces pragmatic simplicity which relates to the degree to which the hypothesis can be fitted into a wider context.

The Bias for Simplicity
Machine learning algorithms typically focus on linguistic or syntactic simplicity, by referring to the description length of the learned hypotheses.The complexity of a rule-based model is typically measured with simple statistics, such as the number of learned rules and their length, or the total number of conditions in the learned model.Inductive rule learning is typically concerned with learning a set of rules or a rule list which discriminates positive from negative examples (Fürnkranz et al., 2012;Fürnkranz & Kliegr, 2015).For this task, a bias towards simplicity is necessary because for a contradiction-free training set, it is trivial to find a rule set that perfectly explains the training data, simply by converting each example to a maximally specific rule that covers only this example.Obviously, although the resulting rule set is clearly within the hypothesis space, it is not useful because it, in principle, corresponds to rote learning and does not generalize to unseen examples.Essentially for this reason, Mitchell (1980) has noted that learning and generalization need a bias in order to avoid such elements of the version space.
Occam's Razor, "Entia non sunt multiplicanda sine necessitate",5 which is attributed to English philosopher and theologian William of Ockham (c. 1287Ockham (c. -1347)), has been put forward as support for a principle of parsimony in the philosophy of science (Hahn, 1930).In machine learning, this principle is generally interpreted as "given two explanations of the data, all other things being equal, the simpler explanation is preferable" (Blumer et al., 1987), or simply "choose the shortest explanation for the observed data" (Mitchell, 1997).While it is well-known that striving for simplicity often yields better predictive resultsmostly because pruning or regularization techniques help to avoid overfitting-the exact formulation of the principle is still subject to debate (Domingos, 1999), and several cases have been observed where more complex theories perform better (Murphy & Pazzani, 1994;Webb, 1996;Bensusan, 1998).
Much of this debate focuses on the aspect of predictive accuracy.When it comes to understandability, the idea that simpler rules are more comprehensible is typically unchallenged.A nice counter example is due to Munroe (2013), who observed that route directions like "take every left that doesn't put you on a prime-numbered highway or street named for a president" could be most compressive but considerably less comprehensive.Although Domingos (1999) argues in his critical review that it is theoretically and empirically false to favor the simpler of two models with the same training-set error on the grounds that this would lead to lower generalization error, he concludes that Occam's Razor is nevertheless relevant for machine learning but should be interpreted as a preference for more comprehensible (rather than simple) model.Here, the term "comprehensible" clearly does not refer to syntactical length.In the same direction, we argue that the Occam's razor principle can be redefined in terms of semantic comprehensibility that goes beyond mere syntactic model size and "mechanical understanding".
A particular implementation of Occam's razor in machine learning is the minimum description length (MDL; Rissanen, 1978) or minimum message length (MML6 ; Wallace & Boulton, 1968) principle which is an information-theoretic formulation of the principle that smaller models should be preferred (Grünwald, 2007).The description length that should be minimized is the sum of the complexity of the model plus the complexity of the data encoded given the model.In this way, both the complexity and the accuracy of a model can be traded off: the description length of an empty model consists only of the data part, and it can be compared to the description length of a perfect model, which does not need additional information to encode the data.The theoretical foundation of this principle is based on the Kolmogorov complexity (Li & Vitányi, 1993), the essentially uncomputable length of the smallest model of the data.In practice, different coding schemes have been developed for encoding models and data and have, e.g., been used as pruning criterion (Quinlan, 1990;Cohen, 1995;Mehta et al., 1995) or for pattern evaluation (Vreeken et al., 2011).However, we are not aware of any work that relates MDL to interpretability.
Nevertheless, many works make the assumption that comprehensibility of a rule-based model can be measured by measures that relate to the complexity of the model, such as the number of rules or the number conditions.A maybe prototypical example is the Interpretable Classification Rule Mining (ICRM) algorithm, which "is designed to maximize the comprehensibility of the classifier by minimizing the number of rules and the number of conditions" via an evolutionary process (Cano et al., 2013).Similarly, Minnaert et al. (2015) investigate a rule learner that is able to optimize multiple criteria, and evaluate it by investigating the Pareto front between accuracy and comprehensibility, where the latter is coarsely measured with the number of rules.Lakkaraju et al. (2016) propose a method for learning rule sets that simultaneously optimizes accuracy and interpretability, where the latter is again measured by several conventional datadriven criteria such as rule overlap, coverage of the rule set, and the number of conditions and rules in the set.

The Bias for Complexity
Even though most systems have a bias toward simpler theories for the sake of overfitting avoidance and increased accuracy, some rule learning algorithms strive for more complex rules, and have good reasons for doing so.Already Michalski (1983) has noted that there are two different kinds of rules, discriminative and characteristic.Discriminative rules can quickly discriminate an object of one category from objects of other categories.A simple example is the rule elephant :-trunk.
which states that an animal with a trunk is an elephant.This implication provides a simple but effective rule for recognizing elephants among all animals.However, it does not provide a very clear picture on properties of the elements of the target class.For example, from the above rule, we do not understand that elephants are also very large and heavy animals with a thick grey skin, tusks and big ears.
Characteristic rules, on the other hand, try to capture all properties that are common to the objects of the target class.A rule for characterizing elephants could be heavy, large, grey, bigEars, tusks, trunk :-elephant.
Note that here the implication sign is reversed: we list all properties that are implied by the target class, i.e., by an animal being an elephant.From the point of understandability, characteristic rules are often preferable to discriminative rules.For example, in a customer profiling application, we might prefer to not only list a few characteristics that discriminate one customer group from the other, but are interested in all characteristics of each customer group.
Characteristic rules are very much related to formal concept analysis (Wille, 1982;Ganter & Wille, 1999).Informally, a concept is defined by its intent (the description of the concept, i.e., the conditions of its defining rule) and its extent (the instances that are covered by these conditions).A formal concept is then a concept where the extension and the intension are Pareto-maximal, i.e., a concept where no conditions can be added without reducing the number of covered examples.In Michalski's terminology, a formal concept is both discriminative and characteristic, i.e., a rule where the head is equivalent to the body.
It is well-known that formal concepts correspond to closed itemsets in association rule mining, i.e., to maximally specific itemsets (Stumme et al., 2002).Closed itemsets have been mined primarily because they are a unique and compact representative of equivalence classes of itemsets, which all cover the same instances (Zaki & Hsiao, 2002).However, while all itemsets in such an equivalence class are equivalent with respect to their support, they may not be equivalent with respect to their understandability or interestingness.
Consider, e.g., the infamous {diapers, beer} itemset that is commonly used as an example for a surprising finding in market based analysis.A possible explanation for this finding is that this rule captures the behavior of young family fathers who are sent to shop for their youngster and have to reward themselves with a six-pack.However, if we consider that a young family may not only need beer and diapers, the closed itemset of this particular combination may also include baby lotion, milk, porridge, bread, fruits, vegetables, cheese, sausages, soda, etc.In this extended context, diapers and beer appear to be considerably less surprising.Conversely, an association rule beer :-diapers with an assumed confidence of 80%, which on first sight appears interesting because of the unexpectedly strong correlation between buying two seemingly unrelated items, becomes considerably less interesting if we learn that 80% of all customers buy beer, irrespective of whether they have bought diapers or not.In other words, the association rule (1) is considerably less plausible than the association rule beer:-diapers, baby lotion, milk, porridge, bread, fruits, vegetables, cheese, sausages, soda.
even if both rules may have very similar properties in terms of support and precision.Gamberger & Lavrač (2003) introduce supporting factors as a means for complementing the explanation delivered by conventional learned rules.Essentially, they are additional attributes that are not part of the learned rule, but nevertheless have very different distributions with respect to the classes of the application domain.In a way, enriching a rule with such supporting factors is quite similar to computing the closure of a rule.In line with the results of Kononenko (1993), medical experts found that these supporting factors increase the plausibility of the found rules.2014) introduced so-called inverted heuristics for inductive rule learning.The key idea behind them is a rather technical observation based on a visualization of the behavior of rule learning heuristics in coverage space (Fürnkranz & Flach, 2005), namely that the evaluation of rule refinements is based on a bottom-up point of view, whereas the refinement process proceeds top-down, in a general-tospecific fashion.As a remedy, it was proposed to "invert" the point of view, resulting in heuristics that pay more attention to maintaining high coverage on the positive examples, whereas conventional heuristics focus more on quickly excluding negative examples.Somewhat unexpectedly, it turned out that this results in longer rules, which resemble characteristic rules instead of the conventionally learned discriminative rules.For example, Figure 2 shows the two decision lists that have been found for the Mushroom dataset with the conventional Laplace heuristic h Lap (top) and its inverted counterpart 4 Lap (bottom).Although fewer rules are learned with 4 Lap , and thus the individual rules are more general on average, they are also considerably longer.Intuitively, these rules also look more convincing, because the first set of rules often only uses a single criterion (e.g., odor) to discriminate between edible and poisonous mushrooms.Stecher et al. (2016) and Valmarska et al. (2017) investigated the suitability of such rules for subgroup discovery, with somewhat inconclusive results.

Conflicting Evidence
The above-mentioned examples should help to motivate that the complexity of rules may have an effect on the interpretability and plausibility of a rule.Even in cases where a simpler and a more complex rule covers the same number of examples, shorter rules are not necessarily more understandable.There are a few isolated empirical studies that add to this picture.However, the results on the relation between the size of representation and interpretability are limited and conflicting.
Larger Models are Less Interpretable.Huysmans et al. (2011) were among the first that actually tried to empirically validate the often implicitly made claim that smaller models are more interpretable.In particular, they related increased complexity to measurable events such as a decrease in answer accuracy, an increase in answer time, and a decrease in confidence.From this, they concluded that smaller models tend to be more interpretable, proposing that there is a certain complexity threshold that limits the practical utility of a model.However, they also noted that in parts of their study, the correlation of model complexity with utility was less pronounced.The study also does not report on the domain knowledge the participants of their study had relating to the data used, so that it cannot be ruled out that the obtained result was caused by lack of domain knowledge.7A similar study was later conducted by Piltaver et al. (2016), who found a clear relationship between model complexity and interpretability in decision trees.
Larger Models are More Interpretable.A direct evaluation of the perceived understandability of classification models has been performed by Allahyari & Lavesson (2011).They elicited preferences on pairs of models which were generated from two UCI datasets: Labor and Contact Lenses.What is unique to this study is that the analysis took into account the estimated domain knowledge of the participants on each of the datasets.On Labor, participants were expected to have good domain knowledge but not so for Contact Lenses.The study was performed with 100 student subjects and involved several decision tree induction algorithms (J48, RIDOR, ID3) as well as rule learners (PRISM, REP, JRIP).It was found that larger models were considered as more comprehensible than smaller models on the Labor dataset whereas the users showed the opposite preference for Contact Lenses.Allahyari & Lavesson (2011) explain the discrepancy with the lack of prior knowledge for Contact Lenses, which makes it harder to understand complex models, whereas in the case of Labor, ". . . the larger or more complex classifiers did not diminish the understanding of the decision process, but may have even increased it through providing more steps and including more attributes for each decision step."In an earlier study, Kononenko (1993) found that medical experts rejected rules learned by a decision tree algorithm because they found them to be too short.Instead, they preferred explanations that were derived from a Naïve Bayes classifier, which essentially showed weights for all attributes, structured into confirming and rejecting attributes.We are not aware of any studies that explicitly addressed the aspect of complexity and plausibility.

An Experiment on Rule Complexity and Plausibility
In this section, we report on experiments that aimed at testing whether rule length has an influence on the interpretability or plausibility of found rules at all, and, if so, whether people tend to prefer longer or shorter rules.As a basis we used pairs of rules generated by machine learning systems, one rule representing a shorter, and the other a longer explanation.Participants were then asked to indicate which one of the pair they preferred.The selection of crowd-sourcing as a means of acquiring data allows us to gather thousands of responses in a manageable time frame while at the same time ensuring our results can be easily replicated.To this end, source datasets, preprocessing code, the responses obtained with crowdsourcing, and the code used to analyze them were made available at https://github.com/kliegr/rule-length-project.8

Rule Generation
For the experiment, we generated several rule pairs consisting of a long and a short rule that have the same or a similar degree of generality.Two different approaches were used to generate rules: Class Association Rules: We used a standard implementation of the APRIORI algorithm for association rule learning (Agrawal et al., 1993;Hahsler et al., 2011) and filtered the output for class association rules with a minimum support of 0.01, minimum confidence of 0.5, and a maximum length of 5. Pairs were formed between all rules that correctly classified at least one shared instance.Although other more sophisticated approaches (such as a threshold on the Dice coefficient) were considered, it turned out that the process outlined above produced rule pairs with quite similar values of confidence (i.e.most equal to 1.0), except for the Movies dataset.2), this results in rule pairs that have approximately the same degree of generality but different complexities.
We used these algorithms to learn rules for four publicly available datasets, namely the Mushroom dataset from the UCI repository, and three datasets derived from the Linked Open Data (LOD) cloud relating to traffic accidents, movies, and the quality of living index (Ristoski et al., 2016).The goal behind these selections was that the respondents are able to comprehend a given rule without the need for additional explanations, but are not able to reliably judge the validity of a given rule.Thus, respondents will need to rely on their common sense in order to judge which of two rules appears to be more convincing.This also implies that we specifically did not expect the users to have expert knowledge in these domains.An overview of the datasets is shown in Table 2.
The Mushroom dataset contains mushroom records drawn from Field Guide to North American Mushrooms (Lincoff, 1981).It is arguably one of the most frequently used datasets in rule learning research, its main advantage being discrete, understandable attributes.The three LOD-based datasets, Traffic, Movies, and Quality, originally only consisted of a name and a target variable.The names were then linked to entities in the public LOD dataset DBpedia, using the method described by Paulheim & Fürnkranz (2012).From that dataset, we extracted the classes of the entities, using the deep classification of YAGO, which defines a very fine grained class hierarchy of several thousand classes.Each class was added as a binary attribute.For example, the entity for the city of Vienna would get the binary features European Capitals, UNESCO World Heritage Sites, etc.The sources of the three datasets are as follows: Traffic is a statistical dataset of death rates in traffic accidents by country, obtained from the WHO 9 Quality is a dataset derived from the Mercer Quality of Living index, which collects the perceived quality of living in cities world wide. 10 Movies is a dataset of movie ratings obtained from MetaCritic 11 For the final selection of the rule pairs, we categorized the differences into several groups according to the perceived differences, such as differences in rule length.The criteria used are shown in Table 3.However, only in the Traffic dataset we had a sufficiently large number of candidate rule pairs to choose from, so that we could sample each of these groups equally.For Quality and Movies, all rule pairs were used.For the Mushroom dataset, we selected rule pairs so that every difference in length (one to five) is represented.
As a final step, we automatically translated all rule pairs into human-friendly HTML-formatted text, and randomized the order of the rules in the rule pair.Example rules for the four datasets are shown in Figure 3.The first column of Table 2 shows the final number of rule pairs generated in each domain. 9http://www.who.int/violence_injury_prevention/road_traffic/en/ 10 http://across.co.nz/qualityofliving.htm 11 http://www.metacritic.com/movieTable 3: Rule selection groups subsuming different-length rules, either antecedent of rule 1 is subset of antecedent of rule 2, or antecedent of rule 2 is subset of antecedent of rule 1 different length rules with disjunct attributes different-length rules, the antecedent of rule 1 is disjunct with antecedent of rule 2 same length rules non disjunct attributes same-length rules, antecedent of rule 1 is not disjunct with antecedent of rule 2 same length rules disjunct attributes same-length rules, antecedent of rule 1 is disjunct with antecedent of rule 2 different length rules neither disjunct nor subsuming attributes different-length rules, the antecedent of rule 1 is not disjunct with antecedent of rule 2, antecedent of rule 1 is not subset of antecedent of rule 2, antecedent of rule 2 is not subset of antecedent of rule 1 large difference in rule length the difference between the lengths of the rules had to be at least 2 (selected only from inverted heuristic pairs) one difference in rule length the difference between the lengths of the rules had to be exactly 1 (selected only from inverted heuristic pairs)

The CrowdFlower Platform
As the experimental platform we used the CrowdFlower crowd-sourcing service.12Similar to the betterknown Amazon Mechanical Turk, CrowdFlower allows to distribute questionnaires to participants around the world, who complete them for remuneration.The remuneration is typically a small payment in US dollars, but some participants may receive the payment in other currencies, including in game currencies ("coins").
Specification of a CrowdFlower Task.A crowdsourcing task performed in CrowdFlower consists of a sequence of steps: 1.The CrowdFlower platform recruits subjects for the task from among the cohort of its workers, who match the level and geographic requirements set by the experimenter.The workers decide to participate in the task based on the payment offered and the description of the task.
2. Subjects are presented assignment containing an illustrative example.
3. If the task contains test questions, each subject has to pass a quiz mode with test questions.Subjects learn about the correct answer after they pass the quiz mode.Subjects have the option to contest the correct answer if they consider it incorrect.
4. Subjects proceed to the work mode, where they complete the task they have been assigned by the experimenter.The task typically has a form of a questionnaire.If test questions were defined by the experimenter, the CrowdFlower platform randomly inserts test questions into the questionnaire.Failing a predefined proportion of hidden test questions results in removal of the subject from the task.Failing the initial quiz or failing a task can also reduce subjects' accuracy on the CrowdFlower platform.Based on the average accuracy, subjects can reach one of the three levels.A higher level gives a user access to additional, possibly better paying tasks.
5. Subjects can leave the experiment at any time.To obtain payment for their work, subjects need to submit at least one page of work.After completing each page of work, the subject can opt to start another page.The maximum number of pages that subject can complete is set by the experimenter.As a consequence, two subjects can contribute with a different number of judgments to the same task.
6.If a bonus was promised, the qualifying subjects receive extra credit.

Representativeness of Crowd-Sourcing Experiments.
There is a number of differences between crowdsourcing and the controlled laboratory environment previously used to run psychological experiments.The central question is to what extent do the cognitive abilities and motivation of subjects differ between the crowdsourcing cohort and the controlled laboratory environment.Since there is a small amount of research specifically focusing on the population of the CrowdFlower platform, which we use in our research, we present data related to Amazon mechanical Turk, under the assumption that the descriptions of the populations will not differ substantially. 13This is also supported by previous work such as (Wang et al., 2015), which has indicated that the user distribution of Crowdflower and AMT is comparable.The population of crowdsourcing workers is a subset of the population of Internet users, which is described in a recent meta study by Paolacci & Chandler (2014) as follows: "Workers tend to be younger (about 30 years old), overeducated, underemployed, less religious, and more liberal than the general population."While there is limited research on workers' cognitive abilities, Paolacci et al. (2010) found "no difference between workers, undergraduates, and other Internet users on a self-report measure of numeracy that correlates highly with actual quantitative abilities."According to a more recent study by Crump et al. (2013), workers learn more slowly than university students and may have difficulties with complex tasks.Possibly the most important observation related to the focus of our study is that according to Paolacci et al. (2010) crowdsourcing workers "exhibit the classic heuristics and biases and pay attention to directions at least as much as subjects from traditional sources."Statistical Information about Participants.Crowdflower does not publish demographic data about its base of workers.Nevertheless, for all executed tasks, the platform makes available the location of the worker submitting each judgment.In this section, we use this data to elaborate on the number and geographical distribution of workers participating in Experiments 1-5 described later in this paper.Table 4a reports on workers participating in Experiments 1-3, where three types of guidelines were used in conjunction with four different datasets, resulting in 9 tasks in total (not all combinations were tried).Experiments 4-5 involved different guidelines (for determining attribute and literal relevance) and the same datasets.The geographical distribution is reported in Table 4b.In total, the reported results are based on 2958 trusted judgments. 14In reality, more judgments were collected, but some were excluded due to automated quality checks.
In order to reduce possible effects of language proficiency, we restricted our respondents to Englishspeaking countries.Most judgments (1417) were made by workers from United States, followed by the United Kingdom (837) and Canada (704).The number of distinct participants for each crowdsourcing task is reported in detailed tables describing the results of the corresponding experiments (part column in Tables 5-9).Note that some workers participated in multiple tasks.The total number of distinct participants across all tasks reported in Tables 4a and 4b is 390.

Experiment 1: Are Shorter Rules More Plausible?
In the following, we cover the first from a series of empirical experiments performed to support the hypotheses presented in this paper.Most of the setup is shared for the subsequent experiments and will not be repeated.Cognitive science research has different norms for describing experiments than are used in machine learning research.We tried to respect these by dividing experiment description into subsections entitled "Material", "Subjects", "Methodology", and "Results", which correspond to the standard outline of an experimental account in cognitive science.Also, the setup of the experiments is described in somewhat greater detail than usual in machine learning, which is given by the general sensitivity of human subjects to these other conditions, such as the amount of payment.Material.The participants were briefed with task instructions, which described the purpose of the task, gave an example rule, and explained plausibility as the elicited quantity (cf. Figure 4).As part of the explanation, the subjects were given definitions of "plausible" sourced from the Oxford Dictionary15 and the Cambridge Dictionary16 (British and American English).The questionnaires presented pairs of rules as described in section 4.1, and asked the participants to give a) judgment which rule in each pair is more preferred and b) optionally a textual explanation for the judgment.A sample question is shown in Figure 5.The judgments were elicited using a drop down box, where the subjects could choose from the following five options: "Rule 1 (strong preference)", "Rule 1 (weak preference)", "No preference", "Rule 2 (weak preference)", "Rule 2 (strong preference)".As shown in Figure 5, the definition of plausibility was accessible to participants at all times, since it was featured below the drop-down box.As optional input, the participants could provide a textual explanation of their reasoning behind the assigned preference, which we informally evaluated but which is not further considered in the analyses reported in this paper.The workers in the CrowdFlower platform were invited to participate in individual tasks.For one judgment relating to one rule we paid 0.07 USD.The number of judgments per rule pair for this experiment was 5 for the Traffic, Quality, and Movies datasets.The Mushroom dataset had only 10 rule pairs, therefore we opted to collect 25 judgments for each rule pair in this dataset.
The task was available to Level 2 workers residing in U.S., Canada and United Kingdom.In order to avoid spurious answers, we also employed a minimum threshold of 180 seconds for completing a page; subjects taking less than this amount of time to complete a page were removed from the job.A maximum time required to complete the assignment was not specified, and the maximum number of judgments per contributor was not limited.
For quality assurance, each subject who decided to accept the task first faced a quiz consisting of a random selection of previously defined test questions.These had the same structure as regular questions but additionally contained the expected correct answer (or answers) as well as an explanation for the answer.We used swap test questions where the order of the conditions was randomly permuted in each of the two pairs, so that the subject should not have a preference for either of the two versions.The correct answer and explanation was only shown after the subject had responded to the question.Only subjects achieving at least 70% accuracy on test questions could proceed to the main task.
Methodology.Evaluations were performed at the level of individual judgments, also called micro-level, i.e., each response was considered to be a single data point, and multiple judgments for the same pair were not aggregated prior to the analysis.By performing the analysis at the micro-level, we avoided the possible loss of information as well as the aggregation bias (Clark & Avery, 1976).Also, as shown for example by (Robinson, 1950) the ecological (macro-level) correlations are generally larger than the microlevel correlations, therefore by performing the analysis on the individual level we obtain more conservative results.
We report rank correlation between rule length and the observed evaluation (Kendall's τ, Spearman's ρ) and tested whether the coefficients are significantly different from zero.We will refer to the values of Kendall's τ as the primary measure of rank correlation, since according to Kendall & Gibbons (1990) and Newson (2002), the confidence intervals for Spearman's ρ are less reliable than confidence intervals for Kendall's τ.
For all obtained correlation coefficients we compute the p value, which is the probability of obtaining obtaining a correlation coefficient at least as extreme as the one that was actually observed assuming that the null hypothesis-that there is no correlation between the two variables-holds.The typical cutoff value for rejecting the null hypothesis is α = 0.05.
Linda is 31 years old, single, outspoken, and very bright.She majored in philosophy.As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations.
Which is more probable?(a) Linda is a bank teller.(b) Linda is a bank teller and is active in the feminist movement.
Results.Table 5 summarizes the results of this crowd-sourcing experiment.In total, we collected 1002 responses, which is on average 6.3 judgments for each of the 158 rule pairs.On two of the datasets, Quality and Mushroom, there was a strong, statistically significant positive correlation between rule length and the observed plausibility of the rule, i.e., longer rules were preferred.In the other two datasets, Traffic and Movies, no significant difference could be observed in either way.
In any case, these results show that there is no negative correlation between rule length and plausibility.In fact, in two of the four datasets, we even observed a positive correlation, meaning that in these cases longer rules were preferred.

The Conjunction Fallacy
Human-perceived plausibility of a hypothesis has been extensively studied in cognitive science.The bestknown cognitive phenomenon related to our focus area of the influence of the number of conditions in a rule on its plausibility is the conjunctive fallacy.This fallacy falls into the research program on cognitive biases and heuristics carried out by Amos Tversky and Daniel Kahneman since the 1970s.The outcome of this research program can be succinctly summarized by a quotation from Kahneman's Nobel Prize lecture at Stockholm University on December 8, 2002: ". .., it is safe to assume that similarity is more accessible than probability, that changes are more accessible than absolute values, that averages are more accessible than sums, and that the accessibility of a rule of logic or statistics can be temporarily increased by a reminder."(Kahneman, 2003) In this section, we will briefly review some aspects of this program, highlighting those that seem to be important for inductive rule learning.For a more thorough review we refer to Kahneman et al. (1982) and Gilovich et al. (2002), a more recent, very accessible introduction can be found in Kahneman (2011).

The Linda Problem
The conjunctive fallacy is in the literature often defined via the "Linda" problem.In this problem, participants are asked whether they consider it more plausible that a person Linda is more likely to be (a) a bank teller or (b) a feminist bank teller (Figure 6).Tversky & Kahneman (1983) report that based on the provided characteristics of Linda, 85% of the participants indicate (b) as the more probable option.This was essentially confirmed in by various independent studies, even though the actual proportions may vary.In particular, similar results could be observed across multiple settings (hypothetical scenarios, real-life domains), as well as for various kinds of respondents (university students, children, experts, as well as statistically sophisticated individuals) (Tentori & Crupi, 2012).However, it is easy to see that the preference for (b) is in conflict with elementary laws of probabilities.Essentially, in this example, respondents are asked to compare conditional probabilities Pr(F ∧ B | L) and Pr(B | L), where B refers to "bank teller", F to "active in feminist movement" and L to the description of Linda.Of course, the probability of a conjunction, Pr(A ∧ B), cannot exceed the probability of its constituents, Pr(A) and Pr(B) (Tversky & Kahneman, 1983).In other words, as it always holds for the Linda problem that Pr(F ∧ B | L) ≤ Pr(B | L), the preference for alternative F ∧ B (option (b) in Figure 6) is a logical fallacy.

The Representative Heuristic
According to Tversky & Kahneman (1983), the results of the conjunctive fallacy experiments manifest that "a conjunction can be more representative than one of its constituents".It is a symptom of a more general phenomenon, namely that people tend to overestimate the probabilities of representative events and underestimate those of less representative ones.The reason is attributed to the application of the representativeness heuristic.This heuristic provides humans with means for assessing a probability of an uncertain event.According to the representativeness heuristic, the probability that an object A belongs to a class B is evaluated "by the degree to which A is representative of B, that is by the degree to which A resembles B" (Tversky & Kahneman, 1974).
This heuristic relates to the tendency to make judgments based on similarity, based on a rule "like goes with like".According to Gilovich & Savitsky (2002), the representativeness heuristic can be held accountable for number of widely held false and pseudo-scientific beliefs, including those in astrology or graphology.17It can also inhibit valid beliefs that do not meet the requirements of resemblance.
A related phenomenon is that people often tend to misinterpret the meaning of the logical connective "and".Hertwig et al. (2008) hypothesized that the conjunctive fallacy could be caused by "a misunderstanding about conjunction", i.e., by a different interpretation of "probability" and "and" by the subjects than assumed by the experimenters.They discussed that "and" in natural language can express several relationships, including temporal order, causal relationship, and most importantly, can also indicate a collection of sets instead of their intersection.For example, the sentence "He invited friends and colleagues to the party" does not mean that all people at the party were both colleagues and friends.According to Sides et al. (2002), "and" ceases to be ambiguous when it is used to connect propositions rather than categories.The authors give the following example of a sentence which is not prone to misunderstanding: "IBM stock will rise tomorrow and Disney stock will fall tomorrow".Similar wording of rule learning results may be, despite its verbosity, preferred.We further conjecture that representations that visually express the semantics of "and" such as decision trees may be preferred over rules, which do not provide such visual guidance.

Experiment 2: Misunderstanding of "and" in Inductively Learned Rules
Given its omnipresence in rule learning results, it is vital to assess to what degree the "and" connective is misunderstood when rule learning results are interpreted.In order to gauge the effect of the conjunctive fallacy, we carried out a separate set of crowdsourcing tasks, To control for misunderstanding of "and", the group of subjects approached in Experiment 2 additionally received intersection test questions which were intended to ensure that all respondents understand the and conjunction the same way it is defined in the probability calculus.In order to correctly answer these, the respondent had to realize that the antecedent of one of the rules contains mutually exclusive conditions.The correct answer was a weak or strong preference for rule which did not contain the mutually exclusive conditions.
Material.The subjects were presented with the same rule pairs as subjects in Experiment 1 (Group 1).The difference between Experiment 1 and Experiment 2 was only one manipulation: instructions in Experiment 2 additionally contained the intersection test questions, not present in Experiment 1.We refer to subjects that received the instructions these test questions as Group 2.
Additional Information: In our data, there are 76 movies which match the conditions of this rule.Out of these 72 are predicted correctly as having good rating.The confidence of the rule is 95%.
In other words, out of the 76 movies that match all the conditions of the rule, the number of movies that are rated as good as predicted by the rule is 72.The rule thus predicts correctly the rating in 72/76=95 percent of cases.Results.We state the following proposition: The effect of higher perceived interpretability of longer rules goes away when it is ensured that subjects understand the semantics of the "and" conjunction.The corresponding null hypothesis is that the correlation between rule length and plausibility is no longer statistically significantly different from zero for participants successfully completed the intersection test questions (Group 2).We focus on the analysis on Mushroom and Quality datasets on which we had initially observed a higher plausibility of longer rules.
The results presented in Table 6 show that the correlation coefficient is still statistically significantly different from zero for the Mushroom dataset with Kendall's τ at 0.28 (p < 0.0001), but not for the Quality dataset, which has τ not different from zero at p < 0.05 (albeit at a much higher variance).This suggests that at least on the Mushroom dataset, there are other factors apart from "misunderstanding of and" that cause longer rules to be perceived as more plausible.

Insensitivity to Sample Size
In the previous sections, we have motivated that rule length is by itself not an indicator for the plausibility of a rule if other factors such as the support and the confidence of the rule are equal.In this and following sections, we will discuss the influence of these and a few alternative factors, partly motivated by results from the psychological literature.The goal is to motivate some directions for future research on the interpretability and plausibility of learned concepts.
In the previous experiments, we controlled the rules selected into the pairs so they mostly had identical or nearly identical confidence and support.Furthermore, the confidence and support values of the shown rules were not revealed to the respondents during the experiments.However, in real situations, rules on the output of inductive rule learning have varying quality, which is communicated mainly by the values of confidence and support.
In the terminology used within the scope of cognitive science (Griffin & Tversky, 1992), confidence corresponds to the strength of the evidence and support to the weight of the evidence.Results in cognitive Table 6: Effect of intersection test questions that are meant to ensure that participants understand the logical semantics of "and".pairs refers to the distinct number of rule pairs, judg to the number of trusted judgments, the quiz failure rate qfr to the the percentage of participants that did not pass the initial quiz as reported by the CrowdFlower dashboard, part to the number of trusted distinct survey participants (workers), and τ to the observed correlation values with p-values in parentheses.science for the strength and weight of evidence suggest that the weight is systematically undervalued while the strength is overvalued.According to Camerer & Weber (1992), this was, e.g., already mentioned by Keynes (1922) who drew attention to the problem of balancing the likelihood of the judgment and the weight of the evidence in the assessed likelihood.In particular, Tversky & Kahneman (1971) have argued that human analysts are unable to appreciate the reduction of variance and the corresponding increase in reliability of the confidence estimate with increasing values of support.This bias is known as insensitivity to sample size, and essentially describes the human tendency to neglect the following two principles: a) more variance is likely to occur in smaller samples, b) larger samples provide less variance and better evidence.Thus, people underestimate the increased benefit of higher robustness of estimates made on a larger sample.
Given that longer rules can fit the data better, they tend to be higher on confidence and lower on support.This implies that if confronted with two rules of different length, where the longer has a higher confidence and the shorter a higher support, the analyst may prefer the longer rule with higher confidence (all other factors equal).These deliberations lead us to the following proposition: When both confidence and support are explicitly revealed, confidence but not support will positively increase rule plausibility.
6.1 Experiment 3: Is rule confidence perceived as more important than support?
We aim to evaluate the effect of explicitly revealed confidence (strength) and support (weight) on rule preference.In order to gauge the effect of rule quality measures confidence and support, we performed an additional experiment.
Material.The subjects were presented with rule pairs like in the previous two experiments.However, Experiment 3 involved only rule pairs generated for the Movies dataset, where the differences in confidence and support between the rules in the pairs were largest.Subjects that received this extra information are referred to as Group 3. The difference between Experiment 1 and Experiment 3 was one manipulation: pairs in Experiment 3 additionally contained the information how many good and bad instances were covered by a rule (see Figure 7).Subjects and Remuneration.This setup was the same as for the preceding two experiments.
Results.Table 7 show the correlations of the rule quality measures confidence and support with plausibility.It can be seen that there is a relation to confidence but not to support, even though both were explicitly present in descriptions of rules for Group 3. The results also show that the relationship between revealed rule confidence and plausibility is causal.This follows from confidence not being correlated with plausibility in the original experiment (Group 1 in Figure 7), which differed only via the absence of the explicitly revealed information about rule quality.While such conclusion is intuitive, to our knowledge it has not yet been empirically confirmed before.Thus, our result supports the hypothesis that insensitivity to sample size effect is applicable to the interpretation of inductively learnt rules.In other words, when both confidence and support are stated, confidence positively affects the preference for a rule whereas support tends to have no impact.
We kindly ask you to assist us in an experiment that will help researchers understand which properties influence mushroom being considered as poisonous/edible.We kindly ask you to assist us in an experiment that will help researchers understand which factors can influence movie ratings.
Example task follows: Condition: Academy Award Winner or Nominee The condition listed above will contribute to a movie being rated as: Select one option.

Relevance of Conditions in Rule
An obvious factor that can determine the perceived plausibility of a proposed rule is how relevant it appears to be.Of course, rules that contain more relevant conditions will be considered to be more acceptable.One way of measuring this could be in the strength of the connection between the condition (or a conjunction of conditions) with the conclusion.However, in our crowd-sourcing experiments we only showed sets of conditions that are equally relevant in the sense that their conjunction covers about the same number of examples in the shown rules or that the rules have a similar strength of connection.Nevertheless, the perceived or subjective relevance of a condition may be different for different users.
There are several cognitive biases that can distort the correlation between the relevance of conditions and the the judgment of plausibility.One of the most recently proposed ones is the weak evidence effect, according to which evidence in favour of an outcome can actually decrease the probability that a person assigns to it.In an experiment in the area of forensic science reported by Martire et al. (2013), it was shown that participants presented with evidence weakly supporting guilt tended to "invert" the evidence, thereby counterintuitively reducing their belief in the guilt of the accused.
In order to analyze the real effect of relevance in the rule learning domain, we decided to enrich our input data with two supporting crowdsourcing tasks, which aimed at collecting judgments of attribute and literal relevance.

Experiment 4: Attribute and Literal Relevance
The experiments, described on conceptual level in the following, were performed similarly as the previous ones using crowdsourcing.Since the relevance experiments did not elicit preferences for rule pairs, there are multiple differences from the setup described earlier.We summarize the experiments in the following, but refer the reader to Kliegr (2017) for additional details.
Attribute Relevance.Attribute relevance corresponds to human perception of the ability of a specific attribute to predict values of the attribute in rule consequent.For example, in the Movies data, the release date of a film may be perceived as less relevant for determining the quality of a film than its language.Attribute relevance also reflects a level of recognition of the explanatory attribute (cf.also Section 8), which is a prerequisite to determining the level of association with the target attribute.As an example of a specific attribute that may not be recognized consider "Sound Mix" for a movie rating problem.This would contrast with attributes such as "Oscar winner" or "year of release", which are equally well recognized, but clearly associated to a different degree with the target.The attribute relevance experiments were prepared for the Mushroom and Traffic datasets.An example wording of the attribute relevance elicitation task for the Mushroom dataset is shown in Figure 8.
Literal Relevance.Literal relevance goes one step further than attribute relevance by measuring human perception of the ability of a specific condition to predict a specific value of the attribute in the rule consequent.It should be noted that we consider the literal relevance to also embed attribute relevance to some extent.For example, the literal ("film released in 2001") conveys also the attribute ("year of release").However, in addition to the attribute name, literal also conveys a specific value, which may not be recognized by itself.This again raises the problem of recognition as a prerequisite to association.
An example wording of the literal relevance elicitation task for the Movies dataset is shown in Figure 9.In this case, there was a small difference in setup between the experiments on LOD datasets and the Mushroom dataset.The latter task did contain links to Wikipedia for individual literals as these were naturally available from the underlying dataset.For the Mushroom dataset no such links were available and thus these were not included in the task.
Enriching data with literal and attribute relevance.The data collected within Experiments 1-3 were enriched with variables denoting the relevance of attributes and literals of the individual rules.Given that in Experiments 1-3 plausibility was elicited for rule pairs, the variables representing relevance were computed as differences of values obtained for the rules in the pair.
Each rule pair was enriched with four18 variables according to to the following pattern: "[Literal|Attribute]Rel[Avg|Max]∆".To compute the enrichment variable, the value of the relevance metric for the second rule in the pair (r2) was subtracted from the value for the first rule (r1).For example, where LiteralRelAvg(r1), LiteralRelAvg(r2) represent the average relevance of literals (conditions) present in the antecedent of rule r1 (r2) in the pair.
Results.Table 8 shows the correlations between plausibility and the added variables representing attribute and literal relevance on the data collected for Group 1 from the previous experiments.The results confirm that literal relevance has a strong correlation with the judgment of the plausibility of a rule.A rule which contained (subjectively) more relevant literals than the second rule in the pair was more likely to be evaluated favorably than rules that do not contain such conditions.This pattern was valid found with varying level of statistical significance across all evaluation setups in Table 8, with the exception of the average for the smallest Mushroom dataset.Note that the effect is strongest for the maximum relevance, which means that it is not necessary that all the literals are deemed important, but it suffices if a few (or even a single) condition is considered to be relevant.Traffic was the only dataset where such effects could not be observed, but this may have to do with the fact that the used attributes (mostly geographic regions) strongly correlate with traffic accidents but do not show a causal relationship.The examination of the relation between the objective relevance of conditions in a rule and their impact on the subjective perception of the rule is an interesting yet challenging area of further study.The perception can be influenced by multiple cognitive phenomena, such as the weak evidence effect.

Recognition Heuristic
The recognition heuristic (Goldstein & Gigerenzer, 1999, 2002) is the best-known of the fast and frugal heuristics that have been popularized in several books, such as Gigerenzer et al. (1999Gigerenzer et al. ( , 2011););Gigerenzer (2015).It essentially states that when you compare two objects according to some criterion that you cannot directly evaluate, and "one of two objects is recognized and the other is not, then infer that the recognized object has the higher value with respect to the criterion."Note that this is independent of the criterion that should be maximized, it only depends on whether there is an assumed positive correlation with the recognition value of the object.For example, if asked whether Hong Kong or Chongqing is the larger city, people tend to pick Hong Kong because it is better known (at least in the western hemisphere), even though Chongqing has about four times as many inhabitants.Thus, it may be viewed as being closely associated to relevance, where, in the absence of knowledge about a fact, the city's relevance is estimated by how well it is recognized.
The recognition heuristic can manifest itself as a preference for rules containing a recognized literal or attribute in the antecedent of the rule.Since the odds that a literal will be recognized increase with the length of the rule, it seems plausible that the recognition heuristic generally increases the preference for longer rules.One could argue that for longer rules, the odds of occurrence of an unrecognized literal will also increase.The counterargument is the empirical finding that-under time pressure-analysts assign recognized objects a higher value than to unrecognized objects.This happens also in situations when recognition is a poor cue (Pachur & Hertwig, 2006).

Experiment 5: Modeling Recognition Heuristic using PageRank
In an attempt to measure representativeness, we resort to measuring the centrality of a concept using its PageRank (Page et al., 1999) in a knowledge graph.In three of our datasets, the literals correspond to Wikipedia articles, which allowed us to use PageRank computed from the Wikipedia connection graph for these literals.Similarly as for the previous experiment, each rule pair was enriched with two additional variables corresponding to the difference in the average and maximum PageRank associated with literals in the rules in the pair.We refer the reader to Kliegr (2017) for additional details regarding the experimental setup.
Table 8 shows the correlations between plausibility and the difference in PageRank as a proxy for the recognition heuristic.While, we have not obtained statistically strong correlation in the datasets, for two of the datasets (Quality and Traffic) the direction of the correlation is according to the expectation: plausibility rises with increased recognition.More research to establish the degree of actual recognition and PageRank values is thus needed.Nevertheless, to our knowledge, this is the first experiment that attempted to use PageRank to model recognition.

Semantic Coherence
In previous work (Paulheim, 2012), we conducted experiments with various statistical datasets enriched with Linked Open Data, one being the already mentioned Quality of Living dataset, another one denoting the corruption perceptions index (CPI)19 in different countries worldwide.For each of those, we created rules and had them rated in a user study.
From that experiment, we experienced that many people tend to trust rules more if there is a high semantic coherence between the conditions in the rule.For example, a rule stating the the quality of living in a city is high if it is a European capital of culture and is the headquarter of many book publishers would be accepted since both conditions refer to cultural topics, whereas a rule involving European capital of culture and many airlines founded in that city would not be as plausible.
Figure 10 depicts a set of results obtained on an unemployment statistic for French departments, enriched with data from DBpedia (Ristoski & Paulheim, 2013).There are highly coherent rules combining attributes such as latitude and longitude, or population and area, as well as lowly coherent rules, combining geographic and demographic indicators.Interestingly, all those combinations perform a similar split of the dataset, i.e., into the continental and overseas departments of France.
At first glance, semantic coherence and discriminative power of a rule look like a contradiction, since semantically related attributes may also correlate: as in the example above, attributes describing the cultural life in a city can be assumed to correlate more strongly than, say, cultural and economic indicators.Hence, it is likely that a rule learner, without any further modifications, will produce semantically incoherent rules at a higher likelihood than semantically coherent ones.
However, in Gabriel et al. (2014), we have shown that it is possible to modify rule learners in a way that they produce more coherent rules.To that end, attribute labels are linked to a semantic resource such as WordNet (Fellbaum, 1998), and for each pair of attributes, we measure the distance in that semantic network.In the first place, this provides us with a measure for semantic coherence within a rule.Next, we can explicitly use that heuristic in the rule learner, and combine it with traditional heuristics that are used for adding conditions to a rule.Thereby, a rule learner can be modified to produce rules that are semantically coherent.
The most interesting finding of the above work was that semantically coherent rules can be learned without significantly sacrificing accuracy of the overall rule-based model.This is possible in cases with lots of attributes that a rule learner can exploit for achieving a similar split of the dataset.In the above example with the French departments, any combination of latitude, longitude, population and area can be used to discriminate continental and overseas departments; therefore, the rule learner can pick a combination that has both a high discriminative power and a high coherence.

Structure
Another factor which, in our opinion, contributes strongly to the interpretability of a rule-based model is its internal logical structure.Rule learning algorithms typically provide flat lists that directly relate the input to the output.Consider, e.g., the extreme case of learning a parity concept, which checks whether an odd or Unemployment = low :-area > 6720, population > 607430.Unemployment = high :-latitude <= 44.1281, longitude <= 6.3333, longitude > 1.8397.).We argue that the parsimonious structure of the latter is much easier to comprehend because it uses only a linear number of rules, and slowly builds up the complex target concept parity from the smaller subconcepts parity2345, parity345, and parity45.This is in line with the criticism of Hüllermeier (2015) who argued that the flat structure of fuzzy rules is one of the main limitations of current fuzzy rule learning systems.However, we are not aware of psychological work that supports this hypothesis.The results of a small empirical validation were recently reported by Schmid et al. (2017), who performed a user study in which the subjects were shown differently structured elementary theories from logic programming, such as definitions for grandfather, greatgrandfather, or ancestor, and it was observed how quickly queries about a certain ancestry tree could be answered using these predicates.Among others, the authors posed and partially confirmed the hypothesis that logical programs are more comprehensible if they are structured in a way that leads to a compression in length.In our opinion, further work is needed in order to see whether compression is indeed the determining factor here.It also seems natural to assume that an important prerequisite for structured theories to be more comprehensible is that the intermediate concepts are by themselves meaningful to the user.Interestingly, this was not confirmed in the experiments by Schmid et al. (2017), where the so-called "public" setting, in which all predicates had meaningful names, did not lead to consistently lower answer times than the "private" setting, in which the predicates did not have meaningful names.They also could not confirm the hypothesis that it furthered comprehensiblity when their subjects were explicitly encouraged to think about meaningful names for intermediate concepts.
In their experiments, Schmid et al. (2017) used manually constructed logic programs.In fact, research in machine learning has not yet produced a system that is powerful enough to learn deeply structured logic theories for realistic problems, on which we could rely for experimentally testing this hypothesis.In machine learning, this line of work has been known as constructive induction (Matheus, 1989) or predicate invention (Stahl, 1996), but surprisingly, it has not received much attention since the classical works in inductive logic programming in the 1980s and 1990s.One approach is to use a wrapper to scan for regularly co-occurring patterns in rules, and use them to define new intermediate concepts which allow to compress the original theory (Wnek & Michalski, 1994;Pfahringer, 1994).Alternatively, one can directly invoke so-called predicate invention operators during the learning process, as, e.g., in Duce (Muggleton, 1987), which operates in propositional logic, and its successor systems in first-order logic (Muggleton & Buntine, 1988;Kijsirikul et al., 1992;Kok & Domingos, 2007).One of the few recent works in this area is by Muggleton et al. (2015), who introduced a technique that employs user-provided meta rules for proposing new predicates.
None of these works performed a systematic evaluation of the generated structured theories from the  point of view of interpretability.Systems like MOBAL (Morik et al., 1993), which not only tried to learn theories from data, but also provided functionalities for reformulating and restructuring the knowledge base (Sommer, 1996), have not received much attention in recent years.We believe that providing functionalities and support for learning structured knowledge bases is crucial for the acceptance of learned models in complex domains.In a way, the recent success of deep neural networks needs to be carried over to the learning of deep logical structures.Recent work on so-called sum-product nets, which combine deep learning with graphical models and generate new concepts in their latent variables (Peharz et al., 2017), may be viewed as a step into this direction.

Conclusion
The main goal of this paper was to motivate that interpretability of rules is an important topic that has received far too little serious attention in the literature.Its main contribution lies in highlighting that plausibility is an important aspect of interpretability, which, to our knowledge, has not been investigated before.
In particular, we observed that even rules that have the same predictive quality in terms of conventional measures such as support and confidence, and will thus be considered as equally good explanations by conventional rule learning algorithms, may be perceived with different degrees of plausibility.
More concretely, we reported on five experiments conducted in order to gain first insight into plausibility of rule learning results.Users were confronted with pairs of learned rules with approximately the same discriminative power (as measured by conventional heuristics such as support and confidence), and were asked to indicate which one seemed more plausible.The experiments were performed in four domains, which were selected so that respondents can be expected to be able to comprehend the given explanations (rules), but not to reliably judge their validity without obtaining additional information.In this way, users were guided to give an intuitive assessment of the plausibility of the provided explanation.
Experiment 1 explored the hypothesis whether the Occam's razor principle holds for the plausibility of rules, by investigating whether people consider shorter rules to be more plausible than longer rules.The results obtained for four different domains showed that this is not the case, in fact we observed statistically significant preference for longer rules on two datasets.In Experiment 2, we found support for the hypothesis that the elevated preference for longer rules is partly due to the misunderstanding of "and" that connects conditions in the presented rules: people erroneously find rules with more conditions as more general.In Experiment 3, we focused on another ingredient of rules: the values of confidence and support metrics.The results show that when both confidence and support are stated, confidence positively affects plausibility and support is largely ignored.This confirms a prediction following from previous psychological research studying the insensitivity to sample size effect.As a precursor to a follow-up study focusing on the weak evidence effect, Experiment 4 evaluated the relation between perceived plausibility and strength of conditions in the rule antecedent.The results indicated that rule plausibility is affected already if a single condition is considered to be relevant.Another contribution of this experiment is in the methodology, since it explored multiple ways of considering evidence (attributes, or attribute-value pairs), aggregation on per-rule basis as well as incentivizing participants.Recognition is a powerful principle underlying many human reasoning patterns and biases.In Experiment 5, we attempted to use PageRank computed from Wikipedia graph as a proxy for how well a given condition is recognized.The results, albeit statistically insignificant, suggest the expected pattern of positive correlation between recognition and plausibility.This experiment is predominantly interesting from the methodological perspective, as it offers a possible approach to approximation of recognition of rule conditions.
In our view, a research program that aims at a thorough investigation of interpretability in machine learning needs to resort to results in the psychological literature, in particular to cognitive biases and fallacies.We summarized some of these hypotheses, such as the conjunctive fallacy, and started to investigate to what extent these can serve as explanations for human preferences between different learned hypotheses.There are numerous other cognitive effects that can demonstrate how people assess rule plausibility, some of which are briefly listed in Appendix A and discussed more extensively in Kliegr et al. (2018).Clearly, more work along these lines is needed.
Moreover, it needs to be considered how cognitive biases can be incorporated into machine learning algorithms.Unlike loss functions, which can be evaluated on data, it seems necessary that interpretability is evaluated in user studies.Thus, we need to establish appropriate evaluation procedures for interpretability, and develop appropriate heuristic surrogate functions that can be quickly evaluated and be optimized in learning algorithms.In cases which require additional knowledge (e.g., for assessing the recognizability of a literal), which cannot be obtained from the data directly, a promising research direction is infusing semantic metadata into the learning process and exploiting it for enforcing the output of rules that are likely to be accepted more by the end user.
• Confirmation bias and positive test strategy (Nickerson, 1998).Seeking or interpretation of evidence so that it conforms to existing beliefs, expectations, or a hypothesis in hand.
• Conjunction fallacy and representativeness heuristic (Tversky & Kahneman, 1983).Conjunction fallacy occurs when a person assumes that a specific condition is more probable than a single general condition in case the specific condition seems as more representative of the problem at hand. Judgment.
• Availability heuristic (Tversky & Kahneman, 1973).The easier it is to recall a piece of information, the greater the importance of the information.
• Effect of difficulty (Griffin & Tversky, 1992).If it is difficult to tell which one of two mutually exclusive alternative hypotheses is better because both are nearly equally probable, people will grossly overestimate the confidence associated with their choice.This effect is also sometimes referred to as overconfidence effect (Pohl, 2017).
• Mere-exposure effect (Zajonc, 1968).Repeated encounter of a hypothesis results in increased preference. Other.
• Ambiguity aversion (Ellsberg, 1961).People tend to favour options for which the probability of a favourable outcome is known over options where the probability of favourable outcome is unknown.Some evidence suggests that ambiguity aversion has a genetic basis (Chew et al., 2012).
• Averaging heuristic (Fantino et al., 1997).Joint probability of two events is estimated as an average of probabilities of the component events.This fallacy corresponds to believing that P(A, B) = P(A)+P(B) 2 instead of P(A, B) = P(A) * P(B).
• Confusion of the inverse (Plous, 1993).Conditional probability is equivocated with its inverse.This fallacy corresponds to believing that P(A|B) = P(B|A).
• Context and trade-off contrast (Tversky & Simonson, 1993).The tendency to prefer alternative x over alternative y is influenced by the context -other available alternatives.
• Disjunction fallacy (Bar-Hillel & Neter, 1993).People tend to think that it is more likely for an object to belong to a more characteristic subgroup than to its supergroup.
• Information bias (Baron et al., 1988).People tend to belief that more information the better, even if the extra information is irrelevant for their decision.
• Insensitivity to sample size (Tversky & Kahneman, 1974).Neglect of the following two principles: a) more variance is likely to occur in smaller samples, b) larger samples provide less variance and better evidence.
• Recognition heuristic (Goldstein & Gigerenzer, 1999).If one of two objects is recognized and the other is not, then infer that the recognized object has the higher value with respect to the criterion.
• Negativity bias (Kanouse & Hanson Jr, 1987).People weigh negative aspects of an object more heavily than positive ones.
• Primacy effect (Thorndike, 1927).This effect can be characterized by words of Edward Thorndike (1874Thorndike ( -1949)), one of the founders of modern education psychology, as follows: "other things being equal the association first formed will prevail" (Thorndike, 1927).
• Reiteration effect (Hasher et al., 1977).Frequency of occurrence is a criterion used to establish validity of a statement.
Figure 2: Two decision lists learned for the class poisonous in the Mushroom dataset.

Figure 3 :
Figure 3: Example translated rules for the four datasets

Figure 4 :
Figure 4: Example instructions for experiments 1-3.The example rule pair was adjusted based on the dataset.For Experiment 3, the box with the example rule additionally contained values of confidence and support, formatted as shown in Figure 7.

Figure 5 :
Figure 5: Example rule pair used in experiments 1-3.For Experiment 3, the description of the rule also contained values of confidence and support, formatted as shown in Figure 7.

Figure 7 :
Figure 7: Sample information provided for clarifying support and confidence.

Figure 9 :
Figure 9: Literal relevance test question for Movies.

Figure 10 :
Figure 10: Example rules for unemployment in different French regions Figure 11: Unstructured and structured rule sets for the parity concept.

Table 1 :
Comprehensibility and plausibility -Two aspects of interpretability

Table 2 :
Stecher et al. (2016)ets used for generating rule pairs We used a simple top-down greedy hill-climbing algorithm that takes a seed example and generates a pair of rules, one with a regular heuristic (Laplace) and one with its inverted counterpart.As shown byStecher et al. (2016)(and illustrated in Figure

Table 4 :
Geographical distribution of collected judgments

Table 5 :
Rule-length experiment statistics.pairsrefers to the distinct number of rule pairs, judg to the number of trusted judgments, the quiz failure rate qfr to the percentage of participants that did not pass the initial quiz as reported by the CrowdFlower dashboard, part to the number of trusted distinct survey participants (workers), τ and ρ to the observed correlation values with p-values in parentheses.Subjects and Remuneration.CrowdFlower divides the available workforce into three levels depending on the accuracy they obtained on earlier tasks.As the level of the CrowdFlower workers we chose Level 2, which was described as follows: "Contributors in Level 2 have completed over a hundred Test Questions across a large set of Job types, and have an extremely high overall Accuracy.".

Table 7 :
Kendall's τ on the Movies dataset with (Group 1) and without (Group 2) additional information about the number of covered good and bad examples.pairs refers to the distinct number of rule pairs, judg to the number of trusted judgments, the quiz failure rate qfr to the percentage of participants that did not pass the initial quiz as reported by the CrowdFlower dashboard, part to the number of trusted distinct survey participants (workers), and ρ to the observed correlation values with p-values in parentheses.

Table 8 :
Attribute and Literal Relevance (Group 1, Kendall's τ).Column att refers to number of distinct attributes, lit to number of distinct literals (attribute-value pairs), judg to the number of trusted judgments, excl to the percentage of participants that were not trusted on the basis of giving justifications shorter than 11 characters, and part to the number of trusted distinct survey participants (workers).

Table 9 :
Correlation of PageRank in the knowledge graph with plausibility (Group 1, Kendall's τ).Column lit refers to number of distinct literals (attribute-value pairs), judg to the number of trusted judgments, qfr to the percentage of non-trusted participants, and part to the number of trusted distinct survey participants (workers).