Beneficial and Harmful Explanatory Machine Learning

Given the recent successes of Deep Learning in AI there has been increased interest in the role and need for explanations in machine learned theories. A distinct notion in this context is that of Michie's definition of Ultra-Strong Machine Learning (USML). USML is demonstrated by a measurable increase in human performance of a task following provision to the human of a symbolic machine learned theory for task performance. A recent paper demonstrates the beneficial effect of a machine learned logic theory for a classification task, yet no existing work has examined the potential harmfulness of machine's involvement in human learning. This paper investigates the explanatory effects of a machine learned theory in the context of simple two person games and proposes a framework for identifying the harmfulness of machine explanations based on the Cognitive Science literature. The approach involves a cognitive window consisting of two quantifiable bounds and it is supported by empirical evidence collected from human trials. Our quantitative and qualitative results indicate that human learning aided by a symbolic machine learned theory which satisfies a cognitive window has achieved significantly higher performance than human self learning. Results also demonstrate that human learning aided by a symbolic machine learned theory that fails to satisfy this window leads to significantly worse performance than unaided human learning.


Introduction
In a recent paper [34] the authors provided an operational definition for comprehensibility of logic programs and used this, in experiments with humans, to provide the first demonstration of Michie's Ultra-Strong Machine Learning (USML). The authors demonstrated USML via empirical evidence that humans improve out-of-sample performance in concept learning from a training set E when presented with a first-order logic theory which has been machine learned from E. The improvement of human performance indicates a beneficial effect of comprehensible machine learned models on human skill acquisition. The present paper investigates the explanatory effects of machine's involvement in human skill acquisition of simple games. Our results indicate that when a machine learned theory is used to teach strategies to humans, in some cases the human's out-of-sample performance is reduced. This degradation of human performance is recognised to indicate the existence of harmful explanations.
In the current paper, which extends our previous work on the phenomenon of USML, both beneficial and harmful effects of a machine learned theory are explored in the context of simple games. Our definition of explanatory effects is based on human out-of-sample performance in the presence of natural language explanation generated from a machine learned theory ( Figure 1). The analogy between understanding a logic program via declarative reading and understanding a piece of natural language text allows the explanatory effects of a machine learned theory to be investigated. The results of relevant Cognitive Science literature allow the properties of a logic theory which are harmful to human comprehension to be characterised. Our approach is based on developing a framework describing a cognitive window which involves bounds with regard to 1) descriptive complexity of a theory and 2) execution stack requirements for knowledge application. We hypothesise that a machine learned theory provides a harmful explanation to humans when theory complexity is high and execution is cognitively challenging. Our proposed cognitive window model is confirmed by empirical evidence collected from multiple experiments involving human participants of various backgrounds.
We summarise our main contributions as follows: -We define a measure to evaluate beneficial/harmful explanatory effects of machine learned theory on human comprehension.
-We develop a framework to assess a cognitive window of a machine learned theory. The approach encompasses theory complexity and the required execution stack. -Our quantitative and qualitative analyses of the experimental results demonstrate that a machine learned theory has a harmful effect on human comprehension when its search space is too large for human knowledge acquisition and it fails to incorporate executional shortcuts.
This paper is arranged as follows. In Section 2, we discuss existing work relevant to the paper. The theoretical framework with relevant definitions is presented in Section 3. We describe our experimental framework and the experimental hypotheses in Section 4. Section 5 describes several experiments involving human participants on two simple games. We examine the impact of a cognitive window on the explanatory effects of a machine learned theory based on human performance and verbal input. In Section 6, we conclude our work and comment on our analytical results -only a short and simple-to-execute theory can have a beneficial effect on human comprehension. We discuss potential extensions to the current framework, curriculum learning and behavioural cloning, for enhancing explanatory effects of a machine learned theory.

Related work
This section summarises related research of game learning and familiarises the reader with the core motivations for our work. We first present a short overview of related investigations in explanatory machine learning of games. Subsequently, we cover various approaches for teaching and learning between humans and machines.

Explanatory machine learning of games
Early approaches to learning game strategies [47,41] used the decision tree learner ID3 to classify minimax depth-of-win for positions in chess end games. These approaches used carefully selected board attributes as features. However, chess experts had difficulty understanding the learned decision tree due to its high complexity [26]. Methods for simplifying decision trees without compromising their accuracy have been investigated [42] on the basis that simpler models are more comprehensible to humans. An early Inductive Logic Programming (ILP) [35] approach learned optimal chess endgame strategies at depth 0 or 1 [5]. An informal complexity constraint was applied which limits the number of clauses used in any predicate definition to 7 ± 2 clauses. This number is based on the hypothesised limit on human short term memory capacity of 7 ± 2 chunks [29]. A different approach involving the augmentation of training data with high-level annotations was explored in [18]. Initialisation requires explanations to be provided for the target data set and the predicative accuracy of explanations is evaluated similarly to the predicative accuracy of labels.
The earliest reinforcement learning system M EN ACE (Matchbox Educable Noughts And Crosses Engine) [25] was specifically designed to learn an optimal agent policy for Noughts and Crosses. Later, Q-Learning [54] and Deep Reinforcement Learning were spawned and have led to a variety of applications including the Atari 2600 games [33] and the game of Go [50]. While these systems defeated the strongest human players, they are not human-like since they lack the ability to explain the encoded knowledge to humans. Recent approaches such as [55] have aimed to explain the policies learned by these models, but the learned strategy is implicitly encoded into the continuous parameters of the policy function which makes their operation opaque to humans. Relational Reinforcement Learning [14] and Deep Relational Reinforcement Learning [56] have attempted to address these drawbacks by incorporating the use of relational biases to ensure human understandability.
In [30,31], the author provided a survey of most relevant work in explainable AI and argued that explanatory functionalities were mostly subjective to the developer's view. While there is a general lack of demonstration on explanatory effect which should be examined by empirical trials, no existing framework accounts for the explanatory harmfulness of machine learned models.

Two-way learning between human and machine
As an emerging sub-field of AI, Machine Teaching [16] provides an algorithmic model for quantifying the teaching effort and a framework for identifying an optimized teaching set of examples to allow maximum learning efficiency for the learner. The learner is usually a machine learning model of a human in a hypothesised setting. In education, machine teaching has been applied to devise intelligent tutoring systems to select examples for teaching [59,43]. On the other hand, rule-based logic theories are important mechanisms of explanation. Rule-based knowledge representations are generalised means of concept encoding and have a structure analogous to human conception. Mechanisms of logical reasoning, induction and abduction, have long been shown to be highly related to human concept attainment and information processing [23,19]. Additionally, humans' ability to apply recursion plays a key role in understanding of relational concepts and semantics of language [17] which are important for communication.
The process of reconstructing implicit target knowledge which is easy to operate but difficult to describe via machine learning has been explored under the topic of Behavioural Cloning. The cloning of human operation sequence has been applied in various domains such as piloting [28] and crane operation [53]. The cloned human knowledge and experience are more dependable and less error-prone due to perceptual and executional inconsistency being averaged across the original behavioural trace. To our knowledge, no existing work has attempted to estimate human errors and target these mistakes in interactive teaching sessions for achieving a measurable "clean up" effect [27] from machine explanations.

Meta-interpretive learning of simple games
Meta-Interpretive Learning (MIL) [37,38] is a sub-field of ILP which supports predicate invention, dependent learning [24], learning of recursions and higher-order  Figure 3) contain existentially quantified second-order variables and universally quantified first-order variables. They clarify the declarative bias employed for substitutions of second-order Skolem constants. The resulting first-order theories are thus strictly logical generalisation of the meta-rules.    The MIL game learning framework MIGO [36] is a purely symbolic system based on the adapted Prolog meta-interpreter Metagol [12]. MIGO learns exclusively from positive examples by playing against the optimal opponent. MIGO is provided with a set of three relational primitives, move/2, won/1, drawn/1 which are a move generator, a won and a drawn classifier respectively. These primitives represent the minimal information a human would expect to know before playing a two-person game. For Noughts and Crosses and Hexapawn, MIGO learns a rule-like symbolic game strategy ( Table 1) that supports human understanding and was demonstrated to converge using less training data compared to Deep and classical Q-Learning. For successive values of k, MIGO learns a series of inter-related definitions for predicates win_k/2. These predicates define maintenance of minimax win in k-ply.
We introduce M IP lain 1 , a variant of M IGO which focuses on learning the task of winning for the game of Noughts and Crosses. In addition to learning from positive examples, M IP lain identifies moves which are negative examples for the task of winning. When a game is drawn or lost for the learner, the corresponding path in the game tree is saved for later backtracking following the most updated strategy. M IP lain performs a selection of hypotheses based on the efficiency of hypothesised programs using M etaopt [13].
An additional primitive number_of _pairs/3 is provided to M IP lain which depicts the number of pairs for a player (X or O) on a given board. A pair is the alignment of two marks of one player, the third square of this line being empty. An example of pairs is shown in Figure 2. This additional primitive serves as an executional shortcut that reduces the depth of the search when executing the learned   strategy. Furthermore, M IP lain is given the meta-rules described in Figure 3, which are two variants of the postcon meta-rule with monadic or dyadic head, and two variants of the conjunction meta-rule with currying in either the first or both body literals. Currying allows the learning of programs with higher-arity predicates where existentially quantified argument variables are bound to constants. The learned strategy presented in Table 2 describes conditions in a rule-like manner that the player's optimal move has to satisfy.

Explanatory effectiveness of a machine learned theory
We extend the machine-aided human comprehension of examples in [34] and C(D, H, E) denotes the unaided human comprehension of examples where D is a target definition, H is a group of humans and E is a set of examples. Based on the analogy between declarative understanding of a logic program and understanding of a natural language explanation, we describe measures for estimating the degree to which the output of a symbolic machine learning algorithm as an explanation can aid human comprehension.
In the scope of this work, we relate the explanatory effectiveness of a theory to performance which means that a harmful explanation provided by the machine degrades comprehension of the task and therefore reduces performance.

Cognitive window of a machine learned theory
In this section, we suggest a window of a machine learned theory that constraints its explanatory effectiveness. A basic assumption of cognitive psychology and artificial intelligence is that human information processing can be modelled in analogy to symbol manipulation of computers -respectively its formal characterisation of a Turing Machine [29,21,39]. More specifically, computational models of cognition share the view that intelligent action is based on manipulation of representations in working memory. In consequence, human inferential reasoning is limited by working memory capacity which corresponds to limitations of tape length and instruction complexity in Turing Machines.
Besides general restrictions of human information processing, performance can be influenced by internal or environmental disruptions such that the given competencies of a human in a specific domain are not always reflected in observable actions [11,49]. However, it can be assumed that humans -at least in domains of higher cognition -are able to explain their actions by verbalising the rules which they applied to produce a given result [46]. Although rules in general can be classified as procedural knowledge, the ability to verbalise rules makes them part of declarative memory [3,46]. For complex domains, the rules which govern action generation will typically be computationally complex as measured by the Kolmogorov complexity [22]. One can assume that increase in complexity can have a negative effect on performance.
In language processing and in general problem solving, hierarchisation of complex action sequences can make information processing more efficient. Typically, a general goal is broken down into sub-goals as it has been proposed in production system models [39] as well as in the context of analogical problem solving [9]. Rules which guide problem solving behaviour, for instance in puzzles such as Tower of Hanoi or games such as Noughts and Crosses, might be learned. From a declarative perspective, such learned rules correspond to explicit representations of a concept such as the winin-two-steps move introduced above. Studies of rule-based concept acquisition suggest that human concept learning can be characterised as search in a pool of possible hypotheses which are explored in some order of preference [8]. This observation relates to the concept of version space learning introduced in machine learning [32].
Based on these different observations concerning human information processing, we propose that a) human learners are version space learners with limited hypothesis space search capability that use meta-rules to learn sub-goal structure and primitives as background knowledge. This allows us to compute a bound on the human hypothesis space size based on the MIL complexity analysis in [24]. We assume that b) rules can be represented explicitly in a declarative, verbalisable form. Finally, we postulate the existence of a cognitive window such that a machine learned theory can be an effective explanation if it satisfies two constraints: 1) a hypothesised human learning procedure which has a limited search space and 2) a knowledge application model based on the Kolmogorov complexity [22]. For the following definitions, we restrict ourselves to learning datalog programs which do not include function symbols. When learned knowledge is cognitively challenging, execution overflows human working memory and instruction stack. We then expect decision making to be more error prone and the task performance of human learners to be less dependable. To account for the cognitive complexity of applying a machine learned theory, we define the cognitive resource of a logic term and atom.
Definition 5 (Cognitive cost of a logic term and atom, C(T )): Given T a logic term or atom, the cost of C(T ) can be computed as follows: ..] as a data structure used by M IGO and M IP lain has cost Note that we compute cognitive costs of programs without redundancy since repeated literals in programs learned by M IGO and M IP lain were removed after unfolding for generating explanations which are presented to human populations. Also, a game position can be represented by different data types. We ignore the cost due to implementation and only count digits and marks. We model the inferential process of evaluating training and testing examples by the run-time execution stack of a datalog program. The resolution of a query represents a mental application of a piece of knowledge given a training or testing example. In this work, we neglect the cost of computing the sub-goals of a primitive and compute its cost as if it were a normal predicate for simplicity.

Example 1 The Noughts and Crosses position in
Example 3 A primitive move(S1, S2) which is an atom with variables S1 and S2 has a cognitive cost C(move(S1, S2)) = 3.
Definition 6 (Execution stack of a datalog program, S(P, q)): Given a query q, the execution stack S(P, q) of a datalog program P is a set of atoms or terms evaluated during the execution of P to compute q. Each exit point of the execution is replaced with the value , and each backtrack point has the value ⊥.
Definition 7 (Cognitive cost of a datalog program, Cog(P, q)): Given a query q, and let St represent S(P, q), the cognitive cost of a datalog program P is Example 4 The primitive move/2 outputs a valid Noughts and Crosses state from a given input game state; the query is move (s1, B). The execution stack contains move(s1, B) and move(s1, s2), Cog(P, move(s1, B)) is 10.
The maintenance cost of task goals in working memory affects performance of problem solving [10]. Background knowledge provides key mappings from solutions obtained in other domains or past experience [4,40] and grants shortcuts for the construction of the current solution process. We expect that when knowledge that provides executional shortcuts is comprehended, the efficiency of human problem solving could be improved due to a lower demand for cognitive resource. Contrarily, in the absence of informative knowledge, performance would be limited by human operational error and would not be better than solving the problem directly. To account for the latter case, we define the cognitive cost of a problem solution that involves the minimum amount of information from the task.
We give a definition of human cognitive window based on theory complexity during knowledge acquisition and theory execution cost during knowledge application. A machine learned theory has 1) a harmful explanatory effect when its hypothesis space size exceeds the cognitive bound and 2) no beneficial explanatory effect if its cognitive cost is not sufficiently lower than the cognitive cost of the problem solution.

Experimental framework
In the following section, we describe an experimental framework for assessing the impact of cognitive window on the explanatory effects of a machine learned theory. Our experimental framework involves 1) a set of criteria for evaluating the participants' learning quality from their own verbal descriptions of learned strategies and 2) an outline of experimental hypotheses. For game playing, we assume humans are able to explain actions by verbalising procedural rules of strategy. We expect verbal responses to provide insights about human decision making and knowledge acquisition. The quality of verbal responses can be affected by multiple factors such as motivation, familiarity with the introduced concepts and understanding of the game rules. We take into account these factors in the evaluation criteria.

Definition 11 (Primitive coverage of a verbal response):
A verbal response correctly describes a primitive if the semantic meaning of the primitive is unambiguously stated in the response. The primitive coverage is the number of primitives in a symbolic machine learned theory that are described correctly in a verbal response.

Definition 12 (Quality of a verbal response, Q(r)):
A verbal response r is checked against the specifications from Table 3 in an increasing order from criteria level 1 to level 4. Q(r) is the highest level i that r can satisfy. When a response does not satisfy any of the higher levels, the quality of this response is the lowest level 0.
To illustrate, we consider the predicate win_2/2 learned by M IP lain (Table 2). Primitive predicates are move/2 and number_of _pairs/3. We present in Table 3 a number of examples of verbal responses. A high quality response reflects a high motivation and good understanding of game concepts and strategy. On the other hand, a poor quality response demonstrates a lack of motivation or poor understanding.
We define the following null hypotheses to be tested in Section 5 and describe the motivations. Let M denote a symbolic machine learning algorithm. E stands for examples, D is a target definition, H is a group of participants sampled from a human population. M (E) denotes a machine learned theory which belongs to a definite clause program class with hypothesis space S. First, we are interested in demonstrating whether 1) the verbal response quality of learned knowledge reflects comprehension, 2) there exist cognitive bounds for humans to provide verbal responses of higher quality and 3) the machine learned theory helps improve the quality of verbal responses.

H1: Unaided human comprehension C(D, H, E) and machine-explained human comprehension C ex (D, H, M (E)) manifest in verbal response quality Q(r).
We examine if high post-test accuracy correlates with high response quality and high primitive coverage of each question category.

H2: Difficulty for human participants to provide verbal response increases with quality Q(r).
We examine if the proportion of verbal responses reduces with respect to high response quality and high primitive coverage of each question category.

H3: Machine learned theory M (E) improves verbal response quality Q(r). We examine
if machine-aided learning results in more HQ responses.
The impact of a cognitive window on explanatory effects is tested via the following hypotheses. φ is a set of primitives introduced to H. Let x denote the set of questions that human h ∈ H answers after learning.

H4: Learning a complex theory (|S| > B(M (E), H)) exceeding the cognitive bound
leads to a harmful explanatory effect (E ex (D, H, M (E)) < 0). We examine if the post-test accuracy, after studying a machine learned theory that participants cannot recall fully, is worse than the accuracy following self-learning.

H5: Applying a theory without a low cognitive cost (Cog(M (E), x) ≥ CogP (E, φ, x))
does not lead to a beneficial explanatory effect (E ex (D, H, M (E)) ≤ 0). We examine if the post-test accuracy, after studying a machine learned theory that is cognitively costly, is equal to or worse than the accuracy following self-learning.

Experiments
This section introduces the materials and experimental procedure which we designed to examine the explanatory effects of a machine learned theory on human learners. Afterwards, we describe the experiment interface and present experimental results.

Materials
We assume that Noughts and Crosses is a widely known game a lot of participants of the experiments are familiar with. This might result in many participants already playing optimally before receiving explanations, leaving no room for potential performance increase. In order to address this issue, the Island Game was designed as a problem isomorphic to Noughts and Crosses. [51] define isomorphic problems as "problems whose solutions and moves can be placed in one-to-one relation with the solutions and moves of the given problem". This changes the superficial presentation of a problem without modifying the underlying structure. Several findings imply that this does not impede solving the problem via analogical inference if the original problem is consciously recognized as an analogy; on the other hand, the prior step of initially identifying a helpful analogy via analogical access is highly influenced by superficial similarity [15,20,44]. Given that the Island Game presents a major re-design of the game surface, we expect that participants will less likely recall prior experience of Noughts and Crosses that would facilitate problem solving, leading to less optimal play initially and more potential for performance increase. The Island Game (Figure 4) contains three islands, each with three territories on which one or more resources are marked. The winning condition is met when a player controls either all territories on one island or three instances of the same resource. The nine territories resemble the nine fields in Noughts and Crosses and the structure of the original game is maintained in regard to players' turns, possible moves, board states and win conditions. This isomorphism masks a number of spatial relations that represent the membership of a field to a win condition. In this way, the fields can be rearranged in an arbitrary order without changing the structure of the game. Fig. 4: Example of pre-and post-test question for the Island Game. A board is presented to the participant who has to select the move that he or she thinks is optimal.

Methods and design
We use two experiment interfaces, one for Noughts and Crosses and another one for the Island Game. For both, we adopt a two-group pre-test post-test design (Table 4). In the pre-test, performance of participants in both self learn-ing and machine-aided learning groups are measured in an identical way. During training, we introduce to participants the concept of pairs and they are able to see correct answers of some game positions. In the post-test, performance of both self-learning and machine-aided groups are evaluated in the exact same way as in the pre-test. This experiment setting allows to evaluate the degree of change in performance as the result of explanations. Each question in pre-and post-test is the presentation of a board for which it is the participant's turn to play. They are asked to select what they consider to be the optimal move. A question category of win i denotes a game position winnable in i moves of the human player. An exemplary question is shown in the Figure 4. The post-test questions are rotated and flipped from pre-test questions. In each test, only 15 questions are given to limit experiment duration to one hour. The response time of participants was recorded for each pre-test and post-test question.
The treatment was applied to the machine-aided group. In the interest of experimentation, during treatment, we present both visual and textual explanations to avoid unnecessary effort of participants to associate textual explanations to game positions and concepts. This is based on the consideration that direct association between textual explanations and game states which can be abstract for participants who are not familiar with the designed game domain. Learned first-order theories have been translated with manual adjustments based on primitives provided to all participants and to M IP lain. An exemplary explanation is shown in Figure 1. Both visual and textual explanations preserve the structure of hypotheses to account for the reasons that make a move right and the other move wrong. Conversely, during training, the self-learning group was presented with similar game position without the corresponding visual and textual explanations. For the Island Game experiments, we recorded an English description of the strategy they used for each of the selected post-test questions. Participants are presented previously submitted answers, one at a time along with a text input box for written answers. Moves for these open questions are selected from post-test with a preference order from wrong and hesitant moves to consistently correct moves. We associate hesitant answers with higher response times. A total of six questions are selected based on individual performance during the post-test.

Experiment results
We conducted three experiments 2 using the interface with Noughts and Crosses questions and explanations. These experiments were carried out on three samples: an undergraduate student group from Imperial College London, a junior student group from a German middle school and a mixed background group from Amazon Mechanical Turk 3 (AMT). No consistent explanatory effects could be observed for any of the mentioned samples. The problem solving strategy that humans apply can be affected by factors such as task familiarity, problem difficulty, and motivation. For  instance, [45] suggested that a rather superficial analogical transfer of a strategy is applied when a problem is too difficult or when there is no reason to gain a more general understanding of a problem. Given that the majority of subjects achieved reasonable initial performance, we ascribe the reason of such results to experience with the game and complexity of explanations. The game familiarity of adult groups led to less potential for performance improvement. Early middle school students had limited attention and were overwhelmed by information intake. Alternatively, we focused on specially designed experiment materials in the following experiments.

Island Game with open questions
A sample from Amazon Mechanical Turk and a student sample from the University of Bamberg participated in experiments 2 that used the interface with Island Game questions and explanations. To test hypotheses H1 to H5, we employed a quantitative analysis on test performance and a qualitative analysis on verbal responses. A subsample with a mediocre initial performance within one standard deviation of the mean was selected for the performance analysis. This aims to discount the ceiling effect (initial performance too high) and outliers (e.g. struggling to use the interface).
From AMT sample, we had 90 participants who were 18 to above 65 years old. A sub-sample of 58 participants with a mediocre initial performance was randomly partitioned into two groups, MS (Mixed background Self learning n = 29) and MM (Mixed background Machine-aided learning, n = 29). A different sub-sample of 30 participants completed open questions and was randomly split into two groups, MSR (Mixed background Self learning and strategy Recall, n = 15) and MMR (Mixed background Machine-aided learning and strategy Recall, n = 15). As shown in Figure  5a, in category win_2, MM post-test had a better comprehension (p = 0.028) than MS post-test while MM and MS had similar pre-test performance (p > 0.1) in this category. Results in category win_2 indicate that explanations have a beneficial effect on MM. However, MM did not have a better comprehension on win_1 than MS given the same initial performance (p > 0.1). In addition, MM had the same initial performance as MS on win_3 (p > 0.1) but MM's performance reduced after receiving explanations of win_3 (p = 0.005).
From a group of students involved in a Cognitive Systems course at the University of Bamberg, we had 13 participants who were 18 to 24 years old and a few outliers between 25 and 54 years. All participants were asked to complete open questions and were randomly split into two groups, SSR (Student Self learning and strategy Recall, n = 4) and SMR (Student Machine-aided learning and strategy Recall, n = 9). A sub-sample of 9 with a mediocre initial performance was randomly divided into SS (Student Self learning, n = 2) and SM (Student Machine-aided learning, n = 7). The imbalance in the student sample was caused by a number of participants leaving during the experiment. The machine-aided learning results show large performance variances in post-test as evidence for insignificant levels of performance degradation.
In Table 5, we identified that participants who were able to provide high quality responses for their test answers scored higher on these questions. This is not the case for win_3, however, due to the high difficulty of providing good description of strategy for win_3 category. Additionally, in the win_2 category, both machineaided groups (MMR: 2/(2+35), SMR: 9/(9+14)) have greater proportions of high quality responses than self learning groups (MSR: 1/(1+32), SSR: 1/(1+8)). Also, we observed a pattern in which there are less HQ responses than LQ responses in win_1 and win_2 categories. This pattern is more significant in win_2 category. Figure 6 illustrates the difficulty of providing good quality verbal response for the non-trivial category win_3. Since win_1 contains only two predicates, we examined primitive coverage of non-trivial categories win_2 and win_3. However, for clarity of presentation, we only show category win_3 which has more remarkable trends. When counting primitives based on Definition 11, we only consider the constraint number_of _pairs/3 and ignore the move generator move/2 as participants were required to make a move when they answered a question.
In Figure 6a, we plotted primitive coverage against the accuracy of post-test answers that were selected as open questions. We observed a major monotonically increasing trend in accuracy with respect to primitive coverage. This indicates that high matching between verbal responses and the machine learned theory correlates with high performance. In Figure 6b, we observed downward curves for MSR and MMR in the number of verbal responses from the lower to the higher primitive coverage. More responses were provided by SSR and SMR covering one primitive than MSR and MMR. Participants gave very few responses that cover more than two primitives. Based on the learned theory in Table 2, the results suggest an    Figure 6b, both mixed background groups (MSR and MMR) had lower proportions of responses covering one predicate than student groups (SSR and SMR). Mixed background and student groups could not provide a significant proportion of response covering more than one and two primitives respectively (Figure 6a). increasing difficulty to provide more complete strategy descriptions beyond two (mixed background groups) and four (student groups) clauses of win_3.

Discussion
Results concerning null hypotheses H1 to H5 are summarised in Table 6 and 7. First, we assume that (H1 Null) comprehension does not correlate with verbal response quality. Results of HQ responses in two categories (Table 5) suggest that being able to provide better verbal responses of strategy corresponds to a high comprehension. We also examined the coverage of primitives (specifically for LQ responses of win_3) in verbal responses (Figure 6a). Evidence in all categories shows a correlation between comprehension and the degree of verbal response matching with explanations. We reject the null hypothesis in all categories which implies the confirmation of H1.
In addition, we assume that (H2 Null) the difficulty for human participants to provide verbal response is not affected by verbal response quality. Since high response quality is difficult to achieve (Table 5) and it is challenging to correctly describe all primitives (Figure 6b), we reject this null hypothesis for all categories and confirm H2 as it is increasingly difficult for participants to provide higher quality verbal response. Hence, two additional trends we observed from the same figure suggest two mental barriers of learning. As we assume a human sample is a collection of version space learners, the search space of participants is limited to programs of size two (mixed background groups) and four (student groups). When H is taken as the student sample and P to be the machine learned theory on winning the Island Game, the cognitive bound B(P, H) = m 4 * p 4(j+1) = 4 4 * 2 12 corresponds to the hypothesis space size for programs with four clauses (four metarules are used with at most two body literals in each clause, primitives are move/2 and number_of _pairs/3). Furthermore, we assume that (H3 Null) machine learned theory does not improve verbal response quality. Results (Table 5) show higher proportion of HQ responses for machine-aided learning than self-learning in category win_2. Thus, for win_2, we reject this null hypothesis which means H3 is confirmed in category win_2 where the machine explanations result in more high quality verbal responses being provided.
We assume that (H4 Null) learning a descriptively complex theory does not affect comprehension harmfully. When P is the program learned by M IP lain, B(P, H) for two samples correspond to program class with size no larger than 4. Only win_3 which has a larger size of seven after unfolding exceeds these cognitive bounds. As harmful effects (Figure 5a and 5b) have been observed in category win_3, this null hypothesis is rejected and H4 is confirmed as learning a complex machine learned theory has a harmful effect on comprehension. We also assume that (H5 Null) applying a theory without a sufficiently low cognitive cost has a beneficial effect on comprehension. Given that the predicate win_1 in M IP lain's learned theory does not have a low cognitive cost, we reject this null hypothesis since no significant beneficial effect has been observed. This null hypothesis is therefore rejected and we confirm H5knowledge application requiring much cognitive resource does not result in better comprehension.
The performance analysis ( Figure 5a) demonstrates a comprehension difference between self learning and machine-aided learning in category win_2. An explanatory effect has not been observed for the student sample. While the conflicting results suggest that a larger sample size would likely ensure consistency of statistical evidence, the patterns in results suggest more significant results in category win_2 than win_1 and win_3. The predicate win_2 in the program learned by M IP lain satisfies both constraints on hypothesis space bound for knowledge acquisition and cognitive cost for knowledge application. In addition, the cognitive window explains the lack of beneficial effects of predicates win_1 and win_3. The former does not have a lower cognitive cost for execution so that operational errors cannot be reduced, thus there has been no observable effects. The latter is a complex rule with a larger hypothesis space for human participants to search from and harmful effects have been observed due to partial knowledge being learned. The learned program denotes strategy of finding a pair rather than going for a direct win, which is a mismatch between taught and learned knowledge.

Conclusions and further work
While the focus of explainable AI approaches has been on explanations of classifications [1], we have investigated explanations in the context of game strategy learning. In addition, we have explored both beneficial and harmful sides of the AI's explanatory effect on human comprehension. Our theoretical framework involves a cognitive window to account for the properties of a machine learned theory that lead to improvement or degradation of human performance. The presented empirical studies have shown that explanations are not helpful in general but only if they are of appropriate complexity -being neither informatively overwhelming nor more cognitively expensive than the solution to a problem itself. It would appear that complex machine learning models and models which cannot provide abstract descriptions of internal decisions are difficult to be explained effectively. However, we acknowledge the limitation of our empirical studies in terms of consistency of statistical evidence as groups vary greatly in sample size which might be addressed with further experimentation.
To explain a strategy, typically goals or sub-goals must be related to actions which can fulfill these goals. If the strategy involves to keep in mind a stack of open sub-goals -as for example the Tower of Hanoi [2,46] -explanations might become more complex than figuring out the action sequence. Based on [8], knowledge is learned by humans in an incremental way, which was recently emphasized by [58] on human category learning. A potential approach to improve explanatory effectiveness of a machine learned theory is to process complex concepts into smaller chunks by initially providing simple-to-execute and short sub-goal explanations. Mapping input to another sub-goal output thus consumes lower cognitive resources and improvement in performance is more likely. It is worth investigating for future work a teaching procedure involving a sequence of teaching sessions that issues increasingly difficult tasks and explanations. Abstract descriptions might be generated in the form of invented predicates as it has been shown in previous work on ILP as an approach to USML [34]. An example for such an abstract description for the investigated game is the predicate number_of _pairs/3. Therefore, learning might be organised incrementally, guided by a curriculum [6,52].
In addition, the current teaching procedure, which only specifies humans as learners, could be augmented to enable two-way learning between human and machine. Human decisions might be machine learned and explanations would be provided based on estimation of human errors during the course of training. A simple demonstration of this idea is presented in Figure 7. We would like to explore, in the future, an interactive procedure in which a machine iteratively re-teaches human learners by targeting human learning errors via specially tailored explanations. [7] suggested it is crucial for machine produced clones to be able to represent goal-oriented knowledge which is in a form that is similar to human conceptual structure. Hence, MIL is an appropriate candidate for cloning since it is able to iteratively learn complex concepts by inventing sub-goal predicates. We hope to incorporate cloning to predict and target mistakes in human learned knowledge from answers in a sequence of re-training. We expect a "clean up" on operation errors of human behaviours from empirical experiments by presenting appropriate explanations in re-training. Such corrections and improvements guided by identified errors in a human strategy are also helpful in the context of intelligent tutoring [57] where classic strategies such as algorithmic debugging [48] can be applied to make humans and machines learn from each other.