Learning programs by learning from failures

We introduce learning programs by learning from failures. In this approach, an inductive logic programming (ILP) system (the learner) decomposes the learning problem into three separate stages: generate, test, and constrain. In the generate stage, the learner generates a hypothesis (a logic program) that satisfies a set of hypothesis constraints (constraints on the syntactic form of hypotheses). In the test stage, the learner tests the hypothesis against training examples. A hypothesis fails when it does not entail all the positive examples or entails a negative example. If a hypothesis fails, then, in the constrain stage, the learner learns constraints from the failed hypothesis to prune the hypothesis space, i.e. to constrain subsequent hypothesis generation. For instance, if a hypothesis is too general (entails a negative example), the constraints prune generalisations of the hypothesis. If a hypothesis is too specific (does not entail all the positive examples), the constraints prune specialisations of the hypothesis. This loop repeats until (1) the learner finds a hypothesis that entails all the positive and none of the negative examples, or (2) there are no more hypotheses to test. We implement our idea in Popper, an ILP system which combines answer set programming and Prolog. Popper supports infinite domains, reasoning about lists and numbers, learning optimal (textually minimal) programs, and learning recursive programs. Our experimental results on three diverse domains (number theory problems, robot strategies, and list transformations) show that (1) constraints drastically improve learning performance, and (2) Popper can substantially outperform state-of-the-art ILP systems, both in terms of predictive accuracies and learning times.

(BK), the ILP problem is to induce a hypothesis which, with the BK, entails as many positive and as few negative examples as possible. ILP represents the examples, BK, and hypotheses as logic programs (sets of logical rules).
Compared to most machine learning approaches, ILP has several advantages. ILP systems can generalise from small numbers of examples, often a single example (Lin et al., 2014). Because hypotheses are logic programs, they can be read by humans, crucial for explainable AI and ultra-strong machine learning (Michie, 1988). Moreover, because ILP systems learn logic programs, ILP is also a form of program synthesis (Shapiro, 1983), where the goal is to automatically generate computer programs from specifications, typically input/output examples. Finally, because of their symbolic nature, ILP systems naturally support lifelong and transfer learning (Cropper, 2019a), which is considered essential for human-like AI (Lake et al., 2016).
The fundamental problem in ILP is to efficiently search a huge (potentially infinite) hypothesis space (the set of all hypotheses). For instance, in our simplest experiment (Section 5.1), the hypothesis space contains approximately 10 13 hypotheses. A popular ILP approach is to use a set covering algorithm to learn hypotheses one clause at-atime (Quinlan, 1990;Muggleton, 1995;Blockeel and Raedt, 1998;Srinivasan, 2001;Ahlgren and Yuen, 2013). Systems that implement this approach are often very efficient because they are example-driven. However, these systems tend to learn overly specific solutions and struggle to learn recursive programs (Bratko, 1999;). An alternative, but increasingly popular, approach is to encode the ILP problem as a SAT problem (Corapi et al., 2011;Law et al., 2014;Kaminski et al., 2018;Evans and Grefenstette, 2018;Evans et al., 2019). Systems that implement this approach can often learn optimal and recursive programs. Moreover, they can use efficient SAT solvers based on conflictdriven clause learning. However, the major limitation of these systems is scalability, especially in terms of the domain size.
In this paper, we introduce an ILP approach called learning programs by learning from failures, largely inspired by Karl Popper's idea of falsification (Popper, 2005) and Shapiro's seminal program synthesis work (Shapiro, 1983). In our approach, the learner (an ILP system) decomposes the ILP problem into three separate stages: generate, test, and constrain. In the generate stage, the learner generates a hypothesis (a logic program) that satisfies a set of hypothesis constraints (constraints on the syntactic form of hypotheses). Importantly, in this step, the learner ignores the BK and examples, and instead focuses on finding a constraint satisfying hypothesis. In the test stage, the learner tests a hypothesis against training examples. A hypothesis fails when it does not entail all the positive examples or entails a negative example. If a hypothesis fails, then, in the constrain stage, the learner learns hypothesis constraints from the failed hypothesis to prune the hypothesis space, i.e. to constrain subsequent hypothesis generation. For instance, if a hypothesis is too general (entails a negative example), the constraints prune generalisations of the hypothesis. If a hypothesis is too specific (   In the test stage, the learner tests h 1 against the examples and finds that it fails because it does not entail any positive example and is therefore too specific. In the constrain stage, the learner learns hypothesis constraints to prune specialisations of h 1 (h 2 and h 3 ) from the hypothesis space. The hypothesis space is now: In the next generate stage, the learner generates another hypothesis: There are two key ideas to our approach. Rather than refine a clause (Quinlan, 1990;Muggleton, 1995;Raedt and Bruynooghe, 1993;Blockeel and Raedt, 1998;Srinivasan, 2001;Ahlgren and Yuen, 2013), or refine a hypothesis (Shapiro, 1983;Bratko, 1999;Athakravi et al., 2013;Cropper and Muggleton, 2016), our first key idea is to refine the hypothesis space through learned hypothesis constraints. In other words, our key idea is to continually build a set of meta-constraints to constrain the hypothesis space. The more constraints we learn, the more we reduce the hypothesis space. By reasoning about the hypothesis space, our approach can drastically prune large parts of the hypothesis space by testing a single hypothesis. Our second key idea is to decompose the ILP problem into entirely separate tasks: generate, test, and constrain. This idea allows for flexibility in how to implement our idea. Moreover, decomposing the problem allows for greater scalability with respect to the problem size (particularly the domain size and the number of training examples). In other words, decomposing the problem alleviates the combinatorial explosion problem faced by approaches that frame the ILP problem as a single SAT problem (Corapi et al., 2011;Law et al., 2014;Kaminski et al., 2018;Evans and Grefenstette, 2018;Evans et al., 2019).
We implement our idea in Popper 1 , a new ILP system which combines answer set programming (ASP) (Gebser et al., 2012) and Prolog. In the generate stage, Popper uses ASP to declaratively define, constrain, and search the hypothesis space. The idea is to define an ASP problem where an answer set (a model) corresponds to a definite program. By later learning hypothesis constraints, we eliminate answer sets and thus prune the hypothesis space. Importantly, this stage ignores the examples and BK so that the search is focused on finding a constraint satisfying hypothesis. Our first motivation for using ASP is its declarative nature, which allows us to, for instance, define constraints to enforce Datalog and type restrictions, constraints to prune recursive hypotheses that do not contain base cases, and constraints to prune generalisations and specialisations of a failed hypothesis. Our second motivation is to use state-of-the-art ASP systems (Gebser et al., 2014) to efficiently solve our complex constraint problem. In the test stage, Popper uses Prolog to test hypotheses against the examples and BK. Our main motivation for using Prolog in this stage is to learn programs that use lists, numbers, and infinite domains. In the constrain stage, Popper learns hypothesis constraints (in the form of ASP constraints) from failed hypotheses to prune the hypothesis space, i.e. to constraint subsequent hypothesis generation. To efficiently combine the three stages, Popper uses ASP's multi-shot solving (Gebser et al., 2019) to maintain state between the three stages, e.g. to remember learned conflicts on the hypothesis space.
To give a clear overview of Popper, Table 1 compares Popper to Progol (Muggleton, 1995), a classical ILP system, and Metagol (Cropper and Muggleton, 2016), ILASP 2 (Law et al., 2014), and ∂ ILP (Evans and Grefenstette, 2018), three state-of-the-art ILP systems based on Prolog, ASP, and neural networks respectively. Compared to Progol, Popper can learn optimal and recursive programs. Compared to Metagol, Popper does not need metarules (Cropper and Tourret, 2019), so can learn programs with any arity predicates. Compared to ILASP and ∂ ILP, Popper supports large and infinite domains. Compared to all the systems, Popper supports hypothesis constraints 3 , such as disallowing the co-occurrence of predicate symbols in a program or disallowing recursive hypotheses that do not contain base cases.   , where the arity and argument types of an invented predicate must be specified by the given language bias.
Overall our specific contributions in this paper are: -We define our problem setting, introduce our simple language bias called predicate declarations, introduce hypothesis constraints, calculate the size of the hypothesis space, define hypothesis generalisations and specialisations, and introduce the idea of learning from failures (Section 3). -We introduce Popper, an ILP system that learns definite programs (Section 4). Popper uses ASP to declaratively define, constrain, and search the hypothesis space and Prolog to test hypotheses. Popper support types, learning optimal (textually minimal) solutions, learning recursive programs, reasoning about lists and infinite domains, and the novel feature of hypothesis constraints. We show that Popper is sound and complete with respect to optimal solutions (Theorem 1). -We experimentally show (Section 5) on three diverse domains (number theory problems, robot strategies, and list transformations) that (1) constraints drastically reduce the hypothesis space, (2) Popper can substantially outperform state-of-the-art ILP systems Metagol, ILASP, and FastLAS (Law et al., 2020), both in terms of predictive accuracies and learning times, (3) (4) Popper is reasonably robust to its parameters.

Program synthesis
The goal of program synthesis is to automatically generate a computer program from a specification. Program synthesis from examples (Summers, 1977;Shapiro, 1983) interests researchers from many areas of computer science, notably machine learning (ML) and programming languages (PL). The major 4 difference between ML and PL approaches is the generality of solutions (synthesised programs). PL approaches often aim to find any program that fits the specification, regardless of whether it generalises. Indeed, PL approaches rarely evaluate the ability of their systems to synthesise solutions that generalise, i.e. they do not measure predictive accuracy (Polikarpova et al., 2016;Albarghouthi et al., 2017;Feng et al., 2018;Raghothaman et al., 2020). By contrast, the major challenge in ML is learning hypotheses that generalise to unseen examples. Indeed, it is often trivial for an ML system to learn an overly specific solution for a given problem. For instance, an ILP system can trivially construct the bottom clause (Muggleton, 1995) for each example. Because of this major difference, in the rest of this section, we focus on ML approaches to program synthesis. We first, however, briefly cover two PL approaches, which share similarities to our learning from failures idea. Neo (Feng et al., 2018) synthesises non-recursive programs using SAT and SMT solvers. Neo inherently requires SMT specifications for domain specific background functions and predicates (i.e. background knowledge). For instance, the specification for head, taking an input list and returning an out put list, is the formula input.size ≥ 1∧ out put.size = 1 ∧ out put.ma x ≤ input.ma x. Our approach does not need such definitions for the BK. We only need to evaluate hypotheses to determine their truth or falsity with respect to examples. Neo cannot synthesise recursive programs, nor is it guaranteed to synthesise optimal (textually minimal) programs. By contrast, Popper can learn optimal and recursive logic programs.
ProSynth (Raghothaman et al., 2020) takes as input a set of candidate Datalog rules and returns a subset of them. ProSynth learns constraints that disallow certain clause combinations, e.g. to prevent clauses that entail a negative example from occurring together. Popper differs from ProSynth in several ways. ProSynth takes as input the full hypothesis space (the set of candidate rules). By contrast, Popper does not fully construct the hypothesis space. This difference is important because it is often infeasible to pre-compute the full hypothesis space. For instance, the largest number of candidate rules considered in the ProSynth experiments is 1000. By contrast, in our simplest experiment (Section 5.1), the hypothesis space contains approximately 10 13 rules. ProSynth provides no guarantees about solution size. By contrast, Popper is guaranteed to learn an optimal (smallest) solution (Theorem 1). Moreover, whereas ProSynth synthesises Datalog programs, Popper additionally learns definite programs, and thus supports learning programs with infinite domains.

Inductive logic programming
There are various ML approaches to program synthesis, including neural approaches (Balog et al., 2017;Ellis et al., 2018,0). We focus on inductive logic programming (ILP) (Muggleton, 1991). As with other forms of ML, given positive and negative examples, the goal of an ILP system is to learn a hypothesis which correctly explains as many positive and as few negative examples as possible. However, whereas most forms of ML represent data (examples and hypotheses) as tables (i.e. vectors), ILP represents data as logic programs. Moreover, whereas most forms of ML learn functions, ILP learns relations.

Recursion
Learning recursive programs has long been considered a difficult problem in ILP (Muggleton et al., 2012). Without recursion, it is often difficult for an ILP system to generalise from small numbers of examples . Indeed, many popular ILP systems, such as FOIL (Quinlan, 1990), Progol (Muggleton, 1995), TILDE (Blockeel and Raedt, 1998), and Aleph (Srinivasan, 2001), struggle to learn recursive programs. The reason is that they employ a set covering approach to build a hypothesis clause by clause. Each clause is usually found by searching an ordering over clauses. A common approach is to pick an uncovered example, generate the bottom clause (Muggleton, 1995) for this example, the logically most specific clause that entails the example, and then to search the subsumption lattice (either top down or bottom up) bounded by this bottom clause. Systems that implement this approach are often very efficient because the hypothesis search is example driven. However, these systems tend to learn overly specific solutions and struggle to learn recursive programs (Bratko, 1999;. To overcome this limitation, Popper searches over logic programs (sets of clauses), a technique used by other ILP systems (Bratko, 1999;Athakravi et al., 2013;Law et al., 2014;Cropper and Muggleton, 2016;Evans and Grefenstette, 2018;Kaminski et al., 2018).

Optimality
There are often multiple (sometimes infinite) hypotheses that explain the data. Deciding which hypothesis to choose is a difficult problem. Progol, Aleph, TILDE, and XHAIL (Ray, 2009) are not guaranteed to learn optimal solutions, where optimal typically means the smallest program or the program with the minimal description length. The claimed advantage of learning optimal solutions is better generalisation. Recent ILP approaches, especially those that encode the ILP problem as a SAT problem, learn optimal solutions, such as programs with the fewest clauses Cropper and Muggleton, 2016;Kaminski et al., 2018) or literals (Corapi et al., 2011;Law et al., 2014). Popper also learns optimal solutions, measured as the total number of literals in the hypothesis.

Language bias
ILP approaches use a language bias (Nienhuys-Cheng and Wolf, 1997) to restrict the hypothesis space. Language bias can be categorised as syntactic bias, which restricts the syntax of hypotheses, such as the number of variables allowed in a clause, and semantic bias, which restricts hypotheses based on their semantics, such as whether they are functional, irreflexive, etc.
Mode declarations (Muggleton, 1995) are a popular language bias (Blockeel and Raedt, 1998;Srinivasan, 2001;Ray, 2009;Corapi et al., 2010,0;Athakravi et al., 2013;Ahlgren and Yuen, 2013;Law et al., 2014). Mode declarations state which predicate symbols may appear in a clause, how often they may appear, what their arguments types are, and whether their arguments must be ground. We do not use mode declarations. We instead use a simple language bias which we call predicate declarations (Section 3), where a user needs only state whether a predicate symbol may appear in the head or/and body of a clause, similar to determinations in Aleph (Srinivasan, 2001). In our approach, a user can additionally provide other language biases, such as type information, as hypothesis constraints (Section 2.8).
Metarules (Cropper and Tourret, 2019) are another popular syntactic bias used by many program synthesis approaches (Raedt and Bruynooghe, 1992;Wang et al., 2014;Albarghouthi et al., 2017;Kaminski et al., 2018), including Metagol Cropper et al., 2019b;Cropper and Muggleton, 2016) and, to an extent 5 , ∂ ILP (Evans and Grefenstette, 2018). A metarule is a higher-order clause which defines the exact form of clauses in the hypothesis space. For instance, the chain metarule is of the form P(A, B) ← Q(A, C), R(C, B), where P, Q, and R denote predicate variables, and allows for instantiated clauses such as last(A,B):-reverse(A,C),head(C,B). Compared with predicate (and mode) declarations, metarules are a much stronger inductive bias because they specify the exact form of clauses in the hypothesis space. However, the major problem with metarules is determining which ones to use (Cropper and Tourret, 2019). A user must either (1) provide a set of metarules, or (2) use a set of metarules restricted to a certain fragment of logic, e.g. dyadic Datalog (Cropper and Tourret, 2019). This limitation means that ILP systems that use metarules are difficult to use, especially when the BK contains predicate symbols with arity greater than two. If suitable metarules are known, then, as we show in Appendix A, Popper can simulate metarules through hypothesis constraints.
Datalog is the target language of many ILP systems (Muggleton et al., 2014,0;Kaminski et al., 2018;Evans and Grefenstette, 2018;Evans et al., 2019). One motivation for learning Datalog, rather than Prolog, programs is to allow the ILP problem to be encoded as a SAT problem, particularly to leverage recent developments in SAT and SMT. This encoding is possible because a Datalog query is guaranteed to terminate -although this termination guarantee comes at the expense of not being a Turing-complete language.
A major limitation with these approaches is that they mostly encode the ILP problem as a single (often very large) SAT problem and thus struggle to scale to large problems.
Recent work in ILP uses ASP to learn Datalog (Evans et al., 2019), definite (Muggleton et al., 2014;Kaminski et al., 2018;, normal (Ray, 2009;Corapi et al., 2011;Athakravi et al., 2013), and answer set programs (Law et al., 2014). Like Datalog, ASP is a truly declarative language. However, compared to Datalog, ASP is more expressive, allowing, for instance, aggregates, a form of disjunction in the head of a clause, and hard and weak constraints. Most ASP solvers only work on ground programs (Gebser et al., 2014) 6 . Therefore, a major limitation of pure ASP-based ILP systems is the intrinsic grounding problem, especially on large domains, such as reasoning about lists or numbers -most ASP implementations do not support lists nor real numbers. For instance, ILASP (Law et al., 2014) can represent real numbers as strings and delegate the reasoning to Python via Clingo's scripting feature (Gebser et al., 2014). However, in this approach, the numeric computation is performed when grounding the inputs, so the grounding must be finite. This grounding problem also implies that such systems do not support infinite domains. Difficulty handling large (or infinite) domains is not specific to ASP, and applies to other pure SAT-based approaches, even those based on neural networks, such as ∂ ILP, which only works on BK formed of a finite set of ground atoms. To overcome this limitation, Popper combines ASP and Prolog. Popper uses ASP to generate definite programs, which allows it to reason about large and infinite domains, such as reasoning about lists and numbers.

Generate, test, and constrain
A key idea of our approach is to reason about the hypothesis space. Rather than refine a clause (Quinlan, 1990;Muggleton, 1995;Raedt and Bruynooghe, 1993;Blockeel and Raedt, 1998;Srinivasan, 2001;Ahlgren and Yuen, 2013), or a hypothesis (Shapiro, 1983;Bratko, 1999;Athakravi et al., 2013;Cropper and Muggleton, 2016), we refine the hypothesis space through learned hypothesis constraints. In other words, our key idea is to continually build a set of meta-constraints to constrain the hypothesis space. The more constraints we learn, the more we reduce the hypothesis space. By reasoning about the hypothesis space, our approach can drastically prune large parts of the hypothesis space by testing a single hypothesis.
Atom (Ahlgren and Yuen, 2013) also learns definite programs using SAT solvers and learns constraints. However, because it builds on Progol (Muggleton, 1995), and thus employs inverse entailment, Atom struggles to learn recursive programs because it needs examples of both the base and step case (in that order) of a recursive program. Moreover, for the same reason, Atom struggles to learn optimal solutions. By contrast, Popper imposes no such conditions because it learns programs rather than individual clauses.
The ILASP systems (Law et al., 2014,0,0), notably ILASP3 (Law, 2018), also follow a generate, test, and constrain loop. We focus on ILASP3, ILASP3 is a pure ASP-based ILP system. ILASP3 takes as input the full hypothesis space of ground clauses defined by given mode declarations. Each clause is given a unique id. The ILASP3 task is to find a subset of the clauses which covers as many positive and as few negative examples as possible. ILASP3 also tests hypotheses to generate constraints. If a hypothesis is not an optimal solution, ILASP3 translates an example into a set of coverage constraints over the hypothesis space. We refer the reader to the work of Law (2018) for a detailed description, but, at a very high-level, a coverage constraint states that specific clauses must or must not be in a hypothesis (remember that ILASP3 precomputes the hypothesis space and assigns each clause a unique identifier).
Popper is similar to ILASP3 in that it follows a generate, test, and constrain loop. However, Popper differs from ILASP3 in several ways. ILASP3 learns unstratified ASP programs, including programs with normal rules, choice rules, and both hard and weak constraints. By contrast, Popper learns definite programs, typically described as Prolog programs, including programs with functions symbols, real numbers, and infinite domains. ILASP3 requires the full hypothesis space of pre-generated clauses as input. By contrast, Popper never fully constructs the hypothesis space, which allows it to scale better to larger programs (Section 5). If a hypothesis is non-optimal, ILASP3 finds a relevant example which it translates into a set of coverage constraints over the hypothesis space. By contrast, in our approach, when a hypothesis fails, we translate the hypothesis into a set of hypothesis constraints. Our hypothesis constraints are different because they do not reason about specific clauses (because we do not precompute the hypothesis space), but instead reason about the structure of hypotheses, i.e. are meta-constraints. Finally, ILASP3 is based entirely on ASP and the generate, test, and constrain stages are closely aligned. By contrast, Popper completely separates the generate, test, and constrain stages, where the generate stage ignores the examples and BK to alleviate the inherent grounding problem faced by ILASP3, which limits it to small domains (which we experimentally show in Section 5).
FastLAS (Law et al., 2020) builds on ILASP. The key difference is that FastLAS does not take the full hypothesis space as input. Instead it uses something similar to bottom clause construction (Muggleton, 1995) to find a subset of the hypothesis space. FastLAS does not, however, support recursion.
The general generate, test, and constrain approach can be traced back to Shapiro's seminal program synthesis work on the model inference system (MIS) (Shapiro, 1983), which, like our approach, was heavily inspired by Karl Popper's idea of falsification (Popper, 2005). MIS is a top-down, incremental, and interactive ILP approach which specialises and generalises a theory until it covers all of the positive and one of the negative examples. However, whereas MIS refines a hypothesis, by either deleting incorrect clauses or specialising clauses, our approach works at the meta-level, and refines the hypothesis space through learned hypothesis constraints.

Hypothesis constraints
Constraints are fundamental to our idea. Many ILP systems allow a user to constrain the hypothesis space though clause constraints (Muggleton, 1995;Srinivasan, 2001;Blockeel and Raedt, 1998;Ahlgren and Yuen, 2013;Law et al., 2014). For instance, Progol, Aleph, and TILDE allow for a user to provide constraints on clauses that should not be violated. Popper also allows a user to provide clause constraints. Popper additionally allows a user to provide hypothesis constraints (or meta-constraints) 7 , which are constraints over a whole hypothesis (a set of clauses), not an individual clause. As a trivial example, suppose you want to disallow two predicate symbols p/2 and q/2 from both simultaneously appearing in a program (in any body literal in any clause). Then, because Popper reasons at the meta-level, this restriction is trivial to express: :-body_literal(_,p,2,_), body_literal(_,q,2,_). We introduce this meta-level encoding in Section 4, but the constraint prunes hypotheses where the predicate symbols p/2 and q/2 both appear in the body of a hypothesis (possibly in different clauses). The key thing to notice is the ease, uniformity, and succinctness of expressing constraints. We argue that declarative hypothesis constraints have many advantages. For instance, through hypothesis constraints, Popper can enforce (optional) type, metarule, recall, and functionality restrictions. Moreover, hypothesis constraints allow us to prune recursive programs without a base case and subsumption redundant programs. Finally, and most importantly, hypothesis constraints allow us to prune generalisations and specialisations of failed hypotheses, which we discuss in the next section.

Problem setting
We now define our problem setting, introduce our simple language bias called predicate declarations, introduce hypothesis constraints, calculate the size of the hypothesis space, define hypothesis generalisations and specialisations, and introduce our idea of learning from failures.

Logic preliminaries
We assume familiarity with logic programming notation (Lloyd, 2012) but we restate some key terminology. All sets are finite unless otherwise stated. A clause is a set of literals. A clausal theory is a set of clauses. A Horn clause is a clause with at most one positive literal. A Horn theory is a set of Horn clauses. A definite clause is a Horn clause with exactly one positive literal. A definite theory is a set of definite clauses. A Horn clause is a Datalog clause if (1) it contains no function symbols, and (2) every variable that appears in the head of the clause also appears in the body of the clause. A Datalog theory is a set of Datalog clauses. Simultaneously replacing variables v 1 , . . . , v n in a clause with terms t 1 , . . . , t n is a substitution and is denoted as θ = {v 1 /t 1 , . . . , v n /t n }. A substitution θ unifies atoms A and B when Aθ = Bθ . We will often use program as a synonym for theory, e.g. a definite program as a synonym for a definite theory.

Problem setting
Our problem setting is based on the ILP learning from entailment setting (Raedt, 2008). Our goal is to take as input positive and negative examples of a target predicate, background knowledge (BK), and to return a hypothesis (a logic program) that with the BK entails all the positive and none of the negative examples. In this paper, we focus on learning definite programs. We will generalise the approach to non-monotonic programs in future work.
ILP approaches search a hypothesis space, the set of learnable hypotheses: ILP approaches restrict the hypothesis space through a language bias (Section 2.5). Several forms of language bias exist, such as mode declarations (Muggleton, 1995), grammars (Cohen, 1994) and metarules (Cropper and Tourret, 2019). We use a simple language bias which we call predicate declarations, which are similar to Aleph's determinations (Srinivasan, 2001). A predicate declaration simply states which predicate symbols may appear in the head (head declarations) or body (body declarations) of a clause in a hypothesis: Definition 1 (Head declaration) A head declaration is a ground atom of the form head_pred(p,a) where p is a predicate symbol of arity a.

Definition 2 (Body declaration) A body declaration is a ground atom of the form body_pred(p,a)
where p is a predicate symbol of arity a.
We define a declaration consistent clause: We define a declaration consistent hypothesis:

Definition 4 (Declaration consistent hypothesis) A declaration consistent hypothesis
H is a set of definite clauses where each C ∈ H is declaration consistent with D.
Example 3 (Declaration consistent hypothesis) Let D be the declaration bias: Then two declaration consistent hypotheses are: In addition to a declaration bias, we restrict the hypothesis space through hypothesis constraints.
We first clarify what we mean by a constraint:

Definition 5 (Constraint)
A constraint is a Horn clause without a head, i.e. a denial. We say that a constraint is violated if all of its body literals are true.
Rather than define hypothesis constraints for a specific encoding (e.g. the encoding we use in Section 4), we use a more general definition: Definition 6 (Hypothesis constraint) Let be a language that defines hypotheses, i.e. a meta-language. Then a hypothesis constraint is a constraint expressed in .
Example 4 In Section 4, we introduce a meta-language for definite programs. In our encoding, the atom head_literal(Clause,Pred,Arity,Vars) denotes that the clause Clause has a head literal with the predicate symbol Pred, is of arity Arity, and has the arguments Vars. An example hypothesis constraint in this language is: This constraint states that a predicate symbol p of arity 2 cannot appear in the head of any clause in a hypothesis.
This constraint states that the predicate symbol p cannot appear in the body of a clause if it appears in the head of a clause (not necessarily the same clause).
We define a constraint consistent hypothesis: Definition 7 (Constraint consistent hypothesis) Let C be a set of hypothesis constraints written in a language . A set of definite clauses H is consistent with C if, when written in , H does not violate any constraint in C.
We now define our hypothesis space: Definition 8 (Hypothesis space) Let D be a declaration bias and C be a set of hypothesis constraints. Then the hypothesis space D,C is the (possibly infinite) set of all declaration and constraint consistent hypotheses. We refer to any element in D,C as a hypothesis.
We define our problem input: We assume that no predicate symbol in the body of a clause in B appears in a head declaration of D. In other words, we assume that the BK does not depend on any hypothesis. For convenience, we define different types of hypotheses, mostly using standard ILP terminology (Nienhuys-Cheng and Wolf, 1997): Definition 10 (Hypothesis types) Let (B, D, C, E + , E − ) be an input tuple and H ∈ D,C be a hypothesis. Then H is: We define a solution, i.e. our problem output:

Definition 11 (Solution) Given an input tuple
Conversely, we define a failed hypothesis:

Definition 12 (Failed hypothesis) Given an input tuple
There may be multiple (sometimes infinite) solutions. We want to find the smallest solution: Definition 13 (Hypothesis size) The function size(H) returns the total number of literals in the hypothesis H.
We define an optimal solution: Definition 14 (Optimal solution) Given an input tuple (B, D, C, E + , E − ), a hypothesis H ∈ D,C is an optimal solution when two conditions hold: In Section 4, we introduce Popper, which, given the problem input, is guaranteed to return an optimal solution (Theorem 1).

Hypothesis space
One of the main ideas of our learning from failures approach is to reduce the size of the hypothesis space through learned hypothesis constraints. The size of the unconstrained hypothesis space is a function of a declaration bias and additional bounding variables:

Proposition 1 (Hypothesis space size) Let D = (D h , D b ) be a declaration bias with a maximum arity a, v be the maximum number of unique variables allowed in a clause, m be the maximum number of body literals allowed in a clause, and n be the maximum number of clauses allowed in a hypothesis. Then the maximum number of hypotheses in the unconstrained hypothesis space is
Proof Let C be an arbitrary clause in the hypothesis space. There are |D h |v a ways to define the head literal of C. There are |D b |v a ways to define a body literal in C. The body of C is a set of literals. There are |D b |v a k ways to chose k body literals. We bound the number of body literals to m, so there are ways to define C. A hypothesis is a set of definite clauses. Given n clauses, there are n k ways to chose k clauses to form a hypothesis. Therefore, there are n j=1 ways to define a hypothesis with at most n clauses.
As this result shows, the hypothesis space is huge for non-trivial inputs, which motivates using learned constraints to prune the hypothesis space.

Generalisations and specialisations
To prune the hypothesis space, we learn constraints to remove generalisations and specialisations of failed hypotheses. We reason about the generality of hypotheses syntactically through θ -subsumption (or subsumption for short) (Plotkin, 1971): Definition 15 (Clausal subsumption) A clause C 1 subsumes a clause C 2 if and only if there exists a substitution θ such that C 1 θ ⊆ C 2 .
Example 6 (Clausal subsumption) Let C 1 and C 2 be the clauses: If a clause C 1 subsumes a clause C 2 then C 1 entails C 2 (Nienhuys-Cheng and Wolf, 1997). However, if C 1 entails C 2 then it does not necessarily follow that C 1 subsumes C 2 . Subsumption is therefore weaker than entailment. However, whereas checking entailment between clauses is undecidable (Church, 1936), checking subsumption between clauses is decidable, although, in general, deciding subsumption is a NP-complete problem (Nienhuys-Cheng and Wolf, 1997). Midelfart (1999) extends subsumption to clausal theories: Definition 16 (Theory subsumption) A clausal theory T 1 subsumes a clausal theory T 2 , denoted T 1 T 2 , if and only if ∀C 2 ∈ T 2 , ∃C 1 ∈ T 1 such that C 1 subsumes C 2 .
Theory subsumption also implies entailment: Proposition 2 (Subsumption implies entailment) Let T 1 and T 2 be clausal theories. If Proof Follows trivially from the definitions of clausal subsumption (Definition 15) and theory subsumption (Definition 16).
We use theory subsumption to define a generalisation: Definition 17 (Generalisation) A clausal theory T 1 is a generalisation of a clausal theory T 2 if and only if T 1 T 2 .
We likewise define our notion of a specialisation: Definition 18 (Specialisation) A clausal theory T 1 is a specialisation of a clausal theory T 2 if and only if T 2 T 1 .
In the next section, we use these definitions to define constraints to prune the hypothesis space.

Learning constraints from failures
In the test stage of our learning from failures approach, a learner tests a hypothesis against the examples. A hypothesis fails when it is incomplete or inconsistent. If a hypothesis fails, a learner learns hypothesis constraints from the different types of failures. We define two general types of constraints, generalisation and specialisation, which apply to any clausal theory, and show that they are sound in that they not prune solutions. We also define an elimination constraint, specific to learning non-recursive definite programs, which we show is sound in that it does not prune optimal solutions. We describe these constraints in turn.

Generalisations and specialisations
Because h entails a negative example, it is too general, so we can prune generalisations of it, such as h 1 and h 2 : We show that pruning generalisations of an inconsistent hypothesis is sound in that it only prunes inconsistent hypotheses, i.e. does not prune consistent hypotheses: Because h entails the first example but not the second it is too specific. We can therefore prune specialisations of h, such as h 1 and h 2 : We show that pruning specialisations of an incomplete hypothesis is sound because it only prunes incomplete hypotheses, i.e. does not prune complete hypotheses:

Eliminations
Suppose the outcome is P none , i.e. H is totally incomplete. Then H is too specific so, as with P some , we can prune specialisations of H. However, because H is totally incomplete (i.e does not entail any positive example), under certain assumptions, we can prune more. If H is totally incomplete then there is no need for H to appear in a complete non-recursive hypothesis (we illustrate why recursion matters in a moment). In other words, if H does not entail any positive example, then no specialisation of H can appear in an optimal non-recursive solution. We can therefore prune non-recursive hypotheses that contain specialisations of H. We call such a constraint an elimination constraint: Definition 21 (Elimination constraint) An elimination constraint only prunes nonrecursive hypotheses that contain specialisations of a hypothesis from the hypothesis space.
Example 10 (Elimination constraint) Suppose we have the positive examples E + and the hypothesis h: Because h does not entail any positive example there is no reason for h (nor its specialisations) to appear in a non-recursive hypothesis. We can therefore prune non-recursive hypotheses which contain specialisations of h, such as: Elimination constraints are not sound in the same way as the generalisation and specialisation constraints because they prune solutions (Definition 11) from the hypothesis space.
Example 11 (Elimination solution unsoundness) Suppose we have the positive examples E + and the hypothesis h 1 : Then an elimination constraint would prune the complete hypothesis h 2 : However, for non-recursive definite programs, elimination constraints are sound with respect to optimal solutions, i.e. they only prune non-optimal solutions from the hypothesis space. To show this result, we first introduce a lemma: We use this result to show that elimination constraints are sound with respect to optimal solutions:  ). Therefore, condition (2) cannot hold, which contradicts the assumption and completes the proof.
This proof relies on a hypothesis H being (1) a definite program, and (2) non-recursive (i.e. no predicate in the body of a clause in H appears in the head of a clause in H).
Condition (1) is clear because the proof relies on the monotonicity of definite programs. To illustrate condition (2), we give a counter-example to show why we can only use elimination constraints to prune non-recursive hypotheses.
Example 12 (Non-elimination for recursive hypotheses) Suppose we have the positive examples E + and the hypothesis h:

Constraints summary
To summarise, combinations of these different outcomes imply different combinations of constraints, shown in Table 2. In the next section we introduce Popper, which uses these constraints to learn definite programs.
Outcome N none N some P all n/a Generalisation P some Specialisation Specialisation, Generalisation P none Specialisation, Elimination Specialisation, Elimination, Generalisation Table 2: The constraints we can learn from testing a hypothesis. The P all and N none outcomes denote that we have found a solution.

Popper
Popper is an implementation of our learning from failures idea 9 . Popper works in three separate stages: generate, test, and constrain, as described in Section 1. Algorithm 1 sketches the Popper algorithm which combines the three stages. To learn optimal solutions (Definition 14), Popper searches for programs of increasing size. We describe the generate, test, and constrain stages in detail, how we use ASP's multi-shot solving (Gebser et al., 2019) to maintain state between the three stages, and then prove the soundness and completeness of Popper.

p(2). p(3).
The aggregate #count calculates the number of elements of a set. For example, the expression #count{X : knows(X,alice)} == N counts how many unique values X hold for knows(X,alice) and checks that it is equal to N.

Generate
The generate step of Popper takes as input (1) predicate declarations, (2) hypothesis constraints, and (3) a bound on the total number of literals in a hypothesis and returns an answer set which represents a definite program, if one exists. There are also implicit input parameters that bound the number of unique variables, literals, and clauses allowed in a hypothesis. The idea is to define an ASP problem where an answer set (a model) corresponds to a definite program. In other words, we define a meta-language in ASP to represent definite programs. Popper uses ASP constraints to ensure that a definite program is declaration consistent and obeys hypothesis constraints, such as enforcing type restrictions or disallowing mutual recursion. By later adding learned hypothesis constraints, we eliminate answer sets, and thus reduce the hypothesis space. In other words, the more constraints we learn, the more we reduce the hypothesis space. Figure 2 shows the base ASP program to generate programs. The key idea is to find an answer set with suitable head and body literals, which both have the arguments (Clause,Pred,Arity,Vars) to denote that there is a literal in the clause Clause, with the predicate symbol Pred, arity Arity, and variables Vars. For instance, head_literal(0,p,2,(0,1)) denotes that clause 0 has a head literal with the predicate symbol p, arity 2, and variables (0,1), which we interpret as (A,B). Likewise, body_literal(1,q,3,(0,0,2)) denotes that clause 1 has a body literal with the predicate symbol p, arity 3, and variables (0,0,2), which we interpret as (A,A,C). Head and body literals are restricted by head_pred and body_pred declarations respectively. Table 3 shows examples of the correspondence between an answer set and a definite program, which we represent as a Prolog program.

Validity, redundancy, and efficiency constraints
Popper uses hypothesis constraints (in the form of ASP constraints) to eliminate answer sets, i.e. to prune the hypothesis space. Popper uses constraints to prune invalid programs. For instance, Figure 3 shows constraints specifically for recursive programs, such as preventing recursion without a base case. Popper also uses constraints to reduce redundancy. For instance, Popper prunes subsumption redundant programs, such as pruning the following program because the first clause subsumes the second: {head_literal(0,f,2,(0,1)),body_literal(0,head,2,(1,0))} f(A,B):-head(B,A).

Language bias constraints
A key feature of Popper is that it supports optional 10 hypothesis constraints to prune the hypothesis space. Figure 4 shows example language bias constraints, such as to prevent singleton variables and to enforce Datalog restrictions (where head variables must appear in the body). Declarative constraints have many benefits, notably the ease to define them. For instance, to add simple types to Popper requires the single constraint shown in Figure 4. Through constraints, Popper also supports the standard notions of recall and input/output 11 arguments of mode declarations (Muggleton, 1995). Popper also supports functional and irreflexive constraints, and constraints on recursive programs, such as disallowing left recursion or mutual recursion. Finally, as we show in Appendix A, Popper can also use constraints to impose metarules, clause templates used by many ILP systems (Cropper and Tourret, 2019), which ensures that each clause in a program is an instance of a metarule.

Hypothesis constraints
As with many ILP systems (Muggleton, 1995;Srinivasan, 2001;Law et al., 2014), Popper supports clause constraints, which allow a user to prune specific clauses from the hypothesis space. Popper additionally supports the more general concept of hypothesis constraints (Definition 6), which are defined over a whole program (a set of clauses) rather than a single clause. For instance, hypothesis constraints allow us to prune recursive programs that do not contain a base case clause (Figure 3), to prune left recursive or mutually recursive programs, or to prune programs which contain subsumption redundancy between clauses. As a toy example, suppose you want two disallow two predicate symbols p/2 and q/2 from both appearing in a program. Then this hypothesis constraint is trivial to express with Popper: :-body_literal(_,p,2,_), body_literal(_,q,2,_).
As we show in Appendix A, Popper can simulate metarules through hypothesis constraints. We are unaware of any other ILP system that supports hypothesis constraints, at least with the same ease and flexibility as Popper.

Test
In the test stage, Popper converts an answer set to a definite program and tests it against the training examples. As Table 3 shows, this conversion is straightforward, except if input/output argument directions are given, in which case Popper orders the body literals of a clause. To evaluate a hypothesis, we use a Prolog interpreter. For each example, Popper checks whether the example is entailed by the hypothesis and background knowledge. We enforce a timeout to halt non-terminating programs. In addition to evaluating a whole hypothesis, Popper also individually evaluates each non-recursive clause in a hypothesis. This extra check allows us to identify additional elimination constraints. If a hypothesis fails, then Popper identifies what type of failure has occurred and what constraints to generate (using the failures and constraints from Section 3.5).

Constrain
If a hypothesis fails, then, in the constrain stage, Popper generates ASP constraints to prune the hypothesis space, and thus constrain subsequent hypothesis generation. Specifically, we describe how we transform a failed hypothesis (a definite program) to a hypothesis constraint (an ASP constraint written in the encoding from Section 4.1). We describe the generalisation, specialisation, and elimination constraints that Popper uses, based on the definitions in Section 3.5. As a version of Popper without these constraints is considered in the experiments, we also describe the banish constraint, which prunes one specific hypothesis. To distinguish between Prolog and ASP code, we represent the code of definite programs in typewriter font and ASP code in bold typewriter font.

Encoding atoms
Consider encoding the atom f(A,B). An atom is either in the head or body of a clause. In our encoding, the atom is either represented as head_literal(Clause,f,2,(V0,V1)) or as body_literal(Clause,f,2,(V0,V1)). The relevant clause is indicated by Clause and the 2 indicates the predicate's arity. Two functions below encode atoms into ASP literals. The function encodeHead encodes a head atom and encodeBody encodes a body atom. The first argument specifies the clause an atom belongs to. The second argument is the atom. A hypothesis variable is converted to a variable in our ASP encoding by the encodeVar function 12 .

Encoding clauses
Using the encoding of atoms to ASP literals, we can encode clauses. Consider a clause last(A,B):-reverse(A,C),head(C,B). Supposing C i identifies the clause, the following ASP literals capture where the atoms occur: head_literal(C i ,last,2,(V0,V1)), body_literal(C i ,reverse,2,(V0,V2)),body_literal(C i ,head,2,(V2,V1)) Note that ASP variables V0, V1, V2 will be instantiated by indices representing variables of hypotheses, e.g. 0 for A, 1 for B, etc. Note that the above encoding allows for V0 = V1 = V2 = 0, which represents the clause with all variables as A. To ensure that these variables remain distinct we need to impose V0!=V1 and V0!=V2 and V1!=V2. The With the clause encoding functions defined, we can now use them to define our constraints.

Generalisation constraints
Given a hypothesis H, by Definition 17, any hypothesis that includes all of H's clauses exactly, i.e. not specialised, is a generalisation of H. We use this fact to define function generalisationConstraint, which converts a set of clauses into an ASP encoded generalisation constraint (Definition 19). We use encodeSizedClause to impose that a clause is not specialised. Each clause gets its own ASP variable C i , meaning the clauses can occur in any order.  :-head_literal(C0,last,2,(C0V0,C0V1)), body_literal(C0,head,2,(C0V0,C0V1)), C0V0 != C0V1,clause_size(C0,1).

Fig. 5:
The ASP encoded generalisation constraint for the hypothesis h.

Specialisation constraints
Given a hypothesis H, by Definition 18, any hypothesis which has every clause of H occur, where each clause may be specialised, and includes no other clauses, is a specialisation of H. The function specialisationConstraint uses this fact to derive an ASP encoded specialisation constraint (Definition 20). We use that encodeClause allows additional literals to be added to a provided clause. The literal not clause(n) ensures no additional clause is added to the n distinct clauses of the provided hypothesis. We illustrate why asserting that specialised clauses are distinct is necessary. Consider the hypotheses h 1 and h 2 : The first clause of h 2 specialises both clauses in h 1 , yet h 2 is not a specialisation of h 1 . According to Definition 18, each clause needs to be subsumed by a provided clause. Note that specialisationConstraint only considers hypotheses with at most n clauses. It is not possible for one of these clauses to be non-specialising, as each of the original n clauses is required to be specialised by a distinct clause. Figure 6 illustrates a specialisation constraint derived by specialisationConstraint.

Elimination constraints
By Proposition 5, given a totally incomplete hypothesis H, any non-recursive hypothesis which includes all of H's clauses, where each clause may be specialised, cannot be an optimal solution. The function eliminationConstraint uses this fact to derive an ASP encoded elimination constraint (Definition 21). As in specialisationConstraint, encodeClause is used to allow additional literals in clauses, ensuring that provided clauses are included or specialised. However, eliminationConstraint does not require that every clause is a specialisation of a provided clause. Instead, all that is required is that the hypothesis is non-recursive.

Banish constraints
In the experiments section, we compare Popper against itself without constraint pruning.
To do so we need to remove single hypotheses from the hypothesis space. We introduce the banish constraint for this purpose. To prune a specific hypothesis, hypotheses with different variables should not be pruned. We accomplish this condition by changing the behaviour of the encodeVar function. Normally encodeVar returns ASP variables which are then grounded to indices that correspond to the variables of hypotheses. Instead, by the following definition, encodeVar directly assigns the corresponding index for a hypothesis variable:
Popper uses multi-shot solving as follows. The initial ASP program is the encoding described in Section 4.1. Popper starts a Clingo instance and asks it to solve this program, which grounds it and then calls the ASP solver, which returns an answer set (if the problem is satisfiable). Popper converts the answer set to a definite program and tests it against the examples. If a hypothesis fails, Popper generates ASP constraints using the functions in Section 4.3 and adds them to the running Clingo instance, which grounds the constraints and adds the new (propositional) rules to the running solver. The solver knows which parts of the search space (i.e. hypothesis space) have already been considered and will not revisit them. This loop repeats until either (1) Popper finds an optimal solution, or (2) there are no more hypotheses to test.

Correctness
We now show the correctness of Popper. We first show that Popper's base encoding (Figure 2) can generate every declaration consistent hypothesis (Definition 4):

Proposition 6 The base encoding of Popper has a model for every declaration consistent hypothesis.
Proof Let D = (D h , D b ) be a declaration bias, N var be the maximum number of unique variables, N bod y be the maximum number of body literals, N cl ause be the maximum number of clauses, H be any hypothesis declaration consistent with D and these parameters, and C be any clause in H. Our encoding represents the head literal p h (H 1 , . . . , H n ) of C as a choice literal head_literal(i,p h ,n,(H 1 ,. . . ,H n )) guarded by the condition head_pred(p h ,n) ∈ D h , which clearly holds. Our encoding represents a body literal p b (B 1 , . . . , B m ) of C as a choice literal body_literal (i,p b ,m,(B 1 ,. . .,B m )) guarded by the condition body_pred(p b ,m) ∈ D b , which clearly holds. The base encoding only constrains the above guesses by three conditions: (i) at most N var unique variables per clause, (ii) at least 1 and at most N bod y body literals per clause, and (iii) at most N cl ause clauses. As both the hypothesis and the guessed literals satisfy the same conditions, we conclude there exists a model representing H.

Proposition 7 (Soundness) Any hypothesis returned by Popper is a solution.
Proof Any returned hypothesis has been tested against the training examples and confirmed as a solution.
To make the next two results shorter, we introduce a lemma to show that Popper never prunes optimal solutions (Definition 14): Lemma 2 Popper never prunes optimal solutions. Proof Popper only learns constraints from a failed hypothesis, i.e. a hypothesis that is incomplete or inconsistent. Let H be a failed hypothesis. If H is incomplete, then, as described in Section 4.3, Popper prunes specialisations of H. Proposition 4 shows that a specialisation constraint never prunes complete hypotheses, and thus never prunes optimal solutions. If H is inconsistent, then, as described in Section 4.3, Popper prunes generalisations of H. Proposition 3 shows that a generalisation constraint never prunes consistent hypotheses, and thus never prunes optimal solutions. Finally, if H is totally incomplete, then, as described in Section 4.3, Popper uses an elimination constraint to prune all non-recursive hypotheses that contain H. Proposition 5 shows that an elimination constraint never prunes optimal solutions. Since Popper only uses these three constraints, it never prunes optimal solutions.

Proposition 8 (Completeness) Popper returns a solution if one exists.
Proof Assume, for contradiction, that Popper does not return a solution, which implies that (1) Popper returned a hypothesis that is not a solution, or (2) Popper did not return a solution. Case (1) cannot hold because Proposition 7 shows that every hypothesis returned by Popper is a solution. For case (2), by Proposition 6, Popper can generate every hypothesis so it must be the case that (i) Popper did not terminate, (ii) a solution did not pass the test stage, or (iii) that every solution was incorrectly pruned. Case (i) cannot hold because Proposition 1 shows that the hypothesis space is finite so there are finitely many hypotheses to generate and test. Case (ii) cannot hold because a solution is by definition a hypothesis that passes the test stage. Case (iii) cannot hold because Lemma 2 shows that Popper never prunes optimal solutions. These cases are exhaustive, so the assumption cannot hold, and thus Popper returns a solution if one exists.
We show that Popper returns an optimal solution if one exists:

Theorem 1 (Optimality) Popper returns an optimal solution if one exists.
Proof By Proposition 8, Popper returns a solution if one exists. Let H be the solution returned by Popper. Assume, for contradiction, that H is not an optimal solution. By Definition 14, this assumption implies that either (1) H is not a solution, or (2) H is a non-optimal solution. Case (1) cannot hold because H is a solution. Therefore, case (2) must hold, i.e. there must be at least one smaller solution than H. Let H ′ be an optimal solution, for which we know size(H ′ ) < size(H). By Proposition 6, Popper generates every hypothesis, and Popper generates hypotheses of increasing size (Algorithm 1), therefore the smaller solution H ′ must have been considered before H, which implies that H ′ must have been pruned by a constraint. However, Lemma 2 shows that H ′ could not have been pruned and so cannot exist, which contradicts the assumption and completes the proof.

Experiments
We now evaluate our learning from failures idea. A key idea of our approach is to learn constraints from failed hypotheses to prune the hypothesis space to improve learning performance. We therefore claim that, compared to unconstrained learning, constraints can improve learning performance. One may think that this improvement is obvious, i.e. constraints will definitely improve performance. However, it is unclear in practice whether, and if so by how much, constraints will improve learning performance because Popper needs to (1) analyse failed hypotheses, (2) generate constraints from them, and (3) pass the constraints to the ASP system, which then needs to ground and solve them, which may all have non-trivial computational overheads. Our experiments therefore aim to answer the question: Q1 Can constraints improve learning performance compared to unconstrained learning?
To answer this question, we compare Popper with and without the constrain stage. In other words, we compare Popper against a brute-force generate and test approach. To do so, we use a version of Popper with only banish constraints enabled to prevent repeated generation of a failed hypothesis. We call this system Enumerate.
As mentioned in Section 2, a major limitation of existing pure ASP-based ILP approaches is that they struggle to handle large domains and cannot support infinite domains (Corapi et al., 2011;Athakravi et al., 2013;Law et al., 2014;Kaminski et al., 2018;Evans et al., 2019). To address this limitation, our approach decomposes the ILP problem into separate hypothesis generation and testing stages. In our implementation, Popper uses ASP to generate programs and then uses Prolog to test programs against the examples. We therefore claim that Popper can outperform pure ASP-based ILP systems on large domains (we do not consider infinite domains because pure ASP-based ILP systems need a finite grounding). In addition, because we learn constraints to avoid repeated search, we claim that Popper can outperform existing pure Prolog-based ILP systems. Our experiments therefore aim to answer the question: Q2 Can Popper outperform state-of-the-art ILP systems?
Proposition 1 shows that the size of the learning from failures hypothesis space is a function of many parameters, including the number of predicate declarations, the number of unique variables in a clause, and the number of clauses in a hypothesis. To explore this result, our experiments aim to answer the question:

Q3 How well does Popper scale?
To answer this question, we evaluate Popper on several problems where we vary (1) the size of the target program, (2) the number of predicate declarations, (3) the number of constants in the problem, (4) the number of unique variables in a clause, (5) the maximum number of literals in a clause, and (6) the maximum number of clauses allowed in a hypothesis.

Primorials
The purpose of this first experiment is to evaluate how well Popper scales with respect to the optimal solution size (i.e. the total number of literals in the optimal solution). We therefore need a problem where we can control the optimal solution size. We consider a number theory problem. Let p k denote the kth prime number. Then the primorial p n # is defined as the product of the first n primes: For instance, p 5 # is the product of the first 5 primes: The goal of this experiment is to classify primorial numbers. We vary the solution size by varying the primorial number p n #. The primorial p n # requires n body literals. For instance, for p 2 #, the solution is: For p 5 #, the solution is:

Materials
To evaluate how well Popper scales given more predicate declarations, we compare two sets of BK (small and big). In the first set (small), we provide as BK a monadic predicate divisible i for each prime number i in {1, 2, . . . , 100}, which holds when a number is evenly divisible by i. In the second set (big), we augment the small dataset with dummy monadic predicates which always evaluate to false. For simplicity, we use the predicate dummy i for each non-prime number i in {1, 2, . . . , 100}. Note that this problem representation is not necessarily the most compact. Indeed, we purposely designed the representation so we can vary the optimal solution size to evaluate how well the systems scale. To be clear, the only variable in the experiment (besides the ILP system) is the optimal solution size, which we progressively increase to evaluate how well the systems scale.
We compare Popper, Enumerate, Metagol, ILASP, and FastLAS. To compare the systems, we try to use settings so that each system considers approximately the same hypothesis space.
Popper and Enumerate settings We set Popper and Enumerate to us at most 1 unique variable, at most 10 body literals, and at most 1 clause.
Metagol settings Metagol needs metarules (Section 2.5) to guide the proof search. We provide Metagol with the following two metarules:

P(A):-Q(A). P(A):-Q(A),R(A).
These metarules match the Popper settings in that only one variable is used.
ILASP2 and ILASP3 settings We run both ILASP2 and ILASP3 with the same settings 13 , so we simply refer to both as ILASP. We run ILASP with the '-no-constraints' and '-noaggregates' flags. We additionally ran ILASP3 with the 'disable implication' and 'disable propagation' flags. We tell ILASP that each BK relation is positive, which prevents it from generating body literals using negation. We set ILASP to use at most 1 unique variable and at most 2 body literals ('-ml=2' and '-max-rule-length=3'). When we tried to use at most 3 and 4 body literals it took ILASP 42 seconds (3 body literals) and 41 minutes (4 body literals) to generate the hypothesis space, i.e. to generate the SAT problem. This bound implies that the largest primorial number learnable by ILASP is p 2 #. ILASP does not support infinite domains so requires a bound on the number of integers. We found that it took Clingo 2 seconds, 48 seconds, and 8 minutes to ground the BK for the bounds for p 7 #, p 8 #, and p 9 # respectively. We therefore set the maximum integer bound to p 7 #+1. This bound implies that largest primorial number learnable by ILASP is p 7 # (ignoring the maximum literal bound).

FastLAS settings
We set FastLAS to run identically to ILASP, except we do not enforce a maximum body literal size because FastLAS does not need such a bound. Note that when we set the maximum integer bound to p 8 #+1, FastLAS could not find any solutions in the allocated time.

Methods
For each n in {1, 2, . . . , 10}, we generate the single positive example corresponding to p n #. We uniformly sample 20 negative examples from the set {2, . . . , p n }. We measure learning time as the time to learn a solution. We enforce a timeout of 2 minutes per task. We repeat each experiment 10 times and plot the standard error. Figure 9 shows the results. Popper clearly outperforms Enumerate (the unconstrained approach) on both datasets. On the small dataset, Enumerate can only learn a program for the second primorial number, i.e. a program with two body literals. On the big dataset Enumerate can only learn a program for the first primorial number, i.e. a program with one body literal. By contrast, on both datasets, Popper can learn a program for the 10th primorial number, i.e. a program with 10 body literals. This result strongly suggests that the answer to Q1 is yes, constraints can drastically improve learning performance.

Results
Why does Popper perform much better than Enumerate? Enumerate tests every hypothesis, i.e. every combination of literals. By contrast, Popper learns constraints from failed hypotheses to prune the hypothesis space, i.e. to remove certain combinations of literals. For instance, consider learning a program for p 3 # = 2×3×5 = 30. Below shows a tiny subset ( ) of the hypothesis space for this problem (the full hypothesis for the big BK problem contains approximately 10 13 hypotheses). When Popper tests h 1 , it fails because it is too specific, i.e. div53(30) fails. Popper therefore generates a constraint to remove specialisations of h 1 (h 1h 8 ) from the hypothesis space. From testing this single Small BK Big BK Fig. 9: Primorials experimental results when varying the primorial number, which corresponds to the size of the optimal solution. Note that FastLAS cannot solve any problems for p 8 #, p 9 #, and p 10 # because of a maximum integer bound.
hypothesis, Popper drastically reduces the size of the hypothesis space.
Popper outperforms Metagol. The highest primorial number for which Metagol can learn a solution is p 6 #, which takes 35 seconds to learn. By contrast it takes Popper 2 seconds to learn the solution for p 6 #. We think the performance difference is because of Metagol's inefficient search. Metagol performs iterative deepening over the number of clauses allowed in a solution . However, if a clause or literal fails during the search, Metagol does not remember this failure, and will retry already failed clauses and literals at each depth (and even multiple times as the same depth). By contrast, if a clause fails, Popper learns constraints from the failure so it never tries that clause (or its specialisations) again.
Popper outperforms ILASP2, ILASP3, and FastLAS. ILASP2 and ILASP3 cannot solve any problem, even for p 1 #, because they both pre-compute the hypothesis space. FastLAS performs much better than both. For p 7 # it takes FastLAS 8 seconds to learn a solution. By contrast it takes Popper 2 seconds. FastLAS cannot learn solutions for p 8 #, p 9 #, or p 10 # because of the maximum integer bound. Note that when given a larger bound, FastLAS could not learn a solution for any primorial number.
Overall, the results from this experiment suggest that the answers to questions Q1 and Q2 are both yes, and that the answer to Q3 is that Popper scales better than stateof-the-art ILP systems with respect to the optimal solution size.

Robots
The purpose of this second experiment is to evaluate how well Popper scales with respect to the domain size (i.e. the constant signature). We therefore need a problem where we can control the domain size. We consider a robot strategy learning problem . There is a robot in a n × n grid world. Given an arbitrary start position, the goal is to learn a general strategy to move the robot to the topmost row in the grid. For instance, for a 10 × 10 world and the start position (2, 2), the goal is to move to position (2, 10). The domain contains all possible robot positions. We therefore vary the domain size by varying n, the size of the world. The optimal solution is a recursive strategy for keep moving upwards until you cannot move upwards any more. To reiterate, we purposely fix the optimal solution so that the only variable in the experiment is the domain size (i.e. the grid world size), which we progressively increase to evaluate how well the systems scale.

Materials
An example is an atom of the form f (s 1 , s 2 ), where s 1 and s 2 represent start and end states. A state is a pair of discrete coordinates (x, y) denoting the column (x) and row ( y) position of the robot. We provide four dyadic relations as BK: move_right, move_left, move_up, and move_down, which change the state, e.g. move_right ((2,2), (3,2)). Again, note that this problem representation is not necessarily the most compact and may not be the best representation for certain systems.
We compare Popper, Enumerate, Metagol, ILASP2, and ILASP3. We do not use Fast-LAS because it does not support recursion. To fairly compare the systems, we again try to use settings so that each system considers approximately the same hypothesis space.

Popper settings
We allow Popper and Enumerate to use at most 3 unique variables, at most 2 body literals, and at most 2 clauses. Because Popper and Enumerate can generate non-terminating Prolog programs, we set both systems to use a testing timeout of 0.1 seconds per example.
Metagol settings We provide Metagol with the metarules in Figure 10. These metarules constitute an almost 14 complete set of metarules for a singleton-free fragment of monadic and dyadic Datalog (Cropper and Tourret, 2019).

ILASP2 and ILASP3 settings
We again run both ILASP2 and ILASP3 with the same settings 15 , so we simply refer to both as ILASP. We run ILASP as with the '-no-constraints' and '-no-aggregates' flags. We tell ILASP that each predicate is positive, which prevents ILASP from generating body literals using negation. We set ILASP to use at most 3 unique variables and at most 2 body literals ('-ml=2' and '-max-rule-length=3'). As in the primorial experiment, when we increased these parameters, ILASP struggled to find any solutions in the given time.

Methods
We run the experiment with an n × n grid world for each n in {4, 6, 8, . . . , 28, 30}. To generate examples, for start states, we uniformly sample positions that are not at the top of the world. For the positive examples, the end state is the topmost position, e.g. (x, n) where n is the grid size. For negative examples, the end state is the topmost position but has the wrong horizontal coordinate, e.g. (4, n) when starting at (2, 3). We sample with replacement 5 positive and 5 negative training examples, and 1000 positive and 1000 negative testing examples. The default predictive accuracy is therefore 50%. We measure predictive accuracies and learning times. We enforce a timeout of 2 minutes per task. We repeat each experiment 10 times and plot the standard error. Figure 11 shows the results. Enumerate achieves the best predictive accuracy out of all the systems. For small hypothesis spaces, this result is unsurprising because Enumerate tests every hypothesis. However, the predictive accuracy difference between Enumerate and Popper is negligible. Popper is 5 times quicker than Enumerate.

Results
The learning times of Popper and Enumerate remain almost constant as the grid size grows. The reason is that the domain size has no influence on the size of the learning from failures hypothesis space (Proposition 1). The only influence the grid size has on the learning time of Popper and Enumerate is any overhead in executing the induced Prolog program on larger grids. This result suggests that Popper can scale well with respect to the domain size.
Metagol slightly outperforms Popper in terms of learning times for grid worlds less than 14, but has worse predictive accuracy. However, as the grid size grows, Metagol's performance quickly degrades. Metagol's predictive accuracy drops because of learning timeouts, i.e. if Metagol fails to learn a solution then it only achieves default predictive accuracy (50%). For a grid size of 30, Metagol almost always times out before finding a solution. The reason is that Metagol searches for a hypothesis by inducing and executing partial programs over the examples. In other words, Metagol uses the examples to guide the hypothesis search. As the grid size grows, there are more partial programs to construct, so its performance suffers. Popper outperforms ILASP2 both in terms of predictive accuracies and learning times. ILASP2 struggles because it ground the rules in the hypothesis space with respect to the examples and BK, which is infeasible on non-trivial grid sizes, and is why its performance suffers as the domain size grows. ILASP2 outperforms ILASP3 because once ILASP2 finds a solution it terminates. By contrast, ILASP3 finds one hypothesis schema that guarantees coverage of the example (which, in this special case, also implies finding a solution), then carries on to find alternative hypothesis schemas. The extra work done by ILASP3 is needed when learning general ASP programs, but in this special case (where there is only a single ILASP positive example, and no negative examples) it is unnecessary and computationally expensive. We refer the reader to Law's thesis 2018 for a detailed comparison of ILASP2 and ILASP3 16 .
To show the versatility of Popper, we modified Popper to test programs using ASP rather than Prolog. In other words, instead of learning Prolog programs, we set Popper to learn Datalog programs. Figure 11 shows the results as Popper AS P . As expected, there is no difference in terms of predictive accuracies but Popper AS P can learn programs quicker than Popper because, in this problem, testing hypotheses using ASP is quicker than with Prolog.
The results from this experiment suggest that the answers to questions Q1 and Q2 are yes. The results also suggest that the answer to Q3 is that Popper scales well and better-than state-of-the-art ILP systems with respect to the domain size.

List transformation problem
The purpose of this third experiment is to evaluate how well Popper performs on difficult (mostly recursive) list transformation problems. Learning recursive programs has long been considered a difficult problem in ILP (Muggleton et al., 2012) and most ILP and program synthesis systems cannot learn recursive programs. Metagol, ILASP2, and ILASP3 can learn recursive programs. However, as the previous experiment showed, ILASP2 and ILASP3 struggle on large domains. We therefore compare Popper against Enumerate and Metagol.

Materials
We evaluate the systems on the ten list transformation tasks shown in Table 4. These tasks include a mix of monadic (e.g. evens and sorted), dyadic (e.g. droplast and finddup), and tradic (dropk) target predicates. The tasks also contain a mix of functional (e.g. last and len) and relational problems (e.g. finddup and member). These tasks are extremely difficult for ILP systems. To learn solutions for them that generalise, an ILP system needs to support recursion and large domains. As far as we are aware, no existing ILP system can learn optimal solutions for all of these tasks without being provided with a very strong inductive bias 17 . We give each system the predicate declarations shown in Figure 12. Note that we use increment/2 only in the len experiment. We had to remove this relation from the BK for the other experiments because when given this relation Metagol runs into infinite recursion 18 on almost every problem and could not find any solutions.

Popper and Enumerate settings
We set Popper and Enumerate to use at most 6 unique variables, at most 5 body literals, and at most 2 clauses. For each BK relation, we also provide both systems with simple types and argument directions (whether input or output). In Section 5.5, we evaluate how sensitive Popper is to these parameters. Because Popper and Enumerate can generate non-terminating Prolog programs, we set both systems to use a testing timeout of 0.1 seconds per example.
Metagol settings For Metagol, we use almost the same metarules as in the previous robot experiment ( Figure 10). However, when given the inverse metarule P(A, B) ← Q(B, A), Metagol could not learn any solution, again because of infinite recursion. Note that if we pick specific metarules for each task, then Metagol would perform better. To aid Metagol, we therefore replace the inverse metarule with the identify metarule, i.e. P(A, B) ← Q (A, B). In addition, when we first ran the experiment with randomly ordered examples, we found that Metagol struggled to find solutions for all the problems (except member). The reason is that Metagol is sensitive to the order of examples because it is exampledriven. Therefore, to aid Metagol, we provide the examples in increasing size (i.e. the length of the input lists).

Methods
For each problem, we generate 10 positive and 10 negative training examples, and 1000 positive and 1000 negative testing examples. The default predictive accuracy is therefore 17 As mentioned in Section 2.3, some inverse entailment methods (Muggleton, 1995) might sometimes learn solutions for them. However, these approaches would need an 'base case' example to learn the base case of a recursive program, and then an example to learn the inductive base, and preferably in that order. Moreover, these approaches would not be guaranteed to learn the optimal solution. Metagol could possibly learn solutions for them if given the exact metarules needed, but that requires that you know the solution before you try to learn it. 18 Because Metagol induces hypotheses by partially constructing and evaluating hypotheses, it is very difficult to impose a timeout on a particular hypothesis, which we can easily do with Popper.    We also provide head_pred(P,A) and body_pred(P,A) declarations, where P and A are the target predicate symbol and arity respectively.

Name
50%. Each list is randomly generated and has a maximum length of 50. We sample the list elements uniformly at random from the set {1, 2, . . . , 100} (this choice is arbitrary and Popper and Metagol can handle much larger values). We measure the predictive accuracy and learning times. We enforce a timeout of 2 minutes per task. We repeat each experiment 10 times and plot the standard error. Table 5 shows the results. Popper equals or outperforms Enumerate on all the tasks in terms of predictive accuracies. Popper outperforms Enumerate on all but one of the tasks in terms of learning times. The exception is the last problem, where it is easier to simply enumerate all programs rather than use constraints. However, this difference is negligible. This result again suggests that the answer to Q1 is yes. Popper equals or outperforms Metagol on all but one task in terms of predictive accuracy. The exception is the finddup problem, where there is only a 1% difference. Popper outperforms Metagol in terms of learning times in almost all cases. Note that Metagol could never learn a solution for dropk because it does not support triadic literals because of the metarule constraints. This result again suggests that the answer to Q2 is yes.

Scalability
Our primorial experiment showed that Popper scales well in the size of the optimal solution size compared to Enumerate, ILASP, FastLAS, and Metagol. Our robot experiment showed that Popper scales well in the size of the domain compared to ILASP, FastLAS, and Metagol. The purpose of this experiment is to evaluate how well Popper scales in terms of the (1) number of examples, and (2) the size of examples. To do so, we repeat the last experiment from Section 5.3, where Popper and Metagol achieved similar performance.

Materials
We use the same materials as Section 5.3.

Settings
We run two experiments. In the first experiment we vary the number of examples. In the second experiment we vary the size of the examples (the size of the input list). For each experiment, we measure the predictive accuracy and learning times averaged over 10 repetitions.

Sensitivity
The learning from failures hypothesis space (Proposition 1) is a function of the number of predicate declarations and three other variables: the maximum number of unique variables in a clause the maximum number of body literals allowed in a clause the maximum number of clauses allowed in a hypothesis The purpose of this experiment is to evaluate how sensitive Popper is to these parameters. To do so, we repeat the len experiment from Section 5.3 with the same BK, settings, and method, except we run three separate experiments where we vary the three aforementioned parameters. Figure 15 shows the experimental results. The results show that Popper is sensitive to the maximum number of unique variables, which has a strong influence on learning times. This result follows from Proposition 1 because more variables implies more ways to form literals in a clause. Somewhat surprisingly, doubling the number of variables from 4 to 8 has little difference on performance, which suggests that Popper is robust to imperfect parameters.

Results
The results show that Popper is mostly insensitive to the maximum number of body literals in a clause. The main reason is that Popper does not pre-compute every possible clause in the hypothesis space, which is, for instance, the case with ILASP and many program synthesis systems, especially SAT approaches.
The results show that Popper is mostly insensitive to the maximum number of clauses. The main reason is because of the way Popper searches for programs of increasing size. For instance, for a program of size 4 (e.g. with four literals), due to constraints on the hypothesis space (Section 4.1), it is impossible to generate a program with three clauses, since each clause must have a head and body literal.
Overall these results suggest that Popper is scales well with the maximum number of clauses and body literals parameters, but struggles with large values for the maximum number of unique variables.

Conclusions and limitations
We have introduced an ILP approach called learning programs by learning from failures. Our approach decomposes the ILP problem into three separate stages: generate, test, and constrain. In the generate stage, the learner generates a hypothesis that satisfies a set of hypothesis constraints (Definition 6). In the test stage, the learner tests a hypothesis against training examples. If a hypothesis fails, then, in the constrain stage, the learner learns hypothesis constraints from the failed hypothesis to prune the hypothesis space, i.e. to constrain subsequent hypothesis generation. In Section 3.5, we introduced three types of constraints: generalisation, specialisation, and elimination and proved their soundness in that they do not prune optimal solutions (Definition 14). This loop repeats until (1) the learner finds an optimal solution, or (2) there are no more hypotheses to test. We implemented our idea in Popper, an ILP system that learns definite programs. Popper combines ASP and Prolog to support types, learning optimal solutions, learning recursive programs, reasoning about lists and infinite domains, and hypothesis constraints. To improve efficiency, Popper uses multi-shot solving to combine the three stages. We showed that Popper is sound and complete with respect to optimal solutions (Theorem 1).
We evaluated our approach on three diverse domains (number theory problems, robot strategies, and list transformations). Our experiment results show that (1) constraints drastically reduce the hypothesis space, (2) Popper can substantially outperform state-of-the-art ILP systems Metagol, ILASP2, ILASP3, and FastLAS, both in terms of predictive accuracies and learning times, (3) Popper scales well with respect to domain size, the number of training examples, and the size of the training examples, and (4) Popper is reasonably robust to its parameters.

Limitations and future work
Popper, as implemented in this paper, has several limitations that future work should address.

Predicate invention
Predicate invention has been shown to help reduce the size of target programs, which in turns reduces sample complexity and improves predictive accuracy (Cropper, 2019b;Dumancic et al., 2019). Popper does not currently support predicate invention, but we plan to support it in future work. There are two straightforward ways to support predicate invention. Popper could mimic Metagol by imposing metarules to restrict the form of clauses in a hypothesis and to guide the invention of new predicate symbols. Alternatively Popper could mimic ILASP by support prescriptive predicate invention , where the arity and (in ILASP's case, argument types) are pre-specified by the language bias. Most of the results in this paper should extend to both approaches.

Noise
Most ILP systems handle noisy (misclassified) examples (Table 1). Popper does not currently support noisy examples. Our initial results suggest that we can address this issue by relaxing when to apply learned hypothesis constraints and by maintaining the best hypotheses tested during the learning, i.e. the hypothesis which entails the most positive and the fewest negative examples. However, our early results suggest that noise handling increases learning times, which future work should explore.

Hypotheses
In most of our experiments Popper learns definite programs and tests them using Prolog. However, in Section 5.2, Popper learns Datalog programs and tests them using ASP. In future work, we want to consider learning other types of programs. For instance, most of our pruning techniques (except the elimination constraint) should extend to learning non-monotonic programs, such as Datalog with stratified negation.

Better search
Popper is only one implementation of our learning from failures idea. An advantage of our separate three staged approach is that it allows for a variety of algorithms and implementations. Moreover, each stage can be improved independently of the others. For instance, any improvement to the Popper ASP encoding that generates programs would have a major influence on learning times because it would reduce the number of programs to test. Likewise, we can also optimise the testing step. For instance, in Section 5.2, we used ASP, rather than Prolog, to test hypotheses, which, in some cases, reduced learning times by 50%. Moreover, by decomposing the ILP problem into three stages, our approach might mitigate the combinatorial and grounding problems faced by systems that solve the ILP problem as a single (and often very large) SAT problem (Corapi et al., 2011;Law et al., 2014;Kaminski et al., 2018;Evans and Grefenstette, 2018;Evans et al., 2019).

Better constraints
Hypothesis constraints are central to our idea. Popper uses predefined constraints to prune redundant programs from the hypothesis space (Section 4.1), such as recursive programs without a base case and subsumption redundant program. A key idea of our approach is to learn constraints from failures. We think the most promising direction for future work is to improve both types of constraints (predefined and learned).
Types. Like many ILP systems (Muggleton, 1995;Blockeel and Raedt, 1998;Srinivasan, 2001;Law et al., 2014;Evans and Grefenstette, 2018), Popper supports simple types to prune the hypothesis space. However, more complex types, such as polymorphic types (parameterised types), can achieve better pruning for programs over structured data (Morel et al., 2019). For instance, polymorphic types would allow us to distinguish between using a predicate on a list of integers and on a list of characters. Refinement types (Polikarpova et al., 2016), i.e. types annotated with restricting predicates, could allow a user to specify stronger program properties (other than examples), such as requiring that a reverse program provably has the property that the lengths of the input and output are the same. In future work we want to explore whether we can express such complex types as hypothesis constraints.
Learned constraints. The constraints described in Section 3.5 prune specialisations and generalisations of a failed hypothesis. However, we have only briefly analysed the properties of these constraints. We showed that these constraints are sound (Propositions 3 and 4), in that they do not prune optimal solutions. We have not, however, considered their completeness, in that they prune all non-optimal solutions. Indeed, our elimination constraint, for the special case of non-recursive definite programs, prunes hypotheses that the generalisation and specialisation constraints miss. In other words, the theory regarding which constraints to use is yet to be developed, and there may be many more constraints to be learned from failed hypotheses, all of which should drastically improve learning performance. By contrast, refinement operators for clauses (Shapiro, 1983;Raedt and Bruynooghe, 1993;Nienhuys-Cheng and Wolf, 1997) and theories (Nienhuys-Cheng and Wolf, 1997;Midelfart, 1999;Badea, 2001) have been studied in detail in ILP. Therefore, we think that this paper opens a new direction of research into identifying and analysing different constraints that we can learn from failed hypotheses. Management, CIKM 2014, Shanghai, China, November 3-7, 2014, pages 1199-1208. ACM, 2014. 10.1145. URL https://doi.org/10.1145/2661829.2662022. Antonius Weinzierl. Blending lazy-grounding and CDNL search for answer-set solving. In Marcello Balduccini and Tomi Janhunen, editors, Logic Programming and Nonmonotonic Reasoning -14th International Conference, LPNMR 2017, Espoo, Finland, July 3-6, 2017, Proceedings, volume 10377 of Lecture Notes in Computer Science, pages 191-204. Springer, 2017.1007/978-3-319-61660-5\_17. URL https://doi.org/10.1007/978-3-319-61660-5_17.

A.1 Metarules
Let M be an arbitrary metarule, i.e. a second-order Horn clause which quantifies over predicate symbols. For example, P(A,B):-Q(A,C),R(C,B) is known as the chain metarule. All letters are quantified variables, with P, Q, and R being second-order, i.e. needing to be substituted for by predicate symbols.

A.2 From a metarule to literals
Let M = head:-body 1 , . . . , body m be a metarule. We use the clause encoding function encodeSizedClause from section 4.3.2 to derive an encoding of a metarule. We introduce two rules to ensure every clause of a generated program is an instance of at least one metarule. The first rule identifies when there exists some metarule for which the clause is an instance. The second rule is a constraint expressing that every clause of a program must be identified as being an instance of at least one metarule. For each M ∈ Ms, generate the following rule of the first kind: meta_clause(Clause):-encodeSizedClause(Clause, M ).