Learning higher-order logic programs

A key feature of inductive logic programming (ILP) is its ability to learn first-order programs, which are intrinsically more expressive than propositional programs. In this paper, we introduce techniques to learn higher-order programs. Specifically, we extend meta-interpretive learning (MIL) to support learning higher-order programs by allowing for \emph{higher-order definitions} to be used as background knowledge. Our theoretical results show that learning higher-order programs, rather than first-order programs, can reduce the textual complexity required to express programs which in turn reduces the size of the hypothesis space and sample complexity. We implement our idea in two new MIL systems: the Prolog system \namea{} and the ASP system \nameb{}. Both systems support learning higher-order programs and higher-order predicate invention, such as inventing functions for \tw{map/3} and conditions for \tw{filter/3}. We conduct experiments on four domains (robot strategies, chess playing, list transformations, and string decryption) that compare learning first-order and higher-order programs. Our experimental results support our theoretical claims and show that, compared to learning first-order programs, learning higher-order programs can significantly improve predictive accuracies and reduce learning times.

Caesar cipher with a shift of +1. Given these examples, most inductive logic programming (ILP) approaches, such as meta-interpretive learning (MIL) [34,35], would learn a recursive first-order program, such as the one shown in Figure 2a. Although correct, this first-order program is overly complex in that most of the program is concerned with manipulating the input and output, such as getting the head and tail elements. In this paper, we introduce techniques to learn higher-order programs that abstract away this boilerplate code. Specifically, we extend MIL to support learning higher-order programs that use higher-order constructs such as map/3, until/4, and ifthenelse/5. Using this new approach, we can learn an equivalent 1 yet smaller decryption program, such as the one shown in Figure 2b, which uses map/3 to abstract away the recursion and list manipulation.

Encrypted
Decrypted joevdujwf inductive mphjd logic qsphsbnnjoh programming   higher-order program, where decrypt1/2 is an invented predicate symbol. The predicate prec/2 represents preceding/2, i.e. the inverse of successor/2. The programs are success set equivalent when restricted to the target predicate decrypt/2 but the higher-order program is much smaller and requires half the number of literals (6 vs 12).
We claim that, compared to learning first-order programs, learning higher-order programs can improve learning performance. We support our claim by showing that learning higher-order programs can reduce the textual complexity required to express programs which in turn reduces the size of the hypothesis space and sample complexity.
We implement our idea in Metagol ho , which extends Metagol [9], a MIL implementation based on a Prolog meta-interpreter. Metagol ho extends Metagol to support interpreted BK (IBK). In this approach, meta-interpretation drives both the search for a hypothesis and predicate invention, allowing for higher-order arguments to be invented, such as the predicate decrypt1/2 in Figure 2b. The key novelty of Metagol ho is the combination of abstraction (learning higher-order programs) and invention (predicate invention), i.e. inventions inside of abstractions. Metagol ho supports the invention of conditions and functions to an arbitrary depth, which goes beyond anything in the literature. We also introduce HEXMIL ho , which likewise extends HEXMIL [23], an answer set programming (ASP) MIL implementation, to support learning higher-order programs. As far as we are aware, HEXMIL ho is the first ASP-based ILP system that has been demonstrated capable of learning higher-order programs.
We further support our claim that learning higher-order programs can improve learning performance by conducting experiments in four domains: robot strategies, chess playing, list transformations, and string decryption. The experiments compare the predictive accuracies and learning times when learning first and higher-order programs. In all cases learning higher-order programs leads to substantial increases in predictive accuracies and lower learning times in agreement with our theoretical results.
Our main contributions are: -We extend the MIL framework to support learning higher-order programs by extending it to support higher-order definitions (Section 3.2). -We show that the new higher-order approach can reduce the textual complexity of programs which in turn reduces the size of the hypothesis space and also sample complexity (Section 3.3). -We introduce Metagol ho and HEXMIL ho which extend Metagol and HEXMIL respectively. Both systems support learning higher-order programs with higher-order predicate invention (Section 4). -We show that the ASP-based HEXMIL and HEXMIL ho have an additional factor determining the size of their search space, namely the number of constants (Section 4.5). -We conduct experiments in four domains which show that, compared to learning first-order programs, learning higher-order programs can substantially improve predictive accuracies and reduce learning times (Section 5).

Program induction
Program synthesis is the automatic generation of a computer program from a specification. Deductive approaches [27] deduce a program from a full specification which precisely states the requirements and behaviour of the desired program. By contrast, program induction approaches induce (learn) a program from an incomplete specification, usually input/output examples. Many program induction approaches learn specific classes of programs, such as string transformations [20]. By contrast, MIL is generalpurpose, shown capable of grammar induction [34], learning robot strategies [7], and learning efficient algorithms [10]. In addition, MIL supports predicate invention, which has been repeatedly stated as an important challenge in ILP [32,42,33]. The idea behind predicate invention is for an ILP system to introduce new predicate symbols to improve learning performance. In program induction, predicate invention can be seen as inventing auxiliary functions/predicates, as one does when manually writing a program, for example to reduce code duplication or to improve the readability of a program.

Inductive functional programming
Functional program induction approaches often support learning higher-order programs.
MagicHaskeller [24] is a general-purpose system which learns Haskell functions by selecting and instantiating higher-order functions from a pre-defined vocabulary. Igor2 [25] also learns recursive Haskell programs and supports auxiliary function invention but is restricted in that it requires the first k examples of a target theory to generalise over a whole class. The L2 system [16] synthesises recursive functional algorithms. The MYTH [36] and MYTH2 [18] systems use type systems to synthesise programs. Frankle et al. [18] show how example-based specifications can be turned into type specifications.
In this work we go beyond these approaches by (1) learning higher-order programs with invented predicates, (2) giving theoretical justifications and conditions for when learning higher-order programs can improve learning performance (Section 3.3), and (3) experimentally demonstrating that learning higher-order programs can improve learning performance.

Inductive logic programming
ILP systems, including the popular systems FOIL [37], Progol [31], ALEPH [41], and TILDE [1], usually learn first-order programs. Given appropriate mode declarations [31] for higher-order predicates such as map/3, Progol and Aleph could learn higher-order programs such as f(A,B):-map(A,B,f1). However, because Progol and Aleph do not support predicate invention they would be unable to invent the predicate f1/2 in the above example. Similarly, existing MIL implementations, such as Metagol, could learn a similar program to the one above when map/3 is provided as background knowledge. However, even though Metagol supports predicate invention, it is unable to invent the predicate f1/2 in the example above because Metagol deductively proves BK by delegating the proofs to Prolog. To overcome this limitation we introduce the notion of interpreted BK (IBK), where map/3 can be defined as IBK. The new MIL system Metagol ho proves IBK through meta-interpretation, which allows for predicate arguments such as f1/2 to be invented.

Meta-interpretive learning
MIL was originally based on a Prolog meta-interpreter, although the MIL problem has also been encoded as an ASP problem [23]. The key difference between a MIL learner and a standard Prolog meta-interpreter is that whereas a standard Prolog meta-interpreter attempts to prove a goal by repeatedly fetching first-order clauses whose heads unify with a given goal, a MIL learner additionally attempts to prove a goal by fetching higherorder existentially quantified formulas, called metarules, supplied as BK, whose heads unify with the goal. The resulting predicate substitutions are saved and can be reused later in the proof. Following the proof of a set of goals, a logic program is induced by projecting the predicate substitutions onto their corresponding metarules. A key feature of MIL is the support for predicate invention. MIL uses predicate invention for automatic problem decomposition. As we will demonstrate, the combination of predicate invention and abstraction leads to compact representations of complex programs. Cropper and Muggleton [8] introduced the idea of using MIL to learn higher-order programs by using IBK. This paper is an extended version of that paper. In addition, we go beyond that work in several ways. First, we generalise their preliminary theoretical results, principally in Section 3.3. We also provide more explanation as to why abstracted MIL can improve learning performance compared to unabstracted MIL (end of Section 3.3). Second, we introduce the HEXMIL ho system, which, as mentioned, extends HEXMIL to support learning higher-order programs with higher-order predicate invention. Our motivation for this extension is to show the generality of our work, i.e. to demonstrate that it is not specific to Metagol and Prolog. We also study the computational complexity of both Metagol ho and HEXMIL ho . We show that the ASP approach is highly sensitive to the number of constant symbols, which leads to scalability issues. Furthermore, we corroborate the experimental results of Cropper and Muggleton by repeating the robot waiter, chess, and list transformation experiments with Metagol ho . We provide additional experimental evidence by repeating the experiments with HEXMIL ho . Finally, we add further evidence by conducting a new experiment on the string decryption problem mentioned in the introduction.

Higher-order logic
McCarthy [28] advocated using higher-order logic to represent knowledge. Similarly, Muggleton et al. [33] argued that using higher-order representations in ILP provides more flexible ways of representing BK. Lloyd [26] used higher-order logic in the learning process but the approach focused on learning functional programs and did not support predicate invention. Early work in ILP [17,38,14] used higher-order formulae to specify the overall form of programs to be learned, similar to how MIL uses metarules. However, these works did not consider learning higher-order programs. By contrast, we use higherorder logic as a learning representation and to represent learned hypotheses. Feng and Muggleton [15] investigated inductive generalisation in higher-order logic using a restricted form of lambda calculus. However, their approach does not support first-order nor higher-order predicate invention. By contrast, we introduce higher-order definitions which allow for invented predicate symbols to be used as arguments in literals.

Abstraction and invention
Predicate invention has been repeatedly stated as an important challenge in ILP [32,42,33]. Popular ILP systems, such as FOIL, Progol, and ALEPH, do not support predicate invention, nor do most program induction systems. Meta-level abduction [22] uses abduction and meta-level reasoning to invent predicates that represent propositions. By contrast, MIL uses abduction to invent predicates that represent relations, i.e. relations that are not in the initial BK nor in the examples. For instance, MIL was shown [35] able to invent a predicate corresponding to the parent/2 relation when learning a grandparent/2 relation. In this paper we extend MIL and the associated Metagol implementation to support higher-order predicate invention for use in higher-order constructs, such as map/3, reduce/3, and fold/4. This approach supports a form of abstraction which goes beyond typical first-order predicate invention [39] in that the use of higher-order definitions combined with meta-interpretation drives both the search for a hypothesis and predicate invention, leading to more accurate and compact programs.

Preliminaries
We assume familiarity with logic programming. However, we restate key terminology. Note that we focus on learning function-free logic programs, so we ignore terminology to do with function symbols. We denote the predicate and constant signatures as and respectively. A variable is first-order if it can be bound to a constant symbol or another first-order variable. A variable is higher-order if it can be bound to a predicate symbol or another higher-order variable. We denote the sets of first-order and higherorder variables as 1 and 2 respectively. A term is a variable or a constant symbol. A term is ground if it contains no variables. An atom is a formula p(t 1 , . . . , t n ), where p is a predicate symbol of arity n and each t i is a term. An atom is ground if all of its terms are ground. A higher-order term is a higher-order variable or a predicate symbol. An atom is higher-order if it has at least one higher-order term. A literal is an atom A (a positive literal) or its negation ¬A (a negative literal). A clause is a disjunction of literals. The variables in a clause are universally quantified. A Horn clause is a clause with at most one positive literal. A definite clause is a Horn clause with exactly one positive literal. A clause is higher-order if it contains at least one higher-order atom. A logic program is a set of Horn clauses. A logic program is higher-order if it contains at least one higher-order Horn clause.

Abstracted meta-interpretive learning
We extend MIL to the higher-order setting. We first restate metarules [6]:
In contrast to a higher-order Horn clause, in which all the variables are all universally quantified, the variables in a metarule can be quantified universally or existentially 2 . When describing metarules, we omit the quantifiers. Instead, we denote existentially quantified higher-order variables as uppercase letters starting from P and universally quantified first-order variables as uppercase letters starting from A. Figure 3 shows example metarules.
To extend MIL to support learning higher-order programs we introduce higher-order definitions: We frequently refer to abstractions. In computer science code abstraction [4] involves hiding complex code to provide a simpler interface. In this work, we define an abstraction as a higher-order Horn clause that contains at least one atom which takes a predicate symbol an argument. In the following abstraction example, the final argument of map/3 is ground to the predicate symbol succ/2: Likewise, in the higher-order decryption program in the introduction (Figure 2b), the final argument of map/3 is ground to the predicate symbol decrypt1/2. We define the abstracted MIL input, which extends a standard MIL input [6] (and problem) to support higher-order definitions: is a set of Horn clauses and B I is (the union of) a set of higher-order definitions -E + and E − are disjoint sets of ground atoms representing positive and negative examples respectively -M is a set of metarules.
There is little declarative difference between B C and B I . There is, however, a procedural difference between the two. We therefore call B C compiled BK and B I interpreted BK (IBK). The procedural distinction between B C and B I is that whereas a clause from B C is proved deductively (by calling Prolog), a clause from B I is proved through metainterpretation, which allows for predicate invention to be combined with abstractions to invent higher-order predicates. The distinction between B I and M is that the clauses in B I are all universally quantified, whereas the metarules in M contain existentially quantified variables whose substitutions form the induced program. We discuss these distinctions in more detail in Section 4 when we describe the MIL implementations.
We define the abstracted MIL problem: Definition 4 (Abstracted MIL problem) Given an abstracted MIL input (B, E + , E − , M ), the abstracted MIL problem is to return a logic program hypothesis H such that: -∀h ∈ H, ∃m ∈ M such that h = mθ , where θ is a substitution that grounds all the existentially quantified variables in m We call H a solution to the MIL problem.
The first condition ensures that a logic program hypothesis is an instance of the given metarules. It is this condition that enforces the strong inductive bias in MIL.
MIL supports inventions: In this program, a MIL learner has invented the predicate f1/2 for use in a map/3 definition. Likewise, in the higher-order decryption program in the introduction (Figure 2b), the final argument of map/3 is ground to the invented predicate symbol decrypt1/2.

Language classes, expressivity, and complexity
We claim that increasing the expressivity of MIL from learning first-order programs to learning higher-order programs can improve learning performance. We support this claim by showing that learning higher-order programs can reduce the size of the hypothesis space which in turn reduces sample complexity and expected error. In MIL the size of the hypothesis space is a function of the number of metarules m and their form, the number of background predicate symbols p, and the maximum program size n (the maximum number of clauses allowed in a program). We restrict metarules by their body size and literal arity: if it has at most j literals in the body and each literal has arity at most i.
For instance, the chain metarule in Figure 3 restricts clauses to be definite with two body literals of arity two, i.e. is in the fragment 2 2 . By restricting the form of metarules we can calculate the size of a MIL hypothesis space. The following result is essentially the same as in [12]. The only difference is that we drop the redundant Big O notation: Proposition 1 (MIL hypothesis space) Given p predicate symbols and m metarules in i j , the number of programs expressible with n clauses is at most (mp j+1 ) n .
Proof The number of clauses which can be constructed from a i j metarule given p predicate symbols is at most p j+1 because for a given metarule there are at most j + 1 predicate variables with at most p j+1 possible substitutions. Therefore the number of clauses that can be formed from m distinct metarules in i j using p predicate symbols is at most mp j+1 . It follows that the number of programs which can be formed from a selection of n such clauses is at most (mp j+1 ) n . Proposition 1 shows that the MIL hypothesis space grows exponentially both in the size of the target program and the number of body literals in a clause. For instance, for the 2 2 fragment, the MIL hypothesis space contains at most (mp 3 ) n programs, where m is the number of metarules and n is the number of clauses in the target program.
We update this bound for the abstracted MIL framework:

Proposition 2 (Number of abstracted
i j programs) Given p predicate symbols and m metarules in i j with at most k additional existentially quantified higher-order variables, the number of abstracted i j programs expressible with n clauses is at most (mp j+1+k ) n .
Proof As with Proposition 1, the number of clauses which can be constructed from a i j metarule given p predicate symbols is at most p j+1 because for a given metarule there are at most j + 1 predicate variables with at most p j+1 possible substitutions. Given a metarule in i j with at most k additional existentially quantified higher-order variables there are therefore potentially j + 1 + k predicate variables with p j+1+k possible substitutions. Therefore the number of clauses expressible with m such metarules is at most mp j+1+k . By the same reasoning as for Proposition 1, it follows that the number of programs which can be formed from a selection of n such clauses is at most (mp j+1+k ) n .
We use this result to develop sample complexity [29] results for unabstracted MIL: Proposition 3 (Sample complexity of unabstracted MIL) Given p predicate symbols, m metarules in i j , and a maximum program size n u , unabstracted MIL has sample complexity: Proof According to the Blumer bound, which appears as a reformulation of Lemma 2.1 in [2], the error of consistent hypotheses is bounded by ε with probability at least where |H| is the size of the hypothesis space. From Proposition 1, |H| = (mp j+1 ) n u for unabstracted MIL. Substituting and applying logs gives: We likewise develop sample complexity results for abstracted MIL:

Proposition 4 (Sample complexity of abstracted MIL) Given p predicate symbols, m metarules in
i j augmented with at most k higher-order variables, and a maximum program size n a , abstracted MIL has sample complexity: Proof Analogous to Proposition 3 using Proposition 2.
We compare these bounds:

Theorem 1 (Unabstracted and abstracted bounds)
Let m be the number of i j metarules, n u and n a be the minimum numbers of clauses necessary to express a target theory with unabstracted and abstracted MIL respectively, s u and s a be the bounds on the number of training examples required to achieve error less than ε with probability at least 1 − δ with unabstracted and abstracted MIL respectively, and k ≥ 1 be number of additional higher-order variables used by abstracted MIL. Then s u > s a when: Proof From Proposition 3 it holds that: From Proposition 4 it holds that: If we cancel 1 ε then s u > s a follows from: n u ln(m) + ( j + 1)n u ln(p) > n a ln(m) + ( j + 1 + k)n a ln(p) Because k ≥ 1, the inequality s u > s a holds when: and: Because k ≥ 1 the inequality (2) implies the inequality (1). The inequality (2) holds when ( j + 1)n u > ( j + 1 + k)n a . Therefore s u > s a follows from ( j + 1)n u > ( j + 1 + k)n a . Rearranging terms leads to s u > s a when n u − n a > k j+1 n a .
The results from this section motivate the use of abstracted MIL, and help explain the experimental results (Section 5). To illustrate these theoretical results, reconsider the decryption programs shown in Figure 2. Consider representing these programs in 2 2 . Figure 4a shows that the first-order program would require seven clauses. By contrast, Figure 4b shows that the higher-order program requires only three clauses and one extra higher-order variable. Let m u = 4, p u = 6, and n u = 7 be the number of metarules, background relations, and clauses needed to express the first-order program shown in Figure  4a. Plugging these values into the formula in Proposition 1 shows that the hypothesis space of unabstracted MIL contains approximately 10 21 programs. By contrast, suppose we allow an abstracted MIL learner to additionally use the higher-order definition map/3 and the corresponding curry metarule P(A, B) ← Q(A, B, R). Therefore m a = m u + 1, p a = p u + 1, n a = 3, and k = 1, where k is the number of additional higher-order variables used in the curry metarule. Then plugging these values into the formula from Proposition 2 shows that the hypothesis space of abstracted MIL contains approximately 10 13 programs, which is substantially smaller than the first-order hypothesis space, despite using more metarules and more background relations. The Blumer bound [2] says that given two hypothesis spaces of different sizes, then searching the smaller space will result in less error compared to the larger space, assuming that the target hypothesis is in both spaces. In this example, the target hypothesis, or a hypothesis that is equivalent 3 to the target hypothesis, is in both hypothesis spaces but the abstracted MIL space is smaller. Therefore, our results imply that in this scenario, given a fixed number of examples, abstracted MIL should improve predictive accuracies compared to unabstracted MIL. In Section 5.5 we experimentally explore whether this result holds.

Algorithms
We now introduce Metagol ho and HEXMIL ho , both of which implement abstracted MIL and which extend Metagol and HEXMIL respectively. For self-containment, we also describe Metagol and HEXMIL.

Metagol
Metagol [9] is a MIL learner based on a Prolog meta-interpreter. Figure 5 shows Metagol's learning procedure described using Prolog. Metagol works as follows. Given a set of atoms representing positive examples, Metagol tries to prove each atom in turn. Metagol first tries to deductively prove an atom using compiled BK by delegating the proof to Prolog (call(Atom)), where the compiled BK contains standard Prolog definitions. Metagol uses prim statements to allow a user to specify what predicates are part of the compiled BK. Prim statements of the form prim(P/A), where P is a predicate symbol and A is the associated arity, and are similar to determinations used by Aleph [41], except that Metagol only requires prim statements for predicates that may appear in the body. If this deductive step fails, Metagol tries to unify the atom with the head of a metarule (metarule(Name,Subs,(Atom:-Body))) and tries to bind the existentially quantified higher-order variables in a metarule to symbols in the predicate signature, where Subs contains the substitutions. Metagol saves the resulting substitutions and tries to prove the body of the metarule. After proving all atoms, a Prolog program is formed by projecting the substitutions onto their corresponding metarules. Metagol checks the consistency of the learned program with the negative examples. If the program is inconsistent, then Metagol backtracks to explore different branches of the SLD-tree.
Metagol uses iterative deepening to ensure that the first consistent hypothesis returned has the minimal number of clauses. The search starts at depth 1. At depth d the search returns a consistent hypothesis with at most d clauses if one exists; otherwise it continues to depth d + 1. At each depth d, Metagol introduces d − 1 new predicate symbols 4 . Figure 6 shows the Prolog code for Metagol ho . The key difference between Metagol ho and Metagol is the introduction of the second prove_aux/3 clause in the meta-interpreter, denoted in boldface. This clause allows Metagol ho to prove an atom by fetching a clause from the IBK (such as map/3) whose head unifies with a given atom. The distinction between compiled and interpreted BK is that whereas a clause from the compiled BK is proved deductively by calling Prolog, a clause from the IBK is proved through metainterpretation. Meta-interpretation allows for predicate invention to be driven by the proof of conditions (as in filter/3) and functions (as in map/3). IBK is different to metarules because the clauses are all universally quantified and, importantly, does not  require any substitutions. By contrast, metarules contain existentially quantified variables whose substitutions form the hypothesised program. Figure 7 shows examples of the three forms of BK used by Metagol ho . Metagol ho works in the same way as Metagol except for the use of IBK. Metagol ho first tries to prove an atom deductively using compiled BK by delegating the proof to Prolog (call(Atom)), exactly how Metagol works. If this step fails, Metagol ho tries to unify the atom with the head of a clause in the IBK (ibk((Atom:-Body))) and tries to prove the body of the matched definition. Metagol does not perform this additional step. Failing this, Metagol ho continues to work in the same way as Metagol. Metagol ho uses negation as failure [5] to negate predicates in the compiled BK. Negation of invented predicates is unsupported and is left for future work 5 .

Metagol ho
To illustrate the difference between Metagol and Metagol ho , suppose you have compiled BK containing the succ/2, int_to_char/2, and map/3 predicates and the curry1    As this scenario illustrates, the real power and novelty of Metagol ho is the combination of abstraction (learning higher-order programs) and invention (predicate invention). In this scenario, abstraction has allowed the atom Q( [1,2,3],[c,d,e],R) to be decomposed into the sub-problems R(1,c), R(2,d), R(3,e). Further abstraction and invention allows for Metagol ho to solve these sub-problems by inventing and defining the necessary predicate for R. By successively interleaving these two steps, Metagol ho supports the invention of conditions and functions to an arbitrary depth, which goes beyond anything in the literature.

HEXMIL
Before describing HEXMIL ho , which supports learning higher-order logic programs, first we discuss HEXMIL, on which HEXMIL ho is based.
HEXMIL is an answer set programming (ASP) encoding of MIL introduced by Kaminski et al. [23]. Whereas Metagol searches for a proof (and thus a program) using a meta-interpreter and SLD-resolution, HEXMIL searches for a proof by encoding the MIL problem as an ASP problem. As argued by Kaminski et al., an ASP implementation can be more efficient than a Prolog implementation because ASP solvers employ efficient conflict propagation, which is important for detecting the derivability of negative examples early during ASP search.
The HEXMIL encoding specifies constraints on possible hypotheses derived from the examples, in addition to rules specifying the available BK. An ASP solver performs a combinatorial search for solutions satisfying these constraints. ASP solvers typically work in two phases: (1) a grounding phase, where rules are grounded, and (2) a solving phase, where reasoning on (propositional) rules leads to answer sets [19]. A straightforward ASP encoding of the MIL problem is infeasible in many cases, for reasons such as the grounding bottleneck of ASP and the difficulty in manipulating complex structures such as lists [23]. To mitigate these difficulties HEXMIL uses the HEX formalism [13] which allows ASP programs to interface with external sources. External sources are predicate definitions given by programs outside of the ASP language. For instance, HEXMIL interfaces with external sources described as a Python program. HEX programs can access these definitions via external atoms. HEXMIL benefits from external atoms by allowing for arbitrary encodings of complex structures (e.g. we encode lists as strings, thereby reducing the number of variables needed in the encoding). Another benefit is that external atoms allow for the incremental introduction of new constants (i.e. symbols not in the initial ASP program).
To improve efficiency, Kaminski et al. introduced a forward-chained HEXMIL-encoding which requires forward-chained metarules:

Definition 7 (Forward-chained metarule)
A metarule is forward-chained when it can be written in the form: where D 1 , . . . , D j are all contained in {A, C 1 , . . . , C i−1 , B}.
In the forward-chained HEXMIL encoding, compiled (first-order) BK is encoded using the external atoms &bkUnary[P,A]() and &bkBinary[P,A](B). These two atoms represent all BK predicates of the form P(A) and P(A,B), where P and A are input arguments to the external source and B is an output argument. Using the input/output ordering of the external binary atoms, grounding of variables in forward-chained metarules occurs from left to right. HEXMIL uses the forward-chained encoding:

deduced(P,A) ← &bkUnary[P,A](), state(A) deduced(P,A,B) ← &bkBinary[P,A](B), state(A) state(A) ← for each P(A,B) ∈ E + ∪ E − state(B) ← deduced(P,A,B)
HEXMIL uses the deduced predicate to represent facts that hypotheses could entail. In this encoding, the import of BK is guarded by the predicate state/1. A solution for MIL problem (Definition 4) must entail all positive examples (i.e ground atoms). Therefore, in HEXMIL, every positive examples must appear in the head of a grounded metarule. It follows that ground terms in atoms can be seen as the states that can be reached from the examples. Therefore, HEXMIL initially marks the ground terms that appear in the examples as state. As new ground terms are introduced by the external atoms, HEXMIL marks these values as state as well.
To support metarules HEXMIL employs two encoding rules. The first rule encodes the possible instantiations of a metarule. Let mr be the name of an arbitrary forward-chained metarule (Def. 7), then for each such metarule, the first encoding rule is: met a(mr, P, Q 1 , . . . , Q i , R 1 , . . . , R j ) ∨ ne g_met a(mr, P, Q 1 , . . . , Q i , R 1 , . . . , R j ) ← si g(P), si g(Q 1 ), . . . , si g(Q i ), si g(R 1 ), . . . , si g(R j ), or d(P, Q 1 ), . . . , or d(P, Q i ), or d(P, R 1 ), . . . , or d(P, R j ), Note that the head in this rule allows for choosing whether to deduce the metarule instantiation. Also note that the disjunction in the head means that this is not a Horn clause, yet it encodes a Horn clause metarule. This encoding rule relies on two other rules: The si g relation denotes predicate symbols available, both invented and given as part of the BK. The or d relation denotes an ordering over the predicate symbols. This ordering disallows certain instantiations 6 , e.g. recursive instantiations.
The second metarule encoding allows for metarule instantiations to be generated in order to derive facts: The

HEXMIL ho
We now describe the extension of HEXMIL to HEXMIL ho , which adds support for higherorder definitions, i.e. interpreted background knowledge (IBK). This extension allows HEXMIL to search for programs in abstracted forward-chained hypothesis spaces. To extend HEXMIL, we introduce a new predicate i bk to encode the higher-order atoms that occur in IBK. Note that i bk is a normal ASP predicate and not an external atom. This predicate allows us to encode higher-order clauses as a mix of deduced atoms for firstorder predicates and i bk atoms for those that involve predicates as arguments. Let the following be a clause of an arbitrary (forward-chained) higher-order definition: h(A, B, P 0,1 , . . . , P 0,k 0 ) ← h 1 (A, C 1 , P 1,1 , . . . , P 1,k 1 ), . . . , h j (C j−1 , B, P j,1 , . . . , P j,k j ) Every atom in this clause can have 0 ≤ k i higher-order terms. The higher-order clauses of the definition will have at least one atom with k i = 0. For each clause in a higher-order definition we give a rule encoding the clause, where C 0 = A and C j = B: i bk(h, A, B, P 0,1 , . . . , P 0,k 0 ) ← st at e(A), si g(P 0,1 ), . . . , si g(P 0,k 0 ), Figure 8 shows an example of this encoding for the until/4 predicate. Figure 8 also contains a definition for map/3 (which is slightly more involved). This approach to higherorder definitions also applies to metarules involving higher-order atoms. For instance, Figure 8 also shows the encoding of the curry2 metarule. Our extension is sufficient 7 to learn higher-order programs. Note that in this setting higher-order definitions are required to be forward-chained in their first-order arguments, meaning that left-to-right grounding of these arguments is still valid. The remaining (higher-order) arguments can be ground by the si g predicate, which contains all the predicate names. As predicate symbols were already arguments in the HEXMIL encoding, we can easily make a predicate argument occur as an atom's predicate symbol, e.g. see the variable F in until/4 and map/3 in Figure 8.

Complexity of the search
The experiments in the next section use both Metagol and HEXMIL, and their higherorder extensions. The purpose of the experiments is to test our claim that learning higherorder programs, rather than first-order programs, can improve learning performance. Although we do not directly compare them, the experimental results show a significant difference in the learning performances of Metagol and HEXMIL, and their higher-order variants. The experimental results also show that HEXMIL and HEXMIL ho do not scale well, both in terms of the amount of BK and the number of training examples. To help explain these results, we now contrast the theoretical complexity of Metagol and HEXMIL. For simplicity we focus on the 2 2 hypothesis space, although our results can easily be generalised. Our main observation is that the performance of HEXMIL is a function of the number of constant symbols, which is not the case for Metagol.
From Proposition 1 it follows that the considers constants that it encounters when it evaluates whether a hypothesis covers an example, in which case it only considers the constant symbols pertaining to that particular example (in fact it delegates this step to Prolog). It follows that the search complexity of Metagol is independent of the number of constant symbols and is the same 8 as Proposition 1. By contrast, HEXMIL searches for a program by instantiating metarules in a bottomup manner where the body atoms of metarules need to be grounded. This approach means that the number of options that HEXMIL considers is not only a function of the number of metarules and predicate symbols (as is the case for Metagol), but it is also a function of the number of constant symbols because it needs to ground the first-order variables in a metarule. Even in the more efficient forward-chained MIL encoding, which incrementally imports new constants, body atoms are ground using many constant symbols unrelated to the examples. Any constant that can be marked as a state will be used to ground atoms. Therefore, the search complexity of HEXMIL is bounded by (mp 3 c 6 ) n , where m is the number of metarules, p is the number of predicate symbols, n is a maximum program size, and c is the number of constant symbols.
For simplicity, the above complexity reasoning was for the first-order systems. We can easily apply the same reasoning to the abstracted MIL setting.

Experiments
Our main claim is that compared to learning first-order programs, learning higher-order programs can improve learning performance. Theorem 1 supports this claim and shows that, compared to unabstracted MIL, abstraction in MIL reduces sample complexity proportional to the reduction in the number of clauses required to represent hypotheses. We now experimentally 9 explore this result. We describe four experiments which compare the performance when learning first-order and higher-order programs. We test the null hypotheses: Null hypothesis 1 Learning higher-order programs cannot improve predictive accuracies Null hypothesis 2 Learning higher-order programs cannot reduce learning times To test these hypotheses we compare Metagol with Metagol ho and HEXMIL with HEXMIL ho , i.e. we compare unabstracted MIL with abstracted MIL.

Common materials
In the Prolog experiments we use the same metarules and IBK in each experiment, i.e. the only variable in the Prolog experiments is the system (Metagol or Metagol ho ). We use the metarules shown in Figure 7. We use the higher-order definitions map/3, until/4, and ifthenelse/5 as IBK. We run the Prolog experiments using SWI-Prolog 7.6.4 [43].
We tried to use the same experimental methodology in the ASP HEXMIL experiments as in the Prolog experiments but HEXMIL failed to learn any programs (first or higherorder) because of scalability issues. Therefore, in each ASP experiment we use the exact metarules and background relations necessary to represent the target hypotheses. We run the ASP experiments using Hexlite 1.0.0 10 . We run Hexlite with the flpcheck disabled. We also set Hexlite to enumerate a single model.

Robot waiter
Imagine teaching a robot to pour tea and coffee at a dinner table, where each setting has an indication of whether the guest prefers tea or coffee. Figure 9 shows an example in terms of initial and final states. This experiment focuses on learning a general robot waiter strategy [7] from a set of examples. Fig. 9: Figures (a) and (b) show initial/final state waiter examples respectively. In the initial state, the cups are empty and each guest has a preference for tea (T) or coffee (C).
In the final state, the cups are facing up and are full with the guest's preferred drink.

Materials
Examples are f/2 atoms where the first argument is the initial state and the second is the final state. A state is a list of ground Prolog atoms. In the initial state, the robot starts at position 0, there are d cups facing down at positions 0, . . . , d − 1; and for each cup there is a preference for tea or coffee. In the final state, the robot is at position d; all the cups are facing up; and each cup is filled with the preferred drink. We allow the robot to perform the fluents and actions (defined as compiled BK) shown in Figure 10.
We generate positive examples as follows. For the Prolog experiments, for the initial state we select a random integer d from the interval [1,20] as the number of cups. For the ASP experiments the interval is [1,5]. For each cup, we randomly select whether the preferred drink is tea or coffee and set it facing down. For the final state, we update the initial state so that each cup is facing up and is filled with the preferred drink. To generate negative examples, we repeat the aforementioned procedure but we modify the final state so that the drink choice is incorrect for a random subset of k > 0 drinks.

Method
Our experimental method is as follows. For each learning system s and for each m in If no program is found in 10 minutes then we deem that every testing example is false. We measure mean predictive accuracies, mean learning times, and standard errors of the mean over 10 repetitions. Figure 11 shows that in all cases Metagol ho learns programs with higher predictive accuracies and lower learning times than Metagol. Figure 12 shows similar results when comparing HEXMIL with HEXMIL ho . We can explain these results by looking at example programs learned by Metagol and Metagol ho shown in Figures 13 and 14 respectively. Although both programs are general and handle any number of guests and any assignment of drink preferences, the program learned by Metagol ho is smaller than the one learned by Metagol. Whereas Metagol learns a recursive program, Metagol ho avoids recursion and uses the higher-order abstraction until/4. The abstraction until/4 essentially removes the need to learn a recursive two clause definition to move along the dinner table.

Results
Likewise, Metagol ho uses the abstraction ifthenelse/5 to remove the need to learn two clauses to decide which drink to pour. The compactness of the higher-order program affects predictive accuracies because, whereas Metagol ho almost always finds the target hypothesis in the allocated time, Metagol often struggles because the programs are too large, as explained by our theoretical results in Section 3.3. The results from this experiment suggest that we can reject null hypotheses 1 and 2.
Although we are not directly comparing the Prolog and ASP implementations of MIL, it is interesting to note that despite having more irrelevant BK, more irrelevant metarules, and having larger training instances, Metagol ho outperforms HEXMIL ho in all cases, both in terms of predictive accuracies and learning times. Figure 12 also shows that both HEXMIL and HEXMIL ho do not scale well in the number of training examples, especially the learning times. Our results in Section 4.5 help explain the poor scalability of HEXMIL and HEXMIL ho because more training examples typically means more constant symbols which in turn means a larger search complexity for both HEXMIL and HEXMIL ho , although this issue can be mitigated using state abstraction [23].

Chess strategy
Programming chess strategies is a difficult task for humans [3]. For example, consider maintaining a wall of pawns to support promotion [21]. In this case, we might start by trying to inductively program the simple situation in which a black pawn wall advances without interference from white. Figure 15 shows such an example, where in the initial state the pawns are at different ranks and in the final state all the pawns have advanced to rank 8 but the other pieces have remained in the initial positions. In this experiment, we try to learn such strategies.

Materials
Examples are f/2 atoms where the first argument is the initial state and the second is the final state. A state is a list of pieces, where a piece is denoted as a tuple of the form (Type,Id,X,Y), where Type is the type (king=k, pawn=p, etc.), Id is a unique identifier, X is the file, and Y is the rank. We generate a positive example as follows. For the initial state for the Prolog experiments, we select a random subset of n pieces from the interval [1,16] and randomly place them on the board. For the ASP experiments the interval is [1,5]. For the final state, we update the initial state so that each pawn finishes at rank 8. To generate negative examples, we repeat the aforementioned procedure but we randomise the final state positions whilst ensuring that the input/output pair is not a positive example. We use the compiled BK shown in Figure 16.

Method
The experimental method is the same as in Experiment 1. Figure 17 shows that in all cases Metagol ho learns programs with higher predictive accuracies and lower learning times than Metagol. Figure 17 shows that Metagol ho learns programs approaching 100% accuracy after around six examples. By contrast, Metagol learns programs with around default accuracy. Figure 18 shows similar results when comparing HEXMIL with HEXMIL ho . The poor performance of Metagol and HEXMIL is because they both rarely find solutions in the allocated time. By contrast, Metagol ho and HEXMIL ho typically learn programs within two seconds. We can again explain the performance discrepancies by looking at example learned programs in Figure 19. Figure 19b shows the compact higher-order program typically learned by Metagol ho . This program is compact because it uses the abstractions map/3 and until/4, where map/3 decomposes the problem into smaller sub-goals of moving a single piece to rank eight and until/4 solves the sub-problem of moving a pawn to rank eight. These sub-goals are solved by the invented f1/2 predicate. By contrast, Figure 19a shows the large target first-order program that Metagol struggled to learn. As shown in Proposition 1, the MIL hypothesis space grows exponentially in the size of the target hypothesis, which is why the larger first-order program is more difficult to learn. The results from this experiment suggest that we can reject null hypotheses 1 and 2.

Droplast
In this experiment, the goal is to learn a program that drops the last element from each sublist of a given list-of-lists -a problem frequently used to evaluate program induction systems [25]. In this experiment, we try to learn a program that drops the last character from each string in a list of strings. Figure 20 shows input/output examples for this problem described using the f/2 predicate.

Materials
Examples are f/2 atoms where the first argument is the initial list and the second is the final list. We generate positive examples as follows. For the Prolog experiments, to form the input, we select a random integer i from the interval [1,10] as the number of sublists.  (b) Fig. 19: Figure (a) shows the target first-order chess program, which Metagol could not learn within 10 minutes. Figure (b) shows the higher-order program often learned by Metagol ho . The higher-order program is clearly smaller than the first-order program, which is why Metagol ho could typically learn it within a couple of seconds.
For each sublist i, we select a random integer k from the interval [1,10] and then sample with replacement a sequence of k letters from the alphabet a-z to form the sublist i. To form the output, we wrote a Prolog program to drop the last element from each sublist. For the ASP experiments the interval for i and k is [1,5].We generate negative examples using a similar procedure, but instead of dropping the last element from each sublist, we drop j random elements (but not the last one) from each sublist, where 1 < j < k. We use the compiled BK shown in Figure 21.

Method
The experimental method is the same as in Experiment 1. Figure 22 shows that Metagol ho achieved 100% accuracy after two examples at which point it learned the program shown in Figure 24a. This program again uses abstractions to decompose the problem. The predicate f/2 maps over the input list and applies f1/2 to each sublist to form the output list, thus abstracting away the reasoning for iterating over a list. The invented predicate f1/2 drops the last element from a single list by reversing the list, calling tail/2 to drop the head element, and then reversing the shortened list back to the original order. By contrast, Metagol was unable to learn any solutions because the corresponding first-order program is too long and the search is impractical, similar to the issues in the chess experiment. Figure 23 shows slightly unexpected results for the ASP experiment. The figure shows that HEXMIL ho learns programs with higher predictive accuracies than HEXMIL when given up to 14 training examples. However, the predictive accuracies of HEXMIL ho progressively decreases given more examples. This performance degradation is because, as we have already explained, HEXMIL and HEXMIL ho do not scale well given more examples. This inability to scale given more examples is clearly shown in Figure 23, which shows that the learning times of HEXMIL ho increase significantly given more training examples.

Results
We repeated the droplast experiment but replaced reverse/2 in the BK with the higher-order definition reduceback/3 and the compiled clause concat/3. In this scenario, Metagol ho learned the higher-order program shown in Figure 24b. This program now includes the invented predicate f3/2 which reverses a given list and is used twice in the program. This more complex program highlights invention through the repeated calls to f3/2 and abstraction through the use of higher-order functions.

Further discussion
To further demonstrate invention and abstraction, consider learning a double droplast program which extends the droplast problem so that, in addition to dropping the last   Figure  (b) shows a more complex program learned by Metagol ho when we repeated the experiment but disallowed Metagol ho to use reverse/2 and instead gave it reduceback/3 and concat/3. element from each sublist, it also drops the last sublist. Figure 25 shows examples of this problem, again represented as the target predicate f/2. Given two examples of this problem, Metagol ho learns the program shown in Figure 26a. For readability Figure 26b shows the folded program where non-reused invented predicates are removed. This program is similar to the program shown in Figure 24b but it makes an additional final call to the invented predicate f1/2 which is used twice in the program, once as a higherorder argument in map/3 and again as a first-order predicate. This form of higher-order abstraction and invention goes beyond anything in the existing literature.

Encryption
In this final experiment, we revisit the encryption example from the introduction.

Materials
Examples are f/2 atoms where the first argument is the encrypted string and the second is the unencrypted string. For simplicity we only allow the letters a-z. We generate a positive example as follows. For the Prolog experiments we select a random integer k from the interval [1,20] to denote the unencrypted string length. For the ASP experiments we select k from the interval [1,5]. We sample with replacement a sequence y of length k from the set {a, b, . . . , z}. The sequence y denotes the unencrypted string. We form the encrypted string x by shifting each character in y two places to the right, e.g. a → c, b → d, . . . , z → b. The atom f (x, y) thus represents a positive example.

Method
The experimental method is the same as in Experiment 1. Figure 27 shows that, as with the other experiments, Metagol ho learns programs with higher predictive accuracies and lower learning times than Metagol. These results are as expected because, as shown in Figure 4a, to represent the target encryption hypothesis as a first-order program 2 2 requires seven clauses. By contrast, as shown in Figure 4b, to represent the target hypothesis as a higher-order program in 2 2 requires three clauses with one additional higher-order variable in the map/3 abstraction.

Results
We attempted to run the experiment using HEXMIL and HEXMIL ho . However, both systems failed to find any programs within the timelimit. In fact, even in an extremely simple version of the experiment (where the alphabet contained only 10 letters, each string had at most 3 letters, and the character shift was +1) both systems failed to learn anything in the allocated time. Our theoretical results in Section 4.5 explain these empirical results. In this scenario, the number of ways that the BK predicates can be chained together and instantiated is no longer tractable for HEXMIL. The experiment suggests that HEXMIL needs to be better at determining which groundings are relevant to consistent hypotheses.

Discussion
Our main claim is that compared to learning first-order programs, learning higher-order programs can improve learning performance. Our experiments support this claim and show that learning higher-order programs can significantly improve predictive accuracies and reduce learning times.
Although it was not our purpose, our experiments also implicitly (implicitly because we do not directly compare the systems) show that Metagol outperforms HEXMIL, and similarly Metagol ho outperforms HEXMIL ho . Our empirical results contradict those by Kaminski et al. [23], but support those by Morel et al. [30]. There are multiple explanations for this discrepancy. We think that the main problem is the ASP grounding problem faced by HEXMIL: in most of our experiments, HEXMIL timed out during the grounding (and not solving) stage. To alleviate this issue, future work could consider using state abstraction [23] to mitigate the grounding issues.
Also, by adjusting the experimental methodology, some of the results may change. For instance, Kaminski et al. showed that HEXMIL can sometimes learn solutions quicker than Metagol because of conflict propagation in ASP. They claim that this performance improvement is because Metagol only considers negative examples after inducing a program from the positive examples (as described in Section 4.1). Therefore, HEXMIL should benefit from more negative examples, but may suffer from fewer.
To summarise, although our empirical results suggest that Metagol outperforms HEXMIL, future work should more rigorously compare the two approaches on multiple domains along multiple dimensions (e.g. varying the numbers of examples, size of BK, etc.).

Conclusions and further work
We have extended MIL to support learning higher-order programs by allowing for higherorder definitions to be included as background knowledge. We showed that learning higher-order programs can reduce the textual complexity required to express target classes of programs which in turn reduces the hypothesis space. Our sample complexity results show that learning higher-order programs can reduce the number of examples required to reach high predictive accuracies. To learn higher-order programs, we introduced Metagol ho , a MIL learner which also supports higher-order predicate invention, such as inventing predicates for the higher-order abstractions map/3 and until/4. We also introduced HEXMIL ho , an ASP implementation of MIL that also supports learning higher-order programs. Our experiments showed that, compared to learning first-order programs, learning higher-order programs can significantly improve predictive accuracies and reduce learning times.

Metarules
There are at least two limitations with our work regarding the choice of metarules when learning higher-order programs.
One issue is deciding which metarules to use. Figure 7 shows the 11 metarules used in our experiments. Of these metarules, eight of them (the ones with only monadic or dyadic literals) are a subset of a derivationally irreducible set of monadic and dyadic metarules [12]. We can therefore justify their selection because they are sufficient to learn any program in a slightly restricted subset of Datalog. However, we have additionally used three curry metarules with arities three, four, and five, which were not considered in the work on identifying derivationally irreducible metarules. In addition, the curry metarules also include existentially quantified predicate arguments (e.g. R in P(A, B) ← Q (A, B, R)). Although these metarules seem intuitive and sensible to use, we have no theoretical justification for using them. Future work should address this issue, such as by extending the existing work [12] to include such metarules.
A second issue regarding the curry metarules is that when used with abstractions they each require an extra clause in the learned program. Our motivation for learning higher-order programs was to reduce the number of clauses necessary to express a target theory. Although our theoretical and experimental results support this claim, further improvements can be made. For instance, suppose you are given examples of the concept f (x, y) where x is a list of integers and y is x but reversed, where each element has had one added to it, and then doubled, such as f ( [1,2,3], [8,6,4]). Then Metagol ho could learn the following program given the metarules used in Figure 7: This more compact program is formed of a single clause and four literals, so should therefore be easier to learn. Future work should try to address this limitation of the current approach 11 .

Higher-order definitions
Our experiments rely on a few higher-order definitions, mostly based on higher-order programming concepts, such as map/3 and until/4. Future work should consider other higher-order concepts. For instance, consider learning regular grammars, such as a * b * c * .
To improve learning efficiency it would be desirable to encode the concept of Kleene star operator 12 as a higher-order definition, such as: kstar(P,A,A). kstar(P,A,B):call(P,A,C), kstar(P,C,B).
Similarly, we have used abstracted MIL to invent functional constructs. Future work could consider inventing relational constructs. For instance, consider this higher-order definition of a closure: