Top program construction and reduction for polynomial time Meta-Interpretive learning

Meta-Interpretive Learners, like most ILP systems, learn by searching for a correct hypothesis in the hypothesis space, the powerset of all constructible clauses. We show how this exponentially-growing search can be replaced by the construction of a Top program: the set of clauses in all correct hypotheses that is itself a correct hypothesis. We give an algorithm for Top program construction and show that it constructs a correct Top program in polynomial time and from a finite number of examples. We implement our algorithm in Prolog as the basis of a new MIL system, Louise, that constructs a Top program and then reduces it by removing redundant clauses. We compare Louise to the state-of-the-art search-based MIL system Metagol in experiments on grid world navigation, graph connectedness and grammar learning datasets and find that Louise improves on Metagol’s predictive accuracy when the hypothesis space and the target theory are both large, or when the hypothesis space does not include a correct hypothesis because of “classification noise” in the form of mislabelled examples. When the hypothesis space or the target theory are small, Louise and Metagol perform equally well.


Introduction
Meta-Interpretive Learning (MIL) (Muggleton et al., 2014) is a new setting for Inductive Logic Programming (ILP) (Muggleton, 1991).ILP algorithms learn logic theories from examples and background knowledge.MIL learners additionally restrict the set, L, of clauses that can be constructed from the symbols in the background knowledge and examples (the hypothesis language), by means of secondorder clauses called metarules (Muggleton et al., 2014).Each clause in L is an Fig. 1: Searching a hypothesis space H is an exponentially more complex task than constructing a hypothesis language L. instantiation of a metarule with existentially quantified variables substituted with predicate symbols and constants, in a process called metasubstitution (examples of metarules from the MIL literature are listed in table 3 in section 3).
Like other ILP learners, the state-of-the-art MIL system, Metagol (Muggleton et al., 2014), searches the set of hypotheses that are possible to express as subsets of L for a correct hypothesis that entails all positive examples and no negative examples.The set of hypotheses expressible in L is the hypothesis space, denoted with H.Each hypothesis in H is a set of clauses in L, therefore H is the powerset of L and searching H for a correct hypothesis takes, in the worst case, time exponential in the cardinality of L.
On the other hand, enumerating the clauses in L need only take time polynomial in the cardinality of L (see Figure 1).Further, the subset of L that includes only the clauses in correct hypotheses in H is itself a correct hypothesis: it is the union of all correct hypotheses in H, and, therefore, the most general, correct set of clauses that entails each other correct set of clauses in H.We will call this set of clauses in correct hypotheses the Top program and denote it by .
In the following sections we develop the framework of the Top program for MIL and give a polynomial-time algorithm for its construction in Algorithm 1 that is capable of learning recursive hypotheses and performing predicate invention as described in section 6.2.We then present a new MIL system, Louise, that implements Algorithm 1 in Prolog and learns by Top program construction and reduction to remove logically redundant clauses by application of Gordon Plotkin's program reduction algorithm (Plotkin, 1972).Tables 1 and 2 illustrate the inputs and outputs of Top program construction and reduction as implemented in Louise.
-Empirical comparison of Louise to the state-of-the-art MIL system, Metagol.
Structure In section 2 we place our work in the context of the ILP and MIL literature.In section 3 we describe our Top program construction algorithm and prove its correctness, convergence and polynomial time complexity.In section 4 we describe Louise.In section 5 we experimentally compare Louise to Metagol.We conclude in section 6 with a summary of our findings and proposed future work.

Related work
The cardinality of H for MIL is upper-bounded by an exponential function of the size of the target theory, Θ, (Lin et al., 2014) and when the true cardinality of H approaches this upper bound, a classical search of H becomes computationally infeasible on modern hardware.As a result most single-predicate programs learned by Metagol as reported in the MIL literature have at most 5 clauses.See e.g.(Muggleton et al., 2014;Lin et al., 2014;Muggleton and Lin, 2015;Cropper and Muggleton, 2015;Cropper et al., 2016;Cropper and Muggleton, 2016;Muggleton et al., 2018;Morel et al., 2019).Much of the MIL literature is preoccupied with reducing the size of Θ as a means of reducing the maximum size of H and thereby the cost of a search for a correct hypothesis.In the Episodic learning (Muggleton et al., 2014) and Dependent learning (Lin et al., 2014) settings, Metagol learns larger multi-predicate programs by incrementally learning small sub-programs while the variant Metagol AI learns from abstractions and higher-order background knowledge (Cropper and Muggleton, 2016).Such techniques take advantage of the theory reformulation (Stahl, 1993) aspect of predicate invention to reduce the size of Θ to fewer than 5 clauses and allow learning to proceed when the complexity of a search of H would otherwise be overwhelming.Top program construction is efficient when Θ is large and when H is large and does not require predicate invention for the purpose of learning programs larger than 5 clauses.
Metarules, central to MIL, where originally proposed in (Emde et al., 1983), where the metarules named Chain, Inverse and Identity in table 3, representing, respectively, the concepts of transitivity, reflexivity and symmetry between binary relations formed the basis of a mechanism for concept discovery.This approach was further developed in systems like METAXA.3 (Emde, 1987), BLIP (Wrobel, 1988) and MOBAL (Morik, 1993;Kietz and Wrobel, 1992).
The Top program construction procedure described in Algorithm 1 can be contrasted to the Rule Discovery Tool (RDT) in MOBAL.RDT employs a generateand-test algorithm that conducts a general-to-specific search for a hypothesis that satisfies a user-defined criterion, guided by a subsumption order over metarules.By contrast, Algorithm 1 does not conduct a search and is not a generate-and-test procedure, but a resolution-based proof procedure that restricts the set of constructed clauses by means of the positive examples then further refines this set by means of the negative examples.Unlike RDT, Algorithm 1 can construct recursive hypotheses, including left-recursive and mutually recursive ones as discussed in section 4.
Other ILP systems using metarules (also called program schemata) have been proposed for the specific purpose of learning recursive logic programs, like CRUS-TACEAN (Aha et al., 1994), CILP (Lapointe et al., 1993), Force2 (Marcinkowski and Pacholski, 1992), Sieres (Wirth and O'Rorke, 1992), TIM (Idestam-Almquist, 1996), Synapse (Flener and Deville, 1993), Dialogs (Flener, 1997) and MetaInduce (Hamfelt and Nilsson, 1994).Such systems learn by a subsumption-order search of H and are typically limited to recursive programs of restricted form (e.g.exactly one base case and one recursive clause), or require additional inductive biases, only accept examples of one target predicate at a time, cannot use background knowledge or require ground background knowledge, cannot perform predicate invention etc. (Flener and Yilmaz, 1999).More recent systems ILASP, (Law et al., 2014), A Top theory is used by some ILP systems as e.g. in TopLog (Muggleton et al., 2008) and MC-TopLog (Muggleton et al., 2012) and in the non-monotonic setting in the ASP-learning systems TAL (Corapi et al., 2010), ASPAL (Corapi et al., 2011) and RASPAL (Athakravi et al., 2014).A Top theory is an instance of strong inductive bias used to direct the search of H which remains expensive and which Top program construction avoids altogether.
The Top program is a unique object in H that can be constructed without an expensive search.It is comparable to Least General Generalisation (LGG) (Plotkin, 1970(Plotkin, , 1971)), or the Bottom clause (Muggleton, 1995), also unique, directly constructible objects.The Top program differs to the LGG and Bottom clause in that it is not a clause but a correct hypothesis, i.e. a set of clauses.

Background
We follow the Logic Programming and ILP terminology established in (Nienhuys-Cheng and de Wolf, 1997) which we extend with MIL-specific terms and terminology for second-order definite clauses and programs, as follows.

Logical notation
C is the set of constants and P the set of predicate symbols.First-order variables are quantified over C and second-order variables are quantified over P.An atom or literal is second-order if it contains at least one second-order variable, or a predicate symbol, as a term, or as an argument of a term.A definite clause is second-order if it contains at least one second-order literal.A literal is datalog (Ceri et al., 1989) if it contains no function symbols of arity more than 0. A definite clause is datalog if it contains only datalog literals.A logic program is definite datalog if it contains only definite datalog clauses.

Meta-Interpretive Learning
MIL is a form of ILP where the first-order language of hypotheses, L (a set of clauses), is defined by a set of metarules, second-order definite clauses with existentially quantified variables in the place of predicate symbols and constants.
The H 2 2 language of definite datalog metarules with at most two body literals of arity at most 2 has Universal Turing Machine expressivity and is decidable when P and C are finite (Muggleton and Lin, 2015).Examples of H 2 2 metarules found in the MIL literature are given in table 3.
Each clause in L is an instantiation of a metarule with second-order existentially quantified variables substituted for symbols in P and first-order existentially quantified variables substituted for constants in C. A substitution of the existentially quantified variables in a metarule M is a metasubstitution, denoted as µ/M .
A system that performs MIL is a Meta-Interpretive Learner, or MIL-learner (with a slight abuse of abbreviation to support a natural pronunciation).A MILlearner is given the elements of a MIL problem and returns a hypothesis as a solution to the MIL problem.A MIL problem is a quintuple, T = E + , E − , B, M, H where: a) positive examples, E + , are ground definite atoms and negative examples, E − , are ground Horn goals, having the symbol and arity of one or more target predicates; b) the background knowledge, B, is a set of program clause definitions with definite datalog heads; c) M is a set of metarules; and d) H is the hypothesis space, a set of hypotheses.
Each hypothesis in H is a set of clauses in L. Each H ∈ H is a definition of a target predicate in E + and may include definitions of one or more invented predicates, predicates other than a target predicate and not defined in B. For each Typically a MIL learner is not explicitly given H or L, rather those are implicitly defined by M and the constants C and symbols P in E + , E − , B and any invented predicates.The original MIL-learner, Metagol, searches H for a correct hypothesis by iterative deepening on the cardinality of hypotheses.Our new MIL-Learner Louise does not search H and instead constructs, and then reduces, the Top program for T , the set of clauses in all correct hypotheses in H, defined below: Definition 1 Let T = E + , E − , B, M, H be a MIL problem and L the hypothesis language.is the Top program for T iff for all C ∈ L where ∃e Proof Assume Theorem 1 is false and H includes a correct hypothesis.Then either Thus the assumption is contradicted and Theorem 1 holds.

Algorithm 1 Top program construction
Output: , the Top program for T , as a set of metasubstitutions µ/M , where M ∈ M.

Top program construction
Algorithm 1 lists our algorithm for Top program construction.To clarify, the name of Algorithm 1 is "Top program construction".Section 4 describes our Prolog implementation of Algorithm 1 as the basis of a new MIL system called "Louise".
In the following sections we prove that Algorithm 1 correctly constructs the Top program for a MIL problem in polynomial time and after processing only a finite number of examples.

Preliminaries
Finite MIL problem In the following sections, let where k is the finite maximum number of body literals in each M ∈ M k .Let C k and P k be the finite sets of constants and predicate symbols in E + k , E − k , B k ; and let L k be the hypothesis language of clauses constructible with M k , C k , P k .
Target theory For each target predicate P ∈ T k , let Θ P , a definition of P , be the target theory of P such that each clause in Θ P is an instance of a metarule M ∈ M k .For each P , B P is the Herbrand base of P ; SS(Θ P ) ⊆ B P is the success set of Θ P ∪ B k restricted to atoms of P ; and F F (Θ P ) = B P \ SS(Θ P ) is the finite failure set of Θ P , restricted to atoms of P (i.e. the set of atoms p of P such that there exists a finitely-failed resolution tree for Inductive soundness and completeness An inductive inference procedure is a) inductively sound, or simply sound, when it derives no clauses that entail one or more negative examples with respect to background knowledge, and b) inductively complete, or simply complete, when it derives all clauses that entail one or more positive examples with respect to background knowledge.

Inductive soundness and completeness of Algorithm 1
3.4.1 Learning in the limit Proof Follows from the finiteness of P k , C k and the soundness and completeness of SLD resolution for definite programs (Nienhuys-Cheng and de Wolf, 1997).The completeness of SLD resolution ensures that procedure Generalise will derive all clauses in L k that entail at least one positive example in E + k and the soundness of SLD resolution ensures that procedure Generalise will derive no clauses in L k that do not entail any positive examples in Proof Same as for Lemma 1.The completeness of SLD resolution ensures that procedure Specialise will derive all clauses in L k that entail at least one negative example in E − k and the soundness of SLD resolution ensures that procedure Specialise will derive no clauses in L k that entail no negative examples in E − k .
Theorem 2 Algorithm 1 is inductively sound and complete.

Finite example sets
In this section we show that Algorithm 1 can construct In this case, k |= e + , which contradicts Theorem 1. Therefore the assumption is false and Lemma 3 holds.
Proof Assume Lemma 4 is false.In this case, Therefore the assumption is false and Lemma 4 holds.Proof Follows directly from Lemmas 3, 4. Note that We do not know how to exactly calculate the cardinality of k , however in the worst case k = L k .It is possible to place a finite upper bound on the cardinality of L k and therefore, k , as follows.(Cropper and Tourret, 2018).This number is finite because p, m, k are finite.
Theorem 3 Algorithm 1 constructs k after processing a finite number of positive and negative examples.

Time complexity of Algorithm 1
In this section we show that the time complexity of Algorithm 1 is polynomial.k ).This is the worst case because in that case, procedure Generalise in Algorithm 1 derives all clauses in L k from each example in E + k , i.e. the maximum number of computations is performed for each example in Remark 1 The number of hypotheses of at most n clauses in H k is (mp k+1 ) n (Cropper and Tourret, 2018).Therefore, the time complexity of a classical search of

Implementation
In this section we present a new MIL-learner, Louise (Patsantzis and Muggleton, 2019), written in Prolog, that learns by Top program construction and reduction 1 .
1 Louise was created alongside a new version of Metagol called Thelma, an acronym for Theory Learning Machine.Louise was named as a play on words with Thelma, referencing Thelma and Louise (Scott et al., 1991).Table 4: Encapsulation of atoms and clauses, including metarules.The excapsulation of an encapsulated atom or clause is the same as its un-encapsulated form.

Louise's learning procedure
Louise's learning procedure is outlined in Algorithm 2. Line numbers listed in this section refer to the numbered lines in the listing of Algorithm 2.
Learning begins with the encapsulation of a MIL problem (line 1).An encapsulation e(L) of a literal L = p(s 1 , ..., sn) is a first-order atom m(p, s 1 , ..., sn) where m is an encapsulation predicate.The symbol m is chosen arbitrarily and has no special meaning.The arity of each encapsulation predicate is n + 1 where n is the arity of the encapsulated predicate(s).Therefore, a literal of a predicate p/n is encapsulated by a literal of m/(n + 1).An encapsulation e(C) of a definite clause C = {L 1 , ..., Ln} is the set of encapsulations of literals in C, {e(L 1 ), ..., e(Ln)}.An encapsulation e(Π) of a definite program Π = {C 1 , ..., Cn} is the set of encapsulations of clauses in Π, {e(C 1 ), ..., e(Cn)}.Table 4 illustrates encapsulation for first order atoms and clauses, and metarules.Encapsulation of metarules ensures the decidability of unification between metarule literals and literals of first-order clauses (Muggleton and Lin, 2015).Encapsulation of a MIL problem facilitates the efficient and simple construction of the Top program, e (line 2), by resolution as described below.
Our implementation of procedures Generalise and Specialise in Louise unifies the encapsulation of each (positive or negative) example atom to the encapsulated head literal of each metarule and resolves the metarule's encapsulated body literals with e(B) and e(E + ).Resolution with e(E + ) permits the derivation of clauses that have body literals with the symbol of a target predicate and therefore the construction of a recursive Top program.Because e(E + ) is a set of ground atoms, each encapsulated body literal with the symbol of a target predicate has a finite refutation sequence so recursive clauses can be derived without resolution entering an infinite recursion.When e(E + ) includes multiple target predicates mutually recursive clauses can be derived.Table 5 lists an example of a Top program with mutually recursive clauses derived from resolution with the encapsulation of the background predicate predecessor/2 in e(B) and the encapsulated examples of the two target predicates, odd/1 and even/1 in e(E + ).
e, the result of resolving the body literals of encapsulated metarules with e(B) and e(E + ) is a set of metasubstitutions.Metasubstitutions in e are applied to the
Redundant clauses are removed from the encapsulated Top program by Algorithm 3 (line 3).The set of clauses remaining after reduction, r , is then excapsulated and returned as the learned hypothesis, a definition of the target predicates in E + (line 4).Excapsulation is the opposite process of encapsulation.An excapsulation, x(e(L)) = L of an encapsulated literal, e(L) = m(p, s 1 , ..., sn), is a first order literal L = p(s 1 , ..., sn).An excapsulation, x(e(C)) = C, of an encapsulated clause e(C) = {e(L 1 ), ..., e(Ln)}, is a first order definite clause C = {L 1 , ..., Ln} where each L i is the excapsulation of a literal in e(C).An excapsulation, x(e(Π)) = Π of an encapsulated program, e(Π), is a set of first order definite clauses Π = {C 1 , ..., Cn} where each C i is the excapsulation of a clause in e(Π).

Plotkin's program reduction
In Algorithm 2, the Top program, e, is reduced by Gordon Plotkin's program reduction algorithm, described in (Plotkin, 1972) as Theorem 3.3.1.2,reproduced here as Algorithm 3 in Plotkin's original notation.
In Algorithm 3, Φ Ψ means that "Φ generalises Ψ ".The generalisation of Ψ by Φ is considered with respect to a theorem, T h (sic).In the context of Algorithm 2, T h is the union of the encapsulated E + , B and M and applied Top program.In our implementation of Plotkin's algorithm in Louise, Φ Ψ is true iff Ψ can be derived from Φ by Prolog's SLD-Resolution.When H does not contain a correct hypothesis, e.g. when E + , E − have mislabelled examples ("classification noise"), a search-based MIL system must exit with failure and its accuracy is minimal.In the worst case, H is additionally large and the MIL system must perform an exhaustive search before returning with failure.Algorithm 1 constructs as much of as possible given the elements of a MIL problem and so returns an approximately correct hypothesis when a correct hypothesis does not exist.In such situations we should expect Louise to outperform Metagol.We formalise this expectation as Experimental Hypothesis 2: Experimental Hypothesis 2 Louise outperforms Metagol when H does not include a correct hypothesis.
When H or Θ are small, Louise should not have an advantage over Metagol.A special case of this is when H includes a single hypothesis which is, tautologically, the set of clauses in all correct hypotheses, i.e. the Top program.In that special case, Louise and Metagol should perform equally well.We formalise this expectation as Experimental Hypothesis 3: Experimental Hypothesis 3 Louise and Metagol perform equally when H = { }.
To test these three experimental hypotheses we compare Metagol and Louise on a real-world dataset and two synthetic datasets summarised in table 6.The synthetic Coloured graph dataset can be configured to include "noise" in the form of mislabelled examples and has two variants with a small and large H, marked with (1) and ( 2) respectively in table 6.Each learning curve experiment proceeds for k = 100 steps.In each step we sample, at random and without replacement, a proportion, s, of E + , E − to form a training partition.Remaining examples form the testing partition.S is taken from the sequence: S = 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9 .At each step, we train each learner on the training partition and measure the accuracy of the returned hypothesis on the testing partition and the duration of training in seconds.We set a time limit of 300 sec.for each training step.If a training step exhausts this time limit, we calculate the accuracy of the empty hypothesis on the testing partition.Finally, we return the mean and standard error of the accuracy and duration for the same sampling ratio s at each step2 .All experiments were run on a PC with 32 8-core Intel Xeon E5-2650 v2 CPUs clocked at 2.60GHz, with 251 Gb of RAM, running Ubuntu 16.04.6.Running each instance of the learning curve experiment (one instance per dataset) occupied one core of the machine at 100% of capacity (experiments were run in parallel as background linux jobs).The longest-running experiment was on the Coloured Graph with False Negatives dataset with large H (described in section 5.4) and took three days for Metagol (but only a few hours for Louise) to complete.The shortest-running experiment was on the M:tG Fragment dataset (described in section 5.5) and took both systems about 11 minutes to complete.Other experiments were completed in about 8 hours on average.

A note on metarule selection
In MIL practice, metarules are typically selected manually, according to user intuition or domain knowledge, although minimal sets of metarules for language fragments such as H 2 2 (see section 3.1.2)have been identified, e.g. in (Cropper and Muggleton, 2015;Cropper and Tourret, 2018).For the experiments described in the following sections, we have manually selected metarules as follows.
For the Coloured Graph (section 5.4) and M:tG Fragment (section 5.5) datasets where Θ was known, we extracted metarules from the clauses of Θ with Louise's metarule extraction module.This defines Prolog predicates to "lift" sets of program clauses to the second order, by variabilisation of their predicate symbols and constants, and encapsulate them as metarules3 .
For the Grid World dataset in section 5.3, were Θ was not known, we initially selected the Chain metarule (table 3), that represents transitivity, such as the relation between consecutive moves over contiguous "cells" in a grid world, reflecting our intuition about the likely structure of Θ. Algorithm 1 can construct recursive instances of metarules without restriction, but Metagol imposes a lexicographic ordering on the predicate symbols in metasubstitutions (Muggleton and Lin, 2015) which precludes recursive instances of Chain and in general requires recursive metarules to be specified explicitly.Adding one metarule for each recursive variant of Chain would increase the size of H and penalise Metagol's time complexity; but omitting any recursive metarules would penalise the expressivity of L only for Metagol.We elected to add the tail-recursive version of Chain, Tailrec (table 3), as the only explicitly recursive metarule, by way of a compromise.Finally, we defined three variants of Chain, listed in table 7, each with one or two body literals of arity 3, to allow the use of higher-order moves defined as arity-3 predicates.

Experiment 1: Grid world
We create a generator for navigation problems where an agent must move to a goal location on an empty grid world represented as a Cartesian plane with the origin at (0, 0) and extending to a point (w, h).Our generator takes as parameters the w, h dimensions of the grid world and generates a) all navigation tasks between pairs of locations in the grid world as atoms of the target predicate, move/2 and b) a set of primitive moves that move the agent up, down, left or right.We define a set of composite moves that each combine two primitive moves and two higherorder moves that repeat a primitive or composite move twice or thrice.To form a MIL problem for this dataset we give all move/2 atoms as positive examples, all primitive, composite and higher-order moves as background knowledge and as metarules Chain and Tailrec from Table 3, and three arity-3 variants of Chain necessary for the use of higher-order moves.No navigation task is impossible on an empty grid world, therefore there are no negative examples.Table 7 illustrates the elements of the MIL problem.
We do not know the target theory for this problem but in preliminary experiments Louise learns a hypothesis of 2,567 clauses from all examples and Metagol a hypothesis equal in size to a small training sample of 5 examples, indicating a

Triadic Chain variants
Tri-Chain 1 : Table 7: Grid world dataset.In navigation tasks and primitive moves each list of the form [R, G, W × H] is a grid world-state listing the location of the agent (R), its goal (G) and the world dimensions W × H.In primitive moves, G is a variable binding to the coordinates of the task's goal (which remains unchanged during a move).In composite and higher-order moves Ss and Gs are variables binding to the world states at the start and end of a move, respectively.In higher-order moves the literal move(M ) nondeterministically generates the predicate symbols of primitive and composite moves.In variants of Chain, existentially quantified variables {Q, R} of literals with arity 3 can only take values from the set of predicate symbols of higher-order moves that also have arity 3, whereas existentially quantified variables {M, M 1 , M 2 } of literals with arity 2 can only take values from the set of symbols of primitive and composite moves that have arity 2. For example, in the first body literal in Tri-Chain 1, a possible metasubstitution is {Q/double move, M/move down} resulting in a literal double move(move down, x, z) i.e. a double-move downwards.large H and Θ.We run our experiment in a 4 × 4 world for only 10 steps after Metagol runs for more than a day when trained on 6 examples in a larger world.

Grid world -results
Figures 2a and 3a plot the accuracy and training time results of the Grid world experiment, respectively.Louise quickly learns a correct hypothesis that generalises well on the testing partition whereas Metagol exhausts the training time limit of 300 sec.early in the experiment, when the training partition includes only 62 examples.This confirms Experimental Hypothesis 1.
Table 8: Target theory and (partial) BK definitions of Coloured graph datasets.

Experiment 2: Coloured graph
To test Experimental Hypothesis 2 we create a generator for MIL problems with a definition of the predicate connected/2, illustrated in table 8, as a target theory, representing the connectedness relation on a directed, acyclic, two-colour graph.Our generator can produce three datasets with different kinds of mislabelled examples: False Positives (with negative examples mislabelled as positive), False Negatives (with positive examples mislabelled as negative) and Ambiguities (with examples simultaneously labelled positive and negative).A fourth dataset, No Noise is noise-free.We "label" examples as positive or negative by inclusion in E + or E − , respectively.Table 9 outlines the mislabelling process.
To create a MIL problem for each dataset we begin by generating all positive and negative atoms of connected/2 forming the initial E + , E − .We select a proportion N of each set of examples at random and without replacement and mislabel them as described above.N = 0.2 for each "noisy" dataset and N = 0 for the No Noise dataset.We give as background knowledge the definitions of the three arity-2 predicates used to define the target theory, ancestor/2, red parent/2 and blue parent/2 and additional definitions (omitted for brevity) of the predicates red child/2, blue child/2, parent/2, child/2.We give as metarules Identity, Inverse, Stack, Queue from Table 3, that match the clauses of the target theory.The background knowledge and metarules suffice to reconstruct the target theory, but mislabelled examples allow a correct hypothesis to be formed only for the No Noise problem.

Coloured graph -results
Figures 2c and 3c plot the accuracy and training time results of the Coloured graph experiment, respectively.In the three "noisy" datasets a correct hypothesis does not exist in H and so Metagol's accuracy is that of the empty hypothesis (varying according to mislabelled examples).Metagol tests a learned hypothesis against the negative examples only once the hypothesis is completed, then backtracks to try a new hypothesis if the test fails.This causes much backtracking in the False Negatives dataset, so much so that Metagol exhausts the training time limit of 300 sec.for most of the experiment.Louise outperforms Metagol in all but the No Noise dataset, although its performance fluctuates as the chance of processing mislabelled examples increases with the size of the training partition.In the No Noise dataset a short, correct hypothesis exists -the target theory-and Metagol finds it earlier in the experiment than Louise.The hypothesis space for this problem includes many over-general hypotheses formed with predicates other than ancestor/2 which suffices to express the target theory.Additional background predicates may be seen as, in a sense, "redundant" and it is this redundancy that leads to Louise's reduced early accuracy with No Noise.
We repeat the experiment with the redundant predicates removed, leaving ancestor/2 as the only background predicate.Figures 2d and 3d plot the accuracy and training time results, respectively.The size of H is now reduced by several orders of magnitude (see table 6).Metagol's predictive accuracy remains unchanged but it can exhaustively search H and exit with failure much more quickly in the "noisy" datasets.Louise's accuracy improves on the No Noise dataset but deteriorates in the False Negatives dataset.Louise performs worse than the empty hypothesis in the Ambiguities dataset, where the combination of mislabelled positive and negative examples forces Algorithm 1 to form a Top program that entails not only few positive, but also many negative examples.
The results in this section support Experimental Hypothesis 2.

Experiment 3: M:tG Fragment
When each positive example in a MIL problem is entailed by exactly one clause in Θ, H "collapses" to a single correct hypothesis.This permits us to test Experimental Hypothesis 3. Magic: the Gathering (M:tG) is a Collectible Card Game played with cards printed with instructions in a Controlled Natural Language (CNL) for which no complete formal specification is published.We hand-craft a grammar in Definite Clause Grammar form for a simple fragment of the M:tG CNL that includes only expressions beginning with one of the three "keyword actions" destroy, exile and return.We manually extract the rules of the grammar from two sources: a) examples  ability([return, an, artif act, f rom, a, graveyard, to, its, 'owner\ s', hand], []).ability ([return, target, planeswalker, to, its, 'owner\ s', hand], []).ability ([return, all, creatures, f rom, your, graveyard, to, the, battlef   of strings on cards and b) semi-formal specifications of expressions provided in the game's rulebook (Wizards of the Coast LLC, 2018).Such specifications are provided for only a few expressions in the language, most of which are pre-terminals denoting card types (e.g.permanent type//0 in table 10).Each example string has a single parse tree and so is entailed by exactly one rule in our grammar.
To set up a MIL problem for this dataset we generate all 1348 strings entailed by our grammar to use as positive examples of the predicate ability/2 (the start symbol of the grammar).We use the 60 nonterminals and pre-terminals in our hand-crafted grammar as background knowledge and use Chain as the only metarule.The 36 productions of our grammar where the start symbol, ability/2 is the nonterminal on the left-hand side are all instances of Chain, therefore Chain is sufficient to construct a correct representation of our grammar.Examples of the elements of the MIL problem for this dataset are given in table 10.We note that the hypothesis learned by Metagol and Louise in this experiment is exactly the target theory for the M:tG Fragment MIL problem and the size of this target theory is 36 clauses, just over 7 times larger than any program learned by Metagol previously reported in the MIL literature.This further supports Experimental Hypothesis 1.When H is small, even when Θ is large, Louise does not have a clear advantage over Metagol.

Discussion
The results in the previous sections show that Louise outperforms Metagol when Metagol cannot find a correct hypothesis within the training time limit.This is most evident in Experiment 1 (Figures 2a and 3a) where both H and Θ are large and Metagol's search is at its most expensive, and in the noisy datasets in Experiment 2 (Figures 2c,2d,3c,3d) where no correct hypothesis exists in H.
Metagol learns in two stages: first it finds a hypothesis, H, that is not toospecific (i.e.3c).Later in the same experiments, the number of false positives sampled increases and the number of not-too-specific hypotheses diminishes allowing Metagol to exit quickly with failure.False negatives in E − cause many hypotheses to appear over-general causing much backtracking in the False Negatives experiment with large H (Figure 3c False Negatives).In the small-H experiments, H is small enough that Metagol's search can exit quickly with failure (Figure 3d).
Louise does not test hypotheses for generality and instead returns the bestpossible Top program without performing a search or backtracking so its training times stay short with both large and small H (Figures 3c, 3d) with small fluctuations caused by redundancies in E + , E − .Louise's accuracy suffers when H includes many over-general hypotheses because of irrelevant background knowledge (Figure 2c No Noise).However, Louise can complete a learning attempt and return a result in situations where Metagol continues to search for a very long time (Figures 3a,3c False Negatives).These observations indicate that Louise is better suited than Metagol to learning in large, complex problem domains with classification noise.
6 Conclusions and future work

Conclusions
We have shown that a costly search of the MIL hypothesis space, H, for a correct hypothesis can be replaced by the construction of a Top program, , the set of clauses in all correct hypotheses, which is itself a correct hypothesis that can be constructed without search, from a finite number of examples and in polynomial time with Algorithm 1.
Table 11: Definite Clause Grammar hypothesis for the a n b n language learned by Louise with Dynamic Learning.The definition of predicate '$1' is invented.
We have implemented Algorithm 1 in Prolog as the basis of a new MIL system, called Louise, that learns by Top program construction and reduction.We have compared Louise to the state-of-the-art search-based MIL system, Metagol, and shown that Louise outperforms Metagol when the size of H and the target theory, Θ, are both large, because of Metagol's exponential time complexity, or when the hypothesis space does not include a correct hypothesis.The latter is the case e.g. when a MIL problem includes classification noise and we have shown that Louise is more robust to certain kinds of noise than Metagol.Louise does not have an advantage over Metagol when H or Θ are small and we have found to our surprise that Metagol can learn a hypothesis 7 times larger than any program previously learned by Metagol, as reported in the MIL literature, when H includes a single hypothesis which is, tautologically, .

Future work
An important limitation of our approach, demonstrated in section 5.4.1, is that Algorithm 1 is forced to learn an over-general Top program when H includes many over-general hypotheses and there are insufficient negative examples to eliminate over-general clauses.In addition, Plotkin's algorithm may not always remove clauses that are not logically redundant but entail overlapping sets of examples.Louise implements two additional program reduction procedures that address these limitations by selecting subsets of the Top program that comprise correct hypotheses of minimal size (and with clauses entailing non-overlapping sets of examples).
Louise is capable of predicate invention by recursive Top program construction in an incremental learning setting named Dynamic Learning (an example of predicate invention in Louise's Dynamic Learning setting is listed in Table 11).Finally, Louise implements a form of examples invention by semi-supervised learning similar to (Dumancic et al., 2019).We have omitted discussion of these features for the sake of brevity but plan to include them in upcoming work.As a MIL system, Louise relies on the selection of relevant metarules, which is currently left to user expertise.Selection of strong inductive biases by user expertise (or intuition) is common in machine learning, e.g. in the selection and careful fine-tuning of a neural network architecture, priors in Bayesian learning, kernels in Support Vector Machines, etc.Previous work in the MIL literature has addressed the issue of automatic selection of metarules, e.g.(Cropper and Muggleton, 2015) and (Cropper and Tourret, 2018).Louise includes libraries for metarule extraction from arbitrary Prolog programs (including background knowledge definitions), as described in section 5.2; for metarule generation; and for metarule combination by unfolding.Finally, predicate invention can effectively extend the set of metarules in a MIL problem beyond those given initially by a user, as first noted in (Cropper and Muggleton, 2015) and investigated further in our upcoming work on the Dynamic Learning setting.A more complete discussion of automatic selection of metarules is left for future work.
The observation noted in section 5.5 that when each positive example is entailed by exactly one clause in the target theory, the MIL hypothesis space includes a single program, merits further theoretical and empirical investigation.
We have shown the existence of finite upper bounds on the numbers of examples necessary for Top program construction with Algorithm 1, but we have not derived sample complexity results.Previous work in the MIL literature, e.g.(Cropper and Muggleton, 2016), has derived sample complexity results for a search of H under PAC Learning assumptions (Valiant, 1984) and according to the Blumer Bound (Blumer et al., 1987).Such results can also be derived for Top program construction.
We have situated the Top program construction framework in the context of MIL but a Top program should exist in any ILP setting.Such a more general description of our framework remains to be done.Similarly, Top program construction should be possible to implement in a different language, other than Prolog, such as Answer Set Programming (ASP) etc.Indeed, MIL has also been implemented in ASP, as hexmil in (Kaminski et al., 2018) and future work should compare our Prolog implementation of Louise against this MIL implementation.
Finally, we are eager to test Louise's mettle on novel experimental applications, particularly real-world applications in domains that have traditionally proven hard for ILP because of the size of H, as e.g. in machine vision.

Lemma 5
Algorithm 1 must process at most | k | positive examples and at most |L k \ 0 k | − | k | negative examples before constructing k .

Theorem 4
The time complexity of Algorithm 1 is a polynomial function of |L k |.Proof Let c = |E + k |.The worst case for the time complexity of Algorithm 1 is when k = L k and each clause in k entails each positive example in E + k (and 0 examples in E −

Fig. 3 :
Fig. 3: Learning curve experiment results (training times).Red circles: Metagol.Blue triangles: Louise.x-axis: size of training partition; y-axis: mean time of a training step.Error bars: standard error.
Composition of positive and negative example sets in Coloured graph datasets.E + m ⊆ E + and E − m ⊆ E − are sets of "mislabelled" examples selected at random and without replacement.Examples are mislabelled by including them in the opposite set of examples.For the Ambiguities dataset, mislabelled examples are included in both E + and E − .For the false positive and false negative examples, mislabelled examples are removed from one and added to the other set.
Figures 2b and 3b plot the accuracy and training time results, respectively, of the M:tG Fragment experiment.Louise and Metagol learn identical hypotheses (i.e. the Top program) and their accuracy curves coincide.Louise is slightly faster for most of the experiment but its training time "spikes" towards the end of the experiment, likely because of redundancy in the examples set that causes the same clauses to be derived from different examples, multiple times 4 .Metagol only learns a single clause from each example thereby avoiding this duplication of effort.Even

Table 1 :
Top program construction.E + : positive examples.E − : negative examples.B: background knowledge; M: metarules.Clauses marked with * in the Generalisation step are removed in the Specialisation step because they entail negative examples.The Top program is completed in the Specialisation step.

Table 2 :
Reduction of the Top program in table 1 by Plotkin's program reduction algorithm (Algorithm 3).

Table 3 :
Examples of second-order Metarules from the MIL literature.As is common in the literature, quantifiers are omitted and quantification is instead denoted by capitalisation; P, Q, R: existentially quantified second-order variables; X, Y : existentially quantified first-order variables; x, y, z: universally quantified first-order variables.
the set of clauses that entail exactly 0 positive examples in E + k with respect to B k ; let + k ⊆ L k be the set of clauses that entail at least one positive example in E + k with respect to B k ; and let − k ⊆ L k be the set of clauses that entail at least one positive example in E + k and at least one negative example in E − k with respect to B k .Let k be the Top program for T k .Note that Lemma 6 The cardinalities of L k , k are finite.Proof L k is the set of clauses constructible with p = |P k | predicate symbols and m = |M k | metarules of at most k body literals.The cardinality of this set is at most mp k+1

Table 5 :
Multi-predicate MIL problem for odd/1 and even/1 and mutually recursive hypotheses learned by Louise.Reduction of a set of clauses (Gordon Plotkin)Given: A set of clauses H. Return: A reduction, H , of H.
1: Set H to H. 2: Stop if every clause in H is marked [and return H ]. 3: Choose an unmarked clause C, in H. 4: If H \ {C} {C} then change H to H \ {C}.Otherwise, mark C. 5: Go to (2).

Table 6 :
Dataset summary.|B|: number of BK definitions.Θ: target theory.Grid world Θ is not known but |E + | approximates its cardinality.max|L| is calculated as |M||B| k+1 , where k is the number of literals in metarules: 3 for Grid world, otherwise 2. max|H| is calculated as max|L| |Θ| .See Lemma 6 and Remark 1. MIL system when the complexity of a search of H is maximised.Metagol's iterative deepening search orders H by hypothesis size and the complexity of its search is maximised when H and the target theory, Θ, are both large, therefore Louise should outperform Metagol when both these conditions hold.We formalise this expectation as Experimental Hypothesis 1: Experimental Hypothesis 1 Louise outperforms Metagol when H and Θ are large.

Table 10 :
M:tG fragment dataset: examples of positive example strings and background knowledge comprised of grammar productions in Definite Clause Grammars form.Tokens in square braces are terminals, other tokens are nonterminals."−→" can be read as "expands to".
H ∧ B |= E + ); then it tests H against E − .If H is over-general (i.e. if H ∧B |= e − ∈ E − ) Metagol backtracks and searches for a new H.False positives in E + cause Metagol to find over-general hypotheses that lead to much backtracking, increasing training times early in the False Positives and Ambiguities experiments with large H (Figure