Inductive general game playing

General game playing (GGP) is a framework for evaluating an agent's general intelligence across a wide range of tasks. In the GGP competition, an agent is given the rules of a game (described as a logic program) that it has never seen before. The task is for the agent to play the game, thus generating game traces. The winner of the GGP competition is the agent that gets the best total score over all the games. In this paper, we invert this task: a learner is given game traces and the task is to learn the rules that could produce the traces. This problem is central to inductive general game playing (IGGP). We introduce a technique that automatically generates IGGP tasks from GGP games. We introduce an IGGP dataset which contains traces from 50 diverse games, such as Sudoku, Sokoban, and Checkers. We claim that IGGP is difficult for existing inductive logic programming (ILP) approaches. To support this claim, we evaluate existing ILP systems on our dataset. Our empirical results show that most of the games cannot be correctly learned by existing systems. The best performing system solves only 40% of the tasks perfectly. Our results suggest that IGGP poses many challenges to existing approaches. Furthermore, because we can automatically generate IGGP tasks from GGP games, our dataset will continue to grow with the GGP competition, as new games are added every year. We therefore think that the IGGP problem and dataset will be valuable for motivating and evaluating future research.


Introduction
General game playing (GGP) [28] is a framework for evaluating an agent's general intelligence across a wide variety of games. In the GGP competition, an agent is given the rules of a game that it has never seen before. The rules are described in a first-order logic-based language called the game description language (GDL) [54]. The rules specify the initial game state, what constitutes legal moves, how moves update the game state, and how the game terminates [4]. Before the game begins, the agent is given a few seconds to think, to process the rules, and devise a game-specific strategy. The agent then starts playing the game, thus generating game traces. The winner of the competition is the agent that gets the best total score over all the games. Figure 1 shows six example GGP games. Figure 2 shows a selection of rules, written in GDL, for the game Rock Paper Scissors. In this paper, we invert the GGP competition task: the learner (a machine learning system) is given game traces and the task is to induce (learn) the rules that could have produced the traces. In other words, the learner must learn the rules of a game by observing others play. This problem is a core part of inductive general game playing (IGGP) [29], the task of jointly learning the rules of a game and playing the game successfully. We focus exclusively on the first task. Once the rules of the game have been learned then existing GGP techniques [25,40,41] can be used to play the games. Figure 3 shows an example IGGP task, described as a logic program, for the game Rock Paper Scissors. In this task, a learner is given a set of ground atoms representing background knowledge (BK) and sets of disjoint ground atoms representing positive (E + ) and negative (E − ) examples of target concepts. The task is for the learner to induce a set of general rules (a logic program) that explains all of the positive but none of the negative examples. In this scenario, the examples are observations of the next_score and next_step predicates, and the task is to learn the rules for these predicates, such as the rules shown in Figure 4.

Fig. 4
The GGP reference solution for the Rock Paper Scissors game described as a logic program. Note that the predicates draws, loses, and wins are not given as background knowledge and the learner must discover these.
complexity. Some of the games are turn-taking (Alquerque) while others (Rock Paper Scissors) are simultaneous. Some of the games are classic board games (Checkers and Hex); some are puzzles (Sokoban and Sudoku); some are dilemmas from game theory (Prisonner's Dilemma and Chicken); others are simple implementations of classic video games (Centipede and Tron). Figure 5 lists the 50 games and also shows for each game the number of dimensions, the number of players, and as an estimate of the game's complexity the number of rules and literals in the GGP reference solution. Each game is described as four relational learning tasks goal, next, legal, and terminal with varying arities, although flattening the dataset to remove function symbols leads to more relations as illustrated in Figure 3 where the next predicate is flattened to relations next_score/2 and next_step/2. For each game, we provide (1) training/validate/test data composed of sets of ground atoms in a 4:1:1 split, (2) a type signature file describing the arities of the predicates and types of the arguments, and (3) a reference solution in GDL. It is important to note that we have not designed these games: the games were designed independently from our IGGP problem without this induction task in mind. Our second contribution is a mechanism to continually expand the dataset. The GGP competition produces new games each year, which provides a continual rich source of challenges to the GGP participants. Our technical contribution allows us to easily add these new games to our dataset. We implemented an automatic procedure for producing a new learning task from a game. When a new game is added to the GGP competition, our system can read the GDL description, generate traces of sample play, and extract an IGGP task from those traces (see Section 4.3 for technical details). This automatic procedure means that our dataset can expand each year as new games are added to the GGP competition. We again stress that the GGP games were not designed with this induction task in mind. The games were designed to be challenging for GGP systems. Thus, this induction task is based on a challenging "real world" problem, not a task that was designed to be the appropriate level of difficulty for current ILP systems.
Our third contribution is an empirical evaluation of existing ILP approaches, to test our claim that IGGP is difficult for current ILP approaches. We evaluate the classical ILP system Aleph [70] and the more recent systems ASPAL [8], Metagol [14], and ILASP [44]. Although non-exhaustive, these systems cover a breadth of ILP approaches and techniques. We also compare non-ILP approaches in the form of simple baselines and clustering (KNN) approaches. Figure 6 summarises the results. Although some systems can solve some of the simpler games, most of the games cannot be solved by existing approaches. In terms of balanced accuracy (Section 6.1.1), the best performing system, ILASP, achieves 86%. However, in terms of our perfectly solved metric (Section 6.1.2), the best performing system, ILASP, achieves only 40%. Our empirical results suggest that our current IGGP dataset poses many challenges to existing ILP approaches. Furthermore, because of our second contribution, our dataset will continue to grow with the GGP competition, as new games are added every year. We therefore think that the IGGP problem and dataset will be valuable for motivating and evaluating future research.  Puzzle  17  60  2  1  Lightboard  18  69  2  2  Knights Tour  18  46  2  1  Sukoshi  19  49  1  2  Walkabout  22  66  2  2  Horseshoe  22  59  2  2  GT Ultimatum  22  67  0  2  Tron  23  76  2  2  9x Buttons and Lights  24  77  2  1  Hunter  24  69  2  1  GT Centipede  24  69  0  2  Fizz Buzz  25  74  0  1  Untwisty Corridor  27  68  0  1  Don't Touch  29  84  2  2  Tiger vs Dogs  30  88  2  2   Game  R  L  D  P  Sheep and Wolf  30  89  2  2  Duikoshi  31  76  2  2  TicTacToe  32  92  2  2  HexForThree  35  130  2  3  Connect 4  36  124  2  4  Breakthrough  36  126  2  2  Centipede  37  134  2  1  Forager  40  106  2  1  Sudoku  41  101  2  1  Sokoban  41  172  2  1  9x TicTacToe  42  149  2  2  Switches  44  183  2  1  Battle of Numbers  44  134  2  2  Free For All  46  130  2  2  Alquerque  49  134  2  2  Kono  50  134  2  2  Checkers  52  167  2  2  Pentago  53  188  2  2  Platform Jumpers  62  168  2  2  Pilgrimage  80  240  2  2  Firesheep  85  290  2  2  Farming Quandries  88  451  2  2  TTCC4  94  301  2  2  Frogs and Toads  97  431  2  2  Asylum  101  273  2  2 Fig. 5 The IGGP dataset. We list the number of rules (clauses) R, the number of literals L, number of dimensions D, and the number of players P. The rest of the paper is organised as follows. Section 2 describes related work and further motivates this new problem and dataset. Section 3 describes the IGGP problem, the GDL, in which GGP games are described, and how IGGP games are Markov games. Section 4 introduces a technique to produce a IGGP task from a GGP game and provides specific details on how we generated our initial IGGP dataset. Section 5 describes the baselines and ILP systems used in the evaluation of current ILP techniques. Section 6 details the results of the evaluation and also describes why IGGP is so challenging for existing approaches. Finally, Section 6 concludes the paper and details future work.

Metric
2 Related work 2.1 General game playing As Björnsson states [4], from the inception of AI games have played a significant role as a test-bed for advancing the field. Although the early focus was on developing general problem-solving approaches, the focus shifted towards developing problem-specific approaches, such as approaches to play chess [6] or checkers [68] very well. One motivation of the GGP competition is to reverse this shift, as to encourage work on developing general AI approaches that can solve a variety of problems.
Our motivation for introducing the IGGP problem and dataset is similar. As we will discuss in the next section, there is much work in ILP on learning rules for specific games, or for specific patterns in games. However, there is little work on demonstrating general techniques for learning rules for a wide variety of games (i.e. the IGGP problem). We want to encourage such work by showing that current ILP systems struggle on this problem.

Inducing game rules
Inducing game rules has a long history in ILP, where chess has often been the focus. Bain [2] studied inducing first-order Horn rules to determine the legality of moves in the chess KRK (king-rook-king) endgame, which is similar to the problem of learning the legal predicate in the IGGP games. Bain also studied inducing rules to optimally play the KRK endgame. Other works on chess include Goodacre [30], Morales [55], who induced rules to play the KRK endgame and rules to describe the fork pattern, and Muggleton et al. [58].
Besides chess, Castillo and Wrobel [7] used a top-down ILP system and active learning to induce a rule for when a square is safe in the game minesweeper. Law et al. [44] used an ASP-based ILP approach to induce the rules for Sudoku and showed that this more expressive formalism allows for game rules to be expressed more compactly.
Kaiser [37] learned the legal moves and the win condition (but not the state transition function) for a variety of boardgames (breakthrough, connect4, gomuku, pawn whopping, and tictactoe). This system represents game rules as formulas of first-order logic augmented with a transitive closure operator T C; it learns by enumerative search, starting with the guarded fragment before proceeding to full first-order logic with T C. Unusually, their system learns the game rules from videos of correct and incorrect play: before it can start learning the rules, it has to parse the video, converting a sequence of pixel arrays into a sequence of sets of ground atoms.
Relatedly, Grohe and Ritzert [32] also use enumerative search, searching through the space of first-order formulas. They exploit Gaifman's locality theorem to search through a restricted set of local formulas. They show, remarkably, that if the max degree of the Gaifman graph is polylogarithmic in the number n of objects, then the running time of their enumerative learning algorithm is also polylogarithmic in n. This intriguing result does not, however, suggest a practical algorithm as the constants involved are very large.
GRL [31] builds on SGRL [4] and LOCM [10] to learn game dynamics from traces. In these systems, the game dynamics are modelled as as finite deterministic automata. They do not learn the legal predicate (determining which subset of the possible moves are available in the current state) or the goal predicate.
As is clear from these works, there is little work in ILP demonstrating general techniques for learning rules for a wide variety of games. This limitation partially motivates the introduction of the IGGP problem and dataset.

Existing datasets
One of our main contributions is the introduction of a IGGP dataset. In contrast to the existing datasets, our dataset introduces many new challenges.

Size and diversity
Our dataset is larger and more diverse than most existing ILP datasets, especially on learning game rules. Commonly used ILP datasets, such as kinship data [34], Michaslki trains [42], Mutagenesis [21], Carcinogenesis [71], string transformations [52], and chess positions [57], typically contain a single predicate to be learned, such as eastbound/1 or westbound/1 in the Michaslki trains dataset or active/1 in the Mutagenesis dataset. By contrast, our dataset contains 50 distinct games, each described by at least four target predicates, where flattening leads to more relations as illustrated in Figure 3. In addition, whereas some datasets use only dyadic concepts, such as kinship or string transformations, our dataset also requires learning programs with a mixture of predicates arities, such as input_jump/8 in Checkers and next_cell/4 predicate in Sudoku. Learning programs with high-arity predicates is a challenge for some ILP approaches [14,38,24]. Moreover, because of our second main contribution, we can continually and automatically expand the dataset as new games are introduced into the GGP competition. Therefore, our IGGP dataset will continue to expand to include more games.

Inductive bias
Our IGGP games come from the GGP competition. As stated in the introduction, the games were not designed with this induction task in mind. One key challenge proposed by the IGGP problem is the lack of inductive bias provided. Most existing work on inducing game rules has assumed as input a set of high-level concepts. For instance, Morales [55] assumed as input a predicate to determine when a chess piece is in check. Likewise, Law [44] assumed high-level concepts such as same_row/2 and same_col/2 as background knowledge when learning whether a Sudoku board was valid. Moreover, most existing ILP work on game learning rules (and learning in general) involves the designers of the system designing the appropriate representation of the problem for their system. By contrast, in our IGGP problem the representation is fixed: it is the GDL provided by the GGP.
Many existing ILP techniques assume a task-specific language bias, expressing a hypothesis space which contains at least one correct representation of the target concept. When available, language biases are extremely useful as a smaller hypothesis space can mean fewer examples and less computational resources are needed by the ILP systems. In many practical situations, however, task-specific language biases are either not available, or are extremely wide, as very little information is known about the structure of the target concept.
In our IGGP dataset we only provide the most simple (or primitive) low-level concepts, which come directly from the GGP competition, i.e. our IGGP dataset does not provide any task-specific language biases. For each game, the only language bias given is the type schema of each predicate in the language of the background knowledge. For instance, in Sudoku the higher-level concepts of same row and same col are not given. Likewise, to learn the terminal predicate in Connect Four, a learner must learn the concept of a line, which in turn requires learning rules for vertical, horizontal, and diagonal lines. This means that for an approach to solve the IGGP problem in general (and to be able to accept future games without changing their method), it must be able to learn without a game-specific bias, or be able to generate this game-specific bias from the typeschemas in the task. In addition, a learner must learn concepts from only primitive lowlevel background predicates, such as cell(X,Y,Filled). Should these high-level concepts be reusable then it would be advantageous to perform predicate invention, which has long been a key challenge in ILP [59,60]. Popular ILP systems, such as FOIL [64] and Progol [56], do not support predicate invention, and although recent work [35,61,13] has tackled this challenge, predicate invention is still a difficult problem.

Large programs
Many reference solutions for IGGP games are large, both in terms of the number of literals and the clauses in them. For instance, the GGP reference solution for the goal predicate for Connect Four uses 14 clauses and a total of 72 literals. This solution uses predicate invention to essentially compress the solution, where the auxillary predicates include the concept of a line, which in turn uses the auxillary predicates for the concepts of columns, rows, and diagonals. If we unfold the reference solution as to remove auxillary predicates then the total number of literals required to learn a solution for this single predicate easily exceeds 400. However, learning large programs is a challenge for most ILP systems [11] which typically struggle to learn programs with hundreds of clauses or literals.

ILP2016 competition
The closest work similar to ours is the ILP 2016 Competition [50]. The ILP 2016 competition was based on a single type of task (with various hand crafted target hypotheses) aimed at learning the valid moves of an agent as it moved through a grid. In some ways this is similar to our legal tasks, although many tasks required learning invented predicates representing changes in state, similar to our next tasks. By contrast, our IGGP problem and dataset is based on a variety of real games, which we did not design. Furthermore, the ILP 2016 dataset provides restricted inductive biases to aid the ILP systems, whereas we (deliberately) do not give such help.

Model learning
AlphaZero [69] has shown the power of combining tree search with a deep neural network for distilling search policy into a neural net. But this technique presupposes that we have been given a model of the game dynamics: we must already know the state transition function and the reward function. Suppose we want to extend AlphaZero-style techniques to domains where we are not given an explicit model of the environment. We would need some way of learning a model of the environment from traces. Ideally, we would like to learn data-efficiently, without needing hundreds of thousands of traces.
Model-free reinforcement learning agents have high sample complexity: they often require millions of episodes before they can learn a reasonable policy. Model-based agents, by contrast, are able to use their understanding of the dynamics of the environment to learn much more efficiently [23,22,33]. Whether, and to what extent, modelbased methods are more sample efficient than model-free methods depends on the complexity of the particular MDP. Sometimes, in simple environments, one needs fewer data to learn a policy than to learn a model. It has also been shown that, for Q learning, the worst-case asymptotics for model-based and model-free are the same [39]. But these qualifications do not, of course, undermine the claim that in complex environments that require anticipation or planning, a model-based agent will be significantly more sampleefficient than its model-free counterpart.
The GGP dataset was designed to test an agent's ability to learn a model that can be useful in planning. The most successful GGP algorithms, e.g. Cadiaplayer [25], Sancho [40], and WoodStock [41], use Monte Carlo Tree Search (MCTS) to search. MCTS relies on an accurate forward model of the Markov Decision Process. The further into the future we search, the more important it is that our forward model is accurate, as errors compound. In order to avoid having to give our MCTS agents a hand-coded model of the game dynamics, they must be able to learn an accurate model of the dynamics from a handful of behavior traces.
Two things make the GGP dataset an appealing task for model learning. First, hundreds of games have already been designed for the GGP competition, with more being added each year. Second, each game comes with 'ground truth': a set of rules that completely describe the game. From these rules, we know the learning problem is solvable, and we have a good measure of how hard it is (by measuring the complexity of the ground-truth program 2 ).

IGGP dataset
In this section, we describe the Game Description Language (GDL) in which GGP games are described, the IGGP problem setting, and finally an illustrative example of a typical IGGP task.
3.1 Game description language GGP games are described using GDL. This language describes the state of a game as a set of facts and the game mechanics as logical rules. GDL is a variant of Datalog with two syntactic extensions (stratified negation and restricted function symbols) and with a small set of distinguished predicates that have a special meaning [54] (shown in Figure  7).
The first syntactic extension is stratified negation. Standard Datalog (lacking negation altogether) has the useful property that there is a unique minimal model [18]. If we add unrestricted negation, we lose this attractive property: now there can be multiple distinct minimal models. To maintain the property of having a unique minimal model, GDL adds a restricted form of negation called stratified negation [1]. The dependency graph of a set of rules is formed by creating an edge from predicate p to predicate q whenever there is a rule whose head is p(...) and that contains an atom q(...) in the body. The edge is labelled with a negation if the body atom is negated. A set of rules is stratified if the dependency graph contains no cycle that includes a negated edge.
GDL's second syntactic extension to Datalog is restricted function symbols. The Herbrand base of a standard Datalog program is always finite. If we add unrestricted function symbols, the Herbrand base can be infinite. To maintain the property of having a finite Herbrand base, GDL restricts the use of function symbols in recursive rules [54].
The two syntactic extensions of GDL, stratified negation and restricted function symbols, mean we extend the expressive power of Datalog without essentially changing its key attractive property: there is always a single, finite minimal model [54].

Predicate
Description distinct(?x,?y) Two terms are syntactically different does(?r,?m) Player ?r performs action ?m in the current game state goal(?r,?n) Player ?r has reward ?n (usually a natural number) in the current state init(?f) Atom ?f is true in the initial game state legal(?r,?m) Action ?m is a legal move for player ?r in the current state next(?f) Atom ?f will be true in the next game state role(?n) Constant ?n denotes a player terminal The current state is terminal true(?f) Atom ?f is true in the current game state

Problem setting
We now define the IGGP problem. Our problem setting is based on the ILP learning from entailment setting [65], where an example corresponds to an observation about the truth or falsity of a formula F and a hypothesis H covers F if H entails F . We assume languages of background knowledge and examples each formed of function-free ground atoms. The atoms are function-free because we flatten the GDL atoms. For example, in Figure 9, the atom true(count(9)) has been flattened into true_count(p9). We flatten atoms because some ILP systems do not support function symbols. We likewise assume a language of hypotheses formed of datalog programs with stratified negation. Stratified negation is not necessary but in practice allows significantly more concise programs, and thus often makes the learning task computationally easier. Note that the GDL also supports recursion but in practice most GGP games do not use recursion. In future work we intend to contribute recursive games to the GGP competition.
We now define the IGGP input:

positive and negative examples respectively
An IGGP input forms the IGGP problem: Note that a single hypothesis should be consistent with all given triples.

Illustrating example: Fizz Buzz
To give the reader an intuition for the IGGP problem and the GGP games, we now describe example scenarios for the game Fizz Buzz. Although typically a multi-player game, in our IGGP dataset Fizz Buzz is a single-player game. The aim of the game is for the player to replace any number divisible by three with the word fizz, any number divisible by five with the word buzz, and any number divisible by both three and five with fizzbuzz. For example, a game of Fizz Buzz up to the number 17 would go: 1, 2, fizz, 4, buzz, fizz, 7, 8, fizz, buzz, 11, fizz, 13, 14, fizz buzz, 16, 17. Figures 9, 10, 11, and 12 show example IGGP problems and solutions for the target predicates legal, next, goal, and terminal respectively. For simplicity each example is a single (B, E + , E − ) triple, although in the dataset each learning task is often a set of multiple triples, where a single hypothesis should explain all the triples. In all cases the BK shown in Figure 8 holds, so we omit it from the individual examples for brevity. Note that the game only runs to the number 31.

Generating the GGP Dataset
In this section, we describe our procedure to automatically generate IGGP tasks from GGP game descriptions. We first explain how GGP games fit inside the framework of multi-agent Markov decision processes. We also explain the need for a type-signature for each game.
legal_say (player,9). legal_say(player,buzz  The column H shows the reference GGP solution described as a logic program. In Fizz Buzz, the player can always make three legal moves in any state, saying fizz, buzz, or fizzbuzz. The player can additionally say the current number (the counter).
-S is a finite set of states -A is a finite set of actions -T is transition function T : S × A → S -R is a reward function We describe these elements in turn for a GGP game.

States
Each state s ∈ S is a set of ground atoms representing fluents (propositions whose truthvalue can change from one state to another). The true predicate indicates which fluents are true in the current state. For instance, one state of a best-of-three game of Rock Paper Scissors is: true(score(p1,0)). true(score(p2,2)). true(step(2)).

Fig. 10
In this Fizz Buzz scenario, the learner is given one positive example of the next_count/1 predicate, one positive example of the next_success/1 predicate, and many negative examples of both predicates. These predicates represent the change of game state. The column H shows the reference GGP solution described as a logic program, which may not necessarily be the most textually compact solution. The next_count/1 relation represents the count in the game. The this relation has a single clause two literal definition, which says that the count increases by one after each step in the game. The next_success/1 relation requires two clauses with many literals. This relation counts how many times a player says the correct output. The reference GGP solution for this relation includes the correct/0 predicate which is not provided as BK but which is reused in both clauses of next_success/1. For an ILP system to learn the reference solution it would need to invent this predicate. Also note that this solution uses negation in the body, including the negation of the invented predicate correct/0. This state represents that the current score is 0 to 2 in favour of player p2, and 2 timesteps have been performed.

Actions
Each action a ∈ A is a set of ground atoms representing the set of all joint actions for agents 1..n. The does predicate indicates which agents perform which actions. For instance, one set of joint actions for Rock Paper Scissors is: (31). true_success (20).

Fig. 11
In this Fizz Buzz scenario the learner is given one example of the goal/2 predicate and four negative examples. This predicate represents the reward for a move. In Fizz Buzz the reward is based on the value of true_success/1. The column H shows the reference GGP solution described as a logic program. The reference solution requires five clauses, which means that it would be difficult for ILP systems that only support learning single-clause programs [56,64]. (27). true_success (8).

Fig. 12
In this Fizz Buzz scenario the learner is given a single negative example of the terminal/0 predicate. This predicate indicates when the game has finished. In this scenario the game has not terminated.
In the dataset the Fizz Buzz game runs until the count is 31, so the learner must learn a rule such as the one shown in column H.

Transition function
In a stochastic MDP, the transition function T has the signature T : S × A × S → {0, 1}. By contrast, in a deterministic MDP, such as a GGP game, the transition function is T : Given a current state s and a set of actions a, the next predicate indicates which fluents are true in the (unique) next state s . For instance, in Rock Paper Scissors, given the current state s and actions a above, the next state s is: next(score(p1,1)). next(score(p2,2)). next(step (3)).

Reward function
In a continuous multi-agent MDP, the reward function has the signature 4 R : S → n . In a discrete MDP, such as a GGP game, we assume a small fixed set of k discrete rewards {r 1 , . . . , r k }, where r i is not necessarily numeric. Let G[i] be the set of atoms representing that player i has one of the k rewards be the joint rewards for agents 1..n. In our GGP dataset, the reward function has the signature R : S → G. Note that, in this framework, learning the reward function becomes a classification problem rather than a regression problem. For example, in the Rock Paper Scissors state above, the reward for state s depends only on the score and is: goal(p1,1). goal(p2,2).

Legal
In the GGP framework, actions are sometimes unavailable. It is not the case that all possible actions from A can be performed, but some of them have no effect -but rather that only a subset of actions are available in a particular state. The legal function L determines which actions are available in which states: L : S → 2 A . Recall that an element of A is not an individual action performed by a single player, but rather a set of simultaneous joint actions, one for each player. For example, one element of A is {does(p1,paper)., does(p2,stone).}. Note that the availability of an action for one agent does not depend on what other actions are being performed concurrently by other agents; it only depends on the state S.

Terminal
The GDL language contains a distinguished predicate, the nullary terminal predicate, that indicates when an episode has terminated (i.e. when the game is over).

Preliminaries: the type-signature for a GGP game
In order to calculate the complete set of ground atoms for a game 5 , we use a type signature Σ. The type signature defines the types of constants, functions, and predicates used in the GDL description. Our type signatures include a simple subtyping mechanism for inclusion polymorphism. For example: true, next :: prop -> bool. at :: pos -> pos -> cell -> prop. red, black :: agent. 1, 2, 3, 4, 5 :: pos. blank :: cell. agent :> cell.
In this example, true and next are predicates, at is a function that takes an (x, y) coordinate and a cell-type and returns a fluent (prop). A cell is either blank or one of the agents. The expression agent :> cell means that an agent is a subtype of cell.
Let be the reflexive transitive closure of :>. Let Σ( f ) be the type assigned to element f by signature Σ . Then f (k 1 , ..., k n ) is a well-formed term of type t if: Predicates are functions that return a bool and constants are functions with no arguments. For example, using the type signature above, true(at(3, 4, black)) is a wellformed term of type bool, i.e. a well-formed ground atom.

Automatically generating induction tasks for a GGP game
Given a GGP game Γ written in GDL, and a type signature Σ for that game, our system automatically generates an IGGP induction task. Before presenting the details, we summarise the general approach. To generate the GGP dataset, we built a simple forwardchaining GDL interpreter. We used the GDL interpreter to calculate the initial state, the currently valid moves, the transition function, and the reward. When generating traces, we first calculate the actions that are currently available for each player. Then we let each player choose actions uniform randomly. We record the state trace (s 1 , ..., s n ), and extract a set of (B i , E + i , E − i ) triples from each trace. The target predicates we wish to learn are legal, next, goal, and terminal. The (B i , E + i , E − i ) triples for the predicates legal, goal, and terminal are calculated from a single state, while the triples for next are calculated from a pair of consecutive states (s i , s i+1 ).
We generated multiple traces for each game: 1000 episodes with a maximum of 100 time-steps. However, we chose these numbers somewhat arbitrarily because there is a complex tradeoff on how much data to generate. We want to generate enough data to capture the diversity of a game, so that a learner can (in theory) learn the correct game rules. However, we do not want to generate too much data as to provide every game state, as this would mean that a learner would not need to learn anything, and could instead simply memorise game situations. We also we do not want to generate too much data that it becomes expensive to compute or store. It is, however, unclear where the boundary is between too little and too much data. Whether such a boundary even exists itself is unclear because by imposing different biases, different learners may need more or less information on the same task. In future work we would like to expand the dataset. We then intend to repeat the experiments with different amounts of training data.
Our approach is presented in Algorithm 1. This procedure generates a number of traces. Each trace is a sequence of game states, and each game state is represented by a input : Γ , a GGP game written in the GDL language input : Σ, a type signature for Γ input : max traces , the number of traces to generate input : max time , the max number of time-steps in a trace output: a set of triples of the form Algorithm 1: Automatically generating induction tasks from GGP games set of ground atoms. We use the extract function (described in Section 4.3.1) to produce a set of (B i , E + i , E − i ) triples from a trace. We add this set of triples to Λ. At the end, when we have finished all the traces, we return Λ, the set of triples. The variable s stores the current state (a set of ground atoms). Initially, s is set to the initial state: initial(Γ ) produces the initial state from the GDL description. Then for each time-step, we calculate the next state via next(Γ , s). This function next(Γ , s) involves three steps. First, we calculate the available actions for each player. Second, we let each player take a (uniform) random move. Third, we use the transition function T to calculate the next state from the current state s and the actions of the players. Once we have calculated the new state, we append it to the end of t. Here, t is a trace i.e. a sequence of states. Then we check if the new state is terminal. If it is terminal, we finish the episode; otherwise, we continue for another time-step. Once the episode is finished, we extract the set of (B i , E + i , E − i ) triples from the sequence of states, and continue to the next trace. Note that we need the type signature Σ to extract the triples from the trace, but we do not need it to generate the trace itself. For our experiments, we generated 1000 traces for each game, and ran for a maximum of 100 time-steps per game.

The extract function
The extract(t, Σ) function in Algorithm 1 takes a trace t = (s 1 , ..., s n ) (a sequence of sets of ground atoms), and a type signature Σ and produces a set of (B i , E + i , E − i ) triples. This set of triples represents a set of induction tasks for the distinguished predicates legal, goal, terminal, and next. It is defined as: where: Before we define the triple 1 and triple 2 functions, we introduce the relevant notation. If s is a set of ground atoms and p is a predicate, let s p be the subset of atoms in s that use the predicate p. If Σ is a type signature and p is a predicate, then ground(Σ, p) is the set of all ground atoms generated by Σ that use predicate p. Given this notation, we define To calculate the negative instances E − i , we use the closed-world assumption: all p-atoms not known to be true in E + are assumed to be false in E − . Given a type signature Σ, we generate the set ground(Σ, p) of all possible ground atoms whose predicate is the distinguished predicate p. For example, in a one player game, if ground(Σ, legal) = {legal(p1, up), legal(p1, down), legal(p1, left), and legal(p1, right)}, and s legal only contains legal(p1, up) and legal(p1, down), then: When learning next, we use the facts at the earlier time-step s i as background facts, we use the facts at the later time-step s i+1 as the positive facts E + to be learned (with the predicate true replaced by next), and we use all the rest of the ground atoms involving next as the negative facts E − . Note, again, the use of the closed-world assumption: we assume all next atoms not known to be in E + to be in E − .

Baselines and ILP systems
We claim that IGGP is challenging for existing ILP approaches. To support this claim we evaluate existing ILP systems on our IGGP dataset. We compare the ILP systems against simple baselines. We first describe the baselines and then each ILP system.

True(B, a) = I ner t ia(B, a) = a[ne x t/t rue] ∈ B M ean(B, a) = {(B
represents training data. The syntax a[next/true] means to replace the predicate symbol next with true in the atom a. Figure 13 shows the four baselines. Each baseline is a Boolean function f : 2 × → { , ⊥}, i.e. a function that takes background knowledge and an example and returns true ( ) or false (⊥). We describe these baselines in detail. Our first two baselines ignore the training data:

Baselines
-True deems that every atom is true:

True(B, a) =
-Inertia is the same as True for atoms with the target predicates goal, legal, and terminal, but for the next predicate an atom is true if and only if the corresponding true atom is in B. For instance, the atom next(at(1,4,x)) is true if and only if true(at(1,4,x)) is in B:

I ner t ia(B, a) = a[nex t/t rue] ∈ B
The intuition behind this baseline is the empirical observation that in most of the games, most ground atoms retain their truth value from one time-step to the next, more often than not. Of course, it is possible to design games in which most or all of the atoms change their truth value each time-step; but in typical games, such radical changes are unusual.

Our next two baselines consider the training data
-Mean deems that a testing atom a is true if and only if a is true more often than not in the positive training examples: -KNN k is based on clustering the data. In K N N k (B, a) we find the k triples in ∆, denoted as κ k (∆, B), whose backgrounds are most 'similar' to the background B. To assess the similarity of two sets A and B of ground atoms, we look at the size of the symmetric difference 6 between A and B: It is straightforward to show that the d function satisfies the conditions for a distance metric:

(A, C) ≤ d(A, B) + d(B, C)
We set the closest k triples κ k (∆, B) to be the k triples with the smallest d distance between B i and B. Given the k closest triples κ k (∆, B) the KNN baseline outputs if a appears in E + in at least half of the closest k triples. More formally: One potential limitation of the KNN approach is that, in contrast to the ILP approaches, the KNN approaches learn at the propositional level and are unable to learn general first-order rules. To illustrate this limitation, suppose we are trying to learn the target predicate p/1 given the background predicate q/1 and that the underlying target rule is p(X ) ← q(X ). Suppose there are only two training triples of the form (B, E + , E−):

Given the test triple ({q(c)}, {p(c)}, {p(a), p(b)}), a KNN approach will deem that p(c)
is false because it has not seen a positive instance of this particular ground atom and has no representational resources for generalising.

ILP systems
We evaluate four ILP systems on our dataset. It is important to note that we are not trying to directly compare the ILP systems, or demonstrate that any particular ILP system is better than another. We are instead trying to show that the IGGP problem is challenging for existing systems, and that it (and the dataset) will provide a challenging problem for evaluating future research. Indeed, a direct comparison of ILP systems is often difficult [11], largely because different systems excel at certain classes of problems. For instance, directly comparing the Prolog-based Metagol against ASP-based systems, such as ILASP and HEXMIL [38] is difficult because Metagol is often used to learn recursive list manipulation programs, including string transformations and sorting algorithms [15]. By contrast, many ASP solvers disallow explicit lists, such as the popular Clingo system [26], and thus a direct comparison is difficult. Likewise, ASP-based systems can be used to learn non-deterministic specifications represented through choice rules and preferences modeled as weak constraints [48], which is not necessarily the case for Prolog-based systems. In addition, because many of the systems have learning parameters, it is often possible to show that there exist some parameter settings for which system X can perform better than Algorithm Y on a particular dataset. Therefore, the relative performances of the systems should largely be ignored. We compare the ILP systems Aleph, ASPAL, Metagol, and ILASP. We describe these systems in turn.

Aleph
Aleph is an ILP system written in Prolog based on Progol [56]. Aleph uses the following procedure to induce a logic program hypothesis (paraphrased from the Aleph website 7 ): 1. Select an example to be generalised. If none exist, stop, otherwise proceed to the next step. 2. Construct the most specific clause (also known as the bottom clause [56]) that entails the example selected and is within language restrictions provided. 3. Search for a clause more general than the bottom clause. This step is done by searching for some subset of the literals in the bottom clause that has the 'best' score. 4. The clause with the best score is added to the current theory and all the examples made redundant are removed. Return to step 1.
To restrict the hypothesis space (mainly at step 2), Aleph uses both mode declarations [56] and determinations to denote how and when a literal can appear in a clause. In the mode language, modeh are declarations for head literals and modeb are declarations for body literals. An example modeb declaration is modeb(2,mult(+int,+int,-int)).
The first argument of a mode declaration is an integer denoting how often a literal may appear in a clause. The second argument denotes that the literal mult/3 may appear in the body of a clause and specifies the type of its arguments. The symbols + and − denote whether the arguments are input or output arguments respectively. Determinations declare what predicates can be used to construct a hypothesis and are the form of determination(TargetName/Arity,BackgroundName/Arity). The first argument is the name and arity of the target predicate. The second argument is the name and arity of a predicate that can appear in the body of such clauses. Typically there will be many determination declarations for a target predicate, corresponding to the predicates thought to be relevant in constructing hypotheses. If no determinations are present Aleph does not construct any clauses.
Aleph assumes that modes will be declared by the user. For the IGGP tasks this is quite a burden because it requires that we create them for each game, and also requires some knowledge of the target hypothesis we want to learn. Fortunately, however, Aleph can extract mode declarations from determinations, where determinations are straightforward to supply because we can supply for each target predicate and each background predicate a determination. Therefore, for each game, we allow Aleph to use all the predicates available for that game as determinations and allow Aleph to induce the necessary mode declarations.
There are many parameters in Aleph which greatly influence the output, such as parameters that change the search strategy when generalising a bottom clause (step 3) and parameters that change the structure of learnable programs (such as limiting the number of literals in the bottom clause). We run Aleph using the default parameters. Therefore, there will most likely exist some parameter settings for which Aleph will perform better than we present.

ASPAL
ASPAL [8] is a system for brave induction under the answer set programming (ASP) [51] semantics. Brave induction systems aim to find a hypothesis H such that there is at least one answer set of B ∪ H that covers the examples 8 .
ASPAL works by transforming a brave induction task T into a meta-level ASP program (T ) such that the answer sets of (T ) correspond to the inductive solutions of T . The first step of state-of-the-art ASP solvers, such as clingo [27], is to compute the grounding of the program. Systems which follow this approach therefore have scalability issues with respect to the size of the hypothesis space, as every ground instance of every rule in the hypothesis space -i.e. the ground instances of every rule that has the potential to be learned -is computed when the ASP solver solves (T ). Similarly to Aleph, ASPAL has several input parameters, which influence the size of the hypothesis space, such as the maximum number of body literals. For most of these, we used the default value, but we increased the maximum number of body literals from 3 to 5 and the maximum number of rules in the hypothesis space from 3 to 15. Our initial experiments showed that the maximum number of rules had very little effect on the feasibility of the ASPAL approach (as the size of the grounding of (T ) is unaffected by this change), whereas the maximum number of body literals can make a significant difference to the size of the grounding of (T ). It is possible that there is a set of parameters for ASPAL that performs better than those we have chosen.
Predicate invention is supported in ASPAL by allowing new predicates (which do not occur in the rest of the task) to appear in the mode declarations. This predicate invention is prescriptive rather than automatic, as the schema of the new predicates (i.e. the arity, and argument types) must be specified in the mode declarations. As how to guess the structure of predicates which should be invented is unclear for this problem setting, we did not allow ASPAL to use predicate invention on this dataset. It should be noted that when programs are stratified, hypotheses containing predicate invention can always be translated into equivalent hypotheses with no predicate invention. Of course, as such hypotheses may be significantly longer than the compact hypotheses which are possible through predicate invention, they may require more examples to be learned accurately by ASPAL.
Similarly, although ASPAL does enable learning recursive hypotheses, we did not permit recursion in these experiments. Recursive hypotheses can also be translated into nonrecursive hypotheses over finite domains. Our initial experiments using ASPAL showed that in addition to increasing the size of the hypothesis space, allowing recursion also significantly increased the grounding of ASPAL's meta program, (T ).

Metagol
Metagol [61,13,14] is an ILP system based on a Prolog meta-interpreter. The key difference between Metagol and a standard Prolog meta-interpreter is that whereas a standard Prolog meta-interpreter attempts to prove a goal by repeatedly fetching first-order clauses whose heads unify with a given goal, Metagol additionally attempts to prove a goal by fetching higher-order metarules (Figure 14), supplied as background knowledge, whose heads unify with the goal. The resulting meta-substitutions are saved and can be reused in later proofs. Following the proof of a set of goals, Metagol forms a logic program by projecting the meta-substitutions onto their corresponding metarules.
Metagol is notable for its support for (non-prescriptive) predicate invention and learning recursive programs. Metarules define the structure of learnable programs, which in turn defines the hypothesis space. Deciding which metarules to use for a given task is an unsolved problem [11,17]. To compute the benchmark, we set Metagol to use the same metarules for all games and tasks. This set is composed of 9 derivationally irreducible metarules [16,17], a set of metarules to allow for constants in a program, and a set of nullary metarules (to learn the terminal predicates). Full details on the metarules used can be found in the code repository.
For each game, we allow Metagol to use all the predicates available for that game. We also allow Metagol to support a primitive form of negation by additionally using the negation of predicates. For instance, in Firesheep we allow Metagol to use the rule not_does_kill(A,B) :-not(does_kill (A,B)). To allow Metagol to induce a program given all triples, we prefix each atom with an extra argument to denote which triple each atom belongs to. For instance, in the first minimal even triple, the atom does_choose(player,1) becomes does_choose(triple1,player,1), and in the second triple the same atom becomes does_choose(triple2,player,1). To account for this extra argument, we also add extra argument to each literal in a metarule.

ILASP
ILASP (Inductive Learning of Answer Set Programs) [44,45,46] is a collection of ILP systems, which are capable of learning ASP programs consisting of normal rules, choice rules, hard and weak constraints. Unlike many other ILP approaches, ILASP guarantees the computation of an optimal inductive solution (where optimality is defined in terms of the length of a hypothesis). Similarly to ASPAL, early ILASP systems, such as ILASP1 [44] and ILASP2 [46], work by representing an ILP task (i.e. every example and every rule in the hypothesis space) as a meta-level ASP program whose optimal answer sets correspond to the optimal inductive solutions of the task. The ILASP systems each target learning unstratified ASP programs with normal rules, choice rules and both hard and weak constraints. Therefore, the stratified normal logic programs which are targeted in this paper do not require the full generality of ILASP; in fact, on this dataset, the metalevel ASP programs used by both ILASP1 and ILASP2 are isomorphic to the meta-level program used by ASPAL.
ILASP2i [47] addresses the scalability with respect to the number of examples by iteratively computing a subset of the examples, called relevant examples, and only representing the relevant examples in the ASP program. In each iteration, ILASP2i uses ILASP2 to find a hypothesis H that covers the set of relevant examples and then searches for a new relevant example which is not covered by H. When no further relevant examples exist, the computed H is guaranteed to be an optimal inductive solution of the full task.
Although ILASP2i makes significant improves on the scalability of ILASP1 and ILASP2 with respect to the examples, on tasks with large hypothesis spaces ILASP2i still suffers from the same grounding bottleneck as ASPAL, ILASP1 and ILASP2. As the size of the hypothesis spaces are one of the major challenges of the dataset in this paper, ILASP2i would likely not perform significantly better than ASPAL. To scale up the application of the ILASP framework to the GGP dataset, we used an extended version of ILASP2i, which computes, at each iteration, a relevant hypothesis space using the type signature and the current set of relevant examples, and then uses ILASP2 to solve a learning task with the current relevant examples and relevant hypothesis space. Through the rest of the paper, we refer to this extended ILASP algorithm as ILASP * . Specifically, rules that entail negative examples or do not cover at least one relevant positive example are omitted from the relevant hypothesis space. Also, a rule is omitted if there is another rule which is shorter and covers the same (or more) relevant positive examples. Similarly to ASPAL, ILASP * takes a parameter for the maximum number of literals in the body. Our preliminary experiments showed that the method for computing the relevant hypothesis space performed best with this parameter set to 5, so this value was used for the experiments.
The construction of a relevant hypothesis space was made significantly easier by forbidding recursion and predicate invention in ILASP * . Although the standard ILASP algorithms do support recursion and (prescriptive) predicate invention, these two features mean that the usefulness of a rule in covering examples cannot be evaluated independently, and thus constructing the relevant hypothesis space is much more challenging. In future work, we hope to generalise the method of relevant hypothesis space construction to relax these two constraints.

Results
We now describe the results of running the baselines and ILP systems on our dataset. All the experimental data is available at https://github.com/andrewcropper/mlj19-iggp. When running the ILP systems, we allowed each system the same amount of time to learn each target predicate. We allowed each system 30 minutes to learn each target predicate.

Evaluation metrics
We use two evaluation metrics: balanced accuracy and perfectly solved.

Balanced accuracy
In our dataset the majority of examples are negative. To account for this class imbalance, we use balanced accuracy [5] to evaluate the approaches. Given background knowledge B, disjoint sets of positive E + and negative E − testing examples, and a logic program H, we define the number of positive examples as p = |E + |, the number of negative examples as n = |E − |, the number of true positives as t p = |{e ∈ E + |B∪H |= e}|, the number of true negatives as t n = |{e ∈ E − |B∪H |= e}|, and the balanced accuracy ba = (t p/p+t n/n)/2.

Perfectly solved
We also consider a perfectly solved metric, which is the number (or percentage) of tasks that an approach solves with 100% accuracy. The perfectly solved metric is important in IGGP because we know that every game has at least one perfect solution: the GDL description from which the traces were generated is a perfectly accurate model of the deterministic MDP. Perfect accuracy is important because even slightly inaccurate models compound their errors as the game progresses. Figure 15 summarises the results and shows for each approach the balanced accuracy and percentage of perfectly solved tasks. The full results are in the appendix. As the results show, the ILP and KNN approaches perform better than simple baselines (True, I ner t ia, and M ean). In terms of balanced accuracy, the KNN approaches often perform better than the ILP systems. However, in terms of the important perfectly solved metric, the ILP methods easily outperform the baselines and the KNN approaches. The most successful system ILASP * perfectly solves 40% of the tasks. It should be noted that 4% of test cases have no positive instances in either the training set nor the test set, meaning that a perfect score can be achieved with the empty hypothesis. Each of our ILP systems achieved a perfect score on these tasks. Without these trivial cases, the score of each system on the perfectly solved metric would be even lower.

Results summary
As Figure 16 shows, in terms of balanced accuracies, the most difficult task is the terminal predicate, although the margin of difference between the predicates is small. As Figure 17 shows, in terms of the important perfectly solved metric, the most difficult task is the next predicate. The mean number of perfectly solved tasks is a measly 3%. Even if we exclude the baselines and only consider the ILP systems then the mean is still only 10%. Figure 18 shows the balanced accuracies for the next predicate on the alphabetically first ten games. This predicate corresponds to the state transition function (Section 4.1). The next atoms are the most difficult to learn and there is only one out of the first ten games, Buttons and Lights, for which any of the methods find a perfect solution. The next predicate is the most difficult to learn because it has the highest mean complexity in terms of the number of dependent predicates in the dependency graph (Section 3.1) in the reference GDL game definitions.  15 Results summary. The baseline represents accepting everything. The results show that all of the approaches struggle in terms of the perfectly solved metric (which represents how many tasks were solved with 100% accuracy).

Fig. 18
Balanced accuracies for the next target predicate for the alphabetically first ten games. .
In the following sections we analyse the results for each system and discuss the relative limitations of the respective systems on this dataset.

KNN
As Figure 15 shows, the KNN approaches perform well in terms of balanced accuracy but poorly in terms of perfectly solved. Note that KNN 1 occasionally scores higher than KNN 5 , which is to be expected because sometimes looking at additional triples gives misleading information. As already mentioned, the KNN approaches learn at the propositional level. This limitation is evident when analysing the results which show that the KNN 1 and KNN 5 approaches only perform well when the target predicate can be learned by memorizing particular atoms. For some of the simpler games (e.g. Coins), the KNN approach is often able to learn the goal predicate because the reward can be extracted directly from the value of an internal state variable representing the score. Similarly, the KNN approach sometimes learns the legal predicate when the set of legally valid actions is static and does not depend on the current state. But the KNN approach is not able to perfectly learn any of the next rules for any of the games in our dataset. In addition, the KNN approaches are expensive to compute. To get these results it took 3 days on a 3.6 GHz machine.

Aleph
As Figure 15 shows, Aleph performs reasonably well, and outperforms most of the baselines in terms of the perfectly solved metric. However, after inspecting the learned programs, we found that Aleph was rarely learning general rules for the games, and instead typically learned facts to explain the specific examples. In other words, on this task, Aleph tends to learn overly specific programs. There are several potential explanations for this limitation. First, as we stated in Section 5.2.1, we did not provide mode declarations to Aleph, and instead allowed Aleph to infer them from the determinations. Second, we ran Aleph with the default parameters. However, as stated in Section 5.2.1, Aleph has many learning parameters which greatly influence the learning performance.
It is reasonable to assume that Aleph could perform even better with a different set of parameters. Third, to learn a program Aleph must first construct the most specific clause (the bottom clause) that entails an example. However, constructing the bottom clause requires exponential time in the depth of variables in the target theory [56]. Therefore, learning large and complex clauses is intractable.

ASPAL
As Figure 15 shows, ASPAL performs quite poorly on this dataset. It is outperformed by the mean baseline, both in terms of the perfectly solved metric, and the average balanced accuracy. ASPAL timed out on the majority of the test problems, which was caused by the size of the hypothesis space, and therefore the grounding of ASPAL's meta-level ASP program. It is possible that by using different parameters to control the size of the hypothesis space, or using a different representation of the problem, with a smaller grounding, ASPAL could perform better.
The results of ASPAL are also interesting to explain the need to create a specialised version of the ILASP algorithm for this dataset. On this constrained problem domain, where we are only aiming to learn stratified programs (which are guaranteed to have a single answer set), ILASP2 and ASPAL are almost identical in their approaches. Both map the input ILP task into a meta-level ASP program, and use the Clingo ASP solver to find an optimal answer set, corresponding to an optimal inductive solution of the input task. The specialised ILASP * algorithm presented in Section 5.2.4 can overcome this problem in some cases, by reducing the size of the hypothesis space being considered, and thus reducing the size of the grounding of the meta-level program. In principle, this specialisation (along with ILASP2i's relevant example method) could be applied to ASPAL, to create ASPAL * , which would likely have performed better.

Metagol
Although Metagol outperforms the baselines in the perfectly correct metric (34%), it is outperformed in terms of balanced accuracy.
One of the main limitations of Metagol in this dataset is that it will only return a program if that program covers all of the positive examples and none of the negative examples. However, in some of the games, Metagol could learn a single simple rule that explains 99% of the training examples (and perhaps 99% of the testing examples) but may need an additional complex rule to cover the remaining 1%. If this extra rule is too complex to learn, then Metagol will not learn anything. To explore this limitation we ran a modified version of Metagol that relaxes this constraint. This modified version simply samples training examples, rather than learn from all the examples. This stochastic version of Metagol improved balanced accuracy from 69% to 76%. In future work we intend to develop more sophisticated versions of stochastic Metagol.
Metagol can generalise from few examples because of the strong inductive bias enforced by the metarules. However, this strong bias is also a key reason why Metagol struggles to learn programs for many of the games. Given insufficient metarules, Metagol cannot induce the target program. For instance, given only monadic metarules, Metagol can only learn monadic programs. Although there is work studying which metarules to use for monadic and dyadic logics [12,16,17], there is no work on determining which metarules to use for higher-arity logic. Therefore, when computing the benchmarks, Metagol could not learn some of the higher-arity target predicates, such as the next_cell/4 predicate in Sudoku. Similarly Metagol could often not use higher-arity predicates, such as does_move/5 and triplet/6 in Alquerque.
Another issue with the metarules is in that, as described in Section 5.2.3, we used the same set of metarules for all games. This approach is inefficient because in almost all cases this approach meant that we were using irrelevant metarules, which added unnecessary search to the learning task. We expect that a simple preprocessing step to remove unusable metarules would improve learning performance, although probably not by any considerable margin.
Another reason why Metagol struggles to solve certain games is because, as with most ILP systems, it struggles to learn large and complex programs. For Metagol the bottleneck is in the size of the target program because the search space grows exponentially with the number of clauses in the target program [17]. Although there is work in trying to mitigate this issue [13], developing approaches that can learn large and complex programs is a major challenge for MIL and ILP in general [11].

ILASP *
The system with the highest percentage of completely accurate models (see Figure 15) is ILASP * , with 40% of the tasks completely solved. In most of the cases where ILASP * terminated with a solution in the time limit of 30 minutes, a perfect solution was returned. On the rare occasion that ILASP * terminated but learned an imperfect solution, it did cover the training examples, but performed imperfectly on the test set; for example, in the terminal training set for Untwisty Corridor there are no positive examples, meaning that ILASP * returns the empty hypothesis (which covers the set of negative examples); however, there is a positive instance of terminal in the test set, meaning that ILASP * (and all other approaches) score a balanced accuracy of 50 on this problem.
In some cases, the restriction on the number of body literals meant that the task had no solutions. In these unsatisfiable cases, the hypothesis in the last satisfiable iteration was returned by ILASP * . In principle, the maximum number of body literals could have been iteratively increased until the task became satisfiable, but our initial experiments showed that this made little or no difference to the number of perfectly solved cases. Some of the unsatisfiable cases may have been caused by the restriction forbidding predicate invention for ILASP * on this dataset -although there will always by an equivalent hypothesis that does not contain predicate invention, the equivalent hypothesis may have rules with more than 5 body literals.
Similarly to the unsatisfiable cases, in the timeout cases, the hypothesis found in the ILASP * 's final iteration was used to compute the accuracy. Returning the hypothesis found in the last iteration explains ILASP * 's much higher average balanced accuracy compared to Metagol, which either returns a perfect solution over the test set or no solution at all. ILASP * is able to perfectly solve some tasks that are not perfectly solved by any of the baselines or other ILP systems. One example is the next learning task for Rock Paper Scissors. In this case, the raw hypothesis returned by ILASP * is shown in Figure 19, which is equivalent to the (more readable) hypothesis shown in Figure 20. Note that this hypothesis is slightly more complicated than necessary. If ILASP * had been permitted to use ! = to check that two player variables did not represent the same player, it is possible that the last three rules would have been replaced with: next_score(Player1, Score) :-true_score(Player1, Score), does(Player1, Move1), does(Player2, Move2), not beats(Move1, Move2), Player1 != Player2.
It is possible to learn hypotheses with ! = (and other binary comparison operators) in ILASP, but this would have increased the size of the hypothesis space, so in these experiments, we only allowed ILASP * to construct hypothesis spaces using the language of the input task. In future work, we may consider extending the relevant hypothesis space construction method to allow binary comparison operators. The increase in the size of the hypothesis space may be outweighed by the fact that the final hypothesis can be shorter -shorter hypotheses tend to need fewer iterations to learn.  Fig. 19 The raw hypothesis returned by ILASP * for the next learning task for Rock Paper Scissors. next_score(p1, Score) :-true_score(p1, Score), does(p2, Action), does(p1, Action).

Fig. 20
A more readable version of the hypothesis returned by ILASP * for the next learning task for Rock Paper Scissors.

Discussion
As Figure 15 shows, most of the IGGP tasks cannot be perfectly learned by existing ILP systems. The best performing system (ILASP * ) solves only 40% of the tasks perfectly. Our results suggest that the IGGP problem poses many challenges to existing approaches. As mentioned in Section 4.3, we are unsure whether the dataset contains sufficient training examples for each approach to perfectly solve all of the tasks. Moreover, determining whether there is sufficient data is especially difficult because the different systems employ different biases. However, in most cases the ILP systems simply timed out, rather than learning an incorrect solution. The key issue is that the ILP systems we have considered do not scale to the large problems in the IGGP dataset. In the previous section we discussed limitations of each system. We now summarise the limitations to help explain what makes IGGP difficult for existing approaches.
Large programs As discussed in Section 2, many reference solutions for IGGP games are large, both in terms of the number of literals and the clauses in them. For instance, the GGP reference solution for the goal predicate for Connect Four uses 14 clauses and a total of 72 literals. However, learning large programs is a challenge for most ILP systems [11] which typically struggle to learn programs with hundreds of clauses or literals. Metagol, for instance, struggles to learn programs with more than 8 clauses.

Predicate invention
The reference solution for goal in Connect four uses auxiliary predicates (goal is defined in terms of lines, which are defined in terms of columns, rows and diagonals). These auxiliary predicates are not strictly required, as any stratified definition with auxiliary predicates can be translated into an equivalent program with no auxiliary predicates; however, such equivalent programs are often significantly longer. If we unfold the reference solution to remove auxiliary predicates, the resulting equivalent unfolded program contains over 400 literals. For ILP approaches that do not support the learning of programs containing auxiliary predicates (such as Progol, Aleph, and FOIL), it is infeasible to learn such a large program. More modern ILP approaches support predicate invention, enabling the learning of auxiliary predicates which are not in the language of the background knowledge or the examples; however, predicate invention is far from easy, and there are significant challenges associated with it, even for state of the art ILP systems. ASPAL and ILASP support prescriptive predicate invention, where the schema of the auxiliary predicates (i.e. the arity, and argument types) must be specified in the mode declarations [43]. By contrast, Metagol supports automatic predicate invention, where Metagol invents auxiliary predicates without the need for user-supplied arities or type information. However, Metagol's approach can still often lead to inefficiencies in the search, especially when multiple new predicate symbols are introduced.

Conclusion
In this paper, we have expanded on the Inductive General Game Playing task proposed by Genesereth. We claimed that learning the rules of the GGP games is difficult for existing ILP techniques. To support this claim, we introduced a IGGP dataset based on 50 games from the GGP competition and we evaluated existing ILP systems on the dataset. Our empirical results show that most of the games cannot be perfectly learned by existing systems. The best performing system (ILASP * ) solves only 40% of the tasks perfectly. Our results suggest that the IGGP problem poses many challenges to existing approaches. We think that the IGGP problem and dataset will provide an exciting challenge for future research, especially as we have introduced techniques to continually expand the dataset with new games.

Limitations and future work
Better ILP systems Our primary motivation for introducing this dataset is to encourage future research in ILP, especially on general ILP systems able to learn rules for a diverse set of tasks. In fact, we have already demonstrated two advancements in this paper: (1) a stochastic version of Metagol (6.2.4), and (2) ILASP * (Section 5.2.4), which scales up ILASP2 for the GGP dataset. In future work we intend to develop better ILP systems.
More games One of the main advantages of the IGGP problem is that the games are based on the GGP competition. As mentioned in the introduction, the GGP competition produces new games each year. These games are introduced independently from our dataset without any particular ILP system in mind. Therefore, because of our second contribution, we can continually expand the IGGP dataset with these new games. In future work we intend to automate this whole process and to ensure that all the data is publicly available.
More systems We have evaluated four ILP systems (Aleph, ASPAL, Metagol, and ILASP). In future work we would like to evaluate more ILP systems. We could also like to consider non-ILP systems (i.e. systems that may not necessarily learn explicit human-readable rules).
More evaluation metrics We have evaluated ILP systems according to two metrics: balanced accuracy and perfect solved. However, there are other dimensions on which to evaluate the systems. We have not, for instance, considered the learning times of the systems (although they all had the same maximum time to learn during the evaluation).
Nor have we considered the sample complexity of the approaches. In future work it would be valuable to evaluate approaches when varying the number of game traces (i.e. observations) available, as to identify the most data-efficient approaches.
More challenges The main challenge in using existing systems on this dataset is the deliberate lack of game-specific language biases, meaning that for many games the hypothesis space that each system must consider is extremely large. This reflects a major current issue in ILP, where systems are often given well crafted language biases to ensure feasibility; however, this is not the only current challenge in ILP. For example, some ILP approaches target challenges such as learning from noisy data [62,24,49], probabilistic reasoning [19,20,66,3,67], non-determinism expressed through unstratified negation [63,48], and preference learning [46]. Future versions of this dataset could be extended to contain these features.
Competitions SAT competitions have been held since 1992 with the aim of providing an objective evaluation of contemporary SAT solvers [36]. The competitions have significantly contributed to the progress of developing ever more efficient SAT techniques [36]. In addition, the competitions have motivated the SAT community to develop more robust, reliable, and general purposes SAT solvers (i.e implementations). We believe that the ILP community stands to benefit from an equivalent competition, to focus and motivate research. We hope that this new IGGP problem and dataset will become a central component in this new competition. 50