1 Introduction

General game playing (GGP) (Genesereth and Thielscher 2014) is a framework for evaluating an agent’s general intelligence across a wide variety of games. In the GGP competition, an agent is given the rules of a game that it has never seen before. The rules are described in a first-order logic-based language called the game description language (GDL) (Love et al. 2008). The rules specify the initial game state, what constitutes legal moves, how moves update the game state, and how the game terminates (Björnsson 2012). Before the game begins, the agent is given a few seconds to think, to process the rules, and devise a game-specific strategy. The agent then starts playing the game, thus generating game traces. The winner of the competition is the agent that gets the best total score over all the games. Figure 1 shows six example GGP games. Figure 2 shows a selection of rules, written in GDL, for the game Rock Paper Scissors.

Fig. 1
figure 1

Sample GGP games described in clockwise order starting from the top left: Alquerque, Chinese Checkers, Eight Puzzle, Farming Quandries, Knights Tour, and Tic Tac Toe

Fig. 2
figure 2

A selection of rules for the game Rock Paper Scissors. The rules are written in the game description language, a variant Datalog which is usually described in prefix notation. The relation (succ 0 1) means succ(0,1), i.e. 1 is the successor of 0. Variables begin with “?”. The relation (<= (next (step ?n)) (true (step ?m)) (succ ?m ?n)) can be rewritten in Prolog notation as next(step(N)):- true(step(M)),succ(M,N).

In this paper, we invert the GGP competition task: the learner (a machine learning system) is given game traces and the task is to induce (learn) the rules that could have produced the traces. In other words, the learner must learn the rules of a game by observing others play. This problem is a core part of inductive general game playing (IGGP) (Genesereth and Björnsson 2013), the task of jointly learning the rules of a game and playing the game successfully. We focus exclusively on the first task. Once the rules of the game have been learned then existing GGP techniques (Finnsson 2012; Koriche et al. 2016, 2017) can be used to play the games.

Figure 3 shows an example IGGP task, described as a logic program, for the game Rock Paper Scissors. In this task, a learner is given a set of ground atoms representing background knowledge (BK) and sets of disjoint ground atoms representing positive (\(E^+\)) and negative (\(E^-\)) examples of target concepts. The task is for the learner to induce a set of general rules (a logic program) that explains all of the positive but none of the negative examples. In this scenario, the examples are observations of the next_score and next_step predicates, and the task is to learn the rules for these predicates, such as the rules shown in Fig. 4.

Fig. 3
figure 3

An example learning task for the game Rock Paper Scissors. The input is a set of ground atoms representing background knowledge (BK) and sets of ground atoms representing positive (\(E^+\)) and negatives (\(E^-\)) examples. In this task, the examples are observations of the next_score and next_step predicates. The task is to learn the rules for these predicates, such as the rules shown in Fig. 4

Fig. 4
figure 4

The GGP reference solution for the Rock Paper Scissors game described as a logic program. Note that the predicates draws, loses, and wins are not given as background knowledge and the learner must discover these

In this paper, we expand on the idea proposed by Genesereth and Björnsson (2013) and we introduce the IGGP problem (Sect.  3.2). Our main claim is that IGGP is difficult for existing inductive logic programming (ILP) techniques, and in Sect. 2 we outline the reasons why we think IGGP is difficult, such as the lack of task-specific language biases. To support our claim, we make three key contributions.

Our main contribution is a new IGGP dataset.Footnote 1 The dataset is based on game traces from 50 games from the GGP competition. The games vary across a number of dimensions, including the number of players (1–4), the number of spatial dimensions (0–2), the reward structure (whether the rewards are zero-sum, cooperative, or orthogonal), and complexity. Some of the games are turn-taking (Alquerque) while others (Rock Paper Scissors) are simultaneous. Some of the games are classic board games (Checkers and Hex); some are puzzles (Sokoban and Sudoku); some are dilemmas from game theory (Prisonner’s Dilemma and Chicken); others are simple implementations of classic video games (Centipede and Tron). Table 1 lists the 50 games and also shows for each game the number of dimensions, the number of players, and as an estimate of the game’s complexity the number of rules and literals in the GGP reference solution. Each game is described as four relational learning tasks goal, next, legal, and terminal with varying arities, although flattening the dataset to remove function symbols leads to more relations as illustrated in Fig. 3 where the next predicate is flattened to relations next_score/2 and next_step/2. For each game, we provide (1) training/validate/test data composed of sets of ground atoms in a 4:1:1 split, (2) a type signature file describing the arities of the predicates and types of the arguments, and (3) a reference solution in GDL. It is important to note that we have not designed these games: the games were designed independently from our IGGP problem without this induction task in mind.

Our second contribution is a mechanism to continually expand the dataset. The GGP competition produces new games each year, which provides a continual rich source of challenges to the GGP participants. Our technical contribution allows us to easily add these new games to our dataset. We implemented an automatic procedure for producing a new learning task from a game. When a new game is added to the GGP competition, our system can read the GDL description, generate traces of sample play, and extract an IGGP task from those traces (see Sect. 4.3 for technical details). This automatic procedure means that our dataset can expand each year as new games are added to the GGP competition. We again stress that the GGP games were not designed with this induction task in mind. The games were designed to be challenging for GGP systems. Thus, this induction task is based on a challenging “real world” problem, not a task that was designed to be the appropriate level of difficulty for current ILP systems.

Table 1 The IGGP dataset. We list the number of rules (clauses) R, the number of literals L, number of dimensions D, and the number of players P

Our third contribution is an empirical evaluation of existing ILP approaches, to test our claim that IGGP is difficult for current ILP approaches. We evaluate the classical ILP system Aleph (Srinivasan 2001) and the more recent systems ASPAL (Corapi et al. 2011), Metagol (Cropper and Muggleton 2016b), and ILASP (Law et al. 2014). Although non-exhaustive, these systems cover a breadth of ILP approaches and techniques. We also compare non-ILP approaches in the form of simple baselines and clustering (KNN) approaches. Table 2 summarises the results. Although some systems can solve some of the simpler games, most of the games cannot be solved by existing approaches. In terms of balanced accuracy (Sect. 6.1.1), the best performing system, ILASP, achieves 86%. However, in terms of our perfectly solved metric (Sect. 6.1.2), the best performing system, ILASP, achieves only 40%. Our empirical results suggest that our current IGGP dataset poses many challenges to existing ILP approaches. Furthermore, because of our second contribution, our dataset will continue to grow with the GGP competition, as new games are added every year. We therefore think that the IGGP problem and dataset will be valuable for motivating and evaluating future research.

Table 2 Results summary. The baseline represents accepting everything. The results show that all of the approaches struggle in terms of the perfectly solved metric (which represents how many tasks were solved with 100% accuracy)

The rest of the paper is organised as follows. Section 2 describes related work and further motivates this new problem and dataset. Section 3 describes the IGGP problem, the GDL, in which GGP games are described, and how IGGP games are Markov games. Section 4 introduces a technique to produce a IGGP task from a GGP game and provides specific details on how we generated our initial IGGP dataset. Section 5 describes the baselines and ILP systems used in the evaluation of current ILP techniques. Section 6 details the results of the evaluation and also describes why IGGP is so challenging for existing approaches. Finally, Sect. 6 concludes the paper and details future work.

2 Related work

2.1 General game playing

As Björnsson states (Björnsson 2012), from the inception of AI games have played a significant role as a test-bed for advancing the field. Although the early focus was on developing general problem-solving approaches, the focus shifted towards developing problem-specific approaches, such as approaches to play chess (Campbell et al. 2002) or checkers (Schaeffer et al. 1996) very well. One motivation of the GGP competition is to reverse this shift, as to encourage work on developing general AI approaches that can solve a variety of problems.

Our motivation for introducing the IGGP problem and dataset is similar. As we will discuss in the next section, there is much work in ILP on learning rules for specific games, or for specific patterns in games. However, there is little work on demonstrating general techniques for learning rules for a wide variety of games (i.e. the IGGP problem). We want to encourage such work by showing that current ILP systems struggle on this problem.

2.2 Inducing game rules

Inducing game rules has a long history in ILP, where chess has often been the focus. Bain (1994) studied inducing first-order Horn rules to determine the legality of moves in the chess KRK (king-rook-king) endgame, which is similar to the problem of learning the legal predicate in the IGGP games. Bain also studied inducing rules to optimally play the KRK endgame. Other works on chess include Goodacre (1996), Morales (1996), who induced rules to play the KRK endgame and rules to describe the fork pattern, and Muggleton et al. (2009).

Besides chess, Castillo and Wrobel (2003) used a top-down ILP system and active learning to induce a rule for when a square is safe in the game minesweeper. Law et al. (2014) used an ASP-based ILP approach to induce the rules for Sudoku and showed that this more expressive formalism allows for game rules to be expressed more compactly.

Kaiser (2012) learned the legal moves and the win condition (but not the state transition function) for a variety of boardgames (breakthrough, connect4, gomuku, pawn whopping, and tictactoe). This system represents game rules as formulas of first-order logic augmented with a transitive closure operator TC; it learns by enumerative search, starting with the guarded fragment before proceeding to full first-order logic with TC. Unusually, their system learns the game rules from videos of correct and incorrect play: before it can start learning the rules, it has to parse the video, converting a sequence of pixel arrays into a sequence of sets of ground atoms.

Relatedly, Grohe and Ritzert (2017) also use enumerative search, searching through the space of first-order formulas. They exploit Gaifman’s locality theorem to search through a restricted set of local formulas. They show, remarkably, that if the max degree of the Gaifman graph is polylogarithmic in the number n of objects, then the running time of their enumerative learning algorithm is also polylogarithmic in n. This intriguing result does not, however, suggest a practical algorithm as the constants involved are very large.

GRL (Gregory et al. 2015) builds on SGRL (Björnsson 2012) and LOCM (Cresswell et al. 2009) to learn game dynamics from traces. In these systems, the game dynamics are modelled as as finite deterministic automata. They do not learn the legal predicate (determining which subset of the possible moves are available in the current state) or the goal predicate.

As is clear from these works, there is little work in ILP demonstrating general techniques for learning rules for a wide variety of games. This limitation partially motivates the introduction of the IGGP problem and dataset.

2.3 Existing datasets

One of our main contributions is the introduction of a IGGP dataset. In contrast to the existing datasets, our dataset introduces many new challenges.

2.3.1 Size and diversity

Our dataset is larger and more diverse than most existing ILP datasets, especially on learning game rules. Commonly used ILP datasets, such as kinship data (Hinton 1986), Michaslki trains (Larson and Michalski 1977), Mutagenesis (Debnath et al. 1991), Carcinogenesis (Srinivasan et al. 1997), string transformations (Lin et al. 2014), and chess positions (Muggleton et al. 1989), typically contain a single predicate to be learned, such as eastbound/1 or westbound/1 in the Michaslki trains dataset or active/1 in the Mutagenesis dataset. By contrast, our dataset contains 50 distinct games, each described by at least four target predicates, where flattening leads to more relations as illustrated in Fig. 3. In addition, whereas some datasets use only dyadic concepts, such as kinship or string transformations, our dataset also requires learning programs with a mixture of predicates arities, such as input_jump/8 in Checkers and next_cell/4 predicate in Sudoku. Learning programs with high-arity predicates is a challenge for some ILP approaches (Cropper and Muggleton 2016b; Kaminski et al. 2018; Evans and Grefenstette 2018). Moreover, because of our second main contribution, we can continually and automatically expand the dataset as new games are introduced into the GGP competition. Therefore, our IGGP dataset will continue to expand to include more games.

2.3.2 Inductive bias

Our IGGP games come from the GGP competition. As stated in the introduction, the games were not designed with this induction task in mind. One key challenge proposed by the IGGP problem is the lack of inductive bias provided. Most existing work on inducing game rules has assumed as input a set of high-level concepts. For instance, Morales (1996) assumed as input a predicate to determine when a chess piece is in check. Likewise, Law et al. (2014) assumed high-level concepts such as same_row/2 and same_col/2 as background knowledge when learning whether a Sudoku board was valid. Moreover, most existing ILP work on game learning rules (and learning in general) involves the designers of the system designing the appropriate representation of the problem for their system. By contrast, in our IGGP problem the representation is fixed: it is the GDL provided by the GGP.

Many existing ILP techniques assume a task-specific language bias, expressing a hypothesis space which contains at least one correct representation of the target concept. When available, language biases are extremely useful as a smaller hypothesis space can mean fewer examples and less computational resources are needed by the ILP systems. In many practical situations, however, task-specific language biases are either not available, or are extremely wide, as very little information is known about the structure of the target concept.

In our IGGP dataset we only provide the most simple (or primitive) low-level concepts, which come directly from the GGP competition, i.e. our IGGP dataset does not provide any task-specific language biases. For each game, the only language bias given is the type schema of each predicate in the language of the background knowledge. For instance, in Sudoku the higher-level concepts of same row and same col are not given. Likewise, to learn the terminal predicate in Connect Four, a learner must learn the concept of a line, which in turn requires learning rules for vertical, horizontal, and diagonal lines. This means that for an approach to solve the IGGP problem in general (and to be able to accept future games without changing their method), it must be able to learn without a game-specific bias, or be able to generate this game-specific bias from the type-schemas in the task. In addition, a learner must learn concepts from only primitive low-level background predicates, such as cell(X,Y,Filled). Should these high-level concepts be reusable then it would be advantageous to perform predicate invention, which has long been a key challenge in ILP (Muggleton et al. 2012, 2014). Popular ILP systems, such as FOIL (Ross Quinlan 1990) and Progol (Muggleton 1995), do not support predicate invention, and although recent work (Inoue et al. 2013; Muggleton et al. 2015; Cropper and Muggleton 2016a) has tackled this challenge, predicate invention is still a difficult problem.

2.3.3 Large programs

Many reference solutions for IGGP games are large, both in terms of the number of literals and the clauses in them. For instance, the GGP reference solution for the goal predicate for Connect Four uses 14 clauses and a total of 72 literals. This solution uses predicate invention to essentially compress the solution, where the auxillary predicates include the concept of a line, which in turn uses the auxillary predicates for the concepts of columns, rows, and diagonals. If we unfold the reference solution as to remove auxillary predicates then the total number of literals required to learn a solution for this single predicate easily exceeds 400. However, learning large programs is a challenge for most ILP systems (Cropper 2017) which typically struggle to learn programs with hundreds of clauses or literals.

2.3.4 ILP2016 competition

The closest work similar to ours is the ILP 2016 Competition (Law et al. 2016). The ILP 2016 competition was based on a single type of task (with various hand crafted target hypotheses) aimed at learning the valid moves of an agent as it moved through a grid. In some ways this is similar to our legal tasks, although many tasks required learning invented predicates representing changes in state, similar to our next tasks. By contrast, our IGGP problem and dataset is based on a variety of real games, which we did not design. Furthermore, the ILP 2016 dataset provides restricted inductive biases to aid the ILP systems, whereas we (deliberately) do not give such help.

2.4 Model learning

AlphaZero (Silver et al. 2017) has shown the power of combining tree search with a deep neural network for distilling search policy into a neural net. But this technique presupposes that we have been given a model of the game dynamics: we must already know the state transition function and the reward function. Suppose we want to extend AlphaZero-style techniques to domains where we are not given an explicit model of the environment. We would need some way of learning a model of the environment from traces. Ideally, we would like to learn data-efficiently, without needing hundreds of thousands of traces.

Model-free reinforcement learning agents have high sample complexity: they often require millions of episodes before they can learn a reasonable policy. Model-based agents, by contrast, are able to use their understanding of the dynamics of the environment to learn much more efficiently (Džeroski et al. 2001; Duff and Barto 2002; Guez et al. 2012). Whether, and to what extent, model-based methods are more sample efficient than model-free methods depends on the complexity of the particular MDP. Sometimes, in simple environments, one needs fewer data to learn a policy than to learn a model. It has also been shown that, for Q learning, the worst-case asymptotics for model-based and model-free are the same (Kearns and Singh 1999). But these qualifications do not, of course, undermine the claim that in complex environments that require anticipation or planning, a model-based agent will be significantly more sample-efficient than its model-free counterpart.

The GGP dataset was designed to test an agent’s ability to learn a model that can be useful in planning. The most successful GGP algorithms, e.g. Cadiaplayer (Finnsson 2012), Sancho (Koriche et al. 2016), and WoodStock (Koriche et al. 2017), use Monte Carlo Tree Search (MCTS) to search. MCTS relies on an accurate forward model of the Markov Decision Process. The further into the future we search, the more important it is that our forward model is accurate, as errors compound. In order to avoid having to give our MCTS agents a hand-coded model of the game dynamics, they must be able to learn an accurate model of the dynamics from a handful of behavior traces.

Two things make the GGP dataset an appealing task for model learning. First, hundreds of games have already been designed for the GGP competition, with more being added each year. Second, each game comes with ‘ground truth’: a set of rules that completely describe the game. From these rules, we know the learning problem is solvable, and we have a good measure of how hard it is (by measuring the complexity of the ground-truth programFootnote 2).

3 IGGP dataset

In this section, we describe the Game Description Language (GDL) in which GGP games are described, the IGGP problem setting, and finally an illustrative example of a typical IGGP task.

3.1 Game description language

GGP games are described using GDL. This language describes the state of a game as a set of facts and the game mechanics as logical rules. GDL is a variant of Datalog with two syntactic extensions (stratified negation and restricted function symbols) and with a small set of distinguished predicates that have a special meaning (Love et al. 2008) (shown in Fig. 5).

The first syntactic extension is stratified negation. Standard Datalog (lacking negation altogether) has the useful property that there is a unique minimal model (Dantsin et al. 2001). If we add unrestricted negation, we lose this attractive property: now there can be multiple distinct minimal models. To maintain the property of having a unique minimal model, GDL adds a restricted form of negation called stratified negation (Apt et al. 1988). The dependency graph of a set of rules is formed by creating an edge from predicate p to predicate q whenever there is a rule whose head is \(p(\ldots )\) and that contains an atom \(q(\ldots )\) in the body. The edge is labelled with a negation if the body atom is negated. A set of rules is stratified if the dependency graph contains no cycle that includes a negated edge.

GDL’s second syntactic extension to Datalog is restricted function symbols. The Herbrand base of a standard Datalog program is always finite. If we add unrestricted function symbols, the Herbrand base can be infinite. To maintain the property of having a finite Herbrand base, GDL restricts the use of function symbols in recursive rules (Love et al. 2008).

The two syntactic extensions of GDL, stratified negation and restricted function symbols, mean we extend the expressive power of Datalog without essentially changing its key attractive property: there is always a single, finite minimal model (Love et al. 2008).

Fig. 5
figure 5

Main predicates in GDL where variables begin with a “?” symbol

Fig. 6
figure 6

In this Fizz Buzz scenario the learner is given four positive examples of the legal_say/2 predicate and many negative examples. This predicate represents what legal moves a player can make in the game. The column H shows the reference GGP solution described as a logic program. In Fizz Buzz, the player can always make three legal moves in any state, saying fizz, buzz, or fizzbuzz. The player can additionally say the current number (the counter)

3.2 Problem setting

We now define the IGGP problem. Our problem setting is based on the ILP learning from entailment setting (De Raedt 2008), where an example corresponds to an observation about the truth or falsity of a formula F and a hypothesis H covers F if H entails F. We assume languages of background knowledge \({\mathscr {B}}\) and examples \({\mathscr {E}}\) each formed of function-free ground atoms. The atoms are function-free because we flatten the GDL atoms. For example, in Fig. 6, the atom true(count(9)) has been flattened into true_count(p9). We flatten atoms because some ILP systems do not support function symbols. We likewise assume a language of hypotheses \({\mathscr {H}}\) formed of datalog programs with stratified negation. Stratified negation is not necessary but in practice allows significantly more concise programs, and thus often makes the learning task computationally easier. Note that the GDL also supports recursion but in practice most GGP games do not use recursion. In future work we intend to contribute recursive games to the GGP competition.

We now define the IGGP input:

Definition 1

(IGGP input) An IGGP input\(\Delta \) is a set of m triples \(\{(B_i,E^+_i,E^-_i)\}^m_{i=1}\) where

  • \(B_i \subset {\mathscr {B}}\) represents background knowledge

  • \(E_i^+\subseteq {\mathscr {E}}\) and \(E_i^-\subseteq {\mathscr {E}}\) represent positive and negative examples respectively.

An IGGP input forms the IGGP problem:

Definition 2

(IGGP problem) Given an IGGP input \(\Delta \), the IGGP problem is to return a hypothesis \(H \in {\mathscr {H}}\) such that \(\text {for all} \;\; (B_i,E^+_i,E^-_i) \in \Delta \) it holds that \(H \cup B_i \models E^+\) and \(H \cup B_i \not \models E_i^-\).

Note that a single hypothesis should be consistent with all given triples.

3.2.1 Illustrating example: Fizz Buzz

To give the reader an intuition for the IGGP problem and the GGP games, we now describe example scenarios for the game Fizz Buzz. Although typically a multi-player game, in our IGGP dataset Fizz Buzz is a single-player game. The aim of the game is for the player to replace any number divisible by three with the word fizz, any number divisible by five with the word buzz, and any number divisible by both three and five with fizzbuzz. For example, a game of Fizz Buzz up to the number 17 would go: 1, 2, fizz, 4, buzz, fizz, 7, 8, fizz, buzz, 11, fizz, 13, 14, fizz buzz, 16, 17.

Fig. 7
figure 7

In this Fizz Buzz scenario, the learner is given one positive example of the next_count/1 predicate, one positive example of the next_success/1 predicate, and many negative examples of both predicates. These predicates represent the change of game state. The column H shows the reference GGP solution described as a logic program, which may not necessarily be the most textually compact solution. The next_count/1 relation represents the count in the game. The this relation has a single clause two literal definition, which says that the count increases by one after each step in the game. The next_success/1 relation requires two clauses with many literals. This relation counts how many times a player says the correct output. The reference GGP solution for this relation includes the correct/0 predicate which is not provided as BK but which is reused in both clauses of next_success/1. For an ILP system to learn the reference solution it would need to invent this predicate. Also note that this solution uses negation in the body, including the negation of the invented predicate correct/0

Figures 678, and 9 show example IGGP problems and solutions for the target predicates legal, next, goal, and terminal respectively. For simplicity each example is a single \((B,E^+,E^-)\) triple, although in the dataset each learning task is often a set of multiple triples, where a single hypothesis should explain all the triples. In all cases the BK shown in Fig.  10 holds, so we omit it from the individual examples for brevity. Note that the game only runs to the number 31.

Fig. 8
figure 8

In this Fizz Buzz scenario the learner is given one example of the goal/2 predicate and four negative examples. This predicate represents the reward for a move. In Fizz Buzz the reward is based on the value of true_success/1. The column H shows the reference GGP solution described as a logic program. The reference solution requires five clauses, which means that it would be difficult for ILP systems that only support learning single-clause programs (Muggleton 1995; Ross Quinlan 1990)

Fig. 9
figure 9

In this Fizz Buzz scenario the learner is given a single negative example of the terminal/0 predicate. This predicate indicates when the game has finished. In this scenario the game has not terminated. In the dataset the Fizz Buzz game runs until the count is 31, so the learner must learn a rule such as the one shown in column H

Fig. 10
figure 10

Common BK for Fizz Buzz

4 Generating the GGP dataset

In this section, we describe our procedure to automatically generate IGGP tasks from GGP game descriptions. We first explain how GGP games fit inside the framework of multi-agent Markov decision processes. We also explain the need for a type-signature for each game.

4.1 Preliminaries: Markov games

GGP games are Markov games (Littman 1994), a strict superset of multi-agent Markov decision process (MDP)s that allow simultaneous moves.Footnote 3 The four components (SATR) of the MDP are:

  • S is a finite set of states

  • A is a finite set of actions

  • T is transition function \(T: S \times A \rightarrow S\)

  • R is a reward function

We describe these elements in turn for a GGP game.

4.1.1 States

Each state \(s \in S\) is a set of ground atoms representing fluents (propositions whose truth-value can change from one state to another). The true predicate indicates which fluents are true in the current state. For instance, one state of a best-of-three game of Rock Paper Scissors is:

$$\begin{aligned}&\texttt {true(score(p1,0)).} \\&\texttt {true(score(p2,2)).} \\&\texttt {true(step(2)).} \\ \end{aligned}$$

This state represents that the current score is 0 to 2 in favour of player p2, and 2 time-steps have been performed.

4.1.2 Actions

Each action \(a \in A\) is a set of ground atoms representing the set of all joint actions for agents 1..n. The does predicate indicates which agents perform which actions. For instance, one set of joint actions for Rock Paper Scissors is:

$$\begin{aligned}&\texttt {does(p1,paper).} \\&\texttt {does(p2,stone).} \\ \end{aligned}$$

4.1.3 Transition function

In a stochastic MDP, the transition function T has the signature \(T : S \times A \times S \rightarrow \{0,1\}\). By contrast, in a deterministic MDP, such as a GGP game, the transition function is \(T : S \times A \rightarrow S\). Given a current state s and a set of actions a, the next predicate indicates which fluents are true in the (unique) next state \(s'\). For instance, in Rock Paper Scissors, given the current state s and actions a above, the next state \(s'\) is:

$$\begin{aligned}&\texttt {next(score(p1,1)).} \\&\texttt {next(score(p2,2)).} \\&\texttt {next(step(3)).} \\ \end{aligned}$$

The transition function is a set of definite clauses defining next in terms of true. For instance, the following two clauses define part of the transition function for Rock Paper Scissors:

figure a

4.1.4 Reward function

In a continuous multi-agent MDP, the reward function has the signatureFootnote 4\(R : S \rightarrow {\mathbb {R}}^n\). In a discrete MDP, such as a GGP game, we assume a small fixed set of k discrete rewards \(\{r_1,\dots ,r_k\}\), where \(r_i\) is not necessarily numeric. Let G[i] be the set of atoms representing that player i has one of the k rewards \(G[i] = \{ goal(i, r_j) \mid j = 1 .. k \}\). Let \(G = G[1] \times \cdots \times G[n]\) be the joint rewards for agents 1..n. In our GGP dataset, the reward function has the signature \(R : S \rightarrow G\). Note that, in this framework, learning the reward function becomes a classification problem rather than a regression problem. For example, in the Rock Paper Scissors state above, the reward for state \(s'\) depends only on the score and is:

$$\begin{aligned}&\texttt {goal(p1,1).} \\&\texttt {goal(p2,2).} \\ \end{aligned}$$

4.1.5 Legal

In the GGP framework, actions are sometimes unavailable. It is not the case that all possible actions from A can be performed, but some of them have no effect—but rather that only a subset of actions are available in a particular state.

The legal function L determines which actions are available in which states: \(L : S \rightarrow 2^A\). Recall that an element of A is not an individual action performed by a single player, but rather a set of simultaneous joint actions, one for each player. For example, one element of A is \(\{\texttt {does(p1,paper).} , \texttt {does(p2,stone).} \}\). Note that the availability of an action for one agent does not depend on what other actions are being performed concurrently by other agents; it only depends on the state S.

4.1.6 Terminal

The GDL language contains a distinguished predicate, the nullary terminal predicate, that indicates when an episode has terminated (i.e. when the game is over).

4.2 Preliminaries: the type-signature for a GGP game

In order to calculate the complete set of ground atoms for a game,Footnote 5 we use a type signature \(\Sigma \). The type signature defines the types of constants, functions, and predicates used in the GDL description. Our type signatures include a simple subtyping mechanism for inclusion polymorphism. For example:

figure b

In this example, true and next are predicates, at is a function that takes an (xy) coordinate and a cell-type and returns a fluent (prop). A cell is either blank or one of the agents. The expression agent :> cell means that an agent is a subtype of cell.

Let \(\sqsubseteq \) be the reflexive transitive closure of :> . Let \(\Sigma (f)\) be the type assigned to element f by signature \(\Sigma \). Then \(f(k_1,\ldots , k_n)\) is a well-formed term of type t if:

  • \(\Sigma (f) = (t_1,\ldots , t_n)\)

  • \(\Sigma (k_i) \sqsubseteq t_i\) for all \(i = 1\ldots n\)

Predicates are functions that return a bool and constants are functions with no arguments. For example, using the type signature above, true(at(3, 4, black)) is a well-formed term of type bool, i.e. a well-formed ground atom.

4.3 Automatically generating induction tasks for a GGP game

Given a GGP game \(\Gamma \) written in GDL, and a type signature \(\Sigma \) for that game, our system automatically generates an IGGP induction task. Before presenting the details, we summarise the general approach. To generate the GGP dataset, we built a simple forward-chaining GDL interpreter. We used the GDL interpreter to calculate the initial state, the currently valid moves, the transition function, and the reward. When generating traces, we first calculate the actions that are currently available for each player. Then we let each player choose actions uniform randomly. We record the state trace \((s_1,\ldots , s_n)\), and extract a set of \((B_i, E^+_i, E^-_i)\) triples from each trace. The target predicates we wish to learn are legal, next, goal, and terminal. The \((B_i, E^+_i, E^-_i)\) triples for the predicates legal, goal, and terminal are calculated from a single state, while the triples for next are calculated from a pair of consecutive states \((s_i, s_{i+1})\).

We generated multiple traces for each game: 1000 episodes with a maximum of 100 time-steps. However, we chose these numbers somewhat arbitrarily because there is a complex tradeoff on how much data to generate. We want to generate enough data to capture the diversity of a game, so that a learner can (in theory) learn the correct game rules. However, we do not want to generate too much data as to provide every game state, as this would mean that a learner would not need to learn anything, and could instead simply memorise game situations. We also we do not want to generate too much data that it becomes expensive to compute or store. It is, however, unclear where the boundary is between too little and too much data. Whether such a boundary even exists itself is unclear because by imposing different biases, different learners may need more or less information on the same task. In future work we would like to expand the dataset. We then intend to repeat the experiments with different amounts of training data.

figure c

Our approach is presented in Algorithm 1. This procedure generates a number of traces. Each trace is a sequence of game states, and each game state is represented by a set of ground atoms. We use the \( extract \) function (described in Sect.  4.3.1) to produce a set of \((B_i,E^+_i,E^-_i)\) triples from a trace. We add this set of triples to \(\Lambda \). At the end, when we have finished all the traces, we return \(\Lambda \), the set of triples. The variable s stores the current state (a set of ground atoms). Initially, s is set to the initial state: \( initial (\Gamma )\) produces the initial state from the GDL description. Then for each time-step, we calculate the next state via \( next (\Gamma , s)\). This function \( next (\Gamma , s)\) involves three steps. First, we calculate the available actions for each player. Second, we let each player take a (uniform) random move. Third, we use the transition function T to calculate the next state from the current state s and the actions of the players. Once we have calculated the new state, we append it to the end of t. Here, t is a trace i.e. a sequence of states. Then we check if the new state is terminal. If it is terminal, we finish the episode; otherwise, we continue for another time-step. Once the episode is finished, we extract the set of \((B_i,E^+_i,E^-_i)\) triples from the sequence of states, and continue to the next trace. Note that we need the type signature \(\Sigma \) to extract the triples from the trace, but we do not need it to generate the trace itself. For our experiments, we generated 1000 traces for each game, and ran for a maximum of 100 time-steps per game.

4.3.1 The \( extract \) function

The \( extract (t, \Sigma )\) function in Algorithm 1 takes a trace \(t = (s_1,\ldots , s_n)\) (a sequence of sets of ground atoms), and a type signature \(\Sigma \) and produces a set of \((B_i,E^+_i,E^-_i)\) triples. This set of triples represents a set of induction tasks for the distinguished predicates \( legal \), \( goal \), \( terminal \), and \( next \). It is defined as:

$$\begin{aligned} extract ((s_1,\ldots , s_n), \Sigma ) = \Lambda _1 \cup \Lambda _2 \cup \Lambda _3 \cup \Lambda _4 \end{aligned}$$

where:

$$\begin{aligned} \Lambda _1= & {} \{ triple _1(s_i, legal , \Sigma ) \mid i = 1 .. n\} \\ \Lambda _2= & {} \{ triple _1(s_i, goal , \Sigma ) \mid i = 1 .. n\} \\ \Lambda _3= & {} \{ triple _1(s_i, terminal , \Sigma ) \mid i = 1 .. n\} \\ \Lambda _4= & {} \{ triple _2(s_i, s_{i+1}, \Sigma ) \mid i = 1 .. n-1\} \end{aligned}$$

Before we define the \( triple _1\) and \( triple _2\) functions, we introduce the relevant notation. If s is a set of ground atoms and p is a predicate, let \(s_p\) be the subset of atoms in s that use the predicate p. If \(\Sigma \) is a type signature and p is a predicate, then \( ground (\Sigma , p)\) is the set of all ground atoms generated by \(\Sigma \) that use predicate p. Given this notation, we define \( triple _1(s, p, \Sigma ) = (B, E^+, E^-)\) where:

$$\begin{aligned}&B = s - s_p \\&E^+ = s_p \\&E^- = ground (\Sigma , p) - E^+ \end{aligned}$$

To calculate the negative instances \({E}^-_i\), we use the closed-world assumption: all p-atoms not known to be true in \(E^+\) are assumed to be false in \(E^-\). Given a type signature \(\Sigma \), we generate the set \( ground (\Sigma , p)\) of all possible ground atoms whose predicate is the distinguished predicate p. For example, in a one player game, if \( ground (\Sigma , legal ) = \{\)legal(p1, up), legal(p1, down), legal(p1, left), and legal(p1, right)\(\}\), and \(s_ legal \) only contains legal(p1, up) and legal(p1, down), then:

$$\begin{aligned} {E}^+_i= & {} \{{\texttt {legal(p1, up)}}, {\texttt {legal(p1, down)}}\} \\ {E}^-_i= & {} ground (\Sigma , legal ) - {E}^+_i = \{{\texttt {legal(p1, left)}}, {\texttt {legal(p1, right)}}\} \end{aligned}$$

We define \( triple _2(s_i, s_{i+1}, \Sigma ) = (B, E^+, E^-)\) where:

$$\begin{aligned}&B = s_i \\&E^+ = s_{i+1} [ true / next ] \\&E^- = ground (\Sigma , next ) - E^+ \end{aligned}$$

When learning \( next \), we use the facts at the earlier time-step \(s_i\) as background facts, we use the facts at the later time-step \(s_{i+1}\) as the positive facts \(E^+\) to be learned (with the predicate \( true \) replaced by \( next \)), and we use all the rest of the ground atoms involving \( next \) as the negative facts \(E^-\). Note, again, the use of the closed-world assumption: we assume all \( next \) atoms not known to be in \(E^+\) to be in \(E^-\).

5 Baselines and ILP systems

We claim that IGGP is challenging for existing ILP approaches. To support this claim we evaluate existing ILP systems on our IGGP dataset. We compare the ILP systems against simple baselines. We first describe the baselines and then each ILP system.

5.1 Baselines

Figure 11 shows the four baselines. Each baseline is a Boolean function \(f :2^{{\mathscr {B}}} \times {\mathscr {E}} \rightarrow \{\top ,\bot \}\), i.e. a function that takes background knowledge and an example and returns true (\(\top \)) or false \((\bot )\). We describe these baselines in detail.

Fig. 11
figure 11

Baselines where \(\Delta = \{(B_i,E^+_i,E^-_i)\}^m_{i=1}\) represents training data. The syntax a[next/true] means to replace the predicate symbol next with true in the atom a

Our first two baselines ignore the training data:

  • True deems that every atom is true:

    $$\begin{aligned} True(B,a) = \top \end{aligned}$$
  • Inertia is the same as True for atoms with the target predicates goal, legal, and terminal, but for the next predicate an atom is true if and only if the corresponding true atom is in B. For instance, the atom next(at(1,4,x)) is true if and only if true(at(1,4,x)) is in B:

    $$\begin{aligned} Inertia(B,a) = a[next/true] \in B \end{aligned}$$

    The intuition behind this baseline is the empirical observation that in most of the games, most ground atoms retain their truth value from one time-step to the next, more often than not. Of course, it is possible to design games in which most or all of the atoms change their truth value each time-step; but in typical games, such radical changes are unusual.

Our next two baselines consider the training data \(\Delta = \{(B_i,E^+_i,E^-_i)\}^m_{i=1}\):

  • Mean deems that a testing atom a is true if and only if a is true more often than not in the positive training examples:

    $$\begin{aligned} Mean(B, a) = |\{(B_i, E^+_i,E^-_i) \in \Delta \mid a \in E^+_i\}| \ge \frac{|\Delta |}{2} \end{aligned}$$
  • KNN\(_k\) is based on clustering the data. In \(KNN_k(B,a)\) we find the k triples in \(\Delta \), denoted as \(\kappa _k(\Delta , B)\), whose backgrounds are most ‘similar’ to the background B. To assess the similarity of two sets A and B of ground atoms, we look at the size of the symmetric differenceFootnote 6 between A and B:

    $$\begin{aligned} d(A, B) = |A - B| + |B - A| \end{aligned}$$

    It is straightforward to show that the d function satisfies the conditions for a distance metric:

    • \(d(A, B) \ge 0\)

    • \(d(A, B) = d(B, A)\)

    • \(d(A, B) = 0\) iff \(A = B\)

    • \(d(A, C) \le d(A, B) + d(B, C)\)

    We set the closest k triples \(\kappa _k(\Delta , B)\) to be the k triples \(\{(B_i,E^+_i,E^-_i)\}^k_{i=1}\) with the smallest d distance between \(B_i\) and B. Given the k closest triples \(\kappa _k(\Delta , B)\) the KNN baseline outputs \(\top \) if a appears in \(E^{+'}\) in at least half of the closest k triples. More formally:

    $$\begin{aligned} KNN_k(B, a) = |\{ (B', E^{+'}, E^{-'}) \in \kappa _k(\Delta , B) \mid a \in E^{+'}\}| \ge \frac{k}{2} \end{aligned}$$

One potential limitation of the KNN approach is that, in contrast to the ILP approaches, the KNN approaches learn at the propositional level and are unable to learn general first-order rules. To illustrate this limitation, suppose we are trying to learn the target predicate p / 1 given the background predicate q / 1 and that the underlying target rule is \(p(X) \leftarrow q(X)\). Suppose there are only two training triples of the form \((B,E^+,E-)\):

$$\begin{aligned} T_1= & {} (\{q(a)\}, \{ p(a) \}, \{ p(b), p(c) \})\\ T_2= & {} (\{q(b)\}, \{ p(b) \}, \{ p(a), p(c) \}) \end{aligned}$$

Given the test triple \((\{ q(c) \}, \{ p(c) \}, \{ p(a), p(b) \})\), a KNN approach will deem that p(c) is false because it has not seen a positive instance of this particular ground atom and has no representational resources for generalising.

5.2 ILP systems

We evaluate four ILP systems on our dataset. It is important to note that we are not trying to directly compare the ILP systems, or demonstrate that any particular ILP system is better than another. We are instead trying to show that the IGGP problem is challenging for existing systems, and that it (and the dataset) will provide a challenging problem for evaluating future research. Indeed, a direct comparison of ILP systems is often difficult (Cropper 2017), largely because different systems excel at certain classes of problems. For instance, directly comparing the Prolog-based Metagol against ASP-based systems, such as ILASP and HEXMIL (Kaminski et al. 2018) is difficult because Metagol is often used to learn recursive list manipulation programs, including string transformations and sorting algorithms (Cropper and Muggleton 2019). By contrast, many ASP solvers disallow explicit lists, such as the popular Clingo system (Gebser et al. 2014), and thus a direct comparison is difficult. Likewise, ASP-based systems can be used to learn non-deterministic specifications represented through choice rules and preferences modeled as weak constraints (Law et al. 2018), which is not necessarily the case for Prolog-based systems. In addition, because many of the systems have learning parameters, it is often possible to show that there exist some parameter settings for which system X can perform better than Algorithm Y on a particular dataset. Therefore, the relative performances of the systems should largely be ignored.

We compare the ILP systems Aleph, ASPAL, Metagol, and ILASP. We describe these systems in turn.

5.2.1 Aleph

Aleph is an ILP system written in Prolog based on Progol (Muggleton 1995). Aleph uses the following procedure to induce a logic program hypothesis (paraphrased from the Aleph websiteFootnote 7):

  1. 1.

    Select an example to be generalised. If none exist, stop, otherwise proceed to the next step.

  2. 2.

    Construct the most specific clause (also known as the bottom clause (Muggleton 1995) that entails the example selected and is within language restrictions provided.

  3. 3.

    Search for a clause more general than the bottom clause. This step is done by searching for some subset of the literals in the bottom clause that has the ‘best’ score.

  4. 4.

    The clause with the best score is added to the current theory and all the examples made redundant are removed. Return to step 1.

To restrict the hypothesis space (mainly at step 2), Aleph uses both mode declarations (Muggleton 1995) and determinations to denote how and when a literal can appear in a clause. In the mode language, modeh are declarations for head literals and modeb are declarations for body literals. An example modeb declaration is modeb(2,mult(+int,+int,-int)). The first argument of a mode declaration is an integer denoting how often a literal may appear in a clause. The second argument denotes that the literal mult/3 may appear in the body of a clause and specifies the type of its arguments. The symbols \(+\) and − denote whether the arguments are input or output arguments respectively. Determinations declare what predicates can be used to construct a hypothesis and are the form of determination(TargetName/Arity,BackgroundName/Arity). The first argument is the name and arity of the target predicate. The second argument is the name and arity of a predicate that can appear in the body of such clauses. Typically there will be many determination declarations for a target predicate, corresponding to the predicates thought to be relevant in constructing hypotheses. If no determinations are present Aleph does not construct any clauses.

Aleph assumes that modes will be declared by the user. For the IGGP tasks this is quite a burden because it requires that we create them for each game, and also requires some knowledge of the target hypothesis we want to learn. Fortunately, however, Aleph can extract mode declarations from determinations, where determinations are straightforward to supply because we can supply for each target predicate and each background predicate a determination. Therefore, for each game, we allow Aleph to use all the predicates available for that game as determinations and allow Aleph to induce the necessary mode declarations.

There are many parameters in Aleph which greatly influence the output, such as parameters that change the search strategy when generalising a bottom clause (step 3) and parameters that change the structure of learnable programs (such as limiting the number of literals in the bottom clause). We use Aleph 5 with YAP 6.2.2 (Costa et al. 2012), keeping the default parameters throughout. Therefore, there will most likely exist some parameter settings for which Aleph will perform better than we present.

5.2.2 ASPAL

ASPAL (Corapi et al. 2011) is a system for brave induction under the answer set programming (ASP) (Lifschitz 2008) semantics. Brave induction systems aim to find a hypothesis H such that there is at least one answer set of \(B\cup H\) that covers the examples.Footnote 8

ASPAL works by transforming a brave induction task T into a meta-level ASP program \({\mathscr {M}}(T)\) such that the answer sets of \({\mathscr {M}}(T)\) correspond to the inductive solutions of T. The first step of state-of-the-art ASP solvers, such as clingo (Gebser et al. 2011), is to compute the grounding of the program. Systems which follow this approach therefore have scalability issues with respect to the size of the hypothesis space, as every ground instance of every rule in the hypothesis space—i.e. the ground instances of every rule that has the potential to be learned—is computed when the ASP solver solves \({\mathscr {M}}(T)\).

Similarly to Aleph, ASPAL has several input parameters, which influence the size of the hypothesis space, such as the maximum number of body literals. For most of these, we used the default value, but we increased the maximum number of body literals from 3 to 5 and the maximum number of rules in the hypothesis space from 3 to 15. Our initial experiments showed that the maximum number of rules had very little effect on the feasibility of the ASPAL approach (as the size of the grounding of \({\mathscr {M}}(T)\) is unaffected by this change), whereas the maximum number of body literals can make a significant difference to the size of the grounding of \({\mathscr {M}}(T)\). It is possible that there is a set of parameters for ASPAL that performs better than those we have chosen.

Predicate invention is supported in ASPAL by allowing new predicates (which do not occur in the rest of the task) to appear in the mode declarations. This predicate invention is prescriptive rather than automatic, as the schema of the new predicates (i.e. the arity, and argument types) must be specified in the mode declarations. As how to guess the structure of predicates which should be invented is unclear for this problem setting, we did not allow ASPAL to use predicate invention on this dataset. It should be noted that when programs are stratified, hypotheses containing predicate invention can always be translated into equivalent hypotheses with no predicate invention. Of course, as such hypotheses may be significantly longer than the compact hypotheses which are possible through predicate invention, they may require more examples to be learned accurately by ASPAL.

Similarly, although ASPAL does enable learning recursive hypotheses, we did not permit recursion in these experiments. Recursive hypotheses can also be translated into non-recursive hypotheses over finite domains. Our initial experiments using ASPAL showed that in addition to increasing the size of the hypothesis space, allowing recursion also significantly increased the grounding of ASPAL’s meta program, \({\mathscr {M}}(T)\).

5.2.3 Metagol

Metagol (Muggleton et al. 2015; Cropper and Muggleton 2016a, b) is an ILP system based on a Prolog meta-interpreter. The key difference between Metagol and a standard Prolog meta-interpreter is that whereas a standard Prolog meta-interpreter attempts to prove a goal by repeatedly fetching first-order clauses whose heads unify with a given goal, Metagol additionally attempts to prove a goal by fetching higher-order metarules (Fig. 12), supplied as background knowledge, whose heads unify with the goal. The resulting meta-substitutions are saved and can be reused in later proofs. Following the proof of a set of goals, Metagol forms a logic program by projecting the meta-substitutions onto their corresponding metarules. Metagol is notable for its support for (non-prescriptive) predicate invention and learning recursive programs.

Metarules define the structure of learnable programs, which in turn defines the hypothesis space. Deciding which metarules to use for a given task is an unsolved problem (Cropper 2017; Cropper and Tourret 2019). To compute the benchmark, we set Metagol to use the same metarules for all games and tasks. This set is composed of 9 derivationally irreducible metarules (Cropper and Tourret 2018, 2019), a set of metarules to allow for constants in a program, and a set of nullary metarules (to learn the terminal predicates). Full details on the metarules used can be found in the code repository.

For each game, we allow Metagol to use all the predicates available for that game. We also allow Metagol to support a primitive form of negation by additionally using the negation of predicates. For instance, in Firesheep we allow Metagol to use the rule not_does_kill(A,B) :- not(does_kill(A,B)). To allow Metagol to induce a program given all \((B_i,E^+_i,E^-_i)\) triples, we prefix each atom with an extra argument to denote which triple each atom belongs to. For instance, in the first minimal even triple, the atom does_choose(player,1) becomes does_choose(triple1,player,1), and in the second triple the same atom becomes does_choose(triple2,player,1). To account for this extra argument, we also add extra argument to each literal in a metarule. For instance, the ident metarule becomes \(P(I,A) \leftarrow Q(I,A)\) and the chain metarule becomes \(P(I,A,B) \leftarrow Q(I,A,C), R(I,C,B)\).

We use Metagol 2.2.3 with YAP 6.2.2.

Fig. 12
figure 12

Example metarules. The letters P, Q, R denote existentially quantified variables. The letters A, B, and C denote universally quantified variables

5.2.4 ILASP

ILASP (Inductive Learning of Answer Set Programs) (Law et al. 2014, 2015a, b) is a collection of ILP systems, which are capable of learning ASP programs consisting of normal rules, choice rules, hard and weak constraints. Unlike many other ILP approaches, ILASP guarantees the computation of an optimal inductive solution (where optimality is defined in terms of the length of a hypothesis). Similarly to ASPAL, early ILASP systems, such as ILASP1 (Law et al. 2014) and ILASP2 (Law et al. 2015b), work by representing an ILP task (i.e. every example and every rule in the hypothesis space) as a meta-level ASP program whose optimal answer sets correspond to the optimal inductive solutions of the task. The ILASP systems each target learning unstratified ASP programs with normal rules, choice rules and both hard and weak constraints. Therefore, the stratified normal logic programs which are targeted in this paper do not require the full generality of ILASP; in fact, on this dataset, the meta-level ASP programs used by both ILASP1 and ILASP2 are isomorphic to the meta-level program used by ASPAL.

ILASP2i (Law et al. 2016) addresses the scalability with respect to the number of examples by iteratively computing a subset of the examples, called relevant examples, and only representing the relevant examples in the ASP program. In each iteration, ILASP2i uses ILASP2 to find a hypothesis H that covers the set of relevant examples and then searches for a new relevant example which is not covered by H. When no further relevant examples exist, the computed H is guaranteed to be an optimal inductive solution of the full task.

Although ILASP2i makes significant improves on the scalability of ILASP1 and ILASP2 with respect to the examples, on tasks with large hypothesis spaces ILASP2i still suffers from the same grounding bottleneck as ASPAL, ILASP1 and ILASP2. As the size of the hypothesis spaces are one of the major challenges of the dataset in this paper, ILASP2i would likely not perform significantly better than ASPAL. To scale up the application of the ILASP framework to the GGP dataset, we used an extended version of ILASP2i, which computes, at each iteration, a relevant hypothesis space using the type signature and the current set of relevant examples, and then uses ILASP2 to solve a learning task with the current relevant examples and relevant hypothesis space. Through the rest of the paper, we refer to this extended ILASP algorithm as \(\hbox {ILASP}^{*}\). Specifically, rules that entail negative examples or do not cover at least one relevant positive example are omitted from the relevant hypothesis space. Also, a rule is omitted if there is another rule which is shorter and covers the same (or more) relevant positive examples. Similarly to ASPAL, \(\hbox {ILASP}^{*}\) takes a parameter for the maximum number of literals in the body. Our preliminary experiments showed that the method for computing the relevant hypothesis space performed best with this parameter set to 5, so this value was used for the experiments.

The construction of a relevant hypothesis space was made significantly easier by forbidding recursion and predicate invention in \(\hbox {ILASP}^{*}\). Although the standard ILASP algorithms do support recursion and (prescriptive) predicate invention, these two features mean that the usefulness of a rule in covering examples cannot be evaluated independently, and thus constructing the relevant hypothesis space is much more challenging. In future work, we hope to generalise the method of relevant hypothesis space construction to relax these two constraints.

6 Results

We now describe the results of running the baselines and ILP systems on our dataset. All the experimental data is available at https://github.com/andrewcropper/mlj19-iggp. When running the ILP systems, we allowed each system the same amount of time, 30min, to learn each target predicate.

6.1 Evaluation metrics

We use two evaluation metrics: balanced accuracy and perfectly solved.

6.1.1 Balanced accuracy

In our dataset the majority of examples are negative. To account for this class imbalance, we use balanced accuracy (Brodersen et al. 2010) to evaluate the approaches. Given background knowledge B, disjoint sets of positive \(E^+\) and negative \(E^-\) testing examples, and a logic program H, we define the number of positive examples as \(p=|E^+|\), the number of negative examples as \(n=|E^-|\), the number of true positives as \(tp=|\{e \in E^+ | B \cup H \models e\}|\), the number of true negatives as \(tn=|\{e \in E^- | B \cup H \not \models e\}|\), and the balanced accuracy \(ba = (tp/p + tn/n)/2\).

6.1.2 Perfectly solved

We also consider a perfectly solved metric, which is the number (or percentage) of tasks that an approach solves with 100% accuracy. The perfectly solved metric is important in IGGP because we know that every game has at least one perfect solution: the GDL description from which the traces were generated is a perfectly accurate model of the deterministic MDP. Perfect accuracy is important because even slightly inaccurate models compound their errors as the game progresses.

6.2 Results summary

Table 3 summarises the results and shows for each approach the balanced accuracy and percentage of perfectly solved tasks. The full results are in the “Appendix”. As the results show, the ILP and KNN approaches perform better than simple baselines (True, Inertia, and Mean). In terms of balanced accuracy, the KNN approaches often perform better than the ILP systems. However, in terms of the important perfectly solved metric, the ILP methods easily outperform the baselines and the KNN approaches. The most successful system \(\hbox {ILASP}^{*}\) perfectly solves 40% of the tasks. It should be noted that 4% of test cases have no positive instances in either the training set or the test set, meaning that a perfect score can be achieved with the empty hypothesis. Each of our ILP systems achieved a perfect score on these tasks. Without these trivial cases, the score of each system on the perfectly solved metric would be even lower.

Table 3 Results summary. The baseline represents accepting everything. The results show that all of the approaches struggle in terms of the perfectly solved metric (which represents how many tasks were solved with 100% accuracy)
Table 4 Balanced accuracy results for each target predicate
Table 5 Perfectly solved percentage for each target predicate
Table 6 Balanced accuracies for the next target predicate for the alphabetically first ten games

As Table 4 shows, in terms of balanced accuracies, the most difficult task is the terminal predicate, although the margin of difference between the predicates is small. As Table 5 shows, in terms of the important perfectly solved metric, the most difficult task is the next predicate. The mean number of perfectly solved tasks is a measly 3%. Even if we exclude the baselines and only consider the ILP systems then the mean is still only 10%. Table 6 shows the balanced accuracies for the next predicate on the alphabetically first ten games. This predicate corresponds to the state transition function (Sect.  4.1). The next atoms are the most difficult to learn and there is only one out of the first ten games, Buttons and Lights, for which any of the methods find a perfect solution. The next predicate is the most difficult to learn because it has the highest mean complexity in terms of the number of dependent predicates in the dependency graph (Sect.  3.1) in the reference GDL game definitions.

In the following sections we analyse the results for each system and discuss the relative limitations of the respective systems on this dataset.

6.2.1 KNN

As Table 3 shows, the KNN approaches perform well in terms of balanced accuracy but poorly in terms of perfectly solved. Note that \(\hbox {KNN}_1\) occasionally scores higher than \(\hbox {KNN}_5\), which is to be expected because sometimes looking at additional triples gives misleading information. As already mentioned, the KNN approaches learn at the propositional level. This limitation is evident when analysing the results which show that the \(\hbox {KNN}_1\) and \(\hbox {KNN}_5\) approaches only perform well when the target predicate can be learned by memorizing particular atoms. For some of the simpler games (e.g. Coins), the KNN approach is often able to learn the goal predicate because the reward can be extracted directly from the value of an internal state variable representing the score. Similarly, the KNN approach sometimes learns the legal predicate when the set of legally valid actions is static and does not depend on the current state. But the KNN approach is not able to perfectly learn any of the next rules for any of the games in our dataset. In addition, the KNN approaches are expensive to compute. To get these results it took 3 days Intel Xeon CPU 3.6 GHz (6 core), 62 G RAM, 425 G Hard drive.

6.2.2 Aleph

As Table 3 shows, Aleph performs reasonably well, and outperforms most of the baselines in terms of the perfectly solved metric. However, after inspecting the learned programs, we found that Aleph was rarely learning general rules for the games, and instead typically learned facts to explain the specific examples. In other words, on this task, Aleph tends to learn overly specific programs. There are several potential explanations for this limitation. First, as we stated in Sect.  5.2.1, we did not provide mode declarations to Aleph, and instead allowed Aleph to infer them from the determinations. Second, we ran Aleph with the default parameters. However, as stated in Sect. 5.2.1, Aleph has many learning parameters which greatly influence the learning performance. It is reasonable to assume that Aleph could perform even better with a different set of parameters. Third, to learn a program Aleph must first construct the most specific clause (the bottom clause) that entails an example. However, constructing the bottom clause requires exponential time in the depth of variables in the target theory (Muggleton 1995). Therefore, learning large and complex clauses is intractable.

6.2.3 ASPAL

As Table 3 shows, ASPAL performs quite poorly on this dataset. It is outperformed by the mean baseline, both in terms of the perfectly solved metric, and the average balanced accuracy. ASPAL timed out on the majority of the test problems, which was caused by the size of the hypothesis space, and therefore the grounding of ASPAL’s meta-level ASP program. It is possible that by using different parameters to control the size of the hypothesis space, or using a different representation of the problem, with a smaller grounding, ASPAL could perform better.

The results of ASPAL are also interesting to explain the need to create a specialised version of the ILASP algorithm for this dataset. On this constrained problem domain, where we are only aiming to learn stratified programs (which are guaranteed to have a single answer set), ILASP2 and ASPAL are almost identical in their approaches. Both map the input ILP task into a meta-level ASP program, and use the Clingo ASP solver to find an optimal answer set, corresponding to an optimal inductive solution of the input task. The specialised \(\hbox {ILASP}^*\) algorithm presented in Sect. 5.2.4 can overcome this problem in some cases, by reducing the size of the hypothesis space being considered, and thus reducing the size of the grounding of the meta-level program. In principle, this specialisation (along with ILASP2i’s relevant example method) could be applied to ASPAL, to create \(\hbox {ASPAL}^*\), which would likely have performed better.

6.2.4 Metagol

Although Metagol outperforms the baselines in the perfectly correct metric (34%), it is outperformed in terms of balanced accuracy.

One of the main limitations of Metagol in this dataset is that it will only return a program if that program covers all of the positive examples and none of the negative examples. However, in some of the games, Metagol could learn a single simple rule that explains 99% of the training examples (and perhaps 99% of the testing examples) but may need an additional complex rule to cover the remaining 1%. If this extra rule is too complex to learn, then Metagol will not learn anything. To explore this limitation we ran a modified version of Metagol that relaxes this constraint. This modified version simply samples training examples, rather than learn from all the examples. This stochastic version of Metagol improved balanced accuracy from 69 to 76%. In future work we intend to develop more sophisticated versions of stochastic Metagol.

Metagol can generalise from few examples because of the strong inductive bias enforced by the metarules. However, this strong bias is also a key reason why Metagol struggles to learn programs for many of the games. Given insufficient metarules, Metagol cannot induce the target program. For instance, given only monadic metarules, Metagol can only learn monadic programs. Although there is work studying which metarules to use for monadic and dyadic logics (Cropper and Muggleton 2014; Cropper and Tourret 2018, 2019), there is no work on determining which metarules to use for higher-arity logic. Therefore, when computing the benchmarks, Metagol could not learn some of the higher-arity target predicates, such as the next_cell/4 predicate in Sudoku. Similarly Metagol could often not use higher-arity predicates, such as does_move/5 and triplet/6 in Alquerque.

Another issue with the metarules is in that, as described in Sect.  5.2.3, we used the same set of metarules for all games. This approach is inefficient because in almost all cases this approach meant that we were using irrelevant metarules, which added unnecessary search to the learning task. We expect that a simple preprocessing step to remove unusable metarules would improve learning performance, although probably not by any considerable margin.

Another reason why Metagol struggles to solve certain games is because, as with most ILP systems, it struggles to learn large and complex programs. For Metagol the bottleneck is in the size of the target program because the search space grows exponentially with the number of clauses in the target program (Cropper and Tourret 2019). Although there is work in trying to mitigate this issue (Cropper and Muggleton 2016a), developing approaches that can learn large and complex programs is a major challenge for MIL and ILP in general (Cropper 2017).

6.2.5 \(\hbox {ILASP}^{*}\)

The system with the highest percentage of completely accurate models (see Table 3) is \(\hbox {ILASP}^{*}\), with 40% of the tasks completely solved. In most of the cases where \(\hbox {ILASP}^{*}\) terminated with a solution in the time limit of 30 min, a perfect solution was returned. On the rare occasion that \(\hbox {ILASP}^{*}\) terminated but learned an imperfect solution, it did cover the training examples, but performed imperfectly on the test set; for example, in the terminal training set for Untwisty Corridor there are no positive examples, meaning that \(\hbox {ILASP}^{*}\) returns the empty hypothesis (which covers the set of negative examples); however, there is a positive instance of terminal in the test set, meaning that \(\hbox {ILASP}^{*}\) (and all other approaches) score a balanced accuracy of 50 on this problem.

In some cases, the restriction on the number of body literals meant that the task had no solutions. In these unsatisfiable cases, the hypothesis in the last satisfiable iteration was returned by \(\hbox {ILASP}^{*}\). In principle, the maximum number of body literals could have been iteratively increased until the task became satisfiable, but our initial experiments showed that this made little or no difference to the number of perfectly solved cases. Some of the unsatisfiable cases may have been caused by the restriction forbidding predicate invention for \(\hbox {ILASP}^{*}\) on this dataset—although there will always by an equivalent hypothesis that does not contain predicate invention, the equivalent hypothesis may have rules with more than 5 body literals.

Fig. 13
figure 13

The raw hypothesis returned by \(\hbox {ILASP}^*\) for the next learning task for Rock Paper Scissors

Similarly to the unsatisfiable cases, in the timeout cases, the hypothesis found in the \(\hbox {ILASP}^{*}\)’s final iteration was used to compute the accuracy. Returning the hypothesis found in the last iteration explains \(\hbox {ILASP}^{*}\)’s much higher average balanced accuracy compared to Metagol, which either returns a perfect solution over the test set or no solution at all.

\(\hbox {ILASP}^*\) is able to perfectly solve some tasks that are not perfectly solved by any of the baselines or other ILP systems. One example is the \(\mathtt {next}\) learning task for Rock Paper Scissors. In this case, the raw hypothesis returned by \(\hbox {ILASP}^*\) is shown in Fig. 13, which is equivalent to the (more readable) hypothesis shown in Fig. 14. Note that this hypothesis is slightly more complicated than necessary. If \(\hbox {ILASP}^*\) had been permitted to use \(!=\) to check that two player variables did not represent the same player, it is possible that the last three rules would have been replaced with:

figure d

It is possible to learn hypotheses with \(!=\) (and other binary comparison operators) in ILASP, but this would have increased the size of the hypothesis space, so in these experiments, we only allowed \(\hbox {ILASP}^*\) to construct hypothesis spaces using the language of the input task. In future work, we may consider extending the relevant hypothesis space construction method to allow binary comparison operators. The increase in the size of the hypothesis space may be outweighed by the fact that the final hypothesis can be shorter—shorter hypotheses tend to need fewer iterations to learn.

Fig. 14
figure 14

A more readable version of the hypothesis returned by \(\hbox {ILASP}^*\) for the next learning task for Rock Paper Scissors

6.3 Discussion

As Table 3 shows, most of the IGGP tasks cannot be perfectly learned by existing ILP systems. The best performing system (\(\hbox {ILASP}^{*}\)) solves only 40% of the tasks perfectly. Our results suggest that the IGGP problem poses many challenges to existing approaches.

As mentioned in Sect. 4.3, we are unsure whether the dataset contains sufficient training examples for each approach to perfectly solve all of the tasks. Moreover, determining whether there is sufficient data is especially difficult because the different systems employ different biases. However, in most cases the ILP systems simply timed out, rather than learning an incorrect solution. The key issue is that the ILP systems we have considered do not scale to the large problems in the IGGP dataset. In the previous section we discussed limitations of each system. We now summarise the limitations to help explain what makes IGGP difficult for existing approaches.

Large programs

As discussed in Sect. 2, many reference solutions for IGGP games are large, both in terms of the number of literals and the clauses in them. For instance, the GGP reference solution for the goal predicate for Connect Four uses 14 clauses and a total of 72 literals. However, learning large programs is a challenge for most ILP systems (Cropper 2017) which typically struggle to learn programs with hundreds of clauses or literals. Metagol, for instance, struggles to learn programs with more than 8 clauses.

Predicate invention The reference solution for goal in Connect four uses auxiliary predicates (goal is defined in terms of lines, which are defined in terms of columns, rows and diagonals). These auxiliary predicates are not strictly required, as any stratified definition with auxiliary predicates can be translated into an equivalent program with no auxiliary predicates; however, such equivalent programs are often significantly longer. If we unfold the reference solution to remove auxiliary predicates, the resulting equivalent unfolded program contains over 400 literals. For ILP approaches that do not support the learning of programs containing auxiliary predicates (such as Progol, Aleph, and FOIL), it is infeasible to learn such a large program. More modern ILP approaches support predicate invention, enabling the learning of auxiliary predicates which are not in the language of the background knowledge or the examples; however, predicate invention is far from easy, and there are significant challenges associated with it, even for state of the art ILP systems. ASPAL and ILASP support prescriptive predicate invention, where the schema of the auxiliary predicates (i.e. the arity, and argument types) must be specified in the mode declarations (Law 2018). By contrast, Metagol supports automatic predicate invention, where Metagol invents auxiliary predicates without the need for user-supplied arities or type information. However, Metagol’s approach can still often lead to inefficiencies in the search, especially when multiple new predicate symbols are introduced.

7 Conclusion

In this paper, we have expanded on the Inductive General Game Playing task proposed by Genesereth. We claimed that learning the rules of the GGP games is difficult for existing ILP techniques. To support this claim, we introduced a IGGP dataset based on 50 games from the GGP competition and we evaluated existing ILP systems on the dataset. Our empirical results show that most of the games cannot be perfectly learned by existing systems. The best performing system (\(\hbox {ILASP}^{*}\)) solves only 40% of the tasks perfectly. Our results suggest that the IGGP problem poses many challenges to existing approaches. We think that the IGGP problem and dataset will provide an exciting challenge for future research, especially as we have introduced techniques to continually expand the dataset with new games.

7.1 Limitations and future work

Better ILP systems

Our primary motivation for introducing this dataset is to encourage future research in ILP, especially on general ILP systems able to learn rules for a diverse set of tasks. In fact, we have already demonstrated two advancements in this paper: (1) a stochastic version of Metagol (6.2.4), and (2) \(\hbox {ILASP}^{*}\) (Sect. 5.2.4), which scales up ILASP2 for the GGP dataset. In future work we intend to develop better ILP systems.

More games One of the main advantages of the IGGP problem is that the games are based on the GGP competition. As mentioned in the introduction, the GGP competition produces new games each year. These games are introduced independently from our dataset without any particular ILP system in mind. Therefore, because of our second contribution, we can continually expand the IGGP dataset with these new games. In future work we intend to automate this whole process and to ensure that all the data is publicly available.

More systems We have evaluated four ILP systems (Aleph, ASPAL, Metagol, and ILASP). In future work we would like to evaluate more ILP systems. We could also like to consider non-ILP systems (i.e. systems that may not necessarily learn explicit human-readable rules).

More evaluation metrics We have evaluated ILP systems according to two metrics: balanced accuracy and perfect solved. However, there are other dimensions on which to evaluate the systems. We have not, for instance, considered the learning times of the systems (although they all had the same maximum time to learn during the evaluation). Nor have we considered the sample complexity of the approaches. In future work it would be valuable to evaluate approaches when varying the number of game traces (i.e. observations) available, as to identify the most data-efficient approaches.

More challenges The main challenge in using existing systems on this dataset is the deliberate lack of game-specific language biases, meaning that for many games the hypothesis space that each system must consider is extremely large. This reflects a major current issue in ILP, where systems are often given well crafted language biases to ensure feasibility; however, this is not the only current challenge in ILP. For example, some ILP approaches target challenges such as learning from noisy data (Oblak and Bratko 2010; Evans and Grefenstette 2018; Law et al. 2018), probabilistic reasoning (Raedt et al. 2007; De Raedt and Thon 2010; Riguzzi et al. 2014; Bellodi and Riguzzi 2015; Riguzzi et al. 2016), non-determinism expressed through unstratified negation (Otero 2001; Law et al. 2018), and preference learning (Law et al. 2015b). Future versions of this dataset could be extended to contain these features.

Competitions SAT competitions have been held since 1992 with the aim of providing an objective evaluation of contemporary SAT solvers (Järvisalo et al. 2012). The competitions have significantly contributed to the progress of developing ever more efficient SAT techniques (Järvisalo et al. 2012). In addition, the competitions have motivated the SAT community to develop more robust, reliable, and general purposes SAT solvers (i.e implementations). We believe that the ILP community stands to benefit from an equivalent competition, to focus and motivate research. We hope that this new IGGP problem and dataset will become a central component in this new competition.