Abstract
We study planning in relational Markov decision processes involving discrete and continuous states and actions, and an unknown number of objects. This combination of hybrid relational domains has so far not received a lot of attention. While both relational and hybrid approaches have been studied separately, planning in such domains is still challenging and often requires restrictive assumptions and approximations. We propose HYPE: a samplebased planner for hybrid relational domains that combines modelbased approaches with state abstraction. HYPE samples episodes and uses the previous episodes as well as the model to approximate the Qfunction. In addition, abstraction is performed for each sampled episode, this removes the complexity of symbolic approaches for hybrid relational domains. In our empirical evaluations, we show that HYPE is a general and widely applicable planner in domains ranging from strictly discrete to strictly continuous to hybrid ones, handles intricacies such as unknown objects and relational models. Moreover, empirical results showed that abstraction provides significant improvements.
Keywords
MDP Probabilistic planning Logic programming Relational MDP Hybrid Hybrid relational MDP Probabilistic programming Abstraction Logical regression Importance sampling1 Introduction
Markov decision processes (MDPs) (Sutton and Barto 1998) are a natural and general framework for modeling probabilistic planning problems. Since the world is inherently relational, an important extension is that of relational MDPs (Wiering and van Otterlo 2012), where the state is represented in terms of firstorder logic, that is objects and relations between them. However, while significant progress has been made in developing robust planning algorithms for discrete, relational and continuous MDPs separately, the more intricate combination of those (hybrid relational) and settings with an unknown number of objects have received less attention.
The recent advances of probabilistic programming languages [e.g., BLOG (Milch et al. 2005a), Church (Goodman et al. 2008), ProbLog (Kimmig 2008), Anglican (Wood et al. 2014), distributional clauses (Gutmann et al. 2011)] has significantly improved the expressive power of formal representations for probabilistic models.
While it is known that these languages can express decision problems (Srivastava et al. 2014; Van den Broeck et al. 2010), including hybrid relational MDPs, it is less clear if the inbuilt generalpurpose inference system can cope with the challenges (e.g., scale, time constraints) posed by actual planning problems, and compete with existing stateoftheart planners.
In this paper, we consider the problem of effectively planning in propositional and relational domains where reasoning and handling unknowns may be needed in addition to coping with mixtures of discrete and continuous variables. In particular, we adopt dynamic distributional clauses (DDC) (Nitti et al. 2013, 2014) [an extension of distributional clauses (Gutmann et al. 2011) and based on distribution semantics (Sato 1995)] to describe the MDP and perform inference. In such general settings, exact solutions may be intractable, and so approximate solutions are the best we can hope for. Popular approximate solutions include Monte Carlo methods to estimate the expected reward of a policy (i.e., policy evaluation).
Monte Carlo methods provide stateoftheart results in probabilistic planners (Kocsis and Szepesvári 2006; Keller and Eyerich 2012). Monte Carlo planners have been mainly applied in discrete domains [with some notable exceptions, such as Mansley et al. (2011), Couetoux (2013), for continuous domains]. Typically, for continuous states, function approximation (e.g., linear regression) is applied. In that sense, one of the few Monte Carlo planners that works in arbitrary MDPs with no particular assumptions is sparse sampling (SST) (Kearns et al. 2002); but as we demonstrate later, it is often slow in practice. We remark that most, if not all, Monte Carlo methods require only a way to sample from the model of interest. While this property seems desirable, it prevents us from exploiting the actual probabilities of the model, as discussed (but unaddressed) in Keller and Eyerich (2012). In this paper we address this issue proposing a planner that exploits the knowledge of the model via importance sampling to perform policy evaluation.
The first contribution of this paper is HYPE: a conceptually simple but powerful planning algorithm for a given (hybrid relational) MDP in DDC. However, HYPE can be adapted for other languages, such as RDDL (Sanner 2010). The proposed planner exploits the knowledge of the model via importance sampling to perform policy evaluation, and thus, policy improvement. Importance sampling has been used in offpolicy Monte Carlo methods (Peshkin and Shelton 2002; Shelton 2001a, b), where policy evaluation is performed using trajectories sampled from another policy. We remark that standard offpolicy Monte Carlo methods have been used in reinforcement learning, which are essentially modelfree settings. In our setting, given a planning domain, the proposed planner introduces a new offpolicy method that exploits the model and works, under weak assumptions, in discrete, relational, continuous, hybrid domains as well as those with an unknown number of objects.
The second contribution of this paper is a samplebased abstraction algorithm for HYPE. In particular, using individual samples of trajectories, it removes irrelevant facts from the sampled states with an approach based on logical regression. There exists several exact methods that perform symbolic dynamic programming, that is dynamic programming at the level of abstract states (set of states). Those methods has been successfully used in relational domains (Kersting et al. 2004; Wang et al. 2008). However, abstraction is more challenging in hybrid relational domains, even though some attempts have been made (Sanner et al. 2011; Zamani et al. 2012) in propositional domains, under expressivity restrictions. To overcome the complexity of logical regression in general hybrid relational domains we perform abstraction at the level of sampled episodes. Such an approach carries over the benefits of symbolic methods to sampling approaches. We provide detailed derivations behind this abstraction, and show that it comes with a significant performance improvement.
The first contribution is based on the previous paper Nitti et al. (2015b) and the second contribution on the workshop paper Nitti et al. (2015a). This paper extends previous works with an algorithm for logical regression to abstract samples, a more detailed theoretical justification for abstraction and additional experiments.
2 Background
2.1 Markov decision processes
In a MDP, a putative agent is assumed to interact with its environment, described using a set S of states, a set A of actions that the agent can perform, a transition function \( p: S \times A \times S \rightarrow [0,1] \), and a reward function \( R: S \times A \rightarrow {\mathbb {R}}. \) That is, when in state \( s \) and on doing \( a \), the probability of reaching \( s' \) is given by \( p(s'  s, a) \), for which the agent receives the reward \( R(s,a). \) The agent is taken to operate over a finite number of time steps \( t = 0, 1, \ldots , T \), with the goal of maximizing the expected reward: \({{\mathbb {E}}}[\sum _{t=0}^T \gamma ^ t R(s_t,a_t)] = {{\mathrm{{\mathbb {E}}}}}[G(E)]\), where \(\gamma \in [0,1]\) is a discount factor, \(E=<s_0,a_0,s_1,a_1,\ldots ,s_T,a_T>\) is the state and action sequence called episode and \(G(E)=\sum _{t=0}^T \gamma ^{t}R(s_t,a_t)\) is the total discounted reward of E.
Alternatively, samplebased planners use Monte Carlo methods to solve an MDP and find a (near) optimal policy. Such planners simulate (by sampling) interaction with the environment in episodes \(E^m=\langle ^m_0,a^m_0,s^m_1,a^m_1,\ldots ,s^m_T,a^m_T\rangle \), following some policy \(\pi \). Each episode is a trajectory of \( T \) time steps, and we let \( s_t ^ m \) denote the state visited at time \( t \) during episode \( m. \) (So, after \( M \) episodes, \( M \times T \) states would be explored). After or during episode generation, the samplebased planner updates \(Q_d(s^m_t,a^m_t)\) for each t according to a backup rule, for example, averaging the total rewards obtained starting from \((s^m_t,a^m_t)\) till the end. The policy is improved using a strategy that trades off exploitation and exploration, e.g., the \(\epsilon \)greedy strategy. In this case the policy used to sample the episodes is not deterministic; we indicate with \(\pi (a_{t} s_{t})\) the probability to select action \(a_{t}\) in state \(s_{t}\) under the policy \(\pi \). Under certain conditions, after a sufficiently large number of episodes, the policy converges to a (near) optimal policy, and the planner can execute the greedy policy \(argmax_a {Q}_d(s_t,a)\).
2.2 Logic programming
In this section we briefly introduce logic programming concepts. See Nilsson and Małiszyński (1996), Apt (1997), Lloyd (1987) for an extensive introduction.
An Herbrand interpretation I is a set of ground atomic formulas that are assumed to be true. The facts not in I are assumed to be false. In this paper, Herbrand interpretations represent states. For example, \(I={\mathtt{\{}}{\mathtt{inside}}{\mathtt{(}}{\mathtt{1,3}}{\mathtt{)}},{\mathtt{inside}}{\mathtt{(}}{\mathtt{2}},{\mathtt{3}}{\mathtt{)}}{\mathtt{\}}}\) represents a world state where 1 and 2 are inside 3, any other fact is false.
In this paper we refer to complete (full) states and partial (abstract) states. In a full state any fact has an assignment true/false, and any other variable has a value. In a partial (abstract) state some of the facts or variables have an assignment. The remaining facts and variables are left undefined. Formally, an abstract state is a conjunctive formula \({\mathcal {F}}\) that represents the set of complete states that satisfies \({\mathcal {F}}\), that is, \({\mathcal {F}} = l_1 \wedge \ldots \wedge l_n\) where all variables are existentially quantified and each literal \(l_i\) is either an atom or a negated atom. We will extend the usual notions of substitution, unification and subsumption to these expressions. In addition, an abstract state \({\mathcal {F}}\) subsumes a state s (notation \(s \preceq {\mathcal {F}}\)) if and only if there exists a substitution \(\theta \) such that \({\mathcal {F}}\theta \subseteq s\). For example, the abstract state \({\mathtt{on}}_{\mathtt{t}}{\mathtt{(}}{\mathtt{1,2}}{\mathtt{)}},{\mathtt{not}}{\mathtt{(}}{\mathtt{on}}_{\mathtt{t}} {\mathtt{(}}{\mathtt{2,table}}{\mathtt{)}}{\mathtt{)}}\) represents all the states where object 1 is on top of object 2 and 2 is not on a table. An abstract state might contain logical variables, e.g., \({\mathtt{on}}_{\mathtt{t}}{\mathtt{(}}{\mathtt{1,A}}{\mathtt{)}}\) represents the set of all the states where 1 is on top of an arbitrary object. An example of such state is \({\mathtt{on}}_{\mathtt{t}}{\mathtt{(}}{\mathtt{1,2}}{\mathtt{)}},{\mathtt{on}}_{\mathtt{t}}{\mathtt{(}}{\mathtt{2,3}}{\mathtt{)}}\), subsumed by \({\mathtt{on}}_{\mathtt{t}}{\mathtt{(}}{\mathtt{1,A}}{\mathtt{)}}\): \({\mathtt{on}}_{\mathtt{t}} {\mathtt{(}}{\mathtt{1,2}}{\mathtt{)}},{\mathtt{on}}_{\mathtt{t}}{\mathtt{(}}{\mathtt{2,3}}{\mathtt{)}} \preceq {\mathtt{on}}_{\mathtt{t}}{\mathtt{(}}{\mathtt{1,A}}{\mathtt{)}}\). In this paper we consider only grounded abstract states, that is, without logical variables.
2.3 Relational MDPs
In firstorder (relational) MDPs, the state is represented in terms of logic formulas. In particular, in relational MDPs based on logic programming, a state is a Herbrand interpretation and the actions are described as facts. The state transition model and the reward function are compactly defined in terms of probabilistic rules exploiting firstorder logic. For example, in a blocksworld we can say that if \({\mathtt{on}}({\mathtt{A,C}}), {\mathtt{clean}}({\mathtt{B}})\) holds then action \({\mathtt{move}}({\mathtt{A,B}})\) will add \({\mathtt{on}}({\mathtt{A,B}})\) with probability 0.9 to the state and remove \({\mathtt{on}}({\mathtt{A,C}}),{\mathtt{clean}}({\mathtt{B}})\), otherwise with probability 0.1 the state will remain unchanged. In addition, it is often convenient to define when an action is applicable in a given state. This can be specified again in terms of rules (clauses). The conditions that make an action applicable are often called preconditions.
A relational MDP can be solved using the Bellman equation applied to abstract states with logical regression, instead of single states individually. This method is called symbolic dynamic programming (SDP), and it has been successfully used to solve big MDPs efficiently (Kersting et al. 2004; Wang et al. 2008; Joshi et al. 2010; Hölldobler et al. 2006). Similar principles have been applied in (propositional) continuous and hybrid domains (Sanner et al. 2011; Zamani et al. 2012). Despite the effectiveness of such approaches, they make restrictive assumptions (e.g., deterministic transition model for continuous variables) to keep exact inference tractable. For more general domains approximations are needed (Zamani et al. 2013). Another issue of SDP is keeping the structures that represent the Vfunction compact. Despite the recent progress, and the availability of regression methods for inference in hybrid domains (Belle and Levesque 2014), SDP remains a challenging approach in general hybrid relational domains, including MDPs where the number of variables can change over time.
In Sect. 5 we will show how to simplify abstraction by performing regression on the sampled episodes.
3 Dynamic distributional clauses
Standard relational MDPs cannot handle continuous variables. To overcome this limitation we consider hybrid relational MDPs formulated using probabilistic logic programming (Kimmig et al. 2010; Gutmann et al. 2011; Nitti et al. 2013). In particular, we adopt (dynamic) distributional clauses (Nitti et al. 2013; Gutmann et al. 2011), an expressive probabilistic language that supports discrete and continuous variables and an unknown number of objects, in the spirit of BLOG (Milch et al. 2005a).
A distributional clause (DC) is of the form \({\mathtt {h}}\sim {\mathcal {D}} \leftarrow {\mathtt {b_1,\ldots ,b_n}}\), where the \({\mathtt {b_i}}\) are literals and \(\sim \) is a binary predicate written in infix notation. The intended meaning of a distributional clause is that each ground instance of the clause \(({\mathtt {h}}\sim {\mathcal {D}} \leftarrow {\mathtt {b_1,\ldots ,b_n}})\theta \) defines the random variable \({\mathtt {h}}\theta \) as being distributed according to \({\mathcal {D}}\theta \) whenever all the \({\mathtt {b_i}} \theta \) hold, where \(\theta \) is a substitution. Furthermore, a term \(\simeq \!\!(d)\) constructed from the reserved functor \(\simeq \!\!/1\) represents the value of the random variable d.
Example 1
A distributional program is a set of distributional clauses (some of which may be deterministic) that defines a distribution over possible worlds, which in turn defines the underlying semantics. A possible world is generated starting from the empty set \(S=\emptyset \); for each distributional clause \({\mathtt {h}} \sim {\mathcal {D}} \leftarrow \mathtt {b_1, \ldots , b_n}\), whenever the body \(\{\mathtt {b_1}\theta , \ldots , \mathtt {b_n}\theta \}\) is true in the set S for the substitution \(\theta \), a value v for the random variable \({\mathtt {h}}\theta \) is sampled from the distribution \({\mathcal {D}}\theta \) and \(\simeq \!\!(h\theta )=v\) is added to S. This is repeated until a fixpoint is reached, i.e., no further variables can be sampled. Dynamic distributional clauses (DDC) extend distributional clauses in admitting temporallyextended domains by associating a time index to each random variable.
Example 2
4 HYPE: planning by importance sampling
In this section we introduce HYPE (= hybrid episodic planner), a planner for hybrid relational MDPs described in DDC. HYPE is a planner that adopts an offpolicy strategy (Sutton and Barto 1998) based on importance sampling and derived from the transition model. Related work is discussed more comprehensively in Sect. 6, but as we note later, samplebased planners typically only require a generative model (a way to generate samples) and do not exploit the model of the MDP (i.e., the actual probabilities) (Keller and Eyerich 2012). In our case, this knowledge leads to an effective planning algorithm that works in discrete, continuous, and hybrid domains, and/or domains with an unknown number of objects under weak assumptions. Moreover, HYPE performs abstraction of sampled episodes. In this section we introduce HYPE without abstraction; the latter will be introduced in Sect. 5.
4.1 Basic algorithm

\( \tilde{Q} \) and \( \tilde{V} \) denote approximations of the \( Q \) and \( V \)function respectively.

Lines 8 select an action according to a given strategy.

Lines 9–12 sample the next step and recursively the remaining episode of total length \( T \), then stores the total discounted reward \(G(E^m_{t})\) starting from the current state \(s^m_t\). This quantity can be interpreted as a sample of the expectation in formula (1), thus an estimator of the Vfunction. For this and other reasons explained later, \(G(E^m_{t})\) is stored as \(\tilde{V}^m_d(s^m_t)\).
 Most significantly, line 6 approximates the \( Q \)function using the weighted average of the stored \(\tilde{V}^i_{d1}(s^i_{t+1})\) points:where \( w^ i \) is a weight function for episode \( i \) at state \( s^ i _ {t+1}. \) The weight exploits the transition model and is defined as:$$\begin{aligned} \displaystyle \tilde{Q}^m_d\left( s^m_t,a\right) \leftarrow R\left( s^m_t,a\right) + \gamma \frac{\sum _{i=0} ^ {m1} w^i \tilde{V}^i_{d1}\left( s^i_{t+1}\right) }{\sum _{i=0}^ {m1} w^i}, \end{aligned}$$(10)$$\begin{aligned} \displaystyle w^i=\frac{p\left( s^i_{t+1}\mid s^m_{t},a\right) }{q\left( s^i_{t+1}\right) } \alpha ^{(mi)}. \end{aligned}$$(11)
Example 3
Let us assume we previously sampled some episodes of length \(T=10\), and we want to sample the \(m=4\)th episode starting from \(s_0=(0,0)\). We compute \(\tilde{Q}^m_{10}((0,0),a)\) for each action a (line 6). Thus, we compute the weights \(w^i\) using (11) for each stored sample \(\tilde{V}^i_{9}(s^i_1)\). For example, Fig. 1 shows the computation of \(\tilde{Q}^m_{10}((0,0),a)\) for action \(a'=(0.4,0.3)\) and \(a''=(0.9,0.5)\), where we have three previous samples \(i=\{1,2,3\}\) at depth 9. A shadow represents the likelihood \(p(s^i_1s_0=(0,0),a)\) (left for \(a'\) and right for \(a''\)). The weight \(w^i\) (11) for each sample \(s^i_1\) is obtained by dividing this likelihood by \(q(s^i_1)\) (with \(\alpha =1\)). If \(q^i(s^i_1)\) is uniform over the three samples, sample \(i=2\) with total reward \(\tilde{V}^2_9(s^2_1)=98\) will have higher weight than samples \(i=1\) and \(i=3\). The situation is reversed for \(a''\). Note that we can estimate \(\tilde{Q}^m_d(s^m_t,a)\) using episodes \(i\) that may never encounter \(s^{m}_t,a_t\) provided that \(p(s^i_{t+1}s^{m}_{t},a_{t})>0\).
4.2 Computing the (approximate) Qfunction
To summarize, for each state \(s^m_t\), \(Q(s^m_t,a_t)\) is evaluated as the immediate reward plus the weighted average of stored \(G(E^i_{t+1})\) points. In addition, for each state \(s^m_t\) the total discounted reward \(G(E^m_{t})\) is stored. We would like to remark that we can estimate the Qfunction also for states and actions that have never been visited, as shown in Example 1. This is possible without using function approximations (beyond importance sampling).
4.3 Extensions
Instead of choosing between the two approaches we can use a linear combination, i.e., we replace line 11 with \(\tilde{V}^m_d(s^m_t)\leftarrow \lambda G(E^m_{t})+(1\lambda ) max_a \tilde{Q}^m_d(s^m_t,a)\). The analysis from earlier applies by letting \( \lambda = 1 \). However, for \(\lambda = 0 \), we obtain a local value iteration step, where the stored \( \tilde{V} \) is obtained maximizing the estimated \( \tilde{Q} \) values. Any intermediate value balances the two approaches (this is similar to, and inspired by, \(\hbox {TD}(\lambda )\) Sutton and Barto 1998). Another strategy consists in storing the maximum of the two: \(\tilde{V}^m_d(s^m_t)\leftarrow max(G(E^m_{t}),max_a \tilde{Q}^m_d(s^m_t,a))\). In other words, we alternate Monte Carlo and Bellman backup according to which one has the highest value. This strategy works often well in practice; indeed it avoids a typical issue in (onpolicy) Monte Carlo methods: bad policies or exploration lead to low rewards, averaged in the estimated Q / Vfunction.
4.4 Practical improvements
In this section we briefly discuss some practical improvements of HYPE. To evaluate the Qfunction the algorithm needs to query all the stored examples, making the algorithm potentially slow. This issue can be mitigated with solutions used in instancebased learning, such as hashing and indexing. For example, in discrete domains we avoid multiple computations of the likelihood and the proposal distribution for samples of the same state. In addition, assuming policy improvement over time, only the \(N_{{\textit{store}}}\) most recent episodes are kept, since older episodes are generally sampled with a worse policy.
HYPE’s algorithm relies on importance sampling to estimate the \(Q\)function, thus we should guarantee that \(p>0 \Rightarrow q>0\), where p is the target and q is the proposal distribution. This is not always the case, like when we sample the first episode. Nonetheless we can have an indication of the estimation reliability. In our algorithm we use \(\sum w^i\) with expectation equal to the number of samples: \({\mathbb {E}}[\sum w^i]=m\). If \(\sum w^i<{\textit{thres}}\) the samples available are considered insufficient to compute \(Q^m_d(s^m_t,a)\), thus action \(a\) can be selected according to an exploration policy. It is also possible to add a fictitious weighted point in line 6, that represents the initial \(Q^m_d(s^m_t,a)\) guess. This can be used to exploit heuristics during sampling.
A more problematic situation is when, for some action \(a_{t}\) in some state \(s_{t}\), we always obtain null weights, that is, \(p(s^i_{t+1}s_{t},a_{t})=0\) for each of the previous episodes i, no matter how many episodes are generated. This issue is solved by adding noise to the state transition model, e.g., Gaussian noise for continuous random variables. This effectively ‘smoothes’ the Vfunction. Indeed the Qfunction is a weighted sum of Vfunction points, where the weights are proportional to a noisy version of the state transition likelihood.
5 Abstraction

Qfunction estimation from abstracted states (line 6)

regression of the current state (line 11)

the procedure returns the abstract state and its Vfunction, instead of the latter only (line 15). This is required for recursive regression.
5.1 Basic principles of abstraction
Before describing abstraction formally, let us consider the blocksworld example to give an intuition. Figure 2 shows a sampled episode from the first state (bottom left) to the last state (top left) that ends in the goal state \({\mathtt{on(2,1)}}\). Informally, the relevant part of the episode is the set of facts that are responsible for reaching the goal, or more generally responsible for obtaining a given total reward. This relevant part is called the abstracted episode. Figure 2 shows the abstract states (circled) that together define the abstract episode. Intuitively, objects 3, 4, 5 and their relations are irrelevant to reach the goal \({\mathtt{on(2,1)}}\), and thus do not belong to the abstracted episode.
The abstraction helps to exploit the previous episodes in more cases, speeding up the convergence. For example, Fig. 2 shows the computation of a weight \(w^i\) [using (11)] to compute the Qfunction of the (full) state \(s^m_t\) depicted on the right, exploiting the abstract state \(\hat{s}^i_{t+1}\) indicated by the arrow (from episode i). If the action is moving 4 on top of 5 we have \(p(\hat{s}^i_{t+1}s^m_t,a)>0 \Rightarrow w^i>0\). Thus, the Qfunction estimate \(\tilde{Q}^m_d(s_t,a)\) will include \(w^1\cdot 99\) in the weighted average (line 6 in Algorithm 2), making the action appealing. In contrast, without abstraction all actions get weight 0, because the full state \(s^i_{t+1}\) is not reachable from \(s^m_t\) (i.e. \(p(s^i_{t+1}s^m_t,a)=0\)). Therefore, episode i cannot be used to compute the Qfunction. For this reason abstraction requires fewer samples to converge to a nearoptimal policy.
This idea is valid in continuous domains. For example, in the objpush scenario, the goal is to put any object in a given region; if the goal is reached, only one object is responsible, any other object is irrelevant in that particular state.
5.2 Mathematical derivation
In this section we formalize samplebased abstraction and describe the assumptions that justify the Qfunction estimation on abstract states (line 6 of Algorithm 2).
5.2.1 Abstraction applied to importance sampling
5.2.2 Importance weights for abstract episodes
We will now explain the weight derivation and motivate the approximations adopted. Until formula (22) the only assumption made is the Markov property on abstract states. No assumptions are made about the action distributions (policies) \(\pi ,\pi ^i\), thus the probability of an action \(a_t\) might depend on abstracted states in previous steps. Then (23) is replaced by (22) as discussed later. Finally, the policy ratio in (23) is replaced in (24) as in HYPE without abstraction.
Now let us focus on (24) derived from (23). Since the policies \(\pi ^i\) used in the episodes are assumed to improve over time, we replaced the policy ratio in (23) with a quantity that favors recent episodes as in the propositional case [formula (18)]. Another way of justifying (24) is estimating for each stored abstract episode i, the Qfunction \(Q_d^{\pi ^i}(s^{m}_t,a)\), with target policy \(\pi =\pi ^i\), and using only the ith sample. With a marginalized target policy given by (26), the single weight of each estimate \(Q_d^{\pi ^i}(s^{m}_t,a)\) is exactly (27). The used Qfunction estimate can be a weighted average of \(Q_d^{\pi ^i}(s^{m}_t,a)\), where recent estimates (higher index \(i\)) receive higher weights because the policy is assumed to improve over time. Thus, the final weights are given by (24).
HYPE with abstraction adopts formula (21) and weights (24) for Qfunction estimation. Note that during episode sampling the states are complete, nonetheless, to compute \(Q_d^\pi (s^{m}_t,a)\) at episode \(m\) all previously abstracted episodes \(i<m\) are considered. Finally, when the sampling of episode \(m\) is terminated, it is abstracted (line 11) and stored (line 14).
5.2.3 Ineffectiveness of lazy instantiation
Before explaining the proposed abstraction in detail, let us consider an alternative solution that samples abstract episodes directly, instead of sampling a complete episode and performing abstraction afterwards. If we are able to determine and sample partial states \(\hat{s}^m_t\), we can sample abstract episodes directly and perform Qfunction estimation. Sampling the relevant partial episode \(\hat{E}_{t}\) can be easily performed using lazy instantiation, where given the query \(G(E_t)\), only relevant random variables are sampled until the query can be answered. Lazy instantiation can exploit contextspecific independencies and be extended for distributions with a countably infinite number of variables, as in BLOG (Milch et al. 2005a, b). Similarly, Distributional Clauses search relevant random variables (or facts) using backward reasoning, while sampling is performed in a forward way. For example, to prove \(R_t\) the algorithm needs to sample the variables \(\hat{s}_t\) relevant for \(R_t\), \(\hat{s}_t\) depends on \(\hat{s}_{t1}\) and the action \(a_{t1}\) depends on the admissible actions that again depend on \(\hat{s}_{t1}\), and so on. At some point variables can be sampled because they depend on known facts (e.g., initial state \(s_0\)). This procedure guarantees that \(G( E_{t})=G(\hat{E}_{t})\), \(p(\hat{s}_{t+1} s_{0:t},a_{t})=p(\hat{s}_{t+1} \hat{s}_{t},a_{t})\) and \(\pi ^i(as_t)=\pi ^i(a\hat{s}_t)\), thus (22) is exactly equal to (23) and it simplifies to \(\frac{ p(\hat{s}_{t+1} s^{m}_{t},a)}{ q^i(\hat{s}_{t+1})} \frac{\prod _{k=t}^{T1} \pi (a_{k+1} \hat{s}_{k+1})}{\prod _{k=t}^{T1} \pi ^i(a_{k+1} \hat{s}_{k+1})}\). Finally, the approximation (24) can be used. Unfortunately, this method avoids only sampling variables that are completely irrelevant, therefore in many practical domains it will sample (almost) the entire state. Indeed, evaluating the admissible actions often requires sampling the entire state. In other words, the abstract state \(\hat{s}_t\subseteq s_t \) that guarantees \(\pi ^i(as_t)=\pi ^i(a\hat{s}_t)\) is often equal to \(s_t\). The solution adopted in this paper is ignoring the requirement \(\pi ^i(as_t)=\pi ^i(a\hat{s}_t)\) and approximate (22) with (23), or equivalently using (26) as marginalized target policy distribution.
5.3 Samplebased abstraction by logical regression
In this section we describe how to implement the proposed samplebased abstraction. The implementation is based on dynamic distributional clauses for two reasons: DDC allow to represent complex hybrid relational domains and provides backward reasoning procedures useful for abstraction as we will now describe.
The algorithm REGRESS for regressing a query (formula) using a set of facts is depicted in Algorithm 3. The algorithm tries to repeatedly find literals in the query that could have been generated using the set of facts and a distributional clause. If it finds such a literal, it will be replaced by the condition part of the clause in the query. If not, it will add the fact to the state to be returned.
Example 4
To illustrate the algorithm, consider the blocksworld example in Fig. 2. Let us consider the abstraction of the episode on the left. To prove the last reward we need to prove the goal, thus \(\hat{s}_2={\mathtt{{on(2,1)_{2}}}}\). Now let us consider time step 1, the proof for the immediate reward is \({\mathtt{{not(on(2,1)_{1})}}}\), while the proof for the next abstract state \(\hat{s}_2\) is \({\mathtt{{on(2,table)_{1}, clear(1)_{1},clear(2)_{1}}}}\), therefore the abstract state becomes \(\hat{s}_1={\mathtt{on(2,table)_{1}, clear(1)_{1},clear(2)_{1}}}\), \({\mathtt{not(on(2,1)_{1})}}\). Analogously, \(s'_0 = {\mathtt{{on(1,2)_{0}, on(2,table)_{0}, clear(1)_{0}, not(on(2,1)_{0})}}}\). The same procedure is applicable to continuous variables.
6 Related work
6.1 Nonrelational planners
There is an extensive literature on MDP planners, we will focus mainly on Monte Carlo approaches. The most notable samplebased planners include sparse sampling (SST) (Kearns et al. 2002), UCT (Kocsis and Szepesvári 2006) and their variations. SST creates a lookahead tree of depth D, starting from state \(s_0\). For each action in a given state, the algorithm samples C times the next state. This produces a nearoptimal solution with theoretical guarantees. In addition, this algorithm works with continuous and discrete domains with no particular assumptions. Unfortunately, the number of samples grows exponentially with the depth D, therefore the algorithm is extremely slow in practice. Some improvements have been proposed (Walsh et al. 2010), although the worstcase performance remains exponential. UCT (Kocsis and Szepesvári 2006) uses upper confidence bound for multiarmed bandits to trade off between exploration and exploitation in the tree search, and inspired successful Monte Carlo tree search methods (Browne et al. 2012). Instead of building the full tree, UCT chooses the action a that maximizes an upper confidence bound of Q(s, a), following the principle of optimism in the face of uncertainty. Several improvements and extensions for UCT have been proposed, including handling continuous actions (Mansley et al. 2011) [see Munos (2014) for a review], and continuous states (Couetoux 2013) with a simple Gaussian distance metric; however the knowledge of the probabilistic model is not directly exploited. For continuous states, parametric function approximation is often used (e.g., linear regression), nonetheless the model needs to be carefully tailored for the domain to solve (Wiering and van Otterlo 2012).
There exist algorithms that exploit instancebased methods (e.g. Forbes and Andre 2002; Smart and Kaelbling 2000; Driessens and Ramon 2003) for modelfree reinforcement learning. They basically store Qpoint estimates, and then use e.g., neighborhood regression to evaluate Q(s, a) given a new point (s, a). While these approaches are effective in some domains, they require the user to design a distance metric that takes into account the domain. This is straightforward in some cases (e.g., in Euclidean spaces), but it can be harder in others. We argue that the knowledge of the model can avoid (or simplify) the design of a distance metric in several cases, where the importance sampling weights and the transition model, can be considered as a kernel.
The closest related works include Shelton (2001a, b), Peshkin and Shelton (2002), Precup et al. (2000), they use importance sampling to evaluate a policy from samples generated with another policy. Nonetheless, they adopt importance sampling differently without knowledge of the MDP model. Although this property seems desirable, the availability of the actual probabilities cannot be exploited, apart from sampling, in their approaches. The same conclusion is valid for practically any samplebased planner, which only needs a sample generator of the model. The work of Keller and Eyerich (2012) made a similar statement regarding PROST, a stateoftheart discrete planner based on UCT, without providing a way to use the state transition probabilities directly. Our algorithm tries to alleviate this, exploiting the probabilistic model in a samplebased planner via importance sampling.
For more general domains that contain discrete and continuous (hybrid) variables several approaches have been proposed under strict assumptions. For example, Sanner et al. (2011) provide exact solutions, but assume that continuous aspects of the transition model are deterministic. In a related effort (Feng et al. 2004), hybrid MDPs are solved using dynamic programming, but assuming that transition model and reward is piecewise constant or linear. Another planner HAO* (Meuleau et al. 2009) uses heuristic search to find an optimal plan in hybrid domains with theoretical guarantees. However, they assume that the Bellman equation integral can be computed.
For domains with an unknown number of objects, some probabilistic programming languages such as BLOG (Milch et al. 2005a), Church (Goodman et al. 2008), Anglican (Wood et al. 2014), and DC (Gutmann et al. 2011) can cope with such uncertainty. To the best of our knowledge DTBLOG (Srivastava et al. 2014; Vien and Toussaint 2014) are the only proposals that are able to perform decision making in such domains using a POMDP framework. Furthermore, BLOG is one of the few languages that explicitly handles data association and identity uncertainty. The current paper does not focus on POMDP, nor on identity uncertainty; however, interesting domains with unknown number of objects can be easily described as an MDP that HYPE can solve.
Among the mentioned samplebased planners, one of the most general is SST, which does not make any assumption on the state and action space, and only relies on Monte Carlo approximation. In addition, it is one of the few planners that can be easily applied to any DDC program, including MDPs with an unknown number of objects. For this reason SST was implemented for DDC and used as baseline for our experiments.
6.2 Relational planners and abstraction
There exists several modeling languages for planning, the most recent is RDDL [40] that supports hybrid relational domains. A RDDL domain can be mapped in DDC and solved with HYPE. Nonetheless, RDDL does not support a state space with an unknown number of variables as in Example 2.
Relational MDPs can be solved using modelfree approaches based on relational reinforcement learning (Džeroski et al. 2001; Tadepalli et al. 2004; Driessens and Ramon 2003), or modelbased methods such as ReBel (Kersting et al. 2004), FODD (Wang et al. 2008), PRADA (Lang and Toussaint 2010), FLUCAP (Hölldobler et al. 2006) and many others. However, those approaches only support discrete actionstate (relational) spaces.
Among modelbased approaches, several symbolic methods have been proposed to solve MDPs exactly in propositional [see Mausam and Kolobov (2012) for a review] and relational domains (Kersting et al. 2004; Wang et al. 2008; Joshi et al. 2010; Hölldobler et al. 2006). They perform dynamic programming at the level of abstract states; this approach is generally called symbolic dynamic programming (SDP). Similar principles have been applied in (propositional) continuous and hybrid domains (Sanner et al. 2011; Zamani et al. 2012), where compact structures (e.g., ADD and XADD) are used to represent the Vfunction. Despite the effectiveness of such approaches, they make restrictive assumptions (e.g., deterministic transition model for continuous variables) to keep exact inference tractable. For more general domains approximations are needed, for example samplebased methods or confidence intervals (Zamani et al. 2013). Another issue of SDP is keeping the structures that represent the Vfunction compact. Some solutions are available in the literature, such as pruning or realtime SDP (Vianna et al. 2015). Despite the recent progress, and the availability of regression methods for inference in hybrid domains (Belle and Levesque 2014), SDP remains a challenging approach in general hybrid relational domains.
Recently, abstraction has received a lot of attention in the Monte Carlo planning literature. Like in our work, the aim is to simplify the planning task by aggregating together states that behave similarly. There are several ways to define state equivalence, see Li et al. (2006) for a review. Some approaches adopt model equivalence: states are equivalent if they have the same reward and the probabilities to end up in other abstract states are the same. Other approaches define the equivalence in terms of the V/Qfunction. In particular, we take note of the following advances: (a) Givan et al. (2003) who compute equivalence classes of states based on exact model equivalence, (b) Jiang et al. (2014) who appeal to approximate local homomorphisms derived from a learned model, (c) Anand et al. (2015) who extend Jiang and Givan in grouping stateaction pairs, and (d) Hostetler et al. (2014) who aggregate states considering the V/Qfunction with tight loss bounds.
In our work, in contrast, we consider equivalence (abstraction) at the level of episodes, not states. Two episodes are equivalent if they have the same total reward. In addition, a Markov property condition on abstract states is added to make the weights in (21) easier to compute. Abstraction is performed independently in each episode, determining, by logical regression, the set of facts (or random variables) sufficient to guarantee the mentioned conditions. Note that the same full state \(s_t\) might have different abstractions in different episodes, even for the same action \(a_t\). This is generally not the case in other works. The proposed abstraction directly exploits the structure of the model (independence assumptions) to perform abstraction. For this reason it relies on the (contextspecific) independence assumptions explicitly encoded in the model. However, it is possible to discover independence assumptions not explicitly encoded and include them in the model (e.g., using independence tests).
7 Experiments
 (Q1)
Does HYPE without abstraction obtain the correct results?
 (Q2)
How is the performance of HYPE in different domains?
 (Q3)
How does HYPE compare with stateoftheart planners?
 (Q4)
Is abstraction beneficial?
7.1 HYPE without abstraction
In this section we consider HYPE without abstraction. The algorithm HYPE and its theoretical foundations are based on approximations (e.g., Monte Carlo). For this reason we tested the correctness of HYPE results in different planning domains (Q1). In particular, we tested HYPE on a nonlinear version of the hybrid mars rover domain (Sanner et al. 2011) (called simplerover1) for which the exact Vfunction is available. In this domain there is a rover that needs to take pictures. The state consists of a twodimensional continuous rover position (x, y) and one discrete variable h to indicate whether the picture at a target point was taken. In this simplified domain we consider two actions: move with reward \(1\) that moves the rover towards the target point (0, 0) by 1 / 3 of the current distance, and takepic that takes the picture at the current location. The reward of takepic is \(max(0,4x^2y^2)\) if the picture has not been already taken (\(h={\textit{false}}\)) and 0 otherwise. In other words, the agent has to minimize the movement cost and take a picture as close as possible to the target point (0, 0). We choose 31 initial rover positions and ran the algorithm with depth \(d=3\) for 100 episodes each. An experiment took on average 1.4 s. Figure 3 shows the results where the line is the exact V provided by Sanner et al. (2011), and dots are estimated V points. The results show that the algorithm converges to the optimal Vfunction with a negligible error.
Experiments without abstraction: d is the horizon used by the planner, T the total number of steps, M is the maximum number of episodes sampled for HYPE, while C is the SST parameter (number of samples for each state and action). Time refers to the plan execution of one instance, from the starting state till the goal or the maximum number of steps is reached, with a timeout of 1800 s. PROST results refer to the IPPC2011 planning competition
Domain  Planner  d  T  Param.  Reward  Time (s)  Size 

gamel  HYPE  5  40  M = 1200  0.87 ± 0.11  662  9 discrete variables 10 actions 
SST  5  40  C = 1  0.34 ± 0.15  986  
HYPE  4  40  M = 1200  0.89 ± 0.07  312  
SST  4  40  C = 2  0.79 ± 0.08  1538  
PROST  0.99 ± 0.02  
game2  HYPE  5  40  M = 1200  0.67 ± 0.18  836  9 discrete variables 10 actions 
SST  5  40  C = 1  0.14 ± 0.20  1000  
HYPE  4  40  M = 1200  0.76 ± 0.19  582  
SST  4  40  C = 2  0.27 ± 0.22  1528  
PROST  1.00 ± 0.19  
sysadminl  HYPE  5  40  M = 1200  0.94 ± 0.07  422  10 discrete variables 11 actions 
SST  5  40  C = 1  0.47 ± 0.13  1068  
HYPE  4  40  M = 1200  0.98 ± 0.06  346  
SST  4  40  C = 2  0.66 ± 0.08  1527  
PROST  1.00 ± 0.05  
sysadmin2  HYPE  5  40  M = 1200  0.87 ± 0.11  475  10 discrete variables 11 actions 
SST  5  40  C = 1  0.31 ± 0.12  1062  
HYPE  4  40  M = 1200  0.86 ± 0.11  392  
SST  4  40  C = 2  0.46 ± 0.12  1532  
PROST  0.98 ± 0.09  
objpush  HYPE  9  30  M = 4500  83.7 ± 7.6  472  2 continuous variables 4 actions 
SST  9  30  C = 1  82.7 ± 2.7  330  
HYPE  10  30  M = 4500  86.4 ± 1.0  1238  
SST  10  30  C = 1  82.4 ± 1.9  1574  
HYPE  12  30  M = 2000  87.5 ± 0.5  373  
SST  \(\ge \)11  30  C = 1  N/A  Timeout  
simplerover2  HYPE  8  8  M = 200  11.8 ± 0.2  38  2 continuous variables 1, 2, 3, 4 discrete variables 2, 3, 4, 5 actions 
SST  8  8  C = 1  11.4 ± 0.3  48  
HYPE  9  9  M = 500  11.7 ± 0.2  195  
SST  9  9  C = 1  11.3 ± 0.3  238  
HYPE  10  10  M = 500  11.9 ± 0.3  218  
SST  10  10  C = 1  11.2 ± 0.3  1043  
marsrover  HYPE  6  40  M = 6000  249.8 ± 33.5  985  2 continuous variables 5 discrete variables 10 actions 
SST  6  40  C = 1  227.7 ± 27.3  787  
HYPE  7  40  M = 6000  269.0 ± 29.4  983  
SST  7  40  C = 1  N/A  Timeout  
HYPE  10  40  M = 4000  296.3 ± 19.5  1499  
SST  \(\ge \) 8  40  C = 1  N/A  Timeout  
objsearch  HYPE  5  5  M = 500  2.53 ± 1.03  13  Variable size 
SST  5  5  C = 5  1.46 ± 1.00  45  
HYPE  5  5  M = 600  3.64 ± 1.09  17  
SST  5  5  C = 6  2.48 ± 1.00  138  
HYPE  6  6  M = 600  3.30 ± 1.60  20  
SST  6  6  C = 5  0.58 ± 1.40  889 
To answer (Q2) and (Q3) we studied the planner in a variety of settings, from discrete, to continuous, to hybrid domains, to those with an unknown number of objects. We performed experiments in a more realistic mars rover domain that is publicly available,^{2} called marsrover (Fig. 3). In this domain we consider one robot and 5 picture points that need to be taken. The state is similar to simplerover domains: a continuous twodimensional position and a binary variable for each picture point. However, in marsrover the robot can move an arbitrary displacement along the two dimensions. The continuous action space is discretized as required by HYPE and SST. The movement of the robot causes a negative reward proportional to the displacement and the pictures can be taken only close to the interest point. Each taken picture provides a different reward.
Other experiments were performed in the continuous objpush MDP described in Sect. 4 (Fig. 1), and in discrete benchmark domains of the IPPC 2011 competition. In particular, we tested a pair of instances of Game of life (called game in the experiments) and sysadmin domains. Game of life consists of a grid X times Y cells, where each cell can be dead or alive. The state of each cell changes over time and depends on the neighboring cells. In addition, the agent can set a cell to alive or do nothing. The reward depends on the number of cells alive. We consider instances with \(3 \times 3\) cells, and thus a state of 9 binary variables and 10 actions. The sysadmin domain consists of a network of computers. Each computer might be running or crashed. The probability of a computer to crash depends on how many computers it is connected to. The agent can choose at each step to reboot a computer or do nothing. The goal is to maximize the number of computers running and minimize the number of reboots required. We consider instances with 10 computers (i.e. 10 binary random variables) and so there are 11 actions. The results in the discrete IPPC 2011 domains are compared with PROST (Keller and Eyerich 2012), the IPPC 2011 winner, and shown in Table 1 in terms of scores, i.e., the average reward normalizated with respect to IPPC 2011 results; score 1 is the highest result obtained (on average), score 0 is the maximum between the random and the no operation policy.
As suggested by Keller and Eyerich (2012), limiting the horizon of the planner increases the performance in several cases. We exploited this idea for HYPE as well as SST (simplerover2 excluded). For SST we were forced to use small horizons to keep plan time under 30 min. In all experiments we followed the IPPC 2011 schema, that is each instance is repeated 30 times (objectsearch excluded), the results are averaged and the \(95\%\) confidence interval is computed. However, for every instance we replan from scratch for a fair comparison with SST. In addition, time and number of samples refers to the plan execution of one instance.
The results (Table 1) highlight that our planner obtains generally better results than SST, especially at higher horizons. HYPE obtains good results in discrete domains but does not reach stateofart results (score 1) for two main reasons. The first is the lack of a heuristic, that can dramatically improve the performance, indeed, heuristics are an important component of PROST (Keller and Eyerich 2012), the IPPC winning planner. The second reason is the time performance that allows us to sample a limited number of episodes and will not allow all the IPPC 2011 domains to finish in 24 h. This is caused by a nonoptimized Prolog implementation and by the expensive Qfunction evaluation; however, we are confident that heuristics and other improvements will significantly improve performance and results. In particular, the weight computation can be performed on a subset of stored Vstate points, excluding points for which the weight is known to be small or zero. For example, in the objpush domain, the points too far from the current position plus action displacement can be discarded because the weight will be negligible. Thus, a nearest neighbor search can be used for a fast retrieval of relevant stored Vpoints.
Moreover, we performed experiments in the objectsearch scenario (Sect. 3), where the number of objects is unknown, even though the domain is modeled as a fully observable MDP. The results are averaged over 400 runs, and confirm better performance for HYPE with respect to SST.
7.2 HYPE with abstraction
Experiments with and without abstraction
Domain  Abstract  d  T  M  Reward  Success (%)  Time (s)  Size 

BW 4  No  10  10  200  74.6 ± 7.4  82  38  4 objects 
Yes  10  10  200  80.3 ± 7.2  88  32  
BW 6  No  16  16  200  \(16.0 \pm 0.0\)  0  112  6 objects 
Yes  16  16  200  54.8 ± 13.4  68  42  
BWC 4  No  10  10  200  11.8 ± 2.7  84  52  4 objects 
Yes  10  10  200  14.2 ± 1.9  94  24  1 cont. variable  
BWC 6  No  18  18  200  \(18.0 \pm 0.0\)  0  186  6 objects 
Yes  18  18  200  8.4 ± 2.4  94  70  1 cont. variable  
pushl  No  20  30  1000  67.6 ± 11  86  734  4 cont. variables 
Yes  20  30  1000  84.0 ± 4.7  98  652  
push2  No  20  30  1000  \(30.0 \pm 0.0\)  0  1963  6 cont. variables 
Yes  20  30  1000  30.5 ± 14.0  58  910  
push3  No  20  40  500  \(17.4 \pm 12.6\)  20  638  6 cont. variables 
Yes  20  40  500  89.3 ± 1.5  100  122  
marsl  No  30  40  1500  280.0 ± 11.8  90  1492  2 cont. variables 
Yes  30  40  1500  273.3 ± 11.0  86  780  5 discrete variables  
mars2  No  30  40  1000  209.3 ± 27.7  37  2817  4 cont. variables 
Yes  30  40  1000  287.7 ± 23.6  87  902  5 discrete variables 
The current implementation supports negation only for ground formulas. Regression of nonground formulas is possible when the domain is purely relational. However, it becomes challenging when there are continuous random variables and logical variables in a negated formula. If we assume that the domain is fixed (e.g., known number of objects used), logical variables can be replaced with objects in the domain, making the formulas ground. For this reason we will not consider domains with an unknown number of objects, which HYPE without abstraction can solve.
The experiments are shown in Table 2. The rewards are averaged over 50 runs and a \(95\%\) confidence interval is computed. The results highlight that abstraction improves the expected total reward for the same number of samples or achieves comparable results. In addition, HYPE with abstraction is always faster. The latter is probably due to a faster weight computation with abstract states and due to the generation of better plans that are generally shorter and thus faster. This suggests that the overhead caused by the abstraction procedure is negligible and worthwhile. We do remark that in domains where the whole state is always relevant, abstraction gives no added value. For example, the reward in Game of life and sysadmin depends always on the full state, thus abstraction is not useful because abstract state and full state coincide. Nonetheless, other type of abstractions can be beneficial. Indeed, the proposed abstraction (Algorithm 3) produces grounded abstract states (i.e., states where the facts do not have logical variables), this is required to allow abstraction in complex domains. In more restricted domains (e.g., discrete) more effective abstractions can improve the performance. For example, abstractions used in SDP (Kersting et al. 2004; Wang et al. 2008; Joshi et al. 2010; Hölldobler et al. 2006) produce abstract states with logical variables.
8 Conclusions
We proposed a samplebased planner for MDPs described in DDC under weak assumptions, and showed how the state transition model can be exploited in offpolicy Monte Carlo. The experimental results show that the algorithm produces good results in discrete, continuous, hybrid domains as well as those with an unknown number of objects. Most significantly, it challenges and outperforms SST. Moreover, we extended HYPE with abstraction. We formally described how (contextspecific) independence assumptions can be exploited to perform episode abstraction. This is valid for propositional as well as relational domains. A theoretical derivation has been provided to justify the assumptions and the approximations used. Finally, empirical results showed that abstraction provides significant improvements.
Footnotes
References
 Anand, A., Grover, A., & Singla, P. (2015). ASAPUCT: Abstraction of stateaction pairs in UCT. In Proceedings of IJCAI (pp. 1509–1515).Google Scholar
 Apt, K. (1997). From logic programming to Prolog. PrenticeHall international series in computer science. Upper Saddle River: Prentice Hall.Google Scholar
 Belle, V., & Levesque, H. J. (2014). PREGO: An action language for beliefbased cognitive robotics in continuous domains. In Proceedings of the twentyeighth AAAI conference on artificial intelligence, July 27–31, 2014, Québec City, Québec, Canada (pp. 989–995).Google Scholar
 Browne, C., Powley, E. J., Whitehouse, D., Lucas, S. M., Cowling, P. I., Rohlfshagen, P., et al. (2012). A survey of Monte Carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in Games, 4(1), 1–43. http://dblp.unitrier.de/db/journals/tciaig/tciaig4.html.
 Couetoux, A. (2013). Monte Carlo tree search for continuous and stochastic sequential decision making problems. Thesis, Université Paris Sud  Paris XI.Google Scholar
 Driessens, K., & Ramon, J. (2003). Relational instance based regression for relational reinforcement learning. In Proceedings of the ICML (pp. 123–130).Google Scholar
 Džeroski, S., De Raedt, L., & Driessens, K. (2001). Relational reinforcement learning. Machine Learning, 43(1–2), 7–52.CrossRefzbMATHGoogle Scholar
 Feng, Z., Dearden, R., Meuleau, N., & Washington, R. (2004). Dynamic programming for structured continuous Markov decision problems. In Proceedings of the UAI (pp. 154–161).Google Scholar
 Forbes, J., & Andre, D. (2002). Representations for learning control policies. In Proceedings of the ICML workshop on development of representations (pp. 7–14).Google Scholar
 Givan, R., Dean, T., & Greig, M. (2003). Equivalence notions and model minimization in markov decision processes. Artificial Intelligence, 147(12), 163–223.MathSciNetCrossRefzbMATHGoogle Scholar
 Goodman, N., Mansinghka, V. K., Roy, D. M., Bonawitz, K., & Tenenbaum, J. B. (2008). Church: A language for generative models. In Proceedings of the UAI (pp. 220–229).Google Scholar
 Gutmann, B., Thon, I., Kimmig, A., Bruynooghe, M., & De Raedt, L. (2011). The magic of logical inference in probabilistic programming. Theory and Practice of Logic Programming, 11, 663–680.Google Scholar
 Hölldobler, S., Karabaev, E., & Skvortsova, O. (2006). Flucap: A heuristic search planner for firstorder MDPs. Journal of Artificial Intelligence Research, 27, 419–439.zbMATHGoogle Scholar
 Hostetler, J., Fern, A., & Dietterich, T. (2014). State aggregation in Monte Carlo tree search. In Proceedings of AAAI.Google Scholar
 Jiang, N., Singh, S., & Lewis, R. (2014). Improving UCT planning via approximate homomorphisms. In Proceedings of the 2014 international conference on autonomous agents and multiagent systems (pp. 1289–1296). International Foundation for Autonomous Agents and Multiagent Systems.Google Scholar
 Joshi, S., Kersting, K., & Khardon, R. (2010). Selftaught decision theoretic planning with first order decision diagrams. In ICAPS (pp. 89–96).Google Scholar
 Kearns, M., Mansour, Y., & Ng, A. Y. (2002). A sparse sampling algorithm for nearoptimal planning in large Markov decision processes. Machine Learning, 49(2–3), 193–208.CrossRefzbMATHGoogle Scholar
 Keller, T., & Eyerich, P. (2012). PROST: Probabilistic planning based on UCT. In Proceedings of the ICAPS.Google Scholar
 Kersting, K., Otterlo, M. V., & De Raedt, L. (2004). Bellman goes relational. In Proceedings of the ICML (p. 59).Google Scholar
 Kimmig, A., Demoen, B., De Raedt, L., Santos Costa, V., & Rocha, R. (2010). On the implementation of the probabilistic logic programming language ProbLog. Theory and Practice of Logic Programming (TPLP), 11, 235–262.Google Scholar
 Kimmig, A., Santos Costa, V., Rocha, R., Demoen, B., & De Raedt, L. (2008). On the efficient execution of ProbLog programs. In Logic programming. Lecture notes in computer science (pp. 175–189). Berlin: Springer.Google Scholar
 Kocsis, L., & Szepesvári, C. (2006). Bandit based MonteCarlo planning. In Proceedings of the ECML.Google Scholar
 Lang, T., & Toussaint, M. (2010). Planning with noisy probabilistic relational rules. Journal of Artificial Intelligence Research, 39, 1–49.zbMATHGoogle Scholar
 Li, L., Walsh, T. J., & Littman, M. L. (2006). Towards a unified theory of state abstraction for MDPS. In ISAIM.Google Scholar
 Lloyd, J. (1987). Foundations of logic programming. New York: Springer.CrossRefzbMATHGoogle Scholar
 Mansley, C. R., Weinstein, A., & Littman, M. L. (2011). Samplebased planning for continuous action Markov decision processes. In Proceedings of the ICAPS.Google Scholar
 Mausam, A. K. (2012). Planning with Markov decision processes: An AI perspective. San Rafael: Morgan & Claypool Publishers.zbMATHGoogle Scholar
 Meuleau, N., Benazera, E., Brafman, R. I., Hansen, E. A., & Mausam, M. (2009). A heuristic search approach to planning with continuous resources in stochastic domains. Journal of Artificial Intelligence Research, 34(1), 27.zbMATHGoogle Scholar
 Milch, B., Marthi, B., Russell, S., Sontag, D., Ong, D., & Kolobov, A. (2005a). BLOG: Probabilistic models with unknown objects. In Proceedings of the IJCAI.Google Scholar
 Milch, B., Marthi, B., Sontag, D., Russell, S., Ong, D. L., & Kolobov, A. (2005b). Approximate inference for infinite contingent Bayesian networks. In Proceedings of the 10th international workshop on artificial intelligence and statistics.Google Scholar
 Munos, R. (2014). From bandits to MonteCarlo tree search: The optimistic principle applied to optimization and planning. Foundations and Trends \(^{\textregistered }\) in Machine Learning, 7, 1–129.Google Scholar
 Nilsson, U., & Małiszyński, J. (1996). Logic, programming and Prolog (2nd ed.). Hoboken: Wiley.Google Scholar
 Nitti, D., Belle, V., De Laet, T., & De Raedt, L. (2015). Samplebased abstraction for hybrid relational MDPs. European workshop on reinforcement learning (EWRL 2015), 10–11.Google Scholar
 Nitti, D., Belle, V., & De Raedt, L. (2015). Planning in discrete and continuous Markov decision processes by probabilistic programming. In Proceedings of the European conference on machine learning and knowledge discovery in databases (ECML/PKDD), 2015.Google Scholar
 Nitti, D., De Laet, T., & De Raedt, L. (2013). A particle filter for hybrid relational domains. In Proceedings of the IROS.Google Scholar
 Nitti, D., De Laet, T., & De Raedt, L. (2014). Relational object tracking and learning. In Proceedings of the ICRA.Google Scholar
 Owen, A. B. (2013). Monte Carlo theory, methods and examples. http://statweb.stanford.edu/~owen/mc/.
 Peshkin, L., & Shelton, C. R. (2002). Learning from scarce experience. In Proceedings of the ICML (pp. 498–505).Google Scholar
 Precup, D., Sutton, R. S., & Singh, S. P. (2000). Eligibility traces for offpolicy policy evaluation. In Proceedings of the ICML.Google Scholar
 Sanner, S. (2010). Relational dynamic influence diagram language (RDDL): Language description. Unpublished paper.Google Scholar
 Sanner, S., Delgado, K. V., & de Barros, L. N. (2011). Symbolic dynamic programming for discrete and continuous state MDPs. In Proceedings of the UAI (pp. 643–652).Google Scholar
 Sato, T. (1995). A statistical learning method for logic programs with distribution semantics. In Proceedings of the twelfth international conference on logic programming (pp. 715–729). MIT Press.Google Scholar
 Shelton, C. R. (2001a). Importance sampling for reinforcement learning with multiple objectives. Ph.D. thesis, MIT.Google Scholar
 Shelton, C. R. (2001b). Policy improvement for POMDPs using normalized importance sampling. In Proceedings of the UAI (pp. 496–503).Google Scholar
 Smart, W. D., & Kaelbling, L. P. (2000). Practical reinforcement learning in continuous spaces. In Proceedings of the ICML.Google Scholar
 Srivastava, S., Russell, S., Ruan, P., & Cheng, X. (2014). Firstorder openuniverse POMDPs. In Proceedings of the UAI.Google Scholar
 Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge: MIT Press.Google Scholar
 Tadepalli, P., Givan, R., & Driessens, K. (2004). Relational reinforcement learning: An overview. In Proceedings of the ICML2004 workshop on relational reinforcement learning (pp. 1–9).Google Scholar
 Van den Broeck, G., Thon, I., van Otterlo, M., & De Raedt, L. (2010). DTProbLog: A decisiontheoretic probabilistic Prolog. In Proceedings of the AAAI (pp. 1217–1222).Google Scholar
 Vianna, L. G. R., de Barros, L. N., & Sanner, S. (2015). Realtime symbolic dynamic programming. In Proceedings of the twentyninth AAAI conference on artificial intelligence, January 25–30, 2015, Austin, Texas, USA (pp. 3402–3408).Google Scholar
 Vien, N. A., & Toussaint, M. (2014). Modelbased relational RL when object existence is partially observable. In Proceedings of the ICML.Google Scholar
 Walsh, T. J., Goschin, S., & Littman, M. L. (2010). Integrating samplebased planning and modelbased reinforcement learning. In Proceedings of the AAAI.Google Scholar
 Wang, C., Joshi, S., & Khardon, R. (2008). First order decision diagrams for relational MDPs. Journal of Artificial Intelligence Research (JAIR), 31, 431–472.MathSciNetzbMATHGoogle Scholar
 Wiering, M., & van Otterlo, M. (2012). Reinforcement learning: Stateoftheart. Adaptation, learning, and optimization. Berlin: Springer.CrossRefGoogle Scholar
 Wood, F., van de Meent, J. W., & Mansinghka, V. (2014). A new approach to probabilistic programming inference. In Proceedings of the 17th international conference on artificial intelligence and statistics (pp. 1024–1032).Google Scholar
 Zamani, Z., Sanner, S., Delgado, K. V., & de Barros, L. N. (2013). Robust optimization for hybrid MDPs with statedependent noise. In IJCAI 2013, Proceedings of the 23rd international joint conference on artificial intelligence, Beijing, China, August 3–9, 2013.Google Scholar
 Zamani, Z., Sanner, S., & Fang, C. (2012). Symbolic dynamic programming for continuous state and action MDPs. In Proceedings of the AAAI.Google Scholar