1 Introduction

Partially observable Markov decision processes (POMDPs) [11] have become a very popular approach to agent reasoning and planning, such as for robotics (e.g., [13, 23]) and human-agent interactions (e.g., [4, 8, 24]). POMDPs explicitly model complex environment dynamics, such as partial observability of environment states revealed through actions, as well as changes to environment state resulting from actions. Using such information, agents can (1) discover the true environment state hidden by partial observability in order to reduce the uncertainty in its beliefs and make more informed decisions, and (2) plan action sequences that maximize expected rewards given its uncertain beliefs.

Reducing the time spent (i.e., the computational complexity) on planning with POMDPs has been a topic of much research in the literature (e.g., [12, 1517, 19, 20, 25]). This is especially important for online POMDP planning [18], where an agent interleaves planning and execution as it operates in the environment and must therefore plan quickly due to real-time constraints. Ultimately, the agent’s goal when planning is to calculate a good estimate of the cumulative, future rewards from its current situation dependent on different actions it could take in order to choose how to behave in the environment. In most problems, this requires being able to plan many steps in advance in order to form good estimations of future rewards. Unfortunately, the complexity of optimal planning is exponential in the planning horizon (i.e., the number of steps the agent looks ahead during planning). Moreover, the complexity is also polynomial in the size of the state space, which is often quite large (necessary to adequately capture and reflect the nuances of real-world environments). Therefore, planning far enough in advance across all possible future situations is prohibitively expensive (due to time constraints), and thus agents are commonly restricted to forming approximately best plans, rather than acting optimally, which reduces their ability to maximize long-term rewards and achieve correct, goal-directed behavior.

In order to provide the most useful cumulative, future reward estimations, many of the state-of-the-art approaches to online planning sacrifice the breadth of planning in order to enable the agent to plan farther in advance for certain situations, thereby forming better estimations of the rewards (and thus better understanding how to act) in those situations. The success of this type of approach depends on the agent’s ability to select (in advance) the correct scenarios it will indeed face. Two common such approaches to planning include (1) expanding plans selectively along attractive belief states (according to some heuristic function) using heuristic search (e.g., AEMS2 [17]), or (2) sparse random sampling of situations biased towards highly probable state/action/observation sequences and high estimated rewards using Monte Carlo search techniques (e.g., DESPOT [21]). So long as the heuristic chosen in heuristic search methods or the sampling performed in Monte Carlo methods expands plans along the correct situations towards high future rewards and goal accomplishment, these approaches have demonstrated an ability to form plans equally as good as the state-of-the-art offline planners where time constraints are more relaxed and agents can afford greater breadth and depth of planning [18, 19, 21, 25].

However, it would be ideal for a POMDP planning algorithm to achieve accurate cumulative, future reward estimations without having to sacrifice the breadth of planning. Indeed, sacrificing breadth can be inherently detrimental to the agent’s behavior in several ways. For example, depth-focused planning algorithms can cause an agent to fail to adequately consider scenarios it might actually encounter in the near future when executing the plan (i.e., if they are unattractive according to the chosen heuristic in heuristic search algorithms or if they are not quite as likely as other scenarios in Monte Carlo methods), and thus the agent could end up in a position where it does not know what to do in order to adequately achieve its goals. In complex, real-world applications of intelligent agents and multiagent systems, such a predicament could even pose imminent danger to the agent (e.g., a search and rescue robot exploring a damaged building in a section about to collapse) or affect the quality of the system (e.g., increased human user frustration caused by improper interactions from a mixed-initiative software agent). Additionally, in problems requiring long action sequences to achieve large rewards (e.g., highly uncertain environments requiring large quantities of information gathering), even depth-focused planning algorithms might fail to adequately plan far enough down to discover large future rewards and thus underestimate the value of the best actions, leaving it potentially confused on how best to act, or even overvalue suboptimal actions (that achieve greater intermediate rewards but lower cumulative rewards in the long run). This, too, can cause the agent to reach undesirable situations that make it difficult for the agent to achieve its goals in the long run.

Overall, it would be advantageous for an agent if it could implicitly estimate cumulative, future rewards without requiring time-consuming, explicit, depth-based calculations so that it can achieve the best of both worlds: allowing time for full breadth of planning—to avoid the potential pitfalls described above—and also creating better estimations of cumulative rewards over the long term. This should produce a planner that is both safer to use in complex environments and still achieves high rewards over time and ultimately goal achievement. In this paper, we explore how to perform implicit future reward estimation within full breadth planning.

In particular we consider a popular technique for implicitly guiding agents towards large future rewards from the related field of reinforcement learning called potential-based reward shaping (PBRS) [2, 6, 7, 14] and apply this technique to online POMDP planning. In this context, PBRS uses additional information about the agent’s current situation (represented by belief states in POMDPs) measured by potential functions reflecting the potential of earning large future rewards from any particular situation in order to shape the rewards maximized by the agent. That is, this additional information guides the agent to optimistically take actions leading to situations (i.e., belief states) likely to earn large future rewards beyond its planning horizon, thereby enjoying the benefits of deeper planning without suffering from the would-be computational costs.

Although PBRS has previously been applied to planning in less complex fully observable Markov decision processes (MDPs) [22] and can be seen as an extension of leaf evaluation heuristicsFootnote 1 (e.g., [18, 22]) to anytime planning, this first application of PBRS to online POMDP planning provides additional insights and benefits previously unreported. Specifically, we discover and provide several novel contributions to both the PBRS and online POMDP planning literature:

  1. 1.

    A novel characterization of different categories of potential functions that provide different indications of which situations are favorable to the agent (beyond its available planning horizon) for earning greater quantities of cumulative, future rewards, including both domain-specific and domain-independent expertise. Previous research has not distinguished between different types of potential functions, and this categorization helps us understand what types of potential functions might be useful in different problems.

  2. 2.

    Two novel types of potential functions unique to POMDPs exploiting different properties of belief states: (a) the agent’s knowledge about the environment represented as a probability distribution, and (b) a sufficient statistic representing the history of interactions by the agent with its environment. Such types capture and exploit information not considered previously in the use of PBRS or leaf evaluation heuristics for planning, enable agent metareasoning with POMDP planning, and prove to be very useful for earning large rewards by agents in an empirical study.

  3. 3.

    Several theoretical results describing the benefits of using PBRS during online POMDP planning, including (a) for any finite horizon of planning depth, PBRS can result in different plans found than the approximately best plan found without PBRS, making it possible to achieve plans closer to the actions within the (infinite horizon) optimal policy when using a potential function that is a good indicator of future rewards; (b) PBRS has the greatest ability to produce plans that are better in the long term when using the shortest horizons, making it a good choice for online planning with real-time constraints; (c) even though PBRS modifies the reward function maximized by the agent, the (infinite horizon) optimal policy under PBRS is the same as the (infinite horizon) optimal policy to the original reward function, so using PBRS still targets plans that optimize the agent’s goals and task accomplishment (i.e., using PBRS is still working towards the same objective, even if it finds different, and hopefully better, policies when using finite horizon planning); and (d) so long as the potential function is convex, the shaped reward calculations remain convex and can thus be solved by a wide range of popular POMDP solvers.

  4. 4.

    A comprehensive experimental study investigating the empirical performance of PBRS for online POMDP planning using 20 different potential functions across multiple benchmark problems with different properties, as well as an identification of the benefits and weaknesses of PBRS when compared against state-of-the-art heuristic search and Monte Carlo planning approaches commonly used for online POMDP planning. In particular, we discover that combinations of potential functions including both (a) domain-specific information (as done elsewhere in the PBRS literature) and (b) forms of metareasoning about agent knowledge and/or histories of agent interactions with the environment (both novel for POMDPs and proposed in this research) results in improved full breadth planning by implicitly estimating cumulative, future rewards, and performs very competitively with (and often exceeding) depth-focused state-of-the-art online POMDP planning algorithms.

Overall, these contributions demonstrate the usefulness of employing PBRS to improve online POMDP planning. PBRS enables full breadth planning (for more comprehensive planning by considering all nearby reachable situations from the current one) to achieve greater cumulative reward estimation implicitly, as other approaches intend to do explicitly at the cost of needing to sacrifice breadth of coverage due to limited time constraints on planning. These contributions also provide additional insights into the types of information measurable by potential functions that can be useful to improve agent reward accumulation, which could be used to improve the use of PBRS in other settings (beyond online POMDP planning, e.g., partially observable reinforcement learning). Of note, this article is a significant extension of a previously published extended abstract [9].

The rest of this paper is organized as follows. Section 2 provides important background for understanding our approach, including a discussion of POMDPs, online planning, and PBRS as originally formulated for RL. Section 3 introduces our approach and contains proofs for several important theoretical properties of the policies found during online POMDP planning with PBRS. Section 4 describes the experimental setup used to empirically evaluate the performance of online POMDP planning with PBRS on several benchmark POMDP problems, followed by the analysis of our results and a discussion of the broader implications of this work in Sect. 5. Section 6 concludes with a summary of our approach and findings, as well as additional suggestions for future work that we intend to explore.

2 Background

2.1 POMDP model

First, we briefly introduce the partially observable Markov decision process (POMDP) [11], formally defined as the tuple \(\langle S,A,T, \Omega , O,b_0, R\rangle \). Here, \(S=\left\{ s \right\} \) is a set of (hidden) states of the environment in which the agent chooses actions from a set \(A=\left\{ a \right\} \). Each action causes a probabilistic state transition according to a function \(T\left( {s_t ,a,s_{t+1} } \right) =P(s_{t+1} |s_t ,a)\in [0,1]\) representing the probability that action \(a\) changes the environment state from \(s_t\) to \(s_{t+1}\). Actions also produce observations from a set \(\Omega =\left\{ o \right\} \) that are used to estimate the hidden state of the environment. These observations occur according to another probabilistic observation function \({O}\left( {s_{t+1} ,a,o} \right) =P(o|s_{t+1} ,a)\in [0,1]\) representing the probability that observation \(o\) is observed after action \(a\) leads to (hidden) state \(s_{t+1}\).

To estimate the hidden state, the agent maintains a belief state \(b\in \varPi (S)\) representing a probability distribution describing the probability for each state \(s\in S\) that \(s\) is the current unobservable state of the environment. Here, we use \(\varPi (S)\) to denote the set of all probability distributions over \(S\). The distribution \(b\) represents the agent’s beliefs (or knowledge) about its current situation and is updated through belief revision based on recent observation \(o\) after action \(a\):

$$\begin{aligned} b_{t+1}^{a,o}\left( {s_{t+1} } \right) =\frac{1}{\eta } O(s_{t+1} ,a,o){\sum }_{s_t \in S} T\left( {s_t, a,s_{t+1}}\right) b_t (s_t) \end{aligned}$$
(1)

where \(\eta \) is a normalization factor insuring the new belief state \(b^{a,o}\) remains a valid probability distribution in \(\varPi (S)\). The initial belief state is denoted by \(b_0\).

Finally, \(R\left( {s,a}\right) \in {\mathbb {R}}\) is a function modeling the rewards received by the agent for taking an action dependent on the state of the environment. Since the agent has uncertain beliefs over the true state of the environment, it commonly computes expected rewards over its uncertain beliefs:

$$\begin{aligned} R\left( {b,a} \right) =\sum \limits _{s\in S} {b(s)R(s,a)} \end{aligned}$$
(2)

The goal of the agent is to build a plan of actions \(\pi (b)\) called a policy based on its current belief state that maximizes expected discounted, long-term rewards:

$$\begin{aligned} E\left[ {\sum \limits _{t=0}^{n-1} \gamma ^{t}r_t } \right] \end{aligned}$$
(3)

where \(r_t\) is the reward received at time \(t,\,n\) is the planning horizon (i.e., number of steps to plan ahead), and \(\gamma \in [0,1)\) is a discount factor for weighting future, uncertain rewards. We define the value of a policy \(\pi \) from a belief state \(b_0\) as a set of Bellman equations with \(a=\pi (b_t )\):

$$\begin{aligned}&\displaystyle V\left( {b_0 ,\pi }\right) =E\left[ {\sum \limits _{t=0}^{n-1} \gamma ^{t}r_t } \right] =V_0 \left( {b_0 } \right) \end{aligned}$$
(4)
$$\begin{aligned}&\displaystyle V_t (b_t )=Q_t (b_t ,\pi (b_t ))\end{aligned}$$
(5)
$$\begin{aligned}&\displaystyle Q_t \left( {b_t ,a} \right) =R\left( {b_t ,a} \right) +\gamma \sum _{s_t \in S} b_t (s_t )\sum _{s_{t+1} \in S} T\left( {s_t ,a,s_{t+1} } \right) \sum _{o\in \varOmega } O\left( {s_{t+1} ,a,o} \right) V_{t+1} \left( b_{t+1}^{a,o}\right) \nonumber \\ \end{aligned}$$
(6)

2.2 Online POMDP planning

Online planning is one approach to policy construction. In online planning, an agent iteratively (1) plans a policy \(\pi \) from its current belief state \(b\) while operating in the environment, then (2) executes that policy for a while before returning to (1) and repeating the process. By interleaving planning and execution, the agent focuses its planning efforts on beliefs it actually encounters in the environment, allowing it to adapt to unlikely and unexpected situations, as well as not waste valuable resources planning for many unencountered beliefs. These properties are especially beneficial in real-world applications where agents operate in real-time and cannot estimate in advance all possible encountered beliefs (e.g., robotic exploration).

Because the agent interleaves planning and execution while operating in the environment, online planning is usually restricted to limited amounts of time it can afford for planning. This requirement of quick planning requires the agent to plan for a limited number of steps ahead (i.e., limited depth) and/or a limited number of possible belief states imminently reachable from the current belief state (i.e., limited breadth).

Among online planning approaches, several different methods have been proposed that deal with time constraints during planning in different ways in order to produce the best estimates of cumulative, future rewards (c.f., Ross et al. [18] for a recent survey of online planning methods). Generally, these approaches represent the agent’s policy as a tree with belief states represented by nodes, whereas actions and observations are represented by branches between belief states (where an action and observation from one belief state produces another belief state, as in Eq. 1). As the tree is expanded, the algorithms use the new actions and belief states added to the tree to update the estimated cumulative rewards from the agent’s current belief state (using Eqs. 46). Thus, planning has two parts: (1) constructing the tree by expanding nodes as time permits, and (2) evaluating the value of action sequences within a tree according the agent’s reward function to form the policy of actions to take. Different existing algorithms for online POMDP planning primarily differ in how they choose to expand the tree to best estimate cumulative rewards within the limited amount of time allotted for online planning.

Two of the most popular categories of online planning algorithms include heuristic search methods and Monte Carlo search methods. First, heuristic search methods (e.g., AEMS2 [17], FHHOP [25]) focus planning on the most attractive beliefs. Iteratively, heuristic search methods choose to expand the plan from the leaf belief state in the policy tree that maximizes some heuristic function. This heuristic function measures how informative each leaf belief state is towards improving the quality of the plan. For example, state-of-the-art heuristic search algorithms (e.g., AEMS2 [17]) rely on heuristics measuring both (1) the error bounds on the value function \(V\) as leaf evaluation heuristics (i.e., additional upper and lower bounds on future rewards added to the value of a belief state), reflecting the uncertainty introduced by the belief state into the agent’s overall plan, as well as (2) whether or not the belief state is reached by actions that optimistically maximize the upper bound on future rewards.

Second, Monte Carlo search methods (e.g., Rollout [3], POMCP [19], DESPOT [21]), also called Monte Carlo Tree Search (MCTS) when used with tree-based policy representations, perform sparse random sampling of future belief states to estimate cumulative, future rewards. In particular, these methods expand plans by sampling situations that have (1) high probabilities in the state transition and observation functions to focus planning on the most likely sequences of agent beliefs, and (2) earn greater rewards under the current reward estimations.

Both heuristic search methods and Monte Carlo search methods commonly result in depth-focused planning since (1) heuristics like AEMS2 favor expanding belief states along optimistically optimal sequences of actions (determined by the upper bound on future rewards), and (2) biased sparse random sampling prefers expanding sequences of belief states that have the greatest likelihood of occurrence. As discussed in Sect. 1, this focus on depth is advantageous because it allows agents to form more accurate estimations of the cumulative, future rewards along the deep expansion paths by recalculating Eqs. 46 repeatedly for the parent belief states along these paths. That is, it suffers less from over- and under-estimation of future rewards on chosen action/belief sequences by explicitly searching many steps in advance. So long as the heuristic function or biased random sampling identifies the correct belief states for which to plan between the agent’s current belief state and its goal, then the heuristic search or Monte Carlo search methods should work quite well in practice, as indeed shown through several experimental studies (e.g., [18, 19, 21, 25]).

However, increasing the depth of planning along select paths in the policy tree requires the agent to sacrifice the breadth of planning within the tree due to limited time constraints. Specifically, heuristic search methods neglect belief states with high (but not quite maximum) heuristic values, and random sampling in Monte Carlo search methods avoids less likely but certainly possible belief state sequences. In many situations, especially in complex environments, planning for these other belief states could be very beneficial to improving the overall quality of the agent’s plan and its estimation of cumulative, future rewards. That is, sacrificing breadth can also lead to suboptimal policies within the (deeper) finite horizon used for depth-focused planning due to over- or underestimation of the value of the computed policy since the agent fails to explore all possible belief state transitions within the policy tree, possibly missing unexpected high rewards that follow from actions and belief state transitions that are myopically suboptimal and not chosen for expansion. As discussed in Sect. 1, sacrificing the breadth of planning can also cause the agent to reach dangerous or undesirable situations with no forethought on what to do or how to reach a better situation in order to eventually achieve its goals.

Additionally, heuristic search methods (and some Monte Carlo search methods) generally require the agent to have computed rough policies offline before using online planning in order to calculate the upper and lower bounds on the value of actions in belief states that are used to guide planning. However, if the agent is placed in a complex environment (e.g., robotic exploration) where the agent has high uncertainty in what situations it will face or if the size of the POMDP is very large, appropriate pre-planning might be prohibitively expensive.

In Sect. 3, we explore an approach to online POMDP planning that does not require sacrificing breadth of coverage during planning, yet improves the ultimate actions chosen from planning by enabling the agent to implicitly look beyond a limited planning horizon when valuing the actions and belief state transitions within the planning horizon, enabling better long-term reward maximization. Our approach is most similar to heuristic search methods for online planning in that it evaluates the quality of belief states for more than just immediate rewards. However, our approach does not limit expanding plans only along selected belief states with high heuristic value. Instead, the approach modifies the rewards considered at each belief state to bias the agent to place higher value during short, finite horizon planning on policies with greater long term cumulative rewards (even if such policies are otherwise suboptimal within the short, finite horizon). Furthermore, our approach does not require information from precomputed plans, although it can exploit such information if available. We will further describe in more detail in Sect. 3.1 the fundamental differences between our approach and those described previously in this section.

2.3 Potential-based reward shaping

Potential-based reward shaping (PBRS) was originally proposed by Ng et al. [14] as a method to provide hints on how to achieve greater long-term rewards as the agent learns the value function in RL. PBRS addresses one important challenge within RL commonly known as the exploration-exploitation problem: determining how to best improve the agent’s learned knowledge whilst simultaneously maximizing long-term reward (Eq. 3). PBRS handles this challenge by embedding a priori information about the potential of states to provide the agent with more valuable rewards. Using this information, the agent is encouraged to choose actions that explore states of high potential in order to learn about these states and hopefully earn greater future rewards while operating in the environment.

Within PBRS, a potential function \(\phi (s)\) defined over states encodes or measures such a priori information. For example, in a path finding application (e.g., [2]), a good potential function might evaluate the inverse of the agent’s distance from the goal location, which returns greater values for states (i.e., agent locations in the maze) closer to where the agent earns large rewards (the goal location).

In order to guide the agent during RL, PBRS shapes the rewards considered during action selection in Eq. 3 by adding an additional amount determined by the potential function. Specifically, PBRS considers the following reward:

$$\begin{aligned} r_t =R\left( {s_t ,a} \right) +F(s_t ,a,s_{t+1} ) \end{aligned}$$
(7)

where

$$\begin{aligned} F\left( {s_t ,a,s_{t+1} } \right) =\gamma \varphi \left( {s_{t+1} } \right) -\phi (s_t ) \end{aligned}$$
(8)

Here, Eq. 8 represents the difference in potential future rewards due to moving from state \(s_t\) to \(s_{t+1}\). Shaping \(r_t\) by adding this value provides additional motivation to the agent to choose actions that increase the potential of earning future rewards. Therefore, by maximizing this representation of \(r_t\) in Eq. 3, the agent targets actions that improve its learning and are more likely to lead to larger rewards. Once the rewards are learned for those high potential states, the agent can then exploit its learned knowledge to maximize long-term rewards.

Furthermore, it can be shown (see the proof for Theorem 4 for similar details) that when planning over an infinite horizon, the same policy optimizes rewards with and without PBRS [2, 14]. Therefore, using PBRS does not change the (infinite horizon) optimal policy and, due to targeted exploration, results in faster learning convergence to the optimal policy and higher cumulative unshaped rewards than only using the original reward function \(R\). This equivalence property of the (infinite horizon) optimal policy is one of the primary advantages of using PBRS to guide exploration in RL [2, 6, 7, 14].

Extending beyond RL, PBRS has also been used to improve planning in fully observable domains using a Markov decision process (MDP) (e.g., [22]), which has the same mathematical framework as RL but knows the model parameters a priori. In the context of MDPs, PBRS uses a potential function to guide the agent to favor policies found during planning that are likely to lead to large future rewards (equivalent to the use of leaf evaluation heuristics [22]). This prior work inspired our own extension of PBRS (which is the first to formally consider partial observability) to POMDPs, where guiding planning towards future rewards is especially important when working with limited planning time due to the increased complexity caused by handling partial observability, as motivated previously.

Of note, POMDPs can be viewed as a special case of the MDP called a (continuous state) belief MDP [11], where the state of the MDP represents the current belief state and the state transition function encompasses all the necessary details of belief state changes (e.g., factoring in observation probabilities). Thus, upon first glance, using PBRS for planning with POMDPs is a relatively straightforward extension of the prior research employing this technique with MDPs. However, the novelty of the research presented here is not in the extension itself (for which we supply the necessary details), but in realizations about the characteristics of potential functions and the discovery of different types of information useful to evaluate the value of plans in POMDPs for finding better approximations of the (infinite horizon) optimal policy when planning with only small, finite horizons. In previous PBRS research, only a single type of potential function has been defined: potentials over individual states (Eqs. 7, 8, c.f., Type 1 in Sect. 3.1), whereas in leaf evaluation heuristics research, another type (c.f., Type 4 in Sect. 3.1) is commonly used. However, the richness of belief states as probability distributions representing both agent knowledge about the environment, as well as histories of agent interactions with the environment, open up additional exploitable opportunities available when using PBRS with more complex POMDPs, rather than simpler, fully observable MDPs. In particular, we identify two novel types of information measurable by potential functions in POMDPs not achievable in MDP\(s\) or fully observable reinforcement learning, including opportunities for metareasoning through reflecting upon the quality of agent knowledge or the history of the agent’s actions in order to guide improved action selection. Indeed, we rely on a feature of POMDPs that make planning more complicated in general (handling partial observability through probabilistic beliefs) and turn it instead into an advantage in designing good potential functions that improve planning. Ultimately, both the identification of the existence of different types of available potential functions, and the consideration of the types of information used in our novel potential functions, could inspire better usage of PBRS in other settings (especially to the very complicated partially observable reinforcement learning).

3 Potential-based POMDP planning

In this section, we describe the extension of PBRS to online POMDP planning. Whereas PBRS has been considered previously for planning in fully observable MDPs [22], this is the first consideration of PBRS for planning within POMDPs. Thus, we first briefly explain the general thought process behind the extension and the transformative steps from prior usage of PBRS with MDPs required to use PBRS with POMDPs. We next identify several different types of potential functions possible with POMDPs and introduce several novel types that exploit the nature of belief states to provide a richer set of information than considered previously with PBRS. We also prove several important results describing the impact of planning with PBRS on both (1) the policies favored during online POMDP planning, and (2) the optimality of planning.

3.1 Extending PBRS to online POMDP planning

3.1.1 Overview

We begin by noting that in RL (or MDPs), the agent makes decisions based on the environment state \(s\). This is why the potential function \(\phi (s)\) is defined over states. In POMDPs, the environment is only partially observable, and thus the agent rarely knows the true state of the environment. Instead, the agent makes decisions based on its uncertain belief state \(b\), which represents the agent’s probabilistic beliefs over which possible state is the correct one. Therefore, since decisions are made over belief states in POMDPs, the first fundamental step of our extension is to define potential functions over belief states: \(\phi (b)\).

Here, the potential function represents a priori information about the potential of an agent to reach high future rewards from any particular belief state \(b\). Shortly, we will detail different classes of information such a potential function can encode or measure, including novelties to using PBRS with POMDPs (as opposed to fully observable RL and MDPs, as previously considered).

To include \(\phi (b)\) in POMDP planning, we define analogous equations to Eqs. 7 and 8 for POMDP rewards:

$$\begin{aligned} r_t =R\left( {b_t ,a} \right) +F(b_t ,a,b_{t+1} ) \end{aligned}$$
(9)

where

$$\begin{aligned} F\left( {b_t ,a,b_{t+1}}\right) =\gamma \varphi \left( {b_{t+1}}\right) -\phi (b_t ) \end{aligned}$$
(10)

As in RL, the reward \(r_t\) for Eq. 3 is shaped by adding the difference in potential caused by changing belief from \(b_t\) to \(b_{t+1}\).

Within the context of POMDPs, we now establish several different ways that the potential function can measure different classes of information based on belief states, each indicators of future rewards. In addition to considering domain-dependent information about individual states (as done previously with PBRS in both fully observable RL and MDPs), an agent can also consider information based on the nature of belief states as probability distributions representing an agent’s knowledge about the environment. That is, an agent can directly reason about what it knows (or does not know) and/or the quality of its knowledge through evaluating these probability distributions as a form of reflective, deliberative metareasoning. The agent can then relate its current knowledge to its task at hand in a potential function to predict the future rewards it will earn. As we will explain below, this provides two key implications: (1) extending PBRS to POMDPs enables a richer set of information to be considered by potential functions during planning to result in better plans, and (2) this information can be abstracted beyond the agent’s particular domain and can be reused across applications in characteristically different domains, which is in stark contrast to PBRS for fully observable environments where potential functions have traditionally been tailor-made for the agent’s particular domain. We summarize our categorization of four proposed types of potential functions in Table 1.

Table 1 Types of potential functions for POMDPs

3.1.2 Potential function type 1 (domain-dependent information from expected state potential)

First, the information encoded in a potential function might be domain-dependent information about environment states, similar to the usage of PBRS in fully observable RL and MDPs. In this case, an extension of the potential function to belief states would measure the expected potential over states (analogous to Eq. 2), based on the probabilities assigned to each environment state in the belief state:

$$\begin{aligned} \phi \left( b \right) =\sum \limits _{s\in S} b(s)\phi (s) \end{aligned}$$
(11)

This type of potential function is a simple extension of prior potential functions to handle the uncertainty present in partially observable domains. It retains the benefits of exploiting domain-dependent expertise about individual states that have led to the success of PBRS in fully observable RL and MDPs. However, this type of potential function is limited in that each potential function must be carefully constructed for the application and domain at hand, limiting reuse across domains. It is also difficult to apply to a new domain where little domain expertise is known, or domains that are very complicated with many possible environment states (as common to many real-world applications of POMDPs, e.g., robotic exploration).

3.1.3 Potential function type 2 (domain-independent information)

On the other hand, by reflecting upon a belief state as a probability distribution representing the agent’s current knowledge about the environment (i.e., beliefs about the likelihood that any particular environment state is the correct one), we can produce additional types of potential functions unique to POMDPs that relate additional classes of information to the potential of the agent to earn future rewards. Improving upon the first type of potential function described above, this information can be domain-independent and apply across multiple applications and domains with differing characteristics, allowing for generalized solutions having applicability to any domain (especially useful when domain expertise is limited or difficult to capture within especially large POMDPs, such as those with many possible hidden states).

In particular, a POMDP potential function might measure some quality or property of the probabilities in a belief state to predict future rewards. Such behavior is independent of any particular environment state (differing from traditional potential functions) and can also be independent of the domain where the POMDP is being employed for planning. For example, in many domains and applications of POMDPs (e.g., active sensing [4, 23]), one of the primary goals of the agent is to discover the environment’s hidden state before it acts on its beliefs to achieve tasks and goals. In such an application, it does not matter which particular state is the hidden one, only that the agent discovers the hidden state. Therefore, an important property of a belief state related to the ability of the agent to accomplish its goals and earn large future rewards is the certainty in its distribution. That is, when an agent is more certain, it is closer to discovering the true state of the environment and can soon earn large rewards for accomplishing its goal. Considering agent certainty in this manner enables the agent to self-reflect on its own beliefs and metacognitively choose actions that will best revise its knowledge, using potential functions as a form of metareasoning to improve agent behavior. Certainty in a belief state can be measured in several ways, each representing a domain-independent potential function leading the agent towards large future rewards. One method for measuring certainty is to consider the entropy in the agent’s belief state, more specifically by using the negativeFootnote 2 entropy in the belief state (e.g., [1]):

$$\begin{aligned} \phi \left( b \right) =1.0+\sum \limits _{s\in S} b(s)\log _{\left| \mathrm{S} \right| } b(s) \end{aligned}$$
(12)

Alternatively, an agent can quickly estimate its overall certainty by considering the probability assigned to the most likely environment state in the belief state:

$$\begin{aligned} \phi \left( b \right) ={\max }_{s\in S} b(s) \end{aligned}$$
(13)

As the agent’s overall certainty increases, so too does the probability assigned to the most likely state, so this potential function can serve as a good proxy for overall certainty. This potential function exploits another possible property of the POMDP and belief state in order to speed up computation. That is, this function is especially advantageous in large, complicated domains where the state space in a POMDP is represented as a factored state space comprised of multiple state variables: \(S=S_1 \times S_2 \times \cdots \times S_m\) (c.f., Sect. 4.1.2 for an example used in our experiments). In a factored state space, a belief state can be represented more compactly by a set of conditional probability distributions between variables. Exploiting the structure of these conditional probability distributions can sometimes be more efficient than dealing with the entire joint probability distribution, allowing the most likely state to be identified with lower computational complexity than finding the entropy of the belief state (Eq. 12) or some other property of a belief state that requires iterating over all possible states.

Of note, this type of potential function is very closely related to belief-based rewards proposed by Araya-Lopez et al. [1], which directly reward the agent based on measurable qualities of belief states (including Eq. 12). However, there is both (1) a lack of theoretical understanding of the impact on agent policies from belief-based rewards, which we provide (in the next section) by including such measures as potential functions within PBRS, and (2) a lack of empirical evidence of their usefulness on POMDP benchmarks, which we provide in the context of PBRS in Sect. 5.

3.1.4 Potential function type 3 (belief prioritization)

Additionally, since belief states represent both (1) an agent’s knowledge about the current state of the environment, and (2) a sufficient statistic describing an agent’s history of observations [11], they can be used to determine preferential orderings on an agent’s actions and beliefs, which can be encoded in a potential function. In some applications, a domain expert might have some knowledge about strategies for plans that could be used to achieve an agent’s goals, but specific details about how to implement those strategies could be lacking. That is, an expert might know that to achieve its goal, the agent needs particular knowledge about particular states (e.g., that the state is either highly likely or unlikely) before it can complete its task or learn about another particular state. Or the expert might know that certain observations are beneficial, but it is unknown how to achieve those observations. In either case, a potential function can assign higher value to belief states that include certain knowledge (e.g., a particular state is highly likely or unlikely) or are only reachable after certain observations.

This is a way of encoding domain expertise about agent beliefs that strategically guides the agent to achieve certain beliefs before others, without necessarily requiring prior knowledge about how to tactically achieve those beliefs. In turn, this approach possibly speeds up an agent’s knowledge acquisition so that it can accomplish tasks and goals faster, requiring less planning and achieving faster and greater reward accumulation.

For example, consider a robotic agentFootnote 3 responsible for gathering information about the quality of a set of rocks \(r\in R\). The agent’s goal is to determine with near perfect certainty whether each rock is good or bad before moving on to another area of interest. In this situation, a potential function could assign higher priority to belief states that reflect histories where the agent has tested every rock and determined whether each is good or bad in order to guide the agent to take actions that perform the necessary sensing as quickly as possible. Assuming a binary state variable for each rock (representing a good or bad state), the agent’s belief state would be almost perfectly certain a rock was good if \(b\left( r \right) >0.99\) and almost perfectly certain the rock was bad if \(b\left( r \right) <0.01\). Then the potential function:

$$\begin{aligned} \phi \left( b\right) =\left\{ \begin{array}{ll} -1000 &{} \hbox {if }\{r\in R|0.01<b\left( r \right) <0.99\}\ne \emptyset \\ 0 &{} \hbox {else} \end{array}\right. \end{aligned}$$
(14)

represents a potential function that prioritizes beliefs (by penalizing beliefs representing histories where the agent has not tested and determined the state of every rock), thereby encouraging the agent to perform its sensing as soon as possible. Moreover it does so without directly explaining to the agent how to do so, and thus represents strategic (instead of tactical) advice.

3.1.5 Potential function type 4 (approximation of optimal value function)

Finally, since potential functions are equivalent to leaf evaluation heuristics in planning [22], the optimal potential function is the (domain-dependent, infinite horizon) optimal value function \(V^{*}\left( b \right) =V(b,\pi ^{*})\) under the (infinite horizon) optimal policy \(\pi ^{*}\), since this function exactly measures the future rewards earned from a belief state when following the optimal policy in the agent’s particular application. Thus, such a potential function contains exactly the information missing from approximate planning, overcoming the problems addressed in this paper. However, such optimal policies and value functions are rarely computable or known in practice (or else we would not need techniques such as PBRS in the first place), so the best we can often do is to approximate these values.

Within the heuristic search online POMDP algorithm literature (e.g., [17, 18, 25]), it is common to approximate \(V^{*}\left( b\right) \) using upper and lower bounds on the value function: \({\overline{V}}\left( b\right) \) and \({\underline{V}} (b)\), respectively, with \({\underline{V}}\left( b\right) \le V^{*}\left( b \right) \le {\overline{V}} (b)\), frequently employed as leaf evaluation heuristics (e.g., [18]). These approximations are calculated using policies \(\pi _{FIB} \) and \(\pi _{Blind} \) formed offline using algorithms such as Fast Informed Bound (FIB) and Blind [10], such that \({\overline{V}}\left( b\right) =V\left( b, \pi _{FIB}\right) \) and \({\underline{V}} \left( b\right) =V(b,\pi _{Blind} )\). With these approximations, we can then define potential functions \(\phi \left( b\right) ={\overline{V}}\left( b\right) \) and \(\phi \left( b\right) ={\underline{V}} (b)\). The tighter the bounds (depending on the application), the better these approximations estimate the optimal value function and thus better guide the agent to optimal rewards.

By using \({\overline{V}}\left( b\right) \) and/or \({\underline{V}}\left( b\right) \) as potential functions, PBRS is able to include the key heuristic information used to guide planning in state-of-the-art heuristic functions without limiting the breadth of planning, and thus not leave the agent in possibly dangerous situations where it reaches a belief state for which it has performed minimal advance planning. Of note, this type of potential function does require offline computations, so this type has the same pre-deployment costs associated with other online POMDP planning approaches discussed in Sect. 2.2, which could be problematic in large, complex real-world problems.

3.1.6 Discussion

Overall, potential functions over belief states can include information (1) about individual states (Type 1, as previously considered with PBRS in RL and MDP planning), (2) about direct estimations of future rewards from a belief state (Type 4, as previously considered with leaf evaluation heuristics), and/or (3) about belief states themselves independent of individual states, in both domain-independent and domain-dependent manners (Types 2 and 3). This enables a richer set of information to be embedded during reward shaping for guiding online POMDP planning towards greater future rewards than previously considered in the PBRS literature.

Moreover, amongst the two novel types of potential functions (Types 2 and 3) discovered in this research, reflecting on (1) agent knowledge to determine how to act (e.g., measuring the quality of knowledge about the current state of the environment as indicated by certainty measures, Eqs. 12 and 13) or (2) the history of the agents’ interactions with the environment (e.g., through priority orderings on belief states both currently experienced and soon reachable) both represent metareasoning methods for improving general reasoning in POMDPs with interesting potential applications in many domains (e.g., better information gathering in active sensing applications).

Comparing PBRS with other types of approaches to online POMDP planning, we see that shaping rewards is advantageous because the shaped amount encourages the agent to place higher value on action sequences that can potentially lead to higher future rewards, including beyond the planning horizon. Thus, planning with a potential function can allow the agent to estimate cumulative, future rewards (or at least maximize indicators possibly correlated to large future rewards, such as belief certainty) in order to better evaluate the long term values of taking different actions while planning only within short finite horizons without having to spend the limited time on deep planning. As a result of these time savings, the agent can instead maintain a breadth of planning to avoid the pitfalls identified in Sect. 1, such as suboptimal finite horizon planning due to not considering all belief states, and avoiding reaching dangerous or undesirable situations with no forethought on what to do or how to reach a better situation in order to eventually achieve its goals. Moreover, implicitly estimating future, cumulative rewards can possibly achieve superior action selection than spending time explicitly building such estimates with depth-focused planning, if the agent faces a problem where very long sequences of actions are required to reach the goal from its current situation, and there is not enough time to plan for such a long sequence, even with depth-focused approaches.

Additionally, when comparing our proposed PBRS approach to other types of online POMDP planning, we note that there is a distinct difference in the way the potential function values are considered versus (1) how heuristic function values are used in heuristic search methods, or (2) how probabilities and reward estimations are used in Monte Carlo search methods. In our proposed approach, potential function values are never used to control planning—they are not used to guide which belief states are expanded in the policy tree at any point in time during planning. In heuristic search methods, on the other hand, the heuristic values calculated for each belief state do indeed determine which belief state is expanded next, in order to guide depth-focused planning, by selecting some belief states for which to plan and excluding others. Likewise, in Monte Carlo search methods, the calculated probabilities for transitions between belief states and reward estimations are used to control how the plan is expanded in a depth-focused fashion. Instead, in our approach, we propose performing a simple breadth-first search (BFS) to consider all belief states within the short, finite horizon, which does not require special control of plan expansion, in order to maintain the breadth of planning and achieve the benefits previously described.

That is, the reward shaping performed by our inclusion of potential functions does not cause some belief states to be considered or excluded during planningFootnote 4 (as controlled by heuristic functions and random sampling), but instead changes the evaluation of the value of action sequences by adding domain-dependent or domain-independent information about belief states reached by those action sequences in order to place greater value on policies that have the potential to achieve greater long term, cumulative rewards, even if those action sequences would not be considered optimal under the short, finite horizon used for planning with only the original reward function. In the next subsection, we provide theoretical results illustrating how the evaluation of the value of policies is changed with reward shaping, as well as the benefits of this change.

Finally, comparing PBRS to the leaf evaluation heuristics, we note that although the two approaches are functionally equivalent [22], there are still advantages to studying and employing PBRS for online POMDP planning. First, PBRS and its mathematical framework (especially Eqs. 9, 10) are the natural extension of leaf evaluation heuristics to anytime online planning algorithms. That is, such algorithms might not know in advance how long they will have to run, and instead must be capable of both (1) returning a plan at any point in time, and (2) continually running as more time is allotted to improve the quality of the plan calculated. Thus, an anytime online planning algorithm might not know in advance when it will stop. In turn, it will not know in advance which nodes will be leaves in the final policy tree, so it will not necessarily know where to apply the leaf evaluation heuristics. The difference function (Eq. 10) in PBRS incrementally considers each node to be a leaf (and is evaluated with a potential function as a leaf evaluation heuristic), then removes that additional shaped value when a node in the policy tree ceases to be a leaf (as the tree is expanded while time is still allocated for planning). Therefore, the mathematical framework for PBRS defines the calculation procedure for employing leaf evaluation heuristics in anytime online planning algorithms, and the theoretical analyses below informs us on how both PBRS and leaf evaluation heuristics would perform in anytime online planning. Second, unlike the leaf evaluation heuristics commonly used in the literature (our Type 4 potential functions), the first three potential function types proposed above do not require any precomputation before operating in the environment. Thus, an agent using PBRS can operate without having to do any work in advance, which is important when (1) the problem domain is very large and precomputations are prohibitively expensive, or (2) the agent must be quickly reconfigured to deploy to multiple environments (e.g., search and rescue robotics).

3.2 Impact of PBRS on online planning

Because incorporating PBRS into online POMDP planning involves shaping the rewards the agent wants to earn, the policies formed using shaped or unshaped rewards could be different. This provides us with a dilemma. On the one hand, due to time constraints in online planning, we want to find better policies with PBRS since any policy found is only optimal over the finite horizon used for planning, and thus only approximately optimal over the infinite horizon. As such, the policies found during planning can suffer from over- and under-estimation problems (which PBRS is intended to address), as described in Sect. 2.2. On the other hand, since PBRS entails maximizing shaped rewards with the addition of the potential function, we do not want to sacrifice the ability to optimize the original reward function \(R\) over the long run (i.e., infinite horizon), which is, after all, the ultimate goal of the agent.

To better understand the relationship between the value of policies with respect to shaped (with PBRS) and unshaped (original) rewards, we evaluate these values from the theoretical perspective. We follow a similar approach taken to understand the values of policies with and without PBRS in RL (e.g., [2]).

In the following, we develop several key results. First, Lemma 1 derives the difference in the valuations of an arbitrary policy both with and without reward shaping over the finite horizons used for planning. This represents the difference between how good a policy looks under one approach or the other. Next, Theorem 2 establishes the conditions (Eq. 16) for which PBRS can lead the agent to a different policy than the original reward function when performing finite horizon planning, based on the results of Lemma 1. In conjunction, Remark 3 observes the condition (small planning horizons \(n)\) when a greater number of potential functions might lead PBRS to different policies than planning without reward shaping. Afterwards, Theorem 4 considers the relationship between (infinite horizon) optimal policies with and without reward shaping to establish that reward shaping still causes the agent to optimize its original reward function over the infinite horizon, in spite of working on a modified objective function. Remark 5 then extends this result (based partly on the proof to Theorem 4) to observe that PBRS also performs well as the planning horizon increases, regardless of the potential function chosen. Finally, Theorem 6 establishes a sufficient condition for the objective function (Eq. 4) with shaped rewards (Eq. 9) to remain convex and thus still be solvable by a wide range of POMDP solvers.

We begin by computing the difference between the values of a policy for a finite horizon \(n\). This captures the impact of using PBRS with online planning for short horizons required due to time constraints.

Lemma 1

Let \(S,A,\varOmega ,T,O,R,b_0 ,\gamma \) from the definition of a POMDP be given, and let \(n\in {\mathbb {N}}\) be a fixed planning horizon, \(\phi \) be a potential function over belief states, and \(\pi \) be a policy of action. Then the difference between the value with PBRS \(V^{PBRS}(b_0, \pi )\) of \(\pi \) starting at \(b_0\) and the value using unshaped rewards \(V^{orig}(b_0 ,\pi )\) is given by:

$$\begin{aligned} V^{PBRS}\left( {b_0 ,\pi } \right) -V^{orig}\left( {b_0 ,\pi } \right) =\gamma ^{n}{\sum }_{b_n \in \varPi (s)} P(b_n |\pi ,b_0 )\phi (b_n )-\phi \left( {b_0 } \right) \end{aligned}$$
(15)

Proof

For notational convenience, we denote the unshaped reward earned at each step \(t\in \left\{ {0,1,\ldots ,n-1} \right\} \) as \(R_t\):

$$\begin{aligned} R_t =R\left( {b_t ,a_t } \right) =r_t^{orig} \end{aligned}$$

and the shaped reward earned at each step \(t\) as \(R_t +F_t \):

$$\begin{aligned} R_t +F_t =R\left( {b_t ,a_t } \right) +F\left( {b_t ,a_t ,b_{t+1} } \right) =r_t^{PBRS} \end{aligned}$$

where \(b_t\) denotes the belief state after performing \(t\) actions and \(a_t =\pi \left( {b_t } \right) \) is the action chosen according to policy \(\pi \).

As an intermediate result, consider an arbitrary history \(H=\left\{ {b_0 ,a_0 ,o_1 ,b_1 ,\ldots ,b_n } \right\} \) (i.e., a fixed sequence for a particular experience in the environment) consisting of (1) the actions taken by the agent according to policy \(\pi \), (2) the resulting observations, and (3) the sequence of beliefs after making those observations. For fixed \(n\), the value using unshaped rewards of any policy \(\pi \) according to particular history \(H\) can be computed as the cumulative reward series:

$$\begin{aligned} V^{orig}\left( {b_0 ,\pi ,H} \right) =\sum \limits _{t=0}^{n-1} \gamma ^{t}r_t^{orig} ={\sum }_{t=0}^{n-1} \gamma ^{t}R_t \end{aligned}$$

and the value using shaped rewards of the same policy \(\pi \):

$$\begin{aligned} V^{PBRS}\left( {b_0 ,\pi ,H} \right)= & {} \sum \limits _{t=0}^{n-1} \gamma ^{t}r_t^{PBRS}\\= & {} \sum \limits _{t=0}^{n-1} \gamma ^{t}\left( {R_t +F_t } \right) \\= & {} \sum \limits _{t=0}^{n-1} \gamma ^{t}\left( {R_t +\gamma \phi \left( {b_{t+1} } \right) -\phi \left( {b_t } \right) } \right) \\= & {} \sum \limits _{t=0}^{n-1} \gamma ^{t}R_t +\sum \limits _{t=0}^{n-1} \gamma ^{t+1}\phi \left( {b_{t+1} } \right) -\sum \limits _{t=0}^{n-1} \gamma ^{t}\phi \left( {b_t } \right) \\= & {} V^{orig}\left( {b_0 ,\pi ,H} \right) +\left[ {\sum \limits _{t=1}^{n-1} \gamma ^{t}\phi \left( {b_t } \right) +\gamma ^{n}\phi \left( {b_n } \right) } \right] \\&-\left[ {\sum \limits _{t=1}^{n-1} \gamma ^{t}\phi \left( {b_t } \right) +\phi \left( {b_0 } \right) } \right] \\= & {} V^{orig}\left( {b_0 ,\pi ,H} \right) +\gamma ^{n}\phi \left( {b_n } \right) -\phi \left( {b_0 } \right) \end{aligned}$$

Because this result holds for arbitrary history \(H\) starting at arbitrary \(b_0\), it will hold for any sequence of beliefs when following policy \(\pi \). Therefore, since the valuation of a policy from a belief state is the expected value over all possible histories (Eq. 4), we find that:

$$\begin{aligned} V^{PBRS}\left( {b_0 ,\pi } \right)= & {} E\left[ {V^{PBRS}\left( {b_0 ,\pi ,H} \right) } \right] \\= & {} E\left[ {V^{orig}\left( {b_0 ,\pi ,H} \right) +\gamma ^{n}\phi \left( {b_n } \right) -\phi \left( {b_0 } \right) } \right] \\= & {} E\left[ {V^{orig}\left( {b_0 ,\pi ,H} \right) } \right] +\gamma ^{n}E\left[ {\phi \left( {b_n } \right) } \right] -E\left[ {\phi \left( {b_0 } \right) } \right] \\= & {} V^{orig}\left( {b_0 ,\pi } \right) +\gamma ^{n}\sum \limits _{b_n \in \varPi (s)} P(b_n |\pi ,b_0 )\phi (b_n )-\phi \left( {b_0 } \right) \end{aligned}$$

\(\square \)

where \(P(b_n |\pi ,b_0)\) is the probability of transitioning to \(b_n\) when following policy \(\pi \) from initial belief \(b_0\), considering the probabilities of the necessary state transitions and observations required to reach \(b_n\). From this result, we can subsequently find the following theorem:

Theorem 2

Let \(S,A,\varOmega ,T,O,R,b_0 ,\gamma \) from the definition of a POMDP be given, and let \(n\in {\mathbb {N}}\) be a fixed (finite) planning horizon and \(\phi \) be a potential function over belief states. Then, the policy \(\pi ^{{\prime }}\) optimizing \(V^{PBRS}\) will differ from the policy \(\pi \) optimizing \(V^{orig}\) over the fixed horizon \(n\), provided that

$$\begin{aligned} V^{orig}\left( {b_0 ,\pi } \right) -V^{orig}\left( {b_0 ,\pi {\prime }} \right) <\gamma ^{n}\sum \limits _{b_n \in \varPi (S)} \phi \left( {b_n } \right) \left[ {P\left( {b_n |\pi ^{{\prime }},b_0 } \right) -P\left( {b_n |\pi ,b_0 } \right) } \right] \end{aligned}$$
(16)

Proof

Consider policy \(\pi \) that optimizes unshaped rewards \(V^{orig}\) over finite horizon \(n\). If there is another policy \(\pi ^{\prime }\) satisfying Eq. 16, meaning that the difference in the value of \(\pi \) and \(\pi ^{{\prime }}\) under the original reward function \(R\) is less than the difference in the expected (discounted) potential values along the planning horizon, then:

$$\begin{aligned} V^{PBRS}\left( {b_0 ,\pi ^{{\prime }}} \right)- & {} V^{PBRS}\left( {b_0 ,\pi } \right) \\= & {} \left[ {V^{orig}\left( {b_0 ,\pi ^{{\prime }}} \right) +\gamma ^{n}\sum \limits _{b_n \in \varPi (S)} P\left( {b_n |\pi ^{{\prime }},b_0 } \right) \phi \left( {b_n } \right) -\phi \left( {b_0 } \right) } \right] \\&-\left[ {V^{orig}\left( {b_0 ,\pi } \right) +\gamma ^{n}\sum \limits _{b_n \in \varPi (S)} P\left( {b_n |\pi ,b_0 } \right) \phi \left( {b_n } \right) -\phi \left( {b_0 } \right) } \right] \\= & {} \left[ {V^{orig}\left( {b_0 ,\pi ^{{\prime }}} \right) -V^{orig}\left( {b_0 ,\pi } \right) } \right] +\gamma ^{n}\sum \limits _{b_n \in \varPi (S)} P\left( {b_n |\pi ^{{\prime }},b_0 } \right) \phi \left( {b_n } \right) \\&-\gamma ^{n}\sum \limits _{b_n \in \varPi (S)} P\left( {b_n |\pi ,b_0 } \right) \phi \left( {b_n } \right) \\= & {} \left[ {V^{orig}\left( {b_0 ,\pi ^{{\prime }}} \right) -V^{orig}\left( {b_0 ,\pi } \right) } \right] \\&+\gamma ^{n}\sum \limits _{b_n \in \varPi (S)} \phi \left( {b_n } \right) \left[ {P\left( {b_n |\pi ^{{\prime }},b_0 } \right) -P\left( {b_n |\pi ,b_0 } \right) } \right] \\> & {} \gamma ^{n}\sum \limits _{b_n \in \varPi (S)} \phi \left( {b_n } \right) \left[ {P\left( {b_n |\pi ,b_0 } \right) -P\left( {b_n |\pi ^{{\prime }},b_0 } \right) } \right] \\&+\gamma ^{n}\sum \limits _{b_n \in \varPi (S)} \phi \left( {b_n } \right) \left[ {P\left( {b_n |\pi ^{{\prime }},b_0 } \right) -P\left( {b_n |\pi ,b_0 } \right) } \right] \\= & {} 0 \end{aligned}$$

Thus, \(\pi ^{\prime }\) achieves higher \(V^{PBRS}\) than \(\pi \), so \(\pi \) cannot optimize \(V^{PBRS}\) over the finite horizon \(n\). Therefore, planning with PBRS can result in a different policy using a finite horizon. Moreover, provided the potential function guides the agent towards beliefs that earn higher rewards beyond the planning horizon, PBRS could improve upon finite horizon policies that would be found without reward shaping. \(\square \)

Furthermore, the impact of the potential function on the valuation of a policy using shaped rewards depends on the size of the planning horizon \(n\). This leads us to the following remark:

Remark 3

The upper bound (Eq. 16) on the permissible difference in the valuations of the (finite horizon) optimal policies with and without reward shaping is greater as the finite planning horizon \(n\) decreases, making it easier to find a potential function \(\phi \) that satisfies Eq. 16 when the planning horizon is small.

Recall that the discount factor is restricted such that \(\gamma \in [0,1)\). Thus, as \(n\) decreases, \(\gamma ^{n}\) increases. Hence, the resulting greater upper bound on the differences between valuations permits a larger number of different policies to optimize each objective function (Eqs. 15, 16 and Lemma 1) over the finite horizon \(n\), so planning with PBRS is more able to find a different policy than planning without reward shaping when the horizon is short. Therefore, provided a suitable potential function, PBRS can be most beneficial when it is most necessary (i.e., when planning without PBRS is at greatest risk of being suboptimal (over the infinite horizon) due to short horizons and limited planning time).

Next, we prove that planning with PBRS does not sacrifice optimality over the infinite horizon with respect to the original reward function \(R\), which ultimately the agent wants to maximize. That is, a policy is optimal (without finite horizon approximation) with PBRS if and only if it is also optimal without reward shaping using just the original rewards. Therefore, even though using shaped or unshaped rewards can find different policies for short horizons, using PBRS also optimizes the original reward function \(R\) (over the infinite horizon) and is working towards the agent’s ultimate goal.

Theorem 4

Let \(S,A,\varOmega ,T,O,R,b_0 ,\gamma \) from the definition of a POMDP be given, and let \(\phi \) be a potential function over belief states. Then, a policy \(\pi ^{*}\) is optimal (over the infinite horizon) with reward shaping using PBRS if and only if \(\pi ^{*}\) is also optimal (over the infinite horizon) without reward shaping.

Proof

Let \(\pi \) be any policy. From Lemma 1, the value of this policy with PBRS over the infinite horizon is:

$$\begin{aligned} V^{PBRS}\left( {b_0 ,\pi } \right)= & {} E\left[ {\sum \limits _{t=0}^\infty \gamma ^{t}r_t^{PBRS} } \right] ={\lim }_{n\rightarrow \infty } E\left[ {\sum \limits _{t=0}^{n-1} \gamma ^{t}r_t^{PBRS} } \right] \\= & {} {\lim }_{n\rightarrow \infty } \left[ {V^{orig}\left( {b_0 ,\pi } \right) +\gamma ^{n}E\left[ {\phi \left( {b_n } \right) } \right] -\phi \left( {b_0 } \right) } \right] \\= & {} V^{orig}\left( {b_0 ,\pi } \right) -\phi \left( {b_0 } \right) +{\lim }_{n\rightarrow \infty } \gamma ^{n}E\left[ {\phi \left( {b_n } \right) } \right] \\= & {} V^{orig}\left( {b_0 ,\pi } \right) -\phi (b_0 ) \end{aligned}$$

since \(\gamma \in [0,1)\) and thus \({\lim }_{n\rightarrow \infty } \gamma ^{n}=0\). Moreover, \(\phi \left( {b_0 } \right) \) is constant since initial belief state \(b_0 \) is fixed. Thus, any policy \(\pi ^{*}\) that optimizes \(V^{PBRS}\) over the infinite horizon also optimizes \(V^{orig}\) over the infinite horizon, and vice-versa. Therefore, \(\pi ^{*}\) is optimal over the infinite horizon with PBRS if and only if it is also optimal over the infinite horizon for the original rewards. \(\square \)

From the perspective of finite horizon policies (which the agent is required to calculate to approximate the infinite horizon due to computational constraints), Theorem 4 and its proof result in the following important implication:

Remark 5

Planning with PBRS also results in earning greater (unshaped) reward as the planning horizon increases (or equivalently, with more planning time), even though it is optimizing a different objective function than the original reward function.

Both the proof for Theorem 4 and Lemma 1 imply that the valuations of policies with and without reward shaping become closer and closer as the planning depth increases. Thus, the policies chosen by each method (with or without reward shaping) also become more similar since these policies maximize their respective valuations. Because approximate planning without reward shaping generally results in better policies as the planning depth increases (since more information is added to the estimation of cumulative, future rewards), this implies that the policies formed with PBRS will also improve with respect to maximizing the original reward function.

Combined with Remark 3, this implies that PBRS is beneficial to the agent not only when the planning horizon is small (provided a good potential function), but also as the planning horizon increases (regardless of potential function).

Finally, we derive the following theorem that is important for determining when pre-existing POMDP planning solvers are compatible with PBRS.

Theorem 6

Let \(S,A,\varOmega ,T,O,R,b_0 ,\gamma \) from the definition of a POMDP be given, and let \(\phi \) be a potential function over belief states. Provided that \(\phi \) is convex, the objective function solved by the agent (Eq. 4) remains convex and can be solved by the traditional set of POMDP solvers.

Proof

Assume that \(\phi \) is indeed convex. Then, Eq. 9 is the linear combination of convex functions (Eq. 2 is also convex) [5]. Thus, the valuation function (Eq. 4) remains convex, as proven by Araya-Lopez et al. [1, Theorem 3.1] (originally established outside the context of PBRS). Therefore, shaped rewards with PBRS can also be optimized by a wide range of POMDP solvers relying on convexity, not just those considered in this paper. \(\square \)

We note here that many of the potential functions provided as examples in this paper (e.g., Eqs. 12 and  13 above) are indeed convex.

Summary To summarize our theoretical results, we observe that Lemma 1 defines the difference in the evaluation of a policy both (1) with reward shaping using PBRS \((V^{PBRS})\) and (2) without reward shaping that considers only the original reward function \(R (V^{Orig})\). In turn, Theorem 2 provides us with a necessary condition for when a policy would be evaluated as having higher value with PBRS than without. That is, this condition establishes when a different policy might be favored and returned by the planning algorithm, instead of the policy that is optimal—considering only the original reward function—for the small, finite horizon \(n\) yet possibly suboptimal over the long run. Remark 3 then notes that the condition of Theorem 2 is looser for the smallest planning horizons, making it easier for PBRS to favor a different policy that could be closer to optimal over the long run than the small, finite-horizon optimal policy. This should cause us to observe the most impactful benefits on agent performance from PBRS under the tightest time constraints on planning. Theorem 4 and Remark 5, on the other hand, explore the opposite direction and establish that as the planning horizon increases, the favored policies found with PBRS also optimize the long term, cumulative rewards of the agent, which is the agent’s ultimate goal. This is true, even though the agent is directly optimizing a slightly different objective function. This should cause us to observe continued good performance from PBRS as planning constraints are relaxed. Finally, Theorem 6 establishes that POMDP planning algorithms relying on convexity in the value function to efficiently find optimal policies will also efficiently find optimal policies under PBRS.

Of note, most of these theoretical results exploit the fact that reward shaping under PBRS takes the form of the difference of potential functions (Eqs. 9, 10). Without this difference and instead using arbitrary reward shaping (e.g., simply adding additional value at each node of the policy tree), the telescoping sums would disappear from the proofs. Without the telescoping sums, (1) we would not be able to bound the difference of the evaluation of a policy with and without reward shaping (Theorem 2), and we need this bound for Remark 3 describing the usefulness of PBRS with small planning horizons, which is important since we are considering time constrained, finite horizon planning that must stop before finding an optimal (infinite horizon) policy, and (2) we could not establish that as the planning horizon increases, the policy optimizing PBRS also optimizes the original reward function, which would in turn affect the ability of planning with PBRS to prefer policies that maximize long term, cumulative rewards.

4 Experimental setup

To evaluate the performance of using PBRS to improve online POMDP planning, we conducted an empirical study that compares agent performance with and without PBRS (using the potential functions summarized in Table 2) in three benchmark POMDP planning problems described below: (1) Tag [16], (2) RockSample [20], and (3) AUVNavigation [15].

These three benchmarks were chosen for our experimental study for the following reasons. First, they are commonly used across the POMDP literature, either together (e.g., [15, 25]) or at least in some combination (e.g., [16, 1821]). Thus, they are relatively well understood. Second, they represent a varying range of problems: (1) Tag is a relatively small problem (i.e., a low number of states, actions, and observations) with high levels of uncertainty, but a relatively simple required behavior to solve the problem, (2) RockSample is a larger problem than Tag and one for which upper and lower bound estimates provide strong clues on how to behave, and (3) AUVNavigation is an even larger problem (especially with two orders of magnitude larger observation space than Tag or RockSample) with a very high amount of uncertainty and a difficult sequence of behavior required to solve the problem. Thus, they represent very different environments. Moreover, AUVNavigation both: (a) requires a long sequence of information gathering then movement actions to reach the ultimate goal state, and (b) contains dangerous situations that cause the agent to be unable to ever accomplish its goal, both of which were hypothesized in Sect. 1 to be problematic for depth-focused planning algorithms and could benefit from breadth-focused planning with implicit future reward estimations, as accomplished by PBRS for online POMDP planning. We limit our study to considering only three benchmarks for two reasons: (1) much of the POMDP literature considers a similar number of benchmarks (e.g., [18, 21, 25]), and (2) due to the comprehensiveness of our experimental setup for each benchmark, resulting in much time required to both (i) run the experiments for each benchmark (c.f., the start of Sect. 5) and (ii) implement and test many different potential functions on each benchmark. For comparison and easy reference, we summarize the potential functions considered in each benchmark in Table 2.

Table 2 Summary of potential functions used in each benchmark problem

4.1 Benchmark problems

4.1.1 Tag

The first benchmark problem we consider is Tag [16], in which a robotic agent (the tagger) plays laser tag with an opponent. Both agents are randomly placed in a 2D grid consisting of 29 locations and the tagger agent’s task is to find and tag the opponent, whereas the opponent tries to prolong the game by moving away from the tagger. Both agents always know their own location and the opponent knows where the tagger is at all times, but the tagger can only observe the opponent when they are in the same cell. The tagger agent earns a penalty of \(-\)1 for moving in each cardinal direction (North, South, East, and West) to find its prey, a larger penalty of \(-\)10 for trying to tag the opponent without being in the same cell, and a reward of +10 for successfully tagging the opponent, which ends the game. The tagger agent’s discounted rewards are maximized by finding and tagging the opponent as fast as possible.

Altogether, Tag represents a relatively small benchmark problem, only consisting of 870 states, 5 actions (movement and tagging), and 2 observations (\(True\) if the tagger and opponent are in the same cell, else \(False\)). However, the problem is highly uncertain as the tagger can only identify the opponent’s location if they are in the same cell, else it must estimate where the dynamic opponent is as it moves away from the tagger. As such, the distance of the tagger from the end of the game can be quite long and dynamically changes as both agents move through the grid. Therefore, the actual horizon for the problem can be particularly long, and time constrained planning can lead to suboptimal actions.

To improve online planning in Tag, we consider seven potential functions representing different domain-independent and -dependent knowledge pointing the agent to future rewards beyond the planning horizon:

  • Entropy using a domain-independent measure of the certainty in the agent’s belief, following Eq. 12

  • TopBelief using another domain-independent measure of the certainty in the agent’s belief represented by Eq. 13, which is similar to Eq. 12, but (1) focuses on certainty in a single state (the most believed state), rather than across the entire belief state and (2) exploits the factored state space (fully observable tagger location vs. partially observable opponent location) to reduce computation

  • MaxBeliefDistance (MBD) using domain-dependent information to assign greater potential to belief states closer to the most likely location of the opponent, thus motivating the agent to move towards the opponent and end the game as fast as possible, hopefully minimizing incurred penalties and maximizing rewards:

    $$\begin{aligned} \phi \left( b \right) =\frac{1}{E[d\left( {o,l} \right) ]+1} \end{aligned}$$
    (17)

where \(o\) is a possible opponent location, \(l\) is the agent’s location, \(d\) measures Euclidian distance between \(o\) and \(l\), and \(E[d\left( {o,l} \right) ]\) is the expected distance based on all possible opponent locations in belief state \(b\).

  • EMBD which sums Entropy (Eq. 12) and MaxBeliefDistance (Eq. 17) to combine domain-independent and domain-dependent information in the same potential function

  • TBMBD which sums TopBelief (Eq. 13) and MaxBeliefDistance (Eq. 17) to also combine domain-independent and domain-dependent information in the same potential function

  • Upper which uses \({\overline{V}}\left( b\right) \) calculated using \(\pi _{FIB} \) formed using the Fast Informed Bound algorithm [10] as an approximation of the optimal value function, and

  • Lower which uses \({\underline{V}} (b)\) calculated using \(\pi _{Blind} \) formed using the Blind algorithm [10] as another approximation of the optimal value function

4.1.2 RockSample

The second benchmark problem considered in our experimental setup is RockSample [20]. In RockSample, an agent navigates a remote world represented by a 2D grid of size \(g\times g\) to sample from \(k\) rocks. The goal of the agent is to determine which rocks are good, then sample only those rocks. Afterwards, the agent exits by moving to a special location off the grid. To accomplish its goals, the agent can perform \(k+5\) actions: move in any of the four cardinal directions (North, South, East, West), check the quality at one of each of the \(k\) rocks, or sample the rock at its current location. To determine which actions to take, the agent considers a factored state space consisting of: (1) its fully observable current location, and (2) the hidden quality of each rock (from the set \(\left\{ {Good,Bad} \right\} )\). Checking a rock returns a noisy observation about the quality of the rock (also from the set \(\left\{ {Good,Bad} \right\} \)), where the observation’s accuracy is greater the closer the agent is to the rock.Footnote 5 Sampling a rock changes the state of the rock to \(Bad\) (indicating it can no longer be sampled). The agent earns a reward of +10 for sampling a good rock, \(-\)10 for sampling a bad rock, and +10 for exiting the grid. All other actions earn zero reward. The agent’s discounted rewards are maximized by sampling all (and only) good rocks and exiting as fast as possible.

We use the common setting \(g=7\) and \(k=8\) (e.g., [18, 21, 25]) that results in a POMDP with 12,585 states, 13 actions, and 2 observations. This problem is larger than Tag, but less dynamic: the problem always ends with the agent reaching the same state (exiting the grid), and the environment does not change as the agent moves around. Thus, it presents a different set of challenges for time constrained planning, including a broader search tree (due to more possible actions) and deeper required activity to accomplish all the agent’s goals (sampling as many good rocks as exist in the environment), but identifying the goal state is less challenging, making it easier to achieve goal directed behavior.

To improve online planning in RockSample, we consider 13 potential functions representing different domain-independent and -dependent knowledge pointing the agent to future rewards beyond the planning horizon. Some are reused from Tag (Entropy, TopBelief, Upper, and Lower), whereas others are unique to RockSample:

  • ClosestDistance (CD) using domain-dependent information to assign greater potential to belief states closer to uncertain rocks where the agent will achieve greater accuracy and thus most immediate belief improvement:

    $$\begin{aligned} \phi \left( b\right) =\left\{ \begin{array}{ll} -\frac{1}{2g}{\min }_{r\in R} [d\left( {r,l} \right) +1]&{}\quad \hbox {if }R\ne \emptyset \\ 0 &{}\quad \hbox {if } R=\emptyset \end{array}\right. \end{aligned}$$
    (18)

    where \(R=\{r|0.01<b\left( r \right) <0.99\}\) is the set of rocks with uncertain quality, \(l\) is the agent’s location, and \(d\) measures Euclidian distance between \(r\) and \(l\).

  • NoExit prioritizing beliefs reflecting more certain knowledge about rocks before exiting to avoid neglected sampling due to myopic planning (similar to Eq. 14 example from Sect. 3.1):

    $$\begin{aligned} \phi \left( b\right) =\left\{ \begin{array}{ll} -1000 &{} \hbox {if }R\ne \emptyset \wedge l=\hbox {exit}\\ 0 &{} \hbox {else} \end{array}\right. \end{aligned}$$
    (19)
  • ECD which sums Entropy (Eq. 12) and ClosestDistance (Eq. 18) to combine domain-independent and domain-dependent information in the same potential function

  • TBCD which sums TopBelief (Eq. 13) and ClosestDistance (Eq. 18) to also combine domain-independent and domain-dependent information in the same potential function

  • NoExitE which sums Entropy (Eq. 12) and NoExit (Eq. 19) to combine domain-independent information and belief prioritization in the same potential function

  • NoExitTB which sums TopBelief (Eq. 13) and NoExit (Eq. 19) to also combine domain-independent information and belief prioritization in the same potential function

  • NoExitCD which sums ClosestDistance (Eq. 18) and NoExit (Eq. 19) to combine domain-dependent information and belief prioritization in the same potential function

  • NoExitECD which sums Entropy (Eq. 12), ClosestDistance (Eq. 18), and NoExit (Eq. 19) to combine domain-independent and domain-dependent information, as well as belief prioritization, in the same potential function

  • NoExitTBCD which sums TopBelief (Eq. 13), ClosestDistance (Eq. 18), and NoExit (Eq. 19) to also combine domain-independent and domain-dependent information, as well as belief prioritization, in the same potential function

4.1.3 AUVNavigation

The final benchmark problem considered in our experimental setup is AUVNavigation [15]. In AUVNavigation, a robotic submarine agent is randomly placed on one side of a \(20\times 7\times 4\) 3D underwater grid and must navigate through a set of rock obstacles to either of two known goal locations on the other side of the grid. The agent can Stay in its current position, turn Left, Right, Up, or Down to change its orientation, or it can move Forward along its orientation towards a desired location. Currents underwater also move the agent with low probability, resulting in stochastic location changes, whether or not the agent intended to move. The agent has sensors that always perfectly observe the agent’s depth and orientation in the grid, but its location in the 2D plane is uncertain. Thus, navigating through the rocks to reach the goal is quite challenging. The agent can move to the surface of the water where it automatically uses a GPS sensor to perfectly determine its location, but this incurs a moderate cost of \(-\)50. Otherwise moving through the grid incurs a penalty of \(-\)1, \(-\)1.44, or \(-\)1.73, depending on its orientation (with higher cost for moving diagonally and changing depths in the grid), whereas Staying or changing orientation earns zero reward. The agent incurs a large penalty of \(-\)500 for hitting a rock and an even larger reward of +5000 for reaching a goal location, each of which result in a terminal state that ends execution. The agent’s discounted rewards are maximized by reaching the goal location as fast as possible while minimizing costs incurred for spending time on the surface.

Altogether, AUVNavigation represents a very challenging benchmark problem compared to the other two benchmarks. Whereas the number of states and actions (13,537 and 6, respectively) in this problem is similar to RockSample, the number of observations (144) is much greater, increasing the size of the POMDP and the breadth of the planning tree, and the uncertainty is also much greater due to the lack of full observability of the agent’s location. Thus, AUVNavigation is the largest and most complex benchmark considered in our experiments. Due to this uncertainty and complexity, AUVNavigation can be viewed as containing three sub-problems in three stages: (1) determining the agent’s location on the far side of the grid, (2) navigating through the many dangerous rock obstacles (requiring high certainty in the agent’s location), and (3) finding a path beyond the obstacles to one of the goal locations. Furthermore, the actual horizon for this problem is quite long and requires more memory than an agent can afford for full breadth planning (due to exponential growth in the planning tree), requiring over 20 actions just to move the agent from its initial location to a goal location without accounting for the number of actions required to resolve its initial location uncertainty. Since a positive reward signaling a good planning path to the agent only occurs when it reaches the goal (after at least 20 steps), time constrained planning is very difficult in this domain since there are no intermediate positive signals to guide the agent towards the goal state. As a result, PBRS is possibly a beneficial approach for this benchmark problem since potential functions can provide such intermediate positive signals, but the potential functions need to be able to account for the different stages of the problem to successfully guide the agent towards its goal, which could require more complex potential functions than the other two benchmark problems.

To improve online planning in AUVNavigation, we consider eight potential functions representing different domain-independent and -dependent knowledge pointing the agent to future rewards beyond the planning horizon. Some are reused from Tag and RockSample (Entropy, TopBelief, Upper, and Lower), whereas others are unique to AUVNavigation:

  • GoalDistance (GD) using domain-dependent information to assign greater potential to belief states closer to the nearest of the two goal locations where the agent has less distance to travel (and further movement cost to incur) to reach its goal:

    $$\begin{aligned} \phi \left( b \right) =\frac{1}{E[d\left( {g,l} \right) ]+1} \end{aligned}$$
    (20)

    where \(l\) is a possible agent location, \(g\) is the nearest goal location to \(l\), \(d\) measures Euclidian distance between \(l\) and \(g\) (equal to the maximum possible distance if \(l\) is also a rock location to encourage the agent to avoid rocks), and \(E[d\left( {g,l} \right) ]\) is the expected distance based on all possible agent locations in belief state \(b\).

  • EGD which sums Entropy (Eq. 12) and GoalDistance (Eq. 20) to combine domain-independent and domain-dependent information in the same potential function

  • TBGD which sums TopBelief (Eq. 13) and GoalDistance (Eq. 20) to also combine domain-independent and domain-dependent information in the same potential function

  • HighBeliefGoalDistance (HBGD) which combines prioritizing beliefs containing high certainty in a single state, reflecting more certain knowledge about the agent’s current location, and the domain-dependent information GoalDistance potential function (Eq. 20) to help the navigate towards a goal location after resolving its own location uncertainty:

    $$\begin{aligned} \phi \left( b\right) =\left\{ \begin{array}{ll} \frac{1}{E[d\left( {g,l} \right) ]+1}&{}\quad \hbox {if }{\max }_{s\in S} b(s)>0.6\\ 0&{}\quad \hbox {else}\\ \end{array}\right. \end{aligned}$$
    (21)

5 Results

In this section, we analyze the results of our experiments using the benchmark problems and potential functions outlined in the previous section and evaluate the empirical performance of using PBRS to improve online POMDP planning.

Specifically, we evaluate performance by comparing the (infinite horizon) cumulative, discounted rewards earned by the agent while operating in each benchmark:

$$\begin{aligned} \sum \limits _{t=0}^\infty \gamma ^{t}r_t^{orig} \end{aligned}$$
(22)

since this is the function the agent intends to optimize (even if it must rely on finite horizon approximations during planning) and is the traditional measure for evaluating POMDP planning. Please note that this measurement does not include the additional rewards from any potential function in order to provide a fair comparison between approaches with and without reward shaping.

For PBRS, we performed full breadth planning using a randomized BFS expansion of the planning tree using different amounts of time \(\tau \) for online planning representing different time constraints imposed on the agent’s reasoningFootnote 6 (common to real-world environments): \(\tau \in \left\{ {5,10,50,100} \right\} \) milliseconds for Tag and RockSample and \(\tau \in \left\{ {100,500,1000,5000} \right\} \) milliseconds for the larger and more complex AUVNavigation.

Within each benchmark, we compared for each amount of allotted time \(\tau \) the performance of planning (1) without reward shaping (Original), (2) with reward shaping using different potential functions for each benchmark problem (summarized in Table 2 and described above), (3) using AEMS2 [17], a state-of-the-art heuristic search algorithm, and (4) using ABDESPOT and ARDESPOT, two online variants of a state-of-the-art Monte Carlo tree search algorithm called DESPOT [21]. Any offline planning required by the algorithms is not included in \(\tau \).

Our results were averaged over 1000 runs of each problem for each planning approach and allotted time combination (except for AUVNavigation, where we only employed 100 runs due to its higher range of \(\tau \) values). To speed up computation in each benchmark, we used the state-of-the-art equivalent MOMDPFootnote 7 representation [15] for the POMDP model, as also done in the recent online POMDP planning literature (e.g., [25]). We limited each run to 200 time steps, which should be ample time for the agent to solve each problem (else the agent was acting randomly and not in a goal directed fashion, and thus would probably never accomplish its goal if left to run longer).

Because we limited planning to fixed amounts of time, all experiments per benchmark were conducted on a fixed computer to avoid introducing variance into the results due to differences between computers, instead of due to differences in the algorithms’ performances that we intended to measure. Two computers were chosen for this purpose: each possessing an Intel i5 (Haswell) 3.4 GHz Quad Core processor with 8GB of RAM (limited to one thread and 3 GB of RAM per experiment run). One computer ran all of the Tag and RockSample experiments, while the other ran the lengthier AUVNavigation experiments.

In the following, we analyze performance in each of the benchmarks separately: first Tag, then RockSample, and finally AUVNavigation. Afterwards, we provide discussions generalizing our results across benchmarks to provide a more abstract identification of the strengths and weaknesses of each approach to online planning, especially focusing on using PBRS.

For each problem, we first compare the performance of full breadth planning with PBRS using the different potential functions against Original (i.e., full breadth planning without reward shaping) to explore whether or not the different types of potential functions truly provide implicit clues of what actions the agent should take to earn large cumulative, future rewards beyond the agent’s planning horizon. Second, we compare the performances of each type of potential function to try to gain insights into which might be most advantageous to improve agent planning. Finally, we compare the performances of the best and worst potential functions (and Original) against the three depth-focused state-of-the-art online POMDP planning algorithms in order to determine how well our proposed approach compares to the best known approaches and to see what benefits we gain from maintaining full breadth planning with implicit estimations of future rewards.

5.1 Tag results

5.1.1 Comparison of full breadth planning with and without reward shaping

We begin our results analysis by comparing the performance of full breadth planning with (PBRS) and without (Original) reward shaping on the Tag benchmark problem to discover the benefits of implicitly estimating future rewards without explicit calculations. We present in Table 3 the cumulative, discounted reward results earned by the agent on this benchmark for each solution.

Table 3 Results from Tag benchmark problem with 95 % confidence intervals

From these results, we make several important observations. First, the majority of the potential functions resulted in improved performance across the various planning times when compared to breadth-first planning without reward shaping (Original): 18 of 28 (64.3 %) potential function and time constraint pairs yielded higher cumulative reward in Tag. Indeed, several of the potential functions (MBD, EMBD, TBMBD, and Upper) even achieved quite significant improvements over full breadth planning with no reward shaping: improvements of 31.5, 29.7, 17.6, and 36.0 % in cumulative reward across the four different time constraints for planning (\(\tau =5,10,50,100\) ms), respectively. Moreover, the best potential functions (MBD, EMBD, TBMBD, Upper) led to better performance with only 10 ms of planning time, compared with employing an order of magnitude more time for planning (up to 100 ms) with Original. Thus, reward shaping can yield improved performance while even using less planning time.

Overall, we conclude from these results that using PBRS to shape rewards with potential functions often resulted in better planning and subsequent performance by the agent through considering implicit estimates of future rewards, as intended. So, we have evidence that using potential functions is a good approach for improving the quality of plans formed during full breadth planning.

However, not every potential function achieved better performance than Original. Namely, the Entropy, TopBelief, and Lower potential functions achieved worse (or similar) performance on many of the time constraints used for planning. Thus, we have evidence that not every potential function (or indicator of future rewards) is beneficial to planning, and care must be taken when choosing an appropriate potential function for the agent’s problem. In the next subsection, we will investigate further why these potential functions might have been a bad choice on Tag, and we will provide a more general discussion on this topic in Sect. 5.4.

5.1.2 Comparison between potential function types

Next, we try to better understand the differences between the performances resulting from each of the potential function types on the Tag benchmark problem. From the results in Table 3, we observe that the domain-dependent information (from expected state-based potential functions, Type 1) (MBD) generally outperformed the domain-independent information (from measures of the quality of agent knowledge, Type 2) potential functions (Entropy, TopBelief) independently. Considering the fact that Type 1 potential functions on POMDPs are a direct extension of the type of potential functions used elsewhere in the literature, we find that utilizing this extension is in fact still beneficial in POMDPs. On the other hand, combining the two types (Type 1 and 2 in the EMBD and TBMBD potential functions) generally resulted in better performance than either type alone. Therefore, we observe an added benefit of considering different types of potential functions, including those novel to POMDPs and proposed in this research (Type 2). In other words, the types of information provided by both form a stronger indicator or estimator of cumulative, future rewards the agent will earn from the belief states with higher potential under these functions.

The approximations of the optimal value function (Type 4 potential functions, commonly used in leaf evaluation heuristics), on the other hand, provided mixed results. On the one hand, the Upper bound approximation (from FIB [10]) outperformed Original and was the best potential function overall with the greatest performance amongst potential functions for three of the four planning times considered (\(\tau =5,50,100\) ms). On the other hand, the Lower bound approximation (from Blind [10]) was one of the worst performers of all potential functions, regardless of the amount of planning time allotted. Thus, this particular potential function (commonly used in practice as a leaf evaluation heuristic (e.g., [18]) is possibly not as good of a choice as other types of information for guiding agent action selection, at least on the Tag benchmark.

5.1.3 Comparison of PBRS with depth-focused, state-of-the-art planning algorithms

Now, we compare full-breadth planning with and without PBRS against the three state-of-the-art algorithms—AEMS2 heuristic search, as well as ABDESPOT and ARDESPOT MCTS algorithms. Our goal is to determine whether maintaining full breadth planning with implicit estimations of future rewards is beneficial in comparison to depth-focused approaches that explicitly calculate the cumulative, future rewards the agent intends to maximize. For this analysis, we plot in Fig. 1 the performance as planning time increased for the best (Upper) and worst (Lower) potential functions, as well as Original and the state-of-the-art algorithms.

Fig. 1
figure 1

Performance of planning algorithms as planning time increased on the Tag benchmark problem for select approaches

From these results, we first observe that full breadth planning (with and without reward shaping) was advantageous for the smallest amounts of planning time (\(\tau =5,10\) ms) in comparison to the MCTS algorithms. This was due to the depth-focused MCTS algorithms not having enough time to find a path of actions to the agent’s goal using biased random sampling (and thus suffered from the problems of sacrificing breadth without gaining the benefits of focusing on depth during planning). In fact, for these amounts of planning time, the MCTS algorithms had the worst overall planning performance on this benchmark (as seen in Table 3).

Moreover, as planning time increased, the best potential function (Upper) remained competitive with the MCTS algorithms as their performance increased (for MCTS, due to better depth-focused planning with more planning time). These results imply that maintaining breadth-focused planning enhanced by implicit estimates of large future rewards achieved close performance to good explicit estimates of cumulative, future rewards. Therefore, implicit estimates can be as useful in at least some domains (like Tag) as explicitly calculating those rewards (under limited time constraints for planning).Footnote 8

However, the best state-of-the-art algorithm (AEMS2 heuristic search) outperformed the best potential function (Upper). Here the PBRS performance was not quite as good, indicating for the Tag benchmark, depth-focused planning providing explicit cumulative, reward estimates was still the best approach for planning. That is, the heuristic used by AEMS2 (based on error bounds in Upper and Lower bounds in agent rewards and optimistically biased towards Upper bound rewards) indeed selected appropriate belief states to expand during planning. Therefore, implicit future reward estimations are not always as good as explicit calculations, even with limited time constraints and having to sacrifice breadth to achieve such depth during planning.

5.2 RockSample results

5.2.1 Comparison of full breadth planning with and without reward shaping

We continue our results analysis by comparing the performance of full breadth planning with (PBRS) and without (Original) reward shaping on the RockSample benchmark problem so that we can gain additional insights into the benefits of implicitly estimating future rewards without explicit calculations. We present in Table 4 the cumulative, discounted reward results earned by the agent on this benchmark for each solution.

Table 4 Results from RockSample benchmark problem with 95 % confidence intervals

As in the Tag benchmark problem, we again observe that many of the potential functions resulted in improved performance across the various time constraints on planning when compared to full breadth planning without reward shaping (Original): 34 of 52 (65.4 %) potential function and time constraint pairs yielded higher reward in RockSample. Therefore, we have additional evidence that implicit estimators of cumulative, future rewards can improve full breadth planning.

Interestingly, the majority of these improved performances occurred for the three smallest amounts of time allotted for planning (\(\tau =5,10,50\) ms) where 27 of 39 (69.2 %) potential function and time constraint pairs yielded higher cumulative reward than Original. This observation supports Remark 3 (c.f., Sect. 3.2) that PBRS can be most beneficial when the amount of time allowed for planning is smallest.

For the largest amount of planning time, on the other hand, less than half of the potential functions (ECD, NoExitTB, NoExitCD, NoExitECD, NoExitTBCD, Lower) outperformed Original. This again indicates that planning with PBRS is not beneficial with any potential function and can be less useful as time constraints are reduced (i.e., there is more time for planning and less need for implicit estimators of rewards beyond the planning horizon).

5.2.2 Comparison between potential function types

Comparing between potential function types, we make many of the same observations for the RockSample as we did for the Tag benchmark in Sect. 5.1.2: domain-dependent information (Type 1, CD) potential functions generally outperformed domain-independent information (Type 2, Entropy and TopBelief) individually. Indeed, the Entropy potential function yielded some of the worst performances amongst all approaches used in our experimental study. Upon further investigation, this was due to this potential function leading the agent to overly conservative behavior by sensing too frequently to reach overly high confidence values before sampling rocks, resulting in less efficient behavior than the other approaches. However, together potential function Types 1 and 2 (especially ECD) perform better than either member type alone. Again, this demonstrates the advantages of exploiting information only available in POMDPs (Type 2 potential functions), and not in fully observable settings, as previously studied.

Furthermore, we also observe that our other proposed novel type of potential function—belief prioritization (Type 3)—also does not perform as well on its own as some of the other types, but combining Types 1, 2, and 3 yielded the best performance amongst all potential function types. In particular, planning with the NoExitECD potential function had the best performance amongst all potential functions. Thus, like Type 2, this third type of potential function (also novel to POMDPs and introduced by this research) is a beneficial form of metareasoning for the agent within a POMDP planning framework, but requires other types of information (especially domain-specific information measured in Type 1 potential functions) to best improve agent planning.

Finally, as in the Tag benchmark problems, the approximations of the optimal value function (Type 4, commonly used as leaf evaluation heuristics) provided mixed results. Whereas the Upper bound (calculated using FIB [10]) again generally provided improved behavior, the Lower bound potential function also led to lower performance than planning without reward shaping (Original) for the lowest time constraints on planning (\(\tau =5,10\) ms). Thus, potential functions of the type commonly used for leaf evaluation heuristics still provided some benefit on this problem, but was less beneficial overall than other potential function types providing other indicators of which belief states yield high cumulative, future rewards.

5.2.3 Comparison of PBRS with depth-focused, state-of-the-art planning algorithms

To better understand the relative performance of PBRS performing full breadth planning with implicit estimation of cumulative, future rewards against depth-focused state-of-the-art algorithms on the RockSample benchmark problem, we plot in Fig. 2 the performance as planning time increased for the best (NoExitECD) and worst (NoExitE) potential functions, as well as Original and the state-of-the-art online POMDP planning algorithms.

From these results, we observe that for each planning time, full-breadth planning with the NoExitECD potential function performed favorably to the three state-of-the-art, depth-focused planning algorithms. Namely, NoExitECD outperformed the state-of-the-art heuristic search algorithm AEMS2 for the most constrained amount of planning time (\(\tau =5\) ms) and the state-of-the-art Monte Carlo search DESPOT algorithms as planning time increased (\(\tau =50\) ms), and was comparable to the state-of-the-art algorithms for the other planning times. This is a very interesting result because unlike in the Tag benchmark problem, Table 4 shows that in RockSample all of the depth-focused approaches—the heuristic search algorithm (AEMS2) and the MCTS algorithms (ABDESPOT, ARDESPOT)—generally outperformed full-breadth planning (especially compared to Original), even for the lowest amounts of planning time. Thus, in this particular problem, depth-focused planning appears to generally be a better approach than full-breadth planning. However, the indicators of future rewards measured by NoExitECD (combining both a Type 1 potential function as commonly used elsewhere in the PBRS literature, as well as our novel Type 2 and 3 potential functions exploiting metareasoning about agent knowledge and histories) sometimes led the agent to select better actions using implicit estimates of cumulative, future rewards instead of spending time explicitly calculating such rewards with depth-focused planning. Combined with the Tag benchmark results, this is additional evidence that using the novel types of potential functions for planning is very advantageous for improving agent performance in partially observable environments.

Fig. 2
figure 2

Performance of planning algorithms as planning time increased on the RockSample benchmark problem for select approaches

5.3 AUVNavigation results

5.3.1 Comparison of full breadth planning with and without reward shaping

Finally, we evaluate the results from the most complicated AUVNavigation benchmark, where time constrained planning is generally very difficult without some estimations of future rewards along very deep planning paths due to the long sequence of actions required to reach the goal state (which is the only state to provide positive reward to guide planning). As before, we begin our analysis of the results from this benchmark by comparing the performance of full breadth planning with (PBRS) and without (Original) reward shaping to evaluate the benefits of implicitly estimating future rewards without explicit calculations. We present in Table 5 the cumulative, discounted reward results earned by the agent on this benchmark for each solution.

In AUVNavigation, we observe far different results than in the simpler Tag and RockSample benchmarks. At first glance, PBRS often appears to have resulted in worse performance than planning without reward shaping (Original): 17 of 32 (53.1 %) of the potential function and time allocation pairs resulted in worse performance than planning without reward shaping.

However, upon deeper investigation, these results are a consequence of an interesting quirk in the reward function optimized by the agent, rather than truly worse performance when using PBRS. In particular, recall that the agent received zero penalty for either doing nothing with the Stay action or for changing its orientation (using the Up, Down, Left, and Right actions). Otherwise, the agent received a small penalty for moving using the Forward action. Thus, for time constrained full breadth planning without PBRS, the agent rarely calculated any benefit to moving Forward and instead chose actions that yielded zero reward (and thus no cost). As a result, the agent without PBRS never reached the goal location and sat aimlessly, sometimes eventually drifting into a rock (due to the dynamic currents underwater), resulting in a penalty of \(-\)500. Thus, the cumulative, discounted rewards earned by the agent without PBRS were close to 0 (any penalty of \(-\)500 occurred after many steps and was heavily discounted). Therefore, planning without PBRS resulted in random, uneventful behavior (stuck in Stage 1 of the problem, c.f. Sect. 4.1.3) and not goal-directed behavior, as necessary (c.f., Sect. 4.1.3).

Table 5 Results from AUVNavigation benchmark problem with 95 % confidence intervals

On the other hand, for the agents with potential functions using PBRS, the agent received incentive for moving Forward from its shaped rewards, thereby incurring negative costs for movement. As a result, the agent usually achieved worse cumulative, discounted rewards, but more goal-directed behavior. In particular, the potential functions combining domain-dependent and domain-independent information (EGD, TBGD) chose actions that successfully completed Stage 1 (uncertainty reduction) and Stage 2 (navigating through the rock obstacles) of the problem, but incurred large costs (\(-\)50 per step) by moving along the surface of the water, where the agent always updated its location with perfect accuracy. Thus, including potential functions resulted in better behavior towards goal accomplishment than full breadth planning without reward shaping (Original), due to supplying required intermediate positive signals that allowed the agent to find a plan within time constrained planning that led the agent towards the goal state.

To better evaluate goal achievement in the challenging AUVNavigation benchmark problem, we present in Table 6 the proportion of the 100 runs in which the agent successfully reached a goal location. From these results, we observe that planning with PBRS was much more successful: 18 of 32 (56.3 %) of the potential function and time pairs resulted in more goal achievement than planning without PBRS (Original), whereas PBRS never performed worse, regardless of the potential function used. Thus, we also find evidence in very complicated environments that potential functions can produce improved planning in a full breadth scenario using implicit estimations of cumulative, future rewards.

5.3.2 Comparison between potential function types

In particular, potential functions combining domain-dependent location information (for rock obstacle avoidance and movement towards the goal in Stages 2 and 3 using Type 1 potential function information) with either domain-independent information (for encouraging belief improvement in Stage 1 using Type 2 potential function information) (EGD, TBGD) or belief prioritization (also prioritizing belief improvement in Stage 1 using Type 3 potential function information) (HBGD) achieved much better performance than planning without PBRS. Domain-dependent location information (Type 1) also performed very favorably to planning without PBRS, although not quite as well as adding metareasoning by combining Type 1 with Type 2 or Type 3 potential functions. Overall, this level of performance is quite significant since successful time constrained planning is generally incredibly difficult for such a complex problem!

Table 6 Proportion of AUVNavigation runs successfully ending at a goal location with 95 % confidence intervals

Moreover, for each successful potential function, performance often increased as the planning horizon increased, with HBGD eventually achieving the goal in nearly all (92 %) runs. Therefore, planning with PBRS was also very beneficial in AUVNavigation, and was able to guide the agent to goal achievement even with time constrained planning in a very complex domain—containing multiple stages with different objectives and long sequences of actions required to reach the goal state—so long as the potential function considered adequate information to guide the agent through the complex domain (here, combinations of information about domain-dependent location and domain-independent certainty or belief prioritization).

Interestingly, potential functions based on approximations of the optimal value function (Upper, Lower) were not as beneficial in this domain (although Upper did improve performance for the two largest amounts of planning time considered, \(\tau =1000,5000\,\)ms). This is a direct consequence of the complexity of the domain, causing the upper and lower bounds on the value function \({\overline{V}}\left( b\right) \) and \({\underline{V}} (b)\) from Fast Informed Bound and Blind [10] to be quite loose (ranging from over 2000 to less than 0 for most belief states), not helping agent performance (as previously observed in Tag).

5.3.3 Comparison of PBRS with depth-focused, state-of-the-art planning algorithms

As a final analysis, in order to better understand the relative performance of full breadth planning with PBRS on the AUVNavigation benchmark problem against depth-focused state-of-the-art approaches, we plot in Fig. 3 the performance as planning time increased for the best (HBGD) and worst (TBGD on rewards, Entropy on proportion of successful runs) potential functions, as well as Original and the state-of-the-art online POMDP planning algorithms. We also plot in Fig. 4 the proportion of runs successfully ending at the goal location as a function of planning time and approach.

Fig. 3
figure 3

Performance of planning algorithms as planning time increased on the AUVNavigation benchmark problem for select approaches

Fig. 4
figure 4

Proportion of AUVNavigation runs successfully ending at a goal location as planning time increased for select approaches

From these figures, we again observe very successful performance by PBRS with the best potential function: HBGD achieved the highest discounted, cumulative rewards in all but the lowest amount of time for planning (\(\tau =100\) ms) and the highest proportion of goal achievement across all planning times. This is a very interesting result as on the one hand, AUVNavigation requires long sequences of actions to accomplish its goal, so depth-focused planning approaches like AEMS2 or the MCTS algorithms (ABDESPOT, ARDESPOT) should have an inherent advantage. However, because the required sequences are so long (more than 20 actions to find positive future rewards), even depth-focused planning could not find a path from the agent’s starting belief state to the goal location under time constrained planning. Instead, such depth-focused approaches wasted time exploring down paths that earn higher intermediate rewards (either not incurring costs for moving forward, or moving along dangerous routes on the bottom of the grid near rocks without incurring high cost at the surface for determining the agent’s true location), causing it to waste time planning down paths of overestimated value and underestimating the value of the truly best action sequences (that were either unexplored or under sampled during planning). PBRS with HBGD, on the other hand, followed an indicator of high future rewards beyond what depth-focused planning could achieve under such limited time constraints, and also performed full breadth planning to minimize the risk of following a wrong path initially in the planning tree in order to avoid underestimating the value of the best action sequences, to solve this particular problem. Therefore, full breadth planning with PBRS is very beneficial over state-of-the-art approaches on the type of problem represented by the AUVNavigation benchmark: agents suffering from high uncertainty and requiring long action sequences to find positive future rewards.

Interestingly, the AEMS2 heuristic search algorithm that performed so admirably on the other two benchmark problems (generally better than MCTS and at least competitive with the best potential function using PBRS) performed very poorly on AUVNavigation. Like full breadth planning without reward shaping (Original), the agent never accomplished the goal and generally had random, non-goal directed behavior when planning with AEMS2 for all amounts of time allocated for planning. Unlike in Tag, in this problem, the heuristic used in AEMS2 was not informative for choosing how to best expand the agent’s plan and led to many bad paths and wasted planning time, making it unable to achieve the expected benefits of depth-focused planning, resulting in closer behavior to full breadth planning without implicit estimations of cumulative, future rewards (and similar overall performance to such a planner, Original). Specifically, on this benchmark, the Upper bound rewards (calculated using FIB [10]) guided the agent as if it had near certain knowledge of the true state of the environment (namely, its current location), but this biased the agent to explore actions maximizing agent rewards under such conditions (namely, attempting to navigate through the rocks). In turn, this led the agent away from exploring action sequences that achieved Stage 1 of the problem (determining the agent’s location), and thus left the agent ultimately confused on how to act since its uncertainty was never actually resolved.

5.4 Discussion

Considering our results across all three benchmark problems, we now draw some general conclusions about the benefits and drawbacks of using PBRS to improve online POMDP planning. Overall, we empirically discovered from our experimental results that in general, PBRS can be very beneficial to online planning for POMDPs.

First, more often than not, the potential functions employed led to better performance than similar full breadth planning without reward shaping, demonstrating that implicit estimations of cumulative, future rewards (indicated by different types of information) indeed can improve the quality of plans and subsequent action selection in a wide range of environments. Thus, PBRS is beneficial to consider in environments where full breadth planning might be useful and still gain some of the benefits of depth-focused planning without having to spend the computational costs to explicitly calculate cumulative, future rewards, such as environments where the agent must take care to avoiding reaching dangerous or undesirable situations with no forethought on what to do or how to reach a better situation in order to eventually achieve its goals, as discussed in Sect. 1.

Second, we also gained insights into which types of information measured by potential functions are most beneficial to improve agent action selection. In each of the three benchmarks, we observed that domain-dependent information (Type 1, often in the form of goal-directed movement for agents in grid-worlds like our three benchmarks), yielded better performance than either of the two novel types of potential functions proposed in this paper exploiting properties unique to POMDPs: both domain-independent information providing metareasoning about agent knowledge (Type 2), or belief prioritization providing metareasoning about histories of agent interaction with the environment (Type 3). However, we also observed in each environment that combining these types of potential functions yielded some of the best performances of any potential function type when using these types together, allowing metareasoning from Type 2 and Type 3 to boost performance beyond that achieved by Type 1 alone. Specifically, combinations such as NoExitECD combining Type 1 + Type 2 + Type 3 in RockSample, and HBGD combining Type 1 + Type 3 in AUVNavigation produced the best performances across all potential functions (and generally across almost all considered approaches to online planning), and EMBD and TBMBD combining Type 1 + Type 2 in Tag also performed well. However, approximations of optimal value functions (Type 4), commonly used as leaf evaluation heuristics, resulted in more mixed results. On the one hand, considering an approximation of the Upper bound on the value function (using FIB [10]) as a potential function led to the best results on Tag and moderately good results on RockSample and AUVNavigation. On the other hand, considering an approximation of the Lower bound on the value function (using Blind [10], which is also used in some online POMDP planning algorithms as a leaf evaluation heuristic, e.g., [18]), generally led to some of the worst performances and occasionally worse than full breadth planning without PBRS (Original). Overall, we conclude that metareasoning about agent knowledge (using standard measures of certainty like Entropy or TopBelief, Eqs. 12, 13, Type 2) and/or about histories of agent interactions with the environment (belief prioritization, Type 3) combined with any available domain-specific information (e.g., distances to goals, whether measured in a grid space or in some other fashion as observed by Ng et al. originally [14]) was generally the most beneficial type of potential functions to use for PBRS with online POMDP planning. Thus, we recommend starting with such combinations when trying to identify how to best use PBRS on a new POMDP problem. Given that standardized measures exist for Type 2, this hopefully only requires identifying relevant domain-specific information to improve planning, which is already a requirement for PBRS use in any domain, since domain-specific information is generally the only type of information previously considered in the PBRS literature.

Finally, in comparison to three depth-focused state-of-the-art online POMDP planning algorithms: the AEMS2 heuristic search algorithm [17] and the DESPOT MCTS algorithms [21], we also observed that full breadth planning using PBRS led to very favorable agent performance. On the largest and most complicated benchmark problem (AUVNavigation), the best potential function (combining Types 1 and 3 for domain-specific information and metareasoning about histories) outperformed each of the state-of-the-art algorithms for most of the allotted times for planning considered as our time constraints. On the other two benchmarks (Tag and RockSample), the best heuristic (Type 4 using approximations of the Upper bound on the value function for Tag, and combining Types 1, 2, and 3 for domain-specific information and metareasoning about agent knowledge in RockSample) also outperformed at least one of the state-of-the-art algorithms for some of the amounts of time allotted for planning, and was generally competitive on the rest. Thus, it appears overall that some combination of metareasoning (novel to POMDP applications of PBRS) and domain-specific information often provides good enough implicit estimations (or signal indicators) of cumulative, future rewards to allow the agent to save time from not explicitly calculating such estimations through depth-focused planning, enabling more time for full breadth planning to avoid the potential pitfalls identified in Sect. 1 from a lack of breadth in planning. Especially noteworthy is that such potential function types do not require precomputation and generally scale well with the size of the POMDP, unlike Type 4 (representing domain information also used by the state-of-the-art algorithms, as explained in the following paragraph), which can be prohibitively expensive to calculate in large POMDPs (especially those with very large state spaces). Therefore, metareasoning with PBRS might be even more advantageous in even larger planning problems, which we intend to explore in the future (noting again that it already performed the best in our largest, most complicated problem: AUVNavigation).

Although PBRS does add some (domain-specific or domain-independent) information to the agent’s planning in addition to the original reward function \(R\), this is similar to the behavior of the state-of-the-art algorithms. Namely, state-of-the-art heuristic search algorithm AEMS2 and the state-of-the-art Monte Carlo search DESPOT algorithms each consider upper \({\overline{V}}\) and lower bounds \({\underline{V}}\) on the value function, which are either precalculated offline (e.g., using the FIB or Blind algorithms [10]) or are calculated directly on the agent’s belief state, just like our proposed potential functions. These bounds then indirectly provide the agent with information about its domain that further inform its evaluation of policies while planning. For example, in RockSample, the bounds inform the agent about the locations of rocks, as these are the only locations where the largest positive cumulative rewards exist. Likewise, in AUVNavigation, these bounds inform the agent about the locations of obstacles and the goals as these are the only locations where the upper bound on the value function and the immediate reward are equal (since both types of locations are terminal locations). Instead, our potential function framework provides a principled, mathematical vehicle for considering additional types of information to inform policy evaluation during finite horizon planning with several established theoretical results. The goal of this research is not necessarily to produce a best new planning algorithm that is superior to all state-of-the-art algorithms, but instead: (1) to provide such a vehicle for embedding additional domain-specific or domain-independent information to further improve online planning for POMDPs, and (2) to explore what types of such information may or may not be useful across different types of planning problems. Identifying valuable types of information could then even be used to create better heuristic search algorithms and further improve the state-of-the-art in online POMDP planning.

However, PBRS is not an approach that works with any potential function and on any problem, as it is possible for a potential function to bias policy evaluation in a bad way. Based on our results, we conclude that some forethought is certainly necessary to identify a good potential function for a particular problem. One necessary component of a good potential function appears to be domain-specific information leading the agent towards its ultimate goal (e.g., distances in grid-based worlds). In environments where such domain expertise is difficult to encode or unknown, PBRS might not be a good choice, as this type of information was generally a prerequisite for the combinations that yielded the best performance, competitive with depth-focused state-of-the-art online POMDP planning algorithms. Indeed, considering the other components (Type 2 and/or 3 metareasoning) individually generally hurt agent performance (compared to full breadth planning without reward shaping). In the future, we intend to explore additional types of problem domains where these types of potential functions might be more useful, which we suspect might include (1) environmental monitoring applications (e.g., sensor tracking) where the agent’s sole goal is to have high belief certainty, making potential functions of Type 2 more useful alone, as well as (2) problems with multiple subtasks required to complete the agent’s ultimate task, where belief state prioritization (potential function Type 3) might be more useful to identify general strategies for accomplishing subtasks individually.

Furthermore, we note that the complexity of potential functions necessary for improving planning increases with the complexity of the problem modeled by the POMDP. That is, in the challenging AUVNavigation problem, simple linear combinations of different types of potential functions were less effective in improving agent performance than in the simpler Tag and RockSample domains. Instead, we had to rely on a more complicated combination of belief prioritization (Type 3) and domain-dependent expected state-based potential (Type 1)—HBGD—in order to best guide the agent through the three subproblems represented by different stages in order to maximize goal achievement and cumulative, discounted rewards. However, even in complex AUVNavigation, simple linear combinations of potential functions still yielded significant improvements in agent performance compared to both full-breadth planning without PBRS (Original) and at least some of the state-of-the-art online planning algorithms. Furthermore, for the simpler benchmark problems (which are still reasonably complex with up to tens of thousands of states, c.f., Sect. 4.1), linear combinations of different types of simple potential functions resulted in significantly improved planning, demonstrating that even simpler potential functions can still boost planning performance.

Moreover, potential functions in complex domains might also require a bit more insight to fine-tune, as well. For example, in the AUVNavigation problem, we eventually added a coefficient of 100,000 (rather than a uniform coefficient of 1 in simpler Tag and RockSample) to the potential functions to properly guide the agent to the goal state from its initial uncertainty. Recall that the successful potential functions (EGD, TBGD, HBGD) reshaped rewards partially based on the multiplicative inverse of the agent’s distance from the goal, and thus changes to these functions (Eqs. 2021) were quite small when the agent was highly uncertain (since it was very far from the goal). This meant that the additional signal from the potential function was easily outweighed by the costs of gathering information (namely moving with cost at most \(-\)1.73 for moving towards better location information, or \(-\)50 for surfacing to discover the agent’s exact location and resolve all uncertainty). To “boost” the potential function’s signal toward cumulative, future rewards, we had to multiply the signal by a large constant in order to offset the order of magnitude differences between potential differences and reward costs. In other domains with high costs for information gathering, or to otherwise complete necessary intermediate steps towards the agent’s ultimate goal, large coefficients might also be necessary. Determining an appropriate coefficient can either be done through experimental investigation, or by analytically comparing the additional shaped reward (from the difference in potential values, Eqs. 9 and 10) against the costs associated with actions that maximize or quickly increase shaped rewards. We took a combination of both approaches to set our coefficient for AUVNavigation, although other coefficients might have also been appropriate and led to similar performance. In the future, we intend to develop a greater theoretical understanding of how such coefficients can and should be determined based on the original shape of the reward function and the signals in the potential function. Of note, the state-of-the-art Monte Carlo DESPOT algorithm also utilizes some parameter hand-tuning with respect to the problem domain, most notably the regularization parameter \(\lambda \) used by the ARDESPOT variant [21]. To provide for a fair comparison, we also tuned this parameter for each of our experimental benchmarks, reusing the \(\lambda \) value suggested by Somani et al. in the documentation of the implementation of their algorithmFootnote 9 for the Tag and RockSample benchmarks, and after empirically searching for an appropriate value ourselves on AUVNavigation.

6 Conclusions and future work

In conclusion, we have explored how extending potential-based reward shaping (PBRS) from reinforcement learning (RL) to online planning with POMDPs can be used to improve approximate planning and agent performance given the compuational complexity of planning and limited time constraints. In particular, our aim was to improve long term, cumulative reward estimations in full breadth planning to avoid problems with depth-focused planning identified in Sect. 1. Our approach entails defining a potential function over the agent’s belief states that indicates the ability of the agent to earn future rewards. The agent’s reward function is then shaped by adding value from this potential function, which leads the agent to be biased towards choosing actions during plan execution that cause the agent to reach belief states that earn larger rewards beyond the planning horizon. We categorize four types of potential functions (with examples), along with hybrid combinations: (1) domain-dependent information from expected state potential (extending directly from the prior use of PBRS with RL and MDPs), (2) domain-independent information measuring a quality or property of a belief state (e.g., certainty), (3) belief prioritization (e.g., priority ordering on belief states), and (4) approximations of the optimal value function. The second and third of these types are novel to POMDPs and offer forms of metareasoning (about agent knowledge and about histories of agent interactions with the environment, respectively) to improve POMDP planning.

We established from a theoretical perspective that planning with PBRS (1) can, given a finite horizon, lead to different policies than planning with the original unshaped rewards, which in turn enables the agent to earn greater future rewards assuming a good potential function; (2) PBRS can most improve planning when planning horizons are shortest; and (3) even though the agent’s reward function is modified, planning with PBRS still optimizes (over the infinite horizon) the agent’s original reward function. Finally, we verified these results in practice using an empirical study employing three classic POMDP benchmark problems, demonstrating that under limited time constraints, an agent planning with PBRS better maximized its cumulative, unshaped rewards than planning without PBRS, especially when combining various forms of metareasoning and domain-specific information (Types 1–3). In the most difficult benchmark, we also discovered that PBRS can enable time constrained online POMDP planning to successfully reach the target goal state when such behavior is otherwise incredibly difficult without reward shaping. In particular, time limited planning requires intermediate positive signals indicating appropriate action sequences towards a goal state that are otherwise only discoverable with very deep planning identifying long sequences of actions reaching positive rewards. For complex environments where the only positive reward is earned for reaching the goal state, PBRS can provide such intermediate signals missing from the original reward function to properly guide the agent, making this form of online planning a viable approach. We also compared the performance of PBRS for online POMDP planning against three state-of-the-art online planning algorithms and discovered that PBRS using the best combination of potential functions (Types 1–3 on two benchmarks, Type 4 on the other) performed comparable to or better than each of the state-of-the-art algorithms on all benchmarks tested.

Furthermore, whilst the focus of this paper has been on planning, the theoretical results on how to extend PBRS to POMDPs, the novel types of potential functions, and the effect of finite horizons on PBRS are also applicable to partially observable RL.

In the future, we plan to continue this line of research in several directions. First, we intend to further study potential functions to determine what additional qualities or properties of belief states are useful indicators of future rewards in order to better determine how to choose appropriate potential functions given the properties of complex environments (and consider other forms of metareasoning that might be useful to add to other potential functions to further improve agent behavior). Second, we intend to explore the application of PBRS to other settings of planning, including (1) decentralized POMDPs, where planning complexity amongst multiple agents is even more complex than planning with a standard POMDP, and addressing multiagent planning complexity is still an open problem, and (2) offline POMDP planning, where concepts from PBRS such as the potential function could be used to better guide the selection of which belief states to plan around in order to create better plans focused on the most important belief states. Third, PBRS could be potentially included in other types of online POMDP planning algorithms (e.g., employed in Monte Carlo search methods to bias sampling towards large cumulative, future rewards), in which case both PBRS and related optimal reward functions [22] would both be of interest to study in order to potentially further improve online POMDP planning.