Abstract
Uncertainty in motion planning is often caused by three main sources: motion error, sensing error, and imperfect environment map. Despite the significant effect of all three sources of uncertainty to motion planning problems, most planners take into account only one or at most two of them. We propose a new motion planner, called Guided Cluster Sampling (GCS), that takes into account all three sources of uncertainty for robots with active sensing capabilities. GCS uses the Partially Observable Markov Decision Process (POMDP) framework and the point-based POMDP approach. Although point-based POMDPs have shown impressive progress over the past few years, it performs poorly when the environment map is imperfect. This poor performance is due to the extremely high dimensional state space, which translates to the extremely large belief space B.
We alleviate this problem by constructing a more suitable sampling distribution based on the observations that when the robot has active sensing capability, B can be partitioned into a collection of much smaller sub-spaces, and an optimal policy can often be generated by sufficient sampling of a small subset of the collection. Utilizing these observations, GCS samples B in two-stages, a subspace is sampled from the collection and then a belief is sampled from the subspace.
It uses information from the set of sampled sub-spaces and sampled beliefs to guide subsequent sampling. Simulation results on marine robotics scenarios suggest that GCS can generate reasonable policies for motion planning problems with uncertain motion, sensing, and environment map, that are unsolvable by the best point-based POMDPs today. Furthermore, GCS handles POMDPs with continuous state, action, and observation spaces. We show that for a class of POMDPs that often occur in robot motion planning, given enough time, GCS converges to the optimal policy.
To the best of our knowledge, this is the first convergence result for point-based POMDPs with continuous action space.
Similar content being viewed by others
References
Alterovitz, R., Simeon, T., & Goldberg, K. (2007). The stochastic motion roadmap: a sampling framework for planning with Markov motion uncertainty. In RSS.
Bai, H., Hsu, D., Lee, W. S., & Ngo, A. V. (2010). Monte Carlo Value Iteration for continuous-state POMDPs. In WAFR.
Bandyopadhyay, T., Sarcione, L., & Hover, F. (2009). A simple reactive obstacle avoidance algorithm and its application in Singapore Harbour. In FSR.
Berg, J. V. D., Abbeel, P., & Goldberg, K. (2010). LQG-MP: optimized path planning for robots with motion uncertainty and imperfect state information. In RSS.
Burns, B., & Brock, O. (2007). Sampling-based motion planning with sensing uncertainty. In ICRA.
Choset, H., Lynch, K. M., Hutchinson, S., Kantor, G., Burgard, W., Kavraki, L. E., & Thrun, S. (2005). Principles of robot motion: Theory, algorithms, and implementations. Cambridge: MIT Press.
Guez, A., & Pineau, J. (2010). Multi-tasking SLAM. In ICRA.
Guibas, L., Hsu, D., Kurniawati, H., & Rehman, E. (2008). Bounded uncertainty roadmaps for path planning. In WAFR.
Hauser, K. (2010). Randomized belief-space replanning in partially-observable continuous spaces. In WAFR.
Hsiao, K., Kaelbling, L. P., & Lozano-Perez, T. (2007). Grasping POMDPs. In ICRA (pp. 4685–4692).
Hsu, D., Latombe, J.-C., & Motwani, R. (1999). Path planning in expansive configuration spaces. International Journal of Computational Geometry and Applications, 9(4–5), 495–512.
Hsu, D., Lee, W. S., & Rong, N. (2007). What makes some POMDP problems easy to approximate? In NIPS.
Kollar, T., & Roy, N. (2008). Efficient optimization of information-theoretic exploration in SLAM. In AAAI (pp. 1369–1375).
Kurniawati, H., Du, Y., Hsu, D., & Lee, W. S. (2011). Motion planning under uncertainty for robotic tasks with long time horizons. The International Journal of Robotics Research, 30(3), 308–323.
Kurniawati, H., Hsu, D., & Lee, W. S. (2008). SARSOP: Efficient point-based POMDP planning by approximating optimally reachable belief spaces. In RSS.
Missiuro, P., & Roy, N. (2006). Adapting probabilistic roadmaps to handle uncertain maps. In ICRA.
Papadimitriou, C. H., & Tsitsiklis, J. N. (1987). The complexity of Markov decision processes. Mathematics of Operations Research, 12(3), 441–450.
Papadopoulos, G., Kurniawati, H., Shariff, A. S. B. M., Wong, L. J., & Patrikalakis, N. M. (2011). 3D-surface reconstruction for partially submerged marine structures using an autonomous surface vehicle. In IROS.
Pineau, J., Gordon, G., & Thrun, S. (2003). Point-based Value Iteration: an anytime algorithm for POMDPs. In IJCAI (pp. 1025–1032).
Plaku, E., Bekris, K., Chen, B. Y., Ladd, A. M., & Kavraki, L. E. (2005). Sampling-based roadmap of trees for parallel motion planning. IEEE Transactions on Robotics, 21(4), 597–608.
Porta, J. M., Vlassis, N., Spaan, M. T. J., & Poupart, P. (2006). Point-Based Value Iteration for continuous POMDPs. Journal of Machine Learning Research, 7(Nov), 2329–2367.
Prentice, S., & Roy, N. (2007). The Belief Roadmap: Efficient planning in linear POMDPs by factoring the covariance. In ISRR.
Smith, T., & Simmons, R. (2005). Point-based POMDP algorithms: Improved analysis and implementation. In UAI.
Stachniss, C., Grisetti, G., & Burgard, W. (2005). Information gain-based exploration using Rao-Blackwellized particle filters. In RSS (pp. 65–72).
Thrun, S. (2000). Monte Carlo POMDPs. In S. A. Solla, T. K. Leen & K.-R. Müller (Eds.), NIPS (Vol. 12, pp. 1064–1070). Cambridge: MIT Press.
Acknowledgements
The authors thank D. Hsu for the discussion and cluster computing usage, L.P. Kaelbling and J.J. Leonard for the discussion, and S. Ong for reading the early drafts of this paper. This work is funded by the Singapore NRF through SMART, CENSAM.
Author information
Authors and Affiliations
Corresponding author
Additional information
Most of the work was done while H. Kurniawati was with the Center for Environmental Sensing and Modeling, Singapore–MIT Alliance for Research and Technology.
Appendix
Appendix
To prove Theorem 1 and Theorem 2, we use α-function as the policy representation. The policy π is represented by a set of α-functions Γ, where π(b)=argmax α∈Γ ∫ s∈S α(s)⋅b(s) ds. Each α-function corresponds to a policy tree T α . Each node in T α corresponds to an action and each edge corresponds to an observation. The value α(s) is the expected total reward of executing T α from s. Let a 0 denotes the root of T α and let’s use the same notation to denote a node and its corresponding action. Then, executing T α starting from s means that the robot at state s starts execution by performing a 0. An arc from a 0 to a node at the next level of T α is followed, based on the observation perceived. Suppose the arc points to node a 1, then at the next step, the robot performs a 1. This process is repeated until a leaf node is reached. The value α(s) can be written as,
where \(\alpha_{a_{0}o}\) is the α-function that corresponds to the sub-tree of T α whose root is the child of a 0 via edge o.
1.1 A.1 Proof of Theorem 1
To prove the theorem, we first show that for any α-vector, α(s)=α(g(s)). For this purpose, we show that for any function f:S→S that does not change the robot’s relative configuration with respect to the environment, α(s)=α(f(s)). Since g is an instance of such function, α(s)=α(g(s)), too.
We prove α(s)=α(f(s)) by induction on the levels of T α . When T α has only one level, α(s)=R(s,a 0). Since the reward function depends only on the relative configuration of the robot and since applying f to s does not change this relative configuration, α(s)=R(s,a 0)=R(f(s),a 0)=α(f(s)). Assume that for any f, any α, and any s∈S, α(s)=α(f(s)) when T α has i levels. Now, we show that α(s)=α(f(s)) when T α has (i+1) levels. The key is to show that applying f to s does not change the integration term in (7). Let’s first look at the transition function. Based on property-1 of LS-continuous, we have \(T(s, a_{0}, s_{1}) = T(f(s), a_{0}, s_{1}')\) where \(s_{1}' = s_{1} + (f(s)-s)\). Since, the displacement vector (f(s)−s) does not change the relative robot’s configuration with respect to the environment, the relative robot’s configuration at s 1 is the same as that at \(s_{1}'\). Since Z depends only on the relative robot’s configuration, \(Z(s_{1}, a_{0}, o) = Z(s_{1}', a_{0}, o)\) for any o∈O. Using the result from level-i, \(\alpha(s_{1}) = \alpha(s_{1}')\). Hence, the integration term of (7) for α(s) and α(f(s)) are the same. This result and the fact that the reward function depends only on the robot’s relative configuration, gives us α(s)=α(f(s)). Now, we prove that for any policy π, V π (b)=V π (Transform(b)). Let \(\mathcal{R}\) be a partition of S, such that each set \(R_{s} \in\mathcal{R}\) consists of all states in S where the robot’s relative configurations with respect to the environment, is the same as that of s. This means that each state in the same set of \(\mathcal {R}\) has the same α-value. And hence we can write,
From the definition of Transform in (3),
Therefore, (8) can be rewritten as
which is the result we want.
1.2 A.2 Proof of Theorem 2
To prove Theorem 2, we first need the following lemma.
Lemma 1
In an LS-continuous POMDP with parameter (K RS ,K Z ) and normalized observation space, for any α-function and any state s,s′∈S, \(|\alpha(s)-\alpha(s')| \leq (\frac{K_{RS}}{1-\gamma} + \frac{\gamma K_{Z} R_{\mathit{max}}}{(1-\gamma )^{2}} )D_{S}(s, s')\).
Proof
Using the definition of α value in (7) and the triangle inequality, we have
Based on property-3 of LS-continuous, we can bound the first absolute term on the right hand side of (9) as |R(s,a 0)−R(s′,a 0)|≤K RS D S (s,s′).
Now, we bound the second absolute term on the right hand side of (9). Let d=s′−s. Property-1 of LS-continuous gives us T(s,a 0,s 1)=T(s′,a 0,s 1+d). Hence, we can rewrite the last absolute term in (9) as,
Using property-2 of LS-continuous, we have Z(s 1+d,a 0,o)≥Z(s 1,a 0,o)−K Z ⋅D S (s 1,s 1+d).
Substituting the above bounds to (9) gives
The last inequality holds, after \(\alpha_{a_{0}o}\) is expanded recursively and assuming that O is normalized. □
Now, we proof Theorem 2. Let V π (b)=α⋅b and V π (b′)=α′⋅b′. Then, there must always be a point b c =ab+(1−a)b′ such that α⋅b c =α′⋅b c , as α⋅b≥α′⋅b and α′⋅b′≥α⋅b′
Suppose f is the joint density function used in computing W D (b,b c ) with b(s)=∫ s′∈S f(s,s′) ds′ and b c (s′)=∫ s∈S f(s,s′) ds. And suppose g is the joint density function used in computing W D (b c ,b′) with b c (s)=∫ s′∈S g(s,s′) ds′ and b′(s′)=∫ s∈S g(s,s′) ds. Then, we can rewrite (10) as,
Substituting the difference between α values in the above inequality with the result of Lemma 1, and using the definition of Wasserstein distance give us,
Using the convexity property of W D , we get the desired result.
1.3 A.3 Proof of Theorem 3
To proof Theorem 3, we first need the following lemma that bounds the error generated by a single backup operation.
Lemma 2
Consider a POMDP that satisfies LS-continuous with parameter (K RS ,K Z ) and LA-continuous with parameter (K RA ,h). Suppose the sampling dispersion in each element of \(\mathcal{P}\) is ≤δ A . Then, the error generated by a single simplified GCS backup at a belief b is bounded as, \(|H V(b)-\hat {H_{b}} V(b) | \leq K_{RA} \delta_{A} + \gamma (\frac{K_{RS}}{1-\gamma}+\frac {K_{Z} R_{\mathit{max}}}{(1-\gamma)^{2}} )h ({\delta_{A}})\).
Proof
To shorten the proof writing, let’s use the Q-value notation. For any b∈B and any a∈A,
The single backup error is then,
where \(\operatorname{Samp}(P)\) is the sampled representation of \(P \in\mathcal{P} \),
Let’s compute \(Q(b, a_{P}^{*})-Q(b, \hat{a}_{P}^{*})\) for an element P of \(\mathcal{P}\). For writing compactness, we drop the P subscript. Using (11) and triangle inequality, we get
Using property-3 of LA-continuous, we bound the first term in the right hand side as ∫ s∈S |R(s,a ∗)−R(s,a)|b(s) ds≤K RA δ A . Using property-1 of LA-continuous, \(T(s, a^{*}, s') = T(s, \hat{a}^{*}, s'+f(a^{*}, \hat {a}^{*}))\). Therefore, (13) can be rewritten as,
Since a ∗ and \(\hat{a}^{*}\) belong to the same element of \(\mathcal{P}\), using property-2 of LA-continuous, we get \(Z(s'+f(a^{*}, \hat{a}^{*}),\allowbreak \hat{a}^{*}, o) = Z(s'+f(a^{*}, \hat{a}^{*}), a^{*}, o)\). Using property-2 of LS-continuous, we get \(Z(s'+f(a^{*}, \hat {a}^{*}), a^{*}, o) \geq Z(s', a^{*}, o) - {K_{Z}} {D_{S}(s', s'+f(a^{*}, \hat{a}^{*}))} \). Using these properties and the assumption that O is normalized, rearranging the above inequality gives us
The last inequality holds based on three properties: (1) any α value does not exceed \(\frac{{R_{\mathit{max}}}}{1-\gamma }\), (2) property-1 of LA-continuous, i.e., \({D_{S}(s', s'+f(a^{*}, \hat {a}^{*}))} \leq h({D_{P}(a^{*}, \hat{a}^{*})})\), and (3) Lemma 1.
Using the above inequality and the fact that for any \({P} \in\mathcal{P} \), \(D_{P}(a_{P}^{*}, \hat{a}_{P}^{*}) \leq{\delta_{A}} \), and h is an increasing function, \(\vert H {V(b)}-{\hat{H_{b}}} {V(b)} \vert \leq K_{RA} \delta_{A} + \gamma ( \frac{K_{RS}}{1-\gamma} + \frac{ K_{Z} R_{\mathit{max}}}{(1-\gamma)^{2}} ) h(\delta_{A})\). □
Now, we can prove Theorem 3. The difference between the optimal value V ∗ and the value V t computed by the simplified GCS after t steps are,
Applying Theorem 2 to V ∗ and V t , and bounding |V ∗(b′)−V t (b′)|≤ε t , we get
To compute ε t , notice that V ∗(b′)=HV ∗(b′) and \(V_{t}(b') \leq\hat{H_{b}} V_{t-1}(b')\). Hence, \(\vert V^{*}(b') - V_{t}(b')\vert \leq \vert HV^{*}(b')- \hat{H_{b}} V_{t-1}(b')\vert \) and the following holds
Using the contraction property of H and (14), we can bound the first absolute term on the right hand side of (15) as \(\vert HV^{*}(b')-HV_{t-1}(b')\vert \leq\gamma (4 (\frac {{K_{RS}}}{1-\gamma}+\frac{K_{Z} R_{\mathit{max}}}{(1-\gamma)^{2}} )\delta_{B} + \varepsilon _{t-1} )\). The last absolute term of (15) can be bounded using Lemma 2. As a result, we get
Expanding the recursion gives us,
which is the result we want.
Rights and permissions
About this article
Cite this article
Kurniawati, H., Bandyopadhyay, T. & Patrikalakis, N.M. Global motion planning under uncertain motion, sensing, and environment map. Auton Robot 33, 255–272 (2012). https://doi.org/10.1007/s10514-012-9307-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10514-012-9307-y