Autonomous Robots

, Volume 41, Issue 4, pp 1013–1026 | Cite as

A confidence-based roadmap using Gaussian process regression

  • Yuya Okadome
  • Yutaka Nakamura
  • Hiroshi Ishiguro


Recent advances in high performance computing have allowed sampling-based motion planning methods to be successfully applied to practical robot control problems. In such methods, a graph representing the local connectivity among states is constructed using a mathematical model of the controlled target. The motion is planned using this graph. However, it is difficult to obtain an appropriate mathematical model in advance when the behavior of the robot is affected by unanticipated factors. Therefore, it is crucial to be able to build a mathematical model from the motion data gathered by monitoring the robot in operation. However, when these data are sparse, uncertainty may be introduced into the model. To deal with this uncertainty, we propose a motion planning method using Gaussian process regression as a mathematical model. Experimental results show that satisfactory robot motion can be achieved using limited data.


Sampling-based motion planning Gaussian process regression Probabilistic roadmap 

1 Introduction

For several decades, robots have been expected to work in real environments. Recent advances in high performance computing have allowed sampling-based motion planning methods to be successfully applied to practical robot control problems (Kavraki et al. 1996; LaValle 2006; LaValle and Kuffner 2001; Choset 2005) , including an autonomous self-driving cars which can negotiate city roads (Kuwata et al. 2009). Two representative sampling-based methods have been used: rapidly-exploring random tree (RRT) (Dalibard et al. 2013; LaValle and Kuffner 2001) and probabilistic roadmap (PRM) (Prentice and Roy 2009; Kavraki et al. 1996). In these approaches, an initial state space is constructed according to the problem setting. We derive a graph in which each node corresponds to a point in the state space and each link represents the reachability of the neighboring nodes. The commonly used methods for estimating reachability require mathematical modeling of the controlled target. For example, node pairs are connected when there exists an appropriate action-inducing state transition from one node to another. Once an adequate graph has been derived, a path on the graph is extracted by a search algorithm such as Dijkstra’s algorithm (Dijkstra 1959) or the A\(^{*}\) algorithm (Dechter and Pearl 1985) to satisfy the task objective.

Many factors affect the behavior of a robot. For example, unanticipated disturbances may be induced by contact with a physical object in an actual environment, or the robot itself may have a complex structure with many degrees of freedom (DoFs). One of the difficulties of motion planning under such constraints is the absence of an appropriate mathematical-model. Since it is difficult to construct an appropriate mathematical model in advance, it is necessary to build one from the motion data obtained by monitoring the robot while it is in operation. Even when a sophisticated system identification technique is used, however, insufficient data can introduce uncertainty into the model, making it difficult to estimate the model parameters (Bishop 2006). Since existing sampling-based motion planning methods rely on the mathematical model, inappropriate motions may be generated if this uncertainty is not taken account.

In this paper, we propose a sample-based motion planning method using Gaussian process regression (GPR) (Rasmussen and Williams 2005; Foster et al. 2009) as the mathematical model. GPR is a nonparametric Bayesian inference method, and has been successfully applied to many practical tasks (Bishop 2006). GPR outputs both the most probable estimate (mean) and the confidence interval, where the confidence interval corresponds to the uncertainty in the model. In this research, the confidence interval is used to evaluate the reachability between nodes. That is, node pairs are connected only if the state transition from one to another by a certain action is expected with “strong” confidence. Paths connected with confident links are assumed to be reliable.

Since a sampling-based motion planning method requires reachability to be calculated between massive numbers of node pairs, a computational model with a high calculation cost is impractical. While the calculation cost of a GPR increases with the sample size at \(O(N^3)\), its accuracy also depends on the sample size. To overcome this dilemma, we employed the fast approximation method for GPR (Okadome et al. 2013) proposed in our previous work. In this method, locality-sensitive hashing (LSH) (Indyk and Motwani 1998) is combined with a product of experts model (PoEs) (Hinton 2002) to balance speed and accuracy.

To evaluate our method, we applied it to three control tasks. The first application, using a simple robot, showed that a feasible motion can be obtained using our confidence-based motion planning method, and that sensitivity to risk can be dealt with by penalizing longer paths. The second application, using a robot with a more complex structure, showed that a highly nonlinear underactuated robot with many DOFs can be controlled using our method. Finally, the control of a robotic arm was achieved in a partially observable setting, as is often required in practice. These results suggest that planning can be reliably established using an uncertainty model, and that satisfactory motion can be obtained using limited data.

2 Sampling-based motion planning

In this research, we assumed that the controlled target was a discrete-time Markov system with continuous state and action spaces. The state of a physical system is represented by a vector (point) in the configuration space and its time derivative (e.g., velocity). For example, the state of a robot is represented by a vector \({\varvec{s}}\) whose components are the joint angles and angular velocities, and the action \({\varvec{a}}\) consists of a torque applied to each joint specifically, the electric voltage or the duty ratio of the PWM. The forward dynamics are defined by the state transition probability \(p({\varvec{s}}' | {\varvec{s}}, {\varvec{a}})\), where \({\varvec{s}}'\) is the successive state. At time t, the controller observes the current state \({\varvec{s}}(t)\) and then outputs an action \({\varvec{a}}(t)\) based on a control policy. Then, the robot changes its state to \({\varvec{s}}(t + 1)\) according to the state transition probability. The cost c(t) for the state transition \(s(t)\rightarrow s(t+1)\) is determined according to its fitness. In a typical Markov decision process, the objective is to obtain the optimal stationary policy that minimizes the accumulated cost:
$$\begin{aligned} \sum _{\tau =0}^{\infty }\gamma ^{\tau }c(t+\tau ), \end{aligned}$$
where \(\gamma \) is a discount factor. A schematic of the procedure is shown in Fig. 1.
Fig. 1

Schematic of Markov decision process. Strictly, the state \({\varvec{s}}\) and observation differ. However, in practice, the state can be approximated by embedding many sensors

In this research we made assumptions as follows. The target system is ergodic for any stationary stochastic policy \(p({\varvec{a}}|{\varvec{s}})\), and has a stationary distribution \(p_{\mathrm{sd}}({\varvec{s}})\) according to the policy. The goal is defined by an area \({\mathcal G}\) where the probability to stay is not 0, i.e., \(\int _{\mathcal G} p_{\mathrm{sd}}({\varvec{s}})d{\varvec{s}}\ne 0\). The cost is −1 if the system reaches the goal area or 0 otherwise. The purpose of our research was not to find the optimal control policy using an exact system model given in advance, but rather was to to find an action sequence to reach the goal on the basis of previous experiences without such a model. From this point of view, the discount factor \(\gamma \) was set to 1; hence, the “optimality” of the control with respect to this “external” cost was not considered1.

From here, we provide a simple explanation of the probabilistic roadmap by Kavraki et al. (1996) that is the basis of our proposed method; however, the system model was assumed to be given in advance in this case. We then discuss about our problem settings.
Fig. 2

A series of state. This example shows the left arm raised up. A motion is divided into a state sequence

Fig. 3

An example of a probabilistic roadmap. Each circle (node) denotes a milestone, and each arrow denotes the state transition (action)

2.1 Probabilistic roadmap

A motion is defined as a sequence of postures and their time derivatives, as shown in Fig. 2. To reproduce a motion, appropriate control signals to induce the necessary transitions must be input to the robot in the correct order. The purpose of motion planning is to find an executable path (motion) from the given start state to the goal state and to generate the required action sequence.

In sampling-based motion planning, a graph that covers the whole state space is constructed, as shown in Fig. 3. In a PRM (Kavraki et al. 1996) algorithm, several states called milestones are randomly sampled from the whole state space, and each milestone pair is connected by a directed link when the robot at the source milestone changes its state to the destination milestone by a certain action; the milestone pair is then reachable. To obtain an arbitrary path, an infinite number of milestones is necessary to fill out the whole state space when the state space is continuous. However, they can be approximated by a finite number of randomly sampled milestones. Milestone pairs are not connected if they are not reachable, because of physical constraints. This may be because the robot is blocked from moving from one to the other by an obstacle between them, or because the thrust force generated by the robot’s actuator is insufficient to move its arm. Feasible motion can be obtained by extracting a sequence of milestones from the start to the goal.

A typical way of creating a roadmap is as follows: i) Sample several points from the whole state space. ii) Evaluate the connectivity between milestone pairs using a “local” controller, e.g., by calculating the inverse dynamics. iii) Connect reachable milestone pairs by links. The problem with this method is that the sampled states are sparsely distributed in the state space if the system has a large number of DoFs, and the connectivity among the milestone pairs cannot be evaluated the local controller if the system model is unknown.

2.2 Roadmap construction without system model

These problems can be solved using the exact inverse dynamics model; however, estimation of the inverse dynamics only from the previously observed data was necessary in our case. In our planning procedure, it was assumed that, (i) the state transition was deterministic, i.e., the same behavior was always generated by the same action sequence; (ii) the same transition was induced by a unique action, i.e., different actions induced different motions. (iii) the observation included sufficient information to avoid perceptual aliasing(Chrisman 1992; Spaan and Vlassis 2004) and the state \({\varvec{s}}\) only consisted of observed sensor values. Furthermore, we assumed that some observed data were available before planning. Note that the motion reproduced by the planned action sequence is not guaranteed to generate the planned motion, although the reproduced and planned motions are expected to be similar if the estimation is sufficiently accurate.

To find a feasible motion that reached to the goal, we introduced “internal” costs for each state transition; these costs indicated the uncertainty of the estimation. Bayesian methods can be used to calculate the uncertainty of the estimation. For example, GPR outputs both the mean value and the confidence interval, and the latter indicates the uncertainty. In this research, the internal costs were calculated from this uncertainty.

3 Roadmap construction using GPR

In this research, the system model was estimated using a Bayesian nonparametric model—specifically a GPR—since the inverse dynamics were not given. If insufficient data were available to enable precise estimation of the inverse dynamics for the tentative state transition \(({\varvec{s}} \rightarrow {\varvec{s}}')\), the confidence interval of the GPR estimation would becomes large. In such a case, the action could not be determined precisely and the state transition by the estimated action would be uncertain. The planned motion should exclude the link corresponding to such state transitions. To give penalties to these links, we assign the “internal” cost based on the confidence interval. By constructing roadmap based on this cost and finding the best path (minimum cost), the least uncertain path was obtained. That is, we expect that feasible motions can be planned by our method.

The main steps in our motion planning method are as follows:
  1. 1.

    Gathering motion data

  2. 2.

    Constructing the initial roadmap

  3. 3.

    Improving the roadmap by connecting node pairs

  4. 4.

    Motion planning using Dijkstra’s algorithm (Dijkstra 1959).

Fig. 4

A randomly generated motion sequence a A randomly generated motion sequence b A motion sequence projected on the state space

3.1 Data gathering

Figure 4a shows a schematic example of motion data gathering. The robot was controlled by a randomly generated control signal, and the sensor values and control signals were recorded. We obtained the time series of states and actions \(\mathcal M = \left\{ {\varvec{s}}(1),{\varvec{a}}(1), {\varvec{s}}(2), \ldots \right\} \). These data were used for both creating the initial roadmap and calculating the inverse dynamics of the target system.

3.2 Constructing the initial roadmap

An initial roadmap was constructed from the time series data. Instead of milestones randomly sampled from whole state space, we used all the states in the data \(\mathcal M\) as the milestones. The t th state \({\varvec{s}}(t)\) was mapped onto the state space as the t th milestone \({\varvec{s}}_{t}\) and connected from \(t-1\) th to milestone \({\varvec{s}}_{t-1}\). Since a state transition from the t th milestone to the \(t+1\) th one could be induced by the action \({\varvec{a}}(t)\), this milestone pair was assumed to be reachable. As a result, a roadmap with a chain of milestones was generated as in Fig. 4b.

We then grew the roadmap by adding links between these milestones. Our method attempts to add links between nearby milestone pairs by using crossing points as a shortcut. The Bayesian inference framework allows the reliability of the estimation to be calculated, as described below. Our method tries to reproduce the recorded motion while avoiding roundabout or low reproducibility motions. To increase the moving area, it is necessary to collect more data during operation. We will return to this in a future study.

Note that the reachability between two states separate from those in the dataset \(\mathcal M\) becomes unreliable when using mathematical model trained on the data \(\mathcal M\) without a priori knowledge of the robot. When using randomly sampled milestones, as in an existing methods, most of the milestones might be isolated and thus ignorable for path planning, owing to the sparse data.

3.3 Connecting node pairs

The inverse dynamics were defined as the probability distribution of an action \(\varvec{a}\) given the originating state \(\varvec{s}\) and the difference \(\varDelta \varvec{s} = {\varvec{s}}'-{\varvec{s}}\): \(p({\varvec{a}}|{\varvec{s}}, \varDelta {\varvec{s}})\). The probability was denoted by \(p({\varvec{a}}|{\varvec{x}})\), where \({\varvec{x}}\) is the input of the function defined by a tuple of the state and difference \([{\varvec{s}}, \varDelta {\varvec{s}}]\), and \({\varvec{a}}\) is the action that induces the difference \(\varDelta {\varvec{s}}\) at the state \({\varvec{s}}\), in accordance with a regression model. This probability was calculated using a mathematical model trained on the motion data \(\mathcal M\). A dataset was constructed from the motion data \(\mathcal M\):
$$\begin{aligned} \mathcal D=\{{\varvec{x}}_{t}, {\varvec{a}}_{t} | t=1,\ldots ,N \}, \end{aligned}$$
where \( {\varvec{x}}_{t} = [{\varvec{s}}(t), \varDelta {\varvec{s}}(t)={\varvec{s}}(t+1)-{\varvec{s}}(t)]\) and \({\varvec{a}}_{t} = {\varvec{a}}(t)\).
In this research, reachability was evaluated by the inverse dynamics \(p(\varvec{a}|\varvec{x})\) calculated using a GPR, which can not only estimate the expectation of an action but also attach a confidence level to that expectation. Given dataset \(\mathcal D\), the probability distribution was calculated as
$$\begin{aligned} p({\varvec{a}}|{\varvec{x}},\mathcal D)= {\mathcal N}({\varvec{a}}|\mu (\varvec{x}, \mathcal D),\sigma ^{2}(\varvec{x}, \mathcal D)), \end{aligned}$$
where \(\mathcal N (\cdot |\mu ,\sigma ^{2})\) denotes a Gaussian distribution whose mean and variance are \(\mu \) and \(\sigma ^{2}\), respectively. The calculation of \(\mu \) and \(\sigma ^{2}\) will be described in the next section.

This gives the variance of the estimation, since GPR is a Bayesian inference method (Bishop 2006; Rasmussen and Williams 2005). When the dataset \(\mathcal D\) represents a large sample in terms of the input query \(\varvec{x}\), the variance becomes small, and vice versa. This value indicates the reliability of the estimation, and is called the confidence interval. In this research, the confidence interval was used to evaluate the reachability of each milestone pair, and the cost of each link was determined from this value.

In our method, the roadmap was grown as follows: (i) For each milestone, \(|\mathcal K|\) tentative destination milestones were identified. \(\mathcal K\) and \(|\mathcal K|\) represent the set of tentative milestones and the size of the set, respectively. For example, the \(\mathcal K\) for \({\varvec{s}}_{t}\) comprised the \(|\mathcal K|\) nearest milestones to \({\varvec{s}}_{t+1}\), since the destination state was expected to be close to the successive state in the motion data \(\mathcal M\). By this procedure, the massive iterations in the GPR calculation could be reduced. (ii) The reachability from the “source” milestone to each tentative destination was evaluated by the GPR, and the cost of each milestone pair was calculated from the confidence interval. When the cost was too large, the milestone pair was not connected, in order to limit the size of the roadmap (Fig. 5a). A directed roadmap \(\mathcal R\) was created by repeating this procedure for all neighboring milestone pairs as shown in Fig. 5b.
Fig. 5

Roadmap construction by our method a A milstone connection. The red and blue arrows show the connected and the unconnected milestone pairs. Note that the confidence interval is direcly used as the cost function in this case b A constructed roadmap. The red arrows show the newly connected milestone pairs

Evaluation of each link Each link had two attributions: cost and associated action. The cost of a milestone pair (\({\varvec{s}}_{\mathrm{so}} \rightarrow {\varvec{s}}_{\mathrm{de}}\)) was calculated as shown below. The input of the inverse dynamics \({\varvec{x}}_{*}\) became \( [{\varvec{s}}_{\mathrm{so}}, {\varvec{s}}_{\mathrm{de}}-{\varvec{s}}_{\mathrm{so}}]\), and the confidence interval was calculated by \(\sigma ^{2}({\varvec{x}}_{*},\mathcal D)\). The cost of link \(c({\varvec{s}}_{\mathrm{so}}\rightarrow {\varvec{s}}_{\mathrm{de}})\), for example, was defined by a monotonically increasing function of \(\sigma ^{2}({\varvec{x}}_{*},\mathcal D)\) such that \(c({\varvec{s}}_{\mathrm{so}}\rightarrow {\varvec{s}}_{\mathrm{de}})=\exp (\sigma ^{2}({\varvec{x}}_{*},\mathcal D))\). The associated action was calculated as \(\mu ({\varvec{x}}_{*},\mathcal D)\). These attributions were used in the planning procedure.

If the cost is too large, the motions including such a link are expected to be selected rarely, and the search on the roadmap may become heavy as a result. To keep the size of the roadmap smaller, links with cost greater than the threshold value \(T_{c}\) were not generated (Fig. 5a).

Pseudocode of confidence-based roadmap A pseudocode for the roadmap construction is shown in Algorithm 1. In the first line, the graph \(G(\forall {\varvec{s}} \in S,E_{\mathrm{init}})\) is initialized, where \(\forall {\varvec{s}} \in S\) are connected to the successive node. The connection matrix E indicates the connectivities between nodes by storing the cost of each link to the corresponding element, while \({\hat{A}}\) is the matrix recording the corresponding action for each link. Initial costs are stored to \(E_{\mathrm{init}}\).

3.4 Motion planning

The sensory input from the target was measured, and the start milestone \({\varvec{s}}_{\mathrm{start}}\) was selected by extracting the nearest state included in \(\mathcal R\). The goal milestone \({\varvec{s}}_{\mathrm{goal}}\) consistent with the objective of the task was selected. When a motion sequence \({\mathcal M}^{p}\) from \({\varvec{s}}_{\mathrm{start}}\) to \({\varvec{s}}_{\mathrm{goal}}\) was given, the cost of the sequence \({\mathcal M}^{p}\) is calculated as
$$\begin{aligned} C(\mathcal M^{p}) = \sum _{\tau =1}^{|\mathcal M^{p}|}c({\varvec{s}}^{p}(\tau )\rightarrow {\varvec{s}}^{p}(\tau +1)), \end{aligned}$$
where \(|\mathcal M^{p}|\) denotes the length of the sequence. The purpose was to derive the sequence \(M^{p*}\) that minimize the total cost \(C(\mathcal M^{p*})\).
To obtain this optimal path, Dijkstra’s algorithm with a start state \({\varvec{s}}_{\mathrm{start}}\) was used. The derived motion sequence was described by
$$\begin{aligned} {\mathcal M}^{p*} = \{{\varvec{s}}^{p}(1), {\varvec{a}}^{p}(1),\ldots , {\varvec{s}}^{p}(\tau )\}. \end{aligned}$$
When the planned motion is reproduced, an estimated action \({\varvec{a}}^{p}(\tau )\) is output in an order that does not consider the observed state of the robot.

In the experiments (Sect. 5.2), the estimation of the forward dynamics was also used for the cost function. The planned state transition and the estimated state transition induced by the planned action were not necessarily identical because of the uncertainties in the estimation. Since the error between them might indicate another type of uncertainty in the state transition, we used this error value for the cost function. We can also add constant value to the cost of each link. This additional cost cannot be an exact external cost that depends on the state transition, but can be an approximation of the external cost. For details, see Sect. 5.1.

3.5 Design of the cost functions

In this section, we explain an intuitive interpretation of our cost function. We employ 2 types of cost for each link. One corresponds to the uncertainty in the estimation of the inverse dynamics, while the other to the forward dynamics.

Let the forward dynamics model of the target system and the action selection model (like a stochastic policy in reinforcement learning (Sutton et al. 2000)) be given as probabilistic density functions as \(p({\varvec{s}}'|{\varvec{s}},{\varvec{a}})\) and \(p({\varvec{a}}|{\varvec{s}})\) respectively. The occurrence probability of a motion \(\mathcal M =\{{\varvec{s}}(1), {\varvec{a}}(1), \cdots , {\varvec{a}}(T-1), {\varvec{s}}(T)\} \) can be calculated as:
$$\begin{aligned} p({\mathcal M}) = \prod _{t=1}^{T-1} p({\varvec{a}}(t)|{\varvec{s}}(t))p({\varvec{s}}(t+1)|{\varvec{s}}(t), {\varvec{a}}(t)). \end{aligned}$$
By taking the logarithm of both sides,
$$\begin{aligned} \log (p({\mathcal M}))= & {} \sum _{t=1}^{T-1}\log (p({\varvec{a}}(t)| {\varvec{s}}(t))) \nonumber \\&+\sum _{t=1}^{T-1}\log (p({\varvec{s}}(t+1)|{\varvec{s}}(t), {\varvec{a}}(t))), \end{aligned}$$
the equation becomes an additive form.

The former term corresponds to the uncertainty in the action selection. Suppose that the state transition is deterministic and the action inducing any transition is unique. If the inverse dynamics is known, the action can be determined without any uncertainty, and the cost in this case should be small. In our problem setting, however, the inverse dynamics is not given and the corresponding action is estimated only from the obtained data set. Thus, the uncertainty is inevitable.

The input to the inverse dynamics is the state transition \({\varvec{s}} \rightarrow {\varvec{s}}'\) and the output is the action to induce the transition. This function is estimated by the Gaussian process regression in this research. Since GPR is one of Bayesian methods, it calculates the probability density function (PDF) of the output variable in the sense of Bayesian probability. By using the data set \(\mathcal D\), the PDF becomes a Gaussian function as \(p({\varvec{a}}|{\varvec{s}}', {\varvec{s}}; {\mathcal D})\) \(={\mathcal N}({\varvec{a}}|\bar{\varvec{a}}, \sigma ^2)\), where \(\bar{\varvec{a}}\) and \(\sigma \) are the mean and the width (confidence interval) parameters respectively, and are determined by the input variables. The value of the PDF denotes the preference over the action space such that the preference of the action \({\varvec{a}}_*\) is calculated as \({\mathcal N}({\varvec{a}}={\varvec{a}}_*| \bar{\varvec{a}}, \sigma ^2)\).

The motion planner in this research selects the maximum a posteriori (MAP) estimate: \(\bar{\varvec{a}}={\arg \max }_{\varvec{a}} {\mathcal N}({\varvec{a}}|\bar{\varvec{a}}, \sigma ^2)\). The preference of the MAP estimate is proportional to \(1/\sigma \) which is the reciprocal of the confidence interval. Therefore, the confidence interval is used as a cost for each link. Note that, the cost becomes close to 0 when the estimated action to induce the given state transition is exactly obtained in the sense of the subjective Bayesian probability.

The second term corresponds to the uncertainty in the state transition. When the action sequence is given, the occurrence probability of a motion by the open loop control can be calculated based on the forward dynamics model. Since the dynamics model is also unavailable in our problem setting, it must be estimated from the obtained data set. The state transition probability \(p({\varvec{s}}'|{\varvec{a}}, {\varvec{s}})\) is calculated by using a Gaussian process regression.

Strictly speaking, the total cost might not correspond to the occurrence probability of the motion \(\mathcal M\). We also adopt heuristic approaches such as employing a monotonically increasing function. However, the total cost would be small when the estimation becomes accurate and precise. That is, it is expected that a feasible motion can be chosen by finding the best path in the roadmap.

4 Fast calculation method for GPR

The action consists of the control signal for each actuator, i.e., \({\varvec{a}} = [a_{1},\ldots ,a_{d}]^{\top }\). For simplicity, in this research, each element of the action vector was calculated independently by a single output GPR. As a result, the probability of the action was calculated as
$$\begin{aligned} p(\varvec{a})=\prod _{d}p(a_{d}). \end{aligned}$$
When the query input for GPR \({\varvec{x}}_{*}\) and the dataset \(\mathcal D\) were given, the output of the d th element of action \(\varvec{a}\) was calculated as
$$\begin{aligned} P(a_{d}|{\varvec{x}}_{*}, \mathcal D)= & {} \mathcal {N}(\mu _{d}({\varvec{x}}_{*}, \mathcal D),\sigma ^{2} ({\varvec{x}}_{*}, \mathcal D)) \nonumber \\= & {} \frac{1}{Z} \exp \left( -\frac{1}{2}\frac{({\varvec{a}}-\mu _{d}({\varvec{x}}_{*}, \mathcal D))^2}{\sigma ^{2}( {\varvec{x}}_{*}, \mathcal D}) \right) , \nonumber \\ \mu _{d}({\varvec{x}}_{*}, \mathcal D)= & {} {\varvec{k}}^{\top }{\varvec{C}}^{-1}\mathcal D_{a_{d}},\nonumber \\ \sigma ^{2}({\varvec{x}}_{*}, \mathcal D)= & {} K({\varvec{x}}_{*},{\varvec{x}}_{*}) - {\varvec{k}}^{\top }{\varvec{C}}^{-1}{\varvec{k}}, \end{aligned}$$
where \({\mathcal D}_{a_{d}}\) is the vector of the d th element of action included in the dataset. Here, \({\varvec{C}}\) is a covariance matrix whose elements are \(C_{lm}=K({\varvec{x}}_{l},{\varvec{x}}_{m})+\delta _{lm}\sigma _{n}^{2}\). \(K(\cdot ,\cdot )\), \(\delta _{lm}\), and \(\sigma _{n}^{2}\) denote a kernel function, a delta function, and a noise term, respectively. \({\varvec{k}}\) is defined as \({\varvec{k}}=[(K({\varvec{x}}_{1},{\varvec{x}}_{*}),\ldots ,K({\varvec{x}}_{N},{\varvec{x}}_{*}))]^{\top }\). Z denotes the normalized term.

In order to calculate Eq. (9), it is necessary to calculate the matrix inverse \((\propto O(N^{3}))\) for learning and the matrix multiplication \((\propto O(N^{2}))\) for each estimation. The computational time increases with the sample size, while a large volume of data is necessary to make the approximation accurate.

To overcome this problem, in a previous study, we proposed a fast approximation method for GPR (Okadome et al. 2013) that combined a dataset division technique, LSH (Indyk and Motwani 1998), with an integration technique for multiple probability distributions, PoEs (Hinton 2002). We named this method LSH-GPR. In LSH, a B-bits binary hash-key composed of B binary hash functions is used to divide the dataset into \(2^{B}\) subsets. Since the calculation of the hash-key value does not depend on the sample size and the calculation of the “decomposed” GPR is performed using only the corresponding subset, the computational cost is reduced dramatically. Although the performance of each decomposed GPR degrades significantly at the borders, this can be ameliorated by integrating L decomposed GPR using PoEs.

In an ideal case, when the dataset is divided into equal size subsets, the calculation cost of this method is \(O(L2^{B}(N/2^{B})^{3})\) for the matrix inversion and \(O(L(N/2^{B})^{2})\) for the computation of the confidence interval. For example, when \(B=3\) and \(L=2\), the calculation costs for both matrix inversion and confidence interval are 1/32nd of those of a “naïve” GPR. As B increases, processing becomes faster, while the accuracy is improved by increasing the size of L. The amount of memory is also small, (\(O(L2^{B}(N/2^{B})^{2})\)), compared with \(O(N^{2})\) for the naïve GPR.

5 Experiments

We conducted three experiments to investigate the properties of our motion planning method. The first experiment used a simple pendulum which is a popular benchmark task with small state space and nonlinear dynamics. We conducted a swinging up task and investigated the sensitivity to cost by adding an additional cost to each state transition. The second experiment used a complex underactuated robot with many DoFs, and reaching motions with different cost functions were compared. We investigated whether a highly nonlinear robot can achieve satisfactory motion using our method. The last experiment investigated whether our method is applicable to the control of a robotic arm when only partial state information with sensory noise is available, due to the limitations of the sensing system. This is inevitable for a robotic system in a real environment.
Fig. 6

A block diagram of control system

The trajectory in operation may diverge from the planned motion due to model uncertainty and stochastic dynamics. Therefore, we employed a feedback controller to compensate for the error between the planned state and the observed state. This value was also calculated using a GPR, and the output of the feedback controller was calculated using the inverse dynamics (1) as \(p({\varvec{a}}(t)|{\varvec{s}}(t), {\varvec{s}}^{p}(t+1)-{\varvec{s}}(t))\) where \({\varvec{s}}^{p}(t+1)\) is the planned state at the next time step. The output of this controller was calculated as
$$\begin{aligned} {\varvec{a}}_{FB}(t)= & {} \mu ({\varvec{a}}| {\varvec{x}}_{*}(t),\mathcal D), \nonumber \\ {\varvec{x}}_{*}(t)= & {} [{\varvec{s}}(t), {\varvec{s}}^{p}(t+1)-{\varvec{s}}(t)]^{\top }. \end{aligned}$$
The planned action \({\varvec{a}}^{p}(t)\) and the feedback action \({\varvec{a}}_{FB}(t)\) were fused as
$$\begin{aligned} {\varvec{a}}(t)=\alpha {\varvec{a}}^{p}(t)+(1-\alpha ) {\varvec{a}}_{FB}(t), \end{aligned}$$
and \({\varvec{a}}(t)\) was input to the robot. Here, \(\alpha \in (0,1]\) is the mixing rate, which was hand tuned. A block diagram of the control system is shown in Fig. 6.

5.1 Simple pendulum

The goal of this task was to make the pendulum swing up and keep it in the upright position. The state of the pendulum was represented by its angle \(\theta \) and its angular velocity \({\dot{\theta }}\) (Fig. 7a). Torque a was applied at the rotation center. The dynamics of the pendulum were calculated as follows:
$$\begin{aligned} {\ddot{\theta }} = -\frac{mg\sin \theta }{ml}-\frac{\mu {\dot{\theta }}}{m}+\frac{a}{m}, \end{aligned}$$
where l, m, g, and \(\mu \) are the pendulum’s length (1.0 m), weight (1.0 kg), gravitational acceleration (9.8 m/s\(^{2}\)), and coefficient of friction (0.10), respectively. No prior knowledge was assumed on the part of the robot, and the mathematical model was constructed only from observed data.
Fig. 7

Structure of the simple pendulum and the collected motion data a Simple pendulum b Dataset

Fig. 8

Planned motion by using the confidence-based roadmap. Root mean squared errors between planned and reproduced motions (joint angle) were a 0.211 and b 0.247 a The motion generated by using CRM\(_{C}\) b The motion generated by using CRM\(_{T}\)

Experimental setting In this experiment, the input torque was a continuous value (\([ -5, 5 ]\) Nm) and the control started at the state \({\varvec{s}}=(\theta ,{\dot{\theta }})\) where the pendulum was hanging down and at rest (\(\theta =\pi , {\dot{\theta }}=0\)). The control cycle was set to 100 ms.

The number of candidates for checking the reachability \(|\mathcal K|\) was 5, a value that was determined empirically. The bit-length of hash-key B and the number of integrated GPR L were set to 3 and 2, respectively.

Data gathering The state of the pendulum was reset randomly \((-0.5\le \theta \le 0.5,-0.5 \le {\dot{\theta }} \le 0.5)\) every 30 s. We constructed the dataset by repeating this process for 200 s, giving 2000 data points. Figure 7b shows the distribution of the samples stored in the dataset.

Cost function We compared two motion planners, CRM\(_{C}\) and CRM\(_{T}\), with different cost functions. The cost functions of CRM\(_{C}\) and CRM\(_{T}\) were defined as
$$\begin{aligned} -c({\varvec{s}}_{\mathrm{so}}\rightarrow {\varvec{s}}_{\mathrm{de}})= & {} \sigma ^{2}({\varvec{x}},\mathcal D),\\ -c({\varvec{s}}_{\mathrm{so}}\rightarrow {\varvec{s}}_{\mathrm{de}})= & {} \beta \sigma ^{2}({\varvec{x}},\mathcal D)+t_{c}, \end{aligned}$$
where \(\beta \) and \(t_{c}\) were positive constant values set to 25 and 0.1. The CRM\(_{C}\) tried to identify the most confident path, while the CRM\(_{T}\) tried to find a shorter path than CRM\(_{C}\), to reduce the additional cost to penalize each control step.

The red points \((\cdot )\) and blue points \((\star )\) in Fig. 8 show the planned motion and the realized motion, respectively. Both CRM\(_{C}\) and CRM\(_{T}\) could generate the desired motion. The path lengths of CRM\(_{C}\) and CRM\(_{T}\) were 110 and 78, respectively. The shorter path was generated by adding the additional term.

However, CRM\(_{T}\) produced larger errors than CRM\(_{C}\) as the robot approached the goal state (\(\theta =0, {\dot{\theta }}=0\)). This suggested that actions with low confidence levels were selected by CRM\(_{T}\) to reduce cost.

The computational time for roadmap construction was about 73 s, and the computational time for Dijkstra’s algorithm was 0.15 s. The computational time for roadmap construction with a “naïve” GPR was 1581 s at the sample size of 2000. By using LSH-GPR, the computational time for roadmap construction was dramatically reduced.

When the feedback controller was employed, it was necessary to calculate the inverse dynamics to correct the divergence from the planned path. The computational time for each feedback signal was 156 ms for the naïve GPR, compared with 6.60 ms for the LSH-GPR. This result suggested that LSH-GPR can achieve compensation control with a higher control cycle.

5.2 Underactuated redundant robot

The underactuated robot used in this experiment (Fig. 9) had redundant joints, and only the joint closest to the body (the green joint in Fig. 9) was controllable. The body was fixed. Each link was connected by a hinge joint, and the movement range of each hinge was \(-\pi /2 \le \theta \le \pi /2\). The total weight of each joint and link was 0.25 kg, giving a total weight of 1 kg. The goal of the task was to make the end-effector of the underactuated robot reach the target position.

Experimental setting The state of the robot was represented by the angle and angular velocity of each joint \(({\varvec{\theta }}, {\dot{\varvec{\theta }}})\) and by two dimensional positions of the end-effector (s1, s2). The input torque range for the controllable joint was \(-5 \le \tau \le 5\), and weak gravity (−1 \(m/s^{2}\)) was added. The control cycle was set to 100 ms.

The number of candidates to check the reachability \(|\mathcal K|\) was 5, a value that was determined empirically. The bit-length of hash-key B and the number of integrated GPR L were set to 5 and 4, respectively.

Data gathering Data were gathered by monitoring a range of motions. We employed a rhythmic control signal similar to a CPG-based controller for a bio-inspired robot (Ijspeert 2008). It was expected that the motion resonance of the robot body would be induced by a rhythmic signal with a certain frequency. The control signal was designed such that the frequency of the rhythmic control signal changed over time. The input torque was determined as \(a(t) = 5\sin (2 \pi \omega (t)t)\) where \(\omega (t)\) is the angular velocity at time t.

For each trial, the robot was set to the initial posture shown in Fig. 9. This angular velocity was initialized to \(\omega (0)={\mathcal U}(0.01,0.02)\) where \({\mathcal U}\) is a random number generated from the uniform distribution in [0.01, 0.02]. The value of \(\omega (t)\) was decreased by multiplying by 0.9 every 500 control steps (50 s). The robot was controlled for 2500 control steps (250 s) and the state of the robot and input torque (action) were recorded. This trial was repeated 5 times, and a total of 12500 samples were obtained. Figure 9b shows the positions of the end-effector stored in the dataset.
Fig. 9

The structure of the underactuated robot and the collected motion data a Underactuated robot b Dataset

Fig. 10

Planned path and realized motion for an underactuated robot using the confidence-based roadmap. RMSEs of the motion a, b, and c are 0.366, 1.36, and 0.269, respectively a Result by the planner CRM\(_{c1}\) b Result by the planner CRM\(_{c2}\) c Result by the planner CRM\(_{c3}\)

Cost function In this experiment, we used three motion planners- CRM\(_{c1}\), CRM\(_{c2}\) and CRM\(_{c3}\)- with different cost functions. The cost functions were defined as
$$\begin{aligned} c_{1}({\varvec{s}}_{\mathrm{so}}\rightarrow {\varvec{s}}_{\mathrm{de}})= & {} \exp (\beta _{1} \sigma ^{2}({\varvec{x}},\mathcal D)), \end{aligned}$$
$$\begin{aligned} c_{2}({\varvec{s}}_{\mathrm{so}}\rightarrow {\varvec{s}}_{\mathrm{de}})= & {} \exp (\beta _{2} e({\varvec{x}}, \mathcal D)), \end{aligned}$$
$$\begin{aligned} c_{3}({\varvec{s}}_{\mathrm{so}}\rightarrow {\varvec{s}}_{\mathrm{de}})= & {} \exp (\beta _{3} \sigma ^{2}({\varvec{x}},\mathcal D)+\beta _{4} e({\varvec{x}}, \mathcal D)), \end{aligned}$$
where \(\beta _{1}\), \(\beta _{2}\), \(\beta _{3}\) and \(\beta _{4}\) denote the coefficients and were set empirically to 5000, 400, 5000 and 400, respectively. In \(c_{1}\), the cost was directly determined from the confidence of action estimation. \(e({\varvec{x}}, \mathcal D)\) in \(c_{2}\) and \(c_{3}\) represented the error related to the forward dynamics \(p({\varvec{s}}'|{\varvec{s}},{\varvec{a}})\) and was calculated as follows: i) Initially, the control signal was calculated as \({\bar{a}} = \mu ({\varvec{s}}_{\mathrm{so}}, {\varvec{s}}_{\mathrm{de}}-{\varvec{s}}_{\mathrm{so}},\mathcal D)\) using Eq. (9). ii) Then, the expected destination state \({\bar{s}}'\) transited from \({\varvec{s}}_{\mathrm{so}}\) with \({\bar{\varvec{a}}}\) was calculated by \({\varvec{\bar{s}}}' = E[{\varvec{s}}'p({\varvec{s}}'|{\varvec{s}}_{\mathrm{so}},{\bar{\varvec{a}}})]\). \(p({\varvec{s}}'|{\varvec{s}}_{\mathrm{so}},{\bar{\varvec{a}}})\) was calculated by another GPR using the dataset \(\mathcal M\). iii) Finally, the error was calculated as \(e({\varvec{s}}_{\mathrm{so}},{\varvec{s}}_{\mathrm{de}};\mathcal D)= ({\varvec{s}}_{\mathrm{de}}-{\bar{\varvec{s}}}')^{2}\). \(c_{3}\) was the combination of cost functions \(c_{1}\) and \(c_{2}\). In each cost function, the exponential function was used to adjust the sensitivity.

Figure 10 shows the results of the reaching task. The red points \((\cdot )\) and blue points \((\star )\) in Fig. 10 show the position of the end-effector in the planned motion and the achieved motion, respectively. When CRM\(_{c1}\) or CRM\(_{c3}\) was used, the end-effector arrived at the goal position \((-0.25, 1.0)\) without diverging from the planned motion. In contrast, when CRM\(_{c2}\) was used, the motion of the end-effector is deviated from the planned route. The divergence between the planned and reproduced motion was evaluated by the root mean squared error (RMSE) between these two motions in state space2.

The computational time for creating the roadmap by CRM\(_{c1}\) was 452 s, while that for CRM\(_{c2}\) and CRM\(_{c3}\) was 954 s. That is because CRM\(_{c2}\) and CRM\(_{c3}\) must estimate the forward dynamics to calculate the cost function. The graph search by Dijkstra’s algorithm took 8.54 s for each motion planner.
Fig. 11

Selected data for the planned motion by CRM\(_{c1}\)

Fig. 12

The displacement of the end-effector during the reproduction by CRM\(_{c1}\)

Figure 11 shows indexes of the data used for the planned motion by CRM\(_{1}\) (Fig. 10a). It is shown that segmentalized motions are connected to generate the planned motion.

Figure 12 shows the divergence of the end-effector from planned motion to realized motion during operation. The solid line and dashed line show the error with and without the feedback controller, respectively. When the error was small, the feedback controller could compensate for it. It could not compensate for large errors, since the transition from the observed state \({\varvec{s}}(t)\) to the planned successive state \({\varvec{s}}^{p}(t+1)\) became impossible due to physical constraints. Note that such cases are improbable as the confidence interval would need to be large due to the absence of comparable samples in the dataset. It would be possible to improve the output of the feedback controller by employing an adequate mixing rate according to the confidence interval.
Fig. 13

A human like robotic arm

Fig. 14

Motion generation of an actual robotic arm a Planned and reproduced motion (\(p_{1}\)-\(p_{2}\) plane) b Planned and reproduced motion (\(p_{2}\)-\(p_{3}\) plane)

5.3 Control of a human-like robotic arm

We applied the proposed method to the motion of the robotic arm shown in Fig. 13 (Okadome et al. 2014). This robotic arm has a 3 DOF shoulder and a 1 DOF elbow. Three pairs of air-cylinders drive the shoulder and one drives the elbow. The working areas of both joints are similar to those of a human. Each air-cylinder is controlled by two computerized control valves, and each valve is controlled by the output of a D/A board installed on a PC. Since each actuator pair for the shoulder is controlled by two valves, the control signal for the robot has 8-dimensional continuous values.

Experimental setting The state space was defined by the position and velocity of the end-effector of the robotic arm (\({\varvec{s}}=(p_{1},p_{2},p_{3},{\dot{p}_{1}},{\dot{p}_{2}},{\dot{p}_{3}})\)) measured by a 3D-position sensor (Microsoft Kinect for Windows). The velocity was calculated from the difference between the current and previous positions. The purpose of the tracking task was to move the end-effector to a target position, following a planned path.

Data gathering During data gathering, several actuators were randomly selected and their input values were altered randomly every 200 ms. We constructed the dataset by repeating this process 3000 times.

The roadmap was constructed using LSH-GPR. The cost function for each link was \(c(\sigma ^{2}({\varvec{a}}))=\sigma ^{2}({\varvec{a}})\), and the number of candidates to check the reachability \(|\mathcal K|\) was 5. The bit-length of hash-key B and the number of integrated GPR L were set to 4 and 2, respectively. During the control phase, the target position for the feedback controller was set to the position of the next time step on the planned path.

Figure 14 shows the planned path and control results. The gray, red, and blue points \((\star )\) denote the dataset, planned path, and observed trajectory, respectively. Here, the points with \(p_{3}=0\) are cases where the depth value of the 3D sensor was not available. These points are used since failures might occur in similar situations, and such information can be utilized in a “sample-based” motion planning method.

The computational times for calculating the roadmap and Dijkstra’s algorithm are 307 and 0.3 s, respectively. These times could be improved by using a parallel computing technique since each decomposed GPR and output dimension can be performed independently.

Figure 14 indicates that the tracking task succeeded in following the planned path. Since only the confidence interval was used as the cost function, the planned path was not smooth. The development of a smoothing algorithm for the planned path will be undertaken in future work.

6 Conclusion

In this research, we proposed a motion planning method using GPR. We applied our motion planning method to three control tasks. The results suggest that a motion planning method taking into account of uncertainty and using a confidence interval based on Bayesian inference provides satisfactory robot motion. In the proposed method, the recorded motion is segmentalized and the fragmented movements are then connected based on the model uncertainty. The planned motion is simply the accumulation of experience.

The size of the data set for the simple pendulum control was 2000. Since the swing-up task of a pendulum is a highly nonlinear problem, it is commonly used in RL literatures (Grondman et al. 2012; Doya 2000) The number of state action pairs used to construct the controller by our methods is almost similar to RL methods (Grondman et al. 2012). Furthermore, the sample size of the data set for the control of the underactuated redundant robot, and the human-like robotic arm were 12500, and 3000, respectively. Although the dimensionality of the state and action spaces increases compared to the control of the simple pendulum, the experimental results show that a roadmap for the control can be constructed in a practical time.

In our current method, the evaluation of “one-step” inverse dynamics was used. It seems better that the reachability of long-distance milestone pairs was investigated, since such pairs can be connected using intermediate virtual milestones and can improve performance. The connectivity of a long-distance milestone pair can be calculated using a GP-Bayes filtering method (Ko and Fox 2009). In such a case, the value of milestone pairs to be assessed is also important since useless links enlarge the roadmap size without improving performance.

The development of a framework that can build the roadmap that covers a sufficient state space during the operation phase in an online manner is important for realizing large motion ranges and various functions. In such a case, autonomous exploration-exploitation balancing (Thrun 1992) is important. Since the confidence interval indicates novelty, planning by our method corresponds to exploitation and selecting an unknown action corresponds to the exploration. Operating the robot according to this criterion can be a naïve means of achieving balancing, and Bayesian model based reinforcement learning can provide a better solution for this dilemma (Asmuth and Littman 2011; Ross et al. 2011).

It may also be important to consider the measurement uncertainty (Bry and Roy 2011; Berg et al. 2011; Prentice and Roy 2009). For example, Prentice and Roy developed a path planning method that addressed uncertainty in the localization on the pathway, using an extended Kalman filter (Prentice and Roy 2009). Both the measurement and control uncertainty3 are difficulties caused by the perceptual aliasing inevitable in a real environment (Chrisman 1992; Spaan and Vlassis 2004). However, the human like robotic arm can still accomplish a given task in an unstructured environment with an insufficient sensor system. By considering model uncertainty, the planned motion toward sequences of sensory information that are reproduced with high probability. Thus, satisfactory motion can be obtained by our method.

There are two remaining issues. First, our current cost function lacks theoretical support. The accumulated “confidence interval” may indicate the uncertainty of the planned motion, but the planned action sequence does not necessarily generate the planned motion. Especially when the cost depends only on the inverse dynamics estimation, the uncertainty of the forward dynamics is not considered at all. Including the cost from the forward dynamics estimation can be a naïve solution to this problem, as in Eq. (15), but it does not guarantee successful control.

Second, an elaborated feedback control framework is necessary to compensate for error during motion reproduction. When a robot is controlled, the generated motion will diverge from the planned motion because of the stochastic property of the dynamics, model uncertainties, and obstacles in the nonstationary environment. In this research, we employed a simple feedback controller; however, recovery failed when the divergence of the trajectory from the planned path became large and difficult to compensate for in a single step. A possible improvement is to modify the mixing procedure (Eq. (11)), such as by using a PoEs framework (Hinton 2002), so that the output of the feedback controller is ignored when the confidence interval becomes large. The development of either a real-time and “short step” motion planning method to recover the trajectory, or an online re-planning method, is also crucial to return to the planned motion.

However, these approaches cannot be the final solution to this problem. The likelihood of the whole motion calculated from the forward dynamics model, e.g., GP-DM (Wang et al. 2008) or GP-LVM (Lawrence 2004), appears to be a better criterion for evaluating the feasibility of a planned motion. These methods mainly focus on an analysis of a single motion and it might not be feasible to apply one of them to construct a roadmap since massive repetitions are required to evaluate every possible motions.

There are several methods based on the optimization such as reinforcement learning or control theory (Todorov and Li 2005; Peters and Schaal 2008; Deisenroth and Rasmussen 2011; Theodorou et al. 2010; Marco et al. 2016; Mukadam et al. 2016). In the field of robotics research, path based methods as represented by iterative Linear Quadratic Gaussian (iLQG) are commonly used to cope with large state and action spaces with non-linear dynamics. In these methods, a given sequence of actions is updated to decrease the total ‘external’ cost and the motion is optimized by repeating the procedure. To calculate the update, some methods use a forward model (Todorov and Li 2005) and some other methods use rollouts (Theodorou et al. 2010). In contrast to sampling based methods which try to construct a graph to cover the state space, the given path is locally optimized around the path. Thus, a smooth and efficient motion can be generated by an optimization based method while a large state space could be explored by a sampling based method. Combination with such an optimization based method is our future work.


  1. 1.

    The accumulated cost becomes -1 for any control policy if \(\gamma =1\) under our assumption since the system reaches the goal with probability 1 after an infinite number of time steps. When the purpose is to obtain the optimal control policy that leads the system to reach the goal in the fewest time steps, the discount factor \(\gamma \) is usually set to less than 1.

  2. 2.

    This RMSE is calculated as \({\frac{1}{|\mathcal M^{p}|} \sum _{\tau =1}^{|\mathcal M^{p}|} \root \of {({\varvec{s}}^{p}(t)-{\varvec{s}}(t))^{\top }({\varvec{s}}^{p}(t)-{\varvec{s}}(t))}}\).

  3. 3.

    The control uncertainty is mentioned in (Berg et al. 2011), and it corresponds the uncertainty in the state transition model or inverse dynamics model.



This work was partly supported by Grant-in-Aid for Young Scientists 14444719 and for JSPS Fellows A15J01499.


  1. Asmuth, J., & Littman, M. L. (2011). Learning is planning: Near Bayes-optimal reinforcement learning via Monte-Carlo tree search. In Uncertainty in Artificial Intelligence.Google Scholar
  2. Berg, J. V. D., et al. (2011). LQG-MP: Optimized path planning for robots with motion uncertainty and imperfect state information. The International Journal of Robotics Research, 30(7), 895–913.CrossRefGoogle Scholar
  3. Bishop, C. M. (2006). Pattern recognition and machine learning (1st ed.). Springer. corr. 2nd printing edn.Google Scholar
  4. Bry, A., & Roy, N. (2011). Rapidly-exploring Random Belief Trees for motion planning under uncertainty. In 2011 IEEE international conference on robotics and automation (ICRA) (pp. 723–730).Google Scholar
  5. Choset, H . M. (2005). Principles of robot motion: Theory, algorithms, and implementations. Cambridge: MIT Press.MATHGoogle Scholar
  6. Chrisman, L. (1992). Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. In AAAI (pp. 183–188).Google Scholar
  7. Dalibard, S., et al. (2013). Dynamic walking and whole-body motion planning for humanoid robots: An integrated approach. The International Journal of Robotics Research, 32(9–10), 1089–1103.CrossRefGoogle Scholar
  8. Dechter, R., & Pearl, J. (1985). Generalized best-first search strategies and the optimality of A*. Journal of the ACM (JACM), 32(3), 505–536.MathSciNetCrossRefMATHGoogle Scholar
  9. Deisenroth, M. P., & Rasmussen, C. E. (2011). PILCO: A model-based and data-efficient approach to policy search. In In Proceedings of the International Conference on Machine Learning.Google Scholar
  10. Dijkstra, E. W. (1959). A note on two problems in connexion with graphs. Numerische mathematik, 1(1), 269–271.MathSciNetCrossRefMATHGoogle Scholar
  11. Doya, K. (2000). Reinforcement learning in continuous time and space. Neural Computation, 12(1), 219–245.CrossRefGoogle Scholar
  12. Foster, L., et al. (2009). Stable and efficient Gaussian Process calculations. Journal of Machine Learning Research, 10, 857–882.MathSciNetMATHGoogle Scholar
  13. Grondman, I., et al. (2012). Efficient model learning methods for actor critic control. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 42(3), 591–602.CrossRefGoogle Scholar
  14. Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural computation, 14, 1771–1800.Google Scholar
  15. Ijspeert, A. J. (2008). Central pattern generators for locomotion control in animals and robots: A review. Neural Networks 21(4), 642–653. Robotics and Neuroscience.Google Scholar
  16. Indyk, P., & Motwani, R. (1998). Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on theory of computing, STOC ’98 (pp. 604–613). New York, ACM.Google Scholar
  17. Kavraki, L. E., et al. (1996). Probabilistic roadmaps for path planning in high-dimensional configuration spaces. IEEE Transactions on Robotics and Automation, 12(4), 566–580.CrossRefGoogle Scholar
  18. Ko, J., & Fox, D. (2009). GP-BayesFilters: Bayesian filtering using Gaussian process prediction and observation models. Autonomous Robots, 27, 75–90.CrossRefGoogle Scholar
  19. Kuwata, Y., et al. (2009). Real-time motion planning with applications to autonomous urban driving. IEEE Transactions on Control Systems Technology, 17(5), 1105–1118.CrossRefGoogle Scholar
  20. LaValle, S . M. (2006). Planning algorithms. Cambridge: Cambridge University Press.CrossRefMATHGoogle Scholar
  21. LaValle, S. M., & Kuffner, J. J. (2001). Randomized kinodynamic planning. The International Journal of Robotics Research, 20(5), 378–400.CrossRefGoogle Scholar
  22. Lawrence, N. D. (2004). Gaussian process latent variable models for visualisation of high dimensional data. Advances in neural information systems, 16, 329–336.Google Scholar
  23. Marco, A., et al. (2016). Automatic LQR tuning based on Gaussian process global optimization. In 2016 IEEE international conference on robotics and automation (ICRA), (pp. 270–277).Google Scholar
  24. Mukadam, M., et al. (2016). Gaussian process motion planning. In 2016 IEEE international conference on robotics and automation (ICRA) (pp. 9–15).Google Scholar
  25. Okadome, Y., et al. (2013). Fast approximation method for Gaussian process regression using hash function for non-uniformly distributed data. In Artificial neural networks and machine learning (pp. 17–25).Google Scholar
  26. Okadome, Y., et al. (2014). Confidence-based roadmap using Gaussian process regression for a robot control. In 2014 IEEE/RSJ international conference on intelligent robots and systems (IROS 2014) (pp. 661–666).Google Scholar
  27. Peters, J., & Schaal, S. (2008). Reinforcement learning of motor skills with policy gradients. Neural Networks, 21(4), 682–697.CrossRefGoogle Scholar
  28. Prentice, S., & Roy, N. (2009). The belief roadmap: Efficient planning in belief space by factoring the covariance. The International Journal of Robotics Research, 28(11–12), 1448–1465.CrossRefGoogle Scholar
  29. Rasmussen, C . E., & Williams, C . K . I. (2005). Gaussian processes for machine learning. Cambridge: the mit press.MATHGoogle Scholar
  30. Ross, S., et al. (2011). A Bayesian approach for learning and planning in partially observable Markov decision processes. The Journal of Machine Learning Research, 12, 1729–1770.MathSciNetMATHGoogle Scholar
  31. Spaan, M. T. J., & Vlassis, N. (2004). ‘A point-based POMDP algorithm for robot planning’. In IEEE international conference on robotics and automation (Vol. 3, pp. 2399–2404). IEEE.Google Scholar
  32. Sutton, R. S., et al. (2000). Policy gradient method for reinforcement learning with function approximation. In Advances in neural information processing systems (Vol. 12, pp. 1057–1063).Google Scholar
  33. Theodorou, E., et al. (2010). ‘Reinforcement learning of motor skills in high dimensions: A path integral approach’. In 2010 IEEE international conference on robotics and automation (ICRA), (pp. 2397 –2403).Google Scholar
  34. Thrun, S. B. (1992). Efficient exploration in reinforcement learning. Tech. rep., Pittsburgh.Google Scholar
  35. Todorov, E., & Li, W. (2005). A generalized iterative LQG method for locally-optimal feedback control of constrained nonlinear stochastic In Proceedings of the 2005, American control conference (Vol. 1, pp. 300–306).Google Scholar
  36. Wang, J. M., et al. (2008). Gaussian process dynamical models for human motion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(2), 283–298.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  • Yuya Okadome
    • 1
  • Yutaka Nakamura
    • 2
  • Hiroshi Ishiguro
    • 2
  1. 1.Intelligent Information Research DepartmentKokubunji CityJapan
  2. 2.Department of System Innovation, Graduate School of Engineering ScienceOsaka UniversityToyonakaJapan

Personalised recommendations