# A confidence-based roadmap using Gaussian process regression

## Abstract

Recent advances in high performance computing have allowed sampling-based motion planning methods to be successfully applied to practical robot control problems. In such methods, a graph representing the local connectivity among states is constructed using a mathematical model of the controlled target. The motion is planned using this graph. However, it is difficult to obtain an appropriate mathematical model in advance when the behavior of the robot is affected by unanticipated factors. Therefore, it is crucial to be able to build a mathematical model from the motion data gathered by monitoring the robot in operation. However, when these data are sparse, uncertainty may be introduced into the model. To deal with this uncertainty, we propose a motion planning method using Gaussian process regression as a mathematical model. Experimental results show that satisfactory robot motion can be achieved using limited data.

## Keywords

Sampling-based motion planning Gaussian process regression Probabilistic roadmap## 1 Introduction

For several decades, robots have been expected to work in real environments. Recent advances in high performance computing have allowed sampling-based motion planning methods to be successfully applied to practical robot control problems (Kavraki et al. 1996; LaValle 2006; LaValle and Kuffner 2001; Choset 2005) , including an autonomous self-driving cars which can negotiate city roads (Kuwata et al. 2009). Two representative sampling-based methods have been used: rapidly-exploring random tree (RRT) (Dalibard et al. 2013; LaValle and Kuffner 2001) and probabilistic roadmap (PRM) (Prentice and Roy 2009; Kavraki et al. 1996). In these approaches, an initial state space is constructed according to the problem setting. We derive a graph in which each node corresponds to a point in the state space and each link represents the reachability of the neighboring nodes. The commonly used methods for estimating reachability require mathematical modeling of the controlled target. For example, node pairs are connected when there exists an appropriate action-inducing state transition from one node to another. Once an adequate graph has been derived, a path on the graph is extracted by a search algorithm such as Dijkstra’s algorithm (Dijkstra 1959) or the A\(^{*}\) algorithm (Dechter and Pearl 1985) to satisfy the task objective.

Many factors affect the behavior of a robot. For example, unanticipated disturbances may be induced by contact with a physical object in an actual environment, or the robot itself may have a complex structure with many degrees of freedom (DoFs). One of the difficulties of motion planning under such constraints is the absence of an appropriate mathematical-model. Since it is difficult to construct an appropriate mathematical model in advance, it is necessary to build one from the motion data obtained by monitoring the robot while it is in operation. Even when a sophisticated system identification technique is used, however, insufficient data can introduce uncertainty into the model, making it difficult to estimate the model parameters (Bishop 2006). Since existing sampling-based motion planning methods rely on the mathematical model, inappropriate motions may be generated if this uncertainty is not taken account.

In this paper, we propose a sample-based motion planning method using Gaussian process regression (GPR) (Rasmussen and Williams 2005; Foster et al. 2009) as the mathematical model. GPR is a nonparametric Bayesian inference method, and has been successfully applied to many practical tasks (Bishop 2006). GPR outputs both the most probable estimate (mean) and the confidence interval, where the confidence interval corresponds to the uncertainty in the model. In this research, the confidence interval is used to evaluate the reachability between nodes. That is, node pairs are connected only if the state transition from one to another by a certain action is expected with “strong” confidence. Paths connected with confident links are assumed to be reliable.

Since a sampling-based motion planning method requires reachability to be calculated between massive numbers of node pairs, a computational model with a high calculation cost is impractical. While the calculation cost of a GPR increases with the sample size at \(O(N^3)\), its accuracy also depends on the sample size. To overcome this dilemma, we employed the fast approximation method for GPR (Okadome et al. 2013) proposed in our previous work. In this method, locality-sensitive hashing (LSH) (Indyk and Motwani 1998) is combined with a product of experts model (PoEs) (Hinton 2002) to balance speed and accuracy.

To evaluate our method, we applied it to three control tasks. The first application, using a simple robot, showed that a feasible motion can be obtained using our confidence-based motion planning method, and that sensitivity to risk can be dealt with by penalizing longer paths. The second application, using a robot with a more complex structure, showed that a highly nonlinear underactuated robot with many DOFs can be controlled using our method. Finally, the control of a robotic arm was achieved in a partially observable setting, as is often required in practice. These results suggest that planning can be reliably established using an uncertainty model, and that satisfactory motion can be obtained using limited data.

## 2 Sampling-based motion planning

*t*, the controller observes the current state \({\varvec{s}}(t)\) and then outputs an action \({\varvec{a}}(t)\) based on a control policy. Then, the robot changes its state to \({\varvec{s}}(t + 1)\) according to the state transition probability. The cost

*c*(

*t*) for the state transition \(s(t)\rightarrow s(t+1)\) is determined according to its fitness. In a typical Markov decision process, the objective is to obtain the optimal stationary policy that minimizes the accumulated cost:

In this research we made assumptions as follows. The target system is ergodic for any stationary stochastic policy \(p({\varvec{a}}|{\varvec{s}})\), and has a stationary distribution \(p_{\mathrm{sd}}({\varvec{s}})\) according to the policy. The goal is defined by an area \({\mathcal G}\) where the probability to stay is not 0, i.e., \(\int _{\mathcal G} p_{\mathrm{sd}}({\varvec{s}})d{\varvec{s}}\ne 0\). The cost is −1 if the system reaches the goal area or 0 otherwise. The purpose of our research was not to find the optimal control policy using an exact system model given in advance, but rather was to to find an action sequence to reach the goal on the basis of previous experiences without such a model. From this point of view, the discount factor \(\gamma \) was set to 1; hence, the “optimality” of the control with respect to this “external” cost was not considered^{1}.

### 2.1 Probabilistic roadmap

A motion is defined as a sequence of postures and their time derivatives, as shown in Fig. 2. To reproduce a motion, appropriate control signals to induce the necessary transitions must be input to the robot in the correct order. The purpose of motion planning is to find an executable path (motion) from the given start state to the goal state and to generate the required action sequence.

In sampling-based motion planning, a graph that covers the whole state space is constructed, as shown in Fig. 3. In a PRM (Kavraki et al. 1996) algorithm, several states called *milestone*s are randomly sampled from the whole state space, and each milestone pair is connected by a directed link when the robot at the *source milestone* changes its state to the *destination milestone* by a certain action; the milestone pair is then *reachable*. To obtain an arbitrary path, an infinite number of milestones is necessary to fill out the whole state space when the state space is continuous. However, they can be approximated by a finite number of randomly sampled milestones. Milestone pairs are not connected if they are not reachable, because of physical constraints. This may be because the robot is blocked from moving from one to the other by an obstacle between them, or because the thrust force generated by the robot’s actuator is insufficient to move its arm. Feasible motion can be obtained by extracting a sequence of milestones from the start to the goal.

A typical way of creating a roadmap is as follows: i) Sample several points from the whole state space. ii) Evaluate the connectivity between milestone pairs using a “local” controller, e.g., by calculating the inverse dynamics. iii) Connect reachable milestone pairs by links. The problem with this method is that the sampled states are sparsely distributed in the state space if the system has a large number of DoFs, and the connectivity among the milestone pairs cannot be evaluated the local controller if the system model is unknown.

### 2.2 Roadmap construction without system model

These problems can be solved using the exact inverse dynamics model; however, estimation of the inverse dynamics only from the previously observed data was necessary in our case. In our planning procedure, it was assumed that, (i) the state transition was deterministic, i.e., the same behavior was always generated by the same action sequence; (ii) the same transition was induced by a unique action, i.e., different actions induced different motions. (iii) the observation included sufficient information to avoid perceptual aliasing(Chrisman 1992; Spaan and Vlassis 2004) and the state \({\varvec{s}}\) only consisted of observed sensor values. Furthermore, we assumed that some observed data were available before planning. Note that the motion reproduced by the planned action sequence is not guaranteed to generate the planned motion, although the reproduced and planned motions are expected to be similar if the estimation is sufficiently accurate.

To find a feasible motion that reached to the goal, we introduced “internal” costs for each state transition; these costs indicated the uncertainty of the estimation. Bayesian methods can be used to calculate the uncertainty of the estimation. For example, GPR outputs both the mean value and the confidence interval, and the latter indicates the uncertainty. In this research, the internal costs were calculated from this uncertainty.

## 3 Roadmap construction using GPR

In this research, the system model was estimated using a Bayesian nonparametric model—specifically a GPR—since the inverse dynamics were not given. If insufficient data were available to enable precise estimation of the inverse dynamics for the tentative state transition \(({\varvec{s}} \rightarrow {\varvec{s}}')\), the confidence interval of the GPR estimation would becomes large. In such a case, the action could not be determined precisely and the state transition by the estimated action would be uncertain. The planned motion should exclude the link corresponding to such state transitions. To give penalties to these links, we assign the “internal” cost based on the confidence interval. By constructing roadmap based on this cost and finding the best path (minimum cost), the least uncertain path was obtained. That is, we expect that feasible motions can be planned by our method.

- 1.
Gathering motion data

- 2.
Constructing the initial roadmap

- 3.
Improving the roadmap by connecting node pairs

- 4.
Motion planning using Dijkstra’s algorithm (Dijkstra 1959).

### 3.1 Data gathering

Figure 4a shows a schematic example of motion data gathering. The robot was controlled by a randomly generated control signal, and the sensor values and control signals were recorded. We obtained the time series of states and actions \(\mathcal M = \left\{ {\varvec{s}}(1),{\varvec{a}}(1), {\varvec{s}}(2), \ldots \right\} \). These data were used for both creating the initial roadmap and calculating the inverse dynamics of the target system.

### 3.2 Constructing the initial roadmap

An initial roadmap was constructed from the time series data. Instead of milestones randomly sampled from whole state space, we used all the states in the data \(\mathcal M\) as the milestones. The *t* th state \({\varvec{s}}(t)\) was mapped onto the state space as the *t* th milestone \({\varvec{s}}_{t}\) and connected from \(t-1\) th to milestone \({\varvec{s}}_{t-1}\). Since a state transition from the *t* th milestone to the \(t+1\) th one could be induced by the action \({\varvec{a}}(t)\), this milestone pair was assumed to be reachable. As a result, a roadmap with a chain of milestones was generated as in Fig. 4b.

We then grew the roadmap by adding links between these milestones. Our method attempts to add links between nearby milestone pairs by using crossing points as a *shortcut*. The Bayesian inference framework allows the reliability of the estimation to be calculated, as described below. Our method tries to reproduce the recorded motion while avoiding *roundabout* or low reproducibility motions. To increase the moving area, it is necessary to collect more data during operation. We will return to this in a future study.

Note that the reachability between two states separate from those in the dataset \(\mathcal M\) becomes unreliable when using mathematical model trained on the data \(\mathcal M\) without *a priori* knowledge of the robot. When using randomly sampled milestones, as in an existing methods, most of the milestones might be isolated and thus ignorable for path planning, owing to the sparse data.

### 3.3 Connecting node pairs

This gives the variance of the estimation, since GPR is a Bayesian inference method (Bishop 2006; Rasmussen and Williams 2005). When the dataset \(\mathcal D\) represents a large sample in terms of the input query \(\varvec{x}\), the variance becomes small, and *vice versa*. This value indicates the reliability of the estimation, and is called the *confidence interval*. In this research, the confidence interval was used to evaluate the reachability of each milestone pair, and the cost of each link was determined from this value.

*Evaluation of each link* Each link had two attributions: cost and associated action. The cost of a milestone pair (\({\varvec{s}}_{\mathrm{so}} \rightarrow {\varvec{s}}_{\mathrm{de}}\)) was calculated as shown below. The input of the inverse dynamics \({\varvec{x}}_{*}\) became \( [{\varvec{s}}_{\mathrm{so}}, {\varvec{s}}_{\mathrm{de}}-{\varvec{s}}_{\mathrm{so}}]\), and the confidence interval was calculated by \(\sigma ^{2}({\varvec{x}}_{*},\mathcal D)\). The cost of link \(c({\varvec{s}}_{\mathrm{so}}\rightarrow {\varvec{s}}_{\mathrm{de}})\), for example, was defined by a monotonically increasing function of \(\sigma ^{2}({\varvec{x}}_{*},\mathcal D)\) such that \(c({\varvec{s}}_{\mathrm{so}}\rightarrow {\varvec{s}}_{\mathrm{de}})=\exp (\sigma ^{2}({\varvec{x}}_{*},\mathcal D))\). The associated action was calculated as \(\mu ({\varvec{x}}_{*},\mathcal D)\). These attributions were used in the planning procedure.

If the cost is too large, the motions including such a link are expected to be selected rarely, and the search on the roadmap may become heavy as a result. To keep the size of the roadmap smaller, links with cost greater than the threshold value \(T_{c}\) were not generated (Fig. 5a).

*Pseudocode of confidence-based roadmap* A pseudocode for the roadmap construction is shown in Algorithm 1. In the first line, the graph \(G(\forall {\varvec{s}} \in S,E_{\mathrm{init}})\) is initialized, where \(\forall {\varvec{s}} \in S\) are connected to the successive node. The connection matrix *E* indicates the connectivities between nodes by storing the cost of each link to the corresponding element, while \({\hat{A}}\) is the matrix recording the corresponding action for each link. Initial costs are stored to \(E_{\mathrm{init}}\).

### 3.4 Motion planning

In the experiments (Sect. 5.2), the estimation of the forward dynamics was also used for the cost function. The planned state transition and the estimated state transition induced by the planned action were not necessarily identical because of the uncertainties in the estimation. Since the error between them might indicate another type of uncertainty in the state transition, we used this error value for the cost function. We can also add constant value to the cost of each link. This additional cost cannot be an exact external cost that depends on the state transition, but can be an approximation of the external cost. For details, see Sect. 5.1.

### 3.5 Design of the cost functions

In this section, we explain an intuitive interpretation of our cost function. We employ 2 types of cost for each link. One corresponds to the uncertainty in the estimation of the inverse dynamics, while the other to the forward dynamics.

The former term corresponds to the uncertainty in the action selection. Suppose that the state transition is deterministic and the action inducing any transition is unique. If the inverse dynamics is known, the action can be determined without any uncertainty, and the cost in this case should be small. In our problem setting, however, the inverse dynamics is not given and the corresponding action is estimated only from the obtained data set. Thus, the uncertainty is inevitable.

The input to the inverse dynamics is the state transition \({\varvec{s}} \rightarrow {\varvec{s}}'\) and the output is the action to induce the transition. This function is estimated by the Gaussian process regression in this research. Since GPR is one of Bayesian methods, it calculates the probability density function (PDF) of the output variable in the sense of Bayesian probability. By using the data set \(\mathcal D\), the PDF becomes a Gaussian function as \(p({\varvec{a}}|{\varvec{s}}', {\varvec{s}}; {\mathcal D})\) \(={\mathcal N}({\varvec{a}}|\bar{\varvec{a}}, \sigma ^2)\), where \(\bar{\varvec{a}}\) and \(\sigma \) are the mean and the width (confidence interval) parameters respectively, and are determined by the input variables. The value of the PDF denotes the preference over the action space such that the preference of the action \({\varvec{a}}_*\) is calculated as \({\mathcal N}({\varvec{a}}={\varvec{a}}_*| \bar{\varvec{a}}, \sigma ^2)\).

The motion planner in this research selects the maximum a posteriori (MAP) estimate: \(\bar{\varvec{a}}={\arg \max }_{\varvec{a}} {\mathcal N}({\varvec{a}}|\bar{\varvec{a}}, \sigma ^2)\). The preference of the MAP estimate is proportional to \(1/\sigma \) which is the reciprocal of the confidence interval. Therefore, the confidence interval is used as a cost for each link. Note that, the cost becomes close to 0 when the estimated action to induce the given state transition is exactly obtained in the sense of the subjective Bayesian probability.

The second term corresponds to the uncertainty in the state transition. When the action sequence is given, the occurrence probability of a motion by the open loop control can be calculated based on the forward dynamics model. Since the dynamics model is also unavailable in our problem setting, it must be estimated from the obtained data set. The state transition probability \(p({\varvec{s}}'|{\varvec{a}}, {\varvec{s}})\) is calculated by using a Gaussian process regression.

Strictly speaking, the total cost might not correspond to the occurrence probability of the motion \(\mathcal M\). We also adopt heuristic approaches such as employing a monotonically increasing function. However, the total cost would be small when the estimation becomes accurate and precise. That is, it is expected that a feasible motion can be chosen by finding the best path in the roadmap.

## 4 Fast calculation method for GPR

*d*th element of action \(\varvec{a}\) was calculated as

*d*th element of action included in the dataset. Here, \({\varvec{C}}\) is a covariance matrix whose elements are \(C_{lm}=K({\varvec{x}}_{l},{\varvec{x}}_{m})+\delta _{lm}\sigma _{n}^{2}\). \(K(\cdot ,\cdot )\), \(\delta _{lm}\), and \(\sigma _{n}^{2}\) denote a kernel function, a delta function, and a noise term, respectively. \({\varvec{k}}\) is defined as \({\varvec{k}}=[(K({\varvec{x}}_{1},{\varvec{x}}_{*}),\ldots ,K({\varvec{x}}_{N},{\varvec{x}}_{*}))]^{\top }\).

*Z*denotes the normalized term.

In order to calculate Eq. (9), it is necessary to calculate the matrix inverse \((\propto O(N^{3}))\) for learning and the matrix multiplication \((\propto O(N^{2}))\) for each estimation. The computational time increases with the sample size, while a large volume of data is necessary to make the approximation accurate.

To overcome this problem, in a previous study, we proposed a fast approximation method for GPR (Okadome et al. 2013) that combined a dataset division technique, LSH (Indyk and Motwani 1998), with an integration technique for multiple probability distributions, PoEs (Hinton 2002). We named this method LSH-GPR. In LSH, a *B*-bits binary hash-key composed of *B* binary hash functions is used to divide the dataset into \(2^{B}\) subsets. Since the calculation of the hash-key value does not depend on the sample size and the calculation of the “decomposed” GPR is performed using only the corresponding subset, the computational cost is reduced dramatically. Although the performance of each decomposed GPR degrades significantly at the borders, this can be ameliorated by integrating *L* decomposed GPR using PoEs.

In an ideal case, when the dataset is divided into equal size subsets, the calculation cost of this method is \(O(L2^{B}(N/2^{B})^{3})\) for the matrix inversion and \(O(L(N/2^{B})^{2})\) for the computation of the confidence interval. For example, when \(B=3\) and \(L=2\), the calculation costs for both matrix inversion and confidence interval are 1/32nd of those of a “naïve” GPR. As *B* increases, processing becomes faster, while the accuracy is improved by increasing the size of *L*. The amount of memory is also small, (\(O(L2^{B}(N/2^{B})^{2})\)), compared with \(O(N^{2})\) for the naïve GPR.

## 5 Experiments

*feedback*controller to compensate for the error between the planned state and the observed state. This value was also calculated using a GPR, and the output of the feedback controller was calculated using the inverse dynamics (1) as \(p({\varvec{a}}(t)|{\varvec{s}}(t), {\varvec{s}}^{p}(t+1)-{\varvec{s}}(t))\) where \({\varvec{s}}^{p}(t+1)\) is the planned state at the next time step. The output of this controller was calculated as

### 5.1 Simple pendulum

*a*was applied at the rotation center. The dynamics of the pendulum were calculated as follows:

*l*,

*m*,

*g*, and \(\mu \) are the pendulum’s length (1.0 m), weight (1.0 kg), gravitational acceleration (9.8 m/s\(^{2}\)), and coefficient of friction (0.10), respectively. No prior knowledge was assumed on the part of the robot, and the mathematical model was constructed only from observed data.

*Experimental setting* In this experiment, the input torque was a continuous value (\([ -5, 5 ]\) Nm) and the control started at the state \({\varvec{s}}=(\theta ,{\dot{\theta }})\) where the pendulum was hanging down and at rest (\(\theta =\pi , {\dot{\theta }}=0\)). The control cycle was set to 100 ms.

The number of candidates for checking the reachability \(|\mathcal K|\) was 5, a value that was determined empirically. The bit-length of hash-key *B* and the number of integrated GPR *L* were set to 3 and 2, respectively.

*Data gathering* The state of the pendulum was reset randomly \((-0.5\le \theta \le 0.5,-0.5 \le {\dot{\theta }} \le 0.5)\) every 30 s. We constructed the dataset by repeating this process for 200 s, giving 2000 data points. Figure 7b shows the distribution of the samples stored in the dataset.

*Cost function*We compared two motion planners, CRM\(_{C}\) and CRM\(_{T}\), with different cost functions. The cost functions of CRM\(_{C}\) and CRM\(_{T}\) were defined as

*confident*path, while the CRM\(_{T}\) tried to find a shorter path than CRM\(_{C}\), to reduce the additional cost to penalize each control step.

The red points \((\cdot )\) and blue points \((\star )\) in Fig. 8 show the planned motion and the realized motion, respectively. Both CRM\(_{C}\) and CRM\(_{T}\) could generate the desired motion. The path lengths of CRM\(_{C}\) and CRM\(_{T}\) were 110 and 78, respectively. The shorter path was generated by adding the additional term.

However, CRM\(_{T}\) produced larger errors than CRM\(_{C}\) as the robot approached the goal state (\(\theta =0, {\dot{\theta }}=0\)). This suggested that actions with low confidence levels were selected by CRM\(_{T}\) to reduce cost.

The computational time for roadmap construction was about 73 s, and the computational time for Dijkstra’s algorithm was 0.15 s. The computational time for roadmap construction with a “naïve” GPR was 1581 s at the sample size of 2000. By using LSH-GPR, the computational time for roadmap construction was dramatically reduced.

When the feedback controller was employed, it was necessary to calculate the inverse dynamics to correct the divergence from the planned path. The computational time for each feedback signal was 156 ms for the naïve GPR, compared with 6.60 ms for the LSH-GPR. This result suggested that LSH-GPR can achieve compensation control with a higher control cycle.

### 5.2 Underactuated redundant robot

The underactuated robot used in this experiment (Fig. 9) had redundant joints, and only the joint closest to the body (the green joint in Fig. 9) was controllable. The body was fixed. Each link was connected by a hinge joint, and the movement range of each hinge was \(-\pi /2 \le \theta \le \pi /2\). The total weight of each joint and link was 0.25 kg, giving a total weight of 1 kg. The goal of the task was to make the end-effector of the underactuated robot reach the target position.

*Experimental setting* The state of the robot was represented by the angle and angular velocity of each joint \(({\varvec{\theta }}, {\dot{\varvec{\theta }}})\) and by two dimensional positions of the end-effector (*s*1, *s*2). The input torque range for the controllable joint was \(-5 \le \tau \le 5\), and weak gravity (−1 \(m/s^{2}\)) was added. The control cycle was set to 100 ms.

The number of candidates to check the reachability \(|\mathcal K|\) was 5, a value that was determined empirically. The bit-length of hash-key *B* and the number of integrated GPR *L* were set to 5 and 4, respectively.

*Data gathering* Data were gathered by monitoring a range of motions. We employed a rhythmic control signal similar to a CPG-based controller for a bio-inspired robot (Ijspeert 2008). It was expected that the motion resonance of the robot body would be induced by a rhythmic signal with a certain frequency. The control signal was designed such that the frequency of the rhythmic control signal changed over time. The input torque was determined as \(a(t) = 5\sin (2 \pi \omega (t)t)\) where \(\omega (t)\) is the angular velocity at time *t*.

*Cost function*In this experiment, we used three motion planners- CRM\(_{c1}\), CRM\(_{c2}\) and CRM\(_{c3}\)- with different cost functions. The cost functions were defined as

Figure 10 shows the results of the reaching task. The red points \((\cdot )\) and blue points \((\star )\) in Fig. 10 show the position of the end-effector in the planned motion and the achieved motion, respectively. When CRM\(_{c1}\) or CRM\(_{c3}\) was used, the end-effector arrived at the goal position \((-0.25, 1.0)\) without diverging from the planned motion. In contrast, when CRM\(_{c2}\) was used, the motion of the end-effector is deviated from the planned route. The divergence between the planned and reproduced motion was evaluated by the root mean squared error (RMSE) between these two motions in state space^{2}.

Figure 11 shows indexes of the data used for the planned motion by CRM\(_{1}\) (Fig. 10a). It is shown that segmentalized motions are connected to generate the planned motion.

### 5.3 Control of a human-like robotic arm

We applied the proposed method to the motion of the robotic arm shown in Fig. 13 (Okadome et al. 2014). This robotic arm has a 3 DOF shoulder and a 1 DOF elbow. Three pairs of air-cylinders drive the shoulder and one drives the elbow. The working areas of both joints are similar to those of a human. Each air-cylinder is controlled by two computerized control valves, and each valve is controlled by the output of a D/A board installed on a PC. Since each actuator pair for the shoulder is controlled by two valves, the control signal for the robot has 8-dimensional continuous values.

*Experimental setting* The state space was defined by the position and velocity of the end-effector of the robotic arm (\({\varvec{s}}=(p_{1},p_{2},p_{3},{\dot{p}_{1}},{\dot{p}_{2}},{\dot{p}_{3}})\)) measured by a 3D-position sensor (Microsoft Kinect for Windows). The velocity was calculated from the difference between the current and previous positions. The purpose of the tracking task was to move the end-effector to a target position, following a planned path.

*Data gathering* During data gathering, several actuators were randomly selected and their input values were altered randomly every 200 ms. We constructed the dataset by repeating this process 3000 times.

The roadmap was constructed using LSH-GPR. The cost function for each link was \(c(\sigma ^{2}({\varvec{a}}))=\sigma ^{2}({\varvec{a}})\), and the number of candidates to check the reachability \(|\mathcal K|\) was 5. The bit-length of hash-key *B* and the number of integrated GPR *L* were set to 4 and 2, respectively. During the control phase, the target position for the feedback controller was set to the position of the next time step on the planned path.

Figure 14 shows the planned path and control results. The gray, red, and blue points \((\star )\) denote the dataset, planned path, and observed trajectory, respectively. Here, the points with \(p_{3}=0\) are cases where the depth value of the 3D sensor was not available. These points are used since failures might occur in similar situations, and such information can be utilized in a “sample-based” motion planning method.

The computational times for calculating the roadmap and Dijkstra’s algorithm are 307 and 0.3 s, respectively. These times could be improved by using a parallel computing technique since each decomposed GPR and output dimension can be performed independently.

Figure 14 indicates that the tracking task succeeded in following the planned path. Since only the confidence interval was used as the cost function, the planned path was not smooth. The development of a smoothing algorithm for the planned path will be undertaken in future work.

## 6 Conclusion

In this research, we proposed a motion planning method using GPR. We applied our motion planning method to three control tasks. The results suggest that a motion planning method taking into account of uncertainty and using a confidence interval based on Bayesian inference provides satisfactory robot motion. In the proposed method, the recorded motion is segmentalized and the fragmented movements are then connected based on the model uncertainty. The planned motion is simply the accumulation of experience.

The size of the data set for the simple pendulum control was 2000. Since the swing-up task of a pendulum is a highly nonlinear problem, it is commonly used in RL literatures (Grondman et al. 2012; Doya 2000) The number of state action pairs used to construct the controller by our methods is almost similar to RL methods (Grondman et al. 2012). Furthermore, the sample size of the data set for the control of the underactuated redundant robot, and the human-like robotic arm were 12500, and 3000, respectively. Although the dimensionality of the state and action spaces increases compared to the control of the simple pendulum, the experimental results show that a roadmap for the control can be constructed in a practical time.

In our current method, the evaluation of “one-step” inverse dynamics was used. It seems better that the reachability of long-distance milestone pairs was investigated, since such pairs can be connected using intermediate virtual milestones and can improve performance. The connectivity of a long-distance milestone pair can be calculated using a GP-Bayes filtering method (Ko and Fox 2009). In such a case, the value of milestone pairs to be assessed is also important since useless links enlarge the roadmap size without improving performance.

The development of a framework that can build the roadmap that covers a sufficient state space during the operation phase in an online manner is important for realizing large motion ranges and various functions. In such a case, autonomous exploration-exploitation balancing (Thrun 1992) is important. Since the confidence interval indicates novelty, planning by our method corresponds to exploitation and selecting an unknown action corresponds to the exploration. Operating the robot according to this criterion can be a naïve means of achieving balancing, and Bayesian model based reinforcement learning can provide a better solution for this dilemma (Asmuth and Littman 2011; Ross et al. 2011).

It may also be important to consider the measurement uncertainty (Bry and Roy 2011; Berg et al. 2011; Prentice and Roy 2009). For example, Prentice and Roy developed a path planning method that addressed uncertainty in the localization on the pathway, using an extended Kalman filter (Prentice and Roy 2009). Both the measurement and control uncertainty^{3} are difficulties caused by the perceptual aliasing inevitable in a real environment (Chrisman 1992; Spaan and Vlassis 2004). However, the human like robotic arm can still accomplish a given task in an unstructured environment with an insufficient sensor system. By considering model uncertainty, the planned motion toward sequences of sensory information that are reproduced with high probability. Thus, satisfactory motion can be obtained by our method.

There are two remaining issues. First, our current cost function lacks theoretical support. The accumulated “confidence interval” may indicate the uncertainty of the planned motion, but the planned action sequence does not necessarily generate the planned motion. Especially when the cost depends only on the inverse dynamics estimation, the uncertainty of the forward dynamics is not considered at all. Including the cost from the forward dynamics estimation can be a naïve solution to this problem, as in Eq. (15), but it does not guarantee successful control.

Second, an elaborated feedback control framework is necessary to compensate for error during motion reproduction. When a robot is controlled, the generated motion will diverge from the planned motion because of the stochastic property of the dynamics, model uncertainties, and obstacles in the nonstationary environment. In this research, we employed a simple feedback controller; however, recovery failed when the divergence of the trajectory from the planned path became large and difficult to compensate for in a single step. A possible improvement is to modify the mixing procedure (Eq. (11)), such as by using a PoEs framework (Hinton 2002), so that the output of the feedback controller is ignored when the confidence interval becomes large. The development of either a real-time and “short step” motion planning method to recover the trajectory, or an online re-planning method, is also crucial to return to the planned motion.

However, these approaches cannot be the final solution to this problem. The likelihood of the whole motion calculated from the forward dynamics model, e.g., GP-DM (Wang et al. 2008) or GP-LVM (Lawrence 2004), appears to be a better criterion for evaluating the feasibility of a planned motion. These methods mainly focus on an analysis of a single motion and it might not be feasible to apply one of them to construct a roadmap since massive repetitions are required to evaluate every possible motions.

There are several methods based on the optimization such as reinforcement learning or control theory (Todorov and Li 2005; Peters and Schaal 2008; Deisenroth and Rasmussen 2011; Theodorou et al. 2010; Marco et al. 2016; Mukadam et al. 2016). In the field of robotics research, path based methods as represented by iterative Linear Quadratic Gaussian (iLQG) are commonly used to cope with large state and action spaces with non-linear dynamics. In these methods, a given sequence of actions is updated to decrease the total ‘external’ cost and the motion is optimized by repeating the procedure. To calculate the update, some methods use a forward model (Todorov and Li 2005) and some other methods use rollouts (Theodorou et al. 2010). In contrast to sampling based methods which try to construct a graph to cover the state space, the given path is locally optimized around the path. Thus, a smooth and efficient motion can be generated by an optimization based method while a large state space could be explored by a sampling based method. Combination with such an optimization based method is our future work.

## Footnotes

- 1.
The accumulated cost becomes -1 for any control policy if \(\gamma =1\) under our assumption since the system reaches the goal with probability 1 after an infinite number of time steps. When the purpose is to obtain the optimal control policy that leads the system to reach the goal in the fewest time steps, the discount factor \(\gamma \) is usually set to less than 1.

- 2.
This RMSE is calculated as \({\frac{1}{|\mathcal M^{p}|} \sum _{\tau =1}^{|\mathcal M^{p}|} \root \of {({\varvec{s}}^{p}(t)-{\varvec{s}}(t))^{\top }({\varvec{s}}^{p}(t)-{\varvec{s}}(t))}}\).

- 3.
The control uncertainty is mentioned in (Berg et al. 2011), and it corresponds the uncertainty in the state transition model or inverse dynamics model.

## Notes

### Acknowledgments

This work was partly supported by Grant-in-Aid for Young Scientists 14444719 and for JSPS Fellows A15J01499.

## References

- Asmuth, J., & Littman, M. L. (2011). Learning is planning: Near Bayes-optimal reinforcement learning via Monte-Carlo tree search. In
*Uncertainty in Artificial Intelligence*.Google Scholar - Berg, J. V. D., et al. (2011). LQG-MP: Optimized path planning for robots with motion uncertainty and imperfect state information.
*The International Journal of Robotics Research*,*30*(7), 895–913.CrossRefGoogle Scholar - Bishop, C. M. (2006). Pattern recognition and machine learning (1st ed.). Springer. corr. 2nd printing edn.Google Scholar
- Bry, A., & Roy, N. (2011). Rapidly-exploring Random Belief Trees for motion planning under uncertainty. In
*2011 IEEE international conference on robotics and automation (ICRA)*(pp. 723–730).Google Scholar - Choset, H . M. (2005).
*Principles of robot motion: Theory, algorithms, and implementations*. Cambridge: MIT Press.MATHGoogle Scholar - Chrisman, L. (1992). Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. In
*AAAI*(pp. 183–188).Google Scholar - Dalibard, S., et al. (2013). Dynamic walking and whole-body motion planning for humanoid robots: An integrated approach.
*The International Journal of Robotics Research*,*32*(9–10), 1089–1103.CrossRefGoogle Scholar - Dechter, R., & Pearl, J. (1985). Generalized best-first search strategies and the optimality of A*.
*Journal of the ACM (JACM)*,*32*(3), 505–536.MathSciNetCrossRefMATHGoogle Scholar - Deisenroth, M. P., & Rasmussen, C. E. (2011). PILCO: A model-based and data-efficient approach to policy search. In
*In Proceedings of the International Conference on Machine Learning*.Google Scholar - Dijkstra, E. W. (1959). A note on two problems in connexion with graphs.
*Numerische mathematik*,*1*(1), 269–271.MathSciNetCrossRefMATHGoogle Scholar - Doya, K. (2000). Reinforcement learning in continuous time and space.
*Neural Computation*,*12*(1), 219–245.CrossRefGoogle Scholar - Foster, L., et al. (2009). Stable and efficient Gaussian Process calculations.
*Journal of Machine Learning Research*,*10*, 857–882.MathSciNetMATHGoogle Scholar - Grondman, I., et al. (2012). Efficient model learning methods for actor critic control.
*IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics)*,*42*(3), 591–602.CrossRefGoogle Scholar - Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence.
*Neural computation*,*14*, 1771–1800.Google Scholar - Ijspeert, A. J. (2008). Central pattern generators for locomotion control in animals and robots: A review.
*Neural Networks**21*(4), 642–653. Robotics and Neuroscience.Google Scholar - Indyk, P., & Motwani, R. (1998). Approximate nearest neighbors: towards removing the curse of dimensionality. In
*Proceedings of the thirtieth annual ACM symposium on theory of computing, STOC ’98*(pp. 604–613). New York, ACM.Google Scholar - Kavraki, L. E., et al. (1996). Probabilistic roadmaps for path planning in high-dimensional configuration spaces.
*IEEE Transactions on Robotics and Automation*,*12*(4), 566–580.CrossRefGoogle Scholar - Ko, J., & Fox, D. (2009). GP-BayesFilters: Bayesian filtering using Gaussian process prediction and observation models.
*Autonomous Robots*,*27*, 75–90.CrossRefGoogle Scholar - Kuwata, Y., et al. (2009). Real-time motion planning with applications to autonomous urban driving.
*IEEE Transactions on Control Systems Technology*,*17*(5), 1105–1118.CrossRefGoogle Scholar - LaValle, S . M. (2006).
*Planning algorithms*. Cambridge: Cambridge University Press.CrossRefMATHGoogle Scholar - LaValle, S. M., & Kuffner, J. J. (2001). Randomized kinodynamic planning.
*The International Journal of Robotics Research*,*20*(5), 378–400.CrossRefGoogle Scholar - Lawrence, N. D. (2004). Gaussian process latent variable models for visualisation of high dimensional data.
*Advances in neural information systems*,*16*, 329–336.Google Scholar - Marco, A., et al. (2016). Automatic LQR tuning based on Gaussian process global optimization. In
*2016 IEEE international conference on robotics and automation (ICRA)*, (pp. 270–277).Google Scholar - Mukadam, M., et al. (2016). Gaussian process motion planning. In
*2016 IEEE international conference on robotics and automation (ICRA)*(pp. 9–15).Google Scholar - Okadome, Y., et al. (2013). Fast approximation method for Gaussian process regression using hash function for non-uniformly distributed data. In
*Artificial neural networks and machine learning*(pp. 17–25).Google Scholar - Okadome, Y., et al. (2014). Confidence-based roadmap using Gaussian process regression for a robot control. In
*2014 IEEE/RSJ international conference on intelligent robots and systems (IROS 2014)*(pp. 661–666).Google Scholar - Peters, J., & Schaal, S. (2008). Reinforcement learning of motor skills with policy gradients.
*Neural Networks*,*21*(4), 682–697.CrossRefGoogle Scholar - Prentice, S., & Roy, N. (2009). The belief roadmap: Efficient planning in belief space by factoring the covariance.
*The International Journal of Robotics Research*,*28*(11–12), 1448–1465.CrossRefGoogle Scholar - Rasmussen, C . E., & Williams, C . K . I. (2005).
*Gaussian processes for machine learning*. Cambridge: the mit press.MATHGoogle Scholar - Ross, S., et al. (2011). A Bayesian approach for learning and planning in partially observable Markov decision processes.
*The Journal of Machine Learning Research*,*12*, 1729–1770.MathSciNetMATHGoogle Scholar - Spaan, M. T. J., & Vlassis, N. (2004). ‘A point-based POMDP algorithm for robot planning’. In
*IEEE international conference on robotics and automation*(Vol. 3, pp. 2399–2404). IEEE.Google Scholar - Sutton, R. S., et al. (2000). Policy gradient method for reinforcement learning with function approximation. In
*Advances in neural information processing systems*(Vol. 12, pp. 1057–1063).Google Scholar - Theodorou, E., et al. (2010). ‘Reinforcement learning of motor skills in high dimensions: A path integral approach’. In
*2010 IEEE international conference on robotics and automation (ICRA)*, (pp. 2397 –2403).Google Scholar - Thrun, S. B. (1992). Efficient exploration in reinforcement learning. Tech. rep., Pittsburgh.Google Scholar
- Todorov, E., & Li, W. (2005). A generalized iterative LQG method for locally-optimal feedback control of constrained nonlinear stochastic In
*Proceedings of the 2005, American control conference*(Vol. 1, pp. 300–306).Google Scholar - Wang, J. M., et al. (2008). Gaussian process dynamical models for human motion.
*IEEE Transactions on Pattern Analysis and Machine Intelligence*,*30*(2), 283–298.CrossRefGoogle Scholar