Robustness of linearly solvable Markov games employing inaccurate dynamics model

Kinjo, Ken; Uchibe, Eiji; Doya, Kenji

doi:10.1007/s10015-017-0401-2

Robustness of linearly solvable Markov games employing inaccurate dynamics model

Original Article
Open access
Published: 31 October 2017

Volume 23, pages 1–9, (2018)
Cite this article

Download PDF

You have full access to this open access article

Artificial Life and Robotics Aims and scope Submit manuscript

Robustness of linearly solvable Markov games employing inaccurate dynamics model

Download PDF

Ken Kinjo¹,
Eiji Uchibe² &
Kenji Doya¹

1527 Accesses
1 Citation
Explore all metrics

Abstract

As a model-based reinforcement learning technique, linearly solvable Markov decision process (LMDP) gives an efficient way to find an optimal policy by making the Bellman equation linear under some assumptions. Since LMDP is regarded as model-based reinforcement learning, the performance of LMDP is sensitive to the accuracy of the environmental model. To overcome the problem of the sensitivity, linearly solvable Markov game (LMG) has been proposed, which is an extension of LMDP based on the game theory. This paper investigates the robustness of LMDP- and LMG-based controllers against modeling errors in both discrete and continuous state-action problems. When there is a discrepancy between the model used for building the control policy and dynamics of the tested environment, the LMG-based control policy maintained good performance while that of the LMDP-based control policy deteriorated drastically. Experimental results support the usefulness of LMG framework when acquiring an accurate model of the environment is difficult.

Control of Uncertain Systems

A Lyapunov approach for stable reinforcement learning

Article 10 August 2022

Adaptive optimal control of unknown discrete-time linear systems with guaranteed prescribed degree of stability using reinforcement learning

Article 24 August 2021

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In model-based reinforcement learning, an optimal controller is derived from an optimal value (cost-to-go) function by solving the Bellman equation, which is often intractable due to its nonlinearity. Linearly solvable Markov decision process (LMDP) is a computational framework to efficiently solve the Bellman equation by an exponential transformation of the value function under some constraints on action-dependent cost [11]. The LMDP framework has been applied in domains such as character control for animation [3], optimal assignment of communication resources in cellular telephone systems [8] and real-robot control [7, 12]. The major drawback of the LMDP framework is, however, that an environmental model is given in advance. Model learning is integrated with LMDP in discrete problems [2] and in continuous problems [7], but the performance of the obtained controllers is critically affected by the accuracy of the environmental model [7].

One possible way to overcome this problem is to adopt concepts from the robust control theory [9], which considers the worst adversary and derives an optimal controller using a game theoretic solution. Recently, the framework of the linearly solvable Markov game (LMG) is proposed as an extension of LMDP [4, 5], in which the optimal value function is obtained as a solution of the Hamilton–Jacobi–Isaacs (HJI) equation. Since LMG also linearizes the nonlinear HJI equation under similar assumptions of LMDP, an optimal policy can be computed efficiently. While the LMG framework has been shown to promote robustness against disturbances [4], its advantage over the LMDP framework in the face of modeling errors has not been fully investigated.

In this study, we compare the performances of the LMDP- and LMG-based controllers in the tasks of grid-world with risky states and swing-up pole. We investigate the robustness of the controllers under variable gaps between the state transition model used for controller design and that of the actual environment. Experimental results in the discrete problem show that the LMG-based policy works well by setting the robustness parameter of LMG to the maximum value while the LMDP-based policy is very sensitive to the accuracy of the modeling error. On the contrary, experimental results in the continuous problem show that the robustness parameter should be tuned to obtain the best performance in the LMG-based policy.

2 Linearly solvable Markov game (LMG)

2.1 Markov games

Since LMDP is a special case of LMG, we provide a brief explanation of the LMG framework according to [4]. Let $x\in \mathcal {X}$ be a state of an agent, and let $u^c, u^a \in \mathcal {U}$ denote the control by the agent and disturbance by the adversary, respectively. In the Markov game, the state transition is affected by both of the agent and the adversary as $x^{\prime } \sim p(x^{\prime }{\mid }x, u^c, u^a)$. When $x$ and $u$ are continuous, $p(x^{\prime }{\mid }x, u^c, u^a)$ is given by the Gaussian distribution $\mathcal {N}(x'{\mid }\varvec{\mu } (x, u^c, u^a), \varvec{\Sigma }),$ where $\varvec{\mu }$ and $\varvec{\Sigma }$ denote the mean and the covariance matrix, respectively. In particular, $\varvec{\mu }$ is assumed to be:

$$\begin{aligned} \varvec{\mu } (x, u^c, u^a) = \varvec{a}(x) + \varvec{B}(x) (u^c + u^a) . \end{aligned}$$

The agent receives immediate cost $\ell (x, u^c, u^a)$ in each step. For instance, in the first-exit case, the objective function is the expected cumulative cost [1],

$$\begin{aligned} \mathcal {J}^{\pi } =\mathbb {E}_{ p(x_{t+1}{\mid }x_t,u_t^c,u_t^a)} \left[ \sum ^{\mathcal {T}_{\mathcal {A}}}_{t=0} \ell (x_t, u_t^c, u_t^a) \right] , \end{aligned}$$

(1)

where $\mathcal {T}_{\mathcal {A}}$ denotes the time when the agent arrives an absorbing state $x_{\mathcal {A}} \in \mathcal {X}_{\mathcal {A}} \subseteq \mathcal {X}$. An optimal policy, which is required to minimize the objective function while the adversary acts to maximize the objective function, is satisfied by the following HJI equation:

$$\begin{aligned} v(x) = \min _{u^c} \max _{u^a} \{ \ell (x, u^c, u^a) + \mathbb {E}_{p(x^{\prime }{\mid }x, u^c , u^a)} [ v(x^{\prime })] \}, \end{aligned}$$

(2)

where $v(x)$ denotes the value function. Since Eq. (2) is a nonlinear equation due to the min and max operators, it is not trivial to find an optimal value function.

2.2 Linearization

The key trick of LMDP and LMG is to optimize the state transition probability directly instead of optimizing the policy. In other words, control and disturbance are allowed to influence the state transition probability directly. At first, a baseline state transition probability called the uncontrolled probability is introduced by

$$\begin{aligned} p^0 (x^{\prime }{\mid }x) = \int p(x^{\prime }{\mid }x, u) \pi ^0 (u{\mid } x) \mathrm{d} u, \end{aligned}$$

where $\pi ^0 (u{\mid } x)$ denotes a baseline policy. In LMG, a learning agent modifies a state transition probability $p^{u^c} (x^{\prime } {\mid } x)$ while a disturber modifies $p^{\{u^c , u^a \}}(x^{\prime } {\mid } x)$ and they are defined by:

$$\begin{aligned} p^{u^c} (x^{\prime } {\mid } x)\propto & {} g^{u^c} (x^{\prime } {\mid } x) p^0(x^{\prime } {\mid } x) \end{aligned}$$

(3)

$$\begin{aligned} p^{\{u^c , u^a \}} (x^{\prime } {\mid } x)\propto & {} g^{u^a} (x' {\mid } x, u^c) p^{u^c} (x^{\prime } {\mid } x), \end{aligned}$$

(4)

where $g^{u^c} (x^{\prime } {\mid } x)$ and $g^{u^a} (x' {\mid } x, u^c)$ denote the effect of control and disturbance in the state transition, respectively.

The HJI equation (2) is intractable due to its nonlinearity. However, the HJI equation is simplified by introducing the following immediate cost:

$$\begin{aligned} \ell (x, u^c, u^a)= \, & {} q(x) + \frac{1}{\alpha } \mathcal {D}_{\alpha } \left( p^0(x^{\prime } {\mid } x) \parallel p^{u^c}(x^{\prime } {\mid } x) \right) \nonumber \\&- \frac{1}{\alpha } \mathrm {KL} \left( p^{\{u^c, u^a \}} (x^{\prime } {\mid } x) \parallel p^{u^c} (x^{\prime } {\mid } x) \right) , \end{aligned}$$

(5)

where $q(x)$ is a state-dependent cost function. $\mathcal {D}_{\alpha } (p^0 \parallel p^{u^c})$ denotes the Rényi divergence between two probability distributions defined by:

$$\begin{aligned} \mathcal {D}_{\alpha } ( p_1 (x) \Vert p_2 (x))= \, & {} \frac{\alpha }{\alpha -1} \log \left( \int p_1 (x)^{\alpha } p_2 (x)^{1-\alpha } dx\right) , \end{aligned}$$

where $\alpha$ is called the robustness parameter ($0 \le \alpha \le 1$). The second term measures a discrepancy between the uncontrolled probability, $p^0 (x^{\prime } {\mid } x)$, and the controlled probability with only the control, $p^{u^c} (x^{\prime } {\mid } x)$, which corresponds to control cost. The third term $\mathrm {KL}$ represents the Kullback–Leibler divergence between the controlled probability with only the control, $p^{u^c} (x^{\prime } {\mid } x)$, and the controlled probability with the control and the disturbance, $p^{\{ u^c, u^a \}} (x^{\prime } {\mid } x)$. It corresponds to the cost reduction caused by the disturbance.

Under the above assumptions, the HJI equation (2) is transformed to the linear equation. If $0 \le \alpha < 1$, substituting Eq. (5) into Eq. (2) yields:

$$\begin{aligned} z(x; \alpha ) = \exp ((\alpha - 1) q(x)) \mathbb {E}_{p^0(x' {\mid } x)} [z(x; \alpha )], \end{aligned}$$

(6)

where $z(x; \alpha ) = \exp ((\alpha - 1) v(x))$ is called the desirability function. When $\alpha = 0$, Eq. 6 is identical to the Bellman equation linearized by LMDP [11]. If $\alpha = 1$, we obtain:

$$\begin{aligned} v(x) = q(x) + \mathbb {E}_{ p^0(x'{\mid }x)} \left[ v(x') \right] . \end{aligned}$$

(7)

In addition, the value and desirability functions are constrained at the absorbing state, $v(x_{\mathcal {A}}) = q(x_{\mathcal {A}})$. In both cases, the control- and the disturbance-dependent cost are eliminated during the linearization. Equations (6) and (7) are linear with respect to the desirability and value functions, respectively. According to the result of linearization, the optimal controlled probabilities are:

$$\begin{aligned} p^{*u^c} (x' {\mid } x)\propto & {} \exp (-v(x')) p^0(x' {\mid } x), \end{aligned}$$

(8)

$$\begin{aligned} p^{*\{u^a, u^c \}} (x' {\mid } x)\propto & {} \exp ((\alpha -1) v(x')) p^0(x' {\mid } x) . \end{aligned}$$

(9)

Thus, both the desirability function and its optimal controller of LMG and LMDP become equivalent when $\alpha = 0$.

2.3 Computing optimal policies

2.3.1 Discrete case

Equation (6) is linear with respect to the desirability function. Since it can be considered as a general eigenvalue problem, the desirability function is calculated by using a standard matrix computation package. On the other hand, Eq. (7) is linear with respect to the value function and it is regarded as a standard Bellman equation under the uncontrolled probability. The value function can be obtained by the value iteration algorithm [10].

Note that the optimal policy is not explicitly computed for the discrete setup in the LMG framework. The control policy to realize the optimal state transition probability (8) is obtained by solving the following constrained least-squares problem in each state:

$$\begin{aligned}&\min _{\pi (u {\mid } x)} \left( \sum _{u \in \mathcal {U}} \pi (u {\mid } x) p (x' {\mid } x, u) - p^{*u^c} (x' {\mid } x) \right) ^2 \\&\mathrm {s.t.} \quad \sum _{u \in \mathcal {U}} \pi (u {\mid } x) = 1, \; 0 \le \pi (u {\mid } x) \le 1, \quad \forall \; u \in \mathcal {U} \nonumber . \end{aligned}$$

(10)

To solve this problem, we use the function lsqlin() in the Matlab Optimization toolbox^®.

2.3.2 Continuous case

To solve the resulting HJI equation (6) for continuous space problems, we should employ a function approximation method. According to the previous study [6, 7], the following linear function approximator is introduced:

$$\begin{aligned} z(x; \varvec{w}, \varvec{\Theta }) = \sum _{i=1}^{N_z} w_i f(x, \varvec{m}_i, \varvec{S}_i), \; \; \varvec{\Theta } = \{ (\varvec{m}_i, \varvec{S}_i) \}_{i=1}^{N_z}, \end{aligned}$$

(11)

where $\varvec{w} = (w_1, \ldots , w_{N_z})$ is the weight vector to be optimized and $f(x, \varvec{m}_i, \varvec{S}_i)$ is a basis function defined by:

$$\begin{aligned} f(x; \varvec{m}_i, \varvec{S}_i) = \exp \left( - \frac{1}{2} (x- \varvec{m}_i)^{\mathrm {T}}\varvec{S}_i (x- \varvec{m}_i) \right) , \end{aligned}$$

where $\varvec{m}_i$ and $\varvec{S}_i$ denote a center position and a precision matrix of the ith basis function, respectively. $w_i$ is a learning weight to be optimized. In the case of $\alpha = 1$, it is appropriate to approximate the value function $v(x)$ rather than the desirability function. To optimize the parameters $\varvec{w}$, the least-squares method is applied for the set of collocation states $\{ \varvec{x}_i \}_{i=1}^{N_s}$, in which the objective function is constructed by:

$$\begin{aligned} J = \sum _{i=1}^{N_s} \left\| z(x_i; \alpha ) - e^{(\alpha - 1) q(x_i)} \mathbb {E}_{p^0 (x' {\mid } x_i)} \left[ z(x'; \alpha ) \right] \right\| ^2 . \end{aligned}$$

See [6, 7] for more details.

Once the desirability or value function is computed, the corresponding optimal control policy $u^*(x)$ can be derived by the following equations:

$$\begin{aligned} {u}^*(x) = -\sigma ^2 \varvec{B}(x)^{\mathrm {T}}\frac{1}{(\alpha -1) z(x; \alpha )} \frac{\partial z(x; \alpha )}{\partial x} , \end{aligned}$$

(12)

where $\varvec{B}(x)$ denotes the Jacobian matrix of the system. Then, LMG for continuous problems needs $\varvec{B}(x)$ and $p^0 (x' {\mid } x)$ as the environmental model explicitly. Note that the optimal action u can be computed directly in continuous problems while we need to solve the constrained least-squares problem (10) in discrete problems.

3 Discrete state-action problem

3.1 Grid-world with risky states

As an example of discrete state-action problems, we select a simple grid-world navigation problem shown in Fig. 1. When the agent steps into a risky state, it receives a high cost ($q(x) = 200$). The agent receives a small cost ($q(x) = 1$) in all other states except the goal state, where it receives zero cost and the episode is terminated. The goal of the agent is to find the shortest path to the goal state while avoiding falling off the risky states.

The state transition probability is characterized by the certainty parameter c ($0.5 \le c \le 1$), as illustrated in Fig. 1b. The agent moves in the desired direction with probability c but moves down with probability $1-c$ due to a north wind. If the agent moves to the boundary, the agent remains in the same state. A random policy $\pi ^0 (u {\mid } x)$ is constructed by a discrete uniform distribution and it is used for producing the uncontrolled state transition probability $p^0 (x' {\mid } x)$.

3.2 Result

Figure 2a shows the experimental results in which the optimal policy was computed in the deterministic training environment ($c = 1$). The left columns of Fig. 2a show the value functions with four different settings of the robustness parameter $\alpha \in \{ 0, 0.95, 0.99, 1 \}$. The middle and right panels show the state visitation frequency which was evaluated by executing the policy in the two test environments ($c = 1$ and 0.7, respectively). The value at the entrance of the narrow path became higher as $\alpha$ increased, and it suggests that the agent chose the shortest path if $\alpha = 0$ while it avoided the risky states if $\alpha$ approached to 1, as shown in the middle panels. The right panels show that the state visitation frequency was disturbed by the north wind in the stochastic test environment. Consequently, the cumulative cost increased drastically when $\alpha$ was set to 0 while the most robust controller by setting $\alpha = 1$ generated similar behaviors.

Figure 2 b shows the experimental results in which the optimal policy was computed in the stochastic environment ($c = 0.9$). As compared with Fig. 2 b, the value functions were skewed due to the north wind when $\alpha \ne 0$ while the value function trained with $c = 0.9$ was the same as that with $c = 1.0$ in the case of LMDP. As a result, the LMDP-based policy preferred the shortest path even though it was computed in the stochastic environment. The reason why the stochasticity did not affect the LMDP-based policy is because $p^0 (x' {\mid } x)$ is invariant with respect to c when $\pi ^0 (u {\mid } x)$ is uniform. In addition, the controlled probability (8) does not always satisfy the following inequality condition:

$$\begin{aligned} p^{*u^c}(x ' {\mid } x) \le \max _u p(x ' {\mid } x, u). \end{aligned}$$

In fact, we found that $p^{*u^c}(x ' {\mid } x)$ for an optimal state transition $(x, x')$ approached to 1 according to the value function. The upper panels of Fig. 2b were similar to those of Fig. 2a and it means that the stochasticity of the environment was not considered appropriately if $\alpha = 0$. On the contrary, conservative behaviors are obtained if $\alpha \ge 0.99$.

Figure 3 compares the average cumulative costs using the policies derived with five different settings of the robustness parameter $\alpha$ and the certainty parameter $c \in \{ 1, 0.9 \}$ in the training environment. The performances of the policies obtained by LMDP ($\alpha = 0$) and LMG with $\alpha = 0.5$ were optimal in the deterministic environment ($c = 1$), but they deteriorated rapidly as c decreased. with the increase in the windiness, even if they were derived with an uncontrolled transition model taking into account the wind ($c=0.9$, left panel). In contrast, the policies obtained by larger $\alpha$ show relatively low performance in the deterministic environment, but they performed robustly in the windier environment, even when they were derived with a windless transition model ($c = 1$, right panel).

4 Continuous state-action problem

4.1 Swing-up pole

Next, we conduct a simulation of a pole swing-up task as an example as continuous state-action problems. In the simulation, the one side of the pole was fixed and the pole could rotate in plane around the fixed point. The objective of the task is to lead the pole to an upward position and stop at this position. The continuous action u is the torque applied to the pole while the state is represented by a two-dimensional vector $x= [\theta , \omega ]$, where $\theta$ and $\omega$ denote the angle and the angular velocity, respectively. The equation of motion in discrete time is modeled by:

$$\begin{aligned} \Delta x= \, & {} (\varvec{a}(x) + \varvec{B}(u+\sigma \xi )) \Delta t\\ \varvec{a}(x)= & {} \begin{bmatrix} \dot{\theta } ,\; m \frac{g}{l} \sin (\theta ) - \kappa \dot{\theta } \end{bmatrix}^{\mathrm {T}},\quad \varvec{B} = [0,\, 1] ^{\mathrm {T}}\nonumber , \end{aligned}$$

(13)

where l, m, g and $\kappa$ denote the length of the pole, mass, gravitational acceleration and coefficient of friction, respectively. $\xi$ is a Gaussian noise with mean 0 and variance 1. Note that the passive dynamics $\varvec{a}(x)$ is a nonlinear vector function of $x$ while $\varvec{B}$ is a constant vector. In this simulation, the physical parameters were $l = 1$ $\mathrm {m}$, $g = 9.8$ ${\mathrm{m/s}}^2$, $\kappa = 0.05$ ${\mathrm{kg\ m}}^2/{\mathrm{s}}$ and $\Delta t =10$ ms. The noise scale was set to $\sigma = 4$. The mass of the pendulum was used as a parameter to change the dynamics.

The state cost was defined so that it was zero at the goal state, using the following unnormalized Gaussian function:

$$\begin{aligned} q(x) = k( 1 - \exp (x^{\mathrm {T}}\mathbf {\Sigma }_{\mathrm {cost}}^{-1} x)), \end{aligned}$$

(14)

where k and $\mathrm {diag}\left( \mathbf {\Sigma }_{\mathrm {cost}}\right)$ denote scale of state-dependent cost and covariance matrix of Gaussian function. They are set as $k=2.5$ and $\mathrm {diag}(\mathbf {\Sigma }_{\mathrm {cost}})= [\pi /4,4\pi /4]^{2}$.

The set of collocation states was uniformly distributed in the state space ($N_s = 1806$). In the simulation, only the weight parameters $\varvec{w}$ are optimized. The centers $\varvec{m}_i$ of the basis functions were set so as to distribute them uniformly in the state space ($N_f=441$). On the other hand, the covariance matrix $\varvec{S}_i$ was determined empirically and set to $\mathrm {diag}([\pi /20,\;\pi /20])^{-2}$.

4.2 Experimental results when the training model was different from the test model

We evaluate the robustness of the control policies obtained by the LMG framework when the test model is different from the training model. The mass of the pole was set to 1 kg in the training model while it was determined in the range (1, 2.5 kg) with the step 0.1 in the test model. For $\alpha \in \{0, 0.5, 0.75, 0.9, 0.95, 0.99, 1 \}$, the desirability functions and corresponding control policies were obtained by solving the linearized HJI equation. We conducted the simulations using the obtained control policies in these test conditions. We tested 50 episodes, and each episode started from the bottom position of the pendulum and was terminated when the controller leaded to the goal position or the trial is over 100 (s) (10000 step). These results are summarized in Fig. 4.

As the mass of the pole increased, the time for swing-up increased and the success rate decreased using the control policy obtained by any value of $\alpha$. In other words, the performance of all obtained control policies deteriorated. However, the deterioration rate of the control performance depended on the value of $\alpha$. In the LMDP setting, $\alpha = 0$, in which the problem setting is equivalent to the LMDP framework, the performance by the obtained control policy deteriorated rapidly as the mass of the pole increased. As expected, the obtained control policy could not adapt to the change of the dynamics. On the other hand, LMG with $\alpha = 1$ could not find the robust controller as opposed to our expectations. When $\alpha = 0.75$, LMG found the most robust controller against the change of the mass of the pole in this simulation. Note that the performance was the best when $\alpha = 0.75$ even though the mass of the pole was 1 kg. As opposed to the results of the discrete problem as discussed in Sect. 3.2, the performance of the optimal controller obtained by LMDP was worse than that of the LMG-based robust controller. The reason why the LMDP-based controller failed was that the error in function approximation of the desirability and value functions was not considered. In addition, numerical errors become dominant when $\alpha$ is close to 1 for the continuous problems. Although the log-sum-exp technique can be used for discrete problems, it is difficult to evaluate the expectation in Eq. (1) for continuous problems. The other is because the same parameter $\varvec{\Theta }$ of function approximator (11) was used although the HJI equation (7) for $\alpha = 1$ is different from Eq. (6) for $0 \le \alpha < 1$.

4.3 Integration with model learning

Next, we investigate how the modeling error of the state transition probability affects the performance. The simple least-squares-based method is adopted for the method for the model learning, in which $\Delta x$ in Eq. (13) are approximated by $\Delta x\approx \mathbf {W}\varvec{\phi }(x,u)$, where $\varvec{W}$ is a weight matrix to be optimized and $\varvec{\phi }(x, u)$ is a vector consisting of basis functions. Two types of basis functions were prepared. The first was a simple linear model with respect to $x$ and $u$, where $\varvec{\phi }(x, u) = [x^{\mathrm {T}}, u, 1]^{\mathrm {T}}$. The other is a linear-NRBF (normalized radial basis function) model defined by $\varvec{\phi }(x, u) = [x^{\mathrm {T}}, \psi _1(x), \ldots , \psi _M(x), u, 1]^{\mathrm {T}}$, where $\psi _i (x, u)$ is given by:

$$\begin{aligned} \psi _i (x) = \frac{\exp \left( - \frac{1}{2} (x- \varvec{\mu }_i)^{\mathrm {T}}\varvec{\Sigma }_{\psi _i}^{-1} (x- \varvec{\mu }_i) \right) }{\sum _k \exp \left( - \frac{1}{2} (x- \varvec{\mu }_k)^{\mathrm {T}}\varvec{\Sigma }_{\psi _k}^{-1} (x- \varvec{\mu }_k) \right) }, \end{aligned}$$

where $\varvec{\mu }_i$ and $\varvec{\Sigma }_{\psi _i}$ denote the center and the covariance matrix, respectively. $\{ \varvec{\mu }_i \}$ were determined by K-means clustering of the training data $\mathcal {D}$ while $\{ \varvec{\Sigma }_{\psi _i} \}$ were tuned manually. The training data $\mathcal {D}$ were extracted from the sample data, which were acquired under the random control policy.

Figure 5 shows the MSE of the angle and angular velocity component. The angle component was approximated quite accurately in both approximators because it includes only linear state transition. On the other hand, the approximation of the angular velocity component of the linear model was less accurate than that of the Liner-NRBF model because the linear model was not able to represent the nonlinear state transition.

The optimal policies derived from the models discussed above were compared with those with the true model under different levels of $\alpha \in \{ 0, 0.75, 0.95, 1 \}$ as we did in Sect. 4.2. Figure 6a compares the estimated desirability functions of three models for $0 \le \alpha < 1$ and the estimated value functions for $\alpha = 1$. As $\alpha$ became close to 1, the obtained desirability functions became flatter. This was due to the fact that the coefficient of the state cost function in Eq. (6) became small as $\alpha$ becomes close to 1. Since the linear-NRBF model was sufficiently accurately shown in Fig. 5, there were no significant differences in the desirability and the value function between the true and the linear-NRBF model for all $\alpha$. On the other hand, the linear model produced a slightly different function. Figure 6b shows the optimal policy and typical trajectories. For the cases of true and the linear-NRBF models, the number of swinging gradually increased as $\alpha$ approached to 1 while the controller from the linear model showed different behaviors.

To evaluate the relationship between the control performance and the level of $\alpha$ in more detail, we conducted the simulations using the obtained control policies. In the simulation, we tested 50 episodes, the episodes started from the bottom position of the pendulum and were terminated when the controller leads to the goal position or the trial is over 200 (s) (20,000 steps).

The experimental results are summarized in Fig. 7. In the LMDP setting, $\alpha = 0$, the control policy obtained by the linear model which had a large approximation error took a much longer time to swing-up as compared with the other control policies. Surprisingly, the performance of the true model was slightly worse than that of the linear-NRBF model. As we discussed in Sect. 4.2, the error in function approximation should be considered even if the true model was used, the obtained control policy with $\alpha = 0$ deteriorated due to the approximation error. Nevertheless, this gap in the control performance became small as the value of $\alpha$ became large. Consequently, the time for swing-up was almost same among the all obtained control policies when $\alpha = 0.75, 0.9$. However, the gap became large again when $\alpha$ was more than 0.9. When $\alpha = 1$, the linear model showed the worse performance.

5 Conclusion

We evaluated the robustness of the control policies obtained by LMDP and LMG using the discrete and continuous state-action tasks. Our simulation results suggest that LMDP was useful only if the environment is regarded as deterministic and discrete, while LMG was efficient even if the training environment was different from the test environment. Furthermore, according to the results of the swing-up task, the performance of the LMG-based controller was improved by choosing $\alpha$ appropriately when the inaccurate linear model was used to approximate the environmental dynamics.

As we discussed, the error in function approximation was important in the continuous problems while the desirability function can be computed precisely for the discrete problems. In the case of the discrete problems, the desirability function can be exactly represented by a tabular representation, and it is computed by solving a generalized eigenvalue decomposition that is numerically stable. Therefore, it was appropriate to choose the largest value, $\alpha =1$, to obtain the control policy, which is robust for the modeling error. On the other hand, in the case of the continuous problems, we should consider the effect of approximation and estimation errors of the desirability function, and therefore, the value of $\alpha$ should be determined carefully to obtain the robust control policy. There remains the problem of how to choose the appropriate value of $\alpha$ as a future work.

References

Başar T, Bernhard P (1995) H$^\infty$-optimal control and related minimax design problems. A dynamic game approach. Birkhäuser, Basel
MATH Google Scholar
Burdelis MAP, Ikeda K (2014) Estimating passive dynamics distributions in linearly solvable markov decision processes from measured immediate costs in reinforcement learning problems. SICE J Control Meas Syst Integr 7(1):48–54
Article Google Scholar
da Silva M, Durand F, Popović J (2009) Linear Bellman combination for control of character animation. ACM Trans Graph 28(3):82:1–82:10
Article Google Scholar
Dvijotham K, Todorov E (2011) A unifying framework for linearly solvable control. In: Proceedings of the 27th annual conference on uncertainty in artificial intelligence. AUAI Press, Arlington, pp 179–186.
Dvijotham K, Todorov E (2012) Linearly solvable Markov games. In: Proceedings of American Control conference, pp 1845–1850
Todorov E (2009) Eigenfunction approximation methods for linearly-solvable optimal control problems. In Proceedings of the 2nd IEEE symposium on adaptive dynamic programming and reinforcement learning, Nashville, TN, USA, pp 161–168
Kinjo K, Uchibe E, Doya K (2013) Evaluation of linearly solvable Markov decision process with dynamic model learning in a mobile robot navigation task. Front Neurorobot 7(7)
Li A, Schrater P (2013) Efficient learning in linearly solvable MDP models. In: Proceedings of the 23rd international joint conference on artificial intelligence, pp 248–253
Morimoto J, Doya K (2005) Robust reinforcement learning. Neural Comput 17(2):335–359
Article MathSciNet Google Scholar
Sutton RS, Barto AG (1998) Reinforcement learning: an introduction. The MIT Press, Cambridge, MA
Google Scholar
Todorov E (2009) Efficient computation of optimal actions. Proc Natl Acad Sci 106(28):11478–11483
Article MATH Google Scholar
Uchibe E, Doya K (2014) Combining learned controllers to achieve new goals based on linearly solvable MDPs. In: Proceedings of IEEE international conference on robotics and automation

Download references

Acknowledgements

This work was supported by JSPS/MEXT KAKENHI Grants: 17H06042 and 16K1250.

Author information

Authors and Affiliations

Okinawa Institute of Science and Technology Graduate University, Okinawa, Japan
Ken Kinjo & Kenji Doya
ATR Computational Neuroscience Laboratories, Kyoto, Japan
Eiji Uchibe

Authors

Ken Kinjo
View author publications
You can also search for this author in PubMed Google Scholar
Eiji Uchibe
View author publications
You can also search for this author in PubMed Google Scholar
Kenji Doya
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Eiji Uchibe.

Additional information

This work was presented in part at the 19th International Symposium on Artificial Life and Robotics, Beppu, Oita, January 22–24, 2014.

Rights and permissions

This article is published under an open access license. Please check the 'Copyright Information' section either on this page or in the PDF for details of this license and what re-use is permitted. If your intended use exceeds what is permitted by the license or if you are unable to locate the licence and re-use information, please contact the Rights and Permissions team.

About this article

Cite this article

Kinjo, K., Uchibe, E. & Doya, K. Robustness of linearly solvable Markov games employing inaccurate dynamics model. Artif Life Robotics 23, 1–9 (2018). https://doi.org/10.1007/s10015-017-0401-2

Download citation

Received: 30 June 2017
Accepted: 12 October 2017
Published: 31 October 2017
Issue Date: March 2018
DOI: https://doi.org/10.1007/s10015-017-0401-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Robustness of linearly solvable Markov games employing inaccurate dynamics model

Abstract