# Probabilistic inference for determining options in reinforcement learning

## Abstract

Tasks that require many sequential decisions or complex solutions are hard to solve using conventional reinforcement learning algorithms. Based on the semi Markov decision process setting (SMDP) and the option framework, we propose a model which aims to alleviate these concerns. Instead of learning a single monolithic policy, the agent learns a set of simpler sub-policies as well as the initiation and termination probabilities for each of those sub-policies. While existing option learning algorithms frequently require manual specification of components such as the sub-policies, we present an algorithm which infers all relevant components of the option framework from data. Furthermore, the proposed approach is based on parametric option representations and works well in combination with current policy search methods, which are particularly well suited for continuous real-world tasks. We present results on SMDPs with discrete as well as continuous state-action spaces. The results show that the presented algorithm can combine simple sub-policies to solve complex tasks and can improve learning performance on simpler tasks.

### Keywords

Reinforcement learning Robot learning Options Semi Markov decision process## 1 Introduction

Solving tasks which require long decision sequences or complex policies is an important challenge in reinforcement learning (RL). The option framework (Sutton et al. 1999) is a promising approach to simplify the complexity of such tasks. In the option framework, a reinforcement learning (RL) agent can choose between actions and macro-actions, which are carried out over multiple time steps (Parr and Russell 1998; Sutton et al. 1999).

Using these macro-actions, the agent has to make less decisions to solve a task. Furthermore, even if macro-actions are based on simple policies, the combination of multiple macro-actions can represent more complex solutions than the simple policies would allow for on their own. For example, if a given task requires a nonlinear policy, a combination of multiple linear sub-policies might still be able to solve this task. Such an automated decomposition of complex solutions can simplify the learning problem in many domains.

Based on the SMDP setting (Puterman 1994), the option framework (Sutton et al. 1998) incorporates such macro actions. An option consists of a sub-policy, an initiation set and a termination probability. After an option is initiated, actions are generated by the sub-policy until the option is terminated and a new option is activated. While the option framework has received considerable attention (Parr and Russell 1998; Sutton et al. 1999; Dietterich 2000), to date most algorithms either require the manual specification of the activation policies or sub-policies. Algorithms for autonomous option discovery typically depend on discrete state-action spaces (McGovern and Barto 2001b; Mann and Mannor 2014; Simsek et al. 2005) with exceptions such as the work of Konidaris and Barto (2009). Furthermore, many existing algorithms first explore a given MDP and learn suitable options afterwards (Menache et al. 2002; Simsek and Barto 2008). Hence, they are not aimed at leveraging the efficiency of options in the initial stages of learning but rather aim at transferring the options to new tasks. These approaches are powerful in scenarios where options can be transferred to similar tasks in the future. In contrast, the approach suggested in this paper aims at directly learning suitable options for the problem at hand while also being applicable in continuous state-action spaces.

In continuous state-action spaces, policy search (PS) methods which optimize parametrized policies have been shown to learn efficiently in simulated and real world tasks (Ng et al. 1998; Kober and Peters 2010). Thus, the compatibility of the proposed option discovery framework with PS methods such as PoWER (Kober and Peters 2010) and REPS (Peters et al. 2010) is an important goal of this paper. In the discrete setting, the framework can equally be combined with a wide range of methods such as as Q-Learning (Christopher 1992) and LSPI (Lagoudakis and Parr 2003). Furthermore, many complex tasks can be solved through combinations of simple behavior patterns. The proposed framework can combine multiple simple sub-policies to achieve complex behavior if a single sub-policy would be insufficient to solve the task. These simpler sub-policies are easier to learn which can improve the overall learning speed.

- 1.
It infers a data-driven segmentation of the state-space to learn the initialization and termination probability for each option.

- 2.
It is applicable to discrete as well as continuous state-action spaces.

- 3.
It outperforms monolithic algorithms on discrete tasks and can solve complex continuous tasks by combining simpler sub-policies.

### 1.1 Problem statement

*b*encodes a binary termination event and \({ \bar{ o } }\) is the index of the previously active option. Furthermore, one global activation policy \({\pi ( o | {\varvec{s}} ) }\) governs the activation of options following a termination event. This activation policy \({\pi ( o | {\varvec{s}} ) }\) replaces the initiation set used in classical algorithms and defines a probability of initiating an option in a given state. A new option can only be activated if the previously active option \({ \bar{ o } }\) is terminated. Thus, the agent uses the same option for multiple time steps if it is not terminated. The option transition model is given as

In Sect. 2, we show how to learn a hierarchical policy defined by the options given a set of demonstrated trajectories, i.e., how to solve the imitation learning problem. We formulate the option framework as a graphical model where the index of the executed option is treated as latent variable. Subsequently, we show how an EM algorithm can be used to infer the parameters of the option components. In Sect. 3 we extend the imitation learning solution to allow for reinforcement learning, i.e., for iteratively improving the hierarchical policy such that it maximizes a reward function.

## 2 Learning options from data

The goal of this paper is to determine sub-policies, termination policies and the activation policy from data with minimal prior knowledge. All option components are represented by parametrized distributions, governed by the parameter vector \({\varvec{\theta }}= \{ {\varvec{\theta }}_A, {\varvec{\theta }}_O, {\varvec{\theta }}_B \}\). The individual components are given as binary classifiers for the termination policies for each option \(\pi (b| {\varvec{s}} , o =i; {\varvec{\theta }}_B^i)\), one global multi-class classifier for the activation policies \({ \pi ( o | {\varvec{s}} ; {\varvec{\theta }}_O)} \) and the individual sub-policies \(\pi ( {\varvec{a}} | {\varvec{s}} , o =i;{\varvec{\theta }}_A^i)\). We use the notation \(\pi (\cdot )\) to denote option components, which we will ultimately aim to learn and \(p(\cdot )\) for all other distributions. Our goal is to estimate the set of parameters \({\varvec{\theta }}= \{{\varvec{\theta }}_A, {\varvec{\theta }}_O, {\varvec{\theta }}_B \}\), which explain one or more demonstrated trajectories \(\tau = { \{\tau _1, \dots , \tau _T \} }\) with \(\tau _t = \{ {\varvec{s}} _t, {\varvec{a}} _t \}\). Crucially, the observations \(\tau \) do not contain the option indices \( o \) nor the termination events *b* but only states \( {\varvec{s}} \) and actions \( {\varvec{a}} \). Both, the option index and the termination events are latent variables. While the proposed method learns all option components from data, it requires a manual selection of the total number of desired options. Future work could replace this requirement by, for example, sampling the number of options through a Dirichlet process.

### 2.1 The graphical model for options

*b*. According to Eq. (1) the previously active option \({ \bar{ o } }\) remains active if no termination occurs. Otherwise, the activation policy \({ \pi ( o | {\varvec{s}} ; {\varvec{\theta }}_O)} \) samples a new option index

*o*based on the current state \( {\varvec{s}} \). Finally, the sub-policy \(\pi ( {\varvec{a}} | {\varvec{s}} , o ;{\varvec{\theta }}_A)\) samples an action \( {\varvec{a}} \) based on the state \( {\varvec{s}} \).

### 2.2 Expectation maximization for options

The graphical model for the option framework is a special case of a hidden Markov model (HMM). The Baum–Welch algorithm (Baum 1972) is an EM algorithm for estimating the parameters of a HMM. We will now state the Baum–Welch algorithm for our special case of the option model, where we consider the special case of a single trajectory for improved clarity. The extension to multiple trajectories, however, is straightforward.

#### 2.2.1 Expectation step

#### 2.2.2 Maximization step

Given the distributions over latent variables and the observed state-action samples, the parameters \({\varvec{\theta }}\) can be determined by maximizing Eq. (5). Since \(Q({\varvec{\theta }}, {\varvec{\theta }}^\text {old} )\) is decoupled, independent optimization can be performed for the sub-policies, termination policies and the activation policy.

*Termination policies*

*t*and a single option \(o_t = i\). This term can be rewritten as

*Activation policy*The case of the activation policy \({ \pi ( o | {\varvec{s}} ; {\varvec{\theta }}_O)} \) follows the argument of the termination policies. Similarly to the termination policies, we consider the relevant term of the lower bound

*Sub-policies*

*Feature representations of the state* The above derivations are given in their most general form, where each option component depends directly on the state \( {\varvec{s}} \). In practice, it may often be beneficial to train the individual components on a feature transformation \(\varvec{\phi } ( {\varvec{s}} )\) of the state. Such features might, for example, be polynomial expansion of the state variable or a kernelized representation. When using feature representations, different representations can be chosen for the individual option components.

## 3 Probabilistic reinforcement learning for option discovery

### 3.1 Probabilistic reinforcement learning algorithms

There exist several algorithms which use probabilistic inference techniques for computing the policy update in reinforcement learning (Dayan and Hinton 1993; Theodorou et al. 2010; Kober and Peters 2010; Peters et al. 2010). More formally, they either re-weight state-action trajectories or state-action pairs according to the estimated quality of the state-action pair and, subsequently, use a weighted maximum likelihood estimate to obtain the parameters of a new policy \(\pi ^*\).

For discrete environments, we can equally employ standard reinforcement learning techniques to obtain the Q-function \(Q( {\varvec{s}} , {\varvec{a}} )\) and value function \(V( {\varvec{s}} )\). In our experiments, we employed standard Q-learning (Christopher 1992) and LSPI (Lagoudakis and Parr 2003) to obtain those quantities. In the discrete case, the temperature parameter \(\eta \) was set to 1.

### 3.2 Combining EM and probabilistic reinforcement learning

As we have seen, the only difference between imitation learning and probabilistic reinforcement learning algorithms is the use of a weighted maximum likelihood (ML) estimate instead of a standard ML estimate. We can now combine the expectation maximization algorithm for discovering parametrized options with probabilistic reinforcement learning algorithms by weighting each time step in the maximization step of the EM algorithm.

Learning options from experience. Termination events, options and actions are sampled from from the current policies. Subsequently, the distribution over latent variables is computed and weights \(f_\text {RL}\) are proposed by the RL algorithm The next policies are determined according to the update equations in the method Sect.

The information flow of the proposed algorithm is shown in Table 1.

## 4 Related work

Options as temporally extended macro-actions were introduced by Sutton et al. (1999). While previous research leveraged the power of temporal abstraction (Kaelbling 1993; Parr and Russell 1998; Sutton et al. 1999), such efforts did not improve the sub-policies themselves. Improving the sub-policies based on the observed state-action-reward sequences is known as intra-option learning. Intra-option learning is a consequence of having Markov options and allows for updating all options that are consistent with an experience. While it is a desired property of option learning methods, it is not realized by all existing methods. Sutton et al. (1998) showed that making use of intra-option learning can drastically improve the overall learning speed. Yet, the algorithms presented by Sutton et al. (1998) relied on hand coded options and were presented in the discrete setting.

Options are also used in many hierarchical RL approaches, where they either extend the action space or are directly extended to sub-tasks, where the overall problem is broken up into potentially simpler sub-problems. Dietterich (2000) proposed the MAXQ framework which uses several layers of such sub-tasks. However, the structure of these sub-tasks needs to be either specified by the user Dietterich (2000), or they rely on the availability of a successful trajectory (Mehta et al. 2008). Barto et al. (2004) rely on artificial curiosity to define the reward signal of individual sub-tasks, where the agent aims to maximize its knowledge of the environment to solve new tasks quicker. This approach relies on salient events which effectively define the sub-tasks.

Stolle and Precup (2002) first learn a flat solution to the task at hand and, subsequently, use state visitation statistics to build the option’s initiation and termination sets. Mann and Mannor (2014) apply the options framework to value iteration and show that it can speed up convergence.

Option discovery approaches often aim to identify so called bottleneck states, i.e., states the agent has to pass pass through on its way from start to goal. McGovern and Barto proposed to formulate this intuition as a multiple-instance learning problem and solve it using a diverse density method (McGovern and Barto 2001a). Other approaches aim to find such bottleneck states using graph theoretic algorithms. The Q-Cut (Menache et al. 2002) and L-Cut (Simsek and Barto 2008) build transition graphs of the MDP and solve a min cut problem to find bottleneck states. Silver and Ciosek (2012) assume a known MDP model to propose an option model composition framework, which can be used for planning while discovering new options. Niekum and Barto (2011) present a method to cluster subgoals discovered by existing subgoal discovery methods to find skills that generalize across tasks. In the presented paper, we do not assume knowledge of the underlying MDP and, further, present a framework which is also suitable for the continuous setting.

In continuous state-action settings, several sub-task based approaches have been proposed. Ghavamzadeh and Mahadevan (2003) proposed the use of a policy gradient method to learn subtasks while the selection of the sub-tasks is realized through Q-Learning. Morimoto and Doya (2001) proposed to learn how to reach specific joint configurations of a robot as sub-tasks, such that these options can later be combined to learn more complicated tasks. In both approaches, the sub-tasks have to be pre-specified by the user. Wingate et al. (2011) use policy priors to encode desired effects like temporal correlation. Levy and Shimkin (2012) propose to extend the state space by the option index, which allows for the use of policy gradient methods. In our proposed method, this option index is a latent variable which is inferred from data. The use of a latent variable allows our methods to update all options with all relevant data points. Konidaris and Barto (2009) use the option framework to learn chaining of skills. This approach requires that the agent can reach the goal state before constructing the series of options leading to the goal state.

A concept similar to the options framework has been widely adapted in the field of robot learning. There, temporal abstraction is achieved through the use of movement primitives (Paraschos et al. 2013; Da Silva et al. 2012). Instead of learning policies over state-torque mappings to control robots, the agent learns parameters of a trajectory generator (Kober and Peters 2010; Kajita et al. 2003). Based on a Beta-Process Autoregressive HMM proposed by Fox et al. (2009), Niekum et al. (2012) proposed a method to segment demonstrated trajectories into a sequence of primitives, addressing the imitation learning problem. Rosman and Konidaris (2015) extend the work of Fox et al. (2009) to allow skill discovery in the inverse reinforcement learning setting. There, the task is to recover reward functions which lead to a skill based solution of a task. Compared to the proposed method, methods based on or similar to the Beta-Process Autoregressive HMM allow to extract skills or segments without a priori knowledge of the total number of skills. However, such methods are computationally expensive and have not been shown to work in the loop together with reinforcement learning methods.

## 5 Evaluation

The evaluation of the proposed framework is separated in two parts. We first evaluated the imitation learning capabilities and, subsequently, proceeded to evaluate different reinforcement learning tasks as well as comparing the proposed methods to other option learning frameworks.

### 5.1 Imitation learning

We started our evaluations with an imitation learning task. The evaluation of the imitation learning capabilities allowed us to ensure that the foundation of the proposed framework performs as expected.

To generate observations, we provided a hand-coded policy which is shown in Fig. 2a. This policy was designed to generate noisy actions within \(\pm 300\) Nm. However, the torques on the system were capped to the torque limit of 30 Nm. The much larger range of desired torques was chosen to simplify the programming of the controller. This policy could successfully perform a pendulum swing-up if the pendulum was initially hanging down with zero velocity. An example trajectory generated by this hand coded policy is shown in Fig. 2b.

Based on five observed trajectories, we used the proposed framework to learn a policy with three options. However, the resulting policy, shown in Fig. 3a, learned to reproduce an effective policy using only two of the three available policies. Generally, we would not necessarily expect that the options recovered by the proposed algorithm exactly match the options of the demonstrator. The algorithm only aims at reproducing the observed behavior but is free to choose the internal structure of the hierarchical policy. Equally, in the RL case, we would not expect the proposed algorithm to learn options that are ‘human-like’. Specifically, we would not expect that a robotic agent would use the same solution decomposition as a human operator.

Figure 3a shows that the trajectories generated with this imitated policy closely resemble the observed trajectories. Figure 4 shows the inferred termination policies of the different options. The results show that options will only be terminated once they are outside of their region of ‘expertise’. Option one has a high termination probability in most regions, however, Fig. 3a shows that the final policy is actually not using this option. Figure 3b shows the development of the log likelihood of the observed data under the imitated policy. The results show that convergence typically has been reached after about five iterations of the EM algorithm. In the imitation learning case, the learned solution is only valid if the system is initiated in a state which is similar to those states that were previously demonstrated. Outside of this region, the imitation learning solution will not be able to successfully solve the task and a reinforcement learning solution is required. In the proposed method, the activation policy will learn to initiate sub-policies according to their responsibility of state-space region. If, for example, the sub-policies are modeled as Gaussians and have infinite support, they have non-zero responsibility for all states and could, theoretically, be activated. If, however, a different class of probability distributions is more task-appropriate and has limited support, it would not be activated outside of its support region.

### 5.2 Reinforcement learning

For all evaluations, we tested each setting ten times and report mean and standard deviation of the results.

#### 5.2.1 Discrete tasks

The discrete environments were given by three different gridworlds as shown in Figs. 5b, d and f. The first world shown in Fig. 5b represents the two rooms scenario (McGovern and Barto 2001a), where the agent has to find a doorway to travel between two rooms. In the second world shown in Fig. 5b, the agent has to traverse two elongated corridors before entering a big room in which the target is located. Finally, in the third world show in Fig. 5f no traditional bottleneck states appear, but the agent has to navigate around two obstacles. Furthermore, in this world optimal paths can lead around either side of the first obstacle.

In all experiments, the agent started in the lower level corner of the respective grid and had to traverse the grid while avoiding two obstacles to reach the goal at the opposite end. The actions available to the agent were going up, left, right and down. Transitions were successful with a probability of 0.8. Unsuccessful transitions had a uniform probability of ending up in any of the neighboring cells. The transition to each accessible field but the goal field generated a reward signal of \(-1 \). After reaching the goal, the agent received a reward of \(+1\) for every remaining step of the episode, where each episode had a length of 500 time steps. If the agent tried to walk into an obstacle or to leave the field, it remained in the current position.

For the discrete tasks we used a tabular feature encoding for the flat policy. For the hierarchical policy, we used the tabular features for the activation and termination policies. The sub-policies were state-independent multinomial distributions over the four actions.

*Comparison to existing methods* In the comparative results to related work we follow the therein established method of reporting ‘steps to goal’ as qualitative measure. In the remaining evaluations of our algorithm we report the average return, which may in some cases be more informative since our reward functions also punish ‘falling off’ the board. The results in Fig. 5 show that in all experiments the proposed framework learned solutions faster than both the Q-Cut as well as the L-Cut methods. Comparing the use of Q-Learning and LSPI in the proposed framework, the results show that LSPI leads to convergence considerably faster than Q-Learning. Since the structure of the individual action policies learned by the proposed approach was given simply as a distribution over the four possible actions, the converged sub-policies usually always select only one action. This simplicity of the sub-policies is a key factor to accelerate the overall learning speed. While we do not present results of comparisons to primitive-based methods such as, for example, using Q-Learning directly, both of the methods that we did compare to have shown to outperform Q-Learning. Thus, we compared to such primitive-based methods indirectly. In our experience, both Q-Cut and L-Cut outperform Q-Learning when not using experience replay. However, in our internal evaluations on the tasks presented in this paper, Q-Learning with experience replay resulted in performance levels similar to Q-Cut and L-Cut, but worse than the proposed method.

*Influence of available options* After comparing to existing methods, we further evaluated the properties of the proposed framework. All remaining evaluations were performed using LSPI in the obstacle world. In our experience, these results were representative of both using Q-Learning as well as performing them in different tasks. Figure 6a shows the influence of available options. In theory this task can be solved optimally with only two options, where one will always go right and one will always go up. However, the results show that making more options available to the algorithm improved both asymptotic performance as well as speed of convergence. Adding more than 20 options did not further increase the performance.

*Influence of EM iterations*The proposed EM style of computing policy updates can be costly and, generally, requires several iterations as also seen in the imitation learning case in Fig. 3b. However, in the RL setting, subsequent policies are expected to be similar and, thus, a small number of EM iterations should be sufficient. The results in Fig. 6b show that this intuition holds true in our evaluations. In our experience, performing even more EM iterations did not change the result. However, when using very few rollouts per iteration, the effect of performing multiple EM iterations can be more pronounced. However, even when a small number of EM iterations is sufficient, introducing these additional computations results in a noticeable computational overhead in continuous task. Here, especially the number of options is an important factor influencing the runtime of the EM algorithm. In our experience, the addition of the EM algorithm would lead to approximately twice the overall computational requirements. However, this factor is of course highly dependent on the RL method used. In the discrete tasks that we evaluated, the computational overhead due to the EM iterations was negligible.

*Influence of termination events* Finally, we evaluated the influence of the probabilistic terminations. In the proposed framework, the sub-policies have to be initialized and, thus, a prior for the termination policies has to be set. Figure 7a shows the effect of changing this prior. The results show that the proposed framework is robust to wide range of these initializations.

We also evaluated the effect of disabling the probabilistic termination sub-policies. In this case, the algorithm could still learn multiple options but no termination policies. Thus, each option could not be active for more than one time step but terminated after every step. The results in Fig. 7b show that learning without terminations slowed down the convergence speed. In our experience, this effect was strongly linked to the stochasticity of the transitions. The higher this stochasticity was, the stronger the benefit of the termination policies became.

#### 5.2.2 Continuous task

The results in Fig. 8a show that while a single linear policy was insufficient to solve this task, it could be solved using two options. Adding more options further improved the resulting hierarchical policy. The visualization of the resulting policy in Fig. 8b shows that with more options, the algorithm learned a control scheme where options two and three were used to swing up the pendulum, and options one and four incorporated a linear stabilization scheme around the upright position of the pendulum. Figure 9a shows a trajectory generated by the resulting policy. Starting from the bottom, the pendulum was first accelerated by options two and three. The plot shows that in-between options two and three, option four was active for a few time steps. However, the kinetic energy at that point was insufficient to fully swing up the pendulum. After the pendulum almost reached the upright position around time step 40, the stabilizing options took over control. Since all option components are stochastic distributions, some option switches still occur even after the pendulum is stabilized. Since the effect of switching into a different option in the stable position for a single time step could easily be balanced by activating the stabilizing option in the next time step, the agent did not have a strong incentive to learn to avoid such behavior. Letting the algorithm run for more iterations might further improve this behavior.

#### 5.2.3 Limitations

While the experiments show that the presented method worked well in the scenarios that were evaluated, we also want to make explicit the assumptions made in this paper. Primarily, the proposed method expects that the number of required options is known a priori. In our experience, this requirement is rather benign in practice, as the algorithm can be initialized with an excessive amount without deterioration in quality of the solution. However, adding more options does increase the computational requirements. Thus, approaches for automatically generating a task-appropriate number of options is an important aspect of future work. Furthermore, we introduced a damping factor \(\alpha =0.1\) on the policy update for the discrete setting, which we found to be especially important when using Q-Learning as the underlying RL method. In our experience, the recommended value of \(\alpha \) depends on the RL method used as well as the task under consideration. Methods such as LSPI will generally work well with larger values of \(\alpha \).

## 6 Conclusion and future work

In this paper, we presented a method to estimate the components of the option framework from data. The results show that the proposed method is able to learn options in the discrete and continuous setting. In the discrete setting, the algorithm performs better than two related option-discovery algorithms which are based on exploiting bottle-neck states. Instead of relying on bottle-neck states, the proposed algorithm achieves its performance by combining options with simpler sub-policies.

In the continuous setting, the results show that the algorithm is able to solve a non-linear task using a combination of options with only linear sub-policies. In this setting, a single linear policy is insufficient for solving the task. Furthermore, the framework allows for parametrized policies and, thus, state-of-the-art policy search methods developed for flat policies can be used to learn hierarchical policies.

The presented approach infers the option’s structure, such as the activation policy and termination policies, from data. However, the number of options still has to be set a-priori by the practitioner. While the results show that setting a relatively large number of options typically yields good performance, learning the number of options is an important aspect of future work. Finally, while the presented framework estimates the most likely termination policies, finding a way of enforcing fewer terminations might further improve learning performance.

## Notes

### Acknowledgments

The research leading to these results has received funding from the DFG SPP ‘autonomous learning’ under the project ‘LearnRobots’.

### References

- Barto, A.G., Singh, S, and Chentanez, N. (2004). Intrinsically motivated learning of hierarchical collections of skills.
*Proceedings of the International Conference on Developmental Learning (ICDL)*.Google Scholar - Baum, L. E. (1972). An equality and associated maximization technique in statistical estimation for probabilistic functions of markov processes.
*Inequalities*,*3*, 1–8.Google Scholar - Bishop, Christopher M. (2006).
*Pattern recognition and machine learning (information science and statistics)*. New York: Springer.MATHGoogle Scholar - Da Silva, B., Konidaris, G., and Barto, A.G. (2012). Learning parameterized skills. In
*Proceedings of the International Conference on Machine Learning (ICML)*.Google Scholar - Daniel, C., Neumann, G., and Peters, J. (2012). Hierarchical relative entropy policy search. In
*Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS)*.Google Scholar - Daniel, C., Neumann, G., Kroemer, O., and Peters, J. (2013). Learning sequential motor tasks. In
*Proceedings of the IEEE International Conference on Robotics and Automation (ICRA)*.Google Scholar - Dayan, P., and Hinton, G. E. (1993). Feudal reinforcement learning. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.),
*Advances in neural information processing systems*(pp. 271–278). Los Altos: Morgan Kaufmann Publishers.Google Scholar - Dietterich, T. G. (2000). Hierarchical reinforcement learning with the MAXQ value function decomposition.
*Journal of Artificial Intelligence Research (JAIR)*,*13*, 227–303.MathSciNetMATHGoogle Scholar - Fox, E. B., Jordan, M. I., Sudderth, E. B., and Willsky, A. S. (2009). Sharing features among dynamical systems with beta processes. In
*Advances in Neural Information Processing Systems (NIPS)*, pp. 549–557.Google Scholar - Ghavamzadeh, M., and Mahadevan, S. (2003). Hierarchical policy gradient algorithms. In
*Proceedings of the International Conference for Machine Learning (ICML)*.Google Scholar - Kaelbling, L. P., (1993). Hierarchical learning in stochastic domains: Preliminary results. In
*Proceedings of the International Conference on Machine Learning (ICML)*.Google Scholar - Kajita, S., Kanehiro, K., and Kaneko, F., Fujiwara, K., Harada, K., Yokoi, K., and Hirukawa, H. (2003). Biped walking pattern generation by using preview control of zero-moment point. In
*Proceedings of the IEEE International Conference of Robotics and Automation (ICRA)*.Google Scholar - Kober, J., and Peters, J. (2010). Policy search for motor primitives in robotics.
*Machine Learning*,*84*, 1–33.Google Scholar - Konidaris, G., and Barto, A. (2009). Skill discovery in continuous reinforcement learning domains using skill chaining. In
*Advances in Neural Information Processing Systems (NIPS)*.Google Scholar - Konidaris, G., Osentoski, S., and Thomas, P.s. (2011). Value function approximation in reinforcement learning using the fourier basis.
*Conference on Artificial Intelligence (AAAI)*.Google Scholar - Lagoudakis, M., & Parr, R. (2003). Least-squares policy iteration.
*Journal of Machine Learning Research*,*4*, 1107–1149.MathSciNetMATHGoogle Scholar - Levy, K. Y., and Shimkin, N. (2012). Unified inter and intra options learning using policy gradient methods. In S. Sanner & M. Hutter (Eds.),
*Recent advances in reinforcement learning*(pp. 153–164). New York: Springer.Google Scholar - Mann, T. A., and Mannor, S. (2014) Scaling up approximate value iteration with options: Better policies with fewer iterations. In
*Proceedings of the International Conference on Machine Learning (ICML)*.Google Scholar - McGovern, A., and Barto, A. G. (2001a). International conference on machine learning (icml).
*Computer Science Department Faculty Publication Series*,*8*, 361–368.Google Scholar - McGovern, A., and Barto, A. G. (2001b). Automatic discovery of subgoals in reinforcement learning using diverse density.
*International Conference on Machine Learning (ICML)*, pp. 8.Google Scholar - Mehta, N., Ray, S., Tadepalli, P., and Dietterich, T. G. (2008). Automatic discovery and transfer of MAXQ hierarchies. In
*Proceedings of the International Conference on Machine Learning (ICML)*.Google Scholar - Menache, I., Mannor, S., and Shimkin, N. (2002). Q-cut-dynamic discovery of sub-goals in reinforcement learning. In
*Proceedings of the European Conference on Machine Learning (ECML)*.Google Scholar - Morimoto, J., & Doya, K. (2001). Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning.
*Robotics and Autonomous Systems*,*36*(1), 37–51.CrossRefMATHGoogle Scholar - Ng, A. Y., Coates, A., Diel, M., Ganapathi, V., Schulte, J., Tse, B., et al. (2006). Autonomous inverted helicopter flight via reinforcement learning. In M. H. Ang & O. Khatib (Eds.),
*Experimental robotics IX: The 9th international symposium on experimental robotics*(pp. 363–372). Berlin, Heidelberg: Springer.Google Scholar - Niekum, S., and Barto, A. G. (2011). Clustering via dirichlet process mixture models for portable skill discovery. In
*Advances in Neural Information Processing Systems (NIPS)*.Google Scholar - Niekum, S., Osentoski, S., Konidaris, G.D., and Barto, A.G. (2012). Learning and generalization of complex tasks from unstructured demonstrations. In
*Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)*.Google Scholar - Paraschos, A., Daniel, C., Peters, J., and Neumann, G., (2013). Probabilistic movement primitives. In
*Advances in Neural Information Processing Systems (NIPS)*.Google Scholar - Parr, R., and Russell, S. (1998) Reinforcement learning with hierarchies of machines.
*Advances in Neural Information Processing Systems (NIPS)*.Google Scholar - Peters, J., Mülling, K., and Altun, Y. (2010). Relative entropy policy search. In
*Proceedings of the National Conference on Artificial Intelligence (AAAI)*.Google Scholar - Puterman, M. L. (1994).
*Markov decision processes: Discrete stochastic dynamic programming*. New York: Wiley.CrossRefMATHGoogle Scholar - Ranchod, P., Rosman, B., and Konidaris, G. (2015). Nonparametric bayesian reward segmentation for skill discovery using inverse reinforcement learning. In
*Intelligent Robots and Systems (IROS) IEEE*, 471–477.Google Scholar - Silver, D., and Ciosek, K. (2012). Compositional planning using optimal option models. In
*Proceedings of the International Conference on Machine Learning (ICML)*.Google Scholar - Simsek, Ö., and Barto, A. G. (2008). Skill characterization based on betweenness. In
*Advances in Neural Information Processing Systems (NIPS)*.Google Scholar - Simsek, Ö., Wolfe, A. P., and Barto, A. G. (2005). Identifying useful subgoals in reinforcement learning by local graph partitioning. In
*Proceedings of the International Conference on Machine Learning (ICML)*.Google Scholar - Stolle, M., & Precup, D. (2002). Learning options in reinforcement learning.
*Abstraction, Reformulation, and Approximation*(pp. 212–223). New York City: Springer.Google Scholar - Stulp, F., and Schaal, S. (2012). Hierarchical reinforcement learning with movement primitives. In
*Proceedings of the IEEE International Conference on Humanoid Robots (HUMANOIDS)*.Google Scholar - Sutton, R. S., Precup, D., and Singh, S. (1998). Intra-option learning about temporally abstract actions. In
*Proccedings of the International Conference on Machine Learning (ICML)*.Google Scholar - Sutton, R. S., Precup, D., & Singh, S. (1999). Between MDPs and Semi-MDPs: A framework for temporal abstraction in reinforcement learning.
*Artificial Intelligence*,*112*, 181–211.MathSciNetCrossRefMATHGoogle Scholar - Theodorou, E., Buchli, J., & Schaal, S. (2010). A generalized path integral control approach to reinforcement learning.
*Journal of Machine Learning Research*,*11*, 3137–3181.MathSciNetMATHGoogle Scholar - van Hoof, H., Peters, J., and Neumann, G. (2015). Learning of non-parametric control policies with high-dimensional state features. In
*Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS)*.Google Scholar - Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning.
*Machine Learning*,*8*(3–4), 279–292.MATHGoogle Scholar - Wingate, D., Goodman, N. D., Roy, D. M, Kaelbling, L. P., and Tenenbaum, J. B. (2011). Bayesian policy search with policy priors. In
*Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI)*.Google Scholar