Probabilistic inference for determining options in reinforcement learning
 3.2k Downloads
 2 Citations
Abstract
Tasks that require many sequential decisions or complex solutions are hard to solve using conventional reinforcement learning algorithms. Based on the semi Markov decision process setting (SMDP) and the option framework, we propose a model which aims to alleviate these concerns. Instead of learning a single monolithic policy, the agent learns a set of simpler subpolicies as well as the initiation and termination probabilities for each of those subpolicies. While existing option learning algorithms frequently require manual specification of components such as the subpolicies, we present an algorithm which infers all relevant components of the option framework from data. Furthermore, the proposed approach is based on parametric option representations and works well in combination with current policy search methods, which are particularly well suited for continuous realworld tasks. We present results on SMDPs with discrete as well as continuous stateaction spaces. The results show that the presented algorithm can combine simple subpolicies to solve complex tasks and can improve learning performance on simpler tasks.
Keywords
Reinforcement learning Robot learning Options Semi Markov decision process1 Introduction
Solving tasks which require long decision sequences or complex policies is an important challenge in reinforcement learning (RL). The option framework (Sutton et al. 1999) is a promising approach to simplify the complexity of such tasks. In the option framework, a reinforcement learning (RL) agent can choose between actions and macroactions, which are carried out over multiple time steps (Parr and Russell 1998; Sutton et al. 1999).
Using these macroactions, the agent has to make less decisions to solve a task. Furthermore, even if macroactions are based on simple policies, the combination of multiple macroactions can represent more complex solutions than the simple policies would allow for on their own. For example, if a given task requires a nonlinear policy, a combination of multiple linear subpolicies might still be able to solve this task. Such an automated decomposition of complex solutions can simplify the learning problem in many domains.
Based on the SMDP setting (Puterman 1994), the option framework (Sutton et al. 1998) incorporates such macro actions. An option consists of a subpolicy, an initiation set and a termination probability. After an option is initiated, actions are generated by the subpolicy until the option is terminated and a new option is activated. While the option framework has received considerable attention (Parr and Russell 1998; Sutton et al. 1999; Dietterich 2000), to date most algorithms either require the manual specification of the activation policies or subpolicies. Algorithms for autonomous option discovery typically depend on discrete stateaction spaces (McGovern and Barto 2001b; Mann and Mannor 2014; Simsek et al. 2005) with exceptions such as the work of Konidaris and Barto (2009). Furthermore, many existing algorithms first explore a given MDP and learn suitable options afterwards (Menache et al. 2002; Simsek and Barto 2008). Hence, they are not aimed at leveraging the efficiency of options in the initial stages of learning but rather aim at transferring the options to new tasks. These approaches are powerful in scenarios where options can be transferred to similar tasks in the future. In contrast, the approach suggested in this paper aims at directly learning suitable options for the problem at hand while also being applicable in continuous stateaction spaces.
In continuous stateaction spaces, policy search (PS) methods which optimize parametrized policies have been shown to learn efficiently in simulated and real world tasks (Ng et al. 1998; Kober and Peters 2010). Thus, the compatibility of the proposed option discovery framework with PS methods such as PoWER (Kober and Peters 2010) and REPS (Peters et al. 2010) is an important goal of this paper. In the discrete setting, the framework can equally be combined with a wide range of methods such as as QLearning (Christopher 1992) and LSPI (Lagoudakis and Parr 2003). Furthermore, many complex tasks can be solved through combinations of simple behavior patterns. The proposed framework can combine multiple simple subpolicies to achieve complex behavior if a single subpolicy would be insufficient to solve the task. These simpler subpolicies are easier to learn which can improve the overall learning speed.
 1.
It infers a datadriven segmentation of the statespace to learn the initialization and termination probability for each option.
 2.
It is applicable to discrete as well as continuous stateaction spaces.
 3.
It outperforms monolithic algorithms on discrete tasks and can solve complex continuous tasks by combining simpler subpolicies.
1.1 Problem statement
In Sect. 2, we show how to learn a hierarchical policy defined by the options given a set of demonstrated trajectories, i.e., how to solve the imitation learning problem. We formulate the option framework as a graphical model where the index of the executed option is treated as latent variable. Subsequently, we show how an EM algorithm can be used to infer the parameters of the option components. In Sect. 3 we extend the imitation learning solution to allow for reinforcement learning, i.e., for iteratively improving the hierarchical policy such that it maximizes a reward function.
2 Learning options from data
The goal of this paper is to determine subpolicies, termination policies and the activation policy from data with minimal prior knowledge. All option components are represented by parametrized distributions, governed by the parameter vector \({\varvec{\theta }}= \{ {\varvec{\theta }}_A, {\varvec{\theta }}_O, {\varvec{\theta }}_B \}\). The individual components are given as binary classifiers for the termination policies for each option \(\pi (b {\varvec{s}} , o =i; {\varvec{\theta }}_B^i)\), one global multiclass classifier for the activation policies \({ \pi ( o  {\varvec{s}} ; {\varvec{\theta }}_O)} \) and the individual subpolicies \(\pi ( {\varvec{a}}  {\varvec{s}} , o =i;{\varvec{\theta }}_A^i)\). We use the notation \(\pi (\cdot )\) to denote option components, which we will ultimately aim to learn and \(p(\cdot )\) for all other distributions. Our goal is to estimate the set of parameters \({\varvec{\theta }}= \{{\varvec{\theta }}_A, {\varvec{\theta }}_O, {\varvec{\theta }}_B \}\), which explain one or more demonstrated trajectories \(\tau = { \{\tau _1, \dots , \tau _T \} }\) with \(\tau _t = \{ {\varvec{s}} _t, {\varvec{a}} _t \}\). Crucially, the observations \(\tau \) do not contain the option indices \( o \) nor the termination events b but only states \( {\varvec{s}} \) and actions \( {\varvec{a}} \). Both, the option index and the termination events are latent variables. While the proposed method learns all option components from data, it requires a manual selection of the total number of desired options. Future work could replace this requirement by, for example, sampling the number of options through a Dirichlet process.
2.1 The graphical model for options
2.2 Expectation maximization for options
The graphical model for the option framework is a special case of a hidden Markov model (HMM). The Baum–Welch algorithm (Baum 1972) is an EM algorithm for estimating the parameters of a HMM. We will now state the Baum–Welch algorithm for our special case of the option model, where we consider the special case of a single trajectory for improved clarity. The extension to multiple trajectories, however, is straightforward.
2.2.1 Expectation step
2.2.2 Maximization step
Given the distributions over latent variables and the observed stateaction samples, the parameters \({\varvec{\theta }}\) can be determined by maximizing Eq. (5). Since \(Q({\varvec{\theta }}, {\varvec{\theta }}^\text {old} )\) is decoupled, independent optimization can be performed for the subpolicies, termination policies and the activation policy.
Termination policies
Subpolicies
Feature representations of the state The above derivations are given in their most general form, where each option component depends directly on the state \( {\varvec{s}} \). In practice, it may often be beneficial to train the individual components on a feature transformation \(\varvec{\phi } ( {\varvec{s}} )\) of the state. Such features might, for example, be polynomial expansion of the state variable or a kernelized representation. When using feature representations, different representations can be chosen for the individual option components.
3 Probabilistic reinforcement learning for option discovery
3.1 Probabilistic reinforcement learning algorithms
There exist several algorithms which use probabilistic inference techniques for computing the policy update in reinforcement learning (Dayan and Hinton 1993; Theodorou et al. 2010; Kober and Peters 2010; Peters et al. 2010). More formally, they either reweight stateaction trajectories or stateaction pairs according to the estimated quality of the stateaction pair and, subsequently, use a weighted maximum likelihood estimate to obtain the parameters of a new policy \(\pi ^*\).
For discrete environments, we can equally employ standard reinforcement learning techniques to obtain the Qfunction \(Q( {\varvec{s}} , {\varvec{a}} )\) and value function \(V( {\varvec{s}} )\). In our experiments, we employed standard Qlearning (Christopher 1992) and LSPI (Lagoudakis and Parr 2003) to obtain those quantities. In the discrete case, the temperature parameter \(\eta \) was set to 1.
3.2 Combining EM and probabilistic reinforcement learning
As we have seen, the only difference between imitation learning and probabilistic reinforcement learning algorithms is the use of a weighted maximum likelihood (ML) estimate instead of a standard ML estimate. We can now combine the expectation maximization algorithm for discovering parametrized options with probabilistic reinforcement learning algorithms by weighting each time step in the maximization step of the EM algorithm.
Learning options from experience. Termination events, options and actions are sampled from from the current policies. Subsequently, the distribution over latent variables is computed and weights \(f_\text {RL}\) are proposed by the RL algorithm The next policies are determined according to the update equations in the method Sect.

The information flow of the proposed algorithm is shown in Table 1.
4 Related work
Options as temporally extended macroactions were introduced by Sutton et al. (1999). While previous research leveraged the power of temporal abstraction (Kaelbling 1993; Parr and Russell 1998; Sutton et al. 1999), such efforts did not improve the subpolicies themselves. Improving the subpolicies based on the observed stateactionreward sequences is known as intraoption learning. Intraoption learning is a consequence of having Markov options and allows for updating all options that are consistent with an experience. While it is a desired property of option learning methods, it is not realized by all existing methods. Sutton et al. (1998) showed that making use of intraoption learning can drastically improve the overall learning speed. Yet, the algorithms presented by Sutton et al. (1998) relied on hand coded options and were presented in the discrete setting.
Options are also used in many hierarchical RL approaches, where they either extend the action space or are directly extended to subtasks, where the overall problem is broken up into potentially simpler subproblems. Dietterich (2000) proposed the MAXQ framework which uses several layers of such subtasks. However, the structure of these subtasks needs to be either specified by the user Dietterich (2000), or they rely on the availability of a successful trajectory (Mehta et al. 2008). Barto et al. (2004) rely on artificial curiosity to define the reward signal of individual subtasks, where the agent aims to maximize its knowledge of the environment to solve new tasks quicker. This approach relies on salient events which effectively define the subtasks.
Stolle and Precup (2002) first learn a flat solution to the task at hand and, subsequently, use state visitation statistics to build the option’s initiation and termination sets. Mann and Mannor (2014) apply the options framework to value iteration and show that it can speed up convergence.
Option discovery approaches often aim to identify so called bottleneck states, i.e., states the agent has to pass pass through on its way from start to goal. McGovern and Barto proposed to formulate this intuition as a multipleinstance learning problem and solve it using a diverse density method (McGovern and Barto 2001a). Other approaches aim to find such bottleneck states using graph theoretic algorithms. The QCut (Menache et al. 2002) and LCut (Simsek and Barto 2008) build transition graphs of the MDP and solve a min cut problem to find bottleneck states. Silver and Ciosek (2012) assume a known MDP model to propose an option model composition framework, which can be used for planning while discovering new options. Niekum and Barto (2011) present a method to cluster subgoals discovered by existing subgoal discovery methods to find skills that generalize across tasks. In the presented paper, we do not assume knowledge of the underlying MDP and, further, present a framework which is also suitable for the continuous setting.
In continuous stateaction settings, several subtask based approaches have been proposed. Ghavamzadeh and Mahadevan (2003) proposed the use of a policy gradient method to learn subtasks while the selection of the subtasks is realized through QLearning. Morimoto and Doya (2001) proposed to learn how to reach specific joint configurations of a robot as subtasks, such that these options can later be combined to learn more complicated tasks. In both approaches, the subtasks have to be prespecified by the user. Wingate et al. (2011) use policy priors to encode desired effects like temporal correlation. Levy and Shimkin (2012) propose to extend the state space by the option index, which allows for the use of policy gradient methods. In our proposed method, this option index is a latent variable which is inferred from data. The use of a latent variable allows our methods to update all options with all relevant data points. Konidaris and Barto (2009) use the option framework to learn chaining of skills. This approach requires that the agent can reach the goal state before constructing the series of options leading to the goal state.
A concept similar to the options framework has been widely adapted in the field of robot learning. There, temporal abstraction is achieved through the use of movement primitives (Paraschos et al. 2013; Da Silva et al. 2012). Instead of learning policies over statetorque mappings to control robots, the agent learns parameters of a trajectory generator (Kober and Peters 2010; Kajita et al. 2003). Based on a BetaProcess Autoregressive HMM proposed by Fox et al. (2009), Niekum et al. (2012) proposed a method to segment demonstrated trajectories into a sequence of primitives, addressing the imitation learning problem. Rosman and Konidaris (2015) extend the work of Fox et al. (2009) to allow skill discovery in the inverse reinforcement learning setting. There, the task is to recover reward functions which lead to a skill based solution of a task. Compared to the proposed method, methods based on or similar to the BetaProcess Autoregressive HMM allow to extract skills or segments without a priori knowledge of the total number of skills. However, such methods are computationally expensive and have not been shown to work in the loop together with reinforcement learning methods.
5 Evaluation
The evaluation of the proposed framework is separated in two parts. We first evaluated the imitation learning capabilities and, subsequently, proceeded to evaluate different reinforcement learning tasks as well as comparing the proposed methods to other option learning frameworks.
5.1 Imitation learning
We started our evaluations with an imitation learning task. The evaluation of the imitation learning capabilities allowed us to ensure that the foundation of the proposed framework performs as expected.
To generate observations, we provided a handcoded policy which is shown in Fig. 2a. This policy was designed to generate noisy actions within \(\pm 300\) Nm. However, the torques on the system were capped to the torque limit of 30 Nm. The much larger range of desired torques was chosen to simplify the programming of the controller. This policy could successfully perform a pendulum swingup if the pendulum was initially hanging down with zero velocity. An example trajectory generated by this hand coded policy is shown in Fig. 2b.
Based on five observed trajectories, we used the proposed framework to learn a policy with three options. However, the resulting policy, shown in Fig. 3a, learned to reproduce an effective policy using only two of the three available policies. Generally, we would not necessarily expect that the options recovered by the proposed algorithm exactly match the options of the demonstrator. The algorithm only aims at reproducing the observed behavior but is free to choose the internal structure of the hierarchical policy. Equally, in the RL case, we would not expect the proposed algorithm to learn options that are ‘humanlike’. Specifically, we would not expect that a robotic agent would use the same solution decomposition as a human operator.
Figure 3a shows that the trajectories generated with this imitated policy closely resemble the observed trajectories. Figure 4 shows the inferred termination policies of the different options. The results show that options will only be terminated once they are outside of their region of ‘expertise’. Option one has a high termination probability in most regions, however, Fig. 3a shows that the final policy is actually not using this option. Figure 3b shows the development of the log likelihood of the observed data under the imitated policy. The results show that convergence typically has been reached after about five iterations of the EM algorithm. In the imitation learning case, the learned solution is only valid if the system is initiated in a state which is similar to those states that were previously demonstrated. Outside of this region, the imitation learning solution will not be able to successfully solve the task and a reinforcement learning solution is required. In the proposed method, the activation policy will learn to initiate subpolicies according to their responsibility of statespace region. If, for example, the subpolicies are modeled as Gaussians and have infinite support, they have nonzero responsibility for all states and could, theoretically, be activated. If, however, a different class of probability distributions is more taskappropriate and has limited support, it would not be activated outside of its support region.
5.2 Reinforcement learning
For all evaluations, we tested each setting ten times and report mean and standard deviation of the results.
5.2.1 Discrete tasks
The discrete environments were given by three different gridworlds as shown in Figs. 5b, d and f. The first world shown in Fig. 5b represents the two rooms scenario (McGovern and Barto 2001a), where the agent has to find a doorway to travel between two rooms. In the second world shown in Fig. 5b, the agent has to traverse two elongated corridors before entering a big room in which the target is located. Finally, in the third world show in Fig. 5f no traditional bottleneck states appear, but the agent has to navigate around two obstacles. Furthermore, in this world optimal paths can lead around either side of the first obstacle.
In all experiments, the agent started in the lower level corner of the respective grid and had to traverse the grid while avoiding two obstacles to reach the goal at the opposite end. The actions available to the agent were going up, left, right and down. Transitions were successful with a probability of 0.8. Unsuccessful transitions had a uniform probability of ending up in any of the neighboring cells. The transition to each accessible field but the goal field generated a reward signal of \(1 \). After reaching the goal, the agent received a reward of \(+1\) for every remaining step of the episode, where each episode had a length of 500 time steps. If the agent tried to walk into an obstacle or to leave the field, it remained in the current position.
For the discrete tasks we used a tabular feature encoding for the flat policy. For the hierarchical policy, we used the tabular features for the activation and termination policies. The subpolicies were stateindependent multinomial distributions over the four actions.
Comparison to existing methods In the comparative results to related work we follow the therein established method of reporting ‘steps to goal’ as qualitative measure. In the remaining evaluations of our algorithm we report the average return, which may in some cases be more informative since our reward functions also punish ‘falling off’ the board. The results in Fig. 5 show that in all experiments the proposed framework learned solutions faster than both the QCut as well as the LCut methods. Comparing the use of QLearning and LSPI in the proposed framework, the results show that LSPI leads to convergence considerably faster than QLearning. Since the structure of the individual action policies learned by the proposed approach was given simply as a distribution over the four possible actions, the converged subpolicies usually always select only one action. This simplicity of the subpolicies is a key factor to accelerate the overall learning speed. While we do not present results of comparisons to primitivebased methods such as, for example, using QLearning directly, both of the methods that we did compare to have shown to outperform QLearning. Thus, we compared to such primitivebased methods indirectly. In our experience, both QCut and LCut outperform QLearning when not using experience replay. However, in our internal evaluations on the tasks presented in this paper, QLearning with experience replay resulted in performance levels similar to QCut and LCut, but worse than the proposed method.
Influence of available options After comparing to existing methods, we further evaluated the properties of the proposed framework. All remaining evaluations were performed using LSPI in the obstacle world. In our experience, these results were representative of both using QLearning as well as performing them in different tasks. Figure 6a shows the influence of available options. In theory this task can be solved optimally with only two options, where one will always go right and one will always go up. However, the results show that making more options available to the algorithm improved both asymptotic performance as well as speed of convergence. Adding more than 20 options did not further increase the performance.
Influence of termination events Finally, we evaluated the influence of the probabilistic terminations. In the proposed framework, the subpolicies have to be initialized and, thus, a prior for the termination policies has to be set. Figure 7a shows the effect of changing this prior. The results show that the proposed framework is robust to wide range of these initializations.
We also evaluated the effect of disabling the probabilistic termination subpolicies. In this case, the algorithm could still learn multiple options but no termination policies. Thus, each option could not be active for more than one time step but terminated after every step. The results in Fig. 7b show that learning without terminations slowed down the convergence speed. In our experience, this effect was strongly linked to the stochasticity of the transitions. The higher this stochasticity was, the stronger the benefit of the termination policies became.
5.2.2 Continuous task
The results in Fig. 8a show that while a single linear policy was insufficient to solve this task, it could be solved using two options. Adding more options further improved the resulting hierarchical policy. The visualization of the resulting policy in Fig. 8b shows that with more options, the algorithm learned a control scheme where options two and three were used to swing up the pendulum, and options one and four incorporated a linear stabilization scheme around the upright position of the pendulum. Figure 9a shows a trajectory generated by the resulting policy. Starting from the bottom, the pendulum was first accelerated by options two and three. The plot shows that inbetween options two and three, option four was active for a few time steps. However, the kinetic energy at that point was insufficient to fully swing up the pendulum. After the pendulum almost reached the upright position around time step 40, the stabilizing options took over control. Since all option components are stochastic distributions, some option switches still occur even after the pendulum is stabilized. Since the effect of switching into a different option in the stable position for a single time step could easily be balanced by activating the stabilizing option in the next time step, the agent did not have a strong incentive to learn to avoid such behavior. Letting the algorithm run for more iterations might further improve this behavior.
5.2.3 Limitations
While the experiments show that the presented method worked well in the scenarios that were evaluated, we also want to make explicit the assumptions made in this paper. Primarily, the proposed method expects that the number of required options is known a priori. In our experience, this requirement is rather benign in practice, as the algorithm can be initialized with an excessive amount without deterioration in quality of the solution. However, adding more options does increase the computational requirements. Thus, approaches for automatically generating a taskappropriate number of options is an important aspect of future work. Furthermore, we introduced a damping factor \(\alpha =0.1\) on the policy update for the discrete setting, which we found to be especially important when using QLearning as the underlying RL method. In our experience, the recommended value of \(\alpha \) depends on the RL method used as well as the task under consideration. Methods such as LSPI will generally work well with larger values of \(\alpha \).
6 Conclusion and future work
In this paper, we presented a method to estimate the components of the option framework from data. The results show that the proposed method is able to learn options in the discrete and continuous setting. In the discrete setting, the algorithm performs better than two related optiondiscovery algorithms which are based on exploiting bottleneck states. Instead of relying on bottleneck states, the proposed algorithm achieves its performance by combining options with simpler subpolicies.
In the continuous setting, the results show that the algorithm is able to solve a nonlinear task using a combination of options with only linear subpolicies. In this setting, a single linear policy is insufficient for solving the task. Furthermore, the framework allows for parametrized policies and, thus, stateoftheart policy search methods developed for flat policies can be used to learn hierarchical policies.
The presented approach infers the option’s structure, such as the activation policy and termination policies, from data. However, the number of options still has to be set apriori by the practitioner. While the results show that setting a relatively large number of options typically yields good performance, learning the number of options is an important aspect of future work. Finally, while the presented framework estimates the most likely termination policies, finding a way of enforcing fewer terminations might further improve learning performance.
Notes
Acknowledgments
The research leading to these results has received funding from the DFG SPP ‘autonomous learning’ under the project ‘LearnRobots’.
References
 Barto, A.G., Singh, S, and Chentanez, N. (2004). Intrinsically motivated learning of hierarchical collections of skills. Proceedings of the International Conference on Developmental Learning (ICDL).Google Scholar
 Baum, L. E. (1972). An equality and associated maximization technique in statistical estimation for probabilistic functions of markov processes. Inequalities, 3, 1–8.Google Scholar
 Bishop, Christopher M. (2006). Pattern recognition and machine learning (information science and statistics). New York: Springer.zbMATHGoogle Scholar
 Da Silva, B., Konidaris, G., and Barto, A.G. (2012). Learning parameterized skills. In Proceedings of the International Conference on Machine Learning (ICML).Google Scholar
 Daniel, C., Neumann, G., and Peters, J. (2012). Hierarchical relative entropy policy search. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS).Google Scholar
 Daniel, C., Neumann, G., Kroemer, O., and Peters, J. (2013). Learning sequential motor tasks. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA).Google Scholar
 Dayan, P., and Hinton, G. E. (1993). Feudal reinforcement learning. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.), Advances in neural information processing systems (pp. 271–278). Los Altos: Morgan Kaufmann Publishers.Google Scholar
 Dietterich, T. G. (2000). Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research (JAIR), 13, 227–303.MathSciNetzbMATHGoogle Scholar
 Fox, E. B., Jordan, M. I., Sudderth, E. B., and Willsky, A. S. (2009). Sharing features among dynamical systems with beta processes. In Advances in Neural Information Processing Systems (NIPS), pp. 549–557.Google Scholar
 Ghavamzadeh, M., and Mahadevan, S. (2003). Hierarchical policy gradient algorithms. In Proceedings of the International Conference for Machine Learning (ICML).Google Scholar
 Kaelbling, L. P., (1993). Hierarchical learning in stochastic domains: Preliminary results. In Proceedings of the International Conference on Machine Learning (ICML).Google Scholar
 Kajita, S., Kanehiro, K., and Kaneko, F., Fujiwara, K., Harada, K., Yokoi, K., and Hirukawa, H. (2003). Biped walking pattern generation by using preview control of zeromoment point. In Proceedings of the IEEE International Conference of Robotics and Automation (ICRA).Google Scholar
 Kober, J., and Peters, J. (2010). Policy search for motor primitives in robotics. Machine Learning, 84, 1–33.Google Scholar
 Konidaris, G., and Barto, A. (2009). Skill discovery in continuous reinforcement learning domains using skill chaining. In Advances in Neural Information Processing Systems (NIPS).Google Scholar
 Konidaris, G., Osentoski, S., and Thomas, P.s. (2011). Value function approximation in reinforcement learning using the fourier basis. Conference on Artificial Intelligence (AAAI).Google Scholar
 Lagoudakis, M., & Parr, R. (2003). Leastsquares policy iteration. Journal of Machine Learning Research, 4, 1107–1149.MathSciNetzbMATHGoogle Scholar
 Levy, K. Y., and Shimkin, N. (2012). Unified inter and intra options learning using policy gradient methods. In S. Sanner & M. Hutter (Eds.), Recent advances in reinforcement learning (pp. 153–164). New York: Springer.Google Scholar
 Mann, T. A., and Mannor, S. (2014) Scaling up approximate value iteration with options: Better policies with fewer iterations. In Proceedings of the International Conference on Machine Learning (ICML).Google Scholar
 McGovern, A., and Barto, A. G. (2001a). International conference on machine learning (icml). Computer Science Department Faculty Publication Series, 8, 361–368.Google Scholar
 McGovern, A., and Barto, A. G. (2001b). Automatic discovery of subgoals in reinforcement learning using diverse density. International Conference on Machine Learning (ICML), pp. 8.Google Scholar
 Mehta, N., Ray, S., Tadepalli, P., and Dietterich, T. G. (2008). Automatic discovery and transfer of MAXQ hierarchies. In Proceedings of the International Conference on Machine Learning (ICML).Google Scholar
 Menache, I., Mannor, S., and Shimkin, N. (2002). Qcutdynamic discovery of subgoals in reinforcement learning. In Proceedings of the European Conference on Machine Learning (ECML).Google Scholar
 Morimoto, J., & Doya, K. (2001). Acquisition of standup behavior by a real robot using hierarchical reinforcement learning. Robotics and Autonomous Systems, 36(1), 37–51.CrossRefzbMATHGoogle Scholar
 Ng, A. Y., Coates, A., Diel, M., Ganapathi, V., Schulte, J., Tse, B., et al. (2006). Autonomous inverted helicopter flight via reinforcement learning. In M. H. Ang & O. Khatib (Eds.), Experimental robotics IX: The 9th international symposium on experimental robotics (pp. 363–372). Berlin, Heidelberg: Springer.Google Scholar
 Niekum, S., and Barto, A. G. (2011). Clustering via dirichlet process mixture models for portable skill discovery. In Advances in Neural Information Processing Systems (NIPS).Google Scholar
 Niekum, S., Osentoski, S., Konidaris, G.D., and Barto, A.G. (2012). Learning and generalization of complex tasks from unstructured demonstrations. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).Google Scholar
 Paraschos, A., Daniel, C., Peters, J., and Neumann, G., (2013). Probabilistic movement primitives. In Advances in Neural Information Processing Systems (NIPS).Google Scholar
 Parr, R., and Russell, S. (1998) Reinforcement learning with hierarchies of machines. Advances in Neural Information Processing Systems (NIPS).Google Scholar
 Peters, J., Mülling, K., and Altun, Y. (2010). Relative entropy policy search. In Proceedings of the National Conference on Artificial Intelligence (AAAI).Google Scholar
 Puterman, M. L. (1994). Markov decision processes: Discrete stochastic dynamic programming. New York: Wiley.CrossRefzbMATHGoogle Scholar
 Ranchod, P., Rosman, B., and Konidaris, G. (2015). Nonparametric bayesian reward segmentation for skill discovery using inverse reinforcement learning. In Intelligent Robots and Systems (IROS) IEEE, 471–477.Google Scholar
 Silver, D., and Ciosek, K. (2012). Compositional planning using optimal option models. In Proceedings of the International Conference on Machine Learning (ICML).Google Scholar
 Simsek, Ö., and Barto, A. G. (2008). Skill characterization based on betweenness. In Advances in Neural Information Processing Systems (NIPS).Google Scholar
 Simsek, Ö., Wolfe, A. P., and Barto, A. G. (2005). Identifying useful subgoals in reinforcement learning by local graph partitioning. In Proceedings of the International Conference on Machine Learning (ICML).Google Scholar
 Stolle, M., & Precup, D. (2002). Learning options in reinforcement learning. Abstraction, Reformulation, and Approximation (pp. 212–223). New York City: Springer.Google Scholar
 Stulp, F., and Schaal, S. (2012). Hierarchical reinforcement learning with movement primitives. In Proceedings of the IEEE International Conference on Humanoid Robots (HUMANOIDS).Google Scholar
 Sutton, R. S., Precup, D., and Singh, S. (1998). Intraoption learning about temporally abstract actions. In Proccedings of the International Conference on Machine Learning (ICML).Google Scholar
 Sutton, R. S., Precup, D., & Singh, S. (1999). Between MDPs and SemiMDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112, 181–211.MathSciNetCrossRefzbMATHGoogle Scholar
 Theodorou, E., Buchli, J., & Schaal, S. (2010). A generalized path integral control approach to reinforcement learning. Journal of Machine Learning Research, 11, 3137–3181.MathSciNetzbMATHGoogle Scholar
 van Hoof, H., Peters, J., and Neumann, G. (2015). Learning of nonparametric control policies with highdimensional state features. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS).Google Scholar
 Watkins, C. J. C. H., & Dayan, P. (1992). Qlearning. Machine Learning, 8(3–4), 279–292.zbMATHGoogle Scholar
 Wingate, D., Goodman, N. D., Roy, D. M, Kaelbling, L. P., and Tenenbaum, J. B. (2011). Bayesian policy search with policy priors. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI).Google Scholar