Progress in Artificial Intelligence

, Volume 2, Issue 1, pp 13–27

Learning domain structure through probabilistic policy reuse in reinforcement learning

Regular Paper

DOI: 10.1007/s13748-012-0026-6

Cite this article as:
Fernández, F. & Veloso, M. Prog Artif Intell (2013) 2: 13. doi:10.1007/s13748-012-0026-6

Abstract

Policy Reuse is a transfer learning approach to improve a reinforcement learner with guidance from previously learned similar policies. The method uses the past policies as a probabilistic bias where the learner chooses among the exploitation of the ongoing learned policy, the exploration of random unexplored actions, and the exploitation of past policies. In this work, we demonstrate that Policy Reuse further contributes to the learning of the structure of a domain. Interestingly and almost as a side effect, Policy Reuse identifies classes of similar policies revealing a basis of core-policies of the domain. We demonstrate theoretically that, under a set of conditions to be satisfied, reusing such a set of core-policies allows us to bound the minimal expected gain received while learning a new policy. In general, Policy Reuse contributes to the overall goal of lifelong reinforcement learning, as (i) it incrementally builds a policy library; (ii) it provides a mechanism to reuse past policies; and (iii) it learns an abstract domain structure in terms of core-policies of the domain.

Keywords

Probabilistic Policy ReuseTransfer learningReinforcement learningDomain structure learning

1 Introduction

Reinforcement Learning (RL) [1, 2] is a powerful technique for learning to solve different kinds of tasks. Solving the task consists of learning a near-optimal policy for such task. In the best case, such policy will be near-optimal for the task, i.e., will maximize the long-term sum of the rewards obtained. The learning process is based on a trial and error process guided by reward signals received from the environment. Classical RL algorithms such as Q-Learning [3] rely on an intensive exploration of the action and state spaces. Due to the “curse of dimensionality” of such spaces in complex domains, solving a task typically requires an extensive interaction of the learning agent with the environment.

Although the cost (time, resources, etc.) of such a learning process may be very high, sometimes the task can be tackled and successfully solved [4, 5]. There have been many different efforts to address the complexity of learning. Reusing the knowledge acquired in the current learning process when solving future problems, so the cost of future learning processes is reduced, is an appealing idea. In RL, several efforts have been done in this line, such as the transfer of value functions [6], the reuse of options [7] and the learning of hierarchical decompositions of factored Markov Decision Processes (MDPs) [8].

In this manuscript, we report on Probabilistic Policy Reuse, an approach for transfer learning based on the reuse of similar action policies. It is based on our research in the related areas of Symbolic Plan Reuse [9] and Extended Rapidly-exploring Random Trees (E-RRT) [10]. Planning by analogical reasoning provides a method for symbolic plan reuse. However, when reusing a past plan, if a step becomes invalid to use in the new situation, the traditional reuse questions are either (i) to resolve the locally failed step and direct the search to return back to another past plan step, or (ii) to completely abandon the past plan and re-plan from scratch from the failed step directly toward the goal. E-RRT solves this general reuse question by guiding a new plan probabilistically with a past plan. The past experience is effectively used as a bias in the new search, and thus solves the general reuse problem in a probabilistic manner.

Learning structure in complex domains is a key challenge for scaling up applications, in particular because of the difficulty in finding similarity metrics to determine commonalities in a complex domain. In this article, we contribute a method to identify equivalence classes of domain states through our developed Policy Reuse. When solving a new problem, Policy Reuse utilizes the past policies as a probabilistic bias where the learner faces three choices: the exploitation of the ongoing learned policy, the exploration of random unexplored actions, and the exploitation of past policies. As a past policy becomes relevant to solving a new task, such effective reuse reveals the similarity between the past and new task. Domain structure is then incrementally learned through Policy Reuse, as we present.

Therefore, a side-effect of Policy Reuse is its capability to identify classes of similar policies revealing a basis of core-policies of the domain. That allows to build a library of policies to be reused in the future, by using the PLPR algorithm (Policy Library through Policy Reuse). In this work, we contribute new theoretical results, and we show that, under a set of conditions to be satisfied, reusing such a set of core-policies allows us to bound the minimal expected gain received while learning a new policy. We introduce new definitions, as the \(\delta \)-Basis-Library of a domain, which defines a library of core policies which is large enough to successfully obtain accurate results in a Policy Reuse process.

In this paper, we also include additional evaluations over classical exploration strategies (like \(\epsilon \)-greedy and Boltzmann) to show the advantages of Policy Reuse in the grid domain. Policy Reuse can also be applied to domains potentially more complex, such as the Keepaway domain [5]. The challenge in Keepaway is to transfer learned knowledge from simpler (although continuous) to larger state and action spaces, e.g., from a Keepaway problem with some number of teammates and opponents to a new one with larger number of agents. The use of Policy Reuse for transfer learning among different state and action spaces (typically called inter-task transfer [11]), and its evaluation in the Keepaway can be found in the literature [1214]. Variations of Policy Reuse algorithms can also be found for multi-robot reconfiguration [15] and learning from demonstration, also in the Keepaway [16]. In this work, we use a grid-based domain that allows us to highlight some properties which are more difficult to represent in other domains. The same domain has been used in other works [17].

In summary, the main contributions of this manuscript are, on the one hand, to show empirical and theoretical results about how the similarity metric among policies work, why it is useful to select the policy to reuse from a set of past policies, and how to use it to bound the gain that can be obtained by reusing a policy library. On the other hand, to demonstrate how Policy Reuse learns the domain structure of a domain in terms of libraries of core-policies, which can be used in future learning tasks.

This manuscript is organized as follows. Section 2 summarizes relevant related work. Section 3 introduces Policy Reuse in the scope of Reinforcement Learning, and formalizes the concepts of task, domain, and gain. Section 4 defines the \(\pi \)-reuse exploration strategy, a similarity metric among policies, and the PRQ-Learning algorithm. Section 5 presents the PLPR algorithm, and provides theoretical and empirical results that demonstrate the capability of the algorithm to build the basis of a domain as a set of core-policies, and bound the sub-optimality of the transfer learning. Section 6 shows the empirical results. Finally, Sect. 7 summarizes the main conclusions of this work.

2 Related work

Policy Reuse is a transfer learning method. It uses past policies to balance among exploitation of the ongoing learned policy, exploration of random actions, and exploration toward the past policies. The exploration versus exploitation problem defines whether to explore new or exploit the knowledge already acquired. The limits are defined by the random and the greedy strategies, and several can be found in between, as \(\epsilon \)-greedy and Boltzmann [2]. Directed exploration strategies memorize exploration-specific knowledge that is used for guiding the exploration search [18]. These strategies are based in heuristics that bias the learning so unexplored states tend to have a higher probability of being explored than recently visited ones. These strategies only use knowledge obtained in the current learning process.

Several methods aim at improving learning by introducing additional knowledge into the exploration process. Advice rules [19] define the actions to be preferred in different sets of states. In this case, the source of the advice rules is the user, which is the source of exploration knowledge in many other approaches [20]. Different knowledge sources can be used, as a mentor, from which policies can be learned by imitation [21]. In the previous cases, as in Policy Reuse, the advice is about policies rather than Q values.

Transfer learning refers to the injection of knowledge from previously solved tasks. Memory guided exploration [22] incorporates knowledge from a past policy in a new exploration process by weighting the Q values associated to the new and the past policy. However, that requires that the values of both Q functions are homogeneous and a perfect mapping between the past and the new Q function. The problem can be solved by weighting the probability of selecting each action, instead of the actual Q values [23]. In any case, the choice of a correct weight decay to balance correctly the use of the past and the new policy relies on the designer.

Transfer learning, as knowledge reuse across different learning tasks, can be performed by initializing the Q-values of a new episode with previously learned Q-values [24, 25]. However, if the source and target tasks are very different, transfer learning may require expert knowledge to decide on the feasibility of the transfer, and on the mapping between actions and states from the source and target tasks [26]. Some methods try to solve this problem through a study of actions correlations [27], through state abstraction [28], or by defining the relationships between the state variables of the source and target MDP’s [29]. Value function transfer is an alternative but it is restricted to previous learning processes performed also through a value function. Furthermore, they do not focus on the case where several tasks have been previously solved (several value functions have been learned) and are susceptible to be reused.

A different way of introducing previous knowledge is by executing macro-actions or sub-policies. For instance, some algorithms use macro-actions to learn new action policies in Semi-Markov Decision Processes (SMDPs), as it is the case of TTree [30]. These macros can also be defined using a relational language, and learned using Inductive Logic Programming (ILP) techniques [31]. Options can also be used in SMDPs [7]. They require the set of states from which they can be executed, an end condition and the behavior of the option. Such a behavior can be learned on line [32], as well as the other components of the option [33]. Other ways to transfer knowledge is through the use of set of rules that summarizes polices [34] or by composing solutions of elemental sequential tasks [35].

Hierarchical RL uses different abstraction levels to organize subtasks [36], and some approaches are able to learn such a hierarchy [37]. The methods for learning hierarchies or options capture the structure of the domain. Some related algorithms are SKILL [38], which discovers partially defined policies that arise in the context of multiple tasks in the same domain, and L-Cut, which discovers subgoals and corresponding sub-policies [39]. Sub-policies can suboptimally solve a task with computable bounds [40]. Other methods incrementally build a cache of policies for a decomposed MDP [41], but also following a hierarchical approach.

Probabilistic Policy Reuse establishes a huge difference with previous works based on options, macro-actions or hierarchical RL. Those methods are built on a basis where, once a sub-policy is selected, it is followed until an end condition associated to the sub-policy is satisfied or it suffers an external interruption. In our case, past policies provide a bias, and the learning agents interlace the execution of actions suggested by the new and past policies probabilistically. Policy Reuse never executes complete, nor even partial policies, but in each step decides whether to execute an action suggested by one of the past or new policies. This fact avoids the definition of both the conditions when a sub-policy must be executed, nor the conditions when the execution of a sub-policy must be interrupted.

A primary difference of Policy Reuse and using macro-actions, assuming flat macro-actions similar to the exploratory actions used in Policy Reuse, is that Policy Reuse does not learn values for such exploratory actions, but it learns values for the primitive actions. From the values of the primitive actions, the ground policy is derived.

A main contribution of Policy Reuse with respect to other previous approaches is that Policy Reuse does not assume that transferred knowledge is positive. This assumption makes other methods to believe that the transferred knowledge will be useful, as it is highlighted in a previous survey [42]. Policy Reuse owns mechanisms to measure the utility of the transferred policies, and capabilities to decide when to reuse them or not.

3 Policy Reuse in reinforcement learning

Reinforcement Learning problems are typically formalized using Markov Decision Processes (MDPs). An MDP is a tuple \(\langle \mathcal S, \mathcal A, \mathcal T\mathcal , \mathcal R\mathcal \rangle \), where \(\mathcal S\) is the set of states, \(\mathcal A\) is the set of actions, \(\mathcal T \) is a stochastic state transition function, \(\mathcal T : \mathcal S \times \mathcal A \times \mathcal S \rightarrow \mathfrak R \), and \(\mathcal R \) is a stochastic reward function, \(\mathcal R : \mathcal S \times \mathcal A \rightarrow \mathfrak R \). RL assumes that \(\mathcal T \) and \(\mathcal R \) are unknown.

We focus on RL domains where different tasks can be solved. The MDP’s formalism is not expressive enough to represent all the concepts involved in knowledge transfer [43], so we define domain and task separately to handle different tasks executed in the same domain. We introduce a task as a specific reward function, while the other concepts, \(\mathcal S \), \(\mathcal A \) and \(\mathcal T \) stay constant for all the tasks in the same domain.

Definition 1

A Domain \(\mathcal D\) is a tuple \(\langle \mathcal S, \mathcal A, \mathcal T\mathcal \rangle \), where \(\mathcal S\) is the set of all states; \(\mathcal A\) is the set of all actions; and \(\mathcal T \) is a state transition function, \(\mathcal T : \mathcal S \times \mathcal A \times \mathcal S \rightarrow \mathfrak R \).

Definition 2

A task \(\Omega \) is a tuple \(\langle \mathcal D , \mathcal R _\Omega \rangle \), where \(\mathcal D\) is a domain; and \(\mathcal R _\Omega \) is the reward function, \(\mathcal R : \mathcal S \times \mathcal A \rightarrow \mathfrak R \).

We assume that we are solving episodic tasks. A trial or episode starts by locating the learning agent in a random position in the environment. Each episode finishes when the agent reaches a goal state or when it executes a maximum number of steps, \(H\).1 The agent’s goal is to maximize the expected average reinforcement per episode, \(W\), as defined in Eq. 1:
$$\begin{aligned} W=\frac{1}{K}\sum _{k=0}^{K} \sum _{h=0}^{H} \gamma ^h r_{k,h} \end{aligned}$$
(1)
where \(\gamma \) (\(0\le \gamma \le 1\)) reduces the importance of future rewards, and \(r_{k,h}\) defines the immediate reward obtained in the step \(h\) of the episode \(k\), in a total of \(K\) episodes.

An action policy, \(\Pi \), is a function \(\Pi : \mathcal S\rightarrow \mathcal A\) that defines how the agent behaves. If the action policy was created to solve a defined task, \(\Omega \), we call that action policy \(\Pi _\Omega \). The gain, or average expected reward, received when executing an action policy \(\Pi \) in the task \(\Omega \) is called \(W^{\Pi }_\Omega \). Finally, an optimal action policy for solving the task \(\Omega \) is called \(\Pi _\Omega ^*\). The action policy \(\Pi _\Omega ^*\) is optimal if \(W^{\Pi ^*_\Omega }_{\Omega }\ge W^{\Pi }_{\Omega }\), for all policy \(\Pi \) in the space of all possible policies when \(K\rightarrow \infty \). Action policies can be represented using the action-value function, \(Q^{\Pi }(s,a)\), which defines for each state \(s\in \mathcal S\), \(a\in \mathcal A\), the expected reward that will be obtained if the agent starts to act from \(s\), executing \(a\), and after it follows the policy \(\Pi \). So, the RL problem is mapped to learning the function \(Q^{\Pi }(s,a)\) that maximizes the expected gain. The learning can be performed using different algorithms, such as Q-Learning [3].

The goal of Policy Reuse is to use different policies, which solve different tasks, to bias the exploration process of the learning of the action policy of another similar task in the same domain. We call Policy Library to the set of past policies, as defined next.

Definition 3

A Policy Library, \(L\), is a set of \(n\) policies \(\{\Pi _1, \ldots , \Pi _n\}\). Each policy \(\Pi _i\in L\) solves a task \(\Omega _i=\langle \mathcal D , \mathcal R _{\Omega _i}\rangle \), i.e., each policy solves a task in the same domain.

The previous definition does not restrict the characteristics of the tasks (they may be repeated), nor the characteristics of the policies (they may be sub-optimal),although optimality or near-optimality could affect the reuse process. The scope of Policy Reuse is summarized as: we want to solve the task \(\Omega \), i.e., learn \(\Pi ^*_\Omega \); we have previously solved the set of tasks \(\{\Omega _1,\ldots ,\Omega _n\}\) with \(n\) policies stored as a Policy Library, \(L=\{\Pi _{1},\ldots ,\Pi _{n}\}\); how can we use the policy library, \(L\), to learn the new policy, \(\Pi ^*_{\Omega }\)?

Policy Reuse answers this question by adding the past policies into a learning episode as a probabilistic exploration bias. We define an exploration strategy able to bias the exploration process toward the policies of the Policy Library, and a method to estimate the utility of reusing each of them and to decide whether to reuse them or not. Furthermore, Policy Reuse provides an efficient method to construct the Policy Library. We now detail the Policy Reuse approach.

4 Reusing past policies

In this section, we describe the basic algorithms of Policy Reuse. We first describe how to reuse just one past policy. Then, we show how to reuse a set of past policies. Finally, in this section, we describe in depth the results obtained in a grid navigation domain.

4.1 The \(\pi \)-reuse exploration strategy

The \(\pi \)-reuse strategy is an exploration strategy able to bias a new learning process with a past policy. Let \(\Pi _\mathrm{past}\) be the past policy to reuse and \(\Pi _\mathrm{new}\) the new policy to be learned. We assume that we are using a direct RL method to learn the action policy, so we are learning the related \(Q\) function. Any RL algorithm can be used to learn the \(Q\) function, and Sarsa(\(\lambda \)) and Q(\(\lambda \)) have been applied [13, 14].

The goal of \(\pi \)-reuse is to balance random exploration, exploitation of the past policy, and exploitation of the new policy, as represented in Eq. 2.
$$\begin{aligned} a= \left\{ \begin{array}{ll} \Pi _\mathrm{past}(s)&\mathrm{w/prob. } \psi \\ \epsilon -\mathrm{greedy}(\Pi _\mathrm{new}(s))&\mathrm{w/prob. } (1-\psi ) \\ \end{array} \right. \end{aligned}$$
(2)
The \(\pi \)-reuse strategy follows the past policy with probability \(\psi \), and it exploits the new policy with probability of \(1-\psi \). As random exploration is always required, it follows the new policy using an \(\epsilon \)-greedy strategy.
Table 1 shows a procedure describing the \(\pi \)-reuse strategy integrated with the Q-Learning algorithm. The procedure gets as an input the past policy \(\Pi _\mathrm{past}\), the number of episodes \(K\), the maximum number of steps per episode \(H\), and the \(\psi \) parameter. An additional \(\upsilon \) parameter is added to decay the value of \(\psi \) in each step of the learning episode. The procedure outputs the Q function, the policy, and the average gain obtained in the execution, \(W\), which will play an important role in similarity assessment, as the next sections present. The variable \(\psi _h\) keeps the value of \(\upsilon ^h\psi \) in each step of each episode.
Table 1

\(\pi \)-reuse exploration strategy

https://static-content.springer.com/image/art%3A10.1007%2Fs13748-012-0026-6/MediaObjects/13748_2012_26_Figa1_HTML.gif

 

4.2 A similarity function between policies

The exploration strategy \(\pi \)-reuse, as defined in Table 1, returns the learned policy \(\Pi _\mathrm{new}\), and the average gain obtained in its learning process, \(W\). Let \(W_i\) be the gain obtained while executing the \(\pi \)-reuse exploration strategy, reusing the past policy \(\Pi _i\), and using a parameter vector \(\varvec{\theta }\) that encapsulates all the parameters of the exploration strategy (\(K\), \(H\), \(\psi \) and \(\upsilon \), defined in Table 1). We can use such value to measure the usefulness of reusing the policy \(\Pi _i\) to learn the new policy \(\Pi _\mathrm{new}\). The next definitions formalize this idea.

Definition 4

Given a policy \(\Pi _i\) that solves a task \(\Omega _i=\langle \mathcal D , R_i\rangle \), and a new task \(\Omega =\langle \mathcal D , R_\Omega \rangle \), the Reuse Gain of the policy \(\Pi _i\) on the task \(\Omega \), \(W_i^{\varvec{\theta }}\), is the gain obtained when applying the \(\pi \)-reuse exploration strategy with the policy \(\Pi _i\) and a parameter vector \(\varvec{\theta }\) to learn the policy \(\Pi \).

Vector \(\varvec{\theta }\) plays an important role, since the reuse gain obtained when reusing a policy depends on such a vector. However, we can assume that such parameter vector must be fixed “a priory” or after some tuning. Therefore, in the rest of the paper we will assume that such vector is fixed. To simplify the notation, we will also eliminate it from the formulation, and we will use \(W_i\), instead of \(W_i^{\varvec{\theta }}\).

Then, given a parameter vector \(\varvec{\theta }\), the most useful policy to reuse, \(\Pi _k\), from a Library Policy, \(L=\{\Pi _1, \ldots , \Pi _n\}\), is the one that maximizes the Reuse Gain when learning such a task, as defined in Eq. 3:
$$\begin{aligned} \Pi _k=\arg _{\Pi _i} \max (W_i) , \quad i=1, \ldots , n \end{aligned}$$
(3)
To solve this equation, we need to compute the Reuse Gain for all the past policies. Interestingly, such a gain can be estimated on-line at the same time that the new policy is computed. This idea is formalized in the PRQ-Learning algorithm.

4.3 The PRQ-learning algorithm

The goal of the PRQ-learning algorithm is to solve a task \(\Omega \), i.e., to learn an action policy \(\Pi _\Omega \). We have a Policy Library \(L=\{\Pi _1, \ldots , \Pi _n\}\) composed of \(n\) past optimal policies that solve \(n\) different tasks, respectively. Then two main questions need to be answered: (i) given the set of policies \(\{\Pi _{\Omega }, \Pi _1, \ldots , \Pi _n\}\), which consists of the policies in the Policy Library plus the ongoing learned policy, what policy (say \(\Pi _k\)) is exploited? (ii) once a policy is selected, what exploration/exploitation strategy is followed?

The answer to the first question is as follows: let \(W_i\) be the Reuse Gain of the policy \(\Pi _i\) on the task \(\Omega \). Also, let \(W_{\Omega }\) be the average reward that is received when following the policy \(\Pi _{\Omega }\) greedily. The solution we introduce consists of following a softmax strategy using the values \(W_{\Omega }\) and \(W_{i}\), as defined in Eq. 4, with a temperature parameter \(\tau \). This value is also computed for \(\Pi _0\), which we assume to be \(\Pi _\Omega \). Equation 4 provides a way of deciding, \(\pi _k\), to select to exploit.
$$\begin{aligned} \Pi _k&= \arg _{\Pi _j, 0\le j\le n}\max P(\Pi _j), \text{ where} P(\Pi _j)\\&=\frac{\mathrm{e}^{\tau W_j}}{\sum _{p=0}^n \mathrm{e}^{\tau W_p} }, \quad \forall \Pi _j, 0\le j\le n \end{aligned}$$
(4)
The problem of selecting what policy to reuse in the PRQ-Learning is similar to a non-stationary k-armed bandit problem. Most works in non-stationary k-armed bandit problems try to detect when the change in the distributions occurs, and then to re-learn with classical stationary approaches [45].

The answer to the second question (what exploration strategy to follow once a policy is chosen) is an heuristic that depends on the selected policy. If the policy chosen is \(\Pi _{\Omega }\), the algorithm follows a completely greedy strategy. However, if the policy chosen is \(\Pi _i\) (for \(i=1, \ldots , n\)), the \(\pi \)-reuse action selection strategy, defined in previous section, is followed instead. In this way, the Reuse Gain of each of the past policies can be estimated on-line with the learning of the new policy. Thus, the values required in Eq. 4 are continuously updated each time a policy is used.

All these ideas are formalized in the PRQ-Learning algorithm (Policy Reuse in Q-Learning) shown in Table 2. The algorithm gets as input: a new task to solve \(\Omega \); the policy library \(L\); the temperature parameter of the softmax policy selection equation \(\tau \), and a decay parameter \(\Delta \tau \); and a set of previously defined parameters: \(K, H, \psi , \upsilon , \gamma , \alpha \).
Table 2

PRQ-Learning

https://static-content.springer.com/image/art%3A10.1007%2Fs13748-012-0026-6/MediaObjects/13748_2012_26_Figa2_HTML.gif

The algorithm initializes the new Q function to 0, as well as the estimated reuse gain of the policies in the library. Then the algorithm executes the \(K\) episodes iteratively. In each episode, the algorithm decides which policy to follow. In the first iteration, all the policies have the same probability to be chosen, given that all \(W_i\) values are initialized to 0. Once a policy is chosen, the algorithm uses it to solve the task, updating the Reuse Gain for such a policy with the reward obtained in the episode, and therefore, updating the probability to follow each policy. The policy being learned can also be chosen, although in the initial steps it behaves as a random policy, given that the Q values are initialized to 0. While new updates are performed over the Q function, it becomes more accurate, and receives higher rewards when executed. After executing several episodes, it is expected that the new policy obtains higher gains than reusing the past policies, so it will be chosen most of the time.

5 Building a library of policies

This section describes the \(PLPR\) algorithm (Policy Library through Policy Reuse), an algorithm to build a library of policies. The algorithm is based on an incremental learning of policies that solve different tasks. Notice that we are assuming that the tasks that the algorithm will be asked to solve are unknown a priory, and are given in a sequential way. Otherwise, a method to learn them in parallel could be applied [46].

5.1 The PLPR algorithm

The algorithm works as follows. Initially, the Policy Library is empty, \(\mathrm{PL}=\emptyset \). Then, the first task, say \(\Omega _1\), needs to be solved, so the first policy, say \(\Pi _1\), is learned. To learn the first policy, any exploration strategy could be used but the policy reuse strategy \(\pi \)-reuse, given that there is not any available policy to reuse. \(\Pi _1\) is added to the Policy Library, so \(\mathrm{PL}=\{\Pi _1\}\). When a second task needs to be solved, the PRQ-Learning algorithm is applied, reusing \(\Pi _1\). Thus, \(\Pi _2\) is learned. Then, we need to decide whether to add \(\Pi _2\) to the Policy Library or not. This decision is based on how similar \(\Pi _1\) is to \(\Pi _2\), following the Eq. 5. In the equation, \(W_2\) is the average gain obtained when following \(\Pi _2\) greedily, and \(W_1\) is the Reuse Gain of \(\Pi _1\) on task \(\Omega _2\). Both values are computed in the execution of the PRQ-Learning algorithm, so no additional computations are required.
$$\begin{aligned} d_\rightarrow (\Pi _1, \Pi _2)= W_2- W_1 \end{aligned}$$
(5)
This distance metric estimates how similar \(\Pi _1\) is to \(\Pi _2\). We define this distance not by direct comparisons between the policies, but comparing the result of applying them. In our case, if \(\Pi _1\) is very similar to \(\Pi _2\), i.e., \(d_\rightarrow (\Pi _1, \Pi _2)\) is close to 0, to include the second policy in the library is unnecessary. However, if the distance is large, \(\Pi _2\) is included. Therefore, we can introduce a new concept, \(\delta \)-similarity, as follows.

Definition 5

Given a policy, \(\Pi _i\) that solves a task \(\Omega _i=\langle \mathcal D , R_i\rangle \), a new task \(\Omega =\langle \mathcal D , R_\Omega \rangle \), and its respective optimal policy, \(\Pi \). \(\Pi \) is \(\delta \)-similar to\(\Pi _i\) (for \(0\le \delta \le 1\)) if \(W_{i} > \delta W_{\Omega }^*\), where \(W_{i}\) is the Reuse Gain of \(\Pi _i\) on task \(\Omega \) and \(W_{\Omega }^*\) is the average gain obtained in \(\Omega \) when an optimal policy is followed.

The interesting property of this concept is that for any optimal policy \(\Pi \), if we know a past policy which is \(\delta \)-similar to it, we know that such optimal policy can be easily learned just by applying the \(\pi \)-reuse algorithm with the past policy, and that the gain obtained in the learning process (the reuse gain) will be at least \(\delta \) times the maximum gain in such a task. From this definition, we can formalize the concept of \(\delta \)-similarity with respect to a Policy Library, \(L\), as follows.

Definition 6

Given a Policy Library, \(L=\{\Pi _1,\ldots ,\Pi _n\}\) in a domain \(\mathcal D\), a new task \(\Omega =\langle \mathcal D , R_\Omega \rangle \), and its respective optimal policy, \(\Pi \). \(\Pi \) is \(\delta \) -similar with respect to \(L\) iff \( \exists \Pi _i\) such as \(\Pi \) is \(\delta \)-similar to \(\Pi _i\), for \(i=1,\ldots , n\).

Thus, if we know that a policy \(\Pi \) is \(\delta \)-similar with respect to a Policy Library \(L\), we know that the policy \(\Pi \) can be easily learned by reusing the policies in \(L\).

The PLPR algorithm is described in Table 3. It is executed each time that a new task needs to be solved. It inputs the Policy Library and the new task to solve, and outputs the learned policy and the updated Policy Library.
Table 3

PLPR Algorithm

https://static-content.springer.com/image/art%3A10.1007%2Fs13748-012-0026-6/MediaObjects/13748_2012_26_Figa3_HTML.gif

Equation (6) is the update equation for the Policy Library, derived from Eq. 5. It requires the computation of the most similar policy, which is the policy \(\Pi _j\) such as \(j=\arg _i\max W_i\), for \(i=1, \ldots , n\). The gain that will be obtained by reusing such a policy is called \(W_\mathrm{max}\). The new policy learned is inserted in the library if \(W_\mathrm{max}\) is lower than \(\delta \) times the gain obtained by using the new policy (\(W_\Omega \)), where \(\delta \in [0,1]\) defines the similarity threshold, i.e., whether the new policy is \(\delta \)-similar with respect to the Policy Library.

The parameter \(\delta \) has an important role. If it receives a value of 0, the Policy Library stores only the first policy learned, given that the average gain obtained by reusing it will be greater than zero in most cases, due to the positive rewards obtained by chance. If \(\delta =1\), most of the policies learned are inserted, due to the fact that \(W_\mathrm{max} < W_{\Omega }\), given that \(W_{\Omega }\) is maximum if the optimal policy has been learned. Different values in the range \((0, 1)\) provide different sizes of the library, as will be demonstrated in the experiments. Thus, \(\delta \) defines the size, and therefore the resolution, of the library.

5.2 Suboptimality of policy reuse

The PLPR algorithm has an interesting “side-effect,” namely the learning of the structure of the domain. As the Policy Library is initially empty, and a new policy is included only if it is different enough with respect to the previously stored ones, depending on the threshold \(\delta \), when the policies stored are sufficiently representative of the domain, no more policies are stored. Thus, the obtained library can be considered as the Basis-Library of the domain, and the stored policies can be considered as the core policies of such domain. In the following, we introduce the formalization of these concepts.

Definition 7

A Policy Library, \(L=\{\Pi _1,\ldots ,\Pi _n\}\) in a domain \(\mathcal D\) with a distribution of tasks \(\mathcal T \), is a \(\delta \)-Basis-Library of the domain \(\mathcal D\) iff: (i) \(\not \exists \Pi _i \in L\), such as \(\Pi _i\) is \(\delta \)-similar with respect to \(L-\Pi _i\); and (ii) the rest of policies \(\Pi \) in the space of all the possible policies in \(\mathcal D\) are \(\delta \)-similar with respect to \(L\).

Here, we introduce the idea of a distribution of tasks, \(\mathcal T \), to limit the distribution of rewards functions. This distribution will be important when we define the conditions to build a \(\delta \)-Basis-Library of a domain.

Definition 8

Given a \(\delta \)-Basis-Library, \(L=\{\Pi _1,\ldots ,\Pi _n\}\) in a domain \(\mathcal D\), a new task \(\Omega =\langle \mathcal D , R_\Omega \rangle \), each policy \(\Pi \in L\) is a \(\delta \)-Core Policy of the domain \(\mathcal D\) in \(L\).

The proper computation of the Reuse Gain for each past policy in the PRQ-Learning algorithm plays an important role, since it allows the algorithm to compute the most similar policy, its reuse distance and therefore, to decide whether to add the new policy to the Policy Library or not. If the reuse gain is not correctly computed, the basis library will not be either. Thus, we introduce a new concept that measures how accurate the estimation of the reuse gain is.

Definition 9

Given a Policy Library, \(L=\{\Pi _1,\ldots ,\Pi _n\}\) in a domain \(\mathcal D\), and a new task \(\Omega =\langle \mathcal D , R_\Omega \rangle \), let us assume that the PRQ-Learning algorithm is executed, and it outputs the new policy \(\Pi _\Omega \), the estimation of the optimal gain \(\hat{W}_{\Pi _\Omega }\), and the estimation of the Reuse Gain of the most similar policy, say \(\hat{W}_\mathrm{max}\). We say that the PRQ-Learning algorithm has been Properly Executed with a confidence factor \(\eta \) (\(0\le \eta \le 1\)), if \(\Pi _\Omega \) is optimal to solve the task \(\Omega \), and the error in the estimation of both parameters is lower than a factor of \(\eta \), i.e.:
$$\begin{aligned} \begin{array}{l} \hat{W}_\mathrm{max}\ge \eta W_\mathrm{max}\quad \text{ and} \quad \eta \hat{W}_\mathrm{max}\le W_\mathrm{max} \text{ and} \\ \hat{W}_{\Pi _\Omega }\ge \eta W_{\Pi _\Omega }\quad \text{ and} \quad \eta \hat{W}_{\Pi _\Omega }\le W_{\Pi _\Omega } \end{array} \end{aligned}$$
(7)

where \(W_{\max }\) is the actual value of the Reuse Gain of the most similar policy and \(W_{\Pi _\Omega }\) is the actual gain of the obtained policy.

Thus, if we say that the PRQ-Learning algorithm has been Properly Executed with a confidence of 0.95, we can say, for instance, that the estimated Reuse Gain, \(\hat{W}_\mathrm{max}\) of the most similar policy, has a maximum deviation over the actual Reuse Gain of 5 %. The proper execution of the algorithm depends on how accurate the parameters selection is. Such a parameter selection depends on the domain and the task, so no general guidelines can be provided. The definition requires that that \(\Pi _\Omega \) is optimal to solve the task \(\Omega \), which theoretically may require an infinite number of episodes. However, in practice, optimal policies may be obtained in a finite number of episodes or, at least, the suboptimality could be bounded.

The previous definition allows us to enumerate the conditions that make the PLPR algorithm build a \(\delta \)-Basis-Library, as described in the following theorem.

Theorem 1

The PLPR algorithm builds a \(\delta \)-Basis-Library of a domain \(\mathcal D \) for a task distribution \(\mathcal T \) if (i) the PRQ-Learning algorithm is Properly Executed with a confidence of 1; (ii) the Reuse Distance is symmetric; and (iii) the PLPR algorithm is executed infinite times over random tasks in the distribution \(\mathcal T \).

Proof

The proper execution of the PRQ-Learning algorithm ensures that the similarity metric, and all the derived concepts, are correctly computed. The first condition of the definition of \(\delta \)-Basis-Library can be demonstrated by induction. The base case is when the library is composed of only one policy, given that no policy is \(\delta \)-similar with respect to an empty library. The inductive hypothesis states that a Policy Library \(L_n=\{\Pi _1, \ldots , \Pi _n\)} is a \(\delta \)-Basis-Library. Lastly, the inductive step is that the library \(L_{n+1}=\{\Pi _1, \ldots , \Pi _n, \Pi _{n+1}\}\) is also a \(\delta \)-Basis-Library. If the PLPR algorithm has been followed to insert \(\Pi _{n+1}\) in \(L\), we ensure that \(\Pi _{n+1}\) is not \(\delta \)-similar with respect to \(L\), given this is the condition to insert a new policy in the library, as described in the PLPR algorithm. Furthermore, \(\Pi _i\) (for \(i=1,\ldots , n\)) is not \(\delta \)-similar with respect to \(L_{n+1}-\Pi _i\), given that (i) \(\Pi _i\) is not \(\delta \)-similar with respect to \(L_{n}-\Pi _i\) (for inductive hypothesis); and (ii) \(\Pi _i\) is not \(\delta \)-similar to \(\Pi _{n+1}\) because the reuse distance is symmetric (by second condition of the theorem), and that ensures that if \(\Pi _i\) is not \(\delta \)-similar to \(\Pi _{n+1}\), then \(\Pi _{n+1}\) is not \(\delta \)-similar to \(\Pi _i\). Finally, the second condition of the definition of \(\delta \)-Basis Library becomes true if the algorithm is executed infinite times, which is satisfied by the third condition of the theorem, which also constrain the distribution of tasks for which policies are computed.

The achievement of the conditions of the theorem depends on several factors. The symmetry of the Reuse Distance depends on the task and the domain. The proper execution of the PRQ-Learning algorithm also depends on the selection of the correct parameters for each domain. However, although the previous theorem requires the PRQ-Learning algorithm to be properly executed with a confidence of 1, a generalized result can be easily derived when the confidence is under 1, say \(\eta \), as the following theorem claims. \(\square \)

Theorem 2

The PLPR algorithm builds a \((2 \eta \delta )\)-Basis-Library if (i) the PRQ-Learning algorithm is Properly Executed with a confidence of \(\eta \); (ii) the Reuse Distance is symmetric; and (iii) the PLPR algorithm is executed infinite times over random tasks.

Proof

The proof of this theorem only requires a small consideration over the inductive step of the proof of the previous theorem, where a policy \(\Pi _{n+1}\) is inserted in the \(\delta \)-Core Policy \(L_n=\{\Pi _1, \ldots , \Pi _n\}\) following the PLPR algorithm. The policy is added only if it is not \(\delta \)-similar with respect to \(L_{n}\). In that case, if the PRQ-Learning algorithm has been properly executed with a confidence of \(\eta \), we can only ensure that the policy \(\Pi _{n+1}\) is not \((2 \eta \delta )\)-similar with respect to \(L_n\), because of the error in the estimation of the gains (reuse gain and optimal gain) in the execution of the PRQ-Learning algorithm.

Finally, we define a lower bound of the learning gain that is obtained when reusing a \(\delta \)-Basis-Library to solve a new task. \(\square \)

Theorem 3

Given a \(\delta \)-Basis-Library, \(L=\{\Pi _1,\ldots ,\Pi _n\}\) of a domain \(\mathcal D\), and a new task \(\Omega =\langle \mathcal D , R_\Omega \rangle \). The average gain obtained, say \(W_\Omega \), when learning a new policy \(\Pi _\Omega \) to solve the task \(\Omega \) by properly executing the PRQ-Learning algorithm over \(\Omega \) reusing \(L\) with a confidence factor of \(\eta \) is at least \(\eta \delta \) times the optimal gain for such a task, \(W_\Omega ^*\), i.e.,
$$\begin{aligned} W_\Omega > \eta \delta W_\Omega ^* \end{aligned}$$
(8)

Proof

When executing the PRQ-Learning properly, we reuse all the past policies, obtaining an estimation of their reuse gain. In the definition of Proper Execution of the PRQ-Learning algorithm, the gain generated by the most similar task, say \(\Pi _i\), was called \(\hat{W}_\mathrm{max}\), which is an estimation of the real one. In the worst case, the gain obtained in the execution of the PRQ-Learning algorithm is generated only by the most similar policy, \(\Pi _i\), and the gain obtained by reusing any other different policy is 0, i.e., \(W_j=0, \forall \Pi _j\ne \Pi _i\). By the definition of \(\delta \)-Basis-Library we know that every policy \(\Pi \) in the domain \(\mathcal D\) is not \(\delta \)-similar with respect to \(L\). Thus, the most similar policy in \(L\), \(\Pi _i\) is such that its Reuse Gain, \(W_\mathrm{max}\) satisfies \(W_\mathrm{max}>\delta W_\Omega ^*\) (by definition of \(\delta \)-similarity). However, given that we have executed the PRQ-Learning algorithm with a confidence factor of \(\eta \), the obtained gain \(W_\Omega \) only satisfies that \( W_\Omega \ge \eta W_\mathrm{max}\) by definition of proper execution of the PRQ-Learning algorithm. Thus, \(W_\Omega \ge \eta W_\mathrm{max}\), and \(W_\mathrm{max}>\delta W_\Omega ^*\), so \(W_\Omega >\eta \delta W_\Omega ^*\). \(\square \)

6 Empirical results

We use a grid-based robot navigational domain (see Fig. 1) with multiple rooms. The environment is represented by walls, free positions and goal areas, all of them of size \(1\times 1\). The whole domain is \(N \times M\) (\( 24 \times 21\) in our case). The actions that the robot can execute are “North,” “East,” “South,” and “West”, all of size one. The final position after executing an action is modified by adding a random value that follows a uniform distribution in the range \((-0.20,0.20)\).
https://static-content.springer.com/image/art%3A10.1007%2Fs13748-012-0026-6/MediaObjects/13748_2012_26_Fig1_HTML.gif
Fig. 1

Grid-based office domain

Walls block the robot’s motion, i.e., when the robot tries to execute an action that would crash it into a wall, the action keeps the robot in its original position.

The robot knows its location in the space through continuous coordinates \((x, y)\). We assume that we have the optimal uniform discretization of the state space (which consists of \( 24 \times 21\) regions).2 The goal in this domain is to reach the area marked with ‘G’, in a maximum of \(H\) actions. When the robot reaches it, it is considered a successful episode, and it receives a reward of 1. Otherwise, it receives a reward of 0.

Figure 1 shows six different tasks, \(\Omega _1\), \(\Omega _2\), \(\Omega _3\), \(\Omega _4\), \(\Omega _5\) and \(\Omega \), given that the goal states, and therefore, the reward functions, are different. All these tasks are used in the experiments described in the next sections.

We choose the robot navigation domain for experimentation because it has been widely used in transfer learning papers (e.g., [25, 38, 43]) and provides us an empirical demonstration of the theoretical results. Policy Reuse has also been succesfully applied in more complex domains, as the Keepaway task in robot soccer, which requires: (i) a mapping between tasks that use different state and action spaces; and (ii) function approximation methods since the state space is continuous [14, 44].

The learning process has been first executed following different exploration strategies that do not use any past policy. Specifically, we have used four different strategies: (i) random; (ii) completely greedy; (iii) an \(\epsilon \)-greedy (i.e., with probability \(\epsilon \) follows the greedy strategy, and with probability \((1-\epsilon )\) acts randomly), with an initial value of \(\epsilon =0\), which is incremented by 0.0005 in each episode; (iv) the Boltzmann strategy (\(P(a_j)=\frac{\mathrm{e}^{\tau Q(s,a_j)}}{\sum _{p=1}^n \mathrm{e}^{\tau Q(s,a_p)} }\)), initializing \(\tau =0\), and increasing it by 5 in each learning episode.

Figure 2 shows the results. Each learning process have been executed 10 times, the average value is shown and error bars show standard deviations. When acting randomly, the average gain in learning is almost 0, given that acting randomly is a very poor strategy. However when a greedy behavior is introduced, (strategy 1-greedy), the curve shows a slow increment, achieving values of almost 0.1. The curve obtained by the Boltzmann strategy does not offer significant improvements. The \(\epsilon \)-greedy strategy seems to compute an accurate policy in the initial episodes, and it corresponds to the highest average gain at the end of the learning.
https://static-content.springer.com/image/art%3A10.1007%2Fs13748-012-0026-6/MediaObjects/13748_2012_26_Fig2_HTML.gif
Fig. 2

Results of the learning process for different exploration strategies that learn from scratch

6.1 Parameter setting

In the \(\pi \)-reuse exploration strategy, there are three probabilities involved: the probability of exploiting the past policy, i.e., \(\psi _h\), the probability of using the current policy, i.e., \(\epsilon (1-\psi _h)\), and the probability of acting randomly, i.e., \((1-\epsilon ) (1-\psi _h)\). These probabilities are shown in Fig. 3, for input values of \(H=100\), \(\psi =1\) and \(\upsilon =0.95\).
https://static-content.springer.com/image/art%3A10.1007%2Fs13748-012-0026-6/MediaObjects/13748_2012_26_Fig3_HTML.gif
Fig. 3

Evolution of the probabilities of exploring and exploiting in an episode for the \(\pi \)-reuse exploration strategy

With this parameter setting, the exploration is biased with the past policy mainly in the initial steps of the episode. Assigning to \(\epsilon \) the value of \((1-\psi _h)\) makes the strategy very greedy in the final steps of each episode, given that we assume that the last steps are the ones that are learned faster (since rewards are also propagated fast from the goal). The figure shows that in the initial steps of each episode, the past policy is exploited. As the number of steps increases, the probabilities of exploiting the new policy and acting randomly increases. In the final steps of the episode, only the new policy is exploited. The transition from exploiting the past policy and exploiting the new one depends on the \(\upsilon \) parameter. If this parameter is low, the transition occurs in the initial steps, while if it is high, the transition is delayed. This parameter setting should be tuned for each domain, in a similar way with the parameters of any other exploration strategy.

In the PRQ-Learning algorithms, the \(\epsilon \) parameter is set to \(1-\psi _h\) in each step. The rest of parameters \(\tau =0\), and \(\Delta \tau =0.05\), that depends on the number of episodes that we can execute, were obtained empirically after an informal evaluation. Extended analysis on how to set the transfer rate has been performed by different authors [48].

6.2 Computing the Reuse Gain with PRQ-learning

We use the PRQ-Learning algorithm for learning the task \(\Omega \), defined in Fig. 1f. We assume that we have three different libraries of policies, so we distinguish three different cases. In the first one, the policy library is \(L_1=\{\Pi _2,\Pi _3, \Pi _4\}\), assuming that the tasks \(\Omega _2\), \(\Omega _3\) and \(\Omega _4\), defined in Fig. 1b, c and d, respectively, were previously solved. All these tasks are very different from the one we want to solve, so their policies are not supposed to be very useful in learning the new one. In the second case, \(\Pi _1\) is added, so \(L_2=\{\Pi _1,\Pi _2, \Pi _3, \Pi _4\}\). The third case uses the Policy Library \(L_3=\{\Pi _2, \Pi _3, \Pi _4, \Pi _5\}\). The PRQ-Learning algorithm is executed for the three cases. The learning curves are shown in Fig. 4.
https://static-content.springer.com/image/art%3A10.1007%2Fs13748-012-0026-6/MediaObjects/13748_2012_26_Fig4_HTML.gif
Fig. 4

Learning curve when learning the task of Fig. 1f reusing different sets of policies

Figure 4 shows two main conclusions. First, when a very similar policy is included in the set of policies to be reused, the improvement on learning is very high. For instance, when reusing \(\Pi _1\) and \(\Pi _5\), the average gain is greater than 0.1 in only 500 iterations, and more than 0.25 at the end of the episode. Secondly, when no similar policy is available, the learning curve is similar to the results obtained when learning from scratch with the 1-greedy strategy, as shown in Fig. 2. Interestingly, that is the strategy followed by PRQ-Learning for the new policy, as defined by the PRQ-Learning algorithm. This demonstrates that the PRQ-learning algorithm has discovered that reusing the past policies is not useful, so it follows the best strategy available, which is to the 1-greedy strategy with the new policy.

The process of learning the most similar policy is illustrated in Fig. 5, which reports about the learning process when reusing the Policy Library \(L_3=\{\Pi _5, \Pi _2, \Pi _3, \Pi _4\}\). Figure 5a shows the evolution of the Reuse Gain computed for each policy involved, \(W_5\), \(W_2\), \(W_3\), \(W_4\), and the gain \(W_\Omega \). On the \(x\) axis, the number of episodes is shown, while the \(y\) axis shows the gains. Initially, the Reuse Gain of all the policies is set to 0. After a few episodes, \(W_2\), \(W_3\) and \(W_4\) stabilize below 0.05. However, \(W_5\) increases up to 0.15. These values demonstrate that the most similar policy (\(\Pi _5\)) is correctly computed. The gain of the new policy, \(W_\Omega \), starts to increase around iteration 100, achieving a value higher than 0.3 by iteration 500, demonstrating that the new policy is very accurate.
https://static-content.springer.com/image/art%3A10.1007%2Fs13748-012-0026-6/MediaObjects/13748_2012_26_Fig5_HTML.gif
Fig. 5

Evolution of \(W_i\) and \(P(\Pi _i)\)

The values of the Reuse Gain computed for each policy are used to compute the probability of selecting them in each iteration of the learning process, using the formula introduced in Eq. 4, and the parameters introduced above (initial \(\tau =0\), and \(\Delta \tau =0.05\)). Figure 5b shows the evolution of these probabilities. In the initial steps, all the past policies have the same probability of being chosen (0.2) given that the gain of all them is initialized to 0. While the gain values are updated, the probability of \(\Pi _5\) grows. For the other past policies, the probability decreases down to 0. The probability of the new policy also increases, and after 400 iterations, its bigger than the rest. After a few iterations more, it achieves the value of 1, given that its gain is the highest, as shown in Fig. 5a.

Figure 5b demonstrates how the balance between exploiting the past policies or the new one is achieved. It shows how in the initial episodes, the algorithm chooses to reuse the past policies to find the most similar. Then, it reuses the most similar policy until the new policy is leaned and improves the result of reusing any past policy.

In summary, we can say that the PRQ-learning algorithm has demonstrated to successfully reuse a predefined set of policies, and how it can compute the reuse gain for each of the past policies. The remaining issue consists of demonstrating how the reuse gain is successfully used to build a library of policies and to learn the domain structure.

6.3 Learning the structure of the domain

In this experiment, we want to evaluate the PLPR algorithm. With this purpose, we try to learn the action policies for different tasks in the navigation domain. Performing a task consists of trying to solve it \(K=2{,}000\) times. Each of these times is called an episode. Each episode consists of a sequence of actions until the goal is achieved or until the maximum number of actions, \(H=100\), is executed. Notice that there is no separation between learning and test, so the correct balance between exploration and exploitation must be achieved to maximize the average gain in each performance.

In this domain, the task distribution is represented by 50 different tasks, each of them with a different reward function. The different reward functions are derived from goal states located in different positions of the different rooms of the domain, as shown in Fig. 6. Notice that the figure does not represent a unique task with 50 different goals, but the 50 different goal areas of the 50 different tasks.
https://static-content.springer.com/image/art%3A10.1007%2Fs13748-012-0026-6/MediaObjects/13748_2012_26_Fig6_HTML.gif
Fig. 6

Navigation domain

The results provided are the average of 10 different executions, in which the 50 different tasks are sequentially performed following a random order.

In these experiments, we use the same parameter setting than in previous experiments; for the Q-Learning algorithm, \(\gamma =0.95\) and \(\alpha =0.05\); for the \(\pi \)-epsilon exploration strategy, \(\psi =1\), \(\upsilon =0.05\), and \(\epsilon \) is set to \(1-\psi _h\) in each step. In the PRQ-Learning algorithm, \(\tau \) is initially set to 0, and is increased by 0.05 after each trial.

The first element to study is the size of the Policy Library built while performing the tasks with the PLPR algorithm, i.e., the number of core-policies stored in the Policy Library, shown in Fig. 7. The figure shows in the \(y\) axis the size of the Policy Library, and in the \(x\) axis, the number of tasks performed up to that moment. As introduced above, when \(\delta =0\), only 1 policy is stored. When \(\delta =0.25\), the number of core-policies is around 14. Interestingly, this is very close to the number of rooms in the domain (15). While increasing \(\delta \), the number of core-policies increases and when \(\delta =1\), almost all the learned policies are stored.
https://static-content.springer.com/image/art%3A10.1007%2Fs13748-012-0026-6/MediaObjects/13748_2012_26_Fig7_HTML.gif
Fig. 7

Number of core-policies obtained

Figure 8 shows an example of the core-policies obtained in one execution, with \(\delta =0.25\). The figure represents the Policy Library obtained after performing the 50 tasks which, as defined above, is composed of 14 core-policies. In the figure, we assume that a policy is represented by the goal area of the task that it solves. An core-policy is represented also by the goal area, but in this case, the area is shaded. The figure demonstrates that for most of the rooms, one and only one core-policy has been learned. The algorithm has discovered that if two different tasks are given two goal areas in the same room, their respective policies are very similar, so only one of them needs to be stored in the Policy Library. That allows us to say that the structure of the domain has been learned by the PLPR algorithm, and is represented by the core-policies.
https://static-content.springer.com/image/art%3A10.1007%2Fs13748-012-0026-6/MediaObjects/13748_2012_26_Fig8_HTML.gif
Fig. 8

Core-Policies (\(\delta =0.25\))

Figure 9a shows the average gain obtained when performing the 50 different tasks with the PLPR algorithm, for the different values of \(\delta \). In most of the cases, \(\delta =0.25, 0.50, 0.75\) and \(1\), the average gain increases up to more than 0.2, and no significant differences exist between them. Only in the case of \(\delta =0\), the average gain stays low, around 0.16, given that, as introduced above, \(\delta =0\) generates a Policy Library with only one policy (the first one learned). For comparisons, the same learning process has been executed with different exploration strategies that learn from scratch, and summarized in Fig. 9b.
https://static-content.springer.com/image/art%3A10.1007%2Fs13748-012-0026-6/MediaObjects/13748_2012_26_Fig9_HTML.gif
Fig. 9

Results of PLPR

The average gain obtained while new policies are learned stabilizes around 0.12 for all the strategies, without very significant differences. This demonstrates that Policy Reuse can obtain an increment of almost a 100 % gain in the performance of the 50 tasks over the results obtained when the 50 tasks are learned from scratch. Interestingly, when \(\delta =0\), and only one policy is stored, it also obtains improved results over learning from scratch, due to a good behavior of the \(\pi \)-reuse exploration strategy. That confirms that providing the learning process with a bias improves the performance, even when that bias may not be the best for all the learning processes.

7 Conclusions

Policy Reuse is a transfer learning method that contributes to Reinforcement Learning with three main capabilities. First, it provides Reinforcement Learning algorithms with a mechanism to probabilistically bias an exploration process by reusing a Policy Library. Our proposed Policy Reuse algorithm, called PRQ-learning, improves the learning performance over exploration strategies that learn from scratch. Second, Policy Reuse provides an incremental method to build the Policy Library. The library is built at the same time that new policies are learned and past policies are reused. And last, our method to build the Policy Library allows the learning of the structure of the domain in terms of a set of \(\delta \)-core-policies or \(\delta \)-Basis-Library. Reusing this set of policies ensures that a minimum gain will be obtained when learning a new task, as demonstrated theoretically.

Policy Reuse defines a completely new way to reuse previous knowledge. It should be easy to identify policies with classical macro (or SMDP) actions. However, the way we reuse policies is completely different to the way macro-actions or options are used. Let us take the case of an option. An option is defined as a mapping between states and actions, an applicability condition, and an end condition. The first and second components have a direct mapping to a policy, since a policy is an state-action mapping applicable in any situation of the domain. However, options are defined to be executed until an end condition is satisfied, or until the option is interrupted. Opposite to this scheme, Policy Reuse never executes complete policies, nor even partial ones. Instead, Policy Reuse executes individual actions suggested by past policies probabilistically. Thus, past policies are only a bias.

The worst scenario for transferring knowledge through Policy Reuse is trying to reuse a policy library where none of the stored policies is useful to solve the current task, i.e., when none of the stored policies are similar to the one is being leaned. The evaluation with the PRQ-Learning algorithm demonstrated that when the policy library reused included a similar policy, that produced a higher performance when compared with other exploration strategies, like \(\epsilon \)-greedy. Interestingly, when the library does not include any similar policy, the algorithm does not perform worse than when learning from scratch.

Another difference of Policy Reuse with macros/options and hierarchical based approaches is that Policy Reuse learns policies in the same level as past policies, while hierarchical methods learn in different abstraction levels. Last, we would like to point out that hierarchical methods typically require the structure of the domain, i.e., the hierarchy of the domain, is known a priory. We have shown that Policy Reuse learns the structure of the domain in terms of a library of core-policies. We believe that such core-policies could be used in the future to support the learning of hierarchies or abstractions of the domain.

In addition, Policy Reuse is very novel since it is able to transfer knowledge, not only from a source task to a target task, but from many tasks to many tasks. We have demonstrated that a Policy Library can be incrementally built. This property is due to the capability of Policy Reuse to decide (i) given a set of policies, which one to reuse, and (ii) given a new policy, whether it is useful to include it in the Policy Library or not, so it can be reused in future tasks. These mechanisms permit to discover when policies are useful for solving a new task, minimizing the effects of negative transfers.

Footnotes
1

Constraining Policy Reuse to episodic tasks or to limit the number of steps of an episode, are a relaxation but not a requirement to apply Policy Reuse, which has demonstrated to perform accurately in domains with undefined length of the episodes [44].

 
2

Different methods for function approximation have been successfully applied on this domain [47]. We have simplified the state space representation to a uniform discretization to focus on the study of Policy Reuse.

 

Acknowledgments

This research was conducted while the first author was visiting Carnegie Mellon from the Universidad Carlos III de Madrid, supported by a generous grant from the Spanish Ministry of Education and Fulbright. This research was partly sponsored by the Spanish Ministerio de Ciencia en Innovacin project number TIN2008-06701-C03-03 and by Comunidad de Madrid-UC3M project number CCG08-UC3M/TIC-4141. This research was partly sponsored also by Rockwell Scientific Co., LLC under subcontract no. B4U528968 and by BBNT Solutions under subcontract no. 950008572. The views and conclusions contained in this document are those of the authors only, and should not be interpreted as representing any other entity.

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  1. 1.Universidad Carlos III de MadridLeganésSpain
  2. 2.Carnegie Mellon UniversityPittsburghUSA