A taxonomy for similarity metrics between Markov decision processes

Although the notion of task similarity is potentially interesting in a wide range of areas such as curriculum learning or automated planning, it has mostly been tied to transfer learning. Transfer is based on the idea of reusing the knowledge acquired in the learning of a set of source tasks to a new learning process in a target task, assuming that the target and source tasks are close enough. In recent years, transfer learning has succeeded in making reinforcement learning (RL) algorithms more efficient (e.g., by reducing the number of samples needed to achieve (near-)optimal performance). Transfer in RL is based on the core concept of similarity: whenever the tasks are similar, the transferred knowledge can be reused to solve the target task and significantly improve the learning performance. Therefore, the selection of good metrics to measure these similarities is a critical aspect when building transfer RL algorithms, especially when this knowledge is transferred from simulation to the real world. In the literature, there are many metrics to measure the similarity between MDPs, hence, many definitions of similarity or its complement distance have been considered. In this paper, we propose a categorization of these metrics and analyze the definitions of similarity proposed so far, taking into account such categorization. We also follow this taxonomy to survey the existing literature, as well as suggesting future directions for the construction of new metrics.


Introduction
Markov decision processes (MDPs) are a common way of encoding decision-making problems in Reinforcement Learning (RL) tasks (Sutton and Barto, 2011). In RL, an MDP is considered to be solved when a policy (i.e., a way of behaving for each state) has been discovered which maximizes the long-term expected return. However, although RL is known as an effective machine learning technique, it might perform poorly in complex problems, leading to a slow rate of convergence. This issue magnifies when facing realistic continuous problems where the curse of dimensionality is inevitable. Transfer learning in RL is a successful technique to remedy such a problem. Specifically, rather than learning a new policy for every MDP, a policy could be learned on one MDP, then transferred to another, similar MDP, and either used as is, or treated as a starting point from which to learn the new policy. Clearly, this transfer cannot be done successfully between any two MDPs, but only in the case they are similar.
Therefore, in this context, one question arises: when are two MDPs similar? In this paper, we consider the concept of similar to be related to the notion of "positive transfer" (Taylor and Stone, 2009). Formally, positive transfer happens when the knowledge in the source task contributes to the improved performance of learning in the target task, and it is considered a negative transfer otherwise, i.e., when the transfer hurts the learning performance when compared with learning from scratch. Additionally, the greater the improvement in the target task, i.e., the greater the positive transfer, the more similar the tasks have to be considered. It is important to be aware of the fact that, based on this description, the concept of similarity might not be related to the structural similarities between the MDPs. So the correct selection of metrics that allows us to measure the similarity between MDPs is a critical issue in transfer learning, precisely to avoid the negative transfer. Obviously, the use of the positive transfer to measure the similarity between two tasks has a major drawback: the similarity measure between MDPs is obtained after the transfer has been run, when really the ideal would be to compute this similarity before it, particularly if the point is to use the task similarity measure to choose a task to use in transfer. This issue magnifies when the transfer happens between simulation and the real world, where it is imperative for the efficient and safe deployment of previously learned knowledge.
The literature in transfer learning has proposed different metrics to measure the level of similarity between MDPs, hence, different definitions of the concept of similarity have been considered so far. This paper surveys the existing task similarity metrics and contributes a taxonomy that, in its roots, classifies them into two clearly distinct categories: model-based, and performance-based metrics. We consider such a distinction as a core contribution, allowing us to categorize metrics in a novel and useful way. Model-based metrics are based on the structural similarities between the MDP models. Such model-based metrics can be computed in different ways depending on what elements of the MDP models come into play to compute the similarity (Ammar et al., 2014;Taylor et al., 2008c;Milner, 1982;Castro and Precup, 2011;Svetlik et al., 2017). The major strength of these approaches lies in that they can be computed a priori, i.e., before the transfer happens. Therefore, they are independent of the transfer algorithm that is later used to transfer knowledge from one task to another. However, most of them require to know in advance the exact MDP models or accurate approximations of them. Instead, performance-based metrics are computed by comparing the performance of the learning agents in the source task and the target task. Such a performance 1 3 comparison can be done in two different ways: by comparing the resulting policies from learning in the source task and the target task (Carroll and Seppi, 2005;Karimpanal and Bouffanais, 2018) or, from a transfer point of view, by measuring the transfer gain, i.e., the positive transfer (Mahmud et al., 2013;Sinapov et al., 2015;Fernández and Veloso, 2013). Many metrics can be used to measure such a transfer gain (e.g., jumpstart, asymptotic performance, total reward) (Taylor and Stone, 2009). In some ways, this transfer gain could be the best method for measuring similarity between two tasks (Carroll and Seppi, 2005). Unfortunately, it is often difficult to compute all of these performance-based measures before actually solving the target task, since most of them require to be computed a posteriori, i.e., after the learning processes. However, there are a few exceptions to this rule, within which, for example, the similarity is computed on-line, i.e., during solving the target task (Fernández and Veloso, 2013).
Therefore, it is important to bear in mind that, despite their different nature, both model-based and performance-based metrics allow us to measure the similarity between tasks. This survey aims to categorize and discuss the main lines of current research within the computation of similarity metrics between MDPs. Our main purpose is to highlight the advantages and disadvantages of the surveyed metrics, making it easier to identify crossing points and open problems for which research communities could merge their expertise to work on, bridging the gap between the literature and the application of all of these metrics in real-world complex problems. To the best of our knowledge, this is the first survey focused on similarity metrics between MDPs. Previously, Lan et al. (2021) published an extraordinary paper surveying and proposing new similarity metrics, but it is focused on metrics between states and not between MDPs, which is the real purpose of this paper. In other words, Lan et al. (2021) focus on state similarity, whilst this paper focuses on task similarity. Nevertheless, the metrics presented by Lan et al. (2021) are discussed here as a good starting point to compose similarity metrics between MDPs. We hope our taxonomy applies to a wide range of researchers, not just those interested in transfer learning. For example, the proposed metrics could also be a critical step forward to sort the samples and tasks in Curriculum Learning (Narvekar et al., 2020), or could be used to measure the distance between simulation and the real world in a Sim-to-Real context (Zhao et al., 2020). They could also be used to understand the similarities within a set of tasks in Multitask Learning (Shui et al., 2019), or Automated Planning (Fernández et al, 2011).
Since different audiences are expected to read this survey, the following guide provides forward references to key insights and sections for target groups with different needs and motivations: -If you are familiarized with the main concepts of RL and transfer learning, you can skip Sect. 2 and go directly to Sect. 3. -If you are interested in just an overview of the similarity metrics and how they are organized, go to Sect. 3. -If you are interested in a deep understanding of the different approaches, you will need to read Sects. 4 and 5 . -If you want a comparative analysis about what is the best method to use for a specific task, you will find your answer in Sect. 6. -If you are interested in this area and willing to go forward, see Sect. 7 to see the future directions.

3 2 Background
This section introduces key concepts required to better understand the rest of the paper. First, some background in RL is introduced (Sect. 2.1), then the main concepts of RL transfer are visited (Sect. 2.2), and finally the concepts of similarity and distance (Sect. 2.3).

Reinforcement learning
Typically, RL tasks are described as Markov Decision Processes (MDPs) represented by tuples in the form M = ⟨S, A, T, R⟩ , where S is the state space, A is the action space, T ∶ S × A → S is the transition function between states, and R ∶ S × A → ℝ is the reward function (Sutton and Barto, 2011). At each step, the agent is able to observe the current state, and choose an action according to its policy ∶ S → A . The goal of the RL agent is to learn an optimal policy * that maximizes the return J( ): where r k is the immediate reward obtained by the agent on step k, is the discount factor, which determines how relevant the future is (with 0 ≤ ≤ 1 ), and K is a final time step for finite-horizon models (including the possibility of K = ∞ and 0 ≤ < 1 for infinite-horizon models). On the one hand, if the task is an episodic task, the interaction between the agent and the environment tends to be divided into episodes. In finite-and infinite-horizon episodic tasks, an episode always ends when reaching a terminal state but, for finite-horizon tasks, it also ends when a fixed number of steps K has passed. On the other hand, a task can be infinity-horizon continuing, which means the task will never end. With the goal of learning the policy , Temporal Differences methods (Sutton and Barto, 2011) estimate the sum of rewards represented in Eq. (1). The function that estimates the sum of rewards, i.e., the return for each state s given the policy is called the value-function Similarly, the action-value function Q (s, a) = E[J( )|s 0 = s, a 0 = a] is the estimation of the value of performing a given action a at a state s being the policy followed. The corresponding value function and action-value function for the optimal policy * are denoted respectively V * and Q * . The Q-learning algorithm (Watkins, 1989) is one of the most widely used for computing the action-value function. In small domains with a small number of states and actions, the Q function and can be fully represented with a lookup table. However, as the state and action spaces grow, a different approach is required. One way to extend the Q function to continuous state-action space, is to discretize the environment in order to reduce such space , thus the use of a tabular representation of Q is still possible. However, in such continuous scenarios, both V and Q functions are typically estimated using a universal function approximation such as an artificial neural network (Wiering and van Otterlo, 2014). In this case, the value function is expressed as a linear V (s) = T (s) or non-linear V (s) = V ( (s), ) combination of a parameter vector and a feature vector (s) . Equivalently, the Q function can also be expressed in terms of and (Van Hasselt, 2012).

Transfer learning for reinforcement learning
In the transfer learning scenario we assume there is an agent who previously has addressed a set of source tasks represented as a sequence of MDPs, M 1 , … , M n . If these tasks are somehow "similar" to a new task M n+1 , then it seems reasonable the agent uses the acquired knowledge solving M 1 , … , M n to solve the new task M n+1 faster than it would be able to from scratch. Transfer learning is the problem of how to obtain, represent and, ultimately, use the previous knowledge of an agent (Torrey and Shavlik, 2010;Taylor and Stone, 2009).
However, transferring knowledge is not an easy endeavour. On the one hand, we can distinguish different transfer settings depending on whether the source and the target tasks share or not the state and action spaces, the transition probabilities and the reward functions. It is common to assume that the tasks share the state space and the action set, but differing the transition probabilities and/or reward functions. However, in case the tasks do not share the state and/or the action spaces, it is required to build mapping functions, X S (s t ) = s s , X A (a t ) = a s , able to map a state s t or action a t in the target task to a state s s or action a s in the source task. Such mapping functions require not only knowing if two tasks are related, but how they are related, which means an added difficulty. On the other hand, it is required to select what type of information is going to be transferred. Different types of information have been transferred so far ranging from instance transfer (a set of samples collected in the source task) to policy transfer (i.e., the policy learned in the source task). Nor is this a simple task, because depending on how much and how the source and the target tasks are related, it could be transferred one type of information or another.
Finally, the most "similar" task among M 1 , … , M n to solve M n+1 should be selected in the hope that it produces the most positive transfer. For this purpose, similarity metrics could be used, which translate into a measurable quantity of how related two tasks are.

Similarity and distance metrics
Similarity metrics are a very important part of transfer learning, as they provide a measure of distance between tasks. A similarity function s(⋅, ⋅) , or its complementary distance function d(⋅, ⋅) , is a mathematical function that assigns a numerical value to each pair of concepts or objects in a given domain. This value measures how similar these two concepts or objects are: if they are very similar, it is assigned a very low distance, and if they are very dissimilar, it is assigned a larger distance (Ontañón, 2020). Intuitively, for each distance function d(⋅, ⋅) we can define its associated similarity function s(⋅, ⋅) = u∕(1 + d(⋅, ⋅)) , where u is the maximum similarity value, usually u = 1 . For simplicity, in this survey we use the distance function d(⋅, ⋅) to formulate the distance between MDPs, knowing that this distance also captures the similarity between tasks.

Taxonomy of similarity metrics for MDPs
We consider there are two tasks, M i and M j , described formally by the tuples M i = ⟨S i , A i , T i , R i ⟩ and M j = ⟨S j , A j , T j , R j ⟩ , where they could share (or not) the state space, the action space, or the transition and reward dynamics. Definition 1 Given two tasks M i and M j , we define a task distance metric as a heuristic Definition 1 allows us to use the function d(⋅, ⋅) to obtain a partial order between tasks in such a way that we can select the more similar one. Ideally, the concept of similarity should be related to the concept of positive transfer: the smaller the distance d(M i , M j ) , the greater the positive transfer. However, in most of the cases, similarity metrics do not provide guarantees for this ideal behavior. Additionally, d(M i , M j ) should be computed before or, at least, during the transfer experiment, in order to select an adequate task to use in transfer. However, the literature proposes different ways to compute d(M i , M j ).
In this paper, we consider two main trends for the computation of the distance metric d(M i , M j ) . Such trends are depicted in Table 1. The first one measures the structural or model similarities between the given MDPs. The second measures the similarities by using the performance of the learning agent in both the source and the target tasks.
Model-Based Metrics As regards to the first, they measure the degree of similarity between a source and a target task by using their corresponding MDP models (i.e., states, actions, transition and rewards dynamics). There are several alternatives to these modelbased metrics depending on what components of the MDPs are taken into account. In this survey, we categorize these metrics in four groups: (i) transition and reward dynamics, (ii) transitions, (iii) state and actions and (iv) states: • Transition and reward dynamics They require complete knowledge of the MDP models both of the source task and the target task. We distinguish three ways of computing similarity metrics using such a complete knowledge: (i) by a sort of metrics based on state abstraction (or state aggregation) techniques (Li et al., 2006;Ferns et al., 2004Ferns et al., , 2012Castro, 2020), (ii) by compliance metrics Lazaric, 2008;Fachantidis et al., 2015;Fachantidis, 2016) and (iii) by metrics based on the construction of MDP graphs (Kuhlmann and Stone, 2007;.
-State abstraction In RL, it is a common practice to aggregate states in order to obtain an abstract description of the problem, i.e., a more compact and easier representation of the task to work with (Giunchiglia and Walsh, 1992;Li et al., 2006). These approaches are based on the same common principle: if a number of states are considered to be similar, they can be aggregated as a single one. This same principle can be also used to compute the similarity between two states belonging to different MDPs. In fact, the cumulative similarity between each pair of these states could be used to compute a sort of similarity metric between the MDPs. In this paper, we survey two of these methods which actually have been used for transfer in RL: bisimulation (Ferns et al., 2004(Ferns et al., , 2012Castro and Precup, 2011;Song et al., 2016), and homomorphism (Ravindran and Barto, 2002;Sorg and Singh, 2009), where both of them require complete knowledge or accurate approximations of the MDP models to compute the similarity between states. Although only bisimulation Table 1 Taxonomy and summary of the similarity metrics considered in this survey and homomorphism are discussed in detail in this paper, we consider other state abstraction techniques might be also used as similarity metrics between MDPs. -Compliance The compliance measure is defined as the probability of a sample ⟨s, a, s ′ , r⟩ in the target task of being generated in the source task Fachantidis, 2016;Fachantidis et al., 2015). Therefore, it is easy to deduce that the compliance between the entire target task and the entire source task allows us to measure the similarity between the two tasks. -MDP graphs. They are based on the construction of graphs that represent the transition and the reward functions both of the source task and the target task (Kuhlmann and Stone, 2007;Liu and Stone, 2006;. Then, they find structural similarities between tasks based on graph-similarity or graph-matching algorithms. Such approaches are based on an interesting idea for similarity computation: the alternative representation of tasks through descriptive structures that allow an easy comparison between them. • Transitions The metrics considered in this category use tuples in the form ⟨s, a, s ′ ⟩ to measure the similarity between MDPs (Taylor et al., 2008c;Ammar et al., 2014). By using such tuples, they model the behavioral dynamics of the two MDPs to be compared, and then they try to find differences between them. • Rewards The metrics considered in this category use tuples in the form ⟨s, a, r⟩ to measure the similarity between MDPs. This set includes different techniques (Carroll and Seppi, 2005;Tao et al, 2021;Gleave et al, 2020). • State & actions These metrics use pairs in the form ⟨s, a⟩ both in the source task and the target task to compute the similarity between MDPs. Such pairs can be used in different ways resulting in different similarity metrics (Carroll and Seppi, 2005;Taylor et al, 2008b;Narayan and Leong, 2019). • States Finally, metrics in this category use the state space both in the source and the target task to compute the similarity between them (Svetlik et al., 2017). Another area of research that is relevant in this category is that of case-based reasoning (CBR) (Aamodt and Plaza, 1994). RL approaches based on CBR use a similarity function between the states in the target task and the states stored in a case base corresponding to a previous source task (Celiberto Jr et al., 2011). Such similarity function could be used to measure the similarity between the state spaces, hence, the similarity between the two tasks.
Performance Based. As regards the second major category in the proposed taxonomy, performance-based metrics are based on the performance of the agents in the source task and the target task, where this performance can be related to the policies themselves learned by the agents in these tasks, or to the transfer gain an agent obtains reusing the knowledge of a source task in a target task. So, we distinguish two different approaches to overcoming the problem of computing such performance-based similarities: (i) by the policy similarity and (ii) by the transfer gain obtained transferring the knowledge from a source task to the target task.
-Policy similarity They are based on the use of the learned value function V or the action-value function Q , or equivalently, on the behavioral policies obtained in the source task and the target task. Therefore, these metrics require the full (or partial) learning of these policies before the computation of the similarity between tasks. Such a comparison can be conducted in two different ways depending on what is being compared: (i) the policy values, or (ii) the policy parameters.

3
-Policy values In this case, the comparison is conducted by observing the specific values of the Q -function or the V -function corresponding to the source and the target tasks (Carroll and Seppi, 2005;Zhou and Yang, 2020). Therefore, in this case, it is really being measured the degree of similarity of the policies obtained in both tasks. -Policy parameters. In RL, the value function V or the action-value function Q usually are represented as a parameter vector . Intuitively, the metrics within this category compare the particular weights of the parameter vectors corresponding to the value functions of the source task and the target task in order to measure the similarity between them (Karimpanal and Bouffanais, 2018). Therefore, these metrics are only applicable with parametric representations of policies, and not with tabular representations. Such an approach opens the door to the comparison of other policy representations (Ferrante et al., 2008).
-Transfer gain In these techniques, the level of similarity is an approximation to the advantage gained by using the knowledge in one source task to speed the learning of another target task (Carroll and Seppi, 2005;Carroll, 2005;Taylor and Stone, 2009). So, it is important to bear in mind that these metrics actually require that the transfer experiment be entirely or partially run before measuring the degree of similarity between tasks. Many metrics to measure such a transfer gain are possible, including jumpstart, asymptotic performance, total reward (see (Taylor and Stone, 2009) for a complete listing of metrics for transfer gain). However, regardless of the particular technique used to compute such a transfer gain, the higher the transfer gain, the greater the similarity between the tasks. In this paper, we distinguish two approaches within this category depending on whether the transfer gain is computed after or during the transfer process: (i) off-line transfer gain, and (ii) on-line transfer gain.
-Off-line transfer gain The transfer gain is estimated as the difference in performance between the learning process with and without transfer (Carroll, 2005;Mahmud et al., 2013;Sinapov et al., 2015). Therefore, it is important to be aware of the fact that, in these approaches, the gain is computed once the learning processes are considered to be finished. -On-line transfer gain On the contrary, in these approaches, the gain is estimated on-line at the same time that the policy in the target task is computed (Azar et al., 2013;Fernández and Veloso, 2013;Li and Zhang, 2017). In this way, it is possible to decide on-line which is the closest task within a library composed of past tasks, so that the knowledge of the selected closest task can have a greater influence on learning about the policy in the new task.

Model-based metrics
This section presents in detail the model-based metrics considered in this paper, i.e., the metrics that evaluate the similarity between tasks using the components of their respective MDPs. As presented in Sect. 2.1, the structure of an MDP is formally described by a tuple ⟨S, A, T, R⟩ . Different model-based metrics result depending on what components of the MDPs take part in the computation of the similarity. Therefore, this survey categorizes the model-based metrics in five groups: (i) transition and reward dynamics, (ii) transitions, (iii) rewards, (iv) state and actions and (v) states.

Transition and reward dynamics
This first group of techniques make use of all the components of the MDPs (i.e., the state and action spaces, and the transition and reward dynamics) to compute the similarity metrics. Depending on the use of this information, we can distinguish three different groups of metrics: (i) by a sort of metrics based on state abstraction (or state aggregation) techniques (Li et al., 2006;Ferns et al., 2004Ferns et al., , 2012Castro, 2020), (ii) by compliance metrics Lazaric, 2008;Fachantidis et al., 2015;Fachantidis, 2016) and (iii) by metrics based on the construction of MDP graphs (Kuhlmann and Stone, 2007;.

State abstraction
Learning in a high-dimensional state space not only increases the time and memory requirements of learning algorithms, but also degenerates performance due to the curse of dimensionality (Kaelbling et al., 1996). This motivates the need for state abstraction, the process of grouping states into abstract representations, more compact and easier to work with, while preserving dynamics of the original system. In this paper, we survey two methods for state abstraction: bisimulation (Ferns et al., 2004(Ferns et al., , 2012Castro and Precup, 2011;Song et al., 2016), and homomorphism (Ravindran and Barto, 2002;Sorg and Singh, 2009). Bisimulation Bisimulation considers two states are equivalent (hence, they can be grouped) when for every action, they achieve the same immediate reward and have the same probability of transitioning to classes of equivalent states (Givan et al., 2003;Phillips, 2006). Figure 1 shows an example of bisimulation reduction by transforming an MDP M of four states to an abstract MDP M ′ with two states. Based on the equivalence relation of bisimulation between each pair of states in M , s 0 , s 1 ∈ M can be grouped as s 0 ∈ M � , and s 2, , s 3 ∈ M � can be grouped as s 1 ∈ M.
However, we are interested in metrics and not in equivalence relations. Such a metric could assign a distance of 1 to states that are not bisimilar, and 0 otherwise, not possessing more distinguishing power than that of bisimulation itself. Therefore, as a desirable property of this metric, it should vary smoothly and proportionally with differences in rewards and transition probabilities. Such a bisimulation metric was first proposed by Ferns et al. (2004) and, although it was originally defined as the distance between states belonging to the same MDP, the definition can be easily extended as the distance between states belonging to different MDPs.
Definition 2 Given two MDPs, M i and M j , the distance between two states s i ∈ S i and s j ∈ S j is defined as: are the immediate reward and the probabilistic transition when action a ∈ A is taken at state is the Kantorovich 2 distance between the two probabilistic transitions.
The bisimulation metric d(s i , s j ) is constructed by comparing the transition and reward dynamics of s i ∈ S i and s j ∈ S j : the more similar the reward and transition structures of s i and s j are, the smaller d(s i , s j ) . d(s i , s j ) is a (unique) fixed-point metric and it can be calculated iteratively by starting with a metric that is zero everywhere, and iterating until the difference in metric distances between iterations drops below a certain threshold (we refer the reader to Ferns et al. (2004Ferns et al. ( , 2006). However, bisimilation metrics are difficult to use at scale and compute online, which is why other bisimulation-inspired metrics have recently appeared such as -bisimulation (Castro, 2020), deep bisimulation for control (Zhang et al., 2021), policy similarity metric (Agarwal et al., 2021), or the MICo distance . Finally, it would be worth noting that the model-irrelevance metric (Li et al., 2006) and its approximate version (Abel et al., 2016) share the same principles of bisimulation: two states are considered similar if they have similar transition and reward functions.
Homomorphism One of the shortcomings of the bisimulation metric presented in Definition 2 is that it requires both MDPs, M i and M j , to have the same action sets, i.e., it requires that the behavior matches for exactly the same actions, which is not always the case. In many practical problems, actions with the exact same label may not match, so it should be allowed correspondences between states by matching their behavior with different actions. This idea is formalized as MDP homomorphism (Ravindran and Barto, 2002;Sorg and Singh, 2009). The aim of abstraction in MDP homomorphism is to group similar state-action pairs instead of just states (Castro and Precup, 2010). Therefore, MDP homomorphisms do not require behavioral equivalence under the same action labels, and this idea was elegantly extended with the lax bisimulation metric proposed by Taylor et al. (2008a).

Definition 3 Given two MDPs M i and M j the distance between the state-action pairs
are the immediate reward and the probabilistic transition when action a i is taken at state is the Kantorovich distance between the two probabilistic transitions. From the distance between state-action pairs in Eq. (3) we can then define a state metric as: Taylor et al. (2008a) demonstrate that the lax bisimulation metric relates more states allowing for more compression than bisimulation metrics (Ferns et al., 2004), as it allows capturing different regularities and other types of special structures in the environment. Bisimulation-based metrics have been successfully used as measures of similarity between states with applications including state aggregation (Li et al., 2006) or representation learning (Comanici et al., 2015). It has also been used to discover regions of state space that can be transferred from one task to another (Castro and Precup, 2010). However, few works have been proposed on how the individual distances between states can be used to determine how similar two tasks are in toto (Song et al., 2016). Using the metrics in Definitions 2 and 3 we can compute the distance between all state pairs in M i and M j . Once the distance between all state pairs in M i and M j is computed, it is required to composite them to compute the distance d(M i , M j ) between the two MDPs.
Definition 4 Given two MDPs M i and M j , we can define the distance between them as: where measures the distance between the state spaces S i and S j by using the individual distances between the state pairs, d(s i , s j ).
It could not be appropriate to simply accumulate or average the distances between all different state pairs. For this reason, Song et al. (2016) define (⋅, ⋅) as a function that measures the distance between the sets corresponding to the state spaces S i and S j , by using the Hausdorff and the Kantorovich metrics. However, in the transfer experiments conducted, the Kantorovich metric can avoid negative transfer by filtering the dissimilar tasks, while the Hausdorff one does not have such property. Additionally, Song et al. (2016) are only focused on finite MDPs. Much work needs to be done to determine if (⋅, ⋅) is also computable for continuous tasks, or if it is possible the use of other metrics between sets (Conci and Kubrusly, 2018).

Compliance
Task compliance was first introduced by Lazaric et al. Lazaric, 2008) with the goal of transferring samples from a source task to a target task. For this transfer to result in a positive transfer, it is required to select source tasks whose samples are similar to those produced in the target task. Such a problem could be stated as a model identification problem in which the goal is to identify a particular task from a distribution of tasks, by determining its transition dynamics and reward function (Mendonca et al., 2020). Compliance can assist to measure the similarity between tasks by calculating the average probability of the source task generating target's samples Fachantidis, 2016;Fachantidis et al., 2015).
Definition 5 Given two MDPs, M i and M j , and a set of experience tuples generated in , r k ⟩ , the probability of an experience tuple = ⟨s, a, s � , r⟩ in the task M j of being generated by M i is defined as: where (T i ) a ss � is the probability of transiting to s ′ , and (R i ) a sr is the probability of generating the reward r after executing the action a in state s in M i .
Definition 5 provides a formal description of the probability of a sample generated in an MDP M j of having been generated in an MDP M i . If instead of having a single tuple of the target task M j , we have a set of tuples, D M j , Definition 5 could be used to compute the compliance between the entire target task M j and the entire source task M i by repeating this operation for all samples in D M j . where n is the number of samples in D M j , and t is the t-th tuple in D M j .

Definition 6
is not strictly a distance metric but a probability: the more likely the samples of the target task are generated in the source task, the closer to 1. Therefore, compliance could be used to obtain a distance metric between MDPs like the ones this survey is looking for, e.g., d(M i , M j ) = 1 − .

MDP graphs
The methods in this section are based on an interesting principle: the translation of the MDPs into alternative representations that allow to more easily measure the similarities between them. In the particular case of the approaches in this section, such an alternative representation is based on a graph-theoretical perspective, i.e., the states, actions, transition and reward functions can be represented as a graph with nodes and edges Kuhlmann and Stone, 2007). Therefore, these approaches are in a way based on the concepts of bisimulation and homomorphism described in Sect. 4.1.1, with the difference that they use specific techniques of structural similarity between graphs to measure the similarity between MDPs.
Definition 7 Given two MDPs M i and M j and their corresponding alternative representation as graphs, is a function that measures the structural similarity between G M i and G M j . Therefore, the representation of MDPs as graphs opens the door to using graph-theoretic similarity metrics to measure the similarity between MDPs. One of these metrics is SimRank (Jeh and Widom, 2002) which is based on the intuition that two nodes are similar iff their neighbors are similar. Formally, given a graph G = {V, E} , where V is the set of nodes and E is the set of directed links between any two nodes in G, the Sim-Rank metric between any two nodes i, j ∈ V with i ≠ j is defined as in Eq. (8), where j ∈ V} denotes the set of neighbours of i, and c is called a decay factor. If i = j , then ij = 1 . In addition, if i ≠ j and I(i) = or I(j) = , then ij = 0.
Besides the SimRank metric, other node-to-node proximities in graphs have been proposed such as RoleSim (Jin et al., 2014) or MatchSim (Lin et al., 2012). Based on the principles of the graph-theoretic similarity metrics,    Kuhlmann and Stone (2007) investigate the use of graph isomorphism in the context of MDPs. Specifically, they represent the MDPs as rule graphs instead of bipartite graphs for the particular problem of General Game Playing (Genesereth et al., 2005). Such a rule graph is an accurate abstraction of the MDP problem, and can be properly compared to other rule graphs. In this case, the similarity function (⋅, ⋅) is a binary function with value of 1 if G M i and G M j are isomorphic, and 0 otherwise. However, it would be desirable that (⋅, ⋅) vary smoothly with the difference in the MDPs. Therefore, it may be worth investigating the graph edit distance which denotes the number of edit steps (insertions, deletions, or updates to nodes or edges) required to transform G i to G j (Gao et al., 2010). The function (⋅, ⋅) could be inversely related to the number of steps required to transform one graph into the other. Additionally, Kuhlmann and Stone (2007) assumes full knowledge of the transition function. A more general approach is presented by Liu and Stone (2006) where the agent has only a qualitative understanding of the transition function. Liu and Stone (2006) models the problem as a Qualitative Dynamic Bayes Network (QDBN). This assigns types to the nodes and edges, providing additional characteristics to compare.

Transitions
The metrics in this category use tuples in the form ⟨s, a, s ′ ⟩ to measure the similarity between MDPs.

Ammar et al. (2014) use D M j to build a Restricted Boltzmann
Machine (RBM) model which describes the transitions in M j in a richer feature space. RBMs are stochastic twolayered energy-based models with generative capabilities for unsupervised learning (Ghojogh et al., 2021). The first layer is the visible layer that represents input data, whereas the hidden layer is used to discover more informative spaces to describe the input data. RBMs have the capability of regenerating visible layer values given a hidden layer configuration. Therefore, the learning process consists of several forward and backward passes, where the RBM tries to reconstruct the input data. Such a reconstruction capability is particularly interesting to discover informative hidden features in unlabeled data (Hinton and Salakhutdinov, 2006). In the specific problem of task similarity, Ammar et al. (2014) first train a RBM model using the tuples in D M j . Then, they propose feeding the tuples k ∈ D M i into this RBM model to obtain a reconstruction ′ k . Afterward, they compute the Euclidean distance e k between k and ′ k . The distance d  M(s, a). In this case, the distance d(D M i , D M j ) is also computed as the average of the Euclidean distances obtained for each k ∈ D M i . Finally, Castro (2020) presents an interesting approach for computing bisimulation metrics (Sect. 4.1.1) via access to transition tuples, and in this way to circumvent the problems for computing them in large or continuous state spaces. However, the results are limited to deterministic MDPs.

Rewards
The metrics can also measure the similarity between MDPs according to the distance between their reward dynamics.

Definition 9
Given two tasks M i and M j , we define the distance between them as d(M i , For instance, Carroll and Seppi (2005) computes d(R i , R j ) as in Eq. (9), where n is the total number of state-action pairs in the source and the target task.
Instead, Tao et al. (2021) assumes the reward functions are a linear combination of some common features (⋅, ⋅) , R i (s, a) = (s, a) T w i and R j (s, a) = (s, a) T w j , and then use the cosine distance function between w i and w j to compute d(R i , R j ) . Finally, Gleave et al. (2020) introduce the Equivalent-Policy Invariant Comparison (EPIC) pseudometric which is composed of two steps. First, transitions from an offline dataset are used to convert reward functions to a canonical form. This canonical form is invariant to reward transformations that do not affect the optimal policy. Second, the correlation between reward values on transition samples is computed, yielding a metric capturing reward function similarity. A significant drawback of this approach is that it evaluates the rewards on transitions between all state-state combinations, regardless of whether such state-state combinations are possible in a transition or not, which in practice leads to unreliable reward values as these are outside the distribution of the transitions. For this reason, some recent works have focused on improving the EPIC metric by making it consider only feasible state transitions (Wulfe et al., 2022). In any case, although the metrics in this category may function correctly in some particular cases, in others it is not a good approximation of the similarity between tasks (Carroll and Seppi, 2005). As a way of example, imagine two mazes, with the goal in different locations, but with everything else left the same. In both tasks, the agent receives a reward of 1 if it reaches the goal, -1 if it hits an obstacle, and 0 elsewhere. These metrics will consider as equally similar to two mazes where the goals are in close positions or in very different positions. These metrics cannot capture the distance that a goal is moved. However, it can easily capture the fact that new obstacles or goals have been added to the tasks. Similarity metrics based on the reward functions can be computed before the policy is learned but in general they are less sensitive to policy trends than those based on policy values (Carroll and Seppi, 2005).

State and action spaces
In this case, the distance between MDPs is computed as the distance between the stateaction pairs ⟨s, a⟩ both in the source and the target tasks.
Definition 10 Given two MDPs M i and M j , and their corresponding state-action spaces, For instance, Narayan and Leong (2019) compute this distance as the difference between the corresponding state-action transition distributions between the two tasks. In particular, the proposed metric is based on the Jensen-Shannon Distance (JSD) which measures the difference between two probability distributions (Nielsen, 2019). The distance d(M i , M j ) is computed as the averaged JSD over all the state-action pairs in order to composite a distance between both tasks. Instead, Taylor et al (2008b) compute the Euclidean distance between state-action pairs in the source and the target tasks. This work is not focused on the construction of a distance measure between MDPs, but such Euclidean distances between all the state-action pairs could be composited to obtain a sort of similarity metric.

States
When comparing closely related tasks, it may be sufficient to use only the state space. The metrics in this category use precisely the state space in both the source and the target tasks to compute the similarity between them.

Definition 11
Given two MDPs M i and M j , we can define the distance between them as where d(S i , S j ) measures the distance between S i and S j . Svetlik et al. (2017) propose to compute d(S i , S j ) as described in Eq. (10): Equation 10 compute d(S i , S j ) as the relation between the applicability of the value function (measured as the number of states in M i that also appears in M j ), and the experience required to learn in M j (measured as the difference of size between S j and S i ). Another area of research that is relevant in this category is that of case-based reasoning (CBR) (Aamodt and Plaza, 1994). CBR uses the knowledge of previous cases to solve new problems, by comparing the actual state to the previous ones, and finding the closest. This results in an action that was already used to solve a very similar case, and therefore it must be useful. RL approaches based on CBR use a similarity function between the states in the target task and the states stored in a case base corresponding to a previous source task (Celiberto Jr et al., 2011;Bianchi et al., 2009). Such similarity function could be used to measure the similarity between the state spaces, hence, the similarity between the two tasks.

Performance-based metrics
The metrics proposed in the second major category of this survey evaluate the similarity between tasks by either comparing policies, or by comparing the performance of the agent in the source and the target tasks. So, we distinguish two different approaches to overcoming the problem of computing such performance-based similarities: (i) by the policy similarity and (ii) by the transfer gain obtained transferring the knowledge from a source task to the target task.

Policy similarity
These approaches measure the similarity between MDPs by comparing the learned value function V or the action-value function Q , hence, the behavioral policies obtained in the source and the target tasks. Therefore, these metrics require the full (or partial) learning of these policies before the computation of the similarity between tasks. Such a comparison can be conducted in two different ways depending on what is being compared: (i) the policy values, or (ii) the policy parameters.

Policy values
One approach to compare the similarity between MDPs is by comparing the specific q-values or v-values of their respective value or action-value functions. The approaches in this category aim to capture the difference between the policy trends in two MDPs. The more the policies of the two tasks overlap, the more similar the two tasks become.

Definition 12
Given two tasks M i and M j and the q-values of Q i and Q j learned in these tasks, we define d( Carroll and Seppi (2005) propose to compute d(V i , V j ) as the number of states in M i and M j with identical maximum v-value. The most obvious problem with this distance is that it is overly restrictive: it requires two states with exactly the same v-value to consider an overlap, which is not often realistic. Carroll and Seppi (2005) also propose to compute d(Q i , Q j ) as described in Eq. (11): where n is the total number of state-action pairs in the source and the target task, and which is less restrictive than d(V i , V j ) . Carroll and Seppi (2005) demonstrate that these similarity metrics are only accurate after the q-values or v-values have been learned. Additionally, similarity metrics based on the number of states with maximum v-value requires the task to be more thoroughly learned than those based on the mean squared error of the q-values. Instead, (Zhou and Yang, 2020) compute d(Q i , Q j ) deriving latent structures of tasks and finding matches between Q i and Q j . Finally, Serrano et al. (2021) compute the similarity between tasks with different but discrete state-action spaces by analyzing the differences between Q i and Q j . In particular, they identify pairs of state-action pairs that perform similar roles in their respective task, based on their Q-values.
Other works also make use of the action-value function Q to compute whether two states are similar, but they focus on state abstractions (i.e., on aggregating similar states belonging to the same MDP) and not on the definition of similarity metrics between different MDPs. For instance, Li et al. (2006) present some abstractions which consider that two states s 1 , s 2 ∈ S in an MDP M , can be aggregated if the condition in Eq. (12) is fulfilled.
In Eq. (12), if = 0 we can obtain different forms of exact abstractions depending on the choice of f: Q ( Q -irrelevance), Q * ( Q * -irrelevance), or max A Q * ( a * -irrelevance). However, similarly to the distance d(V i , V j ) proposed by Carroll and Seppi (2005), exact abstractions fail to find opportunities for abstraction in tasks where no two situations are exactly alike. For this reason, Abel et al. (2016) investigate approximate state abstractions, which treat nearly-identical situations as equivalent, and where ≥ 0 and f is Q * . From these abstractions, we can construct discrete metrics from any state aggregation as d(s 1 , s 2 ) = 0 if Eq. (12) is fulfilled, and d(s 1 , s 2 ) = 1 otherwise. In a similar line of research, Lan et al. (2021) present different value-based metrics for computing the similarity between states. The metric d(s 1 , s 2 ) = max a∈A |Q * (s 1 , a) − Q * (s 2 , a)| is a representative of such value-based metrics. It shares the same fundamentals of the policy irrelevance abstraction proposed by Jong and Stone (2005) which consider that two states can be aggregated if they have the same optimal action. Although all of these metrics have been proposed as distances between two states, they could also be used to calculate the distance between two MDPs as described in Sect. 4.1.1.

Policy approximation parameters
This section considers the value function is approximated by V ( (s), ) ≈ V (s) , where denotes an adaptable parameter vector, (s) is the feature vector of state s, and V may be a linear (e.g., linear combination of features) or a non-linear function (e.g., a neural network) (Van Hasselt, 2012). The action-value function Q can be expressed in similar parameterized terms, Q ( (s, a), ) ≈ Q (s, a) . The metrics within this category compare the particular weights of the parameter vectors corresponding to the value functions V or Q of the source task and the target task in order to measure the similarity between them. Such metrics reflect the authors intuition that two tasks are more likely to be similar to each other if they have similar parameter vectors.
Definition 13 Given two tasks M i and M j and the parameterize functions Q i ≈ Q i and The same definition applies if it is used the value functions V i and V j instead of the action-value functions Q i and Q j . For instance, Karimpanal and Bouffanais (2018) compute the distance d( i , j ) as described in Eq. (13): i.e., by using the cosine similarity between two non-zero vectors. Karimpanal and Bouffanais (2018) demonstrate the better the estimate of the agent's parameter vector, the more accurate the distance d(M i , M j ) . The cosine similarity has some advantages, such as boundedness and the ability to handle parameter vectors with largely different magnitudes. However, Karimpanal and Bouffanais (2018) focused on linear function approximation. Computing the similarity between parameter vectors presents particular challenges for non-linear function approximations, such as neural networks, where neurons could learn the same information at different positions in different runs, and still have identical behaviors. Therefore, the design of d( i , j ) should consider that two parameter vectors are similar if they lead to similar behaviors, regardless of the magnitude of the weights, or possible shuffled vectors. Some metrics have been proposed to measure the distance between neural networks (Ashmore, 2015), but they require further investigation in the context of task similarity. Anyway, we consider that this alternative representation of policies as parameter vectors opens the door to the comparison of other policy representations (Ferrante et al., 2008).

Transfer gain
One possible metric to measure the similarity between two MDPs is an approximation to the advantage gained by using one source task to speed up the learning of another target task, which is commonly known as transfer gain (Carroll and Seppi, 2005;Carroll, 2005;Taylor and Stone, 2009). Many forms to measure the advantage g(M i , M j ) are possible. Some of them are graphically represented in Fig. 2: • Jumpstart The performance at the initial steps of the learning process of an agent learning M j may be improved by transfer from M i . • Asymptotic performance The final learned performance of an agent learning M j may also be improved by transfer from M i . • Total reward The total reward accumulated by an agent learning M j with transfer from M i , may be improved if it uses transfer, compared to learning without transfer. • Average reward The average reward received within some window of time by an agent learning M j with transfer from M i may also be improved. • Time to convergence It is expected for a transfer learner to reach the asymptotic performance earlier than another non-transfer learner.
Regardless of the particular technique used to compute g(M i , M j ) , the higher the transfer gain, the greater the similarity between the tasks. While there is a general consensus that g(M i , M j ) is the best similarity metric between tasks (Carroll, 2005;Carroll and Seppi, 2005), since it allows to accurately measure positive transfer, it actually also requires that the transfer experiment be entirely or partially run before measuring the degree of similarity between tasks. This fact precludes its use in tasks where it is mandatory to know the similarity between tasks before the transfer experiment takes place.
In this paper, we distinguish two approaches within this category depending on whether the transfer gain is computed after or during the transfer process: (i) off-line transfer gain, and (ii) on-line transfer gain. . This graph show benefits to the jumpstart, time to convergence, total reward, average reward received within some window of time, and asymptotic performance

Off-line transfer gain
In this case, the transfer gain g(M i , M j ) is estimated as the difference in performance between the learning process with and without transfer, and once the learning processes are considered to be finished (Carroll, 2005;Mahmud et al., 2013;Sinapov et al., 2015;Zhan et al., 2016). Such gain g(M i , M j ) can be computed as the jumpstart (Sinapov et al., 2015;Carroll, 2005), the time to convergence (Carroll, 2005), the asymptotic performance (Mahmud et al., 2013), although other metrics such as the total reward or the transfer ratio could be also used (Taylor and Stone, 2009).

On-line transfer gain
On the contrary, in these approaches, the transfer gain is estimated on-line while the policy in the target task is computed (Fernández and Veloso, 2013;Azar et al., 2013;Li and Zhang, 2017). It is important to bear in mind that the on-line computation of this gain only makes sense if during the learning process we have several transfer sources to choose from. At the beginning of the learning process, these approaches have at their disposal the knowledge learned in solving a set of previous tasks {M 1 , … , M n } to learn the new task M j . During learning, they compute g(M i , M j ) of each past task M i ∈ {M 1 , … , M n } . To do that, they transfer the knowledge acquired solving M i to M j during a limited number of episodes m. Then, g(M i , M j ) is computed as the average reward obtained during those m episodes (Fernández and Veloso, 2013;Azar et al., 2013). Once all gains are computed, it is possible to decide on-line which is the closest task to M j within {M 1 , … , M n } , so that the knowledge of the selected closest task can have a greater influence on learning about the policy in M j .

Discussion
From a transfer point of view, the ultimate goal of all similarity metrics is in some way to predict the relative advantage that would be gained by using a source task in a target task. The more similar the source and the target tasks are, the greater the positive transfer. However, there is probably no one best universal metric that works with all transfer techniques and problems. Since each metric can capture different types of similarity and each transfer technique induces different bias in the learning process, the question of selecting the best metric turns into finding the correct metric for a transfer technique to be applied to a particular problem. For this reason, in order to facilitate the selection of the best metric for a particular task, this section analyzes the distance metrics surveyed in this paper across five dimensions (Table 2): (i) nature of the state-action space (denoted by Spaces in Table 2), (ii) the required knowledge to compute the distance metric (denoted by Knowl.), (iii) allowed differences between the tasks (Differ.), (iv) the type of information that is transferred from the source tasks and the target tasks (Transfer), and (v) when the computation take places (Comp.). Table 3 provides a key for the abbreviations in Table 2. Furthermore, Table 2 and the proposed discussion will serve to analyze the pros and cons of the categories surveyed.

State-action space
Obviously, the selection of the metric depends on the nature of the state-action space. Some approaches require a finite set of states and actions. This is particularly true in the  (2010) s/a t,r s,t,r b Song et al. (2016) s/a t,r s,t,r b Ferns et al. (2012) S/a t,r -b Castro et al. (2021) S/a I -b Ravindran and Barto (2002) s/a t,r -b Homomorphism Ravindran and Barto (2003) s/a t,r s,a,t,r p b Sorg and Singh (2009) S/a t,r s,a,t,r Q b Taylor et al. (2008a) s/a t,r -b Lazaric et al. (2008) S/A t,r t,r I b Compliance Fachantidis (2016) S/A t,r s,a,t,r I b Fachantidis et al. (2015) S/A t,r s,a,t,r I b  s/a t,r t,r b MDP graphs Liu and Stone (2006) s/a t,r t,r V b Kuhlmann and Stone (2007) s/a t,r t,r V b Ammar et al. (2014) S/A I s,a,t,r Q, b Transitions Taylor et al. (2008c) S/a t s,a,t,r Q b Castro (2020) S/a I -b Carroll and Seppi (2005) s/a r t,r Q, b Rewards Tao  S/A s s,a,t,r b Section 5: Performance-based Carroll (2005) s/a Q t,r Q, b Policy values Zhou and Yang (2020) S/a Q r Q b Serrano et al. (2021) s/a Q s,a Q b Karimpanal and Bouffanais (2018) Policy param. Carroll and Seppi (2005) s/a ∑ r, c t,r Q, a Offline transfer Mahmud et al. (2013) s/a V t,r a Sinapov et al. (2015) S/a j s Q a Fernández and Veloso (2013) S/a ∑ r t,r d Online transfer Fernández et al. (2010) S/a ∑ r s,a,t,r d Li and Zhang (2017) S/a ∑ r t,r d Azar et al. (2013) S/a ∑ r t,r d case of bisimulation and homomorphishm metrics. These metrics are expensive to compute and typically require enumerating all the states pairs even when using on-the-fly approximations (Comanici et al., 2012), sampling-based approximations (Ferns et al., 2012;Castro and Precup, 2010) or approximations using the structure in the state space (Bacci et al., 2013). Such full state enumeration is impractical for large state spaces, and impossible for continuous state spaces. Additionally, these metrics can be overly pessimistic in the sense that they consider worst-case differences between states, which is overly restrictive for many problems (Castro and Precup, 2010). Although there have been positive steps to circumvent these drawbacks (Castro, 2020;Zhang et al., 2021;Castro et al., 2021), bisimulation-based approaches compute distances between states (belonging to the same MDP or not), and few works have been proposed to compose these distances into a single distance between MDPs (Song et al., 2016). For this reason, significant work needs to be done to determine if such approaches are feasible in practice for this task. Metrics based on MDP graphs have similar limitations on the size of the state and action spaces (Wang and Liang, 2019;Kuhlmann and Stone, 2007;Liu and Stone, 2006). Although they are based on an interesting principle, at the moment all of these graph-based approaches have limited applications. On the one hand, they require MDPs with a finite number of states and actions that can be adequately represented as a graph. On the other hand, graphs are required to be small since the computation of graph similarity metrics is computationally demanding. Regarding the latter, the computation of these measures can be accelerated via parallelism, but such approaches need to be investigated in the context of similarity between MDPs. 1 3

Required knowledge
Another issue that needs to be addressed is what and how much information is required to compute the similarity between tasks. As can be seen in Table 2, most model-based approaches require prior full information (or accurate approximations) about the transition and reward dynamics (Castro and Precup, 2010;Lazaric et al., 2008;Wang and Liang, 2019), or about the size of the state space (Svetlik et al., 2017). In this sense, performance-based metrics have a clear advantage over model-based ones: in general, performance-based metrics require less a priori information about the task to be solved, although as a counterpoint they need to fully or partially run the transfer experiment to obtain accurate approximations of the Q-function (Carroll, 2005;Carroll and Seppi, 2005;Zhou and Yang, 2020;Karimpanal and Bouffanais, 2018) or the V-function (Mahmud et al., 2013). In other cases, only the average cumulative reward obtained during a predetermined time window by the ongoing policy is required (Sinapov et al., 2015;Li and Zhang, 2017;Azar et al., 2013;Fernández and Veloso, 2013), which allows the fast computation of the similarity between tasks without the need to fully run the transfer experiment.
In this dimension, the model-based approaches based on the structural comparison of the instances in the form ⟨s, a, s ′ ⟩ gathered from the two tasks are highly attractive (Ammar et al., 2014). On the one hand, they are not computationally expensive because they do not need to approximate the dynamics of the environment, nor do they need to learn a behavior policy, just to gather instances in both tasks. On the other hand, the collection of these instances can be done with a random policy, but also with a safe suboptimal one. The latter is particularly interesting if we are dealing with tasks where random behaviors are not allowed or where a single bad action can lead to catastrophic consequences. In this same line of research, the approaches that only use partial information from the instances to carry out similar comparisons, such as state-action pairs ⟨s, a⟩ (Narayan and Leong, 2019;Taylor et al., 2008b), or simply the states ⟨s⟩ (Svetlik et al., 2017;Celiberto Jr et al., 2011), are also interesting. However, in these cases, it is required a strong relationship between the tasks to be compared, which is either learned (Narayan and Leong, 2019;Taylor et al., 2008b) or taken for granted (Svetlik et al., 2017). As an example of the latter, Svetlik et al. (2017) assume that the only difference allowed between tasks is between the state spaces.

Allowed tasks differences
Regarding the allowed tasks differences, distance metrics can be computed between tasks that have different transition and/or reward functions (Ferns et al., 2004;Castro and Precup, 2010;Song et al., 2016;Lazaric et al., 2008), and/or state-action spaces (Fachantidis et al., 2015;Fernández and Veloso, 2013). Most of the methods in Table 2 require that both tasks have the same state-action space Azar et al., 2013). However, the latter can be partially alleviated by the construction of inter-task mapping functions between state and/or action variables that allows translating a state s i ∈ S i in one task to its equivalent state s j ∈ S j in another tasks, X S (s i ) = s j , and an action a i ∈ A i to its equivalent a j ∈ A j , X A (a i ) = a j . In some cases, the mappings functions X S and X A are explicitly provided , but in other cases they are learned (Fachantidis et al., 2015;Taylor et al., 2008c). However, current approaches for autonomously learning such mapping functions require domain knowledge or are inefficient. In this context, bisimulation and homomorphism metrics offer an interesting alternative to perform such a mapping between tasks. The mapping between the states of two MDPs is defined implicitly by the distance metric d(⋅, ⋅) as described in Definition 2: for each state in one MDP, finding the most similar state in the other MDP is equivalent to mapping the states from one MDP to the other. Instead, homomorphisms allow correspondences to be defined between state-action pairs, rather than just states (Definition 3). Therefore, one possibility is to allow the mapping to be determined automatically from bisimulation or homomorphism (Castro and Precup, 2010;Sorg and Singh, 2009).

Transferred knowledge
Different kinds of knowledge may transfer better or worse depending on the similarity between the tasks. The surveyed papers in Table 2 primarily transfer two types of knowledge: policies (Castro and Precup, 2010;Tao et al., 2021), and value functions (Sorg and Singh, 2009;Carroll, 2005;Ammar et al., 2014), and the type of knowledge transferred does not seem to depend on the way of computing the similarity between tasks. It should be noted that some approaches in Table 2 are not explicitly used for measuring the similarity between tasks. This is the case of some bisimulation and homomorphism approaches which are focused on state aggregation, but, as discussed in Sect. 4.1.1, they can be also used to measure the similarity between MDPs.

Computation
This leads us to the following issue: the computation moment. Ideally, the computation of the similarity metric should be before or, at least, during the transfer. Off-line transfer gain approaches are undoubtedly the best method for measuring similarity between two tasks: they produce such a measure after the transfer experiment has been run, in such a way that we can compute the real gain. However, if the point is to use the task similarity measure to choose a task to use in transfer, these metrics are useless. In this case, model-based metrics have an advantage over performance-based metrics: they allow to compute the metrics before the transfer process. These metrics can be used to choose the most similar MDP before the transfer, but as far as we know there are no theoretical guarantees that the most similar MDP is similar enough to produce a positive transfer. By contrast, the metrics based on the on-line transfer gain are at the point halfway between both. They allow computing the similarity metric during transfer, so that depending on the similarity of the source task, it will introduce a greater or lesser exploration bias in the learning process of the new task.

Future directions
The previous discussion points out several future directions. On the one hand, since there is no best metric, it would be useful to use several of them. For instance, modelbased metrics can be used to return a useful approximation of task similarity before the tasks are learned, although this measure can be adapted on-line during the learning process so that the bias that the source task induces in the exploration process is adjusted dynamically. On the other hand, given that different metrics compute different types of similarity (i.e., model-based metrics measure structural similarities, whilst performance-based metrics measure performance similarities), the agent can be equipped with the ability not only to determine which source task to use, but also which transfer technique to use given the type of similarity between the source and the target tasks. The proposed taxonomy also suggests other holes that should be considered in future work. Sect. 6.2 considers model-based approaches based on the structural comparison of tuples in the form ⟨s, a, s ′ ⟩ as highly attractive (Ammar et al., 2014). Surprisingly, these approximations leave the reward out of the tuples. We suggest that the comparison of tuples in the form ⟨s, a, s ′ , r⟩ would lead to better estimates of the similarity between two MDPs. Such metrics would not only be able to capture structural differences between MDPs, but also changes on the reward dynamics. The taxonomy also shows metrics that measures the similarity between two tasks as the similarity between their state spaces (Sect. 4.5). We suggest that this idea could also be applied to the action space: a novel metric not proposed in the taxonomy would be one that measure the similarity between two tasks by only considering the similarity between their action spaces. Such a metric can be particularly interesting in robotic tasks, where the robot skills could change from one task to another, but with everything else left the same.
Another interesting line of research is that based on building semantic representations of the tasks through domain-dependent features. For instance, we can define a particular Pac-Man task from features like the number of ghosts, behavior of the ghosts, or the type of the maze, and use these features to build a similarity metric between different Pac-Man tasks. In fact, one may heuristically combine structural, performance, but also semantic similarity aspects into the same metric. Thus, it could be obtained a metric more aligned with the way in which humans decides what is similar, since humans analyze the similarity between concepts or objects from different perspectives (Kemmerer, 2017).
There is also future work in the context of Sim-to-Real. Transferring learned models from simulation to the real world remains one of the hardest problems in control theory (Zhao et al., 2020). In this case, similarity metrics can help to answer how similar simulations and the actual world are. They could be used to provide theoretical guarantees that ensure the learned policies transferred from simulation to the actual world will perform as required, or to define mechanisms to tune/modify the simulated environments, so the gap between the simulated world and the actual one decreases. In another scenario, recent successes achieved by Deep Learning for learning feature representations have significantly impacted RL, and the combination of both methods (known as Deep RL) has achieved impressive results in recent years (Silver et al., 2016). However, using Deep RL introduces additional challenges, especially, their sample complexity. Initial investigations show that Deep RL agents also benefit from reuse of knowledge, but the effects of negative transfer could be catastrophic if uninformed transfer is performed (Pan et al., 2018;Rusu et al., 2016). Therefore, similarity metrics should play an important role also in the context of Deep RL. Such metrics will have to deal with the particular characteristics of the MDPs in Deep RL, i.e., possibly huge (and continuous) state and action spaces. This complexity makes most of the metrics discussed in this paper are not directly usable. However, some of these metrics could take advantage of advances in Deep Learning itself, e.g., using deep architectures instead of multi-layer perceptrons (Taylor et al., 2008c), or Deep RBMs instead of RBMs (Ammar et al., 2014). Therefore, the adaptation or the creation of new similarity metrics for Deep RL is a promising area of research.
Finally, comparing similarity metrics from different publications and recreating results from the literature is problematic because there are few available standard implementations. Researchers who wish to compare their new similarity metrics to existing results must often re-implement everything from scratch using the (sometimes incomplete and/or unclear) parameter setting of the publications. In this scenario, comparisons can be inaccurate and often not empirically verifiable. It would be required to establish ways to fairly compare results yielded by similarity metrics that leverage very different methodologies to understand their advantages and disadvantages. We need a benchmark suite consisting of implemented tasks and similarity metrics in order to better gauge the progress and applicability of new metrics. It is worth noting that such tool can open the door to the standardization of the comparison between transfer learning algorithms: since existing metrics measure the similarity from different perspectives, they can determine the performance of a transfer learning algorithm depending on how and how much similar the source and the target tasks are.

Concluding remarks
This paper contributes a compact and useful taxonomy of similarity metrics for Markov Decision Processes. The leaves of the taxonomy have been used to provide a literature review that surveys the existing work. We differentiated between model-based and performance-based metrics, depending on whether a structural or performance criterion has been used in its creation. The proposed taxonomy permits to organize clearly the different similarity metrics, or find commonalities between them. This can help the reader to choose similarity metrics for their tasks, or even define their own. We also discussed different selection criteria and some promising future research directions.

article.
Ethics approval Not applicable.

Consent for publication Not applicable.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.