Preferencebased reinforcement learning: a formal framework and a policy iteration algorithm
Authors
 First Online:
 Received:
 Revised:
 Accepted:
DOI: 10.1007/s1099401253138
 Cite this article as:
 Fürnkranz, J., Hüllermeier, E., Cheng, W. et al. Mach Learn (2012) 89: 123. doi:10.1007/s1099401253138
Abstract
This paper makes a first step toward the integration of two subfields of machine learning, namely preference learning and reinforcement learning (RL). An important motivation for a preferencebased approach to reinforcement learning is the observation that in many realworld domains, numerical feedback signals are not readily available, or are defined arbitrarily in order to satisfy the needs of conventional RL algorithms. Instead, we propose an alternative framework for reinforcement learning, in which qualitative reward signals can be directly used by the learner. The framework may be viewed as a generalization of the conventional RL framework in which only a partial order between policies is required instead of the total order induced by their respective expected longterm reward.
Therefore, building on novel methods for preference learning, our general goal is to equip the RL agent with qualitative policy models, such as ranking functions that allow for sorting its available actions from most to least promising, as well as algorithms for learning such models from qualitative feedback. As a proof of concept, we realize a first simple instantiation of this framework that defines preferences based on utilities observed for trajectories. To that end, we build on an existing method for approximate policy iteration based on rollouts. While this approach is based on the use of classification methods for generalization and policy learning, we make use of a specific type of preference learning method called label ranking. Advantages of preferencebased approximate policy iteration are illustrated by means of two case studies.
Keywords
Reinforcement learning Preference learning1 Introduction
Standard methods for reinforcement learning (RL) assume feedback to be specified in the form of realvalued rewards. While such rewards are naturally generated in some applications, there are many domains in which precise numerical information is difficult to extract from the environment, or in which the specification of such information is largely arbitrary. The quest for numerical information, even if accomplishable in principle, may also compromise efficiency in an unnecessary way. In a game playing context, for example, a short lookahead from the current state may reveal that an action a is most likely superior to an action a′; however, the precise numerical gains are only known at the end of the game. Moreover, external feedback, which is not produced by the environment itself but, say, by a human expert (e.g., “In this situation, action a would have been better than a′”), is typically of a qualitative nature, too.
In order to make RL more amenable to qualitative feedback, we build upon formal concepts and methods from the rapidly growing field of preference learning (Fürnkranz and Hüllermeier 2010). Roughly speaking, we consider the RL task as a problem of learning the agent’s preferences for actions in each possible state, that is, as a problem of contextualized preference learning (with the context given by the state). In contrast to the standard approach to RL, the agent’s preferences are not necessarily expressed in terms of a utility function. Instead, more general types of preference models, as recently studied in preference learning, can be envisioned, such as total and partial order relations.
Interestingly, this approach is in a sense inbetween the two extremes that have been studied in RL so far, namely learning numerical utility functions for all actions (e.g., Watkins and Dayan 1992) and, on the other hand, directly learning a policy which predicts a single best action in each state (e.g., Lagoudakis and Parr 2003). One may argue that the former approach is unnecessarily complex, since precise utility degrees are actually not necessary for taking optimal actions, whereas the latter approach is not fully effectual, since a prediction in the form of a single action does neither suggest alternative actions nor offer any means for a proper exploration. An order relation on the set of actions seems to provide a reasonable compromise, as it supports the exploration of acquired knowledge (i.e., the selection of presumably optimal actions), but at the same time also provides information about which alternatives are more promising than others.
The main contribution of this paper is a formal framework for preferencebased reinforcement learning. Its key idea is the observation that, while a numerical reward signal induces a total order on the set of trajectories, a qualitative reward signal only induces a partial order on this set. This makes the problem considerably more difficult, because crucial steps such as the comparison of policies, which can be realized in a numerical setting by estimating their expected reward, are becoming more complex. In this particular case, we propose a solution based on stochastic dominance between probability distributions on the space of trajectories. Once having defined preferences of trajectories, we can also deduce preferences between states and actions. The proposed framework is quite related to the approach of Akrour et al. (2011). As we will discuss in more detail in Sect. 8, the main differences are that their approach works with preferences over policies, and uses this information to directly learn to rank policies, whereas we learn to rank actions.
We will start the paper with a discussion on the importance of qualitative feedback for reinforcement learning (Sect. 2), which we motivate with an example from the domain of chess, where annotated game traces provide a source for feedback in the form of action and state preferences. We then show how such preferences can be embedded into a formal framework for preferencebased reinforcement learning, which is based on preferences between trajectories (Sect. 3). For a first instantiation of this algorithm, we build upon a policy learning approach called approximate policy iteration, which reduces the problem to iteratively learning a policy in the form of a classifier that predicts the best action in a state. We introduce a preferencebased variant of this algorithm by replacing the classifier with a label ranker, which is able to make better use of the information provided by rollout evaluations of all actions in a state. Preference learning and label ranking are briefly recapitulated in Sect. 4, and their use for policy iteration is introduced in Sect. 5. While the original approach is based on the use of classification methods for generalization and policy learning, we employ label ranking algorithms for incorporating preference information. Advantages of this preferencebased approximate policy iteration method are illustrated by means of two case studies presented in Sects. 6 and 7. In the last two sections, we discuss our plans for future work and conclude the paper.^{1}
2 Reinforcement learning and qualitative feedback
In this section, we will informally introduce a framework for reinforcement learning from qualitative feedback. We will start with a brief recapitulation of conventional reinforcement learning (Sect. 2.1) and then discuss our alternative proposal, which can be seen as a generalization that does not necessarily require a numerical feedback signal but is also able to exploit qualitative feedback (Sect. 2.2). Finally, we illustrate this setting using the game of chess as an example domain (Sect. 2.3).
2.1 Reinforcement learning
Conventional reinforcement learning assumes a scenario in which an agent moves through a (finite) state space by taking different actions. Occasionally, the agent receives feedback about its actions in the form of a reward signal. The goal of the agent is to choose its actions so as to maximize its expected total reward. Thus, reinforcement learning may be considered to be halfway between unsupervised learning (where the agent does not receive any form of feedback) and supervised learning (where the agent would be told the correct action in certain states).

a set of states S={s _{1},…,s _{ n }} in which the agent operates; normally, a state does not have an internal structure, though it may be described in terms of a set of features (which allows, for example, functional representations of policies);

a (finite) set of actions A={a _{1},…,a _{ k }} the agent can perform; sometimes, only a subset A(s _{ i })⊂A of actions is applicable in a state s _{ i };

a Markovian state transition function δ: S×A→ℙ(S), where ℙ(S) denotes the set of probability distributions over S; thus, \(\tau(\mathbf {s},\boldsymbol {a},\mathbf {s'}) = \delta(\mathbf {s},\boldsymbol {a})(\mathbf {s'})\) is the probability that action a in state s leads the agent to state \(\mathbf {s'}\).

a reward function r: S×A→ℝ, where r(s,a) is the reward the agent receives for performing action a in state s; the concrete reward may depend on the successor state, in which case r(s,a) is given by the expectation of \(r(\mathbf {s},\boldsymbol {a},\mathbf {s'})\) with respect to δ(s,a).
Learning from trajectories and accumulated rewards is natural in many reinforcement learning settings. In robotics, for example, each action (for example, a movement) of the robot may cause a certain cost, hence a negative reward, and these cost values are accumulated until a certain goal state is reached (for example, the robot finds itself in a desired spatial position). Another example is reinforcement learning in games, where each stateaction sequence is one game. This example is somewhat special in the sense that a true (nonzero) reward signal only comes at the very end, indicating whether the game was won or lost.
With V ^{∗}(s)=sup_{ π∈Π } V ^{ π }(s) the best possible value that can be achieved for (1), a policy is called optimal if it achieves the best value in each state s. Thus, one possibility to learn an optimal policy is to learn an evaluation of states in the form of a value function (Sutton 1988), or to learn a socalled Qfunction that returns the expected reward for a given stateaction pair (Watkins and Dayan 1992).
2.2 Learning from qualitative feedback
Existing algorithms for reinforcement learning possess interesting theoretical properties and have been used successfully in a number of applications. However, they also exhibit some practical limitations, notably regarding the type of feedback they can handle.
On the one hand, these methods are rather demanding with regard to the training information requested by the learner. In order to learn a value function or a Qfunction, feedback must be specified in the form of realvalued rewards. It is true that, in some applications, information of that type is naturally generated by the environment; for example, the waiting time of people in learning elevator control (Crites and Barto 1998), or the distance covered by a robot learning to walk. In general, however, precise numerical information is difficult to extract from the environment, and designing the reward function is part of framing the problem as a reinforcement learning problem. Sometimes, this may become difficult and largely arbitrary—as a striking though telling example, to which we shall return in Sect. 7, consider assigning a negative reward of −60 to the death of the patient in a medical treatment (Zhao et al. 2009). Likewise, in (robot) soccer, a corner ball is not as good as a goal but better than a throwin; again, however, it is difficult to quantify the differences in terms of real numbers.
The main objective of this paper is to define a reinforcement learning framework that is not restricted to numerical, quantitative feedback, but also able to handle qualitative feedback expressed in the form of preferences over states, actions, or trajectories. Feedback of this kind is more natural and less difficult to acquire in many applications. As will be seen later on, comparing pairs of actions, states or trajectories instead of evaluating them numerically does also have a number of advantages from a learning point of view. Before describing our framework more formally in Sect. 3, we discuss the problem of chess playing as a concrete example for illustrating the idea of qualitative feedback.
2.3 Example: qualitative feedback in chess
In games like chess, reinforcement learning algorithms have been successfully employed for learning meaningful evaluation functions (Baxter et al. 2000; Beal and Smith 2001; Droste and Fürnkranz 2008). These approaches have all been modeled after the success of TDGammon (Tesauro 2002), a learning system that uses temporaldifference learning (Sutton 1988) for training a game evaluation function (Tesauro 1992). However, all these algorithms were trained exclusively on selfplay, entirely ignoring human feedback that is readily available in annotated game databases.

Qualitative move evaluation: Each move can be annotated with a postfix that indicates the quality of the move. Six symbols are commonly used, representing different quality levels:

blunder (??),

bad move (?),

dubious move (?!),

interesting move (!?),

good move (!),

excellent move (!!).


Qualitative position evaluation: Each position can be annotated with a symbol that indicates the qualitative value of this position:

white has decisive advantage (+−),

white has the upper hand (±),

white is better (+=),

equal chances for both sides (=),

black is better (=+),

black has the upper hand (∓),

black has decisive advantage (−+),

the evaluation is unclear (∞).

For example, the lefthand side of Fig. 1 shows the game position after the 13th move of white. Here, black made a mistake (13… g6?), but he is already in a difficult position. From the alternative moves, 13…a5?! is somewhat better, but even here white has the upper hand at the end of the variation (18. ec1!±). On the other hand, 13… ×c2?? is an even worse choice, ending in a position that is clearly lost for black (+−).
It is important to note that this feedback is of qualitative nature, i.e., it is not clear what the expected reward is in terms of, e.g., percentage of won games from a position with evaluation ±. However, it is clear that positions with evaluation ± are preferable to positions with evaluation += or worse (=, =+, ∓, −+).
Also note that the feedback for positions typically applies to the entire sequence of moves that has been played up to reaching this position. The qualitative position evaluations may be viewed as providing an evaluation of the trajectory that lead to this particular position, whereas the qualitative move evaluations may be viewed as evaluations of the expected value of a trajectory that starts at this point.
Note, however, that even though there is a certain correlation between these two types of annotations (good moves tend to lead to better positions and bad moves tend to lead to worse positions), they are not interchangeable. A very good move may be the only move that saves the player from imminent doom, but must not necessarily lead to a very good position. Conversely, a bad move may be a move that misses a chance to mate the opponent right away, but the resulting position may still be good for the player.
3 A formal framework for preferencebased reinforcement learning
In this section, we define our framework in a more rigorous way. To this end, we first introduce a preference relation on trajectories. Based on this relation, we then derive a preference order on policies, and eventually on states and actions.
3.1 Preferences over trajectories
3.2 Preferences over policies
What we are eventually interested in, of course, is preferences over actions or, more generally, policies. Note that a policy π (together with an initial state or an initial probability distribution over states) induces a probability distribution over the set Σ of trajectories. In fact, fixing a policy π means fixing a state transition function, which means that trajectories are produced by a simple Markov process.
 1.
First, trajectories are mapped to real numbers (accumulated rewards), so that the comparison of probability distributions over Σ is reduced to the comparison of distributions over ℝ.
 2.
This comparison is further simplified by mapping probability distributions to their expected value. Thus, a policy π is preferred to a policy π′ if the expected accumulated reward of the former is higher than the one of the latter.
Since the first reduction step (mapping trajectories to real numbers) is not feasible in our setting, we have to compare probability distributions over Σ directly. What we can exploit, nevertheless, is the order relation ⊒ on Σ. Indeed, recall that a common way to compare probability distributions on totally ordered domains is stochastic dominance. More formally, if X is equipped with a total order ≥, and P and P′ are probability distributions on X, then P dominates (is preferred to) P′ if P(X _{ a })≥P′(X _{ a }) for all a∈X, where X _{ a }={x∈X  x≥a}. Or, to put it in words, the probability to reach a or something better is always as high under P as it is under P′.
However, since ⊒ is only a partial order, stochastic dominance cannot be applied immediately. What is needed, therefore, is a generalization of this concept to the case of partial orders: What does it mean for P to dominate P′ if both distributions are defined on a partially ordered set? This question is interesting and nontrivial, especially since both distributions may allocate probability mass on elements that are not comparable. Somewhat surprisingly, it seems that the generalization of stochastic dominance to distributions over partial orders has not received much attention in the literature so far, with a few notable exceptions.
Recalling that policies are associated with probability distributions over Σ, it is clear that a dominance relation on such distributions immediately induces a preference relation ⪰ on the class Π of policies: π⪰π′ if the probability distribution associated with π dominates the one associated with π′. It is also obvious that this relation is reflexive and antisymmetric. An important question, however, is whether it is also transitive, and hence a partial order. This is definitely desirable, and in fact already anticipated by the interpretation of ⪰ as a preference relation.
Massey (1987) shows that the above dominance relation is indeed transitive, provided the family \(\mathcal{I}(\varSigma)\) fulfills two technical requirements. First, it must be strongly separating, meaning that whenever \(\boldsymbol {\sigma }\not\sqsupseteq \boldsymbol {\sigma }'\), there is some \(\varGamma \in \mathcal{I}(\varSigma)\) such that σ′∈Γ and \(\boldsymbol {\sigma }\not\in \varGamma\) (that is, if σ′ is not worse than σ, then it can be separated from σ). The second condition is that I(Σ) is a determining class, which, roughly speaking, means that it is sufficiently rich, so that each probability measure is uniquely determined by its values on the sets in I(Σ).
In summary, provided these technical conditions are met, the partial order ⊒ on the set of trajectories Σ that we started with induces a partial order ⪰ on the set of policies Π. Of course, since ⪰ is only a partial order, there is not necessarily a unique optimal policy. Instead, optimality ought to be defined in terms of (Pareto) dominance: A policy π ^{∗}∈Π is optimal if there is no other policy π′ such that π′≻π. We denote the set of all optimal policies by Π ^{∗}.
3.3 Preferences on states and actions
From the preference relation ⪰ on policies, a preference relation on actions can be deduced. For an action a and a state s, denote by Π(s,a) the set of policies π such that π(s)=a. It is then natural to say that a is an optimal action in state s if Π(s,a)∩Π ^{∗}≠∅, that is, if there is an optimal policy π ^{∗}∈Π ^{∗} such that π ^{∗}(s)=a. Again, note that there is not necessarily a unique optimal action.
The above condition only distinguishes between optimal and nonoptimal actions in a given state. Of course, it might be desirable to discriminate between nonoptimal actions in a more granular way, that is, to distinguish different levels of nonoptimality. This can be done by applying the above choice principle in a recursive manner: With Π ^{∗∗} denoting the nondominated policies in Π∖Π ^{∗}, the “secondbest” actions are those for which Π(s,a)∩Π ^{∗∗}≠∅. Proceeding in this way, that is, successively removing the optimal (nondominated) policies and considering the optimal ones among those that remain, a Pareto ranking (Srinivas and Deb 1995) on the set of actions is produced. As mentioned earlier, this relation might be weak, i.e., there might be ties between different actions.
In the same way, one can derive a (weak) linear order on states from the preference relation ⊒ on trajectories. Let Σ(s)⊂Σ denote the set of trajectories originating in state s, and let Σ ^{∗} be the set of trajectories that are nondominated according to ⊒. Then, a state is optimal (i.e., has Pareto rank 1) if Σ(s)∩Σ ^{∗}≠∅, secondbest (Pareto rank 2) if Σ(s)∩Σ ^{∗∗}≠∅, with Σ ^{∗∗} the set of nondominated trajectories in Σ∖Σ ^{∗}, etc.
An important theoretical question, which is, however, beyond the scope of this paper, concerns the mutual dependence between the above preference relations on trajectories, states and actions, and a simple characterization of these relations. An answer to this question is indeed a key prerequisite for developing qualitative counterparts to methods like value and policy iteration. For the purpose of this paper, this point is arguably less relevant, since our approach of preferencebased approximate policy iteration, to be detailed in Sect. 5, is explicitly based on the construction and evaluation of trajectories through rollouts. The idea, then, is to learn the above linear order (ranking) of actions (as a function of the state) from these trajectories. A suitable learning method for doing so will be introduced in Sect. 5.
Finally, we note that learning a ranking function of the above kind is also a viable option in the standard setting where rewards are numerical, and policies can therefore be evaluated in terms of expected cumulative rewards (instead of being compared in terms of generalized stochastic dominance). In fact, the case study presented in Sect. 6 will be of that type.
4 Preference learning and label ranking
The topic of preference learning has attracted considerable attention in machine learning in recent years (Fürnkranz and Hüllermeier 2010). Roughly speaking, preference learning is about inducing predictive preference models from empirical data, thereby establishing a link between machine learning and research fields related to preference modeling and decision making.
4.1 Utility Functions vs. preference relations

utility functions evaluating individual alternatives, and

preference relations comparing pairs of competing alternatives.
The first approach is quantitative in nature, in the sense that a utility function is normally a mapping from alternatives to real numbers. This approach is in line with the standard RL setting: A value V(s) assigned to a state s by a value function V can be seen as a utility of that state. Likewise, a Qfunction assigns a degree of utility to an action, namely, Q(s,a) is the utility of choosing action a in state s. The second approach, on the other hand, is qualitative in nature, as it is based on comparing alternatives in terms of qualitative preference relations.^{3} The main idea of our paper is to exploit such a qualitative approach in the context of RL.
From a machine learning point of view, the two approaches give rise to two kinds of learning problems: learning utility functions and learning preference relations. The latter deviates more strongly than the former from conventional problems like classification and regression, as it involves the prediction of complex structures, such as rankings or partial order relations, rather than single values. Moreover, training input in preference learning will not, as it is usually the case in supervised learning, be offered in the form of complete examples but may comprise more general types of information, such as relative preferences or different kinds of indirect feedback and implicit preference information.
4.2 Label ranking
Among the problems in the realm of preference learning, the task of “learning to rank” has probably received the most attention in the machine learning literature so far. In general, a preference learning task consists of some set of items for which preferences are known, and the task is to learn a function that predicts preferences for a new set of items, or for the same set of items in a different context. The preferences can be among a set of objects, in which case we speak of object ranking (Kamishima et al. 2011), or among a set of labels that are attached to a set of objects, in which case we speak of label ranking (Vembu and Gärtner 2011).
The training data \(\mathcal {E}\) used to induce a label ranker typically consists of a set of pairwise preferences of the form y _{ i }≻_{ x } y _{ j }, suggesting that, for instance x, y _{ i } is preferred to y _{ j }. In other words, a single “observation” consists of an instance x together with an ordered pair of labels (y _{ i },y _{ j }).
4.3 Learning by pairwise comparison
Several methods for label ranking have already been proposed in the literature; we refer to Vembu and Gärtner (2011) for a comprehensive survey. In this paper, we chose learning by pairwise comparison (LPC; Hüllermeier et al. 2008), but other choices would be possible. The key idea of LPC is to train a separate model \(\mathcal {M}_{i,j}\) for each pair of labels (y _{ i },y _{ j })∈Y×Y, 1≤i<j≤k; thus, a total number of k(k−1)/2 models is needed. At prediction time, a query x is submitted to all models, and each prediction \(\mathcal {M}_{i,j}(\mathbf{x})\) is interpreted as a vote for a label. More specifically, assuming scoring classifiers that produce normalized scores \(f_{i,j}= \mathcal {M}_{i,j}(\mathbf{x}) \in [0,1]\), the weighted voting technique interprets f _{ i,j } and f _{ j,i }=1−f _{ i,j } as weighted votes for classes y _{ i } and y _{ j }, respectively, and orders the labels according to the accumulated voting mass F(y _{ i })=∑_{ j≠i } f _{ i,j }.
Note that the total complexity for training the quadratic number of classifiers of the LPC approach is only linear in the number of observed preferences (Hüllermeier et al. 2008). More precisely, the learning complexity of LPC is O(n×d), where n is the number of training examples, and d is the average number of preferences that have been observed per state. In the worst case (for each training example we observe a total order of the labels), d can be as large as k⋅(k−1)/2 (k being the number of labels). However, in many practical problems, it is considerably smaller.
Querying the quadratic number of classifiers can also be sped up considerably so that the best label \(y^{*} = \operatorname{arg}\max_{y}F(y)\) can be determined after querying only approximately k⋅log(k) classifiers (Park and Fürnkranz 2012). Thus, the main obstacle for tackling largescale problems with LPC is the memory required for storing the quadratic number of classifiers. Nevertheless, Loza Mencía et al. (2010) have shown that it is applicable to multilabel problems with up to half a million examples and up to 1000 labels.
We refer to Hüllermeier et al. (2008) for a more detailed description of LPC in general and a theoretical justification of the weighted voting procedure in particular. We shall use label ranking techniques in order to realize our idea of preferencebased approximate policy iteration, which is described in the next section.
5 Preferencebased approximate policy iteration
In Sect. 3, we have shown that qualitative feedback based on preferences over trajectories may serve as a theoretical foundation of preferencebased reinforcement learning. More specifically, our point of departure is a preference relation on trajectories: σ⊐σ′ indicates that trajectory σ is preferred to σ′, and we assume that preference information of that kind can be obtained from the environment. From preferences over trajectories, we then derived preferences over policies and preferences over actions given states.
Within this setting, different variants of preferencebased reinforcement learning are conceivable. For example, tracing observed preferences on trajectories back to preferences on corresponding policies, it would be possible to search the space of policies directly or, more specifically, to train a ranking function that sorts policies according to their preference as in Akrour et al. (2011).
Here, we tackle the problem in a different way. Instead of learning preferences on policies directly, we seek to learn (local) preferences on actions given states. What is needed, therefore, is training information of the kind a≻_{ s } a′, suggesting that in state s, action a is better than a′. Following the idea of approximate policy iteration (Sect. 5.1), we induce such preferences from preferences on trajectories via simulation (called “rollouts” later on): Taking s as an initial state, we systematically compare the trajectories generated by taking action a first (and following a given policy thereafter) with the trajectories generated by taking action a′ first; this is possible thanks to the preference relation ⊐ on trajectories that we assume to be given (Sect. 5.3).
The type of preference information thus produced, a≻_{ s } a′, exactly corresponds to the type of training information assumed by a label ranker (Sect. 4.2). Indeed, our idea is to train a ranker of that kind, that is, a mapping from states to rankings over the set of actions (Sect. 5.2). This can be seen as a generalization of the original approach to approximate policy iteration, in which a classifier is trained that maps states to single actions. As will be shown later on, by replacing a classifier with a label ranker, our preferencebased variant of approximate policy iteration enjoys a number of advantages compared to the original version.
5.1 Approximate policy iteration
The choice of the sampling procedure to generate the state sample S′ is not a trivial task as discussed in Lagoudakis and Parr (2003), Dimitrakakis and Lagoudakis (2008). Choices of procedures range from simple uniform sampling of the state space to sampling schemes incorporating domainexpert knowledge and other more sophisticated schemes. We emphasize that we do not contribute to this aspect and will use in this work rather simple sampling schemes, which will be described in detail in the experimental sections.
We should note some minor differences between the version presented in Algorithm 2 and the original formulation (Lagoudakis and Parr 2003). Most notably, the training set here is formed as a multiclass training set, whereas in Lagoudakis and Parr (2003) it was formed as a binary training set, learning a binary policy predicate \(\hat{\pi }:\, S \times A \rightarrow \{0,1\}\). We chose the more general multiclass representation because, as we will see in the following, it lends itself to an immediate generalization to a ranking scenario.
5.2 Preferencebased approximate policy iteration
Following our idea of preferencebased RL, we propose to train a label ranker instead of a classifier: Using the notation from Sect. 4.2 above, the instance space X is given by the state space S, and the set of labels Y corresponds to the set of actions A. Thus, the goal is to learn a mapping \(S \rightarrow \mathfrak{S}_{A}\), which maps a given state to a total order (permutation) of the available actions. In other words, the task of the learner is to learn a function that is able to rank all available actions in a state. The training information is provided in the form of binary action preferences of the form (s,a _{ k }≻a _{ j }), indicating that in state s, action a _{ k } is preferred to action a _{ j }.
Note that in practice, looping through all pairs of actions may not be necessary. For example, in our motivating chess example, preferences may only be available for a few action pairs per state, and it might be more appropriate to directly enumerate the preferences. Thus, the complexity of this algorithm is essentially O(S′⋅d), where d is the average number of observed preferences per state (the constant number of iterations is hidden in the O(.) notation). This complexity directly corresponds to the complexity of LPC as discussed in Sect. 4.3. If a total order of actions is observed in each state (as is the case in the study in Sect. 6), the complexity may become as bad as linear in the number of visited states and quadratic in the number of actions. However, if only a few preferences per state can be observed (as is e.g., the case in our motivating chess example), the complexity is only linear in the number of visited states S′ and essentially independent of the number of actions (e.g. if we only compare a single pair of actions for each state). Of course, in the latter case, we might need considerably more iterations to converge, but this cannot be solved with the choice of a different label ranking algorithm. Thus, we recommend the use of LPC for settings where very few action preferences are observed per state. Its use is not advisable for problems with very large action spaces, because the storage of a quadratic number of classifiers may become problematic in such cases.
5.3 Using rolloutbased preferences
The key point of the algorithm is the implementation of EvaluatePreference, which determines the preference between two actions in a state. We follow the work on approximate policy iteration and choose a rollout based approach. Recall the scenario described at the end of Sect. 5.1, where the agent has access to a generative model E, which takes a state s, and an action a as input and returns a successor state s′ and the reward r(s,a). Lagoudakis and Parr (2003) use this setting for generating training examples via rollouts, i.e., by using the generative model and the current policy π for generating a training set \(\mathcal {E}\), which is in turn used for training a multiclass classifier that can be used as a policy. Instead of training a classifier on the optimal action in each state, we can instead use the PBPI algorithm to train a label ranker on all pairwise comparisons of actions in each state.
Note that generating rollout based training information for PBPI is no more expensive than generating the training information for API, because in both cases all actions in a state have to be run for a number of iterations. On the contrary, we argue that from a training point of view, a key advantage of this approach is that pairwise preferences are much easier to elicit than examples for unique optimal actions. Our experiments in Sects. 6 and 7 utilize this in different ways.
The preference relation over actions can be derived from the rollouts in different ways. In particular, we can always reduce the conventional utilitybased setting to this particular case by defining the preference a _{ k }≻_{ s } a _{ j } as “in state s, policy π gives a higher expected reward for a _{ k } than for a _{ j }”. This is the approach that we evaluate in Sect. 6.
An alternative approach is to count the success of each of the two actions a _{ k } and a _{ j } in state s, and perform a sign test for determining the overall preference. This approach allows to entirely omit the aggregation of utility values and instead only aggregates preferences. In preliminary experiments, we noted essentially the same qualitative results as those reported below. The results are, however, not directly comparable to the results of approximate policy iteration, because the sign test is more conservative than the ttest, which is used in API. However, the key advantage of this approach is that we can also use it in cases of nonnumerical rewards. This is crucial for the experiments reported in Sect. 7, where we take this approach.
Section 6 demonstrates that a comparison of only two actions is less difficult than “proving” the optimality of one among a possibly large set of actions, and that, as a result, our preferencebased approach better exploits the gathered training information. Indeed, the procedure proposed by Lagoudakis and Parr (2003) for forming training examples is very wasteful with this information. An example (s,a ^{∗}) is only generated if a ^{∗} is “provably” the best action among all candidates, namely if it is (significantly) better than all other actions in the given state. Otherwise, if this superiority is not confirmed by a statistical hypothesis test, all information about this state is ignored. In particular, no training examples would be generated in states where multiple actions are optimal, even if they are clearly better than all remaining actions.^{4} For the preferencebased approach, on the other hand, it suffices if only two possible actions a _{ k } and a _{ j } yield a clear preference (either a _{ k }≻a _{ j } or a _{ j }≻a _{ k }) in order to obtain (partial) training information about that state. Note that a corresponding comparison may provide useful information even if both actions are suboptimal.
In Sect. 7, an example will be shown in which actions are not necessarily comparable, since the agent seeks to optimize multiple criteria at the same time (and is not willing to aggregate them into a onedimensional target). In general, this means that, while at least some of the actions will still be comparable in a pairwise manner, a unique optimal action does not exist.
Regarding the type of prediction produced, it was already mentioned earlier that a rankingbased reinforcement learner can be seen as a reasonable compromise between the estimation of a numerical utility function (like in Qlearning) and a classificationbased approach which provides only information about the optimal action in each state: the agent has enough information to determine the optimal action, but can also rely on the ranking in order to look for alternatives, for example to steer the exploration towards actions that are ranked higher. We will briefly return to this topic at the end of the next section. Before that, we will discuss the experimental setting in which we evaluate the utility of the additional rankingbased information.
6 Case study I: exploiting action preferences
 Approximate Policy Iteration (API)

generates one training example (s,a ^{∗}) if a ^{∗} is the best available action in s, i.e., if \(\tilde{Q}^{\pi }(s,\mathbf {a}^{*}) >_{L} \tilde{Q}^{\pi }(s,\mathbf {a})\) for all a≠a ^{∗}. If there is no action that is better than all alternatives, no training example is generated for this state.
 Pairwise Approximate Policy Iteration (PAPI)

works in the same way as API, but the underlying base learning algorithm is replaced with a label ranker. This means that each training example (s,a ^{∗}) of API is transformed into a−1 training examples of the form (s,a ^{∗}≻a) for all a≠a ^{∗}.
 PreferenceBased Approximate Policy Iteration (PBPI)

is trained on all available pairwise preferences, not only those involving the best action. Thus, whenever \(\tilde{Q}^{\pi }(s,\mathbf {a}_{k}) >_{L} \tilde{Q}^{\pi }(s,\mathbf {a}_{l})\) holds for a pair of actions (a _{ k },a _{ l }), PBPI generates a corresponding training example (s,a _{ k }≻a _{ l }).
The problems we are going to tackle in this section do still fall into the standard framework of reinforcement learning. Thus, rewards are numerical, trajectories are evaluated by accumulated rewards, and policies by expected cumulative discounted rewards. The reason is that, otherwise, it is not possible to compare with the original approach to approximate policy iteration. Nevertheless, as will be seen, preferencebased learning from qualitative feedback can even be useful in this setting.
6.1 Application Domains
Following Dimitrakakis and Lagoudakis (2008), we evaluated these variants on two wellknown problems, inverted pendulum and mountain car. We will briefly recapitulate these tasks, which were used in their default setting, unless stated otherwise.
In the inverted pendulum problem, the task is to push or pull a cart so that it balances an upright pendulum. The available actions are to apply a force of fixed strength of 50 Newton to the left (−1), to the right (+1) or to apply no force at all (0). The mass of the pole is 2 kg and of the cart 9 kg. The pole has a length of 0.5 m and each time step is set to 0.1 seconds. Following Dimitrakakis and Lagoudakis (2008), we describe the state of the pendulum using only the angle and angular velocity of the pole, ignoring the position and the velocity of cart. For each time step, where the pendulum is above the horizontal line, a reward of 1 was given, else 0. A policy was considered sufficient, if it is able to balance the pendulum longer than 1000 steps (100 sec). The random samples S in this setting were generated by simulating a uniform random number (max 100) of uniform random actions from the initial state (pole straight up, no velocity for cart and pole). If the pendulum fell within this sequence, the procedure was repeated.
In the mountain car problem, the task is to drive a car out of a steep valley. To do so, it has to repeatedly go up on each side of the hill, gaining momentum by going down and up to the other side, so that eventually it can get out. Again, the available actions are (−1) for left or backward and (+1) for right or forward and (0) for a fixed level of throttle. The states or feature vectors consist of the horizontal position and the current velocity of the car. Here, the agent received a reward of −1 in each step until the goal was reached. A policy which needed less than 75 steps to reach the goal was considered as sufficient. For this setting, the random samples S were generated by uniform sampling over valid horizontal positions (excluding the goal state) and valid velocities.
6.2 Experimental setup
In addition to these conventional formulations using three actions in each state, we also used versions of these problems with 5, 9, and 17 actions, because in these cases it becomes less and less likely that a unique best actions can be found, and the benefit from being able to utilize information from states where no clear winner emerges increases. The range of the original action set {−1,0,1} was partitioned equidistantly into the given number of actions, for e.g., using 5 actions, the set of action signals is {−1,−0.5,0,0.5,1}. Also, a uniform noise term in [−0.2,0.2] was added to the action signal, such that all state transitions are nondeterministic. For training the label ranker we used LPC (cf. Sect. 4.2) and for all three considered variants simple multilayer perceptrons (as implemented in the Weka machine learning library (Hall et al. 2009) with its default parameters) was used as (base) learning algorithm. The discount factor for both settings was set to 1 and the maximal length of the trajectory for the inverted pendulum task was set to 1500 steps and 1000 for the mountain car task. The policy iteration algorithms terminated if the learned policy was sufficient or if the policy performance decreased or if the number of policy iterations reached 10. For the evaluation of the policy performance, 100 simulations beginning from the corresponding initial states were utilized. Furthermore, for statistical testing unpaired ttests assuming equal variance (homoscedastic ttest) were used.

five numbers of state samples S∈{10,20,50,100,200},

five maximum numbers of rollouts K∈{10,20,50,100,200},

three levels of significance c∈{0.025,0.05,0.1}.
6.3 Evaluation
Our prime evaluation measure is the success rate (SR), i.e., the percentage of learned sufficient policies. Following Dimitrakakis and Lagoudakis (2008), we plot a cumulative distribution of the success rates of all different parameter settings over a measure of learning complexity, where each point (x,y) indicates the minimum complexity x needed to reach a success rate of y. However, while Dimitrakakis and Lagoudakis (2008) simply use the number of rollouts (i.e., the number of sampled states) as a measure of learning complexity, we use the number of performed actions over all rollouts, which is a more finegrained complexity measure. The two would coincide if all rollouts are performed a constant number of times. However, this is typically not the case, as some rollouts may stop earlier than others. Thus, we generated graphs by sorting all successful runs over all parameter settings (i.e., runs which yielded a sufficient policy) in increasing order regarding the number of applied actions and by plotting these runs along the xaxis with a yvalue corresponding to its cumulative success rate. This visualization can be interpreted roughly as the development of the success rate in dependence of the applied learning complexity.
6.4 Complete state evaluations
Training information generation for API, PAPI and PBPI. For each algorithm, we show the fraction of training states that could be used, as well as the fraction of the \(\frac{1}{2}\cdotA\cdot(A1)\) possible preferences that could on average be generated from a training state. Note that the values for PAPI are only approximately true (the measures are taken from the API experiments and may differ slightly due to random issues)
A 
API/PAPI 
PBPI  

States 
Preferences 
States 
Preferences  
IP 
3 
0.589±0.148 
0.393±0.099 
0.736±0.108 
0.631±0.119 
5 
0.400±0.162 
0.160±0.065 
0.759±0.101 
0.581±0.128  
9 
0.273±0.154 
0.061±0.034 
0.776±0.087 
0.543±0.124  
MC 
3 
0.316±0.218 
0.211±0.145 
0.453±0.245 
0.349±0.229 
5 
0.231±0.181 
0.093±0.072 
0.510±0.263 
0.311±0.216  
9 
0.149±0.125 
0.033±0.028 
0.539±0.273 
0.281±0.201 
However, even if we instead look at the fraction of possible preferences that could be used, we see that there is only a small decay for the PBPI. This decay can be explained by the fact that the more actions we have in the mountain car and inverted pendulum problems, the more similar they are to each other and the more likely it is that we can detect actions pairs that have approximately the same quality and cannot be discriminated via rollouts. For API and PAPI, on the other hand, the decay in the number of generated preferences is the same as with the number of usable states, because each usable training state produces exactly A−1 preferences.^{5} Thus, the fourth column differs from the third by a factor of A/2.
6.5 Partial state evaluations
So far, based on the API strategy, we always evaluated all possible actions at each state, and generated preferences from their pairwise comparisons. A possible advantage of the preferencebased approach is that it does not need to evaluate all options at a given state. In fact, one could imagine to select only two actions for a state and compare them via rollouts. While such a partial state evaluation will, in general, not be sufficient for generating a training example for API, it suffices to generate a training preference for PBPI. Thus, such a partial PBPI strategy also allows for considering a far greater number of states, using the same number of rollouts, at the expense that not all actions of each state will be explored. Such an approach may thus be considered to be orthogonal to recent approaches for rollout allocation strategies (cf. Sect. 8.2).
In order to investigate this effect, we also experimented with three partial variants of PBPI, which only differ in the number of states that they are allowed to visit. The first (PBPI1) allows the partial PBPI variant to visit only the same total number of states as PBPI. The second (PBPI2) adjusts the number of visited sample states by multiplying it with \(\frac{A}{2}\), to account for the fact that the partial variant performs only 2 action rollouts in each state, as opposed to A action rollouts for PBPI. Thus, the total number of action rollouts in PBPI and PBPI2 is constant. Finally, for the third variant (PBPI3), we assume that the number of preferences that are generated from each state is constant. While PBPI generates up to \(\frac{A(A1)}{2}\) preferences from each visited state, partial PBPI generates only one preference per state, and is thus allowed to visit \(\frac{A(A1)}{2}\) as many states. These modifications were integrated into Algorithm 2 by adapting Line 5 to iterate only over two randomly chosen actions and changing the number of considered sample states S to the values as described above.
Nevertheless, the results demonstrate that partial state evaluation is feasible. This may form the basis of novel algorithms for exploring the state space. We will briefly return to this issue in Sect. 8.2.
7 Case study II: learning from qualitative feedback
In a second experiment, we applied preferencebased reinforcement learning to a simulation of optimal therapy design in cancer treatment, using a model that was recently proposed in Zhao et al. (2009). In this domain, it is arguably more natural to define preferences that induce a partial order between states than to define an artificial numerical reward function that induces a total order between states.
7.1 Cancer clinical trials domain
7.2 A preferencebased approach
The problem is to learn an optimal treatment policy π mapping states (Y,X) to actions in the form of a dosage level D (recall that this is a number between 0 and 1). In Zhao et al. (2009), the authors tackle this problem by means of RL, and indeed obtained interesting results. However, using standard RL techniques, there is a need to define a numerical reward function depending on the tumor size, wellness, and possibly the death of a patient. More specifically, four threshold values and eight utility scores are needed, and the authors themselves notice that these quantities strongly influence the results.
We consider this as a key disadvantage of the approach, since in a medical context, a numerical function of that kind is extremely hard to specify and will always be subject to debate. Just to give a striking example, the authors defined a negative reward of −60 for the death of a patient, which, of course, is a rather arbitrary number. As an interesting alternative, we tackle the problem using a more qualitative approach.
7.3 Experimental setup
Action preferences are generated via Pareto dominance relation (3) using rollouts. Essentially this means that for each pair of actions a _{ k } and a _{ j } in a state s, we compute 10 pairs of trajectories and compare each pair in terms of the above preference relation, i.e., the first trajectory is preferred to the second one if the latter involves the death of the patient and the former not or, in case the patient survives in both cases, we have a dominance relation in the sense of (3). Then, we generate a preference for a _{ k } over a _{ j } if the number of comparisons in favor of the former is significantly higher than the number of comparisons in favor of the latter (according to a simple sign test at significance level 0.1).
We use LPC and choose a linear classifier, logistic regression, as the base learner (again using the Weka implementation). The policy iteration stops when (i) the difference between two consequential learned policies is smaller than a predefined threshold, or (ii) the number of policy iterations p reaches 10.
7.4 Results
Finally, we add the results for two other policies, namely the policy learned by our preferencebased approach and a random policy, which, in each state, picks a dose level at random. Although these two policies are again both Paretooptimal, it is interesting to note that our policy is outside the convex hull of the constant policies, whereas the random policy falls inside. Recalling the interpretation of the convex hull in terms of randomized strategies, this means that the random policy can be outperformed by a randomization of the constant policies, whereas our policy can not.
8 Related work
In this section, we give an overview of existing work which is, in one way or the other, related to the idea of preferencebased reinforcement learning as introduced in this paper. In Sect. 8.1, we start with policy search approaches that use the reinforcement signal for directly modifying the policy. The preferencebased approach of Akrour et al. (2011) may be viewed in this context. Preferencebased policy iteration is, on the other hand, closer related to approaches that use supervised learning algorithms for learning a policy (Sect. 8.2). In Sect. 8.3, we discuss the work of Maes (2009), which tackles a quite similar learning problem, albeit in a different context and having other goals in mind. Our work is also related to multiobjective reinforcement learning, because in both learning settings, trajectories may not be comparable (Sect. 8.4). Finally we also discuss alternative approaches for incorporating external advice (Sect. 8.5) and for handling qualitative information (Sect. 8.6).
8.1 Preferencebased policy search
Akrour et al. (2011) propose a framework that is quite similar to ours. In their architecture, the learning agent shows a set of policies to a domain expert who gives feedback in the form of pairwise preferences between the policies. This information is then used in order to learn to estimate the value of parametrized policies in a way that is consistent with the preferences provided by the expert. Based on the new estimates, the agent selects another set of policies for the expert, and the process is repeated until a termination criterion is met.
Thus, just like in our approach, the key idea of Akrour et al. (2011) is to combine preference learning and reinforcement learning, taking qualitative preferences between trajectories as a point of departure.^{7} What is different, however, is the type of (preference) learning problem that is eventually solved: Akrour et al. (2011) seek to directly learn a ranking function in the policy space from global preferences on the level of complete trajectories, whereas we propose to proceed from training information in the form of local preferences on actions in a given state. Correspondingly, they solve an object ranking problem, with objects given by parametrized policies, making use of standard learningtorank methods (Kamishima et al. 2011).
In a way, the approach of Akrour et al. (2011) may be viewed as a preferencebased variant of policy search. Just as socalled policy gradient methods search for a good parameter setting in a space of parametrized policies by using the reinforcement signal to derive the direction into which the policy should be corrected, the abovementioned approach uses a qualitative preference signal for driving the policy learner towards better policies. Numerical policy gradients can be computed in closed form from a parametrized policy (Ng and Jordan 2000), be estimated empirically from action samples of the policy (Williams 1992), or learned by regression (Kersting and Driessens 2008). In the actorcritic framework, where algorithms learn both the value function (the critic) and an explicit policy (the actor) simultaneously (Barto et al. 1983), the policy gradient can be estimated from the predicted values of the value function (Konda and Tsitsiklis 2003; Sutton et al. 2000). A particularly interesting policy gradient approach is the natural actorcritic (Peters and Schaal 2008a). The key idea of this approach is to fight the large variance in conventional gradient approaches by the use of the natural gradient, i.e., the gradient that does not assume that the parameter space is Euclidean but takes its structure into account (Amari 1998; Kakade 2001). Good surveys of current work in this area can be found in Bhatnagar et al. (2009), Peters and Schaal (2008b).
Not all approaches to direct policy search use policy gradients. For example, Mannor et al. (2003) suggest the use of the Cross Entropy for finding an optimal policy. Other methods include EMlike methods (Peters and Schaal 2007; Kober and Peters 2011) or the generalized path integral control approach (Theodorou et al. 2010).
8.2 Supervised policy learning
The abovementioned approaches use the (qualitative or quantitative) reinforcement signal to directly optimize the policy, whereas preferencebased approximate policy iteration is quite related to approaches that use supervised learning algorithms to learn a policy. In particular, our work directly builds upon approximate policy iteration (Lagoudakis and Parr 2003), which had the goal of using modern classifiers to learn the policy. In general, the key idea of such approaches is to learn a policy in the form of a Pfunction, which directly maps states to optimal actions (Tadepalli et al. 2004). The Pfunction can be represented in commonly used concept representations such as relational decision trees (Džeroski et al. 2001), decision lists (Fern et al. 2006), support vector machines (Lagoudakis and Parr 2003), etc. As the Pfunction needs to capture less information than the Qfunction (it does not need to perfectly fit the Qvalues), the hope is that it leads to more compact representations and to a faster convergence.
A key issue for such approaches is the strategy used for generating examples for the supervised learning algorithm. The original rollout strategy proposed by Lagoudakis and Parr (2003) is rather wasteful with respect to the performed number of rollouts, and several authors have tried to address this problem (Dimitrakakis and Lagoudakis 2008; Gabillon et al. 2010). First, the number of necessary rollouts in a state, which we assumed to be a constant number K, can be dynamically adjusted so that the rollouts are stopped as soon as a winning action clearly emerges. Hoeffding or Bernstein races can be used to determine the best action with a minimal number of rollouts (HeidrichMeisner and Igel 2009), elimination algorithms iteratively remove the worst action until a single winner emerges (EvenDar et al. 2003; Audibert et al. 2010), and the UCB algorithm, which trades off exploration and exploitation in multiarmed bandit problems (Auer et al. 2002), can also be adapted to this setting (Dimitrakakis and Lagoudakis 2008). Recently, Gabillon et al. (2011) proposed to improve finitehorizon rollout estimates by enhancing them with a critic which learns to estimate the value of the iterations beyond the horizon. All these approaches are, in a way, orthogonal to our approach, in that they all focus on the best action. In all of them, additional pairwise comparisons can emerge as a byproduct of sampling for the best action, and the experiments of Sect. 6 show that it can be beneficial to use them.
A second approach for optimizing the use of rollouts is to define a state exploration strategy. For example, it was suggested that a policybased generation of states may be preferable to a random selection (Fern et al. 2006). While this may clearly lead to faster convergence in some domains, it may also fail to find optimal solutions in other cases (Lagoudakis and Parr 2003). Again, such approaches can be straightforwardly combined with our proposal. Moreover, preferencebased approximate policy iteration provides a natural focus on the comparison between pairs of actions instead of sets of actions. This allows the use of fewer rollouts for getting some information (one preference) from a state, and thus allows to move more quickly from state to state. For example, selecting a pair of actions and following the better one may be a simple but effective way of trading off exploration and exploitation for state sampling. Whether and how this additional flexibility can be used for more efficient exploration strategies is subject of future work.
Finally, Lazaric et al. (2010) propose direct policy iteration, which improves approximate policy iteration (Lagoudakis and Parr 2003), upon which our work builds, by optimizing a loss function that is based on the difference between the rollout estimate of the action chosen by a policy and the maximum value obtainable by any action.
8.3 Reinforcement learning for structured output prediction
Maes (2009) solves a learning problem quite similar to ours, namely “policy learning as action ranking”, and even makes use of pairwise learning techniques. More specifically, he learns a function called “actionranking function” that ranks actions given states.
The context and the purpose of the approach, however, are quite different. The idea is to tackle structured output prediction problems with reinforcement learning: instead of predicting a structured output (such as a sequence or a tree) right away, the output is constructed step by step. This stepwise construction is modeled as following a path in a properly defined state space and, thus, can be cast as a reinforcement learning problem.
Apart from this difference, Maes (2009) makes use of ranking methods (instead of regression learning of actionvalue functions) for other reasons. While still making use of quantitative information about the costs of actions in a given state, he is mainly interested in facilitating the learning process and making it more efficient.
8.4 Multiobjective reinforcement learning
In multiobjective reinforcement learning (MORL), the agent seeks to achieve two or more objectives at the same time, each with its own associated reward signal. Thus, unlike in standard RL, where the reward is a scalar, it is now a vector (Vamplew et al. 2010). Typically, the different objectives are in conflict with each other and cannot easily be optimized simultaneously. Instead, a policy must either optimize only one objective while neglecting the others, or try to find a tradeoff between the conflicting criteria. What is sought, therefore, is a policy that is optimal in a Pareto sense.
MORL shares an important property with our approach to preferencebased reinforcement learning, namely the fact that trajectories (policies) are not necessarily comparable with each other. On the other hand, the reward signals in MORL are still numerical, thus making the problem amenable to other types of learning algorithms. In other words, MORL can be seen as a special case of our general framework; as such, it can be tackled by specialized algorithms that are presumably more effective than general algorithms for preferencebased reinforcement learning.
8.5 External advice and offpolicy learning
In addition to numerical reward functions, some authors have investigated ways for incorporating other forms of feedback, most notably external advice. For example, Maclin and Shavlik (1996) proposed an approach in which usergenerated advice in the form of rules is transferred to the same neural network used by the reinforcement learning agent for learning the Qfunction. Maclin et al. (2005) propose userprovided advice in the form of preferences over actions and use them for training a reinforcement learner via kernelbased regression. The constraints compare two actions, but are still quantitative in the sense of specifying a lower bound on the difference of the Qvalues of two actions. These constraints can then be directly incorporated into the kernelized optimization algorithm. This form of advice has later been adapted for transfer learning (Torrey et al. 2005). A good survey of transfer learning in RL and its relation to advicetaking is given by Taylor and Stone (2009).
An alternative technique is to encode advice in the form of a reasonable starting policy, which can then be used to generate training examples for a relational reinforcement learner (Driessens and Džeroski 2004). Such examples could, e.g., also be generated by a human controller. Auer et al. (1995) consider the multiarmed bandit problem in the presence of experts, each of which instructs the learner on a particular arm to pull. Langford and Zhang (2008) consider a more general case where any contextual information can be considered in addition to the reward. For example, in a news recommender system, the context could describe news articles and user information (Li et al. 2010). Another approach for advicetaking with reinforcement learning has been tried by Fürnkranz et al. (2000). The key idea of this paper is tuning the weights of several advicegivers instead of the weights of an evaluation function.
The above is related to offpolicy learning, where the learner does not have to follow the policy it tries to optimize. In the simplest case, when the value function is simply represented as a lookup table, offpolicy learning is well understood. In fact, Qlearning is an offpolicy learner. In the general case, however, when the state space is so complex that the value function has to be approximated, offpolicy learning is known to become unstable even for linear function approximators (Precup et al. 2001; Maei et al. 2010). Recently, Langford et al. (2008) discuss how policies can be evaluated using offline experience.
Preferencebased reinforcement learning is also related to inverse reinforcement learning (Ng and Russell 2000; Abbeel and Ng 2010). In this framework, the idea is to learn a reward function from traces of a presumably optimal policy so that the reward is consistent with the observed policy traces. Preferencebased reinforcement learning follows a similar goal, but weakens some of the assumptions: first, we do not assume that the learner has access to observations of an optimal policy but instead only require comparisons between several, possibly suboptimal actions or trajectories. Second, the objective is not necessarily to learn a reward signal, but it suffices to learn a policy that is consistent with the observed trajectories. In that respect, the work is also related to early work in the area of behavioral cloning (Sammut 1996). Similarly, learning player behavior from game traces is an important topic in computer game research (Fürnkranz 2011).
8.6 Handling qualitative information
The exploitation of qualitative information in RL or, more generally, in MDPs, has received less attention so far. Bonet and Pearl (2002) propose a qualitative version of MDPs and POMDPs (Partially Observable MDPs) based on a socalled orderofmagnitude representation of transition probabilities and rewards. This approach is closely related to the work of Sabbadin (1999), who models uncertainty in terms of possibility instead of probability distributions. Epshteyn and DeJong (2006) present a framework which allows the expert to specify imprecise knowledge of transition probabilities in terms of stochastic dominance constraints.
Reyes et al. (2006) propose a method for learning qualitative MDPs. They argue that, in complex domains, it is not always possible to provide a reasonable state representation and transition model. Their approach is able to automatically produce a state abstraction and can learn a transition function over such abstracted states. A qualitative state, called qstate, is a group of states with similar properties and rewards.
There are also connections to some other fields of AI research dealing with qualitative knowledge. One such field is qualitative reasoning (Kuipers 1994), where the learning of qualitative models has also been considered (Bratko and Suc 2003; Zabkar et al. 2008), even though the focus here is less on control and more on the modeling and simulation of dynamical systems. Another field is qualitative decision making, in which qualitative variants of classical Bayesian decision theory and expected utility theory have been developed (Brafman and Tennenholtz 1997; Doyle and Thomason 1999; Dubois et al. 2003; Fargier and Sabbadin 2005).
9 Current and future work
The work reported in this paper provides a point of departure for extensions along several lines. First of all, there are several important theoretical questions regarding our formal framework in Sect. 3, many of which are relevant for algorithmic approaches to preferencebased reinforcement learning. For example, to put our idea of learning preferences on actions given states on a firm ground, it would be important to know under what conditions on the preference relation ⊐ (on trajectories) it is possible to guarantee that a globally optimal policy can be reconstructed from local preferences on actions.
While the setting assumed in approximate policy iteration is not uncommon in the literature, the existence of a generative model that can be used for performing rollout simulations is a strong prerequisite. In future work, we will therefore focus on generalizing our approach toward an online learning setting with onpolicy updates. One of our goals is to develop a preferencebased version of Qlearning. The key missing link to achieve this goal is to find a preferencebased equivalent to the Bellman equation, which allows to transfer information about action preferences from one state to the other.
A first step in that direction is to define and evaluate a rankingbased exploration strategy. The results on partial state evaluations (Sect. 6.5) indicate that an exploration strategy that is based on picking the better one of a pair of actions may be an interesting approach to try. It seems clear that information provided by a ranking gives more information than uninformed exploration strategies like ϵgreedy strategies, and we believe that the loss of information that we suffer from only having a ranking instead of expected utilities or action probabilities is only minor. This, however, needs to be properly evaluated.
We are currently working on applying preferencebased reinforcement learning to the chess domain that we have used to motivate our framework in Sect. 2.3. The main obstacle that we have to solve here is that preferencebased policy iteration as proposed in this paper is for online learning, whereas the preferences that we want to learn from are only available offline. An interesting way for solving this strategy could be an integration of offline advice with online experience. This can be easily done in Algorithm 3 by merging preferences from offline data with preferences that have been generated via rollout analysis with the current policy. A problem here is that rollout analysis with imperfect policies tends to be considerably less reliable for chess than for games like Go (Ramanujan et al. 2010; Arenz 2012), which is one of the motivations for resorting to game annotations in this domain. We are also aiming at lifting inverse reinforcement learning (Ng and Russell 2000; Abbeel and Ng 2010) to a preferencebased formulation.
10 Conclusions
The main contribution of this work is a framework for preferencebased reinforcement learning, which allows for lifting RL into a qualitative setting, where reward is not available on an absolute, numerical scale. Instead, comparative reward functions can be used to decide which of two actions is preferable in a given state, or, more generally, which of two trajectories is preferable.
To cope with this type of training information, we proposed an algorithm for preferencebased policy iteration, which only depends on the availability of preference information between two actions. As a proofofconcept, we instantiated this algorithm into a version of approximate policy iteration, where the preference information is determined via rollouts. Whereas the original algorithm essentially reduces reinforcement learning to classification, we tackle the problem by means of a preference learning method called label ranking. In this setting, a policy is represented by a ranking function that maps states to total orders of all available actions.
To demonstrate the feasibility of this approach, we performed two case studies. In the first study, we showed that additional training information about lowerranked actions can be successfully used for improving the learned policies. The second case study demonstrated one of the key advantages of a qualitative policy iteration approach, namely that a comparison of pairs of actions is often more feasible than the quantitative evaluation of single actions.
In the original formulation as a binary problem, it is still possible to produce negative examples, which indicate that the given action is certainly not the best action (because it was significantly worse than the best action).
Strictly speaking, only the best action for each state is generated and used within API, but for the sake of comparison in terms of preferences this directly relates to A−1 preferences involving the best action.
We exclude the value 0, as it is a common practice to let the patient keep receiving certain level of chemotherapy agent during the treatment in order to prevent the tumor relapsing.
Acknowledgements
We would like to thank the anonymous reviewers for their careful reading of our manuscript and many suggestions that helped to improve the paper. We also thank Robert BusaFekete for several useful comments and discussions, Jan Peters for advice on policy search, and the Frankfurt Center for Scientific Computing for providing computational resources. This research was supported by the German Science Foundation (DFG).