Abstract
Realworld sequential decisionmaking tasks are generally complex, requiring tradeoffs between multiple, often conflicting, objectives. Despite this, the majority of research in reinforcement learning and decisiontheoretic planning either assumes only a single objective, or that multiple objectives can be adequately handled via a simple linear combination. Such approaches may oversimplify the underlying problem and hence produce suboptimal results. This paper serves as a guide to the application of multiobjective methods to difficult problems, and is aimed at researchers who are already familiar with singleobjective reinforcement learning and planning methods who wish to adopt a multiobjective perspective on their research, as well as practitioners who encounter multiobjective decision problems in practice. It identifies the factors that may influence the nature of the desired solution, and illustrates by example how these influence the design of multiobjective decisionmaking systems for complex problems.
1 Introduction
In most realworld decision problems, we care about more than one aspect. For example, if we have a water reservoir with a hydroelectric power plant we may care about maximising energy production, while minimising irrigation deficits as well as minimising the risk of flooding [19, 122, 133]. In medical treatment, we may want to maximise the effectiveness of the treatment, while minimising a variety of side effects [69, 77, 86]. In other words, most realworld decision problems are inherently multiobjective.
While most decision problems actually have multiple objectives, most algorithms dealing with agents that need to interact with sequential decision problems focus on optimising a single objective [163]. To nonetheless deal with the multiple objectives of the real world, a common approach to creating decisiontheoretic agents is to combine all the important aspects together into a single, scalar, additive reward function. This typically involves an iterative process of assigning numerical rewards or penalties to events that can occur in the environment. For example, in the water reservoir setting, we may put a large penalty on a flood occurring, a positive reward on the power output for each time step, and a negative reward for each time step in which the irrigation demand is not met. Then, the singleobjective planning or learning agent is turned on, the resulting policy observed, and then the reward function is reengineered if the behaviour is not satisfactory. This iterative process is then repeated until the behaviour is acceptable to the designer. We argue that this workflow is problematic for several reasons, which we will discuss in detail one by one: (a) it is a semiblind manual process, (b) it prevents people who should take the decisions from making wellinformed tradeoffs, putting an undue burden on engineers to understand the decisionproblem at hand, (c) it damages the explainability of the decisionmaking process, and (d) it cannot handle all types of preferences that users and human decision makers might actually have. Finally, (e) preferences between objectives may change over time, and a singleobjective agent will have to be retrained or updated when this happens.
Firstly (a), if we engineer a scalar reward function through an iterative process until we reach acceptable behaviour, we try out multiple reward functions, each of which is a scalarisation of the actual objectives. However, we do not systematically inspect all possible reward functions. In other words, we may meet our minimal threshold for acceptable behaviours, but we only observe a subset of all possible scalarisations. Therefore, although an acceptable solution may be found, it can be arbitrarily far away from optimal utility – the use we would have received if we could have systematically examined all possible solutions. This automatically brings us to the second point (b). As the reward function is something that needs to be engineered beforehand, we are only guessing as to the effects this might have on the policy. For example, when trying to train an agent in a power production system, we may wish to double the average power output. However, even if the objectives are linearly weighted in the reward function, it is not as simple as just doubling the reward associated with the power output aspect of performance, as the relationship between the reward weights and the actual objective outcomes may well be nonlinear [184]. If, on the other hand, we would be able to inspect all possibly optimal policies – and their values offering different tradeoffs between the objectives – we could have decided in a wellinformed manner on the outcomes, rather than making an educated guess at the scalarisation a priori. This educated guessing is also putting decision power where it does not belong: with the engineers. When an engineer creates a scalar reward function, they are simultaneously making assumptions about the preferences of the actual decision makers (e.g., a government in case of the water reservoir) and making guesses about the behavioural changes resulting from changes to the scalar reward function. This is not a responsibility that can be left to AI engineers – at least not in decision problems that are of significant importance.
We also note an iterative process of trying out different reward functions potentially has a large, but hidden, cost in terms of samplecomplexity and computation time. However, this is typically not reported in the final research paper. We therefore argue that using a multiobjective method straight from the start can in fact save computation time and may have a lower overall samplecomplexity. This is especially so, as multiobjective algorithms can exploit the fact that multiple policies need to be produced in order to reduce computation time [140] and samplecomplexity [4] explicitly.
Another issue with scalar reward functions is the lack of (a posteriori) explainability (c). If we ask “why did the robot collide with and destroy the vase?”, we could try to input an alternative decision, such as swerving away from the vase. An agent with a single allencompassing objective that has learnt a scalar value function will then, for example, tell us there was a 3.451 reduction in value for this other policy, which provides little insight.
If instead, the agent could have told us that in the objective of damage to property the probability of damaging the vase would have dropped to practically 0, but the probability of running into the family dog increased by \(0.5\%\) (a different objective), this would give us insight into what went wrong. We may also disagree for different reasons: we may think that the agent has overestimated the risk of colliding with the dog, which would be an error in the valueestimate for that objective. We might also think that a \(0.5\%\) increase in the likelihood of bumping into to the dog would be so small that it would have been acceptable – especially if the robot bumping into the dog would probably have been an inconvenience for the dog, but not an actual danger to it – if the robot could have definitely avoided destroying the vase. This would have been an error in the utility function we assign to different outcomes. In other words, not taking an explicitly multiobjective approach can rob us of essential information that we might need to evaluate or understand our agents.
Furthermore (d), not all human preferences can be handled by scalar additive reward functions [144]. When a user’s preferences ought to be modelled with a nonlinear rather than a linear utility function, a priori scalarisation becomes mathematically impossible within many reinforcement learning frameworks, as scalarisation would break the additivity of the reward function. For some domains, this might still be acceptable, as the resulting loss of optimality may not have a major impact. However, in important domains where ethical or moral issues become apparent, singleobjective approaches require explicitly combining these factors together with other objectives (such as economic outcomes) in a way that may be unacceptable to many people [191]. Similarly, designing singleobjective rewards may be difficult or impossible for scenarios where we wish to ensure fair or equitable outcomes for multiple participants [157, 177].
Finally (e), humans are known to change their minds from time to time. Therefore, preferences between tradeoffs in the different objectives may well change over time. An explicitly multiobjective system can train agents to be able to handle such preference changes, thereby preempting the need to discover a new policy whenever such changes occur. This increases the applicability of multiobjective decisionmaking agents, as agents do not need to be taken out of operation to be updated and they can simply switch policy to match the new user preferences. We note that this type of change is different from the issue of nonstationary dynamics of the problem which can occur in both singleobjective and multiobjective problems; here the multiobjective Markov decision process (Sect. 3) itself is stationary, but the external preferences have changed.
An insight into the difference between singleobjective and multiobjective approaches to an application can be gained by comparing two different studies applying RL to wet clutch engagement [17, 187]. The task is to control the piston in a wet clutch so as to produce a fast and smooth engagement, by minimising both the time to engagement and the torque loss. The initial study uses a scalar reward with discounting which implicitly captures both aspects of the desired behaviour and achieves acceptable results [187]. However, the subsequent study examines the policies arising from several different utility functions and parameterisations of those functions and demonstrates that some of these are superior to those reported in the original work [17].
By now, we hope we have convinced you, the reader, that taking an explicitly multiobjective approach to planning and learning may be essential to deploying AI in decision problems. To provide further motivation, as well as showcase some difficulties that can arise when modelling problems with multiple objectives, we will provide examples of such multiobjective decision problems in Sect. 2. We then proceed with formalising multiobjective problems (Sect. 3) and recommend an approach to systematically deal with multiobjective decision problems that puts the user’s utility frontandcentre throughout the entire process (Sect. 4). In Sect. 5 we outline which factors should be taken into account in the process from identifying a multiobjective decision problem to deploying a policy for it in practice. We describe the effects of these factors on both this process and on the solution concepts. We then proceed to describe the relationships between multiobjective decision problems and other known decision problems (Sect. 6), and briefly survey both algorithmic approaches (Sect. 7) and the metrics for evaluating the solutions produced by these algorithms (Sect. 8). To help researchers get started in the field, we include a workedout example of a multiobjective decision problem, a water management problem with multiple objectives, in Sect. 9, furthermore, we added a Jupyter notebook [74] with these workedout examples as supplementary material. Finally, we conclude the article and discuss open research challenges in Sect. 10.
Our purpose with this article is to provide an introduction to multiobjective decision making and guide the reader through getting started with modelling and solving such decision problems. This article differs from existing surveys in the literature that aim to provide a comprehensive overview of methods and theory, in that it is designed to be a guide for practitioners and researchers, highlighting the issues that need to be considered and addressed when applying multiobjective agents to practical problems. As a followon reading, we recommend the more technical survey provided by Roijers, Vamplew, Whiteson and Dazeley [144].
2 Motivating examples of modelling complex problems with multiobjective approaches
This section presents examples of complex decisionmaking situations where multiobjective approaches play a role. These examples motivate some of the aspects discussed in later sections.
2.1 Planning a journey
Consider you need to travel from your house to a given destination. Deciding on the modes of transportation along your trip typically involves a number of objectives, such as minimising travel time and cost whereas maximising comfort and reliability [92, 112, 130, 131]. For instance, car trips may be faster and more comfortable than subway ones, at the cost of being more expensive and less reliable (at least in cities that easily get congested due to e.g. an accident). Likewise, planning a journey also involves sequential decisions that need to be made along the trip. For instance, if your trip relies on multiple transportation modes (e.g., subway, bus, bike, or even walking), you may need to promptly switch to another mode when facing delays or malfunctions during the journey. Moreover, given the competitive nature of traffic, your objectives are usually affected by other users, which increases the uncertainties associated with your decision. In spite of such uncertainties, if you can express your preferences over these different objectives as a linear combination, then you can make your decision using conventional optimisation approaches. However, if (as is often the case) you cannot articulate your preferences explicitly in a single formula, or you actually can, but this formula is nonlinear, then you have a genuine multiobjective problem, which requires a multiobjective approach (see details in Sect. 5.3).
In order to select the best multiobjective approach, different factors come into play. If you execute this journey every day, you might be interested in balancing your objectives on average over a longer period. However, your intention might also be to balance the objectives during each of the single journeys, which would require a different approach. Both views would result in one policy that tells you how to plan your journey. Nonetheless, at some occasions, you might want to balance the objectives differently because you have an important meeting or you have someone accompanying you on your journey. If you want to be prepared for this, you can apply a method that provides you with a variety of policies, each of which is optimising a different combination of the objectives involved. In this situation, you could easily adjust each single trip based on your current needs. In contrast, conventional optimisation approaches would need to recompute the policy from scratch in order to handle such changes.
2.2 Water management
Water reservoir operations need to handle multiple competing objectives related to significant socioeconomic impacts [20]. By regulating a system of dam gates placed at the outlet of a lake you can modulate the water release and the level of the lake throughout the year. On the one hand, you will need to supply water to downstream users to meet their agricultural needs. To achieve this, you need to store water during the winter and spring in order to release it during the irrigation season. On the other hand, stakeholders on the shores are interested in keeping the lake level within a certain range to avoid floods and support recreational activities or environmental services. Increasing the lake storage to avoid irrigation deficits means increasing the risk of flooding and therefore some compromise needs to be established. The regulation problem is complicated by the presence of other objectives that interact with the two above: hydropower production, flood mitigation for downstream users, lake navigability, and many others [19, 122, 133]. A multiobjective analysis is a fundamental tool for the human operator and for the representatives of the various stakeholders to properly evaluate the possible tradeoffs among the several conflicting objectives and to support their decisions.
2.3 Military purchasing
The manufacture and purchasing of military equipment requires long term dynamic planning [108]. Each type of equipment takes a varying degree of time to manufacture. For example, a truck may only need a week while a submarine may need more than ten years. Furthermore, the time and cost in setting up a manufacturing pipeline will require items to be produced in larger numbers. Governments need to make decisions now based on a prediction of the types of environments and operations they expect to deploy forces to in the future. These environments typically require unique combinations of equipment to maximise their ability to achieve the outcomes required. Determining this optimum combination of equipment required for operations ten to fifteen years into the future is a multiobjective planning problem – weighingup various factors such as the cost, effectiveness, versatility, and protection provided to personnel. In practice, this becomes a problem with many objectives, when considering details such as the selected features of each piece of equipment. For instance, [13] discusses some seventeen objectives (related to survivability, lethality, mobility, and knowledge) to be considered when simply purchasing a single tank.
Furthermore, in the real world, any initial decision made must also be constantly altered over subsequent years. These alterations may be instigated by a change in: government; national priorities; international dynamics; technology; expected operational environments; and, types of operations. No government can make a decision now and expect it is still optimal in fifteen years. Therefore, new plans are developed periodically that align with new predictions. These new predictions can be represented as selecting a new policy from a precomputed set of solutions. However, governments must be very careful about when to continue; when to cancel; and when to switch manufacturing and purchasing orders. Changing policy directions can incur substantial financial penalties due to rampup and down costs; create periods of unbalanced forces during the switching period; require extra personnel training costs; etc. Therefore, a solution to this situation must be able to ensure that an optimal policy is maintained across objectives during the process of changing from one policy to another. This type of dynamic planning situation across multi and manyobjective problems and over long periods of sequential decisions occurs frequently in realworld strategic decision making domains, such as government, business, energy production, manufacturing, etc. Hence, the development of robust solutions could support many decision makers.
2.4 Wind farm control
The design of traditional wind turbine control systems is typically focused on two objectives. On the one hand, a wind turbine needs to maximise its power production. On the other hand, maximising power output leads to higher fatigue loads (i.e., the stress induced on the turbine’s components), and thus impacts their overall lifetime. Therefore, a tradeoff needs to be made between power output and accumulated damage.
Singleturbine control and design has been wellexplored in the literature [3, 99]. However, as multiple wind turbines are often geographically grouped into wind farms to reduce capital costs, the turbines become dependent on each other due to the wake effect. This effect occurs when upstream turbines extract energy from wind, leaving a cone of reduced available wind for downstream turbines, harming their productivity. This phenomenon, combined with frequently changing wind conditions, makes it challenging to ensure a stable power production with respect to the power demand provided by the grid operator. One method to tackle this issue is through active power control, in which the power production of the turbines is regulated and potentially reduced in order to meet the power demand [5, 188]. Still, it is important to reduce the power production of potentially damaged turbines (or even shut them down) to prevent unnecessary loads on the turbine components. To tackle the nonlinearities and complexities that originate from the wake effect, as well as the multidimensional load spectrum inherent to wind turbine technology, the use of datadriven optimisation methods is necessary to yield optimal wind farm control strategies.
Wind farm control is a sequential decision making problem. For example, during a storm, it is important to predict when and how the front passes through a particular turbine, in order to maximise its operation time, while minimising the probability of emergency braking leading to severe loading [189]. Moreover, to ensure stability of the energy injected in the electricity grid, turbines need to coordinate farmwide to restrict the gradient at which the farm’s power production is reduced over time. Therefore, it is necessary to take timedependent control actions based on the state of the environment, e.g., weather conditions and turbine states.
Finding a good balance between power production and loads is challenging. The link between control actions, the highdimensional load spectrum and future costs is still an open problem [189]. Therefore, although the relationship between control actions and maintenance costs is expected to be complex, a linear scalarisation of power production and loads is often employed (e.g., [180]), where the parameters are decided based on the expertise of operators. Preferably, the operators should receive a set of alternative control strategies to investigate, covering the entire spectrum of objectives ranging from loadfocused to powerfocused strategies.
2.5 Other topics
In addition to the motivating examples discussed above, recent years have seen multiobjective learning and planning methods applied across a wide range of problem domains including: distributed computing [27, 124], drug and molecule design [62, 214], cybersecurity [162], simulation [132], job shop scheduling [98], cognitive radio networks [100, 129], satellite communications [45, 63], recommender systems [78], power systems [34, 35, 97, 193], building management [213], traffic management [70], manufacturing [36, 54, 80], bidding and pricing [76, 207], education [151], and robotics [64]. The scope and variety of these applications supports our assertion that many important problems involve multiple objectives, and are best addressed using explicitly multiobjective methods.
3 Problem setting
First, let us introduce the basic multiobjective sequential decision problem. We formalise this as a multiobjective Markov decision process (MOMDP). We note that more complex models exist, such as a multiobjective partially observable Markov decision process [110, 160, 161, 202] and multiobjective multiagent systems [126]. However, the MOMDP formalisation allows us to study many relevant aspects of multiobjective decision making problems, while also being simple to understand. We therefore use it as the basis for this article. In this section we will restrict discussion to singleagent MOMDPs and defer discussion of the more complex multiagent situation until Sect. 7.2.6.
A MOMDP is represented by the tuple \(\langle S, A, T, \gamma , \mu , {\mathbf {R}} \rangle\), where:

S is the state space

A is the action space

\(T :S \times A \times S \rightarrow \left[ 0, 1 \right]\) is a probabilistic transition function

\(\gamma \in [0, 1)\) is a discount factor

\(\mu :S \rightarrow [0,1]\) is a probability distribution over initial states

\({\mathbf {R}} :S \times A \times S \rightarrow {\mathbb {R}}^d\) is a vectorvalued reward function, specifying the immediate reward for each of the considered \(d\ge 2\) objectives
The crucial difference between a singleobjective MDP and a MOMDP is the vectorvalued reward function \({\mathbf {R}}\), which expresses a numeric feedback signal for each of the considered objectives. This means that the length of the reward vector is equal to the number of objectives.
Like singleobjective MDPs, the state and action sets can in principle be discrete and finite. However, in many realworld problems the statespace is infinite. This happens as soon as some of the state variables describing a state—such as the water levels in a lake (Sect. 2.2)—are continuous. Moreover, even if the state space is discrete, it often is too large to enumerate as states may be described using images, e.g., cameras in an autonomous car. The actionspace can also be infinite in size. For example, in wind farm control (see Sect. 2.4), actions correspond to a specific rotor orientation with respect to the incoming wind direction. This again is a continuous value. Infinite state and actionspaces make the problem considerably harder, and necessitate the use of function approximators to estimate policies and their (vector) values.
3.1 Policies and value functions
In MOMDPs, the agent behaves according to a policy \(\pi \in \Pi\), where \(\Pi\) is the set of all possible (and allowed) policies. A policy is a mapping \({\pi }: S \times A \rightarrow \left[ 0, 1 \right]\), i.e., for any given state, an action is selected according to a certain probability distribution.
The value function of a policy \(\pi\) in a MOMDP is defined as:
where \({\mathbf {r}}_{k+1} = {\mathbf {R}}(s_k, {a}_{k}, s_{k+1})\) is the reward received at timestep \(k+1\). In contrast to singleobjective MDPs, the value function is also vectorvalued, \(\varvec{V}^{\pi } \in {\mathbb {R}}^d\). We can also define the value of a state s, for any timestep t, when \(s_t=s\):
In singleobjective settings, the value functions offer a complete ordering over the policy space, i.e., for any two policies \(\pi\) and \(\pi '\), \(V^{\pi }(s)\) will either be greater than, equal to, or lower than \(V^{\pi '}(s)\). This implies that finding the optimal policy \(\pi ^*\) is equivalent to maximising the expected cumulative discounted reward. For a MOMDP this is not necessarily the case.
If we have access to a utility function (also called a scalarisation function in the literature) \(u:{\mathbb {R}}^d\rightarrow {\mathbb {R}}\), mapping the multiobjective value of a policy to a scalar value,
then this would give us a total ordering over policies and reduce the MOMDP to a singleobjective decision making problem. This however, is not always possible, feasible, or desirable as motivated in the introduction. We illustrate this further in Sect. 5.1.
Thus, when dealing with multiobjective value functions (Equation (2)), it is possible to encounter a situation in which \(V_i^{\pi } > V_i^{\pi '}\) for objective i, while \(V_j^{\pi } < V_j^{\pi '}\) for objective j. As a consequence, in MOMDPs, value functions only allow for a partial ordering over the policy space, so determining the optimal policy is no longer possible without additional information on how to consider or prioritise the objectives to order the policies.
Notice that the formulation of policies described in this section only allows for stationary policies, i.e., we condition only on the current state. While this may be sufficient for fullyobservable, singleobjective MDPs, White [197] demonstrate that for multiobjective tasks it may be beneficial to allow policies to be nonstationary with respect to the current state (i.e., also conditioned on other variables, such as the sum of previously received rewards).
3.2 Solution sets
In singleobjective RL problems, there exist a unique optimal value \(V^*\), and there can be multiple optimal policies \(\pi ^*\) that all have this value. The goal in singleobjective RL is typically to learn an optimal policy.
In the multiobjective case however, without any additional information about the user’s utility, there can now be multiple possibly optimal value vectors \({\mathbf {V}}\). We therefore need to reason about sets of possibly optimal value vectors and policies when thinking about solutions to MORL problems. In the following, we introduce several useful definitions for possibly optimal policies and values. We start by defining the most general set of solutions, i.e., the undominated set. This is the set of policies and associated value vectors for which there is at least one utility function for which this policy is optimal (i.e., there is no other policy for this utility function that has strictly higher utility).
The concepts introduced in this section are defined in terms of policies. However, as each policy \(\pi\) has an associated value vector \({\mathbf {V}}^\pi\), throughout the survey we often relate value vectors to these concepts when the context is clear.
Definition 1
The undominated set, \(U(\Pi )\), is the subset of all possible policies \(\Pi\) and associated value vectors for which there exists a possible utility function u with a maximal scalarised value:
However, the undominated set may well contain excess policies. That is, policies that are optimal for a given (set of) utility function(s), but where other policies exist that have optimal utility for that/those utility function(s) as well. In that case, we do not need to retain all policies to retain optimal utility.
Definition 2
A set \(CS(\Pi )\) is a coverage set if it is a subset of \(U(\Pi )\) and if, for every u, it contains a policy with maximal scalarised value, i.e.,
We note that it is desirable to make a CS as small as possible. However, depending on which utility functions are allowed, determining whether a CS is a minimallysized CS may be a computationally hard problem in itself.
As mentioned, there generally does not exist a total ordering over the values of possible policies in a MORL problem. We can, however, again reason about sets of possibly optimal policy values. Firstly, we note that in multiobjective decision making (including MORL), the utility function, u, can be assumed to be monotonically increasing in all objectives.
Definition 3
A monotonically increasing utility function, u adheres to the constraint that if a policy increases for one or more of its objectives without decreasing any of the objectives, the scalarised value also increases:
A monotonically increasing utility function is able to represent both linear (with nonzero positive weights) and nonlinear user preferences. Monotonicity in the utility function is a minimal assumption for MORL, as it corresponds to the definition of an objective: we always want more value in any of the objectives.^{Footnote 1} For this most general case where u is any (potentially unknown) monotonically increasing utility function (i.e., including nonlinear functions), we define the set of undominated values as follows.
Definition 4
If the utility function u is any monotonically increasing function, then the Pareto Front (PF) is the undominated set [144]:
where \(\succ _{P}\) is the Pareto dominance relation,
In words, the Pareto Front is the set of nondominated policies: for each policy in the Pareto Front, there exists no other policy with value that is equal or better in all objectives. Note that the definition of Pareto dominance corresponds exactly to the definition of monotonically increasing value functions (Definition 3).
Note that for the Pareto front this means we only need to retain one of the policies that have the same value vector. A set of policies whose value functions correspond to the PF is called a Pareto Coverage Set (PCS).
If the (a priori unknown) utility function is a positivelyweighted linear sum, then the undominated set will be the policies corresponding to the convex hull (CH) of value functions \(V^{\pi }\).
Definition 5
A linear utility function computes the inner product of a weight vector \(\mathbf{w}\) and a value vector \(\mathbf{V}^\pi\)
Each element of \(\mathbf{w}\) specifies how much one unit of value for the corresponding objective contributes to the scalarised value. The elements of the weight vector \(\mathbf{w}\) are all positive real numbers and constrained to sum to 1.
Definition 6
The convex hull (CH) is the subset of \(\Pi\) for which there exists a \(\mathbf{w}\) (for a linear u) for which the linearly scalarised value is maximal, i.e., it is the undominated set for linear utility functions:
In words, the convex hull is the set of policies that maximise the weighted sum over objectives for some weight vector \({\mathbf {w}}\in {\mathbb {R}}^d\).
The Pareto Front and the Convex Hull often consist of infinitely many policies, especially when policies can be stochastic.
However, coverage sets can often be significantly smaller. This is particularly so for the convex coverage set.
Definition 7
A set \(CCS(\Pi )\) is a convex coverage set if it is a subset of \(CH(\Pi )\) and if for every \(\mathbf{w}\) it contains a policy whose linearly scalarised value is maximal, i.e., if:
The CCS is not only important for linear utility functions. Specifically if we also allow stochastic policies in \(\Pi\), a CCS is sufficient to construct a CS for all possible (nonlinear) monotonically increasing utility functions as well, i.e., a PCS [171]. We present in Fig. 1 a visual representation of the discussed solution sets, starting from the Pareto Front, since we make the minimal assumption that utility functions in multiobjective planning and learning belong to the class of monotonically increasing functions.
For deterministic stationary policies, the difference between the \(CH(\Pi )\) and a \(CCS(\Pi )\) is often small. Therefore, the terms are often used interchangeably. The key difference however is stochastic policies. Specifically, if the space of deterministic policies is discrete (i.e., there are a finite number of states for which a finite number of actions can be chosen) then there is always a finite CCS, even if stochastic policies are allowed. In contrast, the CH is typically infinite in this case. This is especially important because, as we have already mentioned, this finite CCS can be used as a basis to construct every policy in a PCS. For more detailed information on these sets, and how they interact with deterministic/stochastic policy spaces, please refer to [144].
The choice of solution set is key to the efficiency of the algorithms used to solve multiobjective problems. This is because we have to compute all the policies in these sets. When these sets become too large, we may not be able to compute them anymore, and we need to solicit more information on how to handle or prioritise the objectives. We consider that this optimisation process should be driven by the utility obtained by the user from a proposed solution which can be derived using the utility function. We introduce this perspective and approach in the following section.
4 The utilitybased approach
Early work in multiobjective sequential decisionmaking largely adopted an axiomatic approach in which the optimal solution set is assumed to be the Pareto front (see Definition 4). An advantage of this approach is that it leads to a solution set which will contain an optimal policy for any possible monotonically increasing utility function, and axiomatic methods can derive these solutions without any need to explicitly consider the details of those potential utility functions. However, this set is typically large, and may be prohibitively expensive to retrieve.
In practical applications, a lot more might be known about the utility function of the user, due to domain knowledge. Using an axiomatic approach would make it difficult to exploit this knowledge, and a lot of time and effort might be spent on computing a solution set which contains some members with very low utility for the user/deployment. Therefore there has been a trend in recent literature on multiobjective RL, to adopt a utilitybased approach [126, 127, 144, 145, 215].
Considering the user utility first is key to the successful application of any AI in decision problems. In multiobjective problems, it is especially important, as the properties of the user’s utility may drastically alter the desired solution, what methods are available, and even—in some cases [127]—whether stable solutions even exist. For example, as Vamplew et al. [171] have shown, if stochastic polices are allowed, a much smaller solution set suffices to construct a Paretofront, i.e., we can use stochastic mixtures between the policies in the deterministic stationary convex coverage set (CCS), which is much easier to compute, and allows for algorithms that exploit the properties of the CCS to retrieve the optimal policies, such as outer loop methods [136] as discussed further in Sect. 7.2.3. Therefore we advocate for the adoption of a userbased approach. We note that this does not exclude the use of axiomatic methods as these may still be appropriate. Axiomatic methods also may be appropriate in contexts where it is not possible to establish any constraints on the user’s utility function, or other characteristics of the solution, prior to learning or planning.
The utilitybased approach aims to derive the optimal solution set from the available knowledge about the utility function of the user, and which types of policies are allowed. This knowledge allows constraints to be placed on the solution set, reducing its size and thereby improving learning efficiency and making it easier for users or systems to select their preferred policy [144]. The utilitybased approach entails the following steps:

1.
Collect all a priori available information regarding a user’s utility.

2.
Decide which type of policies (e.g., stochastic or only deterministic) are allowed.

3.
Derive the optimal solution concept from the resulting information of the first two points.

4.
Select or design a MORL algorithm that fits the solution concept. A variety of algorithms suited to differing solution concepts are reviewed in Sect. 7.

5.
When multiple policies are required for the solution, design a method for the user to select the desired policy among these optimal policies.
We note that some of these steps can be done in parallel, as illustrated in the work flow diagram in Fig. 2. Specifically, it is possible to gather information on the user’s utility and decide on which types of policies to allow simultaneously (Steps 1 and 2). However, Steps 1 and 2 need to be completed to be able to derive the solution concept (Step 3), which in turn needs to be completed before being able to select or design an appropriate algorithm (Step 4), and design how the user can select policies (Step 5).
In each of the steps to complete in this process, different factors will come into play. We will briefly discuss which factors must be considered at each step, while referring to later sections for a more detailed discussion.
In Step 1, we aim to collect as much a priori available information about the user’s preferences as possible. This information will help us determine the class of utility functions which we should employ. For example, if we know that all objectives correspond to units of goods that we need to buy or sell on an open market, the utility function will typically be linear (i.e., a sum of prices per unit, times the amount of units we need to buy and can sell).
Another key distinction we have to make here concerns the application of the utility function for deriving the user’s utility [126, 127, 138, 144]. Specifically, if the utility is derived from single outcomes of performing the policy, we need to apply the utility function to the returns, and then optimise the expected utility of the returns. This is the socalled Expected Scalarised Returns (ESR) criterion. For example, in a medical treatment planning setting, the patients will derive their utility from their specific treatment outcomes. Therefore, if the user will only execute the learned policy once or a number of times, the ESR criterion should be optimised. Conversely, if the utility is derived from the average returns over multiple runs we should take the expectation first, and optimise the utility of expected returns. This is called the Scalarised Expected Returns (SER) criterion. In scenarios where the learned policy will be executed multiple times, the SER criterion should be optimised. For a detailed discussion on whether to apply ESR or SER, please refer to Sect. 5.3.
In Step 2, we need to decide what types of policies are allowed. This is important, as in contrast to singleobjective problems, stochastic policies can be strictly better than deterministic policies [171, 197, 198]. However, this does not mean that we should always allow them. For example, in a medical treatment planning setting, the patients would probably object to random selection of different medicines. Furthermore, we need to decide whether to allow nonstationary policies or not [197]. For a detailed discussion on policy types, please refer to Sect. 5.2.3.
Using the information from Step 1 and 2, we need to derive the appropriate solution concept (Step 3). For example, if the utility function is unknown at learning time, but known to be linear, any type of policy is allowed. We need a set of policies that contains at least one optimal policy for every possible set of linear weights. An example where this situation would arise would be where the linear weights correspond to fluctuating market prices of different commodities. Of course if Steps 1 and 2 do not identify any constraints on the utility function or allowable policy types, then the solution concept derived at this stage will be the Pareto front.
In Step 4, we need to either select an existing algorithm from the literature or design one that is suitable for the user’s requirements. The choice of algorithm depends on the solution concept selected in Step 3; one of the main distinctions is between singlepolicy and multipolicy algorithms (see Sect. 7.2). If the user’s utility function is completely known a priori and is not likely to change over time, a singlepolicy algorithm is appropriate. Conversely, if the utility function is unknown or subject to change a multipolicy algorithm is more suitable.
In Step 5, the goal is to help the user select a policy from a solution set produced by the algorithm selected in Step 4, that comes as close as possible to optimal user utility. This might be relatively straightforward if this set is small enough to show all possible policy value vectors to the user. If the set is large, or even continuous, more intricate methods are needed. For example, Zintgraf et al. [216] use Gaussian processes to model the utility function, and use relative preferences queries posed to the user to train this model. Furthermore, they use targeted priors and additional (virtual) data to exploit the fact that utility functions in multiobjective decision problems are monotonic in all objectives. As MORL research advances, we expect the challenge of policy selection to be tackled on a larger scale. An important open question is how to visualise and present highdimensional optimal solutions to users, in a concise and informative manner. To this end, potential approaches to take inspiration from are clustering methods for grouping similar solutions together (i.e., policies or values), and visualisation techniques such as tSNE [183].
Once the desired solution is selected, we can proceed with the policy execution. This phase will depend on how the algorithm applied in Step 4 stores the optimal set of solutions. If the algorithm explicitly stores the policies (e.g., [75]), this becomes a straightforward step. However, in the case of nonstationary policies or infinite horizon problem settings this approach is no longer feasible. In such cases, we are faced with the Paretooptimal policy following problem [137]. Furthermore, in stochastic settings, policy execution is nontrivial, as it leads to a combinatorial optimisation problem at every time step [137]. Van Moffaert and Nowé [182] outline an algorithm for executing policies for stochastic MOMDPs that implicitly assumes that the solution for the combinatorial optimisation problem can be found, without actually identifying it as a hard problem, and performing only limited experiments. Roijers et al. [137] propose two approaches for stochastic settings: a local search algorithm that aims to solve the problem heuristically, as well as a neural networkbased method that entails generating the required data during the planning or learning step and learning to predict which value vector to follow from the current state given the selected vector and last transition. An alternative approach to alleviate this problem is to design and learn a policy representation conditioned on the utility function, similar to the conditioned networks proposed by Abels et. al [4], in the dynamics weights setting.
In some situations it may be possible to merge Steps 4 and 5, by allowing the agent to interact with the user during the learning phase so as to identify their preferences and identify the policy which is optimal with regards to their utility. This will be discussed further in Sect. 7.2.4.
Together, these steps form a complete pipeline to set up a multiobjective reinforcement learning or planning system.
5 Factors influencing the design of multiobjective systems
Multiple factors exist in multiobjective problem domains which do not need to be considered for singleobjective problems, and these can have important implications for the design of a multiobjective agent. In this section, we identify and describe these factors, and explain the impact they may have on the design.
5.1 Scenarios requiring a multiobjective approach
Some researchers would argue that modelling problems as multiobjective is unnecessary and that all rewards can be represented as a single scalar signal [158]. This implies that it is always possible to convert a MOMDP to a MDP. In order for this conversion to take place, an a priori scalarisation function is required. However, Roijers et al. [144] show that in certain situations it may be impossible, infeasible or undesirable to perform this conversion. Roijers et al. [144] present three scenarios in which this can occur as illustrated in (a), (b) and (c) in Fig. 3. Additionally, we propose three new motivating scenarios: the interactive decision support scenario (d), the dynamic utility function scenario (e), and the review and adjust scenario (f). Figure 3 shows that each scenario consists of a planning or learning phase in which either a single policy or a solution set of multiple policies is found, and an execution phase in which a single policy is executed, and, in some scenarios, a selection phase in which the policy to be executed is selected.
In the unknown utility function scenario (a) [127], a priori scalarisation is undesirable as the utility function is unknown at the time when planning or learning occurs. There is too much uncertainty around the utility that could be received. In this scenario it is preferable to compute a coverage set of policies so as to be able to respond quickly whenever more information is available. In the wind farm control example (Sect. 2.4), there are two conflicting objectives. The goal is to maximise power output while minimising the required maintenance costs caused by the stress of operation. Specifying the exact preferences for these objectives is difficult since certain circumstances such as storms, the wake effect, and grid instability can affect the lifespan of turbine components. Since the link between these effects and preventive control measures is insufficiently understood, it is important to learn a set of optimal solutions.
In the decision support scenario (b), the user’s preferences are unknown or difficult to specify. Working with this uncertainty makes it infeasible, if not impossible, to use a priori scalarisation as the user’s utility function is unknown. The decision support scenario is almost identical to the unknown utility function scenario. The only difference is during the selection phase, where a set of policies is presented to a user who selects a policy based on their preferences. In the water management example (Sect. 2.2), the optimal solution for managing a water reservoir depends on many stakeholders and their multiple conflicting objectives. Each stakeholder has their own preferences as to how the water should be managed, with each objective having an effect on different aspects of the businesses operating around the lake as well as the livelihood of those living nearby. Capturing accurate preferences for all stakeholders while taking into account the tradeoffs across all objectives would be difficult, if not impossible. Instead, it would be better to learn a set of optimal policies and then make decisions regarding what policy to follow once a collective decision can be made by a local council or government.
The difference between the two scenarios above lies in the selection phase. The first scenario includes a utility revelation step where the utility function is made explicit. In the second scenario on the other hand, the decision relies on the user(s), and the utility function remains implicit in the decision taken. As defining a utility function explicitly is hard (if not infeasible), user selection typically employs the decision support scenario (b).
In the known utility function scenario (c), the user’s preferences are known. Working with known preferences, we can assume the user’s utility function are known at the time of learning or planning, making scalarisation both possible and feasible. However, it can still be undesirable to do so because performing a priori scalarisation can lead to an intractable problem [127, 144]. In the wind farm control example (Sect. 2.4), based on their preferences a user may want to maximise power output while minimising the stress on the turbine’s components. Since the user’s preferences are known, it is possible to learn a single policy that optimises the user’s preferences.
In the interactive decision support scenario (d), the agent has to learn about both the preferences of the user and the environment [142]. Applying a priori scalarisation in this scenario can be both undesirable and infeasible as the utility function of the user may be unknown or may be uncertain. During learning, the agent can elicit preferences from the user, removing uncertainty about the user’s utility function. In the planning a journey example (Sect. 2.1), a user may not be able to accurately specify their preferences. While cost and travel time are preferences that may be easily specified, objectives such as comfort and reliability may be difficult to specify. When planning a journey other tradeoffs such as taking a direct route over switching trains or having a lay over may have an impact on the user’s utility. At various times during the learning phase a user could be presented with different potential solutions and rank the solutions in order of preference. This will enable the system to get a more accurate representation of the users preferences and learn an optimal solution for the users.
In the dynamic utility function scenario (e), the user’s preferences for certain objectives change over time [107]. In this scenario applying a priori scalarisation would be undesirable. Given that a user’s preferences can change over time, it would be optimal for the algorithm to learn a finite number of policies over time and choose an appropriate nondominated policy for any utility function and improve upon it through further learning for that utility. By reusing information gained from learning for previously encountered utilities, e.g., feature representations and multiobjective value functions, efficiency is improved compared to learning a policy from scratch [4, 107]. Although there is an infinite amount of utility functions, they can be covered by a finite number of policies [107]. In the military purchasing example (Sect. 2.3), current governments must make decisions about military purchasing, but as governments change over time so do the preferences of each government towards military spending. Using a system that can learn optimal policies for changing preferences is the desired approach for this example. While it would be possible to learn a single policy for the initial utility function and then dynamically adapt this as the user’s utility function changes, this would incur a period of suboptimal behaviour as the agent adapts, which need not occur if the agent has learned in advance a suitable set of solutions. When governments change it is crucial that optimal policies are still followed despite the change in preferences.
In the review and adjust scenario (f), a user may be uncertain about their preferences over objectives and their preferences over objectives could change over time. Applying a priori scalarisation in this scenario is unfeasible as there is too much uncertainty around the utility function of the user. In this scenario, learning a coverage set of policies is optimal. Once a coverage set has been learned a user can then select the policy which accurately reflects their preferences. Before execution, the user can review their chosen solution. If the user’s preferences have changed, the user can adjust their selected solution to accurately reflect their updated preferences.
The review process can also update the MOMDP which can alter the set of solutions learned. This may for example occur when a new objective is identified, that was previously missed. For example, imagine an agent is used to control traffic in a part of a city. Initially, the pollution levels are seen as a single objective. However, after inspecting a map of the pollution levels resulting from the policies, it turns out pollution levels around a school is relatively high, while a school is actually an area where it ought to be low. In such a case, the pollution objective should be refined, i.e., split into two objectives: one overall, and one for sensitive/key areas such as schools (and e.g., hospitals).
In the planning a journey example from Sect. 2.1, a user may not be certain about their preferences for comfort and cost, among other objectives. In this scenario a set of optimal solutions is learned and the user selects the solution which accurately reflects their preferences. However, before execution if the user’s preferences have changed, the user can review their chosen solution and select an alternative solution which accurately reflects their updated preferences. The system can then be updated to reflect the newly obtained information about the user’s preferences.
5.2 Problem taxonomy
In [144], Roijers et al. outline a problem taxonomy which discusses what constitutes an optimal solution for a MOMDP. The taxonomy is based on the utility based approach, where the agent’s ultimate goal is to maximise user utility [127]. In the previous section, we highlighted three new motivating scenarios and we have updated the problem taxonomy diagram from Roijers et al. [144] to include the extended scenarios. The taxonomy in Table 1 outlines how each factor can lead to different solution concepts. It is important to carefully consider each factor in the taxonomy before choosing a solution concept. The factors of the problem taxonomy are covered extensively in [144], but we will briefly outline each factor below.
5.2.1 Single versus multiple policies
Whether or not an algorithm learns a single or multiple policies depends on which of the motivating scenarios holds from Sect. 5.1. For example, in the unknown utility function and decision support scenarios the agent needs to learn multiple policies. In both of these scenarios the utility function of the user is unknown at the time of learning or planning, and therefore the agent must return a set of optimal policies. In the known utility function scenario the user’s utility function is known at the time of learning or planning and therefore returning multiple policies is not necessary.
In the planning a journey example (Sect. 2.1), a user may or may not know their exact preferences about getting to their destination. A user may be unsure about how they would like to get to their destination or how much they are willing to spend on the journey. In this case we are in the unknown utility function scenario (a) [144] and learning a coverage set of policies is required. In contrast, a user may want to arrive to their destination at a specific time using a specific mode of transport, and may have a fixed budget. Since the user’s preferences are known we are in the known utility function scenario (c) [144] and a single policy which represents the user’s preferences can be learned.
5.2.2 Linear versus monotonically increasing utility functions
The nature of the utility function has a significant role to play in what constitutes an optimal solution in a MOMDP and which of the motivating scenarios holds. When the utility function is linear the weighted sum for each value of the objectives is computed. In the known utility function scenario, the utility function is known at the time of learning or planning. The utility function can be applied to each reward vector in the MOMDP and an optimal solution can be found. But linear utility functions may not be suitable when trying to express a user’s preferences. If a user’s preferences are nonlinear, a linear utility function is unable to accurately represent these preferences.
A monotonically increasing utility function adheres to the constraint that if a policy increases for one or more of its objectives without decreasing any of the objectives, the scalarised value also increases. A monotonically increasing utility function is able to represents both linear (with nonzero positive weights) and nonlinear user preferences. For example, in the unknown utility function scenario, the agent must learn a set of policies. When the utility function is unknown at the time of learning or planning, Pareto dominance can be used to determine a set of nondominated solutions. Since the utility function is monotonically increasing, policies that are Pareto dominant will be preferred by the user.
In the wind farm control example (Sect. 2.4) there are two objectives: to maximise power and to reduce fatigue loads on the turbine. In the real world, a user will likely have nonlinear preferences over these objectives. If these preferences are known (known utility function scenario) at the time of learning or planning it is crucial they are represented using a nonlinear utility function. If the user’s preferences are represented using a linear utility function a suboptimal solution will be learned. A linear utility function cannot accurately represent nonlinear preferences [144]. In this case, learning a suboptimal solution could negatively impact a wind turbines performance. If a wind turbine is not operating optimally, stress on the turbine’s components would be increased, which impacts the components lifespan and increases maintenance costs. However, if a nonlinear utility function is used to represent the user’s preferences then it is possible to learn an optimal solution.
5.2.3 Deterministic versus stochastic policies
Whether to restrict the agent to policies that are deterministic or to allow stochastic policies has a significant impact on what an optimal solution is in a MOMDP. When the utility function is linear, we can translate the MOMDP to a singleobjective MDP. In an MDP, only deterministic stationary policies apply as the optimal obtainable value is reachable with deterministic stationary policies. This is true for all linear utility functions. But when the utility function is monotonically increasing and nonlinear the situation is much more complex.
For example, in the known utility function scenario, the utility function is known during the learning or planning phase. If the utility function is a linear representation of a user’s preferences, we can then translate the MOMDP to a singleobjective MDP where only deterministic stationary policies hold. As another example, in the unknown utility function scenario where the utility function is assumed to be nonlinear and only deterministic policies are allowed a coverage set of Pareto dominant policies must be learned. In this scenario, nonstationary policies can Pareto dominate stationary policies [197], therefore the Pareto coverage set must include nonstationary policies [144].
In the water management example (Sect. 2.2) there are certain scenarios where stochastic policies should not be considered whatsoever. A stochastic policy where there is a chance the dam gates are opened and all the water in the reservoir is drained should not be considered, even if other outcomes of the policy are optimal. This stochastic policy would have catastrophic outcomes for the nearby town. If the utility function is nonlinear and known at the time of learning or planning (known utility function scenario (c)) it would be optimal to learn one deterministic nonstationary policy. In this case, devastating outcomes like the scenario already mentioned would be avoided.
5.3 Scalarised expected returns and expected scalarised returns
In contrast to singleobjective reinforcement learning, in multiobjective reinforcement learning (MORL) different optimality criteria exist [144]. Optimising under each criterion can lead to significantly different policies being learned [127], where the criterion chosen for optimisation depends on how the policies are used in practice. The two optimisation criteria are known as the scalarised expected returns (SER) and the expected scalarised returns (ESR).
The SER criterion is the most commonly used optimisation criterion in multiobjective RL and planning [146]. The SER criterion is calculated by first computing the expected vector returns of a policy and then applying the utility function to this expectation,
For SER the utility of a user is derived from multiple executions of a policy. SER is the optimal optimisation criterion in scenarios where the user is concerned about achieving an optimal utility over multiple policy executions. For SER, a coverage set is defined as a set of optimal solutions for all possible utility functions.
However, many scenarios exist where only a single execution of a policy may be relevant to a user. In scenarios where a single execution of a policy is used to derive the utility of a user, optimising under the ESR criterion is optimal [138]. For example, in a medical setting a patient may have only one opportunity to select a treatment. Under the ESR criterion the utility function is applied to the returns and the expectation is then computed,
For a linear utility function there is no difference in the policies learned for SER and ESR. However, for a nonlinear utility function the policies learned under SER and ESR are significantly different [126]. Many RL methods cannot be combined with the ESR criterion because nonlinear utility functions in MOMDPs do not distribute across the sum of immediate and future returns which invalidates the Bellman equation [138],
where \(\mathbf {R^{}_{t}}=\sum _{i=0}^{t  1} \gamma ^i \mathbf{r}_i\). ESR is the most commonly used optimality criterion in the game theory literature and literature on multiobjective games [126]. However, the ESR criterion has been extensively understudied in the RL literature. To study the ESR criterion further it is essential that new methods are formulated. It is important to note, for the ESR criterion a coverage set has yet to be defined and this is an open area for research.
In the planning a journey example (Sect. 2.1), we can consider a known nonlinear utility function for a user planning their daily commute to work. If the user’s employer is flexible about what time the user can start work, then the user can sometimes be late as long as the user is usually on time. Under the SER criterion a policy is optimised on the duration of journey objective. Since this policy is executed everyday it is acceptable for the user to be late some days because the days when the user is early compensate. However, if the user’s employer is strict and requires the user to be on time each day or be subject to a fine, it is crucial that the user can plan a daily journey where they arrive on time. In this case optimising under the ESR criterion is optimal since every policy execution must ensure that the user arrives to work on time.
6 The relationship with other problems
Parallels exist between some aspects of multiobjective sequential decisionmaking tasks and the classes of problems considered by other areas of reinforcement learning and planning research. In this section we identify some of the key areas of overlap between fields where we believe there is potential for beneficial exchange of ideas and techniques. We also pinpoint several pitfalls where the application of methods to multiobjective problems may not be consistent with the utilitybased paradigm.
6.1 Partially observable MDPs
A key observation made already in the 1980s, is that if one assumes linear utility functions, POMDPs are a superclass of MOMDPs [198]. To see this, imagine there would be a “true objective” and the linear weights of the utility function would form a “belief” over what the true objective would be. This is a special type of POMDP, where there will never be any observations concerning what the “true objective” is – because after all, it does not actually exist.
Of course multiobjective problems and partially observable problems have significantly different interpretations. However, the fact that POMDPs form a superclass of multiobjective MDPs under linear utility has important consequences for researchers and practitioners alike. Firstly, a lot of theoretical properties are inherited from POMDPs. This means that a lot of theorems do not have to be proven anew for MOMDPs under linear utility. So if you are wondering whether a certain property holds, it is prudent to consult the POMDP literature as well. Secondly, it means that methods that have been invented originally for POMDPs, can often be adapted for usage in MOMDPs [136]. While doing so, it is key to note that the number of objectives in a MOMDP correspond to the number of states in a POMDP (i.e., the dimensionality of the belief and \(\alpha\)vectors) [146]. This means that methods that did not work well in a POMDP context because they scale poorly in the number of states, might be very useful in a MOMDP context. A good example of this is Optimistic Linear Support (OLS) [103, 136, 146], which was based on Cheng’s linear support for POMDPs [24]. Finally, it might mean that some algorithmic improvements may be applicable to both MOMDPs and POMDPs (such as [139]).
6.2 Multiobjective as multiagent problems
Objectives are not agents. Some papers—in our opinion abusively—cast singleagent multiobjective problems as multiagent problems, with each agent representing a competing objective [81, 98]. Then, either through voting rules [168] or (Nash) equilibria [38, 79, 83, 104], a policy is selected. This mechanism however, has no guarantees with respect to user utility. It is unclear whether this “compromise solution” represents a desired tradeoff or not. Specifically, the concepts of voting rules and Nash equilibria have been designed to find tradeoffs between the individual utilities of agents. This is different from tradeoffs for an individual agent between objectives, as objectives can be more or less important and may have nonlinear interactions in the utility function. A voting rule or Nash equilibrium is not able to capture such subtleties and can therefore function as no more than an unfounded heuristic. In fact it is wellknown that Nash equilibria can be Paretodominated [25, 39].
But altruistic agents can see other agents as objectives. On the other hand, if we consider an agent that is explicitly altruistic, i.e., it cares about the other agents in its environment, such an agent could see the utility of these other agents as objectives, and therefore this should be modelled as a multiobjective problem. As such, it is also possible to consider varying levels of altruism. Aoki et al. [8] for example, consider a multistage flow system with multiple agents. Each agent is a service centre, and is represented as a different objective by the other agents. They use a distributed reinforcement learning framework and propose a bidirectional decision making mechanism to address the resulting multiobjective problem.
Modelling other agents as objectives enables explicitly imposing fairness between these objectives, i.e., the utilities of the agents. To this end, one can use the Lorenz dominance ordering, i.e., a refinement of Pareto dominance that introduces the predilection towards a more balanced distribution of values over the objectives. A loose condition for fairness is that a joint policy \(\varvec{\pi }\) is not socalled Lorenz dominated. Lorenz domination is based on the socalled Lorenz vector [121].
The Lorenz vector \(\mathbf{L}(\mathbf{V}^{\varvec{\pi }})\) of a vector \(\mathbf{V}^{\varvec{\pi }}\) is defined as:
where \(v_{(1)} \le v_{(2)} \le ... \le v_{(N)}\) correspond to the values in the vector \(\mathbf{V}^{\varvec{\pi }}\) sorted in increasing order.
Imposing that the Lorenz vector of a policy is undominated leads to the Lorenz optimal set as coverage set. A vector \(\mathbf{V}^{\varvec{\pi }}\) Lorenz dominates (\(\succ _L\)) a vector \(\mathbf{V}^{\varvec{\pi '}}\) when:
i.e., when the Lorenz vector of \(\mathbf{V}^{\varvec{\pi }}\) Pareto dominates the Lorenz vector of \(\mathbf{V}^{\varvec{\pi '}}\). A Lorenz optimal set can then be input for a negotiation of which policy to execute in practice [126]. Lorenz optimal sets have been studied in the context of different problem domains [48, 102, 121].
In short, objectives do not typically correspond to the interests of single agents as agents will care about multiple objectives. However, altruistic agents may see the interests of other agents as objectives, and therefore aim to come up with fair solutions.
6.3 Multi and auxiliary task RL
A highly related problem that has recently gained traction in the RL literature is that of auxiliary tasks and multitask RL. For example, Schaul et al. [154] define multiple goals, which are typically a subset of the states. They then learn a universal value function approximation (UVFA) network, that learns a value with respect to these different goals. From a MORL perspective, UVFA is thus an instance of MORL, with the restriction that goals are closely associated with states, and that the utility function may only select one of these goals to be the goal at the moment. To move to a more general MORL setting, a goal should be generalised to a specific a priori known (parameterised) utility function and as such there is no clear relation between the goal (i.e., the importance of each objective) and the state. This would be an issue in multipolicy settings as (a) it is not clear how many specific utility functions would be needed, and (b) nonlinear utility functions would not be supported. For the dynamic weights setting with linear utility functions, UVFA has been adapted to the MORL setting as a baseline algorithm [4], and shown to perform worse than specific MORL algorithms, but better than more naive baselines.
In their work on successor features (SF), Barreto et al. [11] decompose a scalar reward into a product of state features and task weights to enable transfer learning between tasks. Again, we observe that successor features are in fact a subclass of multiobjective problems with linear weights, i.e., where objectives can be associated with desirable statefeatures. Universal Successor Features Approximators [14] and Universal Successor Representations [87] combine the benefits of SF and UVFA to further generalise across goals. It is important to note though that while state features and task/goals weights are analogous to the multiobjective reward and linear weight vectors, in MORL the decomposition between reward and weight vectors is typically given rather than learnt. This is because successor features are not observing the individual objectives and are only provided with a scalar reward function. One might think that this would make SF more widely applicable than MORL. However, it also restricts the usage of such methods to things that can be inferred from state features. But, more importantly, scalar reward functions are often engineered on the basis of real events, multiple sensor inputs, and endlessly tweaked on the basis of the actual objectives of the users and designers, as we discussed in Sect. 1. Hence, using successor features instead of MORL, would in many realworld problems come down to throwing away information first in order to construct a scalar reward function, to later partially infer it back from data. This is of course suboptimal, and should be avoided if possible.
6.4 Humanaligned agents
As AI systems are increasingly being applied to important realworld tasks, interest has grown in ensuring that the behaviour of autonomous systems is aligned with our own objectives, so as to avoid harmful outcomes either at a general level or with regards to specific individuals. Research within this field focuses on ensuring that the decisions and behaviour of autonomous agents are safe, trustworthy, aligned, interpretable, fair and unbiased. As these add additional considerations beyond maximising the agent’s primary reward, there is a clear link to multiobjective approaches.
Strong parallels exist between multiobjective decision making and risksensitive or safetyaware decision making [93]. An agent making decisions in the context of uncertain risks must aim not just to maximise its expected reward, but also to account for some measure of risk. This measure may be based on the variance of the reward, the worstcase outcome or the probability of entering known error states [49]. As with the multiobjective methods discussed in this paper, the choice of optimal action for a riskaware agent will be based on combining together the expected reward and risk measures for each action using some form of utility function. Therefore it is not surprising that several authors have framed safe reinforcement learning as a multiobjective problem. In [50, 51] and [61], MORL was applied to develop riskaware agents, where the riskrelated reward is based on the probability of the agent visiting an error state. Meanwhile, Elfwing and Seymour [42] argue, based on biological evidence, that computational agents may behave more safely if they learn separate values for rewards and punishments.
As well as a growing interest in safe AI, recent years have also seen an increasing focus on the issues of explainability and interpretability of autonomous systems, as these factors are important for building trust with human users, and in ensuring transparency and lack of bias. It has been argued that a reward which has been decomposed from a scalar into its component terms provides benefits from the perspective of explaining decisions [72], and so several recent papers have explored multiobjective approaches to explainable and interpretable RL agents [26, 28, 29, 111, 211].
In many applications it is also important to ensure that the actions of an agent are fair with regards to multiple stakeholders—a solution which is optimal for many members of society but which significantly disadvantages a subset of the population would, in many cases, be discarded by stakeholders in favour of a perceptively more fair solution. In some contexts this may involve the development of multiagent systems as in our earlier discussion of altruistic agents. In other situations, a single agent may be considering multiple objectives where each objective corresponds to the desires of a particular individual stakeholder, and so the appropriate utility function may be one which maximises the performance with regards to the lowest scoring objective, such as leximin [96] or soft maximin [159]. Alternatively the rewards may correspond to various objectives where each stakeholder may have different preferences over these objectives and the agent must balance the utility obtained by each stakeholder. Multiobjective approaches to fairness have been explored at an abstract level [157], and also within specific applications [63, 119].
In short, following Vamplew et al. [172, 177], we argue that multiobjective agents provide a suitable mechanism for developing humanaligned artificial intelligence, addressing safety constraints as well as other alignment issues such as ethics or legal restrictions.
7 Survey of multiobjective reinforcement learning and planning algorithms
In this section we review the stateoftheart in algorithms for multiobjective planning and reinforcement learning, relating these algorithms back to the design factors identified in Sect. 5. The aim is to aid in identifying which extant algorithms may be best suited for a particular application, based on the properties of that application. Table 2 presents a nonexhaustive list of approaches organised according to the multiobjective taxonomy discussed in Sect. 5.2.
7.1 Multiobjective planning algorithms
Research on planning approaches to MOMDPs has been established for much longer than work on reinforcement learning approaches, dating back to at least the early 1980s [167, 197, 198]. White [198] adapted dynamic programming to develop an algorithm for finding Pareto set policies for infinite horizon discounted MOMDPs. However, as identified by Wiering and De Jong [199], that approach has issues of computational feasibility and finds policies which are nonstationary. To address this, they developed the CONMODP algorithm which invokes a consistency operator to ensure the stationarity of policies.
Bryce et al. [16] demonstrated by example that, in the context of MOMDPs with stochastic state transitions, agents which aim to maximise the SER cannot rely on localised decisionmaking. The information available at any given state is insufficient to determine the optimal action under the SER formulation, and the agent must also take into account the actions which will be selected, and rewards which will be received, at all other states of the MOMDP. They develop the Multiobjective Looping \(AO^*\) (\(MOLAO^*\)) algorithm to address this issue.
The Convex Hull Value Iteration (CHVI) algorithm [12] is amongst the most widelycited works on MOMDP planning. Although it is frequently incorrectly described as a MORL method, it in fact extends Bellman’s value iteration algorithm to estimate and store the convex hull of future rewards for each stateaction pair. This allows CHVI to identify the coverage set of policies, but only under the assumption that the utility function is linear. Because of the linear utility function, CHVI is akin to planning in POMDPs (see also the relation with POMDPs in Sect. 6.1). This has recently been shown in a paper that improves both CHVI and POMDP value iteration methods by reusing information across linear programs in subsequent iterations of these methods [139].
Other planning methods have considered the possibility of specific nonlinear definitions of utility. Perny and Wang [120] address the task of finding the single optimal policy given the goal of minimising the distance between the reward vector received and a target reference point in objective space. They show that the nonlinear nature of this utility prevents direct adaptation of methods like dynamic programming which are based on the Bellman equation, and instead develop a nonlinear programming solution for this task. Meanwhile, Wray et al. [203] identify Lexicographic MDPs as a specific subset of MOMDPs, where there is a specified ordering over objectives. They develop methods based on valueiteration for solving such tasks, allowing the ordering of objectives to be statedependent and incorporating the concept of slack, which allows some degree of loss in the primary objective in order to obtain gains in secondary objectives. This approach has also been extended to POMDPs [202].
7.2 Multiobjective reinforcement learning algorithms
7.2.1 Stateless/bandit algorithms
Algorithms designed for the multiarmed bandit (MAB) domain endeavour to follow an optimal exploration/exploitation strategy for selecting between different actions (arms), so as to minimise the regret (the loss in reward from not selecting the, initially unknown, optimal action on every timestep). Several papers have examined the extension of MAB algorithms to multiobjective tasks, often by adopting the concept of multiobjective regret in which the agent aims to minimise the number of Paretodominated actions which are performed.
Several multiobjective variations to the wellknown UCB1 algorithm are compared in [37], including linear and Chebyshev scalarisations, as well as a version based on Pareto dominance. The empirical results show that Pareto UCB1 outperforms the scalarised versions. Later, Yahyaa et al. [205] demonstrated that a Paretobased variant of the knowledge gradient algorithm could lead to further improvements in performance over Pareto UCB1.
For the interactive decision support setting (Fig. 3d), bandit algorithms have been devised that intertwine learning about the reward functions with preference elicitation. For linear utility functions extensions have been made to both UCB1 and Thompson Sampling [141]. For generalshape utility functions, Gaussianprocess Utility Thompson Sampling (GUTS) [143] combines Thompson sampling with Gaussian processes to learn about the reward vectors and a monotonically increasing utility function simultaneously.
Other work has examined multiobjective extensions to specialised forms of bandits. Van Moffaert and Nowé [182] consider a multiobjective form of the \(\chi\)armed bandit, in which the set of arms is a measurable (potentially infinitely large) set of arms. They propose a modified form of the Hierarchical Optimistic Optimization (HOO) algorithm for this class of bandits. Likewise, Lacerda [78] examines multiobjective extensions of ranked bandits, in which the agent produces a ranking of arms rather than a single choice at each timestep. More recently, Turgay et al. [170] extended the contextual MAB model to incorporate multiple objectives. Unlike conventional MABs, a contextual bandit incorporates some additional state or sideinformation, and so represents a compromise between stateless bandits and fullblown RL scenarios. Their Pareto Contextual Zooming (PCZ) algorithm aims to minimise the Pareto regret while also maintaining a fair distribution over the Paretooptimal arms [15].
7.2.2 Singlepolicy algorithms
Perhaps the simplest and most widelyadopted approach to MORL is to extend existing singleobjective modelfree valuebased methods, such as Qlearning, to handle multiple objectives. This extension requires two changes to the learning algorithm, i.e., the agent must store Qvalues as vectors rather than scalars, and the scalarisation function designed to match the user’s utility function must be used to identify the greedyaction to perform in any given state. This approach naturally gives rise to singlepolicy solutions to the multiobjective problem, as the underlying singleobjective methods are designed to produce a single optimal solution.
Many applications of this approach have used a linear scalarisation function, either weighted or unweighted [6, 56, 117, 156]. This is equivalent to transforming the MOMDP into a corresponding MDP, and so existing proofs of convergence apply [144]. In some domains this will also be a suitable representation of the user’s underlying utility (for example, in problems where the objectives are naturally expressed in monetary terms). However, in many cases this linear function will be inadequate to represent the user’s true utility [174]. Therefore it will often be preferable to use a nonlinear function instead [47, 68, 185, 186]. Nevertheless, this violates the assumption of additive returns in the Bellman equation at the heart of these algorithms [144], and therefore it may be necessary to condition the Qvalues and the agent’s choice of action on an augmented state formed by concatenating the environmental state with the summed rewards previously received by the agent [50]. Additionally these approaches may fail to converge to the optimal policy in environments with stochastic state transitions [178].
An alternative to these valuebased approaches is to adopt a policysearch algorithm. These methods do not rely on the Bellman equation, and so they can directly optimise with regards to any utility function, including nonlinear functions (for example, by calculating gradients relative to the utility only at the end of an episode). In addition, they generally produce stochastic policies, which can be beneficial in the context of multiple objectives as discussed earlier in Sect. 3.2. For example, Pan et al. [113] implement a mixture of longterm policy gradient and shortterm planning to find singlepolicy solutions, while Siddique et al. [157] develop multiobjective forms of the PPO and A2C policy search methods for the task of finding a singlepolicy which is fair with regards to all objectives, as measured by the Generalized Gini social welfare function. A substantial number of further multiobjective policysearch methods have been explored in the literature, but much of this work has been in the context of multipolicy approaches and/or deep RL, and so will be discussed further in the later subsections.
Under the ESR criterion (Sect. 5.3) a nonlinear utility function is assumed. As already highlighted a nonlinear utility function invalidates the assumed additive returns in the Bellman equation. In this case, new methods must be created to efficiently optimise the ESR criterion. Roijers et al. [138] implement an Expected Utility Policy Gradient (EUPG) algorithm which uses Monte Carlo simulations to calculate the sum of the accrued returns and future returns. EUPG optimises over the full returns of an episode as the utility function is applied to the sum of the accrued returns and the future returns. Hayes et al. [59, 60] propose an algorithm known as Distributional Monte Carlo Tree Search (DMCTS) which learns a posterior distribution over the utility of the returns of a full episode. Reymond et al. [135] propose a multiobjective categorical ActorCritic (MOCAC) algorithm. MOCAC learns good policies under the ESR criterion for nonlinear utility functions by utilising a distributional critic that estimates a categorical distribution over the returns.
7.2.3 Multipolicy approaches
Multipolicy approaches can be divided into two classes. Outer loop methods operate on series of singleobjective problems to construct an (approximate) CS, whereas inner loop methods consist of algorithms directly designed to produce multiple policies in one pass [145]. Broadly speaking, outer loop methods tend to have adopted a utilitybased approach, as they need to make assumptions about the utility function in order to apply it on each iteration to produce a scalarised singleobjective problem, whereas inner loop methods may follow either the utilitybased approach (e.g. [4, 18]) or an axiomatic approach (e.g. [115, 152, 182, 200]).
The simplest outer loop methods iterate through a series of different parameter settings for a utility function^{Footnote 2}, and rerun a singlepolicy MORL method for each setting (for example, [114]). The efficiency of outer loop approaches can be improved in two ways. Reusing information from earlier runs rather than discarding this information can reduce learning time [107, 115, 140]. Secondly, naive searches through parameter space may relearn the same policy multiple times, or require a small stepsize to ensure all optimal policies are discovered [140]. More efficient adaptive search methods can reduce the number of iterations of the outer loop [103, 136, 146].
Inner loop methods modify the underlying algorithm to directly identify and store multiplepolicies in parallel rather than sequentially. Both ParetoQLearning (PQL) [182] and PQlearning [152] modify Qlearning to store multiple Paretooptimal values for each stateaction pair. Pruning of dominated values is used to eliminate dominated policies [89]. So far these methods are restricted to tabular representation of Qvalues, limiting their broader applicability, although the Pareto DQN algorithm [134] provides an initial attempt to integrate PQL and deep RL methods. In the batch setting, MultiObjective Fitted QIteration (MOFQI) [18] extends the Fitted QIteration algorithm [43] to the multiobjective case by adding to the state the linear scalarisation weights. MOFQI learns with a single training process an approximation of the optimal Qfunction for all possible combinations of the scalarisation weights.
Multiple authors have developed innerloop multipolicy methods based on multiobjective extensions of Monte Carlo Tree Search. The decision about which branch of the tree to expand at any point is determined based on either the hypervolume metric [192, 194]^{Footnote 3}, or on a measure based on Paretodominance [21, 118, 196].
Modelbased methods have clear benefits in the context of multipolicy learning, as once a model of the environment has been learned, it can be used to derive the optimal policy for any utility function with no requirement for further interaction with the environment. Despite this, there has been surprisingly little research so far in modelbased MORL. Wiering et al. [200] provide an approach which learns all Paretooptimal policies by first learning a model, and then applying the CONMDP multiobjective dynamic programming algorithm [199]. However, this approach is limited to learning stationary, deterministic policies for deterministic environments. The approach of Yamaguchi et al. [206] can be applied to stochastic environments. It learns a model which stores reward occurrence probability (ROP) vectors rather than Qvalues, and then uses the inner product of the ROP vector and a given weight vector to find the expected reward for the optimal policy for that weight vector. This approach avoids the need to perform an extensive search of the weight space to identify optimal policies. However, it is limited to finding deterministic policies under linear scalarisation, and is designed for maximising the average reward rather than the cumulative discounted return.
In order to learn in domains with continuous stateaction spaces and where the state is not fully observable, policy search or actorcritic algorithms are usually considered [32]. In the literature, both outer loop [114] and inner loop [52, 115, 116] approaches have been proposed to extend policy search methods to multiobjective problems. The approach of Parisi et al. [115] is interesting in that it constructs a continuous rather than discrete approximation of the Pareto front. More recently, Abdolmaleki et al. [2] have addressed tasks with highdimensional continuous action spaces using a multiobjective extension of the MPO actorcritic RL algorithm.
Populationbased evolutionary methods are wellsuited to finding multiple policies, as each individual can represent a policy which is optimal for a different set of utility preferences. The field of multiobjective evolutionary optimisation is already very well established [7, 30, 44, 169], and several researchers have applied concepts from this area to MORL tasks. Evolutionary methods can applied by encoding policies as individuals, and executing them within the environment to calculate a vector fitnessmeasure. This can either be done in isolation [24, 115], or combined in a hybrid algorithm where the evolutionary method performs global search with local hillclimbing [160], policygradient [204] or actorcritic methods [23] used to provided localised search or finetuning.
7.2.4 Interactive approaches
The majority of MORL methods take either an a priori approach to policy selection where user’s preferences must be specified prior to learning, or an a posteriori approach where a set of policies are learned and then presented to the user for selection. A third alternative is to allow the user to interactively specify their preferences during the learning process, as first proposed in [175, p. 63]. This allows the user to make a more informed decision based on the agent’s discoveries about the nature of achievable tradeoffs between objectives, while also allowing earlier convergence to the user’s preferences which is important in online learning. An example is the Qsteering algorithm [173, 179]. The user specifies initial preferences in terms of a target point in objective space, and the agent learns a nonstationary mixture of linearscalarised base policies which minimises the distance between the average reward and the target. A visualisation of the returns of the base policies can be provided to the user, who may then revise their choice of target. The agent can immediately adapt to such changes.
Some work builds on methods for singleobjective reinforcement learning with human guidance, and extends those methods to multiobjective problems. Wanigasekara et al. [195] propose an algorithm that learns user utility functions from observations of usersystem interactions for multiobjective contextual bandit based personalized ranking of search results. Ikenaga and Arai [66] propose to use inverse reinforcement learning for elicitation of user preferences in multiobjective sequential decision making, while Saisubramanian et al. [153] use human feedback through random queries, approval, corrections, and demonstrations to learn policies that avoid negative side effects.
A systematic approach to simultaneous learning about the environment and the user was proposed by Roijers et al. for multiobjective multiarmed bandits. Specifically, the interactive Thompson sampling (ITS) algorithm [141] uses queries to solicit preferences resulting from linear utility functions while interacting with the environment. For this, it employs Bayesian logistic regression to learn about the utility function, and uses the uncertainty estimates about the utility function to decide which queries to ask. The Gaussianprocess Utility Thompson Sampling (GUTS) algorithm [143] does the same for any continuous utility function by using Gaussian processes to model the utility function, and estimate the uncertainty about this function.
There are considerable similarities between the learning of user preferences for multiobjective environments, and the tasks performed during preferencebased reinforcement learning [201]. The key difference is that the latter is applied to the task of MOMDPs without reward (MOMDP\(\backslash\)R), and so the only feedback provided to the agent is in the form of user preferences over states, actions and/or trajectories. This means that preferencebased methods may be applicable to tasks where multiobjective methods are not, such as where producing quantifiable numerical rewards is impractical or impossible for some objectives (e.g for objectives related to aesthetics). However as the preferencebased agent can only learn from the preferences provided to it, it cannot undertake multipolicy learning or planning to account for other, alternative preferences in the way that multiobjective agents can. The synthesis of ideas from preferencebased and multiobjective learning is a promising area for future research.
7.2.5 Scaling up to highdimensional states
Singlepolicy and outer loop multipolicy methods can be extended to handle highdimensional input data in much the same way as the corresponding singleobjective algorithms on which they are based. For example, Tesauro et al. [166] combined SARSA, nonlinear utility, and small multilayer perceptrons to learn to control the power consumption and performance of computing clusters.
Deep reinforcement learning methods [85, 101] have shown to scale beyond finite and discrete spaces to problem domains with highdimensional, continuous state and action spaces. Using deep networks as nonlinear function approximators for handling multiobjective optimisation problems has been on the rise the past few years [4, 82, 103, 109, 164]. Most of these methods extend the singleobjective DQN architecture [101] and follow a singlepolicy or a multipolicy approach.
Mossam et al. [103] extended DQN to a multiobjective setting by learning an approximate coverage set of policies (multipolicy). Each policy is represented using a DQN whose output layer has \(\mathcal {A} \times n \) nodes, where \(\mathcal {A}\) represents the size of the action space and n represents the number of objectives. For better efficiency, the authors proposed to reuse the network weights of previously learnt policies for preferences that are similar to each other (\(\varvec{w'} \sim {\varvec{w}}\)). Abels et al. [4] analysed different architectures while extending DQN to a multipolicy, multiobjective setting: the use of multiple DQNs for different user preferences; and a single DQN that can generalise across different user preferences. In scenarios where user preferences change in realtime, the singlepolicy method was most effective. To improve sample efficiency and address bias to recently seen user preferences, they used a diverse experience replay buffer that contained experiences corresponding to different user preferences. Yang et al. [208] also used a singlenetwork approach which generalized across different user preferences, however, they performed envelop updates by using a convex envelope of the solution frontier while updating network parameters. Such envelop updates lead to faster convergence when compared to scalarized updates for a given user preference, which are often sample inefficient, resulting in suboptimal policies.
There has been some recent work on multiobjective deep reinforcement learning for continuous action spaces. Chen et al. [22] combined a multiobjective extension of PPO [155] with modelagnostic metalearning (MAML) [46]. The proposed method first learns a metapolicy, which can then be finetuned in few iterations to find a set of Pareto optimal policies. Compared to learning each policy from scratch, the method was shown to improve performance in terms of training time and optimality of the resulting Pareto front. Xu et al. [204] also used a multiobjective extension of PPO, combined with an evolutionary algorithm to guide learning in the most promising direction. In each generation of evolution, data stored from previous iterations of MORL are used to fit a prediction model, which can help find the pairs of policies and scalarization weights that will improve the solution the most. Each selected policyweight pair is then improved through MORL to produce offspring policies, which are used to create a new generation of policies. The final generation is divided into policy families by clustering, and policy parameters within each family are interpolated to produce a continuous approximation of the Pareto front. Abdelfattah et al. [1] used a twostage approach for learning in environments with nonstationary dynamics and continuous actions. In the first stage, a set of generic skills (e.g., Move Forward, Turn Left, and Turn Around) are learned. In the second stage, the learned skills are used in a hierarchical version of DDPG [85] to produce a policy coverage set for the MOMDP. An intrinsically motivated RL algorithm is used to select which objective preferences to explore to improve the coverage set, and a policy bootstrapping mechanism is used to quickly adapt to changes in the environment dynamics.
While the above approaches use linear scalarization, Tajmajer [164] proposed a nonlinear actionselection mechanism by using n different DQNs corresponding to each objective which are combined using a separate output layer along with the user preferences. Deep reinforcement learning methods for multiobjective partially observable settings have also been proposed [110]. These approaches use action and observation histories along with user preferences as input to the neural network. In general, partially observable settings are much more complex when compared to fully observable settings in terms of training time as well as training stability.
7.2.6 Multiagent algorithms
To formalise multiobjective multiagent decision problems, let us introduce the multiobjective stochastic game (MOSG). A multiobjective stochastic game is a tuple \(M = (S, {\mathcal {A}}, T, \gamma , \mu , {\mathcal {R}})\), with \(n\ge 2\) agents and \(d\ge 2\) objectives, where:

S is the state space

\({\mathcal {A}} = A_1 \times \dots \times A_n\) represents the set of joint actions, with \(A_i\) being the action set of agent i

\(T :S \times {\mathcal {A}} \times S \rightarrow \left[ 0, 1 \right]\) is a probabilistic transition function

\(\gamma \in [0, 1)\) is a discount factor

\(\mu :S \rightarrow [0,1]\) is a probability distribution over initial states

\({\mathcal {R}}={\mathbf {R}}_1 \times \dots \times {\mathbf {R}}_n\) are the reward functions, where \({\mathbf {R}}_i :S \times {\mathcal {A}} \times S \rightarrow {\mathbb {R}}^d\) is the vectorial reward function of agent i for each of the d objectives
An agent behaves according to a policy \({\pi }_i: S \times A_i \rightarrow \left[ 0, 1 \right]\). Optimising \(\pi _i\) is equivalent to maximising the expected discounted longterm reward:
where \(\varvec{\pi } = (\pi _1,\ldots , \pi _n)\) is the joint policy of the agents acting in the environment, and \({\mathbf {R}}_i(s_t, {\mathbf {a}}_{t}, s_{t+1})\) is the vectorial reward obtained by agent i for the joint action \({\mathbf {a}}_{t}\in {\mathcal {A}}\), at state \(s_t \in S\). It is also possible to extend this framework to include the case in which the discount factor is different for each agent i.
The value function is also vectorial, \(\varvec{V}^{\pi _i} \in {\mathbb {R}}^d\). When taking a utilitybased perspective, we consider that each agent also has an individual utility function \(u_i\) to project \(\varvec{V}^{\pi _i}\) to a scalar value, as detailed for the singleagent case in Sect. 5.3.
As explained in Sect. 4, in singleagent multiobjective problems, the shape of the utility function, in conjunction with the allowed policy space, can be used to derive the optimal solution set that a multiobjective decisiontheoretic algorithm should produce. In multiagent settings, the situation is more complex, as each individual agent can represent one or more distinct users (i.e., each agent can have a different utility function). For this reason, Rădulescu et al. [126] proposed a new taxonomy which classifies multiobjective multiagent decision making (MOMADM) settings on the basis of both reward structures and utility functions, as shown in Fig. 4. We note that the case of individual reward–team utility is equivalent to and treated as the individual reward–individual utility case, since the individual return vectors would still lead to different utility values for each agent, despite them having the same utility functions.
In multiobjective multiagent settings the agents’ strategies are interrelated. For this reason, solution concepts, i.e., whether the agents in the system reach outcomes that are of interest, could be used to evaluate the algorithms’ performance. We detail below the solution concepts identified by Rădulescu et al. [126] for the MOMADM setting and present a few algorithmic approaches that employ them.
Coverage sets. The team reward and team utility setting in MOMADM represents a fully cooperative scenario, where all agents share the same rewards and derived utility. Since there is only one true utility function in the execution phase, coverage sets represent the right solution concept for this case, with the same motivation as for singleagent multiobjective decision making. Multiobjective coordination graphs (MOCoGs) represent one of the most studied models for cooperative multiobjective multiagent systems. They exploit the fact that in multiagent systems the rewards or agents can often be factorised into smaller components. Numerous algorithmic approaches focus on finding (approximate) Pareto coverage sets (3.2) like for example multiobjective bucket elimination (MOBE) [147, 150], multiobjective Russian doll search [148], multiobjective (AND/OR) branchandbound tree search [94, 95, 149], Pareto local search [67], and multiobjective maxsum [33]. Another frequently used model is the cooperative multiobjective stochastic game (MOSGs), where reinforcement learning or evolutionary algorithms were used to derive coverage sets (e.g., [90, 91, 209]). Similar methods were proposed for the individual reward and utility setting, where a coverage set can also be a set of possible best responses to the behaviours of the other agents (e.g., [10, 40, 41]). In an individual reward–team utility setting, coverage sets could be used if all agents agree (e.g., through negotiation [71]) upon which alternative joint policy from the coverage set to execute.
Equilibria and stability concepts. In the individual utility scenario, the utility derived by each agent from the received reward is different, regardless if this reward is the same or not for all the agents. Suitable solution concepts for dealing with decision making between selfinterested agents are game theoretic equilibria (e.g., Nash equilibria [106], correlated equilibria [9]). We find here works that study the idea of robust equilibria in multiobjective games [125, 210] or how equilibria are affected by the use of the different optimisation criteria [127]. Furthermore, knowledge transfer [165] and opponent modelling [128, 212] also become more important in this context.
When binding agreements among agents are possible, solution concepts from cooperative game theory can also apply to individual utility settings. Coalition formation can therefore become a central problem in these cases, i.e., finding (sub)groups of agents that are willing to make such a binding agreement with each other [65].
Mechanism design. In game theory, the field of mechanism design takes the system’s perspective for multiagent decision problems. This implies taking as input both the original decision problem (where the agents have individual reward functions that are unknown to the other agents and the “owner” of the game), as well as a social welfare function. The aim is to design a system of additional payments that would (a) force the agents to be truthful about their individual utilities, and (b) lead to solutions that are (approximately) optimal under the social welfare function. In multiobjective settings, the situation is more complex, as the individually received rewards determine the utilities via individual, private utility functions. In general, it can be very challenging, or even impossible to articulate these functions, so being “truthful” about one’s utility might be infeasible from the getgo. Nevertheless, it is possible to design mechanisms for some multiobjective multiagent problems if the individual utilities can be articulated (e.g., [55, 123, 131]).
For an indepth overview of solution concepts for multiobjective multiagent decision making, the interested reader is referred to a recent survey by Rădulescu et al. [126].
8 Evaluating the performance of multiobjective decision making algorithms
Unlike in singleobjective RL, there is not only one optimal solution in multiobjective problem settings. MORL algorithms therefore often produce solution sets (see Sect. 3.2). This complicates the evaluation and comparison procedure of MORL algorithms: When is one solution set better than the other? What properties should a solution set have, and how do we measure those?
In this section, we give an overview of existing evaluation metrics, starting with axiomaticbased ones (Sect. 8.1) then moving on to utilitybased metrics (Sect. 8.2). Axiomatic metrics assume that the optimal solution is the true Pareto front (or convex hull), and try to compare to this in aspects like spread, coverage, or distance. However, these axiomatic metrics are often difficult to interpret from a user perspective. As argued in Sect. 4, the development of MORL solutions should be driven by the perspective on user utility. Similarly, utilitybased evaluation metrics should be used when assessing MORL algorithms.
After giving an overview on evaluation metrics and approaches, we briefly discuss potential pitfalls when using value function approximations in MORL settings in Sect. 8.3. Section 8.4 gives an overview of existing benchmarks and their properties.
8.1 Axiomaticbased evaluation metrics
In this section we give an overview of axiomatic approaches to evaluating solutions to multiobjective decision making problems. Such approaches were widelyused in early literature in the field.
8.1.1 The hypervolume metric
The hypervolume metric [218] has been widely used to evaluate the performance of multiobjective decision making algorithms (e.g., [90, 175, 184, 185, 194, 209]). The hypervolume metric measures the (hyper)volume in valuespace Paretodominated by the set of policies in an approximate coverage set. This correlates with (but is not equal to) the spread of a set of undominated solutions over the possible multiobjective solution space. For this reason, it has been used to compare the sets of solutions produced by multipolicy algorithms (or indeed single policy algorithms run multiple times with different scalarisation/utility function parameters). The accuracy of any set of solutions produced by an algorithm can be evaluated by comparing its hypervolume with that of the nondominated set produced by a competing algorithm, or with that of the true Pareto front of the application domain (if known). In domains where the true Pareto front is known, the hypervolume represents an absolute maximum level of performance that may be achieved in terms of coverage of the set of solutions over the objective space:
where \(\texttt {Volume}({\mathbf{V}}_{\mathrm{ref}}, {\mathbf{V}}^\pi )\) is the volume of the hypercube spanned by the reference vector, \({\mathbf{V}}_{\mathrm{ref}}\), and the vector in the CS, \({\mathbf{V}}^\pi\).
Figure 5 illustrates the hypervolume of a set of undominated solutions with respect to a given reference point, \(\mathbf{V}_{\text {ref}}\), for a 2objective maximisation problem, where both objectives are to be maximised. For convenience, a reference point in the multiobjective space is often used when calculating the hypervolume of a nondominated set. This reference point may be chosen arbitrarily.
Although widely used in the literature, the hypervolume metric has a number of problems. The most significant of these is that hypervolume values are difficult to interpret, as they do not map to any realworld notion of value or utility. When comparing the hypervolume of two competing sets of solutions, the benefit of a certain increase or decrease in hypervolume is not readily apparent to the end user. Adding just one nondominated solution at the extreme ends of the objective ranges could lead to a large increase in the hypervolume of a nondominated set, even if this additional solution is of little interest to the end user. Conversely, adding a new solution that is close to other solutions in the nondominated set can result in a minimal increase in hypervolume, even if the new solution is valuable to the end user. Finally, it is unlikely that the true set of nondominated solutions will be known a priori for any nontrivial multiobjective decision making applications. This invalidates one of the main arguments for the use of the hypervolume metric, i.e., evaluating the coverage of a set of solutions with respect to a reference set. Furthermore, the hypervolume is only applicable to settings where every Paretonondominated policy potentially contributes to the utility. This is not always the case. For example, when the utility function is known to be linear, the hypervolume is not applicable as many policies that would contribute to the hypervolume are known to not improve utility (i.e., all concave regions in the Pareto front). For these reasons, we recommend the use of alternative metrics that better reflect the usefulness of the solutions produced by an algorithm (such as the user’s utility).
8.1.2 Sparsity of coverage sets
The information that is contained by metrics like the hypervolume is rather limited. The only guarantee we have is that if the hypervolume is maximised (unless there are points that contribute 0 hypervolume at the edges that we have missed), then a Pareto Coverage Set has been recovered. This is of course not informative, especially during learning.
One key bit of critique is that if we have two approximate solution sets with equal – or approximate – hypervolume, then we should prefer the set which has more spread over the value space. In other words, the set that contains value vectors that are furthest apart from each other is the better one. From a utilitybased perspective, this is also intuitive, as the user will pick the best vector from a solution set \({\mathcal {S}}\) according to:
so it helps if the user has a larger variety of value vectors to select from.^{Footnote 4}
In multiobjective optimisation, this idea has been used to create algorithms that explicitly look for diverse solutions [31]. In MORL, this same idea has been used to diversify the experience replay buffer, in order to be able to adapt to different utility functions faster [4].
In addition to finding solution sets that are evenly spread over the value space (i.e., sets with high diversity), it is desirable that the solutions provide a dense coverage of the whole Pareto front (i.e., sets with high resolution), so that the user has more options to choose from. For this purpose, Xu et al. [204] proposed to combine the hypervolume metric with a sparsity metric for evaluation of Pareto front approximations. The proposed sparsity metric is defined as:
Here \({\mathcal {S}}\) is the Pareto front approximation for an environment with m objectives, and \(\tilde{{\mathcal {S}}}_j(i)\) is the ith value in the sorted list for the jth objective values in \({\mathcal {S}}\). Pareto front approximations that combine a high hypervolume metric with a low sparsity metric are considered better. Care should be taken to ensure that values of different objectives are at similar scale, e.g., through normalisation, so that one objective is not given higher importance than the others.
8.1.3 The \(\varepsilon \)metric
The \(\varepsilon\) metric [217] measures how closely a solution set \({\mathcal {S}}\) approximates the Pareto front PF. It has been widely used in multiobjective evolutionary optimisation [217] and reinforcement learning [176]. There are two measures, the additive and the multiplicative \(\varepsilon\)indicator.
The additive \(\varepsilon\)indicator is given by
where n is the number of objectives and \({\mathbf {V}}^\pi \in {\mathbb {R}}^d\) is the ddimensional value of policy \(\pi\) (see Sect. 3). In words, a solution set \({\mathcal {S}}\) is an \(\varepsilon\)approximate Pareto front according to this metric if for each value vector \({\mathbf {V}}^\pi\) on the Pareto front PF, there exists at least one value vector \({\mathbf {V}}^{\pi '}\) in the solution set \({\mathcal {S}}\), such that for each objective d, the value in \({\mathbf {V}}^{\pi '}\) is at most \(\varepsilon\) smaller than the values in \({\mathbf {V}}^\pi\).
The (less commonly used) multiplicative \(\varepsilon\)indicator is given by
The difference to the additive indicator is in how the distance is calculated: here, each objective can at most be worse by a multiplicative factor of \(1+\varepsilon\), i.e., this scales with the magnitudes of the individual values (objectives with larger values allow a larger deviation).
The \(\varepsilon\) metric gives an indication of the factor by which an approximate solution set is worse than the Pareto Front, considering all objectives. It can also be used to compare two arbitrary solution sets instead of a solution set and the Pareto front. Unlike the hypervolume metric, it can give an indication of whether one is better than the other (they might, however, be incomparable w.r.t. this metric).
We argue that the \(\varepsilon\) metric is more useful than the hypervolume, since it can directly be used to derive a utility for a given user, see Sect. 8.2.
8.1.4 Metrics from information retrieval
The Coverage Ratio metric is used in [208] as an evaluation metric for comparing different multiobjective algorithms. It is a measure of the count of policies recovered from a finite Coverage Set (CS) which is determined by a comparison between the set \({\mathcal {S}}\) of policies \(\phi\) with value vectors \({\mathbf {V}}^{\pi }\in {\mathbb {R}}^d\) found by a MORL algorithm, and the value vectors corresponding to the policies in the (groundtruth) CS. The measure weights both the precision and recall of finding policies in the CS, the following definition is used when calculating precision and recall such that policies with value vectors within epsilon of the value vector of a policy in CS, are classed in the CS.
The Coverage Ratio (also known as the Fscore) is then calculated as the harmonic mean between the \(\text {precision} ={\mathcal {S}}\cap _{\epsilon } CS/{\mathcal {S}}\) and \(\text {recall} ={\mathcal {S}}\cap _{\epsilon } CS/CS\) measures.
We argue that there are several issues with the Coverage Ratio metric. Firstly, the measure (Equation 23) implies that precision and recall are equally important which in reality is not the case. For example, if the utility function is linear, and \({\mathcal {S}}\) contains excess policies that have a value which is the weighted sum of two other value vectors. Such vectors need not be in the CCS (convex coverage set, see Equation 10) and decreases the precision. However, having this excess policy does not decrease the utility for any linear utility function. Conversely, missing out a policy in the CS typically does decrease the utility for a whole range of utility functions.
Secondly, like the hypervolume metric, the Coverage Ratio does not account for the different levels of utility a user gains from different solutions. In this measure the presence of any solution from the CS is treated as of equal value to any other solution found also in the CS. However, unlike the hypervolume metric the Coverage Ratio is not correlated at all with the spread of the set of nondominated solutions over the possible multiobjective space. Therefore, this measure fails to account for any utility the user gains from the spread of solutions but retains the undesirable properties of the hypervolume metric.
Thirdly, the \(\varepsilon\)parameter that controls the threshold for when we consider a policy in the Convex Coverage Set is a parameter that needs to be chosen and can have a large impact on the Coverage Ratio. Specifically, if a value estimate \(\tilde{\mathbf{V}}\) is within the hypercube surrounding a value in the CS, \(\mathbf{V}_i \pm \varepsilon\), it is assumed to correspond to that value vectors in the CS. Setting \(\varepsilon\) arbitrarily high can lead to all solutions being treated in the CS, while setting it low could lead to an algorithm producing no solutions in the CS. When comparing different algorithms, the choice of \(\varepsilon\) could have a large impact on the final ranking and it is not a priori clear what a fair setting of \(\varepsilon\) would be.
Lastly introducing the notion of \(\varepsilon\) to the recall measure means that multiple counting of policies in the CS can occur as more than one value vector in F could be within \(\varepsilon\) distance of the same policy’s value vector in the CS. Not only can an algorithm thus “recall” more than the ground truth number of policies, but more importantly, “recalling” two policies for the same ground truth policy can obfuscate the missing of another policy in the resulting value of the metric, which is of course highly undesirable.
8.2 Utilitybased evaluation metrics
As argued in Sect. 4, the utilitybased approach is preferable in most scenarios, since here the algorithms are designed and evaluated with respect to the utility that the solution can offer to the user. Accordingly, any evaluation metric should take this into account. Many of the axiomaticbased evaluation metrics are difficult to interpret in terms of user utility, and in addition they often require access to the true Pareto front. If it is possible to assess the utility of the user at time of deployment, then solution sets can be compared based on user utility.
For instance, the user’s utility might correspond to revenue that a deployed solution achieves; in this case, the utility can be measured and compared directly.
For when this is not possible, Zintgraf et al. [215] propose two utilitybased evaluation methods, the expected utility metric (EUM) and maximal utility loss (MUL). Compared to many other metrics such as the hypervolume metric, these are more suitable to compare different algorithms, since they are aimed at directly evaluating an agent’s ability to maximize user utility, which is always our ultimate goal.
For a given solution set, the EUM is defined as the expected utility for a user from this solution set, under some prior distribution over user utility functions. Under the SER optimality criterion, this can be written as:
where \({\mathcal {S}}\) the solution set outputted by an algorithm, u the utility function of the user, \(\mathbf{V}^\pi\) the vectorvalue of the best policy from that set (according to u). The expectation is taken with respect to the distribution over utility functions \(P_u\). This metric is useful in situations where we care about the agent’s ability to do well across many different utility functions, e.g., because many policies from the solution set will be used over time, or because they will be used for different users. This metric does however require a good prior over possible scalarisation functions in order to meaningfully evaluate a given solution set.
The MUL measures the maximal loss in utility that occurs when taking a policy from a given solution set, instead of the full set of possibly optimal solutions. Under the SER optimality criterion, this can be written as:
where \({\mathcal {S}}^*\) is the true optimal solution set (PF or CH, or a very good approximation thereof), \({\mathcal {S}}\) the solution set outputted by an algorithm, \(\mathbf{V}^\pi\) the vectorvalue of said policy, and u the utility function of the user, over which we take the maximisation with respect to the space of possible utility functions \({\mathcal {U}}\). Since it is often infeasible to compute the full set of optimal solutions in order to compute this metric and compare algorithms, a good reference set can be used (such as the union of multiple solution sets, e.g., the final solution sets of all algorithms evaluated in a comparison).
We note that MUL is bounded if an \(\varepsilon\)bound can be given on the accuracy of the set \({\mathcal {S}}\) produced by the algorithms, and the utility function is guaranteed to be (Lipschitz)continuous [215].
8.3 A word of caution regarding value vector approximation
In multiobjective planning, it is often the case that for a given policy, \(\pi\), we know the exact value vector, \(\mathbf{V}^\pi\). When the human decision maker (or even another algorithm) selects such a policy in the selection phase (see Sect. 5.1), we can thus trust these value vectors. This is key as we derive utility from these value vectors by applying the utility function to them (either implicitly or explicitly).
In multiobjective reinforcement learning it is tempting to think we have proper value vectors too, as many algorithms produce value vector estimates. In the literature these are often also denoted \(\mathbf{V}\) or \(\mathbf{Q}\). However, it is essential to note that these are stochastic estimates, that may well have both high variance, or even systematic biases [58]. This issue is exacerbated by the use of function approximators, such as neural networks, which may have their own added variance and/or biases. It would therefore be fairer to explicitly denote such value estimates as estimates by using tildes, \(\tilde{\mathbf{V}}\) or \(\tilde{\mathbf{Q}}\), for example, but this is not common practice.
In multiobjective RL, having inexact value estimates in the coverage set presented to human decision makers (or other algorithms), can lead to missing on two sides: firstly, if the value estimate of the actual best policy is off, that policy may not be selected. Furthermore, the value estimate of the policy that is selected may also be off, leading to a different utility than expected. Combined, these two sources of potential loss can severely affect the user utility.
In order to mitigate the issue of inexact value estimates, and maybe even more importantly, biases in value estimates, we recommend that the coverage sets presented in selection phases do not directly rely on the value vector estimates from the MORL algorithms. Instead, we recommend to extract the policies that constitute the coverage set, and run a separate and thorough policy evaluation, before selecting any policy to execute.
8.4 Benchmark problems for multiobjective decision making
Well established benchmark problems are important for evaluation of reinforcement learning algorithms, since even small variations in the experiment design may have a significant impact on the results. By using common benchmarks a fair comparison of different approaches can be ensured, and by evaluating algorithms on several benchmarks the generality of results can be studied. Table 3 presents an overview of frequently used MORL benchmarks with discrete states and actions, as well as more recent extensions and additions with highdimensional states, partial observability, multiple agents, and continuous actions.
For multiobjective decision problems other than MOMDPs, such as multiobjective coordination graphs [94, 145, 149] and multiobjective normal form games [127, 212], benchmarks are also few and spread out over different papers.
9 An illustrated example
In Sect. 2, we illustrated diverse problems that require multiobjective optimisation. One of them is the water management problem (Sect. 2.2). In that case, as we want to propose a diverse set of solutions to the decision maker—following the taxonomy defined in Sect. 5—we are in a multipolicy scenario. Moreover, the proposed policies should be deterministic instead of stochastic (see Sect. 5.2.3). For example, a stochastic policy with a nonzero probability of emptying the reservoir should not be considered, as it might have catastrophic consequences downstream. Finally, since the policy is executed every day, the utility of the user is derived from multiple executions of the policy. The optimality criterion used is thus the SER criterion (see Sect. 5.3).
In this section, we concretely tackle the water management problem by applying a multiobjective algorithm, and comparing it with its singleobjective counterpart, with both producing deterministic policies. We make no assumptions about the utility function of the decision maker, except that it is monotonically increasing. Thus, the solution set we aim to produce is the Pareto coverage set of deterministic nonstationary policies.
9.1 Setting
In this problem, the goal is to control a dam responsible for dispatching water downstream while avoiding flooding in the region. This environment is modelled as a onedimensional, continuous state, representing the amount of water present in the reservoir. This is subject to change, depending on factors such as rain. At each timestep, the dam can release a specified water amount (a onedimensional, continuous action)^{Footnote 5}.
The dam is responsible for supplying water, and needs to meet the water demand. At the same time, it should be careful not to hold too much water, as this increases the risk of flooding upstream. These objectives are conflicting, since the inflow of water is on average insufficient to cope with the water demand. In order to increase the chance of meeting future demand, the reservoir needs to be filled, thus increasing the risk of flooding upstream.
9.2 Multiobjective natural evolution strategies
With an unknown utility function, which may be nonlinear, we want to propose a set of alternatives to the decision makers by approximating the Paretofront. Since we have no prior information about the preferences of the decision makers, we make no assumptions about the utility function.
The algorithm that is used to approximate the coverage set is MultiObjective Natural Evolution Strategies (MONES) [115]. In essence, a parametric policy is used, where each parameter is represented by a Gaussian distribution. Sampling these distributions results in an executable policy that can be applied on our environment. MONES optimises the mean and standard deviation of each parameter such that, whenever we sample from them, we obtain a policy that leads to a different point on the Pareto front. MONES is an iterative process, that repeats three steps:

1.
Sample a population of policies from our parameters, and execute them on the environment;

2.
Evaluate the quality of these policies using an indicator metric;

3.
Perform a gradient step that improves this indicator using the natural gradient.
Our policy is represented as a small feedforward neural network, where each weight is sampled from its own Gaussian distribution. The network contains a single hidden layer of 50 neurons and, although these weights are correlated with each other, we assume each Gaussian distribution to be independent for the sake of simplicity.
To evaluate the performance of the policies, MONES requires an indicator metric. We closely follow [115], where the metric used is a combination of nondominance ranking and crowding distance. For nondominance ranking, a rank of 0 is applied for all the nondominated points of the discovered returns. By removing these solutions from the population, a new set of policies becomes nondominated. We set the rank for these points to 1. This process is repeated until no points remain to be evaluated. At each iteration the rank is decreased by 1.
This rank is then combined with the crowding distance, which is a metric providing information about the diversity of a frontier:

For all the points of the same rank we compute, for each dimension, the distance between its closest neighbours. As the dimensions might have different ranges (each represents a separate objective), they are normalised using their lowest and highest value. The crowding distance is thus relative to the points of the same rank.

This distance is normalised. Points close to each other will have a crowding distance close to 0, while points at the border of the frontier will have a distance close to 1.
Summing these 2 metrics together provides us with an indicator that encourages points to be on the Pareto front, and be as diverse as possible.
The MONES learner is trained for 30 iterations, sampling 50 policies every time. Each policy is executed 10 times. The average return of each policy is evaluated using the nondominance/crowdingdistance metric.
As can be seen in Fig. 6, after 30 iterations of training, the policies sampled from the Gaussian distributions achieve diverse combinations of returns. The right part of the figure shows 11 nondominated solutions, but the vast majority of the policies (48/50) reach returns reasonably close to the frontier, resulting in a set of diverse, highquality solutions.
9.3 Using singleobjective subroutines
Instead of using a dedicated method (MONES) to discover diverse policies, we use an outer loop method. In this particular case, we use Natural Evolution Strategies (NES) as a single objective subroutine. This subroutine is called a number of times, each time with a different utility function, hopefully resulting in different policies that reach different points of the coverage set.
This requires us to know the distribution over user utility functions. We consider a uniform distribution over linear scalarisation functions, i.e. each utility function is a weighted sum, where the weights are uniformly sampled from a 1simplex (since our problem has two objectives).
Since MONES takes inspiration from NES, both algorithms are quite similar, the main difference being the indicator metric used. While MONES optimises on the combination of ranking and crowding distance, NES optimises on the utility of the return. All other parameters are kept the same as for MONES.
We sampled 30 utility functions, resulting in 30 different NES runs. In Fig. 7, we compare the coverage sets found by MONES and NES. In order to have the same number of points for each method, we sampled 30 new policies from our trained MONES and plotted the resulting returns.
Although the utility functions used by NES were sampled across the whole simplex, it does not result in a spreadout coverage set. The vast majority of returns hover around \((1.0, 10.7)\), where 1.0 refers to flooding and 10.7 to water demand. In comparison, the policies sampled from MONES result in more spreadout returns. This is because the crowding distance is taken into account in the indicator metric used by MONES, encouraging the returns to be diverse. It is also important to mention that, even though hyperparameters are the same for both methods, NES had to be trained 30 separate times, while MONES only once. Compared to NES, MONES is sampleefficient and results in a more evenly spreadout coverage set. Finally, NES requires us to make assumptions about the distribution over user utility functions, while MONES makes no such assumption.
9.4 Comparing evaluation metrics
In order to evaluate the training progression of MONES after each iteration, we use two of the metrics described in Sect. 8. First, we use the hypervolume (Equation 17). This metric requires a reference point, which in this case is set to the worst return found across the whole training process. Secondly, we use the Expected Utility Metric (EUM, Equation 24). This metric requires a good prior distribution over user utility functions, as well as a good approximation of the solution set. As a prior distribution, we choose the one used for NES, e.g., a uniform distribution over linear scalarisation functions.
As we can see in Fig. 8, MONES converges towards our approximation of the optimal solution after 15 iterations, and stays stable for the remainder of the training process. We observe similar trends with the other evaluation metric used (the hypervolume). We also note at iteration 7, that the sudden improvement in EUM is reflected in the hypervolume, although not as drastically. The solution set found in iteration 7 almost doubled the expected user utility, but only resulted in a \(30\%\) increase in hypervolume. This shows that, although an increase of hypervolume is correlated with an improvement of the coverage set, it does not reflect the utility of the user.
In conclusion, we tackled the water management problem with a dedicated multiobjective algorithm. By changing the indicator metric of NES to cope with multiple criteria, we discovered a solution set that is more diverse than repeatedly applying its singleobjective counterpart, and for which the solutions are more evenly spread out. Moreover, MONES only requires minimal assumptions on the utility function (monotonicity). Finally, the number of required interactions with the environment is vastly improved since, in this case, 30 instances of NES needed to be executed, compared to just a single instance of MONES.
10 Conclusion, challenges and open questions
Recent years have seen significant breakthroughs in the capabilities of sequential decision making agents based on planning or reinforcement learning methods. This has led to the increasing applications of these agents to complex, realworld problems. However, as illustrated by the motivating examples in Sect. 2, these realworld tasks frequently require tradeoffs to be made between multiple conflicting objectives. This contrasts with the inherently singleobjective nature of the environments such as board and video games on which the planning and learning algorithms have largely been developed and evaluated. When these singleobjective methods are applied to problems which are multiobjective in nature, either some objectives wind up being excluded from consideration, or the objectives are summed together to form a scalar reward. As discussed in Sects. 1 through 4, the use of singleobjective methods to address multiobjective problems has numerous disadvantages: it forces a priori and uncertain decisions to be made about the desired tradeoffs between objectives, it limits the ability to discover multiple policies so as to rapidly adapt to changing preferences, it shifts the responsibility for managing tradeoffs from the problem stakeholders to the system developer, and it may result in solutions which fail to maximise user utility.
While the last decade or so has seen significant achievements in the development of planning and RL algorithms for multiobjective problems (as reflected in Sect. 7), it remains a niche area compared to the amount of research on singleobjective agents. In addition a number of challenges arise in the context of multiple objectives which do not exist in the singleobjective domain. As such there remain a number of areas where additional research and algorithmic development is required. The remainder of this article will present an overview of the topics which we believe to be the most significant and pressing challenges for multiobjective agent research.
10.1 Lack of multiobjective datasets and benchmarks
Data plays a role in multiobjective decision making (MODM). When solving a given MO problem, data is usually needed to characterise and solve the involved objectives. However, the currently available data may not be sufficient to model some objectives or domains. Whereas this tends not to be an issue for companyoriented research (since companies can usually obtain the required data should it be necessary to achieve its goals), it is often a significant problem for basic research (where the lack of data may make it impossible to study some problems). Some of the challenges faced here include: heterogeneity, availability, and lack of correlation.
As an example, consider the case of traffic authorities aiming to optimise the control of traffic lights as to minimise competitive objectives like travel time, polluting emissions, and discomfort level. To accommodate all these objectives, one needs to deal with all the aforementioned challenges. Data heterogeneity comes into play because the data comes from different sources: data about specific trips come from drivers and passengers; overall traffic statistics come from traffic authorities; fleet demographics come from censoring authorities; \(\hbox {CO}_2\) emission profiles come from manufacturers, based on existing fleet; etc. Availability refers to the fact that the above information is not openly available, either because of privacy concerns, or due to the lack of data release policies. Finally, the data may not be correlated, as is the case of traffic statistics and fleet demographics, which come from different, possibly incompatible sources.
A challenge related to the availability of data is the availability of good benchmark problems for evaluation of MORL algorithms. So far, a limited number of benchmark problems have been proposed for MORL research, and many of those proposed are quite simple (see Sect. 8.4). Some advantages of the existing benchmark problems are, e.g., that they are simple to understand, experiments can be run in a short time, and optimal solutions to the problems are often known. However, while recent work has proposed benchmarks with increased complexity [4, 110, 204], they still do not capture the diversity of realworld problems that deal with multiple conflicting objectives. Referring to our motivating examples for multiobjective reinforcement learning and planning (see Sect. 2), we note that more benchmarks with complex state and action spaces, partial observability, many objectives, multiple agents, and decision making over long time horizons are needed. One approach to quickly increase the number of available MORL benchmarks could be to modify existing singleobjective benchmarks, by making their reward functions multiobjective.
In conclusion, these challenges frequently slow down or even hinder research progress on MODM. Building upon this background, it is of utmost importance for companies and researchers to make their data and benchmarks available. Actions towards this direction include: making your MO problems, data, benchmarks, and baseline implementations available on open platforms; supporting other researchers interested in your problem; negotiating data retention procedures with companies; among others. Without the support of the involved parties, the potential benefits that MODM could bring to our society may not be realised.
10.2 Manyobjective problems
Within the field of multiobjective evolutionary optimisation, the task of handling problems with many objectives (usually defined as four or more objectives) has emerged as a distinct subfield, in recognition that algorithms which work well for a small number of objectives may scale poorly to many objectives [84, 190]. So far there has been only minimal work in planning or RL for manyobjective problems. For instance, Zintgraf et al. [216] consider a traffic regulation problem with 11 objectives (reflecting the delay duration and queue length for different traffic participants and different directions), and how to elicit and model user utility in such a setting using pairwise comparison queries between solutions and Gaussian processes. Giuliani et al. [53] demonstrate a dimensionalityreduction approach in which the original objectives are mapped to a lower dimension using Nonnegative Principal Component Analysis, while Yahyaa et al. [205] examined the performance of bandit algorithms on problems with up to five objectives. However, the development of a broader suite of algorithms for manyobjective problems remains an important direction for future work.
10.3 Multiagent problems
Numerous realworld problems involve both multiple actors and multiple objectives that should be considered when making a decision. Multiobjective multiagent systems represent an ideal setting to study such problems. However, despite its high relevance, it remains an understudied domain, perhaps due to the increasingly complex dimensions involved.
Prior to the recent survey on multiobjective multiagent decision making [126], the literature in this area was rather fragmented and lacked a uniformly accepted framework or set of assumptions to allow for a proper comparison or placement of contributions and to identify gaps in terms of studied settings. Following the taxonomy set out by [126], and the links that have been made to suitable solution concepts for MOMADM (briefly discussed in Sect. 7.2.6), we anticipate that the pace of research on multiobjective multiagent problems will increase in the coming years.
There are countless open challenges brought forward by the MOMADM domain [126], ranging from how to develop negotiation strategies for selecting between multiple potential solutions, how equilibria are affected by the choice of the optimisation criteria (SER vs. ESR, Sect. 5.3) and utility functions of the agents, how to learn about the behaviour or objective preferences of other agents, how to deal with sequential or continuous stateaction settings.
Finally, all the observations and remarks regarding the scarce availability of datasets and benchmarks we make in Sect. 10.1 are even more pressing in the case of multiobjective multiagent settings, rendering the evaluation of MOMADM approaches a challenging task.
10.4 Dynamically identifying and adding objectives
As discussed earlier in Sect. 5.1 under the “review and adjust” scenario, an analysis of the policy found based on an initial formulation of a problem may reveal the need to modify or extend the objectives considered by the agent in order to find a more acceptable solution. While prior work in the singleobjective literature has considered modifying aspects of the problem either during or after learning or planning, such as changes in environmental state dynamics [105], dynamic rewards [181], or introducing new actions [88], obviously the addition of new objectives is unique to multiobjective methods.
Ideally the agent should be able to integrate additional or modified objectives without needing to discard prior learning, and with minimal regret experienced while adjusting its policy. One means by which this might be achieved is to maintain an archive of the agent’s experience under its current policy, which can be used to perform offline learning related to updated specifications of objectives without any further interaction with the actual environment. Alternatively, during learning the agent may be able to identify for itself states that could be associated with potential new objectives (for example, states which are highly different in feature space from other states), and create its own rewards associated with these states such that its policy can be rapidly updated should the user define a new objective associated with these states [73].
10.5 Closing remarks
The aim of this article is to encourage a wider adoption of multiobjective agent technologies in the development of realworld applications. To this end, we have identified a range of factors which need to be considered when designing a multiobjective solution, as well as reviewing how current multiobjective planning and RL algorithms relate to these factors. In addition, we have provided examples demonstrating how existing methods can be applied to discover suitable agent policies for some simple multiobjective tasks. Our hope is that this article will serve to inspire the future growth of applications based on multiobjective agents.
Notes
Note that when an objective is formulated in terms of costs rather than rewards, we can add a minus to the value and maximise, rather than minimise for this objective without loss of generality.
We note that iterating through a series of utility functions might at first glance seem similar to the naive approach of trying out different utility functions until a reasonably satisfactory result is attained, i.e., the singleobjective approach we argued against in the introduction. Where outer loop methods fundamentally differ however, is that they use the series of utility functions as a tool to construct an (approximate) coverage set, that can be subsequently used in a separate selection phase (see Fig. 3). As such, from a user’s point of view, there is no noticeable difference between using an outer or an inner loop method; they both produce a coverage set.
The hypervolume measures the volume of objective space which is dominated by a set of solutions – see Sect. 8.1.1 for more details.
Please note that this intuition implicitly assumes some form of continuity in the user’s utility function, u.
The code used to illustrate this setting can be found at the following link: https://gitlab.ai.vub.ac.be/mreymond/morlguide.
References
Abdelfattah, S., Merrick, K., & Hu, J. (2019). Intrinsically motivated hierarchical policy learning in multiobjective markov decision processes. IEEE Transactions on Cognitive and Developmental Systems.
Abdolmaleki, A., Huang, S., Hasenclever, L., Neunert, M., Song, F., Zambelli, M., Martins, M., Heess, N., Hadsell, R., & Riedmiller, M. (2020). A distributional view on multiobjective policy optimization. In: International Conference on Machine Learning, (pp. 11–22). PMLR.
Abdullah, M., Yatim, A., Tan, C., & Saidur, R. (2012). A review of maximum power point tracking algorithms for wind energy systems. Renewable and Sustainable Energy Reviews, 16(5), 3220–3227.
Abels, A., Roijers, D., Lenaerts, T., Nowé, A., & Steckelmacher, D. (2019). Dynamic weights in multiobjective deep reinforcement learning. In: International Conference on Machine Learning, (pp. 11–20). PMLR.
Aho, J., Buckspan, A., Laks, J., Fleming, P., Jeong, Y., Dunne, F., Churchfield, M., Pao, L., & Johnson, K. (2012). A tutorial of wind turbine control for supporting grid frequency through active power control. In: American Control Conference (ACC), pp. 3120—3131.
Aissani, N., Beldjilali, B., & Trentesaux, D. (2008). Efficient and effective reactive scheduling of manufacturing system using sarsamultiobjective agents. In: MOSIM’08: 7th Conference Internationale de Modelisation et Simulation, pp. 698–707.
Antonio, L. M., & Coello, C. A. C. (2017). Coevolutionary multiobjective evolutionary algorithms: Survey of the stateoftheart. IEEE Transactions on Evolutionary Computation, 22(6), 851–865.
Aoki, K., Kimura, H., & Kobayashi, S. (2004). Distributed reinforcement learning using bidirectional decision making for multicriteria control of multistage flow systems. In: The 8th Conference on Intelligent Autonomous Systems, pp. 281–290.
Aumann, R.J. (1987). Correlated equilibrium as an expression of bayesian rationality. Econometrica: Journal of the Econometric Society, pp. 1–18.
Avigad, G., Eisenstadt, E., & Cohen, M.W. (2011). Optimal strategies for multi objective games and their search by evolutionary multi objective optimization. In: 2011 IEEE Conference on Computational Intelligence and Games (CIG’11), pp. 166–173. IEEE.
Barreto, A., Dabney, W., Munos, R., Hunt, J.J., Schaul, T., van Hasselt, H.P., & Silver, D. (2017). Successor features for transfer in reinforcement learning. In: Advances in Neural Information Processing Systems, pp. 4055–4065.
Barrett, L., & Narayanan, S. (2008). Learning all optimal policies with multiple criteria. In: Proceedings of the 25th International Conference on Machine Learning, pp. 41–47.
Beliakov, G., Bowsell, S., Cao, T., Dazeley, R., MakHau, V., Nguyen, M.T., Wilkin, T., & Yearwood, J. (2019). Aggregation of dependent criteria in multicriteria decision making problems by means of capacities. In: 23rd International Congress on Modelling and Simulation. Modelling and Simulation Society of Australia and New Zealand. https://doi.org/10.36334/modsim.2019.B3.beliakov
Borsa, D., Barreto, A., Quan, J., Mankowitz, D.J., van Hasselt, H., Munos, R., Silver, D., & Schaul, T. (2019). Universal successor features approximators. In: International Conference on Learning Representations.
Bouneffouf, D., Rish, I., & Aggarwal, C. (2020). Survey on applications of multiarmed and contextual bandits. In: 2020 IEEE Congress on Evolutionary Computation (CEC), pp. 1–8. IEEE.
Bryce, D., Cushing, W., & Kambhampati, S. (2007). Probabilistic planning is multiobjective. Arizona State University, Tech. Rep. ASUCSE, 07006.
Brys, T., Van Moffaert, K., Van Vaerenbergh, K., & Nowé, A. (2013). On the behaviour of scalarization methods for the engagement of a wet clutch. In:2013 12th International Conference on Machine Learning and Applications, vol. 1, pp. 258–263. IEEE.
Castelletti, A., Pianosi, F., & Restelli, M. (2012). Treebased Fitted Qiteration for MultiObjective Markov Decision problems. In: IJCNN, pp. 1–8. IEEE.
Castelletti, A., Pianosi, F., & Restelli, M. (2013). A multiobjective reinforcement learning approach to water resources systems operation: Pareto frontier approximation in a single run. Water Resources Research, 49(6), 3476–3486.
Castelletti, A., Pianosi, F., & SonciniSessa, R. (2008). Water reservoir control under economic, social and environmental constraints. Automatica, 44(6), 1595–1607.
Chen, W., & Liu, L. (2019). Pareto monte carlo tree search for multiobjective informative planning. In: Robotics: Science and Systems.
Chen, X., Ghadirzadeh, A., Björkman, M., & Jensfelt, P. (2019). Metalearning for multiobjective reinforcement learning. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 977–983. IEEE.
Chen, D., Wang, Y., & Gao, W. (2020). Combining a gradientbased method and an evolution strategy for multiobjective reinforcement learning. Applied Intelligence.
Cheng, H.T. (1988). Algorithms for partially observable Markov decision processes. Ph.D. thesis, University of British Columbia.
Cohen, J. E. (1998). Cooperation and selfinterest: Paretoinefficiency of nash equilibria in finite random games. Proceedings of the National Academy of Sciences,95(17), 9724–9731. https://doi.org/10.1073/pnas.95.17.9724. URL https://www.pnas.org/content/95/17/9724
Cruz, F., Dazeley, R., & Vamplew, P. (2019). Memorybased explainable reinforcement learning. In: Australasian Joint Conference on Artificial Intelligence, pp. 66–77. Springer.
da Silva Veith, A., de Souza, F.R., de Assunção, M.D., Lefèvre, L., & dos Anjos, J.C.S. (2019). Multiobjective reinforcement learning for reconfiguring data stream analytics on edge computing. In: Proceedings of the 48th International Conference on Parallel Processing, pp. 1–10.
Dazeley, R., Vamplew, P., & Cruz, F. (2021). Explainable reinforcement learning for broadxai: A conceptual framework and survey. arXiv preprint arXiv:2108.09003.
Dazeley, R., Vamplew, P., Foale, C., Young, C., Aryal, S., & Cruz, F. (2021). Levels of explainable artificial intelligence for humanaligned conversational explanations. Artificial Intelligence, 299, 103525.
Deb, K. (2011). Multiobjective optimisation using evolutionary algorithms: an introduction. In: Multiobjective evolutionary optimisation for product design and manufacturing, pp. 3–34. Springer.
Deb, K., Pratap, A., Agarwal, S., & Meyarivan, T. (2002). A fast and elitist multiobjective genetic algorithm: NSGAII. IEEE Transactions on Evolutionary Computation, 6(2), 182–197.
Deisenroth, M.P., Neumann, G., Peters, J., et al. (2013). A survey on policy search for robotics. Foundations and Trends® in Robotics2(1–2), 1–142.
Delle Fave, F., Stranders, R., Rogers, A., & Jennings, N. (2011). Bounded decentralised coordination over multiple objectives. In: Proceedings of the Tenth International Joint Conference on Autonomous Agents and Multiagent Systems, pp. 371–378.
Deng, Z., & Liu, M. (2018). An integrated generationcompensation optimization strategy for enhanced shortterm voltage security of largescale power systems using multiobjective reinforcement learning method. In: 2018 International Conference on Power System Technology (POWERCON), pp. 4099–4106. IEEE.
Deng, Z., Lu, Z., Guo, Z., Yao, W., Zhao, W., Zhou, B., & Hong, C. (2020). Coordinated optimization of generation and compensation to enhance shortterm voltage security of power systems using accelerated multiobjective reinforcement learning. IEEE Access, 8, 34770–34782.
Dornheim, J., & Link, N. (2018). Multiobjective reinforcement learning for reconfigurable adaptive optimal control of manufacturing processes. In: 2018 International Symposium on Electronics and Telecommunications (ISETC), pp. 1–5. IEEE.
Drugan, M.M., & Nowe, A. (2013). Designing multiobjective multiarmed bandits algorithms: A study. In: The 2013 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE.
Duan, R., Prodan, R., & Li, X. (2014). Multiobjective game theoretic scheduling of bagoftasks workflows on hybrid clouds. IEEE Transactions on Cloud Computing, 2(1), 29–42.
Dubey, P., & Rogawski, J. (1990). Inefficiency of smooth market mechanisms. Journal of Mathematical Economics, 19(3), 285–304.
Dusparic, I., & Cahill, V. (2009). Distributed wlearning: Multipolicy optimization in selforganizing systems. In: 2009 Third IEEE International Conference on SelfAdaptive and SelfOrganizing Systems, pp. 20–29. IEEE.
Eisenstadt, E., Moshaiov, A., & Avigad, G. (2015). Coevolution of strategies for multiobjective games under postponed objective preferences. In: 2015 IEEE Conference on Computational Intelligence and Games (CIG), pp. 461–468. IEEE.
Elfwing, S., & Seymour, B. (2017). Parallel reward and punishment control in humans and robots: Safe reinforcement learning using the maxpain algorithm. In: 2017 Joint IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDLEpiRob), pp. 140–147. IEEE.
Ernst, D., Geurts, P., & Wehenkel, L. (2005). Treebased batch mode reinforcement learning. Journal of Machine Learning Research, 6, 503–556.
FalcónCardona, J. G., & Coello, C. A. C. (2020). Indicatorbased multiobjective evolutionary algorithms: A comprehensive survey. ACM Computing Surveys (CSUR), 53(2), 1–35.
Ferreira, P. V. R., Paffenroth, R., Wyglinski, A. M., Hackett, T. M., Bilen, S. G., Reinhart, R. C., & Mortensen, D. J. (2019). Reinforcement learning for satellite communications: from leo to deep space operations. IEEE Communications Magazine, 57(5), 70–75.
Finn, C., Abbeel, P., & Levine, S. (2017). Modelagnostic metalearning for fast adaptation of deep networks. In: International Conference on Machine Learning, pp. 1126–1135.
Gábor, Z., Kalmár, Z., & Szepesvári, C. (1998). Multicriteria reinforcement learning. In: ICML,98, 197–205.
Galand, L., & Lust, T. (2015). Exact methods for computing all lorenz optimal solutions to biobjective problems. In: International Conference on Algorithmic DecisionTheory, pp. 305–321. Springer.
Garcıa, J., & Fernández, F. (2015). A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1), 1437–1480.
Geibel, P. (2006). Reinforcement learning for MDPs with constraints. In: European Conference on Machine Learning, pp. 646–653. Springer.
Geibel, P., & Wysotzki, F. (2005). Risksensitive reinforcement learning applied to control under constraints. Journal of Artificial Intelligence Research, 24, 81–108.
Giuliani, M., Castelletti, A., Pianosi, F., Mason, E., & Reed, P. M. (2016). Curses, tradeoffs, and scalable management: Advancing evolutionary multiobjective direct policy search to improve water reservoir operations. Journal of Water Resources Planning and Management, 142(2), 04015050.
Giuliani, M., Galelli, S., & SonciniSessa, R. (2014). A dimensionality reduction approach for manyobjective markov decision processes: Application to a water reservoir operation problem. Environmental Modelling & Software, 57, 101–114.
Govindaiah, S., & Petty, M.D. (2019). Applying reinforcement learning to plan manufacturing material handling part 1: Background and formal problem specification. In: Proceedings of the 2019 ACM Southeast Conference, pp. 168–171.
Grandoni, F., Krysta, P., Leonardi, S., & Ventre, C. (2010). Utilitarian mechanism design for multiobjective optimization. In: Proceedings of the twentyfirst annual ACMSIAM symposium on Discrete Algorithms, pp. 573–584. Society for Industrial and Applied Mathematics.
Guo, Y., Zeman, A., & Li, R. (2009). A reinforcement learning approach to setting multiobjective goals for energy demand management. International Journal of Agent Technologies and Systems (IJATS), 1(2), 55–70.
Hasan, M. M., Lwin, K., Imani, M., Shabut, A., Bittencourt, L. F., & Hossain, M. A. (2019). Dynamic multiobjective optimisation using deep reinforcement learning: benchmark, algorithm and an application to identify vulnerable zones based on water quality. Engineering Applications of Artificial Intelligence, 86, 107–135.
Hasselt, H. (2010). Double qlearning. Advances in Neural Information Processing Systems, 23, 2613–2621.
Hayes, C.F., Reymond, M., Roijers, D.M., Howley, E., & Mannion, P. (2021). Distributional monte carlo tree search for riskaware and multiobjective reinforcement learning. In: Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems, pp. 1530–1532.
Hayes, C.F., Reymond, M., Roijers, D.M., Howley, E., & Mannion, P. (2021). Riskaware and multiobjective decision making with distributional monte carlo tree search. arXiv preprint arXiv:2102.00966.
Horie, N., Matsui, T., Moriyama, K., Mutoh, A., & Inuzuka, N. (2019). Multiobjective safe reinforcement learning. Artificial Life and Robotics pp. 1–9.
Horwood, J., & Noutahi, E. (2020). Molecular design in synthetically accessible chemical space via deep reinforcement learning. arXiv preprint arXiv:2004.14308.
Hu, X., Zhang, Y., Liao, X., Liu, Z., Wang, W., & Ghannouchi, F.M. (2020). Dynamic beam hopping method based on multiobjective deep reinforcement learning for next generation satellite broadband systems. IEEE Transactions on Broadcasting.
Huang, S.H., Zambelli, M., Kay, J., Martins, M.F., Tassa, Y., Pilarski, P.M., & Hadsell, R. (2019). Learning gentle object manipulation with curiositydriven deep reinforcement learning. arXiv preprint arXiv:1903.08542.
Igarashi, A., & Roijers, D.M. (2017). Multicriteria coalition formation games. In: International Conference on Algorithmic Decision Theory, pp. 197–213. Springer.
Ikenaga, A., & Arai, S. (2018). Inverse reinforcement learning approach for elicitation of preferences in multiobjective sequential optimization. In: 2018 IEEE International Conference on Agents (ICA), pp. 117–118. IEEE.
Inja, M., Kooijman, C., de Waard, M., Roijers, D.M., & Whiteson, S. (2014). Queued pareto local search for multiobjective optimization. In: International Conference on Parallel Problem Solving from Nature, pp. 589–599. Springer.
Issabekov, R., & Vamplew, P. (2012). An empirical comparison of two common multiobjective reinforcement learning algorithms. In: Australasian Joint Conference on Artificial Intelligence, pp. 626–636. Springer.
Jalalimanesh, A., Haghighi, H. S., Ahmadi, A., Hejazian, H., & Soltani, M. (2017). Multiobjective optimization of radiotherapy: distributed qlearning and agentbased simulation. Journal of Experimental & Theoretical artificial intelligence, 29(5), 1071–1086.
Jin, J., & Ma, X. (2019). A multiobjective agentbased control approach with application in intelligent traffic signal system. IEEE Transactions on Intelligent Transportation Systems, 20(10), 3900–3912.
Jonker, C.M., Aydoğan, R., Baarslag, T., Fujita, K., Ito, T., & Hindriks, K. (2017). Automated negotiating agents competition (anac). In: ThirtyFirst AAAI Conference on Artificial Intelligence.
Juozapaitis, Z., Koul, A., Fern, A., Erwig, M., & DoshiVelez, F. (2019). Explainable reinforcement learning via reward decomposition. In: IJCAI/ECAI Workshop on Explainable Artificial Intelligence.
Karimpanal, T. G., & Wilhelm, E. (2017). Identification and offpolicy learning of multiple objectives using adaptive clustering. Neurocomputing, 263, 39–47.
Kluyver, T., RaganKelley, B., Pérez, F., Granger, B., Bussonnier, M., Frederic, J., Kelley, K., Hamrick, J., Grout, J., Corlay, S., Ivanov, P., Avila, D., Abdalla, S., Willing, C., & development team, J. (2016). Jupyter notebooks  a publishing format for reproducible computational workflows. In: F. Loizides, B. Scmidt (eds.) Positioning and Power in Academic Publishing: Players, Agents and Agendas, pp. 87–90. IOS Press, Netherlands. URL https://eprints.soton.ac.uk/403913/
Kooijman, C., de Waard, M., Inja, M., Roijers, D., & Whiteson, S. (2015). Pareto local policy search for momdp planning. In: ESANN 2015: Proceedings of the 23rd European Symposium on Artificial Neural Networks, Special Session on Emerging Techniques and Applications in MultiObjective Reinforcement Learning, pp. 53–58. URL http://www.cs.ox.ac.uk/people/shimon.whiteson/pubs/kooijmanesann15.pdf
Krasheninnikova, E., García, J., Maestre, R., & Fernández, F. (2019). Reinforcement learning for pricing strategy optimization in the insurance industry. Engineering Applications of Artificial Intelligence, 80, 8–19.
Laber, E. B., Lizotte, D. J., & Ferguson, B. (2014). Setvalued dynamic treatment regimes for competing outcomes. Biometrics, 70(1), 53–61.
Lacerda, A. (2017). Multiobjective ranked bandits for recommender systems. Neurocomputing, 246, 12–24.
Lee, C. S. (2012). Multiobjective gametheory models for conflict analysis in reservoir watershed management. Chemosphere, 87(6), 608–613.
Lepenioti, K., Pertselakis, M., Bousdekis, A., Louca, A., Lampathaki, F., Apostolou, D., Mentzas, G., & Anastasiou, S. (2020). Machine learning for predictive and prescriptive analytics of operational data in smart manufacturing. In: International Conference on Advanced Information Systems Engineering, pp. 5–16. Springer.
Li, C., & Czarnecki, K. (2019). Urban driving with multiobjective deep reinforcement learning. In: Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, pp. 359–367. International Foundation for Autonomous Agents and Multiagent Systems.
Li, K., Zhang, T., & Wang, R. (2020). Deep reinforcement learning for multiobjective optimization. IEEE Transactions on Cybernetics.
Li, X., Gao, L., & Li, W. (2012). Application of game theory based hybrid algorithm for multiobjective integrated process planning and scheduling. Expert Systems with Applications, 39(1), 288–297.
Li, B., Li, J., Tang, K., & Yao, X. (2015). Manyobjective evolutionary algorithms: A survey. ACM Computing Surveys (CSUR), 48(1), 1–35.
Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2015). Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.
Lizotte, D.J., Bowling, M.H., & Murphy, S.A. (2010). Efficient reinforcement learning with multiple reward functions for randomized controlled trial analysis. In: Proceedings of the 27th International Conference on Machine Learning (ICML10), pp. 695–702. Citeseer.
Ma, C., Wen, J., & Bengio, Y. (2018). Universal successor representations for transfer reinforcement learning. arXiv preprint arXiv:1804.03758.
Mandel, T., Liu, Y.E., Brunskill, E., & Popovic, Z. (2017). Where to add actions in humanintheloop reinforcement learning. In: AAAI, pp. 2322–2328.
Mandow, L., & Pérezdela Cruz, J.L. (2018). Pruning dominated policies in multiobjective Pareto qlearning. In: Conference of the Spanish Association for Artificial Intelligence, pp. 240–250. Springer.
Mannion, P., Devlin, S., Duggan, J., & Howley, E. (2018). Reward shaping for knowledgebased multiobjective multiagent reinforcement learning. The Knowledge Engineering Review, 33(e23). URL https://doi.org/10.1017/S0269888918000292.
Mannion, P., Devlin, S., Mason, K., Duggan, J., & Howley, E. (2017). Policy invariance under reward transformations for multiobjective reinforcement learning. Neurocomputing, 263.
Mannion, P., Duggan, J., & Howley, E. (2016). An experimental review of reinforcement learning algorithms for adaptive traffic signal control. In: Autonomic Road Transport Support Systems, pp. 47–66. Springer, Cham. https://doi.org/10.1007/9783319258089_4
Mannion, P., Heintz, F., Karimpanal, T.G., & Vamplew, P. (2021). Multiobjective decision making for trustworthy ai. In: Proceedings of the MultiObjective Decision Making (MODeM) Workshop.
Marinescu, R. (2009). Exploiting problem decomposition in multiobjective constraint optimization. In: International Conference on Principles and Practice of Constraint Programming, pp. 592–607. Springer.
Marinescu, R. (2011). Efficient approximation algorithms for multiobjective constraint optimization. In: ADT 2011: Proceedings of the Second International Conference on Algorithmic Decision Theory, pp. 150–164.
Matsui, T. (2019). A study of joint policies considering bottlenecks and fairness. In: ICAART (1), pp. 80–90.
Mello, F., Apostolopoulou, D., & Alonso, E. (2020). Cost efficient distributed load frequency control in power systems. In: 21st IFAC World Congress.
MéndezHernández, B.M., RodríguezBazan, E.D., MartinezJimenez, Y., Libin, P., & Nowé, A. (2019). A multiobjective reinforcement learning algorithm for jssp. In: International Conference on Artificial Neural Networks, pp. 567–584. Springer.
Menezes, E. J. N., Araújo, A. M., & da Silva, N. S. B. (2018). A review on wind turbine control and its associated methods. Journal of Cleaner Production, 174, 945–953.
Messikh, C., & Zarour, N. (2018). Towards a multiobjective reinforcement learning based routing protocol for cognitive radio networks. In: 2018 International Conference on Smart Communications in Network Technologies (SaCoNeT), pp. 84–89. IEEE.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., et al. (2015). Humanlevel control through deep reinforcement learning. Nature, 518(7540), 529–533.
Moghaddam, A., Yalaoui, F., & Amodeo, L. (2011). Lorenz versus pareto dominance in a single machine scheduling problem with rejection. In: International Conference on Evolutionary MultiCriterion Optimization, pp. 520–534. Springer.
Mossalam, H., Assael, Y.M., Roijers, D.M., & Whiteson, S. (2016). Multiobjective deep reinforcement learning. In: NIPS 2016 Workshop on Deep Reinforcement Learning.
Multiobjective routing in integrated services networks. (1991). Economides, A.A., Silvester, J.A., et al. A game theory approach. In: Infocom,91, 1220–1227.
Nagabandi, A., Clavera, I., Liu, S., Fearing, R.S., Abbeel, P., Levine, S., & Finn, C. (2019). Learning to adapt in dynamic, realworld environments through metareinforcement learning. In: Proceedings of Seventh International Conference on Learning Representations.
Nash, J. (1951). Noncooperative games. Annals of Mathematics, 54(2), 286–295.
Natarajan, S., & Tadepalli, P. (2005). Dynamic preferences in multicriteria reinforcement learning. In: Proceedings of the 22nd international conference on Machine learning, pp. 601–608.
Nguyena, M., & Caoa, T. (2017). A hybrid decision making model for evaluating land combat vehicle system. In: 22nd International Congress on Modelling and Simulation, MODSIM2017, Modelling and Simulation Society of Australia and New Zealand, pp. 1399–1405.
Nguyen, T. T., Nguyen, N. D., Vamplew, P., Nahavandi, S., Dazeley, R., & Lim, C. P. (2020). A multiobjective deep reinforcement learning framework. Engineering Applications of Artificial Intelligence, 96, 103915.
Nian, X., Irissappane, A.A., & Roijers, D. (2020). DCRAC: Deep conditioned recurrent actorcritic for multiobjective partially observable environments. In: Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems, pp. 931–938.
Noothigattu, R., Bouneffouf, D., Mattei, N., Chandra, R., Madan, P., Varshney, K., Campbell, M., Singh, M., & Rossi, F. (2018). Interpretable multiobjective reinforcement learning through policy orchestration. arXiv preprint arXiv:1809.08343.
Ortúzar, J.d.D., & Willumsen, L.G. (2011). Modelling transport (4th ed.). Chichester, UK: John Wiley & Sons.
Pan, A., Xu, W., Wang, L., & Ren, H. (2020). Additional planning with multiple objectives for reinforcement learning. KnowledgeBased Systems, 193, 105392.
Parisi, S., Pirotta, M., Smacchia, N., Bascetta, L., Restelli, M. (2014). Policy gradient approaches for multiobjective sequential decision making. In: IJCNN, pp. 2323–2330. IEEE.
Parisi, S., Pirotta, M., & Peters, J. (2017). Manifoldbased multiobjective policy search with sample reuse. Neurocomputing, 263, 3–14.
Parisi, S., Pirotta, M., & Restelli, M. (2016). Multiobjective reinforcement learning through continuous pareto manifold approximation. Journal of Artificial Intelligence Research, 57, 187–227.
Perez, J., GermainRenaud, C., Kégl, B., & Loomis, C. (2009). Responsive elastic computing. In: Proceedings of the 6th International Conference Industry Session on Grids Meets Autonomic Computing, pp. 55–64.
Perez, D., Samothrakis, S., & Lucas, S. (2013). Online and offline learning in multiobjective monte carlo tree search. In: 2013 IEEE Conference on Computational Inteligence in Games (CIG), pp. 1–8. IEEE.
Perez, J., GermainRenaud, C., Kégl, B., & Loomis, C. (2010). Multiobjective reinforcement learning for responsive grids. Journal of Grid Computing, 8(3), 473–492.
Perny, P., & Weng, P. (2010). On finding compromise solutions in multiobjective markov decision processes. In: ECAI, vol. 215, pp. 969–970.
Perny, P., Weng, P., Goldsmith, J., & Hanna, J. (2013). Approximation of lorenzoptimal solutions in multiobjective markov decision processes. In: Proceedings of the 27th AAAI Conference on Artificial Intelligence, pp. 92–94.
Pianosi, F., Castelletti, A., & Restelli, M. (2013). Treebased fitted Qiteration for multiobjective Markov decision processes in water resource management. Journal of Hydroinformatics, 15(2), 258–270.
Pla, A., Lopez, B., & Murillo, J. (2012). Multi criteria operators for multiattribute auctions. In: International Conference on Modeling Decisions for Artificial Intelligence, pp. 318–328. Springer.
Qin, Y., Wang, H., Yi, S., Li, X., & Zhai, L. (2020). An energyaware scheduling algorithm for budgetconstrained scientific workflows based on multiobjective reinforcement learning. The Journal of Supercomputing, 76(1), 455–480.
Qu, S., Ji, Y., & Goh, M. (2015). The robust weighted multiobjective game. PloS one, 10(9), e0138970.
Rădulescu, R., Mannion, P., Roijers, D.M., & Nowé, A. (2020). Multiobjective multiagent decision making: a utilitybased analysis and survey. Autonomous Agents and MultiAgent Systems, 34(10).
Rădulescu, R., Mannion, P., Zhang, Y., Roijers, D. M., & Nowé, A. (2020). A utilitybased analysis of equilibria in multiobjective normalform games. The Knowledge Engineering Review, 35, e32. https://doi.org/10.1017/S0269888920000351.
Rădulescu, R., Verstraeten, T., Zhang, Y., Mannion, P., Roijers, D. M., & Nowé, A. (2021). Opponent learning awareness and modelling in multiobjective normal form games. Neural Computing and Applications. https://doi.org/10.1007/s00521021061843.
Raj, R. N., Nayak, A., & Kumar, M. S. (2020). A survey and performance evaluation of reinforcement learning based spectrum aware routing in cognitive radio ad hoc networks. International Journal of Wireless Information Networks, 27(1), 144–163.
Ramos, G.de.O., da Silva, B.C., Rădulescu, R., Bazzan, A.L.C., & Nowé, A. (2020). Tollbased reinforcement learning for efficient equilibria in route choice. The Knowledge Engineering Review, 35, e8. https://doi.org/10.1017/S0269888920000119.
Ramos, G.de.O., Rădulescu, R., Nowé, A., & Tavares, A.R. (2020). Tollbased learning for minimising congestion under heterogeneous preferences. In: B. An, N. YorkeSmith, A. El Fallah Seghrouchni, G. Sukthankar (eds.) Proc. of the 19th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2020), pp. 1098–1106. IFAAMAS, Auckland, New Zealand.
Ravichandran, N.B., Yang, F., Peters, C., Lansner, A., & Herman, P. (2018). Pedestrian simulation as multiobjective reinforcement learning. In: Proceedings of the 18th International Conference on Intelligent Virtual Agents, pp. 307–312.
Reddy, M. J., & Kumar, D. N. (2006). Optimal reservoir operation using multiobjective evolutionary algorithm. Water Resources Management, 20(6), 861–878.
Reymond, M., & Nowé, A. (2019). ParetoDQN: Approximating the pareto front in complex multiobjective decision problems. In: Proceedings of the adaptive and learning agents workshop (ALA19) at AAMAS.
Reymond, M., Hayes, C., Roijers, D.M., Steckelmacher, D., & Nowé, A. (2021). Actorcritic multiobjective reinforcement learning for nonlinear utility functions. In: MultiObjective Decision Making Workshop (MODeM 2021).
Roijers, D.M. (2016). Multiobjective decisiontheoretic planning. Ph.D. thesis, University of Amsterdam.
Roijers, D.M., Röpke, W., Nowé, A., & Rădulescu, R. (2021). On following paretooptimal policies in multiobjective planning and reinforcement learning. In: Proceedings of the MultiObjective Decision Making (MODeM) Workshop.
Roijers, D.M., Steckelmacher, D., & Nowé, A. (2018). Multiobjective reinforcement learning for the expected utility of the return. In: Proceedings of the Adaptive and Learning Agents workshop at FAIM, vol. 2018.
Roijers, D.M., Walraven, E., & Spaan, M.T.J. (2018). Bootstrapping LPs in value iteration for multiobjective and partially observable MDPs. In: Proceedings of the TwentyEighth International Conference on Automated Planning and Scheduling (ICAPS), pp. 218–226.
Roijers, D.M., Whiteson, S., & Oliehoek, F.A. (2015). Pointbased planning for multiobjective pomdps. In: Proceedings of the twentyfourth international joint conference on artificial intelligence (IJCAI), pp. 1666–1672.
Roijers, D.M., Zintgraf, L.M., & Nowé, A. (2017). Interactive thompson sampling for multiobjective multiarmed bandits. In: International Conference on Algorithmic Decision Theory, pp. 18–34. Springer.
Roijers, D., Zintgraf, L., Libin, P., & Nowe, A. (2018). Interactive multiobjective reinforcement learning in multiarmed bandits for any utility function. In: Proceedings of the adaptive and learning agents workshop (ALA18) at AAMAS.
Roijers, D.M., Zintgraf, L.M., Libin, P., Reymond, M., Bargiacchi, E., & Nowé, A. (2020). Interactive multiobjective reinforcement learning in multiarmed bandits with gaussian process utility models. In: ECMLPKDD 2020: Proceedings of the 2020 European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, p. 16.
Roijers, D. M., Vamplew, P., Whiteson, S., & Dazeley, R. (2013). A survey of multiobjective sequential decisionmaking. Journal of Artificial Intelligence Research, 48, 67–113.
Roijers, D. M., & Whiteson, S. (2017). Multiobjective decision making. Synthesis Lectures on Artificial Intelligence and Machine Learning, 11(1), 1–129.
Roijers, D. M., Whiteson, S., & Oliehoek, F. A. (2015). Computing convex coverage sets for faster multiobjective coordination. Journal of Artificial Intelligence Research, 52, 399–443.
Rollón, E. (2008). Multiobjective optimization for graphical models. Ph.D. thesis, Universitat Politècnica de Catalunya, Barcelona.
Rollon, E., & Larrosa, J. (2007). Multiobjective russian doll search. In: AAAI, pp. 249–254.
Rollon, E., & Larrosa, J. (2008). Constraint optimization techniques for multiobjective branch and bound search. In: International conference on logic programming, ICLP.
Rollón, E., & Larrosa, J. (2006). Bucket elimination for multiobjective optimization problems. Journal of Heuristics, 12, 307–328.
Rowe, J., Smith, A., Pokorny, B., Mott, B., & Lester, J. (2018). Toward automated scenario generation with deep reinforcement learning in gift. In: Proceedings of the Sixth Annual GIFT User Symposium, pp. 65–74.
RuizMontiel, M., Mandow, L., & Pérezdela Cruz, J. L. (2017). A temporal difference method for multiobjective reinforcement learning. Neurocomputing, 263, 15–25.
Saisubramanian, S., Kamar, E., & Zilberstein, S. (2020). A multiobjective approach to mitigate negative side effects. In: Proceedings of the 29th International Joint Conference on Artificial Intelligence.
Schaul, T., Horgan, D., Gregor, K., & Silver, D. (2015). Universal value function approximators. In: International conference on machine learning, pp. 1312–1320.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
Shabani, N. (2009). Incorporating flood control rule curves of the columbia river hydroelectric system in a multireservoir reinforcement learning optimization model. Ph.D. thesis, University of British Columbia.
Siddique, U., Weng, P., & Zimmer, M. (2020). Learning fair policies in multiobjective (deep) reinforcement learning with average and discounted rewards. In: International Conference on Machine Learning.
Silver, D., Singh, S., Precup, D., & Sutton, R. S. (2021). Reward is enough. Artificial Intelligence,299, 103535. https://doi.org/10.1016/j.artint.2021.103535. URL https://www.sciencedirect.com/science/article/pii/S0004370221000862
Smith, B. J., Klassert, R., & Pihlakas, R. (2021). Soft maximin approaches to multiobjective decisionmaking for encoding human intuitive values. In: MultiObjective Decision Making Workshop.
Soh, H., & Demiris, Y. (2011). Evolving policies for multireward partially observable markov decision processes (MRPOMDPs). In: Proceedings of the 13th annual conference on Genetic and evolutionary computation, pp. 713–720.
Soh, H., & Demiris, Y. (2011). Multireward policies for medical applications: Anthrax attacks and smart wheelchairs. In: Proceedings of the 13th annual conference companion on Genetic and evolutionary computation, pp. 471–478.
Sun, Y., Li, Y., Xiong, W., Yao, Z., Moniz, K., & Zahir, A. (2018). Pareto optimal solutions for network defense strategy selection simulator in multiobjective reinforcement learning. Applied Sciences, 8(1), 136.
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. Cambridge: MIT Press.
Tajmajer, T. (2018). Modular multiobjective deep reinforcement learning with decision values. In: Federated conference on computer science and information systems (FedCSIS), pp. 85–93. IEEE.
Taylor, A., Dusparic, I., GalvánLópez, E., Clarke, S., & Cahill, V. (2014). Accelerating learning in multiobjective systems through transfer learning. In: Neural Networks (IJCNN), 2014 International Joint Conference on, pp. 2298–2305. IEEE.
Tesauro, G., Das, R., Chan, H., Kephart, J., Levine, D., Rawson, F., & Lefurgy, C. (2008). Managing power consumption and performance of computing systems using reinforcement learning. In: Advances in Neural Information Processing Systems, pp. 1497–1504.
Thomas, L. (1982). Constrained Markov decision processes as multiobjective problems. Department of Decision Theory: University of Manchester.
Tozer, B., Mazzuchi, T., & Sarkani, S. (2017). Manyobjective stochastic path finding using reinforcement learning. Expert Systems with Applications, 72, 371–382.
Trivedi, A., Srinivasan, D., Sanyal, K., & Ghosh, A. (2016). A survey of multiobjective evolutionary algorithms based on decomposition. IEEE Transactions on Evolutionary Computation, 21(3), 440–462.
Turgay, E., Oner, D., & Tekin, C. (2018). Multiobjective contextual bandit problem with similarity information. In: International Conference on Artificial Intelligence and Statistics, pp. 1673–1681.
Vamplew, P., Dazeley, R., Barker, E., & Kelarev, A. (2009). Constructing stochastic mixture policies for episodic multiobjective reinforcement learning tasks. In: Australasian Joint Conference on Artificial Intelligence, pp. 340–349. Springer.
Vamplew, P., Foale, C., Dazeley, R., & Bignold, A. (2021). Potentialbased multiobjective reinforcement learning approaches to lowimpact agents for AI safety. Engineering Applications of Artificial Intelligence100. https://doi.org/10.1016/j.engappai.2021.104186
Vamplew, P., Issabekov, R., Dazeley, R., & Foale, C. (2015). Reinforcement learning of Paretooptimal multiobjective policies using steering. In: Australasian Joint Conference on Artificial Intelligence, pp. 596–608. Springer.
Vamplew, P., Yearwood, J., Dazeley, R., & Berry, A. (2008). On the limitations of scalarisation for multiobjective reinforcement learning of Pareto fronts. In: Australasian Joint Conference on Artificial Intelligence, pp. 372–378. Springer.
Vamplew, P., Dazeley, R., Berry, A., Issabekov, R., & Dekker, E. (2011). Empirical evaluation methods for multiobjective reinforcement learning algorithms. Machine Learning, 84(1–2), 51–80.
Vamplew, P., Dazeley, R., & Foale, C. (2017). Softmax exploration strategies for multiobjective reinforcement learning. Neurocomputing, 263, 74–86.
Vamplew, P., Dazeley, R., Foale, C., Firmin, S., & Mummery, J. (2018). Humanaligned artificial intelligence is a multiobjective problem. Ethics and Information Technology, 20(1), 27–40.
Vamplew, P., Foale, C., & Dazeley, R. (2021). The impact of environmental stochasticity on valuebased multiobjective reinforcement learning. Neural Computing and Applications. https://doi.org/10.1007/s00521021058591.
Vamplew, P., Issabekov, R., Dazeley, R., Foale, C., Berry, A., Moore, T., & Creighton, D. (2017). Steering approaches to Paretooptimal multiobjective reinforcement learning. Neurocomputing, 263, 26–38.
van Dijk, M.T., van Wingerden, J.W., Ashuri, T., Li, Y., & Rotea, M.A. (2016). Yawmisalignment and its impact on wind turbine loads and wind farm power output. Journal of Physics: Conference Series, 753(6).
Van Dijk, M.T., van Wingerden, J.W., Ashuri, T., Li, Y., & Rotea, M.A. (2016). Yawmisalignment and its impact on wind turbine loads and wind farm power output. Journal of Physics: Conference Series, 753(6).
Van Moffaert, K., & Nowé, A. (2014). Multiobjective reinforcement learning using sets of pareto dominating policies. The Journal of Machine Learning Research, 15(1), 3483–3512.
Van der Maaten, L., & Hinton, G. (2008). Visualizing data using tSNE. Journal of machine learning research, 9(11).
Van Moffaert, K., Brys, T., Chandra, A., Esterle, L., Lewis, P.R., & Nowé, A. (2014). A novel adaptive weight selection algorithm for multiobjective multiagent reinforcement learning. In: 2014 International joint conference on neural networks (IJCNN), pp. 2306–2314. IEEE.
Van Moffaert, K., Drugan, M. M., & Nowé, A. (2013). Hypervolumebased multiobjective reinforcement learning. In: International Conference on Evolutionary MultiCriterion Optimization, pp. 352–366. Springer.
Van Moffaert, K., Drugan, M. M., & Nowé, A. (2013). Scalarized multiobjective reinforcement learning: Novel design techniques. In: 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), pp. 191–199. IEEE.
Van Vaerenbergh, K., Rodríguez, A., Gagliolo, M., Vrancx, P., Nowé, A., Stoev, J., Goossens, S., Pinte, G., & Symens, W. (2012). Improving wet clutch engagement with reinforcement learning. In: The 2012 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE.
Verstraeten, T., Daems, P.J., Bargiacchi, E., Roijers, D.M., Libin, P.J., & Helsen, J. (2021). Scalable optimization for wind farm control using coordination graphs. In: Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems, pp. 1362–1370.
Verstraeten, T., Nowé, A., Keller, J., Guo, Y., Sheng, S., & Helsen, J. (2019). Fleetwide dataenabled reliability improvement of wind turbines. Renewable and Sustainable Energy Reviews, 109, 428–437.
Von Lücken, C., Barán, B., & Brizuela, C. (2014). A survey on multiobjective evolutionary algorithms for manyobjective problems. Computational optimization and applications, 58(3), 707–756.
Wallach, W., & Allen, C. (2008). Moral machines: Teaching robots right from wrong. Oxford: Oxford University Press.
Wang, W., & Sebag, M. (2012). Multiobjective MonteCarlo tree search. In: Asian Conference on Machine Learning (pp. 507522). PMLR, Singapore.
Wang, H., Lei, Z., Zhang, X., Peng, J., & Jiang, H. (2019). Multiobjective reinforcement learningbased intelligent approach for optimization of activation rules in automatic generation control. IEEE Access, 7, 17480–17492.
Wang, W., & Sebag, M. (2013). Hypervolume indicator and dominance reward based multiobjective montecarlo tree search. Machine Learning, 92(2–3), 403–429.
Wanigasekara, N., Liang, Y., Goh, S.T., Liu, Y., Williams, J.J., & Rosenblum, D.S. (2019). Learning multiobjective rewards and user utility function in contextual bandits for personalized ranking. In: Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 3835–3841. AAAI Press.
Weng, D., Chen, R., Zhang, J., Bao, J., Zheng, Y., & Wu, Y. (2020). Paretooptimal transit route planning with multiobjective montecarlo tree search. IEEE Transactions on Intelligent Transportation Systems.
White, D. (1982). Multiobjective infinitehorizon discounted markov decision processes. Journal of Mathematical Analysis and Applications, 89(2), 639–647.
White, C. C., & Kim, K. W. (1980). Solution procedures for vector criterion Markov decision processes. Large Scale Systems, 1, 129–140.
Wiering, M. A., & De Jong, E. D. (2007). Computing optimal stationary policies for multiobjective markov decision processes. In: 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, pp. 158–165. IEEE.
Wiering, M. A., Withagen, M., & Drugan, M. M. (2014). Modelbased multiobjective reinforcement learning. In: 2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), pp. 1–6. IEEE.
Wirth, C., Akrour, R., Neumann, G., Fürnkranz, J., et al. (2017). A survey of preferencebased reinforcement learning methods. Journal of Machine Learning Research, 18(136), 1–46.
Wray, K. H., & Zilberstein, S. (2015). Multiobjective pomdps with lexicographic reward preferences. In: TwentyFourth International Joint Conference on Artificial Intelligence.
Wray, K. H., Zilberstein, S., & Mouaddib, A. I. (2015). Multiobjective mdps with conditional lexicographic reward preferences. In: Twentyninth AAAI conference on artificial intelligence.
Xu, J., Tian, Y., Ma, P., Rus, D., Sueda, S., & Matusik, W. (2020). Predictionguided multiobjective reinforcement learning for continuous robot control. In: Proceedings of the 37th International Conference on Machine Learning.
Yahyaa, S. Q., Drugan, M. M., & Manderick, B. (2014). Knowledge gradient for multiobjective multiarmed bandit algorithms. In: ICAART (1), pp. 74–83.
Yamaguchi, T., Nagahama, S., Ichikawa, Y., Takadama, K. (2019). Modelbased multiobjective reinforcement learning with unknown weights. In: International Conference on HumanComputer Interaction, pp. 311–321. Springer.
Yang, C., Lu, J., Gao, X., Liu, H., Chen, Q., Liu, G., & Chen, G. (2020). MoTiAC: Multiobjective actorcritics for realtime bidding. arXiv preprint arXiv:2002.07408.
Yang, R., Sun, X., & Narasimhan, K. (2019). A generalized algorithm for multiobjective reinforcement learning and policy adaptation. In: Advances in Neural Information Processing Systems, pp. 14636–14647.
Yliniemi, L., & Tumer, K. (2016). Multiobjective multiagent credit assignment in reinforcement learning and nsgaii. Soft Computing, 20(10), 3869–3887.
Yu, H., & Liu, H. (2013). Robust multiple objective game theory. Journal of Optimization Theory and Applications, 159(1), 272–280.
Zhan, H., & Cao, Y. (2019). Relationship explainable multiobjective reinforcement learning with semantic explainability generation. arXiv preprint arXiv:1909.12268.
Zhang, Y., Rădulescu, R., Mannion, P., Roijers, D. M., & Nowé, A. (2020). Opponent modelling for reinforcement learning in multiobjective normal form games. In: Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems, pp. 2080–2082.
Zhang, Z., Chong, A., Pan, Y., Zhang, C., & Lam, K. P. (2019). Whole building energy model for hvac optimal control: A practical framework based on deep reinforcement learning. Energy and Buildings, 199, 472–490.
Zhou, Z., Kearnes, S., Li, L., Zare, R. N., & Riley, P. (2019). Optimization of molecules via deep reinforcement learning. Scientific Reports, 9(1), 1–10.
Zintgraf, L. M., Kanters, T. V., Roijers, D. M., Oliehoek, F., & Beau, P. (2015). Quality assessment of MORL algorithms: A utilitybased approach. In: Benelearn 2015: Proceedings of the 24th Annual Machine Learning Conference of Belgium and the Netherlands.
Zintgraf, L. M., Roijers, D. M., Linders, S., Jonker, C. M., & Nowé, A. (2018). Ordered preference elicitation strategies for supporting multiobjective decision making. In: Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, pp. 1477–1485. International Foundation for Autonomous Agents and Multiagent Systems.
Zitzler, E., Knowles, J., & Thiele, L. (2008). Quality assessment of pareto set approximations. In: Multiobjective Optimization, pp. 373–404. Springer.
Zitzler, E., & Thiele, L. (1999). Multiobjective evolutionary algorithms: a comparative case study and the strength pareto approach. IEEE Transactions on Evolutionary Computation, 3(4), 257–271.
Acknowledgements
This research was supported by funding from the Fonds voor Wetenschappelijk Onderzoek (FWO) through the grant of Eugenio Bargiacchi (#1SA2820N), and by funding from the Flemish Government under the “Onderzoeksprogramma Artificiële Intelligentie (AI) Vlaanderen” for Diederik M. Roijers and Ann Nowé. Roxana Rădulescu was partially supported through the FWO iBOF/21/027 project “DESCARTES”. Conor F. Hayes is funded by the National University of Ireland Galway Hardiman Scholarship. Gabriel Ramos was partially supported by FAPERGS (grant 19/255100012772) and FAPESP (grant 2020/051651). Johan Källström and Fredrik Heintz were partially supported by the Swedish Governmental Agency for Innovation Systems (grant NFFP7/201704885), and the Wallenberg Artificial Intelligence, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation. Matthew Macfarlane was funded by LIFTproject 019.011 which is partly financed by the Dutch Research Council (NWO). Luisa Zintgraf is supported by the 2017 Microsoft Research PhD Scholarship Program, and the 2020 Microsoft Research EMEA PhD Award.
Funding
Open Access funding provided by the IReL Consortium.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Hayes, C.F., Rădulescu, R., Bargiacchi, E. et al. A practical guide to multiobjective reinforcement learning and planning. Auton Agent MultiAgent Syst 36, 26 (2022). https://doi.org/10.1007/s1045802209552y
Accepted:
Published:
DOI: https://doi.org/10.1007/s1045802209552y
Keywords
 Multiobjective decision making
 Multiobjective reinforcement learning
 Multiobjective planning
 Multiobjective multiagent systems