1 Introduction

When making decisions in the real world, decision makers must make trade-offs between multiple, often conflicting, objectives [44]. In many real-world settings, a policy is only executed once. For example, consider a municipality that receives the majority of its electricity from local solar farms. To deal with the intermittency of the solar farms, the municipality wants to build a new electricity generation facility. The municipality are considering two choices: building a natural gas facility or adding a lithium-ion battery storage facility to the solar farms. Moreover, the municipality want to minimise CO\(_{2}\) emissions while ensuring energy demand can continuously be met. Given a new energy generation facility will only be constructed once, a full distribution over each potential outcome for capacity to meet electricity demand and CO\(_{2}\) emissions must be considered to make an optimal decision. The current state-of-the-art multi-objective reinforcement learning (MORL) literature focuses almost exclusively on learning polices that are optimal over multiple executions. Given such problems are salient, to fully utilise MORL in the real world, we must develop algorithms to compute a policy, or set of policies, that are optimal given the single-execution nature of the problem.

In multi-objective reinforcement learning (MORL), a user’s preferences over objectives are represented by a utility function. In certain scenarios, a user’s preferences over objectives may be unknown; therefore, the utility function is unknown. In this case, a user is said to be in the unknown utility function or unknown weights scenario [36]. The unknown utility function scenario has three phases: the learning phase, the selection phase and the execution phase. During the learning phase, a multi-objective method [42] is used compute a set of optimal policies and the set of policies is returned to the user. During the selection phase, the utility function of the user becomes known and a policy from the computed set is selected which best reflects their preferences. The selected policy is then executed during the execution phase [18].

In contrast to single-objective reinforcement learning (RL), multiple optimality criteria exist for MORL [36]. In scenarios where the utility of the user is derived from multiple executions of a policy, the scalarised expected returns (SER) must be optimised. However, in scenarios where the utility of a user is derived from a single execution of a policy, the expected utility of the returns (or expected scalarised returns, ESR) must be optimised. The majority of MORL research focuses on the SER criterion and linear utility functions [29], which limits the applicability of MORL to real-world problems. In the real world, a user’s utility function may be derived in a linear or nonlinear manner. For known linear utility functions, single-objective methods can be used to learn an optimal policy [36]. Nonlinear utility functions do not distribute across the sum of the immediate and future returns, which invalidates the Bellman equation [33]. Therefore, to learn optimal policies for nonlinear utility functions, strictly multi-objective methods must be used.

For nonlinear utility functions, the policies learned under the SER criterion and the ESR criterion can be different [29, 30]. The ESR criterion has received very little attention, to date, from the MORL community with some exceptions [21, 31, 33, 43]. To learn optimal policies in many real-world scenarios where a policy will be executed only once, the ESR criterion must be optimised. For example, in a medical setting where a user has one opportunity to select a treatment, a user will aim to maximise the expected utility of a single outcome. However, choosing the wrong optimisation criterion (SER) for such a scenario could potentially lead to a different policy than that which would be learned under ESR. In the real world, like in the aforementioned scenario, learning a sub-optimal policy could have catastrophic outcomes.

Therefore, it is crucial that the MORL community focuses on developing multi-objective algorithms that can learn optimal policies under the ESR criterion. Recently, a number of multi-objective methods have been implemented that can learn a single optimal policy under the ESR criterion [15, 33]. However, in the current MORL literature, no multi-policy algorithms exist for the ESR criterion. In fact, a set of optimal policies for the ESR criterion has yet to be defined.

Due to the lack of existing research for the ESR criterion, a formal definition of the requirements to learn optimal policies under the ESR criterion has yet to be determined. In Sect. 3, we define the requirements necessary to compute policies under the ESR criterion. The applicability of MORL to many real-world scenarios under the ESR criterion is limited because no solution set has been defined for scenarios when a user’s utility function is unknown. In Sect. 4, we show how first-order stochastic dominance can be used to define sets of optimal policies under the ESR criterion. In Sect. 5, we expand first-order stochastic dominance to define a new dominance criterion, called expected scalarised returns (ESR) dominance. This work proposes that ESR dominance can be used to compute a set of optimal policies, which we define as the ESR set. Finally, we present a novel multi-objective tabular distributional reinforcement learning algorithm (MOTDRL) which aims to learn the ESR set in scenarios when the utility function of the user is unknown. We apply MOTDRL to two different multi-objective multi-armed bandit settings where MOTDRL is able to learn the ESR set in both settings.

2 Background

In this section, we introduce necessary background material, including multi-objective reinforcement learning, utility functions, the unknown utility function scenario, multi-objective multi-armed bandits, commonly used optimality criteria in multi-objective decision making, and stochastic dominance.

2.1 Multi-objective reinforcement learning

In multi-objective reinforcement learning (MORL) [18], we deal with decision-making problems with multiple objectives, often modelled as a multi-objective Markov decision process (MOMDP). An MOMDP is a tuple, \(\mathcal {M} = (\mathcal {S}, \mathcal {A}, \mathcal {T}, \gamma , \mathcal {R})\), where \(\mathcal {S}\) and \(\mathcal {A}\) are the state and action spaces, respectively, \(\mathcal {T} :\mathcal {S} \times \mathcal {A} \times \mathcal {S} \rightarrow \left[ 0, 1 \right]\) is a probabilistic transition function, \(\gamma\) is a discount factor determining the importance of future rewards and \(\mathcal {R} :\mathcal {S} \times \mathcal {A} \times \mathcal {S} \rightarrow \mathbb {R}^n\) is an n-dimensional vector-valued immediate reward function. In multi-objective reinforcement learning, \(n>1\).

2.2 Utility functions

In MORL, utility functions are used to model a user’s preferences. In this work, utility functions map vector returns to a scalar value which represents the user’s preferences over the returns,

$$\begin{aligned} u : \mathbb {R}^{n} \rightarrow \mathbb {R}, \end{aligned}$$
(1)

where u is a utility function and \(\mathbf{R} ^{n}\) is an n-dimensional vector. Linear utility functions are widely used to represent a user’s preferences,

$$\begin{aligned} u = \sum _{i=1}^{n} w_{i}r_{i}, \end{aligned}$$
(2)

where \(w_{i}\) is the preference weight and \(r_{i}\) is the value at position i of the return vector. However, certain scenarios exist where linear utility functions cannot accurately represent a user’s preferences. In this case, the user’s preferences must be represented using a nonlinear utility function.

In this paper, we consider monotonically increasing utility functions [36], i.e.

$$\begin{aligned} (\forall \, i, V_{i}^{\pi } \ge V_{i}^{\pi '} \wedge \exists \, i, V_{i}^{\pi }> V_{i}^{\pi '}) \,\implies\, (\forall \, u, u(\mathbf{V} ^{\pi }) > u(\mathbf{V} ^{\pi '})), \end{aligned}$$
(3)

where \(\mathbf {V}^{\pi }\) and \(\mathbf {V}^{\pi '}\) are the values of executing policies \(\pi\) and \(\pi '\), respectively.

It is important to note that a monotonically increasing utility function also includes linear utility functions of the form in Eq. 2. In certain scenarios, the utility function may be unknown; therefore, we do not know the shape of the utility function. If we assume the utility function is monotonically increasing we know that, if the value of one of the objectives in the return vector increases, then the utility also increases [36]. This assumption makes it possible to determine a partial ordering over policies when the shape of the utility function is unknown. In this work, we make no assumptions about the shape of the utility function, but rather we assume the utility function is monotonically increasing.

2.3 The unknown utility function scenario

In MORL, a user’s preferences over objectives can be modelled as a utility function [36]. However, a user’s utility function is often unknown at the time of learning or planning. In the taxonomy of multi-objective decision making (MODeM), this is known as the unknown utility function scenario (Fig. 1), where a set of optimal policies must be computed and returned to the user [18]. In the unknown utility function scenario, there are three phases: the learning or planning phase, the selection phase and the execution phase. In the learning or planning phase a multi-objective planning or learning algorithm is deployed in a MOMDP. Given the utility function is unknown, the MORL algorithm computes a set of optimal policies during the learning or planning. During the selection phase, the user’s preferences over objectives become known and the user selects a policy from the set of optimal policies that best reflects their preferences. Finally, during the execution phase the selected policy is executed.

Fig. 1
figure 1

The unknown utility function scenario [18]

2.4 Multi-objective multi-armed bandits

Multi-objective multi-armed bandits (MOMAB) [11] are a natural extension of multi-armed bandits, where each arm returns an n-dimensional reward vector \(\mathbf{R} ^{n}\), where n is the number of objectives. At each timestep, t, the agent must select an arm, i, and receives a reward vector. The returns in an MOMAB setting can be deterministic [11] or stochastic [3]. Many algorithms focus on the MOMAB setting and learn a set of arms that are optimal [11, 27, 35, 47].

For example, Pareto UCB1 [11] is an algorithm that can learn a set of optimal policies in an MOMAB setting. Pareto UCB1 [11] initially selects each arm once, then at each timestep the algorithm computes the mean vector of each of the multi-objective arms and adds the upper confidence bound to the mean return vector. Using this method, Pareto UCB1 can learn the Pareto front in an MOMAB setting.

2.5 Scalarised expected returns and expected scalarised returns

For MORL, the ability to express a user’s preferences over objectives as a utility function is essential when learning a single optimal policy. In MORL, different optimality criteria exist [36]. Additionally, the utility function can be applied to the expectation of the returns, or the utility function can be applied directly to the returns before computing the expectation. Calculating the expected value of the return of a policy before applying the utility function leads to the scalarised expected returns (SER) optimisation criterion:

$$\begin{aligned} V_{u}^{\pi } = u\left( \mathbb {E} \left[ \sum \limits ^\infty _{t=0} \gamma ^t \mathbf{r }_t \,\big |\, \pi , \mu _0 \right] \right) , \end{aligned}$$
(4)

where \(\mu _0\) is the probability distribution over possible starting states.

SER is the most commonly used criterion in the multi-objective (single agent) planning and reinforcement learning literature [36]. For SER, a coverage set is defined as a set of optimal solutions for all possible utility functions. If the utility function is instead applied to the returns before computing the expectation, this leads to the expected scalarised returns (ESR) optimisation criterion [15, 33, 36]:

$$\begin{aligned} V_{u}^{\pi } = \mathbb {E} \left[ u\left( \sum \limits ^\infty _{t=0} \gamma ^t \mathbf{r }_t \right) \,\big |\, \pi , \mu _0 \right] . \end{aligned}$$
(5)

ESR is the most commonly used criterion in the game theory literature on multi-objective games [29].

2.6 Stochastic dominance

Stochastic dominance [4, 14] gives a partial order between distributions and can be used when making decisions under uncertainty (Fig. 2). Stochastic dominance is particularly useful when a distribution must be taken into consideration rather than an expected value when making decisions. Stochastic dominance is a prominent dominance criterion in finance, economics and decision theory. When making decisions under uncertainty, stochastic dominance can be used to determine the most risk-averse decision. Various degrees of stochastic dominance exist; however, in this paper we focus on first-order stochastic dominance (FSD). FSD can be used to give a partial ordering over random variables or random vectors to give an FSD dominant set.

Fig. 2
figure 2

For random variables X and Y, X \(\succeq _\mathrm{FSD}\) Y, where \(F_{X}\) and \(F_{Y}\) are the cumulative distribution functions (CDFs) of X and Y, respectively. In this case, X is preferable to Y because higher utilities occur with greater frequency in \(F_{X}\)

In Definition 1, we present the necessary conditions for FSD, and in Theorem 1, we prove that if a random variable is FSD dominant, it has at least as high an expected value as another random variable [46]. We use the work of Wolfstetter [46] to prove Theorem 1.

Definition 1

For random variables X and Y, X \(\succeq _\mathrm{FSD}\) Y if:

$$\begin{aligned} P(X> z) \ge P(Y > z), \forall \, z. \end{aligned}$$

If we consider the cumulative distribution function (CDF) of X, \(F_{X}\), and the CDF of Y, \(F_{Y}\), we can say that X \(\succeq _\mathrm{FSD}\) Y if:

$$\begin{aligned} F_{X}(z) \le F_{Y}(z), \forall \, z. \end{aligned}$$

Theorem 1

If X \(\succeq _\mathrm{FSD}\) Y, then X has a greater than or equal expected value as Y.

$$\begin{aligned} X \succeq _\mathrm{FSD} Y \implies E(X) \ge E(Y). \end{aligned}$$

Proof

By a known property of expected values, the following is true for any random variable:

$$\begin{aligned} \mathbb {E}(X)= & \int _{0}^{+\infty } (1 - F_{X}(x)) \,{\mathrm {d}}x,\\ \mathbb {E}(Y)= & \int _{0}^{+\infty } (1 - F_{Y}(x)) \,{\mathrm {d}}x. \end{aligned}$$

Therefore, if X \(\succeq _\mathrm{FSD}\) Y, then:

$$\begin{aligned} \int _{0}^{+\infty } (1 - F_{X}(x)) \,{\mathrm {d}}x \ge \int _{0}^{+\infty } (1 - F_{Y}(x)) \,{\mathrm {d}}x \end{aligned}$$

which gives,

$$\begin{aligned} \mathbb {E}(X) \ge \mathbb {E}(Y). \end{aligned}$$

[46] \(\square\)

3 Expected scalarised returns

In contrast to single-objective reinforcement learning (RL), different optimality criteria exist for MORL. In scenarios where the utility of a user is derived from multiple executions of a policy, the agent should optimise over the scalarised expected returns (SER) criterion. In scenarios where the utility of a user is derived from a single execution of a policy, the agent should optimise over the expected scalarised returns (ESR) criterion. Let us consider, as an example, a power plant that generates electricity for a city and emits harmful \(CO_2\) and greenhouse gases. City regulations have been imposed which limit the amount of pollution that the power plant can generate. If the regulations require that the emissions from the power plant do not exceed a certain amount over an entire year, the SER criterion should be optimised. In this scenario, the regulations allow for the pollution to vary day to day, as long as the emissions do not exceed the regulated level for a given year. However, if the regulations are much stricter and the power plant is fined every day it exceeds a certain level of pollution, it is beneficial to optimise under the ESR criterion.

The majority of MORL research focuses on linear utility functions. However, in the real world, a user’s utility function can be nonlinear. For example, a utility function is nonlinear in situations where a minimum value must be achieved on each objective [26]. Focusing on linear utility functions limits the applicability of MORL in real-world decision-making problems. For example, linear utility functions cannot be used to learn policies in concave regions of the Pareto front [41]. Furthermore, if a user’s preferences are nonlinear, these are fundamentally incompatible with linear utility functions. In this case, strictly multi-objective methods must be used to learn optimal policies for nonlinear utility functions. In MORL, for nonlinear utility functions, different policies are preferred when optimising under the ESR criterion versus the SER criterion [30]. It is important to note that, for linear utility functions, the distinction between ESR and SER does not exist [29].

For example, a decision maker has to choose between the following lotteries, \(L_{1}\) and \(L_{2}\), which are highlighted in Table 1.

Table 1 A lottery, \(L_{1}\), has two possible returns, (4, 3) and (2, 3), each with a probability of 0.5

The decision maker has the following nonlinear utility function:

$$\begin{aligned} u(\mathbf{x} ) = x_{1}^2 + x_{2}^2, \end{aligned}$$
(6)

where \(\mathbf{x}\) is a vector returned from \(\mathbf {R}\) in Table 1 and \(x_{1}\) and \(x_{2}\) are the values of two objectives. Note that this utility function is monotonically increasing for \(x_{1} \ge 0\) and \(x_{2} \ge 0\). Under the SER criterion, the decision maker will compute the expected value of each lottery, apply the utility function, and select the lottery that maximises their utility function. Let us consider which lottery the decision maker will play under the SER criterion:

$$\begin{aligned} &L_{1}: \,E(L_{1})= \,0.5(4, 3) + 0.5(2, 3) = (2, 1.5) + (1, 1.5) = (3, 3) \\& L_{1}: \,u(E(L_{1}))= (3^{2} + 3^{2}) = 9 + 9 = 18 \\& L_{2}: \,E(L_{2})= 0.9(1, 3) + 0.1(10, 2) = (0.9, 2.7) + (1, 0.2) = (1.9, 2.9) \\ &L_{2}: \,u(E(L_{2}))= (1.9^{2} + 2.9^{2}) = 3.61 + 8.41 = 12.02. \end{aligned}$$

Therefore, a decision maker with the utility function in Eq. 6 will prefer to play lottery \(L_{1}\) under the SER criterion.

Under the ESR criterion, the decision maker will first apply the utility function to the return vectors, compute the expectation and select the lottery to maximise their utility function. Let us consider how a decision maker will choose which lottery to play under the ESR criterion:

$$\begin{aligned} L_{1}: \,\mathbb {E}(u(L_{1}))= &\, 0.5(u(4, 3)) + 0.5(u(2, 3)) = 0.5(4^{2} +3^{2}) + 0.5(2^{2} + 3^{2}) \\= &\, 0.5(25) + 0.5(13) = 12.5 + 6.5 = 19 \\ L_{2}: \,\mathbb {E}(u(L_{2}))= &\, 0.9(u(1, 3)) + 0.1(u(10, 2)) = 0.9(1^{2} + 3^{2}) + 0.1(10^{2} + 2^{2}) \\= &\, 0.9(10) + 0.1(104) = 9 + 10.4 = 19.4. \end{aligned}$$

Therefore, a decision maker with the utility function in Eq. 6 will prefer to play lottery \(L_{2}\) under the ESR criterion. From the example, it is clear that users with the same nonlinear utility function can prefer different policies, depending on which multi-objective optimisation criterion is selected. Therefore, it is critical that the distinction ESR and SER are taken into consideration when selecting a MORL algorithm to learn optimal policies in a given scenario. The majority of MORL research focuses on the SER criterion [29]. By comparison, the ESR criterion has received very little attention from the MORL community [15, 29, 33, 36]. Many of the traditional MORL methods cannot be used when optimising under the ESR criterion, given nonlinear utility functions in MOMDPs do not distribute across the sum of immediate and future returns which invalidates the Bellman equation [33],

$$\begin{aligned} \begin{aligned} &\max_\pi \mathbb {E} \left[ u\left( \mathbf{R}_t^- + \sum _{i=t}^{\infty } \gamma ^i \mathbf{r}_i\right) \ \big |\ \pi , s_t \right] \\&\qquad \not = u(\mathbf{R}_t^-) + \max _\pi \mathbb {E}\left[ u\left( \sum _{i=t}^{\infty } \gamma ^i \mathbf{r}_i\right) \ \big |\ \pi , s_t \right] , \end{aligned} \end{aligned}$$
(7)

where u is a nonlinear utility function and \(\mathbf {R^{-}_{t}}\) \(=\) \(\sum _{i=0}^{t - 1} \gamma ^i \mathbf{r}_i\).

An example of an algorithm that can learn policies for nonlinear utility functions and the ESR criterion is distributional Monte Carlo tree search (DMCTS) [15]. Hayes et al. [15] use DMCTS to calculate the returns of a full policy and compute a posterior distribution over the expected utility of individual policy executions. DMCTS achieves state-of-the-art performance under the ESR criterion. Hayes et al. [15] demonstrate that, when optimising under the ESR criterion, making decisions based on a distribution over the utility of the returns is particularly useful when learning in realistic problems where rewards are stochastic.

However, DMCTS and other MORL algorithms that optimise for the ESR criterion [21, 33, 36] require the utility function of a user to be known a priori. In practice, many scenarios exist where a user’s utility function may be unknown at the time of learning or planning. To compute policies under the ESR criterion when a user’s utility function is unknown, a distribution over the returns must be maintained. To highlight why a distribution over the returns must be used when the utility function of a user is unknown, let us consider the following example in Table 2.Footnote 1

Table 2 A lottery, \(L_{3}\), has two possible returns, (−20, 1) and (20, 3), each with a probability of 0.5

To determine which lottery to play while optimising for the ESR criterion, the utility function must first be applied and then the expected utility can be computed (see Eq. 5):

$$\begin{aligned} u(L_{3})&= u((-20, 1)) + u((20, 3)) \\ \mathbb {E}(u(L_{3}))&= 0.5(u((-20, 1))) + 0.5(u((20, 3))) \\ u(L_{4})&= u((0, 2)) + u((5, 2)) \\ \mathbb {E}(u(L_{4}))&= 0.9(u((0, 2))) + 0.1(u((5, 2))). \end{aligned}$$

Given the utility function is unknown, it impossible to compute the expected utility. Moreover, a distribution over the returns received from a policy execution must be maintained in order to optimise for the ESR criterion. Maintaining a distribution over the returns ensures the expected utility can be computed once the user’s utility function becomes known at decision time.

As demonstrated above, maintaining a distribution over the returns is critical to learning optimal policies when the utility function of a user is unknown. Therefore, to compute a set of optimal policies under the ESR criterion it is necessary to adopt a distributional approach.

To adopt a distributional approach to multi-objective decision making, we must first introduce a multi-objective version of the return distribution [7]Footnote 2, \(\mathbf{Z} ^{\pi }\). A return distribution, \(\mathbf{Z} ^{\pi }\), is equivalent to a multivariate distribution where a dimension exists per objective. The return distribution, \(\mathbf{Z} ^{\pi }\), gives the distribution over returns of a random vector [40] when a policy \(\pi\) is executed, such that,

$$\begin{aligned} \mathbb {E} \, \mathbf{Z} ^{\pi } = \mathbb {E} \left[ \sum \limits ^\infty _{t=0} \gamma ^t \mathbf{r }_t \,\big |\, \pi , \mu _0 \right] . \end{aligned}$$
(8)

Moreover, a return distribution can be used to represent policies. Under the ESR criterion, the utility-of-the-return-distribution, \(Z_{u}^{\pi }\), is defined as a distribution over the scalar utilities received from applying the utility function to each vector in the return distribution, \(\mathbf{Z} ^{\pi }\). Therefore, \(Z_{u}^{\pi }\) is a distribution over the scalar utility of vector returns of a random vector received from executing a policy \(\pi\), such that,

$$\begin{aligned} \mathbb {E} \, Z_{u}^{\pi } = \mathbb {E} \left[ u\left( \sum \limits ^\infty _{t=0} \gamma ^t \mathbf{r }_t \right) \,\big |\, \pi , \mu _0 \right] . \end{aligned}$$
(9)

The utility-of-the-return-distribution can only be calculated when the utility function is known a priori.

When the utility function of a user is unknown, a set of policies that are optimal for all monotonically increasing utility functions must be learned. However, for the ESR criterion, a set of optimal solutions has yet to be defined. To learn a set of optimal policies under the ESR criterion, we must develop new methods. In Sect. 4, we apply first-order stochastic dominance to determine a partial ordering over return distributions under the ESR criterion.

4 Stochastic dominance for ESR

For MORL, there are two classes of algorithms: single-policy and multi-policy algorithms [36, 42]. When the user’s utility function is known a priori, it is possible to use a single-policy algorithm [15, 33] to learn an optimal solution. However, when the user’s utility function is unknown, we aim to learn a set of policies that are optimal for all monotonically increasing utility functions. The current literature on the ESR criterion focuses only on scenarios where the utility function of a user is known [15, 33], overlooking scenarios where the utility function of a user is unknown. Moreover, a set of solutions under the ESR criterion for the unknown utility function scenario [36] has yet to be defined.

Various algorithms have been proposed to learn solution sets under the SER criterion (see Sect. 2.5), for example [24, 34, 45]. Under the SER criterion, multi-policy algorithms determine optimality by comparing policies based on the utility of expected value vectors (see Eq. 4). In contrast, under the ESR criterion it is crucial to maintain a distribution over the utility of possible vector-valued outcomes. SER multi-policy algorithms cannot be used to learn optimal policies under the ESR criterion because they compute expected value vectors. It is necessary to develop new methods that can generate solution sets for the ESR criterion with unknown utilities. The development of methods that determine an optimal partial ordering over return distributions is a promising avenue to address this challenge.

First-order stochastic dominance (see Sect. 2.6) is a method which gives a partial ordering over random variables [20, 46]. FSD compares the cumulative distribution functions (CDFs) of the underlying probability distributions of random variables to determine optimality. When computing policies under the ESR criterion, it is essential that the expected utility is maximised. To use FSD for the ESR criterion, we must show the FSD conditions presented in Sect. 2.6 also hold when optimising the expected utility for unknown monotonically increasing utility functions.

For the single-objective case, Theorem 2 proves for random variables X and Y, if X \(\succeq _\mathrm{FSD}\) Y, the expected utility of X is greater than, or equal to, the expected utility of Y for monotonically increasing utility functions. In Theorem 2, random variables X and Y are considered, and their corresponding CDFs \(F_{X}\), \(F_{Y}\). The work of Mas-Colell et al. [23] is used as a foundation for Theorem 2.

Theorem 2

A random variable, X, is preferred to a random variable, Y, for all decision makers with a monotonically increasing utility function if, X \(\succeq _\mathrm{FSD}\) Y.

$$\begin{aligned} X \succeq _\mathrm{FSD} Y\; \implies\; \mathbb {E}(u(X)) \ge \mathbb {E}(u(Y)). \end{aligned}$$

Proof

If X \(\succeq _\mathrm{FSD}\) Y, thenFootnote 3,

$$\begin{aligned} F_{X}(z) \le F_{Y}(z), \forall \, z, \end{aligned}$$

since

$$\begin{aligned} \mathbb {E}(u(X))= & \int _{-\infty }^{\infty } u(z) {\mathrm {d}}F_{X}(z)\\ \mathbb {E}(u(Y))= & \int _{-\infty }^{\infty } u(z) {\mathrm {d}}F_{Y}(z). \end{aligned}$$

When integrating both \(\mathbb {E}(u(X))\) and \(\mathbb {E}(u(Y))\) by parts, the following results are generated:

$$\begin{aligned} \mathbb {E}(u(X))= &\, [u(z)F_{X}(z)]_{-\infty }^{\infty } - \int _{-\infty }^{\infty } u'(z)F_{X}(z) \,{\mathrm {d}}z\\ \mathbb {E}(u(Y))= & \,[u(z)F_{Y}(z)]_{-\infty }^{\infty } - \int _{-\infty }^{\infty } u'(z)F_{Y}(z) \,{\mathrm {d}}z. \end{aligned}$$

Given \(F_X(-\infty ) = F_Y(-\infty ) = 0\) and \(F_X(\infty ) = F_Y(\infty ) = 1\), the first terms in \(\mathbb {E}(u(X))\) and \(\mathbb {E}(u(Y))\) are equal, and thus,

$$\begin{aligned} \mathbb {E}(u(X)) -\mathbb {E}(u(Y)) = \int _{-\infty }^{\infty } u'(z)F_{Y}(z) \,{\mathrm {d}}z - \int _{-\infty }^{\infty } u'(z)F_{X}(z) \,{\mathrm {d}}z \end{aligned}$$

Since \(F_{X}(z) \le F_{Y}(z)\) and \(u'(z) \ge 0\) for all monotonically increasing utility functions,

$$\begin{aligned} \mathbb {E}(u(X)) - \mathbb {E}(u(Y)) \ge 0 \end{aligned}$$

and thus,

$$\begin{aligned} \mathbb {E}(u(X)) \ge \mathbb {E}(u(Y)). \end{aligned}$$

\(\square\)

A utility function maps an input (scalar or vector return) to an output (scalar utility). Since the probability of receiving some utility is equal to the probability of receiving some return for a random variable, X, we can write the following:

$$\begin{aligned} P(X> c) = P(u(X) > u(c)), \end{aligned}$$
(10)

where c is a constant. Using the results shown in Theorem 2 and Eq. 10, the FSD conditions highlighted in Sect. 2.6 can be rewritten to include monotonically increasing utility functions:

$$\begin{aligned} P(u(X)> u(z)) \ge P(u(Y) > u(z)). \end{aligned}$$
(11)

Definition 2

Let X and Y be random variables. X dominates Y for all decision makers with a monotonically increasing utility function if the following is true:

$$\begin{aligned}&X \succeq _\mathrm{FSD} Y \Leftrightarrow \\&\quad \forall u : \forall {v} : P(u(X)> u({v})) \ge P( u(Y) > u({v})). \end{aligned}$$

In MORL, the return from the reward function is a vector, where each element in the return vector represents an objective. To apply FSD to MORL under the ESR criterion, random vectors must be considered. In this case, a random vector (or multivariate random variable) is a vector whose components are scalar-valued random variables on the same probability space. For simplicity, this paper focuses on the case in which a random vector has two random variables, known as the bi-variate case. FSD conditions have been proven to hold for random vectors with n random variables in the works of Sriboonchitta et al. [39], Levhari et al. [19], Nakayama et al. [25] and Scarsini [37]. In Theorem 3, the work of Atkinson and Bourguignon [2] is distilled into a suitable Theorem for MORL. Theorem 3 highlights how the conditions for FSD hold for random vectors when optimising under the ESR criterion for a monotonically increasing utility function, u, where \(\frac{\partial ^2 u(x_1, x_2)}{\partial x_1 \partial x_2} \le 0\) [32]. It is important to note that Atikson and Bourguignon [2] have shown the conditions for FSD hold for random vectors for utility functions where \(\frac{\partial ^2 u(x_1, x_2)}{\partial x_1 \partial x_2} \ge 0\). We plan to extend these conditions for MORL in a future work. In Theorem 3, \(\mathbf{X}\) and \(\mathbf{Y}\) are random vectors where each random vector consists of two random variables, \(\mathbf{X} = [X_{1}, X_{2}]\) and \(\mathbf{Y} = [Y_{1}, Y_{2}]\). \(F_{{X_{1}X{_2}}}\) and \(F_{Y_{1}Y_{2}}\) are the corresponding CDFs.

Theorem 3

Assume that \(u\ :\ \mathbb {R}_{\ge 0} \times \mathbb {R}_{\ge 0} \rightarrow \mathbb {R}_{\ge 0}\) is a monotonically increasing function, with \(\frac{\partial u(x_1, x_2)}{\partial x_1} \ge 0\), \(\frac{\partial u(x_1, x_2)}{\partial x_2} \ge 0\) and \(\frac{\partial ^2 u(x_1, x_2)}{\partial x_1 \partial x_2} \le 0\). If, for random vectors \(\mathbf {X}\) and \(\mathbf {Y}\), \(\mathbf {X} \succeq _\mathrm{FSD} \mathbf {Y}\), then \(\mathbf {X}\) is preferred to \(\mathbf {Y}\) by all decision makers, i.e.

$$\begin{aligned} \mathbf {X} \succeq _\mathrm{FSD} \mathbf {Y} \implies \mathbb {E}(u(\mathbf {X})) \ge \mathbb {E}(u(\mathbf {Y})). \end{aligned}$$

Proof

As \(\mathbf {X} \succeq _\mathrm{FSD} \mathbf {Y}\), \(\forall t, z\) we have

$$\begin{aligned} \begin{aligned}&F_\mathbf {X}(t, z) \le F_\mathbf {Y}(t, z),\\ \text {or }&\Delta _F(t, z) = F_\mathbf {X}(t, z) - F_\mathbf {Y}(t, z) \le 0. \end{aligned} \end{aligned}$$

The expected utility of the random variable \(\mathbf {X}\) can be written as follows:

$$\begin{aligned} \mathbb {E}\left( u(\mathbf{X} )\right) = \int ^{\infty }_{0} \int ^{\infty }_{0} u(t, z)f_\mathbf {X}(t, z){\mathrm {d}}t{\mathrm {d}}z, \end{aligned}$$

where f is the probability density function of \(\mathbf{X}\). Note that

$$\begin{aligned} \begin{aligned} \Delta _f(t, z)&= f_\mathbf {X}(t, z) - f_\mathbf {Y}(t, z)\\&= \frac{\partial ^2 \Delta _F(t, z)}{\partial t \partial z}. \end{aligned} \end{aligned}$$

Using integration-by-parts (I), and the fact that \(\Delta _F(t, 0) = \frac{\partial \Delta _F(0, z)}{\partial z} = 0\) (Z), we obtain:

$$\begin{aligned}&\begin{aligned}&\mathbb {E}\left( u(\mathbf{X} )\right) - \mathbb {E}\left( u(\mathbf{Y} )\right) \\&= \int ^{\infty }_{0} \int ^{\infty }_{0} u(t, z)\Delta _f(t, z){\mathrm {d}}t{\mathrm {d}}z\\&{\mathop {=}\limits ^{(I)}} \int ^{\infty }_{0} \left[ u(t, z)\frac{\partial \Delta _F(t, z)}{\partial z}\right] ^{\infty }_{t=0} {\mathrm {d}}z - \int ^{\infty }_{0}\int ^{\infty }_{0} \frac{\partial u(t, z)}{\partial t}\frac{\partial \Delta _F(t, z)}{\partial z} {\mathrm {d}}t{\mathrm {d}}z \end{aligned}\\&\begin{aligned}&{\mathop {=}\limits ^{(I)}} \int ^{\infty }_{0} \left[ u(t, z)\frac{\partial \Delta _F(t, z)}{\partial z}\right] ^{\infty }_{t=0} {\mathrm {d}}z - \int ^{\infty }_0\left[ \frac{\partial u(t, z)}{\partial t}\Delta _F(t, z)\right] ^{\infty }_{z=0} {\mathrm {d}}t \\&+\int ^{\infty }_{0}\int ^{\infty }_{0} \frac{\partial ^2 u(t, z)}{\partial t\partial z}\Delta _F(t, z) {\mathrm {d}}t{\mathrm {d}}z\\ \end{aligned}\\&\begin{aligned}&{\mathop {=}\limits ^{(Z)}} \int ^{\infty }_{0} \lim _{t \rightarrow \infty } u(t, z)\frac{\partial \Delta _F(t, z)}{\partial z} {\mathrm {d}}z - \int ^{\infty }_0\lim _{z \rightarrow \infty }\frac{\partial u(t, z)}{\partial t}\Delta _F(t, z) {\mathrm {d}}t \\&+\int ^{\infty }_{0}\int ^{\infty }_{0} \frac{\partial ^2 u(t, z)}{\partial t\partial z}\Delta _F(t, z) {\mathrm {d}}t{\mathrm {d}}z. \end{aligned} \end{aligned}$$

Given that \(\frac{\partial ^2 u(t, z)}{\partial t\partial z} \le 0\), \(\frac{\partial u(t, z)}{\partial t} \ge 0\) and \(\Delta _F(t, z) \le 0\), we know that the last two terms are positive. Therefore, we can state that

$$\begin{aligned}&\begin{aligned}&\mathbb {E}\left( u(\mathbf{X} )\right) - \mathbb {E}\left( u(\mathbf{Y} )\right) \\&= \int ^{\infty }_{0} \lim _{t \rightarrow \infty } u(t, z) \frac{\partial \Delta _F(t, z)}{\partial z} {\mathrm {d}}z - \int ^{\infty }_0 \lim _{z \rightarrow \infty }\frac{\partial u(t, z)}{\partial t}\Delta _F(t, z) {\mathrm {d}}t \\ \end{aligned}\\&\begin{aligned}&+\int ^{\infty }_{0}\int ^{\infty }_{0} \frac{\partial ^2 u(t, z)}{\partial t\partial z}\Delta _F(t, z) {\mathrm {d}}t{\mathrm {d}}z \ge \int ^{\infty }_{0} \lim _{t \rightarrow \infty } u(t, z)\frac{\partial \Delta _F(t, z)}{\partial z} {\mathrm {d}}z. \end{aligned} \end{aligned}$$

According to Lemma 2 (see Appendix section), as u(tz)F(tz) is a positive monotonically increasing function in both t and z, we know that:

$$\begin{aligned} \begin{aligned}&\int ^\infty _0 \lim _{t \rightarrow \infty } u(t, z)\frac{\partial F(t, z)}{\partial z} {\mathrm {d}}z = \lim _{t \rightarrow \infty } \int ^\infty _0 u(t, z)\frac{\partial F(t, z)}{\partial z} {\mathrm {d}}z. \end{aligned} \end{aligned}$$

Using integration-by-parts (I), and the fact that \(\Delta _F(t, 0) = 0\) (Z), we have:

$$\begin{aligned} \begin{aligned}&\mathbb {E}\left( u(\mathbf{X} )\right) - \mathbb {E}\left( u(\mathbf{Y} )\right) \\&\ge \lim _{t \rightarrow \infty } \int ^\infty _0 u(t, z) \frac{\partial \Delta _F(t, z)}{\partial z} {\mathrm {d}}z\\&{\mathop {=}\limits ^{(I)}} \lim _{t \rightarrow \infty } \left[ u(t, z)\Delta _F(t, z)\right] ^{\infty }_{0} - \lim _{t \rightarrow \infty } \int ^\infty _0 \frac{\partial u(t, z)}{\partial z} \Delta _F(t, z) {\mathrm {d}}z\\&{\mathop {=}\limits ^{(Z)}} \lim _{t \rightarrow \infty } \lim _{z \rightarrow \infty } u(t, z)\Delta _F(t, z) - \lim _{t \rightarrow \infty } \int ^\infty _0 \frac{\partial u(t, z)}{\partial z}\Delta _F(t, z) {\mathrm {d}}z.\\ \end{aligned} \end{aligned}$$

Finally, given that \(\frac{\partial u(t, z)}{\partial z} \ge 0\) and \(\Delta _F(t, z) \le 0\), we know that:

$$\begin{aligned} \begin{aligned}&\mathbb {E}\left( u(\mathbf{X} )\right) - \mathbb {E}\left( u(\mathbf{Y} )\right) \\&\ge \lim _{t \rightarrow \infty } \lim _{z \rightarrow \infty } u(t, z) \Delta _F(t, z) - \lim _{t \rightarrow \infty } \int ^\infty _0 \frac{\partial u(t, z)}{\partial z}\Delta _F(t, z) {\mathrm {d}}z\\&\ge 0. \end{aligned} \end{aligned}$$

\(\square\)

Using the results from Theorem 3, Eq. 11 can be updated to include random vectors,

$$\begin{aligned} P(u(\mathbf{X} )> u(\mathbf{z} )) \ge P(u(\mathbf{Y} ) > u(\mathbf{z} )). \end{aligned}$$
(12)

Definition 3

For random vectors \(\mathbf{X}\) and \(\mathbf{Y}\), \(\mathbf{X}\) is preferred over \(\mathbf{Y}\) by all decision makers with a monotonically increasing utility function if, and only if, the following is true:

$$\begin{aligned}&\mathbf{X} \succeq _\mathrm{FSD} \mathbf{Y} \Leftrightarrow \\&\quad \forall u : ( \forall \mathbf{v }: P(u(\mathbf{X} )> u(\mathbf{v })) \ge P( u(\mathbf{Y} ) > u(\mathbf{v })). \end{aligned}$$

Using the results from Theorem 3 and Definition 3, it is possible to extend FSD to MORL. For MORL, under the ESR criterion, the return distribution, \(\mathbf{Z} ^{\pi }\), is considered to be the full distribution of the returns of a random vector received when executing a policy, \(\pi\) (see Sect. 3), return distributions can be used to represent policies. In this case, it is possible to use FSD to obtain a partial ordering over policies. For example, consider two policies, \(\pi\) and \(\pi '\), where each policy has the underlying return distribution \(\mathbf{Z} ^{\pi }\) and \(\mathbf{Z} ^{\pi '}\). If \(\mathbf{Z} ^{\pi } \, \succeq _\mathrm{FSD} \mathbf{Z} ^{\pi '}\), then \(\pi\) will be preferred over \(\pi '\).

Definition 4

Policies \(\pi\) and \(\pi '\) have return distributions \(\mathbf{Z} ^{\pi }\) and \(\mathbf{Z} ^{\pi '}\). Policy \(\pi\) is preferred over policy \(\pi '\) by all decision makers with a utility function, u, that is monotonically increasing if, and only if, the following is true:

$$\begin{aligned} \mathbf{Z} ^{\pi } \succeq _\mathrm{FSD} \mathbf{Z} ^{\pi '}. \end{aligned}$$

Now that a partial ordering over policies has been defined under the ESR criterion for the unknown utility function scenario, it is possible to define a set of optimal policies.

5 Solution sets for ESR

Section 4 defines a partial ordering over policies under the ESR criterion for unknown utility functions using FSD. In the unknown utility function scenario, it is infeasible to learn a single optimal policy [36]. When a user’s utility function is unknown, multi-policy MORL algorithms must be used to learn a set of optimal policies. To apply MORL to the ESR criterion in scenarios with unknown utility, a set of optimal policies under the ESR criterion must be defined. In Sect. 5, FSD is used to define multiple sets of optimal policies for the ESR criterion.

Firstly, a set of optimal policies, known as the undominated set, is defined. The undominated set is defined using FSD, where each policy in the undominated set has an underlying return distribution that is FSD dominant. The undominated set contains at least one optimal policy for all possible monotonically increasing utility functions.

Definition 5

The undominated set, \(U(\Pi )\), is a subset of all possible policies for where there exists some utility function, u, where a policy’s return distribution is FSD dominant.

$$\begin{aligned} U(\Pi ) = \left\{ \pi \in \Pi \ \big |\ \exists u, \forall \pi ' \in \Pi : \mathbf{Z} ^{\pi } \succeq _\mathrm{FSD} \mathbf{Z} ^{\pi '}\right\} . \end{aligned}$$

However, the undominated set may contain excess policies. For example, under FSD, if two dominant policies have return distributions that are equal, then both policies will be in the undominated set. Given both return distributions are equal, a user with a monotonically increasing utility function will not prefer one policy over the other. In this case, both policies have the same expected utility. To reduce the number of policies that must be considered at execution time, for each possible utility function we can keep just one corresponding FSD dominant policy; such a set of policies is called a coverage set (CS).

Definition 6

The coverage set, \(CS(\Pi )\), is a subset of the undominated set, \(U(\Pi )\), where, for every utility function, u, the set contains a policy that has a FSD dominant return distribution,

$$\begin{aligned} CS(\Pi ) \subseteq U(\Pi ) \wedge \left( \forall u, \exists \pi \in CS(\Pi ), \forall {\pi ' \in \Pi } : \mathbf {Z}^{\pi } \succeq _\mathrm{FSD} \mathbf {Z}^{\pi '} \right) . \end{aligned}$$

In practice, a decision maker may aim to learn the smallest possible set of optimal policies. However, FSD considered in this work does not have a strict inequality condition. Moreover, the undominated set generated using FSD may contain excess policies. Therefore, to compute a coverage set in practice where each optimal policy has a unique return distribution, we define expected scalarised returns dominance (ESR dominance). In contrast to FSD, ESR dominance ensures that an explicitly strict inequality exists.

Definition 7

For random vectors X and Y, X \(\succ _\mathrm{ESR}\) Y for all decision makers with a monotonically increasing utility function if, and only if, the following is true:

$$\begin{aligned}&\mathbf{X} \succ _\mathrm{ESR} \mathbf{Y} \Leftrightarrow \\&\quad \forall u : ( \forall \mathbf{v }: P(u(\mathbf{X} )> u(\mathbf{v })) \ge P( u(\mathbf{Y} )> u(\mathbf{v }))\\&\wedge \exists \,\mathbf{v} : P( u(\mathbf{X} )> u(\mathbf{v }))> P( u(\mathbf{Y} ) > u(\mathbf{v }))). \end{aligned}$$

ESR dominance (Definition 7) extends FSD; however, ESR dominance is a more strict dominance criterion. For FSD, policies that have equal return distributions are considered dominant policies, which is not the case under ESR dominance. Therefore, if a random vector is ESR dominant, the random vector has a greater expected utility than all ESR-dominated random vectors. Theorem 4 proves that ESR dominance satisfies the ESR criterion when the utility function of the user is unknown for all monotonically increasing utility functions. Theorem 4 focuses on random vectors \(\mathbf{X}\) and \(\mathbf{Y}\) where each random vector has two random variables, such that \(\mathbf{X} = [X_{1}, X_{2}]\) and \(\mathbf{Y} = [Y_{1}, Y_{2}]\). \(F_\mathbf{X }\) and \(F_\mathbf{Y }\) are the corresponding CDFs and \(\mathbf{v} = [t, z]\). However, Theorem 4 can easily be extended for random vectors with n random variables (\(\mathbf{X} = [X_{1}, X_{2}, \ldots , X_{n}]\)).

Theorem 4

A random vector, X, is preferred to a random vector, Y, by all decision makers with a monotonically increasing utility function if, and only if, X \(\ge _\mathrm{ESR}\) Y:

$$\begin{aligned} \mathbf{X} \succ _\mathrm{ESR} \mathbf{Y} \implies \mathbb {E}(u(\mathbf{X} )) > \mathbb {E}(u(\mathbf{Y} )) \end{aligned}$$

Proof

\(\mathbf{X}\) and \(\mathbf{Y}\) are random vectors with n random variables. If X \(\succ _\mathrm{ESR}\) Y, the following two conditions must be met for all u:

  1. 1.

    \(\forall \mathbf{v }: P(u(\mathbf{X} )> u(\mathbf{v })) \ge P( u(\mathbf{Y} ) > u(\mathbf{v }))\)

  2. 2.

    \(\exists \,\mathbf{v} : P( u(\mathbf{X} )> u(\mathbf{v }))> P( u(\mathbf{Y} ) > u(\mathbf{v }))\).

From Definition 3, if X \(\succeq _\mathrm{FSD}\) Y, then the following is true:

$$\begin{aligned} \forall u : \forall \mathbf{v }: P(u(\mathbf{X} )> u(\mathbf{v })) \ge P( u(\mathbf{Y} ) > u(\mathbf{v })). \end{aligned}$$

If X \(\succeq _\mathrm{FSD}\) Y, then, from Theorem 3, the following is true:

$$\begin{aligned} \mathbb {E}(u(\mathbf{X} )) \ge \mathbb {E}(u(\mathbf{Y} )) \end{aligned}$$

If condition 1 is satisfied, the expected utility of \(\mathbf{X}\) is at least equal to the expected utility of \(\mathbf{Y}\); then:

$$\begin{aligned} \mathbb {E}(u(\mathbf{X} ))= & \int _{-\infty }^{\infty } \int _{-\infty }^{\infty } u(\mathbf{z} )f_\mathbf{X }(t, z) \, {\mathrm {d}}t \, {\mathrm {d}}z\\ \mathbb {E}(u(\mathbf{Y} ))= & \int _{-\infty }^{\infty } \int _{-\infty }^{\infty } u(\mathbf{z} )f_\mathbf{Y }(t, z) \, {\mathrm {d}}t \, {\mathrm {d}}z. \end{aligned}$$

In order to satisfy condition 2, some limits must exist to give the following,

$$\begin{aligned} \int _{a}^{b} \int _{c}^{d} u(t, z)f_\mathbf{X }(t, z) \, {\mathrm {d}}t \, {\mathrm {d}}z > \int _{a}^{b} \int _{c}^{d} u(t, z) f_\mathbf{Y }(t, z) \, {\mathrm {d}}t \, {\mathrm {d}}z. \end{aligned}$$

The minimum requirement to satisfy condition 1 is:

$$\begin{aligned} \int _{-\infty }^{\infty } \int _{-\infty }^{\infty } u(t, z)f_\mathbf{X }(t, z) \, {\mathrm {d}}t \, {\mathrm {d}}z = \int _{-\infty }^{\infty } \int _{-\infty }^{\infty } u(t, z)f_\mathbf{Y }(t, z) \, {\mathrm {d}}t \, {\mathrm {d}}z. \end{aligned}$$

If condition 1 is satisfied, to satisfy condition 2 some limits must exist:

$$\begin{aligned} \int _{a}^{b} \int _{c}^{d} u(t, z)f_\mathbf{X }(t, z) \, {\mathrm {d}}t \, {\mathrm {d}}z > \int _{a}^{b} \int _{c}^{d} u(t, z)f_\mathbf{Y }(t, z) \, {\mathrm {d}}t \, {\mathrm {d}}z. \end{aligned}$$

Therefore,

$$\begin{aligned}&\int _{-\infty }^{a} \int _{-\infty }^{c} u(t, z)f_\mathbf{X }(t, z) \, {\mathrm {d}}t \, {\mathrm {d}}z \, + \int _{a}^{b} \int _{c}^{d} u(t, z)f_\mathbf{X }(t, z) \, {\mathrm {d}}t \,{\mathrm {d}}z \, + \\ \int _{b}^{\infty } \int _{d}^{\infty } u(t, z)f_\mathbf{X }(t, z) \,{\mathrm {d}}t \, {\mathrm {d}}z > \int _{-\infty }^{a} \int _{-\infty }^{c} u(t, z)f_\mathbf{Y }(t, z) \, {\mathrm {d}}t \, {\mathrm {d}}z \, + \\&\quad \int _{a}^{b} \int _{c}^{d} u(t, z)f_\mathbf{Y }(t, z) \, {\mathrm {d}}t \, {\mathrm {d}}z \, + \int _{b}^{\infty } \int _{d}^{\infty } u(t, z)f_\mathbf{Y }(t, z) \, {\mathrm {d}}t \, {\mathrm {d}}z. \end{aligned}$$

Finally,

$$\begin{aligned} \int _{-\infty }^{\infty } \int _{-\infty }^{\infty } u(t, z)f_\mathbf{X }(t, z) \, {\mathrm {d}}t \, {\mathrm {d}}z > \int _{-\infty }^{\infty } \int _{-\infty }^{\infty } u(t, z)f_\mathbf{Y }(t, z) \, {\mathrm {d}}t \, {\mathrm {d}}z. \end{aligned}$$

If X \(\succ _\mathrm{ESR}\) Y, then

$$\begin{aligned} \mathbb {E}(u(\mathbf{X} )) > \mathbb {E}(u(\mathbf{Y} )). \end{aligned}$$

\(\square\)

In the ESR dominance criterion defined in Definition 7, the utility of different vectors is compared. However, it is not possible to calculate the utility of a vector when the utility function is unknown. In this case, Pareto dominance [28] can be used instead to determine the relative utility of the vectors being compared.

Definition 8

\(\mathbf{A}\) Pareto dominates (\(\succ _{p}\)) \(\mathbf{B}\) if the following is true:

$$\begin{aligned} \mathbf {A} \succ _{p} \mathbf {B} \Leftrightarrow (\forall i: \mathbf {A}_{i} \ge \mathbf {B}_{i}) \wedge (\exists i: \mathbf {A}_i > \mathbf {B}_i). \end{aligned}$$
(13)

For monotonically increasing utility functions, if the value of an element of the vector increases, then the scalar utility of the vector also increases. Therefore, using Definition 8, if vector \(\mathbf {A}\) Pareto dominates vector \(\mathbf {B}\), for a monotonically increasing utility function, \(\mathbf {A}\) has a higher utility than \(\mathbf {B}\). To make ESR comparisons between return distributions, Pareto dominance can be used.

Definition 9

For random vectors X and Y, X \(\succ _\mathrm{ESR}\) Y for all monotonically increasing utility functions if, and only if, the following is true:

$$\begin{aligned}&\mathbf{X} \succ _\mathrm{ESR} \mathbf{Y} \Leftrightarrow \\&\quad \forall \mathbf{v}: P( \mathbf{X}>_P \mathbf{v}) \ge P( \mathbf{Y}>_P \mathbf{v}) \wedge \exists \mathbf{v} : P( \mathbf{X}>_P \mathbf{v})> P( \mathbf{Y} >_P \mathbf{v}). \end{aligned}$$

It is also possible to calculate ESR dominance by comparing the CDFs of random vectors. Using the CDF guarantees a higher expected utility. Using the CDF, we compare the cumulative probabilities for a given vector, where a lower cumulative probability is preferred. ESR dominance with the CDF does not require any information about the utility function of a user and therefore can be used in the unknown utility function scenario.

Definition 10

For random vectors X and Y, X \(\succ _\mathrm{ESR}\) Y for all monotonically increasing utility functions if, and only if, the following is true:

$$\begin{aligned}&\mathbf{X} \succ _\mathrm{ESR} \mathbf{Y} \Leftrightarrow \\&\quad \forall \mathbf{v}: F_\mathbf{X }(\mathbf{v}) \le F_\mathbf{Y }(\mathbf{v}) \wedge \exists \mathbf{v} : F_\mathbf{X }(\mathbf{v}) < F_\mathbf{Y }(\mathbf{v}). \end{aligned}$$

Therefore, we can use either Definition 9 or Definition 10 to calculate ESR dominance to give a partial ordering over policies.

Definition 11

For return distributions \(\mathbf{Z} ^{\pi }\) and \(\mathbf{Z} ^{\pi '}\) for policies \(\pi\) and \(\pi '\), \(\pi\) is preferred over \(\pi '\) by all decision makers with a monotonically increasing utility function if, and only if, the following is true:

$$\begin{aligned} \mathbf{Z} ^{\pi } \succ _\mathrm{ESR} \mathbf{Z} ^{\pi '}. \end{aligned}$$

Using ESR dominance, it is possible to define a set of optimal policies, known as the ESR set.

Definition 12

The ESR set, ESR\((\Pi )\), is a subset of all policies where each policy in the ESR set is ESR dominant,

$$\begin{aligned} {\mathrm {ESR}}(\Pi ) = \{ \pi \in \Pi \ |\ \not \exists \pi '\in \Pi : \mathbf {Z}^{\pi '} \succ _\mathrm{ESR} \mathbf {Z}^\pi \}. \end{aligned}$$
(14)

The ESR set is a set of non-dominated policies, where each policy in the ESR set is ESR dominant. The ESR set can be considered a coverage set, given no excess policies exist in the ESR set. It is viable for a multi-policy MORL method to use ESR dominance to construct the ESR set.

6 Multi-objective tabular distributional reinforcement learning

Traditionally in the MORL literature, multi-objective methods learn a set of optimal solutions when the utility function of a user is unknown or hard to specify [18, 36]. The current MORL literature focuses only on methods which learn the optimal set of policies under the SER criterion [24, 45]. As already highlighted, the ESR criterion has largely been ignored by the MORL community, with a few exceptions [15, 33, 43]. In Sect. 6, we address this research gap and we present a novel multi-objective tabular distributional reinforcement learning (MOTDRL) algorithm that learns an optimal set of policies for the ESR criterion, also known as the ESR set, for multi-objective multi-armed bandit (MOMAB) problems.

MOTDRL learns the return distribution for a policy by sampling each available arm in a MOMAB setting and maintains a multivariate distribution over the returns received. Given MOTDRL only considers MOMAB problem domains, MOTDRL maintains a distribution per arm and updates the distribution after each timestep with the return vector received from executing the sampled arm. When optimising under the ESR criterion, it is critical that a MORL method learns the underlying distribution over the returns. Other distributional MORL methods, such as bootstrap Thompson sampling [15], cannot be used to learn a set of optimal policies under the ESR criterion when the utility function is unknown. Such methods learn a distribution over the mean returns. In scenarios where the utility function is unknown or unavailable, such methods would invalidate the ESR criterion as a distribution over mean return vectors would be computed. Given a distribution must be used when learning the ESR set, new distributional MORL methods must be formulated to learn the underlying return distributions.

MOTDRL can learn the underlying return distribution for an arm by maintaining a tabular representation of the underlying multivariate distribution. To maintain a tabular representation of a multivariate distribution, we initialise a Z-table for each arm where the Z-table has an axis per objective. The Z-table maintains a count of the number of times a return vector is received for a given arm. The size of each Z-table is initialised using the parameters \(\mathbf{R} _\mathrm{min}\) and \(\mathbf{R} _\mathrm{max}\) which are the minimum and maximum returns obtainable for any of the objectives in the given environment. Therefore, each axis in the Z-table will use \(\mathbf{R} _\mathrm{min}\) and \(\mathbf{R} _\mathrm{max}\) to define the length of the axis, where each index value of the Z-table is initialised to 0. Using \(\mathbf{R} _\mathrm{min}\) and \(\mathbf{R} _\mathrm{max}\) as initialisation parameters, a Z-table can be constructed which contains an index for all possible return vectors in a given problem domain. Table 3 visualises an initialised Z-table for a multi-objective problem with two objectives where \(\mathbf{R} _\mathrm{min}\) = 0 and \(\mathbf{R} _\mathrm{max}\) = 5.

Table 3 An illustration of an initialised Z-table for a problem domain with two objectives, \(x_{1}\) and \(x_{2}\), with each index value set to 0
figure a

Each Z-table can be used to calculate the return distribution of an arm, which can be considered as a policy \(\pi\), \(\mathbf{Z} ^{\pi }\) (see Sect. 3). At each timestep, t, the returns, \(\mathbf{R}\), received from pulling arm, i, are used to update the Z-table. The Z-table is used to maintain a count of the number of times the return, \(\mathbf{R}\), is received. In MOMAB problem domains, the returns received from the execution of an arm represent the full returns of the execution of a policy. To update the Z-table, the value at the index corresponding to the return \(\mathbf{R}\) is incremented by one. To correctly calculate the probability of receiving return \(\mathbf{R}\) when pulling arm i, a counter, \(N_{i}\), which represents the number of times arm i has been pulled, must be maintained. Each time arm i is pulled, the counter \(N_{i}\) is incremented by one. Algorithm 1 outlines how the Z-table for each arm is updated.

MOTDRL is a multi-policy algorithm that can learn the ESR set using ESR dominance. Using ESR dominance, a partial ordering over policies can be determined when the utility function of a user is unknown. Algorithm 2 outlines how MOTDRL learns the ESR set when the utility function of a user is unknown in a MOMAB problem domain. In Algorithm 2, \(\mathcal {A}\) is defined as a set of available arms, the ESR set is defined as E, D is the number of objectives, n is the total number of pulls across all arms, \(N_{l}\) and \(N_{i}\) are the number of pulls of arms j and i, and \(|E^{*}|\) is the cardinality of the ESR set, which is known a priori. When learning, the MOTDRL algorithm has a priori knowledge of \(\mathcal {A}\), \(\mathbf{R} _\mathrm{max}\) and \(\mathbf{R} _\mathrm{min}\). The agent must have knowledge of \(\mathbf{R} _\mathrm{max}\) and \(\mathbf{R} _\mathrm{min}\), so the Z-table can be correctly initialised and the agent must know the number of arms in \(\mathcal {A}\) for action selection.

figure b

On initialisation, each arm is pulled \(\beta\) times. The hyperparameter \(\beta\) is selected to ensure each arm is pulled sufficiently to build an initial distribution. For optimal performance, \(\beta\) is set to greater than 1. For \(\beta\) greater than 1, MOTDRL can build a sufficient initial distribution and can then efficiently explore each arm with the UCB1 statistic. At each timestep, the return distribution of the policies associated with the execution of each arm is calculated. The ESR set, E, is then calculated from the resulting return distributions. Therefore, for all the non-optimal arms \(l \not \in E\), there exists an ESR dominant arm \(i \in E\) that ESR dominates the arm l.

To calculate ESR dominance required in Algorithm 2 at Line 5, it is critical to compute both the PDF and CDF of the underling return distribution of a policy. The PDF can be calculated by computing the probability of receiving individual returns. Combining the Z-table and N for an arm, i, it is possible to compute the probability of receiving each return in a given problem domain, since the following is true:

$$\begin{aligned} f_\mathbf{X }(x_{1}, x_{2}, \ldots , x_{n}) = P(\mathbf{X} = x_{1}, \mathbf{X} = x_{2}, \ldots , \mathbf{X} = x_{n}) = \frac{Z_{i}(x_{1}, x_{2}, \ldots , x_{n})}{N_{i}}. \end{aligned}$$
(15)

Once the PDF has been computed using Eq. 15, it is possible to compute the CDF. Since the following is true:

$$\begin{aligned} \begin{aligned} F_\mathbf{X }(x_{1}, x_{2}, \ldots , x_{n}) =&P(\mathbf{X} \le x_{1}, \mathbf{X} \le x_{2}, \ldots , \mathbf{X} \le x_{n}) \\ =&\sum _{x_{a} \le x_{1}} \sum _{x_{b} \le x_{2}}\cdots \sum _{x_{k} \le x_{n}} P(\mathbf{X} = x_{a}, \mathbf{X} = x_{b}, \ldots , \mathbf{X} = x_{k}) \\&= \sum _{x_{a} \le x_{1}} \sum _{x_{b} \le x_{2}}\cdots \sum _{x_{k} \le x_{n}} \frac{Z_{i}(x_{a}, x_{b}, \ldots , x_{k})}{N_{i}} \end{aligned} \end{aligned}$$
(16)

Using the PDF and the CDF of a return distribution, it is possible to calculate ESR dominance using Definition 9 or Definition 10. Both methods can be used to calculate ESR dominance.

To efficiently explore all available arms, MOTDRL uses the UCB1 statistic presented by Drugan et al. [11]. MOTDRL uses UCB1 to transform the PDF of the underlying return distribution. MOTDRL transforms the PDF by adding the UCB1 statistic, computed at Line 5 in Algorithm 2, to the PDF. By summing the UCB1 statistic and the PDF, the PDF is shifted relative to the value of the computed UCB1 statistic. The CDF can then calculated based on the transformed PDF, and ESR dominance can then be computed.

Transforming the PDF using the UCB1 statistic ensures that there is sufficient exploration of all available arms during experimentation. However, as the number of pulls of a given arm increases, the UCB1 statistic decreases, which decreases exploration. Over time, the UCB1 statistic’s effect on the PDF and CDF becomes negligible. At such a point, MOTDRL can exploit the return distributions learned during exploration and compute the ESR set.

Given MOTDRL is a multi-policy algorithm, MOTDRL can be used in the unknown utility function scenario (Fig. 1). During the learning phase, MOTDRL learns the ESR set by utilising the steps in Algorithm 2. In Sect. 7, we deploy MOTDRL in two multi-objective multi-armed bandit settings to show MOTDRL can learn the ESR set. It is important to note that the experiments presented only consider the learning phase.

7 Experiments

In order to evaluate the MOTDRL algorithm, we evaluate MOTDRL in multiple settingsFootnote 4. Before experimentation, we define a metric that can be used to evaluate the performance of multi-policy ESR methods. We then evaluate MOTDRL in a multi-objective multi-armed bandit setting. Finally, we define a new multi-objective multi-armed bandit problem domain known as the vaccine recommender system (VRS) environment and evaluate MOTDRL using the VRS environment.

7.1 Evaluation metrics

The standard metrics for MORL [42, 48, 49] are not suitable to evaluate a multi-policy method under the ESR criterion since they are designed to specifically evaluate SER methods. To evaluate MORL algorithms under the ESR criterion, we adapt the coverage ratio metric used by Yang et al. [48] for the ESR criterion. The coverage ratio evaluates the agent’s ability to recover optimal solutions in the ESR set (E). If \(\mathcal {F} \subseteq {R}^m\) is the set of solutions found by the agent, we define the following:

$$\begin{aligned} \mathcal {F} \cap _{\epsilon } E := \{ Z^{\pi } \in \mathcal {F} \, | \, \exists Z^{\pi '} \in E \, s.t \, \sup \limits _\mathbf{x } |F_{Z^{\pi }}(\mathbf{x} ) - F_{Z^{\pi '}}(\mathbf{x} ) \, | \le \epsilon \}, \end{aligned}$$
(17)

where \(\mathbf{x} = [x_{1}, x_{2}, \ldots , x_{D}]\) and D is equal to the number of objectives. Equation 17 uses the Kolmogorov–Smirnov statistic [10] (Eq. 18), where \(\sup \nolimits _\mathbf{x }\) is the supremum of the set of distances. The Kolmogorov–Smirnov statistic takes the largest absolute difference between the two CDFs across all \(\mathbf{x}\) values,

$$\begin{aligned} \sup \limits _\mathbf{x } |F_{Z^{\pi }}(\mathbf{x} ) - F_{Z^{\pi '}}(\mathbf{x} )|. \end{aligned}$$
(18)

The Kolmogorov–Smirnov statistic returns a minimum value of 0 and a maximum value of 1. If two CDFs are equal, then the Kolmogorov–Smirnov statistic will return a value of 0.

The coverage ratio is then defined as:

$$\begin{aligned} F_{1} = 2 \cdot \frac{{\mathrm {precision}} \cdot {\mathrm {recall}}}{{\mathrm {precision}} + {\mathrm {recall}}}, \end{aligned}$$
(19)

where precision \(= | \mathcal {F} \cap _{\epsilon } E| / |\mathcal {F}|\) indicating the fraction of optimal solutions among the retrieved solutions, and the recall \(= | \mathcal {F} \cap _{\epsilon } E| / |E|\) indicating the fraction of optimal instances that have been retrieved over the total amount of optimal solutions [48].

7.2 Multi-objective multi-armed bandit environment

In Sect. 7.2, we evaluate MOTDRL in a MOMAB setting. To evaluate MOTDRL, we consider a bi-objective MOMAB with five arms. Table 4 outlines the number of possible outcomes obtainable when selecting a given arm and the corresponding probabilities. Table 4 is unknown to the agent, and the agent aims to learn each distribution per arm and prune the ESR-dominated arms from consideration. In the MOMAB setting, the ESR set is known a priori where the return distributions for \(arm_1\) and \(arm_5\) are ESR dominant and therefore the ESR set only contains \(arm_1\) and \(arm_5\).

To evaluate MOTDRL in a MOMAB environment, we set \(R_\mathrm{min} = 0\), \(R_\mathrm{max} = 10\), D = 2, \(\beta\) = 5 and \(|E^{*}|\) = 2. To compute the coverage ratio, we set \(\epsilon = 0.01\). All experiments in this setting are averaged over 10 runs.

Table 4 A MOMAB with five arms where selecting an arm has two outcomes and two objectives
Fig. 3
figure 3

Results from the MOMAB environment. MOTDRL is able to learn the ESR set as MOTDRL converges to the optimal coverage ratio since the \(F_1\) score reaches the maximum possible value of 1

MOTDRL is able to learn the underlying return distributions for each arm in the MOMAB setting. Using the return distributions for each arm, MOTDRL can learn the ESR set in the MOMAB environment. In Fig. 3, we plot the coverage ratio as the \(F_{1}\) score. MOTDRL converges to the optimal \(F_1\) score of 1. MOTDRL converges to the optimal \(F_1\) score after 100, 000 episodes. An optimal \(F_1\) score can only be achieved when all policies in the ESR set have been learned by the agent.

MOTDRL computes the ESR set for the MOMAB environment during the learning phase. The learned ESR set contains two arms: \(arm_1\) and \(arm_5\). Both \(arm_1\) and \(arm_5\) are ESR dominant; therefore, any user with a monotonically increasing utility function would prefer \(arm_1\) or \(arm_5\) over all other available arms in the MOMAB problem. MOTDRL will return the ESR set to the user during the selection phase. In practice, a user will select a policy form the ESR set which best reflects their preferences and the selected policy will be executed.

Fig. 4
figure 4

Heatmaps for each return distribution in the ESR set learned by MOTDRL in the MOMAB environment. The left heatmap describes the return distribution for \(arm_1\) learned by MOTDRL, and the right heatmap describes the return distribution for \(arm_5\) learned by MOTDRL

Given ESR dominance is a new solution concept, we utilise Figs. 4, 5 and 6 to give the reader some intuition about ESR dominance. Figure 4 displays the return distributions in the ESR set learned by MOTDRL as heatmaps. Each heatmap in Fig. 4 corresponds to the probabilities highlighted for \(arm_1\) (left) and \(arm_5\) (right) in Table 4.

Fig. 5
figure 5

CDFs for each policy in the ESR set learned by MOTDRL in the MOMAB environment. The left figure describes the CDF for \(arm_1\) learned by MOTDRL, and the right figure describes the CDF for \(arm_5\) learned by MOTDRL

Figure 5 displays the CDFs for each return distribution in the ESR set learned by MOTDRL. The CDF is used to calculate ESR dominance, and the CDFs in Fig. 5 correspond to the CDFs of \(arm_1\) (left) and \(arm_5\) (right) in Table 4.

Fig. 6
figure 6

The CDFs for \(arm_1\) and \(arm_5\) intersect at multiple points. Therefore, as per Definition 7: \(arm_1\) \(\nsucc _\mathrm{ESR}\) \(arm_5\) and \(arm_5\) \(\nsucc _\mathrm{ESR}\) \(arm_1\)

Figure 6 describes how \(arm_1\) \(\nsucc _\mathrm{ESR}\) \(arm_5\) and \(arm_5\) \(\nsucc _\mathrm{ESR}\) \(arm_1\) given the CDFs for \(arm_1\) and \(arm_5\) intersect at multiple points (see Definition 7).

Fig. 7
figure 7

The policies on the Pareto front (left) are different from the expectations of the policies in the ESR set (right). In this case, one policy that is in the ESR set is not on the Pareto front. This figure illustrates why SER methods cannot be used to learn the ESR set.

Figure 7 highlights why the choice of optimality criteria must be taken into consideration for multi-objective decision making when the utility function of the user is unknown. A number of SER methods use Pareto dominance to determine a partial ordering over policies. The Pareto dominant policies, or Pareto front, are then returned to the user. To determine the Pareto front [28], the expectations of each arm in the MOMAB setting are calculated and the Pareto dominant policies are determined. In Fig. 7, the policies on the Pareto front (left) are highlighted in red; all other policies are Pareto dominated. In the MOMAB environment outlined in Table 4, the Pareto front consists of a single policy. Figure 7 (right) displays the expected values of the policies in the ESR set, highlighted in green. By comparing both plots in Fig. 7, it is clear that the ESR set contains an extra policy. Therefore, in some settings, certain policies that are optimal under the ESR criterion are dominated under the SER criterion. Figure 7 highlights the importance of selecting the correct optimality criterion when learning. If SER methods are used to compute a set of optimal policies in scenarios where the ESR criterion should be used, it is possible a sub-optimal policy may be selected by the user at decision time. This may have adverse affects when applying multi-policy multi-objective methods in real-world decision-making settings.

7.3 Vaccine recommender system

To illustrate a potential real-world use case for the ESR criterion and ESR dominance, we define a new multi-objective multi-armed bandit environment known as the vaccine recommender system (VRS). For example, in a medical setting a doctor may only have one opportunity to select a treatment for a patient. In this case, it is necessary to optimise under the ESR criterion. Consider the following scenario: a patient is travelling to another country where it is required to be vaccinated for a specific disease to gain entry to the country. There are five available vaccines; however, each vaccine will have varying side effects (safety rating) and effectiveness. This problem has two objectives: safety and effectiveness. Both objectives are ranked from 0 to 5, with 0 being the worst rating and 5 being the best rating. None of the available vaccines are 100\(\%\) effective at treating the disease. When taking each vaccine there is a chance of different outcomes occurring, for example, there is a chance of having severe side effects (low safety rating) and a chance of the vaccine providing the required immunity to the disease (high effectiveness rating). Table 5 outlines each vaccine and the probability of each outcome occurring after taking the vaccine. Table 5 is unknown to the agent, and the agent aims to learn each distribution per vaccine and prune the ESR-dominated vaccines from consideration.

Table 5 A group of available vaccines that have varying outcomes

Given the utility function of the user is unknown, the MOTDRL algorithm is used to learn the underlying return distributions for each vaccine in Table 5 and determine the ESR set. Once MOTDRL has finished learning, a set of optimal polices, in this case the ESR set, is returned to the user. When the user’s utility function becomes known, a vaccine that maximises the user’s utility function can be selected from the ESR set by the user.

The ESR set for the vaccine recommender system (VRS) environment is known a priori. The return distributions for \(V_{1}\) and \(V_{3}\) are ESR dominant, and therefore, \(V_{1}\) and \(V_{3}\) are the only distributions in the ESR set. The VRS environment has five arms where each arm corresponds to a vaccine in Table 5. To evaluate MOTDRL in a VRS environment, we set \(\mathbf{R} _\mathrm{min} = 0\), \(\mathbf{R} _\mathrm{max} = 10\), D = 2, \(\beta\) = 5 and \(|E^{*}|\) = 2. All experiments in this setting are averaged over 10 runs, and each experiment lasts 200, 000 episodes. To compute the coverage ratio, we set \(\epsilon = 0.01\).

Fig. 8
figure 8

Results from the VRS environment. MOTDRL is able to learn the full ESR set as it converges the optimal \(F_1\) score of 1

After sufficient sampling, MOTDRL is able to learn the underlying return distributions for each arm in the VRS environment. Given return distributions can be used to give a partial ordering over policies, MOTDRL can use the return distributions for each arm to compute the ESR set in the VRS environment. In Fig. 8, we plot the coverage ratio as the \(F_1\) score. MOTDRL converges to the optimal \(F_1\) score after 120, 000 episodes. Given MOTDRL converges to the optimal \(F_1\) score, it is clear MOTDRL is able to learn the ESR set.

Fig. 9
figure 9

Heatmaps for each policy in the ESR set learned by MOTDRL. The left heatmap describes the distribution for \(V_{1}\) learned by MOTDRL, and the right heatmap describes the distribution for \(V_{3}\) learned by MOTDRL

In practice, once learning has completed, MOTDRL returns the learned ESR set for the VRS environment to the user. The learned ESR set contains two vaccines: \(V_{1}\) and \(V_{3}\). Both vaccines in the ESR set are ESR dominant. Moreover, a user with a monotonically increasing utility function will prefer either \(V_{1}\) or \(V_{3}\) over all other vaccines in the VRS environment.

Similar to Sect. 7.2, we utilise Figs. 9 and 10 to give the reader some intuition about ESR dominance. Figure 9 presents heatmaps to represent the policies in the ESR set learned by MOTDRL. Each heatmap represents a return distribution learned by MOTDRL and shows the return vectors and the corresponding probabilities. Each heatmap in Fig. 9 corresponds to the probabilities highlighted for \(V_{1}\) (left) and \(V_{3}\) (right) in Table 5. Figure 10 displays the policies in the ESR set learned by MOTDRL and their corresponding CDFs. Each CDF in Fig. 10 corresponds to the CDFs of the underlying return distributions of \(V_{1}\) and \(V_{3}\) in Table 5.

Fig. 10
figure 10

CDFs for each policy in the ESR set learned by MOTDRL in the VRS environment. The left figure describes the CDF for \(V_{1}\) learned by MOTDRL, and the right figure describes the CDF for \(V_{3}\) learned by MOTDRL

8 Related work

The various orders of stochastic dominance have been used extensively as a method to determine the optimal decision when making decisions under uncertainty in economics [8], finance [1, 5], game theory [13], and various other real-world scenarios [6]. However, stochastic dominance has largely been overlooked in systems that learn. Cook and Jarret [9] use various orders of stochastic dominance and Pareto dominance with genetic algorithms to compute optimal solution sets for an aerospace design problem with multiple objectives when constrained by a computational budget. Martin et al. [22] use second-order stochastic dominance (SSD) with a single-objective distributional RL algorithm [7]. Martin et al. [22] use SSD to determine the optimal action to take at decision time, and this approach is shown to learn good policies during experimentation.

To learn the ESR set in sequential decision-making processes, like MOMDPs, new distributional MORL methods must be formulated. Distributional Monte Carlo tree search (DMCTS) is a state-of-the-art ESR method and uses a bootstrap Thompson sampling method to approximate a posterior distribution over the returns [15]. However, this method is a single-policy method and relies on the utility function of the user to be known at the time of learning or planning. DMCTS would invalidate the ESR criterion in the unknown utility function scenario and would therefore be unable to learn the ESR set. Distributional methods like the C51 algorithm, proposed by Bellemare et al. [7], could potentially be used to learn the underlying distribution of a random vector. However, C51 is a single-objective method and defining a multi-objective version of C51 to learn the ESR set could pose significant challenges. Replacing the distribution over returns used by C51 with a multivariate distribution could cause computation to increase with the number of objectives. In this case, dedicated multi-objective distributional methods must be formulated so that it is possible to efficiently learn the ESR set for the ESR criterion. We highlight this as a new challenge that must be addressed by the MORL community.

9 Conclusion and future work

MORL has been highlighted as one of several key challenges that need to be addressed in order for RL to be commonly deployed in real-world systems [12]. In order to apply RL to the real world, the MORL community must consider the ESR criterion. However, the ESR criterion has largely been ignored by the MORL community, with the exception of the works of Roijers et al. [33, 36], Hayes et al. [15, 16] and Vamplew et al. [43]. The works of Hayes et al. [15, 16] and Roijers et al. [33] present single-policy algorithms that are suitable to learn policies under the ESR criterion; however, prior to this work, a formal definition of the necessary requirements to compute policies under the ESR criterion had not previously been defined. In Sect. 3, we outline, through examples and definitions, the necessary requirements to optimise under the ESR criterion. The formal definitions outlined in Sect. 3 ensure that an optimal policy can be learned when the utility function of the user is known under the ESR criterion. However, in the real world, a user’s preferences over objectives (or utility function) may be unknown at the time of learning [36].

Prior to this paper, a suitable solution set for the unknown utility function scenario under the ESR criterion had not been defined. This long-standing research gap has restricted the applicability of MORL in real-world scenarios under the ESR criterion. In Sects. 4 and 5, we define the necessary solution sets required for multi-policy algorithms to learn a set of optimal policies under the ESR criterion when the utility function of a user is unknown. In Sect. 6, we present a novel multi-policy algorithm, known as multi-objective tabular distributional reinforcement learning (MOTDRL), that can learn the ESR set in a MOMAB setting when the utility function of a user is unknown at the time of learning. In Sect. 7, we evaluate MOTDRL in two MOMAB settings and show that MOTDRL can learn the ESR set in MOMAB settings. This work aims to answer some of the existing research questions regarding the ESR criterion. Moreover, we aim to highlight the importance of the ESR criterion when applying MORL to real-world scenarios. In order to successfully apply MORL to the real world, we must implement new single-policy and multi-policy algorithms that can learn solutions for nonlinear utility functions in various scenarios.

A promising direction for future work would be to extend the work of Hayes et al. [15] and the work of Wang and Sebag [45]. It may be possible to build on the aforementioned works to implement a multi-objective distributional Monte Carlo tree search algorithm that can learn a set of optimal policies under the ESR criterion. It is important to note that Hayes et al. [15, 16] use bootstrap Thompson sampling to approximate a posterior distribution. This method cannot learn the ESR set when utility function of a user is unknown; therefore, a different distributional method must be used to learn the ESR set. Although the distributional method used by Hayes et al. [15] cannot be used to learn the ESR set, this work is still a useful starting point.

Given distributional MORL methods are a new area of research, not much is known about the computational requirements of maintaining a return distribution. Therefore, it is important that a comprehensive computational analysis of distributional MORL methods is undertaken to fully understand the implications of maintaining a return distribution. In a future publication, we plan to perform a computational analysis for distributional MORL methods in both bandit and sequential decision-making settings.

A lack of well-defined benchmarks is a significant challenge associated with implementing any new single-policy or multi-policy algorithms under the ESR criterion. Currently, very few ESR benchmark environments exist (e.g. Fishwood [33]). In order to accurately evaluate single-policy and multi-policy ESR algorithms, a suite of benchmark problem domains need to be designed. Under the SER criterion, such benchmarks already exist, e.g. deep sea treasure [42]. It is also important to highlight the need to establish new metrics to evaluate multi-policy algorithms under the ESR criterion. As previously mentioned, all metrics used to evaluate multi-objective algorithms are designed for the SER criterion. In order to accurately evaluate multi-policy algorithms under the ESR criterion, new metrics must be determined. We note that extending the work of Zintgraf et al. [49] for the ESR criterion would be a promising starting point.