Multi-objective dynamic programming with limited precision

This paper addresses the problem of approximating the set of all solutions for Multi-objective Markov Decision Processes. We show that in the vast majority of interesting cases, the number of solutions is exponential or even infinite. In order to overcome this difficulty we propose to approximate the set of all solutions by means of a limited precision approach based on White’s multi-objective value-iteration dynamic programming algorithm. We prove that the number of calculated solutions is tractable and show experimentally that the solutions obtained are a good approximation of the true Pareto front.


Introduction
Markov decision processes (MDP) are a well-known conceptual tool useful for modelling sequential decision processes and have been widely used in real-world applications such as adaptive production control (see e. g.Kuhnle et al. (2020)), equipment maintenance (see Barde et al. (2019), Liu et al. (2019)) or robot planning (see Veeramani et al. (2020)), to name a few.
Usual optimization procedures take into account just a scalar value to be maximized.However, in many cases the objective function is more accurately described by a vector (see e. g.Gen and Lin (2014), Zhang and Xu (2017)) and multi-objective optimization must be applied.
By merging both concepts (MDP and multi-objective optimization) we are led to consider Multi-objective Markov decision processes (MOMDPs), that are Markov decision processes in which rewards given to the agent consist of two or more independent values, i. e., rewards are numerical vectors.In recent years there has been a growing interest in considering theory and applications of multiobjective Markov decision processes and multi-objective Reinforcement Learning, e.g.see Roijers and Whiteson (2017), and Drugan et al. (2017).
As in any multi-objective problem, the solution to a MOMDP is not a singleton but in general a set with many nondominated values that is called the Pareto front.Two approaches can be followed in order to find the solutions of a MOMDP (Roijers et al. (2013)).The single-policy approach (Perny and Weng (2010), Wray et al. (2015)) computes a single optimal policy according to some scalarization of user preferences; of course, the process can be repeated as many times as desired in order to find different solutions.On the other hand, the multi-policy approach tries to compute simultaneously all the values in the Pareto front (Van Moffaert and Nowé (2014), Ruiz-Montiel et al. (2017)).
In any case, the computation of the Pareto front must face a difficulty: since the number of values in the front is usually huge (in fact, it can be infinite), the computation can be infeasible.Therefore, recent multi-objective reinforcement learning techniques avoid approximating the full Pareto front, or are tested on limited problem instances, like deterministic domains (see Drugan et al. (2017)).So it is desirable to have a procedure that, while being feasible, provides a "good" approximation of the front.That is the goal of the proposal presented here.
The work of Drugan (2019) acknowledges that multi-objective reinforcement learning methods have had "slow development due to severe computational problems".Incorporating elements already widely used in multi-objective evolutionary computation into reinforcement learning methods, like scalarization, and dimensionality reduction through principal component analysis, is a promising avenue of research to overcome these difficulties (see Giuliani et al. (2014)).
On the other hand, the application of multi-objective evolutionary methods in a multi-policy setting (i.e. to directly approximate the Pareto front of MOMDP solutions) seems to remain largely unexplored.Although computationally efficient in many domains, multi-objective evolutionary techniques generally lack guarantees on the optimality or precision of the solutions found.Zitzler et al. (2003) analyzed different metrics that can be used to assess the performance of multi-objective algorithms against a reference Pareto set.Regrettably, these benchmark references are generally not available even for simple stochastic MOMDPs due to their inherent complexity.
A different way to deal with the exponential nature of MOMDPs, is to use mixture policies as proposed by Vamplew et al. (2009): given a set of deterministic policies, one of them is chosen probabilistically at the start of the process, and followed onwards.The initial set of policies could be the set of supported solutions, calculated by linear scalarized versions of reinforcement learning algorithms.The mixture policies form their convex hull, and would be therefore nondominated.However, there may be situations where mixture policies are not acceptable, e.g. for ethical reasons (Lizotte et al. (2010)).These considerations lead us to propose in this paper a new approach that approximates deterministic policies without relying on mixtures.
The structure of this paper is as follows: first we define formally the concepts related to MOMDPs and prove that under a very restrictive set of assumtions the number of solutions of a MOMDP is tractable (Proposition 1); but we show that if any of these assumptions does not hold, then that number becomes intractable or even infinite.In the following section we present the basic valueiteration algorithm for solving a MOMDP (Algorithm 1) and the modification that we propose (Algorithm WLP, given by equation 1), and prove that algorithm WLP computes a tractable number of approximate solutions.Algorithm WLP is then applied to benchmark problems to check if it can provide a "good" approximation of the true Pareto front.Finally some conclusions are drawn.

Multi-objective Markov Decision Processes
This section defines multi-objective Markov decision processes (MOMDPs) and their solutions.When possible, notation is consistent with that of Sutton and Barto (2018).
A MOMDP is defined by at least the elements in the tuple (S,A,p, ⃗,γ), where: S is a finite set of states; A(s) is the finite set of actions available at state s ∈ S; p is a probability distribution such that p(s,a,s') is the probability of reaching state s' immediately after taking action a in state s;  ⃗ is a function such that  ⃗(s,a,s') ∈ IR q is the reward obtained after taking action a in state s and immediately transitioning to state s'; and γ ∈ (0,1] denotes the current value of future rewards.The only apparent difference with scalar finite Markov decision processes (MDPs) (e.g., see Sutton and Barto (2018)) is the use of a vector reward function.
A MOMDP can be episodic, if the process always starts at a start state so ∈ S, terminates when reaching any of a set of terminal states Γ ⊆ S, and before termination there is always a non-zero probability of eventually reaching some terminal state.Otherwise, the process is continuing.A MOMDP can be of finitehorizon, if the process terminates after at most a given finite number of actions n.We say that such process is n-step bounded.Otherwise, it is of infinite-horizon.
MDPs are frequently used to model the interaction of an agent with an environment at discrete time steps.We define the goal of the agent as the maximization of the expected accumulated (additive) discounted reward over time.
Let us now define the concepts related to solving a MOMDP.A decision rule δ is a function that associates each state to an available action.A policy π = (δ1,δ2,...δi,...) is a sequence of decision rules, such that δi(s) determines the action to take if the process is in state s at time step i.If there is some decision rule δ such that for all i, δi = δ, then the policy is stationary.Otherwise, it is non-stationary.We denote by π n = (δ1,...,δn), n ≥ 1 the finite n-step subsequence of policy π.
Let St and Rt be two random variables denoting the state, and the vector reward received at time step t respectively.The value  ⃗π is defined as the expected accumulated discounted reward obtained starting at state s and applying policy π (analogously to the scalar case, see Sutton and Barto (2018) and .Given a set of vectors X ⊂ IR q , we define the subset of nondominated, or Pareto-optimal, vectors as () .We denote by V n (s) and V(s) the set of nondominated values of all possible nstep policies, and of all possible policies at state s respectively.For an n-step bounded process, V n (s) = V(s) for all s ∈ S. The solution to a MOMDP is given by the V(s) sets of all states.

Combinatorial explosion
In this section we analyze some computational difficulties related to solving MOMDPs.
Proposition 1 Let us consider an episodic q-objective MDP with initial state s0 satisfying the following assumptions: 1.The length of every episode is at most d.
2. Immediate rewards  ⃗(s,a,s') are integer in every component and every component is bounded by rmin, rmax.
Proof: given assumptions 2, 3 and 4 the value of a policy is a vector with integer components.Given assumptions 1 and 2 these components lie in the interval [d × rmin,d × rmax].The interval can contain at most R × d + 1 integers.So there can be at most (R×d+1) q different vectors for the values; however, not all of them can be nondominated.Consider the q −1 first components 1,...,q −1 of the vector.There are at most (R×d+1) q−1 different possibilities for them.For each one, just one vector is nondominated: the one having the greatest value for the q-th component.Hence there are at most (R×d+1) q−1 nondominated policy values, q.e.d.
Notice that nothing has been assumed about the number of policies (except that it is finite, by assumption 1).So in general there are many more policies (an exponential number of them) than values.
Let us consider the graph of figure 1   ) there exists a policy with value  ⃗ for s0, namely the policy that for all state si selects a1 if i ∉ I and a2 if i ∈ I. Notice that all these values  ⃗ are nondominated.So the number of nondominated values grows exponentially with d.

Algorithms
This section describes two previous multi-objective dynamic programming algorithms, and proposes a tractable modification.

Recursive backwards algorithm
An early extension of dynamic programming to the multi-objective case was described by Daellenbach and De Kluyver (1980).The authors showed that the standard backwards recursive dynamic programming procedure can be applied straightforwardly to problems with directed acyclic networks and additive positive vector costs.The idea easily extends to other decision problems, including stochastic networks.We denote this first basic multi-objective extension of dynamic programming as algorithm B.

Vector value iteration algorithm
The work of White (1982) considered the general case, applicable to stochastic cyclic networks, and extended the scalar value-iteration method to multiobjective problems.
Algorithm 1 displays a reformulation of the algorithm described by White.The original formulation considers a single reward associated to a transition (s,a), while we consider that rewards  ⃗(s,a,si) may depend on the reached state si as well.Otherwise, both procedures are equivalent.
1: Input: size of the state space (N); a discount rate (γ); number of iterations to run (n).
2: Output: Vn, a vector such that Vn(s) = V n (s) 3: V0 ← vector(size: N, defaultValue: {0 ⃗⃗ } ) 4: for i ∈ n do 5: Vi ← vector(size: N) for s ∈ S do delete(Vi−1) 21: end for 22: return(Vn) Line 15 calculates a temporal set of updated vector value estimates T(a) for action a.Each new estimate is calculated according to the dynamic programming update rule.The calculation takes one vector estimate for each reachable state, and adds the immediate reward and discounted value for each reachable state, weighted by the probability of transition to each such state.All possible estimates are stored in set T(a).Finally, in line 18, the estimates calculated for each action are joined and the non-dominated set is used to update the new value for Vi(s).Maximum memory requirements are determined by the size of Vi, Vi−1 and T.
As n → ∞, the values returned by algorithm W converge to V(s) (see White (1982), theorem 1).
Some new difficulties arise in algorithm W, when compared to standard scalar value iteration White (1982): • After n iterations, algorithm W provides V n (s), i.e. the sets of nondominated values for the set of n-step policies (see White (1982), theorem 2).However, for infinite-horizon problems these are only approximations of the V(s) sets.
Let us consider two infinite-horizon policies π1, and π2 with values  ⃗π1(s),  ⃗π2(s) respectively at a given state, and all their n-step sub-policies and with values  ⃗  1  (),  ⃗  2  ().It is possible to have  ⃗π1(s) ≺  ⃗π2(s), and at the same time  ⃗  1  () ~  ⃗  2  () for all n.In other words, given two infinitehorizon policies that dominate each other, their n-step approximations may not dominate each other for any finite value of n.
• Nonstationary policies may be nondominated.This is also an important departure from the scalar case, where under reasonable assumptions there is always a stationary optimal policy.White's algorithm converges to the values of nondominated policies, either stationary or nonstationary.
• Finally, if policies with probability mixtures over actions are allowed, these policies may also be nondominated.Therefore, if allowed, these must also be covered in the calculations of the W algorithm.
Despite of its theoretical importance, we are not aware of practical applications of White's algorithm to general MOMDPs.This is not surprising, given the result presented in proposition 1.

Vector value iteration with limited precision
In order to overcome the computational difficulties imposed by proposition 1, we propose a simple modification of algorithm 1.This involves the use of limited precision in the values of vector components calculated by the algorithm.We consider a maximum precision factor.Then, vector components are rounded to the allowable precision in step 18.More precisely, the vectors in the T(a) sets are rounded before being joined.Therefore, all vectors in Vi will be of limited precision.
We denote the W algorithm limited to precision ε by WLP(ε).The new algorithm has a new parameter ε, and line 18 is replaced by,   () ← (∪  ((), )) (1) where (, ) returns a set with all vectors in set X rounded to ε precision.This provides a reference for the approximations calculated by WLP(ε).We are interested in analyzing the performance of the algorithms as a function of problem size.Therefore, we actually consider a set of subproblems obtained form the SDST-RD.The state space of subproblem i comprises all 11 rows, but only the i leftmost columns of the grid shown in figure 4.
The subproblems were solved with algorithm B, and with WLP(ε) with precision ε ∈ {0.1, 0.05, 0.02, 0.01, 0.001}.Smaller values provide higher precision and better approximations.The discount rate was set to γ = 1.0.With WLP, the number of iterations for each subproblem was set to the distance of the most distant treasure.
Figure 5 displays the exact Pareto front for subproblems 1 to 6 as calculated by algorithm B. Larger subproblems exceeded a 96 hour time limit.In general, the Pareto fronts present a non-convex nature, like in the standard deterministic DST.However, the number of nondominated vectors in V (s0) is much larger (in the standard DST, subproblem i has just i nondominated solution vectors).

N-Pyramid
The N-pyramid environment presented here is inspired in the pyramid MOMDP described by Van Moffaert and Nowé (2014).More precisely, we define a family of problems of variable size over N ×N grid environments.Each cell in the grid is a different state.Each episode starts always at the start node, situated in the bottom left corner (with coordinates (x,y) = (1,1)).The agent can move in any of the four cardinal directions to an adjacent cell, except for moves that would lead the agent outside of the grid, which are not allowed.Transitions are stochastic, with the agent moving in the selected direction with probability 0.95.With probability 0.05 the agent actually moves in a random direction from those available at the state.The episode terminates when the agent reaches one of the cells in the diagonal facing the start state.Figure 10 displays a sample grid environment for N = 5.The agent wants to maximize two objectives.For each transition to a non-terminal state, the agent receives a vector reward of (−1,−1).When a transition reaches a terminal cell with coordinates (x,y), the agent receives a vector reward of (10x,10y).This environment defines a cyclic state space.Therefore, its solution cannot be calculated using the B algorithm.Table 3: Cardinality of V (s0) for the N-pyramid instances.
The N-pyramid environment was solved with WLP(ε) using ε ∈{1.0, 0.1, 0.05 ,0.01 ,0.005}.The discount rate was set to γ = 1.0, and the number of iterations to 3 × N. A twelve hour runtime limit was imposed for all experiments.The largest subproblem solved was N = 5, for ε = 1.0.
Figure 11 displays the best frontier approximation for each problem instance, and the corresponding ε value.Smaller ε values could not solve the problem within the time limit.
Figure 12 displays the frontiers calculated by the different precision values for the 3-pyramid.This was the largest problem instance solved by all precision values.Figure 13 displays an enlargement of a portion of this frontier.This illustrates the size and shape of the different approximations.As expected, smaller values of ε produce more densely populated policy values.
Finally, table 3 summarizes the cardinality of the V (s0) sets calculated for each problem instance by the different precision values, and table 4 the corresponding hypervolumes using the original reference point (−20,−20).

Conclusions and future work
This paper analyzes some practical difficulties that arise in the solution of MOMDPs.We show that the number of nondominated policy values is tractable (for a fixed number of objectives q) only under a number of limiting assumptions.If any of these assumptions is violated, then the number of nondominated values becomes intractable or even infinite with problem size in the worst case.Multi-policy methods for MOMDPs try to approximate the set of all nondominated policy values simultaneously.Previous works have addressed mostly deterministic tasks satisfying the limiting assumptions that guarantee tractability.
We show that, if policy values are restricted to vectors with limited precision, the number of such values becomes tractable.This idea is applied to a variant of the algorithm proposed by White (1982).This new variant has been analyzed over a set of stochastic benchmark problems.Results show that good approximations of the Pareto front can be obtained.To our knowledge, the results reported deal with the hardest stochastic problems solved to date by multi-policy multi-objective dynamic programming algorithms.
Future work includes the application of the limited precision idea to more general value iteration algorithms, and the generation of additional benchmark problems for stochastic MOMDPs.This should support the exploration and evaluation of other approximate solution strategies, like multi-policy multiobjective evolutionary techniques.

Figure 1 :
Figure 1: Hansen's graph We will show now that if any of the assumptions is not satisfied, then the number of different nondominated values |V(s0)| can be also exponential in d.

Figure 5 :
Figure 5: Exact Pareto front for subproblems 1 to 6 of the SDST-RD task calculated with algorithm B.

Figure 6 :
Figure 6: Best approximation of the Pareto frontier obtained for subproblems 7 to 10 of the SDST-RD task with algorithm WLP.The exact frontier subproblem 6 is included as reference.

Figure 7 :
Figure 7: Memory requirements (# of vectors) of each algorithm for the different subproblems of the SDST-RD (log scale).

Figure 8 :
Figure 8: Pareto fronts obtained for SDST-RD subproblem 6.The area in the rectangle is shown magnified in figure 9.

Figure 11 :
Figure 11: Best approximations obtained for the Pareto fronts for the N-pyramid problem instances.

Figure 12 :
Figure 12: Approximated Pareto fronts for the 3-pyramid instance.An enlarged portion is displayed in figure 13.