Abstract
Concerned with multiobjective reinforcement learning (MORL), this paper presents MOMCTS, an extension of MonteCarlo Tree Search to multiobjective sequential decision making, embedding two decision rules respectively based on the hypervolume indicator and the Pareto dominance reward. The MOMCTS approaches are firstly compared with the MORL state of the art on two artificial problems, the twoobjective Deep Sea Treasure problem and the threeobjective Resource Gathering problem. The scalability of MOMCTS is also examined in the context of the NPhard grid scheduling problem, showing that the MOMCTS performance matches the (nonRL based) state of the art albeit with a higher computational cost.
Introduction
Reinforcement learning (RL) (Sutton and Barto 1998; Szepesvári 2010) addresses sequential decision making in the Markov decision process framework. RL algorithms provide guarantees of finding the optimal policies in the sense of the expected cumulative reward, relying on the thorough exploration of the state and action spaces. The price to pay for these optimality guarantees is the limited scalability of mainstream RL algorithms w.r.t. the size of the state and action spaces.
Recently, MonteCarlo Tree Search (MCTS), including the famed Upper Confidence Tree algorithm (Kocsis and Szepesvári 2006) and its variants, has been intensively investigated to handle sequential decision problems. MCTS, notably illustrated in the domain of ComputerGo (Gelly and Silver 2007), has been shown to efficiently handle mediumsize state and action search spaces through a careful balance between the exploration of the search space, and the exploitation of the best results found so far. While providing some consistency guarantees (Berthier et al. 2010), MCTS has demonstrated its merits and wide applicability in the domain of games (Ciancarini and Favini 2009) or planning (Nakhost and Müller 2009) among many others.
This paper is motivated by the fact that many realworld applications, including reinforcement learning problems, are most naturally formulated in terms of multiobjective optimization (MOO). In multiobjective reinforcement learning (MORL), the reward associated to a given state is ddimensional (e.g. cost, risk, robustness) instead of a single scalar value (e.g. quality). To our knowledge, MORL was first tackled by Gábor et al. (1998); introducing a lexicographic (hence total) order on the policy space, the authors show the convergence of standard RL algorithms under the total order assumption. In practice, multiobjective reinforcement learning is often tackled by applying standard RL algorithms on a scalar aggregation of the objective values (e.g. optimizing their weighted sum; see also Mannor and Shimkin (2004), Tesauro et al. (2007)).
In the general case of antagonistic objectives however (e.g. simultaneously minimize the cost and the risk of a manufacturing process), two policies might be incomparable (e.g. the cheapest process for a fixed robustness; the most robust process for a fixed cost): solutions are partially ordered, and the set of optimal solutions according to this partial order is referred to as Pareto front (more in Sect. 2). The goal of the socalled multiplepolicy MORL algorithms (Vamplew et al. 2010) is to find several policies on the Pareto front (Natarajan and Tadepalli 2005; Chatterjee 2007; Barrett and Narayanan 2008; Lizotte et al. 2012).
The goal of this paper is to extend MCTS to multiobjective sequential decision making. The proposed scheme called MOMCTS basically aims at discovering several Paretooptimal policies (decision sequences, or solutions) within a single tree. MOMCTS requires one to modify the exploration of the tree to account for the lack of total order among the nodes, and the fact that the desired result is a set of Paretooptimal solutions (as opposed to, a single optimal one). A first possibility considers the use of the hypervolume indicator (Zitzler and Thiele 1998), which measures the MOO quality of a solution w.r.t. the current Pareto front. Specifically, taking inspiration from Auger et al. (2009), this indicator is used to define a single optimization objective for the current path being visited in each MCTS treewalk, conditioned on the other solutions previously discovered. MOMCTS thus handles a singleobjective optimization problem in each treewalk, while eventually discovering several decision sequences pertaining to the Paretofront. This approach, first proposed by Wang and Sebag (2012), suffers from two limitations. On the one hand, the hypervolume indicator computation cost increases exponentially with the number of objectives. Secondly, the hypervolume indicator is not invariant under the monotonous transformation of the objectives. The invariance property (satisfied for instance by comparisonbased optimization algorithms) gives robustness guarantees which are most important w.r.t. illconditioned optimization problems (Hansen 2006).
Addressing these limitations, a new MOMCTS approach is proposed in this paper, using Pareto dominance to compute the instant reward of the current path visited by MCTS. Compared to the first approach—referred to as MOMCTShv in the remainder of this paper, the latter approach—referred to as MOMCTSdom—has linear computational complexity w.r.t. the number of objectives, and is invariant w.r.t. the monotonous transformation of the objectives.
Both MOMCTS approaches are empirically assessed and compared to the state of the art on three benchmark problems. Firstly, both MOMCTS variants are applied on two artificial benchmark problems, using MOQL (Vamplew et al. 2010) as baseline: the twoobjective Deep Sea Treasure (DST) problem (Vamplew et al. 2010) and the threeobjective Resource Gathering (RG) problem (Barrett and Narayanan 2008). A stochastic transition model is considered for both DST (originally deterministic) and RG, to assess the robustness of both MOMCTS approaches. Secondly, the real world NPhard problem of grid scheduling Yu et al. (2008) is considered to assess the performance and scalability of MOMCTS methods comparatively to the (nonRLbased) state of the art.
The paper is organized as follows. Section 2 briefly introduces the formal background. Section 3 describes the MOMCTShv and the MOMCTSdom algorithm. Section 4 presents the experimental validation of MOMCTS approaches. Section 5 discusses the strengths and limitations of MOMCTS approaches w.r.t. the state of the art and the paper concludes with some research perspectives.
Formal background
Assuming the reader’s familiarity with the reinforcement learning setting (Sutton and Barto 1998), this section briefly introduces the main notations and definitions used in the rest of the paper.
A Markov decision process (MDP) is described by its state and action space respectively denoted \(\mathcal{S}\) and \(\mathcal{A}\). The transition function (\(p : \mathcal{S}\times\mathcal{A}\times\mathcal{S}\mapsto[0,1]\)) gives the probability p(s,a,s′) of reaching state s′ by executing action a in state s. The (scalar) reward function is defined from the state × action space onto \(\mathbb{R}\) (\(r: \mathcal{S} \times\mathcal{A} \mapsto\mathbb{R}\)).
Multiobjective optimization
In multiobjective optimization (MOO), each point x in the search space \(\mathcal{X}\) is associated with a ddimensional reward vector r _{ x } in \(\mathbb{R}^{d}\), referred to as vectorial reward in the following. With no loss of generality, it is assumed that each objective is to be maximized.
Given two points \(x,x' \in\mathcal{X}\) with r _{ x }=(r _{1},…,r _{ d }) and \(r_{x'} = (r'_{1},\ldots,r'_{d})\) their associated vectorial rewards, r _{ x } is said to dominate, or Paretodominate, r _{ x′} (noted r _{ x }⪰r _{ x′}) iff r _{ i } is greater than or equal to \(r'_{i}\) for i=1…d. The dominance is strict (noted r _{ x }≻r _{ x′}) if r _{ x }⪰r _{ x′} and \(r_{i} > r'_{i}\) for some i (Fig. 1(a)). As mentioned, Paretodominance defines a partial order relation on \(\mathbb{R}^{d}\) and thus on \(\mathcal{X}\). The Pareto front is defined as follows:
Definition 1
Given \(A \subset\mathbb{R}^{d}\) a set of vectorial rewards, the set P _{ A } of nondominated points in A is defined as:
The Pareto front is made of all nondominated vectorial rewards. By abuse of language, P _{ A } is referred to as the set of Paretooptima in A.
Two different categories of MOO problems are distinguished depending on whether they correspond to a convex or nonconvex Pareto front. The convex Pareto front can be identified by solving a set of single objective optimization problems defined on the weighted sum of the objectives, referred to as linear scalarization of the MOO problem (as done in MOQL, Sect. 4.2.1). When dealing with nonconvex Pareto fronts (for instance, the DST problem Vamplew et al. 2010, and the ZDT2 and DTLZ2 test benchmarks Deb et al. 2002) however, the linear scalarization approach fails to discover the nonconvex parts of the Pareto front (Deb 2001). Although many MOO problems have a convex Pareto front, especially in the twoobjective case, the discovery of the nonconvex Pareto front remains the main challenge for MOO approaches (Deb et al. 2000; Beume et al. 2007).^{Footnote 1}
MonteCarlo Tree Search
Let us describe the best known MCTS algorithm, referred to as Upper Confidence Tree (UCT) (Kocsis and Szepesvári 2006) and extending the Upper Confidence Bound algorithm (Auer et al. 2002) to treestructured spaces. UCT simultaneously explores and builds a search tree, initially restricted to its root node, along N treewalks a.k.a. simulations. Each treewalk involves three phases:
The bandit phase starts from the root node and iteratively selects an action/a child node until arriving in a leaf node. Action selection is handled as a multiarmed bandit problem. The set \(\mathcal{A}_{s}\) of admissible actions a defines the possible child nodes (s,a) of node s; the selected action a ^{∗} maximizes the Upper Confidence Bound:
over a ranging in \(\mathcal{A}_{s}\), where n _{ s } stands for the number of times node s has been visited, n _{ s,a } denotes the number of times a has been selected in node s, and \(\hat{r}_{s,a}\) is the average reward collected when selecting action a from node s. The first (respectively the second) term in Eq. (1) corresponds to the exploitation (resp. exploration) term, and the exploration vs exploitation tradeoff is controlled by parameter c _{ e }. Upon the selection of a ^{∗}, the next state is drawn from the transition model depending on the current state and a ^{∗}. In the remainder of the paper, a tree node is labeled with the sequence of actions followed from the root; the associated reward is the average reward collected over all treewalks involving this node.
The tree building phase takes place upon arriving in a leaf node s; some action a is (uniformly or heuristically) selected and (s,a) is added as child node of s. Accordingly, the number of nodes in the tree is the number of treewalks.
The random phase starts from the new leaf node (s,a) and iteratively (uniformly or heuristically) selects an action until arriving in a terminal state u; at this point the reward r _{ u } of the whole treewalk is computed and used to update the reward estimates in all nodes (s,a) visited during the treewalk:
Additional heuristics have been considered, chiefly to prevent overexploration when the number of admissible arms is large w.r.t the number of simulations (the socalled manyarmed bandit issue (Wang et al. 2008)). The Progressive Widening (PW) heuristics (Coulom 2006) will be used in the following, where the allowed number of child nodes of s is initialized to 1 and increases with its number of visits n _{ s } like \(\lfloor n_{s}^{1/b} \rfloor\) (with b usually set to 2 or 4). The Rapid Action Value Estimation (RAVE) heuristic is meant to guide the exploration of the search space (Gelly and Silver 2007). In its simplest version, RAVE(a) is set to the average reward taken over all treewalks involving action a. The RAVE vector can be used to guide the treebuilding phase,^{Footnote 2} that is, when selecting a first child node upon arriving in a leaf node s, or when the Progressive Widening heuristics is triggered and a new child node is added to the current node s. In both cases, the selected action is the one maximizing RAVE(a). The RAVE heuristic aims at exploring earlier the most promising regions of the search space; for the sake of convergence speed, it is clearly desirable to consider the best options as early as possible.
Overview of MOMCTS
The main difference between MCTS and MOMCTS regards the node selection step. The challenge is to extend the singleobjective node selection criterion (Eq. (1)) to the multiobjective setting. Since there is no total order between points in the multidimensional space, as mentioned, the most straightforward way of dealing with multiobjective optimization is to get back to singleobjective optimization, through aggregating the objectives into a single one; the price to pay is that this approach yields a single solution on the Pareto front. Two aggregating functions (the hypervolume indicator and the cumulative discounted dominance reward) aimed at recovering a total order among points in the multidimensional reward space conditionally to the search archive, will be integrated within the MCTS framework.
The MOMCTShv algorithm is presented in Sect. 3.1 and its limitations are discussed in Sect. 3.2. The MOMCTSdom algorithm aimed at overcoming these limitations is introduced in Sect. 3.3.
MOMCTShv
Node selection based on hypervolume indicator
The hypervolume indicator (Zitzler and Thiele 1998) provides a scalar measure of solution sets in the multiobjective space as follows.
Definition 2
Given \(A \subset\mathbb{R}^{d}\) a set of vectorial rewards, given reference point \(z \in\mathbb{R} ^{d}\) such that it is dominated by every r∈A, then the hypervolume indicator (HV) of A is the measure of the set of points dominated by some point in A and dominating z:
where μ is the Lebesgue measure on \(\mathbb{R}^{d}\) (Fig. 1(a)).
It is clear that all dominated points in A can be removed without modifying the hypervolume indicator (HV(A;z)=HV(P _{ A };z)). As shown by Fleischer (2003), the hypervolume indicator is maximized iff points in P _{ A } belong to the Pareto front of the MOO problem. Auger et al. (2009) show that, for d=2, for a number K of points, the hypervolume indicator maps a multiobjective optimization problem defined on \(\mathbb{R}^{d}\), onto a singleobjective optimization problem on \(\mathbb{R}^{d \times K}\), in the sense that there exists at least one set of K points in \(\mathbb{R}^{d}\) that maximizes the hypervolume indicator w.r.t. z.
Let P denote the archive of nondominated vectorial rewards measured for every terminal state u (Sect. 2.2). It then comes naturally to define the value of any MCTS tree node as follows.
Let us associate to each node (s,a) in the tree the vector \(\overline{r}_{s,a}\) of the upper confidence bounds on its rewards:
with c _{ i } the exploration vs exploitation parameter for the ith objective (Eq. (1)).
An upperbound V(s,a) on the value of (s,a) is given by considering the hypervolume indicator of \(\overline{r}_{s,a}\) w.r.t. archive P.
While V(s,a) does provide a scalar value of a node (s,a) conditioned on the solutions previously evaluated, it takes on a constant value if \(\overline{r}_{s,a}\) is dominated by some vectorial reward in P. In order to differentiate these dominated points, we consider the perspective projection \(\overline{r}^{p}_{s,a}\) of \(\overline{r}_{s,a}\) onto \(\mathcal{P}\), the piecewise linear surface in \(\mathbb{R}^{d}\) including all r _{ u }∈P (Fig. 1(b)). Let \(\overline{r}^{p}_{s,a}\) denote the (unique) intersection of line \((\overline{r}_{s,a},z)\) with \(\mathcal{P}\) (being reminded that z is dominated by all points in P and by \(\overline{r}_{s,a}\)). The value function associated to (s,a) is then defined as the value of \(\overline{r}_{s,a}\), minus the Euclidean distance between \(\overline{r}_{s,a}\) and \(\overline{r}^{p}_{s,a}\). Finally, the value of (s,a) is defined as:
The Euclidean distance term here sets a penalty for dominated points, increasing with their distance to the linear envelope \(\mathcal{P}\) of P. Note that Eq. (4) sets a total order on all vectorial rewards in \(\mathbb{R} ^{d}\), where nondominated points are ranked higher than dominated ones.
MOMCTShv algorithm
MOMCTShv differs from MCTS in only three respects (Algorithm 1). Firstly, the selected action a ^{∗} is the one maximizing value function W(s,a) instead of the UCB criterion (Eq. (1)). Secondly, MOMCTShv maintains the archive P of all nondominated vectorial rewards evaluated in previous treewalks. Upon arriving in a terminal state u, MOMCTShv evaluates the vectorial reward r _{ u } of the treewalk. It then updates \(\hat{r}_{s,a}\) for all nodes (s,a) visited during the treewalk, and it updates archive P if r _{ u } is nondominated. Thirdly, the RAVE vector (Sect. 2.2) is used to select new nodes in the treebuilding phase. Letting RAVE(a) denote the average vectorial reward associated to a, letting RAVE ^{p}(a) denote the perspective projection of RAVE(a) on the approximated Pareto front \(\mathcal{P}\), then the action selected is the one minimizing
MOMCTShv parameters include (i) the total number of treewalks N, (ii) the b parameter used in the progressive widening heuristic (Sect. 2.2); (iii) the exploration vs exploitation tradeoff parameter c _{ i } for every ith objective; and (iv) the reference point z.
Discussion
Let B denote the average branching factor in the MOMCTShv tree, and let N denote the number of treewalks. As each treewalk adds a new node, the number of nodes in the tree is N+1 by construction. The average length of a treepath thus is in \(\mathcal{O}(\log{N})\). Depending on the number d of objectives, the hypervolume indicator is computed with complexity \(\mathcal{O}(P^{d/2})\) for d>3 (respectively \(\mathcal{O}(P)\) for d=2 and \(\mathcal{O}(P\log{P})\) for d=3) (Beume et al. 2009). The complexity of each treewalk thus is \(\mathcal{O}(BP^{d/2}\log N)\), where P is at most the number N of treewalks.
By construction, the hypervolume indicator based selection criterion (Eq. (4)) drives MOMCTShv towards the Pareto front and favours the diversity of the Pareto archive. On the negative side however, the computational cost of W(s,a) is exponential with the number d of objectives. Besides, the hypervolume indicator is not invariant under monotonous transformation of objective functions, which prevents the approach from enjoying the same robustness as comparisonbased optimization approaches (Hansen 2006). Lastly, the MOMCTShv critically depends on its hyperparameters. The exploration vs exploitation (EvE) tradeoff parameters c _{ i },i=1,2,…,d (Eq. (1)) of each objective have a significant impact on the performance of MOMCTShv (likewise, the MCTS applicative results depend on the tuning of the EvE tradeoff parameters (Chaslot et al. 2008)). Additionally, the choice of the reference point z also influences the hypervolume indicator values (Auger et al. 2009).
MOMCTSdom
This section presents a new MOMCTS approach aimed at overcoming the above limitations, which is based on the Pareto dominance test. Notably, this test has linear complexity w.r.t. the number of objectives, and is invariant under monotonous transformation of objectives. As this reward depends on the Pareto archive which evolves along the search, the cumulative discounted dominance (CDD) reward mechanism is proposed to handle the search dynamics.
Node selection based on cumulative discounted dominance reward
Let P denote the archive of all nondominated vectorial rewards previously gathered during the search process. A straightforward option would be to associate to each treewalk reward 1 if the treewalk gets a vectorial reward r _{ u } which is not strictly dominated by any point in the archive P, and reward 0 otherwise. Formally this boolean dominance reward, called r _{ u;dom }, is defined as:
The optimization problem defined by dominance rewards is nonstationary as it depends on the archive P, which evolves along time. To cope with nonstationarity, the reward update proceeds along a cumulative discounted (CD) process as follows. Let t _{ s,a } denote the index of the last treewalk which visited node (s,a), let Δt=t−t _{ s,a } where t is the index of the current treewalk, let δ∈[0,1] be a discount factor, the CD update is defined as:
The reward update in MOMCTSdom differs from the standard scheme (Eq. (2)) in two respects. Firstly, cumulative instead of average rewards are considered. The rationale for this modification is that a tiny percentage of the treewalks finds a nondominated vectorial reward if ever. In such cases, average rewards come to be negligible in front of the exploration term, making the MCTS degenerate to pure random search. The use of cumulative rewards instead tends to prevent this degradation.
Secondly, a discount mechanism is used to moderate the cumulative effects using the discount factor δ (0≤δ≤1) and taking into account the number Δt of treewalks since this node was last visited. This discount mechanism is meant to cope with the dynamics of multiobjective search through forgetting old rewards, thus enabling the decision rule to reflect uptodate information.
Indeed, the CD process is reminiscent of the discounted cumulative reward defining the value function in Reinforcement Learning (Sutton and Barto 1998), with the difference that the timestep t here corresponds to the treewalk index, and that the discount mechanism is meant to limit the impact of past (as opposed to, future) information.
In a stationary context, \(\hat{r}_{s,a;dom}\) would converge towards \(\frac{1}{1\delta^{\varDelta t}} \bar{r}\), with Δt the average interval of time between two visits to the node. If the node gets exponentially rarely visited, \(\hat{r}_{s,a;dom}\) goes to \(\bar{r}\). Quite the contrary, if the node happens to be frequently visited, \(\bar{r}\) is multiplied by a large factor (\(\frac{1}{1\delta}\)), entailing the overexploitation of the node. However, the overexploitation is bound to decrease as soon as the Pareto archive moves towards the true Pareto front. While this CDD reward was found to be empirically suited to the MOO setting (see also Maes et al. 2011), further work is required to analyze its properties.
MOMCTSdom algorithm
MOMCTSdom proceeds as standard MCTS except for the update procedure, where Eq. (2) is replaced by Eq. (7). Keeping the same notations B,N and P as above, as the dominance test in the end of each treewalk is linear (\(\mathcal{O}(dP)\)), the complexity of each treewalk in MOMCTSdom is \(\mathcal{O}(B\log N+dP)\), linear w.r.t. the number d of objectives.
Besides the MCTS parameters N and b, MOMCTSdom involves two additional hyperparameters: (i) the exploration vs exploitation tradeoff parameter c _{ e }; and (ii) the discount factor δ.
Experimental validation
This section presents the experimental validation of the MOMCTShv and MOMCTSdom algorithms.
Goals of experiments
The first goal is to assess the performance of the MOMCTS approaches comparatively to the state of the art in MORL (Vamplew et al. 2010). Two artificial benchmark problems (Deep Sea Treasure and Resource Gathering) with probabilistic transition functions are considered. The Deep Sea Treasure problem has two objectives which define a nonconvex Pareto front (Sect. 4.2). The Resource gathering problem has three objectives and a convex Pareto front (Sect. 4.3). The second goal is to assess the performance and scalability of MOMCTS approaches in a realworld setting, that of grid scheduling problems (Sect. 4.4).
All reported results are averaged over 11 runs unless stated otherwise.
Indicators of performance
Two indicators are defined to measure the quality of solution sets in the multidimensional space. The first indicator is the hypervolume indicator (Sect. 3.1.1). The second indicator, inspired from the notion of regret, is defined as follows. Let P ^{∗} denote the true Pareto front. The empirical Pareto front P defined by a search process is assessed from its generational distance (Van Veldhuizen 1999) and inverted generational distance w.r.t. P ^{∗}. The generational distance (GD) is defined by \(\mathit{GD}(P) = (\sqrt{\sum_{i=1}^{n} d_{i}^{2}})/n\), where n is the size of P and d _{ i } is the Euclidean distance between the ith point in P and its nearest point in P ^{∗}. GD measures the average distance from points in P to the Pareto front. The inverted generational distance (IGD) is likewise defined as the average distance of points in P ^{∗} to their nearest neighbour in P. For both generational and inverted generational distances, the smaller, the better.
The algorithms are also assessed w.r.t. their computational cost (measured on a PC with Intel dualcore CPU 2.66 GHz).
Deep Sea Treasure
The Deep Sea Treasure (DST) problem, first introduced by Vamplew et al. (2010), is converted into a stochastic sequential decision making problem by introducing noise in the transition function of DST. The state space of Deep Sea Treasure (DST) consists of a 10×11 grid (Fig. 2(a)). The action space of DST includes four actions (up, down, left and right), each sending the agent to one adjacent square in the indicated direction with probability 1−η, and in the other three directions with equal probability η/3, where 0≤η<1 indicates the noise level in the environment. When the selected action would send the agent beyond the grid or the sea borders, the agent stays in the same place. Each policy, with the top left square as initial state, gets a two dimensional reward: the time spent until reaching a terminal state or reaching the time horizon T=100, and the treasure attached to the terminal state (Fig. 2(a)). The list of all 10 nondominated vectorial rewards in the form of (−time, treasure) are depicted in the twodimensional plane in Fig. 2(b). It is worth noting that the Pareto front is nonconvex.
Baseline algorithm
As mentioned in the introduction, the state of the art in MORL considers a scalar aggregation (e.g. a weighted sum) of rewards associated to all objectives. Several multiplepolicy MORL algorithms have been proposed (Natarajan and Tadepalli 2005; Tesauro et al. 2007; Barrett and Narayanan 2008; Lizotte et al. 2012) using the weighted sum of the objectives (with several weight settings) as scalar reward, which is optimized using standard reinforcement learning algorithms. The differences between the above algorithms are how they share the information between different weight settings and which weight settings they choose to optimize. In the following, MOMCTSdom is compared to MultiObjective QLearning (MOQL) (Vamplew et al. 2010). Choosing MOQL as baseline is motivated as it yields all policies found by other linearscalarisation based approaches, provided that a sufficient number of weight settings be considered.
Formally, in the two objective reinforcement learning case, MOQL optimizes independently m scalar RL problems through Qlearning, where the ith problem considers reward r _{ i }=(1−λ _{ i })×r _{ a }+λ _{ i }×r _{ b }, where 0≤λ _{ i }≤1,i=1,2,…,m define the m weight settings of MOQL, and r _{ a } (respectively r _{ b }) is the first (resp. the second) objective reward. In its simplest version, the overall computational effort is equally divided between the m scalar RL problems. The computational effort allocated to the each weight setting is further equally divided into n _{ tr } training phases; after the jth training phase, the performance of the ith weight setting is measured by the twodimensional vectorial reward, noted r _{ i,j }, of the current greedy policy. The m vectorial rewards of all weight settings {r _{1,j },r _{2,j },…,r _{ m,j }} together compose the Pareto front of MOQL at training phase j.
Experimental setting
We use the same MOQL experimental setting as in Vamplew et al. (2010):

ϵgreedy exploration is used with ϵ=0.1.

Learning rate α is set to 0.1.

The stateaction value table is optimistically initialized (time=0, treasure=124).

Due to the episodic nature of DST, no discounting is used in MOQL(γ=1).

The number m of weight settings ranges in {3,7,21}, with \(\lambda_{i} = \frac{i1}{m1}, i= 1,2,\ldots,m\).
After a few preliminary experiments, the progressive widening parameters b is set to 2 in both MOMCTShv and MOMCTSdom. In MOMCTShv, the exploration vs exploitation (EvE) tradeoff parameters in the time cost and treasure value objectives are respectively set to c _{ time }=20,000 and c _{ treasure }=150. As the DST problem is concerned with minimizing the search time (maximizing its opposite) and maximizing the treasure value, the reference point used in the hypervolume indicator calculation is set to (−100,0).
In MOMCTSdom, the EvE tradeoff parameter c _{ e } is set to 1, and the discount factor δ is set to 0.999.
Experiments are carried out in a DST simulator with the η noise level ranging in 0, 1×10^{−3}, 1×10^{−2}, 5×10^{−2} and 0.1. The training time of MOQL, MOMCTShv and MOMCTSdom is limited to 300,000 time steps (ca 37,000 treewalks in MOMCTShv and 45,000 treewalks in MOMCTSdom). The entire training process is equally divided into n _{ tr }=150 phases. At the end of each training phase, the MOQL and MOMCTS solution sets are tested in the DST simulator, and form the Pareto set P. The performance of algorithms is reported as the hypervolume indicator of P.
Results
Table 1 shows the performance of MOMCTS approaches and MOQL measured by the hypervolume indicator, with reference point z=(−100,0).
Deterministic setting
Figure 3 displays the hypervolume indicator performance of MOMCTShv, MOMCTSdom and that of MOQL for m=3,7,21 in the deterministic setting (η=0). It is observed that for m=7 or 21, MOQL reaches a performance plateau (10062) within 20,000 time steps. The fact that MOQL does not reach the optimal hypervolume indicator 10455 is explained as the DST Pareto front is not convex (Fig. 2(b)). As widely known (Deb 2001), linearscalarisation based approaches of MOO fail to discover solutions in nonconvex regions of the Pareto front. In such cases, MOQL is prevented from finding the true Pareto front and thus is inconsistent. Ultimately, MOQL only discovers the extreme points (−19,124) and (−1,1) of the Pareto front (Fig. 4(a)). In the meanwhile, MOMCTShv performance dominates that of MOQL throughout the training process. MOMCTSdom catches up MOQL after 80,000 time steps. The entire Pareto front is found by MOMCTShv in 5 out 11 runs, and by MOMCTSdom algorithm in 10 out 11 runs.
Figure 3(b) shows the influence of m on MOQL. For m=7, MOQL reaches the performance plateau before m=21 (respectively 8,000 time steps vs 20,000 time steps), albeit with some instability. The instability increases as m is set to 3. The fact that for MOQLm=3 fails to reach the MOQL performance plateau is explained as the extreme point (−19,124) can be missed in some runs as MOQL uses a discount factor of 1 (after Vamplew et al. 2010). Therefore the largest 124 treasure might be discovered later than in time step 19.
The percentage of times out of 11 runs that each nondominated vectorial reward is discovered for at least one test episode during the training process of MOMCTShv, MOMCTSdom and MOQL for m=21 is displayed in Fig. 4(b). This picture shows that MOQL discovers all strategies (lying in the nonconvex regions of the Pareto front) during intermediate test episodes. However, these nonconvex strategies are eventually discarded as the MOQL solution set gradually converges to extreme strategies. Quite the contrary, MOMCTS approaches discovers all strategies in the Pareto front, and keeps them in the search tree after they have been discovered. The weakness of MOMCTShv is that the longest decision sequences corresponding to the vectorial rewards (−17,74) and (−19,124) need more time to be discovered. The MOMCTSdom successfully discovers all nondominated vectorial rewards (in 10 out of 11 runs) and reaches an average hypervolume indicator performance slightly higher than that of MOMCTShv.
Stochastic setting
Figure 5 shows the performance of MOMCTShv, MOMCTSdom and MOQLm=21 in the stochastic environments (η=0.01,0.1). As could have been expected, the performances of MOQL and MOMCTS approaches decrease and their variances increase with noise level η, although their performances improve with training time (except for the MOQL in the η=0.01 case). In the low noise case (η=0.01), MOQL reaches its optimal performance after time step 40,000, with a high performance variance. It is outperformed by MOMCTShv and MOMCTSdom, with higher average hypervolume indicators and lower variances. When the noise rate increases (η=0.1), both performances are degraded while MOMCTS approaches still outperforms MOQL in terms of relative performance and lower variance (as shown in Table 1), showing a good robustness w.r.t. noise.
In summary, the empirical validation on the artificial DST problem shows both the strengths and the weaknesses of MOMCTS approaches. On the positive side, MOMCTS approaches show able to find solutions lying in the nonconvex regions of the Pareto front, as opposed to linear scalarizationbased methods. Moreover, MOMCTS shows a relatively good robustness w.r.t. noise. On the negative side, MOMCTS approaches are more computationally expensive than MOQL (for 300,000 time steps, MOMCTShv takes 147 secs, MOMCTSdom takes 49 secs versus 25 secs for MOQL).
Resource Gathering
The Resource Gathering (RG) task first introduced in Barrett and Narayanan (2008) is carried out in a 5×5 grid (Fig. 6). The action space of RG include the same four actions (up, down, left and right) as in the DST problem. Starting from the home location, the goal of the agent is to gather two resources (gold and gems) and take them back home. Each time the agent reaches one resource location, the resource is picked up. Both resources can be carried by the agent at the same time. If the agent steps on one of the two enemy cases (indicated by swords), it may be attacked with 10 % probability, in which case the agent loses all resources being carried and is returned to the home location immediately. The agent enters a terminal state when it returns home (including the case of being attacked) or when the time horizon T=100 is reached. Five possible immediate reward vectors ordered as (enemy,gold,gems) will be received upon the termination of a policy:

(−1,0,0) in case of an enemy attack;

(0,1,0) for returning home with only gold;

(0,0,1) for returning home with only gems;

(0,1,1) for returning home with both gold and gems;

(0,0,0) in all other cases.
The RG problem contains a discrete state space of 100 states corresponding to the 25 agent positions in the grid, multiplied by the four possible states of resources currently being held (none, gold only, gems only, both gold and gems). The vectorial reward associated to each policy π is calculated as follows: Let r=(enemy,gold,gems) be the vectorial reward obtained by policy π after a Lstep episode. The immediate reward of π is set to r _{ π;L }=r/L=(enemy/L,gold/L,gems/L), and the policy is associated its immediate reward averaged over 100 episodes, favoring the discovery of policies with shortest length. Seven policies (Table 2 and Fig. 7) corresponding to the nondominated average vectorial rewards of the RG problem are identified by Vamplew et al. (2010). The nondominated vectorial rewards compose a convex Pareto front in the three dimensional space (Fig. 8).
Experimental setting
In the RG problem, the MOMCTS approaches are assessed comparatively with the MOQL algorithm, which independently optimizes the weighted sums of the three objective functions (enemy,gold,gems) under m weight settings. In the three dimensional reward space, one weight setting is defined by a 2D vector \((\lambda_{i}, \lambda_{j}^{\prime})\), with \(\lambda _{i},\lambda_{j}^{\prime}\in[0,1]\) and \(0 \leq\lambda_{i} + \lambda _{j}^{\prime}\leq1\). Let us denote the scalar rewards optimized by MOQL as \(r_{i,j} = (1 \lambda_{i}  \lambda_{j}^{\prime})\times r_{enemy} + \lambda _{i} \times r_{gold} + \lambda_{j}^{\prime}\times r_{gems}\), where l weights λ _{ i } (respectively \(\lambda_{j}^{\prime}\)) are evenly distributed in [0,1] for the gold (resp. gems) objective, subject to \(\lambda_{i} + \lambda_{j}^{\prime}\leq1\), the total number of weight settings thus is \(m = \frac{l(l1)}{2}\).
The parameters of MOQL and MOMCTS approaches have been selected after preliminary experiments, using the same amount of computational resources for a fair comparison. For the MOQL:

The ϵgreedy exploration is used with ϵ=0.2.

Learning rate α is set to 0.2.

The discount factor γ is set to 0.95.

By taking l=4,6,10, the number m of weight settings ranges in {6,15,45}.
In MOMCTShv, the progressive widening parameter b is set to 2. The exploration vs exploitation (EvE) tradeoff parameters associated to each objective are defined as c _{ enemy }=1×10^{−3},c _{ gold }=1×10^{−4},c _{ gems }=1×10^{−4}. The reference point z used in the hypervolume indicator calculation is set to (−0.33,−1×10^{−3},−1×10^{−3}), where −0.33 represents the maximum enemy penalty averaged in each time step of the episode, and the −1×10^{−3} values in the gold and gems objectives are taken to encourage the exploration of solutions with vectorial rewards lying in the hyperplanes gold=0 and gems=0.
In MOMCTSdom, the progressive widening parameter b is set to 1 (no progressive widening). The EvE tradeoff parameter c _{ e } is set to 0.1. The discount factor δ is set to 0.99.
The training time of all considered algorithms is 600,000 time steps (ca 17,200 treewalks for MOMCTShv and 16,700 treewalks for MOMCTSdom). Like in the DST problem, the training process is equally divided into 150 phases. At the end of each training phase, the MOQL and MOMCTS solution sets are tested in the RG simulator. Each solution (strategy) is launched 100 times and is associated the average vectorial reward (which might dominate the theoretical optimal ones due to the limited sample). The vectorial rewards of the solution set provided by each algorithm defines its Pareto archive. The algorithm performance is set to the hypervolume indicator of the Pareto archive with reference point z=(−0.33,−1×10^{−3},−1×10^{−3}). The optimal hypervolume indicator is 2.01×10^{−3}.
Results
Table 3 shows the performance of MOMCTShv, MOMCTSdom and MOQL algorithms after 600,000 times steps of training, measured by the hypervolume indicator. Figure 9 displays the evolution of hypervolume indicator in MOMCTShv, MOMCTSdom and MOQL with m=6,15,45. The percentage of times out of 11 runs that each nondominated vectorial reward is discovered for at least one test period during the training process of each algorithm is displayed in Fig. 11. It is observed that with m=6 weight settings, the MOQL performance stops improving after reaching a plateau of 1.9×10^{−3} at 120,000 time steps. Inspecting the Pareto archive, the difference between the performance plateau of and the optimal performance (2.01×10^{−3}) is due to the nondiscovery of policies π _{2},π _{4} and π _{5} whose vectorial rewards are not covered by the 6 weight settings (Fig. 10). MOQL reaches the optimum when m increases (after 240,000 steps for m=15 and 580,000 steps for m=45).
The MOMCTS approaches are outperformed by MOQL; their average hypervolume indicator reach 1.8×10^{−3} in the end of the training process, which is explained as the MOMCTS approaches rarely find the risky policies (π _{6},π _{7}) (Fig. 11). For example, policy π _{6} visits the enemy case twice; the neighbor nodes of this policy thus get the (−1,0,0) reward (more in Sect. 5).
As shown in Fig. 12, the δ parameter governs the MOMCTSdom performance. A low value (δ=0.9) leads to quickly forgetting the discovery of nondominated rewards, turning MOMCTSdom into pure exploration. Quite the contrary, high values of δ (δ=0.999) limit the exploration and likewise hinder the overall performance.
On the computational cost side, the average execution time of 600,000 training steps of in MOMCTShv, MOMCTSdom and MOQL are respectively 944 secs, 47 secs and 43 secs. As the size of the Pareto archive is close to 10 in most treewalks of MOMCTShv and MOMCTSdom, the fact that MOMCTShv algorithm is 20 times slower than MOMCTSdom matches their computational complexities.
As shown in Fig. 13, the cost of treewalks in MOMCTShv increases up to 20 times higher than that of MOMCTSdom within the first 500 treewalks, during which period the Pareto archive size P grows. Afterwards, the cost of MOMCTShv gradually increases with the depth of the search tree (\(\mathcal{O}(\log N)\)). On the contrary, the computational cost of each treewalk in MOMCTSdom remains stable (between 1×10^{−3} secs and 2×10^{−3} secs) throughout the training process.
Grid scheduling
Pertaining to the domain of autonomic computing (Tesauro et al. 2007), the problem of grid scheduling has been selected to investigate the scalability of MOMCTS approaches. The presented experimental validation considers the problem of grid scheduling, referring the reader to Yu et al. (2008) for a comprehensive presentation of the field. Grid scheduling at large is concerned with scheduling the different tasks involved in the jobs on different computational resources. As tasks are interdependent and resources are heterogeneous, grid scheduling defines an NPhard combinatorial optimization problem (Ullman 1975).
Grid scheduling naturally aims at minimizing the socalled makespan, that is the overall job completion time. But other objectives such as energy consumption, monetary cost, or the allocation fairness w.r.t. the resource providers become increasingly important. In the rest of Sect. 4.4, two objectives will be considered, the makespan and the cost of the solution.
In grid scheduling, a job is composed of J tasks T _{1}…T _{ J }, partially ordered through a dependency relation; T _{ i }→T _{ j } denotes that task T _{ i } must be executed before task T _{ j } (Fig. 14(a)). Each task T _{ i } is associated with its unitary load L _{ i }. Each task is assigned one out of M resources R _{1},…,R _{ M }; resource R _{ k } has computational efficiency speed _{ k } and unitary cost cost _{ k }. Grid scheduling achieves the taskresource assignment and orders the tasks executed on each resource. A grid scheduling solution called execution plan is given as a sequence σ of (taskresource) pairs (Fig. 14(b)).
Let ρ(i)=k denote the index of the resource R _{ k } on which T _{ i } is executed. Let \(\mathcal{B}(T_{i})\) denote the set of tasks T _{ j } which must either be executed before T _{ i } (T _{ j }→T _{ i }) or which are scheduled to take place before T _{ i } on the same resource R _{ ρ(i)}. The completion time of a task T _{ i } is recursively computed as:
where the first term is the time needed to process T _{ i } on the assigned resource R _{ ρ(i)}, and the second term expresses the fact that all jobs in \(\mathcal{B}(T_{i})\) must be completed prior to executing T _{ i }.
Finally, grid scheduling is the twoobjective optimization problem aimed at minimizing the overall scheduling makespan and cost:
Baseline algorithms
The state of the art in grid scheduling is achieved by stochastic optimization algorithms (Yu et al. 2008). The two prominent multiobjective variants thereof are NSGAII (Deb et al. 2000) and SMSEMOA (Beume et al. 2007).
Both algorithms can be viewed as importance sampling methods. They maintain a population of solutions, initially defined as random execution plans. Iteratively, the solutions with best Pareto rank and best crowded distance (a density estimation of neighboring points in NSGAII) or hypervolume indicator (in SMSEMOA) are selected and undergo unary and binary stochastic perturbations.
Experimental setting
A simulated grid environment containing 3 resources with different unit time costs and processing capabilities (cost _{1}=20, speed _{1}=10; cost _{2}=2, speed _{2}=5; cost _{3}=1, speed _{3}=1) is defined. We firstly compare the performance of MOMCTS approaches and baseline algorithms on a realistic bioinformatic workflow EBI_ClustalW2, which performs a ClustalW multiple sequence alignment using the EBI’s WSClustalW2 service.^{Footnote 3} This workflow contains 21 tasks and 23 precedence pairs (graph density q=12 %), assuming that all workloads are equal. Secondly, the scalability of MOMCTS approaches is tested through experiments based on artificially generated workflows containing respectively 20, 30 and 40 tasks with graph density q=15 %.
As evidenced from the literature (Wang and Gelly 2007), MCTS performances heavily depend on the socalled random phase (Sect. 2.2). Preliminary experiments showed that a uniform action selection in the random phase was ineffective. A simple heuristic was thus used to devise a better suited action selection criterion in the random phase, as follows.
Let EFT_{ i } define the expected finish time of task T _{ i } (computed offline):
The heuristic action selection uniformly selects an admissible task T _{ i }. It then compares EFT_{ i } to all EFT_{ j } for T _{ j } admissible. If EFT_{ i } is maximal, T _{ i } is allocated to the resource which is due to be free at the earliest; if EFT_{ i } is minimal, T _{ i } is allocated to the resource which is due to be free at the latest. The random phase thus implements a default policy, randomly allocating tasks to resources, except for the most (respectively less) critical tasks that are scheduled with high (resp. low) priority.
The parameters of all algorithms have been selected after preliminary experiments, using the same amount of computational resources for a fair comparison. The progressive widening parameter b is set to 2 in both MOMCTShv and MOMCTSdom. In MOMCTShv, the exploration vs. exploitation (EvE) tradeoff parameters associated to the makespan and cost objectives, c _{ time } and c _{ cost } are both set to 5×10^{−3}. In MOMCTSdom, the EvE tradeoff parameters c _{ e } is set to 1, and the discount factor δ is set to 0.99. The parameters used for NSGAII (respectively SMSEMOA) involve a population size of 200 (resp. 120) individuals, of which 100 are selected and undergo stochastic unary and binary variations (resp. onepoint reordering, and resource exchange among two individuals). For all three algorithms, the number N of treewalks a.k.a. evaluation budget is set to 10,000. The reference point in each experiment is set to (z _{ t },z _{ c }), where z _{ t } and z _{ c } respectively denote the maximal makespan and cost.
Due to the lack of the true Pareto front in the considered problems, we use a reference Pareto front P ^{∗} gathering all nondominated vectorial rewards obtained in all runs of all three algorithms to replace the true Pareto front. The performance indicators are defined by the generational distance (GD) and inverted generational distance (IGD) between the actual Pareto front P found in the run and the reference Pareto front P ^{∗}. In the grid scheduling experiment, the IGD indicator plays a similar role as the hypervolume indicator in DST and RG problems.
Results
Figure 15 displays the GD and IGD of MOMCTShv, MOMCTSdom, NSGAII and SMSEMOA on EBI_ClustalW2 workflow scheduling and on artificial jobs with a number J of tasks ranging in 20,30 and 40 with graph density q=15 %. Figure 16 shows the Pareto front discovered by MOMCTShv, MOMCTSdom, NSGAII and SMSEMOA on the EBI_ClustalW2 workflow after N=100, 1000 and 10000 policy evaluations (treewalks), comparatively to the reference Pareto front. In all considered problems, the MOMCTS approaches are outperformed by the baselines in terms of the GD indicator, while they quickly find good solutions, they fail to discover the reference Pareto front. In the meanwhile, they yield a better IGD performance than the baselines, indicating that on average a single run of MOMCTS approaches reaches a better approximation of the true Pareto front.
Overall, the main weakness of MOMCTS approaches is their computational runtime. The computational cost of MOMCTShv and MOMCTSdom are respectively 5 and 2.5 times higher than that of NSGAII and SMSEMOA.^{Footnote 4} This weakness should have been relativized, noting that in realworld problems, the evaluation cost dominates by several orders of magnitude the search cost.
Discussion
As mentioned, the state of the art in MORL is divided into singlepolicy and multiple policy algorithms (Vamplew et al. 2010). In the former case, the authors use a set of preferences between objectives which are userspecified or derived from the problem domain (e.g. defining preferred regions Mannor and Shimkin 2004 or setting weights on the objectives Tesauro et al. 2007) to aggregate the multiple objectives in a single one. The strength of the singlepolicy approach is its simplicity; its long known limitation is that it cannot discover a policy in nonconvex regions of the Pareto front (Deb 2001).
In the multiplepolicy case, multiple Pareto optimal vectorial rewards can be obtained by optimization of different scalarized RL problems under different weight settings. Natarajan and Tadepalli (2005) show that the efficiency of MOQL can be improved by sharing information between different weight settings. A hot topic in multiplepolicy MORL is how to design the weight settings and share information among the different scalarized RL problems. In the case where the Pareto front is known, the design of the weight settings is made easier—provided that the Pareto front is convex. When the Pareto front is unknown, an alternative provided by Barrett and Narayanan (2008) is to maintain Qvectors instead of Qvalues for each pair (state, action). Through an adaptive selection of weight settings corresponding to the vectorial rewards on the boundary of the convex set of the current Qvectors, this algorithm narrows down the set of selected weight settings, at the expense of an higher complexity of value iteration in each state: the \(\mathcal{O}(SA)\) complexity of standard Qlearning is multiplied by a factor O(n ^{d}), where n is the number of points on the convex hull of the Qvectors and d is the number of objectives. While the approach provides optimality guarantees (n converge toward the number of Pareto optimal policies), the number of intermediate solutions can be huge (in the worst case, \(\mathcal{O}(A^{S})\)). Based on the convexity and piecewise linearity assumption on the shape of the convex hull of Qvectors, Lizotte et al. (2012) extends (Barrett and Narayanan 2008) by narrowing down the range of points locating on the convex hull, thus keeping the n value under control.
In the MOMCTShv approach, each tree node is associated its average reward w.r.t. each objective, and the selection rule involves the scalar associated reward based on the hypervolume indicator (Zitzler and Thiele 1998), with complexity \(\mathcal{O}(BP^{d/2}\log N)\). On the one hand, this complexity is lower than that of a value iteration in Barrett and Narayanan (2008) (considering that the size of archive P is comparable to the number n of nondominated Q vectors). On the other hand, this complexity is higher than that of MOMCTSdom, where the dominance test only needs be computed at the end of each treewalk, thus with linear complexity in the number of objectives and treewalks. The MOMCTSdom complexity thus is \(\mathcal{O}(B\log N+dP)\). The price to pay for the improved scalability of MOMCTSdom is that the dominance reward might less favor the diversity of the Pareto archive than the hypervolume indicator: any nondominated point has the same dominance reward whereas the hypervolume indicator of nondominated points in sparsely populated regions of the Pareto archive is higher.
As shown in the Resource Gathering problem, the MOMCTS approaches have difficulties in finding “risky“ policies, visiting nodes with many low reward nodes in their neighborhood. A tentative explanation for this fact is given as, as already noted by Coquelin and Munos (2007), it may require an exponential time for the UCT algorithm to converge to the optimal node if this node is hidden by nodes with low reward.
Conclusion and perspectives
This paper has pioneered the extension of MCTS to multiobjective reinforcement learning, based on two scalar rewards measuring the merits of a policy relatively to the nondominated policies in the search tree. These rewards, respectively the hypervolume indicator and the dominance reward, have complementary strengths and weaknesses: the hypervolume indicator is computationally expensive but explicitly favors the diversity of the MOO policies, enforcing a good coverage of the Pareto front. Quite the contrary, the dominance test is linear in the number of objectives; it is further invariant under the monotonous transformation of the objective functions, a robust property much appreciated when dealing with illposed optimization problems.
These approaches have been validated on three problems: Deep Sea Treasure (DST), Resource Gathering (RG) and grid scheduling.
The experimental results on DST confirm a main merit of the proposed approaches, their ability to discover policies lying in the nonconvex regions of the Pareto front. To our knowledge,^{Footnote 5} this feature is unique in the MORL literature.
In counterpart, MOMCTS approaches suffer from two weaknesses. Firstly, as shown on the grid scheduling problem, some domain knowledge is required in complex problems to enforce an efficient exploration in the random phase. Secondly, as evidenced in the Resource Gathering problem, the presented approaches hardly discover “risky” policies which lie in an unpromising region (the proverbial needle in the haystack).
These first results however provide a proof of concept for the MOMCTS approaches, noting that these approaches yield comparable performances to the (non RLbased) state of the art albeit at the price of a higher computational cost.
This work opens two perspectives for further studies. The main theoretical perspective concerns the properties of the cumulative discounted reward mechanism in the general (singleobjective) dynamic optimization context. On the applicative side, we plan to refine the RAVE heuristics used in the grid scheduling problem, e.g. to estimate the reward attached to task allocation paired ordering.
Notes
Notably, the chances for a Pareto front to be convex decreases with the number of objectives.
Another option is to use a dynamically weighted combination of the reward \(\hat{r}_{s,a}\) and RAVE(a) in Eq. (1).
The complete description is available at http://www.myexperiment.org/workflows/203.html.
On workflow EBI_ClustalW2, the average execution time of MOMCTShv, MOMCTSdom, NSGAII and SMSEMOA are respectively 142 secs, 74 secs, 31 secs and 32 secs.
A general polynomial result of MOO has been proposed by Chatterjee (2007), which claims that for all irreducible MDP with multiple longrun average objectives, the Pareto front can be ϵapproximated in time polynomial in ϵ. However this claim relies on the assumption that finding some Pareto optimal point can be reduced to optimizing a single objective: optimize a convex combination of objectives using as set of positive weights (p. 2, Chatterjee 2007), which does not hold for nonconvex Pareto fronts. Furthermore, the approach relies on the ϵapproximation of the Pareto front proposed by Papadimitriou and Yannakakis (2000), which assumes the existence of an oracle telling for each vectorial reward whether it is ϵParetodominated (Theorem 2, p. 4, Papadimitriou and Yannakakis 2000).
References
Auer, P., CesaBianchi, N., & Fischer, P. (2002). Finitetime analysis of the multiarmed bandit problem. Machine Learning, 47(2), 235–256.
Auger, A., Bader, J., Brockhoff, D., & Zitzler, E. (2009). Theory of the hypervolume indicator: optimal μdistributions and the choice of the reference point. In FOGA’09 (pp. 87–102). New York: ACM.
Barrett, L., & Narayanan, S. (2008). Learning all optimal policies with multiple criteria. In W. W. Cohen, A. McCallum, & S. T. Roweis (Eds.), ICML’08 (pp. 41–47). New York: ACM.
Berthier, V., Doghmen, H., & Teytaud, O. (2010). Consistency modifications for automatically tuned MonteCarlo Tree Search. In C. Blum & R. Battiti (Eds.), LNCS: Vol. 6073. LION4 (pp. 111–124). Berlin: Springer.
Beume, N., Naujoks, B., & Emmerich, M. (2007). SMSEMOA: multiobjective selection based on dominated hypervolume. European Journal of Operational Research, 181(3), 1653–1669.
Beume, N., Fonseca, C. M., LopezIbanez, M., Paquete, L., & Vahrenhold, J. (2009). On the complexity of computing the hypervolume indicator. IEEE Transactions on Evolutionary Computation, 13(5), 1075–1082.
Chaslot, G., Chatriot, L., Fiter, C., Gelly, S., Hoock, J. B., Perez, J., Rimmel, A., & Teytaud, O. (2008). Combining expert, offline, transient and online knowledge in MonteCarlo exploration (Technical Report). Paris: Lab. Rech. Inform. (LRI). doi:10.1.1.169.8073.
Chatterjee, K. (2007). Markov decision processes with multiple longrun average objectives. In: FSTTCS 2007 foundations of software technology and theoretical computer science (Vol. 4855, pp. 473–484).
Ciancarini, P., & Favini, G. P. (2009). MonteCarlo Tree Search techniques in the game of kriegspiel. In C. Boutilier (Ed.), IJCAI’09 (pp. 474–479).
Coquelin, P. A., & Munos, R. (2007). Bandit algorithms for tree search. Preprint arXiv:cs/0703062.
Coulom, R. (2006). Efficient selectivity and backup operators in MonteCarlo Tree Search. In Proc. computers and games (pp. 72–83).
Deb, K. (2001). Multiobjective optimization using evolutionary algorithms (pp. 55–58). Chichester: Wiley.
Deb, K., Pratap, A., Agarwal, S., & Meyarivan, T. (2000). A fast elitist nondominated sorting genetic algorithm for multiobjective optimization: NSGAII. In M. Schoenauer et al. (Eds.), LNCS: Vol. 1917. PPSN VI (pp. 849–858). Berlin: Springer.
Deb, K., Thiele, L., Laumanns, M., & Zitzler, E. (2002). Scalable multiobjective optimization test problems. In Proceedings of the congress on evolutionary computation (CEC2002) (pp. 825–830). Honolulu, USA.
Fleischer, M. (2003). The measure of Pareto optima. applications to multiobjective metaheuristics. In LNCS: Vol. 2632. EMO’03 (pp. 519–533). Berlin: Springer.
Gábor, Z., Kalmár, Z., & Szepesvári, C. (1998). Multicriteria reinforcement learning. In ICML’98 (pp. 197–205). San Mateo: Morgan Kaufmann.
Gelly, S., & Silver, D. (2007). Combining online and offline knowledge in UCT. In Z. Ghahramani (Ed.), ICML’07 (pp. 273–280). New York: ACM.
Hansen, N. (2006). The cma evolution strategy: a comparing review. In Towards a new evolutionary computation (pp. 75–102). Berlin: Springer. doi:10.1007/3540324941_4.
Kocsis, L., & Szepesvári, C. (2006). Bandit based MonteCarlo planning. In J. Fürnkranz, T. Scheffer, & M. Spiliopoulou (Eds.), ECML’06 (pp. 282–293). Berlin: Springer.
Lizotte, D. J., Bowling, M., & Murphy, S. A. (2012). Linear fittedq iteration with multiple reward functions. Journal of Machine Learning Research, 13, 3253–3295.
Maes, F., Wehenkel, L., & Ernst, D. (2011). Automatic discovery of ranking formulas for playing with multiarmed bandits. In S. Sanner & M. Hutter (Eds.), LNCS: Vol. 7188. Recent advances in reinforcement learning—9th European workshop, EWRL 2011 (pp. 5–17). Berlin: Springer.
Mannor, S., & Shimkin, N. (2004). A geometric approach to multicriterion reinforcement learning. Journal of Machine Learning Research, 5, 325–360. doi:10.1.1.9.5762.
Nakhost, H., & Müller, M. (2009). MonteCarlo exploration for deterministic planning. In C. Boutilier (Ed.), IJCAI’09 (pp. 1766–1771).
Natarajan, S., & Tadepalli, P. (2005). Dynamic preferences in multicriteria reinforcement learning. In ICML’05. New York: ACM.
Papadimitriou, C. H., & Yannakakis, M. (2000). On the approximability of tradeoffs and optimal access of web sources. In FOCS (pp. 86–92). Los Alamitos: IEEE Computer Society.
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: an introduction. Cambridge: MIT Press.
Szepesvári, C. (2010). Algorithms for reinforcement learning. San Rafael: Morgan & Claypool.
Tesauro, G., Das, R., Chan, H., Kephart, J., Levine, D., Rawson, F., & Lefurgy, C. (2007). Managing power consumption and performance of computing systems using reinforcement learning. In J. C. Platt, D. Koller, Y. Singer, & S. T. Roweis (Eds.), NIPS’07 (pp. 1–8).
Ullman, J. D. (1975). NPcomplete scheduling problems. Journal of Computer and System Sciences, 10(3), 384–393.
Vamplew, P., Dazeley, R., Berry, A., Issabekov, R., & Dekker, E. (2010). Empirical evaluation methods for multiobjective reinforcement learning algorithms. Machine Learning, 84, 51–80.
Van Veldhuizen, D. A. (1999). Multiobjective evolutionary algorithms: classifications, analyses, and new innovations (Technical report). DTIC Document.
Wang, Y., & Gelly, S. (2007). Modifications of UCT and sequencelike simulations for MonteCarlo Go. In CIG’07 (pp. 175–182). New York: IEEE Press.
Wang, W., & Sebag, M. (2012). Multiobjective MonteCarlo Tree Search. In Asian conference on machine learning.
Wang, Y., Audibert, J., & Munos, R. (2008). Algorithms for infinitely manyarmed bandits. In D. Koller, D. Schuurmans, Y. Bengio, & L. Bottou (Eds.), NIPS’08 (pp. 1–8).
Yu, J., Buyya, R., & Ramamohanarao, K. (2008). Workflow scheduling algorithms for grid computing. In Studies in computational intelligence (Vol. 146, pp. 173–214). Berlin: Springer.
Zitzler, E., & Thiele, L. (1998). Multiobjective optimization using evolutionary algorithms—a comparative case study. In A. E. Eiben, T. Bäck, M. Schoenauer, & H. Schwefel (Eds.), LNCS: Vol. 1498. PPSN v (pp. 292–301). Berlin: Springer.
Acknowledgements
We wish to thank JeanBaptiste Hoock, Dawei Feng, Ilya Loshchilov, Romaric Gaudel, and Julien Perez for many discussions on UCT, MOO and MORL. We are grateful to the anonymous reviewers for their many comments and suggestions on a previous version of the paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editors: ZhiHua Zhou, Wee Sun Lee, Steven Hoi, Wray Bunfine, and Hiroshi Motoda.
Rights and permissions
About this article
Cite this article
Wang, W., Sebag, M. Hypervolume indicator and dominance reward based multiobjective MonteCarlo Tree Search. Mach Learn 92, 403–429 (2013). https://doi.org/10.1007/s1099401353690
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s1099401353690
Keywords
 Reinforcement learning
 MonteCarlo Tree Search
 Multiobjective optimization
 Sequential decision making