Model-free Reinforcement Learning for Branching Markov Decision Processes

We study reinforcement learning for the optimal control of Branching Markov Decision Processes (BMDPs), a natural extension of (multitype) Branching Markov Chains (BMCs). The state of a (discrete-time) BMCs is a collection of entities of various types that, while spawning other entities, generate a payoff. In comparison with BMCs, where the evolution of a each entity of the same type follows the same probabilistic pattern, BMDPs allow an external controller to pick from a range of options. This permits us to study the best/worst behaviour of the system. We generalise model-free reinforcement learning techniques to compute an optimal control strategy of an unknown BMDP in the limit. We present results of an implementation that demonstrate the practicality of the approach.


Introduction
Branching Markov Chains (BMCs), also known as as Branching Processes, are natural models of population dynamics and parallel processes. The state of a BMC consists of entities of various types, and many entities of the same type may coexist. Each entity can branch in a single step into a (possibly empty) set of entities of various types while disappearing itself. This assumption is natural, for instance, for annual plants that reproduce only at a specific time of the year, or for bacteria, which either split or die. An entity may spawn a copy of itself, thereby simulating the continuation of its existence.
The offspring of an entity is chosen at random among options according to a distribution that depends on the type of the entity. The type captures significant differences between entities. For example, stem cells are very different from regular cells; parallel processes may be interruptible or have different privileges. The type may reflect characteristics of the entities such as their age or size. Although entities coexist, the BMC model assumes that there is no interaction between them. Thus, how an entity reproduces and for how long it lives is the same as if it were the only entity in the system. This assumption greatly improves the computational complexity of the analysis of such models and is appropriate when the population exists in an environment that has virtually unlimited resources to sustain its growth. This is a common situation that holds when a species has just been introduced into an environment, in an early stage of an epidemic outbreak, or when running jobs in cloud computing.
BMCs have a wide range of applications in modelling various physical phenomena, such as nuclear chain reactions, red blood cell formation, population genetics, population migration, epidemic outbreaks, and molecular biology. Many examples of BMC models used in biological systems are discussed in [12].
Branching Markov Decision Processes (BMDPs) extend BMCs by allowing a controller to choose the branching dynamics for each entity. This choice is modelled as nondeterministic, instead of random. This extension is analogous to how Markov Decision Processes (MDPs) generalise Markov chains (MCs) [24]. Allowing an external controller to select a mode of branching allows us to study the best/worst behaviour of the examined model.
As a motivating example, let us discuss a simple model of cloud computing. A computation may be divided into tasks in order to finish it faster, as each server may have different computational power. Since the computation of each task depends on the previous one, the total running time is the sum of the running times of each spawned task as well as the time needed to split and merge the result of each computation into the final solution. As we shall see, the execution of each task is not guaranteed to be successful and is subject to random delays. Specifically, let us consider the following model with two different types (T and S), and two actions (a 1 and a 2 ). This BMDP consists of the main task, T , that may be split (action a 1 ) into three smaller tasks, for simplicity assumed to be of the same type S, and this split and merger of the intermediate results takes 1 hour (1h). Alternatively (action a 2 ), we can execute the whole task T on the main server, but it will be slow (8 hours). Task S can (action a 1 ) be run on a reliable server in 1.6 hours or (action a 2 ) an unreliable one that finishes after 1 hour (irrespective of whether or not the computation is completed successfully), but with a 40% chance we need to rerun this task due to the server crashing. We can represent this model formally as: We would like to know the infimum of the expected running time (i.e. the expected running time when optimal decisions are made) of task T . In this case the optimal control is to pick action a 1 first and then actions a 1 for all tasks S with a total running time of 5.8 hours. The expected running time when picking actions a 2 for S instead would be 1 + 3 · 1/0.6 = 6 [hours].
Let us now assume that the execution of tasks S for action a 1 may be interrupted with probability 30% by a task of higher priority (type H). Moreover, these H tasks may be further interrupted by tasks with even higher priority (to simplify matters, again modelled by type H). The computation time of T is prolonged by 0.1 hour for each H spawned. Our model then becomes: . This is enough for the optimal strategy of running S to become a 2 . Note that if the probability of H being interrupted is at least 50% then the expected running time of H becomes ∞.
When dealing with a real-life process, it is hard to come up with a (probabilistic and controlled) model that approximates it well. This requires experts to analyse all possible scenarios and estimate the probability of outcomes in response to actions based on either complex calculations or the statistical analysis of sufficient observational data. For instance, it is hard to estimate the probability of an interrupt H occurring in the model above without knowing which server will run the task, its usual workload and statistics regarding the priorities of the tasks it executes. Even if we do this estimation well, unexpected or rare events may happen that would require us to recalibrate the model as we observe the system under our control.
Instead of building such a model explicitly first and fixing the probabilities of possible transitions in the system based on our knowledge of the system or its statistics, we advocate the use of reinforcement learning (RL) techniques [27] that were successfully applied to finding optimal control for finite-state Markov Decision Processes (MDPs). Q-learning [30] is a well-studied model-free RL approach to compute an optimal control strategy without knowing about the model apart from its initial state and the set of actions available in each of its states. It also has the advantage that the learning process converges to the optimal control while exploiting along the way what it already knows. While the formulation of the Q-learning algorithm for BMDPs is straightforward, the proof that it works is not. This is because, unlike the MDPs with discounted rewards for which the original Q-learning algorithm was defined, our model does not have an explicit contraction in each step, nor does boundedness of the optimal values or one-step updates hold. Similarly, one cannot generalise the result from [11] that estimates the time needed for the Q-learning algorithm to converge within of the optimal values with high probability for finite-state MDPs.

Related work
The simplest model of BMCs are Galton-Watson processes ( [31]), discrete-time models where all entities are of the same type. They date as far back as 1845 [14] and were used to explain why some aristocratic family surnames became extinct.
The generalisation of this model to multiple types of entities was first studied in 1940s by Kolmogorov and Sevast'yanov ([17]). For an overview of the results known for BMCs, see e.g. [13] and [12]. The precise computational complexity of decision problems about the probabilities of extinction of an arbitrary BMC was first established in [9]. The problem of checking if a given BMC terminates almost surely was shown in [5] to be strongly polynomial. The probability of acceptance of a run of a BMC by a deterministic parity tree automaton was studied in [4] and shown to be computable in PSPACE and in polynomial time for probabilities 0 or 1. In [16] a generalisation of the BMCs was considered that allowed for limited synchronisation of different tasks. BMDPs, a natural generalisation of BMCs to a controlled setting, have been studied in the OR literature (e.g., [23,26]). Hierarchical MDPs (HMDPs) [10] are a special case of BMDPs where there are no cycles in the offspring graph (equivalently, no cyclic dependency between types). BMDPs and HMDPs have found applications in manpower planning [29], controlled queuing networks [15,2], management of livestock [20], and epidemic control [1,25], among others. The focus of these works was on optimising the expected average, or the discounted reward over a run of the process, or optimising the population growth rate. In [10] the decision problem whether the optimal probability of termination exceeds a threshold was studied: it was shown to be solvable in PSPACE and at least as hard as the square-root sum problem, but one can determine if the optimal probability is 0 or 1 in polynomial time. In [7], it was shown that the approximation of the optimal probability of extinction for BMDPs can be done in polynomial time. The computational complexity of computing the optimal expected total cost before extinction for BMDPs follows from [8] and was shown there to be computable in polynomial time via a linear program formulation. The problem of maximising the probability of reaching a state with an entity of a given type for BMDPs was studied in [6]. In [28] an extension of BMDPs with real-valued clocks and timing constraints on productions was studied.

Summary of the results
We show that an adaptation of the Q-learning algorithm converges almost surely to the optimal values for BMDPs under mild conditions: all costs are positive and each Q-value is selected for update independently at random. We have implemented the proposed algorithm in the tool Mungojerrie [21] and tested its performance on small examples to demonstrate its efficiency in practice. To the best of our knowledge, this is the first time model-free RL has been used for the analysis of BMDPs.

Preliminaries
We denote by N the set of non-negative integers, by R the set of reals, by R + the set of positive reals, and by R ≥0 the set of non-negative reals. We let R + = R + ∪ {∞}, and R ≥0 = R ≥0 ∪ {∞}. We denote by |X| the cardinality of a set X and by X * (X ω ) the set of all possible finite (infinite) sequences of elements of X. Finite sequences are also called lists.
Vectors and Lists. We usex,ȳ,c to denote vectors andx i orx(i) to denote its i-th entry. We let0 denote a vector with all entries equal to 0; its size may vary depending on the context. Likewise1 is a vector with all entries equal to 1. For vectorsx,ȳ ∈ R n ≥0 ,x ≤ȳ means x i ≤ y i for every i, andx <ȳ meansx ≤ȳ and x i = y i for some i. We also make use of the infinity norm x ∞ = max i |x(i)|.
We use α, β, γ to denote finite lists of elements. For a list α = a 1 , a 2 , . . . , a k we write α i for the i-th element a i of list α and |α| for its length. For two lists α and β we write α · β for their concatenation. The empty list is denoted by .
Probability Distributions. A finite discrete probability distribution over a countable set Q is a function µ : Q→[0, 1] such that q∈Q µ(q)=1 and its support set Markov Decision Processes. Markov decision processes ( [24]), are a well-studied formalism for systems exhibiting nondeterministic and probabilistic behaviour. We say that an MDP M is finite (discrete) if both S and A are finite (countable). We write A(s) for the set of actions available at s, i.e., the set of actions a for which p(s, a) is defined. In an MDP M, if the current state is s, then one of the actions in A(s) is chosen nondeterministically. If the chosen action is a then the probability of reaching state s ∈ S in the next step is p(s, a)(s ) and the cost incurred is c(s, a).

Branching Markov Decision Processes
We are now ready to define (multitype) BMDPs.
-P is a finite set of types; -A is a finite set of actions; p : P × A → D(P * ) is a partial function called the probabilistic transition function where every D(·) is a finite discrete probability distribution; and We write A(q) for the set of actions available to an entity of type q ∈ P , i.e., the set of actions a for which p(q, a) is defined. A Branching Markov Chain (BMC) is simply a BMDP with just one action available for each type. Let us first describe informally how BMDPs evolve. A state of a BMDP B is a list of elements of P that we call entities. A BMDP starts at some initial configuration, α 0 ∈ P * , and the controller picks for one of the entities one of the actions available to an entity of its type. In the new configuration α 1 , this one entity is replaced by the list of new entities that it spawned. This list is picked according to the probability distribution p(q, a) that depends both on the type of the entity, q, and the action, a, performed on it by the controller. The process proceeds in the same manner from α 1 , moving to α 2 , and from there to α 3 , etc. Once the state is reached, i.e., when no entities are present in the system, the process stays in that state forever.

Definition 3 (Semantics of BMDP). The semantics of a BMDP
for every β ∈ P * and 0 in all other cases.
For a given BMDP B and states α ∈ States B , we denote by Actions B (α) the set of actions (i, a) ∈ Actions B , for which Prob B (α, (i, a)) is defined.
Note that our semantics of BMDPs assumes an explicit listing of all the entities in a particular order similar to [10]. One could, instead, define this as a multi-set or simply a vector just counting the number of occurrences of each entity as in [23]. As argued in [10], all these models are equivalent to each other. Furthermore, we assume that the controller expands a single entity of his choice at the time rather all of them being expanded simultaneously. As argued in [32], that makes no difference for the optimal values of the expected total cost that we study in this paper, provided that all transitions' costs are positive.

Strategies
A path of a BMDP B is a finite or infinite sequence consisting of the initial state and a finite or infinite sequence of action and state pairs, such that Prob B (α j , (i j , a j ))(α j+1 ) > 0 for any 0 ≤ j ≤ |π|, where |π| is the number of actions taken during path π. (|π| = ∞ if the path is infinite.) For a path π, we denote by π A(j) = (i j , a j ) the j-th action taken along path π, by π S(j) (= α j ) the j-th state visited, where π S(0) (= α 0 ) is the initial state, and by π(j)(= α 0 , ((i 1 , a 1 ), α 1 ), . . . , ((i j , a j ), α j )) the first j action-state pairs of π.
We call a path of infinite (finite) length a run (finite path). We write Runs B (FPath B ) for the sets of all runs (finite paths) and Runs B,α (FPath B,α ) for the sets of all runs (finite paths) that start at a given initial state α ∈ States B , i.e., paths π with π S(0) = α. We write last(π) for the last state of a finite path π.
A strategy in BMDP B is a function σ : FPath B → D(Actions B ) such that, for all π ∈ FPath B , supp(σ(π)) ⊆ Actions B (last(π)). We write Σ B for the set of all strategies. A strategy is called static, if it always applies an action to the first entity in any state and for all entities of the same type in any state it picks the same action. A static strategy τ is essentially a function of the form σ : P → A, i.e., for an arbitrary π ∈ FPath B , we have τ (π) = (1, σ(last(π) 1 )) whenever last(π) = .
A strategy σ ∈ Σ B and an initial state α induce a probability measure over the set of runs of BMDP B in the following way: the basic open sets of Runs B are of the form π · (Actions B × States B ) ω , where π ∈ FPath B , and the measure of this open set is equal to if π S(0) = α and equal to 0 otherwise. It is a classical result of measure theory that this extends to a unique measure over all Borel subsets of Runs B and we will denote this measure by P σ B,α .
Let f : Runs B → R + be a function measurable with respect to P σ B,α . The expected value of f under strategy σ when starting at α is defined as E σ B,α {f } = Runs B f dP σ B,α (which can be ∞ even if the probability that the value of f is infinite is 0). The infimum expected value of f in B when starting at α is defined as Note that ε-optimal strategies always exists by definition. We omit the subscript B, e.g., in States B , Σ B , etc., when the intended BMDP is clear from the context. For a given BMDP B and N ≥ 0 we define Total N (π), the cumulative cost of a run π after N steps, as Total . For a configuration α ∈ States and a strategy σ ∈ Σ, let ETotal N (B, α, σ) be the N -step expected total cost defined as ETotal N (B, α, σ) = E σ B,α Total N and the expected total cost be ETotal * (B, α, σ) = lim N →∞ ETotal N (B, α, σ). This last value can potentially be ∞. For each starting state α, we compute the optimal expected cost over all strategies of a BMDP starting at α, denoted by ETotal * (B, α), i.e., As we are going to prove in Theorem 4.b that, for any α ∈ States, we have This justifies focusing on this value for initial states that consist of a single entity only, as we will do in the following section.

Fixed Point Equations
Following [8], we define here a linear equation system with a minimum operator whose Least Fixed Point solution yields the desired optimal values for each type of a BMDP with non-negative costs. This system generalises the Bellman's equations for finite-state MDPs. We use a variable x q for each unknown ETotal * (B, q) where q ∈ P . Letx be the vector of all x q , where q ∈ P . The system has one equation of the form x q = F q (x) for each type q ∈ P , defined as We denote the system in vector form byx = F (x). Given a BMDP, we can easily construct its associated system in linear time. Letc * ∈ R n ≥0 denote the n-dimensional vector of ETotal * (B, q)'s where n = |P |. Let us definex 0 =0, (a) The map F : R n ≥0 → R n ≥0 is monotone and continuous (and so0 ≤x k ≤ x k+1 for all k ≥ 0). Proof.
(a) All equations in the system F (x) are minimum of linear functions with nonnegative coefficients and constants, and hence monotonicity and continuity are preserved. (b) It suffices to show that once action a is taken when starting with a single entity q and, as a result, q is replaced by α with probability p(q, a)(α), then the expected total cost is equal to: This is because then the expected total cost of picking action a when at q is just a weighted sum of these expressions with weights p(q, a)(α) for offspring α. And finally, to optimise the cost, one would pick an action a with the smallest such expected total cost showing that indeed holds. Now, to show (♣), consider an -optimal strategy σ i for a BMDP that starts at α i . It can easily be composed into a strategy σ that starts at α just by executing σ 1 first until all descendants of α 1 die out, before moving on to σ 2 , etc. If one of these strategies, σ i , never stops executing then, due to the assumption that all costs are positive, the expected total cost when starting with α i has to be infinite and so has to be the overall cost when starting with α (as all descendants of α i have to die out before the overall process terminates), so (♣) holds. This shows that c(q, a) + i≤|α| ETotal * (B, x αi ) can be achieved when starting at α. At the same time, we cannot do better because that would imply the existence of a strategy σ for one of the entities σ j with a better cost than its optimal cost ETotal * (B, α j ). (c) Sincex 0 =0 ≤c * and due to (b), it follows by repeated application of F to both sides of this inequality thatx k ≤ F (c * ) =c * , for all k ≥ 0. (d) Consider any fixed pointc of the equation system F (x). We will prove that c * ≤c . Let us denote by σ a static strategy that picks for each type an action with the minimum value of operator F inc , i.e., for each entity q we choose σ (q) = arg min a∈A(q) c(q, a) + α∈P * p(q, a)(α) i≤|α|c αi , where we break ties lexicographically. We now claim that, for all k ≥ 0, ETotal k (B, q, σ ) ≤c q holds. For k = 0, this is trivial as ETotal k (B, q, σ ) = 0 ≤c q . For k > 0, we have that ≤ c(q, σ (q)) + α∈P * p(q, σ (q))(α) Finally, for every q ∈ P , from the definition we havec * q = ETotal * (B, q) ≤ ETotal * (B, q, σ ) = lim k→∞ ETotal k (B, q, σ ) and each element of the last sequence was just shown to be ≤c q . (e) We know thatx * = lim k→∞x k exists in R n ≥0 because it is a monotonically non-decreasing sequence (note that some entries may be infinite). In fact we havex * = lim k→∞ F k+1 (0) = F (lim k→∞ F k (0)), and thusx * is a fixed point of F . So from (d) we havec * ≤x * . At the same time, due to (c), we havē x k ≤c * for all k ≥ 0, sox * = lim k→∞x k ≤c * and thus lim k→∞x k =c * .
The following is a simple corollary of Theorem 4.
Note that for a BMDPs with a fixed static strategy σ (or equivalently BMCs), we have that F (x) = B σx +c σ , for some non-negative matrix B σ ∈ R n×n ≥0 , and a positive vectorc σ > 0 consisting of all one step costs c(q, σ(q)). We will refer to F as F σ in such a case and exploit this fact later in various proofs.
We now show thatc * is in fact essentially a unique fixed point of F . Theorem 6. If F (x) =x andx q < ∞ for some q ∈ P thenx q =c * q .
Proof. By Corollary 5, there exists an optimal static strategy, denoted by σ * , which yields the finite optimal reward vectorc * . We clearly have thatx = F (x) ≤ F σ * (x), because σ * is just one possible pick of actions for each type rather than the minimal one as in (♠). Furthermore, Due to Theorem 4.d, we know thatc * q ≤x q < ∞, so all entries in the q-th row of B k σ * have to converge to 0 as k → ∞, because otherwise the q-th row of ∞ k=0 B k σ * would have at least one infinite value and, as a result, the q-th position ofc * = ( ∞ k=0 B k σ * )b σ * would also be infinite as all entries of b σ * are positive. Therefore, lim k→∞ (B k σ * x) q = 0 and sō The proof is now complete.

Q-learning
We next discuss the applicability of Q-learning to the computation of the fixed point defined in the previous section. Q-learning [30] is a well-studied model-free RL approach to compute an optimal strategy for discounted rewards. Q-learning computes so-called Q-values for every state-action pair. Intuitively, once Q-learning has converged to the fixed point, Q(s, a) is the optimal reward the agent can get while performing action a after starting at s. The Q-values can be initialised arbitrarily, but ideally they should be close to the actual values. Q-learning learns over a number of episodes, each consisting of a sequence of actions with bounded length. An episode can terminate early if a sink-state or another non-productive state is reached. Each episode starts at the designated initial state s 0 . The Q-learning process moves from state to state of the MDP using one of its available actions and accumulates rewards along the way. Suppose that in the i-th step, the process has reached state s i . It then either performs the currently (believed to be) optimal action (so-called exploitation option) or, with probability , picks uniformly at random one of the actions available at s i (so-called exploration option). Either way, if a i , r i , and s i+1 are the action picked, reward observed and the state the process moved to, respectively, then the Q-value is updated as follows: where λ i ∈ ]0, 1[ is the learning rate and γ ∈ ]0, 1] is the discount factor. Note the model-freeness: this update does not depend on the set of transitions nor their probabilities. For all other pairs s, a we have Q i+1 (s, a) = Q i (s, a), i.e., they are left unchanged. Watkins and Dayan showed the convergence of Q-learning [30].
Theorem 7 (Convergence [30]). For γ < 1, bounded rewards r i and learning rates 0 ≤ λ i < 1 satisfying: we have that Q i (s, a) → Q(s, a) as i → ∞ for all s, a ∈ S×A almost surely if all (s, a) pairs are visited infinitely often.
However, in the total reward setting that corresponds to Q-learning with discount factor γ = 1, Q-learning may not converge, or converge to incorrect values. However, it is guaranteed to work for finite-state MDPs in the setting of undiscounted total reward with a target sink-state under the assumption that all strategies reach that sink-state almost surely. The assumption that we make instead is that every transition of BMDP incurs a positive cost. This guarantees that a process that does not terminate almost surely generates an expected infinite reward in which case the Q-learning will coverage (or rather diverge) to ∞, so our results generalise these existing results for Q-learning.
We adopt the Q-learning algorithm to minimise cost as follows. Each episode starts at the designated initial state q 0 ∈ P . The Q-learning process moves from state to state of the BMDP using one of its available actions and accumulates costs along the way. Suppose that, in the i-th step, the process has reached state α. It then selects uniformly at random one of the entities of α, e.g., the j-th one, α j and either performs the currently (believed to be) optimal action or, with probability , picks an action uniformly at random among all the actions available for α j . If c and β denote the observed cost and entities spawned by this action, respectively, then the Q-value of the pair α j , a i are updated as follows: and all other Q-values are left unchanged. In the next section we show that Qlearning almost surely converges (diverges) to the optimal finite (respectively, infinite) value ofc * almost surely under rather mild conditions.

Convergence of Q-Learning for BMDPs
We show almost sure convergence of the Q-learning to the optimal valuesc * in a number of stages. We first focus on the case when all optimal values inc * are finite. In such a case, we show a weak convergence of the expected optimal values for BMCs to the unique fixed-pointc * , as defined in Section 3. To establish this, we show that the expected Q-values are monotonically decreasing (increasing) if we start with Q-values κc * for κ > 1 (κ < 1). This convergence from above and below gives us convergence in expectation using the squeeze theorem.
We then establish almost sure convergence toc * by proving a contraction argument, with the extra assumption that the selection of the Q-value to update is done independently at random in each step.
In the next step, we extend this result to BMDPs, first establishing that Q-learning will almost surely converge to the region of the Q-values less than or equal toc * . We then show that, when considering the pointwise limes inferior values of the sequences of Q-values, there is no point in that region such that every ε-ball around it has a non-zero probability to be represented in the limes inferior. This establishes thatc * is the fixed point the Q-values converge against.
Only at the very end, we show that Q-learning also converges (or rather diverges) to the optimal value even if that value happens to be infinite. We then turn to a type with non-finite optimal value and provide an argument for the divergence to ∞ of its corresponding Q-value.
We assume that all the Q-values are stored in a vector Q of size (|P | · |A|). We also use Q(q, a) to refer to the entry for type q ∈ P and action a ∈ A(q). We introduce the target for Q operator, T , that maps a Q-values vector Q to: We call T the 'target', because, when the Q(q, a) value is updated, then Thus, when Q(q, a) is selected for update with a chance of p q,a , we have that

Convergence for BMCs with finitec *
Since BMCs have only one action, we omit mentioning it for ease of notation. Note that for BMCs, the target for the Q-values is a simple affine function: And it coincides with operator F as defined in Section 3. Therefore, due to Theorem 6, T (Q) has a unique fixed point which isc * . Moreover, T (Q) = BQ+c, where B is a non-negative matrix andc is a vector of one step costs c(q), which are all positive. Naturally, applying T to a non-negative vector Q or multiplying it by B are monotone: Q ≥ Q → T (Q) ≥ T (Q ) and BQ ≥ BQ . Also, due to the linearity of T , E(T (Q)) = T (EQ) holds, where Q is a random vector.
We now start with a lemma describing the behaviour of Q-learning for initial Q-values when they happen to be equal to κc * for some κ ≥ 1.
Lemma 8. Let Q 0 = κc * for a scalar factor κ ≥ 1. Then the following holds for all i ∈ N,c assuming that Q-value to be updated in each step is selected independently at random.
For the induction step (i → i + 1), we use the induction hypothesis With T (c * ) =c * (from the fixed point equations) and the induction hypothesis,c * ≤ T (E(Q i+1 )) ≤ E(Q i+1 ) follows. Using holds, completing the induction step.
By simply replacing all ≤ with ≥ in the above proof, we can get the following for all initial Q-values that happen to be κc * where κ ≤ 1: Lemma 9. Let Q 0 = κc * for a scalar factor κ ∈ [0, 1]. Then the following holds for all i ∈ N, assuming that the Q-value to update in each step is selected independently at random:c * ≥ T (EQ i ) ≥ EQ i+1 ≥ EQ i .
We now first establish that the distance between Q andc * can be upper bounded by the distance between Q and T (Q) with a fixed linear factor µ > 0. Lemma 10. There exists a constant µ > 0 such that Proof. We show this for κ > 1. The proof for κ < 1 is similar, and there is nothing to show for κ = 1.
We first consider the linear programme with a variable for each type with the following constraints for some fixed δ > 0: Q ≥c * , T (Q) ≤ Q, and q∈P Q(q) = q∈Pc * (q) + δ.
An example solution to this constraint system is Q = (1 + δ q∈Pc * (q) )c * . We then find a solution minimising the objective q∈P |(Q−T (Q)(p)|, noting that all entries are non-negative due to the first constraint. This is expressed by adding 2|P | constraints and minimising q∈P x q .
Asc * is the only fixed-point of T , and q∈P Q(q) = q∈Pc * (q) + δ implies that, for an optimal solution Q * , Q * =c * , we have that Due to the constraint Q ≥c * , we always have Q =c * + Q ∆ for some Q ∆ >0.
We can now re-formulate this linear programme to look for Q ∆ instead of Q: with the objective to minimise q∈P |(Q ∆ − BQ ∆ )(q)|.
The optimal solution Q * ∆ to this linear programme gives an optimal value Q * =c * +Q * ∆ for the former and, vice versa, the value Q * for the former provides an optimal solution Q * ∆ −c * for the latter, and these two solutions have the same value in their respective objective function.
Thus, while the former constraint system is convenient to show that the value of the objective function is positive, the latter constraint system is, except for q∈P Q ∆ (q) = δ, linear. This means that any optimal solution for δ = δ 1 can be obtained from the optimal solution for δ = δ 2 just by rescaling it by δ 1 /δ 2 . It follows that the optimal value of the objective function is linear in δ, e.g., there exists µ > 0 such that its value is µδ.
We now show that the sequence of Q-values updates converges in expectation toc * when Q 0 = κc * . Lemma 11. Let Q 0 = κc * where κ ≥ 0. Then, assuming that each type-action pair is selected for update with a minimal probability p min in each step, and that ∞ i=0 λ i = ∞, then lim i→∞ EQ i =c * holds.
Proof. We proof this for κ ≥ 1. A similar proof shows this for any κ ∈ [0, 1]. Lemma 8 provides that all EQ i satisfy the constraints EQ i ≥c * and T (E(Q i )) ≤ EQ i .
Let p min be the smallest probability any Q-value is selected with in each update step. Due to Lemma 10, there is a fixed constant µ > 0 such that By taking the expected value of both sides and the fact thatc * ≤ T (EQ i ) ≤ EQ i+1 ≤ EQ i due to Lemma 8, we get and finally just by rearranging these terms we get Note that all summands are positive by Lemma 8. With Theorem 12. When each Q-value is selected for an update with a minimal probability p min in each step, and ∞ i=0 λ i = ∞, then lim i→∞ EQ i =c * holds for every starting Q-values Q 0 ≥0.
Proof. We first note that none of the entries ofc * can be 0. This implies that there is a scalar factor κ ≥ 0 such that0 ≤ Q 0 ≤ κc * . As the Q i are monotone in the entries of Q 0 , and as the property holds for Q 0 =0 = 0 ·c * and Q 0 = κc * by Lemma 11, the squeeze theorem implies that it also holds for Q 0 .
Convergence of the expected value is a weaker property than expected convergence, which also explains why our assumptions are weaker than in Theorem 7. With the common assumption of sufficiently fast falling learning rates, ∞ i=0 λ i 2 < ∞, we will now argue that the pointwise limes inferior of the sequence of Q-values almost surely converges toc * . This will later allow us to infer convergence of the actual sequence of Q-values toc * .
Theorem 13. When each Q-value is selected for update with a minimal probability p min in each step, then lim i→∞ Q i =c * holds almost surely for every starting Q-values Q 0 ≥0.
Proof. We assume for contradiction that, for some Q =c * , there is a non-zero chance of a sequence {Q i } i∈N0 such that -Q − lim inf i→∞ Q i ∞ < ε for all ε > 0, and there is a type q such that Q(q) < T ( Q)(q).
Then there must be an ε > 0 such that Q(q) + 3ε < T ( Q − 2ε ·1)(q). We fix such an ε > 0. Now we have the assumption that the probability of Q−lim inf n→∞ Q i ∞ < ε is positive. Then, in particular, the chance that, at the same time, lim inf i→∞ Q i > Q − ε ·1 and lim inf i→∞ Q i < Q + ε ·1, is positive.
Thus, there is a positive chance that the following holds: there exists an n ε such that, for all i > n ε , Thus, the expected limit value of Q i (q) is at least Q(q) + 3ε, for every tail of the update sequence. Now, we can use Q−2ε as a bound on the estimation of the updates in Q-learning as Q i ≥ Q−2ε·1 holds. At the same time, the variation of the sum of the updates goes to 0 when ∞ i+0 λ 2 i is bounded. Therefore, it cannot be that lim inf i→∞ Q i < Q + ε ·1 holds; a contradiction.
We note that if, for a Q-values Q ≥0, there is a q ∈ P with Q(q ) <c * (q ), then there is a q ∈ P with Q(q) < T (Q)(q) and Q(q) <c * (q). This is because, for the Q-values Q with Q (q) = min{Q(q),c * (q)} for all q ∈ Q, Q <c * . Thus, there must be a type q ∈ P such that κ = Q (q) c * (q) < 1 is minimal, and Q ≥ κc * . As we have shown before, T (κc * ) = κc * − (κ − 1)c, such that the following holds: Thus, we have that lim inf i→∞ Q i ≥c * holds almost surely. With lim i→∞ EQ i = c * , it follows that lim i→∞ Q i =c * .

Convergence for BMDPs and finitec *
We start with showing that, for BMDPs, the pointwise limes superior of each sequence is almost surely less than or equal toc * . We then proceed to show that the limes inferior of a sequence is almost surelyc * , which together implies almost sure convergence.

Lemma 14.
When each Q-value of BMDP is selected for update with a minimal probability p min in each step, i < ∞, then lim sup i→∞ Q i ≤c * holds almost surely for every starting Q-values Q 0 ≥0.
Proof. To show the property for the limes superior, we fix an optimal static strategy σ * that exists due to Corollary 5.
We define an BMC obtained by replacing each type q in the BMDP with A(q) = {a 1 , . . . , a k }, by k types (q, a 1 ), . . . , (q, a k ) with one action, where each type q is replaced by the type-action pair (q , σ * (q )).
It is easy to see that a type (q, σ * (q)) for the resulting BMC has the same value as the type q and the type-action pair (q, σ * (q)) in the BMDP that we started with.
When identifying these corresponding type-action pairs, we can look at the same sampling for the BMDP and the BMC, leading to sequences Q 0 , Q 1 , Q 2 , . . . and Q 0 , Q 1 , Q 2 , . . ., respectively, where Q 0 = Q 0 .
It is easy to see by induction that Q i ≤ Q i . Considering that {Q i } i∈N almost surely converges toc * by Theorem 13, we obtain our result.
Theorem 15. When each Q-value of an BMDP is selected for update with a minimal probability p min , i < ∞, then lim i→∞ Q i =c * holds almost surely for every starting Q-values Q 0 ≥0.
Proof. As a first simple corollary from Lemma 14, we get the same result for the limes inferior (as lim inf ≤ lim sup must hold).
We now assume for contradiction that, for some vector Q <c * , there is a non-zero chance of a sequence {Q i } i∈N such that Q − lim inf n→∞ Q i ∞ < ε for all ε > 0.
Thus, the expected limit value of Q i (q, σ * (a)) is at least Q(q, σ * (a)) + 3ε, for every tail of the update sequence. Now, we can use T ( Q − 2ε ·1)(q, σ * (a)) as a bound on the estimation of T (Q)(q, σ * (q)) during the update of the Q-value of the type-action pair (q, σ * (q)). At the same time, the variation of the sum of the updates goes to 0 when ∞ i=0 λ 2 i is bounded. Therefore, it cannot be that lim inf i→∞ Q i (q, σ * (a)) < Q(q, σ * (a)) + ε holds; a contradiction.

Divergence
We now show divergence of Q(q) to ∞ when at least one of the entries ofc * (q) is infinite. First due to Theorem 6 and its proof we have thatc * = ∞ i=0 B ic for some non-negative B and positivec. Thereforec * is monotonic in B for BMCs. Likewise, the value ofc * for a BMDP depends only on the cost function and the expected number of successors of each type spawned: Two BMDPs with same cost functions and the expected numbers of successors have the same fixed point c * . Thus, if a type q with one action spawns either exactly one q or exactly one q with a chance of 50% each, or if it spawns 10 successors of type q and another 10 or type q with a chance of 5%, while dying without offspring with a chance of 95%, both lead to identical matrices B and so the samec * (though this difference may impact the performance of Q-learning).
Naturally, raising the number of expected number of successors of any type for any type-action pair strictly raisesc * , while lowering it reducesc * , and for every set of expected numbers, the value ofc * is either finite or infinite.
Let us consider a set of parameters at the fringe of finite vs. infinitec * , and let us choose them pointwise not larger than the parameters from the BMC or BMDP under consideration. As the fixed point from Section 3 is clearly growing continuously in the parameter values, this set of expected successors leads to ā c * which is not finite.
We now look at the family of parameter values that lead to α ∈ [0, 1[ times the expected successors from our chosen parameter at the fringe between finite and infinite values, and refer to it as the α-BMDP. Let alsoc * α denote the fixed point for the reduced parameters. As the solution to the fixed point grows continuously, so doesc * α . Moreover, ifc * 1 = lim α→1c * α was finite, thenc * would be finite as well, because thenc * 1 =c * . Clearly, for all parameters α ∈ [0, 1[, the Q-values of an α-BMC or α-BMDP converge againstc * α . Thus, the Q-values for the BMC or BMDP we have started with converges against a value, which is at least sup α∈[0,1[c * α . As this is not a finite value, Q-learning diverges to ∞.

Experimental Results
We implemented the algorithm described in the previous section in the formal reinforcement learning tool Mungojerrie [21], a C++-based tool which reads BMDPs described in an extension of the PRISM language [18]. The tool provides an interface for RL algorithms akin to that of [3] and invokes a linear programming tool (GLOP) [22] to compute the optimal expected total cost based on the optimality equations (♠).

Benchmark Suite
The BMDPs on which we tested Q-learning are listed in Table 1. For each model, the numbers of types in the BMDP, are given. Table 1 also shows the total cost (as computed by the LP solver), which has full access to the BMDP. This is followed by the estimate of the total cost computed by Q-learning and the time taken by learning. The learner has several hyperparameters: is the exploration rate, α is the learning rate, and tol is the tolerance for Q-values to be considered different when selecting an optimal strategy. Finally, ep-l is the maximum episode length and ep-n is the number of episodes. The last two columns of Table 1 report the values of ep-l and ep-n when they deviate from the default values. All performance data are the averages of three trials with Qlearning. Since costs are undiscounted, the value of a state-action pair computed by Q-learning is a direct estimate of the optimal total cost from that state when taking that action. Models cloud1 and cloud2 are based on the motivating example given in the introduction. Examples bacteria1 and bacteria2 model the population dynamics of a family of two bacteria [28] subject to two treatments. The objective is to determine which treatment results in the minimum expected cost to extinction of the bacteria population. The protein example models a stochastic Petri net description [19] corresponding to a protein synthesis example with entities corresponding to active and inactive genes and proteins. The example frozenSmall [3] is similar to classical frozen lake example, except that one of the holes result in branching the process in two entities. Entities that fall in the target cell become extinct. The objective is to determine a strategy that results in a minimum number of steps before extinction. Finally, the remaining 5 examples are randomly created BMDP instances.

Conclusion
We study the total reward optimisation problem for branching decision processes with unknown probability distributions, and give the first reinforcement learning algorithm to compute an optimal policy. Extending Q-learning is hard, even for branching processes, because they lack a central property of the standard convergence proof: as the value range of the Q-table is not a priori bounded for a given starting table Q 0 , the variation of the disturbance is not bounded. This looks like a more substantial obstacle than the one Q-learning faces when maximising undiscounted rewards for finite-state MDPs, and it is well known that this defeats Q-learning. So it is quite surprising that we could not only show that Q-learning works for branching processes, but extend these results to branching decision processes, too. Finally, in the previous section, we have demonstrated that our Q-learning algorithm works well on examples of reasonable size even with default hyperparameters, so it is ready to be applied in practice without the need for excessive hyperparameter tuning.