Varianceconstrained actorcritic algorithms for discounted and average reward MDPs
 900 Downloads
Abstract
In many sequential decisionmaking problems we may want to manage risk by minimizing some measure of variability in rewards in addition to maximizing a standard criterion. Variance related risk measures are among the most common risksensitive criteria in finance and operations research. However, optimizing many such criteria is known to be a hard problem. In this paper, we consider both discounted and average reward Markov decision processes. For each formulation, we first define a measure of variability for a policy, which in turn gives us a set of risksensitive criteria to optimize. For each of these criteria, we derive a formula for computing its gradient. We then devise actorcritic algorithms that operate on three timescales—a TD critic on the fastest timescale, a policy gradient (actor) on the intermediate timescale, and a dual ascent for Lagrange multipliers on the slowest timescale. In the discounted setting, we point out the difficulty in estimating the gradient of the variance of the return and incorporate simultaneous perturbation approaches to alleviate this. The average setting, on the other hand, allows for an actor update using compatible features to estimate the gradient of the variance. We establish the convergence of our algorithms to locally risksensitive optimal policies. Finally, we demonstrate the usefulness of our algorithms in a traffic signal control application.
Keywords
Markov decision process (MDP) Reinforcement learning (RL) Risk sensitive RL Actorcritic algorithms Multitimescale stochastic approximation Simultaneous perturbation stochastic approximation (SPSA) Smoothed functional (SF)1 Introduction
The usual optimization criteria for an infinite horizon Markov decision process (MDP) are the expected sum of discounted rewards and the average reward (Puterman 1994; Bertsekas 1995). Many algorithms have been developed to maximize these criteria both when the model of the system is known (planning) and unknown (learning) (Bertsekas and Tsitsiklis 1996; Sutton and Barto 1998). These algorithms can be categorized to value functionbased methods that are mainly based on the two celebrated dynamic programming algorithms value iteration and policy iteration; and policy gradient methods that are based on updating the policy parameters in the direction of the gradient of a performance measure, i.e., the value function of the initial state or the average reward. Policy gradient methods estimate the gradient of the performance measure either without using an explicit representation of the value function (e.g., Williams 1992; Marbach 1998; Baxter and Bartlett 2001) or using such a representation in which case they are referred to as actorcritic algorithms (e.g., Sutton et al. 2000; Konda and Tsitsiklis 2000; Peters et al. 2005; Bhatnagar et al. 2007, 2009a). Using an explicit representation for value function (e.g., linear function approximation) by actorcritic algorithms reduces the variance of the gradient estimate with the cost of adding it a bias.
Actorcritic methods were among the earliest to be investigated in RL (Barto et al. 1983; Sutton 1984). They comprise a family of reinforcement learning (RL) methods that maintain two distinct algorithmic components: An Actor, whose role is to maintain and update an actionselection policy; and a Critic, whose role is to estimate the value function associated with the actor’s policy. Thus, the critic addresses a problem of prediction, whereas the actor is concerned with control. A common practice is to update the policy parameters using stochastic gradient ascent, and to estimate the valuefunction using some form of temporal difference (TD) learning (Sutton 1988).
However in many applications, we may prefer to minimize some measure of risk as well as maximizing a usual optimization criterion. In such cases, we would like to use a criterion that incorporates a penalty for the variability induced by a given policy. This variability can be due to two types of uncertainties: (i) uncertainties in the model parameters, which is the topic of robust MDPs (e.g., Nilim and Ghaoui 2005; Delage and Mannor 2010; Xu and Mannor 2012), and (ii) the inherent uncertainty related to the stochastic nature of the system, which is the topic of risksensitive MDPs (e.g., Howard and Matheson 1972; Sobel 1982; Filar et al. 1989).
In risksensitive sequential decisionmaking, the objective is to maximize a risksensitive criterion such as the expected exponential utility (Howard and Matheson 1972), a variance related measure (Sobel 1982; Filar et al. 1989), the percentile performance (Filar et al. 1995), or conditional valueatrisk (CVaR) (Ruszczyński 2010; Shen et al. 2013). Unfortunately, when we include a measure of risk in our optimality criteria, the corresponding optimal policy is usually no longer Markovian stationary (e.g., Filar et al. 1989) and/or computing it is not tractable (e.g., Filar et al. 1989; Mannor and Tsitsiklis 2011). In particular, (i) In Sobel (1982), the author analyzed variance constraints in the context of a discounted reward MDP and showed the existence of a Bellman equation for the variance of the return. However, it was established there that the operator underlying the aforementioned Bellman equation is not necessarily monotone. The latter is a crucial requirement for employing popular dynamic programming procedures for solving MDPs. (ii) In Mannor and Tsitsiklis (2013), the authors provide hardness results for variance constrained MDPs and in particular show that finding a globally mean–variance optimal policy in a discounted MDP is NPhard, even when the underlying transition dynamics are known. (iii) In Filar et al. (1989), the authors established hardness results for average reward MDP, with a variance constraint that differs significantly from its counterpart in the discounted setting. Nevertheless, the variance constraint is well motivated considering the objective is to optimize a longrun average reward. However, the mathematical difficulties in finding a globally mean variance optimal policy remains, even with this altered variance constraint.
Although risksensitive sequential decisionmaking has a long history in operations research and finance, it has only recently grabbed attention in the machine learning community. Most of the work on this topic (including those mentioned above) has been in the context of MDPs (when the model of the system is known) and much less work has been done within the reinforcement learning (RL) framework (when the model is unknown and all the information about the system is obtained from the samples resulted from the agent’s interaction with the environment). In risksensitive RL, we can mention the work by Borkar (2001, (2002, (2010) and Basu et al. (2008) who considered the expected exponential utility, the one by Mihatsch and Neuneier (2002) that formulated a new risksensitive control framework based on transforming the temporal difference errors that occur during learning, and the one by Tamar et al. (2012) on several variance related measures. Tamar et al. (2012) study stochastic shortest path problems, and in this context, propose a policy gradient algorithm [and in a more recent work (Tamar and Mannor 2013) an actorcritic algorithm] for maximizing several risksensitive criteria that involve both the expectation and variance of the return random variable (defined as the sum of the rewards that the agent obtains in an episode).
In this paper,^{1} we develop actorcritic algorithms for optimizing variancerelated risk measures in both discounted and average reward MDPs. In the following, we first summarize our contributions in the discounted reward setting and follow it with those in average reward setting.
Note that we operate in a simulation optimization setting, i.e., we have access to reward samples from the underlying MDP. Thus, it is required to estimate the mean and variance of the return (we use a TDcritic for this purpose) and then use these estimates to compute gradient of the Lagrangian. The latter is used then used to descend in the policy parameter. We estimate the gradient of the Lagrangian using two simultaneous perturbation methods: simultaneous perturbation stochastic approximation (SPSA) (Spall 1992) and smoothed functional (SF) (Katkovnik and Kulchitsky 1972), resulting in two separate discounted reward actorcritic algorithms. In addition, we also propose secondorder algorithms with a Newton step, using both SPSA and SF.
Simultaneous perturbation methods have been popular in the field of stochastic optimization and the reader is referred to Bhatnagar et al. (2013) for a textbook introduction. First introduced in Spall (1992), the idea of SPSA is to perturb each coordinate of a parameter vector uniformly using a Rademacher random variable, in the quest for finding the minimum of a function that is only observable via simulation. Traditional gradient schemes require \(2\kappa _1\) evaluations of the function, where \(\kappa _1\) is the parameter dimension. On the other hand, SPSA requires only two evaluations irrespective of the parameter dimension and hence is an efficient scheme, especially useful in highdimensional settings. While a onesimulation variant of SPSA was proposed in Spall (1997), the original twosimulation SPSA algorithm is preferred as it is more efficient and also seen to work better than its onesimulation variant. Later enhancements to the original SPSA scheme include using deterministic perturbation using certain Hadamard matrices (Bhatnagar et al. 2003) and secondorder methods that estimate Hessian using SPSA (Spall 2000; Bhatnagar 2005). The SF schemes are another class of simultaneous perturbation methods, which again perturb each coordinate of the parameter vector uniformly. However, unlike SPSA, Gaussian random variables are used here for the perturbation. Originally proposed in Katkovnik and Kulchitsky (1972), the SF schemes have been studied and enhanced in later works such as Styblinski and Opalski (1986) and Bhatnagar (2007). Further, Bhatnagar et al. (2011) proposes both SPSA and SF like schemes for constrained optimization.
Proof of convergence Using the ordinary differential equations (ODE) approach, we establish the asymptotic convergence of our algorithms to locally risksensitive optimal policies and in the light of hardness results from Mannor and Tsitsiklis (2013), this is the best one can hope to achieve. Our algorithms employ multitimescale stochastic approximation, in both settings. The convergence proof proceeds by analysing each timescale separately. In essence, the iterates on a faster timescale view those on a slower timescale as quasistatic, while the slower timescale iterate views that on a faster timescale as equilibrated. Using this principle, we show that TD critic (on the fastest timescale in all the algorithms) converge to fixed points of the Bellman operator, for any fixed policy \(\theta \) and Lagrange multiplier \(\lambda \). Next, for any given \(\lambda \), the policy update tracks in the asymptotic limit and converges to the equilibria of the corresponding ODE. Finally, \(\lambda \) updates on slowest timescale converge and the overall convergence is to a local saddle point of the Lagrangian. Moreover, the limiting point is feasible for the constrained optimization problem mentioned above, i.e., the policy obtained upon convergence satisfies the constraint that the variance is upperbounded by \(\alpha \).
Simulation experiments We demonstrate the usefulness of our discounted and average reward risksensitive actorcritic algorithms in a traffic signal control application. On this highdimensional system with state space \(\approx \)10\(^{32}\), the objective in our formulation is to minimize the total number of vehicles in the system, which indirectly minimizes the delay experienced by the system. The motivation behind using a risksensitive control strategy is to reduce the variations in the delay experienced by road users. From the results, we observe that the risksensitive algorithms proposed in this paper result in a longterm (discounted or average) cost that is higher than their riskneutral variants. However, from the empirical variance of the cost (both discounted as well as average) perspective, the risksensitive algorithms outperform their riskneutral variants. Moreover, the experiments in the discounted setting also show that our SPSA based actorcritic scheme outperforms the policy gradient algorithm proposed in Tamar et al. (2012), both from a mean–variance as well as gradient estimation standpoints. This observation justifies using the actorcritic approach for solving risksensitive MDPs, as it reduces the variance of the gradient estimated by the policy gradient approach with the cost of introducing a bias induced by the value function representation.
Remark 1
It is important to note that both our discounted and average reward algorithms can be easily extended to other variance related risk criteria such as the Sharpe ratio, which is popular in financial decisionmaking (Sharpe 1966) (see Remarks 3, 7 for more details).
Remark 2
 (i)
we optimize for the Lagrange multiplier \(\lambda \), which plays a similar role to \(\beta \), as a tradeoff between the mean and variance, and
 (ii)
it is usually more natural to know an upperbound on the variance (as in the mean–variance formulations considered in this paper) than knowing the ideal tradeoff between the mean and variance (as considered in the expected exponential utility formulation).
 (i)
The authors develop policy gradient and actorcritic methods for stochastic shortest path problems in Tamar et al. (2012) and Tamar and Mannor (2013), respectively. On the other hand, we devise actorcritic algorithms for both discounted and average reward MDP settings; and
 (ii)
More importantly, we note the difficulty in the discounted formulation that requires to estimate the gradient of the value function at every state of the MDP and also sample from two different distributions. This precludes us from using compatible features—a method that has been employed successfully in actorcritic algorithms in a riskneutral setting (cf. Bhatnagar et al. 2009a) as well as more recently in Tamar and Mannor (2013) for a risksensitive stochastic shortest path setting. We alleviate the above mentioned problems for the discounted setting by employing simultaneous perturbation based schemes for estimating the gradient in the first order methods and Hessian in the second order methods, that we propose.
 (iii)
Unlike (Tamar et al. 2012; Tamar and Mannor 2013) who consider a fixed \(\lambda \) in their constrained formulations, we perform dual ascent using sample variance constraints and optimize the Lagrange multiplier \(\lambda \). In rigorous terms, \(\lambda _n\) in our algorithms is shown to converge to a local maxima of \(\nabla _\lambda L(\theta ^{\lambda },\lambda )\) (here \(\theta ^\lambda \) is the limit of the \(\theta \) recursion for a given value of \(\lambda \)) and the limit \(\lambda ^*\) is such that the variance constraint is satisfied for the corresponding policy \(\theta ^{\lambda ^*}\).
2 Preliminaries
We consider sequential decisionmaking tasks that can be formulated as a reinforcement learning (RL) problem. In RL, an agent interacts with a dynamic, stochastic, and incompletely known environment, with the goal of optimizing some measure of its longterm performance. This interaction is often modeled as a Markov decision process (MDP). A MDP is a tuple \(({\mathcal {X}},{\mathcal {A}},R,P,x^0)\) where \({\mathcal {X}}\) and \({\mathcal {A}}\) are the state and action spaces; \(R(x,a), x\in {\mathcal {X}}, a\in {\mathcal {A}}\) is the reward random variable whose expectation is denoted by \(r(x,a)={\mathbb {E}}\big [R(x,a)\big ]\); \(P(\cdot x,a)\) is the transition probability distribution; and \(x^0 \in {\mathcal {X}}\) is the initial state.^{2} We assume that both state and action spaces are finite.
The rule according to which the agent acts in its environment (selects action at each state) is called a policy. A Markovian stationary policy \(\mu (\cdot x)\) is a probability distribution over actions, conditioned on the current state x. The goal in a RL problem is to find a policy that optimizes the longterm performance measure of interest, e.g., maximizes the expected discounted sum of rewards or the average reward.
In policy gradient and actorcritic methods, we define a class of parameterized stochastic policies \(\big \{\mu (\cdot x;\theta ),x\in {\mathcal {X}},\theta \in \varTheta \subseteq {\mathbb {R}}^{\kappa _1}\big \}\), estimate the gradient of the performance measure w.r.t. the policy parameters \(\theta \) from the observed system trajectories, and then improve the policy by adjusting its parameters in the direction of the gradient. Here \(\varTheta \) denotes a compact and convex subset of \({\mathbb {R}}^{\kappa _1}\). Our algorithms projects the iterates onto \(\varTheta \), which ensures stability—a crucial requirement necessary for establishing convergence. Since in this setting a policy \(\mu \) is represented by its \(\kappa _1\)dimensional parameter vector \(\theta \), policy dependent functions can be written as a function of \(\theta \) in place of \(\mu \). So, we use \(\mu \) and \(\theta \) interchangeably in the paper.

(A1) For any stateaction pair \((x,a)\in {\mathcal {X}}\times {\mathcal {A}}\), the policy \(\mu (ax;\theta )\) is continuously differentiable in the parameter \(\theta \).

(A2) The Markov chain induced by any policy \(\theta \) is irreducible.
Finally, we denote by \(d^\mu (x)\) and \(\pi ^\mu (x,a)=d^\mu (x)\mu (ax)\), the stationary distribution of state x and stateaction pair (x, a) under policy \(\mu \), respectively. The stationary distributions can be seen to exist because we consider a finite stateaction space setting and irreducibility here implies positive recurrence. Similarly in the discounted formulation, we define the \(\gamma \)discounted visiting distribution of state x and stateaction pair (x, a) under policy \(\mu \) as \(d^\mu _\gamma (xx^0)=(1\gamma )\sum _{n=0}^\infty \gamma ^n\Pr (x_n=xx_0=x^0;\mu )\) and \(\pi ^\mu _\gamma (x,ax^0)=d^\mu _\gamma (xx^0)\mu (ax)\).
3 Discounted reward setting
 1.
\(\min _\theta \varLambda ^\theta (x^0) \quad \) subject to \(\quad V^\theta (x^0)\ge \alpha \),
 2.
\(\max _\theta V^\theta (x^0)\alpha \sqrt{\varLambda ^\theta (x^0)}\),
 3.
Maximizing the Sharpe ratio, i.e., \(\;\max _\theta V^\theta (x^0)/\sqrt{\varLambda ^\theta (x^0)}\). Sharpe ratio (SR) is a popular risk measure in financial decisionmaking Sharpe (1966). Sect. 3 presents extensions of our proposed discounted reward algorithms to optimize the Sharpe ration.
However, in our setting, the Lagrangian \(L(\theta ,\lambda )\) is not necessarily convex in \(\theta \), which implies there may not be an unique saddle point. The problem is further complicated by the fact that we operate in a simulation optimization setting, i.e., only sample estimates of the Lagrangian are obtained. Hence, performing primal descent and dual ascent, one can only get to a local saddle point, i.e., a tuple \((\theta ^*, \lambda ^*)\) which is a local minima w.r.t. \(\theta \) and local maxima w.r.t \(\lambda \) of the Lagrangian. As an aside, global mean–variance optimization of MDPs have been shown to be NPhard in Mannor and Tsitsiklis (2013) and the best one can hope is to find a approximately optimal policy.
Lemma 1
Proof
 1.
two different sampling distributions, \(\pi ^\theta _\gamma \) and \(\widetilde{\pi }^\theta _\gamma \), are used for \(\nabla V^\theta (x^0)\) and \(\nabla U^\theta (x^0)\), and
 2.
\(\nabla V^\theta (x^{\prime })\) appears in the second sum of \(\nabla U^\theta (x^0)\) equation, which implies that we need to estimate the gradient of the value function \(V^\theta \) at every state of the MDP, and not just at the initial state \(x^0\).
4 Discounted reward risksensitive actorcritic algorithms
In this section, we present actorcritic algorithms for optimizing the risksensitive measure (3). These algorithms are based on two simultaneous perturbation methods: simultaneous perturbation stochastic approximation (SPSA) and smoothed functional (SF).
4.1 Algorithm structure

An inner loop that descends in \(\theta \) using the gradient of the Lagrangian \(L(\theta ,\lambda )\) w.r.t. \(\theta \), and

An outer loop that ascends in \(\lambda \) using the gradient of the Lagrangian \(L(\theta ,\lambda )\) w.r.t. \(\lambda \).

\(A_n\) is a positive definite matrix that fixes the order of the algorithm. For the first order methods, \(A_n=I\) (I is the identity matrix), while for the second order methods \(A_n \rightarrow \nabla ^2_\theta L(\theta _n,\lambda _n)\) as \(n \rightarrow \infty \).

\(\varGamma \) is a projection operator that keeps the iterate \(\theta _n\) stable by projecting onto a compact and convex set \(\varTheta := \prod _{i=1}^{\kappa _1} [\theta ^{(i)}_{\min },\theta ^{(i)}_{\max }]\). In particular, for any \(\theta \in {\mathbb {R}}^\kappa _1\), \(\varGamma (\theta ) = (\varGamma ^{(1)}(\theta ^{(1)}),\ldots , \varGamma ^{(\kappa _1)}(\theta ^{(\kappa _1)}))^T\), with \(\varGamma ^{(i)}(\theta ^{(i)}):= \min (\max (\theta ^{(i)}_{\min },\theta ^{(i)}),\theta ^{(i)}_{\max })\).

\(\varGamma _\lambda \) is a projection operator that keeps the Lagrange multiplier \(\lambda _n\) within the interval \([0,\lambda _{\max }]\), for some large positive constant \(\lambda _{\max } < \infty \) and can be defined in an analogous fashion as \(\varGamma \).

\(\zeta _1(n), \zeta _2(n)\) are stepsizes selected such that \(\theta \) update is on the faster and \(\lambda \) update is on the slower timescale. Note that another timescale \(\zeta _3(n)\) that is the fastest is used for the TDcritic, which provides the estimate of the Lagrangian for a given \((\theta ,\lambda )\).
 1.
First order This corresponds to \(A_n = I\) in (6). The proposed algorithms here include RSSPSAG and RSSFG, where the former estimates the gradient using SPSA, while the latter uses SF. These algorithms use the following choice for the perturbation vector: \(p_n=\beta _n\varDelta _n\). Here \(\beta _n>0\) is a positive constant and \(\varDelta _n\) is a perturbation random variable, i.e., a \(\kappa _1\)vector of independent Rademacher (for SPSA) and Gaussian \({\mathcal {N}}(0,1)\) (for SF) random variables.
 2.
Second order This corresponds to \(A_n\) which converges to \(\nabla ^2 L(\theta _n,\lambda _n)\) as \(n\rightarrow \infty \). The proposed algorithms here include RSSPSAN and RSSFN, where the former uses SPSA for gradient/Hessian estimates and the latter employs SF for the same. These algorithms use the following choice for perturbation vector: For RSSPSAN, \(p_n=\beta _n\varDelta _n + \beta _n\widehat{\varDelta }_n\), \(\beta _n>0\) is a positive constant and \(\varDelta _n\) and \(\widehat{\varDelta }_n\) are perturbation parameters that are \(\kappa _1\)vectors of independent Rademacher random variables, respectively. For RSSFN, \(p_n=\beta _n\varDelta _n\), where \(\varDelta _n\) is a \(\kappa _1\) vector of Gaussian \({\mathcal {N}}(0,1)\) random variables.
The overall flow of our proposed actorcritic algorithms is illustrated in Fig. 1 and Algorithm 1. The overall operation involves the following two loops: At each time instant n,
 (1)
Unperturbed simulation For \(m=0,1,\ldots ,m_n\), take action \(a_m\sim \mu (\cdot x_m;\theta _n)\), observe the reward \(R(x_m,a_m)\), and the next state \(x_{m+1}\) in the first trajectory.
 (2)
Perturbed simulation For \(m=0,1,\ldots ,m_n\), take action \(a^+_m\sim \mu (\cdot x^+_m;\theta _n^+)\), observe the reward \(R(x^+_m,a^+_m)\), and the next state \(x^+_{m+1}\) in the second trajectory.
Outer loop (actor update) Estimate the gradient/Hessian of \(\widehat{V}^{\theta }(x^0)\) and \(\widehat{U}^{\theta }(x^0)\), and hence the gradient/Hessian of Lagrangian \(L(\theta ,\lambda )\), using either SPSA (17) or SF (18) methods. Using these estimates, update the policy parameter \(\theta \) in the descent direction using either a gradient or a Newton decrement, and the Lagrange multiplier \(\lambda \) in the ascent direction.
In the next section, we describe the TDcritic and subsequently, in Sects. 4.3 and 4.4, present the first and second order actor critic algorithms, respectively.
4.2 TDcritic

(A3) The basis functions \(\{\phi _v^{(i)}\}_{i=1}^{\kappa _2}\) and \(\{\phi _u^{(i)}\}_{i=1}^{\kappa _3}\) are linearly independent. In particular, \(\kappa _2,\kappa _3\ll n\) and \(\varPhi _v\) and \(\varPhi _u\) are full rank. Moreover, for every \(v\in {\mathbb {R}}^{\kappa _2}\) and \(u\in {\mathbb {R}}^{\kappa _3}\), \(\varPhi _vv\ne e\) and \(\varPhi _uu\ne e\), where e is the ndimensional vector with all entries equal to one.
We now claim that the projected Bellman operator \(\varPi T\) is a contraction mapping w.r.t \(\nu \)weighted norm, for any policy \(\theta \).
Lemma 2
Proof
See Sect. 7.1. \(\square \)
We now describe the TD algorithm that updates the critic parameters corresponding to the value and square value functions (Note that we require critic estimates for both the unperturbed as well as the perturbed policy parameters). This algorithm is an extension of the algorithm proposed by Tamar et al. (2013b) to the discounted setting. Recall from Algorithm 1 that, at any instant n, the TDcritic runs two \(m_n\) length trajectories corresponding to policy parameters \(\theta _n\) and \(\theta _n + \delta \varDelta _n\).
4.2.1 Convergence rate
Let \(\nu _{\min } = \min (\nu _v,\nu _u)\), where \(\nu _v\) and \(\nu _u\) are minimum eigenvalues of \(\varPhi _v^\mathsf {\scriptscriptstyle T}{\varvec{D}}^\theta \varPhi _v\) and \(\varPhi _u^\mathsf {\scriptscriptstyle T}{\varvec{D}}^\theta \varPhi _u\), respectively. Recall that \({\varvec{D}}^\theta \) denotes the stationary distribution of the underlying policy \(\theta \). From (A2), (A3) and the fact that we consider finite statespaces, we have that \(\nu _{\min } > 0\).
From recent results in Korda and Prashanth (2015) that provide nonasymptotic bounds for TD(0) with function approximation, we know that the canonical \(O(m^{1/2})\) rate can be achieved under the appropriate choice of the stepsize \(\zeta _3(m)\). The following rate result is crucial in setting the trajectory lengths \(m_n\) and relating them to perturbation constants \(\beta _n\) [see (A4) in the next section]:
Theorem 1
Proof
The first claim follows directly from Theorem 1 in Korda and Prashanth (2015), while the second claim can be proven in an analogous manner as the first. \(\square \)
The above rate result holds only if the stepsize is set using \(\nu _{\min }\) and the latter quantity is unknown in a typical RL setting. However, a standard trick to overcome this dependence while obtaining the same convergence rate is to employ iterate averaging, proposed independently by Polyak and Juditsky (1992) and Ruppert (1991). The latter approach involves using a larger stepsize \( \varTheta (1/n^{\varsigma _1})\) with \(\varsigma _1 \in (1/2,1)\) and couple this with averaging of iterates. An iterate averaged variant of Theorem 1 can be claimed and we refer the reader to Theorem 2 of Korda and Prashanth (2015) for further details.
4.3 Firstorder algorithms: RSSPSAG and RSSFG
 (i)
\(\beta _n \ge 0\) and vanish asymptotically [see (A4) below for the precise condition];
 (ii)
\(\varDelta _n^{(i)}\)’s are independent Rademacher and Gaussian \({\mathcal {N}}(0,1)\) random variables in SPSA and SF updates, respectively;
 (iii)
\(\varGamma \) and \(\varGamma _\lambda \) are projection operators that keep the iterates \((\theta _n,\lambda _n)\) stable and were defined in Sect. 4.1. These projection operators are necessary to keep the iterates stable and hence, ensure convergence of the algorithms.
4.3.1 Choosing trajectory length \(m_n\), perturbation constants \(\beta _n\) and stepsizes \(\zeta _3(n),\zeta _2(n), \zeta _1(n)\)
 (A4) The step size schedules \(\{\zeta _2(n)\}\), and \(\{\zeta _1(n)\}\) satisfy$$\begin{aligned}&\zeta _2(n), \beta _n \rightarrow 0, \frac{1}{\sqrt{m_n}\beta _n}\rightarrow 0, \end{aligned}$$(22)$$\begin{aligned}&\sum _n \zeta _1(n) = \sum _n \zeta _2(n) = \infty , \end{aligned}$$(23)$$\begin{aligned}&\sum _n \zeta _1(n)^2,\;\;\;\sum _n \frac{\zeta _2(n)^2}{\beta _n^2},\;\;\;<\infty , \end{aligned}$$(24)$$\begin{aligned}&\zeta _1(n) = o\big (\zeta _2(n)\big ). \end{aligned}$$(25)
Equation (22) is motivated by a similar condition in Prashanth et al. (2016) and ensures that the bias from a finite length (\(m_n\)) trajectory run of TDcritic can be ignored. A simple setting that ensures (22) is to have \(m_n = C_1 n^{\varsigma _2}\) and \(\beta _n = C_2 n^{\varsigma _3}\), where \(C_1, C_2\) are constants and \(\varsigma _2, \varsigma _3 >0\) with \(\varsigma _3 > \varsigma _2/2\). This ensures that the trajectories increase in length as a function of outer loop index n, at a rate that is sufficient to cancel the bias induced by the TDcritic. See Lemma 6 in Sect. 7 makes this claim precise, in particular justifying the need for (22) in (A4).
We provide a proof of convergence of the firstorder SPSA and SF algorithms to a tuple \((\theta ^{\lambda ^*},\lambda ^*)\), which is a (local) saddle point of the risksensitive objective function \(\widehat{L}(\theta ,\lambda ) \mathop {=}\limits ^{\triangle } \widehat{V}^\theta (x^0) + \lambda (\widehat{\varLambda }^\theta (x^0)  \alpha )\), where \(\widehat{V}^\theta (x^0) = {\bar{v}}^\mathsf {\scriptscriptstyle T}\phi _v(x^0)\) and \(\widehat{\varLambda }^\theta (x^0) = {\bar{u}}^\mathsf {\scriptscriptstyle T}\phi _u(x^0)  ({\bar{v}}^\mathsf {\scriptscriptstyle T}\phi _v(x^0))^2\) with \(\bar{v}\) and \(\bar{u}\) defined by (12). Further, the limit \(\theta ^{\lambda ^*}\) satisfies the variance constraint, i.e., \(\widehat{\varLambda }^{\theta ^{\lambda ^*}}(x^0) \le \alpha \). See Theorems 3, 4, 5 and Proposition 1 in Sect. 7 for details.
Remark 3
Remark 4
(Onesimulation SR variant) For the SR objective, the proposed algorithms can be modified to work with only one simulated trajectory of the system. This is because in the SR case, we do not require the Lagrange multiplier \(\lambda \), and thus, the simulated trajectory corresponding to the nominal policy parameter \(\theta \) is not necessary. In this implementation, the gradient is estimated as \(\nabla _iS(\theta ) \approx S(\theta +\beta \varDelta )/\beta \varDelta ^{(i)}\) for SPSA and as \(\nabla _iS(\theta ) \approx (\varDelta ^{(i)}/\beta )S(\theta +\beta \varDelta )\) for SF.
Remark 5
(MonteCarlo critic) In the above algorithms, the critic uses a TD method to evaluate the policies. These algorithms can be implemented with a MonteCarlo critic that at each time instant n computes a sample average of the total discounted rewards corresponding to the nominal \(\theta _n\) and perturbed \(\theta _n+\beta \varDelta _n\) policy parameter. This implementation would be similar to that in Tamar et al. (2012), except here we use simultaneous perturbation methods to estimate the gradient.
4.4 Secondorder algorithms: RSSPSAN and RSSFN
4.4.1 RSSPSAN algorithm
 (A5) For any sequence of matrices \(\{A_n\}\) and \(\{B_n\}\) in \({\mathcal {R}}^{\kappa _1\times \kappa _1}\) such that \({\displaystyle \lim _{n\rightarrow \infty } \parallel A_nB_n \parallel }\) \(= 0\), the \(\varUpsilon \) operator satisfies \({\displaystyle \lim _{n\rightarrow \infty } \parallel \varUpsilon (A_n) \varUpsilon (B_n) \parallel }\) \(= 0\). Further, for any sequence of matrices \(\{C_n\}\) in \(\mathcal{R}^{\kappa _1\times \kappa _1}\), we have$$\begin{aligned} {\displaystyle \sup _n \parallel C_n\parallel }<\infty \quad \Rightarrow \quad \sup _n \parallel \varUpsilon (C_n)\parallel< \infty \text { and }\sup _n \parallel \{\varUpsilon (C_n)\}^{1} \parallel <\infty . \end{aligned}$$
4.4.2 RSSFN algorithm
Critic update As in the case of the RSSFG algorithm, we run two simulations with unperturbed and perturbed policy parameters, respectively. Recall that the perturbed simulation corresponds to the policy parameter \(\theta _n+\beta _n\varDelta _n\), where \(\varDelta _n\) represent a vector of independent \(\kappa _1\)dimensional Gaussian \({\mathcal {N}}(0,1)\) random variables. The critic parameters for both these simulations are updated as described earlier in Sect. 4.2.
Remark 6
The secondorder variants of the algorithms for SR optimization can be worked out along similar lines as outlined in Sect. 4.4 and the details are omitted here.
5 Average reward setting
Lemma 3
Proof
Lemma 4
Proof
From Lemma 4, we notice that \(\delta _n\psi _n\) and \(\epsilon _n\psi _n\) are unbiased estimates of \(\nabla \rho (\mu )\) and \(\nabla \eta (\mu )\), respectively, where \(\psi _n=\psi (x_n,a_n)=\nabla \log \mu (a_nx_n)\) is the compatible feature (see e.g., Sutton et al. 2000; Peters et al. 2005).
6 Average reward risksensitive actorcritic algorithm
Although our estimates of \(\rho (\theta )\) and \(\eta (\theta )\) are unbiased, since we use biased estimates for \(V^\theta \) and \(U^\theta \) (linear approximations in the critic), our gradient estimates \(\nabla \rho (\theta )\) and \(\nabla \eta (\theta )\), and as a result \(\nabla L(\theta ,\lambda )\), are biased. The following lemma shows the bias in our estimate of \(\nabla L(\theta ,\lambda )\).
Lemma 5
Proof
Remark 7
Remark 8
Remark 9
(Simultaneous perturbation analogues) In the average reward setting, a simultaneous perturbation algorithm would estimate the average reward \(\rho \) and the square reward \(\eta \) on the faster timescale and use these to estimate the gradient of the performance objective. However, a drawback with this approach, compared to the algorithm proposed above is the necessity for having two simulated trajectories (instead of one) for each policy update.
In the following section, we establish the convergence of our average reward actorcritic algorithm to a (local) saddle point of the risksensitive objective function \(L(\theta ,\lambda )\).
7 Convergence analysis of the discounted reward risksensitive actorcritic algorithms
Our proposed actorcritic algorithms use multitimescale stochastic approximation and we use the ordinary differential equation (ODE) approach (see Chapter 6 of Borkar (2008)) to analyze their convergence. We first provide the analysis for the SPSA based firstorder algorithm RSSPSAG in Sect. 7.1 and later provide the necessary modifications to the proof of SF based firstorder algorithm and SPSA/SF based secondorder algorithms.
7.1 Convergence of the firstorder algorithm: RSSPSAG
Recall that RSSPSAG is a twoloop scheme where the inner loop is a TD critic that evaluates the value/square value functions for both unperturbed as well as perturbed policy parameter. On the other hand, the outer loop is a twotimescale stochastic approximation algorithm, where the faster timescale updates policy parameter \(\theta \) in the descent direction using SPSA estimates of the gradient of the Lagrangian and the slower timescale performs dual ascent for the Lagrange multiplier \(\lambda \) using sample constraint values. The faster timescale \(\theta \)recursion sees the \(\lambda \)updates on the slower timescales as quasistatic, while the slower timescale \(\lambda \)recursion sees the \(\theta \)updates as equilibrated.

Step 1: Critic’s convergence We establish that, for any given values of \(\theta \) and \(\lambda \) that are updated on slower timescales, the TD critic converges to a fixed point of the projected Bellman operator for value and square value functions.

Step 2: Convergence of \(\theta \)recursion We utilize the fact that owing to projection, the \(\theta \) parameter is stable. Using a Lyapunov argument, we show that the \(\theta \)recursion tracks the ODE (55) in the asymptotic limit, for any given value of \(\lambda \) on the slowest timescale.

Step 3: Convergence of \(\lambda \)recursion This step is similar to earlier analysis for constrained MDPs. In particular, we show that \(\lambda \)recursion in (19) converges and the overall convergence of \((\theta _n,\lambda _n)\) is to a local saddle point \((\theta ^{\lambda ^*},\lambda ^*)\) of \(\widehat{L}(\theta ,\lambda )\), with \(\theta ^{\lambda ^*}\) satisfying the variance constraint in (3).
Theorem 2
Remark 10
It is easy to conclude from the above theorem that the TD critic parameters for the perturbed policy parameter also converge almost surely, i.e., \(v^+_m \rightarrow \bar{v}^+\) and \(u^+_m \rightarrow \bar{u}^+\) a.s., where \(\bar{v}^+\) and \(\bar{u}^+\) are the unique solutions to TD fixed point relations for perturbed policy \(\theta _n + \beta _n \varDelta _n\), where \(\theta _n, \beta _n\) and \(\varDelta _n\) correspond to the policy parameter, perturbation constant and perturbation random variable. The latter quantities are updated in the outer loop—see Algorithm 1.
We first provide a proof of Lemma 2 (see Sect. 4.2), which claimed that the operator \(\varPi T\) for the value/square value functions is a contraction mapping. The result in Lemma 2 is essential in establishing the convergence result in Theorem 2.
Proof
Proof
(Theorem 2 ) The vrecursion in (13) is performing TD) with function approximation for the value function, while the urecursion is doing the same for the square value function. The convergence of vrecursion to the fixed point in (12) can be inferred from Tsitsiklis and Roy (1997).
 (A1) The function h is Lipschitz. For any c, define \(h_c(w) = h(cw)/c\). Then, there exists a continuous function \(h_\infty \) such that \(h_c \rightarrow h_\infty \) as \(c \rightarrow \infty \) uniformly on compacts. Furthermore, origin is an asymptotically stable equilibrium for the ODE$$\begin{aligned} \dot{w}_t= h_\infty (w_t). \end{aligned}$$(54)
 (A2) The martingale difference \(\{\varDelta M_{m}, m\ge 1\}\) is squareintegrable with$$\begin{aligned} {\mathbb {E}}[\left\ \varDelta M_{m+1} \right\ ^2 \mid {\mathcal {F}}_m] \le C_0 (1 + \left\ w_m \right\ ^2), m\ge 0, \end{aligned}$$
It is straightforward to verify (A1), as \(h_c(w) = Mw + \xi /c\) converges to \(h_\infty (w) = Mw\) as \(c\rightarrow \infty \). Given that M is negative definite, it is easy to see that origin is a asymptotically stable equilibrium for the ODE (54). (A2) can also be verified by using the same arguments that were used to show that the martingale difference associated with the regular TD algorithm with function approximation satisfies a bound on the second moment (cf. Tsitsiklis and Roy 1997). \(\square \)
In the following, we show that the update of \(\theta \) is equivalent to gradient descent for the function \(\widehat{L}(\theta ,\lambda )\) and converges to a limiting set that depends on \(\lambda \).
The main result regarding the convergence of the policy parameter \(\theta \) for both the RSSPSAG and RSSFG algorithms is as follows:
Theorem 3
Under (A1)–(A4), for any given Lagrange multiplier \(\lambda \), \(\theta _n\) updated according to (19) converges almost surely to \(\theta ^* \in {\mathcal {Z}}_{\lambda }\).
The proof of the above theorem requires the following lemma which shows that the conditions \(m_n, \beta _n\) in (A4) ensure that the TDcritic does not introduce any bias from a finite sample run length of \(m_n\).
Lemma 6
Proof
In order to the prove Theorem 3, we require the wellknown Kushner–Clark lemma (see Kushner and Clark 1978, pp. 191–196). For the sake of completeness, we recall this result below.

(B1) h is a continuous \({\mathbb {R}}^{\kappa _1}\)valued function.

(B2) The sequence \(\xi _{1,n},n\ge 0\) is a bounded random sequence with \(\xi _{1,n} \rightarrow 0\) almost surely as \(n\rightarrow \infty \).

(B3) The stepsizes \(a(n),n\ge 0\) satisfy \( a(n)\rightarrow 0 \text{ as } n\rightarrow \infty \text { and } \sum _n a(n)=\infty \).
 (B4) \(\{\xi _{2,n}, n\ge 0\}\) is a sequence such that for any \(\epsilon >0\),$$\begin{aligned} \lim _{n\rightarrow \infty } P\left( \sup _{m\ge n} \left\ \sum _{i=n}^{m} a_i \xi _{1,i}\right\ \ge \epsilon \right) = 0. \end{aligned}$$

(B5) The ODE (61) has a compact subset K of \({\mathcal {R}}^{\kappa _1}\) as its set of asymptotically stable equilibrium points.
Theorem 4
Assume (B1)–(B5). Then, \(x_n\) converges almost surely to the set K.
Proof

From (A1) together with the facts that the state space is finite and the projection \(\varGamma \) is onto a compact set, we have from Theorem 2 of Schweitzer (1968) that the stationary distributions \({\varvec{D}}^\theta _\gamma (xx^0)\) and \(\widetilde{d}^\theta _\gamma (xx^0)\) are continuously differentiable. This in turn implies continuity of \(\nabla \widehat{V}(\theta _n)\) and \(\nabla \widehat{U}(\theta _n)\). Thus, (B1) follows for \(\nabla \widehat{L}(\theta _n, \lambda )\).

In light of Lemma 6 and (A4), we have that \(\xi _{1,n} \rightarrow 0\) as \(n\rightarrow \infty \).

(A4) implies (B3).
 A simple calculation shows that \( {\mathbb {E}}(\xi _{2,n})^2 \le {\mathbb {E}}({\mathcal {T}}_n^{(i)})^2 \le C_3/\beta _n^2\) for some \(C_3<\infty \). Applying Doob’s inequality, we obtain$$\begin{aligned} P\left( \sup _{l\ge k} \left\ \sum _{n=k}^{l} \zeta _2(n) \xi _{2,n}\right\ \ge \epsilon \right) \le&\dfrac{1}{\epsilon ^2} \sum _{n=k}^{\infty } \zeta _2(n)^2 {\mathbb {E}}\left\ \xi _{2,n}\right\ ^2. \end{aligned}$$(63)Thus, (B4) is satisfied.$$\begin{aligned} \le&\dfrac{C_3}{\epsilon ^2} \sum _{n=k}^{\infty } \frac{\zeta _2(n)^2}{\beta _n^2} =0. \end{aligned}$$(64)
 \({\mathcal {Z}}_\lambda \) is an asymptotically stable attractor for the ODE (55), with \(\widehat{L}(\theta ,\lambda )\) itself serving as a strict Lyapunov function. This can be inferred as follows:$$\begin{aligned} \dfrac{d \widehat{L}(\theta ,\lambda )}{d t} = \nabla \widehat{L}(\theta ,\lambda ) \dot{\theta }= \nabla \widehat{L}(\theta ,\lambda ) \check{\varGamma }\big (\nabla \widehat{L}(\theta ,\lambda )\big ) < 0. \end{aligned}$$
Step 3: (Analysis of \(\lambda \)recursion and convergence to a local saddle point) We first show that the \(\lambda \)recursion converges and then prove that the whole algorithm converges to a local saddle point of \(\widehat{L}(\theta ,\lambda )\).
Theorem 5
\(\lambda _n \rightarrow {\mathcal {F}}\) almost surely as \(n \rightarrow \infty \), where \({\mathcal {F}}\mathop {=}\limits ^{\triangle }\big \{\lambda \mid \lambda \in [0,\lambda _{\max }],\;\check{\varGamma }_\lambda \big [\widehat{\varLambda }^{\theta ^\lambda }(x^0)\alpha \big ]=0,\;\theta ^\lambda \in {\mathcal {Z}}_\lambda \big \}\).
Proof
We next claim that the limit \(\theta ^{\lambda ^*}\) corresponding to \(\lambda ^*\) satisfies the variance constraint in (3), i.e.,
Proposition 1
For any \(\lambda ^*\) in \(\hat{{\mathcal {F}}} \mathop {=}\limits ^{\triangle }\big \{\lambda \mid \lambda \in [0,\lambda _{\max }),\;\check{\varGamma }_\lambda \big [\widehat{\varLambda }^{\theta ^\lambda }(x^0)\alpha \big ]=0,\;\theta ^\lambda \in {\mathcal {Z}}_\lambda \big \}\), the corresponding limiting point \(\theta ^{\lambda ^*}\) satisfies the variance constraint \(\widehat{\varLambda }^{\theta ^{\lambda ^*}}(x^0) \le \alpha \).
Proof
Follows in a similar manner as Proposition 10.6 in Bhatnagar et al. (2013). \(\square \)
From Theorems 3, 4, 5 and Proposition 1, it is evident that the actor recursion (19) converges to a tuple \((\theta ^{\lambda ^*},\lambda ^*)\) that is a local minimum w.r.t. \(\theta \) and a local maximum w.r.t. \(\lambda \) of \(\widehat{L}(\theta ,\lambda )\). In other words, overall convergence is to a (local) saddle point of \(\widehat{L}(\theta ,\lambda )\). Further, the limit is also feasible for the constrained problem in (3) as \(\theta ^{\lambda ^*}\) satisfies the variance constraint there.
7.2 Convergence of the firstorder algorithm: RSSFG
Note that since RSSPSAG and RSSFG use different methods to estimate the gradient, their proofs only differ in the second step, i.e., the convergence of the policy parameter \(\theta \).
7.2.1 Proof of Theorem 3 for SF
Proof
7.2.2 Convergence of the secondorder algorithms: RSSPSAN and RSSFN
Convergence analysis of the secondorder algorithms involves the same steps as that of the firstorder algorithms. In particular, the first step involving the TDcritic and the third step involving the analysis of \(\lambda \)recursion follow along similar lines as earlier, whereas \(\theta \)recursion analysis in the second step differs significantly.
Theorem 6
Under (A1)–(A5), for any given Lagrange multiplier \(\lambda \) and \(\varepsilon > 0\), there exists \(\beta _0 >0\) such that for all \(\beta \in (0, \beta _0)\), \(\theta _n \rightarrow \theta ^* \in {\mathcal {Z}}^\varepsilon _{\lambda }\) almost surely.
7.2.3 Proof of Theorem 6 for RSSPSAN
Before we prove Theorem 6, we establish that the Hessian estimate \(H_n\) in (30) converges almost surely to the true Hessian \(\nabla ^2_{\theta } L(\theta _n, \lambda )\) in the following lemma.
Lemma 7
 (i)
\(\left\ \dfrac{L(\theta _n + \beta _n \varDelta _n + \beta _n \widehat{\varDelta }_n, \lambda )  L(\theta _n,\lambda )}{\beta _n^2 \varDelta _n^{(i)} \widehat{\varDelta }_n^{(j)}}  \nabla ^2_{\theta _n^{(i, j)}} L(\theta _n, \lambda ) \right\ \rightarrow 0\),
 (ii)
\(\left\ \dfrac{L(\theta _n + \beta _n \varDelta _n + \beta _n \widehat{\varDelta }_n, \lambda )  L(\theta _n,\lambda )}{\beta _n \widehat{\varDelta }_n^{(i)}}  \nabla _{\theta _n^{(i)}} L(\theta _n, \lambda ) \right\ \rightarrow 0\),
 (iii)
\(\left\ H^{(i, j)}  \nabla ^2_{\theta _n^{(i, j)}} L(\theta _n, \lambda ) \right\ \rightarrow 0\),
 (iv)
\(\left\ M  \varUpsilon (\nabla ^2_{\theta _n} L(\theta _n, \lambda ))^{1} \right\ \rightarrow 0\).
Proof
The proofs of the above claims follow from Propositions 10.10, 10.11 and Lemmas 7.10 and 7.11 of Bhatnagar et al. (2013), respectively. \(\square \)
Proof
7.2.4 Proof of Theorem 6 for RSSFN
Proof
We first establish the following result for the gradient and Hessian estimators employed in RSSFN: \(\square \)
Lemma 8
 (i)
\(\Bigg \Vert E \left[ \frac{1}{\beta _n^2} \bar{H}(\varDelta _n) (L(\theta _n +\beta _n \varDelta _n,\lambda )  L(\theta _n,\lambda ))\mid \theta _n,\lambda \right]  \nabla ^2_{\theta } L(\theta _n,\lambda ) \Bigg \Vert \rightarrow 0\).
 (ii)
\(\Vert E\left[ \dfrac{1}{\beta _n} \varDelta _n (L(\theta _n+\beta _n\varDelta _n,\lambda ) L(\theta _n,\lambda ))\mid \theta _n,\lambda \right]  \nabla L(\theta _n,\lambda ) \Vert \rightarrow 0\).
Proof
The proofs of the above claims follow from Propositions 10.1 and 10.2 of Bhatnagar et al. (2013), respectively. \(\square \)
The rest of the analysis is identical to that of RSSPSAN.
Remark 11
(On convergence rate) In the above, we established asymptotic limits for all our algorithms using the ODE approach. To the best of our knowledge, there are no convergence rate results available for multitimescale stochastic approximation schemes, and hence, for actorcritic algorithms. This is true even for the actorcritic algorithms that do not incorporate any risk criterion. In Konda and Tsitsiklis (2004), the authors provide asymptotic convergence rate results for linear twotimescale recursions. It would be an interesting direction for future research to obtain concentration bounds for general (nonlinear) twotimescale schemes.
While a rigorous analysis on convergence rate of our proposed schemes is difficult, one could make a few concessions and use the following argument to see that the SPSAbased algorithms converge quickly: In order to analyse the rate of convergence of \(\theta \)recursion, assume (for sufficiently large n) that the TDcritic has converged in the innerloop. This is because, the trajectory lengths \(m_n \rightarrow \infty \) as \(n \rightarrow \infty \) and under appropriate stepsize settings (or with iterate averaging) one can obtain convergence rate of the order \(O\left( 1/\sqrt{m}\right) \) on the root mean square error of TD (see Theorem 1). Now, if one holds \(\lambda \) fixed, then invoking asymptotic normality results for SPSA [see Proposition 2 in Spall (1992)] it can be shown that \(n^{1/3}(\theta _n  \theta ^{\lambda })\) is asymptotically normal, where \(\theta ^{\lambda }\) is a limit point in the set \({\mathcal {Z}}_\lambda \). Similar results also hold for secondorder SPSA variants [cf. Theorem 3a in Spall (2000)]. Both the aforementioned claims are proved using a wellknown result on asymptotic normality of stochastic approximation schemes due to Fabian (1968).
The secondorder schemes such as RSSPSAN score over their first order counterpart RSSPSAG from a asymptotic normality results perspective. This is because obtaining the optimal convergence rate for RSSPSAG requires that the stepsize \(\zeta _2(n)\) is set to \(\zeta _2(0)/n\) where \(\zeta _2(0) > 1/\lambda _{\min }(\nabla ^2_\theta L(\theta ^{\lambda },\lambda ))\), whereas there is no such constraint for the secondorder algorithm RSSPSAN. Here \(\lambda _{\min }(A)\) denotes the minimum eigenvalue of the matrix A. The reader is referred to Dippon and Renz (1997) for a detailed discussion on convergence rate of (one timescale) SPSAbased schemes using asymptotic meansquare error.
Remark 12
8 Convergence analysis of the average reward risksensitive actorcritic algorithm
 1.
The first step is the convergence of \(\rho \), \(\eta \), V, and U, for any fixed policy \(\theta \) and Lagrange multiplier \(\lambda \). This corresponds to a TD(0) (with extension to \(\eta \) and U) proof. Using arguments similar to that in Step 2 of the proof of RSSPSAG, one can show that the \(\theta \) and \(\lambda \) recursions track \(\dot{\theta }_t =0\) and \(\dot{\lambda }_t=0\), when viewed from the TD critic timescale \(\{\zeta _3(t)\}\). Thus, the policy \(\theta \) and Lagrange multiplier \(\lambda \) are assumed to be constant in the analysis of the critic recursion.
 2.The second step is to show the convergence of \(\theta _n\) to an \(\varepsilon \)neighborhood \({\mathcal {Z}}_\lambda ^\varepsilon \) of the set of asymptotically stable equilibria \({\mathcal {Z}}_\lambda \) of ODEwhere the projection operator \(\check{\varGamma }\) ensures that the evolution of \(\theta \) via the ODE (70) stays within the compact and convex set \(\varTheta \subset {\mathbb {R}}^{\kappa _1}\) and is defined in (56). Again here it is assumed that \(\lambda \) is fixed because \(\theta \)recursion is on a faster timescale than \(\lambda \)’s.$$\begin{aligned} \dot{\theta }_t=\check{\varGamma }\big (\nabla L(\theta _t,\lambda )\big ), \end{aligned}$$(70)
 3.
The final step is the convergence of \(\lambda \) and showing that the whole algorithm converges to a local saddle point of \(L(\theta ,\lambda )\). where the limit is shown to satisfy the variance constraint in (40).
Lemma 9
Proof
The proof for the average reward \(\rho (\mu )\) and differential value function \(v^\mu \) follows in a similar manner as Lemma 5 in Bhatnagar et al. (2009a). It is based on verifying the Assumptions (A1)–(A2) of Borkar and Meyn (2000), and uses the second part of Assumption (A3) of our paper, i.e., \(v\in {\mathbb {R}}^{\kappa _2}\), for every \(v\in {\mathbb {R}}^{\kappa _2}\). The proof for \(\rho (\mu )\) and \(v^\mu \) can be easily extended to the square average reward \(\eta (\mu )\) and square differential value function \(u^\mu \). \(\square \)
Step 2: Actor’s convergence
Let \({\mathcal {Z}}_\lambda =\big \{\theta \in \varTheta :\check{\varGamma }\big (\nabla L(\theta ,\lambda )\big )=0\big \}\) denote the set of asymptotically stable equilibrium points of the ODE (70) and \({\mathcal {Z}}_\lambda ^\varepsilon =\big \{\theta \in \varTheta :\theta \theta _0<\varepsilon ,\theta _0\in {\mathcal {Z}}_\lambda \big \}\) denote the set of points in the \(\varepsilon \)neighborhood of \({\mathcal {Z}}_\lambda \). The main result regarding the convergence of the policy parameter in (47) is as follows:
Theorem 7
Assume (A1)–(A4). Then, given \(\varepsilon>0,\;\exists \beta >0\) such that for \(\theta _n,\;n\ge 0\) obtained by the algorithm, if \(\sup _{\theta _n} \Vert {\mathcal {B}}(\theta _n,\lambda )\Vert <\beta \), then \(\theta _n\) governed by (47) converges almost surely to \({\mathcal {Z}}^\varepsilon _\lambda \) as \(n\rightarrow \infty \).
Proof
Remark 13
(Bias in estimating gradient) We do not always expect that \(\sup _{\theta } \Vert {\mathcal {B}}(\theta ,\lambda )\Vert \rightarrow 0\). However, if there is no bias or negligibly small bias in the actorcritic algorithm, which is directly related to the choice of the critic’s function space, then we will definitely gain from using actorcritic instead of policy gradient. Note that the choice between actorcritic and policy gradient is a bias–variance tradeoff, and similar to any other bias–variance tradeoff, if the variance reduction is more significant (given the number of samples used to estimate each gradient) than the introduced bias, then it would be advantageous to use actorcritic instead of policy gradient. Also note that this tradeoff exists even in the original form (risk neutral) of actorcritic and policy gradient and has nothing to do with the risksensitive objective function studied in this paper. For more details on this, we refer the reader to Theorem 2 and Remark 2 in Bhatnagar et al. (2009b).
Step 3: \(\lambda \) Convergence and overall convergence of the algorithm
Theorem 8
\(\lambda _n \rightarrow {\mathcal {F}}\) almost surely as \(t \rightarrow \infty \), where \({\mathcal {F}}\mathop {=}\limits ^{\triangle }\big \{\lambda \mid \lambda \in [0,\lambda _{\max }], \check{\varGamma }_\lambda \big (\varLambda (\theta ^\lambda )  \alpha \big )=0,\;\theta ^\lambda \in {\mathcal {Z}}_\lambda \big \}\).
Proof
The proof follows in a similar manner as that of Theorem 3 in Bhatnagar and Lakshmanan (2012). \(\square \)
As in the discounted setting, the following proposition claims that the limit \(\theta ^{\lambda ^*}\) corresponding to \(\lambda ^*\) satisfies the variance constraint in (40), i.e.,
Proposition 2
For any \(\lambda ^*\) in \(\hat{{\mathcal {F}}} \mathop {=}\limits ^{\triangle }\big \{\lambda \mid \lambda \in [0,\lambda _{\max }),\;\check{\varGamma }_\lambda \big [ \varLambda ^{\theta ^\lambda }(x^0)\alpha \big ]=0,\;\theta ^\lambda \in {\mathcal {Z}}_\lambda \big \}\), the corresponding limiting point \(\theta ^{\lambda ^*}\) satisfies the variance constraint \(\varLambda ^{\theta ^{\lambda ^*}}(x^0) \le \alpha \).
Using arguments similar to that used to prove convergence of RSSPSAG, it can be shown that that the ODE (77) is equivalent to \(\dot{\lambda }_t = \check{\varGamma }_\lambda \big [\nabla _\lambda L(\theta ^{\lambda _t},\lambda _t)\big ]\) and thus, the actor parameters \((\theta _n,\lambda _n)\) updated according to (47) converge to a (local) saddle point \((\theta ^{\lambda ^*},\lambda ^*)\) of \(L(\theta ,\lambda )\). Morever, the limiting point \(\theta ^{\lambda ^*}\) satisfies the variance constraint in (40).
9 Experimental results
We evaluate our algorithms in the context of a traffic signal control application. The objective in our formulation is to minimize the total number of vehicles in the system, which indirectly minimizes the delay experienced by the system. The motivation behind using a risksensitive control strategy is to reduce the variations in the delay experienced by road users.
9.1 Implementation

Policy search phase Here each iteration involved the simulation run with the nominal policy parameter \(\theta \) as well as the perturbed policy parameter \(\theta ^+\) (algorithmspecific). We run each algorithm for 500 iterations, where the run length for a particular policy parameter is 150 steps.

Policy test phase After the completion of the policy search phase, we freeze the policy parameter and run 50 independent simulations with this (converged) choice of the parameter. The results presented subsequently are averages over these 50 runs.
 1.SPSAG This is a firstorder riskneutral algorithm with SPSAbased gradient estimates that updates the parameter \(\theta \) as follows:where the critic parameters \(v_n, v^+_n\) are updated according to (13). Note that this is a twotimescale algorithm with a TD critic on the faster timescale and the actor on the slower timescale. Unlike RSSPSAG, this algorithm, being riskneutral, does not involve the Lagrange multiplier recursion.$$\begin{aligned} \theta _{n+1}^{(i)} =\,&\varGamma _i\left( \theta _n^{(i)} + \frac{\zeta _2(n)}{\beta \varDelta _n^{(i)}}(v^+_n  v_n)^\mathsf {\scriptscriptstyle T}\phi _v(x^0)\right) , \end{aligned}$$
 2.SFG This is a firstorder riskneutral algorithm that is similar to SPSAG, except that the gradient estimation scheme used here is based on the smoothed functional (SF) technique. The update of the policy parameter in this algorithm is given by$$\begin{aligned} \theta _{n+1}^{(i)} =\,&\varGamma _i\left( \theta _n^{(i)} + \zeta _2(n)\Big (\frac{\varDelta _n^{(i)}}{\beta }(v^+_n  v_n)^\mathsf {\scriptscriptstyle T}\phi _v(x^0)\Big )\right) . \end{aligned}$$
 3.SPSAN This is a riskneutral algorithm and is the secondorder counterpart of SPSAG. The Hessian update in this algorithm is as follows: For \(i,j=1,\ldots , \kappa _1\), \(i< j\), the update isand for \(i > j\), we set \(H^{(i, j)}_{n+1} = H^{(j, i)}_{n+1}\). As in RSSPSAN, let \(M_n \mathop {=}\limits ^{\triangle } H_n^{1}\), where \(H_n = \varUpsilon \big ([H^{(i,j)}_n]_{i,j = 1}^{\kappa _1}\big )\). The actor updates the parameter \(\theta \) as follows:$$\begin{aligned} H^{(i, j)}_{n+1}= H^{(i, j)}_n + \zeta ^{\prime }_2(n)\bigg [&\dfrac{(v_nv^+_n)^\mathsf {\scriptscriptstyle T}\phi _v(x^0)}{\beta ^2 \varDelta ^{(i)}_n\widehat{\varDelta }^{(j)}_n}  H^{(i, j)}_n \bigg ], \end{aligned}$$(80)The rest of the symbols, including the critic parameters, are as in RSSPSAN.$$\begin{aligned} \theta _{n+1}^{(i)}= \varGamma _i\bigg [\theta _n^{(i)} + \zeta _2(n)\sum \limits _{j = 1}^{\kappa _1} M^{(i, j)}_n\Big (&\dfrac{(v^+_n  v_n)^\mathsf {\scriptscriptstyle T}\phi _v(x^0)}{\beta \varDelta _n^{(j)}} \Big )\bigg ]. \end{aligned}$$(81)
 4.SFN This is a riskneutral algorithm and is the secondorder counterpart of SFG. It updates the Hessian and the actor as follows: For \(i,j,k=1,\ldots , \kappa _1\), \(j< k\), the Hessian update isand for \(j > k\), we set \(H^{(j, k)}_{n+1} = H^{(k, j)}_{n+1}\). As before, let \(M_n \mathop {=}\limits ^{\triangle } H_n^{1}\), with \(H_n\) formed as in SPSAN. Then, the actor update for the parameter \(\theta \) is as follows:$$\begin{aligned} \mathbf{Hessian: } \quad H^{(i, i)}_{n + 1} =\,&H^{(i, i)}_n + \zeta ^{\prime }_2(n)\bigg [\dfrac{\big (\varDelta ^{(i)^2}_n1\big )}{\beta ^2}(v_nv^+_n)^\mathsf {\scriptscriptstyle T}\phi _v(x^0)  H^{(i, i)}_n \bigg ],\\ H^{(j, k)}_{n + 1} =\,&H^{(j, k)}_n + \zeta ^{\prime }_2(n)\bigg [\dfrac{\varDelta ^{(j)}_n\varDelta ^{(k)}_n}{\beta ^2}(v_nv^+_n)^\mathsf {\scriptscriptstyle T}\phi _v(x^0)  H^{(j, k)}_n \bigg ], \end{aligned}$$The rest of the symbols, including the critic parameters, are as in RSSPSAN.$$\begin{aligned} \mathbf{Actor: } \quad \theta _{n+1}^{(i)}= \varGamma _i\bigg [\theta _n^{(i)} + \zeta _2(n)\sum \limits _{j = 1}^{\kappa _1} M^{(i, j)}_n\frac{\varDelta _n^{(j)}}{\beta }(v^+_n  v_n)^\mathsf {\scriptscriptstyle T}\phi _v(x^0) \bigg ]. \end{aligned}$$
 5.
RSSPSAG This is the firstorder risksensitive actorcritic algorithm that attempts to solve (40) and updates according to (19).
 6.
RSSFG This is a firstorder algorithm and the risksensitive variant of SFG that updates the actor according to (20).
 7.
RSSPSAN This is a secondorder risksensitive algorithm that estimates gradient and Hessian using SPSA and updates them according to (31).
 8.
RSSFN This secondorder risksensitive algorithm is the SF counterpart of RSSPSAN, and updates according to (36).
 9.
TAMAR This is a straightforward adaptation of the algorithm proposed in Tamar et al. (2012). The main difference between this and our algorithms is that TAMAR uses a Monte Carlo critic, while our algorithms employ a TD critic. Moreover, TAMAR incorporates the \(\lambda \)recursion that is identical to that of our algorithms (see Eq. 21). In contrast, the algorithm proposed in Tamar et al. (2012) is for a fixed \(\lambda \) that may not be optimal. Note that even though TAMAR is an algorithm proposed for a stochastic shortest path (SSP) setting, it can be implemented in the traffic signal control problem since we truncate the simulation after 150 steps.
Let \(D_n\) denote the sum of rewards obtained from a single simulation run in the policy search phase. Further, let \(z_n:= \sum _{m=0}^{150} \nabla \ln \mu _\theta (x_m,a_m)\) denote the likelihood derivative. Then, the update rule is given byNote that the \(\theta \)recursion above corrects an error (we believe it is a typo) in the corresponding update rule [i.e., Eq. 13 in Tamar et al. (2012)]. Unlike the above, Eq. 13 in Tamar et al. (2012) is missing the multiplier \(D_n\) in the last term in the \(\theta \)recursion. The latter multiplier originates from the gradient of the value function [see Lemma 4.2 in Tamar et al. (2012)].$$\begin{aligned} \tilde{V}_{n+1} =&\tilde{V}_{n} + \zeta _3(n) \big ( D_n  \tilde{V}_n \big )\\ \tilde{\varLambda }_{n+1} =&\tilde{\varLambda }_{n} + \zeta _3(n) \big ( D_n^2  \tilde{V}_n^2  \tilde{\varLambda }_n \big )\\ \theta _{n+1}^{(i)} =&\varGamma _i\left( \theta _n + \zeta _2(n) \big ( D_n  \lambda _n (D_n^2  2 D_n \tilde{V}_n) \big ) z_n^{(i)} \right) , i=1,\ldots , \kappa _1,\\ \lambda _{n+1} =\,&\varGamma _\lambda \bigg [\lambda _n + \zeta _1(n)\Big (\varLambda _n  \alpha \Big )\bigg ]. \end{aligned}$$
 1.
AC This is an actorcritic algorithm that minimizes the longrun average sum of the singlestage cost function \(h(x_n)\), without considering any risk criteria. This is similar to Algorithm 1 in Bhatnagar et al. (2009a).
 2.
RSAC This is the risksensitive actorcritic algorithm that attempts to solve (40) and is described in Sect. 6.
9.2 Results
Figure 3 shows the distribution of the discounted cumulative cost \(D^\theta (x^0)\) for the algorithms in the discounted setting. Figure 4 shows the total arrived road users (TAR) obtained for all the algorithms in the discounted setting, whereas Fig. 5 presents the average junction waiting time (AJWT) for the firstorder SFbased algorithm RSSFG.^{9} TAR is a throughput metric that measures the number of road users who have reached their destination, whereas AJWT is a delay metric that quantifies the average delay experienced by the road users.
The performance of the algorithms in the average setting is presented in Fig. 6. In particular, Fig. 6a shows the distribution of the average reward \(\rho \), while Fig. 6b presents the average junction waiting time (AJWT) for the average cost algorithms.
Observation 1
Risksensitive algorithms that we propose result in a longterm (discounted or average) cost that is higher than their riskneutral variants, but with a significantly lower empirical variance of the cost in both discounted as well as average cost settings.
The above observation is apparent from Figs. 3 and 6a, which present results for discounted and average cost settings respectively.
Observation 2
From a traffic signal control application standpoint, the risksensitive algorithms exhibit a mean throughput/delay that is close to that of the corresponding riskneutral algorithms, but with a lower empirical variance in throughput/delay.
Throughput (TAR) for algorithms in the discounted setting: standard deviation from 50 independent simulations shown after ±
Algorithm  Riskneutral  Risksensitive 

SPSAG  \(754.84 \pm 317.06\)  \(622.38 \pm 28.36\) 
SFG  \(832.34 \pm 82.24\)  \(810.82 \pm 36.56\) 
SPSAN  \(1077.2.66 \pm 250.42\)  \(942.3 \pm 65.77\) 
SFN  \(1013.62 \pm 152.22\)  \(870.5 \pm 61.61\) 
From the results in Figs. 3, 4 and Table 1, it is apparent that the secondorder schemes (RSSPSAN and RSSFN) in the discounted setting exhibit better results in comparison to firstorder methods (RSSPSAG and RSSFG), from the mean and variance of the longterm discounted cost as well as the throughput (TAR) performance.
Observation 3
The policy parameter \(\theta \) converges for the risksensitive algorithms.
Observation 4
RSSPSA, which is based on an actorcritic architecture, outperforms TAMAR, which employs a policy gradient approach.

Step 1 (True gradient estimation): Estimate \(\nabla _\theta \varLambda (x^0)\) using the likelihood ratio method, along the lines of Lemma 4.2 in Tamar et al. (2012). For this purpose, simulate a large number, say \(\top _1=1000\), of trajectories of the underlying MDP (as before, we truncate the trajectories to 150 steps). This estimate can be safely assumed to be very close to the true gradient and hence, we shall use it as the benchmark for comparing our SPSA based actorcritic scheme vs. the policy gradient approach of TAMAR.
 Step 2 (Policy gradient approach of TAMAR):Repeat the above steps 100 times and collect the mean and standard errors of the \(\ell _2\) distance in the last step above.
 –
Fix a policy parameter.
 –
Run two simulations for the policy above.
 –
Estimate \(\nabla _\theta \varLambda (x^0)\) using the scheme in TAMAR.
 –
Calculate the distance (in \(\ell _2\) norm) between the estimate above and the benchmark defined in Step 1.
 –
 Step 3 (Actorcritic approach of RSSPSA):Repeat the above steps 100 times and collect the mean and standard errors of the relevant \(\ell _2\) distance as in Step 2.
 –
Fix a policy parameter.
 –
Run two simulations—one for the unperturbed parameter and the another for the perturbed parameter, where perturbation is performed as in RSSPSA (see Sect. 4.3).
 –
Estimate \(\nabla _\theta \varLambda (x^0)\) using the scheme in RSSPSA.
 –
Calculate the distance (in \(\ell _2\) norm) between the estimate above and the benchmark defined in Step 1.
 –
\(\ell _2\) distance between gradient estimated using either RSSPSA or TAMAR and a likelihood ratio benchmark: mean and standard error from 100 replications shown before and after ±, respectively
Policy  TAMAR  RSSPSA 

\(\theta ^{(i)}=0.5, \, \forall i\)  \(655.77 \pm 18.65\)  \(142.1 \pm 9.56\) 
\(\theta ^{(i)}=1, \, \forall i\)  \(694.99 \pm 16.67\)  \(149.82 \pm 10.25\) 
\(\theta ^{(i)}=2, \, \forall i\)  \(720.99 \pm 14.85\)  \(146.67 \pm 9.31\) 
\(\theta ^{(i)}=5, \, \forall i\)  \(941.53 \pm 25.39\)  \(200.08 \pm 13.25\) 
\(\theta ^{(i)}=7, \, \forall i\)  \(1167.78 \pm 37.14\)  \(210.73 \pm 12.97\) 
\(\theta ^{(i)}=10, \, \forall i\)  \(1489.32 \pm 43.43\)  \(277.15 \pm 11.93\) 
From the mean and standard errors presented in Table 2 for six different policies, it is evident that RSSPSA produces more accurate estimates of the policy gradients than TAMAR, which explains its faster convergence (compared to TAMAR) in the experiments of Fig. 8. The trend did not change by having the true gradient estimated from a larger number of trajectories. In particular, with \(\top _1=5000\) (see Step 1 above), the relevant \(\ell _2\) distances for TAMAR and RSSPSA were observed to be \((683.06 \pm 26.75)\) and \((143.02 \pm 14.44)\), respectively for the policy \(\theta ^{(i)}=1, \forall i\).
10 Conclusions and future work
We proposed novel actorcritic algorithms for control in risksensitive discounted and average reward MDPs. All our algorithms involve a TD critic on the fast timescale, a policy gradient (actor) on the intermediate timescale, and a dual ascent for Lagrange multipliers on the slowest timescale. In the discounted setting, we pointed out the difficulty in estimating the gradient of the variance of the return and incorporated simultaneous perturbation based SPSA and SF approaches for gradient estimation in our algorithms. The average setting, on the other hand, allowed for an actor to employ compatible features to estimate the gradient of the variance. We provided proofs of convergence to locally (risksensitive) optimal policies for all the proposed algorithms. Further, using a traffic signal control application, we observed that our algorithms resulted in lower variance empirically as compared to their riskneutral counterparts.
As future work, it would be interesting to develop a risksensitive algorithm that uses a single trajectory in the discounted setting. An orthogonal direction of future research is to obtain finitetime bounds on the quality of the solution obtained by our algorithms. As mentioned earlier, this is challenging as, to the best of our knowledge, there are no convergence rate results available for multitimescale stochastic approximation schemes, and hence, for actorcritic algorithms.
Footnotes
 1.
This paper is an extension of an earlier work by the authors (Prashanth and Ghavamzadeh 2013) and includes novel second order methods in the discounted setting, detailed proofs of all proposed algorithms, and additional experimental results.
 2.
Our algorithms can be easily extended to a setting where the initial state is determined by a distribution.
 3.
Henceforth, we shall drop the subscript \(\theta \) and use \(\nabla L(\theta ,\lambda )\) to denote the derivative w.r.t. \(\theta \).
 4.
We extend this to the case of varianceconstrained MDP in Sect. 6.
 5.
By an abuse of notation, we use \(v_n\) (resp. \(v^+_n, u_n, u^+_n\)) to denote the critic parameter \(v_{m_n}\) (resp. \(v^+_{m_n}, u_{m_n}, u^+_{m_n}\)) obtained at the end of a \(m_n\) length trajectory.
 6.
Similar to the discounted setting, the risksensitive average reward algorithm proposed in this paper can be easily extended to other risk measures based on the longterm variance of \(\mu \), including the Sharpe ratio (SR), i.e., \(\max _\theta \rho (\theta )/\sqrt{\varLambda (\theta )}\). The extension to SR will be described in more details in Sect. 3.
 7.
For notational convenience, we drop the dependence of \(\bar{v}\) and \(\bar{u}\) on the underlying policy parameter \(\theta \) and this dependence should be clear from the context.
 8.
We would like to point out that the experimental setting involves ‘costs’ and not ‘rewards’ and the algorithms implemented should be understood as optimizing a negative reward.
 9.
The AJWT performance of the other algorithms in the discounted setting is similar and the corresponding plots are omitted here.
Notes
Acknowledgments
This work was supported in part by the National Science Foundation (NSF) under Grants CMMI1434419, CNS1446665, and CMMI1362303, and by the Air Force Office of Scientific Research (AFOSR) under Grant FA95501510050.
References
 Altman, E. (1999). Constrained Markov decision processes (Vol. 7). Boca Raton: CRC Press.zbMATHGoogle Scholar
 Barto, A., Sutton, R., & Anderson, C. (1983). Neuronlike elements that can solve difficult learning control problems. IEEE Transaction on Systems, Man and Cybernetics, 13, 835–846.Google Scholar
 Basu, A., Bhattacharyya, T., & Borkar, V. (2008). A learning algorithm for risksensitive cost. Mathematics of Operations Research, 33(4), 880–898.MathSciNetzbMATHCrossRefGoogle Scholar
 Baxter, J., & Bartlett, P. (2001). Infinitehorizon policygradient estimation. Journal of Artificial Intelligence Research, 15, 319–350.MathSciNetzbMATHGoogle Scholar
 Bertsekas, D. (1995). Dynamic programming and optimal control. Belmont, MA: Athena Scientific.zbMATHGoogle Scholar
 Bertsekas, D. (1999). Nonlinear programming. Belmont, MA: Athena Scientific.zbMATHGoogle Scholar
 Bertsekas, D., & Tsitsiklis, J. (1996). Neurodynamic programming. Belmont, MA: Athena Scientific.zbMATHGoogle Scholar
 Bhatnagar, S. (2005). Adaptive multivariate threetimescale stochastic approximation algorithms for simulation based optimization. ACM Transactions on Modeling and Computer Simulation, 15(1), 74–107.CrossRefGoogle Scholar
 Bhatnagar, S. (2007). Adaptive Newtonbased multivariate smoothed functional algorithms for simulation optimization. ACM Transactions on Modeling and Computer Simulation, 18(1), 1–35.CrossRefGoogle Scholar
 Bhatnagar, S. (2010). An actorcritic algorithm with function approximation for discounted cost constrained Markov decision processes. Systems & Control Letters, 59(12), 760–766.MathSciNetzbMATHCrossRefGoogle Scholar
 Bhatnagar, S., & Lakshmanan, K. (2012). An online actorcritic algorithm with function approximation for constrained Markov decision processes. Journal of Optimization Theory and Applications, 153(3), 688–708.MathSciNetzbMATHCrossRefGoogle Scholar
 Bhatnagar, S., Fu, M., Marcus, S., & Wang, I. (2003). Twotimescale simultaneous perturbation stochastic approximation using deterministic perturbation sequences. ACM Transactions on Modeling and Computer Simulation, 13(2), 180–209.CrossRefGoogle Scholar
 Bhatnagar, S., Sutton, R., Ghavamzadeh, M., & Lee, M. (2007). Incremental natural actorcritic algorithms. In: Proceedings of advances in neural information processing systems (Vol. 20, pp. 105–112).Google Scholar
 Bhatnagar, S., Sutton, R., Ghavamzadeh, M., & Lee, M. (2009a). Natural actorcritic algorithms. Automatica, 45(11), 2471–2482.MathSciNetzbMATHCrossRefGoogle Scholar
 Bhatnagar, S., Sutton, R., Ghavamzadeh, M., & Lee, M. (2009b) Natural actorcritic algorithms. Technical report TR0910, Department of Computing Science, University of Alberta.Google Scholar
 Bhatnagar, S., Hemachandra, N., & Mishra, V. (2011). Stochastic approximation algorithms for constrained optimization via simulation. ACM Transactions on Modeling and Computer Simulation, 21(3), 15.CrossRefGoogle Scholar
 Bhatnagar, S., Prasad, H., & Prashanth, L. (2013). Stochastic recursive algorithms for optimization (Vol. 434). Berlin: Springer.zbMATHGoogle Scholar
 Borkar, V. (2001). A sensitivity formula for the risksensitive cost and the actorcritic algorithm. Systems & Control Letters, 44, 339–346.MathSciNetzbMATHCrossRefGoogle Scholar
 Borkar, V. (2002). Qlearning for risksensitive control. Mathematics of Operations Research, 27, 294–311.MathSciNetzbMATHCrossRefGoogle Scholar
 Borkar, V. (2005). An actorcritic algorithm for constrained Markov decision processes. Systems & Control Letters, 54(3), 207–213.MathSciNetzbMATHCrossRefGoogle Scholar
 Borkar, V. (2008). Stochastic approximation: A dynamical systems viewpoint. Cambridge: Cambridge University Press.zbMATHGoogle Scholar
 Borkar, V. (2010). Learning algorithms for risksensitive control. In Proceedings of the nineteenth international symposium on mathematical theory of networks and systems (pp. 1327–1332).Google Scholar
 Borkar, V. S., & Meyn, S. P. (2000). The ode method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization, 38(2), 447–469.MathSciNetzbMATHCrossRefGoogle Scholar
 Chen, H., Duncan, T., & PasikDuncan, B. (1999). A Kiefer–Wolfowitz algorithm with randomized differences. IEEE Transactions on Automatic Control, 44(3), 442–453.MathSciNetzbMATHCrossRefGoogle Scholar
 Delage, E., & Mannor, S. (2010). Percentile optimization for Markov decision processes with parameter uncertainty. Operations Research, 58(1), 203–213.MathSciNetzbMATHCrossRefGoogle Scholar
 Dippon, J., & Renz, J. (1997). Weighted means in stochastic approximation of minima. SIAM Journal on Control and Optimization, 35(5), 1811–1827.MathSciNetzbMATHCrossRefGoogle Scholar
 Fabian, V. (1968). On asymptotic normality in stochastic approximation. The Annals of Mathematical Statistics, 39, 1327–1332.MathSciNetzbMATHCrossRefGoogle Scholar
 Filar, J., Kallenberg, L., & Lee, H. (1989). Variancepenalized Markov decision processes. Mathematics of Operations Research, 14(1), 147–161.MathSciNetzbMATHCrossRefGoogle Scholar
 Filar, J., Krass, D., & Ross, K. (1995). Percentile performance criteria for limiting average Markov decision processes. IEEE Transaction of Automatic Control, 40(1), 2–10.MathSciNetzbMATHCrossRefGoogle Scholar
 Gill, P., Murray, W., & Wright, M. (1981). Practical optimization. London: Academic press.zbMATHGoogle Scholar
 Howard, R., & Matheson, J. (1972). Risk sensitive Markov decision processes. Management Science, 18(7), 356–369.MathSciNetzbMATHCrossRefGoogle Scholar
 Katkovnik, V., & Kulchitsky, Y. (1972). Convergence of a class of random search algorithms. Automatic Remote Control, 8, 81–87.MathSciNetGoogle Scholar
 Konda, V., & Tsitsiklis, J. (2000). Actorcritic algorithms. In Proceedings of advances in neural information processing systems (Vol. 12, pp. 1008–1014).Google Scholar
 Konda, V. R., & Tsitsiklis, J. N. (2004). Convergence rate of linear twotimescale stochastic approximation. Annals of Applied Probability, 14(2), 796–819.MathSciNetzbMATHCrossRefGoogle Scholar
 Korda, N., & Prashanth, L. (2015). On TD (0) with function approximation: Concentration bounds and a centered variant with exponential convergence. In International conference on machine learning (ICML).Google Scholar
 Kushner, H., & Clark, D. (1978). Stochastic approximation methods for constrained and unconstrained systems. Berlin: Springer.zbMATHCrossRefGoogle Scholar
 Mannor, S., & Tsitsiklis, J. (2011). Mean–variance optimization in Markov decision processes. In Proceedings of the twentyeighth international conference on machine learning (pp. 177–184).Google Scholar
 Mannor, S., & Tsitsiklis, J. N. (2013). Algorithmic aspects of mean–variance optimization in Markov decision processes. European Journal of Operational Research, 231(3), 645–653.MathSciNetzbMATHCrossRefGoogle Scholar
 Marbach, P. (1998). Simulatedbased methods for Markov decision processes. Ph.D. thesis, Massachusetts Institute of Technology.Google Scholar
 MasColell, A., Whinston, M., & Green, J. (1995). Microeconomic theory. Oxford: Oxford University Press.zbMATHGoogle Scholar
 Mihatsch, O., & Neuneier, R. (2002). Risksensitive reinforcement learning. Machine Learning, 49(2), 267–290.zbMATHCrossRefGoogle Scholar
 Milgrom, P., & Segal, I. (2002). Envelope theorems for arbitrary choice sets. Econometrica, 70(2), 583–601.MathSciNetzbMATHCrossRefGoogle Scholar
 Nilim, A., & Ghaoui, L. E. (2005). Robust control of Markov decision processes with uncertain transition matrices. Operations Research, 53(5), 780–798.MathSciNetzbMATHCrossRefGoogle Scholar
 Peters, J., Vijayakumar, S., & Schaal, S. (2005). Natural actorcritic. In Proceedings of the sixteenth european conference on machine learning (pp. 280–291).Google Scholar
 Polyak, B. T., & Juditsky, A. B. (1992). Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4), 838–855.MathSciNetzbMATHCrossRefGoogle Scholar
 Prashanth, L., & Bhatnagar, S. (2011). Reinforcement learning with function approximation for traffic signal control. IEEE Transactions on Intelligent Transportation Systems, 12(2), 412–421.CrossRefGoogle Scholar
 Prashanth, L., & Bhatnagar, S. (2012). Threshold tuning using stochastic optimization for graded signal control. IEEE Transactions on Vehicular Technology, 61(9), 3865–3880.CrossRefGoogle Scholar
 Prashanth, L., & Ghavamzadeh, M. (2013). Actorcritic algorithms for risksensitive MDPs. In Proceedings of advances in neural information processing systems (Vol. 26, pp. 252–260).Google Scholar
 Prashanth, L., Jie, C., Fu, M., Marcus, S. & Szepesvari, C. (2016). Cumulative prospect theory meets reinforcement learning: Prediction and control. In Proceedings of the 33rd international conference on machine learning (pp. 1406–1415).Google Scholar
 Puterman, M. (1994). Markov decision processes: Discrete stochastic dynamic programming. London: Wiley.zbMATHCrossRefGoogle Scholar
 Ruppert, D. (1991). Stochastic approximation. In B. K. Ghosh & P. K. Sen (Eds.), Handbook of Sequential Analysis (pp. 503–529). New York: Marcel Dekker.Google Scholar
 Ruszczyński, A. (2010). Riskaverse dynamic programming for Markov decision processes. Mathematical Programming, 125, 235–261.MathSciNetzbMATHCrossRefGoogle Scholar
 Schweitzer, P. J. (1968). Perturbation theory and finite Markov chains. Journal of Applied Probability, 5, 401–413.MathSciNetzbMATHCrossRefGoogle Scholar
 Sharpe, W. (1966). Mutual fund performance. Journal of Business, 39(1), 119–138.CrossRefGoogle Scholar
 Shen, Y., Stannat, W., & Obermayer, K. (2013). Risksensitive Markov control processes. SIAM Journal on Control and Optimization, 51(5), 3652–3672.MathSciNetzbMATHCrossRefGoogle Scholar
 Sion, M. (1958). On general minimax theorems. Pacific Journal of Mathematics, 8(1), 171–176.MathSciNetzbMATHCrossRefGoogle Scholar
 Sobel, M. (1982). The variance of discounted Markov decision processes. Applied Probability, 19, 794–802.MathSciNetzbMATHCrossRefGoogle Scholar
 Spall, J. (1992). Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Transactions on Automatic Control, 37(3), 332–341.MathSciNetzbMATHCrossRefGoogle Scholar
 Spall, J. (1997). A onemeasurement form of simultaneous perturbation stochastic approximation. Automatica, 33(1), 109–112.MathSciNetzbMATHCrossRefGoogle Scholar
 Spall, J. (2000). Adaptive stochastic approximation by the simultaneous perturbation method. IEEE Transactions on Automatic Control, 45(10), 1839–1853.MathSciNetzbMATHCrossRefGoogle Scholar
 Styblinski, M. A., & Opalski, L. J. (1986). Algorithms and software tools for IC yield optimization based on fundamental fabrication parameters. IEEE Transactions on Computer Aided Design CAD, 1(5), 79–89.CrossRefGoogle Scholar
 Sutton, R. (1984). Temporal credit assignment in reinforcement learning. Ph.D. thesis, University of Massachusetts Amherst.Google Scholar
 Sutton, R. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3, 9–44.Google Scholar
 Sutton, R., & Barto, A. (1998). Reinforcement learning: An introduction. Cambridge: MIT Press.Google Scholar
 Sutton, R., McAllester, D., Singh, S., & Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Proceedings of advances in neural information processing systems (Vol. 12, pp. 1057–1063).Google Scholar
 Sutton, R. S., McAllester, D. A., Singh, S. P., Mansour, Y., et al. (1999). Policy gradient methods for reinforcement learning with function approximation. In NIPS, Citeseer (Vol. 99, pp. 1057–1063).Google Scholar
 Tamar, A., & Mannor, S. (2013). Variance adjusted actor critic algorithms. arXiv:1310.3697.
 Tamar, A., Di Castro, D., & Mannor, S. (2012). Policy gradients with variance related risk criteria. In Proceedings of the twentyninth international conference on machine learning (pp. 387–396).Google Scholar
 Tamar, A., Di Castro, D., & Mannor, S. (2013a). Policy evaluation with variance related risk criteria in markov decision processes. arXiv:1301.0104.
 Tamar, A., Di Castro, D., & Mannor, S. (2013b). Temporal difference methods for the variance of the reward to go. In Proceedings of the thirtieth international conference on machine learning (pp. 495–503).Google Scholar
 Tsitsiklis, J. N., & Van Roy, B. (1997). An analysis of temporaldifference learning with function approximation. IEEE Transactions on Automatic Control, 42(5), 674–690.MathSciNetzbMATHCrossRefGoogle Scholar
 Wiering, M., Vreeken, J., van Veenen, J., & Koopman, A. (2004). Simulation and optimization of traffic in a city. In IEEE intelligent vehicles symposium (pp. 453–458).Google Scholar
 Williams, R. (1992). Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning, 8, 229–256.zbMATHGoogle Scholar
 Xu, H., & Mannor, S. (2012). Distributionally robust Markov decision processes. Mathematics of Operations Research, 37(2), 288–300.MathSciNetzbMATHCrossRefGoogle Scholar