Multi-agent Gradient-Based Off-Policy Actor-Critic Algorithm for Distributed Reinforcement Learning

Ren, Jineng

doi:10.1007/s44196-024-00560-2

Multi-agent Gradient-Based Off-Policy Actor-Critic Algorithm for Distributed Reinforcement Learning

Research Article
Open access
Published: 24 June 2024

Volume 17, article number 162, (2024)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Computational Intelligence Systems Aims and scope Submit manuscript

Multi-agent Gradient-Based Off-Policy Actor-Critic Algorithm for Distributed Reinforcement Learning

Download PDF

Jineng Ren ORCID: orcid.org/0000-0003-0174-9366¹

226 Accesses
Explore all metrics

Abstract

This paper proposes a gradient-based multi-agent actor-critic algorithm for off-policy reinforcement learning using importance sampling. Our algorithm is incremental with full gradients, and its complexity per iteration scales linearly with the size of approximation features. Previous multi-agent actor-critic algorithms are limited to the on-policy setting or off-policy emphatic temporal difference (TD) learning and they do not take advantage of the advances in off-policy gradient temporal difference learning (GTD). As a theoretical contribution, we establish that the critic step of the proposed algorithm converges to the TD solution of the projected Bellman equation and the actor step converges to the set of asymptotically stable fixed points. Numerical experiments on the multi-agent generalization of the Boyan’s chain problem show that the proposed approach provides improved performances in terms of stability and convergence rate as compared with the state-of-the-art baseline algorithm.

Episode-Experience Replay Based Tree-Backup Method for Off-Policy Actor-Critic Algorithm

Implicit Incremental Natural Actor Critic

On the Analysis of Model-Free Methods for the Linear Quadratic Regulator

Article 09 July 2024

Find the latest articles, discoveries, and news in related topics.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Over the last few years, reinforcement learning has witnessed enormous progress in applications such as intelligent control [1], autonomous driving [2], and strategy games [3]. Compared with single-agent reinforcement learning, multi-agent reinforcement learning (MARL) tasks are more challenging in the sense that agents in such scenarios interact with not only the environment, but also other agents. In this paper, we study the algorithms in MARL with local rewards and actions, where the agents aim to cooperatively perform policy prediction and improvement over connected communication graphs. In these cases, the agents share a joint state whose transition depends on the local rewards and actions of individual agents. To learn the policy and achieve optimal estimation globally, the agents need to share their local information with other connected neighbors. Note that the multi-agent problems in such a setting cannot be solved by simply applying a single-agent approach for each agent independently as the agents would not be able to learn their own policy without communicating with others and well knowing the global joint state value estimates.

A trivial distributed approach could use a central controller to broadcast the global joint state value estimate to all the agents at each iteration. Nevertheless, because of the use of the central controller, this algorithm structure is vulnerable to the breakdown of the central node and can incur privacy concerns. Furthermore, this approach does not scale well and faces challenges in large-scale applications due to the constraints on memory and computation. To improve the scalability and robustness, recent articles have proposed fully decentralized actor-critic methods, where each agent only communicates with its neighbors over a network (e.g., Refs. [4, 5]). Communication-efficient and resilient algorithms have also been developed in Refs. [6, 7]. However, these works mainly focus on the on-policy reinforcement learning methods, which potentially faces the challenges of exploration capability and sample complexity [8, 9].

On the other hand, off-policy multi-agent reinforcement learning is an attractive approach due to its generality and the potential for efficiency in distributed off-policy evaluation and control (see, e.g., Refs. [10,11,12,13]). To this end, Ref. [10] proposes a distributed off-policy actor-critic method with policy consensus where the local value functions are independent, whereas the algorithm’s convergence to the solution of the Bellman equation is not guaranteed. Moreover, Ref. [11] develops a multi-agent decomposed policy gradient method that could narrow the performance gap between multi-agent policy gradient (MAPG) methods and multi-agent value-based approaches. Besides, Ref. [12] considers the application of policy iteration-based algorithms in adaptive output synchronization of heterogeneous multi-agent systems. Reference [13] designs two distributed policy evaluation algorithms based on a special case of the single-agent gradient temporal difference learning proposed in Ref. [14] for Markov decision processes (MDP) by incorporating consensus-based collaborations between agents. Recently, Ref. [15] has proposed a fully decentralized actor-critic MARL framework that learns individual policies separately for decentralization. Unlike on-policy methods, an off-policy agent aims to perform policy evaluation or policy improvement on a given target policy while generating trajectories according to a different behavior policy. Thus, reweighting updates is often needed to compensate for the bias caused by the policy difference. Such importance sampling-based methods have already been widely used in single-agent off-policy algorithms [14, 16,17,18,19].

In this paper, we develop a multi-agent online stochastic gradient update for an off-policy actor-critic method. The proposed algorithm is incremental and scales linearly with the size of approximation features. We make the following main contributions:

We propose the first fully gradient descent-based multi-agent actor-critic approach for off-policy settings, where the critic uses a GTD-type paradigm and the actor adopts a policy gradient method derived from the full gradients of the joint objective function.
Under standard conditions, we prove that the critic updates of the agents gradually reach a consensus that is the TD solution of the projected Bellman equation and the actor updates converge to the set of asymptotically stable fixed points.
We introduce a non-trivial generalization of the multi-agent Boyan’s chain problem where the state value function is still linearly representable with a feature map as in the single-agent Boyan’s chain problem. Experimental results demonstrate the improved performance of the proposed algorithm over the state-of-the-art baseline algorithm.

2 Related Work

Single-Agent Setting Gradient temporal difference (GTD) methods including the linear temporal difference with gradient correction (TDC) [14, 16, 20], emphatic temporal difference (ETD) approaches [18], and off-policy actor-critic methods [19] are well-known off-policy extensions of the temporal difference methods for policy evaluation and control for a single agent. Among these, in Ref. [19], Degris et al. propose a method using gradient-based TD learning as a critic and policy gradient as an actor. However, in the derivations of the actor update algorithm, a semi-gradient is used to approximate the full gradient. This introduces additional bias and as a result the convergence and stability of the algorithm is undermined [21]. This issue is partially addressed by subsequent works including Refs. [21,22,23] in single-agent settings, but the convergence is not directly established. Although our primary focus is the multi-agent setting, we start our work by developing an incremental and online stochastic gradient update for actor-critic methods in the single-agent setting which adopts the full gradient. We prove that this is guaranteed to converge under reasonable assumptions.

Multi-agent setting There have been recent advances in the development and analysis of decentralized multi-agent off-policy actor-critic methods [10,11,12,13, 15, 24, 25]. In contrast to the works including Refs. [4, 5, 15, 24] which use a critic consensus step, Zhang et al. introduced in Ref. [10] a distributed off-policy actor-critic method using a consensus step for the policy parameters. The analysis of Ref. [10] considers a special form of local value functions and ignores the biases caused by the difference between the behavior policy and the target policy in off-policy scenarios. Moreover, Ref. [11] proposes a multi-agent policy gradient (MAPG) algorithm with improved performance based on a decomposed critic assumption. But it is worth noting that their algorithm adopts a centralized critic structure, instead of fully decentralized ones, while the latter is usually more general in applications as it does not require a center that collects critic information from each agent. What is more, Ref. [13] proposes the multi-agent generalization of off-policy GTD and TDC algorithm in Markov decision processes and provides the theoretical guarantee of their weak convergence to a consensus point. Nevertheless, the method only considers policy evaluation problems, while it is not applicable for solving general control problems. Although its authors later developed a type of multi-agent actor-critic algorithm in Ref. [25], they focus on a different scenario where multiple agents are essentially completing a single-agent task with different behaviors. They will reach a consensus policy that generates only one action to be implemented by a single actor for solving the single-agent problem. As compared with this setting, the multiple agents considered in our method take different actions cooperatively for solving multi-agent problems based on their local policies. Recently, in Ref. [26], Chen et al. have conducted finite-time analysis of a multi-agent temporal difference (TD) learning method, but only focus on policy evaluation tasks. In Ref. [15], a decentralized approximate actor-critic MARL framework is developed with a primal-dual optimization to guarantee full decentralization and scalability. However, they only consider the multi-agent generalization framework for TD methods, which is proven to diverge in the seminal counterexamples introduced by Ref. [27], where GTD and TDC methods are still sound and convergent. As a result, their multi-agent framework is not applicable for the more stable gradient-based off-policy actor-critic methods in this paper. Moreover, Suttle et al. study a decentralized version of an off-policy actor-critic algorithm using emphatic temporal difference learning with a critic variable consensus step and analyzes its convergence under a linear function approximation [24]. Compared with Ref. [24], our proposed algorithm is a novel gradient-based multi-agent off-policy actor-critic algorithm that adopts a consensus step of the critic estimates, with its critic part being derived from the gradient descent of the corresponding projected Bellman error functions. Therefore, the proposed algorithm is fully incremental and is likely to exhibit improved stability compared to ETD-based methods [28]. This is corroborated in the experimental results provided in Sect. 6. The comparison of the main features of the related works is summarized in the following table.

Table 1 Comparison between this paper and the related works

Full size table

3 Notation and Problem Setting

To facilitate the analysis of multi-agent algorithms, researchers usually study a Markov decision process (MDP) as a testbed for real-world reinforcement learning problems that can showcase the properties (such as convergence and stability) of their MARL algorithms (see, e.g., Refs. [4,5,6,7, 10, 11, 13,14,15, 24,25,26, 29]). Here, we first consider a Markov decision process for a single agent with a discrete state space ${\mathcal {S}}$, a discrete action space ${\mathcal {A}}$, and a distribution $P: {\mathcal {S}}\times {\mathcal {S}} \times {\mathcal {A}} \rightarrow [0, 1]$, where $P(s^\prime |s, a)$ denotes the probability of moving into the next state $s^{\prime }$ from the current state s after taking action a. Moreover, an expected reward function ${\mathcal {R}}: {\mathcal {S}} \times {\mathcal {A}} \times {\mathcal {S}} \rightarrow {\mathbb {R}}$ gives an expected reward for any given transition triple $(s, a, s^{\prime })$. In the decision process, we observe a stream of data, consisting of states $s_{t} \in {\mathcal {S}},$ actions $a_{t} \in {\mathcal {A}},$ and rewards $r_{t} \in {\mathbb {R}}$, for time steps $t=1,2, \ldots $, with actions being selected according to a fixed behavior policy, $\mu (a | s) \in [0,1]$. Given a discount-rate parameter $\gamma : {\mathcal {S}} \rightarrow [0,1]$ ( [28]), the value function for policy $\pi : {\mathcal {S}} \times {\mathcal {A}} \rightarrow [0,1]$ is defined as

$$\begin{aligned} V^{\pi , \gamma }(s) := {\mathbb {E}}\Bigg [\sum _{p=1}^\infty \gamma ^{p-1}r_{t+p} | {\textbf{s}}_t = {\textbf{s}},\pi \Bigg ]. \end{aligned}$$

(1)

For all $a \in {\mathcal {A}}$ and for all $s \in {\mathcal {S}}$, we define the action-value function, $Q^{\pi , \gamma }(s, a)$ to be

$$\begin{aligned} Q^{\pi , \gamma }(s, a)&= \sum _{s^{\prime } \in {\mathcal {S}}} P\left( s^{\prime } | s, a\right) \left[ {\mathcal {R}}\left( s, a, s^{\prime }\right) \right. \nonumber \\&\quad \left. +\gamma \left( s^{\prime }\right) V^{\pi , \gamma }\left( s^{\prime }\right) \right] . \end{aligned}$$

(2)

It can be shown that $V^{\pi , \gamma }(s)=$ $\sum _{a \in {\mathcal {A}}} \pi (a | s) Q^{\pi , \gamma }(s, a),$ for all $s \in {\mathcal {S}}$. In off-policy settings, one wants to learn about a target policy $\pi $, while generating actions according to the behavior policy $\mu $, which is often chosen to achieve better exploration of the state and action spaces.

The policy $\pi _{\theta }: {\mathcal {A}} \times {\mathcal {S}} \rightarrow [0,1]$ denotes a differentiable function with a parameter vector, $\theta \in {\mathbb {R}}^{N_{\theta }}, N_{\theta } \in {\mathbb {N}}$. In the decision process, we aim to choose $\theta $ to maximize the objective function defined as

$$\begin{aligned} J_{\gamma }(\theta ) :=\sum _{s \in {\mathcal {S}}} d^{\mu }(s) V^{\pi _{\theta }, \gamma }(s), \end{aligned}$$

(3)

where we use $d^{\mu }(s):=\lim _{t \rightarrow \infty } P\left( s_{t}=s | s_{0}, \mu \right) $ to represent the limiting distribution of states, and $P\left( s_{t}=s | s_{0}, \mu \right) $ denotes the probability that $s_{t}=s$ under behavior policy $\mu $ starting from $s_{0}$. Here, the objective function is weighted by $d^{\mu }$ because actions are generated by following the behavior policy in the off-policy setting. When there is no confusion, we can write $\pi $ to represent $\pi _{\theta }$.

In the multi-agent setting, we consider a set of agents ${\mathcal {N}}:= \{1, 2, \ldots , N\}$. Define $\{{\mathcal {G}}_t\}_{t\in {\mathbb {N}}}:= \{({\mathcal {N}}, {\mathcal {E}}_t)\}_{t\in {\mathbb {N}}}$ as a sequence of connected graphs on these agents over time, where ${\mathcal {E}}_t$ denotes the edge set at time t. Specifically, (j, i) is an edge in ${\mathcal {E}}_t$ whenever there is communication between agent i and agent j. We assume that the communication is symmetric so that the communication graph is undirected. Let ${\mathcal {S}}$ be the joint state space, ${\mathcal {A}}^{i}$ be the action space, and ${\mathcal {R}}^{i}$ be the reward space for agent i, for $i = 1,\dots , N$. A joint action can be written as $a:=\left( a^{1}, a^{2}, \ldots , a^{N}\right) \in {\mathcal {A}}^{1} \times {\mathcal {A}}^{2} \times \cdots \times {\mathcal {A}}^{N}$, and a joint reward as $r:=\left( r^{1}, r^{2}, \ldots , r^{N}\right) \in {\mathcal {R}}^{1} \times {\mathcal {R}}^{2} \times \cdots \times {\mathcal {R}}^{N}$. Here, the joint state s is observable by all agents, while the reward $r^{i}$ and the action $a^{i}$ are private for agent i.

In this setting, the agents’ states are coupled by the joint state transition matrix $P(\cdot | \cdot , a) \in {\mathbb {R}}^{|{\mathcal {S}}| \times |{\mathcal {S}}|}$ defined above with a joint action a. As a motivation, this scenario arises in a wide variety of multi-agent applications such as mobile sensing networks [30], robotics [31], and power grids [32, 33]. Assume each agent $i \in {\mathcal {N}}$ has its own local behavior policy $\mu ^{i}: {\mathcal {A}}^{i} \times {\mathcal {S}} \rightarrow [0,1]$. For each $i \in {\mathcal {N}},$ let $\pi _{\theta ^{i}}^{i}: {\mathcal {A}}^{i} \times {\mathcal {S}} \rightarrow [0,1]$ be a set of local target policy functions with parameter $\theta ^{i} \in \Theta ^{i}$, where $\Theta ^{i}$ is a compact subset of ${\mathbb {R}}^{m_{i}}$. Further suppose that each $\pi _{\theta ^{i}}^{i}$ is continuously differentiable with respect to $\theta ^{i}$ and set $\theta =\left[ ({\theta ^{1}})^{T}, \ldots , ({\theta ^{N}})^{T}\right] ^{T}$. Then, the global behavior and target policies can be written as

$$\begin{aligned} \mu&=\prod _{i=1}^{N} \mu ^{i}: {\mathcal {A}}^{i}\times {\mathcal {S}} \rightarrow [0,1], \end{aligned}$$

(4)

$$\begin{aligned} \pi _{\theta }&=\prod _{i=1}^{N} \pi _{\theta ^{i}}^{i}: {\mathcal {A}}^{i} \times {\mathcal {S}} \rightarrow [0,1] . \end{aligned}$$

(5)

In the multi-agent decision process, we aim to maximize the objective function of (3) for the global behavior and target policies in (4) and (5). Note that actor-critic algorithms are essentially the combination of a policy improvement structure and a critic algorithm. The policy improvement structure is used to update the current policy based on the state value (or action value) that is estimated by the critic algorithm. Then, the critic algorithm computes the state value to evaluate the action made by the actor under the current policy. The alternating process continues until convergence to a stationary fixed point, which is a candidate for the optimal solution.

Assume that if $\pi _{\theta ^{i}}^{i}\left( a^{i} | s\right) >0$, then $\mu ^{i}\left( a^{i} | s\right) >0$ for all $i \in {\mathcal {N}}$, $\left( a^{i}, s\right) \in {\mathcal {A}}^{i} \times {\mathcal {S}}$, and $\theta ^{i} \in \Theta ^{i}$. Moreover, for all $\theta \in \Theta $, suppose that the Markov chains obtained by following $\pi _{\theta }$ and $\mu $ are irreducible and aperiodic, and vectors $d^{\pi _{\theta }}, d^{\mu } \in [0,1]^{|{\mathcal {S}}|}$ represent their steady-state distributions, respectively. For an integer $m\in {\mathbb {N}}$, we write [m] as shorthand for the set $\{1, 2, \cdots , m \}$. The definitions of the acronyms used in this paper are summarized in Table 2.

4 Multi-agent Gradient-Based Off-Policy Actor-Critic (MGOPAC) Algorithm

In this section, we first explain the basic theoretical ideas underlying the Gradient-TD methods for the critic updates in the single-agent setting. Then, we derive the actor updates using a full gradient method with emphatic weights to produce a backward-view mechanistic algorithm with eligibility traces. Finally, the single-agent updates are extended to the multi-agent setting, allowing us to specify the actor-critic steps.

Table 2 Acronym definition table

Full size table

4.1 Single-Agent Critic Algorithm

For value function estimation, we first define the $\lambda $-return (bootstrapped return) as

$$\begin{aligned} G_{t}^{\lambda }(V):= r_{t+1}+\gamma _{t+1}\left[ (1-\lambda ) V\left( s_{t+1}\right) +\lambda G_{t+1}^{\lambda }\right] , \end{aligned}$$

(6)

where $\lambda \in [0,1]$ denotes a constant eligibility trace parameter, $\gamma _{t+1}$ is the discount factor at state $s_{t+1}$, and $V(s_{t+1})$ is the corresponding value function. In the following, we introduce the $\lambda $-weighted Bellman equation. A more detailed description is provided in [34].

First, let us consider the linear function approximation of the state value function: $V^{\pi , \gamma }(s) \approx V_{{\textbf{v}}}(s):={\textbf{v}}^{\top } \phi (s)$, where $\phi (s) \in {\mathbb {R}}^{N_{{\textbf{v}}}}, N_{{\textbf{v}}} \in {\mathbb {N}},$ is the feature vector of state s, and ${\textbf{v}} \in {\mathbb {R}}^{N_{{\textbf{v}}}}$ is the weight vector. Then, the projection operator $\Pi $ to the linear space can be written as [28]

$$\begin{aligned} \Pi = \Phi (\Phi ^\top D \Phi )^{-1} \Phi ^\top D, \end{aligned}$$

(7)

where D is an $|{\mathcal {S}}|\times |{\mathcal {S}}|$ diagonal matrix with $d^\mu (s)$, the probability of visiting state s, as its diagonal elements; and $\Phi $ is the matrix whose row is $\phi (s)^\top $ for a given state s. The Gradient-TD method was first proposed by Sutton et al. [35]. It incrementally learns the weights ${\textbf{v}}$, and the auxiliary variables ${\textbf{w}}$, under an off-policy protocol, with guaranteed stability and linear per-time-step complexity. This type of method can not only work in the off-policy setting, but also achieve comparable convergence as conventional TD methods for on-policy problems.

We first introduce the derivation of the critic updates in the single-agent setting based on the GTD($\lambda $) algorithm [16], which can be viewed as an extension of the linear TD method with gradient correction (TDC) [14] to eligibility traces. Define the $\lambda $-weighted Bellman operator for policy $\pi $ as

$$\begin{aligned} \begin{aligned} \left( T_{\pi }^{\lambda , \gamma } V^{\pi , \gamma }\right) (s) := {\mathbb {E}}\left[ G_{t}^{\lambda }\left( V^{\pi , \gamma }\right) | s_{t}=s, \pi \right] . \end{aligned} \end{aligned}$$

(8)

The objective is to find the off-policy TD solution, ${\textbf{v}}$, such that the following projected Bellman equation holds

$$\begin{aligned} V_{{\textbf{v}}}=\Pi T_{\pi }^{\lambda , \gamma } V_{{\textbf{v}}}, \end{aligned}$$

(9)

while the data are generated according to a behavior policy $\mu $. To find the solution of this equation, we minimize the mean-square projected Bellman error (MSPBE) objective function:

$$\begin{aligned} \min _{{\textbf{v}}} ~ {\text {MSPBE}}({\textbf{v}}) :=\left\| V_{{\textbf{v}}}-\Pi T_{\pi }^{\lambda , \gamma } V_{{\textbf{v}}}\right\| ^{2}_{d^{\mu }}. \end{aligned}$$

(10)

To derive an update rule for solving this problem, let us start by introducing the following definitions:

$$\begin{aligned}&\phi _{t} \equiv \phi \left( s_{t}\right) ~~ \text {and} ~~ \gamma _t \equiv \gamma (s_{t})\,, \hspace{5em}\nonumber \\&G_{t}^{\lambda }({\textbf{v}}) := r_{t+1}+\gamma _{t+1}\left[ (1-\lambda ) {\textbf{v}}^{\top } \phi _{t+1}\right. \nonumber \\&\quad \qquad \qquad \quad \left. +\lambda G_{t+1}^{\lambda }({\textbf{v}})\right] \,, \nonumber \\&\delta _{t}^{\lambda }({\textbf{v}}) := G_{t}^{\lambda }({\textbf{v}})-{\textbf{v}}^{\top } \phi _{t}, \end{aligned}$$

(11)

and an operation ${\mathcal {P}}_{\mu }^{\pi }$:

$$\begin{aligned} {\mathcal {P}}_{\mu }^{\pi } \delta _{t}^{\lambda }({\textbf{v}}) \phi _{t} := \sum _{s} \mu (s) {\mathbb {E}}\left[ \delta _{t}^{\lambda }({\textbf{v}}) | s_{t}=s, \pi \right] \phi (s). \end{aligned}$$

(12)

It can be proved that the following identities hold true [16]:

$$\begin{aligned} {\text {MSPBE}}({\textbf{v}})&=\left\| V_{{\textbf{v}}}-\Pi T_{\pi }^{\lambda , \gamma } V_{{\textbf{v}}}\right\| _{\mu }^{2} \nonumber \\&\quad =\left( {\mathcal {P}}_{\mu }^{\pi } \delta _{t}^{\lambda }({\textbf{v}}) \phi _{t}\right) ^{\top } {\mathbb {E}}\left[ \phi _{t} \phi _{t}^{\top }\right] ^{-1}\left( {\mathcal {P}}_{\mu }^{\pi } \delta _{t}^{\lambda }({\textbf{v}}) \phi _{t}\right) . \end{aligned}$$

(13)

Since forward-view equations are not directly implementable as they require future data, we first convert the TD forward-view equation to the mechanistic backward-view one. Define the importance weighting as $\rho _t:=\frac{\pi \left( a_{t} | s_{t}\right) }{\mu \left( a_{t} | s_{t}\right) }$. Then, by applying a gradient descent update of the reformed objective function, we have the critic algorithm based on $\textrm{GTD}(\lambda )$ in the single-agent setting (see [16] for detailed derivations):

$$\begin{aligned} {\textbf{v}}_{t+1}&= {\textbf{v}}_{t} + \alpha _{{\textbf{v}}, t}\nonumber \\&\quad \left[ \delta _{t} {\textbf{e}}_{t}-\gamma _{t+1}(1-\lambda )\left( {\textbf{e}}_{t}^{\top } {\textbf{w}}_{t}\right) \phi _{t+1}\right] , \end{aligned}$$

(14)

$$\begin{aligned} {\textbf{w}}_{t+1}&= {\textbf{w}}_{t} + \alpha _{{\textbf{w}}, t}\left[ \delta _{t} {\textbf{e}}_{t}-\left( {\textbf{w}}_{t}^{\top } \phi _{t}\right) \phi _{t}\right] , \end{aligned}$$

(15)

where $\delta _{t}({\textbf{v}}):=$ $r_{t+1}+\gamma _{t+1} {\textbf{v}}^{\top } \phi _{t+1}-{\textbf{v}}^{\top } \phi _{t}$ is the conventional TD error; and ${\textbf{e}}_{t}$ denotes the eligibility trace vector at time step t, which is updated recursively by ${\textbf{e}}_{t}=\rho _{t}\left( \phi _{t}+\gamma _t \lambda e_{t-1}\right) $.

4.2 Off-Policy Policy-Gradient Method for Actor Algorithm

To update the policy variables, we perform gradient ascent on the global policy, which aims for the maximization of the objective function $J_{\gamma }(\theta )$ as follows:

$$\begin{aligned} \max _{\theta } J_\gamma (\theta ) := \sum _{s\in {\mathcal {S}}} d^\mu (s)\sum _{a\in {\mathcal {A}}}\pi _\theta (a|s)Q^{\pi _\theta , \gamma }(s,a). \end{aligned}$$

(16)

Specifically, the proposed algorithm updates the variables in proportion to the gradient of the objective function:

$$\begin{aligned} \theta _{t+1} -\theta _{t} = \alpha _{\theta , t} \nabla _{\theta } J_{\gamma }\left( \theta _{t}\right) , \end{aligned}$$

(17)

where $\alpha _{\theta , t} \in {\mathbb {R}}$ is a positive stepsize parameter.

First, it can be shown that the gradient with respect to $\theta $ is computed as

$$\begin{aligned} \nabla _{\theta } J_{\gamma }(\theta ) =\sum _{s \in {\mathcal {S}}} m(s) \sum _{a \in {\mathcal {A}}} \nabla _{\theta } \pi _\theta (a | s) Q^{\pi , \gamma }(s, a), \end{aligned}$$

(18)

where m(s) is the emphatic weighting at $s\in S$, with the vector form defined as $m^\top := (d^\mu )^\top ({\textbf{I}} - {\textbf{P}}_{\theta , \gamma })^{-1}$, where ${\textbf{P}}_{\theta , \gamma }$ is the matrix with entries ${\textbf{P}}_{\theta , \gamma }(s, s^\prime ) = \gamma \sum _{a\in {\mathcal {A}}}\pi _\theta (a| s)$ $P(s^\prime | s, a)$.

We now derive an incremental update algorithm using observations sampled from the behavior policy. First, we rewrite Eq. (18) at time step t as an expectation:

$$\begin{aligned} {\textbf{g}}(\theta )&:= \nabla _{\theta } J_{\gamma }(\theta )\nonumber \\&=\sum _{s} d^{\mu }(s) \lim _{t\rightarrow \infty } {\mathbb {E}}_{\mu }\left[ M_{t} | s_{t}=s\right] \nonumber \\&\qquad \cdot \sum _{a \in {\mathcal {A}}} \nabla _{\theta } \pi _\theta (a | s) Q^{\pi , \gamma }(s, a)\,,\nonumber \\&=\sum _{s} d^{\mu }(s) \lim _{t\rightarrow \infty } {\mathbb {E}}_{\mu }\Bigg [{\mathbb {E}}_{\mu }\left[ M_{t} | s_{t}=s, a_{t}, s_{t+1}\right] \Bigg ]\nonumber \\&\qquad \cdot {\mathbb {E}}_{\mu }\Bigg [ {\mathbb {E}}_{\mu } \Bigg [ \rho _t\psi (s_t, a_t)Q^{\pi , \gamma }(s_t, a_t) | s_{t}=s, a_{t}, s_{t+1} \Bigg ] \Bigg ]\,,\nonumber \\&={\mathbb {E}}\Bigg [ \lim _{t\rightarrow \infty } {\mathbb {E}}_{\mu }\Bigg [{\mathbb {E}}_{\mu }\Bigg [M_{t} \rho _t\psi (s_t, a_t)Q^{\pi , \gamma }(s_t, a_t) \nonumber \\&\qquad \qquad \qquad \qquad \qquad | s_{t}=s, a_{t}, s_{t+1}\Bigg ]\Bigg ] | s \sim d^{\mu }\Bigg ]\,,\nonumber \\&=\lim _{t\rightarrow \infty }{\mathbb {E}}_{\mu }\Bigg [M_{t} \rho _t\psi (s_t, a_t)Q^{\pi , \gamma }(s_t, a_t)\Bigg ], \end{aligned}$$

(19)

where $\rho _t:=\frac{\pi _\theta (a_t | s_t)}{\mu (a_t | s_t)}$, $\psi (s, a):=\frac{\nabla _{\theta } \pi _\theta (a | s)}{\pi _\theta (a | s)} $; $ M_{t}:=\left( 1-\lambda ^\theta \right) +\lambda ^\theta F_{t}$ and $F_{t}:= \gamma _{t} \rho _{t-1} F_{t-1}+1 $ are the emphatic and follow-on weights [21]; $\lambda ^\theta \in [0,1]$ is the bootstrapping parameter, which is set to $\lambda $, defined earlier; and the new notation ${\mathbb {E}}_{\mu }[\cdot ]$ is employed to denote the expectation conditioned implicitly on all the random variables being drawn from their stationary distributions under the behavior policy. Moreover, it is shown in Ref. [36] that introducing an arbitrary function of the state into these equations as a baseline does not affect the expected value. With the state-value function estimation provided by the critic, ${\hat{V}}$, the right hand side of Eq. (19) can be further replaced by $ \lim \nolimits _{t\rightarrow \infty }{\mathbb {E}}_{\mu }\Bigg [M_t\rho _t \psi \left( s_{t}, a_{t}\right) \big (Q^{\pi , \gamma }\left( s_{t}, a_{t}\right) -{\hat{V}}\left( s_{t}\right) \big )\Bigg ]$. By approximating the action value, $Q^{\pi , \gamma }\left( s_{t}, a_{t}\right) $, with the off-policy $\lambda ^\theta $-return, it follows that

$$\begin{aligned} {\textbf{g}}(\theta ) \approx \widehat{{\textbf{g}}(\theta )}=\lim _{t\rightarrow \infty } {\mathbb {E}}_{\mu }\Bigg [\rho _t M_t \psi \left( s_{t}, a_{t}\right) \big (G_{t}^{\lambda ^\theta }-{\hat{V}}\left( s_{t}\right) \big )\Bigg ], \end{aligned}$$

where the off-policy $\lambda $-return is defined by: $G_{t}^{\lambda ^\theta }=r_{t+1} +(1-\lambda ^\theta ) \gamma _{t+1} {\hat{V}}\left( s_{t+1}\right) + \lambda ^\theta \gamma _{t+1} \rho \left( s_{t+1}, a_{t+1}\right) G_{t+1}^{\lambda ^\theta }$.

Similarly for a mechanistic implementation, the forward view should be converted to a backward view that does not involve the future return. By conducting an argument similar to the derivation for the critic update, we have the backward view of the actor update as

$$\begin{aligned} \theta _{t+1}=\theta _{t} + \alpha _{\theta , t} \delta _{t} {\textbf{e}}_{\theta , t}, \end{aligned}$$

(20)

where $\delta _{t}:=r_{t+1}+\gamma _{t+1} {\hat{V}}\left( s_{t+1}\right) -{\hat{V}}\left( s_{t}\right) $ is the conventional temporal difference error; and ${\textbf{e}}_{\theta , t} \in {\mathbb {R}}^{N_{\theta }}$ denotes the eligibility trace of $\psi $, updated by ${\textbf{e}}_{\theta , t} =\rho _t \left( M_t\psi \left( s_{t}, a_{t}\right) \right. $ $\left. +\lambda ^\theta {\textbf{e}}_{\theta , t-1}\right) $.

4.3 Multi-Agent Actor-Critic Algorithm

With the above single-agent discussion as a foundation, we proceed to introduce our multi-agent actor-critic algorithm. The overall structure is similar to the single-agent version, with two main differences: (i) each agent can only access its own reward information and it is responsible for maintaining its own estimates of the actor and critic updates; and (ii) the agents perform local updates at each time step and then use a consensus process to average their critic estimates over neighbors and to obtain the global importance sampling ratios for the current global policy.

4.3.1 Actor Part

We start with the actor part of the multi-agent algorithm. Note that in this setting each agent has its own target policy. Therefore, it is reasonable that each agent only estimates its own policy’s parameter. For an agent i, we consider the gradient of $J_\gamma $ w.r.t. its local policy parameter $\theta ^{i}$ instead of the global joint parameter $\theta $. Specifically, we have

$$\begin{aligned} \nabla _{\theta ^{i}} J_{\gamma }(\theta ) =\sum _{s \in {\mathcal {S}}} m(s) \sum _{a \in {\mathcal {A}}} \nabla _{\theta ^{i}} \pi _\theta (a | s) Q^{\pi , \gamma }(s, a), \end{aligned}$$

(21)

where m(s) is the emphatic weighting for $s \in {\mathcal {S}}$, as defined above. Analogously, we can derive an incremental update algorithm by sampling observations from the behavior policy. Equation (21) at time step t can be written as an expectation:

$$\begin{aligned} \nabla _{\theta ^{i}} J_{\gamma }(\theta ) ={\mathbb {E}}_{\mu }\Bigg [M_{t} \rho _t\psi ^i(s_t, a^{i}_t)Q^{\pi , \gamma }(s_t, a_t)\Bigg ], \end{aligned}$$

(22)

where $\psi ^i(s, a)=\frac{\nabla _{\theta ^{i}} \pi ^i(a^{i} | s)}{\pi ^i(a^{i} | s)}$ and other terms are defined as above. By introducing the approximate state-value function given by the critic as a baseline, the right hand side of Eq. (22) becomes ${\mathbb {E}}_{\mu }\left[ M_t\rho _t \psi ^i\left( s_{t}, a_{t}\right) \left( Q^{\pi , \gamma }\left( s_{t}, a_{t}\right) -{\hat{V}}\left( s_{t}\right) \right) \right] $. By approximating the action value, $Q^{\pi , \gamma }\left( s_{t}, a_{t}\right) $, with the off-policy $\lambda ^\theta $-return, we can write the forward view of the actor update at agent i as

$$\begin{aligned} \theta ^{i}_{t+1}-\theta ^{i}_{t}=\alpha _{\theta , t} \rho _t M_t \psi ^i\left( s_{t}, a_{t}^i\right) \big (G_{t}^{\lambda ^\theta }-{\hat{V}}\left( s_{t}\right) \big ). \end{aligned}$$

The forward-view update is not directly implementable since it involves the future return. Denote by $\delta _t^{i}:= r_{t+1}^{i}+\gamma _{t+1} {\hat{V}}^{i}\left( s_{t+1}\right) -{\hat{V}}^{i}\left( s_{t}\right) $ the conventional temporal difference error corresponding to agent i, and let $\delta _{t}=\frac{1}{N}\sum _{i=1}^N \delta _t^{i}$. By conducting the derivation as above, we have the backward view of the actor update at agent i as

$$\begin{aligned} \theta ^{i}_{t+1}=\theta ^{i}_{t} + \alpha _{\theta , t} \delta _{t} {\textbf{e}}^{i}_{\theta ,t}, \end{aligned}$$

(23)

where ${\textbf{e}}^{i}_{\theta , t} \in {\mathbb {R}}^{N_{\theta }}$ is the local eligibility trace of $\psi ^{i}$, updated by ${\textbf{e}}^{i}_{\theta ,t} =\rho _t\big (M_t\psi ^{i}(s_{t}, a^{i}_{t})+\lambda ^\theta {\textbf{e}}_{\theta , t-1}^{i}\big )$.

4.3.2 Critic Part

We now consider the critic part of the multi-agent algorithm, where each agent tries to attain its own estimation of the TD solution to problem (10). Specifically, at time step t, agent i first executes its own action $a_{t}^{i} \sim \mu ^{i}\left( \cdot | s_{t}\right) $, observes its local reward $r_{t}^{i}$ and the next state $s_{t+1}$, and computes its local log importance sampling weights by $p_{t}^{i}=\log \Bigg [\pi _{\theta _{t}^{i}}^{i}\left( a_{t}^{i} | s_{t}\right) / \mu ^{i}\left( a_{t}^{i} | s_{t}\right) \Bigg ]$. Note that the weights are scalars. Thus, with reasonable communication, the agents on connected graphs can perform an exact consensus algorithm to evaluate the average of the local log importance sampling weights in finite time. We also refer to [37,38,39] for detailed descriptions of these algorithms. After achieving consensus, we have $p_{t}^{i}=p_{t}^{j} = p_{t}:=\frac{1}{N} \sum _{i=1}^{N} \log \rho _{t}^{i} $ for all $i, j \in {\mathcal {N}}$. It implies that $\exp \left( N p_{t}^{i}\right) =\exp \left( \sum _{i=1}^{N} \log \rho _{t}^{i}\right) =\prod _{i=1}^{N} \rho _{t}^{i}=\rho _{t}=\pi _{\theta }\left( a_{t} | s_{t}\right) / \mu \left( a_{t} | s_{t}\right) $. Then, each agent conducts the local critic updates. For an agent i, it first computes the eligibility trace vector at time step t and the importance sampling ratio:

$$\begin{aligned} {\textbf{e}}_{t} = \rho _t \left( \phi (s_t)+\gamma _t \lambda {\textbf{e}}_{t-1} \right) , \quad \rho _{t}=\exp \left( N p_{t}^{i}\right) . \end{aligned}$$

(24)

Then, by averaging the local estimates over its neighbors denoted by ${\mathcal {N}}_i$, agent i updates its local critic parameters ${\textbf{v}}^{i}$ and ${\textbf{w}}^{i}$ via

$$\begin{aligned} {\textbf{v}}_{t+1}^{i}&= \hspace{-0.4em}\sum \limits _{j\in {\mathcal {N}}_i}({C_t})_{ij} \big ( {\textbf{v}}_{t}^{j} + \alpha _{{\textbf{v}}, t}\nonumber \\&\quad \Bigg [\delta _t^{j} {\textbf{e}}_{t} -\gamma _{t+1}(1-\lambda )\big (({\textbf{w}}_{t}^{j} )^{\top } {\textbf{e}}_{t} \big ) \phi _{t+1}\Bigg ] \big ), \end{aligned}$$

(25)

$$\begin{aligned} {\textbf{w}}_{t+1}^{i}&= \hspace{-0.4em}\sum \limits _{j\in {\mathcal {N}}_i}({C_t})_{ij} \big ({\textbf{w}}_{t}^{j} +\alpha _{{\textbf{w}}, t} \Bigg [\delta _t^{j} {\textbf{e}}_{t} -\big (({\textbf{w}}_{t}^{j} )^{\top } \phi _t\big ) \phi _t\Bigg ]\big ), \end{aligned}$$

(26)

where $\delta _{t}^{i}:=$ $r_{t+1}^{i}+\gamma ({\textbf{v}}^{i}_t)^{\top } \phi _{t+1}-({\textbf{v}}^{i}_t)^{\top } \phi _{t}$ is the local conventional TD error corresponding to agent i. The connection matrix $C_t$ is row and column stochastic and is generated from the connected graph such that $(i, j) \in {\mathcal {E}}_{t}$ iff $(C_{t})_{ij} > 0$, and $(C_{t})_{ij}$ denotes the communication weight from agent j to i at time t. The parameters $\alpha _{{\textbf{v}}, t}$ and $\alpha _{{\textbf{w}}, t} $ denote the stepsizes of the ${\textbf{v}}$ and ${\textbf{w}}$ updates at time step t. The complete MGOPAC algorithm is described in Algorithm 1.

5 Analysis

It can be observed that our algorithm has a recursive stochastic form similar to off-policy value function algorithms:

$$\begin{aligned} \theta _{t+1}=\theta _{t}+\alpha _{\theta , t}\left( h\left( \theta _{t}, {\textbf{v}}_{t}\right) +N_{t+1}\right) , \end{aligned}$$

(27)

where $h(\cdot , \cdot )$ is a differentiable function and $\left\{ N_{t}\right\} _{t \ge 0}$ denotes a noise sequence. Thus, we consider the behavior of the ordinary differential equation (ODE) as the following form:

$$\begin{aligned} {\dot{\theta }}(t)=\theta (h(\theta (t), {\textbf{v}})). \end{aligned}$$

(28)

The two updates (for the actor and for the critic) are not independent at each time step. Therefore, we analyze two separate ODEs using a two time-scale analysis in the framework of [40]. In this scheme, the convergence of the actor updates are analyzed given fixed critic parameters; then the critic updates are studied while viewing the value of the faster time-scale as equilibrated at each iteration. We make the following assumptions.

Assumption (A1)

The target policy, defined as a function of $\theta $, $\pi ^i_{(\cdot )}(a^{i} | s)$: ${\mathbb {R}}^{N_{\theta }} \rightarrow [0,1]$, is continuously differentiable, $\forall s \in $ ${\mathcal {S}}, a^{i} \in {\mathcal {A}}^{i}$, and $i \in \{1, \ldots , N\}$.

Assumption (A2)

The iterates $\{{\textbf{v}}_t, {\textbf{w}}_t, \theta _t\}$ lie in some compact region $B_{{\textbf{v}}} \times B_{{\textbf{w}}} \times B_{\theta }$ almost surely; and $\{{\textbf{e}}_t, {\textbf{e}}_{\theta , t} \}$ are bounded with probability one.

Remark

Assumption (A1) is satisfied by most practical choices of target policy. Assumption (A2) is not demanding for the iterates in practice when the constraint sets are set large enough. Moreover, it is shown in [41] that under proper conditions on the parameter $\lambda $, the boundedness of the eligibility traces is guaranteed.

Assumption (A3)

The feature matrix $\Phi $ has linearly independent columns $\{\phi (s)\}$, and here the value function can be approximated by $V_{{\textbf{v}}}(s)=\phi (s)^{T} {\textbf{v}}$ that is linear in ${\textbf{v}}$.

Assumption (A4)

It holds that if $\pi _{\theta ^{i}}^{i}\left( a^{i} | s\right) >0$, then $\mu ^{i}\left( a^{i} | s\right) >0$, for each $i \in {\mathcal {N}}$, $\left( a^{i}, s\right) \in {\mathcal {A}}^{i} \times S$, and $\theta ^{i} \in \Theta ^{i}$.

These ensure that the importance sampling weights are always finite. Finally, we have the following standard assumptions on features, stepsizes, and connection matrices, which are also made in previous work [5, 10, 24, 41].

Assumption (A5)

$\left\| \phi _{t}\right\| _{\infty }<\infty , \forall t$, where $\phi _{t} \in {\mathbb {R}}^{N_{\textrm{v}}}$.

Assumption (A6)

$\alpha _{{\textbf{v}}, t}$, $\alpha _{{\textbf{w}}, t}$, $\alpha _{\theta , t}>0$, $\forall t$ are stepsizes such that $\sum _{t} \alpha _{{\textbf{v}}, t}=\sum _{t} \alpha _{{\textbf{w}}, t}=\sum _{t} \alpha _{\theta , t}=\infty $ and $\sum _{t} \alpha _{{\textbf{v}}, t}^{2}<$ $\infty $, $\sum _{t} \alpha _{{\textbf{w}}, t}^{2}<\infty $ and $\sum _{t} \alpha _{\theta , t}^{2}<\infty $ with $\frac{\alpha _{\theta , t}}{\alpha _{{\textbf{v}}, t}} \rightarrow 0$ and $\frac{\alpha _{{\textbf{v}}, t}}{\alpha _{{\textbf{w}}, t}} \rightarrow 0$, as $t\rightarrow +\infty $. Moreover, $\lim _{t \rightarrow \infty } \frac{\alpha _{{\textbf{w}}, t+1}}{\alpha _{{\textbf{w}}, t}}=1$.

Assumption (A7)

For each element $C_{t} \in \left\{ C_{t}\right\} _{t \in {\mathbb {N}}}$, it holds that

1.
The connection matrix $C_{t}$ is row stochastic, ${\mathbb {E}}\left[ C_{t}\right] $ is column stochastic, and there exists $\alpha \in (0,1)$ such that, for any $C_{t}(i, j)>0$, we have $C_{t}(i, j) \ge \alpha $.
2.
Set the spectral norm $\rho := \rho \left( {\mathbb {E}}[C_{t}^{T}\left( I-{\textbf{1}}{\textbf{1}}^{T} / N\right) C_{t}]\right) $. Then, it holds that $\rho <1$.
3.
Given the $\sigma $-algebra $\sigma \left( C_{\tau },\left\{ r_{\tau }^{i}\right\} _{i \in {\mathcal {N}}}; \tau \le t\right) $, matrix $C_{t}$ is conditionally independent of $r_{t+1}^{i}$ for each $i \in {\mathcal {N}}$.

Remark

Here, Assumption (A5) is necessary to show the stability of the critic updates (see the supplementary for details). Assumption (A6) is standard and similar to the conditions in the literature for single-agent actor-critic algorithms employing two-time-scale analysis. Moreover, the first condition of (A7) is standard for guaranteeing convergence of the update for each agent to a consensus vector [42]. The second condition of (A7) has connection to the connectivity of the communication graph ${\mathcal {G}}_t$. In fact, Refs. [43] and [44] showed that for the gossip communication schemes, the condition holds true if and only if the corresponding communication network is connected. The third condition requires that $C_t$ and $\left\{ r_{t+1}^{i}\right\} _{i \in {\mathcal {N}}}$ are independent conditioned on the past history. This is reasonable as the random communication link failures and the mixing schemes usually depend on the active connections instead of the rewards observed by the agents.

With these assumptions in place, we are able to provide the following convergence result for the proposed multi-agent off-policy actor-critic algorithm.

Theorem 1

(Convergence of MGOPAC) Suppose that Assumptions (A1)–(A7) are satisfied. Then, the policy parameters, $\theta _{t}^{i}$, converge to ${\mathcal {O}}^i:=\{\theta ^{i} \in \Theta | {\textbf{g}}^i(\theta ^{i})=0\}$, for $i \in [N]$; moreover, the value function weights, ${\textbf{v}}_{t}^{i}$, converge to a consensus vector which is the corresponding solution of (10), with probability one.

We remark that the conclusion regarding the convergence of the policy parameters is, in general, the best result one can prove for a stochastic gradient algorithm with a possibly nonconvex objective function [14, 16, 19, 24]. The detailed proof of Theorem 1 is included in the supplementary. In the following, we provide a sketch of the proof.

Proof Sketch We adopt the scheme of two time-scale analysis [40]. This has also been used to analyze single-agent policy gradient actor-critic algorithms [17, 19, 45]. We analyze the dynamics of the weights of the agents, which include the actor weights $\{\theta ^{i}_{t}\}_{i\in [N]}$ and the critic weights $\big \{\big ({{\textbf{w}}_{t}^{i}}^{\top } {{\textbf{v}}_{t}^{i}}^{\top }\big )\big \}_{i\in [N]}$. Under the above assumptions, the proof aims to verify that the requirements in Refs. [46] and [47] are satisfied based on the update rules. If the requirements are satisfied, then we can apply their results to establish the convergence of $\{\theta ^{i}_{t}\}_{i\in [N]}$ and $\big \{ {{\textbf{v}}_{t}^{i}}\big \}_{i\in [N]}$ to an asymptotically stable equilibrium and the corresponding TD solution, respectively.

6 Experiments

In this section, we explore the performance of the proposed algorithm for the multi-agent generalizations of the single-agent Boyan’s chain task [48]. We compare it with the state-of-the-art baseline algorithm proposed in Ref. [24], which we denote here as EMOPAC (Emphatic Multi-agent Off-Policy Actor-Critic). This algorithm is currently the only multi-agent generalization of advanced temporal difference learning algorithms (Emphatic TD) with critic consensus as in the proposed method. All experiments were performed with Python 3.7 on an Intel i7 CPU at 3.4 GHz (32 GB RAM) desktop.

Multi-agent Boyan’s Chain Problem Assume we have N agents. We consider a non-trivial multi-agent generalization of the single-agent Boyan’s chain problem, where communication is needed as an agent’s local rewards depend not only on its own states, but on the states of other agents. Specifically, when there is another agent ahead of it, its transition is rewarded less; on the other hand, if the agent is the most advanced in the group, its transition is rewarded more. Each agent’s reward for a transition is equal to the single-agent reward multiplied by the position in the chain of the most advanced agent in the group, excluding itself. Since each local reward is negative, multiplication by a larger multiplier (because another agent has advanced further) has the effect of reducing the reward.

For example, suppose we have a 5-state Boyan’s chain with N agents as in Fig. 1. Assume that Agent 1 is in state S3 and Agents $2, \ldots , N$ are in state S2, then the reward for Agent 1 is the corresponding single-agent reward for state S3 multiplied by 2 (since the most advanced position attained by the agents other than Agent 1 is state S2); similarly, the reward for Agent N equals to the corresponding single-agent reward for state S2 multiplied by 3 (since the most advanced position of any other agent is state S3). The total reward is the summation of all agents’ local rewards. For the multi-agent Boyan’s chain, there exists a feature map that can represent the state value function linearly, as in the single-agent case. The size of the state space $|{\mathcal {S}}|$ grows rapidly; for example, when $N=9$, $|{\mathcal {S}}|$ is ${\mathcal {O}}(5^9) \approx 1,953,125$ and the size of the joint action space $|{\mathcal {A}}|$ is ${\mathcal {O}}(2^9)$.

In the experiments, the connection matrix $C_t$ is generated from a ring graph that links together the agents successively. The stepsizes $\alpha _{{\textbf{v}}, t}$, $\alpha _{{\textbf{w}}, t}$, $\alpha _{\theta , t}$ are set as $\frac{\alpha }{(t+1)^\frac{5}{8}}$, $\frac{\alpha }{(t+1)^\frac{9}{16}}$, and $\frac{\beta }{t+1}$ respectively, where $\alpha $ and $\beta $ are positive constants. These constants are scanned and selected so that the algorithms achieve the best performance. Moreover, the behavior policy is set to be uniform, which means that each feasible action is taken with equal probability at all states. Since the multi-agent experiments are simulated on a single computer, the consensus of $\{p_t^{k}\}_{k=1}^N$ and $\{\delta _t^{k}\}_{k=1}^N$ can be computed directly. We assume that the local target policies are parameterized by a form of softmax function, which are called the Boltzmann policies, and are defined as: $\pi _{\theta ^i}^i(s, a^i) = \frac{\exp \big (q_{s,a^i}^T \theta ^i \big )}{\sum _{b^i\in {\mathcal {A}}^i}\exp \big (q_{s,b^i}^T \theta ^i\big )}$, where $q_{s,b^i}\in {\mathbb {R}}^{N_{\theta }}$ denotes the basis vector for any $s\in {\mathcal {S}}$ and $i\in [N]$. Here, $q_{s,b^i}$ is of the same dimension as $\theta ^i$, which is set as 2N. The elements of $q_{s,b^i}$ are generated from a uniform distribution on [0, 1]. The results after applying the proposed method and the baseline algorithm to the multi-agent Boyan’s chain problems in different settings of the bootstrapping parameter $\lambda $ are shown in Fig. 2. Here, we show the results when $\lambda = 0$, 0.2, 0.5 since when it is closer to 1, more vibration appears in the plots of both candidate algorithms. In the experiments, both algorithms are run for 10 trials. First, the right panel of Fig. 2 shows that the consensus of the agents is reached quickly (within the first 50 episodes). Figure 2 also shows that our algorithm (MGOPAC) attains a stationary value around 15,000 episodes, while the baseline algorithm (EMOPAC) reaches the value slower (around 23,000 episodes) when $\lambda = 0$, 0.2. In addition, MGOPAC demonstrates better stability as compared with EMOPAC when $\lambda = 0$, 0.2, 0.5. Similar merits of the proposed algorithm can be observed in other settings of connection graphs and different numbers of states and agents. The results corroborate the favorable properties of the proposed gradient-based method.

7 Conclusions and Remarks

This paper proposes a gradient-based multi-agent off-policy actor-critic framework for reinforcement learning problems. Theoretically, we prove that the critic updates converge to the consensus limiting point, which is the TD solution of the projected Bellman equation, and the actor updates converge to the set of asymptotically stable fixed points of problem (16). Moreover, unlike algorithms in previous work, both the critic step and the actor step of the proposed method are based on full gradient descent (ascent) of the corresponding objective functions. In the experiments, we compare our approach with a state-of-the-art baseline algorithm, and the results show the superior performance of our algorithm in terms of stability and convergence rate. Note that the analysis of the proposed algorithm is based on the assumption that the state value function is linearly representable by the feature map and critic weights. Thus, one interesting direction for future research would be to extend the proposed gradient-based framework to scenarios with nonlinear approximators like deep neural networks and investigate its performance in solving more complex real-world problems. Moreover, regarding that the proposed method is currently developed and analyzed based on the Markovian setting, it is also worth further studying the extension of its analysis and application to non-Markovian scenarios.

Data Availability

The data are generated in the learning process of the proposed reinforcement learning algorithm. The code is included in the supplementary materials.

References

Gupta, J.K., Egorov, M., Kochenderfer, M.: Cooperative multi-agent control using deep reinforcement learning. In: Proc. Int. Conf. Autonomous Agents and Multiagent Systems, pp. 66–83 (2017)
Sallab, A.E., Abdou, M., Perot, E., Yogamani, S.: Deep reinforcement learning framework for autonomous driving. Electron. Imaging 2017(19), 70–76 (2017)
Article Google Scholar
Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., et al.: Mastering the game of go without human knowledge. Nature 550(7676), 354–359 (2017)
Article Google Scholar
Zhang, K., Yang, Z., Basar, T.: Networked multi-agent reinforcement learning in continuous spaces. In: Proc. IEEE Conf. Decision and Control, pp. 2771–2776 (2018)
Zhang, K., Yang, Z., Liu, H., Zhang, T., Basar, T.: Fully decentralized multi-agent reinforcement learning with networked agents. In: Proc. Int. Conf. Machine Learning, pp. 5872–5881 (2018)
Lin, Y., Zhang, K., Yang, Z., Wang, Z., Başar, T., Sandhu, R., Liu, J.: A communication-efficient multi-agent actor-critic algorithm for distributed reinforcement learning. Proc. IEEE Conf. Decision and Control, 5562–5567 (2019)
Lin, Y., Gade, S., Sandhu, R., Liu, J.: Toward resilient multi-agent actor-critic algorithms for distributed reinforcement learning. Proc. Am. Control Conf., 3953–3958 (2020)
Fakoor, R., Chaudhari, P., Smola, A.J.: P3O: Policy-on policy-off policy optimization. In: Proc. Uncertainty in Artificial Intelligence, pp. 1017–1027 (2020)
Guo, D., Tang, L., Zhang, X., Liang, Y.-C.: An off-policy multi-agent stochastic policy gradient algorithm for cooperative continuous control. Neural Netw 170, 610–621 (2024)
Article Google Scholar
Zhang, Y., Zavlanos, M.M.: Distributed off-policy actor-critic reinforcement learning with policy consensus. Proc. IEEE Conf. Decision and Control, 4674–4679 (2019)
Wang, Y., Han, B., Wang, T., Dong, H., Zhang, C.: Off-policy multi-agent decomposed policy gradients (2020). arXiv preprint arXiv:2007.12322
Chen, C., Lewis, F.L., Xie, K., Xie, S., Liu, Y.: Off-policy learning for adaptive optimal output synchronization of heterogeneous multi-agent systems. Automatica 119, 109081 (2020)
Article MathSciNet Google Scholar
Stanković, M.S., Stanković, S.S.: Multi-agent temporal-difference learning with linear function approximation: Weak convergence under time-varying network topologies. Proc. Amer. Control Conf., 167–172 (2016)
Sutton, R.S., Maei, H.R., Precup, D., Bhatnagar, S., Silver, D., Szepesvári, C., Wiewiora, E.: Fast gradient-descent methods for temporal-difference learning with linear function approximation. Proc. Int. Conf. Machine Learning, pp. 993–1000 (2009)
Li, W., Jin, B., Wang, X., Yan, J., Zha, H.: F2a2: flexible fully-decentralized approximate actor-critic for cooperative multi-agent reinforcement learning. J. Mach. Learn. Res. 24(178), 1–75 (2023)
MathSciNet Google Scholar
Maei, H.R.: Gradient temporal-difference learning algorithms, PhD thesis. University of Alberta (2011)
Maei, H.R., Szepesvari, C., Bhatnagar, S., Precup, D., Silver, D., Sutton, R.S.: Convergent temporal-difference learning with arbitrary smooth function approximation. Proc. Adv. Neur. Inf. Proc. Systems, pp. 1204–1212 (2009)
Sutton, R.S., Mahmood, A.R., White, M.: An emphatic approach to the problem of off-policy temporal-difference learning. J. Mach. Learn. Res. 17(1), 2603–2631 (2016)
MathSciNet Google Scholar
Degris, T., White, M., Sutton, R.S.: Off-policy actor-critic. Proc. Int. Conf. Mach. Learn., pp. 179–186 (2012)
Ghiassian, S., Patterson, A., Garg, S., Gupta, D., White, A., White, M.: Gradient temporal-difference learning with regularized corrections. Proc. Int. Conf. Machine Learning, pp. 3524–3534 (2020)
Imani, E., Graves, E., White, M.: An off-policy policy gradient theorem using emphatic weightings. Proc. Adv. Neur. Inf. Proc. Syst. 96–106 (2018)
Zhang, S., Liu, B., Yao, H., Whiteson, S.: Provably convergent two-timescale off-policy actor-critic with function approximation. Proc. Int. Conf. Mach. Learn., 11204–11213 (2020)
Maei, H.R.: Convergent actor-critic algorithms under off-policy training and function approximation (2018). arXiv preprint arXiv:1802.07842
Suttle, W., Yang, Z., Zhang, K., Wang, Z., Başar, T., Liu, J.: A multi-agent off-policy actor-critic algorithm for distributed reinforcement learning. Proc. Int. Fed. Autom. Control 53, 1549–1554 (2020)
Google Scholar
Stanković, M.S., Beko, M., Stanković, S.S.: Convergent distributed actor-critic algorithm based on gradient temporal difference. In: Proc. European Signal Proc. Conf., pp. 2066–2070 (2022)
Chen, Z., Zhou, Y., Chen, R.: Multi-agent off-policy TD learning: finite-time analysis with near-optimal sample complexity and communication complexity (2021). arXiv preprint arXiv:2103.13147
Baird, L.: Residual Algorithms: Reinforcement Learning with Function Approximation. In: Machine Learning Proceedings, pp. 30–37 (1995)
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT press (2018)
Ren, J., Haupt, J., Guo, Z.: Communication-efficient hierarchical distributed optimization for multi-agent policy evaluation. J. Comput. Sci. 49, 101280 (2021)
Article MathSciNet Google Scholar
Wei, Y., Zheng, R.: Multi-robot path planning for mobile sensing through deep reinforcement learning. Proc. Int. Conf. Comp. Comm., 1–10 (2021)
Perrusquía, A., Yu, W., Li, X.: Multi-agent reinforcement learning for redundant robot control in task-space. Int. J. Mach. Learn. Cybern. 12(1), 231–241 (2021)
Xi, L., Chen, J., Huang, Y., Xu, Y., Liu, L., Zhou, Y., Li, Y.: Smart generation control based on multi-agent reinforcement learning with the idea of the time tunnel. Energy 153, 977–987 (2018)
Article Google Scholar
Dall’Anese, E., Zhu, H., Giannakis, G.B.: Distributed optimal power flow for smart microgrids. IEEE Trans. Smart Grid 4(3), 1464–1475 (2013)
Article Google Scholar
Tsitsiklis, J.N., Van Roy, B.: An analysis of temporal-difference learning with function approximation. IEEE Trans. Autom. Control 42(5), 674–690 (1997)
Article MathSciNet Google Scholar
Sutton, R.S., Szepesvári, C., Maei, H.R.: A convergent o (n) algorithm for off-policy temporal-difference learning with linear function approximation. Proc. Adv. Neur. Inf. Proc. Syst. 21(21), 1609–1616 (2008)
Google Scholar
Sutton, R.S., McAllester, D., Singh, S., Mansour, Y.: Policy gradient methods for reinforcement learning with function approximation. Proc. Adv. Neur. Inf. Proc. Syst. 12, 1057–1063 (2000)
Google Scholar
Hendrickx, J.M., Shi, G., Johansson, K.H.: Finite-time consensus using stochastic matrices with positive diagonals. IEEE Trans. Autom. Control 60(4), 1070–1073 (2014)
Article MathSciNet Google Scholar
Shang, Y.: Finite-time cluster average consensus for networks via distributed iterations. Int. J. Control Autom. Syst. 15(2), 933–938 (2017)
Article Google Scholar
Sayyaadi, H., Doostmohammadian, M.: Finite-time consensus in directed switching network topologies and time-delayed communications. Sci. Iran. 18(1), 75–85 (2011)
Article Google Scholar
Borkar, V.S.: Stochastic approximation: a dynamical systems viewpoint. Springer 48 (2009)
Yu, H.: On convergence of some gradient-based temporal-differences algorithms for off-policy learning (2017). arXiv preprint arXiv:1712.09652
Nedic, A., Ozdaglar, A.: Distributed subgradient methods for multi-agent optimization. IEEE Trans. Autom. Control 54(1), 48–61 (2009)
Article MathSciNet Google Scholar
Boyd, S., Ghosh, A., Prabhakar, B., Shah, D.: Randomized gossip algorithms. IEEE Trans. Inform. Theory 52(6), 2508–2530 (2006)
Article MathSciNet Google Scholar
Aysal, T.C., Yildiz, M.E., Sarwate, A.D., Scaglione, A.: Broadcast gossip algorithms for consensus. IEEE Trans. Signal Process. 57(7), 2748–2761 (2009)
Article MathSciNet Google Scholar
Bhatnagar, S., Sutton, R.S., Ghavamzadeh, M., Lee, M.: Natural actor-critic algorithms. Automatica 45(11), 2471–2482 (2009)
Article MathSciNet Google Scholar
Kushner, H., Yin, G.G.: Stochastic approximation and recursive algorithms and applications. Springer Sci. Bus. Media 35 (2003)
Kushner, H.J., Clark, D.S.: Stochastic approximation methods for constrained and unconstrained systems. Springer Sci. Bus. Media 26 (2012)
Boyan, J.A.: Technical update: least-squares temporal difference learning. Mach. Learn. 49(2), 233–246 (2002)
Article Google Scholar

Download references

Acknowledgements

We thank Professor Mark Coates for helpful discussions in writing the paper.

Funding

The author is partially funded by the Young Faculty Grant of Wenzhou University and the Defence Canada Grant awarded to Professor Mark Coates at McGill University.

Author information

Authors and Affiliations

School of Computer Sciences and Artificial Intelligence & Artificial Intelligence and Advanced Manufacturing Institute (Yongjia), Wenzhou University, Chashan University Town, Wenzhou City, 325035, Zhejiang Province, China
Jineng Ren

Authors

Jineng Ren
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Jineng Ren: conceptualization, methodology, analysis, software, validation, original draft preparation, reviewing and editing.

Corresponding author

Correspondence to Jineng Ren.

Ethics declarations

Conflict of interest

The author reports no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A

In the appendix, we prove the conclusions and key intermediate results of Theorem 1. The proof proceeds in three stages. In Section A.1, we first establish the consensus of the critic updates, i.e., prove that under the theorem’s assumptions ${\textbf{v}}_{t}^{i}$ (and ${\textbf{w}}_{t}^{i}$) reach a consensus with probability one, as $t\rightarrow \infty $. Subsequently, in Section A.2, we demonstrate the convergence of the critic updates to an optimal point set. Finally, in Sect. 1, we establish the convergence result for the actor updates.

1.1 A.1 Proof of Consensus for Critic Updates

First in this section, we analyze the consensus behavior of the critic updates for all the agents. By defining $\varvec{\beta }_{t}:=\Bigg [\left( {\textbf{v}}_{t}^{1}\right) ^{T} \cdots \left( {\textbf{v}}_{t}^{N}\right) ^{T}\left( {\textbf{w}}_{t}^{1}\right) ^{T} \cdots \left( {\textbf{w}}_{t}^{N}\right) ^{T}\Bigg ]^{T} $, the global $({\textbf{v}}, {\textbf{w}})$-update for all agents can be written as

$$\begin{aligned} \varvec{\beta }_{t+1}=\left( \tilde{C_{t}} \otimes I\right) \left( \varvec{\beta }_{t}+\alpha _{{\textbf{w}}, t} \Delta _{t}\right) , \end{aligned}$$

(A1)

where matrix $\tilde{C_t}:= \begin{pmatrix} C_t &{} 0\\ 0 &{} C_t \end{pmatrix}$ and

$$\begin{aligned}&\Delta _{t} := \begin{pmatrix} \eta _t \mathbf {\delta }_{t} \otimes {\textbf{e}}_{t} - \gamma _{t+1} \eta _t (1-\lambda )((I\otimes {\textbf{e}}_t^\top ){\textbf{w}}_t)\otimes \phi _t \\ \mathbf {\delta }_{t} \otimes {\textbf{e}}_{t} - ((I \otimes \phi _t^\top ){\textbf{w}}_t)\otimes \phi _t \end{pmatrix}, \end{aligned}$$

(A2)

with $\eta _t:= \frac{\alpha _{{\textbf{v}},t}}{\alpha _{{\textbf{w}},t}}$. Recall that here $ \mathbf {\delta }_{t}:=\left[ \delta _{t}^{1} \cdots \delta _{t}^{N}\right] ^{T}$, and $\delta _{t}^{i}=r_{t+1}^{i}+\gamma \phi _{t+1}^{T} {\textbf{v}}_{t}^{i}-\phi _{t}^{T} {\textbf{v}}_{t}^{i}. $ Moreover, define

$$\begin{aligned} \langle {\varvec{\beta }}\rangle&= \Bigg (\begin{array}{l} \frac{1}{N}\left( {\textbf{1}}^{T} \otimes I\right) {\textbf{v}} \\ \frac{1}{N}\left( {\textbf{1}}^{T} \otimes I\right) {\textbf{w}} \end{array} \Bigg ) = \Bigg (\begin{array}{l} \frac{1}{N} \sum _{i \in {\mathcal {N}}} {\textbf{v}}^{i} \\ \frac{1}{N} \sum _{i \in {\mathcal {N}}} {\textbf{w}}^{i} \end{array} \Bigg ) \end{aligned}$$

(A3)

$$\begin{aligned} T({\varvec{\beta }})&= \Bigg (\begin{array}{l}{\textbf{1}} \otimes \langle {\textbf{v}}\rangle \\ {\textbf{1}} \otimes \langle {\textbf{w}}\rangle \end{array} \Bigg )\end{aligned}$$

(A4)

$$\begin{aligned} {\varvec{\beta }}_{\perp }&=T_{\perp }({\varvec{\beta }})={\varvec{\beta }}-T({\varvec{\beta }}) = \Bigg (\begin{array}{l} \left( \left( I-\frac{1}{N} \textbf{1 1}^{T}\right) \otimes I\right) {\textbf{v}} \\ \left( \left( I-\frac{1}{N} \textbf{1 1}^{T}\right) \otimes I\right) {\textbf{w}} \end{array} \Bigg ). \end{aligned}$$

(A5)

The vector $\langle {{\varvec{\beta }}}\rangle $ denotes the averages of ${\textbf{v}}^{i}$ and ${\textbf{w}}^{i}$ over all agents i. The vector $T({\varvec{\beta }})$ is the agreement vector, and ${\varvec{\beta }}_{\perp }$ represents the disagreement vector.

For proving the consensus result in the theorem, we show that ${\varvec{\beta }}_{\perp , t} \rightarrow 0$ a.s., which says that all agents will reach a consensus with probability one. Under the assumptions, it can also be proven that $\lim _{t \rightarrow \infty }\left\langle {\textbf{v}}_{t}\right\rangle ={\textbf{v}}^{*}$ almost surely. Thus, these two results will lead to the conclusion of the critic convergence in Theorem 1.

First, notice that ${\varvec{\beta }}_{t}=T\left( {\varvec{\beta }}_{t}\right) +{\varvec{\beta }}_{\perp , t}= {\textbf{1}} \otimes \left\langle {\varvec{\beta }}_{t}\right\rangle +{\varvec{\beta }}_{\perp , t}$. Using this equation and the first part of Assumption (A7), we can write

$$\begin{aligned}&{\varvec{\beta }}_{\perp , t+1} =T_{\perp }\left( {\varvec{\beta }}_{t+1}\right) \nonumber \\&\quad =T_{\perp }\left( \left( \tilde{C_{t}} \otimes I\right) \left( {\textbf{1}} \otimes \left\langle {\varvec{\beta }}_{t}\right\rangle +{\varvec{\beta }}_{\perp , t}+\alpha _{{\textbf{w}}, t} \Delta _{t}\right) \right) \nonumber \\&\quad =T_{\perp }\left( {\textbf{1}} \otimes \left\langle {\varvec{\beta }}_{t}\right\rangle +\left( \tilde{C_{t}} \otimes I\right) \left( {\varvec{\beta }}_{\perp , t}+\alpha _{{\textbf{w}}, t} \Delta _{t}\right) \right) \nonumber \\&\quad =T_{\perp }\left( \left( \tilde{C_{t}} \otimes I\right) \left( {\varvec{\beta }}_{\perp , t}+\alpha _{{\textbf{w}}, t} \Delta _{t}\right) \right) \nonumber \\&\quad =\left( \tilde{C_{t}} \otimes I\right) \left( {\varvec{\beta }}_{\perp , t}+\alpha _{{\textbf{w}}, t} \Delta _{t}\right) - {\textbf{1}} \otimes \nonumber \\&\quad \left\langle \left( \tilde{C_{t}} \otimes I\right) \left( {\varvec{\beta }}_{\perp , t}+\alpha _{{\textbf{w}}, t} \Delta _{t}\right) \right\rangle \nonumber \\&\quad =\left( \tilde{C_{t}} \otimes I\right) \left( {\varvec{\beta }}_{\perp , t}+\alpha _{{\textbf{w}}, t} \Delta _{t}\right) -({\textbf{1}} \otimes I)\nonumber \\&\quad \left\langle \left( \tilde{C_{t}} \otimes I\right) \left( {\varvec{\beta }}_{\perp , t}+\alpha _{{\textbf{w}}, t} \Delta _{t}\right) \right\rangle \nonumber \\&\quad =\left( \tilde{C_{t}} \otimes I\right) \left( {\varvec{\beta }}_{\perp , t}+\alpha _{{\textbf{w}}, t} \Delta _{t}\right) \nonumber \\&\qquad - \begin{pmatrix} ({\textbf{1}} \otimes I) \frac{1}{N}\left( {\textbf{1}}^{T} \otimes I \right) &{} 0\\ 0 &{} ({\textbf{1}} \otimes I) \frac{1}{N}\left( {\textbf{1}}^{T} \otimes I \right) \end{pmatrix}\nonumber \\&\quad \big (\tilde{C_{t}} \otimes I\big ) \cdot \left( {\varvec{\beta }}_{\perp , t}+\alpha _{{\textbf{w}}, t} \Delta _{t}\right) \nonumber \\&= L \left( \tilde{C_{t}} \otimes I\right) \left( {\varvec{\beta }}_{\perp , t}+\alpha _{{\textbf{w}}, t} \Delta _{t}\right) , \end{aligned}$$

(A6)

where the matrix L is defined as

$$\begin{aligned} L := \begin{pmatrix} \left( \left( I-\frac{1}{N} {\textbf{1}}{\textbf{1}}^{T}\right) \otimes I\right) &{} 0\\ 0 &{} \left( \left( I-\frac{1}{N} {\textbf{1}}{\textbf{1}}^{T}\right) \otimes I\right) \end{pmatrix}. \end{aligned}$$

(A7)

From the assumptions, we know that $P\left( \sup _{t}\left\| {\varvec{\beta }}_{t}\right\| <\infty \right) =P\left( \cup _{M \in {\mathbb {N}}}\left\{ \sup _{t}\left\| {\varvec{\beta }}_{t}\right\| \le M\right\} \right) =1$, and $P\left( \sup _{t}\left\| {\textbf{e}}_{t}\right\| <\infty \right) =P\left( \cup _{M \in {\mathbb {N}}}\left\{ \sup _{t}\left\| {\textbf{e}}_{t}\right\| \le M\right\} \right) =1$. Thus, to prove ${\varvec{\beta }}_{\perp , t} \rightarrow 0$ a.s., we just need to show that $\lim _{t\rightarrow \infty } {\varvec{\beta }}_{\perp , t} {\mathbb {I}}_{\left\{ \sup _{t}\left\| {\varvec{x}}_{t}\right\| \le M\right\} }=0$ a.s., for all $M \in {\mathbb {N}}$, where ${\mathbb {I}}_{\{\cdot \}}$ denotes the indicator function and ${\varvec{x}}_{t}:=\left( {\varvec{\beta }}_t, {\textbf{e}}_{t}\right) $. This intermediate result can be proved through the following lemma.

Lemma A.1

Suppose Assumptions (A1)–(A7) are satisfied. Then, there exists a constant $K_1 > 0$ such that

$$\begin{aligned}&{\mathbb {E}}\left[ \left\| \alpha _{{\textbf{w}}, t + 1}^{-1} {\varvec{\beta }}_{\perp , t + 1}\right\| ^{2} {\mathbb {I}}_{\left\{ \sup _{\tau \le t + 1}\left\| {\varvec{x}}_{\tau }\right\| \le M\right\} }\right] \nonumber \\&\quad \le \rho \frac{\alpha _{{\textbf{w}}, t}^{2}}{\alpha _{{\textbf{w}}, t+1}^{2}}\Bigg \{{\mathbb {E}}\left[ \left\| \alpha _{{\textbf{w}}, t}^{-1} {\varvec{\beta }}_{\perp , t}\right\| ^{2} {\mathbb {I}}_{\left\{ \sup _{\tau \le t}\left\| {\varvec{x}}_{\tau }\right\| \le M\right\} }\right] \nonumber \\&\qquad + \sqrt{K_{1}} {\mathbb {E}}\left[ \sqrt{\left\| \alpha _{{\textbf{w}}, t}^{-1} {\varvec{\beta }}_{\perp , t}\right\| ^{2} {\mathbb {I}}_{\left\{ \sup _{\tau \le t}\left\| {\varvec{x}}_{\tau }\right\| \le M\right\} }}\right] + K_{1}\Bigg \}. \end{aligned}$$

(A8)

The proof of this lemma is provided at the end of this section. Now, define $o_{t}=\left\| \alpha _{{\textbf{w}}, t}^{-1} \beta _{\perp , t}\right\| ^{2} {\mathbb {I}}_{\left\{ \sup _{\tau \le t}\left\| {\varvec{x}}_{\tau }\right\| \le M\right\} }$. Then, Eq. (A8) can be written as

$$\begin{aligned} {\mathbb {E}}\left[ o_{t+1}\right]&\le \rho \frac{\alpha _{{\textbf{w}}, t}^{2}}{\alpha _{{\textbf{w}}, t+1}^{2}}\left[ {\mathbb {E}}\left[ o_{t}\right] + \sqrt{K_{1}} {\mathbb {E}}\left[ \sqrt{o_{t}}\right] + K_{1}\right] \nonumber \\&\le \rho \frac{\alpha _{{\textbf{w}}, t}^{2}}{\alpha _{{\textbf{w}}, t+1}^{2}}\left[ {\mathbb {E}}\left[ o_{t}\right] + \sqrt{K_{1}} \sqrt{{\mathbb {E}}\left[ o_{t}\right] }+ K_{1}\right] , \end{aligned}$$

(A9)

where Jensen’s inequality is applied in the last inequality. Since by assumption $\rho \in [0,1)$ and $\lim _{t} \frac{\alpha _{{\textbf{w}}, t}}{\alpha _{{\textbf{w}}, t+1}}=1$, there exist some $\delta \in (0,1)$ and $t_{0} > 0$ such that $\rho \frac{\alpha _{{\textbf{w}}, t}^{2}}{\alpha _{{\textbf{w}}, t+1}^{2}}<1-\delta $, for all $t \ge t_{0}$. Thus, it holds for $t \ge t_{0}$ that

$$\begin{aligned} {\mathbb {E}}\left[ o_{t+1}\right] \le (1-\delta )\left[ {\mathbb {E}}\left[ o_{t}\right] + \sqrt{K_{1}} \sqrt{{\mathbb {E}}\left[ o_{t}\right] }+ K_{1}\right] . \end{aligned}$$

(A10)

Moreover, there exist some $L_{2}$, $L_{3}>0$ such that

$$\begin{aligned} {\mathbb {E}}\left[ o_{t+1}\right]&\le (1-\delta )\left[ {\mathbb {E}}\left[ o_{t}\right] + \sqrt{K_{1}} \sqrt{{\mathbb {E}}\left[ o_{t}\right] }+ K_{1}\right] \nonumber \\&\le \left( 1-\frac{\delta }{2}\right) {\mathbb {E}}\left[ o_{t}\right] +L_{2} {\mathbb {I}}_{\left\{ {\mathbb {E}}\left[ o_{t}\right] <L_{3}\right\} }. \end{aligned}$$

(A11)

Applying this inequality recursively yields that ${\mathbb {E}}\left[ o_{t}\right] \le (1-\delta / 2)^{t-t_{0}} {\mathbb {E}}\left[ o_{t_{0}}\right] +2 L_{2} / \delta $, for $t \ge t_{0}$. Thus, it follows that $\sup _{t} {\mathbb {E}}\left[ o_{t}\right] <\infty $. Note that ${\mathbb {I}}_{\left\{ \sup _{t}\left\| {\varvec{x}}_{t}\right\| \le M\right\} } \le {\mathbb {I}}_{\left\{ \sup _{\tau \le t}\left\| x_{\tau }\right\| \le M\right\} }$. Hence, we have

$$\begin{aligned} \sup _{t} {\mathbb {E}}\left[ \left\| \alpha _{{\textbf{w}}, t}^{-1} \beta _{\perp , t}\right\| ^{2} {\mathbb {I}}_{\left\{ \text {sup}_{t}\left\| {\varvec{x}}_{t}\right\| \le M\right\} }\right] <\infty . \end{aligned}$$

(A12)

The above inequality further implies that there exists $K>0$ such that

$$\begin{aligned} {\mathbb {E}}\left[ \left\| \beta _{\perp , t}\right\| ^{2} {\mathbb {I}}_{\left\{ \sup _{t}\left\| x_{t}\right\| \le M\right\} }\right] \le K \alpha _{{\textbf{w}}, t}^{2}. \end{aligned}$$

(A13)

Then, by summing over both sides of the inequality, we know that $\sum _{t} {\mathbb {E}}\left[ \left\| \beta _{\perp , t}\right\| ^{2} {\mathbb {I}}_{\left\{ \sup _{t}\left\| x_{t}\right\| \le M\right\} }\right] $ is finite, therefore, $\sum _{t}\left\| \beta _{\perp , t}\right\| ^{2} {\mathbb {I}}_{\left\{ \text {sup}_{t}\left\| x_{t}\right\| \le M\right\} }$ is finite almost surely, which implies that $\lim _{t \rightarrow \infty } \beta _{\perp , t} {\mathbb {I}}_{\left\{ \sup _{t}\left\| x_{t}\right\| \le M\right\} }=0$ with probability one. The conclusion for the critic consensus has been proved.

Proof of Lemma A.1

We first consider the relation between the adjacent terms. Specifically, we have

$$\begin{aligned}&\left\| \alpha _{{\textbf{w}}, t+1}^{-1} \beta _{\perp , t+1}\right\| ^{2} \\&\quad = \alpha _{{\textbf{w}}, t+1}^{-2}\left( {\varvec{\beta }}_{\perp , t}+\alpha _{{\textbf{w}}, t} \Delta _{t+1}\right) ^{T}\left( L \tilde{C_{t}} \otimes I\right) ^{T}\left( L \tilde{C_{t}} \otimes I\right) \\&\qquad \left( {\varvec{\beta }}_{\perp , t}+\alpha _{{\textbf{w}}, t} \Delta _{t+1}\right) \,, \\&\quad = \alpha _{{\textbf{w}}, t+1}^{-2}\left( \beta _{\perp , t}+\alpha _{{\textbf{w}}, t} \Delta _{t+1}\right) ^{T} \left( \tilde{C_{t}}^{T}L \tilde{C_{t}} \otimes I\right) \\&\qquad \left( {\varvec{\beta }}_{\perp , t}+\alpha _{{\textbf{w}}, t} \Delta _{t+1}\right) \,,\\&\quad = \frac{\alpha _{{\textbf{w}}, t}^{2}}{\alpha _{{\textbf{w}}, t+1}^{2}}\left( \alpha _{{\textbf{w}}, {t}}^{-1} \beta _{\perp , t}+ \Delta _{t+1}\right) ^{T} \left( \tilde{C_{t}}^{T}L \tilde{C_{t}} \otimes I\right) \\&\qquad \left( \alpha _{{\textbf{w}}, {t}}^{-1} \beta _{\perp , t}+ \Delta _{t+1}\right) . \end{aligned}$$

From the second and third parts of Assumption (A7), we have

$$\begin{aligned}&{\mathbb {E}}\left[ \left\| \alpha _{{\textbf{w}}, t+1}^{-1} {\varvec{\beta }}_{\perp , t+1}\right\| ^{2} | {\mathcal {F}}_{t}\right] \,,\\&\quad \le \rho \frac{\alpha _{{\textbf{w}}, t}^{2}}{\alpha _{{\textbf{w}}, t+1}^{2}} {\mathbb {E}}\left[ \left\| \alpha _{{\textbf{w}}, {t}}^{-1} {\varvec{\beta }}_{\perp , t}+ \Delta _{t+1}\right\| ^{2} | {\mathcal {F}}_{t}\right] \,,\\&\quad \le \rho \frac{\alpha _{{\textbf{w}}, t}^{2}}{\alpha _{{\textbf{w}}, t+1}^{2}}\Bigg [{\mathbb {E}}\left[ \left\| \alpha _{{\textbf{w}}, t}^{-1} {\varvec{\beta }}_{\perp , t}\right\| ^{2} | {\mathcal {F}}_{t}\right] \\&\qquad +2 {\mathbb {E}}\left[ \left\| \alpha _{{\textbf{w}}, t}^{-1} \Delta _{t+1}^{T} {\varvec{\beta }}_{\perp , t}\right\| | {\mathcal {F}}_{t}\right] +{\mathbb {E}}\left[ \left\| \Delta _{t+1}\right\| ^{2} | {\mathcal {F}}_{t}\right] \Bigg ]\,, \\&\quad \le \rho \frac{\alpha _{{\textbf{w}}, t}^{2}}{\alpha _{{\textbf{w}}, t+1}^2}\Bigg [\left\| \alpha _{{\textbf{w}}, t}^{-1} {\varvec{\beta }}_{\perp , t}\right\| ^{2} + 2\left\| \alpha _{{\textbf{w}}, t}^{-1} {\varvec{\beta }}_{\perp , t}\right\| \\&\qquad \cdot {\mathbb {E}}\left[ \left\| \Delta _{t+1}\right\| ^{2} | {\mathcal {F}}_{t}\right] ^{1 / 2}+{\mathbb {E}}\left[ \left\| \Delta _{t+1}\right\| ^{2} | {\mathcal {F}}_{t}\right] \Bigg ], \end{aligned}$$

where we used the Cauchy–Schwarz inequality in the last inequality. Note that the terms containing ${\varvec{\beta }}_{\perp , t}$ are bounded almost surely. To obtain the result for step $t{+}1$, we need to bound the terms that contain $\Delta _{t+1}$.

From Assumptions (A1)–(A6), we know that the vectors $\phi _{t+1}$, $\phi _{t}$, $\mathbf {\beta }_t$, and ${\textbf{e}}_t$ are bounded. Thus, for any $M>0$, there exists some constant $K_{1}>0$ such that

$$\begin{aligned} {\mathbb {E}}\left[ \left\| \Delta _{t+1}\right\| ^{2} {\mathbb {I}}_{\left\{ \sup _{\tau \le t}\left\| x_{\tau }\right\| \le M\right\} } | {\mathcal {F}}_{t}\right] \le K_{1}. \end{aligned}$$

(A14)

Moreover, notice the fact that ${\mathbb {I}}_{\left\{ \sup _{\tau \le t+1}\left\| {\varvec{x}}_{\tau }\right\| \le M\right\} } \le $$ {\mathbb {I}}_{\left\{ \sup _{\tau \le t}\left\| {\varvec{x}}_{\tau }\right\| \le M\right\} }$. By combining this inequality with the preceding one, we have

$$\begin{aligned}&{\mathbb {E}}\left[ \left\| \alpha _{{\textbf{w}}, t+1}^{-1} {\varvec{\beta }}_{\perp , t+1}\right\| ^{2} {\mathbb {I}}_{\left\{ \sup _{\tau \le t+1}\left\| {\varvec{x}}_{\tau }\right\| \le M\right\} } | {\mathcal {F}}_{t}\right] \nonumber \\&\quad \le \rho \frac{\alpha _{{\textbf{w}}, t}^{2}}{\alpha _{{\textbf{w}}, t+1}^{2}}\left[ \left\| \alpha _{{\textbf{w}}, t}^{-1} \beta _{\perp , t}\right\| ^{2} {\mathbb {I}}_{\left\{ \sup _{\tau \le t}\left\| {\varvec{x}}_{\tau }\right\| \le M\right\} }\right. + \left\| \alpha _{{\textbf{w}}, t}^{-1} \beta _{\perp , t}\right\| \nonumber \\&\qquad \cdot {\mathbb {I}}_{\left\{ \sup _{\tau \le t}\left\| {\varvec{x}}_{\tau }\right\| \le M\right\} } {\mathbb {E}}\left[ \left\| \Delta _{t+1}\right\| ^{2} {\mathbb {I}}_{\left\{ \sup _{\tau \le t}\left\| {\varvec{x}}_{\tau }\right\| \le M\right\} } | {\mathcal {F}}_{t}\right] ^{1 / 2} \nonumber \\&\qquad \left. + {\mathbb {E}}\left[ \left\| \Delta _{t+1}\right\| ^{2} {\mathbb {I}}_{\left\{ \sup _{\tau \le t}\left\| {\varvec{x}}_{\tau }\right\| \le M\right\} } | {\mathcal {F}}_{t}\right] \right] \,,\nonumber \\&\quad \le \rho \frac{\alpha _{{\textbf{w}}, t}^{2}}{\alpha _{{\textbf{w}}, t+1}^{2}}\Bigg [\left\| \alpha _{{\textbf{w}}, t}^{-1} \beta _{\perp , t}\right\| ^{2} {\mathbb {I}}_{\left\{ \sup _{\tau \le t}\left\| x_{\tau }\right\| \le M\right\} }\nonumber \\&\qquad + \sqrt{K_{1}} \cdot \left\| \alpha _{{\textbf{w}}, t}^{-1} \beta _{\perp , t}\right\| {\mathbb {I}}_{\left\{ \sup _{\tau \le t}\left\| x_{\tau }\right\| \le M\right\} }+ K_{1}\Bigg ] . \end{aligned}$$

(A15)

Thus, taking expectations on both sides yields the conclusion of Lemma A.1. $\square $

1.2 A.2 Proof of Critic Convergence

In this section, we prove the remaining conclusions on the critic updates. Similar to [41], which studies the convergence of single-agent GTD-type algorithms, here we are applying the Kushner–Yin Theorem (stated below) to the mean of the multi-agent critic iterates. By Assumption (A7), the critic algorithm implies that

$$\begin{aligned} \langle {\textbf{v}}_{t+1}\rangle&= \frac{1}{N}(1 \otimes I)\big (C_t \otimes I\big )\big ({\textbf{1}} \otimes \langle {\textbf{v}}_{t}\rangle + {\textbf{v}}_{\perp , t} + \alpha _{{\textbf{v}}, t} \nonumber \\&\quad \cdot ( \mathbf {\delta }_{t} \otimes {\textbf{e}}_{t} - \gamma _{t+1}(1-\lambda )((I\otimes {\textbf{e}}_t^\top ){\textbf{w}}_t)\otimes \phi _t )\big )\nonumber \\&= \langle {\textbf{v}}_{t}\rangle + \alpha _{{\textbf{v}}, t}\big \langle \big (C_t \otimes I\big )\big (\Delta _{1, t} + \alpha _{{\textbf{v}}, t}^{-1} {\textbf{v}}_{\perp , t}\big ) \big \rangle \,,\end{aligned}$$

(A16)

$$\begin{aligned} \langle {\textbf{w}}_{t+1}\rangle&= \frac{1}{N}(1 \otimes I)\big (C_t \otimes I\big )\Bigg ( {\textbf{1}} \otimes \langle {\textbf{w}}_{t}\rangle + {\textbf{w}}_{\perp , t} + \alpha _{{\textbf{w}}, t} \nonumber \\&\quad \cdot \big (\mathbf {\delta }_{t} \otimes {\textbf{e}}_{t} - ((I \otimes \phi _t^\top ){\textbf{w}}_t)\otimes \phi _t\big )\Bigg )\nonumber \\&= \langle {\textbf{w}}_{t}\rangle + \alpha _{{\textbf{w}}, t}\big \langle \big (C_t \otimes I\big )\big (\Delta _{2, t} + \alpha _{{\textbf{w}}, t}^{-1} {\textbf{w}}_{\perp , t}\big ) \big \rangle , \end{aligned}$$

(A17)

where we define

$$\begin{aligned} \Delta _{1, t}&= \mathbf {\delta }_{t} \otimes {\textbf{e}}_{t} - \gamma _{t+1}(1-\lambda )((I\otimes {\textbf{e}}_t^\top ){\textbf{w}}_t)\otimes \phi _t\,, \end{aligned}$$

(A18)

$$\begin{aligned} \Delta _{2, t}&= \mathbf {\delta }_{t} \otimes {\textbf{e}}_{t} - ((I \otimes \phi _t^\top ){\textbf{w}}_t)\otimes \phi _t. \end{aligned}$$

(A19)

Let us define the functions:

$$\begin{aligned}&g({\textbf{v}}, {\textbf{w}}, z) := \Bigg \langle \big (C_t \otimes I\big ) \Bigg ( \mathbf {\delta }(s, s^\prime , V_{\textbf{v}}) \otimes {\textbf{e}} \nonumber \\&\quad - (1 - \lambda )\gamma (s^\prime )((I\otimes {\textbf{e}}^\top ){\textbf{w}})\otimes \phi (s^\prime ) + \alpha _{{\textbf{v}}, t}^{-1} {\textbf{v}}_{\perp , t} \Bigg ) \Bigg \rangle \,, \end{aligned}$$

(A20)

$$\begin{aligned}&k({\textbf{v}}, {\textbf{w}}, z) := \Bigg \langle \big (C_t \otimes I\big ) \Bigg ( \mathbf {\delta }(s, s^\prime , V_{\textbf{v}}) \otimes {\textbf{e}}\nonumber \\&\quad - ((I \otimes \phi (s)^\top ){\textbf{w}})\otimes \phi (s) + \alpha _{{\textbf{w}}, t}^{-1} {\textbf{w}}_{\perp , t} \Bigg )\Bigg \rangle , \end{aligned}$$

(A21)

where $z:= (s, {\textbf{e}}, s^\prime )$ represents the random variables in a transition, and $\mathbf {\delta }(s, s^\prime , V_{\textbf{v}}):= r(s, s^\prime ) + \gamma (s^\prime ) V_{\textbf{v}}(s^\prime ) - V_{\textbf{v}}(s) $ is a vector in ${\mathbb {R}}^{N\times 1}$.

For notational simplicity, from now on until the end of the proof we write $\langle {\textbf{v}}_{t}\rangle $ and $\langle {\textbf{w}}_{t}\rangle $ as ${\textbf{v}}_{t}$ and ${\textbf{w}}_{t}$ when there is no confusion. Then, we can write the update as

$$\begin{aligned}&{\textbf{v}}_{t+1} = {\textbf{v}}_{t}+\alpha _{{\textbf{v}}, t} g\left( {\textbf{v}}_{t}, {\textbf{w}}_{t}, Z_{t}\right) , \end{aligned}$$

(A22)

$$\begin{aligned}&{\textbf{w}}_{t+1} = {\textbf{w}}_{t} + \alpha _{{\textbf{w}}, t} k\left( {\textbf{v}}_{t}, {\textbf{w}}_{t}, Z_{t}\right) , \end{aligned}$$

(A23)

where $Z_{t}:= (s_t, {\textbf{e}}_t, s_{t+1})$. We first consider the fast time-scale with the stepsize $\left\{ \alpha _{{\textbf{w}}, t}\right\} $, and write the algorithm in the following form:

$$\begin{aligned}&\left( \begin{array}{l} {\textbf{v}}_{t+1} \\ {\textbf{w}}_{t+1} \end{array}\right) = \Bigg ( \begin{array}{l} {\textbf{v}}_{t}+\alpha _{{\textbf{w}}, t}\frac{\alpha _{{\textbf{v}}, t}}{\alpha _{{\textbf{w}}, t}} \cdot \left( g\left( {\textbf{v}}_{t}, {\textbf{w}}_{t}, Z_{t}\right) \right) \\ {\textbf{w}}_{t}+\alpha _{{\textbf{w}}, t}\left( k\left( {\textbf{v}}_{t}, {\textbf{w}}_{t}, Z_{t}\right) \right) \end{array}\Bigg ). \end{aligned}$$

(A24)

We first need to demonstrate that its dynamics are characterized by the mean ODE:

$$\begin{aligned}&\left( \begin{array}{l} \dot{{\textbf{v}}}(t) \\ \dot{{\textbf{w}}}(t) \end{array}\right) =\left( \begin{array}{l} 0 \\ {\bar{k}}({\textbf{v}}(t), {\textbf{w}}(t))+z(t) \end{array}\right) ,\nonumber \\&{\textbf{v}}(0) \in B_{{\textbf{v}}}, z(t) \in -{\mathcal {N}}_{B_{{\textbf{w}}}}({\textbf{w}}(t))\,. \end{aligned}$$

(A25)

Here, the function ${\bar{k}}(\cdot )$ is defined as ${\bar{k}}({{\textbf {v}}}, {{\textbf {w}}}):= {\mathbb {E}}\left[ k({{\textbf {v}}}, {{\textbf {w}}}, Z_0)\right] $, ${\mathcal {N}}_{B_{{\textbf{w}}}}({\textbf{w}}(t))$ is the normal cone of $B_{{\textbf{w}}}$ at ${\textbf{w}}(t)$, and z(t) denotes the boundary reflection term. It is the “minimal force" needed to maintain the solution ${\textbf{w}}(\cdot )$ in the constraint set $B_{{\textbf{w}}}$ ( [46, Chapter 4.3]). As for the slow time-scale, given each ${\textbf{v}}$, let us define $\bar{{\textbf{w}}}({\textbf{v}})={\textbf{w}}_{{\textbf{v}}}$ and ${\bar{g}}({\textbf{v}})=-\nabla {\text {MSPBE}}({\textbf{v}})$, where ${\textbf{w}}_{{\textbf{v}}}$ is the solution to the linear equations ${\bar{k}}({\textbf{v}}, {\textbf{w}}) = 0$, with ${\textbf{w}} \in {\text {span}}\{\phi (s)\}$.

Using these definitions, the ${\textbf{v}}$-iterates at the slow time-scale can be written equivalently as

$$\begin{aligned} {\textbf{v}}_{t+1}&={\textbf{v}}_{t}+\alpha _{{\textbf{v}}, t}\left( g\left( {\textbf{v}}_{t}, \bar{{\textbf{w}}}\left( {\textbf{v}}_{t}\right) , Z_{t}\right) + \kappa _{t}\right) , \end{aligned}$$

(A26)

where $\kappa _{t} =g\left( {\textbf{v}}_{t}, {\textbf{w}}_{t}, Z_{t}\right) - g\left( {\textbf{v}}_{t}, \bar{{\textbf{w}}}\left( {\textbf{v}}_{t}\right) , Z_{t}\right) $.

Treating the last two terms as noise terms, we will also need to show that the dynamics of (A26) are described by the mean ODE:

$$\begin{aligned} \dot{{\textbf{v}}}(t)={\bar{g}}({\textbf{v}}(t)) +z(t), \quad z(t) \in -{\mathcal {N}}_{B_{{\textbf{v}}}}({\textbf{v}}(t)), \end{aligned}$$

(A27)

where z(t) is the boundary reflection term. To demonstrate that the dynamics of the algorithm can be characterized by these mean ODEs, we will apply Theorem 6.1.1 of [46]. The theorem considers single-time-scale stochastic approximation algorithms with exogenous noises and will be reiterated below after we introduce the corresponding conditions. We will apply it first to the fast time-scale iterates (A24) and then to the slow time-scale iterates (A26). To apply this theorem, we need to verify a set of conditions detailed below. We first explain some notation used in the conditions. The continuous-time processes associated with the ODE-based analysis are piecewise linear or constant interpolations of the discrete-time processes. There is a correspondence between the continuous-time and the discrete-time. Specifically, for each continuous time $\tau \ge 0$, we can define the corresponding iteration by

$$\begin{aligned} m_{w}(\tau )&:= \min \left\{ n | \sum _{i=0}^{n} \alpha _{{\textbf{w}},i} >\tau \right\} , \end{aligned}$$

(A28)

$$\begin{aligned} m_{w}^{-}(\tau )&:= \max \left\{ 0, m_{w}(\tau )-1\right\} . \end{aligned}$$

(A29)

These definitions mean that $m_{w}(\tau )$ is the first iteration that occurred after time $\tau $, and $m_{w}^{-}(\tau )$ is the latest iteration before time $\tau $ has elapsed. Similarly, we also define

$$\begin{aligned} m_{w}(n, \tau )&=\min \left\{ j \ge n | \sum _{i=n}^{j} \alpha _{{\textbf{w}},i}>\tau \right\} , \end{aligned}$$

(A30)

$$\begin{aligned} m_{w}^{-}(n, \tau )&=\max \left\{ n, m_{w}(n, \tau )-1\right\} . \end{aligned}$$

(A31)

These iteration pointers with the subscript w will be used in the conditions for the fast time-scale updates. Analogously, we can also define the iteration pointers $m_{v}(\tau )$, $m_{v}^{-}(\tau )$, etc., associated with the stepsizes $\left\{ \alpha _{{\textbf{v}}, t}\right\} $. These pointers will be used in the conditions for the slow time-scale updates. By the above definitions, it holds that for any given $\tau _0 \ge 0$, we have $\sum _{i=m_{w}(\tau _0)}^{m_{w}^{-}(\tau _0+\tau )} \alpha _{{\textbf{w}},i} \le \tau $ and $\sum _{i=m_{v}(\tau _0)}^{m_{v}^{-}(\tau _0+\tau )} \alpha _{{\textbf{v}}, i} \le \tau $.

Now we consider the conditions for the fast time-scale updates. These conditions are specified (in a more general form) as (A.6.1.1) and (A.6.1.3)–(A.6.1.7) in [46, Chapter 6.1].

(B.1) It holds that

$$\begin{aligned}&\sup _{t \ge 0} {\mathbb {E}}\left[ \left\| k\left( {\textbf{v}}_{t}, {\textbf{w}}_{t}, Z_{t}\right) \right\| \right] <\infty , \end{aligned}$$

(A32)

$$\begin{aligned}&\sup _{t \ge 0} {\mathbb {E}}\Bigg [\left( \alpha _{{\textbf{v}}, t} / \alpha _{{\textbf{w}}, t}\right) \cdot \left\| g\left( {\textbf{v}}_{t}, {\textbf{w}}_{t}, Z_{t}\right) \right\| \Bigg ]<\infty \text { . } \end{aligned}$$

(A33)

(B.2) For any $({\textbf{v}}, {\textbf{w}}) \in B_{{\textbf{v}}} \times B_{{\textbf{w}}}, \eta >0$ and $T>0$, we have

$$\begin{aligned}&\lim _{n \rightarrow \infty } {\textbf{P}}\Bigg \{\sup \limits _{j \ge n} \max \limits _{0 \le \tau \le T}\Bigg \Vert \sum _{i=m_{w}(j T)}^{m_{w}^{-}(j T+\tau )} \alpha _{{\textbf{w}},i}\big (k\left( {\textbf{v}}, {\textbf{w}}, Z_{i}\right) - {\bar{k}}({\textbf{v}}, {\textbf{w}})\big )\Bigg \Vert \nonumber \\&\ge \eta \Bigg \}=0, \end{aligned}$$

(A34)

$$\begin{aligned}&\lim _{n \rightarrow \infty } {\textbf{P}}\Bigg \{\sup \limits _{j \ge n} \max \limits _{0 \le \tau \le T}\Bigg \Vert \sum _{i=m_{w}(j T)}^{m_{w}^{-}(j T+\tau )} \alpha _{{\textbf{v}}, i}\left( g\left( {\textbf{v}}, {\textbf{w}}, Z_{i}\right) \right) \Bigg \Vert \nonumber \\&\ge \eta \Bigg \} = 0. \end{aligned}$$

(A35)

(B.3) There exist some nonnegative measurable functions $f_{1}({\textbf{v}}, {\textbf{w}}), f_{2}(z)$ satisfying $\Vert k({\textbf{v}}, {\textbf{w}}, z)\Vert \le f_{1}({\textbf{v}}, {\textbf{w}}) f_{2}(z)$ for any $({\textbf{v}}, {\textbf{w}}) \in B_{{\textbf{v}}} \times B_{{\textbf{w}}}$. Moreover, function $f_{1}$ is bounded on $B_{{\textbf{v}}} \times B_{{\textbf{w}}}$; and for each $\eta >0$,

$$\begin{aligned} \lim _{\tau \rightarrow 0} \lim _{n \rightarrow \infty } {\textbf{P}}\left\{ \sup _{j \ge n} \sum _{i=m_{w}(j \tau )}^{m_{w}^{-}(j \tau +\tau )} \alpha _{{\textbf{w}},i} f_{2}\left( Z_{i}\right) \ge \eta \right\} =0. \end{aligned}$$

(A36)

(B.4) There exist some nonnegative measurable functions $f_{3}({\textbf{v}}, {\textbf{w}}), f_{4}(z)$ such that

$$\begin{aligned} \left\| k({\textbf{v}}, {\textbf{w}}, z)-k\left( {\textbf{v}}^{\prime }, {\textbf{w}}^{\prime }, z\right) \right\| \le f_{3}\left( {\textbf{v}}-{\textbf{v}}^{\prime }, {\textbf{w}}-{\textbf{w}}^{\prime }\right) f_{4}(z), \end{aligned}$$

(A37)

for all $({\textbf{v}}, {\textbf{w}}),\left( {\textbf{v}}^{\prime }, {\textbf{w}}^{\prime }\right) \in B_{{\textbf{v}}} \times B_{{\textbf{w}}}$. Moreover, function $f_{3}$ is bounded on $B_{{\textbf{v}}} \times B_{{\textbf{w}}}$ and $f_{3}({\textbf{v}}, {\textbf{w}}) \rightarrow 0$ as $({\textbf{v}}, {\textbf{w}}) \rightarrow 0$; and for some $\tau >0$, it holds that

$$\begin{aligned} {\textbf{P}}\left\{ \limsup _{n \rightarrow \infty } \sum _{i=n}^{m_{w}(n, \tau )} \alpha _{{\textbf{w}},i} f_{4}\left( Z_{i}\right) <\infty \right\} =1. \end{aligned}$$

(A38)

Now we introduce the conditions for the slow time-scale:

(B.5) It holds that: $\sup _{t \ge 0} {\mathbb {E}}\left[ \left\| g\left( {\textbf{v}}_{t}, {\textbf{w}}_{t}, Z_{t}\right) \right\| \right] <\infty .$

(B.6) For any ${\textbf{v}} \in B_{{\textbf{v}}}, \eta >0$ and $T>0$, we have

$$\begin{aligned}&\lim _{n \rightarrow \infty } {\textbf{P}}\Bigg \{\sup _{j \ge n} \max _{0 \le \tau \le T}\Bigg \Vert \sum _{i=m_{v}(j T)}^{m_{v}^{-}(j T+\tau )} \alpha _{{\textbf{v}}, i}\big (g\left( {\textbf{v}}, \bar{{\textbf{w}}}({\textbf{v}}), Z_{i}\right) -{\bar{g}}({\textbf{v}})\big )\Bigg \Vert \nonumber \\&\ge \eta \Bigg \}=0. \end{aligned}$$

(A39)

(B.7) For any $\eta >0$ and $T>0$, it holds that

$$\begin{aligned}&\lim _{n \rightarrow \infty } {\textbf{P}}\Bigg \{\sup _{j \ge n} \max _{0 \le \tau \le T}\Bigg \Vert \sum _{i=m_{v}(j T)}^{m_{v}^{-}(j T+\tau )}\nonumber \\&\quad \alpha _{{\textbf{v}}, i}\big (g\left( {\textbf{v}}_{i}, {\textbf{w}}_{i}, Z_{i}\right) -g\left( {\textbf{v}}_{i}, \bar{{\textbf{w}}}\left( {\textbf{v}}_{i}\right) , Z_{i}\right) \big )\Bigg \Vert \ge \eta \Bigg \}=0. \end{aligned}$$

(A40)

(B.8) There exist some nonnegative measurable functions $f_{1}({\textbf{v}}), f_{2}(z)$ satisfying $ \Vert g({\textbf{v}}, \bar{{\textbf{w}}}({\textbf{v}}), z)\Vert \le f_{1}({\textbf{v}}) f_{2}(z)$, $~\forall {\textbf{v}} \in B_{{\textbf{v}}}.$ Moreover function $f_{1}$ is bounded on $B_{{\textbf{v}}}$; and for any $\eta >0$ and $T>0$,

$$\begin{aligned} \lim _{\tau \rightarrow 0} \lim _{n \rightarrow \infty } {\textbf{P}}\left\{ \sup _{j \ge n} \sum _{i=m_{v}(j \tau )}^{m_{v}^{-}(j \tau +\tau )} \alpha _{{\textbf{v}}, i} f_{2}\left( Z_{i}\right) \ge \eta \right\} =0. \end{aligned}$$

(A41)

(B.9) There exist some nonnegative measurable functions $f_{3}({\textbf{v}}), f_{4}(z)$ such that

$$\begin{aligned} \left\| g({\textbf{v}}, \bar{{\textbf{w}}}({\textbf{v}}), z) -\left( g\left( {\textbf{v}}^{\prime }, \bar{{\textbf{w}}}\left( {\textbf{v}}^{\prime }\right) , z\right) \right) \right\| \le f_{3}\left( {\textbf{v}}-{\textbf{v}}^{\prime }\right) f_{4}(z), \end{aligned}$$

(A42)

for all ${\textbf{v}}, {\textbf{v}}^{\prime } \in B_{{\textbf{v}}}$. Moreover, function $f_{3}$ is bounded on $B_{{\textbf{v}}}$ and $f_{3}({\textbf{v}}) \rightarrow 0$ as ${\textbf{v}} \rightarrow 0$; and for some $\tau >0$, it holds that

$$\begin{aligned} {\textbf{P}}\left\{ \limsup _{n \rightarrow \infty } \sum _{i=n}^{m_{v}(n, \tau )} \alpha _{{\textbf{v}}, i} f_{4}\left( Z_{i}\right) <\infty \right\} =1. \end{aligned}$$

(A43)

With these conditions, we can now state the Kushner–Yin Theorem.

Theorem 2

[46] Suppose conditions (B.1)–(B.9) are satisfied. Then, $\left( {\textbf{v}}_{t}, {\textbf{w}}_{t}\right) $ converges to the limit set of the ODE (A25) in $B_{{\textbf{v}}} \times B_{{\textbf{w}}}$ almost surely; moreover, ${\textbf{v}}_{t}$ almost surely converges to the limit set of the ODE (A27) in $B_{{\textbf{v}}}$.

Most of the above conditions can be verified with ease. From the assumptions, we know that $\left\{ {\textbf{e}}_{t}\right\} $ lies in a bounded set. Since the iterates $\left\{ {\textbf{v}}_{t}, {\textbf{w}}_{t}\right\} $ are also in bounded sets, it is clear that the conditions (B.1) and (B.5) for both time-scales hold true. Now let us consider the conditions (B.3)–(B.4) for the fast time-scale. From the definition of $k(\cdot )$, it follows that, for some constant $c>0$,

$$\begin{aligned}&\Vert k({\textbf{v}}, {\textbf{w}}, z)\Vert \le \Bigg \Vert \Bigg \langle \big (C_t \otimes I\big ) \Bigg [ \mathbf {\delta }(s, s^\prime , V_{\textbf{v}}) \otimes {\textbf{e}} \Bigg ]\Bigg \rangle \Bigg \Vert \nonumber \\&\qquad + \Bigg \Vert \Bigg \langle \big (C_t \otimes I\big ) \Bigg [ ((I \otimes \phi (s)^\top ){\textbf{w}})\otimes \phi (s) + {\textbf{w}}_{\perp , t} \Bigg ]\Bigg \rangle \Bigg \Vert \,, \nonumber \\&\quad \le c(\Vert {\textbf{v}}\Vert +\Vert {\textbf{w}}\Vert +1) \cdot (\Vert {\textbf{e}}\Vert +1), \end{aligned}$$

(A44)

$$\begin{aligned}&\left\| k({\textbf{v}}, {\textbf{w}}, z)-k\left( {\textbf{v}}^{\prime }, {\textbf{w}}^{\prime }, z\right) \right\| \nonumber \\&\le \Bigg \Vert \Bigg \langle \big (C_t \otimes I\big ) \Bigg [ \big (\mathbf {\delta }(s, s^\prime , V_{\textbf{v}}) - \mathbf {\delta }(s, s^\prime , V_\mathbf {v^\prime }) \big ) \otimes {\textbf{e}} \Bigg ]\Bigg \rangle \Bigg \Vert \nonumber \\&\qquad + \Bigg \Vert \Bigg \langle \big (C_t \otimes I\big ) \Bigg [ ((I \otimes \phi (s)^\top ){\textbf{w}})\otimes \phi (s) + {\textbf{w}}_{\perp , t} \nonumber \\&\qquad - ((I \otimes \phi (s)^\top )\mathbf {w^\prime })\otimes \phi (s) - \mathbf {w^\prime }_{\perp , t} \Bigg ]\Bigg \rangle \Bigg \Vert \,, \nonumber \\&\quad \le c\left( \left\| {\textbf{v}}-{\textbf{v}}^{\prime }\right\| +\left\| {\textbf{w}}-{\textbf{w}}^{\prime }\right\| \right) \cdot (\Vert {\textbf{e}}\Vert +1). \end{aligned}$$

(A45)

Therefore, we can set $f_{1}({\textbf{v}}, {\textbf{w}})=\Vert {\textbf{v}}\Vert +\Vert {\textbf{w}}\Vert +1, f_{3}({\textbf{v}}, {\textbf{w}})=\Vert {\textbf{v}}\Vert +\Vert {\textbf{w}}\Vert $, and $f_{2}(z)=f_{4}(z)=$ $c(\Vert {\textbf{e}}\Vert +1)$. Note that the eligibility traces $\left\{ {\textbf{e}}_{t}\right\} $ lie in a bounded set, so $f_{2}\left( Z_{i}\right) $ and $f_{4}\left( Z_{i}\right) $ are bounded as well for all i. Using the fact that $\sum _{i=m_{w}(j \tau )}^{m_{w}^{-}(j \tau +\tau )} \alpha _{{\textbf{w}},i} \le \tau $ and $\sum _{i=n}^{m_{w}(n, \tau )} \alpha _{{\textbf{w}},i} \le \tau + 1$, it is clear that conditions (B.3) and (B.4) for the fast time-scale are satisfied.

Similarly we can verify the conditions (B.8) and (B.9) for the slow time-scale. From the definition of $g(\cdot )$, it follows that, for some constant $c>0$,

$$\begin{aligned} \Vert g({\textbf{v}}, \bar{{\textbf{w}}}({\textbf{v}}), z)\Vert \le c(\Vert \bar{{\textbf{w}}}({\textbf{v}})\Vert +1) \cdot (\Vert {\textbf{e}}\Vert +1), \end{aligned}$$

(A46)

for all ${\textbf{v}} \in B_{{\textbf{v}}}$. Thus, in the condition (B.8) for the slow time-scale, we can set $f_{1}({\textbf{v}})=\Vert \bar{{\textbf{w}}}({\textbf{v}})\Vert +1$ and $f_{2}(z)=c(\Vert {\textbf{e}}\Vert +1)$. Notice that the eligibility traces $\left\{ {\textbf{e}}_{t}\right\} $ lie in a bounded set and $\sum _{i=m_{v}(j \tau )}^{m_{v}^{-}(j \tau +\tau )} \alpha _{{\textbf{v}}, i} \le \tau $. Thus, it follows that condition (B.8) holds true.

Now let us consider condition (B.9) for the slow time-scale. Because $\bar{{\textbf{w}}}({\textbf{v}})$ is continuous in ${\textbf{v}}$, it is uniformly continuous on the set $B_{{\textbf{w}}}$. Thus, we can define the function $u_{\bar{{\textbf{w}}}}: \Re _{+} \rightarrow \Re _{+}$ as

$$\begin{aligned} u_{\bar{{\textbf{w}}}}(\epsilon ):=\sup _{{\textbf{v}}, {\textbf{v}}^{\prime } \in B_{{\textbf{v}}},\left\| {\textbf{v}}-{\textbf{v}}^{\prime }\right\| \le \epsilon }\left\| \bar{{\textbf{w}}}({\textbf{v}})-\bar{{\textbf{w}}}\left( {\textbf{v}}^{\prime }\right) \right\| . \end{aligned}$$

(A47)

This function satisfies $u_{\bar{{\textbf{w}}}}(\epsilon ) \rightarrow 0$ as $\epsilon \rightarrow 0$. Moreover, for some constant $c>0$ and for all ${\textbf{v}}, {\textbf{v}}^{\prime } \in B_{{\textbf{v}}}$, it holds that

$$\begin{aligned}&\left\| g({\textbf{v}}, \bar{{\textbf{w}}}({\textbf{v}}), z)-\left( g\left( {\textbf{v}}^{\prime }, \bar{{\textbf{w}}}\left( {\textbf{v}}^{\prime }\right) , z\right) \right) \right\| \le c\left( u_{\bar{{\textbf{w}}}}\left( \left\| {\textbf{v}}-{\textbf{v}}^{\prime }\right\| \right) \right) \nonumber \\&\quad \cdot (\Vert {\textbf{e}}\Vert +1) . \end{aligned}$$

(A48)

Therefore, we can set $f_{3}({\textbf{v}})=u_{\bar{{\textbf{w}}}}(\Vert {\textbf{v}}\Vert )$ and $f_{4}(z)=c(\Vert {\textbf{e}}\Vert +1)$ in condition (B.9) for the slow time-scale. Note that $f_{3}({\textbf{v}}) \rightarrow 0$ as ${\textbf{v}} \rightarrow 0$. In view of the facts that $\left\{ {\textbf{e}}_{t}\right\} $ lie in a bounded set and $\sum _{i=n}^{m_{v}(n, \tau )} \alpha _{{\textbf{v}}, i} \le \tau $, we know that condition (B.9) is also satisfied.

Now let us consider (A35) in condition (B.2) for the fast time-scale. By the boundedness of $\left\{ {\textbf{e}}_{t}\right\} $, we have $\left\| g\left( {\textbf{v}}, {\textbf{w}}, Z_{i}\right) \right\| \le c$ for some constant c. Thus, it follows that, for any $j \ge n$,

$$\begin{aligned}&\max _{0 \le \tau \le T}\left\| \sum _{i=m_{w}(j T)}^{m_{w}^{-}(j T+\tau )} \alpha _{{\textbf{v}}, i}\left( g\left( {\textbf{v}}, {\textbf{w}}, Z_{i}\right) \right) \right\| \nonumber \\&\quad \le c \sum _{i=m_{w}(j T)}^{m_{w}^{-}(j T+T)} \alpha _{{\textbf{v}}, i} \le c \sum _{i=m_{w}(j T)}^{m_{w}^{-}(j T+T)} \alpha _{{\textbf{w}},i} \cdot \frac{\alpha _{{\textbf{v}}, i}}{\alpha _{{\textbf{w}},i} }\,, \nonumber \\&\quad \le c T \cdot \sup _{i \ge m_{w}(n T)} \frac{\alpha _{{\textbf{v}}, i}}{\alpha _{{\textbf{w}},i} }, \end{aligned}$$

(A49)

which converges to 0 as $n \rightarrow \infty $ by the assumptions on $\alpha _{{\textbf{v}}, i} / \alpha _{{\textbf{w}},i}$. Hence, Eq. (A35) is satisfied.

Now, let us consider Eqs. (A34) and (A39) in condition (B.2). We use a result from Ref. [40] on the convergence of random probability measures. First let us define random variables $\mu _{\tau }^{w, t}$, $\mu _{\tau }^{v, t}, \tau \ge 0$, $t>0$, which take values in the space of probability measures on ${\mathcal {Z}}$ (the domain of z). Specifically, for Borel subsets $D \subset {\mathcal {Z}}$,

$$\begin{aligned} \mu _{\tau }^{w, t}(D):=\frac{1}{\sum _{i=m_{w}(\tau )}^{m_{w}(\tau +t)} \alpha _{{\textbf{w}},i} } \sum _{i=m_{w}(\tau )}^{m_{w}^{-}(\tau +t)} \alpha _{{\textbf{w}},i} {\textbf{1}}\left( Z_{i} \in D\right) , \end{aligned}$$

(A50)

$$\begin{aligned} \mu _{\tau }^{v, t}(D):=\frac{1}{\sum _{i=m_{v}(\tau )}^{m_{v}(\tau +t)} \alpha _{{\textbf{v}}, i}} \sum _{i=m_{v}(\tau )}^{m_{v}^{-}(\tau +t)} \alpha _{{\textbf{v}}, i} {\textbf{1}}\left( Z_{i} \in D\right) . \end{aligned}$$

(A51)

By assumption, the Markov chain $\left\{ Z_{i}\right\} $ has a unique invariant probability measure (that is, the probability distribution of $Z_{0}$ is a stationary process). Then, from [40, Lemma 6, Chap. 6.3] and Proposition 5.1 of [41], we have the following result, which can directly establish the averaging conditions (A34) and (A39) for both time-scales.

Lemma A.2

Let $\left\{ \alpha _{{\textbf{w}},n}\right\} $ be positive stepsizes such that $\sum _{n \ge 0} \alpha _{{\textbf{w}},n}=\infty $, $\sum _{n \ge 0} \alpha _{{\textbf{w}},n}^{2}<\infty $. Suppose that either $\left\{ \alpha _{{\textbf{w}},n}\right\} $ is eventually nonincreasing or for some sequence of integers $m_{n} \rightarrow \infty $ it holds that $\lim _{n \rightarrow \infty } \sup _{0 \le j \le m_{n}}\left| \frac{\alpha _{{\textbf{w}},n+j}}{\alpha _{{\textbf{w}},n}}-1\right| =0$. Let f(z) be a vector-valued function on ${\mathcal {Z}}$ that is continuous in the trace variable ${\textbf{e}}$, and define ${\bar{f}}:={\mathbb {E}}\left[ f\left( Z_{0}\right) \right] .$ Then, under Assumptions (A1)–(A6) for each $\eta >0$ and $T>0$,

$$\begin{aligned} {\textbf{P}}\Bigg \{\sup _{j \ge n} \max _{0 \le \tau \le T}\Bigg \Vert \sum _{i=m_{w}(j T)}^{m_{w}^{-}(j T + \tau )} \alpha _{{\textbf{w}}, i}\left( f\left( Z_{i}\right) -{\bar{f}}\right) \Bigg \Vert \ge \eta \Bigg \} \rightarrow 0, \end{aligned}$$

(A52)

as $n \rightarrow \infty $.

Now we can apply Lemma A.2 to verify the averaging conditions (A34) and (A39). For each $({\textbf{v}}, {\textbf{w}})$, applying Lemma A.2 with $f(\cdot )=k({\textbf{v}}, {\textbf{w}}, \cdot )$ and stepsizes $\left\{ \alpha _{{\textbf{w}},t}\right\} $, we know that the averaging condition (A34) for the fast time-scale is satisfied, i.e.,

$$\begin{aligned}&\lim _{n \rightarrow \infty } {\textbf{P}}\Bigg \{\sup \limits _{j \ge n} \max \limits _{0 \le \tau \le T}\Bigg \Vert \sum _{i=m_{w}(j T)}^{m_{w}^{-}(j T+\tau )} \alpha _{{\textbf{w}},i}\big (k\left( {\textbf{v}}, {\textbf{w}}, Z_{i}\right) \nonumber \\&\quad -{\bar{k}}({\textbf{v}}, {\textbf{w}})\big )\Bigg \Vert \ge \eta \Bigg \}=0, \end{aligned}$$

(A53)

For each ${\textbf{v}}$, applying the lemma with $f(\cdot )=g({\textbf{v}}, \bar{{\textbf{w}}}({\textbf{v}}), \cdot )$ and stepsizes $\left\{ \alpha _{{\textbf{v}}, t}\right\} $, we know that the averaging condition (A39) for the slow time-scale is also satisfied, i.e.,

$$\begin{aligned}&\lim _{n \rightarrow \infty } {\textbf{P}}\Bigg \{\sup _{j \ge n} \max _{0 \le \tau \le T}\Bigg \Vert \sum _{i=m_{v}(j T)}^{m_{v}^{-}(j T+\tau )} \alpha _{{\textbf{v}}, i}\nonumber \\&\quad \big (g\left( {\textbf{v}}, \bar{{\textbf{w}}}({\textbf{v}}), Z_{i}\right) -{\bar{g}}({\textbf{v}})\big )\Bigg \Vert \ge \eta \Bigg \}=0. \end{aligned}$$

(A54)

Now let us consider the ${\textbf{v}}$-updates (A26) at the slow time-scale. We need to verify the remaining condition (A40) in (B.7). In view of the established almost sure convergence of $\left( {\textbf{v}}_{t}, {\textbf{w}}_{t}\right) $ to $\left\{ ({\textbf{v}}, \bar{{\textbf{w}}}({\textbf{v}})) | {\textbf{v}} \in B_{{\textbf{v}}}\right\} $, and using the uniform continuity of $\bar{{\textbf{w}}}({\textbf{v}})$ on $B_{{\textbf{v}}}$, we see that ${\textbf{w}}_{t}-\bar{{\textbf{w}}}\left( {\textbf{v}}_{t}\right) {\mathop {\rightarrow }\limits ^{a.s.}} 0$. Notice that $\left\{ Z_{t}\right\} $ lie in a bounded set and $g({\textbf{v}}, {\textbf{w}}, z)$ is Lipschitz continuous in ${\textbf{w}}$ uniformly w.r.t. $({\textbf{v}}, z)$, for any z in the bounded set and ${\textbf{v}} \in B_{{\textbf{v}}}$. Thus, there exists some constant c such that when $i\rightarrow \infty $, we have

$$\begin{aligned}&\left\| g\left( {\textbf{v}}_{i}, {\textbf{w}}_{i}, Z_{i}\right) -g\left( {\textbf{v}}_{i}, \bar{{\textbf{w}}}\left( {\textbf{v}}_{i}\right) , Z_{i}\right) \right\| \le c\left\| {\textbf{w}}_{i}-\bar{{\textbf{w}}}\right. \nonumber \\&\left. \left( {\textbf{v}}_{i}\right) \right\| {\mathop {\rightarrow }\limits ^{a . s .}} 0. \end{aligned}$$

(A55)

As a result, we have

$$\begin{aligned}&\sup _{j \ge n} \max _{0 \le \tau \le T}\Bigg \Vert \sum _{i=m_{v}(j T)}^{m_{v}^{-}(j T+\tau )} \alpha _{{\textbf{v}}, i}\left( g\left( {\textbf{v}}_{i}, {\textbf{w}}_{i}, Z_{i}\right) -g\left( {\textbf{v}}_{i}, \bar{{\textbf{w}}}\left( {\textbf{v}}_{i}\right) , Z_{i}\right) \right) \Bigg \Vert \nonumber \\&\quad \le c T \cdot \sup _{i \ge m_{v}(n T)}\left\| {\textbf{w}}_{i}-\bar{{\textbf{w}}}\left( {\textbf{v}}_{i}\right) \right\| {\mathop {\rightarrow }\limits ^{a \cdot s .}} 0, ~~ \text {when}~ n \rightarrow \infty . \end{aligned}$$

(A56)

The inequality holds because $\left\{ Z_{t}\right\} $ and $\left\{ \left( {\textbf{v}}_{t}, {\textbf{w}}_{t}\right) \right\} $ are bounded. Therefore, Eq. (A40) can be verified by combining the Markov inequality, the almost sure convergence in (A56), and the bounded convergence theorem. To this end, all the conditions required for the slow time-scale are satisfied.

Having established that all of the necessary conditions are satisfied for both time-scales, we can now apply Theorem 2. First, applying it to the algorithm (A24) at the fast time-scale, we can conclude that $\left( {\textbf{v}}_{t}, {\textbf{w}}_{t}\right) $ converges to the limit set of the ODE (A25) in $B_{{\textbf{v}}} \times B_{{\textbf{w}}}$ almost surely with the initial condition ${\textbf{w}}_0 \in {\text {span}}\{{\textbf{1}} \otimes \phi ({\mathcal {S}})\}$. The limit set is $\left\{ ({\textbf{v}}, \bar{{\textbf{w}}}({\textbf{v}})) | {\textbf{v}} \in B_{{\textbf{v}}}\right\} $ when $B_{{\textbf{w}}}$ satisfies the condition that ${\textbf{w}}_0 \in B_{\textbf{w}} \cap {\text {span}}\{{\textbf{1}} \otimes \phi (S)\}$ and ${\textbf{v}}_0 \in B_{\textbf{v}}$. Then, by applying Theorem 2 to the ${\textbf{v}}$-updates (A26), we conclude that ${\textbf{v}}_{t}$ almost surely converges to the limit set of the ODE (A27) in $B_{{\textbf{v}}}$, which is the optimal point set [41]. Therefore, the conclusions of Theorem 1 for the critic iterates are proved.

1.3 Proof of Actor Result

We now analyze the convergence of the actor step. First, define

$$\begin{aligned} \delta _{t, \theta _t} := r_{t+1} + \gamma \phi _{t+1}^\top {\textbf{v}}_{\theta _t} - \phi _{t}^\top {\textbf{v}}_{\theta _t}, \end{aligned}$$

(A57)

and

$$\begin{aligned} h^{i}(\theta _t) := {\mathbb {E}}[\delta _{t, \theta _t} {\textbf{e}}_{\theta , t}^{i}| {\mathcal {F}}_t], \end{aligned}$$

(A58)

where ${\textbf{v}}_{\theta _t}$ denotes the limit of the critic weights at the fast time-scale under the current target policy $\pi _{\theta _t}$.

We can reformulate the update (23) as follows:

$$\begin{aligned} \theta _{t+1}^{i}=\theta _{t}^{i}+\alpha _{\theta , t}\left( h^{i}\left( \theta _{t}^{i}\right) +\varphi _{t, 1}^{i}+\varphi _{t, 2}^{i}\right) \,, \end{aligned}$$

(A59)

where

$$\begin{aligned} \varphi _{t, 1}^{i}&= \delta _{t} {\textbf{e}}_{\theta , t}^{i}-{\mathbb {E}}\left[ \delta _{t} {\textbf{e}}_{\theta , t}^{i} | {\mathcal {G}}_{t}\right] , \end{aligned}$$

(A60)

$$\begin{aligned} \varphi _{t, 2}^{i}&= {\mathbb {E}}\left[ \left( \delta _{t}-\delta _{t, \theta _{t}}\right) {\textbf{e}}_{\theta , t}^{i} | {\mathcal {G}}_{t}\right] . \end{aligned}$$

(A61)

To prove the conclusion, we use approach of the Kushner–Clark Theorem [24, 46, 47] for (A59).

Suppose $\Gamma : {\mathbb {R}}^{k} \rightarrow {\mathbb {R}}^{k}$ is a projection operator onto a compact set $O \subset {\mathbb {R}}^{k}$. Define

$$\begin{aligned} {\hat{\Gamma }}(h(x))=\lim _{\epsilon \rightarrow 0^+} \frac{\Gamma (x+\epsilon h(x))-x}{\epsilon }, \end{aligned}$$

(A62)

where $x \in O$ and $h: {\mathbb {R}}^{k} \rightarrow {\mathbb {R}}^{k}$ is continuous on O. Assume we have the following update:

$$\begin{aligned} x_{t+1}=\Gamma \left( x_{t}+\alpha _{t}\left( h\left( x_{t}\right) +\zeta _{t, 1}+\zeta _{t, 2}\right) \right) , \end{aligned}$$

(A63)

whose associated ODE is

$$\begin{aligned} {\dot{x}}={\hat{\Gamma }}(h(x)). \end{aligned}$$

(A64)

Let us introduce three conditions:

Assumption A.1

It holds for the stepsizes $\left\{ \alpha _{t}\right\} _{t \in {\mathbb {N}}}$ that $\sum _{t} \alpha _{t}$ $=\infty $, $\sum _{t} \alpha _{t}^{2}<\infty $.

Assumption A.2

The noise terms $\left\{ \zeta _{t, 1}\right\} _{t \in {\mathbb {N}}}$ satisfy that

$$\begin{aligned} \lim _{t\rightarrow +\infty } P\left( \sup _{n \ge t}\left\| \sum _{\tau =t}^{n} \alpha _{\tau } \zeta _{\tau , 1}\right\| \ge \epsilon \right) =0, \end{aligned}$$

(A65)

for any $\epsilon >0$.

Assumption A.3

The sequence $\left\{ \zeta _{t, 2}\right\} _{t \in {\mathbb {N}}}$ satisfies that $\zeta _{t, 2} \rightarrow 0$ with probability one.

Under these assumptions, the Kushner–Clark Theorem gives the following conclusion.

Theorem 3

[47] Suppose Assumptions A.1, A.2, and A.3 hold and (A64) has a compact set of asymptotically stable equilibria $O^{\prime }$. Then, we have that the updates (A63) converge to $O^{\prime }$ with probability one.

Before applying the theorem, we first need to verify that function $h^{i}$ is continuous in $\theta $. We can show this holds true by demonstrating that the integrand $\delta _{t, \theta _{t}} {\textbf{e}}_{\theta , t}^{i}$ is continuous in $\theta _{t}$.

Notice that $\pi _{\theta }$ is assumed to be continuously differentiable and $\theta _{t}$ lies in a compact set. It follows that $\rho _{t}\psi _t^i$ is continuous in $\theta _{t}$. Furthermore, ${\textbf{e}}_{\theta , t}^{i}$ is a finite sum of products of functions continuous in $\theta _{t}$. Therefore, it is also continuous in $\theta _{t}$. In addition, $\pi _{\theta _t}$ is continuous in $\theta _t$, implying that the reward under the current target policy $r_{\pi _{\theta _t}}$ is also continuous. Since the transition probabilities $P\left( s^{\prime } | s, a\right) $ are given for each $(s, a) \in {\mathcal {S}} \times {\mathcal {A}}$, it holds that the coefficient matrices in the oracle linear Eq. (9) are continuous functions of $\theta _{t}$ as well. This further yields that the solution ${\textbf{v}}_{\theta _{t}}$ is continuous in $\theta _{t}$. Hence, function $h^{i}$ is continuous in $\theta _{t}$.

Note that Assumption A.1 of the Kushner–Clark Theorem is satisfied by the conditions in Assumption (A6) on $\alpha _{\theta , t}$. From the proof of the critic step, we know that ${\textbf{v}}_{t} \rightarrow {\textbf{v}}_{\theta _t}$ almost surely. As a result, $\delta _{t}$ converges to $\delta _{t, \theta _{t}}$ almost surely, so Assumption A.3 is satisfied.

Now, we consider Assumption A.2. Note that since $\theta _{t}$ lies in a compact set, and $\rho _{t}$, ${\textbf{e}}_{\theta , t}^{i}$, and $\delta _{t}$ are continuous in $\theta _{t}$, it holds that $\left\{ \varphi _{t, 1}\right\} _{t \in {\mathbb {N}}}$ is almost surely bounded. From this, we have

$$\begin{aligned} \sum _{t}\left\| \alpha _{\theta , t} \varphi _{t+1,1}\right\| ^{2}<\infty , ~ \text { with probability one.} \end{aligned}$$

(A66)

Let us define ${\mathcal {L}}_{t}:= \sum _{\tau =0}^{t} \alpha _{\theta , \tau } \varphi _{\tau +1,1}$, for each $t \in {\mathbb {N}}$. It is clear that $\left\{ {\mathcal {L}}_{t}\right\} _{t \in {\mathbb {N}}}$ is a martingale. From the above discussion, we know that

$$\begin{aligned} \sum _{t}\left\| {\mathcal {L}}_{t+1}-{\mathcal {L}}_{t}\right\| ^{2}=\sum _{t}\left\| \alpha _{\theta , t+1} \varphi _{t+2,1}\right\| ^{2}<\infty , \end{aligned}$$

(A67)

with probability one. Therefore, from the martingale convergence theorem, it follows that $\left\{ {\mathcal {L}}_{t}\right\} _{t \in {\mathbb {N}}}$ converges almost surely. This means that

$$\begin{aligned} \lim _{t \rightarrow +\infty }\left( \sup _{n \ge t}\left\| \sum _{\tau =t}^{n} \alpha _{\theta , \tau } \varphi _{\tau +1,1}\right\| \ge \epsilon \right) =0 \end{aligned}$$

(A68)

for all $\epsilon >0$. Therefore, the conditions of the Kushner–Clark Theorem are satisfied and thus the convergence is proved.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Ren, J. Multi-agent Gradient-Based Off-Policy Actor-Critic Algorithm for Distributed Reinforcement Learning. Int J Comput Intell Syst 17, 162 (2024). https://doi.org/10.1007/s44196-024-00560-2

Download citation

Received: 09 January 2024
Accepted: 06 June 2024
Published: 24 June 2024
DOI: https://doi.org/10.1007/s44196-024-00560-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Multi-agent Gradient-Based Off-Policy Actor-Critic Algorithm for Distributed Reinforcement Learning

Abstract

Similar content being viewed by others

Episode-Experience Replay Based Tree-Backup Method for Off-Policy Actor-Critic Algorithm

Implicit Incremental Natural Actor Critic

On the Analysis of Model-Free Methods for the Linear Quadratic Regulator

Explore related subjects

1 Introduction

2 Related Work

3 Notation and Problem Setting

4 Multi-agent Gradient-Based Off-Policy Actor-Critic (MGOPAC) Algorithm

4.1 Single-Agent Critic Algorithm

4.2 Off-Policy Policy-Gradient Method for Actor Algorithm

4.3 Multi-Agent Actor-Critic Algorithm

4.3.1 Actor Part

4.3.2 Critic Part

5 Analysis

Assumption (A1)

Assumption (A2)

Remark

Assumption (A3)

Assumption (A4)

Assumption (A5)

Assumption (A6)

Assumption (A7)

Remark

Theorem 1

6 Experiments

7 Conclusions and Remarks

Data Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix A

Appendix A

1.1 A.1 Proof of Consensus for Critic Updates

Lemma A.1

Proof of Lemma A.1

1.2 A.2 Proof of Critic Convergence

Theorem 2

Lemma A.2

1.3 Proof of Actor Result

Assumption A.1

Assumption A.2

Assumption A.3

Theorem 3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation