1 Introduction

Over the last few years, reinforcement learning has witnessed enormous progress in applications such as intelligent control [1], autonomous driving [2], and strategy games [3]. Compared with single-agent reinforcement learning, multi-agent reinforcement learning (MARL) tasks are more challenging in the sense that agents in such scenarios interact with not only the environment, but also other agents. In this paper, we study the algorithms in MARL with local rewards and actions, where the agents aim to cooperatively perform policy prediction and improvement over connected communication graphs. In these cases, the agents share a joint state whose transition depends on the local rewards and actions of individual agents. To learn the policy and achieve optimal estimation globally, the agents need to share their local information with other connected neighbors. Note that the multi-agent problems in such a setting cannot be solved by simply applying a single-agent approach for each agent independently as the agents would not be able to learn their own policy without communicating with others and well knowing the global joint state value estimates.

A trivial distributed approach could use a central controller to broadcast the global joint state value estimate to all the agents at each iteration. Nevertheless, because of the use of the central controller, this algorithm structure is vulnerable to the breakdown of the central node and can incur privacy concerns. Furthermore, this approach does not scale well and faces challenges in large-scale applications due to the constraints on memory and computation. To improve the scalability and robustness, recent articles have proposed fully decentralized actor-critic methods, where each agent only communicates with its neighbors over a network (e.g., Refs. [4, 5]). Communication-efficient and resilient algorithms have also been developed in Refs. [6, 7]. However, these works mainly focus on the on-policy reinforcement learning methods, which potentially faces the challenges of exploration capability and sample complexity [8, 9].

On the other hand, off-policy multi-agent reinforcement learning is an attractive approach due to its generality and the potential for efficiency in distributed off-policy evaluation and control (see, e.g., Refs. [10,11,12,13]). To this end, Ref. [10] proposes a distributed off-policy actor-critic method with policy consensus where the local value functions are independent, whereas the algorithm’s convergence to the solution of the Bellman equation is not guaranteed. Moreover, Ref. [11] develops a multi-agent decomposed policy gradient method that could narrow the performance gap between multi-agent policy gradient (MAPG) methods and multi-agent value-based approaches. Besides, Ref. [12] considers the application of policy iteration-based algorithms in adaptive output synchronization of heterogeneous multi-agent systems. Reference [13] designs two distributed policy evaluation algorithms based on a special case of the single-agent gradient temporal difference learning proposed in Ref. [14] for Markov decision processes (MDP) by incorporating consensus-based collaborations between agents. Recently, Ref. [15] has proposed a fully decentralized actor-critic MARL framework that learns individual policies separately for decentralization. Unlike on-policy methods, an off-policy agent aims to perform policy evaluation or policy improvement on a given target policy while generating trajectories according to a different behavior policy. Thus, reweighting updates is often needed to compensate for the bias caused by the policy difference. Such importance sampling-based methods have already been widely used in single-agent off-policy algorithms [14, 16,17,18,19].

In this paper, we develop a multi-agent online stochastic gradient update for an off-policy actor-critic method. The proposed algorithm is incremental and scales linearly with the size of approximation features. We make the following main contributions:

  • We propose the first fully gradient descent-based multi-agent actor-critic approach for off-policy settings, where the critic uses a GTD-type paradigm and the actor adopts a policy gradient method derived from the full gradients of the joint objective function.

  • Under standard conditions, we prove that the critic updates of the agents gradually reach a consensus that is the TD solution of the projected Bellman equation and the actor updates converge to the set of asymptotically stable fixed points.

  • We introduce a non-trivial generalization of the multi-agent Boyan’s chain problem where the state value function is still linearly representable with a feature map as in the single-agent Boyan’s chain problem. Experimental results demonstrate the improved performance of the proposed algorithm over the state-of-the-art baseline algorithm.

2 Related Work

Single-Agent Setting Gradient temporal difference (GTD) methods including the linear temporal difference with gradient correction (TDC) [14, 16, 20], emphatic temporal difference (ETD) approaches [18], and off-policy actor-critic methods [19] are well-known off-policy extensions of the temporal difference methods for policy evaluation and control for a single agent. Among these, in Ref. [19], Degris et al. propose a method using gradient-based TD learning as a critic and policy gradient as an actor. However, in the derivations of the actor update algorithm, a semi-gradient is used to approximate the full gradient. This introduces additional bias and as a result the convergence and stability of the algorithm is undermined [21]. This issue is partially addressed by subsequent works including Refs. [21,22,23] in single-agent settings, but the convergence is not directly established. Although our primary focus is the multi-agent setting, we start our work by developing an incremental and online stochastic gradient update for actor-critic methods in the single-agent setting which adopts the full gradient. We prove that this is guaranteed to converge under reasonable assumptions.

Multi-agent setting There have been recent advances in the development and analysis of decentralized multi-agent off-policy actor-critic methods [10,11,12,13, 15, 24, 25]. In contrast to the works including Refs. [4, 5, 15, 24] which use a critic consensus step, Zhang et al. introduced in Ref. [10] a distributed off-policy actor-critic method using a consensus step for the policy parameters. The analysis of Ref. [10] considers a special form of local value functions and ignores the biases caused by the difference between the behavior policy and the target policy in off-policy scenarios. Moreover, Ref. [11] proposes a multi-agent policy gradient (MAPG) algorithm with improved performance based on a decomposed critic assumption. But it is worth noting that their algorithm adopts a centralized critic structure, instead of fully decentralized ones, while the latter is usually more general in applications as it does not require a center that collects critic information from each agent. What is more, Ref. [13] proposes the multi-agent generalization of off-policy GTD and TDC algorithm in Markov decision processes and provides the theoretical guarantee of their weak convergence to a consensus point. Nevertheless, the method only considers policy evaluation problems, while it is not applicable for solving general control problems. Although its authors later developed a type of multi-agent actor-critic algorithm in Ref. [25], they focus on a different scenario where multiple agents are essentially completing a single-agent task with different behaviors. They will reach a consensus policy that generates only one action to be implemented by a single actor for solving the single-agent problem. As compared with this setting, the multiple agents considered in our method take different actions cooperatively for solving multi-agent problems based on their local policies. Recently, in Ref. [26], Chen et al. have conducted finite-time analysis of a multi-agent temporal difference (TD) learning method, but only focus on policy evaluation tasks. In Ref. [15], a decentralized approximate actor-critic MARL framework is developed with a primal-dual optimization to guarantee full decentralization and scalability. However, they only consider the multi-agent generalization framework for TD methods, which is proven to diverge in the seminal counterexamples introduced by Ref. [27], where GTD and TDC methods are still sound and convergent. As a result, their multi-agent framework is not applicable for the more stable gradient-based off-policy actor-critic methods in this paper. Moreover, Suttle et al. study a decentralized version of an off-policy actor-critic algorithm using emphatic temporal difference learning with a critic variable consensus step and analyzes its convergence under a linear function approximation [24]. Compared with Ref. [24], our proposed algorithm is a novel gradient-based multi-agent off-policy actor-critic algorithm that adopts a consensus step of the critic estimates, with its critic part being derived from the gradient descent of the corresponding projected Bellman error functions. Therefore, the proposed algorithm is fully incremental and is likely to exhibit improved stability compared to ETD-based methods [28]. This is corroborated in the experimental results provided in Sect. 6. The comparison of the main features of the related works is summarized in the following table.

Table 1 Comparison between this paper and the related works

3 Notation and Problem Setting

To facilitate the analysis of multi-agent algorithms, researchers usually study a Markov decision process (MDP) as a testbed for real-world reinforcement learning problems that can showcase the properties (such as convergence and stability) of their MARL algorithms (see, e.g., Refs. [4,5,6,7, 10, 11, 13,14,15, 24,25,26, 29]). Here, we first consider a Markov decision process for a single agent with a discrete state space \({\mathcal {S}}\), a discrete action space \({\mathcal {A}}\), and a distribution \(P: {\mathcal {S}}\times {\mathcal {S}} \times {\mathcal {A}} \rightarrow [0, 1]\), where \(P(s^\prime |s, a)\) denotes the probability of moving into the next state \(s^{\prime }\) from the current state s after taking action a. Moreover, an expected reward function \({\mathcal {R}}: {\mathcal {S}} \times {\mathcal {A}} \times {\mathcal {S}} \rightarrow {\mathbb {R}}\) gives an expected reward for any given transition triple \((s, a, s^{\prime })\). In the decision process, we observe a stream of data, consisting of states \(s_{t} \in {\mathcal {S}},\) actions \(a_{t} \in {\mathcal {A}},\) and rewards \(r_{t} \in {\mathbb {R}}\), for time steps \(t=1,2, \ldots \), with actions being selected according to a fixed behavior policy, \(\mu (a | s) \in [0,1]\). Given a discount-rate parameter \(\gamma : {\mathcal {S}} \rightarrow [0,1]\) ( [28]), the value function for policy \(\pi : {\mathcal {S}} \times {\mathcal {A}} \rightarrow [0,1]\) is defined as

$$\begin{aligned} V^{\pi , \gamma }(s) := {\mathbb {E}}\Bigg [\sum _{p=1}^\infty \gamma ^{p-1}r_{t+p} | {\textbf{s}}_t = {\textbf{s}},\pi \Bigg ]. \end{aligned}$$
(1)

For all \(a \in {\mathcal {A}}\) and for all \(s \in {\mathcal {S}}\), we define the action-value function, \(Q^{\pi , \gamma }(s, a)\) to be

$$\begin{aligned} Q^{\pi , \gamma }(s, a)&= \sum _{s^{\prime } \in {\mathcal {S}}} P\left( s^{\prime } | s, a\right) \left[ {\mathcal {R}}\left( s, a, s^{\prime }\right) \right. \nonumber \\&\quad \left. +\gamma \left( s^{\prime }\right) V^{\pi , \gamma }\left( s^{\prime }\right) \right] . \end{aligned}$$
(2)

It can be shown that \(V^{\pi , \gamma }(s)=\) \(\sum _{a \in {\mathcal {A}}} \pi (a | s) Q^{\pi , \gamma }(s, a),\) for all \(s \in {\mathcal {S}}\). In off-policy settings, one wants to learn about a target policy \(\pi \), while generating actions according to the behavior policy \(\mu \), which is often chosen to achieve better exploration of the state and action spaces.

The policy \(\pi _{\theta }: {\mathcal {A}} \times {\mathcal {S}} \rightarrow [0,1]\) denotes a differentiable function with a parameter vector, \(\theta \in {\mathbb {R}}^{N_{\theta }}, N_{\theta } \in {\mathbb {N}}\). In the decision process, we aim to choose \(\theta \) to maximize the objective function defined as

$$\begin{aligned} J_{\gamma }(\theta ) :=\sum _{s \in {\mathcal {S}}} d^{\mu }(s) V^{\pi _{\theta }, \gamma }(s), \end{aligned}$$
(3)

where we use \(d^{\mu }(s):=\lim _{t \rightarrow \infty } P\left( s_{t}=s | s_{0}, \mu \right) \) to represent the limiting distribution of states, and \(P\left( s_{t}=s | s_{0}, \mu \right) \) denotes the probability that \(s_{t}=s\) under behavior policy \(\mu \) starting from \(s_{0}\). Here, the objective function is weighted by \(d^{\mu }\) because actions are generated by following the behavior policy in the off-policy setting. When there is no confusion, we can write \(\pi \) to represent \(\pi _{\theta }\).

In the multi-agent setting, we consider a set of agents \({\mathcal {N}}:= \{1, 2, \ldots , N\}\). Define \(\{{\mathcal {G}}_t\}_{t\in {\mathbb {N}}}:= \{({\mathcal {N}}, {\mathcal {E}}_t)\}_{t\in {\mathbb {N}}}\) as a sequence of connected graphs on these agents over time, where \({\mathcal {E}}_t\) denotes the edge set at time t. Specifically, (ji) is an edge in \({\mathcal {E}}_t\) whenever there is communication between agent i and agent j. We assume that the communication is symmetric so that the communication graph is undirected. Let \({\mathcal {S}}\) be the joint state space, \({\mathcal {A}}^{i}\) be the action space, and \({\mathcal {R}}^{i}\) be the reward space for agent i, for \(i = 1,\dots , N\). A joint action can be written as \(a:=\left( a^{1}, a^{2}, \ldots , a^{N}\right) \in {\mathcal {A}}^{1} \times {\mathcal {A}}^{2} \times \cdots \times {\mathcal {A}}^{N}\), and a joint reward as \(r:=\left( r^{1}, r^{2}, \ldots , r^{N}\right) \in {\mathcal {R}}^{1} \times {\mathcal {R}}^{2} \times \cdots \times {\mathcal {R}}^{N}\). Here, the joint state s is observable by all agents, while the reward \(r^{i}\) and the action \(a^{i}\) are private for agent i.

In this setting, the agents’ states are coupled by the joint state transition matrix \(P(\cdot | \cdot , a) \in {\mathbb {R}}^{|{\mathcal {S}}| \times |{\mathcal {S}}|}\) defined above with a joint action a. As a motivation, this scenario arises in a wide variety of multi-agent applications such as mobile sensing networks [30], robotics [31], and power grids [32, 33]. Assume each agent \(i \in {\mathcal {N}}\) has its own local behavior policy \(\mu ^{i}: {\mathcal {A}}^{i} \times {\mathcal {S}} \rightarrow [0,1]\). For each \(i \in {\mathcal {N}},\) let \(\pi _{\theta ^{i}}^{i}: {\mathcal {A}}^{i} \times {\mathcal {S}} \rightarrow [0,1]\) be a set of local target policy functions with parameter \(\theta ^{i} \in \Theta ^{i}\), where \(\Theta ^{i}\) is a compact subset of \({\mathbb {R}}^{m_{i}}\). Further suppose that each \(\pi _{\theta ^{i}}^{i}\) is continuously differentiable with respect to \(\theta ^{i}\) and set \(\theta =\left[ ({\theta ^{1}})^{T}, \ldots , ({\theta ^{N}})^{T}\right] ^{T}\). Then, the global behavior and target policies can be written as

$$\begin{aligned} \mu&=\prod _{i=1}^{N} \mu ^{i}: {\mathcal {A}}^{i}\times {\mathcal {S}} \rightarrow [0,1], \end{aligned}$$
(4)
$$\begin{aligned} \pi _{\theta }&=\prod _{i=1}^{N} \pi _{\theta ^{i}}^{i}: {\mathcal {A}}^{i} \times {\mathcal {S}} \rightarrow [0,1] . \end{aligned}$$
(5)

In the multi-agent decision process, we aim to maximize the objective function of (3) for the global behavior and target policies in (4) and (5). Note that actor-critic algorithms are essentially the combination of a policy improvement structure and a critic algorithm. The policy improvement structure is used to update the current policy based on the state value (or action value) that is estimated by the critic algorithm. Then, the critic algorithm computes the state value to evaluate the action made by the actor under the current policy. The alternating process continues until convergence to a stationary fixed point, which is a candidate for the optimal solution.

Assume that if \(\pi _{\theta ^{i}}^{i}\left( a^{i} | s\right) >0\), then \(\mu ^{i}\left( a^{i} | s\right) >0\) for all \(i \in {\mathcal {N}}\), \(\left( a^{i}, s\right) \in {\mathcal {A}}^{i} \times {\mathcal {S}}\), and \(\theta ^{i} \in \Theta ^{i}\). Moreover, for all \(\theta \in \Theta \), suppose that the Markov chains obtained by following \(\pi _{\theta }\) and \(\mu \) are irreducible and aperiodic, and vectors \(d^{\pi _{\theta }}, d^{\mu } \in [0,1]^{|{\mathcal {S}}|}\) represent their steady-state distributions, respectively. For an integer \(m\in {\mathbb {N}}\), we write [m] as shorthand for the set \(\{1, 2, \cdots , m \}\). The definitions of the acronyms used in this paper are summarized in Table 2.

4 Multi-agent Gradient-Based Off-Policy Actor-Critic (MGOPAC) Algorithm

In this section, we first explain the basic theoretical ideas underlying the Gradient-TD methods for the critic updates in the single-agent setting. Then, we derive the actor updates using a full gradient method with emphatic weights to produce a backward-view mechanistic algorithm with eligibility traces. Finally, the single-agent updates are extended to the multi-agent setting, allowing us to specify the actor-critic steps.

Table 2 Acronym definition table

4.1 Single-Agent Critic Algorithm

For value function estimation, we first define the \(\lambda \)-return (bootstrapped return) as

$$\begin{aligned} G_{t}^{\lambda }(V):= r_{t+1}+\gamma _{t+1}\left[ (1-\lambda ) V\left( s_{t+1}\right) +\lambda G_{t+1}^{\lambda }\right] , \end{aligned}$$
(6)

where \(\lambda \in [0,1]\) denotes a constant eligibility trace parameter, \(\gamma _{t+1}\) is the discount factor at state \(s_{t+1}\), and \(V(s_{t+1})\) is the corresponding value function. In the following, we introduce the \(\lambda \)-weighted Bellman equation. A more detailed description is provided in [34].

First, let us consider the linear function approximation of the state value function: \(V^{\pi , \gamma }(s) \approx V_{{\textbf{v}}}(s):={\textbf{v}}^{\top } \phi (s)\), where \(\phi (s) \in {\mathbb {R}}^{N_{{\textbf{v}}}}, N_{{\textbf{v}}} \in {\mathbb {N}},\) is the feature vector of state s,  and \({\textbf{v}} \in {\mathbb {R}}^{N_{{\textbf{v}}}}\) is the weight vector. Then, the projection operator \(\Pi \) to the linear space can be written as [28]

$$\begin{aligned} \Pi = \Phi (\Phi ^\top D \Phi )^{-1} \Phi ^\top D, \end{aligned}$$
(7)

where D is an \(|{\mathcal {S}}|\times |{\mathcal {S}}|\) diagonal matrix with \(d^\mu (s)\), the probability of visiting state s, as its diagonal elements; and \(\Phi \) is the matrix whose row is \(\phi (s)^\top \) for a given state s. The Gradient-TD method was first proposed by Sutton et al. [35]. It incrementally learns the weights \({\textbf{v}}\), and the auxiliary variables \({\textbf{w}}\), under an off-policy protocol, with guaranteed stability and linear per-time-step complexity. This type of method can not only work in the off-policy setting, but also achieve comparable convergence as conventional TD methods for on-policy problems.

We first introduce the derivation of the critic updates in the single-agent setting based on the GTD(\(\lambda \)) algorithm [16], which can be viewed as an extension of the linear TD method with gradient correction (TDC) [14] to eligibility traces. Define the \(\lambda \)-weighted Bellman operator for policy \(\pi \) as

$$\begin{aligned} \begin{aligned} \left( T_{\pi }^{\lambda , \gamma } V^{\pi , \gamma }\right) (s) := {\mathbb {E}}\left[ G_{t}^{\lambda }\left( V^{\pi , \gamma }\right) | s_{t}=s, \pi \right] . \end{aligned} \end{aligned}$$
(8)

The objective is to find the off-policy TD solution, \({\textbf{v}}\), such that the following projected Bellman equation holds

$$\begin{aligned} V_{{\textbf{v}}}=\Pi T_{\pi }^{\lambda , \gamma } V_{{\textbf{v}}}, \end{aligned}$$
(9)

while the data are generated according to a behavior policy \(\mu \). To find the solution of this equation, we minimize the mean-square projected Bellman error (MSPBE) objective function:

$$\begin{aligned} \min _{{\textbf{v}}} ~ {\text {MSPBE}}({\textbf{v}}) :=\left\| V_{{\textbf{v}}}-\Pi T_{\pi }^{\lambda , \gamma } V_{{\textbf{v}}}\right\| ^{2}_{d^{\mu }}. \end{aligned}$$
(10)

To derive an update rule for solving this problem, let us start by introducing the following definitions:

$$\begin{aligned}&\phi _{t} \equiv \phi \left( s_{t}\right) ~~ \text {and} ~~ \gamma _t \equiv \gamma (s_{t})\,, \hspace{5em}\nonumber \\&G_{t}^{\lambda }({\textbf{v}}) := r_{t+1}+\gamma _{t+1}\left[ (1-\lambda ) {\textbf{v}}^{\top } \phi _{t+1}\right. \nonumber \\&\quad \qquad \qquad \quad \left. +\lambda G_{t+1}^{\lambda }({\textbf{v}})\right] \,, \nonumber \\&\delta _{t}^{\lambda }({\textbf{v}}) := G_{t}^{\lambda }({\textbf{v}})-{\textbf{v}}^{\top } \phi _{t}, \end{aligned}$$
(11)

and an operation \({\mathcal {P}}_{\mu }^{\pi }\):

$$\begin{aligned} {\mathcal {P}}_{\mu }^{\pi } \delta _{t}^{\lambda }({\textbf{v}}) \phi _{t} := \sum _{s} \mu (s) {\mathbb {E}}\left[ \delta _{t}^{\lambda }({\textbf{v}}) | s_{t}=s, \pi \right] \phi (s). \end{aligned}$$
(12)

It can be proved that the following identities hold true [16]:

$$\begin{aligned} {\text {MSPBE}}({\textbf{v}})&=\left\| V_{{\textbf{v}}}-\Pi T_{\pi }^{\lambda , \gamma } V_{{\textbf{v}}}\right\| _{\mu }^{2} \nonumber \\&\quad =\left( {\mathcal {P}}_{\mu }^{\pi } \delta _{t}^{\lambda }({\textbf{v}}) \phi _{t}\right) ^{\top } {\mathbb {E}}\left[ \phi _{t} \phi _{t}^{\top }\right] ^{-1}\left( {\mathcal {P}}_{\mu }^{\pi } \delta _{t}^{\lambda }({\textbf{v}}) \phi _{t}\right) . \end{aligned}$$
(13)

Since forward-view equations are not directly implementable as they require future data, we first convert the TD forward-view equation to the mechanistic backward-view one. Define the importance weighting as \(\rho _t:=\frac{\pi \left( a_{t} | s_{t}\right) }{\mu \left( a_{t} | s_{t}\right) }\). Then, by applying a gradient descent update of the reformed objective function, we have the critic algorithm based on \(\textrm{GTD}(\lambda )\) in the single-agent setting (see [16] for detailed derivations):

$$\begin{aligned} {\textbf{v}}_{t+1}&= {\textbf{v}}_{t} + \alpha _{{\textbf{v}}, t}\nonumber \\&\quad \left[ \delta _{t} {\textbf{e}}_{t}-\gamma _{t+1}(1-\lambda )\left( {\textbf{e}}_{t}^{\top } {\textbf{w}}_{t}\right) \phi _{t+1}\right] , \end{aligned}$$
(14)
$$\begin{aligned} {\textbf{w}}_{t+1}&= {\textbf{w}}_{t} + \alpha _{{\textbf{w}}, t}\left[ \delta _{t} {\textbf{e}}_{t}-\left( {\textbf{w}}_{t}^{\top } \phi _{t}\right) \phi _{t}\right] , \end{aligned}$$
(15)

where \(\delta _{t}({\textbf{v}}):=\) \(r_{t+1}+\gamma _{t+1} {\textbf{v}}^{\top } \phi _{t+1}-{\textbf{v}}^{\top } \phi _{t}\) is the conventional TD error; and \({\textbf{e}}_{t}\) denotes the eligibility trace vector at time step t, which is updated recursively by \({\textbf{e}}_{t}=\rho _{t}\left( \phi _{t}+\gamma _t \lambda e_{t-1}\right) \).

4.2 Off-Policy Policy-Gradient Method for Actor Algorithm

To update the policy variables, we perform gradient ascent on the global policy, which aims for the maximization of the objective function \(J_{\gamma }(\theta )\) as follows:

$$\begin{aligned} \max _{\theta } J_\gamma (\theta ) := \sum _{s\in {\mathcal {S}}} d^\mu (s)\sum _{a\in {\mathcal {A}}}\pi _\theta (a|s)Q^{\pi _\theta , \gamma }(s,a). \end{aligned}$$
(16)

Specifically, the proposed algorithm updates the variables in proportion to the gradient of the objective function:

$$\begin{aligned} \theta _{t+1} -\theta _{t} = \alpha _{\theta , t} \nabla _{\theta } J_{\gamma }\left( \theta _{t}\right) , \end{aligned}$$
(17)

where \(\alpha _{\theta , t} \in {\mathbb {R}}\) is a positive stepsize parameter.

First, it can be shown that the gradient with respect to \(\theta \) is computed as

$$\begin{aligned} \nabla _{\theta } J_{\gamma }(\theta ) =\sum _{s \in {\mathcal {S}}} m(s) \sum _{a \in {\mathcal {A}}} \nabla _{\theta } \pi _\theta (a | s) Q^{\pi , \gamma }(s, a), \end{aligned}$$
(18)

where m(s) is the emphatic weighting at \(s\in S\), with the vector form defined as \(m^\top := (d^\mu )^\top ({\textbf{I}} - {\textbf{P}}_{\theta , \gamma })^{-1}\), where \({\textbf{P}}_{\theta , \gamma }\) is the matrix with entries \({\textbf{P}}_{\theta , \gamma }(s, s^\prime ) = \gamma \sum _{a\in {\mathcal {A}}}\pi _\theta (a| s)\) \(P(s^\prime | s, a)\).

We now derive an incremental update algorithm using observations sampled from the behavior policy. First, we rewrite Eq. (18) at time step t as an expectation:

$$\begin{aligned} {\textbf{g}}(\theta )&:= \nabla _{\theta } J_{\gamma }(\theta )\nonumber \\&=\sum _{s} d^{\mu }(s) \lim _{t\rightarrow \infty } {\mathbb {E}}_{\mu }\left[ M_{t} | s_{t}=s\right] \nonumber \\&\qquad \cdot \sum _{a \in {\mathcal {A}}} \nabla _{\theta } \pi _\theta (a | s) Q^{\pi , \gamma }(s, a)\,,\nonumber \\&=\sum _{s} d^{\mu }(s) \lim _{t\rightarrow \infty } {\mathbb {E}}_{\mu }\Bigg [{\mathbb {E}}_{\mu }\left[ M_{t} | s_{t}=s, a_{t}, s_{t+1}\right] \Bigg ]\nonumber \\&\qquad \cdot {\mathbb {E}}_{\mu }\Bigg [ {\mathbb {E}}_{\mu } \Bigg [ \rho _t\psi (s_t, a_t)Q^{\pi , \gamma }(s_t, a_t) | s_{t}=s, a_{t}, s_{t+1} \Bigg ] \Bigg ]\,,\nonumber \\&={\mathbb {E}}\Bigg [ \lim _{t\rightarrow \infty } {\mathbb {E}}_{\mu }\Bigg [{\mathbb {E}}_{\mu }\Bigg [M_{t} \rho _t\psi (s_t, a_t)Q^{\pi , \gamma }(s_t, a_t) \nonumber \\&\qquad \qquad \qquad \qquad \qquad | s_{t}=s, a_{t}, s_{t+1}\Bigg ]\Bigg ] | s \sim d^{\mu }\Bigg ]\,,\nonumber \\&=\lim _{t\rightarrow \infty }{\mathbb {E}}_{\mu }\Bigg [M_{t} \rho _t\psi (s_t, a_t)Q^{\pi , \gamma }(s_t, a_t)\Bigg ], \end{aligned}$$
(19)

where \(\rho _t:=\frac{\pi _\theta (a_t | s_t)}{\mu (a_t | s_t)}\), \(\psi (s, a):=\frac{\nabla _{\theta } \pi _\theta (a | s)}{\pi _\theta (a | s)} \); \( M_{t}:=\left( 1-\lambda ^\theta \right) +\lambda ^\theta F_{t}\) and \(F_{t}:= \gamma _{t} \rho _{t-1} F_{t-1}+1 \) are the emphatic and follow-on weights [21]; \(\lambda ^\theta \in [0,1]\) is the bootstrapping parameter, which is set to \(\lambda \), defined earlier; and the new notation \({\mathbb {E}}_{\mu }[\cdot ]\) is employed to denote the expectation conditioned implicitly on all the random variables being drawn from their stationary distributions under the behavior policy. Moreover, it is shown in Ref. [36] that introducing an arbitrary function of the state into these equations as a baseline does not affect the expected value. With the state-value function estimation provided by the critic, \({\hat{V}}\), the right hand side of Eq. (19) can be further replaced by \( \lim \nolimits _{t\rightarrow \infty }{\mathbb {E}}_{\mu }\Bigg [M_t\rho _t \psi \left( s_{t}, a_{t}\right) \big (Q^{\pi , \gamma }\left( s_{t}, a_{t}\right) -{\hat{V}}\left( s_{t}\right) \big )\Bigg ]\). By approximating the action value, \(Q^{\pi , \gamma }\left( s_{t}, a_{t}\right) \), with the off-policy \(\lambda ^\theta \)-return, it follows that

$$\begin{aligned} {\textbf{g}}(\theta ) \approx \widehat{{\textbf{g}}(\theta )}=\lim _{t\rightarrow \infty } {\mathbb {E}}_{\mu }\Bigg [\rho _t M_t \psi \left( s_{t}, a_{t}\right) \big (G_{t}^{\lambda ^\theta }-{\hat{V}}\left( s_{t}\right) \big )\Bigg ], \end{aligned}$$

where the off-policy \(\lambda \)-return is defined by: \(G_{t}^{\lambda ^\theta }=r_{t+1} +(1-\lambda ^\theta ) \gamma _{t+1} {\hat{V}}\left( s_{t+1}\right) + \lambda ^\theta \gamma _{t+1} \rho \left( s_{t+1}, a_{t+1}\right) G_{t+1}^{\lambda ^\theta }\).

Similarly for a mechanistic implementation, the forward view should be converted to a backward view that does not involve the future return. By conducting an argument similar to the derivation for the critic update, we have the backward view of the actor update as

$$\begin{aligned} \theta _{t+1}=\theta _{t} + \alpha _{\theta , t} \delta _{t} {\textbf{e}}_{\theta , t}, \end{aligned}$$
(20)

where \(\delta _{t}:=r_{t+1}+\gamma _{t+1} {\hat{V}}\left( s_{t+1}\right) -{\hat{V}}\left( s_{t}\right) \) is the conventional temporal difference error; and \({\textbf{e}}_{\theta , t} \in {\mathbb {R}}^{N_{\theta }}\) denotes the eligibility trace of \(\psi \), updated by \({\textbf{e}}_{\theta , t} =\rho _t \left( M_t\psi \left( s_{t}, a_{t}\right) \right. \) \(\left. +\lambda ^\theta {\textbf{e}}_{\theta , t-1}\right) \).

4.3 Multi-Agent Actor-Critic Algorithm

With the above single-agent discussion as a foundation, we proceed to introduce our multi-agent actor-critic algorithm. The overall structure is similar to the single-agent version, with two main differences: (i) each agent can only access its own reward information and it is responsible for maintaining its own estimates of the actor and critic updates; and (ii) the agents perform local updates at each time step and then use a consensus process to average their critic estimates over neighbors and to obtain the global importance sampling ratios for the current global policy.

Algorithm 1
figure a

The Multi-Agent Gradient Off-Policy Actor-Critic (MGOPAC) Algorithm

4.3.1 Actor Part

We start with the actor part of the multi-agent algorithm. Note that in this setting each agent has its own target policy. Therefore, it is reasonable that each agent only estimates its own policy’s parameter. For an agent i, we consider the gradient of \(J_\gamma \) w.r.t. its local policy parameter \(\theta ^{i}\) instead of the global joint parameter \(\theta \). Specifically, we have

$$\begin{aligned} \nabla _{\theta ^{i}} J_{\gamma }(\theta ) =\sum _{s \in {\mathcal {S}}} m(s) \sum _{a \in {\mathcal {A}}} \nabla _{\theta ^{i}} \pi _\theta (a | s) Q^{\pi , \gamma }(s, a), \end{aligned}$$
(21)

where m(s) is the emphatic weighting for \(s \in {\mathcal {S}}\), as defined above. Analogously, we can derive an incremental update algorithm by sampling observations from the behavior policy. Equation (21) at time step t can be written as an expectation:

$$\begin{aligned} \nabla _{\theta ^{i}} J_{\gamma }(\theta ) ={\mathbb {E}}_{\mu }\Bigg [M_{t} \rho _t\psi ^i(s_t, a^{i}_t)Q^{\pi , \gamma }(s_t, a_t)\Bigg ], \end{aligned}$$
(22)

where \(\psi ^i(s, a)=\frac{\nabla _{\theta ^{i}} \pi ^i(a^{i} | s)}{\pi ^i(a^{i} | s)}\) and other terms are defined as above. By introducing the approximate state-value function given by the critic as a baseline, the right hand side of Eq. (22) becomes \({\mathbb {E}}_{\mu }\left[ M_t\rho _t \psi ^i\left( s_{t}, a_{t}\right) \left( Q^{\pi , \gamma }\left( s_{t}, a_{t}\right) -{\hat{V}}\left( s_{t}\right) \right) \right] \). By approximating the action value, \(Q^{\pi , \gamma }\left( s_{t}, a_{t}\right) \), with the off-policy \(\lambda ^\theta \)-return, we can write the forward view of the actor update at agent i as

$$\begin{aligned} \theta ^{i}_{t+1}-\theta ^{i}_{t}=\alpha _{\theta , t} \rho _t M_t \psi ^i\left( s_{t}, a_{t}^i\right) \big (G_{t}^{\lambda ^\theta }-{\hat{V}}\left( s_{t}\right) \big ). \end{aligned}$$

The forward-view update is not directly implementable since it involves the future return. Denote by \(\delta _t^{i}:= r_{t+1}^{i}+\gamma _{t+1} {\hat{V}}^{i}\left( s_{t+1}\right) -{\hat{V}}^{i}\left( s_{t}\right) \) the conventional temporal difference error corresponding to agent i, and let \(\delta _{t}=\frac{1}{N}\sum _{i=1}^N \delta _t^{i}\). By conducting the derivation as above, we have the backward view of the actor update at agent i as

$$\begin{aligned} \theta ^{i}_{t+1}=\theta ^{i}_{t} + \alpha _{\theta , t} \delta _{t} {\textbf{e}}^{i}_{\theta ,t}, \end{aligned}$$
(23)

where \({\textbf{e}}^{i}_{\theta , t} \in {\mathbb {R}}^{N_{\theta }}\) is the local eligibility trace of \(\psi ^{i}\), updated by \({\textbf{e}}^{i}_{\theta ,t} =\rho _t\big (M_t\psi ^{i}(s_{t}, a^{i}_{t})+\lambda ^\theta {\textbf{e}}_{\theta , t-1}^{i}\big )\).

4.3.2 Critic Part

We now consider the critic part of the multi-agent algorithm, where each agent tries to attain its own estimation of the TD solution to problem (10). Specifically, at time step t, agent i first executes its own action \(a_{t}^{i} \sim \mu ^{i}\left( \cdot | s_{t}\right) \), observes its local reward \(r_{t}^{i}\) and the next state \(s_{t+1}\), and computes its local log importance sampling weights by \(p_{t}^{i}=\log \Bigg [\pi _{\theta _{t}^{i}}^{i}\left( a_{t}^{i} | s_{t}\right) / \mu ^{i}\left( a_{t}^{i} | s_{t}\right) \Bigg ]\). Note that the weights are scalars. Thus, with reasonable communication, the agents on connected graphs can perform an exact consensus algorithm to evaluate the average of the local log importance sampling weights in finite time. We also refer to [37,38,39] for detailed descriptions of these algorithms. After achieving consensus, we have \(p_{t}^{i}=p_{t}^{j} = p_{t}:=\frac{1}{N} \sum _{i=1}^{N} \log \rho _{t}^{i} \) for all \(i, j \in {\mathcal {N}}\). It implies that \(\exp \left( N p_{t}^{i}\right) =\exp \left( \sum _{i=1}^{N} \log \rho _{t}^{i}\right) =\prod _{i=1}^{N} \rho _{t}^{i}=\rho _{t}=\pi _{\theta }\left( a_{t} | s_{t}\right) / \mu \left( a_{t} | s_{t}\right) \). Then, each agent conducts the local critic updates. For an agent i, it first computes the eligibility trace vector at time step t and the importance sampling ratio:

$$\begin{aligned} {\textbf{e}}_{t} = \rho _t \left( \phi (s_t)+\gamma _t \lambda {\textbf{e}}_{t-1} \right) , \quad \rho _{t}=\exp \left( N p_{t}^{i}\right) . \end{aligned}$$
(24)

Then, by averaging the local estimates over its neighbors denoted by \({\mathcal {N}}_i\), agent i updates its local critic parameters \({\textbf{v}}^{i}\) and \({\textbf{w}}^{i}\) via

$$\begin{aligned} {\textbf{v}}_{t+1}^{i}&= \hspace{-0.4em}\sum \limits _{j\in {\mathcal {N}}_i}({C_t})_{ij} \big ( {\textbf{v}}_{t}^{j} + \alpha _{{\textbf{v}}, t}\nonumber \\&\quad \Bigg [\delta _t^{j} {\textbf{e}}_{t} -\gamma _{t+1}(1-\lambda )\big (({\textbf{w}}_{t}^{j} )^{\top } {\textbf{e}}_{t} \big ) \phi _{t+1}\Bigg ] \big ), \end{aligned}$$
(25)
$$\begin{aligned} {\textbf{w}}_{t+1}^{i}&= \hspace{-0.4em}\sum \limits _{j\in {\mathcal {N}}_i}({C_t})_{ij} \big ({\textbf{w}}_{t}^{j} +\alpha _{{\textbf{w}}, t} \Bigg [\delta _t^{j} {\textbf{e}}_{t} -\big (({\textbf{w}}_{t}^{j} )^{\top } \phi _t\big ) \phi _t\Bigg ]\big ), \end{aligned}$$
(26)

where \(\delta _{t}^{i}:=\) \(r_{t+1}^{i}+\gamma ({\textbf{v}}^{i}_t)^{\top } \phi _{t+1}-({\textbf{v}}^{i}_t)^{\top } \phi _{t}\) is the local conventional TD error corresponding to agent i. The connection matrix \(C_t\) is row and column stochastic and is generated from the connected graph such that \((i, j) \in {\mathcal {E}}_{t}\) iff \((C_{t})_{ij} > 0\), and \((C_{t})_{ij}\) denotes the communication weight from agent j to i at time t. The parameters \(\alpha _{{\textbf{v}}, t}\) and \(\alpha _{{\textbf{w}}, t} \) denote the stepsizes of the \({\textbf{v}}\) and \({\textbf{w}}\) updates at time step t. The complete MGOPAC algorithm is described in Algorithm 1.

5 Analysis

It can be observed that our algorithm has a recursive stochastic form similar to off-policy value function algorithms:

$$\begin{aligned} \theta _{t+1}=\theta _{t}+\alpha _{\theta , t}\left( h\left( \theta _{t}, {\textbf{v}}_{t}\right) +N_{t+1}\right) , \end{aligned}$$
(27)

where \(h(\cdot , \cdot )\) is a differentiable function and \(\left\{ N_{t}\right\} _{t \ge 0}\) denotes a noise sequence. Thus, we consider the behavior of the ordinary differential equation (ODE) as the following form:

$$\begin{aligned} {\dot{\theta }}(t)=\theta (h(\theta (t), {\textbf{v}})). \end{aligned}$$
(28)

The two updates (for the actor and for the critic) are not independent at each time step. Therefore, we analyze two separate ODEs using a two time-scale analysis in the framework of [40]. In this scheme, the convergence of the actor updates are analyzed given fixed critic parameters; then the critic updates are studied while viewing the value of the faster time-scale as equilibrated at each iteration. We make the following assumptions.

Assumption (A1)

The target policy, defined as a function of \(\theta \), \(\pi ^i_{(\cdot )}(a^{i} | s)\): \({\mathbb {R}}^{N_{\theta }} \rightarrow [0,1]\), is continuously differentiable, \(\forall s \in \) \({\mathcal {S}}, a^{i} \in {\mathcal {A}}^{i}\), and \(i \in \{1, \ldots , N\}\).

Assumption (A2)

The iterates \(\{{\textbf{v}}_t, {\textbf{w}}_t, \theta _t\}\) lie in some compact region \(B_{{\textbf{v}}} \times B_{{\textbf{w}}} \times B_{\theta }\) almost surely; and \(\{{\textbf{e}}_t, {\textbf{e}}_{\theta , t} \}\) are bounded with probability one.

Remark

Assumption (A1) is satisfied by most practical choices of target policy. Assumption (A2) is not demanding for the iterates in practice when the constraint sets are set large enough. Moreover, it is shown in [41] that under proper conditions on the parameter \(\lambda \), the boundedness of the eligibility traces is guaranteed.

Assumption (A3)

The feature matrix \(\Phi \) has linearly independent columns \(\{\phi (s)\}\), and here the value function can be approximated by \(V_{{\textbf{v}}}(s)=\phi (s)^{T} {\textbf{v}}\) that is linear in \({\textbf{v}}\).

Assumption (A4)

It holds that if \(\pi _{\theta ^{i}}^{i}\left( a^{i} | s\right) >0\), then \(\mu ^{i}\left( a^{i} | s\right) >0\), for each \(i \in {\mathcal {N}}\), \(\left( a^{i}, s\right) \in {\mathcal {A}}^{i} \times S\), and \(\theta ^{i} \in \Theta ^{i}\).

These ensure that the importance sampling weights are always finite. Finally, we have the following standard assumptions on features, stepsizes, and connection matrices, which are also made in previous work [5, 10, 24, 41].

Assumption (A5)

\(\left\| \phi _{t}\right\| _{\infty }<\infty , \forall t\), where \(\phi _{t} \in {\mathbb {R}}^{N_{\textrm{v}}}\).

Assumption (A6)

\(\alpha _{{\textbf{v}}, t}\), \(\alpha _{{\textbf{w}}, t}\), \(\alpha _{\theta , t}>0\), \(\forall t\) are stepsizes such that \(\sum _{t} \alpha _{{\textbf{v}}, t}=\sum _{t} \alpha _{{\textbf{w}}, t}=\sum _{t} \alpha _{\theta , t}=\infty \) and \(\sum _{t} \alpha _{{\textbf{v}}, t}^{2}<\) \(\infty \), \(\sum _{t} \alpha _{{\textbf{w}}, t}^{2}<\infty \) and \(\sum _{t} \alpha _{\theta , t}^{2}<\infty \) with \(\frac{\alpha _{\theta , t}}{\alpha _{{\textbf{v}}, t}} \rightarrow 0\) and \(\frac{\alpha _{{\textbf{v}}, t}}{\alpha _{{\textbf{w}}, t}} \rightarrow 0\), as \(t\rightarrow +\infty \). Moreover, \(\lim _{t \rightarrow \infty } \frac{\alpha _{{\textbf{w}}, t+1}}{\alpha _{{\textbf{w}}, t}}=1\).

Assumption (A7)

For each element \(C_{t} \in \left\{ C_{t}\right\} _{t \in {\mathbb {N}}}\), it holds that

  1. 1.

    The connection matrix \(C_{t}\) is row stochastic, \({\mathbb {E}}\left[ C_{t}\right] \) is column stochastic, and there exists \(\alpha \in (0,1)\) such that, for any \(C_{t}(i, j)>0\), we have \(C_{t}(i, j) \ge \alpha \).

  2. 2.

    Set the spectral norm \(\rho := \rho \left( {\mathbb {E}}[C_{t}^{T}\left( I-{\textbf{1}}{\textbf{1}}^{T} / N\right) C_{t}]\right) \). Then, it holds that \(\rho <1\).

  3. 3.

    Given the \(\sigma \)-algebra \(\sigma \left( C_{\tau },\left\{ r_{\tau }^{i}\right\} _{i \in {\mathcal {N}}}; \tau \le t\right) \), matrix \(C_{t}\) is conditionally independent of \(r_{t+1}^{i}\) for each \(i \in {\mathcal {N}}\).

Remark

Here, Assumption (A5) is necessary to show the stability of the critic updates (see the supplementary for details). Assumption (A6) is standard and similar to the conditions in the literature for single-agent actor-critic algorithms employing two-time-scale analysis. Moreover, the first condition of (A7) is standard for guaranteeing convergence of the update for each agent to a consensus vector [42]. The second condition of (A7) has connection to the connectivity of the communication graph \({\mathcal {G}}_t\). In fact, Refs. [43] and [44] showed that for the gossip communication schemes, the condition holds true if and only if the corresponding communication network is connected. The third condition requires that \(C_t\) and \(\left\{ r_{t+1}^{i}\right\} _{i \in {\mathcal {N}}}\) are independent conditioned on the past history. This is reasonable as the random communication link failures and the mixing schemes usually depend on the active connections instead of the rewards observed by the agents.

With these assumptions in place, we are able to provide the following convergence result for the proposed multi-agent off-policy actor-critic algorithm.

Theorem 1

(Convergence of MGOPAC) Suppose that Assumptions (A1)(A7) are satisfied. Then, the policy parameters, \(\theta _{t}^{i}\), converge to \({\mathcal {O}}^i:=\{\theta ^{i} \in \Theta | {\textbf{g}}^i(\theta ^{i})=0\}\), for \(i \in [N]\); moreover, the value function weights, \({\textbf{v}}_{t}^{i}\), converge to a consensus vector which is the corresponding solution of (10), with probability one.

We remark that the conclusion regarding the convergence of the policy parameters is, in general, the best result one can prove for a stochastic gradient algorithm with a possibly nonconvex objective function [14, 16, 19, 24]. The detailed proof of Theorem 1 is included in the supplementary. In the following, we provide a sketch of the proof.

Proof Sketch We adopt the scheme of two time-scale analysis [40]. This has also been used to analyze single-agent policy gradient actor-critic algorithms [17, 19, 45]. We analyze the dynamics of the weights of the agents, which include the actor weights \(\{\theta ^{i}_{t}\}_{i\in [N]}\) and the critic weights \(\big \{\big ({{\textbf{w}}_{t}^{i}}^{\top } {{\textbf{v}}_{t}^{i}}^{\top }\big )\big \}_{i\in [N]}\). Under the above assumptions, the proof aims to verify that the requirements in Refs. [46] and [47] are satisfied based on the update rules. If the requirements are satisfied, then we can apply their results to establish the convergence of \(\{\theta ^{i}_{t}\}_{i\in [N]}\) and \(\big \{ {{\textbf{v}}_{t}^{i}}\big \}_{i\in [N]}\) to an asymptotically stable equilibrium and the corresponding TD solution, respectively.

Fig. 1
figure 1

5-State Boyan Chain MDP with N agents. Every transition of the agents is specified by a probability (left number) and a reward (right number). Here, each feature can be set as a vector of length 2N, denoted by y, whose \([y_{2i-1}, y_{2i}]\) is the feature corresponding to the state of Agent i in the single-agent case, \(i\in [1, 2, \ldots , N]\). At the bottom are the reward multipliers, the reward for agent i is the single-agent reward multiplied by the multiplier associated with the state of the most advanced agent, excluding i

6 Experiments

In this section, we explore the performance of the proposed algorithm for the multi-agent generalizations of the single-agent Boyan’s chain task [48]. We compare it with the state-of-the-art baseline algorithm proposed in Ref. [24], which we denote here as EMOPAC (Emphatic Multi-agent Off-Policy Actor-Critic). This algorithm is currently the only multi-agent generalization of advanced temporal difference learning algorithms (Emphatic TD) with critic consensus as in the proposed method. All experiments were performed with Python 3.7 on an Intel i7 CPU at 3.4 GHz (32 GB RAM) desktop.

Multi-agent Boyan’s Chain Problem Assume we have N agents. We consider a non-trivial multi-agent generalization of the single-agent Boyan’s chain problem, where communication is needed as an agent’s local rewards depend not only on its own states, but on the states of other agents. Specifically, when there is another agent ahead of it, its transition is rewarded less; on the other hand, if the agent is the most advanced in the group, its transition is rewarded more. Each agent’s reward for a transition is equal to the single-agent reward multiplied by the position in the chain of the most advanced agent in the group, excluding itself. Since each local reward is negative, multiplication by a larger multiplier (because another agent has advanced further) has the effect of reducing the reward.

Fig. 2
figure 2

Convergence on 9-agent Boyan’s Chain of length 5; the right subplot shows the consensus process when \(\lambda = 0\)

For example, suppose we have a 5-state Boyan’s chain with N agents as in Fig. 1. Assume that Agent 1 is in state S3 and Agents \(2, \ldots , N\) are in state S2, then the reward for Agent 1 is the corresponding single-agent reward for state S3 multiplied by 2 (since the most advanced position attained by the agents other than Agent 1 is state S2); similarly, the reward for Agent N equals to the corresponding single-agent reward for state S2 multiplied by 3 (since the most advanced position of any other agent is state S3). The total reward is the summation of all agents’ local rewards. For the multi-agent Boyan’s chain, there exists a feature map that can represent the state value function linearly, as in the single-agent case. The size of the state space \(|{\mathcal {S}}|\) grows rapidly; for example, when \(N=9\), \(|{\mathcal {S}}|\) is \({\mathcal {O}}(5^9) \approx 1,953,125\) and the size of the joint action space \(|{\mathcal {A}}|\) is \({\mathcal {O}}(2^9)\).

In the experiments, the connection matrix \(C_t\) is generated from a ring graph that links together the agents successively. The stepsizes \(\alpha _{{\textbf{v}}, t}\), \(\alpha _{{\textbf{w}}, t}\), \(\alpha _{\theta , t}\) are set as \(\frac{\alpha }{(t+1)^\frac{5}{8}}\), \(\frac{\alpha }{(t+1)^\frac{9}{16}}\), and \(\frac{\beta }{t+1}\) respectively, where \(\alpha \) and \(\beta \) are positive constants. These constants are scanned and selected so that the algorithms achieve the best performance. Moreover, the behavior policy is set to be uniform, which means that each feasible action is taken with equal probability at all states. Since the multi-agent experiments are simulated on a single computer, the consensus of \(\{p_t^{k}\}_{k=1}^N\) and \(\{\delta _t^{k}\}_{k=1}^N\) can be computed directly. We assume that the local target policies are parameterized by a form of softmax function, which are called the Boltzmann policies, and are defined as: \(\pi _{\theta ^i}^i(s, a^i) = \frac{\exp \big (q_{s,a^i}^T \theta ^i \big )}{\sum _{b^i\in {\mathcal {A}}^i}\exp \big (q_{s,b^i}^T \theta ^i\big )}\), where \(q_{s,b^i}\in {\mathbb {R}}^{N_{\theta }}\) denotes the basis vector for any \(s\in {\mathcal {S}}\) and \(i\in [N]\). Here, \(q_{s,b^i}\) is of the same dimension as \(\theta ^i\), which is set as 2N. The elements of \(q_{s,b^i}\) are generated from a uniform distribution on [0, 1]. The results after applying the proposed method and the baseline algorithm to the multi-agent Boyan’s chain problems in different settings of the bootstrapping parameter \(\lambda \) are shown in Fig. 2. Here, we show the results when \(\lambda = 0\), 0.2, 0.5 since when it is closer to 1, more vibration appears in the plots of both candidate algorithms. In the experiments, both algorithms are run for 10 trials. First, the right panel of Fig. 2 shows that the consensus of the agents is reached quickly (within the first 50 episodes). Figure 2 also shows that our algorithm (MGOPAC) attains a stationary value around 15,000 episodes, while the baseline algorithm (EMOPAC) reaches the value slower (around 23,000 episodes) when \(\lambda = 0\), 0.2. In addition, MGOPAC demonstrates better stability as compared with EMOPAC when \(\lambda = 0\), 0.2, 0.5. Similar merits of the proposed algorithm can be observed in other settings of connection graphs and different numbers of states and agents. The results corroborate the favorable properties of the proposed gradient-based method.

7 Conclusions and Remarks

This paper proposes a gradient-based multi-agent off-policy actor-critic framework for reinforcement learning problems. Theoretically, we prove that the critic updates converge to the consensus limiting point, which is the TD solution of the projected Bellman equation, and the actor updates converge to the set of asymptotically stable fixed points of problem (16). Moreover, unlike algorithms in previous work, both the critic step and the actor step of the proposed method are based on full gradient descent (ascent) of the corresponding objective functions. In the experiments, we compare our approach with a state-of-the-art baseline algorithm, and the results show the superior performance of our algorithm in terms of stability and convergence rate. Note that the analysis of the proposed algorithm is based on the assumption that the state value function is linearly representable by the feature map and critic weights. Thus, one interesting direction for future research would be to extend the proposed gradient-based framework to scenarios with nonlinear approximators like deep neural networks and investigate its performance in solving more complex real-world problems. Moreover, regarding that the proposed method is currently developed and analyzed based on the Markovian setting, it is also worth further studying the extension of its analysis and application to non-Markovian scenarios.