Distributed dynamic stochastic approximation algorithm over time-varying networks

In this paper, a distributed stochastic approximation algorithm is proposed to track the dynamic root of a sum of time-varying regression functions over a network. Each agent updates its estimate by using the local observation, the dynamic information of the global root, and information received from its neighbors. Compared with similar works in optimization area, we allow the observation to be noise-corrupted, and the noise condition is much weaker. Furthermore, instead of the upper bound of the estimate error, we present the asymptotic convergence result of the algorithm. The consensus and convergence of the estimates are established. Finally, the algorithm is applied to a distributed target tracking problem and the numerical example is presented to demonstrate the performance of the algorithm.


Introduction
In recent years, there is a huge increase in the scale of data in many real-life application problems. The traditional centralized computing method faces many challenges and sometimes is entirely infeasible for largescale problems. As a result, distributed algorithms over multi-agent systems have received much attention from researchers of diverse areas, including consensus problem [1][2][3][4], resource allocation (RA) [5,6], multi-unmanned aerial vehicle (MUAV) control [7], and distributed target tracking [8] etc. Distributed algorithms are usually associated with a network of agents where each agent has limited computation and communication ability. The agents are required to cooperatively achieve a global objective by using their local observations and information transmitted from their neighbors. Compared with centralized approaches, distributed algorithms have the advantage of robustness on network link failure, privacy protection, and reduction on communication and computation cost.
One of the important branches of distributed algorithms is the distributed optimization problem. It seeks the minimizer of the global function which is written as a sum of the local functions of the agents. In particular, the distributed optimization for time-invariant cost function has become a mature discipline with many results, see [9][10][11] and references within. On the other hand, optimization problems with time-varying cost function have attracted much attention due to its appearance in various applications, for example, signal processing [12] and online optimization [13,14]. The main challenge of time-varying optimization lies in the fact that the minimizer of the time-varying cost function is changing with time. Since the traditional optimization algorithms can only move the estimates towards the minimizer of the cost function of the current time, they cannot track the movement of the minimizer in the dynamic environment. To cope with this issue, two different strategies have been developed. The first one is running method [15,16], where the algorithms sample the time-varying cost function at a fixed frequency, and perform the traditional optimization algorithms on the sampled function in between the sample time. The second one is prediction-correction method [17,18], where the algorithms only optimize the cost function of the current time at each steps. But an additional information of the dynamic of the moving minimizer is required, such as the second derivative of the cost function [18], or an additional constrain on the dynamic of the minimizer [14]. While we consider a different problem in this paper, the method we utilize is much similar to the second method. It can be noticed that the optimization problem can often be transformed into a root-seeking problem, since the optimization of a differentiable convex function is equivalent to seeking the root of its gradient, and for the case where the gradients are unavailable we can use the finite time difference to estimate the gradient [19]. So it's natural to consider the distributed roottracking problem. Distributed stochastic approximation for time-invariant regression functions has been studied by many researchers as a solution for distributed rootseeking problem [20][21][22]. Inspired by the work of dynamic stochastic approximation [23,24], in this paper, we propose a distributed stochastic approximation algorithm for tracking the changing root of a sum of time-varying regression functions over a network. Each agent is aimed at tracking the changing root of the global function, but it can only access a noise-corrupted local observation and the information transmitted from its neighbor. In addition, the noise-corrupted dynamics of the roots of the global regression function is assumed to be known to all agents.
In this paper, the distributed root-tracking problem for time-varying regression function is considered. First, motivated by the truncation technique given in [22], a distributed stochastic approximation algorithm with expanding truncations is introduced. The key difference is that the observation of the local function in this algorithm is a noise-corrupted one, while an exact gradient information is often required in the optimization algorithms mentioned above. Second, under the assumption that the noise-corrupted dynamics of the global roots is known to all agents, the convergence conditions of the algorithm are introduced. Third, it is proved that the estimates generated by the distributed algorithm are of both consensus and convergence with probability one. Finally, we apply this algorithm to a distributed target tracking problem. The numerical example is given demonstrating the performance of the algorithm.
The rest of the paper is organized as follows. The problem formulation and the distributed stochastic approximation algorithm are given in Section 2. The convergence conditions and results are presented in Section 3. To help the proof of the convergence result, two auxiliary sequences are defined and analysed in Section 4. The proof of the main result is given in Section 5. In Section 6, a distributed target tracking problem is solved by the algorithm and the numerical example is demonstrated. Some concluding remarks are addressed in Section 7.

Problem formulation
Consider a network system consisting of N agents. The interaction relationship among agents is described by a time-varying digraph The adjacency matrix associated with the graph is denoted by The time-varying global regression function is given by where f i,k (·) : R l → R l is the local function associated with agent i. Denote by θ k the root of the sum function f k (·) at time k, i.e., f k (θ k ) = 0, k = 1, 2, . . . Further, assume that the dynamics of the root θ k is governed by where the function g k (·) : R l → R l is known for all agents, and {ξ k } is the sequence of dynamic noises. As we can see in Section 6, this assumption is reasonable in some reallife application problems and have been studied before in [14,25,26]. For each agent i, the distributed root-tracking problem is to track the dynamic root of the time-varying global function by using its noise-corrupted observation of local function f i,k (·), the dynamic information of the root g k (·), and the information obtained from its adjacent neighbors.

Algorithm
We now introduce the distributed root-tracking algorithm as follows: where 1) x i,k ∈ R l is the estimate of θ k given by the agent i at time k, 2) O i,k+1 defined by (7) is the local observation of agent i, 3) {a k } k≥0 is the sequence of the step-sizes used by all agents, 4) x * is a fixed vector in R l known to all agents, 5) {M k } k≥0 is a sequence of positive numbers increasingly diverging to infinity with M 0 ≥ ||x * ||, 6) σ i,k is the truncation number of agent i up-to-time k, and 7) h k (·) is a function defined as below Let us explain the algorithm. 1) For agent i, the estimate x i,k is the estimate of θ k . Since the dynamics of {θ k } is governed by (2), in order to make sure the estimate track the dynamic root, the update at time k + 1 utilize g k (x i,k ) instead of x i,k as it was shown in (3). 2) For agent i, the truncation happens when one of the following cases hold true: a) σ i,k <σ i,k , which means that there is at least one neighbor whose truncation number is larger than that of agent i. b) ||x i,k+1 − h k (x * )|| > Mσ i,k , which means that the distance between the intermediate value x i,k+1 and h k (x * ) is larger than the truncation bound. When truncation happens, the estimate x i,k is pulled to h k−1 (x * ). 3) It can be seen that the truncation may not happen at the same time for different agents in the network. So for agent i, the update (5) makes sure that the truncation number of i is not smaller than the largest truncation number of its neighbors, i.e.σ i,k . As to be shown in Lemma 4, this technique guarantees that the difference between truncation numbers of different agents is bounded, which helps the algorithm converge. 4) The truncation mechanism makes sure that the estimates x i,k won't be too far away from h k−1 (x * ). As to be shown in Lemma 1, we can prove that the distance between h k−1 (x * ) and the dynamic root {θ k } is bounded. So it is reasonable to choose this truncation condition. [27], we proposed the distributed root-tracking algorithm without the expanding truncation. To make sure the algorithm converge, we assumed the dynamic root {θ k } and the estimate of all agents {x i,k } are bounded sequences in [27]. With the introduction of the expanding truncation mechanism, this assumption is removed in this paper.

Assumptions
Let us list the assumptions to be used in the paper.
where a is a positive constant possibly depending on where η is an unknown constant specified later in Lemma 1. A3 The class of functions f i,k (·) k≥0 is equi-continuous for i = 1, . . . , N, i.e., for any fixed i and any > 0, there exists δ > 0 such that where δ only depends on . Furthermore, for ∀c > 0, there exists a constant α(c) such that ||f i,k (θ k + ν)|| < α(c) for ∀ν with ||ν|| ≤ c, ∀i ∈ V, k = 1, 2, . . . A4 a) The adjacent matrices W (k) ∀k ≥ 0 are doubly stochastic; b) There exists a constant 0 < κ < 1 such that for infinitely many indices k}, d) There exists a positive integer B such that for all (j, i) ∈ E ∞ and any k ≥ 0. A5 For any i ∈ V, the noise sequence i,k+1 k≥0 is such that where m(k, T) max m : m i=k a i ≤ T and {n k } denotes the indices of any convergent subsequence x i,n k − θ n k . A6 g k (·) : R l → R l is equi-continuous with respect to k and is such that A1 and A2 are the standard assumptions for stochastic approximation. A3 implies the local boundedness of the functions f i,k (·). Notice that the upper bound α(c) in A3 should be uniform with respect to k.
A4 describes the information exchanging among agents. We refer to [9] for the detailed explanation. Set (k, k + 1) I N and By Proposition 1 in [9] it follows that there exist constants c > 0 and 0 < ρ < 1 such that Notice that in A5 b), the noise condition is required only along the indices of any convergent subsequence x i,n k − θ n k . As to be seen in the next section, this makes the convergence analysis much easier compared with requiring the noise condition to hold along the whole sequence.
In A6, d k (x) measures the difference between the estimation error x i,k − θ k and the prediction error g k (x i,k ) − g k (θ k ). This assumption implies that the dynamic of the root , i.e. g k (·), will tend to be a linear function as time k goes to infinity. For example, if the dynamics of the changing roots is g k (x) = x + c, then A6 holds with γ k = 0.

Main result
Further, we denote the disagreement vector of X k by

Theorem 1 Let {x i,k } be the estimates produced by (3)-(7)
with an arbitrary initial value x i,0 . Assume A1-A4 and A6 hold. If for a fixed sample ω, A5 holds for all agents, and A7 holds, then for this ω, the following assertion takes place: i) There exists a positive integer k 0 depending on ω such that or in the compact form Theorem 1 i) shows that the truncation ceases after a finite number of steps. This implies that the difference between the estimate x i,k and h k−1 (x * ) is bounded, which is desirable as to be shown in Lemma 1. Before we move on to the proof of Theorem 1, we first show that the truncation mechanism is reasonable.

Lemma 1 If A6 and A7 hold, the sequence
Proof By A6 and A7 from (2) it follows that As we mentioned in Section 2, Lemma 1 shows that the distance between h k−1 (x * ) and {θ k } is bounded. Since we hope the estimate x i,k generated by the algorithm (3)-(7) track the root {θ k }, the truncation mechanism is intuitively reasonable.

Auxiliary sequences
The next two sections of this paper focus on the proof of Theorem 1. But prior to analyzing x i,k , we need to introduce two auxiliary sequences x i,k and ˜ i,k for each agent i ∈ V. The motivation of constructing these two sequences comes from the character of distributed algorithm with expanding truncation. Recall the convergence analysis of stochastic approximation algorithm with expanding truncation (SAAWET) [28]. The key step is to show that the truncations cease after a finite number of steps, therefore, the boundedness of the estimates is established. If the number of truncations increases unboundedly, then the estimate is pulled back to x * infinitely many times. This produces a convergent subsequence from the estimate sequence. Then a contradiction can be shown analysing the property along this subsequence, which proves the boundedness of the estimates.
Although the problem in this paper is different from the one in [28] since the regression function is time-varying in this paper, we use the same approach to prove the boundedness of the estimates. Notice the distributed algorithm with expanding truncation (3) However, { k } may still not contain any convergent subsequences. This is because truncation may occur at different times for different i ∈ V. Therefore, the analysis approach used for SAAWET cannot directly be applied to the algorithm (3)- (7).
To overcome this difficulty, we introduce the auxiliary sequences x i,k and ˜ i,k . As to be shown, the auxiliary sequences x i,k satisfies the recursions (19)- (21), for which the number of truncation at time k for all agents is the same and the estimatesx i,k for all the agents are pulled back to h k−1 (x * ) when σ k > σ k−1 . The auxiliary noise ˜ i,k satisfies a condition similar to A5 b). These make the analysis for (19)-(21) feasible.
It is shown below that the important feature of the auxiliary sequences consists in that x i,k and x i,k coincide in a finite number of steps, which means that the convergence of these two sequences is equivalent.
Denote by τ i,m inf k : σ i,k = m the smallest time when the truncation number of agent i has reached m, by τ m min i∈V τ i,m the smallest time when at least one of agents has its truncation number reached m, and by the largest truncation number among all agents at time k.
For any i ∈ V, define the auxiliary sequences x i,k k≥0 and ˜ i,k k≥0 as follows: where m is an integer. Note that for the considered sample ω there exists a unique integer m ≥ 0 corresponding to an integer So, x i,k k≥0 and ˜ i,k k≥0 are uniquely determined by the sequences x i,k k≥0 and i,k k≥0 . (14) we conclude (16).

Lemma 4 Assume A4 holds. Then i)
where d i,j is the length of the shortest directed path from i to j in G ∞ , and B is the positive integer given in A4 d).
where D max i,j∈V d i,j .
Proof i) Since G ∞ is strongly connected by A4 c), for any j ∈ V there exists a sequence of nodes i 1 , and hence by (6) and (5) we have Repeat this procedure, we can obtain σ i 2 ,k+2B ≥ σ i 1 ,k+B ≥ σ i,k . Finally we can reach (34). ii) For some m ≥ 1, let τ m = k 1 . Then there exists an i such that τ i,m = k 1 . By (34) we have σ j,k 1 For the case where σ j,k 1 +Bd i,j = m ∀j ∈ V, we have τ j,m ≤ k 1 + Bd i,j ∀j ∈ V. By noticing τ m = k 1 , by the definition of τ j,m we have (35): For the case where σ j,k 1 +Bd i,j > m for some j ∈ V, we have τ m+1 ≤ k 1 + Bd i,j for some j ∈ V, and hence τ m+1 ≤ τ m + BD. Again, we obtain (35): This corollary can be easily obtained from (34).
We show that {h k (x * ) − θ k+1 } is a convergent sequence by proving that the sequence is a Cauchy sequence. For two different integer j > i > 0, we see where the last inequality comes from A6 and Lemma 1. By A6 and A7 we know that ∞ k=1 γ k < ∞ and ∞ k=1 ξ k < Furthermore, we can prove that (37) holds for case iii). Since one of case i), ii), iii) must take place for the case lim k→∞ σ k = ∞, we can conclude that (37) holds in Case 2.

Corollary 2
In (38) we show that τ σ +1 = ∞ when (20) we know that {x i,k } and {x i,k }, {˜ i,k } and { i,k } coincide in a finite number of steps.

Proof of the main result
Define: Since W (k) are doubly stochastic, by the property of The following lemma characterizes the closeness of the auxiliary sequence { k } k≥1 along its convergent subsequence { n k }.
for any m = n k , . . . , m(n k , T k ), and Proof Consider a fixed sample path ω where A5 and A7 hold.
Let C > || ||. There exists an integer k C > 0 such that From Lemma 5 we know that there exist constants T 1 > 0 and k 0 > k C such that Define where c and ρ are given by (9). Select T such that For any k ≥ k 0 and any T k ∈[ 0, T] define So from (51) and (56) it follows that We intend to prove s k > m(n k , T k ). Assume the converse that for sufficiently large k ≥ k 0 and any T k ∈ [ 0, T] s k ≤ m(n k , T k ). (59) We first show that there exists a positive integer k 1 > k 0 such that for any k ≥ k 1 We prove (60) for two cases: lim k→∞ σ k = ∞ and lim k→∞ σ k = σ < ∞. i) lim k→∞ σ k = ∞: From (58) we know that ||x i,n k − θ n k || ≤ M 0 + C + 1 ∀i ∈ V. First, we prove that for sufficiently large k, truncation does not happen at time n k + 1. For any i ∈ V, we consider the following two cases: a)x i,n k and˜ i,n k +1 take value as (13): From (19) we havê w ij (n k ) g n k (x j,n k )−g n k (θ n k )−(x j,n k −θ n k ) w ij (n k ) θ n k +1 − ξ n k +1 + (x j,n k − θ n k ) . Since A4 indicates that W (n k ) is doubly stochastic, by A6, (51) and direct calculation we have the following inequalities and hence by Lemma 1 we know ||x i,n k +1 − h n k (x * )|| ≤ η + 2M 0 + 2C + 3. b)x i,n k and˜ i,n k +1 take value as (14): From (19) we havê + a n k i,n k +1 .
By A5 a) we know that a n k i,n k +1 < 1 for sufficiently large k. Then, by A3, A4, A6 and (51), we have the following inequalities and hence by Lemma 1 we know ||x i,n k +1 − h n k (x * )|| ≤ η + M 1 .
So we show that when || n k || ≤ M 0 + C + 1, we have ||x i,n k +1 − h n k (x * )|| ≤ η + M 1 . Since {M k } is a sequence of positive number increasingly diverging to infinity, there exits a positive integer k 1 > k 0 such that M σ n k > η + M 1 for all k ≥ k 1 . Thus, we prove that truncation does not happen at time n k + 1.
Notice (58) holds for j : n k ≤ j ≤ s k . So, similar to the proof above, we can prove that truncation does not happen for time n k + 1, . . . , s k + 1. Then we conclude s k < τ σ n k +1 .
By ( for sufficiently large k ≥ k 1 and any T k ∈[ 0, T]. Now we consider the following recursive algorithm starting from n k : where Z k col{z i,k , . . . , z N,k }. By (60) we know that (48) holds for m = n k , . . . , and hence where the second inequality comes from A6, the third inequality comes from (58) A3 and A5, the fourth inequality comes from (51) (59), and the last inequality comes from (59). Denote by Z ⊥,s = D ⊥ Z s the disagreement vector of Z s . By multiplying both sides of (63) with D ⊥ we have By definition we know that D ⊥ G m ( m ) = D ⊥ m = 0, hence we have So inductively by (46)(47) we have From (9) Set n n m=1 a m˜ m+1 , by (62) we know that || s − n k −1 || ≤ T k ∀s : So, summing by parts with (9) we have for sufficiently large k ≥ k 1 and any T k ∈[ 0, T]. Notice that s = Z ⊥,s + (1 ⊗ I l ) s . We derive Since ||Z ⊥,n k || ≤ 2|| n k || ≤ 2C, from (65)(67) it follows that for sufficiently large k ≥ k 1 and any T k ∈[ 0, T] Therefore, from (56)(51) we know that for sufficiently large k ≥ k 1 and any T k ∈[ 0, T] Now we look back at the recursive algorithm (19). We rewrite it in the compact form as follows where X k col{x 1,k , . . . ,x N,k }. Then by (63)(64), X s k +1 = Z s k +1 . So by (71) it follows that We now show for sufficiently large k ≥ k 1 and any T k ∈[ 0, T]. We consider the following two cases: lim k→∞ σ k = ∞ and lim k→∞ σ k = σ < ∞. i) lim k→∞ σ k = ∞: Notice M σ k > η + M 0 + 1 + C when k ≥ k 1 . By (20)(21) we know thatX s k +1 = X s k +1 , σ s k +1 = σ s k . So s k + 1 < τ σ n k +1 by (60).
From (73) we know that (48) holds for m = s k for sufficiently large k ≥ k 1 and any T k ∈[ 0, T]. From X s k +1 = Z s k +1 by (73) we seeX s k +1 = Z s k +1 . It follows that for sufficiently large k ≥ k 1 and any T k ∈[ 0, T] which contradicts with the definition of s k . Thus (59) does not hold. So s k > m(n k , T k ) and hence (49) holds.
In conclusion, the proof of Lemma 6 is complete. By multiplying both sides of (48) with 1 N (1 T ⊗ I l ), we havẽ Setting We can rewrite (74) as The following lemma gives the noise property of the sequence {ζ k+1 }.

Lemma 7 Assume all the conditions in Lemma 6 hold.
{ n k } is a convergent subsequence with limit at the considered sample ω. Then for this ω Proof In the proof of Lemma 6 it has been pointed out that there exists a T ∈ (0, 1) such that m(n k , T) < τ σ n k +1 for sufficiently large k. So Now we need to show that ζ (i) k+1 also satisfies the property above for i = 1, 2. First we consider ζ (1) k+1 . We see that where the last inequality comes from A6.
Since lim k→∞ n k = , by setting where o(1) → 0 as k → ∞. Hence for n k + m < m(n k , T) and sufficiently large k And hence lim sup k→∞ So we complete the proof.
By the boundedness of { m k } we can extract a convergent subsequence still denoted by { m k } with limit lim k→∞ m k = . So, lim k→∞ m k = with So by the assumption v(x) = 0 ↔ x = 0 we know there exists a constant β such that || || > β. And hence by (50) we conclude for sufficiently small T > 0 and large k. Setting k to be a vector in-between m k and m(m k ,T) . From (50) it follows that || k || ≤ c 2 T + || || + 1 for sufficiently large k. We consider the following Taylor's expansion v( m(m k ,T) ) − v( m k ) Similar to (51), we take sufficiently large k, then by A3, A6 and Lemma 7, there exists a constant c such that a j (f j+1 g j (x j )) + ζ j+1 ≤ a j f j+1 θ j+1 + d j (x j ) − ξ j+1 + j random graph 1 with designing parameter 0 ≤ p N ≤ 1. We choose p N = 0.25. Denote by N i the neighbor set of agent i and by n i the cardinality of N i . Set W (k) = [ w ij ] N i,j=1 ∀k ≥ 1 with w ij = 1 n i if agent j is in the set N i . All agents aim to track the target state θ k cooperatively. We assume for each agent, only one component of the target state can be observed with noise. To explain it in a mathematical model, the local function for agent i is defined as: where e k is a 4-dimensional square diagonal matrix with only the kth diagonal element being 1 and every other elements being 0, i.e., The selection of k i will be explained later. Since the state θ k is unknown to the agents, each agent can only get a noisecorrupted observation of this local function instead of the exact value. The global function can be written as: It can be seen that while each agent can only estimate one component of θ k with its own local function, θ k is the unique root of the global function f k (x). For our experiment, we take ξ k 1 k 2 v k , where {v k } is a sequence of i.i.d. random variables uniformly distributed over [ −1, 1], and the step-sizes a k 20 k . We let the sampling interval be T 0.1 s, the truncation bound 1 For the details of Poisson random graph, we refer to [30].
M k k + 80, and x * [ 1, 1, 1, 1] T . The initial value x i,0 for all agents is chosen from the uniform distribution over [ −2, 2]. Let the observation noise i,k be the white Gaussian noise. As for the selection of k i , for agent i, if i mod 4 = 0, then k i i mod l, if i mod 4 = 0, then k i 4.
Denote by {x i,k } k≥1 , i ∈ V the estimates given by (3)-(7) and by x k = 1 N N i=1 x i,k the average of x i,k , i ∈ V. In Fig. 1, the dashed lines denote the state of the moving target and the solid lines the average estimates for entries {θ j k , j = 1, · · · , 4} k≥1 of {θ k } k≥1 . From the figure we can see that the estimate can track the moving target successfully.

Conclusion
The distributed root-tracking problem for a sum of timevarying regression functions over a network is considered in this paper. It is assumed that a noise-corrupted dynamic information of the roots is known to all agents in the network. Each agent updates its estimate by using the local observation, the dynamic information of the global root, and information received from its neighbors. A distributed stochastic approximation algorithm is proposed and the consensus and convergence of the estimates are established.
For future research, it is of interest to relax the conditions on the dynamic information of the global roots, and to consider the convergence results of the algorithm over an unbalanced network. Largest truncation number of all agents at time k, i.e., σ k = max i∈V σ i,k = max i∈Vσi,k