1 Introduction

We consider distributed non-linear least squares estimation in networked systems. The networked system considered consists of heterogeneous networked entities or agents where the inter-agent collaboration conforms to a pre-assigned possibly sparse communication graph. The agents acquire their local, noisy, non-linear observations about the unknown phenomenon (unknown static vector parameter θ) in a streaming fashion over discrete time instances t. The goal for each agent is to continuously generate an estimate of θ over time instances t in a recursive fashion, where the estimate update of an agent involves simultaneous assimilation of the newly acquired local observations, and the received information through messages with agents in its immediate neighborhood. The assumed setup is highly relevant in several emerging applications in the context of cyber-physical systems (CPS) and the Internet of things (IoT), like state estimation in smart grid, predictive maintenance, and production monitoring in industrial manufacturing systems. For example, with continuous state estimation of a smart grid, the acquired measurements (voltages, angles) are in general non-linear functions of the unknown state; further, the measurements are inherently distributed across different physical locations (elements of the system), and they arrive continuously over time with a prescribed sampling rate. Furthermore, the scale (network size) of the distributed system (e.g., a large scale micro-grid) and near real-time requirements on the estimation results make distributed, fusion center-free processing a desirable choice.

An important aspect of distributed estimation algorithms in the context of the applications described above is communication efficiency, i.e., achieving good estimation performance with minimal communication cost. Real-world applications such as large-scale deployment of CPS or IoT typically involve entities or agents with limited on board energy resources. In addition to the limited on board power, the energy requirement per unit communication is usually significantly higher than the energy requirement per unit computation [48]. Hence, communication efficiency is a highly desirable trait in such systems. Moreover, for large-scale systems which require continuous system monitoring, it is crucial to reduce the communication cost as much as possible without compromising on the performance of the inference task at hand, which then ensure longer lifetime of such systems.

In this paper, we propose and analyze a communication efficient distributed estimator for non-linear observation models that we refer to as \(\mathcal {CREDO-NL}\). The estimator \(\mathcal {CREDO-NL}\) generalizes the recently proposed linear distributed estimator \(\mathcal {CREDO}\), see [37, 38], that is designed and works for linear measurement (observation) models only. Specific contributions of the paper are as follows.

We propose the non-linear distributed estimator \(\mathcal {CREDO-NL}\) that works for a broad class of non-linear observation models and where the model information in terms of the node i’s sensing function and noise statistic is only available at the individual agent i itself. With the proposed algorithm, each agent communicates probabilistically sparsely over time. More precisely, the probability which determines whether a node communicates at time t decays sub-linearly to zero with t, which then makes the communication cost scale sub-linear with time t.

Despite dropping communications and the presence of non-linearities in the sensing model, we show that the proposed algorithm achieves the optimal O(1/t) rate of the mean square error (MSE) decayFootnote 1. The achievability of the optimal MSE decay in terms of time t translates into significant improvements in the rate at which MSE scales with respect to the per-agent average communication cost \(\mathcal {C}_{t}\) up to time t, namely from \(O(1/\mathcal {C}_{t})\) with existing methods, e.g., [15, 16, 31, 3436, 40], to \(O\left (1/\mathcal {C}_{t}^{2-\zeta }\right)\) with the proposed method, where ζ>0 is arbitrarily small. We also establish strong consistency of the estimate sequence at each agent, showing that each agent’s local estimator converges almost surely to the true parameter θ. Simulation examples confirm significant communication savings of \(\mathcal {CREDO-NL}\) over existing alternatives, by at least an order of magnitude.

We now briefly review the literature on distributed inference and motivate our algorithm \(\mathcal {CREDO-NL}\). Distributed inference algorithms can be broadly divided into two classes based on the presence of a fusion center. The first class assumes presence of a fusion center, e.g., [11, 23, 26, 27, 47]. The fusion center assigns sub-tasks to the individual agents and subsequently fuses the information from different agents. However, when the data samples are geographically distributed across the individual agents and are streamed in time, fusion center-based solutions are impractical.

The second class of distributed inference methods is fusion center-free. These works typically assume that the agents are interconnected over a generic network, and each agent acquires its local measurements in a streaming fashion. These estimators are iterative (recursive), where at each iteration (time instance), each agent assimilates its new measurement and exchanges messages with its immediate neighbors, see, e.g., [2, 46, 14, 20, 22, 24, 25, 2831, 3436, 39, 43, 46]. Most related to our work are references that consider distributed estimation under non-linear observation models, as we do here, or distributed convex stochastic optimization, e.g., [15, 16, 31, 3436, 40]. However, among these works, the best achieved MSE communication rate is \(O(1/\mathcal {C}_{t})\). In contrast, we establish here a strictly faster MSE communication rate equal to \(O\left (1/\mathcal {C}_{t}^{2-\zeta }\right)\) (ζ>0 is arbitrarily small). Finally, it is worth noting that there exist a few distributed algorithms (without fusion node) that are also designed to achieve communication efficiency, e.g., [13, 21, 4446]. In [46], a data censoring method is employed to save in terms of computation and communication costs. However, the communication savings in [46] is a constant proportion with respect to a vanilla method which uses all allowable communications at all times. In [21], the communication savings come at a cost of extra computations. References [13, 44, 45] also consider a different setup than we do here, namely they study distributed optimization (with no fusion center) where the data is available a priori (i.e., it is not streamed). In terms of the strategy to save communications, references [13, 21, 44, 45] consider, respectively, deterministically increasingly sparse communication, adaptive communication scheme, and selective activation of agents. These strategies are different from ours that utilizes a randomized, increasing, “sparsification” of communications.

Consensus+innovations methods, see, e.g., [16, 17, 19, 20]), are a sub-class of distributed recursive algorithms (the second class of algorithms mentioned above) that process data in a streaming fashion. With consensus+innovation methods, each node updates its estimate at each iteration two-fold: by weight-averaging its solution estimate (consensus) with the neighbors’ solution estimates and by assimilating its newly acquired data sample (innovation). Therein, the consensus and innovation weights are usually time-varying and are carefully designed towards achieving optimal asymptotic performance, measured, e.g., through asymptotic covariance of the estimate sequence. Within the class of consensus+innovations distributed estimation algorithms (see, e.g., [18, 20]), the design of communication efficient methods has been addressed in [37], see also [38], for linear observation models, wherein a mixed time-scale stochastic approximation method dubbed \(\mathcal {CREDO}\) has been proposed. We extend here \(\mathcal {CREDO}\) to non-linear observation models. Technically speaking, establishing convergence and asymptotic rates of convergence for \(\mathcal {CREDO-NL}\) involves establishing guarantees for existence of stochastic Lyapunov functions for the estimate sequence. The update of the estimate sequence in \(\mathcal {CREDO-NL}\) involves a gain matrix which is in turn a function of the estimate itself. Moreover, in addition to the gain matrix being a function of the estimate, the sensing functions exhibit localized behavior in terms of smoothness and global observability in the proposed algorithm. Hence, the setup considered in this paper requires technical tools different from \(\mathcal {CREDO}\), which we develop in this paper.

The rest of the paper is organized as follows. Section 2 describes the problem that we consider and gives the needed preliminaries on conventional (centralized) and distributed recursive estimation. Section 3 presents the novel \(\mathcal {CREDO-NL}\) algorithm that we propose, while Section 4 states our main results on the algorithm’s performance. Section 5 presents the simulations experiments, and finally, we conclude in Section 7. Proofs of the main results are relegated to Appendix A.

2 Model and preliminaries

2.1 Sensing and network models

Let θΘ, where \(\Theta \subset \mathbb {R}^{M}\) (the properties of it to be specified shortly) be an M-dimensional parameter that is to be estimated by a network of N agents. Every agent n at time index t makes a noisy observation yn(t), a noisy function of θ. Formally, the observation model for the n-th agent is given by,

$$\begin{array}{*{20}l} \mathbf{y}_{n}(t)=\mathbf{f}_{n}\left({\boldsymbol{\theta}}\right)+\mathbf{\gamma}_{n}(t), \end{array} $$
(1)

where \(\mathbf {f}_{n}:\mathbb {R}^{M}\mapsto \mathbb {R}^{M_{n}}\) is a non-linear sensing function, where MnM, \(\{\mathbf {y}_{n}(t)\} \in \mathbb {R}^{M_{n}}\) is the observation sequence for the n-th agent and {γn(t)} is a zero mean temporally independent and identically distributed (i.i.d.) noise sequence at the n-th agent with nonsingular covariance Rn, where \(\mathbf {R}_{n}\in \mathbb {R}^{M_{n}\times M_{n}}\). The noise processes are independent across different agents. We state an assumption on the noise processes before proceeding further. Throughout, we denote by ∥·∥ the \(\mathcal {L}_{2}\)-norm of its vector or matrix argument and by \(\mathbb {E} [.]\) the expectation operator.

Assumption 1

There exists ε1>0, such that, for all n, \(\mathbb {E} \left [\left \|\gamma _{n}(t)\right \|^{2+\epsilon _{1}}\right ]<\infty \).

We remark that the main results of the paper (Theorems 4.1 and 4.2) continue to hold even if ε1=0Footnote 2. The above assumption encompasses a general class of noise distributions in the setup.

The heterogeneity of the setup is exhibited in terms of the agent dependent sensing functions and the noise covariances at the agents. Each agent is interested in reconstructing the true underlying parameter θ. We assume an agent is aware only of its local observation model, i.e, the non-linear sensing function fn(·) and the associated noise covariance Rn, and hence, it has no information about the observation matrix and noise processes of other agents.

The agents are interconnected through a communication network that we shall assume throughout the paper is modeled as an undirected simple connected graph G=(V,E), with V=[1⋯N] and E denoting the set of agents (nodes) and communication links, see [3]. (With the proposed \(\mathcal {CREDO-NL}\) method, the available links in E will be activated selectively across algorithm iterations in a probabilistic fashion, as it will be detailed in Section 3). The neighborhood of node n in graph G is

$$ \Omega_{n}=\left\{l\in V\,|\,(n,l)\in E\right\}. $$
(2)

The node n has degree dn=|Ωn|. The structure of the graph is described by the N×N adjacency matrix, A=A=[Anl], Anl=1, if (n,l)∈E, Anl=0, otherwise. Let D=diag(d1dN). The graph Laplacian L=DA is positive semidefinite, with eigenvalues ordered as 0=λ1(L)≤λ2(L)≤⋯≤λN(L). The eigenvector of L corresponding to λ1(L) is \((1/\sqrt {N})\mathbf {1}_{N}\). (Here, 1N is the N-dimensional vector with all entries equal to one.) The multiplicity of its zero eigenvalue equals the number of connected components of the network; for a connected graph, λ2(L)>0. This second eigenvalue is the algebraic connectivity or the Fiedler value of the network (see [7] for instance).

Example: distributed static phase estimation in smart grids

Many applications within cyber physical systems and the Internet of things can be modeled as non-linear distributed estimation problems of type (1). Such class of models arises, e.g., with state estimation in power systems; therein, a phasorial representation of voltages and currents is usually utilized, wherein non-linearity in general emerges from power-flow equations [1, 33]. Here, we focus on the specific problem within the class, namely distributed static phase estimation in smart grids. We describe the model briefly and refer to, e.g., [12, 19] for more details. Here, graph G corresponds to a power grid network of n=1,...,N generators and loads (here, a single generator or a single load is a node in the graph), while the edge set E corresponds to the set of transmission lines or interconnections. (For simplicity, even though not necessary, we assume that the physical interconnection network matches the inter-node communication network.) Assume that G is connected. The state of a node n is described by \((\mathcal {V}_{n},{\phi _{n}})\), where \(\mathcal {V}_{n}\) is the voltage magnitude and ϕn is the phase angle. As commonly assumed, e.g., [12], we let the voltages \(\mathcal {V}_{n}\) be known constants; on the other hand, angles ϕn are unknown ant are to be estimated. Following a standard approximation path, the real power flow across the transmission line between nodes n and l can be expressed as, e.g., [12]:

$$ \mathcal{P}_{nl}(\mathbf{\phi})=\mathcal{V}_{n}\,\mathcal{V}_{l}\,b_{nl}\,\sin(\phi_{nl}), $$
(3)

where ϕ is the vector that collects the unknown phase angles ϕn across all nodes, bnl is line (n,l)’s admittance, and ϕnl=ϕnϕl. Denote by EmE the set of lines equipped with power flow measuring devices. The power flow measurement at line (n,l) is then given by:

$$\begin{array}{*{20}l} {}y_{nl}(t) =\mathcal{P}_{nl}(\mathbf{\phi})+\gamma_{nl}(t) =\mathcal{V}_{n}\,\mathcal{V}_{l}\,b_{nl}\sin(\phi_{nl})+\gamma_{nl}(t), \end{array} $$
(4)

where {γnl(t)} is the zero mean i.i.d. measurement noise with finite moment \(\mathbb {E}[\!|\gamma _{nl}(t)|^{2+\epsilon _{1}}]\), for some ε1>0. Assume that each measurement ynl(t) is assigned to one of its incident nodes n or l. Further, let \(\Omega _{n}^{\prime }\) denote the set of all indexes l such that measurements ynl(t) are available at node n. Then, it becomes clear that the angle estimation problem is a special case of model (1), with the measurement vectors \(\mathbf {y}_{n}(t)=[y_{nl}(t),~l\in \Omega _{n}^{\prime }]^{\top }, n=1,...,N\), noise vectors \(\mathbf {\gamma }_{n}(t)=[\gamma _{nl}(t),~l\in \Omega _{n}^{\prime }]^{\top }\), n=1,...,N, and sensing functions \(\mathbf {f}_{n}(\boldsymbol {\phi })=[\mathcal {V}_{n}\,\mathcal {V}_{l}\,b_{nl}\, \sin (\phi _{nl}),~l\in \Omega _{n}^{\prime }]^{\top }. n=1,...,N\). It can be shown that under reasonable assumptions on phase angle ranges (that correspond to the admissible parameter set Θ) and the smart grid network and admittances structure, the assumptions we make on the sensing model are satisfied,Footnote 3 and hence, \(\mathcal {CREDO-NL}\) can be effectively applied; we refer to [12, 19] for details.

2.2 Preliminaries: centralized batch and recursive weighted non-linear least squares estimation

In this subsection, we go over the preliminaries of centralized and distributed weighted non-linear least squares estimation.

Consider a networked setup with a hypothetical fusion center which has access to the samples collected at all nodes at all times. In such a setting, in lieu of the sensing model as described in (1), one of the classical algorithms that finds extensive use is the weighted non-linear least squares (WNLS) (see, for example, [15]). The applicability of WNLS to fairly generic setups which are characterized by the absence of noise statistics makes it particularly appealing in practice. We discuss properties of the WNLS estimator before proceeding further. Define the cost function \(\mathcal {Q}_{t}\) as follows:

$$ \mathcal{Q}_{t}\left(\mathbf{z}\right)=\sum_{s=0}^{t}\sum_{n=1}^{N} \left(\mathbf{y}_{n}(s)-\mathbf{f}_{n}(\mathbf{z})\right)^{\top}\mathbf{R}_{n}^{-1}\left(\mathbf{y}_{n}(s)-\mathbf{f}_{n}(\mathbf{z})\right). $$
(5)

The hypothetical fusion center in such a setting generates the estimate sequence \(\left \{{\widehat {\boldsymbol {\theta }}}_{t}\right \}\) in the following way:

$$ {\widehat{\boldsymbol{\theta}}}_{t}\in{\arg\!\min}_{\mathbf{z}\in\Theta}\mathcal{Q}_{t}(\mathbf{z}). $$
(6)

The consistency and the asymptotic behavior of the estimate sequence \(\{{\widehat {\boldsymbol {\theta }}}_{t}\}\) have been analyzed in the literature under the following weak assumptions:

Assumption 2

The set Θ is compact convex subset of \(\mathbb {R}^{M}\) with non-empty interior int(Θ) and the true (but unknown) parameter θ∈int(Θ).

Assumption 3

The sensing model is globally observable, i.e., any pair \({\boldsymbol {\theta }}, \acute {{\boldsymbol {\theta }}}\) of possible parameter instances in Θ satisfies

$$ \sum_{n=1}^{N}\left\|\mathbf{f}_{n}({\boldsymbol{\theta}})-\mathbf{f}_{n}\left(\acute{{\boldsymbol{\theta}}}\right)\right\|^{2}=0 $$
(7)

if and only if \({\boldsymbol {\theta }}=\acute {{\boldsymbol {\theta }}}\).

Assumption 4

The sensing function fn(.)for each n is continuously differentiable in the interior int(Θ) of the set Θ. For each θ in the set Θ, the (normalized) gain matrix Γθ defined by

$$\begin{array}{*{20}l} \mathbf{\Gamma}_{{\boldsymbol{\theta}}}=\frac{1}{N}\sum_{n=1}^{N}\nabla\mathbf{f}_{n} \left({\boldsymbol{\theta}}\right)\mathbf{R}_{n}^{-1}\nabla\mathbf{f}_{n}^{\top}\left({\boldsymbol{\theta}}\right), \end{array} $$
(8)

is invertible, where \(\nabla \mathbf {f}_{n}(\cdot) \in \mathbb {R}^{M \times M_{n}}\) denotes the gradient of fn(·).

Smoothness conditions on the sensing functions, such as the one imposed by Assumption 3, are common in statistical estimation with non-linear observations models. Note that the matrix Γθ is well defined at the true value of the parameter θ as θ∈int(Θ) and the continuous differentiability of the sensing functions holds for all θ∈int(Θ).

The asymptotic properties of the WNLS estimator in terms of consistency and asymptotic normality are characterized by the following classical result:

Proposition 1

([15]) Let the parameter set Θbe compact and the sensing function fn(·) be continuous on Θfor each n. Let \(\mathcal {G}_{t}\) be an increasing sequence of σ-algebras such that \(\mathcal {G}_{t} = \sigma \left (\left \{\left \{\mathbf {y}_{n}(s)\right \}_{s=0}^{t-1}\right \}_{n=1}^{N}\right)\). Further, denote by θ the true parameter to be estimated. Then, a WNLS estimator of θ exists, i.e., there exists an \(\{\mathcal {G}_{t}\}\)-adapted process \(\left \{{\widehat {\boldsymbol {\theta }}}_{t}\right \}\) such that

$$ {\widehat{\boldsymbol{\theta}}}_{t}\in\text{argmin}_{\mathbf{z}\in\Theta}\mathcal{Q}_{t}(\mathbf{z}),~\forall t. $$
(9)

Moreover, if the model is globally observable, i.e., Assumption 3 holds, the WNLS estimate sequence \(\left \{{\widehat {\boldsymbol {\theta }}}_{t}\right \}\) is consistent, i.e.,

$$ \mathbb{P}_{{\boldsymbol{\theta}}}\left({\lim}_{t\rightarrow\infty}{\widehat{\boldsymbol{\theta}}}_{t}={\boldsymbol{\theta}}\right)=1, $$
(10)

where \(\mathbb {P}_{{\boldsymbol {\theta }}}(\cdot)\) denotes the probability operator. Additionally, if Assumption 4 holds, the parameter estimate sequence is asymptotically normal, i.e.,

$$\begin{array}{*{20}l} \sqrt{t+1}\left({\widehat{\boldsymbol{\theta}}}_{t}-{\boldsymbol{\theta}}\right)\overset{D}{\Longrightarrow}\mathcal{N}\left(0, \mathbf{\Sigma}_{c}\right), \end{array} $$
(11)

where

$$\begin{array}{*{20}l} \mathbf{\Sigma}_{c}=\left(N\mathbf{\Gamma}_{{\boldsymbol{\theta}}}\right)^{-1}, \end{array} $$
(12)

Γθ is as given by (8) and \(\overset {\mathcal {D}}{\Longrightarrow }\) refers to convergence in distribution (weak convergence).

The centralized WNLS estimator above suffers from significant communication overhead due to the inherent access to data samples across all agents at all times. Moreover, the minimization in (6) requires batch processing due to the non-sequential nature of the minimization. Recursive centralized estimators utilizing stochastic approximation type approaches have been proposed in [9, 10, 32, 41, 42], which mitigate the batch processing through the development of sequential albeit centralized estimators. However, such recursive estimators still suffer from the enormous communication overhead as the fusion center requires access to the data samples across all agents at all times and the global model information in terms of the sensing functions and the noise statistics across agents.

2.3 Preliminaries: distributed WNLS

Sequential distributed recursive schemes conforming to the consensus+innovations (see for example, [19] and Eq. (16) ahead) type update, where the agents’ knowledge of the model is limited to themselves have been proposed in [16, 40]. In [16], so as to achieve the optimal asymptotic covariance, the global model information is made available through a carefully constructed gain matrix update, which adds additional computation complexity and communication cost. In contrast with [16, 40] introduces the trade off in terms of sub-optimality of the asymptotic covariance while using local model information at individual agents for evaluating the gain matrix and thus saving communication cost. However, both the aforementioned algorithms in [16, 40] have the number of communication scales linearly with the number of per-node sampled observations {yn(t)}. This paper builds upon the ideas of sequential distributed recursive schemes catering to non-linear observation models as proposed in [16, 40] to construct a communication efficient scheme without compromising on the performance in terms of the mean square error. That is, we aim to achieve the order optimal MSE decay rate of Θ(1/t) (see, e.g., [9]) in terms of the number of per-node processed samples, while reducing the Θ(t) communication cost which is a characteristic of previous approaches.

Before proceeding further, we briefly summarize the estimator in [40] which is referred to as the benchmark estimator henceforth. The overall update rule at an agent n corresponds to

$$\begin{array}{*{20}l} &{\widehat{\mathbf{x}}}_{n}(t+1)=\mathbf{x}_{n}(t)-\underbrace{\widehat{\beta}_{t}\, \sum_{l\in\Omega_{n}}\left(\mathbf{x}_{n}(t) -\mathbf{x}_{l}(t)\right)}_{\mathrm{neighborhood~ consensus}}\\ &-\underbrace{\widehat{\alpha}_{t}\left(\nabla \mathbf{f}_{n}(\mathbf{x}_{n}(t))\right)\mathbf{R}_{n}^{-1}\left(\mathbf{f}_{n}(\mathbf{x}_{n}(t))-\mathbf{y}_{n}(t)\right)}_{\mathrm{local\ innovation}} \end{array} $$
(13)

and

$$ \mathbf{x}_{n}(t+1)=\mathcal{P}_{\Theta}[{\widehat{\mathbf{x}}}_{n}(t+1)], $$
(14)

where Ωn is the communication neighborhood of agent n (determined by the Laplacian L); ∇fn(·) is the gradient of fn; \(\mathcal {P}_{\Theta }[\cdot ]\) the projection operator corresponding to projecting on Θ; and {βt} and {αt} are consensus and innovation weight sequences given by

$$ \widehat{\beta}_{t}=\frac{\widehat{\beta}_{0}}{(t+1)^{\delta_{1}}},\,\,\,\widehat{\alpha}_{t}=\frac{\widehat{\alpha}_{0}}{t+1}, $$
(15)

where \(\widehat {\alpha }_{0}, \widehat {\beta }_{0} > 0, 0<\delta _{1}<1/2-1/(2+\epsilon _{1})\) and ε1 was defined in Assumption 1. From the asymptotic normality in Theorem 2 in [40], it can be inferred that the MSE decays as O(1/t).

Communication efficiency

The communication cost \(\mathcal {C}_{t}\) is defined as the expected number of per-node communications up to iteration t. Formally, the communication cost \(\mathcal {C}_{t}\) is given by

$$\begin{array}{*{20}l} \mathcal{C}_{t}= \mathbb{E}\left[\sum_{s=0}^{t-1}\mathbb{I}_{\{agent~n~transmits~at~s\}}\right], \end{array} $$
(16)

where agent n is arbitrary (the expectation in (16) does not depend on n) and \(\mathbb {I}_{A}\) represents the indicator of event A. The communication cost \(\mathcal {C}_{t}\) for both the centralized WNLS estimator (where all agents transmit their samples yn(t) to the fusion center at all times t) and the distributed estimators in [16, 40] is \(\mathcal {C}_{t} = \Theta (t)\), where we note that the iteration count t is equivalent to the number of per node samples collected till time t. Technically speaking, the MSE decays as \(O\left (\frac {1}{\mathcal {C}_{t}}\right)\).

3 \(\mathcal {CREDO-NL}\): a communication efficient distributed WNLS estimator

In this section, we present the \(\mathcal {CREDO-NL}\) estimator. \(\mathcal {CREDO-NL}\) is based on a carefully chosen protocol which aids in making the communications increasingly probabilistically sparse. Intuitively speaking, the communication protocol exploits the idea that with a gradual information accumulation at the agents through communications, an agent is able to accumulate sufficient information about the parameter of interest which then allows it to drop communications increasingly often. Technically speaking, for each node n, at every time t, we introduce a binary random variable ψn,t, where

$$\begin{array}{*{20}l} \psi_{n,t}= \left\{\begin{array}{ll} \rho_{t} & \mathrm{with~probability}~\zeta_{t}\\ 0 & \text{else}, \end{array}\right. \end{array} $$
(17)

where ψn,t’s are independent both across time and the nodes, i.e., across t and n, respectively as well are independent from nodes’ observations in (1). The random variable ψn,t abstracts out the decision of the node n at time t whether to participate in the neighborhood information exchange or not. We specifically take ρt and ζt of the form

$$\begin{array}{*{20}l} \rho_{t} = \frac{\rho_{0}}{(t+1)^{\epsilon/2}}, \zeta_{t} = \frac{\zeta_{0}}{(t+1)^{(1/2-\epsilon/2)}}, \end{array} $$
(18)

where 0<ε<1. Furthermore, define βt to be

$$\begin{array}{*{20}l} \beta_{t}=\left(\rho_{t}\zeta_{t}\right)^{2} = \frac{\beta_{0}}{(t+1)},\,\,\beta_{0}>0. \end{array} $$
(19)

With the above development in place, we define the random time-varying Laplacian L(t), where \(\mathbf {L}(t)\in \mathbb {R}^{N\times N}\) which abstracts the inter-node information exchange as follows:

$$\begin{array}{*{20}l} \mathbf{L}_{i,j}(t)= \left\{\begin{array}{ll} -\psi_{i,t}\psi_{j,t} & \{i,j\}\in E, i\neq j\\ 0 & i\neq j, \{i,j\}\notin E\\ \sum_{l\neq i}\psi_{i,t}\psi_{l,t}& i=j. \end{array}\right. \end{array} $$
(20)

The communication protocol (17)–(20) assumes that the neighboring nodes communicate only when the corresponding communication link is bi-directional. How bi-directional communication links can be enforced in practice is discussed next. Let us first assume that there exists a dedicated reliable bi-directional communication link between any two neighboring nodes. Consider a link between nodes n and l at time t. If ψn,t=1, node n participates in communication, and it turns on both its transmitting and receiving antennas. If ψn,t=0, it switches off both its transmitting and receiving antennas. Suppose that ψn,t=1, and consider two scenarios: (1) ψl,t=0 and (2) Ψl,t=1. Consider first the former case. Since node n listens the dedicated channel to node l and node l does not transmit, node n verifies that it does not receive the respective message from node l (e.g., within a prescribed time window), and hence, it does not incorporate node l’s estimate in its update. Also, as Ψl,t=0, node l does not include the estimate by node n, by algorithm construction. Next, consider the case Ψl,t=1. In this case, node n listens the channel and receives the message by node l, and thus, it incorporates node l’s estimate in its update. Completely symmetrically, node l listens the channel from node n to node l, receives the respective message, and includes node n’s estimate in its update. Overall, the preceding discussion explains how the symmetric communication protocol can be established. A very similar consideration can be derived if the links are unreliable but still symmetric, in the sense that if the link from n to l is strong enough to support communication, then so is the link from l to n. Finally, if the physical links can fail in an asymmetric fashion, then the proposed algorithm (see ahead (26)–(28) cannot be implemented in its direct form. More precisely, asymmetric failing links yield the Laplacian matrices L(t) become non-symmetric. The algorithm (26)–(28) and the corresponding analysis need to change in such scenario. This lies outside the scope of this paper, but it corresponds to an interesting future research direction.

With the protocol described in (17)–(20), both the weight assigned to the links and the probability of the existence of a link decay over time. We next consider the first moment, the second moment, and the variance of the Laplacian entries for {i,j}∈E:

$$\begin{array}{*{20}l} &\mathbb{E}\left[\mathbf{L}_{i,j}(t)\right]= -\left(\rho_{t}\zeta_{t}\right)^{2} = -\beta_{t} = -\frac{\beta_{0}}{(t+1)}\\ &\mathbb{E}\left[\mathbf{L}_{i,j}^{2}(t)\right] = \left(\rho_{t}^{2}\zeta_{t}\right)^{2} = \frac{\rho_{0}^{2}\beta_{0}}{(t+1)^{1+\epsilon}} \end{array} $$
(21)
$$\begin{array}{*{20}l} &Var\left(\mathbf{L}_{i,j}(t)\right) = \frac{\rho_{0}^{2}\beta_{0}}{(t+1)^{1+\epsilon}} - \frac{\beta_{0}^{2}}{(t+1)^{2}}. \end{array} $$
(22)

For future reference, we also introduce the mean Laplacian matrix {L(t)} as \(\overline {\mathbf {L}}(t) = \mathbb {E}\left [\mathbf {L}(t)\right ]\), and \(\widetilde {\mathbf {L}}(t) = \mathbf {L}(t)-\overline {\mathbf {L}}(t)\). Thus, it holds that \(\mathbb {E}\left [\widetilde {\mathbf {L}}(t)\right ] = \mathbf {0}\), and

$$\begin{array}{*{20}l} \mathbb{E}\left[\left\|\widetilde{\mathbf{L}}(t)\right\|^{2}\right] \leq 2\,N^{3}\mathbb{E}\left[\widetilde{\mathbf{L}}_{i,j}^{2}(t)\right] \leq \frac{2\,N^{3}\beta_{0}\rho_{0}^{2}}{(t+1)^{1+\epsilon}}, \end{array} $$
(23)

where ∥·∥ denotes the L2 norm. Inequality (23) can be obtained as follows. First, we have that \(\left \|\widetilde {\mathbf {L}}(t)\right \| \leq \left \|\widetilde {\mathbf {L}}(t)\right \|_{F},\) where ∥.∥F denotes the Frobenius norm. Also, note that

$$\begin{array}{*{20}l} {}\left\|\widetilde{\mathbf{L}}(t)\right\|_{F}^{2} &= \sum_{i,j=1}^{N} \left|\widetilde{\mathbf{L}}_{i,j}(t) \right|^{2} = \sum_{i=1}^{N} \left(\sum_{j\neq i}\left|\widetilde{\mathbf{L}}_{i,j}(t) \right|^{2} + \left|\widetilde{\mathbf{L}}_{i,i}(t) \right|^{2}\right)\\ &= \sum_{i=1}^{N} \left(\sum_{j\neq i}\left|\widetilde{\mathbf{L}}_{i,j}(t) \right|^{2} + \left|\sum_{j\neq i}\widetilde{\mathbf{L}}_{i,j}(t) \right|^{2}\right)\\ &\leq\sum_{i=1}^{N} \left(\sum_{j\neq i}\left|\widetilde{\mathbf{L}}_{i,j}(t) \right|^{2} + N \sum_{j\neq i}\left|\widetilde{\mathbf{L}}_{i,j}(t) \right|^{2}\right)\\ &\leq 2\,N \sum_{i=1}^{N} \sum_{j \neq i} \left|\widetilde{\mathbf{L}}_{i,j}(t)\right|^{2}. \end{array} $$

Taking expectation and using (17), inequality (23) follows.

Next, we also have that, \(\overline {\mathbf {L}}(t)=\beta _{t}\overline {\mathbf {L}}\), where

$$\begin{array}{*{20}l} \overline{\mathbf{L}}_{i,j}= \left\{\begin{array}{ll} -1 & \{i,j\}\in E, i\neq j\\ 0 & i\neq j, \{i,j\}\notin E\\ -\sum_{l\neq i}L_{i,l}& i=j. \end{array}\right. \end{array} $$
(24)

We next give an assumption on the connectivity of the inter-agent communication graph.

Assumption 5

The inter-agent communication graph is connected on average, i.e., \(\lambda _{2}(\overline {\mathbf {L}}) > 0\), which implies \(\lambda _{2}(\overline {\mathbf {L}}(t))>0\), where \(\overline {\mathbf {L}}(t)\) denotes the mean of the Laplacian matrix L(t) and λ2(·) denotes the second smallest eigenvalue.

Assumption 5 ensures consistent information flow among the agent nodes. Technically speaking, the communication graph modeled here as a random undirected graph need not be connected at all times. It is to be noted that Assumption 3 ensures that \(\overline {\mathbf {L}}(t)\) is connected at all times as \(\overline {\mathbf {L}}(t)=\beta _{t}\overline {\mathbf {L}}\). We now state additional assumption on the smoothness of the sensing functions for the distributed setup.

Assumption 6

For each n, the sensing function fn(·) is Lipschitz continuous on Θ, i.e., for each agent n, there exists a constant kn>0 such that

$$\begin{array}{*{20}l} \left\|\mathbf{f}_{n}\left({\boldsymbol{\theta}}\right)-\mathbf{f}_{n}\left({\boldsymbol{\theta}}\right)\right\| \le k_{n}\left\|{\boldsymbol{\theta}}-{\boldsymbol{\theta}}\right\|, \end{array} $$
(25)

for all θ,θΘ.

With the communication protocol established, we propose an update, where every node n generates an estimate sequence {xn(t)}, where \(\mathbf {x}_{n}(t)\in \mathbb {R}^{M}\) in the following way:

$$\begin{array}{*{20}l} &{\widehat{\mathbf{x}}}_{n}(t+1)=\mathbf{x}_{n}(t)-\underbrace{\beta_{t}\sum_{l\in\Omega_{n}} \psi_{n,t}\psi_{l,t}\left(\mathbf{x}_{n}(t)-\mathbf{x}_{l}(t)\right)}_{\mathrm{neighborhood~ consensus}}\\ &-\underbrace{\alpha_{t}\left(\nabla \mathbf{f}_{n}(\mathbf{x}_{n}(t))\right)\mathbf{R}_{n}^{-1} \left(\mathbf{f}_{n}(\mathbf{x}_{n}(t))-\mathbf{y}_{n}(t)\right)}_{\mathrm{local~ innovation}} \end{array} $$
(26)

and

$$ \mathbf{x}_{n}(t+1)=\mathcal{P}_{\Theta}[{\widehat{\mathbf{x}}}_{n}(t+1)], $$
(27)

where Ωn denotes the neighborhood of node n with respect to the network represented by \(\overline {\mathbf {L}}\), αt is the innovation gain sequence which is given by αt=α0/(t+1), α0>0, and \(\mathcal {P}_{\Theta }[\cdot ]\) the projection operator corresponding to projecting on Θ. The random variable ψn,t determines the activation state of a node n. By activation, we mean, if ψn,t≠0, then node n can send and receive information in its neighborhood at time t. However, when ψn,t=0, node n neither transmits nor receives information. The link between node n and node l gets assigned a weight of \(\rho _{t}^{2}\) if and only if ψn,t≠0 and ψl,t≠0.

The update in (26) can be written in a compact manner as follows:

$$\begin{array}{*{20}l} {\widehat{\mathbf{x}}}(t+1)&=\mathbf{x}(t)-\left(\mathbf{L}(t)\otimes\mathbf{I}_{M}\right) \mathbf{x}(t)\\ &\quad+\alpha_{t}\mathbf{G}(\mathbf{x}(t))\mathbf{R}^{-1}\left(\mathbf{y}(t)-\mathbf{f}\left(\mathbf{x}(t)\right)\right). \end{array} $$
(28)

Here, ⊗ denotes the Kronecker product, IM denotes the M×M identity matrix, and:

$$\begin{array}{@{}rcl@{}} \mathbf{x}(t)^{\top}&=&\left[\mathbf{x}_{1}(t)^{\top}\!\cdots \mathbf{x}_{N}(t)^{\top}\right]\mathbf{y}(t)^{\top} \,=\, \left[ y_{1}(t)^{\top} \!\cdots y_{N}(t)^{\top} \right]\\ {\widehat{\mathbf{x}}}(t)^{\top}&=&\left[{\widehat{\mathbf{x}}}_{1}(t)^{\top}\cdots {\widehat{\mathbf{x}}}_{N}(t)^{\top}\right]\\ \mathbf{f}(\mathbf{x}(t)) &=& \left[\mathbf{f}_{1}(\mathbf{x}_{1}(t))^{\top}\cdots \mathbf{f}_{N}(\mathbf{x}_{N}(t))^{\top}\right]^{\top}\\ \mathbf{R}^{-1}&=&\text{diag}\left[\mathbf{R}_{1}^{-1}, \cdots, \mathbf{R}_{N}^{-1}\right]\\ \mathbf{G}\left(\mathbf{x}(t)\right)&=&\text{diag}\left[\nabla \mathbf{f}_{1}\left(\mathbf{x}_{1}(t)\right), \cdots, \nabla \mathbf{f}_{N}\left(\mathbf{x}_{N}(t)\right)\right]. \end{array} $$

Remark 1

The Laplacian sequence that plays a role in the analysis in this paper, takes the form \(L(t)=\beta _{t}\overline {L}+\widetilde {L}(t)\), where \(\widetilde {L}(t)\) the residual Laplacian sequence does not scale with βt owing to the fact that the communication rate is chosen adaptively and thus makes the Laplacian matrix sequence not identically distributed.

We refer to the parameter estimate update in (26) and the projection in (27) in conjunction with the randomized communication protocol as the \(\mathcal {CREDO-NL}\) algorithm. We propose a condition on the sensing functions (standard in the literature of general recursive procedures) that guarantees the existence of stochastic Lyapunov functions and, hence, the convergence of the distributed estimation procedure.

Assumption 7

The following aggregate strict monotonicity condition holds: there exists a constant c1>0 such that for each pair \({\boldsymbol {\theta }}, \acute {{\boldsymbol {\theta }}}\) in Θ we have that

$$\begin{array}{*{20}l} {}\sum_{n=1}^{N}\left({\boldsymbol{\theta}}-\acute{{\boldsymbol{\theta}}}\right)^{\top} \!\left(\nabla f_{n}({\boldsymbol{\theta}})\right)\mathbf{R}_{n}^{-1}\!\left(f_{n}({\boldsymbol{\theta}})-f_{n}(\acute{{\boldsymbol{\theta}}})\right)\!\geq c_{1} \left\|{\boldsymbol{\theta}}-\acute{{\boldsymbol{\theta}}}\right\|^{2}. \end{array} $$
(29)

The instrumental step in analyzing the convergence of the proposed algorithm is ensuring the existence of appropriate stochastic Lyapunov functions (see, for example [1620]) which is in turn guaranteed by Assumption 7.

Remark 2

It is to be noted that the Assumptions 6–7 are only sufficient conditions. Moreover, the assumptions which play a key role in establishing the main results, i.e., Assumptions 2, 1, 6, and 7 are required to hold only in the parameter set Θ instead of the entire space \(\mathbb {R}^{M}\), which makes our algorithm to apply to very general non-linear sensing functions.

We consider a specific example to give more intuition about the assumptions in this paper. If the fn(·)’s are linear, i.e., fn(θ)=Fnθ, where Fn is the sensing matrix with dimensions Mn×M, Assumption 3 becomes equivalent to \(\sum _{n=1}^{N}\mathbf {F}_{n}^{\top }\mathbf {R}_{n}^{-1}\mathbf {F}_{n}\) being full rank.Footnote 4 Under this context, the monotonicity condition in Assumption 7 is trivially satisfied by the positive definiteness of the matrix \(\sum _{n=1}^{N}\mathbf {F}_{n}^{\top }\mathbf {R}_{n}^{-1}\mathbf {F}_{n}\). We formalize an assumption on the innovation gain sequence {αt} before proceeding further.

Assumption 8

We require that α0 satisfies

$$\begin{array}{*{20}l} \alpha_{0}c_{1}> 1, \end{array} $$
(30)

where c1 is defined in Assumption 7 and α0 is the innovation gain at t=0.

The communication cost per node for the proposed algorithm is given by \(\mathcal {C}_{t} = \sum _{s=0}^{t-1}\zeta _{s} = \Theta \left (t^{(1+\epsilon)/2}\right)\), which in turn is strictly sub-linear as ε<1.

4 Main results

In this section, we present the main results of the proposed algorithm \(\mathcal {CREDO-NL}\), while the proofs of the main results are relegated to Section 7. The first result concerns with the consistency of the estimate sequence {xn(t)}.

Theorem 4.1

Let Assumptions 1–3 and 5–8 hold. Consider the sequence {xn(t)} generated by algorithm (26)–(27) at each agent n, with the parameters set to \(\rho _{t} = \frac {\rho _{0}}{(t+1)^{\epsilon /2}},\)\(\zeta _{t} = \frac {\zeta _{0}}{(t+1)^{(1/2-\epsilon /2)}},\) and αt=α0/(t+1), where ρ0,ζ0,α0 are arbitrary positive numbers. Then, for each n, we have

$$ \mathbb{P}_{{\boldsymbol{\theta}}}\left({\lim}_{t\rightarrow\infty}\mathbf{x}_{n}(t)={\boldsymbol{\theta}}\right)=1. $$
(31)

Theorem 4.1 verifies that the estimate sequence generated by \(\mathcal {CREDO-NL}\) at any agent n is strongly consistent, i.e., xn(t)→θ almost surely (a.s.) as t. While Assumption 4 is needed for asymptotic normality results as in Proposition 1, it is not necessary for Theorem 4.1 (nor Theorem 4.1 ahead) to hold.

We now state a main result of this paper which establishes the MSE communication rate for the proposed algorithm \(\mathcal {CREDO-NL}\).

Theorem 4.2

Let the hypothesis of Theorem 4.1 hold. Then, we have, for each n,

$$\begin{array}{*{20}l} {\mathbb{E}}\left[\left\|\mathbf{x}_{n}(t)-{\boldsymbol{\theta}}\right\|^{2}\right] =O\left(\frac{1}{t}\right). \end{array} $$
(32)

Furthermore, for each n, we have:

$$\begin{array}{*{20}l} {\mathbb{E}}\left[\left\|\mathbf{x}_{n}(t)-{\boldsymbol{\theta}}\right\|^{2}\right] =O\left(\mathcal{C}_{t}^{-\frac{2}{\epsilon+1}}\right), \end{array} $$
(33)

where 0<ε<1 and is as defined in (18).

We make several remarks on Theorems 4.1 and 4.2.

Remark 3

Note that ε in Theorem 4.2 can be taken to be arbitrarily small. Hence, \(\mathcal {CREDO-NL}\) achieves MSE communication rate arbitrarily close to \(1/\mathcal {C}_{t}^{2}\). This is a significant improvement over existing non-linear distributed consensus + innovations estimation methods, e.g., [18, 20]. They have O(t) communication cost up to time t and a MSE iteration-wise rate of O(1/t), hence achieving \(O(1/\mathcal {C}_{t})\) MSE communication rates. \(\mathcal {CREDO-NL}\) achieves the order-optimal O(1/t) MSE iteration-wise rate with a reduced communication cost, thus significantly improving the MSE communication rate.

Remark 4

Observe that \(\mathcal {CREDO-NL}\) algorithm, with βt=β0 (t+1)−1 has communication cost of \(\mathcal {C}_{t} = \Theta \left (t^{0.5(1+\epsilon)}\right)\). From this, we can see that MSE as a function of \(\mathcal {C}_{t}\) is given by \(\text {MSE} = O\left (\mathcal {C}_{t}^{-2/(1+\epsilon)}\right)\).

Of course, with βt that decays faster than 1/t, communication cost reduces further. However, it can be shown that in this case the algorithm no longer produces good estimates. Namely, from standard arguments in stochastic approximation, it can be shown that for βt=β0 (t+1)−1−δ, with δ>0, \(\mathcal {CREDO-NL}\)’s estimate sequence may not converge to θ.

Remark 5

The \(\mathcal {CREDO-NL}\) algorithm builds on our prior work in [37, 38, 40], but establishing Theorems 4.1–4.2 incurs several technical challenges with respect to our past work. Namely, from a technical standpoint, the \(\mathcal {CIWNLS}\) algorithm in [40] incurs the challenge of non-linear observation models. On the other hand, \(\mathcal {CREDO}\) in [37, 38] incurs the challenge of increasingly sparse communications. Differently from \(\mathcal {CREDO}\) and \(\mathcal {CIWNLS}\), this paper simultaneously accounts for both of these challenges. This makes mean square and asymptotic normality analysis more challenging. As a consequence of this difference, while for \(\mathcal {CIWNLS}\) and \(\mathcal {CREDO}\) we establish both MSE iteration-wise convergence rate analysis and asymptotic normality, here we establish only the MSE (iteration-wise and communication-wise) convergence rate results. Next, \(\mathcal {CREDO-NL}\) is a single time scale stochastic approximation-type algorithm, while both \(\mathcal {CIWNLS}\) and \(\mathcal {CREDO}\) are two time scale algorithms. Further, the consensus potentials in \(\mathcal {CIWNLS}\) and in \(\mathcal {CREDO-NL}\) are the same only on average, i.e., up to the first moment. The difference in higher order moments corresponds to different analyses, namely, the randomized communication protocol that incurs with \(\mathcal {CREDO-NL}\), an increased upper bound of the iteration-wise estimate of MSE. A careful analysis in this paper shows that the additional terms in the MSE bounds with \(\mathcal {CREDO-NL}\) decay faster with time t than 1/t, and hence, the MSE iteration-wise rate remains order-optimal and equal to 1/t (see the proof of Theorems 4.1 and 4.2 in Appendix A.) Finally, we point out that the differences of Theorem 4.1 with respect to works [37, 38] mainly arise from the fact that we consider here nonlinear observation models. Due to this difference, several terms that appear in MSE upper bounds are bounded in a technically different way—see the proof of Lemma A1 in Appendix A. Therein, we need to use the arguments like the non-expansiveness property of projections and Lipschitz continuity of functions fn, none of which is explicitly used in [37, 38].

5 Simulation experiments

This section corroborates our theoretical findings through simulation examples and demonstrates the communication efficiency of \(\mathcal {CREDO-NL}\).

Specifically, we compare the proposed communication efficient distributed estimator, \(\mathcal {CREDO}\), with the benchmark distributed recursive estimator in (13) and the diffusion algorithm as in [43]Footnote 5, which both utilize all inter-neighbor communications at all times, i.e., they have a linear communication cost. The example demonstrates that the proposed communication efficient estimator has a similar MSE iteration-wise rate as the two benchmark estimators. The simulation also shows that the proposed estimator improves the MSE communication rate with respect to the two benchmarks.

We generate a random geometric network of 10 agents, shown in Fig. 1.

Fig. 1
figure 1

Network deployment of 10 agents

The relative degreeFootnote 6 of the graph is equal to 0.4. The graph was generated as a connected instance of the geometric graph model with radius \(r=\sqrt {\text {ln}(N)/N}\). To be specific, the first step involves generating 10 points in a unit square grid and the nodes are connected with a link if the distance between them is less than \(\sqrt {\text {ln}(N)/N}\). We repeat the procedure until we get a connected graph instance. We choose the parameter set Θ to be \(\Theta =\left [-\frac {\pi }{4}, \frac {\pi }{4}\right ]^{7}\in \mathbb {R}^{7}\). This choice of Θ conforms with Assumption 2. The sensing functions are chosen to be certain trigonometric functions as described below. The underlying parameter is set as θ=[θ1, θ2, θ3, θ4, θ5, θ6, θ6] and thus \({\boldsymbol {\theta }}\in \mathbb {R}^{7}\). The sensing functions at the agents are taken to be, f1(θ)= sin(θ1+θ2+θ3),f2(θ)= sin(θ3+θ2+θ4),f3(θ)= sin(θ3+θ4+θ5),f4(θ)= sin(θ4+θ5+θ6),f5(θ)= sin(θ6+θ5+θ7),f6(θ)= sin(θ6+θ7+θ1),f7(θ)= sin(θ1+θ2+θ7),f8(θ)= sin(θ1+θ2+θ4),f9(θ)= sin(θ2+θ3+θ6) and f10(θ)= sin(θ3+θ4+θ6). Thus, it is to be noted that each node makes a scalar observation at time t. The noises γn(t) are Gaussian and are i.i.d. both in time and across nodes and have the covariance matrix equal to 0.25×I10. The local sensing functions render the parameter θ locally unobservable, but the parameter θ is globally observable as, under the parameter set Θ considered in this setup, sin(·) is one-to-one and the set of linear combinations of the θ components corresponding to the arguments of the sin(·)’s constitute a full-rank system for θ. Hence, the global observability requirement specified by Assumption 3 is satisfied. The unknown but deterministic value of the parameter is taken to be θ=[π/6, −π/7, π/12, −π/5, π/16,7π/36,π/10]. Under the model considered here in terms of the sensing functions as specified above and the parameter set \(\Theta =\left [-\frac {\pi }{4}, \frac {\pi }{4}\right ]^{7}\), it can be easily verified that the model conforms to the conditions specified in Assumptions 3–7. The projection operator \(\mathcal {P}_{\Theta }\) onto the set Θ defined in (14) is given by,

$$\begin{array}{*{20}l} \left[\mathbf{x}_{n}(t)\right]_{i}= \left\{\begin{array}{ll} \frac{\pi}{4} & [{\widehat{\mathbf{x}}}_{n}(t)]_{i} \geq \frac{\pi}{4}\\ \left[{\widehat{\mathbf{x}}}_{n}(t)\right]_{i} & \frac{-\pi}{4} < [{\widehat{\mathbf{x}}}_{n}(t)]_{i} < \frac{\pi}{4}\\ \frac{-\pi}{4} & [{\widehat{\mathbf{x}}}_{n}(t)]_{i} < \frac{-\pi}{4}, \end{array}\right. \end{array} $$
(34)

for all i=1,⋯,M.

The parameters of the two benchmarks and of the proposed estimator are as follows. The benchmark estimator in (13) has the consensus weight set to 0.48(t+1)−1. For the proposed estimator, we set ρt=0.45(t+1)−0.01 and ζt=(t+1)−0.49. The step size sequence for the benchmark estimator proposed in [43] is set to μt=(0.3(t+20))−1.

It is to be noted that the Laplacian matrix considered for the benchmark estimator and the expected Laplacian matrix for the proposed estimator, \(\mathcal {CREDO-NL}\) are equal, i.e., \(\mathbf {\overline {L}} =\mathbf {L}\). The innovation weight is set to αt=(0.3(t+20))−1. It is to be noted that with the time shifted innovation potential, the theoretical results in this paper continue to hold. As a performance metric, we use the relative MSE estimate averaged across nodes:

$$\begin{array}{*{20}l} \frac{1}{N}\sum_{n=1}^{N} \frac{\|\mathbf{x}_{n}(t)-{\boldsymbol{\theta}} \|^{2}}{ \|\mathbf{x}_{n}(0)-{\boldsymbol{\theta}}\|^{2}}, \end{array} $$

further averaged across 100 independent runs of the estimators. In the above equation, xn(0) refers to the initial estimates at each node, which is set as xn(0)=0. Figure 2 plots the relative MSE decay in terms of the number of iterations or the number of samples. It can be seen that the MSE decay of the two benchmark estimators and the MSE decay of the proposed estimator \(\mathcal {CREDO-NL}\) are very similar with respect to the iteration count. Figure 3 plots the MSE decay of the three estimators in terms of the communication cost per node. It can be seen for example that, at a relative MSE level of 10−1, the proposed estimator requires 20 and 18 times less communications as compared to the estimator in (13) and the algorithm in [43]. One can also notice a faster MSE decay in terms of the communication cost for \(\mathcal {CREDO-NL}\) as compared to the benchmark (13), thus confirming our theory.

Fig. 2
figure 2

Comparison of the proposed and benchmark estimators in terms of relative MSE: Number of Iterations. The light blue line represents the \(\mathcal {CIWNLS}\) algorithm, the dark blue line represents the diffusion based algorithm proposed in [43], and the red line represents the proposed estimator

Fig. 3
figure 3

Comparison of the proposed and benchmark estimators in terms of relative MSE: Communication Cost Per Node. The light blue line represents the \(\mathcal {CIWNLS}\) algorithm, the dark blue line represents the diffusion-based algorithm proposed in [43], and the red line represents the proposed estimator

6 Discussion

In the context of existing work on non-linear distributed methods, e.g., [15, 16, 31, 3436, 40], the current paper contributes by developing a method with a strictly faster communication rate of \(O(1/\mathcal {C}_{t}^{2-\zeta })\)(ζ>0 arbitrarily small) with respect to existing \(O(1/\mathcal {C}_{t})\) rates. Further, with respect to existing works that develop methods designed to achieve communication efficiency, e.g., [13, 21, 4446], we develop here a different scheme with randomized increasingly sparse communications. Finally, this paper is a continuation of works [37, 38] but, in contrast with [37, 38], it considers non-linear observation models. This requires novel analysis techniques as detailed in Section 1. It would be interesting to apply the proposed method on real data sets, e.g., in the context of IoT or power systems applications, in addition to synthetic data tests considered here.

7 Conclusions

In this paper, we have proposed \(\mathcal {CREDO-NL}\)—a communication-efficient distributed estimation scheme for non-linear observation models. We established strong consistency of the estimate sequence at each agent and characterized the MSE decay in terms of the per-agent communication cost \(\mathcal {C}_{t}\). \(\mathcal {CREDO-NL}\) achieves the MSE decay rate \(O\left (\mathcal {C}_{t}^{-2+\zeta }\right)\), where ζ>0 and ζ is arbitrarily small. Future research directions include extending the proposed algorithm to a mixed-time scale stochastic approximation type algorithm, so as to achieve an asymptotic covariance independent of the network, as well as to extend the presented ideas to distributed stochastic optimization.

8 Appendix A: Proof of Main Results

We present the proofs of main results in this section.

Proof of Theorem 4.1

We start the proof with the following useful Lemma. □

Lemma 1

For each n, the process {xn(t)} satisfies

$$\begin{array}{*{20}l} \mathbb{P}_{{\boldsymbol{\theta}}}\left(\sup_{t\ge 0} \left\|\mathbf{x}(t)\right\| < \infty\right) =1. \end{array} $$
(35)

Proof

Consider (14). Since the projection is onto a compact convex set, it is non-expansive. It follows that the inequality

$$\begin{array}{*{20}l} \left\|\mathbf{x}_{n}(t+1)-{\boldsymbol{\theta}}\right\|\leq \left\|{\widehat{\mathbf{x}}}_{n}(t+1)-{\boldsymbol{\theta}}\right\| \end{array} $$
(36)

holds for all n and t. We first note that,

$$\begin{array}{*{20}l} \mathbf{L}(t)=\beta_{t}\overline{\mathbf{L}}+\widetilde{\mathbf{L}}(t), \end{array} $$
(37)

where \(\mathbb {E}\left [\widetilde {\mathbf {L}}(t)\right ] = \mathbf {0}\) and \(\mathbb {E}\left [\widetilde {\mathbf {L}}_{i,j}^{2}(t)\right ] = \frac {\rho _{0}^{2}\beta _{0}}{(t+1)^{1+\epsilon }} - \frac {\beta _{0}^{2}}{(t+1)^{2}}\), for {i,j}∈E,ij.

Define, z(t)=x(t)−1Nθ and V(t)=∥z(t)∥2. (Here, 1N is the all-ones N by 1 vector.) Note that z(t) corresponds to the estimation error vector at time t; its squared norm V(t) will first serve us as a Lyapunov function to establish the almost sure boundedness of x(t) as in Lemma A1. Let \(\{\mathcal {F}_{t}\}\) be the natural filtration generated by the random observations and the random Laplacians i.e.,

$$\begin{array}{*{20}l} \mathcal{F}_{t}=\mathbf{\sigma}\left(\left\{\left\{\mathbf{y}_{n}(s)\right\}_{n=1}^{N}, \left\{\mathbf{L}(s)\right\}\right\}_{s=0}^{t-1}\right). \end{array} $$
(38)

Now, consider the update rules (26)–(28). By algebraic manipulations, conditional independence, and utilizing (36), we have that,

$$\begin{array}{*{20}l} &{}\mathbb{E}\left[V(t+1)|\mathcal{F}_{t}\right] \leq V(t) + \beta_{t}^{2}\mathbf{z}^{\top}(t)\left(\mathbf{\overline{L}} \otimes \mathbf{I}_{M}\right)^{2}\mathbf{z}(t)\\ &{}+\alpha_{t}^{2}{\mathbb{E}}\left[\left\|\mathbf{G}\left(\mathbf{x}(t)\right)\mathbf{R}^{-1} \left(\mathbf{y}(t)-\mathbf{f}\left(\mathbf{1}_{N}\otimes{\boldsymbol{\theta}}\right)\right)\right\|^{2}\right]\\ &{}-2\beta_{t}\mathbf{z}^{\top}(t)\left(\mathbf{\overline{L}}\otimes \mathbf{I}_{M}\right)\mathbf{z}(t)\\ &{}-2\alpha_{t}\mathbf{z}^{\top}(t)\mathbf{G}\left(\mathbf{x}(t)\right)\mathbf{R}^{-1} \left(\mathbf{f}\left(\mathbf{x}(t)\right)-\mathbf{f}\left(\mathbf{1}_{N}\otimes{\boldsymbol{\theta}}\right)\right)\\ &{}+2\alpha_{t}\beta_{t}\mathbf{z}^{\top}(t)\left(\mathbf{\overline{L}} \otimes \mathbf{I}_{M}\right)\mathbf{G}\left(\mathbf{x}(t)\right)\mathbf{R}^{-1} \left(\mathbf{f}\left(\mathbf{x}(t)\right)\,-\,\mathbf{f}\left(\mathbf{1}_{N}\otimes{\boldsymbol{\theta}}\right)\right) \\ &{}+\alpha_{t}^{2}\left\|\left(\mathbf{f}\left(\mathbf{x}(t)\right)- \mathbf{f}\left(\mathbf{1}_{N}\otimes{\boldsymbol{\theta}}\right)\right)^{\top}\mathbf{G}^{\top} \left(\mathbf{x}(t)\right)\mathbf{R}^{-1}\right\|^{2}\\&{}+ \mathbf{z}^{\top}(t){\mathbb{E}}\left[\left(\widetilde{\mathbf{L}}(t)\otimes\mathbf{I}_{M}\right)^{2}\right]\mathbf{z}(t). \end{array} $$
(39)

Consider the orthogonal decomposition

$$\begin{array}{*{20}l} \mathbf{z}=\mathbf{z}_{C}+\mathbf{z}_{C \perp}, \end{array} $$
(40)

where zC denotes the projection of z to the consensus subspace \(\mathcal {C}=\left \{\mathbf {z} \in \mathbb {R}^{MN} |\mathbf {z}=\mathbf {1}_{N}\otimes \mathbf {a}, \text {for\ some\ a} \in \mathbb {R}^{M} \right \}\). The following inequalities hold for all tt1, where t1 is a sufficiently large positive integer:

$$\begin{array}{*{20}l} &{}\mathbf{z}^{\top}(t){\mathbb{E}}\left[\left(\widetilde{\mathbf{L}}(t)\otimes\mathbf{I}_{M}\right)^{2}\right]\mathbf{z}(t) \overset{(q0)}{\le} \frac{c_{5}\left\|\mathbf{z}_{\mathcal{C}^{\perp}}(t)\right\|^{2}}{(t+1)^{1+\epsilon}} \\ &{}\mathbf{z}^{\top}(t)\left(\mathbf{\overline{L}} \otimes \mathbf{I}_{M}\right)^{2}\mathbf{z}(t) \overset{(q1)}{\le} \lambda_{N}^{2}(\mathbf{\overline{L}})||\mathbf{z}_{C\perp}(t)||^{2};\\ &{}\mathbf{z}^{\top}(t)\mathbf{G}\left(\mathbf{x}(t)\right)\mathbf{R}^{-1} \left(\mathbf{f}\left(\mathbf{x}(t)\right)\,-\,\mathbf{f}\left(\mathbf{1}_{N}\otimes{\boldsymbol{\theta}}\right)\right)\!\overset{(q2)}{\ge}\! c_{1}||\mathbf{z}(t)||^{2} {\ge} 0;\\ &{}\mathbf{z}^{\top}(t)\left(\mathbf{\overline{L}}\otimes \mathbf{I}_{M}\right)\mathbf{z}(t) \overset{(q3)}{\ge} \lambda_{2}(\mathbf{\overline{L}})\left\|\mathbf{z}_{C\perp}(t)\right\|^{2};\\ &{}\mathbf{z}^{\top}(t)\left(\mathbf{\overline{L}}\otimes \mathbf{I}_{M}\right)\mathbf{G}\left(\mathbf{x}(t)\right)\mathbf{R}^{-1} \left(\mathbf{f}\left(\mathbf{x}(t)\right)-\mathbf{f}\left(\mathbf{1}_{N}\otimes{\boldsymbol{\theta}}\right)\right) \\&\overset{(q4)}{\le} c_{2}\left\|\mathbf{z}(t)\right\|^{2}. \end{array} $$
(41)

Here, we recall that \(\lambda _{N}(\mathbf {\overline {L}})\) is the largest eigenvalue of matrix \(\mathbf {\overline {L}}.\) Further, c1 is defined in Assumption 7, and c2,c5 are appropriately chosen positive constants. Here, zC(t)=z(t)−zC(t), where zC(t) is the projection of z(t) on the consensus subspace \(\mathcal C\). Inequality (q0) holds because, as noted above, there holds that \(\mathbb {E}\left [\widetilde {\mathbf {L}}_{i,j}^{2}(t)\right ] \leq \frac {\rho _{0}^{2}\beta _{0}}{(t+1)^{1+\epsilon }} \), for {i,j}∈E,ij. Specifically, constant c5 can be taken to equal \(2 \,N^{3}\,\rho _{0}^{2}\beta _{0}\). Next, inequalities (q1) and (q3) follow from the properties of the Laplacian. Inequality (q2) follows from Assumption 7, and (q4) follows from Assumption 6 since we have that ∥∇fn(xn(t))∥ is uniformly bounded from above by kn for all n, and hence, we have that ∥G(x(t))∥≤ maxn=1,⋯,Nkn. (Recall quantity G(x(t)) defined before Remark 3.1.) That is, c2 can be taken as \((\max _{n=1,\cdots,N}k_{n})^{2} (\max _{n=1,\cdots,N}\|\mathbf {R}_{n}^{-1}\|) \|\overline {\mathbf {L}}\|\). We also have

$$\begin{array}{*{20}l} {\mathbb{E}}\left[\left\|\mathbf{G}\left(\mathbf{x}(t)\right)\mathbf{R}^{-1} \left(\mathbf{y}(t)-\mathbf{f}\left(\mathbf{1}_{N}\otimes{\boldsymbol{\theta}}\right)\right)\right\|^{2}\right] \le c_{4}, \end{array} $$
(42)

for some constant c4>0. In (42), we use the fact that the noise process under consideration has finite covariance. We also use the fact that, almost surely, ∥G(x(t))∥≤ maxn=1,⋯,Nkn, which in turn follows from Assumption 6. In particular, c4 may be taken as \((\max _{n=1,\cdots,N}k_{n})^{2} (\max _{n=1,\cdots,N}\|\mathbf {R}_{n}^{-1}\|)^{2} (\max _{n=1,\cdots,N}\|\mathbf {R}_{n}\|)^{2} \). We further have that,

$$\begin{array}{*{20}l} {}\left\|\mathbf{G}\left(\mathbf{x}(t)\right)\mathbf{R}^{-1}\left(\mathbf{f}\left(\mathbf{x}(t)\right)-\mathbf{f} \left(\mathbf{1}_{N}\otimes{\boldsymbol{\theta}}\right)\right)\right\|^{2}\le c_{3}\left\|\mathbf{z}(t)\right\|^{2}, \end{array} $$
(43)

where c3>0 is a constant. It is to be noted that (43) follows from the Lipschitz continuity in Assumption 6 and the result that ∥G(x(t))∥≤ maxn=1,⋯,Nkn. That is, c3 may be taken as \((\max _{n=1,\cdots,N}k_{n})^{4} (\max _{n=1,\cdots,N}\|\mathbf {R}_{n}^{-1}\|)^{2}\). Applying the bounds (41)–(43) in (39), we obtain, after some algebraic manipulations,

$$\begin{array}{*{20}l} {}\mathbb{E}\left[V(t+1)|\mathcal{F}_{t}\right]&\le (1+c_{8}{\alpha^{2}_{t}})V(t)\\&\quad-c_{9}\!\left(\!\beta_{t}-\frac{c_{5}}{(t+1)^{\tau_{1}+\epsilon}}\!\right)\! \left\|\mathbf{z}_{\mathcal{C}^{\perp}}\right\|^{2}+c_{6}{\alpha^{2}_{t}}, \end{array} $$
(44)

where c6,c8,c9 are appropriately chosen positive constants, and c5 is as in (41). In particular, c6 may be taken as c6=c4; c8 may be taken as \(\beta _{0}^{2}\,(\lambda _{N}(\overline {\mathbf {L}}))^{2} /\alpha _{0}^{2} + 2\beta _{0}\sqrt {c_{3}}+c_{3}\), and c9 may be taken as \(2\, \lambda _{2}(\overline {\mathbf {L}})\).

As \(\frac {c_{5}}{(t+1)^{1+\epsilon }}\) goes to zero faster than βt, ∃t2 such that ∀tt2, \(\beta _{t} \ge \frac {c_{5}}{(t+1)^{1+\epsilon }}\). By the above construction we obtain ∀tt2,

$$\begin{array}{*{20}l} {\mathbb{E}}[V(t+1) | \mathcal{F}_{t}] \le (1+c_{8}{\alpha^{2}_{t}})V(t)+\widehat\alpha_{t}^{2}, \end{array} $$
(45)

where \(\widehat {\alpha }(t) = \sqrt {c_{6}}\alpha _{t}\). The product \(\prod _{s=t}^{\infty }(1+\alpha _{s}^{2})\) exists for all t. Now, let {W(t)} be such that

$$\begin{array}{*{20}l} W(t)=\left(\prod_{s=t}^{\infty}(1+c_{8}\alpha_{s}^{2})\right)V(t)+\sum_{s=t}^{\infty}\widehat{\alpha}_{s}^{2},~\forall t\geq t_{2}. \end{array} $$
(46)

By (46), it can be shown that {W(t)} satisfies,

$$\begin{array}{*{20}l} {\mathbb{E}}[W(t+1) | \mathcal{F}_{t}] \le W(t). \end{array} $$
(47)

Hence, {W(t)} is a non-negative supermartingale and converges a.s. to a bounded random variable W as t. It then follows from (46) that V(t)→W as t. Thus, we conclude that the desired claim holds. □

The following Lemma will play a key role in establishing the convergence of the estimate sequence.

Lemma 2

(Lemma 4.1 in [18]) Consider the scalar time-varying linear system

$$\begin{array}{*{20}l} u(t+1)\leq(1-r_{1}(t))u(t)+r_{2}(t), \end{array} $$
(48)

where {r1(t)} is a sequence, such that

$$\begin{array}{*{20}l} \frac{a_{1}}{(t+1)^{\delta_{1}}}\leq r_{1}(t)\leq 1 \end{array} $$
(49)

with a1>0,0≤δ1<1, whereas the sequence {r2(t)} is given by

$$\begin{array}{*{20}l} r_{2}(t)\le\frac{a_{2}}{(t+1)^{\delta_{2}}} \end{array} $$
(50)

with a2>0,δ2≥0. Then, if u(0)≥0 and δ1<δ2, we have

$$\begin{array}{*{20}l} {\lim}_{t\to\infty}(t+1)^{\delta_{0}}u(t)=0, \end{array} $$
(51)

for all 0≤δ0<δ2δ1. Also, if δ1=δ2, then the sequence {u(t)} stays bounded, i.e. supt≥0∥u(t)∥<.

We now prove the almost sure convergence of the estimate sequence to the true parameter. Following similar steps as in the proof of Lemma 1, for t large enough

$$\begin{array}{*{20}l} &{\mathbb{E}}[V(t+1)|\mathcal{F}_{t}]\le\left(1-2c_{1}\alpha_{t}+c_{7}\alpha^{2}_{t}\right)V(t)+c_{6}\alpha_{t}^{2}\\ &\le V(t)+c_{6}\alpha_{t}^{2}, \end{array} $$
(52)

as for t large enough, \(-2c_{1}\alpha _{t}+c_{7}\alpha ^{2}_{t}<0\). Here, constant c6 is as in (44), and c7 is appropriately chosen positive constant that may be taken as \(\beta _{0}^{2}\,(\lambda _{N}(\overline {\mathbf {L}}))^{2} /\alpha _{0}^{2} + 2\beta _{0}\sqrt {c_{3}}+c_{3}\). Now, consider the \(\{\mathcal {F}_{t}\}\)-adapted process {V1(t)} defined as follows

$$\begin{array}{*{20}l} V_{1}(t)&=V(t)+c_{6}\sum_{s=t}^{\infty}\alpha_{s}^{2}\\ &=V(t)+c_{6}\alpha_{0}^{2}\,\sum_{s=t}^{\infty}(t+1)^{-2}. \end{array} $$
(53)

Since {(t+1)−2} is summable, the process {V1(t)} is bounded from above. Moreover, it also follows that \(\phantom {\dot {i}\!}\{V_{1}(t)\}_{t\geq t_{1}}\) is a supermartingale and hence converges a.s. to a finite random variable. By definition from (53), we also have that {V(t)} converges to a non-negative finite random variable V. Finally, from (52), we have that,

$$\begin{array}{*{20}l} {\mathbb{E}}[V(t+1)]\le \left(1-c_{1}\alpha_{t}\right){\mathbb{E}}[V(t)]+c_{6}\alpha_{0}^{2}\,(t+1)^{-2}, \end{array} $$
(54)

for t large enough. The sequence {V(t)} then falls under the purview of Lemma 3 ahead, and we have \({\mathbb {E}}[\!V(t)]\to 0\) as t. Finally, by Fatou’s Lemma, where we use the non-negativity of the sequence {V(t)}, we conclude that

$$\begin{array}{*{20}l} 0\leq {\mathbb{E}}[V^{*}]\le\liminf_{t\to\infty}{\mathbb{E}}[V(t)]=0, \end{array} $$
(55)

which thus implies that V=0 a.s. Hence, ∥z(t)∥→0 a.s. as t, and the desired assertion follows.

We will use the following approximation result (Lemma 3) and the generalized convergence criterion (Lemma 4) for the proof of Theorem 2. Lemma 3 is an extension of Lemma 5 in [18]. Lemma 4 is Lemma 10 in [8].

Lemma 3

Let {bt} be a scalar sequence satisfying

$$\begin{array}{*{20}l} b_{t+1}\le \left(1-\frac{c}{t+1}\right)b_{t}+d(t+1)^{-2}, \end{array} $$
(56)

where d>0 and c>1. Then, we have,

$$\begin{array}{*{20}l} \limsup_{t\to\infty}~(t+1)\,b_{t}<\infty. \end{array} $$
(57)

Lemma 4

Let {J(t)} be an \(\mathbb {R}\)-valued \(\{\mathcal {F}_{t+1}\}\)-adapted process such that \(\mathbb {E}\left [J(t)|\mathcal {F}_{t}\right ]=0\) a.s. for each t≥1. Then the sum \(\sum _{t\geq 0}J(t)\) exists and is finite a.s. on the set where \(\sum _{t\geq 0}\mathbb {E}\left [J(t)^{2}|\mathcal {F}_{t}\right ]\) is finite.

Proof of Theorem 4.2

Consider inequality (54), and recall that, by Assumption 8, we have that α0 c1>1. We can now see that the sequence {V(t)} then falls under the purview of Lemma 3, and we have

$$\begin{array}{*{20}l} &\limsup_{t\to\infty}(t+1){\mathbb{E}}[V(t+1)] < \infty \\ &\Rightarrow {\mathbb{E}}[V(t)] = O\left(\frac{1}{t}\right). \end{array} $$
(58)

Inequality (58) now clearly implies that, for each agent n, there holds:

$$\begin{array}{*{20}l} &{\mathbb{E}}[\left\|\mathbf{x}_{n}(t)-{\boldsymbol{\theta}}\right\|^{2}] = O\left(\frac{1}{t}\right). \end{array} $$
(59)

The communication cost \(\mathcal {C}_{t}\) for the proposed \(\mathcal {CREDO-NL}\) algorithm is given by \(\mathcal {C}_{t} = \Theta \left (t^{\frac {\epsilon +1}{2}}\right)\), and thus the assertion follows in conjunction with (59). □