1 Introduction

Federated learning (FL) [1] is one particular distributed structure where users no longer need to send their data to a server for training. Instead, data remains local, and training happens in collaboration between different clients and the server. Compared to a fully decentralized solution, communication occurs between the server and the clients (or agents), instead of directly between the agents themselves. Such a solution is advantageous in the sense that users no longer need to worry about sharing their data with an unknown party, and the high cost of sending all their raw data is eliminated. In this way, the data stays locally safe on a user’s device, and no extra communication cost is incurred for transferring the data remotely. However, such a distributed architecture is not robust to communication failures and computational overloads, nor it is immune to privacy attacks when agents are required to share their local updates. In standard FL, millions of users can be connected to one server at a time. This means one server will need to be responsible for the communication with all clients with significant computational burden, thus rendering the system susceptible to communication failures. Furthermore, whether clients send their gradient updates or their local models, information about their data can be inferred from the exchanges and leaked [2,3,4,5]. Consider for instance the logistic risk; the gradient of the loss function is a constant multiple of the feature vector. Thus, even though the actual data samples are not sent to the server, information about them can still be inferred from the gradient updates or the models.

These considerations motivate us to propose an architecture for federated learning with privacy guarantees. In particular, we introduce the graph federated architecture, which consists of multiple servers, and we privatize the algorithm by ensuring the communication occuring between the servers and the clients is secure. Graph-homomorphic perturbations, which were initially introduced in [6], focus on the communication between servers. They are based on adding correlated noise to the messages sent between servers such that the noise cancels out if we were to take the average of all messages across all servers. As for the privatization between the clients and their servers, we share noisy updates as opposed to models. The two protocols make sure the effect of the added noise is reduced.

Other works have also contributed to addressing the same challenges we are considering in this work, albeit differently. For example, the work [7] introduces a hierarchical architecture, where it is assumed there are multiple servers connected in a tree structure. Such a solution still has one main server and thus faces the same robustness problem as FL. The graph federated learning architecture in this work (and which appeared in the earlier conference publication [8]) is a more general structure. The work [9] generalizes the standard distributed learning framework to include local updates, while [10] has a similar architecture to the GFL architecture proposed earlier in [8], it nevertheless does not deal with privacy and employs different objective functions and a different learning algorithm based on the alternating direction method of multipliers. Likewise, a plethora of solutions exist that relate to privacy issues. These methods may be split into two sub-groups: those using random perturbations to ensure a certain level of differential privacy [11,12,13,14,15,16,17,18,19,20], or those that rely on cryptographic methods [21,22,23,24,25]. Both have their advantages and disadvantages. While differential privacy is easy to implement, it hinders the performance of the algorithm by reducing the model utility. As for cryptographic methods, they are generally harder to implement since they require more computational and communication power [26, 27]. Furthermore, they restrict the number of participating users. Moving forward, we go ahead with the study of differentially private methods.

The main contribution in this work is three-fold. We introduce a new generalized and more realistic architecture for the federated setting where we now consider multiple servers connected by some graph structure. Furthermore, many earlier works have proposed adding Laplacian noise sources to the shared information among agents in order to ensure some level of privacy. However, these works have largely ignored the fact that these noises degrade the mean-square error (MSE) performance of the network from \(O(\mu )\) down to \(O(\mu ^{-1})\), where \(\mu\) is the small learning parameter. To resolve this issue, we define a new noise generation scheme that mantains the MSE at O(1) while ensuring privacy. Although the work [20] proposed a noisy-distributed consensus strategy, this reference lacks a useful construction method for the perturbations. In this work, we devise a construction scheme. Therefore, the main difference between our proposed method and previous works is that we devise a noise construction scheme that ensures the total sum of the added noise cancels out centrally. This results in the improved MSE bound of O(1). Finally, we prove that clients sharing noisy updates as opposed to noisy models lead to improved performance relative to what is commonly done in the prior literature. Moreover, we do not assume bounded gradients, as commonly assumed in previous works [12, 15, 16], since this condition does not actually hold in most situations in practice. Note, for instance, that even quadratic risks do not have bounded gradients. For this reason, we will not rely on this condition, and will instead be able to show that our noise construction is able to ensure differential privacy with high probability for most cases of interest. The main results shown in this work are as follows:

  1. 1.

    Privatized GFL under graph-homomorphic perturbations converges in the MSE sense to an O(1) neighbourhood of the true model \(w^o\) as opposed to \(O(\mu ^{-1})\) when random perturbations are used instead.

  2. 2.

    Privatized FL under perturbed gradients converges in the MSE sense to an \(O(\mu )\) neighbourhood of the true model \(w^o\) as opposed to \(O(\mu ^{-1})\) when perturbed models are shared instead.

  3. 3.

    GFL with graph-homomorphic perturbations and perturbed gradients is \(\epsilon (i)\)-differentially private with high probability.

2 Graph federated architecture

In the graph federated architecture, which we initially introduced in [8], we consider P federated units connected by a graph structure. Each federated unit consists of a server and a set of K agents. Thus, the overall architecture can be represented as a graph depicted in Fig. 1. We denote the combination matrix connecting the servers by \(A \in {\mathbb {R}}^{P\times P }\), and we write \(a_{mp}\) to refer to the elements of A. We assume each agent of every server has its own dataset \(\{x_{p,k,n}\}_{n=1}^{N_{p,k}}\) that is non-iid when compared to the other agents. The subscript p refers to the federated unit, k to the agent, and n to the data sample. We note the difference between our proposed architecture and a fully distributed setting. The graph federated architecture consists of a network of federated units while a fully distributed network removes the need for servers and assumes clients are connected to each other based on some graph structure. Such an architecture is an improvement on the original federated architecture and not necessarily on the fully distributed architecture. Instead of clients communicating with the same server, we split the load among multiple servers.

Fig. 1
figure 1

The graph federated learning architecture

With this architecture, we associate a convex optimization problem that will take into account the cost function at each federated unit. Thus, the optimization goal is to find the optimal global model \(w^o\) that minimizes an average empirical risk:

$$\begin{aligned} w^o \,\overset{\Delta }{=}\,\mathop {\textrm{argmin}}\limits _{w\in {\mathbb {R}}^M} \frac{1}{P}\sum _{p=1}^P \frac{1}{K}\sum _{k=1}^K J_{p,k}(w), \end{aligned}$$
(1)

where each individual cost is an empirical risk defined over the local loss functions \(Q_{p,k}(\cdot ;\cdot )\):

$$\begin{aligned} J_{p,k} (w) \,\overset{\Delta }{=}\,\frac{1}{N_{p,k}} \sum _{n=1}^{N_{p,k}} Q_{p,k}(w;x_{p,k,n}). \end{aligned}$$
(2)

To solve problem (1) each federated unit p runs the standard federated averaging (FedAvg) algorithm [1]. An iteration i of the algorithm consists of the server p selecting a subset of L participating agents \({\mathcal {L}}_{p,i}\). Then, in parallel, each agent runs a series of stochastic gradient descent (SGD) steps. We call these local steps epochs, and denote an epoch by the letter e and the total number of epochs by \(E_{p,k}\). The sampled data point at an agent k in the federated unit p during the \(e^{th}\) epoch of iteration i is denoted by b. Thus, during an iteration i, each participating agent \(k \in {\mathcal {L}}_{p,i}\) updates the last model \({\varvec{w}}_{p,i-1}\) and sends its new model \({\varvec{w}}_{p,k,E_{p,k}}\) to the server after \(E_{p,k}\) epochs. During a single epoch e, the agent updates its current local model \(w_{p,k,e-1}\) by running a single SGD step. Thus, an agent repeats the following adaptation step for \(e=1,2,\ldots , E_{p,k}\):

$$\begin{aligned} {\varvec{w}}_{p,k,e} =&\,{\varvec{w}}_{p,k,e-1} - \frac{ \mu }{E_{p,k}} \nabla _{w^{\textsf{T}}} Q_{p,k}({\varvec{w}}_{p,k,e-1};{\varvec{x}}_{p,k,b}), \end{aligned}$$
(3)

with \({\varvec{x}}_{p,k,b}\) be the sampled data of agent k in federated unit p, and \({\varvec{w}}_{p,k,0} = {\varvec{w}}_{p,i-1}\). After all the participating agents \(k \in {\mathcal {L}}_{p,i}\) run all their epochs, the server aggregates their final models \({\varvec{w}}_{p,k,E_{p,k}}\), which we rename as \({\varvec{w}}_{p,k,i}\) since it is the final local model at iteration i:

$$\begin{aligned} {\varvec{\psi }}_{p,i} = \frac{1}{L}\sum _{k\in {\mathcal {L}}_{p,i}} {\varvec{w}}_{p,k,i}. \end{aligned}$$
(4)

Next, at the server level, these estimates are combined across neighbourhoods using a diffusion type strategy, where we first consider the previous steps (3) and (4) as the adaptation step and the following step as the combination step:

$$\begin{aligned} {\varvec{w}}_{p,i} = \sum _{m\in {\mathcal {N}}_p}a_{pm}{\varvec{\psi }}_{m,i}. \end{aligned}$$
(5)

To introduce privacy, the models communicated at each round between the agents and the servers need to be encrypted in some way. We could either apply secure multiparty computation (SMC) tools, like secret sharing, or use differential privacy. We focus on differential privacy or masking tools that can be represented by added noise. Thus, we let agent 1 in federated unit 2 add a noise component \({\varvec{g}}_{2,1,i}\) to its final model \({\varvec{w}}_{2,1,i}\) at iteration i, and then let serever 2 add \({\varvec{g}}_{12,i}\) to the message \({\varvec{\psi }}_{2,i}\) it sends to server 1. More generally, we denote by \({\varvec{g}}_{pm,i}\) the noise added to the message sent by server m to server p at iteration i. Similarly, we denote by \({\varvec{g}}_{p,k,i}\) the noise added to the model sent by agent k to server p during the ith iteration. We use unseparated subscripts pm for the inter-server noise components to point out their ability to be combined into a matrix structure. Contrarily, the agent-server noise components’ subscripts are separated by a comma to highlight a hierarchical structure. Thus, the privatized algorithm can be written as a client update step (6), a server aggregation step (7), and a server combination step (8):

$$\begin{aligned} {\varvec{w}}_{p,k,i}&= {\varvec{w}}_{p,i-1} - \frac{\mu }{E_{p,k}} \sum _{e=1}^{E_{p,k}} \nabla _{w^{\textsf{T}}} Q_{p,k}({\varvec{w}}_{p,k,e-1};{\varvec{x}}_{p,k,b}), \end{aligned}$$
(6)
$$\begin{aligned} {\varvec{\psi }}_{p,i}&= \frac{1}{L}\sum _{k \in {\mathcal {L}}_{p,i}} {\varvec{w}}_{p,k,i} + {\varvec{g}}_{p,k,i}, \end{aligned}$$
(7)
$$\begin{aligned} {\varvec{w}}_{p,i}&= \sum _{m\in {\mathcal {N}}_p} a_{pm}({\varvec{\psi }}_{m,i} + {\varvec{g}}_{pm,i}). \end{aligned}$$
(8)

The client update step (6) follows from (3) by combining the multiple epochs for \(e=1,2,\ldots , E_{p,k}\) into one update step, with \({\varvec{w}}_{p,k,i} = {\varvec{w}}_{p,k,E_{p,k}}\) and \({\varvec{w}}_{p,k,0} = {\varvec{w}}_{p,i-1}\), namely:

$$\begin{aligned} {\varvec{w}}_{p,k,E_{p,k}}&= {\varvec{w}}_{p,k,E_{p,k}-1} - \frac{\mu }{E_{p,k}} \nabla _{w^{\textsf{T}}} Q_{p,k}({\varvec{w}}_{p,k,E_{p,k}-1};{\varvec{x}}_{p,k,b}) \\&= {\varvec{w}}_{p,k,E_{p,k,}-2} -\frac{\mu }{E_{p,k}} \sum _{e=E_{p,k}-1}^{E_{p,k}} \nabla _{w^{\textsf{T}}} Q_{p,k}({\varvec{w}}_{p,k,e-1};{\varvec{x}}_{p,k,b}) \\&= {\varvec{w}}_{p,k,0} -\frac{\mu }{E_{p,k}}\sum _{e=1}^{E_{p,k}} \nabla _{w^{\textsf{T}}} Q_{p,k}({\varvec{w}}_{p,k,e-1};{\varvec{x}}_{p,k,b}). \end{aligned}$$
(9)

3 Performance analysis

In this section, we show a list of results on the performance of the algorithm. We study the convergence of the privatized algorithm (6)–(8), and examine the effect of privatization on performance.

3.1 Modeling conditions

To go forward with our analysis, we require certain reasonable assumptions on the graph structure and cost functions.

Assumption 1

(Combination matrix) The combination matrix A describing the graph is symmetric and doubly-stochastic, i.e.:

$$\begin{aligned} a_{pm} = a_{mp}, \quad \sum _{m=1}^P a_{mp} = 1. \end{aligned}$$
(10)

Furthermore, the graph is strongly-connected and A satisfies:

$$\begin{aligned} \iota _2 \,\overset{\Delta }{=}\,\rho \left( A -\frac{1}{P}\mathbbm {1}\mathbbm {1}^\textsf{T}\right) < 1. \end{aligned}$$
(11)

\(\square\)

Assumption 2

(Convexity and smoothness) The empirical risks \(J_{p,k}(\cdot )\) are \(\nu -strongly\) convex, and the loss functions \(Q_{p,k}(\cdot ;\cdot )\) are convex, namely for \(\nu > 0\),:

$$\begin{aligned} J_{p,k}(w_2)&\ge J_{p,k}(w_1) + \nabla _{w^{\textsf{T}}} J_{p,k}(w_1)(w_2-w_1) + \frac{\nu }{2}\Vert w_2 - w_1 \Vert ^2, \end{aligned}$$
(12)
$$\begin{aligned} Q_{p,k}(w_2;\cdot )&\ge Q_{p,k}(w_1;\cdot ) + \nabla _{w^{\textsf{T}}} Q_{p,k}(w_1;\cdot ) (w_2 - w_1). \end{aligned}$$
(13)

Furthermore, the loss functions have \(\delta\)-Lipschitz continuous gradients, meaning there exists \(\delta >0\) such that for any data point \(x_{p,n}\):

$$\begin{aligned} \Vert \nabla _{w^{\textsf{T}}} Q_{p,k}(w_2;x_{p,k,n}) - \nabla _{w^{\textsf{T}}} Q_{p,k}(w_1;x_{p,k,n})\Vert \le \delta \Vert w_2 - w_1\Vert . \end{aligned}$$
(14)

\(\square\)

We also require a bound on the difference between the global optimal model \(w^o\) and the local optimal models \(w^o_{p,k}\) that optimize \(J_{p,k}(\cdot )\). This assumption is used to bound the gradient noise and the incremental noise defined further ahead. It is not a restrictive assumption, and it imposes a condition on when collaboration is sensical among different agents. In other words, since the agents have non-iid data, sometimes their optimal models are too different and collaboration would hurt their individual performance. For example, when considering recommender systems, people in the same country are more likely to get the same movie recommended as opposed to across different countries. This means, people of the same country might have different models but relatively close contrary to different countries.

Assumption 3

(Model drifts) The distance of each local model \(w_{p,k}^o\) to the global model \(w^o\) is uniformly bounded, i.e., there exists \(\xi \ge 0\) such that \(\Vert w^o - w_p^o\Vert \le \xi\).

3.2 Network centroid convergence

We study the convergence of the algorithm from the network centroid’s \({\varvec{w}}_{c,i}\) perspective:

$$\begin{aligned} {\varvec{w}}_{c,i} \,\overset{\Delta }{=}\,\frac{1}{P}\sum _{p=1}^P {\varvec{w}}_{p,i}. \end{aligned}$$
(15)

We write the central recursion as:

$$\begin{aligned} {\varvec{w}}_{c,i}&= {\varvec{w}}_{c,i-1} - \mu \frac{1}{PL}\sum _{p=1}^P \sum _{k\in {\mathcal {L}}_{p,i}} \frac{1}{E_{p,k}} \sum _{e=1}^{E_{p,k}} \nabla _{w^{\textsf{T}}} Q_{p,k} ({\varvec{w}}_{p,k,e-1};{\varvec{x}}_{p,k,b}) \\&\quad + \frac{1}{PL} \sum _{p=1}^P \sum _{k\in {\mathcal {L}}_{p,i}} {\varvec{g}}_{p,k,i}+ \frac{1}{P}\sum _{p,m = 1}^P a_{pm} {\varvec{g}}_{pm,i}. \end{aligned}$$
(16)

Next, we define the model error as \({\widetilde{{\varvec{w}}}}_{c,i} \,\overset{\Delta }{=}\,w^o - {\varvec{w}}_{c,i}\) and the average gradient noise:

$$\begin{aligned} {\varvec{s}}_i \,\overset{\Delta }{=}\,\frac{1}{P}\sum _{p=1}^P {\varvec{s}}_{p,i}, \end{aligned}$$
(17)

with the per-unit gradient noise \({\varvec{s}}_{p,i}\):

$$\begin{aligned} {\varvec{s}}_{p,i} \,\overset{\Delta }{=}\,\widehat{\nabla _{w^{\textsf{T}}} J_{p}}({\varvec{w}}_{p,i-1}) - \nabla _{w^{\textsf{T}}} J_p({\varvec{w}}_{p,i-1}), \end{aligned}$$
(18)

and

$$\begin{aligned} \widehat{\nabla _{w^{\textsf{T}}} J_p}(\cdot )&\,\overset{\Delta }{=}\,\frac{1}{L} \sum _{k\in {\mathcal {L}}_{p,i}} \frac{1}{E_{p,k}} \sum _{e=1}^{E_{p,k}} \nabla _{w^{\textsf{T}}} Q_{p,k}(\cdot ;{\varvec{x}}_{p,k,b}). \end{aligned}$$
(19)

We introduce the average incremental noise \({\varvec{q}}_i\) and the local incremental noise \({\varvec{q}}_{p,i}\), which capture the error introduced by the multiple local update steps:

$$\begin{aligned} {\varvec{q}}_i&\,\overset{\Delta }{=}\,\frac{1}{P}\sum _{p=1}^P {\varvec{q}}_{p,i}, \end{aligned}$$
(20)
$$\begin{aligned} {\varvec{q}}_{p,i}&\,\overset{\Delta }{=}\,\frac{1}{L} \sum _{k \in {\mathcal {L}}_{p,i}} \frac{1}{E_{p,k}} \sum _{e=1}^{E_k} \Big ( \nabla _{w^{\textsf{T}}} Q_{p,k}({\varvec{w}}_{p,k,e-1}; {\varvec{x}}_{p,k,b}) - \nabla _{w^{\textsf{T}}} Q({\varvec{w}}_{p,i-1}; {\varvec{x}}_{p,k,b})\Big ) \end{aligned}$$
(21)

We then arrive at the following error recursion:

$$\begin{aligned} {\widetilde{{\varvec{w}}}}_{c,i} = {\widetilde{{\varvec{w}}}}_{c,i-1} + \mu \frac{1}{P}\sum _{p=1}^P \nabla _{w^{\textsf{T}}} J_p({\varvec{w}}_{p,i-1}) + \mu {\varvec{s}}_i + \mu {\varvec{q}}_i- {\varvec{g}}_{i}, \end{aligned}$$
(22)

where \({\varvec{g}}_{i}\) is the total added noise at iteration i:

$$\begin{aligned} {\varvec{g}}_{i} \,\overset{\Delta }{=}\,\frac{1}{PL} \sum _{p=1}^P \sum _{k \in {\mathcal {L}}_{p,i}} {\varvec{g}}_{p,k,i} + \frac{1}{P}\sum _{p,m=1}^P a_{pm}{\varvec{g}}_{pm,i} \end{aligned}$$
(23)

We estimate the first and second-order moments of the gradient noise in the following lemma. To do so, we use the fact, shown in previous work (Lemma 1 in [28]), that the individual gradient noise is zero-mean with a bounded second order moment:

$$\begin{aligned} {\mathbb {E}}\left\{ \Vert {\varvec{s}}_{p,i} \Vert ^2 | {\mathcal {F}}_{i-1}\right\} \le \beta _{s,p}^2 \Vert {\widetilde{{\varvec{w}}}}_{p,i-1}\Vert ^2 + \sigma _{s,p}^2, \end{aligned}$$
(24)

where the constants are defined as:

$$\begin{aligned} \beta _{s,p}^2&\,\overset{\Delta }{=}\,\frac{6\delta ^2}{L} \left( 1 + \frac{1}{K}\sum _{k=1}^K \frac{1}{E_{p,k}}\right) , \end{aligned}$$
(25)
$$\begin{aligned} \sigma _{s,p}^2&\,\overset{\Delta }{=}\,\frac{1}{LK}\sum _{k=1}^K \left( \frac{12}{E_{p,k}} + 3\right) \frac{1}{N_{p,k}}\sum _{n=1}^{N_{p,k}} \Vert \nabla _{w^{\textsf{T}}} Q_{p,k}(w^o;x_{p,k,n})\Vert ^2, \end{aligned}$$
(26)

and \({\mathcal {F}}_{i-1}\) is the filtration defined over the randomness introduced by all the past subsampling of the data for the calculation of the stochastic gradient. Using Assumption 3, we can guarantee that \(\sigma _{s,p}^2\) is bounded by bounding:

$$\begin{aligned} \Vert \nabla _{w^{\textsf{T}}} Q_{p,k}(w^o; x_{p,k,n})\Vert ^2 \le 2\Vert \nabla _{w^{\textsf{T}}} Q_{p,k}(w^o_{p,k};x_{p,k,n})\Vert ^2 + 2\delta ^2 \xi ^2. \end{aligned}$$
(27)

Lemma 1

(Estimation of first and second-order moments of the gradient noise) The gradient noise defined in (17) is zero-mean and has a bounded second-order moment:

$$\begin{aligned} {\mathbb {E}}\left\{ \Vert {\varvec{s}}_i \Vert ^2 | {\mathcal {F}}_{i-1} \right\}&\le \beta _s^2 \Vert {\widetilde{{\varvec{w}}}}_{c,i-1}\Vert ^2 + \sigma _s^2 + \frac{2}{P}\sum _{p=1}^P \beta _{s,p}^2 \Vert {\varvec{w}}_{p,i-1} - {\varvec{w}}_{c,i-1}\Vert ^2 \end{aligned}$$
(28)

where the constants \(\beta _s^2\) and \(\sigma _s^2\) are given by:

$$\begin{aligned} \beta _s^2&\,\overset{\Delta }{=}\,\frac{2}{P}\sum _{p=1}^P \beta _{s,p}^2, \quad \sigma _s^2 \,\overset{\Delta }{=}\,\frac{1}{P}\sum _{p=1}^P\sigma _{s,p}^2. \end{aligned}$$
(29)

Proof

The above result follows from applying the Jensen’s inequality and the bounds on the per-unit gradient noise \({\varvec{s}}_{p,i}\). \(\square\)

The new term found in the bound of the gradient term is what we call the network disagreement:

$$\begin{aligned} \frac{1}{P} \sum _{p=1}^P \Vert {\varvec{w}}_{p,i} - {\varvec{w}}_{c,i}\Vert ^2. \end{aligned}$$
(30)

It captures the difference in the path taken by the individual models versus the network centroid. We bound this difference in Lemma 3. However, before doing so, we show that the second order moment of the incremental noise is on the order of \(O(\mu )\). From Lemma 5 in [28], we can bound the individual incremental noise:

$$\begin{aligned} {\mathbb {E}} \Vert {\varvec{q}}_{p,i}\Vert ^2 \le&a \mu ^2 {\mathbb {E}} \Vert {\widetilde{{\varvec{w}}}}_{p,i-1}\Vert ^2 + a \mu ^2\xi ^2 + \frac{1}{K}\sum _{k=1}^K (b_k\mu ^4 + c_k \mu ^2)\sigma _{q,p,k}^2, \end{aligned}$$
(31)

where the constants are given by:

$$\begin{aligned} a&\,\overset{\Delta }{=}\,\frac{4\delta ^2}{K}\sum _{k=1}^K \frac{(E_{p,k}+1)(1-\lambda )-1+\lambda ^{E_{p,k}+1}}{E_{p,k}^2(1-\lambda )^2}, \end{aligned}$$
(32)
$$\begin{aligned} b_k&\,\overset{\Delta }{=}\,\frac{2E_{p,k}(E_{p,k}+1)(1-\lambda )^2 - 4E_{p,k}(1-\lambda ) + 4\lambda }{E_{p,k}^2 (1-\lambda )^3 } -\frac{ 2\lambda ^{E_{p,k}+1}}{E_{p,k}^2 (1-\lambda )^3},\end{aligned}$$
(33)
$$\begin{aligned} c_k&\,\overset{\Delta }{=}\,\frac{E_{p,k}-1}{3E_{p,k}}, \end{aligned}$$
(34)
$$\begin{aligned} \lambda&\,\overset{\Delta }{=}\,1-2\nu \mu + 4\delta ^2\mu ^2, \end{aligned}$$
(35)
$$\begin{aligned} \sigma ^2_{q,p,k}&\,\overset{\Delta }{=}\,3\sum _{n=1}^{N_{p,k}} \Vert \nabla _{w^{\textsf{T}}} Q_{p,k}(w^o_{p,k};x_{p,k,n})\Vert ^2. \end{aligned}$$
(36)

The following result follows.

Lemma 2

(Estimation of second-order moment of the incremental noise) The incremental noise defined in (20) has a bounded second-order moment:

$$\begin{aligned} {\mathbb {E}} \Vert {\varvec{q}}_i\Vert ^2&\le O(\mu ) {\mathbb {E}} \Vert {\widetilde{{\varvec{w}}}}_{c,i-1}\Vert ^2 + O(\mu )\xi ^2 + O(\mu ^2 )\sigma _{q}^2 \\&\quad + \frac{O(\mu )}{P}\sum _{p=1}^P {\mathbb {E}}\Vert {\varvec{w}}_{p,i-1} - {\varvec{w}}_{c,i-1}\Vert ^2, \end{aligned}$$
(37)

where the constant \(\sigma _q^2\) is the average of \(\sigma _{q,p,k}^2\):

$$\begin{aligned} \sigma _{q}^2 \,\overset{\Delta }{=}\,\frac{1}{PK}\sum _{p=1}^P\sum _{k=1}^K (b_k\mu ^4 + c_k \mu ^2)\sigma _{q,p,k}^2. \end{aligned}$$
(38)

Proof

The above result follows from applying the Jensen inequality and the bounds on the per-unit incremental noise \({\varvec{q}}_{p,i}\). Furthermore, \(a = O(\mu ^{-1}), b_k = O(\mu ^{-1}),\) and \(c_k = O(1)\) reduce the expression to (37). \(\square\)

We now bound the network disagreement. To do so, we first introduce the eigendecomposition of \(A = QH Q^\textsf{T}\):

$$\begin{aligned} Q \,\overset{\Delta }{=}\,\begin{bmatrix} \frac{1}{\sqrt{P}}\mathbbm {1}&Q_{\theta } \end{bmatrix}, \quad H \,\overset{\Delta }{=}\,\begin{bmatrix} 1 &{} 0 \\ 0 &{} H_{\theta } \end{bmatrix}, \end{aligned}$$
(39)

where \(H_{\theta }\) is a diagonal matrix that includes the last \((P-1)\) eigenvalues of A and \(Q_{\theta }\) their corresponding eigenvectors.

Lemma 3

(Network disagreement) The average deviation from the centroid is bounded during each iteration i:

$$\begin{aligned} \frac{1}{P}\sum _{p=1}^P {\mathbb {E}}\Vert {\varvec{w}}_{p,i} - {\varvec{w}}_{c,i}\Vert ^2&\le \frac{ \iota _2^i }{P} {\mathbb {E}} \Vert (Q_{\epsilon } \otimes I){\varvec{{ {\mathcal {W}}}}}_0\Vert ^2 + \frac{\iota _2^2 }{P}\sum _{j'=0}^{i-1}\iota _2^{j'}\sum _{p=1}^P \Bigg \{ \mu ^2\bigg (\frac{2\delta ^2}{\iota _2(1-\iota _2) } \\&\quad +\beta _{s,p}^2 + O(\mu ) \bigg ) \bigg ( \lambda _p^{j'} A^{j'}[p] \text{ col }\left\{ {\mathbb {E}}\Vert {\widetilde{{\varvec{w}}}}_{p,0}\Vert ^2\right\} _{p=1}^P + \sum _{j=0}^{j'-1} \lambda _p^j \\&\quad \times A^j[p]\text{ col }\left\{ \mu ^2 \sigma _{s,p}^2 + O(\mu ^2)\xi ^2 + O(\mu ^3)\sigma _{q,p}^2 + \sigma _{g,p}^2\right\} _{p=1}^P \bigg ) \\&\quad + \mu ^2\frac{2\Vert \nabla _{w^{\textsf{T}}} J_p(w^o)\Vert ^2}{\iota _2(1-\iota _2)}+ \mu ^2\sigma _{s,p}^2 + O(\mu ^3)\xi ^2 + O(\mu ^4)\sigma _{q,p}^2 \\&\quad + \frac{1}{\iota _2^2} \sigma _{g,p}^2\Bigg \}, \end{aligned}$$
(40)

where \({\varvec{{ {\mathcal {W}}}}}_{0} \,\overset{\Delta }{=}\,\text{ col }\left\{ {\varvec{w}}_{p,0}\right\} _{p=1}^P\) and \(\lambda _p \,\overset{\Delta }{=}\,\sqrt{1-2\nu \mu + \delta ^2\mu ^2} + \beta _{s,p}^2 \mu ^2 + O(\mu ^2) \in (0,1)\). Then, in the limit:

$$\begin{aligned} \limsup _{i\rightarrow \infty } \frac{1}{P}\sum _{p=1}^P {\mathbb {E}} \Vert {\varvec{w}}_{p,i} -{\varvec{w}}_{c,i}\Vert ^2 \le&\frac{\iota _2^2}{P(1-\iota _2)} \sum _{p=1}^P \mu ^2 \sigma _{s,p}^2 + \frac{1}{\iota _2^2}\sigma _{g,p}^+ O(\mu )\sigma _{g,p}^2 + O(\mu ^3). \end{aligned}$$
(41)

Proof

See “Appendix 2”. \(\square\)

Thus, from the above lemma, we see that the individual models gravitate to the centroid model with an error introduced due to the added privatization. The effect of the added noise overpowers that of the gradient and incremental noise, since the later is on the order of the step-size.

Then, using the above result, we can establish the convergence of the centroid model to a neighbourhood of the true optimal model \(w^o\) in the mean-square-error (MSE) sense.

Theorem 1

(Centroid MSE convergence) Under Assumptions 1, 2 and 3, the network centroid converges to the optimal point \(w^o\) exponentially fast for a sufficiently small step-size \(\mu\):

$$\begin{aligned} {\mathbb {E}} \Vert {\widetilde{{\varvec{w}}}}_{c,i}\Vert ^2&\le \lambda _c {\mathbb {E}} \Vert {\widetilde{{\varvec{w}}}}_{c,i-1} \Vert ^2 +\mu ^2 \sigma _s^2 + O(\mu ^{2})\xi ^2 + O(\mu ^{3})\sigma _q^2 + {\mathbb {E}} \Vert {\varvec{g}}_{i}\Vert ^2 \\&\quad + \frac{O(\mu )}{P}\sum _{p=1}^P {\mathbb {E}}\Vert {\varvec{w}}_{p,i-1}-{\varvec{w}}_{c,i-1}\Vert ^2, \end{aligned}$$
(42)

where \(\lambda _c = \sqrt{1-2\nu \mu + \delta ^2\mu ^2} +\beta _s^2\mu ^2 + O(\mu ^{2}) \in (0,1)\). Then, letting i tend to infinity, we get:

$$\begin{aligned} \limsup _{i\rightarrow \infty } {\mathbb {E}} \Vert {\widetilde{{\varvec{w}}}}_{c,i}\Vert ^2&\le \frac{\mu ^2 \sigma _s^2 + O(\mu ^{2})\xi ^2 + O(\mu ^{3})\sigma _q^2 + {\mathbb {E}}\Vert {\varvec{g}}\Vert ^2}{1-\lambda _c} + \sum _{p=1}^PO(1 ) \sigma _{g,p}^2+ O(\mu ). \end{aligned}$$
(43)

Proof

See “Appendix 3”. \(\square\)

The main term in the above bound is the variance of the added noise with a dominating factor of \(\mu ^{-1}\), since:

$$\begin{aligned} 1- \lambda _c&= 1- \sqrt{1-O(\mu ) + O(\mu ^2)} - O(\mu ^2)= O(\mu )- O(\mu ^2) = O(\mu ) \end{aligned}$$
(44)

which allows us to rewrite the bound as follows:

$$\begin{aligned} \limsup _{i \rightarrow \infty } {\mathbb {E}}\Vert {\widetilde{{\varvec{w}}}}_{c,i}\Vert ^2&\le O(\mu )\sigma _s^2 + O(\mu )\xi ^2 + O(\mu ^2)\sigma _q^2 + O(\mu ^{-1}){\mathbb {E}}\Vert {\varvec{g}}\Vert ^2 \\&\quad + \sum _{p=1}^P O(1)\sigma _{g,p}^2 + O(\mu ), \end{aligned}$$
(45)

with \({\mathbb {E}}\Vert {\varvec{g}}\Vert ^2\) representing the variance of the total added noise, independent of time. While in general decreasing the step-size improves performance, the above result shows that this need not be the case with privatization. Thus, since the added noise impacts the model utility negatively, it is important to choose a privatization scheme that reduces the effect. In what follows, we look closely at such a scheme.

3.3 Graph-homomorphic perturbations

We consider a specific privatization scheme and specialize the above results. The goal of the scheme is to remove the \(O(\mu ^{-1})\) term from the MSE bounds. Thus, focusing on the centroid model expression (16), we wish to cancel out the total added noise amongst servers, i.e.,

$$\begin{aligned} \sum _{p,m=1}^P a_{pm}{\varvec{g}}_{pm,i} = 0. \end{aligned}$$
(46)

To achieve this, we introduce graph-homomorphic perturbations defined as follows [6]. We assume each server p draws a sample \({\varvec{g}}_{p,i}\) independently from the Laplace distribution \(Lap(0,\sigma _g/\sqrt{2})\) with variance \(\sigma _{g}^2\). Server p then sets the noise \({\varvec{g}}_{mp,i}\) added to the message sent to its neighbour m as:

$$\begin{aligned} {\varvec{g}}_{mp,i} = {\left\{ \begin{array}{ll} {\varvec{g}}_{p,i} &{} m \ne p \\ - \frac{1-a_{pp}}{a_{pp}} {\varvec{g}}_{p,i}. \end{array}\right. } \end{aligned}$$
(47)

With such a construction, condition (46) is satisfied:

$$\begin{aligned} \sum _{p,m=1}^P a_{pm}{\varvec{g}}_{pm,i}&= \sum _{p \ne m} a_{pm} {\varvec{g}}_{p,i} - \sum _{p=1}^P a_{pp} \left( \frac{1-a_{pp}}{a_{pp}} \right) {\varvec{g}}_{p,i} \\&= \sum _{p=1}^P (1-a_{pp}) {\varvec{g}}_{p,i} - (1-a_{pp}) {\varvec{g}}_{p,i} = 0. \end{aligned}$$
(48)

Thus, with such a scheme, the noise components proportional to \(O(\mu ^{-1})\) resulting from the noise added between the servers cancel out in the error recursions, however since gradients are evaluated at the local models \({\varvec{w}}_{p,i}\) and not at the centroid \({\varvec{w}}_{c,i}\), thus the effect of the noise is still evident. Yet, this remaining error introduced by the noise is controlled by the step-size. Thus, its effect can be mitigated by using a smaller step-size. In the next corollary, we show that if no noise is added amongst the clients and graph-homomorphic perturbations are used amongst servers, then the error converges to \(O(1)\sigma _g^2\).

Corollary 1

(Centroid MSE convergence under graph-homomorphic perturbations) Under Assumptions 1, 2 and 3, the network centroid with graph-homomorphic perturbations converges to the optimal point \(w^o\) exponentially fast for a sufficiently small step-size \(\mu\):

$$\begin{aligned} {\mathbb {E}} \Vert {\widetilde{{\varvec{w}}}}_{c,i}\Vert ^2&\le \lambda _c {\mathbb {E}} \Vert {\widetilde{{\varvec{w}}}}_{c,i-1} \Vert ^2 +\mu ^2 \sigma _s^2 + O(\mu ^{2})\xi ^2 + O(\mu ^{3})\sigma _q^2 \\&\quad + \frac{O(\mu )}{P}\sum _{p=1}^P {\mathbb {E}}\Vert {\varvec{w}}_{p,i-1}-{\varvec{w}}_{c,i-1}\Vert ^2. \end{aligned}$$
(49)

Then, letting i tend to infinity, we get:

$$\begin{aligned} \limsup _{i\rightarrow \infty } {\mathbb {E}} \Vert {\widetilde{{\varvec{w}}}}_{c,i}\Vert ^2&\le \frac{\mu ^2 \sigma _s^2 +O(\mu ^{2})\xi ^2 + O(\mu ^{3})\sigma _q^2 }{1-\lambda _c} + \sum _{p=1}^PO(1 ) \sigma _{g,p}^2 + O(\mu ). \end{aligned}$$
(50)

Proof

Starting from (43), and replacing \({\mathbb {E}} \Vert {\varvec{g}}\Vert ^2 = 0\) because \({\varvec{g}}_{i} = 0\), we get the final result. \(\square\)

3.4 Sharing gradients as opposed to weight estimates

We next show that sharing gradients versus models is better for the performance under added noise. In the remainder of this section and for the sake of simplicity, we illustrate this conclusion by considering one federated unit, say for \(p=1\). Thus, if we were to introduce differential privacy to federated learning, then a random Laplacian noise should be added to each model by the client before aggregation by the server, and the new privatized aggregation step will become:

$$\begin{aligned} {\varvec{w}}_{1,i} = \frac{1}{L}\sum _{k \in {\mathcal {L}}_{1,i}} \left( {\varvec{w}}_{1,k,i} + {\varvec{g}}_{1,k,i} \right) . \end{aligned}$$
(51)

However, if we were to study the MSE convergence of this privatized algorithm, we would notice a new \(O(\mu ^{-1})\sigma _g^2\) term in the bound (Theorem 1). To address this degradation, we now describe an alternative implementation that shares gradients as opposed to weight estimates. Note first that the FL algorithm can be expressed in a single step taken from the server’s perspective:

$$\begin{aligned} {\varvec{w}}_{1,i} = {\varvec{w}}_{1,i-1} - \mu \frac{1}{L}\sum _{k \in {\mathcal {L}}_{1,i}} \frac{1}{E_{1,k}}\sum _{e=1}^{E_{1,k}} \widehat{\nabla _{w^{\textsf{T}}} J_{1,k}}({\varvec{w}}_{1,k,e-1}). \end{aligned}$$
(52)

This suggests that instead of every agent sharing its final model \({\varvec{w}}_{1,k,i}\), they could share the total update:

$$\begin{aligned} \frac{1}{E_{1,k}}\sum _{e=1}^{E_{1,k}} \widehat{\nabla _{w^{\textsf{T}}} J_{1,k}}({\varvec{w}}_{1,k,e-1}). \end{aligned}$$
(53)

The server then aggregates the updates from all participating agents and updates the previous model \({\varvec{w}}_{1,i-1}\). In this case, if we were to privatize this new version of the algorithm, we would add random noise to the updates which are then scaled by the step-size:

$$\begin{aligned} {\varvec{\psi }}_{1,k,i-1}&= \frac{1}{E_{1,k}} \sum _{e=1}^{E_{1,k}} \widehat{\nabla _{w^{\textsf{T}}} J_{1,k}}({\varvec{w}}_{1,k,e-1}), \end{aligned}$$
(54)
$$\begin{aligned} {\varvec{w}}_{1,i}&= {\varvec{w}}_{1,i-1} - \mu \frac{1}{L}\sum _{k\in {\mathcal {L}}_{1,i}} \left( {\varvec{\psi }}_{1,k,i-1} + {\varvec{g}}_{1,k,i} \right) . \end{aligned}$$
(55)

We show in the following theorem the effect of the added noise to the new FL algorithm. It turns out, the noise introduces an \(O(\mu )\) error instead of \(O(\mu ^{-1})\).

Theorem 2

(MSE convergence of privatized FL) Under Assumptions 2 and 3, the privatized FL algorithm (54)–(55) converges exponentially fast for a small enough step-size to a neighbourhood of the optimal model:

$$\begin{aligned} {\mathbb {E}} \Vert {\widetilde{{\varvec{w}}}}_{1,i}\Vert ^2&\le \lambda {\mathbb {E}}\Vert {\widetilde{{\varvec{w}}}}_{1,i-1}\Vert ^2 +O( \mu ^2) \sigma _{s,1}^2 + O(\mu ^2)\xi ^2 + \frac{\mu ^2}{L}\sigma _{g,1}^2 + O(\mu ^3). \end{aligned}$$
(56)

where \(\lambda = \sqrt{1-2\nu \mu + (\beta _{s,1}^2+\delta ^2)\mu ^2} + O(\mu ^2) \in (0,1)\). Then, in the limit:

$$\begin{aligned} \limsup _{i \rightarrow \infty } {\mathbb {E}}\Vert {\widetilde{{\varvec{w}}}}_{1,i}\Vert ^2 \le O(\mu ) (\sigma _{s,1}^2 + \xi ^2 + \sigma _{g,1}^2) + O(\mu ^2). \end{aligned}$$
(57)

Proof

See “Appendix 6”. \(\square\)

Thus, sharing the updates instead of the models is advantageous since the effect of the added noise on the performance is reduced. The \(O(\mu )\) factor allows us to increase the value of the noise variance while ensuring the model utility does not deteriorate significantly. Therefore, to guarantee an \(\epsilon (i)\)-DP algorithm, we let the added noise be a zero-mean Laplacian random variable with \(\sigma _{g}^2\) variance.

4 Privacy analysis

We study the privacy of the algorithm (6)–(8) in terms of differential privacy. We focus on graph-homomorphic perturbations and show that the adopted scheme is differentially private. To do so, we first define what it means for an algorithm to be \(\epsilon\)-differentially private. Therefore, without loss of generality, assume agent 1 in federated unit 1 decides to not participate, and its data samples \(x_{1,1}\) are replaced by a new set \(x'_{1,1}\) with a different distribution. Then, with the new data, the algorithm takes a different path. We denote the new models by \({\varvec{w}}'_{p,k,i}\). The idea behind differential privacy is that an outside observant should not be able to distinguish between the two trajectories \({\varvec{w}}_{p,k,i}\) and \({\varvec{w}}'_{p,k,i}\) and conclude whether agent one participated in the training. More formally, differential privacy is defined bellow.

Definition 1

(\(\epsilon (i)\)-Differential Privacy) We say that the algorithm given in (6)–(8) is \(\epsilon (i)\)-differentially private for server p at time i if the following condition holds on the joint distribution \(f(\cdot )\):

$$\begin{aligned} \frac{f\left( \left\{ \left\{ {\varvec{\psi }}_{p,j} + {\varvec{g}}_{pm,j} \right\} _{m\in {\mathcal {N}}_p \setminus \{p\} } \right\} _{j=0}^i \right) }{f\left( \left\{ \left\{ {\varvec{\psi }}'_{p,j} + {\varvec{g}}_{pm,j} \right\} _{m\in {\mathcal {N}}_p \setminus \{p\} }\right\} _{j=0}^i \right) } \le e^{\epsilon (i)}. \end{aligned}$$
(58)

\(\square\)

Thus, the above definition states that minimaly varried trajectories have comparable probabilities. In addition, the smaller the value of \(\epsilon\) is, the higher the privacy guarantee will be. Thus, the goal will be to decrease \(\epsilon\) as long as the model utility is not strongly affected.

Next, in order to show that the algorithm is differentially private, we require the sensitivity of the algorithm to be bounded. The sensitivity at time i is defined as:

$$\begin{aligned} \Delta (i)&= \Vert {\varvec{{ {\mathcal {W}}}}}_{i} - {\varvec{{ {\mathcal {W}}}}}'_{i}\Vert . \end{aligned}$$
(59)

It measures the distance between the original and perturbed weight vectors. It is shown in “Appendix 4” that \(\Delta (i)\) can be bounded as follows:

$$\begin{aligned} \Delta (i) \le B+B' + \sqrt{P}\Vert w^o-w^{'o}\Vert , \end{aligned}$$
(60)

for constants B and \(B'\) chosen by the designer. Moreover, the above bound holds with high probability given by:

$$\begin{aligned} {\mathbb {P}}(\Delta (i) \le B + B' + \sqrt{P} \Vert w^o - w'^o\Vert )&\ge \left( 1- \frac{\lambda ^i_{\max } {\mathbb {E}}\Vert {\varvec{{ {\mathcal {W}}}}}_0\Vert ^2 + O(\mu ) + O(\mu ^{-1})}{B^2} \right) \\&\quad \times \left( 1- \frac{\lambda '^i_{\max } {\mathbb {E}}\Vert {\varvec{{ {\mathcal {W}}}}}'_0\Vert ^2 + O(\mu ) + O(\mu ^{-1})}{B'^2} \right) . \end{aligned}$$
(61)

This result shows that the sensitivity can be bounded with high probability, which in turn is dependent on the values chosen for B and \(B'\). Larger values for these constants increase the probability, but nevertheless lead to a looser bound for privacy (as shown in Theorem 3). Therefore, the choice of B and \(B'\) needs to be balanced judiciously to ensure the desired level of privacy.

Using the bound on the sensitivity and from the definition of differential privacy, we can finally show that the algorithm is differentially private with high probability.

Theorem 3

(Privacy of GFL algorithm) If the algorithm (6)–(8) adopts graph-homomorphic perturbations, then it is \(\epsilon (i)\)-differentially private with high probability, at time i for a standard deviation of \(\sigma _g = \sqrt{2}(B+B'\sqrt{P}\Vert w^o-w'^o\Vert )(i+1) / \epsilon (i)\).

Proof

See “Appendix 5”. \(\square\)

Thus, the above theorem suggests, if we wish the algorithm to be \(\epsilon (i)\)-differentially private, then we need to choose the noise variance accordingly. The larger the variance is, the more private the algorithm will be. However, the longer the algorithm is run, we will require a larger noise variance to keep the same level of privacy guarantee. Said differently, if we fix the added noise, then as time passes, the algorithm becomes less private, and more information is leaked. However, with graph-homomorphic perturbations, we can afford to increase the variance since its effect is constant on the MSE, and thus decreases the leakage.

Moreover, we study the effect of the model drift on the privacy of the algorithm. Thus, if we examine closely the probability that the sensitivity is bounded, the model drift \(\xi\) appears in the \(O(\mu )\) term. The smaller the model drift is, we note that the higher the probability that the sensitivity is bounded. This in turn implies that the algorithm is differentially private with higher probability. Furthermore, if we study the average \(\epsilon (i)\), we see that:

$$\begin{aligned} {\mathbb {E}}\, \epsilon (i)&= \frac{\sqrt{2}}{\sigma _g} \sum _{j=1}^i {\mathbb {E}} \Delta (j) \\&\le \frac{\sqrt{2}}{\sigma _g} \sum _{j=1}^i {\mathbb {E}}\Vert {\widetilde{{\varvec{w}}}}_{p,j}\Vert + {\mathbb {E}}\Vert {\widetilde{{\varvec{w}}}}_{p,j}'\Vert + \Vert w^o - w'^o\Vert \\&\le \frac{\sqrt{2}}{\sigma _g} \sum _{j=1}^i \lambda ^{j/2} {\mathbb {E}}\Vert {\widetilde{{\varvec{w}}}}_{p,0}\Vert + \frac{1}{\sqrt{1-\lambda }} \left( O(\mu )(\sigma _{s,p}^2 + \xi ^2 + \sigma _g^2) + O(\mu ^{3/2}) \right) \\&\quad + \lambda '^{j/2} {\mathbb {E}}\Vert {\widetilde{{\varvec{w}}}}'_{p,0}\Vert + \frac{1}{\sqrt{1-\lambda '}} \left( O(\mu )(\sigma '^2_{s,p} + \xi '^2 + \sigma _g^2) + O(\mu ^{3/2}) \right) \\&\quad + \Vert w^o - w'^o\Vert \\&\le \frac{1-\lambda ^{i/2}}{1-\lambda ^{1/2}} {\mathbb {E}} \Vert {\widetilde{{\varvec{w}}}}_{p,0}\Vert + \frac{1-\lambda ^{i/2}}{1-\lambda '^{1/2}} {\mathbb {E}} \Vert {\widetilde{{\varvec{w}}}}'_{p,0}\Vert + i\Vert w^o - w'^o\Vert \\&\quad + \frac{i}{\sqrt{1-\lambda }} \left( O(\mu ) (\sigma _{s,p}^2 + \xi ^2 + \sigma _g^2) + O(\mu ^{3/2}) \right) \\&\quad + \frac{i}{\sqrt{1-\lambda '}} \left( O(\mu ) (\sigma '^2_{s,p} + \xi '^2 + \sigma _g^2) + O(\mu ^{3/2}) \right) , \end{aligned}$$
(62)

as the model drift decreases, so does \(\epsilon (i)\) on average. Therefore, with smaller model drift we can achieve higher privacy with more certainty.

5 Experimental analysis

We conduct a series of experiments to study the influence of privatization on the GFL algorithm. The aim of the experiments is to show the superior performance of graph-homomorphic perturbations to random perturbations and perturbations to gradients versus models, and to study the effect of different parameters on the performance of the algorithm.

5.1 Regression

We first start by studying a regression problem on simulated data. We do so for the tractability of the problem. We consider the quadratic loss that has a closed form solution, i.e., a formal expression for the true model \(w^o\) is known, which makes the calculation of the mean square error feasible and more accurate.

Therefore, consider a streaming feature vector \({\varvec{u}}_{p,k,n} \in {\mathbb {R}}^M\) with output variable \({\varvec{d}}_{p,k}(n) \in {\mathbb {R}}\) given by:

$$\begin{aligned} {\varvec{d}}_{p,k}(n) = {\varvec{u}}_{p,k,n}^\textsf{T}w^{\star } + {\varvec{v}}_{p,k}(n), \end{aligned}$$
(63)

where \(w^{\star }\in {\mathbb {R}}^M\) is some generating model, and \({\varvec{v}}_{p,k}(n)\) is some zero-mean Guassian random variable with \(\sigma _{v_{p,k}}^2\) variance and independent of \({\varvec{u}}_{p,k,n}\). Then, the optimal model that solves the following problem:

$$\begin{aligned} \min _w \frac{1}{P}\sum _{p=1}^P \frac{1}{K}\sum _{k=1}^K \frac{1}{N_{p,k}}\sum _{n=1}^{N_{p,k}} \Vert {\varvec{d}}_{p,k}(n) - {\varvec{u}}_{p,k,n}^\textsf{T}w\Vert ^2 + \rho \Vert w\Vert ^2 \end{aligned}$$
(64)

is found to be:

$$\begin{aligned} w^o = ({\widehat{R}}_u + \rho I)^{-1}( {\widehat{R}}_u w^{\star } + {\widehat{r}}_{uv}), \end{aligned}$$
(65)

where \({\widehat{R}}_u\) and \({\widehat{r}}_{uv}\) are defined as:

$$\begin{aligned} {\widehat{R}}_u&\,\overset{\Delta }{=}\,\frac{1}{P}\sum _{p=1}^P \frac{1}{K}\sum _{k=1}^K \frac{1}{N_{p,k}} \sum _{n=1}^{N_k} {\varvec{u}}_{p,k,n}{\varvec{u}}_{p,k,n}^\textsf{T}, \end{aligned}$$
(66)
$$\begin{aligned} {\widehat{r}}_{uv}&\,\overset{\Delta }{=}\,\frac{1}{P}\sum _{p=1}^P \frac{1}{K}\sum _{k=1}^K \frac{1}{N_{p,k}} \sum _{n=1}^{N_k} {\varvec{v}}_{p,k}(n){\varvec{u}}_{p,k,n}. \end{aligned}$$
(67)

We consider \(P = 10\) units, each with \(K= 100\) total agents. We assume, \(N_{p,k}=100\) for each agent. We randomly generate two-dimensional feature vectors \({\varvec{u}}_{p,k}(n)\) from a Gaussian random vector with zero-mean and a randomly generated covariance matrix \(R_{u_{p,k}}\). We then calculate the corresponding outputs according to (63). To make the data non-iid across agents, we assume the covariance matrix \(R_{u_{p,k}}\) is different for each agent, as well as the variance \(\sigma _{v_{p,k}}^2\) of the added noise. When running the algorithm, we assume each unit samples at random \(L = 11\) agents, and each agent runs \(E_{p,k} \in [1,10]\) epochs and uses a mini-batch of \(B_{p,k} \in [5,10]\) samples.

We compare three algorithms: the standard GFL algorithm, the privatized GFL algorithm with random perturbations, and the privatized GFL with homomorphic perturbations. We do not add noise between the clients and their server to focus on the effect of the perturbations between the servers. In the first set of simulations, we fix the step-size \(\mu =0.7\) and the regularization parameter \(\rho = 0.1\). We fix the variance of the added noise for privatization in both schemes to \(\sigma _g^2 = 0.1\). We then plot the mean-square-deviation (MSD) at each time step for the centroid model:

$$\begin{aligned} \text{ MSD}_i \,\overset{\Delta }{=}\,\Vert {\varvec{w}}_{c,i} - w^o\Vert ^2, \end{aligned}$$
(68)

as seen in Fig. 2. We observe that the privatized GFL with random perturbations has lower performance compared to the other two algorithms. While, using homomorphic perturbations does not result in such a decay in performance. Thus, our suggested scheme does a good job at tracking the performance of the original GFL algorithm, while not compromising with the privacy level.

Fig. 2
figure 2

Performance of GFL with no perturbations (blue), with graph-homomorphic perturbations (green), and random perturbations (red)

Fig. 3
figure 3

Performance curves of privatized GFL with varying noise variance

We next study the extent of the effect of the noise on the model utility. Thus, we run a series of experiments with varying added noise \(\sigma _g^2 = \{0.001, 0.01, 0.1,1,2,10\}\) for the two privatized GFL algorithms. We plot the resulting MSD curves in Fig. 3a. We observe for a fixed step-size, as we increase the variance, the MSD of the algorithm with random perturbations increases significantly as opposed to the algorithm with homomorphic perturbations. Thus, we conclude that the algorithm with random perturbations is more sensitive to the variance of the added noise. In fact, at some point, while using random perturbations, for some variance, the algorithm breaks down. While using graph-homomorphic perturbations, delays that effect for much larger variance. In addition, as long as the step-size is small enough, we can always control the effect of the graph-homomorphic perturbations.

However, if we were to look at the individual MSD for one federated unit, we would discover that the performance of the algorithm decays as the noise variance is increased. Nonetheless, it is not to the extent of random perturbations. We plot in Fig. 3b the average individual MSD for the varying noise variance:

$$\begin{aligned} \text{ MSD}_{\text{ avg },i} \,\overset{\Delta }{=}\,\frac{1}{P}\sum _{p=1}^P \Vert {\varvec{w}}_{p,i} - w^o\Vert ^2. \end{aligned}$$
(69)

We observe that for a fixed noise variance, homomorphic perturbations results in a better performance. Furthermore, as we increase the noise variance, the network disagreement increases for both schemes. This comes as no surprise and is in accordance with Lemma 3. Furthermore, as previously mentioned, graph-homomorphic perturbations have the added value of not being negatively affected by the decrease in the step-size. In addition, even though the improvement does not seem significant, the source of the error of the two schemes is different. Furthermore, the information of the true model is distributed in the network and can be retrieved by running at the end of the learning algorithm a consensus-type step. At that point, the local models no longer contain information about the local data, and thus agents can safely share their models. However, when random perturbations are used, reconstruction is not possible since the information has been lost in the network due to the added perturbations.

Fig. 4
figure 4

Performance curves of privatized GFL with varying step-size

We next fix the noise variance \(\sigma _{g}^2 = 0.1\) and varying the step-size \(\mu = \{0.1, 0.5, 1, 5 \}\). According to Theorem 4, the MSD resulting from random perturbations includes an \(O(\mu ^{-1})\) term, which is not the case when using graph-homomorphic perturbations. Thus, we expect a decrease in the step-size will not significantly affect the privatized algorithm with graph-homomorphic perturbations as opposed to random perturbations. Indeed, as seen in Fig. 4, as \(\mu\) is increased, the final MSD increases; this is probably due to the \(O(\mu )\sigma _s^2\) term in the bound. In contrast, for significantly small or large \(\mu\), the performance of the privatized algorithm with random perturbations decreases. In addition, what we observe for both privacy schemes, is that the rate of convergence slows down as we decrease the step-size. Thus, there exists an optimal step-size that achieves a good compromise between a fast convergence and a low MSD.

5.2 Privatized federated learning

We focus on the single server FL setting (i.e., \(P = 1\)), where we assume we have \(K=1000\) agents of which we choose \(L=30\) at a time. We generate non-iid datasets of varying size for each agent as in the previous section. We allow each agent to run varying epochs \(E_k \in [1,10]\) during an iteration of the algorithm. We set the step-size \(\mu = 0.2\), \(\rho = 0.007\) and \(\sigma _g^2 = 0.02\). We compare three algorithms: the standard FL algorithm, the privatized FL algorithm with sharing of models, and the privatized FL algorithm with sharing of updates. We plot the average MSD curves after repeating the experiment 100 times. As expected, the effect of the added noise is worse when models are shared (yellow curve Fig. 5) than when updates are shared (red curve Fig. 5).

Fig. 5
figure 5

MSD plots of FL

We next study the effect of the step-size on the MSD of the privatized FL algorithm. We expect that as \(\mu\) is increased the MSD increases for the FL algorithm when updates are shared. While, when models are shared, since the gradient noise variance is tuned by \(\mu\) and the added noise variance by \(\mu ^{-1}\), we expect to observe a trade-off. On one hand, as \(\mu\) is increased the effect of the gradient noise is increased while that of the added noise is diminished. On the other hand, as \(\mu\) is decreased, the effect of the added noise overpowers that of the gradient noise. Indeed, we observe this phenomenon in (a) and (b) of Fig. 6.

Finally, we study the effect of the variance of the added noise. We fix the step-size at \(\mu =0.2\) and vary the noise variance \(\sigma _g^2 = \{0.01,0.05, 0.1,0.5\}\). In the two cases, as we increase \(\sigma _g^2\) the performance diminishes ((c), (d) of Fig. 6). However, the larger values of the added noise variance affect the perturbed models more than the perturbed gradients. The algorithm diverges for lower values of \(\sigma _g^2\) in the case when models are shared as opposed to when gradients are shared. Thus, sharing updates can handle larger values of \(\sigma _g^2\) before the algorithm diverges. In addition, since the variance is tuned by the step-size, we can always find a suitable \(\mu\) to decrease its effect.

Fig. 6
figure 6

MSD plots of privatized FL with varying step-size and variance of added noise

5.3 Classification

We now focus on a classification problem applied to a dataset on click rate prediction of ads. We consider the Avazu click through dataset [29]. We split the 5101 data unequally among a total of 50 agents. We assume there are \(P = 5\) units each with \(K = 10\) agents. We add non-idd noise to the data at each agent to change their distributions. We again compare three algorithms: standard GFL, privatized GFL with homomorphic perturbations, and privatized GFL with random perturbations. We use a regularized logistic risk with regularization parameter \(\rho = 0.03\). We set the step-size \(\mu = 0.5\). We repeat the algorithms for multiple levels of privacy. We then settle on a noise variance \(\sigma _g^2 = 0.6\) for which the privatized algorithm with random perturbations still converges. We plot in Fig. 7 the testing error on a set of 256 clean samples that were not perturbed with noise to change their distributions. We use the centriod model learned during each iteration to calculate the corresponding testing error. We observe that the graph-homomorphic perturbations do not hinder the performance of the privatized model. As for random perturbations, they significantly reduce the utility of the learnt model.

Fig. 7
figure 7

Testing error of GFL with no perturbations (blue), with graph-homomorphic perturbations (green), and random perturbations (red)

6 Conclusion

In this work, we introduced graph federated learning and implemented an algorithm that guarantees privacy of the data in a differential privacy sense. We showed general privatization based on adding random perturbations to updates in federated learning have a negative effect on the performance of the algorithm. Random perturbations drive the algorithm farther away from the true optimal model. However, we showed by adding graph-homomorphic perturbations, which exploit the graph structure, performance can be recovered with guaranteed privacy. We also showed that using dependent perturbations does not result in the same trade-off between privacy and efficiency. In federated learning, we proved that sharing perturbed gradients versus perturbed models significantly reduces the effect of the added noise on the model utility. Thus, we no longer have to choose what to prioritize, and instead, we can have both a highly privatized algorithm with a good model utility.