Introduction

The increasing demand to process the generated data on the Internet has led to the growing use of Machine Learning (ML) methods. Traditionally, a single server or data center stores data collected from different parties and centrally performs model training. This paradigm is called centralized learning. The centralized learning mode is vulnerable to leakage of personal and private information. Numerous studies [16, 36] have shown that by analyzing the output of machine learning mode, attackers can reversely infer sensitive information about individuals in the training dataset, e.g., bank transaction records and personal medical data.

Without proper regulations and privacy consents, it is challenging to effectively aggregate datasets from different industry domains for privacy concerns in practice. Countries worldwide have introduced a series of legal regulations to protect user privacy. In March 2018, the EU General Data Protection Regulation [34] came into force, setting precise requirements for companies to process user data. Health Insurance Portability and Accountability (HIPAA) [3] also defines how medical data can be used and released. These regulations guide how companies can use data properly and define unauthorized access to the data as illegal. It is forbidden to deliberately and unlawfully attempt to access or gain entry to computer systems, networks, or data without a proper authorization or permission. Therefore, a far-reaching and urgent issue is how to give full play to the potential of machine learning and other artificial intelligence methods while ensuring user privacy and data security.

Recently, some researchers have tried to train global models while keeping data from all participants locally. A typical case is Federated Learning (FL), which was proposed by McMahan et al.   [26]. FL shows its advantages in providing user privacy protection without sharing data among participants. It is a good solution for the data silos problem, which can learn from different datasets and improve the generalization effect of the model. Meanwhile, clients can adopt different machine learning training algorithms flexibly. Technically, the FL architecture has the potential to apply to any industry, but it is essential to consider the security issues. The basic Federated Averaging (FedAvg) scheme can not guarantee data privacy. A malicious participant can infer sensitive information about original data sets from the shared models and hyper-parameters. This gradients leakage problem [35] is one of the significant security concerns in FL. Similarly, data reconstruction attacks [24] and interference attacks [31] can be conducted on the classic FL schemes. Even though cryptographic primitives such as homomorphic encryption (HE), differential privacy (DP), and secure multi-party computation (MPC) schemes are widely applied to FL, there are trade-offs between accuracy, efficiency, and learning performance.

The typical attack points in an FL scenario can be the compromised server and malicious clients. On the one hand, the server deployed by a service provider is considered a passive attacker with honest-but-curious security models. The server usually provides its services strictly following the established learning protocols, but it may also leak some sensitive information about the user from local model updates. On the other hand, participants are considered active attackers who try to recover sensitive information about the other participants from the global model parameters shared by the training data formation.

There are several open-access FL frameworks and libraries available for academia and industry, such as TensorFlow Federated (TFF) [1] proposed by Google, FedML [15], FATE [38] by WeBank, Flower [5] structure, among others. Different platforms use various strategies. However, implementing these libraries requires a lot of effort to re-engineer to accommodate new ML models. Meanwhile, they are all client–server structures and do not support managed and access-controlled networks. We will conduct a comparative analysis with these proposed schemes, especially regarding privacy mechanisms in the “Comparisons of different FL frameworks”.

In this paper, we propose a Kafka-based privacy-preserving framework for FL. In this scheme, we improve the network reliability by using Kafka for message distribution and ensuring data privacy with the Diffie-Hellman scheme for secure server aggregation and DP on gradients to prevent interference attacks. The contributions of this paper are summarized as follows:

  1. 1.

    A combined framework of partial Diffie-Hellman protocol and DP scheme is proposed to minimize the main security threats in FL.

  2. 2.

    A practical FL scheme with Apache Kafka and Zookeeper is designed to achieve anonymous authentication with role-based ACLs and support independent and asynchronous model distributions.

  3. 3.

    The proposed method is compared and evaluated with the state-of-the-art with respect to security, efficiency, effectiveness and accuracy via security analysis and empirical studies.

The structure of the paper is as follows. In “Related work and primitives”, we will introduce the primitives related to the proposed framework, including Apache Kafka, secure aggregation scheme, and DP. We then propose the main framework and protocol in “Proposed framework”. Subsequently, security analysis and performance analysis are conducted in “Security analysis” and “Evaluations”, respectively. Finally, we conclude the paper with a conclusion and future work in “Conclusion and future work”.

Related work and primitives

This section will introduce the main primitives related to our framework. They include the FL algorithm, Apache Kafka block, secure aggregation scheme, and the DP primitives.

Federated learning

McMahan et al. [26] first introduced the concept of federated learning, which eliminates the need to centrally store the data and allows each participant to update the gradient in a distributed way. It gained popularity for its capability of addressing the data privacy issues, which was missing in the previous distributed machine learning schemes. Generally, FL works with both independent and identically distributed (IID) and non-IID data. FL can be categorized into three groups based on the data features: horizontal, vertical, or hybrid data partition. Horizontal federated learning (HFL) shares feature space but has a different sample space in the data of the clients. By contrast, vertical federated learning (VFL) shares the same sample space but different feature spaces. Hybrid federated learning uses a dataset that has a diverse sample and different feature spaces [38]. A definition of FL is given in Definition 1 and a typical HFL training process is described in Fig.  1.

Fig. 1
figure 1

Training process of federated learning (round r)

Definition 1

(Federated learning) Let \(U=\{U_1,U_2,\ldots , U_n\}\) be a set of n participants, and each participant \(U_i\) has its own local dataset \(D_i\). Let \(U_c\) be the central server (or aggregator). The task of FL is to compute the global model \(M_{global}\) on node \(U_c\) based on all uploaded local models \(\forall i\in U: M_{i}\), while each of the local model is generated from the local client dataset \(D_i\).

In the first round, the server will generate a model with random parameters \(\theta _0\) and send them to all clients. After receiving the model sent by the server, k out of n participating clients will locally compute the training gradients based on their own dataset and send the updated model to the server. Then, the server aggregates gradients of clients and computes the global parameters \(\theta _r=\sum _{i=1}^{k}\theta _i/k\). After a round of updates is completed, the clients check whether the accuracy of the local model meets the requirements and stops training if it does; otherwise, it is ready for the next round of training.

During the FL training process, it is vital to minimize the global accuracy \(f_\mathrm{{FL}}\), i.e.,

$$\begin{aligned} \mathop {\min }\limits _{w}\frac{1}{K}\sum _{n=1}^N\sum _{k=1}^{k_n}f_k(w), \end{aligned}$$
(1)

where w and K represent the global model weights and the number of participating clients, respectively.

The most typical algorithm for local model training is Federated Averaging, or FedAvg for short, whose pseudo code is listed in Algorithm 1. In FL, the server chooses a fraction of clients \(0 \le C\le 1\) and set the learning rate to \(\eta \). For each client k, it computes the gradient \(w\leftarrow w-\eta \) \(\nabla (w,b)\) and sends it to the server. The server will update the parameter \(w_{t+1}\leftarrow \sum _{k=1}^{K}\frac{n_k}{n}w_{t+1}^k\) by the weighted average of the gradients.

Algorithm 1:
figure a

FedAvg

Apache Kafka

Apache Kafka [22] is a popular publish/subscribe (pub/sub) system for data pipelines, streaming analytics, data integration, and mission-critical applications. A basic structure of a Client-Kafka/Zookeeper-Producer architecture is shown in Fig. 2.

Fig. 2
figure 2

Communication topology of Kafka cluster

The basic components of a simple Kafka platform include Zookeeper nodes, Kafka broker nodes, producers, and consumers. There are logical channels on Kafka for separating the message flow, which is named “topic”. The producers produce the messages, store them in brokers and send messages to the topic, while the consumers who subscribe to this topic can listen and read messages from the broker. Meanwhile, Kafka implements Zookeeper for robust synchronization. As a centralized service, Zookeeper can trace clients’ states, remove/add the nodes, distribute messages, and even make the access list. These components communicate with each other through secure and encrypted communication channels. The current Kafka includes some security supports such as channel security, authentication, and authorization (ACL), which are described in the following:

  • Channel security: To safely transmit messages through channels between different components, Secure Sockets Layer (SSL)/Transport Layer Security (TLS) secure communication channel is implemented.

  • Authentication: Simple Authentication and Security Layer (SASL) or SSH (Secure Shell) authentication are used to authenticate clients against brokers or, optionally, against Zookeepers.

  • Authorization (ACL): Authorization is a process where the application decides which user has access to the resources and which resources. In this case, one practical example is that one client has the right to publish a message on a topic, and the other one has the right to read messages from this topic. No other action is allowed. These rules are defined in Kafka through Access Control Lists (ACLs).

These unique features make Kafka an excellent solution to handle and mediate communication between two applications. For FL structure, Apache Kafka can provide a secure communication channel for different clients and the server, store and distribute messages. Meanwhile, it can authenticate clients without extra security strategies, such as VPN or username-password authentication. Its distributed and scalable nature makes Kafka well suited for FL. It fits the FL environment for the following reasons:

  1. 1.

    Scalability: Kafka excels in handling high-throughput, large-scale data streaming with low latency, which aligns well with the requirements of real-time processing applications, such as FL. Meanwhile, comparing to other pub/sub systems, such as Amazon Kinesis, Apach Kafka can easily be integrated with other platforms and services.

  2. 2.

    Data integration and distribution: Kafka’s publish-subscribe model allows for the integration and distribution of data streams among the participants and the server, which fits the server-client structure of FL.

  3. 3.

    Reliability: Kafka provides reliable communication channels and ensures that data streams are highly available. In FL, where communication is a bottleneck and devices may go offline during the communication, Kafka’s resilience ensures that data updates and model synchronization can be reliably managed even during network disruptions.

Secure aggregation

Secure aggregation protocol is proposed by Segal et al. [29] to prevent the gradient leakage attack in FL. It builds on secret sharing [30], public-key cryptography, and pseudo-random number generation. The core idea of the algorithm is to add random noise to the updates that change the values completely but cancels out at the aggregation.

These masks are generated by pairwise Diffie–Hellmann Key Agreement [9] that uses pseudo-random generators for creating the noises, whose seeds are given by shared secret keys \(g^{ab}\). Here \(g^a\) and \(g^b\) are the public keys of the two clients with the secret keys a and b, respectively, and g is a public generator element. The scheme is secure under the Computational Diffie–Hellman assumption.

Definition 2

(Computational Diffie–Hellman (CDH) assumption) Consider a cyclic group G of order q, a generator \(g\in G\) and random \(a,b \in {\mathbb {Z}}_q^*\), given \((g,G,p,g^a,g^b)\) it is computationally intractable to compute the value \(g^{ab}\).

The protocol of Diffie–Hellmann Key Agreement is defined as follows:

  1. 1.

    Alice picks a random natural number a with \(1<a<q\) as her secret key, and sends the element \(g^a\) of G to Bob;

  2. 2.

    Bob picks a random natural number b with \(1<b<q\) as hse secret key, and sends the element \(g^b\) of G to Alice;

  3. 3.

    Alice computes the element \(\left( g^b\right) ^a=g^{b a}\) of G;

  4. 4.

    Bob computes the element \(\left( g^a\right) ^b=g^{a b}\) of G.

Thus \(g^{ab}=g^{ba}\) will be their shared secret (this is done for every pair of clients). To address the problem of clients’ dropout and unresolved noises, sharing the private key a and b via by Shamir’s k-out-of-n threshold secret sharing [30] is implemented. It results that at least k shares of a secret can reconstruct it perfectly. By having less than k shares, an entity can reconstruct the secret with negligible probability only.

Differential privacy

DP was first proposed by Dwork et al. [10] and is considered the most secure perturbation-based privacy protection method. DP provides statistical privacy guarantees for individual records and prevents interference attacks on the model without incurring additional computational overhead compared to encryption methods.

In the context of FL, the adjacency means the one dataset \(\mathcal {D'}\) can be obtained for the other \({\mathcal {D}}\) by removing the data of a single client [2]. DP can be achieved by applying deferentially private transformation. The most common form of these transformations is adding random noise to the data.

There are two categories of DP: the central DP (or global DP) and the local DP, according to where the noise is added. In the local model of DP, the noise is directly added to the local datasets of clients. After collecting these noisy data, the aggregator can compute some statistics and publish them. The significant advantage of this mode is that it does not ask for a trusty aggregator. However, it reduces the accuracy by adding noise to the raw data. By contrast, in the central mode, clients send their trained models to the aggregator, and the aggregator publishes the results after masking them. In this scenario, it compromises with a trusty aggregator, but the final results remain private.

Definition 3

(Differential privacy) In differential privacy [12], it quantifies and limits information disclosure about an individual with a privacy loss parameter \((\epsilon ,\delta )\). A random algorithm \({\mathcal {A}}\) is \((\epsilon ,\delta )\)-differentially private, if for all \({\mathcal {S}}\subseteq \text {Range}({\mathcal {A}})\) and for all \({\mathcal {D}}\), and \({\mathcal {D}}'\) adjacent datasets

$$\begin{aligned} P(\mathcal {A(D)}\in S)\le e^\epsilon P(\mathcal {A(D')}\in S)+\delta \end{aligned}$$
(2)

where S means all possible outcomes of \({\mathcal {A}}\). If \(\epsilon \), the difference between the probabilities for getting the result from dataset \({\mathcal {D}}\) or \(\mathcal {D'}\) is small, that means that it is hard to guess on which has been \({\mathcal {A}}\) run.

Definition 4

(Global sensitivity) The random algorithm \({\mathcal {A}}\) satisfies the global sensitivity, which tells us how much noise is to be added to the results. Given a sequence of counting queries \({\mathcal {Q}}\), global sensitivity measures the maximal change on the result when removing one record from the dataset \({\mathcal {D}}\).

For \({\mathcal {Q}}: {\mathcal {D}} \rightarrow {\mathcal {R}}\), the global sensitivity of \({\mathcal {Q}}\) is defined as:

$$\begin{aligned} GS=\underset{D,D'}{\max }\ \left\| Q(D)-Q(D') \right\| \end{aligned}$$
(3)

To satisfy the differential privacy definition, there are two primary noise mechanisms in DP, namely Laplace mechanism (LM) and exponential mechanism (EM).

Theorem 1

(Laplace mechanism) Given an function \(Q: D \rightarrow \) R, for an arbitrary domain D, the random algorithm \({\mathcal {A}}\) provides \(\epsilon \)-DP, if \({\mathcal {A}}\) satisfies:

$$\begin{aligned} {\mathcal {A}}=Q(D)+\left( {\text {Lap}}\left( \frac{GS}{\epsilon }\right) \right) , \end{aligned}$$

where the noise \({\text {Lap}}\left( \frac{GS}{\epsilon }\right) \) is drawn from a Laplace distribution, and d is the dimension of the query Q.

Theorem 2

(Exponential mechanism) Given a function Q, where its input is dataset D, and the output is an entity object \(r \in \) Range. Let q(Dr) be a score function to assign each output r a score, and GS be the sensitivity of the score function. Then, the mechanism \({\mathcal {A}}\) maintains \(\epsilon \)-DP, if:

$$\begin{aligned} {\mathcal {A}}(r, q)=\left\{ \text{ return } r \text{ with } \text{ probability } \propto \exp \left( \frac{\epsilon Q(D, r)}{2GS}\right) \right\} . \end{aligned}$$

Proposed framework

This section will explain the FedlabX: Kafka-based privacy-preserving framework in detail. The framework overview is presented in Fig. 3, with the implementations of DP on the trained model gradients, Kafka brokers with Zookeeper for communication, and secure aggregation on the server side.

Fig. 3
figure 3

Overview framework of the FedlabX

Security model

We follow the semi-honest threat model in which all the participating parties, including the curious adversaries, follow the protocol properly. Additionally, the parties in the semi-honest model can passively gather and analyze the publicly available data in the system. We assume Probabilistic Polynomial Time (PPT) computational power of the parties in the system. Based on the scheme, we can define the following security and privacy requirements:

Definition 5

(Correctness) The proposed model is correct if the server conducts aggregation following the protocol honestly without accuracy loss.

Definition 6

(Data Privacy) Any PPT adversary party can infer the original data of the clients from the proposed model with a negligible probability only.

Definition 7

(Client Unlinkability) Any PPT adversary party can link the client local data to the aggregated data with a negligible probability only.

FedlabX framework

Our framework has three parties: the server, clients, and the Apache Kafka message queue. In this framework, the server initializes a model and sends the untrained model to the request broker in Kafka. We assume Kafka is embedded in the trusted third-party side. Once the clients request the model from the Kafka request broker, Kafka relies upon the untrained model for the clients. The proposed model allows the central broker to set a policy that permits only specified clients based on a 1-out-of-n verification scheme [18] with predefined n attributes to join the system. We assume that there is an established secure channel between the central server and the clients. A lightweight encryption [19, 20] during the verification can be adopted by the central broker. Each client will train the model based on their local dataset. Once the server requests the trained model, clients will respond to the response broker in Kafka with the trained and masked model. Finally, the response broker will send the updated models to the server, and the server will conduct the secure aggregation algorithm to get the final result. This framework is ideal for cross-silos FL scenarios with few clients but large databases. It is more practical in the real industry with the Kafka client–server communication structure and message storage and transmission function. Meanwhile, from the security perspective, secure aggregation and DP schemes make FedlabX privacy-preserving and prevent most attacks when the server and clients are semi-honest. We implement the variant of secure aggregation, which makes it more efficient and computation cost-saving.

This scheme contains the following phases: setup and key generations, model updates, and secure aggregation. The main structure can be found in Fig. 4, and we will go for details of each phase in the following protocol.

Fig. 4
figure 4

Structure of the phases

Phase 1 (Setup and key generations)

This phase includes the key generation between the clients. Let the set of clients be \(C={1,2,\dots , n}\). Then the server chooses an appropriate group G, where the computational Diffie–Hellman problem is hard. G will be a cyclic group of q elements with generator g, following general parameters will be generated:

$$\begin{aligned} (G,),|G|=q, g \in G:G=<g> \end{aligned}$$

Let \(i \ne j \in C\) be a fixed pair of clients. Then the following protocol that includes seed and noise generation must be run for every pair of clients:

Seed generation

  1. 1.

    Client i chooses an exponent \(a_i\in _R {\mathbb {Z}}^*_{q}\), and sends to the server with the value of \(g^{a_i}\). Then the server sends the pair (\(i, g^{a_i}\)) to the client j.

  2. 2.

    Client j chooses an exponent \(a_j\in _R {\mathbb {Z}}^*_{q}\) and sends to the server with the value of \(g^{a_j}\). Then the server sends the pair (\(j, g^{a_j}\)) to the client i.

  3. 3.

    The common key of client i and j is \(key_{i,j}=g^{a_ia_j}\).

    At the end of this phase, every client \(i\in C\) stores a set of common keys with other clients:

    $$\begin{aligned} K_i={\mathrm{{key}}_{i,1}, \mathrm{{key}}_{i,2},\ldots , \mathrm{{key}}_{i,n}}. \end{aligned}$$

Noise generation

To prevent interference attacks, clients will implement DP by adding noises to the gradients after the model training, which is based on DP-SDG [2]. In our case, we use the Laplace noise generation mechanism [11] of DP. The Laplace mechanism preserves \((\epsilon ,0\))-differential privacy or \(\epsilon \)-differentially private.

Definition 8

Lets Q be an aggregated query. The results of Q is \(\alpha \) differentially private if it is perturbed by a random variable \({\mathcal {R}}\sim \mathrm{{Lap}}(\theta ,\alpha )\), where \(\theta \) is the mean, and \(\alpha \) is a scale.

A random variable \({\mathcal {R}}\sim \mathrm{{Lap}}(0,\Delta _Q/\alpha )\) follows a Laplace distribution if:

$$\begin{aligned} Pr({\mathcal {R}}=r)=\frac{\alpha }{2\Delta _Q}e^{-|r|\alpha /\Delta _Q} \end{aligned}$$
(4)

When we add a Laplacian noise to the gradients, we exploit the infinitely divisible property of the distribution, where the random variable \({\mathcal {R}}\) can be computed by summing up n of other random variables. We generate the partial noise using the following:

$$\begin{aligned} {\mathcal {R}}&=\sqrt{{\mathcal {B}}\cdot \sum _{i=1}^n{\mathcal {R}}_i},~\text {for}~{\mathcal {R}},{\mathcal {R}}_i\sim \mathrm{{Lap}}(0,\alpha ),~\text {and}\end{aligned}$$
(5)
$$\begin{aligned} {\mathcal {B}}&\sim \mathrm{{Beta}}(1,n-1)~\text {is a single r.v.}, \end{aligned}$$
(6)

Where Pr\(({\mathcal {B}}=x)=(n-1)(1-x)^{n-2}\).

Phase 2 (model updates)

As a result of the execution of the first phase, where the seed and noise will be generated, all participating clients will be ready for the model masked with noise and key pairs. Continually, clients will update the model by responding to the request sent from the central server. Models will arrive and be stored in the Kafka brokers connected with Zookeeper. In this case, even if clients have different settings of hardware conditions and network bandwidth, this framework can finally reach synchronous communication.

Phase 3 (secure aggregation)

In this phase, the server computes the aggregated client data for a given update. Let \(l \in {\mathbb {N}}\) be the length of the users’ data of a new update. This length can be different for every update but fixed during a single update. Let \(x_i \in \{0,1\}^l\) be the data of client i, and \(P_l\) be a pseudorandom generator of output size l.

The following protocol describes the phase 2 and phase 3 of the framework:

Protocol 1

  1. 1:

    The server chooses a random salt \(r \in _R\{0,1\}^n\).

  2. 2:

    The server sends the chosen r to every client i.

  3. 3:

    Every client \(i \in C\) computes

    $$\begin{aligned} s_i = x_i + \sum _{j \le i}P_l(r|\mathrm{{key}}_{i,j})-\sum _{i \le j}P_l(r|\mathrm{{key}}_{j,i}) \end{aligned}$$
  4. 4:

    Every client i sends \(s_i\) to the server.

  5. 5:

    The server computes \(\sum _{i\in c}s_i\) to get the aggregation of the individual data.

The result of the \(\sum _{i\in c}s_i\) in step 5 of Protocol 1 is our final result of the training request.

Security analysis

FedlabX shows its advantage in data privacy preservation. With a modified federated ML scheme structure, we focused on the privacy and security properties of this framework. In this section, we analyze the security of the proposed model. During analysis, we consider two types of threat models in our scheme, internal and external attacks. Table 1 briefly summarize the threats, source of vulnerability, and counter-measures in the FedlabX framework. We do not go into the details of attack prevention with DP and secure aggregation here, as other papers have proved them [12, 14, 21, 29].

Table 1 Summarize of threats and countermeasures in FedlabX framework

Remark 1

(Data privacy) If the secure aggregation and Laplace noise generation have been implemented, then the proposed model satisfies the data privacy requirement, where the original data can be protected from the gradient leakage attacks and the inference attacks.

Below are a few arguments about this Remark 1. We assume that the central server and clients are semi-honest in the proposed model. In this case, they will follow the protocol properly but try to exploit and infer useful information from the training process for their purpose. Therefore, for an insider attack, the data reconstruction from the gradients is the leading security issue. In this attack, the semi-honest nodes get the gradient update during iterative training to reconstruct the local training data using leaked gradients. Deep leakage from gradient (DLG) [42] and improved deep leakage (iDLG) [41]) use different methods. That is, they run optimization on the pixels of a randomly generated image (or text) matching the gradients on the random image to those of the real data point. If the training uses frequent updates computed from gradients over few data points or a small number of epochs, the training data can be reconstructed completely. As FedlabX uses secure aggregation, this scheme prevents gradient-leakage attacks [6, 13, 37].

External attacks such as man-in-the-middle-attack, Sybil attacks, and interference attacks happen when attackers from the outside of the architecture want to hack the system to steal some data or crash the protocol. The man-in-the-middle attack is one of the most typical attacks. An attacker sniffs the traffics between clients and the central server and tries to infer the private information of clients. However, as the proposed model assumes the existence of secure communication channels [7, 17], we do not consider man-in-the-middle attacks in our scheme. The main security issue, in this case, is the interference attack. In an interference attack, an attacker trains shadow models similar to the attacked one, which performs a similar task over similar datasets. Different input data feed these malicious shadow models, and over their outputs, a binary classifier attack model will be trained whose task is to decide whether the input is part of the training data. Despite its complexity trade-off [33], DP can prevent membership interference attacks as proved in [8, 27, 28].

Theorem 3

The proposed model is computationally correct by following the protocol honestly.

Proof

based on Definition 5, the proposed model is correct if the server conducts aggregation without considering the accuracy loss. In other words, the central server in Protocol 1 should finally compute \(\sum _{i\in c}x_i\). Each client i computes \(s_i = x_i + \sum _{j \le i}P_l(r|\mathrm{{key}}_{i,j})-\sum _{i \le j}P_l(r|\mathrm{{key}}_{j,i})\). Since the pseudo-random generator P takes the \(r|\mathrm{{key}}_{i,j}\) seed from each pair of clients (as a result of phase 1 of the proposed protocol), in each pair of clients, the same random number will be generated. Considering that the central server will receive these two random numbers in two opposite signs, they will cancel each other after adding them. As a result, only \(x_i\) values of each client will remain, therefore

$$\begin{aligned} \sum _{i\in c} {x_i + \sum _{j \le i}P_l(r|\mathrm{{key}}_{i,j})-\sum _{i \le j}P_l(r|\mathrm{{key}}_{j,i})} = \sum _{i\in c} x_i \square \end{aligned}$$

Theorem 4

If the function \(P_l\) is PRNG in phase 3 of the protocol is, then the Protocol 1 satisfies client unlinkability.

Proof

Assume that the function \(P_l\) is PRNG in phase 3 of the Protocol 1. A PPT adversary party can access the securely aggregated client data \(s_i\) from step 4 of Protocol 1. To link the client identity to the client data, an adversary needs \(x_i\) that is blinded with

$$\begin{aligned} \sum _{j \le i}P_l(r|\mathrm{{key}}_{i,j})-\sum _{i \le j}P_l(r|\mathrm{{key}}_{j,i}) \end{aligned}$$

As a result of the PRNG with a secure seed concatenated random salt r, the above blinding factor seems random from any PPT adversary point of view, hence the adversary can learn \(x_i\) from aggregated \(s_i\) with negligible probability only. \(\square \)

Evaluations

In this section, we compare Kafka-based privacy preserving Federated Learning (FedlabX) with other typical frameworks we mentioned before, Flower, FATE, TFF, and FedML, regarding privacy mechanisms, system scalability and interoperability and the computational complexity. Furthermore, we also evaluate the performance of the proposed FedlabX framework in terms of computational cost and accuracy.

Comparisons of different FL frameworks

Comparisons of privacy mechanisms

We list the privacy mechanisms of the most popular FL frameworks in Table 2.

Table 2 Comparisons of different FL frameworks

The comparison shows that the FATE framework applies different cryptographic mechanisms compared to other frameworks. FATE implements secure computation protocols based on homomorphic encryption (HE) of the Pallier system and multi-party computation (MPC). It provides strong privacy protection but relies on heavy computation. The Flower is one of the most user-friendly FL libraries among these open-access frameworks, and it provides DP for privacy-preserving. From the official website, we can find out that their "LightSecAgg protocol," which is for secure aggregation, has not been implemented yet, so its diagram and abstraction may not be accurate in practice. Same for other mechanisms. TFF applies very similar strategies to our framework in implementing cryptographic mechanisms with SA and DP schemes. FedML only provides DP, which can guarantee a certain level of privacy. However, it can not solve the gradient leakage risks in FL.

Incredibly, only our framework reaches anonymous authentication. It is not allowed to use anti-detoxification and anomaly detection to user data or proposed updates in the FL settings. That is why FL has no defense against data poisoning and no anomaly detection. Most other schemes apply standard security communication channels to exchange messages, but they do not mention how the clients are authenticated to prevent malicious clients. However, in our framework, it is clear that we reach anonymous authentication with Kafka ACLs and 1 out of n attribute verification.

Comparisons of system scalability and interoperability

The comparisons of system scalability and interoperability of different frameworks can be found in Table 3.

Table 3 System scalability and interoperability comparisons of different FL frameworks

As observed, FATE has the capability to support various data structures, including HFL, VFL, and Hybrid FL, whereas other frameworks may only support HFL. Additionally, FATE offers a comprehensive one-step federation solution that encompasses assessment and evaluation. This aspect enhances its operability while potentially impacting its scalability. Flower is predominantly used in stand-alone cases, which can result in average operability and scalability. TFF focuses on edge computing and emphasizes ease of cooperation with other platforms and structures. However, the complexity in operating TFF may arise due to its reliance on command-line operations rather than a graphical user interface. FedML demonstrates its versatility in various applications, including topology-based mode, stand-alone mode, edge computing, and non-IID case. Its user interface and API offer ease of use, allowing for straightforward implementation. However, extending the framework to integrate with other platforms can be challenging. FedLabX supports both stand-alone mode and edge computing. It facilitates seamless collaboration with different strategies and platforms, offering enhanced interoperability. However, its operability is rated as average due to the absence of a user interface.

Comparisons of computational complexity

All calculations below assume a single server and n users, where each user holds a data vector of size m. The computational complexity of the FL libraries is influenced by various factors, including the size and complexity of the data and models, the algorithms and protocols used, and the efficiency optimizations employed within the library. However, here we only discuss the computational complexity, which is increased by the privacy mechanisms, especially SA, DP, and HE strategies. We analyze the computational complexity of FedlabX in the following.

On the client side

Compared to the basic FL, the main differences in computational cost lie in the key generation at the initialization stage, PRNG function for the secure aggregation, and the Laplacian noise generation for the DP. We can divide the computation cost of each client i into three parts. (1) Performing the \(n-1\) key agreements, which takes \(O(n-1)\) computation time, (2) Generating noise \(P_l(r|\mathrm{{key}}_{i,j})\) with the PRNG function for every other user j, which takes O(mn) time in total, (3) Generating local noise with the Laplace mechanism, which takes O(mn) computation time. Overall, the computation cost on a standard client is \(O(mn+n)\).

On the server side

There is no key set up on the server side. Thus, we only need to consider the computation cost during the federated process. The server’s computation cost can be broken down as: performing the aggregation function by removing the \(P_l(r|\mathrm{{key}}_{i,j})\) values, which takes time \(0(mn^2)\) in the worst case.

The computational complexity of different FL frameworks is summarized in Table 4.

Table 4 Comparisons of computational complexity of different FL frameworks

The most significant computation cost of FATE lies in the HE, which is mainly determined by the encryption and decryption of the message size. For each client, it encrypts the predicted value and sends the resulting ciphertext to the server. Therefore, the computational complexity for each client is \(O(m^2)\), where m is the size of the message and m is much larger than the client number n. The computational complexity of the FATE is the highest one. On the server side, calculations are performed based on the ciphertext, resulting in a computational complexity of O(m). However, the SA strategy is also employed, which increases the computational complexity to \(O(mn^2)\). In conclusion, the overall computational complexity is \(O(mn^2)\).

The Flower library and FedML have SAME complexity since they both utilize the DP strategy. On the client side, the computation complexity is determined by the Laplace mechanism, which has a complexity of O(m). Similarly, on the server side, where aggregation is performed, the computation complexity is also based on the message size, resulting in a complexity of O(m) as well. Thus, both the client and server components exhibit a computational complexity of O(m). TFF and FedlabX employ the same privacy strategies, resulting in comparable complexity on both the client and server sides.

Prototype performance

In this sub-section, we first clarify the hardware and public datasets for prototype evaluation. Then, the accuracy results with DP and without DP are shown and explained.

Hardware: Both the central server and clients have the same hardware settings in this experiment. All experiments were executed on 2x Intel(R) Xeon(R) Gold 6230R CPU @ 2.10GHz, 1 A100 GPU, and 256 GB of RAM. Every component was run in Docker containers, and resources were allocated to them dynamically. One container was dedicated to MongoDB for storing the trained models, and another container served as the message queue. These containers had minimal resource usage, except for disk I/O. Accuracy testing was performed separately using a Python script. The experiment was designed in a way that the clients’ VRAM usage was slightly below 4GB, ensuring that the 40GB GPU was not bottlenecked when using 10 clients.

Dataset: We use standard MNIST [25] and CIFAR-10 [23] datasets to conduct experiments. MNIST is a handwritten digital recondition dataset consisting of 60,000 training examples and 10,000 testing examples, and each is a 28 \(\times \) 28 size gray-level image. MNIST is divided into four sub-datasets for FL, where the size of each sub-dataset is 100 samples. Compared to the MNIST, CIFAR-10 has a more complicated combination consisting of 60,000 32 \(\times \) 32 color images in 10 classes, with 6000 images per class.The MNIST and CIFAR-10 datasets are widely used in machine-learning for tasks such as image classification. While these datasets do not contain personally identifiable information (PII), they can still be subject to specific privacy attacks. One of the most severe attacks is membership inference, where an attacker may attempt to infer whether a specific data point was part of the original dataset [31]. Meanwhile, although MNIST and CIFAR-10 datasets do not directly contain PII, it is possible to re-identify individuals by combining the dataset with external information. There is a high risk that an attacker may attempt to re-identify those individuals based on their unique characteristics.

Fig. 5
figure 5

Test accuracy of FedlabX framework

Accuracy results without DP: We take the FedAvg algorithm 1 in MNIST and CIFAR-10 databases with IID and non-IID data structures as our baseline. Table 5 shows the accuracy of centralized learning and FL on MNIST and CIFAR-10 databases. The mode details results can be found in Fig. 5. It plots the accuracy of the FedlabX algorithm in the IID and non-IID data structure without the added DP. The training sets are evenly partitioned into ten clients. Each client is randomly assigned a uniform distribution over ten classes for the IID setting. For the non-IID setting, the data is sorted by class and divided to create an extreme 1-class non-IID, where each client receives data partition from only a single class. We perform 50 communication rounds of training on MNIST and CIFAR-10. We use the same notations for the FedAvg algorithm as Algorithm 1: B, the batch size, and E, the number of local epochs. The following parameters are used for FedlabX framework: for MNIST, B = 32, E = 25, \(\eta \) = 0.01 and decay rate = 0.995; for CIFAR-10, B = 128, E = 60, \(\eta \) = 0.1 and decay rate = 0.992.

Table 5 Baseline accuracy

The following observations can be made: the accuracy on the MNIST dataset can reach even higher than the centralized accuracy in the IID data structure. There is a slight drop in the non-IID case on MNIST with 2%. However, on the CIFAR-10 database, the accuracy reaches around 80% in IID, and drops sharply in the non-IID scenario, which can only reach almost 50% of accuracy.

Accuracy with DP: From the definition of DP (Definition 3), we conclude that the parameter \(\epsilon \) measures the ability of the random algorithm \({\mathcal {A}}\) to resist attacks, and the smaller the parameter \(\epsilon \), the greater the privacy protection it provides. This is because the smaller the \(\epsilon \) is, the bigger Laplace noise is, i.e., \(b=\frac{\Delta q}{\epsilon }\). In this case, the accuracy will increase with the growth of \(\epsilon \) value. In this experiment, we fix the parameter of \(\theta \) to \(10^{-5}\). In most of the literature, the algorithms have been evaluated with \(\epsilon \) ranging from as little as 0.1 to as much as 8. In our case, we analyze the accuracy of MNIST and CIFAR-10 with the \(\epsilon \) value from 1 to 8. The main results can be found in Figs. 6 and  7.

On the MNIST database, there are a few differences with the increase of \( \epsilon \) value. The highest accuracy ranges from 89 to 91%. However, with more training rounds, the performance on \(\epsilon = 4\) is better than when the \(\epsilon = 8\). It reflects that with higher epsilon, there is a need for as many training rounds. Most clearly, this effect is observable around \(\epsilon =1\). This plot Fig. 6 suggests we have to focus more on sub-zero and \(\epsilon =1\)–2 ranges.

Fig. 6
figure 6

Accuracy of different levels of \(\epsilon \) setting with different rounds of training on MNIST

On the CIFAR-10 database, there is a significant drop after adding the noise to the model gradients with DP. When the \(\epsilon \) equals 1, the best accuracy only reaches around 44%. With the increase of \(\epsilon \), the accuracy also improves. The details can be found in Fig. 7.

Fig. 7
figure 7

Accuracy of different levels of \(\epsilon \) setting with different rounds of training on CIFAR-10

We examined the client training loss with respect to the number of epochs. The training loss serves as a monitoring metric for assessing the progress of local training on each client and evaluating the convergence and performance of models across different clients. In our case, we fixed the number of clients to 4 and averaged all the training losses to obtain the results. As observed, a larger \(\epsilon \) value of DP leads to more loss, particularly when the number of epochs is small. Additionally, the proposed framework suffers from less training loss on MNIST than on CIFAR-10 due to its smaller size. However, the framework converges on both datasets as the number of epochs increases. More details can be found in Fig. 8.

Fig. 8
figure 8

Client training loss over the epochs

Table 6 Test accuracy of various differentially private training methods on MNIST and CIFAR-10 with different privacy loss and \(\theta =10^{-5}\)

Comparisons with some state-of-the-art FL frameworks

Table.6 depicts the accuracy of various differential private training methods on MINIST and CIFAR-10 databases with different privacy losses. LDP-FL [32] adds noises directly into the raw database, which significantly drops the accuracy in CIFAR-10. For a more rigorous comparison, we have included FedAvg-LDP and FedAvg-CDP [39], two state-of-the-art algorithms in the comparison. The difference is that LDP adds noise directly into the database, while CDP adds noise after the model aggregation. These two algorithms have a structure that closely resembles our framework, making them more suitable for comparative analysis. We also include the version of DP-SGD [4], which modifies the activation functions to prevent activations from exploding in differential privacy. DP-SGD applies DP in calculating the gradients instead of adding noise directly to the model or the dataset, which leads to better accuracy results. \(\rho \)-zero-centralized DP-SGD [40] gets more attention and is also applied in the 2020 Census Conflict of interest Avoidance System for higher privacy. However, the accuracy is not satisfied even in the MNIST database. In our approach, FedlabX can not reach the best accuracy but can reach the median.

Limitations of FedlabX

Based on the analysis and evaluations above, it is apparent that the FedlabX framework still has certain limitations that can be attributed to the cryptographic components adopted in this work.

  1. 1.

    Size limitation and latency of Kakfa: Kafka configuration limits the size of communication messages, which is 1MB by default. However, the model we trained in the FL framework might exceed this limitation. To solve this problem, we can modify the Kafka broker configuration file parameter to increase the allowed maximum message size. Still, processing large messages consumes more CPU and memory of our producer and consumer, which might cause high latency to the end-user.

  2. 2.

    Computation cost: As we can found the comparison of the computational complexity in Table 4, the secure aggregation protocol, especially the key agreement process and the PRNG generation process, increases the computational complexity of our framework.

  3. 3.

    Accuracy drop: The choice of a DP scheme will undoubtedly have an impact on the accuracy of the framework. This is evident from the results presented in Table 6, where the accuracy consistently decreases as more noise is added to the algorithm.

Conclusion and future work

This paper proposed a practical privacy-preserving framework FedlabX based on the basic FL. We proposed adopting secure aggregation and differential privacy mechanisms to achieve a privacy-preserving scheme. Meanwhile, Kafka supports anonymous authentication and asynchronous message queue. Our security analysis proves that FedlabX can protect from the semi-honest server and semi-honest clients, which is enough for a cross-silos FL scenario. From the experiment results, the FedlabX accuracy is higher than the LDP-FL and centralized DP-FL. However, some improvements can still be made compared to TensorFlow DP-SGD. The following future work are worthy of considering:

  1. 1.

    How can we set a reasonable \(\epsilon \) with an explanation or justification? If the relationship between the parameter \(\epsilon \) and the probability of the attacker’s success can be obtained, it will undoubtedly give an intuitive understanding of the strength of differential privacy protection.

  2. 2.

    Based on the current scheme, we can add more privacy-preserving features and comprehensively analyze different schemes, such as MPC and HE schemes.

  3. 3.

    Non-IID impacts the accuracy massively in FL. Another important research direction is how we can improve the accuracy of our FedlabX framework.