FL-GUARD: A Holistic Framework for Run-Time Detection and Recovery of Negative Federated Learning

Federated learning (FL) is a promising approach for learning a model from data distributed on massive clients without exposing data privacy. It works effectively in the ideal federation where clients share homogeneous data distribution and learning behavior. However, FL may fail to function appropriately when the federation is not ideal, amid an unhealthy state called Negative Federated Learning (NFL), in which most clients gain no benefit from participating in FL. Many studies have tried to address NFL. However, their solutions either (1) predetermine to prevent NFL in the entire learning life-cycle or (2) tackle NFL in the aftermath of numerous learning rounds. Thus, they either (1) indiscriminately incur extra costs even if FL can perform well without such costs or (2) waste numerous learning rounds. Additionally, none of the previous work takes into account the clients who may be unwilling/unable to follow the proposed NFL solutions when using those solutions to upgrade an FL system in use. This paper introduces FL-GUARD, a holistic framework that can be employed on any FL system for tackling NFL in a run-time paradigm. That is, to dynamically detect NFL at the early stage (tens of rounds) of learning and then to activate recovery measures when necessary. Specifically, we devise a cost-effective NFL detection mechanism, which relies on an estimation of performance gain on clients. Only when NFL is detected, we activate the NFL recovery process, in which each client learns in parallel an adapted model when training the global model. Extensive experiment results confirm the effectiveness of FL-GUARD in detecting NFL and recovering from NFL to a healthy learning state. We also show that FL-GUARD is compatible with previous NFL solutions and robust against clients unwilling/unable to take any recovery measures.


Introduction
Federated learning is a distributed learning paradigm in which multiple devices (also called clients) learn a shared global model collectively without disclosing their private data [1].In the vanilla FL (FedAvg [1]) system, the global model is leaned by iterative rounds, with each round containing two steps: (1) Training.The clients train the global model on their local data and then submit their local updates to a central server.(2) Aggregation.The server aggregates local updates and distributes the aggregated model as the new global model to all clients.Such a distributed learning paradigm has found a wide range of applications where data are decentralized and data privacy is essential [2,3].
Despite the advantage of FL in protecting privacy, massive experiments on real-world datasets [4][5][6][7][8] have reported the failure of FL [9,10].That is, for most clients, the model produced by FL, when tested on a client's private data, cannot achieve a better performance than the private model produced by that client's local stand-alone training. 1Issues leading to FL failure include (but are not limited to): • data heterogeneity among clients, also called non-IID data [11]; • client inactivity due to connection problems or hardware failure [12]; • attacks from the malicious clients/server, e.g., data poisoning [13] and model poisoning [14,15]; • noises introduced by privacy-protection measures, e.g., differential privacy [16].
Consequences of FL failure include: (1) clients being unwilling to participate in FL, (2) wasted rounds of client computation (and client-server interactions), and (3) disintegration of the entire federation in the worst case.Many remedy solutions have been proposed to prevent FL failure [9,12,[17][18][19][20][21][22][23][24][25][26][27].However, the FL system using the existing solutions faces a dilemma: If the remedy is predetermined to be used, it incurs extra (high) costs even if FL could have done well without such a remedy.In contrast, if the system chooses not to activate the remedy first, then the possible failure and the need for the remedy would manifest only after hundreds of learning rounds.Thus, the waste of client computation cannot be avoided.None of the two choices is perfect unless the necessity of a remedy is known in run-time.However, no previous work addresses this dilemma.Additionally, when employing a remedy solution to upgrade an FL system in use, none of the existing work considers the following realistic scenarios: (1) Some clients are unwilling to take the proposed remedy solutions until observing the others benefit from doing so, and (2) Some clients are unable to take any remedy solution due to their limited computing/communicating power.
Many questions still remain open in the search for the cure of failed FL.For example: How to define a failed FL process?Is failure detectable at an early stage (so that numerous futile learning rounds can be saved)?Is it possible to achieve good local performance on clients when the global FL keeps failing?If yes, what are the countermeasures to be taken by clients?What are the expenses of those countermeasures?What happens if not all clients take such measures?
In this paper, we attempt to answer the above questions.With a trivial assumption that each client of the federation has a private model previously trained on its local data, our work aims at learning a model for each client that can outperform its private model.We coin a new term Negative Federated Learning (NFL) to name the undesirable state of an FL system in which the iterative client-server interactions do not help most clients in learning.Based on the vanilla FL paradigm, which is widely used in practice, we propose a novel FL framework called FL-GUARD for tackling NFL in a run-time detection and recovery paradigm.
Our run-time NFL detection scheme relies on a metric called performance gain (PG), i.e., the improvement in learning accuracy that the clients obtain from participating in FL.Measuring the PG on a client needs the accuracy of the model learned in the FL system (e.g., the global model learned in the vanilla FL system) on that client's testing dataset.However, testing a model in each round of FL may incur non-trivial extra costs for clients.To avoid such costs, we choose to estimate the PG instead.Specifically, each client leverages its training data as a surrogate to estimate the accuracy of the model learned in the FL system on its testing dataset.Then, the estimated PG is obtained by calculating the difference between (1) the estimated accuracy of the model learned in the FL system and (2) the accuracy of the private model obtained before FL starts.The estimated PG is uploaded to the server together with the respective local updates.Next, the server computes an overall performance gain during aggregation.If the overall PG value remains negative after certain rounds of learning, the system reports the state of NFL.Our detection scheme is cost-effective since the per-client PG is a numeric value that can be transmitted at a trivial cost and easily estimated when clients train the model in FL.
Once NFL is detected, the system has to take recovery measures to improve the performance of FL on each client.Our key idea is to personalize the global model learned by vanilla FL.Specifically, each client learns in parallel an adapted model when training the global model.We expect the adapted model can fit the local data distribution and gain benefits (e.g., generalization ability) from the global model learned in each round of FL.Therefore, the optimization goal of the adapted model is to minimize two terms, i.e., the loss on the local data and the parametric divergence from the global model.However, when the global model cannot bring any valuable knowledge for learning the adapted model, adding the second term may negatively influence model adaptation.To mitigate such negative impacts, we introduce a parameter , which is tuned dynamically during training, to control the weight of the parametric divergence.Tuning dynamically in run-time helps to avoid the costs for the repeated tuning of hyperparameters before the start of each FL task, which is a main disadvantage of many existing adaptation methods [10,12,[19][20][21]28].
The contributions of our paper are summarized as follows: (1) We propose FL-GUARD, a holistic framework for tackling NFL in a run-time detection and recovery paradigm.The framework can be easily employed on any FL system to detect NFL and recover from it.To the best of our knowledge, this is the first dynamic solution for tackling NFL in run-time.The importance of a run-time NFL solution is twofold.On the one hand, it does not incur extra (high) costs when vanilla FL performs well.On the other hand, it can save numerous training rounds conducted in vain when FL indeed needs recovery.
(2) We design a cost-effective mechanism for run-time NFL detection.The proposed scheme could detect NFL at a very early stage of learning and thus save hundreds of futile learning rounds before a necessary recovery is activated.
(3) We also introduce an NFL recovery method, which, activated in run-time as per result of NFL detection, improves the performance of federated learning on individual clients by learning an adapted model for each client.
(4) We conduct extensive experiments on federated image classification and language modeling tasks.The results confirm the effectiveness of FL-GUARD in detecting NFL at the early stage of learning and recovering from NFL.We also demonstrate that FL-GUARD is compatible with previous NFL solutions and robust against clients unwilling/unable to take any recovery measures.

Related Work
Federated learning, originally developed by Google, is an emerging technique for learning from decentralized data [1].A significant incentive for a client to engage in FL is to obtain a model better than the private model that it can train independently without cooperation from other clients [10,19].However, many studies report that such an incentive is not always guaranteed.Many issues can pose negative effects on FL.Zhao et al. [29] observed the accuracy of a model learned by FL may drop over 50% when data distributions differ across clients.Both McMahan et al. [1] and Briggs et al. [30] showed that client inactivity would harm the convergence of federated learning.Bhagoji et al. [14] presented that attackers could easily manipulate the model learned by FL to generate false predictions.Yu et al. [10] argued that the differential privacy could also induce a significant performance drop in the model learned by FL.Many realworld FL tasks reportedly suffer from a mixture of all these negative effects and thus fail to outperform local stand-alone training on clients [9,10].The above observations motivate the proposal of many solutions against negative impacts on FL.These solutions can be classified into two categories: (category A) global FL and (category B) personalized FL.
Research in category A aims to learn a single global model that performs uniformly well on most clients and to mitigate the negative impacts on global model learning.To do so, researchers proposed to enhance the vanilla FL paradigm with a public dataset shared among all clients for balancing their different data distributions [29], with robust aggregation designed against attacks [31], with the control variates learned collaboratively by all clients for regularizing the global model training [22][23][24]32], and so on.These schemes are designed for a cooperative environment, where all clients are willing to follow the newly proposed FL algorithms that have extra computation/communication costs compared to FedAvg.Different from these schemes, ours is more flexible to allow for uncooperative clients who stick to following the vanilla FL algorithm.In a cooperative environment, though, these schemes can be combined with our solution to improve the performance of the global model.
Research in category B aims to learn multiple personalized (or called as adapted in this paper) models, each of which is produced for a specific client (or a subset of clients), and to mitigate the negative impacts on the model personalization/adaptation [9, 12, 18, 19, 25-28, 33, 34].Our work is related to this category of work because the fundamental idea of our NFL recovery measures is to perform model adaptation.However, different from existing work in category B, our work not only performs model adaptation in cases of NFL but also detects NFL in run-time for evaluating the necessity of NFL recovery measures.Our proposed NFL detection scheme can be integrated with most studies in category B to save the unnecessary extra expenses that those studies introduced in the well-performing FL processes.

Negative Federated Learning
This section presents the definition of NFL.Table 1 summarizes the notations used throughout this paper.We shall start by introducing vanilla FL.

Preliminaries
Federated learning (FL) is a distributed learning process in which N devices (i.e., clients) train a shared global model collectively without the need to centralize their private data together.Vanilla FL [1] learns the optimal global model, denoted as w , via iterative rounds of interactions between the clients and a central server.In each round r, the server first distributes the global model w r−1 to a set of active cli- ents, which is denoted as C r .Then, each client i ∈ C r uses its private dataset D i to produce a locally updated model, i.e., w r i ← w r − ∇L(D i ;w r ) , and report the local update to a central server.Next, the server aggregates the received updates into a new global model via Then, the server distributes the new global model to active clients for the next round of learning.
Ideally, the performance of w r improves as the learn- ing proceeds and approaches that of imaginary centralized learning (i.e., to pool all private data together and train the model on the centralized dataset).However, it turns out that FL can go wrong, resulting in most clients being unable to obtain quality models from FL and thus unwilling to contribute to FL.

Definition of Negative Federated Learning
To decide if FL is doing well, let us pretend that every client i has a private model trained independently on D i before- hand.Negative federated learning (NFL) refers to the state of an FL system in which the model obtained from FL does not win out over the private models of most clients.
To formalize the definition of NFL, we first define a metric describing the performance gain obtained from FL by a client i.Given a model learned by FL and a private model, we denote by V i the performance of the model learned by FL, and by P i the performance of the private model, the on-client performance gain2 is given by Eq. 1.
Next, we define a system-wide metric , named overall performance gain, as following: Here, i is a weight indicating how much a client i matters in the performance gain evaluation.Possible weight schemes could be data size), or a positive value indicating local data quality [35].
We now define NFL as follows.For a given federated learning system (e.g., running FedAvg [1]), if there does not exist a positive integer R such that for any model learned in this system after round R (e.g., the global model w r , where r ≥ R , learned in the vanilla FL system), ≥ 0 , then we say the system is in negative federated learning.In such a case, | | quantifies the overall negative effects on the participat- ing clients.
Note the NFL concept presented in this paper is defined for the system-wide learning performance.An individual client i may consider itself encountering negative learning if its i remains negative and then may decide to take "individuallevel measures" to improve its performance gain.We will provide a guideline for each client to perform "individuallevel measures" at the end of Sect.4.3.This issue, however, is not further discussed in other sections since our paper focuses on the collective learning behavior of the federation.

Tackling Negative Federated Learning
This section introduces a novel framework named FL-GUARD for tackling NFL in run-time of federated learning.FL-GUARD treats NFL as a measurable, detectable, and avoidable state that may occur in any FL process.When a system enters NFL, it can utilize our proposed approach to recover to a healthy state.The framework is named FL-GUARD because it acts like a guard of FL systems, detecting and confronting the negative learning problem. Figure 1 shows an overview of FL-GUARD.In the following subsections, we first introduce the method for detecting NFL and then detail the recovery measure.

Detecting Negative Federated Learning
Intuitively, NFL can be detected by checking the value of after each round of interaction.However, on-client model testing in each round is costly, as it incurs non-trivial extra computing on clients.To avoid such costs, we propose to use the training data as the surrogate to estimate the value of in the learning process and utilize the estimated value for NFL detection.
We take the vanilla FL system as an example to illustrate how is estimated efficiently during the learning runtime.In each round r, each participating client i first utilizes its first training batch b r i to estimate its performance gain from engaging in FL by Eq. 3, where EV is a function for evaluating model performance on a given dataset, w r−1 is the latest global model that client i received from the vanilla FL system, and P i is the performance of client i's private model on its testing dataset.The private model that client i possessed before engaging in FL is not updated in the FL run-time.Thus, client i only needs to compute the value of P i once before FL starts.
Note that client i can easily obtain the value of EV(w r−1 , b r i ) when performing the feed-forward training of model w r−1 .Thus, the additional computation cost introduced by computing Eq. 3 is trivial.EV(w r−1 , b r i ) in Eq. 3 is an unbiased estimate of the performance of model w r−1 on client i's test- ing dataset (i.e., V i in Eq. 1).This is because the batch of data b i is randomly sampled from client i's dataset D i and has a similar distribution to that of client i's complete testing dataset.When submitting its local update, client i reports the computed βi value to the server.
Given all βi values reported from clients i ∈ C r in round r, the overall performance gain β is estimated on the server by two steps: (1) The server computes a βr value by Eq. 4 which selects the median of { βi |i ∈ C r }.
(2) The server computes the overall performance gain β by Eq. 5 which averages the βr values computed in the last c rounds.
The above two steps are designed under the consideration of malicious clients who may report the fabricated βi and lead to undesirable fluctuation of β .With these two steps, a malicious client can hardly influence the estimated overall performance gain, no matter whether its fabricated βi is close to or diverging from the ones submitted by honest clients.The server uses β to detect the NFL state.A negative β indicates that the model learned by FL performs worse than most clients' private models and implies the state of NFL.An advantage of using β is that it can be obtained without the high cost of evaluating the learned model on (large) datasets in each round.
NFL Detection.Based on the above discussion, we can develop a cost-effective NFL detection scheme on the server as follows: (1) During iterative FL, the server monitors if β < 0 after each round of client-server interactions.If it is true in more than NR rounds, where NR is a threshold for Negative Rounds, the server reports a state of NFL.
(2) A reported NFL can be canceled as well.If β ≥ 0 is observed for c consecutive rounds and NFL has previously been reported, the server cancels the NFL report, but the detection process continues.
βr−c � .For early detection of NFL (e.g., within tens rounds), NR is recommended to be less than 100.But too small NR value may produce a false positive (i.e., to report well-performing FL as NFL).Fortunately, this does not negatively influence the learning performance as it can be canceled in subsequent rounds when β ≥ 0 is observed for c consecutive rounds.False positives only incur some (unnecessary) recovery steps before the NFL report is canceled.
Once NFL is detected in a system, it is necessary to take some measures to improve the performance of the model learned by FL on individual clients.We achieve this by making local adaptation to the global model learned by the vanilla FL paradigm when that global model is iteratively trained on each client.The details of model adaptation are presented in the next section.

Recovering From Negative Federated Learning
This section presents how we perform model adaptation on the global model learned by vanilla FL when that global model cannot perform well on most clients.While it might not be impossible to learn a single global model that performs uniformly well on most (or even all) clients, we believe that performing model adaptation in FL run-time is a much easier approach for achieving better local performance, especially when some clients may fail in taking any NFL recovery measures or deliberately disrupt the learning system.
Our model adaptation approach introduces for each client an adapted model ( v i ), which can fit the local data better and is tightly coupled in the process of learning the global model ( w ).In each round r, model adaptation updates the adapted model ( v i ) and the global model ( w r−1 ) simultane- ously on client i.The objective function for optimizing the global model is given by (w r−1 , D i ) .We do not explain this objective function in detail as it is the same as that in the vanilla FL system.
To optimize the adapted model v i , we minimize two terms: (1) the loss of the adapted model on local training data (v i , D i ) and (2) the divergence in the parameters of the adapted model and the learned global model ‖v i − w r i ‖ 2 (where w r i is initialized by w r−1 in each round).The first term is minimized for making the adapted model fit the local data, while the second is minimized for improving the adapted model's generalization performance.We empirically find that indiscriminately minimizing the second term may not always help.When the global model performs well on most clients, minimizing ‖v i − w r i ‖ 2 on those clients can improve the generalization performance of their adapted models.In contrast, if the global model performs much worse than the local optimal model on a client or diverges significantly from that client's local optimal model, minimizing ‖v i − w r i ‖ 2 on that client may worsen the performance of its adapted model.We introduce a parameter to flexibly control the minimization of ‖v i − w r i ‖ 2 .is computed by Eq. 6-8, where (⋅) is the sig- moid function and <, > is the dot product of two vectors.With , the objective function for optimizing the adapted model is given by Eq. 9.
The value of would become small as the result of two cases.First, the global model performs poorly, causing a large loss (w r i , D i ) and thus the first (⋅) in Eq. 8 gives a small value.Second, the parametric difference (v i − w r i ) strongly disagrees with the gradient of (v i , D i ) , so the sec- ond (⋅) produces a small output.In both cases, we want to play down the impacts of the global model and allow (v i , D i ) to dominate the learning process.Whereas if the global model has a much smaller training error and there is an agreement between the two minimization goals (i.e., to minimize (v i , D i ) or ‖v i − w r i ‖ 2 ), it is relatively safer to put more emphasis on the second goal.The above discussion explains the usage of in Eq. 9.

FL-GUARD Operations
Algorithm 1 presents the complete pseudo-code for implementing our FL-GUARD framework, which contains NFL detection and the model adaptation for NFL recovery, upon the vanilla FL system.At the end of each round r, clients return only the local update w r i along with a float value βi to the server.Thus, the communication cost of our FL-GUARD is almost the same as that of vanilla FL.Moreover, it is noticeable that the transmission of βi would not incur new challenges in privacy protection like the extra variates transmitted in previous NFL solutions, e.g., SCAFFOLD [22], FedNova [23], FedLin [24], etc.
There are two modes to use FL-GUARD: (1) Detection and recovery, which is described in Algorithm 1 and highly recommended.An FL system running in this mode can operate without introducing the costs for NFL recovery until it detects NFL.Only when NFL is reported, an indicator denoted as NFL is set.Then, the clients start recovery by activating model adaptation.(2) All-time recovery.Alternatively, the system may choose to activate NFL recovery in the entire learning life cycle by setting NFL ← True at line s1 and canceling operations at lines s7-s14.For any mode, activating the model adaptation will not change the results of learning the global model.However, once model adaptation is activated, each client utilizes the adapted model (instead of the learned global model) for its local inference/ testing.Thus, the performance gain metric ( i ) is estimated as βi ← EV(v i , b r i ) − P i for client i.We recommended the "detection and recovery" mode, since our NFL detection is inexpensive and the costs for (unnecessary) NFL recovery can be avoided in a well-performing FL process.
As an additional refinement, we provide a guideline for each client to perform "individual-level measures" in FL-GUARD.Specifically, each client can independently start model adaptation when the system-wide NFL state is not reported, if its locally computed βi remains negative in more than NR rounds.Moreover, each client can stop its activated model adaptation to save the computation cost, if it observed βi ≥ 0 for c consecutive rounds.FL-GUARD allows these "individual-level measures" to be performed because these measures will not pose harmful impacts on the system-wide learning results.

Advantage of FL-GUARD
Compared to existing solutions for tackling NFL, FL-GUARD has the following advantages: (1) FL-GUARD tackles NFL dynamically.It can not only activate recovery in system run-time based on the result of NFL detection but also cancel the NFL report and stop recovery when it becomes unnecessary.Such dynamicity avoids the extra expenses in well-performing FL processes.More importantly, as we will see in the experiment, when vanilla FL is sufficient to perform well, indiscriminate activation of NFL recovery measures (no matter ours or most of the previous ones) may slightly harm the performance.This result justifies the need for the detection and recovery mode.
(2) FL-GUARD can be easily employed upon any FL system for run-time NFL detection and recovery and is compatible with most existing NFL handling techniques.This advantage is demonstrated empirically in Sect.5.4.An important benefit of integrating existing NFL recovery techniques with our NFL detection scheme is to save (unnecessary) extra expenses of these techniques when vanilla FL is sufficient to perform well.
(3) FL-GUARD is robust against clients who do not take any NFL recovery measures.As mentioned in Sect. 1, when upgrading an FL system in use, it is common to see (i) some clients unwilling to adopt the newly proposed techniques until observing the others benefit from doing so and (ii) some clients unable to adopt the newly proposed techniques due to their limited computing/communicating power.These clients tend to remain as vanilla FL clients, who stick to following FedAvg and intentionally disable all NFL recovery measures even when the FL system is in NFL state.Fortunately, our FL-GUARD clients (i.e., those activating model adaptation when NFL is detected) can always achieve performance gains from FL, even if the others (e.g., attackers, resource-constrained clients, etc.) remain to be vanilla FL clients.Such robustness, as shown in Sect.5.4, is important for employing a newly proposed technique to upgrade an FL system in use.

Experiments and Results
We evaluate the performance of FL-GUARD on two typical federated learning tasks, namely CIFAR (image classification on CIFAR-10 [36]) and SHAKE (language modeling on Shakespeare dataset [7]).The major issues that may lead to NFL, which include data heterogeneity among clients, client inactivity, attacks from malicious clients, and noises introduced by differential privacy protection, are all considered in our FL tasks.
Table 2 summarizes the default environment settings in different tasks and the hyperparameters used for model training.Our experiment settings all follow previous literature.Unless otherwise stated, all experiments in this paper are conducted under the default settings specified in Table 2. Below, we detail the default environment setup.
Simulation of non-IID local data.We simulate the unbalanced and non-IID data distributions across clients in a way similar to that of previous work [1,11].For CIFAR-10, we allocate its 50,000 training samples and 10,000 testing samples to 100 clients based on a "non-IID (mixed)" scheme.Specifically, we first sort the CIFAR-10 dataset by labels and then set up a case where 50 clients have samples from 10 classes, 30 clients have samples from 5 classes, and the remaining 20 clients have samples from 2 classes.Moreover, we set the amount of data allocated to clients following a Lognormal(0, 2 2 ) distribution.
For Shakespeare, we follow previous work [1,7,37] to allocate each speaking role to one client and term this allocation scheme as "non-IID (by role)".Following Wang et al. [37], we filter out the clients with less than 10,000 data samples and then sample a random subset of 100 clients.In each client, we allocate 90% of the data for local training and the remaining for testing.The sampled dataset contains a total of 1609724 samples for training and 171017 for testing.
We also form a balanced and IID data allocation on CIFAR-10 and Shakespeare for reference and comparison.In the IID scheme, the original CIFAR-10/Shakespeare dataset is first shuffled and then allocated to 100 clients.Each client receives the same amount of data samples and owns the samples from all ten classes in CIFAR-10 and all 80 classes in Shakespeare.Simulation of client inactivity.We simulate client inactivity by uniformly random sampling 10% of all clients ( K∕N = 10% , where K is the number of active clients) in every round to participate in model training, which is the same as the setting in most of the previous work [1,10,23,38,39].Among the selected clients in every round, several clients are malicious, poisoning the global model being learned by FL.The details of attacks are given in the next paragraph.
Simulation of attacks.Our experiments randomly select some of the clients to be malicious attackers who poison the global model via reporting the model parameters updated on label-flipped data samples.Details for attacking strategy can be found in previous work [40].Based on the state-ofthe-art attacks on FL [40][41][42], we set 30% of all clients to be attackers.
Simulation of differential privacy protection.We follow previous work [10,16] to simulate the differential privacy protection on the server, i.e., to aggregate local updates and produce the updated global model in each round r by w r ← w r−1 + 1 , where the norm of each local update is clipped with an upper bound S and Gaussian noise N(0, 2 I) is added.Hyper-parame- ters are set as S = 15 , = 0.001 for CIFAR, and S = 50 , = 0.001 for SHAKE.We mainly monitor two metrics to evaluate whether FL brings benefits to its participating clients: (1) the average Fig. 2 Using β for NFL detection.Results are reported based on the vanilla FL system.The pink curve shows results under the default environment setup specified in Table 2.The blue curve shows results under an Ideal FL scenario, in which we remove most of the negative effects on FL by using IID data allocations in both tasks and removing all attackers Table 3 Tuning the parameter NR used in the NFL detection scheme We show the index of the round in which NFL is reported.The bar (-) means that NFL is not reported.We recommend NR = 50 and NR= 70 to be set, respectively, on CIFAR and SHAKE for both a quick response to NFL under the default setup specified in Table 2 and no false positive under the ideal FL setup specified in Fig. 2  local accuracy (ACC), evaluated for the model that the FL system provides for on-client local inference by testing that model on each client's private testing dataset; and (2) the average performance gain ( ), compared to local standalone training 3 .
Each FL process is given fixed rounds of learning (i.e., 1000 rounds for CIFAR and 500 rounds for SHAKE).All experiments are repeated for three runs with different random seeds.We report the averaged results over these runs, and the results shown in the tables are averaged over the last 10 rounds.

Results of NFL Detection and Recovery
We first justify the usefulness of β for NFL detection and then study the proposed NFL detection/recovery scheme.
Using β for NFL Detection.Fig. 2 reports the results of β when the vanilla FL system is in NFL and ideal FL, respectively.We also plot each round's true value for reference.In the NFL state (pink curve), both β and true value remain fluctuating below zero, implying the poor performance of the federated learning on individual clients.In contrast, the value of β/ quickly goes above zero in ideal FL. β tells NFL from ideal FL in only tens of learning rounds.The results confirm the usefulness of β as a surrogate of true for fast NFL detection.Note that the amplitude of fluctuation of β is smaller than that of in NFL.This is mainly due to the smoothing effect of averaging β over the last c = 50 rounds.
The NR Threshold for NFL Detection.Table 3 presents the number of learning rounds needed by our detection scheme, with different NR (Negative Rounds) thresholds, for reporting NFL.Generally, our detection method is very quick in responding to NFL, though it causes false positives in the case of ideal FL when NR < 50 on CIFAR and NR < 70 on SHAKE (since the threshold value is too small).As NR increases, false positives disappear, but it takes more rounds for the system to report NFL.False positives have negligible impacts on learning performance (since they can be canceled when β ≥ 0 is observed for c consecutive rounds).However, they (unnecessarily) incur double learning costs on all clients.Therefore, we set a moderate value of NR = 50 for CIFAR and NR= 70 for SHAKE.Learning on SHAKE is slower in convergence speed than learning on CIFAR and thus needs a greater NR to prevent false positives.
Learning with NFL detection and recovery.With the above recommended parameter setting, we monitor β and run-time ACC of FL-GUARD running in both detection and recovery and all-time recovery modes.To better study the effect of recovery, we set the recovery in the above modes as long-term (i.e., never stopped after being activated) and try an additional short-term recovery mode, where recovery is activated by NFL detection, and then after β > 0 is observed in consecutive c = 50 rounds, the system cancels NFL report and stops recovery.Note that short-term recovery is precisely the same as detection and recovery before stopping recovery.
The results shown in Fig. 3 reveal the following findings: • First, the detection scheme reports NFL quickly (in tens of learning rounds), confirming the usefulness of β for fast NFL detection.• Next, as a result of activating recovery, the green ACC increases rapidly, overtaking the gray line (performance of local stand-along learning), and soon approaches the blue ACC (all-time recovery).These results indicate the effectiveness of model adaptation in run-time NFL recovery.• The results of the green ACC also indicate that once recovery stops, the performance stops growing too.However, the system does not return to NFL because the adapted models are kept by the clients for local inference.• Finally, the orange curves show further performance improvements if the recovery continues, getting almost the same final ACCs as the all-time recovery.
All the above findings clearly show the effectiveness and efficiency of our proposed detection and recovery scheme and the performance improvements from model adaptation.

Comparison with Previous Methods
We compare FL-GUARD with the following approaches: 1. FedProx [38], to utilize a proxy regularization term for tackling data statistical heterogeneity in the federation; 2. FedMedian [31], robust aggregation approach which takes the coordinate-wise median of local updates for preventing the global model from being poisoned; 3. TrimmedMean [31], similar to FedMedian, but takes the coordinate-wise trimmed mean of local updates in aggregation; 4. multi-Krum [43], another robust aggregation approach which discards the top-k local updates with a relatively large distance to the others; 5. K-norm [28], similar to multi-Krum, but discards the top-k local updates with relatively large norms; 6. FT [12], to fine-tune the global model on the client after an FL process ends; 7. FB [12], similar to FT, but fine-tunes only the top layer of the global model on the client after an FL process ends; 8. KD [10], to augment FT with knowledge distillation; 9. MTL [10], to augment FT with multitask learning; 10.PerFedAvg(HF) [44], similar to FT, but utilizes the meta-learning technique for training the global model; 11.APFL [21], a recent adaptation approach to integrate the global model with a per-client local model; 12. Ditto [28], a recent work similar to ours, but regularizes the adapted model with a frozen copy of the global model.
For a fair comparison, all the above approaches are implemented upon the vanilla FL (FedAvg) system.The results of FedAvg are also reported for reference.Our NFL solution is compared with only the most competitive, instead of all, work in each category of existing NFL handling approaches (presented in Sect.2) because the performance of the approaches in the same category is similar.Later in Sect.5.4, we will show that our NFL solution is compatible with many existing NFL handling approaches.
Comparison under the default environment settings.As presented in the upper half of Table 4, FL-GUARD outperforms all previous approaches in tackling NFL in the vanilla FL system, with its all-time recovery mode leading on all metrics.When the detection and recovery mode is used, the performance of FL-GUARD slightly drops ( < 0.8 ), but it still outperforms most previous approaches.
Comparison under the Ideal FL environment settings.The lower half of Table 4 presents the comparative results under the ideal FL environment settings.Almost all previous approaches cannot compete with the vanilla FL paradigm (FedAvg).Some even cause a reduction of ACC as much as 6%, and FL-GUARD with all-time recovery also harms learning performance.In contrast, FL-GUARD running in the detection and recovery mode still performs better on both datasets compared to the other methods.The reason is that, in the detection and recovery mode, FL-GUARD never activates recovery when FL performs well without it.In this case, FL-GUARD is exactly the same as vanilla FL!This explains the identical results of FedAvg and the bottom row and justifies the need for detection and recovery.
The above results reveal the most important advantage of FL-GUARD: the ability to dynamically tackle NFL in runtime, a feature not reported in any previous work.If NFL occurs, it can be detected and recovered.Whereas if NFL under "IID" and "non-IID (by role)" schemes).All above observations are similar to those in previous work [1].
Table 6 demonstrates that client inactivity does pose some negative impacts on FL since both methods see a reduction in accuracy when the ratio of active clients in each round varies from 90% to 10%.Compared to FedAvg, the accuracy and local performance gain achieved by FL-GUARD is much more stable.Note the change in learning accuracy may not be monotonic to the change in the ratio of active clients.This is probably because under our test environment, where clients' data are non-IID, more active clients may slightly enhance the random disparities in reported model parameters.
Table 7 shows the negative impacts of the poisoning attacks.As the proportion of attackers increases quantitatively, the performance metrics of FedAvg reduce significantly.However, FL-GUARD is much more resilient against such negative impacts.In particular, FL-GUARD provides a high positive gain in ACC (  > 10 ) even when the ACC of the global model reduces by nearly 15.The results further confirm the effectiveness of FL-GUARD in adapting to local data and thereby salvaging the local performance of FL on clients.
Table 8 presents the negative impacts caused by the noises introduced by different privacy.A reduction occurs in both accuracy and local performance gain as the noises increase quantitatively from = 0.001 to = 0.005 .The change in the results on SHAKE is less considerable and not strictly monotonic compared to the results on CIFAR.This indicates the better resilience of the federated LSTM against noises from differential privacy compared with the federated CNN.In contrast to FedAvg, FL-GUARD produces much better accuracy.These results show that FL-GUARD is more resilient against large noises and can make a good adaptation on client data.

Additional Properties of FL-GUARD
We conduct additional experiments to study the compatibility of FL-GUARD with previous techniques for tackling NFL.We also investigate the robustness of FL-GUARD against vanilla FL clients who stick to following FedAvg and do not take any NFL recovery measures.Meanwhile, we evaluate partial layer adaptation in FL-GUARD, namely to make local adaptation on only the top layer(s) of the model for reducing the recovery cost.All experiments in this section are tested under the default environment settings.We present the results on CIFAR here, as those on SHAKE are pretty similar.
Compatibility with previous NFL recovery techniques.Table 9 presents the results of FL-GUARD combining our NFL detection scheme and model adaptation with previous NFL recovery techniques, which include FedProx  [38], TrimmedMean [31], multi-Krum [43], and K-norm [28].We also try to replace our model adaptation approach with APFL and Ditto, which produce competitive results in Sect.5.2.It can be seen that integrating these techniques into FL-GUARD does not significantly influence the system performance.This illustrates the flexibility of our FL-GUARD framework in adopting different kinds of NFL recovery techniques along with our NFL detection scheme (and model adaptation) for tackling NFL in a run-time paradigm.Robustness against vanilla FL clients.We now investigate the robustness of FL-GUARD against vanilla FL clients who stick to following FedAvg and do not take any NFL recovery measures.We show (1) the average of all clients in FL and (2) the average of FL-GUARD clients (i.e., those following the rules in FL-GUARD to run the recovery scheme).As shown in Fig. 4, when the proportion of vanilla FL clients increases from 0% to 80%, the average of all clients reduces quickly below zero, indicating that the entire system's resistance to NFL is weakened by the increasing proportion of vanilla FL clients.However, the average of FL-GUARD clients is much more resilient against such variance, remaining high above zero even when the vanilla clients account for over 50%.These results show the robustness of FL-GUARD against vanilla FL clients and again confirm the effectiveness of our recovery scheme.Such robustness has never been reported in any previous work but is important when using a new technique to upgrade an FL system in use.The performance gains obtained by FL-GUARD clients can be persuasive evidence for attracting vanilla FL clients into our FL-GUARD community.
Partial layer adaptation.Finally, we show that when adopting the model adaptation for NFL recovery, the recovery costs can easily be reduced via partial layer adaptation, i.e., to make local adaptations to only parameters in the top-most layer(s) of the neural model on each client while freezing (namely to refrain from updating) the lower layers.Table 10 presents the results of partial layer adaptation with the different number of lower layers to be frozen ( L fb ).Sur- prisingly, no apparent loss in accuracy is observed when the number of frozen layers increases.The results indicate that, when making model adaptation in FL-GUARD, it is safe to freeze several lower layers of neural models to reduce the adaptation costs without compromising accuracy.

Conclusion
This paper addressed the problem of negative federated learning (NFL).We proposed to leverage an estimated performance gain for each client from participating in FL to detect NFL at the early stage of learning.The estimated perclient performance gain was obtained efficiently by leveraging the information produced when clients train the global model.We also designed a technique for NFL recovery, which additionally learned an adapted model for each client.All these techniques are employed in a holistic framework called FL-GUARD, which tackles NFL in a run-time detection and recovery paradigm.Extensive experiments showed that FL-GUARD achieved higher accuracy than previous approaches and handled various NFL (and ideal FL) scenarios effectively and efficiently.We also showed that FL-GUARD was compatible with existing NFL recovery techniques and robust against clients who do not take any recovery measures.For future work, we are interested

Fig. 1
Fig. 1 An overview of FL-GUARD.Dashed arrows denote the optional model adaptation, which is activated after NFL is detected, for recovering to a healthy learning state

Fig. 3
Fig. 3 Run-time results of FL-GUARD under the default environment settings.The gray lines in the figures of ACC indicate the performance of local stand-alone training obtained before FL starts

Fig. 4
Fig. 4 FL-GUARD is robust against vanilla FL clients

Table 1
Main notations used throughout this paper

Table 2
[1]k profiles and default experimental settingsThe architecture of the neural model trained in each FL task is the same as described in previous paper[1]

Table 6
Varying the ratio of active clients in each round FL-GUARD is robust against the negative effects introduced by the decreasing ratio of active clients

Table 7
Varying the proportion of attackers in each round

Table 9
FL-GUARD is compatible with previous techniques for tackling NFL

Table 10
exploring more effective and efficient schemes in NFL detection and recovery to enhance FL-GUARD further. in