FedTCR: communication-efficient federated learning via taming computing resources

Federated learning (FL) enables clients learning a shared global model from multiple distributed devices while keeping training data locally. Due to the synchronous update mode between server and devices, the straggler problem has become a significant bottleneck for efficient FL. Existing approaches attempt to tackle this issue by using asynchronous-based model aggregation. However, these researches are only from the perspective of changing global model updating manner to mitigate straggler effect. They do not investigate the intrinsic reasons for the generation of the straggler effect, which could not fundamentally solve this problem. Furthermore, asynchronous-based approaches usually ignore those slow-responding but important local updates while frequently aggregating fast-responding ones during the whole training process, which may come with degradation in model accuracy. Thus, we propose FedTCR, a novel Federated learning approach via Taming Computing Resources. FedTCR includes a coarse-grained logical computing cluster construction algorithm (LCC) and a fine-grained intra-cluster collaborative training mechanism (ICT) as part of the FL process. The computing resource heterogeneity among devices and the communication frequency between devices and the server are indirectly tamed during this process, which substantially resolves the straggler problem and significantly improves the communication efficiency for FL. Experimental results show that FedTCR achieves much faster training performance, reducing the communication cost by up to 8.59×\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$8.59\,\times $$\end{document} while improving 13.85%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$13.85\%$$\end{document} model accuracy, compared to state-of-the-art FL methods.


Introduction
The number of smart devices, such as smartphones and wearable devices, has grown exponentially within the last couple of years.Currently, there are around 3 billion smartphones and 7 billion connected Internet of Things (IoT) devices in the world [1].Meanwhile, the vast amount of wealth data generated by these devices provides a huge opportunity for crafting sophisticated machine learning (ML) models to tackle diffi-number of communication rounds [7,8] or the overall transmission bits [9,10] have been presented.These works require the cloud-based server to synchronously aggregate the global model and also assume that there is no resource or data heterogeneity.However, in practice, devices have heterogeneous data or (communication and computing) resources, which leads the server to have to wait for the slowest device (i.e., straggler [11][12][13]) during the model aggregation stage.It leads to significantly prolonged model training time.
To address the straggler problem in synchronous FL, various asynchronous-based communication schemes (e.g., [14,15]) are introduced.Asynchronous communication mechanisms take full account of resource heterogeneity, and asynchronously aggregate the global model throughout the model training process.In each training iteration, the server can update the global model after receiving one local model update without waiting for the straggling clients [16].Compared with traditional synchronous FL approaches that require the server to wait for stragglers (i.e., slower clients) at each global iteration, current asynchronous schemes reduce the waiting delay time for model aggregation and improve the communication efficiency of FL.However, these research efforts are only from the perspective of modifying the global model updating mode to mitigate the straggler effect.They do not explore the intrinsic reasons of the straggler issue.Thereby, these efforts do not fundamentally improve the communication efficiency for FL.Furthermore, asynchronous-based manners usually frequently aggregate faster devices with shorter response latency while decreasing the involvement of those slow-responding but important local updates, which significantly degraded the model performance.Two natural questions are: -What are the intrinsic reasons of straggler problem?-How can we fundamentally mitigate the straggler effect while guaranteeing the training model accuracy?
In fact, the heterogeneous computing capacity of edge devices is the most critical reason for the generation of the straggler effect.FL system usually involves a large number of heterogeneous edge devices.These devices are normally equipped with different computing capacities (e.g., CPU, memory).Furthermore, computing capacity is the most significant factor in determining the model response time (i.e., the time a client takes to finish a single round of training [17].As the amount of CPU resources allocated to each client increase, the model training time gets in a near-linear decrease [18].Thus, differences of computing capability may be the underlying factor for the straggler effect in FL.If we can tame the heterogeneity in computing resources for the FL system and keep devices with a minimal gap in computing capability, then the straggler problem in FL can be fundamentally solved.In addition, by taming the hetero-geneity of the computational resources of clients, we can essentially increase the frequency of low-response devices involved in model training when compared with traditional asynchronous-based approaches, thus improving the training model performance.
Nevertheless, there are several challenges to realize our intuition ideas.On one hand, each device in the FL setting has a fixed amount of computing capacity, which is an inherent physical property and we cannot fundamentally tune this heterogeneity.Therefore, how to tame the heterogeneity of computing resources without changing the original physical properties of devices is a key point to realize our insights.In addition, existing asynchronous-based methods reduce the participation of slow-responding devices during the model training process, which results in a degradation in model performance compared with standard FL approaches.Accordingly, how to tune the heterogeneous computational resources and set reasonable updating mechanisms to increase the engagement for low-response devices is another critical issue to improve the model accuracy.
To overcome these shortcomings, we propose FedTCR, a novel federated learning approach via taming computing resources.FedTCR includes a coarse-grained logical computing cluster construction algorithm (LCC) and a finegrained intra-cluster collaborative training mechanism (ICT) as part of the FL process.The computing resource heterogeneity among devices is indirectly tamed during this process, which substantially resolves the straggler problem and improves the training model performance.Our main contributions are as follows: (1) We propose a novel communication-efficient FL framework called FedTCR.To the best of our knowledge, it is the first work to explore the inherent reasons for the straggler problem and essentially propose a communicationefficient approach to address this issue, which opens up a new perspective for improving the communication efficiency of FL.FedTCR considers constructing a virtual logical computing cluster and designing an intra-cluster collaborative training mechanism, which mitigates the impact of computational resource heterogeneity on the straggler effect and increases the involvement of lowresponsive devices during the model training process.(2) To construct a virtual logical computing cluster, we propose a coarse-grained logical computing cluster construction algorithm (LCC).By partitioning devices into multiple clusters with minimal difference in overall computing capacity, and regarding each cluster as a logical device in the FL system, which indirectly tames the resource heterogeneity in computing capacity.(

Related work
In this section, we will discuss the works related to our investigated issue, including synchronous and asynchronous federated communication optimization.

Synchronous federated communication optimization
McMahan et al. [4] propose a FedAvg algorithm where instead of communicating with the server after every iteration, each client performs multiple iterations of local updating to compute a more convergent weight update.By reducing the overall communication frequency between devices and the server, the required communication overhead for FL is significantly mitigated.Similar to the work in [4], Xin et.al [19] adopt a two-stream model with MMD (Maximum Mean Discrepancy) constraint on every selected device, enabling each device can learn more generalized features from the global model in each training iteration.This approach greatly accelerates the model convergence speed and improves the communication efficiency for FL.After that, based on the observation that a large number of parameters are sparsely distributed and close to zero [20] in complex training models.Various studies including the Gaia [21], CMFL [22], eSGD [23], etc., are proposed to improve the communication efficiency by only transferring a small fraction of important or relevant local updates to the server for aggregation.Apart from reducing the required communication frequency, plenty of compression-based approaches that aim at reducing the total bits transmission for each client are proposed.Konečnỳ et al. [9] present structured updates and sketched updates two approaches to compress the transferred local updates.Jeong et al. [24] introduce a federated distillation approach that compresses the model according to an online version of knowledge distillation.Similarly, the authors in Ref. [25,26] propose to periodically quantize the gradient before uploading to the server.The authors in Ref. [27] propose a new compression framework called sparse ternary compression (STC) which extends existing compression technique of top-k gradient sparsification to meet the requirements of the federated learning environment.Recently, the authors in Refs.[28,29] propose to prune the architecture of the training network model, and merely upload a small sub-network to the server.Synchronous federated communication optimization methods attempt to improve communication efficiency by reducing the number of communication rounds or communication bits.These schemes have a condition that they assume that local clients have no resource heterogeneity.However, in a real FL scenario, devices usually have heterogeneous data or computing and communication resources.In each communication round, the server has to wait for the slowest client (i.e., straggler) to upload their local updates for aggregation.Therefore, it is easy to cause a significantly prolonged model training time.

Asynchronous federated communication optimization
As the inefficiency of standard FL has a straggler issue, Li et al. [11] propose the CluFed algorithm to alleviate the straggler effect.In the CluFed approach, a fraction of fastresponsive devices within a predefined training deadline are selected in each global aggregation.As an improvement, the authors in [12] propose a new FL framework called FedCS, where the maximum number of participants are selected within a predefined deadline in each communication round.Compared with the work in Ref. [11], only accounts for training deadline while neglecting to perform participant selection, higher model accuracy can be achieved in FedCS as it involves more participants for global training in each iteration.However, simply excluding the straggler devices with low-responsive latency might result in a performance bias toward devices that have stronger computing capabilities, and also have an influence on data distribution for slower devices.As such, the authors in Ref. [13] propose an asynchronous FL approach called Async, which asynchronously conducts updates between devices and the server.The server can immediately update the global model whenever they receive a local update and without waiting for other local updates.The simulation result shows that Async is robust to real-world FL settings in which devices with heterogeneous compute resources or devices join part-way during the training process.However, as the Async method enables the model convergence significantly delayed under the non-IID situation.Thus, Xie et al. [14] present the FedAsync framework that asynchronously updates the global model after receiving a newly updated local update.FedAsync takes full consideration of the timestamp each received update belongs to, and set the staleness function to define the weights for each received update during the global aggregation stage.Similarly, Chen et al. [15,16] propose asynchronous FL frameworks that allow wait-free communication and computation to address the straggler problem.The authors in [30] propose an asynchronous online federated learning framework, which considers both the continuously arriving data and the straggler problem in the current asynchronous federated learning framework.This approach tackles the challenges associated with varying computational loads at heterogeneous edge devices that lag or drop out.
Although these asynchronous-based approaches alleviate the straggler problem, they are only from the perspective of modifying the global model aggregation manner.They do not explore the intrinsic reasons for the generation of the straggler problem.Therefore, these approaches do not fundamentally improve the communication efficiency of FL.Besides, in an asynchronous FL setting, the high-responsive devices can play more role in the model training process while those low-responsive ones are indirectly ignored, which can easily create degradation in model accuracy.

Summary
Synchronous federated communication optimization methods try to improve communication efficiency from the aspect of reducing the number of communication rounds and bits, while asynchronous ones have the big problem of stragglers.Current methods have unavoidable flaws to solve the communication-efficient problem.Therefore, in this paper, we want to explore the essential reason why communication efficiency in FL is slow and aim to provide a practical solution to improve communication efficiency.Specially, we aim to address the following issues: - Formally, the entire FL training process can be regarded as a series of optimization processes.Assume that there is a total number of N devices in the FL setting, and the device set is represented as U = {U i } N i=1 .Each device U i ∈ U has a fixed data sample set D i , and the data samples set for all devices is denoted as D = {D i } N i=1 .There is a shared global model that is cooperatively learned by all devices in the FL system.Our ultimate optimization goal is to minimize the weighted average loss function f (ω) in equation (1), and find out the optimal model parameters ω ∈ .
where f i (ω) is the loss value for device u i and f i (ω To address the problem list in Eq. ( 1), the FedAvg algorithm is commonly used in a synchronous update manner.In the FedAvg algorithm, a fraction of devices are randomly selected with a certain probability at each communication iteration.Each selected device performs E epochs local model training on its local data using the commonly used optimizer (e.g., stochastic gradient descent (SGD) [31]).This method that uses conducting more local updates has been proven effective in improving the communication efficiency for FL compared with traditional FL approaches.However, the FedAvg algorithm requires the server to synchronously aggregate the global model and also assume that there is no resource heterogeneity.The server has to wait for the slowest device (i.e., straggler) to upload their local updates in the model fusion stage, which can easily prolong the whole model training time and increase the communication overheads for model training.To better solve equation (1), there is a necessity to explore a better solution to mitigate the straggler effect in FL.
All required key notions in this paper are present in the following Table 1.

Problem definition
As described in the introduction section, our optimization goal is to substantially address the straggler effect for efficient communication while guaranteeing the training model's performance.To this end, we formulate the whole FL learning process as the following optimization problems.In a standard FL system, the server has to wait for the slowest device to send their local model updates for aggregation, which significantly prolongs the entire model training time.The overall training time at each communication iteration is equal to the sum of the local model computation time, and local update parameters transfer time.Furthermore, the model computation time and the transfer time are mainly determined by the computation time of the slowest device and the total number of communication bits transferred of clients, respectively.To be more specific, the computation time is attributed to the heterogeneity of the computing resources of devices; the greater the variability of computing resources between devices, the longer the computation time is.Thus, two aspects have to be considered in our method for more efficient FL.On one hand, to obtain a lower computation time, we should minimize the gap in computation resources between devices.On the other hand, to further reduce the overall transfer time, the number of communication bits accumulated throughout the whole training process also has to be considered.
Formally, assume that the computation time of a device U i used for local model training in the tth iteration is denoted as L t i .Then, as the server in each communication iteration must wait for the slowest devices to upload their local updates for aggregation, the computation time of a global communication iteration t, i.e., L t , is present as the following Eq.( 2).
Based on Eq. ( 2), the total computation time during the whole training process, i.e., L total , can be formulated as: In addition, in the tth iteration, suppose that φ t represents the overall communication bits transmission from all devices, and targeted prediction accuracy can be achieved in the T th iteration.Then, the overall communication bits accumulated throughout the whole FL procedure, i.e., Ω T , are defined as the cumulative sum of transmission bits made by all devices in the final T th iteration, i.e., Finally, another optimization objective is to guarantee the training model's accuracy.Let A T be the final model accuracy that is achieved in the T th iteration, and our final goal is to attempt to maximize the training model accuracy as listed in the following Eq.( 5):

Overview
To reach the optimization goals list in Eq. ( 5), we propose a communication-efficient federated learning framework called FedTCR.The key point behind FedTCR is to tame the heterogeneity in computational capability by building a virtual logical computing cluster, and then essentially reduce the straggler effect caused by the gaps in computing resources between devices.In addition, to increase the involvement of slow devices during the model training process and improve the model accuracy, we design a novel model collaborative training mechanism by assigning more importance to slow devices in the model aggregation stage.In FedTCR, the logical computing cluster construction part is used to group all selected devices into multiple clusters based on their heterogeneous computing resources.Each partitioned cluster has minimal divergence in total computing resources, and the whole FL architecture can be logically viewed as being composed of multiple cluster nodes with small gaps in computational resources.In this way, we essentially correct the heterogeneous characteristics in computing resources, which indirectly reduces the long waiting time during the global model aggregation.In addition, in each computing cluster, we select one device with the highest computational capability as the head and call each head in each cluster the cluster head.In each communication iteration, only the cluster head can be used to communicate with the server.Our insight is that only a device instead of all devices communicating with the server may significantly reduce the total number of communication bits transferred between devices and the server.Finally, devices in each cluster still have heterogeneous computing resources, which may result in prolonging the model training time in intra-cluster training.To further tame the computing resource heterogeneous, an intra-cluster collaborative training mechanism is designed.In this mechanism, devices can immediately communicate with the cluster head after achieving the local updates without waiting for others to finish their local training.Those devices with stronger computation resources can perform more frequency communications with the cluster head, which accelerates the intra-cluster model convergence speed.Meanwhile, unlike the existing asynchronous communication mechanism that requires high-responsive devices to directly interact frequently with the server while ignoring those important but low-responsive updates.Our mechanism allows lowresponsive devices to be given relatively more significance to participate in intra-cluster model aggregation and indirectly communicate with the server.Thus, improving the involvement of low-responsive devices in model training and a higher model performance can be achieved in our proposed FedTCR method.
Algorithm 1 and Fig. 2 provide the complete process of FedTCR.Firstly, by using the logical computing cluster construction (LCC) algorithm, the devices set U as the first divided into M number of clusters.The obtained virtual logical computing cluster set and the set of cluster heads are denoted as C = {C j } M j=1 and H = {H j } M j=1 , respectively.As depicted in Fig. 2, clusters have a minimum computational resource gap between each other, which fundamentally correct the heterogeneous features of computation resources.This cluster construction step is computed effectively, which requires Θ(N M) to complete this operation.Secondly, by using the intra-cluster collaborative training (ICT) mechanism, each cluster C j ∈ C parallelly performs intra-cluster training based on the current latest global model ω t .After H training epochs, a cluster model ω H C j is achieved in per cluster C j .Finally, the server aggregates all the transferred cluster models using the same averaging method as the traditional FedAvg algorithm, and a newly updated global model updates ω t+1 is obtained.Repeat the above steps multiple times until T global iterations, a convergent global model ω T , and the accumulated communication bits Ω T are returned.Note that the intra-cluster collaborative training (ICT) and global model aggregation two steps are also computed effectively, which require Θ(N H M) and Θ(M) to Finish this process, respectively.In the following, we describe the LCC and ICT algorithms in "LCC algorithm" and "ICT mechanism", respectively.

LCC algorithm
The heterogeneity in computing resources may be a potentially important factor for the generation of the straggler effect in FL environments.Indeed, device heterogeneity also Algorithm 1 Complete process of FedTCR 7: end for includes communication capabilities, power, and the environment where the device is located.These factors also have an impact on communication efficiency.Different devices have different communication capabilities, power, and environment.As we know, it is a very complex system issue if we consider these restrictions at the same time.In this paper, we assume that devices have the same background, and we just focus on the factor "computing power", which is the most time-consuming part.Thus, we attempt to explore the relationship between computing power and communication efficiency.A native solution to mitigate the straggler effect caused by heterogeneous computing resource between devices is to attempt to correct the heterogeneity and reduce the gaps in computing power.Indeed, computing power is a logical concept.We can define it in various ways.For example, we can define it as the floating point computing power, which is easy to get according to the CPU or GPU parameters.However, the available computing resources (e.g., the number of CPUs) that per device equips is fixed, and its the inherent properties of devices.We can not physically eliminate this gap.To solve this problem, we present a logical computing cluster construction (LCC) algorithm.
The idea of LCC is to group all heterogeneous devices into multiple clusters and enables clusters with minimal divergence in total computing resources.In particular, LCC allocates different clients to different clusters to make each cluster have the same computing capacity as far as possible.For example, if there are ten clients, whose computing capacity are 2, 5, 6, 1, 3, 4, 7, 2, 3, 4. If they are divided into three clusters, then the clusters will be 7, 3, 3; 6, 4, 2; and 5, 4, 2, 1.The total computing capacity of each cluster is very close.Then, LCC regards each cluster as a virtual edge device in the FL setting, which essentially reduced the straggler effect caused by heterogeneous computing resources.To this end, LCC first divides all devices into multiple initial clusters based on the fixed computing resources that devices equip.Then, LCC considers dynamically adjusting the obtained initial cluster result multiple times until a minimal gap in computing resources clusters is obtained.Note that there are many ways to construct a clustering result with balanced computing resources.Here, we only present a simple method to address this issue.
In the LCC algorithm, to achieve initial cluster results, we first sort the device set according to the computing capacity that each device equips.Then, we sequentially assign a device from the obtained sorted device set to every predefined cluster.After one allocation, each cluster includes a device with different computing power.Obviously, all clusters do not achieve relatively balanced cluster results.Therefore, to address this problem, we assign devices with relatively low computational power from the obtained sorted device set to clusters with high total computational resources.Meanwhile, to avoid those assigned devices being selected again, we remove them from the sorted device set after it is assigned.The above steps are repeated multiple times until all devices are assigned, and we obtain an initial cluster result.However, the initial cluster results are not the best grouping.To achieve even better grouping results, LCC dynamically selects a device with the least computing power from the cluster that has the most total computation resources in the initially assigned clusters to other clusters with the least resources.After several iterations, LCC achieves relatively balanced grouping clusters.
Afterward, we select a device with the highest computing power in each cluster as the head and call it the cluster head.In each iteration, only the cluster head can communicate with the server.Each device in every cluster requires to transfer its local model updates to its connected cluster head.Instead of all devices communicating with the server in existing asynchronous communication manners, our proposed algorithm indirectly improves the involvement of slower devices, especially for the situation with a high heterogeneity degree.Furthermore, by using our virtual logical cluster structure, the communication overheads between devices and the server can be greatly reduced.More importantly, the waiting time for global model aggregation between clusters may be indirectly reduced as the computing resources between clusters have small gaps after constructing the logical computing clusters.Thus, the straggler effect in FL can be indirectly mitigated.
Algorithm 2 provides the specific logical computing cluster construction process.In the initial stage, the server first sorts client set U based on their computing power set P and obtains the sorted device set U .In each iteration, a device in set U is sequentially assigned to each cluster C j .After each assignment iteration, we change the values of the variable f lag.Here, variable f lag denotes the assignment direction and the initial value is true.If variable f lag = true, we then sequentially allocate M number of devices from set U to assign to cluster C j to C M .On the contrary, we sequentially allocate M number of devices from cluster C M to C 1 .In both the situations, we remove the number of M assigned devices from set U .This process is repeated multiple times until all devices in set U are assigned, and we obtain an initial cluster set C. After that, we dynamically adjust the initial cluster set C for r iterations, and a better grouping cluster result is achieved.Finally, a device with the highest computing power in each C j is selected as the cluster head.

ICT mechanism
Although we have tamed the resource heterogeneity in computation resources for devices by using the LCC algorithm in "LCC algorithm".Devices in each cluster still have gaps in computing capacity, which may prolong the intra-cluster training time caused by the waiting time for the slowest devices to upload client updates to the cluster head.Thus, to further correct the heterogeneous computing resources, we propose an intra-cluster collaborative training mechanism (ICT).Our main point is to enable devices with stronger computing power to help the devices with weak ones to train.Our insight is that devices with stronger computing capabilities usually mean faster local training with shorter response latency.If we let the fast devices communicate with the cluster head after it immediately achieves the client updates, and does not need to wait for other slower devices with longer local model computing time for aggregation.Then, the slower devices can indirectly learn the newly updated cluster model from the stronger ones, which accelerates the cluster model convergence speed and significantly reduces the intra-cluster model training time.Furthermore, unlike existing asynchronous federation learning approaches require all devices to asynchronously communicate with the server.Our mechanism shows two important advantages: (1) devices in each cluster are only required to transfer their client updates to the cluster head and only the cluster head responses for uploading the aggregated client updates for global aggregation.Compared to traditional synchronous or asynchronous FL approaches that require all devices in the FL setting to communicate with the server, our scheme significantly mitigates the total number of required communication overheads between devices and the server; (2) in our mechanism, we set a relatively more significance for those slower devices than faster ones in each intra-cluster aggregation, thus, indirectly increase the role of slower devices in model training and improve the training model performance.
We next illustrate the training process of our proposed intra-cluster collaborative training mechanism (ICT).As described in "LCC algorithm", clusters have minimal gaps in total computing resources by using the proposed LCC algorithm.Therefore, there is also a minimum difference in the training time to obtain a satisfactory cluster model by training the same number of epochs per cluster.Here, without loss of generality, we assume that each cluster requires H epochs of collaborative training to gain a desirable cluster model.Let S denote the number of devices in the cluster C j , of which devices U 1 and U S have the strongest and slowest computing resources, respectively.At a certain epoch τ ∈ H , the cluster head first aggregates the client model updates from the fastest device U 1 to get the newly updated cluster model ω τ C j .After obtaining the newly updated cluster model ω τ C j , the cluster head distributes it to those ready devices for performing local model training at the next τ + 1 iteration.In this way, those slower but ready devices can use the latest cluster model from the cluster head for local model training, which indirectly accelerates the cluster model convergence speed.Note that those unready devices still doing their local model training based on the cluster model in the previous τ epoch until obtaining their local updates to send to the cluster head for the next iteration.
In the following Fig. 3, we give an example of the intracluster training process for one communication round in one cluster U j .In FL, there are two types of data.They are IID (Independent Identically Distribution) and non-IID data.IID data means that data of different devices have the same distribution while non-IID means that the distribution of different devices' data is not the same even if the data come from the same categories.At the τ epoch, the client U 1 quickly finishes their local model training and sends their local model updates ω k τ * ,E to the cluster head.Here, E is the local model In this way, different clusters will upload the training weights at the same time.From a broad perspective, clusters with the same computing power have the same convergence speed.Furthermore, even if clients have heterogeneity in each cluster, we design the mechanism ICT to realize the intracluster collaborative training.In other words, ICT is to enable devices with strong computing power to help the devices with weak ones to train.We also allocate different weights to clients with different computing power.Specifically, clients with strong computing power have small weights while clients with weak computing power have big ones.Through collaborative training and the setting of weights, clients with weak computing power have more influence on the entire model.It is beneficial to speed up the convergence of the model.
However, as Fig. 3 depicts, the cluster head interacts more frequently with the stronger devices than with the weaker ones, which would inevitably lead to biases toward the stronger devices.As such, more balanced training has to be proposed to address this issue.An intuitive idea to achieve unbiased is to assign relatively higher weights to weaker devices that update less frequently, so that the cluster model would not bias toward the stronger devices.To this end, we propose a new intra-cluster, weighted aggregation approach, which dynamically adjusts the relative weights assigned to every device in each cluster based on the number of times a device has updated the cluster model till now.Based on this idea, we attempt to give weak computing power clients with big weights to improve their importance.In FL, every client has the same weight to complete the collaborative training.In this paper, we just change the value of weights of clients, it will not increase the time cost.On the contrary, compared with the original FL model, we allocate more reasonable weights to different clients.Our proposed method will accelerate the convergence rate of the model, this is the essential reason why our method is more efficient than original FL model.In this way, the degradation in model accuracy caused by neglecting the updates from weaker devices would be greatly mitigated.Assume that there are |S| number of devices in cluster C j ∈ C, for an intra-cluster epoch τ ∈ H , the number of times each device uploads their client updates in that cluster till now is R where ω τ C j denotes the cluster model that cluster To understand the heuristic, the defined weight α k τ must satisfy the following two requirements: (1) a relatively slower device U k in cluster C j would assign a relatively larger weight value in each training epoch, i.e., the weight that assigns to each device must be the positive form of α k τ ; (2) for all devices in each cluster C j , the overall weight must be equal to 1, i.e., |S| k=1 α k τ = 1.Therefore, to satisfy these two conditions, we give a functional form from mathematics as the following Eq.(7).By combing the equation in ( 6) and ( 7), our mechanism enables the slower devices to play a more role in the training process, which avoids potential bias toward those faster devices.It, therefore, dynamically adjusts and balances the intra-cluster training.
where e is the natural logarithm used to depict the time effect, and the operation ∝ denotes the positive form of a number.
Algorithm 3 describes the specific process for the ICT algorithm.In this algorithm, M denotes the number of clusters, and the cluster set is represented as C = {C 1 , C 2 , . . ., C k , . . ., C M }.In the initial stage, the server first distributes the latest global model parameters ω t to all the cluster heads.Then, every cluster head regards the global model parameters ω t as the initial cluster model, and also sends it to all selected devices.After that, every device U k ∈ S C j performs local model training using SGD based on their local data and the distributed latest cluster model.After E epochs of local training, the faster device (e.g., U k ) immediately send back their local updates ω k τ * ,E to the cluster head.Meanwhile, the cluster head updates the cluster model ω τ C j by using the computed weight α k τ .After H number of epochs, all clusters push their cluster model ω H C j to the server.

Convergence proof
In this section, we focus on providing the convergence proof of our proposed FedTCR algorithm.Next, we introduce some related definitions and assumptions for our convergence analysis.
Definition 1 (Smoothness) The function f is L-smooth with constant L > 0 if for ∀x, y have, where ∇ f (x) denotes the gradient function for vector x, and .represents the inner product of two vectors.

Definition 2 (Strong convexity)
The function f is μ-strongly convex with constant μ > 0 if for ∀x, y have, Assumption 1 (Existence a global optimum value) Assume that there exists a global model optimum value ω * for the function f (ω), and enables ∇ f Based on the above definitions and assumptions, we give the following convergence guarantee as in Theorem 1.

Theorem 1 Assume that the global model loss function f is L-smooth and μ-strongly. We group all devices into M clusters, i.e., C = {C
We set the learning rate η < 1 L , after T global iterations, the server aggregates the number of M cluster models, obtaining an optimum global update ω * , and have: τ,E as ω τ,E , and ω H C j * as ω H . Therefore, similar as the work in [14], according to the L-smooth an μ-strongly convex introduced in in Definitions 1 and 2, for ∀e ∈ E, we have: As each client needs to perform at most E number of local updates before communicating with the cluster head, thus, according to Eq. ( 11), we take total expectation and have the following Eq.( 12): After performing E number of local updates, the device uploads their local updates to the cluster head for updating the cluster model.In each intra-cluster epoch τ ∈ H , we have Here, we ignore j in ω τ C j and U k in α k τ for convenience.Then, according to the equation in (12), we have: Here, we ignore τ in α τ and by telescoping and taking total expectation, after H intra-cluster updates on the cluster head, we have In each global iteration t, the server aggregates the number of M cluster updates ω H , we have where ω H C j denote the cluster model for cluster j, and for convenience, we use ω H in the following steps.Thus, we have: According to Eq. ( 15), after T iterations inter-cluster training, we take total expectation and have: Therefore, our proposed FedTCR algorithm can converge as shown in Theorem 1.

Experimental setting
FL datasets and models.We evaluate our proposed FedTCR approach using two popular image classification datasets on two typical training models.We use the convolutional neural network (CNN) and multilayer perceptron neural network (MLP) training models, as used in Refs.[4,9].The CNN network architecture we used includes two 3 × 3 convolutional layers, each with 32 and 64 filters, followed by a 2 × 2 maxpooling layer, and two fully connected layers with units of 128 and 10.The output is a classification label from 0 to 9, with each label corresponding to one handwritten digital.We use the ReLU function as the activation function, as used in Ref. [9].The MLP model structure we used includes an input layer, and two hidden layers with each having 300 neural units, and the output layer is a result of a class classification label from 0 to 9. For both training models, we use the MNIST [32] and CIFRA-10 [33] datasets for model training and testing.
Compared FL methods.We compare our proposed FedTCR with five typical synchronous and asynchronous FL methods: (i) FedAvg [4] method, which is a baseline synchronous FL method proposed by McMahan et al; (ii)TiFL method [18], in which clients with faster responding time will be grouped in the same tier and the members of the tiers are dynamically changed according to the training and responding time of the clients.(iii) CluFed method [11], an asynchronous updating manner with only selecting a small fraction of fast-responsive devices while neglecting the low-responsive devices for each global aggregation; (iv) FedCS algorithm [12], a new FL framework that handles the straggler problem by selecting the maximum number of participants within a predefined deadline in each communication round; (v) Async method [13], an asynchronous FL framework that asynchronously conducts updates between devices and the server after receiving a newly updated local update; (vi) FedAsync approach [14], which is a typical asynchronous communication method adopting weighted averaging to update the global training model.

FL training hyperparameters.
In our experiment, we implement FedTCR and the other five compared methods on Pytorch.For all of the six methods, we set the total number of clients N as 100, the fraction of participants K 1 as 1, the mini-batch size B, and the local model training epochs E as 50 and 4, respectively.During the whole training process, we use the fixed learning rate η and set η as 0.01.Similar to the work in Refs.[4,14], we adopt two ways to split datasets.One is IID, where the training data is randomly shuffled into 100 clients, each having 600 training samples and 100 testing samples for MNIST and 500 training samples, and 100 testing samples for CIFAR-10.The other is non-IID, where the whole dataset is firstly sorted by their sample labels, then is divided into 200 fragments with each having 300 sizes, and randomly distribute 2 fragments to each participating client with two different labels.
Heterogeneous computing resource setup.To simulate the heterogeneous computing resource of devices in the FL system, we set a set of 5 different CPUs, i.e., P = {2CPUs, 1CPUs, 0.75CPUs, 0.5CPUs, 0.25CPUs} for all devices.Each device is randomly assigned with computing resources from the set P. This leads to varying local model computing times for clients, which simulates the straggler effect that could be potentially caused by heterogeneous computing resources.In our experiment set, 100 clients are divided into 5 clusters, each cluster has 20 clients.In each cluster, all of the 20 clients take advantage of the ICT mechanism to realize collaborative training.And we select just one client to upload the weights to the server.By allocating

Simulation results
The test accuracy.In our experiment, we first measure the global model test accuracy, and model convergence properties on both MLPs and CNNs models by using our proposed TedTCR algorithm and the other five compared methods.
We then depict their relationship as the following Figs. 4  and 5. Here, note that for all five compared algorithms, we chose the best performance for plotting.For compared CluFed and FedCS algorithms, we set a set of values for the predefined response time threshold, i.e., {1 s, 2 s, 2.5 s, 3.0 s, 4.0 s, 5.0 s}, and set the initial staleness value σ ={0.1, 0.15, 0.2, 0.25, 0.5}, the asynchronous aggregation weighting α = {0.1,0.2, 0.3, 0.4, 0.5} for compared FedAsync and Async approaches.We then select the best performance for CluFed and FedCS in the case of a predefined response time of 2.5 s.For Async and FedAsync algorithms, we select the best performance in α = 0.1 and α = 0.5, respectively.Furthermore, the initial staleness value σ as 0.2 for the FedAsync algorithm.
Figure 4 describes the results on MNIST dataset while Fig. 5 shows the experimental results on CIFAR-10.We can observe that in both training models, the black line indicates that training using our proposed FedTCR algorithm.It achieves the best performance among the six mentioned algorithms, which has the highest final test model accuracy and fastest model convergence speed.This is because our FedTCR method proposes the newly weighted aggregation manner can more effectively force the slower devices to engage in the model training, leading to better model performance.What's more, our proposed FedTCR approach essentially reduces the heterogeneous characteristics of computational resources and accelerates the convergence of the model.
In contrast to this case, the red, blue, green, purple lines indicate the case of using the FedAvg, TiFL, FedCS, CluFed algorithms, respectively.Both CluFed and FedCS show the closest prediction accuracy as FedAvg because they both follow the same synchronous model updating strategy.However, as FedCS approach involves more clients participating in global training at each communication round.As such, the FedCS shows a relatively higher model accuracy than the CluFed method.Compared with FedAvg, TiFL, CluFed, FedCS, FedAsync and our proposed FedTCR algorithm, the Async algorithm that the yellow line indicates achieves the worst model accuracy, as it merely aggregates the global model from one client at each communication round, and neglects the local client updates from slower devices.Thus, it could not fundamentally propose an effective manner to address the stragglers.Although the FedAsync algorithm is also using asynchronous updating, it shows relatively higher accuracy than the Async approach.This is because the FedAsync method takes full consideration of the weight that each received local update should have in different communication rounds.
Motivated by the above visualization results, to make clear what extent our proposed FedTCR outperforms other compared algorithms in terms of improving model accuracy.We list the final test accuracy (Acc), the accuracy improvement of FedTCR compared with the five baseline FL methods (impr.(a)) in the following Tables 3 and 4. We can see that the test accuracy is a function of communication rounds for both the scenarios of IID and non-IID data distribution.For both IID and non-IID scenarios, our proposed FedTCR algorithm uses the CNN model on the MNIST dataset can reach a high test accuracy.For CNN under the IID situation, FedTCR reaches an accuracy of 91.23% at the final iteration, which is 3.40%, 3.41%, 5.76%, 8.03%, 11.03% and 13.85% higher than that of FedAvg, TiFL, FedCS, CluFed, FedAsync, and Async, respectively.Compared with training on the IID scenario, FedTCR converges at around 28 communication rounds under non-IID distribution, which leads to FedTCR can achieve even better performance.For CNN under non-IID distribution, the accuracy can achieve up to 90.68% at the final communication round, and outperforms the best baseline FL method, FedAvg, by 2.84%, the worst baseline method, Async, by 13.59%.For MLP on MNIST under IID distribution, FedTCR obtains a relatively lower accuracy, which is up to 90.12% at the final iteration, which is 1.48% and 11.12% higher than that of FedAvg and Async, respec- tively.Similar to the case in the CNN model, the FedTCR algorithm achieves a better accuracy compared with non-IID distribution, up to 89.68% model accuracy and it is higher than FedAvg and Async 1.40% and 12.50%, respectively.However, for CIFRA-10 dataset, it shows a relatively slower model accuracy than that of MNIST, in which the test accuracy is up to 79.44% and 77.38% using the CNN model under IID and non-IID, respectively.
The training time.To compare the effectiveness of FedTCR on mitigating the straggler effect and reducing the model training time.We test the final convergence time for all the compared algorithms and our FedTCR method on MLPs and CNNs.Then, we present the comparison results of the convergence time in the following Figs.6 and 7. Figure 6 shows the results of the MNIST dataset while Fig. 7 shows the experimental results on CIFAR-   3 and 4   (3) Our approach adopts the cluster structure, where instead of all devices communicating with the server, only the cluster head transmits the cluster model to the server for global aggregation.It is, therefore, reduces the total number of communication bits transmission.
Adaptability to heterogeneity.We now evaluate FedTCR's adaptability at different levels of devices' heterogeneity.We define the heterogeneity degree among the devices as follows: , where v i is the iteration that worker n i can process per unit time.We then depict the convergence time of six policies under different workers' heterogeneity degrees.Here, as MLP and CNN models have the same trend in both training datasets, we only depict the representative situation in the following figures.The reason lies in that FedAvg enforces all devices to stop and waits for the straggler when carrying out the aggregation, so the convergence is significantly influenced by the straggler.With FedTCR, the straggler only influences the inter-cluster aggregation, the scope of influence is smaller than that in FedAvg.With FedAsync and Async, as the workers only send the model update to the server immediately, the influence scope of the straggler is the smallest.Although FedAsync and Async can adapt well to heterogeneity compared with FedTCR, the convergence time of FedTCR is always shorter.

Conclusion and future work
This paper explores the intrinsic reasons for the generation of the straggler effect, and fundamentally improves the communication efficiency of the FL.We propose a novel federated learning paradigm called FedTCR.Our main idea behind the proposed FedTCR is to tame the resource heterogeneity in computing resources and essentially reduces the waiting time during the model aggregation.In FedTCR, a coarse-grained logical computing cluster construction algorithm (LCC) and a fine-grained intra-cluster collaborative training mechanism (ICT) are designed to mitigate the impact of the straggler effect caused by heterogeneous computing resources.LCC algorithm considers dividing all participating devices into multiple clusters that have minimal gaps in computing resources and constructs a virtual logical computing cluster, which enables each cluster to have minimal gaps in training time to complete the entire inter-cluster training process.Furthermore, by constructing a virtual logical cluster, the cluster head can replace all the devices in each cluster to communicate with the server, which greatly reduces the required communication overheads.In ICT, we propose a new weighted average updating mechanism that assigns relatively higher weights to slower devices, which increases the involvement of slower devices and improves the model performance.We not only give the convergence proof for FedTCR in theory but also verify the performance of FedTCR on popular FL datasets and networks.Experimental results show that our proposed algorithm can convergence fast, reducing the communication overhead by up to 8.59× and achieving 13.85% higher model performance, compared with the state-of-the-art FL methods.Future work aims to explore better virtual logical computing cluster construction algorithms, for example, considering dynamic grouping algorithms that according to the real-time usage of the computing resources of devices, and getting a more balanced clustering of devices.

Fig. 1
Fig. 1 The flowchart of standard FL framework

Figure 2
describes the overview of the proposed FedTCR framework, where there is a cloud-based server, a set of cluster heads, and many client devices.Each device has a set of local data and a local model called the local model.Each cluster head has a model called the cluster model.In every iteration, devices use the optimization function to update the model based on the local data and the local model.In FedTCR, the cluster heads conduct intra-cluster collaborative aggregation; the server conducts inter-cluster aggregation, synchronously the same as the standard FL.

Fig. 2
Fig.2The overview of our proposed FedTCR framework

Fig. 3
Fig.3The overview of our proposed ICT mechanism

j 1 ,
R j 2 , . . ., R j k , . . ., R j S , respectively.The intra-cluster model updating mechanism can be defined as: C j achieved in the τ th training epoch, α k τ is the relative weight of device U k in the τ th training epoch, and ω k τ,E represents the local model updates after training E number of local epochs.

Fig. 4
Fig. 4 The final test accuracy of two different training models on MNIST dataset

Fig. 5
Fig. 5 The final test accuracy of two different training models on CIFRA-10 dataset

Fig. 6 Fig. 7
Fig. 6 The training time of two different training models on MNIST dataset Figure 8(a) is the result of MLP model using MNIST dataset, while Fig. 8(b) describes

Fig. 10
Fig. 10 Comparison of FedTCR's weighted aggregation and uniform baseline approach

Table 1
Local updates after E local epochs for device U kWeighted average parameter for device U k in the epoch τ δ t Learning rate in the tth iteration l k (x k , y k ; ω) Loss value for training sample {x k , y k } U Client set U = {U i } N i=1 D Local dataset set D = {D i } N i=1 φ t Overall bits transmission from all devices in the tth iteration Ω T Accumulated communication bits in the T th iteration L t i Computation time of a device U i achieved in the tth iteration L t Computation time in the tth iteration L total Total computation time for all devices during the whole training process A T Model test accuracy in the T th iteration the set of cluster heads H = {H j } M Final global model ω T , accumulated communication bits Ω T 1: C, H ← LCC (U, P) 2: Initialize global model ω 0 , global iterations T , intra-cluster epochs H 3: for each communication iteration t = 0, 1, 2...T do 4: {ω j=1Ensure:

Algorithm 2
The complete process of LCC algorithm Require: Client set U = {U i } N i=1 , computation power set P = {P i } N i=1 , cluster set C = ∅, set of cluster heads H = ∅, Variable f lag Ensure: Cluster set C and set of cluster heads H 1: Initial variable f lag ← true 2: Sort set P, obtain sorted set P ← { P1 ≤ Pi , ..., PN } and corresponding client set U ← {U i } N for each iteration r = 1, 2, ..., r do 19: Adjust initially assigned cluster set C 20: end for 21: for each cluster C j ∈ C from 1 to M in parallel do 22: Select a device as head H j 23: end for 1 , C 2 , . . ., C j , . . ., C M }, and there are number of S devices in each cluster C j ∈ C. For a cluster C j ∈ C, the number of update times for transferring updates for devices till now is {R d , and ∀b ∈ B, we have loss of generality, in global iteration t, we assume that the server receives the number of M cluster updates ω In each intra-cluster training epoch τ ∈ H , the cluster head receives the latest local model updates ω k τ * ,E from a ∀U k device, and set ω k τ,E ← ω k τ * ,E .Here, ω k τ * ,E is the result of at least E number of local training on device U k , and τ * is the timestamp that we achieve the local updates from device U k .Note that, we use ω H C j * that enables j * = arg max k F(ω H C j * ) in the following.For convenience, we regard ω k H C 1 , ω H C 2 , . . ., ω H C j , . . ., ω H C M from the cluster head H 1 , H 2 , . . ., H j , . . ., H M , respectively.For a given cluster update, ω H C j is the result of at least H intra-cluster collaborative training.

Table 2
Parameters we required in our experiments For each inter-cluster training process, we randomly select fix fraction of clusters, i.e., K 2 , at each global communication round for global aggregation and set K 2 as 1.This means that there is a 5 × 1 number of clusters sampling at each global iteration.We list the required parameters in the experiment in the following Table2.
10. Obviously, the FedTCR convergences faster and has the smallest convergence time than all other compared methods on both MLPs and CNNs.
Communication bits.To make clear what extent FedTCR outperforms the other two cases in improving the communication efficiency of FL.We next evaluate the total number of communication bits transferred to the server for all

Table 3
FedTCR's performance using the MNIST dataset