1 Introduction

Federated Learning (FL) is a Machine Learning approach born [1] to address the following challenges in practice: (1) Communication cost: Most real-world data that can be useful for training are locally collected; to bring them all to one place for central learning can prohibitively be expensive, especially in real-time learning applications when time is of essence, for example, predicting the next word when texting on a smartphone; and (2) Privacy protection: Many applications must protect user data privacy, such as those in the healthcare field; the private data can only be seen by its local owner and as such the learning may only use a content-hiding representation of this data, which is much less informative.

Fig. 1
figure 1

Federated Learning (FL) architectures: Conventional FL versus Edge FL

FL can compute a learning model using distributed training data that remain private and unmoved on local machines, hereafter referred to as the “participants”. A conventional FL architecture, which is illustrated in Fig. 1a, consists of these participants and a central “aggregation” server. The learning is an iterative procedure as follows. In the first step, the aggregation server broadcasts a global learning model, initially random, to all the participants. In the second step, each participant in parallel uses its own local data to improve this model. In the third step, the participants send their respective models to the server who in turn aggregates them to obtain a new global model. Then the first step is repeated until the global model converges.

The idea of FL is simple, yet proven to be effective to some useful extents [2,3,4]. In practice, more than 90% of global data is stored and processed locally [5]. Furthermore, FL fits nicely in today’s trend of computing toward decentralization, by crowdsourcing to local machines’ computing capabilities for the training process. Having said that, much room is to explore to realize FL’s full promise. Its reliance on the aggregation server for central aggregation of local training models is a vulnerability. For large-scale applications, the server can easily become a bottleneck. Hence, the usefulness of Edge Computing.

Edge Computing has emerged as a viable technology for network operators to push computing resources closer to the users to avoid long-haul crossing of the network core, thus improving network efficiency and user experience. Originally initiated for cellular networks to realize the 5 G vision [6], Edge Computing has become mainstream with broader applicability in many types of long-distance wired or wireless networks [7, 8], serving various data- or computing-intensive applications; e.g., video optimization [9], content delivery [10], big data analytics [11], roadside assistance [12], and augmented reality [13].

In this paper, we assume an Edge Computing architecture for FL. In this so-called Edge Federated Learning (eFL) architecture [14,15,16], a middle layer of edge servers, which are commodity servers deployed at the edge of the network near the participants, serves as “regional" aggregation servers. This is illustrated in Fig. 1b. The learning in eFL works in a hierarchical manner as follows: (1) Each regional server computes a regional model by aggregating the local models in its region; and (2) The central global server updates the global model by aggregating only the regional models. Performance bottleneck is no longer an issue because the global server only needs to communicate with the edge servers, which are not many, and each edge server communicates only with the local participants in its region, which is also a much smaller group.

Fig. 2
figure 2

Tradeoff between decentralization of the central aggregation server into distributed edge servers and resulted learning accuracy. eFL is less accurate with more edge servers. Here, the learning is applied to the real-world MNIST dataset [17] distributed non-iid among 300 participants. The x-axis is the global time during the algorithm running. The y-axis is the accuracy when evaluated against the test dataset

We emphasize that while methods such as client selection [18] and compression [19] techniques can reduce communication costs for conventional FL, the eFL architecture is still useful and, in many cases, the only feasible choice, thanks to the following unique benefits: (1) The edge servers lie at the edge of the network with a much shorter physical distance to the participant side. Therefore, it is faster and cheaper to move local model data from the participants to the edge servers than the far-away global server; (2) FL is increasingly applied to IoT applications. Because FL participants are internet-of-things devices, they may not be able to transmit long distances. Here eFL is more suitable than conventional FL because we can take advantage of the untapped computational capacity at the edge nodes to aggregate local models; and (3) In general, the shifting of computing from centralization using a single datacenter to decentralization using regional servers also has a positive ecological impact due to the severely increasing carbon footprint of having big cloud datacenters [20]. To add more, eFL does not disable the use of client selection and compression techniques. They are orthogonally useful in the sense that we can also adopt these techniques in the eFL architecture to further reduce communication cost.

However, eFL incurs an inevitable tradeoff: learning accuracy. The more decentralization of the aggregation server into distributed edge servers, the worse learning accuracy as a result. Figure 2 illustrates an example. We compared the learning accuracy of FL and eFL on a real-world dataset in fair parameter settings. While FL offers a 90%+ accuracy, eFL by deploying more edge servers results in an increasingly worse accuracy, reduced to 80%. This is understandable because a regional server in eFL gathers information only from its region whereas the parameter server in FL receives information from all the participants. Although the global server in eFL will eventually receive information from all the participants via edge aggregation, the learning process is slower to converge.

The tradeoff can be significant and we should minimize it, hence the focus of this paper. We propose an approach by solving an edge server assignment problem: given a number of edge servers, choose which servers to aggregate which participants to maximize the learning accuracy. To date, it is implicitly understood in eFL studies that this assignment is given a priori. We make the following contributions:

  • The proposed edge assignment problem does not require any parameters. It is purely data-driven but protects data privacy. The only assumption made is our knowing the local models after a few early rounds of local training. To our knowledge, this paper is the first effort on the edge assignment problem aimed to maximize learning accuracy that can work for every eFL setting.

  • The problem is not trivial because there is no closed form to formulate the learning accuracy as an objective function. Due to FL’s iterative manner, its accuracy can only be measured on the fly after the procedure completes. The best we could do thus is a heuristic approach. Yet, no obvious heuristic benchmark exists other than random assignment. We propose a heuristic solution that is data-driven and also useful as a better benchmark.

  • A well-known hinderance to the learning accuracy of eFL, which is inherited from FL, is the non-IID (non-independent and identically distributed) nature of local training data observed in the real world [21,22,23]. Indeed, the edge assignment problem is not much of a need if the data is IID, because any balanced-size assignment should suffice. Our solution is noticeably effective in the non-IID case, where the assignment problem matters.

The remainder of the paper is organized as follows. Related work is discussed in Sect. 2. Essential background on FL and eFL is provided in Sect. 3. We motivate and state the edge server assignment problem in Sect. 4. The proposed algorithm is proposed in Sect. 5. We analyze the results of the evaluation study in Sect. 6. The paper concludes in Sect. 7 with pointers to future work.

2 Related work

The use of edge computing in eFL to address the performance bottleneck of the FL server has only recently been investigated, but gained traction quickly. We discuss the work that is most related to ours below.

2.1 General surveys on eFL

Good surveys of the state of the art are provided in [15, 24, 25]. In [15], a comprehensive review of eFL categorizes the challenges in eFL into different topics, namely communication cost, resource allocation, privacy and security, and considers applications of eFL in cyberattack detection, edge caching and computation offloading, base station association, and vehicular networking. The survey in [24] summarizes the research problems and methods in eFL in terms of applications, development tools, communication efficiency, security, privacy, migration, and scheduling. It also suggests open problems in eFL. The authors in [25] present a report as a result of investigating more than 500 FL papers published between 2016, the year FL was first introduced, and October 2021. 57.3% of these papers discussed eFL. It is clear that FL and Edge Computing are well-suited for each other.

2.2 Effect of non-IID on accuracy

In [14], the authors provide proofs that under some feasible assumptions eFL can provide a learning accuracy approaching that in a centralized learning setting where all the training data are collected in one place. One of the assumptions is the IID-ness on the distributed training datasets, which is often implicit in most machine learning algorithms. However, in a FL setting, the training data reside independently in different physical places and it is often the case that their distribution is non-IID, meaning that the data distribution at each place may be very statistically different from that at another place [21,22,23]; put another way, the participating data are strongly skewed. In a typical implementation of FL, the model aggregation in each iterative round involves only a subset of local models and a careless selection of them ignoring the non-IID of the training data may hurt the global model’s convergence and learning accuracy [22]. Indeed, the local models can overfit local data, leading to a poor global model.

Study on the effect of non-IID data in an eFL setting remains much unexplored. A rare technique to deal with this problem was proposed in [26]. The starting point is that edge servers are allowed to have overlapping coverages. That is, a local model may be included in the model aggregation at multiple servers; it is updated based on model updates sent from these multiple servers. With the overlapping, the servers are closer-to-IID, or equivalently, less non-IID, in terms of the training data they aggregate. This approach, however, incurs more communication cost because a local model may be sent to more than one server. In contrast, our work assumes non-overlapping servers. We maximize the learning accuracy without increasing the communication cost.

2.3 Communication cost of eFL

Besides learning accuracy, communication cost incurred due to model calculations at the mobile devices is an important concern for FL and eFL. Methods such as client selection [18] and compression [19] techniques have been proposed to address this issue for conventional FL. They are orthogonally applicable to eFL regardless of any server assignment choice. The method in [27] is exclusively aimed at eF to reduce communication cost. In this method, the edge servers do not only aggregate the local models, but also collaboratively join in the computation of the local models. In particular, each server collaborates with a constant k of participants. The learning model is based on deep learning which is a sequence of two types of layers: the low layers and high layers. At each participant, the raw training data go through the low layers, at the end of which the output including the model weights and the ground truth is uploaded to the corresponding edge server. At each edge server, upon receipt of the low-layer outputs from its collaborating participants, will perform calculations at the high layers of the model. This approach’s drawback is that local devices must share with their edge servers partial raw local data which is the ground truth aforementioned. This violates the privacy preservation of the FL paradigm. In our eFL setting, local models are trained completely at the participant side and the job of an edge server is only to aggregate local models to obtain a regional model. Hence, no raw data is shared beyond its local owner.

2.4 Edge assignment in eFL

To our knowledge, the only other work sharing the same goal with our paper, i.e., solving the edge server assignment problem in eFL, is [28]. In this reference work, the objective is to minimize the divergence between eFL’s global model and the “centralized" global model; the latter is the result of running the same learning method assuming all the data centralized in one place. The fundamental difference between this work and ours is two-fold. First, they assume that the local data distribution at each and every local participant is known to the global server. Second, they assume that the global data distribution, combining all the local data together, is also known. These two assumptions are too strong, thus limiting the applicability of their edge assignment solution. In contrast, our work makes no such assumptions, hence suitable for any eFL setting.

3 Background and preliminaries

We consider a supervised learning task of learning a function that maps an input object to an output value, called “label", based on a set of input–output pair samples, called the “training set". The label here can be a class label in classification learning or a real-valued number in regression learning.

Let X and Y denote the input and output spaces, respectively. Suppose that we are given a training set of samples, \(\mathcal {D} = \{(x_1, y_1),..., (x_{|\mathcal {D}|}, y_{|\mathcal {D}|})\}\), such that \(x_{i} \in X\) is the feature vector of the \(i^{th}\) input sample and \(y_i \in Y\) its corresponding label. The learning task is to find a function \(g: X \rightarrow Y\) so that given each new input object x we will predict that its label is g(x). The formulation of g depends on the underlying learning method in use, for example, support vector machines or deep neural networks. In this paper, we assume that g is uniquely formulated based on a model \(\textbf{w} \in \mathbb {R}^d\) which is a vector of d parameters. For example, d can be small if using a simple Multi-Layer Perceptron or as large as millions of parameters if using a Convoluted Neural Network method. Since g can be derived from \(\textbf{w}\), hereafter, we are interested in finding \(\textbf{w}\).

We quantify the prediction quality by the following empirical loss

$$\begin{aligned} F(\textbf{w};\, \mathcal {D}) \triangleq \frac{1}{|\mathcal {D}|} \sum _{(x, y) \in \mathcal {D}} l\bigg (\textbf{w};x, y\bigg ) \end{aligned}$$
(1)

where \({l}(\textbf{w};\, x,y)\) is a user-defined function measuring the prediction loss on sample (xy) using model \(\textbf{w}\).

The goal of the learning task is to find \(\textbf{w}\) given \(\mathcal {D}\) to minimize \(F(\textbf{w};\, \mathcal {D})\). A typical way to find \(\textbf{w}\) is by the iterative method of Stochastic Gradient Descent as follows:

figure a

Here, \(\nabla\) is the vector differential operator in math. As the number of rounds T is sufficiently large, the value \(\textbf{w}^{(T)}\) at the end of this loop should converge to the optimum \(\textbf{w}\). Parameter \(\eta\) is predefined and called the learning rate. The higher \(\eta\) is chosen, the quicker convergence is reached but there is a higher risk to miss it. On the other hand, if \(\eta\) is too small, the learning can be too slow to converge.

For ease of presentation, we denote this algorithm with \(\texttt {SGD}(\textbf{w}^{(0)};\, \mathcal {D})\) where \(\textbf{w}^{(0)}\) is the initial model to begin the loop with and \(\mathcal {D}\) the set of training samples.

3.1 Federated learning (FL)

In FL, the training samples are not available all at one place, but instead they reside independently and privately on many local participant machines. Let K be the number of participants and \(\mathcal {D}\) = \(\mathcal {D}_1 \cup \mathcal {D}_2 \cup ...\cup \mathcal {D}_K\) where \(\mathcal {D}_k\) denotes the subset of training samples owned by participant \(k \in [K]\).

The prediction loss can then be expressed as

$$\begin{aligned} F(\textbf{w};\, \mathcal {D})&= \frac{1}{|\mathcal {D}|} \sum _{(x, y) \in \mathcal {D}} l\bigg (\textbf{w};x, y\bigg ) \nonumber \\&= \sum _{k=1}^K \frac{|\mathcal {D}_k|}{|\mathcal {D}|} \bigg (\underbrace{ \frac{1}{|\mathcal {D}_k|} \sum _{(x, y) \in \mathcal {D}_k} l(\textbf{w};\, x, y)}_{F(\textbf{w};\, \mathcal {D}_k)} \bigg ) \end{aligned}$$
(2)
$$\begin{aligned}&= \sum _{k=1}^K \frac{|\mathcal {D}_k|}{|\mathcal {D}|} F(\textbf{w};\, \mathcal {D}_k). \end{aligned}$$
(3)

We can think of \(F(\textbf{w};\, \mathcal {D}_k)\) as the local prediction loss of participant k. In the IID case, where the set of training samples is distributed uniformly at random among the participants, we would have

$$\begin{aligned} \mathbb {E}_{\mathcal {D}_k} [F(\textbf{w};\, \mathcal {D}_k)] = F(\textbf{w};\, \mathcal {D}). \end{aligned}$$
(4)

That is, the expectation of \(F(\textbf{w};\, \mathcal {D}_k)\) over an IID-generated \(\mathcal {D}_k\) as a subset of \(\mathcal {D}\) would equal \(F(\textbf{w}, \mathcal {D})\). What this implies is a simple distributed learning approach: the participants can each independently solve the learning problem using their own training data and the average over all these local models provides a good approximation for the optimal model. This is the foundation for FL.

Fig. 3
figure 3

Federated Averaging (FedAvg) algorithm when applied to conventional FL and edge FL

The earliest and arguably most popular FL algorithm is FedAvg [1]. FedAvg uses SGD, presented above, as the learning method. In the simplest form it works as follows, which is summarized in Algorithm 2 and illustrated in Fig. 3a. At the beginning, the server broadcasts to the participants an initial model \(\textbf{w}^{(0)}\) which can be random. Starting with this model, \(\textbf{w}^{(0)}\) or \(\textbf{w}^{(t-1)}\) in later rounds, each participant performs local training on local data using \(\texttt {SGD}\) for a number of rounds. The resulting local model, \(\textbf{w}^{(t)}_k\), is then sent back to the server. Upon receipt of these local models, the server computes their average weighted by local training size. The server then sends this updated global model, \(\textbf{w}^{(t)}\), back to the participants. The procedure repeats for a number of rounds of global update. The number of rounds, T, is chosen such that convergence is reached.

figure b

We denote this algorithm with \(\texttt {FL}(\textbf{w}^{(0)};\, \{\mathcal {D}_1, \mathcal {D}_2,..., \mathcal {D}_K\})\), which uses \(\textbf{w}^{(0)}\) as the initial global model and applies to a FL system of K participants with local training datasets \(\mathcal {D}_1\), \(\mathcal {D}_2\),..., \(\mathcal {D}_K\), respectively.

This algorithm has several variants [1]. For example, instead of broadcasting the global model in each round to all the participants, we send it to only a random subset of them, which will result in lower communication and computation costs. To further reduce the local computation cost of the participants, the SGD procedure can be modified to apply on a mini-batch of training samples in each gradient descent step rather than on the entire training set.

3.2 Edge federated learning (eFL)

In FL, the server is prone to be a bottleneck when there are many participants. eFL alleviates this bottleneck by deploying edge servers near the participant side to serve as regional aggregators. Let M be the number of edge servers. Let \(z_{ik} \in \{0, 1\}\) denote the assignment of participant k to edge server i. The aggregation coverage of server i is the set of participants k such that \(z_{ik}=1\). Our setting assumes that each participant is assigned to only one edge server, i.e.,

$$\begin{aligned} \forall k \in [K] : \sum _{i=1}^{M} z_{ik} = 1. \end{aligned}$$
(5)

Intuitively, we can think of eFL as a a distributed system of FL subsystems, each running aggregation on an edge server, thus having a corresponding edge model. The job of the global/central server in this distributed system is simply to compute the global model by averaging the edge models. We can modify the FedAvg algorithm above to work for eFL as follows, which is summarized in Algorithm 3 and illustrated in Fig. 3b. The procedure runs G rounds of global update. In each round, in Step 2b, we apply Algorithm 2 to compute the edge model for each edge server. These edge models from Step 2b are then averaged at the the central server, which is weighted by the training size per edge server. The training size of an edge server is the data size total of its assigned participants.

figure c

We denote this algorithm with

$$\begin{aligned} \texttt {eFL}(\varvec{Z}, \textbf{w}^{(0)}; M, \{\mathcal {D}_1, \mathcal {D}_2,..., \mathcal {D}_k\}) \end{aligned}$$

which uses \(\textbf{w}^{(0)}\) as the initial global model and applies to an eFL system with M edge servers and K participants with local training datasets \(\mathcal {D}_1\), \(\mathcal {D}_2\),..., \(\mathcal {D}_K\), respectively, and \(\varvec{Z}\) the assignment matrix of these K participants to the M edge servers. The complexities of eFL depend on the following parameters:

  • How many rounds, G, does the global server need to update the global model from the edge models?

  • How many rounds, E, does each edge server need to update the edge model from the assigned local models before sending the updated edge model to the global server?

  • How many rounds, L, does each participant need to perform SGD before sending the updated local model to its edge server?

We can then represent the complexities as follows:

  • Computation cost: the total number of local SGD steps each participant has to perform until the whole eFL system stops (here, assuming that each SGD step is O(1)-time), which is \(T = G \times E \times L\).

  • Communication cost: the total number of messages transmitted between the participants and the edge servers (\(K \times T/L\)) and between the edge servers and the global server (\(M \times T/(E \times L)\)) until the whole eFL system stops, which is \(K\times T/L + M\times T/(E \times L)\) = \(K\times G\times E + M\times G\).

One can prove that when certain conditions hold, like in FL [29], eFL will eventually converge to a performance comparable to that of the centralized learning setting [14], which is true for both convex and non-convex loss functions.

4 Problem motivation and statement

Consider an eFL setting given (G, E, L) parameters. Its performance then depends on the edge server assignment \(\varvec{Z}=\{z_{ik}\}_{M\times K}\). Our goal is to find \(\varvec{Z}\) to maximize the learning accuracy. To date, it is implicit in eFL designs that this assignment is random or given a priori without justification. Although eFL has gained a lot of traction, the edge server assignment is rarely touched, except [28] which differs from ours as discussed in Sect. 2.

The server assignment problem is important because it is often the case that the training data distribution among the participants is non-IID. As such, Eq. (4) is usually untrue: averaging \(F(\textbf{w}, \mathcal {D}_k)\) over \(\mathcal {D}_k\) can unpredictably be bad as an approximation for \(F(\textbf{w}, \mathcal {D})\). The non-IID case remains a severe hindrance to FL accuracy. With eFL, if a proper edge assignment \(\varvec{Z}\) is used, we can hope to neutralize the non-IID effect.

If the IID assumption holds true, FL works fine and the edge assignment problem for eFL would not matter, because a random assignment that balances the coverage size of each edge server would provide a good learning accuracy. We thus seek an assignment solution \(\varvec{Z}\) that works more effectively for the non-IID case, yet as well as the random solution for the IID case. Unfortunately, no closed form exists to formulate the learning accuracy as an objective function. Due to FL/eFL’s iterative manner, the accuracy can only be computed on the fly after the procedure completes.

In our eFL setting once the server assignment is determined at the beginning, this assignment is fixed for all rounds. This is the focus of our assignment problem. Adjusting the assignment in each round would 1) make the learning slower, and 2) be quickly less effective after each round because the local models are getting increasingly similar. Therefore, although we could also solve a similar assignment problem that can adjust the assignment in each round, that would just be an incremental development, which we could explore in the future (but that is not the main point of this paper).

We assume that participants can communicate with any servers and propose in the next section the assignment algorithm for this case. In general, there may be cases where there is a connectivity restriction to allow only certain participants to connect to certain servers (due to practical reasons). We will discuss this general case in Sect. 5.3.

5 The assignment algorithm

Ideally, we want the data distribution at the edge level to be IID to minimize the global learning loss. For example, in an eFL setting with 4 participants and 2 edge servers such that participant 1’s and participant 2’s training datasets consist of mostly label A and participant 3’s and participant 4’s mostly label B, participants 1, 3 should be combined on one edge server and participants 2, 4 on the other edge server. This way, each edge server has a comprehensive coverage of training data, having both labels A and B.

5.1 The rationale

We elaborate on this heuristic mathematically as follows. Think of each edge server as “virtual” participant with a virtual training set \(\mathbb {D}_i \triangleq \bigcup _{k: z_{ik}=1} \mathcal {D}_k\). The loss F(.) can be re-written as follows

$$\begin{aligned} F(\textbf{w};\, \mathcal {D}) = \frac{1}{|\mathcal {D}|} \sum _{(x, y) \in \mathcal {D}} l\bigg (\textbf{w};x, y\bigg ) \end{aligned}$$
(6)
$$\begin{aligned} = \sum _{i=1}^M \frac{|\mathbb {D}_i |}{|\mathcal {D}|} \bigg (\underbrace{ \frac{1}{|\mathbb {D}_i |} \sum _{(x, y) \in \mathbb {D}_i} l(\textbf{w};\, x, y)}_{F(\textbf{w}; \mathbb {D}_i)} \bigg ) \end{aligned}$$
(7)
$$\begin{aligned} = \sum _{i=1}^M \underbrace{\frac{|\mathbb {D}_i |}{|\mathcal {D}|}}_{\beta _i} F(\textbf{w}; \mathbb {D}_i) = \sum _{i=1}^M \beta _i F(\textbf{w}; \mathbb {D}_i). \end{aligned}$$
(8)

Let \(Q_i(x, y)\) be the hidden ground-truth distribution representing the samples in \(\mathbb {D}_i\). Denote

$$\begin{aligned} E_i(\textbf{w})\triangleq & {} \mathbb {E}_{Q_i} [ F(\textbf{w}; \mathbb {D}_i) ] \end{aligned}$$
(9)
$$\begin{aligned} E(\textbf{w})\triangleq & {} \sum _{i=1}^M \beta _i E_i(\textbf{w}) \end{aligned}$$
(10)

Applying Proposition 2 of [5], the global optimum \(\textbf{w}^* \triangleq \arg \min _{\textbf{w}} F(\textbf{w};\, \mathcal {D})\) introduces the following error bound when minimizing \(E_i(\textbf{w})\): with probability \(1-\delta\) where \(\delta\) is arbitrarily small, we have

$$\begin{aligned}& E_i(\textbf{w}^*) - \min _{\textbf{w}} E_i(\textbf{w}) \nonumber \\ &\quad \le \frac{A}{|\mathcal {D}|} + 2 \bigg \langle Q_i - \sum _{j=1}^M \beta _j Q_j \bigg \rangle \end{aligned}$$
(11)

where we denote \(\langle f \rangle \triangleq \int _{x,y} | f(x,y) | dxdy\) and A is a constant only depending on the definition of the loss function l, dimensionality of the sample space X, and the threshold \(\delta\).

Summing this over all edge servers i’s, we have

$$\begin{aligned}{} & {} E(\textbf{w}^*) - \min _{\textbf{w}} E(\textbf{w}) \end{aligned}$$
(12)
$$\begin{aligned}= & {} \sum _{i=1}^M \beta _i E_i(\textbf{w}^*) - \min _{\textbf{w}} \sum _{i=1}^M \beta _i E_i(\textbf{w}) \end{aligned}$$
(13)
$$\begin{aligned}\le & {} \sum _{i=1}^M \beta _i E_i(\textbf{w}^*) - \sum _{i=1}^M \beta _i \min _{\textbf{w}} E_i(\textbf{w}) \end{aligned}$$
(14)
$$\begin{aligned}= & {} \sum _{i=1}^M \beta _i \bigg ( E_i(\textbf{w}^*) - \min _{\textbf{w}} E_i(\textbf{w}) \bigg ) \end{aligned}$$
(15)
$$\begin{aligned}\le & {} \sum _{i=1}^M \beta _i \frac{A}{| \mathcal {D} |} + 2 \sum _{i=1}^M \beta _i \bigg \langle Q_i - \sum _{j=1}^M \beta _j Q_j \bigg \rangle \end{aligned}$$
(16)
$$\begin{aligned}\le & {} \frac{A}{| \mathcal {D} |} + 2 \sum _{i=1}^M \beta _i \bigg \langle Q_i - \sum _{j=1}^M \beta _j Q_j \bigg \rangle . \end{aligned}$$
(17)

Here, the inequality going from Eq. (14) to Eq. (15) is because

$$\begin{aligned} \min _{\textbf{w}} \sum _{i=1}^M \beta _i E_i(\textbf{w}) = \sum _{i=1}^M \beta _i E_i(\mathbf {w^*}) \\ \ge \sum _{i=1}^M \beta _i \min _{\textbf{w}} E_i(\textbf{w}), \end{aligned}$$

where \(w^* = \arg \min _{\textbf{w}} \sum _{i=1}^M \beta _i E_i(\textbf{w})\).

To summarize, we have

$$\begin{aligned} & E(\textbf{w}^*) - \min _{\textbf{w}} E(\textbf{w}) \nonumber \\ &\quad \le \frac{A}{| \mathcal {D} |} + 2 \sum _{i=1}^M \beta _i \bigg \langle Q_i - \sum _{j=1}^M \beta _j Q_j \bigg \rangle . \end{aligned}$$
(18)

This means that given a choice of edge assignment, matrix \(\textbf{Z}\), which determines the terms \(Q_i\)’s, the right-hand-side of this inequality provides an upper-bound for the accuracy gap between FL and eFL. Therefore, to minimize this gap, our heuristic is to compute \(\textbf{Z}\) that minimizes this upper bound. This happens when \(Q_i(.) = Q_j(.) = Q(.)\) for all i, j, because that leads to

$$\begin{aligned} \bigg \langle Q_i - \sum _{j=1}^M \beta _j Q_j \bigg \rangle = \bigg \langle Q - Q \sum _{j=1}^M \beta _j \bigg \rangle = 0. \end{aligned}$$

In other words, ideally, we want the aggregate probability distribution of the samples virtually belonging to each server is identical. This mathematically justifies the intuition that the data distribution at the edge level should be made IID to minimize the global learning loss. However, we do not know the underlying distribution of each participant. Thus, we propose a heuristic algorithm below.

5.2 The heuristic algorithm

To make the aggregate probability distribution of the samples virtually belonging to each server is identical, our first heuristic algorithm is based on two guidelines: 1) participants belonging to a server should be statistically diverse, and 2) we use the diversity of the local models observed empirically to represent this statistical diversity.

Consider participant k and let \(P_k(x, y)\) denote its data probability distribution. From the strong law of large numbers, when the training data size \(|\mathcal {D}_k|\) is sufficiently large, the loss \(F(\textbf{w};\, \mathcal {D}_k)\) should converge to the true risk \(F_k(\textbf{w})\) of model \(\textbf{w}\) for participant k, which is

$$\begin{aligned} F(\textbf{w};\, \mathcal {D}_k) \approx F_k(\textbf{w})&\triangleq \mathbb {E}_{P_k}[l(\textbf{w};\, x, y)] \nonumber \\&= \int l(\textbf{w};\, x, y) dP_k. \end{aligned}$$
(19)

Participant k’s local model after one round of SGD is

$$\begin{aligned} \textbf{w}^{(1)}_k&= \textbf{w}^{(0)} - \eta \nabla _w F(\textbf{w}^{(0)};\, \mathcal {D}_k) \nonumber \\&\approx \textbf{w}^{(0)} - \eta \nabla _w F_k(\textbf{w}^{(0)}). \end{aligned}$$
(20)

Since

$$\begin{aligned} \nabla _w F_k(w)&= \nabla _w \int l(\textbf{w};\, x, y) dP_k \\&= \int \nabla _w l(\textbf{w};\, x, y) dP_k, \end{aligned}$$

considering two participants k and \(k'\), their model divergence after one SGD round is

$$\begin{aligned}&\Vert \textbf{w}^{(1)}_{k} - \textbf{w}^{(1)}_{k'}\Vert \nonumber \\&= \eta \bigg \Vert \nabla _w F_k(\textbf{w}^{(0)}) - \nabla _w F_{k'}(\textbf{w}^{(0)}) \bigg \Vert \nonumber \\&= \eta \bigg \Vert \int \nabla _w l(\textbf{w}^{(0)};\, x, y) dP_k - \int \nabla _w l(\textbf{w}^{(0)};\, x, y) P_{k'}) \bigg \Vert \nonumber \\&= \eta \bigg \Vert \int \nabla _wl(\textbf{w}^{(0)};\, x, y) d(P_k - P_{k'}) \bigg \Vert \nonumber \\&\le \eta C \bigg | \int d(P_k-P_{k'}) \bigg |, \end{aligned}$$
(21)

where constant C denotes the upper-bound \(C \triangleq \max _{x,y} \Vert \nabla _w l(\textbf{w}^{(0)};\, x, y) \Vert\) (the maximal norm of the gradient of the per-sample loss at \(\textbf{w}^{(0)}\)). This bound exists given the typically assumed Lipschitz-ness of the loss function. Lipschitz-ness is a standard assumption in the literature of FL to guarantee its convergence [5, 29]). For example, this holds when l is the 2-norm.

This inequality implies that if participants k and \(k'\) have similar data distributions \(P_k\) and \(P_{k'}\), they should have small model divergence, and vice versa, if they have large model divergence, their data distributions should be largely different. Therefore, the model divergence after the first SGD round is a good representation for the statistical data distribution difference between two participants. It will even be better if more SGD rounds take place.

It is noted that we are not the first one to observe that model divergence is a consequence of data distribution skewness; earlier examples are in [21, 29,30,31]. However, in comparison, they analyze the divergence between the global FL model compared to the global SGD model (which is the centralized setting combining all local models) as a function of convergence time. They conclude that the stronger non-IID between participants, the more divergence between the global FL model diverges compared to the centralized setting. They do not analyze the model divergence between participants. In contrast, we consider the model divergence at the participant level, in particular, comparing individual participants, to conclude that if two participants have similar data distributions, they should have small model divergence, and vice versa, if they have large model divergence, their data distributions should be largely different. Based on this, we conclude this kind of model divergence can capture the data distribution difference between participants. That leads to our approach that we can use the local model after some early rounds for each participant as a way to measure their non-IID-ness.

Our proposed edge assignment is run by the central server and consists of two steps. First, each participant k needs to send the server the value of \(\textbf{w}^{(1)}_{k}\), their 1st-round local model. Here, local data privacy remains intact. Second, the central server applies a graph partitioning technique to divide the participants in groups of statistically diverse participants. The assignment algorithm is described in Algorithm 4 below.

figure d

We elaborate more on this algorithm. The goal of Step 3b is to partition graph G into M equally-sized clusters such that the vertices in the same cluster are maximally connected. A well-connected cluster means that their assigned participants are statistically diverse. The equal-size objective is to balance the aggregation workload across the edge servers. This partitioning is essentially a balanced min-cut graph partitioning problem [32]:

$$\begin{aligned} \min \sum _{k} \sum _{k' \not \in \text {same cluster with } k} dist(k, k')\nonumber \\ \text {such that each cluster has size~} K/M. \end{aligned}$$
(22)

This problem is NP-complete, even for the simplest case of \(M=2\) [32]. However, one can use efficient approximation algorithms such as [33,34,35]. For example, [32] shows that we can get a polylogarithmic approximation with respect to the min-cut objective if we relax the cluster size to be bounded by \((1+\epsilon ) K/M\) for arbitrary \(\epsilon >0\). It is noted that our Edge Assignment algorithm does not require the optimality of the graph partitioning step. In practice, METIS [35] is a popular tool for different graph partitioning objectives, including a practically effective solution for the balanced partitioning problem. We can use METIS for our algorithm.

The implementation detail of eFL comprises two steps: first, run Algorithm 4 to compute an edge assignment, and then run the eFL algorithm (Algorithm 3) using the resulted edge assignment. The complexities of Algorithm 4 are represented as follows:

  • Computation cost: In Step 2, each participant in parallel runs a round of SGD, which is assumed to take a constant unit cost. With K participants, the total computation cost is O(K). In Step 3, the server incurs a cost of \(O(K^2)\) to form the edge-weighted graph and then a cost of \(O(K^{M^2})\) to run the Min-Cut Graph Partitioning. The total computation cost is thus dominated by \(O(K^{M^2})\). With a fixed number of edge servers, M, this cost is polynomial in the number of participants K. In practice, M is much smaller than K.

  • Communication cost: Steps 1, 2, and 3 each incur K messages sent between the server and the K participants, for a total of 3K messages.

These costs are one-time only at the initial phase of the learning process. Once the server assignment is determined at the beginning, this assignment does not change in all rounds. Adjusting the assignment in each round would 1) make the learning slower, and 2) be quickly less effective after each round because the local models are getting increasingly similar. Therefore, although we could also solve a similar assignment problem that can adjust the assignment in each round, that would just be an incremental development. We could explore this direction in the future, but that is not the main point of this paper.

In our heuristic, we assume that the size of the training dataset of a participant is sufficiently large to approximate the loss obtained by the biased training dataset to be the true risk. This assumption is reasonable because: (1) In general research, assumptions about “sufficiently large” are also made, especially when that involves sampling and approximation. Similarly, we make this assumption as basis to support our theoretical analysis’s generalizability. This analysis is useful as it provides as a guideline to design our algorithm (which is a heuristic algorithm, indeed); (2) In practice, when applying FL in real-world applications, it is rarely the case that a client has very few samples. Our assumption should hold for most clients and so the effect of the small-size clients should be small. In any case, we can further reduce this effect by setting a minimum threshold on the dataset size a client must satisfy to become a meaningful participant of FL; and (3) In general, although limited dataset size affects IID-ness, this is less influential than other factors such as data size variation from one client to another, label distribution variation, and data distribution correlation [36]. Therefore, we can still apply our assumption to many cases where non-IID is due these factors yet having sufficiently large data sizes.

5.3 Generalized algorithm

Our work can be extended for the case where we associate for each pair of server i and participant k an assignment cost \(c_{ik} \le 0\) and set a constraint for the assignment such that the total cost within a budget \(C>0\); i.e., \(\sum _{i=1}^M \sum _{k=1}^K z_{ik}c_{ik} \le C\). The case we consider in this paper corresponds to setting \(c_{ik} = 0\) for all ik, meaning that every participant could be assigned to any candidate server. The case where we allow only certain participants to connect to certain candidate servers corresponds to setting \(c_{ik} = \infty\) for those pairs that cannot connect. In general, to find the best assignment \(\varvec{Z} = \{z_{ik}\}_{M\times K}\), we can solve the following combinatorial optimization problem:

$$\begin{aligned} \min \bigg \{&\sum _{i=1}^{M} \sum _{k=1}^K \sum _{k'=1}^K dist(k, k') z_{ik} (1-z_{ik'}) \bigg \} \end{aligned}$$
(23)
$$\begin{aligned} \text {s. t.~} 1)&\sum _{i=1}^M \sum _{k=1}^K z_{ik}c_{ik} \le C\end{aligned}$$
(24)
$$\begin{aligned} 2)&\forall ~ k \in [K]: \sum _{i=1}^M z_{ik} = 1 . \end{aligned}$$
(25)

This belongs to the class of non-linear integer programming problems (NP-hard) and we can leverage the literature of this area to find an approximate solution [37]. We leave this extended direction for our future work.

6 Evaluation study

Table 1 Real-world datasets used in evaluation

We implemented eFL (Algorithm 3) and compared three edge assignment algorithms: 1) \({\texttt {eFL\_new}}\): our proposed assignment algorithm (Algorithm 4) grouping statistically distant participants in the same server using our heuristics; METIS [35] was used for the graph partitioning step; 2) \({\texttt {eFL\_sim}}\): same as \({\texttt {eFL\_new}}\) but with an opposite assignment strategy: group statistically similar, instead of different, participants in the same server; and 3) \({\texttt {eFL\_rnd}}\): random assignment, which places participants at servers uniformly at random. We also implemented the conventional \(\texttt {FL}\) using the FedAvg algorithm (Algorithm 2), which serves as a baseline.

6.1 Evaluation setup

The learning was applied to two real-world classification datasets: MNIST [17] for hand-written digit recognition and CIFAR10 [38] for object image recognition. Each dataset is split into a testing set and a training set. The training set is distributed into 300 local training datasets hosted by 300 participant machines, respectively. The number of classes, features, training samples, and test samples are given in Table 1. The aforementioned algorithms are compared in terms of the learning accuracy when applied to the test set.

The local data distribution can be IID or non-IID. For the IID case, the training set is distributed uniformly at random such that each participant has for each of the 10 labels the same number of samples. This is 20 samples per participant for the MNIST dataset and 16 samples for the CIFAR10 dataset. For the non-IID case, we make an imbalanced label distribution as usually used in non-IID FL evaluation studies [1, 21, 22, 26]. Specifically, each participant has 90% of its training samples with only one dominant label while the remaining samples are with the other 9 labels equally likely. For example, for non-IID MNIST, a participant has 180 samples with only one label and the remaining 20 samples are with labels equally chosen among the other 9 labels.

For the eFL setting, we vary number of edge servers in the set \(M \in \{10, 20, 30\}\) to explore the effect of decentralization in eFL on the learning performance of the global model. Respectively, each edge server aggregates 30, 15 and 10 participants on average. Two configurations for the numbers of edge rounds and local rounds are considered: (E = 2, L = 10) and (E = 5, L = 20). The former configuration represents the case that edge and global aggregation updates are more frequent and the latter represents the case less frequent. It is well-known in FL for the non-IID case that when L is higher, meaning longer local SGD computation before aggregation, the divergence between local models will get wider, thus slowing down the convergence speed of the global model [1]. In total, we have 24 model cases for each eFL algorithm. For the FL setting, M and E are irrelevant; the only applicable parameter is L which we set to \(L = 20\) or \(L = 100\) so that the number of global model updates resulted is the same for both FL and eFL. These two choices correspond to cases (E = 2, L = 10) and (E = 5, L = 20), respectively. In total, we have 4 model cases for FL.

In general, the parameter server in FL selects different participants for each round to make its convergence faster as well as to achieve better accuracy. The same selection rule can also apply to each edge server in eFL. Because our main goal is to study the tradeoff of having edge servers versus the conventional FL, we focus on how to assign the edge servers, not on selecting participants for each round. Therefore, our experiments consider the full-selection case where all participants are included in each global broadcast.

Multi-Layer Perceptron (MLP) and Logistic Regression (LR) were used as the learning method for the MNIST dataset while Convolutional Neural Network (CNN) was used for CIFAR-10. The setup details are as follows:

  • MLP for MNIST (203,530 model parameters): Fully connected (784, 256) \(\rightarrow\) sigmoid activation \(\rightarrow\) Fully connected (256, 10) \(\rightarrow\) Softmax().

  • LR for MNIST (7850 parameters): Fully connected (784, 10) \(\rightarrow\) Softmax(). The 7850 parameters are 784 * 10 weights and 10 bias terms.

  • CNN for CIFAR10 (258,762 model parameters): we use the convention (in channels, out channels, kernel) for convolutional layers and ReLU activation after convolutional layers. Specifically, Conv2D(in_channels = 3, out_channels = 32, kernel = 3) \(\rightarrow\) MaxPool2D(2, 2) \(\rightarrow\) Dropout(p = 0.1) \(\rightarrow\) Conv2D(in_channels = 32, out_channels = 64, kernel = 3) \(\rightarrow\) MaxPool2D(2, 2) \(\rightarrow\) Dropout(p = 0.1) \(\rightarrow\) Conv2D(in_channels = 64, out_channels = 128, kernel = 3) \(\rightarrow\) MaxPool2D(2, 2) \(\rightarrow\) Dropout(p = 0.1) \(\rightarrow\) Fully connected(128*2*2, 256) \(\rightarrow\) ReLU() \(\rightarrow\) Fully c onnected(256, 128) \(\rightarrow\) ReLU() \(\rightarrow\) Fully connected(128, 10) \(\rightarrow\) Softmax().

We implemented these neural networks with Torch. For the SGD algorithm (Algorithm 1) in FL and eFL, we employed its “mini-batch” version [1] with batch size set to 10 for MNIST and 20 for CIFAR10; the learning rate \(\eta\) is set to 0.01 and 0.15, respectively. This parameter choice was chosen to fit the datasets reasonably.

So as described above, we used the datasets to populate the training data for the distributed participants. Using these datasets, we created different data distribution scenarios to simulate the IID and non-IID cases. Then, for each case, we evaluated how our proposed algorithm outperforms the benchmark. Such different settings result in different model divergences between participants, thus different graph cases for graph partitioning in our algorithm. To form the participant graph in our edge assignment solution, the model divergence between two participants is computed using local models resulted after running SGD for 10 rounds. In computing this divergence, we use the Minkowski metric of order 1 (we found that order 2, which is the Euclidean distance, is worse as a representation for the statistical difference based on the raw data).

6.2 Results

Our goal is to test our hypotheses that H1) having more edge servers in eFL leads to less learning accuracy, which is the tradeoff of eFL due to trying to decentralize the bottleneck problem of FL; H2) this tradeoff is more obvious for the non-IID case, hence, the important problem to address this case for eFL; H3) our edge assignment solution is especially helpful to the non-IID case, even more so when eFL has more edge servers; and H4) this is because this solution effectively neutralizes the non-IID-ness of the local training data at the edge level, justifying the heuristic used in our algorithm. We present the results below.

6.2.1 Accuracy tradeoff due to edge server deployment

Fig. 4
figure 4

Accuracy tradeoff of eFL for MNIST dataset: Effect of the number of edge servers (M=10, 20, 30) deployed for different combination settings of the number of local training updates (L) and the number of edge training updates (E). Deploying more servers leads to less learning accuracy. Also, this tradeoff is more visible for the non-iid case

Figure 4 show the accuracy of eFL as an effect of the number of edge servers (M) for the MNIST case. Here, we chose \(\texttt {eFL\_rnd}\) for illustration as random assignment is used de facto in the literature. FL is included in this figure for comparison. As expected, eFL is not as accurate as FL; FL can still continue improving its accuracy with more global rounds, whereas eFL does not improve beyond 2000 rounds. The figure also shows that increasing the number of edge servers in eFL makes this accuracy tradeoff more significant. This is understandable since, given a fixed number of participants, as more edge servers are placed, each will aggregate from fewer participants. This causes less diversity of data distribution at the edge aggregation and so regional models tend to accumulate divergently. As a result, the global model which is aggregated from these regional models gets worse.

However, there is a worthy observation: eFL accuracy tradeoff is more visible for the non-IID case. For example, in the MNIST dataset, when the number of servers increases from 10 to 30, the accuracy is decreased by 3–4% for the IID case (Fig. 4a), whereas it is almost 20% for the non-IID case (Fig. 4b). Similar patterns on the effect of deploying more servers are also seen for the CIFAR10 dataset, which is illustrated in Fig. 5. This evaluation suggests that the accuracy tradeoff of eFL due to decentralization can be substantial, hence the motivation to design a good edge server assignment to minimize this tradeoff.

Fig. 5
figure 5

Accuracy tradeoff of eFL for CIFAR10 dataset: Effect of the number of edge servers (M=10, 20, 30) deployed for different combination settings of the number of local training updates (L) and the number of edge training updates (E). Deploying more servers leads to less learning accuracy. Also, this tradeoff is more visible for the non-iid case

6.2.2 Effect of the edge server assignment

Fig. 6
figure 6

Comparison eFL_new versus eFL_rnd: The y-axis is the ratio of eFL_new’s test accuracy to eFL_rnd’s test accuracy for different combination settings of the number of local training updates (L) and the number of edge training updates (E)

Figure 6 compares eFL using our proposed edge assignment (\(\texttt {eFL\_new}\)) versus eFL using a random edge assignment (\(\texttt {eFL\_rnd}\)). This figure plots the ratio of the test accuracy of \(\texttt {eFL\_new}\) to that of \(\texttt {eFL\_rnd}\). A ratio larger than 1 means better test accuracy for the former, and less than 1 vice versa.

It is observed that \(\texttt {eFL\_new}\) is comparable to or better than \(\texttt {eFL\_rnd}\) in all configurations. They are comparable in the IID case, which is expected because a random assignment would neutralize the non-IID case effectively. They are also comparable when few edge servers are deployed, in either IID or non-IID case. This is because with few edge servers, the degree of decentralization in eFL is not that significant, giving little room for accuracy improvement as a result of an edge server assignment. That said, we still see a slightly better accuracy for \(\texttt {eFL\_new}\) compared to \(\texttt {eFL\_rnd}\). In the non-IID case, when more edge servers are deployed, the server assignment becomes critical to the learning accuracy and we see that \(\texttt {eFL\_new}\)’s superiority becomes more noticeable. For example, as seen in Fig. 6b for the MNIST dataset, using 30 edge servers, \(\texttt {eFL\_new}\) improves over \(\texttt {eFL\_rnd}\) by 15%. This improvement is 20% for the CIFAR10 dataset; see Fig. 6(d).

Fig. 7
figure 7

Comparison eFL_new versus eFL_sim: The y-axis is the ratio of eFL_new’s test accuracy to eFL_sim’s test accuracy for different combination settings of the number of local training updates (L) and the number of edge training updates (E)

Figure 7 compares \(\texttt {eFL\_new}\) to another benchmark, \(\texttt {eFL\_sim}\), the assignment method that uses the opposite heuristic: grouping “similar” instead of “diverse” participants in the same edge server. Here, we emphasize the non-IID case where the assignment choice matters. Clearly, \(\texttt {eFL\_new}\) is much better. For MNIST (Fig. 7a), \(\texttt {eFL\_new}\)’s accuracy is 1.25+ times better than \(\texttt {eFL\_sim}\)’s; the margin is even much higher in earlier training rounds. For CIFAR10 (Fig. 7b), this margin can range from 1.30 times better for the 20-server case to 2 times better for the 10-server case. These improvements over \(\texttt {eFL\_sim}\) are more impressive than that over \(\texttt {eFL\_rnd}\). The above evaluation supports our hypothesis that a proper assignment choice can improve eFL significantly. We have also validated the theoretical heuristic behind the design of \(\texttt {eFL\_new}\) which outperforms the alternatives, \(\texttt {eFL\_rnd}\) and \(\texttt {eFL\_sim}\).

6.2.3 IID-ness at the edge vs. IID-ness at the local

Fig. 8
figure 8

Edge IID-ness using eFL on MNIST dataset: Each plot shows the distribution of the training samples of the participants assigned to each server over the 10 classification labels. The x-axis is each label, the y-axis is the number of corresponding samples having this label, and each curve is for a server (\(M=10\) or \(M=30\) servers)

Fig. 9
figure 9

Edge IID-ness using eFL on CIFAR10 dataset: Each plot shows the distribution of the training samples of the participants assigned to each server over the 10 classification labels. The x-axis is each label, the y-axis is the number of corresponding samples having this label, and each curve is for a server (\(M =10\) or \(M=30\))

Our assignment heuristic is aimed to maximize the IID-ness of the distribution of the training samples over the set of edge servers; we refer to this as “edge IID-ness”. Here, we provide experimental proofs that \(\texttt {eFL\_new}\) does achieve good edge IID-ness by comparing to \(\texttt {eFL\_rnd}\). We do not include \(\texttt {eFL\_sim}\) in the comparison because as analyzed above it is even worse than \(\texttt {eFL\_rnd}\). The results are illustrated in Fig. 8 for the MNIST dataset and Fig. 9 for the CIFAR10 dataset. In these figures, each plot shows the distribution of the training samples of the participants assigned to each server over the 10 classification labels. There are M (10 or 30) curves in each plot, each representing a server. Good IID-ness is obtained if (1) these curves appear close to each other because, ideally, each server should have the same distribution curve; and 2) the horizontal deviation should be small because, ideally, each label should have the same number of samples. In our evaluation, the global dataset has identical numbers of samples for each label.

Excellent edge IID-ness is achieved by eFL_new for the case where the local data distribution is IID. For example, see Fig. 8a for MNIST and Fig. 9a for CIFAR10, where the server curves are almost identical and the sample size for each label is also very similar. Although eFL_rnd results in very fair label distribution among the samples belonging to each server, this distribution, also the total sample size, between the servers still varies widely. For the same label, the number of samples can differ by 300 samples between the two most different servers. What is implied here is that even in the IID-local case, which is the most desirable scenario, \(\texttt {eFL\_new}\) is still better “edge-IID” than eFL_rnd.

Figure 8b for MNIST and Fig. 9b for CIFAR10 show the edge IID-ness comparison for the non-IID local data distribution case. eFL_rnd clearly has substantial deviations both horizontally and vertically. In contrast, \(\texttt {eFL\_new}\) produces a much more constant-line looking curve shape. It is also observed that edge IID-ness improvement of \(\texttt {eFL\_new}\) over \(\texttt {eFL\_rnd}\) is more noticeable when there are more edge servers: \(M=30\) servers compared to \(M=10\) servers.

The above evaluation confirms that our proposed assignment, \(\texttt {eFL\_new}\), does achieve better edge IID-ness, which explains why it offer better learning accuracy than the other benchmark methods.

6.3 More results

Fig. 10
figure 10

\(\texttt {eFL\_new}\) vs. \(\texttt {eFL\_rnd}\) with Logistic Regression: (left) \(\texttt {eFL\_rnd}\) cannot provide an acceptable accuracy, only about 10%; (right) \(\texttt {eFL\_new}\) is highly effective, reaching 42% accuracy

In the discussion of experimental results above, we use two Deep Neural Networks, MLP and CNN, as the underlying learning model for FL and eFL. For a simpler learning method such as Logistic Regression, which is essentially MLP with middle layers removed, our method is also effective. Figure 10 plots a comparison of \(\texttt {eFL\_new}\) vs. \(\texttt {eFL\_rnd}\) with Logistic Regression using a 30-server non-iid eFL setting with MNIST dataset. This setting was made difficult to learn with Logistic Regression, even by using FL. It is shown that \(\texttt {eFL\_rnd}\) cannot provide an acceptable accuracy, only about 10%. In contrast, \(\texttt {eFL\_new}\) is much better. It can reach 42% accuracy.

Fig. 11
figure 11

Illustration of accuracy improvement under increasing degrees of non-iid-ness: (left) 1-label dominance; (right) 3-label dominance

Figure 11 provides another illustration supporting \(\texttt {eFL\_new}\) under increasing degrees of non-iid-ness. Using a 30-server eFL setting with MNIST, two non-iid cases are considered: local samples at each participant are 90% dominated with one label versus the case of 3-label 90% dominance. It is clearly shown that the accuracy improvement of \(\texttt {eFL\_new}\) over \(\texttt {eFL\_rnd}\) is more substantial in the 1-label dominance case. This case is more non-iid than the 3-label dominance case which is closer to iid.

7 Conclusions

FL cannot scale when the number of participants is too large for the global server to aggregate. eFL improves scalability by using edge servers as regional aggregators, which, however, leads to degraded learning accuracy. We have shown that the edge server assignment is critical to minimizing this tradeoff. We have proposed a simple yet effective solution based on the idea that the local models to be aggregated by an edge server should be maximally diverse. This solution has been shown in our evaluation study to outperform the de facto standard random assignment by up to 20% when tested on popular real-word datasets. The proposed assignment solution is especially helpful when more edge servers are deployed and the local training data distribution is non-IID. Without strong assumptions made, it is useful as a universal benchmark for comparing eFL algorithms. The work in this paper can naturally be extended to cases where the edge servers and participants are limited by computing and communication constraints. We will investigate these extensions in our future work.