Introduction

With the explosive growth of computing power and data, data-driven machine learning (ML) has become an increasingly attractive technology [1]. However, the performance of machine learning heavily depends on large-scale, high-quality data, which becomes impractical in some practical scenarios because of the privacy sensitivity of the data [2]. Additionally, aggregating large amounts of training data on a central server is both expensive and insecure. As an emerging learning paradigm, federated learning (FL) can realize collaborative machine learning that does not require clients to share their raw data to reduce communication costs and ensure data privacy [3,4,5,6].

FL was first proposed in a centralized form [3]. A general FL system uses a central server to coordinate the federated training task based on the data of the participating clients in a star topology. Multiple clients perform local model training with their own datasets and transmit the updated models periodically to a centralized server. The server then conducts model averaging and broadcasts the newly updated model to all the clients. Despite the achievements, the centralized FL framework faces a few challenges in real-world FL scenarios. Centralized FL requires a central parameter server (PS) for model aggregation, parameter encryption and decryption, and other sensitive operations. On the one hand, finding a trusted third party to use as the central server that receives and aggregates information of clients is difficult in many FL environments [4]. On the other hand, the server may also become the busiest node, incurring a high communication cost when the server is attacked and even creating a single point of failure for the whole network.

Unlike FL, decentralized federated learning (DFL) offers a favorable solution to these two challenges as it eliminates the need for a central server. In DFL, the clients directly exchange their model parameters or gradient vector information in a peer-to-peer manner without the help of a third party [5]. However, a practical problem is how to aggregate information among clients to achieve model consensus and stable convergence [7]. In practice, clients in the DFL framework play a similar role as the central server in model aggregation, while they can only exchange messages directly with their respective neighbors, making consensus mechanisms important to ensure agreement to reach the common learning goal [8].

Gossip averaging is a widely used method in decentralized algorithms, and it can quickly converge toward a consensus among clients by exchanging messages in a peer-to-peer manner [9]. Specifically, each client sends a message to one or a group of other clients, and the message propagates through the whole network. Consequently, a general method is to incorporate gossip averaging into DFL methods, where model updating and model averaging are executed alternately on the client side [10,11,12,13,14]. However, directly averaging model parameters in a linear-weighted manner to integrate various models of local clients will be impacted by client drift [15] or weight divergence [16], especially when the training data distribution of different clients is heterogeneous. This will result in slower convergence speed and reduced learning performance. Moreover, current research on non-IID problems predominantly focuses on the centralized FL framework [17,18,19,20,21] and lacks direct applicability to DFL. The main challenge of DFL is to ensure model convergence and maintain accuracy in a fully decentralized training setting. Yet, the existing aggregation algorithms used in DFL cannot achieve competitive performance, and redundant communication will incur a great load on the network bandwidth.

In contrast to existing work, we proposed federated incremental subgradient-proximal (FedISP) to address the aforementioned issues. Regarding the communication mode, the link is directed and static, with each client only communicating with their clockwise neighbor to construct a ring topology that helps to reduce the probability of network congestion and improve reliability. Regarding the aggregation scheme, incremental subgradient-proximal methods [22] are introduced instead of direct aggregation to ensure model consensus across clients. Compared to cross-device scenarios, it is evident that FedISP is more suitable for cross-silo institutions with more stable communication links and fewer network nodes. In fact, not only is FedISP more feasible and easier to implement than general DFL, which requires complex system design, but it also exhibits favorable convergence properties with no additional communication overhead.

In summary, the main contributions of this paper are as follows.

  1. 1.

    In this work, we apply incremental optimization methods to minimize the FL loss, giving rise to a novel DFL framework named FedISP with a ring topology.

  2. 2.

    Theoretically, we guarantee the convergence of FedISP under convex conditions and obtain theoretical bounds on its performance. FedISP achieves convergence within a certain error bound at a constant learning rate while achieving exact convergence to another optimal solution at a diminishing learning rate. This theoretical result also demonstrates that FedISP may solve the problem of weight divergence.

  3. 3.

    We design numerical experiments to evaluate model consensus by weight divergence and network outputs of clients. The results show that our design significantly reduces the weight divergence, and models of different clients have similar predicted outputs. We analyze the communication efficiency of the algorithm and give practical examples showing the advantages of the proposed method in terms of communication cost.

  4. 4.

    We run extensive experiments on four image classification datasets in both IID and non-IID settings using the proposed algorithm FedISP as well as the baseline methods, which include the state-of-the-art DFL method Def-KT and popular centralized methods, such as FedAvg and FedProx. The results show the superiority of FedISP over the baseline.

The remainder of this article is organized as follows. Section “Related work” provides an overview of related works. We formulate the considered problem and introduce the incremental optimization method in Section “System model”. The theoretical results and convergence analysis of FedISP are described in detail in Section “FedISP via incremental methods”. In Section Experiment”, we investigate the impact of the learning rate hyperparameter on the algorithm and provide experimental results to demonstrate the superiority of FedISP over the baseline methods. The limitations are discussed in Section “Discussion” and the conclusion is given in Section “Conclusion ”.

Related work

Google first proposed FedAvg to aggregate clients’ information to learn a shared model while keeping private training data locally [3]. Many studies continue this star architecture for communication and model aggregation in favor of its simple distributed parallelism. However, Zhao et al. [16] showed that the accuracy of FedAvg degrades a lot on highly skewed CIFAR-10 and then proposed a data-sharing strategy. To improve the performance of FL on the non-IID dataset, FedProx [17] leveraged a proximal term to restrict local updates toward the global model, thereby mitigating client drift. Similarly, Li et al. [20] correct the local updates via the similarity between model representations, and Zhu et al. [18] performed inter-client collaboration by clients’ personalized cloud model. Nevertheless, these methods necessitate the presence of a server and entail certain limitations in terms of communication costs.

To avoid the problems caused by the centralized FL framework, the research on the DFL framework has attracted much attention. The network topologies formed by the clients determine the DFL training methods, thus significantly affecting the communication complexity and convergence performance of DFL. Regarding the communication methods of DFL, Roy et al. [10] first provided a peer-to-peer decentralized FL algorithm for clients to interact with each other directly without the help of a central component, but this approach incurred significant communication costs. Lalitha et al. [23, 24] proposed a Bayesian-based distributed algorithm in which each client updates its beliefs by aggregating information from its one-hop neighbors to train a model that best fits the observations over the entire network. The authors of [25, 26] both proposed blockchain-based decentralized FL frameworks where clients use the blockchain for global model storage and local model update exchange. The authors of [27] and [11] utilized a ring topology to constrain the communication link among clients. In RDFL [27], the trusted node requires the model of the remaining trusted nodes and then executes knowledge distillation and FedAvg. In C-DFL [11], each node exchanges models with its two connected nodes and iteratively aggregates its model through model averaging. In addition, semi-decentralized architectures for FL have been explored [28,29,30], where peer-to-peer communications used to exchange model parameters are exploited in conjunction with client–server interactions to improve the model training performance.

To ensure model consensus across clients in DFL, gossip averaging was applied with DFL schemes to execute model aggregation [10,11,12,13,14]. In [14], a segmented gossip approach was proposed that allows the clients to transmit the model segment for model averaging with other clients to fully utilize the node-to-node bandwidth without harming the convergence rate. In methods of [10,11,12,13], clients transmit and average the whole model. Considering the non-IID scenario, several studies [21, 27, 31] focused on integrating DFL with knowledge transfer. In [31], Li et al. proposed a DFL framework called Def-KT that introduced a mutual knowledge transfer algorithm to aggregate the model. In [21], each federation obtains a personalized model through cyclic knowledge extraction. Wang et al. [27] proposed the RDFL algorithm, where clients utilize models obtained from other trusted nodes for distilling knowledge to guide local training. Afterward, FedAvg is performed on all trusted nodes to obtain a new model. However, these methods often impose large communication costs for clients, and most improvement schemes for non-IID problems were based on the existing federated algorithm perspective to improve the model’s generalization ability. A few attempts have been made to directly address the issue of weight divergence in DFL through mathematical optimization approaches.

System model

In this section, we first present a formal description of the FL problem and give an overview of the idea of FedISP, which utilizes incremental subgradient-proximal methods to solve this optimization problem.

Consider a federated network with m clients whose local datasets \({D_1},\ldots ,\mathrm{{ }}{D_m}\) are uniformly sampled from m distinct distributions \({P_1},\ldots ,\mathrm{{ }}{P_m}\). They have the same type of task models with parameters \({w}^{(i)}\). We formulate the FL problem as

$$\begin{aligned} {w}=\underset{{w} \in \mathbb {R}^n}{\mathop {\arg \min }} \sum _{i=1}^m f_i({w}), \end{aligned}$$
(1)

where \({f_i}(x):\mathbb {R}^n\rightarrow \mathbb {R}\) are real-valued loss functions, such as the training objective function that maps the model parameter \(w^{(i)}\in \mathbb {R}^n\) of client i to a real-valued training loss. Specifically

$$\begin{aligned} f_i({w})=\mathbb {E}_{{x} \sim P_i}[l({w}; {x})] \approx \frac{1}{\left| D_i\right| } \sum _{{x} \in D_i} l({w}; {x}), \end{aligned}$$
(2)

where l(wx) is the loss function and \(|D_{i}|\) is the number of local samples of client i.

Let \(F({w})=\sum _{i = 1}^m {{f_i}(w)}\). Denote by \(F^{*} = {inf}_{w \in \mathbb {R}^{n}}F({w})\) the optimal value of problem (1) and by \(W^{*}=\{w^{*}\in \mathbb {R}^n,F(w^*)=F^{*}\}\) the set of optimal solutions.

For the solution of problem (1), the objective function can be deformed as follows:

$$\begin{aligned} \begin{aligned} G({w}):=2 F({w})=\sum _{i=1}^m f_i({w})+\sum _{i=1}^m f_i({w})\\ =\sum _{i=1}^m f_i({w})+\sum _{i=1}^m h_i({w}), \end{aligned} \end{aligned}$$
(3)

where \({h_i}({w})={f_i}(w)\). Therefore

$$\begin{aligned} G({w}):=\sum _{i=1}^m G_i({w})=\sum _{i=1}^m [f_i({w})+h_i({w})]. \end{aligned}$$
(4)

Then, the optimization problem (1) can be rewritten as

$$\begin{aligned} {w}{} & {} =\underset{{w} \in \mathbb {R}^n}{\mathop {\arg \min }}\left\{ F({w}):=\sum _{i=1}^m f_i({w})\right\} \nonumber \\{} & {} =\underset{{w} \in \mathbb {R}^n}{\mathop {\arg \min }}\left\{ G({w})=: \sum _{i=1}^m\left[ f_i({w})+h_i({w})\right] \right\} . \end{aligned}$$
(5)

Note that G(w) is equivalent to the objective function of a common FL problem. As a result, theoretically, the optimal solution \(w^*\) can achieve similar performance.

In addition, the loss function used in FL is not necessarily convex, but it can often be treated as convex near the optimal solution, allowing for effective optimization using gradient-based methods.

Let X be a convex set containing the optimal solution and G(w) be convex on X. The optimization problem presented in Eq. (5) can be solved by incremental subgradient-proximal methods [22]. For the kth iteration, let \(w_{k}^{(0)}=w_{k-1}^{(m)}\) and take the following steps:

$$\begin{aligned}&{z}_{k}^{(i)}=\underset{{w} \in X}{\mathop {\arg \min }} \left\{ h_{i}({w})+\frac{1}{2\alpha _{k}}\left\| {w}-{w}_{k}^{(i-1)}\right\| ^{2}\right\} \end{aligned}$$
(6)
$$\begin{aligned}&{w}_{k}^{(i)}={z}_{k}^{(i)}-\alpha _{k} \widetilde{\nabla } f_{i}({z}_{k}^{(i)}), \end{aligned}$$
(7)

where \(\alpha _{k}\) is the learning rate.

FedISP via incremental methods

In this section, we introduce the proposed framework FedISP with a cyclic order of client selection. In addition, the specific implementation steps of FedISP are described in detail.

FedISP algorithm

Consider a network topology with m clients where there is no server. The clients involved in the FL task are distributed over a ring and communicate only with their respective neighbors exclusively through peer-to-peer communications. Figure 1 gives an overview of the communication graph. The data flow among clients circulates in a ring-like manner, where each client sends data to its right-hand neighbor and receives it from its left-hand one, thus forming a directed communication link. To ensure privacy, only model parameters are communicated across the network without direct raw data exchange.

Fig. 1
figure 1

The structure of FedISP. The model training process of each client can be divided into two steps, i.e., incremental proximal optimization and incremental subgradient optimization

FedISP aims to achieve a common learning goal by accumulating model advantages in a cyclic manner among clients without compromising data privacy. Before training, cross-silo institutions such as multiple medical institutions can negotiate and comprehensively consider their sequence. For example, institutions closer to each other can act as neighbors to reduce communication costs.

We assume that the clients are indexed by \(i=1,2,\ldots ,\mathrm{{}}m\), and the initial model parameter of the kth iteration is \(w_{k}^{(0)}\). In particular, client 1 should receive the model from the end of the ring topology to form a closed loop. In the kth global iteration, the two-step incremental optimization process of the ith client is as follows. In local training, each client seeks an approximate solution to the objective function using its local data by its local solver (e.g., stochastic gradient descent).

Incremental proximal optimization: Client i receives the model \(w_{k}^{(i-1)}\) passed from the previous client \(i-1\) to guide the local training, that is, to optimize \(h_{i}\) by Eq. (6).

The proximal term constrains the result of the proximal iteration \(z_{k}^{(i)}\) closer to the previous model \(w_{k}^{(i-1)}\) by fine-tuning the local model, which helps to take advantage of the distinct training datasets to enhance the generalization capability.

Incremental subgradient optimization: After proximal iteration to acquire the intermediate result \(z_{k}^{(i)}\), following Eq. (7), client i computes \(w_{k}^{(i)}\) locally to optimize \(f_{i}\). Finally, the trained local model is pushed to replicate to the neighboring clients in a clockwise direction.

In the iteration above, each client not only obtains the capability to handle diverse data but maintains performance on the local data simultaneously. We iteratively repeat the above optimization processes until each client has completed the training in the current global iteration. Then, we set \(w_{k+1}^{(0)}=w_{k}^{(m)}\) and begin the next global iteration, continuing until the loss function converges. Finally, all clients can learn the data distributions of the other clients through the FedISP algorithm, i.e., achieve the effect of global model consistent convergence.

Algorithm 1 outlines the primary procedure, in which FedISP implements the optimization steps of incremental subgradient-proximal methods among clients;  that is, it iteratively optimizes G(w) by alternately optimizing \(h_{i}\) and \(f_{i}\) by the incremental proximal method and incremental subgradient method, respectively, until a preset maximum number of iterations K is reached.

Algorithm 1
figure a

FedISP for decentralized federated learning (proposed framework)

Table 1 summarizes the communication complexity of different FL approaches. In the centralized client–server architecture [3, 32], clients exchange model weights with the server twice (upload/download) per global iteration to obtain model updates, resulting in twice the total amount of transferred data. Furthermore, the server may experience heavy node load at times. For P2P communication, all clients can exchange information arbitrarily, which requires lots of communication costs to communicate with all other clients in the network. In RDFL [27], each client communicates N-1 times with its clockwise neighbor in a ring topology to obtain the models from the remaining clients at each synchronizing step. In C-DFL [11], each client shares its model with two neighboring clients in a ring topology per round. Def-KT [31], a peer-to-peer architecture, uses the fewest communications, because half of the participating clients are randomly selected to perform pairwise communication with the remaining clients for mutual knowledge transfer at each round. Compared to RDFL and C-DFL with similar ring architecture, the proposed FedISP achieves a better performance in terms of communication load and communication costs, because the smallest number of models is required for each client.

Table 1 Communication complexity. M and N denote the size of the model parameters and the number of clients, respectively

Convergence analysis of FedISP

In this section, we analyze the convergence of the proposed framework when G is convex. Similar to the analysis of many incremental and stochastic optimization algorithms [17, 22], we make the following assumption.

Assumption 1

There exists a constant \(c \in \mathbb {R}^n\), such that \(\max \left\{ \left\| \widetilde{\nabla } f_i({z}_k^{(i)})\right\| ,\left\| \widetilde{\nabla } h_i({z}_k^{(i)})\right\| \right\} \le c\) holds for all k and i, where \(\widetilde{\nabla }\) is the subgradient and \(\left\| \bullet \right\| \) is the norm. Furthermore, we have \(\max \left\{ f_i({w}_k^{(0)})-f_i({z}_k^{(i)}),h_i({w}_k^{(0)})-h_i({z}_k^{(i)})\right\} \le c\left\| {w}_k^{(0)}-{z}_k^{(i)}\right\| \) hold for all k and i.

For our problem in Eq. (5), Assumption 1 naturally holds if both \(f_i(w)\) and \(h_i(w)\) are Lipschitz continuous over the entire space \(\mathbb {R}^n\).

Now, we provide the convergence guarantee for FedISP when \(f_i(w)\) and \(h_i(w)\) are convex functions and the learning rate \(\alpha \) is a constant.

Theorem 1

(Prop. 3.2 in [22]) Under Assumption 1, and assuming functions \(f_i(w)\) and \(h_i(w)\) in Eq. (5) are convex, let \(\{w_i^{(k)}\}\) be the sequence generated by Eq. (7), with a cyclic order of client selection, and let the learning rate \(\alpha _k\) be fixed at some positive constant \(\alpha \).

If \(G^*=- \infty \), then

$$\begin{aligned} \mathop {\lim }\limits _{k \rightarrow \infty } G(w_k^{(i)}) = {G^*}. \end{aligned}$$
(8)

If \(G^*>- \infty \), then

$$\begin{aligned} \mathop {\lim }\limits _{k \rightarrow \infty } G(w_k^{(i)}) \le G^* + \frac{{\alpha \beta {m^2}{c^2}}}{2}, \end{aligned}$$
(9)

where \(\beta = \frac{1}{m} + 4\) and c is the constant of Assumption 1.

Theorem 1 implies that at a constant learning rate (\(\alpha _k \equiv \alpha \)), convergence can be established at a neighborhood of the optimum, which shrinks to 0 as \(\alpha \rightarrow 0\).

Next, we provide the convergence guarantee for FedISP when \(f_i(w)\) and \(h_i(w)\) are convex functions and the learning rate \(\alpha _k\) is variable.

Theorem 2

(Prop. 3.4 in [22]) Under Assumption 1, and assuming functions \(f_i(w)\) and \(h_i(w)\) in Eq. (5) are convex, let \(\{w_k^{(i)}\}\) be the sequence generated by Eq. (7), with a cyclic order of client selection. If the learning rate \(\alpha _k\) satisfies \(\mathop {\lim }_{k\rightarrow \infty }{\alpha _k}=0,\sum _{k = 0}^\infty {\alpha _k} =\infty ,\) then

$$\begin{aligned} \mathop {\lim }\limits _{k\rightarrow \infty }\inf G(w_k^{(i)})=G^*. \end{aligned}$$
(10)

Furthermore, if \(W^*\) is nonempty and \(\sum _{k = 0}^\infty {\alpha _k^2}<\infty \), then \(\{w_k^{(i)}\}\) converges to some \(w^* \in W^*\).

Theorem 2 implies that at a diminishing learning rate (\(\alpha _k\)), the entire sequence \(\{w_k^{(i)}\}\) will converge to the same optimal solution, that is, \(w^* = \mathop {\lim }_{k\rightarrow \infty }w_k^{(0)}=\mathop {\lim }_{k\rightarrow \infty }w_k^{(1)}=\cdots = \mathop {\lim }_{k\rightarrow \infty }w_k^{(m)}\).

Experiment

In this section, a series of experiments are executed to evaluate the performance of the proposed framework. A comparison with the baseline methods is conducted to validate the efficacy of our approach.

Experimental setup

We conduct numerical experiments on four datasets for image classification tasks, including MNIST [33], CIFAR-10 [34], FMNIST (Fashion-MNIST) [35], and EMNIST (Extended-MNIST) [36]. For the CIFAR-10 dataset, convolutional neural networks (CNNs) (2-layer convolution and 3-layer full connection) are utilized. For the MNIST, FMNIST, and EMNIST datasets, multilayer perceptron (MLP) neural networks (3-layer hidden layer) are used. The details of the network architectures can be seen in the analysis of weight divergence (see Section ‘Weight divergence’). In addition, we also analyzed the case of different numbers of clients by conducting experiments on the CIFAR-100 dataset, where a CNN model composed of three convolutional layers and two fully connected layers is utilized.

We compare FedISP with three state-of-the-art approaches: (1) FedAvg [3], (2) FedProx [32], and (3) Def-KT [31], which covers centralized FL (FedAvg, FedProx) and decentralized FL (Def-KT). We also compare a local training method named Solo, where clients train the local model independently without communication. In the IID case, we also compare centralized learning, and all approaches use the same model for fairness.

Fig. 2
figure 2

Top-1 accuracy vs. number of communication rounds on the four datasets, where different algorithms are performed (IID)

Table 2 The total numbers of transferred bits in the training process on the four datasets

PyTorch is employed to implement our method and the other baselines. We suppose the number of clients is 10, and all clients fully participate in each global iteration by default. To fairly compare the training performance of different methods, we use SGD as the local solver with a batch size of 64, a momentum of 0.5, a weight decay of 0.00001, and a learning rate of 0.01. FedProx needs to modify the hyperparameter \(\mu \) to control the weight of its proximal term loss. We adjust \(\mu \) within \(\{0.001, 0.01, 0.1, 1\}\) referring to [32] and report the best results. For MNIST, FMNIST, EMNIST, and CIFAR-10, the best \(\mu \) value of FedProx is 0.001 in our setting. Unless explicitly specified, we use this \(\mu \) setting for all remaining experiments. Def-KT randomly selects half of the clients to train the local model in each round and guide the remaining clients via mutual learning, which converges faster than when only a pair of clients communicate. All top-1 accuracy indicators we report are measured based on a global test set. For approaches without a server, the accuracy indicator is the average testing accuracy of all clients.

Both IID and non-IID data settings are considered in evaluating the performance of the proposed methods. In the IID case, the training data are shuffled and randomly distributed to each client. In the non-IID case, similar to previous studies [18, 19, 31, 37], two heterogeneous scenarios are considered to simulate distinct data distributions of federated clients, which are called pathological non-IID and Dirichlet non-IID. With the pathological non-IID data partitioning strategy, most clients own samples with only \(\xi \) classes. In particular, because EMNIST owns 62 classes with different numbers of data, each client owns data of more than \(\xi \) classes in our setting. The Dirichlet partition manner simultaneously possesses quantity skew and label distribution skew [5, 20], and the degree of data heterogeneity can be controlled by hyperparameter \(\beta \). In extreme cases, some clients have few local samples (\(\beta =0.1\)).

Results in the IID data setting

Accuracy comparison

Figure 2 shows the top-1 accuracy in each round during the process of model training for the four image classification datasets. The experiment was conducted in the IID data setting. The proposed framework FedISP possesses similar convergence properties to those of centralized learning. In terms of communication rounds, FedISP converges as fast as centralized learning, and both are much faster than the other algorithms. Because the best \(\mu \) values are small in FedProx, FedProx with the best \(\mu \) is pretty close to FedAvg. Moreover, when \(\mu =1\), FedProx becomes slower due to the proximal term. Compared with the other algorithms, Def-KT converges slowest and fails to reach the level of centralized learning within the given number of communication rounds.

Fig. 3
figure 3

Average weight divergence of the corresponding parameters between every pair of clients (FedISP (fixed))

Communication overhead

The model weights occupy 0.24M and 0.57M bits for CNN and MLP, respectively. Table 2 displays the approximate data of the total numbers of shared bits to achieve the target accuracy. Since FedProx and FedAvg have similar performance, only one is listed. The analysis is as follows. (1) For reachable accuracy, FedAvg needs more communication rounds and transfers the maximum total bits. (2) Compared with FedAvg and Def-KT, FedISP shares fewer parameters. Although Def-KT shares fewer bits in each global iteration, its convergence is slower and more unstable. This result shows that FedISP is a decentralized framework with high communication efficiency, which can reduce the communication overhead in terms of the number of communication rounds and communication costs per round.

Weight divergence

To study the degree of model consensus after federated training to verify the convergence of FedISP, we performed comparative experiments on two different types of models (MLP and CNN) on the MNIST and CIFAR-10 datasets in the IID setting. Specifically, we calculated the difference between the corresponding parameters of each network layer, which was measured for every pair of clients among the m clients, and then averaged the results. The network layers are indexed by l, and the parameters are indexed by k. We calculate the average weight divergence by

$$\begin{aligned} D^{l,k}=\frac{\sum _{i=1}^m{\sum _{j=i+1}^m\left| {w}_i^{l, k}-{w}_j^{l, k}\right| }}{m(m-1) / 2}, \end{aligned}$$
(11)

where \(\left| \bullet \right| \) denotes the absolute value.

Figure 3 shows the average weight divergence of the model parameters with \(\alpha =0.01\) after federated training. On MNIST, the difference between the models after convergence is sufficiently small compared to the magnitude of the model parameters (roughly \(1e-2\)). However, on CIFAR-10, there is still some distance among the model weights, which is also reflected by the accuracy standard deviation magnitude in Table 5.

For FedISP with a diminishing learning rate, we compute the weight divergence after federated training using the same standard on the CIFAR-10 dataset. More discussion and details about the diminishing learning rate are presented in the non-IID experiments section (see Section ‘Variable learning rate’), as FedISP with the fixed learning rate shows promising results in the IID setting. In addition, to better evaluate the gaps in the model weights, we provide the relative difference rather than the absolute difference of the average weight divergence, which is computed by

$$\begin{aligned} r^{l,k}=\frac{D^{l,k}}{\left| {w}_i^{l, k}\right| }, \end{aligned}$$
(12)

where \({w}_i^{l, k}\) denotes the model parameter of a particular client such as \(C_0\). The results are shown in Fig. 4 and Table 3.

Fig. 4
figure 4

Average weight divergence of the corresponding parameters between every pairs of clients [FedISP(d)], CNN on CIFAR-10

According to Fig. 4, the weight divergence significantly decreases. The abnormal relative values of a few parameters in Table 3 are due to the fact that the weight value itself is small. For instance, the corresponding model parameters for the maximum relative values of fc1 weight and fc2 weight are \(-3.37e-6\) and \(1.17e-5\), respectively. However, their absolute differences are \(4.26e-5\) and \(2.04e-5\), respectively.

Table 3 The statistics of normalized average weight divergence [FedISP(d)]

Likewise, we compare the differences across the network output of different clients on the test set by computing

$$\begin{aligned} D^{sample}=\frac{\sum _{i=1}^m{\sum _{j=i+1}^m\left\| {y}_i^{sample}-{y}_j^{sample}\right\| }}{m(m-1) / 2}, \end{aligned}$$
(13)

where y represents the logits output by the model and \(\left\| \bullet \right\| \) represents the \(l_2\) vector norm. The relative value is computed by

$$\begin{aligned} {r^{sample}} = \frac{D^{sample}}{{\left\| {y_0^{sample}} \right\| }}, \end{aligned}$$
(14)

where \(y_0^{sample}\) denotes the logit output of client 0 on each test sample.

Table 4 The statistics of average model output difference on the test sample [FedISP(d)]. The number of test samples is 10,000
Fig. 5
figure 5

Top-1 accuracy vs. numbers of communication rounds for FedISP with a fixed learning rate from \(\{0.002, 0.005, 0.01, 0.02\}\) (Dirichlet non-IID, \(\beta =0.1\))

From Table 4, although there exist a few large relative values of weight divergence, the network output remains similar across clients, suggesting that they have little influence on network behavior. Overall, the above experiments not only empirically verify the convergence of the model accuracy, but also verify the convergence of the model weights. Accordingly, FedISP combines incremental optimization techniques with DFL enabling all clients to achieve improved performance and superior model consensus.

Results in the non-IID data setting

The experiments conducted in the IID setting demonstrate the feasibility and effectiveness of the FedISP algorithm. According to the convergence analysis in Sect. 4.2, we first study the influence of the learning rate on FedISP. Then, we compare our methods with other baselines in different experimental settings.

Fixed learning rate

Recall that the upper bound on the convergence of loss in Eq. (9) in Theorem 1 is dependent on \(\alpha \) when the learning rate is a constant. Theoretically, the loss can reach the neighborhood of the optimal value, and the error limit with the optimal value tends toward zero as the learning rate decreases. To study the influence of the change in the learning rate on the model accuracy and convergence properties, we set the learning rate to a constant selected from \(\{0.002, 0.005, 0.01, 0.02\}\).

Figure 5 shows the top-1 accuracy curves of the FedISP method at different learning rates. A higher learning rate can accelerate the convergence speed of FL, while a lower learning rate can make the neighborhood error to the optimal value smaller, leading to higher accuracy. This experimental phenomenon accords with the result of Theorem 1. However, it is prone to falling into a local optimum when the fixed learning rate is too low, and a finite number of iterations may not reach a sufficiently high test accuracy, as shown in Fig. 5b. On the other hand, if the learning rate is too high, it will cause instability in convergence, especially in scenarios with data heterogeneity, as shown in Fig. 5a. Therefore, to balance accuracy and convergence speed, we select a learning rate of 0.01 as the benchmark for comparison with other methods in the subsequent experiments.

Variable learning rate

Fig. 6
figure 6

Diagram of different learning rate decay strategies

We turn now to evaluating the efficiency and analyzing the behavior of FedISP under different learning rate decay strategies from Theorem 2. As shown in Eq. (10), the limit inferior of the objective function is equal to the optimal value, giving it superior theoretical performance compared to Eq. (9). Similar to the existing learning rate adjustment strategy, we conduct the following five learning rate strategies, and Fig. 6 presents the specific details of various learning rate change curves.

Table 5 The top-1 accuracy of FedISP with different learning rate strategies on the test set after 500 epochs of training. We report the mean and standard deviation among all clients (\(m\pm \sigma \))
Fig. 7
figure 7

Top-1 accuracy vs. number of communication rounds for FedISP on the four datasets, where different learning rate strategies are performed (Dirichlet non-IID, \(\beta =0.1\))

  1. (a)

    Equispaced: \(\alpha _k=\alpha _0 \times {\left( \frac{1}{2} \right) ^{k\% 50}};\) we set the initial learning rate \(\alpha \) to 0.01 and the step size to 50, which means that the learning rate is reduced by half of the original every 50 epochs.

  2. (b)

    Exponential: \(\alpha _k=\alpha _0\gamma ^k\), and we set \(\gamma \) to 0.99.

  3. (c)

    Linear: \({\alpha _k}=\max \left\{ {(\alpha _0- c{\alpha _0}){\frac{N - k}{N}} + c{\alpha _0},c{\alpha _0}} \right\} \), where c denotes by the attenuation factor, and we perform linear interpolation between the initial learning rate and target learning rate, then fix the learning rate.

  4. (d)

    Cosine: \({\alpha _k}=\alpha _{min} + \frac{1}{2}(\alpha _0-\alpha _{\min })(1 + cos{\frac{k}{K}}\pi )\), where \(\alpha _{min}\) is the minimum.

  5. (e)

    Harmonic: \({\alpha _k} = {\left\{ \begin{array}{ll} \alpha ,&{}{\text {if}}\ k<50 \\ {\frac{\alpha }{k-49},}&{}{\text {otherwise.}} \end{array}\right. }\)

For ease of distinction, we abbreviate these as FedISP (fixed) and FedISP (\(a\sim e\)).

Table 5 shows the impact of the different learning rate strategies adopted by the FedISP algorithm. In the non-IID case with different unbalanced levels, FedISP with a fixed learning rate no longer maintains excellent performance as in the IID case. Additionally, inferior model consensus occurs with an increasing degree of heterogeneity. To better adapt to different application scenarios, such as heterogeneous data, FedISP with a diminishing learning rate is a favorable choice.

From Fig. 7, we find that FedISP with different learning rate strategies produces characteristic accuracy curves, especially for FedISP (a), which leads to a seemingly unsmooth rise in accuracy. The reason can be explained as follows.

In the early stage of model training, a higher learning rate can ensure the speed of convergence, and the influence of the proximal term is very small. FedISP (fixed) soon reaches a bottleneck, and the early curves almost coincide with FedISP (\(a\sim e\)). With increasing communication rounds, FedISP (fixed) is prone to falling into a local optimum, while a dynamic learning rate contributes to breaking away from the local optimum and attaining higher accuracy.

To further investigate the reason for the inflection point in the accuracy curve under the variable learning rate strategies in Fig. 7, we use the FMNIST dataset (pathological non-IID, \(\xi =2\)) as a sample for analysis. The diagram in Fig. 8 illustrates the angular relationships among the proximal update \({z}_k^{(i)} - w_{k-1}^{(i)}\), local update \({w}_{k}^{(i)} - z_{k}^{(i)}\), and synthetic update directions \({w}_{k}^{(i)} - w_{k-1}^{(i)}\), where the model parameters are reshaped into a vector. For example, the angle between the proximal update and the local update can be computed by

$$\begin{aligned} {\theta _{\textrm{1}}}\mathrm{{ = }}\arccos \frac{{\left\langle {{z}_k^{(i)} - {w}_{k-1}^{(i)},{w}_{k}^{(i)} - {z}_k^{(i)}} \right\rangle }}{{\left\| {{z}_k^{(i)} - {w}_{k-1}^{(i)}} \right\| \left\| {{w}_{k}^{(i)} - {z}_k^{(i)}} \right\| }}. \end{aligned}$$
(15)
Fig. 8
figure 8

The angular relationships of the weight increments during the training stage of a certain client on FMNIST (pathological non-IID, \(\xi \)=2)

Fig. 9
figure 9

Average weight divergence of the corresponding parameters between every pair of clients (FedISP(d)), CNN on CIAFR-10

Table 6 The statistics of the average model output difference on the test samples [FedISP(d)]

The direction of the proximal update differs from that of the local update, as shown by an obtuse angle. This indicates that while the local update provides personalized knowledge, it simultaneously counteracts the effect of the proximal update. When the data distribution is IID, the local update direction is consistent with the global optimal. However, due to the significant deviation of the data distribution among clients, the updated direction of the local training stage will diverge from the optimal path. Thus, a decreasing learning rate is beneficial for further optimizing the model parameters, as it reduces the influence of local training (i.e., the stepsize of the local update). Additionally, the smaller the learning rate, the more constrained the learning between neighboring clients becomes.

In addition, from FedISP (a) of Fig. 7, the decrease in the learning rate (tending toward zero) in every step leads to an apparent increase in accuracy and brings it closer to the optimal value, which accords with the result of Theorem 1. In contrast to FedISP (a), the accuracy curve of FedISP (\(b\sim e\)) with a smoother decaying learning rate strategy behaves more smoothly, and continuous decay provides continuous momentum. In these five learning strategies, FedISP (e) has faster convergence at the cost of lower final accuracy, while FedISP (d) is able to achieve the best accuracy in most cases. In general, FedISP with a diminishing learning rate allows for more robust convergence than FedISP with a fixed learning rate in both IID and non-IID cases. This conclusion also shows the agreement between the theoretical and empirical results.

Weight divergence

In the non-IID case, the weight divergence is expected to increase compared to that in the IID case due to the inconsistency of data distribution across different clients. Identical to the calculation of weight divergence in the IID case, Fig. 9 shows the results of the average weight divergence of the CNN model on the CIFAR-10 dataset. Compared with the case of \(\beta =0.5\), the weight difference in the more unbalanced case \(\beta =0.1\) is larger, but both weight divergences are in an acceptable range. For example, the average model output difference across the clients in the case \(\beta =0.1\) is shown in Table 6. Except for individual samples, the model behavior among the clients is similar.

Fig. 10
figure 10

Top-1 accuracy vs. number of communication rounds on the four datasets, where different algorithms are performed (pathological non-IID, \(\xi \) = 2)

Different degrees of heterogeneity

Next, we compare different methods in the non-IID case. We study the effect of data heterogeneity by varying the concentration parameter \(\beta \) of the Dirichlet distribution and the class parameter \(\xi \) of pathological non-IID on four datasets. A smaller value of \(\beta \) or \(\xi \) indicates a higher degree of data heterogeneity across different clients. In Fig. 10, the convergence curves of different methods in the case of pathological non-IID (\(\xi =2\)) are presented. The complete results in different cases are presented in Table 7, and the analyses are shown below.

Table 7 The top-1 accuracy of FedISP (BEST) and the other baselines on the test set after 500 epochs of training with \(\beta \) from \(\{0.1, 0.5, 1\}\) and \(\xi \) from \(\{2, 4, 8\}\). For SOLO, Def-KT, and FedISP, we report the mean and standard deviation among all clients (\(m\pm \sigma \))
  1. (1)

    In the non-IID data settings, Solo shows much worse accuracy than the FL algorithms. It indicates the benefits of FL. However, with decreasing value of \(\beta \) (\(\beta =1, \beta =0.5, \beta =0.1\)) or \(\xi \) (\(\xi =8, \xi =4, \xi =2\)), the degree of quantity skew and label distribution skew becomes more unbalanced, resulting in a decline in the performance of FL, as reflected by a decrease in mean accuracy and an increase in the standard deviation of accuracy.

  2. (2)

    Since MNIST is relatively simple, the FL approaches can all obtain high accuracy. FedISP achieves the best accuracy in extreme non-IID cases (\(\beta =0.1\)). On the more challenging extension datasets EMNIST and FMNIST, the performance of FedAvg and FedProx degenerates considerably, while FedISP maintains a test accuracy close to that of the IID scenario even when the heterogeneity is serious.

  3. (3)

    In the case of the CNN used on CIFAR-10, FedProx does not show clear superiority through a similar proximal term. In contrast, the performance of FedISP degrades the least in these five approaches. In particular, FedISP outperforms centralized learning in the case of slight heterogeneity (\(\beta =1\)).

  4. (4)

    Compared with the current state-of-the-art method Def-KT based on mutual learning instead of model averaging, the global accuracy on the baseline oscillates more severely and may even fail to converge on very strongly non-IID data, while FedISP attains higher accuracy and more stable learning performance.

  5. (5)

    Overall, FedISP always achieves the best accuracy among the six unbalanced levels from the perspective of the mean and standard deviation of accuracy. The experiments demonstrate the effectiveness and robustness of FedISP. To the best of our knowledge, this is the first time that the DFL algorithm has exceeded the performance of the conventional centralized architecture under such a horizontal comparison.

Number of clients

Table 8 The top-1 accuracy with different numbers of clients on the CIFAR-10 and CIFAR-100 datasets (Dirichlet non-IID, \(\beta =0.5\))
Table 9 The top-1 accuracy with different participation rates per round on the CIFAR-10 and CIFAR-100 datasets when the total number of clients is 40 (Dirichlet non-IID, \(\beta =0.5\))

We research the impact of the number of clients on FedAvg and our method when all clients participate in the communication process on the CIFAR-10 and CIFAR-100 datasets. With the increase in the number of clients, the amount of local data decreases, so overfitting is more likely to occur in local training. From Table 8, we find that the performance of FedAvg degrades, while FedISP maintains well convergence performance when increasing the number of clients.

If part of the nodes meets the offline successor node during parameters sending, an extension work should be taken for FedISP to ensure the continuity of training. Two potential options are route discovery protocols and a distributed storage device. The former allows nodes to actively track the reachability of their neighbor and search for alternative nodes when their neighboring node becomes unavailable. The latter can uniformly maintain a record of connectivity of all nodes. This enables all participating nodes to monitor the online status of others.

To emulate the performance of FL when part of the nodes are offline, we investigate the scenario in which only a subset of clients of each round participates in training in the case of the same degree of heterogeneity (Dirichlet non-IID, \(\beta =0.5\)). Specifically, we simulate cases where the proportion of participating clients per round is set within \(\{0.1, 0.25, 0.5, 0.75\}\) when the total number of clients is 40. It indicates that the number of participating clients per round is 4, 10, 20, and 30, respectively. Table 9 lists the results after 500 epochs of federated training. As the participation rate decreases, FedAvg struggles to converge due to the reduction in the number of participating users per round and the bias in the data caused by data heterogeneity, while FedISP maintains its advantage in convergence performance. Additionally, the performance of each method improves when more clients participate in each round. From the results, FedISP can train stably in partially participating cases.

Discussion

For application scenarios with different degrees of data heterogeneity and different participation rates, FedISP achieved a significant advantage over other methods, particularly in extreme heterogeneity cases. In terms of privacy protection, all training occurs locally without a server, guaranteeing the confidentiality of user data. The results demonstrate the well-adapted nature of FedISP and its superiority in accuracy with the rest of the SOTA methods.

The limitations of FedISP are mainly in terms of communication effectiveness, including the communication failure problem and the restriction on the number of clients. We know that in a ring architecture, communication stability is crucial, and if some clients experience communication failures during the sending of parameters, this may lead to interruptions in the training process. Therefore, it is necessary to verify the effectiveness of FedISP in unreliable communication conditions, such as transmission errors or communication failures, and develop a response plan accordingly. Furthermore, considering the potential increase in the number of clients in the ring, the synchronous nature of FedISP may lead to a longer training time for each round. Our future work will focus on improving FedISP to enhance its robustness and scalability.

Conclusions and future work

To address the weight divergence and inferior communication efficiency of DFL, this paper presents the FedISP algorithm, which incorporates the advantages of incremental methods to implement DFL. Strong convergence guarantees are established for FedISP under the assumption of a convex objective, ensuring convergence within a certain error bound at a constant learning rate and exact convergence to an optimal solution at an appropriately diminishing learning rate. In contrast to FedISP with a fixed learning rate, the training accuracy and weight divergence are further improved through learning rate decay strategies. The experiments on the MNIST, CIFAR-10, FMNIST, and EMNIST datasets in both IID and non-IID settings indicate that the proposed framework FedISP possesses excellent application prospects to handle various cases in the real world. The results show that FedISP outperforms the baseline approaches, including FedAvg, FedProx, and Def-KT, in terms of efficiency and learning performance.

In the future, we plan to further study the sequence selection mechanism of FedISP with a fixed ring or randomized ring, as well as more efficient communication extensions for FedISP. For potential unreliable communication cases, we will explore communication methods such as route discovery protocols to ensure the stable operation of FedISP. Additionally, in the case of a large number of clients, a hierarchical approach can be considered by dividing the clients into subgroups or clusters to allow for more efficient and parallelized computations.