Semi-HFL: semi-supervised federated learning for heterogeneous devices

In the vanilla federated learning (FL) framework, the central server distributes a globally unified model to each client and uses labeled samples for training. However, in most cases, clients are equipped with different devices and are exposed to a variety of situations. There are great differences between clients in storage, computing, communication, and other resources, which makes unified deep models used in traditional FL cannot fit clients’ personalized resource conditions. Furthermore, a great deal of labeled data is needed in traditional FL, whereas data labeling requires a great investment of time and resources, which is hard to do for individual clients. As a result, clients only have a vast amount of unlabeled data, which goes against the federated learning needs. To address the aforementioned two issues, we propose Semi-HFL, a semi-supervised federated learning approach for heterogeneous devices, which divides a deep model into a series of small submodels by inserting early exit branches to meet the resource requirements of different devices. Furthermore, considering the availability of labeled data, Semi-HFL introduces semi-supervised techniques for training in the above heterogeneous learning process. Specifically, two training phases are included in the semi-supervised learning process, unsupervised learning on clients and supervised learning on the server, which makes full use of clients’ unlabeled data. Through image classification, text classification, next-word prediction, and multi-task FL experiments based on five kinds of datasets, it is verified that compared with the traditional homogeneous learning method, Semi-HFL not only achieves higher accuracies but also significantly reduces the global resource overhead.


Introduction
With the development of technology, the computing capabilities of end devices such as mobile phones have been greatly improved, and an increasing number of end devices are now capable of doing complicated computing tasks. In a self-driving scenario, for example, the vehicle needs to see, distinguish and plan the path of the surrounding environment in real time. A distributed computing framework is formed when massive end devices engage in the calculation. For model training, traditional distributed computing solutions require users to upload their data directly to the server. However, this process consumes a significant amount of communication resources. More critically, the majority of these data contain personal information about users, posing a serious threat to their privacy. Federated learning (FL) has become a popular distributed training framework in recent years, because it allows the model to be trained locally and effectively eliminates data leaking. Under the FL framework, a central server aggregates multiple clients' model parameters, and finally obtains a globally unified large model, realizing cross-device and cross-region collaborative training under the premise of fully protecting users' privacy.
In fact, under the distributed computing framework, the composition of end devices is extremely complex, not only in terms of quantity but also in terms of variety, such as mobile phones, smart wearable devices, cameras, and so on. Because different clients are exposed to different situations and tasks, their computation, communication, and storage capabilities vary greatly. Even if the same type of task is performed, there is strong heterogeneity in environments and other factors, which is called system heterogeneity [1]. As a result, in the federated learning process, the upper limit of model complexity at which each client can participate in learning is not the same. To maximize the usage of clients' data for training, model complexity must be adapted to different clients [2,3]. However, in a traditional federated learning framework, a global unified model is distributed to each client, and then, models with the same structures are aggregated without taking into account the problem of system heterogeneity, resulting some stragglers' data features failing to be integrated into the global model. In addition, traditional federated learning also necessitates labeled data on the client. Unfortunately, in most cases, building a sufficiently large labeled database is extremely challenging. Data labeling not only takes a long time but also requires a large number of specialists to do low-skilled, high-repetitive work, resulting in a waste of human resources. In real-world scenarios, there is a significant amount of unlabeled data on clients. To lessen the demand for client-side labeled data, it is critical to deconstruct the premise of supervised learning in the federated learning framework by integrating semi-supervised learning approaches to FL.
Currently, as far as we know, studies about heterogeneous federated learning are mostly based on fully labeled data on clients [2,4]. There is still a wide gap between heterogeneous federated learning and semi-supervised learning. To solve the above problems simultaneously, we propose a semi-supervised federated learning method for heterogeneous devices (Semi-HFL), which allows heterogeneous clients to learn together with only a limited amount of labeled data on the server. First, for clients with different resources, based on the idea of multi-branch fast inference [5], a model can be divided into several small models that can independently complete training and inference tasks by inserting an early exit branch in the middle of the model (shown in Fig. 1), forming a series of submodels of various complexities that satisfy various client resource requirements. In the process of federated learning, models of suitable size are distributed to matching clients. After multiple iterations of local updates, they are sent to the server for aggregation; thus, a global model with multiple branches is formed. In the inference stage, the global model can achieve fast inference through middle branches. Meanwhile, to make full use of the massive unlabeled data from clients, we use a tiny amount of labeled data on the server to pretrain the multi-branch model. After all submodels are distributed to the clients, pseudo-labels are generated for local training; thus, semi-supervised learning is realized. In this paper, we will verify the effectiveness of Semi-HFL in different data distributions. To reduce the impact of data skew on the accuracy of models, we introduce a regularization term [6] in the loss function to balance the parameters of local and global models.
The main contributions are as follows: • A novel heterogeneous federated learning method called Semi-HFL is proposed, which introduces a multi-branch model to solve the system heterogeneity problem in FL. There are two main innovations. First, a splitting method of multi-branching models is designed to split the global model into submodels of different complexities. Second, we give a novel aggregation method for aggregating the split submodels into a global one in the aggregation step of FL. • The semi-supervised learning method under the above FL framework is innovated. For heterogeneous federated settings, a "multi-teacher to multi-student" semi-supervised learning mode is formed using a modest quantity of labeled data on the server. It breaks the limitation that the traditional semi-supervised FL method applies only to the single-exit model. • The convergence of Semi-HFL is analyzed, and its feasibility is verified through image and text classification experiments on different data distributions. The effectiveness of Semi-HFL is proved theoretically and practically.
In the following content, related work is introduced in Sect. "Related works". Section. "The proposed method: Semi-HFL" and Sect. "Algorithm" illustrate Semi-HFL from mathematical analysis and algorithm perspective, respectively. The convergence analysis will be shown in Sect. "Convergence analysis" and experiments are conducted in Sect. "Experimental verification". Finally, Sect. "Conclusions" concludes the paper.

Heterogeneous federated learning
Federated learning can connect a great number of clients to realize collaborative training, which has been applied to many fields like neural architecture search (NAS) [7], industrial cyber-physical system [8], recommender system [9], etc.  1 The process of splitting a fast inference model However, due to the different environments and equipments each client faces, there is significant heterogeneity among clients. This heterogeneity can be roughly divided into data heterogeneity caused by unbalanced data distribution, system heterogeneity caused by different client resource status, and model heterogeneity caused by various tasks [1]. Aiming at data heterogeneity, Wang et al. [10] proposed a monitoring scheme, which can infer the composition of training data, and designed a new loss function called Ratio Loss to reduce the impact of imbalance. Based on the homomorphic encryption technique, [11] chose to improve data heterogeneity through user selection. Besides, some researchers [12] optimized models by learning the global feature representation shared between non-IID data. As for system heterogeneity, [2] constructed a series of models of different complexities to adapt to various devices by reducing the width of the hidden layer. [4] proposed a federated learning protocol that can manage clients based on their resource conditions. In addition, some techniques like asynchronous federated learning [13,14] were studied. When the clients face different application scenarios or perform different tasks, it will also bring about the problem of model heterogeneity. Hence, [15] put forward Moreau Envelopes to perform personalized federated learning by introducing regularization loss function, which was helpful to separate personalized model optimization from the global model. Totally speaking, personalized federated learning methods for model heterogeneity can be divided into the following categories: Adding User Context [16], Mixture of Global and Local Models [17], Multi-task Learning [18], Meta-Learning [19], Knowledge Distillation [20], Base+Personalization Layers [21], Transfer Learning [22], and so on.
We mainly focus on the system heterogeneity problem under FL. The method proposed in this paper will transform a large model into small models of different complexities to meet the requirements of different clients by inserting branches in the middle layer.

Fast inference
The multi-branch fast inference model has received great attention from a large number of researchers and has been applied to a variety of task scenarios, such as image classification [23][24][25], text ranking [26,27], text classification [28], machine translation [29], key point detection [30], etc. In the multi-branch model, three key issues are mainly involved [31]: the design of the model, the training process, and the inference process. First of all, when it comes to designing a model, the number and positions of branches are two key factors. For example, some researchers chose to insert branches after a specific intermediate layer [5,32], and some preferred to insert after each layer [33], but this will bring additional overhead [31]. Second, the training of multi-branch models can be roughly divided into trunk and branch collaborative training [5,32] and separate training. The latter is usually more scalable. From the perspective of the training methods, knowledge distillation is a popular branch training method [28,34,35]. Under this method, the middle branches of the model are regarded as students, and the subsequent branch or the last exit is the teacher. Finally, in the inference process, an important issue is how to set the criteria for samples to exit the model. Currently, there are two ways, one is to preset the rules artificially [33,36,37], such as the loss threshold [5], and the other one is the learnable rules [38][39][40].
In this paper, we will use the multi-branch models of lenet and resnet to perform experiments on the basis of the work of Teerapittayanon et al. [5]. During the training phase, we will conduct collaborative training on the trunk and branches. In the inference process, the number of samples withdrawn from each branch will be determined by the ratio of the branch accuracy in the training process to all branches.

Semi-supervised learning
A good deep learning model usually necessitates massive labeled data for training. In practice, however, labeling data takes a lot of time and effort, whereas unlabeled data are cheap and easy to get, leading to the development of semisupervised learning [41]. Since the advent of semi-supervised learning in the 1970s [42][43][44], it has received widespread attention [45][46][47] because of its great advantage of leveraging unlabeled data. Currently, semi-supervised learning methods are mainly divided into generative models [48,49], consistency regularization [50,51], graph-based methods [52,53], pseudo-labeling methods [54,55], and hybrid methods [56,57]. Particularly, the pseudo-labeling method is a very common method in semi-supervised learning, which is also the main focus of this paper. This method uses unlabeled data with high confidence as the training data to train the model. It can be combined with knowledge distillation [58], data augmentation [59], and other techniques to achieve extremely competitive results.
Although semi-supervised learning has been a research hotspot in past decades [41], as far as we know, a few researchers currently study how to implement semisupervised learning under the framework of federated learning. Therefore, we will study the method of realizing semi-supervised learning under federated learning.

The proposed method: Semi-HFL
The main innovation of the method proposed in this paper is the heterogeneous federated learning framework and the semi-supervised learning method for it. Next, we will introduce Semi-HFL around these two aspects.

Heterogeneous federated learning
Under the framework of federated learning, the computing, communication, and storage resources of clients are different. For example, as a common smart device, a mobile phone usually has a running memory of 4-8G and storage space of 64-256G, while a portable computer can reach 64G and 2TB, respectively. Different resource conditions lead to different computing, communication, and storage capabilities. Therefore, how to maximize the use of various client resources, avoid clients being abandoned because of failing to meet the model training requirements, and improve resource utilization efficiency has become an urgent problem in heterogeneous federation learning.
Considering that the increase of the model depth will bring more computing, communication, and storage resource overhead, while a shallow model consumes less resources, we insert some exit branches in the neural network model to convert a single-exit global deep model to a multi-exit model. Based on each exit branch, we split the global deep model into several shallow models that can independently complete training and inference tasks to adapt to the resource requirements of different clients (as shown in Fig. 2). We assume that the number of layers of each branch is small enough that the computing resource overhead at each branch is less than that of further calculation in the deep network.

The training process
Taking the classification task as an example, for the training process of the above-mentioned multi-branch model, we suppose that the global model ω(t) has n branches, corresponding to n submodels. And t denotes the tth round of FL. The ith branch together with all the backbone networks before the branch constitutes the submodel ω i (t), thus forming a sequence of submodels ω 1 (t) , ω 2 (t) , . . . , ω n (t) . In this paper, we assume that the complexity of ω i (t) is greater than that of ω (i−1) (t). In addition, we use θ i (t) to represent the backbone parameters of the ith submodel ω i (t), and λ i (t) to represent the parameters of its branch. For the entire network backbone, there are n branches, splitting the model backbone into n parts. θ i(k) (t) is used to represent the kth part of the above splitted backbone in the ith model, which corresponds to the backbone between the kth branch and the k − 1th branch in the global model (as shown in Fig. 2). Noting that in θ i(k) (t), the formula k <= i always holds. For example, θ i(2) (t) is a part of ω i (t), and it corresponds to the backbone network between the first and second branches of ω(t). All clients of the same submodel constitute a client cluster. If the total number of clients is L, the number of clients under each client cluster is denoted by l 1 , l 2 , . . . , l n respectively, where L = l 1 + l 2 + · · · + l n . Then, the local training process of federated learning can be expressed as whereF i j (ω i j (t)) is variation of the loss function and ω i j (t) is the jth client under the ith client cluster. Considering the data distribution between clients is not always independent and identical, i.e., data heterogeneity, to enable the locally trained model to not only integrate the features of local data but also avoid overfitting problems, referring to the work of Li et al. [6], we introduce a regularization term in the loss where F i j (ω i j (t)) is the local loss function. The local optimization task can be expressed as Hence, the global optimization goal is as follows: In our work, consistent with Li et al. [6], we set the value of μ-0.3. After performing a predefined number of iterations, each client will upload the model to the server for aggregation. The aggregation process can be divided into two stages: homogeneous aggregation and heterogeneous aggregation. Homogeneous aggregation is the aggregation of models under the same client clusters. The specific process of homogeneous aggregation is as follows: where D i j is the data size of the jth client under the ith cluster and D i is the data size of all clients under the ith cluster. Based on the result of homogeneous aggregation, heterogeneous aggregation aggregates the models between different client clusters. The model architectures of different clusters are different; meanwhile, the complexities are different, too. It can be divided into two parts: backbone aggregation and branch aggregation. Branch aggregation can be expressed as Backbone aggregation is as follows: where θ * (i) (t + 1) is the backbone network parameters between the ith branch and the i − 1th branch in the global model. Therefore, entire backbone network parameters are the union of all θ * (i) (t + 1) Hence, global model parameters are the union of backbone parameters and all the branch parameters The inference process In the inference stage of the multi-branch model, we believe that the greater the test accuracy of each branch, the more likely it is to get correct inference results; thus, the number of exit samples (samples which are credible enough to early exit networks through branches) should be larger. Therefore, we adopt the proportional exit method to define the number of exit samples at each branch. That is, after training on the server, each branch's accuracy in the aggregated large model is calculated. The ratio of each branch's accuracy to the sum of all branches' is the sample exit ratio where p i is the proportion of exit samples at branch i, and acc i is the test accuracy of the submodel formed by branch i. We refer to the work of Teerapittayanon et al. [5], defining an entropy function as the standard to determine whether the sample exits the branch. When the entropy value is smaller, the calculated result is considered to be credible and more likely to exit early, vice versa. When the exit ratio of each branch is determined, all samples are sorted according to the entropy value, and only the sample with a sufficiently small entropy value can exit early through the branch. The mathematical formula of the entropy is as follows: where C is the number of types.

Semi-supervised learning
In real scenarios, due to factors like labor costs, there is massive unlabeled data on the clients, while a limited amount of labeled data on the server. Hence, we further extended the above-mentioned heterogeneous learning method to semisupervised scenarios. Based on the basic trick of knowledge distillation method, we design a multi-teacher to multistudent semi-supervised training method for the heterogeneous federated learning framework. Specifically, the whole semi-supervised FL contains four steps: supervised learning on the server, unsupervised learning on clients, federated aggregation, and model fine-tuning.

Supervised learning on the server
In the supervised learning stage, the server pretrains the global model with the labeled data D L and obtains test accuracies of all branch submodels. After pretraining, the model is decomposed as a series of submodels Teachers = ω 1 , ω 2 , . . . , ω n that are distributed to the adapted client as teacher models.

Unsupervised learning on clients
The model obtained by the jth client under the ith cluster is ω i j , and its test accuracy is acc i j . We regard ω i j as the teacher used to predict the label of the client's local unlabeled data, i.e., pseudo-labeled data. We assume that the higher the initial accuracy of the model, the more reliable the predicted labels are, and more pseudo-label data should participate in the training of the local model. Therefore, we take acc i j as the proportion to select local training data from the pseudolabeled data. That is, if the amount of unlabeled data size is D i j on the client side, D i j * acc i j number of data should be selected from the pseudo-labeled data for training. In all predicted labels, we use the entropy function defined in Eq. (11) as the basis for selecting samples. If the entropy value of the pseudo-label is small enough, it will be taken as training data for local learning. Then, we can get a series of student models of a specific submodel trained by different clients, which can be expressed as Students i = ω i 1 , ω i 2 , . . . , ω i l i . All client student models can be expressed as Students = Students 1 , Students 2 , . . . , Students n =

Federated aggregation
In the aggregation stage, similar to the method in Sect. 3.1, the corresponding parts in all student models are aggregated, respectively. Specifically, it can be expressed as the internally weighted aggregation of each student cluster, i.e., homogeneous aggregation, the result of which is {Students 1 , Students 2 , . . . , Students n }. Then, the heterogeneous aggregation between student clusters is performed. Finally, we can get the global student model Students.

Model fine-tuning
After the global student model is obtained by federated aggregation, to prevent the pseudo-label data from causing the model to shift, we imitate the teacher's guidance and correction behavior during the growth of the students, using labeled data stored on the server to finetune the model. After that, a new global model which is the teacher model in the next iteration is acquired. The above process is repeated iteratively; thus, the student model grows step by step, performing better and better. Finally, the global model integrates all the features of labeled and unlabeled data.

Algorithm
To explain the methods mentioned in Sect. "The proposed method: Semi-HFL" more clearly, we furtherly provide their algorithms Heterogeneous FL and Semi-HFL in this section.

Hetergeneous FL
Algorithm 1 illustrates the whole process of federated training of the multi-branch model. The inputs include the number of federation round T , the number of local updates E, the number of heterogeneous client clusters n, the number of clients under each cluster, the learning rate η, and the data size of each client. The output is the global model. Before the training starts, the server divides the global multi-branch model into a series of single-exit submodels. Each branch corresponds to a single-exit submodel. Since branches are inserted at different points, the corresponding submodels have different complexities. Hence, the requirements for storage, computing and communication resources of submodels are different. We assume that the submodel series {ω 1 (t) , ω 2 (t) , . . . , ω n (t)} are sorted from shallow to deep by depth. The fifth line in algorithm 1 distributes submodels of different complexities to selected client clusters during one iteration. The model architecture of all clients under one cluster is the same. Each client uses local data to train the received submodel E times (line [14][15][16][17][18][19][20]. After all selected clients have completed local training, the model will be aggregated (line 11). Specifically, the aggregation process is divided into homogeneous aggregation within clusters (line [22][23][24][25] and heterogeneous aggregation between clusters (line 26-37). The client first aggregates with those who have the same model as its own in the cluster. This aggregation process uses the FedAvg method (line 24) proposed by McMahan et al. [60]. That is, the model parameters are aggregated according to the amount of client data. After each resource, heterogeneous cluster gets an average submodel through the above process, and the branch and backbone networks will be aggre-gated between clusters. For the aggregation of the backbone network, we regard the backbone network as the connection of each part between adjacent branches. Therefore, the backbone network is composed of n parts, which can be expressed as {θ * (1) (t) , θ * (2) (t) , . . . , θ * (n) (t)}. In particular, the first part of the backbone network in submodel 1 is θ 1(1) (t), and the first part of the backbone network in submodel 2 can be expressed as θ 2(1) (t). Since the depth of each model is different, not every part of the backbone network is included in every submodel. Therefore, we will aggregate each part of the backbone network separately (line 30). For the aggregation of branches, since the submodel corresponding to each branch is only distributed to one type of client cluster during model distribution and each type of branch only appears in one cluster, the aggregation method of the branches is the same as the homogeneous aggregation. Finally, the union of each part of the backbone network and branches constitutes a global multi-branch network (line 32 and line 37).

Semi-HFL
For more general scenarios with limited labeled data, Algorithm 2 further extends Algorithm 1 to semi-supervised scenarios, designing a semi-supervised learning method under the framework of heterogeneous federated learning. Assuming that there is a small amount of labeled data on the server and a large amount of unlabeled data on the clients. Compared with Algorithm 1, the output of Algorithm 2 remains unchanged, and the input adds server-side labeled data. Before distributing the heterogeneous model to clients, the server uses the labeled data for pretraining and obtains the accuracy of each branch {acc 1 , acc 2 , . . . , acc n } (line 5). After the model is distributed to clients, the entropy values of all local samples are calculated (line 10), and the number of local training samples is obtained (line 11). Only samples with sufficiently small entropy values can be used as training samples to train the local model. The pseudolabel of training samples is the predicted values of the model downloaded from the server (line 12). After the local training is completed, similar to Algorithm 1, homogeneous aggregation and heterogeneous aggregation will be performed. However, different from Algorithm 1, to reduce the unreliability of local pseudo-labels, it is also necessary to use server-side labeled data to finetune the aggregated model. The fine-tuning method is the same as the pretraining method (lines 20-26).
Assumption 2 During a federated learning process, all clients will participate in the training.

Assumption 3
In the multi-branch model, since the number of branch layers is very small compared to the backbone, we assume that the decomposed parts of the backbone network can independently complete training and inference tasks. The relationship between them is the sequential input and output relationship. For example, for two adjacent parts, the output of the previous model is the input of the next model. During the federated learning process, they are distributed to each client for training.

Assumption 4
The local loss function F j is convex.

Assumption 6
The expectation and variance of the stochastic gradient of each client meet the following conditions:

Assumption 7
The L2-norm of the difference between local and global gradients has an upper bound For the analysis of the convergence of backbone network, we use mathematical induction method to analyze each part of the backbone network from front to back. The entire decomposed backbone network can be expressed as θ * (1) (t) , θ * (2) (t) , . . . , θ * (n) (t) , and θ * (k) (t) is the kth part of the decomposed backbone. According to the induction method, we only need to analyze the following two points to prove the convergence of the global model: Proof of Theorem 1 According Assumption 3, we believe that θ * (1) (t) is a submodel which can complete training and inference tasks independently. Because it is located at the beginning of the backbone model, during the training process, its input is client data and is not affected by the subsequent submodels. The proof of convergence is similar to traditional federated learning, taking FedAvg as an example, when all clients have the same amount of data Therefore, the key to proving the convergence lies in proving the gap between the average loss of local iteration and the global minimum loss (Eq. 14) decreases with the increase of iterations. That is, to prove the upper bound B k in Eq. 14 decreases with T increasing where F θ * (1) (t, e) is the loss function value of the average model obtained by the eth local training in the tth federated learning, and θ * (1) best is the global optimal parameters that minimize the loss function. To prove the above formula, the following two conditions need to be met: Lemma 1 (Central learning) In the process of optimization, the model parameters should be continuously optimized, i.e., has an upper bound.

| F (t,0) in local learning process has an upper bound. F (t,0) represents all the historical information before the start of the tth federated learning.
Lemma 1 is the condition of central learning, which indicates that each iteration process will make the model parameters closer to the optimal model parameters. Based on Lemma 1, Lemma 2 focuses on the characteristics of distributed learning, limiting the variation range of each client model parameter. Through analysis, we find that when the learning rate η ≤ 1 The specific proofs of the above two formulas are shown in the Appendix. Combining Eqs. 15 and 16, we can get the following conclusion: where Q = θ * (1) (0, 0) − θ * (1) best . We can get the conclusion from Eq. 17 that as the number of iterations T increases, the upper bound of the gap between average local loss and the minimum loss continues to narrow, indicating that the model converges under the framework of federated learning.
In addition to the backbone network, since a single-type branch network only exists under one client cluster, and it only participates in the homogeneous aggregation during the federated learning process, the convergence analysis is consistent with the traditional federated learning process, and please refer to the analysis of θ * (1) (t).

Experimental verification
In this paper, we will use the MNIST, Cifar10, MR [61], and Shakespeare [62] datasets to verify the effectiveness of Semi-HFL for image classification, text classification, and next-word prediction tasks, respectively. Specifically, this section is mainly divided into four parts: Semi-HFL feasibility verification, resource overhead study, ablation experiment, and extended experiment corresponding to Sect. "Semi-HFL feasibility verification", "Two-level heterogeneity", "Multi-level heterogeneity", and "Resource overhead study", respectively. In the Semi-HFL feasibility verification part, we will conduct separate experiments on the two-level and multi-level heterogeneity cases under the independent and identical distribution (IID) and non-independent and identical distribution (non-IID). Through Sect. "Semi-HFL feasibility verification", we hope to find whether Semi-HFL can ensure the accuracy is not compromised compared to other method, and the impact of different heterogeneous cases on the final performance of the model. In Sect. "Twolevel heterogeneity", we only consider the IID distribution, further measuring the overhead of Semi-HFL in terms of storage, computing and communication resources. After that, in ablation experiments shown in Sect. "Multi-level heterogeneity", we will explore the necessity of adding a regularization term to the client loss function. Finally, to test the generalization capability of Semi-HFL, we do extra multi-task experiments to verify it. The main comparison methods include FedAvg [60], FedProx [6], and FedProto [63], where FedAvg is homogeneous method, FedProx and FedProto are heterogeneous methods.
Regarding the processing method of non-IID distribution, we first sort all samples according to their labels, and then equally divide them into a predefined number of packages in order. Finally, each client selects the same number of packages as local samples. Specifically, in MNIST, in addition to 6000 labeled samples on the server, we divide the remaining 54,000 unlabeled samples into 250 packages in the order of labels. And each client randomly picks 5 of them as local data for training. Similarly, we divide the client training data of Cifar10 into 500 packages, MR into 100, and each client picks 10 and 2 packages, respectively, forming non-IID distributions. When it comes to the non-IID setting of Shakesprare dataset, borrowing the setting of [62], each device is equipped with the text data of only one role in the works of William Shakespeare. The models in this paper not only involve CNN models, and RNN models are also included. In the image classification task, the MNIST dataset adopts the Lenet model, and the Cifar10 dataset adopts the Resnet-18 model. In the text classification task, the model of MR is slightly deformed based on the model in [61]. And Shakespeare dataset uses a long short-term memory network (LSTM), a variant of LSTM in [62], for the next-word prediction. All the model structures are shown in Fig. 3. The optimizer used in the experiment is SGD, the learning rate of Cifar10 and MR is 0.1, MNIST's is 0.05, and Shakespeare's is 0.5. And totally, 50 clients are involved in the FL framework. In each round, 20% of them will be selected by the server to train collaboratively. Table 1 has listed the main experimental results of this paper.

Semi-HFL feasibility verification
In this section, we will try to insert branches in different positions of the model to form a multi-level heterogeneity, exploring the effectiveness of Semi-HFL in different heterogeneous cases. The specific evaluation indicator is model accuracy. In the experiments, we randomly select 10% out of the training data as the labeled data on the server and distribute the remaining 90% as unlabeled data to each client. Considering that in real distributed scenarios, most datasets may show two distributions between clients, i.e., IID and non-IID, we involve both distributions in MNIST, Cifar10, and MR datasets' exploration. Noting that since Shakespeare is usually believed to be non-independently and identically distributed, it is only tested in the non-IID case. To ensure the fairness of the comparative experiments, we adopt the proposed semi-supervised learning method in the benchmarks.

Two-level heterogeneity
First, we explore the effectiveness of two-level federation heterogeneity, which means inserting one branch in the middle of the model. After the model is split according to the branch, two models with different computational complexities are formed. Both two models can independently complete the training and inference tasks. In the selection of the insert position of the early exit branch, we try to insert branches at two different positions to form two kinds of two-level heterogeneity. Therefore, we can test the effectiveness of Semi-HFL for the heterogeneous model formed by different insertion positions. The models corresponding to MNIST, Cifar10, and Shakespeare are shown in Fig. 3. The different positions of branches correspond to branch 1 and branch 2. And the twolevel heterogeneous model has two cases: one is composed of branch 1 and branch 3 (represented by "1+3"); the other is composed of branch 2 and branch 3 (represented by "2+3"). The MR model is shown in Fig. 3, too. Since the model of MR is small, we will insert only one branch in the model, forming only one kind of two-level heterogeneous model. The experimental results of image and text datasets are shown in Figs. 4 and 5, respectively.
In the figures, "1 + 3" and "2 + 3" represent two kinds of two-level heterogeneous models, which are composed of branch 1 and branch 3, branch 2 and branch 3, respectively. "Avg1", "Avg2", and "Avg3" represent homogeneous learning frameworks, the models of which are only composed of branch 1, branch 2, and branch 3, respectively. And the aggregation method FedAvg is adopted in "Avg1", "Avg2", and "Avg3". Similarly, the FedProx is adopted in "Prox1", "Prox2", and "Prox3", and the models of them are similar to "Avg1", "Avg2", and "Avg3". "Proto-2" denotes two-level heterogeneous Fedproto, i.e., the client has a choice of two kinds of models with different sizes. The results in Figs. 4 and 5 show that no matter which dataset or distribution is, in the case of using the same semi-supervised learning method, the model test accuracy obtained by Semi-HFL is not only not lower than that of others', but also 1 percentage point higher on average in the MNIST dataset, 10 percentage points higher in the Cifar10 dataset, 1-5 percentage points higher in MR dataset, and up to 5 percentage points higher in Shakespeare dataset. This is because the heterogeneous federated learning method proposed in this paper divides the global model  Two-level heterogeneity performance comparisons of image datasets under IID and non-IID distributions. "1+3" represents two-level heterogeneous FL consisting of the first branch and the third branch, while "2 + 3" is composed of the second branch and the third branch. "Avg1", "Avg2", and "Avg3" are all homogeneous FL. The server under "Avg1", "Avg2", and "Avg3" only includes single-exit models corre-sponding to branch 1, branch 2, and branch 3, respectively, and the training method used by them is FedAvg. The model settings of "Prox1", "Prox2", and "Prox3" are similar to "Avg1", "Avg2", and "Avg3", but the training method is FedProx. There are two kinds of models among clients in "Proto-2", and its training method is FedProto  into submodels of different depths. In the local learning process, each submodel searches its optimal parameters without considering other parts of the network, thus reducing the coupling between various parts of the model and the constraints between parameters in the update process, and allowing the optimization of model parameters to be carried out in a larger search space. However, in other methods like FedAvg, Fed-Prox, and FedProto, since the vertical structure of all models on the clients is the same, all parameters are updated in the direction that maximizes the accuracy of the last exit, which is a collaborative optimization process. Greater constraint and smaller search space result in the final model performance being inferior to Semi-HFL. Therefore, Figs. 4 and 5 show the feasibility and generalization of Semi-HFL to perform various tasks on image and text datasets to some degree. Additionally, we can find that the final convergence value of "1+3" is significantly better than that of "2 + 3" in Figs. 4c, d. This is because, under Cifar10, the model corresponding to branch 1 performs better than branch 2, which can be found in the performance comparison between "Avg1" and "Avg2", or between "Prox1" and "Prox2" in the figures. Hence, if the insert locations of branches are different, the training results will be different, too. In general, if the branch model performs better under the homogeneous training method, the heterogeneous model composed of this branch is correspondingly better.

Multi-level heterogeneity
To further verify the feasibility of Semi-HFL in more complex heterogeneous situations, we increase the number of branches by inserting branches at branch 1 and branch 2 at the same time to form a multi-level heterogeneous model (represented by "1 + 2 + 3"). Since the MR model adopted in this paper is small, we only explore the multi-level heterogeneous situation for the MNIST, Cifar10, and Shakespeare datasets in Figs. 6, 7 and 8. Figures 6a, b, 7a, b, and 8a show the results of MNIST and Cifar10 in IID and non-IID distributions, and Shakespeare's in non-IID distribution. When the heterogeneous situation is more complex and there are more types of submodels, the models trained by Semi-HFL still have obvious advantages, which are about 1 percentage point higher on average in MNIST, and 10 percentage points higher in Cifar10. When it comes to Shakespeare, Semi-HFL is still 5 percentage points higher on average than FedAvg and FedProx. In addition, we also show the local accuracy trend of all clients before federated aggregation under different heterogeneous cases in Figs. 6c, d, 7c,d, and 8b. It can be found that within experimental ranges, the increase of the heterogeneity degree does not significantly affect the performances of models, that is, the final convergence value does not change significantly. However, the higher the degree of heterogeneity, the slower the convergence speed, and the greater the difference between local accuracies.

Resource overhead study
The initial motivation for introducing heterogeneous federated learning is to meet the heterogeneous needs of client storage, computing, communication, and other resources. Therefore, we will verify whether Semi-HFL consumes fewer resources than others. Since the data distributions do not directly affect the resource overhead, this section assumes that the data distribution is IID (Shakespeare is non-IID), and calculates the storage, computing, and communication resources consumed by four kinds of datasets under different computing methods. The storage, computing, and communication resources are measured by model size, FLOPS, and parameters, respectively. Besides, the resource overhead of FedAvg is close to FedProx, and we only compare Semi-HFL with FedAvg and FedProto. Finally, the test accuracy vs. resource overhead scatter plot ( Fig. 9) is obtained. The dots in the figure represent the resource overhead of all clients participating in a certain round of training under Semi-HFL, yellow stars are the average resource cost and accuracy of FedProto, and red stars are FedAvg's. It is worth noting that in the scatter plot of computing resource overhead, since each client was assigned the same number of samples in the experiment, we only calculate the computing resource cost of a single sample for each participating client.
It can be clearly seen from the figure that whether it is MNIST, Cifar10, MR, or Shakespeare, the accuracies of the models obtained by the heterogeneous training method are not only higher than those of the other training method but also significantly reduce the storage, computing, and communication resource overhead. This is because, in Semi-HFL, smaller submodels are trained and transmitted, thereby reducing the overall resource overhead. Meanwhile, it is worth noting the communication resource overhead of Fed-Proto is smaller than others' in most cases, it is because the model parameters are replaced by protos to be transmitted between clients and the server. In addition, in the MNIST, Cifar10, and Shakespeare datasets, it can also be found that when the heterogeneity level is the same, the larger the submodels are, the more resources are consumed overall. For example, the overall consumption of "2+3" is higher than that of "1 + 3". Under different heterogeneous levels, the larger the proportions of shallow models are, the less the resource overhead is. For example, "1 + 2 + 3" consumes fewer resources than "2 + 3", more resources than "1 + 2".

Ablation experiment
In the Semi-HFL method proposed in this paper, since the locally trained submodel is a part of the global model, to   a and b) and performance comparisons of different heterogeneous cases (shown in c and d) when the dataset is MNIST. "1 + 2 + 3" represents multi-level heterogeneous FL consisting of branch 1, branch 2, and branch 3. There are three kinds of models with different complexities among clients in "Proto-3", and its training method is FedProto prevent the gap between the submodel and the global model from being too large, we try to add a regularization term to local loss functions to achieve a balance between local models and the global model. To demonstrate that this approach is effective, we will conduct ablation experiments to validate it in this section. For brevity, we will pick one dataset from image and text datasets to validate, respectively.
We take the "1 + 3" heterogeneous situation as an example and obtain the experimental results shown in Fig. 10. Each subfigure corresponds to the local training results of the client with and without the regularization term under different distributions. The green line indicates the addition of the regularization term, while the blue one is not. It can be seen that in the Cifar10 (Fig. 10a), the clients with the regularization term are significantly better than without no matter in IID or non-IID distributions. Especially when the client data show a non-IID distribution, the gap between the two is more obvious. This is because the local data skew (i.e., non-IID) can easily lead to overfitting of locally trained models, result-ing in unsatisfactory performance on the test dataset. When it come to text dataset MR (Fig. 10b), the distance of green line and blue line is small, indicating that adding the regularization term will not deteriorate the model performance. Therefore, based on the above test results, we believe that it is necessary to add a regularization term to the local loss function.

Extended experiment
To explore the generalization capability of Semi-HFL, we extend the above single-task verification to multi-task experiments under heterogeneous FL orchestration. In detail, there are two kinds of tasks in the FL framework, one is recognizing handwritten digits based on the MNIST dataset, and the other is recognizing costumes based on the FashionMNIST dataset. In the experimental settings, the number of clients is 100, 50 of which are requested to perform handwritten digit recognition tasks, and the remaining 50 perform costume        Each client has only one kind of dataset which is related to its task. Two kinds of recognition tasks can be calculated by a global aggregated model which is a variant of Lenet and is shown on the left of Fig. 6. Since there are 60,000 images for training and 10,000 images for testing both in MNIST and FashionMNIST datasets, we randomly choose 6,000 images from two datasets, respectively, as labeled data on the server, the rest 54,000 MNIST training images are averagely distributed to individuals that perform handwritten digit recognition tasks, and the remaining 54,000 Fashion-MNIST training images are distributed in the same way. The test dataset is composed of 20,000 images from MNIST and FashionMNIST. At the same time, we also consider IID and non-IID two distributions in multi-task federated learning. When the data distribution is non-IID, the data partition method is similar to MNIST in the above single-task experiments. The learning rate, optimizer, and other settings are similar to single-task experiments, too. Figure 11 compares the performance of global models trained by Semi-HFL and other methods. It proves that Semi-HFL has great advantages over other methods no matter what the distributions are. Besides, we can also find that the overall performances of FedProx and FedProto are better than FedAvg. This is easy to understand; multi-task FL is essentially a kind of heterogeneous FL based on tasks. And FedProx was designed to solve the heterogeneity problem by adding a regularization term to the local loss function in FedAvg, FedProto tries to meet heterogeneous requirements by changing the structures of specific model layers. Therefore, the declining trend shown by the FedAvg curve is caused by the overfitting problem, which is also essentially a heterogeneous problem. That is exactly why we choose to add the regularization term in Semi-HFL. So far, we can believe that the Semi-HFL proposed in this paper is effective at least for single-task FL and multi-task FL.

Conclusions
This paper proposes Semi-HFL, a new heterogeneous federated learning framework based on semi-supervised learning for resource heterogeneity and unlabeled data challenges in federated learning, which is inspired by the multi-branch fast inference model. Specifically, by inserting early exit branches in the middle of the model, the original global unified model under the traditional federated learning framework is split into submodels adapted to diverse client computing, communication, and storage resources. During this process, a semi-supervised federated learning technique was created in account of the availability of labeled data, and a regularization term was introduced to the loss function to solve the overfitting problem of local models. For one thing, the framework can cater to clients' personalized needs and provide a novel approach for solving the heterogeneous problem in the federated learning system. For another thing, unlabeled data are fully utilized, which greatly save labor costs. Through image classification, text classification, and next-word prediction experiments, it is proved that no matter what the data distribution is, the accuracy of the model trained by Semi-HFL is higher than that of homogeneous and heterogeneous methods, and at the same time, it consumes fewer resources. In addition, when the degree of heterogeneity increases, the convergence speed slows down and the variance of the client's accuracy grows.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.

Proof of Lemma 1
According to Assumption 5, local loss function F j is Msmoothness, and we have the following: (A1) Because F j is also convex, then we can get  (A6) The conclusion in Eq. A6 is for client's single training process. If it is extended to the entire local training, we get the conclusion (A7)

Proof of Lemma 2
For any two clients, the following conditions are met. Take client 1 and client 2 as examples: