Personalized Semi-Supervised Federated Learning for Human Activity Recognition

The most effective data-driven methods for human activities recognition (HAR) are based on supervised learning applied to the continuous stream of sensors data. However, these methods perform well on restricted sets of activities in domains for which there is a fully labeled dataset. It is still a challenge to cope with the intra- and inter-variability of activity execution among different subjects in large scale real world deployment. Semi-supervised learning approaches for HAR have been proposed to address the challenge of acquiring the large amount of labeled data that is necessary in realistic settings. However, their centralised architecture incurs in the scalability and privacy problems when the process involves a large number of users. Federated Learning (FL) is a promising paradigm to address these problems. However, the FL methods that have been proposed for HAR assume that the participating users can always obtain labels to train their local models. In this work, we propose FedHAR: a novel hybrid method for HAR that combines semi-supervised and federated learning. Indeed, FedHAR combines active learning and label propagation to semi-automatically annotate the local streams of unlabeled sensor data, and it relies on FL to build a global activity model in a scalable and privacy-aware fashion. FedHAR also includes a transfer learning strategy to personalize the global model on each user. We evaluated our method on two public datasets, showing that FedHAR reaches recognition rates and personalization capabilities similar to state-of-the-art FL supervised approaches. As a major advantage, FedHAR only requires a very limited number of annotated data to populate a pre-trained model and a small number of active learning questions that quickly decrease while using the system, leading to an effective and scalable solution for the data scarcity problem of HAR.


INTRODUCTION
The evolution of mobile computing and sensor technologies in the last decades made it possible to develop intelligent applications that continuously monitor our daily activities to enable context-aware services [12]. Off-the-shelf mobile and wearable devices that are pervasively adopted in our everyday life can be used to track physical movements and hence to constantly monitor daily activities. The majority of approaches for Human Activity Recognition (HAR) Authors LG] 19 Apr 2021 based on mobile and wearable devices in the literature rely on supervised learning methods to infer activities from the continuous stream of data generated by the sensors embedded in those devices [32]. While supervised learning leads to high recognition rates, collecting a significant amount of labeled data to train the recognition model is often a real challenge [17]. For instance, data annotation can be performed directly by the monitored subject while performing activities (self-annotation). However, this approach is very obtrusive and error-prone. Alternatively, external observers can annotate the activity execution of a subject (in real-time or by semi-automatic video annotation). Annotation by external observation is particularly time-consuming and privacy-intrusive.
Among the many solutions that have been proposed to tackle the labeled data scarcity problem, semi-supervised learning techniques represent a promising research direction that has been explored in the last few years [2]. Semisupervised methods only use a small amount of labeled data to initialize the recognition model, which is continuously updated taking advantage of unlabelled data. However, there are still several challenges that limit the deployment of those solutions in realistic scenarios. Indeed, even though semi-supervised approaches mitigate the data scarcity problem, they do not consider the scalability and privacy issues that arise in training a real-world recognition model that includes data from a large number of different users. From the scalability point of view, the computational effort that is required to train a global model significantly grows as the number of users increases. Considering privacy aspects, activity data may reveal sensitive information, like the daily behavior of a subject and her habits [6]. Accurate HAR also requires a certain amount of personalization on the end users [53].
In 2016, Google introduced the Federated Learning (FL) framework [27,35,57]. In the FL paradigm, the model training task is distributed over a multitude of nodes (e.g., mobile devices). Each node uses its own labeled data to train a local model. The resulting model parameters of each participating node are forwarded to a server that is in charge of aggregating them. Finally, the server shares the aggregated parameters to the participating nodes. FL is a promising direction to make activity recognition scalable for a large number of users. Moreover, FL mitigates the privacy problem since only model parameters, and not actual data, are shared with the server, and privacy-preserving mechanisms (e.g., Secure MultiParty Computation, Differential Privacy) are used when aggregating parameters [57].
FL has been recently applied to HAR showing that it can reach an accuracy very close to centralised methods [14].
However, all existing solutions assume that each node has complete availability of labeled sensor data. This is actually the general setting of existing works based on FL, that in the literature has been primarily considered for fully supervised learning tasks [27]. While this assumption may be valid for some applications (e.g., the Google approach for keyboard suggestions improvement relies on labeled data implicitly provided by users when typing or confirming suggestions [24]), it is not realistic for applications like HAR where labeled data availability is significantly limited. Extending FL to semi-supervised learning is one of the open challenges in this area [27].
In this work, we propose FedHAR: a hybrid semi-supervised and FL framework that enables personalised privacyaware and scalable HAR based on mobile and wearable devices. Differently from existing approaches, FedHAR considers a limited availability of labeled data. In particular, FedHAR combines active learning and label propagation to provide labels to a significantly large amount of unlabeled data. Semi-automatically labeled data are then used by each node to perform local training, thus obtaining the model parameters that are then transmitted to the server that aggregates them using Secure Multiparty Computation. Similar to existing approaches, FedHAR also relies on transfer learning to fine-tune the global model for each user [56]. At the same time, FedHAR is also designed to generate a global model that generalizes over unseen users.
Considering the limitations of existing evaluation methodologies for FL applied to HAR [21], we designed a novel evaluation methodology to robustly assess both the generalization and the personalization capabilities of our approach.
The results of our experimental evaluation on two publicly available datasets show that FedHAR reaches recognition rates close to state-of-the-art solutions that assume the complete availability of labeled data. Moreover, both the generalization and the personalization capabilities of FedHAR keep increasing over time. Last but not least, the amount of triggered active learning questions is small and acceptable for a real-word deployment.
To the best of our knowledge, FedHAR is the first FL framework for HAR that tackles the data scarcity problem.
Hence, we believe that FedHAR is a significant step towards realistic deployments of HAR systems based on FL. Even if we evaluated FedHAR only considering as target application activity recognition based on mobile and wearable devices, our method can also be applied to other HAR applications (e.g., HAR in smart-homes with environmental sensors) and more generally to other human-centered domains characterized by low availability of labeled data 1 .
In summary, the contributions of this work are the following: • We present FedHAR, a novel hybrid approach that combines federated, semi-supervised, and transfer learning to tackle the data scarcity problem for real-world personalized HAR.
• We propose a novel strategy to reliably evaluate the evolution of the personalization and generalization capabilities of FedHAR over time.
• An extensive evaluation on public datasets shows that FedHAR reaches similar recognition rates with respect to well-known approaches that assume high-availability of labeled data. At the same time, FedHAR triggers a small number of active learning questions that quickly decreases while using the system.

Labeled data scarcity in HAR
Considering HAR based on data collected from mobile devices' inertial sensors, the majority of approaches rely on supervised machine learning [4,9,23,30,46]. However, these approaches need a significant amount of labeled data to train the classifier. Indeed, different users may perform the same activities in very different ways, but also distinct activities may be associated with similar motion patterns. The annotation task is costly, time-consuming, intrusive, and hence prohibitive on a large scale [17]. In the following, we summarize the main methodologies that have been proposed in the literature to mitigate this problem.
Some research efforts focused on knowledge-based approaches based on logical formalisms, especially targeting smart-home environments [13,15]. These approaches usually rely on ontologies to represent the common-sense relationships between activities and sensed data. One of the main issues of knowledge-based approaches is their inadequacy to model the intrinsic uncertainty of sensor-based systems and the large variety of activity execution modalities.
Data augmentation is a more popular solution adopted in the literature to mitigate the data scarcity problem, especially considering imbalanced datasets [11,38]. In these approaches, the available labeled data are slightly perturbed to generate new labeled samples. With respect to our method, data augmentation is an orthogonal approach that could be integrated to further increase the amount of labeled data. Recently, data augmentation in HAR has also been tackled taking advantage of GAN models to generate synthetic data more realistic than the ones obtained by the above-mentioned approaches [10,51]. However, GANs require to be trained with a significant amount of data.
Many transfer learning approaches have been applied to HAR to fine-tune models learned from a source domain with available labeled data to a target domain with low-availability of labeled data [16,40,42,52]. FedHAR relies on transfer learning to fine-tune the personal local model taking advantage of the global model trained by all the participating devices.
An effective method to tackle data scarcity for HAR when the feature space is homogeneous (like in FedHAR) is semisupervised learning [2,22,34,45]. Semi-supervised methods only use a restricted labeled dataset to initialize the activity model. Then, a significant amount of unlabeled data is semi-automatically annotated. The most common semi-supervised approaches for HAR are self-learning [34], label propagation [44], co-learning [33], and active learning [1,25,36]. Also, hybrid solutions based on semi-supervised learning and knowledge-based reasoning have been proposed in [5]. Existing semi-supervised solutions do not consider the scalability problems related to building a recognition model with a large number of users for real-world deployments. Moreover, the data required to build such collaborative models is sensitive, as it could reveal private information about the users (e.g., user health condition and habits) [6,39,50].

Federated Learning for HAR
Recently, the FL paradigm has been proposed to distribute model training over a multitude of nodes [8,19,27,29,35,57].
A recent survey divides FL methods in three categories: horizontal, vertical, and transfer FL [57]. FedHAR is a horizontal FL method: the participating mobile devices share the same feature space, but they have a different space in samples (i.e., each device considers data for a specific user).
FL has been previously applied to mobile/wearable HAR to distribute the training of the activity recognition model among the participating devices [14,21,43,55,56,60]. Existing works show that FL solutions for HAR reach recognition accuracy similar to standard centralized models [43]. Moreover, since personalization is an important aspect for HAR [53], existing works also show that applying transfer learning strategies to fine-tune the global model on each client leads to a significantly improved recognition rate [14,56]. One of the major drawbacks of these solutions is that they assume high availability of labeled data, hence considering a fully supervised setting.
The combination of semi-supervised and federated learning for HAR was partially explored in [60]. However, that work has a different objective, since it focuses on collaboratively learning a deep feature representation of sensor data through autoencoders. The learned feature representation is then used to recognize activities on a labeled dataset in a fully supervised setting.
A common limitation of the previously cited approaches is the methodology adopted to evaluate FL for HAR applications [14,21,55]. Indeed, none of the proposed methodologies truly assess the generality of the global model over users whose data have never been used for training. Moreover, only one iteration of the FL process is evaluated, while in a realistic deployment this process is repeated periodically (e.g., every night) with different data. In this work, we propose an evaluation methodology that overcomes the above-mentioned issues (see Section 4.2).

Overview of FedHAR
In the following, we summarize the architecture of FedHAR. For the sake of this work and without loss of generality, we illustrate FedHAR applied to physical activity recognition based on inertial sensors data collected from personal mobile devices.
3.1.1 Federated Learning. The overall data flow of our FL approach is depicted in Figure 1. Following the FL paradigm, the actors of FedHAR are a server and a set of devices that cooperate to train and update the weights of a global activity recognition model. Each device uses a certain amount of labeled data to perform local training of the model. As we will explain later, the labels are not naturally available, and they are obtained taking advantage of semi-supervised learning techniques. The resulting weights of the local models are transmitted to the server. The server takes care of performing a privacy-preserving aggregation of the weights deriving from the local models of the participating devices.
The updated global model weights are then transmitted to each device, which will apply transfer learning methods to personalize it. The personalized local model is then used for activity classification.

Local models.
In FedHAR, each device stores two distinct local instances of the activity model (as depicted in Figure 2) 2 . The first is the one used for classification, and it is called Personalized Local Model. When a device receives an update about global model weights from the server, the Personalized Local Model is fine-tuned on the specific user using transfer learning methods. As we will discuss in Section 3.5, fine-tuning leads to a personal model that is specialized in recognizing the activities of the specific user.
Each device also trains a twin model without applying personalization. We refer to this model as Local Model and its weights are the ones used during FL. The advantage of using the Local Model for FL is that it leads to a global model that better generalizes over unseen users. Keeping private the weights of the Personalized Local Model has also a positive impact on privacy protection.
In the following, we explain in detail how the local models are used in FedHAR. This model is used for the classification of activities in real-time on the continuous stream of pre-processed sensor data. Before classification, each unlabeled data sample is stored in the User Device Storage. After classification, if the confidence on the current prediction is low, an active learning process is started, and the system asks the user about the activity that she was actually performing. The feedback from the user is finally associated with the corresponding data sample in the User Device Storage.  Figure 2b shows how the periodic training of the local models is performed in FedHAR. This process is triggered when the server requires to update the global model. Before starting the FL process, the labeled and unlabeled data points stored in the User Device Storage are used to update the Label Propagation model, which will automatically label a significant portion of unlabeled data. Then, the newly labeled data obtained after label propagation, as well as the ones obtained by active learning, are used to perform local training of both the Personalized Local Model and the Local Model during the FL process. As we previously mentioned, only the weights of the Local Model are actually transmitted to the server during the FL process to update the global model. We will describe our federated learning strategy in more details in Section 3.3.

Initialization of the global model
FedHAR assumes a limited availability of labeled data. In particular, it only considers a restricted annotated dataset (we will call it pre-training dataset in the following) that is used to initialize the label propagation model and the weights of the global model 3 . In realistic settings, the pre-training dataset can be, for example, a combination of publicly available datasets, or a small training set specifically collected by a restricted number of volunteers. Figure 3 describes the initialization mechanism of FedHAR.
In FedHAR, the activity model is based on deep learning. However, learning features from such a small training set would lead to an unreliable feature representation, negatively impacting the recognition rate during classification.
Hence, FedHAR relies on a pre-processing pipeline based on standard segmentation and handcrafted feature extraction that in the literature proved to be effective for HAR [5]. In particular, for each axis of each inertial sensor, we consider the following features: average, variance, standard deviation, median, mean squared error, kurtosis, symmetry, zero-crossing rate, number of peaks, energy, and difference between maximum and minimum. Hence, the global model is trained by feeding the network with the feature vectors extracted from the pre-training dataset as described above. Then, the resulting weights are transmitted to the participating devices. Clearly, the same pre-processing pipeline is also applied for classification on each device. The pre-training dataset is also used to initialize the Label Propagation model that is distributed to all the participating devices. More details about Label Propagation will be presented in Section 3.4.2.

The Federated Learning strategy of FedHAR
FedHAR takes advantage of FL to distribute the training of the global activity model over a large number of devices.
Each participant contributes to training the global model with a certain amount of labeled data. We will explain in Section 3.4 how labeled data are actually collected thanks to semi-supervised learning.
Periodically (e.g., each night) the server starts a global model update process. The devices that are available to perform computation (e.g, the ones idle and charging) inform the server that they are eligible to take part in the FL process. Afterwards, the server executes several communication rounds to update the weights of the global model.
A communication round consists of the following steps: • The server sends the latest version of the global weights to a fraction of the eligible devices As proposed in FedAVG [35], the server updates the global model weights by executing a weighted average of the locally learned model weights provided from clients. Since the local weights may reveal private information, the aggregation is performed using the Secure Multiparty Computation approach presented in [8].

Semi-supervised learning
As previously mentioned, in order to tackle the labeled data scarcity issue, FedHAR relies on a combination of two semi-supervised learning techniques: Active Learning and Label Propagation. The former is used to obtain the ground truth from the user only when the confidence on the prediction is low. The latter is used to propagate available labels to unlabeled data.
3.4.1 Active Learning. FedHAR takes advantage of active learning to ask the user feedback about her currently performed activity when there is uncertainty in the classifier's prediction. In particular, we adopted a state-of-the-art non-parametric method called VAR-UNCERTAINTY [62]. This approach compares the prediction confidence with a threshold ∈ [0, 1] that is dynamically adjusted over time. Initially, is initialised to = 1. Let A = { 1 , 2 , . . . , } be the set of target activities. Given the probability distribution over the possible activities of the current prediction ⟨ 1 , 2 , . . . , ⟩, we denote with ★ = max the probability value of the most likely activity ★ ∈ A (i.e., the predicted activity). If ★ is below , we consider the system uncertain about the current activity performed by the user. In this case, an active learning process is started by asking the user the ground truth ∈ A about the current activity. The feedback is stored in the User Device Storage. When = ★ , it means that the most likely activity ★ is actually the one performed by the user, and hence the threshold is decreased to reduce the number of questions. On the other hand, when ≠ ★ , is increased. More details about this active learning strategy can be found in [62].
In this work, we assume that active learning questions are prompted to the user in real-time through a dedicated application with a user-friendly interface. For the sake of usability, when querying the user, FedHAR only presents a couple of alternatives taken from the most probable activities. Figure 4 shows a screenshot of an active learning application that we implemented for smart-watches in another research work.

Label Propagation.
In order to increase the number of labeled samples to perform local training, FedHAR also relies on label propagation. Given a set of labeled and unlabeled data points, the goal of label propagation is to automatically annotate a portion of unlabeled data [61]. In FedHAR , the label propagation model is initialized with the pre-training dataset, and new labeled data points are obtained thanks to active learning.
The intuition behind label propagation is that data points close in the feature space likely correspond to the same class label. The Label Propagation model is a fully connected graph = ( , ) where the nodes are the data samples (labeled or unlabeled) and the weight on each edge in is the similarity between the connected data points. In the literature, this similarity is usually computed using K-Nearest Neighbors (KNN) or Radial Basis Function Kernel (RBF kernel). FedHAR relies on the RBF kernel due to its trade-off between computational costs and accuracy [54]. Formally, the RBF kernel function is defined as ( , ′ ) = − | | − ′ | | 2 where || − ′ || 2 is the squared Euclidean distance between the feature vectors of two nodes and ′ (where ′ is a labeled node), and ∈ R+. Hence, the value of the RBF kernel function increases as the distance between data points decreases. The kernel is used to perform inductive inference to predict the labels on unlabeled data points (only when the prediction is reliable). This process is repeated until convergence (i.e., when there are no more unlabeled data points for which label propagation is reliable).
The Label Propagation process is started when the server requires to update the global model, and it is executed considering all the labeled and unlabeled samples stored in the User Local Storage. The Label Propagation model is personal and never shared with other users nor with the server.

Model Personalization
FedHAR adopts a transfer learning strategy to fine-tune the Personalized Local Model on each user. The intuition behind the personalization mechanism is that the last layers of the neural network (i.e., the ones closer to the output) encode personal characteristics of activity execution, while the remaining layers encode more general features that are common between different users [58]. A similar mechanism has been also used in a recent FL approach to HAR [56]. As depicted in Figure 5, we refer to the last layers of the neural network as the User Personalized Layers, while we refer to the remaining ones as Shared Hidden Layers.
In FedHAR, when a node receives the updated global weights from the server, each layer of the Personalized Local Model is updated according to these weights except for the User Personalized Layers. As we previously mentioned, the weights of the Personalized Local Model are not used for the FL process, and hence the User Personalised Layers are only stored locally and never shared with the server.

EXPERIMENTAL EVALUATION
In this section, we describe in detail the extensive experimental evaluation that we carried out to quantitatively assess the effectiveness of FedHAR. First, we describe the public datasets that we considered in our experiments. Then, we present our novel evaluation methodology that aims at assessing both the generalization and personalization capabilities of FedHAR. Finally, we discuss the results that we obtained on the target datasets.

Datasets
Since FL makes sense when many users participate in collaborative training the global model, we considered publicly available datasets of physical activities (performed both in outdoor and indoor environments) that were collected involving a significant number of subjects. However, there are only a few public datasets with these characteristics. One of them is MobiAct [49], which includes labeled data from 60 different subjects with high variance in age and physical characteristics. The dataset contains data from inertial sensors (i.e., accelerometer, gyroscope, and magnetometer) of a smartphone placed in the pocket of each subject during activity execution. Due to its characteristics, this dataset was Manuscript submitted to ACM also used in other works that proposed FL applied to HAR (e.g., the work presented in [55]). In our experiments, we considered the following physical activities 4 : standing, walking, jogging, jumping, and sitting.
We also consider the well-known WISDM dataset [30]. This dataset has been widely adopted as benchmark for HAR. WISDM contains accelerometer data collected from a smartphone in the pocket of each subject during activity execution. WISDM includes data from 36 subjects. The activities included in this dataset are the following: walking, jogging, sitting, standing, and taking stairs.

Evaluation methodology
In the following, we describe our novel methodology that we designed to evaluate the effectiveness of FedHAR, both in terms of personalization and generalization. We split each dataset into three partitions that we call , , and .
The partition (i.e., pre-training data) contains data of users that we only use to initialize the global model.
training data) is the dataset partition that includes data of users who participate in FL. Finally, (i.e., test data) is a dataset partition that includes data of left-out users that we only consider to periodically evaluate the generalization capabilities of the global model. In our experiments, we randomly partition the users as follows: 15% whose data will populate , 65% whose data will populate , and 20% whose data will populate .
We partition the data for each user in into ℎ shards of equal size. In realistic scenarios, each shard should contain data collected during a relatively long time period (e.g., a day) where a user executes many different activities. However, the considered datasets only have a limited amount of data for each user (usually less than one hour of activities for each user). Hence, we generate shards as follows. Given a user ∈ , we randomly assign to each shard a fraction 1 ℎ of the available data samples associated to in the dataset. Note that each data sample of a user is associated to exactly one shard. This mechanism allows us to simulate the realistic scenario described before, where users perform several types of activities in each shard.
Evaluation Algorithm. In the following, we describe our novel evaluation methodology step by step. First, the labeled data in are used to initialize the global model, that is then distributed to the devices of all the users in . Then, we evaluate the recognition capabilities of this initial model on the partition in terms of F1 score. This assessment allows us to measure how the initial global model generalizes on unseen users before any FL step.
As we previously mentioned, we partitioned the data of each user in into exactly ℎ shards. For the sake of evaluation, we assume a synchronous system in which the shards of the different users in are actually temporally aligned and occur simultaneously (i.e, the first shards of every user occur at the same time interval, the second shards of every user occur at the same time interval, and so on). The evaluation process is composed by ℎ iterations, one for each shard. Considering the -th shard we proceed as follows: (1) The devices of the users in exploits the Personalized Local Model to classify the continuous stream of inertial sensor data in its shard. We use the classification output to evaluate the recognition rate in terms of F1 score providing an assessment of personalization. Note that, during this phase, we also apply our active learning strategy and we keep track of the number of triggered questions.
(2) When all data in the shard have been processed (by all devices), the server starts a number of communication rounds with a subset of the devices in order to update the global weights. Each round is implemented as follows: (a) The server randomly selects a certain percentage % of users in and sends to their devices the last update of the global weights. (3) After the execution of all the communication rounds, the latest version of the global weights is sent to devices of the users in , that will apply the personalization process described in Section 3.5.
Note that our evaluation methodology introduces several levels of randomness: assigning users to , and ; assigning data samples to shards; selecting devices at each communication round. We iterate experiments 10 times and average the results in order to make our estimates more robust.

Results
In the following, we report the results of the evaluation of FedHAR.

Classification model and hyper-parameters.
As classification model, we used a fully-connected deep neural network. The network consists of four fully connected layers having respectively 128, 64, 32, and 16 neurons, and a softmax layer for classification. We use Adam [28] as optimizer. We chose this network for three reasons: 1) our method does not rely on feature learning, 2) training a fully-connected network is more suitable for mobile devices since it requires less computational resources with respect to more complex networks (e.g., CNN, LSTM), and 3) a similar configuration exhibited good performances in the literature [55]. As hyper-parameters, we empirically chose = 30%,   represent informative data that are crucial for label propagation. Hence, the evaluation on both datasets confirms that the combination of active learning and label propagation leads to the best results.
In Figure 8 and Figure 9 we show the generalization capability of the global model on left-out users (i.e., users in partition ) after each communication round performed during the FL process with the users in .
The red lines mark the end of each shard. The results indicate that the federated model constantly improves also for those users that did not contribute with training data, even if the active learning questions continuously decrease.
These plots also confirm that the combination of label propagation and active learning leads to the best results on both datasets.

FedHAR versus approaches based on fully labeled data.
We compared our approach with two existing FL methods based on fully labeled data. The first one is the well-known FedAVG [35], which is the most common FL method in the literature. FedAVG simply averages the model parameters derived by the local training on the participating nodes (without any personalization).
Manuscript submitted to ACM  The second method that we use for comparison is called FedHealth [14]. This is one of the first FL approaches proposed for activity recognition on wearable sensors data. Similarly to our approach, FedHealth applies personalization using transfer learning.
For the sake of fairness in our experiments, we adapted FedAVG and FedHealth to use the same neural network that we use in FedHAR. Hence, we performed our experiments using our evaluation methodology by simulating that, for FedAVG and FedHealth, each node has the ground truth for each data sample on each shard. Hence, the evaluation of those methods does not include active learning and label propagation. Moreover, differently from FedHAR, FedAVG and FedHealth only use a single local model.
The results of this comparison for the users in (i.e., the ones that actively participated in the FL process) are reported in Figure 10a and Figure 10b.
From these plots, we observe that FedHAR converges to recognition rates that are similar to solutions based on fully labeled data at each shard. The advantage of FedHAR is that it can be used for realistic HAR deployments where the availability of labeled data is scarce. Despite a reduced number of required annotations, FedHAR performs even better than FedAVG on the WISDM dataset, while on MobiAct it performs slightly worse. Moreover, FedHAR is only ≈ 3% behind FedHealth on both datasets.  Figure 11 shows how the recognition rate improves between shards for each activity for the users in on both datasets. We observed an improvement of the recognition rate shard after shard for each considered activity. The only exception is the standing activity on the MobiAct dataset in the third shard, which maintains the same F1 score. The greatest improvement occurs between the first and the second shard. This is due to the fact that, in the first shard, activities are recognized using the initial global model only trained with the pre-training dataset. Starting from the second shard, classification is performed with the Personalized Local Model updated thanks to FL and personalized using our transfer learning approach.
4.3.5 Impact of personalization. Figure 12 and Figure 13 show the impact of the FedHAR personalization strategy based on transfer learning. This evaluation is performed on the users in the partition.
As expected, fine-tuning the personal models leads to an improvement both on the recognition rate and on the number of questions in active learning. Note that, during the first shard, classification is performed using the weights derived from the pre-trained dataset and personalization is applied starting from the second shard.

Generality of the approach
While we designed FedHAR with wearable-based activity recognition as target application, we believe that this combination of semi-supervised and FL can be applied also to many other applications. Our method is suitable for human-centered classification tasks that include the following characteristics: • There is a large number of clients that participate in the FL process.
• Classification needs to be performed on a continuous data stream, where labels are not naturally available.
• Each node generates a significant amount of unlabeled data.
• It is possible to periodically obtain the ground truth by delivering active learning questions to users that are available to provide a small number of labels. Note that, in real-time applications like HAR, for the sake of usability active learning questions should be prompted temporally close to the prediction. • It is possible to obtain a limited training set to initialize the global model. Hence, a small group of volunteers should be available (in an initial phase) for annotated data acquisition.
• The nodes should be capable of computing training operations. Clearly, nodes can also rely on trusted edge gateways/servers (like proposed in [55]).

Privacy concerns
Despite FL is a significant step towards protecting user privacy in distributed machine learning, the shared model weights may still reveal some sensitive information about the participating users [37,41]. Similarly to other works, FedHAR uses Secure Multiparty Computation (SMC) [7,18] to mitigate this problem. However, other approaches have been proposed, including differential privacy (DP) [3,20], and hybrid approaches that combine SMC and DP [48].
The advantage of DP is the reduced communication overhead, with the cost of affecting the accuracy of the model. For the sake of this work, we opted for SMC in order to more realistically compare the effectiveness of our semi-supervised approach with other approaches, considering privacy as an orthogonal problem. However, we also plan to investigate how to integrate differential privacy in our framework and its impact on the recognition rate.

Need of larger datasets for evaluation
We evaluated FedHAR choosing those well-known public HAR datasets that involved the highest number of users, simulating the periodicity of FL iterations by partitioning the dataset. However, the effectiveness of FedHAR on large scale scenarios should be evaluated on significantly larger datasets. By larger, we mean in terms of the number of users involved, the amount of available data for each user, and the number of target activities. Indeed, FedHAR makes sense when thousands of users are involved, continuously performing activities day after day. However, observing the encouraging results on our limited datasets, we are confident that FedHAR would perform even better on such large scale evaluations.

Resource efficiency
It is important to mention that FedHAR is not optimized in terms of computational efficiency. Indeed, training two deep learning models on mobile devices may be computationally demanding and it may be problematic especially for those devices with low computational capabilities. This problem could be mitigated by relying on trusted edge gateways, as proposed in [56].
We want to point out that several research groups are proposing effective ways to dramatically reduce computational efforts for deep learning processes on mobile devices [31,59]. Moreover, the GPU modules embedded in recent smartphones exhibit performances similar to the ones of entry-level desktop GPUs and this trend is expected to improve in the next few years [26].
Another limitation of our work is that the label propagation model requires storing the collected feature vectors as a graph. This is clearly not sustainable for a long time on a mobile device. This problem could be solved by imposing a limit on the size of the label propagation graph and periodically deleting old or poorly informative nodes.

CONCLUSION AND FUTURE WORK
In this work we presented FedHAR, a novel semi-supervised federated learning framework for activity recognition on mobile devices. FedHAR takes into account the data scarcity problem, combining active learning and label propagation to semi-automatically annotate sensor data for each user. To the best of our knowledge, FedHAR is the first application of federated learning to activity recognition that is not based on the assumption that labeled data exists for all participating clients. Our results show that the combination of active learning and label propagation leads to recognition rates that are close to the ones reached by solutions that rely on fully supervised learning to train the local models.
In future work, we plan to use a different federated model depending on the profile of each user (e.g., age, fitness, weight). Indeed, HAR is more effective when the collaborative model is trained considering users with similar physical traits [47]. This approach would require to adapt FedHAR to automatically group users who share similar characteristics and, at the same time, protecting their privacy.