Introduction

The majority of approaches for Human Activity Recognition (HAR) based on data continuously acquired from mobile and wearable devices rely on supervised learning methods [1].

While supervised learning leads to high recognition rates, collecting a sufficiently representative amount of labeled data to train the recognition model is often a real challenge [2]. For instance, data annotation can be performed directly by the monitored subject while performing activities (self-annotation). However, this approach is very obtrusive and error-prone. Alternatively, external observers can annotate the activity execution of a subject (in real-time or by semi-automatic video annotation), but these methods are time-consuming and privacy-intrusive.

Among the solutions that have been proposed to tackle the labeled data scarcity problem, semi-supervised learning represents a promising research direction that has been explored in the last few years [3]. Semi-supervised methods only use a small amount of labeled data to initialize the recognition model, which is continuously updated taking advantage of unlabeled data. However, there are still several challenges that limit the deployment of these methods in realistic scenarios. Indeed, even though semi-supervised approaches mitigate the data scarcity problem, they do not consider the scalability and privacy issues that arise in training a real-world recognition model that includes data from a large number of different users. From the scalability point of view, the computational effort that is required to train a global model significantly grows as the number of users increases. Considering privacy aspects, activity data may reveal sensitive information, like the daily behavior of a subject and her habits [4]. Accurate HAR also requires a certain amount of personalization on the end users [5].

In 2016, Google introduced the federated learning (FL) framework [6]. In the FL paradigm, the model training task is distributed over a multitude of nodes (e.g., mobile devices). Each node uses its own labeled data to train a local model. The resulting model parameters of each participating node are forwarded to a server that is in charge of aggregating them. Finally, the server shares the aggregated parameters to the participating nodes. FL is a promising direction to make activity recognition scalable for a large number of users. Moreover, FL mitigates the privacy problem since only model parameters, and not actual data, are shared with the server, and privacy-preserving mechanisms (e.g., Secure MultiParty Computation, Differential Privacy) are used when aggregating parameters [7].

FL has been recently applied to HAR showing that it can reach an accuracy very close to centralized methods [8]. However, all existing solutions assume that each node has complete availability of labeled sensor data. This is actually the general setting of existing works based on FL, which in the literature has been primarily considered for fully supervised learning tasks [9]. While this assumption may be valid for some applications (e.g., the Google approach for keyboard suggestions improvement relies on labeled data implicitly provided by users when typing or confirming suggestions [10]), it is not realistic for applications like HAR where labeled data availability is significantly limited. Extending FL to semi-supervised learning is one of the open challenges in this area [9].

In this work, we propose FedAR: a hybrid semi-supervised and FL framework that enables personalized privacy-aware and scalable HAR based on mobile and wearable devices. Different from the majority of the existing solutions, FedAR considers a limited availability of labeled data. In particular, FedAR combines active learning and label propagation to provide labels to a large amount of unlabeled data. Newly labeled data are periodically used by each node to perform local training, thus obtaining the model parameters that are then transmitted to the server that aggregates them using Secure Multiparty Computation. FedAR also relies on transfer learning to fine-tune the global model for each user, while generating a global model that generalizes over unseen users.

Considering the limitations of existing evaluation methodologies for FL applied to HAR [11], we designed a novel evaluation methodology to robustly assess both the generalization and the personalization capabilities of our approach. The results of our experimental evaluation on two publicly available datasets show that FedAR reaches recognition rates close to state-of-the-art solutions that assume the complete availability of labeled data. Moreover, both the generalization and the personalization capabilities of FedAR keep increasing over time. Last but not least, the amount of triggered active learning questions is small and acceptable for a real-world deployment.

To the best of our knowledge, FedAR is the first FL framework for HAR that tackles the data scarcity problem while considering personalization. Hence, we believe that FedAR is a significant step towards realistic deployments of HAR systems based on FL.

In summary, the contributions of this work are the following:

  • We present FedAR, a novel hybrid approach that combines federated, semi-supervised, and transfer learning to tackle the data scarcity problem for real-world personalized HAR.

  • We propose a novel strategy to reliably evaluate the evolution of the personalization and generalization capabilities of FedAR over time.

  • An extensive evaluation on public datasets shows that FedAR reaches similar recognition rates with respect to well-known approaches that assume high availability of labeled data. At the same time, FedAR triggers a small number of active learning questions that quickly decreases while using the system.

Related work

Labeled data scarcity in HAR

Considering HAR based on data collected from mobile devices’ inertial sensors, the majority of approaches rely on supervised machine learning [12,13,14,15,16]. However, these approaches need a significant amount of labeled data to train the classifier. Indeed, different users may perform the same activities in very different ways, but also distinct activities may be associated with similar motion patterns. The annotation task is costly, time-consuming, intrusive, and hence prohibitive on a large scale [2]. In the following, we summarize the main methodologies that have been proposed in the literature to mitigate this problem.

Unsupervised approaches have been proposed to derive activity clusters from unlabeled sensor data [17]. Those approaches still need annotations to reliably associate an activity label to each cluster. Since distinct human activities often share similar sensor patterns, purely unsupervised data-driven approaches for activity recognition are still a challenge considering real-world scenarios.

Some research efforts focused on knowledge-based approaches based on logical formalisms, especially targeting smart-home environments [18, 19]. These approaches usually rely on ontologies to represent the common-sense relationships between activities and sensed data. One of the main issues of knowledge-based approaches is their inadequacy to model the intrinsic uncertainty of sensor-based systems and the large variety of activity execution modalities.

Data augmentation is a more popular solution adopted in the literature to mitigate the data scarcity problem, especially considering imbalanced datasets [20, 21]. In these approaches, the available labeled data are slightly perturbed to generate new labeled samples. With respect to our method, data augmentation is an orthogonal approach that could be integrated to further increase the amount of labeled data. Recently, data augmentation in HAR has also been tackled taking advantage of GAN models to generate synthetic data more realistic than the ones obtained by the above-mentioned approaches [22, 23]. However, GANs require to be trained with a significant amount of data.

Many transfer learning approaches have been applied to HAR to fine-tune models learned from a source domain with available labeled data to a target domain with low-availability of labeled data [24,25,26,27]. FedAR  relies on transfer learning to fine-tune the personal local model taking advantage of the global model trained by all the participating devices.

An effective method to tackle data scarcity for HAR when the feature space is homogeneous (like in FedAR) is semi-supervised learning [3, 28,29,30]. Semi-supervised methods only use a restricted labeled dataset to initialize the activity model. Then, a significant amount of unlabeled data is semi-automatically annotated. The most common semi-supervised approaches for HAR are self-learning  [30], label propagation [31], co-learning [32], and active learning [33,34,35]. Active learning has also been adopted in HAR to handle the class imbalance problem [36]. Hybrid solutions based on semi-supervised learning and knowledge-based reasoning have been proposed in [37]. Existing semi-supervised solutions do not consider the scalability problems related to building a recognition model with a large number of users for real-world deployments. Moreover, the data required to build such collaborative models is sensitive, as it could reveal private information about the users (e.g., user health condition and habits) [4, 38, 39].

Federated learning for HAR

Recently, the FL paradigm has been proposed to distribute model training over a multitude of nodes [6, 7, 9, 40,41,42]. A recent survey divides FL methods in three categories: horizontal, vertical, and transfer FL [7]. FedAR  is a horizontal FL method: the participating mobile devices share the same feature space, but they have a different space in samples (i.e., each device considers data for a specific user). Among the required characteristics of FL approaches, the personalization of the global model on each client plays a major role [43]

FL has been previously applied to mobile/wearable HAR to distribute the training of the activity recognition model among the participating devices [8, 11, 44,45,46,47,48]. In this area, recent works also proposed to learn the global model in a decentralized fashion [49]. Existing works show that FL solutions for HAR reach recognition accuracy similar to standard centralized models [45]. Moreover, since personalization is an important aspect for HAR [5], existing works also show that applying transfer learning strategies to fine-tune the global model on each client leads to a significantly improved recognition rate [8, 46]. One of the major drawbacks of these solutions is that they assume high availability of labeled data, hence considering a fully supervised setting.

The combination of federated and active learning has been recently proposed for Intrusion Detection Systems [50]. However, semi-supervised federated learning solutions for HAR have been only partially explored. The existing works mainly focus on unsupervised methods to collaboratively learn (based on the FL setting) a robust feature representation from the unlabeled stream of sensor data. The global feature representation is then used to build activity classifiers using a limited amount of labeled data. For instance, the work in [47] proposes an approach based on autoencoders, while the work in [51] is based on self-supervised learning. However, those works do not consider model personalization and they do not propose approaches to continuously obtain new labeled data from each user. Nonetheless, we believe that those works focus on a very important orthogonal problem with respect to the one addressed by FedAR. Indeed, feature learning from unlabeled data could be integrated in FedAR to further reduce the amount of active learning questions and to improve the recognition rate. Recently, the work in [52] proposes a solution to build the global model by aggregating the local models’ gradients from a small number of clients with labeled data and a large number of clients with unlabeled data. This method is based on a semi-supervised loss that relies on a novel unsupervised gradients aggregation strategy. Differently from this work, we do not assume the existence of clients with full availability of labeled data, and we also propose a practical solution to continuously improve the global model thanks to active learning and label propagation.

A common limitation in the literature is the methodology adopted to evaluate FL for HAR applications [8, 11, 48]. Indeed, none of the proposed methodologies truly assess the generality of the global model over users whose data have never been used for training. Moreover, only one iteration of the FL process is evaluated, while in a realistic deployment this process is repeated periodically with different data. In this work, we propose an evaluation methodology that overcomes the above-mentioned issues (see Section 5.2).

Overview of FedAR

In this section, we describe the overall FedAR framework at a high-level.

Overall architecture

The overall architecture of FedAR is depicted in Fig. 1. For the sake of this work and without loss of generality, we illustrate FedAR applied to physical activity recognition based on inertial sensors data collected from personal mobile devices.

Fig. 1
figure 1

Overall architecture of FedAR

Following the FL paradigm, the actors of FedAR are a server and a set of clients that cooperate to periodically compute the weights of a global activity recognition model. In order to address the labeled data scarcity problem, FedAR initializes the global model in an offline phase with a limited amount of labeled data, while each client implements a semi-supervised learning strategy (i.e., a combination of active learning and label propagation) to semi-automatically label a portion of the unlabeled sensor data stream. An overview of our semi-supervised strategy is described in Section 3.3.

Periodically (e.g., every night), the server starts a process to update the weights of the global model. Each client uses its available labeled data to train its local model. The resulting local weights are transmitted to the server, which aggregates them with the ones from the other clients to obtain a new version of the global model. Finally, the new version of the global model weights is transmitted to each client. Since different users may perform activities in very different ways, a model personalization module on each client is in charge of fine-tuning the updated global model weights on the specific user. A more detailed overview on the global model update and personalization is described in Section 3.4.

Local models

One of the strengths of FedAR is that it is designed considering both personalization and generalization aspects. Personalization is crucial for the local models to recognize the activities of each user more accurately. On the other hand, generalization is a desirable property for the global model. Indeed, some participating users may not wish to collect labeled data (not even a small amount), or may have devices not adequate to perform local training. Those users are not able to actively contribute to the federated learning process, and their clients would directly use the last version of the global model for activity classification.

In order to guarantee both personalization and generalization, in FedAR, each client stores two distinct instances of the activity model. The former is called Shareable Model, and it is the one used for federated learning. In order to personalize the activity model on each user, a straightforward solution would be to fine-tune the Shareable Model with transfer learning approaches [53]. However, recent studies show that a global model built by aggregating the weights of fine-tuned models exhibits poor generalization capabilities on external users [11]. In order to overcome this problem, in FedAR, at the end of each global model update the clients that actively contribute to the federated learning process create a copy of the Shareable Model that is called Personalized Model. The Personalized Model is fine-tuned on the specific user and it is used for activity classification. Besides improving generalization, an advantage of keeping private the weights of the Personalized Local Model is a positive impact on privacy protection [54].

Semi-supervised data labeling and classification

Figure 2 depicts the semi-supervised data labeling and classification flow of FedAR.

Fig. 2
figure 2

Semi-supervised data labeling and classification data flow

Each client in FedAR uses the Personalized Model to classify activities in real-time on the continuous stream of unlabeled pre-processed sensor data. Before classification, each unlabeled data sample is stored in the Feature Vectors Storage. This storage collects both unlabeled and labeled data samples. After classification, if the confidence on the current prediction is below a threshold, an active learning process is started, and the system asks the user about the activity that she was actually performing. The feedback from the user is then associated with the corresponding feature vector in the Feature Vectors Storage. Active learning makes it possible to assign a label to those informative data points that can effectively improve the local model. For the sake of usability, the number of active learning queries should be low, since they may bother the user during activity execution. For this reason, FedAR also periodically applies a Label Propagation algorithm to spread the labels acquired through active learning to a larger number of unlabeled data points. The advantage of label propagation is to further improve the recognition rate by training the classifier with a significant amount of labeled data samples and, at the same time, to reduce the number of needed active leaning queries over time.

Global model update and personalizazion

Periodically (e.g., every night) the server asks to the participating clients to update the global model. This process is depicted in Fig. 3.

Fig. 3
figure 3

Local models training and personalized model update

First, each client replaces its Shareable Model with the current version of the Global Model. Then, the labeled data in the Feature Vectors Storage are used to perform local training of the Shareable Model. After training, the updated Shareable Model weights are then forwarded to the server, which is in charge of aggregating the weights from all the clients to generate a new version of the global model. These steps are repeated until convergence of the global model. At the end of this process, the Shareable Model of each client is replaced with the last stable version of the Global Model.

Then, the Model Personalization module generates a copy of the Shareable Model that is called Personalized Model, which is fine-tuned using the Feature Vectors Storage. The result of this process is a Personalized Model that takes advantage of the high-level features of the Global Model as well as the personalized aspects of the specific user.

FedAR under the hood

In this section, we describe in detail the algorithms of FedAR.

The activity model

Since we consider a setting with limited availability of labeled data, activity models that automatically learn features from raw data are not effective in FedAR. Indeed, based on our experiments that we describe in Section 5.3.6, CNN models reach significantly lower recognition rates in FedAR due to the high complexity of learning reliable features from limited labeled data. For this reason, in FedAR, the activity classification model is based on a fully connected deep learning model, and the input is a vector of handcrafted features. In particular, we choose features that proved to be effective for HAR [37]. Recent studies in the HAR domain demonstrate that a good choice of handcrafted features and fully connected models can lead to recognition rates comparable to the ones of state-of-the-art CNN models [55].

In particular, for each axis of each inertial sensor, we consider the following features: average, variance, standard deviation, median, mean squared error, kurtosis, symmetry, zero-crossing rate, number of peaks, energy, and difference between maximum and minimum. These features are extracted from fixed-length temporal windows of sensor data of w seconds. Before feature extraction, we apply a median filter on each temporal window to reduce noise in sensor measurements. After feature extraction, we apply standardization as a feature scaling technique.

Initialization of the global model

At the very beginning, the participating clients in FedAR need a pre-trained global model to infer labels on unlabeled data. However, in this work we assume limited availability of labeled data.

Fig. 4
figure 4

Initialization of the global model in FedAR

Hence, FedAR initializes the global model using a restricted annotated dataset (we will call it pre-training dataset in the following).Footnote 1 The pre-training dataset is also used to initialize label propagation algorithm. In realistic settings, the pre-training dataset can be, for example, a combination of publicly available datasets, or a small training set specifically collected by a restricted number of volunteers. Figure 4 summarizes the initialization mechanism of FedAR.

The federated learning strategy of FedAR

In the following, we describe the FL process to update the global and local models. Periodically (e.g., each night) the server starts a global model update process. The devices that are available to perform computation (e.g., the ones idle and charging) inform the server that they are eligible to take part in the FL process. Afterwards, the server executes several communication rounds to update the weights of the global model.

A communication round consists of the following steps:

  • The server sends the latest version of the global weights to a fraction of the eligible devices

  • Each device uses the labeled data in the Feature Vectors Storage to train the Shareable Model

  • When local training is completed, each device sends the new weights of the Shareable Model to the server

  • The server aggregates the local weights to compute the new global weights

The communication rounds are repeated until the global model converges. Then, the new weights are transmitted to each participating device including the ones that did not actively contribute to the communication rounds.

The server updates the global model weights by executing a weighted average of the locally learned model weights provided from clients. Since the local weights may reveal private information, the aggregation is performed using the Secure Multiparty Computation approach presented in [41]. The pseudo-code of the server-side federated learning process is described in Algorithm 1, while the client-side in Algorithm 2.

figure a
figure b

Model personalization

FedAR adopts a transfer learning strategy to fine-tune the Personalized Model on each user. The intuition behind the personalization mechanism is that the last layers of the neural network (i.e., the ones closer to the output) encode personal characteristics of activity execution, while the remaining layers encode more general features that are common between different users [56].

Fig. 5
figure 5

Shared and personal layers

As depicted in Fig. 5, we refer to the last l layers of the neural network as the User Personalized Layers, while we refer to the remaining ones as Shared Hidden Layers. In FedAR, when the update of the global model is complete, each client creates the Personalized Model as a copy of the Shareable Model. In order to fine-tune the Personalized Model on each user, the Shared Hidden Layers are freezed, and the Feature Vector Storage is used to train the User Personalized Layers.

Semi-supervised learning

In the following, we describe how each client semi-automatically provides labels to the stream of unlabeled sensor data. FedAR relies on a combination of two semi-supervised learning techniques: Active Learning and Label Propagation.

Active learning

An active learning process requires the user feedback about her currently performed activity when there is uncertainty in the classifier’s prediction. The intuition is the following: unlabeled data samples for which the classification confidence is significantly low would have the most impact in improving the classifier if the label were available (i.e., they are the more informative ones).

FedAR relies on a state-of-the-art non-parametric active learning approach called VAR-UNCERTAINTY [57]. This approach compares the prediction confidence with a threshold \(\theta \in [0,1]\) that is dynamically adjusted over time. Initially, \(\theta\) is initialised to \(\theta =1\). Let \(\mathbf {A} = \{A_1, A_2, \dots , A_n\}\) be the set of target activities. Given the probability distribution over the possible activities of the current prediction \(\langle p_1, p_2, \dots , p_n \rangle\), we denote with \(p^\star = \max _{i} p_i\) the probability value of the most likely activity \(A^\star \in \mathbf {A}\) (i.e., the predicted activity). If \(p^\star\) is below \(\theta\), we consider the system uncertain about the current activity performed by the user. In this case, an active learning process is started by asking the user the ground truth \(A^f \in \mathbf {A}\) about the current activity. The feedback \(A^f\) is stored in the Feature Vectors Storage. When \(A^f = A^\star\), it means that the most likely activity \(A^\star\) is actually the one performed by the user, and hence the threshold \(\theta\) is decreased to reduce the number of questions. On the other hand, when \(A^f \ne A^\star\), \(\theta\) is increased. More details about this active learning strategy can be found in [57]. The pseudo-code of classification and active learning is described in Algorithm 3

Fig. 6
figure 6

Example of an active learning interface for smart-watches

We assume that active learning queries are prompted to the user in real-time through a dedicated application, thanks to a user-friendly interface. Each query asks the user to choose the activity that she is currently performing among the possible ones. For the sake of usability, FedAR only presents a couple of alternatives taken from the most probable activities. Figure 6 shows a screenshot of an active learning application that we implemented for smart-watches in another research work.

figure c

Label propagation

The major drawback of active learning is that the queries may interrupt the user while performing an activity. In order to reduce the interaction with the user and, at the same time, to train the local models with a larger amount of labeled data, FedAR also relies on label propagation. The Label Propagation process is started when the server requires to update the global model (see Algorithm 1). Given a set of labeled and unlabeled data points, the goal of label propagation is to automatically spread labels to a portion of unlabeled data [58]. The intuition behind label propagation is that data points close in the feature space likely correspond to the same class label. The Label Propagation model of FedAR is a fully connected graph \(g = (V,E)\) where the nodes V are all the data samples in the Feature Vectors Storage and the weight on each edge in E is the similarity between the connected data points. In the literature, this similarity is usually computed using K-Nearest Neighbors (KNN) or Radial Basis Function Kernel (RBF kernel). FedAR relies on the RBF kernel due to its trade-off between computational costs and accuracy [59]. Formally, the RBF kernel function is defined as \(K(x, x^{\prime }) = e ^{- \gamma ||x - x^{\prime } ||^2}\) where \(||x - x^{\prime } ||^2\) is the squared Euclidean distance between the feature vectors of two nodes x and \(x^{\prime }\) (where \(x^{\prime }\) is a labeled node), and \(\gamma \in \mathbb {R}+\). Hence, the value of the RBF kernel function increases as the distance between data points decreases. The kernel is used to perform inductive inference to predict the labels on unlabeled data points, based on a threshold on the similarity between the nodes. This process is repeated until convergence (i.e., when there are no more unlabeled data points for which label propagation is reliable based on the threshold).

In FedAR, the Label Propagation model (i.e., the graph) is initialized with the labeled data points of the pre-training dataset. Moreover, this model is personal and never shared with other users nor with the server.

Experimental evaluation

In this section, we describe in detail the extensive experimental evaluation that we carried out to quantitatively assess the effectiveness of FedAR. First, we describe the public datasets that we considered in our experiments. Then, we present our novel evaluation methodology that aims at assessing both the generalization and personalization capabilities of FedAR. Finally, we discuss the results that we obtained on the target datasets.

Datasets

Since FL makes sense when many users participate in collaboratively training the global model, we considered publicly available datasets of physical activities (performed both in outdoor and indoor environments) that were collected involving a significant number of subjects. However, there are only a few public datasets with these characteristics. One of them is MobiAct [60], which includes labeled data from 60 different subjects with high variance in age and physical characteristics. The dataset contains data from inertial sensors (i.e., accelerometer, gyroscope, and magnetometer) of a smartphone positioned in a trousers’ pocket freely chosen by the subject in any random orientation. \(73\%\) of the participants were male, while 27 are female. The subjects’ age spanned between 20 and 47 (average: 26), the height ranged from 160cm to 189cm (average: 175), and the weight varied from 50kg to 120kg (average: 76). The adopted data acquisition frequency is the highest enabled by the sensors of the selected smartphone (i.e., at most 200Hz). Due to its characteristics, this dataset was also used in other works that proposed FL applied to HAR (e.g., the work presented in [48]). In our experiments, we considered the following physical activities:Footnote 2standing, walking, jogging, jumping, and sitting. The distribution of activity labels in Mobiact is illustrated in Table 1.

We also consider the well-known WISDM dataset [12]. This dataset has been widely adopted as benchmark for HAR. WISDM contains accelerometer data (sampling rate 20HZ) collected from a smartphone located in the front pants leg pocket of each subject during activity execution. WISDM includes data from 36 subjects. The data collection was supervised by one of the WISDM team members to ensure the quality of the collected data. The activities included in this dataset are the following: walking, jogging, sitting, standing, and taking stairs. The distribution of activity labels in WISDM is illustrated in Table 2. Unfortunately, further information about the participants like gender, age, and weight distribution is not publicly available.

Table 1 MobiAct: distribution of the considered activities
Table 2 WISDM: distribution of the considered activities

Evaluation methodology

In the following, we describe the methodology that we designed to evaluate the effectiveness of FedAR, both in terms of personalization and generalization. We split each dataset into three partitions that we call Pt, Tr, and Ts. The partition Pt (i.e., pre-training data) contains data of users that we only use to initialize the global model. Tr (i.e., training data) is the dataset partition that includes data of users who participate in FL. Finally, Ts (i.e., test data) is a dataset partition that includes data of left-out users that we only consider to periodically evaluate the generalization capabilities of the global model. In our experiments, we randomly partition the users as follows: 15 % whose data will populate Pt, 65 % whose data will populate Tr, and 20 % whose data will populate Ts.

We partition the data for each user in Tr into sh shards of equal size. In realistic scenarios, each shard should contain data collected during a relatively long time period (e.g., a day) where a user executes many different activities. However, the considered datasets only have a limited amount of data for each user (usually less than 1 h of activities for each user). Hence, we generate shards as follows. Given a user \(u \in Tr\), we randomly assign to each shard a fraction \(\frac{1}{sh}\) of the available data samples associated with u in the dataset. Note that each data sample of a user is associated with exactly one shard. This mechanism allows us to simulate the realistic scenario described before, where users perform several types of activities in each shard.

Evaluation algorithm

In the following, we describe our novel evaluation methodology step by step. First, the labeled data in Pt are used to initialize the global model, which is then distributed to the devices of all the users in Tr that will use it as the first version of the Personalized Model. We evaluate the recognition capabilities of the initial pre-trained global model on the partition Ts in terms of F1 score. This assessment allows us to measure how the initial global model generalizes on unseen users before any FL step.

As we previously mentioned, for each user, we partition its data samples in Tr into exactly sh shards. For the sake of evaluation, we assume a synchronous system in which the shards of the different users in Tr are actually temporally aligned and occur simultaneously (i.e, the first shards of every user occur at the same time interval, the second shards of every user occur at the same time interval, and so on). Note that, in the considered datasets, each user has a different data distribution and a different number of samples. Hence, within a specific shard, each client contributes with data collected considering its personal distribution. The evaluation process is composed by sh iterations, one for each shard. Considering the i-th shard we proceed as follows:

  1. 1.

    The devices of the users in Tr exploits the Personalized Model to classify the continuous stream of inertial sensor data in its shard. We use the classification output to evaluate the recognition rate in terms of F1 score providing an assessment of personalization. Note that, during this phase, we also apply our active learning strategy and we keep track of the number of triggered questions.

  2. 2.

    When all data in the shard have been processed (by all devices), the server starts a number r of communication rounds with a subset of the devices in order to update the global weights. Each round is implemented as follows:

    1. (a)

      The server randomly selects a certain percentage \(p\%\) of users in Tr and sends to their devices the last update of the global weights.

    2. (b)

      Each user’s device, by receiving the global weights, applies Label Propagation (See Section 4.5.2) and uses the newly labeled data to train its Shareable Model. After training, the resulting weights are transmitted to the server.

    3. (c)

      The server merges the received weights obtaining a new version the global model weights.

    4. (d)

      We evaluate in terms of F1 score the recognition rate of the resulting global model on the left-out users in Ts (providing an assessment of generalization).

  3. 3.

    After the execution of all the communication rounds, each users’ device:

    1. (a)

      replaces the weights of the Shareable Model and Personalized Model with the ones of the latest global model

    2. (b)

      fine-tunes the Personalized Model with labeled data from active learning and label propagation

    3. (c)

      starts the personalization process described in Section 4.4.

Note that our evaluation methodology introduces several levels of randomness: assigning users to Ts, Tr and Pt; assigning data samples to shards; selecting devices at each communication round. We iterate experiments 10 times and average the results in order to make our estimates more robust.

Results

In the following, we report the results of the evaluation of FedAR.

Classification model and hyper-parameters

As explained and motivated in Section 4.1, our classification model is a fully connected deep neural network. The network consists of four fully connected layers having respectively 128, 64, 32, and 16 neurons, and a softmax layer for classification. We use Adam [61] as optimizer. The choice of this specific network architecture is due to the good performance reported in the federated HAR literature [48]. As hyper-parameters, we empirically chose \(w=4s\), \(p=30\%\), \(r=10\), \(l=2\), \(sh=3\), and 10 local training epochs with a batch size of 30 samples. These hyper-parameters have been empirically determined based on data in Ts. The low number of epochs and communication rounds is due to the small size of the public datasets. This also limits the data in each shard. In a large-scale deployment, these parameters should be accurately calibrated.

Impact of semi-supervised learning

Figure 7 and Fig. 8 show how the F1 score and the percentage of active learning questions change at each shard for the users in Tr.

Fig. 7
figure 7

MobiAct: The impact of label propagation and active learning on the subjects that participated in the FL process.

Fig. 8
figure 8

WISDM: The impact of label propagation (LP) and active learning (AL) on the subjects that participated in the FL process.

We observe that the F1 score significantly improves shard after shard, while the number of active learning questions decreases. Averaging the results of both datasets, the number of active learning questions at the first shard is around \(25\%\), while at the last shard is only around \(5\%\). This result indicates that our method continuously improves the recognition rate with a limited amount of labels provided by the users. Moreover, the continuous decrease of the number of questions militates for the usability of our method, which will prompt fewer and fewer questions with time. These figures also show the impact of combining active learning with label propagation. Without label propagation, active learning alone leads to a lower recognition rate and a higher number of questions. This means that the labeled data points derived by label propagation positively improve the activity model. On the other hand, we observe that label propagation leads to unsatisfying results without active learning. Indeed, the labeled samples obtained by active learning represent informative data that are crucial for label propagation. Hence, the evaluation on both datasets confirms that the combination of active learning and label propagation leads to the best results.

In Fig. 9 and Fig. 10 we show the generalization capability of the global model on left-out users (i.e., users in partition Ts) after each communication round performed during the FL process with the users in Tr.

Fig. 9
figure 9

MobiAct: the trend of F1 score on the left-out users after each communication round. This Figure also shows the impact of active learning and label propagation. Each red line marks the end of a shard

Fig. 10
figure 10

WISDM: the trend of F1 score on the left-out users after each communication round. This Figure also shows the impact of active learning and label propagation. Each red line marks the end of a shard

The red lines mark the end of each shard. The results indicate that the federated model constantly improves also for those users that did not contribute with training data, even if the active learning questions continuously decrease. These plots also confirm that the combination of label propagation and active learning leads to the best results on both datasets.

FedAR versus approaches based on fully labeled data

We compared our approach with two existing FL methods based on fully labeled data. The first one is the well-known FedAVG [6], which is the most common FL method in the literature. FedAVG simply averages the model parameters derived by the local training on the participating nodes (without any personalization).

The second method that we use for comparison is called FedHealth [8]. This is one of the first FL approaches proposed for activity recognition on wearable sensors data. Similarly to our approach, FedHealth applies personalization using transfer learning.

Since FedAR considers a limited amount of available labeled data, our goal is to achieve a recognition rate that is as close as possible to the one obtained by solutions that assume full availability of annotations.

For the sake of fairness, in our experiments we adapted FedAVG and FedHealth to use the same neural network that we use in FedAR. Hence, we performed our experiments using our evaluation methodology by simulating that, for FedAVG and FedHealth, each node has the ground truth for each data sample on each shard. Hence, the evaluation of those methods does not include active learning and label propagation. Moreover, differently from FedAR, FedAVG and FedHealth only use a single local model.

The results of this comparison for the users in Tr (i.e., the ones that actively participated in the FL process) are reported in Fig. 11a and b.

Fig. 11
figure 11

Comparison of FedAR with methods based on fully labeled data.

From these plots, we observe that FedAR converges to recognition rates that are similar to solutions based on fully labeled data at each shard. The advantage of FedAR is that it can be used for realistic HAR deployments where the availability of labeled data is scarce. Despite a reduced number of required annotations, FedAR performs even better than FedAVG on the WISDM dataset, while on MobiAct it performs slightly worse. Moreover, FedAR is only \(\approx 3\%\) behind FedHealth on both datasets.

FedAR performance on each activity

Figure 12 shows how the recognition rate improves between shards for each activity for the users in Tr on both datasets.

Fig. 12
figure 12

F1 score at each shard for each activity on the users that participated in the FL process.

We observed an improvement of the recognition rate shard after shard for each considered activity. The only exception is the standing activity on the MobiAct dataset in the third shard, which maintains the same F1 score.

In general, the greatest improvement occurs between the first and the second shards. This is due to the fact that, in the first shard, activities are recognized using the initial global model only trained with the pre-training dataset. Starting from the second shard, classification is performed with the Personalized Model updated thanks to FL and personalized using our transfer learning approach.

Impact of personalization

Figure 13 and Fig. 14 show the impact of the FedAR personalization strategy based on transfer learning. This evaluation is performed on the users in the Tr partition.

Fig. 13
figure 13

MobiAct: results on the users that participated to the FL process for each shard, with and without personalization.

Fig. 14
figure 14

WISDM: results on the users that participated to the FL process for each shard, with and without personalization.

As expected, fine-tuning the personal models leads to an improvement both on the recognition rate and on the number of questions in active learning. Note that, during the first shard, classification is performed using the weights derived from the pre-trained dataset and personalization is applied starting from the second shard.

Fully connected vs convolutional models

The classification model in FedAR is a fully connected network (we will refer it as MLPFootnote 3 for the sake of brevity) that receives as input handcrafted feature vectors. Nonetheless, Convolutional Neural Networks (CNNs) proved to be very effective in fully supervised HAR approaches, since they can automatically learn features from raw data [55].

We performed a preliminary experiment to compare MLP and CNN in a fully supervised centralized approach using a leave-one-subject-out cross-validation. As CNN architecture, we consider the one recently proposed in [62] since it proved to be one of the most effective for HAR. Figure 15 shows the outcome of this comparison. We observe that, considering a fully supervised centralized setting, CNN is more effective on both datasets.

Fig. 15
figure 15

Centralized setting: MLP vs CNN based on leave-one-subject-out cross-validation.

However, we observed that CNN struggles in learning reliable features considering our federated and semi-supervised setting, since the amount of labeled data to train the classifier is limited (cold start issue). Figures 16 and 17 show the comparison of FedAR using our MLP model with handcrafted features and the CNN model. On both datasets, MLP quickly reaches a higher F1 score with respect to CNN with a significantly lower number of active learning queries. Since features are computed a priori, the MLP model can immediately focus on training the classification layers rather than learning features. Hence, these results motivate our choice of adopting a MLP model with handcrafted features in FedAR.

Fig. 16
figure 16

WISDM: results on the users that participated to the FL process for each shard using both CNN and MLP networks

Fig. 17
figure 17

MobiAct: results on the users that participated to the FL process for each shard using both CNN and MLP networks

Discussion

Generality of the approach

While we designed FedAR with wearable-based activity recognition as target application, we believe that this combination of semi-supervised and FL can be applied also to many other applications. Our method is suitable for human-centered classification tasks that include the following characteristics:

  • There is a large number of clients that participate in the FL process.

  • Classification needs to be performed on a continuous data stream, where labels are not naturally available.

  • Each node generates a significant amount of unlabeled data.

  • It is possible to periodically obtain the ground truth by delivering active learning questions to users that are available to provide a small number of labels. Note that, in real-time applications like HAR, for the sake of usability active learning questions should be prompted temporally close to the prediction.

  • It is possible to obtain a limited training set to initialize the global model. Hence, a small group of volunteers should be available (in an initial phase) for annotated data acquisition.

  • The nodes should be capable of computing training operations. Clearly, nodes can also rely on trusted edge gateways/servers (like proposed in [48]).

Privacy concerns

Despite FL is a significant step towards protecting user privacy in distributed machine learning, the shared model weights may still reveal some sensitive information about the participating users [63, 64]. Similarly to other works, FedAR uses Secure Multiparty Computation (SMC) [65, 66] to mitigate this problem. However, other approaches have been proposed, including differential privacy (DP) [67, 68], and hybrid approaches that combine SMC and DP [69].

The advantage of DP is the reduced communication overhead, with the cost of affecting the accuracy of the model. For the sake of this work, we opted for SMC in order to more realistically compare the effectiveness of our semi-supervised approach with other approaches, considering privacy as an orthogonal problem. However, we also plan to investigate how to integrate differential privacy in our framework and its impact on the recognition rate.

Need of larger datasets for evaluation

We evaluated FedAR choosing those well-known public HAR datasets that involved the highest number of users, simulating the periodicity of FL iterations by partitioning the dataset. However, the effectiveness of FedAR on large-scale scenarios should be evaluated on significantly larger datasets. By larger, we mean in terms of the number of users involved, the amount of available data for each user, and the number of target activities. Indeed, FedAR makes sense when thousands of users are involved, continuously performing activities day after day. However, observing the encouraging results on our limited datasets, we are confident that FedAR would perform even better on such large-scale evaluations.

Another limitation of the considered datasets is that both of them include data from only one position of the mobile device (trousers pocket). Since mobile devices can actually have different positions (e.g., wristbands), a larger evaluation should also consider different positions of the mobile device.

Resource efficiency

It is important to mention that FedAR is not optimized in terms of computational efficiency. Indeed, training two deep learning models on mobile devices may be computationally demanding and it may be problematic especially for those devices with low computational capabilities. This problem could be mitigated by relying on trusted edge gateways, as proposed in [46].

We want to point out that several research groups are proposing effective ways to dramatically reduce computational efforts for deep learning processes on mobile devices [70, 71]. Moreover, the GPU modules embedded in recent smartphones exhibit performances similar to the ones of entry-level desktop GPUs and this trend is expected to improve in the next few years [72].

Another limitation of our work is that the label propagation model requires storing the collected feature vectors as a graph. This is clearly not sustainable for a long time on a mobile device. This problem could be solved by imposing a limit on the size of the label propagation graph and periodically deleting old or poorly informative nodes.

Conclusion and future work

In this work we presented FedAR, a novel semi-supervised federated learning framework for activity recognition on mobile devices. FedAR takes into account the data scarcity problem, combining active learning and label propagation to semi-automatically annotate sensor data for each user. To the best of our knowledge, FedAR is the first application of federated learning to personalized activity recognition that is not based on the assumption that labeled data exists for all participating clients. Our results show that the combination of active learning and label propagation leads to recognition rates that are close to the ones reached by solutions that rely on fully supervised learning to train the local models.

In future work, we plan to investigate how federated clustering can further help improving the non-IID problem for HAR [73]. Indeed, HAR is more effective when the collaborative model only involves users that are similar between them [74]. We will study solutions based on federated clustering to automatically group users considering model similarity, creating a dedicated federated model for each cluster. Also, we plan to extend FedAR to automatically learn features from the unlabeled data stream, following the research direction proposed in [51].