1 Introduction

Many countries are currently experiencing a rapidly ageing population, leading to an immense pressure for healthcare resources (Chen et al. 2011). An appealing solution to this issue is to expand the role of continuous healthcare monitoring to private homes, complementing and potentially reducing the need for hospital inpatient care and face-to-face interaction with health professionals (Noor et al. 2020). In this context, smart home technology emerged as a feasible approach to attain this goal (Amiribesheli et al. 2015). Especially, the ubiquity of wearable devices and smartphones makes monitoring data easily accessible (Bi et al. 2021). This creates an opportunity to leverage the massive sensor data to extract clinically relevant information. Automatic human activity recognition (HAR) is therefore an enabling step towards inferring the behavior of people, further facilitating decision making and necessary interventions from carers and health-care professionals (Diethe et al. 2017).

HAR has drawn extensive attention in recent years, and a corpus of machine learning approaches has been explored to address this problem, such as decision trees (Xu et al. 2019), support vector machines  (Khan et al. 2014), and naive Bayesian networks (Gomes et al. 2012). In particular, the current state-of-the-art performance in HAR consists in using deep learning architectures like Convolutional neural networks (CNN)  (Wan et al. 2020; Gao et al. 2021) and long short-term memory (LSTM)  (Ullah et al. 2019; Singh et al. 2021). The main benefit of these methods is that they can automatically extract discriminative and data-driven features from the raw input data.

Despite this remarkable progress, a significant challenge confronting this task lies in the lack of annotations, which the aforementioned methods heavily rely on (Bi et al. 2021). In real-world activity recognition systems, annotating HAR data is not only labour-intensive and time-consuming, but also demands domain-specific knowledge and skills (Bi et al. 2021). Furthermore, the annotation of HAR data in real life (through unscripted experiments) has some privacy and ethical concerns that may limit the annotation process (Zeng et al. 2017). These factors result in a noticeable scarcity of labelled data. Under these circumstances, how to achieve a favourable performance with limited annotations becomes a challenging task.

Semi-supervised learning (SSL) and active learning are two compelling solutions to tackle this issue. Semi-supervised learning improves the model’s performance and generalization ability by leveraging unlabelled data, while active learning enables to reduce the amount of annotations necessary by strategically choosing samples with maximal information and highest training utility (Bi et al. 2021). Owing to their lower dependence on annotations, the application of SSL and active learning on HAR has evoked increasing interests recently. Semi-supervision paradigms, such as self-training  (Bota et al. 2019), co-training (Chen et al. 2020; Lv et al. 2018) and graph-based models (Han et al. 2019) have been successfully applied in HAR. Active learning effectively boosts the performance of HAR as well, where different active sample selection strategies are exploited during the active learning process, including uncertainty (Bi et al. 2021), diversity (Saito et al. 2015) and representativeness (Lughofer 2012) based schemes.

Given that both paradigms are effective in overcoming the hurdle of label scarcity, yet solve this problem from different perspectives, we propose to explore the combination of active learning and SSL for HAR task, in the hope of enhancing the labelling efficiency with the former and taking advantage of unlabelled data with the latter. Active semi-supervised learning has been studied in a few image related tasks (Rottmann et al. 2018; Zhang et al. 2019). However, to our knowledge, the integration of active learning in semi-supervised learning has never been investigated in HAR.

With the booming development of deep learning, applying deep models in SSL has aroused growing attention and shown remarkable improvements in performance  (Zeng et al. 2017; Chen et al. 2020). Most of the current SSL methods work in an iterative manner. They usually employ the network of the last epoch in each iteration to predict the labels of unlabelled samples and use these predictions as training targets for the following iteration. However, if the predictions of the latest epoch are unreliable—which normally happens due to various degrees of randomness in deep neural networks—they will mislead the consequent model training, further worsening the final performance. Based on the findings that an ensemble of multiple neural networks generally generates more robust predictions than a single network (Srivastava et al. 2014a), we propose to incorporate a temporal ensemble (Laine and Aila 2016) to CNN via aggregating the history outputs of the network with dropout regularization during training for HAR.

Overall, this paper proposes a deep active semi-supervised approach for human activity recognition, in order to promote the recognition performance with reduced annotation cost. The main novelties and contributions of the proposed approach are threefold.

  1. 1.

    We design a novel deep HAR model incorporating active learning and semi-supervised learning into one framework, which improves the model’s performance in a low annotation regime by selecting the most informative samples to be annotated and taking advantage of the information of massive unlabelled instances. To the best of our knowledge, this is the first work to combine these two techniques into one framework for HAR.

  2. 2.

    A novel unsupervised loss term is introduced for employing the temporal ensemble of the deep model subject to consistency regularization, effectively enabling semi-supervised learning in combination with the supervised loss component. This unsupervised loss term reduces the impact of prediction uncertainty, producing more accurate and stable predictions for activities.

  3. 3.

    We evaluate our proposed method, which we call ActSemiCNN, on three real benchmark datasets for activity recognition, i.e., PAMAP2, USCHAD and UCIHAR datasets. Extensive experiments were conducted to assess the proposed approach. The results demonstrate that ActSemiCNN achieves state-of-the-art recognition performance with significantly reduced annotation cost, and exhibits strong robustness and generalization ability.

The remainder of this paper is organized as follows. Section 2 briefly recalls related paradigms and methodologies. We detail the proposed deep active semi-supervised method in Sect. 3. Section 4 presents a comparative study applied to real benchmark datasets. We discuss the proposed method in Sect. 5. Finally, Sect. 6 concludes the paper.

2 Related work

In this section we review related works in human activity recognition, where the deep learning-based methods are described in Sect. 2.1, and the active learning and SSL-based methods are introduced in Sect. 2.2.

2.1 Deep learning-based activity recognition

In the past decade, most wearable sensor-based HAR methods involve feature engineering based on domain expertise (Bi et al. 2020). However, these methods are relatively limited as they rely on human creativity to come up with novel features and lack the power to capture underlying explanatory factors in low-level sensory inputs (Saeed et al. 2019; Merritt et al. 2018).

Most recently, the explosion of deep learning techniques has paved a new way for a broad spectrum of problems as they can automatically extract representative features. Kumar et al. (2021) studied deep models, diverse embedding representations and ensembling technique on co-morbidity recognition from clinical records. The research of deep learning frameworks in HAR have been studied in a number of works (Chen and Xue 2015; Ordóñez and Roggen 2016; Yao et al. 2017; Bianchi et al. 2019; Haresamudram et al. 2019; Wan et al. 2020). Chen and Xue (2015) firstly proposed a CNN-based HAR method to classify activities collected with acceleration sensors. A CNN was utilized to automatically learn discriminative features from the signal sequences of accelerometers and gyroscopes in Bianchi et al. (2019). Haresamudram et al. (2019) leveraged unsupervised convolutional auto-encoder to firstly extract feature representation and then used multi-layer perceptron (MLP) to tune the network. To reduce the cost of hardware facilities, Wan et al. (2020) designed a real-time CNN-based HAR method for local feature extraction from smartphone accelerometer data. In Gao et al. (2021), a new multi-branch CNN was introduced, which performs kernel selection among multiple branches by means of attention mechanism.

In addition, recurrent neural networks (RNN) show competitive results when applied to HAR tasks. RNN and their extensions, such as gated recurrent unit and long short-term memory (LSTM), have been applied for HAR in several recent publications. Ordóñez and Roggen (2016) combined LSTM and CNN to explicitly model the temporal dynamics of sequential data, achieving prominent performance in HAR from sensor data. Murad and Pyun (2017) proposed to use RNN for building recognition models that are capable of capturing long-range dependencies in variable-length input sequences. Convolutional and recurrent neural networks were integrated to exploit local interactions among similar mobile sensors and extract temporal relationships to model signal dynamics in Yao et al. (2017). Ullah et al. (2019) developed an end-to-end deep model which consists of a single layer neural network for data pre-processing and a stacked multi-layer LSTM network. Attention mechanism was further incorporated with LSTM in Singh et al. (2021), which not only captures the spatio-temporal features but also learns important time points.

The emerging formulation of self-supervised learning was applied in HAR in recent two years. Saeed et al. (2019) designed an auxiliary task of recognizing diverse transformations performed on the raw input features, which is implemented by training a multi-task CNN, yielding features with high generalization ability. Haresamudram et al. (2020) introduced masked reconstruction to HAR task as a viable self-supervised pre-training objective, and demonstrated improved performance over state-of-the-art semi-supervised learning methods. A contrastive predictive coding framework was developed (Haresamudram et al. 2021) to capture the underlying temporal structure of HAR data, leading to significantly improved recognition performance.

2.2 SSL and active learning-based HAR methods

Semi-supervised learning has been extensively used for sensor-based activity recognition. Semi-supervision based HAR methods can be categorized from different perspectives. Considering how the unlabelled samples are utilized, they can be classified into self-training, co-training and graph-based methods (Zhu et al. 2018). Lopes et al. (2012) is a self-training-based approach, which uses an ensemble of classifiers to alternatively select the unlabelled samples with the most confident predictions and assume their predicted labels are correct in order to assist further training of the classifiers. Co-training (Chen et al. 2020) selects confident samples from independent feature spaces for two classifiers first and then the selected samples with the estimated labels are added to the training set. Liu et al. (2021) developed an HAR approach based on graph convolutional networks, which encodes arbitrary graphs by automatically updating the structure information under manifold regularization.

According to how features are extracted and utilized, semi-supervised HAR methods can be classified into feature engineering based methods (Subramanya et al. 2012) and feature learning based methods. One typical feature engineering based method was designed in Subramanya et al. (2012), which introduces boosted decision stumps based discriminative method to select features. The advent of deep learning enables the boosting of deep SSL-based HAR methods. Zeng et al. (2017) utilized unlabelled data in both feature learning and model learning stages using CNN-Ladder architecture. An adversarial network with an auto-encoder and two discriminator networks represented by fully connected layers was developed in Balabka (2019), which relieves the heavy reliance on large labelled dataset. Chen et al. (2020) designed a co-training HAR framework integrating attention mechanism with recurrent convolutional models. The state-of-the-art mean teacher semi-supervised model was introduced to HAR in Narasimman et al. (2021), which averages model weights over training steps to produce more robust results.

Table 1 Representative related works with different degrees of supervision on human activity recognition

Active learning is another promising solution to address the annotation scarcity issue. It aims to reduce the labelling cost by selecting the most informative samples for annotation. Uncertainty sampling is the most extensively used strategy in active learning. Entropy (Hossain et al. 2017), marginal sampling (Bi et al. 2021) and least confident (Alemdar et al. 2011) measures have been adopted to measure the informativeness of the instances in HAR. In Shahmohammadi et al. (2017), Query-by-committee was employed to identify the samples that are worth annotating.

Table 1 tabulates the most representative works in deep learning, SSL and active learning-based human activity recognition. Analyzing the above methods, we can find that: (1) The application of active learning and SSL in HAR has been intensively investigated, and demonstrated effective in reducing labelling costs. However, none of the above methods focus on activity recognition by jointly combining active learning and SSL. (2) The above analyzed deep SSL models mostly employ the predictions of the network on the unlabelled samples in every iteration, however ignore the model’s uncertainty, which may mislead the model training with erroneous estimated labels if the current predictions are unreliable. For this application, our proposed method differs from the aforementioned HAR methods in two aspects. Firstly, rather than using active learning or SSL separately, our proposed method integrates the two components into a unified framework. Secondly, we combine temporal ensembling with CNNs, incorporating the history information of networks during training, which is expected to generate more reliable predictions on unlabelled samples.

Fig. 1
figure 1

Workflow of the proposed method. Given data from a number of participants, we firstly split the dataset into three independent subsets. Then samples from the validation set and randomly selected ones (a small number) from the training set are manually annotated. Next, hyperparameters of the backbone model are selected based on the labelled training and validation sets. In the active semi-supervised model training process, we first select informative samples via active learning for annotation, and then train the model with both labelled and unlabelled instances. Finally, the obtained model is applied on the test set

3 Methodology

We first formulate the problem and overview the pipeline of the proposed method in Sect. 3.1. Next, we introduce the two components, i.e., active learning based sample selection in Sect. 3.2 and semi-supervised model training with temporal ensembling in Sect. 3.3.

3.1 Problem formulation and overview

The HAR task is formulated as a classification problem where data samples are a set of sensor data sequences gathered over a time interval and classes correspond to activities. We will first introduce the application scenario and processing flow of the proposed method, and then give an overview of the active semi-supervised deep model.

The proposed method is designed for a common scenario where data from a number of participants is available, however labels are scarce due to a limited annotation budget. Figure 1 illustrates the processing pipeline. Given the sequential data from a number of participants as input, we firstly split the whole dataset into three independent subsets—training, validation and test set—in a subject-wise fashion (one subject is always in only one of the partitions). Next, we randomly select a very small number of samples (less than the total annotation budget) from the training set and manually label them. Based on the labelled training set and validation set we then determine the optimal hyperparameters of the backbone neural network. In the active semi-supervised model training process, we first select informative samples via active learning to request for annotations, and then train the semi-supervised model with both the labelled and unlabelled instances. Finally, the model is applied on the test set, examining the models performance.

Next, we describe the framework of the proposed active semi-supervised deep model. For a given HAR training set with N instances, we use \(\mathbf {X}\) to denote all the training instances, L indicates the number of labels of the annotation budget. The labelled sample set is denoted as \(S_s = \{(x_i,y_i), i=1,\ldots ,l\}\) with \(y_i\in \{1,\ldots ,C\}\), where C indicates the number of activity classes and \(l < L\). Figure 2 illustrates the framework, which includes two steps, i.e., active learning based sample selection (Step 1) and semi-supervised model training with temporal ensembling (Step 2).

In Step 1, given the input raw data with labelled sample set \(S_s\), we iteratively select samples to be annotated via active learning under the restriction of annotation budget L. In each iteration, we first train a supervised deep model with the existing labels and then make predictions on the unlabelled candidates, based on which the most informative samples are selected and annotated, yielding an updated set of labels for next iteration. It is noted that the initial labels \(S_s\) for the first active learning iteration are a small number of randomly selected and annotated samples. In Step 2, both the labelled and unlabelled data are fed into a temporal ensembling-based 1-dimensional (1-D) CNN networks for training with dropout as a regularization method. We apply an integrated semi-supervised loss function with two terms—i.e., supervised loss term and unsupervised loss term—to learn the weights of the networks. The optimization of the loss function is an iterative process, where the updated network parameters \(\Theta \) and normalized ensemble prediction \(\tilde{z}\) in current iteration act as input for the network training of the next iteration. The \(\tilde{z}\) is set as 0 in the first iteration. We will describe the two steps with more details in Sects. 3.2 and 3.3 respectively.

Fig. 2
figure 2

Model framework. Step 1: Active learning based sample selection. This is an iterative process, where in each iteration, we first train a supervised deep CNN model with existing labels and then make predictions on the unlabelled candidates, based on which the most informative samples are actively selected to request new annotations, yielding an updated set of labels. Step 2: Semi-supervised model training with temporal ensembling. Both the labelled and unlabelled samples are fed into the network with dropout regularization. An integrated loss function which consists of a supervised and an unsupervised loss term is then calculated based on the predictions of the current network. Next, we update network parameters \(\Theta \) by optimizing the loss function via stochastic gradient descent (SGD) algorithm. The updated \(\Theta \) and normalized ensemble prediction \(\tilde{z}\) act as input for the network training for the next iteration

In this work, we exploit a backbone CNN structure as illustrated in Fig. 3, which consists of nine convolutional layers, two max-pooling layers, two dropout layers, one average pool layer, one fully connected layer and a softmax layer connected to the output (Laine and Aila 2016). Batch normalization and Leaky rectified linear unit (LReLU) activation function (Maas et al. 2013) are sequentially applied following each of the convolutional layers. With regard to the network parameters, please refer to Fig. 3 for more details. This CNN structure is utilized in both the active learning based sample selection and semi-supervised model training processes. It should be pointed out that other network structures can be flexibly incorporated with the proposed framework as well. Yet in this work, we only report the results with the CNN structure as shown in Fig. 3.

Fig. 3
figure 3

Backbone network architecture

3.2 Active learning based sample selection

Unlike supervised learning which trains classifiers with randomly chosen and annotated samples, we use active learning to tactfully select the most beneficial set of samples to be annotated. By doing so, the unnecessary annotation of samples carrying little information is circumvented, which effectively improves the labelling efficacy, thus reducing the labelling cost. Therefore, it is essential to define an effective criterion to assess the informativeness of candidate samples and then select proper candidates to be annotated.

Uncertainty-based active sampling scheme, which tends to select samples with highest uncertainty, is the most recognized and extensively employed sampling criterion. Entropy is a typical indicator for measuring the uncertainty of a probabilistic distribution. Higher values of entropy imply more uncertainty in the distribution (Settles 2009). However, several works in diverse applications (Cao et al. 2020; Bi et al. 2019, 2021) have experimentally shown that, although the performance of entropy-based active learning is generally superior to passive random selection, the performance improvement of the strategy decreases when high entropies are caused by small class probabilities of nonsignificant classes. This issue becomes more severe when a large label set is present for multi-class classification tasks. Previous results (Cao et al. 2020; Bi et al. 2021) show that best-versus-second-best (BvSB)-based active selection can effectively overcome the shortcoming of entropy-based sampling scheme by measuring the difference in class probabilities between the first and second most probable classes. For this reason, we apply BvSB as the sample selection scheme in this work. Let \(\Theta \) represent the learnable parameters of the deep model, \(P(y_\mathrm {B}|x_i,\Theta )\) and \(P(y_\mathrm {SB}|x_i,\Theta )\) denote the two highest estimated class probabilities of sample \(x_i\) output from the classifier, the sampling criterion can be described as:

$$\begin{aligned} x_i^{BvSB}=\mathop {\arg \min }_{x_i, i\in \mathcal {U}}\left( P\left( y_\mathrm {B}|x_i,\Theta \right) -P\left( y_\mathrm {SB}|x_i,\Theta \right) \right) . \end{aligned}$$
(1)

With this sampling scheme, the instances close to the decision boundaries are preferred to be selected. In BvSB-based active learning, we first compute the class probabilities of samples in the candidate pool \(\mathcal {U}\). Samples meeting the BvSB sampling criterion are then iteratively selected to be annotated and incorporated to the training set for the consequent classifier retraining.

It is noteworthy that this work is more concerned about the combination of active learning and SSL, and the application of the unified framework on HAR. BvSB is one promising solution for providing informative samples for the consequent model training. However, other active learning strategies can be applied as well provided that they are capable of selecting the most beneficial instances.

3.3 Semi-supervised CNN model with temporal ensembling

Given the updated labelled sample set \(S_s\), we next introduce our proposed semi-supervised CNN model with temporal ensembling which aims to learn a deep model making use of both labelled and unlabelled samples. To this end, we define a loss function given by:

$$\begin{aligned} \begin{aligned} Loss\left( \Theta |{\mathbf {X}},{S_s}\right) = Loss_s\left( \Theta |{\mathbf {X}},{S_s}\right) +w(t)\times Loss_u(\Theta |{\mathbf {X}}), \\ \end{aligned} \end{aligned}$$
(2)

where \(\Theta \) denotes the combination of the CNN parameters to be optimized. \(Loss_s(\Theta )\) stands for the Supervised loss term, and \(Loss_u(\Theta )\) is the Unsupervised loss term. w(t) is the unsupervised loss weighting function, which starts from zero, and ramps up along a Gaussian curve during the first 80 training epochs (Laine and Aila 2016).

3.3.1 Supervised loss term

The supervised loss term is designed to enforce the consistency between the CNN prediction on the labelled samples and ground truth labels, which follows the cross entropy loss form as,

$$\begin{aligned} \begin{aligned}&Loss_s(\Theta |{\mathbf {X}},{S_s})\\&\quad =-\frac{1}{{L}}\sum _{i=1}^{{L}}\sum _{j=1}^C 1\{y_i=j\} \log P(y_i=j|x_i,{\Theta }), \end{aligned} \end{aligned}$$
(3)

where \(P(y_i=j|x_i, {\Theta })\) indicates the probability of predicting \(x_i\) to have class label j.

3.3.2 Unsupervised loss term

Due to the fact that an ensemble of multiple neural networks generally yields better predictions than a single network (Srivastava et al. 2014a), we adopt a temporal ensembling strategy (Laine and Aila 2016) to promote the model’s performance. In this scheme, the training is performed on a single network, however, the predictions are made on a number of pre-networks by accumulating the predictions of multiple frozen instances of the same network during training. Therefore the history-involved predictions correspond to an ensemble consensus of a large number of pre-networks from different epochs. Dropout approach (Srivastava et al. 2014b) is utilized as a regularizer, which has been shown effective to generalize and provide more certain predictions.

Based on the above analysis, we apply the temporal ensembling to the activity prediction of unlabelled samples, where the labels inferred in this way are exploited as training targets for the unlabelled instances. The unsupervised loss term measures the divergence between the current network outputs and the previous ensemble predictions, as given by,

$$\begin{aligned} Loss_u(\Theta |{\mathbf {X}})=-\frac{1}{CN}\sum _{i=1}^N\Vert z_i-{z}_i\Vert ^2, \end{aligned}$$
(4)

where \(z_i\) indicates the current prediction, and \(\tilde{z}_i\) denotes the ensemble output aggregated from the deep neural networks in previous epochs. After every training epoch, the network outputs \(z_i\) are accumulated into ensemble outputs \(\mathbf {Z}_i\) by,

$$\begin{aligned} \mathbf {Z}_i = \alpha \mathbf {Z}_{i-1} + (1-\alpha )z_i, \end{aligned}$$
(5)

where \(\alpha \) is a momentum term that controls how far the ensemble reaches to previous training epochs. To generate training targets \(\tilde{z}_i\), we perform bias correction on \(\mathbf {Z}_i\) by,

$$\begin{aligned} \tilde{z}_i = \frac{\mathbf {Z}_i}{1-\alpha ^t}, \end{aligned}$$
(6)

where \(\mathbf {Z}\) and \(\tilde{z}\) are set to zero in the first training epoch.

From the above formula, we can see that at the start of training the network, the total loss is dominated by the supervised loss term. As the training evolves, the unsupervised loss term plays a more important role. The optimization of the loss function in Eq. 2 is conducted using the SGD algorithm. In the tth iteration, the model parameters \(\Theta \) are updated with \(\tau \) as the learning rate,

$$\begin{aligned} \Theta _{t+1}=\Theta _t-\tau \frac{\partial Loss(\Theta |{\mathbf {X}},{S_s})}{\partial \Theta }. \end{aligned}$$
(7)

To conclude this section, Algorithm 1 and Fig. 4 illustrate the pseudo-code and flowchart of the proposed approach respectively.

figure a

4 Experimental evaluation

In this section, we justify and analyze the performance of the proposed method on three real HAR datasets. We first introduce the datasets and experimental settings in Sect. 4.1. Next, we conduct ablation study in Sect. 4.2. Section 4.3 reports the comparative results with several competing approaches on benchmark object test, statistical significance test, and cross validation (CV) settings. The impact of parameters on the model performance is presented in Sect. 4.4.

Fig. 4
figure 4

Flowchart of the proposed approach

4.1 Datasets and experimental setting

4.1.1 Datasets

We consider 3 public benchmark datasets for the evaluation of the proposed method, where the key information of them is summarized in Table 2.

The PAMAP2 dataset (Reiss and Stricker 2012) was collected on 9 participants wearing three inertial measurement units on the hand, chest, and ankle respectively, sampled at 100 Hz. The sensors include accelerometer, gyroscope, magnetometer, temperature and heart rate sensor. There are 52 raw features in total. This dataset contains 12 activities including lying, sitting, standing, walking, running, cycling, Nordic walking, ascending stairs, descending stairs, cleaning, ironing, and rope jumping. Following (Saeed et al. 2019; Haresamudram et al. 2020), the data from participant 106 is used for testing, the data from participant 105 for validation and the rest for training.

The USCHAD dataset (Zhang and Sawchuk 2012) was recorded on 14 volunteers using triaxial accelerometer and gyroscope which were attached to participant’s front right hip, yielding 6 features in total. The sampling rate of sensor data is 100 Hz. The dataset includes 12 activities: walking forward, walking left, walking right, walking upstairs, walking downstairs, running forward, jumping, sitting, standing, sleeping, elevator up and elevator down. Following (Saeed et al. 2019; Haresamudram et al. 2020), data from participants 1 to 10 is used for training, 11 and 12 for validation, and 13 and 14 for testing.

The UCIHAR Dataset (Anguita et al. 2013) was collected with triaxial accelerometer and gyroscope from a Samsung Galaxy SII smart phone worn by 30 volunteers on their waist, providing 6 features. The activity set includes 6 basic human activities, walking, walking upstairs, walking downstairs, sitting, standing and laying. Following Chen et al. (2020), Khan and Ahmad (2021), the training set was created with 70% of the volunteers, whereas the remaining 30% were selected to generate the test test.

Table 2 Summary of the dataset
Table 3 Experimental configurations
Table 4 MacroF1 values with different combinations of methods on PAMAP2 dataset with different training sample numbers

4.1.2 Experimental setting

Table 3 displays the experimental configurations of the proposed method. The learning rate follows the cosine annealing function (Loshchilov and Hutter 2016). We choose the minimum value of learning rate \(\tau \) and weight decay by grid search on the baseline CNN with a limited training set, i.e., 200 randomly selected training samples, and the validation set. We performed experiments with \(\tau \) over {0.00001, 0.0001, 0.001, 0.01} and weight decay over {0.0001, 0.0005, 0.001, 0.005, 0.01}. The \(\tau \) and weight decay with the best performance are selected as the hyperparemters used for model training. Based on the obtained results, the weight decay and \(\tau \) are selected as 0.0005 and 0.0001 respectively. We empirically set batch size as 20 in low annotation setting with an annotation budget less than 500, and otherwise 100. We run 50 epochs for network training. All the CNN based methods use the same hyperparameters as exhibited in Table 3.

We studied the impact of the dropout rate and ensembling momentum on the semi-supervised temporal ensembling, where the analysis can be found in Sect. 4.4. Based on the comparative results, dropout rate and ensembling momentum are set as 0.2 and 0.6 respectively. For active learning, we randomly select 100 samples in the first iteration for the supervised model training, and actively select 100 samples to be annotated in each active learning iteration. All the experiments were executed on a workstation with GeFroce RTX 3090 GPU and 64GB RAM, whist the codes are implemented with Pytorch library.

4.1.3 Evaluation metric

As is common in HAR research, we use Macro F1-Score as the evaluation metric, defined in the usual way:

$$\begin{aligned} \begin{aligned} {{Macro F1}}&=\frac{1}{C}\sum _{i=1}^C\frac{2\cdot {{Precision_i}}\cdot {Recall_i}}{Precision_i+Recall_i},\\ Precision_i&=\frac{TP_i}{TP_i+FP_i},\\ Recall_i&=\frac{TP_i}{TP_i+FN_i}, \end{aligned} \end{aligned}$$
(8)

where for a given class i, \(TP_i\) and \(FP_i\) denote the number of true positives and false positives respectively, and \(FN_i\) represents the number of false negatives. Macro F1-Score is denoted as MacroF1 in following sections.

4.2 Ablation study

In this set of experiments, taking the PAMAP2 data set as an example, we conduct an ablation study to examine the contributions of the key components of the method, i.e., active learning based sample selection and semi-supervised learning using temporal ensembling, to the prediction performance. To verify their effectiveness, apart from the baseline CNN, we establish two comparison methods which successively add each of the two components. The compared methods include:

  1. 1.

    CNN: This is the baseline CNN model with the architecture depicted in Fig. 3.

  2. 2.

    ActCNN (short for active convolutional neural networks): the combination of CNN and active learning.

  3. 3.

    ActSemiCNN (short for active semi-supervised convoluational neural networks): the combination of CNN, active learning and SSL, i.e., the proposed method.

Fig. 5
figure 5

MacroF1 values as a function of the number of labels on the PAMAP2 dataset

Fig. 6
figure 6

Confusion matrices. a CNN with 100 initial samples. b ActCNN with 100 initial samples and 100 actively selected samples. c ActSemiCNN with 100 initial samples and 100 actively selected samples

To analyze the sensitivity of the three compared methods to the number of labels, we performed experiments with different annotation budgets. To get more reliable results, we performed 5 rounds of independent repetitions for all three methods, and the average MacroF1 values and standard deviations are reported. It is noted that for the 100 label annotation budget setting, 50 samples are randomly selected in the first iteration and 50 samples are further chosen in the active learning process. Figure 5 presents the MacroF1 value curves as a function of the annotation budgets, and the numerical results are listed in Table 4. Figure 6 gives the confusion matrices of the inner steps in one repetition of ActSemiCNN with an annotation budget of 200. These results lead to the below observations:

  1. 1.

    ActSemiCNN constantly yields the highest MacroF1 value with any number of labels compared. The comparisons between ActCNN and CNN, and ActSemiCNN and ActCNN respectively reflect the benefits brought by active learning and SSL. In this experiment, when only 200 samples can be annotated, ActCNN outperforms CNN by 0.05 MacroF1 units, while ActSemiCNN outperforms ActCNN by 0.04.

  2. 2.

    Inspecting Fig. 6a, we can notice obvious confusions between walking and Nordic walking, and among ascending stairs, descending stairs, cleaning classes. Yet, confusions among the 5 most confusing classes are greatly relieved in the results of ActCNN as shown in Fig. 6b, which can be explained as follows. Take the repetition in Fig. 6 for example, our active learning policy chooses more instances (81) from the 5 classes compared with random selection (44), and selects less samples from the classes that the current classifier is confident with (1 sample for lying and no sample from sitting). ActCNN surpasses CNN with random samples by a MacroF1 value of 0.05, precisely justifying the effectiveness of active learning.

  3. 3.

    Fig. 6c reflects a further improvement on MacroF1 value (0.04) compared with Fig. 6b, which proves that utilizing unlabelled data is particularly effective in improving the recognition performance.

  4. 4.

    The MacroF1 value (0.76) of ActSemiCNN with 200 labels is even higher than the one achieved by the CNN with 500 labels (0.74), showing an obvious reduction of the annotation cost.

Table 5 MacroF1 values with different methods on benchmark test objects

4.3 Results and comparison

In this section, we conduct an extensive evaluation of our approach on the PAMAP2, USCHAD and UCIHAR datasets. The comparison results with 7 competitors on benchmark test objects, statistical significance test and CV settings are presented in Sects. 4.3.14.3.3 respectively. The compared methods include:

  1. 1.

    DeepConvLSTM (Ordóñez and Roggen 2016): DeepConvLSTM is a supervised deep architecture based on the combination of convolutional and LSTM recurrent layers to recognize activities. DeepConvLSTMs reached state-of-the-art in distinguishing complex human activities (Slaton et al. 2020; Mahmud et al. 2020).

  2. 2.

    CoTrCNN: This is a semi-supervised method using a co-training pipeline (Stikic et al. 2008) while with a CNN model (Laine and Aila 2016) as shown in Fig. 3.

  3. 3.

    ConvAE (Haresamudram et al. 2019): This is a state-of-the-art deep architecture  (Mahmud et al. 2020; Liu et al. 2020) which includes a convolutional encoder and decoder with a bottleneck layer in between. The feature representation from the bottleneck layer is used by an MLP for the classification.

  4. 4.

    MTSelfCNN (Haresamudram et al. 2020): This is a state-of-the-art self-supervised method. The accelerometer representations are learnt by training a multi-task CNN to recognize eight transformations applied to the raw input signal. The original paper reports the results on UCIHAR dataset in a semi-supervised setting, which is used for comparison in this paper.

  5. 5.

    SemiConvAttn (Chen et al. 2020): This is a semi-supervised deep model based on co-training framework, where attention-based recurrent convolutional models are introduced to handle multi-modality data.

  6. 6.

    MeanTeacher (Narasimman et al. 2021): This is a semi-supervised HAR model combining state-of-the-art mean teacher learning scheme and CNN.

  7. 7.

    DAL (Bi et al. 2021): This is a state-of-the-art active learning-based model which selects samples via marginal sampling scheme coupled with temporal-frequency features.

Among the above 7 approaches, we self-implemented DeepConvLSTM, DAL, CoTrCNN and MeanTeacher. The numerical results used for comparison with other 3 methods are directly extracted from the original papers (Haresamudram et al. 2019, 2020; Chen et al. 2020).

4.3.1 Performance on benchmark test object

The quantitative results on benchmark test objects are summarized in Table 5, wherein the MacroF1 values, used labels and time consumption are reported, with the best results highlighted in bold. The standard deviations of the 5 self-implementing methods are obtained by averaging the results of 5 independent repetitions.

Table 6 Significance comparison on PAMAP2 dataset
Table 7 Significance comparison on USCHAD dataset
Table 8 Significance comparison on UCIHAR dataset
Table 9 MacroF1 values of CV experiments with different methods on two datasets
Table 10 MacroF1 values of the proposed method on PAMAP2 dataset with different dropout rates
Table 11 MacroF1 values of the proposed method on PAMAP2 dataset with different ensembling momentum values

Table 5 suggests that our proposed method yields the best performance compared with other competitors on all three datasets. With the same number of used labels, ActSemiCNN outperforms DeepConvLSTM, DAL, CoTrCNN and MeanTeacher by 0.14, 0.09, 0.10 and 0.09 on PAMAP2 dataset, 0.05, 0.03, 0.10 and 0.04 on USCHAD dataset, 0.29, 0.16, 0.10, and 0.06 on UCIHAR dataset respectively. ActSemiCNN even outperforms ConvAE with 20% labels and SemiConvAttn with 1000 labels which demonstrates that our proposed method substantially reduces the annotation cost without loss of predictive performance. The superiority of ActSemiCNN over DAL again demonstrates the advantage of utilizing unlabelled data during model training.

We can see that the overall MacroF1 performance on the USCHAD dataset is lower than on the PAMAP2 dataset, which is because USCHAD is a challenging dataset. Firstly, the sensor data is collected from the motion node attached to the hip, which provides less information than the multi-position case as PAMAP2 dataset. Secondly, the activities involve orientation such as elevator up or down which are difficult to discriminate (Mahmud et al. 2020). Especially, CoTrCNN achieves the lowest MacroF1 value of 0.35, which is due to two reasons. First, CoTrCNN directly assigns pseudo labels to unlabelled samples and utilizes them in the model retraining, which is unreliable when the model’s performance is unsatisfactory. Secondly, CoTrCNN splits the features into two views, which limits the performance of the model with incomplete views compared with the complete view scenario.

As revealed in Table 5, ActSemiCNN and MeanTeacher consume more time than others, which is owing to their intrinsic mechanism that the massive unlabelled samples are engaged in training throughout the whole process.

4.3.2 Statistical significance comparison

Apart from the numerical evaluation, we further compare the statistical significance by carrying out the variance-based hypothetical F-test. Given that the original papers of ConvAE, MTSelfCNN and SemiConvAttn only provide the mean experimental results yet without variance information, the significance comparison was conducted on five self-implementing methods. It is hypothesized that the MacroF1 values of multiple repetitions of each compared method follow the Normal distribution. With this assumption, we calculated the p values to represent the level of evidence against null hypothesis which suggests that no statistical difference exists between two sets of observations. It is supposed that there is a significant difference between two groups of results when the obtained p value is less than a commonly employed threshold of 0.05.

Tables 6, 7 and 8 tabulate the statistical significance results on three experimental datasets. S1–S5 in the tables indicate the DeepConvLSTM, DAL, CoTrCNN, MeanTeacher and ActSemiCNN correspondingly. To specify, the symbols ‘0’, ‘+1’, ‘\(-1\)’ respectively mean that the method in the column is significantly equivalent, better and worse than the method in the row. The symbol ‘-’ denotes that the method in the column is the same with the one in the row.

Inspecting the 3 tables, we can draw below conclusions:

  1. 1.

    Our proposed ActSemiCNN is statistically superior to other competitors on all three datasets.

  2. 2.

    The two semi-supervised methods which involve all the unlabelled instances during training, i.e., MeanTeacher and ActSemiCNN, consistently yield better at least comparable results compared to other 3 methods.

4.3.3 Performance on cross-validation experiment

To demonstrate the robustness of the proposed method with regard to sensitivity to specific test subjects, we further conduct a Leave-one-subject-out CV (LOSO-CV) experiment. During the experiment, we repeatedly hold the data from one of the subjects out of the training set and use it only for testing purposes. This is done until all the subjects have been used in the test set, and the average values of MacroF1 scores over all subjects are computed. In this experiment, we only focus in the low annotation scenario with 200 labels available. It should be noted that the correspondence between the samples and objects in UCIHAR dataset is unavailable, thereby the results reported here were obtained with a 5-fold CV instead.

Table 9 presents the results of CV experiments. As can be observed from the table, the MacroF1 scores of the proposed method are constantly higher than the competing methods for CV experiments. ActSemiCNN outperforms DeepConvLSTM, DAL, CoTrCNN and MeanTeacher by 0.12, 0.13, 0.08 and 0.10 on PAMAP2 dataset, 0.13, 0.08, 0.15 and 0.03 on USCHAD dataset, 0.27, 0.14, 0.15 and 0.03 on UCIHAR dataset respectively, which suggests that ActSemiCNN is robust to inter-subject variability.

4.4 The impact of parameters

This section investigates the impact of the key parameters of ActSemiCNN–dropout rate and ensembling momentum–on the recognition performance. Taking PAMAP2 dataset as an example, we conducted comparative experiments on the above 2 parameters with varying values, where the results are tabulated in Tables 10 and 11.

The dropout rate specifies the proportion of nodes randomly dropped out during training. Validation experiments were performed with dropout rates ranging [0, 0.5] with a step of 0.1, where the value 0 means that we do not drop any nodes in the network. From Table 10, we can find that the highest MacroF1 score is achieved with a dropout rate of 0.2. Based on these results, we have set the dropout rate to 0.2 in all other experiments.

Ensembling momentum controls how far the aggregation reaches to the training history, which has a value range of [0, 1). Value 1 indicates that the current loss is totally originated from the history predictions, while value 0 means that no history information is included in the current loss. We conducted experiments with ensembling momentum ranging [0.4, 0.8] with a step of 0.1. We can discover from Table 11 that the best result is yielded with a value of 0.6, which is employed as the ensembling momentum configuration in all experiments.

5 Discussions

The limited availability of annotations is arguably the most critical barrier that hinders the development of HAR. Thereby, it is crucial to develop approaches which can achieve favorable activity recognition performance while easing the burden of annotations. This is precisely the motivation of this work. The novelties of this paper mainly lie in two aspects: (i) the initial integration of active learning and semi-supervised learning into one HAR framework; (ii) the exploitation of temporal ensembling with consistency regularization in the deep CNN optimization. In what follows, we will: (1) offer explanations of ActSemiCNN, (2) provide insights on potential future research, (3) analyze the limitations of the proposed model.

5.1 Explanations

The comparative experiments in Sect. 4 suggest that the proposed model considerably boosts the recognition performance in low annotation regime, and exhibits strong robustness and generalization ability. The superiority of ActSemiCNN is attributed to the effectiveness of its two components. (1) The applied active learning selects samples which are most difficult to classify, therefore the inclusion of the annotations of these samples in training effectively helps the classifier refine the decision boundaries. (2) Leveraging the unlabelled data during training gives rise to more accurate results and the consensus ensemble prediction further alleviates the prediction uncertainty.

5.2 Insights

By analysis, we can conclude that the integration of diverse weak supervision paradigms ends up with additive benefits on the activity prediction, which offers insights for potential future researches. (1) Besides the label scarcity issue, the HAR task is confronted with another challenge, i.e., individual diversity. The hybrid of active learning and transfer learning is believed to be a promising solution for transferable personalized activity analysis, yet with sparse annotations. (2) Activity data are usually collected with multiple sensors, which brings challenges on the data fusion. The synthetic integration of self-supervised learning and semi-supervised learning is a compelling venue to address this issue, where the former enables the mutual and complementary interaction via proxy task between data of different views, and the latter is capable of inferring useful information from the vast amount of unlabelled data.

5.3 Limitations

Although prominent improvements are revealed by this study, ActSemiCNN still suffers from two limitations. (1) Involving the unlabelled data throughout the whole training process inevitably increases the time consumption. More computation resources are needed if there are requirements on the processing timeliness. (2) Although active learning greatly reduces the number of samples to be annotated, it poses higher standards on the labelling quality. This is because in active learning paradigm, the queries are usually raised on samples that are most difficult to classify, which requires that the experts have the capability to assign correct labels out of two or several options. This means that the oracles need to be cautiously selected to guarantee the smooth and effective proceeding of active learning.

6 Conclusion

Recognizing human activities from wearable sensors has been a challenging task, especially when annotations are scarce. The prime purpose of this paper is to explore the effect of (1) the combination of different paradigms of weakly supervised learning, and (2) the ensemble consensus, on the accuracy and robustness of human activity recognition. To this end, we presented a novel activity recognition approach called ActSemiCNN which integrates active learning and semi-supervised learning benefiting from temporal ensembling into one framework. We draw below conclusions from extensive comparative experiments: (1) The integration of active learning and semi-supervised learning leads to state-of-the-art performance, and the annotation cost is greatly reduced, which is attributed to the active sample selection and the utilization of massive unlabelled data. (2) The statistical significance and cross validation tests highlight the effectiveness of ensemble consensus in enhancing the robustness of HAR models.

In future work we plan to investigate few-shot learning for adaptive and personalized human activity recognition. We are also interested in exploring the interplay between different views of multi-modality activity data.