Introduction

Human activity recognition (HAR) is a fundamental technology in medical services and healthcare that can extract information from time series data collected by wearable signal sensors to predict the activities of human beings [1]. Due to its real-time characteristic and portability, HAR is widely used in rehabilitation monitoring [2], geriatric monitoring [3], and other fields [4]. With the development of technology and the maturity of intelligent algorithms, traditional HAR has been far from satisfying human needs, but more personalized services, which can accurately recognize the activities for new individuals, have become the focus of current research.

In recent years, deep learning based HAR algorithm has got evolutionary progress [5, 6]. Specifically, it can overcome the dependency on manual feature extraction by powerful representation capability to obtain the correlation between time series data and automatically extracting features for better classification. Traditional deep learning relies on large amount of high-quality labeled training data to obtain a robust model [7, 8]. However, the deviation of activities among individuals will lead to data distribution discrepancy, makes the deep learning models hard to perform well on new individuals, as shown in Fig. 1. Therefore, how to effectively eliminate the discrepancy among different individuals becomes the key to enhance the generalization ability of personalized HAR services.

Fig. 1
figure 1

Overview of HAR processing. Data and mean of two individuals collected by z-axis acceleration sensor on private data set: a left foot, performing subject 1; b right foot, performing subject 8

Transfer learning breaks the assumption of machine learning that the distribution of training and testing data must be the same [9]. It can improve the generalization ability of the model by reducing the distribution discrepancies cross domains [10, 11], which has been widely applied in computer vision [7, 12], medical decision-making [13,14,15,16], natural language processing (NLP) [17,18,19], and other fields [20,21,22,23]. By utilizing massive, similar and high-quality activity time series data captured by other individuals, transfer learning can effectively improve the generalization abilities of the model on new individuals, thus realizing personalized HAR service. According to the terminology of transfer learning, we annotate the labeled activity data as the source domain, and the unlabeled activity data collected from new individual as the target domain [24]. In addition, considering the labeled data often comes from multiple independent individuals in practical application, we conduct multi-source transfer learning [25, 26] to separately eliminate the differences of activity data among individuals to realize more precisely HAR on target task.

On the other hand, due to the complexity of data distribution in real scenario, the effectiveness of the transfer learning model is constrained. Ensemble learning can improve the classification performance by constructing and combining multiple weak classifiers, which has been proved to be an efficient way for multi-source transfer learning [27, 28]. In general, we extend the advantages of ensemble learning for two main purpose: (1) adopt a co-training strategy to select high-confident target samples for training and (2) combine the transfer results from different source domains to reduce sensitivity to distribution variances and achieve a better average performance. Our previous studies have systematically illustrated the advantages of ensemble learning on transfer learning [29, 30].

In this paper, we fully consider the variance in the distribution of different individuals and propose an end-to-end model, multi-source unsupervised co-transfer network (MUCT), to establish a personalized and precise classification model for HAR. We use feature extractors to automatically extract the time series data feature and use domain adaptation to align the distribution variance cross domains. Furthermore, a consistency filter is developed, consisting of two heterogeneous classifiers to assign the pseudo-label for the target domain with a final agreement. Based on the co-training strategy with a consistency filter, high-confidence target samples are selected to reintegrate the training set and iteratively boost the effectiveness of the classifiers. We aggregate the classification results from the multi-source domain to acquire the final result. The main contributions of this model can be summarized as follows:

  • We propose an unsupervised multi-source transfer learning network that provides a feasible end-to-end way to realize personalized HAR service;

  • An adaptive feature extractor is presented to automatically extract the feature of time series data and align the distribution variance among domains to improve the transferability of the model;

  • MUCT can iteratively enhance the robustness of the model by training with high-confidence target samples collected by a consistency filter;

  • Experimental results on benchmark data sets and a real-world data set (collected by our signal sensors) demonstrate the superior performance of our model over traditional unsupervised domain adaptation methods.

The remainder of this paper is structured as follows. “Related work” section presents related work. “Methods” section describes the workflow of MUCT. “Experiment” section reports simulation results on benchmark data sets and a real-world data set. In “Discussion” section, we analyze the results and discuss future work. In the last section, we describe the conclusions of our model.

Related work

We discuss work on human activity recognition via traditional machine learning methods and multi-source transfer learning.

Human activity recognition

In recent years, the spotlight has been on HAR [31], which can learn advanced information about human activities from raw sensor input [32]. Some recent surveys introduce traditional machine learning models on HAR [33,34,35,36,37]. Bayat et al. [38] proposed a digital low-pass filter that could extract the component of gravity acceleration from body acceleration to improve classifier performance. Hossain et al. [39] combined K-means with active learning to reduce the reliance on training data labels in HAR. Some recent methods have bridged deep learning and HAR, utilizing deep learning methods to extract features from raw time series data. Zeng et al. [40] proposed deep learning based on convolutional neural networks (CNNs) and a partial weight-sharing technique, which could capture the local dependency and scale invariance of a signal. A partial weight-sharing technique could apply to sensor signals to further improve data. Lee et al. [41] proposed a one-dimensional CNN-based method using smartphone triaxial accelerometer data to transform x, y, and z acceleration data to vector magnitudes.

Many traditional HAR methods require manual design for the features, which is time and labor consuming. Although deep learning methods such as CNNs have gradually become mainstream methods in HAR, they fail to account for individual variance, which will cause the poor generalization of classifiers for different individuals.

Multi-source domain adaptation

Domain adaptation is a significant branch of transfer learning that aligns a labeled source and unlabeled target domains in a specific feature space. In real-world applications, multi-source data can be drawn. Hence multi-source domain adaptation is an appropriate method to gain knowledge from multiple perspectives to enhance the classification ability of the target domain [42]. Zhu et al. [43] used Maximum Mean Discrepancy (MMD) loss to align domain-specific distributions, and Disagreement Classifier (DISC) loss to align domain-specific classifiers, enabling classifiers to achieve satisfactory results in the target domain. Fang et al. [27] presented a method that mapped target domain samples into multi-label samples, which could comprehensively consider the correlation between tags. The model creates a shared label subspace in multi-source domains, applying the obtained knowledge to the target domain classifier.

Recently, domain adaptation methods have been applied in HAR. Wang et al. [44] measured the distance of activity data distributions among multiple humans to find the best domain to accomplish transfer tasks, utilizing CNN and long short-term memory (LSTM) layers to extract time series and spatial features, and MMD loss to align them as similarly as possible. Zhao et al. [45] proposed a transfer learning-embedded decision tree that integrates decision trees with the K-means algorithm in personalized HAR. Methods based on domain adaptation cannot deal with multi-source domains, and their generalization ability is poor in practical applications without a suitable ensemble paradigm, especially on unbalanced data sets.

We propose a MUCT model that leverages knowledge gained from multiple individuals with annotated information (source domain) to precisely recognize activities for a specific person (target domain). By adopting a boost strategy, the generalization ability of the algorithm is improved on both benchmark and real-world data sets.

Methods

We show how our model addresses individual invariance and enhances generalization when applying HAR. Our model consists of multi-source feature alignment and a consistency filter, as shown in Fig. 2. Multi-source feature alignment extracts discriminative and domain-invariant features for source and target domains. Heterogeneous classifiers are included in the consistency filter to provide predictive labels for samples from different views. The consistency filter can select high-confidence samples with pseudo-labels to enrich the diversity of training samples and promote the learning process of target information. The pseudocode is shown in Algorithm 1.

Fig. 2
figure 2

Overview of MUCT model. Feature extractor obtains a specific common feature space in each pair of source and target domains and can extract their domain-invariant feature representations to alleviate the problem of variance distribution among different domains. High-confidence data in the source domain are filtered by classification results of multiple classifiers and added with their pseudo-labels to the training set. Average probability is calculated for results of each classifier to obtain final classification result

Problem statement

We propose a personalized HAR service solution. We denote K items of source domain data with labels as \(\mathcal {D}_{s_{k}}=\left\{ \left( x^{s_{k}}_{i}, y^{s_{k}}_{i}\right) \right\} _{i=1}^{n_{s_{k}}}\), where \(n_{s_k}\) is the number of instances of the kth source domain. \(\mathcal {D}_{t}=\left\{ \left( x^{t}_{i}\right) \right\} _{i=1}^{n_{t}}\) is the target domain, with \(n_{t}\) unlabeled instances. \(\mathcal D_{s_k}\) and \({\mathcal {D}}_t\) are sampled from data distributions p and q, respectively, \(p \ne q\). MUCT trains a network \(\textbf{y} = H(\textbf{x})\) that can effectively reduce the distribution variance between the source and target domains, and the target risk \(R_{t}(H)=\mathbb {E}_{(\textbf{x}, \textbf{y}) q}[H(\textbf{x}) \ne \textbf{y}]\) can be bounded by leveraging the source domain.

Algorithm 1
figure a

MUCT

Multi-source feature alignment

In personalized HAR, the distributional variance between domains reduces the generalizability of the model over the target domain. Specifically, the feature extractor cannot extract domain-invariant features due to domain distribution variance. The classifier is unable to classify the target domain data effectively. Multi-source feature alignment aligns the domain distribution variance and motivates the feature extractor to extract domain-invariant features.

To extract domain-invariant features, we introduce MMD to measure the variance between domains. MMD measures the sample feature distributions in the Reproducing Kernel Hilbert Space (RKHS) and calculates the mean embedding distances from different domains [46]. The MMD distance can be estimated as

$$\begin{aligned} d_{m m d}(p, q) \triangleq \left\| \textbf{E}_{x^{s} \sim p}\left[ \phi \left( x^{s}\right) \right] -\textbf{E}_{x^{t} \sim q}\left[ \phi \left( x^{t}\right) \right] \right\| _{\mathcal {H}}^{2}, \end{aligned}$$
(1)

where \(\mathcal {H}\) is the RKHS endowed with a characteristic kernel, \(kernel\left( X_{s}, X_{t}\right) =<\phi \left( X_{s}\right) , \phi \left( X_{t}\right)>\), where \(<\cdot ,\cdot>\) is the inner product of vectors, \(\phi (\cdot )\) denotes the mapping of a feature distribution to RKHS, and p and q are the distributions of \(x^s\) and \(x^t\), respectively. In general, the empirical estimate of MMD \(d_{m m d}(p, q)\) can be further factorized as

$$\begin{aligned} \hat{d}_{m m d}(p, q){} & {} =\left\| \frac{1}{n_{s}} \sum _{x_{i} \in \mathcal {D}_{s}} \phi \left( x_{i}\right) -\frac{1}{n_{t}} \sum _{x_{j} \in \mathcal {D}_{t}} \phi \left( x_{j}\right) \right\| _{\mathcal {H}}^{2} \nonumber \\{} & {} = \frac{1}{n_{s}^{2}} \sum _{i=1}^{n_{s}} \sum _{j=1}^{n_{s}} k\left( x_{i}^{s}, x_{j}^{s}\right) +\frac{1}{n_{t}^{2}} \sum _{i=1}^{n_{t}} \sum _{j=1}^{n_{t}} k\left( x_{i}^{t}, x_{j}^{t}\right) \nonumber \\{} & {} \quad -\frac{2}{n_{s} n_{t}} \sum _{i=1}^{n_{s}} \sum _{j=1}^{n_{t}} k\left( x_{i}^{s}, x_{j}^{t}\right) . \end{aligned}$$
(2)

In multi-source transfer learning, it is challenging to construct a suitable common latent feature space for all domains because the data distributions in each domain differ. When the number of domains is too large, the common latent feature space will reduce the discriminability of features to adapt all common domain features [43]. To enhance the representation capability of the feature extractor, we propose K subnetworks \(\left\{ f_k(\cdot )\right\} _{k=1}^K\) as feature extractors to construct a common latent feature space for each pair of source and target domains. The source domain feature extraction process can be expressed as \(f_{s_{k}}=f_{k}\left( x^{s_{k}}_{i}\right) \). Similarly, we use each feature extractor to perform feature extraction on the target domain data and obtain the features of the target domain. The features of the target domain extracted using the kth feature extractor can be expressed as \(f_{t_{k}}=f_{k}\left( x^{t}_{j}\right) \). We use MMD to measure the variance in the distribution of the source and target domains. The MMD loss is

$$\begin{aligned} \mathcal {L}_{mmd} = \sum _{k=1}^{K}\hat{d}_{m m d}(p, q)(f_k(x^{s_k}),f_k(x^t)). \end{aligned}$$
(3)

Consistency filter

We propose a consistency filter to enrich the diversity of training samples and promote the learning process of the target information by assigning pseudo-labels to target domain samples and collecting the prediction results of all classifiers for the target domain samples. When all classifiers predict the same result for the target domain sample, the consistency filter will make the prediction a pseudo-label. This sample is eventually joined to the training set of each classifier. With increasing iterations, the number of target domain samples in the training set grows. Therefore, the feature extractor will focus on extracting discriminable features in the target domain, which are unique in it and are not domain-invariant. In other words, the consistency filter helps in the extraction of potential features of the target domain, ultimately improving the classifier’s performance. We use multi-classifiers in the final recognition task. For each source domain, two heterogeneous classifiers, \(C_{k1}(\cdot )\) and \(C_{k2}(\cdot )\), perform classification tasks from different perspectives, receiving the domain-invariant feature after feature extractor \(f_k(\cdot )\) for the kth source human domain. We compute the instance’s filter label, \(S_t^i\), which when 0 indicates that an instance is highly trusted data, which helps improve the classifier’s performance; otherwise, it is untrusted data. \(S_{i}\) calculation formula as follows:

$$\begin{aligned} S_{i}= \sum _{k=1}^{K}EQ( |C_{k_{1}}( f_k(x^{t}_{j})) |, |C_{k_2} (f_k(x^{t}_{j})) |). \end{aligned}$$
(4)

\(EQ(\cdot ,\cdot )\) is used to judge whether the predictions of the two classifiers are consistent, outputting 0 when they are, and 1 otherwise. After selection, we add the confidence data and pseudo-label \((x^t_j,C_{1_{1}}( f_1(x^{t}_{j}))\) to the confidence data set \( \mathcal {D}_t^{*} = \left\{ x^{*}_l,y^{*}_l \right\} ^{n_*}_{l=1}\), and add this to the classifier’s training set, where \({n_*}\) is the number of samples in the confidence data set. To prevent the classifier from non convergence, we run the consistency filter after each iteration of hyperparameter \(\beta \).

In HAR, a single classifier cannot accurately classify the data from multiple views because of the spatial and time series characteristics of the data, such as frequency and sequence. These heterogeneous classifiers not only are trained from different domain data but can combine the advantages of different classification algorithms. Therefore, these filtered target domain data have a high confidence level. Therefore, we use two heterogeneous classifiers to classify the samples from different perspectives to obtain predictive labels with high execution for each source domain. We use cross-entropy to optimize the heterogeneous classifiers. Cross-entropy loss measures the degree of difference between two probability distributions in the same random variable. We estimate the difference between the probability distribution and its prediction. The smaller the cross-entropy, the better the classifier predicts. We take the cross-entropy loss as the classification loss of each classifier:

$$\begin{aligned} \mathcal {L}_{cls}= & {} \sum _{k=1}^{K} \sum _{z=1}^{2} \mathcal {L}_{ce}( C_{k_{z}}\left( f_{k}\left( x^{s_{k}}\right) \right) , y^{s_{k}})\nonumber \\{} & {} + \sum _{k=1}^{K} \sum _{z=1}^{2} \mathcal {L}_{ce}( C_{k_{z}}\left( f_{k}\left( x^{*}\right) \right) , y^{*}), \end{aligned}$$
(5)

where \(\mathcal {L}_{ce}\) is cross-entropy loss, K is the number of source domains, and \(C_{k_z}\) is the zth classifier in the kth domain.

Multi-source unsupervised co-transfer learning network

Design features are difficult for human activity recognition. In addition, the connections between multiple features between activities are challenging to capture. To this end, we propose MUCT. Our model comprises K feature extractors, 2K classifiers, and a consistency filter. The total loss of our model generally consists of domain and classification loss, where domain loss is measured by MMD, and classification loss is calculated by cross-entropy. The feature extractor extracts domain-invariant features and measures their differences to represent the domain loss. The classification loss improves the classifier’s performance, and therefore network performance. We use a consistency filter to obtain the confidence data of the target domain and add pseudo-labels to the training set. Because the target domain data are added to the training set, the classifier performs better on the target domain. The total loss of our model is

$$\begin{aligned} \mathcal {L}_{muct} = \mathcal {L}_{cls} + \mathcal {L}_{mmd}. \end{aligned}$$
(6)

In MUCT, the feature is extracted by the feature extractor, and the feature distribution of each pair of source and target domains is as similar as possible to MMD. After this stage, we can approximately consider the source and target domain data distributions to be similar. This provides the basis for our second stage. MUCT improves the classifier’s performance on the target through a consistency filter. We determine its final label by averaging the weights. We use each classifier to predict the probability of the class to which the target human sample belongs by summing the possibilities and averaging them. The category with the highest probability is its label. This process can be expressed as follows:

$$\begin{aligned} H(x^t_i) = \frac{1}{2K} \sum _{k=1}^{K}\sum _{z=1}^{Z}C_{kz}(f_{k}(x^t_i))). \end{aligned}$$
(7)

Experiment

We conducted a number of evaluations and experiments to compare the effectiveness and performance of our methods with other state-of-the-art (SOTA) domain adaptation methods on two benchmark HAR data sets and a real-world data set.

Data sets and setup

We compared the effectiveness and performance of MUCT with ResNet [47], JDA [48], TCA [49], DAN [50], DAAN [51], and MFSAN [43], which include traditional deep learning, single-source domain adaptation, and multi-source domain adaptation methods.

We performed three standard comparisons: (1) Source combine unites different source domains into one, despite the distribution variance; (2) Single best reports the optimal results in multiple single-source domain adaptations among multi-source domain, to compare the differences between the upper bound of single-source and multi-source domain adaptation; (3) Multi-source compares the classification results of multi-source domain adaptation methods. To further validate the effectiveness of every component in our model, we carried out ablation experiments and evaluated several MUCT variants: (1) \(MUCT_{SV}\), with a single classifier; (2) \(MUCT_{NC}\), without a consistency filter; (3) MUCT, the whole pipeline, combining multiple classifiers and a consistency filter. Considering calculation and time constraints, we followed the experimental setup in FedHealth [52], randomly selecting four subjects in each data set to evaluate the effectiveness and performance of our model.

Experimental data sets included two HAR benchmark data sets, the UCI daily and sports data set (DSADS) [53] and the WISDM Lab human activity recognition data set (WISDM) [54], and a real-world data set collected by our signal sensors (real data set). Table 1 describes these data sets.

The DSADS data set was collected from eight subjects who wore sensors in five positions while completing 19 movements. Each subject performed all movements for five minutes. Four males and four females were subjects, and the sensors collected data at 25 Hz. The five minutes of data collected from each subject were divided into 60 segments. Sensors were worn on the subject’s torso (T), right arm (RA), left arm (LA), right leg (RL), and left leg (LL). We randomly selected four individuals from the DSADS data set and noted them as \(P_1\)\(P_4\). We built four transfer tasks using them as target domains: \(P_1,P_2,P_3 \rightarrow P_4\); \(P_1,P_2,P_4 \rightarrow P_3\); \(P_1,P_3,P_4 \rightarrow P_2\), and \(P_2,P_3,P_4 \rightarrow P_1\).

WISDM included six activities intelligently collected from 36 subjects. We randomly selected four individuals from this data set and labeled them as \(P_1\)\(P_4\). For WISDM, cell phone accelerometers collected six types of human activity data: Walking, Jogging, Upstairs, Downstairs, Sitting, and Standing. We built four transfer tasks, \(P_1\)\(P_4\), as target domains: \(P_1,P_2,P_3 \rightarrow P_4\); \(P_1,P_2,P_4 \rightarrow P_3\); \(P_1,P_3,P_4 \rightarrow P_2\) and \(P_2,P_3,P_4 \rightarrow P_1\).

The real data set included 13 activities intelligently collected from four subjects. We collected 19 types of activity information from four volunteers using wearable sensors in Kunming, Yunnan Province, China. We labeled the volunteers as \(P_1\)\(P_4\). Sensor units were on the waist (W), right arm (RA), left arm (LA), right leg (RL), and left leg (LL). These units collected activity data 20 times per second, and this was grouped in 5-second increments after completing the collection. We used ResNet-18 to extract features.

Table 1 Benchmark and real HAR data sets

We compared our MUCT model with the five SOTA algorithms for human activity recognition problems. Our model was based on the PyTorch framework. We set the initial learning rate at 0.2, and it decreased with iterations. We used the SGD optimizer and set its momentum to 0.9. We used ResNet-18 to extract features. Softmax and DNN were used as classifiers. Among ResNet, DAN, DAAN, and MFSAN, we used ResNet18 as the backbone network. We used a learning rate of 0.2 and SGD in these models as the optimizer.

Effectiveness of MUCT

Performance on DSADS. We compared our model with others on DSADS and recorded the best results, obtaining an average over five trials. Table 2 shows that the algorithm’s experimental effect on Source Combine was better than on Single Best. This situation does not square with our previous experience. The reasons for the experimental results are that the activity movements between humans are small in DSADS, which leads to slight variance in the generated data distribution, so the model’s performance is not degraded because of the difference in data distribution in Source Combine. On the contrary, it can increase the generalization performance of the model. It can be seen from the experiment that in this case, better classification results can be obtained even without the use of transfer learning. For these reasons, the performance of the multi-source model MFSAN did not exceed that of Source Combine.

Although there is little difference in data distribution, there is still the issue of complex data characteristics in time and space. However, our algorithm performs better because multiple heterogeneous classifiers and consistent filters can address these aspects. Due to the small difference in data distributions, our consistency filter can screen out more high-confidence samples in the target domain to participate in the classifier’s training. Our multiple heterogeneous classifiers can capture more features of the data for classification. In addition, the consistency filter adds the target domain data to the training set so that the classifiers can adapt to the target distribution. The accuracy of our ablation experiment on \(MUCT_{NC}\) and \(MUCT_{SV}\) also decreased after removing the consistency filter, which confirmed our conjecture. We can see that our model achieved optimal results in four groups of experiments on DSADS.

Table 2 Comparison of classification accuracy (%) on DSADS

Performance on WISDM. We compared the results of our model to others on the WISDM data set. WISDM was collected on smartphones with a single sensor, making little information available. Table 3 compares classification accuracy on WISDM. We can clearly see that the performance on Single Best was better than on source combine. DAAN had a better accuracy rate in classification tasks due to the few category labels in the WISDM data set. DAAN could be more accurate based on label subdomains, thus improving the ability to extract common features with data labels. The multi-source model MFSAN had better results than the source combine models because MFSAN can capture information from multi-source domains.

As shown in Table 3, our model can improve the performance of the HAR task more effectively with multiple classifiers and a consistency filter. In WISDM, MUCT achieves the best results in terms of average accuracy.

Table 3 Comparison of classification accuracy (%) on WISDM
Fig. 3
figure 3

ACC curve on real data set

Fig. 4
figure 4

Box plots on real data set

Fig. 5
figure 5

Confusion matrix on real data set

Performance on real data set. We evaluated our model’s performance on a real application data set, with results as shown in Table 4, from which we can see that the model performance with single best was better than with source combine. This is consistent with our previous experience. Multi-source model MFSAN was better than single best and source combine because multi-source models can obtain more source domain information and improve classifier performance by adapting to the data distribution of the target domain. Our model achieved the best performance. To further verify our model’s performance, we implemented task \(P_2,P_3,P_4 \rightarrow P_1\) on the real data set and plotted the ACC curve, as shown in Fig. 3, which shows the ACC curves of all four models, where the horizontal axis represents the number of training iterations, and the vertical axis represents the accuracy rate of the model. From the ACC curve, we can observe the model’s rate of convergence and stability. Although the curve of DAAN is sharper, the accuracy of our model is higher, and the algorithm more stable, after convergence. To verify the reliability of our model, we repeated the experiment 5, 10, 15, and 20 times and drew box plots, as shown in Fig. 4. We can find that the results of the four groups of experiments are similar of the mean, maximum, and minimum values. This shows that our model has reliable performance in HAR service.

Table 4 Comparison of classification accuracy (%) on real data set

Evaluation of MUCT on unbalanced data set

We tested the performance of our model on unbalanced data sets through experiments with the MUCT model on the real data set, whose data and label distribution are shown in Table 5. We can see that the subject 10 data accounted for 20.73%. However, the data collected from subject 8 only accounted for 4.69%. This data imbalance exists not only between subjects but between domains. For example, in subject 8, there are only 15 samples of \(P_1\), while there are 33, 61, and 46 samples of \(P_2\), \(P_3\), and \(P_4\), respectively. This imbalance is a challenge for the HAR task.

Table 5 Data distribution of real data set

To confirm that our model handles unbalanced data, we calculated the confusion matrix on the real data set with source domains \(P_2, P_3, P_4\), and target domain \(P_1\). Figure 5 shows the confusion matrices of DAN, DAAN, MFSAN, and our model, where each row represents the true label of a category of samples, and each column represents the predicted label of the model for predicting a category. The diagonal of the confusion matrix is the result of correct classification. We can observe the performance of the models on the unbalanced data set by the confusion matrix. The MSFAN approach is superior to DAN and DAAN with unbalanced data sets. DAN is more inclined to classify data as label 1 in the classification task, and DAAN is more prone to labels 5 and 12. Our model weights the data, which enables it to perform better on unbalanced data sets. Our model has fewer misclassifications and higher accuracy on unbalanced data sets.

Fig. 6
figure 6

ROC curve on real data set

To further analyze the performance of our model on unbalanced data sets, we experimented on the real data set with source domains \(P_2, P_3, P_4\) and target domain \(P_1\), and drew the ROC curve. We compared our model with MFSAN, DAN, and DAAN. Figure 6 shows the Receiver Operating Characteristic (ROC) curve, which can accurately reflect the relationship between the true example rate and a certain learner’s false-positive rate. It can be found from the figure that the Area Under Curve (AUC) region of MFSAN is significantly larger than that of DAN and DAAN. Due to richer samples, the multi-source domain models perform better on unbalanced data sets. We can find that the AUC region of our model is more significant than those of other models, which confirms its stronger robustness on unbalanced data sets.

Discussion

Our experiments illustrated that our model can achieve impressive classification results for personalized HAR without target data labels. Our model acquired high-confidence data through a Consistency Filter to assist in classifier training, and it could still maintain a high AUC on unbalanced data sets. Hence, it provides a promising, easy-to-use technique to address personalized HAR problems. We summarize the method’s innovations and drawbacks and propose future work.

Our model can work in a personalized human environment without new human data labels in human activity recognition. This multi-source unsupervised transfer learning method requires less target data than the traditional deep learning method. The information of multiple source domains improves the classification performance of target domains. Since our model does not need the target domain data label, it can quickly process the data of new users without manual data processing. Because of its multi-source domain nature, it can take full advantage of previous data from multiple humans rather than just using a single human.

We use the idea of co-training and using classifiers and high-confidence data screening to improve the performance and stability of our model. Tables 2, 3 and 4 compare the performance of our model and other methods. After introducing high-confidence data, our classifier can better adapt to the data distribution in the target domain. With the addition of a multi-perspective classifier, it can classify data from multiple perspectives. It can make better use of some hidden features of the data and improve the stability of the model. Figure 3 shows our model’s better stability and performance. Our model better adapts to the data imbalance problem using a weighted sample, classifiers, and high-confidence data. Figures 5 and 6 show the excellent performance of our model on unbalanced data sets.

Our model avoids the manual design of activity recognition features and is an end-to-end neural network. It does not need to design features by hand, as with traditional methods. It significantly reduces manual work compared with traditional and supervised models. In the feature extraction module, we use a CNN as the backbone network. For long-term and repetitive behaviors, CNN has a huge advantage. An RNN is more suitable for short-time and natural sorting behaviors [55]. Our model can replace the backbone network with more realistic scenarios.

While experiments showed that our model has good results on public and real-world data sets for personalized HAR, some issues still must be considered. The quality of the pseudo-label of high-confidence data is related to the performance of our model. Therefore, our model requires the classifier to perform better when filtering high-confidence data. We must rely on experience to set suitable parameters for filtering high-confidence data, and to find the best parameters is a challenge. Moreover, as the data in the target domain are more important to the classifier, the classifier should pay more attention to it.

To address the shortcomings of our model, we propose some future work. After obtaining high-confidence data, we can use a method similar to TrAdaboost [56] to weigh the source domain and high-confidence data. In addition, since samples in the source domain contribute differently to classification results, we can weight them according to the similarity between the source and target domains when integrating multiple classification results.

Conclusion

We proposed the MUCT model, which can offers a viable solution for personalized HAR. Our model communicates differences in the activities of individuals and can quickly recognize new human activity data using previous such data. In the absence of target domain data labels, our model utilizes co-training to make full use of target domain data. Multiple-view classifiers are added to improve the model’s performance. The model is an end-to-end network that can automatically extract features with less labor than traditional methods. Experimental results on public and real-world data sets show that our model yields superior personalized classification results in HAR. The model also shows its robustness in practical applications.