1 Introduction

It is widely known that stroke is a worldwide health problem causing disability and death (Donnan et al. 2008), and it occurs when a blood clot cuts off oxygen supply to a region of the brain. Hemiparesis is a very common symptom of post-stroke that is the fractional or intact paralysis of one side of the body, i.e., the opposite side to where the blood clot occurs, and it results in difficulties in performing activities, e.g., with reduced arm movement. Patients can recover some of their capabilities with intense therapeutic input, so it is important to assess their recovery levels in time. There are many approaches to assess patients’ recovery levels including brain imaging (Wintermark et al. 2005), questionnaire-based (Pietro et al. 2007), and lab-based clinical assessment (Barreca et al. 2005).

The brain imaging technique, is deemed as one of the most reliable approach, which can provide the information of brain hemodynamics (Wintermark et al. 2005). However, this approach requires special equipment and is very expensive in cost. Questionnaire-based approaches investigate the functional ability during a period using questionnaires, and it can be categorised into two types: patient-completed and caregiver-completed (Pietro et al. 2007). Although it is much cheaper than brain imaging approaches, it may contain high-level of bias. For instance, patients may not remember their daily activities (i.e.,recall bias); the caregivers may not be able to observe the patient all the time. These biases make questionnaire-based approaches less precise. Lab-based clinical assessment approaches (Barreca et al. 2005), on the other hand, provide an alternative solution. The patients’ upper limb functionality will be assessed by clinicians, e.g., by observing patients’ capabilities of finishing certain pre-defined activities (Barreca et al. 2005). Compared with braining imaging or questionnaire-based approaches, the cost of lab-based clinical assessment approaches is reasonable with high accuracy. However, this assessment is normally taken in clinics/hospitals, which is not convenient for the patients, making continuous monitoring less feasible.

In this work, we aim to build an automated stroke rehabilitation assessment system using wearable sensing and machine learning techniques. Different from the aforementioned approaches, our system can measure the patients objectively and continuously in free-living environments. We collect accelerometer data using wrist-worn accelerometer sensors, and design compact features that can capture rehabilitation-related movements, before mapping these features to clinical assessment scores (i.e., the model training process). The trained model can be used to infer recovery-level for other unknown patients. In free living environments, there are different types of movements which may be related to different frequencies. For example, activities such as running or jumping may correspond to high-frequency signals, while sedentary or eating may be low-frequency signals. In this study, instead of recognising the daily activities explicitly, which is hard to achieve given limited annotation (e.g., without frame/sample-wise annotation), we transform the raw accelerometer data to the frequency domain, where we design features that can encode the rehabilitation-related movements. Specifically, wavelet transform (Walden et al. 2000) is used, and the wavelet coefficients can represent the particular frequency information at certain decomposition scales. In Preece et al. (2009), provide some commonly used wavelet features extracted from accelerometer data. However, to capture stroke rehabilitation-related activities, some domain knowledge should be taken into account to design better features. After stroke, patients have difficulties in moving one side (i.e., paralysed side) due to the brain injury, and data from paralysed side tends to describe more about the upper limb functional ability, than the non-paralysed side (i.e., normal side). However, such signals can be significantly affected by personal behaviours or irrelevant daily activities, and such noises should be suppressed before developing the predictive models. Various wavelet features were studied, and we propose two new types of daily-activity-invariant features that can encode information from both paralysed/non-paralysed sides, before developing predictive models for stroke rehabilitation assessment. Specifically, in this work our contributions can be summarised as follows:

  • Stroke-rehab-driven features: We propose two new types of compact wavelet-based features that can encode information from both paralysed and non-paralysed sides to represent upper limb functional abilities for stroke rehabilitation assessment. It can significantly suppress the influences of personal behaviours or irrelevant daily activities for data collected in the noisy free-living environment.

  • Automated assessment system: Based on the proposed stroke-rehab-driven features, we developed the automated system by using the longitudinal mixed-effects model with Gaussian process prior (LMGP). Various predictive models were studied, and we find LMGP can model the random effects caused by the heterogeneity nature among subjects in a 8-week longitudinal study.

  • Comprehensive evaluation: Comprehensive experiments are designed to study the effectiveness of our system. We comprehensively studied the feature subset on modelling the mixed-effects of LMGP. Compared with other approaches, the results suggest the effectiveness of the proposed system on both acute and chronic patients.

2 Background and related work

As described in Sect. 1, lab-based clinical assessment is one of the most effective stroke rehabilitation assessment methods. In this section, we introduce the lab-based approach named Chedoke Arm and Hand Activity Inventory (CAHAI) scoring (Barreca et al. 2006), based on which our automated system can be developed. Some sensing and machine learning techniques for automated health assessment are also reviewed in this section.

Fig. 1
figure 1

The clinical behaviour assessment for CAHAI scoring Barreca et al. (2006)

2.1 Chedoke arm and hand activity inventory (CAHAI)

CAHAI scoring is a clinical assessment method for stroke rehabilitation, and it is a fully validated measure Barreca et al. (2006) of upper limb functional ability with 9 tasks which are scored by using a 7-point quantitative scale. In the assessment, the patient will be asked to perform 9 tasks, including opening a jar of coffee, drawing a line with a ruler, calling 911, etc. and the clinician will score these behaviours based on patient’s performance at a scale from 1 (total assist weak) to 7 (complete independence i.e., timely, safely) (Barreca et al. 2006). A task example “call 911” is shown in Fig. 1. Thus the minimum and maximum summation scores are 7 and 63 respectively. A CAHAI score form can be found in Fig. 12 in Appendix 2.

2.2 Automated behaviour assessment using wearables

Recently, wearable sensing and machine learning (ML) techniques are comprehensively studied for automated health assessment. Compared with the traditional assessment approaches (e.g., via self-reporting, clinical assessment, etc.) which are normally subjective and expensive, the automated systems may provide an objective, low-cost alternative, which can also be used for continuous monitoring/assessment. Some automated systems are developed to assess the behaviours of diseases such as Parkinson’s disease (zia ur et al. 2019; Hammerla et al. 2015), autism (Ploetz et al. 2012), depression (Little et al. 2020); or to monitor the health status such as sleep (Zhai et al. 2020; Supratak et al. 2017), fatigue (Bai et al. 2020; Ibrahim et al. 2020), or recover-level from surgery (Ratcliffe et al. 2020; Gurchiek et al. 2019), etc.

After collecting behaviour or physiological signals (e.g., accelerometers, ECG, audio, etc.), assessment/monitoring models can be developed. For application with high interpretability requirement, feature engineering can be a crucial step. For example, with gait parameters extracted from IMU sensors (such as stride, velocity, etc.), one can build simple ML models (e.g., random forest) for Parkinson’s disease classification (zia ur et al. 2019) or fatigue score regression (Ibrahim et al. 2020). Compared with the redundant IMU data, gait parameters are more compact and interpretable, making it suitable for clinical applications. However, designing interpretable/clinically-relevant features can be a time-consuming process, which may also require domain knowledge (Zhai et al. 2020; Ibrahim et al. 2020; zia ur et al. 2019; Ratcliffe et al. 2020; Gurchiek et al. 2019).

On the other hand, when interpretability is less required, deep learning can be an alternative approach, which can be directly applied to the raw signal (Supratak et al. 2017) or engineered features (Hammerla et al. 2015; Zhai et al. 2020; Bai et al. 2020; Little et al. 2020 for (high-level) representation learning and classification/regression tasks. However, it normally requires adequate data annotation for better model generalisation.

2.3 Sensing techniques for automated stroke rehabilitation monitoring

With the rapid development of the sensing/ML techniques, researchers also start to apply various sensors for stroke rehabilitation monitoring. In Dolatabadi et al. (2017), Kinect sensor is used in a home-like environments to detect the key joints such that stroke patients’ behaviour can be assessed. In Ganesh et al. (2018), a wireless surface Electromyography (sEMG) device is used to monitor the muscle recruitment of the post-stroke patients to see the effect of orthotic intervention. In clinical environments, five wearable sensors are placed on the trunk, upper and forearm of the two upper limbs to measure the reaching behaviours of the stroke survivors (Jung et al. 2018). To monitor motor functions of stroke patients during rehabilitation sessions at clinics, an ecosystem including a jack and a cube for hand grasping monitoring, as well as a smart watch for arm dynamic monitoring was designed (Bobin et al. 2019). These techniques can objectively assess/measure the behaviours of the stroke patients, yet they are either limited to clinical environments (Bobin et al. 2019; Jung et al. 2018; Ganesh et al. 2018 or constrained environments [e.g., in front of a camera Dolatabadi et al. (2017)].

Most recently, wrist-worn sensors are used for stroke rehabilitation monitoring for patients in free-living environment (Halloran et al. 2019; Tang et al. 2020). In each trial, 3-day accelerometer data are collected from both wrists (with a trial-wise annotation, i.e., CAHAI score), and for both works (Halloran et al. 2019; Tang et al. 2020) data analysis is performed using the sliding window approach. To reduce the data redundancy of the raw data, PCA features are extracted from each window (Halloran et al. 2019; Tang et al. 2020). Moreover, due to the lack of window-wise annotation, in Halloran et al. (2019) pseudo label is assigned to each window such that a random forest regressor can be trained, while in Tang et al. (2020) Gaussian Mixture Models (GMM) clustering approach is employed to learn the holistic trial-wise representation, before developing the regression model. Both methods (Halloran et al. 2019; Tang et al. 2020) suffer from the lack of annotation. In Halloran et al. (2019), pseudo labeling is introduced, yet the trained model is affected by the introduced label noise. In Tang et al. (2020), the application of GMM clustering (on the sliding windows) makes it computationally expensive to large data, and the trained model does not generalise well to unseen subjects.

In our work, by analysing the nature of the paralysed/non-paralysed sides, we design stroke-rehab-driven features which can directly encode the long accelerometer sequence (e.g., a trial with 3-day accelerometer data) into a very compact representation. The features are expected to emphasis the stroke-related behaviours while suppressing the irrelevant activities. Based on the proposed features, a predictive model that is adaptive to different subjects/time-slots can be developed using LMGP (Shi et al. 2012) for CAHAI score prediction.

3 Methodology

In this section, we introduce our method from data collection, data pre-processing, feature design to predictive models. Our aim is to develop an automated model which can map the free-living 3-day accelerometer data into the CAHAI score. With the trained model, we can automatically infer the CAHAI score in an objective and continuous manner. To achieve this, we first reduce the data redundancy via preprocessing and design compact and discriminant features. Given the proposed features, a longitudinal mixed-effects model with Gaussian Process prior (LMGP) is used (Shi et al. 2012), which can further reduce the impact of large variability (caused by different subjects and time slots) for higher prediction results.

3.1 Data acquisition

Fig. 2
figure 2

Demographic information of the collected dataset (with 59 subjects): the distributions of acute/chronic condition, gender, dominant/non-dominant hand, paralysed/non-paralysed side with respect to age

3.1.1 Participants

Data is collected as part of a bigger research study which aims to use a bespoke, professionally-written video game as a therapeutic tool for stroke rehabilitation (Shi et al. 2013). Ethical approval is obtained from the National Research Ethics Committee and all work undertaken is in accordance with the Declaration of Helsinki. Written, informed consent from all the subjects is obtained. A cohort of 59 stroke survivors, without significant cognitive or visual impairment, are recruited for the study. Patients were divided into two groups, i.e.,

  • Group 1: the acute patient group, consisting of 26 participants who enrolled into the study within 6 months after stroke;

  • Group 2: the chronic patient group, including 33 participants who were 6 months or more post onset of stroke.

The distributions of acute/chronic condition, gender, dominant/non-dominant hand, paralysed/non-paralysed side with respect to age are shown in Fig. 2.

These 59 patients visit the clinic for the CAHAI scoring every week (a random day in weekdays) for a duration of 8 weeks. In the 8 weeks, they are asked to wear two wrist-worn sensors for 3 full days (including night time) a week. They are also advised to remove the device during shower or swimming. Since some patients need time to get familiar to this data collection procedure, for better data quality we do not use the first week’s accelerometer data. The first week’s CAHAI scores are used as medical history information.

3.1.2 Data collection

In contrast to other afore-mentioned sensing techniques (Jung et al. 2018; Bobin et al. 2019; Ganesh et al. 2018; Dolatabadi et al. 2017), in this study we collect the accelerometer data from wrist-worn sensors in free-living environments. The sensor used for this study, i.e., AX3 (Axivity Ltd 2020), is a triaxial accelerometer logger that is designed for physical activity/behaviour monitoring, and it has been widely used in the medical community [e.g., for the UK Biobank physical activity study Doherty et al. 2017]. The wrist bands are also designed such that the users can comfortably wear it without affecting their behaviours. The data is collected at 100Hz sampling rate, which can well preserve the daily activities of human being (Bouten et al. 1997). Different from human activity recognition which requires sample-wise or frame-wise annotation (Guan and Ploetz 2017; Ploetz and Guan 2018), the data collection in this study is relatively straight-forward. The patients put on both wrist-worn sensors 3 full days a week, before visiting clinicians for CAHAI scoring (i.e., week-wise annotation). In other words, we aim to use accelerometer data captured in free-living environments to represent the stroke survivors’ upper limb activities to measure the degree of paresis (Stig JÃzgensen et al. 1999) (i.e., CAHAI score).

One problem with most commercial sensors is that only summary data (e.g., step count from fitbit), instead of raw data, are available. The algorithms of producing summary data are normally non-open source, and may vary from vendor to vendor—making the data collection and analysis device-dependent, and thus less practical in terms of generalisation and scalability. The AX3 device used in this study, on the other hand, outputs the raw acceleration information in x, y, z directions. It is simple and transparent, making the collected data re-usable, which is crucial for research communities.

3.2 Data pre-processing

Fig. 3
figure 3

The signal vector magnitude (VM) data collected from two patients (on the paralysed side); Patient la012 has a CAHAI score of 55, while Patient la040 has a CAHAI score of 26

For accelerometer data, signal vector magnitude (VM) (Karantonis et al. 2006) is a popular representation, which is simply the magnitude of the triaxial acceleration data defined as \(a(t) = \sqrt{a_x^2(t) + a_y^2(t) + a_z^2(t)},\) where \(a_x(t)\), \(a_y(t)\), \(a_z(t)\) are the acceleration along the x, y, z axes at timestamp t. The gravity effect can be removed by \(VM(t) = |a(t) -1 |\). Because its simplicity and effectiveness, VM has been widely used in health monitoring tasks, such as fall detection (Karantonis et al. 2006), physical activity monitoring (Doherty et al. 2017), perinatal stroke assessment (Gao et al. 2019), etc. To further reduce the data volume, we used second-wise VM, i.e., the mean VM over each second (including 100 samples per second) will be used as new representation. Some second-wise VM examples (from two patients) can be found in Fig. 3.

3.3 The proposed stroke-rehab-driven features

3.3.1 Challenges

We aim to build a model that can map the 3-day time-series data to the CAHAI score. Different from other wearable-based behaviour analysis tasks (e.g., Ploetz et al. 2012; Guan and Ploetz 2017), the annotation here is inadequate. Even if we used the second-wise VM data, each trial still included roughly 3 days \(\times\) 24h/day \(\times\) 3600s/h \(=259200\) samples (a.k.a. timestamps) with one annotation (i.e., CAHAI score). In contrast to the popular deep learning based human activity recognition approaches, which can be trained when with rich annotations (in frame-wise or sample-wise level), the lack of annotation makes it hard to learn effective representation directly (using machine/deep learning) from the raw data. Moreover, since the data is collected in free-living environments, and the 3 full days (per week) can be taken in weekdays or weekends, which may increase the intra-subject variability significantly, making it hard to model. To address the afore-mentioned issues, domain knowledge driven feature engineering may play a major role in extracting compact and discriminant signatures.

3.3.2 Wavelet features

For time-series analysis, wavelet analysis is a powerful tool to represent various aspects of non-stationary signals such as trends, discontinuities, and repeated patterns (Ayachi et al. 2016; Walden et al. 2000; Preece et al. 2009), which is especially useful in signal compression or noise reduction. Given its properties, wavelet features have been widely used in accelerometer-based daily living activity analytics (Ayachi et al. 2016). In this work, we use discrete wavelet transform (DWT) and discrete wavelet packet transform (DWPT) as feature extractors, based on which new features were designed to preserve the stroke rehabilitation-related information. More details of DWT and DWPT can be found at Appendix 3.

After applying the DWT and DWPT, VM signals can be transformed to the wavelet coefficients at different decomposition scales. In this work, DWT coefficients at scales \(\{2, 3, 4, 5, 6, 7\}\) and DWPT at scales \(\{1.1, 1.2, 1.3, 1.4\}\) are employed, and the corresponding normalised Sum of Absolute value of the coefficients at different Decomposition scales (referred to as SAD features) are used as new representation. Specifically, SAD includes DWPT features defined as

$$\begin{aligned} \begin{aligned} \ {}&SAD_{1.1} = \frac{\left\| {\textbf {W}}_{3.4} \right\| _1}{N/{2^3}} = 2^3 \frac{\left\| {\textbf {W}}_{3.4} \right\| _1}{N}, \\ \ {}&SAD_{1.2} = \frac{\left\| {\textbf {W}}_{3.5} \right\| _1}{N/{2^3}} = 2^3 \frac{\left\| {\textbf {W}}_{3.5} \right\| _1}{N}, \\ \ {}&SAD_{1.3} = \frac{\left\| {\textbf {W}}_{3.6} \right\| _1}{N/{2^3}} = 2^3 \frac{\left\| {\textbf {W}}_{3.6} \right\| _1}{N}, \\ \ {}&SAD_{1.4} = \frac{\left\| {\textbf {W}}_{3.7} \right\| _1}{N/{2^3}} = 2^3 \frac{\left\| {\textbf {W}}_{3.7} \right\| _1}{N}, \\ \end{aligned} \end{aligned}$$
(1)

and DWT features defined as

$$\begin{aligned} SAD_{j} = \frac{\left\| {\textbf {W}}_{j} \right\| _1}{N/{2^j}} = 2^j \frac{\left\| {\textbf {W}}_{j} \right\| _1}{N}, \qquad \qquad j = 2,3,4,5,6,7, \end{aligned}$$
(2)

where \({\textbf {W}}\) presents the wavelet coefficients and N presents the length of the VM data. More technical details of DWT, DWPT, as well as the scale selection can be found in Appendix 4.

Through wavelet transformation, the long sequence (e.g., VM data in Fig. 3) can be transformed into compact SAD representation (i.e., 10-dimensional feature vector, with entries listed in Eqs. 1 and 2). In Fig. 4, we visualise compact SAD features corresponding to the paralysed sides of two patients (i.e., patients la012 and la040 from Fig. 3). We notice in the SAD feature space, it is not easy to distinguish the paralysed sides from these two different patients (in terms of CAHAI), indicating the necessity of developing more advanced stroke-related features (e.g., by also considering the non-paralysed side).

Fig. 4
figure 4

10-dimensional SAD features extracted from the paralysed side of two patients (with different CAHAI scores); They exhibit similar patterns, indicating the necessity of developing more informative stroke-related features

3.3.3 Proposed features

Based on the compact SAD representation, we aim to further design effective features for reliable CAHAI score regression. In Figs. 3 and 4, we visualise the behaviour patterns in different feature spaces. Specifically, we plot the paralysed side of patient la012 (with CAHAI score 55), and la040 (with CAHAI 26) using VM representation (Fig. 3) and SAD representation (Fig. 4). From both figures, we can see the limitations of both representations. Although VM can demonstrate distinct patterns from both patients, it may be also related to the large intra-class variability (e.g., personalised behaviour patterns). Moreover, the redundancy as well as the high-dimensionality make it hard for modelling. On the other hand, SAD has low dimensionality, yet both patients exhibited high-level of similarity, indicating that SAD of the paralysed side alone is not enough for distinguishing patients with different recovery levels.

Fig. 5
figure 5

SAD representation with both paralysed/non-paralysed sides from two different patients (la012 with CAHAI score 55, and la040 CAHAI score 26). SAD features from the non-paralysed side may contain discriminant information for stroke-rehab modelling

Fig. 6
figure 6

Two proposed PNP representations for two patients(la012, and la040), which can provide discriminant information in distinguishing the patients with different recovery levels (clinical CAHAI score)

Given the observations, we further visualise SAD features from both paralysed/non-paralysed sides for both patients in Fig. 5. We can see patient la012 (with high recovery level) uses both hands (almost) equally while patient la040 (with low recovery level) tends to use the non-paralysed side more. These observations motivate us to design new features using both sides, instead of the paralysed side alone. In this work, we propose two types of features that combine both Paralysed side and Non-Paralysed side, namely 1) \({PNP^1}\) that encodes the ratio information with entries defined as:

$$\begin{aligned} PNP^1_k = \frac{SAD_k^{p}}{SAD_k^{np}} \end{aligned}$$
(3)

as well as its variant 2) \(\mathbf {PNP^2}\) with entries defined as:

$$\begin{aligned} PNP^2_k = \frac{SAD_k^{np}-SAD_k^{p}}{SAD_k^{np}+SAD_k^{p}}, \end{aligned}$$
(4)

where k represents the scales defined in SAD features (as shown in Eqs. 1 and 2); p and np refer to the paralysed side and non-paralysed side respectively. We also visualise patient la012 and patient la040 using the new proposed features \(PNP^1\) and \(PNP^2\) in Fig. 6, from which we can see the proposed features can well distinguish these two patients, in contrast to SAD (Fig. 4). Although the proposed PNP features empirically exhibit the desired properties (i.e., compact and informative) for two patients, it should be pointed out that larger scale experiments should be conducted to evaluate the generalisation capability, which will be provided in the experimental section.

We summarise the procedure of generating PNP features as follows:

  1. 1.

    Given 3-day raw accelerometer data, calculating the signal vector magnitude (VM) with the gravity effect removed;

  2. 2.

    calculating the second-wise VM (mean VM value for each second) as the new representation;

  3. 3.

    calculating DWPT features at scales \(\{1.1, 1.2, 1.3, 1.4\}\) and DWT features at scales \(\{2, 3, 4, 5, 6, 7\}\)

  4. 4.

    given the DWPT and DWT features, calculating the 10-dimensional SAD features via Eqs. (1) and (2).

  5. 5.

    given SAD features, calculating the two proposed \({PNP^1}\) and \({PNP^2}\) features, via Eqs. (3) and (4).

Table 1 The proposed rehab-driven features

We list 4 types of features, i.e., the original wavelet features extracted from paralysed (\({SAD^p}\)) and non-paralysed sides (\({SAD^{np}}\)) separately, as well as the two new proposed features (\({PNP^1 }\) and \({PNP^2 }\)). Based on 10 scales, we can form 40-dimensional feature vector, as shown in Table 1. However, there exist certain level of noises and redundancy (especially on \({SAD^p}\), and \({SAD^{np}}\)), so it is crucial to develop feature selection mechanism or powerful prediction models for higher performance.

3.4 Predictive models

Based on the proposed representation, we aim to develop predictive models that can map features to the CAHAI score. Although we reduce the data redundancy significantly, there still exist data noises, which may encode irrelevant information. It is crucial to develop robust mechanism to select the most relevant features, and here we use a popular feature selection linear model (LASSO). To model the nonlinear random effects in the longitudinal study, we also propose to use the longitudinal mixed-effects model with Gaussian Process prior (LMGP).

It is worth noting that our model will also take advantage of the medical history information (i.e., CAHAI score during the first visit) to predict CAHAI scores for the rest 7 weeks (i.e., week 2– week 8). From the perspective of practical application, CAHAI score from the initial week (referred to as ini) may be used as an important normalisation factor for different individuals.

3.4.1 The linear fixed-effects model

Since there may exist some redundant or irrelevant features for the prediction task, first we propose to use LASSO (Least Absolute Shrinkage and Selection Operator) for feature selection.

Given the 41-dimensional input variables (40 wavelet features listed in Table 1 and one CAHAI score from the initial week), first we standardise the data using z-norm, and each feature entry \(x_k\) will be normalised as \(x_k^{new} = (x_k-\overline{x}) / s_k\), where \(\overline{x}\) and \(s_k\) are the mean and standard deviation of the \(k^{th}\) feature. Based on the aforementioned model, namely LASSO, useful features can be selected, based on which prediction model can be developed. For simplicity, we first use linear model to predict the target CAHAI score \({y_i}\):

$$\begin{aligned} {y_{ij}} = {\varvec{x}}_{ij}^\textrm{T} {\varvec{\beta }} + \epsilon _{ij}, \ \epsilon _{ij}\sim N(0, \sigma ^2), \end{aligned}$$
(5)

where i stands for the ith trial/visit (during week 2–week 8) and j represents the jth patients; \({\varvec{x}}_{ij}\) represents the selected feature vector; \({\varvec{\beta }}\) are the model parameter vector to be estimated, and \(\epsilon _{ij}\) is the random noise term.

3.4.2 Longitudinal mixed-effects model with Gaussian process prior (LMGP)

It is simple to use linear model for CAHAI score prediction. However, it ignores the heterogeneity nature among subjects in this longitudinal study. To model the heterogeneity, we propose to use a nonlinear mixed-effects model (Shi et al. 2012), which consists of the fixed-effects part and random-effects part. Specifically, the random-effects part contributes mainly on modelling the heterogeneity, making the the prediction process subject/time-adaptive for longitudinal studies. The longitudinal mixed-effects model with Gaussian Process prior (LMGP) is defined as follows:

$$\begin{aligned} {y_{i,j}} = {\varvec{x}}_{ij}^\textrm{T} {\varvec{\beta }} + g({\varvec{\phi }}_{ij}) + \epsilon _{ij}, \ \epsilon _{ij}\sim N(0, \sigma ^2), \end{aligned}$$
(6)

where i,j stand for the \(i^{th}\) patient at the \(j^{th}\) visit (from week 2 to week 8); \(\epsilon _{ij}\) refers to as independent random error and \(\sigma ^2\) is its variance. In Eq. (6), \({\varvec{x}}_{ij}^T{\varvec{\beta }}\) is the fixed-effects part and \(g({\varvec{\phi }}_{ij})\) represents the nonlinear random-effects part, and the latter can be modelled using a non-parametric Bayesian approach with a GP prior (Shi et al. 2012).

It is worth noting that in LMGP the fixed-effects part \({\varvec{x}}_{ij}^\textrm{T}{\varvec{\beta }}\) explains a linear relationship between input features and CAHAI, while the random-effects part \(g({\varvec{\phi }}_{ij})\) is used to explain the variability caused by differences among individuals or time slots during different weeks. By considering both parts, LMGP provides a solution of personalised modelling for this longitudinal data analysis. In LMGP, it is important to select input features to model both parts, and we refer them to as fixed-effects features and random-effects features, respectively. The effect of the fixed-effects features will be studied in the experimental evaluation section.

For LMGP training, we first ignore the random-effects part, and only optimise the parameters \(\hat{{\varvec{\beta }}}\) of the fixed-effects part (via ordinary least squares OLS); With estimated parameters \(\hat{{\varvec{\beta }}}\), the residual \(r_{ij}=y_{ij} - {\varvec{x}}_{ij}^\textrm{T} \hat{ {\varvec{\beta }} } = g({\varvec{\phi }}_{ij}) + \epsilon _{ij}\) can be calculated, from which we can model the random-effects

$$\begin{aligned} g({\varvec{\phi }}_{i,j})\sim GP(0,K(\cdot , \cdot ;\, {\varvec{\theta }}) ). \end{aligned}$$

In this paper we choose \(K(\cdot , \cdot ;\, {\varvec{\theta }})\) as the following three different kernels (linear, squared exponential and rational quadratic), and here we take the squared exponential as an example. The squared exponential (covariance) kernel function is defined as : \(K\left( {\varvec{\phi }}, {\varvec{\phi }}' ;\, {\varvec{\theta }}\right) =v_{0} \exp \left\{ - d({\varvec{\phi }}, {\varvec{\phi }}')/2 \right\}\) where \(d({\varvec{\phi }}, {\varvec{\phi }}')=\sum _{q=1}^{Q} w_{q}\left( {\phi _{i,j,q}}-{\phi _{i,j,q}^{\prime }}\right) ^{2}\) is an extended distance between \({\varvec{\phi }}\) and \({\varvec{\phi }}'\). It involves the hyper-parameters \({\varvec{\theta }} = (v_0, w_1, \ldots , w_Q )\). In Bayesian approach, we may choose the value of those parameters based on prior knowledge. It is however a difficult task due to the large dimension of \({\varvec{\theta }}\). We used an empirical Bayesian method.

The training procedure include two steps. (I) Estimate \({\varvec{\beta }}\) and \(\sigma\) in Eq. (5); (II) Estimate the values of the hyper-parameters \({\varvec{\theta }}\) by an empirical Bayesian method, i.e. maximise the marginal likelihood from \({\varvec{r}}_i \sim N({\varvec{0}}, {{\varvec{C}}_{i}}+\sigma ^2 {\varvec{I}})\) for \(i=1, \ldots , n\), where \(\textbf{C}_i\in \mathbb {R}^{J \times J}\) is the covariance matrix of \(g(\cdot )\), and its element is defined by \(K(\phi _{i,j}, \phi _{i,j'};\, {\varvec{\theta }})\). To obtain a more accurate results, an iterative method may be used. Except the initial step, the error item in (5) used in step I is replaced by

$$\begin{aligned} {{\varvec{\epsilon }}_{i}}=(\epsilon _1, \ldots , \epsilon _J) \sim N({\varvec{0}}, {{\varvec{C}}_{i}}+\sigma ^2 {\varvec{I}})) \end{aligned}$$

where all the parameters are evaluated by using the values obtained in the previous iteration.

The calculation of the prediction is relatively easy. The posterior distribution of \(g({{\varvec{\phi }}_{i}})\) is a multivariate normal with mean \(\textbf{C}\left( \textbf{C}+\sigma ^{2} \textbf{I}\right) ^{-1} {\varvec{r}}_{i}\) and the variance \(\sigma ^{2} \textbf{C} \left( \textbf{C}+\sigma ^{2} \textbf{I}\right) ^{-1}\).

The fitted value can therefore be calculated by the sum of \({\varvec{x}}^T_{ij} \hat{ {\varvec{\beta }} }\) and the above posterior mean. The variance can be calculated accordingly. The detailed description can be found in Shi et al. (2011).

4 Experimental evaluation

In this section, several experiments are designed to evaluate the proposed features as well as the prediction systems. The patients are splitted into two groups according to the disease nature, i.e., the acute patient group (26 subjects) and the chronic patient group (33 subjects). Experiments are conducted on both group separately.

Specifically for each group, leave one subject out cross validation(LOSO-CV) is applied. That is, for a certain group (acute or chronic) with n subjects, in each iteration 1 subject was used as test set while the rest \(n-1\) subjects were used for training. This procedure is repeated n times to test all the n subjects and average prediction performance (i.e., the mean predicted CAHAI) will be reported.

Since CAHAI score prediction is a typical regression problem, we use the root mean square error (RMSE) as the evaluation metric, and lower mean RMSE values indicate better performance.

4.1 Evaluation of the proposed feature PNP

In this subsection, we evaluate the effectiveness of the proposed PNP features. One most straight-forward approach is to calculate the correlation coefficients against the target CAHAI scores. In Table 2 we report the corresponding correlation coefficients (\(PNP^1_k\), and \(PNP^2_k\) in 10 scales) for acute/chronic patients group. The correlation coefficients of the original wavelet features (with paralysed side \(SAD^p_k\), and non-paralysed side \(SAD^{np}_k\) in 10 scales) against CAHAI score are also reported for comparison. From Table 2, we can see:

  • PNP features generally have higher correlation coefficients (than SAD) against the CAHAI scores.

  • for PNP features, from Scale \(k=1.1\) to \(k=5\) there are higher correlations against the CAHAI scores.

  • for chronic patients, SAD features (on the non-paralysed side) exhibit comparable correlation scores with PNP features.

These observations indicate the necessities of selecting useful features on building the prediction system. Although PNP demonstrates more powerful prediction capacity, in some cases, SAD (e.g., extracted from the non-paralysed side) may also provide important information for a certain population (e.g., chronic patients).

Table 2 Correlation coefficients of the wavelet features and CAHAI score
Fig. 7
figure 7

Cross-correlation of the candidate features for two patient groups (top: acute patients; bottom: chronic patients). In general, PNP features, SAD features and the medical history information ini are less correlated, compared with within-feature correlation (e.g., within PNP features )

For better understanding the relationship between these features, we also report the cross-correlation between each feature pairs. Noting we also include the medical history feature, i.e., the initial week-1 CAHAI score. From Fig. 7, and we have the following observations:

  • For both patient groups, the PNP features are highly correlated. PNP features within the same type (\({PNP}^1\) or \({PNP}^2\)) tend to be positively correlated, while PNP features from different types tend to be negatively correlated.

  • For acute patients, SAD features for each side (paralysed side \({SAD^p}\) or non-paralysed side \({SAD^{np}}\)) are highly (positively) correlated, yet the SAD features from different sides are less correlated. For chronic patients, however, SAD features from both sides are highly (positively) correlated.

  • In general, PNP features, SAD features and the medical history information ini are less correlated, indicating them as potentially complementary information to be fused.

Based on the above findings, it is clear that within each feature types, there may exist high-level of feature redundancy, and it is necessary to select the most relevant feature subsets. For acute and chronic patient groups, the optimal feature subset may vary due to the different movement patterns (e.g., on paralysed/non-paralysed sides). Although the proposed PNP features can alleviate this problem to some extent, it is beneficial to combine the less correlated features (i.e., PNP, SAD, and ini).

4.2 Evaluation of the predictive models

4.2.1 Feature selection

Based on the feature correlation analysis in Sect. 4.1, it is important we select the most relevant features from various sources (i.e., PNP, SAD, and ini). Different from the correlation-based approach which can select each feature independently (by the correlation coefficient), LASSO can select the features by solving a linear optimisation problem with sparsity constraint, and it takes the relationship of the features into consideration. Based on LASSO we select the most important features for both acute/chronic patients, as shown in Table 3.

Table 3 Selected features using LASSO

It is also worth mentioning that the wavelet-based features can bring certain levels of interpretability. \(SAD_j\) represents the point energy in the signal at the decomposition level j based on the energy preserving condition (see Appendix 4 for more details). Specifically, it relates to the degree of energy among the different activity levels (in different frequency domain based on the decomposition scale j). The activities such as jumping or lifting an object may correspond to high-frequency signal, while sedentary or eating may be low-frequency signal. Based on these, we can interpret the key features in Table 3. For example, for acute patients key features (which is high-related to stroke-rehab modelling) correspond to asymmetric activities in low/medium-frequency level (i.e., with \(PNP_3^2, PNP_6^1\)), non-paralysed-based activities in low/medium-frequency level (i.e., with \(SAD_2^{np}, SAD_6^{np}\)), and paralysed-side based activities in high-frequency level (i.e.,with \(SAD_{1.2}^p\)).

4.2.2 Performance of linear fixed-effects model

Based on the selected features, we perform leave-one-patient-out cross validation on these two patient groups respectively using the linear fixed-effects model. As shown in Fig. 8, the prediction results of the chronic patients (with mean RMSE 3.29) tend to be much better than the ones of the acute group (with mean RMSE 7.24). One of the main reasons might be the nature of the patient group. In Fig. 9, we plot the clinical CAHAI distribution (i.e., the ground truth CAHAI) from week 2 to week 8, and we can see the clinical CAHAI scores are very stable for chronic patients. On the other hand, for acute patients who suffered from stroke in the past 6 months, their health statuses were less stable and affected significantly by various factors, and in this case the simple linear fixed-effected model yields less promising results.

Fig. 8
figure 8

Linear model prediction vs clinical CAHAI; Left: Acute patients (RMSE 7.24); Right: Chronic patients (RMSE 3.29). Each point corresponds to a trial (i.e., data collected from 3 days), and different colours represent different subjects

Fig. 9
figure 9

Clinical assessed CAHAI distribution with respect to visit; stroke rehabilitation levels may be stable for chronic patient while may vary substantially for acute patients

4.2.3 Performance of Longitudinal mixed-effects model with Gaussian process prior (LMGP)

We also develop LMGP for both patient groups. We have applied different covariance kernels in LMGP models and found the one with powered exponential kernel achieves the best results. The following discussion will therefore focus on the model with this kernel. More results of using other kernels can be found in Appendix 5.

Fig. 10
figure 10

LMGP prediction vs clinical CAHAI; Left: Acute patients (RMSE 5.75); Right: Chronic patients (RMSE 3.12). Each point corresponds to a trial (i.e., data collected from 3 days), and different colours represent different subjects

Here, we use the selected features (from Table 3) as the fixed-effects features and random-effects features. Similar to the linear fixed-effects model, we evaluate the performance based on leave-one-patient-out cross validation, and the mean RMSE values are reported in Fig. 10, from which can see LMGP can further reduce the errors when compared with the fixed-effects linear model, with mean RMSE 5.75 for acute patients and 3.12 for chronic patients, respectively.

Fig. 11
figure 11

Continues monitoring using LMGP for four patients (top: two chronic patients; bottom: two acute patients); dark points are the trial-wise/week-wise (i.e., each trial including data collected from 3 days per week) prediction and red points are the corresponding ground truth CAHAI scores

Based on LMGP, we also perform “continuous monitoring”—with week-wise predicted CAHAI score—on four patients (two for each patient group) from week 2 to week 8, and the results are reported (with mean and \(95\%\) confidence interval) in Fig. 11, which is extremely helpful when uncertainty measurement is required.

4.2.4 On the fixed-effects part of LMGP

LMGP includes two key parts, i.e., the linear fixed-effects and the non-linear random-effects part, and it is important to choose the key features for modelling. Since the fixed-effects part measures the main (linear) relationship between the input features and the predicted CAHAI, we study the corresponding feature subsets. For random-effects part, we use the full LASSO features (as shown in Table 3).

To select the most important feature subset for the fixed-effects part modelling, we rank the features (from Table 3) based on two criteria: LASSO coefficients, and correlation coefficients (between features and CAHAI, as described in Sect. 4.1). Table 4 demonstrates ranked features, and here only the top \(50\%\) features (i.e., top three features for acute patients and top five features for chronic patients) are used to model the fixed-effects part, and the settings as well as the results are reported in Table 5.

Table 4 Feature importance ranking (based on two criteria) for acute/chronic patients

It is interesting to observe the performance may vary when different feature subsets are applied. Specifically, with the top feature subsets, modelling the LMGP’s fixed-effects part can further reduce the errors to some extent for acute patients, in contrast to chronic patients with increased errors. The top 5 features selected via the LASSO criterion yields the worst performance for chronic patients, and one possible explanation could be the lack of feature ini—the initial health condition—a major attribute for chronic patient modelling (see Fig. 9).

Table 5 LMGP’s fixed-effects part modelling results (RMSE) based on different feature subsets

4.2.5 Model comparison

Based on our proposed (41-dimensional) stroke-rehab-driven features, we compare LMGP with a number of classical predictive models, such as neural network (NN), support vector regression (SVR) and random forest regression(RF) for acute/chronic patient groups. It is worth noting that we cannot use the popular deep learning structures such as convolutional neural network(CNN) or recurrent neural network(RNN) on the time-series signal, due to the lack of frame-wise or sample-wise annotation. Yet with the stroke-rehab-driven features and trial-wise annotation, simple neural networks such as multi-layer perceptron (MLP) can be applied, and here we use a 3-layer MLP.

Table 6 Predictive model comparison based on the proposed stroke-rehab-driven features (in LOSO-CV setting)

LOSO-CV is applied with the mean RMSE values reported in Table 6, from which we observe linear models (linear SVR and linear fixed-effects model) yield better results than non-linear methods (NN, SVR with rbf, and RF). One of the explanation is the over-fitting effect, where the trained non-linear models do not generalise well to the unseen patients/environments in this longitudinal study setting. RF is normally known as a classifier with high generalisation capability, yet it may suffer from the low-dimensionality of the selected features (6 features for acute patients and 10 features for chronic patients). Given the simplicity of the linear models and the designed low-dimensional features, linear models tend to suffer less from the over-fitting effect, with reasonable results in these challenging environments. Compared with linear models, our LMGP can further model the longitudinal mixed-effects (i.e., with linear fixed-effect part and non-linear random-effects part), making the system adaptive to different subjects/time-slots, with the lowest errors.

Table 7 Method comparison (in LOSO-CV setting)

We also compare our approach with other automated CAHAI score regression methods (Tang et al. 2020; Halloran et al. 2019) in the existing literature. Different from our approach, Tang et al. (2020) and Halloran et al. (2019) are pure data-driven approaches. To address the lack of annotation problem, Tang et al. use GMM clustering (on the sliding windows) Tang et al. (2020) to learn latent features that can be aggregated for trial-wise representation, while Halloran et al. (2019) employ pseudo labelling strategy for trial-wise representation. However, both data-driven features cannot suppress the substantial noises in the original accelerator signal, and such noises (e.g., irrelevant daily activities) significantly affect the performance of both approaches. In contrast, by taking advantage of the domain knowledge, our proposed stroke-rehab-driven representation is compact yet informative, and from Tables 6 and 7 we can see it tends to have lower errors than (Tang et al. 2020; Halloran et al. 2019 irrespective of the predictive models for both patient groups.

5 Conclusions

In this work, we develop an automated stroke rehabilitation assessment system using wearable sensing and machine learning techniques. We collect accelerometer data using wrist-worn sensors, based on which we build models for CAHAI score prediction, which can provide objective and continuous rehabilitation assessment. To map the long time-series (i.e., 3-day accelerometer data) to the CAHAI score, we propose a pipeline which can perform from data cleaning, feature design, to predictive model development. Specifically, we propose two compact features which can well capture the rehabilitation characteristics while suppressing the irrelevant daily activities, which is crucial on analysing the data collected in free-living environments. We further use LMGP, which can make the model adaptive to different subjects and different time slots (across different weeks). Comprehensive experiments are conducted on both acute/chronic patients, and very promising results are achieved, especially on the chronic patient group. We also study different feature subsets on modelling the fixed-effects part in LMGP, and experiments suggest the errors can be further reduced for the challenging acute patient population.

Due to irrelevant daily activities and strong heterogeneity among subjects, it is very challenging for researchers in mathematics, computing sciences and other areas to deal with free-living data. It is also crucial to develop models which have good mathematical properties and have physical explanation particularly in medical research. Hopefully, the ideas of the new features and the models discussed in this paper can provide some hints on addressing similar problems in health research.