In this section, we first describe the PD dataset [13, 31] that was used for evaluating the developed models. We also provide a brief description of the Physical Activity Monitoring Dataset (PAMAP2) [32] that was used for transfer learning of the deep model with hand-crafted features. Next, we describe signal segmentation and extraction of the hand-crafted features. Finally, we describe the proposed deep models.
Datasets
Collection of PD Data
A protocol was designed to record motion data of 24 PwP with idiopathic PD as they performed a variety of ADL [13, 31]. A summary of patient characteristics is shown in Table 3. The age average was 58.9 years, and the age range was between 42 and 77 years. Fourteen of the PWP were female and ten were male. The average of the disease duration was 9.9 years, and the range was between 4 and 17 years. The UPDRS-III average was 29.7 before taking PD medications and 17.3 1 h after taking PD medications. The institutional review board approved the study, and all patients provided written informed consent.
Table 3 Subject demographics. LEDD stands for Levodopa Equivalent Daily Dose. Values are presented as number or mean ± standard deviation Two wearable sensors (Great Lakes NeuroTechnologies Inc., Cleveland, OH) consisting of triaxial gyroscope and accelerometer were mounted on the most affected wrist and ankle to collect the motion data at a sampling rate of 64 Hz. The participants stopped their PD medication the night before the experiment and started the experiments in their medication OFF states. Fifteen of the subjects performed various ADL in four rounds spanned for 4 h. The ADL were cutting food, unpacking groceries, grooming, resting, drinking, walking, and dressing. The time of each activity trial ranges between 15 and 60 s, and each round was about 2–4 min. The subjects were asked to perform the ADL at self-pace, and no training was provided. After the first round, the subjects resumed their routine PD medications. 20 trials of activities were missing due to unsuccessful data collection. In addition, two subjects performed three rounds since they started the experiment in their medication ON states. The total duration of each round for all the 15 subjects is shown in Fig. 4a.
The other nine subjects cycled through multiple stations (such as laundry room, entertainment station, snack, and desk work) in a home-like setting while engaging in unconstrained activities. Next, the subjects resumed their routine PD medications. Later, when the medicine kicked in (as confirmed by a neurologist), the subjects repeated the same ADL or cycled through the stations in their medication ON states. For these nine subjects, the recording was continuous for about 2 h. Later, rounds of 10 min were segmented close to UPDRS-III assessments as shown in Fig. 4a.
Concurrently, the clinical examinations were performed by a neurologist to measure and record the subjects’ UPDRS-III scores. Four rounds of UPDRS-III assessment were performed for 15 subjects at the beginning of every hour of the experiment. Two rounds of UPDRS-III assessment were performed at the beginning and end of the experiment for the other nine participants. In each assessment, 27 signs of PD were scored on a 0–4 scale for different body parts and both sides; thus, the range of UPDRS III was 0–108, the sum of scores from the 27 signs.
Physical activity monitoring dataset
PAMAP2 is a public dataset of motion signals recorded using two wearable sensors while nine healthy subjects performed various ADL. The subjects were 27.22 ± 3.31 years old, with eight males and one female. The wearable sensors contained triaxial gyroscopes and accelerometers with a 100 Hz sampling rate and were mounted on the dominant side’s arm and ankle. The recorded ADL included 12 protocol activities such as lying, sitting, standing, walking, watching TV, and working on a computer. We used this dataset for transfer learning of the deep-learning models. The reason for selecting this dataset was the availability of the gyroscope signals and the similarity in the sensor placement locations with our PD dataset.
Data preprocessing
For both datasets, we used only angular velocity signals generated from the gyroscopes. We found experimentally that the gyroscope sensor performs better than accelerometer sensors in estimating UPDRS III, which is in agreement with the finding of Dia et al. [11]. In addition, using one sensor type decreased the computation power and time required to train and test the models because of the reduction in data dimensionality. The energy consumption of gyroscopes is higher than that of accelerometers, which can constrain the long-term recording [33]. However, the availability of devices with long battery life can avoid this issue. The collected signals were filtered to eliminate low and high-frequency noises using a bandpass FIR filter with a 3 dB cutoff frequency (0.5–15 Hz).
For the PD dataset, we excluded the data recorded during the UPDRS-III examination from our analysis to ensure that the developed model will not benefit from the UPDRS III-specific tasks that elicit PD symptoms. Next, 2–4 rounds of data with a maximum duration of 10 min (i.e., maximum \(N_S\) samples) were selected from each subject’s recordings. Fig. 4a demonstrates the number and duration of rounds as well as the corresponding UPDRS-III score for all the subjects. A total of 91 rounds (\({N_R}\)) were selected to form the set \(\mathcal {D}=\{ (X^{(r)},y^{(r)}) \}_r^{N_R}\) \((X^{(r)} \in \mathbb {R}^{N_S^{(r)} \times 6}\), \(y^{(r)} \in \mathbb {R})\) where \(X^{(r)}\) denotes the motion time-series data in round r with \(N_S^{(r)}\) as the number of samples in this round, and \(y^{(r)}\) denotes the UPDRS-III score for round r. The set was used to train and test the developed algorithm using LOOCV. The distribution of these rounds based on the assessed UPDRS III is shown in Fig. 4b. Similarly for PAMAP2 dataset, 1-min rounds of data were selected from each subject’s recordings after down-sampling the signals to 64 Hz. Each round included one activity. A total of 455 rounds were selected to form the set \(\mathcal {D}\) for PAMAP2 dataset.
Segmentation
The PD symptoms have both short- and long-term representations on the body movements. Therefore, there is a need for features extracted from both short and long-term duration of the motion signals [34, 35]. Hence, we used 5-s windows to segment the signals for short-term features, and 1-min windows for long-term features. The segmentation process is shown in Fig. 5a.
Feature extraction
We extracted \(N_{SF}=\)26 short- and \(N_{LF}=\)32 long-term features from each segment of the data. First, 39 short-term features were extracted from the three (x, y, z) axes’ signals of the wrist and 39 from the ankle sensor (i.e., segmented X). The short-term features were selected to capture high-frequency symptoms such as tremor. They consisted of 4–6 Hz signal power (3 features = x3), percentage power of frequencies > 4 Hz (x3), 0.5–15 Hz signal power (x3), amplitude and lag of the first auto-correlation peak (x6), number and sum of auto-correlation peaks (x6), spectral entropy (x3), dominant and secondary frequencies and their powers (x12), cross-correlation (x3) between x and y, x and z and y and z axes. The details of these features were provided in our previous work [36]. This step provided a total of 78 features from the three axes of the wrist and ankle sensors. Next, the features were averaged across the three axes to get \(N_{SF}=\) 26. To conclude, a feature vector (\(\vec {fv} \in \mathbb {R}^{N_{SF}}\)) was extracted from each 5-s window and provided a set of \(\mathcal {D}_{S}=\{ (S^{(r)},y^{(r)}) \}_r^{N_R}\) \((S^{(r)} \in \mathbb {R}^{{N_{Ws}^{(r)}} \times N_{SF}}\), \(y^{(r)} \in \mathbb {R})\) where \(S^{(r)}=[\vec {fv}_1\vec {fv}_2...\vec {fv}_{N_{Ws}^{(r)}}]\), and \(N_{Ws}^{(r)}\) was the number of 5-s windows in round r.
Similarly, 48 long-term features were extracted from the three (x, y, z) axes’ signals of the wrist and 48 from the ankle sensor (i.e., segmented X). The long-term features were selected to capture low-frequency symptoms such as bradykinesia. These features were average jerk (x3), velocity peak-to-peak (x3), 1–4 Hz signal power (x3), 0.5–15 Hz signal power (x3), Shannon entropy (x3), standard deviation (x3), number and sum of auto-correlation peaks (x6), Gini index (x3), sample entropy (x3), mean (x3), skewness (x3), kurtosis (x3), spectral entropy (x3), dominant frequency and its power [36] (x6). Next, the features were averaged across each axes to get \(N_{SF}=\) 32. To conclude, a feature vector (\(\vec {fv} \in \mathbb {R}^{N_{LF}}\)) was extracted from each 1-min window and provided a set of \(\mathcal {D}_{L}=\{ (L^{(r)},y^{(r)}) \}_r^{N_R}\) \((L^{(r)} \in \mathbb {R}^{N_{Wl}^{(r)} \times N_{LF}}\), \(y^{(r)} \in \mathbb {R})\), where \(L^{(r)}=[\vec {fv}_1\vec {fv}_2...\vec {fv}_{N_{Wl}^{(r)}}]\), and \(N_{Wl}^{(r)}\) was the number of 1-min windows in round r.
Regression models for UPDRS-III estimation
In our preliminary work, we explored two different architectures based on a single-channel and dual-channel LSTM of hand-crafted features and showed that the latter provides superior performance [24]. In this section, we first describe an extension to that model by applying transfer learning using PAMAP2 dataset. Next, we develop a new 1D and 2D CNN-LSTM models using raw motion signals and their time–frequency representations, respectively. The proposed ensemble model is described next. Lastly, Gradient Tree Boosting is described as a traditional machine learning method for comparison purposes.
Dual-channel LSTM network with transfer learning
LSTM is a special type of Recurrent Neural Networks to overcome the vanishing gradient problem when training using gradient descent with backpropagation through time. LSTM can efficiently learn the temporal dependencies and has been successfully used in applications involving signals with temporal memory. In this work, LSTM architecture proposed by [37] is used.
LSTM unit consists of input gate (i), input modulation gate (g), forget gate (f), output gate (o), and memory cell (\(c_t\) at time step t). Before applying the operations in these gates, current feature vector (\(\vec {fv}^{(r)}_t\)) at time t in round r is linearly transformed using the following equation:
$$\begin{aligned} \vec {x}^{(r)}_t=W_{fx} \vec {fv}^{(r)}_t +b_{fx} \end{aligned}$$
(1)
where \(\vec {x}^{(r)}_t \in \mathbb {R}^{N_H}\), \({N_H}\) is the number of hidden states and \(W_{fx}\) and \(\vec {b}_{fx}\) are the weight matrix and bias vector, respectively. The operations in these gates are performed on \(\vec {x}^{(r)}_t\) using \({N_H}\) hidden states (\(h_{t-1} \in \mathbb {R}^{N_H}\)) and internal states (\(c_{t-1} \in \mathbb {R}^{N_H}\)) from the previous time step as defined below:
$$\begin{aligned} i_t=& {} \sigma \left( W_{xi} \vec {x}^{(r)}_t + W_{hi} h_{t-1} +b_i\right) \end{aligned}$$
(2)
$$\begin{aligned} g_t=& {} \phi \left( W_{xg} \vec {x}^{(r)}_t + W_{hg} h_{t-1} +b_g\right) \end{aligned}$$
(3)
$$\begin{aligned} f_t=& {} \sigma \left( W_{xf} \vec {x}^{(r)}_t + W_{hf} h_{t-1} +b_f\right) \end{aligned}$$
(4)
$$\begin{aligned} o_t=& {} \sigma \left( W_{xo} \vec {x}^{(r)}_t + W_{ho} h_{t-1} +b_0\right) \end{aligned}$$
(5)
$$\begin{aligned} c_t=& {} f_t c_{t-1}+i_t g_t \end{aligned}$$
(6)
$$\begin{aligned} h_t=& {} o_t \phi \left( c_t\right) \end{aligned}$$
(7)
where \(W_{ab}\) is a weight matrix (\(a=\{x,h\}\) and \(b=\{i,g,f,o\}\)), and \(\sigma\) and \(\phi\) are the logistic sigmoid and tanh activation functions, respectively. The output (\(\hat{y}^{(r)}\)) in many-to-one LSTM network is calculated based on \(h_{t}\) of the last LSTM layer and last \(\vec {x}^{(r)}\) in round r using the following linear transformation:
$$\begin{aligned} \hat{y}^{(r)}=W_{hy} h_{t} +b_y \end{aligned}$$
(8)
After segmentation and feature extraction (refer to segmentation and feature extraction sections), there were only one long-term feature vector for each 1-min window while there are 12 short-term feature vectors. Therefore, we developed a dual-channel LSTM network to combine the two sets of feature vectors as a strategy to appropriately handle the differences in the number of the short-term feature vectors (\(S^{(r)}=[\vec {fv}_1\vec {fv}_2...\vec {fv}_{N_{Ws}^{(r)}}]\)) and long-term feature vectors (\(L^{(r)}=[\vec {fv}_1\vec {fv}_2...\vec {fv}_{N_{Wl}^{(r)}}]\)). This method was based on building a separate LSTM channel on the short-term and long-term sets (\(\mathcal {D}_{S}\) and \(\mathcal {D}_{L}\), respectively) and then integrating the outcome of the two channels into one UPDRS-III score estimation using a fully connected layer. The feature vectors in both sets were linearly transformed using a fully connected layer to have a depth of \(N_{H}\) hidden states in both channels (Eq. 1). The transformed feature vectors \(\vec {x}^{(r)}\) were then passed to a many-to-one LSTM network in both channels as shown in Fig. 5a. The hidden states \(h_{t}\) from the last feature vector in both channels were then concatenated to create a fusion feature that was passed through a fully connected layer to estimate UPDRS III (Eq. 8).
Transfer learning: Due to the limited number of data rounds in the PD dataset used to train the LSTM network, we applied transfer learning to improve the LSTM performance. The LSTM network’s weights to estimate UPDRS III were not randomly initialized; instead, they were transferred from an LSTM network trained to perform activity classification. Next, only the last layer of the LSTM network and the fully connected layers were fine-tuned for estimating UPDRS III. PAMAP2 dataset was used to train the LSTM network for activity classification initially. Note that transfer learning could only be used in the case of the hand-crafted features. Although the sensors in PD and PAMAP2 were placed on the same extremity, the axes’ orientations and the placement on the same extremity were different. Therefore, the learned deep model’s weights on PAMAP2 were not transferable to the PD dataset when the raw signals were used. However, extracting features and averaging them across axes eliminated the effect of having different sensors’ orientation in the PAMAP2 dataset and PD dataset.
1D CNN-LSTM network
We used CNN as a data-driven feature extraction method to explore raw signals. We fed the feature maps of CNN into an LSTM network to model the feature maps’ temporal dependencies and estimate UPDRS III. Our proposed 1D CNN-LSTM is shown in Fig. 5b. It consisted of three convolutional blocks. The first block consisted of two convolutional layers with 32 filters with a width of 8, followed by a max-pooling layer. The second block had the same structure but deeper with 64 filters. The third block had one convolutional filter and a global average pooling layer representing the bottleneck to extract short-term, data-driven features. These features were feed to a many-to-one LSTM network followed by two fully connected layers (96 nodes and one output node) to estimate UPDRS III. Increasing the number of convolutional layers was done by repeating Conv Block-2 multiple times.
Training a good-performing CNN-LSTM model on a relatively limited number of training rounds could be challenging. We applied data augmentation by allowing for a random start for each round of ADL and a 0.5-dropout layer to overcome this challenge. Besides, we proposed a novel two-stage training. In the first stage, a CNN network with a fully connected layer was trained on 5-s windows to estimate UPDRS III while extracting short-term features. The best CNN’s weights selected based on validation data were saved. In the second stage, the fully connected layer of the pre-trained CNN was discarded since they are not extracting new features. Next, the extracted features using the CNN model (i.e., from the global averaging layer) were fed to the LSTM network to estimate UPDRS III for each ADL round.
2D CNN-LSTM network
Many PD symptoms have spectral features such as tremor that manifest in 4–6 Hz and bradykinesia in low frequencies. Therefore, the CNN network can learn new temporal and spectral features if trained on the time–frequency representations of the raw signals. For this purpose, we generated spectrograms by applying a short-time Fourier transform on the 1-min windows and then taking the magnitude. We used a 5-s Kaiser window with 90% overlaps. The spectrograms of the windows from each axes were stacked to construct a time \(\times\) frequency \(\times\) axes tensor and were fed to a 2D CNN-LSTM network as shown in Fig. 5c. The 2D CNN-LSTM consisted of three convolutional blocks. The first block was two convolutional layers with 32 filters of width five by five, followed by a max-pooling layer. The rest of the architecture of the 2D CNN-LSTM was similar to 1D CNN-LSTM described before except for using filters of size 5 \(\times\) 5. In addition, the same two-stage training strategy described before was used to address the limiting training data.
The Ensemble Model
We explored the accuracy of UPDRS III estimation by considering the ensemble of the three models we developed. As shown in Fig. 5d, the ensemble of the previous models was performed by averaging the UPDRS-III scores from each model to get one estimation for each round of ADL.
Gradient Tree Boosting
Gradient Tree Boosting is a traditional machine-learning method used in practice for solving regression problems [38]. It is based on ensemble of \(N_t\) weak regression trees (\(\{f_i\}_{i=1}^{N_t}\)) to estimate the output \(\hat{y}\) or the UPDRS-III score as follows:
$$\begin{aligned} \hat{y}\left( \vec {fv}_t\right) =\sum _{i=1}^{N_t} {f_i\left( \vec {fv}_t\right) } \end{aligned}$$
(9)
where \(f_i(\vec {fv}_t)=w_{q(\vec {fv}_t)}\) is the space of regression tree i with L leaves, \(q(\vec {fv}_t)\) is the structure of the tree that maps \(\vec {fv}_t\) to an index represents the corresponding tree leaf, and \(w \in \mathbb {R}^L\) is the leaf weights. Learning the regression trees is performed using additive training strategy by learning one tree at each iteration that optimize the objective function which includes the first and second gradient statistics on the loss function.
The short- and long-term feature vectors (refer to the feature extraction section) were combined into one feature vector and were fed into the Gradient Tree Boosting model. For every 5-s segment in a 1-min interval, the long-term feature vectors \(\vec {fv}\) were repeated and concatenated with the corresponding short-term feature vectors \(\vec {fv}\) to form a matrix of \(N_{Ws}\) feature vectors with (\(N_{SF}+N_{LF}\)) number of features (\(SL^{(r)} \in \mathbb {R}^{N_{Ws}^{(r)} \times (N_{SF}+N_{LF})}\)). The combined set \(\mathcal {D}_{TB}=\{ (SL^{(r)},y^{(r)}) \}_r^{N_R}\) was used to train and test the model. To estimate the average \(\hat{y}^{(r)}\) (i. e. UPDRS III) of round r during testing, the model first estimate \(\hat{y}\) for each of the feature vectors in \(SL^{(r)}\), and then they were averaged to get the average \(\hat{y}^{(r)}\) (i. e. UPDRS III) for that round.
Implementation
The UPDRS-III estimation methods were evaluated and compared using the data of 24 PD subjects described in the dataset section using LOOCV. In addition, an inner split was applied on the training data to select a random 20% for validation. The mean and standard deviation of the training data in each cross-validation iteration were calculated and used to normalize the entire data. The developed dual-channel LSTM and CNN-LSTM networks were implemented in TensorFlow [39]. In each cross-validation iteration, the networks were trained for 200 epochs using Adam optimizer [40]. During the training, the depth of the CNN and LSTM networks and filter sizes were optimized by selecting the best performing model on the validation data (i.e. maximum validation \(\rho\)) then evaluating them on the held-out test data. The depth of the CNNs was increased by repeating Conv Block-2 up to four times. The LSTM hyper-parameters space (number of layers: 1–3 and number of hidden states: 16–224) were searched. Mini-batches of size 2 and learning rate of 1e-3 were used during the training. In each mini-batch, the signals of all the rounds were repeated to have a length equal to the longest round. In addition, before feeding the hand-crafted or data-driven features of each round to the network in each epoch, a random start point was initialized and data prior to the start point was excluded. This augmentation approach was applied to prevent the LSTM network from memorizing the training sequence.
The Gradient Tree Boosting algorithm was implemented using XGboost library [38]. The learning rate was 0.1. A grid search was applied to find the optimal number of regression trees in the range of 10–200 with a step of 20. The tree depth was in the range of 3–10 with a step of 2. The percentage of used-features per tree was in the range of 10–50% with a step of 10%.