Data-Driven Models for Objective Grading Improvement of Parkinson’s Disease

Parkinson’s disease (PD) is a progressive disorder of the central nervous system that causes motor dysfunctions in affected patients. Objective assessment of symptoms can support neurologists in fine evaluations, improving patients’ quality of care. Herein, this study aimed to develop data-driven models based on regression algorithms to investigate the potential of kinematic features to predict PD severity levels. Sixty-four patients with PD (PwPD) and 50 healthy subjects of control (HC) were asked to perform 13 motor tasks from the MDS-UPDRS III while wearing wearable inertial sensors. Simultaneously, the clinician provided the evaluation of the tasks based on the MDS-UPDRS scores. One hundred-ninety kinematic features were extracted from the inertial motor data. Data processing and statistical analysis identified a set of parameters able to distinguish between HC and PwPD. Then, multiple feature selection methods allowed selecting the best subset of parameters for obtaining the greatest accuracy when used as input for several predicting regression algorithms. The maximum correlation coefficient, equal to 0.814, was obtained with the adaptive neuro-fuzzy inference system (ANFIS). Therefore, this predictive model could be useful as a decision support system for a reliable objective assessment of PD severity levels based on motion performance, improving patients monitoring over time. Electronic supplementary material The online version of this article (10.1007/s10439-020-02628-4) contains supplementary material, which is available to authorized users.


INTRODUCTION
Parkinson's disease (PD) is a common degenerative disorder of the central nervous system characterized by both motor and non-motor symptoms. Traditionally, in clinical practice, PD motor signs are assessed by neurologists while they observe patients performing motor tasks described in section III of the Movement Disorders Society Unified Parkinson's Disease Rating Scale (MDS-UPDRS III) and patient diaries. 10 Thus, the clinical evaluation is mainly based on the experience of clinicians that assign a score ranging from 0 (no signs clinically evident) to 4 (severely impaired) for each task performed by patients. Finally, the overall score, which is the sum of all the exercises, represents a semi-quantitative index for identifying the severity of the impairments due to the pathology. To evaluate the PD progression, it is necessary to understand the long-term monitoring of the disease, assessing the patients periodically.
This traditional evaluation, based on clinical scales, however, relies on clinical expertise, and is subjected to inter-and intra-observer variability. 20 Moreover, using diaries can be difficult for patients with recalling bias when reporting motor fluctuations. Therefore, the traditional evaluation methods are suboptimal for PD diagnosis and monitoring, and novel methods and technologies should be investigated. 6,22 In this direction, different tools have investigated as diagnostic or prognostic indicators, such as: the use of image guidance and robot-assisted techniques to im-prove the accuracy of electrode placement during deep brain stimulation surgery 29 ; the use of Leap Motion Controller as non-invasive methods to assess motor tasks 4 ; the use of EEG algorithmic complexity as prognosis for idiopathic rapid eye movement sleep behaviour disorder (RBD) which is a serious risk factor for PD 23 ; or the use of an olfactory identification test as prognosis for idiopathic hyposmia that is another preclinical marker of PD. 16 Furthermore, recent studies endorse the idea that wearable devices together with artificial intelligence technology could provide decision-making support systems that can help the clinical practitioners in the objective assessment of PD, 26 allowing also the long-term monitoring and management of the pathology. 18 Furthermore, accurately defining PD subtypes can be challenging as well. Since PD is a progressive disease with heterogeneity in individual disease trajectories, 31 investigating the longitudinal clinical records is necessary to understand the disease progression. To detect the motor impairment changes and to identify PD related symptoms in individual, multiple sensor-based measurements of UPDRS could provide useful information. 8 Capturing the data from multiple affected limbs (upper and lower) provides an opportunity for investigating the relationships between clinical rating scale of motor tasks. Moreover, machine learning methods could provide a useful decision support system to support clinicians in PD management. 6,22 In this direction, recent studies analysed motor tasks for PD subtyping identification with wearable technology that primarily relied on machine learning techniques 9 to develop predictive models. This is basically treated as a clustering problem, where the model can compute continuous numeric signal metrics for the subgrouping of the clusters. In previous studies, for the assessment of the motor dysfunction in PD, different regression models were applied. The regression analysis provides a functional relationship among variables that are expressed in the form of a dependent variable and one or more explanatory variables. In a recent study, logistic regression was used to classify freezing of gait and non-freezing of gait in PD patients affected by motor fluctuations by using acceleration sensors. 19 Regression techniques were used to assess the fine motor skills of PD patients and healthy subjects through the analysis of touchscreen typing pattern. 13 In this study, evolving the binary classification problem into a regression analysis for estimating the severity of individual PD motor symptoms could assist physicians by providing explainable insights into the subject's condition against PD. Furthermore, a support vector machine regression model was also suc-cessfully used for remote tracking of PD based on speech signal analysis. 7 Additionally, Patel et al. 18 proposed to implement a regression random forest for the longitudinal assessment of motor fluctuations by analysing MDS-UPDRS III tasks from both upper and lower limbs with wearable sensors. Finally, a treatment-response index estimated from wearable sensors for quantifying PD motor states was developed by Thomas et al., 27 thus supporting personalized treatment for advanced PD patients.
In general, very few studies, to the best of our knowledge, use the clinical score as a predictor against the extracted parameters from wearable technology and apply regression models to classify healthy and PD subjects. Moreover, there are few studies focused on multiple sensor-based measurements approach to capturing data from multiple affected limbs of PD. First, our study aims to understand the significance of the extracted parameters for the prediction of Parkinson disease progression using motion data acquired from two wearable devices, called SensHand 6 and SensFoot. 22 In this study, we aimed at investigating the relationship between the extracted information from the sensors and the clinical scores provided by the clinicians according to the MDS-UPDRS. To achieve this goal, multiple feature selection methods, 4,15 combined with multiple regression models, 11,12,17,25 were investigated to understand which technique could be the most helpful for longitudinal clinical data analysis and long-term monitoring of PD progression. We already investigated classification approach with supervised learning in previous works, considering both binary classification 21 (i.e., PD patients vs healthy controls), and multigroup classification (i.e., PD patients, healthy subjects, and subjects with idiopathic hyposmia that are at risk for developing the disease). 6,22 Differently, here we want to investigate whether our system is able to identify the pathology progression, from the mild to advanced stages. Therefore, we applied the regression approach, which is projected towards the possibility to have a continuous curve of the evolution of the pathology for improving the fine assessment of PD patients according to their motor performances.
The remain of the paper is structured as follows: in the Materials and Methods section we describe the participants to the study, the wearable system used to measure the motor performance, the experimental protocol and data analysis. In the Results section, the findings about features selection and the accuracy of the regression models are reported. Then, in Discussion section, we provide an interpretation of the results considering also the limitations of the study.

MATERIALS AND METHODS
The methodology applied in this study is synthesized in the flowchart in Fig. 1.

Involved Subjects
A total of 114 subjects (64 PD patients: 40 males, 24 females; mean age ± standard deviation (SD) 66.60 ± 8.87 years old, mean Hoehn&Yahr stage ± SD 1.9 ± 0.7, mean MDS-UPDRS III score ± SD 15.7 ± 8.9; and 50 HC: 39 male, 11 females; mean age ± standard deviation 65.46 ± 2.70 years old) were involved in this study. The neurologist that supported the experiment defined the exclusion criteria. Those who had impairments (e.g., prosthesis or arthrosis) or diseases other than PD (e.g., neurological) that could affect the performance of the required tasks were excluded from the study. Before and during the experiments, all patients were on the medication state and were examined by the neurologist. All subjects involved in this study gave written informed consent before the beginning of the experimental sessions. Procedures of this study were approved by the local Medical Ethical Committee (Azienda Sanitaria Locale, Massa and Carrara, Italy; Approval No. 1148/ 12.10.10).

Instruments
The system used in this work is composed of two wearable modules that provide an objective and quantitative analysis of the movements of the upper and lower limbs through inertial measurement units (IMUs). IMUs, integrated into the iNEMO-M1 board and composed of three-axis gyroscope L3G4200D, sixaxis geomagnetic module LSM303DLHC and ARMbased 32-bit microcontroller STM32F103RE (STMicroelectronics, Italy), were used to develop SensHand and SensFoot device (Fig. 2), respectively, for upper and lower limb analysis. 22 The SensHand includes 4 iNEMO-M1, which are positioned on the wrist and on the distal phalanges of the thumb, index, and middle fingers and connected and synchronized between them through the CAN-bus standard.
The SensFoot includes a iNEMO-M1 and is placed on the dorsum of the subject's foot with a Velcro-strap to ensure integrity between the foot and sensor. 6 Both modules are supplied by a rechargeable LiPo battery and integrated with a Bluetooth module (SPBT2632C2A, v3.0, STMicroelectronics) for wireless data transmission to a remote PC, where data were stored and analysed offline. Both modules collected data with a sampling frequency of 100 Hz.

Experimental Protocol and Feature Extraction
The trial session included exercises for upper and lower limbs, based on the tasks of the MDS-UPDRS III. During the exercises, the subjects took a comfortable and standardized sitting position, holding a right angle between the trunk and the thigh (on the hip) and between the thigh and the shin (on the knee). Initial and final fixed positions were established for each exercise to allow a static acquisition of 3 s to acquire the initial position as a reference for each trial.  Data acquired from both accelerometer and gyroscope were preprocessed with adequate filters: bandpass 0.5-15 Hz for HRST, bandpass 0.5-20 Hz for POST, low-pass with cutoff frequency equals to 3 Hz for GAIT, low-pass with cutoff frequency equals to 5 Hz for all the others. Custom-made algorithms were applied to calculate spatiotemporal and frequency features, as already described in previous works 6,22 for both upper and lower limbs, obtaining 190 parameters.  All exercises were performed twice, both for the left and right sides; then, the average value of the parameters for each side was calculated and used for the data analysis.

Statistical Analysis
All the statistical analyses were performed with SPSS21 (IBM, Armonk, North Castle, NY, USA). First, the Kolmogorov-Smirnov test was applied to each measured parameter to verify if it had a normal or abnormal distribution. Since each parameter had an abnormal distribution, the unpaired Mann-Whitney (MW) U-test for non-parametric samples was calculated 28 to identify the features that were statistically significant for distinguishing between the HC and PD groups. The test is issued to determine whether two independent samples (in our case healthy and patients) have the same distributions with the same median without assuming normal distributions. 9 The choice of the cut-off points to accept the alternative hypothesis was p £ 0.05. 2 All the measured features were normalized according to the linear relationship reported in Eq. (1): The normalized features were used for further investigations in this paper. This scaling procedure is more robust to control for both scale (min and max) and location (mean). The mean and standard deviation of a group gives information on how different the individual group population is from the other group in terms of the location of numeric values (mean) and variation in the values (standard deviation). With the linear scaling, it is easier to estimate the group difference in terms of mean and standard deviation between the values 0 and 1.
Feature Selection Algorithms All the features identified as significant by the MW test were included for the analysis with multiple feature selection methods. The feature selection step involves finding the most relevant subset of features from the original set by eliminating inappropriate or redundant features. 14 In this study, multiple tests were run using various algorithms developed in Weka 3.8 to define the optimal subsets of attributes, which would help to avoid overfitting due to redundant information. The chosen feature selection methods have been used in previous studies to obtain the optimum set of parameters to increase the overall accuracy of the system. 4,15 Specifically, five feature selection methods 1 were investigated in this study: Correlation-based feature subset evaluation (CfsSubsetEval): This method evaluates the worth of a subset of attributes that work well together, preferring sets of attributes that are highly correlated with the class but that have low intercorrelation. This multivariate method ignores irrelevant and redundant features from the dataset. This evaluator is implemented with the Best-First search algorithm.
PCA ranker: Ranker methods basically use singleattribute evaluators to generate a ranked list. Unlike other single-attribute evaluators, PCA transforms the set of attributes and then ranks the new attributes according to their eigenvalues.
Correlation attribute evaluation: This method evaluates the worth of an attribute by measuring the correlation coefficient, based on Pearson's correlation, between each attribute and the class.
C 2 attribute evaluation: The C 2 is calculated between each feature and the target, and the features with the best C 2 scores are selected. This method determines if the association between two categorical variables of the sample would reflect their real association in the population.
Wrapper subset evaluation: This method evaluates the worth of the attribute sets by using a learning scheme, and it employs cross-validation to estimate the accuracy of the learning scheme for each set of attributes. Based on the prediction performance, a score is assigned to each subset, and the best subset is chosen.

Regression Methods for PD Severity Evaluation
After selecting the optimal feature array, multiple regression techniques were applied to evaluate the accuracy of the proposed system in automatically quantifying the severity of the disease in clinical terms. Indeed, the regression models can estimate the functional relationship between explanatory variables and a target variable. Moreover, regression trees can be studied to develop a prediction model able to overcome the discrete scores and severity levels of the traditional clinical scales toward the possibility to associate a continuous response score to each patient according to the pathology progression. In particular, supervised machine learning techniques, such as support vector regression (SVR), 17 random forest (RF), 3 adaptive neuro-fuzzy inference system (ANFIS), 12,17 and linear regression (LR) 11 were tested in this work. Based on preliminary investigations with different hyperparameters configurations, default hyperparameters tuning was used since it allowed obtaining the best accuracy in regression models. All the methods were run over the ten-fold cross-validation. For applying the regression models, both the clinical scores and the extracted features from the sensors were used in the normalized form (see Eq. 1) to make sure to fit the model completely and not to overlap the predictor and target values. All the prediction algorithms were developed on the open-source platform Weka 3.8 except for the ANFIS model, which was developed on MATLAB 2019b (The MathWorks, Inc., Natick, MA, USA). Following, a description of each model is reported:

Support Vector Regression
Support vector machine was proposed for solving the binary classification problem. As for the multi-class classification problem, the form of SVR allows for estimating the continuous function of training datasets. It is able to model complex nonlinear relationships by using an appropriate kernel function that maps the input into higher-dimensional feature space and transforms the nonlinear relationships into linear forms. 17 Since previous studies endorsed the significance of the RBF kernel, 11 it was used also in this work. For the prediction, the margin of tolerance (epsilon) for the loss function is set to 0.1, while the weight vector is set to measure the number of successive trails.
SVR looks for the optimal function f x ¼ w; x ð Þþb, where w is the weight vector and b is the threshold value. Thus, SVR minimizes the expected risk prediction. The optimal function is solved in the feature space to make predetermined risk function minimization. After the introduction of the kernel, the regressive function becomes as in Eq. (2): where K x i ; x ð Þ is a kernel function, and a i and a Ã i are Lagrangian multipliers. 17

Adaptive Neuro-Fuzzy Inference System
The path for building fuzzy systems and information extraction is usually based on two parts (i.e., the knowledge formulation from the conscious path and knowledge formulation from the subconscious path). In the first path, rules and membership functions (MF) are consequent from human intelligence based on their expertise, experience, and understanding. With the subconscious formulation of knowledge, rules and membership functions are developed using automated techniques such as gradient methods, learning techniques, and clustering methods. Fuzzy reasoning is a procedure to derive a conclusion from a set of fuzzy ifthen rules. These rules are evaluated using the MF, which is a curve where each point in the input space maps the degree of membership between 0 and 1. The fuzzification step maps the input characteristics to quantify the grade of membership of the fuzzy set via the Gaussian MF with hybrid learning methodology. Then, defuzzification converts the input MF into ifthen rules, which derive the output characteristics for output MFs. Before gathering data using a fuzzy Cmeans (FCM), some parameters need to be quantified, such as: number of clusters (in our case 2), partition matrix exponent (default is 2) for final partition or membership function matrix U, maximum number of repetitions (default is 200), and minimum improvement (default is 1e-5). For this work, the default stopping threshold of 10 À5 was applied. 24 The final step is associated with the sum of all the output MFs into one single-value output. In this study, a fuzzy inference system (FIS) was generated with an FCM algorithm to cluster the inputs. The dataset was divided into a predefined number of clusters using the FCM algorithm. The number of clusters is pre-specified to help in defining the preliminary number of rules for each cluster and the membership of each set of data in the cluster. This technique is also known as semisupervised since the information about the MF is predefined by the user with the identifying hyper-parameters. 17

Random Forest
Random forest uses multiple learning algorithms for forecasting both classification and regression problems. RF combines the results of decision trees trained by the ''bagging'' method. RF is one of the most accurate classifiers among the current algorithms that use decision tree methods and maintains accuracy when a large proportion of the data are missing. It can handle many input variables. It generates an internal unbiased estimation of the generalization error as the forest building progresses.. It runs quickly to produce a forest of decision trees for the classifier. It is assumed that the number of cases in the training set is N, and the number of variables in the classifier is M. The system selects the number of input variables that will be used to determine the decision at a node of the tree. For each node of the tree, m of the M variables is randomly selected, and the decision at the node is based on them. The best split based on these m variables is calculated in the training set.

Linear Regression
Linear regression is a linear approach to modelling the relation between scalar response and one or more explanatory variables. When the outcome and all the attributes are numeric, linear regression is a natural technique to consider. The idea is to express the outcome as a linear combination of the attributes, with predetermined weights as in Eq. (3): where x is the class; a 1 ; a 2 . . . :a k are the attribute values; and w 0 ; w 1 ; . . . w k are weights. The weights are calculated from the training data. 30

Mann-Whitney Significance Test Between PD Patients and HC
The unpaired MW U-test was employed to identify the significant features, among the 190 extracted parameters, for comparing healthy controls and the PD group. Both right and left sides were separately compared, and a parameter was considered significant if the MW U-test showed statistical significance in at least one side. Normalized mean values, standard deviations and p value were reported in Tables 1 and 2 for each measured feature. Statistically significant parameters were selected from all the exercises; in  particular, 52 features were selected from the lower limbs and 78 from the upper limbs. Therefore, each exercise contributed to distinguishing PD and HC. Furthermore, both features extracted from accelerometer and gyroscope showed significance, thus suggesting that both the sensors are equally important for the assessment of the disease progression.
One hundred thirty significant parameters were selected and further investigated with the feature selection methods to select the best subset of features that could be applied as input to data-driven predictive models (i.e., SVR, RF, ANFIS, and LR) to improve the PD severity classification.

Predictive Accuracy from the Regression Methods
The four investigated regression algorithms were applied to each of the five feature selection methods, and the accuracy of the obtained results are reported in terms of correlation coefficient and root mean squared error in Table 3.
The correlation-based feature subset evaluation feature selection method with the Best-First search algorithm selected thirteen featuresas input for the predictive models. The best-performing model was ANFIS with 0.814 accuracy (as correlation coefficient), whereas SVR, RF, and LR, obtained 0.799, 0.790, and 0.738, respectively.
Nineteen parameters were selected with the PCA ranker method, 20 parameters with the correlation attribute evaluation, 15 parameters with the C 2 attribute evaluation, and 12 with the wrapper method. RF resulted in the best performance for PCA ranker (0.685), correlation attribute (0.648), and wrapper evaluation (0.606). Differently, ANFIS achieved the best performance for the aforementioned CfsSubsetEval and for the C 2 attribute evaluation method.

DISCUSSION
This study aimed at providing an objective assessment of fine motion performance decline in PD motor symptoms and at examining different regression models that allowed the identification of a linear relationship between the measured motion parameters and the clinical scores assigned by the neurologists according to the MDS-UPDRS III. Currently, indeed, the clinical scales represent the gold standard for clinical evaluations. Therefore, the first step of our study is to demonstrate that our system agrees with traditional clinical evaluations. Furthermore, it offers the neurologist the opportunity to objectify the clinical assessment. This is a fundamental step to achieve the trust of the neurologist in adopting our system as a decision support tool for improving PD diagnosis procedure. Using regression algorithms can significantly improve the method of performing PD diagnosis because they not only allow identification of whether or not a subject has motor impairments caused by the pathology, as supervised binary classification algorithms can be used, but they also could allow better ranking of the disease progression. Regression models, indeed, permit quantification of the grade of these impairments, allowing classification of patients according to the severity of PD. After 190 spatiotemporal features were extracted from the wearable sensor system, the final feature sets were selected for the regression models and features were included from both upper and lower limbs, reinforcing the hypothesis that a complete motion analysis provides better results than using a reduced set of exercises. 5 Indeed, the set of spatiotemporal features that obtained the maximum accuracy included parameters extracted from Gait, Heel-Toe Tapping, Pronation/ Supination, Finger Tapping, Arms Swing, and Postural Tremor, confirming the significance of these exercises already found in previous works 4,5 for the assessment of motor dysfunction in patients with PD. Furthermore, the selected features are derived from accelerometers and gyroscopes, suggesting that both sensors are equally important for capturing clinically relevant information.
Among the different investigated feature selection methods, the ''CfsSubsetEval'' provided the best solution for selecting a feature array of 13 parameters to use as input for the regression algorithms. In particular, the best performance was obtained with the AN-FIS model that reached a correlation coefficient equal to 0.814. The good results endorse the significance of the selected parameters to evaluate the motor performance in PD patients. ANFIS and SVR were also investigated together previously for the assessment of the PD progression, 11,12,17 and particularly ANFIS already achieved the highest accuracy 12,17 for the assessment of disease progression, promoting the idea that such algorithm could represent a robust method to assess the progressive degenerative process that characterizes the PD development. Thus, it should be further investigated in future studies using larger datasets.
Since ANFIS is a combination of a fuzzy inference system (FIS) and neural network (NN), it can learn automatically from the data, which incidentally is the strength of feed-forward artificial NN. 17 NN represents a broad class of computational models inspired by biological networks found in the central nervous systems and animal brains. They can be used to approximate unknown mapping of a large number of inputs, 12 such as the parameters derived from the wearable system used in this study. Moreover, the Cmeans clustering used to develop the ANFIS model improved the predictive accuracy. Fuzzy C-means is a method of clustering that allows one piece of data to belong to two or more clusters. This capability is important for the intended application since the measurements used for PD assessment include several nonlinear patterns.
Nevertheless, the main limitation of the ANFIS model is that it cannot work very well with many features. Thus, there is the need to identify a reduced number of significant features that allows the highest accuracy to be obtained when using this predictor method.
Additionally, even if a sizeable number of subjects were involved in the study when considering different stages of pathology, the size of each group drastically decreases, resulting also in unbalanced groups that is a limitation of this study and could bias the accuracy of the system. Thus, in future works, the dataset should be enlarged, paying attention to recruiting balanced groups of patients with different levels of the disease. Also, in this work we had not the possibility to average multiple clinical gradings from different physicians, but we performed single evaluations on patients collaborating with the Neurology department of the Apuane Hospital, a small hospital in the Nord-West area of Tuscany, Italy. The Neurology department contributed to this study with a team, composed of two neurologists (the head physician and another physician) and one assistant physician that performed all the clinical evaluations. Adding more clinicians for evaluating the performance of the patients according to the clinical scales could reduce the inter-rater variability and improve the accuracy of the machine learning methods. In future works, we are managing multicenter studies with HD video recording evaluations to enable more physicians to clinically evaluate patients directly in place or offline later and then make a more statistically robust grading and data processing.
In conclusion, this work shows a high linear correlation between the extracted information from the sensors and clinical scores provided by the clinician. Therefore, this work represents a first unavoidable step toward a data-driven objective assessment of the pathology that could support the neurologist in improving the accuracy of the PD evaluations. Such a system, combining wearable technologies and artificial intelligence techniques, can aim at overcoming the traditional assessment of the pathology, based on semiquantitative scales with a limited number of levels, where the scores are mainly based on the expertise of the clinicians, and provide a more continuous grading of the disease, defined on the basis of objectively measured parameters. Certainly, a data-driven assessment can help to reduce the inter-rater variability that typically affects the PD diagnosis, moving to more precise evaluations, and potentially improve the quality of care for PD patients.

FUNDING
Open access funding provided by Universita`degli Studi di Firenze within the CRUI-CARE Agreement.

OPEN ACCESS
This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://crea tivecommons.org/licenses/by/4.0/.