Introduction

Parkinson’s disease (PD) is the second most common long-term chronic, progressive, neurodegenerative disease, affecting more than 10 million people worldwide. In the United Kingdom, two people are diagnosed every hour, and the estimated number of people diagnosed with PD in 2018 was around 145, 000 [1]. Extensive research has shown that Rest Tremor (RT) is the most common and easily recognised symptom of PD [2]. The RT describes unilateral involuntary, rhythmic and alternating movements in relaxed and supported limb mostly hands, and typically disappears with action, it might also appear in lips, chin, jaw and legs [3]. The RT presents in 70–\(90\%\) of PD patients, and it occurs at a frequency between 4 and 6 Hz [2]. In addition to RT, Kinetic Tremor (KT) and Postural Tremor (PT), considered as action tremor, could also occur [4]. The KT is a type of tremor present during voluntary hand movements such as touching the tip of the nose or writing. PT occurs when a person maintains a position against gravity, such as stretching arms. The PT tremor occurs at a frequency between 6 and 9 Hz while KT occurs at a frequency between 9 and 12 Hz [4].

Tremor severity often indicates PD progress and severity. It can also be used to evaluate treatment efficiency. Currently, Parkinson’s tremor severity is scored based on the Unified Parkinson’s Disease Rating Scale (UPDRS) from 0 to 4 with 0: normal, 1: slight, 2: mild, 3: moderate, and 4: severe [5]. The UPDRS is clinically based scale that the clinician assigns numerical scores based on qualitative observations of the patient in various postures and are often insensitive and subjective. The assessment depends on the examiners’ skills and knowledge, and it is various from one examiner to another so examiners’ disagreement on assessment and scores [6,7,8]. There is evidence showing that the UPDRS has high inter and intra-rater variability [9,10,11]. Thus, a patient’s tremor may be assigned UPDRS score by one examiner and in the next appointment assessed by a different examiner and assigned a higher score. In this situation, it is difficult to interpret these two different scores, whether symptoms worsen or due to subjectivity.

The process associated with the UPDRS assessment is also time-consuming. It requires a lengthy administrative procedure, approximately 30 min, besides it needs specialised official training to improve the coherence of data acquisition and interpretation, these make it unhandy for routine clinical practice [5, 12]. Another time burden is that many elements in the UPDRS assessment need to be completed by patients, so additional time is required besides the time required to review these elements by the examiner. This time burden limits the use of the UPDRS in routine clinical practice. Therefore, the UPDRS scale is mainly used in clinical research.

The limitations mentioned above might lead to poor management of PD and wasteful use of resources besides that they are unwieldy in routine clinical practice. Many studies have tried to classify tremor severity scores using Machine Learning (ML) algorithms with signal processing techniques [13]. A well-presented challenge to implement ML algorithms in medical applications is the issue of imbalanced data distributions or the lack of density of a class or some classes in the training data, which cause a false-negative [14]. This miss-classification can lead to wrong assessment, consequently affect treatment. Several approaches have been suggested to address the imbalanced dataset issue, and they can be divided into three major groups [15] (a) data resampling, (b) algorithmic modification, and (c) cost-sensitive learning. Data resampling techniques include over-sampling, under-sampling, and hybrid approach (over-sampling and under-sampling). Resampling techniques were proven as a perfect solution for handling imbalanced datasets in different applications [16]. However, it was established that different resampling techniques work differently with different datasets and different ML algorithms [17,18,19].

A thorough review of the relevant literature indicates that this is the first study to explore different resampling techniques with two classifiers, namely Random forest (RF) and artificial neural network based on multi-layer perceptron (ANN-MLP) to enhance tremor severity estimation of highly imbalanced tremor severity dataset. In addition, this study offers important insights into advanced metrics and how standard metrics sometimes mislead classification results. Moreover, this study was able to enhance tremor severity detection with very high accuracy without neglecting minority classes.

The remainder of this paper is organised as follows: In Section “Related Work”, a review of the related literature is presented. Section Methodology explains the proposed methodology, including dataset description, signal analysis, features extraction, applying different resampling techniques with classifiers, evaluation.followed by the results presented in Section “Results and Discussions”. Section Conclusion and Future Work” concludes the paper and gives a direction for future work.

Related Work

Parkinson’s disease severity estimation could bridge the gap between clinical scales currently used by clinicians and objective evaluation methods in such a way that objective measurements align with clinical score systems such as the UPDRS. Therefore, several studies have explored different sensors with different methods to estimate tremor severity, including soft computing techniques and statistical analysis. A few of these studies are reviewed below.

Niazmand et al. [20] have used an accelerometer to estimate tremor severity utilising data collected from 10 PD patients and 2 healthy control subjects. The Data were collected from integrated pullover triaxial accelerometers while subject performing rest and posture UPDRS motor tasks. The movement frequency calculated using peak detection technique and tremor assessment based on frequencies, as shown in Table 1. The correlation between the measurements from accelerometers and UPDRS scores calculated and achieved \(71\%\) sensitivity of detecting tremor and \(89\%\) sensitivity of detecting posture tremor. However, the study suffered from the fact that the data collected come from pullover fits exactly to the patients achieved good results. At the same time, it is lower for the loose-fitting pullover, and this limitation can be a barrier from using this pullover for PD assessment in routine and continuous assessment since it might be not comfortable for patients and it is difficult to design a pullover to fit all patients. On the other hand, using a fit pullover and might increase the tense of muscles, particularly in posture tremor, which can change accelerometer position can change depending on executed movements. Also, this study is limited by the lack of information about patients’ UPDRS severities and might be biased towards some severities, and this could affect performance measurements.

Table 1 Tremor severity vs frequency ranges

Rigas et al. [21] conducted a study to estimate tremor severity using a set of wearable accelerometers on arms and rest of subjects. This study involved 10 PD patients with tremor range from 0 to 3 according to the UPDRS score, 8 PD patients without tremor and 5 healthy subjects. Data were collected from subjects while there were performing Activities of Daily Living (ADLs). The collected data was processed through low-pass and band-pass filters with cut-off frequencies of 3 Hz and 3–12 Hz, respectively. Set of features (dominant frequency, the energy of dominant frequency, high and low frequencies energy, spectrum entropy and mechanical energy) extracted from filtered signals. For severity classification, a Hidden Markov Models (HMM) employed with LOOCV. They have achieved \(87\%\) overall accuracy with \(91\%\) sensitivity and \(94\%\) specificity for tremor 0, \(87\%\) sensitivity \(82\%\) specificity for tremor 1, \(69\%\) sensitivity and \(79\%\) specificity for tremor 2, \(91\%\) sensitivity and \(83\%\) specificity for tremor 3. However, this study suffered from lack of severe severity (tremor severity 4) in collected data; thus, it cannot be generalised and used to assess all patients, particularly PD patients with severity 4 tremors, besides the relatively low sensitivity of \(69\%\) and specificity \(79\%\) for classification severity 2 tremor.

Bazgir et al. [22] used smartphone built-in triaxial accelerometer to estimate tremor severity for 52 PD patients. Data acquired from a triaxial accelerometer attached to patients’ hands. Set of features were extracted including power spectral density (PSD), median frequency, dispersion frequency and the fundamental tremor frequency. Then tremor severity was estimated using artificial neural network (ANN) classifier. The results were significant and achieved \(91\%\) accuracy. However, no other performance metrics reported such as error rate, sensitivity, specificity, precision and F-score which are very important to evaluate classification models, especially in medicine classification problems, since accuracy neglects the difference between types of errors [23]. In a recent follow-up study, Bazgir et al. [24] tried to improve the accuracy utilising Sequential Forward Selection (SFS) approach for features selection with the same approaches for signal processing and features extraction. In this study, \(100\%\) accuracy achieved with Naïve Bayesian classifier. However, like the previous study, no other performance metrics such as sensitivity and specificity have been reported.

Authors in [25] collected triaxial accelerometer data from 19 PD patients using a smartwatch while they are performing five motor tasks including sitting quietly, folding towels, drawing, hand rotation and walking. A wavelet features extraction technique was used to process the acquired signal and extract relative energy and mean relative energy for each triaxial accelerometer axis. Extracted features were used to predict tremor severity into three tremor levels 0, 1 and 2 where 2 represents tremor severities 2, 3 and 4 using support vector machine (SVM) classifier, the prediction made by summing all axis prediction since the tremor is often in only one axis. The model was evaluated using LOOCV and achieved \(78.91\%\) overall accuracy, \(67\%\) average precision and \(79\%\) average recall. However, severity 2 prediction precision was \(28\%\), and it is very low in comparisons with severity 1 and 3 as they achieved \(98\%\) and \(75\%\) respectively. Moreover, a major problem with this experiment was that severities 2, 3 and 4 combined into one score 2, which does not reflect the actual severity of the tremor and it might not be helpful for neurologists to assess the tremor and does not identify tremor development, especially in advanced stages.

Based on an extensive review of the literature, only one study was identified that utilised over-sampling technique to identify PD patients from healthy people using speech signals [26]. Authors combined RF classifier with over-sampling technique and obtained \(94.89\%\) accuracy. However, their study did not classify tremor severity.

A common limitation in most of the previous studies was that the authors did not take into consideration all tremor levels and imbalanced classes distribution among collected data. Also, some of these studies only used data collected while subject performing specific tasks which do not necessarily include ADLs. Some studies did not report advanced performance metrics such as sensitivity, specificity, F-score, AUC and IBA, which are very important to evaluate classification models.

Methodology

Figure 1 illustrates the proposed framework to classify imbalanced tremor severity dataset using resampling techniques. In the first step, raw accelerometer signal prepossessed to eliminate sensor orientation dependency, non-tremor data and artefacts. Set of tremor severity features extracted from the prepossessed signal in the second step. In the third step, data is split into training and test subsets and training data resampled to avoid classifiers bias. Finally, training and test data are passed into a classifier to estimate tremor severity and the results evaluated for adoption in the fourth step. Each step is described in details in the subsequent sections.

Fig. 1
figure 1

Proposed framework for tremor severity classification

Signal Processing

In order to extract meaningful features from accelerometer data, some preprocessing was performed to eliminate non-tremor data or artefacts. The vector magnitude of three orthogonal acceleration, namely \(A_X\), \(A_Y\), and \(A_Z\) has been calculated to avoid dependency on sensor orientation and to avoid processing signal in three dimensions. Also, since the work focus on the severity of any tremor type, and in order to remove low and high-frequency bands and retain tremors bands from the data as suggested by earlier work [4], a band-pass Butterworth filters with cut-off frequencies 3–6 Hz for RT and 6–9 Hz for PT and 9–12 Hz for KT are applied.

The filtered signals were split into 4s windows that can be labelled and used as inputs. Fixed-length sliding windows with \(50\%\) overlap was utilised, which has shown in the literature to be effective in activity recognition [27]. The window time series represents as:

$$\begin{aligned} {\{a_t\}}_{t=0}^{w_l} \end{aligned}$$
(1)

where \(a_t\) is the acceleration at time t and \(w_l\) is the window length (number of samples).

Guided by previous work, a set of features were extracted from each window in time and frequency domains. A list of the selected features and their descriptions are presented in Appendix A (Table 8). The features were extracted from three filtered signals frequencies, 3–6 Hz for RT, 6–9 Hz for PT, and 9–12 Hz for KT, to form a 102 features vector. Frequency domain features were calculated after transforming the raw signal from the time domain to the frequency domain using Fast Fourier Transform (FFT) according to Expression 2 provided below.

$$\begin{aligned} \displaystyle F(k) = \displaystyle \sum _{t=0}^{W_l-1}a_t e^{\left( \frac{{ -j2\pi kt}}{W_l}\right) } \qquad \text {For}\quad k=0 \;... \;W_l-1, \end{aligned}$$
(2)

where F(k) complex sequence that has the same dimensions as the input sequence \({(a_t)}_{t=0}^{w_l}\) and \(e^\frac{{ -j2\pi }}{W_l}\) is a primitive \(N\text{th}\) root of unity.

The features were carefully selected to provide detailed and discriminatory information of signal characteristics and that are highly correlated with tremor severity, such as distribution, autocorrelation, central tendency, degree of dispersion, shape of the data, stationarity, entropy measures and dissimilarity.

Tremor severity can be distinguished by amplitude, as the tremor amplitude showed a high correlation to the UPDRS score [28], as the amplitude increases when severity increases. Similarly, a previous study showed that tremor severity is highly correlated with frequency sub-bands [20], as every tremor severity or score appears within a specific frequency range, as shown in Table 1. Therefore, features such as mean, max, energy, number of peaks, number of values above and below mean are chosen besides median in case the values are not normally distributed. In addition, these features showed high correlation with tremor severity classification in previous studies [29,30,31]. In order to measure signal dispersion, the standard deviation is selected since it is found to be an effective measure to quantify tremor severity [32, 33].

Skewness and kurtosis are chosen to measure data distribution. Kurtosis has been used in previous studies to detect tremor because tremor signal are more spiky (high Kurtosis) than non-tremor signals [30, 34, 35]. Consequently, high severity tremor is almost certainly has high kurtosis value and vice versa. On the other hand, skewness measures the lack of symmetry, and it has been used to measure random movements to asses medication response, as while patients are on medication the tremor will decrease, thereby tremor signal skewness decrease [30]. Therefore, skewness is expected to decrease with less severe tremor and increase with high severe tremor. Spectral Centroid Amplitude (SCA), which is the weighted power distribution, and maximum weighted PSD are other features related to spectral energy distribution [36]. As every frequency sub-band represents a tremor severity [20], thus the maximum weighted power and the weighted power distribution can quantify tremor severity.

The PD tremor is a rhythmical movement, hence sample entropy and autocorrelation have been chosen to measure regularity and complexity in time series data, as tremor’s sample entropy and autocorrelation are significantly lower when compared to non-tremor movements which has been established by previous work [37,38,39]. Another complexity measures have been selected are the Complexity-Invariant Distance (CID) [40, 41] and the Sum of Absolute Differences (SAD) [42]. SAD and CID measures time series complexity differently, as the more complex time series has more peaks and valleys, which increase the difference between two consecutive values in the window. Consequently, the tremor signal is more complex because tremor frequency and amplitude are higher than normal movements which increase the peaks and valleys in the signal. As a result, complexity is correlated with tremor severity.

Previous research has established that tremor intensity identifies tremor severity [4]. Therefore, tremor intensity at various frequencies can be quantified through PSD, and since tremor severity correlated with frequency ranges or bandwidth spread [20], thus three features are chosen; fundamental frequency, median frequency, and frequency dispersion. The fundamental frequency which has the highest power among all frequencies in power the spectrum. The median frequency divides PSD into two parts equally. The frequency dispersion is the width of the frequency band which contains \(68\%\) of the PSD. In addition, guided by previous work, the difference between the fundamental frequency and the median frequency was extracted as an additional feature since tremor fundamental frequency could be different between patients [22].

Resampling Techniques

This section presents a brief about resampling techniques employed in this study. Resampling methods can be categorised into three groups: over-sampling, under-sampling, and hybrid (Combination of over- and under-sampling).

Over-Sampling Techniques

Over-sampling techniques are consists of adding samples to the minority classes, in this study, three over-sampling techniques are explored as described below:

  1. (a)

    Synthetic Minority Over-sampling Technique (SMOTE) [43] synthetically creates samples in the minority class instead of replacing original samples, which lead to an over-fitting issue. The SMOTE create samples based on similarities in feature space along the line segments joining the minority instance and its ‘k’ minority class nearest neighbours in feature space.

  2. (b)

    Adaptive Synthetic Sampling Approach (ADASYN) [44] generate samples in the minority class according to their weighted distributions using K-nearest neighbour (K-NN). The ADASYN assign higher weights for instances that are difficult to classify using K-nn classifier, where more instances generated for higher weights classes.

  3. (c)

    Borderline SMOTE [45] identifies decision boundary (borderline) minority samples and then SMOTE algorithm is applied to generate synthetic samples along decision boundary of minority and majority classes.

Under-Sampling Techniques

Under-sampling techniques work by removing samples from the majority classes. In this study, five under-sampling techniques are examined as described below:

  1. (a)

    Condensed Nearest Neighbour (CNN) [46] was originally designed to reduce the memory used by K-nearest neighbours algorithm. It works by iterating over majority classes and selecting subset samples that are correctly classified by 1-nearest neighbour algorithm, thus including only relevant samples and eliminating insignificant samples from majority classes.

  2. (b)

    Tomek-links [47] is an enhancement of CNN technique, as the CNN initially chose samples randomly, but the Tomek-links firstly finds Tomek link samples, which are pairs samples belong to different classes and are each other’s 1-nearest neighbours. Then removes Tomek’s link samples belong to the majority classes or alternatively both. In this study, the only majority Tomek’s link classes are removed to retain minority classes and increase distances between classes by removing majority classes near the decision boundary.

  3. (c)

    AllKNN [48] is an under-sampling technique based on Edited nearest neighbours (ENN), [49], which is an under-sampling technique applies K-nearest neighbour (K-NN) classifier on majority class and removes samples that are misclassified i.e. removes samples whose class differs from a majority of its k-nearest neighbours. So, AllKNN technique removes all samples that are adjacent to the minority class, in order to make classes more separable. It works by removing samples from the majority class that has at least 1-nearest neighbour in the minority class.

  4. (d)

    Instance Hardness Threshold (IHT) [50] is a technique based on removing samples with high hardness value. The hardness value indicates the probability of being misclassified. This approach removes majority class samples that classified with low probabilities (overlap the minority classes samples).

  5. (e)

    NearMiss [51] technique based on the average distance of majority classes samples to minority classes samples. There are 3 versions of NearMiss technique: NearMiss-1 selects the majority class samples with the smallest average distance to three closest minority class samples, NearMiss-2 selects majority class samples with the smallest average distance to three farthest minority samples, NearMiss-3 selects majority class samples with the smallest distance to each minority class sample. In this study, NearMiss-1 is used.

Hybrid Resampling (Combination of Over- and Under-sampling)

The last group has been examined is the hybrid approach that combines over- and under-sampling techniques. This approach basically cleans the noise that has been generated from over-sampling techniques by removing majority classes samples that overlaps minority classes samples. In this study, two hybrid techniques are examined as described below:

  1. (a)

    Synthetic Minority Over-sampling technique combined with Tomek link (SMOTETomek) [52] works by increasing the number of minority class samples by generating synthetic samples as discussed in Section “Over-Sampling Techniques”, and, subsequently, Tomek link under-sampling technique is applied to the original and new synthetic samples as discussed in Section“ Under-Sampling Techniques”.

  2. (b)

    Synthetic Minority Over-sampling Technique combined with Edited Nearest Neighbour (SMOTEENN) [53] is the second hybrid approach SMOTEENN starts with SMOTE which has been discussed in Section “Over-Sampling Techniques”, followed by Edited Nearest Neighbour (ENN) which has bee discussed in “Under-Sampling Techniques”. In this study, 3-nearest neighbour algorithm with ENN are applied.

Classifiers

Two classifiers are considered for classification; Artificial Neural Network based on Multi-Layer Perceptron (ANN-MLP) [54], and Random Forest (RF) [55]. These classifiers were adopted based on previous studies that achieved high performance in the classification of different types of balanced and imbalanced datasets [56,57,58].

The ANN-MLP is a feed-forward ANN consists of multiple layers (input layer, one or more hidden layers, and output layer). The ANN-MLP layers are fully connected, so each node in one layer is connected to every node in the following layer with different weights. ANN-MLP training implemented through backpropagation [59] by adjusting the connection weights to minimise mean squared error between neural network predictions and the actual classes.

The RF classifier is an ensemble learning algorithm [55] composed from a set of decision trees to overcome the overfitting problem of decision trees. The decision trees randomly selected from the original training dataset using the bootstrap method [60]. The remaining training data used to estimate the error and features importance to decrease the correlation between constructed trees in the forest, hence, decrease the final model variance. The final classification based on a majority vote of the decision trees in the forest.

The results were evaluated using different traditional metrics including accuracy, precision, sensitivity,specificity and F1-score [61]. In addition, advanced metrics were used such as geometric mean (Gmean) [62], Index of Balanced Accuracy (IBA) [63], and Area Under the Curve(AUC) [64], which will be explained in Section “Performance Metrics”.

Experimentation

The experimental dataset used in this study are introduced in Section “Dataset” showing the imbalanced class distribution problem. Then the classifiers architecture is described in Section “Classifiers Architecture”, and the performance metrics used to evaluate proposed models are illustrated in Section “Performance Metrics”.

Dataset

Tremor datasetFootnote 1 was taken from Levodopa response trial wearable data from Michael J. Fox Foundation for Parkinson’s research (MJFF) [65] . The data are collected from a wearable sensor in both laboratory and home environments using different devices; a Pebble Smartwatch, GENEActiv accelerometer and a Samsung Galaxy Mini smartphone accelerometer. Triaxial accelerometer data were collected from 30 subjects fitted with either 3 or 8 sensors for four days. On the first day of data collection, participants came to the laboratory on their regular medication regimen and performed set ADLs and items of motor examination of the movement disorder society-sponsored revision of the unified Parkinson’s disease rating scale (MDS-UPDRS) [5] which is used to assess motor symptoms. The list of tasks performed includes; standing, walking straight, walking while counting, walking upstairs, walking downstairs, walking through a narrow passage, finger to the nose (left and right hands), repeated arm movement (left and right hands), sit to stand, drawing on a paper, writing on a paper, typing on a computer keyboard, assembling nuts and bolts, taking a glass of water and drinking, organising sheets in a folder, folding a towel, and sitting. For each task, symptom severity scores (rated 0–4) were provided by a clinician. On the second and third days, accelerometers data were collected while participants at home and performing their usual activities. On the fourth day, the same procedures that were performed on the first day were performed once again, but the participants were off medication for 12 h. In this study, only GENEActiv accelerometer tremor data that were collected in the first day will be utilised.

Table 2 shows classes (severities) distribution of 32,414 instances (windows) segmented from collected data. It is clear how data distribution being skewed towards less severe tremor and this bias can cause significant changes in classification output, in this situation the classifier is more sensitive to identifying the majority classes but less sensitive to identifying the minority classes if they are eliminated. Thus, different resampling techniques are described in Section “Resampling Techniques” were utilised to eliminate imbalanced class distribution effect.

Table 2 Imbalanced classes (severities) distribution

Classifiers Architecture

In this study, the ANN-MLP was built using Keras [66] with TensorFlow [67] as the back-end. The neural network contains 102 nodes in the input layer (features vector shape), 200, 180, 180, 100 nodes in each of the four hidden layers respectively, and five nodes in the output layer corresponds to the five classes (severities). Each hidden layer applied the rectified linear unit (ReLU) [68] activation function since it is computationally efficient and tend to show better convergence performance than sigmoid function [69]. The output layer applied softmax activation function [70] to predict classes probabilities.

The RF classifier was built with 100 trees based on the suggestion of Oshiro et al. [71] that number of trees should be between 64 and 128 trees. Gini impurity was selected as decision trees split criteria because it tends to split a node into one small pure node and one large impure node [72], and it can be computationally more efficient than entropy by avoiding \(\log\) computation.

To avoid over-fitting with ANN-MLP, \(20\%\) of the data was kept and used as validation data to monitor model performance and top the training when performance degrades which is a form of regularisation that guides to stop iterations before the model begins to overfit. The RF is an ensemble machine learning algorithm that uses Bagging or Bootstrap Aggregation by default which is a resampling technique that involves random sampling of a dataset with replacement, the random sampling is selected via the Out-of-Bag method which is similar to validation data where each tree in the forest is tested on \(36.8\%\) samples that are not used in building that tree.

Performance Metrics

The most frequently used metrics for evaluating the performance of classification algorithms are accuracy, precision, sensitivity (True Positive Rate), specificity (True Negative Rate) [61]. However, these metrics are illusory and insufficient to assess classifiers in an imbalanced classification problem, since they are sensitive to data distribution [73]. For example, a classifier that predicts most majority classes samples has high accuracy, but it is totally useless in detecting minority classes samples. Sensitivity and precision do not take into account true negatives, which can cause issues, especially in the medical diagnosis field where misclassified true negative can lead to unnecessary and expensive treatment. Hence, other metrics like F1-score [61] and geometric mean (Gmean) [62] are widely used to evaluate classifiers to balance between sensitivity and precision, as the ultimate goal of classifiers to improve sensitivity with impacting precision [14]. Gmean and F1-score are excellent and accurate metrics because they are less influenced by the majority classes in the imbalanced data [74]. However, Even Gmean and F1-score minimise the influence of imbalanced distribution, but they do not take into account the true negatives and classes contribution to overall performance [63]. Hence, in this study, advanced metrics are included in addition to these metrics, such as index of balanced accuracy (IBA) [63] and Area Under the Curve (AUC) [64].

$$\begin{aligned} \text{Accuracy}= \,\, & {} \frac{\text{TP}+\text{TN}}{\text{TP}+\text{TN}+\text{FP}+\text{FN}} \end{aligned}$$
(3)
$$\begin{aligned} \text{Precision}=\,\, & {} \frac{\text{TP}}{\text{TP}+\text{FP}} \end{aligned}$$
(4)
$$\begin{aligned} \text{Sensitivity}=\,\, & {} \text{TPR} = \frac{\text{TP}}{\text{TP}+\text{FN}} \end{aligned}$$
(5)
$$\begin{aligned} \text{Specificity}= \,\, & {} \text{TNR} = \frac{\text{TN}}{\text{TN}+\text{FP}} \end{aligned}$$
(6)
$$\begin{aligned} F1=\,\, & {} \frac{2 \times \text{Precision} \times \text{Sensitivity}}{\text{Precision}+\text{Sensitivity}} \end{aligned}$$
(7)
$$\begin{aligned} \text{Gmean}=\,\, & {} \sqrt{\text{Sensitivity} \times \text{Specificity}} \end{aligned}$$
(8)
$$\begin{aligned} \text{IBA}_\alpha=\,\, & {} \left( 1+\alpha \times (\text{TPR}-\text{TNR})\right) \times \text{G}{GMean}^2 \, , \text {where} \quad 0 \le \alpha \le 1, \end{aligned}$$
(9)

where TP, FP, TN, FN, TPR, TNR, and \(\alpha\), refer respectively to, true positive, false positive, true negative, false negative, true positive rate, true negative rate, and weighting factor.

The most appropriate metrics to evaluate imbalanced data are AUC and IBA. There are many advantages of using AUC to evaluate classifiers, particularly in medical fields [75,76,77]. AUC is independent of prevalence, and it can be used to compare comparison multiple classifiers and to compare classifier performance with different classes.

IBA is a performance metric that takes into consideration the contribution of each class to the overall performances so that high IBA obtained when the accuracy of all classes are high and balanced. The IBA evaluates the relationship between TPR and TNR, which represents classes distribution. IBA can take any value between 0 and 1, and the best performance achieved when TPR = TNR = 1 with \(\alpha = 1\), then IBA = 1.

Results and Discussions

The section is presented in four parts. The first part will discuss results of over-sampling techniques. The second part presents the results of under-sampling techniques. The third part presents results of hybrid techniques. Finally, the best results obtained from each resampling techniques are compared and determining the best resampling technique and the best classifier with PD tremor dataset. In addition the results without resampling technique are presented as a baseline in order to evaluate resampling technique performance.

Over-Sampling Results

Table 3 shows the performance of two classifiers RF and ANN-MLP, on PD tremor severity dataset resampled using three over-sampling techniques, SMOTE, ADASYN and Bordline SMOTE. Overall, all the used over-sampling techniques improved classifiers performance significantly. Also, it can be observed that ANN-MLP classifier performed better than RF classifier with over-sampling, while the RF classifier achieved better results than ANN-MLP without over-sampling. The best results achieved using ANN-MLP classifier combined with Borderline, with \(95.04\%\) overall accuracy, \(96\%\) Gmean, \(93\%\) IBA and \(99\%\) AUC. However, The AUC scores of both classifiers with all over-sampling techniques were \(99\%\). Hence, it is important to evaluate classifiers with different metric such as IBA, which shows slightly different performance between classifiers with different over-sampling techniques which support the discussion about imbalanced datasets performance metrics in Section “Performance Metrics”.

The best performance is achieved using RF classifier with ADASYN technique and obtained \(92.58\%\) overall accuracy, \(95\%\) Gmean, \(91\%\) IBA and \(99\%\) AUC, while the worst performance of RF is using Borderline with \(91.54\%\) overall accuracy, \(94\%\) Gmean, \(89\%\) IBA and \(99\%\) AUC. On the other hand, ANN-MLP achieved best performance using Borderline technique and the worst with SMOTE. Both classifiers did not obtain the best performance with SMOTE, but it is still better than applying classifiers without oversampling.

Table 3 Performance metrics with/without over-sampling for tremor severity classification with RF and ANN-MLP

Interestingly, the worst performance among over-sampling techniques obtained using RF classifier with Borderline technique, while Borderline achieved the highest performance with ANN-MLP. This results shows that over-sampling techniques performance vary among classifiers, hence no assumption can be made about best over-sampling technique, because every dataset, classifier and oversampling technique has it is own characteristics, and different combinations could obtain different results.

Figure 2 shows the confusion matrices for ANN-MLP and RF classifiers without resampling and with best oversampling techniques. It is clear from confusion matrices how oversampling techniques improved the prediction of all classes without any bias towards majority classes. The associated ROC’s for the same results are shown in Fig. 3.

Fig. 2
figure 2

Normalised confusion matrices of ANN-MLP and RF without resampling and with best oversampling results; a ANN-MLP without resampling, b ANN-MLP with BorderlineSMOTE, c RF without resampling, d RF with ADASYN

Fig. 3
figure 3

ROC of ANN-MLP and RF without resampling and with best oversampling results; a ANN-MLP without resampling, b ANN-MLP with BorderlineSMOTE, c RF without resampling, d RF with ADASYN

Under-Sampling Results

Table 4 shows the performance of two classifiers RF and ANN-MLP, on PD tremor severity dataset resampled using five under-sampling techniques, TomekLinks, CNN, AllKNN, IHT and NearMiss. It is clear that overall classifiers performance with under-sampling techniques is significantly worse than using over-sampling techniques. However, some under-sampling techniques improved classifiers performance. Both classifiers (ANN-MLP and RF) achieved best performance with IHT under-sampling technique, but RF classifier achieved better results with \(86.11\%\) overall accuracy, \(91.00\%\) Gmean, \(82.00\%\) IBA, and \(96.00\%\) AUC. Even though ANN-MLP achieved \(85.66\%\) accuracy with ALLKNN, it is not the best undersampling technique since it obtained only \(13\%\) IBA which indicates that some classes are not predicted correctly, also it indicates that the highest overall accuracy does not necessarily mean the classifier performed very well. The worst performance of both classifiers is with the CNN under-sampling technique and did not improve most metrics. On the contrary, it impaired the results.

Table 4 Performance metrics with/without under-sampling for tremor severity classification with RF and ANN-MLP

What is striking about the results in Table 4 is that most important metrics such as Gmean and IBA are very low and declined dramatically with some under-sampling techniques, despite that other metrics are improved. For example, when ALLKNN technique is applied with both classifiers the accuracy, precision and sensitivity improved significantly, while the IBA and Gmean declined. The IBA declined from 26 to \(13\%\) with ANN-MLP and from 11 to \(1\%\) with RF, Gmean declined from 50 to \(34\%\) with ANN-MLP and from 32 to \(9\%\) with RF. These results indicates that depending on standard metrics is not sufficient and appropriate for multi-class imbalanced dataset classification.

Similar to over-sampling techniques, some under-sampling techniques improved the performance of one classifier and deteriorate the second. For example, NearMiss improved RF classifier performance but deteriorate ANN-MLP performance, which support the presented argument that resampling techniques does not perform similarly with different classifiers with different datasets.

Figure 4 shows the confusion matrices of ANN-MLP and RF classifiers without resampling and with best undersampling techniques. Even though, both classifiers did not achieve very high results, similar to oversampling techniques, the results are improved and the bias towards majority classes has been reduced. The associated ROC’s for the same results are shown in Fig. 5.

Fig. 4
figure 4

Normalised confusion matrices of ANN-MLP and RF without resampling and with best undersampling results; a ANN-MLP without resampling, b ANN-MLP with IHT, c RF without resampling, d RF with IHT

Fig. 5
figure 5

ROC of ANN-MLP and RF without resampling and with best undersampling results; a ANN-MLP without resampling, b ANN-MLP with IHT, c RF without resampling, d RF with IHT

Hybrid Results

Table 5 shows the performance of two classifiers RF and ANN-MLP, on resampled PD tremor severity dataset using two hybrid techniques SMOTETomek and SMOTEENN. In contrast to under-sampling techniques, both hybrid techniques improved classifiers performance significantly. But the SMOTEENN performance with both classifiers better than SMOTETomek. However, SOMTEENN obtained the best results with RF classifier with \(92.61\%\) overall accuracy, \(95\%\) Gmean, \(91\%\) IBA and \(99\%\) AUC.

Table 5 Performance metrics with/without hybrid resampling for tremor severity classification with RF and ANN-MLP

Figures 6 and 7 presents the confusion matrices and ROC’s of ANN-MLP and RF classifiers without resampling and with best hybrid resampling techniques. Similar to oversampling and undersampling techniques, the results are improved and the bias towards majority classes has been reduced.

Fig. 6
figure 6

Normalised confusion matrices of ANN-MLP and RF without resampling and with best hybrid resampling results; a ANN-MLP without resampling, b ANN-MLP with ENN, c RF without resampling, d RF with ENN

Fig. 7
figure 7

ROC of ANN-MLP and RF without resampling and with best hybrid resampling results; a ANN-MLP without resampling, b ANN-MLP with ENN, c RF without resampling, d RF with ENN

Performance Comparison

Table 6 shows best results obtained from the two classifiers ANN-MLP and RF in combination with all resampling techniques. Among these results the best performance obtained from ANN-MLP classifier with Borderline and achieved \(93.81\%\) overall accuracy, \(96\%\) Gmean, \(91\%\) IBA and \(99\%\) AUC. While RF achieved best performance with ADASYN and SMOTEENN for all metrics, except the overall accuracy with very low difference \((0.03\%)\). However, both classifier did not improved significantly with IHT in comparison with other resampling techniques, despite that RF performance was higher.

Table 6 Resampling techniques comparison for tremor severity classification with RF and ANN-MLP
Fig. 8
figure 8

Best resampling techniques for tremor severity classification with RF and ANN-MLP

As mentioned in Section “Performance Metrics”, the most important metrics are IBA and AUC, therefore the combinations of ANN-MLP with Borderline, RF with ADASYN and RF with SMOTEENN obtained same results with \(91\%\) IBA and \(99\%\) AUC, and overall performance of these combinations achieved best results with slight difference in some metrics. The worst improvements obtained among the best results is the combination of ANN-MLP with IHT then RF with IHT. So, the order of best combination from high to low is ANN-MLP with Borderline, RF with SMOTEENN, RF with ADASYN and finally ANN-MLP with SMOTEENN, as shown in Fig. 8. It can thus be suggested that the best approaches to estimate tremor severity are over-sampling and hybrid approaches, while the worst is under-sampling approaches. This hypothesis is supported by the findings in Sections “Over-Sampling Results”, “Under-Sampling Results” and “Hybrid Results”.

Table 7 shows summary results and comparison with the state-of-the-art. The measured tremor, the number of patients, the used sensors, the approach, the metrics and the measured severities are shown in the table. Although high classification performance was obtained in the literature for tremor severity estimation, most of these studies did not report advanced performance metrics such as Gmean and IBA which are very important to evaluate classification models. Also, many studies did not take into consideration all types and levels of tremor. In addition, none of these studies considered classes imbalanced issues.

Table 7 Comparison with the state-of-the-art

Conclusion and Future Work

In this study, a set of resampling techniques are investigated to improve Parkinson’s Disease tremor severity classification. It can be concluded that that the proposed algorithms can improve classification process. classifiers with advanced metrics, such as AUC, Gmean and IBA that are not influenced by data distribution are evaluated. The results shows that ANN-MLP with Borderline SMOTE is the best classification approach to identify tremor severity. Also, the results shows that over-sampling techniques performed better than under-sampling techniques and hybrid techniques. The results shows that different resampling techniques achieved different results with different classifiers.

We acknowledge that this study has a number of limitations. First, the sample size is small, and it’s possible that it doesn’t represent the entire PD population. Second, the data was gathered in a single environment. As a result, if the environment is changed, the outcomes may vary. Third, the proposed method should be tested on a variety of datasets.

For future studies, it is suggested to apply different resampling techniques with more classifiers, also to examine resampling techniques influence on individual classes classification instead of overall performance. Extend the work to include rest of collected data to evaluate proposed algorithms with larger dataset.