1 Introduction

Parkinson’s disease (PD) is one of the serious neurological syndromes that exhibit a chronic neurological disorder caused by progressive degeneration and death of dopaminergic neurons. These neurons are responsible for coordinating movement at the level of muscular tone [43]. People with PD show different symptoms including rigidity, tremor, slow movements, impaired voice, and poor balance [26, 28, 40, 52]. Based on these symptoms, different automated approaches have been developed for PD detection [17, 24, 30, 33, 41, 45, 57]. However, PD detection through vocal signal processing is more beneficial because of the two main reasons 1- literature suggests that around 90% of patients having PD possess voice impairment issues [48]. 2- voice disorders are considered as the early symptoms of PD, hence, PD detection through voice signals is a promising way for the early prediction of PD [7, 43]. Additionally, voice recording-based PD detection enables home-based tele-monitoring and tele-diagnosis of PD [53]. Thus, motivated by the above-discussed factors, in this paper an attempt has been made to develop a novel learning model for early detection of PD through acoustic signal processing and machine learning methods.

Recently, various data mining and machine learning researchers have developed different automated systems for the detection of PD based on voice or speech signals, physiological signals, wearable sensors for gait analysis, handwriting movement analysis [2, 24, 27, 32, 40, 48, 49, 55]. Owing to the above-mentioned facts, it is quite natural to detect PD based on the vocal dataset. Sarkar et al. obtained a balanced dataset by collecting multiple types of voice samples from 68 subjects [53]. Dysphonia-related features were extracted from the voice signals by using Praat software [21]. They highlighted the problem of the subject overlap in data having many voice or speech samples per subject and proposed a novel cross-validation scheme, i.e., leave one subject out of cross-validation (LOSO CV). Under the LOSO CV, they obtained an accuracy of 55% by using k nearest neighbour (KNN) and SVM models. Onwards, many machine learning researchers utilized the data collected by Sarkar et al. [53] and tried to improve the PD detection accuracy by evaluating the feasibility of various features extraction and features selection methods are used [11, 16, 18, 19, 23, 32, 37, 39, 42, 46, 47]. For example, Canturk et al. utilized a machine learning system consisting of four features selection algorithms and five different classifiers but they were able to achieve an accuracy score of 57.5% [23]. Recently, Benba et al. explored the feasibility of human factor cepstral coefficients (HFCCs) features and utilized the multiple types of vowel phonatons data only. For classification purposes, they developed an SVM model with different types of kernels and obtained PD detection accuracy of 87.5% accuracy using LOSO CV [19]. Recently, Rahman et al. [50] collected a relatively bigger dataset and showed that high PD detection accuracy on a bigger dataset is a challenging task. Most recently, Ali et al. showed that instead of selecting features from multiple types of voice data, improved performance can be obtained by selecting samples before feature selection [13]. That is, feature selection from data having only one type of samples is a better method than feature selection from data having heterogeneous nature, i.e., having multiple types of hybrid samples.

After critically analyzing the results published by Ali et al. [13], we noticed that improved performance in case of sample selection before feature selection is due to the fact that different types of sustained vowel phonations are sensitive to different subsets of features and different models, consequently, different samples or vowel phonations will have different optimal subsets of features and different optimal models. Hence, if we construct one model and try to obtain one global optimal subset of features for data having multiple types of samples together, we will notice degradation in performance. Thus, in this paper, we exploit these findings and propose a novel ensemble method namely EOFSC (Ensemble model with Optimal Features and Sample Dependant Base Classifiers). We consider the multiple types of dataset containing three different types of sustained vowel phonations, i.e., vowel “a”, “o” and “u”. During the first set of experiments, we explore the feasibility of integrating feature selection through F-score based statistical model with deep neural network (DNN) and other conventional machine learning models (including conventional ensemble models like Adaboost and random forest). Through the numerical experiments, we further consolidate the findings of Ali et al. [13], consequently, three different DNN configurations with different optimal subsets of features are obtained for the three different types of vowel phonation data. In the second step, we utilize the three types of base classifiers (the DNN models) which are sample and feature dependent. Finally, the three developed base models/classifiers are integrated to construct the EOFSC model using majority voting criterion for evaluating the final decision of the newly developed model. Experimental results proved the effectiveness of the feature-driven DNN integrated system (F-DNN) over conventional DNN and all the other similar hybrid/integrated systems. Simulation results showed that the proposed ensemble model further enhances the performance of the F-DNN integrated system by 6.5%. Figure 1 provides more details about EOFSC.

Fig. 1
figure 1

Block Diagram of the proposed EOFSC ensemble method. EOFSC: ensemble model with optimal features and sample dependant base classifiers, \(\lambda \): The hyper-parameters choice or setting of the neural network, \(N_{s}\): size of the optimal subset of features

The main contributions of this study are summarized as follows:

  1. 1.

    Feature selection at the input level of DNN has not been well studied [54]. Recently, Taherkhani et al. in [54] found out that feature selection coupled with the feature extraction capability of deep learning improves the performance of deep learning models. In this paper, we consolidate the fact by cascading F-score based on a statistical model for feature selection with a DNN model for PD detection using multiple types of phonations datasets.

  2. 2.

    The performance of the F-DNN method was compared with conventional DNN, ten similar integrated/hybrid intelligent systems based on conventional machine learning models (including conventional ensemble models like Adaboost and random forest), and many renowned previous methods. Experimental results validate the effectiveness of the feature selection integration with DNN based on the three commonly used evaluation criteria, i.e., accuracy, ROC curves and AUC.

  3. 3.

    To further improve the PD detection performance, this paper proposes a novel ensemble model and validates its effectiveness by demonstrating performance improvement experimentally on two different voice datasets.

In the remaining of the manuscript, we will briefly discuss the multiple types of vowel phonation data and the proposed method in Sect.  2. Section 3, discusses experimental results. Sect.  4 deals with the comparative study and the last section presents the conclusion of the whole study.

2 Materials and methods

2.1 Multiple types of vowel phonation datasets

In this paper, we use two different multiple types of vowel phonations datasets. The first dataset is a benchmark publicly available dataset known as multiple types of speech data for Parkinson’s disease [29]. The dataset was collected and made public by Sarkar et al. [53]. The dataset was distributed in two parts. The first part was given the name training database and the second part was given the name testing database. But, it is worth discussing that the name training database does not mean that this data will be utilized only for training purposes, and the testing database data will be utilized for testing purposes. The main objective was to simulate a proposed method using a training database and re-simulate it once again by training the model through a training database and testing it through a testing database. This will validate the effectiveness of any proposed method more robustly. The training database was constructed by recording multiple types of speech samples, i.e., words, sentences, vowels, and numbers. Thus, from each subject 26 samples were recorded and from each sample, a set of 26 acoustic features were extracted through Praat acoustic software [21]. However, recent studies reported that better performance can be obtained by utilizing the multiple types of vowels only. Furthermore, many of the extracted features are irrelevant for the speech data [13]. Hence, in this paper, we followed the same methodology and we utilized multiple types of vowel phonations for each subject. That is for each subject three vowel phonations were considered, i.e., vowel “a”, “o” and “u”. The vowel “a” data is denoted by \(D_{1}\), the vowel “o” data is denoted by \(D_{2}\) and the vowel “u” data is denoted by \(D_{3}\). A detail of the extracted set of 26 features and their statistical parameters are tabulated in Table 1. From the testing database, the same set of features was extracted, however, the testing database contains three replications of vowel “a” and three replications of vowel “o”. Moreover, the training database contains data of 40 subjects (20 healthy subjects and 20 PD patients) and the testing database accumulates data of 28 PD patients.

The second dataset is also multiple types of voice phonations dataset. The dataset was collected by Rahman et al. [50] in Lady Reading Hospital (Medical Teaching Institution), Pakistan. The dataset is a relatively bigger dataset and was collected from 160 subjects out of which 60 subjects belong to PD class and the remaining 100 subjects are from the healthy class. From each subject three vowel phonations were recorded, i.e., vowel “a”, “o” and “u”. The vowel “a” data is denoted by \(D_{1}\), the vowel “o” data is denoted by \(D_{2}\) and the vowel “u” data is denoted by \(D_{3}\). Hence, the dataset consists of \(160 \times 3 = 480\) multiple types of voice phonations out of which 180 voice recordings are from PD class and 300 voice recordings are from healthy class. From sample we extracted 18 time-frequency features and 26 Mel frequency cepstral coefficient-based features, i.e. (MFCC0, MFCC1,..., MFCC12) and their derivatives (Delta0, Delta1,..., Delta12).

Table 1 Statistical parameters of the extracted set of features. m: mean, std: standard deviation, \( Tr_{h} \): healthy of training database, \( Tr_{pd} \): PD of training database, \( Ts_{pd} \): PD of testing database

2.2 The proposed methods

In classification systems, feature selection methods are used to mine the most relevant features in a feature space [3, 4, 6, 9, 10, 14, 51]. This proposes the use of feature ranking through F-score. The F-score based on feature ranking model measures the discrimination of two sets of real numbers [25]. For a given dataset with \(I_{j}\), \(j=1,2,3,......,n\) instances, if the number of instances related to healthy subjects are \(m_{+}\) and number of instances of PD patients are \(m_{-}\), then the F-score of the kth feature is defined as

$$\begin{aligned} S_1= & {} \frac{1}{(m_{+}-1)} \sum _{j=1}^{m_{+}}(\overline{I}_{j,k}^{(+)}-\overline{I}_{k}^{(+)})^{2} \nonumber \\ S_2= & {} \frac{1}{(m_{-}-1)} \sum _{j=1}^{m_{-}}(\overline{I}_{j,k}^{(-)}-\overline{I}_{k}^{(-)})^{2} \nonumber \\ F(k)= & {} \frac{(\overline{I}_{k}^{(+)} - \overline{I}_{k})^{2} + ( \overline{I}_{k}^{(-)} - \overline{I}_{k})^{2}}{S_1 + S_2}, \end{aligned}$$
(1)

where \(\overline{I}_{k}\), \(\overline{I}_{k}^{+}\), \(\overline{I}_{k}^{-}\) are the average of the \(k\mathrm{th}\) feature of the whole, positive and negative datasets, respectively. Moreover, \(\overline{I}_{j,k}^{(-)}\) is the \(k\mathrm{th}\) feature of the \(j\mathrm{th}\) negative instance. And \(\overline{I}_{j,k}^{(+)}\) is the \(k\mathrm{th}\) feature of the \(j\mathrm{th}\) positive instance. Additionally, in (1), the numerator formulates the discrimination between the positive and negative sets while the denominator denotes the one within each of the two sets [5]. The discriminative power of a feature is proportional to the F-score value. After features ranking by the F-score based on a statistical model, we need to decide a threshold for the F-score, i.e., those features will be selected which have higher F-score than the threshold. In this study, we apply hybrid grid search algorithm (HGSA) to search the optimal threshold that will result in an optimal subset of the extracted set of features. The obtained features subset is supplied to DNN model for the purpose of classification.

DNN model’s performance has a dependence on its hyper-parameters configuration. An inappropriate network configuration will result in poor performance. Hyper-parameters are the variables that determine a neural network architecture or configuration. To optimize the neural network architecture, two important hyper-parameters are considered in this paper, i.e., the number of hidden layers (L) and the width of each hidden layer, i.e., the number of neurons in each hidden layer (\(W_{h}\)). It is worth mentioning that two types of neural networks are discussed in literature, i.e., artificial neural networks (ANNs) or shallow neural networks and DNNs. Shallow neural network or ANN refers to a neural network that uses only one hidden layer [8]. When we optimize ANN or shallow neural network, we can only tune the number of neurons in its hidden layer, i.e., the width of the hidden layer. We cannot tune or optimize the number of hidden layers as there is no concept of depth in shallow neural networks or ANN. However, DNNs refer to neural networks that use multiple hidden layers and are trained using new methods [1, 12, 34, 35]. More precisely, neural networks with many-layer structure, i.e., two or more hidden layers are called deep neural networks [44]. Before utilizing a neural network for classification tasks, it is trained on training data. During the training process, the neural network learns a fitting function known as hypothesis \(h_{\alpha }(x)\) from the patterns of the training data. The values of the parameters of the hypothesis are evaluated by minimizing the objective function given as follows

$$\begin{aligned} J(\alpha ) = \frac{1}{m} \sum _{j=1}^{m}{\text {cost}}(h_\alpha (x^{(j)}), y^{j}), \end{aligned}$$
(2)

where m stands the number of training samples and \(\alpha \in \mathbb {R}^{d}\) represents the neural network parameters. To solve (2), we used IBFGS algorithm which is an optimizer in the family of quasi-Newton methods. After the training phase, the performance of the trained network is checked by applying the validation or testing data. This is regarded as a neural network hyper-parameter optimization problem, i.e., we need to search for that hyper-parameter configuration \(B_{\lambda }\) which will yield maximum generalization performance or minimum validation loss for the LOSO CV. Thus, the main objective of a hyperparameter optimization problem for LOSO CV on a dataset having \(N_{S}\) number of subjects, is to find such an optimal model or hyperparameters, i.e., \(\lambda \) that will minimize \(l(\lambda )\). This fact can be formulated as follows.

$$\begin{aligned} l(\lambda ) = \frac{1}{N_{S}} \sum _{i=1}^{N_{S}}\mathcal {L}(B_{\lambda }, D^{i}_\text {train}, D^{i}_\text {valid}), \end{aligned}$$
(3)

where \(D_\mathrm{train}\) denote training data and \(D_\mathrm{valid}\) denote testing or validation data during the i-th fold of LOSO CV and the function \(\mathcal {L}\) yields the loss obtained during each fold of the LOSO CV.

Equation (3) is a formulation for the optimization of the neural network model only. However, we also need to search for the subset of features that will ensure minimum loss, which is another optimization problem. Thus, the two optimization problems are merged or hybridized as one. The first optimization problem is searching for an optimal threshold for the F-score based on a statistical model which will result in an optimal subset of features having a size of n features. While the second optimization problem is the optimization of neural network configuration. Hence, (3) can be modified as follows.

$$\begin{aligned} l(\lambda , n) = \frac{1}{S} \sum _{i=1}^{S}\mathcal {L}(B_{\lambda }, n, D^{i}_\text {train}, D^{i}_\text {valid}). \end{aligned}$$
(4)

In (4), \(\lambda \) denotes the hyper-parameters of the neural network model while n is the parameter of the F-score based on a statistical model. To solve the newly developed optimization formula given in (4), we utilize HGSA algorithm. The algorithm arranges the n, \(W_{1}\) and \(W_{2}\) as coordinates of a point on a grid. That is each point on the grid is represented by \((n, W_{1}, W_{2})\) where \(W_{1}\) and \(W_{2}\) denote the number of neurons in the first hidden layer and the second hidden layer, respectively. Thus, the grid is hybrid in nature as each point on the grid merges the hyper-parameters of the two models as coordinates of one point. For each experiment or each type of dataset, the search algorithm will return an optimal point on the hybrid grid. The coordinates of the optimal point on the grid denote the optimal subset of features and the optimal hyperparameters of DNN that produce optimized performance. At the first step, HGSA will yield three subsets of features and three different DNN configurations for the three different types of vowel phonation data. It is important to note that for the other ten similar hybrid intelligent systems that use ten conventional machine learning models, the same algorithm is used. However, the DNN model is replaced by one of the ten conventional machine learning models, hence, the \(\lambda \) will denote the hyperparameter(s) of the conventional machine learning model under consideration.

After constructing three different DNN models corresponding to the three different types of vowel phonations, an ensemble model, i.e., EOFSC is developed. The EOFSC model integrates the three base classifiers (i.e., DNN models) which are sample and features dependent, i.e., for different types of dataset (having a specific type of sample for each subject), the corresponding base classifier has different types of an optimal subset of features. These three types of base classifiers are ensembled and a voting criterion is utilized to evaluate the final prediction of the developed EOFSC model. The working of the proposed EOFSC method is depicted in Figure 1.

3 Simulation results and discussion

In this section, a total of seven experiments are performed. The first four experiments are performed using dataset 1 and the remaining on dataset 2. The first experiment is simulated on data that contains multiple types of vowel phonations for each individual. While the second experiment was designed for datasets having one type of vowel phonation for each subject. For both types of experimental settings, the F-DNN method was applied. To further validate the performance of the developed method, the third experiment was performed on the test database following the approach of Sarkar et al. [53] and Benba et al. [19]. In order to validate the effectiveness of the feature selection integration with DNN, similar 10 integrated systems were also developed in experiment four and compared with F-DNN. In the last experiment, the three DNN models which are sample and features dependent were utilized to construct the proposed ensemble model and its performance was compared with the three base classifiers, a further improvement of 5% was observed for dataset 1. Similar experiments are also carried out for dataset 2. For validation purposes, leave one subject out (LOSO) cross-validation is performed and for evaluation purposes, six different types of evaluation metrics are utilized including ROC curves, area under the curve (AUC), specificity, accuracy, sensitivity, and Matthews correlation coefficient (MCC).

3.1 Experiments on the first multiple types of vowel phonations dataset

3.1.1 Experiment No 1: Performance of the F-DNN method using LOSO CV on data containing multiple vowel phonations data for each individual

In this experiment, we consider multiple types of vowel types of vowel phonations for each individual. The dataset is denoted by \(D_{m}\). During the validation process for the F-DNN model, LOSO CV is utilized. In LOSO CV, multiple samples of a subject are used for validation and remaining data is used to train the model. In this experiment, we utilized F-score based on a statistical model to eliminate irrelevant features. The selected subset of features is given at the input of the DNN model for classification. The evaluation measures at different subsets of features are tabulated in Table 2. It can be observed from the table that for the optimal subset of features with 25 features, i.e., for \(n=25\), PD detection accuracy of 87.5% has been obtained. The optimal subset of features for \(n=25\) ranked by F-score and searched by HGSA algorithm includes all \(F_{i}\) where i is a set of integers containing all integers from 1 to 26 except 18. The performance of multiple samples per subject dataset \(D_{m}\) at a different subset of features is depicted in Fig. 2.

Fig. 2
figure 2

Accuracy of multiple samples per subject dataset at different subsets of features. \(D_m\): The dataset have multiple types of samples per subject. X-axis: Size of subset of features, i.e., n. Y-axis: Accuracy which is denoted by ACC(%) in Table 2

Table 2 Results of the proposed method using LOSO CV for multiple types of samples per subject data. L: Total number of layers in DNN including input and output layer, \(W_{1}\): Width of first hidden layer, \(W_{2}\): Width of second hidden layer, ACC[URACY](%), Sen[sitivity](%) and Spec[ificity](%)

To validate the effectiveness of the F-DNN method on the multiple types of vowel phonations data, we also checked the performance of the conventional neural network on the same dataset. The conventional neural network model was optimized using a grid search algorithm. It was pointed out that best performance was obtained with an accuracy of 77.5%, sensitivity of 65%, specificity of 90% and MCC value of 0.568. The best performance was achieved with an optimized neural network configuration having 9 neurons in the first hidden layer and 30 neurons in the second hidden layer. Hence, it is evidently clear that the F-DNN method improves the strength of the conventional neural network by 10% for multiple samples per subject data.

3.1.2 Experiment No 2: performance of the proposed method using LOSO CV on data having one sample per subject

In this experiment, we utilize one sample per subject data. Thus, we construct three different datasets from the multiple samples per subject data. Each dataset is independently processed by simulating the F-DNN method. Table 3 reports the performance of each of the datasets at different subsets of features and different network configurations. From the table, it is clear that the highest PD detection accuracy of 90% was obtained for \(D_{2}\) i.e., vowel “o” at an optimal subset of features with size \(n=7\) (Table 3). Thus, the findings of this study consolidate the findings of [53] which pointed out that vowel “o” samples contain complementary information for PD, compared to other types of samples. Additionally, the simulation results validate the importance of optimization of neural networks for each subset of features through HGSA. As it is evidently clear that if optimally configured neural network is not utilized, we may obtain poor performance with an optimal subset of features. Comparing the optimal subsets of features obtained in experiment 2 with the optimal subset of features produced in experiment 1, it is evidently clear those different types of samples are sensitive to different subsets of features and different DNN configurations. The performance of each dataset at different subset of features is depicted in Fig. 3.

Fig. 3
figure 3

Accuracies of different datasets having one type of vowel phonation for each individual at different subsets of features. \(D_{1}\): Dataset containing only vowel “a” phonation for each subject. \(D_{2}\): Dataset containing only vowel “o” phonation for each subject. \(D_{3}\): Dataset containing only vowel “u” phonation for each subject

Table 3 Results of the proposed method using LOSO CV for datasets having one type of vowel phonation for each individual. \(D_j\), j=1,2,3: Type of dataset used. \(D_{1}\): Dataset containing vowel “a” phonation for each subject, \(D_{2}\): Dataset containing vowel “o” phonation for each subject, \(D_{3}\): Dataset containing vowel “u” phonation for each subject

From Fig. 3, it is evident that an accuracy of 72.5% is obtained for \(D_{1}\) on full features set, i.e., when the only a conventional neural network is used. While the accuracy of 87.5% is obtained when the F-DNN integrated method is applied. Similarly, an accuracy of 80% is obtained for \(D_{2}\) with the conventional neural network while an accuracy of 90% is achieved with the F-DNN integrated method. Finally, an accuracy of 67.5% is obtained for \(D_{3}\) using conventional neural network while an accuracy of 87.5% is achieved when the F-DNN integrated method is applied. With that, we can conclude that the F-DNN integrated method improves the strength of conventional DNN.

3.1.3 Experiment No 3: performance of the proposed method using LOSO validation on testing database

In this experiment, the performance of the F-DNN integrated method is also validated on the testing database while training the F-DNN model on the data of the training database. After the training process, the performance of the F-DNN method is evaluated by performing LOSO validation on the testing database and it resulted in 100% accuracy. The evaluation measures on a full set of features and a reduced subset of features are reported in Table 4. It is worth discussing that the table does not report specificity and MCC, it is because of the fact that the testing database has no healthy subject.

Table 4 Performance of the F-DNN method for LOSO validation on testing database

3.1.4 Experiment No 4: performance comparison of the F-DNN method with other similar integrated methods

To validate the strength of the F-DNN method, we developed ten similar integrated intelligent systems by utilizing ten conventional machine learning models. Each of the developed integrated systems uses same F-score based on a statistical model for features ranking, a machine learning model for prediction, and HGSA for searching for an optimal subset of features and an optimized version of the machine learning model. The evaluation measures for each learning system are given in Table 5. It is evident from the table, that the highest accuracy of 90% is obtained by the F-DNN method.

Table 5 Performance comparison of the proposed method with other similar methods that use conventional machine learning model instead of deep learning model. C/E/K/\(W_{1}\): C is hyperparameter of SVM (Lin), SVM (RBF), SVM (Poly), and LR, E: hyperparameter of adaboost and RF model denoting the number of estimators, \(W_{1}\): number of neurons in first hidden layer of DNN, G/D/\(W_{2}\): G is gamma hyperparameter of SVM (RBF), D is hyperparameter of RF denoting depth of the base estimator, \(W_{2}\): number of neurons in the second hidden layer of DNN

To further validate the strength of the proposed F-DNN method against the conventional DNN, we utilized two more evaluation metrics namely ROC curve and AUC. In machine learning, the quality of the output of a developed learning model is usually gauged by using ROC curves, a model having more area under the ROC curve, i.e., AUC is considered to be a more robust model. The ROC curves of the proposed method and conventional DNN are drawn in Figure 4. It is evident from the ROC curves, that the F-DNN has a better ROC curve owing to its largest AUC of 0.945.

Fig. 4
figure 4

a ROC chart of Conventional DNN model for dataset 1 b ROC chart of the DNN based integrated system, i.e., F-DNN model for dataset 1. ROC charts of the conventional DNN and the proposed F-DNN model for dataset 1

3.1.5 Experiment No 5: performance of the EOFSC ensemble model

In this experiment, we utilize the models from experiment 2 that yielded optimized performance for each type of data. As there are three types of sustained phonation datasets, hence, three models with different subsets of features and DNN configurations were obtained. It can be seen that the configurations of the three models (the base classifiers for our proposed ensemble model) are feature and sample-dependent. In this experiment, we integrate the three models and utilize the voting criterion for evaluating the final prediction of the developed EOFSC ensemble model. From experiment 2, it can be noticed that the first DNN model with \(W_{1}=1\) and \(W_{2}=18\) yielded PD detection accuracy of 87%. The second DNN with \(W_{1}=5\) and \(W_{2}=4\) yielded PD detection accuracy of 90% and the third DNN model with \(W_{1}=20\) and \(W_{2}=6\) also yielded PD detection accuracy of 87%. However, when the proposed EOFSC method was developed, an improved performance of 95% is obtained, which is 5% percent further better than the three base classifiers or models. Moreover, the proposed ensemble model yielded a sensitivity of 100% and specificity of 90%, hence, both these metrics also observed an improvement of 5% compared to the F-DNN integrated system.

3.2 Experiments on the second multiple types of vowel phonations dataset

3.2.1 Experiment No 6: performance of the proposed method using LOSO CV on data having one sample per subject

In this experiment, similar to experiment no 2, we utilize one sample per subject data resulting in three different datasets from the multiple samples per subject data. The F-DNN method is developed for each of the datasets. In accordance with dataset 1, the highest PD detection accuracy of 87.5% was obtained for \(D_{2}\) i.e., vowel “o” at \(n=10\). Again it has been shown that different types of samples are sensitive to different subsets of features and different DNN configurations.

3.2.2 Experiment No 7: performance comparison of the F-DNN method with other similar integrated methods using dataset 2

Similar to experiment no3 for dataset 1, in this experiment, we developed ten similar integrated intelligent systems by utilizing ten conventional machine learning models. For dataset 2, different evaluation measures for each method are given in Table 6. It is evident from the table, that the highest accuracy of 87.5% is obtained by the F-DNN method. From the table, it is clear that the conventional DNN model results in 75% accuracy. Again, the results for the second dataset also validate the effectiveness of feature selection integration with the DNN model.

Table 6 Performance comparison of the proposed method with other similar methods that use conventional machine learning models for dataset 2. C/E/K/\(W_{1}\): C is hyperparameter of SVM (Lin), SVM (RBF), SVM (Poly) and LR, E: hyperparameter of adaboost and RF model denoting the number of estimators, \(W_{1}\): number of neurons in first hidden layer of DNN, G/D/\(W_{2}\): G is gamma hyperparameter of SVM (RBF), D is hyperparameter of RF denoting depth of the base estimator, \(W_{2}\): number of neurons in the second hidden layer of DNN

For the second dataset, we also plotted ROC curves for the proposed F-DNN method and the conventional DNN. The ROC curves of the proposed method and conventional DNN are drawn in Figure 5. It is evident from the ROC curves, that the F-DNN has a better ROC curve owing to its largest AUC of 0.893.

Fig. 5
figure 5

(a) ROC chart of Conventional DNN model for dataset 2 (b) ROC chart of the DNN based integrated system, i.e., F-DNN model for dataset 2. ROC charts of the conventional DNN and the proposed F-DNN model for dataset 2

3.2.3 Experiment No 8: performance of the EOFSC ensemble model

In this experiment, we utilize the models from experiment no. 7 that yielded optimized performance for each type of data. For each type of vowel phonation dataset, a corresponding optimal network model is developed. The proposed EOFSC ensemble model is developed for dataset 2 by integrating the three models and utilizing the voting criterion for evaluating the final prediction. From experiment 7, it can be noticed that the feature selection integration with DNN improved PD detection accuracy from 75% to 87.5%. The proposed EOFSC method was developed by ensembling the F-DNN models for \(D_{1}\), \(D_{2}\) and \(D_{3}\), resulting in an improved performance of 93.75% of PD detection accuracy. The proposed ensemble further improves the performance of F-DNN models by 6.25%. Moreover, the proposed ensemble model yielded sensitivity of 100%, specificity of 90% and MCC of 0.878. Thus, experimental results show increase in sensitivity from 91.16% to 100%, specificity from 85% to 90%.

3.3 Experiment No 9: independent testing

Recently published work has shown that high accuracy under cross-validation is an easy job, especially if the datasets are small. However, maintaining such kind of high performance during independent testing is a challenging task. Moreover, results during independent testing are more reliable and practical. Therefore, for more practical validation, we carried out an independent testing as well. We trained our model using the data of the first of the second datasets and tested it using the data of the first dataset (20 PD and 20 healthy). It was pointed out that optimal performance of 70% accuracy was obtained for vowel “o” voice phonations. On the other hand, the proposed EOFSC ensemble approach yielded 85% of PD detection during independent testing. These results clearly highlight the importance of the proposed EOFSC ensemble approach.

4 Performance comparison of the proposed methods with previous methods

To validate the effectiveness of the proposed EOFSC Ensemble method, a comparative analysis is carried with previously published state-of-the-art methods for both the datasets. Table 7 provides a brief description of the previously reported methods on both the datasets.

Table 7 Performance comparison of the proposed method with other methods

5 Conclusion

In this study, findings of the recently published work for PD detection based on multiple types of voice data were critically analyzed. It was pointed out that different types of voice phonations are sensitive to different subsets of features and different models. Based on these findings, the feasibility of feature selection integration with DNN was evaluated. Results show that feature selection integration with DNN models further improves their performance. Additionally, the obtained results consolidated the findings of recently published work, i.e., for each type of voice phonation, a unique subset of feature and a model was obtained.

Exploiting the above-discussed findings, a novel ensemble model, namely EOFSC was developed. The ensemble model further improved the performance of the DNN based integrated system (F-DNN) obtained under optimal phonation by 6.5%. Thus, it was observed that the proposed ensemble model shows better performance than integrated systems based on DNN and other conventional machine learning models and many renowned previous methods for PD detection based on multiple types of speech and voice or vowel phonations data. Based on the obtained results, it can be concluded that the proposed ensemble approach is a step forward in the domain of automated PD detection.