A novelty detection approach to effectively predict conversion from mild cognitive impairment to Alzheimer’s disease

Accurately recognising patients with progressive mild cognitive impairment (pMCI) who will develop Alzheimer’s disease (AD) in subsequent years is very important, as early identification of those patients will enable interventions to potentially reduce the number of those transitioning from MCI to AD. Most studies in this area have concentrated on high-dimensional neuroimaging data with supervised binary/multi-class classification algorithms. However, neuroimaging data is more costly to obtain than non-imaging, and healthcare datasets are normally imbalanced which may reduce classification performance and reliability. To address these challenges, we proposed a new strategy that employs unsupervised novelty detection (ND) techniques to predict pMCI from the AD neuroimaging initiative non-imaging data. ND algorithms, including the k-nearest neighbours (kNN), k-means, Gaussian mixture model (GMM), isolation forest (IF) and extreme learning machine (ELM), were employed and compared with supervised binary support vector machine (SVM) and random forest (RF). We introduced optimisation with nested cross-validation and focused on maximising the adjusted F measure to ensure maximum generalisation of the proposed system by minimising false negative rates. Our extensive experimental results show that ND algorithms (0.727 ± 0.029 kNN, 0.7179 ± 0.0523 GMM, 0.7276 ± 0.0281 ELM) obtained comparable performance to supervised binary SVM (0.7359 ± 0.0451) with 20% stable MCI misclassification tolerance and were significantly better than RF (0.4771 ± 0.0167). Moreover, we found that the non-invasive, readily obtainable, and cost-effective cognitive and functional assessment was the most efficient predictor for predicting the pMCI within 2 years with ND techniques. Importantly, we presented an accessible and cost-effective approach to pMCI prediction, which does not require labelled data.


Introduction
Alzheimer's disease (AD) is the most common cause of Dementia. Around 60-80% of dementia patients have AD [7]. An estimation has been reported that there will be 12.7 million AD patients in the US by 2050 [7]. Mild cognitive impairment (MCI) is an intermediary stage between normal cognitive ageing and dementia [64]. Most MCI patients remain stable (stable MCI, sMCI), some return to normal cognition [8]. However, MCI patients are at greater risk of converting to AD than those cognitively normal [51]. Studies have shown a 3-15% conversion rate of MCI to AD (progressive MCI, pMCI) every year [22,52]. It is traumatic for patients and their families to receive an MCI diagnosis. Therefore, accurately and early identification of the MCI patients who will convert to AD within a few years and who will remain MCI is essential for determining treatments [14]. Early prognosis of pMCI can bring significant clinical and economic benefits. [7]. However, it is often challenging for expert clinicians to identify pMCI patients at an early stage [73]. Advanced clinical decision support systems based on machine learning have been developed and widely used in healthcare [19,62] for different diseases prediction and classification, such as breast cancer [42][43][44], brain disorders [21] and AD [10,11,20,48,49,58].
The most commonly used machine learning method in this area is SVM [73]. SVM has gotten high accuracy for dealing with high-dimensional data with the kernel trick. SVM has been widely used to classify sMCI and pMCI patients based on MRI and FDG-PET data [3]. Zhang and Shen [74] achieved 69.7% and 67.6% of AUC with MRI and FDG-PET images, respectively. Gaser et al. [23] obtained 81% AUC with MRI data for predicting pMCI. A domain transfer SVM was proposed to predict the conversion of MCI and the study reported 76.4% AUC based on MRI data and 74.1% on FDG-PET data [18]. With the extracted region of interest from MRI data, the SVM classifier achieved 75.6% accuracy in predicting the conversion of pMCI patients within 3 years [11]. The AUC of the classification for sMCI and pMCI reached 93.59% based on the combination of Apolipoprotein E (ApoE), MRI, FDG-PET, and CSF features [26]. In their study, NiftyReg toolbox [53], a feature extraction tool for AD images, was employed to extract features from MRI and FDG-PET images and the CSF and ApoE modalities were transferred into a high dimensional space using the rand tree embedding method.
However, MRI imaging is expensive and less accessible [36] whilst often posing challenges for some patient groups. FDG-PET is invasive and expensive and involves radiation exposure [3] and most basic clinics do not have standardised screening equipment [37]. It is not justified and recommended to perform FDG-PET testing in all MCI patients [67]. Demographic and clinical characteristics are much cheaper and easier to obtain than imaging data and biomarkers from neuropsychological assessments have proven to be effective in detecting the conversion of MCI by statistical analysis [39]. Moreover, neuropsychological assessments [17] have been proven to be good predictors when applying machine learning [9]. The power of cognitive and functional assessments (CFA) to predict AD also were compared with biomarkers, imaging data and multi-modalities combinations. The CFA was the best predictor [15]. The combination of Age, CFA, and MRI improved the AUC to 90% from 80% obtained with MRI. Moreover, there are only around 36 sMCI records and 46 pMCI records in the ADNI dataset [26], which is insufficient for our research. Therefore, we focused on the non-imaging data in this paper to find the most effective modalities for pMCI prediction.
The imbalance of samples from different health conditions poses a challenge when training machine learning algorithms [6]. Imbalanced data may cause inflated performance estimates for a binary/multi-class classifier. The synthetic minority over-sampling technique (SMOTE) [65] and cluster-based instance selection (CBIS) [71] are commonly used to re-sample techniques for dealing with imbalanced data. However, both methods either generate new data points to get more data points for the minority class or remove the data points from the majority class to reduce the majority class size, which changes the original data. Novelty detection (ND), also called one-class classification [60], deals with imbalanced data in a different way since knowledge of only one class (minority or majority class) is used during the training phase [38]. ND has some advantages: (1) it is unsupervised; (2) only one class (normal class or given class) data are needed to train models. An enclosed boundary will be trained to surround the given class tightly. The trained boundary will then be used to detect unseen data. Normal or abnormal depends on the locations of the data points. If the data points are inside the boundary, they will be classified as normal data. Otherwise, they are abnormal [24] data. ND has been widely used in real-life applications, such as electronic security systems [47,56], tumour detection and disease detection [50,66,72]. A comparison of ND algorithms and multi-class classification based on different datasets [70] found that ND techniques work better than traditional multi-class classification algorithms when there is a high imbalance ratio in the data.
To the authors' knowledge, there is no study utilising ND techniques to predict the pMCI subjects based on non-image data. In this paper, five ND techniques, including Gaussian mixture model (GMM) [75], K nearest neighbours (kNN) [27], k-means means [33], extreme machine learning (ELM) [40] and isolation forest (IF) [43], were compared to predict the conversion of pMCI. These basic ND algorithms are easily implemented (e.g., have limited hyperparameter and training parameters) and are easily understood. Moreover, one of the objectives of the paper is to prove that the ND algorithms have the potential for the prediction of pMCI not to find the best ND algorithms. We compared the performance of ND algorithms with the binary-class SVM and RF. The SVM is the most commonly used algorithm in this area and achieved the best AUC in Gupta et al. [26]. While the RF achieved the best classification performance in Bucholc et al. [15] and got 85% AUC in Pereira et al. [63]. We also optimised the non-imaging modalities and the parameters with nested cross-validation (nCV) is an effective way to 1 3 incorporate feature selection and parameter tuning to train an optimal machine learning model [61]. This enables thorough evaluation to ensure generalisability and limits leakage of data across training and testing sets. Compared with other relative works, the proposed strategy employed the sMCI data only to build a model, which can then be used to predict whether an MCI patient will develop AD in 2 years. We demonstrated that the CFA could reliably and accurately predict the conversion of MCI. The approach could provide assistance to clinicians in the prognosis of MCI subjects with high conversion risk in future, as well as lessen the assessment time and cost of patient evaluation.
This study has two objectives: (1) To prove that ND techniques can predict pMCI patients who have a high risk of developing AD in 2 years. (2) To optimise the neuropsychological assessments to find the most cost-effective and time-saving assessments for predicting pMCI.
The contribution of this study is four-fold: (1) introducing ND techniques to predict pMCI subjects who are at high risk of converting to AD, based on easily obtained non-imaging data; (2) assessing the detection performance in comparison to using binary-class classifiers, and (3) optimising the number of neuropsychological assessments by considering the time they spend to predict MCI conversions with ND techniques quickly; (4) using a comprehensive evaluation framework involving nCV and optimising adjusted F measure (AGF) to ensure maximum generalisation of the proposed system to minimise false negative rates and the leakage of data.

Participants
The data used in this study were obtained from Alzheimer's Disease Neuroimaging Initiative (ADNI) database (adni. loni.ucla.edu) which is the most frequently used dataset to develop computational approaches to predict the MCI subjects who were at high risk for converting to AD early. The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the Progression of mild cognitive impairment (MCI) and early Alzheimer's disease (AD). For up-to-date information, see www. adni-info. org.
All subjects performed a clinical dementia rating (CDR) test and received the CDR sum of boxes (CDR-SB) score, which is widely accepted and used as a reliable and objective AD assessment golden standard [9,15,17]. The CDR-SB score is scaled from 0 to 18.0. We categorised the data into normal control (NC), MCI, and AD. (see Table 1

Selected records
In our research, a total of 1060 subjects from ADNI1, ADNIGO, ADNI2 and ADNI3 stages were considered. It was categorised into two groups: 681 sMCI subjects (293 male and 388 female, see Table 2) who were MCI at baseline and the follow-up period and 379 pMCI subjects (150 female and 229 male, see Table 2) who were MCI at baseline but converted to AD at follow-up time points within the whole ADNI study.
An example of selecting records of sMCI and pMCI subjects was illustrated in Fig. 1, where the number '0' is the baseline records (the first time patients attended the test); The numbers' 1', '1.5', '2', '3', and '4' represent the records of the follow-up 1 year, 1.5 years, 2 years, 3 years, and 4 years; 'end' means the last records in the ADNI so far. The red lines are the selected records of the subjects. As shown in Fig. 1, the baseline (0 years) records of sMCI subjects were selected as the sMCI class data in this study. For the pMCI subjects, the earliest records within 2 years before converting to AD were selected. For example, if a subject is MCI at the baseline, but it converts to AD in the follow-up 4 years and has the 2-year record, then the 2-year record is selected (red line in Fig. 1a). If there is no 2-year record, the 3-year record is selected (red line in Fig. 1b). If a patient converts to AD in 1.5 years later after the baseline, then the baseline record is selected (red line in Fig. 1c). If there are no records within 2 years before converting to AD, the subject will not be considered. Our way of selecting the pMCI data was in line with the previous study [46].

Modalities
As above mentioned, most studies aiming to predict the pMCI patients used neuroimaging data, CSF biomarkers, CFA, genetic biomarkers, and their combinations. However, obtaining these imaging data and biomarkers is expensive and needs well-trained technicians [55]. Also, the FDG-PET  [3].
Considering those factors, we focused on the demographic, neuropsychological assessments, and Apolipoprotein E genotype (the number of ApoE ε4 allele, ApoE4) modality categories. Eight potential non-imaging variables available in the ADNI dataset and easily assessed in a clinical setting at minimal cost and without special equipment were selected to determine if the cost-effective and non-invasive AD factors have high discriminative power in prognosing the pMCI. The features were grouped into four modality types: (1) demographics (age); (2) genetic (ApoE4); (3) cognitive and functional assessments (CFA), including The Mini-Mental State Examination (MMSE), Functional Activities Questionnaire (FAQ) and Alzheimer's Disease Assessment Scale-cognitive subscale; and (4) memory including logical memory-delayed recall total number of story units recall (LDELTOTAL), Rey's Auditory Verbal Learning Test (RAVLT) immediate (RAVLT.immediate), RAVLT percent forgetting (RAVLT.perc.forgetting) [17]. Accordingly, we analysed the performance of ND models constructed with each modality type and their combinations. The demographics and neuropsychological characteristics of the studied subjects are detailed in Table 2. More details of different modalities could be found in the supplementary material.

Selected novelty detection algorithms
The five basic used and easily interpretable ND algorithms based on GMM, kNN, k-means, ELM and IF were selected.
(a) The kNN [27] is a distance-based ND algorithm. Unseen data, x , is discriminated as normal or abnormal by comparing the distance ( NN k (x) ) of the data to its K nearest neighbours and the distance from the nearest neighbours to the nearest neighbours' K nearest neighbours ( NN k (NN k (x) ). The acceptance of x could be defined as where I(•) is the indicator function.
(b) The k-means [33] is a traditional clustering-based algorithm. Unseen data is classified as normal if it belongs to the clusters trained on the given normal data only; otherwise, it will be detected as an anomaly.
(c) GMM [75] is a popular probabilistic method. It builds a model combining several weighted Gaussian components. For an N-dimensional vector, x, the probability density function under parameters = { i , w i , i } , can be written by where w i is the weight assigned to the ith Gaussian component and g(x| i , i ) is the Gaussian density of each component, i and i are the mean vector and covariance matrix, respectively. A new unseen data is classified as normal data if the probability of the data is larger than a threshold; otherwise, it is an anomaly.
(d) ELM [40] is a randomised learning neural network whose input weights and bias are randomly determined, and the output weights are computed without traditional iterations. The ELM network contains n input nodes, L hidden layers, and one output. It identifies novelty data by a sign function. For testing data, if the value of the sign function is -1 , then the data are novel data; if the value of the sign function is 1 , then the data are normal data.
(e) Isolation forest [43]. IF utilises that anomalous observations are few and significantly different from normal observations. The training data are divided according to a randomly selected attribute and a split value in the training process. Then each partition is used to construct iTrees until the data is isolated. For new data, they are classified by an anomaly score, which is defined by is the path length. For a new coming data point, if S is smaller than 0.5, the new data point is normal data; if S is greater than 0.5 and closes to 1, the new data point is abnormal data.

Performance metrics
Selecting the right metrics to assess and compare the performance of models is a very important step in ND.
Machine learning algorithms are normally evaluated by a confusion matrix shown in Table 3. The true positive (TP) and false negative (FN) present the number of the positive (pMCI in this study) subjects being correctly classified and misclassified. The true negative (TN) and false positive (FP) are the number of negative (sMCI) subjects being correctly classified and misclassified.
The overall accuracy is the most commonly used metric to evaluate a classification algorithm. However, it can be misleading in ND problems or imbalanced classification since it puts more weight on the negative/majority class [12]. Therefore, to fairly assess the performance of ND algorithms, the Adjusted F-measure (AGF) and the area under the receiver operating characteristic curve (AUC) were used in this study for optimising the modalities and parameters and fairly evaluating the performance of the trained models on identifying pMCI and both sMCI and pMCI patients.
(a) Adjusted F-measure (AGF) [4], an improvement of F-measure ( F 1 score), is an evaluation metric for imbalanced classification. AGF provides more weight to the positive class than other measurements. A higher AGF score indicates a better performance. AGF is defined as where F 2 is the F =2 score and Inv F 0.5 is the F =0.5 score with the inversion of the class labels (switch the initial positive and negative class). The general F score, F , is defined as (b) The AUC calculates the area under the receiver operating characteristic (ROC) curve showing the false positive rate and true positive rate across a continuum of thresholds [30]. It is a fundamental metric for evaluating a diagnostic model and has been commonly used in biomedical research to assess the classification and prediction performance for disease diagnosis and prognosis [28].

Experiments setup
Nested cross-validation (nCV) [61] was employed to obtain an unbiased evaluation of each model. The nCV includes a fivefold inner CV for tunning parameters and selecting features and a fivefold outer CV for evaluating the performances of different ND and binary classification algorithms. First, the original data, including sMCI and pMCI, were split into five folds respectively, four folds were selected as a model development set for inner CV and one fold was used as an independent test set (see Fig. 2a).
For ND, the sMCI in the model development set was further split into 5 subsets. Four out of the five subsets were selected as a training set to optimise the hyperparameters in different algorithms, i.e., the number of nearest neighbours of kNN, the number of clusters in k-means, the number of Gaussian components in GMM, the sigma of the RBF kernel in ELM and the number of learning tree in IF. The hold one set and the pMCI were used as a validation set in the model development set. The validation set was used to validate the trained model on different parameters with different modalities. The optimal parameters with different modalities were selected by calculating the mean AUC of the fivefold inner CV. The final model was trained using the optimised hyperparameters that achieved the best average inner CV performance with the sMCI subjects in the model development set and was tested with the independent test set in the outer CV. The overview of nCV procedure framework for ND is shown in Fig. 2a. We also provided the nCV structure for supervised binary-class classification algorithms in Fig. 2b. Differentiated from ND, binary-class classification needs both sMCI and pMCI for the training set. As shown in the fivefold inner fold CV in Fig. 2b, after getting the model development set, both sMCI and pMCI classes were split into 5 folds, one fold was used as a validation set and the remaining four folds were used as the training set in the inner CV stage.
The optimal parameters with different modalities with the best mean AUC value of fivefold inner CV were selected. For outer CV, the model development set, including both sMCI and pMCI, was used to train the final model. We did not use inner CV in the SVM and RF experiments since the parameters were optimised by the optimisation methods inside the toolbox.
In ND, a parameter that specifies the fraction of target subject rejection (sMCI misclassification tolerance), γ, should be pre-defined to determine the proportion of sMCI subjects allowed outside the decision boundary built in the training stage. In this study, the parameter γ determined the power of the model to identify the sMCI patients accurately. We evaluated different values of γ ( = 0.1 , 0.2 and 0.3), enabling evaluation of performance when 10%, 20% and 30% sMCI subjects were allowed to be misclassified as pMCI. The parameters of kNN, k-means, GMM and ELM were selected from the range {1, 2, …, 15}. The parameter of IF was chosen from the range {5, 10, …, 100}.
The implementation detail of the proposed strategy based on nCV is given in Fig. 3. All experiments were implemented in MATLAB 2021b. The DD_Tools toolbox [68] was used for ND implementation.

Experimental results and analysis
Extensive experiments were conducted on every single modality and their combinations. The results of ND models and the binary classifiers were reported and analysed in this section.

Performance on the validation set
The mean values of the obtained AGF with the best parameters based on the fivefold inner CV were recorded in Table 4. The best modalities with the corresponding ND algorithms relative to the AGF are highlighted in bold in Table 4. The single CFA modality reflected the best AGF against other single modalities. With 10% sMCI misclassification tolerance ( = 0.1 ), the ELM obtained the highest AGF (0.89 34 ± 0.005) with the combination of ApoE4 and CFA (ApoE4 + CFA) . Em ploying Age + ApoE4 + CFA, the highest AGF was achieved by GMM (0.7985 ± 0.0074). The single CFA observed the highest AGF of K-means (0.6559 ± 0.0149) and IF (0.6345 ± 0.0144). While with = 0.2 , ELM got the best AGF (0.896 ± 0.0036) with the ApoE4 + CFA and kNN got the highest AGF (0.8395 ± 0.0033) on the single CFA. With = 0.3 , the ApoE4 + CFA produced the best AGF with ELM (0.897 ± 0.002), GMM (0.8935 ± 0.0035), kNN (0.8861 ± 0.0031) and k-means (0.862 ± 0.0056). The single CFA got the best AGF of IF (0.8537 ± 0.0053).
Overall, the ApoE4 + CFA produced the best AGF for 9 out of 15 ND algorithms (5 algorithms with = 0.1, 0.2 and 0.3) and the single CFA witnessed produced the highest AGF for 5 out of 15 ND algorithms for predicting pMCI subjects. The CFA, ApoE4 + CFA, Age + CFA and Age + ApoE4 + CFA observed the best four AGF of different algorithms (Table 4). Interestingly, combining CFA can improve the performance of all models trained with other single modalities and their combinations. For instance, with   To find if there are significant differences among different modalities, we performed the ANOVA (ANalysis Of Variance) test [35] on the averaged AGF across the fivefold inner CV. The result shows that there are significant differences among at least one pair of different modalities (p = 9.09E−41). To further investigate which pairs of modalities had a significant difference, we conducted the Tukey Post Hoc test [31] (see Table 5). By looking up the critical value of the studentised range table [31], we found the critical value is 3.399. If the score is greater than 3.399, the pair of modalities have a significant difference. Otherwise, there is no difference between the pair of modalities. We found that there were significant differences between most of the modalities. However, there were no significant differences among the CFA, Age + CFA, CFA + ApoE4 and Age + CFA + ApoE4, which are the most effective modalities for predicting pMCI with ND algorithms (see Table 5). Importantly, it turned out that there are significant differences between the modalities with and without the CFA, meaning that the CFA plays an important role in predicting pMCI with ND algorithms.

Performance on the test data
To validate the performance of different modalities and evaluate the ND algorithms with the binary classifier, we reported the mean AGF calculated from the outer fold independent test sets in Table 6. It could be seen that the results were consistent with the results of the inner CV (Table 4). We highlighted the best AGF values of the best modalities corresponding to the ND algorithms in Table 6. The CFA was the most promising predictor for predicting pMCI subjects among the single modalities. The CFA, ApoE4 + CFA, Age + CFA, and Age + ApoE4 + CFA observed the best four AGF in different situations. For example, with γ = 0.1 , the Age + CFA reflected the highest AGF of 0.7232 ± 0.021 of the ELM, which was the best AGF in all the algorithms. The ApoE4 + CFA reflected the highest AGF of 0.5732 ± 0.0663 obtained from k-means. The single CFA observed 0.5916 ± 0.0373 and 0.5175 ± 0.0671 of AGF obtained by kNN and IF. While the GMM model obtained the best AGF (0.6669 ± 0.0347) with the Age + ApoE4 + CFA.
With γ = 0.2, the best AGF of ELM, kNN and IF were obtained by the single CFA (0.7276 ± 0.0281, 0.727 ± 0.029 and 0.6391 ± 0.0632). The best AGF of GMM (0.7179 ± 0.0523) and k-means (0.6606 ± 0.0443) were obtained by ApoE4 + CFA and Age + CFA, respectively. With γ = 0.3, the AopE4 + CFA reflected the best AGF of 0.7341 ± 0.0216 and 0.7297 ± 0.0233 obtained from kNN and GMM. The single CFA observed the best AGF of 0.677 ± 0.0205, 0.7055 ± 0.0304 and 0.6767 ± 0.047 of k-means, ELM and IF, respectively. Overall, 8 out of 15 algorithms got the best AGF with the CFA modality, and 4 out of 15 algorithms obtained the best AGF with ApoE4 + CFA modality.
The mean AUC of the fivefold outer CV was reported in Table 7 to show the overall performance of selected models on both sMCI and pMCI identification. The best modalities with the corresponding ND algorithms relative to the AUC of the outer CV are highlighted in bold here. The best AUC of different algorithms with different γ got by CFA and ApoE4 + CFA. Moreover, the kNN obtained the best AUC with ApoE4 + CFA with different γ. For example, with γ = 0.1, 0.2 and 0.3, the ApoE4 + CFA observed the best AUC of 0.8551 ± 0.026, 0.8552 ± 0.0256 and 0.8535 ± 0.0234 of kNN, respectively. The CFA observed the best AUC of GMM (0.8441 ± 0.0294, 0.8445 ± 0.0255 and 0.8453 ± 0.0309).

Results of supervised binary classification algorithm
The AGF and AUC of SVM of fivefold outer CV are shown in Table 8. The best modalities with SVM and RF algorithms relative to the AGF and AUC are highlighted in bold in Table 8, relatively. We found that the SVM performed better than RF in this situation. The highest AGF of SVM (0.7359 ± 0.0451) and RF (0.4771 ± 0.0167) was achieved by the Age + CFA and CFA + Memory, respectively. In terms of AUC, SVM presented a low value (0.5726 ± 0.0119) on single CFA and a high got the highest value (0.8651 ± 0.0242) on CFA + Memory. Compared with the SVM and RF, ND algorithms got comparable performance with less data information.

Discussion
Identifying MCI patients at a higher risk for conversion to AD is crucial for effective treatment of the disease. In this paper, we have proposed a new strategy that employs ND techniques (k-means, kNN, GMM, ELM and IF) to predict the conversion of MCI with cognitive functional assessments. Two supervised binary classification algorithms, SVM and RF, were compared with unsupervised ND techniques on the same datasets. The best parameters and modalities were selected based on the AGF with the nCV structure to generalise the trained models and limit the leakage of data.
By comparing the performance of different modalities and their combinations with the trained ND models, we found that the single CFA and ApoE4 + CFA modalities observed the best AUC (Table 7), and the CFA, Age + CFA, ApoE4 + CFA, and Age + ApoE4 + CFA modalities observed the best four AGF (Tables 4, 6). By comparing the performance of ND models with SVM and RF (Table 8), we found that without labels and pMCI subjects, the ND techniques could get comparable performance for identifying the pMCI subjects, meaning that the ND techniques have the potential to predict pMCI subjects given the sMCI only. Here we are not proving that ND techniques are the best for predicting pMCI subjects. Instead, we proposed a new pathway to identify and predict the pMCI subjects by training ND models with sMCI subjects only.
By applying our proposed strategy, we found that the CFA modality, including Mini-Mental State Examination (MMSE), Alzheimer's Disease Assessment Scale-cognitive subscale (ADAS13) and Functional Activities Questionnaire (FAQ), was a very important and effective predictor for identifying the pMCI subjects who converted to AD in the follow-up 2 years within the used ADNI dataset. Moreover, the CFA modality could provide more discriminative information to improve the performance to identify pMCI subjects correctly since the pMCI predictions combining CFA were more accurate than those based solely on a single modality. These results lend support to existing clinical practices that depend relatively heavily on CFA [25]. Even though the ApoE4 + CFA got the highest AUC in most of the experiments, there was no significant difference between the single CFA and ApoE4 + CFA for predicting the conversion of MCI according to the Tukey Post Hoc test in Table 5. As we know, the ApoE genotype is extracted from the blood test, and not all laboratories have the equipment for ApoE genotyping. If the patients are recommended to do ApoE genotyping test, obtaining the results might take a long time [59]. Therefore, CFA should be considered as a significant predictor for predicting the conversion of MCI.
Comparing the AGF obtained from the outer CV and the inner CV, we found that the results were slightly different. For example, in terms of = 0.2, the ELM got the best AGF with the ApoE4 + CFA at the inner CV, while the best AGF was obtained by the CFA with ELM. There are two possible reasons for the slight difference: (1) the locations of the points in the validation set and test set are different, and (2) there is a gap between the trained boundaries and the edge data points of the training set and the width of the gap at different location are different. Therefore, there may be a slight difference in the prediction results of the validation set and test set. However, the slight difference could be ignored since there is no significant difference between the four best modalities (Table 5).
Compared with the supervised binary classifiers, the ND algorithms are unsupervised. Labels of subjects are not needed during training and evaluating the models. Additionally, the imbalanced nature of the dataset can be ignored for the ND techniques since only sMCI subjects are needed to train the ND models, while pMCI subjects are used only for validation and testing the performance of the trained ND models. Furthermore, the rejection rate of sMCI subjects could be set with ND techniques, so the level of tolerance for misclassification of sMCI can be controlled. Moreover, ND algorithms got better AUC than SVM and RF with only CFA. The kNN, GMM and ELM algorithms could provide an effective prediction of the pMCI subjects, especially with 20% and 30% of the sMCI misclassification tolerance. In this work, we recommended = 0.2 since = 0.3 will cause more sMCI patients to get inappropriate treatment and bring an unnecessary financial burden to their carers. As mentioned in the introduction section, an increasing number of studies focus on classifying the sMCI and pMCI patients based on neuroimaging data, i.e., MRI and FDG-PET with supervised binary/multi-class classification algorithms. However, neuroimaging data are usually costly to obtain, and the FDG-PET is invasive, which may bring pain to the patients not be suitable for some patients. We only used the clinical assessments in this paper, which are easier to obtain, more cost-effective, and more time-saving than neuroimaging data. With the proposed strategy, the patients only need to complete three assessments for pre-diagnosis. It takes 5-10 min to finish the MMSE [29] and 6-10 min for the FAQ. Taking the ADAS13 assessment usually needs 30-40 min [9].
We recognised two key limitations of our research: firstly, only the ADNI dataset was used in the experiment. Alternatives do exist, such as the Australian Imaging Biomarkers and Lifestyle Study of Ageing (AIBL) and National Alzheimer's Coordinating Centre (NACC) datasets. Secondly, we only consider the total score for each assessment. Hence some details captured within the assessment were ignored.
As discussed in the next section, we plan to address these limitations in our future work.

Conclusion
In this study, we proposed a new strategy that employs ND techniques to predict pMCI subjects who have a higher risk of developing AD within a 2-year follow-up period based on non-imaging features in the ADNI dataset. We designed an nCV-based evaluation and hyperparameter optimisation framework to thoroughly benchmarked an array of unsupervised ND techniques, including kNN, k-means, GMM, ELM and IF. We compared the ND techniques with supervised SVM and RF binary classification methods. The extensive experimental results show that the ND techniques can effectively predict pMCI subjects with non-imaging and easily obtained features within two years. The kNN, GMM and ELM algorithms achieved comparable performance with the supervised SVM with 20% and 30% sMCI misclassification tolerance. We rigorously optimised the parameters for best generalisation performance using nCV and AGF and showed that our strategy was stable from training to testing evaluation, with only insignificant differences in the best performing features combinations.
In terms of novelty, we first proposed employing ND techniques to identify pMCI patients and our results show that the ND techniques have the potential to predict pMCI subjects most accurately. We also found that the CFA was the most promising predictor to predict pMCI patients with ND algorithms. The CFA is a readily available modality within most clinical settings and it is non-invasive and cost-effective compared to other popular modalities. Therefore, the proposed strategy could efficiently be employed to assist in predicting pMCI subjects in practice. For our future work, we will focus on applying more advanced ND algorithms on different AD datasets to our problem domain and developing a platform that can automatically predict pMCI with ND techniques. We will investigate the sub-assessments in each neuropsychological assessment to shorten and simplify the whole process of predicting MCI conversions early. Moreover, we will work to apply ND algorithms to neuroimaging data.