Intelligent feature selection with modified K-nearest neighbor for kidney transplantation prediction

Abstract

The prediction of kidney transplantation outcome is an important challenge and does not need emphasis because of the lack of available organs. Graft survival prediction is significant to help physicians to take the right decision and enhance survival rate by changing medical procedure. Also, it helps in the best choice of the existing kidney donor and the immunosuppressive management suitable for a patient. But the exact prediction of the graft survival is still not accurate despite of the advancements in this field. The purpose of our research is to design an intelligent kidney transplantation prediction method to solve the prediction problem by utilizing data mining methods. The novelty of this study is focused in presenting: (a) an integrated prediction method, (b) a new intelligent feature selection method, and (c) a modified K-nearest neighbor. Choosing the proper variables is accomplished by merging three feature selectors. The new proposed feature selection method is accomplished using gain ratio, naïve Bayes, and genetic algorithm. Next, the cleaned dataset is utilized to provide quick and precise outcome throughout a modified K-nearest neighbor classifier. Each stage of this proposed method has been evaluated using intense experiments. Experimental results demonstrate the efficiency of all the steps of the proposed method. Additionally, the proposed method has been evaluated versus latest methods. The results presented that this method outperformed all latest and similar literature methods. This method can as well be employed to other related transplant datasets.

Introduction

Data mining methods are used in many disciplines and have created a progress to people’s life. Diseases affect people differently, and end-stage organ failure is one of the worst situations. It affects people due to hypertension, diabetes, and other causing diseases. In this case, the best solution is organ transplantation [1].

The prediction of kidney transplantation result is significant in helping physicians to take the right decision, saving patient’s life, and make a good utilization of the available resources. The available kidneys are low and patient wait long time to find the suitable kidney. Some of the kidney transplantation operations fail due to graft failure. The unsuitable match between both recipient and donor can cause this problem [2]. Additionally, the economic influence of being back to dialysis or search for another transplant.

Kidney transplant patients who have suffered from renal failure have better results than patients staying on a waiting list [3]. But, patient survival rate is lowered by graft failure and return patient to waiting list [4]. Despite of that kidney transplantation is the preferred management of renal failure, the number of the available kidneys for transplant is less than the number of patients [5]. Thus, efforts directed at enhancing graft survival interval are very significant for clinical practice. One way to enhance long-term allograft survival is to discover factors that negatively affect the result and observing the relations between and among them and planning a prediction method to enhance the result. Using models to predict allograft result may consequently make an important influence on physician plan management and improve complete clinical result [6].

Graft survival is the period of time that the patient does not require dialysis or one more transplant and the transplanted kidney still works [7]. All together if considerable papers have inspected the variables of kidney transplantation [7, 8], limited papers have employed data mining methods in designing a prediction method for kidney transplantation [9,10,11]. Latest papers have demonstrated that data mining methods offer improved outcomes in predicting graft survival [2].

Commonly, feature selection and outlier rejection are two essential matters that should be focused on precisely before concerning a prediction method as they have powerful influence on the method operation. Graft survival is influenced by a diversity of features and many useless features may perhaps exist in kidney transplantation data. Numerous prediction methods do not offer good results with big volumes of variables. Feature selection methods provide a great effect in raising the method accuracy along with offering quicker results by means of reducing the studied features to the just valuable ones. It also decreases the method complexity [12].

On the contrary, training the prediction method without data preparation, which often contains outliers, reduces the prediction method accuracy since the prediction has influenced by those unimportant instances whose achievement is very poor. Outlier rejection is an exceptional data mining procedure [13]. It offers an improvement to prediction accuracy because it helps to select the correct instances. Therefore, outlier rejection is important for rejecting all bad instances that affect the result.

This paper presents a novel kidney transplantation prediction method through data analysis. The novelty of this study is focused in presenting an integrated prediction method, a new intelligent feature selection, and a modified K-nearest neighbor. The proposed prediction method combines variable selection with outlier rejection and data mining techniques to provide enhanced predictive powers. This proposed methodology is also suitable for any transplant dataset.

This proposed method is introduced in three phases. These phases are (1) data organization phase (DOP), (2) variable selection method (VSM), and (3) outlier rejection and prediction phase (PP). During DOP, input data are preprocessed to be prepared for use in the next phases. VSM is used to select the reduced variables list to diminish the method complexity and reduce the variables dimension to enhance the prediction method. Variable selection is accomplished utilizing gain ratio, naïve Bayes, and genetic algorithm. PP introduces a modification on K-nearest neighbor algorithm to classify testing instances, reject outliers, and predict the outcome of graft survival.

The validity of this proposed method is evaluated using urology and nephrology center dataset. Each stage of the proposed prediction method has been assessed throughout intense experiments. The evaluation results emphasis that our proposed method is efficient. The proposed VSM can select the reduced variables list efficiently. The proposed PP provides good results regarding prediction accuracy. Also, the proposed prediction method has been assessed versus similar literature, and the results outperform the recent and similar literature techniques in prediction metrics.

Our paper is designed as follows: Part 2 presents a background about variable selection, outlier rejection, and K-nearest neighbor algorithm. Part 3 presents an outline about the recent techniques used for predicting graft survival especially kidney transplantation. Part 4 presents the proposed method in detail. Part 5 presents the used dataset, the evaluation results, and discussion. Part 6 introduces the conclusion. Finally, part 7 introduces the used references.

Background

This part introduces the basic concepts used in this research. These basic concepts are variable selection, outlier rejection, and K-nearest neighbor algorithm.

Variable selection

Variable selection is an important method in constructing a prediction method [14]. The dataset may contain some unrelated variables. Hence, selecting the proper variable selection method that can specify the unrelated variables can have a positive effect in enhancing the result of machine learning techniques, preventing overfitting, producing better accuracy, and leading to quicker economical models [15].

There are three methods for feature selection which are filter, wrapper, and embedded methods [16,17,18,19]. Filter method assesses sets of variables by inspecting information content or measuring correlation between variables [12]. This method does not rely on the utilized data mining technique. Variable weights are computed, and the low weight variables are discarded [15]. Wrapper method employs data mining methods to evaluate the subsets of features using the resulting accuracy [12]. It utilizes the outcomes of machine learning methods to specify the efficiency of the given subset of features [15]. Feature selection using embedded methods is accomplished during training of the best variables (i.e., neural network weights among the input and the hidden layer). Table 1 offers a comparison between variable selection methods.

Table 1 Comparison of variable selection methods [18]

Outlier rejection

Outlier recognition has been manipulated for periods to identify and eliminate irregular examinations from information. Outliers appear owing to mechanical errors, alternations in system behavior, person mistake, and tool fault or easily throughout normal abnormalities instances [20].

Outlier rejection is capable of discovering the little instances embedded in the dataset that affects the performance results [17, 21]. Therefore, these instances must be excluded from the dataset because these outliers reduce the predictive accuracy [13].

K-nearest neighbor algorithm

It is an easy algorithm and one of the oldest rules used for model classification [19]. K-nearest neighbor algorithm presents the training instances of the dataset as points in the variable space. These points split into numerous split objectives. To classify an instance Pt: (1) demonstrate the instance in the feature space. (2) Compute the spaces between the instance Pt and the Kth nearest instances. (3) The instance Pt is classified along with the class with the highest selection of the neighbors. Figure 1 illustrates the algorithm supposing three objective classes (i), (ii), and (iii), and K = 7. Then, the point Pt will be classified to class (ii) as class (ii) contains the highest selection of the neighbors.

Fig. 1
figure1

K-nearest neighbor classifier example

Previous work

The importance of predicting kidney transplantation result does not require emphasis. Preoperative graft survival prediction might simplify physician evaluation and boost survival rate by altering medical method. Also, it allows the selection of the best presented donor with the best immunosuppressive management for a patient [10, 22].

Numerous approaches have been utilized to build predictive model. A multivariate analysis was used for kidney transplantation prediction from deceased donor [23]. In alternative study, multivariate analysis was employed for creatinine stages prediction in kidney recipients from living donors [24]. The possibility of graft survival for deceased donor was learned utilizing tree regression method [6, 25]. Artificial neural networks were utilized to predict the probability of delayed graft function after kidney transplantation from deceased donors [26].

Kaplan–Meier survival was utilized by Rana el al. [27] to offer an impression of kidney transplantation survival. Numerous other papers have examined the influence of patient features (age, sex, and nation) [28] and preoperative transplant factor on graft status after transplant by utilizing the Kaplan–Meier survival function [29, 30]. These preceding educations were constrained by using few variables. Therefore, it is required to utilize greater dataset that contain a larger number of candidate predictors to discover possible unknown variables among the diverse predictors, which could influence the kidney transplant results.

Data mining methods have been employed widely in the medical area of organ transplantation. Atallah et al. [31] designed a new integrated prediction method to predict graft survival outcome. They used information gain merged with naïve Bayes to choose the important variables. Then, they used K-nearest neighbor to predict the graft survival. The results outperformed latest methods. Kusiak et al. [32] evaluated the efficiency of rough sets and two decision trees in forecasting the kidney dialysis patient’s survival. While the outcomes of both approaches were precise, it is not well defined if this method operates to a bigger dataset. In extra study, decision trees were also utilized to forecast the acute liver failure prediction [30]. To widen the candidate predictors, another research paper utilized demographic, pharmaceutical, and clinical data to evaluate the accuracy of artificial neural network and nomogram to predict graft outcome at five years. The results revealed that artificial neural network outperformed nomogram [10]. In extra study, Lin et al. [22] concluded the influence of candidate predictors on kidney survival rate using logistic regression, Cox proportional hazard method, and artificial neural networks. Dag et al. [33] utilized four classification techniques to categorize graft outcome for heart patients. They utilized data balancing techniques to deal with the unbalanced class of the dataset. Their analysis revealed that logistic regression together with SMOTE provides the best result prediction. Delen et al. [34] utilized a database that has a lot of features using four machine learning methods, and they accomplished sensitivity analysis on the preferred method to obtain the significant predictors. In their analysis, support vector machines were the best predictor. Another study presents a weighted decision tree algorithm for graft survival prediction after kidney transplantation using patient’s data before transplant and it can predict graft survival with good results [35]. Hence, it appears that the mutual idea between these studies is their capability to effectively predict a result utilizing a group of candidate predictors.

To create a precise prediction model, numerous requirements must be satisfied: The use of a dataset that introduces a large patient population to find unknown patterns between the variables that can change the clinical results [36]. Additionally, use of an intelligent variable selection technique that can specify the most significant variables in the dataset accurately and to promote the operation of the learning classifier. Lastly, the use of outlier’s rejection that can identify the instances with exceptional behavior and reject it.

The purpose of our study is to design a prediction method that can classify graft status successfully. The novelty of this study is focused in presenting an integrated prediction method, a new intelligent feature selection, and a modified K-nearest neighbor. The proposed prediction method combines variable selection with outlier rejection and machine learning methods to enable better predictive abilities. Therefore, it is essential to employ a variable selection technique that can specify influential variables and in the same time eliminates the unimportant ones. Feature selection is accomplished by gain ratio, naïve Bayes, and genetic algorithm. This study introduces a method composed of three phases, namely: (i) data organization phase (DOP), (ii) variable selection method (VSM), and (iii) outlier rejection and prediction phase (PP).

The proposed method

The purpose of this paper is to design a novel proposed method to classify graft status outcome successfully. Figure 2 illustrates the proposed method. In the following sections, the three phases of this method will be explained in more details.

Fig. 2
figure2

Proposed method (PM)

Data organization phase (DOP)

In this phase, the dataset is pre-managed to be ready for building the model and testing it accordingly. This process is called data cleaning. It contains three phases. The first phase excludes all the variables that are related to the operative and postoperative procedures since the proposed method is planned to identify kidney transplantation result before the operation. The second phase excludes the variables that have no influence in the prediction procedure (e.g., the patient name, the id number of the hospital, and investigations dates). The third phase excludes from the dataset all the instances that contain missing data.

Variable selection method (VSM)

Commonly, variable selection is a significant procedure of any data mining procedure. Choosing the proper features will surely boost the method’s prediction accuracy while reducing the time and processing penalties. This paper offers a new feature selection procedure for identifying the highly influential and informative variables to obtain the reduced variables list.

Genetic-based feature selection is proposed based on genetic algorithm. It utilizes a filter and a wrapper method with genetic algorithm to effectively choose the effective collection of features. Gain ratio is used as filter method and naïve Bayes as wrapper method. The filter method uses a function calculation for every feature to specify the best influential features in the dataset according to these calculations. The wrapper method makes evaluation of subsets of the features using cross-validation. The reduced variables list is selected based on this evaluation.

Gain ratio is used to measure the weight of each attribute. The idea of choosing gain ratio is because it overcomes the bias toward attributes with too many values. Information gain is a ranking variable selection technique [37]. It selects the features that have more information with respect to classes [38]. Information gain depends on entropy; it shows the information contribution of the attribute with the dataset.

Suppose that S is the group of n tuples and C is the group of K classes, P(Ci, S) presents the part of the tuples in S that has class Ci. Then, the information introduced from this class is given by [39]:

$${\text{Info}}(S) = - \sum\limits_{i = 1}^{k} {P(C_{i} } ,S)*\log (P(C_{i} ,S))$$
(1)

Suppose Si is the group of tuples and Ai is the attribute value A, then the knowledge extracted from class participation after partitioning by variable A (InfoA(S)) is introduced in Eq. (2) [39]:

$${\text{Info}}_{A} (S) = - \sum\limits_{i = 1}^{v} {\frac{{|S_{i} |}}{|S|}} *{\text{info}}(S_{i} )$$
(2)

Hence, the difference between Info(S) and InfoA(S) introduces the information calculation by splitting S based on variable A as shown in Eq. (3) [39]:

$${\text{Gain}}(A) = {\text{Info}}(S) - {\text{Info}}_{A} (S)$$
(3)

The information gain is influenced by attributes with many values. So gain ratio overcomes this problem by using split information as in Eq. (4) [40]:

$${\text{SplitInfo}}_{A} (S) = - \sum\limits_{i = 1}^{v} {\frac{{|S_{i} |}}{|S|}} *\log_{2} \left( {\frac{{|S_{i} |}}{|S|}} \right)$$
(4)

Then, the gain ratio is calculated as introduced in Eq. (5) [40]:

$${\text{GainRatio}}(A) = \frac{{{\text{Gain}}(A)}}{{{\text{SplitInfo}}_{A} (S)}}$$
(5)

Naïve Bayes is the most powerful probabilistic classifier as evidenced [41, 42] that uses Bayes’ theorem to calculate class probability and classify the instances according to highest probability. The idea behind using naïve Bayes classifier is that it simplifies the computation and gives high accuracy and speed [40]. The way naïve Bayes operates can be specified as follows [40].

Assume D is group of instances with their known class labels. Every instance is represented by n-dimensional vector, X = (X1, X2, ……., Xn) with n variables (V1, V2, …….., Vn). Assume that (C1, C2, ………, Cm) represents m classes. For any instance X, the classifier will predict the corresponding class by specifying the highest probability on X, i.e., the class Ci that has maximized P(Ci|X). In other words, we maximize P(Ci|X) only if [40]:

$$P(C_{i} |X) > P(C_{j} |X)\quad {\text{for}}\quad 1 \le j \le m,\quad j \ne i$$

Then, Eq. (6) represents the Bayes theorem that calculates the P(Ci|X) [40]:

$$P(C_{i} |X) = \frac{{P(X|C_{i} )P(C_{i} )}}{P(X)}$$
(6)

Genetic algorithm as a wrapper method is a famous random search and optimization method [43,44,45]. Its idea is from the human being evolution process [46]. It can be used in several optimization, explore, and data mining problems [47, 48]. Genetic algorithm works using iterative manner to produce new generations from old ones using standard genetic operators for example selection, crossover, and mutation. It uses a fitness function to evaluate each generation [39]. Additional details on genetic algorithm are introduced in [48].

Figure 3 clarifies the proposed variable selection algorithm in detail. First, populations of individuals were initialized randomly. The generation is started by zero. The gain ratio is calculated for every feature in this population. The features that have gain ratio greater than zero are added to the reduced variables list. This list will be trained and tested manipulating naïve Bayes classifier. Basically, naïve Bayes examines the influence of every distinct variable and evaluates its significance on accuracy consequently. It eliminates one input variable from the input variables at a time, and the resulting variables are subsequently employed for training and testing by means of naïve Bayes classifier. The resulting naïve Bayes classifier accuracy is compared with the classifier accuracy built using all variables. If the classifier accuracy is decreased, then this variable will be added to the reduced variables list. Otherwise, it is discarded. Then, genetic algorithm produces another population iteratively using its policies, which are selection, crossover, and mutation [49, 50]. Selection specifies the parents that will be tested in the next generation. The next generation is selected randomly, and each variable is coded to binary zero or one. The generation length introduces the number of variables. The binary one shows a selected variable while the binary zero shows a non-selected variable. Typical crossover and mutation are applied to produce new offspring [39, 51]. This generation is evaluated again manipulating hybrid filter and wrapper process. The algorithm calculates the accuracy of each generation until the termination criteria are satisfied. In our proposed genetic-based feature selection algorithm, the algorithm stops when the number of iterations reached the maximum limit or when the reduced variables list (RVL) that have the highest accuracy between the generations is reached, and the accuracy did no change for several iterations. Algorithm 1 presents the proposed variable selection in detail.

figurea
Fig. 3
figure3

Proposed variable selection method

Outlier rejection and prediction phase (PM) using modified K-nearest neighbor algorithm

The usually utilized distance-based algorithm is K-nearest neighbor (KNN), which depends on distance measures as the only classification criteria. KNN is conceptually much close to the concept of predicting kidney transplantation outcomes through data analysis. Another benefit is that, being a lazy learner, KNN does not need learning and preserving a given model. So, the model can adapt to rapid modifications [52].

Modified K-nearest neighbor (MKNN) is a modification on the classical KNN classifier. Furthermore, to the distance measures, MKNN also puts in consideration the quality of each instance from the neighbors of the tested instances. As the test item may be near in distance to the instances at the edges of a class but it is much more similar in characteristics with instances in another class so the distance measure is not a satisfied criterion.

Algorithm 2 introduces the operation of MKNN algorithm. Initially, the training instances are demonstrated in n-dimensional feature space. Then, the center of each class group is identified using Eq. (7):

$$C(i) = \left( {\frac{1}{n}\sum\limits_{i = 0}^{n} {x_{i}^{1} } ,\frac{1}{n}\sum\limits_{i = 0}^{n} {x_{i}^{2} } , \ldots ,\frac{1}{n}\sum\limits_{i = 0}^{n} {x_{i}^{m} } } \right)$$
(7)

where C denotes the center of the class, \(X_{i}^{m}\) in the mth dimension of the ith instance, and n is the number of instances in the class. Next, the distance between the class center C and each class group, denoted as Xm, are computed for each class using Eq. (8):

$$x_{m} = {\text{Dis}}(y,z) = \sqrt {\sum\limits_{i = 1}^{n} {(y_{i} - z_{i} )^{2} } }$$
(8)

where Dis(y, z) is the distance between instance y and z, n is the instances number, yi is the ith dimension of instance y, zi is the ith dimension of instance z, and m is the class group number. Subsequently, Calculate the average of distances denoted as Dth. Dth is calculated as shown in Eq. (9):

$$D_{\text{th}} = \sum {\frac{{x_{m} }}{m}}$$
(9)

where m is the class group number.

Then, for each class group Ei, if the distance between the center of the class and the class group is less than or equal Dth then the class item Ei is one of the class member group. Otherwise, delete this point as outlier rejection.

Then, from the H points, pick the nearest M instances near the test point Pt. Finally, classification of each test point Pt is the class that have the highest similar of the M elected instances that Pt is belonging.

figureb

An example is shown in Fig. 4. It introduces classes A, B, and C in a two-dimensional space. It is required to classify one event Pt using 7-nearest neighbor algorithm. The steps of the modified K-nearest neighbor algorithm are clarified in detail in Fig. 4.

Fig. 4
figure4

Steps of the proposed modified K-nearest neighbor

Results and discussion

In this part, the principal contributions that were presented in the proposed method will be evaluated. Those contributions are (i) the new proposed variable selection method (VSM), (ii) the outlier rejection and prediction phase (PP) based on MKNN classifier, and (iii) the new proposed prediction method.

The used dataset

For implementing the proposed prediction method, the kidney transplantation dataset from urology and nephrology center, Mansoura University is used. The dataset included information on the kidney transplant such as demographic data, medical history, and medical conditions during and after the transplant. The dataset also contains some preoperative factors for both recipients and donors, in addition to other factors such as date of transplant, dialysis information. The required approvals of this study were obtained from the institutional review board (IRB) committee of Mansoura University (R/19.01.16).

Four decades ago, renal transplantation was started at urology department, Mansoura University, Egypt. The first renal transplantation was a patient that was suffering from chronic pyelonephritis and was ended in renal failure. It was a successful operation. Therefore, renal transplantation operation accomplished was nearly 80 patients yearly. Urology and nephrology center gives health care for seven million populations. In this locate, seventeen hemodialysis centers produce service for nearly 2000 patients. So, there will be a long waiting list. The problem is that cadaveric transplantation is not legal in Egypt [53].

In March 2017, 2811 renal transplant were accomplished in the center. After deleting the instances that have missing values, 2750 patients were included in the experiments. The age of the patients was (mean ± SD, 29.1 ± 10.9), male to female proportion was 2026 (74%): 702 (26%), the age of the donors was (mean ± SD, 36.7 ± 10.2), and male to female proportion was 1246 (46%): 1482 (54%). The proportion between living related donors to living unrelated donors was 2293 (84%): 435 (16%). The proportion between the donors with same blood group to donors with different blood group was 2192 (80%): 536 (20%). Follow up for the graft was ranged from 0 to 33.5 years (mean ± SD, 7.7 ± 1.4).

Performance metrics

During the following experiments, five metrics will be evaluated: (i) accuracy, (ii) precision, (iii) sensitivity, (iv) F-measure, and (v) error. Confusion matrix is utilized to compute values of these metrics. Confusion matrix is illustrated in Table 2. The performance metrics are described as follows

  • True positive (TP) The instances number that the method predicts approves with historic cases.

  • True negative (TN) The instances number that the method predicts approves with historic cases.

  • False positive (FP) The instances number that the method predicts while historic cases were graft failure.

  • False negative (FN) The instances number that the method predicts while historic cases were graft survival.

Table 2 Confusion matrix

Accuracy presented in Eq. (10) is considered as the proportion of the instances classified correctly to the instances number.

$${\text{Accuracy}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{FP}} + {\text{TN}} + {\text{FN}}}}$$
(10)

Precision presented in Eq. (11) is considered as the percentage of the rate of the instances classified correctly in all detected instances.

$${\text{Precision}} = \frac{\text{TP}}{{{\text{TP}} + {\text{FP}}}}$$
(11)

Sensitivity presented in Eq. (12) is considered as the percentage of correct instances among all instances.

$${\text{Sensitivity}} = \frac{\text{TP}}{{{\text{TP}} + {\text{FN}}}}$$
(12)

F-measure presented in Eq. (13)

$$F{\text{ - measure}} = 2*\frac{{{\text{precision}}*{\text{sensitivity}}}}{{{\text{precision}} + {\text{sensitivity}}}}$$
(13)

Error presented in Eq. (14)

$${\text{Error}} = 1 - {\text{accuracy}}$$
(14)

Constraint settings

In the proposed variable selection method, gain ratio is used as a filter method. Each variable has a value which specifies the importance of the variable. The variable with zero value is discarded. The features that have gain ratio greater than zero are selected. Higher gain ratio introduces higher chance of having pure classes in the objective class. Also we use genetic algorithm to obtain the reduced variables list. The population is set randomly. The population consists of many chromosomes. Each one is coded by a binary string [54]. Bit value (0) denotes a non-chosen chromosome, and bit value (1) represents a selected chromosome. The size of the chromosome introduces the variables number in the population. Table 3 introduces the genetic algorithm factors used in the experiments.

Table 3 Genetic algorithm factors

In the proposed prediction method, MKNN is proposed for classification. We specified the best value of k experimentally. We start with k = 1 and test the error rate of the classifier. We repeat this process many times by incrementing the k value to add one additional neighbor. We choose the k value that produces the highest accuracy with minimum error.

K-fold cross-validation

K-fold cross-validation is a method used to compare the performance metrics of the prediction method. It is used to minimize the bias between the random training and testing samples [55, 56]. The dataset is separated to k mutually equal size groups instead of dividing the dataset into two random sampling groups (training group and testing group) that are prone to be biased. In this method, the prediction method is tested k times by using the k test sets. The performance metric is the k performance metrics average of the k folds. Equation 15 presents the performance measure [57]:

$${\text{PM}} = \frac{1}{K}\sum\limits_{i = 1}^{k} {{\text{PM}}_{i} }$$
(15)

where PM specifies the performance metric and k is the fold’s number. In our experiments, we use ten-fold cross validation. Figure 5 introduce an explanatory figure for ten-fold cross validation. The unfilled sections in the figure introduce the training sets while the filled sections introduce the testing sets.

Fig. 5
figure5

Explanatory figure for ten-fold cross validation

Evaluating the proposed variable selection method

This study introduces a new novel intelligent variable selection method that combines between filter and wrapper methods. It has used gain ratio, naïve Bayes, and genetic algorithm. Naïve Bayes has been verified as the highly professional probabilistic classifier [41, 42], so we will use it to verify our proposal efficiency. First, all variables in the dataset are used to train naïve Bayes classifier and are called without feature selection (WFS). Second, we trained the steps of our proposed variable selection method using NB Classifier. The steps of our method are called gain ratio (GR), naïve Bayes (NB), genetic algorithm (GA). Finally, we trained the proposed variable selection method using NB classifier. Experimental results are introduced in Figs. 6, 7, 8, 9 and 10. In general, the performance metrics increased by increasing the instances number. The highest number of instances presents the highest performance metrics in maximizing the accuracy, precision, sensitivity, and F-measure while minimizing the error. The proposed variable selection method introduces the best performance metrics compared to other methods with the scores of 0.77, 0.78, 0.77, 0.77, and 0.22 for accuracy, precision, sensitivity, F-measure, and error respectively. Our proposed variable selection method merges the benefit of both gain ratio, naïve Bayes, and genetic algorithm as every one of them reward the weakness of the other. The results verified that the proposed variable selection method can choose the essential and influential variables to produce the reduced variables list. This stimulates the performance of the proposed variable selection method and diminishes its error.

Fig. 6
figure6

Accuracy of the different feature selection methods

Fig. 7
figure7

Precision of the different feature selection methods

Fig. 8
figure8

Sensitivity of the different feature selection methods

Fig. 9
figure9

F-measure of the different feature selection methods

Fig. 10
figure10

Error of the different feature selection methods

Evaluating the proposed outlier detection and prediction phase

In this subsection, we will evaluate the proposed classifier (MKNN) with the classical KNN classifier and with other classifiers, namely decision tree (J48) and naïve Bayes (NB) without applying any feature selection. Figures 11, 12, 13, 14 and 15 show the accuracy, precision, sensitivity, F-measure, and error of applying the different classification methods. As shown in Figs. 11, 12, 13, 14 and 15, the highest accuracy, precision, sensitivity, F-measure are detected by the proposed MKNN because of rejecting the effect of outlier’s instances that reduce the performance. It also achieves the least error competed to other classification techniques.

Fig. 11
figure11

Accuracy of the different classification techniques without feature selection

Fig. 12
figure12

Precision of the different classification techniques without feature selection

Fig. 13
figure13

Sensitivity of the different classification techniques without feature selection

Fig. 14
figure14

F-measure of the different classification techniques without feature selection

Fig. 15
figure15

Error of the different classification techniques without feature selection

MKNN is a modification on KNN classifier. It uses distance based method to reject outliers and enhances the performance. Therefore, it promotes the advantages of KNN classifier. The proposed outlier rejection and prediction phase outperform all other techniques with the scores of 0.79, 0.8, 0.79, 0.79, and 0.25 for accuracy, precision, sensitivity, F-measure, and error, respectively.

Evaluating the proposed method

Finally, the proposed prediction method with all stages is examined against the most recent techniques used to design a prediction method to prove the efficiency of our proposed method. Table 4 introduces the most recent techniques used for evaluation and the method used. Results are present in Figs. 16, 17, 18, 19, 20 and 21 and Tables 5, 6, 7, 8, 9, and 10. As demonstrated in Figs. 16, 17, 18, 19 and 20 and Tables 5, 6, 7, 8, 9 and 10, the proposed method presents the highest performance as it presents the highest accuracy among all recent techniques with the score of 0.86. It also has the highest precision, sensitivity, F-measure, and ROC area with the scores of 0.77, 0.86, 0.79 and 0.82, respectively. Also, it has the minimum error with the score of 0.23. This proves the efficiency of the proposed method and all its stages.

Table 4 Most recent prediction methods utilized for evaluation
Fig. 16
figure16

Accuracy of the proposed method against recent techniques

Fig. 17
figure17

Precision of the proposed method against recent techniques

Fig. 18
figure18

Sensitivity of the proposed method against recent techniques

Fig. 19
figure19

F-measure of the proposed method against recent techniques

Fig. 20
figure20

Error of the proposed method against recent techniques

Fig. 21
figure21

ROC area of the proposed method against recent techniques

Table 5 Accuracy of the proposed method against recent techniques
Table 6 Precision of the proposed method against recent techniques
Table 7 Sensitivity of the proposed method against recent techniques
Table 8 F-measure of the proposed method against recent techniques
Table 9 Error of the proposed method against recent techniques
Table 10 ROC area of the proposed method against recent techniques

Conclusion

The prediction of kidney transplantation result is very important and does not require emphasis. Hence, successful prediction method is an essential task. In this study, we have designed a new proposed method to classify graft result. This study introduced a method composed of three stages, namely: (i) data organization phase (DOP), (ii) variable selection method (VSM), and (iii) outlier rejection and prediction phase (PP). The proposed method combines variable selection with outlier rejection and machine learning methods to enable better predictive abilities. The proposed prediction method includes new intelligent feature selection procedure and a modified K-nearest neighbor. The new proposed feature selection procedure collects between gain ratio, naïve Bayes, and genetic algorithm, which chooses the essential features from the dataset. Additionally, the proposed modified K-nearest neighbor introduced the outlier rejection and prediction module that used distance based measures with K-nearest neighbor to classify patients. The efficiency of the proposed method is evaluated using urology and nephrology center dataset. Each stage of the proposed method has been assessed throughout intense experiments. The overall method is examined to verify the compatibility of the proposed method. The evaluation results emphasis that our proposed method is efficient. The proposed VSM can select the reduced variables list efficiently. Results likewise specify the efficiency of the proposed outlier rejection and prediction phase as it enhances the prediction accuracy. Experimental results presented that the proposed method gave more precise results than most recent methods. In general, the results offered a new method that could benefit in progress the result of kidney transplantation. This method can as well be employed to other related transplant datasets.

References

  1. 1.

    Oztekin A, Al-Ebbini L, Sevkli Z, Delen D (2018) A decision analytic approach to predicting quality of life for lung transplant recipients: a hybrid genetic algorithms-based methodology. Eur J Oper Res 266(2):639–651

    MathSciNet  MATH  Article  Google Scholar 

  2. 2.

    Topuz K, Zengul FD, Dag A, Almehmi A, Yildirim MB (2018) Predicting graft survival among kidney transplant recipients: a Bayesian decision support model. Decis Support Syst 106:97–109

    Article  Google Scholar 

  3. 3.

    Ojo AO, Hanson JA, Meier-Kriesche H-U, Okechukwu CN, Wolfe RA, Leichtman AB, Agodoa LY, Kaplan B, Port FK (2001) Survival in recipients of marginal cadaveric donor kidneys compared with other recipients and wait-listed transplant candidates. J Am Soc Nephrol 12(3):589–597

    Google Scholar 

  4. 4.

    Ojo AO, Wolfe RA, Agodoa LY, Held PJ, Port FK, Leavey SF, Callard SE, Dickinson DM, Schmouder RL, Leichtman AB (1998) Prognosis after primary renal transplant failure and the beneficial effects of repeat transplantation: multivariate analyses from the United States renal data system 1, 2. Transplantation 66(12):1651–1659

    Article  Google Scholar 

  5. 5.

    Procurement O (2015) Organ procurement and transplantation network, vol 9. HRSA, DHHS, pp 36–42

  6. 6.

    Krikov S, Khan A, Baird BC, Barenbaum LL, Leviatov A, Koford JK, Goldfarb-Rumyantzev AS (2007) Predicting kidney transplant survival using tree-based modeling. ASAIO J 53(5):592–600

    Article  Google Scholar 

  7. 7.

    Hariharan S, Johnson CP, Bresnahan BA, Taranto SE, McIntosh MJ, Stablein D (2000) Improved graft survival after renal transplantation in the United States, 1988 to 1996. N Engl J Med 342(9):605–612

    Article  Google Scholar 

  8. 8.

    Hoot N, Aronsky D (2005) Using Bayesian networks to predict survival of liver transplant patients. In: Proceedings AMIA 2005 symposium. American Medical Informatics Association, pp 345–349

  9. 9.

    Brown TS, Elster EA, Stevens K, Graybill JC, Gillern S, Phinney S, Salifu MO, Jindal RM (2012) Bayesian modeling of pretransplant variables accurately predicts kidney graft survival. Am J Nephrol 36(6):561–569

    Google Scholar 

  10. 10.

    Akl A, Ismail AM, Ghoneim M (2008) Prediction of graft survival of living-donor kidney transplantation: nomograms or artificial neural networks? Transplantation 86(10):1401–1406

    Article  Google Scholar 

  11. 11.

    Dag A, Topuz K, Oztekin A, Bulur S, Megahed FM (2016) A probabilistic data-driven framework for scoring the preoperative recipient-donor heart transplant survival. Decis Support Syst 86:1–12

    Article  Google Scholar 

  12. 12.

    Sajadfar N, Ma Y (2015) A hybrid cost estimation framework based on feature-oriented data mining approach. Adv Eng Inform 29(3):633–647

    Article  Google Scholar 

  13. 13.

    Angiulli F, Pizzuti C (2002) Fast outlier detection in high dimensional spaces. In: European conference on principles of data mining and knowledge discovery. Springer, pp 15–27

  14. 14.

    Aggarwal M (2013) Performance analysis of different feature selection methods in intrusion detection. Int J Sci Technol Res 2(6):225–231

    Google Scholar 

  15. 15.

    Blum AL, Rivest RL (1993) Training a 3-node neural network is NP-complete. In: Machine learning: from theory to applications. Springer, pp 9–28

  16. 16.

    Zhang M, Yao J (2004) A rough sets based approach to feature selection. In: IEEE annual meeting of the fuzzy information, 2004. Processing NAFIPS’04. IEEE, pp 434–439

  17. 17.

    Hung Y (2009) A neural network classifier with rough set-based feature selection to classify multiclass IC package products. Adv Eng Inform 23(3):348–357

    Article  Google Scholar 

  18. 18.

    Khan M, Quadri S (2013) Effect of using filter based feature selection on performance of machine learners using different datasets. BVICAM’s Int J Inf Technol 5:597–603

    Google Scholar 

  19. 19.

    Weinberger KQ, Saul LK (2009) Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res 10:207–244

    MATH  Google Scholar 

  20. 20.

    Hodge V, Austin J (2004) A survey of outlier detection methodologies. Artif Intell Rev 22(2):85–126

    MATH  Article  Google Scholar 

  21. 21.

    Suh NP (2005) Complexity: theory and applications. Oxford University Press on Demand, Oxford

    Google Scholar 

  22. 22.

    Lin RS, Horn SD, Hurdle JF, Goldfarb-Rumyantzev AS (2008) Single and multiple time-point prediction models in kidney transplant outcomes. J Biomed Inform 41(6):944–952

    Article  Google Scholar 

  23. 23.

    Poli F, Scalamogna M, Cardillo M, Porta E, Sirchia G (2000) An algorithm for cadaver kidney allocation based on a multivariate analysis of factors impacting on cadaver kidney graft survival and function. Transpl Int 13(1):S259–S262

    Article  Google Scholar 

  24. 24.

    Zapletal C, Lorenz M, Woeste G, Wullstein C, Golling M, Bechstein W (2004) Predicting creatinine clearance by a simple formula following live-donor kidney transplantation. Transpl Int 17(9):490–494

    Article  Google Scholar 

  25. 25.

    Goldfarb-Rumyantzev AS, Scandling JD, Pappas L, Smout RJ, Horn S (2003) Prediction of 3-yr cadaveric graft survival based on pre-transplant variables in a large national dataset. Clin Transpl 17(6):485–497

    Article  Google Scholar 

  26. 26.

    Brier ME, Ray PC, Klein JB (2003) Prediction of delayed renal allograft function using an artificial neural network. Nephrol Dial Transpl 18(12):2655–2659

    Article  Google Scholar 

  27. 27.

    Rana A, Gruessner A, Agopian VG, Khalpey Z, Riaz IB, Kaplan B, Halazun KJ, Busuttil RW, Gruessner RW (2015) Survival benefit of solid-organ transplant in the United States. JAMA Surg 150(3):252–259

    Article  Google Scholar 

  28. 28.

    Heldal K, Hartmann A, Grootendorst DC, de Jager DJ, Leivestad T, Foss A, Midtvedt K (2009) Benefit of kidney transplantation beyond 70 years of age. Nephrol Dial Transpl 25(5):1680–1687

    Article  Google Scholar 

  29. 29.

    Port FK, Bragg-Gresham JL, Metzger RA, Dykstra DM, Gillespie BW, Young EW, Delmonico FL, Wynn JJ, Merion RM, Wolfe RA (2002) Donor characteristics associated with reduced graft survival: an approach to expanding the pool of kidney donors1. Transplantation 74(9):1281–1286

    Article  Google Scholar 

  30. 30.

    Nakayama N, Oketani M, Kawamura Y, Inao M, Nagoshi S, Fujiwara K, Tsubouchi H, Mochida S (2012) Algorithm to determine the outcome of patients with acute liver failure: a data-mining analysis using decision trees. J Gastroenterol 47(6):664–677

    Article  Google Scholar 

  31. 31.

    Atallah DM, Badawy M, El-Sayed A, Ghoneim MA (2019) Predicting kidney transplantation outcome based on hybrid feature selection and KNN classifier. Multimed Tools Appl 78(14):20383–20407

    Article  Google Scholar 

  32. 32.

    Kusiak A, Dixon B, Shah S (2005) Predicting survival time for kidney dialysis patients: a data mining approach. Comput Biol Med 35(4):311–327

    Google Scholar 

  33. 33.

    Dag A, Oztekin A, Yucel A, Bulur S, Megahed FM (2017) Predicting heart transplantation outcomes through data analytics. Decis Support Syst 94:42–52

    Article  Google Scholar 

  34. 34.

    Delen D, Oztekin A, Tomak L (2012) An analytic approach to better understanding and management of coronary surgeries. Decis Support Syst 52(3):698–705

    Article  Google Scholar 

  35. 35.

    Atallah DM, Eldesoky AI, Amira Y, Ghoneim MA (2014) One-year renal graft survival prediction using a weighted decision tree classifier. Int J Eng Technol 3(3):327

    Article  Google Scholar 

  36. 36.

    Kattan MW (2005) When and how to use informatics tools in caring for urologic patients. Nat Rev Urol 2(4):183

    Article  Google Scholar 

  37. 37.

    Martín-Valdivia MT, Díaz-Galiano MC, Montejo-Raez A, Ureña-López L (2008) Using information gain to improve multi-modal information retrieval systems. Inf Process Manage 44(3):1146–1158

    Article  Google Scholar 

  38. 38.

    Mukras R, Wiratunga N, Lothian R, Chakraborti S, Harper D (2007) Information gain feature selection for ordinal text classification using probability re-distribution. In: Proceedings of the Textlink workshop at IJCAI. p 16

  39. 39.

    Yang C-H, Chuang L-Y, Yang CH (2010) IG-GA: a hybrid filter/wrapper method for feature selection of microarray data. J Med Biol Eng 30(1):23–28

    Google Scholar 

  40. 40.

    Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier, Amsterdam

    MATH  Google Scholar 

  41. 41.

    Inza I, Larrañaga P, Etxeberria R, Sierra B (2000) Feature subset selection by Bayesian network-based optimization. Artif Intell 123(1–2):157–184

    MATH  Article  Google Scholar 

  42. 42.

    Qiang G (2010) An effective algorithm for improving the performance of Naïve Bayes for text classification. In: 2010 Second international conference on computer research and development

  43. 43.

    Holland JH (1975) Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. University of Michigan Press, Ann Arbor

    MATH  Google Scholar 

  44. 44.

    Abed MA, Ismail AN, Hazi ZM (2010) Pattern recognition using genetic algorithm. Int J Comput Electr Eng 2(3):583

    Article  Google Scholar 

  45. 45.

    Tan F, Fu X, Zhang Y, Bourgeois AG (2008) A genetic algorithm-based method for feature subset selection. Soft Comput 12(2):111–120

    Article  Google Scholar 

  46. 46.

    Shahamat H, Pouyan AA (2015) Feature selection using genetic algorithm for classification of schizophrenia using fMRI data. J AI and Data Min 3(1):30–37

    Google Scholar 

  47. 47.

    Golberg DE (1989) Genetic algorithms in search, optimization, and machine learning. Addison Wesley, Reading

    Google Scholar 

  48. 48.

    Holland JH (1992) Adaptation in natural and artificial systems: an introductory analysis with applications to biology, control, and artificial intelligence. MIT Press, Cambridge

    Book  Google Scholar 

  49. 49.

    Ghareb AS, Bakar AA, Hamdan AR (2016) Hybrid feature selection based on enhanced genetic algorithm for text categorization. Expert Syst Appl 49:31–47

    Article  Google Scholar 

  50. 50.

    Pakath R, Zaveri JS (1995) Specifying critical inputs in a genetic algorithm-driven decision support system: an automated facility. Decis Sci 26(6):749–771

    Article  Google Scholar 

  51. 51.

    Ammu P, Preeja V (2013) Review on feature selection techniques of DNA microarray data. Int J Comput Appl 61(12):39–44

    Google Scholar 

  52. 52.

    Hmeidi I, Hawashin B, El-Qawasmeh E (2008) Performance of KNN and SVM classifiers on full word Arabic articles. Adv Eng Inform 22(1):106–111

    Article  Google Scholar 

  53. 53.

    Ghoneim MA, Bakr MA, Refaie AF, Akl AI, Shokeir AA, Shehab El-Dein AB, Ammar HM, Ismail AM, Sheashaa HA (2013) Factors affecting graft survival among patients receiving kidneys from live donors: a single-center experience. BioMed Res Int 2013:1–9

    Article  Google Scholar 

  54. 54.

    Refaeilzadeh P, Tang L, Liu H (2009) Cross-validation. In: Encyclopedia of database systems. Springer, pp 532–538

  55. 55.

    Garson GD (1998) Neural networks: an introductory guide for social scientists. Sage, Thousand Oaks

    Google Scholar 

  56. 56.

    Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Ijcai, vol 2. Montreal, Canada, pp 1137–1145

  57. 57.

    Olson DL, Delen D (2008) Advanced data mining techniques. Springer, Berlin

    MATH  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Dalia M. Atallah.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Atallah, D.M., Badawy, M. & El-Sayed, A. Intelligent feature selection with modified K-nearest neighbor for kidney transplantation prediction. SN Appl. Sci. 1, 1297 (2019). https://doi.org/10.1007/s42452-019-1329-z

Download citation

Keywords

  • Kidney transplantation
  • Graft failure
  • Gain ratio
  • Feature selection
  • Naïve Bayes
  • Genetic algorithm
  • K-nearest neighbor