Introduction

The extensive application of information technology in the medical field provides support for clinical diagnosis [1, 2]. In the process of clinical diagnosis [3, 4], the clinical decision support system (CDSS) analyzes and predicts patients' conditions according to their current disease information and the system knowledge base, so as to provide support information for diagnosis and treatment. CDSS can help doctors deal with various medical problems more efficiently and quickly with complex medical knowledge in the decision making process, so as to find more solutions for difficult and complicated diseases [5, 6].

In recent years, machine learning has been rapidly developed and widely used in clinical diagnosis [7, 8]. Clinical diagnosis based on machine learning [7] regards the disease diagnosis process as a prediction problem characterized by the clinical manifestations of the disease. According to the clinical manifestations of the disease, the feature space of the sample is established, and the existing cases and diagnostic results are used as the training set of the machine learning model, so that the new cases can be predicted.

However, the problem to be solved in a clinical diagnosis based on machine learning is sample imbalance [9, 10]. A large number of patients with some common diseases can produce a large case sample (majority sample). For rare diseases, the number of patients is very small and only a small case sample (minority sample) is produced [11, 12]. When trained on the imbalanced dataset, machine learning models tend to predict the samples into the majority [13, 14]. Although high precision can be achieved, the sensitivity of the model is extremely low, so the  model cannot correctly classify minority samples [15, 16].

At present, methods to solve the sample imbalance problem can be divided into algorithm level [17, 18] and data level [19, 20]. The algorithm level method mainly combines the characteristics of imbalanced samples to improve the algorithm appropriately to improve the sensitivity of minority. Ensemble learning [17] is a common machine learning algorithm, which outputs the results of multiple weak classifiers according to certain rules through combination training of multiple weak classifiers. SMOTE [19] is a common algorithm in the data level, which improves the sensitivity of minority by synthesizing minority samples. However, whichever method has some disadvantages, such as the ensemble algorithm does not take into account the sample distribution [21], and SMOTE is easy to synthesize “noisy sample” and “boundary sample” [22].

Based on the above description, we took the collected missed abortion [23] and diabetes [24] datasets as the research object and proposed a hybrid sampling algorithm combining SMOTE and ENN to solve the sample imbalanced problem in the clinical diagnosis. Firstly, we combing SMOTE and ENN, and used ENN to delete "noisy sample" in the majority after SMOTE synthesized the minority sample. Then, due to the understandable requirements of machine learning model for CDSS, we use the decision tree to model and predict the missed abortion dataset. Finally, the decision tree is biased to the majority in the imbalanced dataset, and we use three ensemble algorithms to ensemble the decision tree to improve the classification performance of the decision tree. The comparison experiment is divided into 3 parts: Firstly, compared with other sampling algorithms to verify the effectiveness of the proposed algorithm. Then it compared with other ensemble algorithms to achieve an accurate clinical diagnosis. Finally, statistical experiments are carried to verify whether the proposed algorithm is significantly better than the existing sampling algorithms.

The rest of this work is organized as following. Section 2 presents the medical datasets and the proposed hybrid algorithm. Section 3 is the comparative experiment and statistical experiment. Section 4 shows the discussion and analysis and Sect.  5 is conclusion.

Datasets and methods

Medical datasets

In this work, the missed abortion dataset collected from 2016 to 2020 is selected for research. The dataset contains 249 missed abortion samples and 112 normal samples, and contains 7 features, Age, Ethnicity, Number of Births, History of abortion, Cesarean section, Infection during Pregnancy and Thyroid test results of the pregnant women respectively. In addition, we also selected the UCI medical dataset diabetes for research. The dataset contains 500 diabetic samples and 268 normal samples, and contains 8 features, Pregnancies, Glucose, Blood Pressure, Skin Thickness, Insulin, Body mass index, Diabetes pedigree function and the Age respectively.

Ensemble algorithm

Ensemble algorithm [17, 25], as a research hotspot in the machine learning, has been increasingly applied in clinical diagnosis. Ensemble algorithm can combine multiple weak classifiers with relatively low precision to train a strong classifier with high precision. The ensemble algorithm is generally divided into 2 stages, that is, weak classifier generation stage and weak classifier combination stage.

In the weak classifier generation stage, different generation methods are used to generate multiple weak classifiers. In the weak classifier combination stage, the multiple weak classifiers are combined by voting and the final prediction model is output. The ensemble algorithm can be divided into Bagging [26], Adaboost [27] and Random forest [28] according to different generation methods of training set and combination methods. They are introduced as following:

Bagging uses bootstrap to sample from the original training subset and obtains T training subsets with the same number of samples. T training subsets are then trained using the weak classifiers, and T weak classifiers are generated. Finally, the trained T weak classifiers are used to test the test subsets, and the prediction results are output by voting.

$$H_{Bagging} \left( x \right) = arg\mathop {\max }\limits_{y \in Y} \mathop \sum \limits_{t = 1}^{T} I\left( {h_{t} \left( x \right) = y} \right),y = 1,2, \cdots ,L$$
(1)

where \(I\left( {} \right)\) is an indicative function, that is, \(I\left( {True} \right) = 1\), \(I\left( {False} \right) = 0\). \(h_{t} \left( {\text{x}} \right)\) is the weak classifier, that is, \(I\left( {True} \right) = 1\), \(I\left( {False} \right) = 0\). In the above method, the combination order of weak classifiers \(T_{1} ,T_{2} , \cdots ,T_{t}\) randomly generates \(h_{t} \left( x \right)\).

Adaboost trains the weak classifier on the training subsets in turn, and the training of the subsequent weak classifier depends on the performance of the previous weak classifier. The samples with errors will appear in the training subsets of the new weak classifier with a high probability. Finally, the trained T weak classifiers are used to test the test subset, and the prediction results are output by voting.

$$H_{Adaboost} \left( x \right) = arg\mathop {\max }\limits_{y \in Y} \mathop \sum \limits_{t = 1}^{T} In\left( {\frac{1}{\beta }} \right)I\left( {h_{t} \left( x \right) = y} \right),y = 1,2, \cdots ,L$$
(2)

where \(I\left( {} \right)\) is the indicative function, \({h}_{t}\left(x\right)\) is the weak classifier, and \({\beta }^{t}\) is the weight, which emphasizes the adjustment of sample weight and the weighting coefficient of weak classifier. Unlike Bagging, the Adaboost algorithm focuses more on samples that are prone to misclassification.

On the basis of the Bagging algorithm, Random forest uses bootstrap to sample from the original training set. Then, a number of features are selected during the training process of T weak classifiers, and these features are selected as the split points of the decision tree by comparing which features have the greatest effect on the prediction. Finally, trained T decision trees classifiers are used to test the test subsets, and the prediction results are output by voting.

$$D_{Random\, forest} \left( x \right) = arg\mathop {\max }\limits_{y \in Y} \mathop \sum \limits_{t = 1}^{T} I\left( {d_{t} \left( x \right) = y} \right),y = 1,2, \cdots ,L$$
(3)

where \(I\left( {} \right)\) is an indicative function, \(d_{t} \left( x \right)\) is the decision tree classifier。Similar to Bagging, Random forest uses weak classifiers to train T training subsets and then generates T decision tree classifiers.

The proposed hybrid sampling algorithm

According to different sampling strategies, data sampling algorithm can be divided into over-sampling and under-sampling [29]. over-sampling algorithm improves the sensitivity of the minority by synthesizing the minority samples. SMOTE [19] is a classical over-sampling algorithm, which reduces the dataset imbalance by synthesizing new minority samples.

Suppose the minority sample is \(x_{i\_min}\), and find the \(k\) (\(k\) is generally 5) nearest neighbor samples \(x_{ik\_min}\) of \(x_{i\_min}\) according to the Euclidean distance. Then the new minority sample is synthesized between the minority sample \(x_{i\_min}\) and the k-nearest neighbor sample \(x_{ik\_min}\). The synthesis formula can be given by Eq. (4).

$$x_{new} = x_{i\_min} + rand\left( {0,1} \right) \times \left( {x_{i\_min} - x_{ik\_min} } \right),i = 1,2, \cdots ,N$$
(4)

where rand (0,1) is a random number between 0 and 1. By setting the over-sampling rate, multiple synthesis is performed according to Eq. (4) until the two classes samples are the same.

Figure 1a shows the original dataset. Figure 1b shows that SMOTE relives sample imbalance to a certain extent, but synthetic new "noisy sample" and "boundary sample" [22, 30]. Therefore, some scholars [22, 31,32,33] have proposed the Borderline-SMOTE [22], Adasyn-SMOTE [31], ANS-SMOTE [32] and Gaussian-SMOTE [33] for the problems existing in SMOTE algorithm.

Fig. 1
figure 1

Samples simulation plot after SMOTE and ENN sampled. a Original dataset, b SMOTE dataset, c ENN dataset

Recently, some scholars [21, 34] have proposed clustering over-sampling algorithms. For example, Douzas et al. [21] proposed k-means-SMOTE algorithm. The algorithm first uses k-means to cluster the dataset, then over-sampling the minority after clustering using SMOTE. Similarly, Ma et al. [34] proposed a Cure SMOTE algorithm. The algorithm uses Cure to identify and delete "noisy samples" before over-sampling again using SMOTE.

Different from the over-sampling algorithm, the under-sampling algorithm achieves the two classes balance by deleting the majority samples. ENN [20] is the common under-sampling algorithm, which deletes samples by searching whether the classes of majority samples are the same as those of the k-nearest neighbors. Suppose the majority samples are \({x}_{maj\_i}\), find \(k\) (\(k\) is generally 3) nearest neighbor samples of\({x}_{maj\_i}\), and judge the class of \({x}_{maj\_i}\) and its \(k\) nearest neighbor samples according to Eq. (5):

$$x_{j\_del} = I\left( {{\text{Class}}\left( {x_{j\_maj} - x_{jk\_maj} } \right)} \right)$$
(5)

According to Eq. (5), if the class of \(x_{j\_maj}\) is different from class of the k-nearest neighbor samples, \(x_{j\_maj}\) is deleted. Figure 1c shows the samples simulation plot after ENN sampled. ENN makes two classes of samples balanced by deleting “noisy sample”. However, the neighbors of majority samples are often the majority samples, and the samples that can be deleted are limited. Therefore, Tomek link [35], Instance hardness under-sampling [36], Radial based under-sampling [37] and other under-sampling algorithms have been proposed successively.

Both over-sampling and under-sampling can achieve the two classes balance, and improve the sensitivity of the minority to a certain extent. However, the specificity of the majority after sampled all declined, which may be because after sampled damaged the sample distribution of the original dataset, resulting in the decline of the specificity [38, 39].

In order to solve this problem, we propose a hybrid sampling algorithm combining SMOTE and ENN. The algorithm firstly uses SMOTE to over-sampling the imbalanced dataset to synthesize new minority samples. Then, ENN is used to under-sampling the oversampled dataset to delete the "noisy samples" in the minority. Figure 2 shows the samples simulation plot after the SMOTE-ENN sampled.

Fig. 2
figure 2

Samples simulation plot after the SMOTE-ENN sampled

Observing Fig. 2, the samples simulation after the SMOTE-ENN sampled is more balanced, and the “noisy sample” synthesized by SMOTE algorithm is deleted, which is different from the dataset of SMOTE or ENN sampled alone. The steps of the SMOTE-ENN are shown in Algorithm 1.

figure a

Experimental result

Evaluation index

Traditional evaluation indexes mainly focus on the overall classification performance, even if the minority samples are incorrectly classified, and good results will be achieved. Therefore, some scholars proposed using class classification index to evaluate its classification performance [40, 41].

If TP is used to represent the sample number of majority correctly predicted, TN to represent the sample number of minority correctly predicted, FN to represent the sample number of majority incorrectly predicted, FP to represent the sample number of minority incorrectly predicted, then:

Prediction precision of the minority (Sensitivity):

$$Sensitivity = TP/\left( {TP + FN} \right)$$
(6)

Prediction precision of the majority (Specificity):

$$Specificity = TN/\left( {FP + TN} \right)$$
(7)

Sensitivity and specificity represent the precision of minority and majority respectively. In order to reflect the classification performance of the classifier in imbalanced dataset in a more comprehensive way, this paper also gives the F-measure index for two classes, which is defined as:

$$F - measure = 2Recall \times Precision/\left( {Recall + Precsion} \right)$$
(8)

Recall is the same as sensitivity. Only when the recall and precision are high, the F-measure will be correspondingly high. In addition, the Matthew correlation coefficient (MCC) [42] is an evaluation index that integrates sensitivity and specificity, and is defined as:

$$MCC = (TP \times TN - FP \times FN)/\sqrt {\left( {TN + FN} \right)\left( {TN + FP} \right)\left( {TP + FN} \right)\left( {TP + FP} \right)}$$
(9)

When there is a large difference in the number of samples, the value of MCC is usually much smaller than sensitivity and specificity. Due to TN and FP are of the same order of magnitude, much larger than TP and FN. Therefore, MCC index can significantly reflect the influence of imbalanced datasets on the classifier, and comprehensively consider the effect of two classes.

Experimental setting

In this section, we selected 11 traditional sampling algorithms for comparative experiments. The over-sampling algorithms are SMOTE [19], Borderline-SMOTE [22], Adasyn-SMOTE [31], Gaussian-SMOTE [33], respectively. In addition, we also select two clustering over-sampling algorithms: k-means-SMOTE [21] and Cure-SMOTE [34]. The under-sampling algorithms are ENN [20], Tomek link [35], Instance hardness under-sampling [36], Radial based under-sampling [37]. The hybrid algorithms are the hybrid of SMOTE and ENN [19, 20], and the hybrid of SMOTE and Tomek Link [19, 35], respectively.

In the experiments, we perform tenfold cross validation on the sampled dataset using the classification algorithm. Firstly, we use three decision trees to perform tenfold cross validation on the sampled dataset, and record the results of various indexes of the decision tree. Then, we use three ensemble algorithms to perform ensemble learning on the decision tree, and record the results of various indexes of the ensemble algorithms. Finally, we select two statistical testing methods to compare 11 over-sampling algorithms to verify the significance of SMOTE-ENN.

The samples distribution after sampled

In order to observe the samples distribution of the sampled dataset, this section presents samples scatter plot after three sampling algorithms sampled on the diabetes dataset. We plot samples scatter plot after SMOTE, ENN, and SMOTE-ENN sampled. The dataset class is selected as Z axis, and any two features are selected as X and Y axis. Figure 3 presents samples scatter plot after three sampling algorithms sampled on the diabetes dataset.

Fig. 3
figure 3

Samples scatter plot after three sampling algorithms sampled

Observing Fig. 3a, it is found that the two classes of samples in the diabetes dataset differ greatly in number, and there are a large number of "noisy samples" and "boundary samples". Figure 3b–d shows samples scatter plot after SMOTE, ENN and SMOTE-ENN sampled, respectively. Observe samples scatter plot after SMOTE sampled (Fig. 3b) and find that although the number of two classes of samples is balanced, a large number of "boundary samples" are generated. In addition, there is a lot of "noisy sample" in the original diabetes dataset. By observing samples scatter plot after ENN sampled (Fig. 3c), it is found that ENN effectively deletes "noisy samples" in the minority. However, after SMOTE-ENN sampled (Fig. 3(d)), not only does SMOTE-ENN effectively synthesized the minority sample, but also deleted the "noisy sample" in the majority, thus significantly improving the sensitivity of the minority.

Comparison with other sampling algorithms

In order to observe the sampled effect of sampling algorithms on the missed abortion and diabetes datasets, this section uses 11 sampling algorithms for comparative experiments. The 11 sampling algorithms are SMOTE (SM), Borderline-SMOTE (BSM), Adasyn-SMOTE (ASM), Gaussian-SMOTE (GSM), k-means-SMOTE (KSM), Cure-SMOTE (CSM), ENN, Tomek link (TL), Instance hardness under-sampling (IHU), Radial based under-sampling (RBU), SMOTE-Tomek link (SMTOM), and SMOTE-ENN (SMENN). In the experiment, three decision tree algorithms are used to test the sampled dataset, and the results are shown in Tables 1, 2 and 3.

Table 1 Results of the C4.5 on the missed abortion and diabetes dataset after sampled
Table 2 Results of the Randomtree on the missed abortion and diabetes dataset after sampled
Table 3 Results of the Reptree on the missed abortion and diabetes dataset after sampled

As shown in Tables 1, 2 and 3, the sensitivity and specificity other indexes of the three decision tree algorithms on the missed abortion and diabetes datasets are all poor. This shows that the sample imbalance greatly damages the classification performance of the decision tree algorithms. In clinical diagnosis, this result is obviously unacceptable. In the over-sampling algorithms, the sensitivity indexes of the decision tree algorithms on the sampled dataset have been significantly improved. Among them, decision tree algorithms have the best classification performance on the missed abortion dataset after k-means-SMOTE sampled. Similarly, k-means-SMOTE has the best sampled effect on the diabetes dataset, and the MCC indexes of the three decision tree algorithms are 59.5%, 54.8% and 58.2%, respectively, which is significantly better than other over-sampling algorithms. In addition, the sampled effect of the Cure-SMOTE is also better than other over-sampling algorithms. This shows that the clustering over-sampling algorithm significantly better than the over-sampling algorithm.

In the under-sampling algorithms, decision tree algorithms have the best classification performance on the missed abortion and diabetes datasets after IHU sampled. But overall, ENN and IHU are better than the over-sampling algorithm, while Tomek link and RBU are worse. The specificity index of C4.5 decreases significantly after RBU sampled, which may be due to the blind deletion of some important majority samples by RBU. In the hybrid sampling algorithms, SMOTE-ENN has the best sampled effect on the missed abortion dataset, and all indexes are better than SMOTE-Tomek link. Compared with the original dataset, the imbalance rate of the dataset is improved after sampled. Among them, the Maj/Min index of the diabetes dataset after SMOTE and k-means-SMOTE sampled all reached 500/500. In addition, SMOTE-ENN has the best sampled effect in all the sampling algorithms, mainly because SMOTE-ENN not only synthesized the minority samples, but also deleted the “noisy samples” in the majority. More importantly, Randomtree is also the best classification performance in the decision tree algorithms.

Comparative experiments of ensemble algorithms

Clinical diagnosis based on machine learning has extremely high requirements for diagnostic results. Thus three ensemble algorithms are proposed to ensemble decision tree. Similarly, we select 11 sampling algorithms to sample the missed abortion and diabetes dataset, and use Random forest, Adaboost and Bagging to test the sampled dataset. Among them, the weak classifier for Adaboost and Bagging is Randmotree. Results of ensemble algorithms on the missed abortion and diabetes datasets after sampled are shown in Tables 4, 5 and 6.

Table 4 Results of Random forest on the missed abortion and diabetes datasets after sampled
Table 5 Results of Adaboost on the missed abortion and diabetes datasets after sampled
Table 6 Results of Bagging on the missed abortion and diabetes dataset after sampled

As shown in Tables 4, 5 and 6 that the classification performance of the three ensemble algorithms on the original missed abortion and diabetes datasets is very poor, each index is only slightly higher than the classification performance when using decision tree alone. The classification performance of the three ensemble algorithms on the sampled dataset has been improved significantly. In the over-sampling algorithms, Gaussian-SMOTE has the best sampled effect on the missed abortion dataset, and the MCC indexes of Random forest, Bagging and Adaboost algorithms are 86.9%, 85.9% and 86.9% respectively. Similarly, k-means-SMOTE has the best sampled effect on the diabetes dataset. In the under-sampling algorithms, IHU has the best sampled effect on the diabetes dataset, and the MCC indexes of Random forest, Bagging and Adaboost algorithms are 80.7%, 67.9% and 75.4%, respectively. In addition, the sampled effect of ENN on the diabetes dataset is also better than that of the over-sampling algorithm.

In the hybrid sampling algorithms, SMOTE-ENN has a better sampled effect on the missed abortion and diabetes datasets, and the indexes are significantly better than SMOTE-Tomek link. In addition, the indexes of the three ensemble algorithms on the missed abortion dataset after SMOTE-Tomek link sampled are lower than those of Gaussian-SMOTE and IHU. Observing the three ensemble algorithms shows, Random forest has the best classification performance on the sampled missed abortion dataset, especially in SMOTE-ENN after sampled the sensitivity and MCC indexes are 93.9% and 95.4% respectively, which are consistent with the previous experimental results. Similarly, Random forest has the same result on the diabetes dataset after SMOTE-ENN sampled, and the sensitivity and MCC indexes are 97.0% and 90.0% respectively. In summary, we select SMOTE-ENN as the sampling algorithm for the dataset and Random forest as the diagnosis algorithm, which is the best combination and has the best classification performance.

Statistical test

In order to further compare the results of different sampling algorithms and observe whether there are significant differences between algorithms, statistical test is required for the experimental results. We used two statistical tests, pairwise comparison and multiple comparisons, respectively.

In pairwise comparison, Wilcoxon test [43] is selected to compare all sampling algorithms. The Wilcoxon test can be described as following:

By calculating the difference in the results in the two sampling algorithms on different indexes, and ranking according to the absolute value of the difference starting from 1. If two identical values exist, the average of the ordinal number is used as the ranked value for both.

The sign is added to the ranked values according to the positive and negative differences, and the positive ranked values are added together to obtain \(R+\), and the negative ranked values are added together to obtain \(R-\). The minimum value of the two is selected as the T value.

Find the threshold value according to the significance level, and the null hypothesis is that there is no difference between the algorithms. If the \(T\) value is less than or equal to the threshold value, the null hypothesis can be rejected and a significant difference between the algorithms can be considered.

According to the principle of the Wilcoxon test, we select the results of 6 indexes as the data values in the experiment, the significance level is \(\alpha =0.05\) and the null hypothesis is that all algorithms have the same result. The Wilcoxon test based on Random forest, Adaboost and Bagging is shown in Table 7.

Table 7 Wilcoxon test based on Random forest, Adaboost and Bagging

Due to the results of 6 groups indexes, when significance level \(\alpha =0.05\), the critical value is 2, that is, the maximum value for rejecting the null hypothesis is 2. From the results that during the test of the sampled missed abortion dataset using Random forest, Adabbost and Bagging, the null hypothesis can be rejected, that is, SMOTE-ENN is significantly better than other sampling algorithms.

In the multiple comparisons, we use the Friedman test to compare all sampling algorithms. For each index, algorithms to rank by the result in descending order. If the results are the same, use the average of the ranked values as the respective ranked values. For each algorithm, the average value \({R}_{j}^{2}\) is obtained as the comparison value, using Friedman test:

$$\chi_{F}^{2} = \frac{12N}{{k\left( {k + 1} \right)}}\left[ {\mathop \sum \limits_{j = 1}^{k} R_{j}^{2} - \frac{{k\left( {k + 1} \right)^{2} }}{4}} \right]$$
(10)

where \(N\) is the number of indexes, \(k\) is the number of algorithms, and \(R_{j}\) is the average value of each algorithm. To obtain better statistical results, \(\chi_{F}^{2}\) distribution is transformed into \(F_{F}\) distribution, and get:

$$F_{F} = \frac{{\left( {N - 1} \right)\chi_{F}^{2} }}{{N\left( {k - 1} \right) - \chi_{F}^{2} }}$$
(11)

The \(F_{F}\) distribution has \(k - 1\) and \(\left( {k - 1} \right)\left( {N - 1} \right)\) degrees of Friedman. Then, experimental results of Random forest, Adaboost and Bagging are compared respectively, and the significance level \(\alpha =0.05\) is adopted, where the null hypothesis is that there is no difference between the 12 sampling algorithms. According to the Eqs. (10) and (11), when \(N = 6\), Friedman test result is:

$$\chi_{F}^{2} = \frac{12 \times 6}{{12 \times 13}}\left[ {8.33^{2} + 7.17^{2} + 10.00^{2} + 3.67^{2} + 4.50^{2} + 4.83^{2} + 6.17^{2} + 9.17^{2} + 2.00^{2} + 11.83^{2} + 9.33^{2}+ 1.00^{2}- \frac{{2028 }}{4}} \right] = 57.69$$
$$F_{F} = \frac{{\left( {6 - 1} \right) \times 57.69}}{{6\left( {12-1} \right) - 57.69}} = 34.71$$

When \({\upalpha } = 0.05\), \({\text{F}}\left( {12,60} \right) = 1.9\)\(17\), that since \(34.71\gg 1.917\), the null hypothesis can be rejected, and the 12 sampling algorithms are considered to have significant differences. Similarly, Friedman test results obtained from Adaboost and Bagging experimental results are 30.45 and 49.28 respectively, which are also much larger than 1.917, so the null hypothesis is rejected.

Discussion and analysis

In all sampling algorithms, the classification performance of decision tree on missed abortion and diabetes datasets after 4 over-sampling algorithms sampled is significantly better than that of the Tomek link and ENN. The sampled effect of IHU is significantly better than the over-sampling algorithm, and the MCC indexes of Randomtree on missed abortion and diabetes datasets are 85.6% and 70.2%, respectively. The SMOTE-ENN has the best sampled effect on the missed abortion dataset, and the average values of precision, sensitivity, specificity, F-measure, MCC and ACU of Randomtree are 98.8%, 95.9%, 100.0%,98.8%, 97.1% and 98.0%, respectively, which is significantly better than SMOTE-Tomek link. Similarly, precision, sensitivity, specificity, F-measure, MCC and AUC indexes of Randomtree on the diabetes dataset after SMOTE-ENN sampled are 93.2%, 95.4%, 90.4%, 93.2%, 86.1% and 92.8%, respectively. This shows that the SMOTE-ENN not only synthesizes the minority samples, but also deletes the "noisy samples" in the majority.

In addition, by observing the samples scatter plot of the diabetes dataset after sampled, it is found that Maj/Min after deletion by the ENN not reach 112/112, while Maj/Min after synthesis by the SMOTE algorithm reaches 249/248. Therefore, the "noisy sample" can be deleted by the ENN is limited. Unfortunately, due to the working principle of the SMOTE, the synthesized samples partially fall in the majority. Therefore, it is necessary to deletion the samples after SMOTE synthesis, the main purpose of which is to delete the "noisy sample" blindly synthesized by SMOTE. SMOTE-ENN firstly uses SMOTE to synthesize the minority samples, and then uses ENN to delete the "noisy sample" in the majority. Although the Maj/Min of the diabetes dataset after SMOTE-ENN sampled is only 114/49, all the indexes of the three decision trees are optimal.

In Experiment, Randomtree has the best classification performance in the three decision tree algorithms. Therefore, we use ensemble algorithm to ensemble Randomtree. Comparing the three ensemble algorithms, Random forest, Bagging and Adaboost all have poor classification performance on the not sampled missed abortion and diabetes datasets, especially the sensitivity index. Similarly, the sampled effect of the over-sampling algorithm is better than the under-sampling algorithm. The sampled effect of IHU is significantly better than other over-sampling algorithms, and MCC indexes of Random forest, Adaboost and Bagging on the diabetes dataset are 80.7%, 85.4%and 67.9%, respectively. Overall, ensemble algorithms have the best classification performance on the missed abortion and diabetes datasets after SMOTE-ENN sampled. This shows that the ensemble algorithms have the same results on the missed abortion and diabetes datasets. In addition, through the ensemble of Adaboost and Bagging on Randomtree, it is found that the classification performance has been significantly improved after the ensemble.

In order to further test the validity of the SMOTE-ENN, the pairwise comparison and multiple comparisons are used to statistically test the 12 sampling algorithms, respectively. In pairwise comparison, precision, sensitivity, specificity, F-measure, MCC and AUC indexes of the three ensemble algorithms on the sampled missed abortion dataset are taken as values. When the significance level is 0.05(\(\mathrm{\alpha }=0.05\)), pairwise tests based on Wilcoxon are rejecting the null hypothesis. This means that the SMOTE-ENN has significant advantages than other sample algorithms. Similarly, in the multiple comparisons, precision, sensitivity, specificity, F-measure, MCC and AUC indexes of the three ensemble algorithms on the sampled missed abortion dataset are also taken as values. When the significance level is 0.05(\(\mathrm{\alpha }=0.05\)), no matter which ensemble algorithm is used for the test, the SMOTE-ENN is significantly better than other sampling algorithms.

In general, the high sample imbalance seriously damages the classification performance of ensemble algorithm. Sampling algorithms can solve the influence of sample imbalance to a certain extent after sampled the missed abortion and diabetes datasets. Overall, the over-sampling algorithm is better than the under-sampling algorithm. However, IHU has the best sampled effect in the single sampling algorithms. The sampled effect of SMOTE-Tomek is worse than that of some single sampling algorithms. The sampled effect of the SMOTE-ENN is optimal, which is mainly because it not only synthesized the minority samples, but also deleted the "noisy samples" in majority. In addition, Random forest has the best classification performance in the ensemble algorithms. Therefore, Random forest is used as the diagnosis algorithm for the missed abortion and diabetes datasets.

Conclusion

Medical datasets are often imbalanced, and different diseases have different sample numbers. Some diseases have only a small number or even one case sample, which greatly increases the diagnostic effectiveness of machine learning algorithms. In clinical diagnosis, minority samples are also extremely important, and the prediction of difficult diseases can greatly help doctors to treat patients in advance. A hybrid sampling algorithm combining SMOTE and ENN is proposed to study the missed abortion diagnosis. Firstly, SMOTE is used to synthesize the minority samples so that there is a balance between the majority and the minority. Then, ENN is then used to under-sampling the synthesized dataset to delete the "noisy samples" in the majority. Finally, the ensemble algorithm is used to model and predict the synthesized dataset. Randomtree has the best classification performance on missed abortion and diabetes datasets after SMOTE-ENN sampled, and all indexes are significantly better than other sampling algorithms. In addition, Random forest has the best classification performance in all the ensemble algorithms. Therefore, Random forest is selected as the diagnosis algorithm for the missed abortion and diabetes datasets.