Background

Acute coronary syndrome (ACS) refers to a group of conditions due to decreased blood flow in the coronary arteries such that part of the heart muscle is unable to function properly or dies [1, 2]. Major adverse cardiac events (MACE) indicates the composite of a variety of adverse events related to the cardiovascular system [3, 4], which may lead severe or fatal outcome for ACS patients. MACE prediction, as a crucial and widely explored topic, plays a pivotal role in the optimal management for ACS patients at their early stage of hospitalization, e.g., clinical decision making of care and treatment, drug development and cost estimation [4, 5].

Over the past decades, a mountain of studies has been proposed to facilitate risk assessment [1, 4]. Many traditional ACS risk score tools, e.g., TIMI [5], PURSUIT [6] and GRACE [7], have been widely used in real clinical circumstances and shown good discriminatory accuracy in predicting MACE for ACS patients [8, 9]. However, these traditional models have several inherent limitations [10]. In particular, these models developed using data from clinical trials and registries may be not representative of a general department patient population because there are strict inclusion and exclusion criteria of the cohort [1]. In addition, to obtain a simple and easy-use tool, traditional risk scoring models are established on a small set of handy-picked risk factors based on the significant univariate relationship to the end point by univariate logistic regression, which may cause deterioration of predicting performance [4, 10, 11]. Moreover, it is hard to enroll new and more discriminatory risk factors into those traditional models, which limits their extension ability [1].

Recently, with the rapid growth of electronic health records (EHRs) data, a multitude risk prediction models utilizing the potential of EHRs have become available and achieved significant improvements in this field [4, 10,11,12,13]. Most of these models are built based on machine learning and data mining techniques. Although valuable, there are still some deficiencies to apply them on mining EHRs, particularly due to the vagueness, impreciseness and uncertain clinical information contained in EHR data. Specifically, most of these models assume that MACEs have been correctly annotated in the EHR dataset and the focus is on the learning capabilities of the MACE prediction scheme. However, unambiguous MACE annotations may be difficult and imprecise due to the lack of information required for specifying certain MACE labels to patient individuals.

Both the traditional risk scoring models and machine learning based models provide us with diverse perspectives on the problem of MACE prediction [4], so that each of them results in complementary information and could be fused to produce an integrative and reliable result. By utilizing a proper strategy for the construction of an ensemble network, it can be successfully applied to MACE prediction problem with imprecise and uncertain information. Dempster-Shafer Theory [14, 15] (DST) of evidence is a general framework for reasoning with uncertainty by combining multiple evidences together to obtain a more reliable result, which has been widely employed in sensor fusion [16], financial distress detection [17], medical diagnosis [18] and etc. To this end, we propose a hybrid method using Rough Set Theory [19] (RST) and Dempster-Shafer Theory of evidence for MACE prediction. The proposed approach integrates four state-of-the-art models, including one traditional ACS risk scoring model, i.e., GRACE, and three machine learning based models, i.e., Support Vector Machine [20] (SVM), L1-Logistic Regression [21] (L1-LR), and Classification and Regression Tree [22] (CART), to generate comprehensive and reliable MACE prediction results. In particular, RST is applied to determine the weights of the four single models, and then the prediction results generated by these single models are assumed as basic beliefs for the problem propositions and in this way, an ensemble MACE prediction result is generated by combine each single model’s evidence such that the overall prediction performance can be enhanced.

We comparatively evaluate the performance of the proposed model on a clinical dataset consisting of 2930 ACS patients and collected from the cardiology department of Chinese PLA General Hospital. The experimental results demonstrate that, in terms of reducing uncertainty caused human subjective cognition on patient data recording and annotation, our proposed method performs better than traditional single models.

Preliminaries

Rough set theory

Rough set theory was first proposed by Pawlak [19], which is widely used to deal with problem containing uncertainty. In RST, an information system is defined as a pair \( \mathbb{I}=\left(\mathrm{U},\mathrm{A}\cup \mathrm{R}\right) \), where U = {u1, u2,  … , ut} is a nonempty set of finite objects, A = {a1, a2,  … , an} is a nonempty set of finite attributes, R = {r1, r2,  … , rm} is a nonempty set of finite results. With each subset P ⊆ A, there is an indiscernibility relation (also called equivalence relation) defined asIND(P) = {(x, y) ∈ U2| ∀ai ∈ P, ai(x) = ai(y)}. The set of objects U can be partitioned based on the relation IND(P), which is denoted by U ∕ IND(P), where an element from U ∕ IND(P) is called an equivalence class. According to equation above, the indiscernibility relation of A, R, and A − {aj}, are defined as IND(A) = {(x, y) ∈ U2| ∀ai ∈ A, ai(x) = ai(y)}, IND(R) = {(x, y) ∈ U2| ∀ri ∈ R, ri(x) = ri(y)}, and IND(A − {aj}) = {(x, y) ∈ U2| ∀ai ∈ A, ai ≠ aj, ai(x) = ai(y)}, j = 1, 2, … , m. Depending on the theory of entropy, the dependence of R to A can be defined as:

$$ \mathrm{D}\left(\mathrm{IND}\left(\mathrm{R}\right)/\mathrm{IND}\left(\mathrm{A}\right)\right)=-\sum \limits_{\left[\mathrm{x}\right]\in \mathrm{U}/\mathrm{IND}\left(\mathrm{R}\right)}\mathrm{p}\left[\mathrm{x}\right]\sum \limits_{\left[\mathrm{y}\right]\in \mathrm{U}/\mathrm{IND}\left(\mathrm{A}\right)}\mathrm{p}\left(\left[\mathrm{y}\right]/\left[\mathrm{x}\right]\right)\ln \left(\mathrm{p}\left(\left[\mathrm{y}\right]/\left[\mathrm{x}\right]\right)\right) $$
(1)

where \( \mathrm{p}\left[\mathrm{x}\right]=\frac{\operatorname{card}\left[\mathrm{x}\right]}{\operatorname{card}\left[\mathrm{U}\right]} \), \( \mathrm{p}\left(\left[\mathrm{y}\right]/ \left[\mathrm{x}\right]\right)=\frac{\operatorname{card}\left(\left[\mathrm{y}\right]\cap \left[\mathrm{x}\right]\right)}{\operatorname{card}\left[\mathrm{x}\right]} \). The significance of attribute aj can be defined as:

$$ \upomega \left({\mathrm{a}}_{\mathrm{j}},\mathrm{A},\mathrm{R}\right)=\left|\mathrm{D}\left(\mathrm{IND}\left(\mathrm{R}\right)/\mathrm{IND}\left(\mathrm{A}-\left\{{\mathrm{a}}_{\mathrm{j}}\right\}\right)\right)-\mathrm{D}\left(\mathrm{IND}\left(\mathrm{R}\right)/\mathrm{IND}\left(\mathrm{A}\right)\right)\right|,\mathrm{j}=1,2,\dots, \mathrm{m}. $$
(2)

Finally, the weight of attribute aj is defined as follows:

$$ \mathrm{w}\left({\mathrm{a}}_{\mathrm{j}}\right)=\frac{\upomega \left({\mathrm{a}}_{\mathrm{j}},\mathrm{A},\mathrm{R}\right)}{\sum \limits_{\mathrm{j}=1}^{\mathrm{m}}\upomega \left({\mathrm{a}}_{\mathrm{j}},\mathrm{A},\mathrm{R}\right)} $$
(3)

Dempster-Shafer theory

Let Θ be the frame of discernment, which represents all possible mutually exclusive states of a system. The power set 2Θ is the set of all subset of Θ, including the empty set ∅, which represents propositions related to actual state of the system. The basic probability assignment (BPA) is defined as m : 2Θ → [0, 1], where m satisfies: m(∅) = 0, \( \sum \limits_{\mathrm{A}\subseteq \mathrm{X}}\mathrm{m}\left(\mathrm{A}\right)=1 \) and m(A) is called BPA of proposition A. If m(A) > 0, the subset A is called focal element. The belief function of proposition A denoted as Bel(A) is defined as \( \mathrm{Bel}\left(\mathrm{A}\right)=\sum \limits_{\mathrm{B}\subseteq \mathrm{A}}\mathrm{m}\left(\mathrm{B}\right),\forall \mathrm{A}\subseteq \Theta \). The plausibility function of proposition A denoted as Pl(A) is defined as \( \mathrm{Pl}\left(\mathrm{A}\right)=1-\mathrm{Bel}\left(\overline{\mathrm{A}}\right)=\sum \limits_{\mathrm{B}\cap \mathrm{A}\ne \varnothing}\mathrm{m}\left(\mathrm{B}\right),\forall \mathrm{A}\subseteq \Theta . \) The belief function and plausibility function represent the minimal and maximal support of A based on the BPA, respectively.

When the system has more than one basic probability assignment functions, Dempster’s combinational rule can combine them together. Let m1 and m2 be the two different BPA functions, and the evidences are A1, A2, … , Am with respect to m1 and B1, B2, … , Bn with respect to m2, if \( \sum \limits_{{\mathrm{A}}_{\mathrm{i}}\cap {\mathrm{B}}_{\mathrm{j}}=\varnothing }{\mathrm{m}}_1\left({\mathrm{A}}_{\mathrm{i}}\right){\mathrm{m}}_2\left({\mathrm{B}}_{\mathrm{j}}\right)<1 \), we have:

$$ {\mathrm{m}}_{1,2}\left(\mathrm{C}\right)={\mathrm{m}}_1\bigoplus {\mathrm{m}}_2\left(\mathrm{A}\right)=\left\{\begin{array}{c}\frac{1}{1-\mathrm{K}}\sum \limits_{{\mathrm{A}}_{\mathrm{i}}\cap {\mathrm{B}}_{\mathrm{j}}=\mathrm{C}}{\mathrm{m}}_1\left({\mathrm{A}}_{\mathrm{i}}\right){\mathrm{m}}_2\left({\mathrm{B}}_{\mathrm{j}}\right),\forall \mathrm{C}\subseteq \Theta, \mathrm{C}\ne \varnothing \\ {}0,\mathrm{C}=\varnothing \end{array}\right. $$
(4)

where \( \mathrm{K}=\sum \limits_{{\mathrm{A}}_{\mathrm{i}}\cap {\mathrm{B}}_{\mathrm{j}}=\varnothing }{\mathrm{m}}_1\left({\mathrm{A}}_{\mathrm{i}}\right){\mathrm{m}}_2\left({\mathrm{B}}_{\mathrm{j}}\right) \), which indicates the conflict between the evidences, called conflict probability. And the coefficient \( \frac{1}{1-\mathrm{K}} \) is a normalization factor.

Methods

In this study, we propose an ensemble approach to integrate traditional risk scoring models and advanced machine learning based models together to alleviate the limitations we mentioned above. Figure 1 shows the outline of our proposed method. As depicted in Fig. 1, we firstly calculated the weights for the four single models, i.e., GRACE, SVM, CART, and L1-LR, based on RST. After that, we employed the DST to integrate the weighted outputs of each model together as our ensemble MACE prediction result.

Fig. 1
figure 1

The outline of the proposed method

To give a more understandable explanation for our proposed method, we employed a subset of our real world dataset to show how we implemented our method step by step. Table 1 shows 10 patient samples from the collected dataset with their corresponding outputs from models trained in our previous work.

Table 1 The original outputs of single models for 10 patient samples

Weights calculation using rough set theory

Before calculating the weight of each single prediction model, we need to transform the models’ outputs into dichotomous variables, such that we can apply RST to calculate the dependence of each model to the final prediction results. We choose the output that is closest to the top-left point in the area under the curve (AUC) figure as our threshold to transform the model’s outputs. Experimentally on all patient samples we have, the thresholds are 0.2348, 0.22689, 0.2584 and 106.5 for SVM, L1-LR, CART and GRACE, respectively. We tend to use the data obtained from our work to give a more practical description in this and following sections. According to the dichotomized outputs, we can calculate the weight for each single model based on Eq. (1–3). The weights are 0.5363, 0.1765, 0.1177 and 0.1696 for SVM, L1-LR, CART and GRACE. Table 2 shows the dichotomized outputs, optimal thresholds and weights of the 4 single models.

Table 2 The dichotomized outputs, optimal thresholds and weights of single models for 10 patient samples

Model fusion using Dempster-Shafer evidence theory

Before using the Dempster-Shafer Theory to combine the four models’ outputs together, we need to transform the models’ outputs into basic probability assignments (BPA). However, in our study, we notice that the range of GRACE’s outputs is from 2 to 258, which cannot be directly used as the BPA, and moreover, the four single models we employed have different optimal thresholds which may influence the combination results. To alleviate these problems, we first normalize the GRACE’s outputs to between 0 and 1 by Eq. (5), and then apply Eq. (6) to adjust the threshold of each single model to the same value, i.e. 0.5, to eliminate the influence caused by different optimal thresholds.

$$ {\mathrm{A}}_{\mathrm{GRACE},\mathrm{j}}=\frac{{\mathrm{O}}_{\mathrm{GRACE},\mathrm{j}}-{\min}_{\mathrm{GRACE}}}{\max_{\mathrm{GRACE}}-{\min}_{\mathrm{GRACE}}};\mathrm{j}=1,2,3,\dots, \mathrm{n} $$
(5)

where n is the number of patients, OGRACE, j and AGRACE, j indicate the original and normalized output of the GRACE model for the jth patient, respectively. maxGRACE and minGRACE, the maximum value and minimum value of the original output of GRACE, are 37 and 201 in our study, respectively.

$$ {{\mathrm{A}}^{\ast}}_{\mathrm{i},\mathrm{j}}=\left\{\begin{array}{c}\ 0.5\times \frac{{\mathrm{A}}_{\mathrm{i},\mathrm{j}}}{{\mathrm{Threshold}}_{\mathrm{i}}},{\mathrm{A}}_{\mathrm{i},\mathrm{j}}<{\mathrm{Threshold}}_{\mathrm{i}}\\ {}\ 0.5,{\mathrm{A}}_{\mathrm{i},\mathrm{j}}={\mathrm{Threshold}}_{\mathrm{i}}\\ {}0.5\times \frac{{\mathrm{A}}_{\mathrm{i},\mathrm{j}}-{\mathrm{Threshold}}_{\mathrm{i}}}{1-{\mathrm{Threshold}}_{\mathrm{i}}}+0.5,{\mathrm{A}}_{\mathrm{i},\mathrm{j}}>{\mathrm{Threshold}}_{\mathrm{i}}\end{array};\mathrm{j}=\right.1,2,\dots, \mathrm{n} $$
(6)

where Ai, j is the adjusted output of ith model for the jth patient with i∈{SVM, L1-LR, CART, GRACE}, Thresholdi is the ith model’s optimal threshold utilized in the dichotomization procedure for weights calculation using RST. Table 3 shows the adjusted outputs of each single model based on Eqs. (5, 6).

Table 3 The adjusted outputs of single models for 10 patient samples

Based on the adjusted outputs, we can obtain the BPA for each patient. In our method, we combined the weights calculated by RST into the BPA using the following functions:

$$ {\mathrm{m}}_{{{\mathrm{A}}^{\ast}}_{\mathrm{i},\mathrm{j}}}\left(\varnothing \right)=0 $$
(7)
$$ {\mathrm{m}}_{{{\mathrm{A}}^{\ast}}_{\mathrm{i},\mathrm{j}}}(1)=\frac{{\mathrm{w}}_{\mathrm{i}}\times {{\mathrm{A}}^{\ast}}_{\mathrm{i},\mathrm{j}}}{{\mathrm{w}}_{\mathrm{i}}\times {{\mathrm{A}}^{\ast}}_{\mathrm{i},\mathrm{j}}+{\mathrm{w}}_{\mathrm{i}}\times \left(1-{{\mathrm{A}}^{\ast}}_{\mathrm{i},\mathrm{j}}\right)+1} $$
(8)
$$ {\mathrm{m}}_{{{\mathrm{A}}^{\ast}}_{\mathrm{i},\mathrm{j}}}(0)=\frac{{\mathrm{w}}_{\mathrm{i}}\times \left(1-{{\mathrm{A}}^{\ast}}_{\mathrm{i},\mathrm{j}}\right)}{{\mathrm{w}}_{\mathrm{i}}\times {\mathrm{F}}_{\mathrm{i},\mathrm{j}}+{\mathrm{w}}_{\mathrm{i}}\times \left(1-{{\mathrm{A}}^{\ast}}_{\mathrm{i},\mathrm{j}}\right)+1} $$
(9)
$$ {\mathrm{m}}_{{{\mathrm{A}}^{\ast}}_{\mathrm{i},\mathrm{j}}}\left(\Theta \right)=\frac{1}{{\mathrm{w}}_{\mathrm{i}}\times {{\mathrm{A}}^{\ast}}_{\mathrm{i},\mathrm{j}}+{\mathrm{w}}_{\mathrm{i}}\times \left(1-{{\mathrm{A}}^{\ast}}_{\mathrm{i},\mathrm{j}}\right)+1} $$
(10)

where wi is the weight of the ith model with i∈{SVM, L1-LR, CART, GRACE}.

According to the weighted BPA obtained by Eqs. (7–10), we can employ the Dempster’s combinational rule to combine the four models’ BPA functions together. Based on Eq. (4), we have:

$$ {\mathrm{m}}_{{{\mathrm{A}}^{\ast}}_{\mathrm{all},\mathrm{j}}}(1)={\mathrm{m}}_{{{\mathrm{A}}^{\ast}}_{\mathrm{SVM},\mathrm{j}}}\bigoplus {\mathrm{m}}_{{{\mathrm{A}}^{\ast}}_{L_1-\mathrm{LR},\mathrm{j}}}\bigoplus {\mathrm{m}}_{{{\mathrm{A}}^{\ast}}_{\mathrm{CART},\mathrm{j}}}\bigoplus {\mathrm{m}}_{{{\mathrm{A}}^{\ast}}_{\mathrm{GRACE},\mathrm{j}}}(1) $$
(11)
$$ {\mathrm{m}}_{{{\mathrm{A}}^{\ast}}_{\mathrm{all},\mathrm{j}}}(0)={\mathrm{m}}_{{{\mathrm{A}}^{\ast}}_{\mathrm{SVM},\mathrm{j}}}\bigoplus {\mathrm{m}}_{{{\mathrm{A}}^{\ast}}_{L_1-\mathrm{LR},\mathrm{j}}}\bigoplus {\mathrm{m}}_{{{\mathrm{A}}^{\ast}}_{\mathrm{CART},\mathrm{j}}}\bigoplus {\mathrm{m}}_{{{\mathrm{A}}^{\ast}}_{\mathrm{GRACE},\mathrm{j}}}(0) $$
(12)

Thus, the final decision value for the jth patient, i.e., Rall, j, can be simply represented as:

$$ {\mathrm{R}}_{\mathrm{all},\mathrm{j}}=\frac{{\mathrm{m}}_{{{\mathrm{A}}^{\ast}}_{\mathrm{all},\mathrm{j}}}(1)}{{\mathrm{m}}_{{{\mathrm{A}}^{\ast}}_{\mathrm{all},\mathrm{j}}}(0)+{\mathrm{m}}_{{{\mathrm{A}}^{\ast}}_{\mathrm{all},\mathrm{j}}}(1)} $$
(13)

Table 4 shows the patient sample’s BPA, the combined BPA and the final decision value. Note that the prediction results are determined by the optimal threshold of decision value, i.e., 0.4759, determined based on the same criteria as the dichotomization procedure. After all the procedures above, we can obtain the ensemble prediction model, which can consider the weight of each single model calculated by RST when combining the BPA by DST.

Table 4 The BPA, combined BPA and the final decision value for 10 patient samples

Experiments and results

Based on our previous work, we have obtained the original outputs of the four single models, e.g., SVM, L1-LR, CART and GRACE, for a total of 2930 ACS patient samples collected from the Cardiology Department of the Chinese PLA General Hospital. We employed 5-fold cross validation to construct both the four single models and our proposed model. To compare with other ensemble methods, we trained the Bagging [23] and AdaBoost [24] models by 5-fold cross validation as well. The metrics of area under the curve [25] (AUC), prediction accuracy (ACC) and their corresponding standard deviations (STD) are employed to evaluate all these models. All model constructions and statistical analyses were completed by R version 3.3.1 (The R Foundation for Statistical Computing, Vienna, Austria). Table 5 illustrates four single models’ weights in 5-fold cross validation. Tables 6 and 7 shows the AUC value and accuracy for all models in our study.

Table 5 The weights of single models in each fold
Table 6 The AUC values of all models
Table 7 The accuracy values of all models

From Table 5, we can find that each model has different weights in each fold, which indicates that the weight calculation step in our method distinguishes the discrimination ability of each single model and affects the construction of the proposed model in each fold cross validation. As illustrated in Tables 6 and 7, we can notice that our proposed method achieves the highest AUC value comparing with the 4 single models which means it can combine the output of each single model and generate a more reliable prediction result. And also, the accuracy of our model is competitive in all models with AUC values above 0.70. Moreover, when compared with the traditional ensemble methods, i.e., Bagging and AdaBoost, our models achieve a better performance with a significant margin. Furthermore, we can notice that the proposed model is the only one whose all AUC values in 5-fold are above 0.70 with a competitive standard deviation, which indicates the outstanding stability of our method. Figures 2 and 3 presents a more understandable comparison between our proposed model and other models.

Fig. 2
figure 2

The average AUC values with standard deviation

Fig. 3
figure 3

The average accuracy values with standard deviation

Discussion

The problem of MACE prediction plays a vital role in the optimal treatment management for ACS patients during their hospitalizations. Facing with the limitations in traditional risk scoring models, machine learning methods and the uncertainties of EHR data, we present an ensemble approach to alleviate this problem. We firstly employed RST to determine each single MACE prediction model’s weight. And then, DST was applied to combine all weighted single models as our ensemble model so as to enhance the performance of MACE prediction. Experiments have been conducted on a clinical dataset collected from the Cardiology Department of the China PLA General Hospital. The experimental results show our proposed method achieves the best prediction performance with 0.715 AUC value, which indicates our model can combine various information provided by the single models to generate more reliable and stable prediction result on the MACE prediction problem.

It should be mentioned that there exist some problems needed further exploration.

In our current work, the single models we employed are based on our previous work directly with no further selection. However, the single model’s outputs will have a significant impact on the final prediction results. Thus, we need to explore which single models are the most appropriate for the proposed method to combine so as to improve the prediction performances. Furthermore, resampling, a key technique to construct more single models, is also a potential direction to build more powerful and robust ensemble prediction model based on the proposed method.

In our future research, we plan to develop and deploy a continuous MACE prediction service in practice. Note that the dynamic nature of a patient status is often essential to risk stratification and subsequent treatment interventions adopted in clinical practice. Thus, it would be valuable to provide a continuous MACE prediction service during patients’ length of stay. Such a service not only anticipate MACEs at runtime, but also monitors patient treatment processes in a continuous and predictive fashion.

Conclusion

In this paper, we present an ensemble approach to alleviate the limitations in traditional ACS risk scoring models, machine learning models and the uncertainties of EHR data. We first employed RST to determine the weight for each single model. After that, DST was applied to combine the weighted outputs of single models as the final prediction results. The experimental results indicate our proposed method achieves 0.715 AUC value with a competitive standard deviation, which is a better performance for the problem of MACE prediction when compared with the single models.