Introduction

Approximately 12.2 million incidents of stroke occur worldwide each year [1, 2]. Stroke is the leading cause of chronic disability, with an increasing burden on patients and society [1, 3]. Management of stroke leads to an improved prognosis due to recanalization and restoration [4,5,6]. Nevertheless, primary prevention remains the most effective means of reducing the stroke burden.

Identifying and managing risk factors is important to prevent strokes. Because the pathophysiology differs between ischemic and hemorrhagic stroke, risk factors can differ between the two types [7, 8]. Nevertheless, risk factors, including old age, hypertension, diabetes mellitus (DM), smoking, and obesity, are common to ischemic and hemorrhagic stroke [1].

Risk factors can be classified into modifiable and unmodifiable. The management and identification of modifiable risk factors such as hypertension, DM, dyslipidemia, and atrial fibrillation (AF) are a major focus. Stratification of stroke occurrence aids in primary prevention by selecting candidates for the appropriate management of risk factors.

The Framingham Stroke Risk Score (FSRS), CHADS2, ABCD, and ABCD2 scores are widely used in clinical practice to stratify stroke occurrence [9,10,11,12,13]. The FSRS is based on cohort data in subjects aged 55–84 years and predicts the 10-year stroke risk [13, 14]. The model is calculated using variables, including age, sex, systolic blood pressure, use of antihypertensive drug, presence/absence of left ventricular hypertrophy on electrocardiogram, prevalent cardiovascular disease, current smoking status, current/previous AF, and DM [14]. The CHADS2 investigates data from the National Registry of AF consisting of Medicare beneficiaries aged 65–95 years and is used to identify stroke risk and determine the eligibility of patients with AF for antithrombotic therapies [10]. The CHADS2 assigns one point for the presence of congestive heart failure, hypertension, and DM and age ≥ 75 years and two points for stroke history or transient ischemic attack [10]. The ABCD score, proposed for risk stratification after transient ischemic attack [12], combines variables of age, blood pressure, clinical features, and duration of systems scored using a six-point scale. Moreover, ABCD2 is a modified model that includes history of DM as the clinical variable [9].

Previous models of stroke occurrence stratification have limitations for primary prevention: each model stratifies different risk categories in one person; some models are applicable to patients with specific conditions, and each model requires different fixed variables. These variables are required, but it is impossible to modify the variables to fit each dataset. Medical datasets include the potential risk factors of stroke and require a variable modification model. These models do not fit the current clinical situation, and modification by increasing the number of datasets remains difficult.

Prevention models require flexibility and easy calculation. A model that can be transformed according to the available variables in datasets, rather than fixed variables, can be widely used. Furthermore, if individuals can calculate their own stroke probability by themselves, motivation for the management of risk factors will further increase.

We aimed to develop a scoring model for stroke occurrence using a public dataset. The established model exhibits characteristics of flexibility for the included variables, which allows modification of other datasets and simplicity for self-calculation.

Methods

Dataset

In this study, we used the benchmark dataset of the Kaggle competition for stroke prediction (https://www.kaggle.com/asaumya/healthcare-dataset-stroke-data. Accessed December 1, 2018). This dataset is a subset of the original stroke data collected from healthdata.gov and accounts for 1.18% of the whole original dataset [15]. The dataset comprises a total of 43,400 records, of which 783 correspond to patients with stroke and the others to non-stroke participants. Overall, the occurrence ratio of stroke was 1.8%.

The dataset contained three continuous variables and seven categorical variables. Continuous variables include age, body mass index (BMI), and average glucose level, while categorical variables include heart disease, hypertension, sex, ever-married status, smoking status, work type, and residence type (Table 1). The detailed lists of variables in the dataset are also described in a previous study [15].

Table 1 Descriptive statistics of numerical variables and number of cases of categorical variables for the whole datasets

Notably, the dataset lacked some information. Specifically, the smoking status of 13,292 cases (approximately 30%) and BMI of 1,462 cases (approximately 3%) were missing. In the development datasets, we dropped cases with missing BMI information. Additionally, we dropped seven cases that selected “other” in the sex question to remove ambiguous data.

Finally, we randomly divided the development datasets containing 41,931 cases with 643 occurrences of stroke into three subsets: 60% (N = 25,158) were used as the training dataset, 20% (N = 8,386) were used as the validation dataset, and 20% (N = 8,387) were used as the test dataset. We then preprocessed the three datasets, fit the model to the training dataset and validated it with the validation dataset, and evaluated the model performance on the test dataset (Table 1).

Preprocessing

We transformed 10 variables using the weight of evidence (WoE) method. For each variable, the WoE is computed by calculating the logarithm of the ratio of the proportion of non-strokes over the proportion of strokes as follows:

$${\text{W}\text{o}\text{E}}_{\text{i}}=\text{l}\text{n}\left(\frac{{\text{N}\text{S}}_{\text{i}}/\sum {\text{N}\text{S}}_{\text{i}}}{{\text{S}}_{\text{i}}/\sum {\text{S}}_{\text{i}}}\right),$$

where \({\text{N}\text{S}}_{\text{i}}\) is the number of non-strokes in bin i, \(\sum {\text{N}\text{S}}_{\text{i}}\) is the total number of non-strokes, \({\text{S}}_{\text{i}}\) is the number of strokes in bin i, and \(\sum {\text{S}}_{\text{i}}\) is the total number of strokes.

WoE transformation converts categorical variables into numerical values and has a linear relationship with the logistic function. Thus, WoE-transformed variables are well-suited for input features in the logistic regression model. High positive WoE values indicate a low risk, whereas high negative WoE values indicate a high risk. WoE is widely used in models for risk management, such as credit risk models [16,17,18].

Logistic regression

We used a logistic regression model for the assessment of stroke occurrence because the non-strokes/strokes odds ratio in the logistic regression is easy to calculate and interpret, and logistic regression has been widely used in building prediction models for various diseases [19, 20]. We first performed univariate logistic regression to evaluate the significance of 10 transformed variables. Finally, we performed a multivariable logistic regression with input transformed variables at P < 0.05 in the univariate logistic regression. We used a backward selection approach to build the final model. The significance of each transformed variable in the logistic regression model was determined at a threshold of P < 0.05.

Model performance

The developed model was used to predict stroke occurrence from the validation and test datasets. To evaluate the model performance, we computed the area under the receiver operating curve (AUROC) and Kolmogorov–Smirnov (KS) statistics for the validation and test datasets. The AUROC measured the discriminatory power of a stroke prediction model that can be interpreted as the probability that strokes receive better scores than non-strokes [21]. The KS statistic computed the maximum differences between the cumulative distributions of two discriminations, such as strokes and non-strokes, where each discrimination score had a value between 0 and 1 [21]. We have also computed the accuracy of the stroke-scoring model. Accuracy represents the ratio of correct predictions made by the model on both the test and validation datasets, which includes the number of true positives and true negatives divided by the total number of predictions.

Scorecard model

Borrowing the concept of the credit scoring model, [16, 17] we developed a scorecard model for strokes. In the clinic, the use of a disease score rather than the probability of disease ranging from 0 to 1 is usually beneficial to determine a patient’s health status. According to the scorecard model, the odds ratios used in the logistic regression could be converted into a disease score as follows:

$${{\rm{P}}_0} = {\rm{A}} + {\rm{B}} \times {\rm{ln}}\left( {{\rm{odds}}\,{\rm{ratio}}} \right),$$
(1)

where A and B are constraints that need to be determined through specific disease scorecard model settings and \({\text{P}}_{0}\) indicates the user-defined baseline score. The point of double odds (POD) representing the score that doubles the odds was used to determine these constraints. The sum of POD and \({\text{P}}_{0}\) yields the double odds ratio as follows:

$${{\rm{P}}_0} + {\rm{POD}} = {\rm{A}} + {\rm{B}} \times {\rm{ln}}\left( {2 \times {\rm{odds}}\,{\rm{ratio}}} \right).$$
(2)

By solving Eqs. (1) and (2), we calculated the two constraints as follows:

$$\left\{ {\begin{array}{*{20}{c}}{{\rm{A}} = {{\rm{P}}_0} - {\rm{B}} \times {\rm{ln}}\left( {{\rm{odds}}\,{\rm{ratio}}} \right)}\\{{\rm{B}} = {\rm{POD}}/{\rm{ln}}\left( 2 \right).}\end{array}} \right.$$

A detailed description of the scoring method with POD has been provided previously [22].

Results

WoE Transformation

To illustrate the effect of WoE transformation for each variable, we plotted both WoE values and stroke occurrences as a function of variable bins. For each variable bin, we additionally computed stroke occurrences as a ratio of the number of strokes to the number of strokes and non-strokes. Trends between WoE values and stroke occurrences show an inverse relationship (Fig. 1). The WoE values of the continuous variables, such as age, BMI, and average glucose level, showed decreasing trends as the bin number increased. Additionally, the categorical variables, including heart disease, hypertension, ever-married status, smoking status, and work type, showed a good WoE distribution, with a difference between the maximum and minimum WoE values of more than 1. In contrast, sex and residence type showed similar WoE values within each category.

Fig. 1
figure 1

WoE values, stroke risk, and the number of samples for each bin of 10 variables. The vertical bar shows the number of samples, while the line depicts the WoE value or the percentage of stroke incidence for each bin within the training dataset. Abbreviations: C, children; FS, formerly smoked; GJ, government job; N/A, not available; NW, never worked; NS, never smoked; P, private job; SE, self-employed; WoE, weight of evidence

Stroke occurrence Model using logistic regression

We summarized the univariate and multivariable logistic regression analysis (Table 2). Except for residence type, all variables were significant in the univariate logistic regression analysis. However, in the multivariable logistic regression, only four variables, namely age, average glucose level, heart disease, and hypertension, were selected by a backward stepwise selection method. All the included variables, such as age, average glucose level, heart disease, and hypertension, were significant in predicting stroke occurrences (P < 0.05). Our multivariable logistic regression model showed that the AUROC and KS values of the validation (test) datasets were 85.3% (83.4%) and 57.6% (53.3%), respectively.

Table 2 Logistic regression and fit results

Stroke-scoring model

The distribution of stroke-scoring for the validation and test datasets shows that our model comprised four variables and scores ranging from 361 to 965 (Fig. 2; Table 3). Using the score cards (Fig. 3), we can easily evaluate an individual’s stroke probability. As an example, let’s consider the case of a 63-year-old patient with heart disease and hypertension, and an average glucose level of 130. Based on the stroke-scoring model, the patient’s score would be calculated as follows: Starting with a base score of 703 points, we add 42 points for their age, 50 points for having heart disease, 24 points for having hypertension, and 1 point for an average glucose level of 130, resulting in a total score of 820 points. According to Fig. 2, if the stroke score falls between 800 and 899 points, the probability of stroke is very high at 7.59%. Our stroke-scoring model was found to be 90.6% accurate for both the test and validation dataset when the disease score threshold was set at 800.

Fig. 2
figure 2

Score distribution for the validation and test datasets and stroke occurrences for each score bin. The ticker values on the left y-axis represent the density of the score distribution. The numbers above the markers on the left (right) represent stroke risk for the validation (test) dataset

Fig. 3
figure 3

Score cards for average glucose level, age, heart disease, and hypertension. Our stroke-scoring model allows to easily calculate an individual’s stroke score using base points and four variables. The stroke risk is determined by the stroke score and score distribution in Fig. 2

Table 3 Number of strokes and non-strokes in each bin of disease score

Discussion

This study developed a flexible and simple scoring model for stroke occurrences using a public dataset. The first step included the transformation of each variable using the WoE method. In this process, changes in stroke occurrences were observed according to the increase or decrease in continuous variables and the presence or absence of categorical variables. The next step was the selection of significant variables that affect stroke occurrences. Significant variables reflect risk factors of stroke. Finally, the scoring model was developed with the base point and scorecards using the significant variables.

Our model was based on the credit evaluation model of banks. The evaluation of credit is one of the most significant data for loans in the banking system [16, 23]. Banks use the credit evaluation model to predict the probability of default by loan applicants [16, 23]. In this study, loan applicants are cases in the dataset and default means stroke occurrences. The advantage of the credit evaluation model of banks is the flexibility of included variables. Because potential variables that can affect individual credit scoring have increased exponentially in recent years, it is necessary to determine whether each variable should be included in the credit evaluation model. Similar to that of the credit evaluation model, the flexibility of our model enables us to add new potential determinants for stroke occurrences in the process. Furthermore, our model is not limited to stroke; it can also be used for other disease models fitted to other datasets containing different variables.

Our model offers several advantages over previous models. First, it employs coefficients for individual variables, facilitating easy comprehension and interpretation of the effects of each variable on an individual’s stroke probability. In contrast, earlier models such as FSRS, CHADS2, ABCD, and ABCD2 have fixed variables, making it challenging to incorporate new variables beyond the fixed ones [9,10,11,12,13]. Second, the model uses a simple calculation method involving the addition of scores assigned to each variable. However, it should be noted that the model must be re-fitted to include new variables, which may result in changes to variable scores during the re-fitting process.

The simplicity of our scoring model allows individuals to easily self-calculate their probability of stroke. The WoE transformation is critical for the development of our scoring model [22]. The WoE values correspond to the stroke occurrences according to changes in each variable. The WoE values of significant variables are expressed as a monotonic relationship with stroke occurrences. The monotonic relationship easily enables each variable to transform our scoring model. The model provides a common base point and scorecards for variables included in the final model. The patients can intuitively understand that the management of included variables reduces the scores and stroke occurrences. This self-calculation of our model can motivate individuals to manage determinants to reduce stroke occurrences for primary prevention. In addition, the significant variables in our model are not essential for model development in other datasets. Each institution can easily replace other significant variables with their own medical data instead of unavailable significant variables in other datasets.

The selection of significant variables reveals risk factors of stroke. All variables except residence type were significant in the univariate regression; however, only well-known predisposing factors were selected as significant variables after multivariable logistic regression. The four significant variables in our model are consistent with previous studies on risk factors predisposing individuals to stroke. Hypertension is the single most important modifiable risk factor for both ischemic stroke and intracerebral hemorrhage [24,25,26]. Furthermore, the incidence of stroke increases with age [27], and increasing degrees of glucose intolerance lead to an increasing risk of thromboembolic stroke [28, 29]. The process of variable selection is used to find potential risk factors for specific diseases in medical datasets. However, we can estimate the limitations of our model based on the exclusion of smoking status, which is a well-known predisposing factor [30, 31]. The exclusion of smoking status may have two effects. First, stroke incidence shows a dose-dependent relationship with smoking with a more than twofold increase in heavy smokers compared with light smokers [32]. In addition, cessation of smoking decreases the risk of stroke in former smokers to the level of non-smokers after 5 years of abstinence [33]. Our dataset included smoking status (never smokers, former smokers, and current smokers), but the dose of smoking and the period of cessation of smoking were not included in the dataset. Second, the missing data corresponding to the smoking status in the dataset represented 30.6% (8,706/41,931) of included cases. The large percentage of missing data on smoking status may affect the exclusion of smoking status in our final model. Nonetheless, the selected variables in the scoring model are consistent with those reported in previous studies on risk factors of stroke [25, 28, 31]. Although the smoking status was excluded, the most well-known risk factors of stroke were included in our model. Therefore, the selection of variables in this study could be used as an example to identify new risk factors using various datasets and confirm risk factors in longitudinal studies.

This study had several limitations. First, the dataset contains limited information on stroke type and variables. Further information regarding diseases and variables could help improve the performance of the disease-scoring model. Second, the distribution of disease scores was not smooth. We could develop a smoother distribution of stroke scores if we had the numerical information of systolic and diastolic blood pressure in addition to a yes or no categorical variable for hypertension.

Conclusions

The disease-scoring model, based on the scorecard concept, is flexible and simple, making it suitable for individual use or adoption by government agencies responsible for public health. Although the current model is based on a limited number of variables extracted from publicly available datasets, it has the potential to expand into a model that can better reflect the risk factors for stroke by utilizing large-scale data collection at institutional and national levels. Even with the addition of more variables, the results will be presented in a scorecard format that can be easily evaluated by the public, making it highly applicable for individual use. If the stroke score is high due to factors such as high body weight or glucose levels, patients may be motivated to manage their health by adjusting their weight or managing their diet to reduce the risk factors of stroke. In conclusion, this scorecard model offers several advantages for identifying and managing risk factors, thereby promoting primary prevention of stroke.