Introduction

Warfarin is an oral anticoagulant; it has been used for the treatment and prevention of thromboembolic disorders for more than 60 years1. Despite its well-studied clinical pharmacology, high efficacy, and cost-effectiveness, it is clinically challenging to determine the appropriate dosage of warfarin for each individual because of its narrow therapeutic window and variable patient responses2. Warfarin is one of the ten main anticoagulants that cause adverse drug events3. The risk of thrombosis increases if the dosage is insufficient, and conversely, the risk of bleeding increases if the dosage is excessive3. Conventionally, the international normalized ratio (INR) blood coagulation test is performed daily to achieve optimal efficacy and minimize side effects of warfarin, and physicians adjust the warfarin dosage individually based on their medical experience and INR values4.

Recent studies have attempted to predict the ideal warfarin dosage using machine learning algorithms, and most models were developed using genetic data obtained through genetic testing5,6,7,8,9,10,11,12. Genetic variations in CYP2C9, VKORC1, and CYP4F2 have been shown to have significant correlations with warfarin. However, genetic testing is not performed in actual clinical settings because it is time-consuming13. One study aimed to predict the adjustment dose of warfarin using only clinical data14. However, warfarin adjustment doses are prescribed to outpatients and can interact with some foods or alcohol, thus affecting their lifestyle. In such cases, it is questionable whether it is more appropriate to use lifelog data that reflects the patient’s lifestyle and environment than using clinical data included in the electronic medical records. Consequently, it is more proper to use EMR derived from controlled inpatients populations in large hospitals for initial warfarin dose prediction.

In this study, we suggested that clinical warfarin dosing algorithms that predict discharge warfarin dosage to the initial warfarin dosage without genetic data are important for the following rationale. In tertiary hospital, it is difficult to use the genetic test in initial warfarin dose judgement immediately after hospitalized, becase it takes about two weeks the genetic test to come out. It results in physicians must decide the initial warfarin dosage based on limited the patient’s information. Accordingly, we concluded that a tool providing reliable warfarin dosage from limited clinical information is clinically relevant.

Finally, we proposed machine learning models that predict the warfarin discharge dosage as an appropriate initial warfarin dosage using only clinical data generated in the hospital within 2 days of hospitalization. The models proposed in this study can not only reduce unnecessary treatment duration but also prevent the adverse drug event of warfarin, by rapidly presenting appropriate warfarin dosing. Additionally, it can contribute to both hospitals and patients, by securing space for other patients in hospital wards and reducing the financial burden of hospitalization costs for patients.

Methods

Ethical approval

The Institutional Review Board of Asan Medical Center approved the protocols of this study (No. 2021-0321), which was conducted corresponding to the 2008 Declaration of Helsinki. Also, this study exempted the requirement for informed consent as the database used for the study consisted of anonymous, de-identified information. All experiments were performed in accordance with relevant guidelines and regulations. Also, we gained the data access to MIMIC-III database to use for external validation process taking courses titled Biomedical Responsible Conduct of Research from CITI.

Study design

In general, warfarin is prescribed at 8 p.m based on initial blood test results, physiological measurements, and concomitant medications of a patient on the day of starting warfarin. Initially, a dose of 5 mg or higher may be initiated to rapidly achieve a therapeutic level, but in cases where the patient is deemed to be at a higher risk of bleeding, a lower dose of 1–2 mg may be used. Following the initiation of warfarin, the daily INR tests are performed to determine the appropriate dosage until the patient reaches a stable therapeutic level (INR 2.0–3.0). While some patients may easily reach the appropriate warfarin dosage, for others, the daily fluctuations in INR values can lead to several days of adjustment before achieving the target range of INR 2.0–3.0. Among the patients admitted to Asan Medical Center, those included in this dataset typically undergo such a process. Reflecting the warfarin therapy schedule of the patients, we designed four machine learning models to predict the warfarin dosage prescribed at discharge using only clinical data measured on the first and second days of hospitalization (Fig. 1). At first, we extracted clinical variables in electronic medical records (EMRs) of the Asan Medical Center (AMC) and Medical Information Mart for Intensive Care III (MIMIC-III) database15 with same criteria for model development and validation. Then, data pre-processing was performed and the models were trained using training set. To assess the clinical utility of the models, the initial warfarin dosage prescribed by physicians was compared with the models in terms of the mean absolute error (MAE)16 and accuracy. Finally, we evaluated our models in the internal validation set that whether the models can provide more accurate warfarin dosage than physicians initial dose. In addition, we externally validated the models to external validation set derived from the MIMIC-III. Data pre-processing, model development, training and validation were conducted in Python 3.8.10. Additionally, we analyzed the predictions of five physicians for 40 data points by calculating the intraclass correlation coefficient (ICC)17 value and compared with the distributions of the model predictions to explore the objectivity of the models.

Figure 1
figure 1

Overview of the workflow. Our single-center study complied with the workflow in the following order: (1) based on clinical rationales, we extracted 17 features associated with warfarin from the electronic medical record (EMR) database up to the second day of hospitalization; (2) we conducted data pre-processing, such as missing value imputation, outlier filtering, and normalization, and created the final dataset; (3) the models that predict the warfarin discharge dosage using the dataset were developed. (4) The discharge dose was used as ground-truth to calculate the error with physicians initial dose and model predictions. The yellow circles represent the one day during hospitalization. The green and orange circle indicate the first warfarin dosage after hospitalization and the warfarin dosage at discharge, respectively.

Data source

We used the EMRs of the Asan Medical Center (AMC) as the source of development cohort. The EMRs was derived from Asan BiomedicaL Research Enviroment (ABLE) platform that Asan Medical Center has been developing a de-identification system for biomedical research18. It ensured the accuracy and completeness of the data.

Data collection

The model development cohort consisted of patients admitted to the cardiovascular or thoracic and cardiovascular surgery departments of AMC between January 1, 2018, and October 31, 2020. All the selected participants were at least 19 years old; exclusions were based on the following criteria: none of warfarin prescription at discharge;< 3 warfarin prescriptions; and no weight measurements within 2 days of hospitalization (Fig. 2). The external validation set derived from the MIMIC-III followed the same workflow except for medical department codes, as the medical department codes of the intensive care unit (ICU) can not be found. Finally, the development cohort derived from AMC EMRs comprised 3168 patients and the external validation cohort derived from the MIMIC-III comprised 891 patients.

Feature selection

The process of feature selection used for model development and validation was conducted only when had already been proofed correlations with warfarin based on clinical rationales19,20,21,22,23,24,25,26. The 17 clinical variables regarded as key factors of warfarin dosage adjustments and associated with bleeding were selected from the four tables of demographic, diagnosis, medication, vital sign (Supplementary Table S1). First of all, four demographic variables of age, sex, height, weight were used. Second, disease status of renal disease was selected in the diagnosis table, due to association with renal disease and warfarin. Third, seven medicines, affecting functions of warfarin and increasing a risk of bleeding and causing warfarin dosage adjustments when combined with warfarin, were selected in the medication table. Additionally, systolic blood pressure and diastolic blood pressure measurements were selected as variables in the vital sign table because uncontrolled blood pressures can increase bleeding.

Figure 2
figure 2

Cohort diagram. AMC Asan medical center, EHR electronic health record, MIMIC-III medical information mart for intensive care III.

Categorization

We conducted categorization of variables in the status of renal disease and concurrent medication twice to reduce redundancy and consider more comprehensive information. It allowed us to group variables with clinical associations together and capture their collective impact on the prediction. This approach made the machine learning models more robust exploring broader patterns and relationships in the dataset. At first, the International Classification of Diseases, Tenth Revision (ICD-10) codes, which are the diagnostic codes used in the AMC EMRs, have a hierarchical structure; therefore, they can be grouped using the first disease category code. For example, code I48 includes all diseases related to atrial fibrillation and flutter. Finally, all ICD-10 codes for renal disease are selected based on the first three or four characters and can be found as Supplementary Table S2. Then, the medication information was categorized into each group based on the associated ingredients of the prescribed drugs. In specific, rosuvastatin, simvastatin, and atorvastatin were grouped as statins, whereas furosemide and torasemide were grouped as diuretic (loop). Last, isosorbide dinitrate and nitroglycerin were grouped as nitrate. Consequently, a total of 35 components were assigned into 7 groups with similar drug effects.

Data transformation

The categorical variables need to convert into numerical values because it can’t use in machine learning models. Consequently, we performed one-hot encoding at two times to variables in status of renal disease, sex. First, the sex code underwent one-hot encoding, with 1 representing male sex and 2 representing female sex. Next, status of renal disease underwent one-hot encoding as 1 if a specific diagnostic code was assigned within 2 days of hospitalization and 0 otherwise. Finally, the entire of categorical variables were transformed as numerical vectors can be used in machine learning model.

Imputation

Data cleaning was performed for each model, since the appropriate method differs depending on the type of model. The XGBoost and Random Forest models are a decision-tree-based ensemble machine learning algorithm, but artificial neural network and linear regression models are not. Therefore, we performed two imputation methods to preprocess missing values in numeric variables, by considering whether the model was a decision-tree-based model. Because the tree-based models can automatically handle missing values and are not sensitive to missing values or outliers27, any missing data was replaced with minus 1 in XGBoost and Random Forest models. Mean imputation would be appropriate to deal with the missing data for artificial neural network and linear regression model, because the numeric variables used in this study do not have a wide range of values. Accordingly, we performed mean imputation that convert missing values into mean values of the specific variable in artificial neural network and linear regression model.

Normalization

Finally, we conducted normalization in artificial neural network and linear regression models and all variables were normalized using the minimum-maximum scaling method28, resulting in a range from 0 to 1.

Data split

The AMC dataset (n = 3168) was separated as 80% for the training set (n = 2534) and 20% for the internal validation set (n = 634) using random split method. Additionally, we performed external validation to evaluate the generalizability of the model to the external validation set (n = 891) from the MIMIC-III.

Model development and validation

The following models were developed: artificial neural network (ANN)29, linear regression30, extreme gradient boosting (XGBoost)31, and random forest32 models. These models were trained in training set including 17 clinical features and predicted a warfarin discharge dosage. We conducted Grid Search33 with random shuffles of 5-fold cross-validation34 to identify the optimal hyperparameters for each model. The entire of final hyperparameters of four models was shown in Supplementary Table S3. Finally, the XGBoost and random forest models utilized the raw dataset as the input, Whereas the ANN and linear regression models used the minimum-maximum scaled dataset as the input to help a rapid optimization of each models.

Performance metrics

We used both of MAE and predictive accuracy as performance metrics to evaluate the model prediction ability. First, the discharge dosage of warfarin set as target. Then, the error between the predictions, including physicians initial dose and model prediction, and the discharge warfarin dosage were calculated for each sample. Finally, the errors for each sample were summed and divided with entire sample size to obtain the MAE. The MAE of the models was compared with the physicians initial dose to examine that our models’ predictions were more accurate and rapid than the physicians.

Subsequently, three thresholds set according to the suggestion of cardiovascular physicians that if the prediction error was within 0.5 mg, 1.0 mg, 1.5 mg, the prediction could be accepted as good, normal, the least, respectively. Reflecting this, we calculated the accuracy of the model prediction using three thresholds: 0.5 mg, 1.0 mg, and 1.5 mg. This approach calculating the accuracy was consulted a logistic regression that conducts a binary classification whether the prediction was greater than a specific cut-off value35. The prediction was classified as accurate if the MAE of the sample was smaller than the corresponding threshold; otherwise, it was classified as inaccurate. For example, when the threshold was 0.5 mg, if the MAE of a particular sample was 0.3 mg, the prediction was classified as accurate. We calculated the proportion of samples with accurate predictions determined by each model and evaluated the accuracy of predictions based on each threshold.

Model interpretations

We used the Shapley additive explanations (SHAP) method to obtain insights of the predictions of our models and understand how each variable contributes to predictions36. The SHAP method is an explainable artificial intelligence method that decomposes the output of the model into the contributions of each feature, allowing for an analysis of the influence of each feature on the model37. It considers dependencies between features and can calculate positive and negative impacts, unlike traditional variable importance measures. Higher SHAP values indicate that the patient needs higher warfarin dose. The SHAP values calculated using the internal validation set were applied to visualize beeswarm and waterfall plots. First, the impact of each feature for model predictions was analyzed through a beeswarm plot. Second, waterfall plots showed that how the model considers the worth of each feature for individual predictions.

Comparison of models’ and physicians’ predictions

We selected 20 data points with accurate model predictions, high physician prediction errors and 20 data points with high model prediction errors and accurate physician predictions. The XGBoost model was used. Subsequently, we constructed a dataset with 50% models accuracy. Next, we distributed these datasets to five physicians and asked them to predict the appropriate warfarin discharge dosage. We analyzed intraclass correlation coefficient (ICC) of the physicians’ predictions to test the interrater agreement using 2-way random effects model in R. Finally, we compared the predictions of the machine learning models and those of the physicians.

Results

Participants characteristics

The baseline characteristics of the two datasets used for model development and validation are listed in Table 1. Also, we additionally confirmed both of the last INR(PT) value and the duration of hospitalization for check the baseline condition of patients, but not used in the models.

Table 1 Characteristics of participants.

Model performance

The MAE and accuracy at the threshold for both datasets are listed in Table 2. The following MAEs were calculated for the internal validation set: XGBoost, 0.9; random forest, 1.0; Artifical neural nets (ANN), 0.9; and linear regression, 1.0. All models had better prediction performance than expert physicians (MAE of 1.3). Using the external validation dataset, the following MAEs were achieved: physicians, 1.8; random forest, 1.8; linear regression, 1.8; ANN, 2.0; and XGBoost, 1.9. Consequently, internal validation of the internal validation set from the AMC EMRs confirmed that all predictions of the artificial intelligence models had lower errors and higher accuracy than those made by physicians regarding MAEs and accuracy. However, the physicians showed similar or superior performance when compared with all machine learning models in terms of the MAE, in external validation derived from the MIMIC-III. The MAE box plots of the internal validation set and external validation set are shown in Fig. 3.

Table 2 Model performance according to the MAE and accuracy.
Figure 3
figure 3

Bar plot representing the MAE. Performance abilities of the models based on the MAE were visualized using the internal validation (n = 634) and external validation (n = 891) sets. The unit of error is milligram (mg).

Model interpretations

We examined the SHAP values of the features with the most impact on model predictions using the beeswarm plot (Fig. 4). We also investigated the impact of features on individual predictions. We randomly selected 4 data points from the internal validation set with no missing values, and our models accurately predicted all of them. Subsequently, we calculated the SHAP values using a waterfall plot to explain individual predictions. The waterfall plot explains the influence of each feature on individual predictions (Fig. 5). The patient of Fig. 5a was 61-years-old male and taken with variable medications including aldosterone antagonist, nitrate, lipid lowering agents, diuretic. His height was 169cm and weight was 94.2 kg. The patients of Fig. 5b was 64-years-old male and taken nitrate and lipid lowering agents, also have a renal disease. His height was 173.6 cm and weight was 73.85 kg. The patient of Fig. 5c was 49-years-old male and taken with Amiodarone. His height was 168.2 cm and weight was 72 kg. The patient of Fig. 5d was 26-years-old female and taken with aldosterone antagonist, diuretic, nitrate, tramadol. Her height was 165.7 cm and weight was 57 kg. Especially, her systolic and diastolic blood pressure were 116 mmHg and 80 mmHg, respectively. As a result, we identified that the contributions of features affected a model prediction were different for each patient. For example, weight affected negative impact for patient of Fig. 5d, but positive impact to the rest of patients.

Figure 4
figure 4

SHAP beeswarm plot of the features affecting the predictions of the XGBoost model. Features are ranked in descending order based on the absolute value of their influence on the XGBoost model. The x-axis indicates SHAP values. Each dot denotes a data point. Colors represent high values (red) or low values (blue) of specific data points. SHAP Shapley additive explanations.

Figure 5
figure 5

SHAP waterfall plot. The x-axis represents the individual warfarin dosage prediction of the models. The y-axis represents the input features of the models. E[f(x)] (2.714) represents the baseline value, which is the model output of the entire dataset, and f(x) represents the individual model output for each patient. Each arrow indicates whether a specific feature increased (red) or decreased (blue) the warfarin dosage. SHAP Shapley additive explanations.

Comparison of models’ and physicians’ predictions

Finally, we collected the predictions of the model and those of the five physicians for 40 data points (Fig. 6). First, intraclass correlation coefficient (ICC) of the physicians’ predictions was calculated to measure the interrater agreement (Table 3). ICC is a test that how different raters measure subjects similarly from equal data point. It is important as it represents measurement errors caused by variation in rater judgement38. ICC values range from 0 to 1, with values less than 0.5 indicating low reliability, values between 0.5 and 0.75 indicating moderate reliability, values between 0.75 and 0.9 indicating good reliability, and values above 0.9 indicating excellent reliability17. In other words, the high reliability demonstrates less prone to prediction error based on raters. As shown in Table 3, the ICC value of five physicians was obtained to 0.36. It showed the significant variability in the distribution of dosage predicted by physicians, since the ICC value below 0.5. As shown in Fig. 6, it suggested that each physician tends to focus on a specific warfarin dosage range based on their clinical experience. Besides, prediction accuracy of the models and the five physicians was compared. Physicians accurately predicted approximately 9 of 40 samples, achieving 23% accuracy; however, the models demonstrated 50% accuracy. This indicated that it can be difficult for physicians to determine the appropriate warfarin discharge dosage based on the 17 clinical features obtained within 2 days of hospitalization.

Table 3 ICC of the predictions by the physicians for 40 data points.
Figure 6
figure 6

Distribution of warfarin dosage predicted by the physicians and models. We identified the predictions of five physicians and that of our model using 40 data points. The actual warfarin dosage at discharge (yellow) was distributed evenly within 1–5 mg. The average prediction dosage of our model (green) was 2 \(.\) 6 mg, and the maximum prediction dosage of our model was 3 mg. The prediction accuracy of model and the physicians are 50% and 23%, respectively. WFR warfarin, DC discharge.

Discussion

In order to maximize the efficacy and safety of warfarin, it is important to determine the appropriate warfarin dosage for each individual39. However, there are variability in warfarin response for individual, due to differences about genetic and clinical factors. In particular, VKORC1, CYP2C9, and CYP4F2 polymorphisms account for approximately 40% of the variability in warfarin response, and novel genetic variants that have not yet been identified are assumed to account for approximately 50% of the variability. In contrast, clinical factors that can be considered in clinical practice only affect warfarin response by about 10%40. Nonetheless, genetic test is not routinely performed to determine the warfarin because genetic testing is costly and time-consuming41. On the other hand, warfarin has very reasonable cost in Korea. Therefore, the trial and error method that tries the specific warfarin dose then adjusted the dose has generally been used and is more efficient than genetic test. In clinical practice, physicians do not consider the genetic characteristics of patients rigorously42 and mainly decided the initial warfarin dosage through the frequent INR monitoring to maintain the target INR (2.0-3.0). Accordingly, we designed the machine learning models that used only clinical factors and predicted warfarin dose at discharge, to reflect the medical environment in Korea. Also, our models proposed in this study can be adoptable for developing countries where have difficult to conduct the genetic test. Additionally, the complex correlation with warfarin, including age, gender, weight, and drug-drug interactions, should be considered for warfarin. However, the priorities of features that have impact on warfarin are different as the health conditions of each patient are different, such as concurrent medication or disease status. Eventually, physicians depend on own clinical experiences to decide the initial warfarin dosage. We required five physicians who prescribed the warfarin at least 3 years predict the initial warfarin dosage for 40 data points to identify the difference of judgement of warfarin dosage for each physician. As a result, the ICC value analyzed in this study was 0.36, showing significant variability in the warfarin dose prediction distribution for each physician. Besides, the predictive accuracy of the physicians was less than half our XGBoost model. Consequently, our models can support the physicians by providing more objective and accurate warfarin dosage.

Our models showed similar or superior performance when compared to other warfarin dosage models (Table 4). We identified previous regression models that predict warfarin dosage as numerical target and use MAE as a performance metrics, to compare accurately between our models and other models. Gage et al developed a multiple regression model that predict warfarin dosage in derivation cohort (N = 1015), included Caucasian and African American, Hispanic, and validated the models in validation cohort (N = 292)5. Both of pharmacogenetic and clinical model were developed and reached a MAE of 1.0, 1.5, respectively. Pavani et al developed an artificial neural network using ten genetic variables as inputs and therapeutic warfarin dosage as the output in Indian population (N = 240)6. Roche-Lima et al collected cardiovascular patients of 190 Caribbean Hispanic were > 21 years old and developed seven machine learning algorithms that predict warfarin dosing in Caribbean Hispanics using pharmacogenetic data7. Among them, random forest regressor (RFR) significantly outperformed all other models with a MAE of 4.74. Tong et al recruited 685 patients who diseased atrial fibrillation or thromboembolic venous disease in a Spanish population using the data last 3 consecutive months and used multiple linear regression8. Both of pharmacogenetic and clinical model were developed and internally validated with a MAE of 3.5, 5.0, respectively in a validation cohort (N = 129). Grossi et al collected 377 patients who were over 18 years old and treated with warfarin in Caucasian population and developed an artificial neural network to predict an optimal warfarin maintenance dose9. The final model reached a MAE of 5.72. Saleh recruited 4271 multi-ethnicity patients who received warfarin and developed an artificial neural network using both of genotyping and clinical data. The artificial neural network model reached a MAE of 9.0. Hernandez et al generated pharmacogenomic warfarin dosing model using clinical and genotyping retrospective data from a derivation cohort of 349 African Americans patients were \(\ge\) 18 years and the model reached with a MAE of 10.9 mg/week11. Alzubiedi et al collected demographic, clinical, and genetic data from 163 African-American patients with a stable warfarin dose12. They developed both of a multiple linear regression model and artificial neural network model with MAE of 10.8, 10.9 respectively. Whereas, both of XGBoost and artificial neural network model in this study achieved 0.9 with MAE and outperformed aforementioned algorithms. Additionally, our models provided more appropriate warfarin dosage than those initially prescribed by physicians using clinical data within 2 days of hospitalization. It demonstrated that the models can make appropriate warfarin dosing decisions without the same level of effort as physicians who would consider various factors such as the INR value. These results are likely the results of successfully selecting important variables that may interact with warfarin from our initial clinical data obtained through discussions with experienced clinical experts and effectively utilizing the refined variables.

Table 4 Comparison with prior works.

Limitation

Our study has a limitation of homogeneous caused its single-center design. We suffered the ethical issues of obtaining the multi-center EMRs. It is difficult to access to EMRs to different centers, because it includes patient’s medical record and cover patient privacy. Eventually, we had no choice about the diverse populations, but to should use a single population. It caused that the external validation in this study didn’t successfully perform and our models can’t be generalized to the different ethnicity. In the further work, we have to obtain an access of a multi-center cohort and conduct a multi-center study, to improve the accuracy and representativeness of the models. Additionally, if we obtain the data with larger sample size, it would be proper separating the group by low-dose and high-dose, respectively and training the models to improve the accuracy of the models.

Conclusion

The patients who participated in this study hospitalized for an average 14 days and the INR measurement of the first and last days of hospitalization were 1.7 and 2.2, respectively. In other words, we assumed that the discharge warfarin dosage was appropriate as the initial dosage since the discharge dosage make patients maintain the target INR (from 2.0 to 3.0). Finally, we developed 3 machine learning models and 1 deep learning model using data of a model development cohort from a tertiary hospital in Korea to predict warfarin discharge dosage. Then, we evaluated our models with MAE and accuracy in both of internal and external validation. Our models not only outperformed physicians in internal validation, but also the previous models. In internal validation set with MAE, XGBoost and artificial neural network models achieved 0.9, and random forest and linear regression models achieved 1.0, whereas physicians achieved 1.3. Besides, our models outperformed previous warfarin dosing algorithm as we mentioned. Therefore, our models that provide proper warfarin dose within 2 days of hospitalization can be useful for patients when just start warfarin and might be effective tools that help physicians choose the personalized warfarin dosage when initiation warfarin.