1 Introduction

Prescription errors are a serious problem that threaten patient safety. A large variation in the frequency of prescription errors has been reported in different studies, ranging from 1.1 to 41.7% [1,2,3,4,5]. Major categories of prescription errors include wrong drug, wrong dose, or wrong strength [6]. While pharmacists conduct prescription audits to detect inappropriate prescriptions, human checks usually have limitations. In a clinical decision support (CDS) systems or computerized physician order entry (CPOE) system, overdose alert features have been implemented in recent years. However, most dose alert features currently implemented in CDS or CPOE are simple designs that warn only when the prescribed dose exceeds the approved upper-limit dose. With this specification, it is difficult to detect incorrect doses for drugs whose appropriate dose range varies widely depending on the disease or patient.

Oral corticosteroids are potent anti-inflammatory drugs used in the treatment of many diseases. The doses of these drugs vary greatly, depending on the disease or severity or patient population. In Japan, prednisolone tablets are one of the most commonly prescribed oral corticosteroids. These tablets may be administered daily, at a maintenance dose of ≤ 5 mg/body, in chronic diseases such as collagen disease. Conversely, in the chemotherapy of malignant lymphoma, high-dose prednisolone such as 100 mg/body is prescribed. Moreover, there is a possibility of dose adjustment within the same patient, based on his/her severity of symptoms.

To accommodate the wide variation in the required dose, oral prednisolone tablets are available in some strengths (1 mg, 2.5 mg, and 5 mg tablets in Japan). When adjusting the dose from a previous prescription, wrong input of the dose or wrong selection of strengths may occur due to the existence of multiple strengths of the drug. Prednisolone tablets, which have a very wide range of approved doses, are high-risk drugs that are prone to prescription errors [7, 8]. Even with the maximum approved dose of prednisolone tablets (e.g., 100 mg/day) set as a threshold for dose alert features currently implemented in CDS or CPOE, it is difficult to detect inappropriate doses in the clinical setting.

In recent years, research using machine learning (ML) has been actively conducted in the healthcare fields [9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31], and its usefulness has been reported for solving complex problems such as diagnostic support and prognosis prediction. The aim of this study is to develop a predictive model to determine the appropriateness of the dosage of oral prednisolone tablets using a machine learning algorithm. Real-world data in the field of drug safety, such as dose errors, are imbalance with very few positive cases. We hypothesized that ML would perform better than conventional models in detecting dose-inappropriate prescription, which is highly disproportionate data. We explore the possibility that a highly accurate prediction model using machine learning will be implemented as a prescription audit function for CDS and CPOE and that it will contribute to the prevention of dose errors of high-risk drugs.

2 Related Work

Clinical data are often imbalanced, i.e., some classes have much fewer instances than others. The percentage of positive cases in clinical dataset of Table 1 is imbalanced (1.6–47.2%). Without dealing with data imbalances, the minority class is often insensitive to machine learning. This problem is serious when the minority class is the target of the prediction. Several approaches have been proposed and research into solutions to overcome the problem of imbalanced datasets [32, 33].

Table 1 Previous works: binary prediction with multi-ML algorithm in clinical data (excluding image and text data) in 2017–2020

Ichikawa et al. [34] developed screening model for hyperuricemia using the random under-sampling method on highly imbalanced data with a positive rate of 1.4%. The synthetic minority over-sampling technique (SMOTE) [35] is the best-known over-sampling technique and is often used for imbalanced medical data: prediction of all-cause mortality by Sakr et al. [36] (34,212 patients data with 11.3% positive cases), prediction of stroke by Wu et al. [37] (1131 patients data with 5.0% positive case), and prediction of the risk for severe complication after bariatric surgery by Cao et al. [38] (training dataset of 37,811 patients with 3.2% positive case).

Feature selection [39] is a method of considering an imbalanced distribution by selecting the best features. De Silva et al. [40] identified prediabetes predictors from 6346 persons data with 23.4% positive case by combining feature selection and machine learning. In this study, four feature selection algorithms were used to select 15 to 30 variables depending on severity from 156 preselected variables. Ali et al. [41] improve the diagnostic performance of Parkinson’s disease by two-dimensional data selection using voice data.

Cost-sensitive learning [42] may also be used for imbalanced learning. This concept takes into account the cost of prediction errors when training a machine learning model by assuming that the cost of misclassification of the minority sample is higher than that of the majority sample. Nakamura et al. [43] used the L2-regularized logistic regression with cost-sensitive learning and CNN models to identify readmissions within 30 days of children and reported that the F-measures for both models were similar.

3 Methods

3.1 Database and Features

The data warehouse, which stores EMR data at Obihiro Kosei General Hospital, a regional center hospital in a rural area of Hokkaido in Japan, was used.

One hundred twenty-one thousand one hundred ninety prescription orders for oral prednisolone tablets (5 mg, 2.5 mg, and 1 mg), from April 1, 2012, to March 31, 2017, were extracted. Prescription orders in which one strength was prescribed more than once were excluded, assuming that the drug is taken every other day or that the instruction requires tapering (these prescriptions are sometimes seen in Japan for the purpose of reducing side effects). Also, those in which different strengths were prescribed on different days were excluded. When different strengths were prescribed for the same number of days, their doses were combined to calculate the daily dose of prednisolone. As a result, 116,285 prednisolone tablet orders were aggregated into 89,443 observations. In addition, the prescription that the previous prescription was confirmed during the investigation period were eligible (i.e., the prescription ordered prednisolone tablets for the first time was excluded). Eighty-two thousand five hundred fifty-three observations were obtained (Online Resource 1).

For the target prescriptions, department, age, gender, daily dose, and prescription days were investigated. For previous prescriptions, prescription date, daily dose, and prescription days were investigated. These features are available through the pharmacy’s dispensing system.

3.2 Standardization of Clinical Departments

The names of the departments included in this study varied depending on the characteristics and size of the hospital. Hence, the clustering of medical departments was done to construct a general model that could be applied to hospitals of any characteristics and size. The subject data were stratified in two dimensions: daily doses (seven categories), the number of prescription days (nine categories), and the number of cases corresponding to each cell were calculated for every medical department (Online Resource 2). A Euclidean distance-based cluster analysis (Ward method) was performed, based on a distribution table normalized so that the total number of prescriptions was one for each clinical department. Nineteen clinical departments were classified into six clusters with a similar prescription trend for prednisolone tablets.

3.3 Outcome and Variates

Prescription modification for prednisolone doses was investigated, and 135 prescriptions with dose modification were defined as positive cases. For positive cases, the dose in the pre-revision order (first edition) was investigated. Positive cases included dose-related changes (i.e., increase or decrease in the daily dose, addition and deletion of prednisolone tablets, or changes in the strength of prednisolone tablets). Changes in prescribing days, comments, and other concomitant medications were not considered positive cases. Eighty-two thousand four hundred eighteen prescriptions without dose modification were considered negative case. There was significant imbalance in the outcome, with a positive rate of 0.16%

The following features have been adapted to build a predictive model in the pharmacy prescription audit setting: the cluster of clinical departments, current daily dose (pre-revision dose in positive case), current prescription days (pre-revision days in positive case), daily dose of the previous prescription, and prescription days of the previous prescription.

3.4 Data Preprocessing

The data were stratified by the percentage of positive cases and divided into a 70% training dataset and a 30% test dataset (Fig. 1).

Fig. 1
figure 1

Procedure for data processing and machine learning

To see if dealing with imbalance data contributes to improved performance, training datasets were prepared with/without resampling preprocessing. Original training dataset includes 57,693 negative cases and 94 positive cases (positive rate: 0.16%). The preprocessed dataset was resampled by SMOTE so that the minority class was 10% of the majority (i.e., training dataset after resampling includes 57,693 negative cases and 5769 positive cases). SMOTE was performed using the Imbalanced-Learn library in the Python 3.6 language with the following settings: sampling_strategy = 0.1, k_neighbors = 5, and n_jobs = 1. The test dataset was not resampled.

3.5 Model Development

Logistic regression (LR) and following five machine learning algorithms were used for learning by training dataset: support vector machine (SVM), k-nearest neighbor (KNN), random forest (RF), gradient boosting (GB), and balanced random forest (BRF). Then, the learning models were validated on the test dataset.

The Python 3.6 language was used for coding the algorithm, and the Scikit-Learn library was used for all ML modules, except for BRF. The Imbalanced-Learn library was used for the latter. Hyperparameter optimization was not explored in this study.

BRF is an adaptation of random forest that under-samples the majority class, making use of the fact that random forest is an ensemble method [44]. In BRF, for each tree in the random forest, a bootstrap sample is drawn from the minority class. The same number of observations is then randomly drawn from the majority class. In the usual under-sampling method, most of the information in the majority class is lost without being used. The BRF algorithm uses the majority class data in the ensemble tree. Although studies that have used BRF for imbalanced data have been reported in various fields [45,46,47,48], very few have been reported for the healthcare field.

3.6 Model Performance

To evaluate model performance, the following indicators were computed: area under the receiver operating characteristic curve (ROC-AUC), accuracy, precision, and recall.

$$\mathrm{Accuracy}=\frac{\mathrm{True}\;\mathrm{Positives}+\mathrm{True}\;\mathrm{Negatives}}{\mathrm{True}\;\mathrm{Positives}+\mathrm{False}\;\mathrm{Positives}+\mathrm{False}\;\mathrm{Negatives}+\mathrm{True}\;\mathrm{Negatives}}$$
(1)
$$\mathrm{Precision}=\frac{\mathrm{True}\;\mathrm{Positives}}{\mathrm{True}\;\mathrm{Positives}+\mathrm{False}\;\mathrm{Positives}}$$
(2)
$$\mathrm{Recall}=\frac{\mathrm{True}\;\mathrm{Positives}}{\mathrm{True}\;\mathrm{Positives}+\mathrm{False}\;\mathrm{Negatives}}$$
(3)

ROC-AUC was used as the primary performance indicator. Recall was used as the second indicator because oversight should be avoided rather than overdetermined in building a predictive model for medical safety fields.

In addition, a confusion matrix, which displays model performance via true positives, false positives, false negatives, and true negatives, was evaluated.

3.7 Internal Validation

Internal validation was performed using k-fold stratified cross-validation (CV) for each prediction model. A normal CV divides the data into k parts in sequential order, and there may be very few or no positive cases in a sub-dataset, in imbalanced data. Stratified CV equally divides cases into positive cases and negative cases into all sub-datasets. Hence, stratified CV is used to validate imbalanced data in the healthcare field [49,50,51,52,53]. In this study, the data were divided evenly into five sub-datasets. The arithmetic means of the performance score, obtained from the five sub-datasets, was defined as performance after internal validation. fivefold stratified CV was only applied to training dataset.

3.8 Statistical Analysis

All features had no missing data and no interpolation was done. For each feature in the model, positive case and negative case rates were described. Numerical information such as age, dose, days, pre-dose, and pre-days was tested by the Student’s t test. The chi-square test was used for sex, which is the count information. Differences in the distribution of the cluster of clinical department were compared by Fisher’s exact test. These statistical analyzes were performed using R (version 3.6.3) software.

4 Results

Data related to 82,553 prednisolone prescriptions were applied to the model. Table 2 shows the background information of the subject data. Age and dose were significantly different between the groups, but their median differences were small.

Table 2 Characteristics of the dataset that was used in this machine learning

The prescription patterns of prednisolone tablets differed by the clinical department (Online Resource 3). In gastroenterology, respiratory medicine, and cardiology departments, low-dose and long-term prescriptions were common. In the pediatrics department, moderate-dose and short-term prescriptions were common. In the hematology department, peaks were observed in high-dose and short-term zones because oral prednisolone was used as an anticancer drug treatment for malignant lymphomas.

Cluster analysis classified clinical departments into six clusters with similar prescription patterns (Fig. 2). Seven departments, which had a tendency to prescribe low to moderate doses for longer duration, were grouped in one cluster, while the hematology and gynecology departments were each clustered in one clinical department. There were no positive cases in multiple clusters of clinical departments, and a significant difference was confirmed in the cluster distribution among the groups.

Fig. 2
figure 2

Ward clustering of clinical department with squared Euclidean distance. CCD, cluster of clinical department)

In the training dataset of the original data (without SMOTE), the learning results showed that the ROC-AUC of RF, GB, SVM, and BRF models was higher than that of LR (Table 3). The BRF model, which is a classifier corresponding to imbalanced data, showed the highest performance among the five ML models and had the highest recall. The LR and SVM models showed zero precision and recall and could not detect any positive cases. Due to the highly imbalanced data in this study, accuracy and ROC-AUC were high even when no positive cases were detected. Confirmation of the confusion matrix showed that no true positives existed for the LR and SVM models (Online Resource 4).

Table 3 Performance indicator in 5 machine learning and logistic regression models with original data (without SMOTE)

As a result of validation with the test dataset using the models learned by the original training dataset, only the BRF model showed high ROC-AUC, but the precision was very low (Table 3). The LR and SVM models were unable to classify true positives even on test dataset.

In the over-sampling data by SMOTE, performance improved for both training and test datasets for most algorithms (Table 4). Amplification of positive cases increased the proportion of true positives. The highest performing model was BRF even after resampling. With over-sampling preprocessing, precision of BRF in training dataset has also greatly improved. In the test dataset with SMOTE; there was no improvement of Recall and ROC-AUC in the BRF model, but the precision increased. In the test dataset after resampling by SMOTE, the SVM model showed the highest ROC-AUC equivalent to BRF model.

Table 4 Performance indicator in 5 machine learning and logistic regression models with preprocessing data by SMOTE

5 Discussion

Oral corticosteroids, such as prednisolone, are associated with various adverse events including peptic ulcer, osteoporosis, hyperglycemia, and susceptibility to infections. Corticosteroids are one of the most commonly used drugs for which hospitalization due to adverse drug events is required [54]. Because corticosteroids have symptoms related to overdose and withdrawal, inappropriate dosing due to prescribing errors is a serious concern [55, 56]. Thus, it is of clinical significance to accurately detect any error related to the dose correction of prednisolone tablets. Determining the appropriate dose of oral prednisolone requires complex considerations such as calculation of daily dose (the sum of multiple standards), disease, severity, weight (especially for children), and the relationship with the previously prescribed dose. There are limits to human prescription audits, and objective support from CDS or CPOE system is desired. We developed the best ML model to detect the dose-related modified prescriptions of prednisolone tablets.

LR model without resampling, as a traditional model, could not be judged for true positives at all, and its performance was not clinically applicable. The BRF model showed highest ROC-AUC and highest recall without preprocessing. Chen et al. reported the usefulness of BRF in imbalanced data with a minority class using 2.3 to 9.7% datasets and that under-sampling appears to be superior to over-sampling [44]. The minority class of the dataset in this study was 0.16%, which was extremely imbalanced compared to the previously reported datasets. However, when combined with stratified CVs, the BRF model showed very high performance at a clinically implementable level.

By adding pre-processing of over-sampling by SMOTE, the training performance of each model was improved. While SVM performance has improved significantly, BRF has not seen much performance improvement. It seems that the BRF without resampling had already obtained sufficient performance. Another reason may be that under-sampling by BRF after over-sampling by SMOTE offset the preprocessing effect. The SVM model after SMOTE resampling (SMOTE + SVM) showed the highest ROC-AUC for the test dataset, slightly above SMOTE + BRF. Because SVM classifiers are very sensitive to imbalanced data [57], SMOTE + SVM has been reported to be a good combination [58] and was considered to be the best algorithm candidate in future developmental studies. In this study, default parameters are used for machine learning. Adjusting hyperparameters can further improve performance. Only one type of SMOTE setting, which sets the amplification factor of the minority class to 10% of that of the majority class, is being considered. It has not been verified whether this amplification setting is optimal. Since the data in this study were obtained from a single facility, the applicability of this model in other facilities needs to be further investigated.

In the prediction in the medical safety area, it is important to catch all positive cases. Therefore, in this study, recall was more important than precision. Since in this study, positive cases were defined as “dose-related prescribing correction cases,” various types of positive cases were observed. High-risk prescriptions with prednisolone dose errors that must be detected include the wrong selection of strength (a case in which instead of a dose reduction from 1 tablet of 5 mg [5 mg/day] to 4 tablets of 1 mg [4 mg/day], 4 tablets of 5 mg [20 mg/day] were prescribed), wrong selection of dose unit (a case in which instead of a prescription of 5 mg, 5 tablets were prescribed), and typing error (a case in which instead of 10 mg, 1 mg or 100 mg was prescribed). On the other hand, positive cases in this study also include cases with little risk in which the dose was adjusted after the prescription order, according to the patient’s condition (for example, a correction from 5 to 4 mg after the order). In this study, since the data were collected retrospectively, it was not possible to investigate the reasons for dose correction. Because in the case of an acute exacerbation, prednisolone may suddenly be administered at high doses; it was also difficult to objectively define “high-risk error dose,” even if the dose is many times higher than the previous prescription. In this study, assuming a secondary audit by a pharmacist, we attempted to construct a first screening model to detect any dose modification. Therefore, the precision of the best ML model in this study was expected to be low to some extent.

Because the optimal ML model constructed in this study is assumed to be used in the setting of community pharmacies, the input variables were limited to the following information described in Japanese prescriptions (i.e. the information which could be recognized by the community pharmacists): patient age, gender, clinical department, and previous prescription information. Features such as laboratory data, indications, severity, and weight were not used in this study because they are difficult to obtain at many pharmacies in Japan. The addition of these variables is expected to improve predictive performance.

Furthermore, the input variables of different departments were clustered in this study. It was observed that the prednisolone prescribing pattern was different for each clinic, and clinic information was considered to be an influential factor in erroneous dose detection. However, the name of the department name can vary, depending on the number of beds and features of the hospital. For example, in the university hospitals, medical departments are highly subdivided (e.g. “endocrinology and metabolism,” “nephrology,” “diabetes”), but in regional small or medium hospitals, integrated names are often listed (e.g. “general internal medicine”). In this study, the clustering of clinical departments was performed using prescription patterns of dose and days to construct a general-purpose model. However, since there has been no report on the standardization of the clustering of clinical departments for prednisolone prescription patterns, it is necessary to verify our method as the clustering by the Ward’s method with Euclidean and thresholds for dividing doses and days to create frequency distributions.

There are few reports of ML adaptation to medication error such as prescription audits [59,60,61,62,63,64]. To the best of our knowledge, this is the first report that ML was considered for prescription audits of drugs with a very wide range of clinical doses, such as oral steroids.

The results of this study show that even in the field of clinical drug safety, where the positive cases are few, ML dealing with imbalanced data may be useful for problems that cannot be solved by mathematical models. Prednisolone tablets, a commonly prescribed oral corticosteroid, is a high-risk drug, and its overdose or underdose is a significant risk to patients. Since this study targets prescription audit settings in pharmacies, a prediction model was constructed with an emphasis on recall to prevent under-detection. As a result, although we built a high-performance machine learning model, prescriptions that detect inappropriate doses of prednisolone should be reviewed by pharmacists for the necessary of prescription question to physicians. Accurate warning regarding prescription dosage errors with ML will be beneficial to both healthcare professionals and patients. It is expected that ML will be extensively utilized in clinical drug safety management, including prescription audits, to prevent serious incidents.