Introduction

Although definitions may vary, individuals who visit emergency department (ED) at least three times per year are considered as “frequent users”1,2,3. Frequent ED users often display heterogeneous profiles—a combination of mental health disorders, physical comorbidities, and low socioeconomic status1,4,5—leading to complex needs that are not adequately dealt with in an ED6. A significant proportion of frequent ED users have numerous chronic diseases, such as coronary artery disease or chronic obstructive pulmonary disease4,7. Those conditions could be managed in primary care, preventing acute deteriorations that lead to ED use8. Since frequent ED use for complex needs may occur because those needs have not been adequately addressed in a primary care context, this type of ED use is considered suboptimal. As an indicator of unmet needs, it is associated with negative outcomes for patients (e.g., higher hospital admissions or mortality rates9,10). Furthermore, ED costs are generally higher than those in a primary care setting, resulting in a socioeconomical burden for the health system7,11,12. In the province of Quebec (Canada), frequent ED users with chronic diseases represent 9.2% of all the ED users but account for 28.8% of all ED visits13. Furthermore, a recent Canadian census shows that chronic condition prevalence will increase as the population age; the burden on the healthcare system is then likely to increase14.

Targeted interventions such as case management have been shown to help reduce ED visits and ED costs, while improving patient satisfaction and clinical outcomes2,15,16. In this context, being able to accurately predict frequent ED use is relevant to target users who may really benefit from it. Much work has been done with statistical models in this direction. In particular, logistic regression (LR) is a standard and widely used statistical model17. However, with the constant improvement of quantity and quality of measurements (electronic health records), statistical models, and computer capacity, modern machine learning (ML) models are becoming more and more popular. Previous studies have predicted frequent ED use for a specific issue18,19 or in a local hospital20 successfully using ML models other than LR. Yet, no study has been conducted comparing predictive power of ML models in a general population and considering chronic diseases.

This study aims at comparing the performance of a logistic regression to four ML models for frequent emergency department use in an adult population with chronic diseases, in the province of Quebec (Canada).

Methods

All methods in this study were carried out in accordance with the TRIPOD guidelines for model development and validation (see the Supplementary material Table S1)21.

Study design and data sources

This is a population-based retrospective cohort study. We used medico-administrative databases from the health insurance board of the province of Quebec (Régie de l’assurance maladie du Québec), which manages health insurance plan for Quebec citizens. The following files were used:

  1. 1.

    The patient demographic register, which contains information about the sex, date of birth, date of death (if applicable), and place of residence of the patient;

  2. 2.

    The physician reimbursement claim register, which contains information about medical services provided by a fee-for-service physician in Quebec: date of service, place of service (emergency, medical clinic, etc.), physician specialty, diagnosis (International Classification of Diseases, ninth revision or ICD-9), and the medical act procedure performed by the physician;

  3. 3.

    The hospital register, which contains information about the reasons for hospitalization (main diagnosis and up to 25 secondary diagnoses coded in ICD-10), dates of admission and release from hospital, and all medical procedures performed during the hospitalization.

Selection of participants

The study population included all adults (18 years and older) living in the province of Quebec, with at least one ED visit during the inclusion period, i.e., between the 1st of January 2012 and the 31st of December 2013, diagnosed with at least one chronic condition, and without dementia. Patients with dementia may have special needs compared to cognitively intact patients and were thus not included. In this study, the diseases considered were those from the Canadian Institute for Health Information (see the Supplementary material Table S2): asthma, chronic obstructive pulmonary disease (COPD), congestive heart failure (CHF), coronary artery disease (CAD), diabetes, epilepsy, and high blood pressure (HBP)22. Those specific conditions, also known as ambulatory care sensitive conditions, are a set of chronic diseases for which timely intervention in primary care could reduce the risk of hospitalization or the occurrence of acute episodes for those diseases23,24,25. The index date was randomly assigned as one ED visit among all ED visits occurring during the inclusion period26. The index date is then used as a “starting point” for measuring patient characteristics, such as ED use, age, or diagnoses.

There were two exclusion criteria (Fig. 1). First, patients living in remote areas were excluded (6.8%). Remote areas were defined as municipalities with fewer than 10,000 inhabitants with weak or no metropolitan influence zone (the percentage of resident employed labour force who commute to work in urban areas is less than 5%). This exclusion ensured that remote residents who tend to use ED as an alternative to walk-in clinics (as there are fewer primary care alternatives27,28) were not included. However, patients living in municipalities with fewer than 10,000 inhabitants with high or moderate metropolitan influence were included. Secondly, patients who died during the year after their index date (8.3%) were excluded as they can require specialized healthcare, such as patients at the end of life29,30. Besides, that exclusion helped reducing immortal time bias31.

Figure 1
figure 1

Flowchart of the study cohorts.

Outcome and independent variables

Frequent ED use was investigated using two different definitions: (1) having at least three visits (“frequent users 3”) and (2) having at least five visits (“frequent users 5”) during the year following the index date (as mentioned in the previous subsection, the index date is an assigned ED visit between 2012 and 2013). Those definitions were chosen amongst the most common ones in order to compare performance in two populations that were different, yet still considered frequent users.

Independent variables (or predictors) considered at the index date were sex, age, residential area (metropolitan: ≥ 100,000; small town: 10,000–100,000; rural: < 10,000 with high or moderate metropolitan influence), material and social deprivation indices32, public prescription drug insurance plan status (PPDIP, see below for the different statuses), having been hospitalized in the two years before the index date, the number of previous ED visits during the year before the index date (PV), and the combined comorbidity index of Charlson (CCI33). The following diagnoses were considered: chronic disease (one diagnosis for each condition, i.e. asthma, COPD, CHF, CAD, diabetes, epilepsy, and HBP), chronic non-cancer pain (CNCP)34, injury, common mental disorders (CMD)35, serious mental disorders (SMD)35, alcohol abuse, and drug abuse. Each condition was identified using the reported diagnoses in the hospital register (one diagnosis) or in the physician reimbursement claim register (at least two diagnoses), during a two-year period before the index date.

Regarding PPDIP status, the Quebec province has four different statuses: “regular recipient of PPDIP”, “admissible to PPDIP and age ≥ 65 years with guaranteed income supplement” (GIS), “not admissible to PPDIP” (individuals with a private insurance plan), or “admissible to PPDIP and being a recipient of last-resort financial assistance” (LRFA)36.

There were less than 5% missing observations, mainly for material and social deprivation indices, and those observations were kept.

Statistical analysis

Frequent ED use prediction is a case of supervised learning, meaning that there are explicit labelled classes (i.e., frequent user or not). Along with logistic regression (LR), four ML predictive models amongst the most efficient for predicting a binary outcome38 were evaluated:

  1. 1.

    Gradient boosting machines (GBM) build an ensemble of successive decision trees; each tree is a weak learner that improves on the previous one using the residuals39. Tuning parameters were the learning rate and the trees depth.

  2. 2.

    Naïve Bayes (NB) model is based on Bayes’ theorem and uses a priori probabilities40. The tuning parameter was the Laplace smoothing for probabilities.

  3. 3.

    Neural networks (NN) feed data through interconnected hidden layers of “neurons”, which apply mathematical operations to the inputs (the independent variables)41. Tuning parameters were the number of neurons and the weight decay.

  4. 4.

    Random forests (RF) apply sequential splits to the data such that the separation is maximized in regards to a homogeneity criterion (i.e., the Gini index), resulting in a tree-like structure40. RF were evaluated with a binary (RF1) and a continuous outcome (RF2). Tuning parameters were the number of trees and the homogeneity criterion used.

The cohort was randomly divided in a training set (80% of the cohort) for building models and a testing set (remaining 20%) for evaluating performance18,42. This procedure is common in order to minimize overfitting, a sensitive issue when dealing with ML algorithms43. Area under the ROC curve (AUC), sensibility (SEN), specificity (SPE), positive predictive value (PPV), and negative predictive value (NPV) were computed to compare performances. AUC 95% confidence intervals were also computed using DeLong’s method44. The same reasoning was adopted as in Grinspan et al.18, the predictability of a model was judged on its AUC, based on 5 categories: poor (0.50–0.59), fair (0.60–0.69), good (0.70–0.79), very good (0.80–0.89) and excellent (0.90–1.0)18. The best cut-off thresholds were selected using Youden’s statistic 45 in order to compute sensitivity, specificity, positive predictive value, and negative predictive value. All tuning parameters were optimized by searching for the maximum AUC, but only the results with the selected parameters are presented here for clarity and brevity purposes.

Results from ML models (except LR) are not as directly interpretable as those from regression models, which straightforwardly assess the effect of predictive variables on the outcome with quantities such as odd ratios. However, ML framework allows for the evaluation of variable importance in a prediction model (also called feature importance). It was computed as the mean decrease in the Gini index in the case of GBM and RF, as the combinations of the absolute values of the weights for NN, and as the absolute value of the t-statistic for LR43. While it is not possible to compare variable importance directly from one model to another due to the models being different in nature, variable importance is still useful as an interpretable and relative quantity about the contribution of each predictor. In our models, all the variables are categorical and GBM, LR, and NN compute variable importance relative to a baseline category while RF computes an overall variable importance. Of note, there is no available variable importance measure when using the NB algorithm.

Sensitivity analyses were conducted on a population of frequent users with at least four visits and with a 50/50 training and testing sets.

Statistical significance level was set at α = 0.05 and differences in descriptive statistics were evaluated using chi-square tests. All analyses were performed with statistical software programs SAS (version 9.4) and R (version 4.2 with packages e1071, nnet, ranger, and xgboost).

Ethics approval and consent to participate

The research ethics board of the Centre intégré universitaire de santé et de services sociaux de l’Estrie – Centre hospitalier universitaire de Sherbrooke (number MP-31–2017-1571 – Urgences-CPSA) approved this study. The need for informed consent was waived by the aforementioned research ethics board due to the retrospective nature of the study.

Results

Characteristics of participants

Out of 451,775 ED users, 43,151 (9.5%) and 13,676 (3.0%) were frequent users 3 and frequent users 5, respectively (Table 1). For both definitions, differences between frequent users and non-frequent users were statistically significant except for the residential area variable.

Table 1 Descriptive statistics for the different populations.

Main results

Multiple combinations of explicative variables were evaluated. The following variables were selected for their clinical interpretation and explicative power: age, public prescription drug insurance plan status, Charlson comorbidity index, number of previous ED visits, chronic obstructive pulmonary disease, injury, serious mental disorders, common mental disorders, chronic non-cancer pain, alcohol, and drugs. No missing values were observed in the variables selected for prediction.

Model performances are shown in Tables 2, 3, for frequent users 3 and 5 respectively. In both cases, RF1 (binary outcome) had poor performances regarding AUC and SEN, followed by NB (poor or fair). On the other hand, RF1 had the highest SPE and PPV. GBM, LR, NN, and RF2 had similar good performances (or very good in the case of GBM, LR, and RF2 for frequent users 5). Performances improved as the threshold for frequent use was increased from three to five visits, except for RF1. Overall, SPE (NPV) was higher than SEN (PPV).

Table 2 Model performances for frequent users 3.
Table 3 Model performances for frequent users 5.

Variable importance results are shown in Tables 4, 5. Those measures are relative, meaning that it is only possible to compare importance between variables in the same model (e.g., variable importance between LR and GBM are not comparable). However, the ranking of independent variables in each model can still be compared for all models, along with the relative magnitude. All models reported the number of previous ED visits as the most important variable for prediction. The magnitude by which it was superior to the other variables varied considerably. CCI and PPDIP were also important, but to a lesser extent (for instance, their importance was respectively 6 and 12 times less than PV for RF2 in the case of frequent users 5). Among chronic diseases, COPD was the most important.

Table 4 Variable importance for the predictive models (frequent users 5).
Table 5 Variable importance for the predictive models (frequent users 3).

No significant changes were observed in the interpretation of results during sensitivity analyses.

Limitations

Both quantity and quality of data are imperative in a ML context. In this study, we had access to an exhaustive medico-administrative database which included hospital and physician data, but it did not include patient reported outcomes (e.g., perceived health, included in the Canadian Community Health Survey46). Those latter could improve the predictive power of models in future work. For instance, studies using national health surveys and telephone interviews found that fair or poor health status and dissatisfaction with treatment outcome were significantly associated with frequent ED use47,48.

Our study focused on frequent ED users with chronic diseases; though results should only be generalized to this population, chronic diseases are common in the frequent ED user population. Better understanding of a population of ED users with chronic diseases is relevant for other healthcare aspects as chronic diseases are linked not only to frequent ED use, but also to hospitalisations, functioning, and deaths24,49,50.

Discussion

This paper aimed at comparing four ML prediction models (gradient boosting machine, naïve Bayes, neural networks, and random forests) with logistic regression, for frequent ED use in a population with chronic diseases. Those ML models have been successfully used to predict related issues, such as ED revisits, in hospital mortality, or hospital admissions at ED triage42,51,52. Accurate ML models may help for early identification of frequent ED use, thus improving targeted interventions such as case management2,15,16. To this end, case-finding tools are appropriate, such as CONECT-6 which was derived from LR models53.

Model performance

In our study, no model clearly outperformed the others. Other studies on frequent ED use that applied ML reached a similar conclusion18,20,54, though they either focused on a specific chronic disease such as asthma or epilepsy or used hospital only data. In fact, a recent systematic review aiming at comparing performances of LR with ML models (among which figured the ones used in this study) for clinical prediction of a binary outcome showed that there is currently no clear performance benefit of ML models55. However, this review included only studies that used clinical data. Other studies that focus on ED related issues (e.g. risk of emergency hospital admission, risk for sepsis, heart failure readmission) found improved predictions with ML56,57, although this is not a general rule58. Quantity of variables (58 to 121 variables56) or very discriminative variables57 explained those improved predictions, amongst others. In our models, increasing the threshold for frequent ED use (thus reducing the number of frequent ED users) gave slightly better performances for all models. A higher threshold increased the homogeneity of the characteristics of frequent users, thus facilitating prediction of their ED use, a result that has already been observed54. Other studies also compared LR to ML models using administrative claims data59,60 and found similar performances, though they did not focus on ED-related outcomes.

In medical studies, the signal-to-noise ratio is often low, i.e., the amount of information contained in the database that is useful for the prediction61, which may explain in part the modest improvements (if any) of ML models. The type of available variables may also affect performance. For instance, in a study about uncontrolled diabetes prediction, LR was outperformed by NN or GBM62. The authors used data from administrative claims and from US census, in which they had access to social determinants, such as food insecurity or recreational park access. It is possible to tune more precisely ML models to overcome those limitations. In our study, this fine-tuning would have amounted to evaluate model parameters over broader spaces. As an example, NN is known for its ability to model complex and nonlinear relationships by combining multiple hidden layers41, which is limited for a traditional LR. Broader ranges could also be evaluated; GBM has shown good performances and helped refine clinical tools when allowed to learn slowly57. However, this fine-tuning comes with a high computation cost and an added complexity. This latter drawback may result in overfitting issues and limited generalization.

Our models had higher sensitivity than positive predictive value, apart from RF1. This means that most models accounted for a fair portion of frequent ED users, but the number of false positives was significant. This contrasts with another study on frequent ED use among children with asthma20. The authors also applied ML models (LR and RF amongst others) and found higher predictive positive values48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70 than sensitivity16,17,18,19,20,21,22,23,24,25,26,27. However, their threshold choice was guided by a maximization of the AUC rather than by a statistical criterion. Besides, specificity and negative predicted values were high in our study. This is a known issue when dealing with imbalanced health datasets (frequent users 3 and 5 represented 9.5% and 3.0% of the cohort respectively)63,64. Algorithms learns mostly from the majority class, which introduces a bias towards non-frequent users. There are possible adjustments such as under-sampling from the majority class or over-sampling from the minority class, but those may not be recommended procedures as they distort prevalence55. Learning from highly imbalanced datasets is an active research area65 and may affect prediction for frequent ED use in the future, especially if combined with multiple different models66.

Variable importance

The models developed in our study are also interesting from a clinical point of view (i.e., risk stratification by variable for frequent ED use). In our study, CCI, PPDIP, and the previous number of ED visits were important, though the latter was the most important variable by a large margin. This result is supported by other studies on frequent ED use conducted with LR26,67,68, but also with ML models18,20. In fact, this variable is usually so important that Brennan et al.68 stated that “targeting patients with the most extreme number of ED visits may be the best and most practical option for targeted interventions”, thus allowing for optimal resource allocation. Hudon et al. (2020) also found that a LR including this variable and having a previous hospitalization performed almost as well as models with more variables such as comorbidities, sociodemographic status, and public prescription drug insurance status26. Even when predicting other ED related-outcomes, the previous number of visits is relevant: Rahimian et al.56 predicted emergency hospital admission (after an ED visit) using RF and GBM and found that it was the most important variable. They also found that other variables are excellent predictors of emergency hospital admission, such as laboratory test results (e.g., cholesterol ratio, haemoglobin, platelets).

Conclusions

Frequent ED use is a major issue in primary and emergency care, and ML models are becoming increasingly popular in medicine and healthcare in general. They are rapidly evolving, offering new opportunities, and while there has been substantial theoretical progress with ML models, the small improvements do not show a clear superiority over simpler models. Those latter still display reasonable performances55,69. In our study, LR was as successful in predicting frequent ED use as other ML models, while the number of ED visits was the most important variable. Access to other variables may be more helpful for refining prediction in the case of frequent ED use, such as patient-reported outcomes or clinical notes. Those types of data have been successfully used with machine learning models in a context of primary care, although not for ED use prediction70,71. Future work also includes considering complex non-linear interactions, where ML models outperform traditional ones72.