1 Introduction

The management of multimorbidity or multiple long term conditions (MLTC), defined as having two or more chronic diseases simultaneously [1, 2], is difficult for both patients and providers due to interacting care plans, interactions of diseases, and the involvement of multiple different healthcare professionals [3,4,5,6]. These complex interactions can lead to potentially unnecessary healthcare utilization and potentially preventable adverse health outcomes, such as emergency department (ED) visits or acute hospitalizations [7, 8]. The monodisciplinary care organization contributes to this complexity, resulting in higher mortality, treatment burden, and lower quality of life for patients with MLTC [3, 8]. Consequently, increases in healthcare utilization [8] and healthcare costs [9, 10] pressure current healthcare systems [1].

An integrated care approach has been recommended for patients with MLTC to prevent adverse outcomes and potentially preventable healthcare utilization [11]. Integrated care refers to “initiatives seeking to improve outcomes of care by overcoming issues of fragmentation through linkage or coordination of services of providers along the continuum of care” [12]. However, to be effective and reduce healthcare utilization, such approaches should target patients that experience fragmentation and are most in need of integrated care [13]. Different determinants related to the patient, involved healthcare professionals, and the healthcare system contribute to this increased need for integrated care. These determinants include physical functioning [14], patient activation [15], communication amongst healthcare providers [16], and healthcare utilization [8]. Healthcare professionals do not have the time or resources to create an overview of these determinants per patient [17]. In contrast, information technology tools based on regularly collected electronic health record (EHR) data can partially provide insights into these determinants and the concurrent risk of high healthcare utilization [17].

Since the delivery of an integrated care approach is neither feasible nor warranted for all patients with MLTC, support tools to improve the identification of patients who would benefit most could prove valuable [18]. Predicting a high risk for potentially preventable healthcare utilization based on EHR data could be a valuable tool to identify a preselection of patients needing integrated care [18]. Such a data-driven tool based on readily available EHR data can complement more subjective identification approaches, by providing insights into the expected potentially preventable healthcare utilization [19]. Previous research developing prediction models with primary care data and data from one general hospital in the Netherlands showed that patients with high healthcare utilization could be reliably predicted [18, 20]. However, further development with EHR data from other hospitals, including more specialized medical centers, is needed to substantiate the added value of such prediction models.

Academic medical centers (AMCs) provide highly specialized care for numerous complex patients with MLTC [21, 22]. Nonetheless, research on predicting high healthcare utilization in patients with MLTC within AMCs is limited [18, 23, 24]. Due to inadequate geographic access resulting from longer travel time [25], higher complexity of patients [22], higher costs of care [21], and greater fragmentation as a consequence of hyperspecialization [26], support in the identification of patients in need for integrated care in AMCs is needed.

Prediction models could contribute to a better identification of patients with MLTC that are most in need of integrated care by providing insights into future expected risk for potentially preventable healthcare utilization. Therefore, this study aims to develop and internally validate prediction models for future outpatient visits, ED visits, and acute hospitalization with machine learning (ML) in patients with MLTC based on EHR data from an AMC in the Netherlands.

2 Materials and methods

2.1 Data

Data were derived from the EHR from an academic hospital in the Netherlands, the University Medical Center of Groningen (UMCG). Since 2006, the Netherlands has a regulated competitive universal health insurance system, in which private health insurance is mandatory for all citizens [27]. General practitioners have an important gatekeeper function and referrals are needed for hospital and specialist care [27]. Electronic records are mostly not centrally stored, nationally standardized, or interoperable between care domains [27].

Data for this study were collected based on Diagnosis-Treatment Combinations (DTCs), which are used in the Netherlands to code hospital care and claim payments. The DTCs include information on all care activities performed per patient within one hospital, including the associated diagnoses, specialties, and type of care activity [28]. The registered diagnoses within the DTC include an International Classification of Diseases and Related Health Problems 10 (ICD-10) code and can be linked with the Dutch Hospital Data – Clinical Classification Software (DHD-CCS) [29]. Dutch Hospital Data (DHD) organizes national registration, facilitates research, and connects hospitals to optimize medical-specialist care in the data field [30]. In the DHD-CCS, the ICD-10 codes are clustered into relevant subgroups. DHD has further classified these subgroups into five main categories: chronic, oncologic, acute, elective, and other diagnoses (supplementary table S1). As the ICD-10 provides too detailed diagnosis groups compared to the DHD-CCS, the 152 diagnosis groups defined by the DHD-CCS classification were used for this research.

For all adult patients who visited the UMCG in 2017 and 2018, EHR data of all outpatient visits, ED visits, and hospitalizations were collected. For this study, we used demographic (sex, date of birth, four-digit zip code (PC4)) and healthcare utilization data (type of care activity and linked diagnoses, and involved specialty).

The Central Ethics Review Board of the UMCG approved the pseudonymous use of the data for research purposes and a waiver of informed consent (#20200861, amendment approval number: #107275). Patient data were pseudonymized and patients who objected to the use of their data were excluded from this study prior to data collection. We followed the reporting guidelines from the TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis) statement [31].

2.2 Study design and sample

We performed a retrospective cohort study and aimed to include patients with a potentially higher complexity of MLTC and higher healthcare use [32]. Based on registered outpatient visit diagnoses in 2017, MLTC was therefore defined as having two or more chronic and/or oncological conditions (based on DHD-CCS codes) and having two or more outpatient visits registered by two or more different specialties in one year. Included patients had to have at least two registered outpatient visits in the year of inclusion, because outpatient care is an important driver of preventable high healthcare utilization [33]. Chronic and oncological conditions were defined based on the DHD-CCS classification (supplementary table S1). Oncological conditions were included as they are usually included in MLTC research but are classified within the DHD-CCS in a separate category next to chronic conditions [34,35,36]. For our study, we included all patients with MLTC in 2017 and aged 18 years or older on 01.01.2017.

2.3 Outcomes

We aimed to predict high healthcare utilization in 2018 with demographic and healthcare utilization data from 2017. Based on previous research that developed prediction models for patients with MLTC in a Dutch general hospital [18], we included three types of high healthcare utilization outcomes in 2018 with the following cut-off values:

  • ≥ 12 outpatient visits,

  • ≥ 2 ED visits,

  • ≥ 1 acute hospitalization(s).

Acute hospitalization(s) were defined based on whether the registered diagnosis for a hospitalization was labeled as acute according to the DHD-CSS classification (supplementary table S1). For patients who visited the hospital in 2017 but did not visit it in 2018, we labeled their outcomes for 2018 as zero (i.e., not having one of the outcomes).

2.4 Predictors

Demographic and healthcare (utilization) characteristics of 2017 were used as predictors. Demographics included age and sex of the patient. Age was determined on 1 January 2017. Selected healthcare (utilization) characteristics were the registered DHD-CCS diagnoses for outpatient visits and the involved medical specialties at outpatient visits in 2017, which were included as separate binary variables for each patient. Other included characteristics were the number of registered diagnoses based on the DHD-CCS classification and the number of involved medical specialties at outpatient visits. Further, the number of outpatient visits, ED visits, hospitalization days, and hospitalization days with an acute diagnosis were used. We also used the PC4 per patient and calculated the distances in kilometers (km) to the UMCG and the closest hospital in the patients’ region per patient. Details on how these distances were calculated are provided in supplementary material S1.1. All predictors can automatically be retrieved from the patient’s EHR. The included predictors in the baseline settings are provided in the supplementary tables S1 and S2.

2.5 Machine learning development and evaluation

To balance the interpretability, predictive accuracy, robustness to imbalanced datasets, and handling of non-linearity, we used four ML prediction models for each outcome, namely Elastic Net Regression (ENR) [37], eXtreme Gradient Boosting (XGB) [38], Logistic Regression (LR) [39], and Random Forest (RF) [40]. A brief description of all four algorithms, including the rationale for their usage, is provided in supplementary material S1.2.

Data were randomly split into a stratified cross-validation (CV) (80%) and independent test (IT) (20%) set. To prevent overfitting, the IT dataset was kept separate from the CV dataset during model development and only used for final model evaluation. A 10-fold CV was performed to assess the results’ robustness and provide confidence intervals (CI). Our baseline settings for all models included the 152 DHD-CCS diagnoses groups, 1000 trees for RF and XGB models, and default hyperparameter ranges (supplementary material S1.2). To evaluate the performance of each model, we calculated the area under the curve (AUC) of the receiver operating characteristics (ROC) and their 95% CI based on the mean AUC of the CV folds and the standard errors. Hyperparameter values for ENR, XGB, and RF were tuned using an exhaustive grid search with size 100, meaning that 100 combinations of hyperparameters were sampled from a prespecified range to determine the best hyperparameters for each model. Hyperparameter combinations that resulted in the best AUC for each separate model in the CV set were selected. To optimize prediction results, hyperparameter ranges were visually inspected for all four models per outcome by plotting them against the AUC. The hyperparameter range was adjusted if no global maximum was detectable and a decreasing or increasing trend was visible in the AUC for a hyperparameter value.

To optimize prediction results of the baseline settings for all models, we tested broader and more narrow diagnoses groups based on definitions of the DHD-CCS (233 and 17 groups, respectively; supplementary tables S3 and S4) and 3000 trees for the RF and XGB models. Moreover, several variables were transformed into theoretically relevant and logical combinations or transformations of variables (i.e., features). We then tested whether the addition of these features resulted in better AUCs. The five features that we defined and evaluated are described in supplementary table S5. In addition, we tested log transformations of skewed variables with and without the five features.

We evaluated the concordance (c) statistic and their 95% confidence intervals (CI) based on the mean AUC of the ROC of the cross-validation folds and the standard error. Discrimination and calibration were evaluated on the IT set for the final models per outcome. For discrimination, AUC and their 95% CI (based on DeLong [41]) and ROC curves were used. The ROC curve visualizes the tradeoff between the true positive rate (sensitivity) and the false positive rate (1 - specificity) [42]. The AUC (equal to the c-statistic for binary outcomes) ranges from 0.5 to 1.0, with high values indicating higher discrimination, meaning that a model can better differentiate patients with the outcome from those without. The final model per outcome was chosen based on the highest mean AUC in the cross-validation set. In addition, calibration curves, intercepts, and slopes in the independent testing set were used. Calibration can be assessed by plotting the predicted probabilities against the actual probabilities and visualizing the agreement between the observed outcome and the predicted probability [43]. For the calibration intercept, values < 0 indicate systematically too high predictions (overestimation of risks) and > 0 too low predictions (underestimation of risks). If the slope of the calibration curve is < 1, the risks are overfitted, meaning that low risks are underestimated and high risks are overestimated. When the slope is > 1, the risks are underfitted. A perfect calibration curve has a slope of 1 and an intercept of 0. Lastly, the precision-recall (PR) AUC and PR curves in the IT test were evaluated to provide a more nuanced understanding of precision and recall, and the impact of different classification thresholds [44]. All analyses and figures were performed and created with R version 4.2.1 [45].

2.6 Model explainability

To increase the interpretability and explainability of each final selected model, we calculated the SHapley Additive exPlanations (SHAP) [46] values per outcome. The SHAP method is intended to explain individual prediction in complex ML models and is based on additive feature attribution and cooperative game theory [47]. With SHAP feature importance and summary plots, we provide insights into the contribution of the 15 most important predictors.

2.7 Sensitivity analyses

We performed several sensitivity analyses and repeated the hyperparameter grid search using our baseline settings. First, we trained and evaluated our models on a subset of the data that excluded all patients who visited the hospital in 2017 but not in 2018. Second, we tested different cut-off values, being ≥ 10 and ≥ 14 instead of ≥ 12 outpatient visits, ≥ 1 instead of ≥ 2 ED visits, and ≥ 1 overall hospitalization instead of ≥ 1 acute hospitalization, similar to previous research [20, 48]. Finally, we examined whether random oversampling of our outcomes (balancing the data) [49] improved the predictive performance.

3 Results

3.1 Population characteristics

Demographic and healthcare (utilization) characteristics of the total included population and of the healthcare utilization subgroups in 2017 are shown in Table 1, including the three outcomes in 2018. For more insights into the included population, the prevalence of diseases and involved medical specialties ≥ 2% in 2017 are shown in supplementary tables S6 and S7. In addition, bar plots depicting the number of diagnoses versus age and gender are displayed in supplementary fig. S1 and S2. This study included 14,486 patients, with a median age of 60 and 54.4% females. Patients in the three high healthcare utilization groups were more often male and older than the total included population. Of the total population, 14.0% had ≥ 12 outpatient visits, 12.7% had ≥ 2 ED visits, and 4.9% had ≥ 1 acute hospitalization in 2017. For patients that visited the hospital in 2017, the three outcomes in 2018 were distributed as follows: 8.8% with ≥ 12 outpatient visits, 6.7% with ≥ 2 ED visits, and 3.5% with ≥ 1 acute hospitalization. The characteristics of the CV and IT sets for all three outcomes are provided in supplementary table S8.

Table 1 Population characteristics in 2017, including the three outcomes in 2018

3.2 Final model selection and evaluation

Results of different model optimization and feature engineering approaches and their corresponding AUC results are provided in supplementary table S9. Finally, we used our baseline settings for ≥ 12 outpatient visits and ≥ 1 acute hospitalization(s) since the optimization approaches did not increase the AUC. For ≥ 2 ED visits, we used the baseline settings with 3000 trees for the XGB and RF models (instead of 1000 trees).

We further evaluated discrimination and calibration in the IT set for these final settings per outcome. The supplementary fig. S3-S5 show ROC curves on the IT dataset for the final settings per outcome. For all three outcomes, ROC curves were similar for all four models.

Calibration curves per outcome on the IT dataset, including calibration intercept and slope, are provided in Figs. 1, 2 and 3. For ≥ 12 outpatient visits, all four models overestimated higher risks, as indicated by the calibration curves and calibration intercepts (all < 0, Fig. 1). The XGB and RF showed comparable calibration curves for ≥ 12 outpatient visits, with XGB showing a slightly better overall agreement and lower variabilities in the curve (Fig. 1b and d). For ≥ 2 ED visits, all four models overestimated higher risks (Fig. 2). For the RF model, the model showed an overestimation of low and high risks but an underestimation in-between (Fig. 2d). The confidence interval of the RF’s calibration curve covered the ideal line well, but the predicted probabilities were limited. The XGB model had the most ideal calibration curve for ≥ 2 ED visits with both an intercept close to 0 and a slope close to 1 (Fig. 2b). For ≥ 1 acute hospitalization(s) the ENR and LG models overfitted the risks, meaning that low risks were underestimated and high risks were overestimated (Fig. 3a and 3c). The RF model vastly underestimated higher risks (Fig. 3d). The XGB model had the most ideal calibration curve for ≥ 1 acute hospitalization(s), even though it overestimated lower risks and underestimated higher risks (Fig. 3b).

Fig. 1
figure 1

- Calibration curves, intercept, and slope for the final models (baseline settings) of ≥ 12 outpatient visits in the independent testing set

Fig. 2
figure 2

- Calibration curves, intercept, and slope for the final models (baseline settings with 3000 trees for the eXtreme Gradient Boosting and Random Forest model) of ≥ 2 emergency department visits in the independent testing set

Fig. 3
figure 3

- Calibration curves, intercept, and slope for the final models (baseline settings) of ≥ 1 acute hospitalization(s) in the independent testing set

Table 2 provides the final selected model per outcome based on the AUC and the calibration curves. For each final model per outcome, the corresponding AUC and 95% CI in the CV and IT set and calibration slope and intercept in the IT set are shown. We selected the XGB model with baseline settings for ≥ 12 outpatient visits as well as for ≥ 1 acute hospitalization(s) and the XGB model with 3000 trees for ≥ 2 ED visits.

Table 2 Final selected models with corresponding discrimination (cross-validation (CV) and independent test (IT)) and calibration intercept and slope (IT set)

The final selected models per outcome were evaluated with precision-recall curves (supplementary figs. 68). Precision and recall AUCs were 0.30 for ≥ 12 outpatient visits, 0.17 for ≥ 2 ED visits, and 0.11 for ≥ 1 acute hospitalizations. Highest precision reached around 0.7 for ≥ 12 outpatient visits, 0.5 for ≥ 2 ED visits, and 1.0 for ≥ 1 acute hospitalizations. Nonetheless, they were achieved at low recall levels.

3.3 Model explainability

Per outcome, we plotted the SHAP feature importance and summary plots of the final selected model, showing the contribution of the 15 most important variables to the predictions (Figs. 4, 5 and 6). The contribution of all variables with a mean Shapley value ≥ 0.001 are provided in supplementary tables 1012. For ≥ 12 outpatient visits in 2018, the number of outpatient visits and the involvement of the internal medicine specialty in 2017 had the highest contributions to the predictions, changing the predicted absolute value probability by 4.6 and 2.6 percentage points, respectively (Fig. 4a). High numbers of outpatient visits, involved specialties, and diagnoses tend to increase the risk for ≥ 12 outpatient visits (Fig. 4b). In contrast, very high age decreased the risk for ≥ 12 outpatient visits while lower to medium-high ages increased the risk.

For ≥ 2 ED visits in 2018, the most important predictors were the number of outpatient visits, the number of ED visits, the distance to the UMCG, and the involvement of the internal medicine specialty in 2017 (Fig. 5). A high number of outpatient and ED visits, and a shorter distance to the UMCG tended to increase the risk of ≥ 2 ED visits (Fig. 5b).

For ≥ 1 acute hospitalization(s) in 2018, the age of the patient, the presence of an acute diagnosis, the number of diagnoses, and the distance to the UMCG in 2017 had the highest influence (Fig. 6). Higher age, a higher number of diagnoses and a shorter distance to the UMCG tended to increase the risk for ≥ 1 acute hospitalization(s) (Fig. 6b).

Fig. 4
figure 4

– SHapley Additive exPlanations (SHAP) feature importance a and summary plot b for ≥ 12 outpatient visits: extreme gradient boosting model with baseline settings. The 15 most important variables are displayed based on the mean absolute Shapley values

Fig. 5
figure 5

– SHapley Additive exPlanations (SHAP) feature importance a and summary plot b for ≥ 2 emergency department visits: extreme gradient boosting model with 3000 trees. The 15 most important variables are displayed based on the mean absolute Shapley values

Fig. 6
figure 6

– SHapley Additive exPlanations (SHAP) feature importance a and summary plot b for ≥ 1 acute hospitalization(s): extreme gradient boosting model with baseline settings. The 15 most important variables are displayed based on the mean absolute Shapley values

3.4 Sensitivity analysis

Results of the sensitivity analyses, in comparison to the baseline setting models as a reference, are displayed in supplementary table S13. The sensitivity analyses based on the subset of the data and of balancing the data for the three outcomes did not improve predictive performance. The different cut-off values did not improve predictive performance, except for the cut-off of ≥ 14 outpatient visits, which resulted in slightly higher AUCs for all four models.

4 Discussion

To investigate whether prediction models can contribute to the identification of patients with MLTC with a high risk of future potentially preventable healthcare utilization, we developed and internally validated prediction models with EHR data for future ≥ 12 outpatient visits, ≥ 2 ED visits, and ≥ 1 acute hospitalization(s). Our study used a large and well-defined sample of patients with MLTC in an AMC and contributes to the limited available clinical prediction models that report measures of discrimination as well as calibration. Furthermore, we improved the explainability of the models by calculating SHAP values [23, 47]. Overall, our models could identify patients with expected future potentially preventable healthcare utilization and showed better or similar performance compared to previous research, as outlined below. Nonetheless, our models could not substantiate whether the identified patients are indeed patients most in need of integrated care.

Our study shows similar or higher AUC values for predicting outpatient and ED visits compared to previous research [18, 20, 50]. For ≥ 12 outpatient visits, our final selected XGB model resulted in an AUC of 0.83 (95% CI: 0.81–0.85) in the CV set and 0.82 (95% CI: 0.80–0.84) in the IT test. One previous study predicted outpatient visits with a LR model and found an AUC of 0.75 [18]. Our study showed an AUC of 0.80 for the LR model. As we used comparable data, including more variables about the patients’ healthcare (utilization) characteristics in our study might explain the better performance. For ≥ 2 ED visits, our final selected XGB model resulted in an AUC of 0.76 (95% CI: 0.74–0.78) in the CV set and 0.76 (95% CI: 0.73–0.80) in the IT set. Previous studies found AUCs ranging between 0.66 and 0.79 and used cut-offs of ≥ 1 to ≥ 4 ED visits in their models [18, 20, 50]. For ≥ 1 acute hospitalization(s), our final selected XGB model resulted in an AUC of 0.75 (95% CI: 0.73–0.78) in the CV set and 0.73 (95% CI: 0.67–0.78) in the IT test. Studies with similar cut-off values focusing on acute or unplanned hospitalizations found lower AUCs of 0.69–0.70 [18, 20]. Another study focusing on any hospitalization within six months resulted in an AUC of 0.84 [48], which is higher than our findings. Their best-performing model included demographic, healthcare utilization, diagnoses, and medication data, which, together with their tenfold larger population, might explain the difference with our findings. Similarly, a review of prediction models of emergency hospital admission found AUCs in 18 studies ranging between 0.63 and 0.83 in models using administrative or clinical record data [51]. Models with an AUC > 0.80 also included medication data or polypharmacy as predictors, which seem to improve the predictive performance for unplanned hospitalizations. Our findings suggest that complex ML models with more complex explainability such as XGB and RF do not seem to increase predictive performance tremendously compared to models with higher explainability as ENR or LG.

To the best of our knowledge, the calibration of comparable prediction models has only been reported in two studies [18, 48]. In line with previous research, we found reasonable agreement between observed and predicted probabilities for outpatient visits. We found similar calibrations for ED visits, with an overestimation of higher predicted risks. In contrast, our model for acute hospitalization(s) underestimated higher risks, whereas previous studies overestimated higher risks [18, 48]. We did, however, find an overestimation of risks in the LR model (instead of the finally selected XGB), which is in line with previous studies. Overall, previous literature and our study suggest that higher risks of potentially preventable healthcare utilization can be predicted less accurately than lower risks. In addition, we evaluated precision-recall curves which has been recommended when dealing with imbalanced datasets [52]. Results were suboptimal and showed high precision only at low recall values. Overall, very high risk patients had the best potential to be correctly classified by the models, but many remain undetected due to the low recall. Further insights are needed to assess whether such very high risk patients could potentially also be identified based on clinical experience and judgement alone.

Several variables increased the risk for all three outcomes. In line with previous research, we identified more prior-year diagnoses, outpatient visits, and hospitalizations as important predictors for all three outcomes [18, 20, 48]. Higher age has been reported in previous studies as an important predictor for all three outcomes, which we only found for ≥ 1 acute hospitalization(s). In our study, very high ages seemed to decrease the risk for ≥ 12 outpatient visits. This finding can be partially seen in our descriptive results (Table 2) and can be explained by changes in treatment goals of the oldest patients [53] and greater challenges to visit the hospital [54]. Other important predictors in our models were the number of involved specialties and the distance to the UMCG. Only one previous study included residential areas in their models [50]. In line with our expectations, most high healthcare utilizers seem to live closer to the UMCG. Nonetheless, some outliers were visible in patients with a larger distance to the UMCG and an increased risk for acute hospitalizations, potentially indicating complex patients with a high travel and disease burden.

4.1 Future work

Future work is needed to evaluate the added value of prediction models for identifying patients with MLTC most in need of integrated care. First, it should be evaluated whether the identified high-risk patients are indeed the patients that either self-report to experience fragmented care or have healthcare providers recognizing a need for integrated care. Such assessment is in line with the suggestion by Verhoeff et al. (2022) [18] and can shed light on whether the high healthcare utilization is related to experienced care fragmentation or simply to the complexity of the patients.

Second, if the models can identify patients needing integrated care, the identification with prediction models should be compared to identifying patients based on more straightforward threshold values. Comparisons of complex versus simple models have similarly been performed in the prediction of chronic opioid therapy and hospitalizations within six months [48, 55]. Such more straightforward threshold values could include, for example, a certain amount of prior healthcare utilization and a specific number of diagnoses. In addition, the models’ performance could be compared to identification based on clinical experience and judgement alone. An identification based on these values, next to or in combination with clinicians’ experience, could be equally (cost-) effective compared to the extensive development, implementation, and maintenance of prediction models.

Third, previous research suggests that including medication data could further improve predictive performance. Hence, future work should assess the impact of including more data on predictive performance and accurately identifying the target population. Moreover, we recommend to further evaluate the added value of additional performance indicators next to the commonly used AUC [56, 57]. The identification of high risk patients for potentially preventable healthcare utilization remains challenging and might be unfeasible for the total population. Therefore, models with acceptable precision-recall AUC might be able to identify a subset of the population in which the model can achieve high precision [58]. In addition, other measures of MLTC that measure disease burden differently compared to simple disease counts should be assessed [32, 59]. Thereby, the feasibility of a hybrid case finding approach [19], in which quantitative prediction modeling and a more qualitative personal assessment are combined, should be further investigated.

4.2 Limitations

Our study has several limitations, the first being the limited prediction time span of one year. Our descriptive results suggest that around half of our patient population has high healthcare utilization over the two-year study period, suggesting that long-term high healthcare utilizers might represent a different patient population [24]. In addition, we could not differentiate between incident and prevalent patients with MTLC in our study due to the limited time span. Longitudinal studies could further differentiate long-term healthcare utilizers, provide insights into the accumulation of conditions, and inform prevention [60].

Second, our data was collected prior to the COVID-19 pandemic. Healthcare utilization might have decreased or changed post COVID-19 and might not be comparable to data from 2017 and 2018 [61, 62].

Third, despite that our sensitivity analysis showed slightly higher AUCs for ≥ 14 outpatient visits, we did not further evaluate this model due to resource constraints. Future research should investigate whether a cut-off of ≥  14 outpatient visits results in a better predictive performance, even though the identified increase in AUC in our sensitivity analyses was only minor.

Fourth, we labeled all patients who visited the hospital in 2017 but did not in 2018 as zero (i.e., not having one of the outcomes). This labeling is in line with reality, as future healthcare utilization or death will not be known at the time of prediction. However, this could create a mixed class of patients labeled as zero: patients who do not return, visit another hospital, or die. Therefore, we performed a sensitivity analysis that excluded patients who did not visit the hospital in 2018. The results were comparable.

Fifth, the inclusion of ‘enabling’ characteristics, such as income or country region, could improve the selection of patients [24]. We could not include such characteristics because this information is not collected in the administrative data based on DTC codes. However, our data is readily available from the EHR and we included the distance to the UMCG and the closest hospital in the patient’s region as a proxy for the country region.

Sixth, the prevalence of MLTC might be underestimated in our study because the definition was based on DTC codes and focused on registered diagnoses at outpatient visits. Within one DTC trajectory, a medical specialist cannot simultaneously register multiple diagnoses of the same main diagnostic group that are associated with his or her medical specialty [63]. Nonetheless, diagnoses from other main diagnostic groups and from other medical specialties can be registered in parallel within one DTC trajectory. In addition, the DTC data accurately reports the involvement of medical specialists and the healthcare utilization. Our data included registered diagnoses from one AMC only. Patients with MLTC often visit multiple primary and secondary healthcare providers [7, 8], leading to a potential underestimation of MLTC in our study. Since data across healthcare providers is not centrally collected or available in Netherlands, we were, however, not able to include such data [27]. This potential underestimation of the MTLC burden might have impacted the predictive performance of the number of diagnoses in this study.

Finally, we did not externally validate our models in data from other AMCs, which is recommended for assessing the general applicability of prediction models [42]. In addition, temporal external validation in updated post COVID-19 has not been performed, limiting conclusions on the consistency of our models over time. As AMCs provide highly specialized care to heterogenous patient populations, the general applicability of our models to other AMCs was beyond the scope of this study. We assume that our models must be retrained or updated in other centers to account for local and hospital-specific variability in patient populations [23, 42, 64]. As high healthcare users can visit multiple hospitals in one year [65], developing and externally validating prediction models with data from multiple Dutch hospitals would be ideal. However, data sharing across hospitals remains problematic in the Netherlands to date. Consequently, implementing locally developed and internally validated models is more feasible.

5 Conclusion

Patients with MLTC and high future healthcare utilization can be identified with ML models. Whether these models identify patients that experience fragmented care and most need an integrated care approach has yet to be substantiated. Patients identified with these models should be compared to identifications based on clinical experience and judgment to assess the added value and potential (cost-) effectiveness of incorporating such prediction models within the EHR.