The rapid growth of healthcare predictive models has been fueled by the widespread availability of three essential elements: (1) large-scale digital and electronic health record (EHR) data; (2) off-the-shelf, robust machine learning (ML) algorithms; and (3) powerful, low-cost computing platforms [1]. Together, these changes have fostered an ‘arms race’ towards superior predictive performance highlighting the latest algorithm, pipeline, or process. Concurrently, the complexity of model development has quickly outpaced the reporting capabilities of traditional scientific journal formats [2]. Thus, today, even experienced readers are stymied by the increasingly difficult task of assessing the value of a rapidly growing number of predictive models [3].

These challenges are likely to spur new approaches to evaluating and comparing healthcare predictive models. In the meantime, we continue to focus on assessing the value of a model within three key domains: (1) clinical utility, (2) validity, and (3) feasibility and sustainability (Table 1). While the listed items are not comprehensive, they can help identify key gaps and potential pitfalls.

Table 1 Key domains and considerations for evaluating the value of a predictive model

To see this framework in action, we apply it to the study by Roimi et al. [4] which sought to predict bloodstream infections among patients with suspected infection in two independent ICU’s (Beth Israel Deaconess Medical Center in the US and Rambam Health Care Campus in Israel; BIDMC and RHCC). Using time-stamped EHR data consisting of medications, laboratory values, vital signs, treatments, and comorbidities, the authors implemented tree-based ML methods to identify patients whose blood cultures, drawn after  > 48 h in the ICU, would be positive with a clinically significant pathogen.

Two relatively novel features of this study were that rather than focusing on building a single model suitable for both sites, they created site-specific models that instead shared similar feature engineering (i.e., data pre-processing and variable identification) and modeling pipelines. Second, they designed a process that incorporated successive transformations, summary statistics, and/or time-series trends, such that each dataset expanded to include > 7000 predictors. To prevent overfitting, their final modeling approach included only the top 50 most predictive variables.

Among 31.7% of ICU patients with > 48 h blood cultures at BIDMC, 6.4% had a positive culture. BIDMC-specific model discriminative performance (area under the receiver operating characteristic curves, AUROC) was excellent (test set: 0.89–0.90). Similarly, among 56.3% of patients with RHCC blood cultures, 15.9% had a positive culture with RHCC-specific test AUROC of 0.92–0.93. At a sensitivity threshold of 50%, specificity was ≥ 93% with a low number needed to screen to detect one true-positive case of 1.2–3.

To evaluate the value of this predictive model, we begin by assessing clinical utility. Patients with bloodstream infection have high mortality, exceeding 25%, making them a prime target for care improvement [5, 6]. A substantial proportion (33%–42%) of patients undergoing blood culture is not on antibiotics [7]. In this study, in the 6 h following cultures, 21%–28% had not yet received antibiotics. Although not an identical population, sepsis patients benefit from earlier antibiotics, suggesting an opportunity to accelerate the initiation of appropriate antibiotic use among patients with a high predicted risk of blood culture positivity [8].

At the same time, unnecessary antibiotics are increasingly being recognized as a threat both at the patient and population levels. A predictive model that lacks a strong negative likelihood ratio could inappropriately increase antibiotic exposure in certain patients depending on if and how negative scores are presented to clinicians. At a sensitivity threshold of 50%, the expected impact of this model’s negative likelihood ratio (0.51–0.54) would only be small to moderate. Competing diagnostic tests such as procalcitonin and rapid pathogen assays are also entering mainstream clinical use and may offer alternate approaches to risk stratification [9, 10].

Together, these factors suggest this model has at least moderate potential clinical utility for patients not on antibiotics at blood culture sampling. However, the precise question of how the end-user will specifically act on the predicted risk remains incompletely defined. Moreover, since the action will likely differ based on whether or not patients are already on antibiotics, evaluation within each of these subpopulations is important.

Assessing model validity is a significantly more challenging task. Despite the authors including a moderately high level of detail in the manuscript, key questions remain about the feature engineering and modeling pipeline. Why is the single strongest predictor of blood culture positivity—the duration between blood culture sampling and last defecation—as much as fivefold more important than the next few variables? Indeed, defecation variables make up seven of the top 100 variables across sites. Similarly, why does the model incorporate variables not traditionally seen as clinically relevant (mean corpuscular hemoglobin concentration or the partial pressure of alveolar oxygenation divided by positive end-expiratory pressure)? Are these novel insights, proxy measures of other key predictors, or spurious findings?

In the modeling approach, why did the authors implement an unusual soft voting method using multiple tree-based methods, when routine approaches would favor increasing the number of trees within a single method? Why do across-site and simple logistic regression model performances degrade so significantly? Would a regularized linear model using the full feature set improve performance? Variability in data sampling might account for these differences, but could also make the models brittle in real-world use. In this case, incomplete insight into the approach could impair ‘live’ model validity in real-world settings.

Finally, we assess feasibility and sustainability. An advantage of site-specific model development is that the resulting predictions maximally leverage local patterns and data [11]. In theory, this makes them better suited to implementation within current clinical workflows. On the other hand, such predictive models may ‘overfit’ specific clinical workflows and EHR documentation processes making them difficult to compare between sites. In addition, if practices change, model performance could degrade significantly. Institution-specific models also require staff who can implement and maintain local algorithms, as well as data with enough samples and granularity to follow a similar modeling strategy. Further, as modeling approaches become more complex, addressing declines in future model performance and mitigating unintended consequences can become increasingly challenging.

Although this study highlights intriguing new ‘growth to date’ by leveraging routine clinical data to predict blood culture positivity in the ICU, key questions about the model’s clinical utility, validity, and feasibility/sustainability remain. These considerations should be used to inform prospective clinical testing, which remains the only way to evaluate true model value. The study also highlights the tremendous new potential that predictive models hold for improving critical care. Predictive tools are becoming commoditized; however, the challenges in assessing model value will remain. Considering this changing landscape, new venues and modalities for assessing and comparing model development, performance, and value will become increasingly important for a successful future of predictive analytics in healthcare.