There is growing consensus that evaluation of the effects of artificial intelligence (AI) interventions within health systems is needed to ensure its safe, equitable, and patient-centered use. Given the speed with which AI is developing, the phrase “building the plane while flying it” could not be more apt. For AI’s full potential to be realized, methods that ensure its effectiveness is replicated across diverse clinical environments, benefits are distributed equitably, and adverse consequences are minimized must be quickly developed and implemented.

Pragmatic implementation science methods1 that assess and enhance the impact of complex interventions in real-world environments have great potential to help AI achieve its goals while minimizing adverse unintended consequences including patient harm, system inefficiency, and disparities in care delivery in ways that are replicable. We use the example of predictive AI sepsis alerts which are already commonly used, and the Practical Robust Implementation Sustainability Model (PRISM)2 (Fig. 1), a frequently used implementation science framework, to illustrate these methods.

Figure 1
figure 1

How the Practical Robust Implementation Sustainability Model (PRISM) facilitates equitable implementation of health interventions.

AI INTERVENTIONS ARE COMPLEX AND CONTEXT-DEPENDENT

When considering what methods are needed to assess and enhance replicability across clinical settings, it is important to recognize AI innovations satisfy the definition of a complex intervention.3 Beyond the predictive AI model itself, additional intervention components are necessary to guide actions taken in response to a given AI prediction (Fig. 2). These “decision” components are environment- or context-dependent. They must be tailored to unique aspects of the clinical setting, including organizational culture, workflow, and infrastructure for the AI intervention to produce the desired outcome.

Figure 2
figure 2

Using pragmatic implementation science methods to plan, implement, and sustain effective, equitable AI interventions.

AI model performance also varies with context and data available. The “brittleness” or inability of AI models to maintain their predictive performance when applied to data sets other than those they were trained on is one of the most important challenges to AI improving clinical care. Thus, like other complex interventions, the benefits of AI are unlikely to be realized when implemented in new contexts without iterative tailoring, or adaptations, using carefully selected implementation strategies.

For instance, an AI model used to predict sepsis developed at one hospital will likely have worse prediction accuracy when initially deployed in another, requiring retraining on local data. Further, predictive AI models can only offer a probability. The intervention designers must decide on the probability above which a clinician will be notified. The clinician must then prioritize their response to the alert given multiple competing demands which are influenced by unique contextual factors such as the size and acuity of their patient census. This example illustrates why AI interventions are not likely to maintain the same magnitude of effectiveness when initially deployed in a new context.

Implementation science frameworks, like PRISM (Fig. 1), are useful in planning, implementing, and maintaining complex health interventions because they provide a scaffold by which to measure multiple contextual factors, process, and clinical outcomes across settings and subgroups. PRISM facilitates monitoring and iterative data-driven adjustments until the desired outcomes are achieved (Fig. 2). For instance, in addition to mortality, the RE-AIM constructs of PRISM support measuring process outcomes, like time to antibiotic administration, that are critical to understanding the effectiveness results of a sepsis alert. The contextual domains of PRISM facilitate understanding aspects of the environment that influence outcomes. Qualitative methods are often used to capture contextual drivers of outcomes that are otherwise difficult to capture, like distrust of an AI model’s prediction accuracy.

HEALTH EQUITY OUTCOMES MUST BE MONITORED TO PREVENT BIAS

Bias can be introduced at every phase of the AI “lifecycle,” from data creation to model deployment.4 Given how easily AI models can incorporate and conceal bias, proactive and iterative monitoring of both care delivery and clinical outcomes that can rapidly identify and address disparities is needed. While the “black box” aspect of many AI models has been cited as an important barrier to trust and detection of bias, close monitoring by health systems to assess for implementation and outcome disparities can mitigate these drawbacks. This approach can also help address the lack of representativeness in existing data by promoting a higher level of scrutiny and transparency with regard to the completeness, relevance, and quality of data, ensuring appropriate inclusion of historically underrepresented populations and behavioral, environmental, and social determinants of health (SDoH) measures. It is also important to consider whether adequate quantity of data is available to meaningfully apply AI in equitable ways. To avoid these potential pitfalls, they should be considered from the beginning when the problem and AI model are specified and iteratively revisited throughout the AI “lifecycle.”

For example, RE-AIM implementation outcomes, with their emphasis on representativeness, can measure whether an AI sepsis alert is being delivered at the same rate, in the same way, and resulting in the same outcomes across demographics like race/ethnicity and other SDoH measures. If, for instance, worse outcomes are found in non English-speaking patients, qualitative data combined with quantitative process outcomes can be used to understand drivers of inequity, identify strategies to address them, and reevaluate outcomes once targeted implementation strategies have been deployed to assess if the disparity has resolved.

IMPLEMENTATION SCIENCE METHODS PAIRED WITH A LEARNING HEALTH SYSTEM INFRASTRUCTURE WILL MAKE TAILORING, RIGOROUS EVALUATION, AND MONITORING OF AI INTERVENTIONS FEASIBLE

The integration of pragmatic implementation science methods with the evolving informatics-driven learning health system (LHS)5 can help facilitate both the equity and replicability of AI intervention effectiveness in diverse contexts. As LHS infrastructures advance, the speed, feasibility, and robustness of implementation science–guided evaluations will also grow. Contextual, process, and effectiveness data that previously took days to months to collect can now be queried and displayed to implementers in real time allowing for more rapid, iterative adaptations to optimize the fit of the AI intervention with its context and correct any unanticipated negative outcomes. This LHS informatics infrastructure also makes pragmatic randomized trials6 and interrupted time series designs more feasible allowing for more accurate estimates of AI on the quintuple aim: clinical effectiveness, health equity, cost, patient and clinician experience.

For example, an operational, automated dashboard displaying RE-AIM outcomes7 of a sepsis alert populated with data extracted from the EHR allows for close monitoring of intervention delivery, effectiveness, and unintended harms in a manner that requires minimal health system resources to perform. Rapid qualitative assessments can be performed in response to these quantitative interval evaluations to understand drivers of desired outcomes.

The RE-AIM outcome of effectiveness can display not only the relative mortality rate associated with the sepsis alert but also balancing measures such as rates of Clostridioides difficile infection. Iterative, qualitative methods can be deployed to identify possible unintended consequences. For instance, if nurses express concerns that the sepsis alerts delay them from performing other duties, implementers can monitor patient falls and pressure wounds in response. Because healthcare environments are both complex and dynamic, the iterative evaluations and adaptations should be continued even after an intervention has demonstrated effectiveness. This will ensure its continued effectiveness and equity over time in an ever-changing context.

In conclusion, the integration of pragmatic implementation science methods and the LHS can provide for the informed design, feasible monitoring, and iterative tailoring of AI interventions essential for effective and equitable use. Application of these approaches can offer a path for realization of AI’s great potential to propel healthcare toward achievement of the quintuple aim.