Introduction

With the advent of increasingly available high-dimensional health data combined with accelerating computational abilities to process and analyze them, there is an emerging opportunity to define health and disease states and their underlying physiologic and pathophysiologic mechanisms with more clarity, precision, and efficiency. Aspirationally, these advances might be applied to real-time diagnosis and patient management. Perhaps nowhere else in the healthcare system than in the intensive care unit (ICU) environment are the challenges to create useful models with direct time-critical clinical applications more relevant and the obstacles to achieving those goals more massive. Machine learning (ML)-based artificial intelligence (AI) techniques to define states and predict future events are commonplace activities in almost all aspects of modern life. However, their penetration into acute care medicine has been slow, stuttering and uneven. There are many papers describing the various types of ML approaches available [1,2,3]. But the realization of such approaches and tools to aid clinicians has been erratic.

Major obstacles to widespread effective application of AI approaches to real-time care of critically ill patients need to be addressed. Presently, clinical decision support systems (CDSS) cannot replace bedside clinicians in acute and critical care environments. The reasons are many and include the immaturity of CDSS to have situational awareness, the fundamental bias in many large databases that do not reflect target populations of patient being treated (making fairness an important issue), and technical barriers to timely access to valid data and its display in a fashion useful for clinical workflow. The inherent “black-box” nature of many predictive algorithms and CDSS makes trustworthiness and acceptance by the medical community difficult. Logistically, collating and curating in real-time multidimensional data streams of various sources needed to inform the algorithms and ultimately display relevant clinical decisions support format that adapt to individual patient responses and signatures represent the efferent limb of these systems and is often ignored during initial validation efforts. Similarly, legal and commercial barriers to the access to many existing clinical databases limit studies to address fairness and generalizability of predictive models and management tools. We will explore the barriers to effective use of AI in critical care medicine, and ways that either bypass them or address them to achieve effective CDSS.

Real-world clinical data for both model-building and CDSS

Large amounts of highly granular data—such as those from devices for monitoring and life support, laboratory and imaging studies, and clinical notes—are continuously being generated and stored in electronic health records (EHRs) from critically ill patients. The massive number of patients with available data for analysis dwarfs clinical trial sample sizes. Thus, there is both ample availability of data and a clear opportunity for data-driven CDSS. Compared with clinical trials or prospectively, enrolled cohort studies, disadvantages of real-world data such as bias and non-random missingness, if addressed, are offset by obvious advantages including an unselected patient population with larger sample size and the ability to update and focus analyses, all with the potential to maximize external validity for a fraction of the cost. Currently, most critical care EHR data are only available for patient care and not for secondary use. Barriers include legal and ethical issues related to privacy protection as well as technical issues related to concept mapping across different-based Intensive Care Unit Data (EHR vendors where similar clinical concepts are represented differently, thus introducing semantic ambiguity [4]). But a very large obstacle is the lack of incentive to make intensive care data available for local, regional, or general use. However, the concept that the healthcare system could learn from all data of all their patients is attractive and should foster data solidarity.

Responsible sharing of large ICU datasets at all levels implies finding the right balance between privacy protections and data usability. This requires careful combinations of governance policies and technical measures for de-identification to comply with ethical and legal standards and privacy laws and regulations (e.g., Health Insurance Portability and Accountability Act in the USA and General Data Protection Regulation in the EU). These challenges contributed to the fact that until recently, freely available ICU databases were sourced only from the USA. A partial list of publicly available US, Europe and China large intensive care databases is provided in Table 1. Most are described and accessible on the Physionet platform [5]. There are also numerous databases and data sharing initiatives that are less freely available. Access to these typically requires collaboration with institutes from which the data has been sourced (e.g., Critical Care Health Informatic Consortium, Dutch Data Warehouse and ICUdata.nl).

Table 1 Publicly available ICU databases

Operationally, the research question should determine the choice of dataset, as they differ substantially in cohort size, data granularity, treatment intensity, and outcomes. To foster model generalizability, at least two different datasets should be used. One barrier to this kind of external validation would be removed if these free databases were available in common data models using standard vocabularies; the recent effort to map the MIMIC-IV dataset to the Observational Medical Outcomes Partnership (OMOP) common data model is an important first step in this effort [6]. The US-based Patient-Focused Collaborative Hospital Repository Uniting Standards (CHoRUS) for Equitable AI has initiated the generation of a harmonized, geographically diverse, large multi-domain dataset of ICU patients including EHR, text, images and waveform data (bridge2ai.org/chorus). This public-facing dataset should soon be available to complement existing databases with the added advantage of significant diversity. Alternatively, the R-based Intensive Care Unit Data (RICU) and the Yet another ICU benchmark (YAIB) offer opportunities for combined analyses of critical care datasets. Another limitation of these datasets may be their limit of ICU-only data.

Despite the limited number of ICU datasets, the flurry of excellent modeling work afforded by these freely available intensive care datasets has exposed a severe translational gap, with implementation at the bedside and demonstration of improved patient outcomes using those models proving very challenging [7, 8].

Bias in database origins and model validation/governance

There is a fundamental flaw in building AI-CDSS using existing EHRs and evaluating the models using accuracy against real-world data given existing health disparities present in these databases. This is setup for encoding structural inequities in the algorithms, thereby legitimizing their existence and perpetuating them in a data-driven healthcare delivery system. Social patterning of the data generation process [9] and social determinants of care [10]. The social patterning of data generation pertains to how a patient is represented as her data during healthcare encounters. In an ideal world, everyone is cared for “equitable fashion”. But existing EHRs suffer from bias because of how patients and their care are captured. These biases are reflected and may be reinforced by AI as the processes of model development and deployment. Furthermore, models built on EHR data skewed toward a primarily Caucasian class of patients may not model well African-American, Hispanic or oriental patients [11, 12]. EHR databases that are representative on the demographics of the patients for whom the AI-CDSS is being directed are necessary.

To avert AI-legitimized and AI-enabled further marginalization of those already disproportionately burdened by disease and societal inequities, regulatory guardrails are needed. Such guardrails would be policies and/or incentive structures developed through continuous open dialogue and engagement with communities that are disproportionately burdened, marginalized, or non-represented. But unless the ML community prioritize the “who”—who are developing and deploying AI—and the “how”—is there transparency and accountability for responsible AI, then these CDSS efforts will be less effective.

Designing clinical decision support systems for situational awareness

Situational awareness (SA) is foundational to decisions and actions in sectors like aviation and medicine [13]. Robust SA is a prerequisite for sound decisions that recognize the relevant elements in an environment, understand their meaning, and forecast their short-term progression. Lapses in SA are a primary cause of safety-related incidents and accidents [13, 14]. SA continuously evolves, influenced by changing external circumstances and individual internal factors. Heavy workloads and fatigue with diminishing mental capacity can hinder a clinician’s ability to achieve and maintain SA in critical care environments. In contrast, having extensive experience in a specific context can enhance SA, as familiarity guides what to focus on. Well-designed CDSS should improve SA.

Presently, AI-based CDSS will work alongside human decision makers as opposed to as autonomous support systems. Such CDSS should transfer essential information to decision makers as quickly as possible and with the lowest possible cognitive effort [15]. User-centered, SA-oriented design is needed for the successful implementation of AI-CDSS. In complex and dynamic environments, AI-CDSS design should allow staff to clearly grasp information, reduce their workload, and strengthen their confidence in the diagnoses, importantly because these aspects promote staff acceptance and trust ultimately determining whether AI-CDSS are implemented.

A wide gap exists between health AI done right and implementations in practice. Building and deploying AI predictive tools in health care is not easy. The data are messy and challenging, and creating models that can integrate, adapt, and analyze this type of data requires a deep understanding of the latest ML strategies and employ these strategies effectively. Presently, only few AI-based algorithms have shown evidence for improved clinician performance or patient outcomes in clinical studies [6, 16, 17]. Reasons proposed for this so-called AI chasm [18] are lack of necessary expertise needed for translating a tool into practice, lack of funding available for translation, underappreciation of clinical research as a translation mechanism, disregard for the potential value of the early stages of clinical evaluation and the analysis of human factors [19], and poor reporting and evaluations [2, 8, 20].

State-of-the-art tools and best practices exist for performing rigorous evaluations. For instance, the Developmental and Exploratory Clinical Investigations of DEcision support systems driven by Artificial Intelligence (DECIDE-AI) [15] guideline provides an actionable checklist of minimal reporting items facilitating the appraisal of CDS studies and their findings replicability. Early-stage clinical evaluation of AI-CDSS should also place a strong emphasis on validation of performance and safety, similar to pharmaceutical trials phase 1 and 2, before efficacy evaluation at scale in phase 3. Small changes in the distribution of the underlying data between the algorithm training and clinical evaluation populations (i.e., dataset shift) can lead to substantial variation in clinical performance and expose patients to potential unexpected harm [21, 22]. Human factor (or ergonomics) evaluations are commonly conducted in safety–critical fields such as aviation, military, and energy sectors [23] evaluating the effect of a device or procedure on their users’ physical and cognitive performance [24]. However, few clinical AI studies have reported on the evaluation of human factors [25]. The FDA recently released the “Artificial Intelligence and Machine Learning (AI/ML) Software as a Medical Device Action Plan,” which outlines their direction [24, 26] and the National Academy of Medicine has announced the AI Code of Conduct [27] but more work needs to be done. Clinical AI algorithms should be given the same rigorous scrutiny as drugs and medical devices undergoing clinical trials.

Bridging the implementation gap in acute care environments

Timely interventions require early and accurate identification of patients who may benefit from them. Two prominent examples relevant to critical care are models using readily available EHR data that can accurately predict clinical deterioration and sepsis hours before they occur [28,29,30,31,32,33]; these models exemplify real-time CDSS that alert clinicians and prompt evaluation, testing, and interventions [33]. Translation of these approaches in clinical intervention studies has improved outcomes [16, 34, 35]. Despite these systems’ early promise, important technical and social obstacles must be addressed to ensure their success. Indeed, the previously described “implementation gap” for medical AI extends to predicting clinical deterioration and sepsis CDSS [7].

Most CDSS development begins with retrospective data; these data often have different quality and availability than data in the production EHR, which can degrade model performance during implementation [36, 37]. Further, outcome labels based on EHR data are generally proxies for real-world outcomes. Imprecise retrospective definitions unavailable in real time, such as billing codes, may complicate the validity of outcome labels [38, 39].

The clinical deterioration and sepsis CDSS models generated headlines for their high discrimination. While discrimination is important, more nuance is needed to understand whether a model is “good enough” to be used for individual patient decision-making. Even when discrimination is high, the threshold chosen for alerts may result in suboptimal sensitivity or excessive false alarms [40]. Balancing sensitivity with false alarms and lead time for alerts remains a persistent challenge, and the optimal balance varies by use case [41]. Also, performance variation across settings, case mix and time must be measured and addressed [42, 43]. Evaluating model fairness across socioeconomic groups is another critical consideration before model implementation.

Information Technology infrastructure and expertise are also essential for implementing CDSS effectively. Vendors increasingly provide proprietary “turnkey” CDSS solutions for identifying clinical deterioration and sepsis [42,43,44]. While convenient, limitations include inconsistent transparency and performance, user experience constraints, and opportunity costs [45]. Alternative approaches may improve performance but generally require substantial resources and may be more vulnerable to “break-fix” issues and other challenges [46].

The social challenges to CDSS implementation are substantial. Successful implementation requires an understanding of intended users, their workflows and resources, and a vision of how these should change based on CDSS output [47, 48]. Implementation science methods offer guidance. Formative work might use determinant frameworks and logic models to understand which behaviors a CDSS is meant to influence, thereby informing clinical workflow [49, 50].

Efforts to comprehend expected user needs may raise trust and facilitate adoption. Model explainability also improves trust and CDSS adoption. The high complexity of many “black-box” ML models may preclude clinicians from valuing CDSS information when the output is incongruent with clinical intuition. Modern approaches to improving explainability include SHapley Additive exPlanations, a model-agnostic approach to visualizing predictor variable contributions to model output based on game theory. User interface design for real-time CDSS requires expertise in human factors, could be limited by vendor software capabilities and may require adherence to regulatory guidance by governmental agencies.

CDSS must be paired with the ability to measure what matters to patients and clinicians. Evaluation frameworks from implementation science may facilitate CDSS evaluations, capturing elements of both efficacy and effectiveness [51]. Study design choices for implementation evaluation will depend on available resources, local factors, and the clinical problem. Pragmatic randomized trials and quasi-experimental designs offer advantages over pre-post designs or comparisons against historical controls [34, 52].

A roadmap to effective adoption of AI-based tools

The integration of AI into healthcare necessitates meticulous planning, active stakeholder involvement, rigorous validation, and continuous monitoring, including the monitoring of adoption. Adhering to software development principles and involving end-users enables CDSS to ensure successful adoption, ultimately resulting in improved patient care and enhanced operational efficiency. A dynamic approach that involves regular assessment and refinement of AI technology is essential to align it with evolving healthcare needs and technological advancements. Creating data cards, which are structured summaries of the essential facts about various aspects of the ML datasets needed by stakeholders across a project’s lifecycle for responsible CDSS development, is a very useful in insightful initial step in this process. Figure 1 summarizes the issues address in this paper as a roadmap to effective CDSS completion, and Table 2 itemizes obstacles to efficient CDSS roll out and potential solutions.

Fig. 1
figure 1

The process of creating, testings, and launching an effective Clinical Decision Support System (CDSS) is multifaceted and ongoing. The interaction of multiple processes and involvement of various stakeholders along the way improve the likelihood of final adoption during real-world deployment (dashed vertical line). Importantly, as illustrated in this work flow diagram, is ongoing assessment refining models and information transfer options. At the start one uses a model card which is a short document that provides key information about a machine learning model. This is central to maintaining focus throughout the workflow cycle of CDSS development

Table 2 Guide to Addressing Obstacles in CDSS Development

It is vital that to have a designated owner to define the problem the AI technology is intended to solve and oversee the design and deployment [53]. Involving a broader array of stakeholders before the pilot phase is equally crucial. This inclusive approach encourages early feedback and insights before full deployment, enhancing potential adoption and ensuring effective communication with the owner throughout the pilot deployment. Engaging a representative team of stakeholders deepens their understanding of the technology and its seamless integration into existing workflows. This involvement should encompass a diverse range of end-users, including medical professionals, clinicians, patients, caregivers and other stakeholders within the clinical workflow. Early and active engagement in design processes by stakeholders ensures alignment of technology with its intended objectives, its smooth integration into established healthcare processes, and early prevention of safety risks and biases.

User Acceptance Testing forms a critical step in the software qualification process, where end-users rigorously evaluate the technology's functionality and, in the case of CDSS, agreement with the model output. It can inform false-positive and false-negative risks. This evaluation ensures that technology aligns with their specific needs and expectations. The User Acceptance Testing phase offers invaluable insights into requirements, integration options and validation of AI-CDSS outputs based on those requirements and contributes significantly to interface design improvements. Human factor studies can be performed to demonstrate technology usability [54]. Usability factors and empirical measures can also be used in the testing phase [55]. By involving end-users in testing, the technology meeting its intended use is greatly facilitated and a sense of ownership is cultivated, empowering end-users with a deeper understanding of how the technology integrates into their workflow, further enhancing its overall effectiveness. Before the AI-CDSS is introduced into the workflow, needs-adjusted training can facilitate AI-CDSS acceptance and instructions for its use [56].

Effectively measuring and monitoring AI technology adoption is pivotal for evaluating its real-world effectiveness and pinpointing areas for enhancement. Utilizing quantitative metrics, such as tracking interface interactions like button clicks, provides data on user engagement, shedding light on usage patterns. Concurrently, surveys and qualitative interviews, focus groups, and direct observation offer deeper insights into user experiences and perceptions. This dual approach enables healthcare organizations to refine the technology, prioritizing user satisfaction and feedback [57]. It also serves as an avenue for end-users to voice safety concerns and broader issues. Real-world deployment necessitates a consistent feedback mechanism, since end-users might override recommended actions or decisions or disagree with AI-CDSS output. This feedback should be systematically shared with the development team or relevant organization, capturing information on agreement with the technology’s output and recommended decisions or actions. This process is akin to documenting protocol deviations in clinical trials and should encompass any safety concerns or other issues, such as bias. A comprehensive root cause analysis of disagreements, along with mitigation strategies, should be recorded at the point of care, enhancing the overall safety and efficacy of the technology.

The transition from the pilot phase to general deployment marks a pivotal stage in AI adoption. Successful pilot deployments act as a springboard for broader adoption [58]. Human trust is an important factor, and further education on AI and transparency information can build this trust for clinicians and patients [59,60,61]. Identifying and leveraging technology champions within the healthcare system can profoundly influence the dissemination of the technology’s value. These advocates play a vital role in communication campaigns, training, and facilitating a seamless transition to widespread deployment, ensuring a comprehensive understanding of the technology’s benefits.

Governance and regulatory considerations

The rapid advances in AI, and in particular the release of publicly available generative AI applications leveraging advanced large language models, have greatly accelerated discussions considering the promises and pitfalls related to AI deployment in society and healthcare [7,8,9,10,11,12,13,14,15,16,17]. Heightened concerns about the development and deployment of AI have generated discussion about how to ensure that AI remains ‘aligned’ with human objectives and interests. As a result, a rapidly evolving set of regulations are being drafted by a wide variety of regional, federal, and international governing bodies that are expected to become formalized over the next three years, such as the World Health Organization’s report on the Ethics and Governance of AI for Health and the European Union’s report on AI in Healthcare: Applications, Risks, and Ethical and Societal Impacts. In the USA, the White House’s Blueprint for an AI Bill of Rights: Making Automated Systems Work for the American People; the National Institute of Standards and Technologies’ Artificial Intelligence Risk Management Framework; and the Food and Drug Administration’s guidance on Software as a Medical Device and Clinical Decision Support Devices do the same. These documents highlight several governing principles for safe AI including that the technology should: do no harm; be safe; be accurate; be free of bias and discrimination; preserve privacy; be intelligible to end-users; be monitored on an ongoing basis; and address consent for use. These principles follow closely with effective, safe, and equitable healthcare delivery, yet AI poses novel challenges given its dependence on rapidly evolving and increasingly complex algorithmic underpinnings.

The AI workforce

The rapid growth of AI has accelerated discoveries across diverse scientific fields and affected every work environment [62,63,64,65] and is reshaping the labor market with unprecedented speed and scale, with 40% of the global workforce expected to require AI, enhancing the need for significant AI upskilling or reskilling [66]. The rapid adoption of AI into healthcare and clinical research is an opportunity to transform how we discover, diagnose, treat, and understand health and disease. The American Medical Association supports this vision of human–machine collaboration by rebranding the AI acronym as “augmented intelligence” [67]. AI-augmented clinical care requires an AI-literate medical workforce, but we presently lack sufficiently skilled workers in medical domain-specific AI applications. Many biomedical and clinical science domain experts lack the foundational understanding of AI systems and methodologies. There is currently not enough opportunity for rapid AI training in clinical medicine and research. AI tools and systems require increasingly less underlying mathematical or technical knowledge to operate, aligning with the US Food and Drug Administration (FDA) processes to authorized AI algorithms as “software as a medical device” [68]. Evidenced by NIH Common Fund Programs (AIM Ahead and Bridge2AI), there is a universally acknowledged AI training gap and a clear need for accessible and scalable AI upskilling approaches to help raise the first global generation of AI-ready healthcare providers.

The future ICU workforce will require specialized AI critical care training that prioritizes a conceptual AI framework and high-level taxonomies over programming and mathematics. Clinicians must understand the indications and contraindications of relevant clinical AI models, including the ability to interpret and appraise published models and training datasheets associated with a given AI tool across various demographic populations [69, 70]. AI training programs in critical care must also be agile enough to adapt to rapid shifts in the AI landscape. Last, these programs should instill in trainees a fundamental working knowledge of bias, fairness, trust, explainability, data provenance, and responsibility and accountability.

It is essential that the diversity of AI researchers mirrors the diverse populations they serve. There are significant gaps in gender, race, and ethnicity [71, 72] Lack of diverse perspectives can negatively impact resulting products, as has plagued the AI field for years [73, 74]. The 2022 Artificial Intelligence Index Report states that 80% of new computer science PhDs specializing in AI were male, and 57% were White, which has not changed significantly since 2010. There is thus a critical need for a nationwide academic-industrial collaborative training programs to fund, develop, and mentor diverse AI researchers to ensure AI fairness in biomedical research [75].

Conclusion

AI is here to stay. It will permeate the practice of critical care and has immense potential to support clinical decision making, alleviate clinical burden, educate clinicians and patients, and save lives. Yet, although this complex, multifaceted, and rapidly advancing technology will reshape how healthcare is provided, it brings along deep ethical, fairness, and governance issues that must be addressed in a timely fashion.