Introduction

Sepsis is a life-threatening condition characterised by organ dysfunction in response to a dysregulated immune response to infection [1]. It is a global health priority estimated to cause 2.9 million deaths in children under 5 each year [2]. Although many factors contribute to sepsis-related mortality and morbidity, the failure to make a prompt and accurate diagnosis plays a key role and is a leading cause of preventable harm. Rapid and accurate identification of sepsis is challenging since it usually presents with subtle or non-specific signs and symptoms, particularly early in the clinical course. Once sepsis is established, the risk of mortality becomes unacceptably high despite appropriate treatment interventions [3]. Early detection of sepsis and identification of patient subgroups that will benefit from specific treatment interventions are critical to improve sepsis outcomes.

In simple terms, the field of artificial intelligence (AI) applies computer science to data to solve problems. AI is differentiated from other statistical and computational approaches by its emphasis on learning and provides significant analytical advantages over conventional rule-based statistical approaches. The most established examples of AI in clinical medicine have been applied to image analysis, resulting in improvements in, for example, the detection of diabetic retinopathy [4], the diagnosis of colorectal cancer [5], and the prediction of lung cancer risk [6]. AI exhibits a spectrum of autonomy, meaning that as algorithms become increasingly complex, human participation and understanding declines. Deep learning approaches are however able to deal with large, multimodal and unstructured datasets and with complex data relationships. This capability lends itself to the study of a heterogeneous multi-system disorder like sepsis. Generative AI and deep learning can derive insights from diverse multi-dimensional datasets including laboratory values, demographic data, healthcare professional notes, vital signs, radiological imaging, microbiological data, histopathological images and omics-related (e.g. genomic, transcriptomic, proteomic and metabolomic) data. In other scientific domains, AI is revolutionising the approach to the development of new therapeutics, including antimicrobials, by predicting the structure of chemical compounds and identifying therapeutic targets [7].

Despite these opportunities, there are substantial challenges for AI systems to overcome if they are to improve healthcare delivery. In this narrative review, we aim to address these opportunities and challenges for AI to improve the outcomes of children with sepsis in the intensive care unit.

Opportunities for Artificial Intelligence to Improve Sepsis Outcomes in Children

Early Recognition

Existing tools to improve the early recognition of sepsis in children rely upon a constellation of physiological variables (such as heart rate), clinical risk factors, examination findings and difficult to quantify contextual information such as parental or clinician ‘concern’ [8]. Previous research efforts have sought to identify diagnostic biomarkers and to construct decision support tools able to support clinicians with the recognition of serious infections including sepsis in children [9, 10]. With few exceptions, these efforts have singularly failed to impact clinical decision making in childhood sepsis. Latterly, researchers have derived and validated gene transcript signatures to differentiate children with and without sepsis [11] or to guide treatment decisions in serious infections [12]. These more sophisticated approaches have the potential to improve diagnostic discrimination but are themselves likely to be limited as they represent a snapshot of the immune response in an often rapidly evolving clinical presentation. The large number of potentially informative indicators of sepsis and serious infection which vary over time is a data problem which has so far proven intractable.

The capacity of AI to interrogate complex datasets presents an exciting opportunity (Table 1), placing AI in a position to broadly re-formulate the clinical evaluation of the child suspected of sepsis. Several studies have demonstrated that AI based tools derived from electronic health records (EHRs) and omics-related data can be used successfully for early prediction, phenotype characterisation, prognostication and treatment personalization of sepsis in children and neonates [13,14,15,16,17,18]. Real-time data derived from a range of bedside monitors, including electrocardiogram waveform and baby cry signals, have also been evaluated in the early prediction of sepsis in children and neonates [19, 20]. Inclusion of real-time data in a recurrent neural network model resulted in forecasts which became progressively more accurate over time [21]. The capacity of AI to incorporate a range of dynamic data inputs into accurate and continuous predictions of risk represents a step-change in diagnostic capability. Developments in large language models (LLMs) are revolutionising the integration of AI within and beyond healthcare. These LLMs, including BERT, GPT-4, and PaLM, employ advanced natural language processing (NLP) techniques, architectures and extensive data training, making them more capable in processing and providing refined outputs [22,23,24]. The integration of these models into chatbots, such as ChatGPT, Bard, and Claude, amplifies their functionalities through natural human-like conversational interactions [25,26,27]. These technologies hold significant promise. Developing domain-specific models trained with sepsis-related data and case studies is essential to generate meaningful responses. The deployment of these systems through globally available chatbots can assist individuals in low resource settings where access to specialised healthcare resources may be limited.

Table 1 Opportunities and challenges of artificial intelligence to improve sepsis outcomes in children

While contemporary literature has focused on comparing the performance of AI-based tools to that of human decision-making, opportunities to successfully implement AI are likely to be found when healthcare practitioners actively collaborate with AI tools [28, 29]. The usefulness of human-AI collaboration (‘Humans in the loop’) will likely depend on the types of the task and the clinical context. Some studies have indicated that clinicians and AI perform better than either alone [30], whereas others have shown that human-AI collaboration provides no additional benefit over AI alone [31]. This also appears to be influenced by the clinicians’ level of expertise; trainees seem to benefit more from AI input than their more experienced colleagues [32]. The accuracy of AI systems significantly affects clinicians’ willingness to collaborate with AI systems, while AI system acceptability increases when its use is supported by clinical leaders and peers [33•]. Human-AI collaboration approach might also improve ‘explainability’ and build trust among users. Understanding how clinicians and AI systems interact to improve the care of children with sepsis is an important component of future studies.

Unsupervised Learning and Sepsis Phenotypes

Sepsis is a heterogeneous syndrome, defined crudely. The definition of paediatric sepsis is still based on the 2005 International Paediatric Sepsis Consensus Conference [34] but the definition, combining systemic inflammation with suspected infection, is too non-specific to be meaningful to either clinicians or researchers. The Third International Consensus Definitions for Sepsis and Septic Shock (Sepsis-3) advanced the development of sepsis and septic shock definitions in adults through the incorporation of organ dysfunction criteria [1] and work is underway to adapt these definitions to the specific characteristics and presentations of sepsis in children [35]. This work is necessary to understand the true burden of disease and to guide improvements in clinical care and study design. Even so, as conceived, sepsis definitions still fail to reflect the range of clinical phenotypes which are likely to have different outcomes and which may respond differently to intervention. AI using unsupervised learning can derive new clinical phenotypes in adults with sepsis with clear differences in treatment outcomes. Simulations of randomised controlled trial results in which the proportion of patients with different clinical phenotypes were varied support the hypothesis that new clinical phenotypes derived from large multi-dimensional datasets may result in effective personalised sepsis interventions [36].

Early Antimicrobial Optimisation

Treatment outcomes in sepsis are improved by early, effective antimicrobial use. Antimicrobial susceptibility testing (AST) is required to inform antibiotic use, but traditional culture-based AST takes days to provide meaningful results. Considering the large datasets from mass spectrometry and whole genome sequencing (WGS) analyses, AI can be used to reduce the time to pathogen identification, and to infer antimicrobial susceptibility [37•, 38]. In clinical settings with a substantial burden of multidrug-resistant organisms, AI can shorten the time to effective therapy. It may also improve antimicrobial stewardship efforts to ensure that the narrowest spectrum appropriate antibiotics replaces broad spectrum empirical agents as quickly as possible, reducing collateral damage associated with broad spectrum antibiotics [39].

Challenges to Overcome for Artificial Intelligence to Improve Sepsis Outcomes in Children

Data Quality, Governance, Transparency

One of the major challenges in this context is the issue of accurately labelling children and neonates with sepsis. The use of international classification of diseases (ICD) codes to classify sepsis fails to provide accurate labelling [40, 41]. A lack of consensus in neonatal sepsis definitions is a particular problem, leading to the development of some neonatal AI sepsis tools based on culture-positivity, and others determined by clinical assessment, independent of culture results [42, 43]. Large, well-curated and expert-labelled datasets are very important but assume the accuracy of classification and labelling, though in reality, sepsis definitions remain non-specific and controversial. What is the classifier trying to predict? This is a major hurdle for supervised learning AI systems, fostering interest in weakly supervised and unsupervised systems that require less labelling effort, meanwhile presenting opportunities for new sepsis phenotype derivation [44].

Supervised machine learning models perpetuate errors in training datasets which hamper the construction of clinically useful AI tools. Algorithms are derived on data from majority groups and may have inadequate performance in groups that are poorly represented in the dataset, a problem termed algorithmic bias. Algorithms should be developed and validated in samples representative of the population in which they will be used. Performance should be evaluated in subgroups such as age, sex, ethnicity, economic group and location [45]. Imbalanced datasets with a low proportion of sepsis cases may yield misleading classification and erroneously treat infrequent events (including sepsis) as noise [46].

The development of AI systems for paediatric sepsis is complicated by the relatively lower event rate in children and age-related heterogeneity in the pathobiology of childhood sepsis [47]. For example, risk factors such as prematurity and low-birth weight significantly increase the risk of sepsis-associated mortality in neonates, but not in older children and adolescents. Additionally, sepsis symptoms in neonates significantly overlap with normal preterm physiology creating a problem of false alarms in neonatal units already prone to alarm fatigue [48, 49]. To overcome this particular problem, technologies that record motion, sound and video may be incorporated into emerging AI systems, which will likely provide valuable insights on how babies respond to care and treatments.

Multi-centre studies are necessary to derive reliable and generalizable AI tools but present significant data governance challenges. Studies of AI in sepsis are most commonly single-centre retrospective study designs [35, 50], limiting their generalisability. In a recent meta-analysis, only three prospective studies and one RCT of AI systems for sepsis prediction were identified, and none in children [51]. A notorious pitfall in developing AI models is the risk of overfitting, which stems from application of the trained model solely to the database in which the model itself was developed or trained, and failing to apply to an external cohort [52, 53]. Multi-centre studies require access to substantial data and computer science infrastructure and capability; however, the pre-processing of multiple datasets remains a challenging and time-consuming task. Restrictions on data sharing limit the availability of data from multiple centres and prevent the development of widely generalisable systems. Federated learning is a system in which an AI algorithm can be trained through multiple independent sessions, each using its own dataset, and may be considered as a potential solution to the challenge of data sharing between different hospitals or jurisdictions. In this system, it is possible to create a common AI model without data sharing, thereby tackling critical challenges such as data security, data privacy and access to heterogeneous data [54]. Addressing data privacy and security is vital to ensure the safe implementation of these technologies. Models that can operate locally at the device level with robust encryption and access controls will further enhance data security and mitigate privacy concerns [55].

Intriguingly, alterations in the characteristics of day-to-day practices, healthcare systems and patient populations over time can cause data shift or data drift, which in turn, can adversely affect model performance [56]. Therefore, continuous updating of AI tools, in addition to external validation, allows us to consistently achieve adequate performance when used in daily practice. The US Federal Drug Administration (FDA) has proposed a blueprint for adaptive AI systems, in which they will approve not only an initial model but also a process to update models over time [57].

The reporting of AI system performance is an additional challenge. The diagnostic discrimination of prediction models is typically reported using the area under the receiver operating characteristic curve (AUROC or AUC). AUC provides a global measure of discrimination and is equivalent to the concordance or c-statistic, the probability that a randomly selected subject with the outcome of interest has a higher predicted probability of having than a randomly selected subject without the outcome. AUC can be misleading in quantifying model performance because of limitations in taking prior probability into account, giving knowledge on the spatial distribution of model errors, and equal weighting of omission and commission errors [58, 59]. As a result, we may see unreliable risk estimates even in models with good discrimination accuracy [60]. Put simply, translating this global measure of discrimination into meaningful metrics to assure clinicians is challenging. Evaluations should quantify other measures of accuracy including model sensitivity and specificity at clinically meaningful thresholds and likelihood ratios which quantify post-test from pre-test odds in the presence of a positive or negative test.

Reporting Guidelines and Implementation

Most AI studies assess the performance of algorithms on retrospectively collected static datasets, whereas physicians seek to implement these models in dynamic, real-world hospital settings using real-time data [61]. This contrast limits the applicability of these algorithms in daily practice. For AI systems to be effectively implemented, the processes of algorithm development and validation need to be transparently reported. Reporting guidelines that reflect the specific requirements of AI system evaluations are now established, extending from the reporting of diagnostic or prognostic accuracy (Transparent Reporting of a multivariable prediction model of Individual Prognosis Or Diagnosis-AI [62]), through early clinical validation (Developmental and Exploratory Clinical Investigations of DEcision support systems driven by Artificial Intelligence (DECIDE-AI [63•]) to clinical trials (CONSORT-AI [64]). These guidelines are essential to improve transparency and ensure the replicability of study findings, thereby supporting successful implementation.

Challenges Specific to LLMs

In order to establish reliability and effectiveness of the LLMs in clinical practice, several challenges need to be addressed. Firstly, existing models are predominantly trained using general datasets and prone to make mistakes, termed ‘hallucinations’, whereby a model invents information. These limitations can significantly reduce the accuracy of a model’s response, particularly in specific domains like sepsis. Recent progress, such as integrating constitutional AI to produce harmless AI systems, improving training techniques, and utilising medical resources for training data, has shown promise [25, 55, 65,66,67]. Nevertheless, developing domain-specific models trained with sepsis-related data and case studies is essential to generate reliable model outputs.

Challenges of Implementing Artificial Intelligence Tools in Daily Practice

Systems

Significant challenges remain if we are to harness AI in healthcare systems. The translation of AI innovations into clinical care is limited by healthcare systems in which data are rarely available for real-time analysis. Data may be siloed in a range of systems which separately document pathology results, physiological variables and clinical notes. Unified data formats such as Fast Healthcare Interoperability Resources may support the aggregation of data but adequate investment in local analytics infrastructure is inevitable to allow data collection, curation, transformation and analytics [68]. Learning healthcare systems are an initiative that aim to incorporate research and innovation to enable rapid improvements in the quality of care driven by continuous analysis of data derived from routine care and may have a role in successfully implementing AI systems [69].

Demonstrating the Clinical Validity of AI Systems

Following the development and validation of novel AI algorithms, deployment may be undertaken first in ‘silent mode’, that is, running the algorithm but not incorporating its results into routine practice. In the process of deployment of an AI system in daily practice, the necessity of a multi-faceted approach including all relevant stakeholders should be kept in mind [70]. Large-scale, well-conducted RCTs are arguably the most important step toward illuminating the impact of AI tools on clinically meaningful outcomes. In order to interpret the results of RCTs and rigorously investigate the usefulness of the AI systems, methods for conducting and reporting of RCTs should be standardised [71, 72]. There are presently no published RCTs of an AI-driven approach to sepsis in children. Once an AI system has been integrated into the hospital system, it is vitally important to give recurring training to healthcare professionals to ensure its proper implementation. Since the usefulness of AI tools can be greatly influenced by the way people provide input and evaluate output, regulatory authorities should mandate that human factors be tested and adequate training provided for end users of AI systems [73].

Explainability and Collaboration Between Clinicians and AI

Contemporary AI techniques can reveal complex relationships within multidimensional data that are challenging to understand and explain to clinicians. Clinical implementation is likely to succeed best where models derived from AI are not only accurate, but transparent, interpretable and actionable [74]. The consequences associated with incorrect model predictions, in this context, for example, the recognised consequences of a missed diagnosis of sepsis, argue strongly in favour of explainability. An explainable system allows for the recognition of errors and the identification of bias or confounding. The best performing models, such as deep neural networks, are the least explainable, however leading to a trade-off between performance and explainability. Work to enhance the explainability of complex AI systems by reporting the relative importance of constituents of model is developing rapidly. Measures such as the shapley additive explanation, for example, have been developed to estimate and visualise the contribution of individual model features in complex and highly accurate ensemble decision trees [75].

This issue can be partially resolved by involving clinical end users (and consumers) in every step of the development process of these tools. Clinicians need to understand how the application of AI tools will impact clinical workflows and improve patient care, something rarely examined and reported. In the pre-deployment phase of an AI system, it should be determined whether the management of the system outputs will be undertaken by a designated team or by attending physicians. Although the first option seems to reduce alarm fatigue in attending physicians, some physicians may find this method disruptive. System alerts are required to balance sensitivity with alert fatigue taking into consideration the clinical importance of sepsis, its prevalence and acuity of the patient population [76]. The alerts generated may be accompanied by additional recommendations or information and sent to healthcare providers as messages via emails, phones or personal devices or through existing hospital EHR systems. This affirms the importance of EHR systems with the functionality to support complex AI systems [77]. Additionally, surveillance of an AI tool’s diagnostic performance should be ongoing after implementation, permitting making alterations of alerting thresholds if necessary. As another solution for alert fatigue, alert types can be classified as hard alerts that should elicit an immediate response, or soft alerts that can be managed more flexibly [33•, 78••, 79].

There is limited evidence to guide the process of AI implementation in sepsis or other clinical domains. Frameworks such as the recently published SALIENT framework offer insight into the barriers and enablers and a roadmap to successful AI implementations [61].

Ethical and Regulatory Challenges

Accountability for the clinical impact of AI systems is presently unclear. If a clinically validated AI system fails or produces predictions which result in harm, will healthcare providers, developers, or regulators, be held accountable [80]? AI developers may begin to exert substantial influence on healthcare service provision. These developers have a responsibility to create safe systems, ensure data security, and to responsibly influence public views on health [81]. The implementation and maintenance of AI tools are affected by the different ownership of input data and AI system. Significant investments must be made in data transfer and processing to ensure optimal data security, to bring stakeholders together, develop trust-based relationships and define roles and responsibilities. There are many standards for assuring the safety of digital systems [82, 83]. However, these standards are not well-suited to AI-driven systems, which evolve through learning and without explicit intervention [84]. Although, the FDA took some steps to address these issues, many healthcare systems lack a regulatory framework to assure the safe application of AI systems in healthcare [67]. A recent report from the UK Care Quality Commission concluded that further studies are needed to provide clarity on how healthcare facilities should implement AI systems into clinical workflows [85].

An aspiration for the use of AI should be to make medicine more human, not less, prioritising more time between clinicians and patients and limiting excessive automation. The potential transformation of healthcare delivery through AI should prompt serious reflection collectively from clinicians, healthcare providers and AI developers regarding the ethics of healthcare provision.

Barriers to Implementing Artificial Intelligence for Sepsis in Children in Low- and Middle-Income Countries

The global burden of sepsis-associated deaths in children and neonates falls disproportionately upon low- and middle-income countries (LMICs) [2]. The vast inequity in resources between high- and low-income settings means that despite the enormous public health impact, communities most affected by sepsis are least likely to benefit from technological progress. An ethical approach to AI implementation would ensure that its benefits are shared. Not only is sepsis more prevalent in children in LMICs, it is also more difficult to treat, being more often associated with antimicrobial resistant infections [86]. There is an urgent need for better diagnostics both to improve early sepsis recognition and to guide more judicious use of antibiotics. Many novel diagnostic platforms have limited suitability in LMICs because of cost and the requirements for technical expertise and laboratory facilities [87, 88]. Paradoxically, AI tools can be a solution to this unmet need. While AI systems derived from extensive EHR datasets are unlikely to be widespread, alternative approaches to data infrastructure may allow high performance computing situated in organisations such as universities to interact with near ubiquitous smart or mobile devices, even in remote locations. As discussed earlier, AI systems implemented in populations of children and neonates in LMICs should be developed in representative groups. AI developers and healthcare providers should resist the temptation to import solutions derived in entirely different patient groups.

The challenges associated with this digital transformation should not be underestimated. Much of the burden of sepsis mortality is attributable to a lack of access to healthcare services. Civil registration systems in LMICs may be limited which presents difficulties in identifying the burden of sepsis in children [89, 90]. Even where healthcare is accessible, many healthcare facilities in LMICs do not have EHR systems to support a sophisticated use of clinical data. Accessing and harmonising data from highly heterogeneous clinical systems is a significant technical and governance challenge. Significant investment must be made to develop the infrastructures necessary for high-quality and large-scale data storage in LMICs [91]. A widespread use of big data to create AI tools is made even more difficult by the different regulatory approaches within and between LMICs [92]. Factors such as linguistic and cultural differences contribute to the challenge of data sharing between LMICs [93].

Sepsis in children in LMICs is associated with unacceptably high mortality and long-term disability. Data derived from EHRs or mobile and smart devices may enable AI tools to relieve the burden on healthcare practitioners and inform evidence-based personalised management of children and neonates with sepsis. While transforming the data ecosystem in LMICs AI-based systems is appealing, the priorities of existing health systems must be considered seriously.

Conclusion

Artificial intelligence offers a rapidly evolving approach to the analysis of routine clinical and biological data with the potential to transform the care of children with suspected sepsis. Rapid developments in the application of LLMs offer exciting and unpredictable opportunities for innovation. However, there are many challenges impeding the development and implementation of safe, reliable and acceptable AI systems in sepsis care. The unique challenges of LMICs should be urgently addressed to reflect the disproportionate impact of sepsis in these settings.