Avoid common mistakes on your manuscript.
With the steady advent of electronic health records, the abundance of routinely recorded data related to the treatment and care of intensive care unit (ICU) patients can be nothing short of overwhelming. Over tens of thousands of data points may typically be registered for every single ICU patient each day, sourced from a multitude of channels. These include readouts and waveforms from devices for monitoring and life support, medication administration records, imaging studies, laboratory tests and microbiology results, and detailed clinical documentation such as fluid charts, nursing notes, procedural reports and care plans [1].
Researchers and clinicians are increasingly advocating for making these data available for secondary use, especially given the current scientific and societal interest in data science and artificial intelligence to leverage the power of big data [2, 3]. The number of publications focusing on big data applications in healthcare has surged in recent years, reflecting this exponential interest [4]. Potential use cases are extensive and include disease surveillance, clinical audits for quality improvement, personalized medicine and advanced clinical research.
However, making ICU data available for secondary use comes with significant technical, ethical, economic and legal challenges. These include complexities of integrating data from different hospitals due to varying data formats such as adequate mapping to standard vocabularies, inconsistencies in recording critical treatment decisions, stringent patient privacy regulations, costs of data management, and legal concerns over data ownership. Standardized data validation processes offer a partial solution by enforcing consistent data granularity, terminology, and uniform labels across institutions. Common data models, such as the Observational Medical Outcomes Partnership (OMOP), facilitate data harmonization by mapping raw data into a unified structure, enabling analyses to be performed on large and diverse datasets while promoting data accuracy and compliance. Despite these advances, gaps remain, particularly in achieving broad adoption and addressing variability in data quality.
With the notable exception of some freely available ICU datasets—such as MIMIC from the USA and its European counterparts AmsterdamUMCdb, HiRID and SICdb—ICU data sharing currently remains scarce [1, 5,6,7]. These datasets have been instrumental in advancing critical care research, yet the overall lack of widespread data sharing obstructs broader progress in the field [8, 9]. Making data available for federated use might be a scalable solution to the challenges of ICU data sharing. With this method of data sharing and analysis, ICUs collaborate without centralizing the data in one location. Instead, they share insights and train models across distributed datasets while keeping patient data local and secure [9].
One notable challenge associated with the secondary use of routinely collected ICU data is concern over data quality, which may affect the reliability of any derived insights. This is typically referred to by the umbrella expression “garbage in, garbage out”, implying that flawed, incomplete, or low-quality input data will inevitably lead to faulty outputs, regardless of how sophisticated the data processing or analysis might be. Moreover, federated approaches may be particularly prone to this risk as it is challenging for data users to verify data quality without direct access, while data providers may lack the incentive to implement data quality checks and provide adequate documentation. [10, 11].
Despite these legitimate concerns, the potential benefits for future ICU patients associated with utilizing routinely collected clinical data for data science applications in the ICU are substantial and may far outweigh the risks, especially if common pitfalls are acknowledged and avoided [12]. One significant benefit is that the sheer volume, granularity and continuity of ICU data allow for detailed analyses of clinical trajectories over time, even if the data are noisy. This richness enables identification of new trends and patterns, such as early signs of clinical deterioration and prediction of relevant clinical outcomes. Exploiting routinely collected ICU data supports data-driven decision-making aimed at improving patient outcomes. Even if the input data are imperfect, the capacity to generate real-time insights can be a game-changer in the fast-paced ICU environment. Furthermore, since this clinical data are already gathered as part of standard patient care, applying it to data science initiatives is both practical and cost-effective, maximizing resources without additional burden.
In addition, unlike data from controlled clinical trials—which often include only highly selected patient populations, adhere to strict protocols and may equally suffer from data quality issues—routinely collected data reflect the diversity and variability of everyday clinical care. Utilizing these real-world data leads to insights that are applicable to a far wider range of critically ill patients. This is especially important for patient populations that are typically underrepresented in clinical trials, such as the very elderly or those with extensive comorbidities. In addition, real-world data can help identify knowledge gaps in clinical practice and areas for improvement.
Recognizing that intensive care medicine often relies on evidence from studies with limitations due to challenges in conducting large-scale randomized controlled trials, big data research offers an opportunity to generate insights that are non-inferior to those from traditional studies if rigorous analytical methods are applied to extensive real-world datasets [13]. Big data models have the added advantage of continuous learning and adaptation. Unlike clinical trial results, which remain fixed once the study concludes, big data models can evolve by incorporating new information over time. This allows predictions and insights to improve as more data are gathered, ensuring that models remain relevant to changing patient populations and treatment practices. With big data, we can enhance the evidence base and contribute meaningfully to patient care, supplementing and reinforcing existing evidence.
As big data and artificial intelligence inevitably become integral to the future of intensive care medicine, ensuring the quality of collected data is more important than ever [3]. Modern data science techniques—such as imputation, feature selection and regularization—are successfully employed to address incomplete or noisy data [14, 15]. In addition, machine learning models can be trained to detect and account for data outliers and discrepancies. However, the effectiveness of these techniques is significantly enhanced when the underlying data, particularly clinical documentation, is of high quality. ICUs should prioritize improving the accuracy and completeness of the data they are already collecting by implementing systemic data validation processes and regular quality audits. Standardized data entry protocols, consistent documentation practices and routine staff training are essential steps to enhance reliability. Beyond this, practicing intensivists can contribute significantly by creating comprehensive data dictionaries, validating mappings and addressing questions from data end users. These measures ensure that clinical documentation adheres to standardized terminologies and is both accurate and readily interpretable for downstream data use, complementing the legal and contractual requirements that are already enforced in many European countries.
Involving data scientists in ICU data management is essential to establishing rigorous quality control processes. Their role encompasses several key responsibilities: designing and implementing data validation checkpoints, developing automated tools to detect inconsistencies or missing values, and collaborating with clinical staff to create comprehensive data dictionaries that clarify terminologies and ensure data consistency. Data scientists are also instrumental in setting up data governance frameworks to monitor data quality continuously and conducting exploratory data analyses to identify trends in data entry errors or patterns of missing data. These measures can identify and rectify common errors before they impact analyses (Fig. 1). In addition, data scientists play a critical role in developing models that suit ICU data, particularly for analyzing complex, time-dependent patient trajectories, a challenge requiring approaches like reinforcement learning to capture temporal trends effectively. In this context, well-constructed algorithms are those that deliver reliable and accurate insights across diverse ICU datasets, remaining stable without being overly sensitive to minor data variations. Data scientists support these goals by setting validation criteria, conducting bias analyses and performing external validations to assess model performance in real-word settings. Demonstrating to ICU professionals how better data quality enhances models and, consequently, clinical decision support is crucial for creating a sustainable positive feedback loop.
With improved data quality as a foundation, models trained on datasets of unprecedented size can be better integrated into ICU practice to support clinical decision-making. For these models to have their greatest impact on patient care, clinicians must trust both the model predictions and their reliability. This trust can be enhanced when models not only provide predictions but also communicate their level of certainty, allowing clinicians to make informed decisions about whether to incorporate or question a model recommendation. This process includes validating model outputs with real-world data, cross-referencing recommendations with clinical observations, and ensuring transparency in the model’s decision-making processes. By combining insights from these large-scale models with clinical expertise, ICUs can responsibly leverage data-driven insights, enhancing patient care while remaining mindful of the limitations inherent in these models. It is important to recognize that some uncertainties arise from gaps in available data or limitations within the model, while others reflect natural variability in patient responses. When model recommendations diverge from clinical judgment, these factors should be carefully considered, with model insights serving to support—not replace—clinical expertise.
Data science advancements in data-generation infrastructure and analysis methodologies can significantly reduce errors through the implementation of rigorous checkpoints. This figure illustrates how common errors at each stage of the data processing pipeline—from raw data extraction to model output—can be minimized by setting quality standards and applying corrective measures. By adhering to these best practices, the overall accuracy and reliability of the analysis workflow are improved
Ensuring that data integrity, well-constructed algorithms and expert interpretation converge is essential to transform raw data into actionable knowledge and deliver meaningful advancements in patient care. If we achieve this convergence, we may be on the cusp of a new era in intensive care medicine.
Data availability
Not applicable.
References
Johnson AEW, Bulgarelli L, Shen L et al (2023) MIMIC-IV, a freely accessible electronic health record dataset. Sci Data 10:1. https://doi.org/10.1038/s41597-022-01899-x
Beam AL, Kohane IS (2018) Big data and machine learning in health care. JAMA 319:1317–1318. https://doi.org/10.1001/jama.2017.18391
Topol EJ (2019) High-performance medicine: the convergence of human and artificial intelligence. Nat Med 25:44–56. https://doi.org/10.1038/s41591-018-0300-7
Jimma BL (2023) Artificial intelligence in healthcare: a bibliometric analysis. Telemat Inform Rep 9:100041. https://doi.org/10.1016/j.teler.2023.100041
Thoral PJ, Peppink JM, Driessen RH et al (2021) Sharing ICU patient data responsibly under the society of critical care medicine/European society of intensive care medicine joint data science collaboration: the Amsterdam university medical centers database (AmsterdamUMCdb) example*. Crit Care Med 49:e563–e577. https://doi.org/10.1097/CCM.0000000000004916
Hyland SL, Faltys M, Hüser M et al (2020) Early prediction of circulatory failure in the intensive care unit using machine learning. Nat Med 26:364–373. https://doi.org/10.1038/s41591-020-0789-4
Sadeghi S, Hempel L, Rodemund N, Kirsten T (2024) Salzburg intensive care database (SICdb): a detailed exploration and comparative analysis with MIMIC-IV. Sci Rep 14:11438. https://doi.org/10.1038/s41598-024-61380-0
Heavner SF, Kumar VK, Anderson W et al (2024) Critical data for critical care: a primer on leveraging electronic health record data for research from society of critical care medicine’s panel on data sharing and harmonization. Crit Care Explor 6:e1179. https://doi.org/10.1097/CCE.0000000000001179
van Genderen ME, Cecconi M, Jung C (2024) Federated data access and federated learning: improved data sharing, AI model development, and learning in intensive care. Intensiv Care Med 50:974–977. https://doi.org/10.1007/s00134-024-07408-5
Sheller MJ, Edwards B, Reina GA et al (2020) Federated learning in medicine: facilitating multi-institutional collaborations without sharing patient data. Sci Rep 10:12598. https://doi.org/10.1038/s41598-020-69250-1
Blacketer C, Defalco FJ, Ryan PB, Rijnbeek PR (2021) Increasing trust in real-world evidence through evaluation of observational data quality. J Am Med Inform Assoc 28:2251–2257. https://doi.org/10.1093/jamia/ocab132
Sauer CM, Dam TA, Celi LA et al (2022) Systematic review and comparison of publicly available ICU data sets-a decision guide for clinicians and data scientists. Crit Care Med 50:e581–e588. https://doi.org/10.1097/CCM.0000000000005517
Granholm A, Alhazzani W, Derde LPG et al (2022) Randomised clinical trials in critical care: past, present and future. Intensiv Care Med 48:164–178. https://doi.org/10.1007/s00134-021-06587-9
Mukherjee K, Gunsoy NB, Kristy RM et al (2023) Handling missing data in health economics and outcomes research (HEOR): a systematic review and practical recommendations. Pharmacoeconomics 41:1589–1601. https://doi.org/10.1007/s40273-023-01297-0
Tian Y, Zhang Y (2022) A comprehensive survey on regularization strategies in machine learning. Inf Fusion 80:146–166. https://doi.org/10.1016/j.inffus.2021.11.005
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
None to declare.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Lijović, L., Elbers, P. Leveraging the power of routinely collected ICU data. Intensive Care Med 51, 163–166 (2025). https://doi.org/10.1007/s00134-024-07745-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00134-024-07745-5
