Key messages

Publication of papers reporting the use of machine learning to analyse routinely collected ICU data is increasing rapidly: around half of the identified studies were published since 2015.

Machine learning methods have changed over time. Neural networks are being replaced by support vector machines and random forests.

The majority of published studies analysed data on fewer than 1000 patients. Predictive accuracy increased with increasing sample size.

Reporting of the validation of predictions was variable and incomplete—few studies validated predictions using independent data.

Methodological and reporting guidelines may increase confidence in reported findings and thereby facilitate the translation of study findings towards routine use in clinical practice.


Intensive care units (ICUs) face financial, bed management, and staffing constraints among others. Efficient operation in the light of these limits is difficult because of their multidimensional and interconnected nature [1]. Extremely detailed data covering all aspects of patients’ journeys into and through intensive care are now collected and stored in electronic health records (EHRs). Data that are typically available in these EHRs include demographic information, repeated physiological measurements, clinical observations, laboratory test results, and therapeutic interventions. Such detailed data offer the potential to provide improved prediction of outcomes such as mortality, length of stay, and complications, and hence improve both the care of patients and the management of ICU resources [2,3,4].

Machine learning is a form of artificial intelligence (AI) in which a model learns from examples rather than pre-programmed rules. Example inputs and output for a task are provided to ‘the machine’ and, using learning algorithms, a model is created so that new information can be interpreted. Machine learning approaches can provide accurate predictions based on large, structured datasets extracted from EHRs [5, 6]. There have been rapid developments in machine learning methodology, but many methods still require large datasets to model complex and non-linear effects, and thereby improve on prediction rules developed using standard statistical methods [6,7,8]. Papers describing applications of machine learning to routinely collected data are published regularly [7], but there is no recent systematic review summarizing their characteristics and findings [9]. We systematically reviewed the literature on uses of machine learning to analyse routinely collected ICU data with a focus on the purposes of the application, type of machine learning methodology used, size of the dataset, and accuracy of predictions.


Systematic review design, definitions, and inclusion/exclusion criteria

Search strategy

Candidate articles were identified from searches of Web of Science and MEDLINE. There was no restriction on the publication date, but only articles written in English were included. Two searches connected with an ‘AND’ statement were used—one to capture applications of artificial intelligence and the other to capture the ICU setting. Searches for artificial intelligence used the following terms: ‘Machine Learning’, ‘Artificial Intelligence’, ‘Deep Learning’, ‘Neural Network’, ‘Support vector machine’, ‘Prediction Network’, ‘Forecast Model’, ‘Data mining’, ‘Supervised Learning’, and ‘Time series prediction’. Searches for the applications of artificial intelligence use the following terms: ‘Cardiac Intensive Care Unit’, ‘CICU’, ‘ICU’, ‘Coronary Care’, ‘Critical Care’, ‘High Dependency’, and ‘HDU’. The search terms were made in an iterative process, initially using subject headings from citation indexes and text word searches for machine learning (e.g. ‘Artificial Intelligence/or Machine Learning/’, ‘Pattern Recognition, Automated/’, and ‘Machine’, ‘Artificial’, ‘Deep’, ‘Supervised’ respectively). The first 30 relevant papers were extracted and mined for specific terms (e.g. ‘Prediction’, ‘Support vector machine?.tw’, ‘Demand’). The search was run again with these terms included, and the first 30 new relevant papers were extracted and mined for specific terms. These were included in the search terms to generate the final list of search terms (see Additional file 1). Review papers were set aside for separate analysis.

Eligibility criteria

Eligible papers (1) used machine learning or artificial intelligence (AI), defined as any form of automated statistical analysis or data science methodology; (2) analysed routinely collected data that were generated as part of patients’ standard care pathway in any hospital worldwide; (3) analysed data collected in the ICU, defined as an area with a sole function to provide advanced monitoring or support to single or multiple body systems; and (4) were published in a scientific journal or conference proceeding when the proceeding detailed the full study. Studies from all ICUs were eligible, regardless of specialty. There was no limit on the age of included patients. The following types of study were excluded: (1) use of machine learning to process or understand medical images; (2) studies that focused on text mining; (3) analyses of novel research data rather than routinely collected data; (4) studies that implemented additional data collection techniques beyond hospitals’ routine systems; (5) studies based on data from a general medicine ward, coronary care unit, operating theatre or post-anaesthetic care unit, or emergency room; (6) conference abstracts and proprietary machine learning systems. Papers describing reviews of machine learning based on ICU data were also retrieved.

Study selection

Details of papers were uploaded to EndNote X8 (Clarivate Analytics, Philadelphia, PA, USA), and duplicates were removed using EndNote’s duplicate identification tool. One author (DS) screened the titles and abstracts and retrieved the full text of papers judged to be potentially eligible for the review. Final decisions about eligibility, based on reading the full text of the manuscripts, were made by one author (DS), with a randomly selected subset checked by two further authors (BG and JS). Conflicts were resolved by consensus. An additional file provides a full bibliography (see Additional file 2).

Review process and data extraction

The study characteristics to be extracted, and their definitions and categories, were decided iteratively following study of 40 eligible papers. We extracted information on the following study features: (1) aim (categorized as improving prognostic models, classifying sub-populations, determining physiological thresholds of illness, predicting mortality, predicting length of stay, predicting complications, predicting health improvement, detecting spurious values, alarm reduction, improving upon previous methods (with details) and other (with details)); (2) type of machine learning (categorized as classification/decision trees, naïve Bayes/Bayesian networks, fuzzy logic, Gaussian process, support vector machine, random forest, neural network, superlearner, not stated and other (with details)). All types of machine learning used were recorded; (3) dataset size (the number of patients, episodes or samples analysed); (4) whether the study used data from the publicly available Medical Information Mart for Intensive Care II/III (MIMIC-II/III), which includes deidentified health data on around 40,000 patients treated at the Beth Israel Deaconess Medical Center between 2001 and 2012 [10]; (5) method used to validate predictions (categorized as independent data, randomly selected subset with leave-P-out (P recorded), k-fold cross-validation (k recorded), randomly selected subset, other (with details), no validation). For studies that validated results for multiple machine learning techniques, we recorded the method corresponding to the most accurate approach. For studies that dichotomized length of stay in order to validate predictions, we recorded the threshold as the highest length of stay in the lower-stay group; (6) measure of predictive accuracy (area under the receiver operator characteristic (ROC) curve (AUC): proportion of subjects correctly classified, sensitivity, and specificity). Each measure reported was recorded. When measures of predictive accuracy were recorded for multiple machine learning techniques, we recorded the measures for the most accurate approach. For multiple outcomes, we recorded the measures corresponding to the longest-term outcome. When multiple validation datasets were used, we recorded the measures for the most accurate approach; (7) reporting of results from standard statistical methods such as linear regression and logistic regression. We recorded the method and the corresponding measures of accuracy, using the same rules as described above when more than one result was reported. In response to a suggestion from a peer reviewer, we recorded whether papers reported on calibration and, if so, the method that was used.

Risk of bias was not assessed in the included studies because the purpose of our review was descriptive—the aim was not to draw conclusions about the validity of estimates of predictive accuracy from the different studies. The size of dataset analysed was tabulated according to the study aims. Analyses were restricted to studies that used machine learning to predict complications, mortality, length of stay, or health improvement. The size of dataset according to the type of machine learning, the approach to validation according to outcome predicted, and the measure of predictive accuracy according to outcome predicted were tabulated. The distribution of AUC according to the number of patients analysed and outcome predicted was plotted, along with the number of papers published according to the type of machine learning and year of publication.


Identification of eligible studies

Two thousand eighty-eight papers were identified through Web of Science and 773 through MEDLINE. After duplicates were removed, the titles and abstracts of 2450 unique papers were screened, of which 2023 papers were classified as ineligible. Of 427 papers for which the full text was reviewed, 169 were found to be ineligible, mainly because they did not use machine learning or did not analyse routinely collected ICU data. The review therefore included 258 papers (Fig. 1). MIMIC-II/III data were used in 63 (24.4%) of these studies.

Fig. 1
figure 1

PRISMA 2009 flow diagram of study review process and exclusion of papers. From [11]

Purpose of machine learning in the ICU

The most common study aims were predicting complications (77 papers [29.8% of studies]), predicting mortality (70 [27.1%]), improving prognostic models (43 [16.7%]), and classifying sub-populations (29 [11.2%]) (Table 1). The median sample size across all studies was 488 (IQR 108–4099). Only six studies (three predicting complications, two improving prognostic models, and one predicting mortality) analysed data on more than 100,000 patients, while 35 analysed data on 10,000–100,000 patients, 18 (51.4%) of which attempted to predict mortality. Most studies (211 [81.8%]) reported analyses of fewer than 10,000 patients. Large sample sizes (> 10,000 patients) were most frequent in studies predicting complications, mortality, or length of stay, and those aiming to improve prognostic models or risk scoring systems. Sample sizes were usually less than 1000 in studies determining physiological thresholds or that aimed to detect spurious values.

Table 1 Number and proportion of papers according to the aim of study and number of patients analysed

All further analyses were restricted to the 169 studies that predicted at least one of four clearly definable types of outcome: complications, mortality, length of stay, and health improvement. MIMIC-II/III data were used in 45 (26.6%) of these 169 studies, a similar rate to the use of MIMIC-II/III data in all outcomes (63 [24.4%]).

Type of machine learning

Among studies that predicted complications, mortality, length of stay, or health improvement, 12 (7.1%) predicted more than one of these types of outcome (Table 2). The most commonly used types of machine learning were neural networks (72 studies [42.6%]), support vector machines (40 [23.7%]), and classification/decision trees (34 [20.1%]). The median sample size was 863 (IQR 150–5628). More than half of the studies analysed data on fewer than 1000 patients. There were no strong associations between the type of machine learning and size of dataset, although the proportion of studies with sample sizes less than 1000 was the highest for those using support vector machines and fuzzy logic/rough sets. Machine learning methods used in fewer than five papers were combined under the “Other” category: of these. Data on the machine learning methods used in the different types of prediction study are available from the authors on request.

Table 2 Number and proportion of papers according to the type of machine learning used and number of patients analysed (for prediction studies only)

Machine learning studies using ICU data were published from 1991 onwards (Fig. 2). The earliest studies were based on fewer than 100 patients: the first studies based on more than 1000, 10,000, and 100,000 patients were published in 1996, 2001, and 2015 respectively. Although study sizes have increased over time (among studies published in 2017 and 2018, the median [IQR] sample size was 3464 [286–21,498]), studies based on fewer than 1000 patients have been regularly published in recent years. Six studies used data on more than 100,000 patients: one in 2015, one in 2017, and four in 2018 [2, 12,13,14,15,16].

Fig. 2
figure 2

Number of papers published according to the sample size and year of publication

The earliest machine learning studies all used neural networks (Fig. 3). Papers using other machine learning methods were reported from 2000 onwards, with support vector machines reported from 2005 and random forests from 2012. Of the 258 studies, 125 (48%) were published from 2015 onwards. The most commonly reported machine learning approaches in these studies were support vector machines (37 [29.6% of recent studies]), random forests (29 [23.2%]), neural networks (31[24.8%]), and classification/decision trees (27 [21.6%]).

Fig. 3
figure 3

Number of papers published according to the type of machine learning and year of publication

Approaches to validation

Table 3 shows that of 169 studies that predicted complications, mortality, length of stay, or health improvement, 161 (95.2%) validated the predictions. Validations were rarely based on independent data (10 studies [6.2%]). The most commonly used approaches were to use random subsets of the data with (71 [44.1%]) or without (71 [44.1%]) k-fold cross-validation respectively. Studies predicting the length of stay were most likely to use independent data and least likely to use k-fold cross-validation. Data on approach to validation according to the type of machine learning and outcome predicted are available from the authors on request.

Table 3 Number and proportion of papers according to outcome predicted and approach to validation (for prediction studies only)

Measures of predictive accuracy reported

The majority of the 161 papers that quantified the predictive accuracy of their algorithm reported the AUC (97 [60.2%]), of which 43 (26.7%) papers also reported accuracy, sensitivity, and specificity (Table 4). Sixty-two studies (38.5%) reported these measures but not the AUC. The AUC was most likely to be reported by studies predicting mortality (47 [69.1%]). Papers predicting complications and health improvement were more likely to report only accuracy, sensitivity, and specificity. All 18 papers predicting the numerical outcome of length of stay validated their predictions: 8 (44.4%) reported the proportion of variance explained (R2). There were 5 papers that reported the AUC dichotomized length of stay: two papers at 1 day [17, 18], one at 2 days [19], one at 7 days [4], and one at 10 days [20]. Data on reported measures of predictive accuracy according to the type of machine learning and outcome predicted are available from the authors on request.

Table 4 Number and proportion of papers according to outcome predicted and measure of predictive accuracy reported (for studies that validated predictions)

Figure 4 shows the distribution of AUC according to the size of dataset, for all prediction studies and for studies predicting mortality or complications, with AUCs from the 10 studies that used external validation shown as individual data points. The median AUC was higher in the smallest studies (< 100 patients) which is likely to reflect over-optimism arising from internal validation in small samples. The median AUC increased with increasing sample size from 100 to 1000 patients to 100,000 to 1,000,000 patients. AUCs for both a machine learning and a standard statistical approach were reported in only 12 studies (Fig. 5). For all but one of these, the machine learning AUC exceeded that from the standard statistical approach. However, the difference appeared related to the study size: three of the four studies with substantial differences between the AUCs were based on fewer than 1000 patients.

Fig. 4
figure 4

Boxplots showing the distribution of AUC scores according to the size of dataset, for all studies and separately for studies predicting mortality and complications. Numbers displayed are the median AUC for each group. A cross indicates the AUC of one of the 10 papers using independent test data. We did not plot results for studies predicting the length of stay and health improvement because the numbers of such studies were small

Fig. 5
figure 5

Comparison of AUC scores found in complication or mortality prediction papers according to the technique used to produce them. A line of equality is also provided

The proportion of papers reporting on calibration was low: 30 (11.6%) of the 258 papers included in the review and 23 (13.6%) papers of the 169 studies that predicted complications, mortality, length of stay, or health improvement. Among these 23 papers, 21 reported Hosmer-Lemeshow statistics [21], one reported the Brier score, and one used a graphical approach.


Key findings

Interest in the use of machine learning to analyse routinely collected ICU data is burgeoning: nearly half of the studies identified in this review were published since 2015. Sample sizes, even in recently reported studies, were often too small to exploit the potential of these methods. Among studies that used machine learning to predict clearly definable outcomes the most commonly used methods were neural networks, support vector machines, and classification/decision trees. Recently reported studies were most likely to use support vector machines, random forests, and neural networks. Most studies validated their predictions using random subsets of the development data, with validations based on independent data rarely reported. Reporting of predictive accuracy was often incomplete, and few studies compared the predictive accuracy of their algorithm with that using standard statistical methods.

Strengths and limitations

We used comprehensive literature searches but may have omitted studies using proprietary machine learning methods or code repositories that were not peer reviewed and published in the literature databases searched. The study was descriptive, and therefore, the risk of bias was not assessed in the results of included studies. Robust conclusions therefore cannot be drawn about the reasons for the variation in AUC between studies or the differences between the performance of machine learning prediction algorithms and those based on standard statistical techniques. Although there were clear changes in the machine learning techniques used with time, we did not compare the performance of the different techniques. Most of the analyses included in the review related to studies that predicted the clearly definable outcomes of complications, mortality, length of stay, or health improvement: quantitative conclusions about other types of study were not drawn.

Results in context with literature

The last systematic review of the use of machine learning in the ICU was published in 2001 [9]. It noted the particular suitability of the data-rich ICU environment for machine learning and artificial intelligence. Further narrative reviews stated the need to understand model assumptions and methods to validate predictions when conducting machine learning studies [21, 22]. Papers in this review rarely compared the performance of machine learning with that of predictions derived using standard statistical techniques such as a logistic regression. Empirical studies have suggested that standard statistical techniques produce predictions that are often as accurate as those derived using machine learnings [23]. Standard statistical techniques may have greater transparency with regard to inputs, processing, and outputs: the ‘black-box’ nature of machine learning algorithms can make it difficult to understand the relative importance of the different predictors and the way that they contribute to predictions. This makes it difficult to understand and correct errors when they occur. Thus, studies have highlighted the desirability for transparent reasoning from machine learning algorithms [24]. Our review documents the evolving use of machine learning methods in recent years, but the continuing limitations in the conduct and reporting of the validation of these studies still exist.

Figure 4 suggests that studies based on small sample sizes and using internal validation overestimated model performance. Although there are no fixed minimum dataset sizes appropriate for machine learning applications, data on many tens or hundreds of thousands of patients may be required for these approaches to realize their potential and provide clear advantages over standard statistical analyses [25]. For dichotomous outcomes such as in-hospital mortality, methods such as random forests and support vector machines may demonstrate instability and over-optimism even with more than 200 outcome events per variable [23]. However, the majority of prediction studies included in our review analysed data on fewer than 1000 patients, which is likely to be too few to exploit the power of machine learning [6, 7]. Machine learning techniques are data hungry, and ‘over-fitting’ is more likely in studies based on small sample sizes. Achieving large sample sizes will require continuing development of digital infrastructure that allows linkage between databases and hence generation of datasets on large clinical populations [6, 16, 26]. Truly large datasets (population sizes of > 100,000 individuals) have so far been difficult to generate due to concerns over data privacy and security. Sharing this data with large commercial players who have the programming and processing ability to extract multiple signals from that data is even more difficult [27]. Only three papers included in our review addressed use of machine learning to identify data errors [28,29,30]. Errors are common in routine EHR data [6], and thus, datasets must be cleaned before analyses. This represents one of the most important tasks in using large datasets and is impossible to do without automation.


The most rigorous approach to quantifying the likely performance of machine learning algorithms in future clinical practice, and avoiding over-optimism arising from selection of variables and parametrizations, is to validate algorithms using independent data [31]. However, this was done in only a small minority of studies. Among studies that quantified predictive accuracy, most validated their models using random subsets of their development data. Because patterns of data in such test datasets do not differ systematically from patterns in the training datasets, they may overestimate model performance. A substantial minority of studies did not validate predictions or report the area under the ROC curve. Studies rarely discussed the implementation of machine learning algorithms that had been developed and whether they improved care. They did not report new performance metrics that may overcome limitations of the AUC, such as its insensitivity to the number of false positives when predicting rare events and that it gives equal weight to false positive and false negative predictions [32, 33].

The papers included in our study generally focused on discrimination (the ability to differentiate between patients who will and will not experience the outcome). Few studies reported on calibration (the degree of agreement between model predictions and the actual outcomes). Model calibration is sensitive to shifts in unmeasured covariates and is particularly important when models are used in population groups that are different from those used for model development.

Reporting standards for applications of machine learning using routinely collected healthcare data, as well as critical appraisal tools, might improve the utility of studies in this area, as has been seen with randomized trials (CONSORT), multivariable prediction models (TRIPOD), and risk of bias in prediction studies (PROBAST) [34,35,36,37,38,39]. These might assist editors and peer reviewers, for example by avoiding applications based on small datasets and insisting that model performance is evaluated on either an external dataset or, for studies using internal validation, using a separate data subset or procedure appropriate to compensate for statistical over-optimism. To ensure that results are reproducible, and facilitate assessment of discrimination and calibration in new settings, journals and the academic community should promote access to datasets and sharing of analysis code [40].


The increasing availability of very large and detailed datasets derived from routinely collected ICU data, and widespread recognition of the potential clinical utility of machine learning to develop predictive algorithms based on these data, is leading to rapid increases in the number of studies in this area. However, many published studies are too small to exploit the potential of these methods. Methodological, reporting, and critical appraisal guidelines, particularly with regard to the choice of method and validation of predictions, might increase confidence in reported findings and thereby facilitate the translation of study findings towards routine use in clinical practice.