The last 2 decades have seen considerable improvements in the delivery of care to critically ill children, such that paediatric intensive care unit (PICU) mortality rates are now typically below 10%, and indeed often closer to 5% [1, 2]. Paradoxically, this may hinder evaluation of new therapies aimed at further improving survival, because the number of patients required for an adequately powered clinical trial with mortality as the endpoint may now be unfeasibly large. Several strategies exist that may allow for a reduction in study size without loss of power. These include: (1) screening out extremely low- and high-risk patients, (where treatment effects may be less pronounced), (2) employing stratified randomisation based on mortality risk at trial entry [3], and (3) estimating treatment effects after covariate adjustment [4]. Identifying a suitable screening/stratification tool poses several challenges. It must be reliable, relatively simple, calculable within a short time frame, and should not reflect local treatment preferences, such as choice of monitoring tool or a particular ancillary therapy. At face value, contemporary mortality risk tools appear to fulfil these criteria, although valid criticisms of this approach have been voiced [5].

An alternative is to choose a relevant endpoint with a higher incidence than mortality. This could be a composite endpoint, such as mortality and morbidity combined, or indeed a different endpoint altogether, such as quality of life. However, both approaches have shortcomings. Composite endpoints have been criticised as being prone to subjectivity and providing misleading impressions of the impact of a given treatment [6]. Current tools for measuring paediatric quality of life are crude, and the optimal time to measure this post PICU discharge is unknown (1 month, 1 year, etc.) [7]. Recovery may take time to plateau; conversely, some deficits (e.g. post-traumatic stress) may not be apparent initially after discharge [7]. Prolonging the measurement period to accommodate this creates other problems; it increases the likelihood of confounding and may have major cost implications for a trial.

A third option involves the use of a surrogate endpoint for mortality. A surrogate is defined as “a variable that provides an indirect measurement of effect in situations where direct measurement of clinical effect is not feasible” [8]. Organ failure/dysfunction has been postulated as a surrogate for mortality, because it appears to meet the majority of criteria needed for validity [8]. First, there is biological plausibility of a causal link between the two. Second, epidemiological studies have shown prognostic value of organ failure for mortality [911]. The final major criterion is that there should be evidence from clinical trials that treatment effects on the surrogate produce similar effects on the main outcome, in other words, evidence that a treatment that ameliorates organ failure also reduces mortality. To date, this crucial piece of evidence is lacking.

Qualitative definitions for organ failure exist [9, 12]. However, a quantitative score, which also incorporates varying degrees of organ dysfunction, is likely to increase the sensitivity of this surrogate for identifying the likelihood of the endpoint (mortality). Three paediatric organ failure scores have been developed: PRISM III Acute Physiology Score [13], the paediatric logistic organ dysfunction score (PELOD) [14] and P-MODS [15]. All are clinical scores, and all have shown a correlation between degree of organ dysfunction and risk of mortality. Of the three, PELOD appears the most appealing. Unlike P-MODS, PELOD was developed in multiple centres and has been validated across several countries [14, 16]. And unlike PRISM III APS, the details of how to actually calculate the score are in the public domain.

Within the publication of the 2003 Lancet paper that validated PELOD externally, the authors stated the score was now fit for use as a surrogate in clinical trials [16], and indeed this has since become the case. However, in 2006, the same authors highlighted an error in their earlier paper, namely that the PELOD score did not, in fact, calibrate as originally stated [17]. Discrimination and calibration are two vital aspects of goodness of fit in a prognostic score. Discrimination refers to how well the score diagnoses or predicts the endpoint per se, while calibration refers to the accuracy of risk prediction within probability bands (for example, if the score assigns a 20% risk of mortality to 100 patients, we would expect 20 of them to die). An organ failure score that is poorly calibrated cannot be used to track changes in the degree of organ dysfunction, and thus loses a key criterion for validity as a surrogate.

The lack of calibration for PELOD has now been confirmed in a second paper, published in this issue of the Journal [18]. Garcia evaluated PELOD across 1,476 admissions from two PICUs in Brazil and Argentina [18], showing remarkably similar results to those from the original PELOD validation in 2003 [16]. The score continues to discriminate excellently (area under curve 0.93 for Garcia vs. 0.91 for Leteurte), but calibrates poorly, with both studies demonstrating under-prediction of mortality in lower risk groups and over-prediction in higher risk patients [16, 18]. The validity of Garcia’s results is corroborated by their case mix and standardised mortality ratio, which are similar to current United Kingdom PICUs [1], and the fact that PIM2 is well calibrated within their study population.

Why is PELOD so poorly calibrated? A likely reason was highlighted previously by Garcia and colleagues: namely that PELOD does not assign risk on a continuous scale [19]. A risk score typically estimates probability of death using a ratio (continuous) scale ranging from 0 to 100%; this can be calculated via a number of mechanisms, most commonly using a logit transformation. Provided at least one of the variables within the model is continuous, an almost infinite number of covariate patterns are possible, limited only by the precision of the measuring tool (for example, base excess is usually expressed to one decimal place). PELOD is a composite score, with a single coefficient. However, its scale is ordinal, containing 33 ranks ranging from 0 to 71. This means that, for example, a patient cannot be assigned a risk between 3.1 and 16.3% nor between 39.2 and 79.6%. Thus, for example, a patient with a “true” risk of 4% would be calculated as 16.3% by PELOD.

The limitation of this approach can be highlighted by examining the original development paper for PELOD [14]. Continuous variables were first categorised using the Fisher algorithm, then further amalgamated into fewer and fewer (and indeed coarser) categories using a combination of cluster analysis and repeated logistic regressions. This approach loses considerable information and hence accuracy [20], and is further compounded when the dataset is small (594 patients, 51 deaths). The lowest score on PELOD (score 1) was formed by assigning the same weight (odds ratio 1.4, coefficient 0.33) to the lowest category of dysfunction seen in four different organ systems, which demonstrated odds ratios ranging from 1.4 to 3.6 (coefficients 0.32 to 1.27, table 6 of the paper). Similarly, a PELOD score of 10 (odds ratio 32.5) represented the amalgamation of four organ systems exhibiting moderate dysfunction showing considerably different odds ratios of 8.5 to 74.4. This resulted in over 80% of the variability in PELOD being explained by two organ systems: cardiovascular and neurologic. Conversely, pulmonary haematologic and hepatic dysfunction each accounted for less than 5%. This is in direct contrast to the adult MODS score, where variability (measured by partial correlation coefficients) is more evenly distributed among organ systems [21].

One implication of using PELOD as a surrogate endpoint is loss of power. To illustrate this, imagine a hypothetical treatment that produces a relative decrease in organ failure severity and hence mortality risk of 20% across a PICU population. The relativity assumption means that the absolute risk reduction is greater for higher risk patients (for example, the treatment may produce an absolute drop in risk of 16% for a patient with a baseline risk of 80%, but only a 2% drop if the baseline is 10%). I have illustrated this using a hypothetical, “true” organ failure score based upon the PIM2-derived risk profiles of 7,500 patients within my own PICU (Fig. 1). If we assume that only patients with a baseline risk greater than 10% are entered into this trial, the histogram on the left shows the effect of the treatment on risk. An adequately powered trial (80%) requires 120 patients per arm, giving a p value of 0.002. If PELOD is used, the “true” risk for each patient will be rounded up to the next PELOD category (e.g. a patient with risk of 4% will be classed by PELOD as 16.3%). The effect of using PELOD on this study population is shown in the right figure; indeed, the therapy is now classed as showing non-significant benefit, with a p value of 0.10.

Fig. 1
figure 1

Hypothetical trial for a treatment producing an average relative risk reduction of 20% as measured by true mortality risk (left) and PELOD (right). The reduction in risk is shown by a left shift in the histograms associated with treatment (white bars)

Is PELOD broken beyond repair? Probably not, but it does need major resuscitation. Where possible, ordinal variables (e.g. heart rate ranges) should be re-expressed as continuous, and their relationship to risk (log odds of mortality) should be delineated using a technique that allows for non-linearity both in the univariate and multivariable modelling process (e.g. multivariable fractional polynomials or cubic splines). Alternative data reduction techniques should be considered (as per the cardiovascular variable in the MODS score) [21]. Therapies that are required to maintain a variable within the normal range may need to be incorporated within the score (e.g. inotrope dose, as in the SOFA score) [22]. These represent but a few suggestions; others can be found elsewhere [23, 24]. Lastly, this would require a much larger data set than previously [14].

Perhaps the most important point is that which goes beyond PELOD, or indeed any other organ failure score: we have not yet fully validated organ failure as a surrogate for death. Validation cannot rely purely on demonstrating a tight correlation between organ failure and death [25], nor upon fulfilling statistical criteria based on conditional distributions [26]. The demonstration of a treatment effect on the surrogate alone is insufficient; evidence must exist that this translates into a similar effect on mortality. A disconnect between the two has been demonstrated in numerous trials, where a beneficial effect on the surrogate translated into a harmful effect on the true endpoint [27]. This may also have relevance in the setting of organ failure, as it has been suggested that this entity may actually be protective, and hence attempting to reverse it may bring harm [28]. Regardless of this, the way forward requires (1) creation of a valid organ failure score, followed by (2) testing the score in the setting of a trial powered for a mortality endpoint. I think we may be waiting some time.