Introduction

The use of intensive care unit (ICU) length of stay (LOS) and its (covariate) risk adjusted equivalent (RALOS), similar to risk adjusted mortality, as a quality metric and a proxy for costs has a long history [1,2,3]. Systematic reviews of variables predicting LOS [4] and statistical estimators of RALOS have appeared [5, 6], albeit caveats about such an endeavour, particularly with respect to individual patients, have been expressed [7, 8]. The relationship between observed LOS and the expected RALOS of a cohort of ICUs may be formulated as a difference, observed minus expected LOS (OMELOS [9]) or as a ratio (the risk adjusted LOS ratio, RALOSR [10]), with corresponding confidence intervals (CI) and displayed in a ranked “caterpillar” plot [11]. ICU LOS ranking uncertainty may be also addressed with respect to a single ICU (versus all other ICUs) or simultaneously, across all ICUs [12]; these being two different estimands [13].

The purpose of this paper is to address these themes by way of a particular estimator of RALOS, the generalized linear mixed model (GLMM [6, 14]) compared with the more familiar linear mixed model (LMM, [6, 10]). Becker et al. have cautioned regarding the misalignment between statements of hypotheses in terms of non-transformed variables (for instance, raw ICU LOS) and the transformed data (log ICU LOS) used to test them [15]. That is, inference on the transformed (log) scale does not equate with inference on the original scale [16,17,18]; back-transformation via exponentiation from the log yields a geometric (mean) value. This difference was resolved by appropriate choice of family and link functions within the GLMM framework. The monotonicity or otherwise between the RALOSR in the (mean) ranked arithmetic (GLMM) or geometric (LMM) metric across ICUs was determined and the impact of formal ranking procedures [12] was examined.

Methods

Ethics statement

Access to the data was granted by the Australian and New Zealand Intensive Care Society (ANZICS) Centre for Outcomes & Resource Evaluation (CORE) Management Committee in accordance with standing protocols; local hospital (The Queen Elizabeth Hospital) Ethics of Research Committee waived the need for patient consent to use their data in this study. The dataset was anonymized before release to the authors by ANZICS CORE, custodians of the database. The dataset is the property of the ANZICS CORE and contributing ICUs and is not in the public domain. Access to the data by researchers, submitting ICUs, jurisdictional funding bodies and other interested parties is obtained under specific conditions and upon written request [19].

Data management

Data was accessed from the ANZICS Adult Patient Database [20], in this instance for calendar year 2016, and processed as previously detailed [21]. Individual ICUs were anonymized, but for purposes of data management and illustration, were given non-identifying integer values.

Statistical analysis

The modelling approach was to use a parsimonious set of predictor variables and their interactions similar to a previous paper utilizing data from the ANZICS Adult Patient Database [21]; no automated routine for covariate selection, such as stepwise regression, was used. The primary focus was the prediction of RALOS and not on coefficient interpretation, albeit subscribing to a data- not algorithmic-modeling scenario, as defined in Breiman 2001 [22].

  1. 1.

    Prediction of ICU LOS

    1. a.

      GLMM: this was undertaken using the Stata™ (Version 17) module “meglm” (gaussian family, log link) with ICU site as a random intercept and ICU LOS (in days, the original scale of the dependent variable, calculated from date-stamped hour & minute electronic records) as the dependent variable.

    2. b.

      Predicted LOS was established as “fitted” values including site specific random effect (RE) and for the fixed part of the model (FE).

    3. c.

      Performance sensitivity analysis was undertaken using split sample estimation (60%) / validation (40%) technique, based upon random allocation of site as a stratum.

    4. d.

      R-squared (R2), at the patient and ICU level, was calculated as the square of the (product-moment) correlation coefficient of LOS versus model predictions. With respect to R2: at the patient level values of 20–28% and at the ICU level, 50–70% have been previously found for predictive models [5, 7].

    5. e.

      Different GLMM family and link combinations were also used, based on the distribution of the LOS (positive integer values with skewed distribution): gamma family and Poisson family with log link.

    6. f.

      OMELOS was calculated as observed LOS minus RALOS, the latter from the “meglm” output. For each ICU, point estimates and CI were calculated using the “mean” and “bca” (bias corrected and accelerated bootstrap [23]) commands provided by Stata™.

    7. g.

      Ratios of observed LOS and RALOS (risk adjusted LOS ratio, RALOSR [10]) were also computed using the “ratio” and “bca” bootstrap (1000 repetitions) commands of Stata™.

    8. h.

      Visualisation of ICU LOS and model predictions was performed using kernel density plots [24] (smoothed histograms) in Stata™.

  2. 2.

    ICU LOS as a quality metric

    1. a.

      This was undertaken using fixed effects model predictions following Straney et al. [10], not including site specific RE, to avoid adjusting for what was desired to establish, that is, ICU performance.

    2. b.

      As well as the outputs from the GLMM above, a LMM was estimated, as “mixed” within Stata™ Version 17, with the same variables as with the GLMM model, ICU site as a random intercept and the dependent variable transformed to the log scale (log(LOS)), again following Straney [10].

      1. i.

        Normality of log ICU LOS was tested computationally and graphically using the user-written Stata module “qctest” [25].

      2. ii.

        Predictions (log(LOS)) were estimated from both the fitted (RE) and fixed (FE) parts of the model.

      3. iii.

        for “mixed”, ICU RALOSRs were established using a user-written (jlm) “ratio” command to compute the ratio of the geometric means of the LOS and the RALOS, which was subsequently bootstrapped to estimate “bca” CI.

    3. c.

      As a sensitivity analysis, RE and their standard errors (SE) were predicted at the ICU level from both the “meglm” and “mixed” models and 95%CI were calculated for the point estimates (± 1.96*SE) [26].

  3. 3.

    Model specification was checked using:

    1. a.

      Covariate selection was undertaken using information criteria; Akaike (AIC) and Schwartz’s Bayesian (BIC) criteria [27]. Further details are provided in the Supplementary file (“Stata command syntax and model specification”, P 2/3).

    2. b.

      Residual analysis: for the GLMM, deviance and Anscombe; for the LMM, conventional and standardised residuals [26].

    3. c.

      R2 estimates at the patient and ICU level (see above)

  4. 4.

    OMELOS, RALOSR and ICU RE displays were produced using the Stata user-written module “forest” (through “metan” V 4.05, 29th November 2021: [28]); metric point estimates were ranked in the displays.

  5. 5.

    Using the point estimate and SE from OMELOS, RALOSR and ICU RE estimates, displays of rank confidence sets, both marginal and simultaneous, were produced using the R statistical package “csranks”[29].

Results

Details of cohort

The initial data base for the calendar year 2016 consisted of 94,361 adult patients from 125 ICUs with median annual patient number of 524 (25th percentile 328, 75th percentile 1028, minimum 152 and maximum 2887). Patient demographics are displayed in Table 1.

Table 1 Cohort demographics

The patient variables used to model ICU LOS were age (and its square), APACHE III score (and its square), ANZICS risk of death score (log), pre-ICU days; death in ICU [30], acute renal failure, treatment limitation, cardiac arrest pre-ICU and mechanical ventilation on day 1 of ICU (as binary variables, 1/0); hospital ICU classification (4 level categorical); collapsed APACHE III categorical variables for surgical and medical diagnoses (30 level; see Supplementary files: Appendix 1, Table 1). Multiple variable interactions were utilised in modelling; Stata command syntax for both GLMM and LMM is given in Supplementary files: Appendix 1, page 2. GLMM and LMM models converged satisfactorily with a total patient number of 87,980, representing, for complete case analysis, a missing data fraction of 9%; no multiple imputation was undertaken.

Modelling approaches

Generalised linear mixed model (GLMM)

The GLMM converged after 124 iterations, requiring the built-in Stata™ maximization option “difficult” and the non-default BFGS (Broyden–Fletcher–Goldfarb–Shanno) algorithm. Model coefficients are displayed in Supplementary files: Appendix 1, Table 2. Residual (deviance and Anscombe) analysis was acceptable and predicted ICULOS values are shown in Table 2.

Table 2 Predicted values of ICU length of stay; n = 87,980

The GLMM predictions compared with ICU LOS are displayed in Fig. 1 using kernel density plots.

Fig. 1
figure 1

Kernel density plots of observed ICU LOS & GLMM predictions (RE and FE), truncated at 20 days

For the RE model, the split-sample sensitivity analysis yielded patient R2 (predicted versus observed ICULOS) of 0.19 (development set, n = 48,015, 75 ICUs) and 0.21 (validation set, n = 39,965, 49 ICUs). For the whole estimation sample, n = 87,980, patient and ICU R2 were 0.20 and 0.85 respectively.

Two different GLMM family and link combinations, gamma family and Poisson family with log link, failed to converge.

Linear mixed model (LMM)

The LMM converged rapidly; model coefficients are displayed in Supplementary files: Appendix 1, Table 3. Residual (conventional and standardised) analysis was acceptable and predicted ICU LOS values in the log-metric are shown in Table 3. Log-ICU LOS was not normally distributed as per the “qctest” Stata module.

Table 3 Predicted ICU length of stay (log metric) values; n = 87,980

For the RE model (log metric), the split-sample sensitivity analysis yielded patient R2 (predicted versus observed ICULOS) of 0.30 (n = 48,015) and 0.28 (n = 39,965). For the whole estimation sample, n = 87,980, patient and ICU R2 were 0.29 and 0.96 respectively.

The LMM log predictions compared with log (observed) ICU LOS are seen in Fig. 2 using kernel density plots.

Fig. 2
figure 2

Kernel density plots of observed ICU LOS and LMM predictions; log metric

Similarly, ICU LOS geometric means are plotted in Fig. 3 for raw ICU LOS and LMM predictions (fixed and random effects).

Fig. 3
figure 3

Kernel density plots of geometric means (GM) by ICU for observed ICU LOS and LMM predictions

For the whole estimation sample, n = 87,980, ICU R2 was 0.38 and 0.88 for the fixed and random effects LMM models.

Quality metrics: tertiary ICUs used as exemplars

RALOSR FE: GLMM vs LMM

The combined graph (Fig. 4) shows the ratio changes across the spread of ICUs, but there was no concordance of ICU rankings between the two estimators, albeit the comparison is between the arithmetic and geometric LOS predictions. For the GLMM, lower RALOSR 95% CI limits were < 1 in 12 ICUs and upper RALOSR 95% CI limits were > 1 in 19; for the LMM these counts were 14 and 14 respectively.

Fig. 4
figure 4

RALOSR for fixed effects, GLMM versus LMM

OMELOS

The OMELOS fixed effects estimates are shown in Fig. 5. There was no concordance of ICU rankings compared with the RALOSR for either the GLMM or LMM models. The upper 95% CI limits were < 0 in 12 ICUs and lower 95% CI limits were > 0 in 19.

Fig. 5
figure 5

OMELOS (from GLMM)

Site-specific random effects

The ICU site RE are plotted in Fig. 6 for both the GLMM and LMM models. There was no concordance of ICU rankings between the two model RE and the LMM RE were constrained in magnitude compared with the GLMM. The GLM and LMM RE upper 95% CI limits were < 0 in 12 and 14 ICUs and lower 95% CI limits were > 0 in 18 and 14 respectively. Not surprisingly, ICU rankings were discordant between the RE and FE models.

Fig. 6
figure 6

Site specific RE: GLMM and LMM

ICU site rankings

Marginal confidence sets: RALOSR: GLMM versus LMM

Figure 7 shows marginal ICU site rankings estimated for the RALOSR for both GLMM and LMM (fixed effects) as estimated by the “csranks” package. The interpretation of “marginal” is that the confidence set covers a single ICU LOS (ranking point estimate) with probability 95%.

Fig. 7
figure 7

Marginal confidence sets for RALOSR: GLMM and LMM

For marginal confidence sets, the GLMM produces clusters of similarly ranked ICUs, but for the LMM the rankings were far more concentrated and the 95% limits are wider.

Simultaneous confidence sets: RALOSR: GLMM versus LMM

Figure 8 shows simultaneous ICU site rankings estimated for the RALOSR for both GLMM and LMM (fixed effects). The interpretation of “simultaneous” is that the confidence sets simultaneously cover all differences in ICU RALOSR with 95% probability. Site rank clustering for GLMM is less apparent than for the marginal sets and the simultaneous confidence sets are more concentrated than in the marginal case. Simultaneous 95% limits were wider for both estimators.

Fig. 8
figure 8

Simultaneous confidence sets for RALOSR, GLMM and LMM

OMELOS from GLMM

Figure 9 shows marginal and simultaneous confidence sets for the OMELOS metric (fixed effects). Marginal rank clustering appears less marked than for RALOSR, both GLMM and LMM. Simultaneous set ranking still preserved some clustering features; 95% confidence limits were wider.

Fig. 9
figure 9

Marginal and simultaneous confidence sets for the OMELOS metric

Site-specific RE: GLMM and LMM

Figure 10 shows the marginal confidence sets for the ICU site-specific ranked RE for both the GLMM and the LMM. The GLMM shows clustering of the site RE, whereas the LMM estimates are compressed, with wider 95% limits.

Fig. 10
figure 10

Marginal confidence sets for ICU RE ranks: GLMM and LMM

Figure 11 shows the simultaneous confidence sets for ranked ICU RE for both the GLMM and the LMM. The GLMM shows clustering of the site RE, whereas the LMM estimates are compacted, with wider 95% limits.

Fig. 11
figure 11

Simultaneous confidence sets for ICU RE ranks: GLMM and LMM

Discussion

Both the GLMM and LMM performed satisfactorily with respect to model specification and prediction of ICU LOS. However, there was no concordance of ICU rankings between model predictions, GLMM versus LMM, nor for the quality metrics used, RALOSR, OMELOS and site-specific RE. That is, there was no “one best model”; thus, ICU “performance” is determined by model choice and any rankings thereupon should be circumspect. These inconsistencies are further examined.

Predictive models

Within the critical care literature prediction of ICU LOS has predominately used linear regression [8, 31], generalised linear regression (GLM [32, 33]) and LMM [6, 10], the latter formally accounting for patient clustering within ICUs. Although GLM variants, (Poisson, negative binomial and Gamma) including the mixed model (RE) formulation [33, 34] have also been utilised, the current study, despite detailed examination, found lack of convergence with mixed effects Poisson and gamma models, possibly related to the large cohort size and multiple factor interactions. In the current study, the maximum ICU LOS was 127 days and there were no negative predicted LOS days, as may occur with linear regression with raw ICU LOS [33]. No formal truncation of LOS was undertaken; the implications of these measures have been previously discussed in detail [6].

The R2 for both models at the patient and ICU level for predicted LOS were reasonable. At the patient and ICU level R2 values of 20–28% and 50–70% respectively have been found for predictive models [5, 7]. This being said, the current study operated at the ICU level and a focus on performance at the individual level would not seem to be warranted nor an intrinsically productive exercise [7, 8]. For other right skewed variables, such as health costs, there would appear to be an upper limitation to R2 [35] and comparison of R2 values between models with different functional forms of the dependent variable, for instance raw and log transformed [31, 33], is not a justifiable practice [36]. Formal computational R2 measures have been described for both LMM and GLMM [37, 38], but a simple easily computed measure was preferred. Uncertainty, as confidence intervals (CI), has been variously estimated; analytic [39] or by the bootstrap [40], of which there are a “bewildering” array of methods [41].

LOS, either ICU or hospital, is positively right skewed and log transformation has been frequently applied to LOS as the dependent regression variable. This being said, appropriate retransformation [42, 43] to the original metric (days) is problematical as \(\mathrm{exp}\left\{E\left(\mathrm{ln\, }y\right)\right\}\ne E\left(Y\right)\) and has rarely been addressed within the biomedical as opposed to the econometric [6, 44] literature. Although correction terms for back transformation to the original metric under both homo- and hetero-skedasticity have been implemented in Stata for linear regression models [45], such is not the case for LMM, albeit the theoretical basis for such has been established by Ramierez-Aldana and Naranjo [46].

There has been debate regarding the virtues, or otherwise, of log transformation in analysis [16]. In (linear) regression, the requirement for “normality” applies to model residuals not to the data covariates [47] and log transformation guarantees neither reduced dependent variable skewness nor variation; in fact, it may produce the opposite [48]. In the current case, normality of ICU LOS was not attained by log transformation, implying that the raw ICU LOS was not log-normally distributed. With respect to inference on the additive (arithmetic [6]) or multiplicative (geometric [10]) scale, the geometric mean, being multiplicative, has found use in analysing compounding investment [49] and the physical sciences [50], but lacks a “clear and concise physical interpretation” [51]. It exhibits bias for small samples and is sensitive to the probability distribution and skewness of the variable under consideration; only for the lognormal distribution is the geometric mean equivalent to the median [51]. For skewed data sets with many zeros, the common practice of adding a small positive constant to the observations (the “shift” parameter) before log transformation has little to recommend it as such a parameter has a highly significant effect on the estimator of the geometric mean [16]. Recent reviews have cautioned against the “routine” use of log-transformation in regression; rather GLM, or, as in the current paper, GLMM have been endorsed [14, 52, 53]. As noted by Deb et al., “Properly interpreting results from a log-transformed model requires substantially more effort” [54].

Quality measures

ICU LOS would seem to be an exemplary quality measure, for reflecting resource use [3] and has been used with outcome measures, such as the standardised mortality rate (SMR), in “efficiency plots” [2] in a number of jurisdictions [39, 55,56,57]. Empirical studies have also demonstrated independence of indices of ICU LOS and the SMR [9, 31, 58].

Using three ICU LOS indices, OMELOS, RALOSR and site-specific RE, with two estimators of ICU LOS (GLMM and LMM), there was no monotonicity of ICU LOS point-estimate nor rankings between indices and or estimator. No intrinsic merit of one or more of these indices / estimators would appear to have been demonstrated, although attention has been drawn to potential limitations of the geometric (mean) metric and it could be argued that, ceteris paribus, site-specific RE encapsulate ICU differences more adroitly [59]. Caterpillar plots have been used to display indices of RALOS [9, 10], but the debate regarding the appropriate way to analyse and present such data, since the seminal paper (1996) of Goldstein and Healy, “The graphical presentation of a collection of means” [60], is substantial [59, 61]. One particular problem with the caterpillar and forest plot [62] variant is that of “…eyeballing …” the estimates, whereby inference (of, say, ICU differences) is conducted in a non-transparent manner [63]. Formal solutions to this problem have been proposed [21, 64], but the current study used ranking measures. Rankings are estimates, not true values, and such uncertainty may be addressed by constructing confidence sets for the ICU LOS ranks as (i) marginal, the confidence set covers a single ICU LOS with 95% probability and (ii) simultaneous, the confidence sets simultaneously cover all differences in ICU LOS with 95% probability [12]. As implemented in the “csranks” software [29], the multiple hypothesis testing regimen controls the familywise error rate and any false directional claim about the sign of a difference; the assumptions involved are “weak” and robust to small differences between (ICU) units ([12], especially “Remark 3.5”). Not surprisingly, the ranking estimates and conventional point-estimates and 95% CI across quality indices and estimator were not consistent, but the former more easily displayed ICU clustering (small measure estimate differences) and simultaneous inference across ICUs. Ranking estimates for all hospital ICU classifications and quality metrics are displayed in Supplementary files: Appendix II. With respect to between-ICU discrimination, the OMELOS metric would appear to be most favourable for both marginal and simultaneous confidence sets, although this was not as explicit in the rural / regional ICU cohort. This may reflect practice patterns within ICU cohorts and / or ICU patient yearly number; the latter varied substantially over ICU hospital classification (Table 1), as expected. We view the utilisation of the confidence sets for the ICU LOS ranks as a major advancement.

Implications of the current study

The upshot of our analysis is that there is no “one best model”; each model produced different rankings. ICUs may be unfairly labelled as “poor performers” when using a particular risk-adjustment model and deemed “good performers” when using a different model. “Performance” in this context may represent quality of care or stewardship of limited resources. Casting a hospital as a “poor performer” may not only negatively affect their reimbursement but may also negatively impact their standing in the community. As such, a multifarious approach to the development and testing of future predictive and risk-adjustment models is mandated to ensure that only the “one best model” is promulgated. Conversely, if multiple models produce different rankings (as we found here), then no one model should be proffered as the definitive solution for risk-adjustment.

Limitations

The current study was registry derived [20] and it is known that clinical studies using observational databases may be sensitive to database choice [65]. Only two estimators of LOS have been reported, albeit many potential estimators exist; the performance of some of these have been discussed in detail [6]. Death in ICU was also treated as a fixed model covariate rather than censored, as in time-to-event analysis, to facilitate straightforward analysis of total ICU population. Similarly, ICU LOS was analysed as a quality-of-care indicator and not hospital LOS, as the former appears to be the most plausible choice, at least within the critical care literature; more particularly in so-called “efficiency plots”. The models entailed a large number of associated covariates, but the “problem” of covariate multicollinearity was discounted [66]. The impact of “exit block” upon ICU LOS [67] was not subject to quantification.

Conclusions

Inference regarding adjusted ICU LOS was dependent upon the statistical estimator and the quality index used to quantify any LOS differences. Therefore, formal ranking estimates, being subject to model determination, are problematic. Development and testing of future predictive and risk-adjustment models should utilize a comprehensive approach, such as that implemented here, to test the consistency of different models in producing ICU rankings.