Introduction

Functional decline and disability affect many survivors of critical illness and can be long-lasting [1]. Post-intensive care syndrome comprises physical, cognitive, and mental health impairments, which can result in adverse socioeconomic consequences and are recognised by patients, clinicians, and public sector organisations as a major public health issue [2, 3]. Muscle wasting occurs rapidly in critical illness and is the result of decreased protein synthesis, bioenergetic failure, and intramuscular inflammation [4,5,6]. Nutritional and metabolic interventions may be able to reverse these pathological changes, improving patient outcomes [7]. The variation in outcomes collected makes comparison between trials challenging, limiting future systematic reviews and meta-analyses [8, 9].

A methodological approach to address this issue is the creation of a Core Outcome Set (COS). This approach does not prevent researchers from evaluating additional outcomes, however, it provides the minimum standard ensuring that essential outcomes within a research area are consistently assessed using the same measurement instruments. Core outcome measures for clinical effectiveness of nutritional and metabolic interventions in critical illness (CONCISE) is an internationally agreed set of outcomes and measurement instruments for use at 30 and 90 days post enrolment, in nutritional and metabolic clinical research in critically ill adults [10]. The development of CONCISE involved a systematic review identifying outcome measures used in critical care nutrition trials and their clinimetric properties followed by a consensus process. The following measurement instruments were recommended: Short Form-36 Physical Component Score (SF-36 PCS) [11], 30 s sit-to-stand (30STS) [12], 6-min walk test (6MWT) [13], Short Physical Performance Battery (SPPB) [14], Barthel Index [15], Katz Index [16], Lawton Instrumental Activities of Daily Living (IADL) [17], Global Leadership Initiative on Malnutrition criteria (GLIM) [18] and handgrip strength (HGS) [19].

Clinicians and researchers using the measurement instruments recommended by CONCISE need to be aware of the clinimetric properties of these measurement instruments, to ensure valid and reliable research. Clinimetric or measurement properties refer to the quality of the measurement tool and the quality of its performance [20]. This systematic review and meta-analysis aimed to summarise and evaluate the clinimetric properties of the measurement instruments recommended in CONCISE.

Methods

The review was registered on PROSPERO (CRD42023438187) on 21st June 2023. This study followed the COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) methodology for systematic reviews of Patient-Reported Outcome Measures (PROMs) [21]. This is reported in line with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement (Additional file 1: Table S1) [22], as recommended by the COSMIN guidelines as we await the combined PRISMA-COSMIN guideline [23].

Search strategy and selection criteria

A search strategy was designed based on the search filter for finding studies on clinimetric properties, developed by Terwee et al. [24]. The search strategy is outlined in the Additional file 1. Four electronic databases (MEDLINE via Ovid, EMBASE via Ovid, CINAHL via Healthcare Databases Advanced Search, and CENTRAL via Cochrane) were searched. Databases were searched from inception to December 2022. Studies identified in the preliminary systematic review process for CONCISE were added [8, 10]. Reference lists were manually searched to screen for eligible studies and relevant review articles. No limits for language, date or geographical region were used. Citations were imported to the web-based collaboration software platform, Covidence [25].

Inclusion and exclusion criteria

Inclusion and exclusion criteria were established prior to screening. Studies were included if they examined at least one clinimetric property of a CONCISE measurement instrument in adults ≥ 18 years with critical illness or recovering from critical illness in any language. To ensure completeness, we also included studies examining the clinimetric properties of variations or components of CONCISE measurement instruments, including the Short Form-36 Physical Functioning (SF-36 PF), five times STS (5xSTS) and SPPB 4 m gait speed. We included systematic reviews and pooled analyses where they provided new data. Unpublished studies, preprints, and conference abstracts without subsequent study publication were excluded.

Two authors (TD, EK) screened each title and abstract independently to determine eligibility for inclusion. Disagreements were resolved through discussion with a third reviewer (ZP). Full texts were assessed by both authors against the predetermined inclusion and exclusion criteria. Data extraction was completed by two authors (TD, EK) independently using standardised extraction forms. Data extraction included publication details (e.g., title, year, journal), patient characteristics (e.g. age, sex, severity and duration of illness), details of measurement setting (e.g., type of intensive care unit (ICU), timeframe) and the predetermined clinimetric properties of the measurement instrument. Authors were contacted for missing demographic data. Clinimetric properties extracted were based on the COSMIN guidelines and are described in Table 1. Data included structural validity (factor analysis results on dimensionality), internal consistency (Cronbach’s alpha), reliability (intraclass correlations), measurement error (standard error of measurement (SEM), smallest detectable change (SDC) and minimal important change (MIC)), construct validity (convergent validity—correlation of CONCISE instruments with comparator measures (Additional file 1: Table S2), divergent validity—correlation of CONCISE instruments with dissimilar measures (Additional file 1: Table S2); and known-groups validity—comparison of CONCISE instrument scores between two subgroups using relative effect sizes or area under the curve (AUC)), responsiveness to change (mean differences, median differences, AUC or relative effect sizes), predictive validity (correlation, odds ratio, AUC or regression coefficient) and interpretability (floor and ceiling effects). Content validity (as per step 5 of COSMIN guidelines) [26] was not evaluated as the aim of this review was to present and evaluate the clinimetric properties of the measurement instruments which had reached consensus through rigorous methodology in CONCISE, and not to formulate additional recommendations about the use of specific outcome measurement instruments.

Table 1 COSMIN clinimetric properties and updated criteria for good measurement properties

Assessment of risk of bias and certainty of the evidence

Two independent reviewers (TD, EK) used the COSMIN checklist to evaluate the risk of bias of clinimetric properties, blinded to each other's ratings [21]. Disagreements were resolved by discussion with a third reviewer (ZP). Based on the risk of bias assessment, studies were rated as either very good, adequate, doubtful, or inadequate. Following this, each clinimetric property result was rated against the criteria for good measurement (clinimetric) properties (Table 1). Each result was rated as sufficient (+), insufficient (−) or indeterminate (?). Predictive validity was not rated as this is not included in the COSMIN checklist. Specific hypotheses were developed for construct validity and responsiveness (Additional file 1: Table S3). Construct validity and responsiveness were considered sufficient (+) if ≥ 75% of the hypotheses were met, or insufficient (−) if ≥ 75% of the hypotheses were not met, otherwise they were considered inconsistent (±) [21]. All results for each clinimetric property were qualitatively summarised and where appropriate, quantitatively pooled and this summarised result was evaluated against the criteria for good measurement (clinimetric) properties to get an overall rating. Finally, the evidence was graded using the modified Grading of Recommendations, Assessment, Development and Evaluation system approach (GRADE) approach [21]. GRADE was adopted and modified as per COSMIN guidelines to rate four of the five GRADE factors (risk of bias, inconsistency, imprecision, and indirectness). Disagreements were resolved by discussion with a third reviewer (ZP).

Data synthesis

For reliability, where there were three or more studies, we calculated pooled intraclass correlation coefficients (ICCs) and 95% confidence intervals using a standard generic inverse variance random effects model. ICC values were combined based on estimates derived from a Fisher transformation, z = 0.5 × ln((1 + ICC)/(1 − ICC)), which has an approximate variance, (Var(z) = 1/(N-3)), where N is the sample size [27]. Between-study heterogeneity was evaluated using the I2 test. Where meta-analysis was not appropriate, we calculated weighted means (number of participants included per study) and weighted standard deviation. Where it was not possible to pool results statistically, results were descriptively summarised. Meta-analysis of data was performed using the statistical software package Review Manager 5.4 (RevMan 5.4.1). Where effect sizes were missing and studies provided sufficient data, Cohen's d was computed as the effect size to assess responsiveness. In cases where the data did not allow for Cohen's d calculation, standardised response mean (SRM) was used as an alternative effect size measure.

Results

Study selection

The search identified 4316 studies. Forty-seven were included in the review, reporting data for 12,308 participants. PRISMA flow diagram is outlined in Fig. 1. All included articles were in English. Table 2 outlines the characteristics of the included studies.

Fig. 1
figure 1

PRISMA diagram

Table 2 Key characteristics of included studies

Risk of bias

The COSMIN risk of bias rating varied from inadequate to very good. Ratings for individual studies are provided in Additional file 1: Table S4. Multiple studies tested more than one measurement property (n = 15). The breakdown of studies reporting clinimetric properties was as follows: structural validity (n = 0), internal consistency (n = 4), reliability (n = 10), measurement error (n = 9), hypothesis testing for construct validity (n = 25) and responsiveness (n = 12). Certainty of evidence was rated using the GRADE approach [21]. Ratings ranged from very low to high. GRADE ratings are outlined in Additional file 1: Table S5.

Measurement instruments

Full results are outlined in Additional file 1: Tables S6, S7, S8 and Fig. 2. No studies tested structural validity and it is therefore not included below.

Fig. 2
figure 2

Results Overview. The colour of the box refers to the COSMIN criteria for good measurement (clinimetric) properties: Green = sufficient; orange = indeterminate or inconsistent; red = insufficient. The Grading of Recommendations, Assessment, Development and Evaluation System (GRADE) rating for the certainty of evidence is presented in each box. CI = confidence intervals; GLIM = global leadership initiative on malnutrition; HGS = handgrip strength; IADL = instrumental activities of daily living; ICC = intra class coefficient; MIC = minimal important change; SD = standard deviation; SDC = smallest detectable change; SF-36 PCS = short form-36 physical component score; SF-36 PF = short form-36 physical functioning; SPPB = short physical performance battery; STS = sit-to-stand; 6MWT = 6-min walk test

Physical function

Short Form-36 Physical Function (SF-36 PF)

Eleven studies reported data for the SF-36 PF [28,29,30,31,32,33,34,35,36,37,38]. The SF-36 PF had excellent internal consistency (pooled Cronbach’s α 0.94) supported by a high certainty of evidence but was rated indeterminate due to no information on its structural validity. It had sufficient test–retest reliability (Pooled ICC 0.86) supported by a low certainty of evidence [32, 33, 35]. There was a moderate to high certainty of evidence supporting sufficient construct validity and responsiveness [29,30,31, 35,36,37,38,39]. No studies tested measurement error. Floor effects post ICU discharge ranged from 6 to 32% and ceiling effects post ICU discharge ranged from 9 to 38% (Additional file 1: Table S7 and Fig. 3) [34, 35, 37]. The SF-36 PF score at 1 month post ICU discharge was not predictive of 1 year mortality or 6 month readmissions [52]. There was no data on the association with length of stay.

Fig. 3
figure 3

Floor effects in hospital and during recovery from critical illness. Floor effects for CONCISE measurement instruments in hospital and during recovery from critical illness. Where more than one study reported a result, the mean was calculated. Relevance threshold set at 15%. BI = barthel index; HGS = handgrip strength; SF-36 PCS = physical component score of the short form-36; SF-36 PF = physical functioning score of the short form-36; SPPB = short physical performance battery; 30STS = 30 s sit-to-stand; 5xSTS = five times sit-to-stand; 6MWT = 6-min walk test.

Short Form-36 Physical Component Score (SF-36 PCS)

Nine studies reported data for the SF-36 PCS [29, 33, 37,38,39,40,41,42,43]. No studies tested internal consistency or reliability. There was a moderate to high certainty of evidence supporting sufficient construct validity and responsiveness [33, 37,38,39,40,41,42,43]. The MIC of the SF-36 PCS was 6.5 but measurement error was rated indeterminate due to no calculation of SDC [42]. A floor effect of 3% was seen at 6 months post ICU discharge (Additional file 1: Table S7 and Fig. 3) [42]. The SF-36 PCS score at 1 month post discharge was not predictive of 1 year mortality or 6 month readmissions [29]. There was no data on the association with length of stay.

Sit-to-stand (STS)

Two studies reported data for the 30STS [44, 45] and three studies for the 5xSTS [39, 46, 47]. When pooled together, there was a very low certainty of evidence supporting excellent test–retest reliability (ICC 0.99) and inter-rater reliability (Pooled ICC 0.95) [44, 46, 47]. Sufficient construct validity was supported by a high certainty of evidence [39, 47] and one study demonstrated sufficient responsiveness with a low certainty of evidence [39]. Measurement error was indeterminate due to no calculation of MIC but the SEM of the 30STS ranged from 0.51 to 1.51 repetitions and the SDC ranged from 1.19 to 4.45 repetitions [29, 35]. No floor or ceiling effects were seen at hospital discharge [35]. A floor effect of 15% was seen at ICU discharge when using the 30STS and 35% at 3 months post discharge when using the 5xSTS (Additional file 1: Table S7 and Fig. 3) [39, 45]. STS performance at ICU discharge was predictive of hospital length of stay [47]. There was no data on the association with mortality or hospital readmissions.

6-min walk test (6MWT)

Nine studies reported data for the 6MWT [13, 28, 30, 31, 36, 38, 39, 48]. No studies in our review tested the reliability of the 6MWT. Sufficient construct validity and responsiveness were supported by a high certainty of evidence [13, 28, 30, 31, 39, 40]. Measurement error was rated as insufficient with a high certainty of evidence as the range for MIC was estimated to be 14-30 m by anchor-based methods which was lower than the SDC of 21–34 m [37]. A floor effect of 40% was seen at hospital discharge and 4% at 3 months post ICU discharge (Additional file 1: Table S7 and Fig. 3) [38, 39]. 6MWT performance at 3 and 6 months post ICU discharge can predict 1 year mortality, and hospital readmissions [6, 12] [30]. There was no data on the association with length of stay.

Short Physical Performance Battery (SPPB)

Two studies reported data for the SPPB [29, 49]. No studies in our review tested the reliability of the SPPB. Sufficient construct validity supported by a low certainty of evidence was demonstrated in one study [49]. Responsiveness to change was insufficient from awakening to ICU discharge (ES 0.33) with a very low certainty of evidence [49]. Measurement error was indeterminate due to no calculation of MIC. The reported range of SDC was 1.3–1.5 points [49]. The SPPB had a significant floor effect of 83% at awakening and 57% at ICU discharge (Additional file 1: Table S7 and Fig. 3) [49]. SPPB performance at 1 month post ICU discharge was not predictive of 1 year mortality or 6 month readmissions [29]. There was no data on the association with length of stay.

Short Physical Performance Battery (SPPB)—4 m gait speed

Five studies reported data on the SPPB 4 m gait speed [30, 31, 36, 40, 50]. Excellent test–retest reliability of the SPPB 4 m gait speed was supported by a low certainty of evidence (ICC range 0.89–0.99) [50]. Sufficient construct validity was supported by a high certainty of evidence and responsiveness was indeterminate [30, 31, 36, 40, 50]. Measurement error was rated insufficient with a high certainty of evidence as the range for MIC was estimated to be 0.13–0.14 m/s by anchor-based methods which was lower than the SDC of 0.06 m/s [50]. No studies tested interpretability. SPPB 4 m gait speed performed at 6 months was predictive of hospital readmissions between 6 to 12 months [40]. There was no data on the association with mortality or length of stay.

Activities of daily living

Barthel Index

Four studies reported data for the Barthel Index [51,52,53,54]. It showed sufficient inter-rater reliability (ICC 0.98) and good internal consistency (Cronbach’s α 0.81) supported by a low certainty of evidence but was rated indeterminate for internal consistency due to no information on structural validity [52]. Sufficient construct validity was supported by a high certainty of evidence [52, 54]. Sufficient responsiveness was demonstrated in a single study with a very low certainty of evidence [51]. Measurement error was rated as indeterminate due to no calculation of MIC. A floor effect of 11% and a ceiling effect of 1% were seen at ICU discharge with an SEM of 7.2 points and an SDC of 20 points (Additional file 1: Table S7 and Fig. 3) [52]. There was no data on the association with mortality, hospital readmissions, or length of stay.

Katz Index

Eight studies reported data for the Katz Index [40, 55,56,57,58,59,60,61]. No studies in our review examined the Katz Index in terms of internal consistency, reliability, measurement error and interpretability. Construct validity was rated insufficient with a high certainty of evidence [40, 57, 60, 61]. Responsiveness was sufficient in a single study with a very low certainty of evidence [57]. The Katz index score on ICU admission was predictive of short term (in-hospital to 90 days) mortality but there was no data on the association with longer term mortality, hospital readmissions or length of stay [55, 56, 59, 62].

Instrumental Activities of Daily Living (Lawson IADL)

Four studies provided data on Lawson IADL [40, 53, 56, 63]. No studies in our review examined the IADL in terms of internal consistency, reliability, responsiveness, measurement error and interpretability. Sufficient construct validity was supported by a moderate certainty of evidence [40]. The IADL at ICU admission was predictive of long term mortality but there were conflicting results regarding shorter term mortality and it was not predictive of hospital length of stay [53, 56, 63]. When performed at 6 months, it was not predictive of hospital readmissions between 6 and 12 months [40].

Muscle/nerve function

Handgrip strength (HGS)

Fifteen studies reported data on HGS [29, 36, 40, 47, 52, 54, 64,65,66,67,68,69,70,71]. There was excellent inter-rater reliability (Pooled ICC 0.95) and good test–retest reliability (Pooled ICC 0.89) supported by a very low to low certainty of evidence [65, 68]. Construct validity was inconsistent and no studies tested responsiveness [31, 36, 40, 47, 52, 54, 64, 69, 71, 72]. Measurement error was indeterminate due to no calculation of MIC. The SEM ranged between 2.8 to 4.5 kg and SDC 7.8 to 12.5 kg [65]. Significant floor effects were seen during ICU admission ranging from 26 to 55% (Additional file 1: Table S7 and Fig. 3) [64, 69, 71]. Handgrip strength performed well in the diagnosis of ICU-acquired weakness with high sensitivity and specificity [64]. Handgrip strength during ICU admission was not predictive of in-hospital mortality, hospital length of stay or ICU length stay [69,70,71]. When performed at 1 month and 6 months post ICU discharge, handgrip strength was not predictive of 1 year mortality or hospital readmissions [29, 40].

Nutritional status

Global Leadership Initiative on Malnutrition Criteria (GLIM)

Two studies reported data for the GLIM [73, 74]. No studies in our review examined the GLIM in terms of reliability, responsiveness, measurement error and interpretability. There was a high certainty of evidence supporting sufficient construct validity. Two studies validated the GLIM against the Subjective Global Assessment (SGA) demonstrating a high level of precision (AUC 0.85–0.93) and agreement (Kappa 0.85) [48, 49]. The GLIM at ICU admission was predictive of ICU mortality and hospital length of stay [73]. There was no data on its association with longer term mortality and hospital readmissions.

Discussion

This systematic review and meta-analysis evaluated the clinimetric properties of the measurement instruments recommended in CONCISE [10]. The SF-36 PCS, SF-36 PF, STS, 6MWT and Barthel Index had the strongest clinimetric properties and certainty of evidence. The SPPB, Katz Index and handgrip strength had less favourable results. There was limited available data for the IADL and GLIM.

Measurement instruments

The CONCISE measurement instruments are established and considered feasible to use during critical illness and its recovery. Our review highlighted differences between the instruments in the strength of clinimetric properties and performance at different time points. The ability to stand from sitting unaided is increasingly recognised by patients as playing a fundamental role in activities of daily living [75,76,77], and our data shows the STS to be an attractive functional independence test with minimal floor effects at ICU and hospital discharge when the repetition based 30STS is used. Our data also support previous findings regarding the 6MWT being a well-defined test for use in critical care nutrition research, post ICU discharge [13, 30]. ICU survivors experience profound disability with previous work demonstrating that only 40% could ambulate at 7 days after ICU discharge [78]. As a result, more complex outcome measures including the 6MWT, SPPB and the Physical Function in ICU Test (PFIT-S) are plagued by floor effects at ICU or hospital discharge as demonstrated in our data [13, 38, 79]. The properties of the SPPB in critically ill patients are poorly defined with a significant floor effect at ICU discharge. Interestingly the 4 m gait speed test, a component of the SPPB, had robust clinimetric properties post hospital discharge suggesting its role may be best utilised later in the recovery period.

The SF-36 and its PCS are widely reported in critical care rehabilitation trials [80] with well-established clinimetric properties [37]. While our data supports excellent construct validity and responsiveness of the SF-36 PCS with no significant floor or ceiling effects, we found no data describing its internal consistency or reliability. The closely related SF-36 PF domain had excellent internal consistency and reliability but patients with good recovery trajectories have significant ceiling effects unlike those with persistent impairment where significant floor effects are seen [37].

Measurement of activities of daily living was deemed essential in the CONCISE Delphi process. Our data suggest the Barthel Index has the current best clinimetric properties with more limited evidence for the Katz Index and IADL. Handgrip strength had excellent inter-rater reliability but studies with a larger sample size are needed to improve the certainty of evidence to allow generalisability in trials of critical illness and there are significant floor effects when used during ICU admission.

The GLIM criteria are a diagnostic tool for malnutrition rather than a patient-reported or performance-based measurement instrument. Reliability, responsiveness, and measurement error testing, as described elsewhere in this review are therefore less relevant for the GLIM criteria and have not been studied. It was seen to be highly accurate in diagnosing malnutrition in critical illness and showed excellent construct validity when compared to the SGA supporting its use in the ICU setting.

Implications for outcome selection and future research

The paucity of relevant research and the difficulty of face-to-face assessments during recovery from critical illness make mandating measurement instruments challenging. The use of patient-reported questionnaires, such as the SF-36, or objective performance-based measurement instruments that can be feasibly administered at home via telemedicine, such as the STS [81, 82], may improve loss to follow-up and enable adequate analysis of interventions over recovery from critical illness.

It has previously been suggested that a single measurement instrument to evaluate functional outcomes cannot be used due to the presence of floor and ceiling effects at different time points, which we highlight above [49]. This means identifying change over time or change in response to an intervention is challenging. The repetition based 30STS has robust clinimetric properties and no floor and ceiling effects at hospital discharge making it an attractive measure of physical function for longitudinal nutrition studies in critical illness.

The strong interest in activities of daily living suggests the Katz Index and IADL require further evaluation in the critically ill population. It has previously been suggested that the Barthel Index is more suitable than the Katz Index for assessing patients after an ICU stay [84] and our analysis supports this recommendation. Additional clinimetric research is required for a more complete evaluation of IADL, handgrip strength and GLIM. Without further research, these instruments may be less attractive for future clinical trials involving patient care. Defining measurement error and responsiveness in more detail for all CONCISE measurement instruments will aid future trial design and sample size calculation.

Strengths and limitations

This review followed the COSMIN methodology and a rigorous approach was taken to the evaluation of the quality and certainty of evidence using the COSMIN risk of bias checklist, COSMIN’s criteria for good measurement properties and the modified GRADE approach [21]. The most important limitations are the low number of high-quality studies and the possibility that relevant studies with clinimetric data were missed in our searches hence results should be interpreted with this in mind. This is especially true for responsiveness where studies used a CONCISE measurement instrument but failed to comment specifically on responsiveness and therefore did not appear in our search. To minimise this, we included all randomised controlled trials of nutrition in critical illness since 2000 from the preliminary CONCISE systematic review [8, 10] but studies with non-nutritional interventions using CONCISE measurement instruments may have been missed. Due to the small number of studies, we included all studies in this review regardless of the risk of bias and subgroup analysis was not performed. We also had to adapt the COSMIN methodology for PROMs to use for the CONCISE performance-based and diagnostic measurement instruments. The studies examined were heterogeneous with variable time points of measurement which were often different to the 30 day or 90 day fixed time points we recommend in CONCISE. Finally, there were no studies evaluating structural validity and the risk of bias was doubtful in many of the studies due to the small sample size or other important methodological flaws such as an inappropriate time interval between assessments when examining reliability. This reinforces the need for large high-quality clinimetric studies in critical illness.

Conclusion

The CONCISE measurement instruments are established and feasible to administer during critical illness and its recovery. The SF-36 PF, SF-36 PCS, STS 6MWT, and Barthel Index had the strongest clinimetric properties and certainty of evidence. Further clinimetric research into all the CONCISE measurement instruments will improve outcome selection for future trials of nutrition and metabolic interventions in critical illness and enable greater generalisability of findings between studies. We suggest using this review alongside CONCISE to guide outcome selection for future trials of nutrition and metabolic interventions in critical illness.