Intensive Care Medicine

, Volume 33, Issue 4, pp 619–624

Reproducibility of physiological track-and-trigger warning systems for identifying at-risk patients on the ward

Authors

  • Christian P. Subbe
    • Department of MedicineWrexham Maelor Hospital
  • Haiyan Gao
    • Intensive Care National Audit and Research CentreTavistock House
    • Intensive Care National Audit and Research CentreTavistock House
Original

DOI: 10.1007/s00134-006-0516-8

Cite this article as:
Subbe, C.P., Gao, H. & Harrison, D.A. Intensive Care Med (2007) 33: 619. doi:10.1007/s00134-006-0516-8
  • 653 Views

Abstract

Objective

Physiological track-and-trigger warning systems are used to identify patients on acute wards at risk of deterioration, as early as possible. The objective of this study was to assess the inter-rater and intra-rater reliability of the physiological measurements, aggregate scores and triggering events of three such systems.

Design

Prospective cohort study.

Setting

General medical and surgical wards in one non-university acute hospital.

Patients and participants

Unselected ward patients: 114 patients in the inter-rater study and 45 patients in the intra-rater study were examined by four raters.

Measurements and results

Physiological observations obtained at the bedside were evaluated using three systems: the medical emergency team call-out criteria (MET); the modified early warning score (MEWS); and the assessment score of sick-patient identification and step-up in treatment (ASSIST). Inter-rater and intra-rater reliability were assessed by intra-class correlation coefficients, kappa statistics and percentage agreement. There was fair to moderate agreement on most physiological parameters, and fair agreement on the scores, but better levels of agreement on triggers. Reliability was partially a function of simplicity: MET achieved a higher percentage of agreement than ASSIST, and ASSIST higher than MEWS. Intra-rater reliability was better then inter-rater reliability. Using corrected calculations improved the level of inter-rater agreement but not intra-rater agreement.

Conclusion

There was significant variation in the reproducibility of different track-and-trigger warning systems. The systems examined showed better levels of agreement on triggers than on aggregate scores. Simpler systems had better reliability. Inter-rater agreement might improve by using electronic calculations of scores.

Keywords

Observer variationReproducibility of resultsCritical illnessScoring systems

Introduction

Physiological track-and-trigger warning systems are used to identify patients on acute wards at risk of deterioration, as early as possible. There are three main types in use [1]:
  1. 1.

    Single- and multiple-parameter systems identify patients by comparing bedside observations with a simple set of criteria and indicating whether one or more of the parameters has reached predefined thresholds.

     
  2. 2.

    Aggregate weighted scoring systems allocate a weight to each observation as a function of its abnormality and a summary score is derived.

     
  3. 3.

    Combination systems are the combination of single- or multiple-parameter systems with aggregate weighted scoring systems.

     
Single-parameter systems have been used extensively by Australian medical emergency teams (MET) [2]. Multiple-parameter, aggregate weighted scoring and combination systems are mainly in use in U.K. hospital settings. A survey of acute hospitals in England indicated that most hospitals were using aggregate weighted scoring systems [1]. Interventional studies have shown that the use of track and trigger systems may reduce adverse outcomes in medical and surgical patients [2, 3, 4, 5, 6]. It is not known whether these systems are reproducible since no information on inter- and intra-rater reliability has been published.

In this study, inter-rater and intra-rater reliability of the physiological measurements, aggregate scores and triggering events of three systems were examined: a single-parameter system, the call-out criteria for MET [2]; and two aggregate scoring systems, the modified early warning score (MEWS) [6] and the assessment score for sick-patient identification and step-up in treatment (ASSIST) [7].

Methods

Design and data collection

A prospective observational study was conducted at Wrexham Maelor Hospital, a district general hospital in North Wales. The study was approved by the local research ethics committee. Participants were adult patients from general medical and surgical wards. A number of wards were selected to satisfy the sample size calculation (below) and all patients on these wards able to give informed consent were invited to participate. Patients were informed about the purpose of the study and received an information leaflet. Verbal consent was obtained.

Based on assumptions for inter-rater reliability (kappa = 0.8, proportion of positive results = 0.07) with four raters, a sample of 93 patients was required to estimate kappa with a standard error of 0.1. For the intra-rater reliability, with an assumed value of kappa = 0.9, the required sample size was 44 patients. Sample size calculations were performed using a custom-designed module [8].

Data were collected by four members of hospital staff on 3 days. All four raters were familiar with the scoring methods in their clinical practice and received an induction prior to the study. Two investigators prepared the consent and patient identification data prior to the study.

For inter-rater reliability, data were collected on two acute medical and two acute surgical wards. A senior doctor (Certificate of Completion of Specialist Training equivalent in Intensive Care Medicine), junior doctor (Senior House Officer level), registered nurse (E-grade; 5 years of experience) and student nurse, who had previously worked as a health care assistant (nursing auxiliary), collected the data. The order of the raters taking the measurements was randomized for each ward from a set of possible permutations. Raters were blinded to the results of their colleagues. For the intra-rater study, the same raters examined separate patients from one medical and one surgical ward, examining the same patients four times each in 15-min intervals, blinded to their previous scores. There were no interventions between the four sets of measurements.

Age and normal blood pressure, derived from an average of the previous 48 h, were collected first. Raters then measured the remaining parameters: systolic blood pressure; temperature; respiratory rate; pulse rate; and level of consciousness. Blood pressure was measured electronically (Dinamap, Critikon, Tampa, Fla.) and checked manually where appropriate. Blood pressure was measured by all four raters on the first 18 patients, but the repeated measurement was found to be unacceptable to the patients. For subsequent patients, blood pressure was measured only once, noted on the patient's bedside sheet, and copied by subsequent raters. Temperature was taken orally (Temp-PlusII, IVAC Corp., San Diego, Calif.), measured only once, noted and copied by subsequent raters. All other parameters were measured by each rater in turn. Pulse rate was counted over 15 s in regular heart rhythm and 1 min in irregular heart rhythm; respiratory rate was counted over 30 s. Raters calculated urine output per kilogram and hour from the output over the last 4 h.

Raters scored the observations according to the three systems. The MET criteria were scored as one if any criterion was fulfilled and otherwise as zero. The MEWS and ASSIST were scored according to scoring charts. Blood pressure in MEWS was scored differently from the published scoring method, by deviation from the patient's norm (C. Stenhouse, pers. commun.). Details of the scoring systems, including the modification to MEWS, are contained in the Electronic Supplementary Material.

Data were entered into a spreadsheet by a data-entry clerk not involved in data collection. Logic, range and consistency check were applied to all variables. Outliers and missing data were checked against original data collection sheets.

Statistical analysis

Statistical analysis was performed using intra-class correlation coefficients for continuous variables (systolic blood pressure, heart rate, respiratory rate, temperature and aggregate scores), and kappa statistics for categorical variables (conscious level, trigger events and aggregate scores). Two-way and one-way analysis of variance was used in calculating the intra-class correlation coefficient for inter-rater and intra-rater studies, respectively [9]. Bootstrap methods were used to provide bias-corrected confidence intervals. For the inter-rater study, we also calculated kappa and phi statistics [10] for each of the six possible pairings among the raters. All analyses were performed in Stata 8.2 (StataCorp LP, College Station, Texas).

Disagreements in total scores and trigger events could be a result of disagreements in physiological measurements or incorrect calculation. To examine the relative impact of these disagreements, the three systems were recalculated from the original measurements, and agreement was assessed both on the scores as recorded by the raters and on the corrected scores.

To interpret the strength of agreement, we adopted the following guidelines [11]: < 0.20 poor; 0.21–0.40 fair; 0.41–0.60 moderate; 0.61–0.80 good; 0.81–1.00 very good.

Results

Inter-rater reliability

In the inter-rater study, 114 patients were examined. The four raters were not able to perform four sets of measurements on all 114 patients, as some patients were called for clinical investigations or were otherwise unavailable. In total, 433 sets of measurements were obtained.

Nine sets of observations from three patients were excluded as their normal blood pressures were missing, leaving 424 sets of observations included in the study. One hundred nine, 102, 107 and 106 patients were examined, respectively, by the senior doctor, junior doctor, registered nurse and student nurse. Of the 424 sets of observations, 412 (97.1%) were missing urine output. All other parameters were 100% complete. Urine output was therefore excluded from the analysis.

By allowing raters to copy the temperature and blood pressure (apart from the first 18 patients), there is potential to have introduced errors in copying the figures. Copying errors in temperature and blood pressure were identified in 1.4% and 0.6% of observations, respectively.

Agreement on respiratory rate, heart rate and systolic blood pressure (method 1: blood pressure of 18 patients taken by each rater) was similar, with intra-class correlation coefficient (95% confidence interval) 0.57 (0.45–0.70), 0.63 (0.52–0.73) and 0.65 (0.40–0.85), respectively. Copying errors had almost no effect on the agreement on systolic blood pressure (method 2: blood pressure of 96 patients taken once and copied) and intra-class correlation coefficient 0.99 (0.97–1.00), and only a small effect on temperature, intra-class correlation coefficient 0.74 (0.51–0.91). There were no significant differences in the mean physiological measurements among the raters for respiratory rate (p = 0.44), systolic blood pressure (p = 0.34 and 0.09 by methods 1 and 2, respectively), and heart rate (p = 0.23). The small number of copying errors in temperature, predominantly by one rater, where agreement was otherwise perfect, led to a small but significant difference in mean temperature (p = 0.03). Kappa agreement was moderate (0.53, 95% confidence interval 0.31–0.78) on levels of consciousness used in MEWS and fair (0.35, 0.22–0.48) on levels of consciousness used in ASSIST.

The percentage of correctly calculated scores was lower for MEWS and ASSIST than for MET (Table 1). Overall, 27 (6.4%) patients were scored higher and 49 (11.5%) lower than correct scores for MEWS, and 12 (2.8%) patients were scored higher and 67 (15.8%) lower than correct scores for ASSIST. There were statistically significant differences in the percentage of correctly calculated scores among raters for MET and MEWS.
Table 1

Number of correctly calculated scores for inter-rater study. MET call-out criteria for medical emergency teams, MEWS modified early warning score, ASSIST assessment score for sick-patient identification and step-up in treatment

 

Student nurse

Registered nurse

Junior doctor

Senior doctor

Total

p-value

      

Observations, n

106

107

102

109

424

       

MET, n (%)

98 (92.5)

106 (99.1)

101 (99.0)

109 (100)

414 (97.6)

0.001

      

MEWS, n (%)

81 (76.4)

81 (75.7)

92 (90.2)

94 (86.2)

348 (82.1)

0.01

      

ASSIST, n (%)

78 (73.6)

90 (84.1)

86 (84.3)

90 (82.6)

344 (81.1)

0.15

      

The p-value indicates statistical significance of difference in correctly calculated scores among raters

The agreement indices among the four raters (Table 2) suggest that the raters had a higher level of agreement on aggregate score for ASSIST than for MEWS. There were no significant differences among the raters in mean scores for the two systems (p = 0.40 and 0.13 for MEWS and ASSIST calculated by raters, 0.41 and 0.14 corrected). The distributions of MEWS and ASSIST scores for the four raters are shown in the Electronic Supplementary Material.
Table 2

Level of agreement of aggregate scores and triggers among the four raters for inter-rater study

 

Triggered, n (%)/ score, median (interquartile range) [range]

Kappa statistic (95% confidence interval)

All agreed, n (%)

Three agreed, n (%)

Intra-class correlation coefficient (95% confidence interval)

     

Calculated by raters

          

MET trigger

11 (2.6)

–0.03(–0.05, 0.00)

86 (77.5)

106 (95.5)

     

MEWS score

1 (1, 2) [0, 8]

0.20 (0.13, 0.27)

17 (15.3)

53 (47.8)

0.45 (0.34, 0.55)

     

MEWS trigger

60 (14.2)

0.18 (0.09, 0.27)

62 (55.9)

94 (84.7)

     

ASSIST score

1 (0, 1) [0, 8]

0.46 (0.38, 0.55)

41 (36.9)

80 (72.1)

0.49 (0 .40, 0.57)

     

ASSIST trigger

19 (4.5)

0.20 (0.04, 0.38)

84 (75.7)

104 (93.7)

     

Corrected calculations

          

MET trigger

7 (1.7)

–0.02(–0.04, 0.05)

90 (81.1)

106 (95.5)

     

MEWS score

1 (1, 2) [0, 8]

0.22 (0.15, 0.30)

18 (16.2)

55 (49.6)

0.50 (0.42, 0.59)

     

MEWS trigger

69 (16.3)

0.37 (0.25, 0.51)

64 (57.7)

101 (91.0)

     

ASSIST score

1 (0, 2) [0, 8]

0.50 (0.42, 0.58)

43 (38.7)

83 (74.8)

0.66 (0.55, 0.76)

     

Agreement on triggers was similar in MEWS and ASSIST, and was improved by using corrected scores. Percentage agreement on triggers was higher than scores. In MET, any patient who did not trigger the first three criteria but caused seriously worry was scored as one. In the 424 sets of observations, 5 patients were triggered via this criterion, all by one rater.

Pairwise agreements were similar to overall agreement, and agreement using phi appeared better than kappa (see Electronic Supplementary Material).

Intra-rater reliability

There were 180 sets of observations from 45 patients in the intra-rater study. All observations were used in the analyses. In total, 170 (94.4%) were missing in urine output, which was excluded. Other parameters were 100% complete. There were copying errors for temperature in 0.6% of observations and for blood pressure in 1.1%.

There was 100% agreement on conscious level, with all patients scored as “Alert”. Intra-rater agreements on respiratory rate, heart rate and systolic blood pressure were similar to those of the inter-rater study. Agreement on temperature and intra-class correlation coefficient 0.98 (0.94–1.00) was better in the intra-rater study than the inter-rater study.

The proportions of scores calculated correctly were similar to those from the inter-rater study (Table 3). In MET, patients were 100% correctly scored by all raters. In MEWS, 17 (9.4%) patients were scored higher and 14 (7.8%) lower than correct scores, and in ASSIST 11 (6.1%) patients were scored higher and 22 (12.2%) lower than correct scores.
Table 3

Number of correctly calculated scores for intra-rater study

 

Student nurse

Registered nurse

Junior doctor

Senior doctor

Total

p-value

      

Observations, n

48

24

84

24

180

       

MET, n (%)

48 (100)

24 (100)

84 (100)

24 (100)

180 (100)

1

      

MEWS, n (%)

40 (83.3)

24 (100)

66 (78.6)

19 (79.2)

149 (82.8)

0.05

      

ASSIST, n (%)

33 (68.8)

24 (100)

72 (85.7)

18 (75.0)

147 (81.7)

0.003

      

The p-value indicates statistical significance of difference in correctly calculated scores among raters

The agreement indices (Table 4) suggest intra-rater agreement on score was similar for MEWS and ASSIST. There was good agreement on triggers for MEWS and ASSIST, although the confidence intervals for ASSIST were very wide due to the low number of events. Only 1 patient triggered the MET calling criteria on a single observation.
Table 4

Level of agreement of total scores and triggers among the four raters for intra-rater study

 

Triggered, n (%)/ score, median (interquartile range) [range]

Kappa statistic (95% confidence interval)

All agreed, n (%)

Three agreed, n (%)

Intra-class correlation coefficient (95% confidence interval)

     

Calculated by raters

          

MET trigger

1 (0.6)

–0.01(–0.02, –0.01)

44 (97.8)

45 (100)

     

MEWS score

1 (1, 2) [0, 6]

0.53 (0.39, 0.68)

24 (53.3)

37 (82.2)

0.71 (0.60, 0.76)

     

MEWS trigger

26 (14.4)

0.64 (0.46, 0.84)

37 (82.2)

45 (100)

     

ASSIST score

1 (1, 1) [0, 5]

0.59 (0.46, 0.74)

27 (60.0)

40 (88.9)

0.81 (0.58, 0.93)

     

ASSIST trigger

6 (3.3)

0.66 (–0.02, 1.00)

43 (95.6)

45 (100)

     

Corrected calculations

          

MET trigger

1 (0.6)

–0.01(–0.02, –0.01)

44 (97.8)

45 (100)

     

MEWS score

1 (1, 2) [0, 5]

0.56 (0.42, 0.68)

23 (51.1)

37 (82.2)

0.68 (0.53, 0.75)

     

MEWS trigger

23 (12.8)

0.58 (0.31, 0.81)

37 (82.2)

44 (97.8)

     

ASSIST score

1 (1, 1) [0, 5]

0.54 (0.42, 0.68)

25 (55.6)

35 (77.8)

0.57 (0.24, 0.83)

     

ASSIST trigger

8 (4.4)

0.48 (–0.03, 1.00)

41 (91.1)

45 (100)

     

Discussion

Scoring systems such as the ones used in this study have become an important tool of clinical risk management for critically ill patients on general wards. Thus far, it is not known whether these assessments are reproducible and how large the likely errors are if different members of staff perform what is meant to be an identical assessment. In the present study we have provided some data on how three systems used in the U.K. perform. There was only fair to moderate agreement on measurements of the parameters used to generate the scores, and only fair agreement on the scores. Reassuringly, there was better percentage agreement on the decision whether a patient had triggered or not.

As one would expect, reproducibility was partially a function of simplicity: MET achieved higher percentage agreement than ASSIST, and ASSIST higher than MEWS. Intra-rater reliability was better then inter-rater reliability. Using corrected calculations improved the level of inter-rater agreement but not intra-rater agreement, suggesting that if scoring systems were misapplied, each rater was doing so in a consistent manner.

The systems were selected because they represent three levels of complexity. MET is very simple but does not allow a patient's progress to be tracked. MEWS is a complete assessment that takes into account urine output and relative changes in blood pressure as compared with previous measurements. ASSIST is a simplified version with only four parameters and an age constant. Both ASSIST and MEWS allow monitoring of clinical progress. The chosen systems are representative of the wide range of scoring systems currently in use, but any system should be assessed in the setting where it is used.

There were a number of potential weaknesses in this study. Firstly, repeated measurements were taken within an hour, but it is possible that patients could have deteriorated or improved during this time. We did not assess whether there was systematic drift of figures between measurements.

A small number of patients were not able or willing to give consent. In particular, patients with reduced neurological function (approximately 5% of all patients) could not be included, and were likely to be generally sicker patients. Inclusion might have led to different results with regard to reliability of the trigger mechanism; however, abnormal neurological scores have been found to be rare in previous studies [3, 12].

It was our aim to assess the reliability of the scoring process in clinical practice. The reliability depends partially on the reliability of the electronic measurement devices used for blood pressure and temperature. This could not be assessed directly as repeated measurement was unacceptable to the patients. Our results therefore represent the human element of reliability only.

Different scores might perform better in different scenarios. As MET and ASSIST collect only basic information, they might be appropriate for screening a large population. MEWS includes two further important pieces of physiological information; however, identifying old records to assess relative changes in blood pressure is unlikely to be performed reliably in a large number of patients. MEWS is therefore probably better suited as a monitoring tool for pre-selected patients known to be at high risk of catastrophic deterioration. In addition, as raters were not familiar with details of the patient's condition, the trigger criterion for any patient who caused serious worry was almost certainly underutilized.

Kappa is a chance-corrected measure of agreement, expressed as a fraction of the maximum difference between observed and expected agreements. Negative values indicate that observed agreement was lower than expected by chance. As trigger events with MET were very rare, expected agreement was extremely high. Kappa is largely meaningless for events this rare, and the chance-independent measure phi can only assess agreement between two raters.

Problems with reproducibility are common when assessing both bedside physiological measurements [13] and scoring systems [14, 15, 16, 17, 18, 19]. Intra- and inter-rater variability for APACHE II scores have been reported at 10–15% [14, 15] but may be reduced if data are collected by highly trained experts [15]. Despite problems with reliability of scoring components, the main target measure might be largely unaffected [16, 17]. This corresponds to the finding that there was greater agreement on the presence of a trigger event than values of scores.

This study was not designed to assess whether the systems helped to identify critically ill patients on general wards. Differences in reliability should be taken into account when choosing a score and the clinical area and patient group to which it will be applied. Determinants of reliability in different professional groups need further investigation.

Conclusion

There was significant variation in the reproducibility of physiological track-and-trigger warning systems used by different health care professionals. All three systems examined showed better agreement on triggers than aggregate scores. Simpler systems had better reliability. Further research should examine how reliability can be improved.

Acknowledgements

This study was funded by the UK National Health Service Research and Development Service Delivery and Organisation Programme (SDO/74/2004). The authors thank S. Ameeth, S. Collins, K. Ghosh, C. Rincon and J. Tobler for their help in preparing the study, obtaining consent from patients and collecting the data. We thank A. Pawley for entering data into electronic format and L. Gemmell for advising on the format and facilitating the setup of the study.

Supplementary material

134_2006_516_MOESM1_ESM.doc (148 kb)
Electronic Supplementary Material (DOC 149K)

Copyright information

© Springer-Verlag 2007