Introduction

There is no objective scale to assess comatose patients, and it just depends on the clinical skill of physicians [1]. There are some scales to evaluate these patients. One of the most commonly used scales is the Glasgow Coma Scale (GCS) [2, 3], but it has some important limitations such as impossibility to assess the verbal component among intubated comatose patients. In this condition, some physicians record the lowest possible score for this component. In addition, the GCS does not have a clinical index for brainstem reflexes. Trained personnel prefer to apply the GCS, although interpretation of intermediate scores on the GCS remains difficult for emergency physicians. In general, the GCS cannot recognize precisely the clinical changes of comatose patients [4]. Some efforts have been made to improve the GCS, and many scoring systems were developed to be substituted for the GCS [57]. The Full Outline of Unresponsiveness (FOUR) score was developed by Wijdicks et al. to evaluate the consciousness in comatose patients [1, 8]. This scale has four components: eye, motor, brainstem, and respiration. The score of each component ranges 0–4 (Table 1). Brainstem reflexes and respiratory components provide an assessment rather than verbal responding [9]. Some studies have translated the FOUR score into different languages and have assessed its validity and reliability for different populations [1017], but there is no study conducted to validate the Persian version. This study aims to assess the predictive validity and inter-rater reliability of the Persian version of the FOUR score in unconscious patients with traumatic brain injury in an intensive care unit.

Table 1 Definition of the FOUR score and the GCS

Methods

Development of the Persian Version of the FOUR Score

Translation of the FOUR score into Persian followed standardized forward–backward procedure. First, two professional expert translators carried out the forward translation from English into Persian independently. Then, the first consensus meeting was held to compare the two Persian versions and discuss the accuracy of the statements to reach an agreement on a fully comprehensible and accurate Persian translation consistent with the original English text. Next, another expert translator who did not have access to the original English scale carried out the back-translation of the Persian version. After that, in the second consensus meeting, the back-translated version was compared with the original scale to develop the final Persian version. Finally, the Persian version of the FOUR score was validated.

Inter-Rater Reliability and Predictive Validity

To assess inter-rater reliability of the FOUR score, the GCS was applied as a standard scale for comparison. To compare the results of the FOUR score with the GCS, three different types of raters scored the Persian version of the FOUR score and the GCS including two ICU physicians, two ICU head nurses, and two senior students of nursing. Each of the nurses and physicians had at least 2 years of clinical experience in an intensive care unit (ICU). Prior to the study, the raters were instructed to apply the FOUR score and the GCS accurately. Subsequently, a trial session was performed on a few patients to ensure they understood the procedure perfectly. The raters were provided with written instructions and scoring sheets to be used during the examination of all patients. Six categories of pairwise ratings were analyzed including (1) physician–nurse, (2) physician–student, (3) nurse–student, (4) physician–physician, (5) nurse–nurse, and (6) student–student. Each pair of raters scored 14 patients using both the FOUR score and the GCS, which resulted in 84 patients. To reduce the bias, the order of examining and scoring of each patient was randomly set. The raters of each pair completed their scorings within a period of 1 h without awareness of the other’s scores.

To assess the predictive validity of the FOUR score, the outcome was assessed at discharge from the hospital using the modified Rankin Scale (mRS) by one of the raters of each pair who was randomly selected. Then, the results were compared with the ones of the FOUR score as a common standard scale. The rating of mRS scale was done according to 7 points as follows: 0 = no symptoms, 1 = no significant disability, 2 = slight disability, 3 = moderate disability, 4 = moderately severe disability, 5 = severe disability, and 6 = dead [18]. In this study, mRS score 0–2 was considered as good outcome and score 3–6 as poor outcome.

Participants

A total of 87 patients admitted to the intensive care unit (ICU) in Shahid Beheshti Hospital of Qom, Iran, were enrolled from March to December 2013. They were evaluated within 7 days from admission to ICU. The inclusion criteria were an age > 18 years and unconsciousness due to an acute traumatic brain injury. Exclusion criteria were treatment with neuromuscular junction blockers and sedatives and interval longer than 1 h between assessment and pairwise scoring of the raters. An informed written consent was obtained from the patient’s legal surrogate. The Ethics Committee of Qom University of Medical Sciences approved this study.

Statistical Analysis

The inter-rater reliability of the FOUR score and the GCS was assessed using the weighted Cohen’s kappa (κ w) for the total score as well as the score of each item. The κ w coefficients of 0.4 or less were considered as poor agreements, and values greater than 0.8 were considered as excellent agreements between the raters [19]. Internal consistency was assessed by Cronbach’s α, and concurrent validity was done via calculating the Spearman’s correlation coefficients between the FOUR score and the GCS. Predictive validity was assessed by receiver operating characteristic (ROC) curve. This curve shows the power of the FOUR score and the GCS to predict the mortality or poor outcome at discharge from the hospital. The sensitivity and specificity were calculated for both scales. Logistic regression was used to show the odds ratios of the FOUR score and the GCS in predicting the mortality or poor outcome at discharge. The mean ratings of two raters for each patient were calculated for the ROC curve and regression analysis. SPSS V20 and MedCalc 14 were applied to analyze the data. The level of statistical significance was set at P < 0.05.

Results

Eighty-seven unconscious patients with acute traumatic brain injury were included in our study, but three of them were excluded because no pairwise rating occurred within a time interval of 1 h. Eventually, statistical analysis was performed on 84 patients. Their mean age was 42.6 ± 11.7 years (25–70 years) and 63 (74.1 %) were men. Sixty-one patients were intubated and mechanically ventilated at the time of scoring (71.8 %); thus, score one was recorded for the GCS verbal subscore.

In total, 168 ratings were performed for 84 patients by the FOUR score and the GCS. The frequency of the total score for each scale and its subscales is illustrated in Fig. 1. Cronbach’s α showed a high degree of internal consistency for the GCS (α = 0.82) as well as the FOUR score (α = 0.93).

Fig. 1
figure 1

Frequency of scores of the GCS and the FOUR scores for 168 raters

Spearman’s correlation coefficient was high (r = 0.95, P < 0.001) between the total scores of the scales.

The inter-rater agreement for each pair of raters was excellent both for the total FOUR score (k w 0.923, 95 % CI, 0.874–0.971) and for the total GCS score (k w 0.838, 95 % CI, 0.889–0.987). Kappa values for all pairs of raters and for each subscale of the FOUR score and the GCS are shown in Table 2. The inter-rater agreement of both scales for each pair of raters was excellent independent of the level of expertise and experiences (Table 3).

Table 2 Inter-rater agreement for the GCS and the FOUR scores by κ w
Table 3 Receiver operating characteristic curve analyses in predicting mortality (mRS = 6) and poor outcome (mRS = 3–6) at discharge for the GCS and the FOUR scores and their subscales

Sixteen patients (18.8 %) died at hospital (mRS = 6), and 40 patients (47.1 %) had poor outcomes (mRS = 3–6) at hospital discharge. The area under the curve (AUC) in the ROC curve was estimated to compare the scales in the prediction power of in-hospital mortality and poor outcome at discharge (Fig. 2).

Fig. 2
figure 2

Receiver operating characteristic curve for the GCS and the FOUR scores for poor outcome (mRS = 3–6) and mortality (mRS = 6) at discharge

AUC values in prediction of in-hospital mortality were significantly different between the FOUR score (AUC = 0.835; 95 % CI, 0.739–0.907) and the GCS (AUC = 0.772; 95 % CI, 0.668–0.856) (P = 0.01).

The sum of sensitivity and specificity was maximized to predict in-hospital mortality at a total score of 6 for both the FOUR (sensitivity = 100 %; specificity = 62 %) and the GCS (sensitivity = 100 %; specificity = 61 %).

In prediction of poor outcome, AUC values were not significantly different between the total FOUR score: 0.983 (95 % CI, 0.928–0.999) and the total GCS: 0.987 (95 % CI, 0.934–1.000).

The sum of sensitivity and specificity was maximized to predict poor outcome at a total score of 6 for both the FOUR score (sensitivity = 100 %, specificity = 91.1 %) and the GCS (sensitivity = 100 %, specificity = 95 %).

Table 4 shows the results of logistic regression between the total score and patient outcome for the two scales.

Table 4 Odds ratios, confidence intervals, and the percent of cases correctly classified for the GCS and the FOUR scores for poor outcome (mRS = 3–6) and mortality (mRS = 6) at discharge

With the FOUR score, each 1-point increase in total score was associated with an estimated 33 % reduction in odds of experiencing in-hospital mortality under the unadjusted model (OR = 0.67, 95 % CI, 0.54–0.85) and 85 % reduction in odds of poor outcome (OR = 0.15, 95 % CI, 0.04–0.6). These relations remained after adjusting for age and sex.

With the GCS total score, each 1-point increase in total score was associated with an estimated 40 % reduction in odds of in-hospital mortality (OR = 0.6, 95 % CI, 0.43–0.83) and estimated 80 % reduction in odds of poor outcome under the unadjusted model (OR = 0.2, 95 % CI, 0.04–0.4). These relations remained after adjusting for age and sex.

Discussion

The present study assessed the predictive validity and inter-rater reliability of the Persian version of the FOUR score among unconscious patients with traumatic injuries in an intensive care unit by comparing it with the GCS as standard scale.

The FOUR score is simply applied and includes the minimal necessities in impaired consciousness and distinguishes specifically certain unconscious states. It has been developed to overcome the limitations of the GCS, which is unable to assess the verbal score in intubated patients and test brainstem reflexes.

The results showed that the inter-rater agreement was excellent for the FOUR score (κ w = 0.923) and comparable with the GCS (κ w = 0.838). This finding is consistent with the results of the developers of the scale(κ w = 0.82 for both scales) [1], the French version (κ w = 0.86 for the FOUR score and κ w = 0.85 for the GCS) [10], the Spanish version (κ w = 0.93 for the FOUR score and κ w = 0.96 for the GCS) [12], and also the Italian version of the scale (κ w = 0.953 for the FOUR score and κ w = 0.943 for the GCS) [11].

It is interesting that our findings about inter-rater agreement are similar to some studies with different raters and various levels of experience. For instance, in the present study, scorings were performed by two nurses, two physicians, and two nursing students and in the Italian version were performed by neurologists and neurology residents with clinical expertise [11]; however, both studies found similar results of inter-rater agreement.

We found that the inter-rater agreement was excellent for all pairs, even for the student–student pair who was less experienced than the nurses and physicians. This finding is consistent with the study of Eelco F. M. Wijdicks that showed good inter-rater agreements in nurse–physician pair for the GCS (κ w = 0.77) and the FOUR score (κ w = 0.75) [1], but it is slightly at variance with the finding of a study on the Italian version that involved highly, moderately, and less experienced raters and showed that performances of the FOUR and GCS were comparable only among the highly and moderately experienced raters. The difference between our findings and those of the Italian version may be due to various patients [20]. Considering that standard instruction is required to apply a scale accurately [21], the excellent inter-rater agreement of the present study may result from a standard and perfect instruction of applying the new scale before scoring for the raters. The difference between our findings and those of the French version may be related to various approaches and quality of instruction.

In contrast, in a study by Michael Fischer, physician-nurse pair agreement (neurologist–ICU staff) was 0.56 with the GCS and 0.66 with the FOUR score [14]. Albeit, in their study, the agreement in the pairs of neurologist–neurologist and nurse–nurse was also less than ours with both scales, it can rationalize the difference between the findings of two studies.

Our results show that the AUC values from ROC curves are analogous and excellent to predict the poor outcome for both scales; but the AUC value in predicting in-hospital mortality was significantly different between the scales, as it was better with FOUR score. This finding is not consistent with the results of the Italian version, which indicates that both scoring systems are excellent outcome predictors of in-hospital mortality and less accurate response in patients with a poor outcome [11]. In addition, they reported that the scales were comparable in prediction power of in-hospital mortality; but the prediction power of the FOUR score was lower than the GCS in poor outcome. The difference between our findings and those of the Italian version may result from various patients and the settings of sampling.

In our study, among the patients with a poor outcome (mRS ≥ 3), the odds ratio for the FOUR score is rather lower than that for the GCS. The lower odds ratios have been associated with a positive predictive value for a higher chance of a positive outcome with increased total score values [1, 13]. The proportion of cases correctly classified for both poor outcome and in-hospital mortality was analogous for both the GCS and the FOUR scores. This is consistent with the result of Cohen [14].

Conclusion

The present study shows that the reliability of the Persian version of the FOUR score as well as its prediction power for poor outcome (predictive validity) is comparable to those of the GCS; moreover, it is superior to the GCS due to its higher prediction power for in-hospital mortality as well as its ability to assess the brainstem reflexes. Therefore, the Persian version of the FOUR score is a simple-to-use, easy-to-teach, and reliable scale for all practitioners, even less-experienced ones such as nurse students. Also, it can be a proper communicating tool among various members of a treatment team that can be applied reliably to assess patients with impaired consciousness and patients with traumatic brain injury in intensive care units if a standard instruction is performed for them.

We conclude that the Persian version of the FOUR score could be a good substitution for the GCS among unconscious patients. Further studies are recommended in various patients and settings.