Background

The recognition of risk factors that can stratify a population of cirrhotic patients into subgroups with different survival is of great prognostic value for the clinician. Numerous attempts have been made to develop a reliable prognostic survival model for cirrhosis. The target population of the different scoring systems in the literature covers patients with liver cirrhosis [17], alcoholic liver disease [8, 9], variceal bleeding [1017], and upper gastrointestinal bleeding including variceal bleeding [1820]. The Child-Turcotte classification [1] and its subsequent modification by Pugh [10] are old empiric methods to assess hepatocellular functional reserve in candidates for portosystemic shunting. Although Child-Turcotte and Child-Pugh scores (CPS) have not been formally evaluated for their statistical accuracy, they have been useful for risk-stratifying groups of patients with cirrhosis [2123], for assessing the efficacy of interventional procedures such as transjugular intrahepatic portosystemic shunting [24, 25] or sclerotherapy [17, 26], and for evaluating therapy for complications of cirrhosis [2729]. Although CPS score is considered an adequate method to establish the degree of liver failure and the survival probability [30], two of its elements are very subjective (ascites and encephalopathy), and a further limitation is its limited discriminatory ability [7]. In some studies, the prognostic value of CPS is described as incomplete, and other variables are demonstrated to have prognostic significance [31]. In addition, prognostic factors unrelated to hepatic function (cardiac, renal, pulmonary, acid-base and electrolyte status, other important associated comorbid conditions and factors) are not included.

Acute Physiology, Age and Chronic Health Evaluation (APACHE) II and III scores were developed by Knaus et al in 1985 and 1991, respectively [32, 33] are being used mainly for critically ill patients of all disease categories admitted to the intensive care units (ICUs). They differ in how chronic health status is assessed, in the number of physiologic variables included (12 vs. 17), and in the total score. Specific parameters of liver function (i.e. serum bilirubin and albumin) are included only in the APACHE III scoring system. Some prognostic variables (e.g., prothrombin time) and other indicators of responses to therapy (e.g., blood units transfused) which are known to be important outcome predictors in cirrhotic patients are not measured by the acute physiology scores [17, 21, 22, 26]. APACHE II and III scores have been successfully used to risk stratify cirrhotic patients admitted to medical ICUs [3438]. APACHE II has been previously used to risk stratify a mixed population of both ICU and non-ICU cirrhotic patients with upper gastrointestinal bleeding [39], while recently, an incomplete APACHE III score (i.e. a score in which data for blood gas analysis were omitted) has been reported to be superior to CPS in risk stratifying cirrhotic patients outside ICU settings [40].

The aim of the present study was to compare the prognostic accuracy of Child-Pugh, 24 hour APACHE II and complete 24 hour APACHE III scoring systems in predicting hospital mortality of patients with liver cirrhosis admitted to a gastroenterological medical ward.

Methods

This prospective study included two hundred consecutive hospitalizations of 147 patients with liver cirrhosis admitted to the Department of Gastroenterology of the University Hospital of Heraklion, from February 1999 through January 2001. For the purpose of the study, each admission was considered as one patient. The criterion for inclusion was the presence at admission or in the past history of any of the major complications of cirrhosis (ascites, encephalopathy, variceal bleeding or spontaneous bacterial peritonitis). Patients transferred from elsewhere were included in the study only if the transfer occurred within 24 hours after initial admission. Patients with hepatocellular carcinoma and patients admitted for less than one day were excluded. Patients admitted to a medical ICU during the first 24 hours of their presentation were also excluded. The diagnosis of cirrhosis was based on liver biopsy in 93 out of 200 patient admissions (46.5%). For the remaining 107 patient admissions, the diagnosis of cirrhosis was based on clinical, laboratory and radiological criteria: history of portal hypertension excluding other etiologies, evidence of esophageal varices confirmed by endoscopy, splenomegaly, ascites confirmed by abdominal ultrasound and physical examination, impaired liver function tests and clotting profile, ultrasound or computer tomography criteria [39, 41].

To calculate the APACHE II score [32], twelve common physiological and laboratory values (temperature, mean arterial pressure, heart rate, respiratory rate, oxygenation (PaO2 or A-aDo2), arterial pH, serum sodium, serum potassium, serum creatinine, haematocrit, white blood cell count and Glasgow coma score) are marked from 0 to 4, with 0 being the normal, and 4 being the most abnormal. The sum of these values is added to a mark adjusting for patient age and a mark adjusting for chronic health problems (severe organ insufficiency or immunocompromised patients) to arrive at the APACHE II score.

APACHE III scores range from 0 to 299 and are derived from marks for the extent of abnormality of 17 physiologic measurements (the acute physiology score), adjusts for age, and adjusts for seven comorbidities that reduce immune function and influence hospital survival [33]. The 17 physiological variables include eleven laboratory parameters (haematocrit, white blood cell count, serum creatinine, serum BUN, serum sodium, serum albumin, serum bilirubin, blood glucose, PaO2, A-aDO2, and a scoring for acid-base abnormalities), five vital signs (pulse, mean blood pressure, temperature, respiratory rate, urine output) and a modified Glasgow coma score.

Clinical and laboratory data necessary to the CPS and APACHE systems and prothrombin time (PT) values were recorded on the first day for all patients. Physiological data (temperature, heart rate, mean blood pressure and respiratory rate) were recorded 3-hourly during the first 24 hours of admission. The calculation of APACHE II and III scores was based on the worst values taken during the first 24 hours after admission.

Statistical analysis

Chi-square test was used to assess the differences of mortality within Child-Pugh classes A, B, and C. Individual relationship of each score (CPS, APACHE II, APACHE III) and PT values to the risk of death was assessed by t-test. For the assessment of the magnitude of correlation of length of stay (LOS) with CPS, APACHE II and APACHE III, Pearson correlation was used. Descriptive statistics were expressed as mean ± SD unless otherwise stated. Discrimination was tested using the receiver operating characteristic (ROC) curves and by comparing areas under the curve (AUCs) [42]. AUCs between 0.7 and 0.8 were classified as "acceptable" and between 0.8 and 0.9 as "excellent" discrimination [43]. For the different scoring systems tested, the sensitivity, specificity, overall correctness of prediction, positive and negative predictive values were calculated, and the cutoff point giving the best Youden index was determined [44]. This cutoff point was also used to calculate the predicted and observed outcome for patients. In order to test the overall classification accuracy of APACHE III score in association with PT, we applied discriminant analysis (backward stepwise method). A P value less than 0.05 was considered statistically significant for all above analyses. Calibration was assessed using the Hosmer-Lemeshow goodness of fit statistic which divides subjects into deciles based on predicted probabilities of death and then computes a chi-square from observed and expected frequencies [45]. Lower chi-square values and higher P values are associated with a better fit. A good fit was defined as P > 0.05.

Results

Of the 200 patient admissions, 137 (68.5%) were men and 63 were women. The mean age was 62.3 years (range, 33–86 years). Eighty eight (44%) admissions were for viral-associated cirrhosis (HBV-associated 40 cases, HCV-associated 48 cases), 66 (33%) for alcoholic cirrhosis and 37 (18.5%) for cryptogenic cirrhosis. In 9 cases (4.5%) there was both viral and alcoholic etiology. The reasons for admission were ascites for 127 patients (63.5%), encephalopathy for 37 (18.5%), variceal bleeding for 54 (27%), spontaneous bacterial peritonitis for eight (4%), and another infection (respiratory, biliary and urinary tract infections, cellulitis) for 36 patients (18%). Five cases were transferred to a medical ICU. During this study period, 23 patients (11.5%) died. Three patients died in the ICU and 20 patients in the medical ward. The causes of death were liver failure in seven cases (30.4%), kidney failure in two cases (8.7%), hepatorenal syndrome in seven cases (30.4%), variceal bleeding in one case (4.5%), and infection in six cases (26%).

Forty nine cases (24.5%) were classified as Child-Pugh class A, 88 cases (44%) as class B and 63 cases (31.5%) as class C. No deaths were recorded among patients with Child-Pugh class A. Two patients with Child-Pugh class B and 21 with class C died. Mortality increased significantly with increasing Child-pugh classes (P < 0.001). Table 1 shows that there were significant differences in CPS, APACHE II score, APACHE III score, and PT between survivors and non-survivors. Table 2 reports predictive values of the various scoring systems calculated at the cutoff point giving the best Youden index. ROC curves are shown in Figure 1. Discrimination power of CPS AUC and APACHE III AUC was excellent, while that of APACHE II AUC was acceptable. When information regarding PT values were combined with APACHE III score into a new discriminant function, the overall classification accuracy of APACHE III was not improved, thus PT was deleted from the full model (non-significant at the 5% level). The results of Hosmer-Lemeshow goodness-of-fit tests are shown in Table 3, while deciles risk are shown in Tables 4, 5 and 6. The Hosmer-Lemeshow statistic was best for CPS. However, for the two APACHE scores, calibration was poor.

Figure 1
figure 1

ROC curves for CPS, APACHE II and APACHE III scoring systems

Table 1 Average values on prothrombin time and average scores on Child-Pugh, APACHE II and APACHE III
Table 2 Comparison of the predictive values of the scoring systems
Table 3 Hosmer-Lemeshow goodness-of-fit tests
Table 4 Tables of deciles risk for Child-Pugh score
Table 5 Tables of deciles risk for APACHE II score
Table 6 Tables of deciles risk for APACHE III score

The median LOS for survivors was 9 days (range 2–85 days), 7 days (range 2–17 days) for patients with Child-Pugh class A, 9 days (range 2–48 days) for Class B, and 15 days (range 2–85 days) for those with class C. CPS and APACHE III score correlated strongly with the duration of hospitalization (P < 0.001), while APACHE II score had a weak and non significant correlation.

Discussion

The performance of the prognostic models is evaluated by their discrimination and calibration. Discrimination (i.e the ability of a prognostic score to classify patients correctly as survivors or non-survivors) is measured by AUC [42, 43]. Calibration evaluates the degree of correspondence between the estimated probabilities of mortality produced by a model and the actual mortality experience of patients and can be tested using Hosmer-Lemeshow goodness-of-fit statistic [45].

In our series, discrimination was acceptable to excellent for Child-Pugh and APACHE scores, however both APACHE prognostic systems had inadequate goodness-of-fit for death. Our results for APACHE II and CPS discrimination compare well with those published by Afessa et al [39]. In their study, the prognostic value of APACHE II (AUC 0.78) was as good as that of Child-Pugh score (AUC 0.76) in predicting short-term outcome of 111 cirrhotic patients hospitalized for upper gastrointestinal bleeding, although no informations regarding correct classification rates, sensitivity, specificity, cutoff values and goodness-of-fit have been assessed. The reported APACHE II mean values were higher for both survivors and non-survivors (17.2 ± 6 and 25.6 ± 10.1 respectively). However, 71% of their patients were ICU admissions and 57.6% had active variceal bleeding, thus they have studied much sicker patients than those included in our sample.

Butt et al reported that by using discriminant analysis, APACHE III score correctly classified 75% of cases vs. 67% of cases for Child-Pugh score [40]. No cutoff values were reported, the overall model calibration was not tested and data from blood gas analysis were not included in the calculation of the APACHE III score, thus resulting in an incomplete score. The APACHE III mean values were found high for both survivors and non-survivors (58.9 ± 35.1 and 87.4 ± 30.3 respectively). This might be related to the high percentage of patients admitted with upper gastrointestinal tract bleeding (i.e. 57%). Since four out of five vital signs (pulse, mean blood pressure, respiratory rate, and possibly urine output) and some of the laboratory parameters (i.e. haematocrit, serum BUN, and possibly creatinine) which need to arrive at the APACHE III score are markedly affected by bleeding, this might be the reason of the observed higher scores. Furthermore, the authors did not specify if they have included patients admitted to a medical ICU during the first 24 hours of their admission, whereas patients with hepatocellular carcinoma were also included in the study. The reported mortality on day 1 was 26% and 68% in patients with an APACHE III score of 51 to 75 points and greater than 75 respectively. It is noteworthy that in our series 17 out of 67 patients (25.3%) with an APACHE III score of 51 to 75 and 6 out of 11 patients (54.5%) with an APACHE III score greater than 75 also died. This suggests that at least in this sub-group of sicker patients our results compare well.

There are many potential reasons for insufficient calibration of APACHE scores. Clinically useful predictive models should demonstrate ease of use, accuracy, reproducibility and acceptance by data collecting stuff [46]. Some variables of the APACHE scores (i.e heart rate) depend on continuous monitoring. In addition, it has been shown that the inter-observer variability is high when these scoring systems are not used on a regular basis (like in most non-ICU wards), thus affecting the accuracy and reproducibility of the data [47, 48]. This is potentially relevant in our study, since physiological data collection was performed by several physicians and over a long period of time (24 months). As previously suggested [49] we tried to minimize variability by having one person to coordinate the process of data collection and having a written reference of definitions based on the original articles of APACHE scores.

Another potential reason for the inadequate calibration is the differences in level of disease severity between our database and the development databases of the mortality prediction systems [50]. Statistically derived prediction models like the APACHE systems are calibrated to the overall outcome prevalence in the development sample. Although APACHE II and III have been shown to work well in cirrhotic patients with a high severity of illness admitted to ICUs [3438], it is a well known fact that the mortality prediction models performance usually deteriorates when models are applied to different population samples (i.e. less sick patients) [51]. In studies conducted in cirrhotic patients admitted to ICU, cutoff values for APACHE II have ranged from 17 (AUC O.69; ICU mortality rate 52%) to 22 (AUC 0.79; ICU mortality rate 36%, hospital mortality 46%) [34, 37], while those reported for APACHE III have ranged from 75 (AUC 0.78; ICU mortality rate 43%, hospital mortality 57%) to 80 (AUC 0.75) [34, 35]. In our series, APACHE II scores equal or greater than 17 and 22, and APACHE III scores equal or greater than 75 and 80 were recorded in only 25 (12.5%), 5 (2.5%), 11 (5.5%) and 8 (4 %) patients respectively, thus emphasizing the much lower level of disease severity in our patients. It should be also recognized that the wide 95% CI of our AUCs (Table 2) suggests sample size problem, especially when only 23 patients died.

Potential limitations of our study should also be mentioned. Our study was performed in an academic referral hospital; therefore our results may not be applicable to institutions with different patient populations. Because mathematical equations for APACHE III have not been published and for APACHE II this equation is available only for admission, these equations have not been used to calculate the relative risk of death. In agreement with other studies [34, 35, 37, 39, 40], we wanted to test the accuracy of single-score values. Patients admitted to a medical ICU during the first 24 hours of their presentation were excluded from our study, thus resulting in a mortality rate of only 11.5%. It could be stated that the rational of excluding these patients weakens our study, since sicker patients at presentation are more likely to die. However, physiological data included in APACHE III score are recorded 3-hourly during the first 24 hours of admission and the worst value at this time interval is taking into account to calculate the total score [33]. Furthermore, we aimed to define within a 24 hour interval patients not sick enough to be admitted in a medical ICU, but who are likely not to benefit from the standard therapy and for whom a more intensive monitoring and treatment might be tried.

Conclusions

In conclusion, we cannot recommend the use of APACHE II and III scores in non-ICU patients. The present study showed that the discrimination power of CPS AUC and APACHE III AUC was excellent, while that of APACHE II AUC was acceptable. Although the Hosmer-Lemeshow statistic revealed adequate goodness-of-fit for CPS, this was not the case for APACHE II and III scores. Our results indicate that between the three scores, CPS had the least statistically significant discrepancy between predicted and observed mortality across the strata of increasing predicting mortality. This supports the hypothesis that APACHE scores do not work accurately outside ICU settings.