Background

Measuring validity, including content, construction, and criterion validity is a fundamental issue for medical research. In epidemiological studies, criterion validity (referred to simply as validity in this paper), is most widely used, and depends heavily on the criterion measurement. Ideally, the criterion should completely reflect the 'truth', and is commonly referred to as the 'gold standard'. Upon implementation of the 'gold standard', statistics of sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV) are employed to quantify the validity of a measure that is being examined.

In the 'real' world, particularly in epidemiological studies, a 'gold standard' is rarely available, too difficult, or costly to establish. Therefore researchers often utilize proximate measures of the 'gold standard' as the criterion to assess validity. For example, the validity of medical conditions recorded in hospital discharge administrative data has been assessed by re-abstraction of inpatient charts by health professionals, as well as comparison with patient self-reported data. In these situations when there is no 'gold standard', the kappa statistic is commonly used to assess agreement for estimating "validity".

In 1960, based on the chance-corrected reliability of content analysis[1], Cohen developed the kappa statistic for evaluation of categorical data, which corrects or adjusts for the amount of agreement that can be expected to occur by chance alone[2]. Since its inception, kappa has been widely studied and critiqued (Table 1). One common criticism is that kappa is highly dependent on the prevalence of the condition in the population[3, 4]. To overcome this limitation, several alternative methods for agreement have been investigated [58]. In 1993, Byrt et al[9] proposed a bias-adjusted and prevalence-adjusted kappa (PABAK) that assumes fifty percent prevalence of the condition, and absence of any bias. PABAK has been employed in many pervious studies for agreement assessment [1017]. Compared with kappa, PABAK reflects the ideal situation, and ignores the variation of prevalence across the conditions and bias presented in the "real" world. To demonstrate the performance of kappa and PABAK, we assessed the agreement between hospital discharge administrative data and chart review data conditions, considering that the prevalence of a condition varies by the sampling method employed. We analyzed kappa and PABAK in the following three sampling scenarios; 1) random sampling, 2) restricted sampling by conditions, and 3) case-control sampling.

Table 1 Literature related to kappa

Methods

Random sample

A total of 4,008 hospital discharge records were randomly selected from the four adult teaching hospitals in Alberta, Canada among admissions during January 1, 2003 and June 30, 2003. There were at least 1000 records from each hospital.

Defining conditions in ICD-10 administrative data

Professional trained health record coders read through the patient' medical chart to assign International Classification of Disease 10th version (ICD-10) diagnoses that appropriately described the patient's hospitalization. Each discharge record contained a unique identification number for each admission, a patient chart number, and up to 16 diagnoses. We defined 32 conditions based on ICD-10 codes[18].

Defining conditions in chart data

Charts of the randomly selected 4008 patients were located using the personal unique identifier and admission date. Two professionally trained reviewers completed a thorough chart review of 4008 patients through examining the chart cover page, discharge summaries, narrative summaries, pathology reports (including autopsy reports), trauma and resuscitation records, admission notes, consultation reports, surgery/operative reports, anesthesia reports, physician daily progress notes, physician orders, diagnostic reports, and transfer notes for evidence of any of the 32 conditions. The process took approximately one hour for each chart.

Restricted sample

We extracted records with any one of three conditions (i.e. hypertension, myocardial infarction and congestive heart failure) from the ICD-10 administrative data. Among 1126 records that met the criteria, there were 887 records with hypertension, 336 with myocardial infarction and 254 with congestive heart failure.

Case-control sample

We defined a case-control sample for each of the three specific conditions based on ICD-10 administrative data. The first sample included 887 records with hypertension and 887 randomly selected records among those without hypertension. Thus in total the sample for hypertension contained 1774 records. Using the same method, the second and third sample was generated for myocardial infarction and congestive heart failure. In total there were 672 records for myocardial infarction (336 with myocardial infarction, 336 without) and 508 records for congestive heart failure (254 with congestive heart failure and 254 records without).

Statistical indices of agreement

We calculated the prevalence of condition, kappa, PABAK statistic, positive agreement, negative agreement, observed agreement, and chance agreement for the condition in each of three samples. The definition of kappa and PABAK is:

κ = I o I e 1 I e , P A B A K = 2 I o 1 MathType@MTEF@5@5@+=feaagaart1ev2aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaqbaeqabeGaaaqaaiabeQ7aRjabg2da9KqbaoaalaaabaGaemysaK0aaSbaaeaacqWGVbWBaeqaaiabgkHiTiabdMeajnaaBaaabaGaemyzaugabeaaaeaacqaIXaqmcqGHsislcqWGjbqsdaWgaaqaaiabdwgaLbqabaaaaiabcYcaSaGcbaGaemiuaaLaemyqaeKaemOqaiKaemyqaeKaem4saSKaeyypa0JaeGOmaiJaemysaK0aaSbaaSqaaiabd+gaVbqabaGccqGHsislcqaIXaqmaaaaaa@46F3@

Where I o and I e is observed agreement and chance agreement, respectively.

The cross-classification for the results of the condition by two databases and the formulas for calculating agreement statistics can be found in Additional file 1. We calculated statistical values in the three samples, such as kappa. These refer to statistics for samples and do not refer to parameter values for populations although the sample statistics are used to estimate population values. Therefore, the statistical indices used in this paper refer to the statistical value for sample, not for population.

This study was approved by the ethics committee of University of Calgary, Canada

Results

Random sample for 32 conditions

The statistical indices for agreement for the 32 conditions are presented in table 2. The prevalence of the conditions for ICD-10 administrative data ranged from 0.25% to 22.13%, whereas for chart review, it was from 0.60% to 30.19% among the 32 conditions. The variation for negative agreement was from 0.92 to 1.00, and from 0.21 to 0.84 for positive agreement. The kappa varied from 0.20 to 0.83, whereas for PABAK, it was from 0.72 to 0.99, with the PABAK value greater than kappa for all conditions in the sample. The difference between PABAK and kappa values ranged from 0.06 to 0.77. Hypertension and metastatic cancer had the lowest difference of 0.06, while blood loss anemia and coagulopathy had the most extreme difference of 0.77. The kappa and PABAK with the prevalence of 32 conditions illustrate in the figure 1.

Table 2 Prevalence and reliability index between chart abstract and ICD-10 discharge abstract data for 32 conditions
Figure 1
figure 1

The comparison of kappa and PABAK with changes of the prevalence of the conditions.

Restricted sample for select conditions

The prevalence of hypertension, myocardial infarction and congestive heart failure for ICD-10 data increased from 22.13%, 8.38%, 6.34% in the randomly selected sample to 78.77%, 29.84% and 22.56% for restricted sample, respectively (see Table 3). The difference between PABAK and kappa values for hypertension, myocardial infarction and congestive heart failure was 0.09, 0.03 and 0.05, respectively that changed from 0.06, 0.18 and 0.18 for the random sample.

Table 3 Prevalence and reliability index between chart abstract and ICD-10 discharge abstract data for 3 select conditions, by sampling method

Case-control sample for select conditions

By design the prevalence for these three conditions for ICD-10 administrative data was 50%, both PABAK and kappa values for hypertension, myocardial infarction and congestive heart failure were 0.82, 0.88, and 0.89, resulting in a difference of 0 between these two indices.

Discussion

We investigated the performance of prevalence unadjusted (i.e. kappa) and adjusted statistical indices (i.e. PABAK) to assess agreement between administrative data and medical chart review data in a randomly selected sample for 32 conditions, as well as a restricted, and case-control sample for three select conditions (hypertension, myocardial infarction and congestive heart failure). Our results indicate that for the same condition the prevalence varies depending on the sampling method, and this variation affects both the kappa and PABAK statistic. We highlight that 1) kappa values varied more widely than PABAK values across the 32 conditions; 2) PABAK should usually not be interpreted as measuring the same agreement parameter as kappa in administrative data, particular for the condition with low prevalence; 3) the gap between these two statistical values for the same condition became narrowed with an increase of its prevalence and disappeared when the prevalence was fixed to 50%.

The sampling method employed has a significant effect on the assessment of validity. Currently, random sample, restricted sample, and case-control sample are popularly used in validity study for administrative data [1921]. In our study, the kappa value for hypertension is 0.72 for random sample and 0.69 for restricted sample. Previous studies indicate that kappa value is highly dependent on the prevalence of the condition[3, 4]. The prevalence of hypertension is 22.13% for the random sample and 78.77% for the restricted sample in this study. Therefore, the difference of kappa value for hypertension might be caused by the different prevalence between random and restricted samples. The sampling method also impacted the PABAK value, with the PABAK value varying by type of sampling method. By definition the PABAK assumes the prevalence is 50% with zero bias[6]; its value only depends on the observed agreement[9]. It reaches to 0.82 for case-control sample when prevalence of hypertension is 50%. These findings are consistent with Vach's[22] report based on the hypothetical sample. In order to overcome the effect of prevalence on kappa value, some researchers advocate using a balance sample and avoiding kappa to assess validity of conditions with low or very high prevalence [3, 4, 9, 23]. A potential reason for the variation in PABAK by sampling methods is due to the change in observed agreement, which was a result of the variation in prevalence estimates by sampling methods. Interpretation of validation study results should therefore consider sampling effect and the disease prevalence.

Kappa is sensitive to the prevalence of a condition defined from administrative data. Agresti's study[24] investigated the influence of prevalence on kappa, and compared the kappa values from populations with different prevalence. Vach's[22] study indicated the kappa is highest when prevalence equals 50%. In reality, nearly all diseases have prevalence lower than 50% in populations. Case-control study design (1:1 matching) automatically maximizes kappa value. In our study, kappa for hypertension in the case-control sample of 0.82 was higher than that in the random sample of 0.72. Therefore, caution should be used in interpreting kappa for case-control studies.

In this study, we compared performance of kappa and PABAK in measuring agreement between administrative data and chart review data but were unable to determine which statistic measures agreement more accurately and reliably. The reason is that we could not establish parameter of 'true' validity of administrative data. However, assessment of these two statistics through administrative data coding practice (i.e. face-validity) demonstrates that PABAK values should usually not be interpreted as measuring the same agreement parameter as kappa, particular for the condition with low prevalence. In our random sample, PABAK value ranged from 0.72 to 0.99 among 32 conditions, indicating a high degree of "validity" for administrative data in recording these conditions. The high PABAK value for obesity and weight loss are examples (0.96 for both conditions) that question the PABAK validity. Administrative data coding guideline and practice[25] instructs coders not to code these conditions even if they are documented in charts because they may not affect length of stay, healthcare or therapeutic treatment. Additionally, coders may intentionally not code these conditions due to the limited amount of time given to code each chart. Therefore, these two conditions generally have very poor validity in administrative data. The fact is revealed by the very poor positive agreement (0.30 for obesity and 0.21 for weight loss, respectively) and kappa value (0.20 and 0.28). PABAK assumes the bias is absent and the prevalence is 50%. However, when bias presented and the prevalence departed from 50%, the PABAK value and kappa value are inconsistence.

Bias also has an effect on the kappa values. Bias is the extent to which there is disagreement on the proportion of positive (or negative) cases and is reflected in a difference between b and c (see Additional file 1)[26]. The kappa value is higher for a large bias, while the kappa value is lower for lower or absent bias[9]. In our study, the value of b and c changed for the same condition in the different samples. For myocardial infarction, the value of b and c was 0.55% and 4.92% in the random sample, respectively, whereas it was 1.95% and 10.04% in the restricted sample. Furthermore, the prevalence of the condition also changed from 8.38% for random sample, and 29.84% for restricted sample, respectively. According to the definition of kappa, it is obvious that the changes of prevalence have the effect on kappa. For the restricted sample, both the difference of b and c, and the prevalence of myocardial infarction are higher compared to the random sample. This partly explains why the kappa value is higher in the restricted sample than those in the random sample.

Our study at least has two limitations. Firstly, we cannot capture the 'true' value of the kappa for the conditions in the administrative data in our study. Therefore, the difference between the 'true' value and estimated value of kappa, and their changes due to the variation of prevalence, cannot be evaluated. Secondly, we employed chart data extracted by reviewers as a 'gold standard' to assess the validity of ICD-10 data. Such a criterion standard depends on the quality of charts.

Conclusion

Our study indicates the prevalence of conditions varies depending on the sampling method employed, and these changes have an effect on kappa and PABAK. Although PABAK theoretically adjusts for prevalence, this statistic may be high to evaluate the agreements between two data sources, and may result in misleading conclusion. Although no single agreement statistic can capture the desired information, we encourage researchers to report kappa, the prevalence, positive agreement, negative agreement, and the relative frequency of in each cell (i.e. a, b, c and d) to permit readers to judge the validity of administrative data from multiple aspects.