Introduction

Chronic obstructive pulmonary disease (COPD) is among the leading causes of disability and death in developed countries. The prevalence of COPD is still on the rise, and costs for the health system are substantial [1, 2]. Airflow limitation that is not fully reversible after bronchodilator application is a key feature of COPD, and spirometry is the routine diagnostic procedure of choice recommended to diagnose COPD [3, 4]. However, the degree of obstruction that establishes the diagnosis of COPD is still under debate [5]. The Global Initiative for chronic Obstructive Lung Disease (GOLD) defined COPD as a fixed post-bronchodilator ratio of forced expiratory volume in 1 second and forced vital capacity (FEV1/FVC ratio) of less than 0.70 [6]. This definition is widely accepted, mainly because of its practicability.

Since the FEV1 value decreases more quickly with age than the (F)VC, the GOLD definition tends to overdiagnose COPD in the elderly [7, 8]. Therefore, some authors suggested using the lower limit of normal (LLN) procedure to diagnose COPD [5]. The LLN is based on age-stratified pre-bronchodilator cut-off values of the FEV1/FVC ratio, and a value below the lower fifth percentile of an aged-matched healthy reference group is considered abnormal and consistent with a diagnosis of COPD [9, 10]. Multiple studies showed that application of any population-derived LLN will result in lower prevalence estimates of COPD compared to the GOLD definition in the elderly [5, 9]. The question remains however which method should be preferred. Another question is, whether there are other pulmonary function test variables that could improve the diagnostic accuracy of GOLD or LLN. The final question is, which definition predicts best prognosis and thus is useful for treatment decisions in COPD. To answer all three questions, the two criteria should be compared with an alternative, and acceptable reference standard, applied in the relevant domain, that is, a population suspected of COPD [11]. An expert panel diagnosis of COPD, based on all available diagnostic information from the clinical assessment, smoking habits, and a complete pulmonary function test (PFT) could be regarded as such a reference standard [11].

To the best of our knowledge, our study is the first to validate GOLD and LLN criteria against an expert panel diagnosis in patients suspected of COPD and to assess their prognostic ability.

Since, in daily practice, establishing the diagnosis of COPD is usually not based on a single PFT parameter, we furthermore assessed whether the addition of other PFT parameters to the GOLD or LLN criteria increases diagnostic accuracy compared to either definition alone.

Methods

Subject and Study Design

In a prospective cohort study, 405 patients aged ≥ 65 years with a general practitioner's (GP) diagnosis of COPD were enrolled in a stable phase of their disease. Primary assessment was between 2001 and 2003. Population and study characteristics have been published previously [12, 13]. In brief, all patients underwent a detailed standardized clinical examination at an outpatient clinic (University Medical Center Utrecht) including PFT, chest radiography, and echocardiography. The study complied with the Declaration of Helsinki, and the Medical Ethical Committee of the University Medical Center Utrecht approved the study protocol. All participants gave their written informed consent.

Pulmonary function tests

All PFT were performed with a fixed volume bodyplethysmograph and Masterscreen (Masterlab Jaeger, Würzburg, Germany). The post-bronchodilator test was assessed after inhalation of ipratropium bromide (40 micrograms twice). For predicted values of lung function markers, the recommendations of the European Respiratory Society were used [14].

Expert panel diagnosis of COPD

In absence of a true reference standard, a consensus of an expert panel is widely accepted as the best alternative [11]. The initial expert panel of the study was composed of a qualified pulmonologist (JWL) and a GP with special interest in COPD (FHR). The panel determined presence or absence of COPD on all available results from the clinical assessment, including history taking and smoking history, chest radiographs, and finally spirometric and bodyplethysmographic information (data and graphs). Besides FEV1/FVC ratio other parameters from the PFT were also considered, including the shape of the curve, FEV1 as % predicted, presence of reversibility, RV/TLC, resistance, air trapping, and DLCO value. Also, smoking habits, a history of allergy or hyperreactivity, initiation of (periods of) dyspnoea and coughing at an early age, and a history of pulmonary embolism or lung diseases other than COPD were used in case of doubt and when applicable.

The same members of the expert panel re-evaluated the diagnosis in a random sample of 80 (20%) cases in 2011, resulting in an excellent kappa statistic between initial and repeat evaluation of 0.90. Another random sample of 120 (30%) cases was externally validated by a panel composed of a German pulmonologist (MH) and a Dutch GP with a special interest in COPD (APS). Kappa statistic between the initial panel diagnosis and that of the external panel was 0.76.

Both panels followed the aforementioned 'strategy' to diagnose COPD. Asthma was diagnosed if a reversible obstruction went along with a typical history of asthma (allergy, hyperreactivity, onset of symptoms at a young age). Reversibility was considered present if the FEV1 levels increased by 200 ml and/or 12% after bronchodilator therapy, accordingly to the American Thoracic society definition [10].

Diagnosis of COPD according to the GOLD and LLN criteria

A post-bronchodilator FEV1/FVC ratio < 0.70 established the diagnosis 'GOLD-COPD' [4]. The 'GOLD-COPD' was graded using post-bronchodilator % of predicted FEV1 values: GOLD stage 1 (mild): ≥ 80%; stage 2 (moderate): 50-79%; stage 3 (severe): 30-49; stage 4 (very severe) < 30% [4].

A FEV1/FVC ratio below the lower fifth percentile of healthy reference groups (similar age) established the diagnosis 'LLN-COPD'. From several LLN equations provided by http://www.spirxpert.com/controversies/workinggroup.html[1417] we selected three LLN reference equations on the basis of sample size, popularity and comparability with the age of our study population. These equations were derived from the following populations:

1. Enright et al. [16]: USA; healthy Caucasians with no respiratory symptoms, N = 1,227, 26% male, age range 65-85 years, non-smokers or all-time smoking duration < 5 years.

2. Quanjer et al. (ECCS/ERS) [14]: Europe; healthy never-smokers with no respiratory symptoms, N = 1,204, 27% male, age range 20-70 years.

3. Falaschetti et al. (Health Survey for England) [17]: Great Britain; healthy never-smokers with no respiratory disease, N = 6,053, 41% male, age range 16-85 years.

The aforementioned reference equations of the LLN-COPD definitions are based on pre-bronchodilator values, and the GOLD-COPD definition on post-bronchodilatory FEV1/FVC values. We analysed both pre- and post-bronchodilator cut-off values for both definitions. Because the results were similar, we only present the post-bronchodilator results.

Prognostic outcomes

Exacerbations of COPD (need for short course of oral steroids), hospitalization for COPD, and all-cause mortality were assessed blinded to the diagnostic classification.

Data analysis

Continuous data are expressed as mean (standard deviation, SD) or median (quartiles), as appropriate. Comparisons between groups were made with Fisher's exact test or Mann-Whitney U-test. Sensitivity, specificity, positive and negative predictive values and kappa (κ) statistics with 95% confidence intervals (CI) were calculated for each COPD definition with the 'expert panel diagnosis of COPD' of the initial panel as the reference test. The classification system proposed by Landis and Koch was used to determine the level of concordance (a κ of 0.81-1.00 is considered almost perfect) [18]. A bootstrap method was used for calculating 95% CI of κ and to assess statistical significance for correlated κ [19, 20]. The diagnostic ability of different PFT parameters for predicting COPD according to the reference standard was tested using ROC curve with C-statistics with 95% CI [20]. The two PFT parameters predicting best according to the C-statistics were incorporated into a new 'modified' GOLD- or LLN-based definition, and diagnostic performance (false positives, false negatives, kappa) of these extended models were compared to the original definitions (Figure 1).

Figure 1
figure 1

Flow chart for diagnostic algorithm. In clinical practice the diagnosis of COPD is based on multiple variables. As the simplest model we chose a three PFT parameters approach in which an initial COPD YES/NO diagnosis based on FEV/FVC levels was corrected if FEV1 and RV/TLC levels were altered counterintuitively*. * As thresholds for FEV1 and RV/TLC levels different cut-off levels were used and kappa statistics calculated for all alternatives. Each change in COPD diagnosis only materializes if both parameters deviate by ≤ 5/7.5/10/12.5/15/20% from 100% of the predicted value. Example: If deviations of 10% (from 100%) are chosen as thresholds for both FEV1 and RV/TLC (as % of predicted) in order to change the GOLD-COPD diagnosis from 1) 'yes' into 'no' (i.e., FEV1 ≥ 90% and RV/TLC ≤ 110%; [2) or vice versa, from 'no' into 'yes', FEV1 < 90% and RV/TLC > 110%]), then the number of misclassified patients (false positives + false negatives) is reduced from 69 to 33, and κ- statistics improve from 0.64 to 0.83. Abbreviations: as in table 1.

Prognostic analysis for different outcomes (exacerbation, hospitalization, death) were calculated with univariate Cox regression.

Missing data and statistical analyses

Very few values in the dataset were missing, with the exception of the diffusion capacity of carbon monoxide (DLCO) with 42 missings. On residual volume (RV) and total lung capacity (TLC) we had five missings. As deletion of subjects with missing values may lead to biased results we imputed missing values using a regression method with the addition of a random error term [21].

All statistical analyses were carried out using the statistical software package of SPSS (PASW Statistics 18) and R for windows (version 2.11.0).

Results

The mean age of patients was 73 (5.3) years, 45% were female. The baseline characteristics of participants are shown in table 1 and 2. Table 3 shows the differential performance of the GOLD approach and of three different LLN definitions with the expert panel diagnosis as the reference test.

Table 1 Baseline characteristics of patients with and without COPD according to the expert panel and the GOLD definition
Table 2 Pulmonary function test of patients with and without COPD according to the expert panel and the GOLD definition
Table 3 Diagnostic test performance of GOLD and LLN with the expert panel as the reference test

In our elderly cohort all regression equations used for the LLN definition had a lower FEV1/FVC threshold than the GOLD definition. Specificity was higher and sensitivity lower for LLN than GOLD. When compared to the reference test, kappa statistics were higher for GOLD than for any of the three LLN definitions, however not all differences were statistically significant (table 2).

'Misdiagnosed' patients with the GOLD definition as compared to the expert panel

There was reasonable concordance between the diagnosis of COPD with the GOLD definition and the expert panel (κ = 0.64, 95% CI 0.57-0.71, table 3). Classification according to GOLD resulted in 33 false positive and 36 false negative diagnoses as compared to the expert panel diagnosis of COPD (table 4).

Table 4 Baseline characteristics of patients with a 'correct' and 'false' GOLD-COPD diagnosis according to the reference-test

In general, patients with a "true positive COPD diagnosis" tend to have RV/TLC values (far) above 100% of predicted, a FEV1 value (far) below 100% of predicted, and a DLCO values (far) below 80% of predicted, as compared to healthy individuals.

In our cohort, the median RV/TLC of patients with COPD according to the expert panel was high (median [quartiles]: 124 [112; 140] as % of predicted), median FEV1 low (67 [54;79] % of predicted), and also DLCO levels low (62 [49;76] % of predicted) (table 4).

Prognostic capacity of the COPD definitions

During a median follow-up of 4.5 (quartiles 3.9; 5.1) years, 148 patients experienced at least one episode of a COPD exacerbation (defined as a 7-10 days boots of prednisolone use), 67 patients were hospitalized for pulmonary reasons, and 60 patients died.

A COPD diagnosis according to the expert panel identified the largest number of patients that experienced any of the aforementioned events, followed by COPD according to GOLD and COPD-LLN (see table 5). The occurrence of outcomes related to the different classifications of COPD with percentages related to the classification of COPD is presented in table 6.

Table 5 Prognostic outcomes according to different COPD definitions within the whole cohort
Table 6 Prognostic outcomes according to different COPD definitions within each "COPD definition"

Hazard ratios of COPD yes versus no for the prognostic outcomes for any of the definitions are presented in table 7. With all COPD definitions, those with COPD had significantly worse prognostic outcomes as compared to those with 'no COPD'.

Table 7 Prognostic outcomes according to different COPD definitions in univariate Cox regression analysis

Pulmonary function test predictors of the diagnosis of COPD, using the expert panel as the reference

From all PFT variables (besides FEV/FVC), FEV1 % predicted, RV/TLC and DLCO % predicted performed best in predicting the expert diagnosis of COPD. The c-statistics of these variables using the expert panel diagnosis of COPD as the reference were 0.95 (95% CI 0.93-0.97), 0.85 (95% CI 0.81-0.89), and 0.77 (95% CI 0.73-0.82), respectively.

Addition of FEV1 and RV/TLC to GOLD-COPD and LLN-COPD

Addition of FEV1 and RV/TLC (both as % predicted) to the GOLD- or LLN-based definition improved diagnostic test performance significantly. Kappa statistics for the GOLD definition increased from 0.64 up to 0.83 (p < 0.001) and the number of misdiagnoses decreased from 69 to 33 (highest kappa statistics seen for 10% deviation; see Figure 1 for explanation of the algorithm). For the LLN definitions (Enright/Quanjer/Falaschetti) the Kappa raised from 0.46/0.53/0.57 up to 0.77/0.79/0.80 and the number of misdiagnosis decreased from 117/98/90 to 44/40/39 (highest kappa statistics seen for all 3 LLN definitions for 5% deviation; see Figure 1 for explanation of the algorithm).

Discussion

In our study we show that false positive diagnosis of COPD occurred more often with the GOLD definition, while false negatives were more common with the LLN definitions as compared to an expert panel diagnosis as the reference test. Adding FEV1 and RV/TLC improved the GOLD and LLN approach, reducing misdiagnosed COPD by up to 50% depending on the cut-points applied. The expert panel diagnosis predicted best the occurrence of exacerbations of COPD, pulmonary hospitalizations, and all-cause mortality, followed by the GOLD and LLN definitions.

The choice of a fixed cut-off point for the GOLD-COPD definition was made for reasons of generalization and simplification [22]. Although even lower FEV1/FVC ratios than 0.7 can be expected in the elderly without a pathological correlate, [7] a spirometric test result of > 0.7 does not necessarily exclude a diagnosis of COPD in these patients. Especially elderly patients tend to incompletely empty their lungs during the performance of the FVC manoeuvre [23], resulting in a lower FVC value and thus an increased FEV1/FVC ratio, rendering false-negative COPD diagnoses more likely.

Multiple studies already showed that fewer patients are diagnosed as COPD positive when LLN definitions are applied instead of GOLD, especially in the elderly (e.g., 36 vs. 15% in a healthy Dutch cohort of patients aged ≥ 50 years) [5, 7, 9, 24, 25]. The present study confirms the aforementioned differences in prevalence rates of COPD according to GOLD or LLN. Importantly, however, all previous studies involved in the discussion whether LLN or GOLD should be applied, compared both methods without application of a reference test. Without a reference, however, it is impossible to answer which method performs better [11]. This lack of evidence and the resulting diagnostic uncertainties have not been adequately appreciated. Application of the LLN will increase the chance of classifying COPD patients as having no COPD and thus the risk of undertreatment of especially elderly patients (Figure 2).

Figure 2
figure 2

Change of the threshold of FEV1/FVC ratio will change the amount of misdiagnosis in both directions. Application of the LLN definition in elderly patients which generally results in FEV1/FVC levels smaller than 0.7 reduces the number of FP diagnoses but subsequently increases the FN.

The diagnosis of our expert panel was validated internally and externally. Re-evaluation of 80 cases (20%) in 2011 by the same panel as in 2001/2003 had a very good kappa of 0.90. External validation with a panel including a German pulmonologist and a Dutch GP with special interest in COPD was somewhat lower with a kappa of 0.76, which still can be considered as a good accordance.

Despite a higher number of patients with mild COPD, the expert panel diagnosis of COPD was highly associated with COPD exacerbations, pulmonary hospitalizations and all-cause deaths, underlining the validity of the expert panel diagnosis.

As expected, a LLN-based diagnosis of COPD generated less false positives and more false negatives as compared to the conventional GOLD definition. The overall accuracy of LLN was similar or worse than the conventional GOLD definition when compared to the expert panel diagnosis of COPD. Misclassifications occurred mainly in patients with GOLD stage I and II.

Typically, in diagnostic test research there is a trade-off between specificity and sensitivity (see Figure 2). For a balanced approach the consequences of false positive and false negative cases should be assessed. A false positive result in the case of COPD may lead to over-treatment and therefore avoidable expenses for the health system. Furthermore, the adverse effects of pulmonary medication might cause more harm than benefit to some patients [2628]. In addition, a false positive diagnosis of COPD increases the risk that physician and patient remain unaware of other possible reasons for the complaints, such as cardiovascular diseases, notably heart failure [29].

The effects of a false negative diagnosis is undertreatment of patients with COPD at a point in time when they probably would benefit most (GOLD stages I and II). Table 5 summarizes the effect of classifying the presence or absence of 'COPD' according to the different methods. LLN tends to categorize elderly with mild obstruction as 'no COPD' (high specificity and low sensitivity). In absolute numbers, LLN identified fewer patients with clinically relevant prognostic events (COPD exacerbation, hospitalization, mortality) than the GOLD or panel definition. As a clinical consequence, fewer elderly patients would receive therapy targeted at reducing these events, when the clinician would apply LLN instead of the GOLD or panel diagnoses. Table 7 shows that the prognostic abilities of LLN, GOLD and panel were compatible with clearly overlapping 95% confidence intervals of the hazard ratios. Early diagnosis and identification of false negatives may enable intervention strategies as counseling for smoking cessation and exercise training when pulmonary compromise is still mild [30]. Initiation of pharmacotherapy can reduce symptoms, improve quality of life, and decrease the number of acute exacerbations [31, 32]. Guidelines therefore advocate early detection of airflow limitation [4].

A multiple test result approach with incorporating bodyplethysmographic data seems a reasonable way to establish a more reliable diagnosis of COPD, although, we have to consider that bodyplethysmography is costly, with an average prices of 75 to 200 US Dollars per performance [33].

FEV1 is probably the most important determinant of obstruction, and RV/TLC is known to be highly and inversely correlated to FEV1% of predicted [33]. Normal values of FEV1 and RV/TLC in subjects with a FEV1/FVC ratio < 0.70 should motivate re-evaluation of a positive diagnosis of COPD based solely on the conventional GOLD criteria. An approach many pulmonologists apply in clinical practice. As an alternative, DLCO could be used instead of RV/TLC, although more missing and indecisive results with this method were seen in our analysis and might be generally be expected (data not shown).

The National Institute for Clinical Excellence (NICE) acknowledged the importance of FEV1 as distinct parameter to diagnose COPD, and defined airflow obstruction if both the FEV1/FVC ratio is < 0.7 and the FEV1 < 80% of predicted, and thus 'starting' the diagnosis of COPD from GOLD II onwards [34]. Application of the NICE definition (post-bronchodilator values) in our cohort as compared to the expert panel diagnosis of COPD led to very high specificity (99%; only two false positives) but low sensitivity (66%, 82 false negatives). Incorporating a second PFT parameter as FEV1 into a FEV1/FVC-based definition might effectively reduce false positive test results, however for correction of false negative results at least a three parameters approach is needed.

Certain limitations need to be considered in the interpretation of our findings. PFT was only performed once at baseline, and secular trends could have been missed. Second, information on graphical PFT results as the flow volume curve or the flow pressure curve also enhance the diagnostic ability of the expert panel, but we could not quantify how much these graphs added to the final decision of the panel.

Another limitation in our study is incorporation bias when assessing the added value of other PFT variables to improve the diagnosis of GOLD or LLN [35]. PFT parameters play an important role in the diagnosis of the expert panel. Thus, overoptimism of the diagnostic performance of PFT variables such as FEV1 and RV/TLC should be considered. Robust external validation and accurate cut-off calculations are still needed before the proposed algorithm of including FEV1 and RV/TLC in the GOLD or LLN definition may be adopted in routine practice. Our intention, however, was not to create a new definition of COPD, but to raise the awareness of some of the shortcomings of the single fixed cut-off value of FEV/FVC 0.7 and the age-adjusted LLN definitions.

In conclusion, both the conventional GOLD criteria and some of the most frequently used LLN-based diagnoses of COPD share major shortcomings as compared to the expert panel diagnosis of COPD. While GOLD definition tends to overdiagnose COPD, LLN-based definitions tend to underdiagnose COPD in symptomatic patients. Adding the information on FEV1 and RV/TLC to the GOLD definition reduced the number of misdiagnoses substantially for either definition. Further studies are needed to explore the usefulness of 'an upgraded' COPD or LLN diagnosis with determination of the optimal cut-off values for RV/TLC and DLCO.