Background

Idiopathic pulmonary fibrosis (IPF) is a rare disease with a median survival of 3–5 years after diagnosis [1]. Current treatment options as pirfenidone and nintedanib are still limited in respect to prolonging life [2]. Mortality alone does not appear to be a sufficient clinical endpoint regarding patients’ outcomes [1, 3,4,5]. Thus, health-related quality of life (HRQL) as a patient-reported outcome gains relevance [6]. Existing HRQOL instruments are not yet sufficiently validated as clinically meaningful endpoints in IPF [7,8,9]. Therefore, the utilisation of validated HRQL instruments is strongly recommended for marketing-authorisation application of novel treatments [10, 11].

The Short Form 36 Health Status Questionnaire (SF-36) is a generic instrument [12] which is frequently used in clinical trials in IPF as a secondary endpoint [13,14,15]. Generic HRQOL instruments are designed to measure overall health states and allow comparisons across patients with different diseases and the general population. Evaluating the validity of these generic instruments in specific diseases is indispensable and is also needed for the SF-36 in IPF [9]. Currently, two studies provide psychometric characteristics of the SF-36 in IPF based on longitudinal data [16, 17]. It is our knowledge that only these studies analysed if the SF-36 can detect changes or stability over time of HRQL, which is essential as an endpoint in clinical trials. Tomioka et al. used observational data of a single outpatient centre in Japan [16]. The analysis of Swigris et al. was based on international multicentre-data, which were part of the randomised clinical trial BUILD-1. Thus, the study population was subject to numerous inclusion and exclusion criteria [17, 18]. Hence, the external validity of the results of both studies might be reduced. Belkin et al. proposed additional research should take place before a broad implementation of the SF-36 [8]. Moreover, only Swigris et al. provide disease specific minimally important differences (MID), which are obligatory to evaluate changes in QOL over time [17, 19]. Therefore, patients would benefit from further longitudinal analysis based on multicentre-data and in a real-world setting.

The aim of this study was (1) to assess the psychometric characteristics of the SF-36 in IPF (acceptance and feasibility; discrimination ability; construct and criterion validity, and internal consistency; responsiveness and test-retest- reliability). Furthermore, we intended (2) to evaluate disease specific MIDs, using data from a comprehensive European registry, which provides real-world data from patients in different disease stages and ethnical backgrounds.

Materials and methods

Data and participants

Data source was the European IPF Registry (eurIPFreg), one of Europe’s leading IPF longitudinal databases with nine participating countries and eleven study centres [20]. Both, eurIPFreg and eurIPFbank (biobank of eurIPFreg) have been reviewed and received positive votes from institutional review boards in Germany (e.g. Ethics Committee of Justus-Liebig-University of Giessen; 111/08), France, Italy, Austria, Spain, Czech Republic, Hungary and the UK. The research was conducted strictly according to the principles of the Declaration of Helsinki. The eurIPFreg and eurIPFbank are listed in ClinicalTrials.gov (NCT02951416). Patients were included into the registry starting November 2009. The datasets generated and investigated during the current study are not publicly available due to registry regulations, but are available from the corresponding author on reasonable request and agreement of the Principle Investigators of the eurIPFreg.

Patients’ data were collected by standardised questionnaires for physicians and patients at baseline and follow-up visits with intervals of three to six months, considering individual necessity and practical issues. Interim documentation in case of unscheduled visits was possible. The collected data was comprehensive and included besides clinical measurements and demographic data, also patient-self-reported instruments [21].

The study population was comprised of incident and prevalent IPF patients. There were following exclusion criteria: subjects without information of sex and age, absence of IPF diagnosis validated by a multidisciplinary team, missing lung function test at baseline, absent or incomplete information on SF-36 items (more than 50% missing values within each dimension) [22]. In case of missing date of filling out the questionnaires or medical examinations, we used the predefined follow-up date.

HRQL instrument

The SF-36 version 2 was used [22]. It contains 36 items categorised into 8 dimensions (vitality (VITAL), physical functioning (PFI), bodily pain (PAIN), general health perceptions (GHP), physical role functioning (ROLPH), emotional role functioning (ROLEM), social role functioning (SOCIAL), mental health (MHI)) and a physical as well as a mental component score (PCS and MCS), which can be calculated for individuals providing all dimensions. The dimensions range from zero to 100; higher values imply higher functional health and well-being. The PCS and MCS are adjusted to normal distribution (mean equal 50, standard deviation (SD) equal 10) with higher values for better functional health and well-being. Scores were calculated based on German scoring system to provide comparability since the majority of considered patients were Germans [23].

Anchors

For purposes of examining the validity of the SF-36 in IPF, we used the following anchors at baseline and follow-up: 6 min walking distance (6MWD) [24,25,26], percent of the predicted value of forced vital capacity (FVC % pred) (based on Global Lungs Initiative (GLI) equations), percent of predicted value of carbon monoxide diffusion capacity of the lung (corrected for haemoglobin, and if not available uncorrected values (DLCO % pred)), and also modified New York Heart Association Classification (NYHA) grade, evaluated by the physician (I-IV, the higher the more impaired) [27],

Baseline Dyspnoea Index (BDI) (scale 0–12, the lower the more impaired) (baseline only) and Transitional Dyspnoea Index (TDI) (scale − 9 to 9, the lower the more impaired) (follow-up only) [28], long-term oxygen therapy (LTOT) (baseline only), Modified Medical Research Council (mMRC) Dyspnea Scale (1–5, the higher the more impaired) (baseline only) [29], and an item of the SF-36 which indicates perceived change in health during the previous year (follow-ups only). This SF-36 item was not included in any of the dimensions and component scores [12, 22].

Cross-sectional analysis

The SF-36 value was not captured during the first visit in all cases. Therefore, in this study we defined baseline as the date of the first filled in SF-36. Additionally, not all examinations were performed at each visit and we therefore decided to accept anchors within a timeframe of plus/minus 45 days around the first visit filled in SF-36. The time frame of 45 days was chosen because frequently, the date was only given as month/year and we needed to set the day to the 15th. Since the SF-36 considers the health status of the last 4 weeks and in some cases the exact date of examination was set to the mid of month, we decided to use 45 days as the maximum interval between anchors and SF-36.

Acceptance and feasibility

To assess acceptance and feasibility we examined the frequency of missing responses to items. As there might be some differences in specific populations, we searched for a possible influence of age, gender and severity of disease (estimated by DLCO % pred, FVC % pred, 6MWD) on the frequency of missing items via Pearson and Spearman correlation for metric and categorical variables, respectively.

Discrimination ability

Ceiling and floor effects in single items were examined as a possible indicator of an insufficient discrimination ability.

Construct and criterion validity, and internal consistency

The construct validity of the domains and summary measures was checked for individuals with and without LTOT via Wilcoxon-Mann-Whitney test to consider possible non-normal distribution. We assumed that individuals with LTOT have a lower HRQOL than individuals without [30].

The criterion validity of the domains and summary measures was evaluated via Pearson correlation in case of metric anchors and Spearman correlation in case of ordinal anchors. A better health status and thus better values of the anchors should implicate higher HRQL and vice versa. Strength of correlation was categorized according to Cohen in great (greater than 0.5), moderate (0.3–0.5), small (0.1–0.3), and trivial (less than 0.1) [31]. Internal consistency was assessed with Cronbach’s alpha for the domains and summary scores of the SF-36.

Longitudinal analysis

Considering the flexible intervals between the visits, the time frame between baseline and follow-up could not be defined a priori. As the SF-36 evaluates the HRQOL of the last four weeks, the interval between baseline and follow up needed to be of at least 28 days, except the SF-36 change item which has a time horizon of one year, here we considered only follow-ups with an interval of 300 to 450 days.

Consistent with the baseline procedure, the follow-up anchors were selected within a time frame of plus/minus 45 days around a filled in SF-36 form. For this purpose, we used a stepwise approach to find the nearest anchor around the SF-36 measurement and excluded matched anchors before we started the next search. An anchor examination was never used for two SF-36 measurements. The number of follow up visits with documented HRQOL and anchors varied and could possibly be more than one. In order to improve the power of these analyses, we decided to use the first and last observation per anchor and individual, provided their health status (improved vs. baseline, deteriorated vs. baseline, same as baseline) varied between these two observations. For example, if the health status was initially stable but deteriorated afterwards, we used both events in different groups and therefore different analyses. Considering an individual twice in one group (e.g. deterioration) would have lead to a bias. In this case, we considered only the last measurement of the respective anchor. For TDI we used only one observation, which was plus/minus 45 days around a filledin SF-36 compared to the preceding SF-36 as the instrument measures the change between two visits.

Responsiveness and test-retest- reliability

For assessing responsiveness and test-retest-reliability the individuals were categorized depending on whether their health status and thus their anchors changed during the follow-up or not. We defined variations with more than the MID of the anchor as improvement and deterioration, respectively. If the shift from baseline to follow up was less than the MID, we defined the anchor as unchanged. We defined the following MIDs for the changes of the anchors: 6MWD ≥30 m [32,33,34], FVC % pred ≥10%, and DLCO % pred ≥15% [35], TDI =1 [28, 36], modified NYHA score ≥ 1 [37]. If the anchor is stable, there should not be a significant difference in the SF-36 between baseline and follow up (test-retest-reliability). The responsiveness was tested by comparing baseline and follow up values of the SF-36 for improved and deteriorated anchors separately. A relevant change of the anchors should implicate a significant shift of HRQL. We used Wilcoxon signed-rank test in case to consider possible non-normal distribution of differences and possible small numbers of observations within the anchors per group.

Minimal important difference (MID)

The MIDs of the summary scores and the dimensions were estimated anchor- and distribution-based. To obtain distribution-based MIDs we used half standard deviation (SD) of baseline values of normally distributed domains [38, 39]. Normality was evaluated by visual inspection [38, 39].

For anchor-based MIDs, only anchors providing a correlation ≥0.3 at baseline to ensure sufficient relationships were considered [31, 39]. MIDs were estimated via linking, which are unaffected by the degree of correlation [40]. Therefore, the MID of the anchor was multiplied by the quotient of the baseline SD of the HRQL domain and the baseline SD of the anchor.

$$ {MID}_{HRQL}={MID}_{anchor}\times \left({SD}_{HRQL}/{SD}_{anchor}\right) $$

As only metric anchor provide meaningful SD, categorical anchors needed to be excluded and only following metric anchors were used: 6MWD, FVC % pred, and DLCO % pred.The mean of distribution- and anchor-based MIDs (if normally distributed and anchor correlated significantly and r ≥ 0.3) was calculated to provide an overall estimate of the specific MID. Additionally, the mean of the distribution-based MID with the MID of the anchor with the highest correlation was provided.

Sensitivity analysis

To detect possible bias we tested a possible influence of study sites on HRQL, adjusted for age, gender, DLCO % pred, FVC % pred and 6MWD.

All statistical analyses were performed using SAS software (version 9.3,©2002–2010 by SAS Institute Inc., Cary, NC, USA).

Results

Cross-sectional analysis

Out of 528 IPF patients, we excluded 139 patients as they had no SF-36 and one individual who had only answered one question. From the resulting 388 patients we excluded three individuals without information on gender and six individuals without date of birth. From the remaining 379 individuals, there was no FVC measurement around the first SF-36 in 121 cases. That does not mean there was no FVC measurement at all, but not within 45 days around the first SF-36. The study population included 258 individuals (73.3% male) with a mean age of 67.3 years (SD 10.7) and on average 2.6 years since first diagnosis (SD 2.8). In spite of a tolerance, a period of plus/minus 45 days between SF-36 and anchor, it was not possible to provide all anchors for each patient. HRQL presented in MCS and PCS was considerably reduced compared with norm values (mean 45.3, SD 11.8 and mean 34.6, SD 10.5 versus mean 50.0, SD 10.0) (Table 1). Except for ROLEM and ROLPH all HRQL measures were normally distributed based on visual validation.

Table 1 Baseline characteristics

Acceptance and feasibility

Regarding single items, 75.2% (194 individuals) had no missing item in the SF-36, 21.3% (n = 55) one to ten and 3.5% (n = 9) eleven to 28 missing items. The number of missing items and age (r = 0.13, p = 0.03) correlated significantly. Gender as well as severity of disease were of no significant influence. A graphic representation on item level can be found in the Additional file 1 Figure S1. Within the dimensions, the percentage of completely answered items ranged from 93.0% (ROLEM) to 95.7% (PAIN) (Table 2).

Table 2 Missing items within the dimensions

Discrimination ability

The distributions of several items were skewed, six had a tendency of more than 60% towards the worst answer category: ROLPH 1–4 (67.9, 74.3, 69.1 and 69.1%) and PFI 1 (78.9%) and 4 (65.6%). Almost half of the study population rejected (answer: ‘definitely false’) that their ‘health is excellent’ (45.8%, item 5 of GHP, possible answers: definitely true; mostly true; don’t know; mostly false; definitely false) (Additional file 2 Figure S2).

Construct and criterion validity, and internal consistency

PCS correlated significantly and moderately with several anchors whereas MCS did not correlate with any anchor with r ≥ 0.3. ROLEM, MHI and PAIN did not reach moderate or high correlations either. Other dimensions correlated significantly with particular anchors on a moderate to high level (Table 3). The tests showed significant lower HRQL in individuals with LTOT except for MCS, MHI, and PAIN (Table 4). Cronbach’s alpha ranged from 0.85 (SOCIAL) to 0.87 (ROLEM), MCS and PCS showed a good internal consistency as well (0.86 both).

Table 3 Criterion validity analysed via correlation coefficiants
Table 4 Construct validity: mean difference of QOL between patients without and with long-term oxygen therapy; significant differences of QOL confirm criterion validity

Longitudinal analysis

SF-36 follow-up data were available of 161 individuals, where almost half of them (78, 48.5%) had up to four further documentations of HRQL and the maximum of filled in SF-36 was 10. The mean time between baseline and all considered follow-ups was 1.3 years (SD 0.88, range 0.1–5.0 years). The number of considered matches of anchors and HRQL (n = 591) was higher than the number of individuals within the follow-up study population, as different visits per patient needed to be considered to provide as much timely congruent documented anchors and filled in SF-36 questionnaires per individual as possible. Moreover, we accepted individuals twice with their first and last observation per anchor, if their health status of the respective anchor varied.

Test-retest-reliability and responsiveness

Analyses for test-retest-reliability did not show significant differences of HRQL except for SOCIAL and the anchor FVC % pred (Table 5). Individuals with relevant changes of the health status based on the anchors had significant changes in all SF-36 dimensions and summary scales except for PAIN (responsiveness) (Table 6).

Table 5 Test-retest-reliability: mean change of QOL in stable health status in anchor; non-significant changes of QOL confirm test-retest-reliability
Table 6 Responsiveness: mean change of QOL in changed health status in anchor; significant changes of QOL confirm responsiveness

Minimal important difference (MID)

The normal distribution could not be assumed for ROLEM and ROLPH and valid distribution-based MIDs could not be provided for both dimensions. As we considered only anchors with a correlation of at least 0.3 and none of the anchors correlated sufficiently with MCS, ROLEM, GHP, MHI and PAIN, it was not possible to provide any anchor based-MIDs for them. Combining the criteria of normal distribution and an at least moderate correlation, it was not possible to calculate a MID for ROLEM. The overall mean MID of PCS and MCS were five and six, respectively. Mean MIDs of the dimensions ranged from seven to 21 based on anchors correlating with r ≥ 0.3 and estimated MIDs of normally distributed domains and summary scores. Taking only distribution-based values and the MID of the anchor with the highest correlation, the mean MIDs ranged from seven to 14 (Table 7).

Table 7 Minimal important differences (MID)

Sensitivity analysis.

The patients of the study sites varied in HRQL, disease severity, age and gender. After adjusting for age, gender, DLCO % pred, FVC % pred and 6MWD there was no influence of study site on HRQL detectable.

Discussion

The SF-36 seems to provide adequate psychometric properties to assess HRQL in IPF cohort. Our analysis demonstrated an increased number of missing items in older patients [41]. It is well known, that in an older population the number of missing items is higher [42, 43]. Especially items containing the wording ‘work or other regular daily activity’ (dimensions ROLEM and ROLPH) led to a higher number of missing values in our study as well as in the studies of Hayes et al. and Mallinson [42, 43].

A possible reason could be a misunderstanding of the wording ‘work or other regular daily activity’ as probably most of the older participants were retired or not able to hold down a regular job [42]. As 75.2% of participants completed the questionnaire without any missing values in our study, we assumed that the higher age of most of the patients suffering IPF is not necessarily a limiting factor.

As we expected in a severe disease such as IPF, there was a floor effect of the items regarding limitations in ‘vigorous activities’ and ‘climbing several flights of stairs’ (dimension PFI) as well as the statement ‘my health is excellent’ (dimension GHP). As the dimension PFI contains ten items and considers different levels of activities, the floor effect of two items may be acceptable. Surprisingly, 4.4 and 7.9% of our study population declared to have no limitations at all in these two physical activity categories and 1.6% rated their health as excellent.

Construct validity was also given. However, the measured dimensions MHI and PAIN and the MCS were not significantly reduced in individuals suffering LTOT. This might be caused by a positive influence of LTOT on well-being in some IPF patients. Regarding the criterion validity, it needs to be mentioned that the correlation of the anchors and MCS was lower than the correlation of the anchors and the PCS, which was also found in other studies [17, 44, 45]. Furthermore, the influence of dyspnea and physical activity measured via mMRC, BDI, NYHA, and 6MWD on HRQL was higher than the influence of clinical parameters as vital and diffusion capacity. Other studies also showed similar results with varying interpretation of the relevance of the correlation between pulmonary function and HRQL [16, 46,47,48,49].

Longitudinal analysis indicated sufficient psychometric properties, whereas the small number of observations limited the validity. Additionally, MIDs could not be estimated in all cases due to lacking sufficient correlation of anchors or missing normal distribution. If assumptions were given, the mean MIDs were higher compared to Swigris et al. (this study: range 5–21; Swigris et al.: range 2–4). Considering only the anchor with the highest correlation, the mean MIDs decreased and approached the MIDs of Swigris et al. Authors of the latter study used different methods and only two anchors [17]. Additionally, the amount of correlations or distribution patterns were not considered in providing MIDs. The different methods in combination with the strongly selected study sample of the BUILD-1 trial may explain the differences in our results.

The strength of this study lies in the international multicentre population of the IPF individuals of all ages and disease stages without strict inclusion and exclusion criteria, which provides a ‘real life’ setting and transferable results. We investigated a potential influence of the study sites and countries on HRQL. After adjusting for age, gender, DLCO % pred, FVC % pred and 6MWD there was no correlation with HRQL. The number of incorrect diagnoses should be negligible as the diagnosis was based on multidisciplinary discussion and on ATS/ERS/JRS/ALAT guideline criteria [4, 50]. To consider clinical and patient-centred values, we used objective anchors as lung function values (FVC % pred, DLCO % pred) and need of supplemental oxygen, (LTOT), as well as subjective parameters as dyspnea scores (self-reported by patients (mMRC, BDI/TDI) and physician (NYHA))and a measure of physical functioning (6MWD). The MID was estimated based on anchors as well as on distribution as widely recommended [51, 52].

Our study has several limitations. First of all, the follow-up intervals varied and only 62.6% of the study population had at least one follow-up SF-36. Additionally, in some cases the date of examination and visit was missing and the scheduled visit date was used as proxy instead. For example, in 19 of 364 analysed baseline and follow up SF-36 questionnaires the date needed to be approximated. The share of missing values of single items still met regulatory requirements. Some analyses were based on a small number of observations.

Conclusion

SF-36 appears to be a valid instrument to measure HRQL in IPF and so can be used in RCTs or individual monitoring of this disease. Nevertheless, the additional evaluation of longitudinal aspects and MIDs can be recommended to further analyse these factors. Our findings have a great potential impact on the evaluation of IPF patients in clinical trials as well as individual disease monitoring.