Introduction

Food allergy affects almost 4% of the general population in westernized countries [1], and it is the primary cause of anaphylaxis presenting to emergency departments [2]. The only proven therapy is careful avoidance of the causal food(s) and provision of medication for emergency treatment [3]. Consequently, patients often fear an allergic reaction and are continuously faced with dietary and social restrictions in their daily lives, which can have a negative impact on quality of life [411].

To measure Health-Related Quality of Life (HRQL), disease-specific questionnaires are significantly more sensitive than generic ones, and they are important for estimating the general burden of food allergy as well as measuring the response to interventions or future treatments. However, generic HRQL instruments allow comparison of the burden of disease between patient populations with different diseases [12]. Recently, as part of the EuroPrevall project, the first self-administered HRQL questionnaires specific for food allergy have been developed and validated: the Food Allergy Quality of Life Questionnaire-Child Form, -Teenager Form and -Adult Form (FAQLQ-CF, -TF, -AF). The FAQLQs showed good validity, internal consistency and discriminative abilities [1316], but test-retest reliability was not extensively investigated.

Reliability measures are important to ensure that what the questionnaire is measuring is dependable and repeatable [12] and that it allows sample sizes to be determined for clinical trials [17]. The aim of this study was therefore to assess the test-retest reliability of the self-administered FAQLQ-CF, -TF and -AF.

Methods

Patients

We contacted Dutch children (8–12 years), adolescents (13–17 years) and adults (≥18 years) with food allergy, who were recruited from our clinic or by advertisement. We included patients with the most prevalent food allergies.

Questionnaires

The FAQLQ-CF contains 24 items and 4 domains, the FAQLQ-TF contains 23 items and 3 domains, and the FAQLQ-AF contains 29 items and 4 domains [1315]. The total FAQLQ score is the sum of all the items divided by the number of items and ranges from 1 (minimal impairment in HRQL) to 7 (maximal impairment in HRQL) [18, 19].

Procedures

We sent the FAQLQs by mail to be completed at home. Regarding the FAQLQ-CF, parents were instructed that they were allowed to explain a question when needed, but they were not allowed to tell the child which answer to give. All patients who completed the first questionnaires (test) received the second questionnaires (re-test) 10–14 days after completion of the first. Patients who did not respond in time were excluded from the study [20, 21] as well as patients who reported a clinically important change in disease between the measurements or within 2 months before the study. We defined a clinically important change in disease that could influence HRQL as a food allergic reaction of grade 3 or 4 according to the Mueller classification [22]. The study was approved by the local medical ethics review commission (METc 2005/051).

Statistical analysis

Data were analysed using SPSS software for Windows (version 14.0). To investigate test-retest reliability of the FAQLQs, we used the intraclass correlation coefficient (ICC), using a one-way ANOVA [20, 21, 23]. Values should be above 0.70 for group comparison studies and above 0.90–0.95 for individual measurements over time [24].

As a second measure of test-retest reliability, we calculated the Lin’s concordance correlation coefficient (CCC). The different components of the CCC [Pearson correlation coefficient (measure of precision), location shift and scale shift (measures of accuracy)] were calculated. We plotted the first measurement against the second measurement, and we used major axis analyses to calculate the best fitting line [25].

Visual assessment of test-retest agreement was obtained by use of Bland-Altman plots [26]. Differences between the first and the second measurement were plotted against the mean of the first and the second measurement. Limits of agreement (mean difference ± 1.96*SD of the difference) were calculated, which reflect the interval within which about 95% of the differences between the two measurements should lie [27, 28]. A regression coefficient (r) was calculated to estimate a relationship between the difference and the mean [26].

Results

Patients

We contacted 148 patients, of which 131 patients completed and returned the first questionnaire and 114 responded to the second questionnaire. This resulted in an overall response rate of 77%. A few patients were excluded, resulting in 101 patients that were eligible for analysing test-retest reliability (Table 1). The descriptive characteristics are shown in Table 2. Mean duration between the first and second measurement was 11 days for all three age groups.

Table 1 Patient recruitment
Table 2 Demographics and clinical characteristics

Analysis of FAQLQs

ICCs were ≥0.900 for the FAQLQs, and CCCs were comparably high. Location shift and scale shift should both be considered minimal according to Lin’s examples [29]. Pearson correlation should be considered moderate in the FAQLQ-CF and good in the FAQLQ-TF and -AF (Table 3). Comparable results were found for the individual domains of the FAQLQs (data not shown).

Table 3 Reliability and agreement measures of the FAQLQs

Figure 1 illustrates the correlation between the first and second measurement. Major axis analysis revealed no significant differences of the slope and intercept of the best fitting line from the concordance line for the FAQLQ-CF and -TF. For the FAQLQ-AF there were significant but modest differences of the slope (1.10, P = 0.046) and the intercept (−0.612, P = 0.019) of the best fitting line from the concordance line. The slope and intercept of the best fitting line of the FAQLQ-CF, -TF and -AF did not differ significantly from each other.

Fig. 1
figure 1

FAQLQ score of the first measurement against the FAQLQ score of the second measurement with 45° line through the origin in (A) children, (B) adolescents and (C) adults

The Bland-Altman plots are shown in Fig. 2. About 95% of the differences lie within the 1.96 SD limits of agreement. There was no significant correlation between the mean of both scores and the differences of both scores for the FAQLQ-CF and -TF. There was a significant but modest correlation between the mean of both scores and the differences of both scores for the FAQLQ-AF (r = − 0.334; P = 0.046). No significant systematic bias was observed, which means that mean differences of both scores were all close to zero. The limits of agreement are most narrow for FAQLQ-TF and wider for FAQLQ-CF and -AF.

Fig. 2
figure 2

Bland-Altman plots for the FAQLQs in (A) children, (B) adolescents and (C) adults. The mean of both measurements are plotted against the difference of both measurements (calculated as first measurement minus second measurement)

Discussion

This article describes the evaluation of the test-retest reliability of the recently developed self-administered FAQLQ-CF, -TF and -AF. Overall, reliability was considered to be excellent for the FAQLQs as measured with the ICC and CCC. Additionally, Bland–Altman plots showed that mean differences were all close to zero, supporting the high reliability of the FAQLQs.

In this study we used ICCs calculated by a one-way ANOVA, CCCs and Bland-Altman plots to assess test-retest reliability. However, different methods can be used to assess test-retest reliability, and there is much discussion in literature on the best way to do this [20]. A disadvantage of the ICC is that if patient groups are very homogeneous, the ICC tends to be low, because the ICC compares variance among patients to total variance. If patient groups are very heterogeneous, the ICC tends to be high. Thus, the ICC would only generalise to similar populations. Additionally, the one-way ICC does not take into account the order in which observations were taken [29]. Therefore, the CCC is a useful additional measure. The CCC takes into account not only mean differences between the first and second measurement, such as ICCs calculated by a one-way ANOVA, but also takes into account variance differences between the first and second measurement by reducing the magnitude of the resulting test-retest reliability estimate. In addition, the CCC is a better tool to distinguish between bias and imprecision [20, 29]. There can be large differences in ICC and CCC scores, especially in studies with heterogeneous groups. The similar scores we found in our study reflect that both coefficients worked very well in this population and that results can be generalised to other groups. Bland-Altman plots are very illustrative in assessing test-retest agreement. They were useful to identify some extreme and outlying differences, to analyse the magnitude of the measurement error, which was small, and to visualise a possible relationship between the difference and the mean of both scores [26].

This study may also have some limitations. Firstly, the sample sizes were relatively small. However, we found that the reliability of the questionnaires was very high, which indicates that the sample sizes were adequate and that a greater number of patients would probably not have influenced the outcomes. Another limitation may be that the majority of adults in this study was female. However, we did not find significant differences in the test-retest reliably outcomes between men and women (data not shown). Therefore, we think that the imbalance between men and women did not influence the generalisability of the results of the FAQLQ-AF. Finally, the significant correlation between the first and second measurement of the FAQLQ-AF (Fig. 1C) and between the mean of both scores and the differences of both scores of the FAQLQ-AF (Fig. 2C) was an unexpected finding. We think this correlation might be due to an outlier. This assumption was supported by a re-analysis excluding this outlier, which showed that the correlation was no longer significant.

In summary, the FAQLQs clearly showed excellent reliability and are thus promising measures in evaluative studies in patients with food allergy, but also in monitoring individual patients. The high test-retest reliability supports the value of the FAQLQs for clinical trials with relatively small sample sizes. We recommend the use of the FAQLQs in clinical trials of current management strategies of food allergy, and they may also be useful when new treatments become available. Currently, the longitudinal validity of the FAQLQs and the validity of several other European language versions of the FAQLQs are being investigated.