FormalPara Key Points for Decision Makers

A pilot phase may increase the cost of valuation studies and is time consuming. It is currently unknown whether an extensive pilot phase has a meaningful impact on the performance of the interviewers and whether it may help minimize interviewer effects in EQ-5D-5L valuation studies.

This study highlighted the benefits of an extensive pilot phase on data quality and interviewer performance in terms of improvement in the face validity and reduction of prediction errors in the cTTO data during the whole data collection process and especially during the pilot phase.

A pilot phase may have substantial benefits for data collection of EQ-VT studies where it can help reveal issues and exclude poorly performing interviewers and might prove more beneficial in EQ-5D-5L valuation studies where protocol compliance issues and interviewer effects exist.

1 Introduction

The EQ-5D is the most used multidimensional instrument for measuring health-related quality of life and quality-adjusted life years [1]. A partial explanation for the popularity of the EQ-5D is that many EQ-5D value sets are available that were constructed at the national level, reflecting the belief that preferences for health can differ across populations. The EuroQol group developed a standardized valuation protocol for the EQ-5D-5L valuation studies that implements two valuation techniques: the composite time trade-off (cTTO) and discrete choice experiment (DCE). Additionally, interviewer training materials are standardized and officially translated in an attempt to harmonize the methodology and training of interviewers in valuation studies across different countries to maximize comparability of the resulting value sets [2, 3]. Development of a country-specific value set using these valuation techniques is nevertheless challenging as it requires trained interviewers for guidance of participants through the whole interview process [4, 5]. Interviewer behavior might also add unwanted variability to the data.

The results of the first wave of valuation studies for the EQ-5D-5L raised the importance of data quality, especially in the cTTO part of the data collection. Multiple issues were observed including few worse than dead responses, low values for mild states, clustering of values and high frequency of inconsistent responses [6,7,8,9]. When The EuroQol realized that these issues were interviewer-driven, measures were taken to promote the performance of the interviewers [10, 11]. Refinements of the valuation protocol included the introduction of the quality control (QC) tool, feedback module and three practice states to improve the reliability and validity of the data and promote interviewer performance [10,11,12].

In the cyclic QC process, Ramos-Goni et al. defined minimum requirements for achievement of protocol compliance as baseline for the initial assessment for each interviewer to complete or stop data collection. The cyclic nature of the process allowed the study teams to reflect on interviewer's performance and gave them continuous feedback to improve their skills and minimize interviewer effects during the entire data collection period [10]. However, there are other factors such as sociodemographic characteristics of the participants and their preferences that might contribute to the apparent existence of interviewer effects [13, 14]. Since a pilot phase is not usually used in the EQ-5D-5L valuation studies, it is not clear whether an extensive pilot phase is required for the EQ-5D-5L valuation studies to improve data quality and standardize interviewers’ performance.

The aim of this study is to investigate how interviewer performance evolved during the EQ-5D-5L valuation study in Egypt and to investigate the effect of the extensive pilot phase in improving protocol compliance, face validity, and reduction of interviewers’ effect and prediction errors in the cTTO data. Identifying all these aspects can provide a guide for designing future valuation studies and training materials and help in improving the performance of interviewers and the quality of the collected data.

2 Methods

2.1 Data Source

This study used cTTO data and QC reports of the Egyptian EQ-5D-5L valuation study [15]. A total of 1,303 interviews were conducted in the period between July 2019 and March 2020 by 12 interviewers and two principal investigators (PIs). Ten interviews were test interviews done by the PIs. Once interviewers were recruited and trained, they did pilot interviews until the study team decided that they had acquired the necessary expertise to obtain good quality interviews based on the QC tool. Three interviewers were excluded due to interviewer effects seen in the data (113 interviews). The final analysis of this study included 206 pilot interviews and the 974 actual interviews that were included in calculating the Egyptian tariff [15]. Members of the general public were recruited from different Egyptian governorates using multi-stratified quota sampling to select a representative sample in terms of age, sex and geographical distribution. Each participant was interviewed face to face by a trained interviewer using the Egyptian translated version of the EQ-VT-2.1 protocol [2]. Interviews took place at the interviewers’ office or the participants’ home, workplace or other public places according to the participants’ preferences. The interviewer training was performed in four stages: interviewing the interviewers by the PIs, initial training followed by conducting pilot interviews then retraining [16].

2.2 Quality Control (QC)

The QC reports are composed of two main aspects, namely protocol compliance and interviewer effects, in addition to other meta data such as the number of iteration steps and the time spent on the better than dead (BTD) and worse than dead (WTD) section of the cTTO task [10]. Protocol compliance is assessed based on four criteria such as the time spent on the WC example and actual cTTO tasks should not be less than 3 min and 5 min, respectively, the presence of clear inconsistency in the cTTO rating or if the interviewer did not use the lead time in the WC example. The interview was flagged if the interviewer was not compliant with any of the above-mentioned criteria. A conservative threshold of four flagged interviews out of ten was established as the limit to stop and retrain the interviewer, after a further ten interviews for the same interviewer, if again four or more interviews were flagged, the interviewer should be excluded from data collection [10]. Interviewer effects were assessed for any unusual clustering or distribution by comparing the cTTO value distribution for each interviewer to the overall distribution of values for all interviewers. The QC reports were discussed through periodical online meetings: weekly during the pilot phase (every five interviews per interviewer) and every 2 weeks during actual data collection (every ten interviews per interviewer) between the Egyptian team and the EQ-VT support team, and the feedback received was discussed with all interviewers. All 12 interviewers were compliant with the minimum requirements of the protocol. However, three interviewers, along with the interviews they had conducted, were excluded from data collection process and data analysis due to the presence of strong clustering and inconsistent distributions for the cTTO data despite retraining and close monitoring, which could indicate poor engagement in the valuation tasks and interviewer' effects.

2.3 Data Analysis

Analyses were conducted using IBM SPSS Statistics for Windows, Version 22.0 (Armonk, NY, USA: IBM Corp) for sample demographic and QC indicators, STATA software version 14 was used to test for the protocol compliance, interviewer effects, clustering and predictive accuracy.

2.3.1 Sample Demographic Characteristics and QC Tool Indicators

Descriptive statistics were presented for sample socio-demographics and the QC tool indicators; we used percentages to present discrete variables, mean and standard deviation for continuous variables.

2.3.2 Protocol Compliance, Interviewer Effects and Clustering

Data were divided into batches of ten interviews by interviewer. We examined the rate at which interviews were flagged between the pilot phase and the actual data collection phase and calculated the rate of flagged interviews by interviewer to compare the effect of the pilot phase on improving protocol compliance, and to investigate whether the rate of flagged interviews decreased beyond the pilot phase or stopped decreasing within the pilot phase. This allowed the determination of whether there was a decreasing trend in flagged interviews along the study.

To test whether interviewer effects were reduced during the pilot phase and subsequent rounds of collected cTTO data, three-level mixed models were estimated where the variance in values was partitioned into variance attributed to responses, variance attributed to respondents, and variance attributed to interviewers by using responses nested in respondents, nested in interviewers on each of the subsamples of ten interviews per interviewer per batch. Intraclass correlation (ICC) coefficients were calculated to investigate whether there was a decreasing trend in the share of variance attributed to interviewers over the collected rounds of data.

Reduction of clustering on the easily obtained values such as (− 1, − 0.5, 0, 0.5 and 1) were compared and taken as an initial indication of quality improvement. Scatter plots were used to investigate whether clustering decreased over rounds of the collected data.

2.3.3 Predictive Accuracy

To test whether the pilot phase had a significant effect on the aggregate predictive accuracy of the models employed in the value set calculation, two samples were compared, the sample used for the value set calculation (n = 974) and a sample of equal size including the pilot data (n = 206) and the first 768 actual interviews. The omission of actual interviews in the second sample was balanced by interviewer, where the numbers of actual interviews excluded for each interviewer were equal to their pilot interviews. First, we applied the Egyptian value set to all health states valued in the pilot and actual data [15]. For each of the two samples, the mean absolute error (MAE) was computed by comparing the mean of the difference between the values assigned by respondents and the index values. As a comparison, we did a random draw of similar size of two other samples out of all data collected, pilot and actual data, and their performance was compared with that data. The random draws were repeated 10,000 times, to ensure robustness of the sample selection.

To determine whether the pilot data caused better predictive accuracy at the interviewer level after doing more interviews, the Egyptian value set was applied to the valuation data. Then, we calculated the MAE within each interview (ten responses per interview) by taking the mean of the difference between the index values and the values provided by the respondents. Subsequently, using scatter plots, decreasing trends in the MAE over time were visualized by plotting the MAEs within each interviewer over the sequence of interviewing.

Ordinary least square (OLS) regression analysis, with the respondent-level MAE as the dependent variable and the rank order in which the interviews were conducted by the interviewer (Time) as the independent variable, were conducted for each interviewer separately (Eq. 1). This allowed us to test whether the MAE improves when interviewers complete more interviews, in other words, whether the outcomes of a cTTO interview become more similar to the results of the final value set. In addition, we explored models that included a dummy variable (Pilot) that indicated whether data was pilot data (Pilot = 1) or non-pilot data (Pilot = 0) (Eq. 2), and also the interaction between whether the data are pilot data and the sequence of interviews (Time*Pilot) (Eq. 3). For each of these variables, p-values were calculated to test the significance of their relationship with the respondent-level MAE. Significant parameter estimates for the dummy variable showed that the MAE was larger or smaller in the pilot, compared to the actual data, and the interaction term showed whether the improvement in predictive error was larger in the pilot phase:

$${\mathrm{MAE}}_{i}= {\upbeta }_{0}+{\upbeta }_{1}\mathrm{Time}+{\upvarepsilon }_{i},$$
(1)
$${\mathrm{MAE}}_{i}= {\upbeta }_{0}+{\upbeta }_{1}\mathrm{Time}+{\upbeta }_{2}\mathrm{Pilot}+ {\upvarepsilon }_{i},$$
(2)
$${\mathrm{MAE}}_{i}= {\upbeta }_{0}+{\upbeta }_{1}\mathrm{Time}+{\upbeta }_{2}\mathrm{Pilot}+{\upbeta }_{3}\mathrm{Pilot}*\mathrm{Time}+{\upvarepsilon }_{i}.$$
(3)

In Eqs. (1), (2) and (3), \({\mathrm{MAE}}_{i}\) represents the mean absolute error for the interview conducted with respondent \(i\). \({\upbeta }_{0}\) represents the regression intercept, while \({\upbeta }_{1}\mathrm{Time}\) represents the effect of interview sequence. \({\upbeta }_{2}\mathrm{Pilot}\) and \({\upbeta }_{3}\mathrm{Pilot}*\mathrm{Time}\) represent the effect of the pilot phase and the interaction between the pilot phase and interview sequence, respectively. \({\varepsilon }_{i}\) is the residual variance.

3 Results

3.1 Sample Demographic Characteristics

In this study 1180 interviews were included in the final analysis; these interviews were conducted by nine interviewers who completed data collection, of which 206 interviews were pilot and 974 interviews were actual.

Table 1 gives an overview of the study sample characteristics. The majority of the participants in the pilot phase were highly educated, employed and lived in urban areas in Cairo.

Table 1 Background characteristics of the Egyptian participants

3.2 QC Tool Indicators

Table 2 compares the QC tool indicators for the pilot and actual data collection phases showing the improvement in the actual data collection phase.

Table 2 Quality control (QC) tool indicators

3.3 Protocol Compliance, Interviewers’ Effects and Clustering

Data were divided into 14 batches clustered by interviewer. The first three batches represented the pilot phase (n = 206) and the subsequent batches (4–14) represented the actual data collection phase (n = 974). The average number of interviews per interviewer in the pilot phase was 23 (range 10–40), and the average number of interviews in the actual data collection was 108 (range 78–169). Each batch consisted of ten interviews per interviewer, except batches 3 and 14. Table 2 shows the exact number of pilot and actual interviews for each interviewer.

There was no effect of the pilot phase on protocol compliance in terms of the four indicators of the QC tool, where the percentages of flagged interviews did not exceed 3.3% per batch in the pilot phase nor in the actual data collection phase. There was no improvement in interviewer effects beyond the pilot phase and it did not decrease substantially over time. However, the share of variance attributed to interviewers over the collected rounds of data, as demonstrated by ICC, did not exceed 6.7% through the whole study.

There was improvement in the face validity of the data where less clustering over time was observed in the easily attained responses (Fig. 1a). In addition, in the pilot phase the range of the mean number of unique values per respondent was 5.7–6.3, which increased to 6.9–8.1 in the actual data collection phase (Fig. 1b). Moreover, the percentages of respondents with fewer than five unique values decreased through the data collection process where the range was 16.7–25.6 in the pilot and 3.3–12.6 in the actual data collection.

Fig. 1
figure 1

Percentage of clustered responses (a) and mean number of unique values per interview (b) per batch, pilot data (left of the red line, actual data (right of the red line)

The percentage of respondents only using integers in trading the life years decreased through the data collection process where the range decreased from 36.8–46.7 to 13.6–40 for the pilot and actual data respectively (Fig. 2a).

Fig. 2
figure 2

Percentage of respondents using only integer values (a), mean absolute error (MAE) (b) per batch, pilot data (left of the red line, actual data (right of the red line)

3.4 Predictive Accuracy

The predictive accuracy increased over batches and beyond the pilot phase where the range of MAE between the pilot and actual data per batch were 0.42–0.46 and 0.32–0.40, respectively (Fig. 2b). The MAE averaged across batches of the actual data was 0.37, which is lower than that of the pilot data 0.44.and, the MAE for the first 974 interviews by including all the pilot interviews and the first 768 actual interviews was 0.39. Drawing 1000 respondents randomly from the whole dataset (pilot+actual data) (Fig. 3) lead to MAEs that were higher than the actual data, but lower than the pilot data. Figure 4 showed the MAEs within each interviewer over the sequence of interviewing by interviewer per respondent. It is clear from Fig. 4 that the noise in the data decreases in later rounds of interviews.

Fig. 3
figure 3

Mean absolute errors (MAEs) for 10,000 random draws of 1000 respondents completing all ten composite time trade-off (cTTO) tasks

Fig. 4
figure 4

Mean absolute error (MAE) per respondent per interviewer over the sequence of interviews

In Table 3 model A shows the OLS regression analyses for MAE over interview sequence (Time) by interviewer, where there was a significant effect for time for six out of nine interviewers, that proved that the MAE for most interviewers decreased once they did more interviews (sequence effect).

Table 3 Regression coefficients for mean absolute error (MAE) (model A) over interview sequence (Time) by interviewer, (model B) type of data (pilot or actual data) and (model C) interaction between interview sequence (time) and type of data

In model B adding the Pilot variable increased the explained variance (R2) when compared to the (R2) in model A, but the effect for the interaction variable (Time*Pilot) was not significant for most interviewers, as demonstrated by (pTime*Pilot) (model C). This is a signal that the MAE was generally lower in the actual data compared to the pilot data as demonstrated by the signs of coefficients of the Pilot variable. This was compatible with the notion that the final value set model was estimated on the actual data, and the estimates for MAE for the pilot data were therefore an out-of-sample prediction, which was expected to have more error than a within-sample prediction. In model B, five out of nine interviewers still had a significant effect for time, which proved that regardless of whether the data were pilot data or not, the MAE was decreasing as the interviewers did more interviews. This showed that interviews completed provide responses that are more similar to the final value set model compared to the responses earlier in the study, which suggests that the precision of the interviews may have improved.

4 Discussion

4.1 Main Findings

To our knowledge this is the first study that highlighted the benefits of an extensive pilot phase on data quality and interviewer performance. We examined the improvement in protocol compliance, face validity and interviewer effects, in addition to the reduction of prediction errors in the cTTO data. Our main findings show that the face validity of the data seems to improve; that is, the number of unique values per respondent, as well as the use of non-integer numbers seems to increase, while clustering of values seems to decrease in the interviews included in the actual data collection versus the pilot phase. Furthermore, we have shown that the values collected in the pilot study are different from those collected in the actual data collection, as shown by the higher MAEs. The MAE seems to decrease for interviews conducted later in the data collection, both within the pilot phase as well as in the interviews completed as part of the actual data collection.

4.2 Interpretation

The face validity and prediction error data show a similar pattern; during the pilot phase there is a substantial improvement in the key characteristics examined in the current study, due to the feedback shared with the interviewers regarding their performance, where a written debriefing was sent to each interviewer that included formative evaluation of their performance and the main issues to be considered during the next set of interviews. In addition, the interviewers were advised to standardize the outline of the interview during the cTTO task, to ensure precision of responses. This included informing the respondents that they were presented with different health states with different severity levels and to show them the full range of the TTO scale with the 6-month increments or decrements during the example questions. In addition, they were instructed to ask the participants for the rationale of their answers if their answers were illogical. Furthermore, the MAE also seems to improve within the interviewer completing more interviews. These two outcomes combined suggest that there is a learning effect (sequence effect) for the interviewers, leading to better data quality after the pilot phase. After completing the pilot phase, there were still some improvements, but not as large as the improvements made during the pilot phase. This may suggest a role for the implementation of pilot phases in future EQ-VT studies.

The MAE data show that there is a substantial difference between the MAE of the data used for the Egyptian value set (0.37) and the pilot data (0.44). Although the MAE for the pilot data is based on an out-of-sample prediction, one may still expect the difference in MAE to be very small if one expects no effect of a pilot phase on predictive accuracy. This along with the observation that MAEs on average decrease by individual interviewer over their interview sequence, strengthens the observation that a pilot phase has a positive effect on the predictive accuracy of the collected data.

The Egyptian valuation study showed high levels of protocol compliance in terms of the four indicators of the QC tool during the initial waves of data collection in the pilot phase where the percentage of flagged interviews per interviewer did not exceed 11% per interviewer in the pilot phase, and 4% in the actual data collection phase. In other studies, this is typically higher, for example the Peruvian EQ-5D-5L valuation study reported 0–19% of interviews flagged per interviewer [17]. This might be attributed to using the QC tool elements as part of the training of interviewers for the Egyptian valuation study. It seems that the protocol compliance was initially already high, which means the effect of a pilot phase on protocol compliance may have been limited in the current study. However, studies that initially report lower rates of protocol compliance may possibly still improve protocol compliance rates during a pilot phase before actual data collection is started.

In EQ-5D-5L valuation studies, interviewers have a major role in motivating respondents to engage in the valuation tasks and to express their values accurately, in addition to dealing with certain participant behaviour or characteristics. In this study, the interviewer training was extensive and performed in four stages to minimize inter- and intra-interviewer effects and to improve performance; this process has been detailed in a previous publication [16]. There was no improvement in interviewer effects beyond the pilot phase and it did not decrease substantially over time. However, the share of variance attributed to interviewers did not exceed 6.7% through the whole study. This might be attributed to the difference in interviewers’ personalities and style in addition to the variation in the characteristics of the participants, time and place of the interview (regional difference of values), which might have an impact on how participants completed the valuation interview [13, 14, 17]. Other studies report interviewer effects as well, but did not quantify them as reported in the current study, making it difficult to make comparisons [17,18,19].

Overall, it seems like a pilot phase may have substantial benefits for data collection of EQ-VT studies. From our data, we show that there is a likely learning effect, where the quality of the collected data increased with the number of interviews completed by an interviewer, the more interviewing experience the higher the level of prediction accuracy and lower level of logical inconsistency. The lower number of inconsistent responses reported when interviewers are more experienced was also found in a previous study by Yang et al. [20]. The lessons learned from the extensive pilot phase in the Egyptian valuation study and the strict implementation of quality control allowed us to provide the interviewers with better feedback, which improved their performance. Although all these requirements increased study costs and led to removal of data, implementing an extensive pilot phase seems to be very effective at revealing data quality issues, and improving the quality of the sample used for estimating the value set.

4.3 Strengths and Limitations

This is not the first EQ-VT study in which a pilot phase was implemented before the final data collection phase commenced. However, it is the first in its current structure, where in our study, each interviewer completed an average of 23 pilot interviews before they commenced actual data collection. In Peru and France the interviewers conducted only five to ten pilot interviews [17, 21]. This is substantially less than in our study, and the size of the pilot sample allowed us to assess the effects of a pilot phase in more detail than possible in previous studies, which is a strength of this study.

One of the limitations of this study is that there were some differences in the sample background characteristics between the pilot phase and the actual data collection phase, where most of the participants in the pilot phase were highly educated, employed, and lived in urban areas in Cairo; it is usually preferred for the pilot study to take place in a central location to reduce cost, achieve a consistent sample frame for all interviewers, and facilitate PI-interviewer interactions. However, it is not clear how this has affected the results.

4.4 Implications

For EQ-5D-5L valuation studies, achieving the minimum quality control requirements is not enough to guarantee good data quality. As shown in the current study, implementing an extensive pilot may substantially improve the face validity and predictive accuracy of the data collected in the actual data collection phase, which may guarantee the highest standards of data quality for generating value sets. Moreover, interviewer effects should be more carefully addressed particularly in the QC process with the development of more exploratory research to control interviewer effects in future EQ-5D-5L valuation studies.

5 Conclusion

This study clarified the benefits of the pilot phase and the strict implementation of the QC tool in improving the face validity and the prediction accuracy of the cTTO data. However, a more extensive pilot phase may be more beneficial in EQ-5D-5L valuation studies that initially have more issues with protocol compliance and interviewer effects.