FormalPara Key Points for Decision Makers

The Self-directed Online Assessment of Preferences (SOAP) tool is freely available, open-source utility-elicitation software compatible with modern web browsers, including touch-screen mobile devices.

SOAP modules can easily be developed for other clinical scenarios.

The SOAP MESCC module is valid, reproducible, and responsive for ex ante utility elicitation.

1 Introduction

Quality-adjusted life-years (QALYs) are used to concurrently quantify morbidity and mortality within a single parameter [1]. For this reason, QALYs can facilitate the discussion of risks and benefits during patient counselling regarding treatment options [2]. To help make funding decisions, policy makers may also combine QALYs with cost estimates to calculate the incremental cost-effectiveness ratio [3]. QALYs are calculated using “utilities,” or health-related quality-of-life (HRQoL) weights, which are obtained by direct valuation or from generic health status measures [4].

The choice of utility valuation approach is driven by available data. Direct valuation is the classical approach in which individuals rate hypothetical health state descriptions using the time trade-off or standard gamble procedures [5]. These procedures can be used to measure utilities for very specific and uncommon health states. However, it can be cumbersome to develop valid health state descriptions for particular diseases. Alternatively, techniques have been developed to convert generic health status measures (e.g. EuroQol-5 Dimensions [EQ-5D], Short Form-6 Dimensions [SF-6D], or Health Utilities Index 3) to utilities [1]. Conversion of generic health state measures is advantageous because custom health state descriptions are not required. However, utilities can only be obtained for health states actually observed in a cohort of patients involved in the generic health survey.

Unfortunately, generic health scores have not been collected for many diseases, meaning direct valuation is necessary for measuring utilities. Best practices in economic evaluation are to recruit a sample of healthy individuals from the general population for utility valuation [6, 7]. Traditionally, general population utility valuation has been conducted using face-to-face interviews, phone interviews, or postal surveys [8]. These forms of survey administration are time intensive and costly, so web-based surveys are increasingly being used [9,10,11,12,13,14,15,16,17,18,19,20,21,22]. Typically, these studies are conducted using proprietary software, which limits application to other disease contexts. Furthermore, the psychometric properties of these propriety software programs have not been assessed [23].

It is important to determine whether web-based utility valuation has acceptable psychometric properties. If this mode of administration has acceptable psychometric properties, rather than building custom software for new utility valuation studies, it would be beneficial and efficient for investigators to be able to build disease-specific modules on a common platform that has been used to develop modules with acceptable psychometric properties. To meet this need, we developed a new open-source (non-proprietary), web-based, self-directed utility valuation platform useable on major computer systems (including touch-screen devices) called the Self-directed Online Assessment of Preferences (SOAP) (Appendix 1 and 2 in the Electronic Supplementary Material [ESM]). SOAP was designed with flexibility in mind and can accept new health state descriptions (modules) with minimal programming.

We decided to first create a SOAP module for metastatic epidural spinal cord compression (MESCC), a condition for which HRQoL data are limited. MESCC can be treated with surgery or radiotherapy, but few high-quality studies compare these interventions using generic health status measures for patients. However, surgery and radiotherapy outcomes could be compared using utilities obtained by direct valuation of hypothetical probe health state descriptions. The European Organisation for Research and Treatment of Cancer (EORTC) MESCC working group has developed an HRQoL questionnaire for MESCC [24]. Items from this questionnaire could be used to generate health state descriptions for a SOAP module.

The objective of this study was to determine whether the SOAP platform can be used to develop a valid, reproducible, and responsive module for MESCC. For this first application of the SOAP platform, we developed a MESCC module based on the work of the EORTC and measured psychometric properties in a general population sample.

2 Methods

2.1 Self-directed Online Assessment of Preferences (SOAP) Platform

Electronic utility valuation protocols are distinguished by the form of health state descriptions, assessment approach, navigation rules, and auxiliary functions [25]. A detailed description of these elements for the SOAP MESCC module are provided in Appendix 1 and 2 in the ESM.

2.2 Metastatic Epidural Spinal Cord Compression (MESCC) Module

EORTC phase I development of a MESCC questionnaire in Canada found that patients and healthcare providers felt that ambulation, urinary continence, pain, and independence were important HRQoL issues for MESCC. Since phase I development was restricted to HRQoL and did not specifically consider treatment effects and adverse events, we reviewed prospective studies on MESCC to identify reported outcomes and adverse events [26,27,28,29]. The EORTC items captured all treatment outcomes identified in our review. However, the review identified a large and disparate set of adverse effects. To develop a manageable decision analytic model, all adverse effects were grouped as an “other symptoms” attribute.

A tabular (point-form) presentation of health states was chosen as it is preferred by participants, is believed to decrease cognitive burden compared with the narrative format, and produces similar results to the narrative format [30, 31]. Therefore, we presented health states as a point-form list of five dysfunctional attributes: non-ambulatory (N), incontinent of urine (I), pain (P), dependent (D), and “other symptoms” (S). To reduce the number of potential health states, EORTC items were collapsed to indicate the presence (+) or absence (−) of the dysfunctional attribute, producing 32 discrete health states (Appendix 3 in the ESM).

When possible, the phrasing for presence or absence of dysfunctional attributes was created using the same EORTC items identified in the MESCC module development process (Table 1). Items were rephrased to the second person and restructured as declarative sentences. Items describing feelings or worries were not utilized as we wanted to make the health states descriptions as objective as possible. The rationale for the specific attribute formulation was as follows:

Table 1 Health state attributes
  1. 1.

    Dependence (D). The two items identified by the MESCC working group were combined into one attribute to highlight the implications of loss of independence. The qualifiers “do” and “do not” were added to indicate complete function and dysfunction.

  2. 2.

    Lack of ambulation (N). The MESCC working group developed a new item that was used as the functional level. Again, two items were combined to highlight the implications of loss of mobility.

  3. 3.

    Incontinence of urine (I). The item identified by the MESCC working group with a qualifier was used as the functional level. An item from the EORTC bladder cancer module (BLM44) was used to highlight the implications of loss of bladder control.

  4. 4.

    Pain (P). As MESCC can only occur in the cervical spine to the thoracolumbar junction, pain was not differentiated by the terms “upper” and “lower” back as was identified by the MESCC working group. As most patients with spine metastasis will have some element of pain, the functional state had patients requiring pain medications. Use of pain medication served as a qualifier and was taken from the EORTC bone metastasis module (BM38).

  5. 5.

    Other symptoms (S). Again, to maintain efficiency, all adverse effects were characterized by several common adverse symptoms. These items were all taken from the core EORTC questionnaire.

Valuations were obtained using the standard gamble method using a ping-pong search algorithm. In the standard gamble, success is typically framed as “perfect” health for an undetermined period of time. In this context, this can be inferred to be the absence of any dysfunctions. Therefore, the fully functional health state (D-, N-, I-, P-, S-) was chosen as the success anchor. To eliminate confusion around life expectancy, all scenarios were framed as having a certain life expectancy of 5 years; that is, for both the probe health scenario and success health scenario, participants were told their life expectancy would certainly be 5 years. This was the maximum survival reported in a randomized controlled trial on treatments for MESCC [27]. Probe health states were presented in a random order.

The MESCC module was pilot tested in a sample of 40 participants to assess acceptability and ease of use. Participants were asked to rate the SOAP MESCC module using a five-point Likert rating for the statement “[t]his website is easy to use”, and 92.5% of participants strongly agreed or agreed with the statement.

2.3 Subjects

To be compliant with best practice in economic evaluation, we sought to conduct a direct utility valuation study with a sample of the general population who have not experienced MESCC using the SOAP MESCC module (ex ante valuation) [6]. Prior to this general population direct valuation study, psychometric properties of the SOAP MESCC module had to be evaluated. To approximate a general population sample for this psychometric validation study, participants were recruited from the emergency department waiting rooms at The Ottawa Hospital, an academic hospital in Ottawa, Ontario, Canada. Only patients’ family members or friends (i.e. individuals accompanying patients) aged ≥ 18 years were eligible to participate. Participants were required to be able to read English and have access to the internet outside of the hospital. A minimum sample size of 50 participants has been recommended in published guidelines for reliability and responsiveness evaluations. [23]. To ensure robust results, we set the sample size for this study at 75.

2.4 Survey Procedures

Participants completed the first survey in the emergency department using a touch-screen device. Investigators did not assist participants in navigating or completing the survey. Each participant valued the health state D + N + I + P+ S + , one randomly selected singly dysfunctional health state, and another triply dysfunctional health state. Dysfunctional elements were nested to ensure a logical ordering of utilities for the three health states. For example, if the singly dysfunctional health state was D-N-I + P-S-, the triply dysfunctional health state could include incontinence and two of dependence, lack of ambulation, pain, or other symptoms.

Investigators contacted participants via email and/or phone 2 days after the initial survey with information to access the retest. Participants completed the second survey using their personal device. For the retest, participants were presented with the same probe health states they completed in the emergency room, but states were presented in a new random order.

2.5 Statistical Analysis

“Validity” refers to whether a tool under investigation measures what it is supposed to measure [32]. Specifically, “construct validity” concerns whether results obtained using the tool under investigation are consistent with a priori hypotheses [32]. We hypothesized that utility valuations should follow the logical ordering of health states, with utility valuations following the relationship: singly ≥ triply ≥ fully dysfunctional. We considered singly = triply = fully a valid response because we could not exclude the possibility of a ceiling effect with one dysfunction. Participant responses were deemed “valid” if their utilities followed this order. The proportion of participants providing valid responses on the test and retest was computed.

“Reproducibility” concerns the stability of participants’ responses on repeated testing and can be characterized by agreement and reliability [32]. “Agreement” quantifies the absolute differences in participants’ repeated responses. We assessed agreement using the smallest detectable change [23]. We classified agreement as adequate if the smallest detectable change was less than the minimal clinically important difference (MCID) [23]. By anchoring to Eastern Cancer Oncology Group functional levels, an MCID of 0.05 for cancer utilities obtained by the three-level EQ-5D (EQ-5D-3L) has been proposed [33]. This MCID has also been used for direct utility valuation by the standard gamble and time trade-off of EQ-5D-3L health states [34]. The precision of the standard gamble algorithm used in our study was also 0.05. Therefore, we used an MCID of 0.05 in this study. Systematic differences between the test and retest sessions were quantified using the smallest detectable change calculation. “Reliability” concerns the fraction of pooled study variance across the repeated tests attributable to differences between participants (participant variance) and individual test–retest variability (noise) [32]. If responses are stable, the ratio of noise to participant variance should be small, and the ratio of participant variance to variance for the pooled results from test and retest should be high. Reliability accounting for systematic differences between the test and retest, stratified for the number of dysfunctions in the health state, was quantified using the intraclass correlation coefficient (ICC) using the following categories: < 0.21, slight reliability; 0.21–0.40, fair reliability; 0.41–0.60, moderate reliability; 0.61–0.80, substantial reliability; > 0.80, almost perfect reliability [35]. An ICC ≥ 0.70 was considered adequate [23].

“Responsiveness” reflects the ability of a tool to detect clinically important changes and can be quantified using Guyatt’s responsiveness index [36]. This index is proportional to the ratio of the MCID to the root mean squared error of the difference between the test and retest value. If test–retest variability is small relative to the MCID, the tool is deemed responsive because meaningful changes are of greater magnitude than test–retest fluctuation [37]. Values of 0.20, 0.50, and 0.80 were interpreted as small, moderate, and large levels of responsiveness, respectively [38].

Statistical analysis was performed using the statistical programming language R [39]. The distribution of age (Kruskal–Wallis test) and sex (Chi squared test) was compared between participants providing valid and invalid responses on the test and retest. We considered logically ordered responses to be valid, that is decreasing utilities assigned to the singly, triply, and fully dysfunctional states. Age was assessed using a one-way analysis of variance (ANOVA). Sex was assessed using the Chi squared test. Reproducibility, agreement, reliability, and responsiveness were only measured for participants providing valid responses on both the test and the retest. Since the SOAP tool is intended for measuring average utilities from the general public, average measures (rather than individual measures) of smallest detectable change, ICCs, and Guyatt’s responsiveness indices were calculated [40].

3 Results

Of 285 participants who completed utility valuations in the emergency department, only 113 (39.6%) completed the retest. Of these 113 participants, 92 (81.4%) provided valid responses on the first test, and 75 (66.4%) provided valid responses on the test and retest (Table 2). The response validity pattern was not associated with age (p = 0.2336) or sex (p = 0.971) (Table 2). Only data from the participants providing valid responses on both the test and the retest were used for reproducibility and responsiveness analysis. Seven respondents skipped at least one scenario during the test and were classified as providing invalid responses. Only one respondent skipped one question during the retest, and the responses were also classified as invalid.

Table 2 Characteristics of participants stratified by response pattern

Agreement for all groups of health states was adequate since their smallest detectable change was less than the MCID of 0.05 (Table 3). Mean ICCs were all > 0.8, indicating substantial reliability, and all ICCs were significantly greater than the pre-specified threshold of 0.7 (Table 3). Guyatt’s responsiveness indices all exceeded 0.80, indicating large responsiveness for the utility evaluation (Table 3) [38].

Table 3 Agreement, reliability, and responsiveness measurements

4 Discussion

Utility valuation studies are traditionally conducted using face-to-face interviews, phone interviews, or postal surveys. These modes of administration have undergone psychometric validation. Web surveys are increasingly used for utility valuation and usually use custom and proprietary valuation tools that have not been psychometrically validated. It would be beneficial and efficient for investigators to be able to build disease-specific modules on a common platform that has been used to develop modules with acceptable psychometric properties.

We developed a new platform called the SOAP (Appendix 1 and 2 in the ESM). For the first application of this platform, we developed a module for MESCC health states. The SOAP platform met published benchmarks for reproducibility (both agreement and reliability) and responsiveness for utility measurement. This study demonstrated that the SOAP platform can be used to develop modules with acceptable psychometric properties.

In total, 81.4% of participants provided valid responses on the first test, and 66.4% of participants provided valid responses on both the test and the retest. These results should be considered in the context of other ex ante valuation studies reported in the literature. We classified a participant’s responses as valid if their utility valuations decreased with increasing dysfunctional attributes in the health state. For example, if a participant valued the fully dysfunctional health state higher than the single dysfunctional health state, their responses were classified as invalid. This definition of validity is termed “logical consistency” and has been used in traditional general population ex ante utility valuation studies of EQ-5D-3L health states.

Logical consistency rates for face-to face valuations have been reported for the UK and Netherlands [41, 42]. In the UK study, 12 pairs of health states per participant could be evaluated for logical consistency. The median rate of logical consistency, per participant, ranged from 83.8 to 91.7%. In the Dutch study, 87.6% of participants provided at least one pair of logically inconsistent valuations. Postal surveys conducted in the USA and New Zealand reported at least one logically inconsistent pairing in 88 and 79% of participants, respectively [43, 44]. With 81.4% of participants providing a valid response (28.6% providing a logically inconsistent response), the logical consistency rate for the SOAP MESCC module was similar to that of traditional population studies. Logical consistency has also been assessed for other self-administered general population ex ante utility valuation studies of EQ-5D-3L health states over the internet [19, 45, 46]. Each study reported a logical consistency rate < 70%.

Compared with the SOAP MESCC module, the face-to-face, postal, and web-based EQ-5D-3L utility valuation studies required greater cognitive effort because participants rated more health states (between five and ten) that were also more complex (five attributes and three levels of dysfunction). Furthermore, these studies did not provide error checking, whereas the SOAP MESCC module notified participants of a logical error if they rejected a lottery with 100% of success. Considering these differences, a logical consistency rate of 81.4% on the first test with the SOAP MESCC module is consistent with the literature.

Valuing MESCC health states using the classical standard gamble is problematic for two reasons. First, the classical standard gamble uses perfect health as a top anchor, which is an unrealistic outcome for metastatic cancer. Second, the classical standard gamble considers timeless (i.e. perpetual) health states, which is incongruent with the metastatic cancer disease process. To make the standard gamble more realistic, we characterised perfect health as the absence of dysfunctions and restricted all health states (including the top anchor) to a survival period of 5 years. These modifications may affect the interpretation of our results relative to classic utility assessment.

Utilities are typically estimated for specific health states and are used to weight the time in such health states. Consequently, a utility value for a specific state is typically considered “timeless,” that is, utilities are usually assumed not to change with time spent in a health state [47]. As a reflection of this, the duration of time spent in a probe health state is not specified in the classical standard gamble [5]. For MESCC health states, we were concerned that the most severe health states would connote poor survival and therefore confound the measurement of HRQoL using the standard gamble with quantity of life. To alleviate this difficulty, we explicitly stated a 5-year duration for each health state, which was the longest survival observed in a randomized controlled trial of treatments for MESCC [27]. This approach has also been used in other utility valuation studies for cancer health states [48]. This modification to health state descriptions should not affect results because the standard gamble (and all other utility-elicitation methods) relies on the “utility independence” assumption [49]. Under this assumption, if a health state has a utility of \(x\), the utility of this health state for 5 years should still be \(x\). Unfortunately, a systematic review concluded that individuals tend not to satisfy the utility independence assumption with no consistent pattern of violation [50]. We are unaware of any algorithm to convert utilities for fixed period of time to “timeless” utilities. Consequently, the utilities measured in this study may not be directly comparable to utilities obtained using the classical standard gamble.

A strength of our study is that we built on the work conducted by the EORTC MESCC working group to ensure the attributes in the MESCC module were appropriate and representative of the MESCC disease process. A limitation of our study is that we did not assess criterion validity by comparing utilities obtained by SOAP MESCC and a “gold standard” [32]. This could be done by having patients with MESCC value their own health using the SOAP MESCC module and comparing these utility valuations with those derived from a generic health questionnaire. We did not have the resources to conduct such a study. Furthermore, measures of logical validity, reproducibility, and responsiveness are more relevant than MESCC criterion validity to investigators considering developing modules for new diseases.

To our knowledge, this is the first validated open-source, web-based, self-directed utility valuation module. For the first application of the SOAP platform, we developed a module for MESCC health states. We have demonstrated the SOAP MESCC module is valid, reproducible, and responsive for obtaining ex ante utilities. Considering the successful psychometric validation of the SOAP MESCC module, other investigators can consider developing modules for other diseases where direct utility valuation is needed.