Response-shift effects in neuromyelitis optica spectrum disorder: a secondary analysis of clinical trial data

Background Researchers have long posited that response-shift effects may obfuscate treatment effects. The present work investigated possible response-shift effects in a recent clinical trial testing a new treatment for Neuromyelitis Optica Spectrum Disorder (NMOSD). This pivotal trial provided impressive support for the drug Eculizumab in preventing relapse, but less strong or null results as the indicators became more subjective or evaluative. This pattern of results suggests that response-shift effects are present. Methods This secondary analysis utilized data from a randomized, double-blind trial evaluating the impact of Eculizumab in preventing relapses in 143 people with NMOSD. Treatment arm and then relapse status were hypothesized ‘catalysts’ of response shift in two series of analyses. We devised a “de-constructed” version of Oort structural-equation modeling using random-effects modeling for use in small samples. This method begins by testing an omnibus response-shift hypothesis and then, pending a positive result, implements a series of random-effects models to elucidate specific response-shift effects. Results In the omnibus test, the ‘standard quality-of-life (QOL) model’ captured substantially less well the experience of placebo as compared to Eculizumab group. Recalibration and reconceptualization response-shift effects were detected. Detected relapse-related response shifts included recalibration, reprioritization, and reconceptualization. Conclusions Trial patients experienced response shifts related to treatment- and relapse-related experiences. Published trial results likely under-estimated Eculizumab vs. Placebo differences due to recalibration and reconceptualization, and relapse effects due to recalibration, reprioritization, and reconceptualization. This novel random-effects- model application builds on response-shift theory and provides a small-sample method for better estimating treatment effects in clinical trials. Electronic supplementary material The online version of this article (10.1007/s11136-020-02707-y) contains supplementary material, which is available to authorized users.


Introduction
Despite the advantages of rigorous clinical trial designs in providing unbiased estimates of treatment outcomes, these designs may also lead to somewhat paradoxical findings. For example, a treatment may have an unarguable benefit on objective outcomes but a less clear impact on more Electronic supplementary material The online version of this article (https ://doi.org/10.1007/s1113 6-020-02707 -y) contains supplementary material, which is available to authorized users. subjective outcomes. Research on response-shift effects provides a theory-driven and empirically testable path toward understanding such paradoxes. "Response shift" refers to the idea that when individuals experience a change in health status, they may change their internal standards, values, or conceptualization of a target construct like "quality of life" (QOL) [1,2]. Over the past two decades, research in a broad range of therapeutic areas has supported that response-shift effects can influence clinical research findings, and can represent positive and negative adaptation [3][4][5][6][7][8][9][10][11][12]. While response-shift effects are generally small, they can influence study conclusions and are thus not inconsequential [3,4].
The current methods for detecting response-shift effects [1,13,14] work with the idea that unexpected levels of QOL scores reflect adaptation [9,11,12,[15][16][17][18][19][20]. For example, if a clinician-assessed outcome does not agree with a patientreported outcome (PRO), this "discrepancy" may signal patients' changes in internal standards, values, or conceptualization of the target construct (e.g., QOL) [17]. Rather than suggesting that either the clinician-assessed or the patientreported outcome is flawed or biased, this discrepancy suggests that there is 'more than meets the eye,' and that a deeper investigation of the situation is warranted. A recent study of people with spinal cord injury (SCI) reported that while objective measures of motor and cognitive function had stabilized one to five years post-injury, patient-reported outcomes reflected recalibration and reconceptualization response shifts [5]. Specifically, patients experienced improvements in physical functioning primarily by dint of improvements in physical role performance over time (recalibration) [5]. They also appeared to change their conceptualization of QOL over time such that over the long-term follow-up, the people with SCI stopped considering their SCI per se as part of their general health, and instead only considered SCI sequelae as part of their general health [5]. These response-shift effects may be important in understanding the full range of dynamics that matter for QOL, as in for example, measures of clinical significance in other patient populations such as with multiple sclerosis and spinal disorders [21,22].
Researchers have long posited that response-shift effects may obfuscate treatment effects. A substantial number of articles have discussed the importance of response-shift effects in clinical trials (e.g., [23,24]), and several studies have tested for response-shift effects in clinical trials [6,11,[25][26][27][28][29]. Several of these studies used the then-test method, a method prone to recall bias and lack of specificity which challenges interpretation [30][31][32]. One of the studies used the relative-importance method to evaluate reprioritization response shifts [11], and two of these studies either combined the then-test with the Schedule for the Evaluation of Individualized QOL (SEIQOL) individualized method [27] or used another individualized metric, the Patient-Generated Index (PGI) [29]. These latter two studies thus provide a fuller, qualitative context to the respondents' changes in priorities and conceptualizations of QOL. The metrics are, however, difficult to harness in quantitative metrics that can help to interpret trial outcomes in comparison to metrics that ignore response-shift effects.
The present work aimed to investigate possible responseshift effects in a recent clinical trial (n = 143) testing a new treatment for Neuromyelitis Optica Spectrum Disorder (NMOSD) [33]. This uncommon but severe form of demyelinating disease is a relapsing, autoimmune, inflammatory disorder that typically affects the optic nerves and spinal cord, leading to blindness and paralysis [34]. Often initially misdiagnosed as having multiple sclerosis, NMOSD patients face a frightening trajectory of severe relapses that leave residual neurologic disability and bring about unpredictable and disabling future attacks [35].
This NMOSD clinical trial provided impressive support in preventing relapse (primary outcome) for the drug Eculizumab. It also provided support for Eculizumab on the more objective secondary outcomes, which were clinician-assessed indicators as well as the EQ-5D utility measure of health state [36]. There was, however, less strong support as the indicators became more subjective or evaluative (e.g., EQ-5D visual analogue scale or EQ-5D VAS indicator of global health; evaluative physical functioning), with null results for the evaluative self-report measure of mental functioning [37]. This pattern of results led us to hypothesize that response-shift effects are present.
Response-shift methods for detecting effects in secondary analyses often rely on relatively large sample sizes [38]. For example, the abovementioned SCI study used Oort's Structural Equation Modeling (SEM) [10,39], a well-vetted method that has been used in a number of secondary analyses of observational data [4,10,[40][41][42]. This approach provides an ordered series of steps that test for response-shift effects, with later steps conducted only if earlier steps pass muster. The sample for the present study is too small for an Oort SEM analysis. Instead, we devised a "de-constructed" approach that is more appropriate for use with small samples. This method begins by testing an omnibus response-shift hypothesis and then implements a series of analyses aimed to elucidate what is uncovered in the omnibus test. We investigated possible effects first for Treatment Arm as a 'catalyst' of response shift due to the substantial health-state changes that differentiated the two groups [2,17]. We then examined Relapse Group as a catalyst, to better understand these findings.

Sample and trial procedure
This secondary analysis utilized data from a randomized, double-blind, time-to-event trial evaluating the impact of Eculizumab in preventing relapses in 143 people with NMOSD. Eligible participants were patients of age 18 years or older, who had a diagnosis of NMOSD or neuromyelitis optica chronic medical condition. This international trial was recruited from 80 sites over four continents. Figure 1 provides the timing of clinician-and patient-reported outcome collection over the course of the trial. (For complete details on trial inclusion and exclusion criteria and procedures see reference [33]). The trial was conducted in accordance with the provision of the Declaration of Helsinki, the International Conference on Harmonization guidelines for Good Clinical Practice, and applicable regulatory requirements. The trial was approved by the institutional review board at each participating institution. All the patients provided written informed consent before participation.

Measures
For the present analysis, we included information about treatment arm (i.e., Eculizumab vs. Placebo) as well as the following clinician-and patient-reported outcome data and information about relapse.
Clinician-assessed outcomes. Clinicians who were blind regarding trial-group assignment rated patients' disability on the Kurtzke Expanded Disability Status Scale (EDSS) [43]. This standard neurological outcome tool assigns scores based on eight Kurtzke Functional Systems that include signs of disability (pyramidal, cerebellar, brainstem, sensory, bowel/bladder control, visual, cerebral, and other). The EDSS score ranges from 0 [no disability] to 10 [death]. Treating clinicians or appropriately trained staff members evaluated patients using the modified Rankin scale (MRS) [44], which assesses the degree of dependence in daily activities; scores range from 0 (no disability) to 6 (death). The Hauser Ambulation Index (HAI) [45] focuses on mobility disability by assessing how much time and degree of assistance is needed to walk 25 feet. Its scores range from 0 to 9, with higher scores indicating increased impairment. PROs. Patients completed the European Quality of Life 5-Dimension 3-Level (EQ-5D-3L) questionnaire [36]. For the purpose of this study, we included the EQ-5D VAS item, a subjective global score of self-reported health ranging from 0 (worst imaginable health) to 100 (best imaginable health). The Short-Form-36v2 (SF-36v2™) [46] is a generic evaluative measure of functional health that includes eight domain scores (general health, physical functioning, physical role performance, social functioning, emotional role performance, mental health, pain, vitality) that are summarized with the Physical Component Score (PCS) and Mental Component Score (MCS). The norm-based scoring system of the SF-36™ ranges from 0 to 100, with a normative mean  On-trial relapse was defined as a patient's new onset of neurologic symptoms or worsening of new neurologic symptoms if those symptoms persisted for more than 24 h, were attributed to NMO, and were preceded by at least 30 days of clinical stability. On the basis of the neurological exam and the OSIS, the treating clinician and a blinded examining clinician judged the severity of the relapse ("Clinician-Assessed Relapse"). A 'major' relapse was defined as an increase in 2-3 points in OSIS Visual Acuity (depending on whether the patient started with a score of 2-7 or 0-1, respectively); and an increase of 2-3 points on the OSIS Motor Subscale (depending on whether the patient started with a score of 2-6 or 0-1, respectively). Any loss in proprioception on the OSIS Sensory Subscale was considered 'major. ' An independent panel of three experts (two neurologists and one neuroophthalmologist) who were blinded to treatment assignment then adjudicated the relapse by considering information from the Clinician-Assessed Relapse, Magnetic Resonance Imaging data, Optical Coherence Tomography imaging data, and the recorded exam. This adjudication process was intended to strengthen the robustness of the trial's primary endpoint by reducing error variance due to (a) geographic differences in standards of care; and (b) a potential bias toward over-reporting a neurologic event as an on-trial relapse to mitigate potential long-term sequelae of a missed relapse. There were 43 patients with Clinician-Assessed Relapses, of whom 22 were adjudicated positively (i.e., categorized in the Adjudicated-Relapse group).

Statistical analysis
This secondary analysis of NMOSD trial data examined evidence of response-shift effects in trial outcomes. We began by focusing on differences by treatment arm (Eculizumab vs. Placebo) and then examined differences by relapse status. Relapse status was defined as a three-level variable (No Relapse, Clinician-Assessed Relapse, Adjudicated Relapse). This variable allowed us to test relationships that had more power (due to larger sample size than simply comparing the Adjudicated-Relapse and No-Relapse groups), and that differentiated more subjective indicators of signal (i.e., Clinician-Assessed) from more objective indicators (i.e., Adjudicated).
These analyses aimed to "de-construct" different aspects of measurement invariance in the context of a small sample to characterize recalibration, reprioritization, and reconceptualization response shifts. We proceeded in four steps.
Step 1: Hypothesis-driven group differences in expectedobserved discrepancy scores. This step tested the 'omnibus response-shift hypothesis' that there are differences between expected and observed QOL scores ("discrepancy scores") as a function of the hypothesized response-shift catalyst (i.e., treatment arm and then whether the person ultimately had a relapse). If this omnibus test does not support a response-shift effect, then then subsequent analytic steps would not be implemented.
To examine discrepancy-score differences by catalyst group, we used the Rapkin and Schwartz residual-modeling approach [17,47]. We began by computing a principal component from the PRO scores, including the EQ-5D VAS and the 8 domain scores of the SF-36™ at all time points (Supplementary Table 1). This analysis enabled summarizing the PRO scores in one component score using data from all time points (see Results section for details). 1 If this analysis had not supported the existence of one dominant component, we would have reduced the scores included such that they were well represented by a unidimensional component score. We then used this component score as the dependent variable in a randomeffects model [48] that included the following demographic and clinical predictors at all available time points: gender, race, country, ethnicity, age, number of years since diagnosis, number of years since NMOSD-presenting symptoms, body mass index, and treatment compliance; and scores on the MRS, HAI, EDSS, and KFS. We saved the residuals from this model, and then tested models predicting these residuals (i.e., scores capturing the discrepancy between expected and observed outcomes) using hypothesized response-shift catalyst groups as the independent variable. Paneled histograms illustrate catalyst-group differences in the discrepancy scores.
If these results suggested that there were responseshift effects, steps 2-4 would then examine evidence of recalibration, reprioritization, and reconceptualization response shifts, respectively. Random-effects models [48], and more specifically random-intercept models, were used to examine longitudinal differences in patterns of emphasis by catalyst group-whether PCS and MCS differed by treatment arm or relapse group in their ability to explain EQ-5D VAS scores; and whether such dynamics changed over time.

Handling of missing data
There was very little missing data in this data set, and the variables we used in our modeling had no missing data.

Sample
The study sample included 143 people, of whom 107 had Definitive Neuromyelitis Optica and 36 had NMO Spectrum Disorder (Table 1). Two-thirds of the sample was on Eculizumab and one-third on placebo, and the sample evinced high levels of treatment adherence. The sample had a mean age of 44 and a mean age of diagnosis of 41. The sample was predominantly female. Each patient had between three and 23 clinician visits during the trial, and each spent between two and 30 months under study. Table 2 displays the descriptive statistics of baseline scores on the clinician-and patient-reported outcomes. On average, the sample had 'slight disability' on the MRS, and scores on the HAI and EDSS consistent with some gait abnormalities, but not enough to prevent independent walking. The sample's average SF-36™ PCS score was substantially below norm-based means; the MCS score was slightly but not significantly below norm-based means. The biggest decrements on the SF-36™ domain scores were in physical functioning and physical role performance. On the EQ-5D VAS, mean scores reflected substantial health impairment. The Self-Care domain of the EQ-5D evinced the greatest decrement. Figure 2 shows the mean change from baseline on the SF-36™ domains by treatment arm. The Eculizumab group evidenced bigger changes in the SF-36™ physical domains compared to the Placebo group, which showed larger changes in the mental domains.

Component score used for creating discrepancy scores
Supplementary Table 1 shows the loadings of each of the PROs used in the PCA. The PRO data from all time points were effectively captured in one component score (Successive Eigenvalues = 4.95, 0.96, and 0.85; successive variances explained = 55%, 10.7%, and 9.4%.). Fig. 3 shows the distribution of the discrepancy scores in the entire sample. The distribution was centered around zero, and slightly left-skewed.

Treatment arm as catalyst
Step 1: Treatment arm differences in expected-versus-observed discrepancy scores The Kruskal-Wallis non-parametric test revealed differences in the central tendencies of the distributions of the discrepancy score by treatment arm (test statistic = 108.40, df = 1, p < 0.0005). The placebo group had a systematically lower median (Fig. 4). For the Eculizumab patients, the discrepancy score was generally close to zero. A sensitivity test was done omitting one low-scoring outlier in the Placebo group and the results were essentially unchanged.
Step 2: Treatment arm differences in patterns of emphasis Table 3 shows the results of random-effects models assessing differences in patterns of emphasis in the trial participants.
There were significant two-way interactions between treatment arm and PCS and MCS scores, such that the Placebo patients had a greater emphasis on PCS and a lesser emphasis on MCS in their ED-5D VAS scores as compared to the Eculizumab patients.
Step 3: Treatment arm differences in changes over time in patterns of emphasis There were no significant three-way interactions for Treatment Arm with time and PCS or MCS (Table 3). These results suggest that differences in patterns of emphasis did not change over time. Residuals overall and for each group were non-normal (p < 0.0005 for each treatment arm) due to skewness (− 0.57 and − 0.39, for Placebo and eculizumab, respectively). Table 4 shows results of the series of random-effects models aimed at clarifying how each domain's relationship with Treatment Arm varied across models when adjusting for all the other domains. These models suggested that the Placebo group was associated with substantially worse-than-expected ED-5D VAS and Vitality scores. None of the other seven SF-36™ domain scores had statistically important relationships with Treatment Arm after adjusting for the other QOL domain scores.

Step 4: Group differences in conceptualization of QOL
Because Eculizumab was highly effective at preventing relapse, we hypothesized that the response-shift effects related to treatment arm overwhelmingly reflected the impact of relapse on patients. We thus investigated response-shift effects by relapse status using the same series of analyses.

Relapse group as catalyst
Step 1: Relapse-group differences in expected-versus-observed discrepancy scores The Kruskal-Wallis non-parametric test supported that there were relapse-group differences in the discrepancyscore distributions (test statistic = 14.87, df = 2, p = 0.001). Figure 5 shows the distribution of discrepancy scores by relapse group. For No-Relapse patients, the discrepancy score was generally close to zero. For the Clinician-Assessed Relapse and Adjudicated-Relapse groups, the score varied more widely. Post hoc pairwise comparisons revealed that the Adjudicated-Relapse group had substantially larger and more-negative discrepancy scores than the other groups  Step 2: Relapse-group differences in patterns of emphasis Table 4 shows results of random-effects models assessing differences in patterns of emphasis in the trial participants. There was a significant two-way interaction between Adjudicated Relapse and PCS, and Step 3: Relapse-group differences in changes over time in patterns of emphasis There were significant three-way interactions for Relapseby-time-by-PCS and Relapse-by-time-by-MCS (b = − 0.01 in both cases; p = 0.02 and 0.01, respectively), after adjusting for main effects and two-way interactions (Table 5).
These results suggest that although PCS and MCS are more important in accounting for ED-5D VAS for people who had an adjudicated relapse than for people with no relapse, this difference attenuates over time. Residuals overall and especially for the no-relapse group were nonnormal (p < 0.0005 and 0.0005, respectively) due to skewness (− 0.48 and − 0.52, respectively). For the adjudicated and clinician-assessed relapse, the residuals were normally distributed (p = 0.05 and 0.50, respectively).
Step 4: Relapse-group differences in conceptualization of QOL Table 6 shows results of the series of random-effects models aimed at clarifying how each domain's relationship with relapse status varied across models when adjusting for all the other domains. These models suggested that relapse status was associated with substantially worse-than-expected ED-5D VAS scores for both Clinician-Assessed and Adjudicated-Relapse groups, after adjusting for the 8 SF-36™ domain scores. In other words, in contrast to the SF-36™ domain scores, ED-5D VAS scores uniquely discriminated Relapse-Group deficits. On the other hand, people who had a Clinician-Assessed Relapse had slightly better than expected Social-Function scores. In other words, Social-Function scores uniquely revealed a strength of this Group. None of the other seven SF-36™ domain scores had statistically important relationships with relapse status after adjusting for the other QOL domain scores.

Discussion
This secondary analysis of clinical trial data revealed that not receiving active treatment and, more specifically, the experience of relapse made people change their thinking about QOL (see summary in Table 7). The implications for such changes on interpreting treatment effects may be substantial. Our results suggest that the QOL impacts of placebo/relapse were under-estimated by the usual analyses, and thus the benefit of Eculizumab is likely even greater than what was documented in the pivotal clinical trial [33], extending to subjective outcomes. Of note, the whole study sample started the trial with close-to-normal scores on the MCS, despite decidedly low scores on the PCS and ED-5D VAS. Thus, despite having dealt with the vicissitudes of NMOSD for an average of 4 years, the participants managed to maintain prior to the trial a relatively normal level of mental-health functioning. In this they also managed to maintain stability over the course of the trial, regardless of treatment arm. This paradox is consistent with response-shift theory, which posits that    [2,17]. Our findings likely reflect the 'shadow' of response shift, inferred by the behavior of examined interactions and unique variance explained rather than characterized more directly. People on placebo and/or people who had a relapse are thinking differently about health due to their experiences.
The relapse experience appears to reflect less and less that which is assessed by the SF-36™ generic functional health indicators, and so assessment of more constructs would be required to delineate exactly what 'health' means after relapse. For example, 'health' may have more to do with purpose in life or meaningful social connections, concepts measured by the Ryff Psychological Well-Being scale [49,50]. Including measures of cognitive appraisal [51,52] would also facilitate more direct characterization of the response-shift effects. Nevertheless, in the absence of other such measures, the ED-5D VAS has clear value in this study.
The present study represents a response-shift investigation of clinical trial data using accepted analytic methods. Triggered by prior unexpected non-significant treatment differences in the more subjective domains related to mental health, here we pursued a series of analyses to explicate these patterns. These analytic steps begin by testing an omnibus response-shift hypothesis that examines the distribution of discrepancy scores by catalyst group. If this hypothesis does not support response shift, then no further analyses would be done. In our companion paper [53], we provide a method that builds on these findings to enable estimation of how response shift affects measured outcomes.
It should be noted that the residual-modeling approach specified in our analyses is distinct from Mayo's 2008 method [7]. While Mayo's 2008 method also works with residuals, the Rapkin and Schwartz method [17] explicitly computes a 'standard model' that includes all available antecedents, and saves the residuals (i.e., discrepancies), which are then used as the dependent variable in hypothesis-driven analyses. Once the response-shift omnibus hypothesis is supported (i.e., the aforementioned discrepancies differ by catalyst group), the method presented in this article then implements a series of random-effects models to test response-shift effects operationalized in ways similar to the Oort SEM method. If measures of appraisal had been collected in the trial data, the Rapkin and Schwartz method would also examine main effects and interactions of appraisal and change in appraisal in conjunction with catalyst (i.e., treatment arm or relapse status) main effects and interactions. In contrast, the Mayo method creates residuals based on a short list of antecedents (i.e., disease severity, age, sex, and comorbidity), and then creates residual-trajectory scores which are then modeled using latent class analysis. Both methods utilize residuals to test responseshift hypotheses in interesting and informative ways, but the method used in our work is correctly identified as the Rapkin and Schwartz (2004) method [17].
The study has a number of strengths including the highquality data on relapse, the inclusion of subjective and objective indicators, and the longitudinal follow-up with low attrition. Its limitations must, however, be acknowledged. Our results likely under-estimate response-shift effects for several reasons. First, the sample sizes of those who ultimately had an adjudicated relapse are relatively small, affording statistical power to detect only large effect sizes [54]. Accordingly, some models may be over-identified. To reach significance despite low power means more than to do so when aided by high power. This situation prevents the application of well-codified response-shift analyses using SEM that would enable us to work with collinear domain scores (using residual correlation), and to model moderation and mediation effects more robustly. The residuals from the two random-effects models were also not always normally distributed, which violates a random-effects model assumption [48,55]. Random-effects models appear, however, to be robust to such violations [56,57]. Further, the study does not include measures of certain relevant constructs such as well-being or of cognitive processes underlying patient self-report. Measures of QOL appraisal processes [51,52] would facilitate a more narrative and nuanced description of how the relapse groups differed in their frames of reference, standards of comparison, experience sampling, and patterns of emphasis [58,59]. Future research might include such cognitive-appraisal and well-being scales [49,  50] in prospective clinical trials of new treatments to ensure that the patients' experience is captured over the course of the trial. Finally, most data that were collected from those who ultimately suffered a relapse were collected before that relapse. Thus, the study design afforded little opportunity for detecting Relapse-Group differences. Despite these odds, we found such differences, perhaps suggesting that relapse patients are experiencing sub-clinical, early warning signs of a relapse. Further investigation into these early warning signs might enable interventions to delay the 'tipping point' to full relapse [60,61]. In summary, this study of response-shift effects in the Eculizumab clinical trial suggests response-shift effects by treatment arm and relapse status. Using a series of analytic steps aimed at detecting the 'shadow' or reflection of response shift, we found that, among Placebo patients and as relapse criteria became more specific and rigorous, commonly accepted clinical and demographic indicators explained less well the patients' QOL ratings. The idea of 'health' among placebo patients and among those who eventually relapsed reflected different patterns of emphasis, and these emphases changed over time for relapse patients, compared to the No-Relapse Group, even when these differences were "watered down" by the inclusion of pre-relapse data. We conclude that there are other aspects of QOL that become more important when one experiences a relapse, aspects that are not well captured in the SF-36™ and/or EQ-5D VAS. This 'shadow' of response shift may take a more definite shape when more relevant constructs are included in a study well powered to explicate the relapse experience.