Introduction to the special section: “Methodologies and considerations for meaningful change”

The determination of what constitutes a ‘meaningful change’ on a health outcome measure

The determination of what constitutes a 'meaningful change' on a health outcome measure remains controversial in both methodological and applied research. Motivated by the question of how to understand the efficacy and effectiveness of interventions or the natural history of conditions better [1,2], the concept builds on the widely held belief that statistical significance in itself is not sufficient to establish a treatment benefit [3,4]. Since health-related quality of life (HRQL) research should reflect patients' perceptions and evaluations, the topic is of immense theoretical, statistical, and practical relevance. It was therefore timely to offer a space to present discussions, methods, and questions related to this topic, even as new methods and interpretive standards emerge.
In collaboration with the Psychometrics Special Interest Group of the International Society for Quality of Life Research (ISOQOL) the editor-in-chief (JRB) developed a call for papers and selected the editorial team (AT and WRL) to take this special issue forward. Submissions closed in April 2021 and invited submissions exploring existing and novel methods for defining meaningful change thresholds for clinical outcome assessments such as patient-or clinicianreported outcome measures. A simulation dataset, described below, was also provided to encourage researchers to evaluate different methods using the same data. The main aim of the special section was to collate a series of methodological and applied articles reflecting current thinking and developments in meaningful change research. And we also wanted to encourage the practice of explicitly stating whether thresholds are intended to support between-group, within-group or within-individual interpretations [3,[5][6][7].
For this special section, we broadly define "meaningful change research" as the determination of guidelines for interpretation of the perceived meaning of health outcome score changes or differences based on the patients' (or: the target population's) perception. For a particular score difference (often described as a "threshold") to indicate a "meaningful change" over time, (i) patients (or an appropriate proxy) need to have described this score difference as directional (e.g., improved or deteriorated); and (ii) to a degree that reflects in their eyes a meaningful difference from the previous state (see for example) [3,4]. A variety of methods are used to operationalize this, including anchorbased methods or qualitative evaluations of score differences that are perceived as meaningful [8].
When working towards concrete operationalizations, the level, type, and magnitude of change need to be specified. For example, it is likely inadmissible to use change thresholds based on group differences to interpret differences between individuals or within individuals over time [7], although this may be a common practice. Table 1 provides an overview of these three key considerations when classifying change and we point out three examples: • Minimal within-individual change over time: the smallest amount of change over time a given person must show on an individual level to be regarded as having a meaningful change (1B, 2B, 3A); • Minimal between-group difference in change over time: the smallest difference between the changes of one group versus another group that are considered meaningful (1A, 2C, 3A); • Minimal within-group change over time: the smallest amount of change over time a group of people must show to be regarded as having had a meaningful change (1A, 2B, 3A).

3
Other combinations such as cross-sectional betweenindividual differences are also made in practice [3], in addition to 'larger than minimal' thresholds [9]. Similarly, while some definitions focus on changes that 'warrant a change in a patient's management' [10], we do not consider this to be a necessity, as some studies (natural history) do not involve treatment evaluations, yet still must establish a meaningful change. Finally, we consciously avoid the use of specific terms such as 'minimal clinically important difference' [11] or 'minimal important change' within this editorial [4], given these terms have been used interchangeably to describe a range of the combinations arising from Table 1. Standardized terminology is more likely to be achieved through a consensus-based approach in a large group such as the SISAQOL-IMI [12]. Until consensus is achieved, it is essential for clarity of communication that all dimensions in Table 1 are clarified in the description of a threshold, e.g., "minimal within-individual change over time".
The special section is split into two parts: the first focuses on meaningful change using clinical anchors, the second one presents papers based on what are often called "distribution-based" approaches. Distribution-based approaches are typically described as (i) using measures of cross-sectional or longitudinal (often inter-individual) variability to define (ii) a minimal score difference that would be seen as exceeding the level of measurement error (or otherwise nuisance or negligible variability) given a particular psychometric model [13]. These thresholds have no connection to (external) evaluations of "meaningfulness" of that particular score difference. It is for this reason that regulators such as FDA have historically stated that distribution-based approaches cannot be used as the sole basis for establishing a responder definition [14]. Instead, the assumption is that score differences that are greater than measurement error are due to a more systematic factor or factors, hence the inference of meaning. Their singular advantage in this context is that they do not depend on finding a suitable external clinical anchor, which can be challenging for some applications, but can be calculated solely using data from the measure being evaluated. In contrast, an index of meaningful change would offer information about 'meaningfulness' by either providing information about the connection to a criterion of change or by offering a clear content-based operationalization of meaningfulness (be it qualitative or quantitative). However, when such a criterion is not available, distribution-based methods can be useful. Furthermore, in this special section, the submissions were of high quality, and their inclusion offers the opportunity to contrast the approaches, and the contribution of these methods is too important to leave out of a special section such as this. Additionally, they have an established history of use for the study of individuals over time (i.e., idiographic research) to complement trends at the group level (i.e., nomothetic research) [5,15,16].
Finally, we want to thank Pip Griffiths (Digital Medicine Society; IQVIA) for providing the simulated dataset that two articles used to illustrate their approaches [17,18], and which could be interesting for readers to explore some of the issues raised in this special section further. The simulated dataset comprises responses to the twelve-item 'Simulated Disease Questionnaire' for 2000 individuals at four time points. The items have four response categories where higher scores indicate worse health (graded response model). Responses to a seven-category transition rating (i.e., global impression of change) were also simulated at the follow-up time points (for more details please refer to https:// osf. io/ khmzg/).

The special section
The response to the call was enthusiastic, with twenty-seven submissions exploring a range of conceptual and practical issues, of which fifteen are now brought together in this special section. Ten of these papers focus on meaningful change, and five papers and two letters address distributionbased indices. The focus of each paper, in terms of meaningful change versus distribution-based indices, and further classification on the level and type of threshold, is provided in Table 2. Two things are clear from this table. First, most papers focus on within-individual change over time. Second, several papers on meaningful change did not precisely specify the magnitude of change (minimal versus greater). For one of these cases, meaningful change was instead conceptualized in terms of hypothetical patient-perceived treatment success [19]. For another paper [20] specifying the magnitude, authors used the terms minimal to reflect ratings of 'a little better' and meaningful to reflect ratings of 'better' and 'much better'. We recommend future papers be clearer in terms of the intended magnitude, but note that the two options for the magnitude dimension in Table 1 are not exhaustive where options such as patient-perceived treatment success can be of interest. Setting the scene for the first part of the special section is a report of an online survey regarding how clinicians from different disciplines determine individual-level meaningful change on patient-reported outcome measures (PROMs) [21]. The authors investigated how oncology or mental health clinical care providers who used PROMs in the USA determine whether a patient's symptoms have changed. Most commonly, clinicians compared two consecutive scores, without a visual aid; the use of normative scores was uncommon. This research highlights the importance of aligning meaningful change research with current practice, but also that education in the value of interpretative tools is warranted.
The papers in this section investigate the use of anchors for the derivation of meaningful change thresholds. Anchorbased methods are the most widely applied method for estimating meaningful change, but this does not mean they are without problems. In the second paper of this section [22], the authors highlight and discuss five important issues with anchors that should be kept in mind, rather than viewing anchor-based approaches as a perfect gold standard. This article serves as a helpful collection of methodological issues to consider when reading the collected papers. The third paper illustrates a fundamental practical question when determining meaningful change thresholds, but likely also for any threshold determination [23]: how scoring rules and ranges limit the usability of group-level minimal important differences in individual-level responder definitions. Based on the example of the EORTC QLQ-C30 subscales, the authors illustrate how the commonly used 10-point change may be misleading, as due to scaling, an individual cannot actually be measured with a 10-point change on any scale. They present considerations (their Fig. 2) to further support responder threshold selection.
Moving to investigations of the effectiveness of study design and analysis approaches, the fourth paper [24] reports the results of a simulation study to evaluate the importance of the strength of the correlation between the anchor and the clinical outcome assessment measuring change, varying the impact of sample size, change score variability, and anchor correlation strength on the estimation of the meaningful change threshold at the individual and group level. Using receiver operating characteristics and logistic regression analyses, they show that sample size and change score variability are key factors impacting the required anchor correlation, but using an 'acceptable' cut-off of > 0.30 was often insufficient for accurate estimates of individual meaningful change thresholds, and always insufficient for group changes. The fifth paper [17] builds on the simulation dataset that accompanied this call to address the problem that traditional methods of evaluating within-individual change ignore the effects of floor/ceiling effects and measurement error in PROM scores and global (transition) ratings. The team combined the use of a longitudinal graded response model with a transition item to measure latent change. The method produced tighter estimates of meaningful change when compared to traditional methods, with the methods overlapping most when the proportion of responders was  [23] N/A Individual Change over time N/A Griffiths et al. [24] Meaningful change Group, Individual Change over time Minimal Ho et al. [33] Distribution-based Individual, Group Change over time N/A Jones et al. [21] Meaningful change Individual Change over time Not specified Lee et al. [32] Both Individual Change over time Minimal Li [18] Distribution-based Individual Change over time N/A Peipert et al. [30] Distribution-based Individual Change over time N/A Poon et al. [29] Meaningful change Individual Change over time (hypothetical) Minimal Qin et al. [27] Meaningful change Individual Change over time Not specified Smit et al. [16] Both Individual Change over time Meaningful a Wyrwich & Norman [22] Meaningful change General General General Wyrwich et al. [19] Meaningful change Individual Change over time (hypothetical) Meaningful b about 50% of participants. Extensions of this approach show promise for a range of applications [25,26]. The final simulation study in the first part [27] casts a view forward to the papers on distribution-based thresholds, as the team evaluated the effects of sample characteristics commonly observed in clinical trials on four anchor-based threshold selection procedures and two distribution-based ones. In a large simulation design, they found that both methodological choices and clinical characteristics exert influence on the results and conclusions, and they suggest prioritizing study designs with strongly responsive endpoints in settings with about 50% anchor-based responders.
Moving to empirical papers exploring questions of meaningful change, one team explored if, how and when meaningful change in depressive symptoms occurred during a period of four months through three data sources [16]: weekly questionnaires, qualitative reports, and ecological momentary assessment (EMA; five prompts per day). The 'if' was assessed in terms of measurement error (weekly level), perceived meaningfulness (qualitative), and statistically significant changes in the modeled trajectory of symptoms. The distinction between sudden and gradual change (the how) and when this occurred varied considerably between methods. This research will help others evaluate what information each method can provide, alone or in combination, when designing studies to assess health changes. It also points to the potential of EMA and experience sampling to increase patient-centeredness and granularity when collecting HRQL data [28]. The use of multiple data sources also plays a key role in the three papers concluding this section. One team [20] sought to evaluate the validity of a rheumatoid arthritis flare questionnaire by examining minimal and meaningful within-individual change using three anchors: patient global ratings, physician global ratings, and using a disease activity index in patients with rheumatoid arthritis. They found that patients were most likely to report meaningful improvement, physicians were most likely to report meaningful worsening, with changes in either direction on the disease activity index least likely to be classified as meaningful. Another team [19] utilized a clinicians-then-patients qualitative interview methodology to understand patient priorities for treatment and a threshold to declare treatment success for adult and adolescent patients with alopecia areata and ≥ 50% scalp hair loss. This paper details the novel qualitative method of explicitly incorporating patient input into the definition of an individual change threshold and the endpoint of % hair loss. The authors documented that due to extensive discussions online by patients about hair loss issues, they were able to make appropriate ratings of their hair loss that were largely consistent with values provided by clinical experts. The first part ends with a qualitative study to define meaningful change in physical function after weight loss [29]. The team conducted a qualitative study to evaluate how much weight loss would be meaningful hypothetically for overweight and obese individuals, if they were to lose weight. These individuals all agreed that a ≥ 10% weight loss would be associated with a meaningful improvement in their physical functioning, and that a one-point change at the item level of two HRQL instruments would represent a noticeable change.
The papers in the second part of the special section focus on distribution-based indices. The papers explore how these indices are affected by different definitions of the error variance, distributions, and level of uncertainty. The first paper [30] builds upon previous work by the authors [5] proposing approaches for the identification of treatment responders, providing further justification and elaboration for the use of the coefficient of repeatability (also known as the 'smallest real difference' [31] or 'minimally detectable change' [13]) for within-individual interpretations of statistically significant change. However, rather than focusing on the conventional p < 0.05 threshold, the authors explore more liberal thresholds. This article serves as a helpful reminder that significance levels are not fixed, where less strict (i.e., smaller) thresholds will be sufficient in some scenarios. In addition, the paper has two letters attached to it in this same issue, which discuss the interpretation of the attached statistical significance level and the applicability of the index to individual change classification, which are also of interest for other indices and their interpretation. The second paper [32] focuses also on a version of the reliable change index and compares its use based on classical test theory and item response theory. Classical test theory assumes measurement error is constant across the scale range, but item response theory relaxes this assumption. The authors compare these approaches to detect change beyond measurement error, where the item response theory-based thresholds fluctuate above or below the fixed classical test theory threshold in accordance with baseline score. Their Table 4 presents an overview of thresholds for PROMIS shortform users within oncology. Using item response models, another team [33] proposes a method for increasing the precision of measurement of within-individual change. They build on existing approaches to quantify the error associated with individual scores derived from item response theory analyses: using plausible values, the precision of scores across the spectrum of theta (severity of underlying trait) can be incorporated. This can increase the accuracy of measuring intra-individual changes, which is very useful in individuals (for example) with chronic illness who need to be monitored repeatedly over time and provides an extension to more typical distribution-based methods.
All PROM scores are subject to measurement error and using raw individual change scores does not account for this fact. The last two papers in the special section use regression and predictive frameworks to derive change metrics that also allow to quantify the uncertainty associated with the estimate. One team [34] presents alternatives to the raw change scores that were developed over 50 years ago [35,36], but have so far not been widely used or explored within patient-reported outcome research. The two approaches provide estimates of an individual's true gain after incorporating measurement error, which have both conceptual advantages and greater sensitivity compared to raw change scores. The final paper of the special section [18], compares three distribution-based methods: the reliable change index, one of its variants, and Bayesian regression models that regress post-scores on pre-scores to identify group-level change over time. The article shows that there are only small differences between the methods in detecting change when PROM reliability is high, but none of them outperforms all others if that is not the case. The article offers a technical discussion that compares advantages and disadvantages of these approaches.

Editorial commentary
In closing, we want to take the opportunity to highlight three topics that struck us when reading and editing the papers. A first observation is that anchor-based methods for withinindividual guidelines should be based on finding a threshold separating 'no change' and 'changed' groups on the anchor. The notion of locating a threshold, lying along a continuum of perceived change, is supported by recent research [37]. As individuals will vary in their personal threshold, many methods use the mean of these individual threshold locations or derive otherwise a threshold aggregated across individuals (e.g., receiver operating characteristic curves, logistic regression, discriminant analysis; [4,38]). Similarly, the longitudinal item response model presented within this special section is designed to estimate the location of this threshold [17]. Therefore, from a theoretical standpoint we view anchor-based methods such as receiver operating characteristic curves, logistic regression, discriminant analysis and longitudinal item response theory models as useful techniques for identifying a threshold for within-individual change to identify groups of responders and non-responders. However, regarding estimates of mean score change within an 'improved' anchor group, we maintain that they do not target the location of a threshold and are therefore theoretically biased estimators of within-individual change thresholds [4]. Instead, mean change within an 'improved' anchor group has been proposed as more appropriate to guide thresholds for within-group changes over time [39,40]. Similarly, calculating the difference in mean change in scores between an 'improved' and 'stable' anchor group is not a theoretically appropriate estimator of a within-individual change threshold [37], but instead has been proposed as more suited to between-group differences in change over time [40,41]. However, simulations presented within this special section [27] suggest that deviations from normally distributed score changes may pose a challenge to these theoretical ideals. Further planned simulations should help to confirm this [42].
A second observation is that current methods for withinindividual thresholds and their clinical application use estimators relying on between-individual variability [4,7]. For example, meaningful change threshold estimation typically compares between-individual variation in an anchor measure with between-individual variation in change between two assessment points. And distribution-based indices are based on between-individual variance (e.g., standard deviation of a test score multiplied by a constant representing the level of accepted uncertainty and another variable such as the reliability coefficient). If researchers or clinicians are interested in understanding how a group of patients is classified over the course of time (and not making a statement about individual patients), then using measures that are based on between-individual variance is likely an appropriate approach [4]. However, if a statement about an individual patient is the goal, then we know that between-individual variability is not always a good or justifiable proxy for within-individual variability [28,[43][44][45][46]. In such a situation, the use of within-individual methods (e.g., EMA or related methods to explore intra-individual variation [16]) might be more appropriate. In the call for papers, we encouraged authors to explicitly justify whether thresholds were intended for between-group, within-group, or within-individual interpretations and why it was appropriate to do so. This has led to calls for more nuance in interpretation [7]; to pragmatic responses that within-individual change methodology faces challenges in practical applications [5] (but see [16,28] for contrasting examples); to detailed statements on how to interpret a given index and when and where it is appropriate to use [4]; as well as wider discussions and explanations of the methods leading to such indices [17,30,34]. We especially see the development of appropriate within-individual methods for the identification of change as a key priority that also aligns with current technological developments for practice.
A third point is that in many submissions the variability or uncertainty associated with either the threshold or the change estimate is an important element in interpretation. Knowing the uncertainty associated with a threshold estimate is important, but not always explained or provided. Regardless of the type of variability used and whether a threshold based on meaning to patients or distributions is sought, recognizing and making transparent that there is uncertainty associated with these thresholds is a valuable reminder that none of the methods discussed in this special section offer absolute results. Because the use of meaningful change methodology and distribution-based thresholds has been ritualized to a degree, it is not always considered whether a particular method to determine thresholds is the most appropriate one for a given context. Additionally, emerging mixed-methods research relies on classifying particular patients as "changed" for identification in case studies, with limited or no allowance for measurement error, as well as assuming that the classification threshold applies to this particular patient [47,48]. Transparency about uncertainty in thresholds and classifications as well as whether it is appropriate to apply a threshold for group or individual change is therefore a key consideration for developing mixed-methods research agendas around how health outcome measures are used by patients more broadly [49][50][51][52]. We think that this intersection between epistemology, psychometrics, and various fields of clinical practice contains one of the strongest development opportunities for our understanding of (subjective) health outcome measurement, but substantial work is needed to align theories and practices for a coordinated research effort in this area.
The call for papers was issued to invite discussion, development, as well as state-of-the-art research and practice. We are grateful for the excellent range of submissions received and to all authors and reviewers involved in selecting the published papers, which represent a two-year collective effort. We hope that readers find these papers useful both in developing their own research, but also to help the field to further extend its efforts around patient-centeredness. When we can all agree on what a meaningful change is and how to measure it for a particular patient, measure, and population, then we will have the opportunity to bring about meaningful change in clinical practice and at the social and policy level.