Minimal important change (MIC): a conceptual clarification and systematic review of MIC estimates of PROMIS measures

We define the minimal important change (MIC) as a threshold for a minimal within-person change over time above which patients perceive themselves importantly changed. There is a lot of confusion about the concept of MIC, particularly about the concepts of minimal important change and minimal detectable change, which questions the validity of published MIC values. The aims of this study were: (1) to clarify the concept of MIC and how to use it; (2) to provide practical guidance for estimating methodologically sound MIC values; and (3) to improve the applicability of PROMIS by summarizing the available evidence on plausible PROMIS MIC values. We discuss the concept of MIC and how to use it and provide practical guidance for estimating MIC values. In addition, we performed a systematic review in PubMed on MIC values of any PROMIS measure from studies using recommended approaches. A total of 50 studies estimated the MIC of a PROMIS measure, of which 19 studies used less appropriate methods. MIC values of the remaining 31 studies ranged from 0.1 to 12.7 T-score points. We recommend to use the predictive modeling method, possibly supplemented with the vignette-based method, in future MIC studies. We consider a MIC value of 2–6 T-score points for PROMIS measures reasonable to assume at this point. For surgical interventions a higher MIC value might be appropriate. We recommend more high-quality studies estimating MIC values for PROMIS. Supplementary Information The online version contains supplementary material available at 10.1007/s11136-021-02925-y.


Introduction
There are several ways to interpret change scores arising from patient-reported outcome measures (PROMs). One possible threshold is the minimal important change (MIC) estimate, which refers to the smallest change in score that patients consider important. The MIC is the lower bound of a distribution of thresholds for important change. There is a lot of confusion about the concept of MIC, which questions the validity of published MIC values [1,2]. First, there is inconsistency in terminology used (e.g., minimal important change, minimal important difference, minimal clinically important difference, meaningful change threshold, to name a few). Similar terms may refer to different concepts and vice versa. Second, there is particular confusion about the concepts of minimal important change and minimal detectable change, which refer to different concepts [3,4]. Third, there are differences in methods used for estimating the MIC, some more and some less methodologically sound [5]. This confusion hampers and may even bias the interpretation of PROM change scores in research and clinical practice.
An increasingly used, innovative set of PROMs is the Patient-Reported Outcomes Measurement Information System (PROMIS ® ). It covers domains of health-related quality of life (HRQOL), such as pain, fatigue, physical function, anxiety, depression, and the ability to participate in social roles and activities, that are commonly important for adults and children with and without (chronic) medical conditions [6,7]. Most PROMIS measures are rooted in item response theory (IRT)-based item banks (i.e., large sets of calibrated questions measuring the same domain (construct)), which enables efficient measurement through fixed-length short forms and/or computerized adaptive testing (CAT) [8][9][10]. A number of studies have estimated MIC values for PROMIS measures. However, in light of its increasing use across the world [11][12][13][14][15][16][17][18][19], and the aforementioned confusion in the interpretation literature, additional guidance is needed on interpreting PROMIS change scores.
The aims of this study were: (1) to clarify the concept of MIC and how to use it; (2) to provide practical guidance for estimating methodologically sound MIC values; and (3) to improve the applicability of PROMIS by summarizing the available evidence on plausible PROMIS MIC values.

Part 1: the concept of MIC and how to use it
We define the MIC as a threshold for a minimal within-person change over time above which patients perceive themselves importantly changed. Assuming that all patients have their individual threshold of what they consider a minimal important change, the MIC can be conceptualized as the mean of these individual thresholds [20,21]. This definition of MIC is made up of three important elements: first, it refers to a threshold for a minimal change above which patients perceive themselves as changed (improved or deteriorated). Second, it refers to a change that is considered important to patients. And third, it refers to a within-patient change over time.
These three elements do not only define what the MIC is but also clarifies what the MIC is not. The MIC does not refer to thresholds for changes that are considered more than minimal (e.g., a mean change in patients who reported to be "much better" is not a MIC). There are other relevant concepts that reflect meaningful change thresholds that are larger than minimal, such as Clinically Significant Change [22], Sufficiently Important Difference [23] or Smallest Worthwhile Effect [24]. These concepts are outside the scope of this paper.
Next, the MIC is not a minimal detectable change (MDC, also referred to as smallest detectable change (SDC)). The MDC is the smallest change in score than can be detected statistically with some degree of certainty (e.g., 95 or 90%), based on the standard error of measurement (SEM) or limits of agreement from a test-retest reliability design. The MDC does not relate to the importance of change to the patients under investigation [4,[25][26][27]. The MDC is also an important benchmark for interpreting PROM change scores, but it is also outside the scope of this paper.
Finally, the MIC is not a difference between (groups of) patients. For example, a difference between patients who reported to be "a little better" and those who reported to be "about the same" refers to a minimal important difference (MID), not a minimal important within-person change (MIC). The MID is another relevant benchmark for interpreting PROM scores but is also outside the scope of this paper.
The MIC, as defined above, can be used for different purposes. In research, some use the MIC value as a threshold to determine the number of responders in clinical trials or other studies (i.e., patients who have a change at least as large as the MIC value) [28,29]. This responder definition adds a meaningful interpretation to study results from the patients' perspective. In clinical practice, the MIC value can also be used to determine the number of responders in groups of patients who receive certain treatments to inform future patients about the expected effects of treatments. For example, a patient can be told that about 70% of patients experience a minimal important change after a given treatment. This may facilitate shared-decision making. However, it is necessary to acknowledge that the estimated MIC value is derived from a wider sample of patients, and the threshold may not apply to the individual patient in the clinical trial or in the consultation room. If a responder is defined as an individual whose PROM change score exceeds the MIC, then on a group level the percentage of responders will probably be correct. However, this doesn't mean that all patients have been classified correctly, based on their individual PROM change score being smaller or greater than their individual MIC. This is because all patients have their own individual threshold of what they consider a minimal important change [20]. Furthermore, measurement error in the PROM change score further contributes to misclassification of individuals.
In addition to being used as a threshold for responder definitions, the MIC value can be used as a probabilistic value, rather than a deterministic cut-point, by clinicians to interpret change scores in light of the probability that an individual patient has experienced a meaningful change. For example, if the estimated MIC value of a PROM is 10 points and an individual patient has changed more than 10 points, it is more likely that the patient has importantly improved than that the patient has not importantly improved. This might help the clinician start a conversation with the patient.

Part 2: guidance for estimating MIC values
A variety of methods have been used in the literature to estimate MIC values [1,30,31]. Many methods, however, do not refer to the concept of MIC as described above. MIC methods are often categorized into distribution-based and anchor-based methods. Distribution-based methods use statistical parameters, such as a standard deviation (SD) or standard error of measurement (SEM) for estimating the MIC value. These parameters refer to measurement error (minimal detectable change) but do not relate to the importance of the change to the patients under investigation and, while they add useful context to interpreting MIC values, they do not capture the spirit of the MIC [3,4,27].
Anchor-based methods are generally more appropriate because they relate change scores on the instrument of interest to an external criterion of important change. Often, a single question at follow-up is being used as the external criterion (the anchor), asking patients how much they have changed, for example on a global 5-or 7-point rating scale ranging from "much worse" to "much better". The most simple and prevalent method used to estimate the MIC value is the mean change method, where the MIC value (further referred to as MIC mean ) is defined as the change score on the measure of interest in the subgroup of patients that reported to be "a little better" (minimal important improvement) or "a little worse" (minimal important deterioration) on the anchor question [32]. Studies have shown that a MIC for improvement may not be the same as a MIC for deterioration [33][34][35]. The mean change method has some important drawbacks. First, the subgroup of patients who reported to be "a little better" is often small, which results in imprecise MIC mean estimates. More importantly, the MIC mean value does not reflect a threshold for minimal improvement because it is defined as the mean of the entire group of patients who reported to be "a little better". As all patients in this group reported to be minimal importantly changed on the anchor, the mean change in score on the PROMs of interest in this group of patients is higher than the threshold for minimal important change. Finally, it has been shown that if the anchor is not completely accurate, MIC mean estimates are more severely biased than other anchor-based methods and will always be biased downwards [36].
Two additional, more appropriate, anchor-based MIC methods are the ROC method and the MIC predictive modeling method, which are described in more detail below and in Online supplement 2. In addition, a relatively new qualitative method, based on comparing vignettes (descriptions of health status of hypothetical patients), is also described below.

ROC method
The Receiver Operating Characteristic (ROC) curve method is based on the ability of a measure to distinguish patients who reported to be improved from patients who reported to be not improved (i.e., stayed the same or worsened) on the anchor. The MIC value (further referred to as MIC ROC ) is most often defined as the value for which the sum of the proportions of misclassifications ([1-sensitivity] + [1-specificity]) is smallest [32]. An advantage of this method is that it uses the entire study sample, leading to more reliable estimates than the MIC mean . Moreover, it estimates the threshold between 'not changed' and 'a little better' (minimal important improvement) or 'a little worse' (minimal important deterioration). A disadvantage is that the MIC ROC will be biased if the percentage of improved patients is not 50% [20].

Predictive modeling method
The predictive modeling approach is based on the predicted probability that a patient belongs to the improved group (based on the anchor) given the observed change score [21]. This method uses logistic regression analysis with the group variable (improved versus not improved [stayed the same and worsen] on an anchor) as the dependent variable and the change score on the instrument of interest as the independent variable. The MIC value (further referred to as MIC predict ) is defined as the change score associated with a likelihood ratio of 1, which is the change score where the posttest probability of belonging to the improved group (i.e., after knowing the patient's PROM change score) equals the pretest probability of belonging to the improved group (before knowing the patient's PROM change score, the pretest probability is the percentage of improved patients in the sample) [20,21]. The MIC predict is more precise than the MIC ROC and a formula has been published to correct the MIC predict for bias if the percentage of improved patients is not 50% [20]. It is therefore considered as a better option than the MIC ROC . In Online supplement 2 we provide additional details and SPSS and R codes (See also [37]) for how MIC ROC and MIC predict can be calculated.

Vignette-based method
The anchor-based MIC methods described above depend on the reliability and validity of the anchor question, which has been criticized [30,38,39]. An alternative method for instruments with IRT-based scores is a vignette-based method, often referred to as bookmarking or standard setting. With this method, patients are asked to compare vignettes (descriptions of health status of hypothetical patients) in focus groups or in a survey [40][41][42]. Each vignette represents a health status with an associated score on the underlying IRT metric. Patients are asked to indicate whether a hypothetical change in health status from one vignette to another would be considered an important change. The MIC (further referred to as MIC vignette ) has been defined as the mean difference in scores between pairs of vignettes that represent a minimal important change. If the mean difference is used to estimate the MIC vignette , this method may suffer a similar issue to MIC mean in that it represents a value higher than the minimal threshold. Alternatively, it would also be possible to ask patients to rate the change between two (or more) vignettes on an anchor question and then use the predictive modeling method to estimate the MIC predict .
In Box 1 we provide a summary of general recommendations for the design and analysis of MIC studies.

Box 1: Recommendations for conducting and reporting MIC studies
1.The predictive modeling or ROC method should be used over the mean change method because MIC predict and MIC ROC provide a threshold between improved and not improved patients [21], while the MIC mean does not reflect a threshold for minimal improvement, but rather a mean in a subgroup of patients who considered themselves as minimally improved. The MIC predict is more precise than the MIC ROC and can be corrected for bias if the percentage of improved patients is not 50%, and is therefore recommended as best option. Vignette-based methods can also be considered or used in addition to the MIC predict or MIC ROC because they do not require a longitudinal study. The MIC vignette is typically determined in a qualitative or survey study [40,97]. 2.The MIC predict or MIC ROC , should be determined in a longitudinal study, where patients complete the instrument of interest at baseline and again after a relevant time period (e.g., after an intervention). The most efficient design is one in which about half of the patients are expected to change (at least to a minimal important degree) on the domain of interest (e.g., physical function) and about half of the patients are expected not to change. If an intervention is applied between baseline and follow-up measurement, this intervention should be clearly described.
3.An anchor question should be completed by the patients at follow-up. The anchor question should measure the same construct as the instrument of interest. For example, for estimating the MIC of a fatigue instrument, the anchor question should state "how much has your fatigue changed since …". The anchor question should refer to a change since the previous measurement (e.g., since before treatment). The anchor question can have 3-7 response options, ranging from "much worse" to "much better". Patients who report to be "a little better" or more will be included in the improved group, while the rest of the patients will be included in the not improved group. 4.The sample size of the MIC study should be at least 100 patients [2]. Ideally, the percentage of patients in the improved group should be 50%. The percentage of improved patients should be reported. If the percent-age of improved patients is not about 50%, the adjusted MIC predict should be used (see Online supplement 2) [20]. 5.We recommend to plot the distributions of change scores in the improved group and in the not improved group (see Online supplement 2) to visualize how well the instrument of interest can distinguish between the improved and not improved patients [32]. 6.The correlation between the change score on the instrument of interest and the anchor question should be at least 0.30 to assume validity of the anchor [2]. This correlation should be reported. If the correlation is too low, the data are not suitable for estimating MIC value. 7.A 95% confidence interval around the MIC value should also be calculated and reported (see Online supplement 2) [98].

Part 3: evidence on plausible MIC values of PROMIS measures
To summarize the available evidence on plausible MIC values of PROMIS measures we performed a search in PubMed from inception up to May 31, 2021 to identify all studies that estimated the MIC of one or more PROMIS measures.

Methods
We extracted relevant search terms from the COSMIN Pub-Med filter for finding studies on measurement properties [43]. The full search strategy is presented in Online supplement 3. One author (CBT) screened the abstracts. We included studies that determined a MIC value for any PROMIS measure (adults and pediatric, any domain, any language, any version (e.g., v1.0, v2.0), full bank, short form or CAT) in any population. We extracted the following information: PROMIS measure(s) used (including domain, version number, administration type, language, age version) and country in which data were collected, study population, intervention(s), length of follow-up, sample size on which the MIC values(s) was/were based, MIC methods used, correlation between PROMIS change scores and the anchor (Spearman correlation if presented, otherwise Pearson correlation), percentage of patients improved based on the anchor (only for studies estimating MIC ROC or MIC predict ), and MIC values.
We only extracted MIC values based on anchor-based methods or vignette-based methods. We did not extract distribution-based MIC values. We only extracted MIC values based on longitudinal anchors, referring to withinperson change over time. We did not extract values based on cross-sectional anchors, referring to minimal important differences between groups of patients (e.g., difference between patients who reported to be "slightly improved" and patients who reported to be "not changed" [44] or differences between patients with different levels of disease [45]) because these values refer to a minimal important difference (MID) rather than a minimal important change (MIC). When MIC values of other instruments were used as an anchor, we checked whether these MIC values were based on anchorbased methods. Furthermore, we did not extract MIC values that referred to more than a minimal important change (for example, MIC mean values based on mean changes in patients who reported to be "much better" were not included). We extracted MIC values for minimal important improvement and for minimal important deterioration separately. MIC values determined in groups of less than 10 patients were not extracted. Data extraction was initially performed by one author (either JDP, RC, PG, or CBT) for each paper, and extracted data were checked by another author (CBT or LBM). Missing information (for example, regarding the version numbers of PROMIS measures used) was requested by email (by CBT) to the primary authors of the papers.
All PROMIS measures are scored on a T-score metric, in which 50 is the mean of a relevant reference population (often a general population) with a standard deviation (SD) of 10. Higher scores mean more of the concept being measured (e.g., worse fatigue, better physical function).

Results
The search yielded 911 abstracts, including 50 studies that estimated a MIC value of a PROMIS measure [41,. All studies used self-reported PROMIS data, no studies on proxy-reported data were found. Of these 50 studies, 10 studies used only distribution-based methods [49,50,52,55,58,66,68,74,75,77]; five studies estimated a minimal important difference (MID) rather than minimal important change (MIC) [44,62,63,72,73]; one study averaged estimates based on cross-sectional and longitudinal anchors as well as distribution-based estimates [84]; one study estimated a MIC value that referred to more than a minimal important change [92]; and two studies intended to calculate an anchor-based MIC but reported only a distribution-based MIC because the area under the ROC curve was considered too low [82,83]. Data from these 19 studies were not extracted.
Out of the 28 studies that used anchor-based methods 12 studies reported the correlation between the PROMIS change scores and the anchor. These correlations ranged from 0.02 to 0.76.
In several studies MIC values were presented for more than one PROMIS item bank.  (Tables S6,  S8, S9, S10, S11 is found in Online Supplement 1).
Only two studies estimated MIC values for five different PROMIS pediatric item banks (Mobility, Upper Extremity, Pain Interference, Fatigue, and Depressive Symptoms, Table S11), with MIC values ranging from 0.1 to 12.7 [41,61].

Discussion
We defined the minimal important change (MIC) as a threshold for a minimal within-person change over time above which patients perceive themselves importantly changed. Assuming that all patients have their individual threshold of what they consider a minimal important change, the MIC can be conceptualized as the mean of these individual

3
thresholds. The MIC can be used to determine the number of responders in a group of patients to interpret study results or to inform patients about expected treatment results, or to help clinicians to estimate the probability that an individual patient has experienced a meaningful change, facilitating a conversation with the patient. There is no perfect MIC method. Distribution-based methods are not appropriate because they do not relate to the importance of the change to patients. We consider the predictive modeling method the most appropriate anchorbased method, because, unlike the mean change method, it refers to a threshold for minimal important change. Moreover, the MIC predict is more precise than the MIC ROC and a formula has been published to correct the MIC predict for bias if the percentage of improved patients is not 50% [20]. A disadvantage of all anchor-based MIC methods is the concern about the reliability and validity of the anchor question. The relatively new vignette-based method does not depend upon an anchor question, but the MIC vignette may represent a value higher than a minimal threshold if based on mean differences between vignettes. We recommend the predictive modeling method, possibly supplemented with the vignettebased method if time and knowledge to design vignettes and recruit patients for that kind of study is available.
Our systematic review showed that published MIC estimates for PROMIS measures vary widely (larger than the range of MIC estimates currently published on the Health-Measures website [93]) and were often generated by less appropriate methods. The lower end of the observed range of MIC values (0.1 T-score points) is, in our opinion, implausible as a MIC threshold. The highest MIC values (7 T scores points of higher) were almost all found in adult patients undergoing surgery. It has been suggested before that an invasive procedure like surgery might require a higher change to be considered an important improvement, but results in the literature have been inconsistent [94,95]. For non-surgical interventions, we consider a MIC value of 2-6 points (covering about two thirds of the published MIC values) reasonable to assume at this point. There is not enough evidence yet to make more specific domain-specific or population-specific recommendations. Further studies are needed to examine whether MIC values differ across domains or between adults and children.
We particularly noticed several methodological concerns which might result in such a wide range of MIC estimates. First, most of these studies used the mean change method, which may represent a value higher than a minimal threshold. We did not exclude these results because this method is currently the most widely used method in the field (despite the critiques raised here) and only five studies used the ROC method, one study used the predictive modeling method [90], and three studies used a vignette-based method. In theory, it is likely that MIC mean values represent an  Table 2 (continued)  [96]. We strongly recommend PROMIS users to use these reporting recommendations. A reporting guideline for MIC studies is being developed by an international group led by researchers from McMaster University, Canada (personal communication).
To gain more insight in the meaning of PROMIS change scores, more high-quality MIC studies are needed. To increase the understanding of the concept of MIC and improve the field, we need to agree on a clear definition of the MIC and report MIC values that are based on this definition. We recommend not to publish MIC values based on data where the correlation between the change score and the anchor is too low. We recommend to report the anchor correlations and state that the low correlation prevents MIC estimation, rather than publish MIC values based on distribution-based methods. We offer recommendations for conducting MIC studies (Box 1) that may help preventing the situation where the correlation between the change score and the anchor is too low. Alternatively, we recommend to use vignette-based methods. The recommendations in Box 1 can also be used to re-analyze existing data. More data are also needed to examine whether the MIC value differs across the PROMIS metric and across settings (e.g., duration of disease, kind of intervention, length of follow-up) [26]. In case researchers need to analyze a study (e.g., responders in a clinical trial) and no credible anchor-based MIC value is available, researchers could decide to use a distributionbased value, such as 0.5 × SD, or use a range of different values in a sensitivity analysis, but we argue that these values should not be called MIC values because distribution-based values refer to the concept of measurement error and are not based on the concept of MIC. However, as stated in part 1, researchers should keep in mind that the estimated MIC value is derived from a wider sample of patients, and the MIC threshold or responder classification may not apply to the individual patient in the clinical trial or in the consultation room.
This study has some limitations. First, we only searched PubMed and the abstracts were screened by one author only, so we may have missed some MIC studies. Second, we based our review on one definition of minimal important change and excluded studies and MIC estimates that were not in line with this definition. Others may have different opinions, and the excluded studies and estimates may nevertheless provide relevant information about the interpretation of PROMIS (change) scores. Strong points of the study were that data extraction was checked by a second author and missing information was requested by email from the corresponding authors of the papers.
In conclusion, 50 studies estimated the MIC of a PROMIS measure, of which 19 studies used less appropriate methods. MIC values of the remaining 31 studies ranged from 0.1 to 12.7 T scores points. We consider a MIC value of 2-6 T-score points for PROMIS measures reasonable to assume at this point. For surgical interventions a higher MIC value might be appropriate. We recommend more highquality studies estimating MIC values for PROMIS. This paper provides recommendations for designing and analyzing future MIC studies.
Funding No funding was received for conducting this study.

Declarations
Conflict of interest D. Cella was co-author on one of the included PROMIS MIC papers [44] and CB. Terwee was co-authors of another included PROMIS MIC paper [89], but both were not involved in the data extraction of these papers. CB. Terwee and D. Cella are board members of the PROMIS Health Organization. The other authors have no conflicts of interest to declare that are relevant to the content of this article.

Research involving human participants and/or animals This study does not include human participants or animals.
Informed consent Because the study does not include human participants, informed consent is not applicable.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.