Response shift results of quantitative research using patient-reported outcome measures: a descriptive systematic review

Purpose The objective of this systematic review was to describe the prevalence and magnitude of response shift effects, for different response shift methods, populations, study designs, and patient-reported outcome measures (PROM)s. Methods A literature search was performed in MEDLINE, PSYCINFO, CINAHL, EMBASE, Social Science Citation Index, and Dissertations & Theses Global to identify longitudinal quantitative studies that examined response shift using PROMs, published before 2021. The magnitude of each response shift effect (effect sizes, R-squared or percentage of respondents with response shift) was ascertained based on reported statistical information or as stated in the manuscript. Prevalence and magnitudes of response shift effects were summarized at two levels of analysis (study and effect levels), for recalibration and reprioritization/reconceptualization separately, and for different response shift methods, and population, study design, and PROM characteristics. Analyses were conducted twice: (a) including all studies and samples, and (b) including only unrelated studies and independent samples. Results Of the 150 included studies, 130 (86.7%) detected response shift effects. Of the 4868 effects investigated, 793 (16.3%) revealed response shift. Effect sizes could be determined for 105 (70.0%) of the studies for a total of 1130 effects, of which 537 (47.5%) resulted in detection of response shift. Whereas effect sizes varied widely, most median recalibration effect sizes (Cohen’s d) were between 0.20 and 0.30 and median reprioritization/reconceptualization effect sizes rarely exceeded 0.15, across the characteristics. Similar results were obtained from unrelated studies. Conclusion The results draw attention to the need to focus on understanding variability in response shift results: Who experience response shifts, to what extent, and under which circumstances? Supplementary Information The online version contains supplementary material available at 10.1007/s11136-023-03495-x.


Background
Longitudinal measurements of patient-reported outcomes (PROs) can be affected by response shift.Whereas several definitions of response shift exist [1], they all draw upon the working definition provided by Sprangers and Schwartz in 1999 [2,3], where response shift refers to a change in the meaning of one's self-evaluation of a target construct as a result of (a) a change in the respondent's internal standards of measurement (i.e., recalibration); (b) a change in the importance of component domains constituting the target construct (i.e., reprioritization); or (c) a redefinition of the target construct (i.e., reconceptualization).When response shift occurs, the responses to a PROM at one point in time do not have the same meaning as the responses to that PROM at another point in time (see illustrative vignette in Fig. 1, based on [4]; also see [5]).Response shift has important implications when inferences, actions and decisions in health care are made based on the use of PROMs to measure change in QOL [6].However, despite a proliferation of research on response shift spanning several decades, a comprehensive descriptive synthesis of quantitative response shift results has thus far not been reported.
There are many different statistical methods for detecting and quantifying the magnitudes of response shift effects.Informed by our previous work [7][8][9], we a priori classified the methods broadly as follows: design-based methods, latent-variable methods, and regression-based methods (Table 1 for detailed descriptions and explanations).Design-based methods involve collection of data for the specific purpose of detecting response shift effects.Common examples include the then-test and individualized methods.Latent-variable methods allow for testing response shift effects by longitudinally examining the consistency (or invariance) of measurement models (e.g., structural equation models or item response theory or Rasch models).Regression-based methods involve the use of various regression analytical techniques to classify people or test for hypothesized response shift effects.Sébille et al. [8] have shown that all these methods operationalize the working definition of Sprangers and Schwartz [2,3], albeit in different ways.
Most studies on response shift have focused on response shift detection, and relatively fewer studies have focused on estimating the magnitudes of response shift effects.A previous scoping review by Sajobi et al. [7] on 101 studies using quantitative response shift methods published through 2016 indicated that 96 studies (95%) had detected response shift.Of these studies, 82 (85.4%) detected recalibration response shift, 20 studies (20.8%) detected reprioritization response shift, four studies (4.2%) detected reconceptualization response shift, and seven studies (7.3%) reported a general response shift effect without indicating a particular pathway.A more recent systematic review of 107 studies, also using quantitative methods, which were published between 2010 and 2020 [10] found that only 91 studies (70.5%) had detected response shift.Less than half of the studies (51 studies) overlapped with the former review by Sajobi et al. [7].Recalibration response shift was found in 73 studies, with 27 (37%) studies using the then-test, 24 (33%) applying structural equation modeling (SEM), and 22 (30%) adopting other methods.In both reviews, reprioritization and reconceptualization response shifts were detected less frequently and if they were, they were predominantly identified by Oort's SEM method [11,12].

Mrs. Adams
Imagine Mrs. Adams who is diagnosed with stomach cancer.Before receiving chemotherapy she completes a PROM and in response to the item "I feel Ɵred" endorses ''6" on a 7-point scale, with 7 being most Ɵred, because she needs to rest almost every day.She answers ''2" to the item ''How is your health,'' (7 point-scale with 1 being best health) because she is hopeful that she will fully recover physically.AŌer therapy, she is sƟll confined to bed and sleeps most of the day, although her faƟgue was worse during chemotherapy.AŌer chemotherapy, she answers ''5'' to ''I feel Ɵred'', given her experience with worse faƟgue.She also endorses ''3" to the health item.
Although she came to realize that she will never recover, she intensely enjoys the company and support of her loved ones.In this example, the response scale for faƟgue ("I feel Ɵred") has been recalibrated, as the response scale opƟons refer to different levels of faƟgue.The response scale of the ''How is your health'' item has been affected by ''reprioriƟzaƟon,'' as it was more influenced by physical health prior to chemotherapy and more by social health aŌer therapy.If social health did not play any role when answering the item before therapy, then the response scale has been reconceptualized, i.e., a consƟtuent domain was not part of the conceptualizaƟon of the target construct at that Ɵme.With these two items, the scores aŌer chemotherapy are incomparable with those prior to the therapy (based on [4]).

Table 1
Overview of response shift methods including response shift detection and effect size estimation (if possible)

Detection of response shift
Effect size metrics as reported in the included studies 1.

Design-based methods
This family of methods requires changes or extensions to common study designs to enable the detection of response shift [3].

Then-test
The then-test is an additional measurement at follow-up occasion.Respondents complete the same measure as they did at baseline and follow-up, but now with the instruction to re-evaluate their level of baseline functioning [8,32].This includes: 1.1.1Then-test: original Conventional use of the then-test as described above 1.1.2Then-test: derivative Study-specific adaptations of the then-test.For example, when applied to valuation of health states, respondents at follow-up are asked to evaluate their own health state of that moment and are asked to recall their valuation of their health state at the previous interview [33].

Then-test: original
Recalibration: Comparison of the mean difference between baseline and then-test.

Explanation:
For example, when chemotherapy induces fatigue, patients may adapt to this higher fatigue level.As a consequence, they may recalibrate the response scale for fatigue.This is indicated when respondents retrospectively (at the then-test) report less fatigue than they did at baseline.The comparison of the mean difference between baseline and then-test is then indicative of recalibration.

Then-test: derivative
Recalibration: Unique for each study.
Then-test: original Standardized mean differences (SMDs) between baseline ( X 2 ) and then-test ( X 1 ) scores were calculated based on available information.a We used reported SMDs, if provided, when insufficient information was available to calculate the SMDs.

Then-test: derivative
Same as Then-test original.

Individualized methods
This family of methods either have respondent-generated content in terms of the domains or the scale anchors are respondent generated or defined.A necessary component of individualized approaches is some respondent-generated content (e.g., items, scale anchors).

Schedule for the Evaluation of Individual Quality of Life (SEIQoL)
Respondents nominate the five most relevant domains (also called cues) to their HRQoL and assess their current functioning for each domain using a visual analogue scale (VAS) ranging from best to worst possible functioning.Patients then rank the relative importance of each domain by allocating 100 points to the five domains, using a pie chart disk (judgment analysis can also be used).The SEIQoL generates an overall index score, which is the sum of all five domain products (multiplication of each domain's weight by its corresponding level) [34].If the SEIQoL is administered at two points in time, response shift can be assessed [8].

Recalibration:
Cannot be detected (only in combination with another method, e.g., the then-test) (e.g., [34]).Reprioritization: Statistical test of change in the domain weights.This may entail the difference in intra-class correlation coefficients between domain weights over time [35]; or subtraction of weights at follow-up from weights at baseline within unchanged domains [36]; or change in cue weights [37].Reconceptualization: Statistical test of change in the number or type of the nominated domains over time.This may entail a mere count of domains mentioned at follow-up but not at baseline [36,38], or the number (or percentage of) participants who changed at least one domain over time [37] or the most important domain over time [39].A qualitative review of change in domain content is part of the procedure [8].
Recalibration, reprioritization, and reconceptualization: Effect sizes were not reported and could not be calculated based on information reported in the included studies, except for one study where the then-test method was applied to calculate SMDs based on the SEIQoL [34].

Detection of response shift
Effect size metrics as reported in the included studies

Patient-Generated Index (PGI)
Respondents nominate up to five areas in relation to their disease that impacted their QOL and one additional area not related to their disease.Respondents are then asked to rate the severity of the nominated areas on a scale of 0-10 (e.g., with 0 being severe or worst they can imagine) and 10 being mild or as they would like to be).Finally, respondents are asked to distribute 12 tokens among the nominated areas, at least one to each area, and give more tokens to the areas that are in most need of improvement.The total score is calculated by multiplying the severity score by the proportion of the 12 tokens assigned to each area and then summing this across the six areas (five disease related and one non-disease related) [40].An area-weighted score can also be calculated [41].If the PGI is administered at two points in time, response shift can be assessed [42].

Recalibration:
Cannot be detected (only in combination with another method, e.g., the then-test).Reprioritization: Statistical test of change in the domain weights (change in number of tokens) over time.This may be combined with qualitative interviews to results (e.g., [41,43]).Reconceptualization: Statistical test of change in the number or type of areas nominated over time.This includes expansion, reduction, or completely different domains.This may involve calculating an index of change score [42] or be combined with qualitative interview results (e.g., [41,43]).
Recalibration, Reprioritization, and Reconceptualization: Effect sizes were not reported and could not be calculated based on information reported in the included studies.

Cantril's ladder and/or changes in anchors
Respondents are asked to describe their best and worst imaginable life satisfaction as anchors for the ladder.They then rate their current life satisfaction on that ladder with the lowest rung being the worst and the highest rung being the best.If Cantril's ladder is administered at two points in time, response shift can be assessed [3].In some studies, patients are then invited to locate the pre-test anchors on the post-test scale with the post-test anchors indicating numbers 1 and 10.This rating scale is extended at both extremes allowing the anchor descriptions of the first assessment to be worse, better, or correspond with those of the second assessment [44,45].

Recalibration:
Statistical test of the difference between baseline and transformed baseline.The transformed scores are a function of the baseline scores and the position of the best and worse anchors in Cantril's ladder at follow-up [44].

Recalibration:
SMDs between baseline and transformed baseline scores [45] (based on the same formulas as for the then-test) or SDMs with a different denominator than provided in footnote 1 qualifying as another metric [44].

Other design-based methods 1.3.1 Ideal-scale approach
Respondents are asked to complete a questionnaire twice: First in reference to their actual status (e.g., how they perceive their current QOL) and second to their ideal status, e.g., how they would like their QOL to be or how they expect their QoL to change [46].These two questionnaires are administered subsequently at the same assessment point.Administration of these two questionnaires is repeated over time [3].

Recalibration:
The response scale of what ideal entails may undergo recalibration, which is indicated by a statistical test of mean changes in ideal scores over time [3].Alternatively, change in internal standards can be captured by comparing actual and ideal status (e.g., QoL expectancies) between baseline and follow-up [46].

Recalibration:
Effect sizes were not reported and could not be calculated based on information reported in the included studies.Direct and moderated response shift effects: Effect sizes were not reported and could not be calculated based on information reported in the included study.

Change in importance ratings
Respondents are asked to indicate the importance of QoL domains over time, using response scales per domain (e.g., from very unimportant to very important) or by ranking the domains according to importance [48]. Reprioritization: If the relative importance of (the QoL) domains changes over time, this is indicated by statistical tests of mean change in importance ratings over time. Reprioritization: SMDs of importance ratings, based on the same formulas as for the then-test [49].

Preference-based methods using vignettes
These methods assess the importance and value a patient explicitly places on a health state or quality-of-life dimension [3].Patients are asked to rate (e.g., from poor to excellent) one or more anchoring vignettes, describing a particular (hypothetical) health state at different points in time [8]. Reprioritization: Statistical test of mean change in vignette ratings over time [8]. Reprioritization: SMDs of vignette ratings, based on the same formulas as for the then-test [50,51].Change in item discrimination parameter estimates.Effect sizes were not reported and could not be calculated based on information reported in the included studies 3

Regression methods
Statistical methods that rely on regression analysis (excluding latent-variable models).

Regression methods with classification
Use of regression models to classify people as having had response shift or not.

Classification and Regression Tree (CART)
A non-parametric method that involves recursive partitioning of the longitudinal data into homogeneous subgroups (nodes) with respect to the change in the PROM scores and corresponding explanatory clinical status variables.
Response shift is inferred when there is a discrepancy between clinical status and change in PROM scores or change in the relative importance of PROM domains (Sebille et al. 2021).

Recalibration:
Inconsistent changes in PROM scores and clinical statu.sReprioritization: Change in the order of importance of each domain over time.
Classification is based on the percentage of respondents identified as having had recalibration and/or reprioritization response shifts.

Random forest regression
Evaluates changes in the relative contribution of PROM domains to the prediction of an outcome over time in each group.The relative importance of each domain is assessed using the average variable importance (AVI), which is the relative contribution of a domain to the prediction of an outcome in a CART averaged across several bootstrap samples.The change in the AVI for each component domain in predicting a global PROM score over time for each group is examined.Response shift is indicated by crossing curves [8].

Reprioritization:
Interaction between change in AVI for different domains.
Classification is based on the percentage of respondents identified as having had reprioritization response shift.

Mixed Models and Growth Mixture Models
Mixed (random effects) models are used to obtain residuals of observed minus predicted PROM scores, after which growth mixture models are used to identify latent classes of the centered residuals' growth trajectories.Response shift is inferred when there is change in centered residuals over time [8].
General response shift effect: Discrepancy between observed and predicted scores (centered residuals having a pattern of fluctuation over time deviating from zero).Reprioritization: Effects of domain scores on global PROM scores that vary with time (i.e., interaction with time).
Classification is based on the percentage of respondents identified as having had (general or reprioritization) response shifts.

Regression methods without classification Use of regression models that do not allow for classification
(continued)

Detection of response shift
Effect size metrics as reported in the included studies

Relative importance analysis
Application of logistic regression or discriminant analysis to rank PROM domains based on their relative importance in discriminating between groups.Response shift is inferred based on changes in relative importance or rank ordering of the PROM domains [8].

Reprioritization:
Change in relative importance (logistic regression or discriminant analysis coefficients) of PROM domains over time.
Effect sizes were not reported and could not be calculated based on information reported in the included studies.

Other regression methods without classification
A variety of study-specific applications of regression models to test for response shift effects as defined by the researchers.
Unique for each study.
Reported model R-squared (see [61] and [62] for study-specific details). 4 Other study-specific methods Methods that are unique to a particular study (and have not been applied in other studies).This includes various combinations of other design-based methods and other statistical methods.
Unique for each study.Effect sizes were not reported and could not be calculated based on information reported in the included studies. a The following procedures were used to standardize the mean difference, in the order of hierarchy depending on available information: (1) the standard deviation of the difference: , where SD = standard deviation and r = correlation, which was assumed to be 0.5 when the actual correlation could not be determined; , where q 1 is the first quartile, m is the median, and q 3 is the third quartile.The standard deviation was derived following Eq.15 from Wan et al. ( 2014): SD = , where numerical values for (n) associated with different sample sizes are provided in Table 2 of  , where UL 95%CI = the upper limit of the 95% confidence interval; and (6) when the t statistic, t, of a paired t test was provided, the standard deviation of the difference was derived as follows: SD= where D is the mean of the difference between baseline and then-test.b For Schmitt's method, recalibration, or "beta-change," refers to a change in metric of the latent factor, which may or may not involve recalibration of any of its measurement indicators.Thus, recalibration by Schmitt's method does not distinguish between recalibration and reprioritization as operationalized by Oort's method.Previous meta-analyses of response shift effects have focused on estimating the magnitudes of the effects, with results suggesting that effect sizes are relatively small on average.However, the meta-analyses also reveal substantial heterogeneity.A meta-analysis of studies published up to 2005 that used the then-test revealed Cohen's d effect sizes ranging from 0.08 to 0.32 [13].A more recent systematic review examined response shift effects in persons with cancer [14].Seventeen of the 35 studies reported effect sizes of which 12 studies found negligible to small effect sizes, four studies found moderate effect sizes, and one study identified a single effect size of large magnitude.A systematic review on nine studies that examined response shift in people with orthopedic conditions after rehabilitation [15], found effect sizes varying in magnitude although most were small.To date, systematic reviews on the magnitudes of response shift effects included only studies focusing on either a particular response shift method (i.e., the then-test) or a specific patient population (i.e., persons with cancer or an orthopedic condition).
The above reviews reveal considerable heterogeneity in characteristics of response shift studies as they were conducted in different populations, employed different study designs, used different PROMs, and applied different response shift methods.These observations give rise to the question: What is the prevalence (i.e., relative frequency) and magnitude of response shift effects for different response shift methods and across different characteristics of response shift studies?To answer this question, it is important to consider the results from quantitative response shift studies, including results from studies for which effect sizes cannot be obtained.However, a descriptive synthesis of all quantitative response shift results has thus far not been reported.
To address this gap, we conducted a systematic review of all published quantitative studies that investigated response shift using PROMs.Our aim was to describe evidence about response shift results including distributions of response shift prevalence and, where possible, effect sizes, for different response shift methods, and population, study design, and PROM characteristics.We recognize that there continues to be a debate about the conceptualization of response shift in the QOL and health measurement literature.We therefore initiated the Response Shift -in Sync Working Group that aims to synthesize the work on response shift to date [16], including the definitional and theoretical underpinnings of response shift [17], the critical examination of response shift detection methods and their underlying operationalizations of response shift [8], and the implications of response shift for healthcare decision-making based on PROMs [6].The descriptive systematic review reported herein is part of this initiative.With this review we do not intend to make recommendations of what response shift is or what metrics should be used.Rather, our aim is to describe and synthesize the results of response shift research to date, including the inherent heterogeneity in operationalization.This type of descriptive synthesis is important for identifying gaps, formulating new research questions, designing new longitudinal studies, and guiding future research directions.

Methods
We conducted a systematic review (registered retrospectively in INPLASY at time of data analysis: #202290033) [18] following guidelines by Cooper, Hedge, and Valentine [19] and used the PRISMA statement as a guide for reporting the results [20].

Search strategy and eligibility criteria
Studies on response shift were identified by searching the following library databases: (a) MEDLINE, PSYCINFO, and CINAHL using the EBSCO interface; (b) EMBASE using the OVID interface; (c) Social Science Citation Index using the Web of Science interface; and (d) Dissertations & Theses Global using the ProQuest interface (see Fig. 2).All searches were conducted using the same combination of the following terms and corresponding abbreviations in all indexed fields: "response shift" OR "longitudinal measurement invariance" OR "retrospective bias" OR "longitudinal differential item" OR "longitudinal DIF."The searches were limited to English language and a date of publication before January 1, 2021.For the Social Science Citation Index, an additional limit was applied to exclude meeting abstracts.No other filters were applied to any of the searches.Duplicate records were identified and removed using the duplicate screening tool in the EPPI Reviewer Platform [21].Manuscripts that reported on errata or on the same study as reported in another manuscript were identified as duplicates during the data extraction process after confirming that no additional relevant information could be extracted.We retained the manuscripts that reported the most detailed results.
We aimed to include all longitudinal quantitative studies that examined response shift using a PROM.Exclusion criteria were sequentially applied in the order shown in Fig. 2. The titles and abstracts of each citation were randomly assigned for independent screening by two team members (RS, LR, MEGV, VS, MAGS), all of whom were thoroughly familiar with response shift, using the EPPI Reviewer platform [21].The full text was subsequently retrieved for each citation identified as potentially relevant and each was screened randomly by two of the same team members.Disagreements were reconciled via consensus.

Data extraction
Data extraction for each included study was completed by one of three team members (LR, MV, RS).Ambiguities were discussed among team members to achieve agreement.Study-level information was extracted using the EPPI reviewer application and detailed information about each response shift effect was extracted and entered into a spreadsheet.The following study-level data extraction characteristics (see Tables 2, 3

Statistical analyses
In all studies, authors concluded whether response shift was found or not, although the conclusions may have been based on different grounds, e.g., statistical significance, where different studies adopted different alpha levels or verbal conclusions in the absence of statistical tests.We followed the authors' conclusions regarding the existence or non-existence of a response shift effect.Where possible, we determined the magnitude of each response shift effect based on reported statistical information from which an effect size could be derived.Table 1 includes a description of response shift detection and effect size calculation (if possible) for each method.We used reported effect sizes, if provided, when insufficient information was available to calculate effect sizes.Standardized mean differences (Cohen's d) were calculated for the then-test and latent-variable methods based on information reported in each study based on the difference between baseline ( X 1 ) and follow-up (then-test) ( X 2 ) scores as follows: Cohen's d = X 1 −X 2 SD (where SD = standard deviation).For some studies, this meant that we first had to transform medians, interquartile ranges (IQR), and t or z statistics into means and standard deviations [19,22].We used the following hierarchy to standardize the mean difference, based on (1) the standard deviation of the difference, (2) the pooled standard deviation, or (3) the standard deviation of the baseline measurements (see footnote to Table 1 for details).For SEM, response shift effects were based on parameter estimates of models that adjust for a lack of longitudinal measurement invariance (for more information see [23]).All effect sizes were converted to absolute values.We followed Cohen's guidelines [24] to interpret effect sizes of 0.2, 0.5, and 0.8 to be small, moderate, and large, respectively.For regression methods that do not use classification of people as having response shift or not, the reported R-squared was used as a measure of effect size, with values of 0.01, 0.06, and 0.14 being indicative of a small, moderate, and large magnitude, respectively [24].For regression-based response shift methods that do use classification, the proportion of people having undergone response shift was extracted as an indication of the magnitude of effects.
Response shift results and effect sizes were summarized for recalibration and reprioritization or reconceptualization effects at different levels of analysis (study and effect levels).Accordingly, the synthesis focused on describing distributions of prevalence (relative frequency) Table 2 Prevalence of response shift results by method a Unknown includes several effects for which the pathway was unknown due to it not being explicitly reported.N = the number of studies or response shift effects.For each study, response shift methods were only counted when results about the response shift effects were reported.For studies that reported the same results in multiple manuscripts, only the results of the first published study were counted.% RS detected = the percentage of detected response shift effects of # studies or # response shift effects that were investigated or possible.SEM = Structural Equation Model.IRT/Rasch = Item Response Theory/Rasch Measurement Model.n/a = not applicable.

Study-level results
Effect-level results

Table 3
Prevalence and magnitude of effect sizes by method Based on a total of 105 studies and 1130 effects for which effect sizes could be determined.N = the corresponding # of studies or response shift effects (which are the same for prevalence and magnitudes of the effects).The # of studies (first column) do not add up to 105 or to their respective subtotals (bolded rows) because some studies applied more than one response shift method.
% RS detected = the percentage of detected response shift effects of # studies or # response shift effects.% sample = percentage of the pooled sample size (total number of people across studies).SEM = Structural Equation Model.n/a = not applicable because either the response shift pathway could not be investigated due to the method used or the results could not be discerned from what was reported in the manuscript.IQR = interquartile range.
a Unknown includes several effects for which the pathway was unknown due to it not being explicitly reported.b Cohen'sd was calculated for 951 of the 1062 effects based on the statistical information reported following the procedures described in Table 1.The remaining 111 effects could not be calculated due to inadequate statistical information, in which case we relied on the Cohen's d effect sizes reported in the manuscript (of these 89 effect sizes were explicitly reported as standardized mean differences and 22 effect sizes were assumed to be standardize mean differences based on overall description of the methods). c Based on one study where the then-test method was applied to calculate SMDs based on the SEIQoL [34].and magnitude of response shift effects, based on (a) the proportion of studies detecting response shift (study level) and (b) the proportion of response shift effects identified (effect level) for different response shift methods and population, study design, and PROM characteristics.Consistent with our descriptive aim and recognizing the inherent heterogeneity in operationalizations, we used non-parametric statistics to describe the distributions of the effect sizes, including their medians and IQRs for continuous effect sizes and percentages for classification (i.e., we did not pool effect sizes statistically).

Total
Although we sought to describe all response shift effects, we also wanted to account for situations where multiple analyses and studies were done on the same sample.To do so, we first conducted the analyses based on all response shift effects and subsequently repeated the same analyses on the subset of response shift effects from unrelated studies and samples that do not overlap with samples from the same (e.g., subsamples) or other studies (with details reported in Supplementary Tables S1 to  S5).Studies were considered related when analyses from different studies are conducted on the same or overlapping samples or when the same results are reported in multiple manuscripts.For related studies, only the first (original) study was counted.Independent samples do not have overlap with other samples.When samples are overlapping, only the overall sample was counted (subsamples were not counted).Analyses were conducted using SPSS [25] with violin plots created using the ggplot2 package in R [26].

Studies
Of the 1906 records screened, 150 studies fulfilled the eligibility criteria and were included (see Fig. 2).Of these studies, 125 were unrelated to any of the other studies and 25 related studies involved analyses of the same or overlapping samples, of which 9 were identified as the primary (first published) studies and 16 as secondary (related studies that were published after the corresponding primary study).We identified a total of 4868 response shift effects (Table 2), of which 917 were from secondary related studies and 284 from primary related studies (3667 effects were from unrelated studies).Results of the 150 studies and 4868 response shift effects are described first, followed by a description of the results based on the 134 unrelated and primary related studies and 3579 response shift effects from independent samples (excluding 372 effects from subsamples), with further details provided in Supplementary Tables S1 to S5.

Prevalence
Of 150 studies, 130 (86.7%) reported detection of one or more response shift effects (Table 2), based on criteria defined by the authors.However, response shift effects were detected for only 793 (16.3%) of the total 4868 effects investigated.Most response shift results were based on 82 studies that utilized the then-test method, with 86.6% of the studies and 39.2% of 1004 corresponding effects resulting in detection of recalibration response shift.SEM methods were applied in 44 of the studies of which 79.5% resulted in detection of at least one response shift effect.However, only 7.7% of all corresponding 3139 effects revealed response shift, including 16.4% of 986 recalibration effects and 3.7% of 2153 reprioritization or reconceptualization effects.
Other methods were less frequently applied, ranging from 3 to 13 studies.When considering methods that were based on at least 10 studies (i.e., not IRT/Rasch and other study-specific methods), the highest percentage of detected response shift effects at study level was found for individualized methods (100%, 12 studies).At effect level, the percentage of detected response shift effects was also relatively high for individualized methods (74.2%) despite the small number of 31 effects.In general, the prevalence of response shift detection was lower when the number of investigated response shift effects was larger for all response shift pathways.

Magnitude
Effect sizes could be determined for 105 (70.0%) of the studies and a total of 1130 response shift effects, with 96 (91.4%) of these studies resulting in detection of 537 (47.5%) response shift effects.Cohen's d (standardized mean difference) was the most common effect size metric, which was obtained for 1062 effects from 91 studies (see Table 3).Most of these effect sizes were based on studies using the then-test (72 studies and 929 effects), resulting in an overall median effect size of 0.22 with substantial dispersion (IQR 0.10-0.38)for recalibration effects.Cohen's d effect sizes were also determined for 111 effects from 19 studies using SEM, where response shift was detected for all of these studies and 95.5% of the effects, with median effect sizes of 0.22 (IQR 0.14-0.35)for recalibration and 0.10 (IQR 0.00-0.14)for reprioritization or reconceptualization.Other methods enabling the calculation of Cohen's d effect sizes included other design-based methods (3 studies and 15 effects) and individualized methods (1 study and 7 effects) (see Table 3).The distribution of effect sizes across these methods is visualized as violin plots in Fig. 3. Additionally, two studies (27 effects) provided the R-squared statistic as an effect size for regression methods without classification, resulting in a median effect size of 0.01 (IQR 0.00-0.02)for reprioritization/reconceptualization.Response shift effect sizes could also be obtained for 27 effects from 17 studies using classification methods, of which all resulted in detection of response shifts.The greatest effect size was obtained for the design-based individualized classification method, which also had the largest number of effects (13), resulting in 68.2% of the pooled sample size indicating reprioritization/reconceptualization response shift.Finally, three studies (14 effects) reported study-specific effect size metrics (see Table 3).

Prevalence
At study level, most studies involved participants with mixed sex (121 studies), who were mostly adults (100 studies), with a medical condition (141 studies), and/or undergoing a medical intervention (69 studies) (Table 4).The studylevel prevalence of detected response shift ranged from 66.7% (only male; 12 studies) to 90.0% (medical condition: stroke; 11 studies), when excluding population characteristics with fewer than 10 studies.The corresponding effectlevel prevalence values were much lower and ranged from 5.9% (mental health condition; 472 effects) to 32.9% (other/ unspecified intervention; 149 effects) for population characteristics with minimally 100 effects.

Magnitude
When considering population characteristics for which at least 100 effects sizes could be determined, median recalibration effect sizes ranged from 0.17 (IQR 0.09-0.28)(381 effects for female samples) to 0.31 (IQR 0.22-0.42)(101 effects for male samples).The reprioritization or reconceptualization effect sizes are based on fewer effects (ranging from 2 to 30 per population characteristic), with median effect sizes ranging from 0.03 (IQR 0.00-0.11),based on 17 effects from samples with medical interventions, to 0.20 (IQR 0.10-0.39)for 8 effects from samples with other/ unspecified interventions.

Prevalence
Most studies employed an observational design (122 studies), conducted primary analysis (90 studies), had a sample size between 57 and 254 (79 studies), and employed an observation period greater than 12 months (90 studies).Across the four study design characteristics, the study-level prevalence of detected response shift ranged from 72.2% (time period from 6 to 12 months) to 91.3% (time period from 1 to 6 months) (excluding the one study with unknown data analysis) (Table 5).Again, the corresponding effectlevel prevalence values were lower and ranged from 8.4% (1132 effects for sample size > 411) to 30.2% (1590 effects for primary analysis).

Magnitude
When considering study design characteristics with at least 100 effect sizes, the smallest median recalibration effects Fig. 3 Distribution of absolute effect sizes across types of response shift methods.Note All violin plots have the same area, which is determined by the distribution of effects within each method.Four effect sizes for which the response shift pathway is unknown are excluded.Two extreme effect sizes of 6.9 [63] and 2.9 [64] for the then-test are not shown to ensure other distributions remain discernable (instead of becoming flat lines).The outliers were only removed for this visualization, they were included in the statistical analysis (Cohen's d effect sizes) was 0.15 (IQR 0.09-0.24)based on 194 effects from studies with sample sizes between 255 and 410.The largest median effect size was 0.26 (IQR 0.16-0.41)based on 145 effects from studies adopting an experimental design and 0.26 (IQR 0.13-0.45)based on 391 effects for studies with sample sizes less than 57.The reprioritization or reconceptualization effect sizes are based on fewer effects (ranging from 4 to 33 per study design characteristic), with median effect sizes ranging from 0.01 (for 7 effects from studies employing a time frame of < 1 month) to 0.14 (based on 15 effects from studies conducting primary analysis and 13 effects of studies using a sample size between 57 and 254) (excluding the time period classification "not reported," since this is essentially a missing data category).

Prevalence
With respect to PROM type, most studies employed a generic PROM (76 studies) with the SF family of PROMs (47 studies) and the EQ-5D (14 studies) being most prevalent (Table 6).Disease-specific PROMs were used in 57 studies, with the EORTC measures being used most often (17 studies).Of the PROM domains, physical (96 studies), general health/QOL (95 studies), and psychological other than depression (85 studies) domains were measured most frequently.When excluding PROM types used in 10 or less studies, study-level prevalence of detected response shift for PROM types ranged from 69.6% (46 studies using other types of PROMs) to 90.0% (10 studies using individualized PROMs).Regarding the different PROM domains, studylevel prevalence was within a range of 12 percentage points for the major domains, ranging from 56.7% for 'other' domain to 70.8% for the physical domain (excluding those with 10 or less studies).
When considering PROM types with at least 100 effect sizes, the effect-level prevalence of detected response shift for PROM types ranged from 10.5 (611 effects for 'other' generic PROMS) to 25.2% (616 effects generated by studies employing EORTC measures).The corresponding effectlevel prevalence values for the PROM domains (excluding those with less than 100 effects) ranged from 10.5% for other PROM domains (440 effects) to 28% for both general health/ QOL and pain (561 and 236 effects, respectively).

Magnitude
When considering PROM types with at least 100 effect sizes, median recalibration effect sizes (Cohen's d) ranged from 0.17 (IQR 0.09-0.28)for 315 effects from studies using the EORTC PROMs to 0.29 (IQR 0.12-0.48)for 115 effects based on studies using 'other' disease-specific PROMs.Median reprioritization/reconceptualization effect sizes ranged from 0.07 (IQR 0.03-0.14)based on 16 effects for other types of PROMs to 0.17 (IQR 0.09-0.47)based on 4 effects for the EORTC family.For PROM domains, median recalibration effects based on at least 100 effects, ranged from 0.20 (IQR 0.10-0.35)based on 429 effects for the physical domain to 0.23 (IQR 0.09-0.41)based on 182 effects for general health/QOL and 0.23 (IQR 0.11-0.33)for 199 effects for the psychological domain other than depression.The reprioritization or reconceptualization effect sizes are based on fewer effects (ranging from 2 ('other' PROM domain) to 18 (SF family) across all PROM characteristics), with median effect sizes ranging from 0.05 (IQR 0.00-0.09)based on 2 effects for the PROM domain other to 0.17 (IQR 0.19 -0.47) based on 4 effects for the EORTC PROMs.

Unrelated studies and non-overlapping samples
Prevalence and effect size estimates were similar for most methods and population, study design, and PROM characteristics when only unrelated studies and non-overlapping samples were considered (see Table S1-S5).Study-level differences in prevalence ranged from 0.0 to 5.9% (when considering methods and characteristics with at least 10 studies) for most methods and characteristics, with the exception of the characteristics 'only males' or a 'sample size of 255-410,' which had greater % prevalence when considering only independent effects.Effect-level differences in prevalence ranged from 0.0 to 14.9%, with 15 of the effectlevel differences exceeding 5% (when considering methods and characteristics with at least 100 effects).The median difference in Cohen's d estimates of recalibration response shift is 0.02 (when considering methods and characteristics with at least 100 effects), with the largest differences for effects for the characteristics 'unknown sex' and a 'mental health condition.'For reprioritization/recalibration, the number of effects was too small to warrant meaningful comparisons.

Discussion
To further find the field of response shift research, this study described variation in prevalence of response shift results and where possible, magnitude of response shift effects for quantitative studies using PROM data.Consistent with earlier reviews [7,10], the most frequently applied response shift method was the then-test, followed by SEM, other design-based methods, regression methods without classification, individualized methods, regression methods with classification, IRT/Rasch, and other study-specific methods.Most studies reported detection of one or more response shift effects.However, response shift effects were detected for only a sixth of all effects investigated.Clearly studylevel prevalence is expected to be higher than effect-level prevalence because a study was classified as having detected response shift when response shift was detected for any of the multiple effects being studied.However, it is noteworthy that the prevalence distributions at the study level as compared to the effect-level were also different across the different methods, population, study design, and PROM characteristics, i.e., different methods and characteristics would be identified as having higher or lower prevalence of response shift effects when this is determined at the study as compared to the effect-level.Individual studies and previous reviews primarily focused on study-level results, drawing binary conclusions about whether response shift is present or not.Our results show that such a one-sided focus may be misleading, and only the combined information at study and effect level provides a comprehensive overview of response shift results.
Effect sizes were determined for 105 of the 150 studies.Whereas, the median effect sizes varied per method, population, study design, and PROM characteristic, they were all of a small magnitude, with most recalibration effect sizes between 0.20 and 0.30 and reprioritization/reconceptualization effect sizes rarely exceeding 0.15.There may be methodological explanations for the small effect sizes found.One explanation would be the presence of heterogeneous samples where response shifts may occur at the individual level but in different directions that cancel each other out at the group level.Nonetheless, given the small median effect sizes, we need to acknowledge that there are empirical situations where the impact of response shift may be quite small, or even negligible, when we are only interested in results for the entire group.But even in such contexts, there is reason to consider such effect sizes as relevant.Many effects in PROM research have a comparable magnitude.For example, a systematic review aimed to investigate, among other things, whether patients who share their responses to PROMs with their health care provider have better health.The results indicated small effect sizes for such PROMs feedback on patientreported health outcomes [27].This review illustrates that the target signal (in this case PROMs feedback) may not be substantially different in magnitude than other processes triggered using PROMs, such as response shift.
The most striking finding is that the effect sizes varied widely, ranging from zero to large, both within and between studies.This observation draws attention to the importance of considering the dispersion of response shift effects within studies rather than relying exclusively on within-study pooled results.The large variability in effect size estimates may also have methodological explanations, including variability in response shift methods (e.g., SEM methods lead to smaller effect sizes than then-tests), study populations, sample sizes, and PROMs.The proportion of variability in effect sizes attributable to such study characteristics is currently unknown.When part of the variability in effect sizes would indeed be caused by differences between study populations, then substantial effects may be experienced by some groups or individuals or in specific contexts or with certain PROMs.Ignoring such variability would not pay credit to the experiences of the respondents and the richness of the data.In other words, variability can be highly meaningful.A parallel can be drawn with precision medicine, e.g., in cancer treatment.Whereas some treatments may hardly affect the survival of a particular population, further investigation into the variability of effects may reveal that the treatment can be highly effective in a subgroup of patients whose tumor DNA matches the working mechanisms of those treatments.Hence, information about the variability of treatment effects enables the development of targeted therapy, which ultimately results in more life years gained.Whereas research into response shift will not lead to such dramatic gains, we may need to be moving into 'precision methodology' for response shift.The key message is that, rather than focusing on effect sizes for the entire group, we should focus on describing and understanding variability in effects: in terms of identification (who experience response shift?), magnitude (to what extent?), and under which circumstances.Moreover, arguments around social justice and societal inequalities would require such subgroup analyses investigating whether response shift effects systematically favor or disadvantage some groups of people [6,28].
A number of limitations of this systematic review merit attention.We omitted studies that were not reported in English.We also cannot preclude the possibility that we may have missed relevant papers despite our extensive literature search.The synthesis of included studies was challenged due to different operationalizations of response shift and inadequate reporting of study results and/or methodology.A substantial number of studies required a disproportionate amount of effort from the team to ensure consensus about the extracted data or considering information as missing.Moreover, dependencies in the data arose from multiple studies (often secondary analyses) being conducted on the same or overlapping samples.We therefore repeated the analyses on independent data only.Further, we cannot be certain that our classifications of study populations, designs, and PROMs represent the best characteristics to highlight heterogeneity.A more important caveat is related to a recent review of response shift methods [8], which concluded that for each response shift method extra steps need to be taken to ensure that the results can indeed be qualified as response shift, i.e., the effects need to be caused by change in meaning of the self-report (see also Vanier et al. [17] and Sprangers et al. [29]).In the present study, we considered all detected effects as response shift effects, although their substantiation may be questioned.However, this limitation is inherent to the current stage of response shift research rather than this systematic review.Further, the heterogeneity of the results may, in part, be due to the variety of detection methods.However, this systematic review cannot disentangle the heterogeneity induced by variation of methods, study context, and design.Another limitation is that the use of different methods and metrics may preclude a clear view of how the resulting numbers compare.We therefore have provided a table where each method is described, response shift detection is detailed, the two most prevalent methods (then-test and Oort's SEM approach) are further explained, and the effect size metric for all applicable methods are provided (see Table 1).We would like to add that we provided an overview of the various methods and metrics as intended and did not aim to solve the inherent heterogeneity of response shift research.We acknowledge the plea for conceptual and operational clarity of what response shift is, but this is beyond the scope of this systematic review.We also would like to highlight that such a plea is not limited to response shift, but equally applicable to the quality-oflife research field at large and for that matter also to other behavioral and social science research [30].Further, we did not perform an assessment of methodological quality of individual studies.The heterogeneity of the included studies with regard to response shift methods, population characteristics, study design, and PROMs used precludes such an unambiguous assessment.For example, sample size does not apply as a quality criterion to individual methods.Rather than weighing different study aspects as an indication of study quality, we made them the focus of our main analyses, by describing the prevalence and, where possible, the magnitude of response shift effects for each response shift method, population, study designs, and PROM characteristic.Finally, whereas this descriptive review provides insight into how response shift effects and effect sizes vary per characteristic, it does not allow for direct comparison of effects across characteristics, however, tempting.Studies examining the same characteristic may differ in many other relevant aspects.For example, the number of response shift parameters in a SEM model is a multitude of those of other methods (e.g., the then-test).Moreover, latent-variable methods can only detect response shift when it affects a minority of items and a majority of study participants [31].Hence, the percentage of detected response shift effects is generally lower and not directly comparable to other methods.Moreover, response shift effects or effect sizes are based on different numbers of studies.Generally, more extreme numbers were found for methods and characteristics based on fewer studies or response shift effects.We therefore only described the results of recalibration response shift when they were based on at least 10 studies and 100 effects.This arbitrary cut-off was intended to guard against over-interpretation of the results.In the case of reprioritization and reconceptualization response shift, the small number of effects reported precluded the application of such criteria.We also refrained from using qualifiers as a higher or lower prevalence and magnitude of response shift effects and provided the minimum and maximum numbers instead.
The current descriptive review on results of quantitative response shift studies is the most comprehensive to date.The data provide insight into the heterogeneity of response shift results, i.e., how the number and magnitude of response shift effects differ across studies employing different response shift methods, populations, research designs, and PROMs.In this sense, this paper draws attention to what some scholars may find a foundational issue in response shift research-the longstanding challenge to harmonize different metrics of response shift across the various measurement procedures from which it is derived.But even in the absence of such harmonization, insight into response shift effects and effect sizes can inform future planning of longitudinal PROM studies, guide the selection of the requisite PROM(s), provide important information for analyzing PROM data in diverse populations, and most importantly, will identify those respondents susceptible to response shift effects for whom different healthcare decision may need to be made.

( 2 )
the pooled standard deviation: SD = the standard deviation of the baseline measurement: SD = SD baseline ; (4) when the median and interquartile range (IQR) were provided, and the mean was derived following Eq.14 from Wan et al. (2014): X = Wan et al. (2014);(5) when confidence intervals were provided, the standard deviation was derived as follows: SD =

c(
For SEMs, Recalibration SMD = post − pre SD , where post and pre are the intercept or threshold values at follow-up and baseline occasions and is the standard deviation that is used to standardize the difference in intercept or threshold values.This standard deviation can be based on model parameter estimates or sample characteristics, following the same hierarchy as with the calculation of the SMD for the then-test in Eqs.2-4.Reprioritization/reconceptualization SMD = loading values at follow-up and baseline occasions, post is the mean of the latent variable(s) at follow-up occasion, and SD is the standard deviation that is used to standardize the difference in factor loading values.This standard deviation was based on model parameter estimates or sample characteristics, following the same hierarchy as with the calculation of the SMD for the then-test in Eqs.2-4.

Identification of studies via databases and registers Identification Screening Included ple
(or group) of an overall sample reported in the same manuscript or the same or overlapping sample from another study.
5. Study results: detection (yes/no) and magnitude (see under statistical analyses) of recalibration, reconceptualization, and reprioritization and dependencies, i.e., whether the response shift effect pertained to a subsam-

Table 4
Prevalence and magnitude of response shift results across population characteristics IQR = interquartile range.n/a = not applicable because either the response shift pathway could not be investigated due to the method used or the results could not be discerned from what was reported in the manuscript.
a Unknown includes several effects for which the pathway was unknown due to it not being explicitly reported.b Prevalence is based on a total of 150 studies and 4868 effects that were investigated or possible.N = # of studies or response shift effects.The # of studies may not add up to 150 (first column) because some studies implemented multiple methods or had multiple samples that were counted separately.% RS detected = the percentage of detected response shift effects of # studies conducted or # response shift effects that were investigated or possible.c Magnitude is based on 91 studies and 1062 effects for which Cohen's d could be determined.N = the # of these response shift effects.ES = Cohen's d.

Table 5
Prevalence and magnitude of response shift results for different study design characteristicsPrevalence is based on a total of 150 studies and 4868 effects that were investigated or possible.N = # of studies or response shift effects.The # of studies may not add up to 150 (first column) because some studies implemented multiple methods or had multiple samples that were counted separately.%RSdetected = the percentage of detected response shift effects of # studies conducted or # response shift effects that were investigated or possible.IQR = interquartile range.n/a=not applicable because either the response shift pathway could not be investigated due to the method used or the results could not be discerned from what was reported in the manuscript.Example:For observational study designs, there are a total of 122 studies of which 88.5% had detected response shift and 3978 effects of which 16.2% had detected response shift.There were 1724 recalibration effects, of which 28.1% had detected response shift and 879 for which a Cohen's d a Unknown includes several effects for which the pathway was unknown due to it not being explicitly reported.b c Magnitude is based on 91 studies and 1062 effects for which Cohen's d could be determined.N = the # of these response shift effects.ES = Cohen's d.

Table 6
Prevalence and magnitude of response shift results for different PROM characteristicsPrevalence is based on a total of 150 studies and 4868 effects that were investigated or possible.N = # of studies or response shift effects.The # of studies may not add up to 150 (first column) because some studies implemented multiple methods or had multiple samples that were counted separately.%RSdetected = the percentage of detected response shift effects of # studies conducted or # response shift effects that were investigated or possible.IQR = interquartile range.n/a=not applicable because either the response shift pathway could not be investigated due to the method used or the results could not be discerned from what was reported in the manuscript.mostfrequentPROM based on the # of studies in which the PROM was used (the # of effects were used when two PROMs are tied based on the # of studies alone).Example:For "Generic PROMs," there are a total of 76 studies of which 84.2% had detected response shift and 1971 effects of which 14.7% had detected response shift.There were 769 recalibration effects, of which 23.0% had detected response shift and 324 for which a Cohen's d was obtained.There were also 1202 reprioritization and/or reconceptualization or unknown effects, of which 9.3% had detected response shift and 18 for which a Cohen's d a Unknown includes several effects for which the pathway was unknown due to it not being explicitly reported.b c Magnitude is based on 91 studies and 1062 effects for which Cohen's d could be determined.N = the # of these response shift effects.ES = Cohen's d. d Bolded values indicate pooled results for all generic or disease-specific PROMs.SF = Short Form Health Survey, EQ-5D = EuroQol 5 Dimensions, EORTC = European Organization for Research and Treatment of Cancer.e First and second