Plain English summary

Desmoid tumors are associated with significant symptoms that affect patients’ quality of life. The effect of treatment on patient-reported outcomes is as important as its effect on the tumor. The efficacy and safety of nirogacestat have been evaluated in an appropriately designed clinical study. Although patient-reported outcome(s) may be included in a clinical study, these outcomes are infrequently included in drug labeling for oncology drugs primarily due to challenges related to the reliability and validity of the measurements. The GODDESS tool was developed to assess Desmoid Tumor/Aggressive Fibromatosis (DT/AF) symptom severity and their impact on patients’ lives. This paper presents the results of the analysis of patient-reported outcomes in the study of nirogacestat. GODDESS was found to be dependable, valid, responsive, and interpretable as a means of examining the effect of treatment on patient-reported outcomes.

Introduction

Desmoid tumor (DT), also known as aggressive or deep fibromatosis [1], is characterized by the development of non-metastasizing, locally aggressive connective-tissue neoplasms and may cause significant morbidity [2]. DT is diagnosed in approximately 1000 to 1650 people in the United States annually, constituting less than 3% of all soft-tissue tumors [3,4,5]. DT can arise in almost any soft tissue but most typically are found in the abdominal wall, intra-abdominal cavity, and extremities [6].

Although DT mortality is low, DT is associated with significant symptoms and impact on patients’ health-related quality of life (HRQoL), particularly when the tumor(s) grow larger regardless of the location [3, 7]. The most frequently reported symptoms associated with DT include pain, soreness, or tingling (caused by the tumor pressing on nearby nerves, muscles, or blood vessels), difficulty moving the arms or hands, limping, or other difficulties moving the legs or feet [4, 8]. Intra-abdominal DT may also cause bloating, constipation, abdominal pain, and/or intestinal obstruction. Improvement of symptomatology and limiting the effects on daily living are highly important to patients with DT [8]; however, no Food and Drug Administration (FDA)-approved therapies are available for DT.

Treatment of DT requires multidisciplinary team collaboration to create an overall treatment plan with the goal of improving both clinical markers, such as progression-free survival and objective response, as well as patient-relevant outcomes such as desmoid-specific symptoms (e.g., pain), functioning, and overall HRQoL [9,10,11]. Possible treatment options include systemic therapies, locoregional therapies (e.g., surgery), and/or active surveillance [12]. The presence of a tumor, in addition to the active treatment used to control the tumor, may cause side effects, as well as emotional, social, and financial effects. Hence, patient-reported outcome (PRO) measures are needed to appropriately assess the patient experience. Despite the importance of the patient experience in the overall treatment plan, PRO measures rarely progress into the FDA drug labeling for oncology drugs primarily due to the methodological challenges questioning the accurate capture of patient experience, for instance, lack of pre-specification or multiplicity adjustment, or existence of appropriate PRO instruments for a specific disease state [13, 14]. The absence of PRO measure data in labeling limits a holistic understanding of the treatment benefit-risk and may compromise fully informed treatment decisions. Following the 21st Century Cures Act of 2016, the FDA issued a guidance aiming to advance the collection of patient experience data for regulatory decision-making [15].

The Gounder/DTRF Desmoid Symptom/Impact Scale (GODDESS©) was developed to assess patient-reported signs and symptoms associated with DT and their impact on functioning and daily living, with the intent to better capture the patient experience and treatment effects in DT and eventual use as an explanatory endpoint in clinical trials. To date, development of the GODDESS© included qualitative research [8], but evidence of the instrument’s psychometric properties, including the interpretation of clinically meaningful within-person and between-group changes, was lacking. This evidence is needed to enable researchers to make informed decisions on the appropriateness of GODDESS© as a clinical trial efficacy endpoint.

The objectives of this study were to assess GODDESS© cross-sectional and longitudinal measurement properties, including assessment of clinically meaningful, within-person and between-group change in score, using data from a clinical study within the instrument’s context of use.

Materials and methods

Data source

Analyses were conducted using blinded data from the Phase 3, double-blind, randomized, placebo-controlled study of nirogacestat in adults with desmoid tumor/aggressive fibromatosis (DT/AF) (DeFi; NCT03785964). Patients were administered nirogacestat or placebo tablets twice daily continuously in 28-day cycles. Eligible patients were ≥ 18 years old, with histologically confirmed DT/AF that progressed by ≥ 20% per Response Evaluation Criteria In Solid Tumors (RECIST) v1.1 within 12 months of screening, and an Eastern Cooperative Oncology Group (ECOG) Performance Status (PS) score ≤ 2. A 1:1 randomization allocation was stratified based on tumor location: intra-abdominal (including mesentery/pelvis) and extra-abdominal (including head/neck, para-spinal, extremities, abdominal wall, chest wall, and other). DeFi was conducted in accordance with Good Clinical Practice; all study materials were approved by the appropriate ethics body at each participating center. Informed consent was obtained prior to any study procedures and follows ICH GCP/CFR guidelines. Consent included the collection and use of NIR-DT-301 study data, which were used for the validation of the GODDESS© tool. See further details in Appendix 3 of the supplementary material [16].

Patient-reported outcome assessments

The GODDESS© tool was developed by Memorial Sloan Kettering Cancer Center (MSK) and Desmoid Tumor Research Foundation (DTRF) to measure signs and symptoms of DT (Desmoid Tumor Symptom Scale: DTSS) and their impact on patients’ lives (Desmoid Tumor Impact Scale: DTIS) based on previously conducted content validity related work [8].

DTSS comprises 11 items assessing key signs and symptoms severity including pain, fatigue, swelling, muscle weakness, difficulty moving (items 1–7); a question referring to tumor location (item 8); and intra-abdominal-specific signs/symptoms (items 9–11) administered only to those reporting intra-abdominal tumor location in item 8. DTIS comprises 17 items assessing symptoms impact on functioning and daily living. DTSS items 1–7 and 9–11 are evaluated on an 11-point numeric rating scale (NRS) from 0 to 10 to measure severity from “none” to “as bad as you can imagine,” with a 24-h recall period; item 8 asks for the specific location of the desmoid tumor. DTIS items are evaluated either on a 5-point Likert scale ranging from “none of the time” to “all of the time” to measure frequency (items 1–9), or an 11-point NRS from 0 to 10 to measure severity from “none” to “as bad as you can imagine” (items 10–17), with a 7-day recall period. Higher scores indicate more severe symptomatology/impact. GODDESS© is available in 20 languages.

Other PRO measures used in the analysis included the Brief Pain Inventory Short Form (BPI-SF) [17], which assesses clinical pain severity and pain interference with feelings and functions; the Patient-Reported Outcomes Measurement Information System Physical Function (PROMIS-PF) short-form 10a version 2.0 plus 3 additional items [18], which assesses various self-reported capability of physical activities; the European Organisation for Research and Treatment of Cancer Quality of Life Questionnaire-Core 30 (EORTC QLQ-C30) version 3.0 [19], which assesses cancer patients’ HRQoL; the Patient Global Impression of Severity (PGIS) and the Patient Global Impression of Change (PGIC) referring specifically to desmoid symptoms severity and overall status [20], respectively.

All PRO measures were completed by the patients using home electronic PRO (ePRO) devices at screening, baseline (i.e., Cycle 1), and in 28-day cycles thereafter. For full details of the PRO measure assessment schedule, see Appendix 4 of the supplementary material.

Analyses

Measurement properties of the PRO tools were evaluated by using measurements at baseline, cycle 4, and cycle 7 from the DeFi study. The blinded data were pooled across treatment arms for these validation analyses. Baseline demographic and clinical variables were described using descriptive statistics (e.g., mean, standard deviation (SD), median) for quantitative variables, and percentage and frequency for categorical variables.

Completion rate from baseline was calculated as the number of patients expected at each timepoint (excluding subject discontinuations) divided by the number of patients still on trial at the analysis timepoint. A summary DTSS score over a week was defined as the average of the daily diary score over the 7-day period if the patient completed ≥ 4 of 7 days, otherwise the score was considered missing.

Inter-item correlations at baseline were assessed using Spearman and polychoric correlation coefficients for the 11-point NRS item pairs and the 5-point Likert scale items, respectively. Values > 0.4 provided support for combining items into a multi-item scale, whereas pairs of items with coefficients > 0.9 and/or < 0.3 were considered for further scrutiny when developing the scoring algorithm [2123]. Confirmatory factor analyses (CFA) were performed to further evaluate the structure (Appendix 1).

Internal consistency reliability, reflecting the degree to which a set of items in the same scale co-vary, was assessed at baseline, cycles 4 and 7 using Cronbach’s alpha coefficients with 95% confidence intervals (CIs); values ≥ 0.70 are considered to represent acceptable reliability [24]. Test–retest reliability, i.e., the repeatability of scores over a time period in patients who are not expected to experience change, was assessed among stable subjects, defined as subjects with no change in PGIS responses between baseline and cycle 2, using 2-way mixed, absolute agreement, single-measure, intra-class correlation coefficients (ICCs) between the two assessments [25]. Values of 0.50–0.90 are considered to represent moderate to good reliability and values > 0.90 excellent reliability [26].

Construct validity was assessed by examining convergent and known-groups validity. Convergent validity, referring to how well constructs that theoretically should be related to each other are observed to be related, was evaluated at baseline by correlation coefficients between DTSS and DTIS and concepts captured from other PRO instruments (i.e., DTSS and BPI-SF Pain Severity items and overall Pain Severity scale, PROMIS-PF short-form 10a score, EORTC QLQ-C30 Global Health Status, symptom scales, and appetite loss; DTIS and BPI-SF Interference scale, 3 PROMIS items, EORTC QLQ-C30 functional scales and insomnia). Correlation coefficients ≥ 0.40 are considered evidence of convergent validity [22]. Known-groups validity, demonstrating the degree to which DTSS and DTIS scores can distinguish among groups of subjects hypothesized to be different in the concept of interest, was evaluated by investigating whether the distributions of DTSS and DTIS varied by groups defined by PGIS (response categories) and ECOG PS (0 versus 1).

Responsiveness and interpretation of meaningful within-person change in score

Responsiveness reflects the ability of an instrument to detect changes in groups of patients who have changed in the measured concept. Change from baseline to cycles 4 and 7 for DTSS and DTIS was evaluated within groups of patients who had changed according to PGIS, using analysis of covariance (ANCOVA) and magnitude of change via effect sizes (ES) (within-group: mean change from baseline/SDbaseline; between-group: mean change from baseline between two groups/SDpooled). Cohen’s ES rules were used for interpretation (i.e., 0.20 = small, 0.50 = moderate, and 0.80 = large) [27].

Meaningful change was evaluated both at within-person level (i.e., the amount of change a person would report to indicate that a relevant benefit has been experienced) and between-group level (i.e., the difference in scores between treatment and comparator groups considered clinically meaningful). Of note, within-person thresholds can be used to define responders for statistical tests of proportions improved in each treatment group, whereas between-group thresholds support statistical inference of mean change between treatment and comparator groups and are used to assess the magnitude of the observed difference. To define meaningful change thresholds (MCTs) in DTSS and DTIS, anchor-based and distribution-based methods were evaluated and supported by visual displays of change score distribution using empirical cumulative distribution function (eCDF) curves. Firstly, the appropriateness of anchors (i.e., PGIS, PGIC) was tested by correlating them with DTSS and DTIS change scores from baseline to cycles 4 and 7; correlations > 0.30 were considered desirable [28]. Anchor-based methods included descriptive and modeling approaches (mixed models repeated measures [MMRM], Appendix 2) of the change scores from baseline. Distribution-based estimates included the half SD of the baseline score and the standard error of measurement, i.e., SDbaseline × sqrt[1−test–retest reliability coefficient]. MCTs at the within-person level were evaluated based on the anchor-based estimates if the latter exceeded the relevant distribution-based estimates and the absolute upper 95% CI of the anchor ‘no change’ group. MCTs at the between-group level were also estimated based on anchor-based estimates and more specifically on the mean change between minimally improved patients and those exhibiting ‘no change.’

Results

Of 142 total patients randomized in DeFi, 109 (76.8%) had extra-abdominal disease and 33 (23.2%) had intra-abdominal disease. At baseline, median age was 34 years, and the majority were female (64.8%), white race (83.1%), non-Hispanic/Latino ethnicity (85.9%), and enrolled in North America (68.3%) (Table 1).

Table 1 Demographic and baseline clinical information

Completion and distribution

One patient was excluded from analysis due to incongruous baseline dates. For the DTSS, a high completion rate was observed at baseline weekly period (i.e., 68.8% from day 1 to 93.6% to day 7). Completion rates dropped in later cycles, although with most days remaining above 65%. For the DTIS, a high completion rate was observed at baseline (92.9%) and at later cycles (65.4% at cycle 4, 67.0% at cycle 7).

Patients showed low symptomatology at baseline with mostly ceiling effects (≥ 20% patients reported the “best” possible score 0 = None, i.e., could not improve) and a good distribution of remaining response options in all DTSS items. Items referring specifically to intra-abdominal symptoms were only answered by a subset, i.e., those with intra-abdominal tumors. These items had the highest percentage of ceiling effects (e.g., ~ 70% of patients reported “None” for item 10: “Nausea”) (Fig. 1). In general, symptomatology improvement was observed in the post-baseline cycles, as most patients reported no symptoms (Supplementary Figs. S2, S3).

Fig. 1
figure 1

DTSS item distribution at baseline: day -1

Patients reported low impact of disease on functioning and daily living at baseline, as most DTIS items exhibited floor effects (> 20% reported no impact of the symptom). An exception to this was observed for the mobility items (i.e., item 1: difficulty moving, and item 4: comfortable in bed), as less than 20% reported no impact. Items reflecting mental concerns (e.g., item 12: fear tests, item 13: fear of recurrence/growth, item 16: anxiety, and item 17: frustration) reported high percentages (≥ 20%) of the worst impact option (Fig. 2). Most DTIS items showed improvement in the post-baseline cycles (Supplementary Fig. S4–S7).

Fig. 2
figure 2

DTIS Item distribution at baseline: day-1

Structural validity

DTSS inter-item correlations were generally moderate to large (0.50 ≤ r < 0.90), indicating that a total score is plausible. Larger correlations were mostly observed among items 1–7, supporting the combination of these into a total symptom scale. The largest correlations were observed among items 1–3, providing evidence of a domain reflecting pain. In addition, moderate-to-strong correlations were observed among items 5–7, supporting an extra-abdominal domain. In contrast, items 9–11, reflecting intra-abdominal symptoms (and answered by a subsample of the initial population [n = 50]), showed lower correlations (Table 2).

Table 2 Item-to-item correlation: GODDESS© DTSS items at baseline

Moderate inter-item correlations were observed among DTIS items 1, 2, 6, 7, 8 and 10, supporting combination of these into a domain reflecting physical functioning impact. However, item 10 was strongly correlated with item 2 (i.e., r > 0.90), suggesting potentially redundant content (both items examined the ‘reaching up’ impact). Moderate inter-item correlations were also observed among items 3–5, supporting the creation of a domain reflecting sleep impact. Strong correlations (r > 0.70) were observed among items 12–17, providing support to create a domain of emotional impact. Items 9 and 11 (‘appearance’ impact) were not highly correlated with other items, indicating they may not belong to a specific domain (Table 3).

Table 3 Item-to-item correlation: GODDESS© DTIS items at baseline

Our hypotheses were further confirmed via CFA models (Appendix 1). Considering the above, a DTSS total symptom score based on items 1–7 (i.e., excluding the specific items for patients with intra-abdominal tumors) and individual domain scores reflecting pain (items 1–3), extra-abdominal symptoms (items 5–7), and intra-abdominal symptoms (items 9–11) were created. Total symptom and individual domain scores were created by averaging the daily scores to constitute a daily score, and weekly scores were created by averaging the daily score of at least 4 out of 7 days before each timepoint. For the DTSS total score, the pain domain score (average of items 1–3) was averaged with items 4–7. For DTIS, three individual domain scores were created reflecting impact on physical functioning (items 1, 2, 6, 7, 8), sleep (items 3–5), and emotional (items 12–17). Items 9, 10, and 11 were considered standalone scores and did not undergo further psychometric assessment. DTSS total symptom and individual domains, and DTIS emotional impact domain scores, range from 0 to 10. DTIS physical functioning and sleep domain scores range from 0 to 4 (Supplementary Table S1).

Internal consistency reliability of the DTSS total symptom score was above the conventional threshold of 0.70 at both baseline (0.95; 95% CI 0.94; 0.96) and cycle 7 (0.96, 95% CI 0.95; 0.98). Similar findings were observed for the DTSS individual domains. Internal consistency reliability was also above the threshold for all DTIS domains (Supplementary Table S2). Test–retest reliability, calculated in subjects with ‘no change’ in PGIS scores between baseline and cycle 2, was > 0.90 for the DTSS total symptom and domain scores and > 0.85 for all DTIS domains (Fig. 3).

Fig. 3
figure 3

Empirical cumulative distribution function of DTSS by PGIS over time

Construct validity was well demonstrated by moderate-to-strong correlations of DTSS total symptom and individual domain scores (i.e., pain, extra-abdominal, intra-abdominal) with similar concepts from other PRO measures as hypothesized. For example, stronger correlations were found with measures of more highly related constructs (such as the “BPI pain at its worst in the last 24 h”; Total r = 0.74; pain domain r = 0.87; extra-abdominal domain r = 0.64; intra-abdominal domain r = 0.58) than more distally related constructs (such as the “EORTC QLQ-C30 Appetite Loss Scale”; Total r = 0.18; pain domain r = 0.19; extra-abdominal domain r = 0.13; intra-abdominal domain r = 0.34). Further relationships are shown in Supplementary Table S3. Similarly, construct validity was also supported for the DTIS domains using similar hypotheses. For example, stronger correlations were found between the “PROMIS-PF short form Bend or Twist Your Back” item and the physical functioning domain (r = − 0.69) than with the sleep domains (− 0.39) or the emotional domain (− 0.37; further relationships displayed in Supplementary Table S4). Construct validity was further demonstrated by comparing DTSS (total symptom and individual domains) and DTIS (individual domains) scores among groups hypothesized to be different at baseline. Higher mean scores, indicating more severe symptomatology and impact, were observed in groups defined using PGIS (higher responses indicating more severe severity), and ECOG PS (with higher score indicating a lower level of functionality) (Supplementary Fig. S1).

Responsiveness and interpretation of meaningful within-person change in score

The mean score change difference among groups of patients specified as PGIS improved, stable, or worsened followed the expected pattern in DTSS and DTIS. Mean DTSS total change score for these PGIS groups were: − 1.52 for improved, − 0.23 for stable, and 0.76 for worsened. Results of the same direction were identified for the DTSS pain, extra-abdominal, and intra-abdominal domains, and the DTIS physical functioning, sleep, and emotional impact domains. ES for the between-groups were mostly moderate to large for the improved group, supporting ability to detect change (Table 4). ES were mostly small-to-moderate for the worsened group. The latter could also be due to the limited number of patients deteriorated throughout the study. For this reason, MCTs are estimated only for improvement.

Table 4 Ability to Detect Change: Change in DTSS and DTIS scores by PGIS groups from Baseline to Cycle 4

Correlation coefficients between DTSS total and individual domain, and DTIS individual domains change scores, with PGIS, were above the recommended threshold of 0.30 for all scale scores except for DTSS intra-abdominal domain. For the latter, PGIC was selected as the more appropriate anchor (Supplementary Table S5); however, the recommended threshold was met only at cycle 4 (Table 5).

Table 5 Threshold estimates for within-person improvement in DTSS and DTIS scores

For patients with improvement in the PGIS anchor, change in DTSS total score ranged from 1.27 to 1.64 (mean and median values), including the anchor-model approach estimate (1.33) (Appendix 2). Additionally, distribution-based estimates ranged from 0.41 to 1.20, indicating that change score magnitude is likely larger than the measurement error. Hence, a 1.30-decrease was recommended as the improvement within-person MCT for DTSS total score. Following similar rationale, the recommended improvement within-person MCT for DTSS pain domain score was a 2.00-decrease; for DTSS extra-abdominal domain, a 1.40-decrease, for DTSS intra-abdominal domain, a 1.00-decrease; for DTIS physical functioning, a 0.60-decrease; for DTIS sleep, a 0.70-decrease; and for DTIS emotional, a 1.80-decrease. All thresholds exceeded the upper bound of the 95% CI of the ‘no change’ group (Supplementary Table S6). For the between-group improvement MCTs, change in mean scores between minimally improved and stable patients was calculated as a 0.80-decrease for DTSS total score, a 1.16-decrease for DTSS pain domain score, and a 1.00-decrease for DTSS extra-abdominal domain score. Considering that these are group-level estimations and to avoid underestimation bias, a more conservative between-group improvement estimate of a 1.00-point decrease was recommended for the DTSS total and extra-abdominal domain scores, and a 1.20-point decrease for the DTSS pain domain score. For the DTSS intra-abdominal domain, the number of persons indicating minimal improvement was too low (n = 3) to allow confident recommendations. Following similar rationale, the recommended between-group improvement MCT was a 0.50-decrease for DTIS physical functioning and sleep scores, and a 2.00-decrease for DTIS emotional impact score (Supplementary Table S7).

The eCDF curves showed the expected shift to the left and clear differentiation, as seen by consistent separation and generally non-crossing curves, mostly for the improvement group. Curves for ‘no change’ or ‘worsening’ were well separated for DTSS total and individual domain scores, and mostly overlapped for the DTIS individual domain scores. For the DTSS intra-abdominal domain score, no clear differentiation was observed (Supplementary Figs. S8–S13).

Discussion

This work contributed to the assessment of GODDESS© psychometric properties, including responsiveness and estimates of clinically meaningful within-person and between-group improvement score change in DT patients. Furthermore, scoring algorithms were suggested and all results were evaluated for the identified domains. More specifically, the GODDESS© tool comprises two scales, the DTSS and the DTIS, for which the following scores were suggested: DTSS total score, DTSS pain, DTSS intra-abdominal and extra-abdominal domains; DTIS physical functioning, DTIS sleep, and DTIS emotional domains.

Strengths of this study include the investigation of DTSS and DTIS scales within a blinded, global, interventional study of DT/AF and via ePROs. The latter contribute to more accurate and complete participants’ responses [29]. Descriptive analysis showed ceiling effects revealing a sample of low symptom severity; however, the whole range of item responses was used indicating its appropriateness. Validity and reliability analyses for DTSS and DTIS scales showed that all scores are valid and reliable measures of DT/AF symptomatology and impact. Additionally, recommended thresholds of reliability were met (i.e., ICC > 0.70) for construct validity, including convergent and known-groups validity. Importantly, the responsiveness of the GODDESS© symptom and impact scores suggested a suitable ability to detect change in patients based on those who reported change on the PGIS. Notably, the performance of GODDESS© to assess responsiveness was confirmed regardless of patients’ low symptomatology at baseline. Correlations between change in GODDESS© symptom and impact scores suggested that PGIS was an adequate anchor (r > 0.3) for all scores except for DTSS intra-abdominal domain score. The MCT analysis enabled the derivation of within-person and between-group estimates for identifying clinically relevant responders based on improvement. A within-person threshold should be used to define responders; these can then be analyzed via test of proportions to investigate whether more responders exist in the treatment compared to the comparator group. Alternatively, researchers may assess mean change between treatment and comparator groups using statistical comparison. The difference in change scores should be at least as large as the between-group MCT to denote changes important to patients.

Study limitations include the relatively low sample size. Although the sample size was adequate for most analyses and especially for a rare disease, the sample size may be suboptimal for factor analysis. Various rules of thumb for CFA exist, such as typically requiring 10 observations per item [23]. However, going forward, it is recommended to replicate factorial validity in other studies with a sample informed using a simulation-based approach. The work presented here could be used as initial information to inform such a simulation-based sample sizing. Furthermore, the limited sample size, especially within the intra-abdominal domain, in combination with the low sample size experiencing deterioration throughout the study period, did not allow estimation of MCTs for deterioration. However, sample size limitations should be contextualized by considering the rare nature of DT. Although, replication of our findings in other studies with bigger samples is recommended, this may be particularly challenging for this disease area. In addition, although translation studies have already been conducted and GODDESS© is linguistically validated among various cultures and languages, further psychometric validity testing is needed to investigate measurement invariance properties. Finally, it should be noted that like in all scales, the MCT estimation is dependent on the population under study, its baseline characteristics, and the anchors [30]. Therefore, it is recommended that future studies be conducted to also provide MCT estimates in different populations and/or with different anchors.

In conclusion, our study showed that GODDESS© has appropriate psychometric properties. In addition, within-person and between-groups MCTs of improvement for the different symptom and impact domain scales were estimated. These may be used in future efficacy analyses to identify the responders versus non-responders and assess the meaningful efficacy differences between treatment and comparator groups. This study provides a rationale for consideration of the GODDESS© tool to assess DT symptom and impact severity improvement as an endpoint in clinical trials or population research studies.