Introduction

Systemic mastocytosis is a rare, clonal mast cell neoplasm driven by the KIT D816V mutation [1] that is characterized by uncontrolled proliferation and activation of mast cells, which leads to severe and unpredictable symptoms for patients with systemic mastocytosis [2]. As a rare disease, the incidence of all systemic mastocytosis subtypes is approximately 0.89 per 100,000 per year [3] and the prevalence of indolent systemic mastocytosis (ISM) in the Groningen region of the Netherlands, a major referral area for systemic mastocytosis patients, is estimated at 13/100,000 [4]. Unlike other forms of systemic mastocytosis, ISM is associated with a normal or near-normal life-expectancy [5]; however, many ISM patients experience severe, life-limiting symptoms that significantly impact daily life [6, 7]. Smoldering systemic mastocytosis (SSM) is similar to ISM in its symptomatology, but is associated with a relatively higher burden of mast cells, and was considered a rare subtype of ISM prior to the 2016 WHO reclassification of systemic mastocytosis [8]. Unfortunately, there are limited treatment options available for patients with systemic mastocytosis and no approved therapies for patients with ISM [8].

As drug sponsors develop ISM treatments, the availability of well-defined and reliable patient-reported outcome (PRO) questionnaires to assess clinical benefit as a result of those interventions are important. However, no such instrument yet exists or was considered to be consistent with Food and Drug Administration (FDA) regulatory guidelines for use in the ISM patient population [9,10,11,12]. To fill this gap, the Indolent Systemic Mastocytosis Symptom Assessment Form (ISM-SAF)(©Blueprint Medicines Corporation) was developed in ways consistent with regulatory [9] and scientific guidelines [10, 13] to evaluate clinical benefit hypotheses for use in product approval and labeling decisions.

The content validity of ISM-SAF was established [14], as evidenced by a variety of qualitative research inquiries along with feedback from the FDA to ensure the ISM-SAF aligned with regulatory expectations for instruments intended for use in clinical trials [9]. The goal of the present study was to perform an exploratory psychometric evaluation of scores produced by the ISM-SAF and to explore its use as a clinical trial screening tool. The psychometric performance of scores produced by the ISM-SAF among patients who have ISM or SSM with respect to score variability, distribution, missingness, reliability, and construct-related validity was evaluated to provide evidence for the trustworthiness of the ISM-SAF scores. Additionally, this study aimed to establish an ISM-SAF total symptom score (TSS) cutoff value (i.e., a severity cutoff point) that could distinguish patients with moderate to severe symptoms relative to those with less severe symptoms; subsequently, the ISM-SAF could be used to screen patient eligibility for clinical studies assessing symptomatic improvement based on a minimum level of sign and symptom severity.

Methods

Study design

A prospective, non-interventional, observational study utilized an online survey of patients in the United States diagnosed with ISM or SSM, who completed PRO assessments using a web-based electronic platform (SurveyMonkey®) over the course of 15 days. All study documents were submitted to and approved by a centralized institutional review board (IRB), Schulman IRB, prior to initiating patient recruitment.

Patients were identified through advertising by The Mastocytosis Society, a patient advocacy group for individuals with mastocytosis and other mast cell disorders. The target sample size for this study was 75 adult patients (age ≥ 18) with ISM or SSM. When interested individuals clicked on the web-enabled link in the study advertisement or study recruitment email, they were directed to a web-based, Health Insurance Portability and Accountability Act-compliant [15] platform (SurveyMonkey®) to provide electronic informed consent using an informed consent form [16, 17]. Patient eligibility was confirmed via a patient screener. Participants with a self-reported diagnosis of ISM or SSM were recruited for study participation. Individuals were excluded from the study if they self-reported mast cell activation syndrome, advanced systemic mastocytosis, or any other hematologic malignancies/blood cancers. Additionally, all participants were asked to provide medical documentation of their ISM or SSM diagnosis. Participants unable to provide medical documentation of diagnosis were still allowed to participate; however, a separate analysis was performed for participants whose ISM or SSM diagnosis was confirmed based on a physician review of medical records. Patients were then provided with Day 1 assessments within 48 h. Specifically, patients were asked on Day 1 to provide demographic and health information and complete the following PRO assessments: ISM-SAF, Patient Global Impression of Severity (PGIS), 12-Item Short Form Survey, Version 2 (SF12v2®), and Mastocytosis Quality of Life Questionnaire (MC-QoL). Subsequently, patients were asked to complete the ISM-SAF on each of the ensuing 13 days (Day 2–Day 14), followed by completion of the ISM-SAF, PGIS, SF12v2®, and MC-QoL on Day 15.

Analysis populations

The analysis populations included two cohorts. The first cohort included all patients who self-reported a diagnosis of ISM or SSM (Self-reported Diagnosis Cohort), and the second cohort included the subsample of patients who also provided a confirmed diagnosis of ISM or SSM via medical documentation (Medically Documented Diagnosis Cohort). Test–retest reliability for the ISM-SAF scores was evaluated using a subsample of patients who exhibited no change in PGIS from Day 1 to Day 15. Post-hoc reliability and validity analyses were performed on patients with only a self-reported diagnosis (i.e., without medical documentation) to give confidence that the scores were similar between patient samples.

Study assessments

ISM-SAF

The ISM-SAF is a 12-item daily diary that assesses the severity of 11 ISM symptoms including bone pain, abdominal pain, headache, nausea, spots, itching, flushing, fatigue, dizziness, brain fog, and diarrhea over a 24-h recall period with an 11-point numeric rating scale (NRS), where 0 = No [symptom] and 10 = Worst imaginable [symptom]; the twelfth item assesses diarrhea frequency by asking patients to enter a discrete numerical value. As a once-daily diary, the ISM-SAF was completed daily from Day 1 to Day 15 on the study’s web-based platform.

The ISM-SAF is scored at an item level, domain level, and total score level. Two severity domains were hypothesized: the Gastrointestinal Symptom Score (GSS), composed of abdominal pain, nausea, and diarrhea severity (score range 0–30), and the Skin Symptom Score (SSS), composed of spots, itching, and flushing severity (score range 0–30). The daily domain scores are generated by summing the item scores of each day, and all contributing items need to be completed to calculate a daily score. The daily total symptom score (TSS) was created by combining all items except the diarrhea frequency item (range 0–110). Weekly scores were derived as seven-day averages of daily scores (Week 1: Days 2–8, Week 2: Days 9–15, with a minimum of four daily scores required), and biweekly scores were derived by averaging scores over 14 days (Days 2–15, with a minimum of seven daily scores required).

Supportive measures

The psychometric evaluation of the ISM-SAF was supported by other clinical and PRO assessments, which were administered on Day 1 and Day 15:

Patient Global Impression of Severity (PGIS)

The PGIS is a single item that asks patients to rate their overall symptom severity at present on a five-point scale (“0– absent,” “1–minimal,” “2–moderate,” “3–severe,” and “4–very severe”).

SF-12v2® Health Survey (SF-12v2®)

The SF-12v2® is a 12-item PRO questionnaire assessing physical and emotional health- and function-related limitations using a recall period of “the past week” on three- and five-point verbal response scales (scores range from 0 to 100, with higher scores representing better health) [18, 19]. It comprises eight health domains (physical functioning, role physical, bodily pain, general health, vitality, social functioning, role emotional, and mental health) and composite scores are calculated for mental and physical constructs.

Mastocytosis Quality of Life Questionnaire (MC-QoL)

The MC-QoL is a 27-item PRO questionnaire assessing health-related quality of life impairment in patients with cutaneous mastocytosis and ISM [20] using a recall period of “the past two weeks” and a five-point verbal response scale (scores ranges from 0 to 100, where higher scores indicate higher health-related quality of life impairment). It consists of four domains (symptoms, emotions, social life/functioning, and skin) and a total score is calculated.

Analyses

Sample

Descriptive statistics for age, gender, race, ethnicity, work status, and education level; experience of mastocytosis in the skin; and treatment history were computed and presented for the study sample upon entry into the study.

Score distribution

Item-level and domain-level score distributions for the ISM-SAF were evaluated in terms of respondents’ use of the entire scale and for floor and ceiling effects.

Reliability

Reliability estimates characterize consistency and reproducibility of scores produced by a questionnaire when administered to a particular target patient population and in a particular context of use and can be evaluated using various methods, depending on the nature of the assessment and context of administration. In this study, the reliability of ISM-SAF scores was assessed in two ways. First, internal consistency reliability was investigated by calculations of Cronbach’s alpha coefficient (α, range 0 to 1) for the TSS, GSS, and SSS (biweekly scores) and again with each item removed to assess the impact that removal had on the overall α. Scores greater than 0.70 are typically seen as sufficient for research purposes [21]. Second, test–retest reliability was assessed among patients who exhibited no change in PGIS from Day 1 to Day 15, using the intra-class correlation coefficient (ICC) [22] and its 95% confidence internal, and based on the comparison of ISM-SAF TSS, domain, and item scores collected during Week 1 and Week 2.

Construct-related validity

Construct-related validity is concluded upon evidence that scores produced by a target questionnaire relate to scores from other assessments in ways that are logical and according to a priori hypotheses [9]. In the present study, the relationships between ISM-SAF scores and those generated by the supportive assessments were examined via correlational analysis and interpreted based on the following absolute value guidelines (correlation range is -1 to 1): negligible relationship, r = 0.0–0.09; small relationship, r = 0.1–0.29; medium relationship, r = 0.30–0.49; and strong relationship, r ≥ 0.50. [23, 24]

Known-groups methods characterize the degree to which a PRO questionnaire generates scores capable of distinguishing among patient groups hypothesized to be clinically distinct [9]. This analysis was conducted using the PGIS, MC-QoL (tertiles), and SF-12v2® (tertiles) to categorize patients into “known groups” on Day 15, and ISM-SAF scores were described across patient severity groups. It was hypothesized that the higher ISM-SAF scores (greater symptoms) would be associated with groups of patients with higher PGIS and MC-QoL scores and lower SF-12v2® scores.

Daily, weekly, or biweekly TSS and domain scores were used in correlational and known-groups analyses to match the recall period of the respective supportive assessment administered on Day 15 (i.e., PGIS correlation with Day 15 ISM-SAF scores, SF-12v2® correlation with Week 2 ISM-SAF scores, and MC-QoL correlation with biweekly ISM-SAF scores).

ISM-SAF score severity cutoffs

To estimate a cutoff value in the ISM-SAF TSS to identify respondents who experience moderate to severe signs and symptoms of ISM, tertile groupings were formed and receiver operating characteristic (ROC) analyses were conducted. Tertile groupings of the biweekly TSS were calculated for both the Self-reported Diagnosis Cohort and the Medically Documented Diagnosis Cohort. ROC curve analysis was conducted to separate patients who were minimally symptomatic from patients who were moderately or more severely symptomatic based on the dichotomized biweekly PGIS scores at Day 15 (i.e., patients with a score of one or below on the PGIS were defined as having minimal or absent symptom severity [coded as 0], and patients with a score of two or above were identified as having some level of symptom severity [coded as 1]). Individual TSSs were examined with regard to sensitivity (i.e., the degree to which the score would correctly identify individuals with moderate to severe symptoms) and specificity (i.e., the degree to which the score would correctly identify individuals who did not have moderate to severe symptoms). Positive and negative predictive values (PPV and NPV) indicated the degree to which the score identified individuals who were also classified as moderate or severe/very severe versus absent/minimal on the PGIS, respectively. The cutoff point on the TSS with the largest Youden’s index indicated the maximization of sensitivity and specificity.

Results

Study sample

A total of 116 eligible patients were screened into the study; 103 were included in the Self-reported Diagnosis Cohort, and 58 were included in the Medically Documented Diagnosis Cohort (ISM: n = 56, 96.6%; SSM: n = 2, 3.4%). In the Self-reported Diagnosis Cohort, mean age was 50.2 years (standard deviation [SD] = 12.6), 81.6% were female, and 98.1% were white. Demographic characteristics for the Medically Documented Diagnosis Cohort were largely similar, with a slightly lower proportion of male patients compared to the Self-reported Diagnosis Cohort (10.3% versus 18.4%). Complete demographic and health information details for both cohorts are presented in Table 1; Additional file 1: Table S1 additionally contains demographic and health information for those patients with only a self-reported diagnosis (n = 45). Concomitant medications reported by patients on entry into the study are presented in Table 2.

Table 1 Sample demographic and health characteristics
Table 2 Concomitant medication use (Self-reported Diagnosis Cohort; N = 103)

Score distribution

Descriptive analysis of the ISM-SAF indicated that while patients used the range of response options available to them for each item (i.e., 0 to 10), not all patients reported experiencing all symptoms and, when symptoms were reported, severity rates were variable. In the Self-reported Diagnosis Cohort, the mean weekly GSS, SSS, and TSS were 5.3 (SD = 4.5), 8.3 (SD = 5.3), and 27.3 (SD = 15.4), respectively. The mean of weekly ISM-SAF items ranged from 1.4 (SD = 1.8) to 4.6 (SD = 2.4), which were all lower than 50% of the scale. It is notable that responses tended to cluster near the lower end of the scale (i.e., less severe symptom experience) and many patients reported “no [symptom]” (i.e., a response choice of “0”). The same pattern was observed in the Medically Documented Diagnosis Cohort.

Reliability

Internal consistency reliability

Internal consistency estimates (α) are presented in Table 3 and suggest adequate reliability for use in research settings for the TSS and marginal to adequate reliability for the GSS and SSS as a biweekly score in both cohorts. Removal of items from the TSS typically reduced overall alpha coefficients; any instances in which alpha increased (e.g., Item 4, spots) were only marginal. Additional file 1: Table S2 presents internal consistency reliability estimates for those patients with only a self-reported diagnosis.

Table 3 Internal consistency reliability (α) on the biweekly ISM-SAF

Test–retest reliability

Test–retest reliability estimates comparing Week 1 (an average of scores generated on Days 2 to 8) and Week 2 (an average of scores generated on Days 9 to 15) were all excellent (> 0.75) [25] based on patients who exhibited no change in PGIS scores from Day 1 to Day 15 (n = 61) (Table 3).

Validity

Construct-related validity

The relationships between the TSS and other variables were strong and in the expected direction. No noteworthy differences or distinctions were observed regarding the pattern of relationships among the Self-reported Diagnosis Cohort and the Medically Documented Diagnosis Cohort. As expected, the TSS was more strongly correlated with variables assessing symptoms and physical function (such as the role physical and bodily pain domains of the SF-12v2® and the symptoms domain of the MC-QoL) and less strongly correlated with variables associated with more distal disease impacts (such as the mental component score or the role emotional domain of the SF-12v2®). Patients reporting increased symptom involvement on the ISM-SAF also rated themselves as more severely afflicted on the PGIS. Correlations with other measures were generally greater for the TSS than for the GSS and SSS, except for the MC-QoL Skin domain, which correlated most strongly with the SSS as expected (Table 4). Additional file 1: Table S3 presents comparable data for those patients with only a self-reported diagnosis.

Table 4 Spearman correlations of ISM-SAF total and domain scores with other measures administered at Day 15

Known-groups analysis

Based on results from both cohorts, TSS, GSS, and SSS scores were clearly distinct across all patient severity groups, in the hypothesized direction (i.e., patients with greater symptoms and impacts, as assessed by the PGIS, MC-QoL, and SF-12v2®, also scored higher on the ISM-SAF), and those differences were statistically significant (p < 0.05) (Table 5). Additional file 1: Table S4 presents comparable data for those patients with only a self-reported diagnosis.

Table 5 Known-groups analysis of the ISM-SAF total and domain scores based on PGIS, MC-QoL, and SF-12v2® assessments administered at Day 15

ISM-SAF score severity cutoffs

ISM-SAF tertile groupings

The biweekly TSS marking the 33rd percentile (P33) was 19.1 for the Self-reported Diagnosis Cohort and 20.6 for the Medically Documented Diagnosis Cohort. The biweekly TSS scores marking P66 were 31.2 and 35.1, respectively. These results suggest that a biweekly TSS ranging from 19.1 to 20.6 would delineate the two-thirds of the study population reporting the most severe symptomatic experience.

ROC curve analysis

The analysis of the Self-reported Diagnosis Cohort suggested a TSS of 21 (sensitivity = 82.0% [i.e., correctly identifies 82.0% of patients with moderate to severe symptoms], specificity = 68.3% [i.e., correctly identifies 68.3% of patients whose symptoms are not moderate or severe], PPV = 79.4% [i.e., correctly identifies 79.4% of patients classified as moderate or severe/very severe on the PGIS], NPV = 71.8% [i.e., correctly identifies 71.8% of patients classified as absent/minimal on the PGIS], Youden's index = 0.50) can be used as a threshold to identify patients with moderate symptoms (Fig. 1a).

Fig. 1
figure 1

Receiver operating characteristic curve: Biweekly total symptom score predicting moderate/severe/very severe on Patient Global Impression of Severity—Self-reported Diagnosis Cohort (n = 102; a shown on the left) and Medically Documented Diagnosis Cohort (n = 57; b shown on the right)

The analysis of the Medically Documented Diagnosis Cohort suggested a TSS of 28 (sensitivity = 80.7%, specificity = 76.9%, PPV = 80.6%, NPV = 76.9%, Youden's index = 0.58) can be used as a threshold to identify patients with at least a moderate condition (Fig. 1b).

Discussion

With its content validity established [14], results from the present observational study demonstrated the ISM-SAF to be capable of generating reliable and construct valid scores when administered in its target patient population. Specifically, internal consistency estimates (α) for the TSS express strong reliability and, while lower for the GSS and SSS, are still acceptable, particularly for a newly developed assessment [21]. Further, test–retest reliability (all ICCs ≥ 0.86), construct validity (e.g., correlational analyses indicated that ISM-SAF scores were more strongly correlated with variables assessing symptoms and physical function, and less strongly correlated with variables associated with more distal disease impacts), and known-groups analyses (e.g., TSS, GSS, and SSS were distinguished among clinically unique groups as specified by the PGIS, SF12v2®, and MC-QoL) all generated results supporting the strong performance of the ISM-SAF scores.

Another goal of the study was to estimate a cutoff value for the ISM-SAF TSS capable of distinguishing respondents who experience moderate to severe signs and symptoms of ISM from those who are less afflicted. The purpose of this exploration was to anticipate use of the ISM-SAF to screen patients into (or out of) future clinical studies based on a minimum level of symptom severity. While descriptive tertile groupings suggest a biweekly TSS in the range of 19.1 to 20.6 would delineate patients reporting the most severe experience of ISM symptoms, ROC analyses suggested that a biweekly TSS of 21 to 28 would be adequate for that purpose. Choosing an optimal cutoff point for clinical trial screening purposes, however, should take other factors into consideration. For example, particularly for a rare condition (such as systemic mastocytosis), care must be taken to ensure that the severity cutoff point does not exclude large numbers of potential patients (i.e., does not limit the clinical study sample and ability to draw reliable conclusions regarding product efficacy or safety). For the present study, a biweekly ISM-SAF TSS cutoff value of between 21 and 28 was suggested for screening purposes in Blueprint Medicine’s BLU-285–2203 pivotal Phase 2 clinical trial. The upper value of 28 was the more conservative recommendation, and it was assumed that the use of this cutoff would retain a large enough sample to meet clinical study goals. Researchers could be confident that the use of this cutoff value would allow the identification of patients with moderate to severe symptoms.

Patients who entered this study were taking many concomitant medications. Thus, it should be noted that patients’ symptom experience—as captured by the ISM-SAF—could have been impacted by management of ISM symptoms through the use of symptomatic treatments. Although there is the potential for experience of side effects with medication use, it is anticipated that the overall ISM symptom experience of patients in this study may have been less severe than in the absence of symptomatic treatment use. This further supports the value of 28 as a more conservative recommendation for moderate symptom threshold; however, the relatively small proportion of patients in the more severe PGIS categories should be noted as a limitation to the score severity cutoff analyses.

Although study patients reported symptom severity across the range of ISM-SAF response options (0–10), responses clustered near the lower end of the scale (i.e., less severe symptoms). From a measurement perspective, it is tempting to conclude a “floor effect” (a potentially artificial or unnatural lower limit of response choices and subsequent inability to measure levels of the target concept that fall below that lower limit) [26]. Relevant here, however, is that it is conceptually impossible to experience a symptom less severely than not experiencing the symptom at all (which is what the response choice of “0” reflects, “No [symptom]”). Therefore, it is likely that the observed data reflects the actual experiences of the target patient population, and this was anticipated and consistent with the qualitative research activities that contributed to the development of the ISM-SAF and showed that not all patients experience all symptoms on all days and when they do experience a given symptom, its severity is variable [14].

Another potential limitation in this study is that patients self-reported their ISM or SSM diagnosis. To address the possibility of including patients who did not have systemic mastocytosis, a separate psychometric analysis was performed on the 56% (n = 58/103) of patients who provided medical documentation of a confirmed ISM or SSM diagnosis. The reliability and validity findings were similar between the two cohorts, which adds to investigator confidence that the entire sample (N = 103) did have ISM or SSM. Additionally, psychometric analyses were performed on patients with only a self-reported diagnosis without medical documentation (44%, n = 45/103) to give confidence that the scores were similar between patient samples. While minor differences in the data were observed (e.g., less distinct differences in the SSS between known groups of patients with only a self-reported diagnosis), overall findings from the post-hoc analysis were comparable to those from patients with a medically documented diagnosis. This similarity in demographic characteristics and score reliability and validity estimates supports the conclusion that these two samples come from the same population of patients, demonstrating the veracity of the results.

Conclusions

In conclusion, the ISM-SAF produced reliable and construct-valid scores that were capable of distinguishing among clinically distinct groups when administered in the target patient population. These results, along with its strong development history including review, comment, and input from division representatives from the FDA and evidence of content validity, support the use of the ISM-SAF in clinical studies designed to evaluate ISM treatments pursuant to product labeling goals. Additionally, this study supported the use of the ISM-SAF as a study entry criteria tool (using a biweekly TSS of between 21 and 28 as a potential cutoff) for future clinical studies. Implementation of the ISM-SAF in future studies will enable further evaluation of the psychometric performance of its scores, including sensitivity to change, as well as inform score interpretation guidelines, when administered to patients with ISM.