Comparison of Simple-Summated Scoring and Toxicity Index Scoring of Symptom Bother in the NSABP B-30 Clinical Trial

Level of symptom burden for cancer patients can be summarized using simple-summated scoring of multiple patient-reported symptoms. The Toxicity Index (TI) is an alternative that has been used primarily to summarize clinician-reported toxicities. To compare the TI with simple-summated scoring of 28 patient-reported symptoms. This is a secondary analysis of longitudinal data from a clinical trial of women with stage 2 or 3 breast cancer: baseline (n = 2156) and 6 months later (n = 1764). Study participants completed the 28-item Breast Cancer Prevention Trial symptom checklist assessing level of symptom bother in the past 7 days and four criterion items assessing general health and overall quality of life. Associations of simple-summated scoring of the 28 cancer-related symptoms with the general health and overall quality of life items tended to be larger than correlations of the TI summary scoring of the symptoms. For example, the Spearman correlation of change in quality of life was − 0.38 with change in the simple-summated score and − 0.23 with change in the TI. The findings suggest that simple-summated scoring and differential weighting of the level of symptom bother yield similar results. Clinicians can use simple-summated scoring rather than more complicated scoring algorithms to obtain an indication of overall level of symptom burden among cancer patients.

Using Patient-reported Outcomes (ASCPRO) working group asserted that symptom burden is the patient-reported counterpart to disease and tumor burden (Cleeland & Sloan, 2010). The ASCPRO group also noted that more stable and informative self-reports of symptoms are obtained by asking about symptom severity rather than simply present or absent of symptoms.
The National Cancer Institute (NCI) Symptom Management and Health-Related Quality of Life Steering Committee recommended a core set of 12 symptoms be "assessed across oncology trials for the purposes of better understanding treatment efficacy and toxicity and to facilitate cross study comparisons" (p. 1): fatigue, insomnia, pain, anorexia, dyspnea, cognitive problems, anxiety, nausea, depression, sensory neuropathy, constipation, and diarrhea (Reeve et al., 2014). Each symptom can be examined separately, but there is often interest in combining different symptoms into a summary score (McColl, 2004). The NCI committee did not recommend how level of symptom burden should be summarized. Often, symptom items are summed or averaged (Niu et al., 2021).
For example, the "symptom distress" scale is the sum of responses to 13 symptoms: nausea, appetite, insomnia, pain, fatigue, bowel pattern, concentration, appearance, breathing, outlook, and cough (Stapleton et al., 2016). While "there are some instances where sum scores are justified; the problem … is employing methods without any justification" (McNeish & Wolf, 2020, p. 2301). More generally, Ferrando & Lorenzo-Seva (2020, p. 207)commented that "all the models and scoring schemas are approximations, and the key issue is to decide whether the approximation is good enough for the purposes of the study." The Toxicity Index (TI) is an alternative to simple summation (Razaee et al., 2021). The TI was inspired by hash functions and summarizes all n observed toxicity grades as defined in the Common Terminology Criteria for Adverse Events (CTCAE). Each of the n toxicity grades X i ( i = 1, …, n ) for an individual is represented in descending order: X 1 ≥ X 2 ≥ ...≥X n . An individual's TI score is a function of the ordered toxicity grades. Any TI ≥ 3 corresponds to a dose-limiting toxicity, and the maximum toxicity grade is the integer part of the final score. For example, a TI of 3.0 indicates a single grade 3 toxic event, whereas a TI of 3.5 means that the patient experienced at least 1 grade 3 toxic event plus additional toxic events. All toxicity grades are represented in the score, although lower grades contribute less to the final score than higher grades. The TI has been used to summarize clinician-reported toxicities from the CTCAE (Henry et al., 2021) and may also be applied to and useful for summarizing patientreported symptom measures .
Simple-summated scoring and the TI represent conceptually distinct scoring approaches. The former treats each symptom as equivalent in its contribution to the total score (i.e., each additional symptom has equal input on the simple sum). The TI heavily weights having one bothersome (i.e., very much bothered) symptom and each additional symptom has decreasing input. If the 28 items are scored from 0 (not at all) to 4 (very much), as shown in the Appendix, the sum for a patient who reports that they are very much bothered by one symptom but not at all bothered by the 27 other symptoms is 4 on the 0-112 possible range (average score = 0.14 on 0-4 possible range). The TI for a person reporting being not at all bothered for 27 symptoms and very much bothered by one symptom is much higher: 4.00 on the 0-4.99 possible range.
Simple-summated scoring of symptoms has not yet been compared to the TI. This paper provides an initial comparison using secondary analyses of patient-reported data collected in the NSABP B-30 cancer clinical trial.

Sample
Patient-reported data in the NSABP B-30 clinical trial (Swain et al., 2010) at baseline (n = 2156) and 6 months later (n = 1764) were analyzed. Study participants had stage 2 or 3 breast adenocarcinoma and had undergone primary surgery with lumpectomy or mastectomy and axillary lymph node dissection. Combinations of doxorubicin (A), cyclophosphamide (C), and docetaxel (T) were evaluated by randomizing participants to either: 1) sequential A 60mg/m 2 plus C 600mg/m 2 every 3 weeks for four cycles followed by T 100mg/m 2 every 3 weeks for four cycles; 3) concurrent A 60mg/m 2 , C 600mg/m 2 , plus T 60mg/m 2 every 3 weeks for four cycles; or A 60mg/m 2 plus T 60mg/m 2 every 4 weeks for four cycles.

Measures
Study participants completed questionnaires to assess symptoms before chemotherapy, at cycle 4day 1 of chemotherapy, at 6 months, and then every 6 months through 24 months following chemotherapy initiation. A 28-item Breast Cancer Prevention Trial symptom checklist with five response options (Not at All, A little bit, Somewhat, Quite a Bit, Very Much) assessing symptom bother in the past 7 days was administered (Ganz et al., 2011).

Criterion Measures
Four self-report items hypothesized to be associated with level of symptom burden were also included on the same survey as the symptom items: 1. In general, would you say that your health is Excellent, Very good, Good, Fair, or Poor. 2. In the past 7 days, I feel ill (Not at All, A little bit, Somewhat, Quite a Bit, Very Much). 3. In the past 7 days, I am bothered by side effects of treatment (Not at All, A little bit, Somewhat, Quite a Bit, Very Much). 4. Please score your overall quality of life in the past 7 days on an 11-point scale between dead (0) and perfect health (10).
The first is the most widely used self-reported health item (Hays et al., 2015), the second is a FACT-G item (Cella et al., 1993), the third is the "GP5" item (Pearman et al., 2018), and the fourth item is a global quality of life item (Diehr et al., 2007). We scored all four criterion items so that a higher score represented better health. We also created a 3-item general health scale score by summing responses to the first three items (Cronbach, 1951, coefficient alpha = 0.75 at the 6-month follow-up; product-moment correlations among the items ranged from 0.46 to 0.57). We hypothesized that the symptom bother simple sum and the TI would be positively and monotonically related to the criterion items and the general health scale.

Analyses
Analyses were conducted for those with complete data on the 28 symptom items (88% of the baseline sample). We estimated item-scale correlations and internal consistency reliability for the symptom items. We also conducted categorical exploratory factor analysis with diagonally weighted least squares to evaluate the dimensionality of the symptom items, following by categorical confirmatory factor analysis. We evaluated model fit using the comparative fit index (CFI) and the root mean square error of approximation (RMSEA). CFI > 0.95 and RMSEA < 0.06 are indicators of excellent model fit (Hu & Bentler, 1999).
Because the TI is a rank-order index, for comparability we estimate Spearman correlations of the TI and simple-summated symptom score at the 6-month follow-up with the 3-item general health scale and the 4 criterion items at 6-months and their change from baseline to 6 months later. We also estimated Spearman correlations for change in the TI and simple-summated score with change in the general health scale and criterion items. Pearson-product moment correlations tended to be similar or not as large for the TI and similar or larger for the simple sum (results not shown). The significance of difference in the dependent rank-order correlations was estimated using z-statistics. (Steiger, 1980). We regressed (ordinary least squares) change on each of the criterion items on TI at baseline, TI at follow-up, simple sum at baseline, and simple sum at follow-up. Next, we ran three separate Probit regression models for each criterion item, controlling for their baseline values. The first model included the TI at baseline, the second model the simple-summated symptom score, and the third model included both the TI and the simple-summated score. Finally, we regressed (ordinary least squares) the 3-item health scale on dummy variables representing the levels of each symptom item.

Results
Internal consistency reliability for the 28-item symptom scale was 0.83 and itemscale correlations ranged from 0.13 (vaginal bleeding) to 0.55 (general aches and pains). The exploratory factor analysis suggested two possible underlying factors.
The first factor was represented by 23 symptom items and the other factor by 5 items that represented vasomotor and vaginal symptoms: night sweats, hot flashes, cold sweats, vaginal dryness, pain with intercourse). The 23-item general symptom factor correlated 0.484 with the vasomotor/vaginal symptom factor. The one-factor confirmatory model did not fit the data (RMSEA = 0.077, CFI = 0.713). The two-factor confirmatory model fit the data better than the one-factor model (RMSEA = 0.054, CFI = 0.857) with the general 23-item symptom factor correlated 0.465 with the vasomotor/vaginal symptom factor. The fit of the two-factor model improved after adding correlated residuals (RMSEA = 0.037, CFI = 0.934). A bifactor model also fit the data reasonably well (RMSEA = 0.042, CFI = 0.921). The bifactor model revealed larger loadings of the 5 vasomotor/vaginal items on their group factor than on the general factor. Because the pattern of results reported below was similar for all 28 symptoms and the 23 symptoms that loaded on the first factor (results available upon request), we only report TI and simple sum results based on all 28 items.
Spearman correlations between the TI and the simple-summated total symptom score were 0.90 at baseline, 0.86 at follow-up, and 0.66 for change between baseline and follow-up. Table1 presents Spearman correlations of the 6-month TI and simplesummated symptom score with the 6-month general health scale and the 4 criterion items, and with change in the criterion measures from baseline to 6-months later. The simple sum score had similar but consistently stronger correlations than did the TI with the general health scale and criterion items. Spearman correlations of change in the TI and the simple sum with change in the 4 criterion items are provided in Table2. The simple-summated score for the 28 symptom items was consistently more highly correlated than was the TI with change in these criteria (7 of 10 comparisons in Table1 and all 6 of those in Table2).
In the Probit regression models that included either TI or the simple-sum, the standardized coefficients were similar for both, and all p-values were highly significant (p <.0001). In the models that included both TI and the simple sum, the TI was significant for the general health scale while the simple sum was significant for the not feeling ill and not bothered by side effects items (Table3).
The symptom levels with statistically significant associations with the 3-item general health scale at baseline are shown in Table4. As expected, increasing severe levels of symptoms tended to be associated with less positive self-rated health. For example, compared to those not at all bothered, the unique decrement in self-rated health was − 1.97, -1.92, -3.30 and − 5.40 for those reporting they had been bothered by headaches a little bit, somewhat, quite a bit, and very much in the past 7 days. The strongest association was seen for those reporting quite a bit of bother from vomiting (-27.81 on the 0-100 possible range). Fifteen of the symptoms (mouth sores, diarrhea, difficulty with bladder control, constipation, hot flashes, vaginal discharge, vaginal bleeding or spotting, vaginal dryness, pain with intercourse, cramps, swelling of hands or feet, weight gain, forgetfulness, night sweats, cold sweats) did not have significant unique associations with self-rated health.

Discussion
The direction and magnitude of associations of the TI and simple-summated scoring of the 28 cancer-related symptoms with the general health scale and single items assessing health and overall quality of life were similar. Monotonic associations of the levels of the symptom items were found in regressions of self-rated health on the symptom bother items. But several symptoms were not significantly uniquely associated with self-rated health and not all levels of symptoms were significantly different from not at all bothered for symptoms with significant associations. The differential magnitude of associations across symptoms at the same level of bother (e.g., quite a bit bothered by vomiting had a negative impact of -25.29 while being quite a bit bothered by headaches was − 4.11) suggests that patients weigh symptoms differentially in terms of their impact on their health (Lee et al., 2012).
While differential weighting of symptoms and levels within symptoms might lead to a better summary of level of symptom burden, it has long been argued that "most attempts at differential item weighting show relatively little improvement over simple scoring" (Wainer, 1976, p. 216;quoting B.F. Green, 1975, personal communication). A more recent study showed that simple-sum scores "perform very well in terms of fidelity and the slight impact of nonlinearity and are quite stable under cross-validation" (Ferrando & Lorenzo-Seva, 2020, p. 223). Alternative approaches include weighting different subscales to obtain a summary score. For example, the Kidney Disease Quality of Life (KDQOL) instrument has a summary score that combines scores from a 12-item symptoms/problems scale, 8-item effects of kidney disease scale, and a 4-item burden of kidney disease scale by weighting mean scores by the number of items per domain (Peipert et al., 2019).
The 11 respondents who reported they were very much bothered on 29% (8 of the 28) of the symptom items had a rounded TI score of 5.00000 (the TI never reaches  . But the average of the 28 symptom items on the 0-4 possible range for these 11 patients ranged from 1.4 to 2.3 (mean = 1.78) and the range of 0-10 overall quality of life ratings was 3-8 (mean = 4.9, SD = 1.6). Moreover, there were 18 cases endorsing very much for more than 8 symptoms that also had a rounded TI score of 5.00000 and their average symptom scores ranged from 1.6 to 2.5 (mean = 2.04) and their overall quality of life ratings ranged from 3 to 9 (mean = 5.5, SD = 1.9).
The method used to combine items to produce a summary score needs to be justified for the specific context of use (U.S. Food and Drug Administration, 2009). Future research may help identify circumstances in which the present findings do not hold. The results of this study need replication in other datasets using different patient-reported symptom measures. It may also be useful to compare the summary approaches examined here with latent profile analysis (Whisenant et al., 2022). The TI was shown to have greater power than the maximum grade method to detect differences between side effects of treatments assessed using the CTCAE (Gresham et al., 2020). Because patient-reported symptoms include disease-related and treatmentrelated effects, it is not possible to generalize from this study to clinician reported