SF-12 or SF-36 in pituitary disease? Toward concise and comprehensive patient-reported outcomes measurements

Purpose Pituitary diseases severely affect patients’ health-related quality of life (HRQoL). The most frequently used generic HRQoL questionnaire is the Short Form-36 (SF-36). The shorter 12-item version (SF-12) can improve efficiency of patient monitoring. This study aimed to determine whether SF-12 can replace SF-36 in pituitary care. Methods In a longitudinal cohort study (August 2016 to December 2018) among 103 endoscopically operated adult pituitary tumor patients, physical and mental component scores (PCS and MCS) of SF-36 and SF-12 were measured preoperatively, and 6 weeks and 6 months postoperatively. Chronic care was assessed with a cross-sectional study (N = 431). Mean differences and agreement between SF-36 and SF-12 change in scores (preoperative vs. 6 months) were assessed with intraclass correlation coefficients (ICC) and limits of agreement, depicting 95% of individual patients. Results In the longitudinal study, mean differences between change in SF-36 and SF-12 scores were 1.4 (PCS) and 0.4 (MCS) with fair agreement for PCS (ICC = 0.546) and substantial agreement for MCS (ICC = 0.931). For 95% of individual patients, the difference between change in SF-36 and SF-12 scores varied between −14.0 and 16.9 for PCS and between −7.8 and 8.7 for MCS. Cross-sectional results showed fair agreement for PCS (ICC = 0.597) and substantial agreement for MCS (ICC = 0.943). Conclusions On a group level, SF-12 can reliably reproduce MCS in pituitary patients, although PCS is less well correlated. However, individual differences between SF-36 and SF-12 can be large. For pituitary diseases, alternative strategies are needed for concise, but comprehensive patient-reported outcome measurement.


Introduction
Pituitary/sellar tumors are rare, with a prevalence of 78-94 per 100,000 individuals [1]. Both the tumor and its treatment may cause short-and long-term sequelae [2,3]. Patients may suffer from symptoms due to compression of local critical structures such as the optic nerve [3], and characteristic symptoms in case of hormone excess or deficiency, such as infertility and hypogonadism in prolactinoma [4,5], and musculoskeletal, cardiovascular, and metabolic abnormalities in acromegaly and Cushing's disease [6,7]. Moreover, both functioning and nonfunctioning tumors frequently cause cognitive and psychological symptoms such as mental fatigue, emotional instability, loss of libido, and depressive symptoms [8,9]. As a result of this complex multisystem morbidity, pituitary/sellar diseases profoundly affect patients' general health-related quality of life (HRQoL), which generally remains impaired even long after biomedical control [8][9][10].
Since discrepancies may exist between patients' perspective on their HRQoL and the more objective clinicianreported outcome measures [11], patient-reported outcome measures (PROMs) are increasingly used both in clinical monitoring, and as outcome measures in clinical trials [8]. Besides disease-specific PROMs, PROMs assessing general HRQoL are used frequently [8], providing the opportunity to compare different disease populations. The Short Form-36 (SF-36) [12] is the most frequently used generic PROM in patients with a pituitary/sellar tumor [8]. This questionnaire consists of 36 questions covering eight domains of health and wellbeing with corresponding subscales, which are used to estimate a physical (PCS) and a mental component score (MCS). A shorter version, the Short Form-12 (SF-12) [13], has been developed, comprising 12 items of the SF-36 that can be used to calculate the PCS and MCS, omitting the subscale scores. The SF-12 has been studied in different patient populations and has shown strong correlations with the SF-36 [14][15][16][17][18][19][20] but has not been evaluated in pituitary diseases.
Due to the wide range of local and systemic symptoms, but also characteristic 'endocrine' symptoms caused by pituitary/sellar tumors, multiple disease-specific or symptom-specific PROMs should be used to comprehensively measure outcomes relevant for pituitary patients, together with a generic PROM allowing for comparison with other diseases [21]. To increase efficiency and to reduce the patient burden of completing these questionnaires, it would be valuable to investigate whether the number of questions can be reduced, whilst maintaining the capacity to reliably monitor HRQoL in patients with pituitary/sellar disease. Therefore, the aim of this study was to determine whether the SF-12 can be used instead of the SF-36 to assess the PCS and MCS in the monitoring of pituitary/sellar diseases.

Study design
For the analyses, data of two previously published cohorts [21,22] were used. The first study was a longitudinal cohort of consecutive patients treated surgically for a pituitary/ sellar tumor between August 2016 and December 2018 [21], who completed multiple PROMs before, and 6 weeks and 6 months after surgery. The second cohort was a large cross-sectional study performed in a chronic care setting [22], which was used to further validate our results. This cohort consisted of pituitary patients after a median of 13.0 years since diagnosis, recruited between September 2016 and March 2017. Both studies were performed at the Leiden University Medical Center, a Dutch tertiary referral center for patients with pituitary/sellar disease, and were approved by the institutional ethical committee (p16.091, p12.067).

Patient population
For the longitudinal cohort study, all consecutive patients, ≥18 years, and scheduled for endoscopic transsphenoidal resection of a pituitary/sellar tumor were eligible. For the cross-sectional study, we approached all patients with a history of a pituitary/sellar tumor, aged ≥ 18 years, and under active follow-up at our center. Exclusion criteria included a follow-up of <6 months, insufficient Dutch language skills, an incapacity to complete the questionnaires, and living abroad. For both studies, eligible patients were invited to participate by letter, and were enrolled after informed consent.

Baseline characteristics
For the longitudinal study, the baseline characteristics collected from patient charts included age, sex, marital status, education level, tumor type, size, and invasion, date of diagnosis, prior treatment of the tumor, preoperative pituitary function, visual functioning, and cerebral nerve deficits, if present. Detailed information on the collection and categorization of these data is presented elsewhere [21]. In addition, the Dutch comorbidity questionnaire, Statistics Netherlands, was used to assess the most common chronic diseases [23], categorized into diabetes mellitus, neurovascular disease, cardiovascular disease, and malignancies. Finally, the Short Form-Health and Labor Questionnaire (SF-HLQ) [24] was used to determine whether patients had a paid job.
For the cross-sectional cohort, data on age, sex, marital status, education level, tumor type, date of diagnosis, pituitary function, and work status were collected and categorized similarly to the longitudinal cohort [22].

Health-related quality of life
Patients completed the SF-36 version 1 [12], which was originally developed and validated in patients with hypertension, diabetes mellitus, congestive heart failure, myocardial infarction, and depression [25,26]. The PCS and MCS of the SF-36 range from 0 to 100, higher scores indicating a better HRQoL. The PCS and MCS of the SF-12 were calculated using the 12 corresponding items of the SF-36 [27] and similarly range from 0 to 100, higher scores indicating a better HRQoL. The SF-12 was developed and validated in the general population of the United States and the same patient populations as the SF-36 and includes the 12 items that predicted the SF-36 subscales most accurately in these populations [27]. The Dutch versions of the SF-36 and SF-12 have been validated in the Netherlands [28,29].

Statistical analysis
In order to determine the correlation between SF-36 and SF-12 scores of the longitudinal cohort, intraclass correlation coefficients (ICCs) for absolute agreement were calculated between the component scores of both questionnaires at the different timepoints. Moreover, ICCs for absolute agreement were used to assess the correlation between change in SF-36 and SF-12 scores (preoperatively vs. 6 months postoperatively). An ICC value of ≥0.41 was considered fair; ICC ≥0.61 moderate; and ICC ≥0.81 substantial [30].
Bland-Altman plots [31] were created to assess agreement of the SF-12 and SF-36 scores at each timepoint. Bland-Altman plots are scatter plots, showing the differences between SF-36 and SF-12 scores for individual patients plotted against the mean of each patient's SF-36 and SF-12 scores. In each plot, the population mean (d) of all individual differences between the two scores is visualized, as well as the limits of agreement, which represent the 95% range of all individual measurements (calculated as d + 1.96 × SD difference and d − 1.96 × SD difference ). Similarly, Bland-Altman plots were created to assess agreement of the change in SF-12 and SF-36 scores over time (6 months vs. preoperatively).
To assess the course of HRQoL over time, proportions of patients in the following categories were calculated twice using the SF-36 items and SF-12 items: no relevant change on all timepoints, persistent improvement or deterioration (on both 6 weeks and 6 months), transient improvement or deterioration (only at 6 weeks) and late improvement or deterioration (only at 6 months). A clinically relevant change in SF-36 scores is not yet known for pituitary patients, but in chronic disease populations, 0.5 SD is typically regarded as the minimal important difference for HRQoL instruments [32]. Therefore, a clinically relevant change (improvement or deterioration) was defined as ≥0.5 SD of the change in SF-36 scores, and no relevant change as <0.5 SD.
To determine the ability of the SF-12 to replicate clinically relevant changes, the proportion of patients that had a clinically relevant change in the same direction on both the SF-36 and the SF-12 was calculated.
In order to assess whether the degree of disagreement between SF-36 and SF-12 scores was associated with specific baseline characteristics, patients were categorized into a group with large individual differences between the SF-36 and SF-12, and a group with good agreement of SF-36 and SF-12 scores (all other patients). Following the same line of reasoning as above, the cutoff for large individual differences between SF-36 and SF-12 was defined as 0.5 SD of the change in SF-36 scores. Logistic regression analysis (both crude and adjusted for age, sex, comorbidities, and education level) was used to determine the association between baseline factors and having >5 points difference between SF-36 and SF-12 scores on PCS and/or MCS.
For the cross-sectional cohort, ICCs for absolute agreement, Bland-Altman plots, and logistic regression analyses were calculated and performed similarly for the cohort's single measurement. P values <0.05 were considered statistically significant. All statistical analyses were performed using IBM SPSS 25.0 software (Armonk, NY) [33].

Patient populations and missing data
The longitudinal perioperative cohort consisted of 103 patients, with a median age of 52.9 years (interquartile range [IQR] 37.0-65.0 years), of whom 71 (62.8%) were female (Table 1)

Longitudinal (perioperative) SF-36 and SF-12 scores
In the longitudinal cohort, mean PCS36 decreased from 41.4 preoperatively to 39.7 at 6 weeks and increased to 42.9 at 6 months postoperatively (Fig. 1). PCS12 scores were consistently slightly lower than PCS36 scores, with values of 37.1 preoperatively, 35.0 at 6 weeks and 36.8 at 6 months. MCS36 and MCS12 scores were more comparable, with scores of 43.5 and 42.0 preoperatively, 47.9 and 46.4 at 6 weeks, and 48.1 and 46.4 at 6 months, respectively. Scores were similar in the cross-sectional study (Supplementary 1).

Correlation of SF-36 and SF-12
In the longitudinal cohort, the ICCs of the PCS were 0.590 preoperatively, 0.548 at 6 weeks and 0.622 at 6 months ( Fig. 1), only the latter correlation being considered moderate for the majority of tumor types (Supplementary 2 and  3). On the contrary, the ICCs of the MCS were substantial at all timepoints (0.952 preoperatively, 0.948 at 6 weeks, 0.943 at 6 months) and for all tumor types (Supplementary 2 and 3). Results were similar for the cross-sectional cohort (Supplementary 4 and 5).
In line with these results, the Bland-Altman plots (Fig. 2) of the PCS of the longitudinal cohort showed relatively wide limits of agreement for individual patients (−11.4 to 19.6 preoperatively; −8.3 to 17.8 at 6 weeks; −7.7 to 19.5 at 6 months), with mean differences of 4.1, 4.7, and 5.9 points respectively for the whole group, while the limits of agreement of the MCS were narrower (−5.9 to 8.5 preoperatively; −4.5 to 7.6 at 6 weeks; −5.3 to 8.7 at 6 months), with mean differences of 1.3, 1.5, and 1.7 points respectively. The Bland-Altman plots (Supplementary 6) of the cross-sectional cohort were in concordance with those of the longitudinal cohort.

Longitudinal changes in SF-36 and SF-12
In the longitudinal cohort, mean longitudinal changes (6 months vs. preoperatively) were comparable between SF-36 (PCS 1.3; MCS 4.5) and SF-12 (PCS −0.3; MCS 3.8) scores. However, the correlation for change in SF-36 and SF-12 scores was substantial only for MCS (ICC = 0.931), while the ICC for PCS was considered fair (ICC = 0.546). Limits of agreement were −14.0 to 16.9 for PCS, and −7.8 to 8.7 for MCS, with mean differences of 1.4 for PCS and 0.4 for MCS (Fig. 3). Longitudinal changes of the PCS and MCS between the preoperative measurement and 6 months postoperatively could be calculated for 94 patients for the PCS36, PCS12, and MCS36, and for 95 patients for the MCS12.
The SDs of the change in SF-36 scores were around 10 in this study (data not shown), and the clinically relevant change (0.5 SD) therefore approached 5. Compared with the SF-36 component scores, the PCS12 and MCS12 showed a lower proportion of patients in the clinically relevant improvement categories, and the PCS12 showed a higher proportion of patients in the deterioration categories (Fig. 4). The percentage of patients with no important change on PCS12 (31.9%) was substantially higher than the percentage with no important change on PCS36 (18.2%). Of the patients with a clinically relevant increase (>5 points) on PCS36, 37.5% also had a clinically relevant increase on PCS12 (Table 2). Of the patients with a clinically relevant decrease on PCS36, 47.8% had a clinically relevant decrease on the PCS12. The numbers for the MCS were higher, 79.1% for increase and 87.5% for decrease, respectively ( Table 2).

Association of baseline factors with difference between SF-36 and SF-12 scores
As the minimal important difference (0.5 SD) approached 5 in this study, the cutoff for large individual differences between SF-12 and SF-36 PCS and/or MCS scores was set at 5 points. Preoperatively, 69 patients of the longitudinal cohort (69.7%) had a large individual difference between SF-36 and SF-12. At 6 weeks, this group consisted of 59 patients (59.0%), and at 6 months of 74 patients (77.9%). In the cross-sectional cohort, 318 patients (77.4%) had a difference of >5 points between SF-36 and SF-12 scores on PCS and/or MCS. Overall, no consistent significant associations were found between baseline factors (i.e., sex, tumor type, age, education level, comorbidities, tumor size, time since diagnosis, prior treatment, preoperative pituitary function, and preoperative visual deficits) and having >5 points difference between the two questionnaires (Supplementary 7-9).

Discussion
The present post hoc analysis of two existing cohorts of patients with a pituitary/sellar tumor demonstrates that, on a group level, the MCS derived from the SF-36 and SF-12 shows substantial agreement on all timepoints and over time. However, the agreement between the PCS of both questionnaires is less convincing, since these correlations were not more than fair in both cohorts. Moreover, due to large individual differences between SF-36 and SF-12, the SF-12 cannot reliably replace the SF-36 for individual patients.
SF-36 and SF-12 scores could be calculated for similar numbers of patients. The Bland-Altman plots demonstrated that the mean differences between the SF-36 and SF-12 scores were up to two points for the MCS, and up to six points for the PCS, indicating comparable results for the MCS between both questionnaires on a group level, when individual scores are averaged. However, the limits of agreement show that individual differences between the SF-36 of SF-12 for both the MCS and PCS are large, varying up to seven points for the MCS and up to 15 points for the PCS, which implies that the SF-12 score of an individual patient may differ up to seven (MCS) or 15 (PCS) points from their SF-36 score. Regression analysis was used to assess whether large individual differences were related to specific baseline factors, but overall, no consistently significant associations between baseline factors and a large individual disagreement between the SF-36 and SF-12 were found in both cohorts. Bland-Altman plots were also used to assess to what extent the component scores of both questionnaires showed a comparable change over time. Again, mean differences in change over time were small, but the limits of agreement were wide, varying up to 15 points (PCS), indicating that the change of the SF-12 of an individual patient may differ strongly from the change of their SF-36 scores. Importantly, the proportion of patients with a clinically relevant change in the same direction on both the SF-36 and SF-12 was as low as 37.5% for a clinically relevant increase in the PCS, while the percentages were considerably higher for the MCS.
The SF-36 and SF-12 have been compared previously in other patient groups, such as dialysis patients [14], patients undergoing knee replacement surgery [16], and patients with a history of stroke [17] [14][15][16][17][18][19][20] also found a good correlation for this component score (ICC range: 0.92-0.97). The majority of the studies therefore concluded that the SF-12 scores reliably approach the SF-36 scores, for both the PCS and MCS [14][15][16][17][18][19][20]. Moreover, most longitudinal studies concluded that responsiveness to Percentages add up to 100% for PCS36, PCS12, MCS36, and MCS12. PCS physical component score; MCS mental component score change was also comparable between the SF-36 and SF-12 [16,[18][19][20][34][35][36][37], reporting correlations (r or ICC) for change ranging between 0.84 and 0.94 for the PCS, and between 0.90 and 0.95 for the MCS. In contrast, the present study showed that individual differences between change in SF-36 and SF-12 scores can be large, and that the ICC for change of the PCS (ICC = 0.546) was considerably lower than for the MCS (ICC = 0.931). The large discrepancy between the PCS and MCS correlations and limits of agreement found in our study is not consistent with the existing literature in other patient groups such as osteoarthritis or stroke patients [14-17, 19, 20, 34-37], and might reflect the complex multisystem morbidity of endocrinological conditions. The SF-36 and SF-12 were developed and validated in patient populations with typically less complex morbidity, such as hypertension and myocardial infarction. In pituitary patients, typically, a combination of multiple less apparent symptoms (fatigue and psychological symptoms) and symptoms that are difficult to measure may profoundly impact their HRQoL [8], requiring measurement with the more comprehensive SF-36 instead of the SF-12. For instance, as pituitary patients experience limitations in energy rather than function, it can be expected that physical HRQoL impairment will be reflected by limitations in moderate activities (included in the SF-12), rather than by limitations in light activities such as walking one block or dressing oneself (not included in the SF-12). Indeed, as outlined in Supplementary 11, the SF-12 includes the physical SF-36 items that in general score relatively low in this cohort, while the items not included in the SF-12 score higher. This may in part explain the marked discrepancy between PCS scores of the two questionnaires. Notably, disease-specific characteristics influence the comparability of the SF-36 and SF-12, and therefore, it is important to evaluate per condition whether this shortened version is representative.
Besides the SF-36, other brief generic questionnaires such as the EuroQoL-5D [38] have been used in pituitary patients [39][40][41]. However, this widely used questionnaire only consists of five items, limiting its ability to provide a comprehensive view on the self-perceived health of patients with complex conditions such as pituitary diseases. This is partially depicted by a strong ceiling effect, as most patients report (very) high scores and therefore, most patients only have room for deterioration [21]. Moreover, the EuroQoL-5D is primarily a questionnaire assessing utility, which is used for economic evaluations and should be distinguished from HRQoL. The SF-36 is therefore more suitable, as a generic HRQoL instrument, for individual patient care than the EuroQoL-5D.

Strengths and limitations
Strengths of this study include the use of two cohorts, thereby increasing patient numbers and allowing for not only cross-sectional analysis in a chronic care setting, but also longitudinal analysis in a perioperative setting. Furthermore, the patient population included in the study is heterogeneous and conclusions can therefore be generalized to the total pituitary patient population. Regression analysis showed that this heterogeneity has not influenced the study's outcomes.
A few limitations of this study must be noted. First of all, in the cohorts used in this analysis, the SF-12 was not assessed separately, but was calculated from the SF-36. This may have resulted in slightly different SF-12 scores than would have been obtained using the SF-12 questionnaire. However, in previous research SF-12 scores based on the items embedded in the SF-36 were found to be equivalent to the scores obtained when the SF-12 was administered separately [42]. Furthermore, although the SF-12 and SF-36 have been validated in several countries, differences between and within both questionnaires scores may exist between countries [28,43], possibly resulting in a limited generalizability of the results of this study.

Conclusions
PROMs are increasingly used in both clinical trials and clinical practice. In clinical trials, PROMs serve as HRQoL outcome measures [44][45][46], that consequently influence clinical decision making, health care policy [47], and guideline development [48,49]. In clinical practice, PROMs enable patient monitoring and facilitate patient-doctor communication [50], resulting in the identification of previously unrecognized symptoms, and improvement of patient satisfaction and outcomes [51][52][53][54].
Our research team has obtained experience with a combination of several PROMs in a comprehensive outcome set for pituitary care [21], which harmonizes outcomes, and enables systematic assessment of HRQoL of all patients. However, this comprehensive outcome set can be timeconsuming and therefore burdensome for patients, due to the relatively large number of questions [21]. The present study therefore investigated whether the shorter SF-12 can be used instead of the SF-36 in patients with pituitary/sellar disease and showed that on a group level, the SF-12 can indeed reliably replicate the MCS, whereas evidence for the PCS is less convincing. However, due to large individual differences between SF-36 and SF-12 scores, the SF-12 is not suitable to replicate SF-36 scores for individuals in this population. Given the additional advantage of the SF-36 of generating domain scores, which provide clinicians and nurses with quick insight into the different aspects of patients' HRQoL, we recommend the SF-36 for clinical use in individual pituitary patients. Whether the SF-12 may fulfill the requirements of a generic PROM in the comprehensive set of generic, disease-specific, and symptomspecific PROMs for pituitary patients needs to be evaluated. In the meantime, alternative approaches to decrease the number of questions in this comprehensive outcome set, such as computer adaptive testing [55][56][57], should be explored as well.

Data availability
Data requests can be directed to D.J.L.

Compliance with ethical standards
Conflict of interest The authors declare that they have no conflict of interest.
Ethical approval All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards. This study was approved by the Medical Ethical Committee of the Leiden University Medical Center (No. p16.091, p12.067).
Informed consent Informed consent was obtained from all individual participants included in the study.
Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons. org/licenses/by/4.0/.