Patient-Reported Outcome Measures for Post-mastectomy Breast Reconstruction: A Systematic Review of Development and Measurement Properties

Background Breast reconstruction (BR) is performed to improve outcomes for patients undergoing mastectomy. A recently developed core outcome set for BR includes six patient-reported outcomes that should be measured and reported in all future studies. It is vital that any instrument used to measure these outcomes as part of a core measurement set be robustly developed and validated so data are reliable and accurate. The aim of this systematic review is to evaluate the development and measurement properties of existing BR patient-reported outcome measures (PROMs) to inform instrument selection for future studies. Methods A PRISMA-compliant systematic review of development and validation studies of BR PROMs was conducted to assess their measurement properties. PROMs with adequate content validity were assessed using three steps: (1) the methodological quality of each identified study was assessed using the COSMIN Risk of Bias checklist; (2) criteria were applied for assessing good measurement properties; and (3) evidence was summarized and the quality of evidence assessed using a modified GRADE approach. Results Fourteen articles reported the development and measurement properties of six PROMs. Of these, only three (BREAST-Q, BRECON-31, and EORTC QLQ-BRECON-23) were considered to have adequate content validity and proceeded to full evaluation. This showed that all three PROMs had been robustly developed and validated and demonstrated adequate quality. Conclusions BREAST-Q, BRECON-31, and EORTC QLQ-BRECON-23 have been well-developed and demonstrate adequate measurement properties. Work with key stakeholders is now needed to generate consensus regarding which PROM should be recommended for inclusion in a core measurement set.

Breast cancer is the most common cancer in women, with over 2 million new cases worldwide in 2018. 1 In the UK, approximately 40% of women who have surgery for breast cancer undergo mastectomy. 2 Breast reconstruction is offered to patients to improve body image and quality of life. 3 Decision-making for BR is complex. There are many types of BR surgery ranging from implant-based procedures to microsurgical free-flaps using tissue from the abdomen, thigh, or buttock. Patients and surgeons need high-quality evidence from well-designed studies to help them make informed decisions about their reconstructive options.
Outcome selection, measurement, and reporting in BR studies, however, is currently heterogeneous and inconsistent. 4,5 This means that results of individual studies cannot be meaningfully compared or combined, limiting their value for decision-making. To address this, a core outcome set (COS), a minimum set of outcomes to be measured and reported in all future research and audit studies of BR, has recently been developed. Robust Delphi methodology involving over 300 key stakeholders, including patients and healthcare professionals, was undertaken. 6 The 11-item COS includes clinical (implant and flap-based complications, major complications, and unplanned surgery), patient-reported (quality of life, normality, emotional and physical well-being, donor-site symptoms/morbidity, and self-esteem), and cosmetic (women's cosmetic satisfaction) outcome domains.
While a COS is an important step in determining what outcomes should routinely be measured, this does not describe how these key outcome domains should be assessed. The next step in improving the quality and consistency of outcome reporting in BR studies is therefore to develop a core measurement set (CMS), a standard set of instruments to assess the core outcome domains. [7][8][9] Patientreported outcomes are particularly important in BR, and it is vital that any patient-reported outcome measure (PROM) recommended for use in future studies be robustly developed and validated for use in this population. The COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) guideline 10,11 is a critical appraisal tool for evaluating the methodological quality of studies reporting the development and measurement properties of health-related measures. 12 This provides a framework to assess the overall quality of outcome measurement instruments for use in research and clinical practice. The aims of this systematic review are to (1) identify candidate PROMs for each patient-reported outcome domain in the BR COS and (2) to critically appraise, compare, and summarize the quality of studies reporting the development and measurement properties of each PROM using the COSMIN guidelines 11 to inform selection of PROMs for use in future BR studies and inclusion in a BR CMS.

METHODS
This study was registered on the PROSPERO international register of systematic reviews before the literature search was performed (CRD42017075211). The search strategy used four broad search terms recommended by COSMIN for performing a systematic review of measurement properties; 11 These were: (1) the constructs of interest, namely the patient-reported outcome domains included in the BR COS 6 (self-esteem, normality, quality of life, donor-site problems, emotional and physical well-being, and women's cosmetic satisfaction), (2) the target population (BR), (3) the comprehensive PROM filter developed by the Patient Reported Outcomes Measurement Group of the University of Oxford, 13 and (4) the measurement properties filter described by Terwee et al. 14 The full search strategy is detailed in ''Appendix''.

Search Strategy and Paper Identification
Initial scoping work suggested that few PROMs currently exist that have been developed and/or validated specifically for patients undergoing BR surgery. For this reason, no specific construct for BR were included in the search strategy to avoid suitable instruments being inappropriately excluded.
Titles and abstracts of the remaining citations were screened independently for eligibility by two reviewers (C.D./S.P.) using predetermined inclusion criteria. Any discrepancies were resolved by discussion between the two reviewers. If uncertainty remained, the full text was obtained for further review and discussion. The reference lists of retrieved articles and existing reviews were manually searched to identify additional potentially relevant studies.

Paper Selection
Full-text original papers published in English reporting the development and/or evaluation of the measurement properties of patient-reported outcome questionnaires in women undergoing BR were eligible for inclusion. Further eligibility criteria included that the questionnaire had to have been developed for patient self-completion, evaluate one of the core patient-reported outcome domains identified in the COS (i.e., health-related quality of life, normality; women's cosmetic satisfaction; physical wellbeing, emotional well-being, or self-esteem) to be relevant for inclusion in the CMS, and have been specifically developed for and/or evaluated in female patients aged 18 years or over who had undergone BR. Breast reconstruction was defined as reconstruction of the breast after total mastectomy for invasive or preinvasive breast cancer or risk reduction.
Excluded were studies involving patients (1) with breast cancer in general without specific reference to BR, (2) undergoing breast conserving surgery or partial BR [e.g., with latissimus dorsi (LD) miniflaps or chest wall perforator flaps], and (3) undergoing cosmetic breast surgery only (e.g., reduction or augmentation surgery).
Papers were screened for inclusion independently by two reviewers (S.P./C.D.) using standardized proforma based on predetermined inclusion criteria. In cases of uncertainty, full-text papers were obtained for further evaluation. Uncertainties that remained after full-text review were resolved by discussion with an experienced methodologist (K.A./R.M.). Reasons for exclusion were recorded.

Data Extraction
Data were extracted onto standardized data extraction proformas. Extracted data included (1) characteristics of PROM instruments, including name of instrument, purpose/objective of study, country of study, recall period, and measurement properties evaluated, (2) PROM instruments assessing each patient-reported outcome domain from the BR COS, including COS item definition, name of PROM instrument, outcome/scales being measured, and number of items per scale, and (3) characteristics of included studies of instruments assessing outcomes in women who had undergone BR, including study author/year, country of study/setting, instrument name, sample size, age (mean), target population, type of RBS performed, and the indication for surgery.

Data Analysis
Selection of PROM Instruments for Full COSMIN Evaluation Nine measurement properties are included in the COSMIN evaluation. 11 These included content, structural, cross-cultural and criterion validity, hypothesis testing for construct validity, internal consistency, reliability, measurement error, and responsiveness. Definitions of these properties are provided in Table 1.
Content validity is the most important measurement property of a PROM and refers to whether the content of an instrument appropriately reflects the construct to be measured. It must be clear that items in the PROM are relevant, comprehensive, and comprehensible with respect to the construct of interest and the target population. 15 Only PROMs assessed by COSMIN criteria as having adequate content validity qualified for full COSMIN evaluation in phase 2 (Fig. 1). PROMs assessed as lacking content validity were excluded from further evaluation (Fig. 1, phase 1). 11,15,16 For PROM instruments undergoing full COSMIN evaluation, data on the instrument's feasibility were also collected. These included patient comprehensibility, completion time, patient's required mental and physical ability level, ease of standardization, ease of score calculation, copyright, cost of using instrument, required equipment, and regulatory agency's requirement for approval.
Evaluating Quality of the PROMs Quality evaluation of the included PROMs consisted of three steps (Fig. 1) and was scored by three reviewers (C.D./R.M./K.A.) independently with disagreements resolved by discussion with a fourth (S.P.).
Step 1. COSMIN Risk of Bias Checklist To evaluate the methodological quality of each single study, the COSMIN Risk of Bias checklist 10,11,17 was used. The COSMIN checklist evaluates the nine measurement properties together with the feasibility and interpretability of the instrument. The risk of bias for each study was rated using a four-point scale as either very good, adequate, doubtful, or inadequate quality and determined by taking the lowest rating of any items (''worst score counts'') within each measurement property.
Step 2. Applying Criteria for Good Measurement Properties by Using Quality Criteria

2a: Content Validity
Each result of a single study on PROM development and content validity was rated against the 10 criteria for good content validity. 17 The results of all available studies were qualitatively summarized to determine whether, overall, the relevance, comprehensiveness, comprehensibility, and overall content validity was sufficient (?), insufficient (-), or indeterminate (?), taking all evidence into account. Studies assessed as having insufficient content validity following this assessment were excluded from further evaluation in the systematic review as these should not be recommended for use.

2b: Remaining Measurement Properties
For instruments assessed as having sufficient content validity, the result of each study for the remaining measurement properties were rated against the criteria for good measurement properties. 11 Each result was rated as either sufficient (?), insufficient (-), or indeterminate (?).
Step 3. Summary of Evidence and Grading of Quality of Evidence

3a: Content Validity
The overall ratings determined in step 2a were also accompanied by a grading for the quality of the evidence using a modified Grading of Recommendations Assessment, Development and Evaluation (GRADE) approach for systematic reviews of clinical trials 18 (scored as high, moderate, low, or very low). The GRADE approach uses five factors to determine the quality of the evidence: risk of bias, inconsistency, indirectness, imprecision, and publication bias. For evaluating content validity, only three of these factors were applicable, namely risk of bias, inconsistency, and indirectness.

3b: Remaining Measurement Properties
To come to an overall conclusion on the quality of a PROM, the results of all available studies per measurement property had to be consistent. The results were pooled and compared again against the criteria for good measurement properties 11 to determine whether, overall, the measurement property of the PROM was sufficient (?), insufficient (-), inconsistent (±), or indeterminate (?). As with content validity, quality of the evidence was graded using the GRADE approach for each measurement property. For evaluating measurement properties in systematic review of PROMs, only four of the five factors (as detailed in step 3a above) were taken into

Measurement property Definition
Internal consistency The degree of the interrelatedness among the items; the extent to which scores for patients who have not changed are the same using different sets of items from same instrument

Reliability
The proportion of the total variance in the measurements which is due to ''true'' differences between patients

Measurement error
The systematic and random error of a patient's score that is not attributed to true changes in the construct to be measured Content validity The degree to which an instrument measures the construct(s) it purports to measure; the degree to which the content of an instrument is an adequate reflection of the construct to be measured

Structural validity
The degree to which the scores of an instrument are an adequate reflection of the dimensionality of the construct to be measured Hypothesis testing for construct validity The degree to which the scores of an instrument are consistent with hypotheses (for instance, with regard to internal relationships, relationships to scores of other instruments, or differences between relevant groups) based on the assumption that the instrument validly measures the construct to be measured; item construct validity Cross-cultural validity The degree to which the performance of the items on a translated or culturally adapted instrument are an adequate reflection of the performance of the items of the original version of the instrument

Criterion validity
The degree to which the scores of an instrument are an adequate reflection of a ''gold standard''

Responsiveness
The ability of an instrument to detect change over time in the construct to be measured; item responsiveness account, namely risk of bias, inconsistency, imprecision, and indirectness.

Systematic Literature Search
After removal of duplicates, 2343 abstracts were screened. For full-text review based on the title and abstract, 27 articles were selected. Of these, 16 articles were excluded from the review for the following reasons: not primary research/reviews (n = 8), not validation studies (n = 5), or not related to BR surgery (n = 3). A further three papers were identified from manual searching. 14 articles describing six BR PROMs met the eligibility criteria and were included in the review (see PRISMA diagram, Fig. 2). Table 2 presents the characteristics of the BR PROMs identified in the review. All included PROMs were evaluated in the English language. The recall period ranged from ''within the last week'' to ''5 years since breast surgery.'' Individual studies evaluated different measurement properties and not all measurement properties were assessed for each PROM. A list of which specific measurement properties were measured per instrument is presented in Table 2.

Assessment of Breast Reconstruction Patient-Reported Outcome Domains
Outcome domains or constructs measured across the identified PROMs included satisfaction with breasts, satisfaction with overall outcome, psychosocial well-being,   Test-retest reliability physical well-being, sexual well-being, health-related quality of life, body image, and sexual functioning. These constructs reflected most of the patient-reported outcome domains in the COS. Two COS constructs, namely normality and self-esteem, were not represented as multiitem domains in the identified PROMs. Several questionnaires, however, included single items relating to each of these constructs. For example, the construct of normality was measured in three PROMs (BREAST-Q, MBROS-BI, and BRECON-31), each of which contained individual items referring to ''feeling normal.'' For the self-esteem construct, the BREAST-Q included four items in the psychosocial well-being subscale addressing this issue. BRECON-31 included four items addressing self-image and three items relating to feeling self-conscious. Details of domains and PROMs are presented in Table 3. Table 4 presents the characteristics of the 14 studies included in the review. Studies were largely conducted in North America and/or Canada (n = 10), with only three studies based in Europe. One study recruited patients from 28 international centers. The sample sizes ranged from 20 to 5000 women with an age range of 18-84 years and included patients undergoing a range of implant-based and autologous reconstruction, including pedicled and free transverse rectus musculocutaneous (TRAM) flaps and latissimus dorsi reconstruction with and without implants; also, patients undergoing bilateral and unilateral surgery and patients receiving nipple/areola reconstruction as well as nipple-sparing procedures.

PROM Instruments Selection for Full COSMIN Evaluation
Of the six identified PROM instruments, only three, BREAST-Q, 19-21 BRECON-31 [22][23][24] and EORTC QLQ-BRECON-23 [25][26][27] were considered to have adequate content validity (see below). Of the remainder, the Michigan BR Outcome Study (MBROS) group developed a BRspecific questionnaire item set for satisfaction (MBROS-S) 28 and body image (MBROS-BI), 29 using input from an expert panel alone. There was no direct patient input into item generation or reduction, therefore these questionnaires were considered to have insufficient content validity and were excluded from further COSMIN evaluation. Similarly, the patient-based subjective rating scale for BR appearance 30 did not assess content validity and was excluded. Finally, the BREAST-Q CAT 31 and the electronic BREAST-Q 32 were adapted versions of the main BREAST-Q questionnaire. As the main BREAST-Q was being assessed, these were excluded.

Overall Rating and Grading of Quality of Evidence per Measurement Property for Each PROM
A summary of the analysis and grading of measurement properties for each of the three PROM instruments included for full COSMIN evaluation is presented in Table 5. This includes the summary of pooled results (from each study per PROM), the overall rating, and the grading of the quality of evidence assigned to each of the measurement properties that were measured. The overall ratings and quality of evidence for each measurement property assessed for the three PROMs are presented in a simpler way in Table 6 for ease of comparison between instruments. Cross-cultural validity, measurement invariance, and measurement error were not assessed for any of the three included PROMs and thus are not included in Table 6.

Content Validity
All three included PROMs, BREAST-Q, BRECON-31, and EORTC QLQ-BRECON 23, exhibited sufficient highquality evidence for the three aspects of content validity (relevance, comprehensiveness, and comprehensibility) as well as the quality of the PROM development, with all three PROMs using extensive input from patients undergoing BR in item formation and from systematic reviews. The development and design of the BREAST-Q questionnaire was extensive, with interviews and focus groups of representative BR patients, and included feedback from healthcare professionals on its relevance and comprehensiveness.
The BRECON-31 used robust item generation and item reduction methods. Item generation was gained from patient focus groups with additional input from an expert panel (plastic surgeons, breast surgeons, and advanced practice nurses) and a literature review. The literature review focused on published articles that related to breast cancer, quality of life, body image, satisfaction, and BR. The EORTC QLQ-BRECON 23 is intended for use alongside the EORTC QLQ-C30 and BR23 to assess patient-reported outcomes in women undergoing mastectomy for invasive breast cancer or ductal carcinoma in situ. 25 Content validity for this PROM showed sufficient high-quality evidence, with development phases incorporating literature reviews and interviews with patients and healthcare professionals.

Structural Validity
All three PROMs showed evidence of structural validity. Both BREAST-Q and EORTC QLQ-BRECON 23 were graded ''high'' for the quality of the evidence. Development of BREAST-Q involved Rasch modeling/ methodology (a form of item response theory) to predict individual item responses and evaluate changes in an individual's health-related quality of life (HRQL). 19 Results showed that the fit to the Rasch model was good and item locations were spread out (0.7-6.6). EORTC QLQ-BRECON 23 used confirmatory factor analysis to test how well the measured variables represented the number of constructs. Studies included an adequate sample size in the analysis, and this instrument received an overall sufficient rating and high quality of evidence. BRECON-31 used exploratory factor analysis to identify the underlying relationships between the measured variables, however, the sample size included in the analysis was not adequate and scored overall an ''insufficient'' rating with low-quality evidence.

Internal Consistency
All three PROMs evaluated internal consistency, each scoring ''high'' for the quality of evidence. All questionnaires showed positive ratings, with Cronbach's a scores ranging from 0.67 to 0.96, suggesting high interrelatedness among constituent outcome measure items. BREAST-Q studies 19,20 reported acceptable Cronbach's a values (of C 0.70) across the subscales (reconstruction module ranged from 0.88 to 0.96). There was an exception for surgical side effects within the EORTC QLQ-BRECON 23 questionnaire, which scored 0.67 for Cronbach's a, below the acceptable threshold for internal consistency.

Reliability
Reliability was assessed in all three PROMs. The quality of evidence for the measurement property varied, with only the BREAST-Q scoring as ''high''-quality evidence. The intraclass correlation coefficient (ICC) was reported across all three PROMs. For BREAST-Q scale, reliability was supported by high Cronbach's a values ([ 0.80), high person separation indices (C 0.73), an ICC [ 0.80, and appropriate item-total correlations (range of means 0.58-0.87). Test-retest reliability for all subscales of the BRECON-31 was good to excellent, with ICC showing excellent agreement (ICC = [ 0.74) for six of the subscales and good to fair agreement for self-image, arm, intimacy, and nipple subscales. For EORTC QLQ-BRE-CON 23, test-retest reliability was good, with ICCs for multiitem scales ranging from 0.809 to 0.916 and single items from 0.728 to 0.905. However, the quality of evidence scores for reliability for BRECON-31 and EORTC QLQ-BRECON 23 were ''very low'' and ''moderate,'' respectively.

Criterion Validity
Out of the three PROMs, only the BRECON-31 evaluated this measurement property. 15 BRECON-31 used BREAST-Q as the reference standard (or gold standard) and performed well based on the level of concordance found between the two questionnaires. BRECON-31 showed excellent correlation (PCC = 0.76) for five of the subscales (satisfaction, self-conscious, arm concerns, appearance, and expectations).

Hypothesis Testing for Construct Validity
Hypothesis testing for construct validity was assessed across all three PROMs, evaluating and demonstrating positive supporting evidence. BREAST-Q was compared with EORTC QLQ-BRECON 23, and hypotheses relating to correlations between BREAST-Q scales and other scales were widely supported through moderate correlations. BRECON-31 was compared with EQ-5D results. The EQ-5D showed moderate agreement with a summary score of the BRECON-31 (PCC = 0.50, p \ 0.01), and utility ratings correlated moderately with BRECON-31 (PCC = 0.42, p \ 0.001). Construct validity for the EORTC QLQ-BRECON 23 questionnaire was assessed using exploratory factor analysis (EFA). The EFA supported the phase 3 provisional six scale structure; all itemfactor weights exceeded 0.4.

Responsiveness
Of the three PROMs, only the EORTC QLQ-BRECON 23 evaluated this property. EORTC QLQ-BRECON 23 scored a sufficient overall rating and scored high for quality of evidence. Mean scale scores from baseline to 6 months were statistically significant (p \ 0.001). For scales such as sexuality and surgical side effects, the effect sizes were small, 0.37 and 0.31, respectively. Table 7 summarizes the different aspects of feasibility evaluated for each PROM. BREAST-Q and EORTC QLQ-BRECON 23 were reported to be acceptable, comprehensible, and easy to complete by patients. The three PROMs differed slightly in the amount of time these took for patients to complete, due to differing numbers of items per  (7) Social wellbeing (7) Physical well-being Physical activity such as how well women can perform work-and leisure-related tasks after surgery

DISCUSSION
This study is, to the best of the authors' knowledge, the first to report a systematic review and critical appraisal of published studies reporting the measurement properties of PROMs developed for use in women undergoing BR using an updated COSMIN methodology. 11 BR is performed to improve patients' quality of life following mastectomy, and six key patient-reported outcome domains are included in the recently developed COS. 6 It is vital that any PROM used to assess these important outcomes be robustly designed and validated if the results are to be meaningful. This review is the first necessary step to understand the performance of existing PROMs to inform instrument selection for patient-reported outcome domains in a BR CMS.
The systematic review identified 14 studies which included 6 different PROMs developed for use in a BR population. Of these, only three, BREAST-Q, BRECON-31, and EORTC QLQ-BRECON 23, were considered to have adequate content validity and were eligible for full measurement property assessment. All three instruments have been used to assess patient-reported outcomes in BR studies, but the most widely used and cited is BREAST-Q. 33 BREAST-Q, BRECON-31, and EORTC QLQ-BRE-CON 26 all had thorough patient involvement in item generation and reduction, which has shown to be critical and to greatly increase the validity of BR PROMs. 22

Strengths and Limitations
This study has certain strengths and limitations. To the best of the authors' knowledge, this is the first study that has used the recently updated COSMIN guidelines to assess the methodological quality of validation studies of BR PROMs. A validated and highly sensitive search strategy using published guidance from Terwee et al. 14 was used to identify all potentially relevant studies, and three independent reviewers independently assessed the quality of each study (any disagreements resolved by a fourth reviewer), as recommended by COSMIN. The main limitation to this review is the assumption that, if validation studies of BR PROMs were not identified from the search, these had not been carried out. Therefore, the possibility of publication bias cannot be excluded. In addition, this review focused on PROMs developed in a BR population. However, there may be other instruments that may have value in this group (e.g., measures of self-esteem) but were not considered as these had not been developed or validated specifically in BR patients.    Critical appraisal was undertaken using the COSMIN checklist. This methodology has recently been developed and requires that PROM developers report in detail the methods used in the development and validation of their instrument. For PROMs developed before the introduction of COSMIN guidance, this information is often not reported in sufficient detail, if at all, and sometimes assumptions need to be made based on the information the author(s) have provided. Researchers developing PROMs in the future will need to follow COSMIN recommendations when reporting their studies to ensure complete reporting of study details and accurate interpretation of results.

Further Work
The aim of this review was to identify robustly validated PROMs that could be recommended to measure the six key patient-reported outcome domains in the BR COS. The three PROMs identified in this review measure most of the key constructs with specific subscales that adequately address each domain. The domains of ''normality'' and ''self-esteem,'' however, are not constructs specifically included in any of the identified instruments, but both BREAST-Q and BRECON-31 include single items which reflect these domains. Further work is now required to determine whether patients feel that these items are adequate or whether work is needed to develop new PROMs in these areas.
Next steps will involve consensus work with key stakeholders to determine which of the three candidate PROMs should be recommended for use. This process involving a modified Delphi survey with over 100 professional stakeholders and face-to-face consensus meetings is already underway. 34 Qualitative work with patients who have undergone BR surgery will also be needed to ensure that the selected PROMs are acceptable for this group.

CONCLUSIONS
This systematic review identified three robustly developed and validated PROMs that could be recommended for use in future BR studies and inclusion in the CMS. Work is now required to determine which instrument should be routinely recommended for use to improve the quality and comparability of BR research and optimize its value for patients.