A prognostic survival model for women diagnosed with invasive breast cancer in Queensland, Australia

Purpose Prognostic models can help inform patients on the future course of their cancer and assist the decision making of clinicians and patients in respect to management and treatment of the cancer. In contrast to previous studies considering survival following treatment, this study aimed to develop a prognostic model to quantify breast cancer-specific survival at the time of diagnosis. Methods A large (n = 3323), population-based prospective cohort of women were diagnosed with invasive breast cancer in Queensland, Australia between 2010 and 2013, and followed up to December 2018. Data were collected through a validated semi-structured telephone interview and a self-administered questionnaire, along with data linkage to the Queensland Cancer Register and additional extraction from medical records. Flexible parametric survival models, with multiple imputation to deal with missing data, were used. Results Key factors identified as being predictive of poorer survival included more advanced stage at diagnosis, higher tumour grade, “triple negative” breast cancers, and being symptom-detected rather than screen detected. The Harrell’s C-statistic for the final predictive model was 0.84 (95% CI 0.82, 0.87), while the area under the ROC curve for 5-year mortality was 0.87. The final model explained about 36% of the variation in survival, with stage at diagnosis alone explaining 26% of the variation. Conclusions In addition to confirming the prognostic importance of stage, grade and clinical subtype, these results highlighted the independent survival benefit of breast cancers diagnosed through screening, although lead and length time bias should be considered. Understanding what additional factors contribute to the substantial unexplained variation in survival outcomes remains an important objective. Supplementary Information The online version contains supplementary material available at 10.1007/s10549-022-06682-5.


Background
Breast cancer is one of the most common cancers diagnosed in women [1]. It is a highly heterogenous disease with varying prognosis [2]. Prognostic models can help inform patients on the future course of their cancer and assist the decision making of clinicians and patients in respect to management and treatment of the cancer. Such models have been developed and validated to predict breast cancer survival among specific subgroups of patients, commonly based upon clinical and pathological prognostic factors. For example, the Nottingham Prognostic Index (NPI) [3] predicts survival for patients with operable breast cancer based upon tumour size, lymph-node status and histological grade. Alternatively, the 'PREDICT' model [4]-and the more recent iteration 'PREDICT v2.0' [5]-predict survival among breast cancer patients who have undergone surgery, based upon patient age, pathological factors, treatment factors and mode of detection.
In a broader context, there is evidence to suggest that poor survival prognosis following a breast cancer diagnosis may be associated with a variety of factors, including but not limited to younger or older age [6,7], socioeconomic disadvantage [8,9], and patient comorbidity [10,11]. In contrast to many previous studies looking at survival following treatment [12], the aim of this study was to develop a prognostic model to quantify breast cancer-specific survival at the time of diagnosis, using a range of information from the Breast Cancer Outcomes Study (BCOS), a large, prospective cohort of women diagnosed with invasive breast cancer in Queensland, Australia.

Data
The BCOS is a longitudinal study of women aged 20 to 79 years and diagnosed with invasive breast cancer in Queensland, Australia between 31st March 2010 and 30th June 2013. Eligibility criteria for the study included the ability to speak and understand English and be without any cognitive impairment that would prevent participation in a phone interview.
A total of 5426 potentially eligible women were identified from the Queensland Cancer Register for inclusion in the study. Of these, doctor's consent to contact was not provided for 688 women and 66 women were deceased. Of the 4672 women remaining, 3326 (71%) completed the interview and all but 3 (n = 3323) consented to their data from the study being prospectively linked with Registry data for follow-up purposes.
Full details of data collection procedures have been described previously [13,14]. In summary, a validated, semi-structured telephone interview administered by trained health interviewers was used to obtain information relating to each woman's experience of the pathway to breast cancer diagnosis. A self-administered questionnaire was completed by each woman following their telephone interview in which information was collected about detection of breast cancer (for example, information on frequency of screening and presence of symptoms), health status (including information on the presence of other chronic conditions or comorbidities, Body Mass Index (BMI) and physical activity levels) and socio-demographic characteristics (including age, education, marital status, employment, income, and area of residence).
Additional data including tumour characteristics were obtained by linkage of BCOS data with the Queensland Cancer Register. Additional clinical information was sourced from medical records. The linkage with the Queensland Cancer Register also provided information on all deaths that had occurred up to 31 st December 2018, and the cause of death, which was classified according to the coding of the International Statistical Classification of Diseases and Related Health Problems, tenth revision (ICD10) [15].
The medical records provided a composite variable of stage at diagnosis in addition to information on the individual staging components of tumour size and nodes involvement. A variable indicating clinical subtype of the tumour was created based upon the positive or negative status of oestrogen receptor (ER), progesterone receptor (PR) and human epidermal growth factor receptor 2 (HER2), using the St Gallen 2013 criteria[16] excluding Ki67, as has been done elsewhere [6]. Subtypes were Where any of ER, PR or HER2 were missing (n = 675, 20.3%), or in combination of ER-PR+and HER2± (n = 23, 0.7%), subtype was treated as missing or considered uncertain subtype and set to missing.

Outcome
Breast cancer-specific survival was the outcome of interest. Death due to breast cancer was defined by an ICD10-code of C50 for cause of death.

Analysis
Approximately forty of the available patient, tumour, clinical, and healthcare variables were considered as possible variables of interest, based on a priori knowledge or evidence presented within the scientific literature of associations with cancer survival.
Cross tabulations examined the distribution of variables against the outcome variable of breast cancer-specific death. Univariate analyses using Cox Proportional Hazards models and stratified Kaplan-Meier survival estimates were performed to check for evidence of associations between variables of interest and breast cancer-specific death. Variables showing an unadjusted association with the outcome were then considered for inclusion in the prognostic model.
Determining the final prognostic model was done in several steps. Multiple imputation by chained equations was initially used to deal with very small percentages of missing values in stage at diagnosis (1.7% missing), tumour size (2.2%) and tumour grade (1.4%) variables, and in the clinical subtype variable, in which 20% of values were missing-assumed missing at random [17]. Based upon standard practice [17], twenty five imputations were performed on the basis that approximately 22% of all patients had at least one of the values missing for these variables.
Following imputation, flexible parametric survival models [18] with multivariate fractional polynomials (MFP) [19] were used to guide the selection of variables to include in the final model. This process employed a backward selection approach, sequentially removing the least significant variable from the model until all remaining variables in the model reported p values < 0.05. We ran analyses forcing age at diagnosis to be retained in the model on a priori grounds and given its practicality in a prognostic tool, and a separate analysis without this constraint. We then used flexible parametric models to determine whether the covariates retained in the model had a time-varying effect on the outcome. The best fitting model in respect to the scale and number of degrees of freedom of the baseline spline function was determined using the Bayes Information Criterion (BIC) statistic [20]. To assess the discrimination and explained variation of the final prognostic model, and the individual contribution of each covariate in the model, we used the D statistic and associated R 2 D proposed by Royston and Sauerbrei [21]. The measure of discrimination of a prognostic model indicates how well the model is able to differentiate between patient outcomes. We calculated the area under the Receiver-Operating Characteristic (ROC) curves for 5-year breast cancer-specific mortality, to ascertain how well the model identified patients with poorer survival prognosis. Additionally, we estimated the Harrell's C-index [22], a measure of goodness-of-fit, for the final prognostic model and models singularly including each covariate. When deriving the prognostic model, we included tumour staging as a potential prognostic factor using the composite staging variable in the first instance, as that included information on the presence of metastases. We ran separate analyses using the tumour and nodes staging components to compare the discrimination, explained variation, and goodness of fit of both final models.
Post-estimation prediction was used to estimate the survival probability up to 5 years following breast cancer diagnosis, based upon differing combinations of the prognostic factors. All analyses were conducted using Stata 16 [23].

Results
Of the 3323 women who were part of this study, 251 (7.6%) had died by the time of last follow-up, and the cause of 174 (69%) of these deaths was attributed to breast cancer. The median follow-up time of the study was 6.8 years (range 0.9-8.8 years).
Nine variables showed evidence of an association with breast cancer-specific survival based upon univariate analyses ( Table 1). The three staging variables (the composite variable and individual tumour and nodes staging components) were associated with the outcome. Other variables associated with breast cancer-specific survival were age at diagnosis, mode of detection (i.e., via screening or symptoms), tumour grade, clinical subtype, diagnostic interval (time between symptoms or screening occurring and receiving a cancer diagnosis) and private health insurance status. Other variables considered are listed in Supplementary Table 1.
Around half of the patients (48%) who had died from their breast cancer had a stage III or stage IV diagnosis, while approximately three-quarters (72%) of the patients who died were aged 50 years or more ( Table 1). The majority (85%) of the patients who died had a symptomatic diagnosis of breast cancer. Over half (54%) of the patients had Luminal A subtype, while approximately 3% of the patients with this subtype died due to their breast cancer. By contrast, far fewer patients (8%) had a tumour of Triple Negative subtype, but 13% of these patients died from their breast cancer. Almost two-thirds (64%) of patients who died due to their breast cancer had tumours with high grade (poorly differentiated) cells. A quarter (25%) of patients who died had a diagnostic interval of 60 or more days. Over half of the patients who died (56%) had full health insurance.

Prognostic model
The final model contained age, stage at diagnosis (composite staging information), tumour grade, clinical subtype, and mode of detection. Age was modelled using restricted cubic splines with two internal knots and centred at age 60 (approximate to the mean age of patients: 57.5 years). The model with the baseline spline function on the probit scale with 2 degrees of freedom (df) had the lowest BIC statistic. From the analysis using flexible parametric models, there was no evidence of any covariates having a time-varying effect on the outcome.
The measures of discrimination (D statistic) and explained variation (R 2 D ), and Harrell's C goodness-of-fit statistic of the final prognostic model and each covariate in the model is provided in    subtype or age from the final model led to a reduction in D of 0.05 and 0.03 and a reduction in R 2 D of 0.02 and 0.01, respectively. When age was included alone in a model it explained less than 1% of the variation in survival.

Predicted survival probabilities
The model was used to predict breast cancer-specific survival, that is, survival under an assumption that breast cancer is the only possible cause of death. To demonstrate the application of this model, we predicted 1-year and 5-year survival probabilities of 12 hypothetical patients, using different scenarios of prognostic factor combinations to characterise these patients (Table 3, Fig. 2). Generally, predicted survival was poorer among patients with advanced stage of disease and of older age. Survival tended to be higher among patients whose breast cancer was detected via screening rather than from symptoms and who had tumours of early stage and low grade. Patients with tumours of Luminal A like clinical subtype had a better prognosis, while those with Triple Negative subtype tended to fare the worst.

Discussion
This study used detailed data from a longitudinal, population-based cohort of breast cancer patients, supplemented with cancer registry data. This enabled a greater range of potential prognostic factors to be considered in the development of a prognostic survival model than would be possible using cancer registry data alone. Importantly, our results confirmed the prognostic importance of stage at diagnosis, tumour grade, and clinical subtype for women diagnosed with breast cancer, along with age at diagnosis, but also highlighted the survival benefit of breast cancers diagnosed through screening, independently of the measured clinical characteristics.
To our knowledge, this is the first prognostic model of breast cancer survival that has been developed using flexible parametric survival models. Cox Proportional Hazards models have more typically been used for this purpose [24]. The Cox models follow an assumption that each covariate has a constant impact on the hazard throughout the follow-up period [25]. Using flexible parametric models, we were able to evaluate potential time-dependent effects of covariates while developing the final prognostic model. An additional benefit with using this approach was the ability to predict survival at different time points following breast cancer diagnosis, according to different combinations of prognostic factors.
The clinical benefit and common use of prognostic breast cancer survival models has been demonstrated through previous studies. The Nottingham Prognostic Index (NPI) allocates patients into groups based on their survival prognosis and predicts 5-year survival. It has performed well under validation using independent datasets [26,27], and has been used widely in breast cancer management, for example, in selection for breast conserving surgery or for adjuvant therapy [3]. Similarly, the 'PREDICT' prognostic tool [4] is intended for use by patients and clinicians to assess the impact of different treatments on survival following surgery. Since being released as an online tool in 2011 it has been accessed frequently, and from locations around the world [5].
The NPI was constructed using data of approximately 400 breast cancer patients treated by mastectomy at a hospital in Nottingham, England, and is based upon three factors (tumour size, grade, and lymph-node stage). Like the NPI, our prognostic model offers simplicity, an important criterion when developing a clinically useful model [28]. Stage at diagnosis was the strongest predictor in our prognostic model. The staging components of tumour size and nodal status are the two most common predictors in published breast cancer survival models [24].
The prognostic model highlighted survival differences between women whose breast cancer was screen-detected and women whose breast cancer was detected via symptoms. These differences may be influenced by lead time [29], where screening has brought the diagnosis forward and represents an artificial addition to the survival time of screen-detected patients. There is also a possibility of length bias [29], where screening has detected slow-growing breast cancers while they are screen-detectable but not yet symptomatic. The prognostic model predicted survival differences even after adjustment for stage at diagnosis, thus after accounting for the possibility of earlier stage detection among screened women. Findings such as these could be useful in a public health setting-for example, in health promotion initiatives highlighting benefits of breast cancer screening. In Australia, women aged between 50 and 74 years are invited for free breast cancer screening every two years [30], while women aged 40-49 years and those aged over 74 years are also eligible for free screening. Our prognostic model predicted better survival among BCOS patients with a screen-detected versus symptomatic breast cancer diagnosis, regardless of age, and clinical characteristics.
Potential data limitations include the possibility of selection bias, as the study cohort excluded women with more  Table 3 advanced stage of disease who were too unwell or unable to take part [13]. As a consequence of this, the prognostic model may overestimate survival among women with advanced stage of disease, relative to what may be seen among a more representative population of breast cancer patients. The exclusion of patients with advanced stage may also lead to an underrepresentation of patients with characteristics traditionally associated with advanced stage cancer and poorer prognosis. For example, socio-economically disadvantaged breast cancer patients tend to have greater risk of presenting with advanced stage of disease [31,32], and tend to have poorer survival outcomes [8,33,34]. With the exception of mode of detection, the prognostic model contained variables sourced from the Queensland Cancer Registry or from medical records. Mode of detection information was collected from BCOS patients using computerassisted telephone interview, by asking how the breast cancer was first detected. Self-reporting biases may arise during data collection via survey questionnaire methods [35]. The composite stage at diagnosis variable used in the prognostic model was obtained from review of medical records. There was a small amount of missing information within the stage at diagnosis variable and within some of the other variables included in the prognostic model. We used multiple imputation to handle this missing information, under an assumption the data were missing at random [36]. Externally validating prognostic models using independent datasets gives an indication of the predictive ability and generalisability of these models [28]. Some breast cancer prognostic survival models have shown weaker performance in independent populations [24]. Although our model has not yet been validated using an external dataset, internal validation of the model using Harrell's concordance statistic indicated it had good ability to predict survival. The methodology used to develop our prognostic model has been used in the development of a prognostic survival model of invasive cutaneous melanoma [37]. The melanoma model offered greater discrimination than the breast cancer model (D-statistic of 1.53 versus 1.20), possibly due to the higher prognostic contribution that melanoma thickness had on melanoma survival compared to tumour diameter for breast cancer survival, even though both were the most important contributors in their respective prognostic model. Consistent with this, the prognostic factors in the melanoma model explained approximately half of the variation in survival while the breast cancer model explained one-third of the variation in survival.
Consistent with prognostic models for other cancer types [37,38], this study has specifically focused on survival outcomes using information available at the time of diagnosis. While providing greater clarity about the role that socio-demographic, diagnostic and other clinical characteristics has on patient survival, there is an increasing role for tailored treatment plans for cancer patients that also consider additional information, including information about specific treatments, that become available during the management pathway [12]. As such, while our current model is novel in focusing on diagnostic-related variables, there is the potential to create an expanded prognostic model for this breast cancer cohort that incorporates additional post-diagnosis variables, including treatmentrelated factors, to compare with other published decision tools.
In conclusion, by using a large, population-based survey of Australian women diagnosed with breast cancer, this study provided novel insights into the important factors known at the time of diagnosis that influence survival outcomes that could not have been obtained through more typical cancer registry-based analyses. Assessing the performance of our model using other breast cancer cohorts would be a beneficial and informative next step. Given the large percentage of survival variation that was unexplained by this model, gaining a better understanding of what additional factors explain survival outcomes among women diagnosed with breast cancer will require dedicated research studies that include more comprehensive range of factors and/or more nuanced measurements.
Author contributions PB and HF contributed to the study conception and design. Data analyses were performed by HF with guidance by PB. The first draft of the manuscript was written by HF and all authors commented on subsequent versions of the manuscript. All authors read and approved the final manuscript.
Funding Open Access funding enabled and organized by CAUL and its Member Institutions. The initial establishment of the Breast Cancer Outcomes Study was funded by a Cancer Australia Grant (#100639) and Cancer Council Queensland. No further funding was obtained for the analyses described in this study.

Data availability
The de-identified dataset analysed during the current study is not publicly available due to the requirements of the original ethical approval but would be available from the corresponding author on reasonable request following a formal data sharing agreement.

Conflict of interest
The authors have no relevant financial or non-financial interests to disclose.  Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.