Background

Breast cancer (BC) is by far the most frequently occurring cancer in women. Every year 522,000 women die from BC [1].

Mammography is used as a screening tool for early diagnosis but has its limitations due to over-diagnosis and a modest impact on mortality [2]. Recent evidence demonstrates that dissemination might occur during the very early stages of tumor evolution and before clinical manifestation of the cancer in the breast [3]. The analyses of circulating markers in order to identify women with disseminated disease before diagnosis have not been successful [4].

Numerous studies have demonstrated that patients with disseminated tumor cells in the bone marrow [5,6,7] or circulating tumor cells (CTCs) [8,9,10,11,12] have an inferior prognosis. The immunocytochemical detection of CTCs is reliant upon the isolation of intact cells.

Adjuvant systemic treatment has reduced BC mortality over the last two to three decades [13]. The current strategy guiding administration of adjuvant systemic treatment is reliant upon primary tumor characteristics. However, systemic relapse and subsequent death are caused by disseminated disease whose biological properties may be very different to those comprising the primary tumor [14].

Recently, markers based on DNA shed from tumor cells have shown great promise in monitoring treatment response and predicting prognosis [15,16,17,18,19]. However, efforts to characterize the cancer genome have shown that only a few genes are frequently mutated in cancer and the site of mutation per gene differs across tumors [20]. A further limitation is that current technology only allows for the detection of a mutant allele fraction of 0.1% [15, 21].

Over the last decade, DNA methylation (DNAme) has been shown to be a hallmark of cancer [22] and occurs very early in BC development [23]. DNAme is centered around specific regions (CpG islands) [22] and is chemically and biologically stable. This enables the development of early detection tools and personalized treatment, based upon the analysis of cell-free DNA contained within serum or plasma [24,25,26,27,28,29]. However, two major challenges have to be overcome: (1) the very low abundance of cancer-DNA in the blood; and (2) the high level of “background DNA” shed from white blood cells (WBC) [30] in banked samples.

To date, virtually all research work has been carried out in relatively small studies and focused on the analyses of cell-free DNAme in metastatic/relapsed breast cancers using markers from previously published studies [31]. In our study we: (1) used an epigenome-wide approach to identify new markers which indicate disseminated breast cancer; (2) analyzed the top marker in 419 primary non-metastatic patients before (i.e. immediately after resection of the primary breast cancer) and after adjuvant chemotherapy; and, most importantly (3) analyzed the marker in 925 healthy women who either remained healthy or developed fatal or non-fatal BC within the first three years after serum sample donation.

Methods

Patients and sample collection

We used a total of 31 tissues and 1869 serum samples (Fig. 1). In Phase 1, we analyzed breast cancer tissue and WBCs in order to identify breast cancer specific DNAme markers. In Phase 2, we established serum DNAme assays using serum sets 1 and 2, collected from women attending hospitals in London, Munich, and Prague where they were invited and consented. Blood samples (20–40 mL) were obtained (in VACUETTE® Z Serum Sep Clot Activator tubes), centrifuged at 3000 rpm for 10 min, and serum collected and stored at – 80 °C. Finally, Phase 3 was initiated to validate the top marker performance by using serum samples from two large clinical studies: (1) from 419 patients recruited within the SUCCESS trial [10] (ClinicalTrial.gov registration ID is NCT02181101), where bloods were taken before and after chemotherapy and (within 96 h) sent to the laboratory for CTC assessment and serum samples stored (Additional file 1: Figure S1); and (2) from UKCTOCS [32] (ClinicalTrial.gov registration ID is NCT00058032), where serum samples were used from: (i) 229 women diagnosed with BC within the first three years after serum sample donation and subsequently died during follow-up; (ii) 231 matched women who developed BC within three years after sample donation and were alive at the end of follow-up; and (iii) 465 women who did not develop BC within five years after sample donation (Additional file 1: Figure S2). Blood samples from all UKCTOCS volunteers were spun down for serum separation after having been transported at room temperature from trial centers to the central laboratory. The median time between sample collection and centrifugation was 22.1 h. Only 1 mL of serum per UKCTOCS volunteer was available. All patients provided written informed consent.

Fig. 1
figure 1

Study design. Using reduced representation bisulfite sequencing (RRBS), 31 human tissue samples were analyzed to identify a total of 18 regions which underwent thorough technical validation. Six regions were selected whose methylation status has been analyzed in two sets consisting of 110 serum samples. One marker (EFC#93) has been validated in two independent settings: (1) in SUCCESS study serum samples from BC patients before and after chemotherapy; and (2) in UKCTOCS serum samples from women before BC diagnosis (within three years) or who remained healthy for five years

Isolation and bisulfite modification of DNA

DNA was isolated from tissue and serum samples at GATC Biotech (Konstanz, Germany). Tissue DNA was quantified using NanoDrop™ and Qubit™, and the size was assessed by agarose gel electrophoresis. Serum DNA was quantified using the Agilent Fragment Analyzer and the High Sensitivity Large Fragment Analysis Kit (AATI, USA). DNA was bisulfite converted at GATC Biotech.

DNAme analysis in tissue

Genome-wide methylation analysis was performed by reduced representation bisulfite sequencing (RRBS) at GATC Biotech. DNA was digested with MspI followed by size selection of the library, providing enhanced coverage for the CpG-rich regions [33, 34]. The digested DNA was adapter-ligated, bisulfite-modified, and polymerase chain reaction (PCR)-amplified. The libraries were sequenced on Illumina’s HiSeq 2500. Analysis of the first samples sequenced with a 100-bp paired-end mode showed that the library insert size was small. Therefore, the remaining samples were sequenced with a 50-bp paired-end mode. Using Genedata Expressionist® for Genomic Profiling v9.1, we established a bioinformatics pipeline for the detection of cancer specific differentially methylated regions (DMRs). The most promising DMRs were taken forward for the development and validation of serum-based clinical assays.

Targeted ultra-high coverage bisulfite sequencing of serum DNA

Targeted bisulfite sequencing libraries were prepared at GATC Biotech. Bisulfite modification was performed with 1 mL serum equivalent. A two-step PCR approach was used to test up to three different markers per modified DNA sample. The first PCR amplifies the target region and adds linker sequences which are used in the second PCR to add barcodes for multiplexing and sequences needed for sequencing. Ultra-high coverage sequencing was performed on Illumina’s MiSeq or HiSeq 2500 with a 75-bp or 125-bp paired-end mode, respectively.

Data analyses

Genedata Expressionist® for Genomic Profiling was used to map reads to human genome version hg19, identify regions with tumor-specific methylation patterns, quantify the occurrence of those patterns, and calculate relative pattern frequencies per sample. Pattern frequencies were calculated as number of reads containing the pattern divided by total reads covering the pattern region. Methylation patterns are represented in terms of a binary string, where the methylation state of each CpG site is denoted by “1” if methylated or “0” if unmethylated. The algorithm that we developed scans the whole genome and identifies regions that contain at least ten aligned paired-end reads. These read bundles are split into smaller regions of interest (ROIs) which contain at least 4 CpGs in a stretch of < 150 bp. For each region and tissue/sample, the absolute frequency (number of supporting reads) for all observed methylation patterns was determined (Fig. 2a). This led to the discovery of tens of millions of patterns per tissue/sample. The patterns were filtered in a multi-step procedure to identify the methylation patterns specifically occurring in tumor samples. To increase the sensitivity and specificity of our pattern discovery procedure, we pooled reads from different tumor or WBC samples and scored patterns based on over-representation within tumor tissue. The results were summarized in a specificity score, Sp, which reflects the cancer specificity of the patterns. After applying a cut-off of Sp ≥ 10, 1.3 million patterns for BC remained and were further filtered according to the various criteria detailed in Fig. 2b (further details are provided in Additional file 2).

Fig. 2
figure 2

Principles of methylation pattern discovery in tissue (a, b) and analyses in serum (c). a RRBS was used in tissue samples in order to identify CpG methylation patterns that are able to discriminate breast cancer from white blood cells (which were deemed to be the most abundant source of cell-free DNA). “0” represents an unmethylated CpG and “1” represents a methylated CpG. An example of region EFC#93 is provided which is a 136-bp-long region containing five linked CpGs. The cancer pattern consists of reads in which all linked CpGs are methylated, indicated by “11111.” b RRBS data have been processed through a bioinformatic pipeline to identify the most promising markers. c The principles of the serum DNA methylation assay

The 95% confidence intervals (CI) for sensitivity and specificity have been calculated according to the efficient-score method [35]. The endpoints were defined according to the STEEP criteria, with relapse-free survival and overall survival as the primary endpoints. The product-limit method according to Kaplan–Meier was used to estimate survival. The survival estimates in different groups were compared using the log-rank test. The Cox proportional hazards regression model was used for the analyses taking into account all variables simultaneously.

Further details on samples and methods can be found in Additional file 2.

Results

The samples, techniques, and purpose of the three phases used in this study (marker discovery, assay development, and assay validation) are summarized in Fig. 1. We first identified DMRs based on their methylation patterns and frequencies in relevant genomic regions, within a BC tissue panel. Methylation patterns with high specificity for breast cancer tissue were identified using the procedure described in Fig. 2b.

The selected 18 BC specific patterns identified by RRBS, were further validated using bisulfite sequencing. Thirty-one bisulfite sequencing primer pairs (1–3 per region) were designed and technically validated to determine PCR efficiency and sensitivity. A dilution series obtained by mixing fully unmethylated (i.e. whole genome amplified DNA) with fully methylated DNA (i.e. whole genome amplified DNA treated with CpG methyltransferase) was used to select six reactions which showed good coverage after sequencing (> 104 reads) and sensitivity in highly diluted (<1:104) samples (Additional file 3: Table S1). The best six reactions were taken into Phase 2, for further testing and assay development, in prospectively collected serum sets. We used ultra-deep bisulfite sequencing to develop assays for these candidate regions in 32 serum samples from Serum Set 1 (Figs. 1 and 2c). Five of the six reactions showed good sensitivity and specificity (particularly when discriminating between metastatic and primary BC), based on the abundance of tumor-specific patterns (see Additional file 1: Figure S3 for a complete overview of pattern counts from region EFC#93) and were selected for further validation in Serum Set 2 (n = 78). DNA methylation marker EFC#93, which was identified in RRBS as a region of ten linked CpGs methylated in BC, was optimized to a pattern of five linked CpGs and showed the best sensitivity and specificity independently in Set 1 and 2 (Additional file 1: Figure S4). A statistically higher pattern frequency, for the optimized marker EFC#93, was observed in the metastatic BC groups compared to the healthy/benign lesions or primary BC groups, in both Sets 1 and 2. This translates to an area under the curve (AUC) of a receiver operating characteristics (ROC) curve of 0.850 (95% CI = 0.745–0.955, P = 0.000004) and 0.845 (95% CI = 0.739–0.952, P = 0.000004) to discriminate healthy/benign lesions or primary BC from metastatic BC in Set 1 and Set 2, respectively. When Set 1 and 2 data were combined, the pattern frequency threshold was set to 0.0008 (i.e. 8 in 10,000 reads demonstrated methylation at all CpGs in the EFC#93 region), which led to a sensitivity of 60.9% and a specificity of 92.0% with respect to identifying metastatic BC (Additional file 1: Figure S4).

EFC#93 was then validated for use as a prognostic and predictive BC marker in clinical trial samples (Fig. 1). As expected, due to delayed sample processing within these trials, serum samples from both SUCCESS and UKCTOCS contained high levels of contaminating WBC DNA, leading to dilution of the cancer signal (Additional file 1: Figure S5). In order to adjust for this, we made an a priori decision to reduce the threshold for EFC#93 pattern frequency by a factor of 10 to 0.00008 (i.e. 8/100,000 reads demonstrated methylation at all five linked CpGs within the EFC#93 region). Table 1 shows SUCCESS patient characteristics, correlated with EFC#93 positivity/negativity, before and after chemotherapy. Using our predetermined threshold, EFC#93 positivity was significantly associated with CTC presence, both before and after chemotherapy (Chi-square test, P < 0.01, Table 1) although ECF#93 pattern frequencies were not significantly different in samples from patients with either no, 1–4, or > 4 CTCs detected, respectively (Additional file 1: Figure S6). Patients who underwent breast-conserving therapy were more likely to be EFC#93-negative compared to patients who underwent a mastectomy; this is in all probability explained by the fact that patients which presented with larger tumors tended to be EFC#93-positive and would not have been eligible for breast-conserving surgery. This is consistent with the findings that EFC#93 positivity after chemotherapy is significantly (P = 0.014) less frequently observed in early stage (T1) compared to late stage (T2–4) cancers. None of the other clinical–pathological features correlated with cell-free DNA methylation of EFC#93 (Table 1). EFC#93 serum positivity before chemotherapy was a very strong marker of poor prognosis, for both relapse-free and overall survival (Table 2 and Fig. 3a and b). This was independent of the prognostic capability of CTCs (Additional file 1: Figures S7 and S8). Hazard ratios (HRs) (95% CI) for overall survival in the multivariable model were 5.973 (2.634–13.542) and 3.623 (1.681–7.812) for EFC#93 and CTCs, respectively (Table 2). Patients who were CTC-positive and EFC#93-positive had an extremely poor outcome, with > 70% of these patients relapsing within five years (Fig. 3c and d). Neither serum marker EFC#93 nor CTCs alone were predictive of the outcome in samples collected after chemotherapy (Additional file 1: Figures S9 and S10).

Table 1 SUCCESS patient characteristics before and after chemotherapy for EFC#93 serum DNAme
Table 2 Univariate and multivariable proportional hazards model for relapse-free and overall survival for SUCCESS serum samples
Fig. 3
figure 3

EFC#93 serum DNAme and CTC analyses in the SUCCESS trial in samples taken before chemotherapy. Kaplan–Meier analysis for relapse-free survival (a) and overall survival (b) according to the presence (EFC#93 pattern frequency ≥ 0.00008) or absence (EFC#93 pattern frequency < 0.00008) of marker EFC#93 before chemotherapy. Kaplan–Meier analysis for relapse-free survival (c) and overall survival (d) according to the presence/absence of EFC#93 and CTCs. P values from a two-sided log-rank test. CTC– no CTC present, CTC+ at least one CTC present

To assess whether EFC#93 serum DNAme can diagnose women with poor prognostic BC earlier, we analyzed serum samples from 925 women from our UKCTOCS cohort. The amount of DNA as well as the fragment length was dramatically higher than expected and correlated with the average UK temperature (Additional file 1: Figures S11 and S12); there was also a good correlation between DNA amount and fragment length (Additional file 1: Figure S13) indicating a substantial leak of blood cell DNA into the serum during the blood transport. Within this nested case/control setting, the women with BC (cases) had provided serum samples up to three years before diagnosis. Again, we a priori hypothesized that the high background levels of DNA from lysed blood cells would impact on assay sensitivity, particularly in a pre-clinical setting where only traces of cancer DNA were expected in the circulation. We therefore split all samples into two groups: (1) low serum DNA amount; and (2) high serum DNA amount. In the “low DNA” group, we observed a significantly higher EFC#93 serum DNAme pattern frequency in the women who developed BC within one year after sample donation and subsequently died (Fig. 4a; cut-off threshold of 0.00008). Due to the high levels of background DNA, no significant findings were observed in the “high DNA” sample groups (Fig. 4b). In the “low DNA” group, EFC#93 DNAme was able to identify 43% of women 3–6 months and 25% of women 6–12 months before the diagnosis of a BC which eventually led to death, with a specificity of 88% (Fig. 4c). The sensitivity of serum EFC#93 methylation in detecting fatal BCs up to one year in advance of diagnosis was ~ 4-fold higher compared to non-fatal BCs (33.9% compared to 9.3%). In fact, the sensitivity for non-fatal BCs was within the false-positive range of the healthy samples, indicating that non-fatal BCs are not detected with this marker.

Fig. 4
figure 4

Pattern frequency of EFC#93 in women from the UKCTOCS. EFC#93 pattern frequency in samples with low (a) or high (b) amounts of DNA in the serum sample. c Performance of EFC#93 serum DNAme marker (cut-off threshold = 0.00008) depending on time to diagnosis and whether or not women subsequently died. Data separated based on DNA amount in the serum sample (95% CI in brackets). P values in (a) and (b) are from a Mann–Whitney U-test and are relative to the control group. Control no cancer developed, BC-D breast cancer which eventually led to death, BC-ND breast cancer which did not lead to death, mo months, yr years

Discussion

We demonstrate that our serum DNAme marker, EFC#93, can be detected up to one year in advance of BC diagnosis and is a marker for poor prognosis in the adjuvant primary treatment setting. EFC#93 is located within GP5, a gene coding for a surface glycoprotein which has been suggested to be involved in hematogenous breast cancer metastasis [36].

The use of tumor-specific methylated DNA in serum using targeted ultra-high bisulfite sequencing has the following advantages compared to alternative strategies: (1) patient plasma/serum DNA can be amplified to increase assay sensitivity; (2) abnormal DNAme is a stable tumor-specific marker occurring early in carcinogenesis and is conserved throughout disease progression [22]; (3) selection of CpG island hypermethylation simplifies assay design; and (4) DNAme over several linked CpGs constitutes a clearly detectable signal with a higher specificity (due to alleviated sensitivity to sequencing errors).

A key limitation of any current large-scale population-based cell-free DNA study, such as ours, is the lack of high-quality samples. This was evident in both the SUCCESS and UKCTOCS samples, where the blood samples were not processed until 24–96 h after the blood was drawn and hence contained large amounts of leaked WBC DNA. In healthy individuals, cell-free DNA is normally present at concentrations in the range of 0–100 ng/mL and at an average of 30 ng/mL [37]. DNA derived from tumor cells is also shorter than that from non-malignant cells in the plasma of cancer patients and typically 166 bp long [38]. Blood tubes which stabilize cell-free DNA and prevent leakage of WBC DNA are now available [39] and will be used for any future studies.

The leaked DNA in these serum samples will no doubt have led to a preferential amplification of non-cancer DNA. Despite these complicating factors, EFC#93 serum DNAme, before treatment, was a strong prognostic factor and was complementary to CTCs. Some previous studies on CTCs used a cut-off value of > 5 cells/mL; this may certainly be valid and useful for metastatic BC patients. In the SUCCESS setting of primary BC patients, only 8/419 patients (1.9%) had > 5 CTCs/mL. Had we taken this CTC cut-off, the relapse-free survival HR would have been 4.8 with a relatively wide 95% CI of 1.5–15.5 (P = 0.009). Hence, the chosen threshold that we pre-specified in previous work [10] (i.e. CTCs detectable or not) is completely justified in this primary cancer setting.

For the current genetic cell-free DNA markers the detection limit is in the range of 0.1% allele frequency (i.e. 1 mutated in the background of 1000 non-mutated alleles can be detected [15, 21]). Ultra-high coverage bisulfite-sequencing, however, allows for far more sensitive testing. Mammography screening in women aged 50–75 years has a sensitivity of 82–86% and a specificity of 88–92% for detecting any BC; however, the majority of these cancers are not fatal [40]. EFC#93 serum DNAme has a sensitivity of 43% in identifying fatal breast cancer up to six months in advance of current diagnosis at a similar specificity (88%) to mammography, supporting the rationale for incorporating serum DNAme markers in future cancer-screening trials.

Based on the evidence accumulated so far, we have to assume that EFC#93 indicates the presence of disseminated breast cancer, which at least in a proportion of women, will not yet be clinically evident in the breast. Hence, the question arises whether EFC#93-positive mammography-negative women should watch and wait (i.e. within an enhanced surveillance program) or whether this group of women could also be offered a strategy which actively deals with the likely disseminated disease until radiological evidence in the breast starts to arise. Anti-hormonal treatment (i.e. Tamoxifen or aromatase inhibitors) are being used for both adjuvant and preventive treatment. Therefore, we assessed whether EFC#93 positivity after SUCCESS chemotherapy (which is before the initiation of anti-hormonal treatment) is associated with survival: EFC#93 positivity in post-chemotherapy samples of hormone receptor-negative women still indicated a poor prognosis whereas EFC#93 positivity in hormone receptor-positive women was no longer associated with poor prognosis (Additional file 1: Figure S14). CTC status in post-chemotherapy samples was not associated with outcome irrespective of subsequent anti-hormonal treatment (Additional file 1: Figure S15).

Conclusions

Overall and for the first time, our study provides evidence that serum DNAme markers can diagnose fatal BCs up to one year in advance of current diagnosis and enable individualized BC treatment which may even commence before obtaining radiological evidence in the breast. In addition, the combination of CTC and cell-free DNA analysis might further improve risk stratification of breast cancer patients. The recent advance of purposed blood tubes will facilitate clinical implementation of DNAme pattern detection of cell-free DNA as a clinical tool in cancer medicine.