Background

Lymph node metastasis (LNM) is one of the most important factors affecting the prognosis of breast cancer [1]. The accurate assessment of lymph node status can predict patients’ outcomes and guide the choice of treatment options [2]. Axillary lymph node dissection (ALND) is the gold standard for evaluating axillary lymph node (ALN) status, but it would bring great harm to patients. Although milder sentinel lymph node biopsy (SLNB) has become routine surgery, it is still risk surgery, which would increase considerable anesthesia time and expense, and cause multiple complications in 3.5–10.9% of patients [3, 4]. Therefore, developing a low-cost non-invasive method to evaluate the status of ALN would be of great benefit to breast cancer patients.

Cell-free DNA (cfDNA) has been an essential biomarker in many cancer applications, such as early detection and outcome prediction of cancer [5]. At present, the most commonly used features are cfDNA level and its sequence information. Previous studies have described the close relationship between abnormal cfDNA levels and ALN metastasis [6, 7], which indicates that cfDNA may be used to assess ALN status. However, the level of cfDNA is influenced levels are affected by many pathological processes, such as infection and inflammation [8,9,10]. In addition, some studies wanted to find ALN metastasis-related ctDNA mutations or ctDNA hypermethylation [1, 11,12,13], however, no relationship was found between them [2]. Thus, novel disease-specific features of cfDNA with high predictive efficacy are needed to be found for predicting LNM.

Recently, cfDNA coverage on gene promoter has found that it carried gene expression information of its original tissues [14, 15]. Plasma cfDNA is mainly released by apoptotic cells after enzymatic processing of chromatin [16]. The DNA bound to the nucleosomes is retained, while the exposed DNA between the nucleosomes is digested. Analysis of cfDNA fragments derived from cancers showed that the promoter regions of active genes exhibited depleted coverage, which meant that nucleosome binding was less in these regions along with increased gene expression [15]. In cancer patients, cfDNA is mainly derived from tumor and hematopoietic cells [16]. More importantly, studies on breast cancer have shown that many gene expression signatures could be used to estimate the risk of distant relapses, and some of which have been commercialized, such as PAM50. In addition, the immune cells have been proved to play an important role in tumor metastasis, and the peripheral blood immunome of breast cancer patients is influenced by the existence and stage of cancer [17, 18]. Therefore, we assume that the cfDNA coverage at the gene promoter has potential to assess the ALN status.

In this study, we first compared the nucleosome footprint around the transcriptional start sites (TSS) of ALN-positive and ALN-negative breast cancer patients to identify genes with differential coverage. In order to further evaluate the potential of promoter profiling for evaluating ALN status, we developed a classifier for distinguishing ALN-positive and ALN-negative patients by using multiple machine learning models. Finally, we incorporated some clinicopathological characteristics in our classifier to test whether its performance would improve.

Methods

Participants and study design

From January 2018 to December 2019, before cancer therapy, plasma samples were prospectively collected from 330 breast cancer patients, including 162 ALN-positive and 168 ALN-negative patients. We excluded patients who: (1) were pregnant or lactating, (2) were metastatic breast cancer or had non-infiltrating tumors histologically, (3) had a hematopoietic system or inflammatory breast diseases, and (4) were ALN-negative patients diagnosed with fine needle aspiration biopsy. We reviewed all tumor specimens histopathologically and staged them according to the seventh edition of the American Joint Committee on Cancer (AJCC) staging system for breast cancer. All plasma samples were obtained under institutional review board of The First People's Hospital of Foshan approved protocols with written informed consent from all participants for research use (ID: L[2021]-7). Table 1 summarizes the characteristics of patients, including age, T stage, estrogen- (ER) and progesterone-receptor (PR) status, expression of human epidermal growth factor receptor 2 (Her2), proliferative fraction (Ki-67 labeling index), and histological grade.

Table 1 Patient characteristics

ALN surgery

The ALN status was ascertained clinically by fine needle aspiration biopsy, ALND or SLNB. Because the number of lymph nodes detected by the fine needle aspiration biopsy is limited, some positive lymph nodes may be ignored, which may increase the false positive rate of the evaluation model. Therefore, the patients with ALN-negative detected by fine needle aspiration biopsy were excluded from this study. Indocyanine green with a carbon nanoparticle suspension was used for SLNB and more than three LNs were checked for cancer.

Extracting and sequencing cfDNA

In total, 1 mL peripheral blood was collected using EDTA tubes from each patient and then immediately implemented two-step centrifugation to obtain the plasma. The centrifugation parameters were is 1600g for 10 min, followed by 10 min at 16,000g at 4 °C. Subsequently, the plasma was stored at − 80 °C before use. Each sample yielded at least 1 ng total cfDNA for sequencing. cfDNA was extracted from plasma by QIAamp DNA Blood Mini Kit (Qiagen). A starting amount of approximately 1–5 ng DNA was used for library construction with the Life Sciences Ion Xpress™ Plus Fragment Library Kit. The number of PCR cycles was set to 12. The DNA size distribution of libraries was analyzed on a Bioanalyzer instrument (Agilent Technologies, Singapore). Sequencing was performed with the Ion PI™ Hi-Q™ OT2 200 Kit and the Ion PI™ Hi-Q™ Sequencing 200 Kit on Ion Proton platform (ThermoFisher Scientific, USA) with 520 flow. The mean depth of the sequencing samples was approximately 0.3×.

Sequencing data processing

After sequencing, the raw read was aligned to the human reference genome (hg19) using bwa (ver.0.7.5). Then, SAMtools rmdup function (ver. 0.1.18) was used to remove the polymerase chain reaction duplicates [19]. The GC-bias correction was implemented using the deeptools (ver.3.5.0) with the default setting. The calculation of tumor fraction and copy number-bias correlation were implemented using ichorCNA algorithm [20].

Promoter profiling calculation

The calculation of promoter profiling was similar to that used in our previous study [15, 21]. In briefly, gene information was downloaded from RefSeq of University of California Santa Cruz [22]. The region ranging from − 1 KB to + 1 KB around the transcriptional start site of each transcript, was defined as the primary transcription start site (pTSS), was first identified. The read counts for each base at the pTSS were calculated using DANPOS with default setting [23]. After read alignment, the read coverage at the pTSS was extracted from the aligned BAM files using bedtools (ver. 2.17.0). Then, the read coverage was normalized by the reads per kilobase per million mapped reads (RPKM)-like method. The normalized value of promoter profiling was calculated by the following formula:

$$Normailzed \,Promoter \, profiling=\frac{\mathrm{cfDNA \, coverage \, around \, TSS}\times \mathrm{ 1,000,000}}{\mathrm{Totally \,mapped \,reads }\times \mathrm{ length}},$$

here, the length of each transcript is equal to 2000 because of the pTSS region ranging from − 1 KB to + 1 KB around each transcriptional start site.

Models for evaluating lymph node status

To develop the evaluation classifier, the patients were firstly divided into three cohorts, including discovery, training and validation cohorts. In the discovery cohort, we identified the genes with differential promoter coverage. Then, the plasma samples were then divided into training and validation cohorts in a ratio of 7:3. Based on the training cohort data, we developed classifiers using three models, including support vector machine (SVM), logistic regression (LR), and linear discriminant analysis (LDA) models, to distinguish ALN-positive and ALN-negative tumors. The importance of the features was assessed with the sigFeature package of R. Then we selected top 100 features for further classifier construction. The SVM classifier was constructed with the linear kernel in e1071 package using the default setting. In order to identify the optimal gene combination with the largest area under the curve (AUC), backward method was adopted. To avoid potential bias and over-fitting in the training cohort, the leave-one-out cross validation method was used to evaluate the robustness of these classifiers. Briefly, each subject in the training cohort was withheld in turn, and the rest subjects were submitted to train the model. The trained model was then used to determine the class of the withheld subject. This procedure went on until all subjects in the training cohort were judged. Finally, the efficacy of selected classifiers was evaluated using the validation cohort data.

Statistical analysis

Wilcoxon rank-sum test or Chi square test were used for analyses that compared the two groups. Benjamini–Hochberg method was used to adjust the raw P-values to the false discovery rate (FDR). Variables with fold change ≥ 1.5 and FDR ≤ 0.05 were considered statistically significant. The genes with differential promoter coverage were used to plot uniform manifold approximation and projection (UMAP) and heat map using uwot package and pheatmap package in R (version 3.0.1), respectively. Receiver operating characteristic (ROC) curves were plotted and differences in the AUC were compared using the pROC package [24]. GO enrichment analysis was implemented by using Metascape with default settings [25]. Housekeeping genes and non-constitutive genes were downloaded from the additional material of a previous study [14].

Results

cfDNA promoter profiling related to tumor expression profiles

In order to test whether the promoter profiling of cfDNA could be used to predict ALN metastasis, we first studied whether the coverage of gene promoter regions (± 1 KB around TSS) was related to gene expression profiles (Fig. 1). Consistent with previous studies [14], the promoter coverage of housekeeping genes with high expression levels was significantly reduced compared with those of non-constitutive genes (Fig. 3a). Then, we studied whether the footprint of nucleosomes around the TSS was different between ALN-positive and ALN-negative groups. In ALN-positive breast cancer patients, we observed the loss of related cfDNA signals (Fig. 3b; P = 2.2e−16, Wilcoxon rank sum test).

Fig. 1
figure 1

Schematic diagram of PPCNM. In cancer, plasma cell-free DNA (cfDNA) is primarily derived from apoptotic tumor and hematopoietic cells. Exposed DNA not bound to a nucleosome is digested, whereas nucleosome-bound DNA escapes digestion and enters the circulation. cfDNA has a nucleosome footprint, which carries information about its original tissues and could reflect its gene expression status. Because axillary lymph node (ALN)-positive and ALN-negative breast cancer patients have different gene expression signatures in tumor and hematopoietic cells, their nucleosome patterns may show difference. Therefore, we assume that the promoter coverage of cfDNA detected by whole-genome sequencing could be used to develop classifiers for predicting lymph node metastasis

Genes with differential promoter coverage associated with LNM

The workflow of our study mainly consisted of three stages, including discovery, training and validation stages (Fig. 2). In the discovery cohort, we identified the genes with differential promoter coverage. When comparing the promoter profiling of each gene, we observed 1,071 genes with differential promoter coverage between ALN-positive and ALN-negative patients (Fig. 3e and Additional file 1: Table S1; fold change ≥ 1.5 and FDR ≤ 0.05, Wilcoxon rank sum test). Then, using UMAP, we found that samples from the same groups were clustered together, while the samples from different groups were scattered (Fig. 3c). In addition, the heat map showed distinct patterns of promoter coverage between ALN-positive and ALN-negative breast cancer patients (Fig. 3d). These results indicated that promoter profiling has potential for assessing the ALN status of breast cancer.

Fig. 2
figure 2

Study design. In order to develop classifiers to predict ALN status, our study was divided into three stage, including discovery, training and validation stage. In the discovery stage, the genes with differential coverages were identified. In the training stage, different machine learning models were used to develop classifiers by using the differential features. The importance of the features was assessed with the sigFeature package of R. Then we selected top 100 features for further classifier construction. In order to identify the optimal gene combination with the largest area under the curve (AUC), backward method was adopted. Finally, the classifiers with the largest AUC were selected. In the validation stage, the predictive efficacy of the selected classifiers was assessed using an internal validation cohort. The detailed characteristics of breast cancer patients were shown in Table 1. WGS whole genome sequencing, ALN axillary lymph node, TSS transcriptional start site, SVM support vector machine, LR logistic regression, LDA linear discriminant analysis, LOOCV leave one out cross validation

Fig. 3
figure 3

The cfDNA promoter profiling shows the potential to predict ALN status. a Promoter profiling of the non-constitutive and housekeeping genes. The average promoter coverage was calculated by using the whole genome sequencing data derived from 30 breast cancer patients. The non-constitutive and housekeeping genes were obtained from the additional materials of previous study [14]. b Promoter profiling of the ALN-negative and -positive breast cancer patients. Mean promoter profiling of protein coding genes derived from15 ALN-positive and 15 ALN-negative breast cancer patients was detected using whole genome sequencing. c Uniform manifold approximation and projection (UMAP) plot representing the associations between ALN-positive and -negative groups. The genes with differential promoter coverages were used to plot UMAP. d Heat map of the z-scores of genes with differential read coverages. e Volcano plots of gene transcripts with differential read coverages at the promoter (fold change ≥ 1.5 and false discovery rate [FDR] ≤ 0.05) between 15 ALN-positive and 15 ALN-negative patients. f Analysis of Gene Ontology (GO) enrichment of genes with differential promoter coverage. TSS transcriptional start sites, Decreased genes with differentially decreased read coverage, Increased genes with differentially increased read coverage, Non genes with no differential read coverage, ALN axillary lymph node

By GO enrichment analysis of the genes with differential promoter coverage, we found that most of GO terms were immuno-associated and growth-associated processes (Fig. 3f). Consistent with the existing literature, cfDNA could reflect the expression status of its original tissues. As the expression of tumor and peripheral blood immunome was closely related to cancer stage [18], the above annotation results may indicate that the genes with differential promoter coverage may be associated with ALN involvement.

Classifiers for evaluating ALN status

To evaluate the potential of promoter profiling for assessing ALN status, we used WGS to characterize the promoter profiling of cfDNA derived from 330 breast cancer patients collected from January 2018 to December 2019, including 162 ALN-positive and 168 ALN-negative patients. The patients were split into training and validation cohorts with a 7:3 ratio and the clinicopathological parameters, such as age, T stage, ER, PR, and Her2 status, were well balanced between the two cohorts of breast patients (Table 1; all P > 0.05).

Then, we used genes with differential promoter coverage in SVM model to develop classifiers to distinguish ALN-positive from ALN-negative patients. ROC analysis was used to evaluate the AUC, sensitivity, specificity and accuracy of the promoter profiling classifiers (Fig. 4a). Among these combinations, a 48-gene combination named PPCNM performed well in the training cohort after LOOCV, with an AUC of 0.936 (95% confidence interval [CI] 0.904–0.967 and an accuracy of 0.848, Fig. 4a and Additional file 1: Table S2). The performance of PPCNM was further evaluated in the validation cohort, and we found that the AUC of PPCNM in the validation cohort was 0.808 (0.730–0.887) (Fig. 4b). These results indicated that a classifier based on promoter profiling can be used to assess ALN status.

Fig. 4
figure 4

Receiver operating characteristic (ROC) curves of PPCNM. a, Support vector machine, SVM b, Logistic regression, LR. c, Linear discrimination analysis, LDA. Acc accuracy, Sen sensitivity, Spe specificity, P the P value of AUC comparison between SVM vs. LR and SVM vs. LDA calculated by pROC package in R. The ROC showed the AUC of the best combination after cross-validation, therefore, it lacked the ‘arc’ shape

Across all cohorts, the average AUC of PPCNM was 0.897 (0.865–0.930), which was used to distinguish ALN-positive and ALN-negative patients, with a sensitivity of 0.914 and a specificity of 0.881 (Fig. 4c). The AUC produced by PPCNM was significantly greater than those of classifiers based on the LR and LDA models (Fig. 4c, LR: 0.829 [0.789–0.870], P = 8.37E−04 and LDA: 0.757 [0.711–0.803], P = 4.03E−10).

PPCNM and tumor DNA fraction

The level of tumor DNA fraction is one of the most important characteristics of tumor. Firstly, we calculated the tumor DNA fraction of ALN-positive and ALN-negative patients, and found that its levels between these two groups were similar (Additional file 1: Fig. S1; P-value = 0.1663). In addition, we found that the efficacy of PPCNM was similar in different concentrations of tumor DNA fraction (all P-value > 0.5; Additional file 1: Table S3). The AUC of tumor DNA fraction used to predict ALN status was 0.544 (0.482–0.606). The efficacy of the combination of PPCNM with tumor DNA fraction has an AUC 0.845 (0.806–0.885), which is significantly lower than that of PPCNM (P = 2.8E−04).

PPCNM combined with clinicopathological characteristics

Previous studies have shown the close relationship between ER, PR, Her2, and Ki67 status with ALN metastasis [26, 27]. Therefore, we first investigated whether the efficacy of our classifiers was different between positive- and negative-status of each feature. The efficacy of the PPCNM model was similar in regards to ER-positive vs. ER-negative, PR-positive vs. PR-negative, Her2-positive vs. Her2-negative, and Ki67-High and Ki67-Low (Fig. 5a–d). We then incorporated these clinical characteristics with PPCNM to see whether its performance would further improve. By evaluating the efficacy of their exhaustive combination with PPCNM, we found that the AUC, accuracy, sensitivity of the PPCNM decreased after being combined with one of the four clinical features (Fig. 5e, f and Additional file 1: Table S4).

Fig. 5
figure 5

Performance of classifiers. a ROC curve for (ER/PR/Her2/Ki67)-positive and (ER/PR/Her2/Ki67)-negative groups. b Performance of the best combinations of PPCNM with different number of clinical features. c ROC curve for the best combinations of PPCNM with different number of clinical features. AUC area under curve, SVM support vector machine, LR logistic regression

Discussion

We found that there was a significant difference in promoter profiling between ALN-positive and ALN-negative breast cancer patients (Fig. 3). The classifier PPCNM based on promoter profiling using the SVM model, produced the maximum AUC (0.897 [0.865–0.930]) for distinguishing these two groups of patients, and its performance was significantly better than those of classifiers relied on LR and LDA regression models (Fig. 4c; all P < 0.05). In addition, the AUC increased slightly with the incorporation of clinical characteristics. These findings indicate that PPCNM may be a promising non-invasive tool for evaluating ALN status.

There are forty-eight genes in the PPCNM (Additional file 1: Table S2). These genes are closely associated with the metastasis of tumor. For instance, a large number of studies have reported the close relationship between NF-κB signaling pathway and tumor metastasis [28, 29]. NF-κB signaling pathway regulates the expression of its downstream target genes, including MMP9, TNFα, uPA and IL8, thus promoting the invasion and metastasis of breast cancer cells [29]. Besides, BHLHE40 confers a pro-survival and pro-metastatic phenotype to breast cancer cells by modulating HBEGF secretion [30]. And BHLHE40 facilitates the invasion of cancer cell by interacting with SP1 [31]. In addition, USP20 can promote breast cancer metastasis by stabilizing SNAI2 [32].

ALN status is an essential factor for the prognosis of breast cancer patients and the choice of cancer treatment in breast cancer [2]. Although milder SLNB has become more pervasive, LN surgery for evaluating ALN status still brings various side effects to patients. Therefore, developing a non-invasive method to predict ALN status may be beneficial to breast cancer patients. At present, some studies show that increased cfDNA levels are related to ALN Metastasis [6, 7]. But cfDNA levels were affected by various physiological and pathological processes [8,9,10]. More specific features of cfDNA have to be found for assessing ALN status. Previous studies have reported that cell-free DNA promoter profiling and TF profiling is capable of prediction of tumor subtypes in prostate and detect early-stage colorectal cancer [14, 33]. Therefore, we assume that promoter profiling could be used to evaluate ALN status. In this study, we found the characteristics of specific promoter profile signatures of cfDNA in ALN-positive and ALN-negative patients (Fig. 3e). The classifiers (PPCNM) based on these differential variables achieved high performance with an AUC of 0.897 [0.865–0.930]. We developed a non-invasive method based on plasma cfDNA to assess ALN status, which could dynamically monitor the status of lymph node. More importantly, our method could avoid the heterogeneity of tumor in tissue detection. Nevertheless, there are some limitations in our research. Although the AUC of our classifier achieved 0.897, and 330 WGS data was used in this study, more prospective samples and samples from other external centers were needed to improve the predictive value of efficacy before clinical application.

In summary, our data suggest that PPCNM is a promising tool based on promoter profiling for evaluating ALN status in breast cancer. PPCNM is a non-invasive technique, which only needs low-coverage DNA sequencing and is not affected by cancer heterogeneity. Therefore, the PPCNM classifier may help patients and clinicians to choose appropriate cancer treatment methods, thus improving the curative effects and the quality of life of cancer.