Introduction

Breast cancer (BC) is the most common malignancy and the leading cause of cancer-associated mortality among women worldwide [1]. It accounts for one in four cancer cases among women and one in six cancer deaths, ranking first in the vast majority of countries for incidence [1]. Approximately, 10–20% of all BC patients have a family history of cancer with multiple family members affected across generations [2]. Germline mutations in specific genes such as BRCA1, BRCA2, CDH1, PALB2, PTEN and TP53 confer an increased risk of developing BC [3].

Recent advances in next-generation sequencing have led to reduced costs for multigene panel testing of cancer predisposition genes for individuals referred for genetic testing, resulting in a higher uptake of testing. However, it is estimated that pathogenic variants in known cancer predisposition genes only account for around 25% of hereditary BC cases [4, 5].

Whole-exome sequencing (WES) is revolutionizing our ability to identify novel genetic variants associated with cancer predisposition. To date, multiple candidate BC predisposition genes have been identified by WES, predominantly from studies on women of European ancestry [6, 7].

Here, we aimed to identify novel candidate BC predisposition genes and variants by performing WES on germline DNA from Asian BC patients referred for cancer genetic risk assessment but who were BRCA1/2-negative. Pathogenic variants identified from WES were filtered and prioritized using in silico bioinformatic tools, followed by case–control analysis and only significant variants in known cancer genes were selected for further analysis. Notably, we have identified pathogenic variants in our cases that had a statistically significant difference in frequency as compared to the Genome Aggregation Database (gnomAD) East-Asian (EAS) controls and Singaporean controls [8].

Results

Demographics and clinical information on the study population

Information on the demographics, age at diagnosis, ethnicity, family history, and clinicopathological characteristics of the 290 BC cases are provided in Table 1. The study population consisted of only females, and a large proportion were Chinese (69.3%). The age of first cancer diagnosis ranged from 19 to 75 years, with a mean and median age of 37.5 and 37 years, respectively. Of 290 patients, 65 (22.4%) presented with a family history (including first-degree, second-degree, and third-degree relatives) of BC, 23 (7.9%) with a family history of other cancers and 218 (75.2%) with no family history of breast or any other cancers (Table 1, Additional file 1: Fig. S1). Of the 290 BC cases, 225 patients (77.6%) had early-onset breast cancer (≤ 40 years).

Table 1 Demographics, clinical characteristics, and family history of patients

Filtering of candidate variants

Whole exome sequencing of 290 BC patients revealed 1,196,466 variants before filtering. Among these, 1,101,796 (92.1%) passed Dynamic Read Analysis for GENomics (DRAGEN) quality-control checks. Further filtering to retain functional variants with gnomAD (EAS) minor allele frequency (MAF) less than 1%, predicted pathogenic variants with scaled Combined Annotation-Dependent Depletion (CADD) score greater than 20, and variants in the known or predicted cancer gene lists in the Network of Cancer Genes (NCG) database, left only 2,496 variants (0.2% of the total; Fig. 1).

Fig. 1
figure 1

Study design for the selection of variants and genes. aList of known or candidate cancer genes in the Network of Cancer Genes [9]. bThe Cancer Gene Census list of the Catalogue of Somatic Mutations in Cancer (COSMIC) [10]. cList of cancer driver genes from Bailey et al. [38]. dList of cancer driver genes inferred with nucleotide context from Dietlein et al. [11]

The genes of our shortlisted variants were further prioritized using cancer genes databases such as Catalogue of Somatic Mutations in Cancer (COSMIC), cancer driver genes based on nucleotide context, and computationally discovered and experimentally validated cancer driver genes [9] (Additional file 4: Table S1). Finally, we shortlist only variants that were present in three or more patients. All variants were checked with IGV (Additional files 1, 2: Figs. S1, S2).

Identification of pathogenic germline variants

In total, we discovered 49 prioritized variants in 37 prioritized genes across 134 patients (Fig. 2; Additional file 4: Table S2). Most of these variants are nonsynonymous single nucleotide variants (SNVs) (42 variants, or 85.7%), with one frameshift insertion (2.0%), three frameshift deletions (6.1%), and three stop-gains (6.1%). Frameshift insertions, deletions, and stop-gains were prioritized regardless of their CADD score.

Fig. 2
figure 2

Oncoplot of variants in prioritized candidate genes, showing the type and frequency of each variant. Rows represent genes and each column represents one case. Rows (bottom) show the age at diagnosis (diag), family history (FH) status for breast cancer (BC) and ovarian cancer (OC) and ethnicity for each case

All 42 nonsynonymous SNVs had CADD scores greater than 20. The remaining 7 variants which were not nonsynonymous SNVs also had CADD scores greater than 20, except for a frameshift deletion variant in HLA-A. Thirty variants were classified as variants of uncertain significance (VUS) (61.2%), two stop-gain mutations in KMT2C were considered pathogenic (4.1%), and the remaining variants were benign (14 variants, or 28.6%) or likely benign (3 variants, or 6.1%) (Table 2).

Table 2 Predicted pathogenicity and classifications from databases for 49 selected variants in 37 genes

Case–control analysis of the Singapore cases

Case–control analysis was performed for 49 selected variants for our Singaporean cases against the gnomAD (EAS) and SG10K_Health control cohorts (Table 3). Apart from the two variants in BRD7 and NBEA that were not reported in gnomAD (EAS), all of our remaining 47 variants were significantly enriched in our cohort as compared to gnomAD (EAS). In the SG10K_Health control cohort, seven of our 49 selected variants were absent, including the aforementioned variants in BRD7 and NBEA; and additional variants in KMT2C, GPRIN2, H3F3A, and MAF. Of the remaining 42 variants which could be found in SG10K_Health, 13 were significantly enriched at α = 0.05 in our cohort versus SG10K_Health (Table 3).

Table 3 Allele frequencies and case–control association analysis of 49 variants in 37 selected candidate genes

Case–control analysis using a breast cancer case cohort from dbGaP

Case–control analysis for the 49 germline variants identified from our Singapore breast cancer cohort was repeated using a case cohort from dbGaP (phs000822.v1.p1) against the same control cohorts (Table 3). Only 34 of our 49 variants were found in phs000822.v1.p1. Of these 34 variants, 26 were significantly enriched in phs000822.v1.p1 when compared against gnomAD (EAS) while eight did not reach statistical significance. Next, comparison of the 34 variants with SG10K_Health found 26 significantly enriched in phs000822.v1.p1, four unreported in SG10K_Health, and another four did not reach significance. These two sets of comparison were generally concordant, as 23 of the 26 significantly enriched phs000822.v1.p1 versus gnomAD (EAS) were also significantly enriched in comparison against SG10K_Health (Table 3). Altogether, 14 variants were significantly enriched in cases, or missing in the control cohorts, across all four sets of case–control comparisons. These variants were found in 89 out of 290 breast cancer patients (30.7%) where 24 of the 89 cases had more than one pathogenic variant (Additional file 4: Table S3).

Variant validation by Sanger sequencing

Four of 14 significant variants were excluded from Sanger sequencing validation as these variants lie in highly repetitive regions (KMT2C, MUC4, and MAF) or highly polymorphic regions (HLA-DRB1). Seven of the remaining 10 variants, including GPRIN2 c.983G, NRG1 c.G172A, MYO5A c.A3960T, CLIP1 c.C80T, CUX1 c.C3317T, GNAS c.A266G and MGA c.C1883A, were confirmed by Sanger sequencing. However, variants in TPTE2, NBEA, and BRD7 could not be validated by Sanger sequencing, suggesting that these variants were likely false positives (Fig. 3).

Fig. 3
figure 3

Sanger sequencing validation of variants identified by whole-exome sequencing. Representative sequencing chromatograms showing the different variants found in our breast cancer patients and of an unaffected control. A Seven variants were confirmed by Sanger sequencing. B Three variants failed to be validated by Sanger sequencing. Arrows indicate the position of the variant

Discussion

Here, we report the largest WES study on germline DNA from Asian breast cancer patients who had undergone cancer risk assessment and were BRCA1 and BRCA2 mutation-negative. The approach that was taken was to select only pathogenic variants that showed a statistically significant difference against gnomAD East-Asian controls and Singapore controls. This was followed by an additional prioritization step of selecting only variants occurring in well documented cancer genes such as those listed in COSMIC, NCG and cancer driver gene databases [9,10,11].

In total, we have identified 49 rare pathogenic germline variants in 37 genes which were significantly enriched in breast cancer patients. These were all predicted to be pathogenic using in silico tools and all had a minor allele frequency of less than 1% or were unreported in gnomAD (EAS). We further validated these results with an independent United States-based case cohort obtained from dbGaP, of 466 early-onset breast cancer patients. Across four sets of comparisons involving two case and two control cohorts, 14 variants were consistently enriched in breast cancer cases (Table 3).

Of these 14 variants, seven variants in GPRIN2, NRG1, MYO5A, CLIP1, CUX1, GNAS, and MGA were confirmed by Sanger sequencing. To the best of our knowledge, these specific germline variants identified here have not been reported in any cancer-related studies thus far. However, their respective gene functions have been implicated in many cancer types [12,13,14,15,16,17]. The NRG1 nonsynonymous SNV (rs113317778) lies in an immunoglobulin-like domain, while other affected residues in GPRIN2 (rs4445576), CUX1 (rs782176246), GNAS (rs563844600), and MGA (rs61736074) are located within a protein disordered region, where it lacks a stable tertiary structure and adopts different structural conformations [18,19,20]. Interestingly, a computational study has predicted the mutation in GPRIN2 (p.S328C) to generate new microstructural elements in the disordered region and may disrupt protein functions or protein–protein interactions [20]. Other exome sequencing studies have also identified a damaging germline mutation in GPRIN2 (p.A233S) in Iranian patients with familial esophageal squamous cell carcinoma (ESCC) [21] as well as somatic mutations in melanoma samples [22].

Additionally, a frameshift deletion variant in TPTE2 (c.483delT) and two nonsynonymous SNVs in NBEA (c.C2317A) and BRD7 (c.A44C) could not be confirmed by Sanger sequencing. NBEA has segmental duplications on chr15, while BRD7 is mapped to segmentally duplicated regions on chr3 and chr6. Furthermore, the TPTE2 variant is within a short 8-nucleotides homopolymer, and it has two segmental duplications on chrY and chr21 [23]. Due to high sequence similarities, sequenced reads which arise from segmental duplications may be wrongly aligned and result in false-positive variant calls.

Seven nonsynonymous SNVs in RNF43, HLA-B, ERBB3, NTRK1, TET2, and DCC identified here, have previously been implicated in various cancer types Additional file 4: Table S4. For example, the HLA-B c.A161G variant, which was detected in 9 patients (3.1%) here, was also found to be associated with high-grade cervical preinvasive lesions and invasive cervical cancer in a recent genome-wide association study [24]. A different study reported that the ERBB3 c.A3355T variant was significantly associated with poor survival in ER-positive cases [25]. Nonetheless, none of these variants were significantly enriched in our case–control analyses.

Of our 49 variants, 4.1% (2/49) were classified as pathogenic and 61.2% (30/49) as VUS by InterVar, respectively. This high VUS rate is consistent with our previous study and that of others on Asian populations [26, 27]. In a large US study on germline genetic testing, Asian patients had approximately two-fold more VUS compared to non-Hispanic White patients, at a VUS rate above 40% [27]. These substantially higher VUS rates in Asians may reflect the underlying lack of variant data from Asian control populations available for variant reclassification.

Besides the variants identified in this current study, WES has been performed to detect candidate variants in BRCA-negative patients from other populations. In a study on 7 families from France, Italy, Netherlands, Australia and Spain, investigators found 12 variants in genes involved in DNA repair, cell proliferation and survival, or cell cycle regulation [28]. Sequencing of 52 individuals from 17 Greek families with HBOC and further validation in additional cohorts from Canada, TCGA and the UK Biobank, led to the prioritization of missense variants in the SETBP1 and c7orf34 genes [29]. In another European study, 54 BRCA-negative families from Belgium underwent WES and 44% harbored variants in known cancer predisposition genes. In particular, it was observed that nonsense variants in cancer-associated genes involved in DNA repair were enriched in breast cancer patients as compared to controls [30]. From 113 families from Tunisia, eight BRCA-negative unrelated patients were selected for WES. Of 24 genes that were prioritized from WES data, five were selected based on their significant association with survival, as determined from analysis using TCGA data [31]. Notably, the strategies for the prioritization and filtering of genes/variants differ between studies with differing variants identified. It is possible that these variants could be population-specific or low penetrance variants.

Our study has limitations. We had used an independent breast cancer cohort of US patients with early-onset breast cancer [35 years or younger] from dbGaP to validate the frequency of the 49 variants discovered in our cohort that were found to be associated with breast cancer. However, 17 of the 49 variants were not present in this dbGaP case cohort, possibly due to differences in genetic ancestry between the populations. Hence, further studies in additional Asian as well as European populations are necessary to validate the variants described in this current study. Secondly, DNA samples from family members of our cases were not available for segregation analysis. Thirdly, due to limited access to the SG10K_Health cohort, we had used the gnomAD (EAS) population for variant filtering. The gnomAD (EAS) cohort is comprised of individuals of Korean, Japanese and Chinese descent, whereas our study population were South-East Asians, mainly of Chinese, Malay and Indian ethnicity. Nonetheless, the gnomAD (EAS) was the most suitable publicly available control population available, and thus was selected.

Conclusions

In summary, the current study has identified 49 pathogenic variants in 37 genes associated with breast cancer predisposition, many of which have not been previously documented. Our study provides new insights into the genetic susceptibility to BC, and it is imperative that further studies in additional populations of diverse ethnic background be undertaken to determine the frequency of these variants, and to confirm their association with BC risk.

Materials and methods

Study participants

Two hundred and ninety breast cancer patients who fulfilled one or more of the following criteria were selected for WES: 1. having a family history of breast cancer in first- and/or second-degree relatives; 2. having bilateral breast cancer; and, 3. having early-onset breast cancer at the age of 40 years or below (Additional file 1: Fig. S1) [26]. Written informed consent was obtained from all participants and the study was approved by the SingHealth Centralised Institutional Review Board (CIRB Ref: 2018/2147).

Whole-exome sequencing

Genomic DNA was isolated from peripheral blood samples, collected from breast cancer patients as described previously [32, 33]. Samples for sequencing and libraries were prepared according to Agilent SureSelect Human All Exon V6 kit (Agilent Technologies, CA, USA) and the library preparation and enrichment were carried out according to Agilent SureSelect protocols. Enriched samples with paired-end sequencing (2X150 bp) were performed on the Illumina NovaSeq 6000 platform. Variants were aligned and called with Illumina DRAGEN version 3.5.7 on the BaseSpace Sequence Hub cloud platform [34], with median 80 × coverage per base.

Prioritization and filtering of variants

The variants were annotated for their transcript effects, CADD v1.3 scaled score [35], and gnomAD minor allele frequencies using ANNOVAR [36]. CADD v1.3 indel scores were filled in manually using the CADD web server. The American College of Medical Genetics and the Association of Molecular Pathology (ACMG-AMP) classifications were obtained using InterVar [37]. We removed variants which did not pass DRAGEN’s default quality control checks, variants with gnomAD (EAS) MAF greater than 1%, and variants found in only two or fewer patients. Frameshift indels, stop-gains; and nonsynonymous SNVs with scaled CADD v1.3 score greater than 20 were chosen for further analysis. A CADD score of 20 and above represents the top 1% of pathogenic variants as scored by CADD.

Prioritization of candidate genes

From the genes of our prioritized variants, we selected only known or candidate cancer genes as listed by the NCG [9]. These genes were then further curated for those that were strongly implicated in cancer in at least one other cancer gene database: the COSMIC database [10], cancer driver genes based on nucleotide context [11], and computationally discovered and experimentally validated cancer driver genes [38] (Additional file 4: Table S1).

Manual checking with IGV

All prioritized variants were manually checked with Integrative Genomics Viewer (IGV) [39], except those in highly repetitive regions in MUC4 or KMT2C, or highly polymorphic genes HLA-A or HLA-DRB1, as their alignments were too complex (Additional file 2: Fig. S2). Variants suspected to be false positives were excluded (Additional file 3: Fig. S3).

Case–control analysis

Case–control analysis for the variants was performed for two breast cancer cohorts (cases described in this study and the phs000822.v1.p1 dataset from dbGaP) and two control cohorts (gnomAD (EAS) and SG10K_Health). The dataset from dbGaP is a breast cancer dataset of 466 patients with early-onset breast cancer (diagnosed on or before the age of 35) from the United States of America. The gnomAD (EAS) cohort (gnomAD v2.1.1) comprises 9,977 individuals of East Asian descent while the SG10K_Health cohort consists of whole genomes from 9,770 healthy Chinese, Indian, and Malay volunteers from Singapore [8].

Polymerase chain reaction and Sanger sequencing

Variants that were significant by case–control analysis were validated by polymerase chain reaction (PCR) and Sanger sequencing. PCR primer sets were designed using Primer-BLAST [40]. DNA amplification by PCR was performed using HotStartTaq (Qiagen, Venlo, Netherlands) or Q5 High-Fidelity (New England Biolabs, Ipswich, MA, USA) DNA polymerase, as described in the manufacturer’s protocol. Primer sequences and their respective cycling conditions are listed in Additional file 4: Table S5. The PCR products were then analyzed by 2% agarose gel electrophoresis and purified with ExoSAP-IT Express (Thermo Scientific, USA) prior to sequencing. Cycle sequencing reactions were performed using BigDye Terminator v3.1 kit (Applied Biosystems, Foster City, CA) and the sequencing products were analyzed on a Genetic Analyzer. DNA sequences were visualized and aligned using Geneious Prime version 2022.1.

Statistical analysis

For case–control analyses, a two-sided Fisher’s exact test was used and p values were adjusted for multiple testing using the Benjamini–Hochberg method [41].