Background

Breast cancer is the most commonly diagnosed cancer in women worldwide, making up 11.7% of new cancer diagnoses in 2020 [1]. Heritability estimates for breast cancer range from 13% [2] to 30% [3]. Breast cancer follows a predominantly complex genetic architecture, which in large parts remains unsolved to this day [4]. Identifying disease predisposing genes in breast cancer can help understand pathological pathways and discover new clinical biomarkers or drug targets. However, linking single-marker associations identified in genome-wide association studies (GWAS) to target genes is still ongoing [5], precluding better mechanistic disease understanding.

The analysis of data from diverse ancestral groups can uncover new insights about genetic risk factors due to ancestral differences in variant frequency and linkage disequilibrium patterns, especially in the context of low-frequency variants, as well as variation in environmental factors [6,7,8]. Thus, extending genetic studies to diverse populations and groups is a necessary advance to gain a comprehensive understanding of genetic architectures of complex diseases.

In this study, we extend the recently published gene aggregation method combining coding and regulatory variants [9] to large-scale whole-genome genotyping cohorts to uncover novel genes implicated in breast cancer development. We used data from the Breast Cancer Association Consortium (BCAC) which has been studied previously including GWAS [10, 11], candidate gene analysis [12], and polygenic risk score analysis [13, 14].

We employ the following strategies to empower the discovery of novel gene-disease associations using data from BCAC: (1) aggregation of all coding and regulatory variants linked to a single gene, (2) effective utilization of low-frequency variants, (3) exploiting genetic diversity between different ancestral groups, and (4) restricting multiple testing burden to one statistical test per gene (~ 18,500).

Methods

Samples and genotype data

We used data on 142,670 individuals from BCAC. Detailed description of recruitment criteria, sample demographics, genotyping quality control, and imputation of additional markers have been reported previously [10, 15, 16]. In short, 83,471 breast cancer cases and 59,199 controls of diverse ancestry were recruited in 80 studies (see Fig. 1A, Additional file 1: Table S1). For each study, country of origin, and case and control numbers can be found in Additional file 1: Table S2. Samples were genotyped using the OncoArray (Illumina) [17], a custom SNP array enriched for cancer-associated genetic regions.

Fig. 1
figure 1

Study design. A Breast cancer patients and control individuals included in this study originate from 33 different study center countries, and comprise samples of African, Asian, European, or Latin American and Hispanic ancestry. B The mummy implemented extended SKAT-O analysis includes variants located in coding regions with an extended window and variants located in linked regulatory regions. Regulatory regions were identified based on overlap with genetic range of coding features or based on presence of gene-specific eQTLs in GTEx data in those regulatory regions

Quality control of genotype data

Sample quality control based on genotype and imputation quality has been performed previously [10]. In short, samples were genotyped on the custom OncoArray. Genotyped markers failing any of the following quality criteria were excluded: (i) call rate above 98% in all consortia, (ii) MAF < 1%, (iii) no significant deviation from Hardy–Weinberg Equilibrium (controls: P < 10 − 7, cases: P < 10 − 12). Markers were imputed in a two-stage approach using shapeit2 and impute2 (V2) and the October 2014 (version 3) release of the 1000 Genomes dataset as reference panel [10]. The imputation was carried out for 5-Mb segments of the genome and for groups of 10,000 samples to reduce the computation burden. We included only low-frequency variants (minor allele frequency MAF < 0.05). Variants with imputation accuracy scores (generated with IMPUTE version 2) below 0.7 were excluded from analysis.

Selection of genetic elements

Our previously developed analysis pipeline “mummy” [9] was used to identify coding and regulatory regions for individual genes and to prepare input data for robust rare variant SNPset association testing software MONSTER [18]. Aggregation tests were performed for genes defined in GENCODE v25 and with at least three but not more than 5000 low-frequency variants.

For each of these genes, we identified genetic elements that are likely to contain relevant functional or expression variation using the mummy wrapper. These include the exomes and untranslated regions (UTR) of the gene. We selected additional regulatory elements that have been shown to be enriched for complex trait associations [19, 20]: promoter, enhancer, and transcription-factor-binding units if they could be linked to the gene. These elements were identified from the Ensembl build 84 resource. The link of regulatory genetic elements to genes was either based on physical overlap with the coding region, e.g., when an element was located within an intron of the gene, or physical overlap with significantly associated eQTLs for the specific gene (see Fig. 1B). Thus, we included the three types of regulatory elements if there was evidence that they affect expression levels of the gene. This was based on eQTL data for all available cell types from GTEx version 6.

For each gene, all the low-frequency variants in these selected genetic elements were extracted and formatted to the MONSTER required input and weighted using Phred-scaled EigenPC pathogenicity scores [21]. EigenPC scores have been previously shown to offer the best balance between coding and noncoding variants for application in aggregation testing [9].

The original implementation of “mummy” was adapted to allow for the input of genotype data based on DNA microarrays instead of sequencing data in VCF format. The adapted “mummy” code is accessible on github here: https://github.com/stef-mueller/mummy_for_genotypes.

Gene-based aggregation test

MONSTER (Minimum P‐value Optimized Nuisance parameter Score Test Extended to Relatives) was used to perform SNPset variant aggregation tests for the variants selected for each gene [18]. MONSTER generalizes the SKAT-O algorithm to allow for testing of related samples and sample cohorts with underlying population structure using a mixed effects model. SKAT-O is a unified test that combines a variance component and a burden test. The original MONSTER code was adapted to allow for the inclusion of larger sample numbers. The adapted MONSTER code is available on github here: https://github.com/stef-mueller/MONSTER.

Samples were processed in 15 study groupings due to the computational demand. Groups were formed based on study origin and genetic ancestry of samples while ensuring balanced case and control numbers. Additional file 1: Table S3 lists the number of analyzed genes for each cohort. Sample numbers per cohort can be found in Additional file 1: Table S4.

The mixed effects models testing for gene associations included relatedness in the form of a kinship matrix as a random effect. The kinship matrix was derived by, first, creating an LD pruned marker set using plink2 [22] (window size: 50 kb, step size: 5, r2 threshold: 0.5, minor allele frequency threshold: > 0.2), second, calculating a relationship matrix using gemma [23], third, calculating individuals’ inbreeding coefficients using plink2 –ibc command, and fourth, combining relationship matrix and inbreeding coefficients to the MONSTER required input format. Additionally, age and for some cohorts the recruitment study or study country were included in the model as fixed effects (Additional file 1: Table S5).

As is common for SNPset aggregation tests, MONSTER reports as output P-values but not effect sizes or effect directions for linear mixed model aggregation tests. To check for unaccounted population stratification effects, raw aggregation test results per cohort were plotted against the theoretical distribution of P-values using quantile–quantile (QQ) plots (see Additional file 1: Figure S1), and genetic inflation factors lambda and lambda1000 were calculated (see Additional file 1: Table S4). Lambda is dependent on sample size and will be increased for large samples. Lambda1000 has been established to be comparable across studies. It corrects for sample size.

Two of the 15 cohorts, one of European ancestry and the Latin American and hispanic group, were found to have increased genetic inflation factors with lambda1000 metrics of 1.32 and 1.14, respectively. Thus, raw aggregation test P-values for these two cohorts were corrected using the genomic control method.

Meta-analysis of aggregation tests

Two meta-analyses were performed to combine raw aggregation association results from individual cohorts. First, to allow for comparison with the published GWAS [10] results based on the same sample set, all cohorts including samples of predominantly European ancestries (twelve cohorts, all named “eur*”) were combined in an all-European meta-analysis. Next, a second meta-analysis was performed including all cohorts.

The Stouffer [24, 25] method was used to perform the meta-analysis. It combines the z-statistic derived from P-values of the aggregate test for each cohort after weighting with the square root of the respective sample size. For cohorts with increased genetic inflation factor lambda1000, genetic control corrected P-values, rather than the raw P-values, were included in the meta-analysis. The R package metaP (version 1.3) was used to perform Stouffer meta-analysis. No evidence for increased inflation was observed for the meta-analysis results based on QQ plots and inflation estimates (Additional file 1: Figure S2).

Benjamini–Hochberg false discovery rate (FDR) method was used to correct the meta-analysis results for multiple testing. To ensure robust association signals, genes with missing results for the majority of cohorts were excluded from further analysis. Significant hits were defined as those with FDR-corrected P-values < 0.05.

Follow-up on significantly associated genes

We evaluated whether any of the significant gene-based associations with breast cancer overlapped with significant single-marker associations arising from the European ancestry GWAS. The genome-wide association analysis for single markers in the European ancestry samples has been previously described [10]. The comparison was based on coding and regulatory regions of the gene-based hits with a flanking region of 100 kb. The flanking region of 100 kb was chosen to ensure inclusion of the majority of cis-eQTL elements which, based on GTEx data of 44 tissues, have a median distance of 28.9 or 50.1 kb from the transcription start site (TSS) of genes for primary and secondary cis-eQTLs, respectively [26]. Loci that included SNPs with P-values below 5 × 10−8 from the single-marker association analysis in the examined regions were classified as previously identified breast cancer association hits.

We carried out bioinformatic annotations for each significantly associated gene. Four open-source databases were queried for prior evidence of a causal role of the genes in breast cancer pathology specifically as well as any cancer pathology. First, the ClinVar database was used to identify any putative pathogenic, single-gene variants reported previously in the context of the phenotypes of interest. The ClinVar database was queried on the 1st of March 2021. Pathogenic, single-gene ClinVar variant entries with at least one star review status were classified as supportive evidence.

Second, the aggregated gene-disease database MalaCards [27] was used to identify any significant correlation of genes and phenotypes of interest based on 68 different data sources and utilizing NLP (Natural Language Processing) algorithms to include evidence from non-structured data sources like research publications. Supportive evidence of causal role of genes was defined as a MalaCards search relevance score over 1. The MalaCard database was queried on the 1st of March 2021.

Third, the expert-curated Genetics Home Reference data was queried for all genes of interest and examined for evidence of causal role in breast cancer or any cancer. The queried data version was published on the 28th of July 2020.

And fourth, investigating possible roles as driver genes in breast cancer and cancer pathogenicity, we queried the COSMIC Cancer Gene Census data (version 92) which classifies genes as either (1) TIER1: genes with strong evidence of causal role promoting cancer such as documented relevance in cancer and oncogenic mutations, (2) TIER2: genes with substantial indications to play a role in cancer etiology, and (3) untiered genes: genes with no substantial evidence of a causal role.

Results

Gene-wise aggregation analysis was performed in 83,471 breast cancer patients and 59,199 matched controls. Of those 142,670 samples, 83.4% (n = 119,014) were of European ancestry, with 10.7% (n = 15,321) of samples being of Asian, 4.1% (n = 5784) of African, or 1.8% (n = 2551) Latin American and Hispanic ancestry, respectively. Samples were recruited to studies in 33 countries (see Fig. 1A).

All-European meta-analysis finds 14 associated breast cancer genes

First, we combined gene-wise association results for European cohorts in an all-European meta-analysis. After multiple testing correction, we found 14 genes located in nine different regions to be significantly associated with breast cancer risk (Table 1). Overlap in coding and regulatory regions of genes can cause non-unique mapping of variants to multiple genes for the association aggregation test performed in MONSTER. Thus, four loci were identified containing more than one associated gene. Regional plots for all 14 genes can be found in Additional file 1: Figure S3.

Table 1 Meta-analysis hits in samples of European ancestry. Results for significant (q < 0.05) gene associations from the meta-analysis of 12 cohorts of European ancestry. Genes with overlapping coding and/or regulatory regions are summarized as a single locus defined as the intersection of all included genetic regions. Overlap with single-marker association results from Michailidou et al. [10] are also shown, with new associations identified for FMNL3 and AC058822.1

For twelve of the 14 associated genes, the region (gene plus a 100-kB flanking region) contained markers that were individually associated with breast cancer at genome-wide significance (P-value < 5 × 10−8).

Two novel associations

The gene-wise aggregation of low-frequency variants based on coding and regulatory features was able to extend findings of a standard GWAS analysis. The analysis identified two novel gene associations that do not overlap previously reported single-marker-based loci (Fig. 2). The FMNL3 (Formin-Like 3) gene at 12q13.12 was associated with breast cancer risk with a q-value of 0.013. It encodes the Formin-like protein 3, a cytoskeletal regulator, whose overexpression is associated with cancer cell migration, invasion, metastasis, and poor prognosis in multiple cancer types, such as colorectal carcinoma [28], nasopharyngeal carcinoma [29], and tongue squamous cell carcinoma [30].

Fig. 2
figure 2

Regional Plot the FMNL3 Gene on Chromosome 12. Regional plots for the breast cancer association of FMNL3 at 12q13.12. A Depiction of coding regions of all coding genes (data retrieved from Ensembl biomart hg38) within the chromosomal region with FMNL3 highlighted in blue. B Variants included in the aggregation test, plotted according to their chromosomal position and analysis weight. Highlighted in blue are variants exclusively present in the analysis of samples of diverse ancestry. C Single-marker association results based on the same samples [10], with blue solid line denoting P-value for meta-analysis of all cohorts for gene of interest (P = 1.24 × 10−5) in this study and blue dashed line denoting unadjusted P-value for all-European meta-analysis (P = 6.11 × 10−6)

The second novel association was found at 4q12 for AC058822.1 (q-value = 0.020), also named RP11-231C18.3. This lncRNA gene is a scarcely characterized genetic element spanning almost 1 MB.

Gene-based aggregation can help identify the causal genes

To assess whether the gene-based approach can help highlight biologically plausible gene candidates, we assessed whether other evidence, such as genetic epidemiological studies or cell models, supports a role for the significantly associated genes in cancer. We queried different public databases for links to breast cancer and other cancer types for the 14 genes found to be associated with breast cancer in the all-European meta-analysis.

Two genes, MAP3K1 and FGFR2, in addition to being previously identified in breast cancer-associated genetic region in GWAS (see Table 2), are both classified as TIER1 cancer-driving genes in COSMIC Cancer Gene Census. Thus, there is strong evidence that somatic mutations in both genes have a functional involvement in cancer etiology.

Table 2 Support for a Role in Cancer for the 14 Associated Genes. Prior supportive evidence for genes associated with breast cancer in the aggregation test was based on presence of pathogenic cancer mutations in those, based on ClinVar and curated genetic reference database Genetics Home Reference and aggregation database Malacards. In addition, hit genes were queried in the COSMIC Cancer Gene Census database

To search for previous causal evidence of germline mutations in associated genes, we queried ClinVar, Genetic Home Reference, and MalaCards databases—the last two being an expert-curated gene-disease database and an aggregation database of 68 data sources, respectively. Five genes were implicated in the development of other cancer types: SRGAP2C, MAP3K1, FGFR2, LSP1, and FMNL3.

In addition, the gene ABRAXAS1 codes for a subunit of the BRCA1-A complex [31]. This protein complex plays an important role in DNA damage repair and mutations in the BRCA1 gene predispose to increased risks of cancer [32].

In summary, we found support for aggregated gene associations coinciding with prior causal evidence in breast cancer for two of the nine associated genes and in any cancer for five of them. Among the four associated genes without or very limited prior evidence in cancer pathophysiology is the single-gene locus spanning gene ABRAXAS1—a promising candidate gene for further follow-up owing to its close interactions with protein BRCA1 and its role in DNA damage repair [33].

Including ancestrally diverse samples finds additional gene associations

We furthermore tested gene-based associations in the African (n = 5784), Asian (n = 15,321), and Latin American and Hispanic (n = 2551) ancestry cohorts. There were no significant associations after FDR multiple testing correction. We considered suggestive associations with unadjusted, or in case of the Latin American and Hispanic cohort genetic control corrected, P-values below 1 × 10−4. While no suggestive associations were found in the Latin American and Hispanic cohort, four and five gene associations could be identified in the African and Asian cohort, respectively (Additional file 1: Table S7 and Table S8). This included a suggestive association of gene CBLB (unadjusted P-value: 2.11 × 10−5, Additional file 1: Figure S5) in the African cohort. The E3 Ubiquitin Ligase Cbl-b, coded by oncogene CBLB, has been reported to affect cancer development and progression [34] and has been proposed as a clinical biomarker in breast cancer [35]. No variants located in the coding region of CBLB (plus 100 kb flanking region) were found to be associated in the 2017 large-scale GWAS [10]. None of the variants at this locus have been previously linked to any breast cancer phenotype based on the GWAS Catalog. Thus, the inclusion of diverse ancestry samples shows promise for the identification of new suggestive associations for a plausible candidate gene.

In a second meta-analysis, all 15 sample cohorts, including European ancestry cohorts and cohorts of Asian, African, or Latin American and Hispanic ancestry, were combined (Additional file 1: Table S6). This analysis identified an additional association of gene ESR1 (FDR adjusted P-value in all cohort meta-analysis: 0.0269; Additional file 1: figure S4). The gene ESR1 codes for the estrogen receptor alpha protein and genetic variations in this gene have been reported to be associated with breast cancer [10, 36] and are well described in breast cancer etiology [37] impacting cancer progression [38], treatment success [39], and long term disease outcomes [40].

Discussion

We report the results of a gene-based association analysis in the BCAC resource. Adopting a recently proposed aggregation method that combines variants in coding and regulatory regions, we were able to replicate and extend previously reported findings. This aggregation method helps identify target genes of previously reported single-marker associations and uncovers additional associations that were missed by other methods.

We found 14 genes located in nine loci to be significantly associated with breast cancer risk in samples of European ancestry. Variants near seven of these loci have previously been implicated in breast cancer development based on the 2017 GWAS by Michailidou et al. [10] and we were able to link those single-marker associations to putative target genes. We found independent evidence for a role in breast cancer development for five of the genes. Two of them, MAP3K1 and FGFR2, are long-established risk genes for breast cancer mediated by both germline and somatic mutations [41, 42]. MAP kinase MEKK1, coded by MAP3K1, has been reported to promote cancer cell migration by contributing to an accommodating breast tumor microenvironment [43, 44], while FGFR2 has been identified as a viable drug target in breast cancer [45]. Additionally, the genes SRGAP2C, LSP1, and FMNL3 have been implicated in the etiology of other types of cancer. Although there is currently no functional evidence to substantiate the role of these three genes in breast cancer, sharing of genetic risk factors between different cancers is prevalent [46]. Jiang et al. report a genetic correlation of 0.24, 0.18, and 0.15 for breast cancer with ovarian, lung, and colorectal cancer, respectively [2].

As a further plausible target gene, we have identified ABRAXAS1, which codes for a subunit of the BRCA1 DNA repair protein complex. Differential allelic expression in the genomic region 4q21, in which gene ABRAXAS1 is located, has been previously reported to be associated with breast cancer susceptibility [47]. Interestingly, a recent study using burden testing for rare, protein-truncating or pathogenic variants in ABRAXAS1 based on sequencing data from 60,000 patients and 53,000 controls from the BCAC cohort did not find a significant disease association, with the odds ratio reported as 0.98 (0.50–1.94) [12]. In contrast, our approach focusing on low-frequency coding and regulatory variants identified a significant association of this gene with breast cancer risk. This suggests that our method enables gene discoveries that are missed by other approaches because the local genetic architecture of genes affecting breast cancer susceptibility varies between ancestry groups.

Beyond the identification of putative target genes in loci that have been previously found to harbor disease-associated variants, we report here two new disease associations for genes FMNL3 and AC058822.1. FMNL3 is a member of the diaphanous-related formin family, which represents a family of highly conserved cytoskeletal regulatory proteins [48]. FMNL3 expression is reported to promote migration and invasion of cancer cells and predicts clinical outcome in different solid cancers such as colorectal carcinoma [28, 49], squamous cell carcinoma of the tongue [30], and melanoma [50]. No markers in the proximity of this gene were found to be associated with breast cancer in the 2017 GWAS in the same dataset.

Features of the method that may facilitate discoveries beyond those identified by other approaches include (i) a reduction of multiple testing burden, (ii) boosting signals by aggregating over all genetic regions affecting individual genes expression and function, (iii) inclusion of low-frequency variants often underpowered in other studies, and (iv) ability to synthesize evidence for genetic risk factors in different ancestries regardless of differences in non-disease-associated variational background.

The inclusion of samples of non-European ancestry in genetic studies can advance our understanding of genetic disease landscapes [8]. However, differences between populations in terms of allele frequencies and linkage disequilibrium can lead to heterogeneity and false positive associations in single-marker association analyses. Additionally, different causal variants may be present in different ancestral groups [51] which can be driven by ancestry differences in allele frequencies. Aggregation methods offer a solution because they can accommodate multiple causal variants at a locus. A meta-analysis including all cohorts in this study was able to identify an additional association for ESR1, which was not detected in a European ancestry only analysis. Ancestry-related differences in disease-associated variants and minor allele frequencies in the ESR1 locus (6q25 region) have been previously reported [52, 53]. This ESR1 gene is coding for the estrogen receptor alpha monomer, an established risk factor and promising clinical biomarker in breast cancer pathophysiology [37, 54, 55].

The comparably small sample size of cohorts of non-European ancestry is a limitation of our study. Although no gene reached FDR-corrected significance in these analyses, nine genes were associated at suggestive thresholds, including biologically plausible candidate gene CBLB. This gene codes for the E3 Ubiquitin Ligase Cbl-b, which is a confirmed protagonist in cancer development and progression [56, 57]. There is recently mounting evidence that CBLB expression may be useful as a prognostic factor in breast cancer [35, 58, 59].

We note the following limitations for the adopted method in this study. First, no effect sizes or effect directions are derived. Second, it is not clear how statistical power for identification of associations is affected by gene length, mutational constrictions, number of transcripts, and amount of prior evidence for regulatory elements. Future analyses could deliver insights in this regard. Third, we were not always able to narrow down associations to a single target gene in loci due to overlapping genetic features. This limitation is affected by the LD structure in a specific region and the amount of prior information available in form of eQTL data and regions of overlapping transcripts. Fourth, although we are able to find plausible target genes applying this method to samples of diverse ancestry, there is potential for further optimisation. Regulatory features for genes have been identified using GTEx data, which predominantly is derived from European ancestry samples. Additionally, variants are weighted using Phred-scaled EigenPC pathogenicity scores [21]. These scores are derived using unsupervised learning on a labeled training dataset predominantly based on samples of European descent. Fifth, the current implementation of the method is computationally demanding but nonetheless able to analyze large sample sets (here over 140,000 samples). Sixth, our analysis did not consider different transcripts of genes so our findings are limited to the assigned major transcript. And lastly, the optimal aggregate methods depend on the genetic architecture at a given locus. We used SKAT-O a unified test to capture a range of different architectures. However, the choice of method may impact on the results.

Conclusions

Our findings show that usage of extended gene aggregation methods covering coding and regulatory regions in addition to standard single-marker tests (i.e., GWAS) have the potential to discover novel associations in available datasets. This study helps uncover the role of low-frequency genetic variation in breast cancer susceptibility and empowers gene discovery in ancestrally diverse cohorts.