Introduction

Genome-wide association studies (GWAS) for breast cancer (BC) have identified hundreds of risk-associated loci and have generated long lists of candidate loci requiring further validation1. Nevertheless, the identification of the causal variants and their target genes, as well as understanding the underlying biological mechanisms, remain challenging. This is because disease risk loci often have many variants in high linkage disequilibrium (LD) with the risk-associated variant, harbour multiple genes and mainly fall in noncoding genome regions2. However, the overrepresentation of potential causal variants at active gene regulatory regions3,4 indicates that variants regulating gene expression levels likely influence BC genetic predisposition, both proximally and over a long range5,6,7,8,9,10,11. These variants have commonly been mapped by expression quantitative trait loci (eQTL) analysis, but this approach is impacted by the effects of negative feedback control and environmental factors12. An increasingly popular alternative approach is to detect imbalances in allelic transcript levels—differential allelic expression (DAE). By comparing the relative expression of the two alleles in a heterozygous individual, each allele will serve as an internal standard for the other, thus controlling for trans-regulatory and environmental factors affecting both alleles13,14. Consequently, this directly indicates regulatory variants acting in cis (rSNPs).

Given the importance of cis-regulatory variants for BC susceptibility, a genome-wide map of cis-regulatory variants would be key to interpreting GWAS results and identifying causal variants of risk. Studies in various healthy tissues showed that DAE is a relatively common event13,15,16,17,18,19. Given that gene expression regulation is tissue-specific, performing these studies in the tissue from which the disease arises, namely, normal breast tissue, is essential. Although others have used allelic expression analysis to identify BC risk, this was carried out in tumour tissue or lymphoblastoid cells20,21. This study proposes an integrative approach to identify causal variants of risk that have a cis-regulatory role (Fig. 1): to combine GWAS results with SNPs associated with DAE levels in normal breast tissue. Hence, we first carried out DAE analysis in normal breast tissue samples at a genome-wide level, then mapped the candidate risk regulatory variants for GWAS loci and finally functionally unveiled the mechanisms underlying BC risk at a selected locus.

Fig. 1
figure 1

Strategy framework used to identify causal variants and target genes associated with breast cancer risk. aeSNP an SNP that passed quality control and at which allelic expression (AE) was measured, daeSNP an aeSNP showing differential AE (DAE), Genotyped SNP an SNP with genotype information (either genotyped in the study or imputed) and tested for association with AE ratios, daeQTL an SNP associated with AE ratios measured for a daeSNP, risk-daeQTL a daeQTL with a r2 ≥ 0.4 with a GWAS hit variant, candidate risk rSNP a variant with a r2 ≥ 0.95 with the risk-daeQTL.

Methods

SNP and call filtering at the gDNA and cDNA levels

We used an Illumina Infinium Exon510S-Duo arrays dataset of normal breast tissue available from Gene Expression Omnibus (GEO, www.ncbi.nlm.nih.gov/geo/) under accession number GSE3502322. It consists of 66 samples of DNA and cDNA (derived from total RNA) run on Illumina Infinium Exon510S-Duo arrays. These exon-centric microarrays contain probes for 511,354 SNPs, with more than 60% of the markers located within 10 kb of a gene and targeting more than 99.9% of human RefSeq genes. Sample filtering and normalisation were performed as described previously, and 12 samples displaying poor dynamic range of intensities in either channel (IQR of non-normalized log2 X or log2 Y < 1), or showing evidence of clear contamination of RNA with DNA sample were removed from further analysis22. For the remaining 64 samples, we used within-array strip-level quantile normalization to correct for dye-biases between the two channels (in general X = Cy5 = allele A and Y = Cy3 = allele B) from each array to obtain normalized intensities (X* and Y*) for each SNP. The function stripNormalize in the crlmm R package (version 1.10.0) was used to normalize the data. Log-ratios (M = log2 X* − log2 Y*) and average log-intensities [S = 0.5(log2 X* + log2 Y*)] were then calculated for each SNP on each array. We work on the log-ratio scale rather than the β-scale [β = X*/(X* + Y*), as used in Ref.13, as it has been shown to have more desirable properties for statistical testing in other applications of Infinium technology23.

Genome-wide DAE analysis

Allelic expression was measured in the filtered dataset of SNPs and samples in a varying number of heterozygous (AB) individuals for each transcribed SNP (aeSNP). As cDNA was prepared from total RNA, without selection for poly-A mRNAs, AE was measured for variants in fully processed and unspliced primary transcripts. Allelic expression ratios (AE ratios) were defined as the log2 of the ratio between the levels of allele A transcript and the levels of allele B transcript (heterozygote ratio), normalised by the same heterozygote ratio calculated for genomic DNA (gDNA) (Fig. S1), to account for copy number variation and correct for technical biases. Differential allelic expression (DAE) was called at the sample level when AE ratios were greater than 0.58 or less than − 0.58 (corresponding to the log2 of a 1.5-fold difference between alleles).

To define aeSNPs displaying mono-allelic expression (maeSNPs)24 we first identified SNPs presenting extreme allelic expression (AE) ratios, specifically with values greater than 0.58 or lower than − 0.58, without any heterozygous samples showing intermediate values. Next, using a binomial test, we tested the distribution of AE ratios at this subset of aeSNPs, displaying extreme ratios for an equal distribution of heterozygous samples showing positive and negative AE ratios on a variant-by-variant basis. We applied multiple testing correction to p-values, and variants with a false discovery rate (FDR) above 5% were considered to display a random choice of expression between alleles and classified as maeSNPs. Genes with at least one maeSNP were labelled maeGenes.

After filtering out maeSNPs, the remaining aeSNPs were tested for differential allelic expression by applying an Equal or Given Proportions test (prop.test function in R), with the alternative hypothesis that the proportion of heterozygotes with absolute AE ratios ≥ 0.58 is greater than 10% for any given SNP. The resulting p-values were corrected using a false discovery rate of 5% to define daeSNPs (Fig. 1, Suppl Fig. S1) i.e., heterozygous transcribed variants displaying differential allelic expression. Genes with at least one daeSNP were henceforth denominated daeGenes.

Validation of nine daeSNPs was performed by TaqMan® PCR technology, as described previously25, in 25 independent normal breast tissue samples heterozygous for a variable number of individuals per SNP using the following TaqMan® Genotyping Assays predesigned by Applied Biosystems: C___8354687_10; C__29939330_20; C__31232634_10; C___3133316_10; C__11844169_10; C___2627792_10; C___1517694_1_; C____787630_20; C___3108259_10. The prop.test was equally applied to confirm the presence of differential allelic expression.

Genotype imputation

Imputation was run on the Illumina Exon 510 Duo germline genotype data from the 64 samples that passed microarray quality control filters. Before imputation data, quality control was applied to the genotyping data, and SNPs with call rates < 85%, minor allele frequency < 0.01, and Hardy–Weinberg equilibrium with p-value < 1.0E−05 were excluded from the analysis. Imputation was performed using MACH1.026 and the phased haplotypes for HapMap3 release (HapMap3 NCBI Build 36, CEU panel—Utah residents with Northern and Western European ancestry) as a reference panel. We applied the recommended two-step imputation process: model parameters (crossover and error rates) were estimated before imputation using all haplotypes from the study subjects and running 100 iterations of the Hidden Markov Model (HMM) with the command option—greedy and -r 100. Genotype imputation was then carried out using the model parameter estimates from the previous round with command options of -greedy, -mle, and -mldetails specified. Imputation results were assessed by the platform-specific measures of imputation uncertainty for each SNP (rq Score) and filtered for an rq-score ≥ 0.3, as suggested in the author webpage (http://csg.sph.umich.edu/abecasis/mach/tour/) and MAF ≥ 0.01.

This imputation was conducted using HapMap3, which was the most widely used and trusted reference dataset for populations of European ancestry at that time. Since then, more comprehensive reference panels such as the 1000 Genomes Project and the Haplotype Reference Consortium (HRC) have become available, providing improved accuracy. Consequently, for subsequent analyses, including proxy retrieval, we utilized these updated reference panels to enhance the robustness of our findings.

Ancestry analysis

Genetic similarity, as a proxy for global ancestry, was determined by Principal Component Analysis on imputed genotypes from 64 samples, as implemented in PLINK (v.2.00a6LM) using 1000 Genomes Project data (release 20181203) as reference. Data wrangling was performed using bcftools (v.1.10.2, htslib v1.10.2–3).

daeQTL mapping

daeSNPs are not necessarily the regulatory variants (rSNPs) responsible for the observed allelic imbalance. Therefore, mapping of candidate rSNPs associated with the DAE observed—henceforth designated as daeQTLs (differential allelic expression quantitative trait loci) (Fig. S1, Fig. 1) was performed to identify neighboring variants that are statistically associated with the AE ratios observed at the daeSNPs. This approach took into consideration the pattern of AE ratio distribution displayed by each daeSNP. This pattern is highly dependent on the LD between the daeSNP and the rSNP acting upon the gene27.

To test the association between candidate SNP zygosity and the allelic expression of a daeSNP, henceforth designated daeQTL analysis (differential allelic expression quantitative trait loci analysis), we considered the pattern of the allelic expression (AE) ratio distribution displayed at each daeSNP, as this is dependent on the linkage disequilibrium between the daeSNP and the rSNP. When a single rSNP is in strong LD (r2 ~ 1) with the daeSNP, the normalised AE ratios for all heterozygotes will be unidirectional, with all samples preferentially expressing the same allele (i.e., all samples exhibiting either positive AE ratios or negative AE ratios). In this case, a one-sample Wilcox test was used to compare the mean normalised AE ratios for samples heterozygous for the candidate rSNP to 0.

When the rSNP is not in r2 ~ 1 with the daeSNP, the distribution of the AE ratios will depend on the rSNP-daeSNP haplotypes present in the analysed samples. In this scenario we applied, we applied a two-sample Wilcox test for the null hypothesis that the absolute AE ratios at the samples heterozygous for the candidate rSNP are higher than the absolute AE ratios at the samples homozygous for the tested rSNP. These tests were performed for rSNP-daeSNP pairs located within 500 kb of each other. p-values were adjusted with the Benjamini‒Hochberg method28, using all daeSNP/tested SNP pairs, with the distance between them as a covariate (package ihw, R)29 and reported as significant when the false discovery rate was below 5%.

Breast cancer GWAS data retrieval

One thousand and eleven GWAS-significant risk-associated SNPs for BC published until April 2018 were retrieved from the NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog)30 using the gwasrapidd R package31. We applied a suggestive p-value threshold (p ≤ 1.0E−05) to capture a broader set of potentially significant variants, including those identified in meta-analyses and replication studies, to ensure the inclusion of biologically relevant findings. This approach aimed to provide a comprehensive and robust analysis. Additionally, we included only studies involving European populations and the following reported traits: “Breast cancer”, “Breast cancer (early onset)”, “Breast cancer (estrogen-receptor negative)”, “Breast cancer (male)”, “Breast cancer in BRCA1 mutation carriers”, “Breast cancer in BRCA2 mutation carriers”, “Breast cancer male”, and “Breast cancer and/or colorectal cancer”. The complete list of SNPs is presented in Table S1.

Proxy SNP retrieval

Variants in LD with index SNPs were retrieved from Ensembl32 using the function get_ld_variants_by_window() from the ensemblr R package (https://github.com/ramiromagno/ensemblr) using the 1000 GENOMES project data (phase_3) for the CEU population and a genomic window size of 500 kb (250 kb upstream and downstream of the queried variant). The r2 cut-off used varied between 0.2 and 0.95 depending on the analysis and is indicated in each analysis description.

Annotation of variants

Variants were annotated according to hg38/GRCh38 with biomaRt v 2.40.5. aeSNP consequence types were categorised as follows: UTR if classified as 3_prime_UTR_variant or 5_prime_UTR_variant; coding if classified as coding_sequence_variant, incomplete_terminal_codon_variant, missense_variant, stop_retained_variant, synonymous_variant, stop_lost, start_lost, stop_gained, splice_region_variant, splice_acceptor_variant or splice_donor_variant; intronic if classified as intron_variant; and noncoding_transcript_variant if classified as noncoding_transcript_variant, noncoding_transcript_exon_variant or mature_miRNA_variant. We classified aeSNPs further according to gene biotype as follows: pseudogene if located in IG_C_pseudogene, processed_pseudogene, transcribed_unprocessed_pseudogene, transcribed_unitary_pseudogene, translated_unprocessed_pseudogene, unprocessed_pseudogene, unitary_pseudogene, transcribed_processed_pseudogene, polymorphic_pseudogene or rRNA_pseudogene; protein-coding gene if located in protein_coding, IG_V_gene, TR_C_gene, TR_J_gene, TR_V_gene or TEC; and noncoding_rna if located in lncRNA, miRNA, misc_RNA, snRNA, snoRNA, scaRNA or ribozyme.

To test whether classes of consequence type and gene biotype were overrepresented (i.e., enriched) in the list of daeSNPs, we applied two-tailed Fisher’s exact tests. Information from imprinted genes was retrieved from a comprehensive study of genomic imprinting in the breast33 and from the geneimprint database (http://www.geneimprint.com) searching for Imprinted Genes: by Species: Human.

Retrieval of previously suggested BC target genes

Genes previously suggested as targets of cis-acting regulatory variation in post-GWAS studies for BC, with extensive fine-scale mapping and in silico prediction or functional analysis, and those classified as Inquisit 1 by Fachal et al.4 are indicated in Table S2.

GTEx eQTL and gene expression data retrieval

The Genotype-Tissue Expression (GTEx) project identified expression quantitative trait loci (eQTL) using normal mammary tissue samples34. eGenes (genes with at least one SNP in cis significantly associated, at a false discovery rate (FDR) of ≤ 0.05, with expression differences of that gene) and significant variant-gene associations based on permutations were downloaded from GTEx Analysis V8 (dbGaP Accession phs000424.v8.p2, available on 18/07/2019).

All SNP-gene associations tested for breast mammary tissue, including nonsignificant and gene expression levels (TPM), were downloaded from GTEx Analysis V7 (available on 2016-01-15).

Comparison of daeGenes, eGenes and gwasGenes

First, the list of publicly available eGenes was compared with the daeGenes identified in our study, restricting this comparison to genes analysed in both datasets. Then, we investigated the percentage of gwasGenes, defined as genes containing variants in moderate to strong LD (r2 ≥ 0.4) with GWAS index SNPs, displaying evidence of cis-regulation by either DAE or eQTL analysis.

Functional characterisation of candidate risk SNPs

daeQTLs in moderate to strong linkage disequilibrium (LD) (r2 ≥ 0.4) with GWAS index SNPs were defined as risk-daeQTLs. These variants, along with their proxies (r2 ≥ 0.95), were considered candidate risk rSNPs in the subsequent analysis. Candidate risk rSNPs were examined for regulatory potential by assessing the overlap of the variant’s location with epigenetic marks derived from the ENCODE35 and NIH Roadmap Epigenomics project data36 using the R package haploR (https://github.com/cran/haploR). Candidate causal variants (variants overlapping with DNase I hypersensitivity sites and H3K4me1 or H3K4me3 or histone modifications in normal breast or breast tumour cell lines) at the 5q14.1–14.2 locus were further analysed regarding their genomic context and transcription factor (TF) binding using the UCSC Genome Browser37,38, HaploReg v4.139 and RegulomeDB v1.140 tools. Emphasis was given to overlapping with transcription factor (TF) binding identified in breast myoepithelial cells (BR. MYO, E027), human mammary epithelial cells (HMECs, E119), variant human mammary epithelial cells (vHMECs, E028) and two BC cell lines (MCF-7 and T47D). Allele-specific epigenetic modifications (H3k4me3 and DNase I), RNA polymerase II (POL2) and transcription factors (TF) binding with alignment data available in HMEC, MCF-7 and MCF-10A breast cancer cell lines from ENCODE were retrieved and visualised using the Integrative Genomics Viewer (IGV Version 2.3.71) tool41, to analyse protein‒DNA interactions and allelic preferential binding. Differential allelic binding was analysed in heterozygous candidate risk rSNPs located within TF binding peaks in experiments with a read coverage at the SNP site ≥ 20. We applied a two-tailed binomial test with the null hypothesis assuming no bias (balanced binding of the protein to the two alleles of the variant). The p-value was corrected for multiple testing using the R package qvalue42. When multiple tracks for the same SNP, trait and cell line existed, only the p-value for the experiment with higher total read counts was reported in the main manuscript.

Analysis related to alternative transcription at the 5q14.1–14.2 locus was carried out in three ways. First, sQTLseekeR (v1.4)43 was used to test the association of genetic variants with alternative isoform expression in both normal breast and tumour tissue using total read counts derived from RNA-seq data from the TCGA (TCGA-BRCA, hg19) and GTEx (phs000424.v6.p1, hg38) projects. Only ATG10 displayed sufficient alternative transcription dispersion to allow sQTL analysis. Additionally, all SNPs within 5 kb upstream or downstream of ATG10 were included in the analysis, not only the candidate risk rSNPs, to increase the stringency of the association exercise. P-values for all SNPs tested for ATG10 sQTL analysis were controlled for multiple testing using a 5% FDR. Correlation analyses between -log10 (FDR q-value) and LD (r2) with rs7707921 were performed using Pearson’s test. Then, the overlapping of variant location with RNA processing-associated proteins was assessed using CLIP data retrieved from POSTAR2 (http://lulab.life.tsinghua.edu.cn/postar/)44 and from RBP-Var (http://www.rbp-var.biols.ac.cn/)45, which additionally informed on riboSNitch potential46. Finally, allele-specific RBP binding predictions were performed with RBPmap47 using the analysed variant flanking sequence (30 nucleotides on each side, with the variant at index 31) using all available human RBP motifs.

Haplotype analysis

Haplotypes in the 5q14.1–14.2 region were analysed on Haploview 4.2 using the imputed genotypes from the 64 normal breast tissue samples48. For candidate risk SNPs whose genotype was not possible to determine (because it was neither genotyped nor imputed), a proxy SNP in strong LD (r2 ≥ 0.95) was used instead. Haplotype blocks were generated using the default algorithm.

TCGA-BRCA gene expression analysis

Processed gene expression and isoform expression from RNA-Seq data for 113 normal solid tissues and 1102 primary solid tumours from the TCGA-BRCA project, together with corresponding clinical data, were retrieved from the Genomic Data Commons archive using the R package TCGAbiolinks49 accessed in October 2018. Isoform expression was annotated according to the genome assembly hg19, and total gene expression was annotated according to hg38. We applied two-sample Wilcoxon tests to compare the mean expression of ATG10 isoforms between normal-solid tissues (normal-matched) and breast tumours, correcting for multiple testing with the Benjamini and Hochberg (BH) procedure. We applied Pearson’s test to correlate gene expression among ATG10, RPS23, and ATP6AP1L. Spearman’s test was applied to correlate ATG10, RPS23, and ATP6AP1L with MYC and MAX gene expression.

Results

Cis-regulatory variation is common in normal breast tissue

Genome-wide allelic expression (AE) analysis was performed using microarray data from 64 normal breast tissue samples. Of these, 61 samples were of European ancestry (EUR), two samples were of American ancestry (AMR) and one sample was of African ancestry (AFR) (Fig. S2). Normalised allelic expression ratios were calculated for SNPs in coding and noncoding regions upon filtering for the cDNA signals’ expression level and allelic discrimination potential. Overall, we identified 91,467 autosomal allelic-expressed SNPs (aeSNPs) located in 21,527 annotated Ensembl genes (median of three aeSNPs per gene) (Fig. S3). Unsurprisingly, the number of aeSNPs analysed per gene correlated with the annotated gene length (rho = 0.60, p-value < 2.2e−16, Fig. S4).

We found that almost one-third of the aeSNPs (26,266 out of 91,467) displayed biallelic differential expression (daeSNPs, q-value ≤ 0.05) (Table 1, Table S3), while 84 SNPs displayed monoallelic expression (maeSNPs). TaqMan PCR validated seven out of nine daeSNPs (Fig. S5) that showed significant DAE and concordant preferential expression (Fisher’s exact test p-value > 0.05).

Table 1 Summary of the genome-wide breast tissue allelic expression analysis results.

The daeSNPs are distributed across the genome, with low interchromosomal variability (ranging from 26 to 35%, Fig. S6). They overlapped 13,688 (65%) annotated genes (daeGenes), of which 3666 (17%) harboured three or more daeSNPs (Fig. 2a, Table 1, Table S3). When considering daeSNPs mapping exclusively to one gene, we identified 8193 daeGenes (out of 12,944) that showed evidence of being under the control of allele-specific cis-acting factors, either genetic or epigenetic. In terms of consistency of DAE detection across the length of these genes, we found that in the majority of daeGenes, the frequency of daeSNPs was higher than 40% (7476 in 13,688), with 3894 daeGenes presenting imbalances in all the analysed aeSNPs (Fig. 2b). The aeSNPs showed a large distribution of mean |AE ratios|, with daeSNPs centred at 0.60 (corresponding to a difference between alleles of 1.5) and non-daeSNPs centred at 0.26 (corresponding to a difference of 1.2). Twelve per cent of daeSNPs showed average absolute AE ratios between 1 and 5, corresponding to average allelic fold changes ranging from 2 to 34 (Fig. 2c, Table S3). The amplitude of the imbalances measured at aeSNPs correlated negatively with the average expression level of both alleles (rho =  − 0.4, p-value < 2.2e−16) (Fig. 2d) but not with the standard deviation across individuals (Fig. S7). The aeSNPs are located mainly in intronic regions and noncoding transcript genes, but non-daeSNPs and daeSNPs showed differences in class distribution for consequence type, with daeSNPs enriched at unannotated regions and depleted at intronic, non-coding regions and UTRs (p-value < 0.05, Fig. 2e). Although most of the aeSNPs analysed were in protein-coding genes, daeSNPs were relatively more common in noncoding genes and pseudogenes when compared to non-daeSNPs and depleted in protein-coding genes (p-value < 0.05, Fig. 2f).

Fig. 2
figure 2

Characterisation of aeSNPs. (a) Histogram of the rank number of daeSNPs identified per gene across 17,135 annotated genes. (b) Histogram of the rank proportion of daeSNPs per aeSNPs identified per gene. (c) Box plot with the distribution of the mean of the absolute values of AE ratios across heterozygous individuals measured at non-daeSNPs and daeSNPs. (d) Distribution of the mean absolute values of AE ratios at aeSNPs according to the average intensity of both alleles at aeSNPs in the microarray Spearman’s results of a Spearman’s correlation test are shown. (e,f) Relative frequency of aeSNPs and daeSNPs according to consequence type and gene biotype, respectively.

Monoallelic expression in breast tissue

Regarding monoallelic expression, maeSNPs were annotated to 44 Ensembl genes (Table 1, Table S4, Fig. S8), the majority of which were previously reported as imprinted in breast tissue (e.g., IGF2 or ZDBF2) or in other tissues (e.g., KCNQ1, KCNQ1OT1, RTL1, NAA60, ZIM2, and L3MBTL1), validating our AE analysis. Interestingly, we detected maeSNPs in a region containing the lncRNA MEG9 and a cluster of miRNA genes that had only previously been reported as imprinted in nonhuman species50,51,52. Additionally, we found unreported monoallelic expression at an intergenic region (22q11.23), suggesting the existence of unannotated transcripts in this region. Notably, we observed two groups of heterozygotes preferentially expressing opposite alleles of rs17122278, an intronic variant of ARCN1, suggesting the latter as a candidate novel monoallelically expressed protein-coding gene in breast tissue.

Mapping of daeQTLs in normal breast tissue

Evidence of DAE supports that a gene’s expression is controlled by cis-regulatory variation, which can be mapped using AE ratios as a quantitative trait—in what we termed DAE quantitative trait loci (daeQTL) analysis. Here, we found a minority of daeSNPs (6928 out of 26,266) for which all the heterozygotes preferentially expressed the same allele. This pattern indicates moderate to strong linkage disequilibrium between the daeSNP and the rSNPs acting on it27. Hence, our mapping approach considered the allelic expression (AE) ratio distribution pattern displayed at each daeSNP, and one-sample or two-sample Wilcox tests were applied accordingly. We identified 54,357 daeQTLs (5% FDR) for 6761 (26%) daeGenes (Table S5), primarily located within 20 kb from the corresponding daeSNP but as far as the 500 kb window used for the analysis (Fig. 3a). daeQTLs for MROH8 and ZNF132, two coding genes located on chromosomes 19 and 20, respectively, were among the most significant ones found (Fig. 3b,c), but we identified other highly significant daeQTLs (adjusted p-values smaller than 5.0E−04) for 2507 genes.

Fig. 3
figure 3

Mapping of variants associated with differential allelic expression. (a) Empirical cumulative distribution for the distance between the daeSNP and corresponding mapped daeQTL. (b,c) daeQTL mapping result for the most significant daeQTL identified for MROH8 using a one-sample Wilcox test and for ZNF132 using a two-sample Wilcox test. The AE ratios calculated at the daeSNPs are represented on the y-axis in the two panels and stratified according to genotype at the candidate SNP (black dots represent heterozygous individuals, and red dots represent homozygous individuals).

Identification of target genes within BC risk loci

To pinpoint the most likely candidate target genes within BC risk loci, a main post-GWAS challenge, we identified the genes within previously reported GWAS loci (gwasGenes) displaying the most robust evidence of being under the control of cis-regulatory variation, provided either by DAE (daeGenes) or eQTL (eGenes) analysis. We found that most gwasGenes (783 out of 948) showed such evidence, with 69% of these with evidence via DAE analysis (358 genes identified solely by DAE and 300 by both analyses) (Table S6). Compared to all genes studied, gwasGenes presented a significant enrichment of Fisher’s significance of DAE (Fisher’s exact test = 2.48e−05). Finally, we successfully mapped daeQTLs for 385 gwasGenes (Fig. 4, Table S6).

Fig. 4
figure 4

Summary of colocalisation analyses for daeGenes and risk-daeQTLs. (a) UpSet plot for 15,706 genes tested for DAE and eQTL (GTEx breast mammary tissue). daeGenes genes identified as having differential allelic expression in normal breast tissue; eGenes genes reported as being eQTL genes in GTEx mammary tissue data (q-value ≤ 0.05); gwasGenes genes where GWAS index SNPs or proxies (r2 ≥ 0.4) are located, daeGenes with daeQTL mapping daeGenes for which daeQTLs were identified. (b) UpSet plot for 424 variants located in DHSs, according to the presence of protein binding and location in active promoters and/or enhancers in breast cell lines.

Next, we verified our ability to identify 178 previously proposed breast cancer target genes (Table S2). We found that 44% of these genes were exclusively daeGenes (e.g., ELL, TOX3, RNF115), 23% were both daeGenes and eGenes (e.g., CASP8, POU5F1B, STXBP4) and 14% were exclusively eGenes (e.g., RMND1, HELQ, PRKRIP1). However, we did not find evidence supporting other genes, such as CITED4, IGFBP5 and MYC. (Table S2). As total levels of gene expression may confound the ability to identify daeGenes and eGenes, it is noteworthy that eGenes showed higher median levels overall than daeGenes, and only 4.7% of exclusive daeGenes showed low median levels (< 0.1 TPM) (Fig. S9).

Identification of causal variants within BC risk loci

Another post-GWAS challenge we addressed was the identification of the causal variants within risk loci. We first identified 1416 daeQTLs in moderate to strong LD (r2 ≥ 0.4) with GWAS index SNPs (Fig. 1) (GWAS p-value < 1.0E−05), henceforth referred to as risk-daeQTLs. These were distributed across 93 loci in 19 chromosomes, primarily in introns, followed by intergenic regions (Table S7, Fig. S10). Then, we assessed these risk-daeQTLs plus their proxies (r2 ≥ 0.95) for their cis-acting regulatory potential. We started by identifying 425 variants located in DNase I hypersensitivity sites (DHS), of which the majority (69%) mapped to regions with histone marks associated with active regulatory elements (Fig. 4b, Fig. S11). More specifically, 149 risk-daeQTLs localised in both active promoter (H3K4me3 and H3K9ac) and regions with active enhancer-associated (H3K4me1 and H3K27ac) histone marks, 76 localised exclusively in regions with active enhancer-associated marks and another 67 exclusively localised in regions with active promoter-associated marks. Of these, 122 risk-daeQTLs also showed protein binding evidence, thus representing strong candidate causal variants within 41 of the initial 93 BC risk loci (Table S8).

Among these 41 risk loci, we detected 47 novel candidate target genes in 29 loci with no previous report of target genes, such as SMC2 in 9q31.1, MLLT10 in 10p12.32 and MAN2C1 and PTPN9 in 15q24.2. We confirmed previously reported target genes in nine loci and identified eight novel genes, including NASP and IPP in 1p34.1 and ATP6AP1L in 5q14.1. Finally, we identified strong candidate causal variants at three loci but could not discern the target gene due to a lack of genomic annotation (Table 2, Table S9).

Table 2 Loci with candidate risk rSNPs and novel suggested target genes.

Notably, 2222 daeQTLs were also in lower LD with GWAS hits (0.2 ≤ r2 < 0.4), representing a valuable dataset warranting further exploration (Table S10).

Mapping of cis-regulatory risk variants at the 5q14.1–14.2 locus

To further show the potential use of our integrated approach, we focused our follow-up studies on the BC risk locus 5q14.1–14.2, where some of the most significant risk-daeQTLs and candidate causal variants were identified. In this locus, rs7707921 was previously associated with BC risk in two meta-analyses (OR for alternative A allele = 1.07, 95% CI [1.05–1.1], p = 5E−11)9,53. The region containing this intronic variant of ATG10, its proxy variants (r2 ≥ 0.4) and other risk-associated variants reported in this locus spans three genes (ATG10, RPS23, and ATP6AP1L), hindering the identification of the causal variant(s) and their target gene(s) in this locus.

First, all three genes showed DAE, supporting their regulation by cis-regulatory variants: 10 daeSNPs out of 37 aeSNPs at ATG10, one daeSNP out of two aeSNPs at RPS23 and three daeSNPs out of five aeSNPs at ATP6AP1L (Fig. S12). The highest mean |AE ratios| detected at daeSNPs in these genes was 1.27 (2.4-fold) at RPS23, followed by 0.92 (1.9-fold) at ATP6AP1L (Fig. 5—panel 2, Fig. S11). By daeQTL mapping analysis, we identified daeQTLs for all three genes: 56 for ATG10 (spreading along the ATG10-ATP6AP1L region), 4 for RPS23 (limited to RPS23-ATP6AP1L) and 26 for ATP6AP1L (spreading along the ATG10-ATP6AP1L region) (Fig. 5—panels 3 to 5). Additionally, we classified as risk-daeQTLs the 38 ATG10 daeQTLs and 24 ATP6AP1L daeQTLs (22 of which are common to the two genes) in moderate to strong LD (r2 ≥ 0.4) with the risk-associated variants. Furthermore, both ATG10 and ATP6AP1L daeQTL analysis results strongly correlated with the corresponding LD with the GWAS lead-SNP rs7707921 (Fig. S13), further supporting the role of variants regulating the expression of these two genes in the risk for breast cancer.

Fig. 5
figure 5

Evidence of DAE and daeQTL analysis at the 5q14.1 BC risk locus. The top track shows the mean values of the absolute AE ratios measured at aeSNPs across the region, with the non-daeSNPs shown in grey, the daeSNPs in black and the daeSNPs with mapped daeQTLs in red. The subsequent tracks show the daeQTL mapping corrected p-values for ATG10, RPS23 and ATP6AP1L.

Cis-regulatory risk variants act via two different mechanisms on genes in the 5q14.1–14.2 locus

The overlap analysis of the risk-daeQTLs with epigenetic marks in breast cell lines identified seven candidate causal variants for ATG10 and ATP6AP1L (Tables S8, S11). We investigated these variants further for allelic differences in transcription factor binding and association with histone modifications and DHSs. One of these SNPs, rs111549985, overlies the active promoter of ATG10 (Fig. S14), and its minor G-allele is preferentially associated with the H3K4Me3 modification in HMECs (2.7-fold, p = 3.7e−03) and shows robust preferential binding by POL2 in MCF7 cells (ninefold, p = 4.0E−04). However, DHS was more significantly associated with the major/reference C allele in T47D cells (0.5-fold, p = 4.6e−05) (Fig. 6a, Table S12). Another two candidate causal variants, rs226198 (intronic to RPS23) and rs688025ʹ UTR (located at RPS23 5ʹ UTR), overlay the shared promoter of RPS23 and ATP6AP1L and a predicted enhancer interacting with the ATG10 promoter (Fig. S15). The minor C-allele of rs226198 showed preferential binding by MYC and MAX transcription factors, which are known to cooperate in cancer54 (12.6-fold and 7.9-fold difference, respectively, p < 2.2e−16) and preferential H3K4me3 marking (2.7-fold, p = 1.4e−02) in MCF-7 cells (Fig. 6b, Table S12). It would be interesting to elucidate whether rs226198 impacts the binding of both factors and H3K4me3 deposition or whether this epigenetic mark is a consequence of altered transcription, as previously suggested55,56. The minor T-allele of rs6880209 also showed preferential binding by MYC (4.8-fold, p < 2.2e−16) and MAX (2.4-fold, p = 2.7e−03), with smaller fold-change differences than rs226198, and additional preferential binding by POL2 (2.6-fold, p = 1.27e−06) in MCF7 cells. However, similar to rs111549985, DHS preferentially occurred in the major/reference C-allele in T47D cells (5.3-fold, p = 9.1e−04) (Fig. 6c, Table S12). Interestingly, the expression of MAX correlated with ATG10, RPS23 and ATP6AP1L, and the expression of MYC correlated with the expression of ATG10 (Fig. S16). Furthermore, the expression levels of ATG10 and ATP6AP1L were positively correlated in breast tissue from healthy women (top 2.5% quantile of 500,000 pairwise tests) and in normal-matched tissue from patients with BC (Fig. S17). The observation that ATG10 and ATP6AP1L are in different topologically associating domains (TADs) and that the candidate causal variants rs226198 and rs6880209 fall on the boundary between them (Fig. S18) suggests that a shared pattern of chromatin condensation does not drive the correlated gene expression but instead by a shared cis-regulatory sequence.

Fig. 6
figure 6

Variants at the 5q14.1 risk locus associated with differential transcription factor binding. Allele-specific analysis of the effect of three candidate risk rSNPs—(a) rs111549985, (b) rs226198 and (c) rs6880209—on RNA polymerase II (POL2) and transcription factor (TF) binding, DNase I targeting (DHS) and H3K4me3 modification in different heterozygous cell lines. An asterisk indicates statistically significant imbalances (two-sided binomial test, p-value ≤ 0.05). HMEC human mammary epithelial cells, MCF7 human breast (adenocarcinoma) cell line, T47D human breast tumour cell line, MCF10A human breast epithelial cell line.

Since genetic variants affecting mRNA decay or alternative splicing57 can cause allelic expression imbalances, we aimed to explore further the alternative transcription’s role in gene expression regulation and driving risk at the 5q14.1 locus. To accomplish this, we performed an sQTL analysis for ATG10 that was not restricted to the candidate risk rSNPs but included all SNPs located within 5 kb upstream and downstream of ATG10 to increase the stringency of the exercise.

We identified six sQTLs (FDR ≤ 5%) in the tumour data, whose minor alleles were associated with changes in the expression of two protein-coding isoforms: decreased expression of ENST00000458350 (one extra exon) and increased expression of ENS3ʹUTR0282185 (longer 3ʹ UTR) (Fig. 7, Fig. S19a, Table S13). Interestingly, ENST00000282185 is expressed at significantly lower levels in tumours than in normal-matched tissue, in line with the reported oncogenic effect of UTR length58, although with a small effect size (fold-change = 1.20) (Fig. S20). The strong correlation between sQTL q-values and LD with the lead GWAS SNP rs7707921 (r = 0.94, p-value = 3.15E−12, Fig. S18b) supports the contribution of alternative transcription of ATG10 to BC risk. Although no sQTL was detected for ATG10 in normal breast data (Table S13), sQTL nominal p-values and LD with rs7707921 were still correlated in normal matched breast samples (r = 0.59, p-value = 0.002) (Fig. S21). ATP6AP1L did not display sufficient alternative transcription dispersion to allow the sQTL analysis. Subsequent functional analysis of ATG10’s sQTLs, and their proxy SNPs (LD r2 ≥ 0.95), revealed the prediction of rs111549985 (5ʹ UTR) and rs6884232 (3ʹ UTR) to cause a riboSNitch (a functional RNA structure disrupted by an SNP46). Although RBP binding data for breast tissue do not exist, these variants have been reported to disrupt the binding of Xrn2 (involved in termination by RNA polymerase II) and of Igf2bp1 (a translation regulator) in K562 cells (Tables S14, S15), which would require confirmation in breast cells.

Fig. 7
figure 7

Variants at the 5q14.1 risk locus associated with alternative transcription. Six sQTLs in high LD with rs7707921 were identified for ATG10. The − log10(q-value) for the sQTL analysis (y-axis) is shown for the 5q14.1–14.2 region (hg38). Colour intensity represents the LD (r2) between the analysed variants and the GWAS lead SNP rs7707921. Below are two ATG10 transcripts whose expression ratios are associated with the sQTLs.

Risk model for 5q14.1 links higher expression of ATG10 and ATP6AP1L with protection against BC

Haplotype analysis of the samples included herein revealed two common haplotypes: one harbouring the major alleles of all proposed risk-rSNPs and the GWAS lead SNP rs7707921 (frequency of 71.1%) and another with the corresponding minor alleles (frequency of 21.9%) (Fig. S22). The proposed risk-rSNPs are among the most significant eQTLs for the two genes: rs111549985 for ATG10 and rs6880209 for ATP6AP1L (Fig. S23)59. Therefore, the most common haplotype is associated with an increased risk for BC and lower expression of ATG10 and ATP6AP1L (Fig. S22).

Our proposed model for risk at 5q14.1 (Fig. 8) establishes that the minor alleles of rs111549985, rs226198, and rs6880209 confer protection against BC by (1) increasing the binding of POL2 II to the promoter of ATG10 (driven by rs111549985), (2) the binding of POL2 to the shared promoter of RPS23/ATP6AP1L (driven by rs6880209), and (3) the binding of cMYC and MAX to a regulatory region (possible enhancer) (driven by rs226198), increasing the expression of ATG10 and ATP6AP1L. These results reveal a complex regulatory landscape at the 5q14.1–14.2 locus, with multiple independent causal variants.

Fig. 8
figure 8

Complex risk regulatory landscape of the 5q14.1 locus. (a) Levels of expression of ATG10 and ATP6AP1L genes differ between the haplotypes containing either the minor alleles of rs111549985, rs226198 and rs6880209 (above) or the major ones (below). Coloured arrows indicate the direction of transcription of the individual genes, the saturation of the corresponding colours indicates the strength of protein binding, the number of green circles indicates the level of H3K4me3 and the coloured curvy lines indicate the relative levels of transcript produced. (b) Schematic representation of the proposed model for the positive correlation between ATG10 and ATP6AP1L via a shared regulatory region.

Discussion

Here, we present the first genome-wide map of differentially expressed allelic genes (daeGenes) and their genetic determinants (daeQTLs) in normal breast tissue. We found widespread differential allelic expression (DAE) across the genome and identified daeQTLs for 26% of daeGenes. By intersecting this map with GWAS data, we identified risk-daeQTLs and target genes for 93 BC risk loci. Next, we retrieved epigenetic regulatory annotations on all candidate risk-rSNPs (risk-daeQTLs and their proxies in high LD) to prioritise variants with regulatory potential for further functional analysis. We identified 406 variants with strong regulatory potential annotated to 55 different chromosomal bands and candidates for regulating the expression levels of 96 genes. Our results represent a practical and valuable resource for prioritising loci for follow-up GWASs. As a proof of concept, we functionally characterised the 5q14.1–14.2 BC risk locus in depth and proposed four causal regulatory variants targeting the genes ATG10 and ATP6AP1L acting via multiple allele-specific mechanisms. Our results suggest a complex regulatory landscape underlying BC aetiology.

We show that cis-acting variants regulate the expression of 65% of genes in normal breast tissue, with some genes displaying extreme allelic differences of up to 32-fold. Notably, we identified a novel gene with monoallelic expression, ARCN1, which warrants further inspection to confirm its imprinting status. An enrichment of daeSNPs at intergenic and intronic regions, as well as noncoding transcripts, noncoding genes, and pseudogenes, concurs with previous reports of predominant allelic imbalances of expression at gene-depleted regions and genes under fewer evolutionary constraints60,61.

To overcome the lack of phasing information, we applied two different tests in the daeQTL mapping, according to the AE ratio distribution, which led to the identification of 54,357 variants associated with AE ratios for 6761 genes, both coding and noncoding for proteins. The stringent statistical correction and the use of distance as a covariate in the second mapping approach increased its confidence level but limited the statistical power to identify regulatory variants in lower LD with the daeSNP or located more distally.

We found evidence of expression regulation by cis-acting variants for most reported GWAS loci and believe that alternative mechanisms are at play in the remainder. We identified risk-daeQTLs at 93 different loci, including 72 loci with novel candidate risk target genes (including NEK10 at 3p24.1 and ZBED6 and ZC3H11A at 1q32.1). Moreover, the initial daeQTL map in normal breast tissue can be further mined whenever new risk variants are identified through GWAS. These results offer a resource platform for functional studies of causal variants and target genes and can help uncover the role of cis-regulatory variation in BC risk.

Finally, we conducted an in silico functional analysis of the 5q14.1–14.2 BC risk locus and identified three strong candidate causal variants: rs111549985, rs226198, and rs6880209. We predict that these variants functionally impact TF binding, chromatin state, and gene expression levels of ATG10 and ATP6AP1L. A similar involvement of diverse regulatory mechanisms has been suggested previously for other BC risk loci4,62,63. Both ATG10 (involved in autophagy) and the ATP6AP1L pseudogene have been suggested to have roles in cancer64,65,66,67. A variant at ATG10 (rs7313473) was previously associated with BC risk by regulating promoter activity, and ATG10 was suggested to act as a tumour suppressor gene in breast tissue68. For ATP6AP1L, another variant (rs10514231) was reported to lead to ATP6AP1L downregulation by decreasing the binding affinity of TCF7L2 in an intronic regulatory region67. Although we did not find supporting evidence for the same variants, our results show an indirect association between the lower expression of ATG10 and ATP6AP1L and BC risk, suggesting that the downregulation of these two genes may contribute to tumorigenesis.

The advantages of our analysis compared to previous reports of AE in normal breast and tumour tissue16,18,19,69 include using the most significant number of normal breast tissue samples, the genome-wide approach, and the mapping of candidate regulatory variants. We found a similar frequency of daeSNPs to previous reports in other tissues/cell lines but a higher frequency of daeGenes13,17,18,70. This higher frequency of daeGenes could be due to our ability to identify genes regulated by common cis-acting variants with weak to large effect sizes19, a consequence of the imposed conditions to call DAE (allelic change difference of 1.5-fold and the minimum number of heterozygotes). Additionally, we did not integrate the AE ratios of multiple daeSNPs in the same gene due to the absence of phase data and to maximise the information withdrawn from daeSNPs that might be located in different LD blocks. The complex regulatory landscape we identified at the 5q14.1 locus, with multiple cis-acting variants located in the same haplotypes and AE likely resulting from the sum of the effects of each variant, supports this analysis approach. Furthermore, as we propose, a global measure of the AE imbalance at each gene would impair the mapping of daeQTLs at individual daeSNPs and restrict the analysis to genes with multiple daeSNPs. Finally, besides the more commonly studied protein-coding genes, we analysed noncoding genes and pseudogenes, such as ATP6AP1L.

Our results confirm the advantage of using DAE analysis to detect the effect of rSNPs compared to eQTL analysis, as shown by the higher number of daeGenes than eGenes among gwasGenes71,72,73. As a minority of gwasGenes were exclusively eGenes, we believe that DAE and eQTL analyses are complementary and should be used in parallel when possible.

Our use of microarray data could be seen as a limitation compared to RNA-seq data, which have more extensive transcriptome coverage and high quantification accuracy for more extreme allelic imbalances. However, microarrays are a widely used and precise technology for measuring AE13,16,22, as we confirmed with our validated monoallelic expression of known imprinted genes and with independent PCR analysis. The only publicly available RNA-seq dataset with normal breast tissue is from the GTEx project. However, our approach presents several advantages: (1) we processed and hybridised the DNA and RNA samples in parallel to minimise technical issues, (2) we used total RNA, which includes coding/noncoding genes and spliced/unspliced transcripts, and (3) we showed that the range of gene expression levels of daeGenes was comparable to that of the eGenes from the GTEx dataset. The following steps will be to carry out matched RNA-seq and DNA-seq to combine all the advantages mentioned above and expand the discovery of daeGenes and rSNPs.

While our approach of defining risk-daeQTLs based on LD with GWAS index SNPs provides valuable insights and a potential link, it does not establish causality. Future work should address this limitation by performing co-localization analysis for all identified risk-daeQTL loci.

Our study predominantly analysed samples from individuals of European ancestry, and we combined these results with GWAS findings that included European populations for consistency. However, leveraging a multi-ancestry study design in future research is essential. Different ancestral populations possess distinct genetic backgrounds, unique sets of genetic variants and varying frequencies of shared variants. Incorporating data from multiple ancestries can broaden our understanding of the genetic and regulatory mechanisms underlying breast cancer risk19,74.

Here, we provide a genome-wide list of variants with strong regulating potential for normal breast tissue, a valuable resource for researchers prioritising GWAS results for functional characterisation and those interested in other BC-related traits. The extensive characterisation of the regulatory landscape at the 5q14.1 BC risk locus identified candidate causal variants and revealed the multiple mechanisms involved. Further studies of this locus will elucidate the mechanisms involved and the relative contributions of each variant and target gene to the genetic risk. Overall, our results reinforce the importance of cis-regulatory variation as a major player in BC susceptibility and the power of identifying these variants in the disease’s tissue of origin—normal breast tissue. They also show that multiple causal variants may co-occur and act via independent cis-regulatory mechanisms at BC risk loci, supporting a broader approach to functional studies.