Background

Somatic mutations in DNA binding sites recognized by transcription factors [1, 2] can alter regulator binding affinities and expression of target genes [3], often leading to malignant cell transformation. Affinity changes can be directly associated with cancer progression with a striking example of a GABP (ETS-family factor) binding site emerging in TERT promoter [4] associated with progression of different tumors [5, 6]. A number of other promoter mutations have been linked with an increased risk of cancer [7, 8], and the number is expected to grow rapidly along extensive sequencing of complete cancer genomes. For instance, in the recent study of regions associated with a risk of epithelial ovarian cancer [9] out of nearly three hundreds significant single-nucleotide polymorphisms only two were found in protein coding regions, whereas 25 were localized in transcription factor binding sites. Likewise, cancer drivers identified in knockdown experiments [10] not necessarily carry mutations in coding regions, thus underlining the importance of regulatory mutations modifying gene expression.

Frequencies of synonymous and non-synonymous substitutions allow studying selection of cancer somatic mutations in protein-coding regions [11]. For non-coding regions estimates of selection pressure can be based on functional annotation of sequence variants. In particular, DNA sequence motifs recognized by the transcription factors [12] usually have strict and degenerate positions [13]. Assessment of potential binding affinity changes induced by substitutions allows studying selection of sequence variants in binding sites in a way resembling usage of non-synonymous and synonymous substitutions in codons.

Somatic mutations often tend to destroy binding sites [3, 14] of specific transcription factors reflecting positive selection of variants. Conversely, binding sites of other transcription factors were reported to avoid mutational changes [14], but the significance of negative selection pressure acting at somatic mutations remains controversial [3, 14, 15].

Mutations in cancer cell lineages are strongly context-dependent [16]; thus, mutation signatures of different cancer types should be properly taken into account to avoid statistical bias. Here, we used genome wide data [16] on several cancer types with different mutation signatures to study the frequencies of somatic mutations that alter binding sites for specific transcription factors.

Results

Assessing selection pressure on transcription factor binding sites

To study selection pressure on gene regulatory regions we used mutation sets from different cancer samples grouped by the tissue [16]. First, we selected mutations in putative regulatory regions (intronic and promoter genomic segments), which made up to 50 % of total mutation calls (Additional file 1: Table S1). Then, we mapped transcription factor binding sites [17] in small windows centered at the mutation sites (see Methods, predicted binding sites were allowed to be located in the vicinity of but not necessarily overlap the mutated base). For a particular tested motif, on average 5 ± 3 % of tested windows included binding sites predictions (see Additional file 2: Table S2 for complete data), but there were exceptional cases with deviating prediction rates.

In this study we considered local windows centered at the mutated bases. In each window we considered only predicted transcription factor binding sites that were located not farther than 10 bp from the mutation. This allowed distinguishing mutations that substantially changed motif affinity (substitution in a core motif position) from those only weakly affecting binding (substitution in a weak motif position or in a flanking position). At the same time, such local analysis allowed to set aside vast fluctuations in mutation rate between different genome domains.

Putative affinity changes were estimated for the mutated allele versus the germline allele [18]. We separately considered both directions of affinity changes that can be caused by a nucleotide substitution in a binding site: the affinity loss (disruption of a binding site predicted for the germline allele) and gain (improvement or emergence of a binding site with stronger prediction for the mutated allele).

To evaluate the selection pressure we compared the observed frequency of mutations substantially changing the binding site affinity with the expected frequency estimated from simulated control data.

We used two different control data sets: (1) the shuffle control consisting of sequences with randomly shuffled nucleotides around the actual mutated bases, similar to that in previous studies [3] but controlling the mutation context (the germline and mutated nucleotides and the proximal 5' and 3' nucleotides); and (2) the genomic control consisting of randomly sampled segments of promoter and intronic regions not overlapping the cancer mutation-centered windows (see Methods for details).

To account for specific mutation signatures of different cancer types (see Additional file 3: Figure S1), binding sites predictions in both shuffle and genomic controls were sampled to equalize the resulting distribution of mutation contexts of given control data to match the cancer mutations data, separately for each cancer type.

Finally, we identified binding motifs that significantly and consistently exhibited an exceptional rate of mutation-induced affinity changes versus both control data sets with equalized contexts distribution (FDR-corrected two-tail Fisher’s P < 0.05, see Additional file 2: Table S2 and Additional file 4: Table S3 for complete data, see Methods for details and Fig. 1 for the overall workflow).

Fig. 1
figure 1

Procedure used to evaluate affinity change events and estimate significance of difference between observed and expected frequencies. Top panel: prediction of binding sites in cancer and control data and evaluation of affinity change events. Bottom panel: binding sites predictions and affinity change events of AP-2 motifs; an example of 2 × 2 contingency table used to compute Fisher’s exact test P-value

Limited magnitude of selection pressure requires high statistical power

For each transcription factor binding motif we estimated magnitude of selection pressure on somatic mutations overlapping the predicted binding sites. The selection pressure magnitude was defined as the ratio of the observed and expected frequencies of mutation-induced affinity changes assessed for the somatic mutations (observed) and the simulated control data (expected), respectively (see Methods). The typical values of the selection pressure magnitude were around 0.9–0.95 (negative selection) and 1.05–1.1 (positive selection, see Fig. 2) and were in a similar range for mutations causing the affinity gain or the affinity loss.

Fig. 2
figure 2

Selection magnitude for affinity loss and gain of ETS, FOX and C/EBP motifs in different cancer types. X-axis displays the selection magnitude for motif affinity loss (a) or gain (b) caused by somatic mutations. Box-plots are provided for ETS-related (14 motifs), FOX (13 motifs), C/EBP-related (9 motifs) and NR3 (Steroid hormone receptors, 11 motifs) transcription factor families in three cancer types with the largest numbers of mutation calls. In particular, C/EBP motifs display frequent affinity loss in breast cancer, FOX and NR3 motifs are protected from both the affinity loss and gain in lung adenocarcinoma and breast cancer, and ETS motifs tend to emerge in all three cancer types (breast, lung and liver). Data for two control datasets (shuffle, genomic) are shown

With ratios expressing selection pressure magnitude so close to 1, a large data volume was necessary to attain acceptable statistical significance. In particular, the simulated control sets were several times larger than the initial cancer data set, especially for cancer types with less called mutations. The most robust observations were made on cancer types with the highest mutation counts and thousands to dozens of thousands predictions per binding motif (see Additional file 2: Table S2).

Selection of mutations altering binding motif affinity

Among transcription factors with binding sites experiencing frequent affinity loss we observed those belonging to AP-2 and C/EBP families, whose binding motifs were previously reported as mutation-enriched [3, 14]. Binding motifs of zinc finger SP and KLF families were also enriched with affinity loss-causing mutations in several cancer types. Mutations in binding sites of other transcription factors, in particular, belonging to the ETS family, persistently induced affinity gain (see Table 1 and Additional file 2: Table S2 and Additional file 4: Table S3). Binding motifs enriched with mutations causing the affinity loss or the affinity gain are likely to be under positive selection.

Table 1 Examples of selection magnitude for conserved binding motifs and motifs frequently affected by somatic mutations

Conversely, for a much wider set of transcription factor binding motifs, mutations leading to either affinity gain or affinity loss were depleted (Table 1, Additional file 2: Table S2 and Additional file 4: Table S3). Furthermore, binding motifs of certain factors were simultaneously protected from both the affinity loss and gain in several cancer types. In particular, there were several families of nuclear receptors (Fig. 3, TFClass families [19] are shown). Such conserved binding motifs indicate action of negative selection against somatic mutations. Importantly, negative selection of HOX and FOX motif-changing variants, reported earlier for normal tissues [20], was also exhibited in different cancers.

Fig. 3
figure 3

Transcription factor binding motifs protected from somatic mutations in different cancer types. The size of a pie shows the total number of motifs in a given transcription factor family (given in curly braces according to TFClass). The slices of a pie show the number of conserved binding motifs protected from any affinity change (yellow), motifs protected from affinity loss (magenta), and motifs protected from affinity gain (deep purple)

Only a few binding motifs were found significant for cancer types with limited number of available mutation calls due to lower statistical power. However, these orphan motifs often belonged to the families found under systematic protection from the affinity loss or gain in larger data sets (Additional file 4: Table S3).

Location of somatic mutations relative to binding motifs is associated with motif information content

Frequencies of cancer somatic mutations highly depend on the sequence context (see Additional file 3: Figure S1). For a given DNA motif, a mutation within a given context is more likely to occur at some positions and to avoid others, i.e. some motif positions are expected to accumulate more substitutions as local nucleotide context at these motif positions is similar to the frequent mutation context.

At the same time, the selection pressure would be the most exhibited at motif positions with high information content. A single-nucleotide substitution in such positions would cause greater changes in motif affinity [21]. To compare substitution rates at different motif positions we have aligned the motif predictions in the mutation-centered windows for germline alleles and estimated the positional density of mutations by normalizing the mutation counts in each position to the total number of tested windows (see Fig. 4 for a visual description).

Fig. 4
figure 4

Procedure used to evaluate mutation frequencies relative to motif occurrences

Figure 5 shows distribution of substitutions for AP2A (frequent affinity loss) and ESR1 (conserved) motifs based on breast cancer data. Normalized substitution frequencies are plotted along the motif logo and clearly demonstrate that somatic mutations often occupy the G(+4) base in AP2A. It is known, that 5’-TGA-3’ context (5’-TCA-3’ on the reverse strand) is frequently mutated in breast cancer (see Additional file 3: Figure S1). Accordingly, in both controls the mutations of G(+4) are more frequent than in other positions. Still, the actual G(+4) mutation rate in the AP2A motif is 1.5х higher in breast cancer genome as compared to the control (simulated) data.

Fig. 5
figure 5

Relative location of mutated bases in reference to the AP2A (top panel) and ESR1 (bottom panel) binding motifs predicted for breast cancer data. Y axis shows the relative fraction of mutation-centered windows with the legitimate motif predictions, X axis shows the location of a mutated base relative to the motif. Motif logos are scaled according to the discrete information content. Somatic variants tend to localize at strict position of the AP2A motif (red line) leading to affinity change. Variants in ESR1 motif (purple line) avoid strict and prefer degenerate positions, the motif is protected from affinity change

At the same time, the major C(+4) in TCA context of the ESR1 motif strongly avoids substitutions in cancer data if compared to any of the controls. Comparison with two other TGA boxes in ESR1 is even more illustrative. The first one is centered at G(+10) and has substitution rate approximately at the expected level. The second one (with the weak information content reflected as logo column height) is centered at G(+15) and, probably, is less important for the ESR1 binding affinity. Consequently, it aggregates significantly more somatic mutations than expected from the control data.

Stronger negative selection acts in DNase accessible regions

Accuracy of binding sites prediction in silico is limited and it is hard to distinguish true binding sites from false positive predictions without direct experimental data. To increase the confidence of binding site prediction, we considered subsets of mutations occurring in DNase accessible segments [22] of promoters and introns for breast cancer and lung adenocarcinoma.

Mutation rates may unpredictably depend on chromatin accessibility. Hence, a separate control set constructed from DNase accessible regions was necessary to evaluate selection of mutations in DNase accessible regions. The resulting estimates of the selection pressure magnitude became comparable with those for the whole set of mutations in promoter and intronic segments.

A smaller absolute number of mutations in DNase accessible regions resulted in a lower number of binding sites predictions and a lower statistical power (Additional file 5: Table S4), thus the absolute number of featured binding motifs was also smaller. However, the major observations persisted. In particular, motifs of FOX and several NR families were found protected from somatic mutations whereas selected members of AP-2 and C/EBP families displayed persistent affinity loss.

Taking the motifs found under significant negative selection for the full set of intronic + promoter mutations (P-value < 0.05 versus a particular control data), we compared the estimates of the selection pressure magnitude with those for mutations in the DNase accessible regions. There was no systematic difference for the shuffle control. The genomic control revealed consistently lower relative frequency of the affinity changing events (i.e. stronger selection magnitude) for the most of significant motifs in DNase accessible regions (Fig. 6). We consider this as an indication of stronger negative selection pressure on DNase accessible regions. The magnitude of positive selection in DNase accessible regions was lower (closer to 1), and the number of motifs detected under positive selection was lower (Additional file 5: Table S4).

Fig. 6
figure 6

Fold change (log2) of negative selection magnitude for mutations in DNase accessible subregions compared to that in the promoter and intronic segments. Y axis displays selection magnitude fold change (log2), the ratio between selection magnitudes estimated for DNase accessible regions to those for all promoter and intronic segments, the respective genomic control data is used in the both cases. Lower values of selection magnitude correspond to the stronger negative selection, thus negative fold change values correspond to stronger negative selection in DNase accessible regions. X axis displays different significantly conserved motifs (P < 0.05) for the set of promoter and intronic mutations. Data for affinity loss (a, top panel) and affinity gain (b, bottom panel) is presented for breast cancer (top subpanels) and lung adenocarcinoma (bottom subpanels). Members of FOX and NR transcription factor families are colored in blue and green

Discussion

Similar binding motifs are under similar selection pressure

Transcription factors of a given structural family [19] usually share similar binding preferences and it is not always possible to distinguish binding sites bound by different members of the same family. Attribution of binding predictions to a particular transcription factor is not entirely reliable, that is why we focused on observations that were consistent for different members of a given motif family (Figs. 2 and 3). Furthermore, transcription factors of the same structural family often had binding motifs obtained from different experimental data sets [17] and thus had different prediction biases. Thus, consistent detection of selection pressure for several motifs belonging to transcription factors of the same family increased our confidence that the detected effects were not arising from biased predictions of a particular low-quality binding model.

Genomic control data highlight negative selection

We emphasize the usage of genomic control in addition to shuffle control, since genomic sequences prefer or avoid occurrences of sequence motifs in a non-random fashion. For instance, composition of CpG islands correlates with the presence of many Kruppel-like transcription factor motifs, whereas nucleosome binding motifs facilitate binding of TBP factor [23]. All these regularities are destroyed in the shuffle control.

In general, the genomic control gave more conservative estimates of selection magnitude (Fig. 2) but there was a notable overlap between the resulting sets of motifs (Additional file 6: Figure S2) identified with any of the two controls, especially for the motifs conserved by negative selection. However, for positive selection the overlap was quite limited. For example, binding sites of HIF-1 transcription factor were found under strong positive selection both for affinity gain and affinity loss when the shuffle control was used, but the effect completely disappeared with the genomic control (Additional file 2: Table S2). Such observations were not easy to interpret; they might be caused by some local bias in nucleotide composition, which was controlled only for the shuffle control set. Thus, in our analysis we focused on cases, which were consistent (with the same selection direction) and significant (P < 0.05) for both control data sets.

Biological interpretation

Many transcription factors with binding sites under selection have been reported to be involved in cell malignant transformation. For instance, we detected significant enrichment of affinity gain events for C/EBP that was reported to be important for malignant conversion of human breast epithelial cells [24]. Binding sites of GABP, a member of ETS family, are created by mutations in TERT promoter and associated with development of many cancer types [4]. In our study the affinity gain of ETS binding sites appeared under positive selection, whereas affinity loss was under negative selection. FOX proteins, whose binding sites were found under negative selection both for the affinity gain and loss, were also suggested to be involved in cancer progression [25].

Our definition of the selection pressure pertains to ensembles of sites, which are destroyed by mutations statistically less often (or more often) than expected. Why mutations occur in particular binding sites is a separate question. For instance, suppose there is a trend for binding sites of a particular motif to be conserved, whereas specific small subset of these sites undergoes recurrent mutations , thus opposing the general trend. Such observation might help to locate important components of cancer regulatory network.

General notes and conclusions

In our analysis we used classic PWM approach to predict transcription factor binding sites. The application of position weight matrices was often expected to result in high false positive rates, especially when utilizing conventional motif databases such as TRANSFAC [26]. However, the prediction quality of modern motif collections, including motifs derived from ChIP-Seq data [27], is noticeably better, which is supported by recent benchmarking studies [28, 29]. Furthermore, eukaryotic genomes contain many low-affinity, poorly conserved binding sites that are nevertheless likely to play an important role in cis-regulatory control and evolution [30]. This substantiates taking into account at least medium-scoring PWM hits, especially, to study affinity change events from single-nucleotide somatic substitutions, which have limited effective potential for “motif editing”.

To summarize our findings, we observed transcription factor binding sites of many motif families in several cancer types altered by somatic mutations significantly less frequently than expected, both for mutations causing affinity loss or gain. The avoidance of mutations in binding motifs indicates the action of negative selection maintaining specific paths in cellular regulatory circuits. This negative selection of mutations that might lead to substantial affinity gain rejects possible alternative explanation that the observed differences are caused by transcription factors providing protection against mutations at binding sites by occupying the respective DNA segments. Another alternative explanation, that the observed statistical phenomena arise from biased mutation patterns of a particular cancer type, can be ruled out because conserved binding motifs are shared in cancers with substantially different mutation signatures (see Additional file 1: Table S1 and Additional file 3: Figure S1).

Finally, conservation of binding motifs against mutations was exhibited even in a simple test considering how often mutations occupied positions within the motifs versus nearby positions in the vicinity (see Methods and Additional file 7: Table S5). The resulting list of motifs was less selective and harder to interpret since the substitution itself and affinity change direction were not considered. However, major families of motifs that tended to overlap/avoid mutations (i.e. with more/less mutations within motifs than expected) were consistent with the detailed affinity change test, in particular, members of ETS, AP-2 and FOX, HOX families were overlapping and avoiding mutations, respectively.

Further analysis of factors with binding motifs protected by negative selection against mutations can reveal cell regulatory pathways crucial for the survivability of various human cancers.

Methods

Overview of cancer mutations data

We used published whole genome somatic mutations data for ten cancer types [16] (507 samples with varying sequencing depth) with mutations from different samples of the same cancer type aggregated. The total number of mutation calls varied between cancer types with breast cancer, liver cancer and lung adenocarcinoma having the largest numbers (Additional file 1: Table S1). Only single-nucleotide substitutions were considered.

Ensembl gene annotation was used to select mutations in intronic regions and [−5000;+500] bp intervals around transcription start sites (promoters). The total length of intronic and promoter DNA was 1,6⋅109 bps with an average mutations density of 1–2 substitutions per 5 kb for the cancer types with the highest number of mutation calls.

Mutations in coding regions were excluded. An overview of the initial mutation data is given in Additional file 1: Table S1. The relative frequencies of mutation contexts (the 5' and 3' nucleotides surrounding the mutated/germline alleles) are shown in Additional file 3: Figure S1.

The mutation coordinates were used to extract mutation-centered [−50; +50] bp genomic windows based on Ensembl GRCh37.p13 (release 75) human genome assembly. Overlapping windows for closely located mutations were considered independently (thus a single binding site in theory could be assessed for more than one mutation). Within a single cancer sample some windows with different mutation coordinates had identical sequences (e.g. due to genomic repeats), and, for each particular sample, only one of these windows was kept for further analysis. For each cancer type, recurrent mutations in different cancer samples were considered as independent observations. The statistical analysis (see below) was performed separately for each cancer type.

Assessing binding motif affinity changes

DNA motifs recognized by transcription factors are highly divergent. A basic binding site model, position weight matrix (PWM), accounts for such divergence by assigning a score for each oligonucleotide of some fixed length, with high scoring sequences selected as binding sites. A fixed score threshold defines the positive prediction rate (motif P-value) for the given PWM. With a uniform distribution of background frequencies the motif P-value is equal to the fraction of oligonucleotides (“words”) scoring above the given threshold among all words of the fixed length. For a given sequence variant the score of the best binding site prediction defines the respective motif P-value. In annotation of regulatory SNPs the ratio of motif P-values for two sequence variants was used to quantify the effects of nucleotide substitutions in predicted binding sites [31, 32].

The correspondence between the motif P-value and the energy of specific binding (the so-called discriminative energy) is clarified in the classic work [21]. It is shown that for a point substitution in a binding motif the log-ratio of P-values defined by two alternative sequence variants is approximately equal to the difference of corresponding discriminative energies. We defined substantial difference as the drop in the discriminative energy that increases the motif P-value four fold, which corresponds to the substitution of a perfectly nondegenerate position to a completely degenerate position in an imaginary model that did not contain any degenerate positions.

To predict the binding sites and to quantify affinity change events we used PERFECTOS-APE software [18]. Binding sites were predicted in [−50;+50] bp mutation-centered windows as best hits of 278 A/B/C-quality (highest quality) PWMs from the HOCOMOCO v9 [17] collection. The closest position of a binding site was required to be located not farther than 10 bp (one helix pitch) away from the mutated base, windows with predictions farther than 10 bp from the mutated base were discarded (separately for each motif). This setup allowed us to bypass global variability in mutational rates, which may induce an unknown bias in co-localization of mutations with binding motifs. To avoid such conditioning by genome location preferences we predicted binding sites only in small windows centered at mutations, thus considering only short genomic segments containing both a binding site and a somatic mutation.

For predictions we used the motif P-value threshold of 0.0005 (which roughly defines the false positive rate corresponding to a single expected prediction per 1000 bp of a random double stranded sequence). The approach to take PWM score thresholds according to a common false positive rate was recently demonstrated to be the least biased in a comparative study [29].

It is not trivial to assess the quality of binding site prediction since very little is known on the negative control, the DNA sequences that do not bind a particular transcription factor. For example, FoxA2 has well-exhibited binding preferences and has been tested for false-positive predictions in [33] with the help of EMSA experiments. For the HOCOMOCO v9 model at 0.0005 P-value the resulting experimentally-justified FDR was about 17 % providing an intuition on binding site prediction error rate. Yet, since only 64 hand-picked binding sites (41 positive and 23 negative) were tested with EMSA, this FDR value cannot be used in any quantitative estimation.

We used the P-value ratio thresholds of 4 (affinity gain, motif emergence) and 0.25 (affinity loss, motif disruption) to distinguish between substantial and non-substantial binding site alterations. This setup was identical for each particular cancer type (to estimate the observed affinity change frequencies) and control data (to estimate the expected frequencies).

To sum up, for a given transcription factor binding motif and for a given mutation-centered window an affinity loss event was counted if (1) the best prediction for the germline sequence passed motif P-value of 0.0005 considering PWM hits not farther than 10 bp away from the mutated base and (2) the best prediction for the mutated allele, again, considering PWM hits not farther than 10 bp away from the mutated base, had P-value at least 4 times weaker. Symmetrically, an affinity gain event was counted under the same restrictions for the best PWM hit but for the sequence with the mutated allele, and with the respective P-value for the germline sequence predictions being 4 times weaker.

We did not require the best hits in the germline and the mutant sequence to appear at the same position. Thus, with our approach we did not count affinity change events in windows with two good motif hits, only one of which was affected by the mutation.

Simulated control data

Two simulated control sequence sets, the shuffle control and genomic control, were used to estimate the expected frequency of affinity changing substitutions. The relative size of the simulated control sets depended on the total number of mutation calls in a particular cancer type. Lower relative size of the control data sets was used for the larger mutation sets (see Additional file 1: Table S1). Higher relative sizes of the control data sets was used for the cancers with lower numbers of mutation calls to provide stable estimates of expected frequencies.

The shuffle control set was obtained by shuffling the flanking sequences within [−50;+50] bp around the mutated base keeping the mutation context, the immediate 5' and 3' nucleotides, and the substitution itself, intact. Multiple shuffles were gathered for each mutation (Additional file 1: Table S1). This was the only step where the window length was explicitly used.

The windows for the genomic control were sampled from intronic and promoter regions in a way that they did not overlap the cancer mutation-centered windows. Each segment of [−50;+50] bp had the central base and its neighboring 5' and 3' nucleotides identical to the mutation context of a given somatic mutation locus, the respective nucleotide alternative was added. For each somatic mutation several genomic control windows were extracted, the number depended on the total number of mutations for a particular cancer type (Additional file 1: Table S1).

Both the shuffle and genomic controls were used to predict transcription factor binding sites in the same way as for the cancer data. For each binding motif the windows with binding sites predictions for the germline alleles were used to evaluate statistical significance of the affinity loss. Likewise, the windows with binding sites predicted for the simulated mutated alleles were used to evaluate statistical significance of the affinity gain. The windows with predictions for both alleles participated in both types of analysis (Fig. 1), and the windows without predictions were discarded.

Since binding sites predictions depended on the nucleotide composition and, consequently, on the mutation contexts (the 5’ and 3’ nucleotides proximal to the mutated base), we equalized the mutation contexts distributions of the test and control data for each particular cancer type before the statistical evaluation. To achieve this, we sampled the windows with binding sites predictions in control data (both shuffle and genomic) to match a given mutation context distribution of a particular cancer for each binding motif separately.

In a limited number of cases there were not enough control data to completely equalize the contexts distribution (see Additional file 2: Table S2). Yet, even for cancer types with low numbers of mutation calls, where the relative required size of the control data sets was extremely large, no less than 95 % of predictions with matching contexts were successfully sampled from the control data. Importantly, for cancer types with abundant mutation calls context equalization was almost perfect (99.9-100 % match of the contexts distributions with the non-perfect match only for exceptional motifs, see Additional file 2: Table S2), since a lower relative size of the control data set was generally required (see Additional file 1: Table S1). During significance evaluation (see below) the “missing” control predictions were considered as if they made the contingency tables more uniform (i.e. reducing the difference and its possible statistical significance).

Thus, for each binding motif we obtained the final sets of mutation-centered windows with binding sites overlapping with or located in the close vicinity of mutations for test and control data with the equalized mutation contexts distribution. This eliminated possible bias from the non-randomness of mutational signatures and made possible a comparison of the binding sites alteration frequencies in cancer versus control data.

The events of mutation-induced motif changes were counted for each cancer type and the control data sets (shuffle and genomic) using the same procedure. For each binding motif the Fisher's exact test was computed using 2×2 contingency tables (substantial affinity loss or gain versus non-substantial affinity change/no change, cancer mutations versus the control data, separately for shuffle and genomic control, refer to Fig. 1 for a scheme).

Only cases that passed 0.05 FDR-corrected (for 278 tested binding motifs) Fisher’s exact test P-value in both comparisons (versus the shuffle and versus genomic controls) were considered significant for a particular cancer type.

For selected motifs we also assessed localization of mutations relative to the binding motif predictions (see the specific section in Results, the workflow is shown in Fig. 4).

Estimating the selection magnitude

The affinity loss frequency was calculated as the fraction of affinity loss events out of all tested windows with legitimate motif predictions for the germline allele (see the previous section). Symmetrically, the affinity gain frequency was calculated as the fraction of windows with affinity gain events among all windows with legitimate predictions for the mutated allele.

The absolute values of the affinity loss and gain event frequencies are biased by the specific mutation signature of each particular cancer type. To account for this effect, we computed the relative values normalizing cancer frequencies for those of the control data. The ratio of the affinity loss (gain) frequency for cancer somatic mutations to the affinity loss (gain) frequency in the control set (genomic or shuffle) defined the selection magnitude for affinity loss (gain). The magnitude greater than 1 corresponded to the positive selection, the magnitude less than 1 corresponded to the negative selection.

Assessing mutations in DNase accessible segments

As an additional test, we considered DNase accessible segments of all introns and promoters. The DNase accessibility data (breast cancer and lung adenocarcinoma only) were collected for related cell lines and normal tissues [34] (Additional file 5: Table S4). The resulting reduced set consisted of 104905 (596253) mutations for breast cancer (lung adenocarcinoma) thus including nearly 30 % (90 %) from the respective total sets of intronic and promoter mutations.

Open chromatin regions are enriched with binding sites of the most of transcription factors but depleted of others [22]. To account for this non-randomness, the shuffle and the genomic controls were produced with the same pipeline as for the total set of promoter + intronic mutations but restricted to DNase accessible regions only.

A simplified test to reveal selection pressure

We also tested a basic overlap of mutations and motifs not taking into account the affinity change. To this end we used 2×2 contingency tables for the test/control data and mutations overlapping/not overlapping with the motifs (Additional file 7: Table S5) with the binding sites predictions performed using the same setup as in the main workflow. On the one hand, this simplified test did not capture neutral substitutions within the motifs and did not allow separating affinity loss and gain events. On the other hand, the results of this test did not depend on arbitrary selected motif P-value ratio thresholds.

Availability of supporting data

The data sets supporting the results of this article are included within the article and its additional files.

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.