Background

In the post-sequencing era of genome biology, the main focus of interest has shifted from questions of genome structure to problems of gene function. Many research activities are now directed towards establishing precise understanding of genetic architecture underlying phenotypic variation. Recently, chemical mutagenesis has become a major method of promoting functional analysis of the mouse genome. N-ethyl-N-nitrosourea (ENU) is known to generate a spectrum of alleles, mainly by introducing single base replacement changes (i.e. transitions and transversions), and has been the mutagen of choice for enhancing the genetic resource for biomedical applications [1, 2].

A central issue of any mutagenesis screen is its efficiency in recovering mutations of interest. One criticism concerning the utility of ENU mutagenesis is that it may not be an efficient way of producing appropriate models of naturally occurring genetic variants (such as human genetic disorders) [3]. Indeed, ENU-induced nucleotide changes are known to exhibit a strong bias towards particular lesions [4, 5], and possibilities remain that this property should restrict the general applicability of ENU mutagenesis.

While phenotype-based mutagenesis screens have been effective in obtaining a variety of phenotypic mutants [6], ENU mutagenesis has recently been extended to gene-driven analyses of the mouse genome aimed at producing allelic series of functional mutations at particular loci [714]. The sequence-based approach is also suited for analyzing the mechanisms of mutation, as it should identify any sequence changes induced within the targeted regions, thereby allowing a precise evaluation of the true mutational pattern. To gain a more comprehensive understanding of the mutational bias inherent in ENU mutagenesis, this article describes a detailed examination of the pattern of induced DNA sequence changes recently obtained from our sequence-based screen of ENU-mutagenized mice using a temperature gradient capillary electrophoresis (TGCE) system [13]. Our aim here is four-fold: (i) to estimate the frequency per nucleotide site of ENU-induced mutation; (ii) to see if there is any indication of strand-specific pattern of induced mutation; (iii) to contrast the mutational patterns in phenotype- versus sequence-based mutagenesis screens; and (iv) to infer the spectrum of amino acid changes obtained by ENU mutagenesis screens. The new data provided by the present analysis will be of great use in promoting the efficient production of murine models for human genetic disorders. Moreover, it will also expand our understanding of molecular mechanisms of mutagenic processes, which are the underpinnings of any comparative and evolutionary genomic data.

Results and discussion

Site-specific pattern and frequency of induced mutations

Whereas our mutagenesis screen includes nongenic portion of the mouse genome [13], the present analysis focuses on the sequence changes identified within the protein-coding genes (or more specifically, mutations detected in exonic as well as flanking intronic or promoter regions) so that the detected mutations are classified as base changes on the nontranscribed (i.e. sense) strand. Excluding those mutations detected in the intergenic sequences, we have identified, by scanning a total of 181,031,647 nucleotide sites, 131 unique germline mutations, which include 130 base replacement changes as summarized in Table 1; the remainder is a length mutation that deletes a cytosine site on the nontranscribed strand. The overall frequency of base replacement mutation is therefore obtained as 7.18 × 10-7 per nucleotide site per generation. (Inclusion of mutations detected in intergenic sequences yields a slightly higher rate of 7.49 × 10-7 [13].) Based on a preliminary experiment utilizing Taq polymerase-induced errors as positive controls, we determined that under the experimental conditions used in our mutation screening [13], the rate of mutation discovery by our TGCE system was ~50%; the absolute rate of ENU-induced heritable mutation is therefore estimated roughly as 1.4 × 10-6 per nucleotide site, which is approximately two orders of magnitude higher than the spontaneous mutation rate in humans (~2 × 10-8 per site per generation [1517]). The detail of the control experiment will be published elsewhere.

Table 1 Frequency of base replacement changes in the sequence-based screen

Table 1 also shows that in accord with the previous observations [1, 18], ENU preferentially introduced heritable changes at T/A sites. In contrast, only a single instance of G-to-C transversion was recorded, while C-to-G transversion was not detected at all. Although there are several reported instances of putatively ENU-induced G/C-to-C/G changes (e.g. [7, 11, 19, 20]), we suspect that those mutations are in fact spontaneous.

Our collection of ENU-induced mutations further suggests that the mutation frequency at thymines on the nontranscribed strand is substantially higher than the frequency at adenines (11.92 versus 6.05 mutations among 107 sites; Table 1). The 95% confidence intervals (CIs) do not overlap with each other, suggesting that the observed strand asymmetry is not due merely to chance events but instead reflects the true nature of strand-specific mutagenic mechanisms; heritable changes are predominantly introduced at T/A base pairs when thymines are located on the nontranscribed strand (χ2df = 1 = 9.001, P = 0.003). Strand-specificity was not detected among changes at G/C pairs (5.65 versus 4.72 mutations among 107 sites; χ2df = 1 = 0.359, P = 0.549). The possible cause of the mutational asymmetry will be discussed later in this section.

Consequences of ENU mutagenesis on amino acid sequences

Since the distribution of ENU-induced base replacement changes (as described in Table 1) is markedly different from the pattern of spontaneous mutation [1517], the utility of mouse ENU mutagenesis in recovering disease-causing variants may somehow be limited, solely because it produces a particular class of amino acid changes more effectively than others. To see if the skewed distribution of ENU-induced mutation should lead to biased generation of amino acid variants, we here compute the expected distribution of amino acid replacement changes in the ENU-mutagenized mouse proteome, as detailed in the Methods section below.

Among 420 (= 21 × 20) types of amino acid replacement changes (including those involving termination codons), 170 are possible through single base replacement, whereas other changes necessarily require multiple mutational steps. By combining the per-site frequency of ENU-induced base replacement mutation (Table 1) with the pattern of codon usage in the mouse coding sequences [21], the relative frequencies of the 170 possible amino acid changes were obtained as shown in Table 2, where the expected count of each change among 1,000 nonsynonymous mutations is listed. Expected proportion of synonymous versus nonsynonymous changes for each amino acid was also obtained as summarized in Table 3, which shows that for certain amino acids (e.g. alanine, glycine, and proline), more than 40% of induced base replacement mutation result in synonymous changes. Overall, it is found that 71.5% of base replacement mutations would result in nonsynonymous changes (67.4% missense, 3.9% nonsense, and 0.2% make-sense), while the remaining 28.5% would lead to synonymous changes that do not alter amino acid sequences. As we shall see below, this theoretical prediction is highly concordant with the observed pattern of synonymous and nonsynonymous mutations in the sequence-based screens.

Table 2 Expected frequency of amino acid replacement changes
Table 3 Amino acid-specific pattern of synonymous and nonsynonymous changes

An obvious consequence of ENU mutagenesis is that because of the relatively low occurrence of G/C-to-C/G transversion changes (Table 1), some amino acid replacements would be disproportionately underrepresented (Table 2). As long as the expectation is based on the present collection that does not include any C-to-G transversions (see Table 1), it is suggested that seven of the amino acid changes should never be produced by ENU treatment (zero entries in Tables 2 and 3). Likewise, other seven amino acid replacements that are possible only by G-to-C transversions would be detected only in a rare occasion. While these two classes of mutation represent nearly 5% of disease-associated amino acid replacement changes observed in humans [22], they would comprise only 0.48% of the entire set of ENU-induced amino acid replacement changes. This imbalance suggests that a precise understanding of the mutational pattern and frequency of ENU-induced nucleotide changes would be desirable for properly designing and conducting a sequence-based mutagenesis experiment.

Contrasting mutational patterns in phenotype- versus sequence-based screens

To see whether a similar pattern of biased mutation could also be detected in phenotype-based mutagenesis screens, we here contrast two classes of induced mutations, each derived from phenotype- and sequence-based screens, respectively. Previously, Justice et al. [1] provided a list of germline mutations detected in phenotype-based screens of ENU-mutagenized mice (see also [18]). Among 62 mutations recorded in Justice et al. [1], two mutations (816SB and 4494SB at the Myo7a locus) accompany deletions [23], while other two (b-m3H and b-m4H at the Gpi1 locus) are considered to have originated from a single mutational event [24]. Hence 59 mutations are relevant for the present analysis. By reviewing the literature that are not included in Justice et al. [1], we have collected another 218 base replacement changes identified in phenotype-based screens, yielding a total of 277 ENU-induced germline mutations at 143 loci distributed across the 19 autosomes as well as the X chromosome in the mouse. The complete list of the relevant literature is provided as a supplementary material [see Additional file 1].

At first, it may appear that the overall pattern of germline mutations detected in phenotype-based screens is similar to the corresponding pattern in our sequence-based screen (Table 4); induced changes are found predominantly at T/A sites (75.1%). Moreover, whereas mutations at G/C sites show no clear sign of strand-specific effects (30 mutations at guanines versus 39 at cytosines on the nontranscribed strand), mutational changes at T/A sites occur more frequently when thymines and not adenines are located on the nontranscribed strand (136 versus 72). Whether the mutational pattern in phenotype-based screens should truly reflect the underlying distribution of ENU-induced mutations, however, still awaits further consideration.

Table 4 Reported number of base replacement changes in ENU mutagenesis

While it is expected that the mutational skew detected in our sequence-based screen (Table 1) is a faithful manifestation of the biased nature of ENU-induced germline mutation, the mutational pattern in phenotype-based screens (Table 4) may be confounded by several additional factors, including (i) unknown nucleotide composition of the genomic regions that are the potential target of the mutagenesis, and (ii) differential phenotypic effects caused by different base replacement mutations. These factors do not affect the mutational pattern in the sequence-based screens (although it may still be confounded due to nonrandom mutation discovery by the TGCE system, which we consider unlikely). For one reason, this is because the per-site mutation frequencies derived from the sequence-based analysis are, by definition, adjusted by the nucleotide composition of the targeted regions. Such adjustments are possible only when we base our analysis on appropriate sequence information, which is usually never obtained in the phenotype-based analysis. Moreover, while only those mutations that confer noticeable effects on the focal phenotype can be identified by phenotype-based screens, any mutational changes that fall within the targeted regions should in principle be identified by sequence-based screens, irrespective of their phenotypic effects. The former may hence represent a biased set of the latter, comprising a limited class of mutations that potentially has greater impacts on phenotypic function. Put differently, this implies that by investigating the incongruence between the two classes of mutations, we may be able to specify mutations with higher propensities of disrupting genome function.

As a possible sign of such incongruence, we here focus on a statistically significant difference detected for the patterns of mutational changes at thymines. In our sequence-based screen (Table 1), 57 mutations are found to alter thymines on the nontranscribed strand, among which 31 is replaced by cytosines, while 21 by adenines. Contrastingly, among 136 detected changes at thymines in the phenotype-based screens (Table 4), 76 are replaced by adenines, while only 47 result in replacement by cytosines (χ2df = 1 = 6.778, P = 0.009). Similar discrepancy between the two classes of mutations has been also pointed out by Augustin et al. [11]. Since T-to-A mutations may generate nonsense changes whereas T-to-C mutations do not, the excess of T-to-A mutations in phenotype-based screens may be because of the greater impact of nonsense changes on protein function. At first glance, this conjecture may appear valid, since nonsense changes are disproportionately overrepresented among mutations detected in phenotype-based screens (Table 5); among those mutations identified in the protein-coding sequences (i.e. synonymous and nonsynonymous mutations), only 4.6% (3/65) are nonsense in our sequence-based screen (in reasonable agreement with the theoretical expectation of 3.9%; see above), whereas in the phenotype-based screens, the corresponding proportion exceeds 20% (46 among 223 mutations). However, a closer look at the two classes of mutations unveils that the incongruence is also present even when only missense mutations are considered (Table 6). Since different subsets among the possible amino acid replacements are achieved by different base replacement changes, this observation suggests that amino acid replacements induced by T-to-A mutations are expected to affect protein function more radically on average, eventually leading to their excess representation in phenotype-based screens.

Table 5 Mutation types in ENU mutagenesis
Table 6 Mutation types in phenotype- versus sequence-based screens

Tables 4 and 5 also summarize the patterns of ENU-induced germline mutations detected independently in other sequence-based mutagenesis screens [710, 12, 14], as well as of mutations detected in sequence-based screens of ENU-mutagenized ES cells [20, 25]. (Mutations described in Augustin et al. [11] are not included because the base changes on the nontranscribed strand are not specified.) While the overall patterns may appear similar, they do not show a clear sign of mutational bias peculiar to our sequence-based screen (more T-to-C transition than T-to-A transversion mutations). Hence although we expect that different screening methods should identify distinct sets of induced changes, possibilities remain that other factors not discussed here further complicate the patterns of mutations detected in mutagenesis screens.

Possible causes of the strand asymmetry

To account for the observed strand bias among mutations induced at T/A sites (Table 1), we here discuss the possibility that the mutational asymmetry is generated as a by-product of transcription-coupled repair (TCR) of DNA damages. The process of TCR has been considered as a candidate responsible for the compositional strand asymmetries in the transcribed portion of the mammalian genome [2632].

Transcription overexposes the nontranscribed strand to DNA damage, thereby biasing the occurrence of mutations between the two strands. Based on comparative analyses of orthologous sequences, it has recently been hypothesized that the compositional asymmetries in mammalian transcribed regions are indeed caused by the mutational bias inherent in TCR [26]. Despite some criticisms (e.g. [33]), the hypothesis has gained a further support from comprehensive analyses of human gene expression, which have detected a significant and positive correlation between the extent of compositional asymmetry and the expression level in the germline [28, 31, 32].

If TCR indeed plays a major part in generating the compositional strand asymmetries in the mammalian genome, then it should also affect the strand-specific pattern of experimentally induced mutation in a predictable manner. As reported previously [4, 5], ENU introduces heritable sequence changes predominantly at T/A sites by adding O2- or O4-ethyl adducts on thymines. Consequently, if transcription-associated DNA repair mechanisms (such as TCR) should effectively remove molecular lesions on the transcribed but not on the complementary nontranscribed strand of DNA, the resulting pattern of mutations on the nontranscribed strand should be strongly skewed toward changes at thymines; adenine modifications, on the other hand, should be less frequently observed.

Based on a genome-wide set of heritable mutations obtained from our sequence-based screen, the present study clearly demonstrates that ENU-induced changes in the mouse germline indeed follow the predicted pattern (Table 1). Although similar observations have been made for induced changes at the Hprt locus [3436], previous reports are based on somatic mutations isolated from phenotypic assays using 6-thioguanine as the selective chemical agent; namely, the mutations are not heritable and the screening methods are not sequence-based. Estimated mutation frequencies are not adjusted by the base composition of the targeted regions either. Therefore, possibilities cannot be excluded that the observed mutational pattern may have been confounded by the region-specific nucleotide composition, and also by the phenotypic effects of induced changes. As detailed above, these factors are adequately controlled in the present analysis.

While premutagenic lesions including O6- and N7-ethylguanines are formed at even higher rates in vivo by ENU [37], these DNA damages are repaired rather rapidly by mechanisms other than TCR [4, 5], eventually making only a minor contribution to heritable changes. The absence of mutational bias at G/C sites (Table 1) may further lend support for the view that TCR, and not the other transcription-associated mutagenic mechanisms such as cytosine deamination [38, 39], is the primary cause of compositional asymmetries in the mammalian transcribed regions [26]. Moreover, since some forms of DNA repair eliminate lesions by preferentially killing the damaged cells [40], the selective removal of the lesions on the transcribed strand may be promoted by the apoptotic effect of strand-specific repair mechanisms.

The strand-specific effects of TCR should be observed among heritable mutations only when mutated nucleotide sites are transcribed in the germline. In other words, strand asymmetry would not be observed when mutations are induced in the nontranscribed portion of the genome, which might include intergenic sequences and pseudogenes. (A caveat here is that some pseudogene sequences retain transcriptional activity while loosing their primary ability of encoding amino acid sequences [4144].) Alternatively, it is also expected that the strand-specific mutational pattern should be absent in genic regions if they are coupled with antisense transcripts [45, 46]. When genes and their antisense products are coexpressed [47], transcription-associated effects should influence both sense and antisense strands equally, thereby erasing the vestiges of strand-specific effects.

Although we have not experimentally confirmed the transcription status in the mouse germ cells of the 54 genic regions screened in our analysis, available data retrieved from the mouse Gene Expression Database (GXD) [48] suggest that in testis, 17 regions are transcribed, whereas none has been demonstrated to be nontranscribed; for the remaining 37 regions, no appropriate information currently exists. The genomic regions included in our sequence-based screen are chosen without any prior knowledge for their function or expression status in the germline. Recalling that some level of germline transcription would involve a large fraction of human genes (71 – 91% [28]), we expect that most of the remaining 37 sequences are also transcribed in the mouse germ cells, even if they may not be essential for germline function. By contrasting the patterns of experimentally induced germline mutations in genomic regions with distinct transcriptional properties, we would gain a more clarified view for the role of transcription in generating mutational and eventual compositional strand asymmetries.

Conclusion

Based on a detailed evaluation of the screening data obtained from our large-scale mutagenesis experiment, the present analysis clearly illustrates the biased nature of ENU-induced mutations, and discusses the possible causes and consequences of the nonuniformity. Despite the mutational bias inherent in ENU mutagenesis, however, this study also provides a strong support for its utility in obtaining a series of allelic variants at a genetic locus. We expect that the present findings will be useful in promoting the efficient production of mutant lines harboring amino acid variants. More generally, by enhancing the collection of experimentally induced mutations in unambiguously defined genomic regions, sequence-based mutagenesis studies will further illuminate the molecular basis of mutagenic and repair mechanisms that preferentially produce a certain class of mutational changes over others.

Methods

Sequence-based mutagenesis screen

The detailed method of our sequence-based mutagenesis is documented in Sakuraba et al. [13]. In brief, male C57BL/6J mice were injected intraperitoneally with an ENU dose of 85 or 100 mg/kg at 8–10 weeks of age. The injections were carried out twice at weekly intervals. The ENU-treated males were then mated to DBA/2J or C3H/HeJ females to obtain G1 offspring, which should harbor ENU-induced mutations as heterozygotes. We have chosen 63 genomic regions (54 genic and nine putatively intergenic) for the target-selected mutagenesis and designed 199 primer pairs therein (see Table 1 in [13]). Most of the primer pairs targeted for the genic sequences were designed to cover exonic regions, which were aimed at detecting induced mutations leading to amino acid sequence alterations, while only a few extended to nontranscribed flanking sequences. Heritable changes in the G1 males were screened using a temperature gradient capillary electrophoresis (TGCE) system, which eventually summed up to a screen of nearly 2 × 108 nucleotide sites in total [13].

Since we are interested in obtaining the strand-specific pattern of induced changes, we here concentrate on mutations identified in the 54 genic regions, where the two strands of the genomic DNA can uniquely be discriminated (namely, sense versus antisense strands). By targeting more than 50 loci that together embrace all the 19 autosomes in the mouse, the present analysis should be free from locus- or region-specific effects that may obscure the genuine pattern of ENU-induced mutation. The GC content of the whole screened sites is 48.0% (see also Table 1), which is somewhat lower than the averages for the mouse exonic sequences (52.0 – 57.1% [49]), but well above the genome-wide average of 42% in mice [50].

The detected mutations are classified according to nucleotide changes on the nontranscribed (or sense) strand. Complementary mutations (e.g. A-to-T versus T-to-A changes) are therefore distinguished and considered different from each other. Note that this distinction cannot be made for sequence changes detected in nongenic regions. The rare occasions of mutations producing length variants (insertions and deletions) are also excluded from the analysis.

Per-site frequency of induced mutations

Given the observed number k ij for mutations from nucleotide i to j, (i, j ∈ {T, C, A, G}, ji), we compute the mutation frequency per nucleotide site as μ ij = k ij /L i , where L i represents the total count of nucleotide i within the genomic region covered in the screen. The average frequency of mutation at nucleotide i is simply given by μ i = Σ j μ ij = Σ j k ij /L i . Likewise, the overall frequency of mutation per nucleotide site is obtained as Σ ij k ij i L i .

Assuming that the probability of observing k or fewer mutations within L screened sites should follow a poisson distribution P{Xk} = Σkκ = 0 [e-L μ(L μ)κ /κ!] (where κ designates a dummy variable), we determine the 95% confidence interval (CI) for the mutation frequency μ by solving this equation with P = 0.025 and 0.975. When no mutations were observed (as for the C-to-G transversion in our collection; see Table 1), we solved the equation P{X = 0} = e-L μ= 0.05 to obtain an upper limit for the mutation frequency.

Expected pattern of amino acid changes

To infer the effect of ENU mutagenesis on the mouse protein sequences, the expected pattern of amino acid replacement changes is computed by superimposing the distribution of ENU-induced base replacement mutations onto the pattern of codon usage in the mouse coding sequences. For each of the 64 triplet codons, there are nine codons that could be reached by single base replacement. The relative frequency distribution of 576 (= 64 × 9) possible codon replacements is obtained simply by multiplying the abundance of the original codon in the mouse coding sequences, as described in the Codon Usage Database [21] based on GenBank Release 151.0 (19 December, 2005), by the per-site frequency of appropriate base replacement mutation (as derived above). Base replacement mutation in a codon is either synonymous or nonsynonymous; the latter leads to an amino acid sequence alteration, while the former preserves the coded amino acid despite the change in the DNA sequence. (Mutation that preserves a termination codon (e.g. TAA-to-TAG) is here considered as synonymous.) Nonsynonymous mutation is further subdivided into missense (exchange of an amino acid by another amino acid), nonsense (replacement of an amino acid by a termination codon), or make-sense (replacement of a termination codon by one of the 20 amino acids) mutation. The expected distribution of amino acid replacement changes is obtained from the derived pattern of codon replacement, by summing up the appropriate codons in case of degeneracy. The result is summarized in Table 2.

Literature survey

We carried out a comprehensive literature survey and collected reported instances of ENU-induced base replacement mutations detected in phenotype-based mutagenesis screens. This has been done by adding germline mutations that are not included in the previous list [1, 18]. We do not claim to have included all the relevant studies, but we expect that the collection should represent an unbiased sample of the literature. An additional literature survey has been conducted also for germline mutations detected independently in other sequence-based studies, as well as for mutations identified in mutagenized embryonic stem (ES) cell systems.