Man's yesterday may ne'er be like his morrow; Nought may endure but Mutabililty.

Percy Bysshe Shelley (1816) Mutability

The first hint that the cytosine-guanine (CpG) dinucleotide might constitute a hotspot for pathological mutations in the human genome came nearly 25 years ago with the finding that two different CGA → TGA (Arg → Term) nonsense mutations had recurred quite independently at different locations in the factor VIII (F8) gene causing haemophilia A [1]. The potential generality of this phenomenon was supported by the finding that 12 of the 34 (35 per cent) single base-pair (bp) substitutions then known to cause human inherited disease were C → T and G → A transitions within CpG dinucleotides [2]. Further studies soon confirmed that the CpG dinucleotide was a mutation hotspot in a variety of different human disease genes, including PAH, [3]F9, [4]LDLR, [5]RB1, [6]HPRT1[7] and DMD [8]. As mutation data accumulated, CGA → TGA transitions were encountered particularly frequently as a cause of human genetic disease; such nonsense mutations are inherently more likely to come to clinical attention than missense mutations [9, 10].

From the outset, it was realised that the hyper-mutability of the CpG dinucleotide was related to its role as the major site of cytosine methylation in the human genome. The reason traditionally put forward to explain this association has been that, while cytosine spontaneously deaminates to uracil (which is efficiently recognised as a non-DNA base and removed by uracil-DNA glycosylase), the spontaneous deamination of 5-methylcytosine (5mC) yields thymine, [11] thereby creating G•T mismatches whose removal by methyl-CpG binding domain protein 4 (MBD4) and/or thymine DNA glycosylase followed by base excision repair is error prone [1216]. It remains possible, however, that mCpG transitions are not exclusively caused by the spontaneous deamination of 5-methylcytosine and may also arise through the action of other mechanisms and processes [1719]. Irrespective of the precise nature of the underlying mechanism, Krawczak et al. (1998) [9] estimated that the rate of CG → TG (and CG → CA on the other strand) transitions was five times the base mutation rate. Subsequent estimates of 5mC hypermutability--derived from various studies of polymorphism, disease mutations or evolutionary divergence--have ranged between four-fold and 15-fold [2026].

It has been known for some time that cytosine methylation also occurs in the context of CpNpG sites in mammalian genomes, where N represents any nucleotide [27, 28]. Since the intrinsic symmetry of the CpNpG trinucleotide would support a semi-conservative model of replication of the methylation pattern (as with the CpG dinucleotide), it comes as no surprise that both maintenance and de novo methylation occurs at CpNpG sites in mammalian cells [28]. In their recent paper on the human methylome, Lister et al. [29] reported abundant DNA methylation in CpHpG trinucleotides (where H = A, C or T). Specifically, some 17.3 per cent of 5mC in embryonic stem cells was found to occur within CpApG, CpCpG and CpTpG, with a further 7.2 per cent of 5mC occurring in CpHpH. Although Lister et al. [29] suggested that non-CpG methylation is almost entirely lost upon differentiation (a conclusion based solely upon the analysis of foetal lung fibroblasts), others have noted CpNpG methylation within human genes in a variety of different somatic tissues [30, 31]. Although the extent of non-CpG methylation in the germ-line remains unclear, if we were to assume not only that CpHpG methylation occurs in the germline, but also that 5mC deamination can occur within a CpHpG context, then it is very likely that methylated CpHpG sites would constitute mutation hot-spots. Indirect evidence that this might indeed be the case has come from a disproportionately high number of C → T and G → A transitions at CpNpG sites in studies of the human NF1[32] and BRCA1[33] genes. In the light of the above, we have revisited the question of CpG dinucleotide hyper-mutability and explored the potential contribution that CpHpG transitions might make to human inherited disease.

According to the April 2010 release of the Human Gene Mutation Database (HGMD; http://www.hgmd.org), [34] 56,457 pathological mutations have been reported in a total of 2,242 human genes. A subset of 54,625 pathological missense and nonsense mutations in 2,113 genes, with ± 2 bp genomic DNA sequence flanking the site of mutation, was retrieved from the HGMD. The numbers of C → T and G → A mutations in this mutation dataset that were located within either CpG dinucleotides or CpHpG trinucleotides were counted and termed 'mutations in di/trinucleotide' (Table 1). Only these C → T and G → A transitions, found in the context of a CpG dinucleotide or CpHpG trinucleotide, would be compatible with a model of methylation-mediated deamination of 5mC. The remaining mutations in this HGMD dataset that were located in non-CpG or non-CpHpG di/trinucleotides within the genes in question were also counted and termed 'mutations not in di/trinucleotide' (Table 1). Thus, 18.2 per cent of the studied missense/nonsense mutations causing human inherited disease are located in the CpG dinucleotide, while the corresponding proportion for the CpHpG trinucleotide is 9.9 per cent. To assess the significance of these figures, the number of all possible C → T and G → A mutations within either CpG dinucleotides or CpHpG trinucleotides within the coding regions of the mutated genes, termed 'possible mutations in di/trinucleotides', were also counted (Table 1). In parallel, all possible single bp substitutions that occurred in a non-CpG dinucleotide or non-CpHpG trinucleotide context (as well as mutations other than C → T and G → A in CpG and CpHpG) within the coding regions of the mutated genes were counted as 'possible mutations not in di/trinucleotide' (Table 1). A weak positive correlation was noted between the number of CpG mutations in the 2,113 genes analysed and the number of possible CpG mutations in these genes (Pearson's correlation 0.129, p = 2.45 × 10-9), implying that the CpG mutation frequency is influenced to some extent by the frequency of occurrence of the underlying CpG dinucleotide. Unsurprisingly, a significantly greater proportion (approximately ten-fold) of observed pathological missense/nonsense mutations within these genes were C → T and G → A transitions within CpG dinucleotides than would have been expected (by chance alone) for all possible mutations (Table 1; p < 10-230). A weak positive correlation (Pearson's correlation 0.251, p = 1.01 × 10-31) was also noted between the number of CpHpG-located mutations and the number of CpHpG trinucleotides in these genes, implying that the CpHpG mutation frequency is influenced to some extent by the frequency of occurrence of the underlying CpHpG trinucleotide. Once again, a greater proportion (approximately two-fold) of observed pathological missense/nonsense mutations within these genes were C → T and G → A transitions within CpHpG trinucleotides than would have been expected by chance alone for all possible mutations (Table 1; p < 10-230).

Table 1 Numbers of C → T and G → A mutations found in CpG dinucleotides and CpHpG trinucleotides in a dataset of 54,625 missense and nonsense mutations in 2,113 different human genes (HGMD) and the numbers of possible C → T and G → A mutations in CpG dinucleotides and CpHpG trinucleotides within the coding regions of the mutated genes.

From the data presented in Table 1, we estimate that ~11.8 per cent of the 9,947 CpG mutations (ie 1,176) occurred within this dinucleotide by chance alone and hence would not necessarily have originated via the methylation-mediated deamination of 5mC. In a similar vein, we estimate that ~46 per cent (2,460) of the CpHpG mutations (5,402) occurred within these trinucleotides by chance alone and hence may not have originated via methylation-mediated deamination of 5mC. The other side of this particular coin, however, is that the remaining 54 per cent of the 5,402 observed CpHpG mutations in HGMD (ie the excess 2,842 over expectation, or ~5 per cent of all the missense/nonsense mutations analysed) may well be attributable to methylation-mediated deamination of 5mC within a CpHpG context. As far as we are aware, this is the first (albeit crude) estimate of the potential impact of CpHpG mutations on human inherited disease.

A similar analysis was performed for 1,766 regulatory mutations (identified in the promoters of 191 human genes) retrieved from the HGMD. The numbers of actual and possible CpG and CpHpG mutations were counted as before, using the promoter sequences of each gene. In order to determine the total numbers of possible CpG/CpHpG and non-CpG/CpHpG mutations, the wild-type promoter sequences for each gene (total length, 22,051 bp) were used (Table 2). As with the missense/nonsense mutations, an approximately twofold higher proportion of observed pathological regulatory mutations within these genes were C → T and G → A transitions within CpG dinucleotides than would have been expected by chance alone for all possible mutations (Table 2; p = 6.03 × 10-9). We estimate that ~55 per cent of the 94 CpG mutations (ie ~52) probably occurred within these dinucleotides by chance alone rather than via methylation-mediated deamination of 5mC. By contrast, a lower than expected proportion of C → T and G → A regulatory mutations located in CpHpG trinucleotides was observed (p = 0.011). The absence of any excess of C → T and G → A mutations located in CpHpG trinucleotides indicates that most, if not all, the promoter CpHpG mutations probably occurred by chance alone, making it unnecessary to invoke methylation-mediated deamination of 5mC to account for them. Since neither CpG nor CpHpG were found to be under-represented in the examined promoter regions as compared to the coding regions, we surmise that the reduced (CpG) or absent (CpHpG) preponderance of C → T and G → A promoter mutations in the methylatable di/trinu-cleotides may be due to the relative paucity of cytosine methylation within the promoter regions [35] that would render unmethylated CpG and CpHpG di/trinucleotides no more mutable than any other di/trinucleotide.

Table 2 Numbers of C → T and G → A mutations found in CpG dinucleotides and CpHpG trinucleotides in a dataset of 1,766 regulatory mutations of 191 gene promoters (HGMD) and the numbers of possible C → T and G → A mutations in CpG dinucleotides and CpHpG trinucleotides.

Although two specific examples of non-CpG methylation altering the binding of transcription factors to promoter elements within human genes have so far been reported, [36, 37] the functional role of most non-CpG methylation in the human genome is still unclear. Irrespective of the functionality or otherwise of this specific type of post-synthetic DNA modification in the human genome, it would appear that methylation of the CpHpG trinucleotide may leave a significant imprint on the spectrum of missense/nonsense mutations causing human genetic disease.