Introduction

Homologous recombination (HR) is one of the fundamental mechanisms of DNA processing which, in various guises, is found in all phyla of life [1, 2]. HR is capable of playing several distinct roles within an individual organism. In sexually reproducing species, meiotic HR is a carefully regulated process that occurs at a defined stage of differentiation in specific cell types. By contrast, in the same species, HR also operates as a major mechanism of DNA repair in all cell types at all times. HR has been clearly co-opted for different functions throughout its deep evolutionary history. Similarly, HR has been exploited as a laboratory tool for, among other applications, genetic engineering in model organisms [3]. The role of HR in DNA repair [4], somatic mutation [5] and chromosomal engineering has been reviewed elsewhere; this paper will focus on meiotic HR and the recent studies demonstrating its impact on the mutability of mammalian genomes.

Evolutionary geneticists have traditionally regarded mutation and recombination (along with selection and genetic drift) as relatively independent 'forces of evolution': while the former generates variation, the latter reshuffles existing variation into novel combinations. These ideas were formulated before DNA was identified as the molecule of inheritance [68], however, and well before any understanding two of the molecular mechanisms of mutation was gained. Recent comparative analyses of whole-genome sequences [912] give a deeper appreciation of the distinct mutational mechanisms operating to shape genomes over evolutionary timescales. The mutability of any genome can be considered to be the summation of the effects of the distinct mutational mechanisms that operate in that genome. These impacts can be quantified in terms of the rate of each mutational mechanism, the number of bases involved in the resultant mutation and the number of susceptible sites within the genome. Figure 1 displays the major mutational mechanisms operating in mammalian genomes and demonstrates that both in their rate and the size of the resultant genomic alteration vary widely. The distinction between recombination and mutation described above becomes blurred by the involvement of HR in a number of these mutational processes. There is ongoing discussion about the relationship between allelic HR and single nucleotide polymorphism (SNP; see Hellmann et al. [13] and Nachman [14], and section below, for example). Furthermore, HR between non-allelic (duplicated) sequences has been demonstrated to be an important mode of both pathogenic mutation [15] and genome evolution [16]. These duplicated sequences can exist in tandem arrays or in dispersed repeats [17]. HR is the predominant mutational mechanism operating in polymorphic tandem repeat arrays with repeat units longer than five base pairs (bp) [18], including minisatellites and ribosomal DNA arrays, whereas replication slippage operates on arrays of shorter repeat units. These two influences of HR on shaping genomic variation will be considered in turn, but first one should consider the mechanism of HR and the distribution of allelic HR throughout the human genome.

Figure 1
figure 1

Mutation processes operating in the human genome. Different classes of mutated loci have been plotted on a graph indicating the mutation rate and the number of bases involved in the mutation. Those mutation processes that involve HR are shaded in red. References for different classes of mutated loci are as follows: base substitutions [19], short indels [19], microsatellites [20, 21] pathogenic triplet repeats [22], telomere repeats [23], rDNA repeats [24, 25], minisatellites [18], satellites, retroelement insertions [26], duplicated sequences [27] and rearrangements in the single copy portion of the genome [28]. Only the higher-order repeat structure of satellites is shown, indicating that this is likely to be the mutable unit.

Mechanism and genomic distribution of HR

The specifics of the multifarious protein-DNA interactions that underpin HR are beyond the scope of this paper (but have been reviewed by West [1]). It is sufficient to note that HR is initiated by a DNA double-strand break (DSB). This break is subsequently processed and then invades a homologous acceptor sequence [29]. After further processing, an intermediate is formed that can be resolved in one of two ways: a crossover results in the reciprocal splicing of the donor sequence to the acceptor sequence, whereas a gene conversion results in the non-reciprocal transfer of a short tract of sequence between the sequences (Figure 2). A crossover results in a change of phase of flanking markers on either side of the crossover, whereas a gene conversion is only observable if it encompasses a variant site between the homologous sequences. The ratio of these two outcomes is poorly characterised, although recent empirical [30] and statistical analyses [21] seem to indicate that gene conversion is the more frequent outcome.

Figure 2
figure 2

Simplified view of the outcomes of homologous recombination (HR). HR is initiated by a double-strand break and, after several processing steps, an intermediate, known as a Holliday junction, is formed, in which two homologous sequences are conjoined. This intermediate can then be resolved in one of two ways, resulting in a crossover or a gene conversion event. Current models of HR posit two Holliday junctions within a more complex intermediate structure;[32] however, the details of these models are beyond the scope of this paper.

The distribution of allelic HR throughout the human genome is likely to be extremely heterogeneous on the fine (kilobase [kb]) scale, although on the coarse scale (tens of megabases), the broad pattern is one of 1.6-fold more recombination events in females than in males, depressed recombination near the centromere and increased recombination in subtelomeric regions [33]. On the fine scale, both empirical studies of recombination in sperm [34] and statistical analysis of patterns of variation in populations [35, 36] indicate the widespread existence of hotspots of recombinatorial activity that can be orders of magnitude more active within an interval of 500 bp to 1 kb than in flanking 'cold' sequences. Population genetics theory suggests that these HR hotspots are likely to be short lived because the dynamics of the HR process are such that recombinogenic variants are doomed to be preferentially gene converted out of existence [37, 38]. This prediction appears to have been bolstered by the recent observation that recombination hotspots are not conserved over the short evolutionary distance that separates humans and chimpanzees [39, 40]. An absence of shared sequence motifs between known recombination hotspots [35] suggests that epigenetic mechanisms might be involved in the inheritance of recombinatorial activity at these locations.

The number of recombination events per meiosis seems to vary significantly both between gametes and between healthy individuals [41]. Intriguingly, mothers with higher rates of recombination tend to have greater reproductive success [42], which would suggest that selection on the dynamics of allelic HR is ongoing.

Allelic HR and sequence diversity

There is, on average, higher sequence diversity in regions of higher recombination rate [14]. It has also been demonstrated that there is a similar correlation between recombination and divergence in comparisons between human and mouse [43] and human and chimpanzee [13]. This correlation need not be explained by a causative relationship between the two (ie HR being mutagenic), although it has been suggested that errors in the repair of DSBs that initiate HR could increase the mutation rate. The hypothesis that recombination is itself mutagenic contrasts with the observation that recombination hotspots are short-lived evolutionary phenomena. Although, genomic location (ie proximity to telomeres) is more evolutionarily stable and plays a role in patterning large-scale recombination activity. Patterns of fine-scale recombination rate in humans, however, may be a poor predictor of the recombinatorial landscape in which sequences have evolved on both human and mouse lineages since they last shared a common ancestor.

Selection has also been invoked to explain the relationship between recombination and diversity. The rationale is that positive (hitchhiking) or negative (background) selection at linked loci is expected to reduce diversity and that, by breaking down linkage, recombination can release neighbouring markers from this selection-induced reduction in diversity [44, 45]. It has been argued, however, that the correlation between recombination and divergence suggests that a neutral, rather than a selective, explanation is more likely. In addition, the relationship between recombination and diversity shows no correlation with gene density [43], which further argues against selection.

It is possible that regions of higher recombination rate are likely to have an elevated mutation rate, not because recombination is itself mutagenic, but because the same genomic features elevate both. The best candidate appears to be GC content, which is positively correlated with recombination rate [46]. Indeed, HR is thought to be GC biased -- that is, gene conversion will preferentially repair a mismatched base pair to a G-C rather than an A-T [47, 48]. As a consequence, a high recombination rate might be expected to lead to a high GC content over time. Correspondingly, the approximately tenfold greater mutability of the CpG dinucleotide [19] over other dinucleotides suggests that mutation should also be elevated in regions of high GC content. Once GC content is taken into account, much of the association between recombination and mutation at larger scales disappears, although a fine-scale correlation of diversity and recombination rate persists [13].

Non-allelic HR between duplicated sequences

Much of the recent interest in non-allelic HR has been caused by the observation from the Human Genome Project that almost half of the human genome is duplicated elsewhere in the genome [4951]. Approximately 5-6 per cent of the human genome can be found in > 1 kb blocks of > 90 per cent sequence similarity to other locations in the genome (known as segmental duplications) [50, 51]. Furthermore, about 42 per cent of the genome is accounted for by families of dispersed repetitive elements -- short interspersed nuclear elements (SINEs), long interspersed nuclear elements (LINEs and human endogenous retroviruses (HERVs)-- [9]. While these two classes of duplicated sequence are typically considered separately in evolutionary analyses, it is clear that non-allelic HR can occur within both sets of duplicated sequence [52, 53].

Multiple mechanisms account for the origins of these duplicated sequences. Families of dispersed repeats seem to populate the genome in bursts of infectious activity. Non-allelic HR between these small dispersed repeats can in turn lead to much larger segmental duplications [54]. It has been suggested that the rapid populating of an ancestral primate genome with Alu elements facilitated a recent burst of segmental duplication [54] approximately 40 million years ago. It also appears, however, that physical fragility of the DNA sequence additionally plays a role in generating segmental duplications [55].

Non-allelic HR, similarly to allelic HR, can result in crossovers (sometimes known as unequal crossovers) and gene conversion events and is characterised by the existence of hotspots of activity [56]. The duplicated substrates for non-allelic HR can be on the same or non-homologous chromosomes, although it appears that intrachromosomal interactions are much more frequent [57]. Crossovers promoted by non-allelic HR generate rearrangements; the precise structural change depends on the orientation of the duplicated sequences: repeats in direct orientation promote deletions and duplications, whereas repeats in inverted orientation sponsor inversions [15]. Copy number changes sponsored by non-allelic HR need not simply involve deletion or duplication of single copy sequence but may also include dramatic variation in the copy number of tandemly duplicated arrays; for example, an individual X chromosome may carry between one and nine copies of the X-linked opsin genes [58]. The prevalence of these tandem arrays in the human genome has yet to be systematically characterised; however, genes known to exist in polymorphic arrays include those for amylase, alpha-defensins, beta-defensins, opsins, CYP2D6, TSPY, globins, rDNA and histones.

Concerted evolution describes the observation that duplicated sequences appear to be more closely related within species than they are to their orthologues in related species [59]. Concerted evolution can arise from both gene conversion and multiple rounds of unequal crossovers (Figure 3). Gene conversions transfer sequence between duplicated substrates, which can lead to the homogenisation of a family of repeats. Concerted evolutionary processes can thwart attempts to date duplication events that equate sequence similarity between two duplicated sequences to the time since duplication [60, 61]. By homogenising duplicated sequences, gene conversion causes such analyses to underestimate the age of the duplication event; many duplications are older than they might first appear. While concerted evolution was initially characterised in tandemly arrayed gene families (eg ribosomal DNA, globins, opsins), it has more recently been observed in interspersed duplications [62].

Figure 3
figure 3

Alternative mechanisms of concerted evolution. Both gene conversion and unequal crossover between duplicated sequences can homogenise duplicated sequences within a species (concerted evolution). This figure shows two interspersed duplicated sequences in direct orientation, which contain variant sites that distinguish the two sequences, shown as green and orange bars. Gene conversion homogenises these duplicates without any change in copy number of the duplicates in intermediate stages. Unequal crossovers can homogenise these repeats by repeated rounds of expansion and contraction with crossovers located at different locations within the duplicates. At each round, one of the two products of the previous unequal crossover (indicated by the double-headed arrow at the crossover point) undergoes an unequal crossover with the same repeat structure on the homologous chromosome.

As with allelic HR, it has been suggested influence that non-allelic HR sequence diversity and divergence in duplicated sequences. There have been conflicting reports of the direction of this influence. Analysis of the long, almost identical inverted repeats on Yq has indicated a significantly lower sequence divergence between humans and chimpanzees within the repeats than in flanking single copy sequence [63]. This difference was attributed to high levels of gene conversion operating between the inverted repeats repressing sequence divergence between orthologous sequences. By contrast, analyses of the duplicated HERV sequences that promote the AZFa deletion (also on Yq) have revealed elevated sequence divergence within known non-allelic HR hotspots [64] and increased sequence diversity flanking the hotspot [65]. Simulations of the gene conversion process suggest that elevated sequence diversity and orthologous divergence is to be expected when duplicated sequences are themselves slightly differentiated [64]. It remains to be seen whether these observations can be generalised to the entire genome, but the enrichment of apparent SNPs (in the dbSNP database of sequence variation) within segmental duplications [66] and the observation of the gene conversion process operating on other chromosomes is highly suggestive [67].

Evolutionary benefits of having a duplicated genome

Gene duplication and divergence has long been prophesised to be the major mechanism by which novel gene functions arise [68]. Once a gene has been duplicated, selective constraints are relaxed and there are several mechanisms by which the duplicates can diverge in function while fulfilling the role of the ancestral gene (reviewed by Hurles [69]). The widespread existence of gene families pays testament to the importance of gene duplication in evolution. Comparative whole-genome sequence analysis now gives a complete picture of how genomes adapt to novel environments. Comparisons between human and rodent and mammalian and avian genomes [1012] have implied the importance of lineage-specific expansions of particular clustered gene families, although greater effort is required, on a locus by locus basis, to discount the role of neutral processes in the origins of these structures. The clustering of these genes implicates non-allelic HR, both in their origins and in their patterns of sequence evolution. These lineage-specific expansions are often of gene families involved in sensory perception, toxin metabolism, immune response and reproduction [70]. These gene functions are also observed in single copy genes that show evidence of recent positive selection [71], suggesting that these functions are among the most important for rapid adaptation to novel environments. Interestingly, many pathogens also exhibit clusters of genes involved in antigenic variation; it appears that non-allelic HR is an important mutational mechanism operating in the ongoing arms race between pathogens and immune systems. The relatively high mutation rate of non-allelic HR, compared with that of sequence evolution, is probably an important factor underlying this observation.

Non-allelic HR and disease

As with all mutational processes, non-allelic HR generates variation that is subject to natural selection. While this variation can confer evolutionary benefits, as described above, it can also cause disease. Both unequal crossovers and gene conversions between duplicated sequences have pathogenic potential. A growing number of genetic diseases (Table 1) have been recognised to be caused by deletions and duplications of dosage-sensitive genes, inversions disrupting genic structures and gene conversions ablating normal gene function [15]. Disease-causing rearrangements have been identified within tandemly duplicated gene arrays, as well as between interspersed duplicates. While non-allelic HR is not the sole cause of structural variation in the human genome, HR between tandemly duplicated arrays, between segmental duplications and between dispersed repeats has been demonstrated to be a major cause of these mutations.

Table 1 Examples of diseases caused by non-allelic HR.

Perhaps the most common outcome of gene duplication is that one of the copies acquires mutations that may render it non-functional. Not only is the potential for evolving a novel function lost, but this pseudogene now contains a reservoir of mutations that can be gene converted into the remaining functional gene [7274]. The prevalence of pseudogenes in the human genome suggests that many genes will have associated pseudogenes [75].

While much attention has focused on the role of non-allelic HR in genetic disorders with Mendelian inheritance patterns, little effort has been devoted to investigating its role in the genetics of complex diseases. This is despite the longstanding existence of examples of the role of rearrangements in more complex phenotypes such as drug response [76] and resistance to infectious diseases such as malaria [77]. More recently, elevated copy number of a segmental duplication containing the gene CCL3L1 has been shown to protect against HIV/AIDS [78]. In addition, a chromosomal inversion with a convoluted evolutionary history has been demonstrated to confer a selective advantage in recent generations of the Icelandic population [79]. The lack of studies investigating structural polymorphism and complex traits has perhaps been due to an underestimation of the degree of structural variation in the human genome. Recent studies demonstrate that there is much more large-scale copy number variation than was previously thought to exist and also point towards methods that can redress this under-ascertainment in a systematic fashion [8082].

Conclusions

The mutagenic potential of non-allelic HR was identified early in the history of molecular genetics, yet, due to the difficulty of experimentally interrogating duplicated sequences, a fuller appreciation of its evolutionary and pathogenic roles has had to await the publication of wholegenome sequences. Clearly, there are both costs and benefits to having a highly duplicated, and therefore mutable, genome.

Despite recent advances, non-allelic HR remains perhaps the most poorly characterised mutation process in the human genome. While the human genome reference sequence provides a reasonable understanding of where non-allelic HR is likely to occur, little is known about the rates of these processes and how they vary between individuals and over evolutionary time-scales. It is worth noting that variation in the frequencies of chromosomal rearrangements along different evolutionary lineages need not reflect the degree of duplication in an ancestral genome, but might result from specific demographic histories (eg population bottlenecks) that have transiently favoured the fixation of chromosomal rearrangements.

Given the role that non-allelic HR appears to have played in the rapid adaptation to novel environments observed within mammalian genome comparisons, it will be of great interest to investigate the genomic changes that must have accompanied the adaptation of different groups of humans to the wide range of environments that our species presently occupies. The recent work on the CCL3L1-containing segmental duplication described above illustrates how a reservoir of structural variation allows a rapid response to the new selective environment posed by a novel human pathogen [78].