1 Species Trees and Incomplete Lineage Sorting

The sequencing of all the great ape genomes [1,2,3,4,5,6] has allowed us to paint a detailed picture of the species relationship between humans and their closest relatives. By joint analysis of full genomes from pairs of species, coalescent hidden Markov models (CoalHMM) (see Chapter 8) can efficiently model both sequence divergence and recombination by approximating the full ancestral recombination graph as a Markov process along the genome. The states in these hidden Markov models represent different gene trees separated by recombination events. Such models can jointly estimate both the time of reproductive isolation (the time of speciation) and the size of the ancestral population that gave rise to the two species. Figure 1 provides an overview of the estimated split times and ancestral population sizes. The estimated split times are computed assuming that the mutation rate across the species tree has remained constant at the rate of 0.6 × 10−9 per year as observed in humans. However, speciation dates produced using a constant mutation rate does not square with the physical dating of ancestral fossil species. To reconcile DNA and fossil evidence, it has been proposed that the yearly mutation rate has slowed down across in the great ape lineage [8], possibly resulting from the development of larger body sizes and longer generation times.

Fig. 1
figure 1

Species tree of humans and the great apes. Dashed lines represent speciation events with estimated dates. Gray numbers show estimated sizes of ancestral populations, and percentages show the estimated amount of incomplete lineage sorting (see below) between two descendant species and their immediate outgroup (e.g., 30% ILS between human, chimpanzee, and gorilla). (The figure is adapted from Mailund et al. [7])

The time to the most recent common ancestor of sequences sampled from two species lies much further into the past than the time when the species split apart. For this reason, a common ancestor of two lineages from separate species may not be found in the population ancestral to the two species, but even further into the past, in a population ancestral to additional species. When sampling more than two species, this allows for the possibility that lineages from other than the most closely related species find a common ancestor before those most closely related. This is especially true for the relationship between human, chimpanzee, and gorilla. Between the speciation events separating human and chimpanzee and that separating human and gorilla, a lot of ancestral polymorphism was conserved in the large ancestral population. The more rapid the succession of speciation events, and the larger the ancestral population between them is, the more ancestral polymorphism will be conserved. The implication is that individual gene trees along the alignment of these three species will not always group the same two species as the species tree does. The phenomenon is called incomplete lineage sorting (ILS) because the lineages of individual gene trees are not completely sorted according to species (see Chapter 1 for further details).

One coalescent hidden Markov model compares three closely related species and exploits information from sequence divergence and ILS to estimate the time of the two speciation events as well as the size of the ancestral population [9, 10]. From this model, it is also possible to extract the proportion of discordant gene trees with a topology different from the species tree. Applying this method to the human, chimpanzee, and gorilla showed that for ~15% percent of the genome, humans are more closely related to gorillas than chimpanzees, and for another ~15%, chimpanzees and gorillas are more closely related to each other than to humans [3] (see Fig. 1). The same model has been applied to alignments of bonobo, chimpanzee, and human, and showed that ~5% of the genome is subject to ILS [4]. Because the proportion of ILS is determined by ancestral population size and the time between speciation events, we can compute the estimated proportions of ILS for trios of species where these parameters have been estimated by other means. Between human, gorilla, and orangutan it is expected to be ~4%, and for human, orangutan, and gibbon it is expected to be ~24% [7]. The great apes thus also showcase how misleading phylogenies built from individual genes may be since a phylogeny built from long regions of a recombining sequence will not represent the population genetic processes that distribute individual lineages among species.

2 Gene Flow and Demography

Most coalescent hidden Markov models assume that speciation is instantaneous and that the initial split of two populations is not followed by gene flow between the diverging populations. Other coalescent hidden Markov models account for the possibility that such gene flow has occurred [11]. Among species splits in great ape evolution, most have involved a period of gene flow before consolidation of the populations as separate species [11]. The divergence of the orangutan from the human–chimp–bonobo–gorilla ancestor involved several hundred thousand years of gene flow. The speciation of humans and chimpanzees-bonobo most likely also included an extended period of gene flow. Only the speciation separating the bonobo from the chimpanzees seem to be a clear example of an abrupt and permanent split, possibly produced as the Congo River provided a physical barrier between the populations.

An alternative way to estimate gene flow between separating populations is using methods such as MSMC [12] (see Chapter 7) that can estimate the relative rate of cross-coalescence between populations from the present and into the past. This approach measures the proportion of gene pairs that find common ancestry between two sampled populations rather than within them. Inspecting a curve of this relative cross-coalescence rate can help identify both the time of speciation and whether this was a clean split rather than a protracted period of reduced gene flow. MSMC as well as a similar method modeling only one diploid sample (PSMC [13]) also estimate the historical effective population size of a species. Such methods have been used to identify how the great apes have responded to environmental changes and show that great apes have experienced a decline in their effective population size across the last few hundred thousand years [5]. Comparing the curves of historical effective population sizes may also reveal when the species split apart. Across the time in the past where two species share an ancestor their historical population sizes will be the same, but at the time this ancestral population split into two, the size of these two populations will be free to follow different trajectories through time and will reveal the species split as a separation of the curves of historical population sizes. Along with methods such as approximate Bayesian computation, PSMC has helped describe the relationship between chimpanzee subspecies [5] showing that eastern and central chimpanzees are most closely related, forming a group separate from Nigeria–Cameroon and western chimpanzees.

3 Selection

One of the most intriguing questions in great ape evolution is how the adaptive evolution of particular genes has contributed to shaping phenotypes in present-day species. A study comparing a large number of orthologous genes addressed adaptive evolution along the branches of humans and chimpanzees by comparing the rate of evolution at synonymous sites (sites where a mutation will not change the encoded protein) with nonsynonymous sites (sites where mutation replaces an amino acid) [14]. Many of the identified genes were involved in sensory perception and immune defenses, but the genes showing the strongest evidence of positive selection were genes involved in tumor suppression and apoptosis, and genes involved in spermatogenesis [15]. Another way to identify selection in primate genomes is by measuring the patterns of genetic diversity along the genomes. Slightly deleterious variants will reduce genetic diversity in a genomic region around the deleterious variant, a process called background selection (see Charlesworth [16] for a review), in effect reducing the local effective population size. Positive selection also removes variation in a region around it, leaving a signature in local genetic variation that can be distinguished from that of background selection if the positive selection is strong enough and occurred recently. When a new variant is subject to strong positive selection, variation in the flanking regions is depleted because linked variants are carried to fixation along with the selected variant. This is called a selective sweep because it sweeps variation in a region around the selected variant and produces a wide genomic region where all individuals from a species share a recent common ancestor [17]. The size of the swept region depends on the strength of selection, the size of the population and the rate of genetic recombination. Several methods have been developed to detect sweeps from information in population samples such as the site frequency spectrum, linkage disequilibrium and population differentiation [18] (see Chapter 5). Due to the relatively small sample sizes available in great apes, no striking examples of recent sweeps on great ape autosomes have been reported (but see Sect. 5 for strong selective sweeps on the X chromosome). However, thanks to the McDonald and Kreitman test framework there are many estimates of the proportion of beneficial nonsynonymous substitutions (α) across primates (see Chapter 1 for a formal definition of α). Genome-wide estimates in humans and nonhuman primates are very low, α < 10–20% [1, 19,20,21,22,23], but α can be as high as 50% for some particular genes like immune genes, testis genes, or virus interacting protein genes [24, 25].

It is still debated if positive or negative selection is more prominent in shaping diversity along great ape genomes, and we are still trying to figure out whether selective sweeps are mainly due to new mutations [17] or selection on standing variation [26], and which are more important for adaptation and the surrounding patterns of DNA diversity. One argument to suggest that sweeps from new mutations contribute significantly to variation in diversity is that great apes with larger population sizes show more dramatic reductions in diversity near genes [27]. This dependence of population size is consistent with the action of positive selection rather than negative selection and suggests that new beneficial mutations leading to sweeps arise more often in species with a larger number of individuals subject to mutation. Identification of selective sweeps, from depressions in diversity or distortions of the site frequency spectrum, is limited to the recent past, where a sample of individuals is expected to be represented by many ancestors. An alternative method to quantify the impact of sweeps on longer timescales is to identify extended regions devoid of incomplete lineage sorting. A sweep in an ancestral species will induce common ancestry for all lineages in a wide region around the selected variant and thus precludes the possibility of incomplete lineage sorting in the region. By identifying and comparing such regions in both the ancestor to human and chimpanzee and the ancestor to human and orangutan, it was possible to show that the human–chimpanzee ancestor experienced a higher frequency of strong sweeps than the human–orangutan ancestor [28].

Addressing the forces of positive and negative selection in the great apes, we need to know what proportion of new mutations are advantageous, neutral, or deleterious and whether these proportions differ across these species. The distribution of fitness effects (DFE) describes the proportions of new mutations that are effectively neutral and new mutations that are under selection [29, 30] (see Chapter 1). The DFE further distinguishes between advantageous mutations, which increase the fitness of the organism, and deleterious mutations, which impair survival or fertility. Several methods are available to infer this continuum of selective effects from DNA sequence data [19,20,21, 31,32,33,34]. Initial studies in humans with modest sample sizes found ~25% of effectively neutral nonsynonymous mutations (−1 > 2Ns < 1), ~15% of weakly deleterious nonsynonymous mutations (−10 > 2Ns ≤−1) and ~60% of moderately to strongly deleterious nonsynonymous mutations (2Ns ≤ −10) [19,20,21, 31,32,33,34]. A recent study with a large sample size was able to further refine the estimate of new nonsynonymous mutations which are strongly deleterious (2Ns ≤ −100) to 14–22% and the proportion of weakly deleterious mutations (−10 > 2Ns ≤ −1) to 25–33% [32]. The DFE for new nonsynonymous mutations is quite similar across great apes despite the differences in the species long-term Ne [22, 35]. This similarity may be explained by the highly leptokurtic DFE of these species, which predicts that substantial changes in Ne will only have a modest impact on the selective effects of mutations. Nonetheless, very different methods and assumptions have been invoked to estimate the DFE across species, and even the shape of the DFE is still a contentious issue. There is very limited knowledge about the DFE of new noncoding mutations, and all we know relies on measures of DNA conservation across mammals and primates. Thus, for noncoding DNA we are only able to say which proportion of new mutations are effectively neutral (−1 > 2Ns < 1) and effectively selected against (2Ns ≤ −1). These rough conservation scores show that only 2–5% of point mutations at noncoding sites might be under purifying selection in humans and the rest of primates [36,37,38,39,40].

Balancing selection is another mode of selection that differs from directional selection in that it does not drive selected variants toward fixation or extinction (see Chapter 1 for a definition of balancing selection). Instead, it maintains genetic variation by stabilizing alleles at intermediate frequencies. There are several methods to detect loci under recent and/or long-term balancing selection [22, 41, 42]. A recent study in great apes has confirmed that immune genes are enriched in signals of balancing selection, and it has found that genes involved in the formation of the skin are also under balancing selection [22]. Some of these polymorphisms maintained by balancing selection are even shared between humans and chimpanzees; the most prominent example is the major histocompatibility complex (MHC).

4 Recombination

The rate of recombination varies along the genome. The local recombination rate in each part of the genome can be estimated from patterns of linkage disequilibrium (LD) [43]. It can also be inferred from individually called recombination events by comparing many parent and offspring genomes [44] or by examining genomes of individuals with mixed ancestry [45]. The landscape of varying recombination rate across the genome is referred to as a recombination map. For humans, recombination maps have been produced by all three approaches. Among the great apes, detailed recombination maps only exist for bonobo, chimpanzee, and gorilla. These are produced using the same LD-based method used in humans, allowing direct comparison of recombination maps across species. In all four species, recombination rate varies on a large scale (millions of bases), and this variation is associated with the size of chromosomes, the chromosomal position, the sequence GC content, the gene density, and several other factors [46]. At the fine scale (thousands of bases) recombination rate is determined by the location of the so-called recombination hotspots where about 60% of recombinations occur despite that these hotspots constitute only ~6% of the genome [47]. The location of hotspots is determined by the affinity of the PRDM9 protein for certain DNA motifs present at hotspots. This affinity is encoded in a zinc-finger array whose DNA contacting residues are under strong positive selection. It is now clear that biased gene conversion favors alleles that disrupt hotspots. This depletion of hotspot motifs may result in selection for PRMD9 variants recognizing alternative motifs, producing a turnover of hotspot locations [48]. A comparative analysis of recombination maps of the four species [49, 50] showed that recombination rate on a megabase scale is highly conserved across species, but that the location recombination hotspots are completely different. Only a few hotspots are shared even between chimpanzee subspecies, revealing that turnover of hotspot locations commence at short evolutionary timescales [50].

Comparative studies of recombination have less power than comparative studies of genome sequences: whereas sequence change can be assigned to individual species branches using standard models of molecular evolution, change in recombination rate has so far only been observed as differences between pairs of species. This is because the differences between two species cannot be resolved into the change that occurred in each species without knowledge of recombination rates in the species common ancestor. Fortunately, it is now also possible to construct recombination maps for ancestral species if enough incomplete lineage sorting is present [2]. This approach takes advantage of the fact that gene trees with different topologies must be separated by a recombination event. When sequences are sampled from three different species, the majority of recombination events separating gene trees with different topology will occur in the species ancestral to the two most closely related species. This approach has been used to produce a recombination map of the ancestor of human and chimpanzee [3]. By resolving the differences between humans and chimpanzees into the changes that occurred in each species since their divergence it was shown that recombination rate had evolved more rapidly in humans than in chimpanzees and that striking changes in recombination rate had resulted from a genomic inversion and a chromosome fusion in the human lineage.

5 The X Chromosomes of Great Apes

The unique mode of inheritance of X chromosomes exposes them to population genetic process that differs from that of the autosomes. In a simple population genetic model, the effective population size of the X chromosome will be 3/4 that of the autosomal one. However, this ratio is influenced by many factors such as a difference in generation time and reproductive variance between the sexes, or a stronger propensity of one sex to migrate between subpopulations. More recently it has been suggested that linked selection on the X chromosome in the form of selective sweeps may contribute significantly to a reduced X–autosome ratio. Analysis of diversity along the X chromosomes of the great apes identified extreme selective sweeps in the form of wide regions with strongly reduced diversity and a higher proportion of singleton polymorphisms [51]. The swept regions overlap partially between species, suggesting some amount of recurrent positive selection on the same genes. A separate study exploiting patterns of ILS to measure the cumulative effect of sweeps in the human–chimpanzee ancestor, identified a set of wider regions, spanning the regions identified in extant great ape species [52]. This suggests that regions of the X chromosome are subject to recurrent very strong positive selection. Since these extreme sweeps are only observed on the X chromosomes, it is possible that this is the result of selection of “selfish genes.” Such selfish genes, catering only for the preferential transmission of X or Y chromosomes into viable sperm are potentially subject to a particular kind of positive selection called meiotic drive. Even modest transmission distortions will provide selective advantages strong enough to explain the magnitude of these sweeps.

6 Conclusion

The examples of insights provided above represent only the first glimpses of the evolutionary history we share with the great apes as well as the evolution that is private to each species. As genetic diversity across the ranges of each great ape is assayed in more detail, we will get a much deeper understanding of how diverse population genetic processes have shaped genomes very similar to our own.