1 Genomic Variation

1.1 Loci, Alleles, and Polymorphism

Population genomics studies the evolution of genome variants in populations. A locus (pl. loci) refers to a given location in the genome. The particular sequence at a given locus may vary between individuals, each variant being termed an allele. We call loci with at least two alleles polymorphic and invariant loci monomorphic. The term polymorphism refers to the presence of multiple alleles but is commonly used as a countable noun as a substitute for “polymorphic locus” (one polymorphism, several polymorphisms).

Alleles may differ because of the nucleotide content, but also in length, as a result of nucleotide insertions or deletions (a.k.a. indels). Variable loci of length one can have up to four distinct alleles (A, C, G, or T) and are termed single nucleotide polymorphisms (SNPs). SNPs constitute, so far, the majority of the data accounted for by population genetic models.

1.2 Mutations

Molecular events altering the genome are termed mutations. Mutations include substitution of a nucleotide into another one, removal or addition of one or several nucleotides, as well as multiplication of some part of the genome. Mutation is the process by which new alleles are formed. The infinite site model assumes that during the timeframe of evolution modeled, each locus have undergone at most one mutation [1,2,3]. This model also implies that each mutation creates a new allele in the population and that there is no “backward” or “reverse” mutation. The infinite site model is a generally reasonable assumption as the mutation rate is typically low and genomes are large. It might be locally invalidated, however, in case of mutation hotspots or when larger evolutionary timescales are considered. Under this premise, at most two alleles are expected per locus. Loci with two alleles are termed diallelic or biallelic, the first term having historical precedence and being more accurate [4], while the second is more commonly used since the 1990s. Furthermore, in a population genomic dataset, a sampled diallelic locus is called a singleton if one of the two alleles is present in only one haploid genome, and a doubleton if it is present in precisely two haploid genomes.

1.3 The Wright–Fisher Model

The simplest process of allele evolution within a single population is named the Wright–Fisher model. It describes the evolution of alleles in a population of fixed and constant size, where all alleles have the same fitness, and therefore the same chance to be transmitted to the next generation (neutral evolution). The population is assumed to be panmictic, that is, individuals are randomly mating. Time is discretized in non-overlapping generations so that the alleles in the current generation are a random sample of the alleles from the previous generation, without new alleles being generated by mutation. Under such conditions, allelic frequencies evolve only because of the stochasticity in the sampling of gametes that will contribute to the next generation, a process termed genetic drift. Because populations are of finite size, alleles will be sampled at their actual frequencies on average only and the ultimate fate of any allele is either to reach frequency zero in the population and be lost, when by chance no individual carrying this allele has any descendant in the next generation or to become fixed when all other alleles have been lost. The time until fixation depends on the population size: smaller populations will show a stronger sampling effect and shorter times to fixation. When genetic drift is the only force acting on a population, the number of alleles at a given locus is necessarily decreasing over time.

The Wright–Fisher model with mutation extends the Wright–Fisher model by introducing new alleles in the population, at a given rate. As the mutation rate is low, new mutations appear in a single copy, their initial frequency is then 1∕2N in a diploid population. Mutation and drift act in opposite direction and a mutation-drift equilibrium is reached when the rate of allele creation by mutation equals the rate of allele loss by drift. The genetic diversity is then determined by the sole product of the population size N and the mutation rate u. Under the infinite site model, the expected heterozygosity at a locus in a population of diploid individuals is approximated by [1]

$$ \hat{h}=\frac{4\cdot N\cdot u}{4\cdot N\cdot u+1}\kern1.00em $$

while the expected number of distinct alleles and their respective frequencies can be estimated using Ewens’s sampling formula [5].

A substitution occurs when a new mutation has spread in the population, increasing from frequency 1∕(2N) to 1 (see Note 1). Kimura showed that the average time to fixation of a new mutation is 4N in a population of diploid individuals [6]. Furthermore, as a neutral mutation has a probability of reaching fixation equal to 1∕(2N) and given that there are 2N ⋅ u new mutations per generation, in a purely neutrally evolving population, the expected number of substitutions per generation is equal to 2N ⋅ u ⋅ 1∕(2N) = u. The substitution rate is therefore independent of the population size and, assuming that the mutation rate is constant in time, the number of substitutions between two populations is a direct measure of the number of generations separating them, a phenomenon termed molecular clock [7].

1.4 The Backward Wright–Fisher Model: The Standard Coalescent

While the Wright–Fisher process naturally describes the evolution of sequences within populations one generation after the other, population genetic data typically represent individuals sampled at a given time point. For inference purposes, it is therefore convenient to model the history of the genetic material that gave rise to the sample. The modelization of the ancestry of a sample (also known as the genealogy) is typically done backward in time, as every locus find a common ancestor in the past, until the most recent common ancestor (MRCA) of the sample. The merging of two lineages in the past is called a coalescence event, and the set of mathematical tools describing this process under a variety of demographic models is referred to as the coalescence theory. Kingman [8] first described the standard coalescent, the genealogical model corresponding to the Wright–Fisher model (but see refs. 9 and 10 for a historical perspective). The standard coalescent is, therefore, also referred to as the Kingman’s coalescent.

2 Beyond the Wright–Fisher Model

The Wright–Fisher model has been extended in several ways to include more realistic assumptions on the underlying evolutionary process. These extensions led to the concept of Effective population size (Ne), originally defined as the number of individuals contributing to the gene pool. When a population deviates from the assumptions of the Wright–Fisher model, Ne is no longer equal to the census population size (N). Often (but not always) in such cases, Ne can be obtained by a linear scaling of N such that it reflects the number of individuals from an idealized Wright–Fisher population that would display the same genetic diversity as the actual population under study [11].

2.1 Demography

A possible deviation from the Wright–Fisher assumptions happens when the population size is not constant across generations. The term demographic history generally refers to the collection of demographic parameters (effective sizes, growth rates) that describes the history of the population until its most recent common ancestor [12]. When population size varies in a cyclic manner with relatively small period n generations, the resulting genealogies can be modeled by a Wright–Fisher process with a population size equal to the harmonic mean of the historical population sizes, so that

$$ Ne=\frac{n}{\sum_i^n\frac{1}{N_i}},\kern1.00em $$

where Ni refer to the ith population size [13]. More drastic demographic effects include genetic bottlenecks, corresponding to a sharp decrease (shrinkage) in population size.

2.2 Population Structure

In the absence of panmixia, genetic exchanges occur more often between certain individuals, resulting in population structure with several subpopulations. Population structure may occur for different reasons such as overlapping generations, assortative mating, or geographic isolation [12]. Assortative mating occurs when individuals choose their mates according to some similarity between their phenotypes. If the phenotype is genetically determined, assortative mating can influence the level of heterozygosity in the population [14].

Gene flow describes the migration of genetic variants between subpopulations under a scenario of population structure. It reduces genetic differentiation among subpopulations [15]. Ultimately, subpopulations can diverge and become genetically isolated, a process called speciation. The simplest speciation processes involve spontaneous isolation (isolation model) or spontaneous isolation followed by a period of gene flow (isolation with migration model) [16].

When speciation events occur in a short timeframe and ancestral population sizes are large, ancestral polymorphism may persist in the ancestral species, a phenomenon called incomplete lineage sorting (ILS) [17]. The expected amount of ILS depends on the number of generations between two isolation events (ΔT) and the ancestral effective population size NeA [18]:

$$ \Pr (ILS)=\frac{2}{3}{e}^{\left(-\frac{2\cdot {\varDelta}_T}{N{e}_A}\right)}\kern1.00em $$

The term introgression is used to depict the transfer of genetic material between diverged populations or species through secondary contact [19]. As a result, extant lineages share a common ancestor that predates the two isolation or speciation events. The resulting genealogy may, therefore, be incongruent with the phylogeny defined by the two splits, depending on the order of coalescence events between lineages [20].

3 Statistics on Nucleotide Diversity

Statistics are needed to infer population genetics parameters from polymorphism data. The site frequency spectrum (SFS) describes the empirical distribution of allele frequencies across segregating sites of a given (set of) loci in a population sample. For a sample of n sequences (in n haploid individuals or n∕2 diploid individuals), the so-called unfolded SFS is the set of counts of derived alleles X = (X1, X2, …, Xn−1), where sample configurations Xi denote the number of sites that have n − i ancestral and i derived alleles. The ancestral state is usually estimated using an outgroup sequence. In cases where we cannot assess the ancestral allele, the folded site frequency spectrum, X′, may be calculated instead. X′ represents the distribution of the minor allele frequencies, such as \( {X}_i^{\prime }={X}_i+{X}_{n-i} \) for i < n∕2 and \( {X}_{n/ 2}^{\prime }={X}_{n/ 2} \) [13, 21, 22]. The shape of the SFS is affected by underlying population genetic processes, such as demography and selection, and therefore serves as the input of many population genetics methods [23] (see Fig. 1).

Fig. 1
figure 1

Effect of demography on the shape of the site frequency spectrum (SFS). The figure depicts four scenarios: constant population size, exponential growth, genetic bottleneck, and population structure. The red curve shows the expectation under a constant population size. In the case of exponential growth or a genetic bottleneck, the SFS displays an excess of low-frequency variants. Population structure, here simulated as two subpopulations exchanging migrants at a low rate, results in an excess of intermediate frequency variant when we reconstruct a single SFS from the two subpopulations. Simulations were performed using the msprime software [24] (see also Chapter 9 and the online companion material)

Watterson’s theta, here noted \( {\hat{\theta}}_S \), is an estimator of the population mutation rate θ = 4Ne ⋅ u, where Ne is the (diploid) effective population size and u the mutation rate. It is derived from the number of segregating sites Sn of a sample of size n [25]. Assuming an infinite sites model, Sn is equal to the product of u and the expected time to coalescence, corrected by the sample size:

$$ E\left[ Sn\right]=u\cdot 4\cdot Ne\sum \limits_{i=1}^{n-1}i.\kern1.00em $$

Since 4Ne ⋅ u = θ the equation may be written as E[Sn] = θ ⋅ an, where \( {a}_n={\sum}_{i=1}^{n-1}i \). The proposed estimator of θ for the sample is

$$ {\hat{\theta}}_S=\frac{{\hat{S}}_n}{a_n}=\frac{{\hat{S}}_n}{\left(1+\frac{1}{2}+\dots +\frac{1}{n-1}\right)},\kern1.00em $$

where \( {\hat{S}}_n \) is the observed number of segregating sites in the sample. In order to be comparable, values of θ are usually reported per site, and \( {\hat{\theta}}_S \) is then further divided by the sequence length L. This estimator is unbiased when the data is generated from a Wright–Fisher process but is not robust to deviations from it, due to selection or demography [26].

Tajima’s π, the average pairwise heterozygosity is a measure of nucleotide diversity defined as the number of pairwise differences between a set of sequences [27]. Under the infinite sites model, the number of mutations separating two orthologous chromosomes Dij is equal to the number of nucleotide differences between sequences i and j. As the expectation of the average pairwise nucleotide differences between all pairs of sequences in a sample is equal to θ = 4Ne ⋅ u [28], Tajima’s estimator of θ is:

$$ {\hat{\theta}}_{\pi }=\frac{2}{n\left(n-1\right)\cdot L}\sum \limits_{i=1}^{n-1}\sum \limits_{j=i+1}^n{D}_{ij},\kern1.00em $$

where L is the total sequence length.

4 Selective Processes

4.1 Protein-Coding Genes

The coding region of a protein-coding gene, also known as Coding DNA Sequence (CDS) is the portion of DNA, or RNA, that encodes a protein. A start and stop codons limit the coding region at the five-prime and three-prime end, respectively. In mRNAs, the CDS is bounded by the five-prime untranslated region (5-UTR) and the three-prime untranslated region (3’-UTR), also included in the exons. Mutations within coding regions are expected to be of distinct types: synonymous mutations lead to no change of amino-acid at the protein level due to the redundancy of the genetic code, as opposed to non-synonymous mutations. Non-synonymous mutations can further be classified as conservative and non-conservative (= radical), whether they replace an amino-acid by a biochemically similar one or not. Because of the structure of the genetic code, the four types of mutations at one site (toward A, C, G, or T) can be in principle both synonymous and non-synonymous. Sites where n out of four possible mutations are synonymous are called n-fold degenerated. Four-fold degenerated sites only undergo synonymous mutations, while a mutation at a so-called zero-fold degenerated site is necessarily non-synonymous. Most of second codon positions are zero-fold degenerated, while many of the third positions are four-fold degenerated.

4.2 Fitness Effect

The resulting change of fitness at the organism level characterizes the type of mutations: neutral mutations have no impact on the fitness, while harmful or deleterious mutations induce a lower fitness. Conversely, advantageous mutations increase the fitness of the organism compared to the wild-type genotype. There is, however, a wide range of selective effects, which extends the categorization of mutations from strongly deleterious, through weakly deleterious, neutral to mildly and highly adaptive mutations. The relative frequencies of these types of mutations represent the distribution of fitness effects [29, 30].

The selection coefficient (s) is a measure of differences in fitness, which determines the changes in genotype frequencies that occur due to selection. It is commonly expressed as a relative fitness. If one considers a single locus with two alleles A and a, a standard parametrization is to attribute a fitness of 1 to the homozygote AA and relative fitness of 1 + s for the homozygote aa. The heterozygote Aa is attributed a fitness of 1 + h ⋅ s, where h is the so-called coefficient of dominance. The s parameter varies between − 1 and +  (but see Note 2), wherein values comprised among − 1 and 0 are indicative of negative selection, while positive values correspond to positive selection [13, 31]. The efficiency of selection, however, depends on both s and the effective population size, Ne, so that mutations with Ne ⋅ s ≪ 1 behave in effect like neutral mutations, whose fate is determined by genetic drift only [29].

4.3 Types of Selection

Positive selection acts on alleles that increase fitness, raising their frequency in the population over time, while negative selection (=  purifying selection) decreases the frequency of alleles that impair fitness. Both positive and negative selection decrease genetic diversity. Conversely, balancing selection acts by maintaining multiple alleles in the gene pool of a population at frequencies higher than expected by drift alone. Three mechanisms are generally acknowledged: heterozygous advantage, where heterozygotes have a higher fitness than homozygotes and maintain genetic polymorphism; frequency-dependent selection, where the fitness of the genotype is inversely proportional to its frequency in the population; and environment-dependent fitness of genotypes (also known as local adaptation) [31, 32].

4.4 Inference of Selection in Protein-Coding Sequences

The strength and direction of selection acting on protein-coding regions may be assessed by contrasting the rate of non-synonymous (potentially under selection, dN) to synonymous (assumed to be neutral, dS, but see, for instance, [33]) substitutions between species. In a population of sequences evolving neutrally, all substitutions are neutral and the two rates are equal, leading to a dNdS ratio equal to one on average. Assuming non-synonymous mutations are either neutral or deleterious while synonymous mutations are always neutral, the rate of non-synonymous substitutions will be lower than the rate of synonymous substitutions, and the dNdS ratio will be lower than one. Conversely, if non-synonymous mutations are positively selected, their rate of fixation may exceed the rate of synonymous mutation, leading to a higher substitution rate and a dNdS ratio higher than one.

At the population level, the ratio of non-synonymous (pN) and synonymous (pS) polymorphism is indicative of the strength of purifying selection acting on a protein. Because non-synonymous mutations are more likely to have a negative fitness effect and be counter-selected, they tend to be removed from the population by purifying selection or segregate at low-frequency. We can estimate the synonymous and non-synonymous genetic diversity by computing the average pairwise heterozygosity π separately for non-synonymous and synonymous mutations, noted πN and πS, respectively. The πNπS ratio is therefore generally below one, the stronger the purifying selection, the closer the ratio is to zero.

Contrasting the dNdS and pNpS ratios allows to test the selection regime acting on the sequences [34]. If mutations are all neutral or deleterious, we expect the ratios dNdS and pNpS to be equal. Positively selected mutations will tend to quickly rise to fixation and will not be observed as polymorphism, leading to an increased dNdS ratio higher than pNpS. Conversely, balancing selection will lead to an excess of polymorphism detectable as dNdS < pNpS [35]. A simple measure of the proportion of amino-acid substitutions resulting from positive selection (α) is given by 1 − (dS ⋅ pNdN ⋅ pS) [36]. Using the complete synonymous and non-synonymous site frequency spectra, it is further possible to estimate the distribution of fitness effects and account for slightly deleterious and slightly advantageous mutations when estimating the rate of adaptive substitutions (see Chapter 5) [37].

5 Linkage and Recombination

5.1 The Coalescent with Recombination

In sexually reproducing species, recombination refers to both the shuffling of non-homologous chromosomes and the rearrangement of homologous chromosomes during meiosis. Such cross-over events cause each chromosome to have two parent chromosomes in the previous generation, which are themselves the products of recombination events in the previous generations. Therefore, any chromosome in the current generation can be viewed as a mosaic of chromosomes that existed in the past (see Fig. 2) [38]. The collection of coalescence and recombination events that describes the history of sampled chromosomes until the most recent common ancestor of each non-recombining block is reached (see Fig. 2) is called the ancestral recombination graph (ARG) [39]. Compared to a tree-like genealogy of a sample without recombination, whose complexity depends only on the sample size, the complexity of the ARG grows with the sample size and the number of recombination events in the ancestry of the sample.

Fig. 2
figure 2

An ancestral recombination graph. An ancestral recombination graph is a collection of recombination (1–2) and coalescence (3–5) events. In each depicted chromosome, white bars represent segregating ancestral material, black bars represent coalesced ancestral material, and thin lines represent non-ancestral material. The asterisk denotes trapped non-ancestral material. Note that “1” does not impact the sample because the resulting segments are joined back together in “4” before coalescing in “5.” There are thus only two relevant TMRCAs in the ARG, separated at position x

Backward-in-time, the most recent common ancestor (MRCA) denotes the first individual where the entire sample (population) coalesces for a particular non-recombining block. The TMRCA notes the timing of such event. DNA sequences provide no information beyond the MRCA in a sample of genomes since all individuals will share any mutation that happens further back in time [40]. In the presence of recombination, different parts of the genome will have different MRCAs. In this case, all ancestral material is eventually found as a contiguous sequence in the grand most recent common ancestor (GMRCA) of the sample (see Fig. 2). If the GMRCA is not an MRCA for any nucleotide, this individual does not have any significance for DNA sequences [39].

In the ARG, nucleotide segments that are found both in past chromosomes and in contemporary samples are termed ancestral genetic material (see Fig. 2). Conversely, non-ancestral genetic material refers to segments that are found in past chromosomes but not in contemporary samples. Furthermore, non-ancestral genetic material flanked on both sides by ancestral genetic material is referred to as trapped genetic material. In this setting, recombination events that happen in trapped genetic material can affect linkage disequilibrium between present-day nucleotides (see Fig. 2). Thus the existence of trapped genetic material introduces long-range correlations between genealogies rendering the coalescent with recombination a non-Markovian process along chromosomes [41]. The Sequentially Markov coalescent (SMC) is an approximation to the coalescent with recombination whereby recombination events are assumed to happen only within ancestral material. This approximation allows the use of efficient algorithms in both simulation and data analysis [42, 43].

5.2 Impact of Linkage on Selection

An excess of linkage between loci compared to a random association is termed linkage disequilibrium (LD). LD arises from genetic drift, population admixture, and selection, but is reduced by recombination each generation. It is, therefore, higher between close loci and decays with increasing physical distance [44].

Linked selection refers to the reduction of diversity at neutral sites that happens as a result of their physical linkage to variants under selection [45]. In the absence of recombination, all variants segregating in a chromosome would undergo the same shift in frequency as the selected variant. However, recombination creates new allelic combinations and reduces this correlation as the physical distance from the selected locus increases (see Fig. 3).

Fig. 3
figure 3

Impact of selection on genetic diversity. Black lines represent individual genomes. SNP variants are displayed by filled circles. Distinct variants at the same position are depicted with different colors: neutral variants in gray, positive variants in red or yellow, and negative variant in blue. (a) A positively selected new variant spreads in the population and removes genetic diversity at linked loci, generating a hard selective sweep. (b and c) Segregation of several positively selected variants in different genetic backgrounds, either from standing variation or recurrent mutations, resulting in a soft selective sweep. (d) Reduction of neutral diversity because of linkage to deleterious mutations (background selection). (e) Competitive segregation of positively selected variant at distinct loci, resulting in the loss of advantageous variants (Hill–Robertson interference)

Background selection refers to a form of linked selection where the reduction of diversity at neutral loci results from linkage to a locus under purifying selection [46], and genetic hitchhiking is commonly used to depict linked selection due to linkage to a locus under positive selection [47], where a new beneficial mutation will rise in frequency in a population. As the new positively selected allele increases its frequency, nearby linked alleles on the chromosome will “hitchhike” along with it, also growing in frequency, thus producing a selective sweep of genetic diversity (see Fig. 3d). Hard sweeps occur when a new mutation is positively selected and is therefore exclusively associated with the genetic background where it arose. Conversely, soft sweeps occur when a mutation is already segregating in the population at the onset of selection. This mutation may exist in several genetic backgrounds and therefore does not prompt a complete loss of genetic variation after the selective sweep [47] (see Fig. 3a–c).

Linkage of two or more loci can also impair the efficacy of positive selection, a phenomenon termed Hill–Robertson interference (HRI) [48]. When two advantageous mutations at distinct loci in distinct individuals segregate in the population, one will be lost unless a recombination event brings them together. In the absence of recombination between the selected loci, only the unlikely event of recurrent mutations can generate the optimal haplotypic combination [49] (see Fig. 3e).

6 Notes

  1. 1.

    The use of the term substitution differs in population genetics and molecular biology. In the latter case, it describes a particular type of mutation where a single nucleotide replaces a distinct one (as opposed to insertions/deletions, for instance).

  2. 2.

    In some instances, s is substituted by − s, so that the relative fitnesses become ωAA = 1, ωAa = 1 − h ⋅ s and ωaa = 1 − s.