Genome-scale analysis of demographic history and adaptive selection
One of the main topics in population genetics is identification of adaptive selection among populations. For this purpose, population history should be correctly inferred to evaluate the effect of random drift and exclude it in selection identification. With the rapid progress in genomics in the past decade, vast genome-scale variations are available for population genetic analysis, which however requires more sophisticated models to infer species’ demographic history and robust methods to detect local adaptation. Here we aim to review what have been achieved in the fields of demographic modeling and selection detection. We summarize their rationales, implementations, and some classical applications. We also propose that some widely-used methods can be improved in both theoretical and practical aspects in near future.
Keywordsgenomics demographic history local adaptation natural selection
composite likelihood test
Extended haplotype homozygosity
genome-wide selection scans
integrated Haplotype Score
Markov Chain Mento Carlo method
Most Recent Common Ancestors
Pairwise Sequentially Markovian Coalescent
Site Frequency Spectrum
single nucleotide polymorphism
thyroid stimulating hormone receptor
Identifying adaptive selection has been a central issue in the study of molecular evolution, since Kimura Motoo (1968) argued that it is neutrality instead of selection driving the majority of variations in DNA. There has also been long interest in understanding the nature of selection in study of domestication since Charles Darwin (1859), during which artificial selection leads to phenotypic and genetic variation distinguishing domesticated organisms from their wild ancestors (Mannion, 1999). The inference of demographic history of related population(s) plays a vital role for these aims, for the reason that a proper inferred model could offer a null hypothesis for expectation of neutrality (Nielsen et al., 2007). To distinguish selective traits from those caused by bottleneck effects, understanding of the population history that the first population captured from wild became domestic population has also been a vital task (Axelsson et al., 2013). Besides, demographic models inferred from genetic data complement archeological evidence in understanding pre-historical events, such as number and timing of major continental fluctuations of population size as well as migration. Therefore, the research of demographic history as well as adaptive selection play essential role in evolutionary biology.
In the past decade, there has been an explosive progress in genomics (International Human Genome Sequencing Consortium, 2001; Li et al., 2010; Huo et al., 2012; NCBI Resource Coordinators, 2013). The explosion started from the revolution of sequencing technique and stimulated accumulation of genomic data, which subsequently pushed the improvement of analysis methods. Today, the data accumulation rate is hundreds of times higher than that when the Human Genome Project was first stated (NCBI Resource Coordinators, 2013). The available genomic data have been extended for various species, from the initial goal of human and key lab model species to primates and domesticated animals and plants, and presently to endangered organisms with special scientific or cultural values (NCBI Resource Coordinators, 2013; Grigoriev et al., 2013). Such a data flood made population genetics approaches widely applied in numerous organisms, which in turn stimulated the development of population genetic approaches, for example, to infer more detailed demographic history, to identify adaptive selection more accurately and sensitively, and to perform the computation more rapidly with more data and less constraints (Nielsen et al., 2007; Crisci et al., 2012).
For demographic inference, the most straightforward and simplest approach was based on polymorphic data organized in a Site Frequency Spectrum (SFS). A coalescent process or diffusion process could be applied to trace the history of species. In that way a series of parameters describing the history could be inferred by maximum likelihood, Bayesian approximation or Markov Chain Mento Carlo method (MCMC) (Crisci et al., 2012). As for the identification of local adaptation, population genetic statistic methods were applied to seek the outliers of genetic variation and differentiation across genomes within and between species driven by selection forces (Sabeti et al., 2006). In this review we focused on some recent advances on demographic inference as well as identification of adaptive selection. We also provided some perspectives related to the improvement of those mentioned approaches. We suggested that the genome point of view might contribute to the future progress of population genomics, in both theoretical and applicable aspects.
Approaches on Demographic History with Genome-Scale Data
To infer demographic history is to estimate population events in the past with population data at present. With the development of the next-generation genome sequencing technique, the present population data could be either mass of genome-scale traditional molecular polymorphism data from multiple individuals or heterozygosity data obtained from one whole genome sequence. We will discuss them respectively.
Methods with polymorphism dataset
In order to make inference about the population history, two steps are needed. One must firstly formulize the history of the population with certain mathematical model, in which the evolutionary affair during the history was described as a set of parameters. Secondly statistical inference methods could be implied on the parameters. In the past decade, the coalescent process was the most widely used model (Nielsen and Wakeley, 2001; Crisci et al., 2012). Recently, a diffusion approach was adopted as well. These two approaches are discussed in the section of “Coalescent process versus diffusion process”. The statistical inference methods are discussed in the section of “Methods to infer demographic parameters”.
Coalescent process versus diffusion process
The isolation-with-migration model (Nielsen and Wakeley, 2001) was one of the most common models to infer demographic scenarios, under which different methods could be used to trace the evolutionary history of the genetic variation. Straightforwardly, the genealogy of alleles could be traced backward in time under the process of coalescence, during which the parameters of demographic model could be derived, including the change of effective population size and the time point on events of bottleneck and exponential growth. When two or more populations were considered, the divergence time could also be derived. If not limited to the consideration of the Wright-Fisher model, coalescent method could consider migration rate and recombination rate in the gene tree as well. Actually, coalescent method is the most widely used method and has been applied in numerous demographic inference programs (Wooding and Rogers, 2002; Adams and Hudson, 2004; Hey and Nielsen, 2004; Thornton and Andolfatto, 2006; Becquet and Przeworski 2007; Lopes et al., 2009; Hey, 2010).
The polymorphism dataset could be organized as the Site Frequency Spectrum (SFS), which is the distribution of allele frequencies in a sampled dataset. In the case of multiple populations, a joint SFS (JSFS) could be used, and the evolution of genetic polymorphism among populations could be described as the change over time of allele distribution in the SFS/JSFS. In the context of neutral theory, the change could be approximated with a diffusion process. The Kolmogorov forward equation for diffusion approximation of neutrality could be introduced to approximate the distribution of allele frequencies at given time (Hartl and Clark, 2007). Different from the methods based on coalescent process, the method based on diffusion process could provide more flexible demographic history model with acceptable computational performance (Gutenkunst et al., 2009), and it has been used to deal with complicated demographic model including three populations with migration and recombination based on genome-scale SNP dataset (Gutenkunst et al., 2009; Zhao et al., 2013).
Here is the time unit, where t is the time in generations and Nref is the reference effective population size. x means the population frequencies runs from 0 to 1. is the relative effective size of population i. Mi ← j = 2Nrefmi ← j is the scaled migration rate, where mi ← j is the proportion of “chromosomes” per generation in population i that are new migrants from population j. And is the scaled selection, in which si is the relative selective advantage of variants in population i.
The program uses single nucleotide polymorphism (SNP) data in a given genomic region as input dataset, the region of which would be as large as the whole genome obtained by genome resequencing. If outgroup is used, a statistical correction is needed for ancestral state misidentification (Hernandez et al., 2007). Such variations, which is caused by varying mutation rates across sites and over time, violate the parsimony assumption that the ancestral state of each SNP matches the orthologous allele in the outgroup locus (Hwang and Green, 2004). In the original paper of ∂a∂i which inferred a population history of human, a tri-nucleotide transition rate matrix for primate lineage was used for the correction of the misidentification. Since the tri-nucleotide transition rate matrix varies among different mammalian lineages, a customized matrix should be inferred accordingly.
Methods to infer demographic parameters
Demographic history is the population events in the past, while the genetic diversity data available is contemporary. Therefore, one has to find out proper estimation of historical parameters which give the best fit to the present polymorphism dataset. Several alternative statistical inference procedures could be used for these purposes as discussed below. Here we discuss maximum likelihood method and Bayesian approximation method. We also discuss some about Markov Chain Mento Carlo (MCMC) method, which is widely used in Bayesian computation.
In essence, inference of demographic history could be a statistical procedure. It looks for the most possible distribution pattern of SFS under the constraints of given demographic parameter set that fits the real dataset sampled from the population. Therefore, it would be natural to introduce the Maximum Likelihood (ML) method, which estimates the most possible measure of probability in the probability space that fits the known sample with the highest likelihood.
In which is the joint SFS of the P populations, L(Θ|S) is the likelihood function of the joint SFS under the diffusion model with the parameter set of Θ.
The Approximation Bayesian Computation (ABC) is another method to simulate the parameter values from demographic models that could have given rise to the observed dataset. Suppose the observed dataset x and the joint density of parameter values θ that defined the population history model, the probability of θ with given x could be considered as a posterior of p(θ|x) according to the Bayesian formula. Thus the essential of the computations become the integral of a certain function of the posterior distribution. The method does work in the case that the posterior distribution was simple or low dimensional, for example, the fluctuation of effective population size on single population (Thornton and Andolfatto, 2006). When the complicated model is considered, the computation becomes a complicated high-dimensional integral which is intractable. Therefore a summary statistics could be used to a restricted set of data to simplify the computation. popABC (Lopes et al., 2009) was following this way, which makes it possible to consider both recombination and migration with genomic data. The program as well as its successive implementations has been used in the detection of rapid radiation in spiny lobsters (Palero et al., 2009), the inference of Africa pygmies demographic history (Batini et al., 2011), as well as the recombination rate variation and the speciation study in rodents (Nachman and Payseur, 2012).
Summary statistics only uses part of information in dataset, which may reduce the statistical power. To use full information of dataset and avoid complicated computation of posterior, the Markov Chain Monte Carlo (MCMC) method was introduced to overcome the difficulty of the complicated high-dimension integral of the posterior in ABC. The method uses a series of sampling based on constructing a Markov chain to get a reliable inference to the probability distribution of the total, instead of computation of high-dimension integral. It starts from settling an initial distribution of the total as prior. Then a series of sampling from the total are performed. For each sampling a distribution could be calculated as the posterior distribution and used to correct the prior one. When sampling times are large enough and the posterior distribution tends to be stable, the stable distribution of samples (exactly the equilibrium distribution of Markov chain) could be considered as the distribution of the total (Beaumont, 2010). The MCMC method was powerful to give a simulation result for the posterior distribution of Bayesian computation particularly for complicated demographic pattern.
Comparison of some recent demographic models
Wooding and Rogers, 2002
Adams and Hudson, 2004
Hey and Nielsen, 2004
Thornton and Andolfatto, 2006
Becquet and Przeworski, 2007
Lopes et al., 2009
Gutenkunst et al., 2009
Model with heterozygosity dataset
Beside above mentioned approaches, another novel approach using heterozygotes of one genome is applied to obtain information of the population parameters (McVean and Cardin, 2005). Such a method, named as the Pairwise Sequentially Markovian Coalescent (PSMC) model, was used to infer human population history (Li and Durbin, 2011). The PSMC model considered the local density of heterozygous sites along chromosomes which reflects how the constant Most Recent Common Ancestors (MRCA, or TMRCA abbreviated by Li and Durbin (2011)) were separated by historical recombination events. Therefore the population parameters, such as the past effective population size and the recombination rate could be inferred.
Genome-wide scanning for local adaptation
From “Survival of the fittest” to “Neutral or near neutral mutations”
The concepts of selection, adaptation, and evolution were first described in the famous Origin of Species by Darwin (1859). Natural selection changes the fitness by accumulating tiny variations from generation to generation to adapt to environments. Phenotypic adaptation is the result of mutation and selection during evolution. A few decades later, Ronald Aylmer Fisher (1930) proposed the idea of “Fisher’s fundamental theorem”, where mathematical approaches were applied to define the fitness rate as a function of its genetic variation. Natural selection could then be measured not only by fitness changes but also by genetic variations.
Later, principles of population genetics were extended to sequence data with the development of molecular biology. The neutral theory considered that genetic variations were accumulation of neutral mutation, and were removed by genetic drift (Kimura, 1968). Successive theories argued that with a broader definition of neutrality, majority of genetic variations could be attributed to random drift during evolution instead of adaptation to local environment (Ohta, 1992; Nei, 2005). Therefore it becomes a challenge to detect the signatures under selection within genome via statistical approaches, which has been one of the central tasks of evolutionary genetics for the recent two decades (Kreitman and Akashi, 1995).
Genome is shaped by two evolutionary forces: neutrality during demographic history and natural selection. Generally, genetic drift, population growth, migration, and other demographic events affect the whole genome; but natural selection by local environment changes make imprint on episodes or a part of structures of the genome, which could have effects on phenotypes and fitness. Natural selection also changes the frequency of mutations across populations. The advantageous mutations approach to genetic fixation under directional selection; the advantageous heterozygotes are maintained by balancing selection; and purifying selection removes the deleterious mutants. The genetic signatures in genome sequence open a door to detect natural selection. Application of mathematical methods and statistical tests throw light on interpreting the imprints of evolution and adaptation.
Seeking genome-wide signatures of adaptation
Statistical methods are developed to distinguish causal genetic variations subject to selection from neutral genetic variations (Hartl and Clark, 2007). Whole genome sequencing provides large-scale genetic variation data for this purpose. Two strategies are applied. The first one is based on genome-wide selection scans (GWSS) to detect outliers or structure violations as the signatures of selection. The second is based on genome-wide association approaches, in which a prior phenotype or environmental information is required to obtain associated genetic variation loci as the potential candidates for selection. Ultimately, the function of related genes should be investigated to associate the candidate genetic variation with the phenotype variation. For the first strategy, it is a challenge to distinguish selective effect from that caused by neutral events in past. For the second, there are less fine-scaled phenotypic records in non-model species than those in model species or domestic creatures. No matter what strategy is applied, an experimental validation would be highly persuasive for candidate variations, but in most cases it would be very difficult.
Comparative genomic data (between/among species)
The simple ratio of non-synonymous (dN) to synonymous (dS) substitution of coding regions is often used to identify adaptive loci deviation from neutral state between species. Under neutral theory, ω = dN/dS is expected to reflect selection types of species: ω = 1, >1, and <1, indicating the sites tested under neutral state, positive selection, and purifying selection, respectively. Comparison of average ω indicator across gene or DNA segments will present the most conservative results because the effects of real sites under selection might be weakened by neutral sites unless the whole region was under selection. Yang (1997, 1998) set varying ω values across lineages or among protein sites to estimate whether a specific lineage was subjected to Darwinian natural selection among protein sites or evolutionary lineages. A likelihood test was applied to determine the null and the alternative hypotheses (Yang, 1997, 1998). This approach was incorporated in PAML software, which included branch model, site model, and branch-site model, allowing ω varying across branches, or sites among proteins, and both across branch and among proteins, respectively (Yang, 1997, 1998; Yang and Nielsen, 2002; Zhang et al., 2005; Yang, 2007; Yang and Nielsen, 2008; Yang and dos Reis, 2011).
These methods have been widely applied in comparative genomic analyses. In mammals, several studies using extensive genomes in mammals have been performed to detect the loci or lineages under selections (Clark et al., 2003; Kosiol et al., 2008; Nielsen et al., 2005; Li et al., 2010). For example, the results of Li et al. (2010) showed that the positively selected genes were significantly enriched in the functional categories of blood circulation and gas exchange activity in mammals. Zhang et al. (2013) compared two related bat genomes and identified genes responsible for DNA damage checkpoint and NF-κB pathways subjected to strong selection, implying possible adaptation to flight. In addition, comparative analysis of two falcons, peregrine and saker falcon, showed an accelerated evolution rate on homeostasis-related genes responsible for circulation (Zhan et al., 2013).
Comparative population genomic data (within and between species)
Under the neutral theory, polymorphic and divergence data are expected to be the accumulation of neutral mutations within and between species respectively. Hudson et al. (1987) established a statistical test to examine whether the DNA sequence with higher/lower evolutionary rate between species also presented higher/lower polymorphic rate within species. Although Wright and Charlesworth (2004) extended HKA tests by incorporating maximum likelihood tests, the HKA has been rarely used in genome-wide analysis for its constant effective population size assumption (Nei and Kumar, 2000). McDonald and Kreitman (1991) applied the idea underlying HKA tests and proposed that the ratio of dN:dS between species is equal to the ratio of dN:dS within species when sequences are selective neutrality. They found an excess of the ratio in divergence than that in polymorphism across the ADH sequences among three species of fruit flies, suggesting that beneficial alleles were maintained by positive selection. Bustamante et al. (2005) also applied this approach to compare the divergence and polymorphism data of human and chimpanzee to identify signatures under positive and purifying selection across human genome. The positively selected genes are predicted to be enriched in defense/immunity, transcription, sensory perception and so on, while negative selection are expected to affect processes of cell structure and motility, ectoderm development, general vesicle transport, and intracellular protein traffic.
Population genomic data within species
Evolutionary forces will skew genetic diversity and frequency of loci across genomes. Various methods based on estimators of genetic diversity within species have been developed to detect selection signatures. Here we mention tests based on heterozygosity, tests based on FST, tests based on Allele frequency, and tests based on haplotype. We also discuss composite approaches combining multiple tests.
Local adaptation signals have two manifestations. Within one population, they are expected to be with low heterozygosity. Between populations they show higher genetic differentiation measured by FST. These two indicators, heterozygosity and FST, and their derived formations are widely-used in population genomic analysis.
Genetic differentiation between populations is usually measured with genetic fixation index . Generally, positive selection gives rise to less heterozygosity within populations and higher genetic differentiation of loci between populations. statistic has been developed to estimate selection for decades (Wright, 1943; Weir and Cockerham, 1984; Slatkin and Voelm, 1991; Cockerham and Weir, 1993). Pair-wise between populations is compared to detect the differentiated signals under positive selection directly (Akey et al., 2002; Barreiro et al., 2008; Lam et al., 2010; Zhao et al., 2013). The integrated approach based on different estimation of has been successfully applied into detection of local adaptation signals in giant pandas population genomics (Zhao et al., 2013).
Branch length indicated the genetic differentiation levels. The outliers with larger branches than the average genomic branch in Tibetan were considered to be candidates. To further validate the power of the transformed methods, a larger sample of Tibetan was genotyped, and the frequency of advantageous allele of the outlier loci EPAS1 was examined. The results showed that the SNP in EPAS1 was associated with erythrocyte count and hemoglobin quantity (Yi et al., 2010a).
Haplotype provides integrative information of a set of neighboring SNPs rather than a sum of individual SNPs. Diversity and frequency of haplotype in populations will be distorted under very recent positive selection. Since the phased high-resolution human HapMap was available, statistic methods based on variance in heterozygosity, length and frequency of haplotype were developed to detect very recent adaptation within and between populations (Sabeti et al., 2002, 2006; Voight et al., 2006). Extended haplotype homozygosity (EHH test) was applied to identify long-range haplotype with reduced heterozygosity as signatures under recent selective sweeps (Sabeti et al., 2002). The ancestral and derived haplotypes were compared using integrated Haplotype Score (iHS test) to measure the selection strength of each locus (Sabeti et al., 2006; Voight et al., 2006). Differential frequency of haplotype between populations was measured and compared in Cross Population EHH (XP-EHH test) to localize population adaptive to local environment (Sabeti et al., 2007; Grossman et al., 2010). The haplotype-based methods were also widely used in domesticated species to detect signals of local adaptation to new environments under artificial selection. Vonholdt et al. (2010) used FST and XP-EHH methods to identify several SNPs involving memory formation and behavioral sensitization during dog’s domestication. Toomajian et al. (2006) proposed a haplotype-sharing statistical analysis and used another haplotype-based method EHH as well to identify the early-flowering alleles in Arabidopsis thaliana.
Functional analysis of candidates
Generally, three approaches are used to investigate biological function of outlier candidates. One is the prediction of potential gene function. For protein within the candidate selective locus, the spatial structure of the protein will be simulated to infer its potential conformation change influenced by the variant locus (Sabeti et al., 2007; Grossman et al., 2013). For all the outlier data, candidate loci will be annotated by gene ontology database and the functional categories enriched with candidate loci would be considered as potentially selected. An alternative precise functional analysis associates studies of outlier variations with phenotype variation or environmental change. Some researchers have made good examples on genome-wide selection scanning as well as genome-wide association analysis (Yi et al., 2010; Grossman et al., 2013).
However, the prediction analysis and statistical association analysis could only tell us about the general information about potential biological function, it’s important to examine whether the genetic variations with adaptive selection signals successfully express the phenotype with high fitness into new environment or not. In dog, expression and functional experiments were performed after the GWWS analysis to test adaptive signatures under domestication, which proved that domesticated dogs have more powers in enzymatic activity for digestion of starch-rich diets than their ancestor wolves (Axelsson et al., 2013).However, few cases could examine the functions of polymorphic loci in inter-gene under selections because of no protein expression.
A diffusion approximation model with recombination
For total P populations there will be number of r values, their mean value could represent the recombination rate of the P populations within one generation. In the whole process, the most urgent task is to determine the SNP pairs. One pair of SNPs should be tightly linked so that there is only one recombination event could happen between the SNPs, as the same time every two pairs of SNPs should be distant enough so that the linkage could be ignored within groups. However, there is an arbitrary assumption which may be violated in real data set in the model. We suppose that to make sure equation (13) is equal, every term in the diffusion function (11) and (12) are equal. Therefore we obtain equation (14) and (15), which mean migration and drift are equal between groups. However, there does be possibility that drift and migration have similar effects, therefore equation (14) and (15) are not equal but equation (13) is still satisfied. If so, a deviation may appear in the estimation of recombination rate. Future studies are needed to access the deviation by either simulation or real dataset.
Challenges for identifying genomic signature of selection
In genome era, selection tests for recent positive selection were well developed as shown above. However, the test for the balancing and negative selection within species is rare. Although some approaches and softwares made efforts to test balancing selection (Excoffier et al., 2009; Excoffier and Lischer, 2010) and negative selection (Tajima, 1989; Yang, 1998), it could not meet the needs for GWWS, e.g. few balancing and purifying selection tests based on haplotypes. Signatures for balancing and negative selection should be interpreted and mined from large-scale genome data, which calls for diversifying methods in the future.
Genomic signatures for selection detected by GWWS could be distorted by demographic history and population structure. Tajima’s D could not distinguish the signatures caused by selection force from those caused by demographic fluctuation (Yang, 2006, Hartl and Clark, 2007). Other estimators are also sensitive to population growth and contraction (Hartl and Clark, 2007), and invisible population structure and admixture will influence the heterozygosity and haplotype diversity (Hartl and Clark, 2007). Hence it is important to infer population genetic structure and demographic background to distinguish the effect of selection from neutrality. Williamson et al. (2005) set a good example on this idea. He used presumed neutral loci to construct a population history. And then he used the simulation results as the null hypothesis to infer selection signals. The more genomic data available, the more intensive and integrative methods are needed in the future to incorporate both selection forces and demographic history influences.
Last but not least, it is still a barricade to test the selective effect on certain loci using experimental approaches, either for cellular and molecular approach or high-throughput technique. The difficulties come from two aspects. Firstly, it is still a hard task to avoid false negative results. The theoretical assumption may be violated in the populations of certain species, genetic structure or demographic history may not be totally excluded. Selection signals may be flooded by noise. All these may introduce false negative results. Secondly, present methods could give evidence of whether one locus was under selection, but could not give clues that in which levels the locus shows the selection: at molecular level, or cellular level, or tissue/organ level, or individual level. It is impossible to test every possible level for each loci, let alone the permutation and combination of all loci. In general, although it is easy to consider some possibilities that the loci in non-synonymous sites would cause variations in protein sequence and possibly cause the change of protein structure, to verify the biological effects of the identified selective loci will lag behind the progress of identifying selective loci in the next several years.
We are grateful to the Supercomputing Center of Chinese Academy of Science (CAS) for the supercomputing resource for demographic history simulation. Studies in our laboratory were supported by grants from the National Natural Science Foundation of China (Grant No. 31230011) and Knowledge Innovation Program of Chinese Academy of Sciences (KSCX2-EW-Z-4).
Compliance With Ethics Guidelines
The authors declare that they have no conflict of interest.
This article does not contain any studies with human or animal subjects performed by any of the authors.
- Bustamante CD, Fledel-Alon A, Williamson S, Nielsen R, Hubisz MT, Glanowski S, Tanenbaum DM, White TJ, Sninsky JJ, Hernandez RD, Civello D, Adams MD, Cargill M, Clark AG (2005) Natural selection on protein-coding genes in the human genome. Nature 437:1153–1157Google Scholar
- Darwin C (1859) On the origin of species. John Murray, LondonGoogle Scholar
- Excoffier L, Hofer T, Foll M (2009) Detecting loci under selection in a hierarchically structured population. Heredity (Edinb) 103:285–298Google Scholar
- Excoffier L, Lischer HE (2010) Arlequin suite ver 3.5: a new series of programs to perform population genetics analyses under Linux and Windows. Mol Ecol Resour 10:564–567Google Scholar
- Fay JC, Wu CI (2000) Hitchhiking under positive Darwinian selection. Genetics 155:1405–1413Google Scholar
- Fu YX, Li WH (1993) Statistical tests of neutrality of mutations. Genetics 133:693–709Google Scholar
- Grigoriev IV, Nordberg H, Shabalov I, Aerts A, Cantor M, Goodstein D, Kuo A, Minovitsky S, Nikitin R, Ohm RA et al (2013) The genome portal of the Department of Energy Joint Genome Institute. Nucleic Acids Res 40(Database issue):D26–D32Google Scholar
- Hartl DL, Clark AG (2007) Principles of population genetics, 4th edn. Sinauer Associates Inc, SunderlandGoogle Scholar
- Hudson RR, Kreitman M, Aguade M (1987) A test of neutral molecular evolution based on nucleotide data. Genetics 116:153–159Google Scholar
- McDonald JH, Kreitman M (1991) Adaptive protein evolution at the Adh locus in Drosophila. Nature 351:652–654Google Scholar
- NCBI Resource Coordinators (2013) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 41(Database issue):D8–D20Google Scholar
- Nei M, Kumar S (2000) Molecular evolution and phylogenetics. Oxford University Press, OxfordGoogle Scholar
- Nielsen R, Wakeley J (2001) Distinguishing migration from isolation: a Markov chain Monte Carlo approach. Genetics 158:885–896Google Scholar
- Nielsen R, Bustamante C, Clark AG, Glanowski S, Sackton TB, Hubisz MJ, Fledel-Alon A, Tanenbaum DM, Civello D, White TJ et al (2005) A scan for positively selected genes in the genomes of humans and chimpanzees. PLoS Biol 3:e170Google Scholar
- Slatkin M, Voelm L (1991) FST in a hierarchical island model. Genetics 127:627–629Google Scholar
- Tajima F (1989) Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123:585–595Google Scholar
- Williamson SH, Hernandez R, Fledel-Alon A, Zhu L, Nielsen R, Bustamante CD (2005) Simultaneous inference of selection and population growth from patterns of variation in the human genome. Proc Natl Acad Sci USA 102:7882–7887Google Scholar
- Wooding S, Rogers A (2002) The matrix coalescent and an application to human single-nucleotide polymorphisms. Genetics 161:1641–1650Google Scholar
- Wright S (1943) Isolation by distance. Genetics 28:114–138Google Scholar
- Yang Z (1997) PAML: a program package for phylogenetic analysis by maximum likelihood. Comput Appl Biosci 13:555–556Google Scholar
Open Access This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.