In the following subsections we examine some basic examples of running simulations with msprime, starting with the simplest possible models and adding the various complexities required to model biological populations. We use a notebook-like approach throughout, where we intersperse code chunks and their results freely within the text.
2.1 Trees and Replication
At the simplest level, coalescent simulation is about generating trees (or genealogies). These trees (which are always rooted) represent the simulated history of a sample of individuals drawn from an idealized population (in later sections we show how to vary the properties of this idealized population). The function msprime.simulate runs these simulations and the parameters that we provide define the simulation that is run. It returns a TreeSequence object, which represents the full coalescent history of the sample. In later sections we discuss the effects of recombination, when this TreeSequence contains a sequence of correlated trees. For now, we focus on non-recombining sequences and use the method first( ) to obtain the tree object from this tree sequence. (In general, we can use the trees( ) iterator to get all trees; see Section 2.7.) For example, here we simulate a history for a sample of three chromosomes:
This code chunk illustrates the basic approach required to draw a tree in a Jupyter notebook. We first generate a tree sequence object (ts), and we then obtain the tree object representing the first (and only) tree in this sequence. Finally, we draw a representation of this tree using the IPython SVG function on the output of the tree.draw( ) method. By default, tree.draw( ) returns a depiction of the tree in SVG format, but also supports plain text rendering. For example, print( tree.draw( format=unicode) ) prints the tree to the console using Unicode box-drawing characters. This is a very useful debugging tool. We have omitted the import statements required for the SVG function here as it is rather specific to the Jupyter notebook environment. All code chunks in this chapter are included in the accompanying Jupyter notebook and are fully runnable.
The output of one random realization of this process is shown in Fig. 1. The resulting tree has five nodes: nodes 0, 1, and 2 are leaves, and represent our samples. Node 3 is an internal node, and is the parent of 0 and 2. Node 4 is also an internal node, and is the root of the tree. In msprime, we always refer to nodes by their integer IDs and obtain information about these nodes by calling methods on the tree object. For example, the code tree.children( 4) will return the tuple ( 1, 3) here, as these are the node IDs of the children of the root node. Similarly, tree.parent( 0) will return 3.
The height of a tree node is determined by the time at which the corresponding ancestor was born. So, contemporary samples always have a node time of zero, and time values increase as we go upwards in the tree (i.e., further back in time). Times in msprime are always measured in generations.
When we run a single simulation, the resulting tree is a single random sample drawn from the probability distribution of coalescent trees. Since a single random draw from any distribution is usually uninformative, we nearly always need to run many different replicate simulations to obtain useful information. The simplest way to do this in msprime is to use the num_replicates argument.
In this example we run 1000 independent replicates of the coalescent for a sample of 10 chromosomes, and compute the mean time to the MRCA of the entire sample, i.e., the root of the tree. The value of 3.7 generations in the past we obtain is of course highly unrealistic. However, by default, time is measured in units of 4Ne generations (see the next section for details on how to specify population models and interpret times). It is important to note here that although time is measured in units of generations, this is of course an approximation and we may have fractional values. Internally, during a simulation time is scaled into coalescent units using the Ne parameter and once the simulation is complete, times are scaled back into units of generations before being presented to the user. This removes the burden of such tedious time scaling calculations from the user. We discuss these time scaling issues in more detail in the next section.
The simulate function behaves slightly differently when it is called with the num_replicates argument: rather than returning a single tree sequence, we return an iterator over the individual replicates. This means that we can use the convenient for loop construction to consider each simulation in turn, but without actually storing all these simulations. As a result, we can run millions of replicates using this method without using any extra storage.
When simulating coalescent trees, we are often interested in more than just the mean of the distribution of some statistic. Rather than compute the various summaries by hand (as we have done for the mean in the last example), it is convenient to store the result for each replicate in a NumPy array and analyze the data after the simulations have completed. For example:
Here we simulate 1000 replicates, storing the time to the MRCA for each replicate in the array T_mrca. We use the Python enumerate function to simplify the process of efficiently inserting values into this array, which simply ensures that j is 0 for the first replicate, 1 for the second, and so on. Thus, by the time we finish the loop, the array has been filled with TMRCA values generated under the coalescent. We then use the NumPy library (which has an extensive suite of statistical functions) to compute the mean and variance of this array. This example is idiomatic, and we will use this type of approach throughout. In the interest of brevity, we will omit all further import statements from code chunks.
It is usually more convenient to use the num_replicates argument to perform replication, but there are situations in which it is desirable to specify random seeds manually. For example, if simulations require a long time to run, we may wish to use multiple processes to run these simulations. To ensure that the seeds used in these different processes are unique, it is best to manually specify them. For example,
In this example we create a list of 1000 seeds between 1 and 232 − 1 (the range accepted by msprime) randomly. We then use the multiprocessing module to create a worker pool of four processes, and run our different replicates in these subprocesses. The results are then collected together in an array so that we can easily process them. This approach is a straightforward way to utilize modern multi-core processors.
Specifying the same random seed for two different simulations (with the same parameters) ensures that we get precisely the same results from both simulations (at least, on the same computer and with the same software versions). This is very useful when we wish to examine the properties of a specific simulation (for example, when debugging), or if we wish to illustrate a particular example. We will often set the random seed in the examples in this tutorial for this reason.
2.2 Population Models
In the previous section the only parameters we supplied to simulate were the sample_size and num_replicates parameters. This allows us to randomly sample trees with a given number of nodes, but, as it leaves the population unspecified, has little connection with biological reality. The most fundamental population parameter is the effective population size, or Ne. This parameter simply rescales time; larger effective population sizes correspond to older coalescence times:
Thus, when we specify Ne = 10 we get a mean pairwise coalescence time of about 20 generations, and with Ne = 100, the mean coalescence time is about 200 generations. See ref. 43 for details on the biological interpretation of effective population size.
By default, Ne = 1 in msprime, which is equivalent to measuring time in units of Ne generations. It is very important to note that Ne in msprime is the diploid effective population size, which means that all times are scaled by 2Ne (rather than Ne for a haploid coalescent). Thus, if we wish to compare the results that are given in the literature for a haploid coalescent, then we must set Ne to 1∕2 to compensate. For example, we know that the expected coalescence time for a sample of size 2 is 1, and this is the value we obtain from the pairwise_T_mrca function when we have Ne = 0.5. We will usually assume that we are working in haploid coalescent time units from here on, and so set Ne = 0.5 in most examples. However, when running simulations of a specific organism and/or population, it is substantially more convenient to use an appropriate estimated value for Ne so that times are directly interpretable.
2.2.1 Exponentially Growing/Shrinking Populations
When we provide an Ne parameter, this specifies a fixed effective population size. We can also model populations that are exponentially growing or contracting at some rate over time. Given a population size at the present s and a growth rate α, the size of the population t generations in the past is se−αt. (Note again that time and rates are measured in units of generations, not coalescent units.)
In msprime, the initial size and growth rate for a particular population are specified using the PopulationConfiguration object. A list of these objects (describing the different populations; see Section 2.4) are then provided to the simulate function. When providing a list of PopulationConfiguration objects, the Ne parameter to simulate is not required, as the initial_size of the population configurations performs the same task. For example,
Here we simulate the pairwise TMRCA for positive, zero, and negative growth rates. When we have a growth rate of zero, we see that we recover the usual result of 1.0 (as our initial size, and hence Ne, is set to 1∕2). When the growth rate is positive, we see that the mean coalescence time is reduced, since the population size is getting smaller as we go backwards in time, resulting in an increased rate of coalescence. Conversely, when we have a negative growth rate, the population is getting larger as we go backwards in time, resulting in a slower coalescence rate. (Care must be taken with negative growth rates, however, as it is possible to specify models in which the MRCA is never reached. In some cases this will lead to an error being raised, but it is also possible that the simulator will keep generating events indefinitely. This is particularly important in simulation based approaches to inference from real data.)
We cannot directly observe gene genealogies; rather, we observe mutations in a sample of sequences which ultimately have occurred on genealogical branches. We are therefore very often interested not just in the genealogies generated by the coalescent process, but also in the results of mutational processes imposed on these trees. msprime currently supports simulating mutations under the infinitely many sites model (arbitrarily complex mutations are supported by the underlying data model, however). This is accessed by the mutation_rate parameter to the simulate function. As usual, this rate is the per-generation rate.
The tree produced by this code chunk is shown in Fig. 2. Here we have two mutations, shown by the red squares. Mutations occur above a given node in the tree, and all samples beneath this node will inherit the mutation. The infinite sites mutations used here are simple binary mutations, that is, the ancestral state is 0 and the derived state is 1. One convenient way to access the resulting sample genotypes is to use the genotype_matrix( ) method, which returns an m × n NumPy array, if we have m variable sites and n samples. Thus, if G is the genotype matrix, G[j, k] is the state of the kth sample at the jth site. In our example above, the site 0 has a mutation over node 3, and site 1 has a mutation over node 1, and so we get the following matrix:
The genotype matrix gives a convenient way of accessing genotype information, but will consume a great deal of memory for larger simulations. See Section 3.4 for more information on how to access genotype data efficiently.
When comparing simulations to analytic results, it is very important to be aware of the way in which the mutation rates are defined in coalescent theory. For historical reasons, the scaled mutation rate θ is defined as 2Neμ, where μ is the per-generation mutation rate. Since all times and rates are specified in units of generations in msprime, we must divide by a factor of two if we are to compare with analytic predictions. For example, the mean number of segregating sites for a sample of two is θ; to run this in msprime we do the following:
Note that here we set the mutation rate to θ∕2 (to cancel out the factor of 2 in the definition of θ) and Ne = 1∕2 (so that time is measured in haploid coalescent time units). Such factor-of-two gymnastics are unfortunately unavoidable in coalescent theory.
2.4 Population Structure
Following ms , msprime supports a discrete-deme model of population structure in which d panmictic populations exchange migrants according to the rates defined in an d × d matrix. This approach is very flexible, allowing us to simulate island models (in which all populations exchange migrants at a fixed rate), one- and two-dimensional stepping stone models (where migrants only move to adjacent demes) and other more complex migration patterns. This population structure is declared in msprime via the population_configurations and migration_matrix parameters in the simulate function. The list of population configurations defines the populations; each element of this list must be a PopulationConfiguration instance (each population has independent initial population size and growth rate parameters). The migration matrix is a NumPy array (or list of lists) of per-generation migration rates; m[j, k] defines the fraction of population j that consists of migrants from population k in each generation. (Note that when running simulations on the coalescence scale, i.e. setting Ne = 1∕2, this is equivalent to the number of migrants per deme and generation M[j, k] = 2Nem[j, k].)
We create our model by first making a list of two PopulationConfiguration objects. For convenience here, we use the sample_size argument to these objects to state that we wish to have two samples from each population. This results in samples being allocated sequentially to the populations when simulate is called: 0 and 1 are placed in population 0, and samples 2 and 3 are placed in population 1. We then declare our migration matrix, which is asymmetric in this example. Because M[0, 1] = 0.1 and M[1, 0] = 0, forwards in time, individuals can migrate from population 1 to population 0 but not vice versa. This is illustrated in Fig. 3a which shows the tree produced by this simulation. Each node has been colored by its population (red is population 0 and blue population 1). Thus, the leaf nodes 0 and 1 are both from population 0, and 2 and 3 are both from population 2 (as explained above). As we go up the tree, the first event that occurs is 2 and 3 coalescing in population 1, creating node 4. After this, 4 coalesces with node 0, which has at some point before this migrated into deme 1, creating node 5. Node 1 also migrates into deme 1, where it coalesces with 5. Because migration is asymmetric here, the MRCA of the four samples must occur within deme 1.
The exact history of migration events is available if we use the record_migrations option. In the next example, we set up a symmetric island model and track every migration event:
Figure 3b shows the tree produced by this code chunk. Here we sample three nodes from population 0, but because there is a lot of migration, the locations of coalescences are quite random. For example, the first coalescence occurs in deme 2 (green), after node 0 has migrated in. To see the details of these migration events, we can examine the “migration records” that are stored by msprime. (These are not stored by default, as they may consume a substantial amount of memory. The record_migrations parameter must be supplied to simulate to turn on this feature.) Migration records store complete information about the time, source, and destination demes and the genomic interval in question. Here we are interested in the total number of migration events experienced by each node:
This code produces the plot in Fig. 4. We can see that node 0 experienced very few migration events before it ended up in deme 2, where it coalesced with 4 (which never migrated). Node 2, on the other hand, migrated 30 times before it finally coalesced with 7 in deme 0. Note that there are many more migration events than nodes here, implying that most migration events are not identifiable from a genealogy in real data .
Other forms of migration are also possible between specific demes at specific times. These different demographic events are dealt with in the next section.
2.5 Demographic Events
Demographic events allow us to model more complex histories involving changes to the population structure over time, and are specified using the demographic_events parameter to simulate. Each demographic event occurs at a specific time, and the list of events must be supplied in the order they occur (backwards in time). There are a number of different types of demographic event, which we examine in turn.
2.5.1 Migration Rate Change
Migration rate change events allow us to update the migration rate matrix at some point in time. We can either update a single cell in the matrix or all (non-diagonal) entries at the same time.
The tree produced by this code chunk is shown in Fig. 5a (in this example and those following we have omitted the code required to draw the tree). The samples 0 and 1, and 2 and 3 coalesce quickly within their own populations. However, because the migration rate between the populations is zero these lineages are isolated and would never coalesce without some change in demography. The migration rate change event happens at time 20, resulting in node 5 migrating to deme 1 soon afterwards. The lineages then coalesce at time 21.4.
2.5.2 Mass Migration
This class of event allows us to move some proportion of the lineages in one deme to another at a particular time. This allows us to model population splits and admixture events. Population splits occur when (backwards in time) all the lineages in one population migrate to another.
The tree produced by this code chunk is shown in Fig. 5b. In this case we also have two isolated populations which coalesce down to a single lineage. The population split at time 15 (which, forwards in time produced all the individuals in population 1) results in this lineage migrating back to population 0, where it coalesces with the ancestor of the samples 0, 1, and 2.
Admixture events (i.e., where some fraction of the lineages move to a different deme) are specified in the same way:
The tree produced by this code chunk is shown in Fig. 5c. We begin in this example with six lineages sampled in population 0, zero samples in population 1, and with no migration between these populations. At time 0.5, we specify an admixture event where each of the four extant lineages (5, 7, 0, and 6) has a probability of 1/2 of moving to deme 1. Linages 0 and 6 migrate, and subsequently coalesce into node 8. Further back in time, at t = 1.1, another demographic event occurs, changing the migration rate between the demes to 0.1, thereby allowing lineages to move between them. Eventually, all lineages end up in deme 1, where they coalesce into the MRCA at time t = 6.9.
2.5.3 Population Parameter Change
This class of event represents a change in the growth rate or size of a particular population. Since each population has its own individual size and growth rates, we can change these arbitrarily as we go backwards in time. Keeping track of the actual sizes of different populations can be a little challenging, and for this reason msprime provides a DemographyDebugger class.
To illustrate this, we consider a very simple example in which we have a single population experiencing a phase of exponential growth from 750 to 100 generations ago. The size of the population 750 generations ago was 2000, and it grew to 20,000 over the next 650 generations. The size of the population has been stable at this value for the past 100 generations. We encode this model as follows:
It gives the following output:
Epoch: 0 -- 100.0 generations
start end growth_rate | 0
-------- -------- -------- | --------
0 | 2e+04 2e+04 0 | 0
Events @ generation 100.0
- Population parameter change for -1: growth_rate -> 0.0035
Epoch: 100.0 -- 750.0 generations
start end growth_rate | 0
-------- -------- -------- | --------
0 | 2e+04 2e+03 0.00354 | 0
Events @ generation 750.0
- Population parameter change for -1: growth_rate -> 0
Epoch: 750.0 -- inf generations
start end growth_rate | 0
-------- -------- -------- | --------
0 | 2e+03 2e+03 0 | 0
After we set up our model, we use the DemographyDebugger to check our calculations. We see that time has been split into three “epochs.” From the present until 100 generations ago, the population size is constant at 20,000. Then, we have a demographic event that changes the growth rate to 0.0035, which applies over the next epoch (from 100 to 750 generations ago). Over this time, the population grows from 2000 to 20,000 (note that the “start” and “end” of each epoch is looking backwards in time, as we consider epochs starting from the present and moving backwards). At generation 750, another event occurs, setting the growth rate for the population to 0. Then, the population size is constant at 20,000 from generation 750 until the indefinite past.
A more complex example involving a three-population out-of-Africa human model is available in the online documentation.
2.6 Ancient Samples
Up to this point we have assumed that all samples are taken at the present time. However, msprime allows us to specify arbitrary sampling times and locations, allowing us to simulate (for example) ancient samples.
The tree produced by this code chunk is shown in Fig. 6. All of the trees that we previously considered had leaf nodes at time zero. In this case, the samples 0, 1, and 2 are taken at time 0 in population 0, but node 3 is sampled at time 0.75 in population 1. Note that in this case we used the samples parameter to simulate to specify our samples. This is the most general approach to assigning samples, and allows samples to be assigned to arbitrary populations and at arbitrary times.
One of the key innovations of msprime is that it makes simulation of the full coalescent with recombination possible at whole-chromosome scale. Adding recombination to a simulation is simple, requiring very minor changes to the methods given above.
In this case, we provide two extra parameters: length, which defines the length of the genomic region to be simulated, and recombination_rate, which defines the rate of recombination per unit of sequence length, per generation. It is often useful to think of both sequence lengths and recombination rates as defined in units of base-pairs. (Note, however, that these are continuous values, so this correspondence should not be taken too literally. Note also that because msprime assumes an infinite sites mutation model the length parameter is not connected to the number of mutational sites. Thus any number of mutations can occur on a given sequence length, depending on the mutation rate specified.) For this example, we defined a sequence length of 100 kb, and a recombination rate of 10−8 per base per generation. The result of this particular simulation is a tree sequence that contains 82 distinct trees. Other replicate simulations with different random seeds will usually result in different numbers of trees.
Up to this point we have focused on simulations that returned a single tree representing the genealogy of a sample. The inclusion of recombination, however, means that there may be more than one tree relating our samples. The TreeSequence object returned by msprime is a very concise and efficient representation of these highly correlated trees. To process the trees, we simply consider them one at a time, using the trees( ) iterator.
This code generates the plot in Fig. 7 showing the time of the MRCA of the sample for each tree across the sequence. We find the TMRCA as before, and plot this against the left coordinate of the genomic interval that each tree covers. A full description of tree sequences and the methods for working with them is beyond the scope of this chapter (but see the online documentation for more details).
It is also possible to simulate data with recombination rates varying across the genome (for example, in recombination hotspots). To do this, we first create a RecombinationMap instance that describes the properties of the recombination landscape that we wish to simulate. We then supply this object to simulate using the recombination_map argument. In the following example, we simulate 100 samples using the human chromosome 22 recombination map from the HapMap project . Figure 8 shows the recombination rate and the locations of breakpoints from the simulation, and the density of breakpoints closely follows the recombination rate, as expected.
Although coordinates are specified in floating point values, msprime uses a discrete loci model when performing simulations. By default, the number of loci is very large (∼232), and the locations of breakpoints are translated back into the coordinate system defined by the recombination map. However, the number of loci is configurable and it is possible to simulate a specific number of discrete loci.
Here we simulate the history of two samples in a system with ten loci, each of length 1 with recombination rate of 1 between adjacent loci per generation. In the output, we see that the breakpoints between trees now occur exactly at the integer boundaries between these loci. This shows that we can also simulate models of recombination with discrete loci in msprime, as well as the more standard continuous genome.