Key words

1 Introduction

Understanding how species form and diverge is a central topic of biology, and by observing emerging species today, we can understand many of the genetic and environmental processes involved. Through such observations, we can understand the underlying forces that drive speciation, but to understand how specific speciation events occurred in the past, and understand the specifics of how existing species formed, we must make the inference from the signals these events have left behind. The speciation processes leave genetic “fossils” in the genome of the resulting species, and through what you might call genetic paleontology we can study past events from the signals they left behind.

The main objectives of the methods we describe in this chapter are to infer demographic parameters, Θ, given genetic data, D, through the model likelihood: \( L\left(\Theta \kern0.3em |\kern0.3em D\right)=\Pr \left(D\kern0.3em |\kern0.3em \Theta \right) \). Here, we assume that Θ contains information such as effective population sizes, time points where population structure changes (populations split or admix), or migration rates between populations. We can connect data and demographics through coalescence theory [8]. This theory gives us a way to assign probability densities to genealogies; densities that depend on the demographic parameters, f(G | Θ). Then, if we know the underlying genealogy, we can assign probabilities to observed data using standard algorithms such as Felsenstein’s likelihood recursion [7] and get \( \Pr \left(D\kern0.3em |\kern0.3em G,\Theta \right) \). Theoretically, we now simply need to integrate away the nuisance parameter G to get the desired likelihood

$$ \mathcal{L}\left(\Theta \kern0.3em |\kern0.3em D\right)=\Pr \left(D\kern0.3em |\kern0.3em \Theta \right)=\int \Pr \left(D\kern0.3em |\kern0.3em G,\Theta \right)f\left(G\kern0.3em |\kern0.3em \Theta \right)\kern2.77695pt \mathrm{d}\;G.\kern1.00em $$
(1)

In practice, however, the space of all possible genealogies prevents this beyond a small sample size of sequences and for any sizeable length of genetic material. Approximations are needed, and the sequential Markov coalescent (see Chapter 1) and coalescent hidden Markov models approximate the likelihood in two steps: they assume that sites are independent given the genealogy, i.e.,

$$ \Pr \left(D\kern0.3em |\kern0.3em G,\Theta \right)\approx \underset{i=1}{\overset{L}{\Pi}}\Pr \left({D}_i\kern0.3em |\kern0.3em {G}_i,\Theta \right)\kern1.00em $$
(2)

where L is the length of the sequence and Di is the data and Gi the genealogy at site i, and assume that the dependency between genealogies is Markovian:

$$ f\left(G\kern0.3em |\kern0.3em \Theta \right)\approx f\left({G}_1\kern0.3em |\kern0.3em \Theta \right)\underset{i=2}{\overset{L}{\Pi}}f\left({G}_i\kern0.3em |\kern0.3em {G}_{i-1},\Theta \right).\kern1.00em $$
(3)

Both assumptions are known to be invalid, but simulation studies indicate that this model captures most important summary statistics from the coalescent [17, 18] and that it can be used to accurately infer parameters in various demographic models [2, 14, 16]. Because of the form the likelihood now has,

$$ f\left(D,G\kern0.3em |\kern0.3em \Theta \right)=f\left({G}_1\kern0.3em |\kern0.3em \Theta \right)\underset{i=2}{\overset{L}{\Pi}}f\left({G}_i\kern0.3em |\kern0.3em {G}_{i-1},\Theta \right)\underset{i=1}{\overset{L}{\Pi}}\Pr \left({D}_i\kern0.3em |\kern0.3em {G}_i,\Theta \right), $$
(4)

which is the form of a hidden Markov model, we can compute the likelihood efficiently using the so-called Forward algorithm (see Chapter 3 in Durbin et al. [3]).

This efficiency has permitted us and others (see Chapters 7 and 10) to apply this approximation to the coalescence to infer demographic parameters on whole genome data [1, 9, 11,12,13, 19, 24, 25, 27] in addition to inferring recombination patterns [20, 21] and scanning for signs of selection [4, 22].

2 Software

We have created a theoretical framework for constructing coalescent hidden Markov models from demographic specifications [2, 14,15,16] and used it to implement various models in the software package Jocx, available at

https://github.com/jade-cheng/Jocx.git

Jocx handles the state space explosion problem of dealing with many sequences by creating hidden Markov models for all pairs of sequences and then combining these into a composite likelihood when estimating parameters. In brief, a full analysis looks something like the following. In the remainder of this chapter, we describe in detail how to apply Jocx to sequence data and how to interpret the results.

Jocx.py init . iso a.fasta b.fasta

It is very important that the verbatim (typewriter font) sections are left exactly as in the input. They contain ascii art that is output from our program.

Jocx.py run . iso nm 0.0001 1000 0.1

Jocx executes CoalHMMs by specifying a model and an optimizer. It uses sequence alignments in the format of “ziphmm” directories, which is also prepared by Jocx. The program prints to standard output the progression of the estimated parameters and the corresponding log likelihood. The source package contains a set of Python files, and it requires no installation.

2.1 Preparing Data

Jocx takes two or more aligned sequences as input; the number of sequence pairs depends on the CoalHMM model specified for a particular execution. We will discuss CoalHMM model specification later. For example, for inference in a two-population isolation scenario [14], we need a minimum of one pair of aligned sequences, with one sequence from each of the two populations. The input should be FASTA files with names matching the names of the sequences we will use in the analysis, and since the sequences will be interpreted as aligned, they should all have the same lengths. The preprocessing will skip indels and handle all symbols except A, C, G, and T as the wildcard N.

In the following example, sequence a and sequence b form an alignment. Each sequence may have multiple data segments (e.g., contigs or chromosomes). In the example, we have two segments, 1 and 2. The names for these data segments need to be consistent between the two sequences. In the software we have the data-preparation step and model-inference step. In the data-preparation step, we supply Fasta sequences by providing their file names, e.g., a.fasta and b.fasta.

$ ls a.fasta  b.fasta $ cat a.fasta | wc -c 1827 $ cat b.fasta | wc -c 1827 $ head ∗.fasta -n 7 ==> a.fasta <== >1 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaagaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaattaaaaaaaaaaacaaaaaaaaaaaaa >2 aaataaaaaaaaaaaaaaaaaaaaaaaagacaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaataaaaaaaaaaaaaaaaaaaaaaaaaaaa ==> b.fasta <== >1 ataaaaaaaaaaaaaaaaaaacaaaaaagaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaagaaaaaacaaaaaaaaaaaaaaaaaa >2 aaaaaaaaaacaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaagaaaaaaaaaaaaaaaaaaaataaaaaaaaataaaaa

We use the ZipHMM framework [26] to calculate likelihoods—in previous experiments we have found that ZipHMM gives us one or two orders of magnitude speedup in full genome analyses. To use ZipHMM in Jocx, we must preprocess the sequence files. The preprocessing step is customized to each demographic model and is done using the

$ Jocx.py init

command. This command takes a variable number of arguments, depending on how many sequences are needed for the demographic model we intend to use. The first two arguments are the directory in which to put the preprocessed alignment and the demographic model to use. The sequences used for the alignment, the number of which depends on the model, must be provided as the remaining arguments. In the aforementioned two-population isolation scenario, the model iso, we need to process two aligned sequences, so the init command will take four arguments in total. To create a pairwise alignment for the isolation model, we would execute the following command:

$ Jocx.py init . iso a.fasta b.fasta # Creating directory: ./ziphmm_iso_a_b # creating uncompressed sequence file # using output directory "./ziphmm_iso_a_b" # parsing "a.fasta" # parsing "b.fasta" # comparing sequence "1" # sequence length: 900 # creating "./ziphmm_iso_a_b/1.ziphmm" # comparing sequence "2" # sequence length: 900 # creating "./ziphmm_iso_a_b/2.ziphmm" # Creating 5-state alignment in directory: ./ziphmm_iso_a_b/1.ziphmm # Creating 5-state alignment in directory: ./ziphmm_iso_a_b/2.ziphmm

The result of the init command is the directory ziphmm_iso_a_b that contains information about the alignment of a.fasta and b.fasta in a format that ZipHMM can use to efficiently analyze the isolation model. Each Fasta data segment forms its own ZipHMM subdirectory. In the above example, we have two data segments, named 1 and 2, so we have two ZipHMM subdirectories.

$ ls a.fasta  b.fasta  ziphmm_iso_a_b $ find ziphmm_iso_a_b/ ziphmm_iso_a_b/ ziphmm_iso_a_b/1.ziphmm ziphmm_iso_a_b/1.ziphmm/data_structure ziphmm_iso_a_b/1.ziphmm/nStates2seq ziphmm_iso_a_b/1.ziphmm/nStates2seq/5.seq ziphmm_iso_a_b/1.ziphmm/original_sequence ziphmm_iso_a_b/2.ziphmm ziphmm_iso_a_b/2.ziphmm/nStates2seq ziphmm_iso_a_b/2.ziphmm/nStates2seq/5.seq ziphmm_iso_a_b/2.ziphmm/data_structure ziphmm_iso_a_b/2.ziphmm/original_sequence

The exact structure of this directory is not important to how Jocx is used, but you must preprocess input sequences to match each demographic model you will analyze.

To see the list of all supported models, use the --help option. Here iso is the two-population two-sequence isolation scenario, shown below.

$ Jocx.py ----help : ISOLATION MODEL (iso)   ∗  / \ tau A   B 3 params -> tau, coal_rate, recomb_rate 2 seqs   -> A, B 1 group  -> AB :

For each model, the tool implements, the --help command will show an ASCII image of the model, annotated with the parameters of the model and with leaves labelled by populations. Below the image, the parameters are listed in the order they will be output when optimizing the model, followed by the sequences in the order they must be provided to the init command when creating the ZipHMM file. Finally, the help lists the pairs of sequences that will be used in the composite likelihood in the list of “groups.” When initializing a sequence alignment, you will get a ZipHMM directory per group.

The two-population isolation demographic model is symmetric, so the order of input Fasta sequences does not matter. This is not always the case. For example, in a three-population admixture model, shown below, the roles the populations take are different. Population C is admixed, and it is formed from ancestral siblings of the two source populations, A and B. The order of input Fasta sequences, therefore, needs to match.

In this model, we have five unknown time points and durations to be estimated, they are three-population isolation time (iso_time), two time points where the admixed population merges with each of the two source populations (buddy23_time_1a, buddy23_time_2a), and finally the duration before all populations find their common ancestry for the first population (greedy1_time_1a). The last unknown duration can be calculated: greedy1_time_2a = greedy1_time_1a + buddy23_time_1a - buddy23_time_2a.

$ Jocx.py ----help : THREE POP ADMIX 2 3 MODEL (admix23)                   ∗                  / \     greedy1_time_1a buddy23_time_1a /\  \                /  \_/\   buddy23_time_2a  admix_prop   /  <-|  \  iso_time              A     C   B 7 params -> iso_time, buddy23_time_1,           buddy23_time_2, greedy1_time_1,           coal_rate, recomb_rate, admix_prop 3 seqs   -> A, B, C 3 groups -> AC, BC, AB :

When executing the init command, the order of the Fasta sequences should match the order of species names in the help command:

$ ls a1.fasta  b1.fasta  c1.fasta $ Jocx.py init . admix23 a1.fasta b1.fasta c1.fasta # Creating directory: ./ziphmm_admix23_a_c # creating uncompressed sequence file : $ ls a1.fasta  b1.fasta  c1.fasta ziphmm_admix23_a_b  ziphmm_admix23_a_c  ziphmm_admix23_b_c

In the two examples above, each population contributes a single sequence to the CoalHMM’s construction. Jocx also has models that support two sequences per population.

$ Jocx.py ----help : THREE POP ADMIX 2 3 MODEL 6 HMM (admix23-6hmm)                   ∗                  / \     greedy1_time_1a buddy23_time_1a /\  \                /  \_/\   buddy23_time_2a    admix_prop /  <-|  \  iso_time              A1   C1   B1              A2   C2   B2 7 params -> iso_time,        buddy23_time_1a,             buddy23_time_2a, greedy1_time_1a,             coal_rate, recomb_rate, admix_prop 6 seqs   -> A1, A2, B1, B2, C1, C2 6 groups -> A1C1, B1C1, A1B1, A1A2, B1B2, C1C2 :

In this example, we have the same admixture demographic model as before but with each population contributing two sequences to form six pairwise alignments, which are then used to construct six HMMs for the inference.

$ ls a1.fasta  a2.fasta  b1.fasta  b2.fasta  c1.fasta  c2.fasta $ Jocx.py init . admix23-6hmm a1.fasta a2.fasta \ $                             b1.fasta b2.fasta \ $                             c1.fasta c2.fasta # Creating directory: ./ziphmm_admix23-6hmm_a1_c1 : $ ls a1.fasta  b1.fasta  c1.fasta a2.fasta  b2.fasta  c2.fasta ziphmm_admix23-6hmm_a1_a2  ziphmm_admix23-6hmm_a1_c1  ziphmm_admix23-6hmm_b1_c1 ziphmm_admix23-6hmm_a1_b1  ziphmm_admix23-6hmm_b1_b2  ziphmm_admix23-6hmm_c1_c2

In the two-population isolation model, we have one demographic transition for a pair of samples. That is from a two-population isolation scenario (Fig. 1a) to a single ancestral population scenario (Fig. 1b). In the three-population admix model, we have three kinds of demographic transitions for a pair of samples. They are from a two-population duration (Fig. 1a) to a three-population duration (Fig. 2a), then to another two-population duration (Fig. 2b), finally to a single ancestral population (Fig. 1b). In the three-population duration, only two populations are allowed to exchange lineages, shown in Fig. 2a as the second and third populations; hence we call this duration buddy23. In the second two-population duration, one of the two populations only accepts lineages because it is not involved in the admixture event at the previous state space transition. Since we have one population that never gives lineages during this time, we call this duration greedy1.

Fig. 1
figure 1

Demographic transition in the two-population isolation model for a pair of samples. Backwards in time, the state space transits from a two-population isolation scenario (a) to a single ancestral population scenario (b)

Fig. 2
figure 2

Demographic transitions in the three-population admix model for a pair of samples. Backwards in time, the state space transits from a two-population isolation scenario (Fig. 1a) to a three-population scenario (a), then to another two-population scenario (b), and finally to a single ancestral population scenario (Fig. 1b)

2.2 Inferring Parameters

To infer parameters, we maximize the model likelihood. Jocx implements three optimization subroutines, Nelder–Mead (NM), genetic algorithms (GA), and particle swarm optimization (PSO). After preparing the ZipHMM directories, user can run the CoalHMM to maximize the likelihood using one of these three algorithms using the run command.

$ Jocx.py run . iso nm 0.0001 1000 0.1

The first argument of this command, like for the init command, is the directory where the ZipHMM preprocessed data is found. The next argument is the demographic model. If we preprocessed the ZipHMM data with the iso model, we can use iso here to fit that model. The third argument is the optimization algorithm, one of nm, ga, and pso.

Following the optimizer option are the initialization values for the optimization. These arguments should match the number and order of parameters given by the --help command. In the iso model, for example, the parameters are these:

$ Jocx.py ----help : ISOLATION MODEL (iso)   ∗  / \ tau A   B 3 params -> tau, coal_rate, recomb_rate 2 seqs   -> A, B 1 group  -> AB :

In this model, we infer three parameters: the population split time, tau, the coalescent rate, coal_rate, and the recombination rate, recomb_rate. In this model, populations are assumed to have the same coalescent rate, which is why there is only one parameter for this.

2.2.1 NM

NM was introduced by John Nelder and Roger Mead in 1965 [23] as a technique to minimize a function in a many-dimensional space. This method uses several algorithm coefficients to determine the amount of effect of possible actions.

$ Jocx.py run . iso nm 0.0001 1000 0.1 # algorithm            = _NMOptimiser # timeout              = None # max_executions       = 1 # # 2017-10-11 11:29:08.069462 : # execution state score param0 param1 param2 0 init -38.2023478685 0.000376954454165 7480.36836670 0.337649514816 1 fmin-in -40.5337262711 0.000385595244114 661.208520686 0.920281817958 1 fmin-cb -40.3804021200 0.000385595244114 694.268946721 0.920281817958 : 1 fmin-cb -37.8927822292 0.000695082517418 200504630.601 32081.6528250 Optimization terminated successfully.      Current function value: 37.892782      Iterations: 262      Function evaluations: 533 1 fmin-out -37.8927822292 0.000695082517418 200504630.601 32081.652825

In the output of NM’s execution, we have a final report of whether or not the execution was successful together with the optimal solution. It is possible for the optimizer to fail for various reasons, the number of parameters being a major cause of this. If the parameter space is too large, the Nelder–Mead optimizer often fail and one of the other optimizers will do better.

2.2.2 GA

GA was introduced by John Holland in the 1970s [10]. The idea is to encode each solution as a chromosome-like data structure and operate on them through actions analogous to genetic alterations, which usually involves selection, recombination, and mutation. For each type of alteration, various authors have developed different techniques.

$ Jocx.py run . iso ga 0.0001 1000 0.1 # algorithm            = _GAOptimiser # timeout              = None # elite_count          = 1 # population_size      = 50 # initialization       = UniformInitialisation # selection            = TournamentSelection # tournament_ratio     = 0.1 # selection_ratio      = 0.75 # mutation             = GaussianMutation # point_mutation_ratio = 0.15 # mu                   = 0.0 # sigma                = 0.01 # # 2017-10-23 10:31:32.821761 # # param0 = (1.0000000000000016e-05, 0.001) # param1 = (99.99999999999996, 10000.0) # param2 = (0.009999999999999995, 1.0) # # # POPULATION FOR GENERATION 1 # average_fitness = -5.32373335161 # min_fitness     = -10.7962322739 # max_fitness     = -0.613544122419 # # gen idv     fitness      param0         param1      param2   1   1   -0.61354412  0.00002825  6305.95175380  0.04139445   1   2   -1.38710619  0.00004282  2182.61708962  0.03027973   1   3   -4.45085424  0.00001133   254.73764392  0.01081756   1   4   -9.37092993  0.00067074   116.84983427  0.13757425   1   5  -10.79623227  0.00071728   142.34535478  0.81564586   : # # POPULATION FOR GENERATION 2 # average_fitness = -5.83495296756 # min_fitness     = -10.5697879572 # max_fitness     = -0.613544122419 # # gen idv     fitness      param0         param1      param2   2   1   -0.61354412  0.00002825  6305.95175380  0.04139445   2   2   -0.61382451  0.00002825  6305.95175380  0.13757425   2   3   -6.89850999  0.00002825   116.84983427  0.14110664   2   4  -10.47909826  0.00067074   145.01523656  0.81564586   2   5  -10.56978796  0.00067074   142.34535478  0.81564586   : :

In the output of GA’s execution, we have multiple generations of solutions, and multiple solutions per generation. Solutions in each generation are ordered by the fitness, i.e., best solution is at the top. The final solution is, therefore, the first solution in the last generation.

2.2.3 PSO

PSO was introduced by Eberhart and Kennedy in 1995 [5] as an optimization technique relying on stochastic processes, similar to GA. As its name implies, each individual solution mimics a particle in a swarm. Each particle holds a velocity and keeps track of the best positions it has experienced and best position the swarm has experienced. The former encapsulates the social influence, i.e., a force pulling towards the swarm’s best. The latter encapsulates the cognitive influence, i.e., a force pulling towards the particle’s best. Both forces act on the velocity and drive the particle through a hyperparameter space.

$ Jocx.py run . iso pso 0.0001 1000 0.1 # algorithm            = _PSOptimiser # timeout              = None # max_iterations       = 50 # particle_count       = 50 # max_initial_velocity = 0.02 # omega                = 0.9 # phi_particle         = 0.3 # phi_swarm            = 0.1 # # 2017-10-23 10:32:29.123305 # # param0 = (1.0000000000000016e-05, 0.001) # param1 = (99.99999999999996, 10000.0) # param2 = (0.009999999999999995, 1.0) # # # PARTICLES FOR ITERATION 1 # swarm_fitness           = -0.832535308472 # best_average_fitness    = -4.40169918533 # best_minimum_fitness    = -9.77654933959 # best_maximum_fitness    = -0.832535308472 # current_average_fitness = -4.40169918533 # current_minimum_fitness = -9.77654933959 # current_maximum_fitness = -0.832535308472 # #                                           best-   best-     best-     best- # gen idv  fitness  param0   param1  param2 fitness param0    param1    param2   1   0  -0.83  0.000044  4619.31  0.20   -0.83   0.000044  4619.31   0.20   1   1  -0.86  0.000048  4502.80  0.26   -0.86   0.000048  4502.80   0.26   1   2  -0.89  0.000061  4669.48  0.58   -0.89   0.000061  4669.48   0.58   1   3  -1.10  0.000035  2970.77  0.31   -1.10   0.000035  2970.77   0.31   1   4  -1.46  0.000057  2148.93  0.15   -1.46   0.000057  2148.93   0.15   : # # PARTICLES FOR ITERATION 2 # swarm_fitness           = -0.810479293858 # best_average_fitness    = -4.02436023707 # best_minimum_fitness    = -9.12434788412 # best_maximum_fitness    = -0.810479293858 # current_average_fitness = -4.02984771812 # current_minimum_fitness = -9.12434788412 # current_maximum_fitness = -0.810479293858 # #                                           best-   best-     best-     best- # gen idv  fitness  param0   param1  param2 fitness param0    param1    param2   2   0  -0.81  0.000045  4854.87  0.25   -0.81   0.000045  4854.87   0.25   2   1  -0.82  0.000040  4622.38  0.21   -0.82   0.000040  4622.38   0.21   2   2  -0.91  0.000064  4599.97  0.59   -0.89   0.000061  4669.48   0.58   2   3  -1.12  0.000038  2917.40  0.29   -1.10   0.000035  2970.77   0.31   2   4  -1.39  0.000058  2308.29  0.14   -1.39   0.000058  2308.29   0.14   : :

In the output of the PSO’s execution, we have multiple generations and multiple particles (solutions) per generation. Each particle contains two sets of solutions, the current solution and the best solution that this particle has encountered throughout the PSO’s execution. The latter is never worse than the former. Similar to GA, each generation is ordered by the particles’ fitness. The final solution is, therefore, the second solution of the first particle in the last generation.

3 Simulation, Execution, and Result Summarization

In this section, we will use a simulation experiment to show how to perform a full analysis and extract the final solution. We will use the software fastSIMCOAL2 [6] to simulate sequences under given demographic parameters, and we will use the two-population isolation model. All scripts and input files used here can be found in the Companion Material of this book.

We execute the following command to generate variable sites of a two-sequence alignment.

$ ./fsc251 -i input.par -n 1

The first argument points to a file containing the demographic parameters, shown below. The second argument specified the number of simulations to perform. We need only one pairwise alignment.

$ cat input.par //Number of population samples (demes) 2 //Population effective sizes (number of genes) 12000 12000 //Sample sizes 1 1 //Growth rates: negative growth implies population expansion 0 0 //Number of migration matrices : 0 implies no migration between demes 0 //historical event: time, source, sink, migrants, ... 1 historical event 10000 0 1 1 2 0 0 //Number of independent loci [chromosome] 1 0 //Per chromosome: Number of linkage blocks 1 //per Block: data type, num loci, rec. rate ...     DNA 8000000 0.00000001 0.00000002 0.33

This simulation input file corresponds to the isolation model demography and model parameters. Our goal is to recover these parameters through CoalHMM model-based inference. The historical event line contains seven parameters. They are the time of the event (in generations), source population id, destination population id, the proportion of a population that migrated in this event, the new population size of the source population, the new growth rate, and the new migration matrix to use after this event. The last line contains five parameters. They are the type of data, the size of simulated sequence, the recombination rate, and the migration rate.

ISOLATION MODEL     ∗    / \ Tau   A   B Tau     = Sim_Time ∗ Sim_Mutation_rate     = 10000  ∗ 0.00000002     = 0.0002 Coal_rate     = 1 / (2 ∗ Sim_Population_size ∗ Sim_Mutation_rate)     = 1 / (2 ∗ 12000 ∗ 0.00000002)     = 2083 Recombination_rate     = Sim_Recombination_rate / Sim_Mutation_rate     = 0.00000001 / 0.00000002     = 0.5

The direct output from the simulation program is a directory of the same name as the input file, and in this case this directory contains three files:

$ ls input input_1.arb  input_1.simparam  input_1_1.arp

The fist file input_1.arb lists the file paths and names of the generated alignments. The second file input_1.simparam records simulation conditions and serves as a log. The last file input_1_1.arp contains the variable sites of a sequence alignment. The content of this file is shown below.

$ less -S ./input/input_1_1.arp #Arlequin input file written by the simulation program fastsimcoal2 [Profile]         Title="A series of simulated samples"         NbSamples=2         GenotypicData=0         GameticPhase=0         RecessiveData=0         DataType=DNA         LocusSeparator=NONE         MissingData='?' [Data]         [[Samples]] #Number of independent chromosomes: 1 #Total number of polymorphic sites: 10960 # 10960 polymorphic positions on chromosome 1 #414, 1380, 2815, 3855, 4036, 5364, 5772, 5816, ... #Total number of recombination events: 5381 #Positions of recombination events: # Chromosome 1 #       3350, 8236, 9270, 10691, 11097, 12316, ...                 SampleName="Sample 1"                 SampleSize=1                 SampleData= { 1_1     1       CCTCGGTTGTTGTCAAGGACAGTAACTATG... }                 SampleName="Sample 2"                 SampleSize=1                 SampleData= { 2_1     1       GAATAAAAAAAACGTGAATGCAAGTACGAA... } [[Structure]]         StructureName="Simulated data"         NbGroups=1         Group={            "Sample 1"            "Sample 2"         }

We use the script arlequin2fasta.py to convert the Arlequin alignment into Fasta files. Since the Arlequin file contains only the variable sites, we need to specify the total length of the simulated sequence, which should match the simulation parameter in the input file intput.par, e.g., 8,000,000 in this example.

$ ./arlequin2fasta.py input/input_1_1.arp 8000000

This creates two Fasta sequences for the pairwise alignment, and they are ready for Jocx’s analysis.

$ ls input  input.par $ ./arlequin2fasta.py ./input/input_1_1.arp 8000000 $ ls input  input.par input_1_1-sample_1-1_1.fasta  input_1_1-sample_2-2_1.fasta

Analysis using Jocx follows a two-step procedure as described earlier. We first prepare the ZipHMM data directory using the init command and then infer parameters using the run command. The following commands conduct a full analysis, and it tests all three optimization options using ten independent executions per optimizer.

Jocx.py init . iso \   ./input_1_1-sample_1-1_1.fasta \   ./input_1_1-sample_2-2_1.fasta Jocx.py run . iso pso 0.0001 1000 0.1 > pso-0.stdout Jocx.py run . iso pso 0.0001 1000 0.1 > pso-1.stdout : Jocx.py run . iso pso 0.0001 1000 0.1 > pso-9.stdout Jocx.py run . iso ga  0.0001 1000 0.1 > ga-0.stdout Jocx.py run . iso ga  0.0001 1000 0.1 > ga-1.stdout : Jocx.py run . iso ga  0.0001 1000 0.1 > ga-9.stdout Jocx.py run . iso nm  0.0001 1000 0.1 > nm-0.stdout Jocx.py run . iso nm  0.0001 1000 0.1 > nm-1.stdout : Jocx.py run . iso nm  0.0001 1000 0.1 > nm-9.stdout

Upon completion, we receive ten sets of parameter estimates per optimization method. The format of the stand output, which contains the inference results, is different for each optimization method. We can use the following commands to summarize and plot the outcome. This plotting script is also provided in the Companion Material.

tail nm∗.stdout -n 1 -q > nm-summary.txt grep '500   1' ga-∗.stdout > ga-summary.txt grep '500   1' pso-∗.stdout > pso-summary.txt ./box-plot-simple.py nm-summary.txt 3 nm-summary.png ./box-plot-simple.py ga-summary.txt 3 ga-summary.png ./box-plot-simple.py pso-summary.txt 3 pso-summary.png

The results are shown in Fig. 3. The first command collects the inference results from the NM optimizer. The last line in a NM execution’s standard output contains the final estimates. The second two commands collect the inference results from the GA and PSO optimizers. The first solution/particle in the last generation/iteration, which is 500 in this experiment, contains the estimates.

Fig. 3
figure 3

Summary of ten independent simulations and CoalHMM executions on the two-population isolation model using the three optimisation methods. The three columns show parameters speciation time, coalescence rate, and recombination rate, respectively. The simulated values of these parameters are 0.0002, 2083, and 0.5. The number written below each box-plot is the median value of the estimates shown on the y-axis. This median can be used as a point estimate for the parameters

$ head ∗summary.txt ==> ga-summary.txt <== ga-0.stdout: 500 1 -81395.70680891  0.00011837  1815.42025279  0.42354064 ga-1.stdout: 500 1 -81470.10001761  0.00019243  1938.38996498  0.12963492 ga-2.stdout: 500 1 -81424.59984134  0.00021634  1846.60957248  0.19741876 ga-3.stdout: 500 1 -81430.96932585  0.00021685  1886.66976041  0.18309926 ga-4.stdout: 500 1 -81386.45366757  0.00019324  1916.03941578  0.32995308 ga-5.stdout: 500 1 -81463.45628041  0.00004345  1915.25301917  0.23921500 ga-6.stdout: 500 1 -81373.58453032  0.00018669  1968.26116983  0.52133035 ga-7.stdout: 500 1 -81504.94579193  0.00021242  1500.28846236  0.10292456 ga-8.stdout: 500 1 -81374.56618397  0.00019414  2046.25788612  0.52203350 ga-9.stdout: 500 1 -81433.41521075  0.00022051  1876.14477389  0.17886387 ==> nm-summary.txt <== 1 fmin-out  -81373.5832257  0.000186088241216  1966.58533828  0.52229387809 1 fmin-out  -81373.5832257  0.000186088706436  1966.58675497  0.52229303544 1 fmin-out  -81373.5832257  0.00018608870264   1966.58640056  0.522294017033 1 fmin-out  -81373.5832257  0.000186088642201  1966.5864041   0.522295006576 1 fmin-out  -81373.5832257  0.000186088168201  1966.58599993  0.522295026609 1 fmin-out  -81373.5832257  0.00018608835674   1966.58624347  0.522297122163 1 fmin-out  -81373.5832257  0.000186088509117  1966.58601275  0.52229560587 1 fmin-out  -81373.5832257  0.000186088949749  1966.58644739  0.522293271654 1 fmin-out  -81373.5832257  0.000186088354698  1966.58755713  0.522294573711 1 fmin-out  -81373.5832257  0.000186088870812  1966.5853934   0.522294569147 ==> pso-summary.txt <== pso-0.stdout: 500 1 -81373.583 0.000186 1966.585 0.522 -81373.583 0.000186 1966.585 0.522 pso-1.stdout: 500 1 -81373.583 0.000186 1966.586 0.522 -81373.583 0.000186 1966.585 0.522 pso-2.stdout: 500 1 -81373.583 0.000186 1966.586 0.522 -81373.583 0.000186 1966.586 0.522 pso-3.stdout: 500 1 -81373.583 0.000186 1966.586 0.522 -81373.583 0.000186 1966.586 0.522 pso-4.stdout: 500 1 -81373.583 0.000186 1966.585 0.522 -81373.583 0.000186 1966.585 0.522 pso-5.stdout: 500 1 -81373.583 0.000186 1966.585 0.522 -81373.583 0.000186 1966.585 0.522 pso-6.stdout: 500 1 -81373.583 0.000186 1966.586 0.522 -81373.583 0.000186 1966.585 0.522 pso-7.stdout: 500 1 -81373.583 0.000186 1966.586 0.522 -81373.583 0.000186 1966.586 0.522 pso-8.stdout: 500 1 -81373.583 0.000186 1966.585 0.522 -81373.583 0.000186 1966.585 0.522 pso-9.stdout: 500 1 -81373.583 0.000186 1966.586 0.522 -81373.583 0.000186 1966.586 0.522

The plotting script simply places these estimates in box plots. The first parameter indicates the summary file to plot. The second parameter indicates the number of parameters that this model has. For the two-population isolation model, we have three parameters. Each particle in PSO contains two sets of results, the local best and swarm best. The second set, swarm’s best, should be used. The last parameter specifies the output file’s name. At the bottom of each box plot we print the median value of the estimates.

The demographic parameters we use in this experiment are 0.0002, 2083, and 0.5. They are the split time of the two isolated populations, the coalescent rate, and the recombination rate, respectively. These values are roughly recovered by CoalHMM for all the optimizers.

In summary, the following commands conduct a full simulation and estimation data analysis, and it summarizes the final results by creating box plots and printing the median estimate for each parameter.

$ ./fsc251 -i input.par -n 1 $ ./arlequin2fasta.py input/input_1_1.arp 8000000 $ ./Jocx.py init . iso \   ./input_1_1-sample_1-1_1.fasta \   ./input_1_1-sample_2-2_1.fasta $ ./Jocx.py run . iso pso 0.0001 1000 0.1 > pso-0.stdout : $ grep '500   1' pso-∗.stdout > pso-summary.txt $ ./box-plot-simple.py pso-summary.txt 3 pso-summary.png

The first command simulates a pairwise sequence alignment using the fastSIMCOAL2 program. The second command uses a custom script to convert the simulated alignment from the Arlequin format to the Fasta format. The third command prepares the ZipHMM directories using the Fasta sequences. The fourth command executes CoalHMM’s model inference and dumps the output to a file. Potentially, multiple independent runs are dispatched and a HPC cluster is involved in this step. The fifth command obtains the inference results from the output file. The number 500 here is the maximum iteration count for this experiment, and the number 1 indicates the first particle in the last iteration. Finally, the sixth command plots the parameters and presents the median estimates as the final results.

4 Conclusions

We have presented the Jocx tool for estimating parameters in ancestral population genomics. The tool uses a framework of pairwise coalescent hidden Markov models combined in a composite likelihood to implement various demographic scenarios. A full list of available demography models are available through the tool’s help command. Using a simple isolation model, we described an analysis pipeline based on simulating data and then analyzing it using the three different optimizers implement in Jocx. This pipeline is available in the Companion Material associated with this chapter, and serves as a good starting point for getting familiar with Jocx before moving to more involved models.