Ancestral Population Genomics with Jocx, a Coalescent Hidden Markov Model

Coalescence theory lets us probe the past demographics of present-day genetic samples and much information about the past can be gleaned from variation in rates of coalescence event as we trace genetic lineages back in time. Fewer and fewer lineages will remain, however, so there is a limit to how far back we can explore. Without recombination, we would not be able to explore ancient speciation events because of this—any meaningful species concept would require that individuals of one species are closer related than they are to individuals of another species, once speciation is complete. Recombination, however, opens a window to the deeper past. By scanning along a genomic alignment, we get a sequential variant of the coalescence process as it looked at the time of the speciation. This pattern of coalescence times is ﬁxed at speciation time and does not erode with time; although accumulated mutations and genomic rearrangements will eventually hide the signal, it enables us to glance at events in the past that would not be observable without recombination. So-called coalescence hidden Markov models allow us to exploit this, and in this chapter, we present the tool Jocx that uses a framework of these models to infer demographic parameters in ancient speciation events.


Introduction
Understanding how species form and diverge is a central topic of biology, and by observing emerging species today, we can understand many of the genetic and environmental processes involved. Through such observations, we can understand the underlying forces that drive speciation, but to understand how specific speciation events occurred in the past, and understand the specifics of how existing species formed, we must make the inference from the signals these events have left behind. The speciation processes leave genetic "fossils" in the genome of the resulting species, and through what you might call genetic paleontology we can study past events from the signals they left behind.
The main objectives of the methods we describe in this chapter are to infer demographic parameters, Θ, given genetic data, D, through the model likelihood: LðΘ j DÞ ¼ PrðD j ΘÞ. Here, we assume that Θ contains information such as effective population sizes, time points where population structure changes (populations split or admix), or migration rates between populations. We can connect data and demographics through coalescence theory [8]. This theory gives us a way to assign probability densities to genealogies; densities that depend on the demographic parameters, f(G | Θ). Then, if we know the underlying genealogy, we can assign probabilities to observed data using standard algorithms such as Felsenstein's likelihood recursion [7] and get PrðD j G, ΘÞ. Theoretically, we now simply need to integrate away the nuisance parameter G to get the desired likelihood In practice, however, the space of all possible genealogies prevents this beyond a small sample size of sequences and for any sizeable length of genetic material. Approximations are needed, and the sequential Markov coalescent (see Chapter 1) and coalescent hidden Markov models approximate the likelihood in two steps: they assume that sites are independent given the genealogy, i.e., where L is the length of the sequence and D i is the data and G i the genealogy at site i, and assume that the dependency between genealogies is Markovian: Both assumptions are known to be invalid, but simulation studies indicate that this model captures most important summary statistics from the coalescent [17,18] and that it can be used to accurately infer parameters in various demographic models [2,14,16]. Because of the form the likelihood now has, which is the form of a hidden Markov model, we can compute the likelihood efficiently using the so-called Forward algorithm (see Chapter 3 in Durbin et al. [3]). This efficiency has permitted us and others (see Chapters 7 and 10) to apply this approximation to the coalescence to infer demographic parameters on whole genome data [1, 9, 11-13, 19, 24, 25, 27] in addition to inferring recombination patterns [20,21] and scanning for signs of selection [4,22].

Software
We have created a theoretical framework for constructing coalescent hidden Markov models from demographic specifications [2,[14][15][16] and used it to implement various models in the software package Jocx, available at https://github.com/jade-cheng/Jocx.git Jocx handles the state space explosion problem of dealing with many sequences by creating hidden Markov models for all pairs of sequences and then combining these into a composite likelihood when estimating parameters. In brief, a full analysis looks something like the following. In the remainder of this chapter, we describe in detail how to apply Jocx to sequence data and how to interpret the results.
Jocx.py init . iso a.fasta b.fasta It is very important that the verbatim (typewriter font) sections are left exactly as in the input. They contain ascii art that is output from our program.
Jocx.py run . iso nm 0.0001 1000 0.1 Jocx executes CoalHMMs by specifying a model and an optimizer. It uses sequence alignments in the format of "ziphmm" directories, which is also prepared by Jocx. The program prints to standard output the progression of the estimated parameters and the corresponding log likelihood. The source package contains a set of Python files, and it requires no installation.

Preparing Data
Jocx takes two or more aligned sequences as input; the number of sequence pairs depends on the CoalHMM model specified for a particular execution. We will discuss CoalHMM model specification later. For example, for inference in a two-population isolation scenario [14], we need a minimum of one pair of aligned sequences, with one sequence from each of the two populations. The input should be FASTA files with names matching the names of the sequences we will use in the analysis, and since the sequences will be interpreted as aligned, they should all have the same lengths. The preprocessing will skip indels and handle all symbols except A, C, G, and T as the wildcard N. In the following example, sequence a and sequence b form an alignment. Each sequence may have multiple data segments (e.g., contigs or chromosomes). In the example, we have two segments, 1 and 2. The names for these data segments need to be consistent between the two sequences. In the software we have the datapreparation step and model-inference step. In the data-preparation step, we supply Fasta sequences by providing their file names, e.g., a.fasta and b.fasta. We use the ZipHMM framework [26] to calculate likelihoods-in previous experiments we have found that ZipHMM gives us one or two orders of magnitude speedup in full genome analyses. To use ZipHMM in Jocx, we must preprocess the sequence files. The preprocessing step is customized to each demographic model and is done using the $ Jocx.py init command. This command takes a variable number of arguments, depending on how many sequences are needed for the demographic model we intend to use. The first two arguments are the directory in which to put the preprocessed alignment and the demographic model to use. The sequences used for the alignment, the number of which depends on the model, must be provided as the remaining arguments. In the aforementioned two-population isolation scenario, the model iso, we need to process two aligned sequences, so the init command will take four arguments in total. To create a pairwise alignment for the isolation model, we would execute the following command: $ Jocx.py init . iso a.fasta b.fasta # Creating directory: ./ziphmm_iso_a_b # creating uncompressed sequence file # using output directory "./ziphmm_iso_a_b" # parsing "a.fasta" # parsing "b.fasta" # comparing sequence "1" # sequence length: 900 # creating "./ziphmm_iso_a_b/1.ziphmm" # comparing sequence "2" # sequence length: 900 # creating "./ziphmm_iso_a_b/2.ziphmm" # Creating5-state alignmentin directory: ./ziphmm_iso_a_b/1.ziphmm # Creating5-state alignmentin directory: ./ziphmm_iso_a_b/2.ziphmm The result of the init command is the directory ziphmm_ iso_a_b that contains information about the alignment of a. fasta and b.fasta in a format that ZipHMM can use to efficiently analyze the isolation model. Each Fasta data segment forms its own ZipHMM subdirectory. In the above example, we have two data segments, named 1 and 2, so we have two ZipHMM subdirectories.  The exact structure of this directory is not important to how Jocx is used, but you must preprocess input sequences to match each demographic model you will analyze.
To see the list of all supported models, use the --help option. Here iso is the two-population two-sequence isolation scenario, shown below. For each model, the tool implements, the --help command will show an ASCII image of the model, annotated with the parameters of the model and with leaves labelled by populations. Below the image, the parameters are listed in the order they will be output when optimizing the model, followed by the sequences in the order they must be provided to the init command when creating the ZipHMM file. Finally, the help lists the pairs of sequences that will be used in the composite likelihood in the list of "groups." When initializing a sequence alignment, you will get a ZipHMM directory per group.
The two-population isolation demographic model is symmetric, so the order of input Fasta sequences does not matter. This is not always the case. For example, in a three-population admixture model, shown below, the roles the populations take are different. Population C is admixed, and it is formed from ancestral siblings of the two source populations, A and B. The order of input Fasta sequences, therefore, needs to match.
In this model, we have five unknown time points and durations to be estimated, they are three-population isolation time (iso_time), two time points where the admixed population merges with each of the two source populations (buddy23_ti-me_1a, buddy23_time_2a), and finally the duration before all populations find their common ancestry for the first population (greedy1_time_1a). The last unknown duration can be calcu- In this example, we have the same admixture demographic model as before but with each population contributing two sequences to form six pairwise alignments, which are then used to construct six HMMs for the inference.
In the two-population isolation model, we have one demographic transition for a pair of samples. That is from a two-population isolation scenario (Fig. 1a) to a single ancestral population scenario (Fig. 1b). In the three-population admix model, we have three kinds of demographic transitions for a pair of samples. They are from a two-population duration (Fig. 1a) to a three-population duration (Fig. 2a), then to another two-population duration (Fig. 2b), finally to a single ancestral population (Fig. 1b). In the three-population duration, only two populations are allowed to exchange lineages, shown in Fig. 2a as the second and third populations; hence we call this duration buddy23. In the second two-population duration, one of the two populations only accepts lineages because it is not involved in the admixture event at the previous state space transition. Since we have one population that never gives lineages during this time, we call this duration greedy1.  (Fig. 1a) to a three-population scenario (a), then to another two-population scenario (b), and finally to a single ancestral population scenario (Fig. 1b)

Inferring Parameters
To infer parameters, we maximize the model likelihood. Jocx implements three optimization subroutines, Nelder-Mead (NM), genetic algorithms (GA), and particle swarm optimization (PSO). After preparing the ZipHMM directories, user can run the CoalHMM to maximize the likelihood using one of these three algorithms using the run command. The first argument of this command, like for the init command, is the directory where the ZipHMM preprocessed data is found. The next argument is the demographic model. If we preprocessed the ZipHMM data with the iso model, we can use iso here to fit that model. The third argument is the optimization algorithm, one of nm, ga, and pso.
Following the optimizer option are the initialization values for the optimization. These arguments should match the number and order of parameters given by the --help command. In the iso model, for example, the parameters are these: In this model, we infer three parameters: the population split time, tau, the coalescent rate, coal_rate, and the recombination rate, recomb_rate. In this model, populations are assumed to have the same coalescent rate, which is why there is only one parameter for this.

NM
NM was introduced by John Nelder and Roger Mead in 1965 [23] as a technique to minimize a function in a many-dimensional space. This method uses several algorithm coefficients to determine the amount of effect of possible actions. In the output of NM's execution, we have a final report of whether or not the execution was successful together with the optimal solution. It is possible for the optimizer to fail for various reasons, the number of parameters being a major cause of this. If the parameter space is too large, the Nelder-Mead optimizer often fail and one of the other optimizers will do better.

GA
GA was introduced by John Holland in the 1970s [10]. The idea is to encode each solution as a chromosome-like data structure and operate on them through actions analogous to genetic alterations, which usually involves selection, recombination, and mutation. In the output of GA's execution, we have multiple generations of solutions, and multiple solutions per generation. Solutions in each generation are ordered by the fitness, i.e., best solution is at the top. The final solution is, therefore, the first solution in the last generation.

PSO
PSO was introduced by Eberhart and Kennedy in 1995 [5] as an optimization technique relying on stochastic processes, similar to GA. As its name implies, each individual solution mimics a particle in a swarm. Each particle holds a velocity and keeps track of the best positions it has experienced and best position the swarm has experienced. The former encapsulates the social influence, i.e., a force pulling towards the swarm's best. The latter encapsulates the cognitive influence, i.e., a force pulling towards the particle's best. Both forces act on the velocity and drive the particle through a hyperparameter space. In the output of the PSO's execution, we have multiple generations and multiple particles (solutions) per generation. Each particle contains two sets of solutions, the current solution and the best solution that this particle has encountered throughout the PSO's execution. The latter is never worse than the former. Similar to GA, each generation is ordered by the particles' fitness. The final solution is, therefore, the second solution of the first particle in the last generation.

Simulation, Execution, and Result Summarization
In this section, we will use a simulation experiment to show how to perform a full analysis and extract the final solution. We will use the software fastSIMCOAL2 [6] to simulate sequences under given demographic parameters, and we will use the two-population isolation model. All scripts and input files used here can be found in the Companion Material of this book.
We execute the following command to generate variable sites of a two-sequence alignment.
$ ./fsc251 -i input.par -n 1 The first argument points to a file containing the demographic parameters, shown below. The second argument specified the number of simulations to perform. We need only one pairwise alignment. This simulation input file corresponds to the isolation model demography and model parameters. Our goal is to recover these parameters through CoalHMM model-based inference. The historical event line contains seven parameters. They are the time of the event (in generations), source population id, destination population id, the proportion of a population that migrated in this event, the new population size of the source population, the new growth rate, and the new migration matrix to use after this event. The last line contains five parameters. They are the type of data, the size of simulated sequence, the recombination rate, and the migration rate. The direct output from the simulation program is a directory of the same name as the input file, and in this case this directory contains three files: The fist file input_1.arb lists the file paths and names of the generated alignments. The second file input_1.simparam records simulation conditions and serves as a log. The last file input_1_1.arp contains the variable sites of a sequence alignment. The content of this file is shown below. We use the script arlequin2fasta.py to convert the Arlequin alignment into Fasta files. Since the Arlequin file contains only the variable sites, we need to specify the total length of the simulated sequence, which should match the simulation parameter in the input file intput.par, e.g., 8,000,000 in this example.
$ ./arlequin2fasta.py input/input_1_1.arp 8000000 This creates two Fasta sequences for the pairwise alignment, and they are ready for Jocx's analysis. Analysis using Jocx follows a two-step procedure as described earlier. We first prepare the ZipHMM data directory using the init command and then infer parameters using the run command. The following commands conduct a full analysis, and it tests all three optimization options using ten independent executions per optimizer. Upon completion, we receive ten sets of parameter estimates per optimization method. The format of the stand output, which contains the inference results, is different for each optimization method. We can use the following commands to summarize and plot the outcome. This plotting script is also provided in the Companion Material.
tail nm * .stdout -n 1 -q > nm-summary.txt grep '500 1' ga- * .stdout > ga-summary.txt grep '500 1' pso- * .stdout > pso-summary.txt The results are shown in Fig. 3. The first command collects the inference results from the NM optimizer. The last line in a NM execution's standard output contains the final estimates. The second two commands collect the inference results from the GA and PSO optimizers. The first solution/particle in the last generation/ iteration, which is 500 in this experiment, contains the estimates. The plotting script simply places these estimates in box plots. The first parameter indicates the summary file to plot. The second parameter indicates the number of parameters that this model has. For the two-population isolation model, we have three parameters. Each particle in PSO contains two sets of results, the local best and swarm best. The second set, swarm's best, should be used. The last parameter specifies the output file's name. At the bottom of each box plot we print the median value of the estimates.
The demographic parameters we use in this experiment are 0.0002, 2083, and 0.5. They are the split time of the two isolated populations, the coalescent rate, and the recombination rate, respectively. These values are roughly recovered by CoalHMM for all the optimizers.
In summary, the following commands conduct a full simulation and estimation data analysis, and it summarizes the final results by creating box plots and printing the median estimate for each parameter.
$ ./fsc251 -i input.par -n 1 $ ./arlequin2fasta.py input/input_1_1.arp 8000000 $ ./Jocx.py init . iso \ ./input_1_1-sample_1-1_1.fasta \ ./input_1_1-sample_2-2_1.fasta $ ./Jocx.py run . iso pso 0.0001 1000 0.1 > pso-0.stdout : $ grep '500 1' pso- * .stdout > pso-summary.txt The first command simulates a pairwise sequence alignment using the fastSIMCOAL2 program. The second command uses a custom script to convert the simulated alignment from the Arlequin format to the Fasta format. The third command prepares the ZipHMM directories using the Fasta sequences. The fourth command executes CoalHMM's model inference and dumps the output to a file. Potentially, multiple independent runs are dispatched and a HPC cluster is involved in this step. The fifth command obtains the inference results from the output file. The number 500 here is the maximum iteration count for this experiment, and the number 1 indicates the first particle in the last iteration. Finally, the sixth command plots the parameters and presents the median estimates as the final results.

Conclusions
We have presented the Jocx tool for estimating parameters in ancestral population genomics. The tool uses a framework of pairwise coalescent hidden Markov models combined in a composite likelihood to implement various demographic scenarios. A full list of available demography models are available through the tool's help command. Using a simple isolation model, we described an analysis pipeline based on simulating data and then analyzing it using the three different optimizers implement in Jocx. This pipeline is available in the Companion Material associated with this chapter, and serves as a good starting point for getting familiar with Jocx before moving to more involved models.