Estimation of Evolutionary Dynamics and Selection Pressure in Coronaviruses

Evolution of coronaviruses is facilitated by the strong selection, large population size, and great genetic diversity within the susceptible hosts. This predisposition is primarily due to high error rate, and limited proofreading capability of the viral polymerase and by recombination. These characteristics make coronaviruses an interesting model system to study the mechanisms involved in viral evolution and the ways viruses adapt to switch host or to gain novel functions. Here we describe the protocol to estimate selection pressures for the spike gene and evolutionary dynamics of bovine coronaviruses.


Introduction
Coronaviruses encode the largest positive sense single-stranded RNA genomes known, ranging from 27 to 31 Kb in length. Although coronaviruses have been shown to possess proofreading ability [ 1 ], relatively high mutation rates mean that coronaviruses are one of the most diverse, genetically distinct, and recently emerging groups of viruses. The emergence of these viruses are mainly triggered by the virus evolution which could occur due to high mutational rates, selection pressure on genetic diversity, interand intra-host selection, frequency of recombination, and genetic drifts during transmission bottlenecks. Within subfamily Coronaviridae , Alphacoronaviruses , and Betacoronaviruses infect and cause diseases in mammals, whereas Gammacoronaviruses are mainly avian specifi c [ 2 ].
Taken together, experimental data and mathematical models have reinforced the need for studying coronavirus dynamics and evolution, which could provide bases for effective control measures. Recent availability of quantitative deep-sequencing methodologies has provided data that can be modelled for future prediction of transmission dynamics and to estimate relevant parameters. In this protocol, we used publically available S gene data on BCoV, as prototype coronavirus, and analyzed to predict epidemiological linkage, mutation-prone sites and evolution in the S gene of BCoV. The same protocol is applicable to other genes of the coronaviruses and viruses of other families.

Materials
To perform in silico analysis of the S genes of BCoV, the following equipment will be required ( see Note 1 ): 10. An appropriate Internet access.

Methods
The following procedures are adapted for Mac OS but are equally applicable for other systems.

Open the downloaded fi le (BCoV_S genes.fas) in TextEdit and
edit the sequence titles. The sequence titles can be arranged depending upon objective in mind and the availability of downstream analysis tools. One accepted way of labelling the sequence title will be to arrange them in host/isolate_ID/ genotype/country/year (accession number). Remove all illegal characters along with empty spaces and replace them with underscore/understrike (_). However, do not remove any greater than signs (>), which will destroy the .fasta format and may require rebuilding of the data set [ 11 ]. To do so, use the "Find" and "Replace with" options in the TextEdit, which can be opened with "cmd+F" command in Mac OS X. Save the fi le before closing the dataset. Distributed under the GNU General Public License Type "help" or "help <command>" for information on the commands that are available.
Type "about" for authorship and general information about the program.
8. To set the evolutionary model to the GTR substitution, type "lset nst=6 rates=invgamma" after the MrBayes > prompt then press "Enter". The message "Successfully set likelihood model parameters" indicates the success in model setup. 9. To set the sample collection (200) from posterior probability distribution, diagnostic calculation every 1,000 generations, and print and sample frequency to 100, type "mcmc ngen=20000 samplefreq=100 printfreq=100 diagnfreq=1000" after the MrBayes > prompt then press "Enter". Program will start calculating the split frequency depending on the speed of the operating system and the size of the dataset. Note the message "Average standard deviation of split frequencies". If it is below 0.01 after 2,000 generations, type "yes" after "Continue the analysis? (yes/no)" prompt to set more generations. Continue this until the split frequency drops below 0.01. Once reached, type "no" which leads the users to MrBayes > prompt. 10. To summarize the parameter, type "sump" then press "Enter".
11. To summarize the tree type "sumt" then press "Enter". This command will save the tree with extension "nex.con.tre" (i.e., BCoV_S genes_align.nex.con.tre) in the MrBayes folder where the original fi le (BCoV_S genes_align.nex) was kept. The tree can be opened and annotated in the FigTree.
13. Label your sequences by searching your sequence-tag, such as isolate name or country, in the search button when "Taxa" is selected. Similarly, select "Nod" or "Clade" to label the respective items ( see Note 6 ).
14. After annotation, save your tree using File -> Export Graphics -> PDF (or other desired fi le format from the list) -> OK path. The resulting fi le can be used for further editing or for presentation [ 11 ].
1. To analyze the occurrences of synonymous (dS) and nonsynonymous (dN) substitutions in the S gene, use the same fasta fi le (BCoV_S_genes_align.fas) that was generated for phylodynamics ( see Note 7 ).
2. Open the SNAP tool freely available at http://www.hiv.lanl. gov/content/sequence/SNAP/SNAP.html and paste the sequence or upload the dataset.
3. Both accumulated (cumulated dN-dS) and per codon (dN-dS) selection sites can be calculated by the generated table of SNAP.

SNAP
4. Since the selections are calculated on every nucleotide, sites under positive or negative selection can be highlighted ( see Note 8 ).
1. Alternatively and to verify the robustness of the data generated by the SNAP, the same alignment can be used to calculate selection pressure using GTR (general time reversible) substitution model on a neighbor-joining phylogenetic tree by the Datamonkey Web server (Freely available at http://www. datamonkey.org/dataupload.php ).
2. The program uses the computational engine of the HyPhy package [ 12 ] to estimate dN-dS with a variety of evolutionary models and can analyze selection even in the presence of recombination ( see Note 9 ).

Within the BEAST package, open BEAUti program (Bayesian
Evolutionary Analysis Utility) and import Nexus (BCoV_S_ genes_align.nex) or Fasta (BCoV_S_genes_align.fas) fi le of the data set. Remember to execute the data by "File -> Import Data -> Open".
2. Several parameters of the BEAST run (i.e., the date of the sequences, the substitution model, the rate variation among sites, the length of the MCMC chain) can then be adjusted according to specifi c need [ 13 ] ( see Note 10 ).
3. Once all desired parameters are set, fi nally, click on the "Generate BEAST File" to generate .XML fi le which will be used as input for BEAST analysis.

Label the fi le as BCoV_S_genes_align.xml for consistency.
This is a brief explanation in order to run BEAST program and summarize results using TRACER.
2. Open the BEAST program (double-click), a white screen on JAVA environment will appear, wait for several seconds until a second screen appears.
3. Choose the fi le to analyze in this second screen. Before beginning the analyses enable the "Allow overwriting of log fi les" option. Then press "Run" and the analysis will begin.
4. After few moments, depending upon the processing capacity of the operating system and the size of the data, the chain will begin to run. There will be seven columns that extend vertically. Every column is one of the parameters that are being estimated; however, the fi rst and the last column are crucial to observe. The fi rst column is the generation being sampled in every moment (every chain has ten million steps) and the last

BEAST Analysis
Evolution in Coronaviruses column shows how many millions of states will be run per hour (remember, ten million steps per chain). Depending on the length of the chain, the length of the sequences, and the number of sequences to be analyzed, it may take variable time to complete the run.

5.
Once the chain has run, it is required to store the parameters. Close the BEAST window and open the BEAST folder. Every time a chain is run, two fi les are generated: .xml fi le and several ends (.log and .tre). Once the fi rst run is complete, change the name of the .log and .tre fi les. For example, after the completion of a run for BCoV_S_gene_align.xml, BCoV_S_gene_ align.log and BCoV_S_gene_align.tre fi les will be generated. Rename these two fi les to BCoV_S_gene_align1.log and BCoV_S_gene_align1.tre.
7. Finally, three different .log fi les and .tre fi les will be available labelled as BCoV_S_gene_align1.log, BCoV_S_gene_align2. log, and BCoV_S_gene_align3.log. These three fi les contain the estimations of the substitution rate that have to be summarized in TRACER.
8. To summarize the run, open the TRACER program and select the option "File" and "Input Trace fi le", and open the fi rst …1.log fi le from the folder, followed by the addition of the second (…2.log) fi le. Finally add the third (…3.log) log fi le.
9. The estimations of the parameters are viewable in the graphic interface. Select the option "Combined" from the Trace Files (Upper left) and the estimations that will appear on the traces table are the main estimations for all the parameters ( see Note 11 ). Generally, the desired parameters are: Tree Model Root : This is the number of years that passed after the most recent common ancestor (TMRCA). Subtract this number from the most modern date to yield the TMRCA for the dataset.
Clock Rate : This is directly the rate of evolution in substitution/site/year.

Notes
1. This protocol is optimized for Mac OS X; however, all the software packages and tools used here are also available for Windows which can be installed using recommended methodologies. All the software used here are Open Access, which do not require any subscription for any operating systems.
These software packages are only for demonstration purposes, and there may be alternative solutions for the same purpose. The overall time of the data analysis depends upon processing power of the operating system and the number and length of sequences in the dataset.
2. The same phylogenetic tree can be used for different interpretations. Failing to create a proper objective can lead to drawing incorrect conclusions from phylogenetic studies. It is therefore essential to defi ne the objective for the downstream analyses before initiating the study.
3. Construction of datasets depends on the objectives. One of the most common interests of bioinformaticians is to determine the epidemiological linking of the query sequence to that of sequences reported from the world and are available in the public domains. For this purpose, the Basic Local Alignment Search Tool (BLAST) is the most widely used tool, primarily owing to its speed of execution. Search the nucleotide sequences with objective-based keyword such as "Bovine Coronaviruses S gene". Manual editing and investigations of the downloaded sequences are always suggested. Notably, BLAST-Explorer is primarily aimed at helping the construction of sequence datasets for further phylogenetic study, and it can also be used as a standard BLAST server with enriched output. Use BLAST or BLAST-Explorer or other suitable database for construction of datasets.
4. There are different algorithms for DNA sequence alignment with variable degrees of utility. In this protocol, ClustalW was used for simplicity. Any other algorithm can be used depending upon the preferences and interest.
5. Nexus format is required input for MrBayes. Different tools, both online and offl ine, can be used to generate appropriate nexus output. We have only presented two commonly used and easily achievable methods.
6. Detailed demonstration for tree annotation is described in our earlier publication [ 11 ].
7. The fi le used for phylogenetic analysis may contain all available sequences in the public domain, which increases the size of the fi le signifi cantly. However, depending upon the objective in mind, the datasets can be modifi ed accordingly. For the larger datasets, the compiled data will be emailed to the email address provided once ready. This is also important to keep a record for future use.
8. The cut point is calculated to be zero. All sites showing cumulated dN-dS values above 0 are under positive pressure whereas values below 0 are under negative pressure.
9. The nature of parameter selection and interpretation is complex and is beyond the scope of this protocol. Please consult developers' published report for thorough understating of the concepts and applications [ 12 ]. 10. The parameters of the BEAST run are crucial and can determine the nature of output and may heavily infl uence the results. However, normally the default parameters are used [ 13 ].
11. When summarizing the BEAST results, do not use the mean as it appears in the output (this is the arithmetic mean); instead use the geometric mean that appears in the summary statistic