Next-Generation Analysis of Trypanosomatid Genome Stability and Instability

Understanding the rate and patterns of genome variation is becoming ever more amenable to whole-genome analysis through advances in DNA sequencing, which may, at least in some circumstances, have supplanted more localized analyses by cellular and genetic approaches. Whole-genome analyses can utilize both short- and long-read sequence technologies. Here we describe how sequence generated by these approaches has been used in trypanosomatids to examine DNA replication dynamics, the accumulation of modiﬁed histone H2A due to genome damage, and evaluation of genome variation, focusing on ploidy change.


Introduction
The genome is the crucible of life for cellular organisms. Accurate copying and transmission of DNA content from parent to offspring is needed for the propagation of cellular life. However, the genetic content cannot remain unchanged, since alterations in sequence are needed for adaptation to changing or hostile environments, and are the basis for selection by evolution. Normally these content changes accumulate slowly and often occur in a limited way, such as through single base changes or small insertions and deletions, subtly altering the protein and RNA products that are encoded. Other changes are more dramatic, either because they happen at high rates or because they occur through larger scale changes in the genome. These dramatic changes are not frequently found as examples of evolution but instead are normally genetically programmed reactions that reflect specific aspects of organism biology and development. Prominent examples of such adaptive change are immunoglobulin gene rearrangements during B-and T-cell development [1,2], mating type switching during yeast life cycles [3], and genome reconstruction in a range of ciliates [4][5][6]. In all cases, understanding of the timing, rate and mechanism of such dramatic change relative to the slower accrual of evolution-directing mutations is greatly aided by mapping the nature and location of the changes. Such questions are hugely easier to address, and more comprehensively evaluated, using next generation sequencing approaches that study such biology on a genome-wide scale.
Trypanosomatids-parasites within the wider order Kinetoplastae-provide a rich source of adaptive change, including frequent changes in the content and location of genes encoding the critical variant surface glycoprotein coat that underlies immune evasion by antigenic variation in Trypanosoma brucei [7], and the remarkable levels of genome-wide mosaic aneuploidy and gene amplification that are seen during Leishmania growth and dissemination [8,9]. All such processes are amenable to analysis through next generation sequencing. Here, we provide details of three broad approaches that have been applied to date to understand the reactions that underlie stable transmission and directed variation in trypanosomatid genome content. First, we explain how the initiation and progression of DNA replication has been mapped by Marker Frequency Analysis (MFA-seq) [10,11]; elsewhere, we have discussed how this approach compares with other strategies to map replication initiation, either using next generation sequencing or not [12]. Second, we explain how sites of DNA damage may be mapped in trypanosomatids [13], exploiting chromatin immunoprecipitation of histone γH2A, a phosphorylated variant of core histone H2A, on residue Thr130, which is generated in response to a range of DNA insults [14,15]. Third, whole-genome sequencing, to date using Illumina methodology, can be exploited to map patterns of single nucleotide polymorphism (SNP) accumulation during growth [16], to determine patterns of chromosome and gene copy number variation (CNV) [17][18][19], and to predict structural changes, such as translocations [16,20]; here, as an example, we describe the evaluation of chromosome CNV (CCNV) in Leishmania major, using freely available files and datasets. We have described each of the three approaches such that they can be followed individually, rather than relying on cross-checking on materials and protocols between the approaches.
The following files must be obtained prior to the analysis: 1. adapters.fa-.fa file containing the adapter sequences used in the library preparation and sequencing processes.
2. refgenome.fa-fasta file containing the reference genome to which the sequencing data are going to be aligned to. This can be obtained from TriTrypDB.org (Downloads ! Data Files ! Current_Release). 2. Anti-γH2A antibody.

Data Analysis
All analyses can be carried out on a computer running UNIX or Linux with the following software installed: 1. FastQC (http://www.bioinformatics.babraham.ac.uk/pro jects/fastqc/).

Datasets
1. To perform chromosomal copy number variation (CCNV) evaluations, a chromosome-level assembly of the desired reference genome, as well as whole-genome sequencing reads preferentially in FASTQ format with at least 20Â of coverage is needed. Lower genome coverages could be used; however, they can compromise the ploidy analysis. The advantage of using FASTQ instead of fasta reads is the base quality information, which can be used to more accurately identify SNPs and genomic alterations. Although not required in all ploidy estimation methodologies, the use of a General Feature Format (GFF) file containing the genome coordinates of all annotated genes could be important to perform CCNV in complex genomes or to polish the ploidy estimation. Genomic reads generated by Illumina (https://www. illumina.com/documents/products/illumina_sequencing_ introduction.pdf) are suggested due to their high depth and base quality; however, this analysis can also be performed with reads from other technologies as 454 [22] and Ion torrent (https://www.thermofisher.com/br/en/home/ brands/ion-torrent.html) [23]. The use of long-reads such as generated by the PacBio Sequel/RSII (https://www.pacb. com/smrt-science/smrt-sequencing/) and Oxford Nanopore (https://nanoporetech.com/) to perform ploidy estimations is possible, but this analysis would need a different toolset from the one described in this chapter. 1. From a log-phase culture (1 Â 10 7 cells/ml T. brucei procyclic form cells; 5 Â 10 6 cells/ml L. major or L. mexicana promastigote form cells), collect 1 Â 10 8 cells by centrifuging for 10 min at 1000 Â g.

2.
Wash the pellet in 10 ml of 1Â PBS supplemented with 5 mM EDTA (see Note 3), and centrifuge again for 10 min at 1000 Â g.

3.
Resuspend the cells in 1.2 ml of 1Â PBS supplemented with 5 mM EDTA, and add 2.8 ml of ice-cold 100% methanol, in a drop-wise fashion while vortexing gently; the final fixing solution will be of 70% methanol, and the concentration of 2.5 Â 10 7 cells/ml (see Note 4).
4. Wrap the tube in aluminum foil, and store at 4 C (from overnight to up to 3 weeks) until the sorting.
5. Centrifuge the cells for 10 min at 1000 Â g, at 4 C.
6. Wash the pellet in 1 ml of 1Â PBS supplemented with 5 mM EDTA, and centrifuge again for 10 min at 1000 Â g, at 4 C.

2.
Wash the pellet in 10 ml of 1Â PBS supplemented with 5 mM EDTA (see Note 3), and centrifuge again for 10 min at 1000 Â g.

3.
Resuspend the pellet first in 500 μl of 1Â PBS, and then add 9.5 ml of 1% formaldehyde in 1Â PBS.
4. Incubate the cells for 10 min at room temperature.
5. Centrifuge the cells for 10 min at 1000 Â g.
6. Wash the pellet in 10 ml of 1Â PBS, and then centrifuge for 10 min at 1000 Â g.

7.
Resuspend the cells in 4 ml of 1Â PBS, so to have a final concentration of 2.5 Â 10 7 cells/ml (see Note 4).
8. Wrap the tube in aluminum foil, and store at 4 C (from overnight to up to 3 weeks) until the sorting.
9. Centrifuge the cells for 10 min at 1000 Â g.

10.
Resuspend the pellet, first in 1 ml of 0.01% Triton X-100 diluted in 1Â PBS, and then add the remaining 9 ml of 0.01% Triton X-100 diluted in 1Â PBS; incubate for 30 min at room temperature.
12. Wash the cells in 10 ml of 1Â PBS, and then centrifuge for 10 min at 1000 Â g.

13.
Resuspend the pellet first in 4 ml of 1Â PBS with 10 μg/ml of propidium iodide, and 100 μg/ml of RNase A.
14. Incubate for 1 h at 37 C, protected from light.
15. Transfer the cells to the 5 ml BD Falcon™ tube through the 35 μm nylon mesh cell strainer cap. 16. Keep the cells on ice and protected from light. 1. Immediately before introducing the sample into the sorting machine, gently vortex and refilter the sample through the cell strainer cap.
2. Run the samples through the FACS machine-the settings (see Note 6) will vary between systems and different FACS facilities will have different protocols of use in place-contact them first before running the experiment. Propidium iodide is excited by the 488 nm or 561 nm lasers and detected by the 695/40 nm or 610/20 nm detectors (respectively).
3. On the FACS machine software, create the following plots: side scatter area (SSC-A) vs forward scatter area (FSC-A), and PE area (PE-A) vs PE width (PE-W).
4. Initially, run~50,000 events to get an idea of the quality of the sample and to set the gates for the sorting. First, the SSC vs FSC scatter plot will inform on the shape of the cells, while the PE-A vs PE-W scatter will allow you to draw a gate to include only singlet events (excluding agglomerates of cells). From this gate, generate a histogram of cell count vs PE-A. This will allow you to see the "typical" cell cycle profile. On this graph, set the gates to sort cells in G1, S (can be divided into early S and late S populations), and G2/M phase populations.

Sort the samples into the different populations (see Note 7)
into collection tubes (see Note 8) containing 200 μl of lysis buffer; the collection chamber was kept at 4 C (see Note 9). If various collection tubes are needed, keep them on ice until the next step.
6. After the sorting, homogenize the sorted cells and lysis solution, and incubate for 2 h at 55 C (see Note 10).
7. The lysate can be stored at À20 C, for later gDNA extraction (see Note 11), or used straight for gDNA extraction. With the exception of the lysis one, all steps are as described in the instructions of the Qiagen DNeasy Blood and Tissue kit.
1. Thaw frozen lysates at 37 C and, in the case of multiple sorting sessions, pool per cell cycle stage; take note of the total volume per sample.
2. To each sample, add 100% ethanol, in a volume that corresponds to one-third of the sample's volume, and vortex thoroughly.
4. Centrifuge for 1 min at 6000 Â g and discard the followthrough.

5.
Repeat steps 3 and 4 until all the lysate has been passed through the column.
7. Centrifuge for 1 min at 6000 Â g and discard the flow-through. 8. Add 500 μl of buffer AW2 (Qiagen) to the column. 9. Centrifuge for 3 min at 20,000 Â g and discard the flowthrough.
10. Transfer the column to a 1.5 ml Eppendorf and add 50 μl of Buffer AE (Qiagen).

12.
Store the gDNA at À20 C, until needed for library prep.

Sequencing Library Preparation
Library preparation and sequencing were performed at facilities such as the Glasgow Polyomics at the University of Glasgow, or Eurofins Genomics (Germany). Seek advice on which kit and sequencing system to use, as these evolve rapidly and more efficient ones might be available at the time of your experiment. In our work we have used the following: 1. DNA libraries were prepared with the Nextera ® XT DNA Sample Preparation kit (Illumina) or the TruSeq ® DNA Sample Preparation kit (Illumina), according to the manufacturer instructions.
2. The samples were multiplexed and sequenced using either the Illumina MiSeq paired-end 250 bp sequencing system or the Illumina HiSeq paired-end sequencing system, according to the manufacturer instructions. 3. Align the trimmed files onto the reference genome using bow-tie2 and prepare the output for downstream analysis by using samtools to convert the output to binary format and sort it by coordinate. For convenience, this can be achieved by piping the output of bowtie2 directly to samtools, thus preventing the generation of a large intermediate file.

5.
Apply custom MFAseq Python script to perform marker frequency analysis. The script used to perform the MFAseq analysis is available at https://github.com/CampbellSam/MFAseq. Python version 2.7 or higher is required to run this script and the pysam package (version 0.8.3 recommended) is also a dependency that must be installed prior to running the analysis. Running the mfaseq.py script on the command line with no input will provide usage instructions that are also available in the script header. Two alignment files are the only required input but the window size and output file can also be specified as input parameters if necessary. This script will produce output in wiggle format (.wig) that can then be viewed as a custom track on EuPathDB using the GenomeBrowser tool or plotted for visualization using ggplot in R or Python. Details to alter the appearance of the EuPathDB track are also included within the script and will normally be printed as standard out when the output file is generated. The .wig data can also be imported to excel and plotted in Excel or Prism (GraphPad). Bigwig output can be visualized on other genome viewers, such as IGV.

Chromatin-Immunoprecipitation (ChIP)
For immunoprecipitation of γH2A we used 11 μl of antibody (produced in house).
1. If frozen, thaw chromatin on ice. Transfer 10 μl of chromatin to a 1.5 ml Eppendorf and store at À20 C. This will serve as the Input sample and will be used in downstream steps. The number of cycles to subtract is calculated from the dilution factor used (i.e., for a 1:10 dilution, log2 of 10 is 3.322)

ChIP-seq
For library preparation we recommend the TrueSeq ChIP Library Preparation kit from Illumina. The following steps then must be followed for both the Input and Elute sequencing files.
1. Assess read quality using FASTQC. This is done in the same manner as for MFA-seq (see Subheading 3.1.6).
2. Trim reads to remove Illumina adaptor sequences and based with quality scores <20.
# This can be done using Trim Galore, which will by default remove Illumina adapter sequences and bases with a Phred score < 20. Trim_galore also has a database of adaptors used in commercial library prep kits and will trim these if they are found. Custom adaptors can be supplied at the command line.
# For single-end reads trim_galore <file> #For paired-end reads trim_galore -paired <reads1File> <reads2File> 3. Map to the reference genome using Bowtie2 in "very-sensitive" mode and prepare BAM files for downstream analysis. This can be performed in the same manner as for MFA-seq (see Subheading 3.1.6) and will first require a reference genome. If single end reads are used, the command can be altered as follows: FastQC evaluates the quality of the read libraries and generate several report files. Among them, the * .html file contains the summary of the results (Fig. 1a, b). It is desirable that the mean read quality is higher than phred 30, meaning one expected error in 1000 bases (see Note 32).

Run Trimmomatic
Trimmomatic will exclude low quality reads, remove positions with low quality at the extremities of the quality reads and adaptors sequences.   Trimmomatic can be run using single-end or paired-end read libraries. The above example represents the command line by using paired-end reads, which requires two input files, Read-file-1. fastq and input file 2 is Read-file-2.fastq. Trimmomatic will generate four output files. The output files Paired_output_1. fastq and Paired_output_2.fastq contain the reads that have a corresponding high quality read pair with each other. On the other hand, the output files Unpaired_output_1.fastq and Unpaired_output_2.fastq contain the reads that did not have a high quality pair with each other.
At the end of the process, Trimmomatic will generate a summary of the process, as in the example below: This means that from the 21125144 reads present in the input files, 15021842 (71.11%) were of high quality in both pair-end read files, 1980418 (9.37%) were of high quality only in the first read library, 1406608 (6.66%) were of high quality only in the second read library and 2716276 (12.86%) presented low quality in both libraries, and were excluded. A comparison of the quality of read libraries before and after the trimming step can be seen in Fig. 1.

Mapping the Reads to the Reference Genome
BWA-MEM maps the filtered reads to the reference genome, generating an alignment file in Sequence Alignment/Map (SAM) format.
#Index the Reference genome > bwa index <Reference-genome.fasta> If necessary, the user could use all the "paired" and "unpaired" reads in this mapping analysis to improve the depth of coverage. To that end, it is possible to combine all reads in a single file, with the command: In summary, flagstat result showed that from the 30055150 reads in the input files, 25150804(83.68%) mapped in the reference genome TriTrypDB-41_LmajorFriedlin_Genome. fasta, where 25081734(83.48%) of the reads were properly paired (see Note 33). Samtools view flag "-b" converts the SAM to a BAM compact file, while the flag "-q 30" filter the alignment file to only report reads that presented a mapping quality higher than Phred 30. Finally, the flag "-h" generates a header for the BAM file.

Sort .bam File
Samtools sort sorts alignments based on leftmost coordinates. This sorting is required by the majority of the downstream programs that work with alignment files. The flag "-d" BEDTools will generate a file reporting the depth of coverage in each position of each chromosome. This file can be used to estimate the mean genome coverage as well as the mean coverage for each chromosome. This command will count and save, in the "Genome-positions" file, the number of lines in the file "Genome-coverage.bed", which is equivalent to the number of nucleotides in the evaluated genome. In the example, the "SRR6369659-Genome-positions" contains the number "32855095". The genome mean RDC can be obtained by dividing "Genome-RDC" by the "Genome-positions". In the example, by dividing 2144501882 by 32855095, the mean genome coverage of~65Â. In this command line, the "genome_RDC" corresponds to the mean genome coverage, estimated in the 3.3.6 step, which is "65" in the example command.

Notes
MFA-seq Analysis of Replication Dynamics 1. As a fixing agent, methanol is used for T. brucei procyclic cells and Leishmania promastigote cells. Insufficient T. brucei bloodstream form cells could be retrieved after fixing with methanol; instead, fixation with 1% formaldehyde (methanolfree) allows the retrieval of a sufficient number of cells.
2. The number of cells needed is variable depending upon how many cells are needed for downstream applications; for this protocol we recommend preparing 3 Â 10 8 cells for each sorting session. If other numbers of cells are to be used, adjust the volumes used in the protocol accordingly.
3. Washing in 1Â PBS in the presence of EDTA prevents cell aggregation.

4.
A final concentration of 2.5 Â 10 7 cells/ml is ideal for the following steps in the protocol; if denser, the cells start to form aggregates.

5.
After formaldehyde fixation the cells seem to become more fragile, so the speed used in the centrifugation was reduced from 1000 Â g to 700 Â g.
6. Settings such as flow and event threshold rates will vary depending on the machine; this will impact the hours that will be needed to obtain the desirable number of cells (e.g., a machine that sorts a max. of 3000 events per second vs one that does it at 10,000 events per second).
7. The number of cells needed depends on the application. The minimum number of cells we have used is 1 Â 10 6 cells in each of the S phase subpopulations (early and late).
8. The type of tubes will depend on the machine used. 9. The SDS in the lysis buffer will precipitate at 4 C, giving it a 'blurry' look; this is ok during collection and remember to mix it thoroughly after sorting and before lysis. 10. The temperature and incubation period allow for the reversal of the formaldehyde crosslinking.
11. If several sorting sessions are needed to obtain the desired number of cells, store the lysate at À20 C and extract gDNA from the samples of the different sorting sessions at the same time (e.g., if three sessions were run to obtain enough cells in S phase, pool the extracts from the three sessions and extract the gDNA using a single column).
Localization of DNA Damage by γH2A ChIP-seq 12. Do not exceed fixation time as cells will be difficult to lyse.
13. It may be helpful to observe cells before lysis for comparison. If cells are successfully lysed numerous nuclei should appear as black dots and cell bodies will appear "ghostly". It is important that the Dounce homogenizer is tight fitting for optimal lysis.
14. Shearing has been optimized for T. brucei chromatin. Incubation for 5 min results in fragments between 200 and 1500 bp in length. Shearing time may require optimization for other genomes. 15. Quantification has been carried out using a Qubit. If using a spectrometer for quantification, the DNA must be cleaned up by phenol-chloroform prior to quantification. Column purification is not recommended as protein may clog the column. 16. When performing multiple samples to compare, that is, treatment vs control, use the same quantity of chromatin (ng) per IP reaction.
17. Depending on the target, more chromatin may need to be added to the ChIP reaction in order to achieve a high enough concentration of DNA in the eluted sample for downstream analysis. If vastly more chromatin is required, the number of cells used for chromatin extraction can be increased. However, the volumes of Fixation Solution, Glycine, 1Â PBS, Lysis Buffer (as well as the supplements PIC and PMSF), Digestion Buffer (plus PIC and PMSF), Enzymatic Shearing Cocktail and 0.5 M EDTA must all be scaled accordingly. For example, double all reagents when the protocol is started with 2 Â 10 8 parasites.
18. When washing the samples, it is important to prevent the beads from drying; therefore, perform wash steps one tube at a time. 19. When large amounts of chromatin have been used, and so the samples contain higher concentrations of proteins, this step can be extended.
20. If little DNA is available for analysis, input DNA can be diluted further and eluted DNA can also be diluted. However, these must be corrected for accurately when calculating Input percentage.
21. This is especially important when analyzing regions of highly similar sequences such as T. brucei VSG genes. 22. When targeting γH2A, it is likely that some nonspecific binding to the unphosphorylated H2A histone occurs. Therefore, it may be useful to compare ChIP signal enrichment between control and treatment samples (e.g., wild-type and mutant). This can also be done via Deeptools using the bigwigCompare tool.
Next Generation Sequencing to Examine Genome Variation 23. The examples shown here were executed on an Ubuntu 10.10 server with a 2.8-GHz CPU. For 32 million reads, at least 4 GB of RAM is required. More RAM may need to if more deeply sequenced samples are used.

24.
Although not required for all ploidy estimation methodologies, the use of a GFF file can increase accuracy when performing CCNV analysis in complex genomes. This optional step removes the biased impact of non-CDS repetitive regions, as well as the bias from "N" blocks in the reference genome, improving the somy estimation.
25. The Illumina platform usually yields reads with the deepest and highest base quality. However, reads from other platforms, such as 454 and Ion torrent, can also be used.
26. Long-reads, such as generated by the PacBio Sequel/RSII and Oxford Nanopore platforms, are also suitable for ploidy estimations. The .bam file can be generated using BlasR (https://github.com/PacificBiosciences/blasr), and the following steps described in this chapter can be used.
27. It is preferable to use FASTQ format, rather than fasta format, because it encodes not only the sequence information itself, but also the base quality information. This is essential for accurate detection of SNPs and other types of structural alterations.
28. Coverage of at least 20Â is desirable. Using lower genome coverages can compromise sensitivity and accuracy of any structural variation analysis.
29. All software used here are compatible with the Linux operational systems, such as Fedora and Ubuntu. Some of them are also compatible with macOS; however, all of the examples were performed on Linux.
30. A guide on how to work with bash in command line can be seen in: https://www.cs.wcupa.edu/rkline/linux/bash-basics. html. A beginner guide can be seen here: https://www.lifewire. com/guide-to-bash-part-1-hello-world-2202041 and here https://www.bash.academy/. 31. In our example, we are using a paired-end read library from the L. major genome. Thus, when downloading it using fastqdump, the use of "--split-files" flag is mandatory. By doing so, reads from each pair will be split into two different files. When using single-end read libraries, the aforementioned flag is not needed.
32. It is not advisable to use read libraries with a mean read quality lower than 20. Such libraries will reduce the SNP call precision and compromise the accuracy of downstream analysis.
33. We suggest that read libraries in which more than 50% of the read were dropped should not be used in CCNV analysis. As the proportion of excluded reads is not necessarily similar to all chromosomes, this could lead to detection of false variation in chromosome copy number.