Fungal isolates and DNA extraction
Isolates of 15 species in the Latin American clade of Ceratocystis (Mbenoun et al. 2013) were obtained from the culture collection (CMW) of the Forestry and Agricultural Biotechnology Institute (FABI), University of Pretoria, South Africa. Species included C. fimbriata, C. cacaofunesta, C. manginecans, C. platani, C. acaciivora, C. colombiana, C. curvata, C. diversiconidia, C. ecuadoriana, C. eucalypticola, C. fimbriatomima, C. mangicola, C. mangivora, C. neglecta, and C. papillata. Two C. pirilliformis isolates were included as outgroups for the phylogenetic studies (Table 1). Depending on availability, between two and five isolates from different geographic locations and/or hosts were chosen per species, including the ex-type strain of each species. All isolates were grown in culture on 2 % malt extract agar (MEA) supplemented with 50 mg/l streptomycin (Sigma–Aldrich, Germany) and 100 μg/l thymine (Sigma–Aldrich, Germany).
For DNA extraction, mycelium was scraped from the surface of MEA plates and freeze-dried. Samples were crushed to a powder with sterile metal beads in a mixer mill type MM 301 Retsch® tissue grinder (Retsch, Germany). DNA was extracted using a phenol/chloroform method (Goodwin et al. 1992). Extracted DNA was quantified using a NanoDrop ND-1000 instrument (NanoDrop, Wilmington, DE, USA) and the quality assessed by gel electrophoresis on a 1 % agarose gel (AGE). For AGE, 5 μl DNA was combined with 2 μl GelRed™ (Biotium, California, USA) and the DNA visualized under UV illumination. DNA concentrations were standardized to a working dilution of 30 ng/μl for subsequent reactions.
Single-copy phylogenetic gene regions
Primer design PCR amplification and sequencing
Regardless of whether ITS sequence data were available on GenBank, amplification and sequencing of all isolates in this study was repeated to avoid discrepancies. The ITS1 and ITS4 primers (White et al. 1990) were used for amplification. Due to a long poly-A repeat in the sequence, primers ITS2 and ITS3 were used for additional sequencing of the internal regions in some isolates (White et al. 1990).
Seven additional gene regions were tested for amplification and for their potential use as phylogenetic markers in Ceratocystis. These were Mcm7, Tsr1, CAL, RPBII, βT 2, FG1093 and MS204. Information for all primers used in PCR and sequence reactions in this study are summarised in Table 2. The Mcm7 region was amplified with primers Mcm7-709 and Mcm7-1348 (Schmitt et al. 2009) and CAL with CAL2F and CAL2R2 primers (Duong et al. 2012). Some primers were modified at a few nucleotide sites to be more specific for species in Ceratocystis by aligning them to the C. fimbriata genome (GenBank accession number APWK01000000) (Wilken et al. 2014). For amplification of the FG1093 region, the FG1093F.cerato and FG1093R.cerato primers were modified from the original FG1093 E1F1 and FG1093 E3R1 primers (Walker et al. 2012b). For the MS204 region, the MS204F.cerato and MS204R.cerato primers were modified from the MS204 E1F1 and MS204 E5R1 primers. For species where problems were experienced in amplification, a smaller region of MS204 was amplified with the primers MS204F.ceratoB and MS204R.ceratoB. Various primer combinations were tested for the Tsr1 region. This included testing the original primers (Schmitt et al. 2009) and primers modified from the original (Tsr1.cerato), as well as a completely new forward primer designed in this region. Primers RPB2-5Fb and RPB2-7Rb, used to amplify the RPBII region, and primers T1d, Bt1d and Bt2d, used for amplification of the βT 2 region, were also designed from the C. fimbriata genome. For isolates where there were no reliable data available for the βT 1 and EF 1-α regions in GenBank, these were amplified using primers βt1a and βt1b (Glass and Donaldson 1995) for βT 1, and EF1-728F and EF1-986R (Jacobs et al. 2004) for EF 1-α.
The PCR reactions were identical to those performed by Duong et al. (2012), but 0.2 μM forward and reverse primer or 0.8 μM in the case of the degenerate primers were used per reaction. MgCl2 concentration differed for different primer sets (Table 2). The PCR program for amplification of the CAL, RPBII, βT 2, and MS204.ceratoB primer sets was identical to the program in previous studies (Duong et al. 2012). An Expand-PCR program was used for the ITS, Mcm7, βt 1, EF 1-α, MS204 (MS204.cerato) and FG1093 primer sets. The program was as follows: 96 °C 10 min, (94 °C 30 s, 55 °C 45 s, 72 °C 1 min) × 10 cycles, (94 °C 30 s, 55 °C 45 s, 72 °C 1 min +5 s/cycle increase) × 30 cycles, 72 °C 10 min. Annealing temperatures differed for each primer set (Table 2).
PCR and sequencing products were purified with 6 % Sephadex G-50 columns using the manufacturer’s protocols (Sigma–Aldrich, Germany). Amplification reactions for sequencing were performed, as described previously (Duong et al. 2012), using the ABI PRISM® BigDYE Terminator Cycle Sequencing Ready Reaction Kit (Applied Biosystems, Foster City, California, USA). All sequence data generated in this study were submitted to GenBank (see Table 1). For the FG1093 region, some isolates required cloning to obtain optimal sequencing results, and this was performed with a pGEM®-T Easy Vector System, following the manufacturer’s protocols (Promega, Madison, WI, USA).
Sequence alignment and phylogenetic analyses
The quality of the raw sequence reads were evaluated and assembled in CLC Main Workbench v.6 (CLC bio, www.clcbio.com). Consensus sequences for each gene region of all isolates were aligned in MAFFT version 6, with the alignment strategy set to E-INS-i for the ITS data set and L-INS-i for all other data sets (Katoh et al. 2005). Alignment of the data sets was also manually inspected and edited in MEGA 5 (Tamura et al. 2011). Two of the gene regions contained a long poly-A repeat region, and this was excluded from the analyses. Maximum parsimony analysis (MP), maximum likelihood (ML), and Bayesian inference (BI) were applied to each data set individually for tree construction. Parsimony analysis was performed in PAUP* version 4.0 (Swofford 2002) and trees were obtained using the heuristic search option with 1,000 replicates, with random addition of sequences and a tree bisection and reconnection (TBR) branch swapping strategy. Both introns and exons were considered for each gene region. Indels were treated as a fifth character. For application in ML and BI analyses, the best model of evolution for each gene region was identified using the jModelTest version 0.1 and applying the Akaike information criterion (Posada 2008). ML tree construction was performed in PhyML 3.0 (Guindon and Gascuel 2003), with the following criteria: proportion of invariable sites was 0, gamma shape was estimated by the program, and the number of substitution sites was set to six (except for βT 1, where nst = 2). The starting tree was obtained using the BIONJ approach, and the branch swopping strategy was set to select the best of either NNI or SPR algorithms. Statistical support for the branches of both MP and ML trees was obtained using 1,000 replicates of non-parametric bootstrap analysis of the sequence data. As Ceratocystis pirilliformis resides in Ceratocystis but is distantly related to the species in the Latin American clade, it was selected as the outgroup to root the trees.
Additional branch support was obtained using Bayesian analysis, applying a Markov Chain Monte Carlo (MCMC) algorithm in the MrBayes version 3.1.2 program (Ronquist and Huelsenbeck 2003). Tree searching was performed using four independent chains and run for 6 000 000 generations, sampling every 100th tree. Analyses were performed twice, and concordance between the two sets was investigated by comparing the log likelihoods in Tracer version 1.5 (Rambaut and Drummond 2009). The burn-in for each data set was performed in MrBayes and set to 10 000 generations. The posterior probabilities for the tree topology were obtained by constructing a consensus tree from the data using MrBayes and viewing it in TreeView X (Page 1996).
Phylogenetic value of single genes and gene combinations
A combination of criteria was used to select the gene regions with the most potential for use as phylogenetic markers. A three- and four-gene region combination was evaluated. This was based on i) the number of species that could be distinguished with significant bootstrap and Bayesian support values (>70 BS and >95 BI) by the gene region, ii) the number of species shown to be monophyletic based on the genealogical sorting index (gsi) value (Cummings et al. 2008), and iii) the congruence in tree topology as compared to a combined reference tree (Nye topological score) (Nye et al. 2006).
The gsiT is a statistical support value, in addition to BS and BI values, that indicates the exclusive ancestry of a group of organisms in a genealogy, and has proven informative in recent fungal phylogenetic studies (Sakalidis et al. 2011; Taole et al. 2012; Walker et al. 2012a). The analysis produces a value on a scale from 0 to 1 for each identified group, with 0 indicating no exclusive ancestry from other groups in the genealogy and 1 representing monophyly. For each gene region, the gsi value was calculated for 100 ML bootstrap trees randomly selected and 10 000 permutation tests were performed for statistical support of each gsi value. From these values, the gsiT was calculated as a weighted average of all 100 gsi values for each gene region, with a P value of <0.05 considered statistically significant (Sakalidis et al. 2011). All calculations were performed online at http://www.genealogicalsorting.org/index.php.
To determine the accuracy of each individual gene region in representing the relationships among all taxa, a Bayesian consensus tree for each gene region was compared to a combined reference tree of all of the gene regions (Aguileta et al. 2008). The combined reference tree was constructed from all eight gene regions using a Bayesian approach, as described in Aguileta et al. (2008), incorporating the corresponding nucleotide substitution model for each gene region. The topological difference in tree topologies were compared in the online program ‘Compare2Trees’ (http://www.mas.ncl.ac.uk/~ntmwn/compare2trees/index.html), based on an algorithm that compares the branches and partition of nodes between two trees and gives an overall topological congruence score (Nye et al. 2006). An overall score of 85 % was selected as the cutoff point for a marker to be compatible with the other regions considered.
The five most informative gene regions were selected on the basis of the three criteria as previously stated. A partition homogeneity test (PHT) with 1,000 repeats was performed on different arrangements of three- and four-gene region combinations in PAUP 4.0 to determine whether the sequences could be combined (Swofford 2002). The combined tree, based on three or four gene regions, was constructed using MP, ML and BI analyses. Identical conditions to those applied to the individual gene regions were used in the different tree construction methods.
Development of SNP markers
Sequence data generation for SNP calling
SNP markers were developed from 454 sequence data for six species in Ceratocystis. These markers could subsequently be used to consider variation in the rest of the species in this genus. This method was shown to be effective in a previous study on fungal species complexes (Pérez 2010; Pérez et al. 2012). To generate sequence data, reduced representations of genome sequences were generated with a protocol similar to the initial steps of an AFLP protocol up to the pre-amplification step (Myburg and Remington 2000) and sequenced with 454 pyrosequencing. Isolates included for sequencing were C. fimbriata (CMW 1547, 15049, 14799) (Table 1), C. cacaofunesta (CMW 14809, 15051, 14798), C. platani (CMW 14802, 23918, 23450), and a combined group comprising C. manginecans, C. acaciivora, C. mangicola, and C. mangivora (CMW 13851, 13852, 21123, 22563, 17568, 17570, 23623, 14797, 15052), which are referred to as the C. manginecans group in this part of the study.
All genomic DNA was digested with a frequent- and a rare-cutting restriction enzyme, after which restriction enzyme-specific adapters were ligated to the DNA fragments. A master mix digestion reaction consisting of 1× R/L buffer, 2 units EcoRI, 2 units MseI, and ddH2O to a final volume of 10 μl per sample was used to digest 150 ng genomic DNA. The DNA solution was made up to a 20 μl reaction volume with ddH2O, and this was mixed with 10 μl of master mix. The reaction was incubated for 3 hrs at 37 °C, and then 15 min at 65 °C. The ligation reaction was performed directly afterwards by combining 30 μl digested DNA with 10 μl ligation master mix (1× R/L buffer, 1 mM ATP pH7, 1 pmol EcoRI- and 10 pmol MseI adaptors, 1 unit T4 DNA ligase up to 40 μl final volume with ddH2O) and was incubated for 3 h at 22 °C.
Pre-amplification reactions were performed with 5 μl of the ligated products. Total reaction volumes of 30 μl consisted of 1× PCR buffer (+1.5 mM Mg), 0.2 mM dNTP, 0.3 μM EcoRI+A primer, 0.3 μM MseI+C primer, and 0.6 units Taq polymerase (Expand Taq). The PCR program was as follows: 94 °C for 4 min, (94 °C for 30 s; 56 °C for 30 s; 72 °C for 1 min +1 s/cycle extra) × 25 cycles, 72 °C for 2 min. Amplification smears were analysed by AGE on a 2 % gel.
Amplicons of all the isolates of the same species were pooled in four sample sets representing C. fimbriata, C. cacaofunesta, C. platani, and C. manginecans. Amplicons were pooled by combining 25 μl PCR product of each isolate and then precipitated with 0.1 vol of 10 M NaOAc and 2.5 vol absolute ethanol and incubated on ice for 10 min. Samples were then centrifuged and washed with 70 % ethanol and the dried product resuspended in 30 μl H2O. Sample sets were size-separated on a 1.2 % agarose gel at 60 V for 1 hr. Bands in the size range of 150–450 bp were excised and purified using the NucleoSpin® Extract II kit (Macherey-Nagel, Germany).
The 454 adapters with identity tags for each of the four species were added to the DNA fragments by means of a PCR reaction. The primer sequences for the forward reaction consisted of the 454 adaptor A sequence plus the species-specific sequence tag plus the EcoRI adaptor-specific sequence (sequence: 5’GCCTCCCTCGCGCCATCAG-NNNN-GACTGCGTACCAATTC3’). The reverse primer sequence consisted of the 454 adaptor B sequence, the species-specific sequence tag, and the MseI adaptor-specific sequence (sequence: 5’GCCTTGCCAGCCCGCTCAG–NNNN-GATGAGTCCTGAGTAA3’) (Pérez 2010). The species-specific identification tag was ATCG for the C. cacaofunesta sample set, CTAG for the C. fimbriata sample set, AGCT for the C. platani sample set, and CAGT for the C. manginecans samples. The program for the PCR amplification was 94° 2 min, (94 °C 30 s, 60 °C 30 s, 72 °C 60 s) × 25 cycles, and 72 °C for 2 min.
PCR amplicons from all four sample sets were precipitated using 10 μl NaOAc (10 M) and 200 μl absolute ethanol (100 %). DNA concentrations of all amplicons were adjusted to 30 ng/μl, and 20 μl of each of the four samples were pooled to a final volume of 80 μl. The product was sequenced using the Genome Sequencer 454 FLX (Roche, Inqaba Biotec, Pretoria, South Africa).
SNP identification, marker development and application to Ceratocystis species
Raw reads from the 454 sequencing were assembled using the CLC Genomics Workbench 5.0 (CLC bio), and contigs were generated for orthologous regions containing sequences from all four species. Contigs were constructed with the following parameters: similarity = 0.8, length fraction = 0.5, insertion and deletion costs = 3, mismatch cost = 2 and minimum contig length = 200. Each contig was investigated individually to determine the presence of SNPs that were conserved within a species but able to differentiate between species. Informative contigs were identified based on the number of SNPs present in a contig (minimum of four SNPs) and the number of species between which the SNPs could differentiate. For the purposes of this study, a nucleotide difference was considered as a species-specific SNP only where it occurred in the majority of reads of at least one of the species. Regions that were present in the C. fimbriata genome more than once, based on BLAST results, and regions too variable for primer design were excluded.
For the selected contigs, primers were designed in the non-variable regions flanking the informative SNP region, using CLC Main Workbench 5.0 (CLC bio). Where SNPs were located inordinately close to the 3’ or 5’ end of the contig, the C. fimbriata genome (isolate CMW 14799, with GenBank accession number APWK01000000) was used to design primers located upstream or downstream from the SNP regions. Parameters were set for a maximum primer length of 22 bp, minimum primer length of 18 bp, maximum G/C content of 0.6, minimum G/C content of 0.4, maximum melting temperature of 58 °C and minimum melting temperature of 48 °C. All primer pairs were designed with Tm temperatures as close together as possible in order to simplify multiplex PCR reactions. The designed primers (Table S1) were synthesized by Inqaba Biotech (Pretoria, South Africa).
Each of the primer sets designed were first tested on the four species groups used for 454 sequencing. This was to ensure PCR success and to confirm the presence of the SNPs as predicted using the 454 data. The regions that were most informative and amplified well in the majority of species were selected and were amplified in three to five additional isolates of all other species included in this study (Table 1). PCR conditions were identical to those used for the single gene region amplification. The PCR program was as follow: 95° 5 min, (94 °C 30 s, 55 °C 30 s, 72 °C 90 s) × 38 cycles and a final extension of 72 °C for 10 min.
PCR products were amplified in 96-well PCR plates and purified using the ExoSAP method (Glenn and Schable 2005). Purified PCR products were used for amplification of sequencing products performed in 96-well MicroAmp® reaction plates. Sequencing products were purified using ethanol precipitation (Glenn and Schable 2005). The dried product was sequenced on an ABI PRISM® 3500xL auto-sequencer (Applied Biosystems, Foster City, California, USA).
Evaluation of SNP markers for species delimitation
Sequences from amplified SNP regions were assembled, analysed, and edited in the CLC Main Workbench 5.0 (CLC bio). Two different approaches were applied to investigate the SNP variation among the isolates. First, the entire sequenced region for each SNP primer set was considered and a combined data set of all SNP regions was generated. A cladogram was constructed from the data set based on MP and BI analyses, using settings similar to those used to analyse the single gene regions. Due to the presence of large indels, gaps were coded as a fifth character using FastGap version 1.0.7 (Borchsenius 2007). The best model of evolution for each SNP region was determined using the jModelTest 0.1 (Posada 2008) and was implemented in the BI analysis.
In the second approach, only the SNP sites (SNPs and indels) from each region were considered in constructing a haplotype network. All of the SNPs from the selected SNP regions were combined into a single concatenated SNP haplotype for each of the isolates. This was constructed by aligning the sequences of all isolates for each SNP region separately in MEGA 5 using MUSCLE alignment (Edgar 2004) and removing the constant sites. The variable sites from all SNP regions were then concatenated. Haplotypes were determined from the aligned SNP data in DnaSP v. 5 (Librado and Rozas 2009). Gaps were included for haplotype construction. The identified haplotypes were used as input data to construct a haplotype network, based on a median-joining algorithm on NETWORK v. 220.127.116.11 (Bandelt et al. 1999).