Microsatellite discovery in an insular amphibian (Grandisonia alternans) with comments on cross-species utility and the accuracy of locus identification from unassembled Illumina data

The Seychelles archipelago is unique among isolated oceanic islands because it features an endemic radiation of caecilian amphibians (Gymnophiona). In order to develop population genetics resources for this system, we identified microsatellite loci using unassembled Illumina MiSeq data generated from a genomic library of Grandisonia alternans, a species that occurs on multiple islands in the archipelago. Applying a recently described method (PALFINDER) we identified 8001 microsatellite loci that were potentially informative for population genetics analyses. Of these markers, we screened 60 loci using five individuals, directly sequenced several amplicons to confirm their identity, and then used eight loci to score allele sizes in 64 G. alternans individuals originating from five islands. A number of these individuals were sampled using non-lethal methods, demonstrating the efficacy of non-destructive molecular sampling in amphibian research. Although two loci satisfied our criteria as diploid, neutrally evolving loci with the statistical power to detect population structure, our success in identifying reliable loci was very low. Additionally, we discovered some issues with primer redundancy and differences between Illumina and Sanger sequences that suggest some Illumina-inferred loci are invalid. We investigated cross-species utility for eight loci and found most could be successfully amplified, sequenced and aligned across other species and genera of caecilians from the Seychelles. Thus, our study in part supported the validity of using PALFINDER with unassembled reads for microsatellite discovery within and across species, but importantly identified major limitations to applying this approach to small datasets (ca. 1 million reads) and loci with small tandem repeat sizes.


Introduction
Microsatellites or simple sequence repeats (SSR) are one of the most widely used tools in population genetics and conservation biology.As costs associated with highthroughput Next Generation DNA sequencing (NGS) have decreased, microsatellite discovery via genomic shotgun libraries (the Seq-to-SSR approach) has emerged as a reliable and economical methodology for non-model species (Abdelkrim et al. 2009;Allentoft et al., 2009;Castoe et al. 2012a).Several software packages have been developed to identify microsatellite loci this way including MSATCOMMANDER (Faircloth 2008), PALFINDER (Castoe et al. 2012a), and QDD (Meglécz et al. 2010;2014).Typically these methods use de novo assemblies or unassembled reads to identify SSR loci.The use of unassembled reads is a timesaving characteristic (Castoe et al. 2012a;Lance et al. 2013), however, several studies have cautioned against this approach when using short read data, such as from the Illumina platform.These studies have found that Illumina paired-end identified microsatellite loci can have an order of magnitude smaller success rate (in terms of amplification across multiple individuals and sequencing error) than those loci identified from longer 454 or PacBio reads (Drechsler et al. 2013;Wei et al. 2014).Despite this, the use of PALFINDER to identify SSR loci from short, unassembled Illumina reads is widespread (e.g.amphibians, Drechsler et al. 2013;Peterman et al. 2013;bivalves, O'Bryhim et al. 2012a;crustaceans, Stoutamore et al. 2012;fish, O'Bryhim et al. 2012b;Nunziata et al. 2013;mammals, Barthelmess et al. 2013;reptiles, Castoe et al. 2012b).
Interestingly, few studies have performed validation experiments using SSR loci identified from unassembled reads.Those that have (e.g.Mikheyev et al. 2010;Delmas et al., 2011;Castoe et al. 2012b) have typically used unassembled 454 reads, thus leaving the question of whether SSR loci can be reliably identified from shorter reads poorly addressed outside of a few studies (Drechsler et al. 2013;Wei et al. 2014).
In this study, we performed validation experiments (via direct sequencing and power analysis) on SSR loci identified with unassembled Illumina data from caecilians (Amphibia, Gymnophiona).Caecilians are limbless amphibians that are restricted mostly to parts of the wet tropics and typically possess a fossorial ecology (Gower & Wilkinson 2008).As such, they are generally rarely encountered and most species are poorly understood (Gower & Wilkinson 2005).Globally, there are ten families of caecilian (Wilkinson et al. 2011;Kamei et al. 2012).We focused on Grandisonia alternans (Stejneger,1893), an indotyphlid species that is widely distributed across the Seychelles, an archipelago situated off the coast of east Africa in the Indian Ocean (Fig. 1).With both endemic frog and caecilian radiations the Seychelles is unique among isolated oceanic archipelagos (Nussbaum 1984).Given the exceptional evolutionary insights provided by species occurring on small islands (e.g. Warren et al. 2015), the generally high level of threat of extinction faced by such species (e.g.Daltry 2007), and the scarcity of knowledge of caecilian population genetics (e.g.Gower & Wilkinson 2005), we were motivated to develop population genetics tools for the endemic caecilians of the Seychelles.
To accomplish this we identified thousands of putative SSR loci from shotgun genomic sequencing, tested 60 loci to investigate intraspecific utility, generated Sanger sequences to examine sequence similarity to Illumina reads, identified eight loci with putatively desirable characteristics for population genetics and size-scored loci in multiple individuals of G. alternans.We also examined the cross-species utility of these eight loci in additional caecilian taxa and compared divergence in their flanking regions relative to G. alternans sequences.Based on our findings we describe (i) the discovery of six microsatellite loci that amplify across divergent caecilian species, including two loci that satisfy theoretical assumptions of selective neutrally in G. alternans putative populations, and (ii) several important limitations we encountered while using the Seq-to-SSR approach with unassembled Illumina reads.

DNA isolation and microsatellite identification
We generated a shotgun genomic library from a pool of six barcoded Seychelles caecilian specimens, comprising representatives of five species: Grandisonia alternans, G. larvata (Ahl 1934), G. sechellensis (Boulenger 1911), Hypogeophis brevis Boulenger 1911and H. rostratus (Cuvier 1829).Library preparation was performed by first extracting genomic DNA with a Qiagen DNeasy Blood and Tissue kit and then using a standard Illumina Nextera DNA kit to prepare the pooled sample for sequencing (see Lewis et al. 2014).Sequencing was performed using a 500-cycle v.2 reagent kit on an Illumina MiSeq at the Core Research laboratories of the Natural History Museum, London.We removed and trimmed low quality reads of unassembled paired-end data using Illumina software and the FASTX-Toolkit (available at http://hannonlab.cshl.edu/fastx_toolkit/links.html).Barcoded pooled data were de-multiplexed and only sequences from the target taxa G. alternans were retained, originating from a single G. alternans individual from Silhouette Island (UMMZ 192945).This single individual dataset was used in all downstream analysis.

Potentially amplifiable loci (PAL) selection
We used PALFINDER software v 2.03 (Castoe et al. 2012a) to identify putative SSR loci from NGS output, run with default settings and applying the following criteria in Primer3 (Untergasser et al. 2012): (i) a minimum of eight tandem repeats for dinucleotide motifs, (ii) a minimum of six tandem repeats for trinucleotide motifs, and (iii) a minimum of six tandem repeats for tetranucleotide motifs.From the PALFINDER/Primer3 output, a limited number of loci were chosen manually for further screening.All loci were predicted to contain simple microsatellite repeat regions and care was taken to retain loci that corresponded with predicted di-, tri-, and tetramer repeat motifs.The occurrence rate of unique priming regions in the unassembled reads varied considerably, and so primer sets were intentionally selected to represent a range of occurrence frequencies.This approach contrasts with the design of many previous studies, which intentionally avoided priming sites that occur frequently because they may be associated with repetitive elements.Thus, some selected primer sets corresponded to regions where both forward and reverse priming regions occurred up to 1,000 times in the MiSeq reads, and some loci corresponded to where one or both priming regions occurred only once (e.g., Table 1).PALs were checked for homology against the NCBI's nucleotide database and retained only when they did not return high similarity to characterised sequence regions.

Microsatellite screening, fragment analysis, and scoring
Five G. alternans individuals from across the sampled geographical range of the species in the Seychelles archipelago were selected to investigate PCR amplification success and assess size variability in the PALs identified from the MiSeq data.All PCRs were conducted using either a Type-it Microsatellite PCR Kit or MyTaq™ Red Mix with manufacturer's recommended reaction mix and cycling conditions.Initial trials tested a range of annealing temperatures for each locus (50−63 o C), all further PCRs used the optimum 60 O C annealing temperature.To visualise reaction success the amplified products were run on 3% agarose gels (70V, 80mins).
For every locus that showed clear gel bands of approximately the expected size range predicted by PALFINDER in each of the five test samples, one or two G. alternans individuals (including the original sample used for Illumina library preparation) were selected and the locus re-amplified, product cleaned by vacuum filtration and Sanger sequenced in both directions using Big Dye ® Terminator Chemistry v3.1 on a 3730xl DNA Analyser by the DNA Sequencing Facility at the Natural History Museum, London.Chromatograms were compared to predicted PAL sequence and examined to confirm presence of microsatellite repeat region using Geneious R8 (Biomatters Ltd).Where microsatellite regions were confirmed, forward primers were synthesized with fluorescent dye labels prior to bulk screening individuals for size variation via Fragment Analysis (15 second injection) on a 3730xl DNA Analyser run with a Genescan 500 LIZ size standard.
Fragment analysis screening was conducted on all sampled individuals of G. alternans.PCRs used the protocols optimized in initial trials and were performed in 12.5uL reactions with a heated lid, with or without multiplexing.Products were run on agarose to confirm PCR success prior to fragment analysis and product of single locus PCRs (non-multiplexed) were diluted and mixed together prior to fragment analysis so that each resulting .fsafile contained results for four loci with different fluorescent colour labels.Allele sizes were scored using the Geneious Microsatellite Plugin 1.4 using the local southern sizing method.
In addition to fragment analysis, loci that were consistently amplified were chosen to investigate cross-species utility.PCRs were conducted for single individuals of each species, and when a sharp band could be observed on agarose.
Sanger sequencing was carried out as above.

Data Analysis
Microsatellite allele frequencies were compiled in Excel software v12.0 and input files and summary statistics were generated using CONVERT software v1.31 (Glaubitz 2004).To investigate the potential for the loci for future population genetic research, we tested their conformation to neutral expectations (i.e.Hardy Weinberg Equilibrium: HWE, Linkage Equilibrium) using Arlequin software v3.5.1.3(Excoffier & Lischer 2010).Because no a priori information exists as to what might constitute a panmictic population of G. alternans, samples were grouped by island for these analyses under the assumption that contemporary gene flow between islands for this species is highly unlikely.
Tests of statistical power were conducted to further investigate the potential of the loci to uncover genetic differentiation among populations, given the allelic variation observed at each locus.Both the level of differentiation that it was possible to detect using current sampling and the number of samples required to detect relatively low differentiation (at or above FST values of 0.02) were investigated.
Power analysis was performed using the program POWSIM (Ryman & Palm 2006) with each parameter set (Ne, t) employing 1,000 replications.
Cross-species alignments were performed using Geneious R8 and further altered by hand.Approximate estimates of homology (percent base pair differences) were calculated after ends were trimmed and microsatellite regions excluded.

PALs identified in Grandisonia alternans
After filtering results of pooled Illumina sequencing, 983,636 paired end shotgun reads were recovered from the genomic library of G. alternans individual UMMZ 192945.From these reads, PALFINDER identified 8,001 microsatellite loci, but only 560 of these loci contained simple di-, tri-, or tetranucleotide motifs meeting our selection criteria (Fig. 2).Of these, 60 loci were selected for initial PCR, 20 of which were found to amplify consistently across five test individuals and were subsequently Sanger sequenced.Sixteen of the 20 PCR products produced readable sequence data that could be aligned at least partially with the PAL sequence, of which 15 loci contained microsatellite repeat regions (Table 1).The majority of loci that were successfully amplified and re-sequenced corresponded to loci for which both forward and reverse priming regions occurred together multiple times among all PAL sequences (Fig 3).
Many of the PALs showed insertions when compared to Sanger sequence data, with most of these corresponding to duplications of near or adjacent regions (duplications ranged in size from 21 to 94 bps).Two PALs had substantial deletions (30 bps, 150 bps) when compared to Sanger sequence generated from PCR.The majority (~90%) of these indels did not occur beside microsatellite repeat regions (Fig. 4), suggesting that sequencing across potentially problematic repeat regions was not driving the observed sequencing errors.When taking these indels into account, close comparison between Sanger and PAL sequence across loci revealed that in two cases multiple primer sets were targeting the same genomic regions, reducing the number of independent microsatellites containing PALs to 11.These "duplicated" PALs corresponded to primer sets that usually, although not always, had relatively high occurrences of priming regions among the identified PALs (Table 1, Figure 3).
Ultimately eight loci were selected that appear to amplify reliably and show di-allelic size variation; three with di-nucleotide repeat motifs, four with trinucleotide, and one with a tetranucleotide repeat motif (Galt 1-8; Table 2).Figure 4 illustrates the differences between PAL and Sanger sequences for these loci.Data were generated for 64 G. alternans individuals across these eight loci, however, we found evidence that the quality of these SSR loci (in terms of population genetics utility) was variable.Two loci (Galt 2 and 8) displayed behaviour consistent with informative SSRs.Four loci (Galt 1, 2, 6, and 7) appeared to be heterozygous but fixed across all individuals of G. alternans screened.One locus (Galt 3) possessed different allelic sizes, but we never observed a heterozygous individual.One locus (Galt 5) contained large numbers of stutter peaks and we were therefore unable to score it consistently.

Utility of SSR markers from Grandisonia alternans for population genetics
Only two PALs (Galt 4 and Galt 8) amplified well, could be reliably scored and showed size variation consistent with diploid SSR loci.Of the 64 individuals scored, one sample (BMNH 2005(BMNH .1686) ) was not successfully genotyped at both these loci and was therefore excluded from further analysis.Within each of the five putative populations (grouped by island), no signature of linkage was detected between loci (all exact p-values > 0.29).Data for locus Galt 4 conformed to HWE across all five islands (all p-values > 0.12).Significant departures from HWE were observed for locus Galt 8 among individuals drawn from the two largest of the five islands (Mahé: p < 0.0001, Silhouette: p = 0.0415), potentially indicating substructure at this geographical scale.When data for all islands was pooled and re-analysed significant departures from linkage and HWE were observed, consistent with the hypothesis that some level of genetic structuring is present for G. alternans across its distribution in the Seychelles archipelago.
Results of our power analysis of the two loci are presented in Figure 5.Given five putative populations and levels of allelic variation observed from the 63 individuals surveyed here, the current two locus data set has the power to detect differentiation in the order of or above FST values of 0.04 if it is present (Fig. 5A).If sample sizes were increased to 40 per population, the two loci are likely to be useful in identifying differentiation as low as FST = 0.02 (Fig. 5B).

Cross-Species Amplifications
Six of the eight loci were successfully amplified and Sanger sequenced for other caecilian species, including members of the closely related genera Grandisonia, Hypogeophis and Praslinia.When compared with G. alternans data, the maximum sequence divergence observed was 5.28% (Table 2).SSRs were observed in different species for five of the six cross-amplified loci.None of the loci could be successfully amplified for the more distantly related Indian taxa using the G. alternans primer sets.

Discussion
Although we have identified two SSR loci suitable for describing population structure in the Seychelles caecilian, Grandisonia alternans, and six SSR loci that amplify across a range of other caecilian taxa, our investigation into the utility of using short read MiSeq Illumina sequence data to identify SSR loci suitable for population genetics was disappointing.We recognise issues associated with initial locus identification, with the quality of resulting markers, and with the amount of investment (both financial and in time) needed to use this approach.Although our approach differed in some respects from previous studies (using a relatively small NGS dataset, longer paired-end reads, considering a range of loci with respect to both tandem repeat type and frequency of occurrence in the NGS data set), the issues we identify are broadly applicable to validation of loci identified from shotgun datasets.
Using our data, we encountered two major limitations associated with the locus identification portion of the Seq-to-SSR method via PALFINDER: (i) multiple primer sets that amplified the same region and (ii) a PAL that lacked microsatellites.
Both these limitations appear to be directly related to inaccuracies in the short read shotgun sequence data, resulting in PAL sequences that were substantially different from sequences obtained via direct Sanger re-sequencing.This problem is likely to be symptomatic of the type of data used, and we suspect that employing larger NGS effort (for example a full Illumina MiSeq lane for a single sample), while providing substantially more PALs, would not change the proportion of erroneous PALs encountered during locus validation.
In our case, of the 16 comparisons we performed between PAL and Sanger sequences generated for the same individual (Table 1), only two Sanger sequences could be aligned with the PAL sequence on which their priming regions had been designed without the introduction of large insertions (up to 120bp) and or deletions (up to 167bp).Indels relative to the PALFINDER sequences did not occur beside microsatellite sections, suggesting that sequencing errors are not related to problems sequencing across tandem repeat areas.Instead, they are likely related to poor concatenation of overlapping mate pairs, a hypothesis supported by the insertions typically being short (non-microsatellite) duplications of adjacent sequence regions.A simple correlation test revealed no relationship between priming site occurrence rate and indel size in our alignments (R 2 = 0.04, p = 0.47 [two-tailed probability distribution]), indicating that indel patterns are not related to issues of PAL sequencing depth.However, there were two loci where we observed both insertions and deletions in Sanger and corresponding PALFINDER sequences (Galt 6 and 7; Fig 4), which may indicate some further sequencing errors in one of our approaches.
The multiple primer sets that targeted the same genomic regions were not identified prior to Sanger sequencing because of indel differences in their PAL shotgun sequences.We recommend that future studies utilizing shotgun data should align PAL sequences from PALFINDER output allowing for the introduction of indels to help identify and eliminate the presence of such "duplicate loci".Although not a consistent pattern, it appears that priming regions of duplicated loci were generally represented in relatively high numbers among PALs (Table 1; Fig 3), and were more prevalent in tri-and tetramer PALs, repeat motifs that are often considered better targets for locus design (Castoe et al. 2012a).It may be that selecting loci with moderate levels of primer occurrence as recommended by PALFINDER may minimize this problem.The lowest occurrences of priming sites among PALs, however, returned low success rates of useable loci, suggesting that there is likely to be a tradeoff between the risks of selecting duplicate loci and of choosing loci that can be amplified at all.
Our investigation into the Seq-to-SSR approach did identify two SSR loci that conformed to neutral expectations and had significant power to detect genetic structure in G. alternans, and so appear to represent ideal targets for further population genetic study of this species in the Seychelles.However, these were the minority of PAL loci originally identified using PALFINDER applying the recommended stringent criteria for identifying variable SSR loci with 2-4 tandem repeats (sensu Castoe et al. 2012a).The two loci that did pass our own criteria for population genetics microsatellites (i.e., reliable and repeatable amplification, size variability, diploidy, conformation to neutrality) were all that remained from the initial 60 primer sets that were trialed in the wet lab.The majority of PAL primer sets failed to amplify anything (Fig 3), perhaps again due to errors in PAL sequence resulting in erroneous primer design.Of the eight loci that did make it through to the large scale screening of allelic variation, six either did not show patterns of variation consistent with diploid loci or were impossible to score consistently, rendering them unsuitable for most population genetics analyses that assume HWE.Excluding the many primer sets that we did not trial in the wet lab, our end result of 2/60 usable loci represents a disappointing success rate of only 3.3%, a poor return for the time and money spent in the laboratory.
Perhaps more encouraging are our species cross-amplification results.Our study is not the first to demonstrate cross-species amplification of microsatellites in caecilians (Barratt et al. 2012).We did, however, observe several interesting patterns related to cross-species amplification among caecilians from the Seychelles.For example, in some microsatellite flanking regions the raw pairwise similarity between

G. alternans and Hypogeophis spp. was smaller than between G. alternans and other
Grandisonia species (G.larvata and G. sechellensis; Table 2).Although most of the species that were cross-amplified successfully were also present in the original pooled NGS run, we filtered sequences from other taxa out before designing and verifying the loci, so we consider our cross-amplification success to potentially reflect closeness of phylogenetic relationships rather than being an artefact of incorrect filtering of the pooled sample prior to locus design.These results are more likely related to non-monophyletic genus-level classification, homoplasy related to the rapid rate of SSR evolution, or differences in the rate of evolution among taxa; this will need to be revisited once a robust hypothesis of phylogenetic relationships is available for caecilian species from the Seychelles (for a recent perspective on this problem, see Maddock et al. 2016).
Also encouraging was our success in extracting and amplifying DNA from tissue collected with non-lethal sampling methods.The efficacy of non-lethal sampling for obtaining DNA suitable for Sanger sequencing has been demonstrated previously in caecilians (Maddock et al. 2014), however this study is the first to demonstrate the applicability of non-lethal sampling protocols for generating population genetic (microsatellite) data.This is encouraging in terms of conservation biology because it means that the deaths of vast numbers of individuals of conservation concern will no longer be required to provide adequate sample sizes for population genetic studies.
Although we demonstrated an ability to use the Seq-to-SSR method to successfully identify loci that are useful for population genetics inference, the amount of time and financial investment in screening/size scoring was discouraging given the low success rate.Thus, although the identification of these potentially amplifiable loci may be rapid, the application of this method (via screening and size scoring) was, in our case, an inefficient approach.Based on these results, we support pursuing non-SSR loci methods (e.g.RADseq; Davey & Blaxter 2011), identifying SSRs from longer read sequence data (e.g.Drechsler et al. 2013;Wei et al. 2014), or restricting shotgun Seq-to-SSR searches to so called "Best PALs" (> 6 repeats for 4-6mers; Castoe et al. 2012a) when researchers need to perform population genetics assessments using NGS with limited resources.
The genomic shotgun reads of Grandisonia alternans used to identify microsatellites have been uploaded to the NCBI sequence read archive (SAMN04543719-20).
Directly sequenced PCR amplicons were submitted to GenBank under accession number KU739108-739133.Table 1.Seychelles caecilian, Grandisonia alternans.PALs and corresponding primer sets that were identified from short read NGS sequencing and then successfully resequenced with Sanger technology.Numbers of occurrences refers to number of times that priming regions and PALs were present among the unassembled MiSeq data.

Tables
Bolded text indicates loci that were duplicated (multiple primer sets actually amplified the same region) despite each corresponding PAL occurring only once in the PALFINDER output.Asterisk beside locus name indicates a PAL that did not have microsatellite regions when re-sequenced with Sanger technology.Silhouette -/ + / -/ -/ -/ -/ -/ -

Figure 2 :
Figure 2: Total number of PALs containing simple sequence repeat motifs that were identified from 983,636 unassembled Illumina paired-end reads sequenced from Grandisonia alternans (summarised by repeat motif).

Figure 3 .
Figure 3. Number of times that both forward and reverse primer sequences were present in PAL sequences for the 60 PAL loci trialed in this study.Loci were tested on five individuals, Sanger re-sequenced when possible, and then Sanger and PAL sequences were aligned to check for the presence of expected microsatellite repeats motifs and for duplication of amplified regions.

Figure 4 :
Figure 4: Comparison between sequences predicted from shotgun sequencing data using PALFINDER (PALs) and sequences generated from Sanger sequencing with primers designed with PALFINDER/PRIMER 3 for the eight focal loci.Seven of the PALs had indels when compared to Sanger sequencing data.

Figure 5 :
Figure 5: Simulation estimates of power (proportion of significances) (black) and Type 1 error (grey) for the two microsatellite loci developed here (Galt 4 & Galt 8).Closed circles = Chi-squared analysis, open circles = Fishers test; (A) The data generated in this study from 63 individuals drawn for five putative populations have strong power (>0.95) to detect significant differentiation at or above FST of ≈ 0.04, (B) Results of simulations with expected divergence of FST = 0.02 and overall allelic variation observed in this study indicate the effect of sample size on the power and magnitude of Type 1 error.Using these loci only and drawing sample sizes of 40 individuals from each of two populations would be sufficiently powerful to detect FST differentiation as low as 0.02.
Table2Microsatellite/anonymous loci for Grandisonia alternans, annealing temp was 60 o C for all loci.Various qualities of each locus are discussed further in text.Simple sequence repeat (SSR) motifs were confirmed via Sanger sequencing.Percent divergence was calculated using comparisons to G. alternans Sanger sequence data (excluding indels and repeat regions).