The East Africa (Uganda) Nile tilapia samples were collected monthly overnight between July and December 2016, as part of commercial catches of local fishermen using the East Africa member states recommended 127 mm of gill nets. The sampling sites were defined upon advice from law enforcement fisheries officers, together with local fishermen and sampling sites recorded using a Global Positioning System (GPS). On landing, muscle tissue samples were taken immediately, preserved in absolute ethanol and later stored under − 20 °C, except during transportation. The species are not protected, but rather fished commercially, and sampling was performed by members of a government institution (National Agricultural Research Organization of Uganda), hence no other permission was required. Apart from one non-native population (Lake Victoria), the other three populations (Lakes Albert and George and River Nile) are native. In total, we collected 107 Ugandan individuals (Table 1). All animal rights were observed during the field excursions. Fish was obtained as part of commercial fishing operations by local fishermen and killed in this process.
For SSR development, we used a single Ethiopian Nile tilapia sample (Lake Ziway), which was accessed through a selective breeding project at the Institute for Integrative Nature Conservation Research-BOKU, Vienna. From this sample, fresh fin clips were taken, implying that there was no fish sacrifice. The Ethiopian sample was later used for low coverage total DNA shot gun sequencing for SSR marker discovery prior to East African sampling. We also sampled Ethiopian Tilapia zillii for cross-species amplification tests.
A piece of ethanol-preserved tissue muscle (approximately 0.1 g) was digested overnight in 500 µl lysis buffer (2% SDS, 2%PVP 40, 250 m MNaCl, 200 M Tris–HCl, 5 mM EDTA, pH8) containing 200 ng of proteinase K enzyme. Genomic DNA extraction was carried out using magnetic beads (MagSi-DNA beadsMagnaMedics) and a magnetic separator SL-MagSep96 (Steinbrenner, Germany) following a modified MagSi-DNA Vegetal kit protocol. DNA binding was carried out by mixing 17 µl of beads with 500 µl each of binding buffer and clear lysate in a 2 ml 96 well plate. Bound DNA beads were then washed twice in 80% 600 µl cold ethanol. Later, DNA was eluted twice by adding 50 µl (first elution) and 70 µl (second elution) of elution buffer (65 °C 10 mM Tris–HCl, pH 8.3), following the above-mentioned kit protocol. The quality of the extracted DNA was inspected on 0.8% agarose gel.
High quality extracted genomic DNA from the Ethiopian sample was sent for library preparation on the Illumina MiSeq paired-end (PE) 300 sequencing platform (San Diego, USA) as described in Shendure and Ji (2008); Castoe et al. (2012). Both library preparation and sequencing were done at the Genomics Service Unit, Ludwig-Maximilian’s-Universität München, Germany. Sequences generated by Illumina Miseq were quality checked using FASTQC (Andrews and FastQC 2010) and trimmed for the removal of adapter sequences and low quality regions (Phred < 20) using CUTADAPT vers. 0.11.1 (Martin 2011). Forward and reverse reads were merged using PEAR version 0.9.4 (Zhang et al. 2014) considering only minimum overlaps of 15 bp with a p-value below 0.01 for the highest observed expected alignment scores. Later, the sequences were screened for microsatellite motifs (from di to penta nucleotide repeats) using the SSR_pipeline program (Miller et al. 2013). Here, we considered sequences with at least 10 repeats for 2mers, eight for 3mers, and six for 4/5mers. A total of 6,724 SSR motif reads were revealed comprising 4,629 2mers, 818 3mers, 868 4mers and 409 5mers. For subsequent primer design, sequences of equal or greater than 350 bp long and microsatellites with flanking regions longer than 30 bp were considered. This length was chosen to facilitate detection and elimination of primer dimer and other low molecular weight artefacts from the specific amplification products using the washing method described below. Considering the inclusion of 60 bp oligonucleotides in the PCR as a multiplex assay, such artefacts are difficult to suppress. Artefacts can be unequivocally detected using gel electrophoresis and discarded using magnetic beads. In addition, longer sequence information is higher chances of recovering extra information on the flanking sites that may contribute to the increase on the number of alleles. Raw reads were submitted to Sequence Read Archive (SRA) database with the reference number SRX3398501.
Primer design for amplicon sequencing
Specific PCR primers were designed in Geneious software version 10.3 (Kearse et al. 2012) using the default Primer3 program (Untergasser et al. 2012). Manual primer3 adjustments were set at 55 °C for optimal primer melting temperature, with a GC content in the range of 20–50–80, optimal oligo length between 18–20–23 bp, and amplification product size between 350 and 450 bp. We designed primers in a way that the complete primer motif would be included in the first or last 300 bp of the amplicon being able to be covered completely by one of MiSeq’s paired reads. This prevents that the repetitive unit is part of the overlap of the paired reads which would lead to difficulties with the merging step in the bioinformatics pipeline. Here, 48 primer pairs were designed. We decided to develop primers from own shot gun sequences instead of published Nile tilapia genomic information to increase the likelihood that markers fit on East African populations. However, the availability of a Nile tilapia genome from the GenBank accession number MKQE01000000, allowed us to screen the entire sequences containing microsatellites used for primer design for potentially duplicated markers, using BLASTn (Basic Local Alignment Search Tool for nucleotides). BLASTn outputs provide a list of pairwise alignment matches and sequence hits above which a statistical threshold is displayed (Xiong 2006). In the current study, BLASTn-aligned sequences were selected if the E-values were zero. The E-value is comparable to the probability value p, as the least value suggests a lower likelihood that the database matches are a result of random chance, but rather the database matches display a significant similarity (Xiong 2006). Only primers that originated from sequences showing single matches on the genome were considered because they were more likely to represent single copy regions. Of the 48 primer sequences initially mapped on the genome, 13 primers were found more than once in the genome and subsequently discarded, thus leaving 35 primer pairs. Although higher numbers of microsatellite markers can provide robust population genetics results (Capote et al. 2012; Ryman et al. 2006), generally a number of microsatellite loci in the range of 8–20 are considered adequately informative to determine genetic structure between populations (Arthofer et al. 2018; Vartia et al. 2014; Koskinen et al. 2004). In the current study, the initial 48 primer pairs resulted in a number sufficient to test genetic structure patterns in east African Nile Tilapia. Nevertheless, more markers can be easily added with the procedure and resources presented here.
For Illumina sequencing, primers were extended by part of the Illumina adapters P5 (TCTTTCCCTACACGACGCTCTTCCGATCT) and P7 (CTGGAGTTCAGACGTGTGCTCTTCCGATCT) at the 5′ end of the primer forward and reverse, respectively (Fig. 1). These correspond to the Illumina sequencing primers and served as a linker for the second PCR where the remaining parts of the adapters are added. This procedure was conducted using primers containing all the components necessary for Illumina sequencing. In this second (index) PCR, for each sample, we used a novel combination of two different indexes of 8 bp, P5-(AATGATACGGCGACCACCGAGATCTACAC[Index]ACACTCTTTCCCTACACGACG) and P7-(CAAGCAGAAGACGGCATACGAGAT[Index]GTGACTGGAGTTCAGACGTGT). This was vital for allowing the pooling of a large sample size in the down-stream analysis. After the second PCR, the resulting amplicons had following parts from 5′ to 3′ (Fig. 1): (1) P5 motif for flow cell hybridization, (2) index 1 of 8 bp, (3) P5one sequencing primer, (4) specific forward primer, (5) target DNA for sequencing; specific reverse primer; (6) P7 sequencing primer, (7) index 2 of 8 bp, and (8) P7 motif for flow cell hybridization.
SSR primer testing
To ascertain the applicability and usability of the developed primers, we first amplified them in singleplex reactions to tested two scenarios: (1) transferability of the developed primers on East African Nile tilapia, and (2) cross-species amplification of T. zillii. The amplification success rate of the candidate loci during PCR reactions was tested by assaying a Nile tilapia sample from Uganda. For cross-species amplification, only two genomic DNA samples for T. zillii were tested on a panel of 35 SSR loci. PCR reactions were conducted in a 10 µl total volume. All primer pairs were tested using the QIAGEN Multiplex PCR Master Mix (Qiagen, CA, U.S.A) kit. PCR reaction volume during Nile tilapia amplification was composed of 5 µl Master mix, 4 µl primer mix and 1 µl genomic DNA. Primer mix was a combination of 1 µl Reverse primer + 1 µl Forward primer (100 µM each), plus 98 µl of water. Finally, the cycler reaction mixtures were performed based on the following PCR profiles: initialisation at 95 °C for 15 min, followed by 35 cycles of denaturation at 95 °C for 30 s, annealing at 55 °C for 60 s, elongation at 72 °C for 60 s and final extension at 72 °C for 10 min. The success of the PCR products was tested by electrophoresis on 1.8% agarose gel. Here, 33 primer pairs were successfully identified in specific PCR products, which subsequently were used for the multiplex PCR approach on Ugandan Nile tilapia populations. Successful markers for cross-species amplification were based on the PCR single-plex gel products based on two replications.
PCR multiplex and Illumina sequencing
All 33 gel-screened primers were combined in a single multiplex reaction. PCR reactions were carried out in a 10 µl total volume containing 5 µl Master mix, 2.5 µl water, 0.5 µl primer mix (1 µM each) and 2 µl genomic DNA. Thermal cycler profiles were analogous to the single-plex PCR. The resulting PCR products were purified using magnetic bead procedures, following slight modifications from AgencourtAMPure XP PCR Purification protocol. Here, we mixed 4 µl PCR products with 2.86 µl of AMPure XP beads (Beckman Coulter, Inc, Bree, CA) and incubated for five minutes at room temperature. Bound DNA beads were captured by an inverted magnetic bead extraction device, VP 407-AM-N (V&P SCIENTIFIC, INC) and washed twice in 200 µl of 80% ethanol for 45 s. Later, the beads were dried at room temperature for five minutes and eluted in 17 µl of elution buffer (65oC10 mM Tris–Hcl, pH 8.3).
The second (index) PCR was performed in a total reaction volume of 10 µl, containing 5 µl master mix, 2 µl each of index primer (1 µM) P5 and P7, and 1 µl of template purified PCR products. PCR was run with the following thermocycler conditions: 95 °C for 15 min, followed by 10 cycles of denaturation at 95 °C for 30 s, annealing at 58 °C for 60 s, elongation at 72 °C for 60 s and final extension at 72 °C for five minutes. Finally, all the PCR products were pooled and sent for PE 300 bp sequencing in an Illumina MiSeq at the Genomics Service Unit at Ludwig Maximillian Universität, München, Germany. The samples used in this work occupied 11% of the MiSeq run.
Sequence analysis and SSR-GBS genotyping
Reads from Illumina were quality controlled and merged as described in “SSR discovery”. Overlap was only possible because SSR motifs were covered completely by one of the paired reads. The resulting sequences should start with the forward primer and end with the reverse and we used this criterion to de-multiplex the sequences according to primer content, creating one fastq files per sample and locus using script 1 (Supplementary material Table S4). This script looks for mismatches between the amplification primers from the beginning (forward) and the end (reverse) of the sequences. Only reads with a mismatch to both primers below two base pairs were considered. From this step the allele calling was performed in two steps: first using the AL, which resembles the traditional SSR genotyping, and then by considering possible SNP variation within alleles of the same length recovering the whole amplicon sequence information (WAI). After each step, a codominant matrix was produced allowing for a comparison between data that would be produced by tradition SSR genotyping to the one from SSR-GBS. Allele calling based on AL incorporates length variation at the repetition motif plus possible indels in the flanking regions. All types of length variation and SNPs are used as information with WAI because two amplicon sequences were only considered as the same alleles if they were equal.
To call alleles using AL, we calculated the length distribution per sample and per locus using script 2 producing files containing the number of times each length occurs per marker and sample (Supplementary material Table S4). Then we used the script 3 to automatically call the alleles and plot histograms based on sequence length resembling the chromatograms obtained in traditional SSR genotyping (Supplementary material Table S4). Only genotypes with a minimum depth of 10 reads were considered. Automatic allele calling considered homozygous genotypes if there was a length with a frequency equal or above to 90% of the total number of reads. Heterozygous genotypes were assumed if the frequency of two lengths was above 90% of the reads and if the frequency of both lengths did not differ by more than 20%. In case these criteria were not matched, the genotypes were marked for manual control. Nevertheless, all possible genotypes were manually controlled with the aid of the produced histograms.
For WAI allele calling, the sequences with the same length of the alleles defined in the AL step were extracted and used to produce a 70% consensus sequence per length class. The extraction of sequences per length allele was done using the script 4 and the consensus sequence the script 5. In the 70% consensus, the positions with the most common nucleotide of frequency below 70% were coded with the ambiguous base “N” and considered as potential heterozygous SNPs. These sequences were divided into two files based on the two most frequent nucleotides for that position using the script 6. In case more than one potential SNP was found, these positions were considered as linked and the two most frequent nucleotide combinations were recovered. We observed, that chimeric sequences could occur between alleles that differed by more than one SNP. This causes the occurrence of sequence states intermediate between the alleles. In each case these intermediate states were less frequent and could be unambiguously resolved visually, either by comparing them to other alleles in the sampling, or by including the sequence length as additional information. However, these occurrences were rare. Only in a few cases which were not considered due to low overall read counts more than two similarly frequent nucleotide combinations were found. Similarly, the two most frequent combinations between a SNP and length signal was called as allele in samples that were heterozygous with AL. WAI allele calling was finally done using scrip 7 where a number was attributed to each unique sequence (allele) and saving this information in a codominant matrix. All scripts are available in GitHub (https://github.com/mcurto/SSR-GBS-pipeline) and detailed description of script 1–script 7 is given in supplementary table S4. The sequence analyses resulted in 26 loci with genotypes for most of the samples that were used for further statistical assessment. Raw reads can be found in the SRA database with the references SRX3398667 to SRX3398776.
Descriptive population genetics analyses for SSR loci were determined using various programs. The software Micro-Checker version 18.104.22.168 (Van Oosterhout et al. 2004), was used to estimate the presence of null alleles, evidence of allele drop-out, or stuttering during PCR amplification. Test for deviations from Hardy–Weinberg Equilibrium (HWE) and calculations of the fixation index (Fis) were performed in GenePop version 4.6.9 (Rousset 2008). Markov chain parameters for all tests in GenePop were run at 10,000 dememorizations, 100 batches and 5000 iterations per batch and Fis values were recorded (Weir and Cockerham 1984). Fis was specifically determined to assess the type of HWE on the populations, in aspects of excess or deficiency heterozygosity (Dorak 2014). Here, positive or negative Fis values indicate excess homozygosity or excess heterozygosity (outbreeding) respectively (Dorak 2014). Observed heterozygosity (Ho), expected heterozygosity (He) and loci polymorphic information content (PIC) were determined using Cervus software version 3.0.7 (Kalinowski et al. 2007). Allelic richness and number of alleles per locus were calculated with Fstat program version 22.214.171.124 (Goudet 2001). To further assess the extent of informativeness and hence usability of the developed markers, we tested the genetic structure and principle coordinate analysis (PCoA) on the four Nile tilapia populations using STRUCTURE version 2.3.4 (Hubisz et al. 2009) and GenAlex version 6.5 (Peakall and Smouse 2006), respectively. STRUCTURE classifies populations by genetically allocating them into groups whose individuals share similar patterns of variation (Porras-Hurtado et al. 2013). The program further is rendered useful as it can identify subpopulations of the whole population by maximizing HWE linkage within potential subpopulations (Porras-Hurtado et al. 2013). STRUCTURE was set at 100,000 burn-in period and the application of Markov Chain Monte Carlo (MCMC) was run at 500,000 replications, with each cluster (K) assigned to 10 iterations. STRUCTURE default settings for the admixture model and allele frequencies correlated were implemented. For inference to the K that best suits the data, we ran STRUCTURE HARVESTER. Here, the program collates STRUCTURE results and validates multiple K values for optimal detection and thereby depicts the best K value from tens or hundreds of iterations (Earl and vonHoldt 2012), as indicated in the supplementary material Table S1, Fig. S1 and Fig. S2. Similarly, for presenting informative genetic STRUCTURE outputs, we ran the CLUMPAK clustering Markov package pipeline across the K values for summation and graphical representation of the results obtained from STRUCTURE (Kopelman et al. 2015). From these analyses, we present and compare the results regarding; PIC, number of alleles (Na), allelic richness (Ar), HWE per population, Fixation index (Fis), PCoA, and STRUCTURE based on the two allele calling methods, AL and WAI.