Characterizing the genetic structure of introduced Nile tilapia (Oreochromis niloticus) strains in Tanzania using double digest RAD sequencing

Tilapia hatcheries in Tanzania rely heavily on importing germplasm. Nevertheless, the genetic structure of the imported stocks is poorly understood. In the current study, the level of genetic diversity and differentiation of eight populations of Nile tilapia (Oreochromis niloticus) strains imported in Tanzania was investigated. Four of the studied strains originated from Thailand, three from Uganda, and one from the Netherlands. Double-digest restriction site–associated DNA sequencing (ddRAD-seq) was applied to identify and genotype single nucleotide polymorphisms (SNPs). In total, 2214 SNPs passed all the quality control steps and were utilized for downstream analysis. Mean heterozygosity estimates were higher for the Thailand strains (Ho, 0.23) compared with the strains from Uganda (Ho, 0.12). Low genetic distance was observed amongst populations from the same geographic origin (Fst, 0.01–0.04). However, genetic distance between populations from different geographic origins was substantial (Fst, 0.24–0.44). Bayesian model–based clustering (STRUCTURE) and discriminant analysis of principal components (DAPC) grouped the studied animals into three distinct clusters. A cross-validation approach (where 25% of animals from each population were considered of unknown origin) was conducted in order to test the efficiency of the SNP dataset for identifying the population of origin. The cross-validation procedure was repeated 10 times resulting in approximately 97% of the tested animals being allocated to the correct geographic population of origin. The breeding history and hatchery practices used to manage these stocks prior and after import appear to be the main factors for the genetic diversity observed in this study. Our study will help inform hatchery stock management and future breeding program designs in Tanzania.


Background
Nile tilapia (Oreochromis niloticus) is the focal farmed fish in Tanzania accounting for over 95% of the country's aquaculture production volume (Kaliba et al. 2006). Farming is mainly conducted in small ponds. In addition, cage farming is also practiced in the country's main lakes: Victoria, Tanganyika, and Nyasa (URT 2016(URT , 2017. Nevertheless, the contribution of farmed tilapia to the fisheries sector and its total economic value is still limited. In 2016, aquaculture (mainly farmed tilapia) accounted for only 2.9% (11,000 tonnes; URT 2017) of the total fish production. Due to the need for increased production (400,000 tonnes per year to meet the FAO recommendation of an average 17 Kg/capita/year; URT 2014), private hatchery operators have been importing tilapia strains from Asia, Europe, and within Africa (mainly Uganda).
The development of genetically improved farmed tilapia (GIFT) from a base population of 7 Nile tilapia strains by ICLARM (currently WorldFish) was a major achievement in the history of tilapia aquaculture (FAO 2011;ADB 2004;Azhar et al. 2004;ADB 2005). Undoubtedly, the GIFT strain has been instrumental in enhancing the worldwide production of Nile tilapia (ADB 2005;Li and Cai 2008;Ansah et al. 2014). Similar approaches have been practiced by several tilapia aquaculture research and development programs, like GET-EXCEL (Tayamen 2004), GenoMar Supreme Tilapia (GST) (Zimmermann and Natividad 2004), Hainan Progift (Thodesen et al. 2013), and FAST (Bolivar 1998).
In Tanzania, no national breeding program exists for tilapia, but some of the aforementioned improved strains have been imported with a view of increasing production output. In particular, the most common imported strains include (i) the Chitralada imported in 2016 by EDEN Agri-Aqua and JAN Aqua Center hatcheries from the Asian Institute of Technology (AIT; Khlong Luang, Pathum Thani, Thailand), (ii) the BIG NIN and GIFT from Thailand, imported in 2017 by Mpanju Fish Farm from the Nam Sai Farms Co. Ltd (Bangrabow, A.Bansang, Prachinburi, Thailand), and (iii) the silver strain/YY super males strain imported in 2017 by Rift Valley AquaCulture Farm from Til-Aqua International (Velden, The Netherlands).
Interestingly, the aforementioned genetically improved strains were originally collected from Africa (ADB 2004(ADB , 2005. The Chitralada strain originated from a base population of 50 animals from an Egyptian stock that was later introduced in Thailand ( Damrongratana and Kessanchai 1966;Pullin and Capili 1988). In comparison, the GIFT strain originated from a diverse tilapia germplasm of more than 2000 individuals from Egypt, Ghana, Kenya, andSenegal (ADB 2004, 2005). Additional tilapia strains derived from the GIFT breeding nucleous include the BIG NIN (http://tilapiathai.com/project/nile-tilapia/). Furthermore, Til-Aqua International developed a silver tilapia strain from an all-male GIFT silver strain and a female line from various strains like GIFT and Chitralada from Egypt and the Philippines.
On the other hand, imported tilapia germplasm from Uganda is not derived from selective breeding. Generally, most of the hatcheries in Uganda (~70%) obtained their broodstock from different lakes, i.e., Victoria, Kyoga, and Albert (Mwanja et al. 2015). Common hatchery practices in the absence of selective breeding may result in the reduction of genetic diversity and inbreeding accumulation. For example, consanquineous matings as a consequence of limited numbers of broodstock and mass selection for growth (Aho et al. 2006;Qin et al. 2007;Smallbone et al. 2016) are still relatively common. The adverse effects of inbreeding accumulation can be minimized with selective breeding programs that utilize pedigree records. However, such programs are relatively expensive, and some GIFT and GIFT-derived strains have already suffered a significant reduction of genetic diversity when compared with the original stocks (McKinna et al. 2010). Therefore, the accurate evaluation of genetic purity and diversity of stocks over time should be a priority.
The objective of this study was to study the level of genetic diversity and differentiation of eight Nile tilapia strains introduced into Tanzania by various sources using ddRAD-seq. The findings of this study will improve knowledge of the genetic structure of existing farmed tilapia strains, and will inform future tilapia breeding program initiatives in Tanzania.

Material and methods
Sample collection, processing, and DNA extraction A total of 160 Nile tilapia fin clip samples derived from eight strains (n = 20 per strain) were collected from seven different hatcheries located in different regions in Tanzania: Silver-YY, BIG NIN, Chitralada-N, Chitralada-E, GIFT, Ruvu Farm-R, Chifive-C and Muleba-M (Table 1; Fig. 1). Regarding the GIFT strain, no information was available, whether it directly originated from the World Fish Center. Fish were reared in labelled plastic tanks of 1.5 m × 1.5 m × 1.5 m at the Institute of Marine Sciences Mariculture Centre (IMS-MC) at Bweni village, Pangani District in Tanga, Tanzania. Fin clips approximately 1.5-cm long were taken and preserved in 95% ethanol and stored at − 20°C. Total genomic DNA was extracted with the QIAsymphony kit using an automated DNA extraction robot (QIAsymphony, QIAGEN, Germany) according to the manufacturer's instructions. Following the DNA extraction, quantification of DNA samples was done using Qubit fluorometer (Thermo Fisher Scientific, USA ) and diluted by TE buffer to 20 ng/μL followed by gel electrophoresis (1% agarose gel) to assess DNA quality.

ddRAD library preparation and sequencing
The ddRAD library preparation protocol was based on the methodology originally reported by Peterson et al. (2012), with modifications described in Palaiokostas et al. (2015). Briefly, for each library, 3-μL DNA samples (20 ng/μL) were simultaneously digested with two highfidelity restriction enzymes: SbfI (CCTGCA|GG recognition site) and SphI (GCATG|C recognition site), both sourced from New England Biolabs, (NEB) UK. Digestions were incubated for 40 min at 37°C in 1× CutSmart Buffer (NEB). Following cooling to c. 22°C , 3 μL of a premade barcode/adapter mix was added to each digested DNA sample. The adapter mix comprised individual-specific barcoded combinations of P1 (SbfI compatible) and P2 (SphI compatible) adapters at 6 nM and 72 nM concentrations, respectively, in 1× reaction buffer 2 (NEB). Ligation was performed over 2 h at 22°C by the addition of 3 μL of a ligation mix comprising 4 mM rATP (Promega, UK), and 2000 cohesive end units of T4 ligase (NEB) in 1× CutSmart Buffer. The ligated samples were then heat denatured at 65°C for 20 min and cooled to room temperature. Thereafter, samples were combined into one pool. The pooled library was column purified (MinElute PCR Purification Kit, Qiagen, UK), and eluted in 80 μL EB buffer (Qiagen, UK). Size selection of fragments, ranging from approximately 400 bp to 700 bp, was performed by agarose gel separation. Following gel purification (MinElute Gel Extraction Kit, Qiagen, UK), the eluted size-selected template DNA (60 μL in EB buffer) was PCR amplified (13-14 cycles PCR dependent on library: 24 separate 12.5 μL reactions, each with 1 μL template DNA) using a high-fidelity Taq polymerase (Q5 Hot Start High-Fidelity DNA Polymerase, NEB). The PCR reactions were column purified (MinElute PCR Purification Kit) and eluted in 50 μL of EB buffer. A final cleanup step was performed using an equal volume of AMPure magnetic beads (Perkin-Elmer, UK). The final library was eluted in 20 μL EB buffer and QUBIT quantified. The library was sequenced at Edinburgh Genomics Facility, University of Edinburgh, on an Illumina HiSeq 4000 instrument (Illumina, Inc., San Diego, CA) using 150 base-paired end reads (v4 chemistry).

Genotyping ddRAD alleles
Sequenced reads were aligned to the Nile tilapia reference genome assembly version GCA_001858045.2 (Conte et al. 2017) using bowtie2 (Langmead and Salzberg 2012). Uniquely aligned reads were retained for downstream analysis. The aligned reads were sorted into RAD loci, and SNPs were identified from both P1 and P2 reads using the Stacks software v2.4 (Catchen et al. 2011;Rochette et al. 2019). SNPs were detected using gstacks (-var-alpha 0.001 -gt-alpha 0.001 -min-mapq 40). SNPs with minor allele frequency (MAF) below 0.05, heterozygocity above 0.7, and those for which > 25 % of individuals had missing genotypes were discarded. Moreover, SNPs showing substantial coverage differences (3 or more standard deviations) from the mean obtained coverage were removed. Finally, only animals with less than 25% missing genotypes were retained for downstream analysis. The aligned reads in the format of bam files were deposited in the National Centre for Biotechnology Information (NCBI) repository under project ID PRJNA518067 (File_S1).  (Pritchard et al. 2000;Falush et al. 2003;Hubisz et al. 2009) using K values from 2 to 5. Markov chain Monte Carlo (MCMC) of 100,000 iterations with a burn-in period of 10,000 was carried out for each K value. For each tested K value, 3 independent MCMC samplings were performed. The obtained posterior probability values (Pritchard et al. 2000) were used to determine the optimal number of clusters. Structure results were interpreted using Structure Harvester (Earl 2012) and CLUMPAK (Kopelman et al. 2015) for identifying the most probable number of clusters. Additionally, the pattern of population structure was also investigated using discriminant analysis of principal components (DAPC) (Jombart et al. 2010). The Bayesian information criterion (BIC) was used for selecting the optimal number of clusters (K) based on the elbow method (Jombart et al. 2010).

Assignment of individuals to population using the SNP dataset
A 4-fold cross validation scheme was performed using the R package adegenet v2.1.1 (Jombart 2008) in order to test the utility of the SNP dataset to correctly identify the geographic population of origin. The origin of 25% animals from each geographic population was masked and formed a test dataset. Predictions regarding population of origin on the aforementioned test set were performed using information obtained through DAPC (predict.dapc) on the remaining training data set. The entire procedure was repeated 10 times in order to minimize potential bias due to the stochasticity of sample allocation in the training /test datasets.

SNP identification
The alignment rate of the sequenced reads to the tilapia reference genome (Genbank accession GCA_001858045.2; Conte et al. 2017) ranged between 94 and 97% across the tested animals. In total, 37,683 putative ddRAD loci were identified; out of which, 3791 contained one or more SNPs. The mean sequence coverage of the identified loci was approximately 170X (SD, 70X). In total, 2214 SNPs passed all quality control steps and were retained for downstream analysis. Moreover, 5 animals were removed due to high rate of missing genotypic data (4 animals from the Chifive-C and 1 animal from the Ruvu Farm-R) resulting in a dataset of 155 animals for the subsequent analysis.

Genetic diversity: distance
The Chitralada-N strain showed the highest level of expected heterozygocity (He , 0.27 ± 0.003) followed by the other Thai-derived strains Chitralada-E, BIG NIN, and GIFT (He, 0.26 ± 0.003). In contrast, the lowest expected heterozygocity was observed in the Ugandan strains (Ruvu Farm-R, CHIFIVE-C, and Muleba-M) ( Table 2). Regarding the Fis metric, the lowest value was detected in the silver-YY strain and the highest in the Chitralada (  Muleba-M (Fst, 0.01), while the highest was found between the silver-YY and the Chifive-C (Fst, 0.44; Table 3).

Population structure
Principal component analysis (PCA) was used to visualize individual relationships within and between populations. The first and second principal components accounted for 24% and 10% of the variation respectively. Overall, PCA indicated the existence of 3 genetic groups (Fig. 2). STRUCTURE analysis provided evidence regarding the genetic structure amongst the tested Nile tilapia populations indicating that K = 4 was the most probable number of clusters (File_S2). Indications for admixture between the strains from Thailand and the Ugandan strains were suggested particularly for the Chitralada-N and E strains (Fig. 3). Furthermore, DAPC supported a close relationship of strains based on their geographical origin. The difference of BIC values between the most probable K values (3 to 5) was less than 1%, with lowest value for K = 4 (File_S3). Regarding the silver-YY strain from Netherlands, both STRUCTURE and DAPC suggested the existence of an isolated cluster, with the exception of one animal from the Ugandan strains that appeared indistinguishable from the silver-YY strain (Fig. 4).

Population prediction using the SNP dataset
Predictions regarding the geographic origin were performed using DAPC on the test set. Using a 4-fold cross validation scheme, a successful population assignment rate of approximately 97% was obtained (Fig. 5). The mean correct assignment rate for the Netheland strain was 98%. Moreover, the correct assignment rates for the Thailand and Ugandan populations were 97% and 98% respectively.

Discussion
Our study revealed higher genetic variation in the case of animals that originated from selective breeding programs. According to the obtained estimate of He, both the silver-YY and the Thai strains (Chitralada-N, Chitralada-E, BIG NIN, and GIFT) had higher genetic variation in comparison with the Ugandan strains (Ruvu Farm-R, CHIFIVE-C, and Muleba-M). This observation is in agreement with findings of previous studies which compared improved strains with local farmed strains by using microsatellite markers (Romana-Eguia et al. 2004;Baggio et al. 2016). Hatchery practices and genetic management of fish stocks constitute key factors affecting the genetic diversity and long-term sustainability of the breeding populations. Both the silver-YY and the Thai strains of our study are based on selective breeding programs (ADB 2004(ADB , 2005Mair 1997) with appropriate hatchery management (Bhujel 2013). However, to the best of our knowledge, the utilized broodstock of the Ugandan strains lack the availability of pedigree records that could be of value for avoiding crossings amongst close relatives (Mwanja et al. 2015). Furthermore, both the Thai and the silver-YY strains utilized in our study have been imported recently to Tanzania (2016-2017) Fig. 3 The admixture analysis assigned individuals in clusters (K = 3-5) using eight strains. Each single vertical bar represents an individual, and each color represents a probability that the individual is assigned to each genepool indicating that crossing of close relatives has been most likely avoided. However, a controversy appears in our study based on the Fis estimates where the strains from Uganda appear to have a lower heterozygosity deficit compared with the Thai strains. Fis is negative in case of an   Confusion matrix for prediction efficiency of the SNP dataset using cross validation. 4-fold cross validation was performed where 5 randomly chosen animals on each population were considered of unknown origin. The entire procedure was repeated 10 times in order to minimize potential bias due to sample allocation in the training/test datasets. The diagonal contains the number of correct population assignments for the overall sum of the cross-validation scheme. Off-diagonals contain the number of erroneously population allocations for each particular case. The confusion matrix is colored according to the number assignments on each populations as indicated by the color bar scale on the right excess of heterozygocity and positive in the case of heterozygosity deficiency (Allendorf and Luikart 2007). Nevertheless, by construction, Fis values depend on the absolute difference between observed and expected heterozygocity. As such, the Fis values should be interpreted in connection with the relevant heterozygocity values, which in the case of the strains from Uganda are less than half compared with the Thai and the silver-YY strains.
Interestingly, the overall level of heterozygosity observed in our analysis is lower compared with prior tilapia studies (Fuerst et al. 2000;Bhassu et al. 2004;Hassanien and Gilbey 2005;Briñez et al. 2011;Baggio et al. 2016;Silva 2015). However, since the aforementioned studies utilized microsatellites, a direct comparison of heterozygosity magnitude may not be appropriate. Compared with other ddRAD studies in fish species, the Thai and the silver-YY strains appear to have heterozygocity values within the reported range (0.18-0.28;Saenz-Agudelo et al. 2015;Leitwein et al. 2016). Nevertheless, in the case of the Ugandan strains, the estimated genetic variation is particularly low. It has to be stressed that since most of the studied strains are descendants of small founder populations, genetic drift could be a main causative factor behind the low genetic diversity (Gilpin and Soulé 1986;Soulé 1987;Was and Wenne 2002;Sekino et al. 2002;Romana-Eguia et al. 2005;Aho et al. 2006;McKinna et al. 2010). Furthermore, particularly in the absence of pedigree records, severe reductions in genetic variation can occur due to crossings between close relatives (Westemeier et al. 1998;Madsen et al. 1999;Romana-Eguia et al. 2005;Aho et al. 2006;Appleyard and Ward 2006;Mbiru et al. 2015;Mapenzi and Mmochi 2016;Qin et al. 2007;Smallbone et al. 2016). In addition, the family structure of the sampled animals was unknown at the time of sampling, and this may affect estimates of genetic variation in a particular population.
Population structure analysis grouped all strains in 3 main clusters which corresponded to their countries of origin. Both STRUCTURE and DAPC can detect the underlying genetic structure of the tested animals, being able to suggest a most probable number of genetic clusters. In the case of STRUCTURE, the most probable number of clusters (K) is assigned by comparing the mean likelihoods of Bayesian-based models of a priori determined K. Regarding DAPC, the most probable K is derived through comparisons of BIC values. In our study, both approaches suggested that the most probable number of clusters is between 3 and 5, with the model selection criteria in both analysis suggesting a K = 4 as the most probable. Nevertheless, we need to point out that only a minute difference of the utilized model selection criteria was observed between K values ranging from 3 to 5, as opposed to when models of K = 2 were tested against K = 3 (File_S2, File_S3).
Regardless of the exact value of most probable K, our analysis suggested the existence of admixture between the Thai origin Chitralada strains and the Ugandan-derived strains. Furthermore, no gene flow was suggested between the GIFT or the BIG NIN strain and the Ugandan strains. As such, our results indicate that the detected similarity of the Chitralada and the Ugandan strains is not due to the common founder stock (African origin), but most probably due to recent admixture. Taking into account that the Thai strains were imported in Tanzania after 2016 and that stock exchange between fish farms in Tanzania is particularly common, we suggest that admixture probably occurred in recent generations. Interestingly, our data clustered one animal from the Ugandan strains (Muleba-M) with the silver-YY strain from Netherlands. However, since our analysis suggested that the silver-YY strain forms a uniquely isolated cluster with no apparent gene flow amongst any of the other tested strains, sample mislabeling could be the underlying reason for clustering an animal of putative Ugandan origin with the silver-YY strain.
Overall, our SNP dataset proved highly efficient in identifying the geographic origin of more than 97% of "putative" unknown samples. The ability to predict the population of origin is most valuable both for aquaculture and for conservation purposes of wild populations. Hybridization and introgression are fairly common in tilapias (Shirak et al. 2009;Wu and Yang 2012). Taking into account that Tanzania is a hot spot for wild chiclid populations, it is easily apparent that introgression with farmed strains could jeopardize the local adaptivity of the wild populations. As such, it is easily apparent that our approach could be also of value in conservation management of wild tilapia populations in Tanzania.
As part of an initiative to facilitate a national tilapia breeding program, the genomic resources of this study offer the required resolution for detecting the genetic diversity and differentiation of introduced tilapia strains in Tanzania. Importing tilapia strains is a valid option for the promotion of tilapia farming in Tanzania. Although GIFT and GIFT-derived strains showed significant impact to food security, rural income, and employment in both Asia and Africa (i.e., Egypt and Ghana) (ADB 2005;Hussain et al. 2013;Ansah et al. 2014), the argument to import the GIFT in Tanzania is still being debated in view of how best to conserve biodiversity and aquaculture germplasm for future use. Lind et al. (2012) suggested a zoned aquaculture system based on large water body catchments, while Mbiru et al. (2015) and Mapenzi and Mmochi (2016) suggest the usage of all-male hybrid populations. However, since improvements using hybridization are done through trial and error (Chattopadhyay 2017), the outcomes are usually a reduced fingerling production due to potential reproductive incompatibilities between crossed species (Popma and Lovshin 1995).
In view of the aforementioned challenges, decisions related to importing strains like GIFT as a main seed source in tilapia farming should be informed by data originating from population genetics, growth performance, economic analysis, and potential biodiversity threats for the local wild populations. A thorough assessment of the "exotic" strain growth performance in comparison with the native ones will reveal the potential of each strain. Research on developing a new strain from the native germplasm for future use could be of outmost importance. Modern genomic tools like ddRAD-seq could play a pivotal role in assisting the formation of a Nile tilapia breeding nucleous of high genetic diversity, upon which future stock improvement will be based.
Author Contributions MM and FP carried out DNA extraction. CP and RH performed ddRAD library preparation and sequencing. DK, MM, and LC framed the study and contributed to designing the experiments. MM and CP performed the statistical and genetic analysis. MM wrote the manuscript. DK and CP revised the manuscript. All authors approved the final draft of the manuscript Funding information Open access funding provided by Swedish University of Agricultural Sciences. We would like to acknowledge financial support from the Swedish International Development Agency (Sida) and BBSRC Institute Strategic Program Grants (BBS/E/D/20002172 and BBS/E/D/30002275) from Roslin Institute (University of Edinburgh). Edinburgh Genomics is partly supported through core grants from NERC (R8/H10/ 56), MRC (MR/K001744/1), and BBSRC (BB/J004243/1).
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.