Background

With the advent of next generation sequencing technologies, plants and animals can be genotyped for thousands of single nucleotide polymorphisms (SNPs) at one time. Sequencing technologies accelerate the development of theory of molecular quantitative genetics. Quantitative trait loci (QTL) mapping and genome-wide association study (GWAS) have been considered as new methods applied in breeding programs. Quantitative traits, however, were verified to be affected by many genes termed as QTLs [1]. In GWAS, each QTL is identified based on a significance test. As a result, many QTLs will be ignored because most QTLs have smaller effects and can’t reach the significant levels [2, 3]. In order to avoid the above defect, a new method termed as genomic selection (GS) was proposed by Meuwissen et al. [4]. Genomic selection uses entire genomic data to explain observed phenotypic variation, but not selects single locus based on a significance test. With high density markers, each QTL can be highly in linkage disequilibrium (LD) with at least one marker [5]. Due to the advantages of high accuracy of prediction and reduction of generation interval [6], genomic selection has been widely used in dairy cattle [713] and has been studied in other species [1421]. Compared with livestock and plant breeding, genomic selection is relatively late to be applied in aquatic breeding [22]. Sonesson et al. [23] have studied genomic selection in aquaculture breeding programs by using simulation data but not real data. Recently, an experiment on genomic selection in Atlantic salmon was studied by Hsin-Yuan et al. [24]. This paper was designed to discuss the feasibility of genomic selection applying in large yellow croaker breeding.

Large yellow croaker (Larimichthys crocea) is one of the most important commercial marine fish species in southeast China and Eastern Asia [25]. However, the genetic diversity of large yellow croaker is seriously lost because of over-fishing and environmental degradation [26]. In addition, the fishing technology reserving larger and abandoning smaller individuals gives rise to negative selection for large yellow croaker. Therefore, a good breeding technique is necessary for this species. In traditional animal breeding, genetic values are predicted from the phenotypic data of individuals and their relatives. This algorithm is termed as best linear unbiased prediction (BLUP) [27]. However, BLUP cannot estimate Mendelian sampling term very well [28]. Using genome-wide SNP genotypes may be a better choice to obtain more accurate relationship among relatives. Nielsen et al. [29] have used simulation method and Hsin-Yuan et al. [24] have used real data to compare the accuracies for genomic estimated breeding values (GEBV) with traditional BLUP estimated breeding values (BLUPEBV), and have suggested that accuracy for GEBV was higher than that for BLUPEBV in aquaculture. Therefore, it may be a trend to apply marker-assisted selection (MAS) in fish breeding programs. Especially for some traits, such as meat quality and disease resistance, which cannot be measured on candidates directly, are more suitable to use MAS schemes for breeding.

Various algorithms are used to predict GEBV in genomic selection, including Genomic BLUP (GBLUP) [30] and Bayesian methods [4]. GBLUP was deduced by VanRaden by using genomic relationship matrix (G matrix) to obtain GEBV directly. Another algorithm termed as RR-BLUP (ridge-regression BLUP) can obtain the same results as GBLUP by calculating SNP effects firstly, which was firstly proposed by Meuwissen et al. [4]. The prior distribution of GBLUP algorithm assumes an equal variance across each locus, which is not an accurate assumption when number of QTLs is small [31]. Nevertheless, it is closer to reality if many QTLs exist in the genome. Another assumption is that there are many loci with no variance and a few loci with their own variances [4]. This algorithm is termed as BayesB, which has a mixture of prior distribution. The GEBV are estimated based on MCMC (Monte Carlo Markov Chain) technology in BayesB, which needs much more computing time. The reason is that the prior distribution of markers is proposed in term of variances but not of effects. Therefore, in order to save the computing time, Meuwissen et al. [32] proposed to use a mixture of a distribution with zero effects and an exponential distribution as a prior for the marker effects:

$$ \pi (g)=\left\{{}_{\left(1-\gamma \right)\kern5.25em for\ g = 0}^{\raisebox{1ex}{$1$}\!\left/ \!\raisebox{-1ex}{$2$}\right.\ast \gamma \lambda exp\left(-\lambda \left|g\right|\right)\kern0.5em for\ g\ne 0}\right. $$
(1)

where γ is the proportion of makers existing effects, and λ is the parameter of exponential distribution. This algorithm termed as fast BayesB or iterated conditional expectation (ICE) is not based on MCMC technology, therefore the computing speed is several orders of magnitude faster than MCMC based BayesB. On the basis of the study of ICE, Shepherd et al. [33] developed an algorithm (named emBayesB) by combining expectation-maximization (EM) algorithms with fBayesB. Besides the fast computational speed and relatively high estimation accuracy, emBayesB has other advantages: (i) the algorithm can adjust the value of proportion that SNPs are in LD with QTLs in the calculating process; (ii) heritability, which is set beforehand, hardly affects the estimation results even if the heritability deviates from the actual situation significantly.

Predictive accuracy of GEBV is one of the most important indicators in genomic selection, which has been studied by various methods based on real or simulated data [4, 3437]. It is affected by many factors, such as training sizes, trait heritability, number of QTLs, and also by marker density and statistical methods. This research would estimate trait heritability and assesse the predictive abilities via cross-validation, and compare predictive abilities within various training records and two algorithms (GBLUP and emBayesB). Combined with the formula for predictive accuracy in genomic selection [34], we would predict the training sizes required to reach ideal predictive accuracies. We would also compare results of GWAS with genomic prediction, and assess feasibility of pre-selection of significant SNPs in genomic selection.

Methods

Materials

The experimental materials were large yellow croaker. All fish were reared in a breeding nucleus farm named ‘Jinling Aquaculture Science and Technology Co. Ltd.’ in Ningde City, Fujian Province, P.R.China. The trail was carried out in Key Laboratory of Healthy Mariculture for the East China Sea when the age of fish was two years old. All fish were injected the hormone named Luteinizing hormone releasing hormone A3 (LRH-A3) simultaneously. Approximately 36 h after injection the LRH-A3, all fish would release sperms or eggs almost at the same time, so all progenies which were used as experimental materials had the same age. Three quantitative traits were tested: body weight (BW), body length (BL) and percentage of n-3 highly unsaturated fatty acids (n-3HUFA) in muscle. Growth rate and meat quality are the most important economic traits in large yellow croaker, so the three traits were chosen for research. BW and BL data was derived from the live body directly. n-3HUFA data, however, must be measured by dissection. 500 individuals were tested on traits of BW and BL while 176 individuals sampled randomly from the 500 individuals were tested on trait of n-3HUFA, which would be used as the experimental materials in this research. The parameters of the three traits were shown in Table 1.

Table 1 Statistical results of phenotypic data for three quantitative traits

Library preparation and sequencing

Fin samples of all 500 fish shown in Table 1 were collected for genotyping. To detect whole-genomic SNP markers for all fish, EcoRI and NlaIII based on Genotyping-By-Sequencing (GBS) method were used to construct the libraries for the NGS sequencing (had not been published). Briefly, genomic DNA of each fish individual was incubated at 37 °C with EcoRI and NlaIII (New England Biolabs, NEB), CutSmart™ buffer and MilliQ water. Digestion reactions were heat-inactivated at 65 °C for 20 min and the reaction system was held in 8 °C. The digested DNA was ligated to adapter sequences with CutSmart™ buffer, ATP, T4 DNA ligase, adapter mix and MilliQ water at 16 °C. Restriction-ligation reaction was also heat-inactivated at 65 °C for 20 min and the reaction system was held in 8 °C. The PCR reaction was performed using diluted restriction-ligation samples, dNTP, Taq DNA polymerase (NEB) and IlluminaF primer and indexing primer containing barcodes. The PCR productions were separated by 8 % PAGE. Fragments with 150 ~ 350 bp (with indexes and adaptors) in size were isolated by using a Gel Extraction Kit (Qiagen), which was diluted for sequencing. Then, pair-end sequencing was performed upon the selected libaries by using an Illumina high-throughput sequencing platform (Illumina, Inc; San Diego, CA; USA).

SNP calling and imputation

The raw sequencing reads were quality checked by FastQC [38]. The reads were then quality filtered by the following steps: (1) adaptor sequences were removed from the raw reads; (2) the reads with the ratio of ambiguous ‘N’ bases greater than 5 % were filtered; (3) deleted the continuous base windows of 5 bp that the average quality smaller than 20 at two ends of reads; (4) removed short reads with a length below 50 bp. The cleaned reads were mapped to large yellow croaker reference genome sequence [39] by BWA version 0.7.10 [40]. The alignments files were then sorted and duplicate marked by Picard (http://picard.sourceforge.net) and applied to GATK package [41] for SNP calling. The resulting SNPs were discarded according to any of the criterions: (1) missing rate ≥20 %; (2) MAF (minor allele frequency) < 0.05; (3) significantly deviation from Hardy-Weinberg equilibrium (HWE) (p-value < 0.001). As a result, 29,748 SNPs were retained for BW and BL, and 32,249 SNPs were retained for n-3HUFA, and the average missing rate of markers was 11.9 %. Fig. 1 showed the distribution of minor allele frequency after filtration. Beagle Version 3.3.2 software [42] was used to impute all missing SNPs.

Fig. 1
figure 1

Distribution of minor allele frequency (MAF) for three traits. The left figure (a) showed the distribution of MAF for body weight and body length and the right figure (b) showed that for percentage of n-3 highly unsaturated fatty acids

Statistical methods

The following linear model is fitted to explain the composition of trait Y i :

$$ {Y}_i=\mu +se{x}_k+{\displaystyle \sum_{j=1}^p{X}_{ij}{g}_j}+{e}_i $$
(2)

where Y i is the phenotypic record of individual i (i = 1, 2, …, n), μ is an overall mean, sex k is the fixed effect of the kth sex (k = 1 for male or 2 for female). X i∙ is a 1 × p vector of SNP genotypes on individual i (The SNP genotypes are coded as 0 for genotype ‘A_A’, 1 for ‘A_a’ and 2 for ‘a_a’). g j is the effect of the jth locus, so g is a p × 1 vector of SNP effects, and e i is a residual effect. In most conditions, the value p is much larger than n. \( {\displaystyle \sum_{j=1}^p{X}_{ij}{g}_j} \) can be replaced by ĝ i which is the breeding value of individual i. If need to calculate SNP effects, the genotype codes will be standardized using the formula: \( {X_{ij}}^{\prime }=\left({X}_{ij}-2{p}_j\right)/\sqrt{2{p}_j\left(1-{p}_j\right)} \), where p j is the frequency of allele ‘a’ at locus j. After standardizing, the mean of genotype at locus j is 0 and variance is 1, so the variance of locus j is only decided by the effect g j , which was described by Meuwissen et al. in detail [32].

Two algorithms were used to calculate the effects of SNPs in this study: GBLUP [30] and emBayesB [33]. The GEBV of GBLUP are calculated by the following mixed model equation (MME):

$$ \left[\begin{array}{l}{{\mathrm{l}}_n}^{\prime }{\mathrm{l}}_n{{\kern1em \mathrm{l}}_{\mathrm{n}}}^{\prime}\mathrm{X}\\ {}{\mathrm{X}}^{\prime }{\mathrm{l}}_n\kern0.75em {\mathrm{X}}^{\prime}\mathrm{X}+\mathrm{K}\lambda \end{array}\right]\left[\begin{array}{l}\widehat{\mu}\\ {}\widehat{g}\end{array}\right]=\left[\begin{array}{l}{{\operatorname{l}}_n}^{\prime }y\\ {}{\mathrm{X}}^{\prime }\ y\end{array}\right] $$
(3)

where \( \lambda ={\sigma_{e_i}}^2/{\sigma_{g_i}}^2=p\left(1-{h}^2\right)/{h}^2 \), h 2 is heritability of trait, ĝ is a vector of GEBV, and K is an inverse of G matrix in GBLUP algorithm. G matrix is calculated by the formula \( \frac{\left(\mathrm{X}-P\right){\left(\mathrm{X}-P\right)}^{\prime }}{2{\displaystyle \sum {p}_i\left(1-{p}_i\right)}} \) [30], where P is the vector of frequency of allele ‘a’ in all loci, and p i is the frequency of allele ‘a’ at locus i. The MME in RR-BLUP is very similar with formula (3), but is used to calculate SNP effects, and K is not the inverse of G matrix but an identity matrix, and \( \lambda ={\sigma_e}^2/{\sigma_{g_i}}^2=p\left(1-{h}^2\right)/{h}^2 \). The two algorithms are equivalent in predicting GEBV of candidates, so we defined both two algorithms as GBLUP in this study. σ e 2 and σ g 2 were estimated by the algorithm ‘REML’ (Restricted Maximum Likelihood) [43]. Another way could obtain the similar results of σ e 2 and σ g 2 by using the R-package ‘EMMREML’, Version 3.1 (http://mirror.bjtu.edu.cn/cran/web/packages/EMMREML/index.html). The formula (1) was still used as the prior distribution of SNP effects in emBayesB. All calculation process were written by Fortran codes (the codes of emBayesB were supplied by Shepherd et al. [33], http://www.biomedcentral.com/1471-2105/11/529/additional) and run in the server of Jimei University.

Cross-validation

In order to reduce the errors caused by random sampling, a replicated training-testing method was used to evaluate the results of genomic prediction. Cross-validation of 20 replicates was performed in this research. 400 individuals were randomly sampled as training set and the rest 100 individuals were used as testing set in each repeat in the BW and BL experiment. The same way was used to study n-3HUFA but the number of individuals in training set and testing set was changed to 140 and 36 respectively. In each replicate, the same training set and testing set were used to perform the GBLUP and emBayesB prediction, so the results would have sufficient comparability for the two algorithms. Paired-t tests were used to test whether predictive abilities estimated by two algorithms had significant differences.

The evaluation index was predictive ability obtained by calculating the correlation between GEBV and observed values in the testing set, i.e., r (ĝ,y) . The relationship between predictive ability and predictive accuracy was deduced by Legarra et al. [20]:

$$ {r}_{\left(\widehat{g},y\right)}={r}_{\left(\widehat{g},g\right)}\ast h $$
(4)

where y is observed values, ĝ is genomic estimated breeding values (GEBV), g is true breeding values (TBV), and h is the square root of trait heritability. r (ĝ,g) is the correlation between GEBV and TBV, which is termed as predictive accuracy. True breeding values, however, can only be observed in simulation data. Therefore, we had to substitute phenotypic data for true breeding values. Nevertheless, the predictive abilities still had comparability because h could be considered as a constant in the same trait and the same population.

In order to observe the changes of predictive abilities by the sizes of training set, the training sizes were also changed from the level 100 to 400 (4 levels were used with 100 as a spacing) in BW and BL experiment. Because the number of individuals was very small in n-3HUFA, the training sizes affecting predictive abilities would not be studied any longer. 20 replicates were also used in the experiment. The empirical formula for predictive accuracy was proposed by Daetwyler et al. [34]:

$$ {r}_{\left(\widehat{g},g\right)}=\sqrt{\frac{N{h}^2}{N{h}^2+M}} $$
(5)

where M is the number of independent loci affecting a trait. h 2 is the trait heritability which can be obtained from REML algorithm. r (ĝ,g) is derived from the formula (4), i.e., r (ĝ,g) = r (ĝ,y)/h. To fit the regression equation, Olrike et al. used least-squares curve-fitting method [19]. In this study, we linearized the formula (5) to derive a linear regression equation:

$$ \frac{1-{r_{\left(\widehat{g},g\right)}}^2}{{r_{\left(\widehat{g},g\right)}}^2}=\frac{M}{h^2}\ast \frac{1}{N} $$
(6)

Here the assumption was that y was \( \frac{1-{r_{\left(\widehat{g},g\right)}}^2}{{r_{\left(\widehat{g},g\right)}}^2} \) and x was \( \frac{1}{N} \), so we obtained a linear model y = kx with no intercept, where \( k=\frac{M}{h^2} \). According to formula (6), we could derive how many individuals in training set were required to reach an ideal predictive accuracy. The number of independent loci (M) could be also derived from the equation.

Comparison of GS with GWAS

Although the objective of genomic selection is different with GWAS, it is still informative to compare the significant SNP loci which are analyzed by GWAS and calculated by GS, and it is helpful to find the best method of pre-selecting SNPs for genomic prediction. All individuals were used to perform GWAS analysis and to calculate SNP effects, i.e., 500 individuals were used in BW and BL while 176 individuals were used in n-3HUFA. A linear regression model was used to perform the GWAS analysis. Two algorithms, i.e., RRBLUP (GBLUP) and emBayesB were still used to calculate the SNP effects. We also compared the proportions of total additive genetic variances (VA) explained by the most significant SNPs by GWAS with that explained by the largest absolute effects calculated by GBLUP and emBayesB. The genetic variances explained by significant SNPs were also estimated by REML algorithm [43]. Theoretically, genetic variance of a locus decides the contribution to the phenotypic variance, but when the genotype has been standardized with mean 0 and variance 1, the absolute SNP effect can reflect the contribution of a locus to phenotypic variance [32].

Results

Heritability estimate

The heritability values estimated by algorithm REML were 0.604, 0.586 and 0.438 for trait of body weight, body length and n-3HUFA, when the number of phenotypic records was 500, 500 and 176 respectively. The results were very similar when heritability estimated by different number of phenotypic records (shown in Table 2).

Table 2 Heritability estimates by REML in different number of phenotypic records

Predictive abilities

Table 3 showed the means and standard errors of predictive abilities estimated by GBLUP and emBayesB when training sizes were 400 for BW and BL, and 140 for n-3HUFA. The results showed that predictive abilities estimated by GBLUP were higher than that by emBayesB in trait of body weight and body length. Through the paired t-tests, the differences were extremely significant (P < 0.001) in body weight and significant (P = 0.015) in body length. However, the result was just the opposite in n-3HUFA, i.e., predictive ability estimated by emBayesB was higher than that by GBLUP, but the result of paired t-test did not show significant difference (P = 0.496) between the two algorithms.

Table 3 Predictive abilities of GBLUP and emBayesB for three quantitative traits

Table 4 showed trend of the predictive abilities with different number of individuals in training set. In general, the increase of predictive abilities accompanied with the increase of training sizes. In this study, we assumed the formula (5) was appropriate for both GBLUP and emBayesB algorithms. Combined with formula (6), we obtained the curve fitting equations of predictive accuracies, which were shown in Table 5.

Table 4 Predictive abilities of GBLUP and emBayesB in different number of phenotypic records
Table 5 Curve fitting equations of accuracies and required training sizes to reach ideal accuracies

Results of GWAS, GBLUP and emBayesB

Figures 2, 3 and 4 showed the results of GBLUP, emBayesB and GWAS. Y-axes represented the absolution values of SNP effects estimated by GBLUP and emBayesB. The vertical lines indicated the significant SNP loci analyzed by GWAS. The results showed that significant SNPs found by GWAS tended to cluster together in some regions. By comparing the results of three algorithms, we found that three algorithms could not always found the SNPs associated with phenotypes consistently. Using body weight as an example, all algorithms could find coincident SNP loci associated with phenotypes in chromosome 1, 6, 10, 11, 13 and 24. However, some significant loci (in chromosome 4, 8 and 23) were found by GBLUP and GWAS but not found by emBayesB. Similarly, some significant loci (in chromosome 12, 15 and 16) were found by GBLUP and emBayesB but not found by GWAS. The proportions of total additive genetic variances explained by significant SNPs (or SNPs with the largest absolute effects) were shown in Table 6. We could find that the same number of significant SNPs by GWAS explained the least proportion of total additive genetic variance, and that by emBayesB could explain the largest proportion (even more than 100 % in n-3HUFA).

Fig. 2
figure 2

Manhattan plot of absolute SNP effects estimated by GBLUP and emBayesB for body weight. X-axis represented the chromosome number (1–24). Number 25 was not chromosome but SNPs which had not been located on specific loci in genome. The upper figure was the results of emBayesB, and the lower figure was the results of GBLUP. Vertical lines indicated the 83 significant SNP loci (P-value < 10−5) analyzed by GWAS

Fig. 3
figure 3

Manhattan plot of absolute SNP effects estimated by GBLUP and emBayesB for body length. X-axis represented the chromosome number (1–24). Number 25 was not chromosome but SNPs which had not been located on specific loci in genome. The upper figure was the results of emBayesB, and the lower figure was the results of GBLUP. Vertical lines indicated the 43 significant SNP loci (P-value < 10−6) analyzed by GWAS

Fig. 4
figure 4

Manhattan plot of absolute SNP effects estimated by GBLUP and emBayesB for n-3HUFA. X-axis represented the chromosome number (1–24). Number 25 was not chromosome but SNPs which had not been located on specific loci in genome. The upper figure was the results of emBayesB, and the lower figure was the results of GBLUP. Vertical lines indicated the 48 significant SNP loci (P-value < 10−4) analyzed by GWAS

Table 6 Proportions of additive genetic variances explained by significant SNPs or SNPs with large absolute effects

Discussion

Heritability estimation

Although the heritability estimated by different phenotypic records was very similar, the stability was poorer when the number of phenotypic records used to estimate heritability became smaller. For example, when 400 individuals had phenotypic records, the standard error of mean for heritability estimation was 0.013 in body weight, but the result became 0.054 when only 100 individuals had records. Therefore, we suggest using as many individuals having phenotypes as possible to estimate heritability.

Predictive abilities

The predictive abilities by GBLUP were slightly higher than that by emBayesB in BW and BL, which was not coincident with simulation results but coincident with some real data. In most simulation results, the accuracies of Bayesian method were higher than that of GBLUP [4, 32, 33]. However, the results in some real data showed that the accuracies of GBLUP were similar to or even higher than that of Bayesian method [10, 11, 19, 44]. One point we need to pay attention to is that relatively small number (~50 or fewer) of QTLs was used in simulation study [4, 32, 33, 37]. GBLUP, however, has no advantage when number of QTLs is smaller. Therefore, we speculate more QTLs (more than 50) affecting BW and BL exist in the genome of large yellow croaker. We think another reason maybe the exponential distribution is not a suitable prior distribution in large yellow croaker. Maybe a better distribution needs to be studied in this species. However, the result was just the opposite in n-3HUFA, which may be explained by the reason that not many QTLs affecting n-3HUFA exist in genome. Another evidence could support this viewpoint in Table 6. Only 48 significant SNPs could explain most (even more than 100 %) proportion of total additive genetic variance. In view of the advantages in different algorithms, we suggest doing a test before deciding to use which algorithm to calculate marker effects and GEBV. According to results of this research, we suggest GBLUP is more suitable to perform genomic prediction in body weight and body length in large yellow croaker. The result of significance test in n-3HUFA did not show significant difference between the two algorithms. We think the reason may be relatively small training sizes were used to perform genomic prediction, which may be also the reason why the standard errors of predictive abilities in n-3HUFA were higher than that in BW and BL.

According to the predictive accuracy equations shown in Table 5, we can derive the training sizes required to reach ideal accuracies. That is to say, at least 1000 individuals are needed to reach predictive accuracy of 0.8. We think this is a very good result that only ~1000 individuals can reach a so high accuracy. The reason may be high trait heritability and the consistent rearing environment. The number of independent QTL loci (i.e., M value), was also observed from the equation, which was far more than 50, which may support our speculation in the above discussion.

Comparison of GWAS and GS

At the present time, it is quite expensive to perform genomic selection in large yellow croaker breeding. The genotyping of a candidate by GBS still costs more than 2 to 3 broodstock, so it is necessary to compare the results of GWAS with GS, and find the best method of pre-selecting SNPs for genomic prediction. Figures 2, 3 and 4 showed significant SNPs by GWAS tend to cluster together in specific regions. The reasons may be that strong correlations exist between adjacent SNPs and single-marker analysis was used by GWAS in this study. When one SNP locus is correlated with phenotypes significantly, the adjacent locus may show similar result. Table 6 showed the significant SNPs by GWAS could explain the least proportion of total additive genetic variances, which is still caused by the clusters phenomenon. Although 83 SNP loci with P-value < 10−5 in body weight, many clustered significant SNPs just corresponded to one QTL actually.

Figures 2, 3 and 4 showed that the distributions of SNP effects were very different between GBLUP and emBayesB. The main reason is the different prior assumptions. emBayesB assumes most loci having no effects, and therefore compresses effects of most loci to near zero. GBLUP, however, assumes all loci having equal variance, and therefore QTLs seem to be everywhere in genome. Table 6 shows the largest absolute SNP effects estimated by emBayesB can explain more proportions of total additive genetic variance than that by GBLUP. We think the reason may be prior distribution of emBayesB highlights loci with large effects.

According to proportion of total additive genetic variance explained by significant SNPs (shown in Table 6), we can speculate the number of QTLs in n-3HUFA is fewer than that in BW and BL, which can offer a reference for using pre-selection of SNPs in genomic selection. Using n-3HUFA as an example, the largest absolute effects by emBayesB can explain more than 100 % of total additive genetic variance, which means the 48 SNPs is enough to perform genomic prediction in n-3HUFA. It is reasonable that a portion of SNPs explain slightly more than 100 % of total genetic variance, because not all SNP loci are associated with the trait and the SNPs with no effect may interfere with the estimation. Ulrike et al. [19] have suggested pre-selection of SNPs can improve predictive accuracies in genomic selection, which means it is possible to obtain higher additive genetic variance using a portion of SNPs.

GS in fish breeding

Through analyzing the data in the experiment, using genome-wide markers to estimate genetic values is feasible in large yellow croaker. However, it still has some limitation to apply genomic selection in fish breeding. The reason is high mortality rate in fish, which causes genomic selection applied in fish is not the same as domestic animals completely. For example, dairy cattle and pigs can be selected in an early stage by using GEBV as a reference because they have low death rates. Most of fish, however, will die in the process of growth, which leads to the result that an early-stage selection is not a good scheme in fish breeding. If we want to select the broodfish in the early stage, more fish are required to select to meet the quantity of adult broodfish, which means more fish are required to be measured and genotyped. Therefore, the costs will increase a lot if an early-stage selection is performed in fish genomic breeding. The better way to save costs is selecting the broodfish in adult stage. In this stage, the fish have relatively low mortality, but the traits such as body size are also easy to be measured, which means phenotypic selection is viable for these traits. However, some traits such as meat quality and disease resistance are not suitable to be measured in candidates, so phenotypic selection is not suitable in these traits. Therefore, we suggest genomic selection is more suitable for some traits which cannot be measured on candidates directly. Without doubt, the selection age of fish is still in adult stage but not in early stage. In addition, because of high cost of genotyping, other methods such as pre-selection of significant SNPs [45, 46] or using extreme phenotypic records [47] can be considered in genomic prediction of fish breeding.

Conclusions

In this study, heritability estimates by REML were 0.604, 0.586 and 0.438 for trait of body weight, body length and n-3HUFA respectively. The research showed that using genome-wide sequence data to estimate genetic values was feasible in large yellow croaker, which is helpful to promote this technology to apply in fish breeding. GBLUP and emBayesB had respective advantages on different traits, so we suggest doing a test before deciding to use which algorithm in specific trait in genomic selection. Combined with the predictive accuracy equations, we derived that at least 1000 individuals in training set could reach a predictive accuracy of 0.8 in body weight and body length. Three algorithms, i.e., GBLUP, emBayesB and GWAS, cannot always find significant SNPs associated with phenotypes consistently. The significant SNPs by emBayesB could explain the maximal proportion of total additive genetic variance, while that by GWAS explained the minimal proportion, which can offer a reference for pre-selection of SNPs in genomic selection. Because of high mortality rate of fish and high cost in genomic sequencing, genomic selection may be more suitable for applying on the traits which cannot be measured on candidates directly.

Abbreviations

BL, body length; BLUPEBV, BLUP estimated breeding values; BLUP, best linear unbiased prediction; BW, body weight; EM, expectation-maximization; emBayesB, BayesB based on expectation-maximization; GBLUP, Genomic BLUP; GBS, Genotyping-by-sequencing; GEBV, genomic estimated breeding value; G matrix, genomic relationship matrix; GS, genomic selection; GWAS, genome-wide association study; HWE, Hardy-Weinberg equilibrium; ICE, iterated conditional expectation; LD, linkage disequilibrium; MAF, minor allele frequency; MAS, marker-assisted selection; MCMC, Monte Carlo Markov Chain; MME, mixed model equation; n-3HUFA, percentage of n-3 highly unsaturated fatty acids; NEB, New England Biolabs; QTL, quantitative trait loci; REML, Restricted Maximum Likelihood; RR-BLUP, ridge-regression BLUP; SNP, single nucleotide polymorphism; TBV, true breeding values