Background

Dagginess (faecal soiling of the perineum region) and internal nematode parasites, are major animal health traits with ethical and welfare implications of interest to the New Zealand sheep industry. Firstly, dagginess has a known association with flystrike (genetic correlations range between 0.34 ± 0.13 and 0.81 ± 0.15) [1, 2], with the majority of flystrike occurring in the breech. Secondly, internal parasites are increasingly becoming resistant to anthelmintic drenches; there is known resistance to all major classes of anthelmintics [3]. Finally, there is a perception that internal parasite load is associated with level of dagginess. Breeding to reduce level of dagginess and internal parasite load as measured by faecal egg counts is a strategy that can provide cumulative and permanent progress.

Dagginess and parasite resistance, as measured by faecal egg counts of Strongyle (FEC) and Nematodirus egg count (NEM), have been shown to be moderately heritable traits [4], with estimates in New Zealand sheep of 0.37 and 0.34 for dag score at three and eight months (DAG3, DAG8) and between 0.18 and 0.21 for FEC/NEM traits [5]. Dagginess and FEC/NEM are often thought to be genetically associated in sheep. However, recent estimates showed that genetic and phenotypic correlations between FEC/NEM and dagginess traits (DAG3 and DAG8) were low to zero in New Zealand sheep [5]. A subsequent study also showed that dagginess was not correlated with wool length, bulk or type [6]. This does not rule out the immune response to worm burden irrespective to FEC/NEM as a cause of dagginess, but does suggest that the cause is internal and may involve processes within the intestinal tract.

Traditional genome-wide linkage studies which find quantitative trait loci (QTL) associated with a trait of interest has been used successfully in animal production [7, 8]. The sequencing of many domesticated species: e.g. cattle, chicken and sheep, have allowed the introduction of high-density SNP genotyping platforms. These involve thousands and for some species hundreds of thousands of SNPs approximately equally spaced across the genome, to capture the greatest amount of linkage disequilibrium with causative QTL. This has produced rapid progress in genome-wide association studies (GWAS) which have already identified regions associated with production [9], fertility [10], disease [1114] and polledness [15] traits in cattle and sheep.

The same platforms have given rise to marker assisted selection on a genome-wide scale, called genomic prediction or selection [16]. The sum of the effect each SNP has on a trait is used to predict the animals’ molecular breeding values (mBVs) [17]. Thus potentially all the genetic variation for a trait could be picked up by the SNP panel due to the extent of LD between the SNPs on the panel and causative QTL. The dairy industry has already adopted GP to increase genetic gain [18], and it has been recently implemented in the New Zealand sheep industry [19].

A resource consisting of greater than 3.5M pedigree recorded animals, born between 1990 and 2010 from 233 industry recorded flocks, with estimated breeding values (eBVs) for a number of production traits, including DAG3, DAG8 and FEC in summer (FEC1), autumn (FEC2) and as adult (AFEC), was available for use. Of these, 8705 have been genotyped on the Illumina Ovine SNP50BeadChip (50K). The aim of this study was to estimate the accuracy of mBVs for these traits, using genomic BLUP, which assumes all SNPs have a small effect and are normally distributed. A second aim was to use the SNP effects generated from the genomic prediction analysis to identify regions associated with these traits, in a GWAS.

Results and discussion

Quality control

A step by step quality control pipeline was performed [20]. From the initial set of 54,977 useable SNPs, 4869 were not retained by the Ovine HapMap [21] and a further 1781 SNPs were discarded due to one or more of the following; non-autosomal (including pseudoautosomal), minor allele frequency (MAF) = 0, call frequency <0.97 and Illumina quality score (GC10) value <0.422. The final dataset included 8705 animals and 48,327 SNPs.

Summary of dependent variables

Of the 3.5M animals used for eBV estimation there were 95,544 and 75,979 raw measurements for DAG3 and DAG8, respectively. The traits FEC1 (scored in summer) and FEC2 (scored in autumn) are repeatable traits with two samples (a and b) potentially collected at each time point, several days apart (Table 1). For AFEC, this trait is not recorded and the eBVs are generated using estimated genetic and phenotypic correlations with other traits including FEC1 and FEC2. Table 1 summarizes the raw measurements used by Sheep Improvement Limited (SIL), the New Zealand sheep genetic evaluation database, to generate the eBVs and the resulting dependent variables (y) used for molecular breeding value (mBV) calculation; i.e. are parent averaged de-regressed, have reliabilities greater or equal to 0.8 times the heritability and were for animals genotyped on the 50K SNP chip. There were between 1957 and 4164 animals for each trait with y values; corresponding reliabilities were between 0.34 and 0.51.

Table 1 Summary of raw phenotypes, de-regressed dependent variables with parent average removed and reliabilities

The dependent variables were split into a training and validation datasets, based on birth year, for the genomic prediction and to estimate accuracy of the prediction equations. This was performed for each breed (Romney, Coopworth, Perendale, Texel and three breed crosses designated: CompRCP, CompRCPT and CompCRP) and trait (Table 2). Training set cut offs were chosen to ensure adequate numbers were in the training and validation datasets, see methods for a complete description.

Table 2 The year of birth of the first animals placed in the validation set and number (n) of animals in training and validation sets for each breed

Principal component analysis

Using the G1 matrix, described by VanRaden [22], the first six principal components (PC) using the animals in the training set were calculated. The six PCs accounted for between 0.60 and 0.73 of the genetic variation contained in the genomic relationship matrix for each trait (Table 3). In Fig. 1, the first (PC1) and second (PC2) largest axes of variation are plotted using the animals with FEC1 y values. The distinction of the four main breed groups (Romney, Coopworth, Perendale and Texel), with the three composites breeds (shown as ‘other’) clustered in-between, are shown clearly and is typical of the New Zealand sheep industry.

Table 3 The genetic variance explained by the first 6 principal components for each trait
Fig. 1
figure 1

The first two principal components (PC) calculated for all animals for faecal egg count in summer. Romney (blue), Coopworth (green), Texel (yellow), Perendale (purple) and others/composites (grey)

Accuracy of genomic prediction

The accuracies of the five traits calculated as the adjusted weighted correlation between mBV and y (rA) and as the weighted average individual accuracy (rI) are shown in Table 4. These were calculated using the animals in the validation set. For the dagginess traits the rA ranged between 0.11 and 0.56 for those breeds in the training set and rI ranged between 0.31 and 0.52. The Perendales had the lowest rA and rI reflecting the low number of animals from this breed in the training set; n = 52 and 50 for DAG3 and DAG8, respectively (see methods). For the three breeds present in the validation set only (Texel, CompRCPT and CompCRP) rA and rI were poor for the Texel who are the furthest removed from the validation set. The two composites have more than 30 % of their genetic background from Romney, Coopworth or Perendale breeds and had accuracies close to values seen by the breeds represented in the training set.

Table 4 Accuracies as the weighted correlation between mBV and dependent variable (rA) and the weighted average individual accuracy (rI) calculated for the five traits in the seven validation breeds

For the FEC traits, the rA ranged between 0.18 and 0.71 for those breeds represented in the training set and between 0.28 and 0.55 for rI. The Perendales again had the lowest accuracies; the number of animals of this breed in the training set were low (n = 164, 175 and 123, for FEC1, FEC2 and AFEC, respectively). The composite (CompRCP) also had a low number of animals in the training set for FEC2 and AFEC, 101 and 10, respectively. The slightly higher accuracies seen for CompRCP compared to the Perendales may be due to the CompRCP animals consisting of at least 50 % Romney, Coopworth and/or Perendale. Again, of the breeds represented only in the validation sets, the Texel had the lowest rA and rI except for AFEC, where CompRCPT had a rA of −0.22. Adult FEC had lower accuracies for most breeds compared to FEC1 and FEC2. This is probably a reflection of the lower numbers of animals available with y values for this trait, which in turn is due to this trait being indirectly predicted from correlated traits.

The accuracies were compared to the theoretical accuracies using equation 8 from Goddard [23], following the assumptions of an effective population size (Ne) of 405 (New Zealand Romney, Table S4 [21]), number of records available per trait and genome length of 30M. The theoretical accuracies were 0.18 and 0.20 for DAG3 and DAG8 respectively and 0.32, 0.31 and 0.23 for FEC1, FEC2 and AFEC, respectively. The accuracy estimates obtained in this study are higher than those calculated theoretically. The reason for the higher accuracies is that the theoretical values are for ‘unrelated’ animals i.e. ~10 generations or more distant. In practice, most of the validation animals have an ancestor 1–3 generations distant in the training data set and as such will have higher estimated accuracies than expected from theory.

Simulations showed that when there are limited numbers of animals from one population set, then the most accurate genomic predictions are generated when information from all populations are combined in the training set rather than predicting separately by population [24]. However, the more genetically diverse the populations are, the less accurate are the genomic predictions for across breed analysis. This corroborates the low accuracy for the Texels in this analysis, as they are the most divergent breed in the validation set compared to those present in the training set. The Texels originated from Texel an island offshore from the Netherlands, while the Romneys were from England, their estimated divergence is 160 to 240 generations ago [21]. To increase accuracies for Texels, more animals are required so that some may be combined in the training set. This may in part be achieved by increasing the number of composites with at least 50 % Texel, if pure-breds are hard to collect.

The accuracies for the CompRCPT and CompCRP are higher than expected for a breed not present in the training set, however, they are at least 50 % and at least 30–50 %, respectively, of the breeds represented in the training set. The strength of the genetic relationships between individuals and breeds was shown in the principal component analysis, e.g. for FEC1 (Fig. 1).

Implementation in industry in New Zealand for these traits currently uses the mBVs as described here with minor modifications [25]. These are then blended with eBVs calculated on all available animals (see Dodds [26] for a brief description).

GWAS

The quantile-quantile (Q-Q) plots (Fig. 2) showed that the deviation of the majority of observed -log10(P) values from the expected values was insignificant (lambda ranged between 1.001 and 1.021). The SNPs seen to be deviating from the expected values were interpreted as SNPs associated with the trait of interest, as the SNPs are departing from the null hypothesis of no genetic association and no LD between SNPs. There were 32 regions associated with DAG3, DAG8, FEC1, FEC2 and/or AFEC with a P value < 0.0001 (Additional file 1).

Fig. 2
figure 2

Quantile-quantile plot for dag score at three (a) and eight months (b), faecal egg count in summer (c), autumn (d) and as adult (e). The 0–1 line (solid) and the slope (dash) are also plotted

Figure 3 (a and b) show the Manhattan plots of the resulting -log10(P) values for DAG8 and FEC1, respectively. The Manhattan plot for DAG3 was similar to DAG8, and plots for FEC2 and AFEC were similar to FEC1 (Additional file 2). A summary of genes underlying the top SNPs with a P value < 0.0001 for each trait is in Additional file 1. For DAG3 and DAG8 there was one peak common to both traits detected on OAR15 (Fig. 4b), comprising of a single SNP (s22390; P value 5.04 × 10−6 and 2.72 × 10−10, respectively). Annotation on Ovine genome v3.1 (http://www.ensembl.org/Ovis_aries) showed there are no known genes or proteins within 100kbp window of this SNP. Two predicted genes were observed (Ensembl transcript: GENSCAN00000038546 and GENSCAN00000038543), however RNA-seq data at Ensembl does not provide supporting evidence for these genes being real. The 100kbp sequence was scanned for open reading frames in all 6 frames, these were matched against a collection of protein signature databases using InterProScan 5 [27]. This identified three matching domains; integrin beta subunit, insulin-like growth factor binding protein and Agouti (Additional file 3).

Fig. 3
figure 3

Manhatten plot of -log10(P) values of SNPs for dag score at eight months (a) and faecal egg count in summer (b). Ordered on the ovine genome v3 map, P < 0.0001 (solid line), P < 0.001 (dash line)

Fig. 4
figure 4

Distribution of -log10(P) values for dag score at three (blue) and eight months (orange) and faecal egg count in summer (green), autumn (purple) and as adult (red) for candidate regions on OAR 7 (a), 15 (b) and 26 (c)

For the FEC traits there were two distinguishable similarities on OAR 7 and 26 (Fig. 4a and c). On OAR 7 there was a notable peak consisting of 17 SNPs in total for all three traits, spanning a 3.36 Mbp region (45,194,749bp to 48,549,329bp), with the top SNP within this region passing the P < 10−4 threshold for FEC1 (s65809). On OAR 26 there was one significant peak (OAR26_25273391), with a P value ranging between 8.38 × 10−8 and 2.18 × 10−11 for FEC1, FEC2 and AFEC.

The gene positioned under the peak on OAR 7 is the vacuolar protein sorting 13 homolog C (VPS13C; OMIM: 608879). This family of proteins are involved in the trafficking of membrane proteins between the trans-Golgi network and the prevacuolar compartment (Saccharomyces cerevisiae) corresponding to the multivesicular body/late endosome in mammals [28]. Four splice variants of this gene have been observed, along with three repeat regions. Homolog C arose from duplication of homolog A, also known as chorein, which is involved in chorea-acanthocytosis (ChAc), an autosomal recessive disease. However, homolog C cannot compensate for defunct homolog A in ChAc patients [28]. The VSP13 family of proteins are expressed in most tissues including the small intestine and colon [28].

The gene positioned under the peak on OAR 26 is the zeta-sarcoglycan (SGCZ; OMIM: 608113) gene. This gene is involved in the formation of the sarcoglycan (SG) complex with SGCE, SGCB and SGCD in smooth muscle, retina and Schwann cells [29, 30]. The SG complex is part of the dystrophin-glycoprotein complex that interacts between the actin cytoskeleton and the extracellular matrix, essential for membrane stability. Mutations in the sarcoglycans cause limb-girdle muscular dystrophy, with malfunctions of digestive smooth muscle leading to dysphagia, vomiting, chronic constipation and acute digestive dilatations. This region also overlaps with a copy number variant (CNV) region associated with obesity in mice [31].

Numerous studies have been carried out to investigate the genetic control of resistance to internal parasites. The majority of these studies involved microsatellite-based linkage studies [3238]. There are only a few more recent studies involving SNP chip data [11, 12, 39, 40], one published study using both microsatellites and SNPs [41] and one using candidate gene approach [42]. Numerous traits representing parasite resistance have been used in these previous studies, for example, immunoglobulin A activity, packed cell volume and eosinophil counts, as well as the standard FEC and NEM traits. Given this, of the 32 identified SNPs with a P < 0.0001, there were 16 regions which overlapped previous QTL/GWAS studies on gastrointestinal parasites (Additional file 4). Notably the region on OAR 7 (~45.3cM) overlapped with four other studies [12, 36, 40, 41]. A region on OAR 2 (~129.9cM) also overlapped or is near regions identified in five other studies [11, 32, 33, 36, 39]. Four regions associated with DAG3 and/or DAG8 in this study were also identified in previous studies on gastrointestinal parasites (OAR 1 ~ 154.6cM [11, 35]; OAR 3 ~ 138.4cM; [33]; OAR 8 ~ 71.1cM [39]; and OAR 15 ~ 40.2; [11, 38, 41]). It has been reported in a comprehensive dataset that FEC and dag score traits are not strongly genetically or phenotypically correlated, however, it was suggested that parasite worm burden could still play a role in faeces accumulation [5]. Even though no similarities between significant regions were found between the two trait sets in this study, there may be similarities in the genes involved in the whole physiological response. It could be that the part of the response involved in reducing FEC/pasture contamination is a different gene set from those involved in what leads to faecal accumulation. Therefore, similarities between regions associated with dag score in this study and FEC from previous studies could arise, relating to the physiological response to the parasitic infection.

Conclusions

These results indicate that genomic prediction can be implemented for most breeds in the New Zealand sheep industry for dagginess and FEC traits. In addition, three regions have been identified, one on OAR15 shows association with faecal accumulation and two regions (OAR7 and OAR26) show an association with the FEC traits. This study adds additional information in the quest for understanding the genes potentially involved in the host response to internal parasites and faecal accumulation.

Methods

This study was carried out in strict accordance of the guidelines of the 1999 New Zealand Animal Welfare Act and was approved by the AgResearch’s Invermay Animal Ethics committee.

Estimated breeding values and dependent variables

Estimated breeding values (eBVs) were available for DAG3, DAG8, FEC1, FEC2 and AFEC from analyses performed by Sheep improvement Limited (SIL), the New Zealand sheep genetic evaluation system [43]. The eBVs were generated from approximately 3.5M pedigree recorded animals from 230 industry recorded flocks. To ensure SNP associations were not due to pedigree information, dependent variables (y) were calculated taking into account the individuals own and descendants’ information. Parent average effects are removed [44], assuming all genetic variation is explained by the markers (c = 0). The resulting values were de-regressed using the reliability of the eBV with parent-average removed.

The eBVs were available directly from SIL, however, for completeness the following is a description of the measurement criteria of the traits for input into the SIL database. Dagginess is a subjectively, visually-assessed trait scored at approximately three and eight months of age (DAG3; DAG8) using a 6-point scale: zero (no dagginess) to five (complete coverage of the breech and down the legs by faecal material) [6]. Measurements on FEC traits for input into SIL for BV estimation are done through the WormFEC™ programme [45]. Faecal egg count is a repeatable trait with BVs estimated from two samples (taken several days apart) taken in summer (FEC1) and autumn (FEC2). Egg counts are made of Strongyle (primarly Ostertagia spp, Trichostrongylus spp, Cooperia curticei, and in a proportion of farms Haemonchus contortus [46]). Breeding values for AFEC are estimated from genetic and phenotypic correlations with FEC1 and FEC2.

Genotypes and quality assurance

Of the 3.5M pedigree recorded animals, 8705 had been genotyped, and passed quality control, with the Illumina® Ovine 50K SNP chip, according to the manufacturer’s protocol. These were mainly sires, only 22 % were female, and were predominately Romney, Coopworth, Perendale or Texel, plus other breeds and various crosses and composites. Genotyping results were put through a quality control pipeline before analysis [20]. In summary, SNPs are discarded if they have a call rate <97 %, appear non-autosomal (including pseudoautosomal), minor allele frequency (MAF) ≤0.01, Illumina quality score (GC10) value <0.422 and departed from Hardy Weinberg disequilibrium (1 × 10−6). The SNPs that were not retained as part of the Ovine HapMap study [21] were also discarded.

Genomic prediction analysis

Breed designation and reliability threshold

Data were filtered on breed and reliability before analysis. Breed was designated by the following conditions: Romney, Coopworth and Perendale were reported if their breed composition was greater or equal to 75 %. There were also three composite breeds considered, based on the breed composition of the New Zealand sheep industry. Firstly, CompRCPT were those that were greater than 50 % of combined Romney, Coopworth, Perendale breeds and at least 25 % Texel. Secondly, CompRCP were those that were greater than 50 % of combined Romney, Coopworth, Perendale breeds and less than 25 % Texel. Finally CompCRP were those that had greater than 30 % and less than or equal to 50 % of combined Romney, Coopworth and Perendale breeds.

The reliability cut off was 80% of the heritability estimate used for eBV estimation. Animals had to have dependent variable reliabilities equal to or above this cut off to be considered for analysis as typically they have either not been measured or alternatively progeny tested for the trait. The number of animals in the final analysis (i.e. with both genotypes and eBVs above cut off) were 2640 for DAG3 (44 % female), 1957 for DAG8 (31 % female), 4165 for FEC1 (33 % female), 3269 for FEC2 (27 % female) and 2204 for AFEC (16 % female).

Training and validation assignment

After the above filtering, genotypes were scored on the number of copies of the ‘A’ allele (based on Illumina AB calling format). Missing genotypes were filled in using the breed mean, estimated using a least squares regression on breed proportions as Romney, Coopworth, Perendale, Texel and other, to generate allele frequencies for each SNP within breed. The missing values are then replaced weighted by the individuals breed proportion of Romney, Coopworth, Perendale, Texel or other.

Training and validation sets were formed to a) derive a prediction equation using the training set and b) to estimate the accuracy of the prediction equation in the validation set. For estimating the SNP effects for the GWAS all animals were used in the training set. The animals were split into validation and training sets based on birth year (Table 2). The Texel, CompRCPT and CompCRP animals were only used in the validation set, to see how well the predictions work for these groups when not directly in the training set, as well as for groups represented in training. Training set cut off years were chosen for each breed, using a number of criteria. First, at least 200 animals per breed are used for validation. Secondly, if there are less than 400 animals roughly half are required in each set. Finally, if there were between 75 and 100 animals then a small portion (~10) were left in the training set, and the rest in validation.

The dataset comprised mainly of males used as sires, for each trait the percentage of females in the training and validation sets were: 39 and 50 % for DAG3, 20 and 43 % for DAG8, 30 and 38 % for FEC1, 26 and 32 % for FEC2 and 9 and 27 % for AFEC, respectively.

GBLUP

For full description of methods see [23], in summary the following methods were applied. Two genomic relationship matrices were used. The first G matrix (G1), as described by VanRaden [22] was used to calculate the coefficients (i.e., a linear prediction equation), while the second G matrix (G2, calculated using breed-specific allele frequencies [47]) was used to calculate the individual accuracies as described below. In a multi-breed population the G2 matrix is more similar to the pedigree-derived relationship matrix than G1, [47] and [23] recommend using G2 in preference to G1 for calculating individual accuracies.

The mBVs were calculated using genomic BLUP method of VanRaden [22]. A mixed model was fitted to the dependent variable, y, for each trait as follows: y =  + Z u + e where X is a matrix of the first six principal components of the G1 [22] matrix (to account for population stratification), β is a vector of fixed effects of the PC, Z is an incidence matrix and u is the animal effects (breeding values) distributed as N(0, G1 σ2 u), where σ2 u is the additive genetic variance and e are the residual effects distributed as N(0,R) where R is a diagonal matrix with diagonal elements (1-r2)/r2 where r2 is the reliability of y.

The mBVs are the predicted animal effects from the above model. The mBVs were obtained by multiplying the SNP effects by the SNP genotypes and summing.

Calculating the accuracy

The accuracies of the mBVs for each breed were derived from the validation animals using two different methods. The first method used the mBVs from GBLUP fitting the G1 matrix; \( {r}_A = \frac{cor\left(\boldsymbol{y},\boldsymbol{M}\boldsymbol{B}\boldsymbol{V}\right)}{h_g} \), and was weighted by 1/(1-r2). The mBV were calculated as above, only using the training set. The effective heritability (h2 g) is equal to the average reliability (r2) of y. The second method uses the prediction error variance (PEV) [48] from a genomic BLUP analysis fitting the G2 matrix, giving; \( {r}_{Ii}=\sqrt{1-\frac{PE{V}_i}{\sigma_u^2}} \) for animal i, where σ2 u is the genetic variance and the PEV i are obtained by inverting the left hand side of the mixed model equation [47]. These were calculated for all validation animals and averaged (weighted by 1/(1-r2)) to give an accuracy, r I , for each breed.

GWAS

To identify genetic regions associated with the five traits, SNP effects (b i ) were obtained from the above genomic BLUP using the G1 matrix and all animals in one dataset. Probability (P) values were calculated for the ith SNP assuming the b i follow a normal distribution with mean zero and variance: \( var=\frac{2 pi\left(1- pi\right){n}_b{\sigma}_b^2}{{\displaystyle \sum \left(2 pi\left(1- pi\right)\right)}} \), where p i is the frequency of the A allele of the i th SNP in the population, σ2 b is equal to the empirical variance of b i and n b is the number of SNPs with effects. The -log10(P) values corresponding to the estimates of the b i were graphed in a Manhattan plot on Ovine genome v3.1 [49] (available at Ensembl http://www.ensembl.org/Ovis_aries), and thresholds set at an initial level calculated using the 5 % Bonferroni correction [50] 0.05/n b  ≈ 10−6 (−log10(P) ~6).

The quantiles were calculated to check whether the distribution of the observed -log10(P) values deviated from the expected distribution (exponential) under the null hypothesis of no genetic association and no LD between SNPs. To do so, the n b -log10(P) values were sorted and plotted against the -log10(1-u) where u = [1 / (n b  + 1), 2 / (n b  + 1), …, n b / (n b  + 1)] as a quantile - quantile (QQ) plot.

Exploration of significant SNPs

For peaks that reached the Bonferroni threshold, the genomic region was explored by inspecting a 100kbp window surrounding the location of the significant SNP using ovine genome v 3.1 (available at Ensembl http://www.ensembl.org/Ovis_aries). A further literature search and Online Mendelian Inheritance in Man (OMIM) were used to identify candidate genes.

Supporting material

The data sets supporting the results of this article are included within the article and its additional files.