Background

The identification of genomic regions that affect traits of interest and the application of marker- or gene-assisted selection [1] in livestock are crucial to speed up genetic improvement. This is especially valid in species for which genetic gain is hampered by the relatively high costs of large-scale phenotyping and the logistic constraints of artificial insemination [2]. However, application of selection assisted by molecular information for traits that are influenced by numerous loci, each one explaining a small portion of the trait variance, is limited by the lack of power of experiments based on low-density marker maps [3]. In sheep, attempts to identify regions that affect quantitative traits (QTL) were performed first by using microsatellite maps [4,5,6,7,8,9,10,11,12,13]. Most of the identified QTL showed low significance levels and large confidence intervals, and thus their use in selection programs was not possible [2].

Nevertheless, the discovery of thousands of single-nucleotide markers (SNPs) and cost-effective tools (SNP arrays) to genotype them on a large number of animals as well as the recent availability of affordable whole-genome sequencing techniques, has opened new perspectives. It is expected that polymorphisms with small effects that collectively explain an increasing amount of the genetic variance may be gradually identified by using denser molecular marker maps on larger structured resource populations [14]. Thus, individuals that belong to pre-existing and new dairy sheep experimental populations have been accurately recorded for several traits and genotyped with 50 K and/or 600 K SNP chips. This is the case of an experimental flock of Sarda ewes, which has been set up since 2000 as a resource population to implement selection assisted by molecular information in the breeding scheme of this Italian dairy breed. QTL detection studies based on SNP arrays and the availability of increasingly accurate genome asec15 nnotation data allow the listing of potential candidate genes [15,16,17,18,19,20]. Moreover, whole-genome re-sequencing of target animals has been used to restrict the number of candidate polymorphisms. Thus, putative causative mutations that affect traits of economic interest have been proposed [21,22,23,24].

Among the available QTL detection approaches, those based on the combined use of linkage disequilibrium (LD) and linkage analysis (LA) information (LDLA) have been indicated as the most powerful, robust and precise in populations that are structured in families [16, 25, 26]. The main reason is that they account for both recombination events that occurred within genotyped generations and historical recombination events that occurred in generations prior to genotyping [25]. Several approaches have been proposed to combine LD and LA information [25, 27,28,29,30,31]. The classical LDLA method [25, 32] performs a variance component analysis at each putative QTL position by using identity-by-descent (IBD) probabilities between haplotypes. The main issue is that the IBD probability matrix is often dense, non-positive definite and computationally demanding for its inversion [29, 33]. Thus, either strategies that perform the hierarchical clustering of haplotypes based on IBD probability [33,34,35] or the approximation of the IBD based on the extent of the identity-by-state status between haplotypes [36] have been used. However, these approximations inevitably result in a loss of information [29].

An alternative way to process IBD information is principal component analysis (PCA), which has the desirable feature of collapsing information that is contained in a set of correlated variables by a smaller set of orthogonal variables. As such, PCA has been proposed as a technique to reduce the dimensionality of predictors in genomic selection [37, 38].

The aim of this study was to detect genomic regions that affect milk traits in the resource population of Sarda sheep by applying an LDLA approach combined with PCA to overcome computational issues of the IBD matrix and minimize the loss of information.

Methods

Resource population

The generation of the resource population (RP) started in 1999 when 10 Lacaune × Sarda F1 sires were mated to Sarda ewes to produce 928 back-cross female lambs in the framework of an European project aimed at detecting QTL in the main European sheep breeds (QLK5-CT-2000-00656; “genesheepsafety”). Subsequently, we focused on the detection of QTL segregating in the pure Sarda breed, and since 2002 we used exclusively Sarda rams (SA) to produce the yearly replacement of RP. Until 2009, the average size of the sire families was 43 daughters whereas, from 2010 onward the average size of families decreased to nine daughters, in order to increase the number of Sarda bloodlines represented in the RP. Sarda sires were always chosen based on their genetic impact on the registered population among rams belonging to the artificial insemination centre of the breed.

In total, 3949 ewes from 161 rams (10 F1 and 151 SA) were generated until 2015. Ewes of RP were kept until the 4th (occasionally the 5th) lactation on an experimental farm. The farming system was similar to that commonly applied in Sardinia with most of the adult ewes lambing in autumn and yearlings lambing between January and March. The ewes were milked twice a day by machine from weaning (3–4 weeks from lambing) until the end of July. The feeding regime was based on controlled grazing, supplemented by hay and concentrates in winter and late spring.

Genotypes and phenotypes

All the ewes of RP and their sires as well as the 10 Lacaune sires of F1 and 11 Sarda sires of SA were genotyped with the Illumina Inc. OvineSNP50 Beadchip. SNP editing was performed using call rate and MAF thresholds of 95% and 1%, respectively. The ovine genome assembly v4.0 and the software SNPchimMpv.3 [39] were used to construct the genetic map by assuming 1 Mb = 1 cM. Unmapped SNPs and SNPs on sex chromosomes were not included in the study.

A large range of traits of economic interest was measured in the RP. In the current study, we focused on milk traits: milk yield (MY); fat yield (FY); protein yield (PY); fat content (FP) and protein content (PP). MY, FP and PP were measured twice a month during the milking period at the am and pm milking. Lactation records were computed by the Fleischmann method, using records from the milking period only (in agreement with ICAR recommendations), by considering an initial suckling period of 30 days. Finally, 13,059 lactations of 3731 ewes recorded from 2000 to 2017 were retained. The average number of records per ewe was 3.5 ± 1.02, ranging from 1 (5% of animals) to 5 (9.3% of animals); most of the ewes (55.5%) had four records.

First, in order to adjust for the main environmental effects, raw lactation records were analysed with single-trait repeatability animal models using the ASReml 4.1 software [40]. Genetic relationships between animals were taken into account by calculating the genomic relationship matrix [41] between 4513 animals, including F1 and SA sires and their genotyped ancestors. The animal model included as fixed effects the year-management-group interaction (37 levels), the year-month of lambing-parity-age class interaction (230 levels) and the milking length within age class (adult and primiparous) as a covariate. The average performance deviation (APD) of each ewe was calculated as the average of lactation records adjusted for fixed effects. The APD used in this study as pseudo-phenotypes for QTL detection differ from the yield deviations [42] that were used in similar studies in that the performances are not adjusted for permanent environmental effects in order to prevent inaccurate estimations due to the likely confounding between permanent environment and additive genetic effects. Indeed, Pearson’s correlations between additive genetic and permanent environmental effects from the repeatability animal model ranged, for the five analyzed traits, from 0.46 for FY to 0.50 for PP. Moreover, although in this study we shall investigate only additive effects, ADP include dominance and epistatic genetic effects when they exist and are suitable pseudo-phenotypes for testing such effects in further analyses. Finally, 3731 APD from as many ewes were available. To verify the suitability of the applied animal model, the ratio between the estimated genomic and total variance was compared with the heritabilities reported in the literature. In the same way, the correlations between APD and GEBV of different traits were compared with phenotypic and genetic correlations estimated in other studies.

Classification of gametes and reconstruction of gametic phases

By “gamete”, we refer to the whole haploid set of autosomes that are inherited by an individual from one of the two parents. Moreover, we classified the gametes of the population as base haplotypes (BH) when inherited from an ungenotyped parent and replicated haplotypes (RH) when inherited from a genotyped parent. The pool of BH comprised both gametes of F1 rams and of 63 SA rams, the maternal or paternal gametes of the 35 ewes with an unknown sire or dam, respectively, and the maternal gametes of the 928 back-cross ewes and of 85 SA rams for which the sire was genotyped. Finally, 1207 gametes were classified as BH, i.e. the 10 F1 sires paternal Lacaune gametes (\({\text{BH}}^{\text{L}}\)) and 1197 Sarda origin gametes (\({\text{BH}}^{\text{S}}\)). Then, all 7462 gametes (\({\text{n}}_{\text{RH}}\)) carried by the 3731 ewes with production records (\({\text{n}}_{\text{P}}\)) were considered as replicates (RH) of the 1207 BH (\({\text{n}}_{\text{BH}}\)). An example of how gametes were classified is given in Appendix.

The paternal and maternal inherited gametes of all the genotyped individuals were reconstructed by using a procedure based on the linkage disequilibrium multilocus iterative peeling method proposed by Meuwissen and Goddard [43]. In this method, the parental origin of the alleles carried by an individual is iteratively inferred on the genotypes of parents and offspring at a given locus if they are already phased or at the neighbouring phased loci if the phase at the given locus is unknown. Here, the LD at the population level was ignored, since the population structure was expected to allow a high level of precision using family relationships only. For individuals with both parents without a genotype, the paternal or maternal origin was arbitrary assigned. Genotypes for which the parental origin of alleles was assigned with a probability lower than 0.99 were assumed missing.

Calculation of IBD probabilities

The marked familial structure of the RP led us to exploit the information from the within-family linkage analysis (LA) in addition to that from the population-wide linkage disequilibrium (LD) to estimate IBD probabilities.

First, IBD probabilities between BH and RH were calculated by LA (\({\text{IBD}}_{\text{LA}}\)) given the known gametic phases and the pedigree information. The grand-parental origin of each RH was estimated at each SNP position with certainty when the genotype at a given position was not missing and the parent transmitting RH was heterozygous. When these conditions were not fulfilled, the probability of a grand-parental origin at a given locus was determined based on information from the closest neighbouring informative loci [44, 45]. Then, transmission from BH to RH was traced through generations following Fernando and Grossman [46]. At each SNP position \(l\), \({\text{IBD}}_{\text{LA}}\) probabilities were stored in a matrix \({\mathbf{H}}_{l}\) with size \({\text{n}}_{\text{RH}} \times {\text{n}}_{\text{BH}}\). Moreover, the number of replicates of a given BH in RP at each SNP position \(l\) (\({\mathbf{f}}_{l}\)) was calculated as \({\mathbf{f}}_{l} = {\mathbf{H}}_{l}^{\varvec{'}} 1\), where \(\bf{1}\) is a vector of \({\text{n}}_{\text{RH}}\) ones (see Appendix).

Secondly, IBD between BH were estimated by LD analysis (\({\text{IBD}}_{\text{LD}}\)) at each SNP position following Meuwissen and Goddard [47]. The \({\text{IBD}}_{\text{LD}}\) probability was conditioned to the identity-by-state (IBS) status of neighbouring SNPs using windows of 21 SNPs (10 upstream and 10 downstream resulting in an average window length of 1 Mb) and to the within-breed (Lacaune or Sarda) expected homozygosity. The \({\text{IBD}}_{\text{LD}}\) between \({\text{BH}}^{\text{S}}\) and \({\text{BH}}^{\text{L}}\) were assumed to be null. At each SNP position \(l\), \({\text{IBD}}_{\text{LD}}\) probabilities were stored in a matrix \({\mathbf{U}}_{l}\) with size \({\text{n}}_{\text{BH}} \times {\text{n}}_{\text{BH}}\).

Once we had precisely estimated the covariances between BH and between BH and RH as well as the BH number of replicates at each locus, the average values across loci were also calculated and stored in \({\mathbf{U}}_{g}\) (\({\text{n}}_{\text{BH}} \times {\text{n}}_{\text{BH}}\)), \({\mathbf{H}}_{g}\) (\({\text{n}}_{\text{RH}} \times {\text{n}}_{\text{BH}}\)) and \({\mathbf{f}}_{g}\) (\({\text{n}}_{\text{BH}}\)), respectively. These matrices will be used later to estimate genome-wide IBD probabilities between gametes to adjust for the polygenic effect of background genes.

Principal component analysis and QTL detection model

Hereafter we describe a novel LDLA approach for QTL mapping that, similarly to the basic LDLA model proposed by Meuwissen et al. [25], relies on the IBD information at the locus level and takes the effect of background genes into account. In their study, Meuwissen et al. [25] modelled the phenotypic records by the random effects of all the inherited gametes (\(\mathbf{h}\), twice the number of individuals) at a given position \(l\) and the random polygenic effects (\(\mathbf{u}\), i.e. the combined effect of background genes) of all the individuals. In the original model [25]: \({\mathbf{y}} = \bf {1} \mu+ {\mathbf{Zh}} + {\mathbf{u}} + {\mathbf{e}}\), the covariance matrix of gametic effects \({\mathbf{G}}_{l}\) included IBD probabilities between founder gametes that were obtained by LD analysis (\({\text{IBD}}_{\text{LD}}\)) [47] and IBD probabilities between founder and non-founder gametes and between non-founders gametes that were obtained by combining the corresponding transmission probabilities (\({\text{IBD}}_{\text{LA}}\)) with \({\text{IBD}}_{\text{LD}}\), using the algorithm described by Fernando and Grossman [46]. The additive polygenic covariance between individuals was considered through the numerator relationship matrix \({\mathbf{A}}\) based on pedigree information. In the basic method [25], the maximum likelihood estimates of the variance components were calculated at each putative QTL position \(l\). The application of this method implies some relevant issues related to the nature of the \({\mathbf{G}}_{l}\) matrix, which is usually dense and may turn out to be non-positive definite, and the computational needs in applying the variance component analysis at each investigated position. To overcome these issues, we propose a novel approach which uses the principal component analysis (PCA) to handle \({\mathbf{G}}_{l}\) and exploits the dimensional reduction of the model equations that may be achieved by PCA to estimate both QTL and polygenic effects. First, PCA is used to capture the IBD information at the locus level with the aim of overcoming the difficulties in inverting the \({\mathbf{G}}_{l}\) matrix, in a different way from previous strategies having the same purpose [33,34,35,36, 48] which frequently result in a loss of information [29]. Second, PCA is applied to the matrix of the genome- wide IBD probabilities between gametes (i.e. the average across loci of IBD probabilities locus-wide) which is used instead of the classical numerator relationship matrix (\({\mathbf{A}}\)) to take into account the polygenic effects. A similar approach was used by Rothammer et al. [49, 50], which applied PCA to reduce the dimension of the relationship matrix and used explanatory PC as fixed effects in their QTL detection model. Third, PC that explain most of the variability of both the locus-level and genome-wide IBD probability matrices are included in the model as fixed effects to estimate both QTL and polygenic effects by performing a least square analysis instead of the more computationally demanding variance component one.

Thus, at each SNP position \(l\) the model is the following:

$${\mathbf{y}} = \bf{1}\mu + {\mathbf{ZV}}_{l} {\varvec{\upbeta}}_{l} + {\mathbf{ZV}}_{g} {\varvec{\upalpha}}_{l} + {\varvec{\upvarepsilon}}$$
(1)

where \({\mathbf{y}}\) is a vector of APD of \({\text{n}}_{\text{p}}\) ewes for MY, PY, FY, PP and FP; \(\mu\) is the overall mean; \({\varvec{\upbeta}}_{l}\) is a vector of the fixed effects of the \({\text{n}}_{{{\text{PC}}_{l} }}\) principal components that explain more than 99% of the within breed variation (\({\text{PC}}_{l}\)) of the IBD probability matrix \({\mathbf{G}}_{l}\), i.e. \({\varvec{\upbeta}}_{l}\) summarizes the effects of haplotypes at the QTL position \(l\); \({\varvec{\upalpha}}_{l}\) is a vector of the fixed effects of the \({\text{n}}_{{{\text{PC}}_{g} }}\) principal components that explain more than 99% of the variation (\({\text{PC}}_{g}\)) of the genome-wide IBD probability matrix, i.e. \({\varvec{\upalpha}}_{l}\) summarizes the polygenic effects of the gametes; \(\bf{1}\) is a vector of \({\text{n}}_{\text{p}}\) ones; \({\mathbf{Z}}\) is a \({\text{n}}_{\text{p}} \times {\text{n}}_{\text{RH}}\) incidence matrix relating phenotypes with RH; \({\mathbf{V}}_{l}\) is a \({\text{n}}_{\text{RH}} \times {\text{n}}_{{{\text{PC}}_{l} }}\) matrix including the \({\text{PC}}_{l}\) scores of RH, \({\mathbf{V}}_{g}\) is a \({\text{n}}_{\text{RH}} \times {\text{n}}_{{{\text{PC}}_{g} }}\) matrix including the \({\text{PC}}_{g}\) scores of RH; \({\varvec{\upvarepsilon}}\) is a vector of \({\text{n}}_{\text{p}}\) residuals assuming that \({\varvec{\upvarepsilon}}\sim {\text{N}}\left( {0,\sigma_{\upvarepsilon}^{2} {\mathbf{R}}^{ - 1} } \right)\) with \({\mathbf{R}}\) a diagonal matrix with the APD’s reliability (\(r\)) as diagonal element. For each investigated trait (MY, PY, FY, PP and FP), reliabilities were calculated as \(r_{\text{i}} = 1 - {\text{se}}\left( {{\hat{\text{a}}}_{\text{i}} } \right)^{2} /\sigma_{\text{a}}^{2}\), from a repeatability linear model \({\text{y}}_{\text{ij}} = {\text{a}}_{\text{i}} + {\text{e}}_{\text{ij}}\), where \({\text{y}}_{\text{ij}}\) is the performance deviation \({\text{j}}\) (i.e. the lactation record \({\text{j}}\) adjusted for the fixed effects estimated with the full animal model) of ewe \({\text{i}}\), \({\text{a}}_{\text{i}}\) is the random ewe effect assuming that \({\mathbf{a}}\sim {\text{N}}\left( {0, \sigma_{\text{a}}^{2} {\mathbf{I}}} \right)\) and \({\text{e}}_{\text{ij}}\) is the corresponding error, assuming that \({\mathbf{e}}\sim {\text{N}}\left( {0, \sigma_{\text{e}}^{2} {\mathbf{I}}} \right)\).

Below, we explain how the PC scores of the \({\mathbf{V}}_{l}\) and \({\mathbf{V}}_{g}\) matrices were calculated. In addition, a numerical example is given in Appendix.

As far as the \({\mathbf{V}}_{l}\) elements are concerned, in order to limit the computational requirements to extract PC directly from the large (\({\text{n}}_{\text{RH}} \times {\text{n}}_{\text{RH}}\)) \({\mathbf{G}}_{l}\) matrix, the PCA was performed on a \({\text{n}}_{\text{BH}} \times {\text{n}}_{\text{BH}}\) matrix denoted as \({\mathbf{U}}_{l}^{w}\), where the \({\text{IBD}}_{\text{LD}}\) probabilities between BH, stored in \({\mathbf{U}}_{l}\), were weighted for the \({\text{IBD}}_{\text{LA}}\) probabilities between BH and RH by condensing \({\mathbf{H}}_{l}\) information in a \({\text{n}}_{\text{BH}} \times {\text{n}}_{\text{BH}}\) diagonal matrix \({\mathbf{F}}_{l}\), in which the diagonal elements are the number of RH of each BH (stored in \({\mathbf{f}}_{l}\), where \({\mathbf{f}}_{l} = {\mathbf{H}}_{l}^{\varvec{'}} 1\)).

The \({\mathbf{U}}_{l}^{w}\) matrix is defined as:

$${\mathbf{U}}_{l}^{w} = {\mathbf{F}}_{l}^{1/2} {\mathbf{U}}_{l} {\mathbf{F}}_{l}^{1/2} .$$
(2)

PCA was carried out on \({\mathbf{U}}_{l}^{w}\) by using the Jacobi algorithm. Eigenvectors (\({\mathbf{V}}_{l}^{w}\)) relating to the largest principal components that together explain more than 99% of the within-breed variation (\({\text{PC}}_{l}\)) were retained. Finally, \({\text{IBD}}_{\text{LA}}\) probabilities between BH and RH (\({\mathbf{H}}_{l}\)) were combined with \({\mathbf{V}}_{l}^{w}\) to define the \({\text{PC}}_{l}\) scores of RH (\({\mathbf{V}}_{l}\)) by:

$${\mathbf{V}}_{l} = {\mathbf{H}}_{l} {\mathbf{F}}_{l}^{ - 1/2} {\mathbf{V}}_{l}^{w} .$$
(3)

Note that when \({\text{IBD}}_{\text{LA}}\) between BH and RH are estimated with certainty and, thus, \({\mathbf{H}}_{l}\) only contains 0 and 1, then: \({\mathbf{H}}_{l} {\mathbf{U}}_{l} {\mathbf{H}}_{l}^{'} = {\mathbf{G}}_{l}\) and \({\mathbf{F}}_{l} = {\mathbf{H}}_{l}^{'} {\mathbf{H}}_{l}\); eigenvalues from \({\mathbf{G}}_{l}\) correspond to eigenvalues from \({\mathbf{U}}_{l}^{w}\) for the explanatory principal components (\({\text{PC}}_{l}\)) and \({\text{PC}}_{l}\) scores from \({\mathbf{G}}_{l}\) correspond to \({\mathbf{V}}_{l}\) (see Appendix). When \({\text{IBD}}_{\text{LA}}\) between BH and RH are estimated without certainty and \({\mathbf{H}}_{l}\) contains intermediate values between 0 and 1, \({\text{PC}}_{l}\) scores from \({\mathbf{G}}_{l}\) and \({\mathbf{V}}_{l}\) do not correspond perfectly and differences tend to increase as the uncertainty of \({\text{IBD}}_{\text{LA}}\) probabilities increases. This is because \({\mathbf{U}}_{l}^{w}\) only considers the covariance that derives from \({\text{IBD}}_{\text{LD}}\) excluding the covariance between RH pairs generated by imprecise estimation of transmissions of BH. This effect is negligible in our experiment because most transmission probabilities are estimated with certainty.

Since the \({\text{IBD}}_{\text{LD}}\) between \({\text{BH}}^{\text{S}}\) and \({\text{BH}}^{\text{L}}\) was set to 0 and two sets of breed-specific PC were obtained, the matrix \({\mathbf{V}}_{l}\) can be detailed as \(\left[ {{\mathbf{V}}_{l}^{\text{S}} {\mathbf{V}}_{l}^{\text{L}} } \right]\). where \({\mathbf{V}}_{l}^{\text{S}}\) and \({\mathbf{V}}_{l}^{\text{L}}\) are the \({\text{PC}}_{l}\) summarising IBD probabilities between the gametes of Sarda and Lacaune origin, respectively. In Eq. (1) \({\mathbf{V}}_{l}\) elements, which are related by the incidence matrix \({\mathbf{Z}}\) to phenotypes, are used as covariates on the investigated traits to estimate QTL effects at locus \(l\) (\({\varvec{\upbeta}}_{l}\)).

As far as the \({\mathbf{V}}_{g}\) elements are concerned, the PCA performed directly on the weighted genome-wide \({\text{IBD}}_{\text{LD}}\) matrix, (\({\mathbf{U}}_{g}^{w}\)) computed as in Eq. (2) resulted in 1022 PC that were needed to capture 99% of the total variation. This limited dimensional reduction is due to the moderate genome-wide \({\text{IBD}}_{\text{LD}}\). probabilities between BH (on average around 0.1) and the small number of replicates of some BH on RP. Thus, in order to not over-parameterize the model, we considered the BH with the highest impact on RP (\({\text{BH}}_{\text{h}}\)). Then, to recover information from BH with few RH, a matrix of coefficients \({\mathbf{W}}\) relating \({\text{BH}}_{\text{h}}\) to all the BH was calculated as:

$${\mathbf{W}} = {\mathbf{U}}_{{g\__{h} }} {\mathbf{U}}_{{g\__{hh} }}^{ - 1} ,$$
(4)

where \({\mathbf{U}}_{g\_h}\) is the section of \({\mathbf{U}}_{g}\) including \({\text{IBD}}_{\text{LD}}\) probabilities between all the BH with \({\text{BH}}_{\text{h}}\) and \({\mathbf{U}}_{g\_hh}^{ - 1}\) is the inverse of the section of \({\mathbf{U}}_{g}\) including \({\text{IBD}}_{\text{LD}}\) probabilities between \({\text{BH}}_{\text{h}}\) pairs. The average number of replicates per \({\text{BH}}_{\text{h}}\) was then updated as \({\mathbf{f}}_{g\_h} = {\mathbf{W^{\prime}f}}_{g}\). The \({\text{BH}}_{\text{h}}\) set was iteratively selected as the smallest group of BH that satisfied the condition that \(\sum {\mathbf{f}}_{g\_h} /{\text{n}}_{\text{RH}}\) is higher than 0.99. According to the analysis at the locus level, the Jacobi algorithm was performed on the matrix \({\mathbf{U}}_{g\_hh}^{w}\) computed as:

$${\mathbf{U}}_{g\_hh}^{w} = {\mathbf{F}}_{g\_h}^{1/2} {\mathbf{U}}_{{g\_hh\varvec{ }}} {\mathbf{F}}_{g\_h}^{1/2} ,$$
(5)

where \({\mathbf{F}}_{g\_h}\) is a diagonal matrix, with its diagonal elements being the number of replicates stored in \({\mathbf{f}}_{g\_h}\). Eigenvectors (\({\mathbf{V}}_{g\_hh}^{w}\)) of the largest principal components that together explain more than 99% of the total variation of \({\mathbf{U}}_{g\_hh}^{w}\) (\({\text{PC}}_{g}\)) were retained. Finally, genome-wide \({\text{IBD}}_{\text{LA}}\) probabilities between BH and RH (\({\mathbf{H}}_{g}\)) were combined with \({\mathbf{V}}_{g\_hh}^{w}\) to define \({\text{PC}}_{g}\) scores of RH (\({\mathbf{V}}_{g}\)) by:

$${\mathbf{V}}_{g} = {\mathbf{H}}_{g} {\mathbf{WF}}_{g\_h}^{ - 1/2} {\mathbf{V}}_{g\_hh}^{\text{w}} .$$
(6)

In Eq. (1), \({\mathbf{V}}_{g}\), scores which are related by the incidence matrix \({\mathbf{Z}}\) to phenotypes, are used as covariates on the investigated traits to estimate polygenic effects (\({\varvec{\upalpha}}_{l}\)).

Note that \({\varvec{\upbeta}}_{l}\) and \({\varvec{\upalpha}}_{l}\) vectors in Eq. (1) are both fixed effects that are depicted separately to highlight that the model aims at estimating QTL effects (\({\varvec{\upbeta}}_{l}\)) while adjusting for the background of polygenes (\({\varvec{\upalpha}}_{l}\)). Moreover, covariates related to \({\varvec{\upbeta}}_{l}\) \(\left( {{\mathbf{ZV}}_{l} } \right)\) are locus-specific whereas covariates related to \({\varvec{\upalpha}}_{l}\) \(\left( {\mathbf{ZV}_{g} } \right)\) remain constant throughout the genome.

In addition, in accordance with \({\mathbf{V}}_{l}\), \({\varvec{\upbeta}}_{l}^{'}\) can be detailed as \(\left[ {{\varvec{\upbeta}}_{l}^{{'{\text{S}}}} {\varvec{\upbeta}}_{l}^{{'{\text{L}}}} } \right]\), where \({\varvec{\upbeta}}_{l}^{\text{S}}\) and \({\varvec{\upbeta}}_{l}^{\text{L}}\) summarize the effects of the Sarda and Lacaune gametes, respectively.

The model was tested at each SNP position by F-tests. Three null hypotheses were tested, \({\text{H}}_{0}\): \({\varvec{\upbeta}}_{l}\) = 0; \({\text{H}}_{0}\): \({\varvec{\upbeta}}_{l}^{\text{S}}\) = 0 and \({\text{H}}_{0}\): \({\varvec{\upbeta}}_{l}^{\text{L}}\) = 0. In the current study, only \({\text{BH}}^{\text{S}}\) tests, corresponding to \({\text{H}}_{0}\): \({\varvec{\upbeta}}_{l}^{\text{S}}\) = 0, will be analysed and discussed.

Genome-wide (GW) significance thresholds were determined by 2000 within-trait permutations of the residuals (\({\varvec{\upvarepsilon}}\)) of the reduce model \({\mathbf{y}} = \bf{1}\mu + {\mathbf{ZV}}_{g} {\varvec{\upalpha}} + {\varvec{\upvarepsilon}}\), where only the polygenic effects were considered. In order to break free from differences in the number of degrees of freedom at each SNP position, genome-wide maxima of the negative logarithms of p-values \([ - \log_{10} \left( {\text{p{-}value}} \right)]\) from each permutation were used to construct the null distribution.

Results

Production data and phenotypes for QTL detection

The ratios between genomic and total variance estimated by a single-trait animal model (Table 1) are consistent with the estimates of heritabilities for dairy traits in the literature [2]. Content traits were more heritable than yield traits. The most and the less heritable traits were PP and PY, respectively. APD and GEBV correlations between traits showed similar values and were consistent with phenotypic and genetic correlations reported in previous studies on sheep [2]. Strong correlations (from 0.88 to 0.95) were observed between yield traits, and moderate positive correlations were observed between content traits (0.58 and 0.62). MY was negatively correlated with both content traits, while the correlations of the other two yield traits with content traits were low.

Table 1 Ratios between genomic and total variance (diagonal, standard errors of estimates in brackets) and correlations between APD (above the diagonal) and between GEBV (below the diagonal)

Reconstruction of gametic phases

Preliminary editing of data led to remove SNPs with more than 5% missing genotypes and with a minor allele frequency (MAF) lower than 0.01. Only SNPs located on the 26 autosomes were retained.

The phasing procedure allowed the reconstruction of the sequence of the alleles carried by the investigated BH and RH for more than 99.5% of the SNP positions. After phasing, another 120 SNPs were excluded, because their genotypes were inconsistent with the phase estimated from neighbouring SNPs. Finally, 43,390 SNPs were retained for further analyses. The explored genomic portion was 2437 Mb long and the average distance between SNPs was 56 ± 49 kb with a maximum gap of 2.377 Mb on Ovis aries (OAR) chromosome 21.

IBD probability calculation

On average, the maximum locus-wide \({\text{IBD}}_{\text{LA}}\) probability between each RH and all the BH was 0.99, which indicates the precise reconstruction of the meioses occurred across RP generations.

The distribution of the genome-wide number of replicates of the 1207 BH in RP (\({\mathbf{f}}_{g}\)) are depicted in Fig. 1. The impact of BH on RP was extremely variable, in fact the average number of replicates per BH ranged from 1 to 202. The locus-wide \({\text{IBD}}_{\text{LD}}\) probability between BH pairs was zero in 68% of cases. The distribution of non-zero \({\text{IBD}}_{\text{LD}}\) probabilities is shown in Fig. 2. The 13% and 6% of the locus-wide \({\text{IBD}}_{\text{LD}}\) were lower than 0.05 and higher than 0.95, respectively, which suggests that, for a large proportion of BH pairs, it would have been possible to approximate the IBD status to 0 or 1. However, another 13% of locus-wide \({\text{IBD}}_{\text{LD}}\) showed intermediate values, for which the approximation to 0 or 1 would have been less accurate.

Fig. 1
figure 1

Genome-wide average number of replicates (\({\mathbf{f}}_{g}\)) per base gamete (BH, n = 1207) across the gametes of the resource population (RH; n = 7462)

Fig. 2
figure 2

Distribution of non-zero \({\text{IBD}}_{\text{LD}}\) probabilities between base gametes (BH; n = 1207) at the locus (blue) and genome-wide (orange) levels

Most of the BH pairs showed a genome-wide \({\text{IBD}}_{\text{LD}}\) probability equal to 0.10, with 95% having values ranging from 0.07 to 0.16 (Fig. 2). This result confirms that the original pool of Sarda gametes as well as the rams used to generate the yearly replacement of RP show a rather large genetic variability.

Principal component analysis and QTL detection

The distribution of the number of PC needed to capture 99% of the locus-wide variability is shown in Fig. 3. The average number of \({\text{PC}}_{l}\) was 32.3 ± 6.4 with a maximum of 74 and a minimum of 9. As far as the breed of origin is concerned, the number of \({\text{PC}}_{l}\) explaining 99% of variation due to \({\text{BH}}^{\text{S}}\) (see Additional file 1) and \({\text{BH}}^{\text{L}}\) were 24.1 ± 6.0 and 8.2 ± 1.2, respectively.

Fig. 3
figure 3

Frequencies across all loci (43,390 SNPs) of the number of principal components capturing 99% of the variation at the locus level (\({\text{PC}}_{l}\))

Concerning the genome-wide analysis, 139 BH with the highest impact on RP (\({\text{BH}}_{\text{h}}\)) were selected on the basis of \({\mathbf{f}}_{g}\) and the genome-wide \({\text{IBD}}_{\text{LD}}\) probabilities between BH pairs. The \({\text{BH}}_{\text{h}}\) set included all 10 Lacaune gametes and 129 Sarda gametes. The sum of the original number of replicates (\({\mathbf{f}}_{g}\)) of \({\text{BH}}_{\text{h}}\) was 0.69 (i.e. 69% of the RH were replicates of \({\text{BH}}_{\text{h}}\)). The remaining 30% of the RH variation was accounted for through the coefficients included in the \({\mathbf{W}}\) matrix and derived from \({\text{IBD}}_{\text{LD}}\) probabilities between \({\text{BH}}_{\text{h}}\) and other BH. At the genome-wide level, the number of \({\text{PC}}_{g}\) needed to capture 99% of the variation was 137.

The distributions of the genome-wide maxima of \(- \log_{10} \left( {\text{p{-}value}} \right)\), corresponding to the null hypothesis \({\text{H}}_{0}\): \({\varvec{\upbeta}}_{l}^{\text{S}}\) = 0, obtained by 2000 within-trait permutations, did not show relevant differences across traits (see Additional file 2). The 5% threshold ranged from 5.59 to 5.69. Thus, the most conservative value, 5.69, was retained as the common GW threshold for all the analysed traits.

Overall, 2563 positions exceeded the 0.05 GW significance threshold for at least one trait (Fig. 4) (see Additional file 3). There were 200, 108, 122, 918 and 1927 SNP positions significantly associated with MY, FY, PY, FP and PP, respectively. The number of significant positions affecting simultaneously one, two, three and four traits was 1943, 546, 56 and 18, respectively. Several significant positions were adjacent, which may be due to linkage disequilibrium between locations. In order to account for such dependency, significant positions were clustered into QTL regions (QTLR). The correlations between \({\mathbf{ZV}}_{l} {\varvec{\upbeta}}_{l}\) (corresponding to the second term of the model Eq. 1) were calculated for all pairs of significant SNPs on the same chromosome. Then, the strongest signal at the chromosome level was retained as the peak of the first QTLR. The peaks of further QTLR along the chromosome were iteratively identified among the significant locations that had correlations lower than 0.15 with the already defined QTLR peaks. Finally, the remaining significant positions were assigned to the QTLR with which they had the highest correlation. When QTLR for different traits overlapped, we considered them as a unique QTLR. This procedure may underestimate the true number of QTL if more than one gene affecting the trait(s) is located in the same genome region.

Fig. 4
figure 4

Manhattan plots showing − log10 (nominal p-values) corresponding to the null hypothesis that the effects of principal components that explain 99% of the variability due to the Sarda base gametes (BHS) at each locus (43,390 SNPs) are zero. The dashed black lines indicate the 0.05 genome-wide significance threshold determined by permutations. MY milk yield; FY fat yield; PY protein yield; FP fat content; PP protein content

Details of the 75 defined QTLR are in Table 2. QTLR were detected across all 26 autosomes except OAR26. The largest number of QTLR (10) was detected on OAR1. Overall, 12, 11, 10, 46 and 43 QTLR significantly affected MY, FY, PY, FP and PP, respectively. Among these 75 QTLR, 46 were significant for one trait only: two for MY, two for FY, two for PY, 23 for FP and 17 for PP; 20 were significant for two traits: one for MY and FY, one for MY and PP and 18 for FP and PP; two QTLR were significant for three traits: one for MY and the two content traits and one for the three yield traits; five QTLR were significant for four traits: one for the three yield traits and FP, three for the three yield traits; and PP, and one for FY, PY, FP and PP; and finally two QTLR significant for all five investigated traits.

Table 2 Details of the QTL regions that include SNP positions exceeding the 0.05 genome-wide significance threshold

The strongest signal was obtained for PP on OAR6 at 85.34 Mb where a nominal p-value of 11.12*10−67 was observed. The corresponding QTLR harboured significant positions also for FP and MY. The most significant position for FP (p-value = 1.26*10−15) was observed at 12.34 Mb on OAR4. This QTLR affected also PP. The most significant results for the three yield traits (p-value = 2.41*10−12, 5.89*10−10 and 2.20*10−11 for MY, FY and PY, respectively) were detected at 55.43 Mb on OAR11 where a significant peak for FP was also identified (Table 2).

Discussion

Power, precision, robustness of QTL mapping experiments in complex populations may be affected by several issues (size of the experiment, number and frequency of base haplotypes, density of marker maps). As described above, the resource population investigated here is constituted by families based on male ancestors. LDLA mapping approaches are expected to be more suitable than linkage and genome-wide association analyses to fine map QTL regions in such populations. In fact, the LDLA analysis combines both the within-family linkage and population-wide linkage disequilibrium information to estimate IBD probabilities between haplotypes [51].

The proposed approach allowed us to solve the model by the least square method, which avoid a computational expensive variance component analysis. The advantages of approaches based on LDLA regression versus those on variance components, in term of ease of use and computing time, are well known [29] and have been clearly demonstrated by Roldand et al. [52].

In our study, LDLA mapping greatly benefits from the structure of the population, in which the ewes born after the first generation have both parents genotyped, which allows a precise reconstruction of the base gametes of the population and their transmission through generations. IBD information can be efficiently captured with PCA and the computational constraints due to the multi-collinearity generated by the high \({\text{IBD}}_{\text{LD}}\) probabilities that may occur between pairs of BH at the locus level can be overcome. The use of PCA avoided the implementation of prior clustering of BH or approximations in the IBD probability estimation [33, 34].

Moreover the strategy used here to collapse IBD information into principal components (i.e. the use of \({\mathbf{U}}_{l}^{w}\) instead of \({\mathbf{G}}_{l}\)) is computationally efficient. It relies on the high precision of LA for defining the ancestral origin of each gamete, which is possible for populations with a strong familial structure. The effectiveness of the method in populations with weaker familial structure should be investigated.

Depending on the eigenvalues threshold used for PCs selection, the proposed approach allows to capture most of the IBD variation with a dramatic decrease in the number of effects to estimate. Although the direct solutions of the analysis are the effects of explanatory PC, the effect corresponding to each BH at position \(l\) can be easily calculated (\({\varvec{\upbeta}}_{l}^{\text{BH}} = {\mathbf{F}}_{l}^{ - 1/2} {\mathbf{V}}_{l}^{w} {\varvec{\upbeta}}_{l}\)). Effects and frequencies of BH may be used as basic information to identify the BH that contribute most to the significance of a given locus. In fact, in their study on the identification of putative causative mutations that affect the protein content, Casu et al. [22] used such information to select individuals for whole-genome re-sequencing.

The QTL detection model proposed here included a fixed factor to adjust for the polygenic background of each individual based on the effect of PC that summarize the genome-wide BH variation. However, at the genome-wide level \({\text{IBD}}_{\text{LD}}\) probabilities between BH had moderate values when averaged across the genome. Indeed, most PC showed small eigenvalues and many of them were necessary to explain most of the variability. Thus, we applied an approach that aimed at reducing the number of BH to be included in the PCA to those that had the highest impact on RP (\({\text{BH}}_{\text{h}}\)) by taking their probability to be carried by an individual with a record into account.

During the development of our method, we applied it to the dataset that was simulated for the XVI QTLMAS meeting [53]. The results of QTL detection were very close to those reported by Garzia Gamez et al. [53] who used a more classical LDLA method [33, 53], which implemented variance component analysis and accounted for an individual random polygenic effect (see Additional file 4).

Overall, a large number of genomic regions that significantly affected milk traits were detected in this study. The number of detected regions largely exceeded those obtained by other LDLA studies in dairy sheep for milk production traits [16]. This larger number of detected QTLR is probably due to the larger size of the analyzed population. Indeed, Garzia-Gamez et al. [16] performed a LDLA mapping on a population of about 1700 Churra ewes that were organized in 16 half-sib families and they detected 34 genome-wide significant regions.

Consistent with the estimates of heritabilities, the number of QTLR that affected content traits was larger than that for yield traits. Several positions suggested pleiotropic effects. Twenty-nine QTLR affected more than one trait: nine affected at least two yield traits and frequently one or both of the content traits, four affected MY and both content traits, and 18 were significant for both content traits.

A very long list of positional candidate genes was obtained by overlapping the sheep genome reference (Oar_v4.0) with each QTLR. Overall, 745 annotated genes were detected but only a few of these were cited as dairy-related in previous studies on cattle [54] and sheep [55]. Among these, the most interesting genes were those in the casein cluster (CSN1S1, CSN1S2, CSN2 and CSN3), which is mapped to OAR6 within the 85.00–85.23 Mb interval. This interval overlaps with the position of the strongest signal for PP found in this study. At almost the same position, a QTL for PP was detected in Churra sheep by GWA [21]. A deeper investigation of this region is ongoing by whole-genome re-sequencing of individuals that carry BH with large effects at the significant location. The aim is to list the candidate causative mutations by performing specific association studies of all the polymorphisms included in the genomic region [22].

The QTLR that affects PP and FP on OAR3 at 137.3 Mb overlaps with the α-lactalbumin gene (LALBA). This gene was previously reported as a strong candidate for PP and has been deeply investigated in the Churra breed [16, 21]. Two other interesting candidate genes are the growth hormone receptor (GHR) and transcription factor AP-2 gamma (TFAP2C) genes. GHR is located on OAR16 within the 31.83-32.00 Mb interval, where a QTLR that affects yield traits was detected. Previous studies in dairy cattle and sheep have shown that GHR affects milk production [54, 56]. TFAP2C, which is involved in the development, differentiation, and oncogenesis of the mammary gland [55], overlaps with a QTLR that is significant for MY, FP and PP on OAR13 at 58.6 Mb.

Conclusions

We present a simple least square model to map QTL. It combines linkage disequilibrium and linkage analysis information and accounts for the polygenic effects of base gametes. The use of principal component analysis was found to be a good strategy to reduce the computational burden. A large number of regions associated to the variability of milk traits were identified. The outputs provided by this method are useful for the selection of individuals and genes that need to be further investigated for identifying causative mutations or markers in strong linkage disequilibrium with causative variants and for implementing them in genomic selection programs.