Background

The biological resources of the Mediterranean sub-region of the Palaearctic include a diversity of domesticated animals [1] comprising 53 officially recognized local breeds of taurine cattle (Bos taurus) in the Iberian Peninsula alone (Additional file 1: Table S1). Taurine cattle are thought to have been domesticated by Neolithic farmers from Bos primigenius populations in the Fertile Crescent around 10,000 years [2], and have since diversified into more than 1000 breeds [3]. Cattle genomes have been shaped not only by human-driven selection, but also by genetic bottlenecks associated with migrations from the origin of domestication, adaptation to different agro-ecological areas and a more strict division of animal populations into breeds led by Europeans since the eighteenth century [3]. Furthermore, multiple events of introgression have been proposed to have influenced European cattle breeds: i) ancestral hybridization with European populations of B. primigenius [4,5,6,7,8,9] (extinct in Europe since the seventeenth century [9]); ii) introgression from African taurine cattle [10]; iii) introgression from non-taurine sources such as indicine breeds (Bos indicus, the humped cattle type resulting from a separate domestication event in the Indus valley [11]) [10, 12]. Such wide-spread gene-flow resulted in complex patterns of admixture and the difficulty in sometimes establishing whether a breed represents the taurine populations that were originally associated with a specific geographic region [10] and could explain the high levels of genetic diversity relative to other domesticated species [12].

Currently, there are two broad groups of cattle breeds, those under intensive management with strong specialization in dairy or meat phenotypes (such as the commercial transboundary Holstein, Charolais, Limousine, and more recently Angus), and the so-called “primitive” breeds, traditional cattle with a low dependence from external inputs that make use of naturally available food resources. Iberian native cattle are found in diverse agro-ecological systems including coastal, mountain, and lowland arid environments (Fig. 1a). Inheritable traits of these cattle have been modified at different times by the various cultures that inhabited this territory, and breeds are often defined based on morphological traits such as coat color, as well as horn size and body shape. For example, osteometric data validated by the molecular sexing of cattle metacarpals dated to the fifteenth century indicated that there was an overall size increase or improvement of cattle in southern Portugal following the Christian reconquista of the 11th–13th centuries AD [13]. Recently, the Food and Agriculture Organization (FAO) has warned that about 67% of the Iberian cattle breeds are at risk as many of these have less than 1000 breeding females and/or less than 20 breeding males [3], which reinforces the need for a continued conservation strategy. The complex origin of the Iberian primitive breeds is reflected in their high diversity in Y-chromosome haplotypes, including the major taurine Y1 and Y2 haplogroups [14, 15] and unique patrilines [16], as well as distinct maternal lineages, i.e. common European T3-matrilines along with more distinct Q-haplotypes [15, 17], and a strong influence of T1-lineages of African origin [18]. This higher diversity relative to their European counterparts is quite notable, given the geographic distance of this territory from the presumed Near-Eastern center of domestication [4, 14, 15, 19, 20]. This makes Iberian cattle a great example for investigating the genomic impact of the intricate processes of cattle diversification both regarding the last 200 years of specific breed formation and the putative earlier admixture events.

Fig. 1
figure 1

a Geographical distribution of the eight Iberian native breeds (maps from https://commons.wikimedia.org/). b Population structure plot determined by NGSadmix shows consistency with breed denomination; each individual is represented by a stacked column of the 2, 5 and 8 proportions (other K values in Fig. S2). c Reproductive isolation of the Mirandesa and Brava breeds relative to the others is clear in the principal component analysis done with PCAngsd; variance explained by each component is shown in parenthesis (other components are in Fig. S3). Colors denote breed names

To uncover genome-wide patterns of diversity associated with the formation of primitive cattle breeds, we sequenced the genomes of 48 individuals belonging to eight breeds of native Iberian cattle (Fig. 1a). Their breed denominations have been shown to agree with population structure inferred from microsatellites [19,20,21,22]. Noteworthy, no clear structure is recovered when using genotypes determined with the Illumina Bovine High-Density 777 k SNP BeadChip in the context of European cattle [4], likely a result of ascertainment bias as Iberian breeds were not included in the discovery panel of the genotyping assay. This reinforces the need for full genome data to accurately determine genetic diversity and measure population differentiation [23].

We confirm that there is a clear genetic distinction between Iberian cattle breeds. In addition, we demonstrate that breed management and associated demographic processes had profound effects on genomic diversity and resulted in unusual patterns of genetic differentiation for autosomes vs sex chromosomes. We further describe genome-wide diversity and introgression in Iberian breeds in relation to 60 previously published taurine (B. taurus) and zebu (B. indicus) cattle genomes from Europe, Africa and Asia [24], and sequence data from one European aurochs (B. primigenius) [5]. We confirm that gene flow has occurred between African taurine and Iberian breeds. Overall, we show how whole-genome data are important for uncovering specific patterns related to recent events in breed formation and management, and provide the ground for future studies on the singularity of locally adapted European cattle breeds.

Results

The 48 Iberian cattle genomes and the previously published shotgun resequencing data from 60 additional individuals including taurine and indicine cattle (Table 1; Additional file 1: Table S2) were mapped with BWA mem to three reference genomes: genome version UMD_3.1.1 (bosTau8) [25], genome version Btau_4.6.1 (bosTau7; contains an assembled Y-chromosome) [25] and to the outgroup wild yak (B. mutus) [26]. Details on the quality-based read trimming and filtering steps are included in the Methods section. Sequencing error rates for all 48 samples are below 0.2% (Additional file 1: Figure S1).

Table 1 Summary of the data sets used in this study

Population structure and individual ancestry were investigated with NGSadmix, which does not require definition of the exact genotypes thus is adequate for low-depth sequencing data [27]. Setting the number of expected clusters to eight (the number of breeds) resulted in the assignment of each individual to the source breed (Fig. 1b) while assuring convergence of the method. This level of genetic homogeneity within Iberian cattle populations is also observed in the results of the principal components analyses (Fig. 1c). The first two PCs explain 10 and 9% of the total variation and show the high differentiation of Mirandesa and Brava. Mirandesa in fact appears as an independent cluster when the number of ancestral K populations is set to two (Fig. 1b), and Brava individuals become a separate cluster when K = 4 (Additional file 1: Figure S2).

We assessed the levels of genetic differentiation between breeds by calculating the fixation index (FST). In general, we observed high levels of differentiation (average 9%), even when admixture has occurred, which precludes the use of Iberian cattle as a single evolutionary unit. Consistent with their higher heterogeneity, the breed pair Arouquesa/Mertolenga had a low FST value of 0.06. The highest FST corresponded to the pairwise comparison of Mirandesa and Alentejana (FST = 0.16) and the lowest FST values were obtained for Preta vs Mertolenga (FST = 0.04) (Table 2).

Table 2 FST values between the eight Iberian breeds. The highest value is shown in bold and the lowest in italic

FST values between Iberian breeds and other taurine cattle ranged from 12 to 33%, partially overlapping the divergence values observed for comparisons within Iberian breeds (Table 2). Mirandesa, the most divergent within the Iberian breeds, has the highest FST values relative to all other breeds (Fig. 5a). The taurine breed with the overall highest FST relative to the Iberian was the Jersey cattle which may be explained by the insular isolated status of this breed [28], although we must note that this might not be a representative sample of the breed. Overall, the FST for autosomes was much higher than for chromosome X within taurine breeds and within the indicine cattle (Fig. 5b; Additional file 1: Figure S7).

When compared to publicly available genomes [24] (database information in the Materials section) of taurine (European Holstein, Angus and Jersey, and African N’Dama) and African indicine cattle (Ogaden, Kenana and Borana), Iberian breeds are clearly assigned by NGSadmix [27] to the European cluster (Fig. 2a) with a slight suggestion of African taurine admixture at K = 3 for autosomal data. As observed previously [24], at K = 3 the clusters observed represent European taurine, African taurine and African indicine ancestries.

Fig. 2
figure 2

a Population structure using 108 individuals at K = 3 clearly divides the European taurine (blue), African taurine (green) and African indicine (pink) ancestries. b Treemix maximum likelihood tree depicting the relationships between taurine cattle breeds (grey: Illumina BovineHD SNP data; black: whole genome data). c Nucleotide diversity in taurine and indicine breeds (Iberian breed names in black)

All analyzed breeds have a positive Tajima’s D (Additional file 1: Figure S4). As observed previously [24], the commercial European breeds have lower nucleotide diversity (average number of pairwise differences) relative to the African breeds (Fig. 2c). The Iberian breeds analyzed in this study have, overall, similar or higher values of nucleotide diversity compared to their European counterparts. The lowest values correspond to Mirandesa, Brava and Alentejana, which had been previously shown to have the lowest heterozygosity in a microsatellite panel [19, 20]).

We then used the maximum likelihood approach implemented in Treemix [29] to uncover the historical relationships between the breeds (Fig. 2b). We intersected our whole genome data with the Illumina BovineHD SNP data of 25 European primitive breeds from [4], which shows that our selection of breeds is representative of the Iberian breed context (Fig. 2b). When allowing for one migration event, we observe gene flow from African taurine to the base of the Iberian clade (Fig. 2b) which had been previously suggested to have occurred [15, 17, 18].

We explicitly test for differential African cattle introgression into Iberian breeds, using the D-statistics [30, 31] and publicly available genomes [24] (database information in the Materials section) of taurine (European Holstein, Angus and Jersey, and African N’Dama). We can confirm that there is a significant excess of shared derived alleles in varying amounts between Iberian breeds and the African taurine N’Dama when compared to a panel of European taurine breeds (Fig. 3). This was observed both for southern Iberian Brava that had the largest African (N’Dama) influence, but also in breeds from the north of Portugal such as Barrosã. These results are further supported by the occurrence of ~ 17% of T1-matrilines in the Iberian cattle analyzed here (Fig. 4). We found no evidence for either indicine or aurochs introgression into Iberian cattle (Additional file 1: Figure S5).

Fig. 3
figure 3

D-statistics determined using genome-wide autosomal data. Negative values indicate an excess of derived alleles shared by the breeds in H1 (denoted in the y-axis) and the African N’Dama breed in comparison with European taurine breeds (H2)

Fig. 4
figure 4

Maximum-likelihood phylogeny of cattle mitogenomes showing that Iberian breeds can be assigned to haplogroups Q, and T, including sub-haplogroup T1 typical of African cattle. Breed acronyms are as follows: ALT, Alentejana; Arouquesa, ARO; Barrosã, BAR; Brava de Lide, BRA; Maronesa, MRO; Mertolenga, MER; Mirandesa, MIR; Preta, PRE; Kenana, KEN; Borana, BOR; Ogaden, OGA; N’Dama, DAM; Holstein, HOL; Jersey, JER

Discussion

Signatures of breeding in the population structure and genetic differentiation of Iberian cattle breeds

The observed genetic homogeneity within Iberian cattle and the high differentiation of Mirandesa and Brava are expected to result from genetic drift due to drastic demographic changes: in the 1970s, Mirandesa was raised in a vast area of the Portuguese territory with over 200,000 animals [32] and since has suffered a significant reduction in population size with less than 6000 breeding females registered in the herd book in 2017 (http://www.fao.org/dad-is/browse-by-country-and-species/en/); Brava has historically been reproductively isolated from other breeds living in semi-feral conditions for the main purpose of its use in bullfights [32]. PCs 3 and 4 separate Alentejana and Preta from the remaining Portuguese native breeds, whereas Maronesa, Barrosã and Mertolenga are separated by PCs 5 and 6 (Additional file 1: Figure S3).

Recent crossbreeding involving Arouquesa cattle is revealed in it being the last to form a discrete cluster, showing contributions from the other populations until K = 7 (Fig. 1b and Additional file 1: Figure S2). This is consistent with an analysis of microsatellite loci, which showed Arouquesa as having the lowest mean genotype membership proportions [19]. This breed is mostly raised in a region located south of the Douro river in the district of Viseu (Fig. 1a), bordering the area of production of Maronesa and in remote times also of the once abundant Mirandesa cattle. Arouquesa has also historically been crossbred with the latter to produce the highly valued “vitela de Lafões”, a meat product certified by the European Union with Protected Geographical Indication, and so admixture is intrinsically linked to its history. Another breed showing high heterogeneity was Mertolenga (Additional file 1: Figure S2), one of the most phenotypically diverse Iberian native breeds, with its three distinct coat color phenotypes mostly raised in separate herds [19]. In general, we observed high levels of differentiation (average 9%), even when admixture has occurred, which precludes the use of Iberian cattle as a single evolutionary unit.

Iberian genetic variation in the context of taurine and zebu cattle diversity

The positive Tajima’s D (Additional file 1: Figure S4) indicates a reduction in the low-frequency polymorphisms, suggestive of population structure, bias in the choice of genomic markers or of a recent bottleneck probably associated with breeding practices. The lower nucleotide diversity of European breeds relative to the African breeds (Fig. 2c) can be explained by a combination of intensive selection and genetic drift in European cattle breeds [24]. Mirandesa, Brava and Alentejana present the lowest values of all, which probably results from management and demographic histories (as mentioned above, Mirandesa has, since the 1970s, suffered a drastic reduction in population size, and significant inbreeding has been detected in Brava and Alentejana [19]).

Iberian cattle show a clear signature of admixture from African cattle and high diversity in mitochondrial DNA and Y chromosome haplotypes

We confirmed that there was gene flow between Iberian breeds and the African taurine N’Dama and this is supported by both nuclear and mitochondrial data (Fig. 3 and Fig. 4). The Iberian Peninsula and the Maghreb regions share natural zoo-geographical affinities, and there were complex biogeographic and historic faunal and human relationships during much of the early Holocene (including maritime pioneer colonization in the West Mediterranean by the agropastoral communities that reached the Iberian Peninsula in the Early Neolithic [33]), which could explain these patterns of genomic admixture. Despite not finding evidence of indicine introgression in Iberian cattle, it is important to notice that the indicine cattle in our sample has taurine introgression (confirmed by the presence of T1 taurine mitochondrial haplotypes in all the indicine samples of Fig. 4), it is likely that these are not adequate for performing this test. Contrary to previous results [4], we did not find evidence for aurochs introgression into Iberian cattle (Additional file 1: Figure S5) when using sequence data from a 6750 year-old British aurochs [5]. Given the probable complex population structure of ancient wild cattle in Europe [5, 9, 34], this result does not preclude that local aurochs introgression occurred, but data from pre-domestic Iberian specimens is required for further testing of this hypothesis.

Y-specific markers are useful to investigate crossbreeding [14] as Y-chromosomal variation is geographically structured, with the Y1 and Y2 lineages being predominant in northern and central European taurine cattle, respectively, while the Y3 lineage is specific of indicine cattle [15]. In addition, the effective population size of the cattle Y-chromosome is strongly reduced by the reproductive success of popular sires. The paternal diversity (Y-chromosome) of Iberian cattle (Additional file 1: Figure S6 and Table S2) appears to have its origins in the dispersal of a heterogeneous male population since the Neolithic along the Mediterranean route, rather than in the recent admixture of transboundary commercial cattle which are generally fixed for a single patriline (e.g. Holstein-Friesian). Isolation and less intensive selection probably also contributed to preservation of much of the original diversity in this region. Interestingly, Jersey bulls shared a distinct patriline with African cattle (one Ogaden individual; Additional file 1: Figure S6). Previous analyses of Y-chromosome polymorphisms showed that Jersey is fixed for a specific haplotype that is intermediate between Y1 and Y2 haplogroups [15], this may-well represent an African Y-lineage but more comprehensive data from African bulls are needed.

The impact of breeding practices on chromosomal variation and general patterns of diversification

The lower effective population size in chromosome X relative to the autosomes should lead to stronger impact of the bottleneck (or population structure) caused by breeding practices, observed in an overall higher Tajima’s (Additional file 1: Figure S4). In this scenario, genetic drift would be expected to result in higher FST values for chromosome X (lower effective population size [35]) relative to autosomes, which is what we observe when we compare taurine and indicine cattle (Fig. 5b). However, comparisons within taurine and within indicine show a much higher FST for autosomes than for chromosome X (Fig. 5b; Additional file 1: Figure S7). This agrees with extensive male-biased gene flow within taurine and within indicine – since males have a single copy of chromosome X, introgression will be more efficient on the autosomes. It is “known” that female populations are more likely to be geographically constrained and human-driven crossbreeding may have been carried out mainly using males [36]. This could also explain the difference in ancestry assignments for autosomes and chromosome X (Fig. 6), with signatures of previously described indicine admixture in the African taurine autosomes, but not observed in chromosome X.

Fig. 5
figure 5

a Autosomal FST between Iberian cattle and taurine/indicine breeds. b Range of autosomal FST values for including European taurine (Holstein, Jersey and the Iberian breeds), African taurine (N’Dama), and the African indicine breeds Ogaden, Kenana and Borana. Also shown are the FST values for sex chromosome X, which is comparatively low within taurine breeds, but shows the expected trend in comparisons with indicine breeds

Fig. 6
figure 6

Population structure at K = 2 determined using the female individuals only (Additional file 1: Table S2). The indicine contribution to African taurine (N’Dama) is not observed in sex chromosome X (bottom) compared to the autosomes (top)

Conclusion

Here we sequenced whole genomes of locally adapted Iberian cattle (for which genomic resources were lacking), and compared them to commercial cattle to uncover genomic patterns associated with the different breeding contexts. Our analyses confirm that these breeds are genetically very distinct and show high levels of genetic variation unlike what would be expected given their limited geographical distribution. Also, Iberian cattle retain much of the original paternal and maternal diversity, which appears to derive from the dispersal of a heterogeneous population since the Neolithic along the Mediterranean route with strong influences from North African taurine cattle, rather than from recent admixture with transboundary commercial cattle. This may have significant impact on the resilience of Iberian cattle to foreseen environmental changes. Not only these breeds produce high-quality certified meat products under local extensive conditions, as they can provide the source for genetic material to improve breeds with depleted genetic diversity, i.e. transboundary commercial cattle. Furthermore, we show that the complex processes underlying the formation of taurine breeds in general had profound effects on genomic diversity and resulted in unusual patterns of genetic differentiation for autosomes vs. sex chromosomes. Our results indicate that genetic differentiation measured using chromosome X might be more representative of the native populations of domesticated cattle, and that comparisons between breeds using autosomal data might be misleading without an appropriate demographic model.

Methods

Materials

Information regarding the breeds and the type of genetic data used to investigate genome diversity and genetic relationships is summarized in Table 1 and supplementary Additional file 1: Note S1. We selected a total of 48 animals representative of Iberian cattle, namely from the Portuguese breeds Alentejana, Arouquesa, Barrosã, Brava de Lide, Maronesa, Mertolenga, Mirandesa and Preta (Fig. 1). The 6 animals of each breed included in our study were nonrelated back to the second generation, originated from several herds, and portray the genetic diversity observed for autosomal microsatellite loci, mitochondrial DNA and Y-chromosome sequences [15, 19]. Sampling was done as described in [19], briefly 9 ml of whole-blood were collected from each animal by qualified veterinarians during their routine practice in the framework of official health control programs. Additionally, we used previously generated publicly available genomic data to make population genomics inferences in the context of worldwide cattle: i) shotgun resequencing data of four indigenous African breeds: N’Dama (Bos taurus), Ogaden (Bos indicus), Boran (Bos indicus) and Kenana (Bos indicus) [24] (Bioproject ID: PRJNA312138); ii) shotgun resequencing data of three transboundary commercial breeds: Holstein, Jersey, and Angus (Bioproject IDs: PRJNA210521, PRJNA318089 and PRJNA318087, respectively); iii) genotyping Illumina BovineHD SNP data [4] (777,692 SNPs; https://doi.org/10.5061/dryad.f2d1q) of 26 European breeds represented by at least 3 individuals: English Longhorn (England), White Park (England), Galloway (Scotland), Highland (Scotland), Kerry Cattle (Ireland), Heck (Germany), Brown Swiss (Switzerland), Fleckvieh (Switzerland), Dutch Belted (The Netherlands), Dutch Friesian (The Netherlands), Groningen Whiteheaded (The Netherlands), Meuse-Rhine-Yssel (The Netherlands), Busha (Balkan region), Romanian grey (Romania), Boskarin (Check Republic and Hungary), Chianina (Italy), Maremmana (Italy), Maltese (Malta), Cachena (Portugal), Berrenda en Colorado (Spain), Berrenda en negro, (Spain), Cardena (Spain), Lidia (Spain), Limia (Spain), Pajuna (Spain), Sayaguesa (Spain). We also included data of an aurochs [5] (England; Bioproject ID: PRJNA294709) to test for admixture with domesticated cattle. Furthermore, 149 full mitochondrial genomes from NCBI’s PopSets 157,778,019 [7], 306,977,267 [37], 355,330,537 [18], and 946,518,556 [38] were used together with the mitochondrial consensus sequences obtained from our shotgun data (details below).

Laboratory procedures

Genomic DNA was extracted using a modified salting-out precipitation method (Gentra Puregene Blood Kit, Qiagen) according to the manufacturer’s recommendations. We prepared equimolar DNA concentrations for all animals before library construction using Nanodrop™ 2000 (Thermo Scientific) and Qubit™ Fluorometer (Qubit™ dsDNA BR Assay Kit, 2–1000 ng, Invitrogen, Oregon, USA) measurements. Following DNA fragmentation by sonication using a program specific for 550 bp inserts (https://www.diagenode.com/en/p/bioruptor-pico-sonication-device), genomic libraries were prepared using the TruSeq DNA PCR-free Library Preparation Kit (Illumina, San Diego, CA) according to the manufacturer’s protocols. Whole-genome paired-end resequencing data was obtained by pooling 16 samples in each lane and using an Illumina HiSeq1500 instrument with 2 × 100 bp reads.

Sequencing data pre-processing

The 48 samples were sequenced to between 1.4X and 2.3X depth of coverage (Additional file 1: Table S2). Methods appropriate for low coverage NGS data [27, 39,40,41] were used throughout the analyses and applied to all samples. Raw Illumina reads were first processed with Trimmomatic (version 0.36) [42] for removal of adapter sequences and trimming bases with quality < 20 and discarded reads with length < 80. Mapping to cattle genome versions UMD_3.1.1 (bosTau8) [25] and Btau_4.6.1 (bosTau7; contains an assembled Y-chromosome) [25], and to the outgroup wild yak (Bos mutus; Bioproject ID: PRJNA74739) [26] was done with BWA mem (version 0.7.12-r1039). Reads showing a mapping hit were further filtered for mapping quality > 25. PCR duplicates were removed with Picard MarkDuplicates (version 1.95; http://broadinstitute.github.io/picard/) and local realignment around indels was done with GATK [43].

Sequencing error rates

Sequencing error rates were determined in ANGSD (version 0.917) [39] using a method that relies on an outgroup and a high quality genome to estimate the expected number of derived alleles (similar to a method described by Reich et al [44]). Briefly, if we observe a higher number of derived alleles in an individual we assume that this excess is due to errors. If the high-quality genome is error free, we will obtain an estimate of the true error rate. If there are errors in the high-quality genome, then the estimated error rate can roughly be understood as the excess error rate relative to the error rate of the high-quality genome.

Population structure

NGSadmix version 32 [27] was used to detect population structure with autosomal data from samples for which shotgun resequencing data was available. NGSadmix infers population structure from genotype likelihoods (that contain all relevant information on the uncertainty of the underlying genotype [45]). NGSadmix was run for K equal 2 to 8 for sites present in a minimum of 10% of the individuals: a total of 951,213 SNP sites for the 48 Iberian samples (Fig. 1b); 129,829 SNP sites for the data set including all 128 animals (Fig. 2a); 628,774 for SNP sites for the data set including the 94 female individuals (Fig. 6). The program was run with different seed values until convergence was reached.

A principal component analysis using the same SNP set for the Iberian breeds was done with PCAngsd [40] which estimates the covariance matrix for low depth NGS data in an iterative procedure based on genotype likelihoods. Genotype likelihoods for all individuals were generated with ANGSD [39] (options -GL 1 -doGlf 2 -minQ 20 -minMapQ 30).

Phylogenetic analyses

Treemix (version 1.13) [29] was used to infer the admixture graphs (Fig. 2b) using allele counts for 512,358 SNP positions included in the Illumina BovineHD SNP that can be unambiguously assigned to autosomal positions in the cattle reference genome version UMD_3.1.1 [25] using [46]. For shotgun resequencing data, allele counts were obtained from allele frequencies calculated in ANGDS [39] for positions covered in at least 3 individuals. Treemix was run using the global option and standard errors were estimated in blocks with 500 SNPs in each. Even though we do not call genotypes on the shotgun data, the individual breeds where correctly assigned to expected branches in the North/Central European and Iberian clades (Fig. 2b), confirming the robustness of our methodological approach.

The software RAxML [47] version 8.1.7 with 100 rapid bootstrap replicates was used to estimate the phylogenetic trees under the GTR + GAMMA model of sequence evolution for complete mitochondrial sequences from [7, 18, 37, 38] together with consensus sequences from the shotgun resequencing data analyzed in this study obtained by choosing the most common base per position (−doFasta 2 in ANGSD [39]).

D-statistics

To determine the pattern of excess shared derived alleles between taxa, indicative of introgression, we estimated D-statistics using the wild yak (Bos mutus) as an outgroup. All samples were mapped to the yak outgroup genome assembly [26]. The D-statistic [30, 31] is approximated by a Gaussian distribution with mean zero [41] in the absence of gene flow between the four populations, allowing for hypothesis testing. We apply an extended version of the D-statistic [41] which can use multiple individuals per population sequenced at low coverage and is implemented in ANGSD [39]. It takes observed allele frequencies for each individual in a population, and then combines them linearly to find an unbiased estimator of population frequency while minimizing the variance [41].

Assessment of genetic diversity and population differentiation

We used methods based on the site frequency spectrum (SFS) [48, 49] to estimate nucleotide diversity, the neutrality test statistic Tajima’s D (Fig. 2C; Additional file 1: Figure S7) and genome-wide FST values (Fig. 5 and Additional file 1: Figure S7). Briefly, after estimating the SFS, posterior sample allele frequencies are calculated using the global SFS as prior. SFSs estimated separately were used to obtain joint SFSs for population pairs, which are then used to estimate FST. For all pairwise breed comparisons, we determined FST using autosomes 1 to 29. For comparisons relating to chromosome X, FST was determined for the sex chromosome and autosomes using only female individuals.