Background

Assessing and developing strategies for maintaining genomic diversity in dairy cattle populations have become important activities in breeding programs due to the faster accumulation of inbreeding per year and reduction in effective population sizes as a result of intensive breeding practices (e.g., artificial insemination) and genomic selection schemes [1,2,3]. Numerous studies have characterized homozygous regions (e.g., runs of homozygosity – ROH) in several livestock species, including cattle [4,5,6], swine [7, 8], sheep [9], goats [10], horses [11], and poultry [12]. For instance, ancient and recent genomic inbreeding can be estimated based on ROH [13]. On the other hand, a stretch of heterozygous genotypes, also known as heterozygous-enriched region (HER) or runs of heterozygosity (ROHet), are far less characterized than ROH in livestock species [14] and could indicate genomic regions with high genetic variability and balancing selection.

The first study reporting HERs in livestock was published by Williams et al. [15] in Chillingham cattle. This breed had not been subjected to artificial selection and remained a closed herd for at least 350 years [15]. Despite this particularity, some genomic regions were still found to have high concentrations of heterozygous genotypes, especially regions containing loci influencing fitness and survival traits. Following this first HER characterization, studies on sheep [9], cattle [16], horses [11], and pigs [17] have been published. As such, substantial evidence indicates that genomic regions linked to some essential traits maintain high heterozygosity across generations. Maintaining haplotypic diversity at specific loci might confer a fitness advantage and be subject to balancing selection [15].

Balancing selection describes any selection processes that result in the maintenance of multiple variants of specific loci at intermediate frequencies within a population [18]. Balancing selection implies that heterozygosity is widespread and persists in the population through heterozygote advantage [19]. Balancing selection is not the only process linked to HERs occurrence in a population. Other processes such as introgression (the transfer of genetic variants from one species to another, e.g., hybridization), admixture (mixture of genetic lines or breeds – usually in the same species), and hypervariable regions (portions of the genome with much higher levels of variation than other similar areas due to mutations, recombination rate, and chromosomal rearrangements) [12, 15] contribute to the appearance and maintenance of HERs. A high concentration of heterozygous alleles in specific genomic regions across a large proportion of a population is defined as HER islands and could inform balancing selection pressure suffered by a population at a specific time.

The accurate assessment of polymorphisms in high-variable genomic regions presents additional challenges and can lead to underestimation of the results [19]. One of the main challenges when quantifying HERs in a population is the definition of the parameters to be used in the analyses. For instance, HER detection can be impacted by the density of the SNP panels, distribution of markers throughout the genome, genotyping quality, and consistency of information (error rates and minor allele frequencies) [20]. Aside from these parameters, the correct identification of HERs also depends on factors such as the minimum length size of a HER and the number of homozygous allowed within a HER [21]. Despite the influence of all these factors in the identification of HERs and HER islands, to the best of our knowledge, no studies have evaluated the impact of different parameters used in HER detection. Therefore, the main objectives of this study were to: 1) evaluate the impact of different parameters (i.e., minimum number of SNPs in a HER, maximum distance between two consecutive SNPs, minimum length of a HER, maximum number of homozygous allowed in a HER) and data source scenarios (i.e., SNP panel densities and whole-genome sequence data – WGS) on the detection of HERs; 2) characterize HERs in Holstein (HOL) cattle based on WGS data, followed by functional genomic analyses of the identified HER islands; and, 3) evaluate the overlap of the HERs found in HOL with those from other taurine (Bos taurus taurus) breeds, including Angus (ANG), Jersey (JER), and Norwegian Red cattle (RDC).

Methods

Data and quality control

Nine hundred and fifty-nine (959) HOL animals from the 1,000 Bull Genomes Project [22] were used in this study. WGS data for these individuals contained 47,379,463 markers distributed across the 29 autosomes. The quality control (QC) was performed following the criteria proposed by Ferenčaković et al. [23] and Biscarini et al. [14] in HER and ROH studies. In brief, the QC removed SNPs with low call rate (< 0.95), duplicated positions, located on non-autosomal chromosomes, or with unknown positions.

Identification of Heterozygous-enriched Regions

The detectRUNs package [24] was used to identify HERs, applying the “consecutive approach” in the analyses, which directly scans the genome SNP by SNP, as proposed by Marras et al. [25]. The impact of different sets of parameters in the identification of HERs was evaluated. These parameters included:

  • Minimum number of SNPs in a HER: 5, 10, 15, 20, 25, and 30;

  • Maximum distance between two consecutive SNPs (GAP, in Kb): 500 Kb, 1,000 Kb, and 2,000 Kb;

  • Minimum length of a HER (Kb): 10 Kb, 25 Kb, 50 Kb, 100 Kb, 500 Kb, and 1,000 Kb; and,

  • Maximum number of homozygous allowed in a HER: 0, 1, 2, 3, 4, and 5.

The combination of all sets of parameters resulted in 648 analyses. Therefore, the following criteria were employed to optimize time and computing efficiency. A subset of 300 HOL animals was randomly chosen from the 959 available. Furthermore, the impact of the parameters on HERs identification was evaluated on three selected chromosomes with 5,351,067 markers (BTA1: 2,987,435 markers, BTA14: 1,544,553 markers, and BTA25: 819,079 markers). These chromosomes were selected to represent long, medium, and short chromosomes in the cattle genome, respectively, providing an overall view of how the parameters chosen for the analyses impact HER detection across the entire genome. As the number of HERs detected is based on a combination of parameters, the average and standard deviation of the number of HERs detected, corresponding to each parameter, were calculated to assess the impact of each parameter on HERs detection. For instance, the effect of the minimum number of SNPs equal to 5 on HERs detection was measured by the average and standard deviation of the total number of HERs detected across all scenarios with a minimum number of SNPs equal to 5. Further, we evaluated the effect of all the parameters on HERs when using data from SNP chip panels instead of WGS. Three SNP panels were derived from WGS data by selectively retaining variants present in commercial SNP panels: 50 K (BovineSNP50), 100 K (GGP Bovine 100 K), and HD (700 K – BovineHD genotype BeadChip).

Lastly, additional scenarios were created as preliminary results about the impact of parameters did not show a relevant differentiation based on the minimum number of SNPs and homozygous allowed (see Results section). Therefore, we considered the minimum number of markers to determine a HER equal to 5 and 10, combining the homozygous allowed inside of a HER, ranging from 0 to 5 (0–2 when the minimum number of markers was equal to 5, and 0–5 for minimum number of markers equal 10), and fixing the minimum length size of a HER in 10 Kb, resulting in nine additional scenarios. This approach was used to identify the best combination of parameters to detect HERs, using all the autosomal chromosomes.

Heterozygosity estimation, nucleotide diversity, and Tajima’s D statistic test

Genetic variation in a population can be measured in several ways, the most common of which are heterozygosity (observed and expected) and the proportion of polymorphic nucleotide sites (π) [26]. Using such information, it is possible to estimate if the markers have been selected for heterozygous or homozygous alleles by estimating Tajima’s D value. Tajima’s D test is a good indicator of balancing selection, because it directly measures allele frequency and, since population size change and population structure should affect all loci, differences in Tajima’s D value between loci are likely to reflect differences in selection pressure [27]. It is possible to access the Tajima’s D value by the equation:

$$D=\frac{\frac{\pi-S}{a_{1}}}{\sqrt{V}},$$

Where, π is proportion of polymorphic nucleotide sites, S is the number of segregation sites, V is the sampling variance of the difference between π and S, and \({a}_{1}=\sum_{i=1}^{n-1}\frac{1}{i}\) is the coefficient related to dependent to the number of sequences (n). Here we compared the results of such metrics with the proportion of times that a marker appears on a HER, aiming to validate the parameters chosen.

Each marker’s observed and expected heterozygosity levels were obtained using the Hardy–Weinberg test statistics implemented in the PLINK v.1.09 software [28]. VCFtools [29] was used to estimate each marker’s nucleotide diversity and to perform the Tajima’s D test [30]. The difference between observed and expected heterozygosity, nucleotide diversity, and Tajima’s D statistic were collectively used to correlate with the results from the proportion of times the SNP appeared on a HER in the population, using the Spearman rank correlation to assess the ranking inside of the diversity metrics and the proportion of times the SNP appeared on a HER in the population in the different scenarios.

Identification of HER islands, linkage disequilibrium, and genomic annotation

We declared as HER island regions those that were present in at least 10% of the individuals and present in each one of the additional scenarios created on the step #2 (Fig. 1). As specific criteria for defining HER islands were lacking in the literature, we opted for a 10% threshold to serve as a representative value, offering a visual depiction of the overall concentration of HERs across the genome. The linkage disequilibrium (LD) for all markers within these HER islands was estimated using the LDheatmap package [31]. The genomic annotation of these regions was performed using the GALLO R package [32] with the annotated data for Bos taurus from the Ensembl database (www.ensembl.org/Bos_taurus/Info/Index), version ARS-UCD1.2 [33]. Subsequently, the WebGestaltR package [34] was used to mine Gene Ontology (GO) and identify potential biological processes, molecular functions, cellular components, and metabolic pathways in which the positional candidate genes might be involved.

Fig. 1
figure 1

Roadmap of the analyses carried out in this study. Min SNP: minimum number of SNPs in a HER; Max Gap: maximum distance between two consecutive SNPs (Kb); Min Length: minimum length of a HER (Kb); Max n hom: Maximum number of homozygous allowed in a HER

Comparison of HER islands from Holstein with other taurine breeds

The HER islands identified in HOL were compared to those from other three taurine (Bos taurus taurus) breeds [ANG (317 animals), JER (179 animals), and RDC (179 animals)] from the 1000 Bull Genomes Project [22]. The parameters used for the identification of HERs in all breeds were:

  • Minimum number of SNPs in a HER equals to 10;

  • Maximum distance between two consecutive SNPs (GAP, in Kb) equals to 500 Kb;

  • Minimum length of a HER (Kb) equals to 10 Kb; and,

  • The maximum number of homozygous genotypes allowed in a HER equals to two.

This combination of parameters was chosen for their ability to effectively capture HERs, as demonstrated in the previous analyses. Figure 1 presents a summary of all the analyses performed.

Results

Impact of the parameters used on HER identification

The effect of the parameters investigated on the number of HERs identified based on WGS data is reported in Fig. 2. The maximum distance between consecutive SNPs did not impact the average number of HERs detected. In contrast, increasing the minimum number of SNP in a HER decreased the number of HERs from 621.31(± 1,276.77), when considering 5 SNPs, to 6.08(± 6.08), when considering 30 SNPs. The same pattern was observed for the minimum length of the HERs, where a decrease in the number of detected HERs was observed when increasing the minimum length size of a HER, with a higher detection when the parameter was equal to 10 Kb. At a minimum HER length of 10 Kb, the corresponding increase in the number of HERs was 603% and 2,283% compared to HER minimum sizes of 25 Kb and 50 Kb, respectively. The minimum length size higher than 100 Kb did not detect HERs, as shown in Fig. 2.

Fig. 2
figure 2

Effect of parameters on the detection of heterozygous-enriched regions

When considering the maximum number of homozygous markers allowed inside a HER, the pattern was the opposite compared to the minimum number of markers and length size of a HER. As expected, more HERs were observed when more homozygous markers were allowed in a HER. The range for this parameter was on average 54.47 (± 196.27) numbers of HER, when allowing 0 homozygous markers, to 494.89 (± 1,169.35), when allowing 5 homozygous markers, an average increase of 808% on the number of detected HERs. A similar pattern was observed in the HD SNP panel results for all the parameters. However, the minimum length of a HER in the 50 K and 100 K SNP panels exhibited a greater variation across the selected sizes, ranging from 16,542.94 (± 34,229.63) HERs detected for a length of 10 Kb to 102.97 (± 194.07) for a length of 1,000 Kb in the 50 K panel, and from 35,365.22 (± 60,859.43) for a length of 10 Kb to 93.36 (± 111.62) for a length of 1,000 Kb in the 100 K panel. The effect of the parameters on the commercial panels is reported in Additional file 1 Figure S1 to Additional file 3 Figure S3, and the number of HERs overlapping between SNP panels and WGS for each parameter combination is reported in Additional file 4 Table S1 to Additional file 6 Table S3.

The average number of HERs detected based on each of the lower density SNP panels evaluated is illustrated in Fig. 3. The number of HERs identified based on SNP panels was higher than the one observed based on WGS data and increased from the 50 K (11,421.13 ± 28,407.06) to the HD panel (64,072.73 ± 159,181.15), with a difference of 4,141%, 8,550%, 23,692%, for 50, 100 K, and HD respectively, in comparison to the WGS data results.

Fig. 3
figure 3

Average and standard deviation of heterozygous-enriched regions detected by whole genome sequence (WGS) and commercial SNP panel

Classification and distribution of HERs across the genome

There is no unique metric for the minimum number of markers or the number of homozygous markers allowed inside HERs that show a high differentiation in the HER call. As such, a minimum number of markers equal to 5 and a maximum number of homozygous ranging from 0 to 2, and a minimum number of markers equals 10, and the maximum homozygous ranging from 0 to 5 were chosen to assess the distribution of HER across the genome, as well the classification based on the length size of a HER. For all these new scenarios, created in step #2 (Fig. 1), the minimum length was hold at 10 Kb, as this value showed a substantial difference in the number of HERs identified, in comparison to the others length sizes evaluated.

Table 1 shows the classification of HERs based on the length size for the scenarios with minimum number of markers equal to 5 and 10. The number of HERs detected increased with the number of homozygous markers allowed in both scenarios. In the scenario considering the minimum number of markers equal to 5, the number of HERs ranged from 36,024 to 99,811, with the number of homozygous markers equal to 0 to 2, respectively. In scenario 10, the range was 17,942 to 115,568 HER detected with the maximum number of homozygous markers ranging from 0 to 5, respectively. Most HER detected in all scenarios were classified as 10–20 Kb, representing 63.3% to 75.1% of the total number of HERs in scenario 5 and 53.4% to 72.0% in scenario 10.

Table 1 Number of heterozygous-enriched regions (HERs) detected and classified based on the length size for the scenarios with minimum number of markers equals to 5 and 10

Figure 4 shows the HER classification by chromosome and the percentage of genome coverage by HERs. BTA10 showed the highest concentration of HERs of large size (> 50 Kb). In addition, BTA10 showed more HERs in scenarios with minimum number of markers equal to 5, with maximum number of homozygous 0 and 1, and the minimum number of markers equal to 10, with maximum number of homozygous 0 to 4. Regarding genome coverage, the percentage of HERs did not exceed 0.01%, with BTA10 showing the highest rate.

Fig. 4
figure 4

Classification of heterozygous-enriched regions (HER), by chromosome, according to length and the average percentage of chromosome coverage by HERs in each of the scenarios with minimum number of the marker in a HER equals to 5 and 10 and number of homozygous markers allowed inside of a HER equals to 0 to 5

Correlations

The Spearman correlation among scenarios ranged from moderate to high, considering the proportion of times that the SNP appears inside a HER. The rank correlation was considered moderate between all scenarios and between the diversity metric estimated (expected and observed heterozygosity, nucleotide diversity, and Tajima’s D test). The overall Spearman correlations are presented in Fig. 5. Additional file 7 Figure S4 provides the Spearman correlations among scenarios and diversity nucleotide metrics per chromosome.

Fig. 5
figure 5

Spearman correlation among scenarios with a minimum number of the marker inside of HER equals to 5 and 10 and a number of homozygous markers allowed inside of a HER from 0 to 5, and differences based on heterozygosity observed and expected, diversity of nucleotide (π) and Tajima’s D statistic

Heterozygous-enriched region islands, linkage disequilibrium, and functional analyses

The overlapped of the HER islands across all nine scenarios created in step #2 (Fig. 1) are presented in Fig. 6. Forty-five islands with markers with high Tajima’s D values and high differential of observed and expected heterozygosity were found across all scenarios evaluated. These HER islands were located on 23 chromosomes and contained 3,440 markers. The percentage that each island appears in each of the scenarios and the average linkage disequilibrium between the markers located within HERs are presented in Table 2.

Fig. 6
figure 6

Distribution of heterozygous-enriched region islands across the Holstein cattle genome

Table 2 Percentage of heterozygous-enriched regions (HERs) islands observed in all scenarios and the average of linkage disequilibrium (r2) of markers inside HER islands in Holstein cattle

The largest HER island was identified on BTA10, in the region between 24,100,254 bp to 24,459,318 bp, with a length of 359.06 Kb. The smallest HER island was identified on BTA26, in the regions between 50,115,341 bp to 50,125,621 bp. The average length size of the HER islands was 40.06 Kb (± 68.37 Kb). The level of linkage disequilibrium was generally small in all HER islands evaluated, ranging from 0.001 to 0.191. One particular HER island (BTA23:21,755,400 –21,781,579 bp) shows part of the markers with strong linkage disequilibrium, as shown in Fig. 7. This region is responsible for coding the gene ENSBTAG00000054235; a gene previously linked with paratuberculosis infection in cattle [35]. Additional file 8 Figure S5 presents the heatmap for all the HER islands found.

Fig. 7
figure 7

Heatmap of linkage disequilibrium for the heterozygous-enriched region islands located on chromosome 23

Table 3 presents the gene annotation for the regions in HER islands. The significant (p < 0.05) biological processes, molecular functions, cellular components, and metabolic pathways for the genes found in the HER islands are presented in Table 4.

Table 3 Gene annotation for the Heterozygous enriched regions (HER) islands
Table 4 Significant (p < 0.05) Gene Ontology (GO) terms and pathways for the genes located within heterozygous-enriched regions (HER) islands

For the QTL enrichment analyses, 31 regions related to 11 traits were identified. Of those, 40% are related to milk, 30% to production,13.33% to the exterior, 10% to health, and 6.67% to reproduction. The QTL enrichment results are presented in Table 5.

Table 5 Significant (p < 0.05) QTL (quantitative trait loci) for the genes located within heterozygous-enriched regions (HER) islands

Comparison among breeds

The number of HER islands found for each breed, the number of islands in common with HOL, the specific regions for the breed and HOL, and the gene set for the specific region in HOL are presented in Table 6.

Table 6 Comparison of heterozygous-enriched region (HER) islands among Holstein (HOL), Angus (ANG), Jersey (JER), and Norwegian red cattle (RDC)

The breed with the highest number of overlapped HER islands was NRC, with 97.14% of the same islands found in HOL cattle. JER had the smallest percentage of HER islands in common with the HOL breed (94.28%), followed by ANG (95.72%). The genes related to each one of the particular breed HER islands are presented in Additional file 9 Table S4.

Discussion

Impact of the parameters used in the identification of HERs

We first evaluated the impact of different parameters on the identification of HERs and HER islands. As presented in Fig. 2, three parameters (i.e., the minimum number of markers, minimum length of a HER, and the maximum number of homozygous markers) significantly influenced the detection of HERs, and the minimum length of 10 Kb enabled the detection of most HERs based on WGS. This corroborates the assumption that HERs are small regions scattered through the genome that randomly occur due to different population processes (e.g., selection, mutation, migration) [12]. This finding directly impacts the choice of the SNP panels to be used for these analyses as the accurate detection of shorter HERs across the genome requires denser SNP panels [36].

The high number of markers in the WGS dataset enabled the identification of small informative HERs across the genome. Therefore, the use of WGS is a good alternative, because the majority of HERs are small DNA stretches throughout the genome (Table 1 and Fig. 4). Another point to be highlighted, as shown by Ceballos et al. [36] is that the total number of heterozygous markers present in WGS is higher than that of SNP panels. Therefore, the chances of capturing HERs are higher when using WGS data than commercial panels.

SNP panels are usually designed based on SNPs present with a frequency of more than 1% in the population used for its design [37]. These polymorphisms typically have higher allele frequency and are located near QTL associated with important traits [38]. The genome scan with such SNP panels assumes that markers not located between two consecutive heterozygous SNP are heterozygous. Therefore, ascertainment bias in the information provided could exist, mainly in the length of a HER. Small-length HERs might not be detected, the length size might be inflated, or yet false-positive HERs might be identified.

As presented in Fig. 3, a substantially inflated number of HERs was obtained when using SNP panels compared to WGS. Interestingly, the higher density (100 K and HD) SNP panels enabled a greater detection rate of HERs. In addition, the larger HERs identified in less dense SNP panels might be multiple HERs side by side in denser panels. As the chances of detecting false-positive or false-negative HERs in SNP panels are higher, additional analyses, such as nucleotide diversity and/or Tajima D statistics, can be complementary metrics to validate the detected HERs.

It seems there is a pattern between the number of HERs found by chromosome and the diversity metrics. Although the overall rank correlation was moderated (Fig. 5), the correlation was higher for chromosomes with a larger number of HERs (Additional file 8 Figure S5 – case of BTA9 and BTA10). This suggests that for chromosomes with a higher number of HERs, and longer HERs, the combination of parameters in the additional scenarios captured the markers in higher diversity in the population, besides the difference between the minimum number of markers that constitute a HER and the number of homozygous allowed inside of a HER.

Regarding the minimum number of markers that make up a HER and the maximum number of homozygous markers allowed inside a HER, these parameters seem more related to how conservative the analyses are. Both parameters work in the opposite direction and have higher impact on detecting HERs. The choice of these parameters needs to consider previous information about the population, selection history, density of SNP panels, and genotyping quality. In populations under a long direct selection process, with small diversity among the individuals, and a small effective population size, the autozygosity levels in these populations are expected to be high [21, 39]. In response, more flexible parameters could be used to detect HERs in these populations. For instance, allowing more homozygous markers inside of HERs could be a way to consider possible calling errors that may wrongly break the sequence of heterozygous regions affected by the genotyping quality [36]. Therefore, the decision to allow or not for more lenient parameters in the analyses depends on the quality of the SNP panels or other DNA genotyping or sequencing platform used.

Heterozygous-enriched region islands

Heterozygous-enriched regions can appear throughout the genome, but when they are concentrated in a specific region, this could indicate a pattern of selection events in the population [40]. Forty-five islands were found in all scenarios, and all of them had a positive Tajima’s D value indicating selection for heterozygous genotypes (balancing selection). The Tajima’s D test considers the nucleotide diversity (π) and an expectation for π based on the average pairwise markers plus the total number of mutations [41]. As a result, if the values are positive, the marker is under balancing selection [30]. Here, we used such analysis as validation to confirm that the markers inside of a HER had been selected over the generations and endorse the potential island of HERs.

Regarding the linkage disequilibrium among markers inside a HER, most markers have shown a low linkage disequilibrium. Comparing our results to those from Qanbari and Wittenburg [42], part of the HER islands found in our study are located in regions defined as recombination hotspot intervals, where the recombination rate exceeds 2.5 standard deviations from the genome-wide average recombination rate. Such recombination breaks and recombines different alleles. In this process, new HERs are created mainly in HERs near telomere regions. However, this does not seem to be the only process that leads to the creation of HER islands. Twenty-eight genes and 11 QTLs were founded related to the HER islands in HOL (Tables 3 and 5) and could be, in some way, associated with a heterozygous advantage. Some of the genes found in these regions are related to immune response, as expected once higher heterozygosity levels can lead to greater infectious disease resistance [43]. On the other hand, many genes were found to be related to traits such as fertility and production. For instance, TP63 or tumor protein p63, which has a function of binding certain regions of DNA and controlling the activities of particular genes, was previously associated with puberty in cattle [44]. The ZSCAN23 gene is involved in transcription’s regulation of RNA polymerase III and has been associated with male fertility [44]. The ARHGAP44 gene, which affects cell polarity, vesicular trafficking, cell cycle, and transcription, has also been reported to influence cow fertility [45].

The TP63 and CDK1 genes are related to a higher number of processes and pathways, as shown in Table 4. Such genes impact mechanisms that control the process of multiplication of cells and/or transcription of the genes and are involved in multiple gene ontological terms. Other traits related to genes found in HER islands include milk fatty acids and milk fat percentage (e.g., MYT1L, [46]); residual feed intake (e.g., MAYO9A [47]); vitrification temperature of mature bovine oocytes (e.g., NEK5 [48]); and, hormone homeostasis and levels of progesterone (e.g., SERPINA6 [49]). Some of these traits, although are under positive selection processes, could still be present within a HER due to certain processes such as increased diversity around a target selection or non-synonymous polymorphisms segregating at intermediate frequencies [19]. These processes can lead to concentration of heterozygous alleles around these regions and increase in the number of polymorphic markers around them.

Regarding the comparison among breeds, at least 94.28% of the HER islands in HOL are present in the other three breeds evaluated. Interestingly, particularities such as the ancestry among the breed (as in HOL and NRC) or distinctive selection processes (HOL and ANG) seem not to show an impact on the presence of common HERs across the breeds. Although the process that contributes to the presence of the heterozygosity in the population is known, the persistence of such heterozygosity in areas of the genome is still not completely uncovered [11]. The most acceptable reason for this persistence the heterozygote advantage [19] is likely linked to the fitness/survival traits that are related to evolutionary process associated with animal adaptation [11, 15].

Limitations and implications

In this study, we applied different parameter combinations for the identification of HERs, which is the first study to assess the impact of various combinations of different parameters on HER detection. This study provides background information for the design of future HER studies to better understand their role in biological mechanisms and evolutive processes. Our results show that HERs are small regions spread across the genome and likely concentrated in genomic regions under balancing selecting pressure. Although it is difficult to determine what are the “true” HERs on the genome when using real datasets (instead of simulated datasets), they provide important insights about the different selection forces that the population may have been through. The use of WGS data for HERs detection studies is the most recommended. Our results show that SNP panels resulted in an inflated number of HERs and applying additional metrics that contribute to decrease such inflation is recommended.

Here, the minimum length to capture more HERs was 10 Kb and, although neither the minimum number of markers nor the maximum number of homozygous markers allowed inside of a HER showed a distinct difference, we observed better identification of HERs with the minimum number of markers equals to 10 and a maximum number of homozygous ranging from 0 to 3. These parameter values seem to be adequate for capturing the relevant HERs, decreasing the capture of noises, and, more importantly, capturing regions with a concentration of HER islands. Interestingly, all the HER islands found had a minimum number of markers equal to or higher than 10, which reinforces the use of such level for the minimum number of markers. Regarding, the level assumed to declare if a region is located in a HER island (> 10%), we understand that it could be considered a low level. A metric that confirms whether the region has been or is not selected should also be applied. In our study, all the islands found showed a Tajima’s D value higher than zero, indicating that such regions on the genome had been selected for higher heterozygosity. Future studies utilizing simulated datasets are recommended to further investigate the roles of HERs in phenotypic variability and evolutionary processes within livestock populations. To achieve more reliable simulation results, it is crucial to understand how HERs are inherited across generations and the mechanisms that influence their presence and concentration in the animal genome. Additionally, it is important to consider the emergence and maintenance of variation, which depends on selection schemes affecting genetic variants, the genetic architecture of complex traits (including the number of genes controlling traits, their effects, and genetic linkage), and the coordination of allele expression [18]. This comprehensive understanding will provide the necessary foundation for conducting simulations that accurately reflect the underlying biological processes of HERs in the cattle genome.

Conclusions

The identification of HERs depends on the parameters used to assess the heterozygosity of the regions. The minimum length of 10 Kb resulted in the highest number of HERs detected, confirming that HERs are small regions scattered throughout the genome. The minimum number of markers that define a HER and the maximum number of homozygous allowed inside a HER did not show substantial impact based on the data sources evaluated, being more linked to the population structure and quality of genotyping. Forty-five HER islands were identified in all scenarios of parameter combinations, with high Tajima’s D values indicating that such regions are likely under balancing selection. In general, those regions have a small linkage disequilibrium and are related to traits such as fertility, production, and immune response. As for the breed comparisons, the majority of the identified HERs were in common among the four taurine breeds (> 94%), regardless of the selection forces each breed went through over the generations.