Background & Summary

Indigenous animal genetic resources, primarily found in developing countries, are known to contain a significant portion of the world’s genetic diversity. Millions of people rely directly on these resources for their livelihoods1. Ethiopia, in particular, is considered a gateway for cattle migrations in Africa2. Presently, the cattle population in Ethiopia exceeds 70 million heads3, with 98.5% of them being indigenous cattle. These indigenous cattle are often named based on their appearance, morphological structure, the ethnic group of the herder, and their geographical location4,5. Over time, these cattle have developed unique adaptive traits that enable them to withstand challenges such as limited feed availability, high environmental temperatures, and a high prevalence of internal and external parasites and diseases. These adaptive features have been shaped through natural and human selection processes6,7.

By far, cattle production in Ethiopia is an integral part of almost all farming systems in the crop-livestock mixed farming systems of highlanders and mid attitudes, and the main occupation in the lowland pastoralists, and still promising to rally around the country’s economic development. Despite multiple functions and significant phenotypic variations of indigenous cattle populations, little attention was paid to the livestock sector, which threatened the country’s cattle diversity and population size. These are mainly associated with complex and interrelated factors such as indiscriminate crossbreeding and interbreeding between adjacent indigenous breeds due to herders’ migrations and socio-cultural interactions8,9. Furthermore, recurrent drought, the prevalence of disease, ethnic conflicts, and the illegal cross-border market hasten the decline in cattle numbers. Thus, a comprehensive understanding of breed characteristics, including population size, genetic landscape, and geographical distribution, is crucial for effectively managing farm animal genetic resources1,10. It also serves as a guiding framework for breed development programs, enabling them to align with specific production needs in diverse environments.

Quantitative genetic analysis has historically been characterized as a black box due to the intricate nature of gene action, which involves multiple loci with unknown effects and their interactions in shaping quantitative traits11. This complexity has posed challenges in understanding the underlying mechanisms and unraveling the genetic architecture of these traits. As a result, researchers have faced difficulties replicating the results of selective breeding across different spatial and temporal scales, making it essential to explore further and elucidate these complex genetic processes. Advancements in genome sequencing, SNP genotyping technologies, and statistical analysis tools have shifted research focus from analyzing neutral variation to exploring functional variation12. Notably, the advent of whole-genome sequencing (WGS) in domestic animals has revolutionized our understanding of their genetic makeup. It has allowed for the identification of causal variants that have significant implications for animal production, health, welfare, and evolutionary studies within livestock species and breeds13. While WGS has become a standard tool in various biological sciences, including animal breeding, its application for genetic characterization and routine evaluation of livestock genetic resources in developing countries is still limited. This study presents the whole-genome sequencing data from 70 indigenous cattle originating from seven distinct Ethiopian cattle populations sampled from various agro-ecological and climatic settings (Table 1; Ayalew et al.14). Thus, our WGS data will serve as a valuable resource for conducting further in-depth studies and investigations in tropical cattle. This sequence dataset will facilitate a deeper understanding of the genetic landscape, allowing for the identification and validation of causal mutations that contribute to essential traits and the exploration of evolutionary patterns.

Table 1 Ethiopian cattle breeds and their respective sampling locations.

Moreover, the detailed analytical procedures offer significant advantages for researchers, such as ease of management of similar WGS and implementation of global cattle meta-assemblies at a broader scale. The meta-assembly, which combines multiple genetic or genomic data assemblies into a single, comprehensive assembly, will enable the accurate validation of regions under selection reported by various researchers, ensuring the identification of actual signals while minimizing false positives and supporting future breed improvement and conservation efforts.

Methods

Cattle sampling and collection

We specifically selected seven indigenous cattle populations (Abigar, Barka, Boran, Fellata, Fogera, Gojjam-Highland, and Horro) for our study, with ten unrelated samples collected from each population. These cattle populations inhabit distinct agro-climatic regions, representing Ethiopia’s diverse environments (Table 1). We selected these particular populations based on their relevance to agricultural practices, providing insights into desirable production traits, environmental adaptation, and regional livestock farming systems. Blood samples were drawn from the jugular vein of the cattle under sterile conditions, using 10 ml EDTA tubes. The samples were carefully transported to the laboratory in an ice box and stored at −20 °C until DNA extraction.

Extraction and quality control of genomic DNA

The blood samples were thawed for 30 minutes at room temperature and underwent DNA extraction using the Tiangen genomic DNA extraction kit based on the manufacturer’s protocols (TIANGEN Biotech, Beijing, China). We conducted 0.8% agarose gel electrophoresis to assess DNA integrity and visualized the resulting DNA bands using a gel imaging apparatus. Each sample’s DNA concentration and quality were determined using a Nanodrop Spectrophotometer (ND-2000, Thermo Scientific, Massachusetts, USA) at a wavelength of A260/A280. Samples with DNA concentrations above 50 μg/μl were then sent to Wuhan Frasergen Bioinformatics Co. Ltd in China for whole-genome sequencing (WGS).

Sequence library preparation and sequencing

The VAHTS Universal DNA Library Prep Kit for MGI (Vazyme, Nanjing, China) was employed to generate sequencing libraries of each sample, targeting fragments of approximately 500 bp in length using one microgram of DNA as input material. Adapter sequences were ligated to each sample. Library size and quantification were assessed using Qubit 3.0 Fluorometers and Bioanalyzer 2100 systems (Agilent Technologies, CA, USA). Finally, the sequencing process was conducted by Frasergen Bioinformatics Co., Ltd. (Wuhan, China) on an MGI-SEQ 2000 platform, resulting in a 150 bp sequence length for each sample.

Sequence data pre-processing and mapping

The demultiplexed 70 individual samples (forward and reverse reads) were received and checked for their quality metrics using FastQC v0.11.815. The raw reads were subjected to initial quality control by Trimmomatic v0.39 using default settings16. After removing adapter sequences and low-quality reads, MultiQC v1.14 was run on the clean reads, and standard sequence quality metrics were confirmed for subsequent analysis. BWA-MEM 0.7.17-r118817 was employed to align individual reads to the latest bovine reference genome ARS-UCD1.218. The aligned reads were converted to binary alignment map (BAM) format, sorted by coordinates, and indexed using SAMtools version 1.619. Finally, the duplicate sequences were marked using the MarkDuplicates function of Picard 2.27.4 (https://broadinstitute.github.io/picard/) to produce a non-duplicated bam file for variant calling.

Variant calling and filtration

High-quality variant calling and filtration are vital in genomic research. The Genome Analysis Toolkit best practices pipeline (https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant discovery) was employed for SNPs discoveries (Fig. 1). First, the marked duplicate bam files were used as input to generate Base Quality Score Recalibration (BQSR) tables using GATK 4.3.0.0. The “Apply BQSR” argument of the same software was then employed to create recalibrated BAM files. The HaplotypCaller method, followed by joint genotyping of all samples and VQSR procedures for SNP recalibrations, was performed using validated SNPs provided by the 1000 bull genome project. In the Variant Quality Score Recalibration (VQSR) procedure, SNP recalibrations utilized different variant annotators, including Quality of Depth (Q.D.), Fisher Strand Test (F.S.), Mapping Quality Score (M.Q.), Mapping Quality Rank Sum Test (MQRankSum), Read Position Rank Sum Test Statistic (ReadPosRankSum), and StrandOddsRatio Test (SOR). Subsequently, the ApplyVQSR procedure was employed to select variants with a true sensitivity of 99.0%. Finally, the ‘SelectVariant’ procedure from the same software was used, and the final SNPs were used for annotations (refer to the Code availability section).

Fig. 1
figure 1

Overview of raw data quality control, sequence mapping, variant calling, and variant filtration pipeline. The pipeline follows GATK’s best practice protocol for germline short variant discovery.

Data Records

The 70 Ethiopian indigenous cattle pair-end raw sequencing data (in fastq.gz format) were available at NCBI under Sequence Read Archive (SRA) accession numbers SRP47834820 and SRP48080321 (Supplementary file 1). The VCF file can be available in the European Variation Archive (EVA) with the accession number for Project PRJEB75238 (https://identifiers.org/ena.embl:ERP159827)22.

Technical Validation

Quality control for raw reads and alignments

In next-generation sequencing (NGS) data analysis, quality control of raw sequence reads is a standard preliminary procedure before further analysis. This crucial pre-processing step enhances the overall data quality and reliability before conducting downstream analyses23. Some essential quality measures used to make choices for the downstream analysis are the base quality, nucleotide distribution, G.C. content, and duplication rate of the raw sequences24. Sequencing of each individual yielded between 13.61 gigabases to 25.45 gigabases, of which 91.8–95.5% of the reads fell above Phred scaled quality score of 30, which proves the bases were called with 99.9% accuracy (Fig. 2). To elucidate all types of variants (including SNVs, indels, and CNVs), a high-depth WGS (30X) is the ‘gold standard’25. Due to budget constraints, it is common practice to sequence fewer samples at high coverage (20 to 30X). However, this approach may result in a poor representation of a population’s genetic variation. The smaller dataset may not adequately capture the full range of genetic diversity present, leading to potential biases or incomplete insights23. Recently, Jiang et al. suggested 4X as the lowest boundary and 10X as an ideal depth for achieving greater than 99% genome coverage in pigs26. The average estimated coverage for each of the 70 Ethiopian cattle samples was above the threshold with an average depth of 14X (Fig. 2). The relatively moderate depth of coverage in our study enhances the resolution and reliability of downstream analyses, leading to more robust findings and insights into the genetic basis of various traits and population dynamics26,27.

Fig. 2
figure 2

Boxplot presentation of 70 Ethiopian cattle sequencing yield, yield Q30 and estimated sequence coverage.

The MultiQC software28 was employed to generate a pooled sequence quality metrics report (Fig. 3). The MultiQC reports for 70 paired-end Ethiopian cattle sequences confirm that the mean quality scores and per-sequence metrics fell within the high sequence standard range for downstream analysis (Fig. 3b,c). Although there is no universal threshold for duplication levels in WGS data, FastQC flagged a warning for sequences with more than 20% duplicates15. Unlike PCR-free methods, PCR-based sequencing introduces bias in sequencing data by causing uneven amplification of genomic regions and generating duplicate reads, which can impact the accuracy of the sequencing data29. Intriguingly, we found an average duplication rate of 17% (Fig. 3a), and this relatively low level of duplication observed in our data can mitigate challenges in variant calling and uneven distribution of coverage across the genome and enhance the efficiency and speed of analysis pipelines30.

Fig. 3
figure 3

The quality control metrics from FastQC analysis of 70 cattle sequences. The metrics from all FASTQ files are consolidated using the MultiQC package.

A uniform G.C. content among reads indicates high-quality sequencing, suggesting minimal artifacts or contaminants24. However, in our dataset comprising 70 forward and 70 reverse sequencing files (140 files), all sequenced in the same lane and on the same instrument, Fig. 3d reveals some deviations from the expected distribution of G.C. content in a subset of 23 files (16.43%). These deviations may be attributed to challenges during library preparations15. Notably, despite deviations observed in the G.C. content distribution of some sequencing files, a warning message is acceptable for fewer than 30% of the reads, indicating that the overall data quality remains suitable for subsequent analysis15.

While the quality control for aligned reads is not routinely conducted, it is a valuable tool for gaining additional insights into sample quality. It can help identify problematic samples that might pass the initial raw data quality control checks24. In our data, 99.2% of the reads were successfully mapped to the Bos taurus (ARS-UCD1.2) reference genome (Supplementary file 2). It suggests that most reads were mapped correctly to their corresponding genomic locations.

Quality control for SNP data

After consolidating individual sample VCF files, the joint genotyping analysis yielded 39 million SNPs. To ensure the reliability of these variants and filter out false-positive calls for downstream analyses, we employed a robust machine-learning model called VQSR (https://gatk.broadinstitute.org/hc/en-us/articles/360035531612-Variant-Quality-Score-Recalibration-VQSR). VQSR is a two-step process that involves training a machine learning model using a training dataset and then applying this model to recalibrate the variant quality scores in the primary dataset. VQSR offers several advantages, including improved accuracy, adaptability, comprehensive assessment, and reduced false positives compared to traditional filtering methods. By incorporating VQSR, we optimized the quality control process and enhanced the validity of our variant calls. Specifically, threshold values of 99% retained about 35 million true variants and excluded four million variants as poor/false positive calls. We also computed the transition/transversion (Ti/Tv) ratio and the heterozygosity-to-homozygosity (het/hom) ratio for SNPs passing the 99% threshold. The observed Ti/Tv and het/hom ratios were 2.35 and 1.17, respectively. These metrics are consistent with values reported for other African zebu cattle breeds31.

To investigate the genomic distribution and functional impact of genetic variants, we used the SNPeff variant annotation tool. A significant portion of variants (over 89%) were annotated within intronic and intragenic regions (Table 2). Notably, while the number of SNPs per chromosome correlated with chromosome length32, our study revealed varying SNP densities across chromosomes. For instance, Chromosome 23 showed the highest SNP density (16.65), whereas the X chromosome had the lowest (6.61). These variations are likely attributed to multiple factors, including differences in recombination and mutation rates, genetic drift, demographic influences, selective pressures, and population history33. Despite containing more repetitive regions, the X chromosome experiences heightened selection pressure against genetic variants, driven by hemizygosity in males and X-chromosome inactivation in females. As a result, the X chromosome exhibits a lower SNP density than autosomes. These unique genetic mechanisms and evolutionary dynamics significantly shape the distinct SNP profiles observed between the X chromosome and autosomes34 Table 3.

Table 2 Single Nucleotide Polymorphisms (SNPs) across various annotation categories.
Table 3 Summary of SNPs density in each chromosome.