Background & Summary

Selective breeding for different agricultural purposes, such as meat, wool, and milk, have established many sheep breeds with unique characteristics worldwide1. The East Friesian sheep (Ostfriesisches Milchschaf) is a highly specialized breed. The breed originates from the Frisia region of both the Netherlands and Germany, and is considered to be the world’s highest producing dairy sheep2,3. In a single lactation, the East Friesian sheep can produce 500–700 kg of milk over a period of approximately 230 days4. Additionally, East Friesian sheep have a relatively high average number of lambs per ewe, 2.25 lambs/litter, but the carcass of lambs is very lean5. In physical appearance, East Friesian sheep have many unique features. They have a relatively large body, head, face, legs, ears all clean of wool. Their most distinctive physical feature is a “rat-tail” which is thin and devoid of wool. The East Friesian sheep, renowned for its adaptability, has been successfully crossbred with breeds known for their robust ketone body composition, such as Suffolk, Dorset, and Texel. This strategic crossbreeding not only enhances the meat quality of the East Friesian sheep but also ameliorates the traits of breeds that exhibit lower milk yields and suboptimal reproductive and lambing capabilities. Hailing from the northern regions of Germany and the Friesland area in the Netherlands, the East Friesian breed has garnered international attention and has been integrated into the livestock industries of various countries, including China, the United Kingdom, and South Africa. The exploration of the breed’s genetic makeup at the molecular level presents a compelling opportunity to deepen our comprehension of the genetic underpinnings of economically significant traits in sheep, thereby contributing to the advancement of the field.

De novo genome assembly is a fundamental and powerful tool employed in the realm of molecular research. Several genomes of sheep genomes have been made publicly available in databases, including East Friesian sheep6, Tibetan sheep7, Rambouillet sheep8, and Texel sheep9. Despite the achievement of chromosome-level assembly in these sheep genomes, there still exist unidentified regions containing gaps that require further investigation and determination. A number of assemblers have been developed for long reads assembly, such as Falcon10, Flye11, Canu12, wtdbg213, NextDenovo14 and Hifiasm15. The Hifiasm method stands out for its utilization of string-overlap graphs to represent genomes, encode information for algorithmic analysis, and visually present both primary and alternative paths along a DNA sequence16. New developments in long-read sequencing technologies, such as Pacific Biosciences (PacBio) circular consensus (CCS) long-read sequencing and ultra-long ONT sequencing, has revolutionized our ability to acquire comprehensive chromosome sequences spanning from one telomere to another. With the availability of a complete genome sequence, researchers would have the opportunity to thoroughly investigate and gain a deeper understanding of genome function, regulation, and evolution17,18.

In this study, we present the first near T2T gap-free genome assembly for East Friesian sheep using a combination of PacBio high-fidelity (HiFi) long-read, Oxford Nanopore (ONT) ultra long-read, and high-throughput chromosome conformation capture (Hi-C) sequencing data. In total, we generated 321 Gb (~107X coverage) ONT reads with a N50 of 63.5 kb, 148 Gb PacBio HiFi CCS reads with a N50 of 22.1 kb (~49X coverage), and 396 Gb Hi-C data (MGISEQ paired-end reads, ~132X coverage) (Table 1). The final genome assembly of East Friesian sheep, termed as EFS v2.0, is about 2.96 Gb with a scaffold N50 of 104.10 Mb, comprising 27 chromosomes without any gaps (Table 2; Fig. 1) and 164 unplaced sequences. We observed that 94.53% of these unplaced sequences consist of repetitive elements, among which satellite sequences constitute 84.64%. Further research and refinement are needed to determine their precise genomic location and functional relevance. The EFS v2.0 assembly captured 41 telomeres and 24 centromeres (Table 3). Notably, the EFS v2.0 assembly closed 35 gaps in total compared to the previously published East Friesian genome6 (Fig. 2).

Table 1 Summary of sequencing data of East Friesian sheep genome.
Table 2 Comparison of four sheep genomes.
Fig. 1
figure 1

Circos plot of the EFS v2.0 genome. From inside to outside, I: GC content in nonoverlapping 1 Mb windows (histograms); II: percent coverage of repetitive sequences in nonoverlapping 1 Mb windows (heat maps); III: gene density calculated based on the number of genes in nonoverlapping 1 Mb windows (heat maps); IV: 27 super-scaffolds. Lengths are shown in Mb.

Table 3 Centromere positions of East Friesian sheep genome.
Fig. 2
figure 2

Overview of the near T2T and gap-free EFS v2.0 reference genome. The box represents the 35 closed gaps identified from GCA_018804185.1. The triangle represents the telomere region, and the circle represents the centromere region.

In the EFS v2.0 genome, repeat sequences accounted for 1.60 Gb, representing 53.98% of the assembly (Table 4). Long interspersed nuclear elements (LINE) retrotransposons (41.46%) were the most abundant component among repetitive elements, which was consistent with a previous study19 (Table 5). Gene annotation identified 24,580 protein-coding genes. Of which, 24,536 genes (99.8%) were anchored to 27 chromosomes (Fig. 1), while 44 genes anchored to unplaced scaffolds. The length and number of exons were similar to those of three other sheep breeds (Fig. 3a,b). Furthermore, the predicted proteins achieved a complete BUSCO score of approximately 98%, indicating high quality annotation (Fig. 3c). 23,891 (97.2%) protein-coding genes were successfully annotated in diverse databases, including Gene Ontology (GO), KOG, Interpro, SwissProt20, Kyoto Encyclopedia of Genes and Genomes (KEGG)21, NCBI nonredundant database (NR), and Translation of European Molecular Biology Laboratory (Trembl) (Table 6). Moreover, 17,328 (~70.5%) genes were supported by all five databases (Fig. 3d). Based on transcriptomic deep-sequencing data, we investigated gene expression level in five different tissues (Table 7). A total of 15,263 (62.2%) genes showed detectable expression levels (transcripts per million ≥ 1) in one or more of these tissues. Through structural variants analysis with the previously published East Friesian sheep6, we identified 232 newly assembled genes, among which 151 were expressed in 5 different transcriptome samples (Table 8; Fig. 4).

Table 4 General statistics of repeats in the EFS v2.0 assembly.
Table 5 Transposable elements (TEs) in the assembled EFS v2.0 assembly.
Fig. 3
figure 3

Quality assessment of the protein-coding genes in the EFS v2.0 assembly. (a) Comparison of exon length among four sheep gene sets. Window refers to the length of every point. (b) Comparison of exon number among four sheep gene sets. No obvious unexpected differences exist among these four organisms, indicating the high quality of gene structure annotation. (c) BUSCO assessment results of protein-coding genes in the EFS v2.0 assembly. (d) Gene function annotation results in a statistics Venn diagram using five public databases: NR, InterPro, KEGG, SwissProt and KOG.

Table 6 Number of functional annotations for predicted genes in the EFS v2.0 assembly.
Table 7 Summary of RNA-seq sequencing data of East Friesian sheep genome.
Table 8 The improvement of EFS v2.0 assembly.
Fig. 4
figure 4

Heatmap representation of new assembled genes. Rows represent new assembled genes, and columns represent 5 different samples. The bar in the upper right corner represents log 2 transformed TPM values. Blue and red boxes represent genes showing lower and higher expression levels, respectively. “Hea_t” represents heart, “Rum_n” represents rumen, “Sub_t” represents subcutaneous fat, “Lun_g” represents lung, and “Per_t” represents perirenal fat.

Methods

Sample collection, DNA preparation and RNA extractions

A 1-year-old female East Friesian sheep from Inner Mongolia key Lab of Bio-manufacture in Inner Mongolia autonomous region of China was chosen for DNA and RNA sequencing. The assembled sequence does not include the Y chromosome due to sampling from females. The animal was healthy, and no genetic defects were observed in it or its parents.

DNA was extracted from fresh blood specimen using the QIAGEN Blood & Cell Culture DNA Midi Kit according to the manufacturer’s instruction (QIAGEN, Germany). TRIzol (Invitrogen, Carlsbad, CA, United States) was used to extract total RNA from heart, rumen, subcutaneous fat, lung and perirenal fat tissues. The concentration of total RNA was determined using the Nano 6000 spectrophotometer Assay Kit of the Bioanalyzer 2100 system (Agilent Technologies, Santa Clara, CA, United States). The RNA purity was determined using the Qubit® RNA Assay Kit in a Qubit® 2.0 Fluorometer (Life Technologies, Camarillo, CA, United States).

Long insert libraries preparation and sequencing

The library construction and sequencing of RNA-seq full-length transcripts were conducted using a method similar to that described in Yuan, Ge et al.22, resulting in 437,807 full-length non-chimeric reads with mean length of 1,388 bp.

For the DNA PacBio long inserts libraries, the preparation was carried out in accordance with the “Using SMRTbell Express Template Prep Kit 2.0 With Low DNA Input” protocol23 provided by PacBio (Pacific Biosciences, USA). This resulted in libraries with an insert size of approximately 20 kb. Subsequently, the libraries were subjected to sequencing using PacBio Sequel II platforms operating in CCS mode. The subreads were processed through the CCS algorithm of SMRTLink (v8.0.0)24 with specific parameters: “-minPasses 3 -minPredictedAccuracy 0.99 -minLength 500”, yielding 148 Gb of PacBio’s long high-fidelity (HiFi) reads in total.

Furthermore, ultra-long DNA ONT libraries were created following the protocols detailed by Shafin et al.25. These libraries were then sequenced on the PromethION sequencer platform (Oxford Nanopore Technologies, UK). The sequencing effort resulted in the production of 8,180,779 reads, with an N50 value of 63,509 bp.

Short insert libraries preparation and sequencing

RNA-seq libraries were prepared uisng the NEBNext® Ultra™ RNA Library Prep Kit for Illumina® (NEB, Ipswich, MA, USA) following the manufacturer’s protocol. Subsequently, the RNA libraries were sequenced on a MGISEQ-2000 platform, producing 150 bp paired-end reads.

The Hi-C library was prepared using the same method described in Yin, Chen et al.26 with the same blood specimen and sequenced on a MGISEQ-2000 instrument. A total of 395 Gb of clean data were obtained from 396 Gb of sequencing data using software SOAPnuke (v2.0)27 with parameters “-n 0.01 -l 20 -q 0.1 -i -Q 2 -G 2 -M 2 -A 0.5”.

Genome assembly

With the HiFi reads, the primary contigs were assembled using Hifiasm (v 0.16.1)15 with default parameters. The Hi-C valid reads were employed to anchor contigs onto chromosomes through Juicer28 and 3d-dna pipeline29. The chromosome nomenclature was adopted for the chromosome numbering on the basis of their collinearity with 27 chromosomes of Texel sheep genome30. To achieve a near T2T gap-free reference genome assembly, gaps in the assembly genome were filled using LR_Gapcloser31 with error-corrected ONT long reads produced by NECAT32.

Annotation of repetitive sequences and identification of telomeres and centromeres

Two strategies, de novo and homolog methods, were applied to annotate repetitive sequences. RepeatModeler (v1.0.4)33 was used to identify de novo repeats, and LTR-FINDER (v1.0.7)34 was utilized to annotate long terminal repeats. DNA and protein transposable elements (TEs) were detected by RepeatMasker (v4.0.7)35 and RepeatProteinMasker (v4.0.7), respectively, based on Repbase database36. At last, tandem repeats were identified using Tandem Repeat Finder (v4.10.0)37. Referencing the methods described in sweet tea38, wild blueberry39, and rapeseed studies40, the telomeric sequences in the EFS v2.0 genome assembly were identified using quarTeT (v1.0.3)41 with the “-c animal” option. The quarTeT program comprises four modules: AssemblyMapper, GapFiller, TeloExplorer, and CentroMiner, of which, TeloExplorer is specifically utilized for identifying candidate telomeres. The telomere repeat monomer identified by quarTeT was “TTAGGG/CCCTAA”. The Centromics software (https://github.com/ShuaiNIEgithub/Centromics) was employed for centromere identification. Furthermore, we aligned the human rRNA sequences against the EFS v2.0 genome by using BLASTN with an E value of 1 × 10−5 to investigate the ribosomal DNA (rDNA) in EFS v2.0 genome. In total, we identified 1329 ribosomal RNAs (rRNAs), including 122 18S rRNA, 586 28S RNA, 59 5.8S rRNA, and 562 5S RNA.

Protein-coding genes prediction and functional annotation

Protein-coding genes in the EFS v2.0 assembly were predicted using a similar method to Fang, Mou et al.42. For the RNA-seq-based prediction approach, clean RNA-seq reads were aligned to the EFS v2.0 assembly via Hisat2 (v2.1.0)43 with parameters including–sensitive–no-discordant–no-mixed -I 1 -X 1000–max-intronlen 1000000. Additionally, 437,807 full-length transcripts were matched against the EFS v2.0 assembly using GMAP (v 2017-11-15)44. The aligned reads were then assembled using Stringtie (v1.3.5)45 using the parameters: -f 0.3 -j 3 -c 5 -g 100 -s 10000. Subsequently, TransDecoder (v5.5.0) (https://github.com/TransDecoder/TransDecoder) was invoked to identify the coding sequence with default settings. In the homolog prediction method, GeMoMa (v1.9)46 was used to detect homologous peptides across four mammalian genomes, including human (GCA_000001405.29)47, Texel sheep (GCA_000298735.2)9, Tibetan sheep (GCA_017524585.1)7, and Rambouillet sheep (GCA_016772045.1)8. Genes that had RNA-seq-based prediction support with correct structure, but were not identified via homology-based prediction, were incorporated into the gene set. Ultimately, untranslated regions and alternative splicing regions were determined using Program to Assemble Spliced Alignment48. The integrated gene set was translated into amino-acid sequences and functionally annotated by mapping against KEGG49, Swiss-Prot20, TrEMBL20, KOG50, InterPro51 and NR (NCBI Non-redundant protein) databases using BLAST (v2.2.26)52 with an E-value threshold of 1E-5. Protein domains and motifs were annotated using InterProScan53, from which GO Ontology (GO)54 was derived.

Gene expression analysis

Quality control of raw RNA-seq reads was conducted using SOAPnuke (v2.0)15. Afterwards, the clean reads were aligned to the EFS v2.0 genome using Hisat2 (v2.1.0)43, with the following parameters: ‘--phred33 -p 5 --sensitive --no-discordant --no-mixed -I 1 -X 1000’. A read count matrix was generated using featureCounts55. Gene expression levels were calculated using the transcripts per million (TPM) method.

Identification of new assembled genes

The software Syri (v1.6.3)56 was employed to detect structural variations between the EFS v2.0 genome assembly and the previously published East Friesian sheep6. A gene was classified as newly assembled if the previously published East Friesian sheep6 exhibited a deletion of at least 50 bp and the gene region had a minimum overlap of 30% with that region.

Reads coverage analysis of genome assembly

We assessed whether the long sequencing reads extended across the regions that required gap filling. Prior to this process, the genome contained eight gaps. We employed minimap257 (v 2.24) to map both the ONT and HiFi reads to the EFS v2.0 genome. Utilizing SAMtools58 (v 1.10) with the ‘-q 20’ option, we filtered out low-quality and multi-mapping reads. Subsequently, we utilized the IGV software for visualizing the high-quality alignment results.

Quality value (QV) calculations

In the realm of whole-genome sequencing, the Quality Value (QV) emerges as an essential metric for gauging the precision of nucleotide identification. The QV is derived from the Phred quality score, a measure that captures the negative logarithm of the likelihood that a given base call is erroneous. The QV is precisely calculated through the equation QV = −10 × log10 (error probability). For instance, an error probability of 0.001 equates to a QV of 30, indicating a high confidence in the correctness of the base call. Throughout the sequencing process, each nucleotide is appraised with a Phred score that is contingent upon the signal-to-noise ratio; this score is subsequently converted to a QV, thereby providing an index of the sequencing data’s fidelity. In this study, we have employed the Merqury59 software to meticulously compute the QV, ensuring robust data quality assessment.

Data Records

The DNA sequence reads of East Friesian sheep (Experiment of DNA sequencing data from ultra-long ONT library: SRR2627375660; Experiments of DNA sequencing data from Hi-C library: SRR2627376360; Experiments of DNA sequencing data from PacBio HiFi library: SRR2627376260) and RNA sequence reads of East Friesian sheep (Experiment of 5 transcriptome libraries: SRR26273757-SRR2627376160) have been deposited in the Sequence Read Archive (SRA). The genome assembly have been deposited in the GenBank database under the accession number JAWMPZ00000000061. The files of the gene structure annotation, repeat predictions and gene functional annotation have been deposited at Figshare database62.

Technical Validation

Multiple methods were employed to validate the accuracy and completeness of EFS v2.0 assembly.

Firstly, we utilized long sequencing reads to ascertain their extension across the eight gap regions (Table 9). The resulting plots confirmed comprehensive coverage of the targeted regions (Fig. 5). Secondly, the Hi-C heatmap displayed high consistency across all chromosomes, demonstrating the correct ordering and orientation of contigs in the EFS v2.0 assembly (Fig. 6). Thirdly, the EFS v2.0 assembly exhibited high collinearity with Rambouillet sheep (GCA_016772045.1)8, Tibetan sheep (GCA_017524585.1)7 and the previously published East Friesian sheep (GCA_018804185.1)6 (Fig. 7). Fourthly, the accuracy was confirmed by the high mapping rates of two type sequences on the EFS v2.0 assembly, with 99.93% of ONT reads and 100% of HiFi reads aligning to the EFS v2.0 assembly. Notably, the sequencing assembly attained a remarkable quality value (QV) score of 69.1, signifying an exceptionally low error rate of approximately 1.26 errors per 100 million bases. This level of sequencing accuracy and reliability is highly commendable and will undoubtedly facilitate subsequent genetic analysis and research. Lastly, the Benchmarking Universal Single-Copy Orthologs (BUSCO) test revealed that EFS v2.0 assembly successfully identified 97.1% of 9,226 mammalia gene sets, which exhibiting the highest level of BUSCO completeness among the four commonly used genomes (Fig. 8).

Table 9 The location of the gap to be filled.
Fig. 5
figure 5

Using IGV to demonstrate the coverage of ONT and PacBio reads in the gap 1 region. The IGV images for Gap 1 through Gap 8 are available through the Figshare database62.

Fig. 6
figure 6

The accuracy and completeness of the EFS v2.0 genome assembly. Whole-genome Hi-C heatmap of EFS v2.0 within and between 27 chromosomes.

Fig. 7
figure 7

The identification of syntenic regions for EFS v2.0, Rambouillet sheep and Tibetan sheep was based on conducting homology searches using MCScan (Python version)63, with a minimum requirement of 30 genes per block. Macrosynteny connecting blocks of >30 one-to-one gene pairs are shown.

Fig. 8
figure 8

BUSCO plot of the several sheep genomes. C: Complete BUSCOs; S: Complete and single-copy BUSCOs; D: Complete and duplicated BUSCOs; F: Fragmented BUSCOs; M: Missing BUSCOs; n: Total BUSCO groups searched. East Friesian sheepa: GCA_018804185.1; East Friesian sheepb: EFS v2.0.