A near complete genome assembly of the East Friesian sheep genome

You, Xiaoxiao; Fang, Qinyuan; Chen, Chunhai; Cao, Junwei; Fu, Shaoyin; Zhang, Teng; Wang, Shenyuan; He, Xiaolong; He, Jiangfeng; Zhou, Yang; Wang, Biao; Wang, Liwei; Wang, Zheng; Sun, Tianhao; Yang, Xukui; Te, Rigele; Jian, Jianbo; Zhou, Huanmin; Dai, Yanfeng; Liu, Yongbin

doi:10.1038/s41597-024-03581-w

A near complete genome assembly of the East Friesian sheep genome

Data Descriptor
Open access
Published: 11 July 2024

Volume 11, article number 762, (2024)
Cite this article

Download PDF

You have full access to this open access article

Scientific Data

A near complete genome assembly of the East Friesian sheep genome

Download PDF

Xiaoxiao You¹^na1,
Qinyuan Fang¹^na1,
Chunhai Chen²^na1,
Junwei Cao³,
Shaoyin Fu⁴,
Teng Zhang¹,
Shenyuan Wang³,
Xiaolong He⁴,
Jiangfeng He⁴,
Yang Zhou¹,
Biao Wang⁴,
Liwei Wang⁴,
Zheng Wang¹,
Tianhao Sun¹,
Xukui Yang²,
Rigele Te⁴,
Jianbo Jian ORCID: orcid.org/0000-0003-2187-5490²,
Huanmin Zhou^1,3,
Yanfeng Dai¹ &
…
Yongbin Liu^1,4

280 Accesses
Explore all metrics

Abstract

Advancements in sequencing have enabled the assembly of numerous sheep genomes, significantly advancing our understanding of the link between genetic variation and phenotypic traits. However, the genome of East Friesian sheep (Ostfriesisches Milchschaf), a key high-yield milk breed, remains to be fully assembled. Here, we constructed a near-complete and gap-free East Friesian genome assembly using PacBio HiFi, ultra-long ONT and Hi-C sequencing. The resulting genome assembly spans approximately 2.96 Gb, with a contig N50 length of 104.1 Mb and only 164 unplaced sequences. Remarkably, our assembly has captured 41 telomeres and 24 centromeres. The assembled sequence is of high quality on completeness (BUSCO score: 97.1%) and correctness (QV: 69.1). In addition, a total of 24,580 protein-coding genes were predicted, of which 97.2% (23,891) carried at least one conserved functional domain. Collectively, this assembly provides not only a near T2T gap-free genome, but also provides a valuable genetic resource for comparative genome studies of sheep and will serve as an important tool for the sheep research community.

Chromosome-level genome assembly of Guide Black-Fur sheep (Ovis aries)

Article Open access 29 June 2024

A near complete genome for goat genetic and genomic research

Article Open access 10 September 2021

Haplotype-resolved genomes provide insights into structural variation and gene content in Angus and Brahman cattle

Article Open access 29 April 2020

Background & Summary

Selective breeding for different agricultural purposes, such as meat, wool, and milk, have established many sheep breeds with unique characteristics worldwide¹. The East Friesian sheep (Ostfriesisches Milchschaf) is a highly specialized breed. The breed originates from the Frisia region of both the Netherlands and Germany, and is considered to be the world’s highest producing dairy sheep^2,3. In a single lactation, the East Friesian sheep can produce 500–700 kg of milk over a period of approximately 230 days⁴. Additionally, East Friesian sheep have a relatively high average number of lambs per ewe, 2.25 lambs/litter, but the carcass of lambs is very lean⁵. In physical appearance, East Friesian sheep have many unique features. They have a relatively large body, head, face, legs, ears all clean of wool. Their most distinctive physical feature is a “rat-tail” which is thin and devoid of wool. The East Friesian sheep, renowned for its adaptability, has been successfully crossbred with breeds known for their robust ketone body composition, such as Suffolk, Dorset, and Texel. This strategic crossbreeding not only enhances the meat quality of the East Friesian sheep but also ameliorates the traits of breeds that exhibit lower milk yields and suboptimal reproductive and lambing capabilities. Hailing from the northern regions of Germany and the Friesland area in the Netherlands, the East Friesian breed has garnered international attention and has been integrated into the livestock industries of various countries, including China, the United Kingdom, and South Africa. The exploration of the breed’s genetic makeup at the molecular level presents a compelling opportunity to deepen our comprehension of the genetic underpinnings of economically significant traits in sheep, thereby contributing to the advancement of the field.

De novo genome assembly is a fundamental and powerful tool employed in the realm of molecular research. Several genomes of sheep genomes have been made publicly available in databases, including East Friesian sheep⁶, Tibetan sheep⁷, Rambouillet sheep⁸, and Texel sheep⁹. Despite the achievement of chromosome-level assembly in these sheep genomes, there still exist unidentified regions containing gaps that require further investigation and determination. A number of assemblers have been developed for long reads assembly, such as Falcon¹⁰, Flye¹¹, Canu¹², wtdbg2¹³, NextDenovo¹⁴ and Hifiasm¹⁵. The Hifiasm method stands out for its utilization of string-overlap graphs to represent genomes, encode information for algorithmic analysis, and visually present both primary and alternative paths along a DNA sequence¹⁶. New developments in long-read sequencing technologies, such as Pacific Biosciences (PacBio) circular consensus (CCS) long-read sequencing and ultra-long ONT sequencing, has revolutionized our ability to acquire comprehensive chromosome sequences spanning from one telomere to another. With the availability of a complete genome sequence, researchers would have the opportunity to thoroughly investigate and gain a deeper understanding of genome function, regulation, and evolution^17,18.

In this study, we present the first near T2T gap-free genome assembly for East Friesian sheep using a combination of PacBio high-fidelity (HiFi) long-read, Oxford Nanopore (ONT) ultra long-read, and high-throughput chromosome conformation capture (Hi-C) sequencing data. In total, we generated 321 Gb (~107X coverage) ONT reads with a N50 of 63.5 kb, 148 Gb PacBio HiFi CCS reads with a N50 of 22.1 kb (~49X coverage), and 396 Gb Hi-C data (MGISEQ paired-end reads, ~132X coverage) (Table 1). The final genome assembly of East Friesian sheep, termed as EFS v2.0, is about 2.96 Gb with a scaffold N50 of 104.10 Mb, comprising 27 chromosomes without any gaps (Table 2; Fig. 1) and 164 unplaced sequences. We observed that 94.53% of these unplaced sequences consist of repetitive elements, among which satellite sequences constitute 84.64%. Further research and refinement are needed to determine their precise genomic location and functional relevance. The EFS v2.0 assembly captured 41 telomeres and 24 centromeres (Table 3). Notably, the EFS v2.0 assembly closed 35 gaps in total compared to the previously published East Friesian genome⁶ (Fig. 2).

Table 1 Summary of sequencing data of East Friesian sheep genome.

Full size table

Table 2 Comparison of four sheep genomes.

Full size table

Table 3 Centromere positions of East Friesian sheep genome.

Full size table

In the EFS v2.0 genome, repeat sequences accounted for 1.60 Gb, representing 53.98% of the assembly (Table 4). Long interspersed nuclear elements (LINE) retrotransposons (41.46%) were the most abundant component among repetitive elements, which was consistent with a previous study¹⁹ (Table 5). Gene annotation identified 24,580 protein-coding genes. Of which, 24,536 genes (99.8%) were anchored to 27 chromosomes (Fig. 1), while 44 genes anchored to unplaced scaffolds. The length and number of exons were similar to those of three other sheep breeds (Fig. 3a,b). Furthermore, the predicted proteins achieved a complete BUSCO score of approximately 98%, indicating high quality annotation (Fig. 3c). 23,891 (97.2%) protein-coding genes were successfully annotated in diverse databases, including Gene Ontology (GO), KOG, Interpro, SwissProt²⁰, Kyoto Encyclopedia of Genes and Genomes (KEGG)²¹, NCBI nonredundant database (NR), and Translation of European Molecular Biology Laboratory (Trembl) (Table 6). Moreover, 17,328 (~70.5%) genes were supported by all five databases (Fig. 3d). Based on transcriptomic deep-sequencing data, we investigated gene expression level in five different tissues (Table 7). A total of 15,263 (62.2%) genes showed detectable expression levels (transcripts per million ≥ 1) in one or more of these tissues. Through structural variants analysis with the previously published East Friesian sheep⁶, we identified 232 newly assembled genes, among which 151 were expressed in 5 different transcriptome samples (Table 8; Fig. 4).

Table 4 General statistics of repeats in the EFS v2.0 assembly.

Full size table

Table 5 Transposable elements (TEs) in the assembled EFS v2.0 assembly.

Full size table

Table 6 Number of functional annotations for predicted genes in the EFS v2.0 assembly.

Full size table

Table 7 Summary of RNA-seq sequencing data of East Friesian sheep genome.

Full size table

Table 8 The improvement of EFS v2.0 assembly.

Full size table

Methods

Sample collection, DNA preparation and RNA extractions

A 1-year-old female East Friesian sheep from Inner Mongolia key Lab of Bio-manufacture in Inner Mongolia autonomous region of China was chosen for DNA and RNA sequencing. The assembled sequence does not include the Y chromosome due to sampling from females. The animal was healthy, and no genetic defects were observed in it or its parents.

DNA was extracted from fresh blood specimen using the QIAGEN Blood & Cell Culture DNA Midi Kit according to the manufacturer’s instruction (QIAGEN, Germany). TRIzol (Invitrogen, Carlsbad, CA, United States) was used to extract total RNA from heart, rumen, subcutaneous fat, lung and perirenal fat tissues. The concentration of total RNA was determined using the Nano 6000 spectrophotometer Assay Kit of the Bioanalyzer 2100 system (Agilent Technologies, Santa Clara, CA, United States). The RNA purity was determined using the Qubit® RNA Assay Kit in a Qubit® 2.0 Fluorometer (Life Technologies, Camarillo, CA, United States).

Long insert libraries preparation and sequencing

The library construction and sequencing of RNA-seq full-length transcripts were conducted using a method similar to that described in Yuan, Ge et al.²², resulting in 437,807 full-length non-chimeric reads with mean length of 1,388 bp.

For the DNA PacBio long inserts libraries, the preparation was carried out in accordance with the “Using SMRTbell Express Template Prep Kit 2.0 With Low DNA Input” protocol²³ provided by PacBio (Pacific Biosciences, USA). This resulted in libraries with an insert size of approximately 20 kb. Subsequently, the libraries were subjected to sequencing using PacBio Sequel II platforms operating in CCS mode. The subreads were processed through the CCS algorithm of SMRTLink (v8.0.0)²⁴ with specific parameters: “-minPasses 3 -minPredictedAccuracy 0.99 -minLength 500”, yielding 148 Gb of PacBio’s long high-fidelity (HiFi) reads in total.

Furthermore, ultra-long DNA ONT libraries were created following the protocols detailed by Shafin et al.²⁵. These libraries were then sequenced on the PromethION sequencer platform (Oxford Nanopore Technologies, UK). The sequencing effort resulted in the production of 8,180,779 reads, with an N50 value of 63,509 bp.

Short insert libraries preparation and sequencing

RNA-seq libraries were prepared uisng the NEBNext® Ultra™ RNA Library Prep Kit for Illumina® (NEB, Ipswich, MA, USA) following the manufacturer’s protocol. Subsequently, the RNA libraries were sequenced on a MGISEQ-2000 platform, producing 150 bp paired-end reads.

The Hi-C library was prepared using the same method described in Yin, Chen et al.²⁶ with the same blood specimen and sequenced on a MGISEQ-2000 instrument. A total of 395 Gb of clean data were obtained from 396 Gb of sequencing data using software SOAPnuke (v2.0)²⁷ with parameters “-n 0.01 -l 20 -q 0.1 -i -Q 2 -G 2 -M 2 -A 0.5”.

Genome assembly

With the HiFi reads, the primary contigs were assembled using Hifiasm (v 0.16.1)¹⁵ with default parameters. The Hi-C valid reads were employed to anchor contigs onto chromosomes through Juicer²⁸ and 3d-dna pipeline²⁹. The chromosome nomenclature was adopted for the chromosome numbering on the basis of their collinearity with 27 chromosomes of Texel sheep genome³⁰. To achieve a near T2T gap-free reference genome assembly, gaps in the assembly genome were filled using LR_Gapcloser³¹ with error-corrected ONT long reads produced by NECAT³².

Annotation of repetitive sequences and identification of telomeres and centromeres

Two strategies, de novo and homolog methods, were applied to annotate repetitive sequences. RepeatModeler (v1.0.4)³³ was used to identify de novo repeats, and LTR-FINDER (v1.0.7)³⁴ was utilized to annotate long terminal repeats. DNA and protein transposable elements (TEs) were detected by RepeatMasker (v4.0.7)³⁵ and RepeatProteinMasker (v4.0.7), respectively, based on Repbase database³⁶. At last, tandem repeats were identified using Tandem Repeat Finder (v4.10.0)³⁷. Referencing the methods described in sweet tea³⁸, wild blueberry³⁹, and rapeseed studies⁴⁰, the telomeric sequences in the EFS v2.0 genome assembly were identified using quarTeT (v1.0.3)⁴¹ with the “-c animal” option. The quarTeT program comprises four modules: AssemblyMapper, GapFiller, TeloExplorer, and CentroMiner, of which, TeloExplorer is specifically utilized for identifying candidate telomeres. The telomere repeat monomer identified by quarTeT was “TTAGGG/CCCTAA”. The Centromics software (https://github.com/ShuaiNIEgithub/Centromics) was employed for centromere identification. Furthermore, we aligned the human rRNA sequences against the EFS v2.0 genome by using BLASTN with an E value of 1 × 10⁻⁵ to investigate the ribosomal DNA (rDNA) in EFS v2.0 genome. In total, we identified 1329 ribosomal RNAs (rRNAs), including 122 18S rRNA, 586 28S RNA, 59 5.8S rRNA, and 562 5S RNA.

Protein-coding genes prediction and functional annotation

Protein-coding genes in the EFS v2.0 assembly were predicted using a similar method to Fang, Mou et al.⁴². For the RNA-seq-based prediction approach, clean RNA-seq reads were aligned to the EFS v2.0 assembly via Hisat2 (v2.1.0)⁴³ with parameters including–sensitive–no-discordant–no-mixed -I 1 -X 1000–max-intronlen 1000000. Additionally, 437,807 full-length transcripts were matched against the EFS v2.0 assembly using GMAP (v 2017-11-15)⁴⁴. The aligned reads were then assembled using Stringtie (v1.3.5)⁴⁵ using the parameters: -f 0.3 -j 3 -c 5 -g 100 -s 10000. Subsequently, TransDecoder (v5.5.0) (https://github.com/TransDecoder/TransDecoder) was invoked to identify the coding sequence with default settings. In the homolog prediction method, GeMoMa (v1.9)⁴⁶ was used to detect homologous peptides across four mammalian genomes, including human (GCA_000001405.29)⁴⁷, Texel sheep (GCA_000298735.2)⁹, Tibetan sheep (GCA_017524585.1)⁷, and Rambouillet sheep (GCA_016772045.1)⁸. Genes that had RNA-seq-based prediction support with correct structure, but were not identified via homology-based prediction, were incorporated into the gene set. Ultimately, untranslated regions and alternative splicing regions were determined using Program to Assemble Spliced Alignment⁴⁸. The integrated gene set was translated into amino-acid sequences and functionally annotated by mapping against KEGG⁴⁹, Swiss-Prot²⁰, TrEMBL²⁰, KOG⁵⁰, InterPro⁵¹ and NR (NCBI Non-redundant protein) databases using BLAST (v2.2.26)⁵² with an E-value threshold of 1E-5. Protein domains and motifs were annotated using InterProScan⁵³, from which GO Ontology (GO)⁵⁴ was derived.

Gene expression analysis

Quality control of raw RNA-seq reads was conducted using SOAPnuke (v2.0)¹⁵. Afterwards, the clean reads were aligned to the EFS v2.0 genome using Hisat2 (v2.1.0)⁴³, with the following parameters: ‘--phred33 -p 5 --sensitive --no-discordant --no-mixed -I 1 -X 1000’. A read count matrix was generated using featureCounts⁵⁵. Gene expression levels were calculated using the transcripts per million (TPM) method.

Identification of new assembled genes

The software Syri (v1.6.3)⁵⁶ was employed to detect structural variations between the EFS v2.0 genome assembly and the previously published East Friesian sheep⁶. A gene was classified as newly assembled if the previously published East Friesian sheep⁶ exhibited a deletion of at least 50 bp and the gene region had a minimum overlap of 30% with that region.

Reads coverage analysis of genome assembly

We assessed whether the long sequencing reads extended across the regions that required gap filling. Prior to this process, the genome contained eight gaps. We employed minimap2⁵⁷ (v 2.24) to map both the ONT and HiFi reads to the EFS v2.0 genome. Utilizing SAMtools⁵⁸ (v 1.10) with the ‘-q 20’ option, we filtered out low-quality and multi-mapping reads. Subsequently, we utilized the IGV software for visualizing the high-quality alignment results.

Quality value (QV) calculations

In the realm of whole-genome sequencing, the Quality Value (QV) emerges as an essential metric for gauging the precision of nucleotide identification. The QV is derived from the Phred quality score, a measure that captures the negative logarithm of the likelihood that a given base call is erroneous. The QV is precisely calculated through the equation QV = −10 × log₁₀ (error probability). For instance, an error probability of 0.001 equates to a QV of 30, indicating a high confidence in the correctness of the base call. Throughout the sequencing process, each nucleotide is appraised with a Phred score that is contingent upon the signal-to-noise ratio; this score is subsequently converted to a QV, thereby providing an index of the sequencing data’s fidelity. In this study, we have employed the Merqury⁵⁹ software to meticulously compute the QV, ensuring robust data quality assessment.

Data Records

The DNA sequence reads of East Friesian sheep (Experiment of DNA sequencing data from ultra-long ONT library: SRR26273756⁶⁰; Experiments of DNA sequencing data from Hi-C library: SRR26273763⁶⁰; Experiments of DNA sequencing data from PacBio HiFi library: SRR26273762⁶⁰) and RNA sequence reads of East Friesian sheep (Experiment of 5 transcriptome libraries: SRR26273757-SRR26273761⁶⁰) have been deposited in the Sequence Read Archive (SRA). The genome assembly have been deposited in the GenBank database under the accession number JAWMPZ000000000⁶¹. The files of the gene structure annotation, repeat predictions and gene functional annotation have been deposited at Figshare database⁶².

Technical Validation

Multiple methods were employed to validate the accuracy and completeness of EFS v2.0 assembly.

Firstly, we utilized long sequencing reads to ascertain their extension across the eight gap regions (Table 9). The resulting plots confirmed comprehensive coverage of the targeted regions (Fig. 5). Secondly, the Hi-C heatmap displayed high consistency across all chromosomes, demonstrating the correct ordering and orientation of contigs in the EFS v2.0 assembly (Fig. 6). Thirdly, the EFS v2.0 assembly exhibited high collinearity with Rambouillet sheep (GCA_016772045.1)⁸, Tibetan sheep (GCA_017524585.1)⁷ and the previously published East Friesian sheep (GCA_018804185.1)⁶ (Fig. 7). Fourthly, the accuracy was confirmed by the high mapping rates of two type sequences on the EFS v2.0 assembly, with 99.93% of ONT reads and 100% of HiFi reads aligning to the EFS v2.0 assembly. Notably, the sequencing assembly attained a remarkable quality value (QV) score of 69.1, signifying an exceptionally low error rate of approximately 1.26 errors per 100 million bases. This level of sequencing accuracy and reliability is highly commendable and will undoubtedly facilitate subsequent genetic analysis and research. Lastly, the Benchmarking Universal Single-Copy Orthologs (BUSCO) test revealed that EFS v2.0 assembly successfully identified 97.1% of 9,226 mammalia gene sets, which exhibiting the highest level of BUSCO completeness among the four commonly used genomes (Fig. 8).

Table 9 The location of the gap to be filled.

Full size table

Code availability

No specific code was developed for this study. The data analyses were conducted following the manuals and protocols provided by the developers of the relevant bioinformatics tools, which are described in the Methods section along with the versions used.

References

Mohamadipoor Saadatabadi, L. et al. Signature selection analysis reveals candidate genes associated with production traits in Iranian sheep breeds. BMC Veterinary Research 17 (2021).
Li, X. et al. Whole genome re-sequencing reveals artificial and natural selection for milk traits in East Friesian sheep. Frontiers in veterinary science 9, 1034211 (2022).
Article PubMed PubMed Central Google Scholar
Kominakis, A., Hager-Theodorides, A. L., Saridaki, A., Antonakos, G. & Tsiamis, G. Genome-wide population structure and evolutionary history of the Frizarta dairy sheep. Animal: an international journal of animal bioscience 11, 1680–1688 (2017).
Article CAS PubMed Google Scholar
Nguyen, Q. V. et al. Supplementing grazing dairy ewes with plant-derived oil and rumen-protected EPA+DHA pellets enhances health-beneficial n-3 long-chain polyunsaturated fatty acids in sheep milk. European Journal of Lipid Science and Technology 120, 1700256 (2018).
Article Google Scholar
Afolayan, R. A. et al. Genetic evaluation of crossbred lamb production. 3. Growth and carcass performance of second-cross lambs. Australian Journal of Agricultural Research 58, 5 (2007).
Article Google Scholar
Li, R. et al. A sheep pangenome reveals the spectrum of structural variations and their effects on tail phenotypes. Genome Research 33, 3 (2023).
Article ADS Google Scholar
Li, X. et al. Genomic analyses of wild argali, domestic sheep, and their hybrids provide insights into chromosome evolution, phenotypic variation, and germplasm innovation. Genome Res 32, 9 (2022).
Article Google Scholar
Davenport, K. M. et al. An improved ovine reference genome assembly to facilitate in-depth functional annotation of the sheep genome. Gigascience 11, giab096 (2022).
Article PubMed PubMed Central Google Scholar
NCBI Genbank. https://identifiers.org/ncbi/insdc.gca:GCA_000298735.2 (2015).
Chin, C. S. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nature methods 13, 1050–1054 (2016).
Article CAS PubMed PubMed Central Google Scholar
Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. Nature Biotechnology 37, 540–546 (2019).
Article CAS PubMed Google Scholar
Koren, S. et al. Canu: scalable and accurate long-read assembly via adaptive k -mer weighting and repeat separation. Genome research 27, 722–736 (2017).
Article CAS PubMed PubMed Central Google Scholar
Ruan, J. & Li, H. Fast and accurate long-read assembly with wtdbg2. Nature Methods 17, 1–4 (2020).
Article Google Scholar
Hu, J. et al. An efficient error correction and accurate assembly tool for noisy long reads. (bioRxiv, 2023).
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nature Methods 18, 170–175 (2021).
Article ADS CAS PubMed PubMed Central Google Scholar
Marx, V. Long road to long-read assembly. Nature methods 18, 125–129 (2021).
Article CAS PubMed Google Scholar
Chen, J. et al. A complete telomere-to-telomere assembly of the maize genome. Nature Genetics 55, 1221–1231 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Nurk, S. et al. The complete sequence of a human genome. Science (New York, N.Y.) 376, 44–53 (2021).
Article ADS Google Scholar
Su, R. et al. Draft Genome of the European Mouflon (Ovis orientalis musimon). Frontiers in genetics 11, 533611 (2020).
Article CAS PubMed PubMed Central Google Scholar
Bairoch, A. & Apweiler, R. The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999. Nucleic Acids Research 27, 49–54 (1999).
Article CAS PubMed PubMed Central Google Scholar
Kanehisa, M. & Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research 28, 27–30 (2000).
Article CAS PubMed PubMed Central Google Scholar
Yuan, Z. et al. Integrative analysis of Iso-Seq and RNA-seq data reveals transcriptome complexity and differentially expressed transcripts in sheep tail fat. PeerJ 9, e12454 (2021).
Article PubMed PubMed Central Google Scholar
PacBio. (2021). Procedure & Checklist – Preparing HiFi SMRTbell® Libraries using the SMRTbell Express Template Prep Kit 2.0. Pacific Biosciences of California, Inc. https://www.pacb.com/wp-content/uploads/Procedure-Checklist-Preparing-HiFi-SMRTbell-Libraries-using-SMRTbell-Express-Template-Prep-Kit-2.0.pdf (Updated March 2019).
Chin, C. et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nature Methods 10, 563–569 (2013).
Article CAS PubMed Google Scholar
Shafin, K. et al. Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes. Nature Biotechnology 38, 1044–1053 (2020).
Article CAS PubMed PubMed Central Google Scholar
Yin, D. et al. Gapless genome assembly of East Asian finless porpoise. Scientific Data 9 (2022).
Chen, Y. et al. SOAPnuke: a MapReduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data. GigaScience 7 (2017).
Durand, N. et al. Juicer Provides a One-Click System for Analyzing Loop-Resolution Hi-C Experiments. Cell Systems 3, 95–98 (2016).
Article CAS PubMed PubMed Central Google Scholar
Dudchenko, O. et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science 356, eaal3327 (2017).
Article Google Scholar
Bahcall, O. Sheep genome. Nature Genetics 46, 665–665 (2014).
Article CAS Google Scholar
Xu, G.C. et al. LR_Gapcloser: a tiling path-based gap closer that uses long reads to complete genome assembly. GigaScience 8 (2018).
Chen, Y. et al. Efficient assembly of nanopore reads via highly accurate and intact error correction. Nature Communications 12 (2021).
Chen, N. Using RepeatMasker to Identify Repetitive Elements in Genomic Sequences. Current Protocols in Bioinformatics 5 (2004).
Xu, Z. & Wang, H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Research 35, W265–W268 (2007).
Article PubMed PubMed Central Google Scholar
Price, A. L., Jones, N. C. & Pevzner, P. A. De novo identification of repeat families in large genomes. Bioinformatics (Oxford, England) 21(Suppl 1), i351–8 (2005).
Article CAS PubMed Google Scholar
Bao, W., Kojima, K.K. & Kohany, O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mobile DNA 6 (2015).
Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic acids research 27, 573–80 (1999).
Article CAS PubMed PubMed Central Google Scholar
Liu, H. et al. Chromosome-scale genome assembly of sweet tea (Lithocarpus polystachyus Rehder). Scientific Data 10 (2023).
Zeng, T. et al. The telomere-to-telomere gap-free reference genome of wild blueberry (Vaccinium duclouxii) provides its high soluble sugar and anthocyanin accumulation. in Horticulture research 10, uhad209 (2023).
Article CAS Google Scholar
Li, B. et al. A gap-free reference genome reveals structural variations associated with flowering time in rapeseed (Brassica napus). Horticulture research 10, uhad171 (2023).
Article CAS PubMed PubMed Central Google Scholar
Lin, Y. et al. quarTeT: a telomere-to-telomere toolkit for gap-free genome assembly and centromeric repeat identification. Horticulture Research 10, uhad127 (2023).
Article PubMed PubMed Central Google Scholar
Fang, X. et al. The sequence and analysis of a Chinese pig genome. GigaScience 1, 16 (2012).
Article CAS PubMed PubMed Central Google Scholar
Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nature Biotechnology 37, 907–915 (2019).
Article CAS PubMed PubMed Central Google Scholar
Wu, T. D. & Watanabe, C. K. GMAP: a genomic mapping and alignment program for mRNA and EST sequence. Bioinformatics 21, 1859–75 (2005).
Article CAS PubMed Google Scholar
Kovaka, S. et al. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biology 20, 278 (2019).
Article CAS PubMed PubMed Central Google Scholar
Jens et al. GeMoMa: Homology-Based Gene Prediction Utilizing Intron Position Conservation and RNA-seq Data. Methods in Molecular Biology 1962, 161–177 (2019).
Article Google Scholar
NCBI Genbank. https://identifiers.org/ncbi/insdc.gca:GCA_000001405.29 (2022).
Haas, B. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Research 31, 5654–5666 (2003).
Article CAS PubMed PubMed Central Google Scholar
Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M. & Tanabe, M. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Research 44, D457–D462 (2015).
Article PubMed PubMed Central Google Scholar
Korf, I. Gene finding in novel genomes. BMC bioinformatics 5, 59 (2004).
Article PubMed PubMed Central Google Scholar
Finn, R. D. et al. InterPro in 2017—beyond protein family and domain annotations. Nucleic Acids Research 45, D190–D199 (2016).
Article PubMed PubMed Central Google Scholar
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. Journal of Molecular Biology 215, 403–410 (1990).
Article CAS PubMed Google Scholar
Mulder, N. J. & Apweiler, R. InterPro and InterProScan: tools for protein sequence classification and comparison. Methods in molecular biology 396, 59–70 (2007).
Article CAS PubMed Google Scholar
Ashburner, M. et al. Gene Ontology: tool for the unification of biology. Nature Genetics 25, 25–29 (2000).
Article CAS PubMed PubMed Central Google Scholar
Liao, Y., Smyth, G. K. & Shi, W. featureCounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30, 923–30 (2013).
Article PubMed Google Scholar
Goel, M., Sun, H., Jiao, W.-B. & Schneeberger, K. SyRI: finding genomic rearrangements and local sequence differences from whole-genome assemblies. Genome Biology 20 (2019).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Article CAS PubMed PubMed Central Google Scholar
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics (Oxford, England) 25, 2078–2079 (2009).
PubMed Google Scholar
Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biology 21 (2020).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRP464403 (2023).
Liu, Y. Ovis aries breed East Friesian sheep isolate LY-2023, whole genome shotgun sequencing project. GenBank https://identifiers.org/ncbi/insdc.gca:GCA_033439445.1 (2023).
Liu, Y. B. Annotation Files of East Friesian sheep. figshare https://doi.org/10.6084/m9.figshare.24208689.v3 (2023).
Tang, H. et al. Synteny and Collinearity in Plant Genomes. Science 320, 486–488 (2008).
Article ADS CAS PubMed Google Scholar

Download references

Acknowledgements

This work was funded by the Inner Mongolia Autonomous Region Open Competition Projects (2022JBGS0024), China Agriculture Research System (CARS-38) and National Key Research and Development Program Projects (2021YFD1300903).

Author information

These authors contributed equally: Xiaoxiao You, Qinyuan Fang, Chunhai Chen.

Authors and Affiliations

Inner Mongolia University, Hohhot, China
Xiaoxiao You, Qinyuan Fang, Teng Zhang, Yang Zhou, Zheng Wang, Tianhao Sun, Huanmin Zhou, Yanfeng Dai & Yongbin Liu
Inner Mongolia Academy of Agricultural & Animal Husbandry Sciences, Hohhot, China
Chunhai Chen, Xukui Yang & Jianbo Jian
Inner Mongolia Agricultural University, Hohhot, China
Junwei Cao, Shenyuan Wang & Huanmin Zhou
BGI Genomics, Shenzhen, China
Shaoyin Fu, Xiaolong He, Jiangfeng He, Biao Wang, Liwei Wang, Rigele Te & Yongbin Liu

Authors

Xiaoxiao You
View author publications
You can also search for this author in PubMed Google Scholar
Qinyuan Fang
View author publications
You can also search for this author in PubMed Google Scholar
Chunhai Chen
View author publications
You can also search for this author in PubMed Google Scholar
Junwei Cao
View author publications
You can also search for this author in PubMed Google Scholar
Shaoyin Fu
View author publications
You can also search for this author in PubMed Google Scholar
Teng Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Shenyuan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaolong He
View author publications
You can also search for this author in PubMed Google Scholar
Jiangfeng He
View author publications
You can also search for this author in PubMed Google Scholar
Yang Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Biao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Liwei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zheng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Tianhao Sun
View author publications
You can also search for this author in PubMed Google Scholar
Xukui Yang
View author publications
You can also search for this author in PubMed Google Scholar
Rigele Te
View author publications
You can also search for this author in PubMed Google Scholar
Jianbo Jian
View author publications
You can also search for this author in PubMed Google Scholar
Huanmin Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Yanfeng Dai
View author publications
You can also search for this author in PubMed Google Scholar
Yongbin Liu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Y.F.D. and H.M.Z. conceived the study. Y.B.L. was responsible for the implemented the study. Q.Y.F., S.Y.W., X.L.H., J.F.H., B.W., L.W.W and R.T. collected the samples. Q.Y.F., J.W.C., S.Y.W., Z.W, X.K.Y., T.H.S. and J.B.J. processed the samples. X.X.Y., Q.Y.F. and C.H.C. processed bioinformatics data. X.X.Y., S.Y.F., T.Z. and Y.Z. performed bioinformatics analysis. X.X.Y. and C.H.C. wrote the manuscript with significant contributions. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Huanmin Zhou, Yanfeng Dai or Yongbin Liu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

You, X., Fang, Q., Chen, C. et al. A near complete genome assembly of the East Friesian sheep genome. Sci Data 11, 762 (2024). https://doi.org/10.1038/s41597-024-03581-w

Download citation

Received: 10 November 2023
Accepted: 27 June 2024
Published: 11 July 2024
DOI: https://doi.org/10.1038/s41597-024-03581-w
Springer Nature Limited

A near complete genome assembly of the East Friesian sheep genome

Abstract

Similar content being viewed by others

Chromosome-level genome assembly of Guide Black-Fur sheep (Ovis aries)

A near complete genome for goat genetic and genomic research

Haplotype-resolved genomes provide insights into structural variation and gene content in Angus and Brahman cattle

Background & Summary

Methods

Sample collection, DNA preparation and RNA extractions

Long insert libraries preparation and sequencing