Abstract
Guide Black-Fur sheep (GD) is a breed of Tibetan sheep (Ovis aries) that lives in the Qinghai–Tibetan plateau region at an altitude of over 4,000 m. However, a lack of genomic information has made it difficult to understand the high-altitude adaptation of these sheep. We sequenced and assembled the GD reference genome using PacBio, Hi-C, and Illumina sequencing technologies. The final assembled genome size was 2.73 Gb, with a contig N50 of 20.30 Mb and a scaffold N50 of 107.63 Mb. The genome is predicted to contain 20,759 protein-coding genes, of which 98.42 have functional annotations. Repeat elements account for approximately 52.2% of the genomic landscape. The completeness of the GD genome assembly is highlighted by a BUSCO score of 93.1%. This high-quality genome assembly provides a critical resource for future molecular breeding and genetic improvement of Tibetan sheep.
Similar content being viewed by others
Background & Summary
Tibetan sheep play a unique and essential role in China’s rural revitalization strategy. High-quality development of the Tibetan sheep industry is expected to improve social and economic development while maintaining the ecosystem stability and national security of the Qinghai–Tibetan Plateau. Germplasms of these sheep are considered valuable because of their unique biological characteristics, high genetic stability, stress resistance, meat quality, and carpet wool quality1,2,3. Additionally, having undergone long-term natural and artificial selection under specific ecological conditions, they are a model species for studying adaptations to extreme environments. Therefore, assembly of the Tibetan sheep reference genome at the chromosome level will substantially facilitate comparative and functional genomics research. Although draft genome assemblies of certain sheep species have been released, their quality metrics were low4,5,6,7,8. Furthermore, no comprehensive genomes of sheep from high-altitude environments (>4,000 m) are available.
Guide Black-Fur sheep (GD) are known for their black skins and spend most of the year grazing in their core habitat of Guide County, Hainan Tibetan Autonomous Prefecture of Qinghai, China (altitude ~4,100 m). GD sheep exhibit a variety of fleece colors, ranging from blackish-red to grey. Black skin contains more melanocytes than other skin colors and thus synthesizes more melanin, absorbing ultraviolet radiation and thereby reducing its damaging effects on organs9,10. This adaptation strategy has allowed GD to thrive in a robust ultraviolet environment.
In this study, we aimed to produced high-quality, chromosome-scale, de novo genome assemblies of the high-altitude environments GD using Illumina, PacBio, and chromosome conformation capture (Hi-C) sequencing. Comparisons with other mammalian genomes helped to identify rapidly evolving genes and elucidate the evolutionary history of Tibetan sheep.
Methods
Sample collection and sequencing
All animal experiments were performed under the guidance of ethical regulations from the Institutional Animal Care and Use Committee of Lanzhou Institute of Husbandry and Pharmaceutical Science of Chinese Academy of Agricultural Sciences (Approval No. NKMYD201805; Approval Date: 18 October 2018). We randomly selected one adult male GD (age: 3 years) from flocks raised under standard feeding regimes and with free access to water in Hainan Tibetan Autonomous Prefecture, Qinghai, China (altitude: ~4,100 m). Blood samples were collected from the jugular vein and then preserved in EDTA anti-freezing tubes at −20 °C. The animals was slaughtered by exsanguination after being deprived of food for 24 h. The brain, heart, liver, kidney, spleen, lung, and muscle tissues were excised within 30 min and stored in liquid nitrogen.
For PacBio long-read and short-read sequencing, genomic DNA was extracted from the blood and liver of the GD. The DNA was sequenced at Frasergen Bioinformatics (Wuhan, China) using the PacBio Sequel platform (Pacific Biosciences), yielding a total of 267.82 Gb of PacBio continuous long reads for corresponding to 98 × genomic coverage (Table 1). The DNA was re-sequenced using HiSeq X-Ten (Illumina, CA, USA) to correct long reads. Paired-end sequencing produced 263.48 Gb of short-read data corresponding to 96 × genomic coverage (Table 1).
The PacBio Iso-Seq method sequenced full-length transcripts via Single Molecule Real-Time (SMRT) sequencing technology. Total RNA was extracted from all tissues (brain, femur, spleen, back muscle, kidney, heart, lung, and liver) using the RN33 kit (Aidlabs Biotechnologies, Beijing, China). All required profiles were obtained using the same sequencing platform described for PacBio, producing 36.90 Gb of Iso-Seq reads (Table 1).
For Hi-C sequencing, fresh liver tissues were collected for library construction, which was performed as previously described11. Paired-end (2 × 150 bp) sequencing was conducted on the Illumina HiSeq X-Ten platform to abtain 494.40 Gb of data corresponding to 181 × genomic coverage (Table 1).
Genome assembly and polishing
All subreads from SMRT sequencing were used for de novo assembly of the GD genome. The draft genome assembly was obtained using MECAT2 with default parameters12. Variant calling with gcpp in the SMRT link 4 toolkit was performed to correct errors after the initial genome assembly. Next, initial assembly contigs from the previous step were mapped with corresponding HiSeq reads and polished once using Pilon (v1.22) to correct any remaining errors13. This gave a total of 1,972 primary contigs for GD, with an N50 size of 20.30 Mb (Table 2).
Hi-C scaffolding
Next-generation sequencing short reads were aligned to their assembly using the Burrows-Wheeler Aligner (BWA-MEM algorithm, v0.7.17) to further increase single base accuracy14. Purge Haplotigs was used to filter redundant sequences resulting from genome heterozygosity15. Pseudochromosomes were dcreated during Hi-C analysis, as described previously16. Briefly, all clean read pairs produced from the Hi-C library were mapped to the polished sheep contig assembly using the BWA-MEM algorithm with default parameters. LACHESIS was then used to cluster contigs into chromosome-level scaffolds based on the genomic proximity signal of Hi-C data17. After scaffolding, we obtained 27 chromosomes containing > 98% of the contigs (Table 3, Fig. 1). The final assembled genome (2.73 Gb) had a scaffold N50 of 107 Mb and 1,246 gaps in total (Table 2).
Annotation of repetitive sequences and genes
De novo and homology-based prediction methods annotated repeat sequences in the GD genome. Known transposable elements were identified by combining RepeatMasker, RepeatProteinMask, and RepeatModeler18. Repetitive transposable element (TE) sequences comprised ~52% of the total assembly of the genome, with long terminal repeat retrotransposons being the most abundant (~20.99%) (Table 4).
For gene annotation, we adopted a strategy combining ab initio gene finding, homology-based gene prediction, and Iso-Seq reads. First, the assembled GD genome was hard- and soft- masked by RepeatMasker. Second, ab initio gene prediction was performed using the Augustus (v3.3.1) and SNAP gene models, which were pre-trained using homologous proteins19. Third, Exonerate (v2.2.0) set to default parameters was used to predict genes from protein sequences20. Fourth, clean RNA-sequencing reads were assembled into transcripts via Trinity for RNA-based gene prediction, followed by further prediction of gene structure using PASA21. Finally, Maker (v3.00) was employed to integrate the three sets of prediction results22. In total, we identified 20,759 predicted protein-coding genes within the genome, representing 98.42% of all genes (Table 5). A comparison of gene features among GD, Texel sheep (Ovis aries)23, Rambouillet sheep (Ovis aries)24, San Clemente goats (Capra hircus)25, and humans (Homo sapiens)26 revealed similar length distributions for coding sequences, genes, exons, and introns (Fig. 2).
The tRNA-related genes were mainly identified by tRNAscan-SE (v1.3.1) and Infernal (v1.1.2) using default parameters18,27. A total of 256,815 non-coding RNA genes were predicted, comprising 528 miRNA, 231 rRNA, 264,021 tRNA, and 2,017 snRNA genes (Table 6).
Data Records
Raw data from the long-read and short-read sequencing have been deposited in the NCBI Sequence Read Archive database with accession numbers SRR22290763 (Iso-seq) and SRR22585187 (whole-genome sequencing) under BioProject PRJNA89885228,29. The final chromosome assembly was deposited in the GenBank at NCBI with accession number: JBEJUG01000000030. The draft genome assembly and genome annotation were deposited in the Figshare database (https://doi.org/10.6084/m9.figshare.26013145)31.
Technical Validation
Evaluation of the genome assembly
Quality metrics included assessment of the completeness of each genome based on the proportion of single-copy conserved orthologs obtained and specific signs of mis-assembly. Benchmarking Universal Single-Copy Orthologs (BUSCO) software32 revealed that ~93% of the conserved genes were identified in the GD genome, confirming the completeness of the obtained assemblies. The 93.1% of single-copy conserved orthologs in this assembly was higher than those found in Hu sheep and Texel sheep23,33. The percentage of duplicated complete BUSCOs detected in the GD genome (1.9%) was comparable to the ranges observed across the genomes of Texel sheep and Rambouillet sheep (0.9%–1.6%)23,24. The GD genome also exhibited a lower percentage of missing BUSCOs (4.2%) compared with the Texel sheep genome (11.1%)23, suggestive of genomic integrity and the precise assembly of highly complex regions (Table 7).
Genome collinearity analysis
Genome synteny analysis of the GD, Texel, and Rambouillet sheep breeds was performed using the MUMmer tool with default parameters (filtering of delta sequences with the -1 parameter and removal of collinear fragments <10 kb). Single-nucleotide polymorphisms and variations between the three genomes were identified using the show-snps (-rT parameter) and show-diff (-rH parameter) utilities, respectively. Overall, the GD genome demonstrates strong collinearity with Texel and Rambouillet sheep. Some obvious insertions and inversions on chromosomes 1, 3, and 7 may be species-specific (Fig. 3).
Code availability
No specific code was developed in this work. The parameters of all commands and pipelines used for data processing are described in the Methods section. If no detailed parameters are mentioned for a software, the default parameters were used, as suggested by the developer.
References
Liu, J. B. et al. Genetic signatures of high-altitude adaptation and geographic distribution in Tibetan sheep. Sci Rep. 10, 18332 (2020).
Zhang, Q. Y. et al. Gangba sheep in the Tibetan plateau: validating their unique meat quality and grazing factor analysis. J Environ Sci. 101, 117–122 (2021).
Liu, G. B. et al. Identification of microRNAs in wool follicles during anagen, catagen, and telogen phases in Tibetan sheep. PloS One 8, e77801 (2013).
Davenport, K. M. et al. An improved ovine reference genome assembly to facilitate in-depth functional annotation of the sheep genome. GigaScience 11, giab096 (2022).
Jiang, Y. et al. The sheep genome illuminates biology of the rumen and lipid metabolism. Science 344, 1168–1173 (2014).
Li, R. et al. A Hu sheep genome with the first ovine Y chromosome reveal introgression history after sheep domestication. Sci China Life Sci 64, 1116–1130 (2021).
Upadhyay, M. et al. The first draft genome assembly of snow sheep (Ovis nivicola). Genome Biol. Evol. 12, 1330–1336 (2020).
Yang, Y. Z. et al. Draft genome of the Marco Polo Sheep (Ovis ammon polii). GigaScience 6, 1–7 (2017).
Li, M. Z. et al. Genomic analyses identify distinct patterns of selection in domesticated pigs and Tibetan wild boars. Nat Genet 45, 1431–1438 (2013).
Visscher, M. O. Skin color and pigmentation in ethnic skin. Facial Plast Surg Clin North Am. 25, 119–125 (2017).
Crémazy, F. G. et al. Determination of the 3D genome organization of bacteria using Hi-C. Methods Mol Biol. 1837, 3–18, https://doi.org/10.1007/978-1-4939-8675-0_1 (2018).
Xiao, C. L. et al. MECAT: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads. Nat Methods 14, 1072–1074 (2017).
Walker, B. J. et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PloS One 9, e112963 (2014).
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint 1303, 3997 (2013).
Roach, M. J., Schmidt, S. A. & Borneman, A. R. Purge haplotigs: allelic contig reassignment for third-gen diploid genome assemblies. BMC Bioinf. 19, 460 (2018).
Yin, D. M. et al. Genome of an allotetraploid wild peanut Arachis monticola: a de novo assembly. GigaScience 7, giy066 (2018).
Burton, J. N. et al. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat Biotechnol 31, 1119–1125 (2013).
Tarailo-Graovac, M. & Chen, N. Using RepeatMasker to identify repetitive elements in genomic sequences. Curr. Protoc. Bioinf. Chapter 4, 4.10.11–14.10.14 (2009).
Stanke, M. et al. AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Res. 34, W435–439 (2006).
Slater, G. S. & Birney, E. Automated generation of heuristics for biological sequence comparison. BMC Bioinf. 6, 31 (2005).
Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol 29, 644–652 (2011).
Cantarel, B. L. et al. MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res. 18, 188–196 (2008).
NCBI GenBank https://identifiers.org/ncbi/insdc.gca:GCA_000298735.2 (2015).
NCBI GenBank https://identifiers.org/ncbi/insdc.gca:GCA_002742125.1 (2017).
NCBI GenBank https://identifiers.org/ncbi/insdc.gca:GCA_001704415.2 (2016).
NCBI GenBank https://identifiers.org/ncbi/insdc.gca:GCA_000001405.29 (2022).
Nawrocki, E. P., Kolbe, D. L. & Eddy, S. R. Infernal 1.0: inference of RNA alignments. Bioinformatics 25, 1335–1337 (2009).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR22290763 (2023).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR22585187 (2023).
NCBI GenBank https://identifiers.org/ncbi/insdc.gca:GCA_040259355.1 (2024).
Lu, Z. K. The high-quality chromosome-level genome assembly of Guide Black-Fur sheep (Ovis aries). figshare. https://doi.org/10.6084/m9.figshare.26013145 (2024).
Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212 (2015).
NCBI GenBank https://identifiers.org/ncbi/insdc.gca:GCA_011170295.1 (2020).
Acknowledgements
This work was supported by the National Key R&D Program of China (2021YFD1600703), the Innovation Project of Chinese Academy of Agricultural Sciences (CAAS-ASTIP-2015-LIHPS), the National Natural Science Foundation for General Program of China (31872981), and the Gansu Provincial Science and Technology Plan (22CX8NA014).
Author information
Authors and Affiliations
Contributions
J.L. and Z.L. conceived this study. Z.L. and T.G. collected the samples and performed the experiments; Z.L., C.Y., X.A. and Z.C. performed the research and analyzed the data. Z.L. drafted the manuscript. All authors have read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Lu, Z., Yuan, C., An, X. et al. Chromosome-level genome assembly of Guide Black-Fur sheep (Ovis aries). Sci Data 11, 711 (2024). https://doi.org/10.1038/s41597-024-03564-x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-024-03564-x
- Springer Nature Limited