Background & Summary

The genus Elymus L. belongs to the grass tribe Triticeae, containing approximately 150 species1,2. The genus is entirely composed of polyploidy species with StH, StY, StHY, StPY, and StWY, including five basic genomes. The included basic genomes St, H, P, W are derived from Pseudoroegeneria (Neveski) Löve, Hordeum L., Agropyron Gaertn., and Australopyrum (Neveski) Love, respectively, although the origin of Y genome is still unknown3,4. Elymus species belong to the same tribe with staple food crops such as wheat (Triticum aestivum, 2n = 6x = 42; AABBDD genome), rye (Secale cereal, 2n = 2x = 14), and barley (Hordeum vulgare, 2n = 2x = 14), and which are important genetic resources with high diversity, constituting a tertiary gene pool for improvement of major cereal crops.

Elymus sibiricus L. (Siberian wild rye), a typical species of the genus Elymus, is a well-known perennial and caespitose grass. E. sibiricus is widely distributed in the northern hemisphere, with particular preponderance in Sweden, northern Asia, Japan, and North America5, which is mostly utilized as perennial forages in template regions6,7. E. sibiricus is an allotetraploid with a genome constitution of StStHH (2n = 4x = 28)2. Chromosomal polymorphisms and major rearrangements of E. sibiricus have been revealed by Florescence in situ hybridization (FISH) in different accessions8,9. Genomic SSR markers were exploited by screening enriched microsatellite DNA library for genetic diversity evaluation10. Transcriptome of E. sibiricus was profiled to reveal candidate genes connected to seed shattering11. Genome sequencing was carried out by Illumina HiSeq X-ten platform, and a draft genome of 4.34 Gb was assembled, and which was used for SSR markers development12.

In this study, an E. sibiricus chromosome-scale reference genome by integrating PacBio HiFi reads and chromatin conformation capture data was assembled. The high-quality E. sibiricus assembly obtained in this study provides a reference for the StH genome of the genus Elymus in the Triticeae tribe (Fig. 1). It will be much helpful to facilitate genetic resource evaluation of StH species in genus Elymus. Furthermore, it can be served as important tool to directly domesticate E. sibiricus as a forage crop or even a cereal crop.

Fig. 1
figure 1

Overview of the assembled E. sibiricus genome. (ag) are as followers: collinearity between the chromosomes (The color red is used to highlight the associations between different homologous), chromosomes, gene counts, GC content, Simple repeat density, DNA transposons density, LTR elements density (The window used for calculating the density of the above elements is 100,000 bp).

Methods

Plant materials and genome sequencing

The inbreed line Gaomu No.1 of E. sibiricus required for sequencing was self-crossed exactly 6 generations. Fresh young leaf tissue of it was collected, frozen in liquid nitrogen, The extraction of DNA samples follows the CATB method13. The DNA library preparation and sequencing were carried out according to the protocol provided in the SMRTbell® prep kit 3.0 instruction manual and sequencing was performed on the PacBio Revio platform. DNA required for Hi-C sequencing was purified using the QIAamp DNA Mini Kit (CAT#51306, Qiagen) following the manufacturer’s protocol, while for Next-Generation Sequencing (NGS) whole genome sequencing, libraries were constructed using the MGIEasy Universal DNA Library Prep Kit V1.0 (CAT#1000005250, MGI) following the standard protocol. The Hi-C library was sequenced on the DNBSEQ-T7 platform, while NGS for whole-genome sequencing was conducted on the MGISEQ-2000 platform. Fastp v0.23.414 with default parameters was used to obtain NGS clean reads. All genome sequencing and Hi-C sequencing data were derived from a single plant. The data obtained from each platform is shown in Table 1.

Table 1 Data Output Statistics.

Raw reads from full-length transcriptome sequencing were processed into circular consensus (CCS) reads based on the adapter. Subsequently, full-length, non-chimeric (FLNC) transcripts were identified by detecting the poly A tail signal and 5′ and 3′ cDNA primers in CCS. Clustering was performed on full-length sequences from the same transcript, grouping similar full-length sequences into clusters, and obtaining a consensus sequence for each cluster. These consensus sequences were then corrected to obtain high-quality sequences for further analysis. High-quality FL transcripts from Iso-Seq were used to remove redundancy using cd-hit v4.8.115 (identity >0.99).

Genome assembly and chromosome construction

The genome of E. sibiricus at the contig level was assembled using the hifiasm v0.19.616, supplemented by Hi-C data and Pacbio HiFi data. Conserved homologous probes17 across A, B, D genome of common wheat (Triticum aestivum L.)18, and H genome of barley (Hordeum vulgar L.)19 were developed using CHORUS2 v2.0.120. BWA v0.7.1721 is utilized to align Hi-C data to the draft genome reference. Subsequently, contigs and Hi-C alignment were classified based on these homologous probes. Classified contigs were subjected to chromosome construction through the polyploid workflow of ALLHiC22. Juicebox v1.11.0823 was used to further manually correct the chromatin contact matrix and built the Hi-C interaction heatmap. SubPhase v1.2.624 (kmer = 15) with default parameters was used to distinguish between two subgenomes of E. sibiricus. An H genome specific transposable element (Gypsy-96_TAe-LTR) was obtained by a pipeline procedure of RepeatExplorer25,26 using low coverage NGS sequencing data of both H genome donor species Hordeum bogdanii and St genome donor species Pseudoroegneria stipifolia. The content of the Gypsy-96_TAe-LTR was estimated hundreds times more in H genome than St genome. We used this element to further confirm which set of subgenomes is H and which set is St (Table 2). Benchmarking Universal Single-Copy Orthologs27 (BUSCO v5.2.2) and LTR Assembly Index28 (LAI) were employed to evaluate the completeness and contiguity of genome assemblies. Finally the assembly resulted in a genome size of 6.929 Gb with an contig N50 of  49.518 Mb (Table 3). Using SubPhaser and subgenome-specific repetitive sequence, we were able to successfully separate the two sets of subgenomes (Fig. 2).

Table 2 Alignment counts of the subgenome-specific repetitive sequence.
Table 3 Features of the E. sibiricus genome assembly and annotation.
Fig. 2
figure 2

Principal component analysis (PCA) based on subgenome-specific kmers (kmer = 15).

Annotation of repetitive sequences and function gene

LTRfinder v1.0729 (-w 2 -C -D 15000 -d 1000 -L 7000 -l 100 -p 20 -M 0.85) and LTRHarvest v1.6.530 (-minlenltr 100 -maxlenltr 7000 -mintsd 4 -maxtsd 6 -motif TGCA -motifmis 1 -similar 85 -vic 10 -seed 20 -seqids yes) were used to initially predict Long Terminal Repeat (LTR) sequences. Subsequently, LTR_retriever v2.9.531 was used to merge the results and obtain the final LTR predictions. A De Novo repeat sequence database for E. sibiricus was constructed using RepeatModeler v2.0.332 with default parameters. The final repeat sequence predictions were conducted using RepeatMasker v4.1.233 pipeline.

The BRAKER3 v3.0.334 pipeline was used for structural annotation of E. sibiricus genome. This comprehensive pipeline incorporated three sources of extrinsic evidence: short-read RNA-seq data obtained from the public NCBI Illumina dataset (SRP101478)35, full-length transcriptome sequencing from the current experiment, and protein sequences of Eukaryota sourced from OrthoDB36. BRAKER3 utilizes the GeneMark-ETP v1.0237 pipeline for gene prediction. This involves assembling transcript sequences with StringTie v2.2.138. Short RNA-Seq reads were aligned to the genome by HISAT2 v2.2.139. GeneMarkS-T analyzes the assembled transcripts to predict protein-coding genes, which are then searched against a protein database. ProtHint maps homologous proteins back to the genome, generating hints for another round of gene structure prediction. AUGUSTUS v3.4.040 is trained on the high-confidence gene set and predicts a second genome-wide gene set with hint support. The predictions from these components were integrated using TSEBRA41.

This study found that repetitive sequences accounted for 82.49% of the genome in E. sibiricus (Table 4). A total of 89,800 protein-coding genes were annotated, with an average gene length of 2,315 bp and an average CDS length of 1,075 bp (Table 5). Among these annotated genes, 85,250 genes were annotated in the NR42 database, 49,637 in the Swiss-Prot43 database, 63,623 in the Pfam44 database, 24,763 in the GO45 database, and 18,856 in the KEGG46 database. Additionally, 85,274 genes are annotated in at least one of these databases (Fig. 3).

Table 4 Classification of repeat annotation in E. sibiricus.
Table 5 Statistics of the gene prediction.
Fig. 3
figure 3

The venn picture of the function genes of E. sibiricus by using different database.

Phylogenetic tree construction

We have selected the Coding DNA Sequences(CDS) of the following genomes for phylogenetic analysis: Oryza sativa47, Brachypodium distachyon48, Triticum aestivum (subgenomes A, B, and D), Secale cereale49, Thinopyrum intermedium (subgenomes St, Jr, and Jvs) (https://phytozome-next.jgi.doe.gov/info/Tintermedium_v3_1), Dasypyrum villosum50, Hordeum vulgare along with E. sibiricus (subgenomes H and St). Orthofinder v2.5.551 with the search engine Blast v2.14.152 was employed to identify orthologous genes. From the selected genomes, a total of 2,082 lineal homologous genes were obtained. MUSCLE v5.153 was used for multiple sequence alignment. The phylogenetic tree was constructed using RAxML v8.2.1254 with the maximum likelihood method. Divergence times were estimated with mcmctree v4.10.755 using the calibrated times (O. sativa - B. distachyon: 41.5–62.0 MYA) from the Time Tree56 website (Fig. 4).

Fig. 4
figure 4

Estimation of divergence time between E. sibiricus and related species. Divergence times (unit: MYA) are indicated at each node. Green represent Triticeae species; red represent Brachypodieae species; blue represent Oryzeae species.

synteny analysis

One Step MCScanX in TBtools-II57 was used for synteny analysis. First,coding protein sequences between subgenomes were aligned using blastp v2.15.0 + (−evalue 1e-5 -num_alignments 5), MCScanX v2022.11.0158 with default parameters was employed to identify collinear blocks.

Data Records

The raw sequence data reported in this paper have been deposited in the Genome Sequence Archive in National Genomics Data Center59, China National Center for Bioinformation/Beijing Institute of Genomics60, Chinese Academy of Sciences (GSA: CRA014200)61. The final chromosome assembly of E. sibiricus was deposited at GenBank under the accession number JBDKXM00000000062. Genome assembly and annotation, conserved homologues probes and subgenome-specific repetitive sequnce were uploaded to figshare63.

Technical Validation

The genome-wide Hi-C interaction heatmap was generated using Juicerbox. The coordinates in the heatmap represent all bins on individual chromosomes, where the color of each point indicates the logarithmic value of the corresponding bin pair interaction strength in the genome (Fig. 5). The interaction strength intensifies from white to red, with darker colors indicating higher interaction strength. Notably, regions with higher interaction strength exhibit deeper colors, and the depth of colors along the diagonal is significantly higher than at the two ends. The anti-diagonals are typical for Triticeae genomes and correspond the Rabl configuration of Triticeae chromosomes64,65. Following manual adjustments, the current assembly of the E. sibiricus genome adheres to the distance-dependent interaction decay. From the global heatmap perspective, the overall assembly results appear satisfactory, with no apparent clustering errors between chromosomes.

Fig. 5
figure 5

Contact map after the integration of the Hi-C data and manual correction. blue boxes represent pseudo molecules; green boxes represent contigs.

The ultimate calculated LTR Assembly Index (LAI) value is 12.61, with a corresponding raw LAI of 18.02. In accordance with the criteria proposed by the authors of the LTR_retriever methodology, the assembly quality of the E. sibiricus is categorized at the reference level.

The BUSCO analysis of the entire genome indicates a high level of completeness and contiguity in the assembly of the E. sibiricus genome. Among the 4895 single-copy gene set, only 38 single-copy genes were found to be either missing or fragmented. We also conducted BUSCO analysis by extracting the longest transcript of each gene. The results indicate a relatively complete annotation, with the majority of genes on subgenomes being identified as single-copy (Table 6).

Table 6 BUSCO estimation for E. sibiricus genome assembly and annotation.

Phylogenetic analysis with the assembled CDS showed close relationships between St genome in E. sibiricus and St in Th. Intermidum, and those between H genome in E. sibiricus and H. vulgare, which is accordant with the recognized genome constitution of E. sibiricus.

The synteny analysis revealed an apparent collinearity distort in 4H and 6H chromosome (Fig. 1), which was confirmed by a species-specific 4H/6H reciprocal translocation detected by chromosomal Florescence in situ hybridization with single-gene probes in E. sibiricus8.