Background

Salinity stress affects over 6% of the global land area and is a severe problem that limits agriculture [1, 2]. Halophytes are remarkable plants that tolerate high salinity that would kill 99% of other plant species (glycophyte), and are applied to improve saline soil [3, 4]. Some extreme halophytes can survive salinity levels > 1000 mM NaCl, whereas glycophytes, such as rice and Arabidopsis, can only survive 50–100 mM NaCl [4, 5]. Most botanists believe that these salt-sensitive glycophytes may provide limited insights into mechanisms of salinity tolerance, and that extreme halophytes may have enormous values for improving our understanding of salinity tolerance mechanisms [4,5,6]. Given that many important crops are gramineous, understanding the salinity tolerance mechanisms of gramineous halophytes will be helpful in improving the salinity or drought tolerance of cereal crops. Although the genomes of several salinity-tolerant plant species have been reported [7,8,9,10], the genome of an extreme Gramineae halophyte is unavailable. Puccinellia tenuiflora (2n = 14) is a perennial halophyte of the Gramineae and is distributed in Asian and European grasslands [3, 11, 12]. It is a forage grass with high nutritional value and strong tolerance for multiple stress conditions, such as drought, disease, and chilling [3, 11, 12]. P. tenuiflora can survive at pH 10 and 900 mM NaCl [3, 11,12,13,14] and can grow normally and produce seeds under some extreme soil conditions (2–3% salt content and pH > 10) [14, 15]. Given these qualities, P. tenuiflora has been used to recover and exploit saline grasslands or croplands in northern China [14, 15]. A growing number of molecular studies have focused on P. tenuiflora [12, 16,17,18,19,20,21,22,23,24,25,26,27,28]. Currently, P. tenuiflora is recognized as a model halophyte [3, 12]. Unfortunately, the genomic sequence of P. tenuiflora is unavailable. Here, we provide first report on the P. tenuiflora genome. Our work may provide extensive genetic resources for improving the salinity or drought tolerance of cereals.

Construction and content

Evaluation of genome size

Taxonomy characteristics of Puccinellia tenuiflora are available at Flora of China (http://www.efloras.org/florataxon.aspx?flora_id=2&taxon_id=200026128). We surveyed the chromosome number of P. tenuiflora according to Kato et al. [29]. Total genomic DNA was extracted from fresh leaves. We used the conventional method to estimate the P. tenuiflora genome size. Briefly, we generated 49 Gb of high-quality short-insert Illumina reads to analyze the K-mer frequency of distribution [30]. Genome size was calculated using the following formula: Genome size = total K-mer number /K-mer depth [30, 31], in which K-mer depth is the peak value of K-mer distribution. The chromosome number of P. tenuiflora is 14 (Fig. 1). Our K-mer analysis showed that the genome size of extreme halophyte P. tenuiflora was 1.303 Gb (2n = 14) and the genome was complex, with 1.56% heterozygosity and 65.5% repeat content (Table 1).

Fig. 1
figure 1

Chromosome number (a) and habitat (b) of P. tenuiflora

Table 1 Results of K-mer analysis. The K-mer was defined as 17 bp to assess P. tenuiflora genome size by the following formula: total K-mer number/K-mer depth. The heterozygous ratio was determined by the number of heterozygous K-mer/total K-mer number

Genome sequencing

Illumina paired-end (PE) libraries were constructed with short insert sizes of 250 and 450 bp. Illumina mate-pair (MP) libraries were constructed with insert sizes of 2, 5, and 10 k bp (Table 2). We generated 209.13 Gb of raw data by the PE libraries, and 197.38 Gb of raw data by the MP libraries. The Illumina libraries were sequenced on Illumina HiSeq XTen platform. We also sequenced 56.12 Gb of PacBio long reads and 161.03 Gb of 10× genomics barcoded reads (Table 2).

Table 2 Raw data of P. tenuiflora sequencing

Genome assembly

Because the P. tenuiflora genome is highly complex and repeated, its genome was assembled by a combined strategy of PacBio (third-generation), 10× genomic technique, and Illumina Hiseq (second-generation). We generated 312.6× reads of Illumina, 43.2× read of PacBio and 123.87× reads of 10× genomic. First the PacBio sequences were corrected for errors. The accurate sequences of PacBio were assembled into primary contigs based on FALCON (Branch 3.1) [32] and FALCON-Unzip software (https://github.com/PacificBiosciences/FALCON_unzip). After treatment with FALCON-Unzip software, we corrected errors of these contigs using PacBio sequences based on quiver software [33] and using Illumina data based on pilon software [34], and finally obtaining consensus sequences of high quality. Next, we used Illumina long reads of 2, 5, and 10 kb to elongate and combine the preassembled contigs into scaffolds based on SSPACE software [35], and then used 10× genomics linked-reads to further elongate and combine the scaffolds based on 10× FragScaff software. Lastly, we used Purge Haplotigs software (https://bitbucket.org/mroachawri/purge_haplotigs/overview) to filter the redundant sequences caused by high heterozygosity. Finally, we assembled 2638 scaffolds with a total size of 1.107 Gb, contig N50 of 117 kb, and scaffold N50 of 950 kb (Table 3).

Table 3 Assembly results of P. tenuiflora genome

Genome annotation

Annotation of replicate sequences

Transposable elements (TEs) of the P. tenuiflora genome were annotated. We used two methods to find the TEs. The first method was RepeatMasker (version 3.3.0) to discover TEs in an integrated known replicate sequence library (Repbase 15.02) and the de novo replicate sequence library constructed by RepeatModeler (Version 1.0.5) [36, 37], RepeatScout [38], and LTR_FINDER [39]. The second method detected TEs in the P. tenuiflora genome using RepeatProteinMask by searching against the TE protein database [37]. We identified 691 Mb transposable elements (62.44% of the total sequence), including 580 Mb of LTR retrotransposons (52.43%) (Table 4).

Table 4 Overview of the annotation of the P. tenuiflora genome

Annotation of protein-coding genes

A combined strategy (de novo-, homolog-, and RNA-seq-based predictions) was used to annotate protein-coding genes in the P. tenuiflora genome using the following software: Augustus (version 3.0.2) [40, 41], Genescan (version 1.0) [42], Geneid [43], GlimmerHMM (version 3.0.2) [44], and SNAP [45]. The homologous sequences of six species (Zea mays, Sorghum bicolor, Brachypodium distachyon, Setaria italica, Arabidopsis thaliana, and Oryza sativa) were aligned against the repeat-masked P. tenuiflora genome with TBLASTN (E-value ≤10–5) [46], and then Genewise software 2.2.0 was used to predict the gene models [47]. Two strategies were used to assemble the RNA-seq reads to the unique transcripts. First, we mapped the RNA-seq reads to the P. tenuiflora genome with Tophat 2.0.8 [48] and Cufflinks 2.1.1 software [49] (http://cufflinks.cbcb.umd.edu/). Afterward, we used Trinity [50] to assemble the RNA-seq reads, and then used PASA [51] (http://pasapipeline.github.io/) to improve the structure of the assembled genes. We generated non-redundant gene sets using EVidenceModeler (EVM) [52] via integrating gene prediction results of all methods. Finally, the predicted genes were filtered by three criteria: coding region length of ≤50 amino acids; FPKM < 5; and supported only by de novo strategy. Functions of the protein-coding genes were annotated by BLASTP program (best hit with E-value ≤1E-05) against three public protein databases: TrEMBL [53], Swiss-Prot, and NR. The protein domains were analyzed by InterProScan software (4.8) via searching against InterPro databases 29.0 [54], and the GO term information was collected from the InterPro annotation results [55]. Moreover, we also conducted KEGG annotation for all genes [56].

On the basis of P. tenuiflora genomic sequences, we predicted 39,725 protein-coding genes (Tables 5). Of the 39,725 predicted protein-coding genes, the protein sequences of 39,470 genes (99.4%) were similar to sequences of known proteins and could be annotated (Table 6). The average gene length was 2818.5 bp, and the average CDS length was 1082.0 bp. The average exon number per gene was 4.2, with an average exon length of 260.5 bp and average intron length of 550.8 bp (Table 5).

Table 5 General statistics for feature of predicted protein-coding genes of P. tenuiflora genome. Protein-coding genes were predicted through the annotation strategy of de novo prediction and evidence based on homology and transcriptome data. The gene model was integrated with EVM and corrected by PASA to obtain the final set of protein-coding genes
Table 6 Functional annotation of protein-coding genes against different databases. Gene functions were obtained from the best BLASTP hit

Annotation of non-coding RNA

The tRNA genes were discovered with tRNAscan-SE software [57]. The rRNA, miRNA, and snRNA were predicted by INFERNAL software [58] against the Rfam database 9.1 [59]. We annotated non-coding RNA and identified 692 tRNAs, 68 rRNAs, 702 snRNAs, and 1376 microRNAs in the P. tenuiflora genome (Tables 4 and 7). The average lengths of microRNAs, tRNAs, rRNAs, and snRNAs were 124.89 bp, 75.27 bp, 207.79 bp, and 118.21 bp, respectively (Table 7). We deposited the genome sequence in the Genome Warehouse in National Genomics Data Center [60].

Table 7 Identification of non-coding RNAs of P. tenuiflora genome. The tRNAs were predicted by tRNAscan-SE software. The rRNA, miRNA and snRNA genes were extracted by INFERNAL software against the Rfam database

Assessment of genome quality

We assessed genome quality using the following methods: Burrow-Wheeler Aligner (BWA), Core Eukaryotic Genes Mapping Approach (CEGMA), and Benchmarking Universal Single-Copy Orthologs (BUSCO). First, in order to assess the quality of genome assembly, we aligned the high-quality Illumina short reads to the assembly using BWA (http://bio-bwa.sourceforge.net, parameters ‘-o 1 -i 15’) [61]. According to BWA method, 87.41% of raw reads were mapped to the genome with 93.34% coverage (Table 8). Next, we used CEGMA and BUSCO to estimate completeness of the assembly. CEGMA is a set of conserved protein families for a wide range of eukaryotes, and is used to identify exon–intron structures of these conserved protein families in a new genomic sequence [62]. CEGMA analysis revealed 223 out of 248 ultraconserved eukaryotic genes (89.9%) in the P. tenuiflora genome indicating integrity for the core genes in the assembly (Table 9). Moreover, completeness of the assembly also was assessed using BUSCO [63] combined with TBLASTN [46], Augustus (version 3.0.2) [40, 41], and HMMER (version 3.1b2) [64]. The BUSCO analysis showed that our assemblies contained 86.8% complete and 1.7% fragmented embryophyta orthologs, suggesting that the assembly quality was high (Table 10).

Table 8 Genome coverage rate of raw data based on the BWA method. Mapping rate was generated by mapping raw reads to the P. tenuiflora genome to express the reliability of the genome coverage
Table 9 CEGMA analysis results of P. tenuiflora genome
Table 10 BUSCO results of P. tenuiflora genome. C: Complete BUSCOs; S: Complete and single-copy BUSCOs; D: Complete and duplicated BUSCOs; F: Fragmented BUSCOs; M: Missing BUSCOs; n: Total BUSCO groups searched

Utility and discussion

Description of database

The genome assembly of P. tenuiflora consisted of 14,036 contigs with a total size of 1.095 Gb. Finally, we assembled 2638 scaffolds with a total size of 1.107 Gb, contig N50 of 117 kb, and scaffold N50 of 950 kb. On the basis of P. tenuiflora genomic sequences, we predicted 39,725 protein-coding genes, and identified 692 tRNAs, 68 rRNAs, 702 snRNAs, 1376 microRNAs, and 691 Mb transposable elements. We assessed the quality and completeness of the assembled genome through BWA, CEGMA mapping, and BUSCO mapping (Tables 8, 9, 10). The results showed that our assembly had high quality. All raw data for genome assembly are deposited at NCBI. The genome sequence is deposited in the Genome Warehouse in National Genomics Data Center (https://bigd.big.ac.cn/gwh) (accession number GWHABHL00000000).

Significance of database

Halophytes belong to several families and are distributed among multiple clades; this broad distribution pattern suggests that the salinity tolerance mechanisms of halophytes have evolved numerous times or have multiple origins [2]. As a result, halophytes not only exhibit a wide range of salinity tolerance but have also evolved diverse molecular and physiological mechanisms for salinity tolerance [2]. This diversity complicates discovery of the salinity tolerance mechanisms of halophytes. To date, almost all known molecular mechanisms of salinity tolerance were characterized in glycophytes such as rice, wheat, and Arabidopsis [4,5,6]. Glycophytes only provide limited insights into mechanisms of salinity tolerance, and extreme halophytes may have enormous values for improving our understanding of salinity tolerance mechanisms. The genome sequence of extreme halophytes will unlock their molecular studies in salinity tolerance.

The Gramineae is an important plant group because it includes many important food crops, such as rice, wheat, maize, and barley. P. tenuiflora, an extreme Gramineae halophyte, is closely related to barley and wheat. Zhang et al. (2013) reported that P. tenuiflora can grow normally for 6 days under 900 mM NaCl and survive at pH 11 [23]. Wang et al. (2006) found that P. tenuiflora survived 670 mmol/L NaCl [13]. A growing number of molecular biology studies have focused on this species owing to its strong salinity tolerance and high genetic value for cereal improvement [16,17,18,19,20,21,22,23,24,25,26,27,28]. In the present study, we sequenced and assembled the P. tenuiflora genome (2n = 14, size 1.107 Gb). Our work may improve current understanding of salinity tolerance and provides genetic resources for cereal improvement.