Chromosome-scale genome assembly and annotation of Paspalum notatum Flüggé var. saurae

Vega, Juan Manuel; Podio, Maricel; Orjuela, Julie; Siena, Lorena A.; Pessino, Silvina C.; Combes, Marie Christine; Mariac, Cedric; Albertini, Emidio; Pupilli, Fulvio; Ortiz, Juan Pablo A.; Leblanc, Olivier

doi:10.1038/s41597-024-03731-0

Chromosome-scale genome assembly and annotation of Paspalum notatum Flüggé var. saurae

Data Descriptor
Open access
Published: 16 August 2024

Volume 11, article number 891, (2024)
Cite this article

Download PDF

You have full access to this open access article

Scientific Data

Chromosome-scale genome assembly and annotation of Paspalum notatum Flüggé var. saurae

Download PDF

Juan Manuel Vega¹^na1,
Maricel Podio¹^na1,
Julie Orjuela²^na1,
Lorena A. Siena¹,
Silvina C. Pessino¹,
Marie Christine Combes²,
Cedric Mariac²,
Emidio Albertini³,
Fulvio Pupilli⁴,
Juan Pablo A. Ortiz ORCID: orcid.org/0000-0001-8460-6154¹ &
…
Olivier Leblanc ORCID: orcid.org/0000-0003-3641-1875²

819 Accesses
Explore all metrics

Abstract

Paspalum notatum Flüggé is an economically important subtropical fodder grass that is widely used in the Americas. Here, we report a new chromosome-scale genome assembly and annotation of a diploid biotype collected in the center of origin of the species. Using Oxford Nanopore long reads, we generated a 557.81 Mb genome assembly (N50 = 56.1 Mb) with high gene completeness (BUSCO = 98.73%). Genome annotation identified 320 Mb (57.86%) of repetitive elements and 45,074 gene models, of which 36,079 have a high level of confidence. Further characterisation included the identification of 59 miRNA precursors together with their putative targets. The present work provides a comprehensive genomic resource for P. notatum improvement and a reference frame for functional and evolutionary research within the genus.

Chromosome-level genome assembly and annotation of the prickly nightshade Solanum rostratum Dunal

Article Open access 01 June 2023

High-quality chromosome-scale de novo assembly of the Paspalum notatum ‘Flugge’ genome

Article Open access 11 April 2022

Chromosome-level genome assembly and annotation of xerophyte secretohalophyte Reaumuria soongarica

Article Open access 22 July 2024

Background & Summary

Paspalum notatum Flüggé (bahiagrass) is a subtropical grass native to South America that is widespread on lightly textured soils in warm, humid regions of the Western Hemisphere and extensively used as a pasture and ground cover^1,2. The species forms a multiploid complex in which the diploid (2n = 2x = 20) plants are self-sterile and sexual, while the polyploids (3x = 30, 4x = 40, 5x = 50) are pseudogamous aposporous apomicts, i.e. they form seeds containing maternal embryos^3,4. The diploid form, var. saurae, also known as Pensacola bahiagrass, occurs naturally in a restricted geographical area of Argentina stretching between the western and eastern banks of the Uruguay and Paraná rivers, respectively². It owes its name to the fact that it was inadvertently introduced in the Pensacola area of Florida before 1926 and subsequently naturalized as a warm-season perennial pasture throughout the coastal plain and Gulf Coast regions of the United States⁵. Today, it is one of the most important grasses for pastures and lawns in the southeastern United States⁶. The search for the origin of Pensacola bahiagrass led the agricultural scientist Glenn W. Burton to travel through Brazil, Uruguay, and Argentina, where he eventually found highly diverse populations in a small area of the province of Santa Fe, on the banks of the Paraná River and the island of Berduc, near the city of Cayastá⁵ (Fig. 1a,b). Since cytogenetic studies indicate that polyploid P. notatum races (var. notatum) are autotetraploid and share homologous chromosomes with the saurae plants^7,8, this region was then considered to be the center of origin of the species^2,5.

Because P. notatum establishes well in poor-quality sandy soils and tolerates drought, sporadic flooding, and continuous grazing, the species has been selected and improved by classical and molecular methods for almost 80 years, with about 20 cultivars released to date⁹. While the diploid sexual races could be crossed to generate improved hybrids, tetraploid cultivars were traditionally obtained through ecotype selection due to their apomictic mode of reproduction⁹. However, the experimental production of tetraploid sexual individuals by doubling the chromosomes of diploids and the creation of synthetic sexual tetraploid populations have increased the variability for breeding programs through crosses with natural apomictic pollen donors^9,10,11.

P. notatum ecotypes have relatively small genomes, with 1 C values ranging from 0.55 to 0.60 pg¹². Recent studies have provided a wealth of information on the species’ genetic, transcriptomic, and genomic data¹¹ and have set up strategies for the functional characterization of agronomically important genes using genetic transformation and gene editing^13,14,15. Available resources include leaf and flower transcriptomes of sexual and apomictic genotypes^16,17,18, a catalog of small RNAs present during the sexual and apomictic reproductive development¹⁹, and a chromosome-scale de novo genome assembly (514 Mb) of the species²⁰. However, information on gene content annotation and miRNA genes is not yet available.

Long-read sequencing technologies have proven to be extremely effective in improving the quality of assembly in complex genomes, with high levels of heterozygosity, polyploidy, and repetitive elements^21,22,23, particularly for non-model species and orphan crops^{24,25,26,27,28}. Here, we report a chromosome-level genome assembly and annotation of a natural diploid P. notatum biotype (#R1) collected at the species center of diversity using Oxford Nanopore Technology (ONT). The plant #R1 reproduces sexually but occasionally produces aposporous embryo sacs, which is the first step of apomictic reproduction²⁹. Further extensive genomic characterization using Illumina short reads, together with the existing and newly generated transcriptomes, makes the #R1 genome assembly and annotation a valuable resource for providing new insights into the gene content and genome evolution, and for elucidating the developmental genetics of agronomically valuable traits.

Methods

Sample collection

The #R1 plant is a diploid individual collected in a natural population established near the city of Cayastá, Santa Fe Province, Argentina²⁹ (Fig. 1a,c), which belongs to the living germplasm collection of Paspalum spp. of the Instituto de Botánica del Nordeste (IBONE), CONICET-UNNE, Corrientes, Argentina (voucher CTES0553130; Herbarium Carmen L. Cristobal) (Fig. 1b). Several duplicates generated by vegetative propagation through rhizomes are also maintained at the Instituto de Investigaciones en Ciencias Agrarias de Rosario (IICAR), CONICET-UNR, Rosario, Argentina, and at the French National Research Institute for Sustainable development (IRD), Montpellier, France (Fig. 1d). For ONT sequencing, we used ~5 gr of fresh leaf tissue to extract high molecular weight genomic DNA (HMW gDNA) from nuclei isolation and performed quality control, both according to Mariac et al.³⁰. We also extracted total RNA for cDNA synthesis and ONT sequencing from flowers of #R1 immature inflorescences collected before anthesis using a method adapted from Azevedo et al.³¹. Briefly, the plant material was ground in liquid nitrogen, mixed with the extraction buffer, incubated for 15 min at room temperature, and finally extracted using chloroform-isoamyl alcohol. We preserved RNA integrity by avoiding vortexing and keeping samples on ice throughout the extraction process. The genomic DNA used for preparing Illumina sequencing libraries was extracted from ~3 gr of fresh leaf tissue using a CTAB (cetyltrimethylammonium bromide) method³² and qualified for concentration and purity using a NanoDrop 2000 (Thermo Scientific, USA).

DNA sequencing

Nanopore sequencing

DNA libraries of the #R1 genotype were prepared from non-fragmented HMW gDNA using the ligation sequencing Kit 1D SQK-LSK109 (Oxford Nanopore Technology). ONT sequencing was carried out using either a MinION MK1b (Oxford Nanopore Technology) at IRD or a PromethION (Oxford Nanopore Technology, UK) at Novogene (Cambridge, UK) employing R9.4.1 Spot-On Flow Cells (Oxford Nanopore Technology). ONT sequencing FAST5 files were base-called using GUPPY v6.0.6 software and the dna_r9.4.1_450bps_hac_prom.cfg model. The quality control of raw reads in FASTQ format was conducted using NanoPlot v1.31.0 software³³.

Illumina sequencing

Illumina sequencing was carried out at the Instituto de Agrobiotecnología de Rosario (INDEAR; Rosario, Argentina). Sequencing libraries were prepared from 50 ng of genomic DNA using the Nextera DNA Library Prep Kit (Illumina, Inc., San Diego, CA, USA) according to the manufacturer’s instructions and sequenced using a 2 × 250 paired-end Illumina HiSeq. 1500 platform.

Assessing the heterozygosity level of the #R1 genome

Illumina reads were trimmed to remove adaptors and filtered by quality using Trimmomatic v0.33³⁴. Approximately, 277 million high-quality Illumina reads (Q > 37) (Supplementary Table 1) were used as input to count 21k-mers using Jellyfish v2.3.0³⁵, followed by a genome scan using GenomeScope³⁶.

cDNA sequencing

cDNA from flowers of immature #R1 inflorescences was synthesized from 50 ng of total RNA using the SMART-Seq V4 low-input RNA kit (Takara Bio Europe, France). Of the 10 µl reverse transcription reaction, 1 µl was used for quality control and the remaining 9 µl were amplified using Seq Amp DNA Polymerase with seqAmp CB PCR Buffer for long fragment amplification (Takara Bio Europe, France). A sequencing cDNA library with an estimated concentration of 80 fmol (2000 bp average library size) was prepared using the SQK-LSK 109 ligation sequencing kit (Oxford Nanopore Technologies, UK). Preparation included RNA and cDNA purification steps using dAMPure XP Beads (Beckman Coulter, France). RNA quality was assessed using the Agilent High Sensitivity DNA Reagent Kit (Agilent Technologies, France). ONT sequencing and base calling were performed at IRD, as described above. Raw reads were filtered for quality (Q > 10) and length (>300 bp) and trimmed (85 bp at both ends) using Nanofilt v1.0³⁶.

Genome survey and assembly

Preliminary k-mer analysis carried out with the Illumina reads predicted a total genome size of 513 Mb, an abundance of repetitive elements of approximately 50.0% and a heterozygosity rate of 1.73%, as indicated by the bimodal k-mer profile (Fig. 2). This high level of heterozygosity was expected for the #R1 genotype based on previous genetic analysis of the natural population from which the plant was collected²⁹ and is similar to that reported for other self-incompatible grasses^37,38. To achieve genome assembly, we first generated 72.13 Gb of ONT long reads (Q > 7) (19.98 Gb from MinION and 52.15 Gb from PromethION) with a N50 = 19.71 kb of read length and a GC content of 45.56% (Supplementary Table 1³⁹). The reads were then filtered for quality (Q > 10) and length (>5 kb) using NanoFilt v1.0³³ resulting in of 68 Gb of data with a %GC of 45.60 and an N50 of 20.41 kb (Supplementary Table 1), which were assembled using Flye v2.9⁴⁰. The de novo assembled contigs were polished using Racon v1.4.10⁴¹ and scaffolded by RagTag v2.1.0⁴² using the available P. notatum genome reference²⁰ (NCBI Genome assembly ASM2253091v1), excluding the unassigned contigs. The new assembly was polished with the 70 × coverage of Illumina pair-end sequences. Illumina short reads mapping was performed using BWA-MEM v0.7.17⁴³, and error correction was performed with Pilon v1.23⁴⁴ in two successive iterations. This procedure resulted in a 557.8 Mb #R1 genome (GenBank GCA_036689595.1), including the ten expected chromosome-length scaffolds (N50 = 56.10 Mb) and a GC content of 45.80% (Table 1). Of the total ONT reads used as input, 99.14% were mapped within the assembly, indicating a high degree of raw data inclusiveness. #R1 pseudomolecules were named based on their sequence similarity to the reference chromosomes²⁰. Chromosome size varied between 46.63 and 85.72 Mb, with a mean of 55.78 ± 10.93 Mb (Supplementary Fig. 1). Some of the unassigned contigs reported by Yan et al.²⁰ showed similarity with sequences within the #R1 chromosomes. These additions probably contribute to the increase in the genome length from 541 Mb of the reference²⁰ to 557.8 Mb of the new assembly.

Table 1 Summary statistics of P. notatum genome assembly and annotation.

Full size table

Flowers and leaves transcriptome assembly

The #R1 genome was used for a reference-guided transcriptome assembly of flowers and leaves. From a total of 11.9 Gb of ONT cDNA reads from flower transcriptome, ~10 Gb of filtered reads (Q > 10) were assembled using Stringtie v 2.1.4⁴⁵. The resulting flower transcriptome assembly consisted of 36,317 transcripts with a GC content of 51.68% and an N50 of 2,382 bp (Table 2; Supplementary Table 2) (GenBank GKQU01000000.1). Furthermore, the Illumina cDNA paired-end reads (QC > 30) from leaves of diploid genotypes available from NCBI database SRR7347364, SRR7347365, SRR7347366, SRR7347367, SRR7347368, SRR7347369¹⁷ were reference-based assembled using Trinity v2.0.2⁴⁶ and produced 76,682 transcripts with a %GC content of 46.69% and N50 of 1,545 bp (Table 2, Supplementary Table 2). The features of both transcriptomes were consistent with previous reports for the species^16,17,18 and were subsequently used as biological evidence for the #R1 genome annotation (see below).

Table 2 Summary of flowers and leaves transcriptome assemblies from diploid P. notatum genotypes.

Full size table

Genome annotation

Repetitive sequences

Repetitive sequences in the #R1 genome assembly were assessed using the filtered Illumina paired-end reads and the RepeatExplorer2 pipeline integrated into the Galaxy platform (https://repeatexplorer-elixir.cerit-sc.cz/) following the protocol described by Novak et al.⁴⁷. Briefly, a clustering analysis was performed using RepeatExplorer2 and the TAREAN tandem repeat analyzer module. The DANTE tool was used to extract the consensus sequences of transposable elements (TEs) and classify them based on the REXdb database Viridiplantae 3.0 release⁴⁸, using ‘BLOSUM80’ as scoring matrix and no iterative search. RepeatModeler v4.1.2⁴⁹ (RM2) was used to generate a custom library of P. notatum TEs, and RepeatMasker v4.1.2-p1⁵⁰ was used to determine the frequency of repeat DNA families. The RM2 output was then parsed (modified ParseRM.pl script⁵¹) to identify and quantify TE families. The putative centromeric regions of #R1 chromosomes were localized using the centromere-specific satellite sequences of eight grass species (Oryza sativa, Setaria viridis, Setaria italica, Panicum hallii, Panicum capillare, Panicum virgatum, Zea mays and Zea luxurians) described by Melters et al.⁵². Chromosomal positions were determined by BLASTN analysis⁵³ using the satellite sequences as query and considering only the alignments longer than 100 bp and identities >80%⁵⁴. Telomeric regions were identified using the quarTeT tool⁵⁵.

Analysis of the Illumina reads with RepeaExplorer2 identified a total of 320 Mb of repetitive sequences (57.36% of the #R1 assembly), predominantly consisting of retrotransposons (82.12%) and DNA transposons (7.17%) and including a significantly large proportion of unclassified elements (Fig. 3a, Table 3). When mapped onto the #R1 genome, repetitive sequences occupied a minimum of 44.96% (chr. 02) and a maximum of 71.21% (chr. 08) of the chromosome length (Table 4). As expected, the density distribution of the different repeat elements varied along the chromosomes. LTRs were most abundant in putative centromeric regions, whereas retroelements (LINE and SINE), DNA transposons, and rolling circles were prevalent in chromosome arms (Fig. 3b,c). Simple repeats and satellite repeats appeared regularly distributed along all ten chromosomes (Fig. 3c). The putative centromeric regions could be assigned to eight of the ten chromosomes. For chromosomes 2 and 10, these regions could not be properly defined, probably due to a low assembly resolution in these areas and therefore, the proposed locations are hypothetical (Table 4, Supplementary Fig. 1). Similarly, the putative locations of the telomeric regions of chromosomes 2, 3, 4, 9 and 10 were recognized. However, for chromosomes 1, 5, 6, 7 and 8 the positions given are provisional due to the short alignments obtained (Table 4). The average length of the putative telomeres was 6,255 bp, ranging from 70 bp (Chr. 09) to 26,929 bp (Chr. 03) (Table 4).

Table 3 Classification of major repeat sequence families in the #R1 genome as assessed using the RepeatMasker software.

Full size table

Table 4 Length and proportion of repetitive elements of the P. notatum #R1 chromosomes.

Full size table

Gene annotation

Gene prediction and annotation were performed using the MAKER v2.31.9 pipeline⁵⁶ by integrating ab initio gene model predictions with biological transcriptomic and proteomic data through multiple BLAST steps using Exonerate v2.4.0⁵⁷. The soft-repeat-masked version of the #R1 genome together with flower and leaf transcriptomes (this work) merged and filtered for redundancy (similarity threshold of 90%) using CD-HIT⁵⁸ were used as input. In addition, the transcriptome of Sorghum bicolor NCBIv3 (GeneBank GCA_000003195.3) and the proteome of Oryza sativa Japonica Group cv. Nipponbare (Genebank GCA_001433935.1) were included as expressed sequence evidence of related species. Two MAKER iterations were performed to obtain the final annotation. In the first one, ab initio gene predictions were carried out using AUGUSTUS v3.2.2⁵⁹ with the EST trust-blindly option enabled and Oryza sativa as the model species. The resulting gene models were filtered to retain only those with an annotation error distance (AED) <0.5⁵⁶. The outcome of this first annotation was then used to train new species models for AUGUSTUS and SNAP⁶⁰ for the second run of MAKER. Gene models with an AED score > 0.5 and transcripts <50 nt were filtered out. The predicted coding sequences (CDS) obtained with MAKER were then translated to protein sequences using the program GffReadv0.12.7⁶¹ with parameter “-y”. Predicted protein sequences were checked for CDS features (presence of start and stop codons) and for homology with known domains using InterProScan v5.53.87.0⁶² (consulting the databases TIGRFAM, SFLD, SUPERFAMILY, PANTHER, SMART, CDD, PIRSR, Pfam, on April 2023). Gene models that fitted with both criteria were considered as “high confidence”.

Using this strategy, a total of 51,249 transcripts with an AED < 0.5 (85.18% of the total predicted) (Fig. 4a), which defined 45,074 gene models with approximately 1.14 transcripts per gene, were obtained (Supplementary Table 3). The average lengths of mRNA and CDS were 3,679 nt and 1,258 nt, respectively. Each predicted gene contained an average of 4.4 exons, and the exons’ mean length was 346 nt. Of the total predicted gene models, 36,079 (80.04%) were classified as high-confidence (HC) genes. The complete list of genes, their genomic coordinates and corresponding A. thaliana and rice homologs, together with their functional annotation, are summarized from the GFF file in the Supplementary Table 3. As expected, over 99% of the flower and leaf transcripts mapped in the #R1 genome showing a high density towards the ends of the chromosome arms and a low density in most of the putative centromeric regions (Fig. 4b). The number and density of genes per #R1 chromosome are shown in Table 5.

Table 5 Number and density of predicted genes per #R1 chromosomes.

Full size table

Identification of rRNA and tRNAs

rRNA genes were identified using Barrnap v0.9⁶³ software with an e-value cut-off for similarity of 1e⁻¹⁰ and a minimum length threshold of 0.9. In addition, tRNA genes were identified using tRNAscan-SE V1.3.1⁶⁴ with the ‘-infernal’ mode. These analyses resulted in the annotation of 354 rRNA genes and 544 tRNA genes in the #R1 genome (Table 1), which localization is presented in the GFF annotation file deposited in the NCBI database accession number (GCA_036689595.1).

Prediction of microRNA (miRNA) genes and targets

MicroRNA (miRNA) gene precursors present in the #R1 genome were searched using the small RNA (sRNA) sequence database of the reproductive development of sexual and apomictic P. notatum genotypes¹⁹ available at the NCBI BioProject Accession: PRJNA373857 and the software ShortStack 3.8.4⁶⁵. miRNA precursors, miRNA mature sequences and putative targets in the #R1 genome were detected as described in Ortiz et al.¹⁹ using the #R1 assembly as a reference. The putative miRNA’s target regions were analyzed using the #R1 GFF annotation file to determine the location of the mature miRNA alignment (5′ UTR, exon, intron, or 3′ UTR regions) within the genes. Following these procedures, a total of 59 clusters distributed across the 10 chromosomes containing sRNAs were detected (Supplementary Table 4, sheet 1), most of them producing mature miRNAs of 21 nt (47 clusters) and 22 nt (9 clusters). A total of 52 unique mature miRNAs were predicted, corresponding to 21 known families and including all miRNAs previously described in the species, with the exception of the miR390¹⁹ (Supplementary Table 4, sheet 2). Moreover, two new miRNAs (miR827 and miR3979) were identified in the species (Supplementary Table 4, sheet 2). Fourteen precursors generate putative mature miRNAs with no significant match in MirBase and, therefore, may represent novel Paspalum-specific miRNAs. A search for target regions in the #R1 genome performed with TargetFinder⁶⁶ identified 1,456 unique genomic regions (TF score < 4), of which 1,324 have homology with known proteins (Supplementary Table 4, sheet 3).

Data Records

The raw reads derived from the #R1 genome sequencing using Oxford Nanopore (ONT) technology were deposited in the NCBI Sequence Read Archive (SRA) database under accession Nos. SRS19975480⁶⁷ and SRS19975482⁶⁸. The sequencing Illumina raw data were deposited in the NCBI SRA database SRS19975483⁶⁹ and SRS19975484⁷⁰. The #R1 genome assembly and annotation were deposited in the NCBI database under accession No GCA_036689595.1⁷¹. The reads of the #R1 flower cDNA ONT sequencing were deposited in SRA database SRS19975481⁷², and the #R1 flower transcriptome assembly were deposited in the NCBI database under accession No. GKQU00000000.1⁷³. The raw reads from leaves were downloaded from the NCBI Sequence Read Archive (SRA) database accession Nos. SRR7347364⁷⁴, SRR7347365⁷⁵, SRR7347366⁷⁶, SRR7347367⁷⁷, SRR7347368⁷⁸, SRR7347369⁷⁹. The leaf transcriptome assembly was deposited in the NCBI under the accession number DAWXED000000000⁸⁰. The precursor and mature miRNA sequence data recovered from the #R1 genome has been incorporated in the Supplementary Table 4, sheets 1 and 2.

Technical Validation

Assessing the quality of HMW genomic DNA for ONT sequencing

The quality and integrity of the #R1 genomic DNA for ONT sequencing was evaluated using a NanoDrop One/One Spectrophotometer and a Pulsed-Field Gel Electrophoresis system (PFGE BioRad) according to Mariac et al.³⁰ (https://www.protocols.io/view/high-molecular-weight-dna-extraction-from-plant-nu-83shyne). DNA preparations consistently showed spectrophotometric ratios 260/280 nm 1.8–2.0 and 260/230 2.0-2.2, confirming the purity of the extraction. On the other hand, the high molecular weight of the DNA preparation was checked out by loading 1.5–5.5 µg of genomic DNA in 1% agarose gel (0.5 × TAE) with 5 µl of 6 × loading buffer and electrophoresed using the following parameters: pulse time: initial = 5 s, final = 117 s, running time = 20.5 h, V/cm = 5, Angle = 120, Temp = 14° and mA end of run = 255. The molecular weight of the genomic DNA preparation obtained ranged from 48 to 200 kb (Supplementary Fig. 2).

Assessment of genome and transcriptome assembly and annotation quality

The NCBI-FCS-GX scan tool⁸¹ was used to find contaminants in the assembly, setting the taxon in Viridiplantae. In addition, the presence of organellar DNA was assessed by BLASTn analysis (query coverage >30% and % of identity >60%) using the Oryza sativa IRGSP-1.0 organellar data set as query. No contaminants or organellar DNA were detected in the #R1 assembly. The software Merqury⁸² was used to estimate the base-level accuracy and k-mer completeness of the #R1 genome. This analysis showed an assembly consensus quality value (QV) of 30.2, which correspond to an accuracy of 99.9%, and a k-mer completeness value of 84.3%. Nevertheless, we cannot discard that some regions may include both haplotypes (Supplementary Fig. 3). In addition, the #R1 assembly quality was evaluated using BUSCO v5⁸³ using the Liliopsida gene set as a reference, and by mapping the Illumina paired-end reads over the genome. The BUSCO score showed the presence of 94.7% of complete genes,(with 91.8% of them corresponding to single genes), 4% of fragmented genes and 1.3% of missing genes (Supplementary Fig. 4a). Furthermore, the percentage of the total core genes with more than one ortholog was only 3.1%. Moreover, 97.7% of the paired-end Illumina reads were properly mapped by BWA-mem v0.7.17 to the #R1 genome, with an estimated average coverage depth of 93.2×. Using the same procedure for assessing MAKER gene annotation, the BUSCO score showed that 94.4% of the 3,236 Lilliopsida single-copy genes were properly annotated, with an average of 1.19 orthologs for each gene (Supplementary Fig. 4b). In this case, the percentage of duplicate transcripts increase up to 10.1%, probably due to the inclusion of splicing variants. On the other hand, BUSCO analysis performed to evaluate both the flower and leaf transcriptome assemblies revealed 87.9% and 82.2% of complete, 3.7% and 8.6% of fragmented, and 8.4% and 9.2% of missing genes, respectively (Supplementary Fig. 4c,d). Overall, these results indicate that both transcriptomes have a high level of completeness, and therefore represent comprehensive evidence of the expressed sequences of the #R1 genome.

Table 6 Software and parameters used during the #R1 genome sequencing, assembly and annotation.

Full size table

Code availability

All software packages used in this study were run according to their user manuals. The version and parameters used are listed in the Table 6. No specific custom codes were used in this study.

References

Chase, A. The North American species of Paspalum. In Systematic plant studies. 1–310 (1929).
Gates, R. N., Quarin, C. L. & Pedreira, C. G. S. Bahiagrass. In: Warm‐season (C4) grasses 45, 651-680 (2004).
Burton, G. W. The method of reproduction in common bahia grass, Paspalum notatum. Agron. J. 40(5), 443–452 (1948).
Article Google Scholar
Burton, G. W. Breeding Pensacola Bahiagrass, Paspalum notatum: Method of reproduction. Agron. J. 47(7), 311–314 (1955).
Article Google Scholar
Burton, G. W. A search for the origin of Pensacola Bahia grass. Econ. Bot. 21(4), 379–382 (1967).
Article Google Scholar
Acuña, C. A. et al. Bahiagrass tetraploid germplasm: reproductive and agronomic characterization of segregating progeny. Crop Sci. 49, 581–588 (2009).
Article Google Scholar
Forbes, I. Jr & Burton, G. W. Cytology of diploids, natural and Induced tetraploids, and intra‐species hybrids of Bahiagrass, Paspalum Notatum Flügge. Crop Sci. 1(6), 402–406 (1961).
Article Google Scholar
Quarin, C. L., Burson, B. L. & Burton, G. W. Cytology of intra-and interspecific hybrids between two cytotypes of Paspalum notatum and P. cromyorrhizon. Bot. Gaz. 145(3), 420–426 (1984).
Article Google Scholar
Acuña, C. A. et al. Reproductive systems in Paspalum: Relevance for germplasm collection and conservation, breeding techniques, and adoption of released cultivars. Front. Plant Sci. 10, 1377 (2019).
Article PubMed PubMed Central Google Scholar
Zilli, A. L. et al. Widening the gene pool of sexual tetraploid bahiagrass: generation and reproductive characterization of a sexual synthetic tetraploid population. Crop Sci. 58(2), 762–772 (2018).
Article CAS Google Scholar
Ortiz, J. P. A., Pupilli, F., Acuña, C. A., Leblanc, O. & Pessino, S. C. How to become an apomixis model: the multifaceted case of Paspalum. Genes 11(9), 974 (2020).
Article CAS PubMed PubMed Central Google Scholar
Galdeano, F. et al. Relative DNA content in diploid, polyploid, and multiploid species of Paspalum (Poaceae) with relation to reproductive mode and taxonomy. J. Plant Res. 129(4), 697–710 (2016).
Article PubMed Google Scholar
Mancini, M. et al. The MAP3K-coding QUI-GON JINN (QGJ) gene is essential to the formation of unreduced embryo sacs in Paspalum. Front. Plant Sci. 9, 1547 (2018).
Article PubMed PubMed Central Google Scholar
Colono, C. et al. A plant-specific TGS1 homolog influences gametophyte development in sexual tetraploid Paspalum notatum ovules. Front. Plant Sci. 10, 1566 (2019).
Article PubMed PubMed Central Google Scholar
May, D., Sanchez, S., Gilby, J. & Altpeter, F. Multi-allelic gene editing in an apomictic, tetraploid turf and forage grass (Paspalum notatum Flüggé) using CRISPR/Cas9. Front. Plant Sci. 14 (2023).
Ortiz, J. P. A. et al. A reference floral transcriptome of sexual and apomictic Paspalum notatum. BMC Genom. 18, 1–14 (2017).
Article Google Scholar
de Oliveira, F. A. et al. Coexpression and transcriptome analyses identify active apomixis-related genes in Paspalum notatum leaves. BMC Genom. 21(1), 1–15 (2020).
Article MathSciNet Google Scholar
Podio, M., Colono, C., Siena, L., Ortiz, J. P. A. & Pessino, S. C. A study of the heterochronic sense/antisense RNA representation in florets of sexual and apomictic Paspalum notatum. BMC Genom. 22, 1–19 (2021). (2021).
Article Google Scholar
Ortiz, J. P. A. et al. Small RNA-seq reveals novel regulatory components for apomixis in Paspalum notatum. BMC Genom. 20(1), 1–17 (2019).
Article CAS Google Scholar
Yan, Z. et al. High-quality chromosome-scale de novo assembly of the Paspalum notatum ‘Flugge’ genome. BMC Genom. 23(1), 293 (2022).
Article CAS Google Scholar
Pucker, B., Irisarri, I., de Vries, J. & Xu, B. Plant genome sequence assembly in the era of long reads: Progress, challenges and future directions. Quant. Plant. Biol. 3, e5 (2022).
Article PubMed PubMed Central Google Scholar
Sahu, S. K. & Liu, H. Long-read sequencing (method of the year 2022): the way forward for plant omics research. Mol. Plant 16(5), 791–793 (2023).
Article CAS PubMed Google Scholar
Warburton, P. E. & Sebra, R. P. Long-Read DNA Sequencing: Recent Advances and Remaining Challenges. Annu Rev Genomics Hum Genet. 24 (2023).
Siadjeu, C., Pucker, B., Viehöver, P., Albach, D. C. & Weisshaar, B. High contiguity de novo genome sequence assembly of trifoliate yam (Dioscorea dumetorum) using long read sequencing. Genes 11(3), 274 (2020).
Article CAS PubMed PubMed Central Google Scholar
Hunt, S. P. et al. A chromosome-scale assembly of the garden orach (Atriplex hortensis L.) genome using Oxford Nanopore sequencing. Front. Plant Sci. 11, 624 (2020).
Article PubMed PubMed Central Google Scholar
Carballo, J. et al. A high-quality genome of Eragrostis curvula grass provides insights into Poaceae evolution and supports new strategies to enhance forage quality. Sci. Rep. 9(1), 10250 (2019).
Article ADS CAS PubMed PubMed Central Google Scholar
Sun, G. et al. Genome of Paspalum vaginatum and the role of trehalose mediated autophagy in increasing maize biomass. Nat. Commun. 13(1), 7731 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Wu, D. et al. Genomic insights into the evolution of Echinochloa species as weed and orphan crop. Nat. Commun. 13(1), 689 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Quarin, C. L., Espinoza, F., Martinez, E. J., Pessino, S. C. & Bovo, O. A. A rise of ploidy level induces the expression of apomixis in Paspalum notatum. Sex. Plant Reprod. 13, 243–249 (2001).
Article Google Scholar
Mariac, C., Zekraoui, L. & Leblanc, O. High molecular weight DNA extraction from plant nuclei isolation. Protocols.io. https://doi.org/10.17504/protocols.io.83shyne (2019).
Azevedo, H., Lino-Neto, T. & Tavares, R. M. An improved method for high-quality RNA isolation from needles of adult maritime pine trees. Plant Mol. Biol. Rep. 21, 333–338 (2003).
Article CAS Google Scholar
Clarke, J. D. Cetyltrimethyl ammonium bromide (CTAB) DNA miniprep for plant DNA isolation. Cold Spring Harb. Protoc. 3, pdb–prot5177 (2009).
Google Scholar
De Coster, W. & Rademakers, R. NanoPack2: population-scale evaluation of long-read sequencing data. Bioinform. 39(5), btad311 (2023).
Article Google Scholar
Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinform. 30, 2114–20 (2014).
Article CAS Google Scholar
Marçais, G. & Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinform. 27(6), 764–770 (2011).
Article Google Scholar
Vurture, G. W. et al. GenomeScope: fast reference-free genome profiling from short reads. Bioinform. 33(14), 2202–2204 (2017).
Article CAS Google Scholar
Yan, Q. et al. The elephant grass (Cenchrus purpureus) genome provides insights into anthocyanidin accumulation and fast growth. Mol Ecol Resour. 21(2), 526–542 (2021).
Article CAS PubMed Google Scholar
Zhang, B. et al. A high-quality haplotype-resolved genome of common bermudagrass (Cynodon dactylon L.) provides insights into polyploid genome stability and prostrate growth. Front. Plant Sci. 13, 890980 (2022).
Article PubMed PubMed Central Google Scholar
Doležel, J., Greilhuber, J. & Suda, J. Estimation of nuclear DNA content in plants using flow cytometry. Nat. Protoc. 2(9), 2233–2244 (2007).
Article PubMed Google Scholar
Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. Nat Biotechnol. 37, 540–546 (2019).
Article CAS PubMed Google Scholar
Vaser, R., Sović, I., Nagarajan, N. & Šikić, M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 27(5), 737–746 (2017).
Article CAS PubMed PubMed Central Google Scholar
Alonge, M. et al. Automated assembly scaffolding using RagTag elevates a new tomato system for high-throughput genome editing. Genome Biol. 23(1), 1–19 (2022).
Article Google Scholar
Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinform. 26(5), 589–595 (2010).
Article Google Scholar
Walker, B. J. et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PloS One 9(11), e112963 (2014).
Article ADS PubMed PubMed Central Google Scholar
Kovaka, S. et al. Transcriptome assembly from long-read RNA-seq alignments with StringTie2. Genome Biol. 20, 1–13 (2019).
Article Google Scholar
Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat. Biotechnol. 29(7), 644–652 (2011).
Article CAS PubMed PubMed Central Google Scholar
Novák, P., Neumann, P. & Macas, J. Global analysis of repetitive DNA from unassembled sequence reads using RepeatExplorer2. Nat. Protoc. 15(11), 3745–3776 (2020).
Article PubMed Google Scholar
Neumann, P., Novák, P., Hoštáková, N. & Macas, J. Systematic survey of plant LTR-retrotransposons elucidates phylogenetic relationships of their polyprotein domains and provides a reference for element classification. Mob. DNA 10, 1–17 (2019).
Article PubMed PubMed Central Google Scholar
Flynn, J. M. et al. RepeatModeler2 for automated genomic discovery of transposable element families. Proc. Natl. Acad. Sci. USA 117(17), 9451–9457 (2020).
Article ADS CAS PubMed PubMed Central Google Scholar
Smith, A., Hubley, R., & Green, P. RepeatMasker Open-4.0. RepeatMasker Open-4.0 (2013).
Kapusta, A., Suh, A. & Feschotte, C. Dynamics of genome size evolution in birds and mammals. Proc. Natl. Acad. Sci. USA 114(8), E1460–E1469 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Melters, D. P. et al. Comparative analysis of tandem repeats from hundreds of species reveals unique insights into centromere evolution. Genome Biol. 14(1), 1–20 (2013).
Article Google Scholar
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990).
Article CAS PubMed Google Scholar
Salson, M. et al. An improved assembly of the pearl millet reference genome using Oxford Nanopore long reads and optical mapping. G3-Genes, Genom. Genet. 13(5), jkad051 (2023).
Article CAS Google Scholar
Lin, Y. et al. quarTeT: a telomere-to-telomere toolkit for gap-free genome assembly and centromeric repeat identification. Hortic. Res 10(8), uhad127 (2023).
Article PubMed PubMed Central Google Scholar
Holt, C. & Yandell, M. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinform. 12(1), 1–14 (2011).
Article Google Scholar
Slater, G. S. C. & Birney, E. Automated generation of heuristics for biological sequence comparison. BMC Bioinform. 6, 31, https://doi.org/10.1186/1471-2105-6-31 (2005).
Article CAS Google Scholar
Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinform. 28(23), 3150–3152 (2012).
Article CAS Google Scholar
Stanke, M. et al. AUGUSTUS: ab initio prediction of alternative transcripts. Nucleic Acids Res. 34(suppl_2), W435–W439 (2006).
Article CAS PubMed PubMed Central Google Scholar
Korf, I. Gene finding in novel genomes. BMC Bioinform. 5(1), 1–9 (2004).
Article Google Scholar
Pertea, G. & Pertea, M. GFF utilities: GffRead and GffCompare. F1000Research 9 (2020).
Jones, P. et al. InterProScan 5: genome-scale protein function classification. Bioinform. 30(9), 1236–1240 (2014).
Article CAS Google Scholar
Seemann T. barrnap 0.9: rapid ribosomal RNA prediction. v0.9 (2018).
Lowe, T. M. & Eddy, S. R. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25(5), 955–964 (1997).
Article CAS PubMed PubMed Central Google Scholar
Axtell, M. J. ShortStack: comprehensive annotation and quantification of small RNA genes. RNA 19(6), 740–751 (2013).
Article CAS PubMed PubMed Central Google Scholar
Fahlgren, N. & Carrington, J. C. miRNA target prediction in plants. Plant MicroRNAs: Methods and Protocols. Springer; New York, NY, USA. pp. 51–57 (2010).
NCBI Sequence Read Archive. https://identifiers.org/ncbi/insdc.sra:SRS19975480 (2024).
NCBI Sequence Read Archive. https://identifiers.org/ncbi/insdc.sra:SRS19975482 (2024).
NCBI Sequence Read Archive. https://identifiers.org/ncbi/insdc.sra:SRS19975483 (2024).
NCBI Sequence Read Archive. https://identifiers.org/ncbi/insdc.sra:SRS19975484 (2024).
NCBI GenBank. https://identifiers.org/ncbi/insdc.gca:GCA_036689595.1 (2024).
NCBI Sequence Read Archive. https://identifiers.org/ncbi/insdc.sra:SRS19975481 (2024).
NCBI Transcriptome Shotgun Assembly. https://identifiers.org/ncbi/insdc:GKQU00000000.1 (2024).
NCBI Sequence Read Archive. https://identifiers.org/ncbi/insdc.sra:SRR7347364 (2019).
NCBI Sequence Read Archive. https://identifiers.org/ncbi/insdc.sra:SRR7347365 (2019).
NCBI Sequence Read Archive. https://identifiers.org/ncbi/insdc.sra:SRR7347366 (2019).
NCBI Sequence Read Archive. https://identifiers.org/ncbi/insdc.sra:SRR7347367 (2019).
NCBI Sequence Read Archive. https://identifiers.org/ncbi/insdc.sra:SRR7347368 (2019).
NCBI Sequence Read Archive. https://identifiers.org/ncbi/insdc.sra:SRR7347369 (2019).
NCBI Transcriptome Shotgun Assembly. https://identifiers.org/ncbi/insdc:DAWXED000000000 (2024).
Astashyn, A. et al. Rapid and sensitive detection of genome contamination at scale with FCS-GX. Genome Biol. 25(1), 60 (2024).
Article CAS PubMed PubMed Central Google Scholar
Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 21, 1–27 (2020).
Article Google Scholar
Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinform. 31(19), 3210–3212 (2015).
Article Google Scholar

Download references

Acknowledgements

This research was funded by the European Union’s Horizon 2020 Research and Innovation Program under the Marie Skłodowska-Curie Grant Agreements 872417 (MAD) and 101007438 (POLYPLOID). J.P.A.O., L.A.S., M.P. and S.C.P. were supported by the National Agency for Scientific and Technological Promotion (ANPCyT), Argentina (PICT-2017-1956 and PICT 2019-03414, PICT 2019-02153); CONICET, Argentina Projects PUE 22920160100043CO and PIP 11220200101680CO, and National University of Rosario, Argentina Project 80020190300021UR. E.A. and F.P. were funded by the European Union-Next Generation EU through the Ministero dell’ Università e della Ricerca (MUR), Italy with the projects PRIN 2022 (2022Z4HLLJ) and PRIN 2022 PNRR (P2022KFJB5), respectively. This work was carried out using the facilities of the CCT-Rosario Computational Center, a member of the High-Performance Computing National System (SNCAD, MincyT-Argentina), and the ISO 9001 certified IRD i-Trop HPC (South Green Platform) at IRD Montpellier, France. L.A.S., M.P., S.C.P. and J.P.A.O. are research staff members of CONICET, Argentina, and J.M.V. was supported by a PhD grant from CONICET.

Author information

These authors contributed equally: Juan Manuel Vega, Maricel Podio, Julie Orjuela.

Authors and Affiliations

Laboratorio de Biología Molecular, Instituto de Investigaciones en Ciencias Agrarias de Rosario (IICAR) CONICET-UNR, Facultad de Ciencias Agrarias, Campo Experimental Villarino, Universidad Nacional de Rosario, Zavalla (S2125ZAA), Santa Fe, Argentina
Juan Manuel Vega, Maricel Podio, Lorena A. Siena, Silvina C. Pessino & Juan Pablo A. Ortiz
DIADE, Univ. Montpellier, CIRAD, IRD, Montpellier, France
Julie Orjuela, Marie Christine Combes, Cedric Mariac & Olivier Leblanc
Department of Agricultural, Food and Environmental Science, University of Perugia, 06121, Perugia, Italy
Emidio Albertini
Institute of Biosciences and Bioresources (IBBR), National Research Council (CNR), 06128, Perugia, Italy
Fulvio Pupilli

Authors

Juan Manuel Vega
View author publications
You can also search for this author in PubMed Google Scholar
Maricel Podio
View author publications
You can also search for this author in PubMed Google Scholar
Julie Orjuela
View author publications
You can also search for this author in PubMed Google Scholar
Lorena A. Siena
View author publications
You can also search for this author in PubMed Google Scholar
Silvina C. Pessino
View author publications
You can also search for this author in PubMed Google Scholar
Marie Christine Combes
View author publications
You can also search for this author in PubMed Google Scholar
Cedric Mariac
View author publications
You can also search for this author in PubMed Google Scholar
Emidio Albertini
View author publications
You can also search for this author in PubMed Google Scholar
Fulvio Pupilli
View author publications
You can also search for this author in PubMed Google Scholar
Juan Pablo A. Ortiz
View author publications
You can also search for this author in PubMed Google Scholar
Olivier Leblanc
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization, J.O., J.M.V., J.P.A.O., S.C.P. and O.L.; Methodology, C.M., M.C.C. and L.A.S.; Software, J.O., J.M.V., M.P. and O.L.; Formal Analysis, J.O., J.M.V. and M.P.; Investigation, C.M., M.C.C., J.O., J.M.V., M.P., S.C.P. and L.A.S.; Data Curation, J.O., M.P., J.P.A.O. and O.L.; Writing – Original Draft, J.P.A.O. and O.L.; Writing – Review & Editing, C.M., E.A., F.P., M.C.C., M.P., J.M.V., J.O. and S.P.; Visualization, J.M.V., M.P., J.P.A.O.; Supervision, J.P.A.O. and O.L.; Funding Acquisition, E.A., F.P., S.C.P., M.P., J.P.A.O. and O.L. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Juan Pablo A. Ortiz or Olivier Leblanc.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Table 1

Supplementary Table 2

Supplementary Table 3

Supplementary Table 4

Supplementary Figures 1_4

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Vega, J.M., Podio, M., Orjuela, J. et al. Chromosome-scale genome assembly and annotation of Paspalum notatum Flüggé var. saurae. Sci Data 11, 891 (2024). https://doi.org/10.1038/s41597-024-03731-0

Download citation

Received: 19 February 2024
Accepted: 02 August 2024
Published: 16 August 2024
DOI: https://doi.org/10.1038/s41597-024-03731-0
Springer Nature Limited

Associated content

Genomics data for plant ecology, conservation and agriculture

Collection 20 January 2023

Chromosome-scale genome assembly and annotation of Paspalum notatum Flüggé var. saurae

Abstract

Similar content being viewed by others

Background & Summary

Methods

Sample collection

DNA sequencing

Nanopore sequencing

Illumina sequencing

Assessing the heterozygosity level of the #R1 genome

cDNA sequencing

Genome survey and assembly

Flowers and leaves transcriptome assembly

Genome annotation

Repetitive sequences

Gene annotation

Identification of rRNA and tRNAs

Prediction of microRNA (miRNA) genes and targets

Data Records

Technical Validation

Assessing the quality of HMW genomic DNA for ONT sequencing

Assessment of genome and transcriptome assembly and annotation quality

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation