Background & Summary

Broomcorn millet (Panicum miliaceum L.), a member of the Paniceae tribe in the Gramineae family, exhibits remarkable adaptability to marginal regions due to its short growing season (60–90 days), low water requirements, high salt tolerance, and efficient nutrient resource utilization1,2. Being a C4 plant, broomcorn millet demonstrates enhanced carbon fixation and efficient utilization of water and nitrogen resources. Additionally, its grains are characterized by their gluten-free nature and exceptional nutritional value, containing higher protein content, mineral composition, and antioxidant levels compared to most other cereals3. Consequently, broomcorn millet has been extensively cultivated in semiarid regions across Asia, Europe, and other continents and is considered one of the oldest crops worldwide4. The cultivation of broomcorn millet holds promise for enhancing food security, diversifying agriculture, and promoting a healthier diet5. Broomcorn millet has an allotetraploid genome consisting of 36 (2n = 4x = 36) chromosomes6. Although four chromosome-level of broomcorn millet, Jinshu77, LM_v18, LM_v29, and Pm_03908, have been made available, there are still missing segments within the genome due to the presence of highly repetitive sequences clustered across the genome, particularly in the telomere and centromere regions. In recent years, T2T and gap-free genomes have been successfully obtained in various important crops, including rice10, barley11, and maize12.

In the present study, we assembled the first T2T gap-free genome of broomcorn millet (AJ8) (Fig. 1a), achieved through PacBio HiFi long reads, Nanopore technologies and Hi-C sequencing data. The resulting complete genome assembly has a final size of 834.7 Mb and is organized into 18 pseudo-chromosomes (Table 1; Fig. 1b). Gene annotation identified 52.0% repetitive sequences and 63,678 protein-coding genes (Fig. 1b). This complete reference genome provides a robust foundation for future studies on population and conservation genetics of broomcorn millet.

Fig. 1
figure 1

An overview of the AJ8 genome. (a) The photograph of the AJ8 plant. (b) Circos plot illustrating the genome of the AJ8 genome. The plot includes the following components, arranged from inside to outside: (I) Collinear regions within the AJ8 assembly; (II) GC content in non-overlapping 1 Mb windows; (III) Percentage of repeats in 1-Mb sliding windows; (IV) Gene density in 1-Mb sliding windows; (V) Length of pseudo-chromosome in megabases (Mb).

Table 1 Summary statistics of broomcorn millet genome assemblies.

Methods

Plant materials and growth conditions

The broomcorn millet landrace sequenced in this study was originally collected from Center for Agricultural Genetic Resources Research, Shanxi Agricultural University, Taiyuan, Shanxi Province (coordinates: E 112° 34′ 26.66″, N 37° 46′ 37.16″). The plants were planted under controlled conditions with a temperature of 25 °C, humidity of 60%, and a light intensity following a 14-hour day and 10-hour night cycle. Twenty seedlings with consistent growth at the fourth leaf stage were carefully chosen and sampled from various organs, including roots, stems, and leaves. A weight of 2 g was measured for each tissue organ, which was immediately placed in a freezing chamber with liquid nitrogen and subsequently stored at −80 °C.

Long insert libraries preparation and sequencing

Genomic DNA was extracted from leaf tissue using DNeasy Plant Maxi kit (Qiagen). The PacBio long insert libraries were prepared according to manufacturers’ instructions with an insert size of approximately 20 kb (Pacific Biosciences, USA). Subsequently, the libraries were subjected to sequencing using PacBio Sequel II platforms in circular consensus sequencing mode. The subreads were processed using SMRTLink (v11.1.0)13 with parameters “–minPasses 3 –minPredictedAccuracy 0.99 –minLength 500”, yielding approximately 77.0 Gb high-fidelity (HiFi) reads with a N50 size of about 18.0 kb (Table 2).

Table 2 Summary of sequencing data of AJ8 genome.

The ONT ultra-long insert libraries were generated using the Oxford Nanopore SQK-LSK109 kit, and then sequenced on a PromethION flow cell (Oxford Nanopore Technologies, Oxford, UK). A total of 165.6 Gb of ONT data with 196x coverage was generated, and the N50 value was 55,765 bp (Table 2). After error correction and length filtering of the data, 34.8 Gb ultra long ONT reads with the N50 value 92, 975 bp were obtained (Table 2).

Short insert libraries preparation and sequencing

For chromosomal conformational capture (Hi-C) sequencing, Hi-C libraries based on DpnII restriction enzymes were generated as previously described14, and sequenced on the MGISEQ-2000 platform. A total of 253.0 Gb of clean data were obtained from 255.1 Gb of sequencing data using software SOAPnuke (v2.0)15 with parameters “-n 0.01 -l 20 -q 0.1 -i -Q 2 -G 2 -M 2 -A 0.5” (Table 2).

RNA-seq libraries from leaf tissues were constructed using the NEBNext® Ultra™ RNA Library Prep Kit for Illumina® (NEB, Ipswich, MA, USA) following the manufacturer’s protocol. Then the RNA libraries were sequenced on a MGISEQ-2000 instrument and generated 150 bp paired-end reads. After quality control by fastp (0.19.5)16 with parameters of “–adapter_sequence AAGTCGGAGGCCAAGCGGTCTTAGGAAGACAA–adapter_sequence_r2 AAGTCGGATCGTAGCCATGTCGTTCTGTGAGCCAAGGAGTTG–average_qual 15 -l 150”, each library contained more than 7.8 Gb of clean data. More than 98.1% of the clean data had scores greater than Q20 in each library (Table 3).

Table 3 Summary of RNAseq sequencing data of AJ8 genome.

Genome assembly

Using HiFi reads, ultra-long ONT reads, and Hi-C clean data, the primary contigs were assembled by Hifiasm (v 0.19.5)17 with default parameters. To anchor contigs onto chromosomes, we used BWA (v 0.7.12)18 to align the Hi-C clean data to the assembled contigs. Low-quality reads were filtered out using the HiC-Pro pipeline19 with default parameters. The remaining valid reads were employed to anchor chromosomes with Juicer20 and 3d-dna pipeline21. Excitingly, our results showed that the hifiasm assembly consists of contiguous sequences covering the entire length of all chromosomes. This achievement can be attributed to the remarkable accuracy of HiFi data, the utilization of ultra-long ONT data, and the ongoing enhancements in assembly algorithms. Analogous to the T2T genome of rapeseed22, the hifiasm assembly comprises continuous sequences spanning the entirety of nine chromosomes. For further refinement of the genome, the T2T assembly was polished using a similar method described by Mc Cartney et al.23. In brief, the HiFi reads were aligned to the T2T assembly using Winnowmap2 (v 2.03)24. The resulting alignments were filtered to exclude secondary alignments and alignments with excessive clipping by using ‘falconc bam-filter-clipped’ tool. Finally, racon (v 1.5.0)25 was performed with the filtered alignments. The final assembled genome had a length of 834,678,208 bp and a contig N50 of 48.3 Mb (Table 1). The assembled sequences were successfully anchored to 18 pseudo-chromosomes (Table 1). The completeness of the assembly was assessed using Benchmarking Universal Single-Copy Orthologs (BUSCO) (v 5.5.0)26 with the embryophyta_odb10 (parameters: -m genome -l embryophyta_odb10). We appylied Merqury (v1.3)27 using PacBio HiFi long reads with a K-mer value of 17-bp to estimate the quality value.

Annotation of repetitive sequences

Tandem repeats and interspersed repeats were identified using the method described in Qu et al.28. In brief, RepeatModeler (v1.0.4)29 and LTR-FINDER (v1.0.7)30 were employed to construct of a de novo repeat sequence library. This compiled library was subsequently deployed for the detection of interspersed repeats and low-complexity sequences utilizing RepeatMasker (v4.0.7)31. DNA and protein transposable elements (TEs) were identified through RepeatMasker (v4.0.7) and RepeatProteinMasker (v4.0.7), respectively. Tandem repeats were discerned utilizing Tandem Repeat Finder (v4.10.0)32. A total of 433.8 Mb (~52.0%) of repetitive sequences were obtained. Among the interspersed repeats, three types of repetitive elements, namely class I (retrotransposons), class II (DNA transposons), and unclassified elements, accounted for 49.6% of the genome assembly (Table 4). The telomeric sequences and centromere region in the AJ8 genome assembly were identified using quartet (v1.1.36)33 with “-c plant”.

Table 4 Interspersed repeat contents in AJ8 genome assembly.

Protein-coding genes prediction and functional annotation

Gene prediction in this study was conducted using a combination of transcriptome-based prediction, homology-based prediction, and ab initio prediction methods. For transcriptome-based prediction, 194.2 Gb Illumina clean reads from 21 samples were assembled by Trinity (v 2.15.1)34 with parameters of ‘–max_memory 200 G–CPU 40–min_contig_length 200–genome_guided_bam merged_sorted.bam–full_cleanup–min_kmer_cov 4–min_glue 4–bfly_opts ‘-V 5–edge-thr = 0.1–stderr’–genome_guided_max_intron 10000’, which generated 227,501 transcripts with a N50 of 2,698. These assembled transcripts were aligned against the AJ8 T2T assembly using Program to Assemble Spliced Alignment (PASA) (v 2.4.1)35. Gene structures were generated from valid transcript alignments (PASA-set). RNA-seq clean reads were also mapped to the AJ8 T2T assembly using Hisat2 (v 2.0.1)36. Stringtie (v 1.2.2)37 and TransDecoder (v 5.7.1) (https://github.com/TransDecoder/TransDecoder) were employed to assemble the transcripts and identify candidate coding regions into gene models (Stringtie-set). Homologous genomes from seven plants, including rice (T2T-NIP)10, maize (T2T Mo17)12, A. thaliana (Col-PEK)38, sorghum (GCA_000003195.3)39, pearl millet (Tift23D2B1-P1-P5)40, foxtail millet (GCA_000263155.2)41, and two previous versions of broomcorn millet (Jinshu7: GCA_026771285.1; Longmi4: GCA_002895445.3)7,9 were downloaded and used as queries to search against the AJ8 T2T assembly using GeMoMa (v 1.9)42. These homology predictions were referred to as “Homology-set”. For ab initio prediction methods, AUGUSTUS (v 3.2.3)43 was used to predict coding regions in the repeat-masked genome. All gene models were combined using EvidenceModeler (v 2.1.0)44 with different weight parameters assigned to evidence from different sources (10 for PASA-set, 5 for Stringtie-set, 5 for Homology-set, and 1 for AUGUSTUS gene prediction). The resulting protein-coding genes that were only derived from ab initio prediction were filtered out. Finally, the produced gene models were further refined with PASA (v 2.4.1)37 to generate untranslated regions and alternative splicing variation information. The final comprehensive gene set comprised 63,678 genes.

The integrated gene set was translated into amino-acid sequences and annotated using the method described in Zhou et al.45. As a result, 98.8% of the predicted protein-coding genes were functionally annotated (Table 5). In addition, we utilized the Diamond (v 0.9.30)46 software (E value <= 1e-05) to perform homologous comparison analyses between AJ8 and Arabidopsis (TAIR10)47, as well as between AJ8 and rice (Osativa_v7_0)48. Among the AJ8 genes, 48,284 (75.83%) genes exhibited homology with Arabidopsis, while 56,777 (89.16%) genes showed homology with rice (Table S1).

Table 5 Number of functional annotations for predicted genes in AJ8 assembly.

Gene expression analysis

The gene expression analysis was used the same method as previously reported49. The expression heatmap was constructed using heatmap R package. The expression matrix of genes in different transcriptome samples was displayed in Table S2.

Synteny analysis

The identification of syntenic regions was based on conducting homology searches using MCScan (Python version)50, with a minimum requirement of 30 genes per block.

Subgenome phasing

By employing repetitive k-mers as “differential signatures” and utilizing the SubPhaser software, we successfully phased the subgenomes of AJ8. The results obtained from SubPhaser51 were found to be consistent with the Jinshu7 genome, as indicated in Table 6 and Fig. 4.

Table 6 Correspondence between chromosome identificationss of AJ8 and Jinshu7.

Data Records

The sequencing data and assembled genome sequence have been deposited in the Sequence Read Archive with accession numbers SRP48256652 under project number PRJNA1059665. The genome assembly has been deposited at GenBank under the WGS accession

GCA_038442765.153. Files of the gene structure annotation, repeat predictions and gene functional annotation were deposited at Figshare54.

Technical Validation

Genome assembly and gene prediction quality assessment

The accuracy and integrity of AJ8 T2T assembly were assessed through several analyses. Firstly, the Hi-C heatmap displayed consistent results across all chromosomes, indicating the correct ordering and orientation of contigs in the assembly (Fig. 2a). Secondly, the assembly successfully captured 18 centromeres and all 36 telomeres, providing strong evidence for their integrity (Fig. 2b; Tables 7, 8). Thirdly, the assembly showed high collinearity with Jinshu7 (GCA_026771285.1)7 and Panicum hallii (GCA_002211085.2)55 (Fig. 2c). Fourthly, alignment results from minimap2 (v 2.24-r1122)56 revealed that 100.0% of ONT reads and 99.98% of HiFi reads could be aligned to the AJ8 T2T assembly. Additionally, the average genome alignment rate of the transcriptome was 98.7% (Table 3). Lastly, AJ8 T2T demonstrated an LTR assembly index of 20.4, a quality value of 61.7, and a BUSCO score of 99.6%, indicating its high completeness (Table 1).

Fig. 2
figure 2

The high-quality of the AJ8 genome. (a) Heatmap displaying Hi-C interactions of AJ8 pseudomolecules. (b) Telomere detection map. Triangles and circles represent telomeres and centromere within the AJ8 assembled chromosomes. The orange color represents regions with high gene density, while the dark sky blue color represents regions with low gene density. (c) Synteny analysis of Panicum hallii, AJ8 and Jinshu7.

Table 7 The identified telomeres in AJ8 assembly.
Table 8 The distribution of centromeres in AJ8 assembly.

We compared the length distribution of genes among the AJ8, maize12, sorghum (GCF_000003195.3)39, LM_18, and Jinshu77 and found similar patterns (Fig. 3a). The BUSCO analysis showed that 99.5% (single-copy gene: 19.5%, duplicated gene: 80.0%) of 1,614 embryophyta single-copy orthologs were successfully identified as complete, while 0.2% were fragmented and 0.3% were missing in the assembly (Fig. 3b). 62,885 (98.8%) gene models were successfully annotated in diverse databases and 51,958 gene models (81.6%) exhibited detectable transcriptional activity (FPKM value ≥ 1) across 21 RNA-seq samples (Fig. 3c; Table 5; Fig. 3d). Taken together, these results provide strong evidence that a high-quality AJ8 genome has been obtained. The high-quality genome provides a solid foundation for uncovering the drought resistance and adaptive mechanisms of AJ8, and also serves as an important reference for the rapid breeding of AJ8 and other crops.

Fig. 3
figure 3

Gene prediction quality assessment. (a) The composition of gene length in the AJ8 genome compared to the genomes of other species. (b) BUSCO assessments of the AJ8 gene. (c) Venn diagram showing the number of genes with homology or functional classification by each method. (d) The expression heatmap illustrates the expression levels among 21 RNA-seq samples. The color bar in the lower right corner represents log2-transformed FPKM values. Blue and red boxes indicate genes with lower and higher expression levels, respectively.

Fig. 4
figure 4

Phased subgenomes of the AJ8 genome. (a) The histogram of differential k-mers among homoeologous chromosome sets. (b) Heatmap and clustering of differential k-mers. The x-axis, differential k-mers; y-axis, chromosomes. The vertical color bar, each chromosome is assigned to which subgenome; the horizontal color bar, each k-mer is specific to which subgenome (blank for non-specific kmers). (c) Principal component analysis of differential k-mers. Points indicate chromosomes.