The phased telomere-to-telomere reference genome of Musa acuminata, a main contributor to banana cultivars

Liu, Xin; Arshad, Rida; Wang, Xu; Li, Wei-Ming; Zhou, Yongfeng; Ge, Xue-Jun; Huang, Hui-Run

doi:10.1038/s41597-023-02546-9

The phased telomere-to-telomere reference genome of Musa acuminata, a main contributor to banana cultivars

Data Descriptor
Open access
Published: 16 September 2023

Volume 10, article number 631, (2023)
Cite this article

Download PDF

You have full access to this open access article

Scientific Data

The phased telomere-to-telomere reference genome of Musa acuminata, a main contributor to banana cultivars

Download PDF

Xin Liu ORCID: orcid.org/0000-0003-4175-9358^1,2,3^na1,
Rida Arshad⁴^na1,
Xu Wang⁴,
Wei-Ming Li⁵,
Yongfeng Zhou ORCID: orcid.org/0000-0003-0780-2973^4,6,
Xue-Jun Ge^1,2 &
…
Hui-Run Huang ORCID: orcid.org/0000-0002-4656-5627^1,2

2857 Accesses
5 Citations
3 Altmetric
Explore all metrics

Abstract

Musa acuminata is a main wild contributor to banana cultivars. Here, we reported a haplotype-resolved and telomere-to-telomere reference genome of M. acuminata by incorporating PacBio HiFi reads, Nanopore ultra-long reads, and Hi-C data. The genome size of the two haploid assemblies was estimated to be 469.83 Mb and 470.21 Mb, respectively. Multiple assessments confirmed the contiguity (contig N50: 16.53 Mb and 18.58 Mb; LAI: 20.18 and 19.48), completeness (BUSCOs: 98.57% and 98.57%), and correctness (QV: 45.97 and 46.12) of the genome. The repetitive sequences accounted for about half of the genome size. In total, 40,889 and 38,269 protein-coding genes were annotated in the two haploid assemblies, respectively, of which 9.56% and 3.37% were newly predicted. Genome comparison identified a large reciprocal translocation involving 3 Mb and 10 Mb from chromosomes 01 and 04 within M. acuminata. This reference genome of M. acuminata provides a valuable resource for further understanding of subgenome evolution of Musa species, and precise genetic improvement of banana.

A complete reference genome of broomcorn millet

Article Open access 21 June 2024

Near telomere-to-telomere genome of the model plant Physcomitrium patens

Article 26 January 2024

Haplotype-resolved chromosomal-level assembly of wasabi (Eutrema japonicum) genome

Article Open access 11 July 2023

Background & Summary

The wild relatives of domesticated crops, i.e. crop wild relatives (CWRs), generally possess genetic diversity helpful in developing more productive and resilient crop varieties, thereby providing a wide practical gene pool for genetic improvement of crops¹. In order to address the challenges and threats posed by emerging diseases and climate change, CWRs appear to be a source for solutions to manage both biotic and abiotic stresses^2,3. At present, combining huge sequence information and precise gene-editing tools provides a route to transform CWRs into ideal crops². Therefore, a high-quality reference genome of CWR germplasm is an important prerequisite for efficiently introducing potential useful genes into breeding programmes. Thanks to the advances in sequencing technologies and analytical tools, many high-quality reference genomes for crops as well as their important wild relatives have been generated. These genetic resources will thus facilitate the identification of structural variants and incorporation of the variants from CWRs into crop gene pools.

Banana domestication started at least 7000 years ago in Southeast Asia⁴. Hybridization between various species and subspecies of the Musa genus led to the development of modern bananas with high production⁵. To date, most banana cultivars were derived from Musa acuminata (A genome), a complex of subspecies geographically segregated in distinct Southeast Asian continental regions and islands⁶. Four particular M. acuminata subspecies have been raised as the main contributors of edible banana cultivars, which are banksii, burmannica, malaccensis, and zebrina⁴. Several large structural variants in these subspecies were identified and suggested to be associated with the domestication of banana^7,8,9,10,11. Genome research first started in the subspecies malaccensis. The first draft genome of M. acuminata ssp. malaccensis was assembled by incoporating Sanger and Roche/454 reads, with sequence errors corrected by Illumina data¹². This assembly was anchored along the Musa linkage groups of the genetic map built with SSR and DArT markers. The double-haploid genotype (DH-Pahang) was used in this study for reducing genome complexity and facilitating assembly process. Recently the telomere-to-telomere (T2T) reference genome of DH-Pahang has been constructed using Nanopore data and polished with Nanopore and Illumina reads, with continuity improved significantly¹³. Although DH genotype could miss some important genetic information, these genome resources have significantly facilitated the studies of banana domestication and genome evolution. With advances in the sequencing technologies and biosoftwares, heterozygosity would not be the consistent hurdle. Currently, more and more haplotype-resolved and T2T genomes have been published, such as lychee¹⁴ and apple¹⁵, providing unprecedented insights into subgenome evolution and domesticated history.

In this study, we assembled a haplotype-resolved and telomere-to-telomere reference genome of M. acuminata ssp. malaccensis by incorporating PacBio HiFi reads, Nanopore ultra-long reads, and high throughput chromatin conformation capture (Hi-C) paired reads. An unphased reference genome was first constructed and used for guiding haplotype-resolved scaffolding (Fig. 1). Multiple assessment methods were applied to evaluate the quality of the haplotype-resolved assembly. A comprehensive genome comparison between this assembly and the previous reference of the DH genotype identified a large reciprocal translocation involving 3 Mb and 10 Mb from chromosomes 01 and 04. Furthermore, the 3-Mb segment (34,734,628 to 37,810,715 bp in chromosome 04) was suggested to be associated with flower development pathway, such as anther/stamen development. The haplotype-resolved genome of M. acuminata will help to obtain a better understanding of potential structural variants, allele specific expression and subgenome evolution of Musa species, and serve as reliable reference for banana breeding programmes.

Methods

Sample collection and sequencing

The M. acuminata sample used for DNA and RNA extraction was obtained from South China Botanical Garden, Chinese Academy of Sciences, Guangzhou, China. Tissues were immediately frozen in liquid nitrogen and preserved at −80 °C for DNA/RNA extraction. The CTAB method was used to extract high quality genomic DNA from leaf tissue samples.

A standard SMRTbell library was constructed using SMRTbell Express Template Prep Kit 2.0 according to the manufacturer’s recommendations (Pacific Biosciences, CA, USA) and sequenced on a PacBio Sequel II platform. This yielded 32.39 Gb HiFi data, covering ~65 × coverage of the haploid genome size. The N50 length of the HiFi reads was 17.32 kb. A nanopore library was constructed with the Oxford Nanopore SQK-LSK109 kit following the manufacturers’ instructions and sequenced on a PromethION platform. Totally 20.80 Gb ONT data were obtained, covering ~42 × coverage of the haploid genome size. The N50 length was 86.86 kb. A Hi-C library was constructed based on cross-linked genomic DNA and sequenced on an Illumina NovaSeq platform (Illumina, San Diego, CA, USA). In total, 134 Gb Hi-C data were obtained, covering ~268 × coverage of the haploid genome size. The 15.58 Gb NGS data were obtained using the Illumina NovaSeq platform, covering ~31 × coverage of the haploid genome size (Table 1).

Table 1 Summary of sequencing data of Musa acuminata ssp. malaccensis for haplotype-resolved and telomere-to-telomere assembly and genome annotation.

Full size table

Additionally, total RNA was extracted from four tissues, including root, leaf, flower, and fruit, using the NEBNext^® Ultra^™ II Directional RNA Library Prep Kit for Illumina^® (New England Biolabs, MA, USA). Paired-end 150-bp reads were also generated by the Illumina NovaSeq platform. These yielded a total of 26.90 Gb raw RNAseq data (Table 1). All sequencing were carried out at Anhui Double Helix Gene Technology Co., Ltd. (Anhui, China).

Genome size and heterozygosity estimation

CCS software (https://github.com/PacificBiosciences/ccs) with default parameters was used to generate the consensus reads (HiFi reads). Based on the obtained high-accurate HiFi reads, the K-mer distribution was analysed with jellyfish¹⁶ with jellyfish count -C -m 21 -s 100000000 and jellyfish histo -h 1000000. The results were subsequently imported to GenomeScope v2.0¹⁷ with K-mer length = 21 and Ploidy = 2. The genome size of M. acuminata was estimated to be 450.43 Mb with the 21 K-mer, about 14% shorter than DH-Pahang genome size (523.00 Mb) estimated by flow cytometry¹². The heterozygosity rate was estimated to be 0.59% (Fig. 2).

De novo haplotype-resolved genome assembly

Fastp v0.23.2¹⁸ was performed to filter Hi-C reads with default parameters. Subsequently, hifiasm v0.16.1-r375¹⁹ was carried out to generate the primary unphased draft genome based on HiFi and Hi-C reads. This generated a 491.54 Mb draft genome with an N50 of 26.62 Mb, and only 20 contigs consisted of 90% length of the genome (Table 2). Then, ragtag v2.1.0²⁰ with default parameters was first used to sort, orientate, and cluster the primary contigs guided by the T2T version of M. acuminata ssp. malaccensis DH-Pahang genome¹³ (Hereafter MAv4). Meanwhile, the primary contigs were anchored into 11 pseudo-chromosomes using Juicer v1.6²¹ and 3D-DNA v180922²² in turn. Then, based on the assembly file obtained from ragtag and the hic file from Juicer and 3D-DNA, Juicebox v2.20.00²³ was introduced for visualizing Hi-C data and manual correction in order to obtain a high-quality reference genome. Finally, there were only 17 gaps in the high-quality reference genome. For gap filling, ONT assembly was constructed by NextDenovo (https://github.com/Nextomics/NextDenovo) with read-cutoff = 1k and genome_size = 500 M. Then this draft ONT assembly was polished by Nextpolish²⁴ based on the HiFi reads and the Illumina reads with default parameters. Subsequently minimap2 v2.24-r1122²⁵ with default parameters was used to map the polished ONT assembly to the primary reference genome. We examined the breakpoint with the Integrative Genomics Viewer (IGV) tool²⁶ and manually filled the gaps based on the alignment results. After using ONT assembly to fill all remaining gaps, a high-quality reference genome named MA was generated. The genome size of this unphased assembly is 471.04 Mb with an anchored rate of 95.83%. The Hi-C heatmap confirmed the contiguity of the assembly (Supplementary Figure S1).

Table 2 Summary of genome assembly of Musa acuminata ssp. malaccensis genome.

Full size table

To obtain a haplotype-resolved genome, a similar pipeline was applied (Fig. 1). Two primary haploid assemblies were first generated by hifiasm. Further genome assembly statistics were performed with QUAST²⁷ with default parameters. Accumulative lengths of the two haploid assemblies were 500.78 Mb and 484.36 Mb with N50 of 16.53 Mb and 18.58 Mb, respectively (Table 2). After Hi-C scaffolding processes, 469.83 Mb and 470.21 Mb were anchored to 11 chromosomes respectively, with an anchored rate of 93.82% and 97.08% (Table 3). The genome sizes of the two haploid assemblies were slightly longer than that of MAv4 (468.82 Mb)¹³, and represented approximately 90% of DH-Pahang genome size (523.00 Mb) estimated by flow cytometry¹². All 66 gaps in the two haploid assemblies were filled. Finally, the haplotype-resolved and telomere-to-telomere reference genome for M. acuminata was obtained; and the two haploid assemblies were named MAH1 and MAH2. The circos²⁸ software was introduced to draw the genome features shown in Fig. 3. The Hi-C heatmap confirmed this assembly as a complete and reliable haplotype-resolved reference genome (Fig. 4).

Table 3 The lengths of the pseudo-chromosomes of Musa acuminata ssp. malaccensis genomes.

Full size table

Genome quality assessment

Multiple methods were combined to evaluate the quality of genome assembly. First, the HiFi, Illumina, and RNAseq reads were aligned to the phased genome using minimap2 v2.24-r1122, BWA v0.7.17-r1188²⁹, and HiSAT2 v2.2.1³⁰ with default parameters, respectively. BamTools v2.5.1³¹ was used to calculate the read mapping rates. The results showed a HiFi coverage rate of 99.86% and 99.87% on MAH1 and MAH2 assemblies, respectively. The mapping rate of Illumina reads reached up to 99.98% in both haploid assemblies. The mapping rate of RNAseq reads ranged from 92.44% to 97.34% (Table 4). Second, the LTR Assembly Index (LAI) calculated from LTR_retriever v2.9.0³² was used to assess the genome assembly quality. The LAI of MAH1 and MAH2 reached up to 20.18 and 19.48, respectively, indicating that our phased assembly reached the standard of a golden reference. Third, the completeness of the haplotype-resolved genome was evaluated by BUSCO v5.4.3³³ against the ‘embryophyta_odb 10’ database. In total, 98.57% (1,591 of 1,614) of the complete BUSCO genes were identified (Table 5). Finally, the consensus quality value (QV) of the genome was assessed by Merqury v1.3³⁴ with meryl k = 19 count, showing 45.97 and 46.12 of QV (Genome accuracy >99.99%) for MAH1 and MAH2, respectively (Table 6, Supplementary Figure S2).

Table 4 Assessment of genome quality based on mapping with RNAseq reads.

Full size table

Table 5 BUSCO results of MAH1 (C: 98.57%) and MAH2 (C: 98.57%).

Full size table

Table 6 The consensus quality values of MAH1 and MAH2.

Full size table

Repeat and gene annotation

The extensive de novo TE annotator (EDTA)³⁵ was used to fully screen and group repeat elements. Briefly, a de novo repeat library constructed by RepeatModeler v2.0.1³⁶ was imported to RepeatMasker v4.1.1 (http://repeatmasker.org/) to predict repeats. Then, Repbase³⁷ was introduced to predict homology repeats in RepeatMasker. In total we identified 235.46 Mb (50.11%) and 234.61 Mb (49.90%) repetitive sequences in MAH1 and MAH2, respectively. Among these, long terminal repeats (LTR) that accounted for 36.61% in MAH1 and 34.19% in MAH2 were the most abundant repeat elements (Supplementary Table S1). These results were comparable with the findings in the previous T2T DH genome version (Repeat elements: 52.62%; LTR: 34.85%)¹³.

Standard MAKER3 v3.01.03³⁸ pipeline was used to annotate genes. All high-confidence protein sequences in swiss-prot³⁹ database were imported for homology prediction. Transcripts from the 4 tissues, including root, leaf, flower and fruit, were used for gene prediction. Then AUGUSTUS v3.3.2 and SNAP v20131129 were used to train the ab-initio gene models. Finally, the MAKER3 pipeline was run again to obtain high-quality gene annotations. Functional characterization of the predicted coding genes was performed using eggNOG-mapper v2⁴⁰ based on the eggNOG v5.0 database⁴¹. A total of 40,889 and 38,269 protein-coding genes were annotated in MAH1 and MAH2, respectively. The total lengths of protein-coding genes were 148.54 Mb and 144.95 Mb, respectively. Average lengths of genes were 3.63 kb and 3.79 kb. Based on the eggNOG-Mapper results, 59,143 (74.72%) genes were functionally annotated (Table 7). Besides, BUSCO scores of protein-coding genes in MAH1 and MAH2 were up to 89.41% and 90.27% (Table 8).

Table 7 Statistics of protein-coding genes in MAH1 and MAH2.

Full size table

Table 8 Summary of BUSCO analysis of protein-coding genes in MAH1 (C: 89.41%) and MAH2 (C: 90.27%).

Full size table

Identification of telomeres and centromeres

TIDK v0.2.1 (https://github.com/tolkit/telomeric-identifier) was used to find telomeres. In total 36 telomeres were found (Table 9). Plant centromeric regions are generally characterized by the presence of short tandem repeats that are highly enriched in these regions⁴², accompanied by a collapse in the density of LTR elements such as Copia. By identifying these distinctive features, centromeric regions can be located. We predicted centromeric regions according to the workflow in Shi et al.⁴³, which employed the above approach. Using Tandem Repeats Finder v4.09⁴⁴ with the parameters: trf genomes.fa 2 7 7 80 10 50 500 -f -d -m, we screened 183 bp, 148 bp, 124 bp, 125 bp, and 191 bp tandem repeat units as candidates based on sorted results and IGV results (Supplementary Table S2, Supplementary Figure S3). The centromeric regions were defined according to the density of 183 bp tandem repeat unit, which was the highest enriched centromeric repeat unit. Finally, all centromeric regions have been captured successfully (Table 10, Supplementary Figure S3).

Table 9 Summary of telomere information of Musa acuminata ssp. malaccensis genome.

Full size table

Table 10 Summary of centromere information of Musa acuminata ssp. malaccensis genome.

Full size table

Characterization of a reciprocal translocation in Musa acuminata

Nucmer v4.0.0rc1⁴⁵ was used to obtain the syntenic relationship between MAH1 and MAH2 with default parameters. Then the delta-filter was launched with parameters ‘-i 90 -l 15000’. In the same way, our haplotype-resolved assembly was aligned against MAv4 using nucmer. Mummerplot command was used to generate the dot plots (Supplementary Figure S4). Syri v1.6.3⁴⁶ with default parameters was used for identifying structural variants between MAH1 and MAH2 (Fig. 5). Overall, 47 translocations with a cumulative size of 2.70 Mb (~0.57%), 23 inversions with a cumulative size of 11.30 Mb (~2.40%), and 53 duplications with a cumulative size of 1.33 Mb (~0.28%) were defined. These structural variants were generally heterozygous, representing more complete genetic information compared with the double-haploid MAv4 genome.

MCscan tools⁴⁷ were used to search for the syntenic relationships between the two haploid assemblies and MAv4 at the gene level. Briefly, ‘jcvi.compara.catalog’ module with ‘--cscore = 0.99’ and ‘jcvi.compara.synteny’ module with ‘--minspan = 30’ were used to build the syntenic regions; then, syntenic relationships were visualized by ‘jcvi.graphics.karyotype’ module. Besides, potential structural variants and heterozygous regions were shown in Supplementary Figure S5. A reciprocal translocation involving 3 Mb and 10 Mb from chromosome 01 and 04 was identified (Fig. 6a). These reciprocal translocation gene blocks were located in the translocated regions identified in whole genome alignment results (Supplementary Figure S4C,D). The 10-Mb segment from 261,650 to 10,745,936 bp in chromosome 01 of MAH1 was linked to 44,882,868 to 34,419,170 bp in chromosome 04 of MAv4 (Supplementary Figure S5). The 3-Mb segment from 34,734,628 to 37,810,715 bp in chromosome 04 of MAH1 was linked to 122,362 to 3,101,126 bp in chromosome 01 of MAv4. The reciprocal translocation between MAH2 and MAv4 was located in the similar genomic regions. The huge difference in chromosome length in chromosome 01 and chromosome 04 between MAH1/2 and MAv4 was also derived from this reciprocal translocation, while other chromosome lengths and genomic total lengths were comparable (Table 3).

We further performed GO enrichments based on the extracted genes located in the translocated regions using TBtools v1.108⁴⁸. The genes in the 10-Mb segment of MAH1 were not significantly enriched in any biological process, while those on the 3-Mb segment were enriched in several pathways associated with flower development (Supplementary Table S3), including anther development (GO:0048653), stamen development (GO:0048443), regulation of flower development (GO:0009909), and floral whorl development (GO:0048438). For further validation, we used nucleotide BLAST tools in National Center for Biotechnology Information (NCBI) with default parameters to align identified genes to non-redundant database, and checked gene functions manually.

Data Records

All raw sequencing reads have been deposited in the National Center for Biotechnology Information (NCBI) under BioProject ID PRJNA962682, and the National Genomics Data Center (NGDC) under BioProject ID PRJCA018611. The PacBio HiFi, Nanopore, Hi-C, Illumina sequencing reads have been deposited in the NCBI Sequence Read Archive database with accession group numbers SRP435127⁴⁹. Genome assembly is available from GenBank with accession number GCA_030219345.1⁵⁰. The genome annotation files have been submitted to the online open access repository Figshare database⁵¹, including a high-quality reference genome that we constructed for guidance.

Technical Validation

Manual correction for chromosome scaffolding

For constructing a high-quality reference genome, we used Juicebox to manually correct the reference genome based on Hi-C alignments (Supplementary Figure S1). Finally, 471.04 Mb (95.83%) contigs were anchored to 11 pseudo-chromosomes. Then, we started to orient, sort and group our haplotype-resolved genome based on this high-quality reference genome. We also used Juicebox to manually correct the haplotype-resolved genome based on Hi-C alignments. In total, 469.83 Mb (93.82%) and 470.21 Mb (97.08%) contigs were anchored to 11 chromosome pairs, respectively. We further examined the Hi-C alignments in chromosome 01 and chromosome 04 in Juicebox (Fig. 6b), and confirmed the accurate assemblies of chromosome 01 and chromosome 04. Besides, chromosome 01 consists of only one contig (Table 3), further confirming its high continuity.

Targeted PCR confirmed the reciprocal translocation between Chr01 and Chr04

Based on the genomic syntenic analysis between our assembly and MAv4, we identified a large reciprocal translocation from chromosomes 01 and 04, corresponding to the translocation found in a previous study⁹. In that study, three pairs of primers were designed to amplify the breakpoints located along the reference and hypothesized chromosome structures, thereby showing the presence of chromosomes 01, 04, and 1T4 resulting from the translocation. Here we used the same primer pairs to perform targeted PCR to validate the chromosome structures found in our sample (Fig. 6c). DNA was extracted from leaf tissue of M. acuminata ssp. malaccensis. PCR was performed in 50-μL volumes containing 2.5 ng of gDNA, 1 μL of specific primers, 32 μL of distilled, deionized water, and 0.5 μL of TaKaRa LA Taq^® (Vazyme) using an Eastwin Life Science EDC810 PCR amplification system. The reaction conditions for thermal cycling were 94 °C for 5 min, followed by 35 cycles of 94 °C for 45 s, 56 °C for 45 s, and 72 °C for 60 s. Thereafter, PCR products were visualized by 2% agarose gel-electrophoresis with a 100 bp DNA ladder. Only the breakpoint of chromosome 1T4 was amplified in our studied sample, suggesting that the reciprocal translocation involving 3 and 10 Mb segments from chromosomes 01 and 04 existed in both haploid genomes of the M. acuminata sample (Fig. 6c). This finding was consistent with our whole genome alignment results in sequence and gene levels.

Code availability

No special code was used for this study. All software mentioned in methods could be found in the community. If no detail parameters were mentioned for the software, default parameters were used as suggested by the developer.

References

Brozynska, M., Furtado, A. & Henry, R. J. Genomics of crop wild relatives: expanding the gene pool for crop improvement. Plant Biotechnol. J. 14, 1070–1085 (2016).
PubMed CAS Google Scholar
Bohra, A. et al. Reap the crop wild relatives for breeding future crops. Trends Biotechnol. 40, 412–431 (2022).
PubMed CAS Google Scholar
Castaneda-Alvarez, N. P. et al. Global conservation priorities for crop wild relatives. Nat. Plants 2, 16022 (2016).
PubMed Google Scholar
Perrier, X. et al. Multidisciplinary perspectives on banana (Musa spp.) domestication. Proc. Natl. Acad. Sci. USA 108, 11311–11318 (2011).
PubMed PubMed Central ADS CAS Google Scholar
Davey, M. W. et al. A draft Musa balbisiana genome sequence for molecular genetics in polyploid, inter- and intra-specific Musa hybrids. BMC Genom. 14, 683 (2013).
CAS Google Scholar
Perrier, X. et al. Combining biological approaches to shed light on the evolution of edible bananas. Ethnobot. Res. App. 7, 199–216 (2009).
Google Scholar
Shepherd K. Cytogenetics Of The Genus Musa (International Network for the Improvement of Banana and Plantain, 1999).
Hippolyte, I. et al. A saturated SSR/DarT linkage map of Musa acuminata addressing genome rearrangements among bananas. BMC Plant Biol. 10, 65 (2010).
PubMed PubMed Central Google Scholar
Martin, G. et al. Evolution of the banana genome (Musa acuminata) is impacted by large chromosomal translocations. Mol. Biol. Evol. 34, 2140–2152 (2017).
PubMed PubMed Central CAS Google Scholar
Dupouy, M. et al. Two large reciprocal translocations characterized in the disease resistance-rich burmannica genetic group of Musa acuminata. Ann. Bot. 124, 319–329 (2019).
PubMed PubMed Central CAS Google Scholar
Martin, G. et al. Chromosome reciprocal translocations have accompanied subspecies evolution in bananas. Plant J. 104, 1698–1711 (2020).
PubMed PubMed Central CAS Google Scholar
D’Hont, A. et al. The banana (Musa acuminata) genome and the evolution of monocotyledonous plants. Nature 488, 213–217 (2012).
PubMed ADS Google Scholar
Belser, C. et al. Telomere-to-telomere gapless chromosomes of banana using nanopore sequencing. Commun. Biol. 4, 1047 (2021).
PubMed PubMed Central CAS Google Scholar
Hu, G. et al. Two divergent haplotypes from a highly heterozygous lychee genome suggest independent domestication events for early and late-maturing cultivars. Nat. Genet. 54, 73–83 (2022).
PubMed PubMed Central CAS Google Scholar
Sun, X. et al. Phased diploid genome assemblies and pan-genomes provide insights into the genetic history of apple domestication. Nat. Genet. 52, 1423–1432 (2020).
MathSciNet PubMed PubMed Central CAS Google Scholar
Marcais, G. & Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 764–770 (2011).
PubMed PubMed Central CAS Google Scholar
Ranallo-Benavidez, T. R., Jaron, K. S. & Schatz, M. C. GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nat. Commun. 11, 1432 (2020).
PubMed PubMed Central ADS CAS Google Scholar
Chen, S., Zhou, Y., Chen, Y. & Gu, J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890 (2018).
Article PubMed PubMed Central Google Scholar
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).
PubMed PubMed Central CAS Google Scholar
Alonge, M. et al. RaGOO: fast and accurate reference-guided scaffolding of draft genomes. Genome Biol. 20, 224 (2019).
PubMed PubMed Central Google Scholar
Durand, N. C. et al. Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments. Cell Syst. 3, 95–98 (2016).
PubMed PubMed Central CAS Google Scholar
Dudchenko, O. et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science 356, 92–95 (2017).
PubMed PubMed Central ADS CAS Google Scholar
Durand, N. C. et al. Juicebox provides a visualization system for Hi-C contact maps with unlimited zoom. Cell Syst. 3, 99–101 (2016).
PubMed PubMed Central CAS Google Scholar
Hu, J., Fan, J., Sun, Z. & Liu, S. NextPolish: a fast and efficient genome polishing tool for long-read assembly. Bioinformatics 36, 2253–2255 (2020).
PubMed CAS Google Scholar
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
PubMed PubMed Central CAS Google Scholar
Thorvaldsdottir, H., Robinson, J. T. & Mesirov, J. P. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief. Bioinform. 14, 178–192 (2013).
PubMed CAS Google Scholar
Gurevich, A., Saveliev, V., Vyahhi, N. & Tesler, G. QUAST: quality assessment tool for genome assemblies. Bioinformatics 29, 1072–1075 (2013).
PubMed PubMed Central CAS Google Scholar
Krzywinski, M. et al. Circos: an information aesthetic for comparative genomics. Genome Res. 19, 1639–1645 (2009).
PubMed PubMed Central CAS Google Scholar
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
PubMed PubMed Central CAS Google Scholar
Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat. Biotechnol. 37, 907–915 (2019).
PubMed PubMed Central CAS Google Scholar
Barnett, D. W., Garrison, E. K., Quinlan, A. R., Stromberg, M. P. & Marth, G. T. BamTools: a C++ API and toolkit for analyzing and managing BAM files. Bioinformatics 27, 1691–1692 (2011).
PubMed PubMed Central CAS Google Scholar
Ou, S. & Jiang, N. LTR_retriever: a highly accurate and sensitive program for identification of long terminal repeat retrotransposons. Plant Physiol. 176, 1410–1422 (2018).
PubMed CAS Google Scholar
Manni, M., Berkeley, M. R., Seppey, M., Simao, F. A. & Zdobnov, E. M. BUSCO update: novel and streamlined workflows along with broader and deeper phylogenetic coverage for scoring of eukaryotic, prokaryotic, and viral genomes. Mol. Biol. Evol. 38, 4647–4654 (2021).
PubMed PubMed Central CAS Google Scholar
Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome biol. 21, 245 (2020).
PubMed PubMed Central CAS Google Scholar
Ou, S. et al. Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipeline. Genome Biol. 20, 275 (2019).
PubMed PubMed Central CAS Google Scholar
Flynn, J. M. et al. RepeatModeler2 for automated genomic discovery of transposable element families. Proc. Natl. Acad. Sci. USA 117, 9451–9457 (2020).
PubMed PubMed Central ADS CAS Google Scholar
Bao, W., Kojima, K. K. & Kohany, O. Repbase Update, a database of repetitive elements in eukaryotic genomes. Mob. DNA 6, 11 (2015).
PubMed PubMed Central Google Scholar
Campbell, M. S., Holt, C., Moore, B. & Yandell, M. Genome annotation and curation using MAKER and MAKER-P. Curr Protoc Bioinformatics 48, 4.11.11–14.11.39 (2014).
Google Scholar
Bairoch, A. & Apweiler, R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 28, 45–48 (2000).
PubMed PubMed Central CAS Google Scholar
Cantalapiedra, C. P., Hernandez-Plaza, A., Letunic, I., Bork, P. & Huerta-Cepas, J. eggNOG-mapper v2: functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Mol. Biol. Evol. 38, 5825–5829 (2021).
PubMed PubMed Central CAS Google Scholar
Huerta-Cepas, J. et al. eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res. 47, D309–D314 (2019).
PubMed CAS Google Scholar
Melters, D. P. et al. Comparative analysis of tandem repeats from hundreds of species reveals unique insights into centromere evolution. Genome Biol. 14, R10 (2013).
PubMed PubMed Central Google Scholar
Shi, X. et al. The complete reference genome for grapevine (Vitis vinifera L.) genetics and breeding. Hortic. Res. 10, uhad061 (2023).
PubMed PubMed Central Google Scholar
Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573–580 (1999).
PubMed PubMed Central CAS Google Scholar
Marcais, G. et al. MUMmer4: A fast and versatile genome alignment system. PLoS Comput. Biol. 14, e1005944 (2018).
PubMed PubMed Central Google Scholar
Goel, M., Sun, H., Jiao, W. B. & Schneeberger, K. SyRI: finding genomic rearrangements and local sequence differences from whole-genome assemblies. Genome Biol. 20, 277 (2019).
PubMed PubMed Central Google Scholar
Tang, H. et al. Synteny and collinearity in plant genomes. Science 320, 486–488 (2008).
PubMed ADS CAS Google Scholar
Chen, C. et al. TBtools: an integrative toolkit developed for interactive analyses of big biological data. Mol. Plant 13, 1194–1202 (2020).
PubMed CAS Google Scholar
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRP435127 (2023).
Liu, X. et al. Musa acuminata subsp. malaccensis genome assembly. GenBank https://identifiers.org/ncbi/insdc.gca:GCA_030219345.1 (2023).
Liu, X. et al. The phased telomere-to-telomere reference genome of Musa acuminata, a main contributor to banana cultivars. Figshare https://doi.org/10.6084/m9.figshare.22716271.v9 (2023).

Download references

Acknowledgements

This work was financially supported by the National Natural Science Foundation of China (No. 32070237, 31261140366), and the Strategic Priority Research Program of Chinese Academy of Sciences (Grant No. XDB31000000).

Author information

These authors contributed equally: Xin Liu, Rida Arshad.

Authors and Affiliations

Key Laboratory of Plant Resources Conservation and Sustainable Utilization, South China Botanical Garden, Chinese Academy of Sciences, Guangzhou, 510650, China
Xin Liu, Xue-Jun Ge & Hui-Run Huang
South China National Botanical Garden, Guangzhou, 510650, China
Xin Liu, Xue-Jun Ge & Hui-Run Huang
University of Chinese Academy of Sciences, Beijing, 100049, China
Xin Liu
National Key Laboratory of Tropical Crop Breeding, Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Synthetic Biology, Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, 518120, China
Rida Arshad, Xu Wang & Yongfeng Zhou
School of Marine Sciences and Biotechnology, Guangxi University for Nationalities, Nanning, 530008, China
Wei-Ming Li
National Key Laboratory of Tropical Crop Breeding, Tropical Crops Genetic Resources Institute, Chinese Academy of Tropical Agricultural Sciences, Haikou, 571101, China
Yongfeng Zhou

Authors

Xin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Rida Arshad
View author publications
You can also search for this author in PubMed Google Scholar
Xu Wang
View author publications
You can also search for this author in PubMed Google Scholar
Wei-Ming Li
View author publications
You can also search for this author in PubMed Google Scholar
Yongfeng Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Xue-Jun Ge
View author publications
You can also search for this author in PubMed Google Scholar
Hui-Run Huang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

H.R.H., X.J.G. and Y.Z. designed and supervised the research; X.L., R.A. and H.R.H. wrote the manuscript; X.L., R.A. and X.W. analysed the data; X.L. and W.M.L. collected the experimental materials. All authors contributed to manuscript revision, read and approved the submitted version.

Corresponding author

Correspondence to Hui-Run Huang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Liu, X., Arshad, R., Wang, X. et al. The phased telomere-to-telomere reference genome of Musa acuminata, a main contributor to banana cultivars. Sci Data 10, 631 (2023). https://doi.org/10.1038/s41597-023-02546-9

Download citation

Received: 06 June 2023
Accepted: 05 September 2023
Published: 16 September 2023
DOI: https://doi.org/10.1038/s41597-023-02546-9
Springer Nature Limited

This article is cited by

Technology-enabled great leap in deciphering plant genomes
- Lingjuan Xie
- Xiaojiao Gong
- Longjiang Fan
Nature Plants (2024)

The phased telomere-to-telomere reference genome of Musa acuminata, a main contributor to banana cultivars

Abstract

Similar content being viewed by others

A complete reference genome of broomcorn millet

Near telomere-to-telomere genome of the model plant Physcomitrium patens

Haplotype-resolved chromosomal-level assembly of wasabi (Eutrema japonicum) genome

Background & Summary