Background & Summary

Persimmon (Diospyros kaki Thunb.), a member of the Ebenaceae, is an important fruit tree species that originated in East Asia and was successively introduced to Europe and America in the 18th–20th centuries1. The cultivated area of persimmon had reached 1.01 million ha, with a total yield of 4.24 million tons globally in 2020 (www.fao.org). The persimmon industry is negatively affected by labor-intensive artificial de-astringency treatment, transportation difficulties, short shelf life, and limited processing2.

To enhance the persimmon industry, superior cultivars suitable for fresh-eating and processing are urgently needed. Current persimmon cultivars are generally classified into either pollination-constant non-astringent (PCNA) or non-PCNA3,4, based on the natural de-astringency capacity of fruits at the commercial maturity stage (fruits with mature peel color but not soft). The PCNA contains the Japanese-type PCNA (JPCNA) and Chinese-type PCNA (CPCNA), which are edible without any artificial de-astringency treatment and have high market valuable5. The non-PCNA includes pollination-variant non-astringent (PVNA), pollination-constant astringent (PCA), and pollination-variant astringent (PVA)1.

In the PCNA breeding program, inbreeding repression limits the efficiency due to the high genetic similarity among the JPCNA cultivars6,7. Modern molecular refinement breeding methods, including CRISPR/Cas9, are potentially effective for breeding new superior cultivars. The molecular mechanisms regulating crucial traits should first be determined accurately when using these new methods. The natural de-astringency capacity of JPCNA is controlled by a recessive allele at a single locus of ASTRINGENCY (AST)8,9,10. A previous study suggested that the natural de-astringency capacity of CPCNA is controlled by dominant alleles11,12, indicating that the CPCNA type may be more effective for breeding new superior PCNA cultivars, although the precise natural de-astringency mechanism of CPCNAs remains unknown.

The most common persimmon cultivar is hexaploidy (2n = 6x = 90). Owing to a lack in genomic data, the principal molecular mechanisms underlying the natural de-astringency of CPCNA and other crucial traits, including fruit size, shape, and flavor, of hexaploidy persimmon remains challenging to understand. Fortunately, the publication of genomes of hexaploid persimmon close relatives Diospyros lotus (2n = 2x = 30) and Diospyros oleifera (2n = 2x = 30) provided help for the study of persimmon biology13,14,15,16. Hexaploid and diploid persimmons are different species with discrepant genomic information. Taking the diploid persimmon genome as a reference, data on the regulation mechanism of some traits in hexaploid persimmon are limited, which contributes to the reduction of breeding efficiency; hence, the desperate need for the hexaploid persimmon genome assembly, that will help in both basic and applied research.

The assembly of polyploid genomes is a major technical challenge hindered by repeat content, transposable elements, high heterozygosity, and gene content17. The assembly of autopolyploids with smaller genetic distances is more susceptible to the misassignment of sub-genome fragments than allopolyploids. With the advancement of sequencing and assembly technology, the autopolyploid genomes of some plants have been reported, such as Ipomoea batatas18, Saccharum spontaneum19,20, Medicago sativa21, and Solanum tuberosum22, which provide a reference for current genome assemblies.

This study uses PacBio circular consensus sequencing (CCS) and high-throughput chromosome conformation capture (Hi-C) technologies to generate an allele-aware chromosome-level genome assembly for D. kaki. The current genomic information will provide a molecular platform for future research and elaborate breeding programs.

Methods

Sampling and sequencing

‘Xiaoguotianshi’ persimmon is one of the five varieties of the CPCNA persimmon ‘Luotiantianshi’ with a good taste and higher soluble solids content than other CPCNA persimmons. The young leaves of D. kaki ‘Xiaoguotianshi’ and D. lotus (wild germplasm) were collected from the Persimmon Germplasm Resources Nursery of Research Institute of Non-timber Forestry, Chinese Academy of Forestry (Yuanyang County, Henan Province, China, 34°55′18″–34 °56′27″N, 113°46′14″–113°47′35″E).

Genomic DNA was extracted from the young leaf tissue of D. kaki using a DNAsecure Plant Kit (TIANGEN, Beijing, China). Sequencing libraries with insert sizes of 350 bp were constructed using a library construction kit, following manufacturer’s instructions (Illumina, San Diego, CA, USA). The libraries were sequenced using the Illumina HiSeq X platform.

For the PacBio library, the DNA was used to construct 15-kb-insert-size SMRTbell libraries using the SMRTbell Express Template Prep Kit 2.0, following manufacturer’s instructions (PacBio, CA). Then, libraries were sequenced using PacBio Sequel II, and HiFi reads were obtained using the CCS tool (https://github.com/PacificBiosciences/ccs; v6.0.0) by setting ‘min-passes = 3, min-rq = 0.99’.

For the Hi-C library, formaldehyde was used to fix the chromatin. Leaf cells were lysed, and HindIII endonuclease was used to digest the fixed chromatin. The 5 overhangs of the DNA were recovered with biotin-labeled nucleotides, and the resulting blunt ends were ligated to each other using DNA ligase. Proteins were removed with protease to release DNA molecules from the crosslinks. The purified DNA was sheared into 350-bp fragments and ligated to adaptors23. The biotin-labeled fragments were extracted using streptavidin beads; following PCR enrichment, the libraries were sequenced on an Illumina HiSeq X instrument.

For RNA sequencing, total RNA was extracted from the leaf, stem and fruit tissues using an RNAprep Pure Plant Kit (TIANGEN, Beijing, China), and genomic DNA contaminants were removed using RNase-Free DNase I (TIANGEN, Beijing, China). The RNA integrity was evaluated using 1.0% agarose gel stained with ethidium bromide (EB), while its quality and quantity were assessed using an Agilent 2100 Bioanalyzer (Agilent Technologies, CA, USA). The integrated RNA was then used for cDNA library construction, Illumina and PacBio sequencing. The cDNA libraries were constructed using the NEBNext Ultra RNA Library Prep Kit (NEB, MA, USA) for Illumina and SMRTbell Express Template Prep Kit 2.0 (PacBio, CA, USA) for PacBio, following the manufacturers’ instructions. Prepared libraries were sequenced on the Illumina HiSeq X and PacBio Sequel platform.

Genome size estimation

K-mer frequency analysis was used to determine genome characteristics24. The genome size of D. kaki was calculated based on k-mer (k = 27) statistics using the modified Lander–Waterman algorithm. The total length of the sequence reads was divided by the sequencing depth; the peak value of the frequency curve represented the overall sequencing depth. We estimated the genome size using the following formula: (N × (L−K + 1) − B)/D = G, where N is the total number of the sequence reads, L is the average length of the sequence reads, K is the K-mer length (27 bp)25, B is the total number of low-frequency K-mers (frequency ≤ 1 in this analysis), G is the genome size, and D is the overall depth estimated via the K-mer distribution. Heterozygosity was reflected in the distribution of the number of distinct k-mers (k = 27). On the basis of a total of 222,144,314,592 27-mer and a peak 27-mer depth of 49, the estimated genome size was 4533.56 Mb (Fig. 1).

Fig. 1
figure 1

Results of 27-mer frequency analysis to estimate the D. kaki genome size. The haplotype genome size was calculated by dividing the total K-mer count by coverage-depth (222,144,314,592 /49 = 4,533,557,441).

The genome size of the sequenced individuals was confirmed using flow cytometry. Approximately 20–50 mg of fresh leaves of D. kaki and D. lotus were chopped using a razor blade in 1 ml of LB01 buffer (15 mM Tris, 2 mM Na2EDTA, 0.5 mM spermine tetrahydrochloride, 80 mM KCl, 20 mM NaCl, 0.1% (vol/vol) Triton X-100) adjusted to pH 7.5 with 1 M NaOH and b-mercaptoethanol to 15 mM. Cell culture was collected by gentle pipetting and filtered through a 400-mesh nylon strainer. The samples were stained with 100 μg/ml PI and 100 μg/ml RNase in an ice bath for 10 min before analysis using a MoFlo-XDP flow cytometer (Beckman Coulter Inc., USA).

Nuclear fluorescence was measured using a MoFlo-XDP high-speed flow cytometer with a 70 μm ceramic nozzle at a sheath pressure of 60 psi. PI fluorescence was detected with a solid-state laser (488 nm) and a 625-/26-nm HQ band-pass filter. The FL3-Height/SSC-Height gate method eliminated debris, cell fragments, and dead cells. Single and double cells were discriminated using FL3-Height /FL3-Area. The final results showed that the genome size of D. kaki was 4.61 Gb (Fig. 2).

Fig. 2
figure 2

Results of flow cytometry analysis to estimate the D. kaki genome size. The D. lotus genome (2n = 2x = ~1.76 Gb) served as an internal reference standard. Peak R3 showed DNA amount of D. lotus. Peak R2 showed duplicated DNA amount of D. lotus. The ratio of peak mean was equal 2.52 (R2/R3) and 1.31 (R2/R4), hence the estimated genome size of D. kaki was 2n = 6x = ~4.44 Gb and 4.61 Gb. Due to the distance between peak R2 and peak R4 is less than the distance between peak R2 and peak R3, it is more accurate to estimate genome size of D. kaki 2n = 6x = ~4.61 Gb.

Genome assembly

In total, 179.09 Gb PacBio HiFi long reads (8 SMRT cell; 39.53X coverage) and 445.72 Gb Hi-C paired-end reads (98.39X coverage) were obtained (Table 1). D. kaki genome was assembled with Hifisam (v0.13-r308)26 using PacBio HiFi reads with default parameter settings. After initial assembly, Hi-C sequencing data were aligned to the assembled contigs using the Burrows-Wheeler Aligner (BWA) mem option27, while pseudo-chromosomes were constructed based on ALLHIC (v0.9.8)20. We configured the parameter setting -K 90--minREs 50--maxlinkdensity 3--NonInformativeRabio 2. Finally, Hi-C scaffolding yielded 90 chromosome-length scaffolds. The final assembly contained 4.52 Gb with a contig N50 value of 5.28 Mb and scaffold N50 value of 44.01 Mb, respectively; 4.06 Gb (89.87%) of the assembly was anchored onto 90 chromosome-level pseudomolecules comprising 15 homologous groups, with six allelic chromosomes in each. The assignment to genome haplotypes was based on chromosome length (Tables 2, 3; Figs. 3, 4).

Table 1 Statistics of data for genome assemblies of D.kaki.
Table 2 Summary of D. kaki genome assembly.
Table 3 Statistics of chromosome length in D. kaki genome.
Fig. 3
figure 3

Overview of the D. kaki genome. From the outer ring to the inner ring are Chromosome, Gene density, TE density, GC content, and Synteny.

Fig. 4
figure 4

Overview of Hi-C contacts in the heat map visualization for assembled chromosomes.

Repetitive sequence annotation

Transposable elements (TEs) in the D. kaki genome were identified by combining de novo- and homology-based approaches. For the de novo-based approach, we used RepeatScout (v1.0.5; https://github.com/mmcco/RepeatScout)28, RepeatModeler (http://www.repeatmasker.org/RepeatModeler.html), and LTR_FINDER (v1.0.7; https://github.com/xzhub/LTR_Finder)29 to build a de novo repeat library. For the homology-based approach, we used RepeatMasker (v3.3.0; http://www.repeatmasker.org/) against the Repbase TE library (http://www.girinst.org/server/RepBase/)30 with score cut-off of 225 and RepeatProteinMask (v4.0.5; http://www.repeatmasker.org/) against the TE protein database31 with a p-value cut-off of 1e-4. Tandem repeats were identified using Tandem Repeats Finder (v4.0.7; https://tandem.bu.edu/trf/)32 with parameters ‘matching weight: 2, mismatching penalty: 7, indel penalty: 7, minimum alignment score:50, maximum period size: 2000’. Ultimately, a total of 2.90 Gb of repetitive elements occupying 64.02% of the D. kaki genome were annotated (Table 4). Most of the repeats were long terminal repeats (LTRs) (51.28% of the genome; Table 5). The DNA, LINE, and SINE classes accounted for 5.93%, 2.66%, and 0.03% of the genome, respectively (Table 5).

Table 4 Summary of repeat sequences in D. kaki genome.
Table 5 Summary of TE sequences in D. kaki genome.

Gene prediction and annotation

Homology-based, de novo, and transcriptome-based predictions were used to predict protein-coding genes in the D. kaki genome. Homologous proteins from five plant genomes (Arabidopsis thaliana, D. oleifera, D. lotus, Actinidia chinensis, and Camellia sinensis) were downloaded from Ensembl Plants (http://plants.ensembl.org/index.html) and NCBI (https://www.ncbi.nlm.nih.gov/). The protein sequences were then aligned to the D. kaki genome assembly using tblastN33, with an E-value cut-off of 1e-5. The BLAST hits were conjoined using a Solar software34. GeneWise (https://www.ebi.ac.uk/Tools/psa/genewise) was used to predict the exact gene structure of the corresponding genomic regions in each BLAST hit (Homo-set)35. The published RNA-seq data of female flowers and fruit at different developmental stages, and 0.33 Gb new sequencing RNA-seq data of the young leaves and stems of ‘Xiaoguotianshi’ (three biological replicates) were mapped to the D. kaki genome using HISAT2 (https://daehwankimlab.github.io/hisat2/, v2.2.1)36 and Cufflinks (http://cole-trapnell-lab.github.io/cufflinks/, v2.1.1)37 (Table 6). A total of 70.54 Gb Iso-seq data from PacBio transcriptome sequencing of mixed samples containing the young leaves, stems, and fruits of ‘Xiaoguotianshi’ (three biological replicates) were used to create several pseudo-ESTs. These pseudo-ESTs were mapped to the assembly, and gene models were predicted using PASA (http://pasapipeline.github.io/)38 (Table 6). This gene set was denoted as the PASA-T-set and used to train ab initio gene prediction programs. Five ab initio gene prediction programs, namely, Augustus (http://augustus.gobics.de/, v3.2.3), GENSCAN (http://genes.mit.edu/GENSCAN.html, v1.0), GlimmerHMM (http://ccb.jhu.edu/software/glimmerhmm/, v3.0.1), geneid (http://genome.crg.es/software/geneid/), and SNAP (http://korflab.ucdavis.edu/software.html) were used to predict coding regions in the repeat-masked genome39,40,41,42. Gene model evidence from homo-set, cufflinks-set, PASA-T-set, and ab initio programs were combined using EVidenceModeler (EVM) (http://evidencemodeler.sourceforge.net/) into a non-redundant set of gene structures43.

Table 6 Statistics of RNA-seq and Iso-seq.

Functional annotation of protein-coding genes was performed using BLASTP (E-value: 1e-05) against two integrated protein sequence databases44: SwissProt (http://web.expasy.org/docs/swiss-prot_guideline.html) and NR (ftp://ftp.ncbi.nih.gov/blast/db/). Protein domains were annotated by searching against the InterPro (http://www.ebi.ac.uk/interpro/, v32.0) and Pfam (https://pfam-legacy.xfam.org/.org/, v27.0) databases using InterProScan (v4.8) and HMMER (http://www.hmmer.org/, v3.1), respectively45,46,47,48. Gene ontology (GO, http://www.geneontology.org/page/go-database) terms for each gene were obtained from the corresponding InterPro or Pfam entries. The pathways in which the genes might be involved were assigned using BLAST against the Kyoto Encyclopedia of Genes and Genomes (KEGG) database (http://www.kegg.jp/kegg/kegg1.html, release 53), with an E-value cut-off of 1e-05. Overall, a total of 153,288 protein-coding genes were predicted with an average sequence length of 7,397.94 bp and an average CDS length of 1,153.82 bp (Table 7). Of these, 135,446 genes are anchored to 90 chromosomes (Table 8). On average, each predicted gene contained 5.01 exons with an average sequence length of 230.33 bp (Table 7). 98.60% of the genes were functionally annotated via similarity searches against homologous sequences and protein domains (Table 9).

Table 7 Summary of gene structure prediction in D. kaki genome.
Table 8 Statistics of chromosome gene number in D. kaki genome.
Table 9 Statistics of gene function annotation in D. kaki genome.

tRNA genes were identified using the tRNAscan-SE software49. The rRNA fragments were predicted by aligning the rRNA sequences using BlastN at an E-value of 1e-10. The miRNA and snRNA genes were predicted using the INFERNAL software50 against the Rfam database (release 9.1)51. As a result, 110,480 rRNA, 12,297 tRNA, 1,483 miRNA, and 3,510 snRNA genes were annotated (Table 10).

Table 10 Statistics of non-coding RNA in D. kaki genome.

Data Records

Raw data of genome sequencing and transcriptome sequencing of D. kaki are deposited in the NCBI SRA database under BioProject ID PRJNA810977. The SRA accession number of PacBio HiFi sequencing data are SRR1850047052, SRR1850047153, SRR1850047254, SRR1850047355 SRR1850047456, SRR1850047557, SRR1850047658, and SRR1850047759. The SRA accession number of Hi-C sequencing data are SRR1850048160, SRR1850048261, SRR1850048362, SRR1850048463, SRR1850048564, SRR1850048665, SRR1850048766 and SRR1850048867. The SRA accession number of Illumina sequencing data are SRR1850047968 and SRR1850048069. The SRA accession number of Iso-seq data SRA accession number is SRR1850046370. The SRA accession number of some RNA-seq data are SRR1850046471, SRR1850046572, SRR1850046673, SRR1850047874, SRR1850048975, SRR1850049076 and SRP15171577. The others RNA-seq data have been deposited in the NCBI SRA database under the SRR1637198478, SRR1637198579, SRR1637198680, SRR1637198781, SRR1637198882, SRR1637198983, SRR1637199084, SRR1637199185, SRR1637199286, SRR1637199387, SRR1637199488, SRR1637199589, SRR1637199690, SRR1637199791 and SRR1637199792, which is associated with the Bioproject ID PRJNA771936. The assembled genome sequence has been deposited at GenBank with accession number JAQSGO00000000093. Other data, such as gene structure annotation, predicted CDS and protein sequences, annotation of TEs, tandem repeat sequences, tRNA genes, miRNA genes, snRNA genes, and rRNA genes, are available at FigShare database94.

Technical Validation

Assessment of the completeness of the genome assembly using CEGMA indicated a 95.56% (Haplotype: DkaA 89.92%; DkaB 90.73%; DkaC 91.13%; DkaD 86.69%; DkaE 89.11%; DkaF 86.69%) coverage of the conserved core eukaryotic genes, while the BUSCO (v5.2.2; embryophyta odb10 database)95 results indicated that the genome and gene set was 99.50% (Haplotype: DkaA 92.70%; DkaB 93.50%; DkaC 92.10%; DkaD 90.10%; DkaE 90.20%; DkaF 84.60%) and 97.50 (Haplotype: DkaA 87.10%; DkaB86.60%; DkaC 86.10%; DkaD 83.90%; DkaE 84.60%; DkaF 79.80%) complete, respectively (Tables 11, 12), showing that the individual haplotypes lack genes present elsewhere in the genome. Additionally, 99.86% (Haplotype: DkaA 94.88%; DkaB 94.59%; DkaC 93.70%; DkaD 93.94%; DkaE 92.97%; DkaF 90.09%) of the high-quality short reads were mapped back to the assembly (Table 13). All in all, these results of these assessments indicate to us that the D. kaki genome assembly is complete and high quality.

Table 11 Assessment of the completeness of the genome assembly.
Table 12 Assessment of the completeness of the gene set.
Table 13 Coverage statistics of D. kaki genome.

Inter-genomic comparison analysis revealed a distinct 6-to-1 syntenic relationship between D. kaki and D. oleifera (Fig. 5), which further supported the high quality of the D. kaki assembly.

Fig. 5
figure 5

The syntenic dot plots of D. kaki and D. oleifera.